How do I use glob with urllib2?

So what I have been trying to acheive with no success is creating a list of file names with glob from two sources and comparing them and download file if it doesn't exist.

I can't get past the start because I am not sure how to tell glob to start at the end of the url or directory path.

This is where i was going.

import urllib2, urlparse, glob

def getfile(base, fileExt):
    base = ('http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/')
# the files wanted end in _C.rtf
# for files on site not in /home/sayth/python/secFiles download
    files = []
    files = files.append(urllib2.urlopen(base + glob.glob('?_C.rtf')))

PS I checked with urllib that the full path was correct. I didn't include full print out but as you can see it works.

>>> import urllib
>>> data = urllib.urlopen('http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110702Race01_C.rtf').read()
>>> print data
{\rtf1\ansi \deff1\deflang1033{\fonttbl{\f1\froman\fcharset0\fprq2 Arial;}}
\paperh11907\paperw16840\margt794\margb794\margl794\margr794\lndscpsxn\psz9\viewkind1\viewscale84

Recommended Answers

All 11 Replies

glob.glob() does not accept url arguments, read the documentation. If the site doesn't give you an url to list the directory, I don't think it can be done.

It appears you can do it. It's not directly an answer but there is a solution using glob and urrlib here http://pastebin.com/m6ae1ae41

Docs say glob is to find filenames and urllib/urllib2 is to build url's. Should work. The above example to me a newbie seems to just treat the url like a directory path.

Will keep trying. Any thoughts appreciated.

It appears you can do it. It's not directly an answer but there is a solution using glob and urrlib here http://pastebin.com/m6ae1ae41

Docs say glob is to find filenames and urllib/urllib2 is to build url's. Should work. The above example to me a newbie seems to just treat the url like a directory path.

Will keep trying. Any thoughts appreciated.

In your example, glob is only used to search a local directory. A url looks only superficially like a directory. Browsing the web and exploring the hard disk involve quite different functions, so I don't think glob can work for this.

Is there another function that could perform this?

possibly there is ftp access:
http://docs.python.org/library/ftplib.html

Thanks having a read now. also found a python downloader script. I din't want most of it but the concept is that it figures out what files there are given a base url.
The full 140 0dd lines are here https://github.com/mobileProgrammer/Automatic-Downloader/blob/master/downloader.py
But this is the section of interst.

# start reading the URL
    connection = urllib.urlopen(url)
    html = connection.read()
    
    patternLinks = re.compile(r'<a\s.*?href\s*?=\s*?"(.*?)"', re.DOTALL)
    iterator = patternLinks.finditer(html);

    downloadList = []

Thanks having a read now. also found a python downloader script. I din't want most of it but the concept is that it figures out what files there are given a base url.
The full 140 0dd lines are here https://github.com/mobileProgrammer/Automatic-Downloader/blob/master/downloader.py
But this is the section of interst.

# start reading the URL
    connection = urllib.urlopen(url)
    html = connection.read()
    
    patternLinks = re.compile(r'<a\s.*?href\s*?=\s*?"(.*?)"', re.DOTALL)
    iterator = patternLinks.finditer(html);

    downloadList = []

The algorithm is to read the web page at the base url and to extract all links from this web page. Unfortunately, your base url leads to a 404 error, not to a page containing the files urls.

Another question is how did you discover the file 110702Race01_C.rtf ? If you found it by browsing the site http://www.tvn.com.au/ , then python can probably follow the same steps that you followed manually (for example the mechanize module can simulate hand browsing).

Well if I navigate to the page containing the files the full url is a mess.

http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_sectionals.jsp?TVNSESSION=2DC7A10ABB1C10265323569B4D89208A.tvnEngine1

However I found the link by right clicking on the file and selecting copy link address. On the page there are two sets of files the ones I want end in _C.rtf and the other is _B.rtf

http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110702SRace01_B.rtf

Well if I navigate to the page containing the files the full url is a mess.

http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_sectionals.jsp?TVNSESSION=2DC7A10ABB1C10265323569B4D89208A.tvnEngine1

However I found the link by right clicking on the file and selecting copy link address. On the page there are two sets of files the ones I want end in _C.rtf and the other is _B.rtf

http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110702SRace01_B.rtf

It's excellent ! The page containing all the files contains sections like

<p><B><FONT color=#ffff33 size=3> Finish Split Times Sectionals<br><br></font>

<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race01_B.rtf">Race 1</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race02_B.rtf">Race 2</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race03_B.rtf">Race 3</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race04_B.rtf">Race 4</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race05_B.rtf">Race 5</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race06_B.rtf">Race 6</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race07_B.rtf">Race 7</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race08_B.rtf">Race 8</a>&nbsp;&nbsp;
</p>

<p><b><FONT color=#ffff33 size=3> Runner Sectional Rates<br><br></font>
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race01_C.rtf">Race 1</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race02_C.rtf">Race 2</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race03_C.rtf">Race 3</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race04_C.rtf">Race 4</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race05_C.rtf">Race 5</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race06_C.rtf">Race 6</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race07_C.rtf">Race 7</a>&nbsp;&nbsp;

<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race08_C.rtf">Race 8</a>&nbsp;&nbsp;
<br>
</p><br>

It will be very easy to extract all these urls using the BeautifulSoup module (or even regular expressions).

Reading beautiful soup now but here http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML it gives a lot of examples. One thing though how do you tell beautiful soup which url it wants to do its stuff with.

Looking at the most appropriate example.

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
mentionsOfBob = SoupStrainer(text=re.compile("Bob"))
[text for text in BeautifulSoup(doc, parseOnlyThese=mentionsOfBob)]
# [u'Bob reports ', u"Don't get any on\nus, Bob!"]

where does it get the url from ? Is it returning it as a list?

This may help you,but this may not be an easy task for you if new to this and python.
Files i get from code under is.

110702SRace01_B.rtf
110702SRace03_B.rtf
110702SRace04_B.rtf
110702SRace05_B.rtf
110702SRace06_B.rtf
110702SRace07_B.rtf
110702SRace08_B.rtf

Files will be in folder you run script from.

from BeautifulSoup import BeautifulSoup
import urllib2
from urllib import urlretrieve


url = urllib2.urlopen("http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_sectionals.jsp?TVNSESSION=BD8A1BBD2F555AECA581698BFD1BDC6E.tvnEngine2")
soup = BeautifulSoup(url)

site = 'http://www.tvn.com.au'
race = soup.findAll('p', limit=1)
race = race[0].findAll('a', href=True)
for item,link in enumerate(race):
    #print link['href'] #Test print
    urlretrieve(site+link['href'], '110702SRace0%s_B.rtf' % item)
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.