Use glob with urrlib2

Question

flebber 12 Light Poster

13 Years Ago

How do I use glob with urllib2?

So what I have been trying to acheive with no success is creating a list of file names with glob from two sources and comparing them and download file if it doesn't exist.

I can't get past the start because I am not sure how to tell glob to start at the end of the url or directory path.

This is where i was going.

import urllib2, urlparse, glob

def getfile(base, fileExt):
    base = ('http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/')
# the files wanted end in _C.rtf
# for files on site not in /home/sayth/python/secFiles download
    files = []
    files = files.append(urllib2.urlopen(base + glob.glob('?_C.rtf')))

PS I checked with urllib that the full path was correct. I didn't include full print out but as you can see it works.

>>> import urllib
>>> data = urllib.urlopen('http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110702Race01_C.rtf').read()
>>> print data
{\rtf1\ansi \deff1\deflang1033{\fonttbl{\f1\froman\fcharset0\fprq2 Arial;}}
\paperh11907\paperw16840\margt794\margb794\margl794\margr794\lndscpsxn\psz9\viewkind1\viewscale84

python

4 Contributors
11 Replies
2K Views
18 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by snippsat

All 11 Replies

Gribouillis 1,391 Programming Explorer

13 Years Ago

glob.glob() does not accept url arguments, read the documentation. If the site doesn't give you an url to list the directory, I don't think it can be done.

Gribouillis 1,391 Programming Explorer

13 Years Ago

It appears you can do it. It's not directly an answer but there is a solution using glob and urrlib here http://pastebin.com/m6ae1ae41
Docs say glob is to find filenames and urllib/urllib2 is to build url's. Should work. The above example to me a newbie seems to just treat the url like a directory path.
Will keep trying. Any thoughts appreciated.

In your example, glob is only used to search a local directory. A url looks only superficially like a directory. Browsing the web and exploring the hard disk involve quite different functions, so I don't think glob can work for this.

Edited 13 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Thanks having a read now. also found a python downloader script. I din't want most of it but the concept is that it figures out what files there are given a base url.
The full 140 0dd lines are here https://github.com/mobileProgrammer/Automatic-Downloader/blob/master/downloader.py
But this is the section of interst.
# start reading the URL
    connection = urllib.urlopen(url)
    html = connection.read()
    
    patternLinks = re.compile(r'<a\s.*?href\s*?=\s*?"(.*?)"', re.DOTALL)
    iterator = patternLinks.finditer(html);

    downloadList = []

The algorithm is to read the web page at the base url and to extract all links from this web page. Unfortunately, your base url leads to a 404 error, not to a page containing the files urls.

Another question is how did you discover the file 110702Race01_C.rtf ? If you found it by browsing the site http://www.tvn.com.au/ , then python can probably follow the same steps that you followed manually (for example the mechanize module can simulate hand browsing).

Edited 13 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

flebber 12 Light Poster · Answer 1 · 2011-08-03T13:59:58+00:00

It appears you can do it. It's not directly an answer but there is a solution using glob and urrlib here http://pastebin.com/m6ae1ae41

Docs say glob is to find filenames and urllib/urllib2 is to build url's. Should work. The above example to me a newbie seems to just treat the url like a directory path.

Will keep trying. Any thoughts appreciated.

flebber 12 Light Poster · Answer 2 · 2011-08-03T16:46:36+00:00

flebber 12 Light Poster

13 Years Ago

Is there another function that could perform this?

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 3 · 2011-08-03T17:07:49+00:00

TrustyTony 888 ex-Moderator

13 Years Ago

possibly there is ftp access:
http://docs.python.org/library/ftplib.html

Edited 13 Years Ago by TrustyTony because: n/a

flebber 12 Light Poster · Answer 4 · 2011-08-03T17:22:30+00:00

possibly there is ftp access:
http://docs.python.org/library/ftplib.html

Thanks having a read now. also found a python downloader script. I din't want most of it but the concept is that it figures out what files there are given a base url.
The full 140 0dd lines are here https://github.com/mobileProgrammer/Automatic-Downloader/blob/master/downloader.py
But this is the section of interst.

# start reading the URL
    connection = urllib.urlopen(url)
    html = connection.read()
    
    patternLinks = re.compile(r'<a\s.*?href\s*?=\s*?"(.*?)"', re.DOTALL)
    iterator = patternLinks.finditer(html);

    downloadList = []

flebber 12 Light Poster · Answer 5 · 2011-08-03T17:59:52+00:00

Well if I navigate to the page containing the files the full url is a mess.

http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_sectionals.jsp?TVNSESSION=2DC7A10ABB1C10265323569B4D89208A.tvnEngine1

However I found the link by right clicking on the file and selecting copy link address. On the page there are two sets of files the ones I want end in _C.rtf and the other is _B.rtf

http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110702SRace01_B.rtf

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2011-08-03T18:04:41+00:00

Well if I navigate to the page containing the files the full url is a mess.
http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_sectionals.jsp?TVNSESSION=2DC7A10ABB1C10265323569B4D89208A.tvnEngine1
However I found the link by right clicking on the file and selecting copy link address. On the page there are two sets of files the ones I want end in _C.rtf and the other is _B.rtf
http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110702SRace01_B.rtf

It's excellent ! The page containing all the files contains sections like

<p><B><FONT color=#ffff33 size=3> Finish Split Times Sectionals<br><br></font>

<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race01_B.rtf">Race 1</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race02_B.rtf">Race 2</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race03_B.rtf">Race 3</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race04_B.rtf">Race 4</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race05_B.rtf">Race 5</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race06_B.rtf">Race 6</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race07_B.rtf">Race 7</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race08_B.rtf">Race 8</a>&nbsp;&nbsp;
</p>

<p><b><FONT color=#ffff33 size=3> Runner Sectional Rates<br><br></font>
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race01_C.rtf">Race 1</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race02_C.rtf">Race 2</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race03_C.rtf">Race 3</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race04_C.rtf">Race 4</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race05_C.rtf">Race 5</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race06_C.rtf">Race 6</a>&nbsp;&nbsp;
<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race07_C.rtf">Race 7</a>&nbsp;&nbsp;

<a href="http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/resources/sectionaltimes/110706Race08_C.rtf">Race 8</a>&nbsp;&nbsp;
<br>
</p><br>

It will be very easy to extract all these urls using the BeautifulSoup module (or even regular expressions).

flebber 12 Light Poster · Answer 7 · 2011-08-03T19:13:21+00:00

Reading beautiful soup now but here http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML it gives a lot of examples. One thing though how do you tell beautiful soup which url it wants to do its stuff with.

Looking at the most appropriate example.

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
mentionsOfBob = SoupStrainer(text=re.compile("Bob"))
[text for text in BeautifulSoup(doc, parseOnlyThese=mentionsOfBob)]
# [u'Bob reports ', u"Don't get any on\nus, Bob!"]

where does it get the url from ? Is it returning it as a list?

snippsat 661 Master Poster · Answer 8 · 2011-08-03T23:17:14+00:00

This may help you,but this may not be an easy task for you if new to this and python.
Files i get from code under is.

110702SRace01_B.rtf
110702SRace03_B.rtf
110702SRace04_B.rtf
110702SRace05_B.rtf
110702SRace06_B.rtf
110702SRace07_B.rtf
110702SRace08_B.rtf

Files will be in folder you run script from.

from BeautifulSoup import BeautifulSoup
import urllib2
from urllib import urlretrieve


url = urllib2.urlopen("http://www.tvn.com.au/tvnlive/v1/system/modules/org.tvn.website/jsptemplates/tvn_sectionals.jsp?TVNSESSION=BD8A1BBD2F555AECA581698BFD1BDC6E.tvnEngine2")
soup = BeautifulSoup(url)

site = 'http://www.tvn.com.au'
race = soup.findAll('p', limit=1)
race = race[0].findAll('a', href=True)
for item,link in enumerate(race):
    #print link['href'] #Test print
    urlretrieve(site+link['href'], '110702SRace0%s_B.rtf' % item)

Use glob with urrlib2

Recommended Answers Collapse Answers

All 11 Replies

Recommended Answers