HTML Scraper: Urllib2 / BeautifulSoup / Regex Help

Question

katamole 0 Newbie Poster

15 Years Ago

Hi everyone,

As a personal project I've decided to write a small script which will take a raw_input film title, then look up the IMDB rating and return the result. As an extra challenge I decided to employ re.

Now, this is how far I have got (yes, I am yet to wrap most things in functions, I will do this when i have ironed out the following problems):

from BeautifulSoup import BeautifulSoup
import urllib2
import re


#get source code of page (function used later)
def fetchsource(url):
    url = urllib2.urlopen(url)
    source = url.read()
    return source


#ask for film title
title = raw_input("Please enter a film title: ")


#format the raw_input string for searching
raw_string = re.compile(' ') #search for a space in string
searchstring = raw_string.sub('+', title) #replace with +
print searchstring

#find the film page url
url = "http://www.imdb.com/find?s=" + searchstring
print url
source = fetchsource(url)
soup = BeautifulSoup(source)
filmlink = soup.find('a', href=re.compile("title\/tt[0-9]*\/"))
print filmlink

If you run this code, it prints the film string and the search url fine: the problem is that my regex for getting the url of the film page from the search results page never produces anything. So "filmlink" is always empty. I'm not really sure why I'm getting no value here.

Is my regex bad, or have I not put the right options in?

Also, I don't quite understand exactly what I am doing with re.compile() but it works! Could somebody possibly write an easy to understand sentence or two?

Many thanks for your help.

python

3 Contributors
10 Replies
667 Views
15 Hours Discussion Span
Latest Post 15 Years Ago Latest Post by katamole

All 10 Replies

Gribouillis 1,391 Programming Explorer

15 Years Ago

I was more successful with a slightly modified version. I modified the query string. Also, it's a good habit to always use strings prefixed by 'r' when you pass a literal string to re.compile.

#format the raw_input string for searching
raw_string = re.compile(r' ') #search for a space in string
searchstring = raw_string.sub('+', title) #replace with +
print searchstring

#find the film page url
url = "http://www.imdb.com/find?s=all&q=" + searchstring
print url
source = fetchsource(url)
soup = BeautifulSoup(source)
filmlink = soup.find('a', href=re.compile(r"/title/tt[0-9]*/"))
print filmlink

For your other question, what happens when you call re.compile ? The regular expression specified by the argument string is interpreted to create a finite automaton (a collection of nodes and transition rules between these nodes). This automaton is a machine to look for the regular expression in a string. The whole thing is hidden in a "regular expression object" with a nice interface from the point of vue of the client code.

scru 909 Posting Virtuoso

15 Years Ago

EDIT: Woops I misunderstood the link structure.

Also when working with regex, it's probably a good idea to test them independently before incorporating them into code. For example, test the regular expression at idle with a couple of the links and make sure they work.

Gribouillis 1,391 Programming Explorer

15 Years Ago

No, the / is not special in python's regex syntax. The \ is special, so if you want to match a single backslash, you must write re.compile(r"\\") . The role of the r is to avoid the interpretation of special characters by the python compiler in strings. For example the literal string "\n" has a single character (a newline), but the string r"\n" has 2 characters, a backslash and a 'n' (and no newline). When passed to re.compile , it is then interpreted as a newline. You don't need to think too much about this: if you put the 'r' when you use re.compile, it will usually do what you're expecting.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

katamole 0 Newbie Poster · Answer 1 · 2009-03-03T19:33:15+00:00

I was more successful with a slightly modified version. I modified the query string.

Thanks, that works in the way I want it to. So it is not necessary to escape special characters if you prefix the string with "r"?

katamole 0 Newbie Poster · Answer 2 · 2009-03-03T19:53:13+00:00

Argh, why did I bother with regexes.

So here I am trying to extract the rating: (see screen grab)

And here's my poor attempt at an expression: ratingregexp = re.compile(r"^*/10$") And I'm not sure why it doesn't work!

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 3 · 2009-03-03T19:58:19+00:00

Because the * is a multiplier which must apply to something. You can look for ^[^/]*/10$ , which means begining of string, any number of non slash characters, a slash, a 10 and the end of the string.

katamole 0 Newbie Poster · Answer 4 · 2009-03-03T20:16:57+00:00

Because the * is a multiplier which must apply to something. You can look for ^[^/]*/10$ , which means begining of string, any number of non slash characters, a slash, a 10 and the end of the string.

Grib, thanks for your help. I am still getting "None" as the output though. Here is the snippet: I have verified that the source is coming through ok.

rating_source = fetchsource(pagelink)
soup = BeautifulSoup(rating_source)
ratingregexp = re.compile(r"^[^/]*/10$")
rating_element = soup.find(ratingregexp)
print rating_element

katamole 0 Newbie Poster · Answer 5 · 2009-03-03T20:17:40+00:00

No, the / is not special in python's regex syntax. The \ is special, so if you want to match a single backslash, you must write re.compile(r"\\") . The role of the r is to avoid the interpretation of special characters by the python compiler in strings. For example the literal string "\n" has a single character (a newline), but the string r"\n" has 2 characters, a backslash and a 'n' (and no newline). When passed to re.compile , it is then interpreted as a newline. You don't need to think too much about this: if you put the 'r' when you use re.compile, it will usually do what you're expecting.

That makes perfect sense. I think I must've been asleep when I was reading the documentation!

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2009-03-03T23:02:39+00:00

Grib, thanks for your help. I am still getting "None" as the output though. Here is the snippet: I have verified that the source is coming through ok.
rating_source = fetchsource(pagelink)
soup = BeautifulSoup(rating_source)
ratingregexp = re.compile(r"^[^/]*/10$")
rating_element = soup.find(ratingregexp)
print rating_element

It worked for me with this

source = fetchsource("http://www.imdb.com/title/tt0071853/")
soup = BeautifulSoup(source)
ratingregexp = re.compile(r"^[^/]*/10$")
rating_element = soup.find("b", text=ratingregexp)
print rating_element

I think the error was in the call to soup.find.

katamole 0 Newbie Poster · Answer 7 · 2009-03-04T02:56:50+00:00

Of course, I wasn't searching inside <b> tags before. Thanks for the fix!

The script is up and running now, works great.

I will mark this thread as <solved>, and once again, thanks for your help.

In a few days I'm going to add functionality whereby you direct the script to a directory with films in it, and it parses each title, extracts ratings, and puts the ratings into a sorted html page. I'll post here once I've some of it up and running!

k.

HTML Scraper: Urllib2 / BeautifulSoup / Regex Help

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers