Beautiful Soup and Python Error

Question

jacob501 0 Light Poster

13 Years Ago

I am trying to use Beautiful Soup to scrape a website, Locationary.com, and get some information from it. I am a member and even when I'm logged in this doesn't work...

OK. This first bit of code just returns the HTML of Locationary.com (the home page) in a "pretty" form. And it works!!!

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.locationary.com/').read()

soup = BeautifulSoup(page)

print soup.prettify()

However when I add more stuff to the URL, such as a place page on their website, I get a bad result...

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp').read()

soup = BeautifulSoup(page)

print soup.prettify()

With the above code, Python gives me something like this:

>>>
&lsaquo; (with a big dot at the end that won't copy!!!)
>>>

Does anybody know why this is happening? How come it can give me the HTML of the website's main page but not one of its other pages? What are these few weird characters Python is giving me?

I would appreciate any help. Thanks!

python

2 Contributors
13 Replies
200 Views
15 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by Gribouillis

All 13 Replies

Gribouillis 1,391 Programming Explorer

13 Years Ago

I get a better result with

page = urllib2.urlopen('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_&_Company-p1022884996.jsp').read()

(I replaced %26 with &)

Edited 13 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Oh well...thats what my code looks like already. Daniweb just changed it a little...putting it on one line doesn't change anything for me...I still get the weird result ("&lsaquo (DOT))
Or do you mean that that link worked for you and you got the HTML from it?

I mean did you replace the %26 in the url by & ?

Gribouillis 1,391 Programming Explorer

13 Years Ago

hehe

Gribouillis 1,391 Programming Explorer

13 Years Ago

Sorry. I'm kind of new to all this prgramming stuff. What is a BOM and how will it help?

The BOM is the 2 first bytes of the file. It's used to detect encoding (see wikipedia). In our case, I found \x1f\x8b, and google tells me that this marks files compressed with gzip. Indeed my linux system detects a compressed file and it is able to uncompress it with gunzip. Python can do this too with module gzip. Here we go:

>>> from urllib import urlretrieve
>>> urlretrieve('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp', 'myfile')
>>> import gzip
>>> data = gzip.open('myfile', 'rb').read()

!!!

Edited 13 Years Ago by Gribouillis because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

jacob501 0 Light Poster · Answer 1 · 2011-12-10T22:30:17+00:00

I get a better result with

page = urllib2.urlopen('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_&_Company-p1022884996.jsp').read()

(I replaced %26 with &)

Oh well...thats what my code looks like already. Daniweb just changed it a little...putting it on one line doesn't change anything for me...I still get the weird result ("&lsaquo (DOT))

Or do you mean that that link worked for you and you got the HTML from it?

jacob501 0 Light Poster · Answer 2 · 2011-12-10T22:38:03+00:00

I mean did you replace the %26 in the url by & ?

Oh wow!!! Thank you so much! I looked through your code for differences at first but barely missed this. Thanks!! It works now.

jacob501 0 Light Poster · Answer 3 · 2011-12-10T22:39:23+00:00

jacob501 0 Light Poster

13 Years Ago

hehe

:)

jacob501 0 Light Poster · Answer 4 · 2011-12-10T22:44:54+00:00

jacob501 0 Light Poster

13 Years Ago

Crap...its not working anymore.

jacob501 0 Light Poster · Answer 5 · 2011-12-10T22:47:21+00:00

jacob501 0 Light Poster

13 Years Ago

Why did it only work once??

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2011-12-10T23:48:47+00:00

Why did it only work once??

I don't know why it worked only once. Obviously the content has a special encoding. I did this

from urllib import urlretrieve
urlretrieve('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp', 'myfile.jsp')

Then when I cat myfile.jsp in a terminal it looks good, but when I load the content with python, it shows the same error. We could perhaps find a BOM at the beginning of the data.

jacob501 0 Light Poster · Answer 7 · 2011-12-10T23:54:59+00:00

I don't know why it worked only once. Obviously the content has a special encoding. I did this
from urllib import urlretrieve
urlretrieve('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp', 'myfile.jsp')
Then when I cat myfile.jsp in a terminal it looks good, but when I load the content with python, it shows the same error. We could perhaps find a BOM at the beginning of the data.

Sorry. I'm kind of new to all this prgramming stuff. What is a BOM and how will it help?

jacob501 0 Light Poster · Answer 8 · 2011-12-11T00:19:45+00:00

Oh. Okay. I ran it a few times to check and it worked! Thanks! Now I know what a BOM is too!

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 9 · 2011-12-11T08:02:05+00:00

Oh. Okay. I ran it a few times to check and it worked! Thanks! Now I know what a BOM is too!

You can also uncompress it without using a temporary file like this

from urllib2 import urlopen
from gzip import GzipFile
from cStringIO import StringIO
fobj = urlopen('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp')
fobj = StringIO(fobj.read())
ifh = GzipFile(mode='rb', fileobj=fobj)
data = ifh.read()

Beautiful Soup and Python Error

Recommended Answers Collapse Answers

All 13 Replies

Recommended Answers