I am trying to use Beautiful Soup to scrape a website, Locationary.com, and get some information from it. I am a member and even when I'm logged in this doesn't work...

OK. This first bit of code just returns the HTML of Locationary.com (the home page) in a "pretty" form. And it works!!!

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.locationary.com/').read()

soup = BeautifulSoup(page)

print soup.prettify()

However when I add more stuff to the URL, such as a place page on their website, I get a bad result...

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp').read()

soup = BeautifulSoup(page)

print soup.prettify()

With the above code, Python gives me something like this:

>>>
‹ (with a big dot at the end that won't copy!!!)
>>>

Does anybody know why this is happening? How come it can give me the HTML of the website's main page but not one of its other pages? What are these few weird characters Python is giving me?

I would appreciate any help. Thanks!

Oh well...thats what my code looks like already. Daniweb just changed it a little...putting it on one line doesn't change anything for me...I still get the weird result ("&lsaquo (DOT))

Or do you mean that that link worked for you and you got the HTML from it?

I mean did you replace the %26 in the url by & ?

I mean did you replace the %26 in the url by & ?

Oh wow!!! Thank you so much! I looked through your code for differences at first but barely missed this. Thanks!! It works now.

I don't know why it worked only once. Obviously the content has a special encoding. I did this

from urllib import urlretrieve
urlretrieve('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp', 'myfile.jsp')

Then when I cat myfile.jsp in a terminal it looks good, but when I load the content with python, it shows the same error. We could perhaps find a BOM at the beginning of the data.

Sorry. I'm kind of new to all this prgramming stuff. What is a BOM and how will it help?

Sorry. I'm kind of new to all this prgramming stuff. What is a BOM and how will it help?

The BOM is the 2 first bytes of the file. It's used to detect encoding (see wikipedia). In our case, I found \x1f\x8b, and google tells me that this marks files compressed with gzip. Indeed my linux system detects a compressed file and it is able to uncompress it with gunzip. Python can do this too with module gzip. Here we go:

>>> from urllib import urlretrieve
>>> urlretrieve('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp', 'myfile')
>>> import gzip
>>> data = gzip.open('myfile', 'rb').read()

!!!

Edited 4 Years Ago by Gribouillis: n/a

Oh. Okay. I ran it a few times to check and it worked! Thanks! Now I know what a BOM is too!

Oh. Okay. I ran it a few times to check and it worked! Thanks! Now I know what a BOM is too!

You can also uncompress it without using a temporary file like this

from urllib2 import urlopen
from gzip import GzipFile
from cStringIO import StringIO
fobj = urlopen('http://www.locationary.com/place/en/US/North_Carolina/Raleigh/Noodles_%26_Company-p1022884996.jsp')
fobj = StringIO(fobj.read())
ifh = GzipFile(mode='rb', fileobj=fobj)
data = ifh.read()

Edited 4 Years Ago by Gribouillis: n/a

This question has already been answered. Start a new discussion instead.