Hi,
I am trying to build a scrapper and for that I am using mechanize to get the page source then beautiful soup to parse it. But yesterday I faced a very strange error the page source which I am getting is kind of binary, though from browser I can access the page source and its contents are just like other web pages source code but through mechanize or even with urllib2 it is turning into binary.

This is one such page from ebay http://stores.ebay.com/honesty-seller-ly9999 which is not getting fetched through mechanize. Its header has some information related to IE 7/8/9 which is different from other page sources which I have seen soo far......

Thanks.

Recommended Answers

All 2 Replies

The data is compressed with gzip. I tested it with:

$ curl http://stores.ebay.com/honesty-seller-ly9999 > bad_data
$ file bad_data
      bad_data: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

The site is served with the header Content-Encoding: gzip, which isn't correct unless you request it with a Accept-encoding: gzip. Anyway, this is a very minimal example of automatically decompressing pages like this. I'm using Python 3 and urllib (which differs from Python 2's urllib a little):

from gzip import GzipFile
from io import BytesIO
from urllib import request

def get_url(url):
    """ Open a url and return it's response.
        If the response is compressed using gzip, then decompress it.
    """
    # I haven't included any error handling here.
    con = request.urlopen(url)
    data = con.read()
    if con.headers.get('content-encoding', None) == 'gzip':
        # Server says the data is gzipped, use a bytes stream with GzipFile.
        fd = BytesIO(data)
        # The fileobj keyword is important.
        # We want a file object, not a file name.
        return GzipFile(fileobj=fd).read()

    # Not compressed, return normal response.
    # In Python 3 this is still encoded, and needs to be decoded
    # with the correct encoding (utf-8?).
    return data

response = get_url('http://stores.ebay.com/honesty-seller-ly9999')
print(response.decode())

That is enough to get the correct data out of that page.

This would probably fail miserably if you landed on a page like: example.com/zippedapplication.tar.gz

It assumes the response is encoded plain text with possible gzip compression, not gzipped binary data.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.