Python Mechanize Page Source Error.

Question

sun_2588 0 Newbie Poster

8 Years Ago

Hi,
I am trying to build a scrapper and for that I am using mechanize to get the page source then beautiful soup to parse it. But yesterday I faced a very strange error the page source which I am getting is kind of binary, though from browser I can access the page source and its contents are just like other web pages source code but through mechanize or even with urllib2 it is turning into binary.

This is one such page from ebay http://stores.ebay.com/honesty-seller-ly9999 which is not getting fetched through mechanize. Its header has some information related to IE 7/8/9 which is different from other page sources which I have seen soo far......

Thanks.

mechanize python

Edited 8 Years Ago by sun_2588

2 Contributors
2 Replies
294 Views
3 Weeks Discussion Span
Latest Post 8 Years Ago Latest Post by chriswelborn

All 2 Replies

chriswelborn 63 ...

8 Years Ago

The data is compressed with gzip. I tested it with:

$ curl http://stores.ebay.com/honesty-seller-ly9999 > bad_data
$ file bad_data
      bad_data: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

The site is served with the header Content-Encoding: gzip, which isn't correct unless you request it with a Accept-encoding: gzip. Anyway, this is a very minimal example of automatically decompressing pages like this. I'm using Python 3 and urllib (which differs from Python 2's urllib a little):

from gzip import GzipFile
from io import BytesIO
from urllib import request

def get_url(url):
    """ Open a url and return it's response.
        If the response is compressed using gzip, then decompress it.
    """
    # I haven't included any error handling here.
    con = request.urlopen(url)
    data = con.read()
    if con.headers.get('content-encoding', None) == 'gzip':
        # Server says the data is gzipped, use a bytes stream with GzipFile.
        fd = BytesIO(data)
        # The fileobj keyword is important.
        # We want a file object, not a file name.
        return GzipFile(fileobj=fd).read()

    # Not compressed, return normal response.
    # In Python 3 this is still encoded, and needs to be decoded
    # with the correct encoding (utf-8?).
    return data

response = get_url('http://stores.ebay.com/honesty-seller-ly9999')
print(response.decode())

That is enough to get the correct data out of that page.

Edited 8 Years Ago by chriswelborn because: code formatting, decode step

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

chriswelborn 63 ... · Answer 1 · 2015-12-01T02:26:52+00:00

This would probably fail miserably if you landed on a page like: example.com/zippedapplication.tar.gz

It assumes the response is encoded plain text with possible gzip compression, not gzipped binary data.

Python Mechanize Page Source Error.

Recommended Answers Collapse Answers

All 2 Replies

Recommended Answers