Hey all. First post to this forum, though I have casually browsed threads here often in the past. I'm pretty new to Python but have several years of experience programming in C, C++ and Java. I've got a Java app currently deployed to Google App Engine and wanted to be able to do some batch processing on a local machine (basically going to the data store and pulling records that represent web pages, then checking to see if these pages are still active). The easiest way to do this, it seemed, was with something called Remote API (http://code.google.com/appengine/articles/remote_api.html) which is Python-only. I figured I ought to just go ahead and try working in Python. Cut to two days later and I feel I have a decent handle on the language. I'm posting here because I've hit an odd (or maybe not so odd) problem. I wrote a script that goes to the my database and fetches records (each of which contains a url for a webpage and some other properties), then iterates through them and uses urllib2.urlopen() and readline() to look at the contents of the page/url associated with each record. If the page is no longer active (these are craigslist housing listings -- so if I get a 404 or similar message) the script removes the associated record from the data store. This works for the most part alright but I've noticed that about ~10% of the time urllib2 grabs me a page different from the one that firefox grabs me using the same url (an example from just a minute ago: "http://newyork.craigslist.org/aap/jsy/abo/1297710020.html" -- the page received by the python script has a "this posting has been deleted by its author" message whereas the page downloaded by firefox is totally active). When I try to download these odd pages with wget or using a Java program, I get the same erroneous content that I get when using my Python script. Thinking it might be a user-agent discrimination type thing, I added the headers from my browser to my request (basically trying to spoof the server):

txdata = None
txheaders = {
'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2',
'Accept' : 'text/html, image/jpeg, image/png, text/*, image/*, */*',
'Accept-Language': 'en-us',
'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
'Keep-Alive': '300',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
}
req = urllib2.Request(, txdata, txheaders)
pagefile = urllib2.urlopen(req)

The only header I left out was (Accept-Encoding: gzip,deflate) because seemed (logically, I guess) to result in the server sending me back a compressed page. Still don't end up with the right page (well, the page I see in firefox or konqueror). So I guess the question is: what is happening here and what can I do to make it such that my script is able to access the same pages that my browser is accessing?
Thanks,
Nick Z

Recommended Answers

All 3 Replies

Is there any possibility that the page your browser is displaying is a cached version and hasn't updated to notify you of the user's deletion?

Try clearing ff's cache and then fetching the page again.

Is there any possibility that the page your browser is displaying is a cached version and hasn't updated to notify you of the user's deletion?

Try clearing ff's cache and then fetching the page again.

Tried this. Still the same weirdness. There must be something to the fact that wget and java.net.* are all being served the same page that the Python script is being served. Here are a couple more examples of urls for which this weirdness occurred from the most recent run of the script (I could post hundreds -- like I said, this is happening with about ~10% of all the listing pages I try to download) if you all want to have a look and let me know what you see with firefox/wget/java/python etc. and your thoughts, I would appreciate it:

http://newyork.craigslist.org/aap/brk/abo/1298022085.html
http://newyork.craigslist.org/aap/mnh/abo/1298012842.html
http://newyork.craigslist.org/aap/fct/abo/1298015599.html
http://newyork.craigslist.org/aap/mnh/abo/1298017522.html
http://newyork.craigslist.org/aap/brk/abo/1298022085.html

-Nick

Tried this. Still the same weirdness. There must be something to the fact that wget and java.net.* are all being served the same page that the Python script is being served. Here are a couple more examples of urls for which this weirdness occurred from the most recent run of the script (I could post hundreds -- like I said, this is happening with about ~10% of all the listing pages I try to download) if you all want to have a look and let me know what you see with firefox/wget/java/python etc. and your thoughts, I would appreciate it:

http://newyork.craigslist.org/aap/brk/abo/1298022085.html
http://newyork.craigslist.org/aap/mnh/abo/1298012842.html
http://newyork.craigslist.org/aap/fct/abo/1298015599.html
http://newyork.craigslist.org/aap/mnh/abo/1298017522.html
http://newyork.craigslist.org/aap/brk/abo/1298022085.html

-Nick

Wow, 15 hours later and I just clicked through all of those links to find that firefox now shows me a "this listing has expired" or "deleted by author" page. My best theory now as to what's going on is that perhaps craigslist's content is mirrored accross several servers and whichever servers firefox/konqueror are accessing content from haven't yet been updated to reflect the content's expiration/removal but...I don't know. All very weird

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.