urllib - check if web page exists

Reply

Join Date: Nov 2007
Posts: 116
Reputation: mn_kthompson is an unknown quantity at this point 
Solved Threads: 21
mn_kthompson mn_kthompson is offline Offline
Junior Poster

urllib - check if web page exists

 
0
  #1
Jan 2nd, 2009
I've been writing a script that will check google for pages that meet certain criteria at my web site. For example, excel spreadsheets that contain data that should not be on the web. So far, so good.

When we find something and take it down, it can take a while for the result to time out of Google. One thing I would like to do in order to reduce false positives is check if each link is dead or not. The problem I'm having is that I don't know how to do that without downloading the whole page...that could come out to a lot of data.

Is there a way that I can touch the link and see if it is still an active link without downloading each of the files?
Reply With Quote Quick reply to this message  
Join Date: Nov 2007
Posts: 6
Reputation: QwertyManiac is an unknown quantity at this point 
Solved Threads: 1
QwertyManiac QwertyManiac is offline Offline
Newbie Poster

Re: urllib - check if web page exists

 
1
  #2
Jan 6th, 2009
I know a small recipe that does just this (checking for existence) over HTTP. I'm not too sure if it will apply but it's the following:

  1. from httplib import HTTP
  2. from urlparse import urlparse
  3.  
  4. def checkURL(url):
  5. p = urlparse(url)
  6. h = HTTP(p[1])
  7. h.putrequest('HEAD', p[2])
  8. h.endheaders()
  9. if h.getreply()[0] == 200: return 1
  10. else: return 0
  11.  
  12. if __name__ == '__main__':
  13. print checkURL('http://slashdot.org')
  14. print checkURL('http://slashdot.org/notadirectory')
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Other Threads in the Python Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC