| | |
urllib - check if web page exists
![]() |
•
•
Join Date: Nov 2007
Posts: 140
Reputation:
Solved Threads: 28
I've been writing a script that will check google for pages that meet certain criteria at my web site. For example, excel spreadsheets that contain data that should not be on the web. So far, so good.
When we find something and take it down, it can take a while for the result to time out of Google. One thing I would like to do in order to reduce false positives is check if each link is dead or not. The problem I'm having is that I don't know how to do that without downloading the whole page...that could come out to a lot of data.
Is there a way that I can touch the link and see if it is still an active link without downloading each of the files?
When we find something and take it down, it can take a while for the result to time out of Google. One thing I would like to do in order to reduce false positives is check if each link is dead or not. The problem I'm having is that I don't know how to do that without downloading the whole page...that could come out to a lot of data.
Is there a way that I can touch the link and see if it is still an active link without downloading each of the files?
•
•
Join Date: Nov 2007
Posts: 6
Reputation:
Solved Threads: 1
I know a small recipe that does just this (checking for existence) over HTTP. I'm not too sure if it will apply but it's the following:
python Syntax (Toggle Plain Text)
from httplib import HTTP from urlparse import urlparse def checkURL(url): p = urlparse(url) h = HTTP(p[1]) h.putrequest('HEAD', p[2]) h.endheaders() if h.getreply()[0] == 200: return 1 else: return 0 if __name__ == '__main__': print checkURL('http://slashdot.org') print checkURL('http://slashdot.org/notadirectory')
![]() |
Other Threads in the Python Forum
- Previous Thread: Beginners Problem
- Next Thread: Text Wrap in Terminal
| Thread Tools | Search this Thread |
address aliased anydbm app bash beginner bits changecolor cipher clear conversion coordinates corners cturtle curves definedlines development dictionary dynamic events examples excel feet file float format function generator getvalue gui handling homework iframe images import input ip java keycontrol line linux list lists loan loop maintain matching maze millimeter mouse number numbers output parsing path port prime programming projects py2exe pygame pymailer python queue random rational raw_input recursion recursive scrolledtext searchingfile singleton slicenotation split string strings tails terminal text threading time tlapse tooltip tuple tutorial type ubuntu unicode url urllib urllib2 valueerror variable variables vigenere web whileloop word wxpython xlwt





