| | |
urllib - check if web page exists
![]() |
•
•
Join Date: Nov 2007
Posts: 116
Reputation:
Solved Threads: 21
I've been writing a script that will check google for pages that meet certain criteria at my web site. For example, excel spreadsheets that contain data that should not be on the web. So far, so good.
When we find something and take it down, it can take a while for the result to time out of Google. One thing I would like to do in order to reduce false positives is check if each link is dead or not. The problem I'm having is that I don't know how to do that without downloading the whole page...that could come out to a lot of data.
Is there a way that I can touch the link and see if it is still an active link without downloading each of the files?
When we find something and take it down, it can take a while for the result to time out of Google. One thing I would like to do in order to reduce false positives is check if each link is dead or not. The problem I'm having is that I don't know how to do that without downloading the whole page...that could come out to a lot of data.
Is there a way that I can touch the link and see if it is still an active link without downloading each of the files?
•
•
Join Date: Nov 2007
Posts: 6
Reputation:
Solved Threads: 1
I know a small recipe that does just this (checking for existence) over HTTP. I'm not too sure if it will apply but it's the following:
python Syntax (Toggle Plain Text)
from httplib import HTTP from urlparse import urlparse def checkURL(url): p = urlparse(url) h = HTTP(p[1]) h.putrequest('HEAD', p[2]) h.endheaders() if h.getreply()[0] == 200: return 1 else: return 0 if __name__ == '__main__': print checkURL('http://slashdot.org') print checkURL('http://slashdot.org/notadirectory')
![]() |
Other Threads in the Python Forum
- Previous Thread: Beginners Problem
- Next Thread: Text Wrap in Terminal
| Thread Tools | Search this Thread |
alarm anydbm app assignment beginner bluetooth character cipher cmd conversion coordinates corners curves customdialog cx-freeze data decimals definedlines development directory dynamic excel exe feet file float format function generator getvalue gnu graphics halp handling homework http ideas input ip itunes keycontrol leftmouse line linux list lists loan loop maintain maze millimeter module mouse number numbers output parsing path prime programming push py2exe pygame pymailer python queue random raw_input recursion recursive schedule screensaverloopinactive script searchingfile slicenotation sqlite ssh string strings sudokusolver text time tlapse tooltip tuple type ubuntu unicode url urllib urllib2 variable ventrilo vigenere web webservice wikipedia wxpython xlib xlwt





