I've been writing a script that will check google for pages that meet certain criteria at my web site. For example, excel spreadsheets that contain data that should not be on the web. So far, so good.

When we find something and take it down, it can take a while for the result to time out of Google. One thing I would like to do in order to reduce false positives is check if each link is dead or not. The problem I'm having is that I don't know how to do that without downloading the whole page...that could come out to a lot of data.

Is there a way that I can touch the link and see if it is still an active link without downloading each of the files?

I know a small recipe that does just this (checking for existence) over HTTP. I'm not too sure if it will apply but it's the following:

from httplib import HTTP
from urlparse import urlparse

def checkURL(url):
     p = urlparse(url)
     h = HTTP(p[1])
     h.putrequest('HEAD', p[2])
     h.endheaders()
     if h.getreply()[0] == 200: return 1
     else: return 0

if __name__ == '__main__':
     print checkURL('http://slashdot.org')
     print checkURL('http://slashdot.org/notadirectory')
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.