I am building a crawler+parser in Python. It has to be run for, like 20 hours. How can I modify the code such that the code execution pauses (before next urllib2.urlopen) when the internet is disconnected, and AUTOMATICALLY resumes with the same variable values, when the internet connection is back on?


Problem is, there's no way to know "when the internet is disconnected" ahead of time. You might be able to get to 99.99% of all websites but can't get to the one you're trying at the moment. Is "the internet disconnected" in that case? What if it's the other way around, where you can't get to most websites but can get to just the one you are looking for. Is "the internet disconnected" in that case?

My suggestions:

  1. Learn to handle connect failures via exception handlers and timed waits in a loop. I think urllib2 throws urllib2.URLError when it has a problem with a website. This is not that difficult.
  2. Use threading to have several spiders going at once.
  3. If you don't already, learn to read and honor robots.txt , else you may face legal action for noncompliance, depending on jurisdictions.

Just a small warning that you have to be careful not to create an ever increasing herd of spiders when you start spawning them in multiple threads or processes. As with robots.txt, you have responsibilities to the community (and a contract with your ISP)

Thanks for the suggestions. My spider is complete and running successfully. I just needed to know in advance about the resuming thing, in case there is a problem. URLError works great for bad links (very rare though). Currently spider's capable of resuming from last point, since I am saving links to hard disk after every fetch (So obviously on re-connection I will have to manually run a different subroutine that takes the url, next to the last fetched one.) I just thought it would be cool to know if there is some tweak available to automatize that.

Good for me, robots.txt only disallows 7 urls from the entire archives section. which has 4000000+ links. Thanks for the replies. Will try threading.