Hello.
I was looking for a tutorial or any example of creating web crawler that i found this code somewhere and copied and pasted to test it:

First, it is a web crawler, right? Because when i gave it a url of a website, the output was some linkes were published on the terminal.

Second, if you test it yourself, you will see that linkes will divided into some parts with the title Scanning depth 1 web and so on (the number will change). What is that for? What does it mean? What does depth number web means?

Third, i want to send exactly everything i see that will be printed into terminal, into a textfile, so where should i put this code:

with open('file.txt', 'w') as f:
    f.write()

And what shoul i type in the ( )?

and finally i have a request.
could you explain each line of code for me please, if you are familiar with any line? Even afew lines of code explanation will be really helpful because i don't understand it clear and i want to learn it well. It's a request only and will be happy if you help me with understanding it.
Thank you in advance :)

Recommended Answers

All 5 Replies

oopss! Forgot to send the code!

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        HTMLParser.__init__(self)
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]

        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0

    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')

            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
                self.node.append(link)
            self.db[link] = (self.db.get(link) or 0) + 1

    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                try:
                    req = urlopen(self.url)
                    res = req.read()
                    self.feed(res)
                except:
                    self.reset()
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted

if __name__ == "__main__":
    spidey = Spider(starting_url = 'http://www.python.org', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

I've test the code again. i gave it a url and then the output was:

**********************************************************************
Scanning depth 1 web
**********************************************************************
**********************************************************************
Scanning depth 2 web
**********************************************************************
**********************************************************************
Scanning depth 3 web
**********************************************************************
**********************************************************************
Scanning depth 4 web
**********************************************************************
**********************************************************************
Scanning depth 5 web
**********************************************************************
****************************************
RESULTS
****************************************
http:// the main url i gave to the programe in line 47 / was found 1 time.

But there were many linkes on that page of the website, so why none of them printed into the terminal??!! Is it a web crawler?!

I thought a web crawler enter into a page that we gave its url first, then it will find all linkes on that page and will print those links then will enter into each linkes again to do all again, but here i got a different result.

Your code (crawl function) doesn't find the links in the page - it's not crawling it. I think its not implemented yet.

So why it works with 'http://www.python.org' as the main url i give to programe, but when i tried it with another url, the result was what i posted in my previous post?!

Might be hidden in the Javascript section of the page. Not everything is straight HTML.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.