need help to understand a web crawler code

Question

Niloofar24 15 Posting Whiz

10 Years Ago

Hello.
I was looking for a tutorial or any example of creating web crawler that i found this code somewhere and copied and pasted to test it:

First, it is a web crawler, right? Because when i gave it a url of a website, the output was some linkes were published on the terminal.

Second, if you test it yourself, you will see that linkes will divided into some parts with the title Scanning depth 1 web and so on (the number will change). What is that for? What does it mean? What does depth number web means?

Third, i want to send exactly everything i see that will be printed into terminal, into a textfile, so where should i put this code:

with open('file.txt', 'w') as f:
    f.write()

And what shoul i type in the ( )?

and finally i have a request.
could you explain each line of code for me please, if you are familiar with any line? Even afew lines of code explanation will be really helpful because i don't understand it clear and i want to learn it well. It's a request only and will be happy if you help me with understanding it.
Thank you in advance :)

python seo

3 Contributors
5 Replies
441 Views
3 Weeks Discussion Span
Latest Post 10 Years Ago Latest Post by vegaseat

All 5 Replies

Slyte 0 Newbie Poster

10 Years Ago

Your code (crawl function) doesn't find the links in the page - it's not crawling it. I think its not implemented yet.

vegaseat 1,735 DaniWeb's Hypocrite

10 Years Ago

Might be hidden in the Javascript section of the page. Not everything is straight HTML.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Niloofar24 15 Posting Whiz · Answer 1 · 2015-02-27T00:18:04+00:00

oopss! Forgot to send the code!

# -*- coding: utf-8 -*-
from HTMLParser import HTMLParser
from urllib2 import urlopen

class Spider(HTMLParser):
    def __init__(self, starting_url, depth, max_span):
        HTMLParser.__init__(self)
        self.url = starting_url
        self.db = {self.url: 1}
        self.node = [self.url]

        self.depth = depth # recursion depth max
        self.max_span = max_span # max links obtained per url
        self.links_found = 0

    def handle_starttag(self, tag, attrs):
        if self.links_found < self.max_span and tag == 'a' and attrs:
            link = attrs[0][1]
            if link[:4] != "http":
                link = '/'.join(self.url.split('/')[:3])+('/'+link).replace('//','/')

            if link not in self.db:
                print "new link ---> %s" % link
                self.links_found += 1
                self.node.append(link)
            self.db[link] = (self.db.get(link) or 0) + 1

    def crawl(self):
        for depth in xrange(self.depth):
            print "*"*70+("\nScanning depth %d web\n" % (depth+1))+"*"*70
            context_node = self.node[:]
            self.node = []
            for self.url in context_node:
                self.links_found = 0
                try:
                    req = urlopen(self.url)
                    res = req.read()
                    self.feed(res)
                except:
                    self.reset()
        print "*"*40 + "\nRESULTS\n" + "*"*40
        zorted = [(v,k) for (k,v) in self.db.items()]
        zorted.sort(reverse = True)
        return zorted

if __name__ == "__main__":
    spidey = Spider(starting_url = 'http://www.python.org', depth = 5, max_span = 10)
    result = spidey.crawl()
    for (n,link) in result:
        print "%s was found %d time%s." %(link,n, "s" if n is not 1 else "")

Niloofar24 15 Posting Whiz · Answer 2 · 2015-02-27T00:29:25+00:00

I've test the code again. i gave it a url and then the output was:

**********************************************************************
Scanning depth 1 web
**********************************************************************
**********************************************************************
Scanning depth 2 web
**********************************************************************
**********************************************************************
Scanning depth 3 web
**********************************************************************
**********************************************************************
Scanning depth 4 web
**********************************************************************
**********************************************************************
Scanning depth 5 web
**********************************************************************
****************************************
RESULTS
****************************************
http:// the main url i gave to the programe in line 47 / was found 1 time.

But there were many linkes on that page of the website, so why none of them printed into the terminal??!! Is it a web crawler?!

I thought a web crawler enter into a page that we gave its url first, then it will find all linkes on that page and will print those links then will enter into each linkes again to do all again, but here i got a different result.

Niloofar24 15 Posting Whiz · Answer 3 · 2015-03-01T04:49:43+00:00

So why it works with 'http://www.python.org' as the main url i give to programe, but when i tried it with another url, the result was what i posted in my previous post?!

need help to understand a web crawler code

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers