adding limit to web crawler

Question

laura301019 0 Newbie Poster

10 Years Ago

I have the following web crawler code and would like to add a limit as to the depth of the search but unsure how to implement this.

import urllib2

def getAllNewLinksOnPage(page,prevLinks):

    response = urllib2.urlopen(page)
    html = response.read()

    links,pos,allFound=[],0,False
    while not allFound:
        aTag=html.find("<a href=",pos)
        if aTag>-1:
            href=html.find('"',aTag+1)
            endHref=html.find('"',href+1)
            url=html[href+1:endHref]
            if url[:7]=="http://":
                if url[-1]=="/":
                    url=url[:-1]
                if not url in links and not url in prevLinks:
                    links.append(url)     
                    print url
            closeTag=html.find("</a>",aTag)
            pos=closeTag+1
        else:
            allFound=True   
    return links

url = raw_input("Enter the seed URL:")
toCrawl=[url]
crawled=[]
while toCrawl:
    url=toCrawl.pop()
    crawled.append(url)
    newLinks=getAllNewLinksOnPage(url,crawled)
    toCrawl=list(set(toCrawl)|set(newLinks))

print crawled

python

2 Contributors
1 Reply
434 Views
1 Hour Discussion Span
Latest Post 10 Years Ago Latest Post by Gribouillis

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 1 · 2013-11-06T18:40:56+00:00

You must do a depth first traversal for this. The algorithm is very simple: suppose the urls line up at the post office with a number D written on their T-shirt, representing their depth. For every url, a web page is loaded and new urls discovered. The urls which have already been seen are thrown away, the others queue up with T-shirt D+1. Initially, there is only one url in the queue with T-shirt 0.