web crawler

Question

leegeorg07

15 Years Ago

hi again i have been assigned a project to create a web crawler in python but i have no idea where to start so all help will be welcome.

python

3 Contributors
5 Replies
121 Views
2 Days Discussion Span
Latest Post 15 Years Ago Latest Post by leegeorg07

All 5 Replies

mn_kthompson 3 Junior Poster

15 Years Ago

This is a good place to start. http://cis.poly.edu/cs912/parsing.txt

That is sample code that you can use to gather all of the links on a particular web page. Once you have the list of links on a page, you could repeat the process for each one of those links. Repeat the process until you have as the links you want.

Gribouillis 1,391 Programming Explorer

15 Years Ago

if you add a line print(item) before your line data = urllib.urlopen(item) you might see why urlopen can't open the url.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

leegeorg07 · Answer 1 · 2009-01-28T23:08:26+00:00

thanks that works but i have a little problem with the code

import urllib, htmllib, formatter
a = []
running = 0
class LinksExtractor(htmllib.HTMLParser): # derive new HTML parser
    def __init__(self, formatter) :        # class constructor
      htmllib.HTMLParser.__init__(self, formatter)  # base class constructor
      self.links = []        # create an empty list for storing hyperlinks
    def start_a(self, attrs) :  # override handler of <A ...>...</A> tags
      # process the attributes
      if len(attrs) > 0 :
         for attr in attrs :
            if attr[0] == "href" :         # ignore all non HREF attributes
                self.links.append(attr[1]) # save the link info in the list
    def get_links(self):
        return self.links
format = formatter.NullFormatter()           # create default formatter
htmlparser = LinksExtractor(format)        # create new parser object

data = urllib.urlopen("http://uk.youtube.com/")
htmlparser.feed(data.read())      # parse the file saving the info about links
htmlparser.close()

links = htmlparser.get_links()   # get the hyperlinks list
print (links)   # print all the links

while running <=3:
    for item in links:
        a.append(item)
        for item in a:
            data = urllib.urlopen(item)
        htmlparser.feed(data.read())
        htmlparser.close()

        links = htmlparser.get_links()
        print(links)

it raises this error:

Traceback (most recent call last):
File "C:\Python26\web crawler start.py", line 30, in <module>
data = urllib.urlopen(item)
File "C:\Python26\lib\urllib.py", line 87, in urlopen
return opener.open(url)
File "C:\Python26\lib\urllib.py", line 203, in open
return getattr(self, name)(url)
File "C:\Python26\lib\urllib.py", line 461, in open_file
return self.open_local_file(url)
File "C:\Python26\lib\urllib.py", line 486, in open_local_file
return addinfourl(open(localname, 'rb'),
IOError: [Errno 2] No such file or directory: '\\'

in this trial i used youtube

why is there this error and how can i solve it?

leegeorg07 · Answer 2 · 2009-01-29T01:30:14+00:00

hi i worked out that it does that whenever i have a website with a login link so how can i solve this problem and after that how can i implement it into a search engine?

leegeorg07 · Answer 3 · 2009-01-30T14:35:43+00:00

i guess that im going to need knowledge in php and the information to be in a txt file but how can i go about doing this?

web crawler

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers