Hey guys,
im trying to extract the top 10 links from a yahoo search results page. i can get all the links using the code below.. but that could be 70 links.

Any idea how i could get just those top 10 ranked ones? and not the adverts etc.

ie for this page..

http://uk.search.yahoo.com/search?p=python&fr=yfp-t-501&ei=UTF-8&meta=vc%3D

i would only want
1. www.python.org
2. www.pythonline.com
3. www.python.org/download
.
.
10.
etc

heres main lump of my code that returns ALL links on that page.

Is there even anything to distinguish which are in the top ten that way i could try extract them.

if __name__ == "__main__":
    import urllib
    usock = urllib.urlopen("http://uk.search.yahoo.com/search?p=python&fr=yfp-t-501&ei=UTF-8&meta=vc%3D")
    parser = URLLister()
    parser.feed(usock.read())
    parser.close()
    usock.close()
    path = u"c:\\Users\\admin\\Desktop\\"
    i = 0
    for url in parser.urls: 
       if i <= (len(parser.urls)):
          print i
          print parser.urls[i]
          page = urllib.urlopen(parser.urls[i]).read()
          f = file(path + u"test" + str(i) + u".txt", "w+")   
          print >> f, page 
          f.close()
          print "Html file successfully printed to file!"

any help appreciated,

thanks guys :)

Recommended Answers

All 2 Replies

Your code is broken as it stands. What's the URLLister() class? I don't have it in my urllib.

Jeff

Your code is broken as it stands. What's the URLLister() class? I don't have it in my urllib.

Jeff

Sorry, its one of my own methods i didnt include all the ode just main chunk

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):
        href = [v for k, v in attrs if k=='href']
        if href:
            self.urls.extend(href)

if __name__ == "__main__":
    import urllib
    usock = urllib.urlopen("http://uk.search.yahoo.com/search?p=cinemas+in+dublin&fr=yfp-t-501&ei=UTF-8&meta=vc%3D")
    parser = URLLister()
    parser.feed(usock.read())
    parser.close()
    usock.close()
    path = u"c:\\Users\\Neil\\Desktop\\"
    i = 0
    for url in parser.urls: 
       if i <= (len(parser.urls)):
          print i
          print parser.urls[i]
          page = urllib.urlopen(parser.urls[i]).read()
          f = file(path + u"test" + str(i) + u".txt", "w+")   
          print >> f, page 
          f.close()
          print "Html file successfully printed to file!"
          i = i + 2

any idea how i can just get the top ten links?

thanks :)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.