0

Hi All,

Newbie here -

I am trying to write a program that will list out all the URL's from a webpage which I was able to successfully complete using the following script:

import urllib, urllister
urladdr = raw_input("Enter URL here: ")
usock = urllib.urlopen(urladdr)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url

However, I am trying to sort the output to unique and also to take out only the domain, sub domain names from the list. For example, if the ouput is:

[url]http://abc.mydomain.com/partners/[/url]
[url]http://xyz.mydomain.com/study/[/url]
[url]http://otherdomain.com/[/url]
[url]http://books.otherdomain.com/[/url]
[url]http://irc.otherdomain.com/misc/tutorial.html[/url]
[url]http://xyz.mydomain.com/study/[/url]
[url]http://otherdomain.com/[/url]

I want to filter out the ouput so that it only displays unique domain names and subdomains only For example,

http://abc.mydomain.com/
[url]http://xyz.mydomain.com/[/url]
[url]http://otherdomain.com/[/url]
[url]http://books.otherdomain.com/[/url]
[url]http://irc.otherdomain.com/[/url]

Can anybody guide me with this?

2
Contributors
1
Reply
2
Views
9 Years
Discussion Span
Last Post by woooee
0

There are no doubt other and probably better ways to do this, but try
print url.split("/")
You want the 3rd item, or result[2]. You can then add it to a list or dictionary if it is not already in the list or dictionary.

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.