Hi All,
Newbie here -
I am trying to write a program that will list out all the URL's from a webpage which I was able to successfully complete using the following script:
import urllib, urllister
urladdr = raw_input("Enter URL here: ")
usock = urllib.urlopen(urladdr)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
However, I am trying to sort the output to unique and also to take out only the domain, sub domain names from the list. For example, if the ouput is:
[url]http://abc.mydomain.com/partners/[/url]
[url]http://xyz.mydomain.com/study/[/url]
[url]http://otherdomain.com/[/url]
[url]http://books.otherdomain.com/[/url]
[url]http://irc.otherdomain.com/misc/tutorial.html[/url]
[url]http://xyz.mydomain.com/study/[/url]
[url]http://otherdomain.com/[/url]
I want to filter out the ouput so that it only displays unique domain names and subdomains only For example,
http://abc.mydomain.com/
[url]http://xyz.mydomain.com/[/url]
[url]http://otherdomain.com/[/url]
[url]http://books.otherdomain.com/[/url]
[url]http://irc.otherdomain.com/[/url]
Can anybody guide me with this?