954,546 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

fetching Unique URL, Domain name and sub domain names from Web Page

Hi All,

Newbie here -

I am trying to write a program that will list out all the URL's from a webpage which I was able to successfully complete using the following script:

import urllib, urllister
urladdr = raw_input("Enter URL here: ")
usock = urllib.urlopen(urladdr)
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url

However, I am trying to sort the output to unique and also to take out only the domain, sub domain names from the list. For example, if the ouput is:

<a href="http://abc.mydomain.com/partners/">http://abc.mydomain.com/partners/</a> 
 <a href="http://xyz.mydomain.com/study/">http://xyz.mydomain.com/study/</a> 
 <a href="http://otherdomain.com/">http://otherdomain.com/</a> 
 <a href="http://books.otherdomain.com/">http://books.otherdomain.com/</a> 
 <a href="http://irc.otherdomain.com/misc/tutorial.html">http://irc.otherdomain.com/misc/tutorial.html</a> 
 <a href="http://xyz.mydomain.com/study/">http://xyz.mydomain.com/study/</a> 
 <a href="http://otherdomain.com/">http://otherdomain.com/</a>


I want to filter out the ouput so that it only displays unique domain names and subdomains only For example,

http://abc.mydomain.com/
 <a href="http://xyz.mydomain.com/">http://xyz.mydomain.com/</a> 
 <a href="http://otherdomain.com/">http://otherdomain.com/</a> 
 <a href="http://books.otherdomain.com/">http://books.otherdomain.com/</a> 
 <a href="http://irc.otherdomain.com/">http://irc.otherdomain.com/</a>


Can anybody guide me with this?

morpheus063
Newbie Poster
2 posts since Feb 2008
Reputation Points: 10
Solved Threads: 0
 

There are no doubt other and probably better ways to do this, but try
print url.split("/")
You want the 3rd item, or result[2]. You can then add it to a list or dictionary if it is not already in the list or dictionary.

woooee
Nearly a Posting Maven
2,454 posts since Dec 2006
Reputation Points: 777
Solved Threads: 714
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You