fetching Unique URL, Domain name and sub domain names from Web Page

Reply

Join Date: Feb 2008
Posts: 1
Reputation: morpheus063 is an unknown quantity at this point 
Solved Threads: 0
morpheus063 morpheus063 is offline Offline
Newbie Poster

fetching Unique URL, Domain name and sub domain names from Web Page

 
0
  #1
Feb 8th, 2008
Hi All,

Newbie here -

I am trying to write a program that will list out all the URL's from a webpage which I was able to successfully complete using the following script:

  1. import urllib, urllister
  2. urladdr = raw_input("Enter URL here: ")
  3. usock = urllib.urlopen(urladdr)
  4. parser = urllister.URLLister()
  5. parser.feed(usock.read())
  6. usock.close()
  7. parser.close()
  8. for url in parser.urls: print url


However, I am trying to sort the output to unique and also to take out only the domain, sub domain names from the list. For example, if the ouput is:

  1. <a rel="nofollow" class="t" href="http://abc.mydomain.com/partners/" target="_blank">http://abc.mydomain.com/partners/</a>
  2. <a rel="nofollow" class="t" href="http://xyz.mydomain.com/study/" target="_blank">http://xyz.mydomain.com/study/</a>
  3. <a rel="nofollow" class="t" href="http://otherdomain.com/" target="_blank">http://otherdomain.com/</a>
  4. <a rel="nofollow" class="t" href="http://books.otherdomain.com/" target="_blank">http://books.otherdomain.com/</a>
  5. <a rel="nofollow" class="t" href="http://irc.otherdomain.com/misc/tutorial.html" target="_blank">http://irc.otherdomain.com/misc/tutorial.html</a>
  6. <a rel="nofollow" class="t" href="http://xyz.mydomain.com/study/" target="_blank">http://xyz.mydomain.com/study/</a>
  7. <a rel="nofollow" class="t" href="http://otherdomain.com/" target="_blank">http://otherdomain.com/</a>

I want to filter out the ouput so that it only displays unique domain names and subdomains only For example,

  1. http://abc.mydomain.com/
  2. <a rel="nofollow" class="t" href="http://xyz.mydomain.com/" target="_blank">http://xyz.mydomain.com/</a>
  3. <a rel="nofollow" class="t" href="http://otherdomain.com/" target="_blank">http://otherdomain.com/</a>
  4. <a rel="nofollow" class="t" href="http://books.otherdomain.com/" target="_blank">http://books.otherdomain.com/</a>
  5. <a rel="nofollow" class="t" href="http://irc.otherdomain.com/" target="_blank">http://irc.otherdomain.com/</a>

Can anybody guide me with this?
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,008
Reputation: woooee is a jewel in the rough woooee is a jewel in the rough woooee is a jewel in the rough 
Solved Threads: 285
woooee woooee is offline Offline
Veteran Poster

Re: fetching Unique URL, Domain name and sub domain names from Web Page

 
0
  #2
Feb 8th, 2008
There are no doubt other and probably better ways to do this, but try
print url.split("/")
You want the 3rd item, or result[2]. You can then add it to a list or dictionary if it is not already in the list or dictionary.
Reply With Quote Quick reply to this message  
Reply

This thread is more than three months old.
Perhaps start a new thread instead?
Message:



Other Threads in the Python Forum
Thread Tools Search this Thread



About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC