| | |
fetching Unique URL, Domain name and sub domain names from Web Page
![]() |
•
•
Join Date: Feb 2008
Posts: 1
Reputation:
Solved Threads: 0
Hi All,
Newbie here -
I am trying to write a program that will list out all the URL's from a webpage which I was able to successfully complete using the following script:
However, I am trying to sort the output to unique and also to take out only the domain, sub domain names from the list. For example, if the ouput is:
I want to filter out the ouput so that it only displays unique domain names and subdomains only For example,
Can anybody guide me with this?
Newbie here -
I am trying to write a program that will list out all the URL's from a webpage which I was able to successfully complete using the following script:
Python Syntax (Toggle Plain Text)
import urllib, urllister urladdr = raw_input("Enter URL here: ") usock = urllib.urlopen(urladdr) parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() for url in parser.urls: print url
However, I am trying to sort the output to unique and also to take out only the domain, sub domain names from the list. For example, if the ouput is:
Python Syntax (Toggle Plain Text)
<a rel="nofollow" class="t" href="http://abc.mydomain.com/partners/" target="_blank">http://abc.mydomain.com/partners/</a> <a rel="nofollow" class="t" href="http://xyz.mydomain.com/study/" target="_blank">http://xyz.mydomain.com/study/</a> <a rel="nofollow" class="t" href="http://otherdomain.com/" target="_blank">http://otherdomain.com/</a> <a rel="nofollow" class="t" href="http://books.otherdomain.com/" target="_blank">http://books.otherdomain.com/</a> <a rel="nofollow" class="t" href="http://irc.otherdomain.com/misc/tutorial.html" target="_blank">http://irc.otherdomain.com/misc/tutorial.html</a> <a rel="nofollow" class="t" href="http://xyz.mydomain.com/study/" target="_blank">http://xyz.mydomain.com/study/</a> <a rel="nofollow" class="t" href="http://otherdomain.com/" target="_blank">http://otherdomain.com/</a>
I want to filter out the ouput so that it only displays unique domain names and subdomains only For example,
Python Syntax (Toggle Plain Text)
http://abc.mydomain.com/ <a rel="nofollow" class="t" href="http://xyz.mydomain.com/" target="_blank">http://xyz.mydomain.com/</a> <a rel="nofollow" class="t" href="http://otherdomain.com/" target="_blank">http://otherdomain.com/</a> <a rel="nofollow" class="t" href="http://books.otherdomain.com/" target="_blank">http://books.otherdomain.com/</a> <a rel="nofollow" class="t" href="http://irc.otherdomain.com/" target="_blank">http://irc.otherdomain.com/</a>
Can anybody guide me with this?
![]() |
Other Threads in the Python Forum
- Previous Thread: open a empty CSV file
- Next Thread: wx.lib.plot.py
| Thread Tools | Search this Thread |
abrupt alarm ansi anti approximation assignment avogadro backend beginner binary bluetooth calculator character cmd code customdialog cx-freeze data decimals dictionaries dictionary directory dynamic error examples exe file float format function gnu graphics gui halp heads homework http ideas import input java launcher leftmouse line linux list lists loop module mouse number numbers output parsing path pointer port prime programming progressbar projects push py2exe pygame pyglet pyqt python random recursion schedule screensaverloopinactive script scrolledtext sqlite statistics string strings sudokusolver sum table terminal text thread threading time tlapse tricks tuple tutorial twoup ubuntu unicode urllib urllib2 variable ventrilo wikipedia write wxpython xlib






