#
I'm trying to extract the url's from the below text, without the added html tags.
Is there anyway I can get just parts starting with ( http://) and ending with (")?
#

<a href="http://www.gumtree.sg/?ChangeLocation=Y" rel="nofollow">Singapore</a>, <a href="http://www.gumtree.com.au/?ChangeLocation=Y" rel="nofollow">Australia</a>, <a href="http://www.gumtree.co.nz/?ChangeLocation=Y" rel="nofollow">New Zealand</a>, <a href="http://www.gumtree.com" rel="nofollow">England</a>, <a href="http://edinburgh.gumtree.com" rel="nofollow">Scotland</a>, <a href="http://cardiff.gumtree.com" rel="nofollow">Wales</a>, <a href="http://www.gumtree.ie" rel="nofollow">Ireland</a>, <a

#
Just looking for the simplest way, thanks this community is excellent.
I'm using python 2.6.
#
(I'll store the html in a .txt file)

Recommended Answers

All 6 Replies

Use parser BeautifulSoup is good.

from BeautifulSoup import BeautifulSoup

html = '''\
<a href="http://www.gumtree.sg/?ChangeLocation=Y" rel="nofollow">Singapore</a>,
<a href="http://www.gumtree.com.au/?ChangeLocation=Y" rel="nofollow">Australia</a>,
<a href="http://www.gumtree.co.nz/?ChangeLocation=Y" rel="nofollow">New Zealand</a>,
<a href="http://www.gumtree.com" rel="nofollow">England</a>, <a href="http://edinburgh.gumtree.com" rel="nofollow">Scotland</a>,
<a href="http://cardiff.gumtree.com" rel="nofollow">Wales</a>,
<a href="http://www.gumtree.ie" rel="nofollow">Ireland</a>, <a>
'''

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
for link in links:
    print link['href']
    
''' output-->
http://www.gumtree.sg/?ChangeLocation=Y
http://www.gumtree.com.au/?ChangeLocation=Y
http://www.gumtree.co.nz/?ChangeLocation=Y
http://www.gumtree.com
http://edinburgh.gumtree.com
http://cardiff.gumtree.com
http://www.gumtree.ie
'''
commented: very simple +4

Thank you guys are badass.
Going to use BeautifulSoup.

Thanks again

Again I can't say how much that helped.
Here's how I'm using it to get all the links.


import urllib2,sys
from BeautifulSoup import BeautifulSoup
import re
adress = sys.argv
html = urllib2.urlopen('http://www.mylinkwenthere.com')
soup = BeautifulSoup(html)

cost = soup.findAll('a')
for link in cost:
print link

can someone please explain the:

for link in links:
print link

part of the code?

I am trying to store each of the links found into a list but am unable to.

Thank you

nevermind, got it. Thanks a bunch

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.