Hi all,

I wanna extract a certain link from a web page using python regular expression.

The scenario is like this..

The code:

blah...
...
....
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;"
....
blah
blah
blah

I wanna extract the url "http://www.test.com/file.ext" from the page using python regular expression.

Thanks in advance!

Recommended Answers

All 5 Replies

Read this.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

So regex it not the right tool when it comes to html/xml.
There is a reason why parser excit,python has 2 very good lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.find('div')
print tag['src']
#--> http://www.test.com/file.ext

So in a lager page your search would be more specific something like this.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.findAll('div', {'class': 'test'})
print tag[0]['src']
#--> http://www.test.com/file.ext

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between <!-- --> this tags.
I will be very thank full if you can solve this with a regular expression which will extract the url between <div class="test" src=" and " style="top:0px

What have you tried looks simple match betseen 'start and end tags'?

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between <!-- --> this tags.

That may be because you making and error,impossibile to say without seeing some code.
Regex ...no,but here something you can look at.

>>> import re
>>> re.findall(r'class="test" src="(.*?)"', html)
['http://www.test.com/file.ext']
>>> ''.join(re.findall(r'class="test" src="(.*?)"', html))
'http://www.test.com/file.ext'
>>>
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.