Python Regular Expression Help

Question

debasishgang7 0 Junior Poster in Training

12 Years Ago

Hi all,

I wanna extract a certain link from a web page using python regular expression.

The scenario is like this..

The code:

blah...
...
....
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;"
....
blah
blah
blah

I wanna extract the url "http://www.test.com/file.ext" from the page using python regular expression.

Thanks in advance!

python regex

Edited 12 Years Ago by debasishgang7 because: n/a

4 Contributors
5 Replies
278 Views
12 Hours Discussion Span
Latest Post 12 Years Ago Latest Post by snippsat

All 5 Replies

snippsat 661 Master Poster

12 Years Ago

Read this.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

So regex it not the right tool when it comes to html/xml.
There is a reason why parser excit,python has 2 very good lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.find('div')
print tag['src']
#--> http://www.test.com/file.ext

So in a lager page your search would be more specific something like this.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.findAll('div', {'class': 'test'})
print tag[0]['src']
#--> http://www.test.com/file.ext

Edited 12 Years Ago by snippsat because: n/a

TrustyTony 888 pyMod

12 Years Ago

What have you tried looks simple match betseen 'start and end tags'?

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

debasishgang7 0 Junior Poster in Training · Answer 1 · 2012-01-14T21:13:36+00:00

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between  this tags.
I will be very thank full if you can solve this with a regular expression which will extract the url between <div class="test" src=" and " style="top:0px

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 2 · 2012-01-14T22:29:44+00:00

Gribouillis 1,391 Programming Explorer

12 Years Ago

Here is the solution.

Edited 12 Years Ago by Gribouillis because: n/a

snippsat 661 Master Poster · Answer 3 · 2012-01-15T02:02:27+00:00

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between  this tags.

That may be because you making and error,impossibile to say without seeing some code.
Regex ...no,but here something you can look at.

>>> import re
>>> re.findall(r'class="test" src="(.*?)"', html)
['http://www.test.com/file.ext']
>>> ''.join(re.findall(r'class="test" src="(.*?)"', html))
'http://www.test.com/file.ext'
>>>

Python Regular Expression Help

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers