954,515 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Python Regular Expression Help

Hi all,

I wanna extract a certain link from a web page using python regular expression.

The scenario is like this..

The code:

blah...
...
....
http://www.test.com/file.ext " style="top:0px;width:100%;"
....
blah
blah
blah

I wanna extract the url "http://www.test.com/file.ext" from the page using python regular expression.

Thanks in advance!

debasishgang7
Junior Poster in Training
91 posts since Oct 2009
Reputation Points: 10
Solved Threads: 0
 

Read this.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

So regex it not the right tool when it comes to html/xml.
There is a reason why parser excit,python has 2 very good lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.find('div')
print tag['src']
#--> http://www.test.com/file.ext

So in a lager page your search would be more specific something like this.

from BeautifulSoup import BeautifulSoup

html = """\
<div class="test" src="http://www.test.com/file.ext" style="top:0px;width:100%;>"
"""

soup = BeautifulSoup(html)
tag = soup.findAll('div', {'class': 'test'})
print tag[0]['src']
#--> http://www.test.com/file.ext
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
 

Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between this tags.
I will be very thank full if you can solve this with a regular expression which will extract the url between

debasishgang7
Junior Poster in Training
91 posts since Oct 2009
Reputation Points: 10
Solved Threads: 0
 

What have you tried looks simple match betseen 'start and end tags'?

pyTony
pyMod
Moderator
5,359 posts since Apr 2010
Reputation Points: 782
Solved Threads: 852
 
Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 
Well thanks for your suggestion,but in this case its not working.I am getting "IndexError: list index out of range" error.May be its because i am trying with huge page.And one more thing is the part of this html code is inactive means its between this tags.


That may be because you making and error,impossibile to say without seeing some code.
Regex ...no,but here something you can look at.

>>> import re
>>> re.findall(r'class="test" src="(.*?)"', html)
['http://www.test.com/file.ext']
>>> ''.join(re.findall(r'class="test" src="(.*?)"', html))
'http://www.test.com/file.ext'
>>>
snippsat
Practically a Posting Shark
808 posts since Aug 2008
Reputation Points: 353
Solved Threads: 294
 

This article has been dead for over three months

Post: Markdown Syntax: Formatting Help
You
View similar articles that have also been tagged: