Hi all,

I am trying to extract some text from a HTML page using regex.

<html>
...some code....
<B><FONT color="green">TEXT to BE EXTRACTED 1</FONT></B><br>
<P>
<B><FONT color="green">TEXT to BE EXTRACTED 2</FONT></B><br>
<P>
<B><FONT color="red">TEXT to BE EXTRACTED 3</FONT></B>
....some code....
</html>

I want to make a script which will print

TEXT to BE EXTRACTED 1
TEXT to BE EXTRACTED 2
TEXT to BE EXTRACTED 3

from the entire HTMl Page.

Thanks

Recommended Answers

All 3 Replies

Before re experts start to give they advices I give the standard answer for HTML: Don't, use for example BeautifulSoup module instead.

As Tony poster regex is not a god choice for html or xml.
This is why paser exist to do this job.

from BeautifulSoup import BeautifulSoup

html = '''\
<html>
...some code....
<B><FONT color="green">TEXT to BE EXTRACTED 1</FONT></B><br>
<P>
<B><FONT color="green">TEXT to BE EXTRACTED 2</FONT></B><br>
<P>
<B><FONT color="red">TEXT to BE EXTRACTED 3</FONT></B>
....some code....
</html>'''

soup = BeautifulSoup(html)
tag = soup.findAll('font')
for item in tag:
    print item.text

'''Output-->
TEXT to BE EXTRACTED 1
TEXT to BE EXTRACTED 2
TEXT to BE EXTRACTED 3
'''

thanks all....!!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.