I'm writing a code that should extract tags from an HTML code (I'm skipping parts about parsing and stuff). I'm testing it using a simple fixed string however, it doesn't remove this <div> tag and I have no idea why...

Thanks...

import re

RegExpression_Tags = r"<.*?>"

html = """
<div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
Ask how to use HTMLParser
</div>"""

p = re.compile(r"<.*?>")
NoTags = p.sub( '' , html)
print NoTags

Recommended Answers

All 4 Replies

This?

^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

This?

^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

Thank you but it didn't work :confused:

I'm writing a code that should extract tags from an HTML code

When comes html regex may not be the right tool.
Look into beautifulsoup and lxml.

Read this answer by bobince about regex/html,one of the best answer i have read and funny to.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Here is one way.

import re

html = """\
<div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
Ask how to use HTMLParser
</div>"""

test_match = re.sub(r'<.*?>|<\w{3}', '', html)
print test_match 

'''-->Out
 style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
Ask how to use HTMLParser
'''

thanks snippsat.
Actually, I am using BeautifulSoup for parsing the html pages. I tried traversing the parsed trees until i reach the leaves i.e. the NavigableStrings and extract these. This sounds like the best thing to do but didn't work :( I didn't find enough examples for NavigableStrings. That's why I went to option 2 which is RegEx.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.