Help with RegEx

Question

kshw 3 Newbie Poster

14 Years Ago

I'm writing a code that should extract tags from an HTML code (I'm skipping parts about parsing and stuff). I'm testing it using a simple fixed string however, it doesn't remove this <div> tag and I have no idea why...

Thanks...

import re

RegExpression_Tags = r"<.*?>"

html = """
<div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
Ask how to use HTMLParser
</div>"""

p = re.compile(r"<.*?>")
NoTags = p.sub( '' , html)
print NoTags

python regex

3 Contributors
4 Replies
101 Views
6 Hours Discussion Span
Latest Post 14 Years Ago Latest Post by kshw

All 4 Replies

snippsat 661 Master Poster

14 Years Ago

I'm writing a code that should extract tags from an HTML code

When comes html regex may not be the right tool.
Look into beautifulsoup and lxml.

Read this answer by bobince about regex/html,one of the best answer i have read and funny to.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Here is one way.

import re

html = """\
<div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
Ask how to use HTMLParser
</div>"""

test_match = re.sub(r'<.*?>|<\w{3}', '', html)
print test_match 

'''-->Out
 style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
Ask how to use HTMLParser
'''

Edited 14 Years Ago by snippsat because: n/a

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ultimatebuster 14 Posting Whiz in Training · Answer 1 · 2010-07-19T20:09:25+00:00

ultimatebuster 14 Posting Whiz in Training

14 Years Ago

This?

^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

kshw 3 Newbie Poster · Answer 2 · 2010-07-19T20:13:59+00:00

This?
^<([a-z]+)([^<]+)*(?:>(.*)<\/\1>|\s+\/>)$

Thank you but it didn't work :confused:

kshw 3 Newbie Poster · Answer 3 · 2010-07-19T20:51:59+00:00

thanks snippsat.
Actually, I am using BeautifulSoup for parsing the html pages. I tried traversing the parsed trees until i reach the leaves i.e. the NavigableStrings and extract these. This sounds like the best thing to do but didn't work :( I didn't find enough examples for NavigableStrings. That's why I went to option 2 which is RegEx.

Help with RegEx

Recommended Answers Collapse Answers

All 4 Replies

Recommended Answers