Extracting text from between tags

Question

daviddoria 334 Posting Virtuoso

14 Years Ago

I am trying to parse the content of a wiki page.

In a string like this:

==Heading1==
<test>
some text here
</test> 

==Heading2==
<test>
even more text
</test>

I need to obtain "Heading1", "some text here", "Heading2" and "even more text".

I got this to work:

import re
MyStr = "<test>some text here</test>
m=re.compile('<test>(.*?)</test>').search(MyStr)
print m.group(1)

it produces "some text here".

But I tried this:

MyStr = "==some text here=="
m=re.compile('<test>(.*?)</test>').search(MyStr)
print m.group(1)

and it had an error.

I also tried this:

MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
m=re.compile('<test>(.*?)</test>').search(MyStr)
print m.group(1)
print m.group(2)

and it had an error getting group(2) (which I was hoping was the second occurrence of the matching string?)

Can anyone point me in the right direction?

Thanks,

Dave

python

6 Contributors
8 Replies
15K Views
4 Days Discussion Span
Latest Post 14 Years Ago Latest Post by daviddoria

Gribouillis 1,391 Programming Explorer

14 Years Ago

You should probably compile with the re.DOTALL option, because the dot character does not normally match newline

m=re.compile('<test>(.*)</test>', re.DOTALL).search(MyStr)

Edited 14 Years Ago by Gribouillis because: n/a

woooee 814 Nearly a Posting Maven

14 Years Ago

The "standard" way does not use regular expressions. When "<test>" is found, start appending records to a list. When "</test> " is found, print or do whatever with list and re-define it as an empty list, and continue down the line.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

daviddoria 334 Posting Virtuoso Featured Poster · Answer 1 · 2010-04-23T23:46:09+00:00

Sounds like that will definitely help down the road, but it still didn't work in this case:

>>> import re
>>> MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
>>> m=re.compile('<test>(.*?)</test>', re.DOTALL).search(MyStr)
>>> print m.group(1)
some text here
>>> print m.group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

Dave

daviddoria 334 Posting Virtuoso Featured Poster · Answer 2 · 2010-04-24T01:23:49+00:00

You'll have to forgive me, I'm pretty new to Python. How would I traverse the string to find the starting tag?

Then, I can add things to a list:

list.append(6)

but how do I get characters/words/strings from the big string in a sequential fashion to add them to this list?

Thanks,

Dave

d5e5 109 Master Poster · Answer 3 · 2010-04-25T01:31:41+00:00

If you use findall instead of search you get a list of strings.

>>> import re
>>> MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
>>> m=re.compile('<test>(.*?)</test>', re.DOTALL).findall(MyStr)
>>> print m
['some text here', ' even more text']
>>>

Sounds like that will definitely help down the road, but it still didn't work in this case:

>>> import re
>>> MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
>>> m=re.compile('<test>(.*?)</test>', re.DOTALL).search(MyStr)
>>> print m.group(1)
some text here
>>> print m.group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

Dave

TrustyTony 888 ex-Moderator Team Colleague Featured Poster · Answer 4 · 2010-04-27T10:36:20+00:00

I would do while loop with partition:

MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"

tlist = []
_,found,t = MyStr.partition('<test>')

while found:
    t,found,more = t.partition('</test>')
    if found:
        tlist.append(t) ## no assignment, just append
    else:
        raise ValueError, "Missing end tag: " + t
    
    _,found,t = more.partition('<test>')
    

print tlist

snippsat 661 Master Poster · Answer 5 · 2010-04-27T11:29:08+00:00

When it come to web pages([X]HTML)
Then regular expression may not be the right tool to use.
Read bobince famous answer at stackoverflow.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

For small steady web pages regular expression can work ok.
Python has some really good tool for this like BeautifulSoup,lxml.

For a small wiki pages the solution post here by d5e5 and tonyjv can work fine.
Just to show one in BeautifulSoup.

import BeautifulSoup as bs

html = '''\
==Heading1==
<test>
some text here
</test>

==Heading2==
<test>
even more text
</test>
'''

soup = bs.BeautifulSoup(html)
divs = soup.findAll('test')
children = divs[0].contents
my_data = divs[0].string + divs[1].string
print my_data  #some text here even more text

BeautifulSoup can handle almost any web page even it has a lot of bad html.

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.
Neither does BeautifulSoup paser .

daviddoria 334 Posting Virtuoso Featured Poster · Answer 6 · 2010-04-27T17:39:02+00:00

Ok, interesting. I'll definitely look into XML parsing and using Beautiful Soup.

Thank all,

Dave