I am trying to parse the content of a wiki page.

In a string like this:

==Heading1==
<test>
some text here
</test> 

==Heading2==
<test>
even more text
</test>

I need to obtain "Heading1", "some text here", "Heading2" and "even more text".

I got this to work:

import re
MyStr = "<test>some text here</test>
m=re.compile('<test>(.*?)</test>').search(MyStr)
print m.group(1)

it produces "some text here".

But I tried this:

MyStr = "==some text here=="
m=re.compile('<test>(.*?)</test>').search(MyStr)
print m.group(1)

and it had an error.

I also tried this:

MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
m=re.compile('<test>(.*?)</test>').search(MyStr)
print m.group(1)
print m.group(2)

and it had an error getting group(2) (which I was hoping was the second occurrence of the matching string?)

Can anyone point me in the right direction?

Thanks,

Dave

Recommended Answers

All 8 Replies

You should probably compile with the re.DOTALL option, because the dot character does not normally match newline

m=re.compile('<test>(.*)</test>', re.DOTALL).search(MyStr)

Sounds like that will definitely help down the road, but it still didn't work in this case:

>>> import re
>>> MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
>>> m=re.compile('<test>(.*?)</test>', re.DOTALL).search(MyStr)
>>> print m.group(1)
some text here
>>> print m.group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

Dave

The "standard" way does not use regular expressions. When "<test>" is found, start appending records to a list. When "</test> " is found, print or do whatever with list and re-define it as an empty list, and continue down the line.

You'll have to forgive me, I'm pretty new to Python. How would I traverse the string to find the starting tag?

Then, I can add things to a list:

list.append(6)

but how do I get characters/words/strings from the big string in a sequential fashion to add them to this list?

Thanks,

Dave

If you use findall instead of search you get a list of strings.

>>> import re
>>> MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
>>> m=re.compile('<test>(.*?)</test>', re.DOTALL).findall(MyStr)
>>> print m
['some text here', ' even more text']
>>>

Sounds like that will definitely help down the road, but it still didn't work in this case:

>>> import re
>>> MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"
>>> m=re.compile('<test>(.*?)</test>', re.DOTALL).search(MyStr)
>>> print m.group(1)
some text here
>>> print m.group(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

Dave

I would do while loop with partition:

MyStr = "<test>some text here</test> <other> more text </other> <test> even more text</test>"

tlist = []
_,found,t = MyStr.partition('<test>')

while found:
    t,found,more = t.partition('</test>')
    if found:
        tlist.append(t) ## no assignment, just append
    else:
        raise ValueError, "Missing end tag: " + t
    
    _,found,t = more.partition('<test>')
    

print tlist

When it come to web pages([X]HTML)
Then regular expression may not be the right tool to use.
Read bobince famous answer at stackoverflow.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

For small steady web pages regular expression can work ok.
Python has some really good tool for this like BeautifulSoup,lxml.

For a small wiki pages the solution post here by d5e5 and tonyjv can work fine.
Just to show one in BeautifulSoup.

import BeautifulSoup as bs

html = '''\
==Heading1==
<test>
some text here
</test>

==Heading2==
<test>
even more text
</test>
'''

soup = bs.BeautifulSoup(html)
divs = soup.findAll('test')
children = divs[0].contents
my_data = divs[0].string + divs[1].string
print my_data  #some text here even more text

BeautifulSoup can handle almost any web page even it has a lot of bad html.

You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.

Neither does BeautifulSoup paser .

Ok, interesting. I'll definitely look into XML parsing and using Beautiful Soup.

Thank all,

Dave

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.