Problem with regular expression

Question

theharshest 0 Newbie Poster

13 Years Ago

I am using the following code to extract second name from the html having following kind of lines -

<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>

So, I want to extract "Brittany" from the above line

for line in f:
	  match3 = re.search(r'$([a-zA-Z]+)(</td>)',line)
	  if match3:
	  	print match3.group(1)

But this ain't working. Please help.

python regex

3 Contributors
10 Replies
446 Views
16 Hours Discussion Span
Latest Post 13 Years Ago Latest Post by Gribouillis

All 10 Replies

Gribouillis 1,391 Programming Explorer

13 Years Ago

You can find all the matches with finditer() and then select the last one

import re

line = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'
matches = list(re.finditer(r'([a-zA-Z]+)(?:</td>)', line))

print "matches:", matches

name = matches[-1].group(1)

print "name:", name

""" my output --->
matches: [<_sre.SRE_Match object at 0x7fe2548059c0>, <_sre.SRE_Match object at 0x7fe254805b58>]
name: Brittany
"""

Edit: I agree with snippsat that using beautifulsoup is more robust

Edited 13 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Hi Gribouillis and snippsat, thanks a lot for your solutions.
But I was more interested in finding the flaw in the re I have written.
Because I was able to extract first name successfully using the following -
match2 = re.search(r'(<td>)([a-zA-Z]+)',line)
Then why isn't it working in second name case as $ would look for pattern from the end.

It is not true: $ matches the end of the string. There is no way to look for a pattern from the end. In this case, you could use a devilish trick:

re.search(r">dt/<([a-zA-Z]+)", line[::-1]).group(1)[::-1]

or if there is only white space after the last </td> for example, you could use the end of the line like this

re.search(r'([a-zA-Z]+)</td>\s*$',line)

Edited 13 Years Ago by Gribouillis because: n/a

Gribouillis 1,391 Programming Explorer

13 Years Ago

Hey Gribouillis,
Finally I got the thing(from your second suggestion). Please correct me if I am wrong.
I was using $ in front of re instead of using it at last. And I again think that $ looks for pattern in a string from end.
match3 = re.search(r'([a-zA-Z]+)(</td>)$',line)
Above code works perfect! :)
Thanks!

It works, but it does not mean that $ searches from the end. It means that in the line, the endtag </td> is immediately followed by the end of the line.

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

snippsat 661 Master Poster · Answer 1 · 2011-08-02T03:39:15+00:00

The problem here is not regex,because regex is the wrong tool when it comes to html/xml.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
So two good parser for python is lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = '''\
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td')
print tag[2].string #Brittany

theharshest 0 Newbie Poster · Answer 2 · 2011-08-02T03:52:05+00:00

Hi Gribouillis and snippsat, thanks a lot for your solutions.
But I was more interested in finding the flaw in the re I have written, as I am in learning phase currently concentrating on re.
Because I was able to extract first name successfully using the following -

match2 = re.search(r'(<td>)([a-zA-Z]+)',line)

Then why isn't it working in second name case as $ would look for pattern from the end.

theharshest 0 Newbie Poster · Answer 3 · 2011-08-02T04:12:52+00:00

Hey Gribouillis,

Finally I got the thing(from your second suggestion). Please correct me if I am wrong.

I was using $ in front of re instead of using it at last. And I again think that $ looks for pattern in a string from end.

match3 = re.search(r'([a-zA-Z]+)(</td>)$',line)

Above code works perfect! :)

Thanks!

theharshest 0 Newbie Poster · Answer 4 · 2011-08-02T04:23:08+00:00

It works, but it does not mean that $ searches from the end. It means that in the line, the endtag </td> is immediately followed by the end of the line.

I am not arguing but want to get my doubt clarified, as I can quote the following directly from the Google's python class -

^ = start, $ = end -- match the start or end of the string

It clearly says that if we use $ in any re then matching is from the end of string.

snippsat 661 Master Poster · Answer 5 · 2011-08-02T04:31:22+00:00

One from me to,but remember that regex and html/xml tag is diffcult.
And parsing a website with many 100 <td> tags it will break down.
Thats why we have parser and do not you use regex for this.
But for practice this can be fun.

>>> import re
>>> s = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'
>>> r = [match.group(1) for match in re.finditer(r"td>(\w+)", s)]
>>> r
['3', 'Matthew', 'Brittany']
>>> r[1]
'Matthew'
>>>

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 6 · 2011-08-02T04:33:59+00:00

I am not arguing but want to get my doubt clarified, as I can quote the following directly from the Google's python class -
^ = start, $ = end -- match the start or end of the string
It clearly says that if we use $ in any re then matching is from the end of string.

No. Your search wouldn't match the line

line = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> hello world'

for example because in the pattern r'([a-zA-Z]+)(</td>)$' , you require that </td> is recognised only if it is immediately followed by the end of the line. That's why the search doesn't see Matthew. Regular expression searches always go from left to right.

Gribouillis 1,391 Programming Explorer Team Colleague · Answer 7 · 2011-08-02T13:24:27+00:00

Here is a screenshot of kodos, the python regex debugger, running with python 2.6 on your example.

Problem with regular expression

Recommended Answers Collapse Answers

All 10 Replies

Recommended Answers