I am using the following code to extract second name from the html having following kind of lines -

<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>

So, I want to extract "Brittany" from the above line

for line in f:
	  match3 = re.search(r'$([a-zA-Z]+)(</td>)',line)
	  if match3:
	  	print match3.group(1)

But this ain't working. Please help.

You can find all the matches with finditer() and then select the last one

import re

line = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'
matches = list(re.finditer(r'([a-zA-Z]+)(?:</td>)', line))

print "matches:", matches

name = matches[-1].group(1)

print "name:", name

""" my output --->
matches: [<_sre.SRE_Match object at 0x7fe2548059c0>, <_sre.SRE_Match object at 0x7fe254805b58>]
name: Brittany
"""

Edit: I agree with snippsat that using beautifulsoup is more robust

Edited 5 Years Ago by Gribouillis: n/a

The problem here is not regex,because regex is the wrong tool when it comes to html/xml.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
So two good parser for python is lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = '''\
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td')
print tag[2].string #Brittany

Edited 5 Years Ago by snippsat: n/a

Hi Gribouillis and snippsat, thanks a lot for your solutions.
But I was more interested in finding the flaw in the re I have written, as I am in learning phase currently concentrating on re.
Because I was able to extract first name successfully using the following -

match2 = re.search(r'(<td>)([a-zA-Z]+)',line)

Then why isn't it working in second name case as $ would look for pattern from the end.

Edited 5 Years Ago by theharshest: n/a

Hi Gribouillis and snippsat, thanks a lot for your solutions.
But I was more interested in finding the flaw in the re I have written.
Because I was able to extract first name successfully using the following -

match2 = re.search(r'(<td>)([a-zA-Z]+)',line)

Then why isn't it working in second name case as $ would look for pattern from the end.

It is not true: $ matches the end of the string. There is no way to look for a pattern from the end. In this case, you could use a devilish trick:

re.search(r">dt/<([a-zA-Z]+)", line[::-1]).group(1)[::-1]

or if there is only white space after the last </td> for example, you could use the end of the line like this

re.search(r'([a-zA-Z]+)</td>\s*$',line)

Edited 5 Years Ago by Gribouillis: n/a

Hey Gribouillis,

Finally I got the thing(from your second suggestion). Please correct me if I am wrong.

I was using $ in front of re instead of using it at last. And I again think that $ looks for pattern in a string from end.

match3 = re.search(r'([a-zA-Z]+)(</td>)$',line)

Above code works perfect! :)

Thanks!

Hey Gribouillis,

Finally I got the thing(from your second suggestion). Please correct me if I am wrong.

I was using $ in front of re instead of using it at last. And I again think that $ looks for pattern in a string from end.

match3 = re.search(r'([a-zA-Z]+)(</td>)$',line)

Above code works perfect! :)

Thanks!

It works, but it does not mean that $ searches from the end. It means that in the line, the endtag </td> is immediately followed by the end of the line.

It works, but it does not mean that $ searches from the end. It means that in the line, the endtag </td> is immediately followed by the end of the line.

I am not arguing but want to get my doubt clarified, as I can quote the following directly from the Google's python class -

^ = start, $ = end -- match the start or end of the string

It clearly says that if we use $ in any re then matching is from the end of string.

One from me to,but remember that regex and html/xml tag is diffcult.
And parsing a website with many 100 <td> tags it will break down.
Thats why we have parser and do not you use regex for this.
But for practice this can be fun.

>>> import re
>>> s = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'
>>> r = [match.group(1) for match in re.finditer(r"td>(\w+)", s)]
>>> r
['3', 'Matthew', 'Brittany']
>>> r[1]
'Matthew'
>>>

Edited 5 Years Ago by snippsat: n/a

I am not arguing but want to get my doubt clarified, as I can quote the following directly from the Google's python class -

^ = start, $ = end -- match the start or end of the string

It clearly says that if we use $ in any re then matching is from the end of string.

No. Your search wouldn't match the line

line = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> hello world'

for example because in the pattern r'([a-zA-Z]+)(</td>)$' , you require that </td> is recognised only if it is immediately followed by the end of the line. That's why the search doesn't see Matthew. Regular expression searches always go from left to right.

Edited 5 Years Ago by Gribouillis: n/a

This article has been dead for over six months. Start a new discussion instead.