0

I am using the following code to extract second name from the html having following kind of lines -

<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>

So, I want to extract "Brittany" from the above line

for line in f:
	  match3 = re.search(r'$([a-zA-Z]+)(</td>)',line)
	  if match3:
	  	print match3.group(1)

But this ain't working. Please help.

3
Contributors
10
Replies
13
Views
5 Years
Discussion Span
Last Post by Gribouillis
Featured Replies
  • [QUOTE=theharshest;1614629]Hi Gribouillis and snippsat, thanks a lot for your solutions. But I was more interested in finding the flaw in the re I have written. Because I was able to extract first name successfully using the following - [CODE]match2 = re.search(r'(<td>)([a-zA-Z]+)',line)[/CODE] Then why isn't it working in second name case … Read More

0

You can find all the matches with finditer() and then select the last one

import re

line = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'
matches = list(re.finditer(r'([a-zA-Z]+)(?:</td>)', line))

print "matches:", matches

name = matches[-1].group(1)

print "name:", name

""" my output --->
matches: [<_sre.SRE_Match object at 0x7fe2548059c0>, <_sre.SRE_Match object at 0x7fe254805b58>]
name: Brittany
"""

Edit: I agree with snippsat that using beautifulsoup is more robust

Edited by Gribouillis: n/a

0

The problem here is not regex,because regex is the wrong tool when it comes to html/xml.
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
So two good parser for python is lxml and BeautifulSoup.

from BeautifulSoup import BeautifulSoup

html = '''\
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'''

soup = BeautifulSoup(html)
tag = soup.findAll('td')
print tag[2].string #Brittany

Edited by snippsat: n/a

0

Hi Gribouillis and snippsat, thanks a lot for your solutions.
But I was more interested in finding the flaw in the re I have written, as I am in learning phase currently concentrating on re.
Because I was able to extract first name successfully using the following -

match2 = re.search(r'(<td>)([a-zA-Z]+)',line)

Then why isn't it working in second name case as $ would look for pattern from the end.

Edited by theharshest: n/a

1

Hi Gribouillis and snippsat, thanks a lot for your solutions.
But I was more interested in finding the flaw in the re I have written.
Because I was able to extract first name successfully using the following -

match2 = re.search(r'(<td>)([a-zA-Z]+)',line)

Then why isn't it working in second name case as $ would look for pattern from the end.

It is not true: $ matches the end of the string. There is no way to look for a pattern from the end. In this case, you could use a devilish trick:

re.search(r">dt/<([a-zA-Z]+)", line[::-1]).group(1)[::-1]

or if there is only white space after the last </td> for example, you could use the end of the line like this

re.search(r'([a-zA-Z]+)</td>\s*$',line)

Edited by Gribouillis: n/a

0

Hey Gribouillis,

Finally I got the thing(from your second suggestion). Please correct me if I am wrong.

I was using $ in front of re instead of using it at last. And I again think that $ looks for pattern in a string from end.

match3 = re.search(r'([a-zA-Z]+)(</td>)$',line)

Above code works perfect! :)

Thanks!

0

Hey Gribouillis,

Finally I got the thing(from your second suggestion). Please correct me if I am wrong.

I was using $ in front of re instead of using it at last. And I again think that $ looks for pattern in a string from end.

match3 = re.search(r'([a-zA-Z]+)(</td>)$',line)

Above code works perfect! :)

Thanks!

It works, but it does not mean that $ searches from the end. It means that in the line, the endtag </td> is immediately followed by the end of the line.

0

It works, but it does not mean that $ searches from the end. It means that in the line, the endtag </td> is immediately followed by the end of the line.

I am not arguing but want to get my doubt clarified, as I can quote the following directly from the Google's python class -

^ = start, $ = end -- match the start or end of the string

It clearly says that if we use $ in any re then matching is from the end of string.

0

One from me to,but remember that regex and html/xml tag is diffcult.
And parsing a website with many 100 <td> tags it will break down.
Thats why we have parser and do not you use regex for this.
But for practice this can be fun.

>>> import re
>>> s = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>'
>>> r = [match.group(1) for match in re.finditer(r"td>(\w+)", s)]
>>> r
['3', 'Matthew', 'Brittany']
>>> r[1]
'Matthew'
>>>

Edited by snippsat: n/a

0

I am not arguing but want to get my doubt clarified, as I can quote the following directly from the Google's python class -

^ = start, $ = end -- match the start or end of the string

It clearly says that if we use $ in any re then matching is from the end of string.

No. Your search wouldn't match the line

line = '<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td> hello world'

for example because in the pattern r'([a-zA-Z]+)(</td>)$' , you require that </td> is recognised only if it is immediately followed by the end of the line. That's why the search doesn't see Matthew. Regular expression searches always go from left to right.

Edited by Gribouillis: n/a

This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.