I am trying to extract three values from the td tags in an html downloaded file.

<tr align="right"><td>236</td><td>Roy</td><td>Allyson</td>
<tr align="right"><td>237</td><td>Marvin</td><td>Pamela</td>
<tr align="right"><td>238</td><td>Micah</td><td>Kristine</td>
<tr align="right"><td>239</td><td>Collin</td><td>Raquel</td>

I am using the pattern match = re.findall(r'<td.?>([\d+])([.?])*<\/td>', file)

The file is created with a read() statement.

The output should look like

(236, "Roy", "Allyson")
(237, "Marvin", "Pamela")
(238, "Micah", "Kristine")
(239, "Collin", "Raquel")

What I get is

(236, "")
(237, "")
(238, "")
(239, "")

I've tried different variations of the same pattern and get

('236', '23', '6')
('Roy', '', 'Roy)
('Allyson', '', 'Alison')
('237', '23', '7')
('Marvin', '', 'Marvin')
('Pamela', '', 'Pamela')
('238', '23', '8')
('Micah', '', 'Micah')
('Kristine', '', 'Kristine')
('239', '23', '9')
('Collin', '', 'Collin')
('Raquel', '', 'Raquel')

I'm relatively new to regular expressions so be gently, but any help would
be appreciated.

PS: I'm using Pythoon

Recommended Answers

All 5 Replies

The trick is to use lazy matching which matches the shortest possible string.

html = '<tr align="right"><td>236</td><td>Roy</td><td>Allyson</td>'
pat = '<td>(.+?)</td>'

then

re.split(pat,html)

returns

['<tr align="right">', '236', '', 'Roy', '', 'Allyson', '']

and

re.split(pat,html)[1::2]

returns

['236', 'Roy', 'Allyson']
commented: Hey Jim, thanks for the advice. I did finally get the results I was looking for using the pattern '<td>(\d+)+<\/td><td>(\w+)<\/td><td>(\w+)'. +0

Sidenote: If you want to learn, understand and experiment with regexes I can highly recommend RegexBuddy.

Question has been answered.

The correct pattern is:

matches = re.findall(r'<td>(\d+)+<\/td><td>(\w+)<\/td><td>(\w+)', file)

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.