newbie: extracting parts of a reg ex match

Question

ChrisP_Buffalo 0 Newbie Poster

16 Years Ago

I'm trying to write my first web scraper with Python using simple regular expressions to match the info I want to extract (I realize BeautifulSoup is available, but I'm not ready to use that yet, so I want to figure out how to use reg ex first) .

I want to extract information about tennis players and their rankings from a table on a tennis site. The rank of each player is contained within lines like the ones below:

So, I wrote a regular expression that matches on ">[\d+]</div>", which I thought would output all of the ranks as a list of numbers like this: . I'll associate these ranks with their respective players later as I develop the code more, but right now, the brackets [] are not working the way I thought they should in Python (as specifying the set of characters to match) .

import urllib

f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/")
tennis_rankings = f.read()

tennis_players = re.compile(">[\d+]</div>", re.I | re.S | re.M)
find_result = tennis_players.findall(tennis_rankings)

print find_result

The code above outputs this:

There are two problems I need to solve. 1) this outputs the whole match, not just the string, and 2) this outputs only numbers 1-9, not 10 or above. If I delete the brackets [] from the above code, it matches all 100 players listed, but it still returns the whole ">3</div>" string.

Any help would be appreciated.

python

2 Contributors
2 Replies
64 Views
8 Hours Discussion Span
Latest Post 16 Years Ago Latest Post by jlm699

All 2 Replies

jlm699 320 Veteran Poster

16 Years Ago

[] designates a group of characters to match.. you had [d+] in there, meaning that you wanted to match any of the digits 0-9 as well as the plus symbol. An equivalent statement would be [0-9+], or typing the entire set of digits out would be [0123456789+]. By placing an * after your group of matchable characters that means match this as many times as possible. So the * solves the "only matches single digit" problem. The () are what gave the ability to extract a specific part of the text out of the matched text. Anything that is enclosed in parenthesis is able to be extracted after the pattern is matched.

P.S. if you meant to use '+' as match 1 or more times then it would have to be outside of the square brackets. Read up on regular expression syntax for more info.
HTH

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

ChrisP_Buffalo 0 Newbie Poster · Answer 1 · 2008-07-23T23:30:29+00:00

I seem to have found a solution. It appears ">([\d+]*)</div>" get's the job done. It's not clear to me why the combination of ([]*) makes this works, but I found the solution here http://www.developertutorials.com/tutorials/python/advanced-python-topics-050706/page1.html

newbie: extracting parts of a reg ex match

Recommended Answers Collapse Answers

All 2 Replies

Recommended Answers