This is a follow up to my solved thread a few days ago about extracting parts of a RegEx search in a web scraping app. Now, I have a script that includes 4 RegEx searches that each work individually. Now I want to compile all 4 into a single search and return the 4 pieces of information in a single list. I've seen examples on the web using a plus sign "+" in between the RegEx's, but when I do that, I get an empty list returned (this is also the result if I use nothing between the searches). If I use "and" in place of "+", only the last search returns its value.

import re
import urllib

f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/")
tennis_rankings = f.read()

#++++++++++++++++++  The RegEx's below all work individually 
#They extract, in order: 1) player's rank, 2) player's name, 3) player's total points, 4) number of tourneys played

#tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>", re.I | re.S | re.M)
#tennis_players = re.compile("playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)", re.I | re.S | re.M)
#tennis_players = re.compile("pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)
#tennis_players = re.compile("playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)

#++++++++++++++++++  Now, together as a single search 

tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>" + "playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)" + "pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)" + "playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)

find_result = tennis_players.findall(tennis_rankings)

print find_result
print 'done

My preferred return is some sort of array of tuples:

[('1', 'Federer, Roger', '6600', '18'), ('2', 'Nadal, Rafael', '5800', '19'), ('3', 'Djokovic, Novak', '4900','20'), ...]

Any help would be appreciated!

Recommended Answers

All 2 Replies

In my experience with regex you can combine two different regexes with an '|' (or operator). So within the quotes just place a '|' between each regex (without single quotes)'
Like so:

tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>|playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)|pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)|playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)

You can use string formatting for your regex pattern. The following iterates on the file object. For the top 10 players, variable data should have 40 items. Match object m.groups() should have 5 items. The first item will be the entire matched string. The data you are interested in will be one of the last 4 items, the other 3 will be None.

import re
import urllib

p1 = "<div class=\"entrylisttext\">([\d+]*)</div>"
p2 = "playernumber=[A-Z]+[0-9]+\" id=\"blacklink\">([A-Z ]+-?[A-Z ]*, [A-Z ]+)"
p3 = "pointsbreakdown.asp\?player=[A-Z]+[0-9]+&ss=y\" id=\"blacklink\">([0-9]+)"
p4 = "playeractivity.asp\?player=[A-Z]+[0-9]+\" id=\"blacklink\">([0-9]+)"

patt = re.compile(r'(?P<data>%s|%s|%s|%s)' % (p1,p2,p3,p4), re.I)

f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/")
data = []
for line in f:
    m = patt.search(line)
    if m:
        [data.append(item) for item in m.groups()[1:] if item]

f.close()

output = [(data[i],data[i+1],data[i+2],data[i+3]) for i in range(0,len(data),4)]

for item in output:
    print "Rank: %-3s Player: %-28s Points: %-5s Tournaments: %s" % (item[0], item[1], item[2], item[3])
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.