•
•
•
•
What is DaniWeb IT Discussion Community?
You're currently browsing the Python section within the Software Development category of DaniWeb, a massive community of 401,437 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 2,874 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Python advertiser: Programming Forums
Views: 200 | Replies: 2 | Solved
![]() |
•
•
Join Date: Mar 2008
Posts: 18
Reputation:
Rep Power: 1
Solved Threads: 0
This is a follow up to my solved thread a few days ago about extracting parts of a RegEx search in a web scraping app. Now, I have a script that includes 4 RegEx searches that each work individually. Now I want to compile all 4 into a single search and return the 4 pieces of information in a single list. I've seen examples on the web using a plus sign "+" in between the RegEx's, but when I do that, I get an empty list returned (this is also the result if I use nothing between the searches). If I use "and" in place of "+", only the last search returns its value.
My preferred return is some sort of array of tuples:
[('1', 'Federer, Roger', '6600', '18'), ('2', 'Nadal, Rafael', '5800', '19'), ('3', 'Djokovic, Novak', '4900','20'), ...]
Any help would be appreciated!
Python Syntax (Toggle Plain Text)
import re import urllib f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/") tennis_rankings = f.read() #++++++++++++++++++ The RegEx's below all work individually #They extract, in order: 1) player's rank, 2) player's name, 3) player's total points, 4) number of tourneys played #tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>", re.I | re.S | re.M) #tennis_players = re.compile("playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)", re.I | re.S | re.M) #tennis_players = re.compile("pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M) #tennis_players = re.compile("playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M) #++++++++++++++++++ Now, together as a single search tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>" + "playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)" + "pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)" + "playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M) find_result = tennis_players.findall(tennis_rankings) print find_result print 'done
My preferred return is some sort of array of tuples:
[('1', 'Federer, Roger', '6600', '18'), ('2', 'Nadal, Rafael', '5800', '19'), ('3', 'Djokovic, Novak', '4900','20'), ...]
Any help would be appreciated!
In my experience with regex you can combine two different regexes with an '|' (or operator). So within the quotes just place a '|' between each regex (without single quotes)'
Like so:
Like so:
python Syntax (Toggle Plain Text)
tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>|playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)|pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)|playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)
Let's Go Pens!
•
•
Join Date: Mar 2007
Posts: 43
Reputation:
Rep Power: 2
Solved Threads: 12
You can use string formatting for your regex pattern. The following iterates on the file object. For the top 10 players, variable data should have 40 items. Match object m.groups() should have 5 items. The first item will be the entire matched string. The data you are interested in will be one of the last 4 items, the other 3 will be None.
Python Syntax (Toggle Plain Text)
import re import urllib p1 = "<div class=\"entrylisttext\">([\d+]*)</div>" p2 = "playernumber=[A-Z]+[0-9]+\" id=\"blacklink\">([A-Z ]+-?[A-Z ]*, [A-Z ]+)" p3 = "pointsbreakdown.asp\?player=[A-Z]+[0-9]+&ss=y\" id=\"blacklink\">([0-9]+)" p4 = "playeractivity.asp\?player=[A-Z]+[0-9]+\" id=\"blacklink\">([0-9]+)" patt = re.compile(r'(?P<data>%s|%s|%s|%s)' % (p1,p2,p3,p4), re.I) f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/") data = [] for line in f: m = patt.search(line) if m: [data.append(item) for item in m.groups()[1:] if item] f.close() output = [(data[i],data[i+1],data[i+2],data[i+3]) for i in range(0,len(data),4)] for item in output: print "Rank: %-3s Player: %-28s Points: %-5s Tournaments: %s" % (item[0], item[1], item[2], item[3])
![]() |
•
•
•
•
•
•
•
•
DaniWeb Python Marketplace
•
•
•
•
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
- Previous Thread: linux
- Next Thread: Help newbie : need simple code


Linear Mode