User Name Password Register
DaniWeb IT Discussion Community
All
What is DaniWeb IT Discussion Community?
You're currently browsing the Python section within the Software Development category of DaniWeb, a massive community of 401,437 software developers, web developers, Internet marketers, and tech gurus who are all enthusiastic about making contacts, networking, and learning from each other. In fact, there are 2,874 IT professionals currently interacting right now! Registration is free, only takes a minute and lets you enjoy all of the interactive features of the site.
Please support our Python advertiser: Programming Forums
Views: 200 | Replies: 2 | Solved
Reply
Join Date: Mar 2008
Posts: 18
Reputation: ChrisP_Buffalo is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 0
ChrisP_Buffalo ChrisP_Buffalo is offline Offline
Newbie Poster

newbie: concatenating multiple RegEx's

  #1  
Jul 24th, 2008
This is a follow up to my solved thread a few days ago about extracting parts of a RegEx search in a web scraping app. Now, I have a script that includes 4 RegEx searches that each work individually. Now I want to compile all 4 into a single search and return the 4 pieces of information in a single list. I've seen examples on the web using a plus sign "+" in between the RegEx's, but when I do that, I get an empty list returned (this is also the result if I use nothing between the searches). If I use "and" in place of "+", only the last search returns its value.

  1. import re
  2. import urllib
  3.  
  4. f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/")
  5. tennis_rankings = f.read()
  6.  
  7. #++++++++++++++++++ The RegEx's below all work individually
  8. #They extract, in order: 1) player's rank, 2) player's name, 3) player's total points, 4) number of tourneys played
  9.  
  10. #tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>", re.I | re.S | re.M)
  11. #tennis_players = re.compile("playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)", re.I | re.S | re.M)
  12. #tennis_players = re.compile("pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)
  13. #tennis_players = re.compile("playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)
  14.  
  15. #++++++++++++++++++ Now, together as a single search
  16.  
  17. tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>" + "playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)" + "pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)" + "playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)
  18.  
  19. find_result = tennis_players.findall(tennis_rankings)
  20.  
  21. print find_result
  22. print 'done
  23.  

My preferred return is some sort of array of tuples:

[('1', 'Federer, Roger', '6600', '18'), ('2', 'Nadal, Rafael', '5800', '19'), ('3', 'Djokovic, Novak', '4900','20'), ...]

Any help would be appreciated!
AddThis Social Bookmark Button
Reply With Quote  
Join Date: Jul 2008
Location: Durham, NC
Posts: 138
Reputation: jlm699 is an unknown quantity at this point 
Rep Power: 1
Solved Threads: 21
jlm699's Avatar
jlm699 jlm699 is offline Offline
Junior Poster

Re: newbie: concatenating multiple RegEx's

  #2  
Jul 24th, 2008
In my experience with regex you can combine two different regexes with an '|' (or operator). So within the quotes just place a '|' between each regex (without single quotes)'
Like so:
  1. tennis_players = re.compile("<div class=\"entrylisttext\">([\d+]*)</div>|playernumber=[A-Z][0-9]+\" id=\"blacklink\">([a-zA-Z]+, [a-zA-Z]+)|pointsbreakdown.asp\?player=[A-Z][0-9]+&ss=y\" id=\"blacklink\">([0-9]+)|playeractivity.asp\?player=[A-Z][0-9]+\" id=\"blacklink\">([0-9]+)", re.I | re.S | re.M)
Let's Go Pens!
Reply With Quote  
Join Date: Mar 2007
Posts: 43
Reputation: solsteel is an unknown quantity at this point 
Rep Power: 2
Solved Threads: 12
solsteel solsteel is offline Offline
Light Poster

Re: newbie: concatenating multiple RegEx's

  #3  
Jul 25th, 2008
You can use string formatting for your regex pattern. The following iterates on the file object. For the top 10 players, variable data should have 40 items. Match object m.groups() should have 5 items. The first item will be the entire matched string. The data you are interested in will be one of the last 4 items, the other 3 will be None.
  1. import re
  2. import urllib
  3.  
  4. p1 = "<div class=\"entrylisttext\">([\d+]*)</div>"
  5. p2 = "playernumber=[A-Z]+[0-9]+\" id=\"blacklink\">([A-Z ]+-?[A-Z ]*, [A-Z ]+)"
  6. p3 = "pointsbreakdown.asp\?player=[A-Z]+[0-9]+&ss=y\" id=\"blacklink\">([0-9]+)"
  7. p4 = "playeractivity.asp\?player=[A-Z]+[0-9]+\" id=\"blacklink\">([0-9]+)"
  8.  
  9. patt = re.compile(r'(?P<data>%s|%s|%s|%s)' % (p1,p2,p3,p4), re.I)
  10.  
  11. f = urllib.urlopen("http://www.atptennis.com/3/en/rankings/entrysystem/")
  12. data = []
  13. for line in f:
  14. m = patt.search(line)
  15. if m:
  16. [data.append(item) for item in m.groups()[1:] if item]
  17.  
  18. f.close()
  19.  
  20. output = [(data[i],data[i+1],data[i+2],data[i+3]) for i in range(0,len(data),4)]
  21.  
  22. for item in output:
  23. print "Rank: %-3s Player: %-28s Points: %-5s Tournaments: %s" % (item[0], item[1], item[2], item[3])
Reply With Quote  
Reply

Only community members can participate in forum threads. You must register or log in to contribute.

DaniWeb Python Marketplace
Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)

 

Thread Tools Display Modes

Other Threads in the Python Forum

All times are GMT -4. The time now is 12:52 am.
Forum system based on vBulletin Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
©2003 - 2008 DaniWeb® LLC