Simple HTML Parsing Question

Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved

Join Date: Sep 2008
Posts: 93
Reputation: SoulMazer is an unknown quantity at this point 
Solved Threads: 1
SoulMazer SoulMazer is offline Offline
Junior Poster in Training

Simple HTML Parsing Question

 
1
  #1
Oct 13th, 2009
So, I have a rather simple question today. I'll try to explain it by using an example. Let's say there is a line of HTML in the page "www.mywebsite.com/py" that says "<tr colData0='Friday'>". However, this line of code can change to be "Sunday", "Tuesday", etc. What would be the easiest way for me to extract the data (Monday, Wednesday, etc.) from that line of code via a Python script?

Thanks in advance.
Reply With Quote Quick reply to this message  
Join Date: May 2008
Posts: 945
Reputation: Paul Thompson has a spectacular aura about Paul Thompson has a spectacular aura about 
Solved Threads: 146
Sponsor
Paul Thompson's Avatar
Paul Thompson Paul Thompson is online now Online
previously paulthom12345
 
0
  #2
Oct 14th, 2009
have a look at Beautiful Soup:

I have heard that it is an excellent tool for scraping webpages.

If that doesn't work then you can always try using string methods...
  1. text = "<tr colData0='Friday'>"
  2. #Split into a list with 3 items.
  3. text = text.split("'")
  4. print text[1]

Actually, the second idea would probably be the simplest
Last edited by Paul Thompson; Oct 14th, 2009 at 12:16 am.
Make it idiot proof and someone will make a better idiot.
Check out my Site | and join us on IRC | Python Specific IRC
Reply With Quote Quick reply to this message  
Join Date: Jul 2008
Posts: 965
Reputation: Gribouillis is a jewel in the rough Gribouillis is a jewel in the rough Gribouillis is a jewel in the rough 
Solved Threads: 222
Gribouillis's Avatar
Gribouillis Gribouillis is offline Offline
Posting Shark
 
1
  #3
Oct 14th, 2009
You could use HTMLParser like this
  1. import sys
  2. if sys.version_info[0] < 3:
  3. from HTMLParser import HTMLParser
  4. from urllib2 import urlopen
  5. else:
  6. from html.parser import HTMLParser
  7. from urllib.request import urlopen
  8.  
  9. class MyParser(HTMLParser):
  10. def __init__(self):
  11. HTMLParser.__init__(self)
  12. self.day = None
  13.  
  14. def handle_starttag(self, tag, attrs):
  15. if tag == 'tr':
  16. for key, value in attrs:
  17. if key == 'colData0':
  18. self.day = value
  19.  
  20. def get_day(url):
  21. parser = MyParser()
  22. html = urlopen(url).read().decode('utf8')
  23. parser.feed(html)
  24. parser.close()
  25. return parser.day
  26.  
  27. if __name__ == '__main__':
  28. print(get_day("http://www.mywebsite.com/py"))
Reply With Quote Quick reply to this message  
Join Date: Sep 2008
Posts: 93
Reputation: SoulMazer is an unknown quantity at this point 
Solved Threads: 1
SoulMazer SoulMazer is offline Offline
Junior Poster in Training
 
0
  #4
Oct 14th, 2009
@Gribouillis: I tried the code you gave me, except I receive an error about an unexpected tag:
HTMLParser.HTMLParseError: bad end tag: u"</SCR');\ndocument.write('IPT>"
...blah...
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
Since the source of the HTML page is too long to post, just view the page source via your browser: http://www.xfire.com/friends/soulmazer/.

@paulthom: Well, I would prefer to just parse it myself, as the example I gave is less confusing than what I am actually trying to accomplish (get information from a profile). Except, how could I get the unparsed HTML of a web page via a script?
Reply With Quote Quick reply to this message  
Join Date: May 2008
Posts: 945
Reputation: Paul Thompson has a spectacular aura about Paul Thompson has a spectacular aura about 
Solved Threads: 146
Sponsor
Paul Thompson's Avatar
Paul Thompson Paul Thompson is online now Online
previously paulthom12345
 
1
  #5
Oct 15th, 2009
If by unparsed HTML via script you mean get the source code for a page. Then you do that by using urllib
  1. import urllib
  2.  
  3. #This is a file like object.
  4. data = urllib.urlopen("www.daniweb.com")
  5.  
  6. #So we have to read() it to get the text
  7. print data.read()

Hope that is what you meant
Make it idiot proof and someone will make a better idiot.
Check out my Site | and join us on IRC | Python Specific IRC
Reply With Quote Quick reply to this message  
Join Date: Sep 2008
Posts: 93
Reputation: SoulMazer is an unknown quantity at this point 
Solved Threads: 1
SoulMazer SoulMazer is offline Offline
Junior Poster in Training
 
0
  #6
Oct 15th, 2009
That's perfect! Ok, my problem's solved. Thank you very much.
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:


Thread Tools Search this Thread



Tag cloud for Python
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC