944,107 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Marked Solved
  • Views: 656
  • Python RSS
Oct 13th, 2009
1

Simple HTML Parsing Question

Expand Post »
So, I have a rather simple question today. I'll try to explain it by using an example. Let's say there is a line of HTML in the page "www.mywebsite.com/py" that says "<tr colData0='Friday'>". However, this line of code can change to be "Sunday", "Tuesday", etc. What would be the easiest way for me to extract the data (Monday, Wednesday, etc.) from that line of code via a Python script?

Thanks in advance.
Similar Threads
Reputation Points: 23
Solved Threads: 12
Posting Whiz in Training
SoulMazer is offline Offline
212 posts
since Sep 2008
Oct 14th, 2009
0
Re: Simple HTML Parsing Question
have a look at Beautiful Soup:

I have heard that it is an excellent tool for scraping webpages.

If that doesn't work then you can always try using string methods...
Python Syntax (Toggle Plain Text)
  1. text = "<tr colData0='Friday'>"
  2. #Split into a list with 3 items.
  3. text = text.split("'")
  4. print text[1]

Actually, the second idea would probably be the simplest
Last edited by Paul Thompson; Oct 14th, 2009 at 12:16 am.
Reputation Points: 264
Solved Threads: 183
Veteran Poster
Paul Thompson is offline Offline
1,095 posts
since May 2008
Oct 14th, 2009
1
Re: Simple HTML Parsing Question
You could use HTMLParser like this
python Syntax (Toggle Plain Text)
  1. import sys
  2. if sys.version_info[0] < 3:
  3. from HTMLParser import HTMLParser
  4. from urllib2 import urlopen
  5. else:
  6. from html.parser import HTMLParser
  7. from urllib.request import urlopen
  8.  
  9. class MyParser(HTMLParser):
  10. def __init__(self):
  11. HTMLParser.__init__(self)
  12. self.day = None
  13.  
  14. def handle_starttag(self, tag, attrs):
  15. if tag == 'tr':
  16. for key, value in attrs:
  17. if key == 'colData0':
  18. self.day = value
  19.  
  20. def get_day(url):
  21. parser = MyParser()
  22. html = urlopen(url).read().decode('utf8')
  23. parser.feed(html)
  24. parser.close()
  25. return parser.day
  26.  
  27. if __name__ == '__main__':
  28. print(get_day("http://www.mywebsite.com/py"))
Reputation Points: 930
Solved Threads: 668
Posting Maven
Gribouillis is offline Offline
2,656 posts
since Jul 2008
Oct 14th, 2009
0
Re: Simple HTML Parsing Question
@Gribouillis: I tried the code you gave me, except I receive an error about an unexpected tag:
Quote ...
HTMLParser.HTMLParseError: bad end tag: u"</SCR');\ndocument.write('IPT>"
...blah...
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
Since the source of the HTML page is too long to post, just view the page source via your browser: http://www.xfire.com/friends/soulmazer/.

@paulthom: Well, I would prefer to just parse it myself, as the example I gave is less confusing than what I am actually trying to accomplish (get information from a profile). Except, how could I get the unparsed HTML of a web page via a script?
Reputation Points: 23
Solved Threads: 12
Posting Whiz in Training
SoulMazer is offline Offline
212 posts
since Sep 2008
Oct 15th, 2009
1
Re: Simple HTML Parsing Question
If by unparsed HTML via script you mean get the source code for a page. Then you do that by using urllib
Python Syntax (Toggle Plain Text)
  1. import urllib
  2.  
  3. #This is a file like object.
  4. data = urllib.urlopen("www.daniweb.com")
  5.  
  6. #So we have to read() it to get the text
  7. print data.read()

Hope that is what you meant
Reputation Points: 264
Solved Threads: 183
Veteran Poster
Paul Thompson is offline Offline
1,095 posts
since May 2008
Oct 15th, 2009
0
Re: Simple HTML Parsing Question
That's perfect! Ok, my problem's solved. Thank you very much.
Reputation Points: 23
Solved Threads: 12
Posting Whiz in Training
SoulMazer is offline Offline
212 posts
since Sep 2008

This thread is solved

Either the thread starter or a moderator has marked this thread as solved. You can most likely trust the responses and answers given. There is most likely no reason for any further responses to be posted here. If you have a related question, please start a new thread in this forum instead.

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: need some list help
Next Thread in Python Forum Timeline: wxPython and Sqlite3 database problem





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC