954,525 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Have something to say? Contribute New Article Reply to this Article

Simple HTML Parsing Question

So, I have a rather simple question today. I'll try to explain it by using an example. Let's say there is a line of HTML in the page "www.mywebsite.com/py" that says "". However, this line of code can change to be "Sunday", "Tuesday", etc. What would be the easiest way for me to extract the data (Monday, Wednesday, etc.) from that line of code via a Python script?

Thanks in advance.

SoulMazer
Posting Whiz in Training
213 posts since Sep 2008
Reputation Points: 23
Solved Threads: 12
 

have a look at Beautiful Soup :

I have heard that it is an excellent tool for scraping webpages.

If that doesn't work then you can always try using string methods...

text = "<tr colData0='Friday'>"
#Split into a list with 3 items.
text = text.split("'")
print text[1]


Actually, the second idea would probably be the simplest :P

Paul Thompson
Veteran Poster
1,119 posts since May 2008
Reputation Points: 264
Solved Threads: 183
 

You could use HTMLParser like this

import sys
if sys.version_info[0] < 3:
    from HTMLParser import HTMLParser
    from urllib2 import urlopen
else:
    from html.parser import HTMLParser
    from urllib.request import urlopen

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.day = None

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            for key, value in attrs:
                if key == 'colData0':
                    self.day = value

def get_day(url):
    parser = MyParser()
    html = urlopen(url).read().decode('utf8')
    parser.feed(html)
    parser.close()
    return parser.day

if __name__ == '__main__':
    print(get_day("http://www.mywebsite.com/py"))
Gribouillis
Posting Maven
Moderator
2,786 posts since Jul 2008
Reputation Points: 1,044
Solved Threads: 691
 

@Gribouillis: I tried the code you gave me, except I receive an error about an unexpected tag:
HTMLParser.HTMLParseError: bad end tag: u"');\ndocument.write('IPT>"
...blah...
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
Since the source of the HTML page is too long to post, just view the page source via your browser: http://www.xfire.com/friends/soulmazer/ .

@paulthom: Well, I would prefer to just parse it myself, as the example I gave is less confusing than what I am actually trying to accomplish (get information from a profile). Except, how could I get the unparsed HTML of a web page via a script?

SoulMazer
Posting Whiz in Training
213 posts since Sep 2008
Reputation Points: 23
Solved Threads: 12
 

If by unparsed HTML via script you mean get the source code for a page. Then you do that by using urllib

import urllib

#This is a file like object. 
data = urllib.urlopen("www.daniweb.com")

#So we have to read() it to get the text
print data.read()


Hope that is what you meant :P

Paul Thompson
Veteran Poster
1,119 posts since May 2008
Reputation Points: 264
Solved Threads: 183
 

That's perfect! Ok, my problem's solved. Thank you very much.

SoulMazer
Posting Whiz in Training
213 posts since Sep 2008
Reputation Points: 23
Solved Threads: 12
 

This question has already been solved

Post: Markdown Syntax: Formatting Help
You