So, I have a rather simple question today. I'll try to explain it by using an example. Let's say there is a line of HTML in the page "www.mywebsite.com/py" that says "<tr colData0='Friday'>". However, this line of code can change to be "Sunday", "Tuesday", etc. What would be the easiest way for me to extract the data (Monday, Wednesday, etc.) from that line of code via a Python script?

Thanks in advance.

Recommended Answers

All 5 Replies

have a look at Beautiful Soup:

I have heard that it is an excellent tool for scraping webpages.

If that doesn't work then you can always try using string methods...

text = "<tr colData0='Friday'>"
#Split into a list with 3 items.
text = text.split("'")
print text[1]

Actually, the second idea would probably be the simplest :P

You could use HTMLParser like this

import sys
if sys.version_info[0] < 3:
    from HTMLParser import HTMLParser
    from urllib2 import urlopen
else:
    from html.parser import HTMLParser
    from urllib.request import urlopen

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.day = None

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            for key, value in attrs:
                if key == 'colData0':
                    self.day = value

def get_day(url):
    parser = MyParser()
    html = urlopen(url).read().decode('utf8')
    parser.feed(html)
    parser.close()
    return parser.day

if __name__ == '__main__':
    print(get_day("http://www.mywebsite.com/py"))

@Gribouillis: I tried the code you gave me, except I receive an error about an unexpected tag:

HTMLParser.HTMLParseError: bad end tag: u"</SCR');\ndocument.write('IPT>"
...blah...
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())

Since the source of the HTML page is too long to post, just view the page source via your browser: http://www.xfire.com/friends/soulmazer/.

@paulthom: Well, I would prefer to just parse it myself, as the example I gave is less confusing than what I am actually trying to accomplish (get information from a profile). Except, how could I get the unparsed HTML of a web page via a script?

If by unparsed HTML via script you mean get the source code for a page. Then you do that by using urllib

import urllib

#This is a file like object. 
data = urllib.urlopen("www.daniweb.com")

#So we have to read() it to get the text
print data.read()

Hope that is what you meant :P

commented: Very helpful and follows through +2

That's perfect! Ok, my problem's solved. Thank you very much.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.