Simple HTML Parsing Question

Question

SoulMazer 26 Posting Whiz in Training

14 Years Ago

So, I have a rather simple question today. I'll try to explain it by using an example. Let's say there is a line of HTML in the page "www.mywebsite.com/py" that says "<tr colData0='Friday'>". However, this line of code can change to be "Sunday", "Tuesday", etc. What would be the easiest way for me to extract the data (Monday, Wednesday, etc.) from that line of code via a Python script?

Thanks in advance.

python

3 Contributors
5 Replies
158 Views
1 Day Discussion Span
Latest Post 14 Years Ago Latest Post by SoulMazer

Recommended Answers

Answered by Gribouillis 1,391 in a post from 14 Years Ago

You could use HTMLParser like this

import sys
if sys.version_info[0] < 3:
    from HTMLParser import HTMLParser
    from urllib2 import urlopen
else:
    from html.parser import HTMLParser
    from urllib.request import urlopen

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.day = None

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            for key, …

Jump to Post

Answered by lllllIllIlllI 178 in a post from 14 Years Ago

If by unparsed HTML via script you mean get the source code for a page. Then you do that by using urllib
import urllib

#This is a file like object. 
data = urllib.urlopen("www.daniweb.com")

#So we have to read() it to get the text
print data.read()
Hope that …

Jump to Post

All 5 Replies

Gribouillis 1,391 Programming Explorer

14 Years Ago

You could use HTMLParser like this

import sys
if sys.version_info[0] < 3:
    from HTMLParser import HTMLParser
    from urllib2 import urlopen
else:
    from html.parser import HTMLParser
    from urllib.request import urlopen

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.day = None

    def handle_starttag(self, tag, attrs):
        if tag == 'tr':
            for key, value in attrs:
                if key == 'colData0':
                    self.day = value

def get_day(url):
    parser = MyParser()
    html = urlopen(url).read().decode('utf8')
    parser.feed(html)
    parser.close()
    return parser.day

if __name__ == '__main__':
    print(get_day("http://www.mywebsite.com/py"))

lllllIllIlllI 178 Veteran Poster

14 Years Ago

If by unparsed HTML via script you mean get the source code for a page. Then you do that by using urllib

import urllib

#This is a file like object. 
data = urllib.urlopen("www.daniweb.com")

#So we have to read() it to get the text
print data.read()

Hope that is what you meant :P

SoulMazer commented: Very helpful and follows through +2

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

lllllIllIlllI 178 Veteran Poster · Answer 1 · 2009-10-14T09:15:39+00:00

have a look at Beautiful Soup:

I have heard that it is an excellent tool for scraping webpages.

If that doesn't work then you can always try using string methods...

text = "<tr colData0='Friday'>"
#Split into a list with 3 items.
text = text.split("'")
print text[1]

Actually, the second idea would probably be the simplest :P

SoulMazer 26 Posting Whiz in Training · Answer 2 · 2009-10-15T08:45:30+00:00

@Gribouillis: I tried the code you gave me, except I receive an error about an unexpected tag:

HTMLParser.HTMLParseError: bad end tag: u"</SCR');\ndocument.write('IPT>"
...blah...
File "/usr/lib/python2.6/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())

Since the source of the HTML page is too long to post, just view the page source via your browser: http://www.xfire.com/friends/soulmazer/.

@paulthom: Well, I would prefer to just parse it myself, as the example I gave is less confusing than what I am actually trying to accomplish (get information from a profile). Except, how could I get the unparsed HTML of a web page via a script?

SoulMazer 26 Posting Whiz in Training · Answer 3 · 2009-10-15T10:44:08+00:00

That's perfect! Ok, my problem's solved. Thank you very much.

Simple HTML Parsing Question

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers