I'm trying to learn the very basics of HTML parsing in python. Through these forums I learned what a parser is.
"
Parsing often means "perform syntax analysis" on a program or a text. It means check if a text obeys given grammar rules and extract the corresponding information. For example, suppose that you define the rule that the structure of a question in english is auxiliary verb + subject + main verb + rest . Then the output of the statement parse("Are they playing football?") could be a hierarchy of tuples, or other objects, like this

("question",
    ("auxiliary verb", "are"),
    ("subject", "they"),
    ("verb", "playing"),
    ("rest", "football"),
)

Programs and compilers handle such trees more easily than raw text." (thanks for that explanation Gribouillis)

So what would the output be if I fed this data-

<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>

to the python html parser?
(i.e from html.parser import HTMLParser)

Recommended Answers

All 9 Replies

I suggest that you write your own program to see what it does.

that's the point.. I don't know what to expect. More importantly, I don't know in what form the output is given and I don't know how to display it.. I was hoping for an example

EDIT: The python documentation simply confuses me

You can run this program to see how the parser's methods are called while the parser reads your html data

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, *args):
        print "handle_starttag%s called." % str(args)

    def handle_startendtag(self, *args):
        print "handle_startendtag%s called." % str(args)

    def handle_endtag(self, *args):
        print "handle_endtag%s called." % str(args)

    def handle_data(self, *args):
        print "handle_data%s called." % str(args)

    def handle_charref(self, *args):
        print "handle_charref%s called." % str(args)

    def handle_entityref(self, *args):
        print "handle_entityref%s called." % str(args)

    def handle_comment(self, *args):
        print "handle_comment%s called." % str(args)

    def handle_decl(self, *args):
        print "handle_decl%s called." % str(args)

    def handle_pi(self, *args):
        print "handle_pi%s called." % str(args)

myData = """

<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>
"""

theParser = MyHTMLParser()
theParser.feed(myData)
theParser.close()

"""my output  ------>
handle_data('\n\n',) called.
handle_starttag('html', []) called.
handle_data('\n',) called.
handle_starttag('body', []) called.
handle_data('\n\n',) called.
handle_starttag('h1', []) called.
handle_data('My First Heading',) called.
handle_endtag('h1',) called.
handle_data('\n\n',) called.
handle_starttag('p', []) called.
handle_data('My first paragraph.',) called.
handle_endtag('p',) called.
handle_data('\n\n',) called.
handle_endtag('body',) called.
handle_data('\n',) called.
handle_endtag('html',) called.
handle_data('\n',) called.
"""

The next step is to fill the methods' body so that your parser does something more useful than just printing the order of the calls and their arguments.

Thanks for taking the trouble to write that code...
but I see you haven't called any of the methods separately. So how did you get an output? Thanks..
EDIT: I think it would be better to explain just one method.. (for example the "handle_starttag" method and discuss how it can be used. That way, I might be able to use the documentation for the rest of the methods.

I don't call the methods, the parser does. When I call theParser.feed and theParser.close, the parser "reads" the html input and calls the methods; For example, when it encounters <h1>, it calls handle_starttag('h1'), etc.

oh... So thePasser.feed reads (and stores) the data... and when theParser.close() is called the parser calls the methods depending on the characters it engounters.. am I right?

Oh... one more question.
How would you write the code so that the content between the two tags are printed? Is that possible?
For example <p> Hello World </p> the output should be, Hello World

Oh... one more question.
How would you write the code so that the content between the two tags are printed? Is that possible?
For example <p> Hello World </p> the output should be, Hello World

Well, if you only want to print the text content of the first paragraph, you could go this way

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.count_paragraph = 0
        self.print_data = False
    def handle_starttag(self, tag, args):
        if tag == "p":
            self.count_paragraph += 1
            if self.count_paragraph == 1:
                self.print_data = True
    def handle_data(self, data):
        if self.print_data:
            print(data)
    def handle_endtag(self, tag):
        if tag == "p" and self.print_data:
            self.print_data = False

thanks

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.