0

I'm trying to learn the very basics of HTML parsing in python. Through these forums I learned what a parser is.
"
Parsing often means "perform syntax analysis" on a program or a text. It means check if a text obeys given grammar rules and extract the corresponding information. For example, suppose that you define the rule that the structure of a question in english is auxiliary verb + subject + main verb + rest . Then the output of the statement parse("Are they playing football?") could be a hierarchy of tuples, or other objects, like this

("question",
    ("auxiliary verb", "are"),
    ("subject", "they"),
    ("verb", "playing"),
    ("rest", "football"),
)

Programs and compilers handle such trees more easily than raw text." (thanks for that explanation Gribouillis)

So what would the output be if I fed this data-

<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>

to the python html parser?
(i.e from html.parser import HTMLParser)

2
Contributors
9
Replies
10
Views
8 Years
Discussion Span
Last Post by mahela007
0

that's the point.. I don't know what to expect. More importantly, I don't know in what form the output is given and I don't know how to display it.. I was hoping for an example

EDIT: The python documentation simply confuses me

0

You can run this program to see how the parser's methods are called while the parser reads your html data

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):

    def handle_starttag(self, *args):
        print "handle_starttag%s called." % str(args)

    def handle_startendtag(self, *args):
        print "handle_startendtag%s called." % str(args)

    def handle_endtag(self, *args):
        print "handle_endtag%s called." % str(args)

    def handle_data(self, *args):
        print "handle_data%s called." % str(args)

    def handle_charref(self, *args):
        print "handle_charref%s called." % str(args)

    def handle_entityref(self, *args):
        print "handle_entityref%s called." % str(args)

    def handle_comment(self, *args):
        print "handle_comment%s called." % str(args)

    def handle_decl(self, *args):
        print "handle_decl%s called." % str(args)

    def handle_pi(self, *args):
        print "handle_pi%s called." % str(args)

myData = """

<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>
"""

theParser = MyHTMLParser()
theParser.feed(myData)
theParser.close()

"""my output  ------>
handle_data('\n\n',) called.
handle_starttag('html', []) called.
handle_data('\n',) called.
handle_starttag('body', []) called.
handle_data('\n\n',) called.
handle_starttag('h1', []) called.
handle_data('My First Heading',) called.
handle_endtag('h1',) called.
handle_data('\n\n',) called.
handle_starttag('p', []) called.
handle_data('My first paragraph.',) called.
handle_endtag('p',) called.
handle_data('\n\n',) called.
handle_endtag('body',) called.
handle_data('\n',) called.
handle_endtag('html',) called.
handle_data('\n',) called.
"""

The next step is to fill the methods' body so that your parser does something more useful than just printing the order of the calls and their arguments.

0

Thanks for taking the trouble to write that code...
but I see you haven't called any of the methods separately. So how did you get an output? Thanks..
EDIT: I think it would be better to explain just one method.. (for example the "handle_starttag" method and discuss how it can be used. That way, I might be able to use the documentation for the rest of the methods.

0

I don't call the methods, the parser does. When I call theParser.feed and theParser.close, the parser "reads" the html input and calls the methods; For example, when it encounters <h1>, it calls handle_starttag('h1'), etc.

0

oh... So thePasser.feed reads (and stores) the data... and when theParser.close() is called the parser calls the methods depending on the characters it engounters.. am I right?

0

Oh... one more question.
How would you write the code so that the content between the two tags are printed? Is that possible?
For example <p> Hello World </p> the output should be, Hello World

0

Oh... one more question.
How would you write the code so that the content between the two tags are printed? Is that possible?
For example <p> Hello World </p> the output should be, Hello World

Well, if you only want to print the text content of the first paragraph, you could go this way

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.count_paragraph = 0
        self.print_data = False
    def handle_starttag(self, tag, args):
        if tag == "p":
            self.count_paragraph += 1
            if self.count_paragraph == 1:
                self.print_data = True
    def handle_data(self, data):
        if self.print_data:
            print(data)
    def handle_endtag(self, tag):
        if tag == "p" and self.print_data:
            self.print_data = False
This question has already been answered. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.