if i had a data file with html/xhtml tags:

Code:
<html>
<head>
<title> data file </title>
</head>

<body>
<center><h1>
heading 1
</h1></center>

<b>bolded</b>
<P>paragraph</P>
<P>
<br />


how would get a python program to read ONLY the start and end tags and and enqueue them in a queue?

for example, the queue for this would look:

<html>
<head>
<title>
</title>
</head>
<body>
<center>
<h1>
</h1>
</center>
<b>
</b>
<P>
</P>
<P>
<br />

Thanks!

Recommended Answers

All 4 Replies

You can use the modules HTMLParser and collections.deque to implement the queue

from HTMLParser import HTMLParser
from collections import deque # deque is a linked list which can be used as fifo or filo

class MyHTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.tag_deque = deque()

    def handle_starttag(self, tag, attrs):
        self.tag_deque.append("<{t}>".format(t=tag))

    def handle_endtag(self, tag):
        self.tag_deque.append("</{t}>".format(t=tag))


def main():
    parser = MyHTMLParser()
    filename = "mydatafile.html"
    parser.feed(open(filename).read())
    parser.close()
    print(parser.tag_deque)

if __name__ == "__main__":
    main()

Is there a more simplistic way to do this without having to import things? I'm a beginning at Python.

Is there a more simplistic way to do this without having to import things? I'm a beginning at Python.

You will always have to import stuff for anything but the most trivial tasks.

Is there a more simplistic way to do this without having to import things? I'm a beginning at Python.

You really don't need a queue, a list will do fine. Here is a more simple version of Gribouillis' code that actually works in Python2 or Python3 ...

# extract tag names in an html code file
# works with Python2 and Python3

try:
    # Python2
    import HTMLParser as hp
except ImportError:
    # Python3
    import html.parser as hp

class MyHTMLParser(hp.HTMLParser):
    def __init__(self):
        hp.HTMLParser.__init__(self)
        self.tag_list = list()

    def handle_starttag(self, tag, attrs):
        self.tag_list.append("<%s>" % tag)

    def handle_endtag(self, tag):
        self.tag_list.append("</%s>" % tag)


parser = MyHTMLParser()
# pick an HTML file you have in the working directory
# or give the full file path
filename = "test1.htm"
parser.feed(open(filename).read())
parser.close()
for tag in parser.tag_list:
    print(tag)

"""typical result -->
<html>
<head>
<title>
</title>
</head>
<body>
<table>
<tr>
<td>
<img>
<a>
</a>
</td>
</tr>
</table>
</body>
</html>
"""

Note that Python is a modular language and comes with many thoroughly tested and optimized modules. To code in Python means you have to use those modules for your advantage. Python syntax may be easy, but remembering all those modules may use all the power of your brain!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.