if i had a data file with html/xhtml tags:

Code:
<html>
<head>
<title> data file </title>
</head>

<body>
<center><h1>
heading 1
</h1></center>

<b>bolded</b>
<P>paragraph</P>
<P>
<br />


how would get a python program to read ONLY the start and end tags and and enqueue them in a queue?

for example, the queue for this would look:

<html>
<head>
<title>
</title>
</head>
<body>
<center>
<h1>
</h1>
</center>
<b>
</b>
<P>
</P>
<P>
<br />

Thanks!

You can use the modules HTMLParser and collections.deque to implement the queue

from HTMLParser import HTMLParser
from collections import deque # deque is a linked list which can be used as fifo or filo

class MyHTMLParser(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.tag_deque = deque()

    def handle_starttag(self, tag, attrs):
        self.tag_deque.append("<{t}>".format(t=tag))

    def handle_endtag(self, tag):
        self.tag_deque.append("</{t}>".format(t=tag))


def main():
    parser = MyHTMLParser()
    filename = "mydatafile.html"
    parser.feed(open(filename).read())
    parser.close()
    print(parser.tag_deque)

if __name__ == "__main__":
    main()

Edited 7 Years Ago by Gribouillis: n/a

Is there a more simplistic way to do this without having to import things? I'm a beginning at Python.

Is there a more simplistic way to do this without having to import things? I'm a beginning at Python.

You will always have to import stuff for anything but the most trivial tasks.

Edited 7 Years Ago by pythopian: n/a

Is there a more simplistic way to do this without having to import things? I'm a beginning at Python.

You really don't need a queue, a list will do fine. Here is a more simple version of Gribouillis' code that actually works in Python2 or Python3 ...

# extract tag names in an html code file
# works with Python2 and Python3

try:
    # Python2
    import HTMLParser as hp
except ImportError:
    # Python3
    import html.parser as hp

class MyHTMLParser(hp.HTMLParser):
    def __init__(self):
        hp.HTMLParser.__init__(self)
        self.tag_list = list()

    def handle_starttag(self, tag, attrs):
        self.tag_list.append("<%s>" % tag)

    def handle_endtag(self, tag):
        self.tag_list.append("</%s>" % tag)


parser = MyHTMLParser()
# pick an HTML file you have in the working directory
# or give the full file path
filename = "test1.htm"
parser.feed(open(filename).read())
parser.close()
for tag in parser.tag_list:
    print(tag)

"""typical result -->
<html>
<head>
<title>
</title>
</head>
<body>
<table>
<tr>
<td>
<img>
<a>
</a>
</td>
</tr>
</table>
</body>
</html>
"""

Note that Python is a modular language and comes with many thoroughly tested and optimized modules. To code in Python means you have to use those modules for your advantage. Python syntax may be easy, but remembering all those modules may use all the power of your brain!

Edited 7 Years Ago by vegaseat: n/a

This article has been dead for over six months. Start a new discussion instead.