Hi,

I'm having trouble just printing particular strings out of a sentence.

<h1>hi my name is</h1>

if i just want the program to print out

<h1>
</h1>

and ignoring the text in between, how would i do that?

i have tried converting everything to ascii and printing within the range, but that doesn't exactly work. :/

thanks!

Recommended Answers

All 4 Replies

We just discussed that in detail, all you have to do is to slightly modify the solution:

# extract tag names in html code

try:
    # Python2
    import HTMLParser as hp
except ImportError:
    # Python3
    import html.parser as hp

class MyHTMLParser(hp.HTMLParser):
    def __init__(self):
        hp.HTMLParser.__init__(self)
        self.tag_list = list()

    def handle_starttag(self, tag, attrs):
        self.tag_list.append("<%s>" % tag)

    def handle_endtag(self, tag):
        self.tag_list.append("</%s>" % tag)


parser = MyHTMLParser()
html_str = """<h1>hi my name is</h1>"""
parser.feed(html_str)
parser.close()
for tag in parser.tag_list:
    print(tag)

"""my result -->
<h1>
</h1>
"""

I understand. Thank you.

Though, I am assigned to write this function just using the String library to write my own parser.

So far, I have...

def readFile(filename):
    file = open(filename, "r")

    for line in file:
        items = line.split()
        if len(items) > 0:
            print items[0]

Though it's not completely correct, due to the fact that it grabs some of the words within the sentence on various lines.

---
Though, I am assigned to write this function just using the String library to write my own parser.
---

You need to tell us this sort of thing right away!

You could use find() assuming each line has one opening and one closing tag. Here is an example of string function find():

line = '<h1>hi my name is Fred</h1>'

# find index of first '<'
ix1 = line.find('<')

# find index of first '>'
ix2 = line.find('>')

# find index of first '</'
ix3 = line.find('</')

# find index of second '>' (past ix2)
ix4 = line.find('>', ix2+1)

# test
print( ix1, ix2, ix3, ix4 )

# now get tags using string slicing and the indexes found
tag1 = line[ix1:ix2+1]
tag2 = line[ix3:ix4+1]

print( tag1 )  # <h1>
print( tag2 )  # </h1>

Thanks again.

There are a couple of issues I am having. There are couple of lines in the data file where there are two ending tags on line like such:

</h1></center>

and the output is printing it like as, instead of on separate lines.

Also, I have to place these tags into the list in order of when they are found, but the spaces in between are being added as well.

Here is my code as it stands:

for line in file:                   
        # find index of first '<'
        ix1 = line.find('<')
        # find index of first '>'
        ix2 = line.find('>')
        # find index of first '</'
        ix3 = line.find('</')
        # find index of second '>' (past ix2)
        ix4 = line.find('>', ix2+1)
        # now get tags using string slicing and the indexes found
        tag1 = line[ix1:ix2+1]
        tags.append(tag1)  # <h1>
        tag2 = line[ix3:ix4+1]
        tags.append(tag2)  # </h1>       

    print tags

Thanks again!

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.