I want this program to read a text file then target and replace anything start with < and end with >
for example it finds <html>, replace that into ****
but somehow i tested it and it didn't work than i expected. any suggestions?

def remove_html(text):
    txtLIST = list(text)
    i = 0
    while i < len(txtLIST):
        if txtLIST[i] == '<':
            while txtLIST[i] != '>':
                txtLIST.pop(i)
            txtLIST.pop(i)
        else:
            i = i + 1
    replace = 4*'*'
    return replace.join(txtLIST)

file = open('remHTML.txt','r')
test = file
display = remove_html(test)
print display

Recommended Answers

All 10 Replies

Lines 11 and 12 put "****" between every single character. Other than this, is the output what you expect?

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

I agree with this,but now it look like boiishuvo will destroy the stucture of html.
Should it replace like this or keep <> intact?

>>> s = '<html>'
>>> s.replace('<html>', '***')
'***'

Something like this with regex.

import re

html = '''\
<html>
<head>
    <title></title>
</head>
<body>

</body>
</html>'''

print re.sub(r'<.*>', '****', html)
"""Output-->
****
****
    ****
****
****

****
****
"""

Wait I read the guide, they expect me to write a code that removes all HTML markup, including < and >, from a text file then display the rest of HTML left.

I've never heard of BeautifulSoup Python module but the guide don't expect me to use that though.

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

It can dependent how that text file look.
Can cange regex to something like this.

import re

data = '''\
<title>Lachlan Osborn</title>
<head>hello world</head>
'''

text = re.sub(r'<.*?>', '', data)
print text.strip()
"""Output-->
Lachlan Osborn
hello world
"""

Thanks. That's a good example but I modified the code to meet the guide requirement and the output didn't show anything.

def remove_html(text):
    import re
    info = open('remHTML.txt','r')
    data = info
    info.close()
    text = re.sub(r'<.*?>', '', data)

text = []
text_list = remove_html(text)
print(text_list.strip())

you are not returning anything.

Yeah I forgot to add that

but the line 8-10 seems incorrect.

You are modifying the passed in list text aren't you? I do not understand why you try to set other variable text_list.

how do I run the procedure

def remove_html(text)
info = '''<table>
    <tr align = "center">
        <h1> Lachlan Osborn </h1>
        <p> Address: 5 Smith Street, Manly <br>
        Date of Birth: 26th April 1993 </p>
        
        <a href="semester.html"><b>My Semester Units</b></a>
        <p><b>Check out my <a href="hobbies.html">hobbies.</a></b></p>
    </tr>
</center>'''

def remove_html(text, info):
    import re
    text = re.sub(r'<.*?>', '', info)
    return text

remove_html(text.strip())
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.