1,105,229 Community Members

remove HTML markup in the input text, return a plain text string

Member Avatar
boiishuvo
Junior Poster in Training
86 posts since Jun 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

I want this program to read a text file then target and replace anything start with < and end with >
for example it finds <html>, replace that into ****
but somehow i tested it and it didn't work than i expected. any suggestions?

def remove_html(text):
    txtLIST = list(text)
    i = 0
    while i < len(txtLIST):
        if txtLIST[i] == '<':
            while txtLIST[i] != '>':
                txtLIST.pop(i)
            txtLIST.pop(i)
        else:
            i = i + 1
    replace = 4*'*'
    return replace.join(txtLIST)

file = open('remHTML.txt','r')
test = file
display = remove_html(test)
print display
Member Avatar
nosehat
Newbie Poster
15 posts since Dec 2010
Reputation Points: 0 [?]
Q&As Helped to Solve: 3 [?]
Skill Endorsements: 0 [?]
 
0
 

Lines 11 and 12 put "****" between every single character. Other than this, is the output what you expect?

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

Member Avatar
snippsat
Veteran Poster
1,039 posts since Aug 2008
Reputation Points: 483 [?]
Q&As Helped to Solve: 381 [?]
Skill Endorsements: 10 [?]
 
0
 

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

I agree with this,but now it look like boiishuvo will destroy the stucture of html.
Should it replace like this or keep <> intact?

>>> s = '<html>'
>>> s.replace('<html>', '***')
'***'

Something like this with regex.

import re

html = '''\
<html>
<head>
    <title></title>
</head>
<body>

</body>
</html>'''

print re.sub(r'<.*>', '****', html)
"""Output-->
****
****
    ****
****
****

****
****
"""
Member Avatar
boiishuvo
Junior Poster in Training
86 posts since Jun 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Wait I read the guide, they expect me to write a code that removes all HTML markup, including < and >, from a text file then display the rest of HTML left.

I've never heard of BeautifulSoup Python module but the guide don't expect me to use that though.

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

Member Avatar
snippsat
Veteran Poster
1,039 posts since Aug 2008
Reputation Points: 483 [?]
Q&As Helped to Solve: 381 [?]
Skill Endorsements: 10 [?]
 
0
 

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

It can dependent how that text file look.
Can cange regex to something like this.

import re

data = '''\
<title>Lachlan Osborn</title>
<head>hello world</head>
'''

text = re.sub(r'<.*?>', '', data)
print text.strip()
"""Output-->
Lachlan Osborn
hello world
"""
Member Avatar
boiishuvo
Junior Poster in Training
86 posts since Jun 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Thanks. That's a good example but I modified the code to meet the guide requirement and the output didn't show anything.

def remove_html(text):
    import re
    info = open('remHTML.txt','r')
    data = info
    info.close()
    text = re.sub(r'<.*?>', '', data)

text = []
text_list = remove_html(text)
print(text_list.strip())
Member Avatar
pyTony
pyMod
6,103 posts since Apr 2010
Reputation Points: 818 [?]
Q&As Helped to Solve: 1,056 [?]
Skill Endorsements: 42 [?]
Moderator
Featured
 
0
 

you are not returning anything.

Member Avatar
boiishuvo
Junior Poster in Training
86 posts since Jun 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

Yeah I forgot to add that

but the line 8-10 seems incorrect.

Member Avatar
pyTony
pyMod
6,103 posts since Apr 2010
Reputation Points: 818 [?]
Q&As Helped to Solve: 1,056 [?]
Skill Endorsements: 42 [?]
Moderator
Featured
 
0
 

You are modifying the passed in list text aren't you? I do not understand why you try to set other variable text_list.

Member Avatar
boiishuvo
Junior Poster in Training
86 posts since Jun 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 

how do I run the procedure

def remove_html(text)
Member Avatar
boiishuvo
Junior Poster in Training
86 posts since Jun 2009
Reputation Points: 0 [?]
Q&As Helped to Solve: 0 [?]
Skill Endorsements: 0 [?]
 
0
 
info = '''<table>
    <tr align = "center">
        <h1> Lachlan Osborn </h1>
        <p> Address: 5 Smith Street, Manly <br>
        Date of Birth: 26th April 1993 </p>
        
        <a href="semester.html"><b>My Semester Units</b></a>
        <p><b>Check out my <a href="hobbies.html">hobbies.</a></b></p>
    </tr>
</center>'''

def remove_html(text, info):
    import re
    text = re.sub(r'<.*?>', '', info)
    return text

remove_html(text.strip())
Question Answered as of 2 Years Ago by snippsat, pyTony and nosehat
You
This question has already been solved: Start a new discussion instead
Post:
Start New Discussion
Tags Related to this Article