remove HTML markup in the input text, return a plain text string
I want this program to read a text file then target and replace anything start with < and end with >
for example it finds <html>, replace that into ****
but somehow i tested it and it didn't work than i expected. any suggestions?
def remove_html(text):
txtLIST = list(text)
i = 0
while i < len(txtLIST):
if txtLIST[i] == '<':
while txtLIST[i] != '>':
txtLIST.pop(i)
txtLIST.pop(i)
else:
i = i + 1
replace = 4*'*'
return replace.join(txtLIST)
file = open('remHTML.txt','r')
test = file
display = remove_html(test)
print display
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
Wait I read the guide, they expect me to write a code that removes all HTML markup, including < and >, from a text file then display the rest of HTML left.
I've never heard of BeautifulSoup Python module but the guide don't expect me to use that though.
For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
Thanks. That's a good example but I modified the code to meet the guide requirement and the output didn't show anything.
def remove_html(text):
import re
info = open('remHTML.txt','r')
data = info
info.close()
text = re.sub(r'<.*?>', '', data)
text = []
text_list = remove_html(text)
print(text_list.strip())
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
you are not returning anything.
pyTony
pyMod
6,312 posts since Apr 2010
Reputation Points: 879
Solved Threads: 987
Skill Endorsements: 26
Yeah I forgot to add that
but the line 8-10 seems incorrect.
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
You are modifying the passed in list text aren't you? I do not understand why you try to set other variable text_list.
pyTony
pyMod
6,312 posts since Apr 2010
Reputation Points: 879
Solved Threads: 987
Skill Endorsements: 26
how do I run the procedure
def remove_html(text)
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
info = '''<table>
<tr align = "center">
<h1> Lachlan Osborn </h1>
<p> Address: 5 Smith Street, Manly <br>
Date of Birth: 26th April 1993 </p>
<a href="semester.html"><b>My Semester Units</b></a>
<p><b>Check out my <a href="hobbies.html">hobbies.</a></b></p>
</tr>
</center>'''
def remove_html(text, info):
import re
text = re.sub(r'<.*?>', '', info)
return text
remove_html(text.strip())
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
Question Answered as of 1 Year Ago by
snippsat,
pyTony
and
nosehat