We're a community of 1077K IT Pros here for help, advice, solutions, professional growth and fun. Join us!
1,076,466 Members — Technology Publication meets Social Media
Username:
Password:
Lost login information?
Start New Discussion Reply to this Discussion

remove HTML markup in the input text, return a plain text string

I want this program to read a text file then target and replace anything start with < and end with >
for example it finds <html>, replace that into ****
but somehow i tested it and it didn't work than i expected. any suggestions?

def remove_html(text):
    txtLIST = list(text)
    i = 0
    while i < len(txtLIST):
        if txtLIST[i] == '<':
            while txtLIST[i] != '>':
                txtLIST.pop(i)
            txtLIST.pop(i)
        else:
            i = i + 1
    replace = 4*'*'
    return replace.join(txtLIST)

file = open('remHTML.txt','r')
test = file
display = remove_html(test)
print display
4
Contributors
10
Replies
5 Hours
Discussion Span
1 Year Ago
Last Updated
11
Views
Question
Answered
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

Lines 11 and 12 put "****" between every single character. Other than this, is the output what you expect?

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

nosehat
Newbie Poster
15 posts since Dec 2010
Reputation Points: 10
Solved Threads: 3
Skill Endorsements: 0

By the way, you may want to look at the BeautifulSoup Python library for working with html files (and extracting text from them).

I agree with this,but now it look like boiishuvo will destroy the stucture of html.
Should it replace like this or keep <> intact?

>>> s = '<html>'
>>> s.replace('<html>', '***')
'***'

Something like this with regex.

import re

html = '''\
<html>
<head>
    <title></title>
</head>
<body>

</body>
</html>'''

print re.sub(r'<.*>', '****', html)
"""Output-->
****
****
    ****
****
****

****
****
"""
snippsat
Posting Shark
957 posts since Aug 2008
Reputation Points: 482
Solved Threads: 344
Skill Endorsements: 8

Wait I read the guide, they expect me to write a code that removes all HTML markup, including < and >, from a text file then display the rest of HTML left.

I've never heard of BeautifulSoup Python module but the guide don't expect me to use that though.

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

For example, a text file that shows: <title>Lachlan Osborn</title>
and the output should be like that: Lachlan Osborn

It can dependent how that text file look.
Can cange regex to something like this.

import re

data = '''\
<title>Lachlan Osborn</title>
<head>hello world</head>
'''

text = re.sub(r'<.*?>', '', data)
print text.strip()
"""Output-->
Lachlan Osborn
hello world
"""
snippsat
Posting Shark
957 posts since Aug 2008
Reputation Points: 482
Solved Threads: 344
Skill Endorsements: 8

Thanks. That's a good example but I modified the code to meet the guide requirement and the output didn't show anything.

def remove_html(text):
    import re
    info = open('remHTML.txt','r')
    data = info
    info.close()
    text = re.sub(r'<.*?>', '', data)

text = []
text_list = remove_html(text)
print(text_list.strip())
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

you are not returning anything.

pyTony
pyMod
Moderator
6,312 posts since Apr 2010
Reputation Points: 879
Solved Threads: 987
Skill Endorsements: 26

Yeah I forgot to add that

but the line 8-10 seems incorrect.

boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0

You are modifying the passed in list text aren't you? I do not understand why you try to set other variable text_list.

pyTony
pyMod
Moderator
6,312 posts since Apr 2010
Reputation Points: 879
Solved Threads: 987
Skill Endorsements: 26

how do I run the procedure

def remove_html(text)
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
info = '''<table>
    <tr align = "center">
        <h1> Lachlan Osborn </h1>
        <p> Address: 5 Smith Street, Manly <br>
        Date of Birth: 26th April 1993 </p>
        
        <a href="semester.html"><b>My Semester Units</b></a>
        <p><b>Check out my <a href="hobbies.html">hobbies.</a></b></p>
    </tr>
</center>'''

def remove_html(text, info):
    import re
    text = re.sub(r'<.*?>', '', info)
    return text

remove_html(text.strip())
boiishuvo
Junior Poster in Training
76 posts since Jun 2009
Reputation Points: 10
Solved Threads: 0
Skill Endorsements: 0
Question Answered as of 1 Year Ago by snippsat, pyTony and nosehat

This question has already been solved: Start a new discussion instead

Post: Markdown Syntax: Formatting Help
 
You
 
© 2013 DaniWeb® LLC
Page rendered in 0.1067 seconds using 2.68MB