Replacing HTML tags and entities in a string

Question

G_S 38 Junior Poster in Training

11 Years Ago

Hello.

I am working on a personal project. It's basically a program for changing specific tags from certain HTML files.

So far, everything works. The GUI and the logic a work but I know the main function is wrong because it looks like this:

def tag_remove(HTML_string):
    clean_HTML = a_string.replace('<b>', '').replace('<i>', '').replace('<p>', '').replace('<h1>', '') #etc.
    return cleaned

Is there a way of doing this using a data structure like a list, tuple or dictionary? I don't want it to be so recursive. I was thinking of a dictionary where the key would be the tag to replace and the value the value for which it should be replaced. But I don't know how to do that.

python

Edited 11 Years Ago by G_S

4 Contributors
5 Replies
7K Views
3 Days Discussion Span
Latest Post 11 Years Ago Latest Post by G_S

All 5 Replies

TrustyTony 888 pyMod

11 Years Ago

Correct way is to use HTML parser like http://www.crummy.com/software/BeautifulSoup/

snippsat 661 Master Poster

11 Years Ago

from bs4 import BeautifulSoup

html = """\
<html>
<head>
   <title>html page</title>
</head>
<body>
  <div>Hello world</div>
</body>
</html>
"""

soup = BeautifulSoup(html)
head_tag = soup.find('head')
head_tag.name = 'New name'
print soup.prettify()

"""Output-->
<html>
 <New name>
  <title>
   html page
  </title>
 </New name>
 <body>
  <div>
   Hello world
  </div>
 </body>
</html>
"""

Edited 11 Years Ago by snippsat

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

woooee 814 Nearly a Posting Maven · Answer 1 · 2012-06-30T22:33:06+00:00

If you have a <b> then you should also have a </b>, etc. Doing this yourself usually involves a split() and iterating over each item, and a good habit to form is to try to iterate over the text as few times as possible, which means looking for each tag in one pass over the string instead of a find and replace option which goes through the string however-many-tags-you-have times. Also, remember that you want a way to do this that you understand, so ignore any criticism from any self-appointed gods that frequent this forum.

def test_tags(tag):
    """ test each item from the list and return it untouched unless
        it starts with one of the "delete" tags
    """
    replace_these=['<b>', '</b>', '<p>', '</p>', '<h1>', '</h1>']
    for looking in replace_these:
        if tag.startswith(looking):
            return tag[len(looking):]     ## removed
    return tag  ## nothing found so save this tag

    ## or you can replace and return
    replace_these=[('<b>', 'rep_b'), 
                   ('</b>', 'rep_b2')]
    for looking, replacement in replace_these:
        if tag.startswith(looking):
            return replacement+tag[len(looking):]
    return tag  ## nothing found so save this tag



def tag_remove(HTML_string):
    """ split the string into a list on the "<" character and send
        each item to the test_tags() function for "cleaning"
    """
    clean_HTML = []
    HTML_list= HTML_string.split("<")
    for tag in HTML_list:
        return_ch = test_tags("<"+tag)
        if len(return_ch):
            clean_HTML.append(return_ch)
    return "".join(clean_HTML)

test_html="""<html>
    <head>
    <title>Should not be removed</title>
    </head>
    <b>bold test</b>
</html>
"""

print tag_remove(test_html)

G_S 38 Junior Poster in Training · Answer 2 · 2012-07-03T18:00:36+00:00

Thanks for your suggestions. I'll try using beautiful soup in future projects, since I think it is excellent but don't like installing full libraries to use just one function.

I then decided to take inspiration from wooooeee's approach:

So far I managed to turn the string into a list including both words and tags:

test_html="""<html>
<head>
<title>Should not be removed</title>
</head>
<b>bold test</b>
</html>
"""
new_html = html.replace('<', '\*<').replace('>', '>\*')
html_as_list = new_html.split('\*')

This produces a list where tags AND words are separate elements. It also produces crap (empty strings), but join deals with that later.

Next, I have a dictionary of changes:

replacements = {'<b>:'<strong>'}

That is what I've got so far. Now the plan is:

for entry in html_as_list:
    #if entry is a key in replacements:
        #replace it with the corresponding value#
    else:
        pass
return "".join(new_html)

Can somebody help me with those two commented lines? Is there a python function for doing that?

G_S 38 Junior Poster in Training · Answer 3 · 2012-07-03T19:55:26+00:00

Small update: I found out about those two lines on my own. Here is the code:

new_html = html.replace('<', '\*<').replace('>', '>\*')

html_as_list = new_html.split('\*')

replacements = {'<b>':'<strong>', '</b>':'</strong>', '<table>':'<p>','</table>':'</p>','<td>':'<p>', '</td>':'</p>', '<tr>':'<p>', '</tr>':'</p>'}

for i in range(len(html_as_list)):
    if html_as_list[i] in replacements:
        html_as_list[i] = replacements[html_as_list[i]]


new_html = "".join(html_as_list)

return new_html

It's now working pretty well now, but... is it more efficient than my first version of the code?

The first version was

def tag_remove(HTML_string):
    clean_HTML = a_string.replace('<b>', '').replace('<i>', '').replace('<p>', '').replace('<h1>', '') #etc.
    return cleaned

But there were 120+ .replace(something, something) statements. Is this new code really more efficient?

Replacing HTML tags and entities in a string

Recommended Answers Collapse Answers

All 5 Replies

Recommended Answers