Hello.

I am working on a personal project. It's basically a program for changing specific tags from certain HTML files.

So far, everything works. The GUI and the logic a work but I know the main function is wrong because it looks like this:

def tag_remove(HTML_string):
    clean_HTML = a_string.replace('<b>', '').replace('<i>', '').replace('<p>', '').replace('<h1>', '') #etc.
    return cleaned

Is there a way of doing this using a data structure like a list, tuple or dictionary? I don't want it to be so recursive. I was thinking of a dictionary where the key would be the tag to replace and the value the value for which it should be replaced. But I don't know how to do that.

Edited 4 Years Ago by G_S

from bs4 import BeautifulSoup

html = """\
<html>
<head>
   <title>html page</title>
</head>
<body>
  <div>Hello world</div>
</body>
</html>
"""

soup = BeautifulSoup(html)
head_tag = soup.find('head')
head_tag.name = 'New name'
print soup.prettify()

"""Output-->
<html>
 <New name>
  <title>
   html page
  </title>
 </New name>
 <body>
  <div>
   Hello world
  </div>
 </body>
</html>
"""

Edited 4 Years Ago by snippsat

If you have a <b> then you should also have a </b>, etc. Doing this yourself usually involves a split() and iterating over each item, and a good habit to form is to try to iterate over the text as few times as possible, which means looking for each tag in one pass over the string instead of a find and replace option which goes through the string however-many-tags-you-have times. Also, remember that you want a way to do this that you understand, so ignore any criticism from any self-appointed gods that frequent this forum.

def test_tags(tag):
    """ test each item from the list and return it untouched unless
        it starts with one of the "delete" tags
    """
    replace_these=['<b>', '</b>', '<p>', '</p>', '<h1>', '</h1>']
    for looking in replace_these:
        if tag.startswith(looking):
            return tag[len(looking):]     ## removed
    return tag  ## nothing found so save this tag

    ## or you can replace and return
    replace_these=[('<b>', 'rep_b'), 
                   ('</b>', 'rep_b2')]
    for looking, replacement in replace_these:
        if tag.startswith(looking):
            return replacement+tag[len(looking):]
    return tag  ## nothing found so save this tag



def tag_remove(HTML_string):
    """ split the string into a list on the "<" character and send
        each item to the test_tags() function for "cleaning"
    """
    clean_HTML = []
    HTML_list= HTML_string.split("<")
    for tag in HTML_list:
        return_ch = test_tags("<"+tag)
        if len(return_ch):
            clean_HTML.append(return_ch)
    return "".join(clean_HTML)

test_html="""<html>
    <head>
    <title>Should not be removed</title>
    </head>
    <b>bold test</b>
</html>
"""

print tag_remove(test_html)

Edited 4 Years Ago by woooee

Thanks for your suggestions. I'll try using beautiful soup in future projects, since I think it is excellent but don't like installing full libraries to use just one function.

I then decided to take inspiration from wooooeee's approach:

So far I managed to turn the string into a list including both words and tags:

test_html="""<html>
<head>
<title>Should not be removed</title>
</head>
<b>bold test</b>
</html>
"""
new_html = html.replace('<', '\*<').replace('>', '>\*')
html_as_list = new_html.split('\*')

This produces a list where tags AND words are separate elements. It also produces crap (empty strings), but join deals with that later.

Next, I have a dictionary of changes:

replacements = {'<b>:'<strong>'}

That is what I've got so far. Now the plan is:

for entry in html_as_list:
    #if entry is a key in replacements:
        #replace it with the corresponding value#
    else:
        pass
return "".join(new_html)

Can somebody help me with those two commented lines? Is there a python function for doing that?

Edited 4 Years Ago by G_S

Small update: I found out about those two lines on my own. Here is the code:

new_html = html.replace('<', '\*<').replace('>', '>\*')

html_as_list = new_html.split('\*')

replacements = {'<b>':'<strong>', '</b>':'</strong>', '<table>':'<p>','</table>':'</p>','<td>':'<p>', '</td>':'</p>', '<tr>':'<p>', '</tr>':'</p>'}

for i in range(len(html_as_list)):
    if html_as_list[i] in replacements:
        html_as_list[i] = replacements[html_as_list[i]]


new_html = "".join(html_as_list)

return new_html

It's now working pretty well now, but... is it more efficient than my first version of the code?

The first version was

def tag_remove(HTML_string):
    clean_HTML = a_string.replace('<b>', '').replace('<i>', '').replace('<p>', '').replace('<h1>', '') #etc.
    return cleaned

But there were 120+ .replace(something, something) statements. Is this new code really more efficient?

This article has been dead for over six months. Start a new discussion instead.