0

Hi all.

I'm a newbie here so excuse my question if it's a bit dumb. I'm a C programmer but needed to do some text file stripping so was told Pythin would be good for this.
I have been messing about with this for about a week now and have the following problem.
I want to look at blocks of HTML and leave certain chunks which contain a name. So for example if my text doc looked like the below, i want to be able to scan through and only take out the blocks with the word "remove"
I wrote some code which was capturing all the blocks but i can't figure out how to leave the "leave" blocks and carry on. My code was getting to "leave" and then would start to rescan the doc again causing it to be stuck in a loop. I have also included my code, don't laugh i'm a beginner ;)

<tr>
<td><a leave </a></td>
</tr>
<tr>
<td><a remove </a></td>
</tr>
<tr>
<td><a leave </a></td>
</tr>
<tr>
<td><a remove </a></td>
</tr>

import re

TRUE = 1
FALSE = 0

leave_search = re.compile ('leave')#need to use this to somehow skip block with this regex
main_search = re.compile ('<tr>\s.*\s</tr>\s')

def file_strip(file_name,search_type):

    result = search_type.search(file_name)
    leave = leave_search.search(result.group())

    print (file_name) #debug not needed
    print (result) #debug not needed
    
    search = TRUE
    while search:
        
        if result:
            print ('We have a result')
            if leave:
                print ('leave text found') #here i somehow need to search to the next block
            else:
                print ('leave text NOT found')
                file_name = file_name.replace(result.group(),"")
        else:
            print ('No result left in file')
            return file_name
        
def HTML_strip (filename):
        
    file_to_open = open(filename, 'r')
    file_to_read = file_to_open.read()
    file_to_open.closed
    file_to_read = file_strip(file_to_read,main_search)
    file_to_open = open(filename, 'w')
    file_to_open.closed
    file_to_open = open(filename, 'r+')
    file_to_open.write(file_to_read)
    file_to_open.closed
    return file_to_read

HTML_strip ('webtest.txt')
2
Contributors
1
Reply
2
Views
7 Years
Discussion Span
Last Post by pyTony
0

Regexp I do not know in python, but here one more time native python scanning with partition

MyStr = """
<tr>
<td><a leave </a></td>
</tr>
<tr>
<td><a remove </a></td>
</tr>
<tr>
<td><a leave </a></td>
</tr>
<tr>
<td><a remove </a></td>
</tr>
"""

before ,found,t = MyStr.partition('<tr>')
print before,found, ## leave everything outside <tr> blocks

while found:
    t,found,more = t.partition('</tr>')
    if found:
        if t.find('<a remove') == -1: print t,
        print found,
    else:
        raise ValueError, "Missing end tag: " + t
    
    before,found,t = more.partition('<tr>')
    print before,found, ## leave everything outside <tr> blocks
This topic has been dead for over six months. Start a new discussion instead.
Have something to contribute to this discussion? Please be thoughtful, detailed and courteous, and be sure to adhere to our posting rules.