Hi; I'm learning Python, need a little help here.
I have a text file which has the below data,

<SYNC Start=5047><P Class=ENCC>
Back, back, back, back!
<SYNC Start=7235><P Class=ENCC>&nbsp;
<SYNC Start=10725><P Class=ENCC>
Yeah, Dan!

I want to strip the text only i.e.

Back, back, back, back!
Yeah, Dan!

I'm using this code, but it gives the lines that I want to ignore

new_list = []
ignore=False
for line in file("tstl.txt"):

	data_list = line.split(" ")
	if line.startswith("<"):
           if line.endswith(">") or line.endswith(";"):
      		ignore=True        		
      	if ignore:
	   	new_list.append(line)
     
fout = open("tst2.txt", "w")

fout.writelines(new_list)

fout.close()

Could somebody guide me in this regard?

Recommended Answers

All 9 Replies

if not ignore:
	   	new_list.append(line)

Looks like you could simplify your code a little:

data = """\
<SYNC Start=5047><P Class=ENCC>
Back, back, back, back!
<SYNC Start=7235><P Class=ENCC>&nbsp;
<SYNC Start=10725><P Class=ENCC>
Yeah, Dan!"""

# this would be like the file data
lines = data.split('\n')

new_data = ""
for line in lines:
    if not line.startswith('<'):
        new_data += line + '\n'

# test it
print(new_data)

"""
my output -->
Back, back, back, back!
Yeah, Dan!
"""

What if I have a text file with lines that have text

<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;

and I want to strip the text that makes sense like

Wheels, one eight, wheels.
You'll need something to keep her in
now you stay inside this foolish grin
though any day your secrets end

In this case, the above code does not work, could you suggest something

Here's a possible way to get you started (note this is untested):

inp = """<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;"""
out = []
found = False
for character in inp:
    if not found:
        if character == '<':
            found = True
        else:
            out.append(character)
    else:
        if character == '>':
            found = False
    
print ''.join(out)

That's great, it worked better than I thought. Thx.
What if I want to ignore more characters that are at start/end of line. Also, if want to get text from a file and not few text lines e.g. I have this file with more than 10000 lines and I want to extract only relevant text like

inp=open("friends.smi", 'r')
out = []
found = False
for character in inp:
 
      if not found:
  
      	if character == '<' or character=='&':

           found = True

        else:

      	   out.append(character)
 
      else:
 
       if character == '>' :
  
          found = False

fout = open("1.txt", "w")

fout.writelines(out)

fout.close()

This doesn't separate relevant text from file. What could be the problem?

Well for one I suggest removing the ampersand as criteria for ignoring. Since you know that &nbsp; will consistently appear you can simply replace it at the end after you join all your characters back together (which is something you omitted). Another you can replace are the double new-line characters ( \n\n ).

I also suggest using write() instead of writelines() in this case. Look over the modifications and ask about anything you don't understand:

inp=open("friends.smi", 'r')
out = []
found = False
for character in inp:
    if not found:
        if character == '<':
            found = True
        else:
            out.append(character)
    else:
        if character == '>':
            found = False
new_text = ''.join( out ).replace('&nbsp;', '').replace('\n\n', '\n')
fout = open("1.txt", "w")
fout.write(new_text)
fout.close()

Ignore criteria on the base of character doesn't work for a file. When I use a text file with lines that have the following repeated representation i.e.

<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;

It returns the the whole text file as it was before parsing without ignoring any character.

<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;

What could be the problem?

What could be the problem?

You're not reading the file. When you iterate over a file handle you iterate line by line. So no line will ever equal '<' or '>'.

I got that. Is there some other way around to do it. I mean by parsing lines only.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.