Getting text from file

Question

gujjar19 0 Newbie Poster

16 Years Ago

Hi; I'm learning Python, need a little help here.
I have a text file which has the below data,

<SYNC Start=5047>
Back, back, back, back!
<SYNC Start=7235> 
<SYNC Start=10725>
Yeah, Dan!

I want to strip the text only i.e.

Back, back, back, back!
Yeah, Dan!

I'm using this code, but it gives the lines that I want to ignore

new_list = []
ignore=False
for line in file("tstl.txt"):

	data_list = line.split(" ")
	if line.startswith("<"):
           if line.endswith(">") or line.endswith(";"):
      		ignore=True        		
      	if ignore:
	   	new_list.append(line)
     
fout = open("tst2.txt", "w")

fout.writelines(new_list)

fout.close()

Could somebody guide me in this regard?

python

4 Contributors
9 Replies
143 Views
3 Days Discussion Span
Latest Post 16 Years Ago Latest Post by gujjar19

All 9 Replies

jice 53 Posting Whiz in Training

16 Years Ago

if not ignore:
	   	new_list.append(line)

sneekula 969 Nearly a Posting Maven

16 Years Ago

Looks like you could simplify your code a little:

data = """\
<SYNC Start=5047><P Class=ENCC>
Back, back, back, back!
<SYNC Start=7235><P Class=ENCC>&nbsp;
<SYNC Start=10725><P Class=ENCC>
Yeah, Dan!"""

# this would be like the file data
lines = data.split('\n')

new_data = ""
for line in lines:
    if not line.startswith('<'):
        new_data += line + '\n'

# test it
print(new_data)

"""
my output -->
Back, back, back, back!
Yeah, Dan!
"""

jlm699 320 Veteran Poster

16 Years Ago

Here's a possible way to get you started (note this is untested):

inp = """<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;"""
out = []
found = False
for character in inp:
    if not found:
        if character == '<':
            found = True
        else:
            out.append(character)
    else:
        if character == '>':
            found = False
    
print ''.join(out)

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

gujjar19 0 Newbie Poster · Answer 1 · 2009-06-04T00:03:07+00:00

What if I have a text file with lines that have text

So hold onto your special friend 
- Wheels, one eight, wheels.
<SYNC Start=2474996>
 You'll need something to keep her in
<SYNC Start=2480275>
 now you stay inside this foolish grin
<SYNC Start=2486187>
 though any day your secrets end
<SYNC Start=2490682>
 but then again years may go by
<SYNC Start=2495728>

and I want to strip the text that makes sense like

Wheels, one eight, wheels.
You'll need something to keep her in
now you stay inside this foolish grin
though any day your secrets end

In this case, the above code does not work, could you suggest something

gujjar19 0 Newbie Poster · Answer 2 · 2009-06-04T12:43:02+00:00

That's great, it worked better than I thought. Thx.
What if I want to ignore more characters that are at start/end of line. Also, if want to get text from a file and not few text lines e.g. I have this file with more than 10000 lines and I want to extract only relevant text like

inp=open("friends.smi", 'r')
out = []
found = False
for character in inp:
 
      if not found:
  
      	if character == '<' or character=='&':

           found = True

        else:

      	   out.append(character)
 
      else:
 
       if character == '>' :
  
          found = False

fout = open("1.txt", "w")

fout.writelines(out)

fout.close()

This doesn't separate relevant text from file. What could be the problem?

jlm699 320 Veteran Poster · Answer 3 · 2009-06-04T18:29:46+00:00

Well for one I suggest removing the ampersand as criteria for ignoring. Since you know that   will consistently appear you can simply replace it at the end after you join all your characters back together (which is something you omitted). Another you can replace are the double new-line characters ( \n\n ).

I also suggest using write() instead of writelines() in this case. Look over the modifications and ask about anything you don't understand:

inp=open("friends.smi", 'r')
out = []
found = False
for character in inp:
    if not found:
        if character == '<':
            found = True
        else:
            out.append(character)
    else:
        if character == '>':
            found = False
new_text = ''.join( out ).replace('&nbsp;', '').replace('\n\n', '\n')
fout = open("1.txt", "w")
fout.write(new_text)
fout.close()

gujjar19 0 Newbie Poster · Answer 4 · 2009-06-05T12:42:23+00:00

Ignore criteria on the base of character doesn't work for a file. When I use a text file with lines that have the following repeated representation i.e.

<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;

It returns the the whole text file as it was before parsing without ignoring any character.

<FONT COLOR="ff99ff"> So hold onto your special friend </font>
- Wheels, one eight, wheels.
<SYNC Start=2474996><P Class=ENCC>
<FONT COLOR="ff99ff"> You'll need something to keep her in
<SYNC Start=2480275><P Class=ENCC>
<FONT COLOR="ff99ff"> now you stay inside this foolish grin
<SYNC Start=2486187><P Class=ENCC>
<FONT COLOR="ff99ff"> though any day your secrets end
<SYNC Start=2490682><P Class=ENCC>
<FONT COLOR="ff99ff"> but then again years may go by
<SYNC Start=2495728><P Class=ENCC>&nbsp;

What could be the problem?

jlm699 320 Veteran Poster · Answer 5 · 2009-06-05T19:46:52+00:00

What could be the problem?

You're not reading the file. When you iterate over a file handle you iterate line by line. So no line will ever equal '<' or '>'.

gujjar19 0 Newbie Poster · Answer 6 · 2009-06-06T02:09:07+00:00

I got that. Is there some other way around to do it. I mean by parsing lines only.

Getting text from file

Recommended Answers Collapse Answers

All 9 Replies

Recommended Answers