943,737 Members | Top Members by Rank

Ad:
  • Python Discussion Thread
  • Marked Solved
  • Views: 1352
  • Python RSS
May 26th, 2009
0

Parsing text from a SMI file and writing to a file

Expand Post »
Hi, I am new to python. I am working on parsing text from a smi file . I want to extract only dialogues and want to ignore the timestamps (lines starting with <SYNC) for one case, e.g. Below is part of smi file

<SAMI>

<HEAD>

<Title>ÁŠžñÀ» ÀûŸî ÁÖŒŒ¿ä.</Title>

<Style TYPE="text/css">

<!--

P {margin-left:8pt; margin-right:8pt; margin-bottom:2pt; margin-top:2pt;

text-align:center; font-size:22pt; font-family: Arial, Sans-serif;

font-weight:bold; color:white;}

.KRCC {Name:Korean; lang:ko-KR; SAMIType:CC;}

.ENCC {Name:English; lang:en-US; SAMIType:CC;}

#STDPrn {Nametandard Print;}

#VLargePrn {Name:34pt (VLarge Print); font-size:34pt;}

#LargePrn {Name:28pt (Large Print); font-size:28pt;}

#MediumPrn {Name:24pt (Medium Print); font-size:24pt;}

#BSmallPrn {Name:18pt (BSmall Print); font-size:18pt;}

#SmallPrn {Name:12pt (Small Print); font-size:12pt;}

-->

</Style>

<!--


-->

</HEAD>

<BODY>

<SYNC Start=52><P Class=ENCC>

Subtitles by Korea NSC Subtitle Team <br>

(http://club.nate.com/tsm)

<SYNC Start=3989><P Class=ENCC>&nbsp;

<SYNC Start=5047><P Class=ENCC>

Back, back, back, back!

<SYNC Start=7235><P Class=ENCC>&nbsp;

<SYNC Start=10725><P Class=ENCC>

Yeah, Dan!

I want to extract only lines that have text sentence e.g.
Back, back, back, back!
and write them into a file.
I have wrote a function but it doesn't give the required ouput.

def getText( inFile ):

text = []
file = open( inFile, "ra" )

wholefile = file.readlines()

for line in wholefile:

line = line.strip()

if line.startswith("<"):
break
elif line.endswith(:>)
break
else
continue
text.append(line)

text.sort()
return text

What could be the problem with this code or is there any other way to do it.
In second case I want to keep the timestamps count and ignore the text while parsing. How could it be achieved?
Similar Threads
Reputation Points: 10
Solved Threads: 0
Newbie Poster
gujjar19 is offline Offline
13 posts
since May 2009
May 26th, 2009
0

Re: Parsing text from a SMI file and writing to a file

Please use code tag.

I see two problems with your code:
1. No data will be written to the text list. Its a logical thing, you know. If you exit the loop when a line starts with "<" or ends with ":>", and you do not process a line otherwise, then you won't process anything.
2. I am not sure if text.sort() does the thing you want. Why is this needed?
3. A program pattern that does, what you want, looks like:
python Syntax (Toggle Plain Text)
  1. fi=open("the filename")
  2. output=list()
  3. for line in fi:
  4. if <line is to be outputed> :
  5. output.append(line)
  6. fi.close()
  7. return output

fileobject.readlines is a pain, when files get bigger...
Reputation Points: 56
Solved Threads: 65
Posting Whiz in Training
slate is offline Offline
242 posts
since Jun 2008
May 26th, 2009
0

Re: Parsing text from a SMI file and writing to a file

Thanks for correting me and for the code but I don't get how your suggested code would differentiate between a line which starts with delimiter "<" and a line which has text characters. Could you explain it a bit.
Reputation Points: 10
Solved Threads: 0
Newbie Poster
gujjar19 is offline Offline
13 posts
since May 2009
May 26th, 2009
0

Re: Parsing text from a SMI file and writing to a file

If you simply drop the else part, then you will have all lines in the text, which do not begin with "<" and do not end with ":>"
python Syntax (Toggle Plain Text)
  1. for line in wholefile:
  2. line = line.strip()
  3. if line.startswith("<") or line.endswith(":>"):
  4. continue
  5. text.append(line)

line.find("Back, back, back, back!")==-1
That means, that the line does not contain the string in the argument. If that is what you mean.

If you want to know if a line is a text in general, that is hard, because the whole file seems to be a text file.

I do not know the SMI file, but it seems to be a valid xml file. If the file is reasonable in size (<1G), you can use a python dom library. Maybe xml.etree.ElementTree in python library.

Otherwise, please post your minimal code (in code tags) that runs, and give us the error or the unexpected result you had.
Reputation Points: 56
Solved Threads: 65
Posting Whiz in Training
slate is offline Offline
242 posts
since Jun 2008
May 29th, 2009
0

Re: Parsing text from a SMI file and writing to a file

When I use the code below it gives the below error

python Syntax (Toggle Plain Text)
  1. def getText(inFile):
  2. text=[ ]
  3. file = open( inFile, "ra" )
  4.  
  5. for line in wholefile:
  6. line = line.strip()
  7. if line.startswith("<") or line.endswith(">"):
  8. continue
  9. text.append(line)
  10. return text

File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop

Also, can I write this code in a python shell, without defining a function and passing it to main.
Another thing, how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
Reputation Points: 10
Solved Threads: 0
Newbie Poster
gujjar19 is offline Offline
13 posts
since May 2009
May 29th, 2009
0

Re: Parsing text from a SMI file and writing to a file

Quote ...
File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop
Your code is not intended in the function.

try:
python Syntax (Toggle Plain Text)
  1. def getText(inFile):
  2. text=[ ]
  3. file = open( inFile, "r" )
  4.  
  5. for line in wholefile:
  6. line = line.strip()
  7. if line.startswith("<") or line.endswith(">"):
  8. continue
  9. text.append(line)
  10. return text

Quote ...
Also, can I write this code in a python shell, without defining a function and passing it to main.
Is this a question? I do not understand what you mean. If it's not, then neither


Quote ...
how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
Can you please write an example!
Last edited by slate; May 29th, 2009 at 9:38 am.
Reputation Points: 56
Solved Threads: 65
Posting Whiz in Training
slate is offline Offline
242 posts
since Jun 2008
May 29th, 2009
0

Re: Parsing text from a SMI file and writing to a file

Quote ...
how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
The simpliest way, without using and/or combinations, is to use an indicator which you set to True if found.
Python Syntax (Toggle Plain Text)
  1. test_data = [ "keep this line>",
  2. "<ignore this line>",
  3. "<ignore this line also;" ]
  4.  
  5. ignore = False
  6. for rec in test_data:
  7. rec = rec.strip()
  8. if rec.startswith("<"):
  9. if rec.endswith(">") or rec.endswith(";"):
  10. ignore = True
  11.  
  12. if ignore:
  13. print "Ignore",
  14. else:
  15. print "OK",
  16. print rec
Reputation Points: 741
Solved Threads: 692
Nearly a Posting Maven
woooee is offline Offline
2,305 posts
since Dec 2006
May 30th, 2009
0

Re: Parsing text from a SMI file and writing to a file

In reply to Wooee.
Forgive my ignorance. With this code still I'm not able to get the text that I want to extract. e.g. from below text lines (more than 2000 lines of similar text in a file )
Python Syntax (Toggle Plain Text)
  1. </HEAD>
  2. <BODY>
  3. <SYNC Start=52><P Class=ENCC>
  4. Subtitles by Korea NSC Subtitle Team <br>
  5. (http://club.nate.com/tsm)
  6. <SYNC Start=3989><P Class=ENCC>&nbsp;
  7. <SYNC Start=5047><P Class=ENCC>
  8. Back, back, back, back!
  9. <SYNC Start=7235><P Class=ENCC>&nbsp;
  10. <SYNC Start=10725><P Class=ENCC>
  11. Yeah, Dan!
  12. <SYNC Start=11984><P Class=ENCC>
  13. Hey! Hey!
  14. <SYNC Start=14072><P Class=ENCC>&nbsp;
  15. <SYNC Start=15212><P Class=ENCC>
  16. Back, back, back, back!
  17. <SYNC Start=17249><P Class=ENCC>&nbsp;
I want only the text line to be written into a new file and the rest of lines be ignored e.g.
Keep this text......
Back, back, back, back!
Yeah, Dan!
Hey! Hey!
and ignore the rest text
Reputation Points: 10
Solved Threads: 0
Newbie Poster
gujjar19 is offline Offline
13 posts
since May 2009

This thread is solved

Either the thread starter or a moderator has marked this thread as solved. You can most likely trust the responses and answers given. There is most likely no reason for any further responses to be posted here. If you have a related question, please start a new thread in this forum instead.

This thread is more than three months old

No one has posted to this discussion for at least three months. Please let old threads die and do not reply to them unless you feel you have something new and valuable to contribute that absolutely must be added to make the discussion complete. Otherwise, please start a new thread in this forum instead.
Message:
Previous Thread in Python Forum Timeline: Newbie With Function Trouble
Next Thread in Python Forum Timeline: numpy for python3?





About Us | Contact Us | Advertise | Acceptable Use Policy
Forum Index | Build Custom RSS Feed


Follow us on Twitter


© 2011 DaniWeb® LLC