Parsing text from a SMI file and writing to a file

Please support our Python advertiser: Programming Forums - DaniWeb Sister Site
Thread Solved

Join Date: May 2009
Posts: 13
Reputation: gujjar19 is an unknown quantity at this point 
Solved Threads: 0
gujjar19 gujjar19 is offline Offline
Newbie Poster

Parsing text from a SMI file and writing to a file

 
0
  #1
May 26th, 2009
Hi, I am new to python. I am working on parsing text from a smi file . I want to extract only dialogues and want to ignore the timestamps (lines starting with <SYNC) for one case, e.g. Below is part of smi file

<SAMI>

<HEAD>

<Title>ÁŠžñÀ» ÀûŸî ÁÖŒŒ¿ä.</Title>

<Style TYPE="text/css">

<!--

P {margin-left:8pt; margin-right:8pt; margin-bottom:2pt; margin-top:2pt;

text-align:center; font-size:22pt; font-family: Arial, Sans-serif;

font-weight:bold; color:white;}

.KRCC {Name:Korean; lang:ko-KR; SAMIType:CC;}

.ENCC {Name:English; lang:en-US; SAMIType:CC;}

#STDPrn {Nametandard Print;}

#VLargePrn {Name:34pt (VLarge Print); font-size:34pt;}

#LargePrn {Name:28pt (Large Print); font-size:28pt;}

#MediumPrn {Name:24pt (Medium Print); font-size:24pt;}

#BSmallPrn {Name:18pt (BSmall Print); font-size:18pt;}

#SmallPrn {Name:12pt (Small Print); font-size:12pt;}

-->

</Style>

<!--


-->

</HEAD>

<BODY>

<SYNC Start=52><P Class=ENCC>

Subtitles by Korea NSC Subtitle Team <br>

(http://club.nate.com/tsm)

<SYNC Start=3989><P Class=ENCC>&nbsp;

<SYNC Start=5047><P Class=ENCC>

Back, back, back, back!

<SYNC Start=7235><P Class=ENCC>&nbsp;

<SYNC Start=10725><P Class=ENCC>

Yeah, Dan!

I want to extract only lines that have text sentence e.g.
Back, back, back, back!
and write them into a file.
I have wrote a function but it doesn't give the required ouput.

def getText( inFile ):

text = []
file = open( inFile, "ra" )

wholefile = file.readlines()

for line in wholefile:

line = line.strip()

if line.startswith("<"):
break
elif line.endswith(:>)
break
else
continue
text.append(line)

text.sort()
return text

What could be the problem with this code or is there any other way to do it.
In second case I want to keep the timestamps count and ignore the text while parsing. How could it be achieved?
Reply With Quote Quick reply to this message  
Join Date: Jun 2008
Posts: 128
Reputation: slate is an unknown quantity at this point 
Solved Threads: 31
slate slate is offline Offline
Junior Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #2
May 26th, 2009
Please use code tag.

I see two problems with your code:
1. No data will be written to the text list. Its a logical thing, you know. If you exit the loop when a line starts with "<" or ends with ":>", and you do not process a line otherwise, then you won't process anything.
2. I am not sure if text.sort() does the thing you want. Why is this needed?
3. A program pattern that does, what you want, looks like:
  1. fi=open("the filename")
  2. output=list()
  3. for line in fi:
  4. if <line is to be outputed> :
  5. output.append(line)
  6. fi.close()
  7. return output

fileobject.readlines is a pain, when files get bigger...
Reply With Quote Quick reply to this message  
Join Date: May 2009
Posts: 13
Reputation: gujjar19 is an unknown quantity at this point 
Solved Threads: 0
gujjar19 gujjar19 is offline Offline
Newbie Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #3
May 26th, 2009
Thanks for correting me and for the code but I don't get how your suggested code would differentiate between a line which starts with delimiter "<" and a line which has text characters. Could you explain it a bit.
Reply With Quote Quick reply to this message  
Join Date: Jun 2008
Posts: 128
Reputation: slate is an unknown quantity at this point 
Solved Threads: 31
slate slate is offline Offline
Junior Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #4
May 26th, 2009
If you simply drop the else part, then you will have all lines in the text, which do not begin with "<" and do not end with ":>"
  1. for line in wholefile:
  2. line = line.strip()
  3. if line.startswith("<") or line.endswith(":>"):
  4. continue
  5. text.append(line)

line.find("Back, back, back, back!")==-1
That means, that the line does not contain the string in the argument. If that is what you mean.

If you want to know if a line is a text in general, that is hard, because the whole file seems to be a text file.

I do not know the SMI file, but it seems to be a valid xml file. If the file is reasonable in size (<1G), you can use a python dom library. Maybe xml.etree.ElementTree in python library.

Otherwise, please post your minimal code (in code tags) that runs, and give us the error or the unexpected result you had.
Reply With Quote Quick reply to this message  
Join Date: May 2009
Posts: 13
Reputation: gujjar19 is an unknown quantity at this point 
Solved Threads: 0
gujjar19 gujjar19 is offline Offline
Newbie Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #5
May 29th, 2009
When I use the code below it gives the below error

  1. def getText(inFile):
  2. text=[ ]
  3. file = open( inFile, "ra" )
  4.  
  5. for line in wholefile:
  6. line = line.strip()
  7. if line.startswith("<") or line.endswith(">"):
  8. continue
  9. text.append(line)
  10. return text

File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop

Also, can I write this code in a python shell, without defining a function and passing it to main.
Another thing, how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
Reply With Quote Quick reply to this message  
Join Date: Jun 2008
Posts: 128
Reputation: slate is an unknown quantity at this point 
Solved Threads: 31
slate slate is offline Offline
Junior Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #6
May 29th, 2009
File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop
Your code is not intended in the function.

try:
  1. def getText(inFile):
  2. text=[ ]
  3. file = open( inFile, "r" )
  4.  
  5. for line in wholefile:
  6. line = line.strip()
  7. if line.startswith("<") or line.endswith(">"):
  8. continue
  9. text.append(line)
  10. return text

Also, can I write this code in a python shell, without defining a function and passing it to main.
Is this a question? I do not understand what you mean. If it's not, then neither


how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
Can you please write an example!
Last edited by slate; May 29th, 2009 at 9:38 am.
Reply With Quote Quick reply to this message  
Join Date: Dec 2006
Posts: 1,065
Reputation: woooee is a jewel in the rough woooee is a jewel in the rough woooee is a jewel in the rough 
Solved Threads: 299
woooee woooee is offline Offline
Veteran Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #7
May 29th, 2009
how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
The simpliest way, without using and/or combinations, is to use an indicator which you set to True if found.
  1. test_data = [ "keep this line>",
  2. "<ignore this line>",
  3. "<ignore this line also;" ]
  4.  
  5. ignore = False
  6. for rec in test_data:
  7. rec = rec.strip()
  8. if rec.startswith("<"):
  9. if rec.endswith(">") or rec.endswith(";"):
  10. ignore = True
  11.  
  12. if ignore:
  13. print "Ignore",
  14. else:
  15. print "OK",
  16. print rec
Reply With Quote Quick reply to this message  
Join Date: May 2009
Posts: 13
Reputation: gujjar19 is an unknown quantity at this point 
Solved Threads: 0
gujjar19 gujjar19 is offline Offline
Newbie Poster

Re: Parsing text from a SMI file and writing to a file

 
0
  #8
May 30th, 2009
In reply to Wooee.
Forgive my ignorance. With this code still I'm not able to get the text that I want to extract. e.g. from below text lines (more than 2000 lines of similar text in a file )
  1. </HEAD>
  2. <BODY>
  3. <SYNC Start=52><P Class=ENCC>
  4. Subtitles by Korea NSC Subtitle Team <br>
  5. (http://club.nate.com/tsm)
  6. <SYNC Start=3989><P Class=ENCC>&nbsp;
  7. <SYNC Start=5047><P Class=ENCC>
  8. Back, back, back, back!
  9. <SYNC Start=7235><P Class=ENCC>&nbsp;
  10. <SYNC Start=10725><P Class=ENCC>
  11. Yeah, Dan!
  12. <SYNC Start=11984><P Class=ENCC>
  13. Hey! Hey!
  14. <SYNC Start=14072><P Class=ENCC>&nbsp;
  15. <SYNC Start=15212><P Class=ENCC>
  16. Back, back, back, back!
  17. <SYNC Start=17249><P Class=ENCC>&nbsp;
I want only the text line to be written into a new file and the rest of lines be ignored e.g.
Keep this text......
Back, back, back, back!
Yeah, Dan!
Hey! Hey!
and ignore the rest text
Reply With Quote Quick reply to this message  
Reply

This thread has been marked solved.
Perhaps start a new thread instead?
Message:




Views: 500 | Replies: 7
Thread Tools Search this Thread



Tag cloud for Python
About Us | Contact Us | Advertise | DaniWeb | Acceptable Use Policy | RSS Feed

©2003 - 2009 DaniWeb® LLC