| | |
Parsing text from a SMI file and writing to a file
Thread Solved |
•
•
Join Date: May 2009
Posts: 13
Reputation:
Solved Threads: 0
Hi, I am new to python. I am working on parsing text from a smi file . I want to extract only dialogues and want to ignore the timestamps (lines starting with <SYNC) for one case, e.g. Below is part of smi file
<SAMI>
<HEAD>
<Title>ÁŠžñÀ» ÀûŸî ÁÖŒŒ¿ä.</Title>
<Style TYPE="text/css">
<!--
P {margin-left:8pt; margin-right:8pt; margin-bottom:2pt; margin-top:2pt;
text-align:center; font-size:22pt; font-family: Arial, Sans-serif;
font-weight:bold; color:white;}
.KRCC {Name:Korean; lang:ko-KR; SAMIType:CC;}
.ENCC {Name:English; lang:en-US; SAMIType:CC;}
#STDPrn {Name
tandard Print;}
#VLargePrn {Name:34pt (VLarge Print); font-size:34pt;}
#LargePrn {Name:28pt (Large Print); font-size:28pt;}
#MediumPrn {Name:24pt (Medium Print); font-size:24pt;}
#BSmallPrn {Name:18pt (BSmall Print); font-size:18pt;}
#SmallPrn {Name:12pt (Small Print); font-size:12pt;}
-->
</Style>
<!--
-->
</HEAD>
<BODY>
<SYNC Start=52><P Class=ENCC>
Subtitles by Korea NSC Subtitle Team <br>
(http://club.nate.com/tsm)
<SYNC Start=3989><P Class=ENCC>
<SYNC Start=5047><P Class=ENCC>
Back, back, back, back!
<SYNC Start=7235><P Class=ENCC>
<SYNC Start=10725><P Class=ENCC>
Yeah, Dan!
I want to extract only lines that have text sentence e.g.
Back, back, back, back!
and write them into a file.
I have wrote a function but it doesn't give the required ouput.
def getText( inFile ):
text = []
file = open( inFile, "ra" )
wholefile = file.readlines()
for line in wholefile:
line = line.strip()
if line.startswith("<"):
break
elif line.endswith(:>)
break
else
continue
text.append(line)
text.sort()
return text
What could be the problem with this code or is there any other way to do it.
In second case I want to keep the timestamps count and ignore the text while parsing. How could it be achieved?
<SAMI>
<HEAD>
<Title>ÁŠžñÀ» ÀûŸî ÁÖŒŒ¿ä.</Title>
<Style TYPE="text/css">
<!--
P {margin-left:8pt; margin-right:8pt; margin-bottom:2pt; margin-top:2pt;
text-align:center; font-size:22pt; font-family: Arial, Sans-serif;
font-weight:bold; color:white;}
.KRCC {Name:Korean; lang:ko-KR; SAMIType:CC;}
.ENCC {Name:English; lang:en-US; SAMIType:CC;}
#STDPrn {Name
tandard Print;}#VLargePrn {Name:34pt (VLarge Print); font-size:34pt;}
#LargePrn {Name:28pt (Large Print); font-size:28pt;}
#MediumPrn {Name:24pt (Medium Print); font-size:24pt;}
#BSmallPrn {Name:18pt (BSmall Print); font-size:18pt;}
#SmallPrn {Name:12pt (Small Print); font-size:12pt;}
-->
</Style>
<!--
-->
</HEAD>
<BODY>
<SYNC Start=52><P Class=ENCC>
Subtitles by Korea NSC Subtitle Team <br>
(http://club.nate.com/tsm)
<SYNC Start=3989><P Class=ENCC>
<SYNC Start=5047><P Class=ENCC>
Back, back, back, back!
<SYNC Start=7235><P Class=ENCC>
<SYNC Start=10725><P Class=ENCC>
Yeah, Dan!
I want to extract only lines that have text sentence e.g.
Back, back, back, back!
and write them into a file.
I have wrote a function but it doesn't give the required ouput.
def getText( inFile ):
text = []
file = open( inFile, "ra" )
wholefile = file.readlines()
for line in wholefile:
line = line.strip()
if line.startswith("<"):
break
elif line.endswith(:>)
break
else
continue
text.append(line)
text.sort()
return text
What could be the problem with this code or is there any other way to do it.
In second case I want to keep the timestamps count and ignore the text while parsing. How could it be achieved?
•
•
Join Date: Jun 2008
Posts: 122
Reputation:
Solved Threads: 30
Please use code tag.
I see two problems with your code:
1. No data will be written to the text list. Its a logical thing, you know.
If you exit the loop when a line starts with "<" or ends with ":>", and you do not process a line otherwise, then you won't process anything.
2. I am not sure if text.sort() does the thing you want. Why is this needed?
3. A program pattern that does, what you want, looks like:
fileobject.readlines is a pain, when files get bigger...
I see two problems with your code:
1. No data will be written to the text list. Its a logical thing, you know.
If you exit the loop when a line starts with "<" or ends with ":>", and you do not process a line otherwise, then you won't process anything.2. I am not sure if text.sort() does the thing you want. Why is this needed?
3. A program pattern that does, what you want, looks like:
python Syntax (Toggle Plain Text)
fi=open("the filename") output=list() for line in fi: if <line is to be outputed> : output.append(line) fi.close() return output
fileobject.readlines is a pain, when files get bigger...
•
•
Join Date: Jun 2008
Posts: 122
Reputation:
Solved Threads: 30
If you simply drop the else part, then you will have all lines in the text, which do not begin with "<" and do not end with ":>"
line.find("Back, back, back, back!")==-1
That means, that the line does not contain the string in the argument. If that is what you mean.
If you want to know if a line is a text in general, that is hard, because the whole file seems to be a text file.
I do not know the SMI file, but it seems to be a valid xml file. If the file is reasonable in size (<1G), you can use a python dom library. Maybe xml.etree.ElementTree in python library.
Otherwise, please post your minimal code (in code tags) that runs, and give us the error or the unexpected result you had.
python Syntax (Toggle Plain Text)
for line in wholefile: line = line.strip() if line.startswith("<") or line.endswith(":>"): continue text.append(line)
line.find("Back, back, back, back!")==-1
That means, that the line does not contain the string in the argument. If that is what you mean.
If you want to know if a line is a text in general, that is hard, because the whole file seems to be a text file.
I do not know the SMI file, but it seems to be a valid xml file. If the file is reasonable in size (<1G), you can use a python dom library. Maybe xml.etree.ElementTree in python library.
Otherwise, please post your minimal code (in code tags) that runs, and give us the error or the unexpected result you had.
•
•
Join Date: May 2009
Posts: 13
Reputation:
Solved Threads: 0
When I use the code below it gives the below error
File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop
Also, can I write this code in a python shell, without defining a function and passing it to main.
Another thing, how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
python Syntax (Toggle Plain Text)
def getText(inFile): text=[ ] file = open( inFile, "ra" ) for line in wholefile: line = line.strip() if line.startswith("<") or line.endswith(">"): continue text.append(line) return text
File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop
Also, can I write this code in a python shell, without defining a function and passing it to main.
Another thing, how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
•
•
Join Date: Jun 2008
Posts: 122
Reputation:
Solved Threads: 30
•
•
•
•
File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop
try:
python Syntax (Toggle Plain Text)
def getText(inFile): text=[ ] file = open( inFile, "r" ) for line in wholefile: line = line.strip() if line.startswith("<") or line.endswith(">"): continue text.append(line) return text
•
•
•
•
Also, can I write this code in a python shell, without defining a function and passing it to main.

•
•
•
•
how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
Last edited by slate; May 29th, 2009 at 9:38 am.
•
•
Join Date: Dec 2006
Posts: 1,017
Reputation:
Solved Threads: 286
•
•
•
•
how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '
Python Syntax (Toggle Plain Text)
test_data = [ "keep this line>", "<ignore this line>", "<ignore this line also;" ] ignore = False for rec in test_data: rec = rec.strip() if rec.startswith("<"): if rec.endswith(">") or rec.endswith(";"): ignore = True if ignore: print "Ignore", else: print "OK", print rec
•
•
Join Date: May 2009
Posts: 13
Reputation:
Solved Threads: 0
In reply to Wooee.
Forgive my ignorance. With this code still I'm not able to get the text that I want to extract. e.g. from below text lines (more than 2000 lines of similar text in a file )
I want only the text line to be written into a new file and the rest of lines be ignored e.g.
Keep this text......
Back, back, back, back!
Yeah, Dan!
Hey! Hey!
and ignore the rest text
Forgive my ignorance. With this code still I'm not able to get the text that I want to extract. e.g. from below text lines (more than 2000 lines of similar text in a file )
Python Syntax (Toggle Plain Text)
</HEAD> <BODY> <SYNC Start=52><P Class=ENCC> Subtitles by Korea NSC Subtitle Team <br> (http://club.nate.com/tsm) <SYNC Start=3989><P Class=ENCC> <SYNC Start=5047><P Class=ENCC> Back, back, back, back! <SYNC Start=7235><P Class=ENCC> <SYNC Start=10725><P Class=ENCC> Yeah, Dan! <SYNC Start=11984><P Class=ENCC> Hey! Hey! <SYNC Start=14072><P Class=ENCC> <SYNC Start=15212><P Class=ENCC> Back, back, back, back! <SYNC Start=17249><P Class=ENCC>
Keep this text......
Back, back, back, back!
Yeah, Dan!
Hey! Hey!
and ignore the rest text
![]() |
Similar Threads
- Writing to File in Client/Server app (Java)
- Parsing Log file and writing it into array/file (C)
- writing to a file (C++)
- Loading a structure (text lines) from a file, modify the structure, then save (C)
- Optimizing text file writing in VB 6.0 (Visual Basic 4 / 5 / 6)
- Writing text from textbox to file... (C++)
- Reading binary data from a file and writing it (Visual Basic 4 / 5 / 6)
- Checking if a file is in Writing/Saving mode (VB.NET)
Other Threads in the Python Forum
- Previous Thread: Newbie With Function Trouble
- Next Thread: Tkinter set mouse position?
| Thread Tools | Search this Thread |
address aliased anydbm bash beginner bits calling casino changecolor class clear conversion convert corners count cturtle cursor curves definedlines dictionary digital dynamic dynamically events examples excel external file float format frange function gui handling hints homework i/o iframe import info input java line linux list lists loan loop matching mouse multiple number numbers output parsing path port prime programming projects py py2exe pygame python random rational raw_input recursion recursive scrolledtext searchingfile shebang signal singleton string strings subprocess table tails terminal text thread threading time tkinter tlapse tooltip tuple tutorial type ubuntu unicode urllib urllib2 valueerror variable web-scrape whileloop word wxpython






