Hi, I am new to python. I am working on parsing text from a smi file . I want to extract only dialogues and want to ignore the timestamps (lines starting with <SYNC) for one case, e.g. Below is part of smi file

<SAMI>

<HEAD>

<Title>ÁŠžñÀ» ÀûŸî ÁÖŒŒ¿ä.</Title>

<Style TYPE="text/css">

<!--

P {margin-left:8pt; margin-right:8pt; margin-bottom:2pt; margin-top:2pt;

   text-align:center; font-size:22pt; font-family: Arial, Sans-serif;

   font-weight:bold; color:white;}

.KRCC {Name:Korean; lang:ko-KR; SAMIType:CC;}

.ENCC {Name:English; lang:en-US; SAMIType:CC;}

#STDPrn {Name:Standard Print;}

#VLargePrn {Name:34pt (VLarge Print); font-size:34pt;}

#LargePrn {Name:28pt (Large Print); font-size:28pt;}

#MediumPrn {Name:24pt (Medium Print); font-size:24pt;}

#BSmallPrn {Name:18pt (BSmall Print); font-size:18pt;}

#SmallPrn {Name:12pt (Small Print); font-size:12pt;}

-->

</Style>

<!--


-->

</HEAD>

<BODY>

<SYNC Start=52><P Class=ENCC>

Subtitles by Korea NSC Subtitle Team <br>

([url]http://club.nate.com/tsm[/url])

<SYNC Start=3989><P Class=ENCC>&nbsp;

<SYNC Start=5047><P Class=ENCC>

Back, back, back, back!

<SYNC Start=7235><P Class=ENCC>&nbsp;

<SYNC Start=10725><P Class=ENCC>

Yeah, Dan!

I want to extract only lines that have text sentence e.g.
Back, back, back, back!
and write them into a file.
I have written a function but it doesn't give the required ouput.

def getText( inFile ):

    text = []
    file = open( inFile, "ra" )

    wholefile = file.readlines()

    for line in wholefile:

        line = line.strip()

        if line.startswith("<"):
            break
        elif line.endswith(:>)
            break
        else 
            continue
            text.append(line)

 text.sort()
 return text

What could be the problem with this code or is there any other way to do it.
In second case I want to keep the timestamps count and ignore the text while parsing. How could it be achieved?

Recommended Answers

All 7 Replies

Please use code tag.

I see two problems with your code:
1. No data will be written to the text list. Its a logical thing, you know.:) If you exit the loop when a line starts with "<" or ends with ":>", and you do not process a line otherwise, then you won't process anything.
2. I am not sure if text.sort() does the thing you want. Why is this needed?
3. A program pattern that does, what you want, looks like:

fi=open("the filename")
output=list()
for line in fi:
    if <line is to be outputed> :
       output.append(line)
fi.close()
return output

fileobject.readlines is a pain, when files get bigger...

Thanks for correting me and for the code but I don't get how your suggested code would differentiate between a line which starts with delimiter "<" and a line which has text characters. Could you explain it a bit.

If you simply drop the else part, then you will have all lines in the text, which do not begin with "<" and do not end with ":>"

for line in wholefile:
    line = line.strip()
    if line.startswith("<") or line.endswith(":>"):
        continue
    text.append(line)

line.find("Back, back, back, back!")==-1
That means, that the line does not contain the string in the argument. If that is what you mean.

If you want to know if a line is a text in general, that is hard, because the whole file seems to be a text file.

I do not know the SMI file, but it seems to be a valid xml file. If the file is reasonable in size (<1G), you can use a python dom library. Maybe xml.etree.ElementTree in python library.

Otherwise, please post your minimal code (in code tags) that runs, and give us the error or the unexpected result you had.

When I use the code below it gives the below error

def getText(inFile):
text=[ ]
file = open( inFile, "ra" )

for line in wholefile:
    line = line.strip()
    if line.startswith("<") or line.endswith(">"):
        continue
    text.append(line)
return text

File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop

Also, can I write this code in a python shell, without defining a function and passing it to main.
Another thing, how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '

File "t1.py", line 12
continue
SyntaxError: 'continue' not properly in loop

Your code is not intended in the function.

try:

def getText(inFile):
    text=[ ]
    file = open( inFile, "r" )

    for line in wholefile:
        line = line.strip()
        if line.startswith("<") or line.endswith(">"):
            continue
        text.append(line)
    return text

Also, can I write this code in a python shell, without defining a function and passing it to main.

Is this a question? I do not understand what you mean. If it's not, then neither:)

how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '

Can you please write an example!

how can I achive mutliple matches from a line in single if statement like I need to ignore a line that starts with '<' and ends either with '>' or ' ; '

The simpliest way, without using and/or combinations, is to use an indicator which you set to True if found.

test_data = [ "keep this line>",
              "<ignore this line>",
              "<ignore this line also;" ]
              
ignore = False
for rec in test_data:
   rec = rec.strip()
   if rec.startswith("<"):
      if rec.endswith(">") or rec.endswith(";"):
         ignore = True
         
   if ignore:
      print "Ignore",
   else:
      print "OK",
   print rec

In reply to Wooee.
Forgive my ignorance. With this code still I'm not able to get the text that I want to extract. e.g. from below text lines (more than 2000 lines of similar text in a file )

</HEAD>
<BODY>
<SYNC Start=52><P Class=ENCC>
Subtitles by Korea NSC Subtitle Team <br>
(http://club.nate.com/tsm)
<SYNC Start=3989><P Class=ENCC>&nbsp;
<SYNC Start=5047><P Class=ENCC>
Back, back, back, back!
<SYNC Start=7235><P Class=ENCC>&nbsp;
<SYNC Start=10725><P Class=ENCC>
Yeah, Dan!
<SYNC Start=11984><P Class=ENCC>
Hey! Hey!
<SYNC Start=14072><P Class=ENCC>&nbsp;
<SYNC Start=15212><P Class=ENCC>
Back, back, back, back!
<SYNC Start=17249><P Class=ENCC>&nbsp;

I want only the text line to be written into a new file and the rest of lines be ignored e.g.
Keep this text......
Back, back, back, back!
Yeah, Dan!
Hey! Hey!
and ignore the rest text

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.