Hello,

I think this is a pretty simple problem but I just don't know where to start. I have a text file:

1
00:00:34,000 --> 00:00:36,135
Thank you, Detective.

2
00:00:42,714 --> 00:00:45,794
- Any change?
- Nothing since you left.

3
00:00:52,988 --> 00:00:55,585
She seems to be looking for something.

4
00:00:55,588 --> 00:00:59,234
Camera?

5
00:01:23,961 --> 00:01:26,662
She has a nice ass.

6
00:01:27,571 --> 00:01:30,407
Stay focused on the mission.

7
00:01:36,600 --> 00:01:40,336
Keep an eye on her,
but don't get too close.

8
00:01:51,605 --> 00:01:53,832
- Good morning.
- Good morning.

Actually, its a .srt (subtitle) file and I need to extract the text, so ignore the 'timestamps' and 'index number'. Ultimately, I need to create a corpus of subtile files as part of my linguistics course. Is python the right tool for this job? Any help would be much appreciated :D

Start with a script which prints the lines one by one

SOURCE_FILE = "myfile.srt"

def main():
    with open(SOURCE_FILE) as src_file:
        for line in src_file:
            print(repr(line))

if __name__ == "__main__":
    main()

Yes, I have learnt a little more about python. I've finally got the code to do want I want it to do.

import sys, re

output = sys.stdout
text = sys.stdin.read()

#rx_blanks = re.compile(r"\W+")
paragraph = re.compile(r"(\d{2}:\d{2}:\d{2},\d{3}\s-->\s\d{2}:\d{2}:\d{2},\d{3})\r\n(.*?\r?\n?.*)\r\n\r\n\d{1,3}\r",re.MULTILINE)
sub_oneline = re.compile(r"\r\n")

for match in paragraph.finditer(text):
	timestamp, subs = match.groups()
	timestamp = timestamp.strip()
	subs = sub_oneline.sub("->>-",subs)
	print (timestamp, '@', subs)

Which gives this:


01:31:41,632 --> 01:31:44,763 @ I love him too, unfortunately.
01:31:48,515 --> 01:31:50,939 @ I may have a solution for you.
01:32:24,031 --> 01:32:25,689 @ Are you with me this time?
01:32:51,829 --> 01:32:52,868 @ C'mon, let him up.
01:32:54,861 --> 01:32:56,322 @ I'm just a tourist.

But why can I write:

print (timestamp, '@', subs)

and not:

f = open('file.txt','w')
f.write(timestamp, '@', subs)

??

Assuming you are using python 3, you should be able to write

f = open('file.txt','w')
print(timestamp, '@', subs, file=f)

or alternately

f.write(''.join(timestamp, '@', subs))

Edited 5 Years Ago by Gribouillis: n/a

Great, i'm still in 2.5 i'll update.

Many thanks

Actually you can do the same in 2.6 or 2.7 if you add

from __future__ import print_function

as the first line of your file.

This article has been dead for over six months. Start a new discussion instead.