Hello,

I think this is a pretty simple problem but I just don't know where to start. I have a text file:

1
00:00:34,000 --> 00:00:36,135
Thank you, Detective.

2
00:00:42,714 --> 00:00:45,794
- Any change?
- Nothing since you left.

3
00:00:52,988 --> 00:00:55,585
She seems to be looking for something.

4
00:00:55,588 --> 00:00:59,234
Camera?

5
00:01:23,961 --> 00:01:26,662
She has a nice ass.

6
00:01:27,571 --> 00:01:30,407
Stay focused on the mission.

7
00:01:36,600 --> 00:01:40,336
Keep an eye on her,
but don't get too close.

8
00:01:51,605 --> 00:01:53,832
- Good morning.
- Good morning.

Actually, its a .srt (subtitle) file and I need to extract the text, so ignore the 'timestamps' and 'index number'. Ultimately, I need to create a corpus of subtile files as part of my linguistics course. Is python the right tool for this job? Any help would be much appreciated :D

Recommended Answers

All 8 Replies

Start with a script which prints the lines one by one

SOURCE_FILE = "myfile.srt"

def main():
    with open(SOURCE_FILE) as src_file:
        for line in src_file:
            print(repr(line))

if __name__ == "__main__":
    main()

Thank you very much. I will try this.

Yes, I have learnt a little more about python. I've finally got the code to do want I want it to do.

import sys, re

output = sys.stdout
text = sys.stdin.read()

#rx_blanks = re.compile(r"\W+")
paragraph = re.compile(r"(\d{2}:\d{2}:\d{2},\d{3}\s-->\s\d{2}:\d{2}:\d{2},\d{3})\r\n(.*?\r?\n?.*)\r\n\r\n\d{1,3}\r",re.MULTILINE)
sub_oneline = re.compile(r"\r\n")

for match in paragraph.finditer(text):
	timestamp, subs = match.groups()
	timestamp = timestamp.strip()
	subs = sub_oneline.sub("->>-",subs)
	print (timestamp, '@', subs)

Which gives this:


01:31:41,632 --> 01:31:44,763 @ I love him too, unfortunately.
01:31:48,515 --> 01:31:50,939 @ I may have a solution for you.
01:32:24,031 --> 01:32:25,689 @ Are you with me this time?
01:32:51,829 --> 01:32:52,868 @ C'mon, let him up.
01:32:54,861 --> 01:32:56,322 @ I'm just a tourist.

But why can I write:

print (timestamp, '@', subs)

and not:

f = open('file.txt','w')
f.write(timestamp, '@', subs)

??

Assuming you are using python 3, you should be able to write

f = open('file.txt','w')
print(timestamp, '@', subs, file=f)

or alternately

f.write(''.join(timestamp, '@', subs))

Great, i'm still in 2.5 i'll update.

Many thanks

acehigher where did you get the above code from?

Great, i'm still in 2.5 i'll update.

Many thanks

Actually you can do the same in 2.6 or 2.7 if you add

from __future__ import print_function

as the first line of your file.

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.