vbScript - An Application to Modify srt Subtitle Files
Please see my post vbScript - The Basics for more details on vbScript.
So far I've posted either functions/subs or code snippets to illustrate a technique. This time I am going to post a complete application.
First, a few points.
- I am somewhat hearing impaired - just one of the hazards of increasing age and heredity.
- Because of 1 I find I am making increased use of videos with subtitles
- I like to collect videos over the winter to watch at the cottage in the summer
I use VLC media player exclusively to watch videos and I make use of the vlsub plugin to download subtitles for those videos. I have found two things with most subtitles.
- They contain ads (most identifiable by a few common keywords)
- They are frequently out of sync with the actual video
Almost all subtitles now are stored as plain text in a file with the
srt extension. The file consists of multiple blocks such as
2 00:00:02,902 --> 00:00:06,429 We're going to be going for an extremely youth-oriented product.
The format is
- A sequential index number (which is of no use that I can determine)
- A duration specified as
start time --> end time
- One or more lines of dialog followed by a blank line to indicate end of block
Given that, I wrote
Subtitle.vbs to do cleanup, reindexing (although I don't know why), and time stretching/compression. Cleanup is a simple task. I maintain a list of words or phrases usually found only in ads and I remove any blocks containing those words (or phrases). Once I am left with a "clean" subtitle file I can then apply any stretching/compression as needed.
The only information I need to adjust the time scale is the exact time of the first line of dialog, and the exact time of the last line of dialog. Once I know that I can use that along with the current first and last times from the srt file to calculate an offset and scaling to apply to each block in the file. Once I apply the offset and scale the subtitles should then sync with the video. For almost all files this works well.
To contain the subtitle data I created two classes
Class Subtitle (holds information for one subtitle block)
Properties: ----------- Public timespec 'time in the form hh:mm:ss,ddd --> hh:mm:ss.ddd ' Public stime 'start time from timespec in thousandths of a second' Public etime 'end time from timespec in thousandths of a second ' Public text 'one or more lines of dialog to display '
Class Subtitles (a sorted list of subtitle objects)
Methods: -------- Class_Initialize - initializes some private data (list object, etc) Count - returns the number of subtitle blocks Add - adds a new subtitle block to the internal list Item - returns the subtitle block with the given 0-relative index (or from the end if index < 0) TimeToNum - converts time spec from hh:mm:ss,ddd to thousandths of seconds NumToTime - converts time from thousandths of seconds to "hh:mm:ss,ddd" Read - reads the given srt file into memory Write - writes the in-memory subtitles to the given file (or console if file = "") Adjust - adjusts all times by a given expansion/compression factor NotAnAd - returns True if the given text is not an advertisement
Plus a few other housekeeping methods.
Capitalize was written to try to clean up subtitles that were posted in all upper case.
RemoveBlanksInNumbers handles a problem where numbers like 1045 tend to be written as 1 045 (embedded space). This happens only with the digit 1.
The script can be run in two modes
This will read the srt file, remove any obvious advertisements, reindex, then rewrite the srt file.
subtitle file.srt HH:MM:SS,DDD HH:MM:SS,DDD
This will read the srt file, remove anay obvious advertisements, reindex, then adjust all subtitle blocks to correspond with the given start times of the actual first and last lines of dialog. You can use several methods to determine the actual times. vlc does not display any resolution finer than one second so you could always use trial and error until it looks right. Personally, I load up the video into VirtualDub and get the values from there.
After frequently mistyping the times as
HH:MM:SS:DDD instead of
HH:MM:SS,DDD (colon instead of comma) I just decided to code it up to accept either form.
Because working with times as
HH:MM:SS,DDD is difficult when scaling, all timespecs are converted in input to thousandths of seconds (see