vbScript - An Application to Modify srt Subtitle Files

Please see my post vbScript - The Basics for more details on vbScript.

So far I've posted either functions/subs or code snippets to illustrate a technique. This time I am going to post a complete application.

First, a few points.

  1. I am somewhat hearing impaired - just one of the hazards of increasing age and heredity.
  2. Because of 1 I find I am making increased use of videos with subtitles
  3. I like to collect videos over the winter to watch at the cottage in the summer

I use VLC media player exclusively to watch videos and I make use of the vlsub plugin to download subtitles for those videos. I have found two things with most subtitles.

  1. They contain ads (most identifiable by a few common keywords)
  2. They are frequently out of sync with the actual video

Almost all subtitles now are stored as plain text in a file with the srt extension. The file consists of multiple blocks such as

2
00:00:02,902 --> 00:00:06,429
We're going to be going for an
extremely youth-oriented product.

The format is

  1. A sequential index number (which is of no use that I can determine)
  2. A duration specified as start time --> end time
  3. One or more lines of dialog followed by a blank line to indicate end of block

Given that, I wrote Subtitle.vbs to do cleanup, reindexing (although I don't know why), and time stretching/compression. Cleanup is a simple task. I maintain a list of words or phrases usually found only in ads and I remove any blocks containing those words (or phrases). Once I am left with a "clean" subtitle file I can then apply any stretching/compression as needed.

The only information I need to adjust the time scale is the exact time of the first line of dialog, and the exact time of the last line of dialog. Once I know that I can use that along with the current first and last times from the srt file to calculate an offset and scaling to apply to each block in the file. Once I apply the offset and scale the subtitles should then sync with the video. For almost all files this works well.

To contain the subtitle data I created two classes

Class Subtitle (holds information for one subtitle block)

Properties:
-----------
Public timespec     'time in the form hh:mm:ss,ddd --> hh:mm:ss.ddd     '
Public stime        'start time from timespec in thousandths of a second'
Public etime        'end time from timespec in thousandths of a second  '
Public text         'one or more lines of dialog to display                                  '

Class Subtitles (a sorted list of subtitle objects)

Methods:
--------
Class_Initialize    - initializes some private data (list object, etc)
Count               - returns the number of subtitle blocks
Add                 - adds a new subtitle block to the internal list
Item                - returns the subtitle block with the given 0-relative index (or from the end if index < 0)
TimeToNum           - converts time spec from hh:mm:ss,ddd to thousandths of seconds
NumToTime           - converts time from thousandths of seconds to "hh:mm:ss,ddd"
Read                - reads the given srt file into memory
Write               - writes the in-memory subtitles to the given file (or console if file = "")
Adjust              - adjusts all times by a given expansion/compression factor
NotAnAd             - returns True if the given text is not an advertisement

Plus a few other housekeeping methods. Capitalize was written to try to clean up subtitles that were posted in all upper case. RemoveBlanksInNumbers handles a problem where numbers like 1045 tend to be written as 1 045 (embedded space). This happens only with the digit 1.

The script can be run in two modes

subtitle file.srt

This will read the srt file, remove any obvious advertisements, reindex, then rewrite the srt file.

subtitle file.srt HH:MM:SS,DDD HH:MM:SS,DDD

This will read the srt file, remove anay obvious advertisements, reindex, then adjust all subtitle blocks to correspond with the given start times of the actual first and last lines of dialog. You can use several methods to determine the actual times. vlc does not display any resolution finer than one second so you could always use trial and error until it looks right. Personally, I load up the video into VirtualDub and get the values from there.

After frequently mistyping the times as HH:MM:SS:DDD instead of HH:MM:SS,DDD (colon instead of comma) I just decided to code it up to accept either form.

Because working with times as HH:MM:SS,DDD is difficult when scaling, all timespecs are converted in input to thousandths of seconds (see TimeToNum and NumToTime)

''#region Header                                                                        '
''                                                                                      '
''  Name:                                                                               '
''                                                                                      '
''      Subtitle.vbs                                                                    '
''                                                                                      '
''  Description:                                                                        '
''                                                                                      '
''      Script to modify subtitle files in the srt format. Simple modification consists '
''      of removing ad blocks. Optionally, this script can stretch or compress dialog   '
''      times to match previously determined (by playing the video) actual dialog start '
''      and stop times.                                                                 '
''                                                                                      '
''  Usage:                                                                              '
''                                                                                      '
''      subtitle file                                                                   '
''                                                                                      '
''          Read the file, remove obvious ad blocks, reindex, and write file.           '
''                                                                                      '
''      subtitle file ftime ltime                                                       '
''                                                                                      '
''          Read the file, stretch or compress all times to fit the given first (ftime) '
''          and last (ltime) time specs and write file. Time specs are in the same      '
''          format as those in the srt file, to wit:                                    '
''                                                                                      '
''              HH:MM:SS,DDD    or HH:MM:SS:DDD                                         '
''                                                                                      '
''          where DDD represents thousandths of a second.                               '
''                                                                                      '
''  Notes:                                                                              '
''                                                                                      '
''      More and more subtitle (srt) files come loaded with ads. A lot of ads can be    '
''      recognized by the presence of certain words or phrases. The last function in    '
''      this file is NotAnAd, which checks a subtitle block to determine if it is an    '
''      ad. You may find the need to modify the adwords list. To date, if a block       '
''      contains an ad then it will contain no other dialog so removing the block will  '
''      not result in the loss of any dialog.                                           '
''                                                                                      '
''      Frequently, the times associated with the dialog blocks will not correctly sync '
''      with the actual dialog in the video. If this happens then do the following:     '
''                                                                                      '
''          1) edit the srt file and delete all blocks before the first block of spoken '
''              dialog (some srt blocks may contain descriptive but non-spoken text)    '
''          2) do the same thing for the end of the srt file                            '
''          3) play the video and note the exact time of the start of the first And     '
''              last spoken blocks                                                      '
''          4) run subtitle.vbs, giving the exa t start and stop times                  '
''                                                                                      '
''      The script will stretch or compress the times in the srt block to correspond To '
''      the given start and stop times.                                                 '
''                                                                                      '
''  Example:                                                                            '
''                                                                                      '
''      If you have a subtitle file which has the first line of dialog being started at '
''      00:00:35,968 and the last line at 01:58:59,064, but when you check the video    '
''      you find that the actual first and last times are 00:00:39,500 and 01:59:01,950 '
''      then you can adjust all of the times in the file to correspond to the correct   '
''      times by                                                                        '
''                                                                                      '
''          subtitle yourfile.srt 00:00:39,500 01:59:01,950                             '
''                                                                                      '
''  Audit:                                                                              '
''                                                                                      '
''      2016-11-24  rj  original code                                                   '
''                                                                                      '
''#endregion                                                                            '

MakeBackup = False          'set to True to make a backup of the original subtitle file '
ChangeCase = False          'set to True when file is in all caps                       '

Set fso = CreateObject("Scripting.FileSystemObject")
Set arg = Wscript.Arguments.Unnamed
Set srt = New Subtitles

For Each opt In WScript.Arguments.Named
    Select Case opt
        Case "b","backup"   : MakeBackup = True
        Case "c","case"     : ChangeCase = True
        Case "?","help"     : Help("")  : WScript.Quit
        Case Else           : Help(opt) : WScript.Quit
    End Select
Next

Select Case arg.Count

    Case 1      'no time specs given - remove ads and resequence

        reseqOnly = True

    Case 3      'time specs given - remove ads, adjust times and resequence

        reseqOnly = False

        'ensure that ftime & ltime are given as hh:mm:ss,ddd or hh:mm:ss:ddd

        Set rex = New RegExp
            rex.Pattern = "\d\d:\d\d:\d\d[,:]\d\d\d"
            rex.Global = False

        If Not rex.Test(arg(1)) Then
            Wscript.Echo "ftime not in the form HH:MM:SS,DDD or HH:MM:SS:DDD"
            Wscript.Quit
        End If

        If Not rex.Test(arg(2)) Then
            Wscript.Echo "ltime not in the form HH:MM:SS,DDD or HH:MM:SS:DDD"
            Wscript.Quit
        End If

    Case Else   : Help("") : WScript.Quit

End Select

file = arg(0)

'check that file exists and is of type srt

If Not fso.FileExists(file) Then
    Wscript.Echo "file '" & file & "' not found"
    Wscript.Quit
End If

If Lcase(fso.GetExtensionName(file)) <> "srt" Then
    Wscript.Echo file,"is not an srt type file"
    Wscript.Quit
End If

'make a .bak file if requested unless one already exists

bakfile = fso.GetBaseName(file) & ".bak"
If MakeBackup And Not fso.FileExists(bakfile) Then
    fso.CopyFile file, bakfile
End If

'Read the .srt file

srt.Read(file)

'Variable names indicate                            '
'                                                   '
'   sftm - srt start time of first line of dialog   '
'   sltm - srt start time of last line of dialog    '
'   vftm - video start time of first line of dialog '
'   vltm - video start time of last line of dialog  '

'get current (srt) dialog start and end times from the sorted list

sftm = srt.Item(0).stime            'get time from first dialog line'
sltm = srt.Item(-1).stime           'get time from last dialog line '
slen = sltm - sftm                  'calculate total dialog length  '

If Not reseqOnly Then

    'get actual dialog start and end times from the command line

    vftm = srt.TimeToNum(arg(1))    'get time from first dialog line'
    vltm = srt.TimeToNum(arg(2))    'get time from last dialog line '
    vlen = vltm - vftm              'calculate total dialog length  '

    factor = vlen / slen            'compression/expansion factor   '

    'calculate the factor to multiply srt times by to get actual times

    srt.Adjust factor,sftm,vftm     'compress/expand all dialog     '

End If

'write the new subtitle file

srt.Write(file)

''Display help info                                                                     '
Function Help (str)

    If Len(str) > 0 Then WScript.Echo VbCrLf & ">>>Unknown option: /" & str

    Wscript.Echo ""
    Wscript.Echo "subtitle [file [[ftime ltime]]"
    Wscript.Echo ""
    WScript.Echo "    Removes ads from subtitle file with an srt extension. Will also"
    WScript.Echo "    optionally stretch/compress times to match actual dialog start"
    Wscript.Echo "    and stop times. Note that you will have to manually determine"
    Wscript.Echo "    these times by playing the video. Specify ftime and ltime (the"
    Wscript.Echo "    times of the first and last dialog lines) as HH:MM:SS,DDD. or"
    WScript.Echo "    HH:MM:SS:DDD."
    Wscript.Echo ""
    Wscript.Echo "    Options:"
    Wscript.Echo ""
    WScript.Echo "        /backup       (/b) Save the original file in *.bak"
    WScript.Echo "        /case         (/c) change case to lower case"
    Wscript.Echo "        /help         (/?) show this help"

End Function

''#region                                                                               '
''                                                                                      '
''  Name:                                                                               '
''                                                                                      '
''      Subtitle.cls                                                                    '
''                                                                                      '
''  Description:                                                                        '
''                                                                                      '
''      Class definition for I/O and manipulation of susbtitle files in the srt format. '
''                                                                                      '
''  Properties:                                                                         '
''                                                                                      '
''      Count           number of subtitle blocks                                       '
''      timespec        time in the form hh:mm:ss,ddd --> hh:mm:ss.ddd                  '
''      stime           start time from timespec in thousandths of a second             '
''      etime           end time from timespec in thousandths of a second               '
''      text            text to display                                                 '
''                                                                                      '
''  Methods:                                                                            '
''                                                                                      '
''      Item(i)         Return the Subtitle object with the given 0-relative index      '
''      Read(file)      Read the given srt file                                         '
''      Write(file)     Write subtitles to the given file (use "" for console output)   '
''      Adjust(f,v)     Compress or expand all dialog blocks                            '
''      TimeToNum(t)    Convert hh:mm:ss,ddd to thousandths of seconds                  '
''      NumToTime(n)    Convert thousandths of seconds to hh:mm:ss,ddd                  '
''                                                                                      '
''  Notes:                                                                              '
''                                                                                      '
''      To expand or compress dialog you require two numbers. The first is a factor (f) '
''      indicating how much to expand (f > 1.0) or compress (F < 1.0). The other number '
''      is the starting time (v) of the first line of dialog in the video. This value   '
''      may be specified as thousandths of seconds or a string in the form hh:mm:ss,ddd '
''                                                                                      '
''      You can reference items in reverse order by using negative numbers. To access   '
''      the last item use obj.Item(-1)                                                  '
''                                                                                      '
''      If you find ads that are not removed on input, find a string or strings that    '
''      appear in the ad but are unlikely tp appear in other dialog and add a section   '
''      of logic to the Sub NotAnAd to trap it.                                         '
''                                                                                      '
''#endregion                                                                            '

''This class holds the information for one subtitle block from an srt type subtitle file'

Class Subtitle

    Public timespec     'time in the form hh:mm:ss,ddd --> hh:mm:ss.ddd     '
    Public stime        'start time from timespec in thousandths of a second'
    Public etime        'end time from timespec in thousandths of a second  '
    Public text         'one or more lines of dialog to display             '

End Class

''This class contains all of the subtitle blocks from within an srt type subtitle file  '

Class Subtitles

    Private srt         'sorted list - key=index val=Subtitle object    '
    Private rex         'regular expression                             '

    Public Sub Class_Initialize()
        Set srt = CreateObject("System.Collections.SortedList")
        Set rex = New RegExp
        rex.Pattern = "[0-9] [0-9]"
    End Sub

    'Returns the number of subtitle blocks

    Public Property Get Count()
        Count = srt.Count
    End Property

    'Add a new subtitle block                                           '
    '                                                                   '
    '   timespec    string of form "hh:mm:ss,ddd --> hh:mm:ss,ddd"      '
    '   text        dialog block associated with this timespec          '

    Private Sub Add(timespec, text)

        Dim s: Set s = New Subtitle
        Dim fld: fld = Split(timespec," --> ")

        s.timespec = timespec
        s.stime    = TimeToNum(fld(0))
        s.etime    = TimeToNum(fld(1))
        s.text     = Capitalize(text)

        srt.Add Count,s

    End Sub

    'Return a subtitle object given the index                           '
    '                                                                   '
    '   index       a 0-relative index number (< 0 then index from end) '

    Public Function Item (ByVal index)

        'if index < 0 then index back from end

        If index < 0 Then index = srt.Count + index

        If index < 0 or index >= srt.Count then
            Set item = Nothing
        Else
            Set Item = srt.GetByIndex(index)
        End If

    End Function

    'Convert time spec from hh:mm:ss,ddd to thousandths of seconds      '
    '                                                                   '
    '   timespec    time in the form "hh:mm:ss,ddd"                     '

    Public Function TimeToNum ( timespec )

        Dim tt, hh, mm, ss, dd

        tt = Replace(timespec,",",":")
        tt = Replace(tt," --> ",":")
        tt = Split(tt,":")

        'get the numeric equivalents of all time components

        hh = tt(0)          'hours          '
        mm = tt(1)          'minutes        '
        ss = tt(2)          'seconds        '
        dd = tt(3)          'thousandths    '

        TimeToNum = 1000 * (60 * (60 * hh + mm) + ss) + dd

    End Function

    'Convert time from thousandths of seconds to "hh:mm:ss,ddd"         '
    '                                                                   '
    '   num         time in thousandths of seconds                      '

    Public Function NumToTime ( ByVal num )

        Dim hh, mm, ss, dd

        hh = num \ 3600000: num = num - 3600000 * hh    'hours          '
        mm = num \ 60000  : num = num -   60000 * mm    'minutes        '
        ss = num \ 1000                                 'seconds        '
        dd = num Mod 1000                               'thousandths    '

        NumToTime = Pad(hh,2) & ":" & Pad(mm,2) & ":" & Pad(ss,2) & "," & Pad(dd,3)

    End Function

    'Read the given srt file into memory. The file is stored as a       '
    'collection of Subtitle objects.                                    '
    '                                                                   '
    '   filename    name of the file to read                            '

    Public Function Read (filename)

        Dim lines       'array of all lines read from the file          '
        Dim line        'single line from lines                         '
        Dim timeSpec    'time as "hh:mm:ss,ddd --> hh:mm:ss,ddd"        '
        Dim text        'accumulated dialog from one dialog block       '
        Dim ind         'regexp matches an index number line            '
        Dim tim         'regexp matches a timespec line                 '
        Dim inText      'true if in text portion of dialog block        '

        'matches a line containing only a number (index code)

        Set ind = New RegExp
            ind.Pattern = "^\d+$"
            ind.Global = False

        'matches a line containing a start/stop time spec

        Set tim = New RegExp
            tim.Pattern = "^\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d$"
            tim.Global = False

        lines = fso.OpenTextFile(filename).ReadAll
        lines = Split(Replace(lines,vbCr,""),vbLf)
        lines = Filter(lines,"http",False,vbTextCompare)

        text   = ""
        inText = False

        For Each line In lines

            line = Trim(line)

            Select Case True

                Case ind.Test(line)

                    'ignore index numbers

                Case inText

                    line = RemoveBlanksInNumbers(line)

                    'add text to buffer (if present) or end text block
                    'lines containing ads and are ignored

                    If line = "" Then
                        inText = False
                        If NotAnAd(text) Then Add timeSpec, Mid(text,3)
                        text = ""
                    Else
                        text = text & vbCrLf & line
                    End If

                Case tim.Test(line)

                    timespec = line
                    inText   = True
                    text     = ""

            End Select

        Next

        'add current text block if not already added (and not an ad)

        If text <> "" Then
            If NotAnAd(text) Then
                Add timeSpec, Mid(text,3)
            End If
        End If

    End Function

    'Write the subtitles to the given file. If the file is "" then      '
    'the subtitles are written to the console (StdOut).                 '
    '                                                                   '
    '   filename    name of file to write                               '

    Public Function Write (filename)

        Dim tso         'textstream object for output                   '
        Dim i,s

        If filename = "" Then
            Set tso = Wscript.StdOut
        Else
            Set tso = fso.OpenTextFile(filename,2,True)
        End If

        For i = 0 To srt.Count - 1
            Set s = Me.Item(i)
            tso.WriteLine(i+1)
            tso.WriteLine(s.timespec)
            tso.WriteLine(s.text)
            tso.WriteLine("")
        Next

        tso.Close()

    End Function

    'Adjust all times by a given expansion/compression factor.          '
    '                                                                   '
    '   factor          expansion (>1.0) or compression (<1.9)          '
    '   sftm            timespec of first line of dialog from srt file  '
    '   vftm            timespec of first line of dialog                '
    '                   specify as "hh:mm:ss,ddd" or thousandths of     '
    '                       seconds                                     '

    Public Function Adjust (factor, ByVal sftm, ByVal vftm)

        Dim i, st, et

        'if vftm given as hh:mm then convert to thousandths of seconds

        If VarType(vftm) = vbString Then vftm = TimeToNum(vftm)

        For i = 0 to Me.Count - 1

            Set s = Me.Item(i)

            'make start and end times zero-relative

            st = s.stime - sftm
            et = s.etime - sftm

            'apply factor and offset to actual dialog start/end times with rounding

            st = vftm + factor * st
            et = vftm + factor * et

            'copy new values to subtitle object

            s.stime    = st
            s.etime    = et
            s.timespec = NumToTime(st) & " --> " & NumToTime(et)

        Next

    End Function

    'left pad a number with "0" to the given width                      '
    '                                                                   '
    '   num             integer to pad                                  '
    '   width           number of chars in resulting string             '

    Public Function Pad (num, width)
        Pad = Right(String(width,"0") & num,width)
    End Function

    Private Function Capitalize (text)

        If ChangeCase Then
            Capitalize = Ucase(Left(text,1)) & Lcase(Mid(text,2))
            Capitalize = Replace(Capitalize," i "," I ")
            Capitalize = Replace(Capitalize," i'"," I'")
            Capitalize = Replace(Capitalize,vbCrLf & "i ",vbCrLf & "I ")
        Else
            Capitalize = text
        End If

    End Function

    'Subtitles that are auto-ripped from a DVD using OCR tend to        '
    'have numbers containing embedded blanks, especially after a 1      '
    'digit. This Function removes them.                                 '

    Function RemoveBlanksInNumbers(ByVal text)

        Dim match

        Do While rex.Test(text)
            For Each match In rex.Execute(text)
                text = rex.Replace(text,Replace(match.Value," ",""))
            Next
        Loop

        RemoveBlanksInNumbers = text

    End Function

    'Certain subtitle blocks contain ads and such that do not have      '
    'anything to do with the video. This function returns True if       '
    'the text block is not obviously an ad.                             '
    '                                                                   '
    '   text            dialog text                                     '

    Private Function NotAnAd (ByVal text)

        Dim adwords: adwords = Array( _
            ".com "                     , _
            "? ?"                       , _
            "closed captions"           , _
            "dvdrip"                    , _
            "gom player"                , _
            "http:"                     , _
            "https:"                    , _
            "joycasino"                 , _
            "opensubtitles"             , _
            "professional translation"  , _
            "raceman"                   , _
            "subtitles"                 , _
            "sync by"                   , _
            "sync,"                     , _
            "synced"                    , _
            "synchronized by"           , _
            "captioned by"              , _
            "captioning"                , _
            ".com"                      , _
            "www."                      )

        NotAnAd = False

        text = lcase(text)

        If Instr(text,"rate") > 0 And Instr(text,"subtitle") > 0 Then
            Wscript.Echo "FOUND: rate AND subtitle IN",text
            Exit Function
        End If

        For Each entry in adwords
            if Instr(1,text,entry,vbTextCompare) > 0 Then
                Wscript.Echo "FOUND:",entry,"IN",text
                Exit Function
            End If
        Next

        NotAnAd = True

    End Function

End Class