├── .gitattributes
├── .gitignore
├── LICENSE
├── README.md
└── SRT-To-SSML.py


/.gitattributes:
--------------------------------------------------------------------------------
1 | # Auto detect text files and perform LF normalization
2 | * text=auto
3 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | 
2 | subtitles.srt
3 | SSML.txt
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2022 ThioJoe
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SRT-To-SSML
 2 |  Converts SRT subtitle file to SSML file with speech durations. 
 3 |  
 4 | #### Note: If looking for a more comprehensive tool for also generating synced and translated dubs, visit [my other repo](https://github.com/ThioJoe/Auto-Synced-Translated-Dubs).
 5 | 
 6 | ### Use Cases
 7 | - Using TTS to generate speech for a video using only subtitles
 8 | - Automated translation and dubbing of videos while keeping the dub in sync. You can simply translate the text portions of the subtitles before feeding it into the script. This allows the translations of each line remain the same length of the original speech, so the generated speech should theoretically be a drop-in replacement of the original.
 9 | 
10 | ### How it Works:
11 | - It takes the text lines from the subtitle file and puts each on a separate line within the `speak` tag
12 | - It takes the timestamps for the start/end for each subtitle line, and calculates that time difference in milliseconds. Then uses that for the `duration` attribute for the `prosody` tag. This tells the TTS how long it should take to say the line, so it will stay in sync with the original video.
13 |   - Note: Not every neural TTS service supports/uses the duration feature. Amazon Polly non-neural voices and Azure Speech do, but use their own tags, which this script will automatically use instead.
14 | - It also calculates the time difference between the end of one subtitle line and the beginning of the next, and uses that as the `time` attribute for the `break` tag at the end of each text line. This is also to keep it in sync with the original video.
15 | 
16 | ### Other Notable Features
17 | - Automatic tag configuration based on TTS service (currently supports Microsoft Azure and Amazon Polly non-neural voices)
18 |   - Note: Currently only Azure Speech seems to support specifying the duration of speech for neural voices. Therefore that is the only service that can properly take advantage of this script. Amazon Polly does too, but only for standard non-neural voices.
19 | 
20 | ### SSML Options Changeable With Variables
21 | - Language
22 | - TTS Voice Name
23 | - SSML Version
24 | - xmlns Attributes for <speak> tag
25 | - Whether to include the `xmlns:xsi` and `xsi:schemaLocation` attributes
26 | - Input and Output file names (Defaults: `subtitles.srt` for input and `SSML.txt` for output)
27 | - Duration Attribute Name
28 | 
29 | # Example
30 | ### Input (SRT Subtitle File)
31 | ```
32 | 1
33 | 00:00:00,140 --> 00:00:05,050
34 | This is an example of a subtitle file with a bunch of random words I've added with various timestamps.
35 | 
36 | 2
37 | 00:00:05,240 --> 00:00:13,290
38 | Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
39 | 
40 | 3
41 | 00:00:13,480 --> 00:00:14,250
42 | veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
43 | 
44 | 4
45 | 00:00:14,340 --> 00:00:19,930
46 | Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla.
47 | 
48 | 5
49 | 00:00:20,130 --> 00:00:23,419
50 | Now some examples of some escaped characters such as & and ' and " and < and > just to name a few
51 | ```
52 | 
53 | 
54 | ### Output
55 | ```xml
56 | <?xml version="1.0" encoding="UTF-8"?>
57 | <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" version="1.0" xml:lang="en-US"><voice name="en-US-DavisNeural">
58 | 	<prosody duration="4910ms">This is an example of a subtitle file with a bunch of random words I&apos;ve added with various timestamps.</prosody><break time="190ms"/>
59 | 	<prosody duration="8050ms">Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim</prosody><break time="190ms"/>
60 | 	<prosody duration="770ms">veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</prosody><break time="90ms"/>
61 | 	<prosody duration="5590ms">Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla.</prosody><break time="200ms"/>
62 | 	<prosody duration="3289ms">Now some examples of some escaped characters such as &amp; and &apos; and &quot; and &lt; and &gt; just to name a few</prosody>
63 | </voice></speak>
64 | ```
65 | 


--------------------------------------------------------------------------------
/SRT-To-SSML.py:
--------------------------------------------------------------------------------
  1 | # This script converts an SRT subtitle files to an SSML file, using the timestamps of the subtitles to accurately create tags for speech timing.
  2 | # Note: Many services do not support the 'duration' parameter, so this might not always work as expected.
  3 | #
  4 | # I offer no warranty or guarantees of any kind. Use at your own risk. I didn't even know what SSML meant a few days ago.
  5 | #--------------------------------------------------------------
  6 | import re
  7 | 
  8 | #====================================================================================================
  9 | #======================================== USER VARIABLES ============================================
 10 | #====================================================================================================
 11 | 
 12 | #------- Basic Options -------
 13 | # Path to Subtitles File
 14 | srtFile = r"subtitles.srt"
 15 | # Output file name
 16 | outputFile = "SSML.txt"
 17 | 
 18 | #------- SSML Options -------
 19 |     # Service Mode - Automaticaly adjusts some variables depending on the TTS service
 20 |     # Note: Amazon Polly only supports the duration feature on non-neural voices. Only Azure currently supports duration on neural voices.
 21 |     # Default: "generic"
 22 | serviceMode = "generic" # Possible Values: "azure", "amazon-standard-voice", "generic"
 23 |     # Language
 24 | language = "en-US"
 25 |     # Voice Name - To not specify a voice, put nothing between the quotes or set value to None
 26 | voiceName = "en-US-DavisNeural"
 27 |     # Whether to escape special characters in the text. Possible Values: True, False
 28 | enableCharacterEscape = True
 29 | 
 30 | #------- Advanced SSML Options -------
 31 |     # SSML Version
 32 | ssmlVersion = "1.0"
 33 |     # Whether to include the xmlns:xsi and xsi:schemaLocation attributes in the <speak> tag.
 34 | includeSchemaLocation = True   # Possible Values: True, False
 35 |     #schemaLocations
 36 | schema_1_0 = "http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
 37 | schema_1_1 = "http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
 38 |     # Output File Encoding
 39 | chosenFileEncoding = "utf_8_sig" # utf_8_sig for BOM, utf_8 for no BOM
 40 |     # Dictionary of xmlns attributes to be added to <speak> tag
 41 |     # To not include one, just comment it out
 42 | xmlnsAttributesDict = {
 43 |     "xmlns": "http://www.w3.org/2001/10/synthesis", # Required! See: https://www.w3.org/TR/speech-synthesis11/#S2.1
 44 |     "xmlns:mstts": "http://www.w3.org/2001/mstts", 
 45 |     "xmlns:emo": "http://www.w3.org/2009/10/emotionml",
 46 |     #"xmlns:xsi":           # Don't uncomment this, refer to "includeSchemaLocation" option above
 47 |     #"xsi:schemaLocation":  # Don't uncomment this, refer to "includeSchemaLocation" option above
 48 | }
 49 | 
 50 | # ------- Other Optional Advanced Settings You Probably Don't Need to Worry About -------
 51 |     # NOTE: The script will already automatically account for Microsoft Azure and Amazon Polly, but if you are using a different service, you may wish to change this.
 52 |     # Duration Attribute Name - The standard name for this attribute within the 'prosody' tag is 'duration', however some services may use their own name.
 53 |     # Default/Standard: "duration"
 54 | durationAttributeName = "duration"
 55 |     # If you are using Azure or Amazon, but want to force force the use of the durationAttributeName instead whatever one would be set autoamtically.
 56 |     # You probably don't need to change this.
 57 | overrideDurationAttributeName = False # Default: False
 58 | 
 59 | # ---- Possibly Helpful Resources -----
 60 | # Amazon Polly Duration Tag Info: "amazon:max-duration"  # See: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#maxduration-tag
 61 | 
 62 | #====================================================================================================
 63 | #====================================== Start Program ===============================================
 64 | #====================================================================================================
 65 | 
 66 | # ---- Prepare variables with correct formatting ----
 67 | serviceMode = serviceMode.lower()
 68 | useInnerDurationTag = False
 69 | # Only need to set this for Amazon, because it isnt used for Azure
 70 | if serviceMode == "amazon-standard-voice":
 71 |     durationAttributeName = "amazon:max-duration"
 72 | elif serviceMode == "azure":
 73 |     useInnerDurationTag = True
 74 | 
 75 | # If user chooses to override automatic tag
 76 | if overrideDurationAttributeName:
 77 |     durationAttributeName = overrideDurationAttributeName
 78 | 
 79 | 
 80 | # Sets the schemaLocation attribute based on the SSML version you chose
 81 | if includeSchemaLocation:
 82 |     xmlnsAttributesDict["xmlns:xsi"] = "http://www.w3.org/2001/XMLSchema-instance"
 83 |     if ssmlVersion == "1.0":
 84 |         xmlnsAttributesDict["xsi:schemaLocation"] = schema_1_0
 85 |     elif ssmlVersion == "1.1":
 86 |         xmlnsAttributesDict["xsi:schemaLocation"] = schema_1_1
 87 | 
 88 | # Constructs the xmlns attributes string
 89 | xmlnsAttributesString = ""
 90 | for key, value in xmlnsAttributesDict.items():
 91 |     xmlnsAttributesString += f"{key}=\"{value}\" "
 92 | xmlnsAttributesString = xmlnsAttributesString.strip() # Remove extra space at end
 93 | 
 94 | # Creates function to escape special characters such as: " & ' < >
 95 | def escapeChars(enableCharacterEscape, text):
 96 |     if enableCharacterEscape:
 97 |         text = text.replace("&", "&amp;")
 98 |         text = text.replace('"', "&quot;")
 99 |         text = text.replace("'", "&apos;")
100 |         text = text.replace("<", "&lt;")
101 |         text = text.replace(">", "&gt;")
102 |     return text
103 | 
104 | 
105 | #======================================== Parse SRT File ================================================
106 | # Open an srt file and read the lines into a list
107 | with open(srtFile, 'r', encoding='utf-8-sig') as f:
108 |     lines = f.readlines()
109 | 
110 | # Matches the following example with regex:    00:00:20,130 --> 00:00:23,419
111 | subtitleTimeLineRegex = re.compile(r'\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d')
112 | 
113 | # Create a dictionary
114 | subsDict = {}
115 | 
116 | # Enumerate lines, and if a line in lines contains only an integer, put that number in the key, and a dictionary in the value
117 | # The dictionary contains the start, ending, and duration of the subtitles as well as the text
118 | # The next line uses the syntax HH:MM:SS,MMM --> HH:MM:SS,MMM . Get the difference between the two times and put that in the dictionary
119 | # For the line after that, put the text in the dictionary
120 | for lineNum, line in enumerate(lines):
121 |     line = line.strip()
122 |     # If line has no text
123 |     if line.isdigit() and subtitleTimeLineRegex.match(lines[lineNum + 1]):
124 |         lineWithTimestamps = lines[lineNum + 1].strip()
125 |         lineWithSubtitleText = lines[lineNum + 2].strip()
126 | 
127 |         # If there are more lines after the subtitle text, add them to the text
128 |         count = 3
129 |         while True:
130 |             # Check if the next line is blank or not
131 |             if (lineNum+count) < len(lines) and lines[lineNum + count].strip():
132 |                 lineWithSubtitleText += ' ' + lines[lineNum + count].strip()
133 |                 count += 1
134 |             else:
135 |                 break
136 | 
137 |         # Create empty dictionary with keys for start and end times and subtitle text
138 |         subsDict[line] = {'start_ms': '', 'end_ms': '', 'duration_ms': '', 'text': '', 'break_until_next': ''}
139 | 
140 |         time = lineWithTimestamps.split(' --> ')
141 |         time1 = time[0].split(':')
142 |         time2 = time[1].split(':')
143 |         # Converts the time to milliseconds
144 |         processedTime1 = int(time1[0]) * 3600000 + int(time1[1]) * 60000 + int(time1[2].split(',')[0]) * 1000 + int(time1[2].split(',')[1]) #/ 1000 #Uncomment to turn into seconds
145 |         processedTime2 = int(time2[0]) * 3600000 + int(time2[1]) * 60000 + int(time2[2].split(',')[0]) * 1000 + int(time2[2].split(',')[1]) #/ 1000 #Uncomment to turn into seconds
146 |         timeDifferenceMs = str(processedTime2 - processedTime1)
147 |         # Set the keys in the dictionary to the values
148 |         subsDict[line]['start_ms'] = str(processedTime1)
149 |         subsDict[line]['end_ms'] = str(processedTime2)
150 |         subsDict[line]['duration_ms'] = timeDifferenceMs
151 |         subsDict[line]['text'] = lineWithSubtitleText
152 |         if lineNum > 0:
153 |             # Goes back to previous line's dictionary and writes difference in time to current line
154 |             subsDict[str(int(line)-1)]['break_until_next'] = str(processedTime1 - int(subsDict[str(int(line) - 1)]['end_ms']))
155 |         else:
156 |             subsDict[line]['break_until_next'] = '0'
157 | 
158 | #=========================================== Create SSML File ============================================
159 | # Make voice tag if applicable
160 | if voiceName is None or voiceName == '' or voiceName.lower() == 'none':
161 |     voiceTag = ''
162 |     voiceTagEnd = ''
163 | else:
164 |    voiceTag = '<voice name="' + voiceName + '">'
165 |    voiceTagEnd = '</voice>'
166 | 
167 | # Set Up Special Tag If Necessary
168 | if useInnerDurationTag:
169 |     if serviceMode == "azure":
170 |         specialDurationTag = "mstts:audioduration"
171 | else:
172 |     useSpecialDurationTag = None
173 | 
174 | # Encoding with utf-8-sig adds BOM to the beginning of the file, because use with Azure requires it
175 | with open(outputFile, 'w', encoding=chosenFileEncoding) as f:
176 |     # Write the header
177 |     f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
178 |     f.write(f'<speak {xmlnsAttributesString} version="{ssmlVersion}" xml:lang="{language}">\n')
179 |     # If using Azure, each duration tag must be inside a voice tag, so need to write voice tag later
180 |     if not serviceMode == "azure":
181 |         f.write(f'{voiceTag}\n')
182 | 
183 |     # Write SSML tags with the text and duration from the dictionary
184 |     # Prosody Syntax: https://www.w3.org/TR/speech-synthesis11/#S3.2.4
185 |     for key, value in subsDict.items():
186 |         # Get Break Time
187 |         if not value['break_until_next'] or value['break_until_next'] == '0':
188 |             breakTimeString = ''
189 |         else:
190 |             breakTime = str(value['break_until_next'])
191 |             breakTimeString = f'<break time="{breakTime}ms"/>'
192 |         
193 |         # Get and escape text, then write
194 |         text = escapeChars(enableCharacterEscape, value['text'])
195 |         # Format each line of text, then write
196 |         if not useInnerDurationTag:
197 |             textToWrite = (f'\t<prosody {durationAttributeName}="{value["duration_ms"]}ms">{text}</prosody>{breakTimeString}\n')
198 |         else:
199 |             textToWrite = (f'\t{voiceTag}<{specialDurationTag}="{value["duration_ms"]}ms"/>{text}{voiceTagEnd}{breakTimeString}\n')
200 |         f.write(textToWrite)
201 | 
202 |     # Write ending voice tag if applicable
203 |     if not useInnerDurationTag:
204 |         f.write(f'{voiceTagEnd}\n')
205 |     # Write ending speak tag
206 |     f.write('</speak>\n')
207 | 


--------------------------------------------------------------------------------