├── .gitattributes ├── .gitignore ├── LICENSE ├── README.md └── SRT-To-SSML.py /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | subtitles.srt 3 | SSML.txt 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 ThioJoe 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SRT-To-SSML 2 | Converts SRT subtitle file to SSML file with speech durations. 3 | 4 | #### Note: If looking for a more comprehensive tool for also generating synced and translated dubs, visit [my other repo](https://github.com/ThioJoe/Auto-Synced-Translated-Dubs). 5 | 6 | ### Use Cases 7 | - Using TTS to generate speech for a video using only subtitles 8 | - Automated translation and dubbing of videos while keeping the dub in sync. You can simply translate the text portions of the subtitles before feeding it into the script. This allows the translations of each line remain the same length of the original speech, so the generated speech should theoretically be a drop-in replacement of the original. 9 | 10 | ### How it Works: 11 | - It takes the text lines from the subtitle file and puts each on a separate line within the `speak` tag 12 | - It takes the timestamps for the start/end for each subtitle line, and calculates that time difference in milliseconds. Then uses that for the `duration` attribute for the `prosody` tag. This tells the TTS how long it should take to say the line, so it will stay in sync with the original video. 13 | - Note: Not every neural TTS service supports/uses the duration feature. Amazon Polly non-neural voices and Azure Speech do, but use their own tags, which this script will automatically use instead. 14 | - It also calculates the time difference between the end of one subtitle line and the beginning of the next, and uses that as the `time` attribute for the `break` tag at the end of each text line. This is also to keep it in sync with the original video. 15 | 16 | ### Other Notable Features 17 | - Automatic tag configuration based on TTS service (currently supports Microsoft Azure and Amazon Polly non-neural voices) 18 | - Note: Currently only Azure Speech seems to support specifying the duration of speech for neural voices. Therefore that is the only service that can properly take advantage of this script. Amazon Polly does too, but only for standard non-neural voices. 19 | 20 | ### SSML Options Changeable With Variables 21 | - Language 22 | - TTS Voice Name 23 | - SSML Version 24 | - xmlns Attributes for tag 25 | - Whether to include the `xmlns:xsi` and `xsi:schemaLocation` attributes 26 | - Input and Output file names (Defaults: `subtitles.srt` for input and `SSML.txt` for output) 27 | - Duration Attribute Name 28 | 29 | # Example 30 | ### Input (SRT Subtitle File) 31 | ``` 32 | 1 33 | 00:00:00,140 --> 00:00:05,050 34 | This is an example of a subtitle file with a bunch of random words I've added with various timestamps. 35 | 36 | 2 37 | 00:00:05,240 --> 00:00:13,290 38 | Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim 39 | 40 | 3 41 | 00:00:13,480 --> 00:00:14,250 42 | veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 43 | 44 | 4 45 | 00:00:14,340 --> 00:00:19,930 46 | Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla. 47 | 48 | 5 49 | 00:00:20,130 --> 00:00:23,419 50 | Now some examples of some escaped characters such as & and ' and " and < and > just to name a few 51 | ``` 52 | 53 | 54 | ### Output 55 | ```xml 56 | 57 | 58 | This is an example of a subtitle file with a bunch of random words I've added with various timestamps. 59 | Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim 60 | veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 61 | Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla. 62 | Now some examples of some escaped characters such as & and ' and " and < and > just to name a few 63 | 64 | ``` 65 | -------------------------------------------------------------------------------- /SRT-To-SSML.py: -------------------------------------------------------------------------------- 1 | # This script converts an SRT subtitle files to an SSML file, using the timestamps of the subtitles to accurately create tags for speech timing. 2 | # Note: Many services do not support the 'duration' parameter, so this might not always work as expected. 3 | # 4 | # I offer no warranty or guarantees of any kind. Use at your own risk. I didn't even know what SSML meant a few days ago. 5 | #-------------------------------------------------------------- 6 | import re 7 | 8 | #==================================================================================================== 9 | #======================================== USER VARIABLES ============================================ 10 | #==================================================================================================== 11 | 12 | #------- Basic Options ------- 13 | # Path to Subtitles File 14 | srtFile = r"subtitles.srt" 15 | # Output file name 16 | outputFile = "SSML.txt" 17 | 18 | #------- SSML Options ------- 19 | # Service Mode - Automaticaly adjusts some variables depending on the TTS service 20 | # Note: Amazon Polly only supports the duration feature on non-neural voices. Only Azure currently supports duration on neural voices. 21 | # Default: "generic" 22 | serviceMode = "generic" # Possible Values: "azure", "amazon-standard-voice", "generic" 23 | # Language 24 | language = "en-US" 25 | # Voice Name - To not specify a voice, put nothing between the quotes or set value to None 26 | voiceName = "en-US-DavisNeural" 27 | # Whether to escape special characters in the text. Possible Values: True, False 28 | enableCharacterEscape = True 29 | 30 | #------- Advanced SSML Options ------- 31 | # SSML Version 32 | ssmlVersion = "1.0" 33 | # Whether to include the xmlns:xsi and xsi:schemaLocation attributes in the tag. 34 | includeSchemaLocation = True # Possible Values: True, False 35 | #schemaLocations 36 | schema_1_0 = "http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" 37 | schema_1_1 = "http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" 38 | # Output File Encoding 39 | chosenFileEncoding = "utf_8_sig" # utf_8_sig for BOM, utf_8 for no BOM 40 | # Dictionary of xmlns attributes to be added to tag 41 | # To not include one, just comment it out 42 | xmlnsAttributesDict = { 43 | "xmlns": "http://www.w3.org/2001/10/synthesis", # Required! See: https://www.w3.org/TR/speech-synthesis11/#S2.1 44 | "xmlns:mstts": "http://www.w3.org/2001/mstts", 45 | "xmlns:emo": "http://www.w3.org/2009/10/emotionml", 46 | #"xmlns:xsi": # Don't uncomment this, refer to "includeSchemaLocation" option above 47 | #"xsi:schemaLocation": # Don't uncomment this, refer to "includeSchemaLocation" option above 48 | } 49 | 50 | # ------- Other Optional Advanced Settings You Probably Don't Need to Worry About ------- 51 | # NOTE: The script will already automatically account for Microsoft Azure and Amazon Polly, but if you are using a different service, you may wish to change this. 52 | # Duration Attribute Name - The standard name for this attribute within the 'prosody' tag is 'duration', however some services may use their own name. 53 | # Default/Standard: "duration" 54 | durationAttributeName = "duration" 55 | # If you are using Azure or Amazon, but want to force force the use of the durationAttributeName instead whatever one would be set autoamtically. 56 | # You probably don't need to change this. 57 | overrideDurationAttributeName = False # Default: False 58 | 59 | # ---- Possibly Helpful Resources ----- 60 | # Amazon Polly Duration Tag Info: "amazon:max-duration" # See: https://docs.aws.amazon.com/polly/latest/dg/supportedtags.html#maxduration-tag 61 | 62 | #==================================================================================================== 63 | #====================================== Start Program =============================================== 64 | #==================================================================================================== 65 | 66 | # ---- Prepare variables with correct formatting ---- 67 | serviceMode = serviceMode.lower() 68 | useInnerDurationTag = False 69 | # Only need to set this for Amazon, because it isnt used for Azure 70 | if serviceMode == "amazon-standard-voice": 71 | durationAttributeName = "amazon:max-duration" 72 | elif serviceMode == "azure": 73 | useInnerDurationTag = True 74 | 75 | # If user chooses to override automatic tag 76 | if overrideDurationAttributeName: 77 | durationAttributeName = overrideDurationAttributeName 78 | 79 | 80 | # Sets the schemaLocation attribute based on the SSML version you chose 81 | if includeSchemaLocation: 82 | xmlnsAttributesDict["xmlns:xsi"] = "http://www.w3.org/2001/XMLSchema-instance" 83 | if ssmlVersion == "1.0": 84 | xmlnsAttributesDict["xsi:schemaLocation"] = schema_1_0 85 | elif ssmlVersion == "1.1": 86 | xmlnsAttributesDict["xsi:schemaLocation"] = schema_1_1 87 | 88 | # Constructs the xmlns attributes string 89 | xmlnsAttributesString = "" 90 | for key, value in xmlnsAttributesDict.items(): 91 | xmlnsAttributesString += f"{key}=\"{value}\" " 92 | xmlnsAttributesString = xmlnsAttributesString.strip() # Remove extra space at end 93 | 94 | # Creates function to escape special characters such as: " & ' < > 95 | def escapeChars(enableCharacterEscape, text): 96 | if enableCharacterEscape: 97 | text = text.replace("&", "&") 98 | text = text.replace('"', """) 99 | text = text.replace("'", "'") 100 | text = text.replace("<", "<") 101 | text = text.replace(">", ">") 102 | return text 103 | 104 | 105 | #======================================== Parse SRT File ================================================ 106 | # Open an srt file and read the lines into a list 107 | with open(srtFile, 'r', encoding='utf-8-sig') as f: 108 | lines = f.readlines() 109 | 110 | # Matches the following example with regex: 00:00:20,130 --> 00:00:23,419 111 | subtitleTimeLineRegex = re.compile(r'\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d') 112 | 113 | # Create a dictionary 114 | subsDict = {} 115 | 116 | # Enumerate lines, and if a line in lines contains only an integer, put that number in the key, and a dictionary in the value 117 | # The dictionary contains the start, ending, and duration of the subtitles as well as the text 118 | # The next line uses the syntax HH:MM:SS,MMM --> HH:MM:SS,MMM . Get the difference between the two times and put that in the dictionary 119 | # For the line after that, put the text in the dictionary 120 | for lineNum, line in enumerate(lines): 121 | line = line.strip() 122 | # If line has no text 123 | if line.isdigit() and subtitleTimeLineRegex.match(lines[lineNum + 1]): 124 | lineWithTimestamps = lines[lineNum + 1].strip() 125 | lineWithSubtitleText = lines[lineNum + 2].strip() 126 | 127 | # If there are more lines after the subtitle text, add them to the text 128 | count = 3 129 | while True: 130 | # Check if the next line is blank or not 131 | if (lineNum+count) < len(lines) and lines[lineNum + count].strip(): 132 | lineWithSubtitleText += ' ' + lines[lineNum + count].strip() 133 | count += 1 134 | else: 135 | break 136 | 137 | # Create empty dictionary with keys for start and end times and subtitle text 138 | subsDict[line] = {'start_ms': '', 'end_ms': '', 'duration_ms': '', 'text': '', 'break_until_next': ''} 139 | 140 | time = lineWithTimestamps.split(' --> ') 141 | time1 = time[0].split(':') 142 | time2 = time[1].split(':') 143 | # Converts the time to milliseconds 144 | processedTime1 = int(time1[0]) * 3600000 + int(time1[1]) * 60000 + int(time1[2].split(',')[0]) * 1000 + int(time1[2].split(',')[1]) #/ 1000 #Uncomment to turn into seconds 145 | processedTime2 = int(time2[0]) * 3600000 + int(time2[1]) * 60000 + int(time2[2].split(',')[0]) * 1000 + int(time2[2].split(',')[1]) #/ 1000 #Uncomment to turn into seconds 146 | timeDifferenceMs = str(processedTime2 - processedTime1) 147 | # Set the keys in the dictionary to the values 148 | subsDict[line]['start_ms'] = str(processedTime1) 149 | subsDict[line]['end_ms'] = str(processedTime2) 150 | subsDict[line]['duration_ms'] = timeDifferenceMs 151 | subsDict[line]['text'] = lineWithSubtitleText 152 | if lineNum > 0: 153 | # Goes back to previous line's dictionary and writes difference in time to current line 154 | subsDict[str(int(line)-1)]['break_until_next'] = str(processedTime1 - int(subsDict[str(int(line) - 1)]['end_ms'])) 155 | else: 156 | subsDict[line]['break_until_next'] = '0' 157 | 158 | #=========================================== Create SSML File ============================================ 159 | # Make voice tag if applicable 160 | if voiceName is None or voiceName == '' or voiceName.lower() == 'none': 161 | voiceTag = '' 162 | voiceTagEnd = '' 163 | else: 164 | voiceTag = '' 165 | voiceTagEnd = '' 166 | 167 | # Set Up Special Tag If Necessary 168 | if useInnerDurationTag: 169 | if serviceMode == "azure": 170 | specialDurationTag = "mstts:audioduration" 171 | else: 172 | useSpecialDurationTag = None 173 | 174 | # Encoding with utf-8-sig adds BOM to the beginning of the file, because use with Azure requires it 175 | with open(outputFile, 'w', encoding=chosenFileEncoding) as f: 176 | # Write the header 177 | f.write('\n') 178 | f.write(f'\n') 179 | # If using Azure, each duration tag must be inside a voice tag, so need to write voice tag later 180 | if not serviceMode == "azure": 181 | f.write(f'{voiceTag}\n') 182 | 183 | # Write SSML tags with the text and duration from the dictionary 184 | # Prosody Syntax: https://www.w3.org/TR/speech-synthesis11/#S3.2.4 185 | for key, value in subsDict.items(): 186 | # Get Break Time 187 | if not value['break_until_next'] or value['break_until_next'] == '0': 188 | breakTimeString = '' 189 | else: 190 | breakTime = str(value['break_until_next']) 191 | breakTimeString = f'' 192 | 193 | # Get and escape text, then write 194 | text = escapeChars(enableCharacterEscape, value['text']) 195 | # Format each line of text, then write 196 | if not useInnerDurationTag: 197 | textToWrite = (f'\t{text}{breakTimeString}\n') 198 | else: 199 | textToWrite = (f'\t{voiceTag}<{specialDurationTag}="{value["duration_ms"]}ms"/>{text}{voiceTagEnd}{breakTimeString}\n') 200 | f.write(textToWrite) 201 | 202 | # Write ending voice tag if applicable 203 | if not useInnerDurationTag: 204 | f.write(f'{voiceTagEnd}\n') 205 | # Write ending speak tag 206 | f.write('\n') 207 | --------------------------------------------------------------------------------