├── LICENSE.md ├── README.md └── subs_extract.py /LICENSE.md: -------------------------------------------------------------------------------- 1 | Copyright (c) 2019 Nathan Vegdahl 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of 4 | this software and associated documentation files (the "Software"), to deal in 5 | the Software without restriction, including without limitation the rights to 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 7 | of the Software, and to permit persons to whom the Software is furnished to do 8 | so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Subs Extract 2 | 3 | A simple Python script that uses ffmpeg to extract subtitles and audio from a video. The purpose is to facilitate creating flash cards from foreign-language TV shows and movies. 4 | 5 | It takes as input a video file and a corresponding .ssa, .ass, .vtt, or .srt subtitle file, like this: 6 | 7 | ``` 8 | subs_extract.py a_really_cool_show.mp4 a_really_cool_show.srt 9 | ``` 10 | 11 | And it then creates a text file, an mp3 file, and an image thumbnail for each sentence in the subtitles, putting them in a new directory named after the video file. It also creates a deck file that can be imported into Anki, with the following fields per note: 12 | 13 | 1. The line's subtitle text. 14 | 2. The filename of the line's audio file, wrapped in an Anki audio tag. 15 | 3. A blank field (unless a second subtitle file was provided--see the next section of this readme). 16 | 4. The filename of the line's image thumbnail. 17 | 5. The name of the video file the line came from, without the file extension (this basically identifies the movie/episode, assuming your video files are named appropriately). 18 | 6. A timestamp, indicating where in the video file the line is. 19 | 20 | 21 | ## Using a second subtitle file 22 | 23 | You can also pass a second subtitle file like so: 24 | 25 | ``` 26 | subs_extract.py a_really_cool_show.mp4 a_really_cool_show.jp.srt a_really_cool_show.en.srt 27 | ``` 28 | 29 | If you do this, Subs Extract will attempt to find a matching subtitle in the second file for every subtitle in the first, and include that as a translation. It does this based on the start times of the subtitles. It isn't perfect, and there will usually be a handful of weird matches for each video, but it generally does an okay job. Also, it sometimes simply won't include a match at all if the timing for a given subtitle is just too different. 30 | 31 | The generated deck file will also include the second subtitle like so: 32 | 33 | 1. The line's subtitle text. 34 | 2. The filename of line's audio file, wrapped in an Anki audio tag. 35 | 3. **Matching subtitle from second subtitle file.** 36 | 4. The filename of the line's image thumbnail. 37 | 5. The name of the video file the line came from, without the file extension (this basically identifies the movie/episode, assuming your video files are named appropriately). 38 | 6. A timestamp, indicating where in the video file the line is. 39 | -------------------------------------------------------------------------------- /subs_extract.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import sys 4 | import os 5 | import subprocess 6 | import re 7 | 8 | 9 | def timecode_to_milliseconds(code): 10 | """ Takes a time code and converts it into an integer of milliseconds. 11 | """ 12 | elements = code.replace(",", ".").split(":") 13 | assert(len(elements) < 4) 14 | 15 | milliseconds = 0 16 | if len(elements) >= 1: 17 | milliseconds += int(float(elements[-1]) * 1000) 18 | if len(elements) >= 2: 19 | milliseconds += int(elements[-2]) * 60000 20 | if len(elements) >= 3: 21 | milliseconds += int(elements[-3]) * 3600000 22 | 23 | return milliseconds 24 | 25 | 26 | def milliseconds_to_timecode(milliseconds): 27 | """ Takes a time in milliseconds and converts it into a time code. 28 | """ 29 | hours = milliseconds // 3600000 30 | milliseconds %= 3600000 31 | minutes = milliseconds // 60000 32 | milliseconds %= 60000 33 | seconds = milliseconds // 1000 34 | milliseconds %= 1000 35 | return "{}:{:02}:{:02}.{:02}".format(hours, minutes, seconds, milliseconds // 10) 36 | 37 | 38 | def parse_ass_file(path, padding=0): 39 | """ Parses an entire .ass file, extracting the dialogue. 40 | 41 | Returns a list of (start, length, dialogue) tuples, one for each 42 | subtitle found in the file. A padding of "padding" milliseconds is 43 | added to the start/end times. 44 | """ 45 | subtitles = [] 46 | with open(path) as f: 47 | # First find out the field order of the dialogue. 48 | found_start = False 49 | fields = {} 50 | for line in f: 51 | if not found_start: 52 | found_start = line.strip() == "[Events]" 53 | elif line.strip().startswith("Format:"): 54 | line = line[7:].strip() 55 | tmp = line.split(",") 56 | for i in range(len(tmp)): 57 | fields[tmp[i].strip().lower()] = i 58 | break 59 | if ("start" not in fields) or ("end" not in fields) or ("text" not in fields): 60 | raise Exception("'Start', 'End', or 'Text' field not found.") 61 | 62 | # Then parse the dialogue lines. 63 | start_times = set() # Used to prevent duplicates 64 | for line in f: 65 | if line.strip().startswith("Dialogue:"): 66 | elements = line[9:].strip().split(',', len(fields) - 1) 67 | start_element = elements[fields["start"]].strip() 68 | end_element = elements[fields["end"]].strip() 69 | text_element = elements[-1].strip() 70 | 71 | if (start_element not in start_times) and (text_element != ""): 72 | start_times |= set(start_element) 73 | start = max(timecode_to_milliseconds(start_element) - padding, 0) 74 | end = timecode_to_milliseconds(end_element) + padding 75 | length = end - start 76 | subtitles += [( 77 | milliseconds_to_timecode(start), 78 | milliseconds_to_timecode(length), 79 | text_element, 80 | )] 81 | subtitles.sort() 82 | return subtitles 83 | 84 | 85 | def parse_vtt_file(path, padding=0): 86 | """ Parses an entire WebVTT/SRT file, extracting the dialogue. 87 | 88 | Returns a list of (start, length, dialogue) tuples, one for each 89 | subtitle found in the file. A padding of "padding" milliseconds is 90 | added to the start/end times. 91 | """ 92 | subtitles = [] 93 | with open(path) as f: 94 | start_times = set() # Used to prevent duplicates 95 | for line in f: 96 | if "-->" in line: 97 | # Get the timing. 98 | times = line.split("-->") 99 | start = max(timecode_to_milliseconds(times[0].strip()) - padding, 0) 100 | end = timecode_to_milliseconds(times[1].strip()) + padding 101 | length = end - start 102 | 103 | # Get the text. 104 | text = "" 105 | next_line = f.readline() 106 | while next_line.strip() != "": 107 | text += next_line 108 | next_line = f.readline() 109 | text = text.strip() 110 | 111 | # Process text to get rid of unnecessary tags. 112 | text = re.sub("", "", text) 113 | text = re.sub(".*?", "", text) 114 | text = re.sub(".*?", "", text) 115 | 116 | # Add to the subtitles list. 117 | if (start not in start_times) and (text != ""): 118 | start_times |= set([start]) 119 | subtitles += [( 120 | milliseconds_to_timecode(start), 121 | milliseconds_to_timecode(length), 122 | text, 123 | )] 124 | return subtitles 125 | 126 | 127 | def parse_subtitle_file(filepath, padding=0): 128 | """ Parses a subtitle file, attempting to automatically determine the 129 | file format for parsing. 130 | """ 131 | if filepath.endswith(".ass") or filepath.endswith(".ssa"): 132 | return parse_ass_file(filepath, padding) 133 | elif filepath.endswith(".vtt") or filepath.endswith(".srt"): 134 | return parse_vtt_file(filepath, padding) 135 | else: 136 | raise "Unknown subtitle format. Supported formats are SSA, ASS, VTT, and SRT." 137 | 138 | 139 | def find_closest_sub(subs_list, timecode, max_diff_milliseconds): 140 | """ Finds the sub in the given list with the start time closest to timecode. 141 | 142 | Will only return a matching sub if the start time difference is less 143 | than max_diff_milliseconds. Otherwise it will return None. 144 | """ 145 | time_mil = timecode_to_milliseconds(timecode) 146 | 147 | # This certainly isn't the most efficient way to do this, but it's 148 | # dead-simple and does not appear to be a performance bottleneck at all. 149 | closest_so_far = -1 150 | closest_diff = max_diff_milliseconds 151 | for i in range(len(subs_list)): 152 | sub_start = timecode_to_milliseconds(subs_list[i][0]) 153 | diff = abs(time_mil - sub_start) 154 | if diff < closest_diff: 155 | closest_so_far = i 156 | closest_diff = diff 157 | 158 | if closest_so_far >= 0: 159 | return subs_list[closest_so_far] 160 | else: 161 | return None 162 | 163 | 164 | if __name__ == "__main__": 165 | video_filename = sys.argv[1] 166 | subs_filename = sys.argv[2] 167 | second_subs_filename = None 168 | if len(sys.argv) >= 4: 169 | second_subs_filename = sys.argv[3] 170 | dir_name = video_filename.rsplit(".")[0] 171 | base_name = os.path.basename(dir_name) 172 | 173 | # Parse the subtitle files 174 | subtitles = parse_subtitle_file(subs_filename, 300) 175 | second_subs = parse_subtitle_file(second_subs_filename, 300) if second_subs_filename else None 176 | 177 | # Create the directory for the new files if it doesn't already exist. 178 | try: 179 | os.mkdir(dir_name) 180 | except FileExistsError: 181 | pass 182 | except Exception as e: 183 | raise e 184 | 185 | # Set up deck file 186 | deck_out_filepath = os.path.join(dir_name, "0_deck -- {}.txt".format(base_name)) 187 | deck_file = open(deck_out_filepath, 'w') 188 | 189 | # Process the subtitles 190 | first_card = True 191 | for item in subtitles: 192 | print("\n\n========================================================") 193 | print("Extracting \"{}\"".format(item[2])) 194 | print("========================================================") 195 | 196 | # Generate base filename 197 | base_filename = os.path.join(dir_name, "{} -- {}".format(base_name, item[0].replace(":", "_").replace(".", "-"))) 198 | 199 | # Find matching alt sub if any. 200 | if second_subs: 201 | alt_sub = find_closest_sub(second_subs, item[0], 1000) 202 | if alt_sub: 203 | alt_sub = alt_sub[2] 204 | else: 205 | alt_sub = "" 206 | 207 | # Write text file of subtitle 208 | subtitle_out_filepath = base_filename + ".txt" 209 | if not os.path.isfile(subtitle_out_filepath): 210 | with open(subtitle_out_filepath, 'w') as f: 211 | f.write(item[2]) 212 | if second_subs: 213 | f.write("\n\n" + alt_sub) 214 | 215 | # Extract audio of subtitle into mp3 file 216 | audio_out_filepath = base_filename + ".mp3" 217 | if not os.path.isfile(audio_out_filepath): 218 | subprocess.Popen([ 219 | "ffmpeg", 220 | "-n", 221 | "-vn", 222 | "-ss", 223 | item[0], 224 | "-i", 225 | video_filename, 226 | "-aq", "8", 227 | "-t", 228 | item[1], 229 | "-ar", 230 | "44100", 231 | "-ac", 232 | "1", 233 | audio_out_filepath, 234 | ]).wait() 235 | 236 | # Extract video frame of subtitle into jpg file 237 | image_out_filepath = base_filename + ".jpg" 238 | if not os.path.isfile(image_out_filepath): 239 | image_timecode = milliseconds_to_timecode( 240 | timecode_to_milliseconds(item[0]) 241 | + (timecode_to_milliseconds(item[1]) // 2 ) 242 | ) 243 | subprocess.Popen([ 244 | "ffmpeg", 245 | "-ss", 246 | image_timecode, 247 | "-i", 248 | video_filename, 249 | "-vf", 250 | "scale='min(480,iw)':'min(240,ih)':force_original_aspect_ratio=decrease", 251 | "-frames:v", 252 | "1", 253 | "-q:v", 254 | "6", 255 | "-y", 256 | image_out_filepath, 257 | ]).wait() 258 | 259 | # Add card to deck file 260 | if not first_card: 261 | deck_file.write("\n") 262 | first_card = False 263 | deck_file.write(item[2].replace("\t", " ").replace("\r\n", "
").replace("\n", "
") + "\t") 264 | deck_file.write("[sound:{}]".format(os.path.basename(audio_out_filepath)) + "\t") 265 | if second_subs: 266 | deck_file.write(alt_sub.replace("\t", " ").replace("\r\n", "
").replace("\n", "
") + "\t") 267 | else: 268 | deck_file.write("\t") 269 | deck_file.write('""'.format(os.path.basename(image_out_filepath)) + "\t") 270 | deck_file.write(base_name + "\t") 271 | deck_file.write("{}".format(item[0].rsplit(".")[0])) 272 | 273 | deck_file.close() 274 | 275 | print("\n\nDone extracting subtitles!") --------------------------------------------------------------------------------