├── LICENSE.md
├── README.md
└── subs_extract.py


/LICENSE.md:
--------------------------------------------------------------------------------
 1 | Copyright (c) 2019 Nathan Vegdahl
 2 | 
 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 4 | this software and associated documentation files (the "Software"), to deal in
 5 | the Software without restriction, including without limitation the rights to
 6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
 7 | of the Software, and to permit persons to whom the Software is furnished to do
 8 | so, subject to the following conditions:
 9 | 
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 | 
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Subs Extract
 2 | 
 3 | A simple Python script that uses ffmpeg to extract subtitles and audio from a video.  The purpose is to facilitate creating flash cards from foreign-language TV shows and movies.
 4 | 
 5 | It takes as input a video file and a corresponding .ssa, .ass, .vtt, or .srt subtitle file, like this:
 6 | 
 7 | ```
 8 | subs_extract.py a_really_cool_show.mp4 a_really_cool_show.srt
 9 | ```
10 | 
11 | And it then creates a text file, an mp3 file, and an image thumbnail for each sentence in the subtitles, putting them in a new directory named after the video file.  It also creates a deck file that can be imported into Anki, with the following fields per note:
12 | 
13 | 1. The line's subtitle text.
14 | 2. The filename of the line's audio file, wrapped in an Anki audio tag.
15 | 3. A blank field (unless a second subtitle file was provided--see the next section of this readme).
16 | 4. The filename of the line's image thumbnail.
17 | 5. The name of the video file the line came from, without the file extension (this basically identifies the movie/episode, assuming your video files are named appropriately).
18 | 6. A timestamp, indicating where in the video file the line is.
19 | 
20 | 
21 | ## Using a second subtitle file
22 | 
23 | You can also pass a second subtitle file like so:
24 | 
25 | ```
26 | subs_extract.py a_really_cool_show.mp4 a_really_cool_show.jp.srt a_really_cool_show.en.srt
27 | ```
28 | 
29 | If you do this, Subs Extract will attempt to find a matching subtitle in the second file for every subtitle in the first, and include that as a translation.  It does this based on the start times of the subtitles.  It isn't perfect, and there will usually be a handful of weird matches for each video, but it generally does an okay job.  Also, it sometimes simply won't include a match at all if the timing for a given subtitle is just too different.
30 | 
31 | The generated deck file will also include the second subtitle like so:
32 | 
33 | 1. The line's subtitle text.
34 | 2. The filename of line's audio file, wrapped in an Anki audio tag.
35 | 3. **Matching subtitle from second subtitle file.**
36 | 4. The filename of the line's image thumbnail.
37 | 5. The name of the video file the line came from, without the file extension (this basically identifies the movie/episode, assuming your video files are named appropriately).
38 | 6. A timestamp, indicating where in the video file the line is.
39 | 


--------------------------------------------------------------------------------
/subs_extract.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | 
  3 | import sys
  4 | import os
  5 | import subprocess
  6 | import re
  7 | 
  8 | 
  9 | def timecode_to_milliseconds(code):
 10 |     """ Takes a time code and converts it into an integer of milliseconds.
 11 |     """
 12 |     elements = code.replace(",", ".").split(":")
 13 |     assert(len(elements) < 4)
 14 |     
 15 |     milliseconds = 0
 16 |     if len(elements) >= 1:
 17 |         milliseconds += int(float(elements[-1]) * 1000)
 18 |     if len(elements) >= 2:
 19 |         milliseconds += int(elements[-2]) * 60000
 20 |     if len(elements) >= 3:
 21 |         milliseconds += int(elements[-3]) * 3600000
 22 |     
 23 |     return milliseconds
 24 | 
 25 | 
 26 | def milliseconds_to_timecode(milliseconds):
 27 |     """ Takes a time in milliseconds and converts it into a time code.
 28 |     """
 29 |     hours = milliseconds // 3600000
 30 |     milliseconds %= 3600000
 31 |     minutes = milliseconds // 60000
 32 |     milliseconds %= 60000
 33 |     seconds = milliseconds // 1000
 34 |     milliseconds %= 1000
 35 |     return "{}:{:02}:{:02}.{:02}".format(hours, minutes, seconds, milliseconds // 10)
 36 | 
 37 | 
 38 | def parse_ass_file(path, padding=0):
 39 |     """ Parses an entire .ass file, extracting the dialogue.
 40 | 
 41 |         Returns a list of (start, length, dialogue) tuples, one for each
 42 |         subtitle found in the file.  A padding of "padding" milliseconds is
 43 |         added to the start/end times.
 44 |     """
 45 |     subtitles = []
 46 |     with open(path) as f:
 47 |         # First find out the field order of the dialogue.
 48 |         found_start = False
 49 |         fields = {}
 50 |         for line in f:
 51 |             if not found_start:
 52 |                 found_start = line.strip() == "[Events]"
 53 |             elif line.strip().startswith("Format:"):
 54 |                 line = line[7:].strip()
 55 |                 tmp = line.split(",")
 56 |                 for i in range(len(tmp)):
 57 |                     fields[tmp[i].strip().lower()] = i
 58 |                 break
 59 |         if ("start" not in fields) or ("end" not in fields) or ("text" not in fields):
 60 |             raise Exception("'Start', 'End', or 'Text' field not found.")
 61 | 
 62 |         # Then parse the dialogue lines.
 63 |         start_times = set() # Used to prevent duplicates
 64 |         for line in f:
 65 |             if line.strip().startswith("Dialogue:"):
 66 |                 elements = line[9:].strip().split(',', len(fields) - 1)
 67 |                 start_element = elements[fields["start"]].strip()
 68 |                 end_element = elements[fields["end"]].strip()
 69 |                 text_element = elements[-1].strip()
 70 | 
 71 |                 if (start_element not in start_times) and (text_element != ""):
 72 |                     start_times |= set(start_element)
 73 |                     start = max(timecode_to_milliseconds(start_element) - padding, 0)
 74 |                     end = timecode_to_milliseconds(end_element) + padding
 75 |                     length = end - start
 76 |                     subtitles += [(
 77 |                         milliseconds_to_timecode(start),
 78 |                         milliseconds_to_timecode(length),
 79 |                         text_element,
 80 |                     )]
 81 |     subtitles.sort()
 82 |     return subtitles
 83 | 
 84 | 
 85 | def parse_vtt_file(path, padding=0):
 86 |     """ Parses an entire WebVTT/SRT file, extracting the dialogue.
 87 | 
 88 |         Returns a list of (start, length, dialogue) tuples, one for each
 89 |         subtitle found in the file.  A padding of "padding" milliseconds is
 90 |         added to the start/end times.
 91 |     """
 92 |     subtitles = []
 93 |     with open(path) as f:
 94 |         start_times = set() # Used to prevent duplicates
 95 |         for line in f:
 96 |             if "-->" in line:
 97 |                 # Get the timing.
 98 |                 times = line.split("-->")
 99 |                 start = max(timecode_to_milliseconds(times[0].strip()) - padding, 0)
100 |                 end = timecode_to_milliseconds(times[1].strip()) + padding
101 |                 length = end - start
102 | 
103 |                 # Get the text.
104 |                 text = ""
105 |                 next_line = f.readline()
106 |                 while next_line.strip() != "":
107 |                     text += next_line
108 |                     next_line = f.readline()
109 |                 text = text.strip()
110 | 
111 |                 # Process text to get rid of unnecessary tags.
112 |                 text = re.sub("</?ruby>", "", text)
113 |                 text = re.sub("<rp>.*?</rp>", "", text)
114 |                 text = re.sub("<rt>.*?</rt>", "", text)
115 | 
116 |                 # Add to the subtitles list.
117 |                 if (start not in start_times) and (text != ""):
118 |                     start_times |= set([start])
119 |                     subtitles += [(
120 |                         milliseconds_to_timecode(start),
121 |                         milliseconds_to_timecode(length),
122 |                         text,
123 |                     )]
124 |     return subtitles
125 | 
126 | 
127 | def parse_subtitle_file(filepath, padding=0):
128 |     """ Parses a subtitle file, attempting to automatically determine the
129 |         file format for parsing.
130 |     """
131 |     if filepath.endswith(".ass") or filepath.endswith(".ssa"):
132 |         return parse_ass_file(filepath, padding)
133 |     elif filepath.endswith(".vtt") or filepath.endswith(".srt"):
134 |         return parse_vtt_file(filepath, padding)
135 |     else:
136 |         raise "Unknown subtitle format.  Supported formats are SSA, ASS, VTT, and SRT."
137 | 
138 | 
139 | def find_closest_sub(subs_list, timecode, max_diff_milliseconds):
140 |     """ Finds the sub in the given list with the start time closest to timecode.
141 | 
142 |         Will only return a matching sub if the start time difference is less
143 |         than max_diff_milliseconds.  Otherwise it will return None.
144 |     """
145 |     time_mil = timecode_to_milliseconds(timecode)
146 | 
147 |     # This certainly isn't the most efficient way to do this, but it's
148 |     # dead-simple and does not appear to be a performance bottleneck at all.
149 |     closest_so_far = -1
150 |     closest_diff = max_diff_milliseconds
151 |     for i in range(len(subs_list)):
152 |         sub_start = timecode_to_milliseconds(subs_list[i][0])
153 |         diff = abs(time_mil - sub_start)
154 |         if diff < closest_diff:
155 |             closest_so_far = i
156 |             closest_diff = diff
157 | 
158 |     if closest_so_far >= 0:
159 |         return subs_list[closest_so_far]
160 |     else:
161 |         return None
162 | 
163 | 
164 | if __name__ == "__main__":
165 |     video_filename = sys.argv[1]
166 |     subs_filename = sys.argv[2]
167 |     second_subs_filename = None
168 |     if len(sys.argv) >= 4:
169 |         second_subs_filename = sys.argv[3]
170 |     dir_name = video_filename.rsplit(".")[0]
171 |     base_name = os.path.basename(dir_name)
172 | 
173 |     # Parse the subtitle files
174 |     subtitles = parse_subtitle_file(subs_filename, 300)
175 |     second_subs = parse_subtitle_file(second_subs_filename, 300) if second_subs_filename else None
176 | 
177 |     # Create the directory for the new files if it doesn't already exist.
178 |     try:
179 |         os.mkdir(dir_name)
180 |     except FileExistsError:
181 |         pass
182 |     except Exception as e:
183 |         raise e
184 | 
185 |     # Set up deck file
186 |     deck_out_filepath = os.path.join(dir_name, "0_deck -- {}.txt".format(base_name))
187 |     deck_file = open(deck_out_filepath, 'w')
188 | 
189 |     # Process the subtitles
190 |     first_card = True
191 |     for item in subtitles:
192 |         print("\n\n========================================================")
193 |         print("Extracting \"{}\"".format(item[2]))
194 |         print("========================================================")
195 | 
196 |         # Generate base filename
197 |         base_filename = os.path.join(dir_name, "{} -- {}".format(base_name, item[0].replace(":", "_").replace(".", "-")))
198 | 
199 |         # Find matching alt sub if any.
200 |         if second_subs:
201 |             alt_sub = find_closest_sub(second_subs, item[0], 1000)
202 |             if alt_sub:
203 |                 alt_sub = alt_sub[2]
204 |             else:
205 |                 alt_sub = ""
206 | 
207 |         # Write text file of subtitle
208 |         subtitle_out_filepath = base_filename + ".txt"
209 |         if not os.path.isfile(subtitle_out_filepath):
210 |             with open(subtitle_out_filepath, 'w') as f:
211 |                 f.write(item[2])
212 |                 if second_subs:
213 |                     f.write("\n\n" + alt_sub)
214 | 
215 |         # Extract audio of subtitle into mp3 file
216 |         audio_out_filepath = base_filename + ".mp3"
217 |         if not os.path.isfile(audio_out_filepath):
218 |             subprocess.Popen([
219 |                 "ffmpeg",
220 |                 "-n",
221 |                 "-vn",
222 |                 "-ss",
223 |                 item[0],
224 |                 "-i",
225 |                 video_filename,
226 |                 "-aq", "8",
227 |                 "-t",
228 |                 item[1],
229 |                 "-ar",
230 |                 "44100",
231 |                 "-ac",
232 |                 "1",
233 |                 audio_out_filepath,
234 |             ]).wait()
235 | 
236 |         # Extract video frame of subtitle into jpg file
237 |         image_out_filepath = base_filename + ".jpg"
238 |         if not os.path.isfile(image_out_filepath):
239 |             image_timecode = milliseconds_to_timecode(
240 |                 timecode_to_milliseconds(item[0])
241 |                 + (timecode_to_milliseconds(item[1]) // 2 )
242 |             )
243 |             subprocess.Popen([
244 |                 "ffmpeg",
245 |                 "-ss",
246 |                 image_timecode,
247 |                 "-i",
248 |                 video_filename,
249 |                 "-vf",
250 |                 "scale='min(480,iw)':'min(240,ih)':force_original_aspect_ratio=decrease",
251 |                 "-frames:v",
252 |                 "1",
253 |                 "-q:v",
254 |                 "6",
255 |                 "-y",
256 |                 image_out_filepath,
257 |             ]).wait()
258 | 
259 |         # Add card to deck file
260 |         if not first_card:
261 |             deck_file.write("\n")
262 |         first_card = False
263 |         deck_file.write(item[2].replace("\t", "    ").replace("\r\n", "</br>").replace("\n", "</br>") + "\t")
264 |         deck_file.write("[sound:{}]".format(os.path.basename(audio_out_filepath)) + "\t")
265 |         if second_subs:
266 |             deck_file.write(alt_sub.replace("\t", "    ").replace("\r\n", "</br>").replace("\n", "</br>") + "\t")
267 |         else:
268 |             deck_file.write("\t")
269 |         deck_file.write('"<img src=""{}"">"'.format(os.path.basename(image_out_filepath)) + "\t")
270 |         deck_file.write(base_name + "\t")
271 |         deck_file.write("{}".format(item[0].rsplit(".")[0]))
272 | 
273 |     deck_file.close()
274 | 
275 |     print("\n\nDone extracting subtitles!")


--------------------------------------------------------------------------------