├── LICENSE.md
├── README.md
└── subs_extract.py
/LICENSE.md:
--------------------------------------------------------------------------------
1 | Copyright (c) 2019 Nathan Vegdahl
2 |
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of
4 | this software and associated documentation files (the "Software"), to deal in
5 | the Software without restriction, including without limitation the rights to
6 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
7 | of the Software, and to permit persons to whom the Software is furnished to do
8 | so, subject to the following conditions:
9 |
10 | The above copyright notice and this permission notice shall be included in all
11 | copies or substantial portions of the Software.
12 |
13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19 | SOFTWARE.
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Subs Extract
2 |
3 | A simple Python script that uses ffmpeg to extract subtitles and audio from a video. The purpose is to facilitate creating flash cards from foreign-language TV shows and movies.
4 |
5 | It takes as input a video file and a corresponding .ssa, .ass, .vtt, or .srt subtitle file, like this:
6 |
7 | ```
8 | subs_extract.py a_really_cool_show.mp4 a_really_cool_show.srt
9 | ```
10 |
11 | And it then creates a text file, an mp3 file, and an image thumbnail for each sentence in the subtitles, putting them in a new directory named after the video file. It also creates a deck file that can be imported into Anki, with the following fields per note:
12 |
13 | 1. The line's subtitle text.
14 | 2. The filename of the line's audio file, wrapped in an Anki audio tag.
15 | 3. A blank field (unless a second subtitle file was provided--see the next section of this readme).
16 | 4. The filename of the line's image thumbnail.
17 | 5. The name of the video file the line came from, without the file extension (this basically identifies the movie/episode, assuming your video files are named appropriately).
18 | 6. A timestamp, indicating where in the video file the line is.
19 |
20 |
21 | ## Using a second subtitle file
22 |
23 | You can also pass a second subtitle file like so:
24 |
25 | ```
26 | subs_extract.py a_really_cool_show.mp4 a_really_cool_show.jp.srt a_really_cool_show.en.srt
27 | ```
28 |
29 | If you do this, Subs Extract will attempt to find a matching subtitle in the second file for every subtitle in the first, and include that as a translation. It does this based on the start times of the subtitles. It isn't perfect, and there will usually be a handful of weird matches for each video, but it generally does an okay job. Also, it sometimes simply won't include a match at all if the timing for a given subtitle is just too different.
30 |
31 | The generated deck file will also include the second subtitle like so:
32 |
33 | 1. The line's subtitle text.
34 | 2. The filename of line's audio file, wrapped in an Anki audio tag.
35 | 3. **Matching subtitle from second subtitle file.**
36 | 4. The filename of the line's image thumbnail.
37 | 5. The name of the video file the line came from, without the file extension (this basically identifies the movie/episode, assuming your video files are named appropriately).
38 | 6. A timestamp, indicating where in the video file the line is.
39 |
--------------------------------------------------------------------------------
/subs_extract.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import sys
4 | import os
5 | import subprocess
6 | import re
7 |
8 |
9 | def timecode_to_milliseconds(code):
10 | """ Takes a time code and converts it into an integer of milliseconds.
11 | """
12 | elements = code.replace(",", ".").split(":")
13 | assert(len(elements) < 4)
14 |
15 | milliseconds = 0
16 | if len(elements) >= 1:
17 | milliseconds += int(float(elements[-1]) * 1000)
18 | if len(elements) >= 2:
19 | milliseconds += int(elements[-2]) * 60000
20 | if len(elements) >= 3:
21 | milliseconds += int(elements[-3]) * 3600000
22 |
23 | return milliseconds
24 |
25 |
26 | def milliseconds_to_timecode(milliseconds):
27 | """ Takes a time in milliseconds and converts it into a time code.
28 | """
29 | hours = milliseconds // 3600000
30 | milliseconds %= 3600000
31 | minutes = milliseconds // 60000
32 | milliseconds %= 60000
33 | seconds = milliseconds // 1000
34 | milliseconds %= 1000
35 | return "{}:{:02}:{:02}.{:02}".format(hours, minutes, seconds, milliseconds // 10)
36 |
37 |
38 | def parse_ass_file(path, padding=0):
39 | """ Parses an entire .ass file, extracting the dialogue.
40 |
41 | Returns a list of (start, length, dialogue) tuples, one for each
42 | subtitle found in the file. A padding of "padding" milliseconds is
43 | added to the start/end times.
44 | """
45 | subtitles = []
46 | with open(path) as f:
47 | # First find out the field order of the dialogue.
48 | found_start = False
49 | fields = {}
50 | for line in f:
51 | if not found_start:
52 | found_start = line.strip() == "[Events]"
53 | elif line.strip().startswith("Format:"):
54 | line = line[7:].strip()
55 | tmp = line.split(",")
56 | for i in range(len(tmp)):
57 | fields[tmp[i].strip().lower()] = i
58 | break
59 | if ("start" not in fields) or ("end" not in fields) or ("text" not in fields):
60 | raise Exception("'Start', 'End', or 'Text' field not found.")
61 |
62 | # Then parse the dialogue lines.
63 | start_times = set() # Used to prevent duplicates
64 | for line in f:
65 | if line.strip().startswith("Dialogue:"):
66 | elements = line[9:].strip().split(',', len(fields) - 1)
67 | start_element = elements[fields["start"]].strip()
68 | end_element = elements[fields["end"]].strip()
69 | text_element = elements[-1].strip()
70 |
71 | if (start_element not in start_times) and (text_element != ""):
72 | start_times |= set(start_element)
73 | start = max(timecode_to_milliseconds(start_element) - padding, 0)
74 | end = timecode_to_milliseconds(end_element) + padding
75 | length = end - start
76 | subtitles += [(
77 | milliseconds_to_timecode(start),
78 | milliseconds_to_timecode(length),
79 | text_element,
80 | )]
81 | subtitles.sort()
82 | return subtitles
83 |
84 |
85 | def parse_vtt_file(path, padding=0):
86 | """ Parses an entire WebVTT/SRT file, extracting the dialogue.
87 |
88 | Returns a list of (start, length, dialogue) tuples, one for each
89 | subtitle found in the file. A padding of "padding" milliseconds is
90 | added to the start/end times.
91 | """
92 | subtitles = []
93 | with open(path) as f:
94 | start_times = set() # Used to prevent duplicates
95 | for line in f:
96 | if "-->" in line:
97 | # Get the timing.
98 | times = line.split("-->")
99 | start = max(timecode_to_milliseconds(times[0].strip()) - padding, 0)
100 | end = timecode_to_milliseconds(times[1].strip()) + padding
101 | length = end - start
102 |
103 | # Get the text.
104 | text = ""
105 | next_line = f.readline()
106 | while next_line.strip() != "":
107 | text += next_line
108 | next_line = f.readline()
109 | text = text.strip()
110 |
111 | # Process text to get rid of unnecessary tags.
112 | text = re.sub("?ruby>", "", text)
113 | text = re.sub("", "", text)
114 | text = re.sub("", "", text)
115 |
116 | # Add to the subtitles list.
117 | if (start not in start_times) and (text != ""):
118 | start_times |= set([start])
119 | subtitles += [(
120 | milliseconds_to_timecode(start),
121 | milliseconds_to_timecode(length),
122 | text,
123 | )]
124 | return subtitles
125 |
126 |
127 | def parse_subtitle_file(filepath, padding=0):
128 | """ Parses a subtitle file, attempting to automatically determine the
129 | file format for parsing.
130 | """
131 | if filepath.endswith(".ass") or filepath.endswith(".ssa"):
132 | return parse_ass_file(filepath, padding)
133 | elif filepath.endswith(".vtt") or filepath.endswith(".srt"):
134 | return parse_vtt_file(filepath, padding)
135 | else:
136 | raise "Unknown subtitle format. Supported formats are SSA, ASS, VTT, and SRT."
137 |
138 |
139 | def find_closest_sub(subs_list, timecode, max_diff_milliseconds):
140 | """ Finds the sub in the given list with the start time closest to timecode.
141 |
142 | Will only return a matching sub if the start time difference is less
143 | than max_diff_milliseconds. Otherwise it will return None.
144 | """
145 | time_mil = timecode_to_milliseconds(timecode)
146 |
147 | # This certainly isn't the most efficient way to do this, but it's
148 | # dead-simple and does not appear to be a performance bottleneck at all.
149 | closest_so_far = -1
150 | closest_diff = max_diff_milliseconds
151 | for i in range(len(subs_list)):
152 | sub_start = timecode_to_milliseconds(subs_list[i][0])
153 | diff = abs(time_mil - sub_start)
154 | if diff < closest_diff:
155 | closest_so_far = i
156 | closest_diff = diff
157 |
158 | if closest_so_far >= 0:
159 | return subs_list[closest_so_far]
160 | else:
161 | return None
162 |
163 |
164 | if __name__ == "__main__":
165 | video_filename = sys.argv[1]
166 | subs_filename = sys.argv[2]
167 | second_subs_filename = None
168 | if len(sys.argv) >= 4:
169 | second_subs_filename = sys.argv[3]
170 | dir_name = video_filename.rsplit(".")[0]
171 | base_name = os.path.basename(dir_name)
172 |
173 | # Parse the subtitle files
174 | subtitles = parse_subtitle_file(subs_filename, 300)
175 | second_subs = parse_subtitle_file(second_subs_filename, 300) if second_subs_filename else None
176 |
177 | # Create the directory for the new files if it doesn't already exist.
178 | try:
179 | os.mkdir(dir_name)
180 | except FileExistsError:
181 | pass
182 | except Exception as e:
183 | raise e
184 |
185 | # Set up deck file
186 | deck_out_filepath = os.path.join(dir_name, "0_deck -- {}.txt".format(base_name))
187 | deck_file = open(deck_out_filepath, 'w')
188 |
189 | # Process the subtitles
190 | first_card = True
191 | for item in subtitles:
192 | print("\n\n========================================================")
193 | print("Extracting \"{}\"".format(item[2]))
194 | print("========================================================")
195 |
196 | # Generate base filename
197 | base_filename = os.path.join(dir_name, "{} -- {}".format(base_name, item[0].replace(":", "_").replace(".", "-")))
198 |
199 | # Find matching alt sub if any.
200 | if second_subs:
201 | alt_sub = find_closest_sub(second_subs, item[0], 1000)
202 | if alt_sub:
203 | alt_sub = alt_sub[2]
204 | else:
205 | alt_sub = ""
206 |
207 | # Write text file of subtitle
208 | subtitle_out_filepath = base_filename + ".txt"
209 | if not os.path.isfile(subtitle_out_filepath):
210 | with open(subtitle_out_filepath, 'w') as f:
211 | f.write(item[2])
212 | if second_subs:
213 | f.write("\n\n" + alt_sub)
214 |
215 | # Extract audio of subtitle into mp3 file
216 | audio_out_filepath = base_filename + ".mp3"
217 | if not os.path.isfile(audio_out_filepath):
218 | subprocess.Popen([
219 | "ffmpeg",
220 | "-n",
221 | "-vn",
222 | "-ss",
223 | item[0],
224 | "-i",
225 | video_filename,
226 | "-aq", "8",
227 | "-t",
228 | item[1],
229 | "-ar",
230 | "44100",
231 | "-ac",
232 | "1",
233 | audio_out_filepath,
234 | ]).wait()
235 |
236 | # Extract video frame of subtitle into jpg file
237 | image_out_filepath = base_filename + ".jpg"
238 | if not os.path.isfile(image_out_filepath):
239 | image_timecode = milliseconds_to_timecode(
240 | timecode_to_milliseconds(item[0])
241 | + (timecode_to_milliseconds(item[1]) // 2 )
242 | )
243 | subprocess.Popen([
244 | "ffmpeg",
245 | "-ss",
246 | image_timecode,
247 | "-i",
248 | video_filename,
249 | "-vf",
250 | "scale='min(480,iw)':'min(240,ih)':force_original_aspect_ratio=decrease",
251 | "-frames:v",
252 | "1",
253 | "-q:v",
254 | "6",
255 | "-y",
256 | image_out_filepath,
257 | ]).wait()
258 |
259 | # Add card to deck file
260 | if not first_card:
261 | deck_file.write("\n")
262 | first_card = False
263 | deck_file.write(item[2].replace("\t", " ").replace("\r\n", "").replace("\n", "") + "\t")
264 | deck_file.write("[sound:{}]".format(os.path.basename(audio_out_filepath)) + "\t")
265 | if second_subs:
266 | deck_file.write(alt_sub.replace("\t", " ").replace("\r\n", "").replace("\n", "") + "\t")
267 | else:
268 | deck_file.write("\t")
269 | deck_file.write('"
"'.format(os.path.basename(image_out_filepath)) + "\t")
270 | deck_file.write(base_name + "\t")
271 | deck_file.write("{}".format(item[0].rsplit(".")[0]))
272 |
273 | deck_file.close()
274 |
275 | print("\n\nDone extracting subtitles!")
--------------------------------------------------------------------------------