├── LICENSE
├── README.md
├── main.py
└── requirements.txt


/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2023 Alexey Sokolov
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # Video Dubber
  3 | The program for automatic dubbing any video file for a lot of languages.
  4 | 
  5 | This Python script extracts the audio from a video file, transcribes it, 
  6 | translates it into a different language, generates a new audio file with the translated text, and then merges it with the original video.
  7 | 
  8 | ## Prerequisites
  9 | 
 10 | - Python 3.8 or higher
 11 | - [FFmpeg](https://ffmpeg.org/download.html)
 12 | 
 13 | ## Technologies Used
 14 | 
 15 | - [Google Cloud Text-to-Speech API](https://cloud.google.com/text-to-speech): Used to generate the audio for the translated text.
 16 | - [Google Cloud Translate API](https://cloud.google.com/translate): Used to translate the transcribed text into a different language.
 17 | - [Whisper ASR](https://www.openai.com/research/whisper/): Used to transcribe the audio from the video file.
 18 | - [Spacy](https://spacy.io/): Used for natural language processing tasks, such as tokenization and syllable counting.
 19 | - [PyDub](http://pydub.com/): Used for manipulating audio files.
 20 | - [MoviePy](https://zulko.github.io/moviepy/): Used for extracting the audio from the video file.
 21 | 
 22 | 
 23 | ## Installation
 24 | 
 25 | 1. Clone this repository:
 26 |    ```
 27 |    git clone https://github.com/am-sokolov/videodubber.git
 28 |    ```
 29 | 2. Install the required Python packages:
 30 |    ```
 31 |    pip install -r requirements.txt
 32 |    ```
 33 | 
 34 | ## Google Cloud Credentials
 35 | 
 36 | This script uses Google Cloud's Text-to-Speech and Translate APIs, which require authentication. Follow these steps to get your credentials:
 37 | 
 38 | 1. Create a new project in the [Google Cloud Console](https://console.cloud.google.com/).
 39 | 2. Enable the [Text-to-Speech](https://cloud.google.com/text-to-speech/docs/quickstart-client-libraries) and [Translate](https://cloud.google.com/translate/docs/setup) APIs for your project.
 40 | 3. Create a new service account for your project in the [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccounts) page.
 41 | 4. Create a new JSON key for your service account, and download it. This is your credentials file.
 42 | 
 43 | ## Usage
 44 | 
 45 | Run the script with the following command:
 46 | 
 47 | ```
 48 | python main.py --input <path_to_video_file> --voice <target_voice> --credentials <path_to_credentials_file> --source_language <source_language>
 49 | ```
 50 | 
 51 | - `<path_to_video_file>`: Path to the source video file
 52 | - `<target_voice>`: Target dubbing voice name from [Google Cloud Text-to-Speech Voices](https://cloud.google.com/text-to-speech/docs/voices). Default is "es-US-Neural2-B". Recommended voices are:
 53 | 
 54 |     - English: "en-US-Neural2-J"
 55 |     - Spanish: "es-US-Neural2-B"
 56 |     - German: "de-DE-Neural2-D"
 57 |     - Italian: "it-IT-Neural2-C"
 58 |     - French: "fr-FR-Neural2-D"
 59 |     - Russian: "ru-RU-Wavenet-D"
 60 |     - Hindi: "hi-IN-Neural2-B"  
 61 | But you feel free to use any other voice.
 62 | 
 63 | - `<path_to_credentials_file>`: Path to the Google Cloud credentials JSON file
 64 | - `<source_language>`: Source language, e.g. "english". 
 65 | 
 66 | Now, the fully supported source languages are: English, German, French, Italian, Catalan, Chinese, Croatian, Danish, Dutch, Finnish, Greek, Japanese, Korean, Lithuanian, Macedonian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Ukrainian.
 67 | 
 68 | ## Output
 69 | 
 70 | The script will create a new video file with the same name as the input video file, but with "_translated" appended to the name. The new video file will have the original video with the new translated audio track.
 71 | Additionaly, the script will create a new `.wav` audio track with the same name as the input video file contains translation only.
 72 | 
 73 | ## Testing
 74 | 
 75 | You can test this script with any video that contains narration. For example, you can use this [free video of US President Donald Trump speaking at the Young Black Leadership Summit at the White House](https://www.videvo.net/video/us-president-donald-trump-speaks-to-african-americans-young-black-leadership-summit-at-the-white-house-8/613121/). 
 76 | 
 77 | Here are the step-by-step instructions for testing:
 78 | 
 79 | 1. Download the video from the link above. 
 80 | 
 81 | 2. Save the video file in the same directory as the script under the name `trump_speech.mp4`.
 82 | 
 83 | 3. Run the script with the downloaded video file as the input. For example, if you saved the video as `trump_speech.mp4`, you would run:
 84 | 
 85 |    ```
 86 |    python main.py trump_speech.mp4  de-DE-Neural2-B path_to_credentials.json  english
 87 |    ```
 88 | 
 89 |    Replace `path_to_credentials.json` with the path to your Google Cloud credentials JSON file.
 90 | 
 91 | 4. The script will create a new `.wav` audio file named `trump_speech.wav` in the same directory. This file contains the translated audio.
 92 | 
 93 | 5. Listen to the `trump_speech.wav` file to verify that the script worked correctly. The audio should be a translation of the original speech in the video.
 94 | 
 95 | Feel free to replace `de-DE-Neural2-B` with the desired target voice.
 96 | 
 97 | ## License
 98 | 
 99 | Alexey Sokolov (c). This project is licensed under the terms of the MIT license included in this repository.
100 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import argparse
  2 | import os
  3 | import uuid
  4 | from pydub import AudioSegment
  5 | from moviepy.editor import VideoFileClip, AudioFileClip
  6 | from google.cloud import texttospeech
  7 | from google.cloud import translate_v2 as translate
  8 | import whisper
  9 | import spacy
 10 | from spacy_syllables import SpacySyllables
 11 | from tqdm import tqdm
 12 | import tempfile
 13 | import re
 14 | 
 15 | spacy_models = {
 16 |     "english": "en_core_web_sm",
 17 |     "german": "de_core_news_sm",
 18 |     "french": "fr_core_news_sm",
 19 |     "italian": "it_core_news_sm",
 20 |     "catalan": "ca_core_news_sm",
 21 |     "chinese": "zh_core_web_sm",
 22 |     "croatian": "hr_core_news_sm",
 23 |     "danish": "da_core_news_sm",
 24 |     "dutch": "nl_core_news_sm",
 25 |     "finnish": "fi_core_news_sm",
 26 |     "greek": "el_core_news_sm",
 27 |     "japanese": "ja_core_news_sm",
 28 |     "korean": "ko_core_news_sm",
 29 |     "lithuanian": "lt_core_news_sm",
 30 |     "macedonian": "mk_core_news_sm",
 31 |     "polish": "pl_core_news_sm",
 32 |     "portuguese": "pt_core_news_sm",
 33 |     "romanian": "ro_core_news_sm",
 34 |     "russian": "ru_core_news_sm",
 35 |     "spanish": "es_core_news_sm",
 36 |     "swedish": "sv_core_news_sm",
 37 |     "ukrainian": "uk_core_news_sm"
 38 | }
 39 | 
 40 | 
 41 | def extract_audio_from_video(video_file):
 42 |     try:
 43 |         print("Extracting audio track")
 44 |         video = VideoFileClip(video_file)
 45 |         audio = video.audio
 46 |         audio_file = os.path.splitext(video_file)[0] + ".wav"
 47 |         audio.write_audiofile(audio_file)
 48 |         return audio_file
 49 |     except Exception as e:
 50 |         print(f"Error extracting audio from video: {e}")
 51 |         return None
 52 | 
 53 | 
 54 | def transcribe_audio(audio_file, source_language):
 55 |     try:
 56 |         print("Transcribing audio track")
 57 |         model = whisper.load_model("large")
 58 |         trans = model.transcribe(audio_file, language=source_language, verbose=False, word_timestamps=True)
 59 |         return trans
 60 |     except Exception as e:
 61 |         print(f"Error transcribing audio: {e}")
 62 |         return None
 63 | 
 64 | 
 65 | def translate_text(texts, target_language):
 66 |     try:
 67 |         translate_client = translate.Client()
 68 |         results = translate_client.translate(texts, target_language=target_language)
 69 |         return [result['translatedText'] for result in results]
 70 |     except Exception as e:
 71 |         print(f"Error translating texts: {e}")
 72 |         return None
 73 | 
 74 | 
 75 | def create_audio_from_text(text, target_language, target_voice):
 76 |     audio_file = "translated_" + str(uuid.uuid4()) + ".wav"
 77 |     try:
 78 |         client = texttospeech.TextToSpeechClient()
 79 |         input_text = texttospeech.SynthesisInput(text=text)
 80 |         voice = texttospeech.VoiceSelectionParams(
 81 |             language_code=target_language,
 82 |             name=target_voice
 83 |         )
 84 |         audio_config = texttospeech.AudioConfig(
 85 |             audio_encoding=texttospeech.AudioEncoding.LINEAR16, speaking_rate=1.1
 86 |         )
 87 |         response = client.synthesize_speech(
 88 |             request={"input": input_text, "voice": voice, "audio_config": audio_config}
 89 |         )
 90 |         with open(audio_file, "wb") as out:
 91 |             out.write(response.audio_content)
 92 |         return audio_file
 93 |     except Exception as e:
 94 |         if os.path.isfile(audio_file):
 95 |             os.remove(audio_file)
 96 |         raise Exception(f"Error creating audio from text: {e}")
 97 | 
 98 | 
 99 | ABBREVIATIONS = {
100 |     "Mr.": "Mister",
101 |     "Mrs.": "Misses",
102 |     "No.": "Number",
103 |     "Dr.": "Doctor",
104 |     "Ms.": "Miss",
105 |     "Ave.": "Avenue",
106 |     "Blvd.": "Boulevard",
107 |     "Ln.": "Lane",
108 |     "Rd.": "Road",
109 |     "a.m.": "before noon",
110 |     "p.m.": "after noon",
111 |     "ft.": "feet",
112 |     "hr.": "hour",
113 |     "min.": "minute",
114 |     "sq.": "square",
115 |     "St.": "street",
116 |     "Asst.": "assistant",
117 |     "Corp.": "corporation"
118 | }
119 | 
120 | ISWORD = re.compile(r'.*\w.*')
121 | 
122 | def merge_audio_files(transcription, source_language, target_language, target_voice, audio_file):
123 |     temp_files = []
124 |     try:
125 |         ducked_audio = AudioSegment.from_wav(audio_file)
126 |         if spacy_models[source_language] not in spacy.util.get_installed_models():
127 |             spacy.cli.download(spacy_models[source_language])
128 |         nlp = spacy.load(spacy_models[source_language])
129 |         nlp.add_pipe("syllables", after="tagger")
130 |         merged_audio = AudioSegment.silent(duration=0)
131 |         sentences = []
132 |         sentence_starts = []
133 |         sentence_ends = []
134 |         sentence = ""
135 |         sent_start = 0
136 |         print("Composing sentences")
137 |         for segment in tqdm(transcription["segments"]):
138 |             if segment["text"].isupper():
139 |                 continue
140 |             for i, word in enumerate(segment["words"]):
141 |                 if not ISWORD.search(word["word"]):
142 |                     continue
143 |                 word["word"] = ABBREVIATIONS.get(word["word"].strip(), word["word"])
144 |                 if word["word"].startswith("-"):
145 |                     sentence = sentence[:-1] + word["word"] + " "
146 |                 else:
147 |                     sentence += word["word"] + " "
148 |                 # this is a trick to compensate the absense of VAD in Whisper
149 |                 word_syllables = sum(token._.syllables_count for token in nlp(word["word"]) if token._.syllables_count)
150 |                 segment_syllables = sum(token._.syllables_count for token in nlp(segment["text"]) if token._.syllables_count)
151 |                 if i == 0 or sent_start == 0:
152 |                     word_speed = word_syllables / (word["end"] - word["start"])
153 |                     if word_speed < 3:
154 |                         sent_start = word["end"] - word_syllables / 3
155 |                     else:
156 |                         sent_start = word["start"]
157 |                 if i == len(segment["words"]) - 1:  # last word in segment
158 |                     word_speed = word_syllables / (word["end"] - word["start"])
159 |                     segment_speed = segment_syllables / (segment["end"] - segment["start"])
160 |                     if word_speed < 1.0 or segment_speed < 2.0:
161 |                         word["word"] += "."
162 | 
163 |                 if word["word"].endswith("."):
164 |                     sentences.append(sentence)
165 |                     sentence_starts.append(sent_start)
166 |                     sentence_ends.append(word["end"])
167 |                     sent_start = 0
168 |                     sentence = ""
169 |         # translate sentences in chunks of 128
170 |         print("Translating sentences")
171 |         translated_texts = []
172 |         for i in tqdm(range(0, len(sentences), 128)):
173 |             chunk = sentences[i:i + 128]
174 |             translated_chunk = translate_text(chunk, target_language)
175 |             if translated_chunk is None:
176 |                 raise Exception("Translation failed")
177 |             translated_texts.extend(translated_chunk)
178 |         print("Creating translated audio track")
179 |         prev_end_time = 0
180 |         for i, translated_text in enumerate(tqdm(translated_texts)):
181 |             translated_audio_file = create_audio_from_text(translated_text, target_language, target_voice)
182 |             if translated_audio_file is None:
183 |                 raise Exception("Audio creation failed")
184 |             temp_files.append(translated_audio_file)
185 |             translated_audio = AudioSegment.from_wav(translated_audio_file)
186 | 
187 |             # Apply "ducking" effect: reduce volume of original audio during translated sentence
188 |             start_time = int(sentence_starts[i] * 1000)
189 |             end_time = start_time + len(translated_audio)
190 |             next_start_time = int(sentence_starts[i+1] * 1000) if i < len(translated_texts) - 1 else len(ducked_audio)
191 |             ducked_segment = ducked_audio[start_time:end_time].apply_gain(-10)  # adjust volume reduction as needed
192 | 
193 |             fade_out_duration = min(500, max(1, start_time - prev_end_time))
194 |             fade_in_duration = min(500, max(1, next_start_time  - end_time))
195 |             prev_end_time = end_time
196 |             # Apply fade in effect to the end of the audio before the ducked segment
197 |             if start_time == 0:
198 |                 ducked_audio = ducked_segment +  ducked_audio[end_time:].fade_in(fade_in_duration)
199 |             elif end_time == len(ducked_audio):
200 |                 ducked_audio = ducked_audio[:start_time].fade_out(fade_out_duration) + ducked_segment
201 |             else:
202 |                 ducked_audio = ducked_audio[:start_time].fade_out(fade_out_duration) \
203 |                                + ducked_segment +  ducked_audio[end_time:].fade_in(fade_in_duration)
204 | 
205 |             # Overlay the translated audio on top of the original audio
206 |             ducked_audio = ducked_audio.overlay(translated_audio, position=start_time)
207 | 
208 |             original_duration = int(sentence_ends[i] * 1000)
209 |             new_duration = len(translated_audio) + len(merged_audio)
210 |             padding_duration = max(0, original_duration - new_duration)
211 |             padding = AudioSegment.silent(duration=padding_duration)
212 |             merged_audio += padding + translated_audio
213 |         return merged_audio, ducked_audio
214 |     except Exception as e:
215 |         print(f"Error merging audio files: {e}")
216 |         return None
217 |     finally:
218 |         # cleanup: remove all temporary files
219 |         for file in temp_files:
220 |             try:
221 |                 os.remove(file)
222 |             except Exception as e:
223 |                 print(f"Error removing temporary file {file}: {e}")
224 | 
225 | 
226 | def save_audio_to_file(audio, filename):
227 |     try:
228 |         audio.export(filename, format="wav")
229 |         print(f"Audio track with translation only saved to {filename}")
230 |     except Exception as e:
231 |         print(f"Error saving audio to file: {e}")
232 | 
233 | 
234 | 
235 | def replace_audio_in_video(video_file, new_audio):
236 |     try:
237 |         # Load the video
238 |         video = VideoFileClip(video_file)
239 | 
240 |         # Save the new audio to a temporary file
241 |         with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio_file:
242 |             new_audio.export(temp_audio_file.name, format="wav")
243 |         new_audio.export("duckled.wav", format="wav")
244 | 
245 |         # Load the new audio into an AudioFileClip
246 |         try:
247 |             new_audio_clip = AudioFileClip(temp_audio_file.name)
248 |         except Exception as e:
249 |             print(f"Error loading new audio into an AudioFileClip: {e}")
250 |             return
251 | 
252 |         # Check if the audio is compatible with the video
253 |         if new_audio_clip.duration < video.duration:
254 |             print("Warning: The new audio is shorter than the video. The remaining video will have no sound.")
255 |         elif new_audio_clip.duration > video.duration:
256 |             print("Warning: The new audio is longer than the video. The extra audio will be cut off.")
257 |             new_audio_clip = new_audio_clip.subclip(0, video.duration)
258 | 
259 |         # Set the audio of the video to the new audio
260 |         video = video.set_audio(new_audio_clip)
261 | 
262 |         # Write the result to a new video file
263 |         output_filename = os.path.splitext(video_file)[0] + "_translated.mp4"
264 |         try:
265 |             video.write_videofile(output_filename, audio_codec='aac')
266 |         except Exception as e:
267 |             print(f"Error writing the new video file: {e}")
268 |             return
269 | 
270 |         print(f"Translated video saved as {output_filename}")
271 | 
272 |     except Exception as e:
273 |         print(f"Error replacing audio in video: {e}")
274 |     finally:
275 |         # Remove the temporary audio file
276 |         if os.path.isfile(temp_audio_file.name):
277 |             os.remove(temp_audio_file.name)
278 | 
279 | 
280 | def main():
281 |     parser = argparse.ArgumentParser()
282 |     parser.add_argument('--input', type=str, help='Path to the source video file', required=True)
283 |     parser.add_argument('--voice', type=str, default="es-US-Neural2-B",
284 |                         help=f'Target dubbing voice name from https://cloud.google.com/text-to-speech/docs/voices')
285 |     parser.add_argument('--credentials', type=str, help='Path to the Google Cloud credentials JSON file', required=True)
286 |     parser.add_argument('--source_language', type=str, help=f'Source language, e.g. english. Now the following languages are supported:'
287 |                                                             f' {list(spacy_models.keys())}', default="english")
288 |     args = parser.parse_args()
289 | 
290 |     # Set the GOOGLE_APPLICATION_CREDENTIALS environment variable
291 |     os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = args.credentials
292 | 
293 |     audio_file = extract_audio_from_video(args.input)
294 |     if audio_file is None:
295 |         return
296 | 
297 |     transcription = transcribe_audio(audio_file, args.source_language.lower())
298 |     if transcription is None:
299 |         return
300 | 
301 |     merged_audio, ducked_audio = merge_audio_files(transcription, args.source_language.lower(), args.voice[:5], args.voice, audio_file)
302 |     if merged_audio is None:
303 |         return
304 |     replace_audio_in_video(args.input, ducked_audio)
305 |     # Save the audio file with the same name as the video file but with a ".wav" extension
306 |     output_filename = os.path.splitext(args.input)[0] + ".wav"
307 |     save_audio_to_file(merged_audio, output_filename)
308 | 
309 | 
310 | if __name__ == "__main__":
311 |     main()
312 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | blis==0.7.9
 2 | cachetools==5.3.1
 3 | catalogue==2.0.8
 4 | certifi==2023.5.7
 5 | charset-normalizer==3.1.0
 6 | click==8.1.3
 7 | cmake==3.26.3
 8 | confection==0.0.4
 9 | cymem==2.0.7
10 | decorator==4.4.2
11 | ffmpeg-python==0.2.0
12 | filelock==3.12.0
13 | future==0.18.3
14 | google-api-core==2.11.0
15 | google-auth==2.19.1
16 | google-cloud-core==2.3.2
17 | google-cloud-speech==2.20.0
18 | google-cloud-texttospeech==2.14.1
19 | google-cloud-translate==3.11.1
20 | googleapis-common-protos==1.59.0
21 | grpcio==1.54.2
22 | grpcio-status==1.54.2
23 | idna==3.4
24 | imageio==2.30.0
25 | imageio-ffmpeg==0.4.8
26 | Jinja2==3.1.2
27 | langcodes==3.3.0
28 | lit==16.0.5.post0
29 | llvmlite==0.39.1
30 | MarkupSafe==2.1.3
31 | more-itertools==9.1.0
32 | moviepy==1.0.3
33 | mpmath==1.3.0
34 | murmurhash==1.0.9
35 | networkx==3.1
36 | nvidia-cublas-cu11==11.10.3.66
37 | nvidia-cuda-cupti-cu11==11.7.101
38 | nvidia-cuda-nvrtc-cu11==11.7.99
39 | nvidia-cuda-runtime-cu11==11.7.99
40 | nvidia-cudnn-cu11==8.5.0.96
41 | nvidia-cufft-cu11==10.9.0.58
42 | nvidia-curand-cu11==10.2.10.91
43 | nvidia-cusolver-cu11==11.4.0.1
44 | nvidia-cusparse-cu11==11.7.4.91
45 | nvidia-nccl-cu11==2.14.3
46 | nvidia-nvtx-cu11==11.7.91
47 | openai-whisper==20230314
48 | packaging==23.1
49 | pathy==0.10.1
50 | Pillow==9.5.0
51 | preshed==3.0.8
52 | proglog==0.1.10
53 | proto-plus==1.22.2
54 | protobuf==4.23.2
55 | pyasn1==0.5.0
56 | pyasn1-modules==0.3.0
57 | pydantic==1.10.8
58 | pydub==0.25.1
59 | pyphen==0.14.0
60 | regex==2023.6.3
61 | requests==2.31.0
62 | rsa==4.9
63 | six==1.16.0
64 | smart-open==6.3.0
65 | spacy==3.5.3
66 | spacy-legacy==3.0.12
67 | spacy-loggers==1.0.4
68 | spacy-syllables==3.0.2
69 | srsly==2.4.6
70 | sympy==1.12
71 | thinc==8.1.10
72 | tiktoken==0.3.1
73 | torch==2.0.1
74 | tqdm==4.65.0
75 | triton==2.0.0
76 | typer==0.7.0
77 | typing_extensions==4.6.3
78 | urllib3==1.26.16
79 | wasabi==1.1.1
80 | numba==0.56.4
81 | numpy==1.23.5
82 | 


--------------------------------------------------------------------------------