├── LICENSE ├── README.md ├── main.py └── requirements.txt /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2023 Alexey Sokolov 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # Video Dubber 3 | The program for automatic dubbing any video file for a lot of languages. 4 | 5 | This Python script extracts the audio from a video file, transcribes it, 6 | translates it into a different language, generates a new audio file with the translated text, and then merges it with the original video. 7 | 8 | ## Prerequisites 9 | 10 | - Python 3.8 or higher 11 | - [FFmpeg](https://ffmpeg.org/download.html) 12 | 13 | ## Technologies Used 14 | 15 | - [Google Cloud Text-to-Speech API](https://cloud.google.com/text-to-speech): Used to generate the audio for the translated text. 16 | - [Google Cloud Translate API](https://cloud.google.com/translate): Used to translate the transcribed text into a different language. 17 | - [Whisper ASR](https://www.openai.com/research/whisper/): Used to transcribe the audio from the video file. 18 | - [Spacy](https://spacy.io/): Used for natural language processing tasks, such as tokenization and syllable counting. 19 | - [PyDub](http://pydub.com/): Used for manipulating audio files. 20 | - [MoviePy](https://zulko.github.io/moviepy/): Used for extracting the audio from the video file. 21 | 22 | 23 | ## Installation 24 | 25 | 1. Clone this repository: 26 | ``` 27 | git clone https://github.com/am-sokolov/videodubber.git 28 | ``` 29 | 2. Install the required Python packages: 30 | ``` 31 | pip install -r requirements.txt 32 | ``` 33 | 34 | ## Google Cloud Credentials 35 | 36 | This script uses Google Cloud's Text-to-Speech and Translate APIs, which require authentication. Follow these steps to get your credentials: 37 | 38 | 1. Create a new project in the [Google Cloud Console](https://console.cloud.google.com/). 39 | 2. Enable the [Text-to-Speech](https://cloud.google.com/text-to-speech/docs/quickstart-client-libraries) and [Translate](https://cloud.google.com/translate/docs/setup) APIs for your project. 40 | 3. Create a new service account for your project in the [Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccounts) page. 41 | 4. Create a new JSON key for your service account, and download it. This is your credentials file. 42 | 43 | ## Usage 44 | 45 | Run the script with the following command: 46 | 47 | ``` 48 | python main.py --input --voice --credentials --source_language 49 | ``` 50 | 51 | - ``: Path to the source video file 52 | - ``: Target dubbing voice name from [Google Cloud Text-to-Speech Voices](https://cloud.google.com/text-to-speech/docs/voices). Default is "es-US-Neural2-B". Recommended voices are: 53 | 54 | - English: "en-US-Neural2-J" 55 | - Spanish: "es-US-Neural2-B" 56 | - German: "de-DE-Neural2-D" 57 | - Italian: "it-IT-Neural2-C" 58 | - French: "fr-FR-Neural2-D" 59 | - Russian: "ru-RU-Wavenet-D" 60 | - Hindi: "hi-IN-Neural2-B" 61 | But you feel free to use any other voice. 62 | 63 | - ``: Path to the Google Cloud credentials JSON file 64 | - ``: Source language, e.g. "english". 65 | 66 | Now, the fully supported source languages are: English, German, French, Italian, Catalan, Chinese, Croatian, Danish, Dutch, Finnish, Greek, Japanese, Korean, Lithuanian, Macedonian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Ukrainian. 67 | 68 | ## Output 69 | 70 | The script will create a new video file with the same name as the input video file, but with "_translated" appended to the name. The new video file will have the original video with the new translated audio track. 71 | Additionaly, the script will create a new `.wav` audio track with the same name as the input video file contains translation only. 72 | 73 | ## Testing 74 | 75 | You can test this script with any video that contains narration. For example, you can use this [free video of US President Donald Trump speaking at the Young Black Leadership Summit at the White House](https://www.videvo.net/video/us-president-donald-trump-speaks-to-african-americans-young-black-leadership-summit-at-the-white-house-8/613121/). 76 | 77 | Here are the step-by-step instructions for testing: 78 | 79 | 1. Download the video from the link above. 80 | 81 | 2. Save the video file in the same directory as the script under the name `trump_speech.mp4`. 82 | 83 | 3. Run the script with the downloaded video file as the input. For example, if you saved the video as `trump_speech.mp4`, you would run: 84 | 85 | ``` 86 | python main.py trump_speech.mp4 de-DE-Neural2-B path_to_credentials.json english 87 | ``` 88 | 89 | Replace `path_to_credentials.json` with the path to your Google Cloud credentials JSON file. 90 | 91 | 4. The script will create a new `.wav` audio file named `trump_speech.wav` in the same directory. This file contains the translated audio. 92 | 93 | 5. Listen to the `trump_speech.wav` file to verify that the script worked correctly. The audio should be a translation of the original speech in the video. 94 | 95 | Feel free to replace `de-DE-Neural2-B` with the desired target voice. 96 | 97 | ## License 98 | 99 | Alexey Sokolov (c). This project is licensed under the terms of the MIT license included in this repository. 100 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | import uuid 4 | from pydub import AudioSegment 5 | from moviepy.editor import VideoFileClip, AudioFileClip 6 | from google.cloud import texttospeech 7 | from google.cloud import translate_v2 as translate 8 | import whisper 9 | import spacy 10 | from spacy_syllables import SpacySyllables 11 | from tqdm import tqdm 12 | import tempfile 13 | import re 14 | 15 | spacy_models = { 16 | "english": "en_core_web_sm", 17 | "german": "de_core_news_sm", 18 | "french": "fr_core_news_sm", 19 | "italian": "it_core_news_sm", 20 | "catalan": "ca_core_news_sm", 21 | "chinese": "zh_core_web_sm", 22 | "croatian": "hr_core_news_sm", 23 | "danish": "da_core_news_sm", 24 | "dutch": "nl_core_news_sm", 25 | "finnish": "fi_core_news_sm", 26 | "greek": "el_core_news_sm", 27 | "japanese": "ja_core_news_sm", 28 | "korean": "ko_core_news_sm", 29 | "lithuanian": "lt_core_news_sm", 30 | "macedonian": "mk_core_news_sm", 31 | "polish": "pl_core_news_sm", 32 | "portuguese": "pt_core_news_sm", 33 | "romanian": "ro_core_news_sm", 34 | "russian": "ru_core_news_sm", 35 | "spanish": "es_core_news_sm", 36 | "swedish": "sv_core_news_sm", 37 | "ukrainian": "uk_core_news_sm" 38 | } 39 | 40 | 41 | def extract_audio_from_video(video_file): 42 | try: 43 | print("Extracting audio track") 44 | video = VideoFileClip(video_file) 45 | audio = video.audio 46 | audio_file = os.path.splitext(video_file)[0] + ".wav" 47 | audio.write_audiofile(audio_file) 48 | return audio_file 49 | except Exception as e: 50 | print(f"Error extracting audio from video: {e}") 51 | return None 52 | 53 | 54 | def transcribe_audio(audio_file, source_language): 55 | try: 56 | print("Transcribing audio track") 57 | model = whisper.load_model("large") 58 | trans = model.transcribe(audio_file, language=source_language, verbose=False, word_timestamps=True) 59 | return trans 60 | except Exception as e: 61 | print(f"Error transcribing audio: {e}") 62 | return None 63 | 64 | 65 | def translate_text(texts, target_language): 66 | try: 67 | translate_client = translate.Client() 68 | results = translate_client.translate(texts, target_language=target_language) 69 | return [result['translatedText'] for result in results] 70 | except Exception as e: 71 | print(f"Error translating texts: {e}") 72 | return None 73 | 74 | 75 | def create_audio_from_text(text, target_language, target_voice): 76 | audio_file = "translated_" + str(uuid.uuid4()) + ".wav" 77 | try: 78 | client = texttospeech.TextToSpeechClient() 79 | input_text = texttospeech.SynthesisInput(text=text) 80 | voice = texttospeech.VoiceSelectionParams( 81 | language_code=target_language, 82 | name=target_voice 83 | ) 84 | audio_config = texttospeech.AudioConfig( 85 | audio_encoding=texttospeech.AudioEncoding.LINEAR16, speaking_rate=1.1 86 | ) 87 | response = client.synthesize_speech( 88 | request={"input": input_text, "voice": voice, "audio_config": audio_config} 89 | ) 90 | with open(audio_file, "wb") as out: 91 | out.write(response.audio_content) 92 | return audio_file 93 | except Exception as e: 94 | if os.path.isfile(audio_file): 95 | os.remove(audio_file) 96 | raise Exception(f"Error creating audio from text: {e}") 97 | 98 | 99 | ABBREVIATIONS = { 100 | "Mr.": "Mister", 101 | "Mrs.": "Misses", 102 | "No.": "Number", 103 | "Dr.": "Doctor", 104 | "Ms.": "Miss", 105 | "Ave.": "Avenue", 106 | "Blvd.": "Boulevard", 107 | "Ln.": "Lane", 108 | "Rd.": "Road", 109 | "a.m.": "before noon", 110 | "p.m.": "after noon", 111 | "ft.": "feet", 112 | "hr.": "hour", 113 | "min.": "minute", 114 | "sq.": "square", 115 | "St.": "street", 116 | "Asst.": "assistant", 117 | "Corp.": "corporation" 118 | } 119 | 120 | ISWORD = re.compile(r'.*\w.*') 121 | 122 | def merge_audio_files(transcription, source_language, target_language, target_voice, audio_file): 123 | temp_files = [] 124 | try: 125 | ducked_audio = AudioSegment.from_wav(audio_file) 126 | if spacy_models[source_language] not in spacy.util.get_installed_models(): 127 | spacy.cli.download(spacy_models[source_language]) 128 | nlp = spacy.load(spacy_models[source_language]) 129 | nlp.add_pipe("syllables", after="tagger") 130 | merged_audio = AudioSegment.silent(duration=0) 131 | sentences = [] 132 | sentence_starts = [] 133 | sentence_ends = [] 134 | sentence = "" 135 | sent_start = 0 136 | print("Composing sentences") 137 | for segment in tqdm(transcription["segments"]): 138 | if segment["text"].isupper(): 139 | continue 140 | for i, word in enumerate(segment["words"]): 141 | if not ISWORD.search(word["word"]): 142 | continue 143 | word["word"] = ABBREVIATIONS.get(word["word"].strip(), word["word"]) 144 | if word["word"].startswith("-"): 145 | sentence = sentence[:-1] + word["word"] + " " 146 | else: 147 | sentence += word["word"] + " " 148 | # this is a trick to compensate the absense of VAD in Whisper 149 | word_syllables = sum(token._.syllables_count for token in nlp(word["word"]) if token._.syllables_count) 150 | segment_syllables = sum(token._.syllables_count for token in nlp(segment["text"]) if token._.syllables_count) 151 | if i == 0 or sent_start == 0: 152 | word_speed = word_syllables / (word["end"] - word["start"]) 153 | if word_speed < 3: 154 | sent_start = word["end"] - word_syllables / 3 155 | else: 156 | sent_start = word["start"] 157 | if i == len(segment["words"]) - 1: # last word in segment 158 | word_speed = word_syllables / (word["end"] - word["start"]) 159 | segment_speed = segment_syllables / (segment["end"] - segment["start"]) 160 | if word_speed < 1.0 or segment_speed < 2.0: 161 | word["word"] += "." 162 | 163 | if word["word"].endswith("."): 164 | sentences.append(sentence) 165 | sentence_starts.append(sent_start) 166 | sentence_ends.append(word["end"]) 167 | sent_start = 0 168 | sentence = "" 169 | # translate sentences in chunks of 128 170 | print("Translating sentences") 171 | translated_texts = [] 172 | for i in tqdm(range(0, len(sentences), 128)): 173 | chunk = sentences[i:i + 128] 174 | translated_chunk = translate_text(chunk, target_language) 175 | if translated_chunk is None: 176 | raise Exception("Translation failed") 177 | translated_texts.extend(translated_chunk) 178 | print("Creating translated audio track") 179 | prev_end_time = 0 180 | for i, translated_text in enumerate(tqdm(translated_texts)): 181 | translated_audio_file = create_audio_from_text(translated_text, target_language, target_voice) 182 | if translated_audio_file is None: 183 | raise Exception("Audio creation failed") 184 | temp_files.append(translated_audio_file) 185 | translated_audio = AudioSegment.from_wav(translated_audio_file) 186 | 187 | # Apply "ducking" effect: reduce volume of original audio during translated sentence 188 | start_time = int(sentence_starts[i] * 1000) 189 | end_time = start_time + len(translated_audio) 190 | next_start_time = int(sentence_starts[i+1] * 1000) if i < len(translated_texts) - 1 else len(ducked_audio) 191 | ducked_segment = ducked_audio[start_time:end_time].apply_gain(-10) # adjust volume reduction as needed 192 | 193 | fade_out_duration = min(500, max(1, start_time - prev_end_time)) 194 | fade_in_duration = min(500, max(1, next_start_time - end_time)) 195 | prev_end_time = end_time 196 | # Apply fade in effect to the end of the audio before the ducked segment 197 | if start_time == 0: 198 | ducked_audio = ducked_segment + ducked_audio[end_time:].fade_in(fade_in_duration) 199 | elif end_time == len(ducked_audio): 200 | ducked_audio = ducked_audio[:start_time].fade_out(fade_out_duration) + ducked_segment 201 | else: 202 | ducked_audio = ducked_audio[:start_time].fade_out(fade_out_duration) \ 203 | + ducked_segment + ducked_audio[end_time:].fade_in(fade_in_duration) 204 | 205 | # Overlay the translated audio on top of the original audio 206 | ducked_audio = ducked_audio.overlay(translated_audio, position=start_time) 207 | 208 | original_duration = int(sentence_ends[i] * 1000) 209 | new_duration = len(translated_audio) + len(merged_audio) 210 | padding_duration = max(0, original_duration - new_duration) 211 | padding = AudioSegment.silent(duration=padding_duration) 212 | merged_audio += padding + translated_audio 213 | return merged_audio, ducked_audio 214 | except Exception as e: 215 | print(f"Error merging audio files: {e}") 216 | return None 217 | finally: 218 | # cleanup: remove all temporary files 219 | for file in temp_files: 220 | try: 221 | os.remove(file) 222 | except Exception as e: 223 | print(f"Error removing temporary file {file}: {e}") 224 | 225 | 226 | def save_audio_to_file(audio, filename): 227 | try: 228 | audio.export(filename, format="wav") 229 | print(f"Audio track with translation only saved to {filename}") 230 | except Exception as e: 231 | print(f"Error saving audio to file: {e}") 232 | 233 | 234 | 235 | def replace_audio_in_video(video_file, new_audio): 236 | try: 237 | # Load the video 238 | video = VideoFileClip(video_file) 239 | 240 | # Save the new audio to a temporary file 241 | with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio_file: 242 | new_audio.export(temp_audio_file.name, format="wav") 243 | new_audio.export("duckled.wav", format="wav") 244 | 245 | # Load the new audio into an AudioFileClip 246 | try: 247 | new_audio_clip = AudioFileClip(temp_audio_file.name) 248 | except Exception as e: 249 | print(f"Error loading new audio into an AudioFileClip: {e}") 250 | return 251 | 252 | # Check if the audio is compatible with the video 253 | if new_audio_clip.duration < video.duration: 254 | print("Warning: The new audio is shorter than the video. The remaining video will have no sound.") 255 | elif new_audio_clip.duration > video.duration: 256 | print("Warning: The new audio is longer than the video. The extra audio will be cut off.") 257 | new_audio_clip = new_audio_clip.subclip(0, video.duration) 258 | 259 | # Set the audio of the video to the new audio 260 | video = video.set_audio(new_audio_clip) 261 | 262 | # Write the result to a new video file 263 | output_filename = os.path.splitext(video_file)[0] + "_translated.mp4" 264 | try: 265 | video.write_videofile(output_filename, audio_codec='aac') 266 | except Exception as e: 267 | print(f"Error writing the new video file: {e}") 268 | return 269 | 270 | print(f"Translated video saved as {output_filename}") 271 | 272 | except Exception as e: 273 | print(f"Error replacing audio in video: {e}") 274 | finally: 275 | # Remove the temporary audio file 276 | if os.path.isfile(temp_audio_file.name): 277 | os.remove(temp_audio_file.name) 278 | 279 | 280 | def main(): 281 | parser = argparse.ArgumentParser() 282 | parser.add_argument('--input', type=str, help='Path to the source video file', required=True) 283 | parser.add_argument('--voice', type=str, default="es-US-Neural2-B", 284 | help=f'Target dubbing voice name from https://cloud.google.com/text-to-speech/docs/voices') 285 | parser.add_argument('--credentials', type=str, help='Path to the Google Cloud credentials JSON file', required=True) 286 | parser.add_argument('--source_language', type=str, help=f'Source language, e.g. english. Now the following languages are supported:' 287 | f' {list(spacy_models.keys())}', default="english") 288 | args = parser.parse_args() 289 | 290 | # Set the GOOGLE_APPLICATION_CREDENTIALS environment variable 291 | os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = args.credentials 292 | 293 | audio_file = extract_audio_from_video(args.input) 294 | if audio_file is None: 295 | return 296 | 297 | transcription = transcribe_audio(audio_file, args.source_language.lower()) 298 | if transcription is None: 299 | return 300 | 301 | merged_audio, ducked_audio = merge_audio_files(transcription, args.source_language.lower(), args.voice[:5], args.voice, audio_file) 302 | if merged_audio is None: 303 | return 304 | replace_audio_in_video(args.input, ducked_audio) 305 | # Save the audio file with the same name as the video file but with a ".wav" extension 306 | output_filename = os.path.splitext(args.input)[0] + ".wav" 307 | save_audio_to_file(merged_audio, output_filename) 308 | 309 | 310 | if __name__ == "__main__": 311 | main() 312 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | blis==0.7.9 2 | cachetools==5.3.1 3 | catalogue==2.0.8 4 | certifi==2023.5.7 5 | charset-normalizer==3.1.0 6 | click==8.1.3 7 | cmake==3.26.3 8 | confection==0.0.4 9 | cymem==2.0.7 10 | decorator==4.4.2 11 | ffmpeg-python==0.2.0 12 | filelock==3.12.0 13 | future==0.18.3 14 | google-api-core==2.11.0 15 | google-auth==2.19.1 16 | google-cloud-core==2.3.2 17 | google-cloud-speech==2.20.0 18 | google-cloud-texttospeech==2.14.1 19 | google-cloud-translate==3.11.1 20 | googleapis-common-protos==1.59.0 21 | grpcio==1.54.2 22 | grpcio-status==1.54.2 23 | idna==3.4 24 | imageio==2.30.0 25 | imageio-ffmpeg==0.4.8 26 | Jinja2==3.1.2 27 | langcodes==3.3.0 28 | lit==16.0.5.post0 29 | llvmlite==0.39.1 30 | MarkupSafe==2.1.3 31 | more-itertools==9.1.0 32 | moviepy==1.0.3 33 | mpmath==1.3.0 34 | murmurhash==1.0.9 35 | networkx==3.1 36 | nvidia-cublas-cu11==11.10.3.66 37 | nvidia-cuda-cupti-cu11==11.7.101 38 | nvidia-cuda-nvrtc-cu11==11.7.99 39 | nvidia-cuda-runtime-cu11==11.7.99 40 | nvidia-cudnn-cu11==8.5.0.96 41 | nvidia-cufft-cu11==10.9.0.58 42 | nvidia-curand-cu11==10.2.10.91 43 | nvidia-cusolver-cu11==11.4.0.1 44 | nvidia-cusparse-cu11==11.7.4.91 45 | nvidia-nccl-cu11==2.14.3 46 | nvidia-nvtx-cu11==11.7.91 47 | openai-whisper==20230314 48 | packaging==23.1 49 | pathy==0.10.1 50 | Pillow==9.5.0 51 | preshed==3.0.8 52 | proglog==0.1.10 53 | proto-plus==1.22.2 54 | protobuf==4.23.2 55 | pyasn1==0.5.0 56 | pyasn1-modules==0.3.0 57 | pydantic==1.10.8 58 | pydub==0.25.1 59 | pyphen==0.14.0 60 | regex==2023.6.3 61 | requests==2.31.0 62 | rsa==4.9 63 | six==1.16.0 64 | smart-open==6.3.0 65 | spacy==3.5.3 66 | spacy-legacy==3.0.12 67 | spacy-loggers==1.0.4 68 | spacy-syllables==3.0.2 69 | srsly==2.4.6 70 | sympy==1.12 71 | thinc==8.1.10 72 | tiktoken==0.3.1 73 | torch==2.0.1 74 | tqdm==4.65.0 75 | triton==2.0.0 76 | typer==0.7.0 77 | typing_extensions==4.6.3 78 | urllib3==1.26.16 79 | wasabi==1.1.1 80 | numba==0.56.4 81 | numpy==1.23.5 82 | --------------------------------------------------------------------------------