├── LICENSE ├── README.md ├── .gitignore ├── add_fades_captions_to_video.py ├── transcribe_from_video_aws.py ├── clean_video_from_transcription.py ├── transcribe_from_video_whisper.py └── summary_chapters_blog.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2023 Roy Shilkrot 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Video Transcript Helper 2 | 3 |

4 | 5 | [![Discord](https://img.shields.io/discord/1200229425141252116)](https://discord.gg/KbjGU2vvUz) 6 | 7 |

8 | 9 | A comprehensive toolkit designed for content creators, educators, digital marketers, and video editing enthusiasts. 10 | It harnesses the power of AI and video processing through a suite of Python scripts that simplify the post-production process. 11 | This free open-source project aims to transform the way users handle video content, turning hours of editing into a task of a few command lines. 12 | 13 | This project contains three scripts: 14 | - `transcribe_from_video_XXX.py`: Transcribe a video 15 | - `clean_video_from_transcription.py`: Zap filler words ('uh', 'um') in videos using FFMPEG 16 | - `summary_chapters_blog.py`: Generate a summary, video chapters and a blog post 17 | 18 | Roadmap of future features: 19 | - Remove or speedup (shorten) periods of "silence" 20 | - Enhance speech by voice separation models 21 | - Generate a supercut for a quick video snippet 22 | - Add Audiogram / Kareoke kind of subtitles on the video 23 | - Translate the subtitles to any language 24 | 25 | ## Usage 26 | Transcribe the video: (either AWS Transcribe API or [Faster-Whisper](https://github.com/guillaumekln/faster-whisper)) 27 | 28 | ```sh 29 | $ python transcribe_from_video_XXX.py 30 | ``` 31 | 32 | The output will be a file called `.json` in the same directory as the video. 33 | 34 | Zap the filler words: 35 | 36 | ```sh 37 | $ python clean_video_from_transcription.py

38 | ``` 39 | 40 | The output will be a file called `-clean.mp4` in the same directory as the video. 41 | 42 | Generate the summary, chapters and blog post: 43 | 44 | ```sh 45 | $ python summary_chapters_blog.py --generate_summary --generate_chapters --generate_blog 46 | ``` 47 | 48 | ## Dependencies 49 | - Python 3.6+ 50 | - [FFMPEG](https://ffmpeg.org/) 51 | - [AWS CLI](https://aws.amazon.com/cli/) 52 | 53 | Make sure to configure your AWS CLI with your credentials and region. 54 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # poetry 98 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 102 | #poetry.lock 103 | 104 | # pdm 105 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 106 | #pdm.lock 107 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 108 | # in version control. 109 | # https://pdm.fming.dev/#use-with-ide 110 | .pdm.toml 111 | 112 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 113 | __pypackages__/ 114 | 115 | # Celery stuff 116 | celerybeat-schedule 117 | celerybeat.pid 118 | 119 | # SageMath parsed files 120 | *.sage.py 121 | 122 | # Environments 123 | .env 124 | .venv 125 | env/ 126 | venv/ 127 | ENV/ 128 | env.bak/ 129 | venv.bak/ 130 | 131 | # Spyder project settings 132 | .spyderproject 133 | .spyproject 134 | 135 | # Rope project settings 136 | .ropeproject 137 | 138 | # mkdocs documentation 139 | /site 140 | 141 | # mypy 142 | .mypy_cache/ 143 | .dmypy.json 144 | dmypy.json 145 | 146 | # Pyre type checker 147 | .pyre/ 148 | 149 | # pytype static type analyzer 150 | .pytype/ 151 | 152 | # Cython debug symbols 153 | cython_debug/ 154 | 155 | # PyCharm 156 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 157 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 158 | # and can be added to the global gitignore or merged into this file. For a more nuclear 159 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 160 | #.idea/ 161 | 162 | # Mac 163 | .DS_Store 164 | .AppleDouble 165 | .LSOverride 166 | 167 | # Thumbnails 168 | ._* 169 | 170 | # Files that might appear on external disk 171 | .Spotlight-V100 172 | .Trashes 173 | 174 | # Directories potentially created on remote AFP share 175 | .AppleDB 176 | .AppleDesktop 177 | Network Trash Folder 178 | Temporary Items 179 | .apdisk 180 | 181 | # Windows 182 | # Windows thumbnail cache files 183 | Thumbs.db 184 | ehthumbs.db 185 | 186 | # Folder config file 187 | Desktop.ini 188 | 189 | # Recycle Bin used on file shares 190 | $RECYCLE.BIN/ 191 | -------------------------------------------------------------------------------- /add_fades_captions_to_video.py: -------------------------------------------------------------------------------- 1 | # this script will add fade-in, fade-out effects and captions to a video 2 | # based on the input timed chapters (output from summary_chapters_blog.py file) 3 | 4 | import argparse 5 | import json 6 | import subprocess 7 | import os 8 | 9 | # get the input video file and the output text file 10 | parser = argparse.ArgumentParser() 11 | parser.add_argument("input_video_file", help="input video file") 12 | parser.add_argument( 13 | "input_timed_chapters_file", help="input text file with timed chapters" 14 | ) 15 | args = parser.parse_args() 16 | 17 | # get the input video file name and the output text file name 18 | input_video_file = args.input_video_file 19 | input_timed_chapters_file = args.input_timed_chapters_file 20 | 21 | chapters = [] 22 | 23 | # read the input text file 24 | print("Parsing the input text file...") 25 | with open(input_timed_chapters_file) as f: 26 | # each line in the file is a chapter, in the format: 27 | # - 28 | # e.g. 29 | # 00:00 - 00:10 Introduction 30 | # 00:10 - 00:20 Chapter 1 31 | 32 | # read the lines 33 | lines = f.readlines() 34 | 35 | # split each line into start_time, end_time and chapter_title 36 | # and convert the start_time and end_time to seconds 37 | for line in lines: 38 | if line == "\n" or line == "": 39 | continue 40 | # split the line into start_time, end_time and chapter_title 41 | start_time, end_time_and_chapter_title = line.split(" - ") 42 | end_time, chapter_title = end_time_and_chapter_title.split(" ", 1) 43 | 44 | # convert the start_time and end_time to seconds 45 | start_time_seconds = sum( 46 | x * float(t) for x, t in zip([60, 1], start_time.split(":")) 47 | ) 48 | end_time_seconds = sum( 49 | x * float(t) for x, t in zip([60, 1], end_time.split(":")) 50 | ) 51 | 52 | # add the chapter to the list of chapters 53 | chapters.append((start_time_seconds, end_time_seconds, chapter_title.strip())) 54 | 55 | print(f"Found {len(chapters)} chapters.") 56 | 57 | # sort the chapters by start_time 58 | chapters.sort(key=lambda x: x[0]) 59 | 60 | # create an .ass file with the captions in Advanced SSA format 61 | # each chapter will have a caption at the beginning of the chapter 62 | 63 | # create the output file name 64 | output_srt_file = os.path.splitext(input_video_file)[0] + ".ass" 65 | print(f"Creating the output file {output_srt_file}...") 66 | 67 | # write a times in seconds in HH:MM:SS,MS format 68 | write_srt_format = ( 69 | lambda x: f"{str(int(x//3600)).zfill(2)}:{str(int((x%3600)//60)).zfill(2)}:{str(int(x%60)).zfill(2)},000" 70 | ) 71 | 72 | ssa_prefix = """ 73 | [Script Info] 74 | Title: 75 | ScriptType: v4.00+ 76 | Collisions: Normal 77 | PlayDepth: 0 78 | 79 | [v4+ Styles] 80 | Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding 81 | Style: Default,Arial,20,&H00FFFFFF,&H000080FF,&H00000000,&H80000000,0,0,0,0,100,100,0,0,1,2,2,2,10,10,20,0 82 | 83 | [Events] 84 | Format: Layer, Start, End, Style, Actor, MarginL, MarginR, MarginV, Effect, Text 85 | """ 86 | 87 | # create the output file 88 | with open(output_srt_file, "w") as f: 89 | # write the prefix 90 | f.write(ssa_prefix) 91 | # write the captions 92 | for i, chapter in enumerate(chapters): 93 | # each subtitle is of the form e.g. 94 | # Dialogue: 0,0:00:03.00,0:00:08.00,Default,,0,0,0,,subtitle text 95 | 96 | # write the subtitle line to the file. The subtitle will be shown for 5 seconds 97 | f.write( 98 | f"Dialogue: 0,{write_srt_format(chapter[0])},{write_srt_format(chapter[0]+5)},Default,,0,0,0,,{'{'}\\fad(1200,250){'}'}{chapter[2]}\n" 99 | ) 100 | 101 | f.close() 102 | 103 | # get the duration of the video 104 | print("Getting the duration of the video...") 105 | result = subprocess.run( 106 | [ 107 | "ffprobe", 108 | "-v", 109 | "error", 110 | "-show_entries", 111 | "format=duration", 112 | "-of", 113 | "default=noprint_wrappers=1:nokey=1", 114 | input_video_file, 115 | ], 116 | stdout=subprocess.PIPE, 117 | stderr=subprocess.STDOUT, 118 | ) 119 | duration = int(float(result.stdout)) 120 | 121 | print(f"Video duration: {duration} seconds.") 122 | 123 | output_video_file_path = os.path.splitext(input_video_file)[0] + "_with_captions.mp4" 124 | 125 | # add the captions to the video with ffmpeg 126 | print("Adding captions and fades to the video...") 127 | subprocess.run( 128 | [ 129 | "ffmpeg", 130 | "-i", 131 | input_video_file, 132 | "-vf", 133 | f"subtitles={output_srt_file}:force_style='Fontsize=24,PrimaryColour=&Hffffff&'[v];[v]fade=in:st=0:n=30,fade=out:st={duration-30}:n=30", 134 | "-c:v", 135 | "libx264", 136 | "-c:a", 137 | "copy", 138 | "-y", 139 | output_video_file_path, 140 | ] 141 | ) 142 | 143 | # delete the temporary files 144 | print("Deleting the temporary files...") 145 | os.remove(output_srt_file) 146 | 147 | print("Done!") 148 | -------------------------------------------------------------------------------- /transcribe_from_video_aws.py: -------------------------------------------------------------------------------- 1 | # this script will transcribe the audio from an input video file 2 | # the output will be a JSON file with the transcription 3 | # use argparse to get the input video file and the output text file 4 | # use the AWS transcribe using the AWS CLI to transcribe the audio 5 | # 6 | # Usage: 7 | # python transcribe_from_video.py 8 | # 9 | # The output JSON file will be saved in the same directory as the input video file 10 | # 11 | # Example: 12 | # python transcribe_from_video.py "input_video.mp4" 13 | # 14 | # The output JSON file will have the name "input_video.json" 15 | 16 | import argparse 17 | import json 18 | import subprocess 19 | import os 20 | import re 21 | import uuid 22 | 23 | # get the input video file and the output text file 24 | parser = argparse.ArgumentParser() 25 | parser.add_argument("input_video_file", help="input video file") 26 | args = parser.parse_args() 27 | 28 | # get the input video file name and the output text file name 29 | input_video_file = args.input_video_file 30 | 31 | # get the input video file name without the extension 32 | input_video_file_name = os.path.splitext(input_video_file)[0] 33 | 34 | # get the input video file name without the extension and without the path 35 | input_video_file_name_without_path = os.path.basename(input_video_file_name) 36 | 37 | 38 | def cleanup(job_name, s3_uri, flac_audio_file): 39 | if s3_uri is not None: 40 | # delete the temporary S3 audio file 41 | print("Deleting the temporary S3 audio file...") 42 | subprocess.run(["aws", "s3", "rm", s3_uri]) 43 | 44 | if job_name is not None: 45 | # delete the trascription job in AWS Transcribe 46 | print("Deleting the transcription job in AWS Transcribe...") 47 | subprocess.run(["aws", "transcribe", "delete-transcription-job", 48 | "--region", "us-east-1", "--transcription-job-name", job_name]) 49 | 50 | # delete the temporary S3 bucket 51 | print("Deleting the temporary S3 bucket...") 52 | subprocess.run(["aws", "s3", "rb", f"s3://{job_name}", "--force"]) 53 | 54 | if flac_audio_file is not None: 55 | # delete the FLAC audio file 56 | print("Deleting the local FLAC audio file...") 57 | subprocess.run(["rm", flac_audio_file]) 58 | 59 | 60 | # convert the video file to a FLAC audio file using ffmpeg (quiet mode) 61 | # the FLAC audio file will be saved in the same directory as the input video file 62 | # the FLAC audio file will have the same name as the input video file but with a FLAC extension 63 | flac_audio_file = input_video_file_name_without_path 64 | # make sure the file has s3 compatible name.. 65 | flac_audio_file = re.sub('[^0-9a-zA-Z]+', '-', flac_audio_file) 66 | flac_audio_file_without_path = flac_audio_file + ".flac" 67 | # add the path to the FLAC audio file 68 | flac_audio_file = os.path.join(os.path.dirname(input_video_file), flac_audio_file_without_path) 69 | 70 | print(f"Converting video file to FLAC audio file using ffmpeg... {flac_audio_file}") 71 | subprocess.run(["ffmpeg", "-i", input_video_file, "-vn", "-ac", "1", "-ar", "16000", "-c:a", "flac", 72 | "-qscale:a", "0", "-loglevel", "quiet", "-copyts", "-y", flac_audio_file]) 73 | 74 | # generate a UUID for the job name 75 | job_name = f"transcribe-job-{uuid.uuid4().hex}" 76 | 77 | # create a temporary S3 bucket for the transcription job 78 | # the bucket name will be the same as the job name 79 | print("Creating temporary S3 bucket for the transcription job...") 80 | process = subprocess.run(["aws", "s3", "mb", f"s3://{job_name}"]) 81 | 82 | if process.returncode != 0: 83 | print("Error creating temporary S3 bucket for the transcription job") 84 | exit(1) 85 | 86 | # upload the FLAC audio file to the temporary S3 bucket 87 | print("Uploading FLAC audio file to the temporary S3 bucket...") 88 | process = subprocess.run(["aws", "s3", "cp", flac_audio_file, f"s3://{job_name}"]) 89 | 90 | if process.returncode != 0: 91 | print("Error uploading FLAC audio file to the temporary S3 bucket") 92 | cleanup(job_name, None, flac_audio_file) 93 | exit(1) 94 | 95 | # get the S3 URI for the FLAC audio file 96 | s3_uri = f"s3://{job_name}/{flac_audio_file_without_path}" 97 | 98 | print(s3_uri) 99 | 100 | # start the transcription job 101 | # aws transcribe start-transcription-job \ 102 | # --region us-east-1 \ 103 | # --transcription-job-name "$TEMP_NAME" \ 104 | # --media "MediaFileUri=$S3_URI" \ 105 | # --language-code en-U 106 | print("Starting the transcription job...") 107 | process = subprocess.run(["aws", "transcribe", "start-transcription-job", 108 | "--region", "us-east-1", "--transcription-job-name", job_name, 109 | "--media", f"MediaFileUri={s3_uri}", 110 | "--language-code", "en-US"]) 111 | 112 | if process.returncode != 0: 113 | print("Error starting the transcription job") 114 | cleanup(job_name, s3_uri, flac_audio_file) 115 | exit(1) 116 | 117 | # wait for the transcription job to complete 118 | # run `aws transcribe get-transcription-job`` and capture the output JSON 119 | # e.g. aws transcribe get-transcription-job \ 120 | # --region us-east-1 \ 121 | # --transcription-job-name "$TEMP_NAME" 122 | # check the `TranscriptionJobStatus` field in the JSON if it is `COMPLETED` 123 | # if it is not `COMPLETED`, wait for 5 seconds and then check again 124 | # if it is `COMPLETED`, then break out of the loop 125 | print("Waiting for the transcription job to complete...") 126 | while True: 127 | process = subprocess.run(["aws", "transcribe", "get-transcription-job", 128 | "--region", "us-east-1", "--transcription-job-name", job_name], 129 | capture_output=True) 130 | output = process.stdout.decode("utf-8") 131 | if "COMPLETED" in output: 132 | break 133 | else: 134 | print("Transcription job not completed yet. Waiting for 5 seconds...") 135 | subprocess.run(["sleep", "5"]) 136 | 137 | # get the transcription job output JSON 138 | # use the last output JSON from the previous loop iteration to get the output JSON 139 | # from the `TranscriptionJob.Transcript.TranscriptFileUri` field 140 | # parse the JSON and get the `TranscriptFileUri` field 141 | parsed = json.loads(output) 142 | output_uri = parsed["TranscriptionJob"]["Transcript"]["TranscriptFileUri"] 143 | 144 | # download the transcription job output JSON file using regular `curl` 145 | # the transcription job output JSON file will be saved in the same directory as the input video file 146 | # and have the same name as the input video file but with a JSON extension 147 | output_json_file = input_video_file_name + ".json" 148 | print("Downloading the transcription job output JSON file...") 149 | subprocess.run(["curl", "-o", output_json_file, output_uri]) 150 | 151 | cleanup(job_name, s3_uri, flac_audio_file) 152 | -------------------------------------------------------------------------------- /clean_video_from_transcription.py: -------------------------------------------------------------------------------- 1 | # this script will read the transcription from the output JSON file and then clean the video 2 | # from filler words (e.g. um, uh, like, etc.) 3 | # 4 | # Usage: 5 | # python clean_video_from_transcription.py 6 | # 7 | # The output video file will be saved in the same directory as the input video file 8 | # 9 | # Example: 10 | # python clean_video_from_transcription.py "input_video.mp4" "input_json.json" 11 | 12 | import argparse 13 | import json 14 | import subprocess 15 | import os 16 | 17 | # get the input video file and the output text file 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("input_video_file", help="input video file") 20 | parser.add_argument("input_json_file", help="input json transcription file") 21 | args = parser.parse_args() 22 | 23 | # get the input video file name and the output text file name 24 | input_video_file = args.input_video_file 25 | input_json_file = args.input_json_file 26 | 27 | # read the input JSON file 28 | print("Parsing the input JSON file...") 29 | with open(input_json_file) as f: 30 | data = json.load(f) 31 | 32 | # get all the items where .results.items.alternatives.content is a filler word 33 | filler_words = ["um", "uh", "so"] 34 | 35 | # filter to keep only pronunciations 36 | pronunciation_items = list( 37 | filter(lambda x: x["type"] == "pronunciation", data["results"]["items"]) 38 | ) 39 | 40 | # merge consecutive filler words in pronunciation_items 41 | i = 0 42 | while i < len(pronunciation_items) - 1: 43 | if ( 44 | pronunciation_items[i]["alternatives"][0]["content"].lower() in filler_words 45 | and pronunciation_items[i + 1]["alternatives"][0]["content"].lower() 46 | in filler_words 47 | ): 48 | print( 49 | "Found consecutive filler words: " 50 | "{pronunciation_items[i]['alternatives'][0]['content']} " 51 | f"{pronunciation_items[i+1]['alternatives'][0]['content']} " 52 | "at " 53 | f"{pronunciation_items[i]['start_time']} " 54 | f"{pronunciation_items[i+1]['start_time']}" 55 | ) 56 | # merge the start and end timings of the two items 57 | pronunciation_items[i]["end_time"] = pronunciation_items[i + 1]["end_time"] 58 | 59 | # remove the second item 60 | pronunciation_items.pop(i + 1) 61 | else: 62 | i += 1 63 | 64 | # extract the timings from the filler words items, in (start, end) tuples 65 | # parse float from string 66 | # the end time of a filler word is the start time of the next pronunciation 67 | # unless the next pronunciation is also a filler word, in which case the end time is the end time 68 | # of the next pronunciation 69 | filler_words_timings = [(0.0, 0.0)] 70 | for i, item in enumerate(pronunciation_items[:-1]): 71 | # check in lowercase 72 | if item["alternatives"][0]["content"].lower() in filler_words: 73 | # get the start & end time of the filler word 74 | start_time = float(item["start_time"]) 75 | # end_time = float(pronunciation_items[i+1]["start_time"]) + 0.1 76 | end_time = float(item["end_time"]) 77 | # the duration of a filler word is at least 0.3 seconds 78 | if end_time - start_time < 0.3: 79 | end_time = start_time + 0.3 80 | # if the next pronunciation is farther ahead than 0.3 seconds, then the start time of the 81 | # next pronunciation as the end time of this filler word 82 | if float(pronunciation_items[i + 1]["start_time"]) > end_time: 83 | end_time = float(pronunciation_items[i + 1]["start_time"]) 84 | 85 | if start_time >= end_time: 86 | continue 87 | 88 | filler_words_timings.append((start_time, end_time)) 89 | 90 | # append in the end the duration of the video 91 | # find the duration of the video using ffprobe 92 | print("Finding the duration of the video...") 93 | ffprobe_output = subprocess.check_output( 94 | [ 95 | "ffprobe", 96 | "-v", 97 | "error", 98 | "-show_entries", 99 | "format=duration", 100 | "-of", 101 | "default=noprint_wrappers=1:nokey=1", 102 | input_video_file, 103 | ] 104 | ) 105 | video_duration = float(ffprobe_output) 106 | filler_words_timings.append((video_duration, video_duration)) 107 | 108 | # sort the filler words timings by start time 109 | filler_words_timings.sort(key=lambda x: x[0]) 110 | 111 | print(f"Found {len(filler_words_timings)-2} filler words in the video.") 112 | 113 | print("Filler words timings:") 114 | print(filler_words_timings[:5] + ["..."] + filler_words_timings[-5:]) 115 | 116 | # build an ffmpeg filter to remove the filler words by using the timings 117 | # e.g. 118 | # [0:v]trim=start=10:end=20,setpts=PTS-STARTPTS,format=yuv420p[0v]; 119 | # [0:a]atrim=start=10:end=20,asetpts=PTS-STARTPTS[0a]; 120 | # [0:v]trim=start=30:end=40,setpts=PTS-STARTPTS,format=yuv420p[1v]; 121 | # [0:a]atrim=start=30:end=40,asetpts=PTS-STARTPTS[1a]; 122 | # [0:v]trim=start=30:end=40,setpts=PTS-STARTPTS,format=yuv420p[2v]; 123 | # [0:a]atrim=start=30:end=40,asetpts=PTS-STARTPTS[2a]; 124 | # and then concatenate the inputs 125 | # [0v][0a][1v][1a][2v][2a]concat=n=3:v=1:a=1[outv][outa] 126 | 127 | 128 | 129 | def build_ffmpeg_cmd_with_filter(): 130 | n_filrs = len(filler_words_timings) 131 | filter = "" 132 | for i in range(1, n_filrs): 133 | # stagger the start and end time of the video and audio filters 134 | # so that we take the "non-filler" portion of the video 135 | start_time = filler_words_timings[i - 1][1] 136 | end_time = filler_words_timings[i][0] 137 | 138 | # add the video filter 139 | filter += ( 140 | f"[0:v]trim=start={start_time}:end={end_time},setpts=PTS-STARTPTS[{i}v];" 141 | ) 142 | 143 | # add the audio filter 144 | filter += ( 145 | f"[0:a]atrim=start={start_time}:end={end_time},asetpts=PTS-STARTPTS[{i}a];" 146 | ) 147 | 148 | # add the concat filter 149 | all_inputs = "".join([f"[{i}v][{i}a]" for i in range(n_filrs)]) 150 | filter += f"{all_inputs}concat=n={n_filrs}:v=1:a=1[outv][outa]" 151 | print("Filter:") 152 | print(filter) 153 | 154 | return [ 155 | "ffmpeg", 156 | "-i", 157 | input_video_file, 158 | "-filter_complex", 159 | filter, 160 | "-map", 161 | "[outv]", 162 | "-map", 163 | "[outa]", 164 | "-avoid_negative_ts", 165 | "1", 166 | "-y", 167 | ] 168 | 169 | 170 | def build_ffmpeg_cmd_with_ss_to(): 171 | n_filrs = len(filler_words_timings) 172 | cmd = ["ffmpeg"] 173 | remove_fillers = 0 174 | for i in range(1, n_filrs): 175 | # stagger the start and end time of the video and audio filters 176 | # so that we take the "non-filler" portion of the video 177 | start_time = filler_words_timings[i - 1][1] # end of last filler word 178 | end_time = filler_words_timings[i][0] # start of next filler word 179 | 180 | if start_time >= end_time: 181 | remove_fillers += 1 182 | continue 183 | 184 | # add the start and end time to the ffmpeg command 185 | cmd += [ 186 | "-ss", 187 | str(start_time) + "s", 188 | "-to", 189 | str(end_time) + "s", 190 | "-i", 191 | input_video_file, 192 | ] 193 | 194 | # add the number of filler words to remove 195 | print(f"Found {remove_fillers} inconsistent-timing filler words.") 196 | n_filrs -= remove_fillers 197 | 198 | # add the concat filter 199 | all_inputs = "".join([f"[{i}:v][{i}:a]" for i in range(n_filrs - 1)]) 200 | filter = f"{all_inputs}concat=n={n_filrs-1}:v=1:a=1[outv][outa]" 201 | 202 | cmd += [ 203 | "-filter_complex", 204 | filter, 205 | "-map", 206 | "[outv]", 207 | "-map", 208 | "[outa]", 209 | "-avoid_negative_ts", 210 | "1", 211 | "-y", 212 | "-loglevel", 213 | "error", 214 | ] 215 | return cmd 216 | 217 | 218 | # build the ffmpeg command 219 | ffmpeg_cmd = build_ffmpeg_cmd_with_ss_to() 220 | 221 | output_video_file = os.path.splitext(input_video_file)[0] + "_cleaned.mp4" 222 | 223 | # run ffmpeg to remove the filler words 224 | print("Removing the filler words from the video...") 225 | subprocess.run([*ffmpeg_cmd, output_video_file]) 226 | 227 | print("Done.") 228 | -------------------------------------------------------------------------------- /transcribe_from_video_whisper.py: -------------------------------------------------------------------------------- 1 | # this script will transcribe the audio from an input video file 2 | # the output will be a JSON file with the transcription 3 | # use argparse to get the input video file and the output text file 4 | # use whisper from openai to transcribe the audio 5 | # 6 | # Usage: 7 | # python transcribe_from_video.py 8 | # 9 | # The output JSON file will be saved in the same directory as the input video file 10 | # 11 | # Example: 12 | # python transcribe_from_video.py "input_video.mp4" 13 | # 14 | # The output JSON file will have the name "input_video.json" 15 | 16 | import argparse 17 | import json 18 | import subprocess 19 | import os 20 | import re 21 | import uuid 22 | from faster_whisper import WhisperModel 23 | 24 | # get the input video file and the output text file 25 | parser = argparse.ArgumentParser() 26 | parser.add_argument("input_video_file", help="input video file") 27 | args = parser.parse_args() 28 | 29 | # get the input video file name and the output text file name 30 | input_video_file = args.input_video_file 31 | 32 | # get the input video file name without the extension 33 | input_video_file_name = os.path.splitext(input_video_file)[0] 34 | 35 | # get the input video file name without the extension and without the path 36 | input_video_file_name_without_path = os.path.basename(input_video_file_name) 37 | 38 | # transcribe the audio from the input video file 39 | # the output will be a JSON file with the transcription 40 | # use whisper from openai to transcribe the audio 41 | # the output JSON file will be saved in the same directory as the input video file 42 | # the output JSON file will have the name "input_video.json" 43 | 44 | # get the audio from the input video file 45 | # the output will be a wav file with the same name as the input video file 46 | # the output wav file will be saved in the same directory as the input video file 47 | 48 | # get the input video file name without the extension 49 | input_video_file_name = os.path.splitext(input_video_file)[0] 50 | 51 | # get the input video file name without the extension and without the path 52 | input_video_file_name_without_path = os.path.basename(input_video_file_name) 53 | 54 | # get the output wav file name 55 | output_wav_file_name = input_video_file_name_without_path + ".wav" 56 | 57 | # get the output wav file name with the path 58 | output_wav_file_name_with_path = os.path.join( 59 | os.path.dirname(input_video_file), output_wav_file_name 60 | ) 61 | 62 | print("converting video to audio...") 63 | # execute the command to extract the audio from the input video file 64 | # the output will be a wav file with the same name as the input video file 65 | # the output wav file will be saved in the same directory as the input video file 66 | subprocess.run( 67 | [ 68 | "ffmpeg", 69 | "-i", 70 | input_video_file, 71 | "-vn", 72 | "-ac", 73 | "1", 74 | "-ar", 75 | "16000", 76 | "-loglevel", 77 | "quiet", 78 | "-copyts", 79 | "-y", 80 | output_wav_file_name_with_path, 81 | ] 82 | ) 83 | 84 | # check if the output wav file exists 85 | if not os.path.exists(output_wav_file_name_with_path): 86 | print( 87 | 'Error: the output wav file does not exist "' 88 | + output_wav_file_name_with_path 89 | + '"' 90 | ) 91 | exit(1) 92 | 93 | print("transcribing audio...") 94 | model = WhisperModel("base") 95 | # hack the model to produce filler words by adding them as an input prompt 96 | segments, transcriptionInfo = model.transcribe( 97 | output_wav_file_name_with_path, 98 | initial_prompt="So uhm, yeaah. Uh, um. Uhh, Umm. Like, Okay, ehm, uuuh.", 99 | word_timestamps=True, 100 | suppress_blank=True, 101 | ) 102 | 103 | punctuation_marks = "\"'.。,，!！?？:：”)]}、" 104 | 105 | new_segments = [] 106 | # split punctuation from words into new items 107 | for segment in segments: 108 | new_words = [] 109 | for word in segment.words: 110 | wordStr = word.word.strip() 111 | if len(wordStr) < 1: 112 | continue 113 | if wordStr[-1] in punctuation_marks: 114 | punctuation = wordStr[-1] 115 | new_words.append( 116 | { 117 | "word": wordStr[:-1].strip(), 118 | "start": word.start, 119 | "end": word.end, 120 | "probability": word.probability, 121 | } 122 | ) 123 | new_words.append( 124 | { 125 | "word": punctuation, 126 | "start": word.end, 127 | "end": word.end, 128 | "probability": word.probability, 129 | } 130 | ) 131 | else: 132 | new_words.append( 133 | { 134 | "word": word.word, 135 | "start": word.start, 136 | "end": word.end, 137 | "probability": word.probability, 138 | } 139 | ) 140 | new_segment = {"words": new_words} 141 | new_segments.append(new_segment) 142 | 143 | # print(json.dumps(result, indent=4)) 144 | 145 | # get the output json file name 146 | output_json_file_name = input_video_file_name_without_path + ".json" 147 | 148 | # get the output json file name with the path 149 | output_json_file_name_with_path = os.path.join( 150 | os.path.dirname(input_video_file), output_json_file_name 151 | ) 152 | 153 | # write the output json file where the output format is: 154 | # { 155 | # "results": { 156 | # "transcripts": [{ 157 | # "transcript": "the transcript" 158 | # }], 159 | # "items": [ 160 | # { 161 | # "alternatives": [ 162 | # { 163 | # "content": "the word", 164 | # "confidence": 0.0 165 | # } 166 | # ], 167 | # "start_time": 0.0, 168 | # "end_time": 0.0, 169 | # "type": "pronunciation" 170 | # }, 171 | # ... 172 | # ] 173 | # } 174 | # } 175 | # 176 | # the input forma from whisper is: 177 | # { 178 | # "segments": [ 179 | # { 180 | # "words": [ 181 | # { 182 | # "word": "the word", 183 | # "start": 0.0, 184 | # "end": 0.0, 185 | # "probability": 0.0 186 | # }, 187 | # ... 188 | # ] 189 | # }, 190 | # ... 191 | # ] 192 | # } 193 | # 194 | # translate from whisper format to output format 195 | with open(output_json_file_name_with_path, "w") as outfile: 196 | json.dump( 197 | { 198 | "results": { 199 | "transcripts": [ 200 | { 201 | "transcript": " ".join( 202 | [ 203 | word["word"].strip() 204 | for segment in new_segments 205 | for word in segment["words"] 206 | ] 207 | ), 208 | } 209 | ], 210 | "items": [ 211 | { 212 | "alternatives": [ 213 | { 214 | "content": word["word"], 215 | "confidence": word["probability"], 216 | }, 217 | ], 218 | "start_time": word["start"], 219 | "end_time": word["end"], 220 | "confidence": word["probability"], 221 | "type": ( 222 | "pronunciation" 223 | if word["word"] not in punctuation_marks 224 | else "punctuation" 225 | ), 226 | } 227 | for segment in new_segments 228 | for word in segment["words"] 229 | ], 230 | } 231 | }, 232 | outfile, 233 | indent=2, 234 | ) 235 | 236 | # cleanup the output wav file 237 | os.remove(output_wav_file_name_with_path) 238 | -------------------------------------------------------------------------------- /summary_chapters_blog.py: -------------------------------------------------------------------------------- 1 | # Description: This script takes a JSON file as input and outputs a summary of the video and the 2 | # chapters for adding to the video description on YouTube. 3 | # 4 | # Usage: 5 | # python summary_and_chapters.py [--generate_summary] [--generate_chapters] \ 6 | # [--generate_blog] [--print_prompts] [--trim_length] 7 | # 8 | # Example: 9 | # python summary_and_chapters.py "input_json.json" 10 | 11 | import argparse 12 | import json 13 | 14 | import openai 15 | 16 | 17 | # get the input video file and the output text file 18 | parser = argparse.ArgumentParser() 19 | parser.add_argument("input_json_file", help="input json transcription file") 20 | # non positional arguments for generating summary and chapters 21 | parser.add_argument("--generate_summary", action="store_true", help="generate summary") 22 | parser.add_argument( 23 | "--generate_chapters", action="store_true", help="generate chapters" 24 | ) 25 | parser.add_argument("--generate_blog", action="store_true", help="generate blog") 26 | parser.add_argument("--print_prompts", action="store_true", help="print prompts") 27 | parser.add_argument("--trim_length", type=int, default=100, help="trim length") 28 | parser.add_argument( 29 | "--wshiper_cpp_json", action="store_true", help="is this a whisper cpp json file?" 30 | ) 31 | # optional arguments for generating summary and chapters 32 | parser.add_argument("--summary_prompt", type=str, default="", help="prompt to use for summary") 33 | args = parser.parse_args() 34 | 35 | # get the input video file name and the output text file name 36 | input_json_file = args.input_json_file 37 | 38 | # read the input JSON file 39 | # print("Parsing the input JSON file...") 40 | with open(input_json_file) as f: 41 | data = json.load(f) 42 | 43 | # combine words into sentences and keep the timings, using the start time of the first word 44 | # and the end time of the last word. 45 | # sentences are separated by a `punctuation` type item in the JSON file. 46 | # collect sentences in a list of lists of items from the JSON file. 47 | sentences = [] 48 | 49 | if not args.wshiper_cpp_json: 50 | sentence = [] 51 | for item in data["results"]["items"]: 52 | # if the item is a punctuation, then it's the end of the sentence 53 | if item["type"] == "punctuation" and item["alternatives"][0]["content"] in [ 54 | ".", 55 | "?", 56 | "!", 57 | ]: 58 | # add an 'end_time' to the punctuation item by using the end time of the last word 59 | item["end_time"] = ( 60 | sentence[-1]["end_time"] if len(sentence) > 0 else item["start_time"] 61 | ) 62 | 63 | # add the punctuation to the sentence 64 | sentence.append(item) 65 | 66 | # add the sentence to the list of sentences 67 | sentences.append(sentence) 68 | 69 | # start a new sentence 70 | sentence = [] 71 | else: 72 | # filter out the filler words 73 | if item["type"] == "pronunciation" and item["alternatives"][0][ 74 | "content" 75 | ].lower() in ["um", "uh", "so", "hmm", "like"]: 76 | continue 77 | 78 | # filter out punctuation 79 | if item["type"] == "punctuation": 80 | continue 81 | 82 | # add the word to the sentence 83 | sentence.append(item) 84 | 85 | # get the timings of the sentences 86 | sentences_timings = [] 87 | for sentence in sentences: 88 | # get the start time of the sentence 89 | start_time = float(sentence[0]["start_time"]) 90 | 91 | # get the end time of the sentence 92 | end_time = float(sentence[-1]["end_time"]) 93 | 94 | # add the timings to the list of timings 95 | sentences_timings.append((start_time, end_time)) 96 | 97 | 98 | def convert_senconds_to_mmss(seconds): 99 | return f"{int(seconds // 60):02d}:{int(seconds % 60):02d}" 100 | 101 | 102 | def build_summary(trim=True, remove_filler_words=True): 103 | # build a summary list from the senstences and their timings 104 | summary = [] 105 | if not args.wshiper_cpp_json: 106 | for sentence, timings in zip(sentences, sentences_timings): 107 | # get the pronounciations from the sentence 108 | pronounciations = [ 109 | item["alternatives"][0]["content"].strip() 110 | for item in sentence 111 | if item["type"] == "pronunciation" 112 | ] 113 | 114 | if remove_filler_words: 115 | # remove the filler words from the sentence 116 | pronounciations = [ 117 | word 118 | for word in pronounciations 119 | if word.lower() not in ["um", "uh", "so", "hmm", "like"] 120 | ] 121 | 122 | # get the sentence text 123 | sentence_text = " ".join(pronounciations) + "." 124 | 125 | if trim: 126 | # trim the sentence text to a maximum of 100 characters 127 | sentence_text = sentence_text[: args.trim_length] 128 | 129 | # get the sentence start and end timings 130 | sentence_start_time, sentence_end_time = timings 131 | 132 | # convert the timings to strings in the format MM:SS 133 | sentence_start_time = f"{convert_senconds_to_mmss(sentence_start_time)}" 134 | sentence_end_time = f"{convert_senconds_to_mmss(sentence_end_time)}" 135 | 136 | # add the sentence to the summary 137 | summary.append( 138 | { 139 | "text": sentence_text, 140 | "start_time": sentence_start_time, 141 | "end_time": sentence_end_time, 142 | } 143 | ) 144 | else: 145 | for sentence in data["transcription"]: 146 | # get the sentence text 147 | sentence_text = sentence["text"] 148 | 149 | if trim: 150 | # trim the sentence text to a maximum of 100 characters 151 | sentence_text = sentence_text[: args.trim_length] 152 | 153 | # get the sentence start and end timings 154 | sentence_start_time = sentence["timestamps"]["from"] 155 | sentence_end_time = sentence["timestamps"]["to"] 156 | 157 | # add the sentence to the summary 158 | summary.append( 159 | { 160 | "text": sentence_text, 161 | "start_time": sentence_start_time, 162 | "end_time": sentence_end_time, 163 | } 164 | ) 165 | 166 | return summary 167 | 168 | 169 | if args.generate_summary: 170 | # build a prompt for OpenAI generation: 171 | prompt = "transcript for the video:\n" 172 | prompt += "---\n" 173 | for sentence in build_summary(trim=args.trim_length > 0): 174 | prompt += f"{sentence['text']}\n" 175 | prompt += "---\n" 176 | if args.summary_prompt is not None and args.summary_prompt != "": 177 | prompt += args.summary_prompt 178 | else: 179 | prompt += ( 180 | "write a short summary description paragraph for the above video on YouTube.\n" 181 | ) 182 | prompt += "Summary for the video:\n" 183 | 184 | if args.print_prompts: 185 | print(prompt) 186 | 187 | history = [{"role": "user", "content": prompt}] 188 | 189 | # send a request to the OpenAI API (model gpt-3.5-turbo) to generate the summary 190 | # print("Sending a request to the OpenAI API to generate the summary...") 191 | print("Generating the summary...") 192 | response = openai.ChatCompletion.create( 193 | model="gpt-3.5-turbo-16k", 194 | messages=history, 195 | ) 196 | 197 | # get the generated summary 198 | generated_summary = response["choices"][0]["message"]["content"] 199 | history += [{"role": "assistant", "content": generated_summary}] 200 | 201 | # print the generated summary 202 | print("----------------------") 203 | print(generated_summary) 204 | print("----------------------") 205 | 206 | if args.generate_chapters: 207 | prompt = "transcript for the video:\n" 208 | prompt += "---\n" 209 | for sentence in build_summary(trim=True): 210 | prompt += ( 211 | f"[{sentence['start_time']} - {sentence['end_time']}] {sentence['text']}\n" 212 | ) 213 | prompt += "---\n" 214 | prompt += ( 215 | "write up to 10 high-level chapters for the video on YouTube in the format: " 216 | + "'MM:SS .'\n" 217 | ) 218 | prompt += "Chapters for the video:\n" 219 | 220 | if args.print_prompts: 221 | print(prompt) 222 | 223 | history = [{"role": "user", "content": prompt}] 224 | 225 | # send a request to the OpenAI API (model gpt-3.5-turbo) to generate the chapters 226 | print("Sending a request to the OpenAI API to generate the chapters...") 227 | response = openai.ChatCompletion.create( 228 | model="gpt-3.5-turbo", 229 | messages=history, 230 | ) 231 | 232 | # get the generated chapters 233 | generated_chapters = response["choices"][0]["message"]["content"] 234 | history += [{"role": "assistant", "content": generated_chapters}] 235 | 236 | # print the generated chapters 237 | print("----------------------") 238 | print(generated_chapters) 239 | print("----------------------") 240 | 241 | if args.generate_blog: 242 | prompt = "transcript for the video:\n" 243 | prompt += "---\n" 244 | for sentence in build_summary(trim=False): 245 | prompt += f"{sentence['text']}\n" 246 | prompt += "---\n" 247 | prompt += "write a blog post of at least 500 words for the above video. write the title and then the post body.\n" 248 | prompt += "Title of the blog post:\n" 249 | 250 | if args.print_prompts: 251 | print(prompt) 252 | 253 | history = [{"role": "user", "content": prompt}] 254 | 255 | # send a request to the OpenAI API (model gpt-3.5-turbo) to generate the blog post 256 | print("Sending a request to the OpenAI API to generate the blog post...") 257 | response = openai.ChatCompletion.create( 258 | model="gpt-3.5-turbo", 259 | messages=history, 260 | ) 261 | 262 | # get the generated blog post 263 | generated_blog = response["choices"][0]["message"]["content"] 264 | history += [{"role": "assistant", "content": generated_blog}] 265 | 266 | # print the generated blog post 267 | print("----------------------") 268 | print(generated_blog) 269 | print("----------------------") 270 | 271 | print("Done.") 272 | --------------------------------------------------------------------------------