├── tweet.png ├── frames ├── frame_139.jpg ├── frame_347.jpg ├── frame_365.jpg ├── frame_7.jpg ├── frame_849.jpg ├── frame_901.jpg ├── frame_961.jpg ├── frame_1431.jpg ├── frame_1625.jpg ├── frame_2073.jpg ├── frame_2170.jpg ├── frame_2324.jpg ├── frame_2503.jpg ├── frame_2686.jpg ├── frame_2907.jpg ├── frame_2932.jpg ├── frame_2940.jpg ├── frame_3193.jpg ├── frame_3261.jpg ├── frame_3346.jpg ├── frame_3459.jpg ├── frame_3595.jpg ├── frame_3631.jpg ├── frame_3646.jpg ├── frame_3781.jpg ├── frame_3874.jpg ├── frame_3926.jpg ├── frame_3989.jpg ├── frame_4002.jpg ├── frame_4069.jpg ├── frame_4107.jpg ├── frame_4194.jpg ├── frame_4267.jpg ├── frame_4301.jpg ├── frame_4398.jpg ├── frame_4436.jpg ├── frame_4458.jpg ├── frame_4613.jpg ├── frame_4659.jpg ├── frame_4797.jpg ├── frame_5015.jpg ├── frame_5065.jpg ├── frame_5225.jpg ├── frame_5282.jpg ├── frame_5378.jpg ├── frame_5446.jpg ├── frame_5461.jpg ├── frame_5482.jpg ├── frame_5589.jpg ├── frame_5683.jpg ├── frame_5781.jpg ├── frame_5865.jpg ├── frame_5880.jpg ├── frame_5955.jpg ├── frame_6035.jpg ├── frame_6128.jpg ├── frame_6156.jpg ├── frame_6193.jpg ├── frame_6792.jpg ├── frame_6867.jpg ├── frame_6914.jpg ├── frame_7025.jpg ├── frame_7042.jpg ├── frame_7134.jpg ├── frame_7308.jpg ├── frame_7373.jpg ├── frame_7500.jpg ├── frame_7736.jpg ├── frame_7824.jpg ├── frame_7891.jpg ├── frame_7901.jpg └── frame_7946.jpg ├── README.md ├── LICENSE ├── karpathy_challenge_final.ipynb ├── transcript_json └── llm_outline.txt └── companion_guide.txt /tweet.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/tweet.png -------------------------------------------------------------------------------- /frames/frame_139.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_139.jpg -------------------------------------------------------------------------------- /frames/frame_347.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_347.jpg -------------------------------------------------------------------------------- /frames/frame_365.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_365.jpg -------------------------------------------------------------------------------- /frames/frame_7.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7.jpg -------------------------------------------------------------------------------- /frames/frame_849.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_849.jpg -------------------------------------------------------------------------------- /frames/frame_901.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_901.jpg -------------------------------------------------------------------------------- /frames/frame_961.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_961.jpg -------------------------------------------------------------------------------- /frames/frame_1431.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_1431.jpg -------------------------------------------------------------------------------- /frames/frame_1625.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_1625.jpg -------------------------------------------------------------------------------- /frames/frame_2073.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2073.jpg -------------------------------------------------------------------------------- /frames/frame_2170.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2170.jpg -------------------------------------------------------------------------------- /frames/frame_2324.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2324.jpg -------------------------------------------------------------------------------- /frames/frame_2503.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2503.jpg -------------------------------------------------------------------------------- /frames/frame_2686.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2686.jpg -------------------------------------------------------------------------------- /frames/frame_2907.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2907.jpg -------------------------------------------------------------------------------- /frames/frame_2932.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2932.jpg -------------------------------------------------------------------------------- /frames/frame_2940.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_2940.jpg -------------------------------------------------------------------------------- /frames/frame_3193.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3193.jpg -------------------------------------------------------------------------------- /frames/frame_3261.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3261.jpg -------------------------------------------------------------------------------- /frames/frame_3346.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3346.jpg -------------------------------------------------------------------------------- /frames/frame_3459.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3459.jpg -------------------------------------------------------------------------------- /frames/frame_3595.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3595.jpg -------------------------------------------------------------------------------- /frames/frame_3631.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3631.jpg -------------------------------------------------------------------------------- /frames/frame_3646.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3646.jpg -------------------------------------------------------------------------------- /frames/frame_3781.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3781.jpg -------------------------------------------------------------------------------- /frames/frame_3874.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3874.jpg -------------------------------------------------------------------------------- /frames/frame_3926.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3926.jpg -------------------------------------------------------------------------------- /frames/frame_3989.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_3989.jpg -------------------------------------------------------------------------------- /frames/frame_4002.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4002.jpg -------------------------------------------------------------------------------- /frames/frame_4069.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4069.jpg -------------------------------------------------------------------------------- /frames/frame_4107.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4107.jpg -------------------------------------------------------------------------------- /frames/frame_4194.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4194.jpg -------------------------------------------------------------------------------- /frames/frame_4267.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4267.jpg -------------------------------------------------------------------------------- /frames/frame_4301.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4301.jpg -------------------------------------------------------------------------------- /frames/frame_4398.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4398.jpg -------------------------------------------------------------------------------- /frames/frame_4436.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4436.jpg -------------------------------------------------------------------------------- /frames/frame_4458.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4458.jpg -------------------------------------------------------------------------------- /frames/frame_4613.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4613.jpg -------------------------------------------------------------------------------- /frames/frame_4659.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4659.jpg -------------------------------------------------------------------------------- /frames/frame_4797.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_4797.jpg -------------------------------------------------------------------------------- /frames/frame_5015.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5015.jpg -------------------------------------------------------------------------------- /frames/frame_5065.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5065.jpg -------------------------------------------------------------------------------- /frames/frame_5225.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5225.jpg -------------------------------------------------------------------------------- /frames/frame_5282.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5282.jpg -------------------------------------------------------------------------------- /frames/frame_5378.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5378.jpg -------------------------------------------------------------------------------- /frames/frame_5446.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5446.jpg -------------------------------------------------------------------------------- /frames/frame_5461.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5461.jpg -------------------------------------------------------------------------------- /frames/frame_5482.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5482.jpg -------------------------------------------------------------------------------- /frames/frame_5589.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5589.jpg -------------------------------------------------------------------------------- /frames/frame_5683.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5683.jpg -------------------------------------------------------------------------------- /frames/frame_5781.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5781.jpg -------------------------------------------------------------------------------- /frames/frame_5865.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5865.jpg -------------------------------------------------------------------------------- /frames/frame_5880.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5880.jpg -------------------------------------------------------------------------------- /frames/frame_5955.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_5955.jpg -------------------------------------------------------------------------------- /frames/frame_6035.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6035.jpg -------------------------------------------------------------------------------- /frames/frame_6128.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6128.jpg -------------------------------------------------------------------------------- /frames/frame_6156.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6156.jpg -------------------------------------------------------------------------------- /frames/frame_6193.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6193.jpg -------------------------------------------------------------------------------- /frames/frame_6792.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6792.jpg -------------------------------------------------------------------------------- /frames/frame_6867.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6867.jpg -------------------------------------------------------------------------------- /frames/frame_6914.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_6914.jpg -------------------------------------------------------------------------------- /frames/frame_7025.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7025.jpg -------------------------------------------------------------------------------- /frames/frame_7042.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7042.jpg -------------------------------------------------------------------------------- /frames/frame_7134.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7134.jpg -------------------------------------------------------------------------------- /frames/frame_7308.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7308.jpg -------------------------------------------------------------------------------- /frames/frame_7373.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7373.jpg -------------------------------------------------------------------------------- /frames/frame_7500.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7500.jpg -------------------------------------------------------------------------------- /frames/frame_7736.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7736.jpg -------------------------------------------------------------------------------- /frames/frame_7824.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7824.jpg -------------------------------------------------------------------------------- /frames/frame_7891.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7891.jpg -------------------------------------------------------------------------------- /frames/frame_7901.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7901.jpg -------------------------------------------------------------------------------- /frames/frame_7946.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/companion-guide-challenge/main/frames/frame_7946.jpg -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 |

Video Companion Guide Challenge

2 |

Submission by: Adam Łucek

3 |

Automating the creation and media population of markdown companion guides with ✨AI✨

4 |
5 | 6 |
7 | 8 | --- 9 | 10 | Main output file in companion_guide_output.md 11 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 Adam Łucek 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /karpathy_challenge_final.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "ceb3ab89-7990-4469-8aad-43c24472b1f9", 6 | "metadata": {}, 7 | "source": [ 8 | "

Video Companion Guide Challenge

\n", 9 | "

Submission by: Adam Łucek

\n", 10 | "

Automating the creation and media population of markdown companion guides with ✨AI✨

\n", 11 | "
\n", 12 | "\n", 13 | "
\n", 14 | "
\n" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "id": "16a5282e-f372-42cd-ae4a-c79e17e1377f", 20 | "metadata": {}, 21 | "source": [ 22 | "### Dependencies\n", 23 | "---" 24 | ] 25 | }, 26 | { 27 | "cell_type": "code", 28 | "execution_count": 35, 29 | "id": "07739e56-7509-4f02-9844-48df5d97b990", 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from pytube import YouTube as pyt\n", 34 | "from moviepy.editor import *\n", 35 | "import os\n", 36 | "from moviepy.editor import AudioFileClip\n", 37 | "from openai import OpenAI\n", 38 | "import time\n", 39 | "import json\n", 40 | "import re\n", 41 | "from langchain_experimental.text_splitter import SemanticChunker\n", 42 | "from langchain_openai.embeddings import OpenAIEmbeddings\n", 43 | "from langchain_openai import ChatOpenAI\n", 44 | "from langchain.prompts import ChatPromptTemplate\n", 45 | "from langchain_core.runnables import RunnablePassthrough\n", 46 | "from langchain_core.output_parsers import StrOutputParser\n", 47 | "from langchain_community.document_loaders import JSONLoader\n", 48 | "from langchain_community.vectorstores import FAISS\n", 49 | "from langchain_openai import OpenAIEmbeddings\n", 50 | "from langchain_text_splitters import CharacterTextSplitter\n", 51 | "import cv2\n", 52 | "\n", 53 | "OPENAI_API_KEY = os.environ.get(\"OPENAI_API_KEY\")\n", 54 | "os.environ[\"IMAGEIO_FFMPEG_EXE\"] = \"/opt/homebrew/bin/ffmpeg\" " 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "53dce99a-e995-4e3d-9290-2b3d07f7ec56", 60 | "metadata": {}, 61 | "source": [ 62 | "### Time Keeping Function\n", 63 | "---" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 22, 69 | "id": "b34131ab-8ea8-477e-b851-62ed6694eadd", 70 | "metadata": {}, 71 | "outputs": [], 72 | "source": [ 73 | "def start_timer():\n", 74 | " global start_time\n", 75 | " start_time = time.time()\n", 76 | "\n", 77 | "def show_current_runtime():\n", 78 | " return round(time.time() - start_time, 2)" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "id": "9d76fb8f-84fd-4c8b-aefc-42f0b0353b0e", 84 | "metadata": {}, 85 | "source": [ 86 | "### Downloading the Audio & Video\n", 87 | "---" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": 36, 93 | "id": "8da796c3-f9bb-463c-855d-b3ad5dec9a31", 94 | "metadata": {}, 95 | "outputs": [], 96 | "source": [ 97 | "def download_video(url, path):\n", 98 | " yt = pyt(url)\n", 99 | " stream = yt.streams.get_highest_resolution()\n", 100 | " stream.download(output_path=f\"{path}/original_files/video\", filename=\"video_file.mp4\")\n", 101 | " audio_stream = yt.streams.filter(only_audio=True).first()\n", 102 | " audio_stream.download(output_path=f\"{path}/original_files/audio\", filename=\"audio_file.mp4\")\n", 103 | "\n", 104 | "def get_title(url):\n", 105 | " yt = pyt(url)\n", 106 | " return yt.title" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "id": "c23a065e-bf3b-4265-b717-a1b8699dad31", 112 | "metadata": {}, 113 | "source": [ 114 | "### Checking Audio File Size, and Chunking it if Large\n", 115 | "---" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 37, 121 | "id": "1e7f5597-aa5d-4f34-885f-106d62c13451", 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "def split_audio(file_path, chunk_size_mb=12, output_folder=\"split_chunks\"):\n", 126 | " global split_audio_return\n", 127 | " \n", 128 | " if not os.path.exists(file_path):\n", 129 | " print(f\"File {file_path} not found.\")\n", 130 | " return\n", 131 | "\n", 132 | " if not os.path.exists(output_folder):\n", 133 | " os.makedirs(output_folder)\n", 134 | "\n", 135 | " file_size_mb = os.path.getsize(file_path) / (1024 * 1024)\n", 136 | " print(f\"File size: {file_size_mb:.2f} MB\")\n", 137 | "\n", 138 | " if file_size_mb <= chunk_size_mb:\n", 139 | " print(\"File size is within the limit. No need to split.\")\n", 140 | " return\n", 141 | " else:\n", 142 | " split_audio_return = True\n", 143 | "\n", 144 | " clip = AudioFileClip(file_path)\n", 145 | " total_duration = clip.duration\n", 146 | " chunk_duration = (chunk_size_mb / file_size_mb) * total_duration\n", 147 | "\n", 148 | " # Split the audio\n", 149 | " start = 0\n", 150 | " part = 1\n", 151 | " while start < total_duration:\n", 152 | " end = min(start + chunk_duration, total_duration)\n", 153 | " chunk = clip.subclip(start, end)\n", 154 | " chunk_filename = os.path.join(output_folder, f\"{os.path.splitext(os.path.basename(file_path))[0]}_part{part}.mp4\")\n", 155 | " chunk.write_audiofile(chunk_filename, bitrate=\"64k\", codec=\"aac\")\n", 156 | "\n", 157 | " print(f\"Created chunk: {chunk_filename}\")\n", 158 | "\n", 159 | " start = end\n", 160 | " part += 1\n", 161 | "\n", 162 | " clip.close()" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "id": "b2f3142e-6afa-4ac0-9247-f720862ab98e", 168 | "metadata": {}, 169 | "source": [ 170 | "### Transcribing with Whisper-1 & Writing to JSON File(s)\n", 171 | "---" 172 | ] 173 | }, 174 | { 175 | "cell_type": "code", 176 | "execution_count": 38, 177 | "id": "eab4f585-ac13-478d-8d80-b96458e816c5", 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "def create_json(split_audio_return, input_folder, output_folder=\"transcript_json\"):\n", 182 | " if not os.path.exists(input_folder):\n", 183 | " print(f\"The folder {input_folder} does not exist.\")\n", 184 | " return\n", 185 | " if not os.path.exists(output_folder):\n", 186 | " os.makedirs(output_folder)\n", 187 | "\n", 188 | " client = OpenAI()\n", 189 | "\n", 190 | " if not split_audio_return:\n", 191 | " # If audio is not split, use the path to the original file\n", 192 | " original_audio_path = f\"{input_folder}/original_files/audio\" \n", 193 | " if os.path.exists(original_audio_path):\n", 194 | " for filename in os.listdir(original_audio_path):\n", 195 | " if filename.endswith(\".mp4\"):\n", 196 | " process_audio_file(client, original_audio_path, filename, output_folder)\n", 197 | " else:\n", 198 | " print(f\"The original audio folder {original_audio_path} does not exist.\")\n", 199 | " else:\n", 200 | " input_folder_chunks = f'{path}/split_chunks'\n", 201 | " # If audio is split, iterate over the split audio files\n", 202 | " for filename in os.listdir(f'{path}/split_chunks'):\n", 203 | " if filename.endswith(\".mp4\"):\n", 204 | " process_audio_file(client, input_folder_chunks, filename, output_folder)\n", 205 | "\n", 206 | "def process_audio_file(client, folder_path, filename, output_folder):\n", 207 | " file_path = os.path.join(folder_path, filename)\n", 208 | " with open(file_path, \"rb\") as audio_file:\n", 209 | " transcript = client.audio.transcriptions.create(\n", 210 | " file=audio_file,\n", 211 | " model=\"whisper-1\",\n", 212 | " response_format=\"verbose_json\",\n", 213 | " timestamp_granularities=[\"segment\"]\n", 214 | " )\n", 215 | "\n", 216 | " json_filename = f\"{os.path.splitext(filename)[0]}_transcript.json\"\n", 217 | " output_path = os.path.join(output_folder, json_filename)\n", 218 | "\n", 219 | " with open(output_path, 'w') as f:\n", 220 | " json.dump(transcript.segments, f, indent=4)\n", 221 | "\n", 222 | " print(f\"Transcript for {filename} saved to {output_path}\")" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "id": "054935e3-818d-474f-8102-0460f81eeef5", 228 | "metadata": {}, 229 | "source": [ 230 | "### Cleaning & Concatenating the Transcription JSON files\n", 231 | "---" 232 | ] 233 | }, 234 | { 235 | "cell_type": "code", 236 | "execution_count": 39, 237 | "id": "e2152362-7151-47ee-aa19-34c004d52805", 238 | "metadata": { 239 | "scrolled": true 240 | }, 241 | "outputs": [], 242 | "source": [ 243 | "def extract_part_number(filename):\n", 244 | " match = re.search(r'part(\\d+)', filename)\n", 245 | " return int(match.group(1)) if match else 0\n", 246 | "\n", 247 | "def process_file(filepath, max_id, last_end_time):\n", 248 | " with open(filepath, 'r') as file:\n", 249 | " data = json.load(file)\n", 250 | " time_adjustment = last_end_time - float(data[0]['start']) if last_end_time else 0\n", 251 | " new_data = []\n", 252 | " for entry in data:\n", 253 | " new_entry = {\n", 254 | " 'id': max_id + 1,\n", 255 | " 'start': float(entry['start']) + time_adjustment,\n", 256 | " 'end': float(entry['end']) + time_adjustment,\n", 257 | " 'text': entry['text']\n", 258 | " }\n", 259 | " new_data.append(new_entry)\n", 260 | " max_id += 1\n", 261 | " return new_data, max_id, new_data[-1]['end'] if new_data else last_end_time\n", 262 | "\n", 263 | "def process_transcripts(directory_path, split_audio_return):\n", 264 | " files = os.listdir(directory_path)\n", 265 | " combined_data = []\n", 266 | " max_id = -1\n", 267 | " last_end_time = 0.0\n", 268 | "\n", 269 | " if split_audio_return:\n", 270 | " sorted_files = sorted(\n", 271 | " [file for file in files if file.startswith('audio') and file.endswith('.json')],\n", 272 | " key=extract_part_number\n", 273 | " )\n", 274 | " else:\n", 275 | " sorted_files = [file for file in files if file.endswith('.json') and not 'part' in file]\n", 276 | "\n", 277 | " for filename in sorted_files:\n", 278 | " full_path = os.path.join(directory_path, filename)\n", 279 | " processed_data, max_id, last_end_time = process_file(full_path, max_id, last_end_time)\n", 280 | " combined_data.extend(processed_data)\n", 281 | "\n", 282 | " # Output the combined data to a new JSON file\n", 283 | " output_path = os.path.join(directory_path, 'combined_data.json')\n", 284 | " with open(output_path, 'w') as outfile:\n", 285 | " json.dump(combined_data, outfile, indent=4)\n", 286 | "\n" 287 | ] 288 | }, 289 | { 290 | "cell_type": "markdown", 291 | "id": "80b56ef6-96ba-4c14-acec-2b7a84c8b2b3", 292 | "metadata": {}, 293 | "source": [ 294 | "### Also Grabbing the Full Transcript\n", 295 | "---" 296 | ] 297 | }, 298 | { 299 | "cell_type": "code", 300 | "execution_count": 40, 301 | "id": "3ff9b387-63a1-4c52-b609-734b25e220be", 302 | "metadata": {}, 303 | "outputs": [], 304 | "source": [ 305 | "def full_transcript(file_path):\n", 306 | " with open(file_path, 'r') as file:\n", 307 | " data = json.load(file)\n", 308 | " transcript = \"\"\n", 309 | " for entry in data:\n", 310 | " transcript += entry['text']\n", 311 | " return transcript" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "123e38fa-15de-40de-938c-e41bfc264c24", 317 | "metadata": {}, 318 | "source": [ 319 | "### Function to Combine Short Chunks of Transcription\n", 320 | "---" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": 41, 326 | "id": "2cfbee6b-a42d-478d-abfe-6dd7de05becd", 327 | "metadata": { 328 | "scrolled": true 329 | }, 330 | "outputs": [], 331 | "source": [ 332 | "def merge_short_documents(documents, min_length=2000):\n", 333 | " i = 0\n", 334 | " while i < len(documents) - 1:\n", 335 | " current_doc = documents[i]\n", 336 | " if len(current_doc.page_content) < min_length:\n", 337 | " documents[i + 1].page_content = current_doc.page_content + documents[i + 1].page_content\n", 338 | " del documents[i]\n", 339 | " else:\n", 340 | " i += 1\n", 341 | " return documents" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "id": "a378a120-6934-4d21-9122-38830f00bbcc", 347 | "metadata": {}, 348 | "source": [ 349 | "### Main Markdown File Prompting with GPT-4-Turbo & LangChain Setup\n", 350 | "---" 351 | ] 352 | }, 353 | { 354 | "cell_type": "code", 355 | "execution_count": 42, 356 | "id": "ba169ab4-c63e-411e-8358-42dff42e6733", 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "guide_prompt_template = \"\"\"\n", 361 | "\n", 362 | "Below is a script from a video that I am making into a companion guide blog post first. \\\n", 363 | "You are a helpful assistant made to assist in the creation I'm doing. \\\n", 364 | "This is a continuation of a guide so include chapters, key summaries, and incorporate visual aids and direct links to relevant parts of the video, \\\n", 365 | "however do not include any conclusion or overarching title. \\\n", 366 | "For visual aids, specific frames from the video will be identified where images can be inserted to enhance understanding. \\\n", 367 | "For direct links, portions of the text should be hyperlinked to their corresponding times in the video. \\\n", 368 | "To indicate that a sentence should be hyperlinked, insert the raw text of the transcript next to the word with the indicator . \\\n", 369 | "To indicate a picture regarding the text, insert the indicator . \\\n", 370 | "It is crucial to use the raw text from the transcript that will be used, as the additional tools that will be inserting the hyperlinks and pictures need this to know where in the video to look.\n", 371 | "\n", 372 | "In this blog post, in addition to the paragraphs: \\\n", 373 | "\n", 374 | "Create titles or headings that encapsulate main points and ideas \\\n", 375 | "\n", 376 | "Format your response in markdown, ensuring distinction and clean styling between titles and paragraphs. \\\n", 377 | "Be sure to include the image placeholders, and hyperlinks with enough distinguishable text WITHOUT ANY QUOTATIONS, as the placeholders will be fed into a semantic search algorithm. \\\n", 378 | "This structured approach will be applied to the entire transcript. \\\n", 379 | "The example below only shows one style, but use multiple styles including different headings, bullet points, and other markdown elements when needed. \\\n", 380 | "\n", 381 | "Here are shortened example of the input and shortened expected output:\n", 382 | "\n", 383 | "example input:\n", 384 | "\n", 385 | "Hi everyone. So in this video I'd like us to cover the process of tokenization in large language models. Now you see here that I have a sad face and that's because tokenization is my least favorite part of working with large language models but unfortunately it is necessary to understand in some detail because it is fairly hairy, gnarly and there's a lot of hidden foot gums to be aware of and a lot of oddness with large language models typically traces back to tokenization. So what is tokenization? Now in my previous video Let's Build GPT from Scratch we actually already did tokenization but we did a very naive simple version of tokenization. So when you go to the Google Colab for that video you see here that we loaded our training set and our training set was this Shakespeare dataset. Now in the beginning the Shakespeare dataset is just a large string in Python it's just text and so the question is how do we plug text into large language models and in this case here we created a vocabulary of 65 possible characters that we saw occur in this string. These were the possible characters and we saw that there are 65 of them and then we created a lookup table for converting from every possible character a little string piece into a token an integer. So here for example we tokenized the string hi there and we received this sequence of tokens and here we took the first 1000 characters of our dataset and we encoded it into tokens and because this is character level we received 1000 tokens in a sequence so token 18, 47, etc. Now later we saw that the way we plug these tokens into the language model is by using an embedding table and so basically if we have 65 possible tokens then this embedding table is going to have 65 rows and roughly speaking we're taking the integer associated with every single token we're using that as a lookup into this table and we're plucking out the corresponding row and this row is trainable parameters that we're going to train using backpropagation and this is the vector that then feeds into the transformer and that's how the transformer sort of perceives every single token. So here we had a very naive tokenization process that was a character level tokenizer\n", 386 | "\n", 387 | "example output:\n", 388 | "\n", 389 | "Introduction to Tokenization\n", 390 | "----------------------------\n", 391 | "\n", 392 | "Welcome to our comprehensive guide on tokenization in large language models (LLMs). Tokenization is a critical yet complex aspect of working with LLMs, essential for understanding how these models process text data. Despite its challenges, tokenization is foundational, as it converts strings of text into sequences of tokens, small units of text that LLMs can manage more effectively.\n", 393 | "\n", 394 | "