├── README.md └── app.py /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # NotebookLlama: An Open Source version of NotebookLM 4 | 5 | ![image](https://github.com/user-attachments/assets/37296d45-f862-4a29-960f-4a1447a0da08) 6 | 7 | ## Audio Example 8 | https://peregrine-results.s3.amazonaws.com/pigeon/Rg3RFytruIL15cOV3x_0.mp3 9 | 10 | This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow with voice cloning capabilities. 11 | 12 | You will learn from experiments using Text to Speech Models and voice cloning technology. 13 | 14 | It assumes zero knowledge of LLMs, prompting, audio models, and voice cloning - everything is covered in their respective notebooks. 15 | 16 | ### Outline: 17 | 18 | Here is step by step thought (pun intended) for the task: 19 | 20 | - Step 1: Pre-process PDF: Use `Llama-3.2-1B-Instruct` to pre-process the PDF and save it in a `.txt` file. 21 | - Step 2: Transcript Writer: Use `Llama-3.1-70B-Instruct` model to write a podcast transcript from the text 22 | - Step 3: Dramatic Re-Writer: Use `Llama-3.1-8B-Instruct` model to make the transcript more dramatic 23 | - Step 4: Text-To-Speech Workflow: Use `parler-tts/parler-tts-mini-v1`, `bark/suno`, or `PlayHT` to generate a conversational podcast with optional voice cloning 24 | 25 | Note 1: In Step 1, we prompt the 1B model to not modify the text or summarize it, strictly clean up extra characters or garbage characters that might get picked due to encoding from PDF. Please see the prompt in Notebook 1 for more details. 26 | 27 | Note 2: For Step 2, you can also use `Llama-3.1-8B-Instruct` model, we recommend experimenting and trying if you see any differences. The 70B model was used here because it gave slightly more creative podcast transcripts for the tested examples. 28 | 29 | Note 3: For Step 4, please try to extend the approach with other models. These models were chosen based on a sample prompt and worked best, newer models might sound better. Please see [Notes](./TTS_Notes.md) for some of the sample tests. 30 | 31 | Note 4: For voice cloning capabilities, we've integrated PlayHT which allows you to either use preset voices or clone your own voice for more personalized audio generation. The voice cloning feature requires a PlayHT account and API credentials. 32 | 33 | ### Detailed steps on running the notebook: 34 | 35 | Requirements: 36 | - GPU server or an API provider for using 70B, 8B and 1B Llama models 37 | - For running the 70B model, you will need a GPU with aggregated memory around 140GB to infer in bfloat-16 precision 38 | - PlayHT account and API credentials (for voice cloning) 39 | - Audio sample for voice cloning (optional) 40 | 41 | Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you! 42 | 43 | Before getting started: 44 | 1. Login using the `huggingface cli` and launch your jupyter notebook server 45 | 2. Get your Hugging Face access token from [here](https://huggingface.co/settings/tokens) 46 | 3. Run `huggingface-cli login` and paste your access token 47 | 4. Set up your PlayHT credentials if using voice cloning 48 | 49 | Installation: 50 | ```bash 51 | git clone https://github.com/meta-llama/llama-recipes 52 | cd llama-recipes/recipes/quickstart/NotebookLlama/ 53 | pip install -r requirements.txt 54 | ``` 55 | 56 | ### Notebook Walkthrough: 57 | 58 | #### Notebook 1: 59 | This notebook processes the PDF using the new Feather light model into a `.txt` file. 60 | - Update the first cell with your PDF link 61 | - Experiment with `Llama-3.2-1B-Instruct` model prompts 62 | 63 | #### Notebook 2: 64 | Takes processed output and creates a podcast transcript using `Llama-3.1-70B-Instruct`. 65 | - Try the 405B model if GPU-rich 66 | - Experiment with System prompts 67 | - Compare with 8B model results 68 | 69 | #### Notebook 3: 70 | Adds dramatization using `Llama-3.1-8B-Instruct`. 71 | - Creates conversation tuples for easier processing 72 | - Customizes speaker-specific prompts 73 | - Test with 3B and 1B models 74 | 75 | #### Notebook 4: 76 | Converts to podcast audio using multiple options: 77 | 1. Traditional approach: 78 | - `parler-tts/parler-tts-mini-v1` 79 | - `bark/suno` models 80 | 2. Voice Cloning approach (PlayHT): 81 | - Upload voice sample (30s - 1min recommended) 82 | - Use generated voice ID or preset voices 83 | - Generate high-quality TTS with cloned voice 84 | 85 | Note: Parler requires transformers 4.43.3 or earlier, while steps 1-3 need the latest version. 86 | 87 | ### Next-Improvements/Further ideas: 88 | 89 | - Speech Model experimentation with advanced voice cloning 90 | - LLM vs LLM Debate feature 91 | - Testing 405B for transcripts 92 | - Better prompting strategies 93 | - Support for website, audio, YouTube inputs 94 | - Enhanced voice cloning capabilities 95 | - Multi-speaker voice cloning support 96 | 97 | ### Resources for further learning: 98 | 99 | - https://betterprogramming.pub/text-to-audio-generation-with-bark-clearly-explained-4ee300a3713a 100 | - https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing 101 | - https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing#scrollTo=NyYQ--3YksJY 102 | - https://replicate.com/suno-ai/bark?prediction=zh8j6yddxxrge0cjp9asgzd534 103 | - https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c 104 | - https://docs.play.ht/reference/api-getting-started 105 | - https://play.ht/voice-cloning 106 | 107 | This project welcomes community contributions and PRs for any improvements! 108 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import os 2 | from groq import Groq 3 | import pypdf 4 | from pathlib import Path 5 | import json 6 | from typing import Optional, Dict, Any 7 | import requests 8 | from dataclasses import dataclass 9 | from gtts import gTTS 10 | import pyttsx3 11 | import random 12 | from pyht import Client 13 | from pyht.client import TTSOptions 14 | import asynciot 15 | import logging 16 | 17 | # Set up logging 18 | logging.basicConfig(level=logging.DEBUG) 19 | logger = logging.getLogger(__name__) 20 | 21 | # Initialize Groq client 22 | client = Groq( 23 | api_key="add your api key here", 24 | ) 25 | 26 | # Initialize PlayHT client with provided credentials 27 | playht_client = Client( 28 | user_id="add your user id here", 29 | api_key="add your api key here" 30 | ) 31 | 32 | @dataclass 33 | class ProcessingConfig: 34 | """Configuration for the NotebookLlama processing pipeline""" 35 | temperature: float = 0.7 36 | max_tokens: int = 1500 37 | pdf_chunk_size: int = 4000 38 | 39 | class NotebookLlama: 40 | def __init__(self, config: Optional[ProcessingConfig] = None): 41 | self.config = config or ProcessingConfig() 42 | 43 | def _extract_text_from_pdf(self, pdf_path: str) -> str: 44 | """Extract text from PDF while maintaining structure""" 45 | pdf_reader = pypdf.PdfReader(pdf_path) 46 | text = "" 47 | for page in pdf_reader.pages: 48 | text += page.extract_text() + "\n\n" 49 | return text 50 | 51 | def _chunk_text(self, text: str, chunk_size: int = 4000) -> list[str]: 52 | """Split text into manageable chunks""" 53 | words = text.split() 54 | chunks = [] 55 | current_chunk = [] 56 | current_length = 0 57 | 58 | for word in words: 59 | if current_length + len(word) + 1 > chunk_size: 60 | chunks.append(" ".join(current_chunk)) 61 | current_chunk = [word] 62 | current_length = len(word) 63 | else: 64 | current_chunk.append(word) 65 | current_length += len(word) + 1 66 | 67 | if current_chunk: 68 | chunks.append(" ".join(current_chunk)) 69 | 70 | return chunks 71 | 72 | async def preprocess_pdf(self, pdf_path: str) -> str: 73 | """Step 1: Preprocess PDF using llama-3.1-8b-instant""" 74 | raw_text = self._extract_text_from_pdf(pdf_path) 75 | chunks = self._chunk_text(raw_text, self.config.pdf_chunk_size) 76 | processed_chunks = [] 77 | 78 | for chunk in chunks: 79 | response = client.chat.completions.create( 80 | messages=[ 81 | { 82 | "role": "system", 83 | "content": "Create PDF while preserving context. Clean and structure the text while maintaining its original meaning." 84 | }, 85 | { 86 | "role": "user", 87 | "content": chunk 88 | } 89 | ], 90 | model="llama-3.1-8b-instant", 91 | temperature=self.config.temperature, 92 | max_tokens=self.config.max_tokens 93 | ) 94 | processed_chunks.append(response.choices[0].message.content) 95 | 96 | return "\n".join(processed_chunks) 97 | 98 | async def generate_podcast_script(self, clean_text: str) -> str: 99 | """Step 2: Generate podcast script using llama-3.1-70b-versatile""" 100 | response = client.chat.completions.create( 101 | messages=[ 102 | { 103 | "role": "system", 104 | "content": "Create an engaging podca st script from the following text. Make it conversational and easy to follow." 105 | }, 106 | { 107 | "role": "user", 108 | "content": clean_text 109 | } 110 | ], 111 | model="llama-3.1-70b-versatile", 112 | temperature=self.config.temperature, 113 | max_tokens=self.config.max_tokens 114 | ) 115 | return response.choices[0].message.content 116 | 117 | async def dramatize_script(self, podcast_script: str) -> str: 118 | """Step 3: Dramatize the podcast script using llama-3.1-8b-instant""" 119 | response = client.chat.completions.create( 120 | messages=[ 121 | { 122 | "role": "system", 123 | "content": "Make this podcast script more dramatic and engaging while maintaining its core message." 124 | }, 125 | { 126 | "role": "user", 127 | "content": podcast_script 128 | } 129 | ], 130 | model="llama-3.1-8b-instant", 131 | temperature=self.config.temperature, 132 | max_tokens=self.config.max_tokens 133 | ) 134 | return response.choices[0].message.content 135 | 136 | async def generate_audio(self, dramatic_script: str, output_path: str = "podcast.mp3") -> str: 137 | """Step 4: Generate audio using PlayHT, limited to about 20 seconds""" 138 | try: 139 | # Estimate the number of words for 20 seconds of audio 140 | words_for_20_seconds = 50 141 | 142 | # Truncate the script to the estimated word count 143 | truncated_script = ' '.join(dramatic_script.split()[:words_for_20_seconds]) 144 | 145 | logger.debug(f"Truncated script: {truncated_script}") 146 | 147 | # Generate audio using PlayHT 148 | options = TTSOptions( 149 | voice="s3://voice-cloning-zero-shot/a778f067-54d5-4e01-bc3a-3e7406dc0d2f/original/manifest.json", 150 | format="mp3" 151 | ) 152 | 153 | logger.debug("Starting audio generation with PlayHT") 154 | audio_data = b'' 155 | try: 156 | for i, chunk in enumerate(playht_client.tts(truncated_script, options)): 157 | logger.debug(f"Received chunk {i}: {type(chunk)}, {len(chunk) if chunk is not None else 'None'} bytes") 158 | if chunk is not None: 159 | audio_data += chunk 160 | else: 161 | logger.warning(f"Received None chunk from PlayHT API at iteration {i}") 162 | except Exception as tts_error: 163 | logger.error(f"Error during TTS generation: {tts_error}", exc_info=True) 164 | raise 165 | 166 | logger.debug(f"Total audio data received: {len(audio_data)} bytes") 167 | 168 | if not audio_data: 169 | raise ValueError("No audio data received from PlayHT API") 170 | 171 | # Save the generated audio 172 | with open(output_path, 'wb') as f: 173 | f.write(audio_data) 174 | 175 | logger.info(f"20-second podcast audio generated and saved as: {output_path}") 176 | return f"20-second audio generated and saved as {output_path}" 177 | except Exception as e: 178 | logger.exception("Error in audio generation") 179 | return f"Error in audio generation: {str(e)}" 180 | 181 | async def process_document(self, pdf_path: str, output_path: str = "podcast.mp3") -> Dict[str, Any]: 182 | """Process the entire pipeline""" 183 | try: 184 | # Step 1: Preprocess PDF 185 | clean_text = await self.preprocess_pdf(pdf_path) 186 | `` 187 | # Step 2: Generate podcast script 188 | podcast_script = await self.generate_podcast_script(clean_text) 189 | 190 | # Step 3: Dramatize script 191 | dramatic_script = await self.dramatize_script(podcast_script) 192 | 193 | # Step 4: Generate audio 194 | audio_status = await self.generate_audio(dramatic_script, output_path) 195 | 196 | return { 197 | "status": "success", 198 | "clean_text": clean_text, 199 | "podcast_script": podcast_script, 200 | "dramatic_script": dramatic_script, 201 | "audio_status": audio_status 202 | } 203 | 204 | except Exception as e: 205 | return { 206 | "status": "error", 207 | "error": str(e) 208 | } 209 | 210 | # Example usage 211 | async def main(): 212 | # Initialize NotebookLlama with custom configuration 213 | config = ProcessingConfig( 214 | temperature=0.8, 215 | max_tokens=2000, 216 | pdf_chunk_size=3000 217 | ) 218 | 219 | notebook_llama = NotebookLlama(config) 220 | 221 | # Process a document 222 | result = await notebook_llama.process_document( 223 | pdf_path="sample_tech_article.pdf", 224 | output_path="D:\\Deep learning\\output_podcast.mp3" # Make sure this path is valid and writable 225 | ) 226 | 227 | # Print results 228 | if result["status"] == "success": 229 | print("Processing completed successfully!") 230 | print(f"Audio status: {result['audio_status']}") 231 | else: 232 | print(f"Error during processing: {result['error']}") 233 | 234 | if __name__ == "__main__": 235 | import asyncio 236 | asyncio.run(main()) 237 | --------------------------------------------------------------------------------