├── README.md
└── app.py


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | # NotebookLlama: An Open Source version of NotebookLM
  4 | 
  5 | ![image](https://github.com/user-attachments/assets/37296d45-f862-4a29-960f-4a1447a0da08)
  6 | 
  7 | ## Audio Example
  8 | https://peregrine-results.s3.amazonaws.com/pigeon/Rg3RFytruIL15cOV3x_0.mp3
  9 | 
 10 | This is a guided series of tutorials/notebooks that can be taken as a reference or course to build a PDF to Podcast workflow with voice cloning capabilities. 
 11 | 
 12 | You will learn from experiments using Text to Speech Models and voice cloning technology.
 13 | 
 14 | It assumes zero knowledge of LLMs, prompting, audio models, and voice cloning - everything is covered in their respective notebooks.
 15 | 
 16 | ### Outline:
 17 | 
 18 | Here is step by step thought (pun intended) for the task:
 19 | 
 20 | - Step 1: Pre-process PDF: Use `Llama-3.2-1B-Instruct` to pre-process the PDF and save it in a `.txt` file.
 21 | - Step 2: Transcript Writer: Use `Llama-3.1-70B-Instruct` model to write a podcast transcript from the text
 22 | - Step 3: Dramatic Re-Writer: Use `Llama-3.1-8B-Instruct` model to make the transcript more dramatic
 23 | - Step 4: Text-To-Speech Workflow: Use `parler-tts/parler-tts-mini-v1`, `bark/suno`, or `PlayHT` to generate a conversational podcast with optional voice cloning
 24 | 
 25 | Note 1: In Step 1, we prompt the 1B model to not modify the text or summarize it, strictly clean up extra characters or garbage characters that might get picked due to encoding from PDF. Please see the prompt in Notebook 1 for more details.
 26 | 
 27 | Note 2: For Step 2, you can also use `Llama-3.1-8B-Instruct` model, we recommend experimenting and trying if you see any differences. The 70B model was used here because it gave slightly more creative podcast transcripts for the tested examples.
 28 | 
 29 | Note 3: For Step 4, please try to extend the approach with other models. These models were chosen based on a sample prompt and worked best, newer models might sound better. Please see [Notes](./TTS_Notes.md) for some of the sample tests.
 30 | 
 31 | Note 4: For voice cloning capabilities, we've integrated PlayHT which allows you to either use preset voices or clone your own voice for more personalized audio generation. The voice cloning feature requires a PlayHT account and API credentials.
 32 | 
 33 | ### Detailed steps on running the notebook:
 34 | 
 35 | Requirements: 
 36 | - GPU server or an API provider for using 70B, 8B and 1B Llama models
 37 | - For running the 70B model, you will need a GPU with aggregated memory around 140GB to infer in bfloat-16 precision
 38 | - PlayHT account and API credentials (for voice cloning)
 39 | - Audio sample for voice cloning (optional)
 40 | 
 41 | Note: For our GPU Poor friends, you can also use the 8B and lower models for the entire pipeline. There is no strong recommendation. The pipeline below is what worked best on first few tests. You should try and see what works best for you!
 42 | 
 43 | Before getting started:
 44 | 1. Login using the `huggingface cli` and launch your jupyter notebook server
 45 | 2. Get your Hugging Face access token from [here](https://huggingface.co/settings/tokens)
 46 | 3. Run `huggingface-cli login` and paste your access token
 47 | 4. Set up your PlayHT credentials if using voice cloning
 48 | 
 49 | Installation:
 50 | ```bash
 51 | git clone https://github.com/meta-llama/llama-recipes
 52 | cd llama-recipes/recipes/quickstart/NotebookLlama/
 53 | pip install -r requirements.txt
 54 | ```
 55 | 
 56 | ### Notebook Walkthrough:
 57 | 
 58 | #### Notebook 1:
 59 | This notebook processes the PDF using the new Feather light model into a `.txt` file.
 60 | - Update the first cell with your PDF link
 61 | - Experiment with `Llama-3.2-1B-Instruct` model prompts
 62 | 
 63 | #### Notebook 2:
 64 | Takes processed output and creates a podcast transcript using `Llama-3.1-70B-Instruct`.
 65 | - Try the 405B model if GPU-rich
 66 | - Experiment with System prompts
 67 | - Compare with 8B model results
 68 | 
 69 | #### Notebook 3:
 70 | Adds dramatization using `Llama-3.1-8B-Instruct`.
 71 | - Creates conversation tuples for easier processing
 72 | - Customizes speaker-specific prompts
 73 | - Test with 3B and 1B models
 74 | 
 75 | #### Notebook 4:
 76 | Converts to podcast audio using multiple options:
 77 | 1. Traditional approach:
 78 |    - `parler-tts/parler-tts-mini-v1`
 79 |    - `bark/suno` models
 80 | 2. Voice Cloning approach (PlayHT):
 81 |    - Upload voice sample (30s - 1min recommended)
 82 |    - Use generated voice ID or preset voices
 83 |    - Generate high-quality TTS with cloned voice
 84 | 
 85 | Note: Parler requires transformers 4.43.3 or earlier, while steps 1-3 need the latest version.
 86 | 
 87 | ### Next-Improvements/Further ideas:
 88 | 
 89 | - Speech Model experimentation with advanced voice cloning
 90 | - LLM vs LLM Debate feature
 91 | - Testing 405B for transcripts
 92 | - Better prompting strategies
 93 | - Support for website, audio, YouTube inputs
 94 | - Enhanced voice cloning capabilities
 95 | - Multi-speaker voice cloning support
 96 | 
 97 | ### Resources for further learning:
 98 | 
 99 | - https://betterprogramming.pub/text-to-audio-generation-with-bark-clearly-explained-4ee300a3713a
100 | - https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing
101 | - https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing#scrollTo=NyYQ--3YksJY
102 | - https://replicate.com/suno-ai/bark?prediction=zh8j6yddxxrge0cjp9asgzd534
103 | - https://suno-ai.notion.site/8b8e8749ed514b0cbf3f699013548683?v=bc67cff786b04b50b3ceb756fd05f68c
104 | - https://docs.play.ht/reference/api-getting-started
105 | - https://play.ht/voice-cloning
106 | 
107 | This project welcomes community contributions and PRs for any improvements!
108 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | from groq import Groq
  3 | import pypdf
  4 | from pathlib import Path
  5 | import json
  6 | from typing import Optional, Dict, Any
  7 | import requests
  8 | from dataclasses import dataclass
  9 | from gtts import gTTS
 10 | import pyttsx3
 11 | import random
 12 | from pyht import Client
 13 | from pyht.client import TTSOptions
 14 | import asynciot
 15 | import logging
 16 | 
 17 | # Set up logging
 18 | logging.basicConfig(level=logging.DEBUG)
 19 | logger = logging.getLogger(__name__)
 20 | 
 21 | # Initialize Groq client
 22 | client = Groq(
 23 |     api_key="add your api key here",
 24 | )
 25 | 
 26 | # Initialize PlayHT client with provided credentials
 27 | playht_client = Client(
 28 |     user_id="add your user id here",
 29 |     api_key="add your api key here"
 30 | )
 31 | 
 32 | @dataclass
 33 | class ProcessingConfig:
 34 |     """Configuration for the NotebookLlama processing pipeline"""
 35 |     temperature: float = 0.7
 36 |     max_tokens: int = 1500
 37 |     pdf_chunk_size: int = 4000
 38 | 
 39 | class NotebookLlama:
 40 |     def __init__(self, config: Optional[ProcessingConfig] = None):
 41 |         self.config = config or ProcessingConfig()
 42 |         
 43 |     def _extract_text_from_pdf(self, pdf_path: str) -> str:
 44 |         """Extract text from PDF while maintaining structure"""
 45 |         pdf_reader = pypdf.PdfReader(pdf_path)
 46 |         text = ""
 47 |         for page in pdf_reader.pages:
 48 |             text += page.extract_text() + "\n\n"
 49 |         return text
 50 | 
 51 |     def _chunk_text(self, text: str, chunk_size: int = 4000) -> list[str]:
 52 |         """Split text into manageable chunks"""
 53 |         words = text.split()
 54 |         chunks = []
 55 |         current_chunk = []
 56 |         current_length = 0
 57 |         
 58 |         for word in words:
 59 |             if current_length + len(word) + 1 > chunk_size:
 60 |                 chunks.append(" ".join(current_chunk))
 61 |                 current_chunk = [word]
 62 |                 current_length = len(word)
 63 |             else:
 64 |                 current_chunk.append(word)
 65 |                 current_length += len(word) + 1
 66 |                 
 67 |         if current_chunk:
 68 |             chunks.append(" ".join(current_chunk))
 69 |             
 70 |         return chunks
 71 | 
 72 |     async def preprocess_pdf(self, pdf_path: str) -> str:
 73 |         """Step 1: Preprocess PDF using llama-3.1-8b-instant"""
 74 |         raw_text = self._extract_text_from_pdf(pdf_path)
 75 |         chunks = self._chunk_text(raw_text, self.config.pdf_chunk_size)
 76 |         processed_chunks = []
 77 | 
 78 |         for chunk in chunks:
 79 |             response = client.chat.completions.create(
 80 |                 messages=[
 81 |                     {
 82 |                         "role": "system",
 83 |                         "content": "Create PDF while preserving context. Clean and structure the text while maintaining its original meaning."
 84 |                     },
 85 |                     {
 86 |                         "role": "user",
 87 |                         "content": chunk
 88 |                     }
 89 |                 ],
 90 |                 model="llama-3.1-8b-instant",
 91 |                 temperature=self.config.temperature,
 92 |                 max_tokens=self.config.max_tokens
 93 |             )
 94 |             processed_chunks.append(response.choices[0].message.content)
 95 | 
 96 |         return "\n".join(processed_chunks)
 97 | 
 98 |     async def generate_podcast_script(self, clean_text: str) -> str:
 99 |         """Step 2: Generate podcast script using llama-3.1-70b-versatile"""
100 |         response = client.chat.completions.create(
101 |             messages=[
102 |                 {
103 |                     "role": "system",
104 |                     "content": "Create an engaging podca st script from the following text. Make it conversational and easy to follow."
105 |                 },
106 |                 {
107 |                     "role": "user",
108 |                     "content": clean_text
109 |                 }
110 |             ],
111 |             model="llama-3.1-70b-versatile",
112 |             temperature=self.config.temperature,
113 |             max_tokens=self.config.max_tokens
114 |         )
115 |         return response.choices[0].message.content
116 | 
117 |     async def dramatize_script(self, podcast_script: str) -> str:
118 |         """Step 3: Dramatize the podcast script using llama-3.1-8b-instant"""
119 |         response = client.chat.completions.create(
120 |             messages=[
121 |                 {
122 |                     "role": "system",
123 |                     "content": "Make this podcast script more dramatic and engaging while maintaining its core message."
124 |                 },
125 |                 {
126 |                     "role": "user",
127 |                     "content": podcast_script
128 |                 }
129 |             ],
130 |             model="llama-3.1-8b-instant",
131 |             temperature=self.config.temperature,
132 |             max_tokens=self.config.max_tokens
133 |         )
134 |         return response.choices[0].message.content
135 | 
136 |     async def generate_audio(self, dramatic_script: str, output_path: str = "podcast.mp3") -> str:
137 |         """Step 4: Generate audio using PlayHT, limited to about 20 seconds"""
138 |         try:
139 |             # Estimate the number of words for 20 seconds of audio
140 |             words_for_20_seconds = 50
141 | 
142 |             # Truncate the script to the estimated word count
143 |             truncated_script = ' '.join(dramatic_script.split()[:words_for_20_seconds])
144 | 
145 |             logger.debug(f"Truncated script: {truncated_script}")
146 | 
147 |             # Generate audio using PlayHT
148 |             options = TTSOptions(
149 |                 voice="s3://voice-cloning-zero-shot/a778f067-54d5-4e01-bc3a-3e7406dc0d2f/original/manifest.json",
150 |                 format="mp3"
151 |             )
152 |             
153 |             logger.debug("Starting audio generation with PlayHT")
154 |             audio_data = b''
155 |             try:
156 |                 for i, chunk in enumerate(playht_client.tts(truncated_script, options)):
157 |                     logger.debug(f"Received chunk {i}: {type(chunk)}, {len(chunk) if chunk is not None else 'None'} bytes")
158 |                     if chunk is not None:
159 |                         audio_data += chunk
160 |                     else:
161 |                         logger.warning(f"Received None chunk from PlayHT API at iteration {i}")
162 |             except Exception as tts_error:
163 |                 logger.error(f"Error during TTS generation: {tts_error}", exc_info=True)
164 |                 raise
165 | 
166 |             logger.debug(f"Total audio data received: {len(audio_data)} bytes")
167 | 
168 |             if not audio_data:
169 |                 raise ValueError("No audio data received from PlayHT API")
170 | 
171 |             # Save the generated audio
172 |             with open(output_path, 'wb') as f:
173 |                 f.write(audio_data)
174 |             
175 |             logger.info(f"20-second podcast audio generated and saved as: {output_path}")
176 |             return f"20-second audio generated and saved as {output_path}"
177 |         except Exception as e:
178 |             logger.exception("Error in audio generation")
179 |             return f"Error in audio generation: {str(e)}"
180 | 
181 |     async def process_document(self, pdf_path: str, output_path: str = "podcast.mp3") -> Dict[str, Any]:
182 |         """Process the entire pipeline"""
183 |         try:
184 |             # Step 1: Preprocess PDF
185 |             clean_text = await self.preprocess_pdf(pdf_path)
186 |             ``
187 |             # Step 2: Generate podcast script
188 |             podcast_script = await self.generate_podcast_script(clean_text)
189 |             
190 |             # Step 3: Dramatize script
191 |             dramatic_script = await self.dramatize_script(podcast_script)
192 |             
193 |             # Step 4: Generate audio
194 |             audio_status = await self.generate_audio(dramatic_script, output_path)
195 |             
196 |             return {
197 |                 "status": "success",
198 |                 "clean_text": clean_text,
199 |                 "podcast_script": podcast_script,
200 |                 "dramatic_script": dramatic_script,
201 |                 "audio_status": audio_status
202 |             }
203 |             
204 |         except Exception as e:
205 |             return {
206 |                 "status": "error",
207 |                 "error": str(e)
208 |             }
209 | 
210 | # Example usage
211 | async def main():
212 |     # Initialize NotebookLlama with custom configuration
213 |     config = ProcessingConfig(
214 |         temperature=0.8,
215 |         max_tokens=2000,
216 |         pdf_chunk_size=3000
217 |     )
218 |     
219 |     notebook_llama = NotebookLlama(config)
220 |     
221 |     # Process a document
222 |     result = await notebook_llama.process_document(
223 |         pdf_path="sample_tech_article.pdf",
224 |         output_path="D:\\Deep learning\\output_podcast.mp3"  # Make sure this path is valid and writable
225 |     )
226 |     
227 |     # Print results
228 |     if result["status"] == "success":
229 |         print("Processing completed successfully!")
230 |         print(f"Audio status: {result['audio_status']}")
231 |     else:
232 |         print(f"Error during processing: {result['error']}")
233 | 
234 | if __name__ == "__main__":
235 |     import asyncio
236 |     asyncio.run(main())
237 | 


--------------------------------------------------------------------------------