├── README.md ├── Day-01: Python Foundations for GenAI ├── assignment.md └── README.md ├── Day-08: RAG Using LangChain or LlamaIndex ├── assignment.md └── README.md ├── Day-02: Generative AI & LLM Basics ├── assignment.md └── README.md ├── Day-07: Implement RAG From Scratch (Pure Python) ├── assignment.md └── README.md ├── Day-10: Build & Deploy a RAG Application (FastAPI-Streamlit) ├── assignment.md └── README.md ├── Day-05: Embeddings & Vector Databases ├── assignment.md └── README.md ├── Day-09: Advanced RAG (Reranking, Query Rewriting, Fusion) ├── assignment.md └── README.md ├── RAG Projects └── readme.md ├── Day-06: RAG Fundamentals (Retrieval → Augmentation → Generation) ├── assignment.md └── README.md ├── Day-04: Chunking & Data Extraction (PDF-Web-Docs) ├── assignment.md └── README.md └── Day-03: Prompt Engineering Essentials ├── assignment.md └── README.md /README.md: -------------------------------------------------------------------------------- 1 | # 🚀 10-Day RAG Beginner Roadmap 2 | 3 | Welcome to your comprehensive learning journey into **Retrieval-Augmented Generation (RAG)**! This repository is designed for absolute beginners who want to master RAG from the ground up in just 10 days. 4 | 5 | ## 📖 Description 6 | 7 | This roadmap takes you from Python fundamentals all the way to building and deploying a complete RAG application. Each day builds upon the previous one, ensuring you have a solid foundation before moving to more advanced concepts. By the end of 10 days, you'll have hands-on experience with: 8 | 9 | - Python programming for AI applications 10 | - Large Language Models (LLMs) and their capabilities 11 | - Prompt engineering techniques 12 | - Data extraction and chunking strategies 13 | - Vector embeddings and databases 14 | - Building RAG systems from scratch 15 | - Using frameworks like LangChain and LlamaIndex 16 | - Advanced RAG techniques 17 | - Deploying production-ready RAG applications 18 | 19 | ## 🎯 How to Use This Repository: 20 | 21 | 1. **Study Day-by-Day**: Follow the roadmap sequentially, starting with Day 1 22 | 2. **Read the Notes**: Open each day's folder and read the `README.md` file thoroughly 23 | 3. **Complete Assignments**: Work through the `assignment.md` file for hands-on practice 24 | 4. **Practice Regularly**: Code along with the examples and complete all practice tasks 25 | 5. **Build Projects**: Each day includes a mini-project to reinforce your learning 26 | 27 | ### Recommended Study Schedule: 28 | 29 | - **Time per day**: 2-4 hours 30 | - **Read notes**: 30-60 minutes 31 | - **Complete assignments**: 1-2 hours 32 | - **Mini project**: 30-60 minutes 33 | 34 | ## 🛠️ Technical Requirements: 35 | 36 | ### Python Version 37 | - **Python 3.8 or higher** (Python 3.10+ recommended) 38 | 39 | ### Required Libraries 40 | 41 | You'll install these progressively throughout the roadmap: 42 | 43 | ```bash 44 | # Core libraries 45 | pip install openai 46 | pip install langchain 47 | pip install llama-index 48 | pip install chromadb 49 | pip install sentence-transformers 50 | pip install pypdf 51 | pip install beautifulsoup4 52 | pip install requests 53 | pip install fastapi 54 | pip install streamlit 55 | pip install uvicorn 56 | ``` 57 | 58 | ### API Keys 59 | 60 | You'll need API keys for certain days: 61 | 62 | - **OpenAI API Key** (for Days 2, 3, 6, 7, 8, 9, 10) 63 | - Sign up at [platform.openai.com](https://platform.openai.com) 64 | - Get your API key from the API keys section 65 | - Store it securely (use environment variables) 66 | 67 | ### Environment Setup 68 | 69 | Create a `.env` file in the root directory: 70 | 71 | ```env 72 | OPENAI_API_KEY=your_api_key_here 73 | ``` 74 | 75 | ## 📚 Roadmap Overview 76 | 77 | | Day | Topic | Focus Area | 78 | |-----|-------|------------| 79 | | **Day 1** | Python Foundations for GenAI | Python basics, data structures, file handling, APIs | 80 | | **Day 2** | Generative AI & LLM Basics | Understanding LLMs, OpenAI API, model capabilities | 81 | | **Day 3** | Prompt Engineering Essentials | Crafting effective prompts, few-shot learning, chain-of-thought | 82 | | **Day 4** | Chunking & Data Extraction | PDF parsing, web scraping, document processing | 83 | | **Day 5** | Embeddings & Vector Databases | Vector embeddings, similarity search, ChromaDB | 84 | | **Day 6** | RAG Fundamentals | Retrieval → Augmentation → Generation pipeline | 85 | | **Day 7** | Implement RAG From Scratch | Building RAG system with pure Python | 86 | | **Day 8** | RAG Using LangChain or LlamaIndex | Using popular RAG frameworks | 87 | | **Day 9** | Advanced RAG | Reranking, query rewriting, fusion techniques | 88 | | **Day 10** | Build & Deploy RAG Application | FastAPI/Streamlit deployment, production considerations | 89 | 90 | ## 🗂️ Repository Structure 91 | 92 | ``` 93 | rag-roadmap/ 94 | │ 95 | ├── Day01/ 96 | │ ├── README.md 97 | │ └── assignment.md 98 | │ 99 | ├── Day02/ 100 | │ ├── README.md 101 | │ └── assignment.md 102 | │ 103 | ├── Day03/ 104 | │ ├── README.md 105 | │ └── assignment.md 106 | │ 107 | ├── Day04/ 108 | │ ├── README.md 109 | │ └── assignment.md 110 | │ 111 | ├── Day05/ 112 | │ ├── README.md 113 | │ └── assignment.md 114 | │ 115 | ├── Day06/ 116 | │ ├── README.md 117 | │ └── assignment.md 118 | │ 119 | ├── Day07/ 120 | │ ├── README.md 121 | │ └── assignment.md 122 | │ 123 | ├── Day08/ 124 | │ ├── README.md 125 | │ └── assignment.md 126 | │ 127 | ├── Day09/ 128 | │ ├── README.md 129 | │ └── assignment.md 130 | │ 131 | ├── Day10/ 132 | │ ├── README.md 133 | │ └── assignment.md 134 | │ 135 | └── README.md (this file) 136 | ``` 137 | 138 | ## 💡 Learning Tips 139 | 140 | 1. **Don't Skip Days**: Each day builds on previous concepts 141 | 2. **Code Along**: Type out the examples yourself, don't just read 142 | 3. **Experiment**: Modify examples to see what happens 143 | 4. **Ask Questions**: If something is unclear, research it 144 | 5. **Take Notes**: Write down key concepts in your own words 145 | 6. **Build Projects**: The mini-projects are crucial for understanding 146 | 147 | ## 🎓 Prerequisites 148 | 149 | - Basic understanding of programming concepts (variables, functions, loops) 150 | - Familiarity with command line/terminal 151 | - Willingness to learn and experiment 152 | - No prior AI/ML experience required! 153 | 154 | ## 📝 Notes 155 | 156 | - All code examples are beginner-friendly 157 | - Solutions are not provided for assignments (learning by doing!) 158 | - You can work at your own pace, but try to complete one day per day 159 | - Feel free to revisit previous days if needed 160 | 161 | ## 🙏 Credits 162 | 163 | Created by **Chandra Sekhar** 164 | 165 | --- 166 | 167 | **Ready to start?** Navigate to `Day01/` and begin your RAG journey! 🚀 168 | 169 | -------------------------------------------------------------------------------- /Day-01: Python Foundations for GenAI/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 1 — Assignment 2 | 3 | ## Instructions 4 | 5 | Complete the following tasks to reinforce your Python foundations. Write all code in separate Python files (`.py`). Test your code thoroughly and make sure it runs without errors. 6 | 7 | **Important:** 8 | - Use proper error handling 9 | - Add comments to explain your code 10 | - Follow Python naming conventions (snake_case for functions/variables) 11 | - Test with different inputs to ensure your code is robust 12 | 13 | --- 14 | 15 | ## Tasks 16 | 17 | ### Task 1: Document Statistics Calculator 18 | 19 | Create a function `calculate_document_stats(filename)` that: 20 | - Reads a text file 21 | - Returns a dictionary with: 22 | - Total characters (including spaces) 23 | - Total words 24 | - Total sentences (split by `.`, `!`, `?`) 25 | - Average words per sentence 26 | - Most common word (and its frequency) 27 | 28 | **Test file:** Create a `sample.txt` file with at least 5 sentences to test your function. 29 | 30 | --- 31 | 32 | ### Task 2: Text Chunker with Overlap 33 | 34 | Implement a function `chunk_with_overlap(text, chunk_size, overlap)` that: 35 | - Splits text into chunks of `chunk_size` characters 36 | - Each chunk overlaps with the previous one by `overlap` characters 37 | - Returns a list of dictionaries, each containing: 38 | - `chunk_id`: Sequential number (1, 2, 3...) 39 | - `text`: The chunk text 40 | - `start_pos`: Starting character position 41 | - `end_pos`: Ending character position 42 | - `word_count`: Number of words in chunk 43 | 44 | **Example:** 45 | ```python 46 | text = "This is a sample text for chunking." 47 | chunks = chunk_with_overlap(text, chunk_size=10, overlap=3) 48 | # Should create overlapping chunks 49 | ``` 50 | 51 | --- 52 | 53 | ### Task 3: Document Manager Class 54 | 55 | Create a `DocumentManager` class that: 56 | - Can load multiple documents 57 | - Stores each document with metadata (filename, content, word_count) 58 | - Has a method to find documents by keyword (searches in content) 59 | - Has a method to get statistics across all documents 60 | - Has a method to export all document info to a JSON file 61 | 62 | **Requirements:** 63 | - Use a dictionary to store documents (key: filename, value: document data) 64 | - Implement `add_document(filename, content)` 65 | - Implement `search_documents(keyword)` → returns list of matching filenames 66 | - Implement `get_all_stats()` → returns summary statistics 67 | - Implement `export_to_json(output_file)` → saves all document data 68 | 69 | --- 70 | 71 | ### Task 4: Text Preprocessing Function 72 | 73 | Write a function `preprocess_text(text)` that: 74 | - Converts text to lowercase 75 | - Removes all punctuation (keep spaces) 76 | - Removes extra whitespace (multiple spaces → single space) 77 | - Removes leading/trailing whitespace 78 | - Returns the cleaned text 79 | 80 | **Bonus:** Also create a function that removes stop words (common words like "the", "a", "an", "is", etc.) 81 | 82 | --- 83 | 84 | ### Task 5: File Batch Processor 85 | 86 | Create a function `process_multiple_files(file_list, output_dir)` that: 87 | - Takes a list of file paths 88 | - Reads each file 89 | - Processes it (calculate stats, chunk it, etc.) 90 | - Saves processed results to `output_dir` 91 | - Returns a summary report 92 | 93 | **Requirements:** 94 | - Handle errors gracefully (skip files that can't be read) 95 | - Create output directory if it doesn't exist 96 | - Save each file's stats as a separate JSON file 97 | - Return a dictionary with success/failure counts 98 | 99 | --- 100 | 101 | ## One Mini Project 102 | 103 | ### 📘 Build a Document Analyzer Tool 104 | 105 | Create a complete Python script `document_analyzer.py` that: 106 | 107 | 1. **Takes command-line arguments:** 108 | - Input file or directory 109 | - Output format (JSON, TXT, or both) 110 | - Chunk size (optional, default 200) 111 | 112 | 2. **For a single file:** 113 | - Reads the file 114 | - Calculates statistics (words, sentences, characters) 115 | - Chunks the text 116 | - Generates a word frequency report 117 | - Exports results 118 | 119 | 3. **For a directory:** 120 | - Processes all `.txt` files in the directory 121 | - Creates a summary report 122 | - Exports individual file reports 123 | 124 | 4. **Output includes:** 125 | - Document statistics 126 | - Top 10 most common words 127 | - Chunk information 128 | - Processing timestamp 129 | 130 | **Example usage:** 131 | ```bash 132 | python document_analyzer.py input.txt --output json --chunk-size 200 133 | python document_analyzer.py ./documents/ --output both 134 | ``` 135 | 136 | **Requirements:** 137 | - Use `argparse` for command-line arguments 138 | - Implement proper error handling 139 | - Use classes to organize your code 140 | - Include docstrings for all functions 141 | - Make it user-friendly with clear output messages 142 | 143 | **Deliverables:** 144 | - `document_analyzer.py` - Main script 145 | - `requirements.txt` - List of dependencies (if any) 146 | - Sample output files showing the results 147 | 148 | --- 149 | 150 | ## Expected Output Section 151 | 152 | ### Task 1 Expected Output: 153 | ```python 154 | stats = calculate_document_stats("sample.txt") 155 | # Output: { 156 | # 'characters': 245, 157 | # 'words': 42, 158 | # 'sentences': 5, 159 | # 'avg_words_per_sentence': 8.4, 160 | # 'most_common_word': ('the', 5) 161 | # } 162 | ``` 163 | 164 | ### Task 2 Expected Output: 165 | ```python 166 | chunks = chunk_with_overlap("Long text here...", 20, 5) 167 | # Output: [ 168 | # {'chunk_id': 1, 'text': 'Long text here...', 'start_pos': 0, 'end_pos': 20, 'word_count': 4}, 169 | # {'chunk_id': 2, 'text': 'here...more text', 'start_pos': 15, 'end_pos': 35, 'word_count': 3}, 170 | # ... 171 | # ] 172 | ``` 173 | 174 | ### Mini Project Expected Output: 175 | 176 | When you run the document analyzer: 177 | - Clear console output showing progress 178 | - Generated JSON/TXT files with analysis results 179 | - Summary statistics displayed in terminal 180 | - Error messages for any files that couldn't be processed 181 | 182 | **Example console output:** 183 | ``` 184 | Document Analyzer Tool 185 | ===================== 186 | Processing: sample.txt 187 | ✓ File processed successfully 188 | - Words: 1,234 189 | - Sentences: 45 190 | - Chunks: 12 191 | - Top word: 'the' (45 occurrences) 192 | Results saved to: output/sample_analysis.json 193 | ``` 194 | 195 | --- 196 | 197 | ## Submission Checklist 198 | 199 | - [ ] All 5 tasks completed and tested 200 | - [ ] Mini project fully functional 201 | - [ ] Code is well-commented 202 | - [ ] Error handling implemented 203 | - [ ] Code follows Python best practices 204 | - [ ] All files run without errors 205 | 206 | **Good luck!** 🚀 207 | 208 | -------------------------------------------------------------------------------- /Day-08: RAG Using LangChain or LlamaIndex/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 8 — Assignment 2 | 3 | ## Instructions 4 | 5 | Build RAG systems using LangChain and LlamaIndex frameworks. Compare both approaches and understand their strengths. Install required libraries: 6 | 7 | ```bash 8 | pip install langchain llama-index openai chromadb 9 | ``` 10 | 11 | **Important:** 12 | - Try both frameworks 13 | - Compare approaches 14 | - Experiment with configurations 15 | - Document differences 16 | - Understand when to use which 17 | 18 | --- 19 | 20 | ## Tasks 21 | 22 | ### Task 1: LangChain RAG System 23 | 24 | Build a complete RAG system using LangChain `langchain_rag.py`: 25 | 26 | **Components:** 27 | 1. Document loading (PDF/TXT) 28 | 2. Text splitting 29 | 3. Vector store (ChromaDB) 30 | 4. Retrieval QA chain 31 | 5. Query interface 32 | 33 | **Requirements:** 34 | - Use LangChain components 35 | - Support multiple document types 36 | - Configurable chunking 37 | - Return sources with answers 38 | - Handle errors gracefully 39 | 40 | **Test with:** Multiple documents 41 | 42 | **Deliverable:** `task1_langchain_rag.py` 43 | 44 | --- 45 | 46 | ### Task 2: LlamaIndex RAG System 47 | 48 | Build a complete RAG system using LlamaIndex `llamaindex_rag.py`: 49 | 50 | **Components:** 51 | 1. Document loading 52 | 2. Index creation 53 | 3. Query engine 54 | 4. Response synthesis 55 | 5. Query interface 56 | 57 | **Requirements:** 58 | - Use LlamaIndex components 59 | - Custom service context 60 | - Configurable settings 61 | - Source retrieval 62 | - Error handling 63 | 64 | **Test with:** Same documents as Task 1 65 | 66 | **Deliverable:** `task2_llamaindex_rag.py` 67 | 68 | --- 69 | 70 | ### Task 3: Framework Configuration Comparison 71 | 72 | Create a comparison tool `framework_comparison.py`: 73 | 74 | **Compare:** 75 | - Code complexity (lines of code) 76 | - Setup time 77 | - Query performance 78 | - Feature availability 79 | - Ease of customization 80 | 81 | **Requirements:** 82 | - Build same RAG system with both 83 | - Measure performance metrics 84 | - Document differences 85 | - Create comparison report 86 | 87 | **Deliverable:** `task3_comparison.py` + comparison report 88 | 89 | --- 90 | 91 | ### Task 4: Advanced Features Exploration 92 | 93 | Explore advanced features in both frameworks: 94 | 95 | **LangChain:** 96 | - Conversational memory 97 | - Different chain types 98 | - Agents 99 | - Custom retrievers 100 | 101 | **LlamaIndex:** 102 | - Different index types 103 | - Advanced retrievers 104 | - Response modes 105 | - Node postprocessors 106 | 107 | **Requirements:** 108 | - Implement 2-3 advanced features from each 109 | - Document what each does 110 | - Show examples 111 | 112 | **Deliverable:** `task4_advanced_features.py` 113 | 114 | --- 115 | 116 | ### Task 5: Hybrid Approach 117 | 118 | Create a system that uses both frameworks `hybrid_rag.py`: 119 | 120 | **Ideas:** 121 | - Use LangChain for document loading 122 | - Use LlamaIndex for indexing 123 | - Combine retrieval strategies 124 | - Use best of both worlds 125 | 126 | **Requirements:** 127 | - Integrate both frameworks 128 | - Explain why you chose each component 129 | - Make it work seamlessly 130 | 131 | **Deliverable:** `task5_hybrid_rag.py` 132 | 133 | --- 134 | 135 | ## One Mini Project 136 | 137 | ### 🚀 Build a Framework Comparison RAG Application 138 | 139 | Create a comprehensive application `framework_rag_comparison.py` that demonstrates both LangChain and LlamaIndex. 140 | 141 | **Features:** 142 | 143 | 1. **Dual Framework Support:** 144 | - Switch between LangChain and LlamaIndex 145 | - Same documents, different frameworks 146 | - Side-by-side comparison 147 | 148 | 2. **Document Management:** 149 | - Load documents once 150 | - Index with both frameworks 151 | - Compare indexing time 152 | - Compare storage size 153 | 154 | 3. **Query Interface:** 155 | ``` 156 | === Framework RAG Comparison === 157 | 1. Load documents 158 | 2. Index with LangChain 159 | 3. Index with LlamaIndex 160 | 4. Query (LangChain) 161 | 5. Query (LlamaIndex) 162 | 6. Compare frameworks 163 | 7. Performance metrics 164 | 8. Exit 165 | ``` 166 | 167 | 4. **Comparison Features:** 168 | - Side-by-side query results 169 | - Performance metrics (time, tokens) 170 | - Answer quality comparison 171 | - Source comparison 172 | - Code complexity metrics 173 | 174 | 5. **Advanced Analysis:** 175 | - Response time comparison 176 | - Token usage comparison 177 | - Answer similarity 178 | - Source overlap 179 | - Quality scoring 180 | 181 | 6. **Reporting:** 182 | - Generate comparison reports 183 | - Export results 184 | - Visualize differences 185 | - Recommendations 186 | 187 | **Requirements:** 188 | - Clean, modular code 189 | - Both frameworks fully implemented 190 | - Fair comparison methodology 191 | - Detailed documentation 192 | - Performance metrics 193 | - User-friendly interface 194 | 195 | **Example Usage:** 196 | ```python 197 | app = FrameworkComparison() 198 | app.load_documents("./documents/") 199 | 200 | # Index with both 201 | app.index_langchain() 202 | app.index_llamaindex() 203 | 204 | # Compare 205 | results = app.compare_query("What is machine learning?") 206 | print("LangChain:", results["langchain"]["answer"]) 207 | print("LlamaIndex:", results["llamaindex"]["answer"]) 208 | print("Similarity:", results["similarity_score"]) 209 | ``` 210 | 211 | **Deliverables:** 212 | - `framework_rag_comparison.py` - Main application 213 | - `requirements.txt` - Dependencies 214 | - `README_frameworks.md` - Documentation 215 | - Comparison report template 216 | - Example outputs 217 | 218 | --- 219 | 220 | ## Expected Output Section 221 | 222 | ### Task 1 Expected Output: 223 | ```python 224 | # LangChain RAG 225 | from langchain.chains import RetrievalQA 226 | 227 | qa_chain = RetrievalQA.from_chain_type(...) 228 | result = qa_chain({"query": "What is Python?"}) 229 | 230 | # Output: 231 | { 232 | "result": "Python is a programming language...", 233 | "source_documents": [...] 234 | } 235 | ``` 236 | 237 | ### Task 2 Expected Output: 238 | ```python 239 | # LlamaIndex RAG 240 | index = VectorStoreIndex.from_documents(documents) 241 | query_engine = index.as_query_engine() 242 | response = query_engine.query("What is Python?") 243 | 244 | # Output: 245 | ResponseObject with: 246 | - response: "Python is a programming language..." 247 | - source_nodes: [...] 248 | ``` 249 | 250 | ### Task 3 Expected Output: 251 | ``` 252 | === Framework Comparison === 253 | 254 | LangChain: 255 | - Setup time: 2.3s 256 | - Query time: 1.2s 257 | - Code lines: 45 258 | - Features: High flexibility 259 | 260 | LlamaIndex: 261 | - Setup time: 1.8s 262 | - Query time: 0.9s 263 | - Code lines: 28 264 | - Features: RAG-optimized 265 | 266 | Recommendation: Use LlamaIndex for RAG-focused apps 267 | ``` 268 | 269 | ### Mini Project Expected Output: 270 | 271 | The comparison app should provide: 272 | - Fair side-by-side comparisons 273 | - Detailed metrics 274 | - Clear recommendations 275 | - Professional interface 276 | 277 | **Example session:** 278 | ``` 279 | === Framework RAG Comparison === 280 | Choose: 6 281 | 282 | Query: "What is RAG?" 283 | 284 | LangChain Result: 285 | Answer: RAG stands for Retrieval-Augmented Generation... 286 | Time: 1.2s | Tokens: 150 287 | 288 | LlamaIndex Result: 289 | Answer: RAG (Retrieval-Augmented Generation) is... 290 | Time: 0.9s | Tokens: 145 291 | 292 | Comparison: 293 | - Answer similarity: 0.87 294 | - Time difference: 0.3s (LlamaIndex faster) 295 | - Token difference: 5 tokens 296 | - Source overlap: 2/3 documents 297 | ``` 298 | 299 | --- 300 | 301 | ## Submission Checklist 302 | 303 | - [ ] Task 1: LangChain RAG working 304 | - [ ] Task 2: LlamaIndex RAG working 305 | - [ ] Task 3: Comparison complete 306 | - [ ] Task 4: Advanced features explored 307 | - [ ] Task 5: Hybrid approach implemented 308 | - [ ] Mini project: Full comparison app 309 | - [ ] Both frameworks tested 310 | - [ ] Differences documented 311 | - [ ] Code is well-documented 312 | 313 | **Remember:** Frameworks save time, but understanding the fundamentals (Day 7) is crucial! 314 | 315 | **Good luck!** 🚀 316 | 317 | -------------------------------------------------------------------------------- /Day-02: Generative AI & LLM Basics/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 2 — Assignment 2 | 3 | ## Instructions 4 | 5 | Complete these tasks to get hands-on experience with LLMs and the OpenAI API. Make sure you have: 6 | - An OpenAI API key (get one at platform.openai.com) 7 | - Python `openai` library installed: `pip install openai` 8 | - Your API key stored securely (use environment variables) 9 | 10 | **Important:** 11 | - Never commit your API key to version control 12 | - Use environment variables or `.env` files 13 | - Handle errors gracefully 14 | - Test with different prompts and parameters 15 | 16 | --- 17 | 18 | ## Tasks 19 | 20 | ### Task 1: API Setup and First Call 21 | 22 | Create a Python script that: 23 | 1. Loads your OpenAI API key from an environment variable 24 | 2. Makes a simple API call asking "What is Python?" 25 | 3. Prints the response 26 | 4. Displays the number of tokens used 27 | 5. Handles errors if the API key is missing or invalid 28 | 29 | **Deliverable:** `task1_first_call.py` 30 | 31 | --- 32 | 33 | ### Task 2: Temperature Comparison Tool 34 | 35 | Create a function that: 36 | - Takes a prompt as input 37 | - Sends the same prompt with 3 different temperature values: 0.1, 0.7, 1.5 38 | - Collects all responses 39 | - Returns a comparison showing how temperature affects output 40 | 41 | Test with prompts like: 42 | - "Write a haiku about coding" 43 | - "Explain machine learning in one sentence" 44 | - "Describe a futuristic city" 45 | 46 | **Deliverable:** `task2_temperature_comparison.py` 47 | 48 | --- 49 | 50 | ### Task 3: Token Counter 51 | 52 | Create a utility that: 53 | 1. Estimates token count for input text (rough estimate: 1 token ≈ 4 characters) 54 | 2. Makes an API call 55 | 3. Compares your estimate with the actual token count from the API response 56 | 4. Calculates the accuracy of your estimation 57 | 58 | **Bonus:** Use the `tiktoken` library for more accurate token counting. 59 | 60 | **Deliverable:** `task3_token_counter.py` 61 | 62 | --- 63 | 64 | ### Task 4: Simple Chatbot 65 | 66 | Build a simple chatbot that: 67 | - Maintains conversation history 68 | - Allows multiple turns of conversation 69 | - Remembers context from previous messages 70 | - Has a command to clear history (type "clear" or "reset") 71 | - Has a command to exit (type "quit" or "exit") 72 | 73 | **Features:** 74 | - Greet the user 75 | - Show conversation history 76 | - Handle empty inputs 77 | - Display token usage after each response 78 | 79 | **Deliverable:** `task4_chatbot.py` 80 | 81 | --- 82 | 83 | ### Task 5: Model Comparison Tool 84 | 85 | Create a script that: 86 | - Takes a prompt as input 87 | - Sends it to both `gpt-3.5-turbo` and `gpt-4` 88 | - Compares: 89 | - Response quality (subjective) 90 | - Response length 91 | - Token usage 92 | - Response time (if possible) 93 | - Displays a side-by-side comparison 94 | 95 | Test with prompts requiring: 96 | - Simple factual answers 97 | - Creative writing 98 | - Complex reasoning 99 | - Code generation 100 | 101 | **Deliverable:** `task5_model_comparison.py` 102 | 103 | --- 104 | 105 | ## One Mini Project 106 | 107 | ### 🤖 Build an LLM Playground Application 108 | 109 | Create a comprehensive Python application `llm_playground.py` that allows users to experiment with different LLM settings interactively. 110 | 111 | **Features:** 112 | 113 | 1. **Interactive Menu:** 114 | ``` 115 | === LLM Playground === 116 | 1. Single Prompt 117 | 2. Conversation Mode 118 | 3. Compare Models 119 | 4. Parameter Tuning 120 | 5. View History 121 | 6. Export Results 122 | 7. Exit 123 | ``` 124 | 125 | 2. **Single Prompt Mode:** 126 | - Enter a prompt 127 | - Adjust temperature, max_tokens, model 128 | - View response 129 | - Save to history 130 | 131 | 3. **Conversation Mode:** 132 | - Multi-turn conversation 133 | - View full conversation history 134 | - Clear conversation option 135 | 136 | 4. **Compare Models:** 137 | - Enter a prompt 138 | - Automatically test with gpt-3.5-turbo and gpt-4 139 | - Show side-by-side comparison 140 | - Show cost comparison (if possible) 141 | 142 | 5. **Parameter Tuning:** 143 | - Test same prompt with different: 144 | - Temperature values (0.0 to 2.0) 145 | - Max tokens (50 to 500) 146 | - Top P values 147 | - Display all results for comparison 148 | 149 | 6. **View History:** 150 | - Show all previous prompts and responses 151 | - Filter by model 152 | - Show token usage statistics 153 | 154 | 7. **Export Results:** 155 | - Export conversation history to JSON 156 | - Export to text file 157 | - Include metadata (tokens, model, timestamp) 158 | 159 | **Requirements:** 160 | - Use classes to organize code 161 | - Store conversation history in memory (or file) 162 | - Implement proper error handling 163 | - Add input validation 164 | - Make it user-friendly with clear prompts 165 | - Display token usage and estimated costs 166 | - Use color coding for different types of output (optional) 167 | 168 | **Example Interaction:** 169 | ``` 170 | === LLM Playground === 171 | Choose an option: 1 172 | 173 | Enter your prompt: Explain RAG in simple terms 174 | Model (gpt-3.5-turbo/gpt-4) [gpt-3.5-turbo]: 175 | Temperature (0.0-2.0) [0.7]: 0.5 176 | Max tokens [150]: 200 177 | 178 | [Processing...] 179 | 180 | Response: 181 | RAG stands for Retrieval-Augmented Generation... 182 | 183 | Tokens used: 45 184 | Estimated cost: $0.0001 185 | 186 | Save to history? (y/n): y 187 | ``` 188 | 189 | **Deliverables:** 190 | - `llm_playground.py` - Main application 191 | - `requirements.txt` - Dependencies 192 | - `README_playground.md` - Brief usage instructions 193 | - Sample output showing the application in action 194 | 195 | --- 196 | 197 | ## Expected Output Section 198 | 199 | ### Task 1 Expected Output: 200 | ``` 201 | API Key loaded successfully. 202 | Making API call... 203 | 204 | Response: Python is a high-level programming language... 205 | 206 | Tokens used: 25 207 | ``` 208 | 209 | ### Task 2 Expected Output: 210 | ``` 211 | === Temperature Comparison === 212 | Prompt: "Write a haiku about coding" 213 | 214 | Temperature 0.1: 215 | Code flows like water, 216 | Functions dance in harmony, 217 | Logic finds its way. 218 | 219 | Temperature 0.7: 220 | Bits and bytes align, 221 | Algorithms come alive, 222 | Code becomes art form. 223 | 224 | Temperature 1.5: 225 | Electric dreams pulse, 226 | Syntax sings in midnight glow, 227 | Digital poetry blooms. 228 | 229 | [Notice how creativity increases with temperature] 230 | ``` 231 | 232 | ### Task 4 Expected Output: 233 | ``` 234 | === Simple Chatbot === 235 | Hello! I'm your AI assistant. Type 'quit' to exit, 'clear' to reset. 236 | 237 | You: Hello, my name is Bob 238 | Assistant: Hello Bob! Nice to meet you. How can I help you today? 239 | 240 | You: What's my name? 241 | Assistant: Your name is Bob! 242 | 243 | You: clear 244 | [Conversation cleared] 245 | 246 | You: What's my name? 247 | Assistant: I don't have that information. Could you tell me your name? 248 | 249 | You: quit 250 | Goodbye! 251 | ``` 252 | 253 | ### Mini Project Expected Output: 254 | 255 | The playground should provide a smooth, interactive experience: 256 | - Clear menu navigation 257 | - Real-time responses 258 | - Formatted output with proper spacing 259 | - Error messages for invalid inputs 260 | - History tracking and export functionality 261 | - Professional-looking interface 262 | 263 | **Example session:** 264 | ``` 265 | === LLM Playground === 266 | 1. Single Prompt 267 | 2. Conversation Mode 268 | ... 269 | Choose: 1 270 | 271 | Enter prompt: What is machine learning? 272 | Model [gpt-3.5-turbo]: 273 | Temperature [0.7]: 274 | Max tokens [150]: 275 | 276 | Response: 277 | Machine learning is a subset of artificial intelligence... 278 | 279 | Tokens: 42 | Cost: $0.00008 280 | 281 | [1] Try again 282 | [2] Save to history 283 | [3] Main menu 284 | ``` 285 | 286 | --- 287 | 288 | ## Submission Checklist 289 | 290 | - [ ] Task 1: API setup working 291 | - [ ] Task 2: Temperature comparison functional 292 | - [ ] Task 3: Token counter implemented 293 | - [ ] Task 4: Chatbot maintains conversation 294 | - [ ] Task 5: Model comparison working 295 | - [ ] Mini project: Full playground application 296 | - [ ] All code includes error handling 297 | - [ ] API keys stored securely (not in code) 298 | - [ ] Code is well-commented 299 | 300 | **Remember:** Keep your API key secret! Never share it or commit it to version control. 301 | 302 | **Good luck!** 🚀 303 | 304 | -------------------------------------------------------------------------------- /Day-07: Implement RAG From Scratch (Pure Python)/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 7 — Assignment 2 | 3 | ## Instructions 4 | 5 | Build a complete RAG system from scratch using pure Python. No frameworks allowed! This will help you understand every component deeply. Install only basic dependencies: 6 | 7 | ```bash 8 | pip install openai numpy pypdf 9 | ``` 10 | 11 | **Important:** 12 | - Build each component separately first 13 | - Test thoroughly 14 | - Add error handling 15 | - Document your code 16 | - Make it production-ready 17 | 18 | --- 19 | 20 | ## Tasks 21 | 22 | ### Task 1: Core Component Implementation 23 | 24 | Implement each core component as a separate class: 25 | 26 | 1. **DocumentLoader** (`document_loader.py`) 27 | - Load .txt files 28 | - Load .pdf files 29 | - Handle errors 30 | - Return clean text 31 | 32 | 2. **TextChunker** (`text_chunker.py`) 33 | - Fixed-size chunking 34 | - Configurable overlap 35 | - Add metadata 36 | - Return structured chunks 37 | 38 | 3. **EmbeddingGenerator** (`embedding_generator.py`) 39 | - Single text embedding 40 | - Batch embedding 41 | - Error handling 42 | - API key management 43 | 44 | 4. **VectorStore** (`vector_store.py`) 45 | - Store embeddings 46 | - Store chunks 47 | - Similarity search 48 | - Top K retrieval 49 | 50 | **Test each component independently before integrating.** 51 | 52 | **Deliverables:** 53 | - `document_loader.py` 54 | - `text_chunker.py` 55 | - `embedding_generator.py` 56 | - `vector_store.py` 57 | 58 | --- 59 | 60 | ### Task 2: RAG System Integration 61 | 62 | Create `rag_system.py` that integrates all components: 63 | 64 | **Requirements:** 65 | - `RAGSystem` class 66 | - `index_document(filepath)` method 67 | - `query(question, k=3)` method 68 | - Complete pipeline: Load → Chunk → Embed → Store → Retrieve → Augment → Generate 69 | - Return structured results 70 | 71 | **Test with:** Multiple documents and various questions 72 | 73 | **Deliverable:** `task2_rag_system.py` 74 | 75 | --- 76 | 77 | ### Task 3: Error Handling & Robustness 78 | 79 | Enhance your RAG system with comprehensive error handling: 80 | 81 | **Handle:** 82 | - Missing API keys 83 | - API rate limits 84 | - Network errors 85 | - File not found 86 | - Empty documents 87 | - No search results 88 | - Invalid inputs 89 | 90 | **Requirements:** 91 | - Graceful error messages 92 | - Fallback behaviors 93 | - Logging (optional) 94 | - User-friendly error reporting 95 | 96 | **Deliverable:** `task3_robust_rag.py` 97 | 98 | --- 99 | 100 | ### Task 4: Configuration System 101 | 102 | Create a configuration system `config_rag.py`: 103 | 104 | **Configurable options:** 105 | - Chunk size 106 | - Overlap size 107 | - K value (retrieval) 108 | - Similarity threshold 109 | - Embedding model 110 | - LLM model 111 | - Temperature 112 | - Max tokens 113 | 114 | **Requirements:** 115 | - Load from JSON file 116 | - Default values 117 | - Validation 118 | - Easy to modify 119 | 120 | **Deliverable:** `task4_configurable_rag.py` + `config.json` 121 | 122 | --- 123 | 124 | ### Task 5: Performance Optimization 125 | 126 | Optimize your RAG system: 127 | 128 | **Optimizations:** 129 | 1. Batch embedding generation 130 | 2. Cache embeddings (save to file) 131 | 3. Efficient similarity search (use NumPy) 132 | 4. Progress indicators 133 | 5. Memory management 134 | 135 | **Requirements:** 136 | - Measure performance (time) 137 | - Compare before/after 138 | - Handle large documents 139 | - Show progress for long operations 140 | 141 | **Deliverable:** `task5_optimized_rag.py` 142 | 143 | --- 144 | 145 | ## One Mini Project 146 | 147 | ### 🏗️ Build a Production-Ready RAG System 148 | 149 | Create a complete, production-ready RAG system `production_rag.py` with all features. 150 | 151 | **Features:** 152 | 153 | 1. **Complete Component System:** 154 | - DocumentLoader (multiple formats) 155 | - TextChunker (multiple strategies) 156 | - EmbeddingGenerator (with caching) 157 | - VectorStore (persistent storage) 158 | - RAGPipeline (orchestration) 159 | 160 | 2. **Document Management:** 161 | - Index single documents 162 | - Index directories 163 | - Remove documents 164 | - List indexed documents 165 | - Document statistics 166 | 167 | 3. **Query System:** 168 | - Single queries 169 | - Batch queries 170 | - Query history 171 | - Result caching 172 | 173 | 4. **Configuration:** 174 | - JSON config file 175 | - Runtime configuration 176 | - Environment variables 177 | - Default values 178 | 179 | 5. **Error Handling:** 180 | - Comprehensive try-catch 181 | - User-friendly messages 182 | - Logging system 183 | - Recovery mechanisms 184 | 185 | 6. **Performance Features:** 186 | - Embedding caching 187 | - Batch processing 188 | - Progress tracking 189 | - Performance metrics 190 | 191 | 7. **CLI Interface:** 192 | ``` 193 | === Production RAG System === 194 | 1. Index document 195 | 2. Index directory 196 | 3. Query 197 | 4. View indexed documents 198 | 5. Remove document 199 | 6. Configuration 200 | 7. Statistics 201 | 8. Exit 202 | ``` 203 | 204 | 8. **Advanced Features:** 205 | - Multiple collections 206 | - Export/import data 207 | - Search with filters 208 | - Answer quality metrics 209 | - System health check 210 | 211 | **Requirements:** 212 | - Clean, modular code 213 | - Comprehensive documentation 214 | - Error handling throughout 215 | - Configuration system 216 | - Performance optimizations 217 | - User-friendly interface 218 | - Production-ready quality 219 | 220 | **Example Usage:** 221 | ```python 222 | from production_rag import ProductionRAG 223 | 224 | # Initialize 225 | rag = ProductionRAG(config_file="config.json") 226 | 227 | # Index documents 228 | rag.index_document("doc1.pdf") 229 | rag.index_directory("./documents/") 230 | 231 | # Query 232 | result = rag.query("What is machine learning?") 233 | print(result["answer"]) 234 | print(f"Sources: {len(result['sources'])}") 235 | 236 | # Statistics 237 | stats = rag.get_statistics() 238 | print(f"Total chunks: {stats['total_chunks']}") 239 | ``` 240 | 241 | **Deliverables:** 242 | - `production_rag.py` - Main system 243 | - `config.json` - Configuration template 244 | - `requirements.txt` - Dependencies 245 | - `README_production.md` - Documentation 246 | - Unit tests (optional but recommended) 247 | - Example usage script 248 | 249 | --- 250 | 251 | ## Expected Output Section 252 | 253 | ### Task 2 Expected Output: 254 | ```python 255 | rag = RAGSystem() 256 | rag.index_document("document.pdf") 257 | # Output: "Indexed document.pdf: 15 chunks" 258 | 259 | result = rag.query("What is the main topic?") 260 | # Output: 261 | { 262 | "answer": "The main topic is...", 263 | "sources": [ 264 | {"text": "...", "source": "document.pdf", "chunk_id": 1}, 265 | ... 266 | ], 267 | "similarities": [0.89, 0.85, 0.82] 268 | } 269 | ``` 270 | 271 | ### Task 4 Expected Output: 272 | ```json 273 | // config.json 274 | { 275 | "chunk_size": 500, 276 | "overlap": 50, 277 | "k": 3, 278 | "similarity_threshold": 0.7, 279 | "embedding_model": "text-embedding-ada-002", 280 | "llm_model": "gpt-3.5-turbo", 281 | "temperature": 0.3, 282 | "max_tokens": 300 283 | } 284 | ``` 285 | 286 | ### Mini Project Expected Output: 287 | 288 | The production system should be: 289 | - Robust and error-resistant 290 | - Well-documented 291 | - Configurable 292 | - Performant 293 | - User-friendly 294 | 295 | **Example session:** 296 | ``` 297 | === Production RAG System === 298 | Choose: 1 299 | 300 | Enter document path: document.pdf 301 | [Indexing...] 302 | ✓ Loaded document 303 | ✓ Created 15 chunks 304 | ✓ Generated embeddings 305 | ✓ Stored in vector database 306 | Indexed successfully! 307 | 308 | Choose: 3 309 | 310 | Question: What is RAG? 311 | [Processing...] 312 | 313 | Answer: 314 | RAG stands for Retrieval-Augmented Generation... 315 | 316 | Sources (3): 317 | 1. [0.91] document.pdf, chunk 5 318 | 2. [0.87] document.pdf, chunk 8 319 | 3. [0.84] document.pdf, chunk 12 320 | ``` 321 | 322 | --- 323 | 324 | ## Submission Checklist 325 | 326 | - [ ] Task 1: All core components implemented 327 | - [ ] Task 2: RAG system integrated 328 | - [ ] Task 3: Error handling added 329 | - [ ] Task 4: Configuration system working 330 | - [ ] Task 5: Optimizations implemented 331 | - [ ] Mini project: Production-ready system 332 | - [ ] All code is well-documented 333 | - [ ] Error handling comprehensive 334 | - [ ] Tested with real documents 335 | - [ ] Code follows best practices 336 | 337 | **Remember:** Building from scratch teaches you everything! 338 | 339 | **Good luck!** 🚀 340 | 341 | -------------------------------------------------------------------------------- /Day-10: Build & Deploy a RAG Application (FastAPI-Streamlit)/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 10 — Assignment 2 | 3 | ## Instructions 4 | 5 | Build and deploy a complete RAG application with FastAPI backend and Streamlit frontend. This is your final project! Install required libraries: 6 | 7 | ```bash 8 | pip install fastapi uvicorn streamlit requests python-multipart 9 | ``` 10 | 11 | **Important:** 12 | - Build backend first 13 | - Then build frontend 14 | - Test locally 15 | - Deploy to cloud 16 | - Document everything 17 | 18 | --- 19 | 20 | ## Tasks 21 | 22 | ### Task 1: FastAPI Backend 23 | 24 | Create a complete FastAPI backend `rag_api.py`: 25 | 26 | **Endpoints:** 27 | 1. `GET /` - Root endpoint 28 | 2. `GET /health` - Health check 29 | 3. `POST /index` - Index document (file upload) 30 | 4. `POST /query` - Query RAG system 31 | 5. `GET /stats` - Get statistics 32 | 6. `DELETE /documents/{doc_id}` - Remove document 33 | 34 | **Requirements:** 35 | - Use Pydantic models for requests/responses 36 | - Handle file uploads 37 | - Integrate your RAG system 38 | - Add error handling 39 | - Include API documentation 40 | 41 | **Test with:** Postman or curl 42 | 43 | **Deliverable:** `task1_fastapi_backend.py` 44 | 45 | --- 46 | 47 | ### Task 2: Streamlit Frontend 48 | 49 | Create a Streamlit frontend `rag_ui.py`: 50 | 51 | **Features:** 52 | 1. Document upload section 53 | 2. Query interface 54 | 3. Answer display 55 | 4. Source display 56 | 5. Statistics view 57 | 6. Chat history (optional) 58 | 59 | **Requirements:** 60 | - Connect to FastAPI backend 61 | - Handle errors gracefully 62 | - Add loading states 63 | - Make it user-friendly 64 | - Add styling 65 | 66 | **Deliverable:** `task2_streamlit_frontend.py` 67 | 68 | --- 69 | 70 | ### Task 3: Complete Integration 71 | 72 | Integrate backend and frontend `complete_app.py`: 73 | 74 | **Features:** 75 | 1. Working API 76 | 2. Working UI 77 | 3. Full integration 78 | 4. Error handling 79 | 5. User feedback 80 | 81 | **Requirements:** 82 | - Test all features 83 | - Handle edge cases 84 | - Add validation 85 | - Improve UX 86 | 87 | **Deliverable:** `task3_complete_app/` (folder with both files) 88 | 89 | --- 90 | 91 | ### Task 4: Deployment Preparation 92 | 93 | Prepare for deployment: 94 | 95 | **Tasks:** 96 | 1. Create `requirements.txt` 97 | 2. Create `Dockerfile` 98 | 3. Create `docker-compose.yml` 99 | 4. Add environment variable handling 100 | 5. Create deployment documentation 101 | 102 | **Requirements:** 103 | - All dependencies listed 104 | - Docker setup working 105 | - Environment variables documented 106 | - Deployment guide written 107 | 108 | **Deliverable:** Deployment files + documentation 109 | 110 | --- 111 | 112 | ### Task 5: Cloud Deployment 113 | 114 | Deploy to a cloud platform: 115 | 116 | **Options:** 117 | - Heroku 118 | - Railway 119 | - Render 120 | - AWS/GCP/Azure (advanced) 121 | 122 | **Requirements:** 123 | - Deploy backend API 124 | - Deploy frontend (or serve from backend) 125 | - Test deployed version 126 | - Document deployment process 127 | 128 | **Deliverable:** Deployed application + deployment guide 129 | 130 | --- 131 | 132 | ## One Mini Project 133 | 134 | ### 🚀 Build and Deploy a Complete RAG Application 135 | 136 | Create a production-ready RAG application with full-stack implementation. 137 | 138 | **Features:** 139 | 140 | 1. **FastAPI Backend:** 141 | - Complete REST API 142 | - Document management 143 | - Query endpoints 144 | - Authentication (optional) 145 | - Rate limiting (optional) 146 | - CORS configuration 147 | - API documentation 148 | 149 | 2. **Streamlit Frontend:** 150 | - Modern, clean UI 151 | - Document upload 152 | - Interactive query interface 153 | - Answer display with formatting 154 | - Source citations 155 | - Chat history 156 | - Settings panel 157 | - Statistics dashboard 158 | 159 | 3. **Complete Features:** 160 | - Multiple document support 161 | - Document management (add/remove) 162 | - Query history 163 | - Export results 164 | - Configuration options 165 | - Error handling 166 | - Loading states 167 | - User feedback 168 | 169 | 4. **Deployment:** 170 | - Docker containerization 171 | - Environment configuration 172 | - Cloud deployment 173 | - Health checks 174 | - Monitoring (optional) 175 | 176 | 5. **Documentation:** 177 | - API documentation (auto-generated) 178 | - User guide 179 | - Deployment instructions 180 | - README with setup 181 | - Architecture diagram 182 | 183 | **Project Structure:** 184 | ``` 185 | rag_application/ 186 | ├── backend/ 187 | │ ├── main.py (FastAPI) 188 | │ ├── rag_system.py 189 | │ └── models.py 190 | ├── frontend/ 191 | │ └── app.py (Streamlit) 192 | ├── requirements.txt 193 | ├── Dockerfile 194 | ├── docker-compose.yml 195 | ├── .env.example 196 | ├── README.md 197 | └── DEPLOYMENT.md 198 | ``` 199 | 200 | **Requirements:** 201 | - Production-ready code 202 | - Comprehensive error handling 203 | - User-friendly interface 204 | - Well-documented 205 | - Deployed and accessible 206 | - Tested thoroughly 207 | 208 | **Example Usage:** 209 | ```bash 210 | # Backend 211 | cd backend 212 | uvicorn main:app --reload 213 | 214 | # Frontend 215 | cd frontend 216 | streamlit run app.py 217 | 218 | # Or with Docker 219 | docker-compose up 220 | ``` 221 | 222 | **Deliverables:** 223 | - Complete application code 224 | - Docker setup 225 | - Deployment configuration 226 | - Comprehensive documentation 227 | - Deployed application (URL) 228 | - Demo video/screenshots (optional) 229 | 230 | --- 231 | 232 | ## Expected Output Section 233 | 234 | ### Task 1 Expected Output: 235 | ```bash 236 | # Start API 237 | uvicorn rag_api:app --reload 238 | 239 | # Test endpoint 240 | curl http://localhost:8000/health 241 | # {"status":"healthy"} 242 | 243 | # Index document 244 | curl -X POST http://localhost:8000/index \ 245 | -F "file=@document.pdf" 246 | 247 | # Query 248 | curl -X POST http://localhost:8000/query \ 249 | -H "Content-Type: application/json" \ 250 | -d '{"question": "What is Python?", "k": 3}' 251 | ``` 252 | 253 | ### Task 2 Expected Output: 254 | ``` 255 | Streamlit app running at http://localhost:8501 256 | 257 | Features visible: 258 | - File upload widget 259 | - Query input 260 | - Answer display area 261 | - Sources section 262 | - Statistics panel 263 | ``` 264 | 265 | ### Task 3 Expected Output: 266 | ``` 267 | Complete integrated application: 268 | - Backend API running on :8000 269 | - Frontend UI running on :8501 270 | - Full functionality working 271 | - Error handling in place 272 | - User-friendly interface 273 | ``` 274 | 275 | ### Task 5 Expected Output: 276 | ``` 277 | Deployed application: 278 | - Backend: https://your-app.herokuapp.com 279 | - Frontend: https://your-app.herokuapp.com (or separate) 280 | - API docs: https://your-app.herokuapp.com/docs 281 | - Health: https://your-app.herokuapp.com/health 282 | ``` 283 | 284 | ### Mini Project Expected Output: 285 | 286 | The complete application should be: 287 | - Fully functional 288 | - Well-designed UI 289 | - Production-ready 290 | - Deployed and accessible 291 | - Well-documented 292 | 293 | **Example screens:** 294 | ``` 295 | ┌─────────────────────────────────┐ 296 | │ 🤖 RAG Application │ 297 | ├─────────────────────────────────┤ 298 | │ 📄 Upload Documents │ 299 | │ [Choose File] [Index] │ 300 | ├─────────────────────────────────┤ 301 | │ 💬 Ask a Question │ 302 | │ [Input box] │ 303 | │ [Ask Button] │ 304 | ├─────────────────────────────────┤ 305 | │ 📝 Answer │ 306 | │ [Generated answer displayed] │ 307 | ├─────────────────────────────────┤ 308 | │ 📚 Sources │ 309 | │ [Source 1] [Source 2] [Source 3]│ 310 | └─────────────────────────────────┘ 311 | ``` 312 | 313 | --- 314 | 315 | ## Submission Checklist 316 | 317 | - [ ] Task 1: FastAPI backend complete 318 | - [ ] Task 2: Streamlit frontend complete 319 | - [ ] Task 3: Integration working 320 | - [ ] Task 4: Deployment files ready 321 | - [ ] Task 5: Deployed to cloud 322 | - [ ] Mini project: Complete application 323 | - [ ] All endpoints tested 324 | - [ ] UI is user-friendly 325 | - [ ] Documentation complete 326 | - [ ] Application deployed and accessible 327 | 328 | **Final Checklist:** 329 | - [ ] Code is production-ready 330 | - [ ] Error handling comprehensive 331 | - [ ] Documentation is complete 332 | - [ ] Application is deployed 333 | - [ ] README is informative 334 | - [ ] You're proud of your work! 🎉 335 | 336 | **Congratulations on completing the 10-day RAG roadmap!** 🎊🚀 337 | 338 | **Good luck with your deployment!** 🌟 339 | 340 | -------------------------------------------------------------------------------- /Day-05: Embeddings & Vector Databases/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 5 — Assignment 2 | 3 | ## Instructions 4 | 5 | Complete these tasks to master embeddings and vector databases. You'll work with OpenAI embeddings and ChromaDB. Install required libraries: 6 | 7 | ```bash 8 | pip install openai chromadb numpy 9 | ``` 10 | 11 | **Important:** 12 | - Store your OpenAI API key securely 13 | - Test with various texts to understand embeddings 14 | - Experiment with different similarity thresholds 15 | - Document your findings 16 | 17 | --- 18 | 19 | ## Tasks 20 | 21 | ### Task 1: Embedding Generator Tool 22 | 23 | Create a tool `embedding_generator.py` that: 24 | 25 | 1. Takes text input (single or batch) 26 | 2. Generates embeddings using OpenAI API 27 | 3. Returns embeddings with metadata 28 | 4. Handles errors and rate limits 29 | 5. Saves embeddings to file (optional) 30 | 31 | **Features:** 32 | - Support single text or list of texts 33 | - Show embedding dimensions 34 | - Display first few values 35 | - Calculate and display statistics (min, max, mean) 36 | 37 | **Test with:** Various texts (short, long, different topics) 38 | 39 | **Deliverable:** `task1_embedding_generator.py` 40 | 41 | --- 42 | 43 | ### Task 2: Similarity Calculator 44 | 45 | Build a similarity calculator `similarity_calculator.py`: 46 | 47 | 1. Takes two texts as input 48 | 2. Generates embeddings for both 49 | 3. Calculates cosine similarity 50 | 4. Provides interpretation of the score 51 | 5. Visualizes similarity (text-based or simple plot) 52 | 53 | **Features:** 54 | - Calculate cosine similarity 55 | - Provide similarity interpretation (very similar, somewhat similar, different) 56 | - Compare multiple text pairs 57 | - Show embedding values (first few dimensions) 58 | 59 | **Test with:** 60 | - Very similar texts ("dog" vs "puppy") 61 | - Somewhat similar ("dog" vs "animal") 62 | - Different texts ("dog" vs "computer") 63 | 64 | **Deliverable:** `task2_similarity_calculator.py` 65 | 66 | --- 67 | 68 | ### Task 3: ChromaDB Document Store 69 | 70 | Create a document storage system `chromadb_store.py`: 71 | 72 | 1. Initialize ChromaDB collection 73 | 2. Add documents with metadata 74 | 3. Query for similar documents 75 | 4. Retrieve documents by ID 76 | 5. Get collection statistics 77 | 78 | **Requirements:** 79 | - Create a class `DocumentStore` 80 | - Methods: `add_documents()`, `search()`, `get_by_id()`, `get_stats()` 81 | - Support metadata filtering 82 | - Handle collection creation/loading 83 | 84 | **Test with:** 20+ sample documents on various topics 85 | 86 | **Deliverable:** `task3_chromadb_store.py` 87 | 88 | --- 89 | 90 | ### Task 4: Batch Embedding Processor 91 | 92 | Build a batch processor `batch_processor.py`: 93 | 94 | 1. Process multiple documents in batches 95 | 2. Generate embeddings efficiently 96 | 3. Store in ChromaDB 97 | 4. Show progress 98 | 5. Handle errors gracefully 99 | 100 | **Features:** 101 | - Batch size configuration 102 | - Progress tracking 103 | - Error recovery (skip failed items, continue) 104 | - Summary report 105 | 106 | **Test with:** 50+ documents 107 | 108 | **Deliverable:** `task4_batch_processor.py` 109 | 110 | --- 111 | 112 | ### Task 5: Semantic Search Engine 113 | 114 | Create a semantic search engine `semantic_search.py`: 115 | 116 | 1. Index a collection of documents 117 | 2. Accept search queries 118 | 3. Return top K most similar documents 119 | 4. Display results with similarity scores 120 | 5. Support metadata filtering 121 | 122 | **Features:** 123 | - Search interface (CLI) 124 | - Display top results with scores 125 | - Show metadata for each result 126 | - Highlight matching content (optional) 127 | - Export search results 128 | 129 | **Test with:** A collection of 30+ documents 130 | 131 | **Deliverable:** `task5_semantic_search.py` 132 | 133 | --- 134 | 135 | ## One Mini Project 136 | 137 | ### 🔍 Build a Semantic Search Tool 138 | 139 | Create a complete application `semantic_search_tool.py` that implements a semantic search system using embeddings and vector databases. 140 | 141 | **Features:** 142 | 143 | 1. **Document Indexing:** 144 | - Load documents from files (PDF, TXT, etc.) 145 | - Extract and chunk text 146 | - Generate embeddings 147 | - Store in ChromaDB with metadata 148 | 149 | 2. **Search Interface:** 150 | ``` 151 | === Semantic Search Tool === 152 | 1. Index documents 153 | 2. Search 154 | 3. View indexed documents 155 | 4. Delete documents 156 | 5. Collection statistics 157 | 6. Export results 158 | 7. Exit 159 | ``` 160 | 161 | 3. **Search Capabilities:** 162 | - Natural language queries 163 | - Top K results (configurable) 164 | - Similarity score display 165 | - Metadata filtering 166 | - Search history 167 | 168 | 4. **Advanced Features:** 169 | - Multiple collections support 170 | - Hybrid search (keyword + semantic) 171 | - Result ranking and re-ranking 172 | - Search analytics 173 | - Export search results 174 | 175 | 5. **Statistics and Analytics:** 176 | - Total documents indexed 177 | - Average document length 178 | - Search performance metrics 179 | - Most common queries 180 | - Collection health 181 | 182 | **Requirements:** 183 | - Use classes for organization 184 | - Support multiple file formats 185 | - Implement progress tracking 186 | - Add error handling 187 | - Create a user-friendly CLI 188 | - Store collections persistently 189 | - Generate detailed reports 190 | 191 | **Example Usage:** 192 | ```bash 193 | python semantic_search_tool.py 194 | 195 | === Semantic Search Tool === 196 | Choose option: 1 197 | 198 | Enter directory path: ./documents 199 | Chunk size [500]: 400 200 | Processing documents... 201 | ✓ Indexed 15 documents 202 | ✓ Created 42 chunks 203 | ✓ Generated embeddings 204 | 205 | Choose option: 2 206 | 207 | Enter search query: What is machine learning? 208 | Found 5 results: 209 | 210 | 1. [Score: 0.89] Machine learning is a subset of AI... 211 | Source: ai_textbook.pdf, Page: 3 212 | 213 | 2. [Score: 0.85] ML algorithms learn from data... 214 | Source: ml_guide.pdf, Page: 1 215 | ... 216 | ``` 217 | 218 | **Deliverables:** 219 | - `semantic_search_tool.py` - Main application 220 | - `requirements.txt` - Dependencies 221 | - `README_search.md` - Usage guide 222 | - Sample indexed collection 223 | - Example search results 224 | 225 | --- 226 | 227 | ## Expected Output Section 228 | 229 | ### Task 1 Expected Output: 230 | ```python 231 | embedding = generate_embedding("Python is a programming language") 232 | # Output: 233 | { 234 | "dimension": 1536, 235 | "first_5_values": [0.012, -0.034, 0.089, ...], 236 | "statistics": { 237 | "min": -0.523, 238 | "max": 0.891, 239 | "mean": 0.001 240 | } 241 | } 242 | ``` 243 | 244 | ### Task 2 Expected Output: 245 | ``` 246 | === Similarity Calculator === 247 | Text 1: "dog" 248 | Text 2: "puppy" 249 | 250 | Similarity: 0.847 251 | Interpretation: Very similar (same concept, different word) 252 | 253 | Text 1: "dog" 254 | Text 2: "car" 255 | 256 | Similarity: 0.312 257 | Interpretation: Different (unrelated concepts) 258 | ``` 259 | 260 | ### Task 3 Expected Output: 261 | ```python 262 | store = DocumentStore("my_collection") 263 | store.add_documents( 264 | texts=["Doc 1", "Doc 2"], 265 | metadatas=[{"source": "book1"}, {"source": "book2"}] 266 | ) 267 | 268 | results = store.search("programming", n_results=2) 269 | # Returns top 2 similar documents with metadata 270 | ``` 271 | 272 | ### Mini Project Expected Output: 273 | 274 | The semantic search tool should provide: 275 | - Fast indexing of documents 276 | - Accurate search results 277 | - Clear similarity scores 278 | - Rich metadata display 279 | - Professional interface 280 | 281 | **Example session:** 282 | ``` 283 | === Semantic Search Tool === 284 | Choose: 2 285 | 286 | Query: How does neural network work? 287 | Searching... 288 | 289 | Results (Top 5): 290 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291 | 1. [0.92] Neural networks are computing systems... 292 | 📄 Source: ai_book.pdf | 📄 Page: 45 293 | 294 | 2. [0.88] A neural network consists of layers... 295 | 📄 Source: ml_guide.pdf | 📄 Page: 12 296 | ... 297 | ``` 298 | 299 | --- 300 | 301 | ## Submission Checklist 302 | 303 | - [ ] Task 1: Embedding generator working 304 | - [ ] Task 2: Similarity calculator functional 305 | - [ ] Task 3: ChromaDB store implemented 306 | - [ ] Task 4: Batch processor complete 307 | - [ ] Task 5: Semantic search engine working 308 | - [ ] Mini project: Complete search tool 309 | - [ ] All code handles errors 310 | - [ ] Code is well-documented 311 | - [ ] Tested with real documents 312 | 313 | **Remember:** Embeddings and vector databases are the foundation of RAG retrieval! 314 | 315 | **Good luck!** 🚀 316 | 317 | -------------------------------------------------------------------------------- /Day-09: Advanced RAG (Reranking, Query Rewriting, Fusion)/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 9 — Assignment 2 | 3 | ## Instructions 4 | 5 | Implement advanced RAG techniques to improve your system. These techniques make RAG production-ready. Install required libraries: 6 | 7 | ```bash 8 | pip install sentence-transformers rank-bm25 openai numpy 9 | ``` 10 | 11 | **Important:** 12 | - Implement each technique separately first 13 | - Compare before/after results 14 | - Measure improvements 15 | - Document your findings 16 | 17 | --- 18 | 19 | ## Tasks 20 | 21 | ### Task 1: Query Rewriting System 22 | 23 | Build a query rewriting system `query_rewriter.py`: 24 | 25 | **Features:** 26 | 1. Generate query variations using LLM 27 | 2. Extract key terms 28 | 3. Expand with synonyms 29 | 4. Create multiple query formulations 30 | 31 | **Requirements:** 32 | - Generate 3-5 variations per query 33 | - Test if variations improve retrieval 34 | - Compare retrieval results 35 | - Document improvements 36 | 37 | **Test with:** Various question types 38 | 39 | **Deliverable:** `task1_query_rewriter.py` 40 | 41 | --- 42 | 43 | ### Task 2: Reranking Implementation 44 | 45 | Add reranking to your RAG system `reranking_rag.py`: 46 | 47 | **Features:** 48 | 1. Use cross-encoder model for reranking 49 | 2. Rerank initial retrieval results 50 | 3. Compare before/after rankings 51 | 4. Measure improvement 52 | 53 | **Requirements:** 54 | - Install sentence-transformers 55 | - Use a reranking model 56 | - Rerank top 10, return top 5 57 | - Show score improvements 58 | 59 | **Test with:** Various queries and measure improvement 60 | 61 | **Deliverable:** `task2_reranking_rag.py` 62 | 63 | --- 64 | 65 | ### Task 3: Fusion Techniques 66 | 67 | Implement fusion `fusion_rag.py`: 68 | 69 | **Features:** 70 | 1. Reciprocal Rank Fusion (RRF) 71 | 2. Weighted fusion 72 | 3. Combine multiple retrieval results 73 | 4. Deduplication 74 | 75 | **Requirements:** 76 | - Implement RRF algorithm 77 | - Test with 2-3 different retrieval strategies 78 | - Compare fused vs single retrieval 79 | - Measure improvement 80 | 81 | **Deliverable:** `task3_fusion_rag.py` 82 | 83 | --- 84 | 85 | ### Task 4: Hybrid Search 86 | 87 | Build hybrid search `hybrid_search.py`: 88 | 89 | **Features:** 90 | 1. Semantic search (embeddings) 91 | 2. Keyword search (BM25) 92 | 3. Combine both with weights 93 | 4. Tune alpha parameter 94 | 95 | **Requirements:** 96 | - Implement BM25 keyword search 97 | - Combine with semantic search 98 | - Test different alpha values (0.0 to 1.0) 99 | - Find optimal balance 100 | 101 | **Deliverable:** `task4_hybrid_search.py` 102 | 103 | --- 104 | 105 | ### Task 5: Complete Advanced RAG 106 | 107 | Combine all techniques `advanced_rag.py`: 108 | 109 | **Pipeline:** 110 | 1. Query rewriting 111 | 2. Multiple retrievals (with variations) 112 | 3. Fusion (combine results) 113 | 4. Reranking (improve order) 114 | 5. Generation 115 | 116 | **Requirements:** 117 | - Integrate all techniques 118 | - Make it configurable 119 | - Compare with basic RAG 120 | - Measure overall improvement 121 | 122 | **Deliverable:** `task5_advanced_rag.py` 123 | 124 | --- 125 | 126 | ## One Mini Project 127 | 128 | ### 🚀 Build a Production-Ready Advanced RAG System 129 | 130 | Create a complete advanced RAG application `production_advanced_rag.py` with all optimization techniques. 131 | 132 | **Features:** 133 | 134 | 1. **Complete Advanced Pipeline:** 135 | - Query rewriting 136 | - Multiple retrieval strategies 137 | - Fusion 138 | - Reranking 139 | - Generation 140 | 141 | 2. **Configuration System:** 142 | - Enable/disable each technique 143 | - Tune parameters 144 | - A/B testing mode 145 | - Performance vs quality trade-offs 146 | 147 | 3. **Multiple Retrieval Strategies:** 148 | - Semantic search 149 | - Keyword search 150 | - Hybrid search 151 | - Metadata filtering 152 | - Custom retrievers 153 | 154 | 4. **Advanced Features:** 155 | - Query expansion 156 | - Query decomposition 157 | - Multi-query fusion 158 | - Reranking with cross-encoders 159 | - Answer quality scoring 160 | 161 | 5. **Evaluation System:** 162 | - Compare techniques 163 | - Measure improvements 164 | - A/B testing 165 | - Performance metrics 166 | - Quality metrics 167 | 168 | 6. **Interactive Interface:** 169 | ``` 170 | === Advanced RAG System === 171 | 1. Index documents 172 | 2. Query (Basic RAG) 173 | 3. Query (Advanced RAG) 174 | 4. Compare techniques 175 | 5. Configure settings 176 | 6. Evaluation mode 177 | 7. Statistics 178 | 8. Exit 179 | ``` 180 | 181 | 7. **Reporting:** 182 | - Technique comparison 183 | - Performance reports 184 | - Quality improvements 185 | - Recommendations 186 | 187 | **Requirements:** 188 | - All advanced techniques implemented 189 | - Configurable and modular 190 | - Comprehensive evaluation 191 | - Production-ready quality 192 | - Detailed documentation 193 | 194 | **Example Usage:** 195 | ```python 196 | rag = ProductionAdvancedRAG() 197 | 198 | # Configure 199 | rag.configure({ 200 | "query_rewriting": True, 201 | "reranking": True, 202 | "fusion": True, 203 | "hybrid_search": True 204 | }) 205 | 206 | # Query 207 | result = rag.query("What is machine learning?") 208 | print(result["answer"]) 209 | print(f"Improvement: {result['improvement_metrics']}") 210 | ``` 211 | 212 | **Deliverables:** 213 | - `production_advanced_rag.py` - Main system 214 | - `config_advanced.json` - Configuration 215 | - `requirements.txt` - Dependencies 216 | - `README_advanced.md` - Documentation 217 | - Evaluation report template 218 | 219 | --- 220 | 221 | ## Expected Output Section 222 | 223 | ### Task 1 Expected Output: 224 | ```python 225 | variations = rewrite_query("How does ML work?") 226 | # Output: 227 | [ 228 | "How does machine learning work?", 229 | "What is the process of machine learning?", 230 | "How do ML algorithms learn?", 231 | "Explain machine learning mechanism", 232 | "How is machine learning implemented?" 233 | ] 234 | 235 | # Test retrieval improvement 236 | basic_results = retrieve("How does ML work?") 237 | advanced_results = retrieve_multiple(variations) 238 | # Advanced finds 40% more relevant documents 239 | ``` 240 | 241 | ### Task 2 Expected Output: 242 | ``` 243 | Before Reranking: 244 | 1. Doc A (0.85) 245 | 2. Doc B (0.82) 246 | 3. Doc C (0.80) 247 | 248 | After Reranking: 249 | 1. Doc C (0.92) ← Better match! 250 | 2. Doc A (0.88) 251 | 3. Doc B (0.85) 252 | 253 | Improvement: Top result relevance increased by 8% 254 | ``` 255 | 256 | ### Task 3 Expected Output: 257 | ``` 258 | Single Retrieval: Found 3 relevant docs 259 | Fusion (3 strategies): Found 5 relevant docs 260 | Improvement: 67% more relevant results 261 | ``` 262 | 263 | ### Task 5 Expected Output: 264 | ``` 265 | === Advanced RAG Query === 266 | Question: "What is Python?" 267 | 268 | [Query Rewriting] Generated 4 variations 269 | [Multiple Retrieval] Found 12 candidates 270 | [Fusion] Combined to 8 unique results 271 | [Reranking] Reordered top 5 272 | [Generation] Generated answer 273 | 274 | Answer: Python is a high-level programming language... 275 | 276 | Improvement Metrics: 277 | - Retrieval: +45% relevant docs 278 | - Answer quality: +23% improvement 279 | - Response time: +0.3s (acceptable) 280 | ``` 281 | 282 | ### Mini Project Expected Output: 283 | 284 | The advanced RAG system should demonstrate: 285 | - Significant quality improvements 286 | - Configurable techniques 287 | - Comprehensive evaluation 288 | - Production-ready features 289 | 290 | **Example session:** 291 | ``` 292 | === Advanced RAG System === 293 | Choose: 3 294 | 295 | Question: "Explain neural networks" 296 | 297 | [Advanced Pipeline Running...] 298 | ✓ Query rewritten: 4 variations 299 | ✓ Retrieved: 15 candidates 300 | ✓ Fused: 8 unique results 301 | ✓ Reranked: Top 5 selected 302 | ✓ Generated answer 303 | 304 | Answer: 305 | Neural networks are computing systems inspired by... 306 | 307 | Sources (Top 5, reranked): 308 | 1. [0.94] neural_networks.pdf | Page 3 309 | 2. [0.91] deep_learning.pdf | Page 1 310 | 3. [0.89] ai_basics.pdf | Page 7 311 | ... 312 | 313 | Comparison with Basic RAG: 314 | - Answer quality: +28% improvement 315 | - Source relevance: +35% improvement 316 | - Response time: +0.4s 317 | ``` 318 | 319 | --- 320 | 321 | ## Submission Checklist 322 | 323 | - [ ] Task 1: Query rewriting working 324 | - [ ] Task 2: Reranking implemented 325 | - [ ] Task 3: Fusion functional 326 | - [ ] Task 4: Hybrid search working 327 | - [ ] Task 5: Complete advanced pipeline 328 | - [ ] Mini project: Production system 329 | - [ ] All techniques tested 330 | - [ ] Improvements measured 331 | - [ ] Code well-documented 332 | 333 | **Remember:** Advanced techniques make the difference between a prototype and production system! 334 | 335 | **Good luck!** 🚀 336 | 337 | -------------------------------------------------------------------------------- /RAG Projects/readme.md: -------------------------------------------------------------------------------- 1 | # 🚀 10 Real-World RAG Projects 2 | 3 | **Practical Ideas from Beginner to Advanced** 4 | 5 | *Retrieval-Augmented Generation* 6 | 7 | --- 8 | 9 | ## 📋 Table of Contents 10 | 11 | - [Project 1: Legal Document Assistant](#project-1-legal-document-assistant) 12 | - [Project 2: Medical Research Summarizer](#project-2-medical-research-summarizer) 13 | - [Project 3: Customer Support Assistant](#project-3-customer-support-assistant) 14 | - [Project 4: Codebase Search & Explainer](#project-4-codebase-search--explainer) 15 | - [Project 5: Educational Q&A Tutor](#project-5-educational-qa-tutor) 16 | - [Project 6: Company Policy Assistant](#project-6-company-policy-assistant) 17 | - [Project 7: Financial Report Analyzer](#project-7-financial-report-analyzer) 18 | - [Project 8: Product Manual Assistant](#project-8-product-manual-assistant) 19 | - [Project 9: Academic Research Copilot](#project-9-academic-research-copilot) 20 | - [Project 10: News Contextualizer](#project-10-news-contextualizer) 21 | - [Recommended Tech Stack](#recommended-tech-stack) 22 | 23 | --- 24 | 25 | ## Project 1: Legal Document Assistant ⚖️ 26 | 27 | **Difficulty:** Intermediate 28 | 29 | **Description:** 30 | Help lawyers find and summarize relevant case laws from thousands of legal documents. Query with natural language and get cited precedents instantly. 31 | 32 | **Use Case:** 33 | Legal professionals can quickly search through extensive case law databases, legal documents, and precedents using natural language queries. The system retrieves relevant cases and provides summaries with proper citations, significantly reducing research time. 34 | 35 | **Tags:** 36 | - Legal Tech 37 | - Case Law 38 | - PDF Processing 39 | 40 | --- 41 | 42 | ## Project 2: Medical Research Summarizer 🏥 43 | 44 | **Difficulty:** Advanced 45 | 46 | **Description:** 47 | Summarize latest medical research for doctors and researchers. Query PubMed papers by symptoms or diseases and get readable clinical summaries. 48 | 49 | **Use Case:** 50 | Healthcare professionals and researchers can query medical literature from PubMed using symptoms, diseases, or research topics. The system retrieves relevant papers and provides readable clinical summaries, helping doctors stay updated with the latest research findings. 51 | 52 | **Tags:** 53 | - Healthcare 54 | - PubMed 55 | - Research 56 | 57 | --- 58 | 59 | ## Project 3: Customer Support Assistant 💬 60 | 61 | **Difficulty:** Beginner 62 | 63 | **Description:** 64 | Answer customer questions using internal knowledge base, FAQs, and support docs. Perfect for reducing support ticket volume. 65 | 66 | **Use Case:** 67 | Businesses can deploy an AI assistant that answers customer queries by retrieving information from internal knowledge bases, FAQ documents, and support documentation. This helps reduce support ticket volume and provides instant, accurate responses to common questions. 68 | 69 | **Tags:** 70 | - Support 71 | - Chatbot 72 | - Enterprise 73 | 74 | --- 75 | 76 | ## Project 4: Codebase Search & Explainer 💻 77 | 78 | **Difficulty:** Intermediate 79 | 80 | **Description:** 81 | Developer assistant that retrieves and explains code snippets from large codebases. Ask "How is auth implemented?" and get step-by-step answers. 82 | 83 | **Use Case:** 84 | Developers working with large codebases can query the system to find and understand how specific features are implemented. For example, asking "How is authentication implemented?" will retrieve relevant code snippets and provide step-by-step explanations, making onboarding and code navigation much easier. 85 | 86 | **Tags:** 87 | - DevTools 88 | - GitHub 89 | - Code Search 90 | 91 | --- 92 | 93 | ## Project 5: Educational Q&A Tutor 📚 94 | 95 | **Difficulty:** Beginner 96 | 97 | **Description:** 98 | AI tutor that retrieves textbook sections to answer student questions. Perfect for personalized learning and homework help. 99 | 100 | **Use Case:** 101 | Students can ask questions about course material, and the system retrieves relevant sections from textbooks and educational resources to provide comprehensive answers. This enables personalized learning experiences and helps with homework and exam preparation. 102 | 103 | **Tags:** 104 | - EdTech 105 | - Learning 106 | - Tutoring 107 | 108 | --- 109 | 110 | ## Project 6: Company Policy Assistant 🏢 111 | 112 | **Difficulty:** Beginner 113 | 114 | **Description:** 115 | Query internal HR policies like leave, reimbursement, and benefits. Get instant answers with document citations. 116 | 117 | **Use Case:** 118 | Employees can quickly find information about company policies, HR procedures, leave policies, reimbursement rules, and benefits by querying the system. The assistant retrieves relevant policy documents and provides answers with proper citations, reducing HR workload and improving employee experience. 119 | 120 | **Tags:** 121 | - HR Tech 122 | - Internal Tools 123 | - Policies 124 | 125 | --- 126 | 127 | ## Project 7: Financial Report Analyzer 💰 128 | 129 | **Difficulty:** Intermediate 130 | 131 | **Description:** 132 | Analyze quarterly reports and generate insights. Query like "Summarize Tesla's Q3 2024 performance" and get business-friendly summaries. 133 | 134 | **Use Case:** 135 | Financial analysts, investors, and business professionals can query quarterly financial reports and get business-friendly summaries and insights. For example, asking "Summarize Tesla's Q3 2024 performance" will retrieve relevant sections from financial reports and provide comprehensive analysis. 136 | 137 | **Tags:** 138 | - FinTech 139 | - Analytics 140 | - Reports 141 | 142 | --- 143 | 144 | ## Project 8: Product Manual Assistant 📖 145 | 146 | **Difficulty:** Beginner 147 | 148 | **Description:** 149 | Help users troubleshoot products using manuals. Search through documentation and show step-by-step instructions. 150 | 151 | **Use Case:** 152 | Product users can troubleshoot issues by querying product manuals and documentation. The system retrieves relevant sections and provides step-by-step instructions, reducing support calls and improving user experience with self-service troubleshooting. 153 | 154 | **Tags:** 155 | - Support 156 | - Documentation 157 | - UX 158 | 159 | --- 160 | 161 | ## Project 9: Academic Research Copilot 🎓 162 | 163 | **Difficulty:** Advanced 164 | 165 | **Description:** 166 | Find, cite, and summarize scholarly papers. Create literature reviews and track recent trends in research topics. 167 | 168 | **Use Case:** 169 | Researchers and academics can use this tool to find relevant scholarly papers, generate citations, create literature reviews, and track recent trends in their research areas. The system searches through academic databases like arXiv and provides summaries and citations for papers. 170 | 171 | **Tags:** 172 | - Research 173 | - Academia 174 | - arXiv 175 | 176 | --- 177 | 178 | ## Project 10: News Contextualizer 📰 179 | 180 | **Difficulty:** Intermediate 181 | 182 | **Description:** 183 | Provide historical context for trending news. Query current events and get timeline summaries with fact-based background. 184 | 185 | **Use Case:** 186 | Journalists, researchers, and news consumers can query current events and receive historical context, timeline summaries, and fact-based background information. This helps understand the full picture of news stories by connecting them to past events and providing comprehensive context. 187 | 188 | **Tags:** 189 | - News 190 | - Context 191 | - Archives 192 | 193 | --- 194 | 195 | ## Recommended Tech Stack 🛠️ 196 | 197 | All projects can be built using the following technologies: 198 | 199 | - **Python** - Primary programming language 200 | - **LangChain** - Framework for building LLM applications 201 | - **OpenAI** - LLM API provider 202 | - **FAISS** - Vector similarity search library 203 | - **Pinecone** - Managed vector database 204 | - **ChromaDB** - Open-source vector database 205 | 206 | ### Additional Tools & Libraries 207 | 208 | - **Streamlit** / **Gradio** - For building web interfaces 209 | - **PyPDF2** / **pdfplumber** - For PDF processing 210 | - **BeautifulSoup** / **Scrapy** - For web scraping 211 | - **Sentence Transformers** - For embeddings 212 | - **FastAPI** - For building APIs 213 | - **Docker** - For containerization 214 | 215 | --- 216 | 217 | ## 💡 Note 218 | 219 | All projects are scalable from MVP to Production. Start with a simple implementation and gradually add features like: 220 | - Advanced retrieval strategies (hybrid search, reranking) 221 | - Multi-modal support (images, tables) 222 | - User authentication and session management 223 | - Analytics and monitoring 224 | - Caching and optimization 225 | - Multi-language support 226 | 227 | --- 228 | 229 | ## Getting Started 230 | 231 | 1. Choose a project that matches your skill level 232 | 2. Set up your development environment with Python 3.8+ 233 | 3. Install required dependencies 234 | 4. Set up your vector database (FAISS, Pinecone, or ChromaDB) 235 | 5. Configure your LLM API keys 236 | 6. Start building! 237 | 238 | --- 239 | 240 | 241 | **Happy Building! 🚀** 242 | 243 | -------------------------------------------------------------------------------- /Day-06: RAG Fundamentals (Retrieval → Augmentation → Generation)/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 6 — Assignment 2 | 3 | ## Instructions 4 | 5 | Complete these tasks to build your first RAG systems. You'll implement the complete RAG pipeline: Retrieval → Augmentation → Generation. Make sure you have all dependencies: 6 | 7 | ```bash 8 | pip install openai chromadb numpy 9 | ``` 10 | 11 | **Important:** 12 | - Test with real documents 13 | - Experiment with different K values 14 | - Try various prompt templates 15 | - Document what works best 16 | 17 | --- 18 | 19 | ## Tasks 20 | 21 | ### Task 1: Basic RAG System 22 | 23 | Build a complete RAG system `basic_rag.py`: 24 | 25 | **Components:** 26 | 1. Document storage (ChromaDB) 27 | 2. Retrieval function 28 | 3. Augmentation function 29 | 4. Generation function 30 | 5. Complete query pipeline 31 | 32 | **Requirements:** 33 | - Create a `BasicRAG` class 34 | - Methods: `add_documents()`, `query()` 35 | - Retrieve top 3 chunks 36 | - Simple prompt template 37 | - Return answer and sources 38 | 39 | **Test with:** 10+ documents on a specific topic 40 | 41 | **Deliverable:** `task1_basic_rag.py` 42 | 43 | --- 44 | 45 | ### Task 2: RAG with Source Citations 46 | 47 | Enhance your RAG to include citations `rag_with_citations.py`: 48 | 49 | **Features:** 50 | 1. Store source metadata with documents 51 | 2. Include source info in prompt 52 | 3. Generate answers with citations 53 | 4. Format: "According to [Source]..." 54 | 55 | **Requirements:** 56 | - Track document sources 57 | - Include in augmented prompt 58 | - LLM should cite sources in answer 59 | - Return structured results with citations 60 | 61 | **Test with:** Documents from different sources 62 | 63 | **Deliverable:** `task2_rag_citations.py` 64 | 65 | --- 66 | 67 | ### Task 3: Similarity Threshold Filtering 68 | 69 | Implement similarity-based filtering `filtered_rag.py`: 70 | 71 | **Features:** 72 | 1. Set similarity threshold 73 | 2. Filter retrieved chunks 74 | 3. Only use chunks above threshold 75 | 4. Handle case when no chunks pass threshold 76 | 77 | **Requirements:** 78 | - Configurable threshold (0.0 to 1.0) 79 | - Show similarity scores 80 | - Test with different thresholds 81 | - Compare results 82 | 83 | **Test with:** Various queries and thresholds 84 | 85 | **Deliverable:** `task3_filtered_rag.py` 86 | 87 | --- 88 | 89 | ### Task 4: Multi-Query RAG 90 | 91 | Implement query expansion `multi_query_rag.py`: 92 | 93 | **Features:** 94 | 1. Generate query variations 95 | 2. Search with each variation 96 | 3. Combine and deduplicate results 97 | 4. Use combined results for answer 98 | 99 | **Query expansion ideas:** 100 | - Paraphrase the question 101 | - Extract key terms 102 | - Generate related questions 103 | 104 | **Requirements:** 105 | - Create 2-3 query variations 106 | - Search with each 107 | - Merge results (remove duplicates) 108 | - Use merged chunks for answer 109 | 110 | **Deliverable:** `task4_multi_query_rag.py` 111 | 112 | --- 113 | 114 | ### Task 5: RAG Evaluation System 115 | 116 | Build an evaluation framework `rag_evaluator.py`: 117 | 118 | **Features:** 119 | 1. Test dataset (questions + expected answers) 120 | 2. Run RAG on test questions 121 | 3. Compare generated vs expected answers 122 | 4. Calculate metrics (accuracy, similarity) 123 | 124 | **Metrics to implement:** 125 | - Exact match 126 | - Semantic similarity (embedding-based) 127 | - Contains key terms 128 | - Answer length comparison 129 | 130 | **Requirements:** 131 | - Create test dataset (5-10 Q&A pairs) 132 | - Run evaluation 133 | - Calculate and display metrics 134 | - Identify failure cases 135 | 136 | **Deliverable:** `task5_rag_evaluator.py` 137 | 138 | --- 139 | 140 | ## One Mini Project 141 | 142 | ### 🚀 Build a Full RAG System From Scratch 143 | 144 | Create a complete RAG application `rag_system.py` that implements all the concepts learned. 145 | 146 | **Features:** 147 | 148 | 1. **Document Management:** 149 | - Load documents from files (PDF, TXT) 150 | - Extract and chunk text 151 | - Generate embeddings 152 | - Store in vector database 153 | - Manage multiple document collections 154 | 155 | 2. **RAG Pipeline:** 156 | - Complete retrieval system 157 | - Configurable K value 158 | - Similarity threshold filtering 159 | - Query expansion (optional) 160 | - Augmentation with citations 161 | - Generation with LLM 162 | 163 | 3. **Interactive Interface:** 164 | ``` 165 | === RAG System === 166 | 1. Add documents 167 | 2. Ask a question 168 | 3. View indexed documents 169 | 4. Configure settings 170 | 5. Evaluate system 171 | 6. Export results 172 | 7. Exit 173 | ``` 174 | 175 | 4. **Settings Configuration:** 176 | - K value (number of chunks) 177 | - Similarity threshold 178 | - LLM model selection 179 | - Temperature 180 | - Max tokens 181 | - Enable/disable query expansion 182 | 183 | 5. **Advanced Features:** 184 | - Multiple collections 185 | - Metadata filtering 186 | - Search history 187 | - Answer quality scoring 188 | - Source highlighting 189 | - Export conversations 190 | 191 | 6. **Evaluation Tools:** 192 | - Test with sample questions 193 | - Compare different configurations 194 | - Performance metrics 195 | - Quality assessment 196 | 197 | **Requirements:** 198 | - Use classes for organization 199 | - Support multiple file formats 200 | - Implement all RAG components 201 | - Add comprehensive error handling 202 | - Create user-friendly CLI 203 | - Store configurations 204 | - Generate detailed reports 205 | 206 | **Example Usage:** 207 | ```bash 208 | python rag_system.py 209 | 210 | === RAG System === 211 | Choose: 1 212 | 213 | Enter document path: ./documents 214 | Processing... 215 | ✓ Indexed 5 documents 216 | ✓ Created 23 chunks 217 | 218 | Choose: 2 219 | 220 | Question: What is machine learning? 221 | [Searching...] 222 | 223 | Answer: 224 | Machine learning is a subset of artificial intelligence that enables systems to learn from data... 225 | 226 | Sources: 227 | 1. [0.89] ai_textbook.pdf, Page 5 228 | 2. [0.85] ml_guide.pdf, Page 2 229 | 3. [0.82] intro_ai.pdf, Page 10 230 | 231 | [1] Ask another question 232 | [2] View full sources 233 | [3] Main menu 234 | ``` 235 | 236 | **Deliverables:** 237 | - `rag_system.py` - Main application 238 | - `config.json` - Configuration template 239 | - `requirements.txt` - Dependencies 240 | - `README_rag.md` - Usage guide 241 | - Sample test dataset 242 | - Example outputs 243 | 244 | --- 245 | 246 | ## Expected Output Section 247 | 248 | ### Task 1 Expected Output: 249 | ```python 250 | rag = BasicRAG() 251 | rag.add_documents(["Doc 1 text...", "Doc 2 text..."]) 252 | 253 | result = rag.query("What is Python?") 254 | # Output: 255 | { 256 | "answer": "Python is a high-level programming language...", 257 | "sources": [ 258 | "Python is a programming language created in 1991...", 259 | "Python supports multiple programming paradigms..." 260 | ] 261 | } 262 | ``` 263 | 264 | ### Task 2 Expected Output: 265 | ```python 266 | result = rag_with_citations.query("What is RAG?") 267 | # Output: 268 | { 269 | "answer": "According to document1.pdf, RAG stands for Retrieval-Augmented Generation...", 270 | "sources": [ 271 | {"text": "...", "source": "document1.pdf", "page": 3}, 272 | {"text": "...", "source": "document2.pdf", "page": 1} 273 | ] 274 | } 275 | ``` 276 | 277 | ### Task 3 Expected Output: 278 | ``` 279 | Query: "machine learning" 280 | Threshold: 0.7 281 | 282 | Retrieved 5 chunks, 3 above threshold (0.7) 283 | Using top 3 chunks for answer... 284 | 285 | Answer: [Generated answer using filtered chunks] 286 | ``` 287 | 288 | ### Mini Project Expected Output: 289 | 290 | The RAG system should provide: 291 | - Fast document indexing 292 | - Accurate retrieval 293 | - Clear, cited answers 294 | - Configurable settings 295 | - Professional interface 296 | 297 | **Example session:** 298 | ``` 299 | === RAG System === 300 | Choose: 2 301 | 302 | Question: How does neural network training work? 303 | 304 | [Retrieving relevant chunks...] 305 | [Generating answer...] 306 | 307 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308 | Answer: 309 | Neural network training involves feeding data through the network, calculating errors, and adjusting weights through backpropagation... 310 | 311 | Sources (Top 3): 312 | 1. [0.91] neural_networks.pdf | Page 12 313 | 2. [0.87] deep_learning.pdf | Page 5 314 | 3. [0.84] ai_basics.pdf | Page 8 315 | 316 | Similarity scores shown in brackets 317 | ``` 318 | 319 | --- 320 | 321 | ## Submission Checklist 322 | 323 | - [ ] Task 1: Basic RAG working 324 | - [ ] Task 2: Citations implemented 325 | - [ ] Task 3: Filtering functional 326 | - [ ] Task 4: Multi-query working 327 | - [ ] Task 5: Evaluation system complete 328 | - [ ] Mini project: Full RAG system 329 | - [ ] All components tested 330 | - [ ] Code is well-documented 331 | - [ ] Error handling implemented 332 | 333 | **Remember:** RAG combines retrieval and generation - both parts are important! 334 | 335 | **Good luck!** 🚀 336 | 337 | -------------------------------------------------------------------------------- /Day-02: Generative AI & LLM Basics/README.md: -------------------------------------------------------------------------------- 1 | # Day 2 — Generative AI & LLM Basics 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Welcome to the world of **Generative AI**! Today, you'll learn what Large Language Models (LLMs) are, how they work, and why they're the foundation of RAG systems. 6 | 7 | **What is Generative AI?** 8 | Generative AI refers to artificial intelligence systems that can create new content—text, images, code, etc.—based on patterns they've learned from training data. Unlike traditional AI that classifies or predicts, generative AI produces original outputs. 9 | 10 | **Why this matters for RAG:** 11 | - RAG systems use LLMs to generate answers 12 | - Understanding LLMs helps you craft better prompts 13 | - You'll interact with LLMs via APIs (like OpenAI) 14 | - LLMs have limitations that RAG solves (hallucination, outdated knowledge) 15 | 16 | **Real-world context:** 17 | Think of an LLM as a very knowledgeable assistant who has read millions of books but can't remember specific sources. RAG gives this assistant access to a "library" (your documents) so it can provide accurate, sourced answers. 18 | 19 | --- 20 | 21 | ## 2. Deep-Dive Explanation 22 | 23 | ### 2.1 What are Large Language Models (LLMs)? 24 | 25 | LLMs are neural networks trained on vast amounts of text data. They learn patterns, relationships, and language structure. 26 | 27 | **Key characteristics:** 28 | - **Large**: Billions of parameters (weights) 29 | - **Language**: Understand and generate human language 30 | - **Models**: Mathematical representations of language patterns 31 | 32 | **How they work (simplified):** 33 | ``` 34 | Input Text → Neural Network → Probability Distribution → Generated Text 35 | ``` 36 | 37 | ### 2.2 Popular LLMs 38 | 39 | **OpenAI Models:** 40 | - **GPT-3.5/GPT-4**: General-purpose, powerful 41 | - **GPT-4 Turbo**: Faster, more efficient 42 | - **Embedding models**: Convert text to vectors 43 | 44 | **Other Models:** 45 | - **Claude** (Anthropic): Strong reasoning 46 | - **Llama** (Meta): Open-source alternative 47 | - **Gemini** (Google): Multimodal capabilities 48 | 49 | ### 2.3 Understanding Tokens 50 | 51 | LLMs process text in **tokens**, not words: 52 | - 1 token ≈ 4 characters (roughly) 53 | - "Hello world" = 2 tokens 54 | - "RAG system" = 3 tokens 55 | 56 | **Why it matters:** 57 | - API pricing is often per token 58 | - Models have token limits (context windows) 59 | - You need to manage token usage efficiently 60 | 61 | ### 2.4 The OpenAI API 62 | 63 | **Basic API Structure:** 64 | ``` 65 | Your Code → HTTP Request → OpenAI API → Response (JSON) → Your Code 66 | ``` 67 | 68 | **Key Components:** 69 | - **API Key**: Authentication 70 | - **Endpoint**: URL for the API 71 | - **Model**: Which LLM to use (e.g., "gpt-4") 72 | - **Messages**: Conversation format 73 | - **Parameters**: Temperature, max_tokens, etc. 74 | 75 | ### 2.5 API Parameters Explained 76 | 77 | **Temperature** (0-2): 78 | - Lower (0-0.3): More deterministic, focused 79 | - Higher (0.7-2): More creative, varied 80 | - Default: 0.7 81 | 82 | **Max Tokens:** 83 | - Maximum length of the response 84 | - Set based on your needs 85 | - Be careful not to exceed model limits 86 | 87 | **Top P** (0-1): 88 | - Nucleus sampling 89 | - Controls diversity 90 | - Alternative to temperature 91 | 92 | ### 2.6 Model Capabilities and Limitations 93 | 94 | **What LLMs are good at:** 95 | - Understanding context 96 | - Generating coherent text 97 | - Following instructions 98 | - Summarizing content 99 | - Answering questions (if trained on the topic) 100 | 101 | **What LLMs struggle with:** 102 | - **Hallucination**: Making up facts 103 | - **Outdated information**: Training data cutoff 104 | - **Specific knowledge**: Not trained on your documents 105 | - **Math/Logic**: Can make errors 106 | - **Real-time data**: No access to current events 107 | 108 | **This is why RAG exists!** RAG solves the "specific knowledge" and "outdated information" problems. 109 | 110 | --- 111 | 112 | ## 3. Instructor Examples 113 | 114 | ### Example 1: Basic OpenAI API Call 115 | 116 | ```python 117 | import openai 118 | import os 119 | 120 | # Set your API key (use environment variable in production!) 121 | openai.api_key = os.getenv("OPENAI_API_KEY") 122 | 123 | def simple_chat(prompt): 124 | """Send a simple prompt to GPT-3.5""" 125 | response = openai.ChatCompletion.create( 126 | model="gpt-3.5-turbo", 127 | messages=[ 128 | {"role": "user", "content": prompt} 129 | ], 130 | temperature=0.7, 131 | max_tokens=150 132 | ) 133 | 134 | return response.choices[0].message.content 135 | 136 | # Usage 137 | answer = simple_chat("What is RAG?") 138 | print(answer) 139 | ``` 140 | 141 | ### Example 2: Conversation with Context 142 | 143 | ```python 144 | def chat_with_context(messages, user_message): 145 | """Maintain conversation context""" 146 | messages.append({"role": "user", "content": user_message}) 147 | 148 | response = openai.ChatCompletion.create( 149 | model="gpt-3.5-turbo", 150 | messages=messages, 151 | temperature=0.7 152 | ) 153 | 154 | assistant_message = response.choices[0].message.content 155 | messages.append({"role": "assistant", "content": assistant_message}) 156 | 157 | return assistant_message, messages 158 | 159 | # Usage 160 | conversation = [] 161 | response, conversation = chat_with_context( 162 | conversation, 163 | "My name is Alice" 164 | ) 165 | response, conversation = chat_with_context( 166 | conversation, 167 | "What's my name?" # Model remembers context! 168 | ) 169 | ``` 170 | 171 | ### Example 3: Using Different Models 172 | 173 | ```python 174 | def compare_models(prompt): 175 | """Compare responses from different models""" 176 | models = ["gpt-3.5-turbo", "gpt-4"] 177 | results = {} 178 | 179 | for model in models: 180 | response = openai.ChatCompletion.create( 181 | model=model, 182 | messages=[{"role": "user", "content": prompt}], 183 | temperature=0.7, 184 | max_tokens=200 185 | ) 186 | results[model] = response.choices[0].message.content 187 | 188 | return results 189 | 190 | # Usage 191 | prompt = "Explain quantum computing in simple terms" 192 | results = compare_models(prompt) 193 | for model, answer in results.items(): 194 | print(f"\n{model}:\n{answer}") 195 | ``` 196 | 197 | ### Example 4: Controlling Output with Parameters 198 | 199 | ```python 200 | def generate_with_settings(prompt, temperature=0.7, max_tokens=100): 201 | """Generate text with custom parameters""" 202 | response = openai.ChatCompletion.create( 203 | model="gpt-3.5-turbo", 204 | messages=[{"role": "user", "content": prompt}], 205 | temperature=temperature, 206 | max_tokens=max_tokens, 207 | top_p=0.9 208 | ) 209 | 210 | return { 211 | "content": response.choices[0].message.content, 212 | "tokens_used": response.usage.total_tokens, 213 | "model": response.model 214 | } 215 | 216 | # Usage - Creative writing (high temperature) 217 | creative = generate_with_settings( 218 | "Write a short story about a robot", 219 | temperature=1.2, 220 | max_tokens=200 221 | ) 222 | 223 | # Usage - Factual answer (low temperature) 224 | factual = generate_with_settings( 225 | "What is the capital of France?", 226 | temperature=0.2, 227 | max_tokens=50 228 | ) 229 | ``` 230 | 231 | --- 232 | 233 | ## 4. Student Practice Tasks 234 | 235 | ### Task 1: Basic API Setup 236 | Set up your OpenAI API key and make your first API call. Print the response and the number of tokens used. 237 | 238 | ### Task 2: Temperature Experiment 239 | Send the same prompt to the API with different temperature values (0.1, 0.7, 1.5). Observe how the responses differ. What do you notice? 240 | 241 | ### Task 3: Token Counting 242 | Write a function that estimates token count for a given text. Compare your estimate with the actual token count from the API response. 243 | 244 | ### Task 4: Conversation Memory 245 | Create a simple chatbot that maintains conversation history. The bot should remember what was discussed earlier in the conversation. 246 | 247 | ### Task 5: Model Comparison 248 | Compare responses from `gpt-3.5-turbo` and `gpt-4` for the same prompt. What differences do you observe in quality, detail, and token usage? 249 | 250 | ### Task 6: Error Handling 251 | Write a robust API wrapper that handles: 252 | - API key errors 253 | - Rate limiting 254 | - Network errors 255 | - Invalid model names 256 | 257 | --- 258 | 259 | ## 5. Summary / Key Takeaways 260 | 261 | - **LLMs** are neural networks trained on vast text data to understand and generate language 262 | - **Tokens** are the units LLMs process (not words); manage them carefully 263 | - **OpenAI API** provides access to powerful models via simple HTTP requests 264 | - **Temperature** controls creativity (low = focused, high = creative) 265 | - **Max tokens** limits response length 266 | - **LLMs have limitations**: hallucination, outdated info, no access to your documents 267 | - **RAG solves LLM limitations** by providing external knowledge 268 | - **Context matters**: LLMs use conversation history to maintain coherence 269 | - **Different models** have different capabilities and costs 270 | 271 | --- 272 | 273 | ## 6. Further Reading (Optional) 274 | 275 | - OpenAI API Documentation: [platform.openai.com/docs](https://platform.openai.com/docs) 276 | - "Attention Is All You Need" - The transformer paper (advanced) 277 | - OpenAI Cookbook: Examples and best practices 278 | - Token counting tools: tiktoken library 279 | 280 | --- 281 | 282 | **Next up:** Day 3 will teach you how to craft effective prompts! 283 | 284 | -------------------------------------------------------------------------------- /Day-01: Python Foundations for GenAI/README.md: -------------------------------------------------------------------------------- 1 | # Day 1 — Python Foundations for GenAI 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Welcome to Day 1! Before we dive into RAG and AI, we need to ensure you have a solid foundation in Python programming. Python is the primary language used in Generative AI development, and understanding its core concepts will make everything else much easier. 6 | 7 | **Why this matters for RAG:** 8 | 9 | - RAG systems are built primarily in Python 10 | - You'll need to work with data structures (lists, dictionaries) to handle documents 11 | - File handling is essential for reading PDFs, text files, and web content 12 | - API interactions are crucial for connecting to LLM services 13 | - Understanding functions and classes helps organize RAG code 14 | 15 | **Real-world context:** 16 | Think of Python as your toolkit. Just like a carpenter needs to know how to use a hammer before building a house, you need Python skills before building RAG systems. Every RAG application you'll build will use these fundamental concepts. 17 | 18 | --- 19 | 20 | ## 2. Deep-Dive Explanation 21 | 22 | ### Core Python Concepts for GenAI 23 | 24 | #### 2.1 Data Structures 25 | 26 | **Lists** - Ordered collections of items 27 | 28 | ```python 29 | documents = ["doc1.txt", "doc2.txt", "doc3.txt"] 30 | chunks = [] # Empty list to store text chunks 31 | ``` 32 | 33 | **Dictionaries** - Key-value pairs (perfect for storing metadata) 34 | 35 | ```python 36 | document_info = { 37 | "filename": "article.pdf", 38 | "page_count": 10, 39 | "author": "John Doe", 40 | "chunks": [] 41 | } 42 | ``` 43 | 44 | **Tuples** - Immutable ordered collections 45 | 46 | ```python 47 | api_config = ("https://api.openai.com", "v1", "gpt-4") 48 | ``` 49 | 50 | #### 2.2 File Handling 51 | 52 | Reading and writing files is essential for RAG: 53 | 54 | ``` 55 | File → Read → Process → Store 56 | ``` 57 | 58 | **Reading text files:** 59 | 60 | ```python 61 | with open("document.txt", "r", encoding="utf-8") as file: 62 | content = file.read() 63 | ``` 64 | 65 | **Writing to files:** 66 | 67 | ```python 68 | with open("output.txt", "w", encoding="utf-8") as file: 69 | file.write("Processed content") 70 | ``` 71 | 72 | #### 2.3 Functions and Classes 73 | 74 | **Functions** - Reusable blocks of code 75 | 76 | ```python 77 | def chunk_text(text, chunk_size=100): 78 | """Split text into chunks of specified size""" 79 | chunks = [] 80 | for i in range(0, len(text), chunk_size): 81 | chunks.append(text[i:i+chunk_size]) 82 | return chunks 83 | ``` 84 | 85 | **Classes** - Organizing related functionality 86 | 87 | ```python 88 | class DocumentProcessor: 89 | def __init__(self, filename): 90 | self.filename = filename 91 | self.content = "" 92 | 93 | def load(self): 94 | with open(self.filename, "r") as f: 95 | self.content = f.read() 96 | 97 | def get_word_count(self): 98 | return len(self.content.split()) 99 | ``` 100 | 101 | #### 2.4 Working with APIs 102 | 103 | RAG systems interact with APIs (like OpenAI): 104 | 105 | ``` 106 | Your Code → HTTP Request → API → Response → Your Code 107 | ``` 108 | 109 | **Basic API interaction pattern:** 110 | 111 | ```python 112 | import requests 113 | 114 | def call_api(url, data): 115 | response = requests.post(url, json=data) 116 | return response.json() 117 | ``` 118 | 119 | #### 2.5 List Comprehensions and Generators 120 | 121 | **List comprehensions** - Concise way to create lists 122 | 123 | ```python 124 | # Traditional way 125 | squares = [] 126 | for x in range(10): 127 | squares.append(x**2) 128 | 129 | # List comprehension 130 | squares = [x**2 for x in range(10)] 131 | ``` 132 | 133 | **Generators** - Memory-efficient iteration 134 | 135 | ```python 136 | def chunk_generator(text, chunk_size): 137 | for i in range(0, len(text), chunk_size): 138 | yield text[i:i+chunk_size] 139 | ``` 140 | 141 | --- 142 | 143 | ## 3. Instructor Examples 144 | 145 | ### Example 1: Reading and Processing a Text File 146 | 147 | ```python 148 | def read_and_process_file(filename): 149 | """Read a file and return processed content""" 150 | try: 151 | with open(filename, "r", encoding="utf-8") as file: 152 | content = file.read() 153 | 154 | # Basic processing 155 | lines = content.split("\n") 156 | word_count = len(content.split()) 157 | 158 | return { 159 | "content": content, 160 | "lines": len(lines), 161 | "words": word_count 162 | } 163 | except FileNotFoundError: 164 | print(f"File {filename} not found!") 165 | return None 166 | 167 | # Usage 168 | result = read_and_process_file("sample.txt") 169 | if result: 170 | print(f"Lines: {result['lines']}, Words: {result['words']}") 171 | ``` 172 | 173 | ### Example 2: Text Chunking Function 174 | 175 | ```python 176 | def chunk_text(text, chunk_size=200, overlap=50): 177 | """ 178 | Split text into overlapping chunks 179 | 180 | Args: 181 | text: Input text to chunk 182 | chunk_size: Size of each chunk 183 | overlap: Number of characters to overlap between chunks 184 | """ 185 | chunks = [] 186 | start = 0 187 | 188 | while start < len(text): 189 | end = start + chunk_size 190 | chunk = text[start:end] 191 | chunks.append(chunk) 192 | start = end - overlap # Overlap for context 193 | 194 | return chunks 195 | 196 | # Usage 197 | long_text = "This is a very long document..." * 100 198 | chunks = chunk_text(long_text, chunk_size=200, overlap=50) 199 | print(f"Created {len(chunks)} chunks") 200 | ``` 201 | 202 | ### Example 3: Working with Dictionaries for Document Metadata 203 | 204 | ```python 205 | class Document: 206 | def __init__(self, filename, content): 207 | self.filename = filename 208 | self.content = content 209 | self.metadata = { 210 | "word_count": len(content.split()), 211 | "char_count": len(content), 212 | "chunks": [] 213 | } 214 | 215 | def add_chunk(self, chunk_text, chunk_id): 216 | chunk_data = { 217 | "id": chunk_id, 218 | "text": chunk_text, 219 | "length": len(chunk_text) 220 | } 221 | self.metadata["chunks"].append(chunk_data) 222 | 223 | def get_summary(self): 224 | return { 225 | "filename": self.filename, 226 | "words": self.metadata["word_count"], 227 | "chunks": len(self.metadata["chunks"]) 228 | } 229 | 230 | # Usage 231 | doc = Document("article.txt", "This is the content of the article...") 232 | doc.add_chunk("First chunk", 1) 233 | doc.add_chunk("Second chunk", 2) 234 | print(doc.get_summary()) 235 | ``` 236 | 237 | ### Example 4: Simple API Request Pattern 238 | 239 | ```python 240 | import requests 241 | import json 242 | 243 | def make_api_request(url, payload, headers=None): 244 | """Make a POST request to an API""" 245 | default_headers = {"Content-Type": "application/json"} 246 | if headers: 247 | default_headers.update(headers) 248 | 249 | try: 250 | response = requests.post(url, json=payload, headers=default_headers) 251 | response.raise_for_status() # Raises exception for bad status codes 252 | return response.json() 253 | except requests.exceptions.RequestException as e: 254 | print(f"API request failed: {e}") 255 | return None 256 | 257 | # Usage pattern (you'll use this with OpenAI API later) 258 | # payload = {"prompt": "Hello, world!"} 259 | # result = make_api_request("https://api.example.com/endpoint", payload) 260 | ``` 261 | 262 | --- 263 | 264 | ## 4. Student Practice Tasks 265 | 266 | ### Task 1: File Reader Function 267 | 268 | Write a function that reads a file and returns: 269 | 270 | - The content as a string 271 | - The number of sentences (split by periods) 272 | - A list of unique words (lowercase) 273 | 274 | ### Task 2: Dictionary Manipulation 275 | 276 | Create a dictionary to store information about 3 documents. Each document should have: 277 | 278 | - `title` 279 | - `author` 280 | - `word_count` 281 | - `chunks` (a list) 282 | 283 | Then write a function that finds the document with the most words. 284 | 285 | ### Task 3: Text Processing 286 | 287 | Write a function that: 288 | 289 | 1. Takes a long string of text 290 | 2. Removes all punctuation 291 | 3. Converts to lowercase 292 | 4. Splits into words 293 | 5. Returns a dictionary with word frequencies 294 | 295 | ### Task 4: Chunking with Metadata 296 | 297 | Modify the chunking function to also return metadata for each chunk: 298 | 299 | - Chunk number 300 | - Start position 301 | - End position 302 | - Word count 303 | 304 | ### Task 5: Error Handling 305 | 306 | Write a robust file reader that handles: 307 | 308 | - File not found errors 309 | - Permission errors 310 | - Encoding errors 311 | - Empty files 312 | 313 | ### Task 6: List Comprehension Challenge 314 | 315 | Convert this code to use list comprehensions: 316 | 317 | ```python 318 | numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 319 | even_squares = [] 320 | for num in numbers: 321 | if num % 2 == 0: 322 | even_squares.append(num ** 2) 323 | ``` 324 | 325 | --- 326 | 327 | ## 5. Summary / Key Takeaways 328 | 329 | - **Lists and dictionaries** are essential for storing documents and metadata in RAG systems 330 | - **File handling** with `with` statements ensures proper resource management 331 | - **Functions** help organize code and make it reusable 332 | - **Classes** provide structure for complex data and operations 333 | - **API interactions** will be crucial when connecting to LLM services 334 | - **List comprehensions** make code more Pythonic and readable 335 | - **Error handling** is important for robust applications 336 | - **Text processing** (chunking, splitting) is fundamental to RAG 337 | 338 | --- 339 | 340 | ## 6. Further Reading (Optional) 341 | 342 | - Python Official Documentation: [docs.python.org](https://docs.python.org/3/) 343 | - Real Python: Great tutorials on Python fundamentals 344 | - Python `requests` library documentation for API calls 345 | - PEP 8: Python style guide for writing clean code 346 | 347 | --- 348 | 349 | **Next up:** Day 2 will introduce you to Generative AI and Large Language Models! 350 | -------------------------------------------------------------------------------- /Day-08: RAG Using LangChain or LlamaIndex/README.md: -------------------------------------------------------------------------------- 1 | # Day 8 — RAG Using LangChain or LlamaIndex 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Now that you've built RAG from scratch, it's time to learn the frameworks that make it easier! **LangChain** and **LlamaIndex** are popular frameworks that abstract away the complexity and provide powerful features out of the box. 6 | 7 | **Why use frameworks?** 8 | - **Faster development**: Pre-built components 9 | - **Best practices**: Built-in optimizations 10 | - **More features**: Advanced capabilities 11 | - **Community**: Well-documented and supported 12 | - **Production-ready**: Battle-tested code 13 | 14 | **LangChain vs. LlamaIndex:** 15 | - **LangChain**: General-purpose LLM framework, flexible 16 | - **LlamaIndex**: Specialized for RAG, data-focused 17 | 18 | **Today's goal:** 19 | Learn to build RAG systems using these frameworks, understanding when to use which. 20 | 21 | --- 22 | 23 | ## 2. Deep-Dive Explanation 24 | 25 | ### 2.1 LangChain Overview 26 | 27 | **What is LangChain?** 28 | A framework for building LLM applications with: 29 | - Document loaders 30 | - Text splitters 31 | - Vector stores 32 | - Chains (workflows) 33 | - Agents (autonomous systems) 34 | 35 | **Key Components:** 36 | - **Document Loaders**: Load from various sources 37 | - **Text Splitters**: Chunk documents 38 | - **Embeddings**: Generate embeddings 39 | - **Vector Stores**: Store and search 40 | - **Retrievers**: Retrieve relevant docs 41 | - **Chains**: Combine components 42 | - **LLMs**: Language models 43 | 44 | ### 2.2 LangChain RAG Pipeline 45 | 46 | **Components:** 47 | ``` 48 | Document Loader → Text Splitter → Embeddings → Vector Store 49 | ↓ 50 | User Query → Embeddings → Retriever → Context + Query → LLM → Answer 51 | ``` 52 | 53 | **LangChain Abstractions:** 54 | - `Document`: Text with metadata 55 | - `TextSplitter`: Chunks documents 56 | - `Embeddings`: Embedding interface 57 | - `VectorStore`: Vector database interface 58 | - `Retriever`: Retrieval interface 59 | - `Chain`: Composable workflows 60 | 61 | ### 2.3 LlamaIndex Overview 62 | 63 | **What is LlamaIndex?** 64 | A data framework for LLM applications, optimized for RAG: 65 | - Data connectors 66 | - Indexing 67 | - Querying 68 | - Retrieval 69 | - Response synthesis 70 | 71 | **Key Concepts:** 72 | - **Index**: Structured data representation 73 | - **Nodes**: Chunks with metadata 74 | - **Retrievers**: Find relevant nodes 75 | - **Query Engines**: Answer questions 76 | - **Response Synthesizers**: Generate answers 77 | 78 | ### 2.4 LlamaIndex RAG Pipeline 79 | 80 | **Components:** 81 | ``` 82 | Documents → Load Data → Parse → Build Index 83 | ↓ 84 | Query → Retrieve Nodes → Synthesize Response → Answer 85 | ``` 86 | 87 | **LlamaIndex Abstractions:** 88 | - `Document`: Source document 89 | - `Node`: Chunk with metadata 90 | - `Index`: Structured data store 91 | - `Retriever`: Retrieval logic 92 | - `QueryEngine`: Query interface 93 | - `ResponseSynthesizer`: Answer generation 94 | 95 | ### 2.5 When to Use Which? 96 | 97 | **Use LangChain when:** 98 | - Building general LLM applications 99 | - Need flexibility and customization 100 | - Want to combine multiple tools 101 | - Building agents or complex workflows 102 | 103 | **Use LlamaIndex when:** 104 | - Focused on RAG applications 105 | - Need advanced retrieval strategies 106 | - Want optimized indexing 107 | - Building data-centric applications 108 | 109 | **You can use both!** They complement each other. 110 | 111 | --- 112 | 113 | ## 3. Instructor Examples 114 | 115 | ### Example 1: LangChain RAG 116 | 117 | ```python 118 | from langchain.document_loaders import PyPDFLoader, TextLoader 119 | from langchain.text_splitter import RecursiveCharacterTextSplitter 120 | from langchain.embeddings import OpenAIEmbeddings 121 | from langchain.vectorstores import Chroma 122 | from langchain.chains import RetrievalQA 123 | from langchain.llms import OpenAI 124 | import os 125 | 126 | # Setup 127 | os.environ["OPENAI_API_KEY"] = "your-key" 128 | 129 | # 1. Load documents 130 | loader = PyPDFLoader("document.pdf") 131 | documents = loader.load() 132 | 133 | # 2. Split text 134 | text_splitter = RecursiveCharacterTextSplitter( 135 | chunk_size=1000, 136 | chunk_overlap=200 137 | ) 138 | chunks = text_splitter.split_documents(documents) 139 | 140 | # 3. Create embeddings and vector store 141 | embeddings = OpenAIEmbeddings() 142 | vectorstore = Chroma.from_documents( 143 | documents=chunks, 144 | embedding=embeddings 145 | ) 146 | 147 | # 4. Create retriever 148 | retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) 149 | 150 | # 5. Create QA chain 151 | qa_chain = RetrievalQA.from_chain_type( 152 | llm=OpenAI(temperature=0), 153 | chain_type="stuff", 154 | retriever=retriever, 155 | return_source_documents=True 156 | ) 157 | 158 | # 6. Query 159 | result = qa_chain({"query": "What is the main topic?"}) 160 | print(result["result"]) 161 | print(f"Sources: {len(result['source_documents'])}") 162 | ``` 163 | 164 | ### Example 2: LangChain with Chat Models 165 | 166 | ```python 167 | from langchain.chat_models import ChatOpenAI 168 | from langchain.chains import ConversationalRetrievalChain 169 | from langchain.memory import ConversationBufferMemory 170 | 171 | # Use chat model 172 | llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo") 173 | 174 | # Add memory for conversation 175 | memory = ConversationBufferMemory( 176 | memory_key="chat_history", 177 | return_messages=True 178 | ) 179 | 180 | # Create conversational chain 181 | qa_chain = ConversationalRetrievalChain.from_llm( 182 | llm=llm, 183 | retriever=retriever, 184 | memory=memory 185 | ) 186 | 187 | # Query with conversation 188 | result = qa_chain({"question": "What is Python?"}) 189 | result = qa_chain({"question": "What are its main features?"}) # Remembers context 190 | ``` 191 | 192 | ### Example 3: LlamaIndex RAG 193 | 194 | ```python 195 | from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext 196 | from llama_index.llms import OpenAI 197 | import os 198 | 199 | os.environ["OPENAI_API_KEY"] = "your-key" 200 | 201 | # 1. Load documents 202 | documents = SimpleDirectoryReader("./documents").load_data() 203 | 204 | # 2. Create index (handles chunking, embedding, storage) 205 | index = VectorStoreIndex.from_documents(documents) 206 | 207 | # 3. Create query engine 208 | query_engine = index.as_query_engine() 209 | 210 | # 4. Query 211 | response = query_engine.query("What is the main topic?") 212 | print(response) 213 | print(f"Source nodes: {len(response.source_nodes)}") 214 | ``` 215 | 216 | ### Example 4: LlamaIndex with Custom Settings 217 | 218 | ```python 219 | from llama_index import ( 220 | VectorStoreIndex, 221 | ServiceContext, 222 | StorageContext 223 | ) 224 | from llama_index.embeddings import OpenAIEmbedding 225 | from llama_index.node_parser import SimpleNodeParser 226 | from llama_index.vector_stores import ChromaVectorStore 227 | import chromadb 228 | 229 | # Custom service context 230 | service_context = ServiceContext.from_defaults( 231 | llm=OpenAI(temperature=0, model="gpt-3.5-turbo"), 232 | embed_model=OpenAIEmbedding(), 233 | node_parser=SimpleNodeParser.from_defaults(chunk_size=500) 234 | ) 235 | 236 | # Custom vector store 237 | chroma_client = chromadb.Client() 238 | chroma_collection = chroma_client.create_collection("rag_docs") 239 | vector_store = ChromaVectorStore(chroma_collection=chroma_collection) 240 | storage_context = StorageContext.from_defaults(vector_store=vector_store) 241 | 242 | # Create index with custom settings 243 | index = VectorStoreIndex.from_documents( 244 | documents, 245 | service_context=service_context, 246 | storage_context=storage_context 247 | ) 248 | 249 | # Query 250 | query_engine = index.as_query_engine(similarity_top_k=3) 251 | response = query_engine.query("Your question here") 252 | ``` 253 | 254 | ### Example 5: Comparing Both Frameworks 255 | 256 | ```python 257 | # LangChain approach 258 | from langchain.chains import RetrievalQA 259 | 260 | langchain_qa = RetrievalQA.from_chain_type( 261 | llm=llm, 262 | retriever=retriever, 263 | return_source_documents=True 264 | ) 265 | 266 | # LlamaIndex approach 267 | from llama_index import VectorStoreIndex 268 | 269 | llamaindex_index = VectorStoreIndex.from_documents(documents) 270 | llamaindex_qa = llamaindex_index.as_query_engine() 271 | 272 | # Both achieve similar results with different APIs 273 | ``` 274 | 275 | --- 276 | 277 | ## 4. Student Practice Tasks 278 | 279 | ### Task 1: LangChain RAG Setup 280 | Set up a basic LangChain RAG system: 281 | - Load documents 282 | - Create vector store 283 | - Build QA chain 284 | - Test with queries 285 | 286 | ### Task 2: LlamaIndex RAG Setup 287 | Set up a basic LlamaIndex RAG system: 288 | - Load documents 289 | - Create index 290 | - Build query engine 291 | - Test with queries 292 | 293 | ### Task 3: Custom Configuration 294 | Configure both frameworks with: 295 | - Custom chunk sizes 296 | - Different embedding models 297 | - Various LLM parameters 298 | - Compare results 299 | 300 | ### Task 4: Advanced Retrieval 301 | Experiment with: 302 | - Different retrieval strategies 303 | - Metadata filtering 304 | - Reranking 305 | - Hybrid search 306 | 307 | ### Task 5: Framework Comparison 308 | Build the same RAG system with both frameworks and compare: 309 | - Code complexity 310 | - Performance 311 | - Features 312 | - Ease of use 313 | 314 | ### Task 6: Integration 315 | Combine LangChain and LlamaIndex components in a single system. 316 | 317 | --- 318 | 319 | ## 5. Summary / Key Takeaways 320 | 321 | - **LangChain**: General-purpose LLM framework, flexible and composable 322 | - **LlamaIndex**: RAG-optimized framework, data-centric 323 | - **Both are powerful**: Choose based on your needs 324 | - **Pre-built components**: Save development time 325 | - **Best practices**: Frameworks include optimizations 326 | - **Active communities**: Well-documented and supported 327 | - **Production-ready**: Battle-tested code 328 | - **Can combine**: Use both frameworks together 329 | - **Learning curve**: Worth it for complex applications 330 | - **Abstraction**: Understand what's happening under the hood 331 | 332 | --- 333 | 334 | ## 6. Further Reading (Optional) 335 | 336 | - LangChain Documentation 337 | - LlamaIndex Documentation 338 | - Framework comparison articles 339 | - Community examples and tutorials 340 | 341 | --- 342 | 343 | **Next up:** Day 9 will cover advanced RAG techniques! 344 | 345 | -------------------------------------------------------------------------------- /Day-04: Chunking & Data Extraction (PDF-Web-Docs)/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 4 — Assignment 2 | 3 | ## Instructions 4 | 5 | Complete these tasks to master data extraction and chunking. You'll work with PDFs, web pages, and implement various chunking strategies. Make sure you have the required libraries installed: 6 | 7 | ```bash 8 | pip install pypdf beautifulsoup4 requests 9 | ``` 10 | 11 | **Important:** 12 | - Test with real files/URLs 13 | - Handle errors gracefully 14 | - Consider edge cases (empty files, malformed HTML, etc.) 15 | - Document your chunking decisions 16 | 17 | --- 18 | 19 | ## Tasks 20 | 21 | ### Task 1: PDF Text Extractor 22 | 23 | Create a comprehensive PDF extractor `pdf_extractor.py` that: 24 | 25 | 1. Extracts text from PDF files 26 | 2. Returns structured data: 27 | - Full text 28 | - Text per page (list) 29 | - Total pages 30 | - Metadata (author, title if available) 31 | 3. Handles errors (corrupted files, password-protected, etc.) 32 | 4. Optionally extracts text from specific page ranges 33 | 34 | **Requirements:** 35 | - Use `pypdf` library 36 | - Add progress indication for large PDFs 37 | - Support batch processing (multiple PDFs) 38 | 39 | **Test with:** A multi-page PDF document 40 | 41 | **Deliverable:** `task1_pdf_extractor.py` 42 | 43 | --- 44 | 45 | ### Task 2: Web Content Scraper 46 | 47 | Build a web scraper `web_scraper.py` that: 48 | 49 | 1. Takes a URL as input 50 | 2. Extracts main content (removes navigation, ads, scripts) 51 | 3. Returns clean, readable text 52 | 4. Handles different website structures 53 | 5. Extracts metadata (title, author, date if available) 54 | 55 | **Requirements:** 56 | - Use `requests` and `BeautifulSoup` 57 | - Add proper headers (User-Agent) 58 | - Handle timeouts and errors 59 | - Support both single pages and article-style pages 60 | - Clean extracted text (remove extra whitespace, normalize) 61 | 62 | **Test with:** 63 | - A news article URL 64 | - A blog post URL 65 | - A Wikipedia page 66 | 67 | **Deliverable:** `task2_web_scraper.py` 68 | 69 | --- 70 | 71 | ### Task 3: Chunking Strategy Comparison 72 | 73 | Implement three different chunking strategies and compare them: 74 | 75 | 1. **Fixed-Size Chunking**: Split by character/word count 76 | 2. **Sentence-Aware Chunking**: Respect sentence boundaries 77 | 3. **Paragraph-Aware Chunking**: Respect paragraph boundaries 78 | 79 | **Requirements:** 80 | - All strategies should support overlap 81 | - Create a comparison function that: 82 | - Tests all three on the same text 83 | - Reports chunk count, average size, size variance 84 | - Shows sample chunks from each strategy 85 | - Visualize differences (print sample chunks side-by-side) 86 | 87 | **Test with:** A long text document (at least 2000 words) 88 | 89 | **Deliverable:** `task3_chunking_comparison.py` 90 | 91 | --- 92 | 93 | ### Task 4: Text Cleaning Pipeline 94 | 95 | Create a comprehensive text cleaning module `text_cleaner.py`: 96 | 97 | **Cleaning functions:** 98 | 1. Remove extra whitespace 99 | 2. Normalize line breaks 100 | 3. Remove special characters (configurable) 101 | 4. Remove headers/footers (detect common patterns) 102 | 5. Fix encoding issues 103 | 6. Remove URLs/email addresses (optional) 104 | 7. Normalize quotes and dashes 105 | 106 | **Requirements:** 107 | - Make each cleaning step optional/configurable 108 | - Create a `clean_text()` function that applies all steps 109 | - Test each step individually 110 | - Show before/after examples 111 | 112 | **Deliverable:** `task4_text_cleaner.py` 113 | 114 | --- 115 | 116 | ### Task 5: Chunk Metadata System 117 | 118 | Build a chunking system that stores rich metadata: 119 | 120 | **Metadata to include:** 121 | - `chunk_id`: Unique identifier 122 | - `source`: Source file/URL 123 | - `page_number`: Page number (for PDFs) 124 | - `chunk_index`: Position in document 125 | - `start_char`: Starting character position 126 | - `end_char`: Ending character position 127 | - `word_count`: Number of words 128 | - `char_count`: Number of characters 129 | - `timestamp`: When chunk was created 130 | - `preview`: First 50 characters (for quick preview) 131 | 132 | **Requirements:** 133 | - Create a `Chunk` class to store this data 134 | - Implement methods to: 135 | - Export chunks to JSON 136 | - Filter chunks by metadata 137 | - Get chunk statistics 138 | 139 | **Deliverable:** `task5_chunk_metadata.py` 140 | 141 | --- 142 | 143 | ## One Mini Project 144 | 145 | ### 📘 Build a PDF-to-Text Extractor and Chunker 146 | 147 | Create a complete application `document_processor.py` that processes documents and prepares them for RAG. 148 | 149 | **Features:** 150 | 151 | 1. **Multi-Format Support:** 152 | - PDF files 153 | - Text files (.txt, .md) 154 | - Web URLs 155 | - (Optional) Word documents (.docx) 156 | 157 | 2. **Processing Pipeline:** 158 | ``` 159 | Input → Extract → Clean → Chunk → Store → Report 160 | ``` 161 | 162 | 3. **Chunking Options:** 163 | - Chunk size (characters/words) 164 | - Overlap percentage 165 | - Strategy (fixed, sentence-aware, paragraph-aware) 166 | - Minimum chunk size 167 | 168 | 4. **Output Formats:** 169 | - JSON (structured chunks with metadata) 170 | - Text file (one chunk per line) 171 | - CSV (chunks with metadata columns) 172 | - Console display (formatted) 173 | 174 | 5. **Batch Processing:** 175 | - Process multiple files 176 | - Process entire directories 177 | - Progress tracking 178 | - Error reporting 179 | 180 | 6. **Statistics and Reporting:** 181 | - Total chunks created 182 | - Average chunk size 183 | - Size distribution 184 | - Processing time 185 | - Source information 186 | 187 | 7. **Interactive CLI:** 188 | ``` 189 | === Document Processor === 190 | 1. Process single file 191 | 2. Process directory 192 | 3. Process web URL 193 | 4. Configure chunking settings 194 | 5. View statistics 195 | 6. Export results 196 | 7. Exit 197 | ``` 198 | 199 | **Requirements:** 200 | - Use classes for organization 201 | - Implement proper error handling 202 | - Add progress bars for long operations 203 | - Support command-line arguments 204 | - Create a configuration system (JSON/YAML) 205 | - Generate detailed reports 206 | - Store results in organized folders 207 | 208 | **Example Usage:** 209 | ```bash 210 | # Command line 211 | python document_processor.py input.pdf --chunk-size 500 --overlap 50 --output json 212 | 213 | # Interactive mode 214 | python document_processor.py 215 | ``` 216 | 217 | **Example Output:** 218 | ``` 219 | Processing: document.pdf 220 | ✓ Extracted 15 pages 221 | ✓ Cleaned text (removed 234 extra spaces) 222 | ✓ Created 42 chunks 223 | ✓ Average chunk size: 487 words 224 | ✓ Processing time: 2.3 seconds 225 | 226 | Chunks saved to: output/document_chunks.json 227 | Statistics saved to: output/document_stats.txt 228 | ``` 229 | 230 | **Advanced Features (Bonus):** 231 | - OCR support for scanned PDFs 232 | - Table extraction 233 | - Image extraction 234 | - Language detection 235 | - Duplicate detection 236 | - Chunk quality scoring 237 | 238 | **Deliverables:** 239 | - `document_processor.py` - Main application 240 | - `config.json` - Configuration file template 241 | - `requirements.txt` - Dependencies 242 | - `README_processor.md` - Usage documentation 243 | - Sample output files demonstrating functionality 244 | 245 | --- 246 | 247 | ## Expected Output Section 248 | 249 | ### Task 1 Expected Output: 250 | ```python 251 | result = extract_pdf("document.pdf") 252 | # Output: 253 | { 254 | "full_text": "Complete text...", 255 | "pages": [ 256 | "Page 1 text...", 257 | "Page 2 text...", 258 | ... 259 | ], 260 | "total_pages": 15, 261 | "metadata": { 262 | "title": "Sample Document", 263 | "author": "John Doe" 264 | } 265 | } 266 | ``` 267 | 268 | ### Task 2 Expected Output: 269 | ```python 270 | content = scrape_web("https://example.com/article") 271 | # Output: 272 | { 273 | "title": "Article Title", 274 | "content": "Clean article text...", 275 | "author": "Author Name", 276 | "date": "2024-01-15", 277 | "word_count": 1234 278 | } 279 | ``` 280 | 281 | ### Task 3 Expected Output: 282 | ``` 283 | === Chunking Strategy Comparison === 284 | 285 | Text: 2500 words 286 | 287 | Fixed-Size Chunking: 288 | - Chunks: 5 289 | - Avg size: 500 words 290 | - Size variance: 0 words 291 | - Sample: "This is the first chunk of text..." 292 | 293 | Sentence-Aware Chunking: 294 | - Chunks: 6 295 | - Avg size: 417 words 296 | - Size variance: 45 words 297 | - Sample: "This is the first chunk. It respects..." 298 | 299 | Paragraph-Aware Chunking: 300 | - Chunks: 4 301 | - Avg size: 625 words 302 | - Size variance: 120 words 303 | - Sample: "This is a complete paragraph. It contains..." 304 | ``` 305 | 306 | ### Task 5 Expected Output: 307 | ```python 308 | chunks = chunk_with_metadata(text, source="doc.pdf") 309 | # Output: List of Chunk objects 310 | [ 311 | Chunk( 312 | chunk_id=1, 313 | source="doc.pdf", 314 | page_number=1, 315 | start_char=0, 316 | end_char=500, 317 | word_count=75, 318 | preview="This is the beginning of the chunk..." 319 | ), 320 | ... 321 | ] 322 | ``` 323 | 324 | ### Mini Project Expected Output: 325 | 326 | The document processor should provide: 327 | - Clear progress indicators 328 | - Detailed statistics 329 | - Multiple output formats 330 | - Error handling and reporting 331 | - Professional CLI interface 332 | 333 | **Example session:** 334 | ``` 335 | === Document Processor === 336 | Choose option: 1 337 | 338 | Enter file path: document.pdf 339 | Chunk size [500]: 400 340 | Overlap [50]: 40 341 | Strategy [fixed/sentence/paragraph]: sentence 342 | 343 | [Processing...] 344 | ✓ Extracted 15 pages 345 | ✓ Created 38 chunks 346 | ✓ Saved to output/document_chunks.json 347 | 348 | Statistics: 349 | - Total words: 15,234 350 | - Chunks: 38 351 | - Avg chunk size: 401 words 352 | - Processing time: 2.1s 353 | 354 | [1] View chunks 355 | [2] Export to CSV 356 | [3] Process another file 357 | [4] Main menu 358 | ``` 359 | 360 | --- 361 | 362 | ## Submission Checklist 363 | 364 | - [ ] Task 1: PDF extractor working 365 | - [ ] Task 2: Web scraper functional 366 | - [ ] Task 3: Chunking comparison complete 367 | - [ ] Task 4: Text cleaning pipeline implemented 368 | - [ ] Task 5: Metadata system working 369 | - [ ] Mini project: Complete document processor 370 | - [ ] All code handles errors gracefully 371 | - [ ] Code is well-documented 372 | - [ ] Tested with real files/URLs 373 | 374 | **Remember:** Good data extraction and chunking are crucial for RAG quality! 375 | 376 | **Good luck!** 🚀 377 | 378 | -------------------------------------------------------------------------------- /Day-03: Prompt Engineering Essentials/assignment.md: -------------------------------------------------------------------------------- 1 | # Day 3 — Assignment 2 | 3 | ## Instructions 4 | 5 | Complete these tasks to master prompt engineering. You'll create various prompt templates and test them with the OpenAI API. Focus on: 6 | - Clarity and specificity 7 | - Proper structure 8 | - Effective use of examples 9 | - Context handling 10 | 11 | **Important:** 12 | - Test all prompts with actual API calls 13 | - Compare different prompt variations 14 | - Document what works and what doesn't 15 | - Save your best prompts as reusable templates 16 | 17 | --- 18 | 19 | ## Tasks 20 | 21 | ### Task 1: Prompt Improvement Challenge 22 | 23 | Take these 5 vague prompts and rewrite them to be specific, clear, and effective: 24 | 25 | 1. "Tell me about machine learning" 26 | 2. "Fix this code" 27 | 3. "Summarize this" 28 | 4. "What's the best way?" 29 | 5. "Explain this document" 30 | 31 | For each: 32 | - Write an improved version 33 | - Explain why your version is better 34 | - Test both versions with the API 35 | - Compare the results 36 | 37 | **Deliverable:** `task1_prompt_improvements.py` with both old and new prompts, plus comparison results 38 | 39 | --- 40 | 41 | ### Task 2: Role-Based Prompt System 42 | 43 | Create a system that generates prompts based on different AI roles: 44 | 45 | **Roles to implement:** 46 | - `coding_tutor`: Explains programming concepts to beginners 47 | - `business_analyst`: Analyzes business problems 48 | - `creative_writer`: Helps with creative writing 49 | - `data_scientist`: Explains data science concepts 50 | - `technical_writer`: Creates technical documentation 51 | 52 | **Requirements:** 53 | - Create a function `generate_role_prompt(role, user_input)` 54 | - Each role should have a distinct personality and style 55 | - Test each role with the same input to see how responses differ 56 | 57 | **Deliverable:** `task2_role_prompts.py` 58 | 59 | --- 60 | 61 | ### Task 3: Few-Shot Learning Templates 62 | 63 | Create few-shot prompt templates for these tasks: 64 | 65 | 1. **Text Classification**: Classify customer reviews as positive, negative, or neutral 66 | 2. **Format Conversion**: Convert informal text to formal business language 67 | 3. **Information Extraction**: Extract names, dates, and locations from text 68 | 4. **Code Translation**: Convert Python code to pseudocode 69 | 70 | **Requirements:** 71 | - Each template should have 3-5 examples 72 | - Create reusable functions 73 | - Test with new inputs 74 | 75 | **Deliverable:** `task3_fewshot_templates.py` 76 | 77 | --- 78 | 79 | ### Task 4: Chain-of-Thought Problem Solver 80 | 81 | Build a problem-solving system using chain-of-thought prompting: 82 | 83 | **Problem types to handle:** 84 | - Math word problems 85 | - Logic puzzles 86 | - Code debugging scenarios 87 | - Decision-making problems 88 | 89 | **Requirements:** 90 | - Create a function that formats problems with CoT instructions 91 | - The prompt should encourage step-by-step reasoning 92 | - Extract and display the reasoning steps from the response 93 | 94 | **Example:** 95 | ```python 96 | problem = "If 3 apples cost $2, how much do 9 apples cost?" 97 | solution = solve_with_cot(problem) 98 | # Should show: Step 1, Step 2, Step 3, Final Answer 99 | ``` 100 | 101 | **Deliverable:** `task4_cot_solver.py` 102 | 103 | --- 104 | 105 | ### Task 5: RAG Prompt Template Builder 106 | 107 | Create a comprehensive RAG prompt template system: 108 | 109 | **Features:** 110 | 1. **Context Injection**: Add retrieved documents to prompt 111 | 2. **Citation Support**: Include instructions for citing sources 112 | 3. **Answer Formatting**: Specify output format (paragraph, bullet points, JSON) 113 | 4. **Fallback Handling**: Instructions for when context is insufficient 114 | 5. **Multi-document Synthesis**: Handle multiple relevant documents 115 | 116 | **Requirements:** 117 | - Create a class `RAGPromptBuilder` 118 | - Methods: 119 | - `add_context(documents)` - Add retrieved documents 120 | - `set_question(question)` - Set the question 121 | - `set_format(format_type)` - Set output format 122 | - `enable_citations(enable)` - Toggle citation requirements 123 | - `build()` - Generate final prompt 124 | 125 | **Example usage:** 126 | ```python 127 | builder = RAGPromptBuilder() 128 | builder.add_context(["Doc 1 content", "Doc 2 content"]) 129 | builder.set_question("What is RAG?") 130 | builder.set_format("bullet_points") 131 | builder.enable_citations(True) 132 | prompt = builder.build() 133 | ``` 134 | 135 | **Deliverable:** `task5_rag_prompt_builder.py` 136 | 137 | --- 138 | 139 | ## One Mini Project 140 | 141 | ### 🎯 Build a Prompt Engineering Playground 142 | 143 | Create an interactive application `prompt_playground.py` that allows users to experiment with different prompt engineering techniques. 144 | 145 | **Features:** 146 | 147 | 1. **Main Menu:** 148 | ``` 149 | === Prompt Engineering Playground === 150 | 1. Basic Prompt Tester 151 | 2. Role-Based Prompts 152 | 3. Few-Shot Learning 153 | 4. Chain-of-Thought 154 | 5. RAG Prompt Builder 155 | 6. Prompt Comparison Tool 156 | 7. Save/Load Prompts 157 | 8. Exit 158 | ``` 159 | 160 | 2. **Basic Prompt Tester:** 161 | - Enter a prompt 162 | - Adjust parameters (temperature, max_tokens) 163 | - View response 164 | - Rate the response quality (1-5) 165 | - Save prompts and ratings 166 | 167 | 3. **Role-Based Prompts:** 168 | - Select from predefined roles 169 | - Enter your input 170 | - See how different roles respond 171 | - Create custom roles 172 | 173 | 4. **Few-Shot Learning:** 174 | - Add examples interactively 175 | - Test with new inputs 176 | - Compare results with/without examples 177 | - Save example sets 178 | 179 | 5. **Chain-of-Thought:** 180 | - Enter a problem 181 | - View step-by-step reasoning 182 | - Extract and highlight reasoning steps 183 | - Compare with direct answers 184 | 185 | 6. **RAG Prompt Builder:** 186 | - Add context documents 187 | - Set question 188 | - Configure options (citations, format, etc.) 189 | - Generate and test prompt 190 | - Save templates 191 | 192 | 7. **Prompt Comparison Tool:** 193 | - Enter multiple prompt variations 194 | - Test all with the same input 195 | - Side-by-side comparison 196 | - Quality scoring 197 | - Export comparison results 198 | 199 | 8. **Save/Load Prompts:** 200 | - Save successful prompts to JSON 201 | - Load saved prompts 202 | - Organize by category 203 | - Search prompts 204 | 205 | **Advanced Features:** 206 | - **A/B Testing**: Compare two prompts statistically 207 | - **Prompt Library**: Pre-built prompts for common tasks 208 | - **Response Analyzer**: Analyze response quality (length, structure, etc.) 209 | - **Token Optimizer**: Suggest ways to reduce token usage 210 | - **Export Options**: Export prompts and results to various formats 211 | 212 | **Requirements:** 213 | - Use classes for organization 214 | - Store prompts and results in JSON files 215 | - Implement a clean CLI interface 216 | - Add color coding for better UX (optional) 217 | - Include help/documentation 218 | - Handle errors gracefully 219 | 220 | **Example Interaction:** 221 | ``` 222 | === Prompt Engineering Playground === 223 | Choose option: 1 224 | 225 | Enter your prompt: Explain quantum computing 226 | Temperature [0.7]: 0.5 227 | Max tokens [200]: 150 228 | 229 | [Processing...] 230 | 231 | Response: 232 | Quantum computing uses quantum mechanical phenomena... 233 | 234 | Tokens: 87 235 | Rate this response (1-5): 4 236 | 237 | [1] Try different parameters 238 | [2] Save this prompt 239 | [3] Compare with another prompt 240 | [4] Main menu 241 | ``` 242 | 243 | **Deliverables:** 244 | - `prompt_playground.py` - Main application 245 | - `prompt_templates.json` - Saved prompt templates 246 | - `requirements.txt` - Dependencies 247 | - `README_playground.md` - Usage guide 248 | - Sample output demonstrating all features 249 | 250 | --- 251 | 252 | ## Expected Output Section 253 | 254 | ### Task 1 Expected Output: 255 | ``` 256 | === Prompt Comparison === 257 | 258 | Original: "Tell me about machine learning" 259 | Improved: "Explain machine learning in 3 paragraphs, covering: 260 | 1. What it is 261 | 2. Common applications 262 | 3. Key algorithms" 263 | 264 | Results: 265 | - Original: Generic, unfocused response (156 tokens) 266 | - Improved: Structured, comprehensive response (203 tokens) 267 | - Quality: Improved version is 40% more informative 268 | ``` 269 | 270 | ### Task 2 Expected Output: 271 | ``` 272 | === Role-Based Prompts === 273 | Input: "How do I learn Python?" 274 | 275 | Coding Tutor: 276 | "Start with basics: variables, data types, and functions. 277 | Practice daily with small projects..." 278 | 279 | Business Analyst: 280 | "Python is valuable for data analysis. Focus on pandas 281 | and data visualization libraries..." 282 | 283 | [Different perspectives based on role] 284 | ``` 285 | 286 | ### Task 4 Expected Output: 287 | ``` 288 | === Chain-of-Thought Solver === 289 | Problem: "If 3 apples cost $2, how much do 9 apples cost?" 290 | 291 | Step 1: Identify what we need to find 292 | → Cost of 9 apples 293 | 294 | Step 2: Find cost per apple 295 | → $2 ÷ 3 = $0.67 per apple 296 | 297 | Step 3: Calculate cost of 9 apples 298 | → $0.67 × 9 = $6 299 | 300 | Final Answer: 9 apples cost $6 301 | ``` 302 | 303 | ### Task 5 Expected Output: 304 | ```python 305 | builder = RAGPromptBuilder() 306 | builder.add_context([ 307 | "RAG combines retrieval and generation...", 308 | "Vector databases store embeddings..." 309 | ]) 310 | builder.set_question("How does RAG work?") 311 | builder.set_format("bullet_points") 312 | builder.enable_citations(True) 313 | 314 | Generated Prompt: 315 | """ 316 | You are a helpful assistant... 317 | 318 | Documents: 319 | [Document 1] 320 | RAG combines retrieval and generation... 321 | 322 | [Document 2] 323 | Vector databases store embeddings... 324 | 325 | Question: How does RAG work? 326 | 327 | Instructions: 328 | - Answer using the documents above 329 | - Cite your sources 330 | - Format as bullet points 331 | ... 332 | """ 333 | ``` 334 | 335 | ### Mini Project Expected Output: 336 | 337 | The playground should provide a comprehensive, user-friendly interface for experimenting with prompts: 338 | 339 | - Intuitive menu navigation 340 | - Real-time prompt testing 341 | - Side-by-side comparisons 342 | - Quality metrics and analysis 343 | - Export capabilities 344 | - Professional presentation 345 | 346 | **Example session:** 347 | ``` 348 | === Prompt Engineering Playground === 349 | 1. Basic Prompt Tester 350 | ... 351 | Choose: 6 352 | 353 | === Prompt Comparison Tool === 354 | Enter prompt 1: Explain AI simply 355 | Enter prompt 2: You are a teacher. Explain AI to a 10-year-old. 356 | 357 | [Testing both prompts...] 358 | 359 | Results: 360 | Prompt 1: 145 tokens, Generic explanation 361 | Prompt 2: 167 tokens, Age-appropriate, engaging explanation 362 | 363 | Winner: Prompt 2 (Better engagement, clearer structure) 364 | ``` 365 | 366 | --- 367 | 368 | ## Submission Checklist 369 | 370 | - [ ] Task 1: Prompt improvements completed and tested 371 | - [ ] Task 2: Role-based system functional 372 | - [ ] Task 3: Few-shot templates created 373 | - [ ] Task 4: CoT solver working 374 | - [ ] Task 5: RAG prompt builder implemented 375 | - [ ] Mini project: Full playground application 376 | - [ ] All prompts tested with API 377 | - [ ] Results documented and compared 378 | - [ ] Code is well-organized and commented 379 | 380 | **Remember:** Good prompts are the foundation of great RAG systems! 381 | 382 | **Good luck!** 🚀 383 | 384 | -------------------------------------------------------------------------------- /Day-03: Prompt Engineering Essentials/README.md: -------------------------------------------------------------------------------- 1 | # Day 3 — Prompt Engineering Essentials 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | **Prompt Engineering** is the art and science of crafting instructions that get the best results from LLMs. Think of it as learning to communicate effectively with AI—the better your prompts, the better the AI's responses. 6 | 7 | **Why this matters for RAG:** 8 | - RAG systems rely heavily on well-crafted prompts 9 | - You'll need to prompt LLMs to answer questions based on retrieved context 10 | - Effective prompts improve answer quality and reduce hallucinations 11 | - Prompt engineering is a core skill for building production RAG applications 12 | 13 | **Real-world context:** 14 | Imagine asking a librarian a vague question vs. a specific one: 15 | - ❌ "Tell me about space" (too vague) 16 | - ✅ "What are the key differences between stars and planets? Explain in 3 bullet points." (specific and clear) 17 | 18 | Prompt engineering is about being that specific librarian question—clear, structured, and goal-oriented. 19 | 20 | --- 21 | 22 | ## 2. Deep-Dive Explanation 23 | 24 | ### 2.1 What is a Prompt? 25 | 26 | A **prompt** is the input text you send to an LLM. It can be: 27 | - A simple question 28 | - Instructions with examples 29 | - A conversation history 30 | - Structured templates 31 | 32 | **Prompt Structure:** 33 | ``` 34 | [System Message] + [Context] + [User Question] + [Format Instructions] 35 | ``` 36 | 37 | ### 2.2 Core Prompt Engineering Techniques 38 | 39 | #### 2.2.1 Be Specific and Clear 40 | 41 | **Bad:** 42 | ``` 43 | Tell me about Python. 44 | ``` 45 | 46 | **Good:** 47 | ``` 48 | Explain Python programming language in 3 sentences, focusing on: 49 | 1. What it's used for 50 | 2. Key features 51 | 3. Why it's popular for AI 52 | ``` 53 | 54 | #### 2.2.2 Use Role-Playing 55 | 56 | Assign a role to the AI: 57 | ``` 58 | You are an expert Python tutor. Explain variables to a beginner programmer. 59 | ``` 60 | 61 | #### 2.2.3 Provide Examples (Few-Shot Learning) 62 | 63 | Show the AI what you want: 64 | ``` 65 | Example 1: 66 | Input: "Python is easy" 67 | Output: "Python is beginner-friendly" 68 | 69 | Example 2: 70 | Input: "AI is powerful" 71 | Output: "AI has transformative capabilities" 72 | 73 | Now convert: "RAG is useful" 74 | ``` 75 | 76 | #### 2.2.4 Chain-of-Thought (CoT) 77 | 78 | Encourage step-by-step reasoning: 79 | ``` 80 | Solve: 15 * 23 81 | 82 | Let's think step by step: 83 | 1. First, multiply 15 by 20 = 300 84 | 2. Then, multiply 15 by 3 = 45 85 | 3. Finally, add 300 + 45 = 345 86 | ``` 87 | 88 | #### 2.2.5 Output Formatting 89 | 90 | Specify the format you want: 91 | ``` 92 | List 5 programming languages. Format as JSON: 93 | { 94 | "languages": [ 95 | {"name": "...", "year": "..."} 96 | ] 97 | } 98 | ``` 99 | 100 | ### 2.3 Prompt Patterns for RAG 101 | 102 | #### 2.3.1 Context Injection Pattern 103 | 104 | ``` 105 | Use the following context to answer the question: 106 | 107 | Context: 108 | {retrieved_documents} 109 | 110 | Question: {user_question} 111 | 112 | Answer based only on the provided context. If the context doesn't contain enough information, say "I don't have enough information." 113 | ``` 114 | 115 | #### 2.3.2 Answer with Citations 116 | 117 | ``` 118 | Based on the following documents, answer the question and cite your sources: 119 | 120 | Documents: 121 | {document_1} 122 | {document_2} 123 | 124 | Question: {question} 125 | 126 | Format your answer as: 127 | Answer: [your answer] 128 | Sources: [document numbers] 129 | ``` 130 | 131 | #### 2.3.3 Multi-Step Reasoning 132 | 133 | ``` 134 | Given the context below, follow these steps: 135 | 1. Identify key information 136 | 2. Analyze the relationships 137 | 3. Synthesize an answer 138 | 139 | Context: {context} 140 | Question: {question} 141 | ``` 142 | 143 | ### 2.4 Common Prompt Mistakes 144 | 145 | **Mistake 1: Being Too Vague** 146 | - ❌ "Explain this" 147 | - ✅ "Summarize the main points in 3 bullet points" 148 | 149 | **Mistake 2: Not Providing Context** 150 | - ❌ Asking about specific documents without including them 151 | - ✅ Including relevant context in the prompt 152 | 153 | **Mistake 3: Ambiguous Instructions** 154 | - ❌ "Make it better" 155 | - ✅ "Rewrite this sentence to be more concise and professional" 156 | 157 | **Mistake 4: Ignoring Token Limits** 158 | - ❌ Including too much context 159 | - ✅ Being selective about what context to include 160 | 161 | ### 2.5 Prompt Templates 162 | 163 | **Template 1: Question Answering** 164 | ``` 165 | Context: {context} 166 | 167 | Question: {question} 168 | 169 | Instructions: 170 | - Answer based only on the provided context 171 | - If the answer isn't in the context, say so 172 | - Be concise but complete 173 | ``` 174 | 175 | **Template 2: Summarization** 176 | ``` 177 | Summarize the following text in {number} sentences: 178 | 179 | {text} 180 | 181 | Focus on: {key_points} 182 | ``` 183 | 184 | **Template 3: Extraction** 185 | ``` 186 | Extract the following information from the text: 187 | - Names 188 | - Dates 189 | - Key facts 190 | 191 | Text: {text} 192 | 193 | Format as JSON. 194 | ``` 195 | 196 | --- 197 | 198 | ## 3. Instructor Examples 199 | 200 | ### Example 1: Basic Prompt with Context 201 | 202 | ```python 203 | def answer_with_context(context, question): 204 | """Answer a question using provided context""" 205 | prompt = f"""Use the following information to answer the question. 206 | 207 | Information: 208 | {context} 209 | 210 | Question: {question} 211 | 212 | Answer the question based only on the information provided above. 213 | If the information doesn't contain the answer, say "I don't have enough information." 214 | """ 215 | 216 | response = openai.ChatCompletion.create( 217 | model="gpt-3.5-turbo", 218 | messages=[{"role": "user", "content": prompt}], 219 | temperature=0.3 # Lower temperature for factual answers 220 | ) 221 | 222 | return response.choices[0].message.content 223 | 224 | # Usage 225 | context = "Python was created by Guido van Rossum in 1991." 226 | question = "Who created Python?" 227 | answer = answer_with_context(context, question) 228 | print(answer) 229 | ``` 230 | 231 | ### Example 2: Few-Shot Learning 232 | 233 | ```python 234 | def classify_sentiment_fewshot(text): 235 | """Classify sentiment using few-shot examples""" 236 | prompt = f"""Classify the sentiment of the following text as positive, negative, or neutral. 237 | 238 | Examples: 239 | Text: "I love this product!" 240 | Sentiment: positive 241 | 242 | Text: "This is terrible." 243 | Sentiment: negative 244 | 245 | Text: "The weather is okay." 246 | Sentiment: neutral 247 | 248 | Now classify: 249 | Text: "{text}" 250 | Sentiment:""" 251 | 252 | response = openai.ChatCompletion.create( 253 | model="gpt-3.5-turbo", 254 | messages=[{"role": "user", "content": prompt}], 255 | temperature=0.1 # Very low for classification 256 | ) 257 | 258 | return response.choices[0].message.content.strip() 259 | 260 | # Usage 261 | result = classify_sentiment_fewshot("This movie was amazing!") 262 | print(result) # positive 263 | ``` 264 | 265 | ### Example 3: Chain-of-Thought Prompting 266 | 267 | ```python 268 | def solve_problem_cot(problem): 269 | """Solve a problem using chain-of-thought reasoning""" 270 | prompt = f"""Solve the following problem step by step. 271 | 272 | Problem: {problem} 273 | 274 | Let's think through this step by step: 275 | 1. First, identify what we need to find 276 | 2. List the information we have 277 | 3. Determine the approach 278 | 4. Solve step by step 279 | 5. Verify the answer 280 | 281 | Solution:""" 282 | 283 | response = openai.ChatCompletion.create( 284 | model="gpt-3.5-turbo", 285 | messages=[{"role": "user", "content": prompt}], 286 | temperature=0.3, 287 | max_tokens=300 288 | ) 289 | 290 | return response.choices[0].message.content 291 | 292 | # Usage 293 | problem = "If a train travels 120 km in 2 hours, what's its average speed?" 294 | solution = solve_problem_cot(problem) 295 | print(solution) 296 | ``` 297 | 298 | ### Example 4: RAG-Style Prompt Template 299 | 300 | ```python 301 | def rag_prompt_template(context_chunks, question): 302 | """Create a RAG-style prompt with multiple context chunks""" 303 | context_text = "\n\n".join([ 304 | f"[Document {i+1}]\n{chunk}" 305 | for i, chunk in enumerate(context_chunks) 306 | ]) 307 | 308 | prompt = f"""You are a helpful assistant that answers questions based on provided documents. 309 | 310 | Documents: 311 | {context_text} 312 | 313 | Question: {question} 314 | 315 | Instructions: 316 | 1. Answer the question using information from the documents above 317 | 2. If multiple documents are relevant, synthesize information from all of them 318 | 3. Cite which document(s) you used (e.g., "According to Document 1...") 319 | 4. If the documents don't contain enough information, say so clearly 320 | 5. Be specific and accurate 321 | 322 | Answer:""" 323 | 324 | return prompt 325 | 326 | # Usage 327 | chunks = [ 328 | "Python is a programming language created in 1991.", 329 | "RAG stands for Retrieval-Augmented Generation." 330 | ] 331 | question = "What is Python?" 332 | prompt = rag_prompt_template(chunks, question) 333 | # Use this prompt with OpenAI API 334 | ``` 335 | 336 | --- 337 | 338 | ## 4. Student Practice Tasks 339 | 340 | ### Task 1: Basic Prompt Improvement 341 | Take these vague prompts and rewrite them to be specific and clear: 342 | - "Tell me about AI" 343 | - "Explain this code" 344 | - "What should I do?" 345 | 346 | ### Task 2: Role-Playing Prompts 347 | Create prompts that assign different roles to the AI: 348 | - A coding tutor 349 | - A business consultant 350 | - A creative writer 351 | - A data analyst 352 | 353 | ### Task 3: Few-Shot Examples 354 | Create a few-shot prompt for: 355 | - Classifying emails as spam/not spam 356 | - Converting text to a specific format 357 | - Extracting key information 358 | 359 | ### Task 4: Chain-of-Thought 360 | Write a CoT prompt for: 361 | - Solving a math word problem 362 | - Debugging code 363 | - Making a decision 364 | 365 | ### Task 5: RAG Prompt Template 366 | Create a reusable RAG prompt template function that: 367 | - Takes context and question 368 | - Includes instructions 369 | - Specifies output format 370 | - Handles cases where context is insufficient 371 | 372 | ### Task 6: Prompt Comparison 373 | Test the same question with: 374 | - A basic prompt 375 | - An improved prompt with context 376 | - A prompt with examples 377 | Compare the quality of responses. 378 | 379 | --- 380 | 381 | ## 5. Summary / Key Takeaways 382 | 383 | - **Be specific**: Clear, detailed prompts get better results 384 | - **Use roles**: Assigning roles helps guide AI behavior 385 | - **Few-shot learning**: Examples teach the AI what you want 386 | - **Chain-of-thought**: Encourages step-by-step reasoning 387 | - **Format instructions**: Specify the output format you need 388 | - **Context matters**: In RAG, always include relevant context 389 | - **Temperature settings**: Lower for factual, higher for creative 390 | - **Iterate**: Prompt engineering is iterative—refine based on results 391 | - **Test variations**: Try different phrasings to find what works best 392 | 393 | --- 394 | 395 | ## 6. Further Reading (Optional) 396 | 397 | - OpenAI Prompt Engineering Guide 398 | - "Prompt Engineering for LLMs" by Lilian Weng 399 | - LangChain Prompt Templates documentation 400 | - Anthropic's Prompt Engineering resources 401 | 402 | --- 403 | 404 | **Next up:** Day 4 will teach you how to extract and chunk data from various sources! 405 | 406 | -------------------------------------------------------------------------------- /Day-10: Build & Deploy a RAG Application (FastAPI-Streamlit)/README.md: -------------------------------------------------------------------------------- 1 | # Day 10 — Build & Deploy a RAG Application (FastAPI/Streamlit) 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Congratulations! You've reached the final day. Today, you'll build and deploy a complete RAG application that others can use. You'll create a web interface and API so your RAG system is accessible and production-ready. 6 | 7 | **What you'll build:** 8 | 9 | - **FastAPI Backend**: REST API for your RAG system 10 | - **Streamlit Frontend**: User-friendly web interface 11 | - **Deployment**: Make it accessible to others 12 | 13 | **Why this matters:** 14 | 15 | - Real applications need interfaces 16 | - APIs allow integration with other systems 17 | - Web interfaces make systems accessible 18 | - Deployment makes your work usable 19 | 20 | **Real-world context:** 21 | Your RAG system is powerful, but it's just code. To make it useful, you need: 22 | 23 | - A way for users to interact (web UI) 24 | - A way for other systems to use it (API) 25 | - A way to access it from anywhere (deployment) 26 | 27 | --- 28 | 29 | ## 2. Deep-Dive Explanation 30 | 31 | ### 2.1 FastAPI for RAG Backend 32 | 33 | **What is FastAPI?** 34 | A modern Python web framework for building APIs: 35 | 36 | - Fast and performant 37 | - Automatic API documentation 38 | - Type hints support 39 | - Easy to use 40 | 41 | **Why FastAPI for RAG?** 42 | 43 | - Handles async operations well 44 | - Great for ML/AI applications 45 | - Automatic validation 46 | - Easy to deploy 47 | 48 | **Key Components:** 49 | 50 | - **Routes**: API endpoints 51 | - **Models**: Request/response schemas 52 | - **Dependencies**: Reusable components 53 | - **Middleware**: Cross-cutting concerns 54 | 55 | ### 2.2 Streamlit for RAG Frontend 56 | 57 | **What is Streamlit?** 58 | A Python framework for building web apps: 59 | 60 | - Simple and intuitive 61 | - Great for data/ML apps 62 | - No frontend knowledge needed 63 | - Fast development 64 | 65 | **Why Streamlit for RAG?** 66 | 67 | - Perfect for interactive AI apps 68 | - Easy to add file uploads 69 | - Simple chat interfaces 70 | - Quick prototyping 71 | 72 | **Key Components:** 73 | 74 | - **Widgets**: Inputs, buttons, displays 75 | - **Layout**: Organize your UI 76 | - **State**: Manage app state 77 | - **Session**: User sessions 78 | 79 | ### 2.3 Application Architecture 80 | 81 | **Complete System:** 82 | 83 | ``` 84 | ┌─────────────┐ 85 | │ Streamlit │ User Interface 86 | │ Frontend │ 87 | └──────┬──────┘ 88 | │ HTTP 89 | ▼ 90 | ┌─────────────┐ 91 | │ FastAPI │ REST API 92 | │ Backend │ 93 | └──────┬──────┘ 94 | │ 95 | ▼ 96 | ┌─────────────┐ 97 | │ RAG System │ Your RAG Pipeline 98 | └─────────────┘ 99 | ``` 100 | 101 | ### 2.4 API Design 102 | 103 | **Essential Endpoints:** 104 | 105 | - `POST /index` - Index documents 106 | - `POST /query` - Query RAG system 107 | - `GET /health` - Health check 108 | - `GET /stats` - System statistics 109 | - `DELETE /documents/{id}` - Remove document 110 | 111 | **Request/Response Models:** 112 | 113 | - Structured data 114 | - Validation 115 | - Type safety 116 | - Documentation 117 | 118 | ### 2.5 Deployment Options 119 | 120 | **Local Deployment:** 121 | 122 | - Run on your machine 123 | - Access via localhost 124 | - Good for testing 125 | 126 | **Cloud Deployment:** 127 | 128 | - **Heroku**: Easy, free tier 129 | - **Railway**: Simple deployment 130 | - **Render**: Free hosting 131 | - **AWS/GCP/Azure**: Production scale 132 | 133 | **Containerization:** 134 | 135 | - Docker for packaging 136 | - Easy deployment 137 | - Consistent environment 138 | 139 | --- 140 | 141 | ## 3. Instructor Examples 142 | 143 | ### Example 1: FastAPI RAG Backend 144 | 145 | ```python 146 | from fastapi import FastAPI, UploadFile, File, HTTPException 147 | from pydantic import BaseModel 148 | from typing import List, Optional 149 | import os 150 | 151 | app = FastAPI(title="RAG API", version="1.0.0") 152 | 153 | # Your RAG system (from previous days) 154 | from rag_system import RAGSystem 155 | 156 | rag = RAGSystem() 157 | 158 | # Request/Response Models 159 | class QueryRequest(BaseModel): 160 | question: str 161 | k: Optional[int] = 3 162 | 163 | class QueryResponse(BaseModel): 164 | answer: str 165 | sources: List[dict] 166 | processing_time: float 167 | 168 | class IndexResponse(BaseModel): 169 | message: str 170 | chunks_indexed: int 171 | 172 | # API Endpoints 173 | @app.get("/") 174 | async def root(): 175 | return {"message": "RAG API is running"} 176 | 177 | @app.get("/health") 178 | async def health(): 179 | return {"status": "healthy"} 180 | 181 | @app.post("/index", response_model=IndexResponse) 182 | async def index_document(file: UploadFile = File(...)): 183 | """Index a document""" 184 | try: 185 | # Save uploaded file 186 | file_path = f"temp/{file.filename}" 187 | os.makedirs("temp", exist_ok=True) 188 | 189 | with open(file_path, "wb") as f: 190 | content = await file.read() 191 | f.write(content) 192 | 193 | # Index document 194 | chunks = rag.index_document(file_path) 195 | 196 | # Clean up 197 | os.remove(file_path) 198 | 199 | return IndexResponse( 200 | message=f"Document indexed successfully", 201 | chunks_indexed=chunks 202 | ) 203 | except Exception as e: 204 | raise HTTPException(status_code=500, detail=str(e)) 205 | 206 | @app.post("/query", response_model=QueryResponse) 207 | async def query_rag(request: QueryRequest): 208 | """Query the RAG system""" 209 | import time 210 | start_time = time.time() 211 | 212 | try: 213 | result = rag.query(request.question, k=request.k) 214 | processing_time = time.time() - start_time 215 | 216 | return QueryResponse( 217 | answer=result["answer"], 218 | sources=result["sources"], 219 | processing_time=processing_time 220 | ) 221 | except Exception as e: 222 | raise HTTPException(status_code=500, detail=str(e)) 223 | 224 | @app.get("/stats") 225 | async def get_stats(): 226 | """Get system statistics""" 227 | stats = rag.get_statistics() 228 | return stats 229 | 230 | if __name__ == "__main__": 231 | import uvicorn 232 | uvicorn.run(app, host="0.0.0.0", port=8000) 233 | ``` 234 | 235 | ### Example 2: Streamlit RAG Frontend 236 | 237 | ```python 238 | import streamlit as st 239 | import requests 240 | import time 241 | 242 | # API URL 243 | API_URL = "http://localhost:8000" 244 | 245 | # Page config 246 | st.set_page_config( 247 | page_title="RAG Application", 248 | page_icon="🤖", 249 | layout="wide" 250 | ) 251 | 252 | # Title 253 | st.title("🤖 RAG Application") 254 | st.markdown("Ask questions about your documents!") 255 | 256 | # Sidebar 257 | with st.sidebar: 258 | st.header("📄 Document Management") 259 | 260 | # File upload 261 | uploaded_file = st.file_uploader( 262 | "Upload a document", 263 | type=["pdf", "txt"], 264 | help="Upload PDF or TXT files" 265 | ) 266 | 267 | if uploaded_file: 268 | if st.button("Index Document"): 269 | with st.spinner("Indexing document..."): 270 | files = {"file": uploaded_file.getvalue()} 271 | response = requests.post( 272 | f"{API_URL}/index", 273 | files=files 274 | ) 275 | 276 | if response.status_code == 200: 277 | result = response.json() 278 | st.success(f"✅ {result['message']}") 279 | st.info(f"Indexed {result['chunks_indexed']} chunks") 280 | else: 281 | st.error("Failed to index document") 282 | 283 | # Statistics 284 | if st.button("View Statistics"): 285 | response = requests.get(f"{API_URL}/stats") 286 | if response.status_code == 200: 287 | stats = response.json() 288 | st.json(stats) 289 | 290 | # Main area 291 | st.header("💬 Ask a Question") 292 | 293 | # Query input 294 | question = st.text_input( 295 | "Enter your question:", 296 | placeholder="What is the main topic?", 297 | key="question_input" 298 | ) 299 | 300 | # K value 301 | k = st.slider("Number of sources:", 1, 10, 3) 302 | 303 | # Query button 304 | if st.button("Ask", type="primary") and question: 305 | with st.spinner("Thinking..."): 306 | # Make API request 307 | response = requests.post( 308 | f"{API_URL}/query", 309 | json={"question": question, "k": k} 310 | ) 311 | 312 | if response.status_code == 200: 313 | result = response.json() 314 | 315 | # Display answer 316 | st.subheader("📝 Answer") 317 | st.write(result["answer"]) 318 | 319 | # Display sources 320 | st.subheader("📚 Sources") 321 | for i, source in enumerate(result["sources"], 1): 322 | with st.expander(f"Source {i} (Similarity: {source.get('similarity', 'N/A'):.2f})"): 323 | st.write(source.get("text", "")) 324 | if "metadata" in source: 325 | st.caption(f"Source: {source['metadata']}") 326 | 327 | # Processing time 328 | st.caption(f"⏱️ Processed in {result['processing_time']:.2f} seconds") 329 | else: 330 | st.error("Failed to get answer") 331 | 332 | # Chat history (optional) 333 | if "chat_history" not in st.session_state: 334 | st.session_state.chat_history = [] 335 | 336 | # Display history 337 | if st.session_state.chat_history: 338 | st.header("💭 Chat History") 339 | for i, (q, a) in enumerate(reversed(st.session_state.chat_history[-5:]), 1): 340 | with st.expander(f"Q{i}: {q}"): 341 | st.write(a) 342 | ``` 343 | 344 | ### Example 3: Complete Deployment Setup 345 | 346 | ```python 347 | # requirements.txt 348 | fastapi==0.104.1 349 | uvicorn==0.24.0 350 | streamlit==1.28.0 351 | openai==1.3.0 352 | chromadb==0.4.15 353 | pypdf==3.17.0 354 | requests==2.31.0 355 | python-multipart==0.0.6 356 | 357 | # Dockerfile 358 | FROM python:3.10-slim 359 | 360 | WORKDIR /app 361 | 362 | COPY requirements.txt . 363 | RUN pip install --no-cache-dir -r requirements.txt 364 | 365 | COPY . . 366 | 367 | EXPOSE 8000 368 | 369 | CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] 370 | 371 | # docker-compose.yml 372 | version: '3.8' 373 | services: 374 | api: 375 | build: . 376 | ports: 377 | - "8000:8000" 378 | environment: 379 | - OPENAI_API_KEY=${OPENAI_API_KEY} 380 | volumes: 381 | - ./data:/app/data 382 | ``` 383 | 384 | --- 385 | 386 | ## 4. Student Practice Tasks 387 | 388 | ### Task 1: FastAPI Backend 389 | 390 | Build a FastAPI backend with: 391 | 392 | - Document indexing endpoint 393 | - Query endpoint 394 | - Health check 395 | - Statistics endpoint 396 | 397 | ### Task 2: Streamlit Frontend 398 | 399 | Create a Streamlit UI with: 400 | 401 | - File upload 402 | - Query interface 403 | - Answer display 404 | - Source display 405 | 406 | ### Task 3: Integration 407 | 408 | Connect Streamlit to FastAPI: 409 | 410 | - Make API calls 411 | - Handle errors 412 | - Display results 413 | - Add loading states 414 | 415 | ### Task 4: Error Handling 416 | 417 | Add comprehensive error handling: 418 | 419 | - API errors 420 | - File errors 421 | - Validation errors 422 | - User-friendly messages 423 | 424 | ### Task 5: Deployment 425 | 426 | Deploy your application: 427 | 428 | - Local deployment 429 | - Cloud deployment (choose one) 430 | - Docker containerization 431 | - Environment variables 432 | 433 | ### Task 6: Documentation 434 | 435 | Create documentation: 436 | 437 | - API documentation 438 | - User guide 439 | - Deployment instructions 440 | - README file 441 | 442 | --- 443 | 444 | ## 5. Summary / Key Takeaways 445 | 446 | - **FastAPI** provides a fast, modern API framework 447 | - **Streamlit** makes building UIs simple 448 | - **APIs** enable integration with other systems 449 | - **Web interfaces** make systems accessible 450 | - **Deployment** makes your work usable 451 | - **Docker** simplifies deployment 452 | - **Error handling** is crucial for production 453 | - **Documentation** helps users 454 | - **Testing** ensures reliability 455 | - **You've built a complete RAG system!** 🎉 456 | 457 | --- 458 | 459 | ## 6. Further Reading (Optional) 460 | 461 | - FastAPI Documentation 462 | - Streamlit Documentation 463 | - Docker Documentation 464 | - Deployment guides (Heroku, Railway, Render) 465 | - API design best practices 466 | 467 | --- 468 | 469 | **Congratulations on completing the 10-day RAG roadmap!** 🎊 470 | 471 | You now have: 472 | 473 | - ✅ Python foundations 474 | - ✅ LLM understanding 475 | - ✅ Prompt engineering skills 476 | - ✅ Data extraction capabilities 477 | - ✅ Embedding knowledge 478 | - ✅ RAG system building 479 | - ✅ Framework experience 480 | - ✅ Advanced techniques 481 | - ✅ Deployment skills 482 | 483 | **Keep building and learning!** 🚀 484 | -------------------------------------------------------------------------------- /Day-05: Embeddings & Vector Databases/README.md: -------------------------------------------------------------------------------- 1 | # Day 5 — Embeddings & Vector Databases 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Today, you'll learn about **embeddings** and **vector databases**—the technology that makes RAG retrieval possible. This is where the magic happens: converting text into numbers that capture meaning, and storing them so we can find similar content quickly. 6 | 7 | **What are Embeddings?** 8 | Embeddings are numerical representations of text that capture semantic meaning. Similar texts have similar embeddings (close numbers), allowing computers to understand meaning mathematically. 9 | 10 | **Why this matters for RAG:** 11 | - Embeddings convert text chunks into searchable vectors 12 | - Vector databases store and retrieve similar content efficiently 13 | - When you ask a question, we find the most relevant chunks using similarity search 14 | - This is the "Retrieval" part of RAG! 15 | 16 | **Real-world context:** 17 | Imagine a library where books are organized by meaning, not alphabetically. When you ask "Tell me about dogs," the system finds all books about dogs, even if they don't contain the exact word "dogs" (maybe they say "canines" or "pets"). Embeddings make this possible! 18 | 19 | --- 20 | 21 | ## 2. Deep-Dive Explanation 22 | 23 | ### 2.1 What are Embeddings? 24 | 25 | **Text Embeddings** are dense vectors (arrays of numbers) that represent text in a high-dimensional space. 26 | 27 | **Key Properties:** 28 | - **Semantic similarity**: Similar meanings → similar vectors 29 | - **Fixed dimensions**: Each embedding has the same length (e.g., 1536 for OpenAI) 30 | - **Dense**: Most values are non-zero (unlike sparse representations) 31 | 32 | **Example:** 33 | ``` 34 | "dog" → [0.2, -0.5, 0.8, ..., 0.1] (1536 numbers) 35 | "puppy" → [0.19, -0.48, 0.79, ..., 0.12] (very similar!) 36 | "car" → [-0.3, 0.6, -0.2, ..., -0.5] (very different!) 37 | ``` 38 | 39 | ### 2.2 How Embeddings Work 40 | 41 | **The Process:** 42 | ``` 43 | Text → Embedding Model → Vector (Array of Numbers) 44 | ``` 45 | 46 | **Embedding Models:** 47 | - **OpenAI**: `text-embedding-ada-002` or `text-embedding-3-small` 48 | - **Sentence Transformers**: Open-source alternatives 49 | - **Custom models**: Trained on specific domains 50 | 51 | **Dimensions:** 52 | - OpenAI ada-002: 1536 dimensions 53 | - OpenAI 3-small: 1536 dimensions 54 | - Sentence-BERT: 384 or 768 dimensions 55 | - More dimensions = more detail (but slower, more storage) 56 | 57 | ### 2.3 Similarity Search 58 | 59 | **Cosine Similarity:** 60 | Measures the angle between two vectors (0 to 1): 61 | - 1.0 = Identical meaning 62 | - 0.9 = Very similar 63 | - 0.5 = Somewhat related 64 | - 0.0 = Unrelated 65 | 66 | **Formula (simplified):** 67 | ``` 68 | similarity = dot_product(vec1, vec2) / (magnitude(vec1) * magnitude(vec2)) 69 | ``` 70 | 71 | **Why Cosine Similarity?** 72 | - Focuses on direction, not magnitude 73 | - Works well for text embeddings 74 | - Range: -1 to 1 (usually 0 to 1 for normalized embeddings) 75 | 76 | ### 2.4 Vector Databases 77 | 78 | **What is a Vector Database?** 79 | A specialized database optimized for storing and searching high-dimensional vectors. 80 | 81 | **Key Features:** 82 | - Fast similarity search 83 | - Handles millions of vectors 84 | - Supports metadata filtering 85 | - Efficient indexing (ANN - Approximate Nearest Neighbor) 86 | 87 | **Popular Vector Databases:** 88 | - **ChromaDB**: Simple, Python-native 89 | - **Pinecone**: Cloud-based, scalable 90 | - **Weaviate**: Open-source, feature-rich 91 | - **Qdrant**: Fast, Rust-based 92 | - **FAISS**: Facebook's library (not a full DB) 93 | 94 | ### 2.5 ChromaDB Basics 95 | 96 | **Why ChromaDB for Learning?** 97 | - Easy to use 98 | - No external services needed 99 | - Perfect for prototyping 100 | - Python-native 101 | 102 | **Core Concepts:** 103 | - **Collection**: Container for vectors and metadata 104 | - **Documents**: Your text chunks 105 | - **Embeddings**: Vector representations 106 | - **Metadata**: Additional info (source, page, etc.) 107 | 108 | **Basic Operations:** 109 | 1. Create collection 110 | 2. Add documents (auto-generates embeddings) 111 | 3. Query for similar documents 112 | 4. Retrieve with metadata 113 | 114 | ### 2.6 The Embedding Pipeline 115 | 116 | **Complete Flow:** 117 | ``` 118 | Text Chunks → Embedding Model → Vectors → Vector DB → Index 119 | ↓ 120 | Query → Embedding Model → Query Vector → Similarity Search → Top K Results 121 | ``` 122 | 123 | **Steps:** 124 | 1. **Chunk documents** (from Day 4) 125 | 2. **Generate embeddings** for each chunk 126 | 3. **Store in vector DB** with metadata 127 | 4. **Query**: Convert question to embedding 128 | 5. **Search**: Find most similar chunks 129 | 6. **Retrieve**: Get top K results 130 | 131 | --- 132 | 133 | ## 3. Instructor Examples 134 | 135 | ### Example 1: Generating Embeddings with OpenAI 136 | 137 | ```python 138 | import openai 139 | import os 140 | 141 | openai.api_key = os.getenv("OPENAI_API_KEY") 142 | 143 | def get_embedding(text, model="text-embedding-ada-002"): 144 | """Generate embedding for text""" 145 | text = text.replace("\n", " ") # Replace newlines 146 | 147 | response = openai.Embedding.create( 148 | model=model, 149 | input=text 150 | ) 151 | 152 | return response['data'][0]['embedding'] 153 | 154 | # Usage 155 | text = "Python is a programming language" 156 | embedding = get_embedding(text) 157 | print(f"Embedding dimension: {len(embedding)}") # 1536 158 | print(f"First 5 values: {embedding[:5]}") 159 | ``` 160 | 161 | ### Example 2: Batch Embedding Generation 162 | 163 | ```python 164 | def get_embeddings_batch(texts, model="text-embedding-ada-002"): 165 | """Generate embeddings for multiple texts""" 166 | # Clean texts 167 | texts = [text.replace("\n", " ") for text in texts] 168 | 169 | response = openai.Embedding.create( 170 | model=model, 171 | input=texts 172 | ) 173 | 174 | # Extract embeddings 175 | embeddings = [item['embedding'] for item in response['data']] 176 | return embeddings 177 | 178 | # Usage 179 | texts = [ 180 | "Python is a programming language", 181 | "Dogs are loyal pets", 182 | "Machine learning uses algorithms" 183 | ] 184 | embeddings = get_embeddings_batch(texts) 185 | print(f"Generated {len(embeddings)} embeddings") 186 | ``` 187 | 188 | ### Example 3: Simple Similarity Calculation 189 | 190 | ```python 191 | import numpy as np 192 | 193 | def cosine_similarity(vec1, vec2): 194 | """Calculate cosine similarity between two vectors""" 195 | vec1 = np.array(vec1) 196 | vec2 = np.array(vec2) 197 | 198 | dot_product = np.dot(vec1, vec2) 199 | norm1 = np.linalg.norm(vec1) 200 | norm2 = np.linalg.norm(vec2) 201 | 202 | if norm1 == 0 or norm2 == 0: 203 | return 0.0 204 | 205 | return dot_product / (norm1 * norm2) 206 | 207 | # Usage 208 | embedding1 = get_embedding("dog") 209 | embedding2 = get_embedding("puppy") 210 | embedding3 = get_embedding("car") 211 | 212 | similarity_dog_puppy = cosine_similarity(embedding1, embedding2) 213 | similarity_dog_car = cosine_similarity(embedding1, embedding3) 214 | 215 | print(f"Dog-Puppy similarity: {similarity_dog_puppy:.3f}") # ~0.85 216 | print(f"Dog-Car similarity: {similarity_dog_car:.3f}") # ~0.30 217 | ``` 218 | 219 | ### Example 4: ChromaDB Basics 220 | 221 | ```python 222 | import chromadb 223 | from chromadb.config import Settings 224 | 225 | # Initialize ChromaDB (in-memory for simplicity) 226 | client = chromadb.Client(Settings(anonymized_telemetry=False)) 227 | 228 | # Create or get a collection 229 | collection = client.create_collection(name="documents") 230 | 231 | # Add documents 232 | documents = [ 233 | "Python is a high-level programming language", 234 | "Dogs are loyal and friendly animals", 235 | "Machine learning is a subset of AI" 236 | ] 237 | 238 | ids = ["doc1", "doc2", "doc3"] 239 | metadatas = [ 240 | {"source": "python_book", "page": 1}, 241 | {"source": "animal_guide", "page": 5}, 242 | {"source": "ai_textbook", "page": 10} 243 | ] 244 | 245 | collection.add( 246 | documents=documents, 247 | ids=ids, 248 | metadatas=metadatas 249 | ) 250 | 251 | # Query for similar documents 252 | results = collection.query( 253 | query_texts=["programming languages"], 254 | n_results=2 255 | ) 256 | 257 | print("Similar documents:") 258 | for i, doc in enumerate(results['documents'][0]): 259 | print(f"{i+1}. {doc}") 260 | print(f" Metadata: {results['metadatas'][0][i]}") 261 | ``` 262 | 263 | ### Example 5: Complete Embedding Pipeline 264 | 265 | ```python 266 | class EmbeddingPipeline: 267 | def __init__(self, embedding_model="text-embedding-ada-002"): 268 | self.embedding_model = embedding_model 269 | self.client = chromadb.Client() 270 | self.collection = None 271 | 272 | def create_collection(self, name): 273 | """Create a new collection""" 274 | self.collection = self.client.create_collection(name=name) 275 | return self.collection 276 | 277 | def add_documents(self, texts, ids=None, metadatas=None): 278 | """Add documents to collection (ChromaDB auto-generates embeddings)""" 279 | if ids is None: 280 | ids = [f"doc_{i}" for i in range(len(texts))] 281 | 282 | self.collection.add( 283 | documents=texts, 284 | ids=ids, 285 | metadatas=metadatas 286 | ) 287 | 288 | def search(self, query_text, n_results=5, filter_metadata=None): 289 | """Search for similar documents""" 290 | query_params = { 291 | "query_texts": [query_text], 292 | "n_results": n_results 293 | } 294 | 295 | if filter_metadata: 296 | query_params["where"] = filter_metadata 297 | 298 | results = self.collection.query(**query_params) 299 | 300 | return { 301 | "documents": results['documents'][0], 302 | "metadatas": results['metadatas'][0], 303 | "distances": results['distances'][0] 304 | } 305 | 306 | def get_stats(self): 307 | """Get collection statistics""" 308 | count = self.collection.count() 309 | return {"total_documents": count} 310 | 311 | # Usage 312 | pipeline = EmbeddingPipeline() 313 | pipeline.create_collection("my_docs") 314 | 315 | # Add documents 316 | texts = ["Document 1 text...", "Document 2 text..."] 317 | metadatas = [{"source": "book1"}, {"source": "book2"}] 318 | pipeline.add_documents(texts, metadatas=metadatas) 319 | 320 | # Search 321 | results = pipeline.search("What is Python?", n_results=3) 322 | for doc, metadata in zip(results["documents"], results["metadatas"]): 323 | print(f"Found: {doc[:50]}... (Source: {metadata['source']})") 324 | ``` 325 | 326 | --- 327 | 328 | ## 4. Student Practice Tasks 329 | 330 | ### Task 1: Embedding Generator 331 | Create a function that: 332 | - Takes a list of texts 333 | - Generates embeddings for each 334 | - Returns embeddings with metadata 335 | - Handles API errors 336 | 337 | ### Task 2: Similarity Calculator 338 | Build a tool that: 339 | - Takes two texts 340 | - Generates embeddings 341 | - Calculates cosine similarity 342 | - Explains the similarity score 343 | 344 | ### Task 3: ChromaDB Setup 345 | Set up ChromaDB and: 346 | - Create a collection 347 | - Add 10 sample documents 348 | - Query for similar documents 349 | - Display results with metadata 350 | 351 | ### Task 4: Batch Processing 352 | Create a system that: 353 | - Processes multiple documents 354 | - Generates embeddings in batches 355 | - Stores in ChromaDB 356 | - Shows progress 357 | 358 | ### Task 5: Similarity Search 359 | Implement a search function that: 360 | - Takes a query 361 | - Finds top 5 most similar documents 362 | - Returns results with similarity scores 363 | - Filters by metadata if needed 364 | 365 | ### Task 6: Embedding Visualization 366 | (Advanced) Use dimensionality reduction (PCA/t-SNE) to visualize embeddings in 2D and see how similar texts cluster together. 367 | 368 | --- 369 | 370 | ## 5. Summary / Key Takeaways 371 | 372 | - **Embeddings** convert text to numerical vectors that capture meaning 373 | - **Similar texts** have similar embeddings (high cosine similarity) 374 | - **Embedding models** like OpenAI's ada-002 generate 1536-dimensional vectors 375 | - **Vector databases** (like ChromaDB) store and search embeddings efficiently 376 | - **Cosine similarity** measures how similar two embeddings are (0-1 scale) 377 | - **ChromaDB** is easy to use for learning and prototyping 378 | - **The pipeline**: Text → Embedding → Store → Query → Retrieve 379 | - **Metadata** helps filter and organize stored documents 380 | - **Batch processing** is more efficient than one-by-one 381 | - **Similarity search** finds relevant chunks for RAG queries 382 | 383 | --- 384 | 385 | ## 6. Further Reading (Optional) 386 | 387 | - OpenAI Embeddings Guide 388 | - ChromaDB Documentation 389 | - Sentence Transformers library 390 | - Vector Database Comparison articles 391 | - "The Illustrated Word2vec" (understanding embeddings conceptually) 392 | 393 | --- 394 | 395 | **Next up:** Day 6 will combine everything into a complete RAG system! 396 | 397 | -------------------------------------------------------------------------------- /Day-07: Implement RAG From Scratch (Pure Python)/README.md: -------------------------------------------------------------------------------- 1 | # Day 7 — Implement RAG From Scratch (Pure Python) 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Today, you'll build a complete RAG system using only pure Python—no frameworks! This deep dive will help you understand every component and how they work together. By building from scratch, you'll gain a solid foundation before using frameworks like LangChain. 6 | 7 | **Why build from scratch?** 8 | - Understand every component deeply 9 | - No "magic" - you see how everything works 10 | - Customize any part you want 11 | - Better debugging skills 12 | - Foundation for using frameworks later 13 | 14 | **What you'll build:** 15 | A complete, production-ready RAG system with: 16 | - Document processing 17 | - Embedding generation 18 | - Vector storage and search 19 | - Prompt construction 20 | - LLM integration 21 | - Answer generation 22 | 23 | --- 24 | 25 | ## 2. Deep-Dive Explanation 26 | 27 | ### 2.1 System Architecture 28 | 29 | **Complete RAG System Components:** 30 | 31 | ``` 32 | ┌─────────────────┐ 33 | │ Document Loader │ 34 | └────────┬─────────┘ 35 | │ 36 | ▼ 37 | ┌─────────────────┐ 38 | │ Text Chunker │ 39 | └────────┬─────────┘ 40 | │ 41 | ▼ 42 | ┌─────────────────┐ 43 | │ Embedding Model │ 44 | └────────┬─────────┘ 45 | │ 46 | ▼ 47 | ┌─────────────────┐ 48 | │ Vector Store │ 49 | └────────┬─────────┘ 50 | │ 51 | ▼ 52 | ┌─────────────────┐ 53 | │ Query Handler │ 54 | └────────┬─────────┘ 55 | │ 56 | ▼ 57 | ┌─────────────────┐ 58 | │ RAG Pipeline │ 59 | └─────────────────┘ 60 | ``` 61 | 62 | ### 2.2 Component Design 63 | 64 | **1. Document Loader** 65 | - Read various file formats 66 | - Extract text 67 | - Handle errors 68 | 69 | **2. Text Chunker** 70 | - Split into manageable pieces 71 | - Preserve context 72 | - Add metadata 73 | 74 | **3. Embedding Generator** 75 | - Call OpenAI API 76 | - Handle batching 77 | - Cache embeddings 78 | 79 | **4. Vector Store** 80 | - Store embeddings 81 | - Implement similarity search 82 | - Manage metadata 83 | 84 | **5. Query Processor** 85 | - Convert query to embedding 86 | - Search vector store 87 | - Rank results 88 | 89 | **6. RAG Pipeline** 90 | - Orchestrate all components 91 | - Handle errors 92 | - Return formatted results 93 | 94 | ### 2.3 Implementation Strategy 95 | 96 | **Class Structure:** 97 | ```python 98 | class RAGSystem: 99 | - document_loader 100 | - chunker 101 | - embedding_generator 102 | - vector_store 103 | - llm_client 104 | 105 | Methods: 106 | - load_documents() 107 | - index_documents() 108 | - query() 109 | - get_stats() 110 | ``` 111 | 112 | **Error Handling:** 113 | - API failures 114 | - File errors 115 | - Empty results 116 | - Invalid inputs 117 | 118 | **Configuration:** 119 | - Chunk size 120 | - K value 121 | - Similarity threshold 122 | - LLM parameters 123 | 124 | --- 125 | 126 | ## 3. Instructor Examples 127 | 128 | ### Example 1: Complete RAG System Structure 129 | 130 | ```python 131 | import os 132 | import openai 133 | import json 134 | import numpy as np 135 | from typing import List, Dict, Optional 136 | 137 | class DocumentLoader: 138 | """Load documents from various sources""" 139 | 140 | def load_text_file(self, filepath: str) -> str: 141 | """Load text from .txt file""" 142 | with open(filepath, 'r', encoding='utf-8') as f: 143 | return f.read() 144 | 145 | def load_pdf(self, filepath: str) -> str: 146 | """Load text from PDF""" 147 | import pypdf 148 | text = "" 149 | with open(filepath, 'rb') as f: 150 | reader = pypdf.PdfReader(f) 151 | for page in reader.pages: 152 | text += page.extract_text() + "\n" 153 | return text 154 | 155 | class TextChunker: 156 | """Split text into chunks""" 157 | 158 | def __init__(self, chunk_size: int = 500, overlap: int = 50): 159 | self.chunk_size = chunk_size 160 | self.overlap = overlap 161 | 162 | def chunk_text(self, text: str, source: str = "unknown") -> List[Dict]: 163 | """Split text into chunks with metadata""" 164 | words = text.split() 165 | chunks = [] 166 | 167 | for i in range(0, len(words), self.chunk_size - self.overlap): 168 | chunk_words = words[i:i + self.chunk_size] 169 | chunk_text = " ".join(chunk_words) 170 | 171 | chunks.append({ 172 | "text": chunk_text, 173 | "source": source, 174 | "chunk_id": len(chunks) + 1, 175 | "word_count": len(chunk_words) 176 | }) 177 | 178 | return chunks 179 | 180 | class EmbeddingGenerator: 181 | """Generate embeddings using OpenAI""" 182 | 183 | def __init__(self, model: str = "text-embedding-ada-002"): 184 | self.model = model 185 | openai.api_key = os.getenv("OPENAI_API_KEY") 186 | 187 | def generate(self, text: str) -> List[float]: 188 | """Generate embedding for single text""" 189 | text = text.replace("\n", " ") 190 | response = openai.Embedding.create( 191 | model=self.model, 192 | input=text 193 | ) 194 | return response['data'][0]['embedding'] 195 | 196 | def generate_batch(self, texts: List[str]) -> List[List[float]]: 197 | """Generate embeddings for multiple texts""" 198 | texts = [t.replace("\n", " ") for t in texts] 199 | response = openai.Embedding.create( 200 | model=self.model, 201 | input=texts 202 | ) 203 | return [item['embedding'] for item in response['data']] 204 | 205 | class VectorStore: 206 | """Simple vector store using in-memory storage""" 207 | 208 | def __init__(self): 209 | self.embeddings = [] 210 | self.chunks = [] 211 | self.metadata = [] 212 | 213 | def add(self, embeddings: List[List[float]], chunks: List[Dict]): 214 | """Add embeddings and chunks to store""" 215 | self.embeddings.extend(embeddings) 216 | self.chunks.extend(chunks) 217 | self.metadata.extend([c.get("metadata", {}) for c in chunks]) 218 | 219 | def search(self, query_embedding: List[float], k: int = 3) -> List[Dict]: 220 | """Search for top K similar chunks""" 221 | if not self.embeddings: 222 | return [] 223 | 224 | # Calculate similarities 225 | similarities = [] 226 | query_vec = np.array(query_embedding) 227 | 228 | for emb in self.embeddings: 229 | emb_vec = np.array(emb) 230 | similarity = np.dot(query_vec, emb_vec) / ( 231 | np.linalg.norm(query_vec) * np.linalg.norm(emb_vec) 232 | ) 233 | similarities.append(similarity) 234 | 235 | # Get top K 236 | top_indices = np.argsort(similarities)[::-1][:k] 237 | 238 | results = [] 239 | for idx in top_indices: 240 | results.append({ 241 | "chunk": self.chunks[idx], 242 | "similarity": float(similarities[idx]), 243 | "metadata": self.metadata[idx] 244 | }) 245 | 246 | return results 247 | 248 | class RAGSystem: 249 | """Complete RAG system""" 250 | 251 | def __init__(self): 252 | self.loader = DocumentLoader() 253 | self.chunker = TextChunker() 254 | self.embedder = EmbeddingGenerator() 255 | self.vector_store = VectorStore() 256 | openai.api_key = os.getenv("OPENAI_API_KEY") 257 | 258 | def index_document(self, filepath: str): 259 | """Load, chunk, and index a document""" 260 | # Load 261 | if filepath.endswith('.txt'): 262 | text = self.loader.load_text_file(filepath) 263 | elif filepath.endswith('.pdf'): 264 | text = self.loader.load_pdf(filepath) 265 | else: 266 | raise ValueError(f"Unsupported file type: {filepath}") 267 | 268 | # Chunk 269 | chunks = self.chunker.chunk_text(text, source=filepath) 270 | 271 | # Generate embeddings 272 | chunk_texts = [c["text"] for c in chunks] 273 | embeddings = self.embedder.generate_batch(chunk_texts) 274 | 275 | # Store 276 | self.vector_store.add(embeddings, chunks) 277 | 278 | return len(chunks) 279 | 280 | def query(self, question: str, k: int = 3) -> Dict: 281 | """Complete RAG query""" 282 | # 1. Retrieve 283 | query_embedding = self.embedder.generate(question) 284 | results = self.vector_store.search(query_embedding, k) 285 | 286 | if not results: 287 | return {"answer": "No relevant documents found.", "sources": []} 288 | 289 | # 2. Augment 290 | context = "\n\n".join([ 291 | f"[Source: {r['chunk']['source']}]\n{r['chunk']['text']}" 292 | for r in results 293 | ]) 294 | 295 | prompt = f"""Answer the question using the following context. 296 | 297 | Context: 298 | {context} 299 | 300 | Question: {question} 301 | 302 | Answer based only on the provided context.""" 303 | 304 | # 3. Generate 305 | response = openai.ChatCompletion.create( 306 | model="gpt-3.5-turbo", 307 | messages=[{"role": "user", "content": prompt}], 308 | temperature=0.3, 309 | max_tokens=300 310 | ) 311 | 312 | answer = response.choices[0].message.content 313 | 314 | return { 315 | "answer": answer, 316 | "sources": [r['chunk'] for r in results], 317 | "similarities": [r['similarity'] for r in results] 318 | } 319 | 320 | # Usage 321 | rag = RAGSystem() 322 | rag.index_document("document.pdf") 323 | result = rag.query("What is the main topic?") 324 | print(result["answer"]) 325 | ``` 326 | 327 | ### Example 2: Enhanced RAG with Configuration 328 | 329 | ```python 330 | class ConfigurableRAG(RAGSystem): 331 | """RAG system with configuration options""" 332 | 333 | def __init__(self, config: Dict): 334 | super().__init__() 335 | self.config = config 336 | self.chunker = TextChunker( 337 | chunk_size=config.get("chunk_size", 500), 338 | overlap=config.get("overlap", 50) 339 | ) 340 | self.embedder = EmbeddingGenerator( 341 | model=config.get("embedding_model", "text-embedding-ada-002") 342 | ) 343 | 344 | def query(self, question: str, k: Optional[int] = None) -> Dict: 345 | """Query with configurable parameters""" 346 | k = k or self.config.get("k", 3) 347 | threshold = self.config.get("similarity_threshold", 0.0) 348 | 349 | # Retrieve 350 | query_embedding = self.embedder.generate(question) 351 | results = self.vector_store.search(query_embedding, k * 2) # Get more, filter 352 | 353 | # Filter by threshold 354 | filtered = [r for r in results if r['similarity'] >= threshold][:k] 355 | 356 | if not filtered: 357 | return {"answer": "No relevant documents found.", "sources": []} 358 | 359 | # Rest of the pipeline... 360 | # (similar to previous example) 361 | ``` 362 | 363 | --- 364 | 365 | ## 4. Student Practice Tasks 366 | 367 | ### Task 1: Core Components 368 | Implement each component separately: 369 | - DocumentLoader 370 | - TextChunker 371 | - EmbeddingGenerator 372 | - VectorStore 373 | 374 | Test each independently. 375 | 376 | ### Task 2: Integration 377 | Combine all components into a RAGSystem class. Test the complete pipeline. 378 | 379 | ### Task 3: Error Handling 380 | Add comprehensive error handling: 381 | - API failures 382 | - File errors 383 | - Empty results 384 | - Invalid inputs 385 | 386 | ### Task 4: Configuration System 387 | Create a configuration system that allows: 388 | - Adjusting chunk size 389 | - Changing K value 390 | - Setting similarity threshold 391 | - Configuring LLM parameters 392 | 393 | ### Task 5: Performance Optimization 394 | Optimize your system: 395 | - Batch embedding generation 396 | - Cache embeddings 397 | - Efficient similarity search 398 | - Progress indicators 399 | 400 | ### Task 6: Testing 401 | Create test cases: 402 | - Unit tests for each component 403 | - Integration tests 404 | - End-to-end tests 405 | 406 | --- 407 | 408 | ## 5. Summary / Key Takeaways 409 | 410 | - **Building from scratch** deepens understanding 411 | - **Modular design** makes components reusable 412 | - **Error handling** is crucial for production 413 | - **Configuration** allows flexibility 414 | - **Vector search** uses cosine similarity 415 | - **Batching** improves efficiency 416 | - **Metadata** enables filtering and citations 417 | - **Testing** ensures reliability 418 | - **Pure Python** = no framework dependencies 419 | - **Foundation** for understanding frameworks 420 | 421 | --- 422 | 423 | ## 6. Further Reading (Optional) 424 | 425 | - NumPy documentation (for vector operations) 426 | - OpenAI API best practices 427 | - Software design patterns 428 | - Unit testing in Python 429 | 430 | --- 431 | 432 | **Next up:** Day 8 will introduce you to LangChain and LlamaIndex frameworks! 433 | 434 | -------------------------------------------------------------------------------- /Day-06: RAG Fundamentals (Retrieval → Augmentation → Generation)/README.md: -------------------------------------------------------------------------------- 1 | # Day 6 — RAG Fundamentals (Retrieval → Augmentation → Generation) 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Today, you'll learn the complete RAG pipeline! **Retrieval-Augmented Generation** combines the best of both worlds: the knowledge retrieval of search engines and the language understanding of LLMs. 6 | 7 | **What is RAG?** 8 | RAG is a technique that: 9 | 1. **Retrieves** relevant information from your documents 10 | 2. **Augments** the LLM's prompt with this context 11 | 3. **Generates** accurate, sourced answers 12 | 13 | **Why RAG matters:** 14 | - Solves LLM limitations (hallucination, outdated info) 15 | - Provides accurate, verifiable answers 16 | - Uses your own documents as knowledge base 17 | - Enables domain-specific AI applications 18 | 19 | **Real-world context:** 20 | Instead of asking an LLM "What's in my company handbook?" (which it doesn't know), RAG: 21 | 1. Searches your handbook documents 22 | 2. Finds relevant sections 23 | 3. Gives those sections to the LLM 24 | 4. LLM answers based on YOUR documents 25 | 26 | --- 27 | 28 | ## 2. Deep-Dive Explanation 29 | 30 | ### 2.1 The RAG Pipeline 31 | 32 | **Complete Flow:** 33 | ``` 34 | User Question 35 | ↓ 36 | [1. RETRIEVAL] 37 | ↓ 38 | Query Embedding → Vector Search → Top K Chunks 39 | ↓ 40 | [2. AUGMENTATION] 41 | ↓ 42 | Context + Question → Formatted Prompt 43 | ↓ 44 | [3. GENERATION] 45 | ↓ 46 | LLM → Answer with Sources 47 | ``` 48 | 49 | ### 2.2 Step 1: Retrieval 50 | 51 | **What happens:** 52 | 1. Convert user question to embedding 53 | 2. Search vector database for similar chunks 54 | 3. Retrieve top K most relevant chunks 55 | 4. Return chunks with metadata 56 | 57 | **Key decisions:** 58 | - **K value**: How many chunks? (typically 3-5) 59 | - **Similarity threshold**: Minimum similarity score? 60 | - **Metadata filtering**: Filter by source, date, etc.? 61 | 62 | **Example:** 63 | ``` 64 | Question: "What is Python?" 65 | → Embedding: [0.1, -0.3, 0.8, ...] 66 | → Search vector DB 67 | → Retrieve: [Chunk about Python, Chunk about programming, ...] 68 | ``` 69 | 70 | ### 2.3 Step 2: Augmentation 71 | 72 | **What happens:** 73 | 1. Combine retrieved chunks into context 74 | 2. Format context with clear structure 75 | 3. Add question to prompt 76 | 4. Include instructions for the LLM 77 | 78 | **Prompt Structure:** 79 | ``` 80 | System: You are a helpful assistant... 81 | Context: 82 | [Chunk 1] 83 | [Chunk 2] 84 | [Chunk 3] 85 | 86 | Question: {user_question} 87 | 88 | Answer based on the context above. 89 | ``` 90 | 91 | **Best practices:** 92 | - Clearly separate chunks 93 | - Include source information 94 | - Limit context size (token budget) 95 | - Add instructions for citation 96 | 97 | ### 2.4 Step 3: Generation 98 | 99 | **What happens:** 100 | 1. Send augmented prompt to LLM 101 | 2. LLM generates answer using context 102 | 3. Extract answer from response 103 | 4. Optionally extract citations 104 | 105 | **LLM Configuration:** 106 | - **Temperature**: Lower (0.3-0.5) for factual answers 107 | - **Max tokens**: Based on expected answer length 108 | - **Model**: GPT-3.5-turbo or GPT-4 109 | 110 | ### 2.5 Complete RAG System Components 111 | 112 | **Required Components:** 113 | 1. **Document Store**: Where chunks are stored 114 | 2. **Embedding Model**: Converts text to vectors 115 | 3. **Vector Database**: Stores and searches embeddings 116 | 4. **LLM**: Generates final answers 117 | 5. **Prompt Template**: Formats context + question 118 | 119 | **Data Flow:** 120 | ``` 121 | Documents → Chunks → Embeddings → Vector DB 122 | ↓ 123 | User Question → Embedding → Search → Retrieved Chunks 124 | ↓ 125 | Retrieved Chunks + Question → Prompt → LLM → Answer 126 | ``` 127 | 128 | ### 2.6 RAG vs. Traditional Search 129 | 130 | **Traditional Search:** 131 | - Keyword matching 132 | - Exact text search 133 | - May miss semantic matches 134 | 135 | **RAG:** 136 | - Semantic understanding 137 | - Finds conceptually similar content 138 | - Understands context and meaning 139 | 140 | **Example:** 141 | - Question: "How do I train a model?" 142 | - Keyword search: Finds "train" and "model" separately 143 | - RAG: Finds content about "machine learning training", "model training", etc. 144 | 145 | --- 146 | 147 | ## 3. Instructor Examples 148 | 149 | ### Example 1: Simple RAG Pipeline 150 | 151 | ```python 152 | import openai 153 | import chromadb 154 | from chromadb.config import Settings 155 | 156 | class SimpleRAG: 157 | def __init__(self): 158 | self.client = chromadb.Client(Settings()) 159 | self.collection = None 160 | openai.api_key = os.getenv("OPENAI_API_KEY") 161 | 162 | def setup(self, collection_name="documents"): 163 | """Initialize collection""" 164 | self.collection = self.client.create_collection(name=collection_name) 165 | 166 | def add_documents(self, texts, ids=None, metadatas=None): 167 | """Add documents to the collection""" 168 | if ids is None: 169 | ids = [f"doc_{i}" for i in range(len(texts))] 170 | 171 | self.collection.add( 172 | documents=texts, 173 | ids=ids, 174 | metadatas=metadatas 175 | ) 176 | 177 | def retrieve(self, query, k=3): 178 | """Retrieve top K relevant chunks""" 179 | results = self.collection.query( 180 | query_texts=[query], 181 | n_results=k 182 | ) 183 | return results['documents'][0] 184 | 185 | def augment(self, context_chunks, question): 186 | """Create augmented prompt""" 187 | context = "\n\n".join([ 188 | f"[Document {i+1}]\n{chunk}" 189 | for i, chunk in enumerate(context_chunks) 190 | ]) 191 | 192 | prompt = f"""Use the following documents to answer the question. 193 | 194 | Documents: 195 | {context} 196 | 197 | Question: {question} 198 | 199 | Answer based only on the provided documents. If the documents don't contain enough information, say so.""" 200 | 201 | return prompt 202 | 203 | def generate(self, prompt): 204 | """Generate answer using LLM""" 205 | response = openai.ChatCompletion.create( 206 | model="gpt-3.5-turbo", 207 | messages=[{"role": "user", "content": prompt}], 208 | temperature=0.3, 209 | max_tokens=300 210 | ) 211 | return response.choices[0].message.content 212 | 213 | def query(self, question, k=3): 214 | """Complete RAG pipeline""" 215 | # 1. Retrieve 216 | chunks = self.retrieve(question, k) 217 | 218 | # 2. Augment 219 | prompt = self.augment(chunks, question) 220 | 221 | # 3. Generate 222 | answer = self.generate(prompt) 223 | 224 | return { 225 | "answer": answer, 226 | "sources": chunks 227 | } 228 | 229 | # Usage 230 | rag = SimpleRAG() 231 | rag.setup() 232 | 233 | # Add documents 234 | rag.add_documents([ 235 | "Python is a programming language created in 1991.", 236 | "RAG combines retrieval and generation.", 237 | "Machine learning uses algorithms to learn from data." 238 | ]) 239 | 240 | # Query 241 | result = rag.query("What is Python?") 242 | print(result["answer"]) 243 | print("\nSources:", result["sources"]) 244 | ``` 245 | 246 | ### Example 2: RAG with Metadata 247 | 248 | ```python 249 | class RAGWithMetadata(SimpleRAG): 250 | def retrieve_with_metadata(self, query, k=3): 251 | """Retrieve chunks with metadata""" 252 | results = self.collection.query( 253 | query_texts=[query], 254 | n_results=k 255 | ) 256 | return { 257 | "documents": results['documents'][0], 258 | "metadatas": results['metadatas'][0], 259 | "distances": results['distances'][0] 260 | } 261 | 262 | def augment_with_sources(self, retrieved_data, question): 263 | """Augment with source citations""" 264 | context_parts = [] 265 | for i, (doc, metadata) in enumerate(zip( 266 | retrieved_data['documents'], 267 | retrieved_data['metadatas'] 268 | )): 269 | source = metadata.get('source', f'Document {i+1}') 270 | context_parts.append(f"[Source: {source}]\n{doc}") 271 | 272 | context = "\n\n".join(context_parts) 273 | 274 | prompt = f"""Answer the question using the following sources. 275 | 276 | Sources: 277 | {context} 278 | 279 | Question: {question} 280 | 281 | Provide an answer and cite which source(s) you used.""" 282 | 283 | return prompt 284 | 285 | def query(self, question, k=3): 286 | """RAG with source citations""" 287 | # Retrieve with metadata 288 | retrieved = self.retrieve_with_metadata(question, k) 289 | 290 | # Augment 291 | prompt = self.augment_with_sources(retrieved, question) 292 | 293 | # Generate 294 | answer = self.generate(prompt) 295 | 296 | return { 297 | "answer": answer, 298 | "sources": [ 299 | {"text": doc, "metadata": meta} 300 | for doc, meta in zip( 301 | retrieved['documents'], 302 | retrieved['metadatas'] 303 | ) 304 | ] 305 | } 306 | ``` 307 | 308 | ### Example 3: RAG with Similarity Filtering 309 | 310 | ```python 311 | class FilteredRAG(SimpleRAG): 312 | def retrieve_with_threshold(self, query, k=5, threshold=0.7): 313 | """Retrieve only chunks above similarity threshold""" 314 | results = self.collection.query( 315 | query_texts=[query], 316 | n_results=k 317 | ) 318 | 319 | # Filter by similarity (distance is inverse of similarity) 320 | # Lower distance = higher similarity 321 | filtered_docs = [] 322 | filtered_metas = [] 323 | 324 | for doc, meta, distance in zip( 325 | results['documents'][0], 326 | results['metadatas'][0], 327 | results['distances'][0] 328 | ): 329 | similarity = 1 - distance # Convert distance to similarity 330 | if similarity >= threshold: 331 | filtered_docs.append(doc) 332 | filtered_metas.append(meta) 333 | 334 | return filtered_docs, filtered_metas 335 | 336 | def query(self, question, k=5, threshold=0.7): 337 | """RAG with similarity filtering""" 338 | chunks, metadata = self.retrieve_with_threshold(question, k, threshold) 339 | 340 | if not chunks: 341 | return { 342 | "answer": "I couldn't find relevant information in the documents.", 343 | "sources": [] 344 | } 345 | 346 | prompt = self.augment(chunks, question) 347 | answer = self.generate(prompt) 348 | 349 | return { 350 | "answer": answer, 351 | "sources": chunks, 352 | "num_sources": len(chunks) 353 | } 354 | ``` 355 | 356 | --- 357 | 358 | ## 4. Student Practice Tasks 359 | 360 | ### Task 1: Basic RAG Implementation 361 | Build a simple RAG system that: 362 | - Stores documents in ChromaDB 363 | - Retrieves top 3 chunks for a query 364 | - Augments prompt with context 365 | - Generates answer using GPT-3.5 366 | 367 | ### Task 2: RAG with Citations 368 | Enhance your RAG to: 369 | - Include source information in answers 370 | - Format citations properly 371 | - Show which document each part came from 372 | 373 | ### Task 3: Similarity Threshold 374 | Add filtering to only use chunks above a similarity threshold. Test with different thresholds and observe how it affects answers. 375 | 376 | ### Task 4: Multi-Query RAG 377 | Implement query expansion: 378 | - Generate multiple query variations 379 | - Search with each variation 380 | - Combine results 381 | - Remove duplicates 382 | 383 | ### Task 5: RAG Evaluation 384 | Create a simple evaluation: 385 | - Test with known questions 386 | - Compare answers to expected answers 387 | - Calculate accuracy metrics 388 | 389 | ### Task 6: RAG with Metadata Filtering 390 | Add ability to filter by metadata (e.g., only search in specific documents, date ranges, etc.) 391 | 392 | --- 393 | 394 | ## 5. Summary / Key Takeaways 395 | 396 | - **RAG = Retrieval + Augmentation + Generation** 397 | - **Retrieval**: Find relevant chunks using semantic search 398 | - **Augmentation**: Combine chunks with question in a prompt 399 | - **Generation**: LLM creates answer from augmented context 400 | - **K value**: Number of chunks to retrieve (typically 3-5) 401 | - **Similarity threshold**: Filter low-quality matches 402 | - **Metadata**: Track sources for citations 403 | - **Prompt engineering**: Critical for good RAG results 404 | - **RAG solves**: Hallucination, outdated info, domain knowledge 405 | - **Complete pipeline**: Documents → Embeddings → Vector DB → Query → Retrieve → Augment → Generate 406 | 407 | --- 408 | 409 | ## 6. Further Reading (Optional) 410 | 411 | - "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (original RAG paper) 412 | - LangChain RAG documentation 413 | - LlamaIndex RAG guides 414 | - RAG evaluation metrics 415 | 416 | --- 417 | 418 | **Next up:** Day 7 will have you build a complete RAG system from scratch! 419 | 420 | -------------------------------------------------------------------------------- /Day-04: Chunking & Data Extraction (PDF-Web-Docs)/README.md: -------------------------------------------------------------------------------- 1 | # Day 4 — Chunking & Data Extraction (PDF/Web/Docs) 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Before you can build a RAG system, you need to extract and prepare your data. Today, you'll learn how to extract text from various sources (PDFs, websites, documents) and split it into manageable chunks—a crucial step in the RAG pipeline. 6 | 7 | **Why this matters for RAG:** 8 | - RAG systems need documents in text format 9 | - Documents must be split into chunks that fit LLM context windows 10 | - Different sources require different extraction methods 11 | - Proper chunking improves retrieval quality 12 | 13 | **Real-world context:** 14 | Imagine you have a 100-page PDF and want to answer questions about it. You can't send all 100 pages to an LLM at once (token limits!). Instead, you: 15 | 1. Extract text from the PDF 16 | 2. Split it into smaller chunks (e.g., 500 words each) 17 | 3. Store these chunks for retrieval 18 | 4. When a question comes, find relevant chunks and send only those to the LLM 19 | 20 | --- 21 | 22 | ## 2. Deep-Dive Explanation 23 | 24 | ### 2.1 Data Extraction Overview 25 | 26 | **The Pipeline:** 27 | ``` 28 | Source File → Extract Text → Clean Text → Chunk Text → Store Chunks 29 | ``` 30 | 31 | **Common Sources:** 32 | - PDF files 33 | - Web pages (HTML) 34 | - Text files (.txt, .md) 35 | - Word documents (.docx) 36 | - CSV files 37 | - JSON files 38 | 39 | ### 2.2 PDF Extraction 40 | 41 | **Libraries:** 42 | - `PyPDF2`: Basic PDF reading 43 | - `pypdf`: Modern alternative 44 | - `pdfplumber`: Better text extraction 45 | - `PyMuPDF` (fitz): Fast and accurate 46 | 47 | **Challenges:** 48 | - Scanned PDFs (need OCR) 49 | - Complex layouts 50 | - Tables and images 51 | - Multi-column text 52 | 53 | **Basic Extraction:** 54 | ```python 55 | import pypdf 56 | 57 | def extract_pdf_text(filepath): 58 | text = "" 59 | with open(filepath, "rb") as file: 60 | reader = pypdf.PdfReader(file) 61 | for page in reader.pages: 62 | text += page.extract_text() + "\n" 63 | return text 64 | ``` 65 | 66 | ### 2.3 Web Scraping 67 | 68 | **Libraries:** 69 | - `requests`: HTTP requests 70 | - `BeautifulSoup4`: HTML parsing 71 | - `selenium`: For JavaScript-heavy sites 72 | 73 | **Basic Web Scraping:** 74 | ```python 75 | import requests 76 | from bs4 import BeautifulSoup 77 | 78 | def extract_web_text(url): 79 | response = requests.get(url) 80 | soup = BeautifulSoup(response.content, "html.parser") 81 | 82 | # Remove script and style elements 83 | for script in soup(["script", "style"]): 84 | script.decompose() 85 | 86 | # Get text 87 | text = soup.get_text() 88 | 89 | # Clean up whitespace 90 | lines = (line.strip() for line in text.splitlines()) 91 | chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 92 | text = " ".join(chunk for chunk in chunks if chunk) 93 | 94 | return text 95 | ``` 96 | 97 | ### 2.4 Text Chunking Strategies 98 | 99 | #### 2.4.1 Fixed-Size Chunking 100 | 101 | Split text into chunks of fixed character/word count: 102 | ``` 103 | Text: [1000 chars] → Chunk1[500] + Chunk2[500] 104 | ``` 105 | 106 | **Pros:** Simple, predictable 107 | **Cons:** May split sentences/paragraphs 108 | 109 | #### 2.4.2 Sentence-Aware Chunking 110 | 111 | Split at sentence boundaries: 112 | ``` 113 | Text → Sentences → Group into chunks (respecting max size) 114 | ``` 115 | 116 | **Pros:** Preserves sentence integrity 117 | **Cons:** More complex 118 | 119 | #### 2.4.3 Paragraph-Aware Chunking 120 | 121 | Split at paragraph boundaries: 122 | ``` 123 | Text → Paragraphs → Group paragraphs into chunks 124 | ``` 125 | 126 | **Pros:** Preserves context 127 | **Cons:** Chunks may vary significantly in size 128 | 129 | #### 2.4.4 Overlapping Chunks 130 | 131 | Add overlap between chunks for context: 132 | ``` 133 | Chunk1: [0-500] → Chunk2: [450-950] → Chunk3: [900-1400] 134 | ``` 135 | 136 | **Pros:** Maintains context across boundaries 137 | **Cons:** More storage, potential redundancy 138 | 139 | ### 2.5 Chunking Best Practices 140 | 141 | **Considerations:** 142 | - **Chunk size**: 200-1000 tokens (depends on model) 143 | - **Overlap**: 10-20% of chunk size 144 | - **Boundaries**: Respect sentence/paragraph boundaries 145 | - **Metadata**: Store source, position, timestamp 146 | 147 | **Metadata to Store:** 148 | ```python 149 | chunk_metadata = { 150 | "chunk_id": 1, 151 | "source": "document.pdf", 152 | "page": 3, 153 | "start_char": 0, 154 | "end_char": 500, 155 | "word_count": 75 156 | } 157 | ``` 158 | 159 | ### 2.6 Text Cleaning 160 | 161 | **Common Cleaning Steps:** 162 | 1. Remove extra whitespace 163 | 2. Remove special characters (if needed) 164 | 3. Normalize encoding 165 | 4. Remove headers/footers 166 | 5. Handle line breaks 167 | 168 | --- 169 | 170 | ## 3. Instructor Examples 171 | 172 | ### Example 1: PDF Text Extraction 173 | 174 | ```python 175 | import pypdf 176 | 177 | def extract_pdf_text(filepath): 178 | """Extract text from a PDF file""" 179 | text = "" 180 | try: 181 | with open(filepath, "rb") as file: 182 | pdf_reader = pypdf.PdfReader(file) 183 | num_pages = len(pdf_reader.pages) 184 | 185 | for page_num in range(num_pages): 186 | page = pdf_reader.pages[page_num] 187 | page_text = page.extract_text() 188 | text += f"\n--- Page {page_num + 1} ---\n" 189 | text += page_text 190 | 191 | return text 192 | except Exception as e: 193 | print(f"Error extracting PDF: {e}") 194 | return None 195 | 196 | # Usage 197 | text = extract_pdf_text("document.pdf") 198 | print(f"Extracted {len(text)} characters") 199 | ``` 200 | 201 | ### Example 2: Web Scraping 202 | 203 | ```python 204 | import requests 205 | from bs4 import BeautifulSoup 206 | 207 | def extract_web_content(url): 208 | """Extract main content from a webpage""" 209 | try: 210 | headers = { 211 | "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)" 212 | } 213 | response = requests.get(url, headers=headers, timeout=10) 214 | response.raise_for_status() 215 | 216 | soup = BeautifulSoup(response.content, "html.parser") 217 | 218 | # Remove unwanted elements 219 | for element in soup(["script", "style", "nav", "footer", "header"]): 220 | element.decompose() 221 | 222 | # Try to find main content 223 | main_content = soup.find("main") or soup.find("article") or soup.find("body") 224 | 225 | if main_content: 226 | text = main_content.get_text(separator=" ", strip=True) 227 | # Clean up multiple spaces 228 | text = " ".join(text.split()) 229 | return text 230 | else: 231 | return soup.get_text(separator=" ", strip=True) 232 | 233 | except Exception as e: 234 | print(f"Error scraping {url}: {e}") 235 | return None 236 | 237 | # Usage 238 | content = extract_web_content("https://example.com/article") 239 | ``` 240 | 241 | ### Example 3: Sentence-Aware Chunking 242 | 243 | ```python 244 | import re 245 | 246 | def chunk_text_sentences(text, chunk_size=500, overlap=50): 247 | """Chunk text respecting sentence boundaries""" 248 | # Split into sentences (simple approach) 249 | sentences = re.split(r'(?<=[.!?])\s+', text) 250 | 251 | chunks = [] 252 | current_chunk = [] 253 | current_size = 0 254 | 255 | for sentence in sentences: 256 | sentence_size = len(sentence) 257 | 258 | # If adding this sentence exceeds chunk size 259 | if current_size + sentence_size > chunk_size and current_chunk: 260 | # Save current chunk 261 | chunk_text = " ".join(current_chunk) 262 | chunks.append(chunk_text) 263 | 264 | # Start new chunk with overlap 265 | overlap_text = " ".join(current_chunk[-2:]) if len(current_chunk) >= 2 else "" 266 | current_chunk = [overlap_text, sentence] if overlap_text else [sentence] 267 | current_size = len(" ".join(current_chunk)) 268 | else: 269 | current_chunk.append(sentence) 270 | current_size += sentence_size + 1 # +1 for space 271 | 272 | # Add final chunk 273 | if current_chunk: 274 | chunks.append(" ".join(current_chunk)) 275 | 276 | return chunks 277 | 278 | # Usage 279 | long_text = "Sentence one. Sentence two. Sentence three..." * 50 280 | chunks = chunk_text_sentences(long_text, chunk_size=200) 281 | print(f"Created {len(chunks)} chunks") 282 | ``` 283 | 284 | ### Example 4: Complete Document Processor 285 | 286 | ```python 287 | class DocumentProcessor: 288 | def __init__(self, chunk_size=500, overlap=50): 289 | self.chunk_size = chunk_size 290 | self.overlap = overlap 291 | self.chunks = [] 292 | 293 | def extract_from_pdf(self, filepath): 294 | """Extract text from PDF""" 295 | import pypdf 296 | text = "" 297 | with open(filepath, "rb") as file: 298 | reader = pypdf.PdfReader(file) 299 | for page in reader.pages: 300 | text += page.extract_text() + "\n" 301 | return text 302 | 303 | def extract_from_web(self, url): 304 | """Extract text from webpage""" 305 | import requests 306 | from bs4 import BeautifulSoup 307 | 308 | response = requests.get(url) 309 | soup = BeautifulSoup(response.content, "html.parser") 310 | for script in soup(["script", "style"]): 311 | script.decompose() 312 | return soup.get_text() 313 | 314 | def clean_text(self, text): 315 | """Clean extracted text""" 316 | # Remove extra whitespace 317 | text = " ".join(text.split()) 318 | # Remove special characters (optional) 319 | # text = re.sub(r'[^\w\s]', '', text) 320 | return text 321 | 322 | def chunk_text(self, text, source="unknown"): 323 | """Chunk text and store with metadata""" 324 | words = text.split() 325 | chunks = [] 326 | 327 | for i in range(0, len(words), self.chunk_size - self.overlap): 328 | chunk_words = words[i:i + self.chunk_size] 329 | chunk_text = " ".join(chunk_words) 330 | 331 | chunk_data = { 332 | "chunk_id": len(chunks) + 1, 333 | "text": chunk_text, 334 | "source": source, 335 | "word_count": len(chunk_words), 336 | "start_word": i, 337 | "end_word": min(i + self.chunk_size, len(words)) 338 | } 339 | chunks.append(chunk_data) 340 | 341 | self.chunks.extend(chunks) 342 | return chunks 343 | 344 | def process_pdf(self, filepath): 345 | """Complete PDF processing pipeline""" 346 | text = self.extract_from_pdf(filepath) 347 | text = self.clean_text(text) 348 | chunks = self.chunk_text(text, source=filepath) 349 | return chunks 350 | 351 | # Usage 352 | processor = DocumentProcessor(chunk_size=200, overlap=20) 353 | chunks = processor.process_pdf("document.pdf") 354 | print(f"Processed {len(chunks)} chunks") 355 | ``` 356 | 357 | --- 358 | 359 | ## 4. Student Practice Tasks 360 | 361 | ### Task 1: PDF Extractor 362 | Write a function that extracts text from a PDF and returns: 363 | - Full text 364 | - Number of pages 365 | - Text per page (as a list) 366 | 367 | ### Task 2: Web Scraper 368 | Create a web scraper that: 369 | - Takes a URL 370 | - Extracts main content (removes nav, ads, etc.) 371 | - Returns clean text 372 | - Handles errors gracefully 373 | 374 | ### Task 3: Chunking Functions 375 | Implement three chunking strategies: 376 | - Fixed-size chunking 377 | - Sentence-aware chunking 378 | - Paragraph-aware chunking 379 | 380 | Compare results on the same text. 381 | 382 | ### Task 4: Text Cleaner 383 | Write a comprehensive text cleaning function that: 384 | - Removes extra whitespace 385 | - Handles encoding issues 386 | - Removes headers/footers (if patterns detected) 387 | - Normalizes line breaks 388 | 389 | ### Task 5: Chunk Metadata 390 | Enhance chunking to include rich metadata: 391 | - Source file 392 | - Page number (for PDFs) 393 | - Character positions 394 | - Word count 395 | - Timestamp 396 | 397 | ### Task 6: Multi-Format Processor 398 | Create a processor that handles: 399 | - PDF files 400 | - Text files 401 | - Web URLs 402 | - Returns standardized chunk format 403 | 404 | --- 405 | 406 | ## 5. Summary / Key Takeaways 407 | 408 | - **Data extraction** is the first step in RAG pipelines 409 | - **PDF extraction** requires libraries like `pypdf` or `pdfplumber` 410 | - **Web scraping** uses `requests` and `BeautifulSoup` 411 | - **Chunking strategies** vary: fixed-size, sentence-aware, paragraph-aware 412 | - **Overlapping chunks** preserve context across boundaries 413 | - **Metadata** is crucial for tracking chunk sources 414 | - **Text cleaning** improves chunk quality 415 | - **Chunk size** should match your LLM's context window 416 | - **Different sources** require different extraction methods 417 | 418 | --- 419 | 420 | ## 6. Further Reading (Optional) 421 | 422 | - PyPDF2/PyPDF documentation 423 | - BeautifulSoup documentation 424 | - LangChain document loaders 425 | - LlamaIndex data connectors 426 | - Text chunking best practices 427 | 428 | --- 429 | 430 | **Next up:** Day 5 will teach you about embeddings and vector databases! 431 | 432 | -------------------------------------------------------------------------------- /Day-09: Advanced RAG (Reranking, Query Rewriting, Fusion)/README.md: -------------------------------------------------------------------------------- 1 | # Day 9 — Advanced RAG (Reranking, Query Rewriting, Fusion) 2 | 3 | ## 1. Beginner-Friendly Introduction 4 | 5 | Today, you'll learn advanced techniques that make RAG systems production-ready! These techniques improve answer quality, handle complex queries, and make your system more robust. 6 | 7 | **What are Advanced RAG Techniques?** 8 | - **Reranking**: Improve retrieval by reordering results 9 | - **Query Rewriting**: Transform queries for better retrieval 10 | - **Fusion**: Combine multiple retrieval strategies 11 | - **Hybrid Search**: Mix semantic and keyword search 12 | 13 | **Why these matter:** 14 | - Basic RAG works, but advanced techniques make it better 15 | - Production systems need these optimizations 16 | - Handle edge cases and complex queries 17 | - Improve answer accuracy and relevance 18 | 19 | **Real-world context:** 20 | Think of basic RAG as a simple search engine. Advanced RAG is like Google—it uses multiple strategies, reranks results, understands query intent, and combines different signals to give you the best answer. 21 | 22 | --- 23 | 24 | ## 2. Deep-Dive Explanation 25 | 26 | ### 2.1 Reranking 27 | 28 | **What is Reranking?** 29 | After retrieving initial results, rerank them using a more sophisticated model to improve order. 30 | 31 | **Why Rerank?** 32 | - Initial retrieval may miss subtle relevance 33 | - Reranking models are more accurate 34 | - Better final results 35 | 36 | **How it works:** 37 | ``` 38 | Initial Retrieval (Top 10) → Reranking Model → Reordered Top 5 39 | ``` 40 | 41 | **Reranking Models:** 42 | - Cross-encoders (BERT-based) 43 | - Specialized reranking models 44 | - Custom scoring functions 45 | 46 | **Benefits:** 47 | - Better top results 48 | - Improved answer quality 49 | - More relevant context 50 | 51 | ### 2.2 Query Rewriting 52 | 53 | **What is Query Rewriting?** 54 | Transform user queries to improve retrieval: 55 | - Expand queries with synonyms 56 | - Generate multiple query variations 57 | - Reformulate for better matching 58 | - Extract key terms 59 | 60 | **Techniques:** 61 | 1. **Query Expansion**: Add related terms 62 | 2. **Query Decomposition**: Break into sub-queries 63 | 3. **Query Reformulation**: Rephrase for clarity 64 | 4. **Hybrid Queries**: Combine semantic + keyword 65 | 66 | **Example:** 67 | ``` 68 | Original: "How to train model?" 69 | Rewritten: ["How to train machine learning model?", "model training process", "train ML model"] 70 | ``` 71 | 72 | ### 2.3 Fusion Techniques 73 | 74 | **What is Fusion?** 75 | Combine results from multiple retrieval strategies: 76 | - Different embedding models 77 | - Keyword + semantic search 78 | - Multiple query variations 79 | - Different chunk sizes 80 | 81 | **Fusion Methods:** 82 | 1. **Reciprocal Rank Fusion (RRF)**: Combine rankings 83 | 2. **Weighted Fusion**: Weight different sources 84 | 3. **Deduplication**: Remove duplicates 85 | 4. **Score Normalization**: Normalize before combining 86 | 87 | **Benefits:** 88 | - More comprehensive retrieval 89 | - Better coverage 90 | - Reduces misses 91 | 92 | ### 2.4 Hybrid Search 93 | 94 | **What is Hybrid Search?** 95 | Combine semantic search (embeddings) with keyword search (BM25/TF-IDF): 96 | - Semantic: Understands meaning 97 | - Keyword: Exact matches 98 | - Best of both worlds 99 | 100 | **Implementation:** 101 | ``` 102 | Query → [Semantic Search] → Results 1 103 | → [Keyword Search] → Results 2 104 | → [Fusion] → Final Results 105 | ``` 106 | 107 | ### 2.5 Advanced RAG Pipeline 108 | 109 | **Enhanced Pipeline:** 110 | ``` 111 | User Query 112 | ↓ 113 | Query Rewriting (multiple variations) 114 | ↓ 115 | Multiple Retrieval Strategies 116 | ↓ 117 | Fusion (combine results) 118 | ↓ 119 | Reranking (improve order) 120 | ↓ 121 | Top K Chunks 122 | ↓ 123 | Augmentation + Generation 124 | ↓ 125 | Answer 126 | ``` 127 | 128 | --- 129 | 130 | ## 3. Instructor Examples 131 | 132 | ### Example 1: Query Rewriting 133 | 134 | ```python 135 | import openai 136 | 137 | def rewrite_query(original_query): 138 | """Generate query variations""" 139 | prompt = f"""Generate 3 different ways to ask this question that might retrieve different relevant documents. 140 | 141 | Original question: {original_query} 142 | 143 | Generate 3 variations, each on a new line:""" 144 | 145 | response = openai.ChatCompletion.create( 146 | model="gpt-3.5-turbo", 147 | messages=[{"role": "user", "content": prompt}], 148 | temperature=0.7, 149 | max_tokens=150 150 | ) 151 | 152 | variations = response.choices[0].message.content.strip().split("\n") 153 | variations = [v.strip("- ").strip() for v in variations if v.strip()] 154 | 155 | return [original_query] + variations 156 | 157 | # Usage 158 | query = "How does machine learning work?" 159 | variations = rewrite_query(query) 160 | print(variations) 161 | # Output: [ 162 | # "How does machine learning work?", 163 | # "What is the process of machine learning?", 164 | # "How do ML algorithms learn from data?", 165 | # "Explain the mechanism of machine learning" 166 | # ] 167 | ``` 168 | 169 | ### Example 2: Reciprocal Rank Fusion 170 | 171 | ```python 172 | def reciprocal_rank_fusion(results_list, k=60): 173 | """ 174 | Combine multiple ranked result lists using RRF 175 | 176 | Args: 177 | results_list: List of result lists, each with (doc_id, score) 178 | k: RRF constant (typically 60) 179 | """ 180 | doc_scores = {} 181 | 182 | for results in results_list: 183 | for rank, (doc_id, score) in enumerate(results, 1): 184 | if doc_id not in doc_scores: 185 | doc_scores[doc_id] = 0 186 | doc_scores[doc_id] += 1 / (k + rank) 187 | 188 | # Sort by score 189 | fused_results = sorted( 190 | doc_scores.items(), 191 | key=lambda x: x[1], 192 | reverse=True 193 | ) 194 | 195 | return fused_results 196 | 197 | # Usage 198 | results1 = [("doc1", 0.9), ("doc2", 0.8), ("doc3", 0.7)] 199 | results2 = [("doc2", 0.85), ("doc1", 0.82), ("doc4", 0.75)] 200 | results3 = [("doc3", 0.88), ("doc1", 0.80), ("doc5", 0.72)] 201 | 202 | fused = reciprocal_rank_fusion([results1, results2, results3]) 203 | print(fused) 204 | # Output: [("doc1", 0.083), ("doc2", 0.075), ("doc3", 0.068), ...] 205 | ``` 206 | 207 | ### Example 3: Reranking with Cross-Encoder 208 | 209 | ```python 210 | from sentence_transformers import CrossEncoder 211 | 212 | class Reranker: 213 | def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"): 214 | self.model = CrossEncoder(model_name) 215 | 216 | def rerank(self, query, documents, top_k=5): 217 | """Rerank documents for a query""" 218 | # Create query-document pairs 219 | pairs = [[query, doc] for doc in documents] 220 | 221 | # Get scores 222 | scores = self.model.predict(pairs) 223 | 224 | # Sort by score 225 | ranked_indices = sorted( 226 | range(len(scores)), 227 | key=lambda i: scores[i], 228 | reverse=True 229 | ) 230 | 231 | # Return top K 232 | reranked = [ 233 | { 234 | "document": documents[i], 235 | "score": float(scores[i]), 236 | "rank": rank + 1 237 | } 238 | for rank, i in enumerate(ranked_indices[:top_k]) 239 | ] 240 | 241 | return reranked 242 | 243 | # Usage 244 | reranker = Reranker() 245 | documents = ["Doc 1 text...", "Doc 2 text...", "Doc 3 text..."] 246 | reranked = reranker.rerank("What is Python?", documents, top_k=3) 247 | for item in reranked: 248 | print(f"Rank {item['rank']}: Score {item['score']:.3f}") 249 | ``` 250 | 251 | ### Example 4: Hybrid Search 252 | 253 | ```python 254 | from rank_bm25 import BM25Okapi 255 | import numpy as np 256 | 257 | class HybridSearch: 258 | def __init__(self, documents, embeddings): 259 | self.documents = documents 260 | self.embeddings = embeddings 261 | 262 | # Setup BM25 (keyword search) 263 | tokenized_docs = [doc.split() for doc in documents] 264 | self.bm25 = BM25Okapi(tokenized_docs) 265 | 266 | def semantic_search(self, query_embedding, k=10): 267 | """Semantic search using embeddings""" 268 | similarities = [] 269 | for emb in self.embeddings: 270 | similarity = np.dot(query_embedding, emb) / ( 271 | np.linalg.norm(query_embedding) * np.linalg.norm(emb) 272 | ) 273 | similarities.append(similarity) 274 | 275 | top_indices = np.argsort(similarities)[::-1][:k] 276 | return [(idx, similarities[idx]) for idx in top_indices] 277 | 278 | def keyword_search(self, query, k=10): 279 | """Keyword search using BM25""" 280 | tokenized_query = query.split() 281 | scores = self.bm25.get_scores(tokenized_query) 282 | top_indices = np.argsort(scores)[::-1][:k] 283 | return [(idx, scores[idx]) for idx in top_indices] 284 | 285 | def hybrid_search(self, query, query_embedding, k=5, alpha=0.5): 286 | """Combine semantic and keyword search""" 287 | # Get results from both 288 | semantic_results = self.semantic_search(query_embedding, k*2) 289 | keyword_results = self.keyword_search(query, k*2) 290 | 291 | # Normalize scores 292 | semantic_scores = {idx: score for idx, score in semantic_results} 293 | keyword_scores = {idx: score for idx, score in keyword_results} 294 | 295 | # Normalize to 0-1 range 296 | max_sem = max(semantic_scores.values()) if semantic_scores else 1 297 | max_key = max(keyword_scores.values()) if keyword_scores else 1 298 | 299 | # Combine scores 300 | combined_scores = {} 301 | all_indices = set(semantic_scores.keys()) | set(keyword_scores.keys()) 302 | 303 | for idx in all_indices: 304 | sem_score = semantic_scores.get(idx, 0) / max_sem if max_sem > 0 else 0 305 | key_score = keyword_scores.get(idx, 0) / max_key if max_key > 0 else 0 306 | combined_scores[idx] = alpha * sem_score + (1 - alpha) * key_score 307 | 308 | # Return top K 309 | top_indices = sorted( 310 | combined_scores.items(), 311 | key=lambda x: x[1], 312 | reverse=True 313 | )[:k] 314 | 315 | return top_indices 316 | 317 | # Usage 318 | hybrid = HybridSearch(documents, embeddings) 319 | results = hybrid.hybrid_search("Python programming", query_embedding, k=5) 320 | ``` 321 | 322 | ### Example 5: Complete Advanced RAG Pipeline 323 | 324 | ```python 325 | class AdvancedRAG: 326 | def __init__(self): 327 | self.vector_store = None 328 | self.reranker = Reranker() 329 | # ... other components 330 | 331 | def query(self, question, k=5): 332 | """Advanced RAG with all techniques""" 333 | # 1. Query Rewriting 334 | query_variations = rewrite_query(question) 335 | 336 | # 2. Multiple Retrievals 337 | all_results = [] 338 | for query_var in query_variations: 339 | results = self.vector_store.search(query_var, k=k*2) 340 | all_results.append(results) 341 | 342 | # 3. Fusion 343 | fused_results = reciprocal_rank_fusion(all_results) 344 | 345 | # 4. Reranking 346 | top_docs = [self.vector_store.get_doc(doc_id) for doc_id, _ in fused_results[:k*2]] 347 | reranked = self.reranker.rerank(question, top_docs, top_k=k) 348 | 349 | # 5. Augment and Generate 350 | context = "\n\n".join([item["document"] for item in reranked]) 351 | answer = self.generate_answer(question, context) 352 | 353 | return { 354 | "answer": answer, 355 | "sources": reranked 356 | } 357 | ``` 358 | 359 | --- 360 | 361 | ## 4. Student Practice Tasks 362 | 363 | ### Task 1: Query Rewriting 364 | Implement query rewriting that: 365 | - Generates 3-5 query variations 366 | - Uses LLM to create variations 367 | - Tests if variations improve retrieval 368 | 369 | ### Task 2: Reranking 370 | Add reranking to your RAG system: 371 | - Use a reranking model 372 | - Compare results before/after reranking 373 | - Measure improvement 374 | 375 | ### Task 3: Fusion 376 | Implement fusion: 377 | - Combine results from multiple queries 378 | - Use RRF or weighted fusion 379 | - Compare fused vs single retrieval 380 | 381 | ### Task 4: Hybrid Search 382 | Build hybrid search: 383 | - Combine semantic + keyword search 384 | - Tune the alpha parameter 385 | - Compare with pure semantic search 386 | 387 | ### Task 5: Complete Advanced Pipeline 388 | Combine all techniques: 389 | - Query rewriting 390 | - Multiple retrievals 391 | - Fusion 392 | - Reranking 393 | - Generation 394 | 395 | ### Task 6: Evaluation 396 | Evaluate advanced techniques: 397 | - Compare answer quality 398 | - Measure retrieval improvement 399 | - Test with various queries 400 | 401 | --- 402 | 403 | ## 5. Summary / Key Takeaways 404 | 405 | - **Reranking** improves result order using sophisticated models 406 | - **Query rewriting** generates variations for better retrieval 407 | - **Fusion** combines multiple retrieval strategies 408 | - **Hybrid search** mixes semantic and keyword search 409 | - **Advanced techniques** significantly improve RAG quality 410 | - **Production systems** benefit from these optimizations 411 | - **Trade-offs**: More complexity vs better results 412 | - **Evaluation** is crucial to measure improvements 413 | - **Combining techniques** often works best 414 | - **Start simple**, add complexity as needed 415 | 416 | --- 417 | 418 | ## 6. Further Reading (Optional) 419 | 420 | - "Reciprocal Rank Fusion" paper 421 | - Cross-encoder models documentation 422 | - BM25 algorithm explanation 423 | - RAG evaluation metrics 424 | - Production RAG best practices 425 | 426 | --- 427 | 428 | **Next up:** Day 10 - Build and deploy a complete RAG application! 429 | 430 | --------------------------------------------------------------------------------