├── README.md
├── Day-01: Python Foundations for GenAI
    ├── assignment.md
    └── README.md
├── Day-08: RAG Using LangChain or LlamaIndex
    ├── assignment.md
    └── README.md
├── Day-02: Generative AI & LLM Basics
    ├── assignment.md
    └── README.md
├── Day-07: Implement RAG From Scratch (Pure Python)
    ├── assignment.md
    └── README.md
├── Day-10: Build & Deploy a RAG Application (FastAPI-Streamlit)
    ├── assignment.md
    └── README.md
├── Day-05: Embeddings & Vector Databases
    ├── assignment.md
    └── README.md
├── Day-09: Advanced RAG (Reranking, Query Rewriting, Fusion)
    ├── assignment.md
    └── README.md
├── RAG Projects
    └── readme.md
├── Day-06: RAG Fundamentals (Retrieval → Augmentation → Generation)
    ├── assignment.md
    └── README.md
├── Day-04: Chunking & Data Extraction (PDF-Web-Docs)
    ├── assignment.md
    └── README.md
└── Day-03: Prompt Engineering Essentials
    ├── assignment.md
    └── README.md


/README.md:
--------------------------------------------------------------------------------
  1 | # 🚀 10-Day RAG Beginner Roadmap
  2 | 
  3 | Welcome to your comprehensive learning journey into **Retrieval-Augmented Generation (RAG)**! This repository is designed for absolute beginners who want to master RAG from the ground up in just 10 days.
  4 | 
  5 | ## 📖 Description
  6 | 
  7 | This roadmap takes you from Python fundamentals all the way to building and deploying a complete RAG application. Each day builds upon the previous one, ensuring you have a solid foundation before moving to more advanced concepts. By the end of 10 days, you'll have hands-on experience with:
  8 | 
  9 | - Python programming for AI applications
 10 | - Large Language Models (LLMs) and their capabilities
 11 | - Prompt engineering techniques
 12 | - Data extraction and chunking strategies
 13 | - Vector embeddings and databases
 14 | - Building RAG systems from scratch
 15 | - Using frameworks like LangChain and LlamaIndex
 16 | - Advanced RAG techniques
 17 | - Deploying production-ready RAG applications
 18 | 
 19 | ## 🎯 How to Use This Repository:
 20 | 
 21 | 1. **Study Day-by-Day**: Follow the roadmap sequentially, starting with Day 1
 22 | 2. **Read the Notes**: Open each day's folder and read the `README.md` file thoroughly
 23 | 3. **Complete Assignments**: Work through the `assignment.md` file for hands-on practice
 24 | 4. **Practice Regularly**: Code along with the examples and complete all practice tasks
 25 | 5. **Build Projects**: Each day includes a mini-project to reinforce your learning
 26 | 
 27 | ### Recommended Study Schedule:
 28 | 
 29 | - **Time per day**: 2-4 hours
 30 | - **Read notes**: 30-60 minutes
 31 | - **Complete assignments**: 1-2 hours
 32 | - **Mini project**: 30-60 minutes
 33 | 
 34 | ## 🛠️ Technical Requirements:
 35 | 
 36 | ### Python Version
 37 | - **Python 3.8 or higher** (Python 3.10+ recommended)
 38 | 
 39 | ### Required Libraries
 40 | 
 41 | You'll install these progressively throughout the roadmap:
 42 | 
 43 | ```bash
 44 | # Core libraries
 45 | pip install openai
 46 | pip install langchain
 47 | pip install llama-index
 48 | pip install chromadb
 49 | pip install sentence-transformers
 50 | pip install pypdf
 51 | pip install beautifulsoup4
 52 | pip install requests
 53 | pip install fastapi
 54 | pip install streamlit
 55 | pip install uvicorn
 56 | ```
 57 | 
 58 | ### API Keys
 59 | 
 60 | You'll need API keys for certain days:
 61 | 
 62 | - **OpenAI API Key** (for Days 2, 3, 6, 7, 8, 9, 10)
 63 |   - Sign up at [platform.openai.com](https://platform.openai.com)
 64 |   - Get your API key from the API keys section
 65 |   - Store it securely (use environment variables)
 66 | 
 67 | ### Environment Setup
 68 | 
 69 | Create a `.env` file in the root directory:
 70 | 
 71 | ```env
 72 | OPENAI_API_KEY=your_api_key_here
 73 | ```
 74 | 
 75 | ## 📚 Roadmap Overview
 76 | 
 77 | | Day | Topic | Focus Area |
 78 | |-----|-------|------------|
 79 | | **Day 1** | Python Foundations for GenAI | Python basics, data structures, file handling, APIs |
 80 | | **Day 2** | Generative AI & LLM Basics | Understanding LLMs, OpenAI API, model capabilities |
 81 | | **Day 3** | Prompt Engineering Essentials | Crafting effective prompts, few-shot learning, chain-of-thought |
 82 | | **Day 4** | Chunking & Data Extraction | PDF parsing, web scraping, document processing |
 83 | | **Day 5** | Embeddings & Vector Databases | Vector embeddings, similarity search, ChromaDB |
 84 | | **Day 6** | RAG Fundamentals | Retrieval → Augmentation → Generation pipeline |
 85 | | **Day 7** | Implement RAG From Scratch | Building RAG system with pure Python |
 86 | | **Day 8** | RAG Using LangChain or LlamaIndex | Using popular RAG frameworks |
 87 | | **Day 9** | Advanced RAG | Reranking, query rewriting, fusion techniques |
 88 | | **Day 10** | Build & Deploy RAG Application | FastAPI/Streamlit deployment, production considerations |
 89 | 
 90 | ## 🗂️ Repository Structure
 91 | 
 92 | ```
 93 | rag-roadmap/
 94 | │
 95 | ├── Day01/
 96 | │   ├── README.md
 97 | │   └── assignment.md
 98 | │
 99 | ├── Day02/
100 | │   ├── README.md
101 | │   └── assignment.md
102 | │
103 | ├── Day03/
104 | │   ├── README.md
105 | │   └── assignment.md
106 | │
107 | ├── Day04/
108 | │   ├── README.md
109 | │   └── assignment.md
110 | │
111 | ├── Day05/
112 | │   ├── README.md
113 | │   └── assignment.md
114 | │
115 | ├── Day06/
116 | │   ├── README.md
117 | │   └── assignment.md
118 | │
119 | ├── Day07/
120 | │   ├── README.md
121 | │   └── assignment.md
122 | │
123 | ├── Day08/
124 | │   ├── README.md
125 | │   └── assignment.md
126 | │
127 | ├── Day09/
128 | │   ├── README.md
129 | │   └── assignment.md
130 | │
131 | ├── Day10/
132 | │   ├── README.md
133 | │   └── assignment.md
134 | │
135 | └── README.md   (this file)
136 | ```
137 | 
138 | ## 💡 Learning Tips
139 | 
140 | 1. **Don't Skip Days**: Each day builds on previous concepts
141 | 2. **Code Along**: Type out the examples yourself, don't just read
142 | 3. **Experiment**: Modify examples to see what happens
143 | 4. **Ask Questions**: If something is unclear, research it
144 | 5. **Take Notes**: Write down key concepts in your own words
145 | 6. **Build Projects**: The mini-projects are crucial for understanding
146 | 
147 | ## 🎓 Prerequisites
148 | 
149 | - Basic understanding of programming concepts (variables, functions, loops)
150 | - Familiarity with command line/terminal
151 | - Willingness to learn and experiment
152 | - No prior AI/ML experience required!
153 | 
154 | ## 📝 Notes
155 | 
156 | - All code examples are beginner-friendly
157 | - Solutions are not provided for assignments (learning by doing!)
158 | - You can work at your own pace, but try to complete one day per day
159 | - Feel free to revisit previous days if needed
160 | 
161 | ## 🙏 Credits
162 | 
163 | Created by **Chandra Sekhar**
164 | 
165 | ---
166 | 
167 | **Ready to start?** Navigate to `Day01/` and begin your RAG journey! 🚀
168 | 
169 | 


--------------------------------------------------------------------------------
/Day-01: Python Foundations for GenAI/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 1 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Complete the following tasks to reinforce your Python foundations. Write all code in separate Python files (`.py`). Test your code thoroughly and make sure it runs without errors. 
  6 | 
  7 | **Important:** 
  8 | - Use proper error handling
  9 | - Add comments to explain your code
 10 | - Follow Python naming conventions (snake_case for functions/variables)
 11 | - Test with different inputs to ensure your code is robust
 12 | 
 13 | ---
 14 | 
 15 | ## Tasks
 16 | 
 17 | ### Task 1: Document Statistics Calculator
 18 | 
 19 | Create a function `calculate_document_stats(filename)` that:
 20 | - Reads a text file
 21 | - Returns a dictionary with:
 22 |   - Total characters (including spaces)
 23 |   - Total words
 24 |   - Total sentences (split by `.`, `!`, `?`)
 25 |   - Average words per sentence
 26 |   - Most common word (and its frequency)
 27 | 
 28 | **Test file:** Create a `sample.txt` file with at least 5 sentences to test your function.
 29 | 
 30 | ---
 31 | 
 32 | ### Task 2: Text Chunker with Overlap
 33 | 
 34 | Implement a function `chunk_with_overlap(text, chunk_size, overlap)` that:
 35 | - Splits text into chunks of `chunk_size` characters
 36 | - Each chunk overlaps with the previous one by `overlap` characters
 37 | - Returns a list of dictionaries, each containing:
 38 |   - `chunk_id`: Sequential number (1, 2, 3...)
 39 |   - `text`: The chunk text
 40 |   - `start_pos`: Starting character position
 41 |   - `end_pos`: Ending character position
 42 |   - `word_count`: Number of words in chunk
 43 | 
 44 | **Example:**
 45 | ```python
 46 | text = "This is a sample text for chunking."
 47 | chunks = chunk_with_overlap(text, chunk_size=10, overlap=3)
 48 | # Should create overlapping chunks
 49 | ```
 50 | 
 51 | ---
 52 | 
 53 | ### Task 3: Document Manager Class
 54 | 
 55 | Create a `DocumentManager` class that:
 56 | - Can load multiple documents
 57 | - Stores each document with metadata (filename, content, word_count)
 58 | - Has a method to find documents by keyword (searches in content)
 59 | - Has a method to get statistics across all documents
 60 | - Has a method to export all document info to a JSON file
 61 | 
 62 | **Requirements:**
 63 | - Use a dictionary to store documents (key: filename, value: document data)
 64 | - Implement `add_document(filename, content)`
 65 | - Implement `search_documents(keyword)` → returns list of matching filenames
 66 | - Implement `get_all_stats()` → returns summary statistics
 67 | - Implement `export_to_json(output_file)` → saves all document data
 68 | 
 69 | ---
 70 | 
 71 | ### Task 4: Text Preprocessing Function
 72 | 
 73 | Write a function `preprocess_text(text)` that:
 74 | - Converts text to lowercase
 75 | - Removes all punctuation (keep spaces)
 76 | - Removes extra whitespace (multiple spaces → single space)
 77 | - Removes leading/trailing whitespace
 78 | - Returns the cleaned text
 79 | 
 80 | **Bonus:** Also create a function that removes stop words (common words like "the", "a", "an", "is", etc.)
 81 | 
 82 | ---
 83 | 
 84 | ### Task 5: File Batch Processor
 85 | 
 86 | Create a function `process_multiple_files(file_list, output_dir)` that:
 87 | - Takes a list of file paths
 88 | - Reads each file
 89 | - Processes it (calculate stats, chunk it, etc.)
 90 | - Saves processed results to `output_dir`
 91 | - Returns a summary report
 92 | 
 93 | **Requirements:**
 94 | - Handle errors gracefully (skip files that can't be read)
 95 | - Create output directory if it doesn't exist
 96 | - Save each file's stats as a separate JSON file
 97 | - Return a dictionary with success/failure counts
 98 | 
 99 | ---
100 | 
101 | ## One Mini Project
102 | 
103 | ### 📘 Build a Document Analyzer Tool
104 | 
105 | Create a complete Python script `document_analyzer.py` that:
106 | 
107 | 1. **Takes command-line arguments:**
108 |    - Input file or directory
109 |    - Output format (JSON, TXT, or both)
110 |    - Chunk size (optional, default 200)
111 | 
112 | 2. **For a single file:**
113 |    - Reads the file
114 |    - Calculates statistics (words, sentences, characters)
115 |    - Chunks the text
116 |    - Generates a word frequency report
117 |    - Exports results
118 | 
119 | 3. **For a directory:**
120 |    - Processes all `.txt` files in the directory
121 |    - Creates a summary report
122 |    - Exports individual file reports
123 | 
124 | 4. **Output includes:**
125 |    - Document statistics
126 |    - Top 10 most common words
127 |    - Chunk information
128 |    - Processing timestamp
129 | 
130 | **Example usage:**
131 | ```bash
132 | python document_analyzer.py input.txt --output json --chunk-size 200
133 | python document_analyzer.py ./documents/ --output both
134 | ```
135 | 
136 | **Requirements:**
137 | - Use `argparse` for command-line arguments
138 | - Implement proper error handling
139 | - Use classes to organize your code
140 | - Include docstrings for all functions
141 | - Make it user-friendly with clear output messages
142 | 
143 | **Deliverables:**
144 | - `document_analyzer.py` - Main script
145 | - `requirements.txt` - List of dependencies (if any)
146 | - Sample output files showing the results
147 | 
148 | ---
149 | 
150 | ## Expected Output Section
151 | 
152 | ### Task 1 Expected Output:
153 | ```python
154 | stats = calculate_document_stats("sample.txt")
155 | # Output: {
156 | #     'characters': 245,
157 | #     'words': 42,
158 | #     'sentences': 5,
159 | #     'avg_words_per_sentence': 8.4,
160 | #     'most_common_word': ('the', 5)
161 | # }
162 | ```
163 | 
164 | ### Task 2 Expected Output:
165 | ```python
166 | chunks = chunk_with_overlap("Long text here...", 20, 5)
167 | # Output: [
168 | #     {'chunk_id': 1, 'text': 'Long text here...', 'start_pos': 0, 'end_pos': 20, 'word_count': 4},
169 | #     {'chunk_id': 2, 'text': 'here...more text', 'start_pos': 15, 'end_pos': 35, 'word_count': 3},
170 | #     ...
171 | # ]
172 | ```
173 | 
174 | ### Mini Project Expected Output:
175 | 
176 | When you run the document analyzer:
177 | - Clear console output showing progress
178 | - Generated JSON/TXT files with analysis results
179 | - Summary statistics displayed in terminal
180 | - Error messages for any files that couldn't be processed
181 | 
182 | **Example console output:**
183 | ```
184 | Document Analyzer Tool
185 | =====================
186 | Processing: sample.txt
187 | ✓ File processed successfully
188 |   - Words: 1,234
189 |   - Sentences: 45
190 |   - Chunks: 12
191 |   - Top word: 'the' (45 occurrences)
192 | Results saved to: output/sample_analysis.json
193 | ```
194 | 
195 | ---
196 | 
197 | ## Submission Checklist
198 | 
199 | - [ ] All 5 tasks completed and tested
200 | - [ ] Mini project fully functional
201 | - [ ] Code is well-commented
202 | - [ ] Error handling implemented
203 | - [ ] Code follows Python best practices
204 | - [ ] All files run without errors
205 | 
206 | **Good luck!** 🚀
207 | 
208 | 


--------------------------------------------------------------------------------
/Day-08: RAG Using LangChain or LlamaIndex/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 8 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Build RAG systems using LangChain and LlamaIndex frameworks. Compare both approaches and understand their strengths. Install required libraries:
  6 | 
  7 | ```bash
  8 | pip install langchain llama-index openai chromadb
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Try both frameworks
 13 | - Compare approaches
 14 | - Experiment with configurations
 15 | - Document differences
 16 | - Understand when to use which
 17 | 
 18 | ---
 19 | 
 20 | ## Tasks
 21 | 
 22 | ### Task 1: LangChain RAG System
 23 | 
 24 | Build a complete RAG system using LangChain `langchain_rag.py`:
 25 | 
 26 | **Components:**
 27 | 1. Document loading (PDF/TXT)
 28 | 2. Text splitting
 29 | 3. Vector store (ChromaDB)
 30 | 4. Retrieval QA chain
 31 | 5. Query interface
 32 | 
 33 | **Requirements:**
 34 | - Use LangChain components
 35 | - Support multiple document types
 36 | - Configurable chunking
 37 | - Return sources with answers
 38 | - Handle errors gracefully
 39 | 
 40 | **Test with:** Multiple documents
 41 | 
 42 | **Deliverable:** `task1_langchain_rag.py`
 43 | 
 44 | ---
 45 | 
 46 | ### Task 2: LlamaIndex RAG System
 47 | 
 48 | Build a complete RAG system using LlamaIndex `llamaindex_rag.py`:
 49 | 
 50 | **Components:**
 51 | 1. Document loading
 52 | 2. Index creation
 53 | 3. Query engine
 54 | 4. Response synthesis
 55 | 5. Query interface
 56 | 
 57 | **Requirements:**
 58 | - Use LlamaIndex components
 59 | - Custom service context
 60 | - Configurable settings
 61 | - Source retrieval
 62 | - Error handling
 63 | 
 64 | **Test with:** Same documents as Task 1
 65 | 
 66 | **Deliverable:** `task2_llamaindex_rag.py`
 67 | 
 68 | ---
 69 | 
 70 | ### Task 3: Framework Configuration Comparison
 71 | 
 72 | Create a comparison tool `framework_comparison.py`:
 73 | 
 74 | **Compare:**
 75 | - Code complexity (lines of code)
 76 | - Setup time
 77 | - Query performance
 78 | - Feature availability
 79 | - Ease of customization
 80 | 
 81 | **Requirements:**
 82 | - Build same RAG system with both
 83 | - Measure performance metrics
 84 | - Document differences
 85 | - Create comparison report
 86 | 
 87 | **Deliverable:** `task3_comparison.py` + comparison report
 88 | 
 89 | ---
 90 | 
 91 | ### Task 4: Advanced Features Exploration
 92 | 
 93 | Explore advanced features in both frameworks:
 94 | 
 95 | **LangChain:**
 96 | - Conversational memory
 97 | - Different chain types
 98 | - Agents
 99 | - Custom retrievers
100 | 
101 | **LlamaIndex:**
102 | - Different index types
103 | - Advanced retrievers
104 | - Response modes
105 | - Node postprocessors
106 | 
107 | **Requirements:**
108 | - Implement 2-3 advanced features from each
109 | - Document what each does
110 | - Show examples
111 | 
112 | **Deliverable:** `task4_advanced_features.py`
113 | 
114 | ---
115 | 
116 | ### Task 5: Hybrid Approach
117 | 
118 | Create a system that uses both frameworks `hybrid_rag.py`:
119 | 
120 | **Ideas:**
121 | - Use LangChain for document loading
122 | - Use LlamaIndex for indexing
123 | - Combine retrieval strategies
124 | - Use best of both worlds
125 | 
126 | **Requirements:**
127 | - Integrate both frameworks
128 | - Explain why you chose each component
129 | - Make it work seamlessly
130 | 
131 | **Deliverable:** `task5_hybrid_rag.py`
132 | 
133 | ---
134 | 
135 | ## One Mini Project
136 | 
137 | ### 🚀 Build a Framework Comparison RAG Application
138 | 
139 | Create a comprehensive application `framework_rag_comparison.py` that demonstrates both LangChain and LlamaIndex.
140 | 
141 | **Features:**
142 | 
143 | 1. **Dual Framework Support:**
144 |    - Switch between LangChain and LlamaIndex
145 |    - Same documents, different frameworks
146 |    - Side-by-side comparison
147 | 
148 | 2. **Document Management:**
149 |    - Load documents once
150 |    - Index with both frameworks
151 |    - Compare indexing time
152 |    - Compare storage size
153 | 
154 | 3. **Query Interface:**
155 |    ```
156 |    === Framework RAG Comparison ===
157 |    1. Load documents
158 |    2. Index with LangChain
159 |    3. Index with LlamaIndex
160 |    4. Query (LangChain)
161 |    5. Query (LlamaIndex)
162 |    6. Compare frameworks
163 |    7. Performance metrics
164 |    8. Exit
165 |    ```
166 | 
167 | 4. **Comparison Features:**
168 |    - Side-by-side query results
169 |    - Performance metrics (time, tokens)
170 |    - Answer quality comparison
171 |    - Source comparison
172 |    - Code complexity metrics
173 | 
174 | 5. **Advanced Analysis:**
175 |    - Response time comparison
176 |    - Token usage comparison
177 |    - Answer similarity
178 |    - Source overlap
179 |    - Quality scoring
180 | 
181 | 6. **Reporting:**
182 |    - Generate comparison reports
183 |    - Export results
184 |    - Visualize differences
185 |    - Recommendations
186 | 
187 | **Requirements:**
188 | - Clean, modular code
189 | - Both frameworks fully implemented
190 | - Fair comparison methodology
191 | - Detailed documentation
192 | - Performance metrics
193 | - User-friendly interface
194 | 
195 | **Example Usage:**
196 | ```python
197 | app = FrameworkComparison()
198 | app.load_documents("./documents/")
199 | 
200 | # Index with both
201 | app.index_langchain()
202 | app.index_llamaindex()
203 | 
204 | # Compare
205 | results = app.compare_query("What is machine learning?")
206 | print("LangChain:", results["langchain"]["answer"])
207 | print("LlamaIndex:", results["llamaindex"]["answer"])
208 | print("Similarity:", results["similarity_score"])
209 | ```
210 | 
211 | **Deliverables:**
212 | - `framework_rag_comparison.py` - Main application
213 | - `requirements.txt` - Dependencies
214 | - `README_frameworks.md` - Documentation
215 | - Comparison report template
216 | - Example outputs
217 | 
218 | ---
219 | 
220 | ## Expected Output Section
221 | 
222 | ### Task 1 Expected Output:
223 | ```python
224 | # LangChain RAG
225 | from langchain.chains import RetrievalQA
226 | 
227 | qa_chain = RetrievalQA.from_chain_type(...)
228 | result = qa_chain({"query": "What is Python?"})
229 | 
230 | # Output:
231 | {
232 |     "result": "Python is a programming language...",
233 |     "source_documents": [...]
234 | }
235 | ```
236 | 
237 | ### Task 2 Expected Output:
238 | ```python
239 | # LlamaIndex RAG
240 | index = VectorStoreIndex.from_documents(documents)
241 | query_engine = index.as_query_engine()
242 | response = query_engine.query("What is Python?")
243 | 
244 | # Output:
245 | ResponseObject with:
246 | - response: "Python is a programming language..."
247 | - source_nodes: [...]
248 | ```
249 | 
250 | ### Task 3 Expected Output:
251 | ```
252 | === Framework Comparison ===
253 | 
254 | LangChain:
255 | - Setup time: 2.3s
256 | - Query time: 1.2s
257 | - Code lines: 45
258 | - Features: High flexibility
259 | 
260 | LlamaIndex:
261 | - Setup time: 1.8s
262 | - Query time: 0.9s
263 | - Code lines: 28
264 | - Features: RAG-optimized
265 | 
266 | Recommendation: Use LlamaIndex for RAG-focused apps
267 | ```
268 | 
269 | ### Mini Project Expected Output:
270 | 
271 | The comparison app should provide:
272 | - Fair side-by-side comparisons
273 | - Detailed metrics
274 | - Clear recommendations
275 | - Professional interface
276 | 
277 | **Example session:**
278 | ```
279 | === Framework RAG Comparison ===
280 | Choose: 6
281 | 
282 | Query: "What is RAG?"
283 | 
284 | LangChain Result:
285 | Answer: RAG stands for Retrieval-Augmented Generation...
286 | Time: 1.2s | Tokens: 150
287 | 
288 | LlamaIndex Result:
289 | Answer: RAG (Retrieval-Augmented Generation) is...
290 | Time: 0.9s | Tokens: 145
291 | 
292 | Comparison:
293 | - Answer similarity: 0.87
294 | - Time difference: 0.3s (LlamaIndex faster)
295 | - Token difference: 5 tokens
296 | - Source overlap: 2/3 documents
297 | ```
298 | 
299 | ---
300 | 
301 | ## Submission Checklist
302 | 
303 | - [ ] Task 1: LangChain RAG working
304 | - [ ] Task 2: LlamaIndex RAG working
305 | - [ ] Task 3: Comparison complete
306 | - [ ] Task 4: Advanced features explored
307 | - [ ] Task 5: Hybrid approach implemented
308 | - [ ] Mini project: Full comparison app
309 | - [ ] Both frameworks tested
310 | - [ ] Differences documented
311 | - [ ] Code is well-documented
312 | 
313 | **Remember:** Frameworks save time, but understanding the fundamentals (Day 7) is crucial!
314 | 
315 | **Good luck!** 🚀
316 | 
317 | 


--------------------------------------------------------------------------------
/Day-02: Generative AI & LLM Basics/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 2 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Complete these tasks to get hands-on experience with LLMs and the OpenAI API. Make sure you have:
  6 | - An OpenAI API key (get one at platform.openai.com)
  7 | - Python `openai` library installed: `pip install openai`
  8 | - Your API key stored securely (use environment variables)
  9 | 
 10 | **Important:**
 11 | - Never commit your API key to version control
 12 | - Use environment variables or `.env` files
 13 | - Handle errors gracefully
 14 | - Test with different prompts and parameters
 15 | 
 16 | ---
 17 | 
 18 | ## Tasks
 19 | 
 20 | ### Task 1: API Setup and First Call
 21 | 
 22 | Create a Python script that:
 23 | 1. Loads your OpenAI API key from an environment variable
 24 | 2. Makes a simple API call asking "What is Python?"
 25 | 3. Prints the response
 26 | 4. Displays the number of tokens used
 27 | 5. Handles errors if the API key is missing or invalid
 28 | 
 29 | **Deliverable:** `task1_first_call.py`
 30 | 
 31 | ---
 32 | 
 33 | ### Task 2: Temperature Comparison Tool
 34 | 
 35 | Create a function that:
 36 | - Takes a prompt as input
 37 | - Sends the same prompt with 3 different temperature values: 0.1, 0.7, 1.5
 38 | - Collects all responses
 39 | - Returns a comparison showing how temperature affects output
 40 | 
 41 | Test with prompts like:
 42 | - "Write a haiku about coding"
 43 | - "Explain machine learning in one sentence"
 44 | - "Describe a futuristic city"
 45 | 
 46 | **Deliverable:** `task2_temperature_comparison.py`
 47 | 
 48 | ---
 49 | 
 50 | ### Task 3: Token Counter
 51 | 
 52 | Create a utility that:
 53 | 1. Estimates token count for input text (rough estimate: 1 token ≈ 4 characters)
 54 | 2. Makes an API call
 55 | 3. Compares your estimate with the actual token count from the API response
 56 | 4. Calculates the accuracy of your estimation
 57 | 
 58 | **Bonus:** Use the `tiktoken` library for more accurate token counting.
 59 | 
 60 | **Deliverable:** `task3_token_counter.py`
 61 | 
 62 | ---
 63 | 
 64 | ### Task 4: Simple Chatbot
 65 | 
 66 | Build a simple chatbot that:
 67 | - Maintains conversation history
 68 | - Allows multiple turns of conversation
 69 | - Remembers context from previous messages
 70 | - Has a command to clear history (type "clear" or "reset")
 71 | - Has a command to exit (type "quit" or "exit")
 72 | 
 73 | **Features:**
 74 | - Greet the user
 75 | - Show conversation history
 76 | - Handle empty inputs
 77 | - Display token usage after each response
 78 | 
 79 | **Deliverable:** `task4_chatbot.py`
 80 | 
 81 | ---
 82 | 
 83 | ### Task 5: Model Comparison Tool
 84 | 
 85 | Create a script that:
 86 | - Takes a prompt as input
 87 | - Sends it to both `gpt-3.5-turbo` and `gpt-4`
 88 | - Compares:
 89 |   - Response quality (subjective)
 90 |   - Response length
 91 |   - Token usage
 92 |   - Response time (if possible)
 93 | - Displays a side-by-side comparison
 94 | 
 95 | Test with prompts requiring:
 96 | - Simple factual answers
 97 | - Creative writing
 98 | - Complex reasoning
 99 | - Code generation
100 | 
101 | **Deliverable:** `task5_model_comparison.py`
102 | 
103 | ---
104 | 
105 | ## One Mini Project
106 | 
107 | ### 🤖 Build an LLM Playground Application
108 | 
109 | Create a comprehensive Python application `llm_playground.py` that allows users to experiment with different LLM settings interactively.
110 | 
111 | **Features:**
112 | 
113 | 1. **Interactive Menu:**
114 |    ```
115 |    === LLM Playground ===
116 |    1. Single Prompt
117 |    2. Conversation Mode
118 |    3. Compare Models
119 |    4. Parameter Tuning
120 |    5. View History
121 |    6. Export Results
122 |    7. Exit
123 |    ```
124 | 
125 | 2. **Single Prompt Mode:**
126 |    - Enter a prompt
127 |    - Adjust temperature, max_tokens, model
128 |    - View response
129 |    - Save to history
130 | 
131 | 3. **Conversation Mode:**
132 |    - Multi-turn conversation
133 |    - View full conversation history
134 |    - Clear conversation option
135 | 
136 | 4. **Compare Models:**
137 |    - Enter a prompt
138 |    - Automatically test with gpt-3.5-turbo and gpt-4
139 |    - Show side-by-side comparison
140 |    - Show cost comparison (if possible)
141 | 
142 | 5. **Parameter Tuning:**
143 |    - Test same prompt with different:
144 |      - Temperature values (0.0 to 2.0)
145 |      - Max tokens (50 to 500)
146 |      - Top P values
147 |    - Display all results for comparison
148 | 
149 | 6. **View History:**
150 |    - Show all previous prompts and responses
151 |    - Filter by model
152 |    - Show token usage statistics
153 | 
154 | 7. **Export Results:**
155 |    - Export conversation history to JSON
156 |    - Export to text file
157 |    - Include metadata (tokens, model, timestamp)
158 | 
159 | **Requirements:**
160 | - Use classes to organize code
161 | - Store conversation history in memory (or file)
162 | - Implement proper error handling
163 | - Add input validation
164 | - Make it user-friendly with clear prompts
165 | - Display token usage and estimated costs
166 | - Use color coding for different types of output (optional)
167 | 
168 | **Example Interaction:**
169 | ```
170 | === LLM Playground ===
171 | Choose an option: 1
172 | 
173 | Enter your prompt: Explain RAG in simple terms
174 | Model (gpt-3.5-turbo/gpt-4) [gpt-3.5-turbo]: 
175 | Temperature (0.0-2.0) [0.7]: 0.5
176 | Max tokens [150]: 200
177 | 
178 | [Processing...]
179 | 
180 | Response:
181 | RAG stands for Retrieval-Augmented Generation...
182 | 
183 | Tokens used: 45
184 | Estimated cost: $0.0001
185 | 
186 | Save to history? (y/n): y
187 | ```
188 | 
189 | **Deliverables:**
190 | - `llm_playground.py` - Main application
191 | - `requirements.txt` - Dependencies
192 | - `README_playground.md` - Brief usage instructions
193 | - Sample output showing the application in action
194 | 
195 | ---
196 | 
197 | ## Expected Output Section
198 | 
199 | ### Task 1 Expected Output:
200 | ```
201 | API Key loaded successfully.
202 | Making API call...
203 | 
204 | Response: Python is a high-level programming language...
205 | 
206 | Tokens used: 25
207 | ```
208 | 
209 | ### Task 2 Expected Output:
210 | ```
211 | === Temperature Comparison ===
212 | Prompt: "Write a haiku about coding"
213 | 
214 | Temperature 0.1:
215 | Code flows like water,
216 | Functions dance in harmony,
217 | Logic finds its way.
218 | 
219 | Temperature 0.7:
220 | Bits and bytes align,
221 | Algorithms come alive,
222 | Code becomes art form.
223 | 
224 | Temperature 1.5:
225 | Electric dreams pulse,
226 | Syntax sings in midnight glow,
227 | Digital poetry blooms.
228 | 
229 | [Notice how creativity increases with temperature]
230 | ```
231 | 
232 | ### Task 4 Expected Output:
233 | ```
234 | === Simple Chatbot ===
235 | Hello! I'm your AI assistant. Type 'quit' to exit, 'clear' to reset.
236 | 
237 | You: Hello, my name is Bob
238 | Assistant: Hello Bob! Nice to meet you. How can I help you today?
239 | 
240 | You: What's my name?
241 | Assistant: Your name is Bob!
242 | 
243 | You: clear
244 | [Conversation cleared]
245 | 
246 | You: What's my name?
247 | Assistant: I don't have that information. Could you tell me your name?
248 | 
249 | You: quit
250 | Goodbye!
251 | ```
252 | 
253 | ### Mini Project Expected Output:
254 | 
255 | The playground should provide a smooth, interactive experience:
256 | - Clear menu navigation
257 | - Real-time responses
258 | - Formatted output with proper spacing
259 | - Error messages for invalid inputs
260 | - History tracking and export functionality
261 | - Professional-looking interface
262 | 
263 | **Example session:**
264 | ```
265 | === LLM Playground ===
266 | 1. Single Prompt
267 | 2. Conversation Mode
268 | ...
269 | Choose: 1
270 | 
271 | Enter prompt: What is machine learning?
272 | Model [gpt-3.5-turbo]: 
273 | Temperature [0.7]: 
274 | Max tokens [150]: 
275 | 
276 | Response:
277 | Machine learning is a subset of artificial intelligence...
278 | 
279 | Tokens: 42 | Cost: $0.00008
280 | 
281 | [1] Try again
282 | [2] Save to history
283 | [3] Main menu
284 | ```
285 | 
286 | ---
287 | 
288 | ## Submission Checklist
289 | 
290 | - [ ] Task 1: API setup working
291 | - [ ] Task 2: Temperature comparison functional
292 | - [ ] Task 3: Token counter implemented
293 | - [ ] Task 4: Chatbot maintains conversation
294 | - [ ] Task 5: Model comparison working
295 | - [ ] Mini project: Full playground application
296 | - [ ] All code includes error handling
297 | - [ ] API keys stored securely (not in code)
298 | - [ ] Code is well-commented
299 | 
300 | **Remember:** Keep your API key secret! Never share it or commit it to version control.
301 | 
302 | **Good luck!** 🚀
303 | 
304 | 


--------------------------------------------------------------------------------
/Day-07: Implement RAG From Scratch (Pure Python)/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 7 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Build a complete RAG system from scratch using pure Python. No frameworks allowed! This will help you understand every component deeply. Install only basic dependencies:
  6 | 
  7 | ```bash
  8 | pip install openai numpy pypdf
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Build each component separately first
 13 | - Test thoroughly
 14 | - Add error handling
 15 | - Document your code
 16 | - Make it production-ready
 17 | 
 18 | ---
 19 | 
 20 | ## Tasks
 21 | 
 22 | ### Task 1: Core Component Implementation
 23 | 
 24 | Implement each core component as a separate class:
 25 | 
 26 | 1. **DocumentLoader** (`document_loader.py`)
 27 |    - Load .txt files
 28 |    - Load .pdf files
 29 |    - Handle errors
 30 |    - Return clean text
 31 | 
 32 | 2. **TextChunker** (`text_chunker.py`)
 33 |    - Fixed-size chunking
 34 |    - Configurable overlap
 35 |    - Add metadata
 36 |    - Return structured chunks
 37 | 
 38 | 3. **EmbeddingGenerator** (`embedding_generator.py`)
 39 |    - Single text embedding
 40 |    - Batch embedding
 41 |    - Error handling
 42 |    - API key management
 43 | 
 44 | 4. **VectorStore** (`vector_store.py`)
 45 |    - Store embeddings
 46 |    - Store chunks
 47 |    - Similarity search
 48 |    - Top K retrieval
 49 | 
 50 | **Test each component independently before integrating.**
 51 | 
 52 | **Deliverables:** 
 53 | - `document_loader.py`
 54 | - `text_chunker.py`
 55 | - `embedding_generator.py`
 56 | - `vector_store.py`
 57 | 
 58 | ---
 59 | 
 60 | ### Task 2: RAG System Integration
 61 | 
 62 | Create `rag_system.py` that integrates all components:
 63 | 
 64 | **Requirements:**
 65 | - `RAGSystem` class
 66 | - `index_document(filepath)` method
 67 | - `query(question, k=3)` method
 68 | - Complete pipeline: Load → Chunk → Embed → Store → Retrieve → Augment → Generate
 69 | - Return structured results
 70 | 
 71 | **Test with:** Multiple documents and various questions
 72 | 
 73 | **Deliverable:** `task2_rag_system.py`
 74 | 
 75 | ---
 76 | 
 77 | ### Task 3: Error Handling & Robustness
 78 | 
 79 | Enhance your RAG system with comprehensive error handling:
 80 | 
 81 | **Handle:**
 82 | - Missing API keys
 83 | - API rate limits
 84 | - Network errors
 85 | - File not found
 86 | - Empty documents
 87 | - No search results
 88 | - Invalid inputs
 89 | 
 90 | **Requirements:**
 91 | - Graceful error messages
 92 | - Fallback behaviors
 93 | - Logging (optional)
 94 | - User-friendly error reporting
 95 | 
 96 | **Deliverable:** `task3_robust_rag.py`
 97 | 
 98 | ---
 99 | 
100 | ### Task 4: Configuration System
101 | 
102 | Create a configuration system `config_rag.py`:
103 | 
104 | **Configurable options:**
105 | - Chunk size
106 | - Overlap size
107 | - K value (retrieval)
108 | - Similarity threshold
109 | - Embedding model
110 | - LLM model
111 | - Temperature
112 | - Max tokens
113 | 
114 | **Requirements:**
115 | - Load from JSON file
116 | - Default values
117 | - Validation
118 | - Easy to modify
119 | 
120 | **Deliverable:** `task4_configurable_rag.py` + `config.json`
121 | 
122 | ---
123 | 
124 | ### Task 5: Performance Optimization
125 | 
126 | Optimize your RAG system:
127 | 
128 | **Optimizations:**
129 | 1. Batch embedding generation
130 | 2. Cache embeddings (save to file)
131 | 3. Efficient similarity search (use NumPy)
132 | 4. Progress indicators
133 | 5. Memory management
134 | 
135 | **Requirements:**
136 | - Measure performance (time)
137 | - Compare before/after
138 | - Handle large documents
139 | - Show progress for long operations
140 | 
141 | **Deliverable:** `task5_optimized_rag.py`
142 | 
143 | ---
144 | 
145 | ## One Mini Project
146 | 
147 | ### 🏗️ Build a Production-Ready RAG System
148 | 
149 | Create a complete, production-ready RAG system `production_rag.py` with all features.
150 | 
151 | **Features:**
152 | 
153 | 1. **Complete Component System:**
154 |    - DocumentLoader (multiple formats)
155 |    - TextChunker (multiple strategies)
156 |    - EmbeddingGenerator (with caching)
157 |    - VectorStore (persistent storage)
158 |    - RAGPipeline (orchestration)
159 | 
160 | 2. **Document Management:**
161 |    - Index single documents
162 |    - Index directories
163 |    - Remove documents
164 |    - List indexed documents
165 |    - Document statistics
166 | 
167 | 3. **Query System:**
168 |    - Single queries
169 |    - Batch queries
170 |    - Query history
171 |    - Result caching
172 | 
173 | 4. **Configuration:**
174 |    - JSON config file
175 |    - Runtime configuration
176 |    - Environment variables
177 |    - Default values
178 | 
179 | 5. **Error Handling:**
180 |    - Comprehensive try-catch
181 |    - User-friendly messages
182 |    - Logging system
183 |    - Recovery mechanisms
184 | 
185 | 6. **Performance Features:**
186 |    - Embedding caching
187 |    - Batch processing
188 |    - Progress tracking
189 |    - Performance metrics
190 | 
191 | 7. **CLI Interface:**
192 |    ```
193 |    === Production RAG System ===
194 |    1. Index document
195 |    2. Index directory
196 |    3. Query
197 |    4. View indexed documents
198 |    5. Remove document
199 |    6. Configuration
200 |    7. Statistics
201 |    8. Exit
202 |    ```
203 | 
204 | 8. **Advanced Features:**
205 |    - Multiple collections
206 |    - Export/import data
207 |    - Search with filters
208 |    - Answer quality metrics
209 |    - System health check
210 | 
211 | **Requirements:**
212 | - Clean, modular code
213 | - Comprehensive documentation
214 | - Error handling throughout
215 | - Configuration system
216 | - Performance optimizations
217 | - User-friendly interface
218 | - Production-ready quality
219 | 
220 | **Example Usage:**
221 | ```python
222 | from production_rag import ProductionRAG
223 | 
224 | # Initialize
225 | rag = ProductionRAG(config_file="config.json")
226 | 
227 | # Index documents
228 | rag.index_document("doc1.pdf")
229 | rag.index_directory("./documents/")
230 | 
231 | # Query
232 | result = rag.query("What is machine learning?")
233 | print(result["answer"])
234 | print(f"Sources: {len(result['sources'])}")
235 | 
236 | # Statistics
237 | stats = rag.get_statistics()
238 | print(f"Total chunks: {stats['total_chunks']}")
239 | ```
240 | 
241 | **Deliverables:**
242 | - `production_rag.py` - Main system
243 | - `config.json` - Configuration template
244 | - `requirements.txt` - Dependencies
245 | - `README_production.md` - Documentation
246 | - Unit tests (optional but recommended)
247 | - Example usage script
248 | 
249 | ---
250 | 
251 | ## Expected Output Section
252 | 
253 | ### Task 2 Expected Output:
254 | ```python
255 | rag = RAGSystem()
256 | rag.index_document("document.pdf")
257 | # Output: "Indexed document.pdf: 15 chunks"
258 | 
259 | result = rag.query("What is the main topic?")
260 | # Output:
261 | {
262 |     "answer": "The main topic is...",
263 |     "sources": [
264 |         {"text": "...", "source": "document.pdf", "chunk_id": 1},
265 |         ...
266 |     ],
267 |     "similarities": [0.89, 0.85, 0.82]
268 | }
269 | ```
270 | 
271 | ### Task 4 Expected Output:
272 | ```json
273 | // config.json
274 | {
275 |     "chunk_size": 500,
276 |     "overlap": 50,
277 |     "k": 3,
278 |     "similarity_threshold": 0.7,
279 |     "embedding_model": "text-embedding-ada-002",
280 |     "llm_model": "gpt-3.5-turbo",
281 |     "temperature": 0.3,
282 |     "max_tokens": 300
283 | }
284 | ```
285 | 
286 | ### Mini Project Expected Output:
287 | 
288 | The production system should be:
289 | - Robust and error-resistant
290 | - Well-documented
291 | - Configurable
292 | - Performant
293 | - User-friendly
294 | 
295 | **Example session:**
296 | ```
297 | === Production RAG System ===
298 | Choose: 1
299 | 
300 | Enter document path: document.pdf
301 | [Indexing...]
302 | ✓ Loaded document
303 | ✓ Created 15 chunks
304 | ✓ Generated embeddings
305 | ✓ Stored in vector database
306 | Indexed successfully!
307 | 
308 | Choose: 3
309 | 
310 | Question: What is RAG?
311 | [Processing...]
312 | 
313 | Answer:
314 | RAG stands for Retrieval-Augmented Generation...
315 | 
316 | Sources (3):
317 | 1. [0.91] document.pdf, chunk 5
318 | 2. [0.87] document.pdf, chunk 8
319 | 3. [0.84] document.pdf, chunk 12
320 | ```
321 | 
322 | ---
323 | 
324 | ## Submission Checklist
325 | 
326 | - [ ] Task 1: All core components implemented
327 | - [ ] Task 2: RAG system integrated
328 | - [ ] Task 3: Error handling added
329 | - [ ] Task 4: Configuration system working
330 | - [ ] Task 5: Optimizations implemented
331 | - [ ] Mini project: Production-ready system
332 | - [ ] All code is well-documented
333 | - [ ] Error handling comprehensive
334 | - [ ] Tested with real documents
335 | - [ ] Code follows best practices
336 | 
337 | **Remember:** Building from scratch teaches you everything!
338 | 
339 | **Good luck!** 🚀
340 | 
341 | 


--------------------------------------------------------------------------------
/Day-10: Build & Deploy a RAG Application (FastAPI-Streamlit)/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 10 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Build and deploy a complete RAG application with FastAPI backend and Streamlit frontend. This is your final project! Install required libraries:
  6 | 
  7 | ```bash
  8 | pip install fastapi uvicorn streamlit requests python-multipart
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Build backend first
 13 | - Then build frontend
 14 | - Test locally
 15 | - Deploy to cloud
 16 | - Document everything
 17 | 
 18 | ---
 19 | 
 20 | ## Tasks
 21 | 
 22 | ### Task 1: FastAPI Backend
 23 | 
 24 | Create a complete FastAPI backend `rag_api.py`:
 25 | 
 26 | **Endpoints:**
 27 | 1. `GET /` - Root endpoint
 28 | 2. `GET /health` - Health check
 29 | 3. `POST /index` - Index document (file upload)
 30 | 4. `POST /query` - Query RAG system
 31 | 5. `GET /stats` - Get statistics
 32 | 6. `DELETE /documents/{doc_id}` - Remove document
 33 | 
 34 | **Requirements:**
 35 | - Use Pydantic models for requests/responses
 36 | - Handle file uploads
 37 | - Integrate your RAG system
 38 | - Add error handling
 39 | - Include API documentation
 40 | 
 41 | **Test with:** Postman or curl
 42 | 
 43 | **Deliverable:** `task1_fastapi_backend.py`
 44 | 
 45 | ---
 46 | 
 47 | ### Task 2: Streamlit Frontend
 48 | 
 49 | Create a Streamlit frontend `rag_ui.py`:
 50 | 
 51 | **Features:**
 52 | 1. Document upload section
 53 | 2. Query interface
 54 | 3. Answer display
 55 | 4. Source display
 56 | 5. Statistics view
 57 | 6. Chat history (optional)
 58 | 
 59 | **Requirements:**
 60 | - Connect to FastAPI backend
 61 | - Handle errors gracefully
 62 | - Add loading states
 63 | - Make it user-friendly
 64 | - Add styling
 65 | 
 66 | **Deliverable:** `task2_streamlit_frontend.py`
 67 | 
 68 | ---
 69 | 
 70 | ### Task 3: Complete Integration
 71 | 
 72 | Integrate backend and frontend `complete_app.py`:
 73 | 
 74 | **Features:**
 75 | 1. Working API
 76 | 2. Working UI
 77 | 3. Full integration
 78 | 4. Error handling
 79 | 5. User feedback
 80 | 
 81 | **Requirements:**
 82 | - Test all features
 83 | - Handle edge cases
 84 | - Add validation
 85 | - Improve UX
 86 | 
 87 | **Deliverable:** `task3_complete_app/` (folder with both files)
 88 | 
 89 | ---
 90 | 
 91 | ### Task 4: Deployment Preparation
 92 | 
 93 | Prepare for deployment:
 94 | 
 95 | **Tasks:**
 96 | 1. Create `requirements.txt`
 97 | 2. Create `Dockerfile`
 98 | 3. Create `docker-compose.yml`
 99 | 4. Add environment variable handling
100 | 5. Create deployment documentation
101 | 
102 | **Requirements:**
103 | - All dependencies listed
104 | - Docker setup working
105 | - Environment variables documented
106 | - Deployment guide written
107 | 
108 | **Deliverable:** Deployment files + documentation
109 | 
110 | ---
111 | 
112 | ### Task 5: Cloud Deployment
113 | 
114 | Deploy to a cloud platform:
115 | 
116 | **Options:**
117 | - Heroku
118 | - Railway
119 | - Render
120 | - AWS/GCP/Azure (advanced)
121 | 
122 | **Requirements:**
123 | - Deploy backend API
124 | - Deploy frontend (or serve from backend)
125 | - Test deployed version
126 | - Document deployment process
127 | 
128 | **Deliverable:** Deployed application + deployment guide
129 | 
130 | ---
131 | 
132 | ## One Mini Project
133 | 
134 | ### 🚀 Build and Deploy a Complete RAG Application
135 | 
136 | Create a production-ready RAG application with full-stack implementation.
137 | 
138 | **Features:**
139 | 
140 | 1. **FastAPI Backend:**
141 |    - Complete REST API
142 |    - Document management
143 |    - Query endpoints
144 |    - Authentication (optional)
145 |    - Rate limiting (optional)
146 |    - CORS configuration
147 |    - API documentation
148 | 
149 | 2. **Streamlit Frontend:**
150 |    - Modern, clean UI
151 |    - Document upload
152 |    - Interactive query interface
153 |    - Answer display with formatting
154 |    - Source citations
155 |    - Chat history
156 |    - Settings panel
157 |    - Statistics dashboard
158 | 
159 | 3. **Complete Features:**
160 |    - Multiple document support
161 |    - Document management (add/remove)
162 |    - Query history
163 |    - Export results
164 |    - Configuration options
165 |    - Error handling
166 |    - Loading states
167 |    - User feedback
168 | 
169 | 4. **Deployment:**
170 |    - Docker containerization
171 |    - Environment configuration
172 |    - Cloud deployment
173 |    - Health checks
174 |    - Monitoring (optional)
175 | 
176 | 5. **Documentation:**
177 |    - API documentation (auto-generated)
178 |    - User guide
179 |    - Deployment instructions
180 |    - README with setup
181 |    - Architecture diagram
182 | 
183 | **Project Structure:**
184 | ```
185 | rag_application/
186 | ├── backend/
187 | │   ├── main.py (FastAPI)
188 | │   ├── rag_system.py
189 | │   └── models.py
190 | ├── frontend/
191 | │   └── app.py (Streamlit)
192 | ├── requirements.txt
193 | ├── Dockerfile
194 | ├── docker-compose.yml
195 | ├── .env.example
196 | ├── README.md
197 | └── DEPLOYMENT.md
198 | ```
199 | 
200 | **Requirements:**
201 | - Production-ready code
202 | - Comprehensive error handling
203 | - User-friendly interface
204 | - Well-documented
205 | - Deployed and accessible
206 | - Tested thoroughly
207 | 
208 | **Example Usage:**
209 | ```bash
210 | # Backend
211 | cd backend
212 | uvicorn main:app --reload
213 | 
214 | # Frontend
215 | cd frontend
216 | streamlit run app.py
217 | 
218 | # Or with Docker
219 | docker-compose up
220 | ```
221 | 
222 | **Deliverables:**
223 | - Complete application code
224 | - Docker setup
225 | - Deployment configuration
226 | - Comprehensive documentation
227 | - Deployed application (URL)
228 | - Demo video/screenshots (optional)
229 | 
230 | ---
231 | 
232 | ## Expected Output Section
233 | 
234 | ### Task 1 Expected Output:
235 | ```bash
236 | # Start API
237 | uvicorn rag_api:app --reload
238 | 
239 | # Test endpoint
240 | curl http://localhost:8000/health
241 | # {"status":"healthy"}
242 | 
243 | # Index document
244 | curl -X POST http://localhost:8000/index \
245 |   -F "file=@document.pdf"
246 | 
247 | # Query
248 | curl -X POST http://localhost:8000/query \
249 |   -H "Content-Type: application/json" \
250 |   -d '{"question": "What is Python?", "k": 3}'
251 | ```
252 | 
253 | ### Task 2 Expected Output:
254 | ```
255 | Streamlit app running at http://localhost:8501
256 | 
257 | Features visible:
258 | - File upload widget
259 | - Query input
260 | - Answer display area
261 | - Sources section
262 | - Statistics panel
263 | ```
264 | 
265 | ### Task 3 Expected Output:
266 | ```
267 | Complete integrated application:
268 | - Backend API running on :8000
269 | - Frontend UI running on :8501
270 | - Full functionality working
271 | - Error handling in place
272 | - User-friendly interface
273 | ```
274 | 
275 | ### Task 5 Expected Output:
276 | ```
277 | Deployed application:
278 | - Backend: https://your-app.herokuapp.com
279 | - Frontend: https://your-app.herokuapp.com (or separate)
280 | - API docs: https://your-app.herokuapp.com/docs
281 | - Health: https://your-app.herokuapp.com/health
282 | ```
283 | 
284 | ### Mini Project Expected Output:
285 | 
286 | The complete application should be:
287 | - Fully functional
288 | - Well-designed UI
289 | - Production-ready
290 | - Deployed and accessible
291 | - Well-documented
292 | 
293 | **Example screens:**
294 | ```
295 | ┌─────────────────────────────────┐
296 | │  🤖 RAG Application             │
297 | ├─────────────────────────────────┤
298 | │  📄 Upload Documents            │
299 | │  [Choose File] [Index]           │
300 | ├─────────────────────────────────┤
301 | │  💬 Ask a Question              │
302 | │  [Input box]                    │
303 | │  [Ask Button]                   │
304 | ├─────────────────────────────────┤
305 | │  📝 Answer                      │
306 | │  [Generated answer displayed]   │
307 | ├─────────────────────────────────┤
308 | │  📚 Sources                     │
309 | │  [Source 1] [Source 2] [Source 3]│
310 | └─────────────────────────────────┘
311 | ```
312 | 
313 | ---
314 | 
315 | ## Submission Checklist
316 | 
317 | - [ ] Task 1: FastAPI backend complete
318 | - [ ] Task 2: Streamlit frontend complete
319 | - [ ] Task 3: Integration working
320 | - [ ] Task 4: Deployment files ready
321 | - [ ] Task 5: Deployed to cloud
322 | - [ ] Mini project: Complete application
323 | - [ ] All endpoints tested
324 | - [ ] UI is user-friendly
325 | - [ ] Documentation complete
326 | - [ ] Application deployed and accessible
327 | 
328 | **Final Checklist:**
329 | - [ ] Code is production-ready
330 | - [ ] Error handling comprehensive
331 | - [ ] Documentation is complete
332 | - [ ] Application is deployed
333 | - [ ] README is informative
334 | - [ ] You're proud of your work! 🎉
335 | 
336 | **Congratulations on completing the 10-day RAG roadmap!** 🎊🚀
337 | 
338 | **Good luck with your deployment!** 🌟
339 | 
340 | 


--------------------------------------------------------------------------------
/Day-05: Embeddings & Vector Databases/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 5 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Complete these tasks to master embeddings and vector databases. You'll work with OpenAI embeddings and ChromaDB. Install required libraries:
  6 | 
  7 | ```bash
  8 | pip install openai chromadb numpy
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Store your OpenAI API key securely
 13 | - Test with various texts to understand embeddings
 14 | - Experiment with different similarity thresholds
 15 | - Document your findings
 16 | 
 17 | ---
 18 | 
 19 | ## Tasks
 20 | 
 21 | ### Task 1: Embedding Generator Tool
 22 | 
 23 | Create a tool `embedding_generator.py` that:
 24 | 
 25 | 1. Takes text input (single or batch)
 26 | 2. Generates embeddings using OpenAI API
 27 | 3. Returns embeddings with metadata
 28 | 4. Handles errors and rate limits
 29 | 5. Saves embeddings to file (optional)
 30 | 
 31 | **Features:**
 32 | - Support single text or list of texts
 33 | - Show embedding dimensions
 34 | - Display first few values
 35 | - Calculate and display statistics (min, max, mean)
 36 | 
 37 | **Test with:** Various texts (short, long, different topics)
 38 | 
 39 | **Deliverable:** `task1_embedding_generator.py`
 40 | 
 41 | ---
 42 | 
 43 | ### Task 2: Similarity Calculator
 44 | 
 45 | Build a similarity calculator `similarity_calculator.py`:
 46 | 
 47 | 1. Takes two texts as input
 48 | 2. Generates embeddings for both
 49 | 3. Calculates cosine similarity
 50 | 4. Provides interpretation of the score
 51 | 5. Visualizes similarity (text-based or simple plot)
 52 | 
 53 | **Features:**
 54 | - Calculate cosine similarity
 55 | - Provide similarity interpretation (very similar, somewhat similar, different)
 56 | - Compare multiple text pairs
 57 | - Show embedding values (first few dimensions)
 58 | 
 59 | **Test with:**
 60 | - Very similar texts ("dog" vs "puppy")
 61 | - Somewhat similar ("dog" vs "animal")
 62 | - Different texts ("dog" vs "computer")
 63 | 
 64 | **Deliverable:** `task2_similarity_calculator.py`
 65 | 
 66 | ---
 67 | 
 68 | ### Task 3: ChromaDB Document Store
 69 | 
 70 | Create a document storage system `chromadb_store.py`:
 71 | 
 72 | 1. Initialize ChromaDB collection
 73 | 2. Add documents with metadata
 74 | 3. Query for similar documents
 75 | 4. Retrieve documents by ID
 76 | 5. Get collection statistics
 77 | 
 78 | **Requirements:**
 79 | - Create a class `DocumentStore`
 80 | - Methods: `add_documents()`, `search()`, `get_by_id()`, `get_stats()`
 81 | - Support metadata filtering
 82 | - Handle collection creation/loading
 83 | 
 84 | **Test with:** 20+ sample documents on various topics
 85 | 
 86 | **Deliverable:** `task3_chromadb_store.py`
 87 | 
 88 | ---
 89 | 
 90 | ### Task 4: Batch Embedding Processor
 91 | 
 92 | Build a batch processor `batch_processor.py`:
 93 | 
 94 | 1. Process multiple documents in batches
 95 | 2. Generate embeddings efficiently
 96 | 3. Store in ChromaDB
 97 | 4. Show progress
 98 | 5. Handle errors gracefully
 99 | 
100 | **Features:**
101 | - Batch size configuration
102 | - Progress tracking
103 | - Error recovery (skip failed items, continue)
104 | - Summary report
105 | 
106 | **Test with:** 50+ documents
107 | 
108 | **Deliverable:** `task4_batch_processor.py`
109 | 
110 | ---
111 | 
112 | ### Task 5: Semantic Search Engine
113 | 
114 | Create a semantic search engine `semantic_search.py`:
115 | 
116 | 1. Index a collection of documents
117 | 2. Accept search queries
118 | 3. Return top K most similar documents
119 | 4. Display results with similarity scores
120 | 5. Support metadata filtering
121 | 
122 | **Features:**
123 | - Search interface (CLI)
124 | - Display top results with scores
125 | - Show metadata for each result
126 | - Highlight matching content (optional)
127 | - Export search results
128 | 
129 | **Test with:** A collection of 30+ documents
130 | 
131 | **Deliverable:** `task5_semantic_search.py`
132 | 
133 | ---
134 | 
135 | ## One Mini Project
136 | 
137 | ### 🔍 Build a Semantic Search Tool
138 | 
139 | Create a complete application `semantic_search_tool.py` that implements a semantic search system using embeddings and vector databases.
140 | 
141 | **Features:**
142 | 
143 | 1. **Document Indexing:**
144 |    - Load documents from files (PDF, TXT, etc.)
145 |    - Extract and chunk text
146 |    - Generate embeddings
147 |    - Store in ChromaDB with metadata
148 | 
149 | 2. **Search Interface:**
150 |    ```
151 |    === Semantic Search Tool ===
152 |    1. Index documents
153 |    2. Search
154 |    3. View indexed documents
155 |    4. Delete documents
156 |    5. Collection statistics
157 |    6. Export results
158 |    7. Exit
159 |    ```
160 | 
161 | 3. **Search Capabilities:**
162 |    - Natural language queries
163 |    - Top K results (configurable)
164 |    - Similarity score display
165 |    - Metadata filtering
166 |    - Search history
167 | 
168 | 4. **Advanced Features:**
169 |    - Multiple collections support
170 |    - Hybrid search (keyword + semantic)
171 |    - Result ranking and re-ranking
172 |    - Search analytics
173 |    - Export search results
174 | 
175 | 5. **Statistics and Analytics:**
176 |    - Total documents indexed
177 |    - Average document length
178 |    - Search performance metrics
179 |    - Most common queries
180 |    - Collection health
181 | 
182 | **Requirements:**
183 | - Use classes for organization
184 | - Support multiple file formats
185 | - Implement progress tracking
186 | - Add error handling
187 | - Create a user-friendly CLI
188 | - Store collections persistently
189 | - Generate detailed reports
190 | 
191 | **Example Usage:**
192 | ```bash
193 | python semantic_search_tool.py
194 | 
195 | === Semantic Search Tool ===
196 | Choose option: 1
197 | 
198 | Enter directory path: ./documents
199 | Chunk size [500]: 400
200 | Processing documents...
201 | ✓ Indexed 15 documents
202 | ✓ Created 42 chunks
203 | ✓ Generated embeddings
204 | 
205 | Choose option: 2
206 | 
207 | Enter search query: What is machine learning?
208 | Found 5 results:
209 | 
210 | 1. [Score: 0.89] Machine learning is a subset of AI...
211 |    Source: ai_textbook.pdf, Page: 3
212 | 
213 | 2. [Score: 0.85] ML algorithms learn from data...
214 |    Source: ml_guide.pdf, Page: 1
215 | ...
216 | ```
217 | 
218 | **Deliverables:**
219 | - `semantic_search_tool.py` - Main application
220 | - `requirements.txt` - Dependencies
221 | - `README_search.md` - Usage guide
222 | - Sample indexed collection
223 | - Example search results
224 | 
225 | ---
226 | 
227 | ## Expected Output Section
228 | 
229 | ### Task 1 Expected Output:
230 | ```python
231 | embedding = generate_embedding("Python is a programming language")
232 | # Output:
233 | {
234 |     "dimension": 1536,
235 |     "first_5_values": [0.012, -0.034, 0.089, ...],
236 |     "statistics": {
237 |         "min": -0.523,
238 |         "max": 0.891,
239 |         "mean": 0.001
240 |     }
241 | }
242 | ```
243 | 
244 | ### Task 2 Expected Output:
245 | ```
246 | === Similarity Calculator ===
247 | Text 1: "dog"
248 | Text 2: "puppy"
249 | 
250 | Similarity: 0.847
251 | Interpretation: Very similar (same concept, different word)
252 | 
253 | Text 1: "dog"
254 | Text 2: "car"
255 | 
256 | Similarity: 0.312
257 | Interpretation: Different (unrelated concepts)
258 | ```
259 | 
260 | ### Task 3 Expected Output:
261 | ```python
262 | store = DocumentStore("my_collection")
263 | store.add_documents(
264 |     texts=["Doc 1", "Doc 2"],
265 |     metadatas=[{"source": "book1"}, {"source": "book2"}]
266 | )
267 | 
268 | results = store.search("programming", n_results=2)
269 | # Returns top 2 similar documents with metadata
270 | ```
271 | 
272 | ### Mini Project Expected Output:
273 | 
274 | The semantic search tool should provide:
275 | - Fast indexing of documents
276 | - Accurate search results
277 | - Clear similarity scores
278 | - Rich metadata display
279 | - Professional interface
280 | 
281 | **Example session:**
282 | ```
283 | === Semantic Search Tool ===
284 | Choose: 2
285 | 
286 | Query: How does neural network work?
287 | Searching...
288 | 
289 | Results (Top 5):
290 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
291 | 1. [0.92] Neural networks are computing systems...
292 |    📄 Source: ai_book.pdf | 📄 Page: 45
293 | 
294 | 2. [0.88] A neural network consists of layers...
295 |    📄 Source: ml_guide.pdf | 📄 Page: 12
296 | ...
297 | ```
298 | 
299 | ---
300 | 
301 | ## Submission Checklist
302 | 
303 | - [ ] Task 1: Embedding generator working
304 | - [ ] Task 2: Similarity calculator functional
305 | - [ ] Task 3: ChromaDB store implemented
306 | - [ ] Task 4: Batch processor complete
307 | - [ ] Task 5: Semantic search engine working
308 | - [ ] Mini project: Complete search tool
309 | - [ ] All code handles errors
310 | - [ ] Code is well-documented
311 | - [ ] Tested with real documents
312 | 
313 | **Remember:** Embeddings and vector databases are the foundation of RAG retrieval!
314 | 
315 | **Good luck!** 🚀
316 | 
317 | 


--------------------------------------------------------------------------------
/Day-09: Advanced RAG (Reranking, Query Rewriting, Fusion)/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 9 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Implement advanced RAG techniques to improve your system. These techniques make RAG production-ready. Install required libraries:
  6 | 
  7 | ```bash
  8 | pip install sentence-transformers rank-bm25 openai numpy
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Implement each technique separately first
 13 | - Compare before/after results
 14 | - Measure improvements
 15 | - Document your findings
 16 | 
 17 | ---
 18 | 
 19 | ## Tasks
 20 | 
 21 | ### Task 1: Query Rewriting System
 22 | 
 23 | Build a query rewriting system `query_rewriter.py`:
 24 | 
 25 | **Features:**
 26 | 1. Generate query variations using LLM
 27 | 2. Extract key terms
 28 | 3. Expand with synonyms
 29 | 4. Create multiple query formulations
 30 | 
 31 | **Requirements:**
 32 | - Generate 3-5 variations per query
 33 | - Test if variations improve retrieval
 34 | - Compare retrieval results
 35 | - Document improvements
 36 | 
 37 | **Test with:** Various question types
 38 | 
 39 | **Deliverable:** `task1_query_rewriter.py`
 40 | 
 41 | ---
 42 | 
 43 | ### Task 2: Reranking Implementation
 44 | 
 45 | Add reranking to your RAG system `reranking_rag.py`:
 46 | 
 47 | **Features:**
 48 | 1. Use cross-encoder model for reranking
 49 | 2. Rerank initial retrieval results
 50 | 3. Compare before/after rankings
 51 | 4. Measure improvement
 52 | 
 53 | **Requirements:**
 54 | - Install sentence-transformers
 55 | - Use a reranking model
 56 | - Rerank top 10, return top 5
 57 | - Show score improvements
 58 | 
 59 | **Test with:** Various queries and measure improvement
 60 | 
 61 | **Deliverable:** `task2_reranking_rag.py`
 62 | 
 63 | ---
 64 | 
 65 | ### Task 3: Fusion Techniques
 66 | 
 67 | Implement fusion `fusion_rag.py`:
 68 | 
 69 | **Features:**
 70 | 1. Reciprocal Rank Fusion (RRF)
 71 | 2. Weighted fusion
 72 | 3. Combine multiple retrieval results
 73 | 4. Deduplication
 74 | 
 75 | **Requirements:**
 76 | - Implement RRF algorithm
 77 | - Test with 2-3 different retrieval strategies
 78 | - Compare fused vs single retrieval
 79 | - Measure improvement
 80 | 
 81 | **Deliverable:** `task3_fusion_rag.py`
 82 | 
 83 | ---
 84 | 
 85 | ### Task 4: Hybrid Search
 86 | 
 87 | Build hybrid search `hybrid_search.py`:
 88 | 
 89 | **Features:**
 90 | 1. Semantic search (embeddings)
 91 | 2. Keyword search (BM25)
 92 | 3. Combine both with weights
 93 | 4. Tune alpha parameter
 94 | 
 95 | **Requirements:**
 96 | - Implement BM25 keyword search
 97 | - Combine with semantic search
 98 | - Test different alpha values (0.0 to 1.0)
 99 | - Find optimal balance
100 | 
101 | **Deliverable:** `task4_hybrid_search.py`
102 | 
103 | ---
104 | 
105 | ### Task 5: Complete Advanced RAG
106 | 
107 | Combine all techniques `advanced_rag.py`:
108 | 
109 | **Pipeline:**
110 | 1. Query rewriting
111 | 2. Multiple retrievals (with variations)
112 | 3. Fusion (combine results)
113 | 4. Reranking (improve order)
114 | 5. Generation
115 | 
116 | **Requirements:**
117 | - Integrate all techniques
118 | - Make it configurable
119 | - Compare with basic RAG
120 | - Measure overall improvement
121 | 
122 | **Deliverable:** `task5_advanced_rag.py`
123 | 
124 | ---
125 | 
126 | ## One Mini Project
127 | 
128 | ### 🚀 Build a Production-Ready Advanced RAG System
129 | 
130 | Create a complete advanced RAG application `production_advanced_rag.py` with all optimization techniques.
131 | 
132 | **Features:**
133 | 
134 | 1. **Complete Advanced Pipeline:**
135 |    - Query rewriting
136 |    - Multiple retrieval strategies
137 |    - Fusion
138 |    - Reranking
139 |    - Generation
140 | 
141 | 2. **Configuration System:**
142 |    - Enable/disable each technique
143 |    - Tune parameters
144 |    - A/B testing mode
145 |    - Performance vs quality trade-offs
146 | 
147 | 3. **Multiple Retrieval Strategies:**
148 |    - Semantic search
149 |    - Keyword search
150 |    - Hybrid search
151 |    - Metadata filtering
152 |    - Custom retrievers
153 | 
154 | 4. **Advanced Features:**
155 |    - Query expansion
156 |    - Query decomposition
157 |    - Multi-query fusion
158 |    - Reranking with cross-encoders
159 |    - Answer quality scoring
160 | 
161 | 5. **Evaluation System:**
162 |    - Compare techniques
163 |    - Measure improvements
164 |    - A/B testing
165 |    - Performance metrics
166 |    - Quality metrics
167 | 
168 | 6. **Interactive Interface:**
169 |    ```
170 |    === Advanced RAG System ===
171 |    1. Index documents
172 |    2. Query (Basic RAG)
173 |    3. Query (Advanced RAG)
174 |    4. Compare techniques
175 |    5. Configure settings
176 |    6. Evaluation mode
177 |    7. Statistics
178 |    8. Exit
179 |    ```
180 | 
181 | 7. **Reporting:**
182 |    - Technique comparison
183 |    - Performance reports
184 |    - Quality improvements
185 |    - Recommendations
186 | 
187 | **Requirements:**
188 | - All advanced techniques implemented
189 | - Configurable and modular
190 | - Comprehensive evaluation
191 | - Production-ready quality
192 | - Detailed documentation
193 | 
194 | **Example Usage:**
195 | ```python
196 | rag = ProductionAdvancedRAG()
197 | 
198 | # Configure
199 | rag.configure({
200 |     "query_rewriting": True,
201 |     "reranking": True,
202 |     "fusion": True,
203 |     "hybrid_search": True
204 | })
205 | 
206 | # Query
207 | result = rag.query("What is machine learning?")
208 | print(result["answer"])
209 | print(f"Improvement: {result['improvement_metrics']}")
210 | ```
211 | 
212 | **Deliverables:**
213 | - `production_advanced_rag.py` - Main system
214 | - `config_advanced.json` - Configuration
215 | - `requirements.txt` - Dependencies
216 | - `README_advanced.md` - Documentation
217 | - Evaluation report template
218 | 
219 | ---
220 | 
221 | ## Expected Output Section
222 | 
223 | ### Task 1 Expected Output:
224 | ```python
225 | variations = rewrite_query("How does ML work?")
226 | # Output:
227 | [
228 |     "How does machine learning work?",
229 |     "What is the process of machine learning?",
230 |     "How do ML algorithms learn?",
231 |     "Explain machine learning mechanism",
232 |     "How is machine learning implemented?"
233 | ]
234 | 
235 | # Test retrieval improvement
236 | basic_results = retrieve("How does ML work?")
237 | advanced_results = retrieve_multiple(variations)
238 | # Advanced finds 40% more relevant documents
239 | ```
240 | 
241 | ### Task 2 Expected Output:
242 | ```
243 | Before Reranking:
244 | 1. Doc A (0.85)
245 | 2. Doc B (0.82)
246 | 3. Doc C (0.80)
247 | 
248 | After Reranking:
249 | 1. Doc C (0.92) ← Better match!
250 | 2. Doc A (0.88)
251 | 3. Doc B (0.85)
252 | 
253 | Improvement: Top result relevance increased by 8%
254 | ```
255 | 
256 | ### Task 3 Expected Output:
257 | ```
258 | Single Retrieval: Found 3 relevant docs
259 | Fusion (3 strategies): Found 5 relevant docs
260 | Improvement: 67% more relevant results
261 | ```
262 | 
263 | ### Task 5 Expected Output:
264 | ```
265 | === Advanced RAG Query ===
266 | Question: "What is Python?"
267 | 
268 | [Query Rewriting] Generated 4 variations
269 | [Multiple Retrieval] Found 12 candidates
270 | [Fusion] Combined to 8 unique results
271 | [Reranking] Reordered top 5
272 | [Generation] Generated answer
273 | 
274 | Answer: Python is a high-level programming language...
275 | 
276 | Improvement Metrics:
277 | - Retrieval: +45% relevant docs
278 | - Answer quality: +23% improvement
279 | - Response time: +0.3s (acceptable)
280 | ```
281 | 
282 | ### Mini Project Expected Output:
283 | 
284 | The advanced RAG system should demonstrate:
285 | - Significant quality improvements
286 | - Configurable techniques
287 | - Comprehensive evaluation
288 | - Production-ready features
289 | 
290 | **Example session:**
291 | ```
292 | === Advanced RAG System ===
293 | Choose: 3
294 | 
295 | Question: "Explain neural networks"
296 | 
297 | [Advanced Pipeline Running...]
298 | ✓ Query rewritten: 4 variations
299 | ✓ Retrieved: 15 candidates
300 | ✓ Fused: 8 unique results
301 | ✓ Reranked: Top 5 selected
302 | ✓ Generated answer
303 | 
304 | Answer:
305 | Neural networks are computing systems inspired by...
306 | 
307 | Sources (Top 5, reranked):
308 | 1. [0.94] neural_networks.pdf | Page 3
309 | 2. [0.91] deep_learning.pdf | Page 1
310 | 3. [0.89] ai_basics.pdf | Page 7
311 | ...
312 | 
313 | Comparison with Basic RAG:
314 | - Answer quality: +28% improvement
315 | - Source relevance: +35% improvement
316 | - Response time: +0.4s
317 | ```
318 | 
319 | ---
320 | 
321 | ## Submission Checklist
322 | 
323 | - [ ] Task 1: Query rewriting working
324 | - [ ] Task 2: Reranking implemented
325 | - [ ] Task 3: Fusion functional
326 | - [ ] Task 4: Hybrid search working
327 | - [ ] Task 5: Complete advanced pipeline
328 | - [ ] Mini project: Production system
329 | - [ ] All techniques tested
330 | - [ ] Improvements measured
331 | - [ ] Code well-documented
332 | 
333 | **Remember:** Advanced techniques make the difference between a prototype and production system!
334 | 
335 | **Good luck!** 🚀
336 | 
337 | 


--------------------------------------------------------------------------------
/RAG Projects/readme.md:
--------------------------------------------------------------------------------
  1 | # 🚀 10 Real-World RAG Projects
  2 | 
  3 | **Practical Ideas from Beginner to Advanced**
  4 | 
  5 | *Retrieval-Augmented Generation*
  6 | 
  7 | ---
  8 | 
  9 | ## 📋 Table of Contents
 10 | 
 11 | - [Project 1: Legal Document Assistant](#project-1-legal-document-assistant)
 12 | - [Project 2: Medical Research Summarizer](#project-2-medical-research-summarizer)
 13 | - [Project 3: Customer Support Assistant](#project-3-customer-support-assistant)
 14 | - [Project 4: Codebase Search & Explainer](#project-4-codebase-search--explainer)
 15 | - [Project 5: Educational Q&A Tutor](#project-5-educational-qa-tutor)
 16 | - [Project 6: Company Policy Assistant](#project-6-company-policy-assistant)
 17 | - [Project 7: Financial Report Analyzer](#project-7-financial-report-analyzer)
 18 | - [Project 8: Product Manual Assistant](#project-8-product-manual-assistant)
 19 | - [Project 9: Academic Research Copilot](#project-9-academic-research-copilot)
 20 | - [Project 10: News Contextualizer](#project-10-news-contextualizer)
 21 | - [Recommended Tech Stack](#recommended-tech-stack)
 22 | 
 23 | ---
 24 | 
 25 | ## Project 1: Legal Document Assistant ⚖️
 26 | 
 27 | **Difficulty:** Intermediate
 28 | 
 29 | **Description:**
 30 | Help lawyers find and summarize relevant case laws from thousands of legal documents. Query with natural language and get cited precedents instantly.
 31 | 
 32 | **Use Case:**
 33 | Legal professionals can quickly search through extensive case law databases, legal documents, and precedents using natural language queries. The system retrieves relevant cases and provides summaries with proper citations, significantly reducing research time.
 34 | 
 35 | **Tags:**
 36 | - Legal Tech
 37 | - Case Law
 38 | - PDF Processing
 39 | 
 40 | ---
 41 | 
 42 | ## Project 2: Medical Research Summarizer 🏥
 43 | 
 44 | **Difficulty:** Advanced
 45 | 
 46 | **Description:**
 47 | Summarize latest medical research for doctors and researchers. Query PubMed papers by symptoms or diseases and get readable clinical summaries.
 48 | 
 49 | **Use Case:**
 50 | Healthcare professionals and researchers can query medical literature from PubMed using symptoms, diseases, or research topics. The system retrieves relevant papers and provides readable clinical summaries, helping doctors stay updated with the latest research findings.
 51 | 
 52 | **Tags:**
 53 | - Healthcare
 54 | - PubMed
 55 | - Research
 56 | 
 57 | ---
 58 | 
 59 | ## Project 3: Customer Support Assistant 💬
 60 | 
 61 | **Difficulty:** Beginner
 62 | 
 63 | **Description:**
 64 | Answer customer questions using internal knowledge base, FAQs, and support docs. Perfect for reducing support ticket volume.
 65 | 
 66 | **Use Case:**
 67 | Businesses can deploy an AI assistant that answers customer queries by retrieving information from internal knowledge bases, FAQ documents, and support documentation. This helps reduce support ticket volume and provides instant, accurate responses to common questions.
 68 | 
 69 | **Tags:**
 70 | - Support
 71 | - Chatbot
 72 | - Enterprise
 73 | 
 74 | ---
 75 | 
 76 | ## Project 4: Codebase Search & Explainer 💻
 77 | 
 78 | **Difficulty:** Intermediate
 79 | 
 80 | **Description:**
 81 | Developer assistant that retrieves and explains code snippets from large codebases. Ask "How is auth implemented?" and get step-by-step answers.
 82 | 
 83 | **Use Case:**
 84 | Developers working with large codebases can query the system to find and understand how specific features are implemented. For example, asking "How is authentication implemented?" will retrieve relevant code snippets and provide step-by-step explanations, making onboarding and code navigation much easier.
 85 | 
 86 | **Tags:**
 87 | - DevTools
 88 | - GitHub
 89 | - Code Search
 90 | 
 91 | ---
 92 | 
 93 | ## Project 5: Educational Q&A Tutor 📚
 94 | 
 95 | **Difficulty:** Beginner
 96 | 
 97 | **Description:**
 98 | AI tutor that retrieves textbook sections to answer student questions. Perfect for personalized learning and homework help.
 99 | 
100 | **Use Case:**
101 | Students can ask questions about course material, and the system retrieves relevant sections from textbooks and educational resources to provide comprehensive answers. This enables personalized learning experiences and helps with homework and exam preparation.
102 | 
103 | **Tags:**
104 | - EdTech
105 | - Learning
106 | - Tutoring
107 | 
108 | ---
109 | 
110 | ## Project 6: Company Policy Assistant 🏢
111 | 
112 | **Difficulty:** Beginner
113 | 
114 | **Description:**
115 | Query internal HR policies like leave, reimbursement, and benefits. Get instant answers with document citations.
116 | 
117 | **Use Case:**
118 | Employees can quickly find information about company policies, HR procedures, leave policies, reimbursement rules, and benefits by querying the system. The assistant retrieves relevant policy documents and provides answers with proper citations, reducing HR workload and improving employee experience.
119 | 
120 | **Tags:**
121 | - HR Tech
122 | - Internal Tools
123 | - Policies
124 | 
125 | ---
126 | 
127 | ## Project 7: Financial Report Analyzer 💰
128 | 
129 | **Difficulty:** Intermediate
130 | 
131 | **Description:**
132 | Analyze quarterly reports and generate insights. Query like "Summarize Tesla's Q3 2024 performance" and get business-friendly summaries.
133 | 
134 | **Use Case:**
135 | Financial analysts, investors, and business professionals can query quarterly financial reports and get business-friendly summaries and insights. For example, asking "Summarize Tesla's Q3 2024 performance" will retrieve relevant sections from financial reports and provide comprehensive analysis.
136 | 
137 | **Tags:**
138 | - FinTech
139 | - Analytics
140 | - Reports
141 | 
142 | ---
143 | 
144 | ## Project 8: Product Manual Assistant 📖
145 | 
146 | **Difficulty:** Beginner
147 | 
148 | **Description:**
149 | Help users troubleshoot products using manuals. Search through documentation and show step-by-step instructions.
150 | 
151 | **Use Case:**
152 | Product users can troubleshoot issues by querying product manuals and documentation. The system retrieves relevant sections and provides step-by-step instructions, reducing support calls and improving user experience with self-service troubleshooting.
153 | 
154 | **Tags:**
155 | - Support
156 | - Documentation
157 | - UX
158 | 
159 | ---
160 | 
161 | ## Project 9: Academic Research Copilot 🎓
162 | 
163 | **Difficulty:** Advanced
164 | 
165 | **Description:**
166 | Find, cite, and summarize scholarly papers. Create literature reviews and track recent trends in research topics.
167 | 
168 | **Use Case:**
169 | Researchers and academics can use this tool to find relevant scholarly papers, generate citations, create literature reviews, and track recent trends in their research areas. The system searches through academic databases like arXiv and provides summaries and citations for papers.
170 | 
171 | **Tags:**
172 | - Research
173 | - Academia
174 | - arXiv
175 | 
176 | ---
177 | 
178 | ## Project 10: News Contextualizer 📰
179 | 
180 | **Difficulty:** Intermediate
181 | 
182 | **Description:**
183 | Provide historical context for trending news. Query current events and get timeline summaries with fact-based background.
184 | 
185 | **Use Case:**
186 | Journalists, researchers, and news consumers can query current events and receive historical context, timeline summaries, and fact-based background information. This helps understand the full picture of news stories by connecting them to past events and providing comprehensive context.
187 | 
188 | **Tags:**
189 | - News
190 | - Context
191 | - Archives
192 | 
193 | ---
194 | 
195 | ## Recommended Tech Stack 🛠️
196 | 
197 | All projects can be built using the following technologies:
198 | 
199 | - **Python** - Primary programming language
200 | - **LangChain** - Framework for building LLM applications
201 | - **OpenAI** - LLM API provider
202 | - **FAISS** - Vector similarity search library
203 | - **Pinecone** - Managed vector database
204 | - **ChromaDB** - Open-source vector database
205 | 
206 | ### Additional Tools & Libraries
207 | 
208 | - **Streamlit** / **Gradio** - For building web interfaces
209 | - **PyPDF2** / **pdfplumber** - For PDF processing
210 | - **BeautifulSoup** / **Scrapy** - For web scraping
211 | - **Sentence Transformers** - For embeddings
212 | - **FastAPI** - For building APIs
213 | - **Docker** - For containerization
214 | 
215 | ---
216 | 
217 | ## 💡 Note
218 | 
219 | All projects are scalable from MVP to Production. Start with a simple implementation and gradually add features like:
220 | - Advanced retrieval strategies (hybrid search, reranking)
221 | - Multi-modal support (images, tables)
222 | - User authentication and session management
223 | - Analytics and monitoring
224 | - Caching and optimization
225 | - Multi-language support
226 | 
227 | ---
228 | 
229 | ## Getting Started
230 | 
231 | 1. Choose a project that matches your skill level
232 | 2. Set up your development environment with Python 3.8+
233 | 3. Install required dependencies
234 | 4. Set up your vector database (FAISS, Pinecone, or ChromaDB)
235 | 5. Configure your LLM API keys
236 | 6. Start building!
237 | 
238 | ---
239 | 
240 | 
241 | **Happy Building! 🚀**
242 | 
243 | 


--------------------------------------------------------------------------------
/Day-06: RAG Fundamentals (Retrieval → Augmentation → Generation)/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 6 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Complete these tasks to build your first RAG systems. You'll implement the complete RAG pipeline: Retrieval → Augmentation → Generation. Make sure you have all dependencies:
  6 | 
  7 | ```bash
  8 | pip install openai chromadb numpy
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Test with real documents
 13 | - Experiment with different K values
 14 | - Try various prompt templates
 15 | - Document what works best
 16 | 
 17 | ---
 18 | 
 19 | ## Tasks
 20 | 
 21 | ### Task 1: Basic RAG System
 22 | 
 23 | Build a complete RAG system `basic_rag.py`:
 24 | 
 25 | **Components:**
 26 | 1. Document storage (ChromaDB)
 27 | 2. Retrieval function
 28 | 3. Augmentation function
 29 | 4. Generation function
 30 | 5. Complete query pipeline
 31 | 
 32 | **Requirements:**
 33 | - Create a `BasicRAG` class
 34 | - Methods: `add_documents()`, `query()`
 35 | - Retrieve top 3 chunks
 36 | - Simple prompt template
 37 | - Return answer and sources
 38 | 
 39 | **Test with:** 10+ documents on a specific topic
 40 | 
 41 | **Deliverable:** `task1_basic_rag.py`
 42 | 
 43 | ---
 44 | 
 45 | ### Task 2: RAG with Source Citations
 46 | 
 47 | Enhance your RAG to include citations `rag_with_citations.py`:
 48 | 
 49 | **Features:**
 50 | 1. Store source metadata with documents
 51 | 2. Include source info in prompt
 52 | 3. Generate answers with citations
 53 | 4. Format: "According to [Source]..."
 54 | 
 55 | **Requirements:**
 56 | - Track document sources
 57 | - Include in augmented prompt
 58 | - LLM should cite sources in answer
 59 | - Return structured results with citations
 60 | 
 61 | **Test with:** Documents from different sources
 62 | 
 63 | **Deliverable:** `task2_rag_citations.py`
 64 | 
 65 | ---
 66 | 
 67 | ### Task 3: Similarity Threshold Filtering
 68 | 
 69 | Implement similarity-based filtering `filtered_rag.py`:
 70 | 
 71 | **Features:**
 72 | 1. Set similarity threshold
 73 | 2. Filter retrieved chunks
 74 | 3. Only use chunks above threshold
 75 | 4. Handle case when no chunks pass threshold
 76 | 
 77 | **Requirements:**
 78 | - Configurable threshold (0.0 to 1.0)
 79 | - Show similarity scores
 80 | - Test with different thresholds
 81 | - Compare results
 82 | 
 83 | **Test with:** Various queries and thresholds
 84 | 
 85 | **Deliverable:** `task3_filtered_rag.py`
 86 | 
 87 | ---
 88 | 
 89 | ### Task 4: Multi-Query RAG
 90 | 
 91 | Implement query expansion `multi_query_rag.py`:
 92 | 
 93 | **Features:**
 94 | 1. Generate query variations
 95 | 2. Search with each variation
 96 | 3. Combine and deduplicate results
 97 | 4. Use combined results for answer
 98 | 
 99 | **Query expansion ideas:**
100 | - Paraphrase the question
101 | - Extract key terms
102 | - Generate related questions
103 | 
104 | **Requirements:**
105 | - Create 2-3 query variations
106 | - Search with each
107 | - Merge results (remove duplicates)
108 | - Use merged chunks for answer
109 | 
110 | **Deliverable:** `task4_multi_query_rag.py`
111 | 
112 | ---
113 | 
114 | ### Task 5: RAG Evaluation System
115 | 
116 | Build an evaluation framework `rag_evaluator.py`:
117 | 
118 | **Features:**
119 | 1. Test dataset (questions + expected answers)
120 | 2. Run RAG on test questions
121 | 3. Compare generated vs expected answers
122 | 4. Calculate metrics (accuracy, similarity)
123 | 
124 | **Metrics to implement:**
125 | - Exact match
126 | - Semantic similarity (embedding-based)
127 | - Contains key terms
128 | - Answer length comparison
129 | 
130 | **Requirements:**
131 | - Create test dataset (5-10 Q&A pairs)
132 | - Run evaluation
133 | - Calculate and display metrics
134 | - Identify failure cases
135 | 
136 | **Deliverable:** `task5_rag_evaluator.py`
137 | 
138 | ---
139 | 
140 | ## One Mini Project
141 | 
142 | ### 🚀 Build a Full RAG System From Scratch
143 | 
144 | Create a complete RAG application `rag_system.py` that implements all the concepts learned.
145 | 
146 | **Features:**
147 | 
148 | 1. **Document Management:**
149 |    - Load documents from files (PDF, TXT)
150 |    - Extract and chunk text
151 |    - Generate embeddings
152 |    - Store in vector database
153 |    - Manage multiple document collections
154 | 
155 | 2. **RAG Pipeline:**
156 |    - Complete retrieval system
157 |    - Configurable K value
158 |    - Similarity threshold filtering
159 |    - Query expansion (optional)
160 |    - Augmentation with citations
161 |    - Generation with LLM
162 | 
163 | 3. **Interactive Interface:**
164 |    ```
165 |    === RAG System ===
166 |    1. Add documents
167 |    2. Ask a question
168 |    3. View indexed documents
169 |    4. Configure settings
170 |    5. Evaluate system
171 |    6. Export results
172 |    7. Exit
173 |    ```
174 | 
175 | 4. **Settings Configuration:**
176 |    - K value (number of chunks)
177 |    - Similarity threshold
178 |    - LLM model selection
179 |    - Temperature
180 |    - Max tokens
181 |    - Enable/disable query expansion
182 | 
183 | 5. **Advanced Features:**
184 |    - Multiple collections
185 |    - Metadata filtering
186 |    - Search history
187 |    - Answer quality scoring
188 |    - Source highlighting
189 |    - Export conversations
190 | 
191 | 6. **Evaluation Tools:**
192 |    - Test with sample questions
193 |    - Compare different configurations
194 |    - Performance metrics
195 |    - Quality assessment
196 | 
197 | **Requirements:**
198 | - Use classes for organization
199 | - Support multiple file formats
200 | - Implement all RAG components
201 | - Add comprehensive error handling
202 | - Create user-friendly CLI
203 | - Store configurations
204 | - Generate detailed reports
205 | 
206 | **Example Usage:**
207 | ```bash
208 | python rag_system.py
209 | 
210 | === RAG System ===
211 | Choose: 1
212 | 
213 | Enter document path: ./documents
214 | Processing...
215 | ✓ Indexed 5 documents
216 | ✓ Created 23 chunks
217 | 
218 | Choose: 2
219 | 
220 | Question: What is machine learning?
221 | [Searching...]
222 | 
223 | Answer:
224 | Machine learning is a subset of artificial intelligence that enables systems to learn from data...
225 | 
226 | Sources:
227 | 1. [0.89] ai_textbook.pdf, Page 5
228 | 2. [0.85] ml_guide.pdf, Page 2
229 | 3. [0.82] intro_ai.pdf, Page 10
230 | 
231 | [1] Ask another question
232 | [2] View full sources
233 | [3] Main menu
234 | ```
235 | 
236 | **Deliverables:**
237 | - `rag_system.py` - Main application
238 | - `config.json` - Configuration template
239 | - `requirements.txt` - Dependencies
240 | - `README_rag.md` - Usage guide
241 | - Sample test dataset
242 | - Example outputs
243 | 
244 | ---
245 | 
246 | ## Expected Output Section
247 | 
248 | ### Task 1 Expected Output:
249 | ```python
250 | rag = BasicRAG()
251 | rag.add_documents(["Doc 1 text...", "Doc 2 text..."])
252 | 
253 | result = rag.query("What is Python?")
254 | # Output:
255 | {
256 |     "answer": "Python is a high-level programming language...",
257 |     "sources": [
258 |         "Python is a programming language created in 1991...",
259 |         "Python supports multiple programming paradigms..."
260 |     ]
261 | }
262 | ```
263 | 
264 | ### Task 2 Expected Output:
265 | ```python
266 | result = rag_with_citations.query("What is RAG?")
267 | # Output:
268 | {
269 |     "answer": "According to document1.pdf, RAG stands for Retrieval-Augmented Generation...",
270 |     "sources": [
271 |         {"text": "...", "source": "document1.pdf", "page": 3},
272 |         {"text": "...", "source": "document2.pdf", "page": 1}
273 |     ]
274 | }
275 | ```
276 | 
277 | ### Task 3 Expected Output:
278 | ```
279 | Query: "machine learning"
280 | Threshold: 0.7
281 | 
282 | Retrieved 5 chunks, 3 above threshold (0.7)
283 | Using top 3 chunks for answer...
284 | 
285 | Answer: [Generated answer using filtered chunks]
286 | ```
287 | 
288 | ### Mini Project Expected Output:
289 | 
290 | The RAG system should provide:
291 | - Fast document indexing
292 | - Accurate retrieval
293 | - Clear, cited answers
294 | - Configurable settings
295 | - Professional interface
296 | 
297 | **Example session:**
298 | ```
299 | === RAG System ===
300 | Choose: 2
301 | 
302 | Question: How does neural network training work?
303 | 
304 | [Retrieving relevant chunks...]
305 | [Generating answer...]
306 | 
307 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
308 | Answer:
309 | Neural network training involves feeding data through the network, calculating errors, and adjusting weights through backpropagation...
310 | 
311 | Sources (Top 3):
312 | 1. [0.91] neural_networks.pdf | Page 12
313 | 2. [0.87] deep_learning.pdf | Page 5
314 | 3. [0.84] ai_basics.pdf | Page 8
315 | 
316 | Similarity scores shown in brackets
317 | ```
318 | 
319 | ---
320 | 
321 | ## Submission Checklist
322 | 
323 | - [ ] Task 1: Basic RAG working
324 | - [ ] Task 2: Citations implemented
325 | - [ ] Task 3: Filtering functional
326 | - [ ] Task 4: Multi-query working
327 | - [ ] Task 5: Evaluation system complete
328 | - [ ] Mini project: Full RAG system
329 | - [ ] All components tested
330 | - [ ] Code is well-documented
331 | - [ ] Error handling implemented
332 | 
333 | **Remember:** RAG combines retrieval and generation - both parts are important!
334 | 
335 | **Good luck!** 🚀
336 | 
337 | 


--------------------------------------------------------------------------------
/Day-02: Generative AI & LLM Basics/README.md:
--------------------------------------------------------------------------------
  1 | # Day 2 — Generative AI & LLM Basics
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Welcome to the world of **Generative AI**! Today, you'll learn what Large Language Models (LLMs) are, how they work, and why they're the foundation of RAG systems.
  6 | 
  7 | **What is Generative AI?**
  8 | Generative AI refers to artificial intelligence systems that can create new content—text, images, code, etc.—based on patterns they've learned from training data. Unlike traditional AI that classifies or predicts, generative AI produces original outputs.
  9 | 
 10 | **Why this matters for RAG:**
 11 | - RAG systems use LLMs to generate answers
 12 | - Understanding LLMs helps you craft better prompts
 13 | - You'll interact with LLMs via APIs (like OpenAI)
 14 | - LLMs have limitations that RAG solves (hallucination, outdated knowledge)
 15 | 
 16 | **Real-world context:**
 17 | Think of an LLM as a very knowledgeable assistant who has read millions of books but can't remember specific sources. RAG gives this assistant access to a "library" (your documents) so it can provide accurate, sourced answers.
 18 | 
 19 | ---
 20 | 
 21 | ## 2. Deep-Dive Explanation
 22 | 
 23 | ### 2.1 What are Large Language Models (LLMs)?
 24 | 
 25 | LLMs are neural networks trained on vast amounts of text data. They learn patterns, relationships, and language structure.
 26 | 
 27 | **Key characteristics:**
 28 | - **Large**: Billions of parameters (weights)
 29 | - **Language**: Understand and generate human language
 30 | - **Models**: Mathematical representations of language patterns
 31 | 
 32 | **How they work (simplified):**
 33 | ```
 34 | Input Text → Neural Network → Probability Distribution → Generated Text
 35 | ```
 36 | 
 37 | ### 2.2 Popular LLMs
 38 | 
 39 | **OpenAI Models:**
 40 | - **GPT-3.5/GPT-4**: General-purpose, powerful
 41 | - **GPT-4 Turbo**: Faster, more efficient
 42 | - **Embedding models**: Convert text to vectors
 43 | 
 44 | **Other Models:**
 45 | - **Claude** (Anthropic): Strong reasoning
 46 | - **Llama** (Meta): Open-source alternative
 47 | - **Gemini** (Google): Multimodal capabilities
 48 | 
 49 | ### 2.3 Understanding Tokens
 50 | 
 51 | LLMs process text in **tokens**, not words:
 52 | - 1 token ≈ 4 characters (roughly)
 53 | - "Hello world" = 2 tokens
 54 | - "RAG system" = 3 tokens
 55 | 
 56 | **Why it matters:**
 57 | - API pricing is often per token
 58 | - Models have token limits (context windows)
 59 | - You need to manage token usage efficiently
 60 | 
 61 | ### 2.4 The OpenAI API
 62 | 
 63 | **Basic API Structure:**
 64 | ```
 65 | Your Code → HTTP Request → OpenAI API → Response (JSON) → Your Code
 66 | ```
 67 | 
 68 | **Key Components:**
 69 | - **API Key**: Authentication
 70 | - **Endpoint**: URL for the API
 71 | - **Model**: Which LLM to use (e.g., "gpt-4")
 72 | - **Messages**: Conversation format
 73 | - **Parameters**: Temperature, max_tokens, etc.
 74 | 
 75 | ### 2.5 API Parameters Explained
 76 | 
 77 | **Temperature** (0-2):
 78 | - Lower (0-0.3): More deterministic, focused
 79 | - Higher (0.7-2): More creative, varied
 80 | - Default: 0.7
 81 | 
 82 | **Max Tokens:**
 83 | - Maximum length of the response
 84 | - Set based on your needs
 85 | - Be careful not to exceed model limits
 86 | 
 87 | **Top P** (0-1):
 88 | - Nucleus sampling
 89 | - Controls diversity
 90 | - Alternative to temperature
 91 | 
 92 | ### 2.6 Model Capabilities and Limitations
 93 | 
 94 | **What LLMs are good at:**
 95 | - Understanding context
 96 | - Generating coherent text
 97 | - Following instructions
 98 | - Summarizing content
 99 | - Answering questions (if trained on the topic)
100 | 
101 | **What LLMs struggle with:**
102 | - **Hallucination**: Making up facts
103 | - **Outdated information**: Training data cutoff
104 | - **Specific knowledge**: Not trained on your documents
105 | - **Math/Logic**: Can make errors
106 | - **Real-time data**: No access to current events
107 | 
108 | **This is why RAG exists!** RAG solves the "specific knowledge" and "outdated information" problems.
109 | 
110 | ---
111 | 
112 | ## 3. Instructor Examples
113 | 
114 | ### Example 1: Basic OpenAI API Call
115 | 
116 | ```python
117 | import openai
118 | import os
119 | 
120 | # Set your API key (use environment variable in production!)
121 | openai.api_key = os.getenv("OPENAI_API_KEY")
122 | 
123 | def simple_chat(prompt):
124 |     """Send a simple prompt to GPT-3.5"""
125 |     response = openai.ChatCompletion.create(
126 |         model="gpt-3.5-turbo",
127 |         messages=[
128 |             {"role": "user", "content": prompt}
129 |         ],
130 |         temperature=0.7,
131 |         max_tokens=150
132 |     )
133 |     
134 |     return response.choices[0].message.content
135 | 
136 | # Usage
137 | answer = simple_chat("What is RAG?")
138 | print(answer)
139 | ```
140 | 
141 | ### Example 2: Conversation with Context
142 | 
143 | ```python
144 | def chat_with_context(messages, user_message):
145 |     """Maintain conversation context"""
146 |     messages.append({"role": "user", "content": user_message})
147 |     
148 |     response = openai.ChatCompletion.create(
149 |         model="gpt-3.5-turbo",
150 |         messages=messages,
151 |         temperature=0.7
152 |     )
153 |     
154 |     assistant_message = response.choices[0].message.content
155 |     messages.append({"role": "assistant", "content": assistant_message})
156 |     
157 |     return assistant_message, messages
158 | 
159 | # Usage
160 | conversation = []
161 | response, conversation = chat_with_context(
162 |     conversation, 
163 |     "My name is Alice"
164 | )
165 | response, conversation = chat_with_context(
166 |     conversation,
167 |     "What's my name?"  # Model remembers context!
168 | )
169 | ```
170 | 
171 | ### Example 3: Using Different Models
172 | 
173 | ```python
174 | def compare_models(prompt):
175 |     """Compare responses from different models"""
176 |     models = ["gpt-3.5-turbo", "gpt-4"]
177 |     results = {}
178 |     
179 |     for model in models:
180 |         response = openai.ChatCompletion.create(
181 |             model=model,
182 |             messages=[{"role": "user", "content": prompt}],
183 |             temperature=0.7,
184 |             max_tokens=200
185 |         )
186 |         results[model] = response.choices[0].message.content
187 |     
188 |     return results
189 | 
190 | # Usage
191 | prompt = "Explain quantum computing in simple terms"
192 | results = compare_models(prompt)
193 | for model, answer in results.items():
194 |     print(f"\n{model}:\n{answer}")
195 | ```
196 | 
197 | ### Example 4: Controlling Output with Parameters
198 | 
199 | ```python
200 | def generate_with_settings(prompt, temperature=0.7, max_tokens=100):
201 |     """Generate text with custom parameters"""
202 |     response = openai.ChatCompletion.create(
203 |         model="gpt-3.5-turbo",
204 |         messages=[{"role": "user", "content": prompt}],
205 |         temperature=temperature,
206 |         max_tokens=max_tokens,
207 |         top_p=0.9
208 |     )
209 |     
210 |     return {
211 |         "content": response.choices[0].message.content,
212 |         "tokens_used": response.usage.total_tokens,
213 |         "model": response.model
214 |     }
215 | 
216 | # Usage - Creative writing (high temperature)
217 | creative = generate_with_settings(
218 |     "Write a short story about a robot",
219 |     temperature=1.2,
220 |     max_tokens=200
221 | )
222 | 
223 | # Usage - Factual answer (low temperature)
224 | factual = generate_with_settings(
225 |     "What is the capital of France?",
226 |     temperature=0.2,
227 |     max_tokens=50
228 | )
229 | ```
230 | 
231 | ---
232 | 
233 | ## 4. Student Practice Tasks
234 | 
235 | ### Task 1: Basic API Setup
236 | Set up your OpenAI API key and make your first API call. Print the response and the number of tokens used.
237 | 
238 | ### Task 2: Temperature Experiment
239 | Send the same prompt to the API with different temperature values (0.1, 0.7, 1.5). Observe how the responses differ. What do you notice?
240 | 
241 | ### Task 3: Token Counting
242 | Write a function that estimates token count for a given text. Compare your estimate with the actual token count from the API response.
243 | 
244 | ### Task 4: Conversation Memory
245 | Create a simple chatbot that maintains conversation history. The bot should remember what was discussed earlier in the conversation.
246 | 
247 | ### Task 5: Model Comparison
248 | Compare responses from `gpt-3.5-turbo` and `gpt-4` for the same prompt. What differences do you observe in quality, detail, and token usage?
249 | 
250 | ### Task 6: Error Handling
251 | Write a robust API wrapper that handles:
252 | - API key errors
253 | - Rate limiting
254 | - Network errors
255 | - Invalid model names
256 | 
257 | ---
258 | 
259 | ## 5. Summary / Key Takeaways
260 | 
261 | - **LLMs** are neural networks trained on vast text data to understand and generate language
262 | - **Tokens** are the units LLMs process (not words); manage them carefully
263 | - **OpenAI API** provides access to powerful models via simple HTTP requests
264 | - **Temperature** controls creativity (low = focused, high = creative)
265 | - **Max tokens** limits response length
266 | - **LLMs have limitations**: hallucination, outdated info, no access to your documents
267 | - **RAG solves LLM limitations** by providing external knowledge
268 | - **Context matters**: LLMs use conversation history to maintain coherence
269 | - **Different models** have different capabilities and costs
270 | 
271 | ---
272 | 
273 | ## 6. Further Reading (Optional)
274 | 
275 | - OpenAI API Documentation: [platform.openai.com/docs](https://platform.openai.com/docs)
276 | - "Attention Is All You Need" - The transformer paper (advanced)
277 | - OpenAI Cookbook: Examples and best practices
278 | - Token counting tools: tiktoken library
279 | 
280 | ---
281 | 
282 | **Next up:** Day 3 will teach you how to craft effective prompts!
283 | 
284 | 


--------------------------------------------------------------------------------
/Day-01: Python Foundations for GenAI/README.md:
--------------------------------------------------------------------------------
  1 | # Day 1 — Python Foundations for GenAI
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Welcome to Day 1! Before we dive into RAG and AI, we need to ensure you have a solid foundation in Python programming. Python is the primary language used in Generative AI development, and understanding its core concepts will make everything else much easier.
  6 | 
  7 | **Why this matters for RAG:**
  8 | 
  9 | - RAG systems are built primarily in Python
 10 | - You'll need to work with data structures (lists, dictionaries) to handle documents
 11 | - File handling is essential for reading PDFs, text files, and web content
 12 | - API interactions are crucial for connecting to LLM services
 13 | - Understanding functions and classes helps organize RAG code
 14 | 
 15 | **Real-world context:**
 16 | Think of Python as your toolkit. Just like a carpenter needs to know how to use a hammer before building a house, you need Python skills before building RAG systems. Every RAG application you'll build will use these fundamental concepts.
 17 | 
 18 | ---
 19 | 
 20 | ## 2. Deep-Dive Explanation
 21 | 
 22 | ### Core Python Concepts for GenAI
 23 | 
 24 | #### 2.1 Data Structures
 25 | 
 26 | **Lists** - Ordered collections of items
 27 | 
 28 | ```python
 29 | documents = ["doc1.txt", "doc2.txt", "doc3.txt"]
 30 | chunks = []  # Empty list to store text chunks
 31 | ```
 32 | 
 33 | **Dictionaries** - Key-value pairs (perfect for storing metadata)
 34 | 
 35 | ```python
 36 | document_info = {
 37 |     "filename": "article.pdf",
 38 |     "page_count": 10,
 39 |     "author": "John Doe",
 40 |     "chunks": []
 41 | }
 42 | ```
 43 | 
 44 | **Tuples** - Immutable ordered collections
 45 | 
 46 | ```python
 47 | api_config = ("https://api.openai.com", "v1", "gpt-4")
 48 | ```
 49 | 
 50 | #### 2.2 File Handling
 51 | 
 52 | Reading and writing files is essential for RAG:
 53 | 
 54 | ```
 55 | File → Read → Process → Store
 56 | ```
 57 | 
 58 | **Reading text files:**
 59 | 
 60 | ```python
 61 | with open("document.txt", "r", encoding="utf-8") as file:
 62 |     content = file.read()
 63 | ```
 64 | 
 65 | **Writing to files:**
 66 | 
 67 | ```python
 68 | with open("output.txt", "w", encoding="utf-8") as file:
 69 |     file.write("Processed content")
 70 | ```
 71 | 
 72 | #### 2.3 Functions and Classes
 73 | 
 74 | **Functions** - Reusable blocks of code
 75 | 
 76 | ```python
 77 | def chunk_text(text, chunk_size=100):
 78 |     """Split text into chunks of specified size"""
 79 |     chunks = []
 80 |     for i in range(0, len(text), chunk_size):
 81 |         chunks.append(text[i:i+chunk_size])
 82 |     return chunks
 83 | ```
 84 | 
 85 | **Classes** - Organizing related functionality
 86 | 
 87 | ```python
 88 | class DocumentProcessor:
 89 |     def __init__(self, filename):
 90 |         self.filename = filename
 91 |         self.content = ""
 92 | 
 93 |     def load(self):
 94 |         with open(self.filename, "r") as f:
 95 |             self.content = f.read()
 96 | 
 97 |     def get_word_count(self):
 98 |         return len(self.content.split())
 99 | ```
100 | 
101 | #### 2.4 Working with APIs
102 | 
103 | RAG systems interact with APIs (like OpenAI):
104 | 
105 | ```
106 | Your Code → HTTP Request → API → Response → Your Code
107 | ```
108 | 
109 | **Basic API interaction pattern:**
110 | 
111 | ```python
112 | import requests
113 | 
114 | def call_api(url, data):
115 |     response = requests.post(url, json=data)
116 |     return response.json()
117 | ```
118 | 
119 | #### 2.5 List Comprehensions and Generators
120 | 
121 | **List comprehensions** - Concise way to create lists
122 | 
123 | ```python
124 | # Traditional way
125 | squares = []
126 | for x in range(10):
127 |     squares.append(x**2)
128 | 
129 | # List comprehension
130 | squares = [x**2 for x in range(10)]
131 | ```
132 | 
133 | **Generators** - Memory-efficient iteration
134 | 
135 | ```python
136 | def chunk_generator(text, chunk_size):
137 |     for i in range(0, len(text), chunk_size):
138 |         yield text[i:i+chunk_size]
139 | ```
140 | 
141 | ---
142 | 
143 | ## 3. Instructor Examples
144 | 
145 | ### Example 1: Reading and Processing a Text File
146 | 
147 | ```python
148 | def read_and_process_file(filename):
149 |     """Read a file and return processed content"""
150 |     try:
151 |         with open(filename, "r", encoding="utf-8") as file:
152 |             content = file.read()
153 | 
154 |         # Basic processing
155 |         lines = content.split("\n")
156 |         word_count = len(content.split())
157 | 
158 |         return {
159 |             "content": content,
160 |             "lines": len(lines),
161 |             "words": word_count
162 |         }
163 |     except FileNotFoundError:
164 |         print(f"File {filename} not found!")
165 |         return None
166 | 
167 | # Usage
168 | result = read_and_process_file("sample.txt")
169 | if result:
170 |     print(f"Lines: {result['lines']}, Words: {result['words']}")
171 | ```
172 | 
173 | ### Example 2: Text Chunking Function
174 | 
175 | ```python
176 | def chunk_text(text, chunk_size=200, overlap=50):
177 |     """
178 |     Split text into overlapping chunks
179 | 
180 |     Args:
181 |         text: Input text to chunk
182 |         chunk_size: Size of each chunk
183 |         overlap: Number of characters to overlap between chunks
184 |     """
185 |     chunks = []
186 |     start = 0
187 | 
188 |     while start < len(text):
189 |         end = start + chunk_size
190 |         chunk = text[start:end]
191 |         chunks.append(chunk)
192 |         start = end - overlap  # Overlap for context
193 | 
194 |     return chunks
195 | 
196 | # Usage
197 | long_text = "This is a very long document..." * 100
198 | chunks = chunk_text(long_text, chunk_size=200, overlap=50)
199 | print(f"Created {len(chunks)} chunks")
200 | ```
201 | 
202 | ### Example 3: Working with Dictionaries for Document Metadata
203 | 
204 | ```python
205 | class Document:
206 |     def __init__(self, filename, content):
207 |         self.filename = filename
208 |         self.content = content
209 |         self.metadata = {
210 |             "word_count": len(content.split()),
211 |             "char_count": len(content),
212 |             "chunks": []
213 |         }
214 | 
215 |     def add_chunk(self, chunk_text, chunk_id):
216 |         chunk_data = {
217 |             "id": chunk_id,
218 |             "text": chunk_text,
219 |             "length": len(chunk_text)
220 |         }
221 |         self.metadata["chunks"].append(chunk_data)
222 | 
223 |     def get_summary(self):
224 |         return {
225 |             "filename": self.filename,
226 |             "words": self.metadata["word_count"],
227 |             "chunks": len(self.metadata["chunks"])
228 |         }
229 | 
230 | # Usage
231 | doc = Document("article.txt", "This is the content of the article...")
232 | doc.add_chunk("First chunk", 1)
233 | doc.add_chunk("Second chunk", 2)
234 | print(doc.get_summary())
235 | ```
236 | 
237 | ### Example 4: Simple API Request Pattern
238 | 
239 | ```python
240 | import requests
241 | import json
242 | 
243 | def make_api_request(url, payload, headers=None):
244 |     """Make a POST request to an API"""
245 |     default_headers = {"Content-Type": "application/json"}
246 |     if headers:
247 |         default_headers.update(headers)
248 | 
249 |     try:
250 |         response = requests.post(url, json=payload, headers=default_headers)
251 |         response.raise_for_status()  # Raises exception for bad status codes
252 |         return response.json()
253 |     except requests.exceptions.RequestException as e:
254 |         print(f"API request failed: {e}")
255 |         return None
256 | 
257 | # Usage pattern (you'll use this with OpenAI API later)
258 | # payload = {"prompt": "Hello, world!"}
259 | # result = make_api_request("https://api.example.com/endpoint", payload)
260 | ```
261 | 
262 | ---
263 | 
264 | ## 4. Student Practice Tasks
265 | 
266 | ### Task 1: File Reader Function
267 | 
268 | Write a function that reads a file and returns:
269 | 
270 | - The content as a string
271 | - The number of sentences (split by periods)
272 | - A list of unique words (lowercase)
273 | 
274 | ### Task 2: Dictionary Manipulation
275 | 
276 | Create a dictionary to store information about 3 documents. Each document should have:
277 | 
278 | - `title`
279 | - `author`
280 | - `word_count`
281 | - `chunks` (a list)
282 | 
283 | Then write a function that finds the document with the most words.
284 | 
285 | ### Task 3: Text Processing
286 | 
287 | Write a function that:
288 | 
289 | 1. Takes a long string of text
290 | 2. Removes all punctuation
291 | 3. Converts to lowercase
292 | 4. Splits into words
293 | 5. Returns a dictionary with word frequencies
294 | 
295 | ### Task 4: Chunking with Metadata
296 | 
297 | Modify the chunking function to also return metadata for each chunk:
298 | 
299 | - Chunk number
300 | - Start position
301 | - End position
302 | - Word count
303 | 
304 | ### Task 5: Error Handling
305 | 
306 | Write a robust file reader that handles:
307 | 
308 | - File not found errors
309 | - Permission errors
310 | - Encoding errors
311 | - Empty files
312 | 
313 | ### Task 6: List Comprehension Challenge
314 | 
315 | Convert this code to use list comprehensions:
316 | 
317 | ```python
318 | numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
319 | even_squares = []
320 | for num in numbers:
321 |     if num % 2 == 0:
322 |         even_squares.append(num ** 2)
323 | ```
324 | 
325 | ---
326 | 
327 | ## 5. Summary / Key Takeaways
328 | 
329 | - **Lists and dictionaries** are essential for storing documents and metadata in RAG systems
330 | - **File handling** with `with` statements ensures proper resource management
331 | - **Functions** help organize code and make it reusable
332 | - **Classes** provide structure for complex data and operations
333 | - **API interactions** will be crucial when connecting to LLM services
334 | - **List comprehensions** make code more Pythonic and readable
335 | - **Error handling** is important for robust applications
336 | - **Text processing** (chunking, splitting) is fundamental to RAG
337 | 
338 | ---
339 | 
340 | ## 6. Further Reading (Optional)
341 | 
342 | - Python Official Documentation: [docs.python.org](https://docs.python.org/3/)
343 | - Real Python: Great tutorials on Python fundamentals
344 | - Python `requests` library documentation for API calls
345 | - PEP 8: Python style guide for writing clean code
346 | 
347 | ---
348 | 
349 | **Next up:** Day 2 will introduce you to Generative AI and Large Language Models!
350 | 


--------------------------------------------------------------------------------
/Day-08: RAG Using LangChain or LlamaIndex/README.md:
--------------------------------------------------------------------------------
  1 | # Day 8 — RAG Using LangChain or LlamaIndex
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Now that you've built RAG from scratch, it's time to learn the frameworks that make it easier! **LangChain** and **LlamaIndex** are popular frameworks that abstract away the complexity and provide powerful features out of the box.
  6 | 
  7 | **Why use frameworks?**
  8 | - **Faster development**: Pre-built components
  9 | - **Best practices**: Built-in optimizations
 10 | - **More features**: Advanced capabilities
 11 | - **Community**: Well-documented and supported
 12 | - **Production-ready**: Battle-tested code
 13 | 
 14 | **LangChain vs. LlamaIndex:**
 15 | - **LangChain**: General-purpose LLM framework, flexible
 16 | - **LlamaIndex**: Specialized for RAG, data-focused
 17 | 
 18 | **Today's goal:**
 19 | Learn to build RAG systems using these frameworks, understanding when to use which.
 20 | 
 21 | ---
 22 | 
 23 | ## 2. Deep-Dive Explanation
 24 | 
 25 | ### 2.1 LangChain Overview
 26 | 
 27 | **What is LangChain?**
 28 | A framework for building LLM applications with:
 29 | - Document loaders
 30 | - Text splitters
 31 | - Vector stores
 32 | - Chains (workflows)
 33 | - Agents (autonomous systems)
 34 | 
 35 | **Key Components:**
 36 | - **Document Loaders**: Load from various sources
 37 | - **Text Splitters**: Chunk documents
 38 | - **Embeddings**: Generate embeddings
 39 | - **Vector Stores**: Store and search
 40 | - **Retrievers**: Retrieve relevant docs
 41 | - **Chains**: Combine components
 42 | - **LLMs**: Language models
 43 | 
 44 | ### 2.2 LangChain RAG Pipeline
 45 | 
 46 | **Components:**
 47 | ```
 48 | Document Loader → Text Splitter → Embeddings → Vector Store
 49 |                                                       ↓
 50 | User Query → Embeddings → Retriever → Context + Query → LLM → Answer
 51 | ```
 52 | 
 53 | **LangChain Abstractions:**
 54 | - `Document`: Text with metadata
 55 | - `TextSplitter`: Chunks documents
 56 | - `Embeddings`: Embedding interface
 57 | - `VectorStore`: Vector database interface
 58 | - `Retriever`: Retrieval interface
 59 | - `Chain`: Composable workflows
 60 | 
 61 | ### 2.3 LlamaIndex Overview
 62 | 
 63 | **What is LlamaIndex?**
 64 | A data framework for LLM applications, optimized for RAG:
 65 | - Data connectors
 66 | - Indexing
 67 | - Querying
 68 | - Retrieval
 69 | - Response synthesis
 70 | 
 71 | **Key Concepts:**
 72 | - **Index**: Structured data representation
 73 | - **Nodes**: Chunks with metadata
 74 | - **Retrievers**: Find relevant nodes
 75 | - **Query Engines**: Answer questions
 76 | - **Response Synthesizers**: Generate answers
 77 | 
 78 | ### 2.4 LlamaIndex RAG Pipeline
 79 | 
 80 | **Components:**
 81 | ```
 82 | Documents → Load Data → Parse → Build Index
 83 |                               ↓
 84 | Query → Retrieve Nodes → Synthesize Response → Answer
 85 | ```
 86 | 
 87 | **LlamaIndex Abstractions:**
 88 | - `Document`: Source document
 89 | - `Node`: Chunk with metadata
 90 | - `Index`: Structured data store
 91 | - `Retriever`: Retrieval logic
 92 | - `QueryEngine`: Query interface
 93 | - `ResponseSynthesizer`: Answer generation
 94 | 
 95 | ### 2.5 When to Use Which?
 96 | 
 97 | **Use LangChain when:**
 98 | - Building general LLM applications
 99 | - Need flexibility and customization
100 | - Want to combine multiple tools
101 | - Building agents or complex workflows
102 | 
103 | **Use LlamaIndex when:**
104 | - Focused on RAG applications
105 | - Need advanced retrieval strategies
106 | - Want optimized indexing
107 | - Building data-centric applications
108 | 
109 | **You can use both!** They complement each other.
110 | 
111 | ---
112 | 
113 | ## 3. Instructor Examples
114 | 
115 | ### Example 1: LangChain RAG
116 | 
117 | ```python
118 | from langchain.document_loaders import PyPDFLoader, TextLoader
119 | from langchain.text_splitter import RecursiveCharacterTextSplitter
120 | from langchain.embeddings import OpenAIEmbeddings
121 | from langchain.vectorstores import Chroma
122 | from langchain.chains import RetrievalQA
123 | from langchain.llms import OpenAI
124 | import os
125 | 
126 | # Setup
127 | os.environ["OPENAI_API_KEY"] = "your-key"
128 | 
129 | # 1. Load documents
130 | loader = PyPDFLoader("document.pdf")
131 | documents = loader.load()
132 | 
133 | # 2. Split text
134 | text_splitter = RecursiveCharacterTextSplitter(
135 |     chunk_size=1000,
136 |     chunk_overlap=200
137 | )
138 | chunks = text_splitter.split_documents(documents)
139 | 
140 | # 3. Create embeddings and vector store
141 | embeddings = OpenAIEmbeddings()
142 | vectorstore = Chroma.from_documents(
143 |     documents=chunks,
144 |     embedding=embeddings
145 | )
146 | 
147 | # 4. Create retriever
148 | retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
149 | 
150 | # 5. Create QA chain
151 | qa_chain = RetrievalQA.from_chain_type(
152 |     llm=OpenAI(temperature=0),
153 |     chain_type="stuff",
154 |     retriever=retriever,
155 |     return_source_documents=True
156 | )
157 | 
158 | # 6. Query
159 | result = qa_chain({"query": "What is the main topic?"})
160 | print(result["result"])
161 | print(f"Sources: {len(result['source_documents'])}")
162 | ```
163 | 
164 | ### Example 2: LangChain with Chat Models
165 | 
166 | ```python
167 | from langchain.chat_models import ChatOpenAI
168 | from langchain.chains import ConversationalRetrievalChain
169 | from langchain.memory import ConversationBufferMemory
170 | 
171 | # Use chat model
172 | llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo")
173 | 
174 | # Add memory for conversation
175 | memory = ConversationBufferMemory(
176 |     memory_key="chat_history",
177 |     return_messages=True
178 | )
179 | 
180 | # Create conversational chain
181 | qa_chain = ConversationalRetrievalChain.from_llm(
182 |     llm=llm,
183 |     retriever=retriever,
184 |     memory=memory
185 | )
186 | 
187 | # Query with conversation
188 | result = qa_chain({"question": "What is Python?"})
189 | result = qa_chain({"question": "What are its main features?"})  # Remembers context
190 | ```
191 | 
192 | ### Example 3: LlamaIndex RAG
193 | 
194 | ```python
195 | from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
196 | from llama_index.llms import OpenAI
197 | import os
198 | 
199 | os.environ["OPENAI_API_KEY"] = "your-key"
200 | 
201 | # 1. Load documents
202 | documents = SimpleDirectoryReader("./documents").load_data()
203 | 
204 | # 2. Create index (handles chunking, embedding, storage)
205 | index = VectorStoreIndex.from_documents(documents)
206 | 
207 | # 3. Create query engine
208 | query_engine = index.as_query_engine()
209 | 
210 | # 4. Query
211 | response = query_engine.query("What is the main topic?")
212 | print(response)
213 | print(f"Source nodes: {len(response.source_nodes)}")
214 | ```
215 | 
216 | ### Example 4: LlamaIndex with Custom Settings
217 | 
218 | ```python
219 | from llama_index import (
220 |     VectorStoreIndex,
221 |     ServiceContext,
222 |     StorageContext
223 | )
224 | from llama_index.embeddings import OpenAIEmbedding
225 | from llama_index.node_parser import SimpleNodeParser
226 | from llama_index.vector_stores import ChromaVectorStore
227 | import chromadb
228 | 
229 | # Custom service context
230 | service_context = ServiceContext.from_defaults(
231 |     llm=OpenAI(temperature=0, model="gpt-3.5-turbo"),
232 |     embed_model=OpenAIEmbedding(),
233 |     node_parser=SimpleNodeParser.from_defaults(chunk_size=500)
234 | )
235 | 
236 | # Custom vector store
237 | chroma_client = chromadb.Client()
238 | chroma_collection = chroma_client.create_collection("rag_docs")
239 | vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
240 | storage_context = StorageContext.from_defaults(vector_store=vector_store)
241 | 
242 | # Create index with custom settings
243 | index = VectorStoreIndex.from_documents(
244 |     documents,
245 |     service_context=service_context,
246 |     storage_context=storage_context
247 | )
248 | 
249 | # Query
250 | query_engine = index.as_query_engine(similarity_top_k=3)
251 | response = query_engine.query("Your question here")
252 | ```
253 | 
254 | ### Example 5: Comparing Both Frameworks
255 | 
256 | ```python
257 | # LangChain approach
258 | from langchain.chains import RetrievalQA
259 | 
260 | langchain_qa = RetrievalQA.from_chain_type(
261 |     llm=llm,
262 |     retriever=retriever,
263 |     return_source_documents=True
264 | )
265 | 
266 | # LlamaIndex approach
267 | from llama_index import VectorStoreIndex
268 | 
269 | llamaindex_index = VectorStoreIndex.from_documents(documents)
270 | llamaindex_qa = llamaindex_index.as_query_engine()
271 | 
272 | # Both achieve similar results with different APIs
273 | ```
274 | 
275 | ---
276 | 
277 | ## 4. Student Practice Tasks
278 | 
279 | ### Task 1: LangChain RAG Setup
280 | Set up a basic LangChain RAG system:
281 | - Load documents
282 | - Create vector store
283 | - Build QA chain
284 | - Test with queries
285 | 
286 | ### Task 2: LlamaIndex RAG Setup
287 | Set up a basic LlamaIndex RAG system:
288 | - Load documents
289 | - Create index
290 | - Build query engine
291 | - Test with queries
292 | 
293 | ### Task 3: Custom Configuration
294 | Configure both frameworks with:
295 | - Custom chunk sizes
296 | - Different embedding models
297 | - Various LLM parameters
298 | - Compare results
299 | 
300 | ### Task 4: Advanced Retrieval
301 | Experiment with:
302 | - Different retrieval strategies
303 | - Metadata filtering
304 | - Reranking
305 | - Hybrid search
306 | 
307 | ### Task 5: Framework Comparison
308 | Build the same RAG system with both frameworks and compare:
309 | - Code complexity
310 | - Performance
311 | - Features
312 | - Ease of use
313 | 
314 | ### Task 6: Integration
315 | Combine LangChain and LlamaIndex components in a single system.
316 | 
317 | ---
318 | 
319 | ## 5. Summary / Key Takeaways
320 | 
321 | - **LangChain**: General-purpose LLM framework, flexible and composable
322 | - **LlamaIndex**: RAG-optimized framework, data-centric
323 | - **Both are powerful**: Choose based on your needs
324 | - **Pre-built components**: Save development time
325 | - **Best practices**: Frameworks include optimizations
326 | - **Active communities**: Well-documented and supported
327 | - **Production-ready**: Battle-tested code
328 | - **Can combine**: Use both frameworks together
329 | - **Learning curve**: Worth it for complex applications
330 | - **Abstraction**: Understand what's happening under the hood
331 | 
332 | ---
333 | 
334 | ## 6. Further Reading (Optional)
335 | 
336 | - LangChain Documentation
337 | - LlamaIndex Documentation
338 | - Framework comparison articles
339 | - Community examples and tutorials
340 | 
341 | ---
342 | 
343 | **Next up:** Day 9 will cover advanced RAG techniques!
344 | 
345 | 


--------------------------------------------------------------------------------
/Day-04: Chunking & Data Extraction (PDF-Web-Docs)/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 4 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Complete these tasks to master data extraction and chunking. You'll work with PDFs, web pages, and implement various chunking strategies. Make sure you have the required libraries installed:
  6 | 
  7 | ```bash
  8 | pip install pypdf beautifulsoup4 requests
  9 | ```
 10 | 
 11 | **Important:**
 12 | - Test with real files/URLs
 13 | - Handle errors gracefully
 14 | - Consider edge cases (empty files, malformed HTML, etc.)
 15 | - Document your chunking decisions
 16 | 
 17 | ---
 18 | 
 19 | ## Tasks
 20 | 
 21 | ### Task 1: PDF Text Extractor
 22 | 
 23 | Create a comprehensive PDF extractor `pdf_extractor.py` that:
 24 | 
 25 | 1. Extracts text from PDF files
 26 | 2. Returns structured data:
 27 |    - Full text
 28 |    - Text per page (list)
 29 |    - Total pages
 30 |    - Metadata (author, title if available)
 31 | 3. Handles errors (corrupted files, password-protected, etc.)
 32 | 4. Optionally extracts text from specific page ranges
 33 | 
 34 | **Requirements:**
 35 | - Use `pypdf` library
 36 | - Add progress indication for large PDFs
 37 | - Support batch processing (multiple PDFs)
 38 | 
 39 | **Test with:** A multi-page PDF document
 40 | 
 41 | **Deliverable:** `task1_pdf_extractor.py`
 42 | 
 43 | ---
 44 | 
 45 | ### Task 2: Web Content Scraper
 46 | 
 47 | Build a web scraper `web_scraper.py` that:
 48 | 
 49 | 1. Takes a URL as input
 50 | 2. Extracts main content (removes navigation, ads, scripts)
 51 | 3. Returns clean, readable text
 52 | 4. Handles different website structures
 53 | 5. Extracts metadata (title, author, date if available)
 54 | 
 55 | **Requirements:**
 56 | - Use `requests` and `BeautifulSoup`
 57 | - Add proper headers (User-Agent)
 58 | - Handle timeouts and errors
 59 | - Support both single pages and article-style pages
 60 | - Clean extracted text (remove extra whitespace, normalize)
 61 | 
 62 | **Test with:** 
 63 | - A news article URL
 64 | - A blog post URL
 65 | - A Wikipedia page
 66 | 
 67 | **Deliverable:** `task2_web_scraper.py`
 68 | 
 69 | ---
 70 | 
 71 | ### Task 3: Chunking Strategy Comparison
 72 | 
 73 | Implement three different chunking strategies and compare them:
 74 | 
 75 | 1. **Fixed-Size Chunking**: Split by character/word count
 76 | 2. **Sentence-Aware Chunking**: Respect sentence boundaries
 77 | 3. **Paragraph-Aware Chunking**: Respect paragraph boundaries
 78 | 
 79 | **Requirements:**
 80 | - All strategies should support overlap
 81 | - Create a comparison function that:
 82 |   - Tests all three on the same text
 83 |   - Reports chunk count, average size, size variance
 84 |   - Shows sample chunks from each strategy
 85 | - Visualize differences (print sample chunks side-by-side)
 86 | 
 87 | **Test with:** A long text document (at least 2000 words)
 88 | 
 89 | **Deliverable:** `task3_chunking_comparison.py`
 90 | 
 91 | ---
 92 | 
 93 | ### Task 4: Text Cleaning Pipeline
 94 | 
 95 | Create a comprehensive text cleaning module `text_cleaner.py`:
 96 | 
 97 | **Cleaning functions:**
 98 | 1. Remove extra whitespace
 99 | 2. Normalize line breaks
100 | 3. Remove special characters (configurable)
101 | 4. Remove headers/footers (detect common patterns)
102 | 5. Fix encoding issues
103 | 6. Remove URLs/email addresses (optional)
104 | 7. Normalize quotes and dashes
105 | 
106 | **Requirements:**
107 | - Make each cleaning step optional/configurable
108 | - Create a `clean_text()` function that applies all steps
109 | - Test each step individually
110 | - Show before/after examples
111 | 
112 | **Deliverable:** `task4_text_cleaner.py`
113 | 
114 | ---
115 | 
116 | ### Task 5: Chunk Metadata System
117 | 
118 | Build a chunking system that stores rich metadata:
119 | 
120 | **Metadata to include:**
121 | - `chunk_id`: Unique identifier
122 | - `source`: Source file/URL
123 | - `page_number`: Page number (for PDFs)
124 | - `chunk_index`: Position in document
125 | - `start_char`: Starting character position
126 | - `end_char`: Ending character position
127 | - `word_count`: Number of words
128 | - `char_count`: Number of characters
129 | - `timestamp`: When chunk was created
130 | - `preview`: First 50 characters (for quick preview)
131 | 
132 | **Requirements:**
133 | - Create a `Chunk` class to store this data
134 | - Implement methods to:
135 |   - Export chunks to JSON
136 |   - Filter chunks by metadata
137 |   - Get chunk statistics
138 | 
139 | **Deliverable:** `task5_chunk_metadata.py`
140 | 
141 | ---
142 | 
143 | ## One Mini Project
144 | 
145 | ### 📘 Build a PDF-to-Text Extractor and Chunker
146 | 
147 | Create a complete application `document_processor.py` that processes documents and prepares them for RAG.
148 | 
149 | **Features:**
150 | 
151 | 1. **Multi-Format Support:**
152 |    - PDF files
153 |    - Text files (.txt, .md)
154 |    - Web URLs
155 |    - (Optional) Word documents (.docx)
156 | 
157 | 2. **Processing Pipeline:**
158 |    ```
159 |    Input → Extract → Clean → Chunk → Store → Report
160 |    ```
161 | 
162 | 3. **Chunking Options:**
163 |    - Chunk size (characters/words)
164 |    - Overlap percentage
165 |    - Strategy (fixed, sentence-aware, paragraph-aware)
166 |    - Minimum chunk size
167 | 
168 | 4. **Output Formats:**
169 |    - JSON (structured chunks with metadata)
170 |    - Text file (one chunk per line)
171 |    - CSV (chunks with metadata columns)
172 |    - Console display (formatted)
173 | 
174 | 5. **Batch Processing:**
175 |    - Process multiple files
176 |    - Process entire directories
177 |    - Progress tracking
178 |    - Error reporting
179 | 
180 | 6. **Statistics and Reporting:**
181 |    - Total chunks created
182 |    - Average chunk size
183 |    - Size distribution
184 |    - Processing time
185 |    - Source information
186 | 
187 | 7. **Interactive CLI:**
188 |    ```
189 |    === Document Processor ===
190 |    1. Process single file
191 |    2. Process directory
192 |    3. Process web URL
193 |    4. Configure chunking settings
194 |    5. View statistics
195 |    6. Export results
196 |    7. Exit
197 |    ```
198 | 
199 | **Requirements:**
200 | - Use classes for organization
201 | - Implement proper error handling
202 | - Add progress bars for long operations
203 | - Support command-line arguments
204 | - Create a configuration system (JSON/YAML)
205 | - Generate detailed reports
206 | - Store results in organized folders
207 | 
208 | **Example Usage:**
209 | ```bash
210 | # Command line
211 | python document_processor.py input.pdf --chunk-size 500 --overlap 50 --output json
212 | 
213 | # Interactive mode
214 | python document_processor.py
215 | ```
216 | 
217 | **Example Output:**
218 | ```
219 | Processing: document.pdf
220 | ✓ Extracted 15 pages
221 | ✓ Cleaned text (removed 234 extra spaces)
222 | ✓ Created 42 chunks
223 | ✓ Average chunk size: 487 words
224 | ✓ Processing time: 2.3 seconds
225 | 
226 | Chunks saved to: output/document_chunks.json
227 | Statistics saved to: output/document_stats.txt
228 | ```
229 | 
230 | **Advanced Features (Bonus):**
231 | - OCR support for scanned PDFs
232 | - Table extraction
233 | - Image extraction
234 | - Language detection
235 | - Duplicate detection
236 | - Chunk quality scoring
237 | 
238 | **Deliverables:**
239 | - `document_processor.py` - Main application
240 | - `config.json` - Configuration file template
241 | - `requirements.txt` - Dependencies
242 | - `README_processor.md` - Usage documentation
243 | - Sample output files demonstrating functionality
244 | 
245 | ---
246 | 
247 | ## Expected Output Section
248 | 
249 | ### Task 1 Expected Output:
250 | ```python
251 | result = extract_pdf("document.pdf")
252 | # Output:
253 | {
254 |     "full_text": "Complete text...",
255 |     "pages": [
256 |         "Page 1 text...",
257 |         "Page 2 text...",
258 |         ...
259 |     ],
260 |     "total_pages": 15,
261 |     "metadata": {
262 |         "title": "Sample Document",
263 |         "author": "John Doe"
264 |     }
265 | }
266 | ```
267 | 
268 | ### Task 2 Expected Output:
269 | ```python
270 | content = scrape_web("https://example.com/article")
271 | # Output:
272 | {
273 |     "title": "Article Title",
274 |     "content": "Clean article text...",
275 |     "author": "Author Name",
276 |     "date": "2024-01-15",
277 |     "word_count": 1234
278 | }
279 | ```
280 | 
281 | ### Task 3 Expected Output:
282 | ```
283 | === Chunking Strategy Comparison ===
284 | 
285 | Text: 2500 words
286 | 
287 | Fixed-Size Chunking:
288 | - Chunks: 5
289 | - Avg size: 500 words
290 | - Size variance: 0 words
291 | - Sample: "This is the first chunk of text..."
292 | 
293 | Sentence-Aware Chunking:
294 | - Chunks: 6
295 | - Avg size: 417 words
296 | - Size variance: 45 words
297 | - Sample: "This is the first chunk. It respects..."
298 | 
299 | Paragraph-Aware Chunking:
300 | - Chunks: 4
301 | - Avg size: 625 words
302 | - Size variance: 120 words
303 | - Sample: "This is a complete paragraph. It contains..."
304 | ```
305 | 
306 | ### Task 5 Expected Output:
307 | ```python
308 | chunks = chunk_with_metadata(text, source="doc.pdf")
309 | # Output: List of Chunk objects
310 | [
311 |     Chunk(
312 |         chunk_id=1,
313 |         source="doc.pdf",
314 |         page_number=1,
315 |         start_char=0,
316 |         end_char=500,
317 |         word_count=75,
318 |         preview="This is the beginning of the chunk..."
319 |     ),
320 |     ...
321 | ]
322 | ```
323 | 
324 | ### Mini Project Expected Output:
325 | 
326 | The document processor should provide:
327 | - Clear progress indicators
328 | - Detailed statistics
329 | - Multiple output formats
330 | - Error handling and reporting
331 | - Professional CLI interface
332 | 
333 | **Example session:**
334 | ```
335 | === Document Processor ===
336 | Choose option: 1
337 | 
338 | Enter file path: document.pdf
339 | Chunk size [500]: 400
340 | Overlap [50]: 40
341 | Strategy [fixed/sentence/paragraph]: sentence
342 | 
343 | [Processing...]
344 | ✓ Extracted 15 pages
345 | ✓ Created 38 chunks
346 | ✓ Saved to output/document_chunks.json
347 | 
348 | Statistics:
349 | - Total words: 15,234
350 | - Chunks: 38
351 | - Avg chunk size: 401 words
352 | - Processing time: 2.1s
353 | 
354 | [1] View chunks
355 | [2] Export to CSV
356 | [3] Process another file
357 | [4] Main menu
358 | ```
359 | 
360 | ---
361 | 
362 | ## Submission Checklist
363 | 
364 | - [ ] Task 1: PDF extractor working
365 | - [ ] Task 2: Web scraper functional
366 | - [ ] Task 3: Chunking comparison complete
367 | - [ ] Task 4: Text cleaning pipeline implemented
368 | - [ ] Task 5: Metadata system working
369 | - [ ] Mini project: Complete document processor
370 | - [ ] All code handles errors gracefully
371 | - [ ] Code is well-documented
372 | - [ ] Tested with real files/URLs
373 | 
374 | **Remember:** Good data extraction and chunking are crucial for RAG quality!
375 | 
376 | **Good luck!** 🚀
377 | 
378 | 


--------------------------------------------------------------------------------
/Day-03: Prompt Engineering Essentials/assignment.md:
--------------------------------------------------------------------------------
  1 | # Day 3 — Assignment
  2 | 
  3 | ## Instructions
  4 | 
  5 | Complete these tasks to master prompt engineering. You'll create various prompt templates and test them with the OpenAI API. Focus on:
  6 | - Clarity and specificity
  7 | - Proper structure
  8 | - Effective use of examples
  9 | - Context handling
 10 | 
 11 | **Important:**
 12 | - Test all prompts with actual API calls
 13 | - Compare different prompt variations
 14 | - Document what works and what doesn't
 15 | - Save your best prompts as reusable templates
 16 | 
 17 | ---
 18 | 
 19 | ## Tasks
 20 | 
 21 | ### Task 1: Prompt Improvement Challenge
 22 | 
 23 | Take these 5 vague prompts and rewrite them to be specific, clear, and effective:
 24 | 
 25 | 1. "Tell me about machine learning"
 26 | 2. "Fix this code"
 27 | 3. "Summarize this"
 28 | 4. "What's the best way?"
 29 | 5. "Explain this document"
 30 | 
 31 | For each:
 32 | - Write an improved version
 33 | - Explain why your version is better
 34 | - Test both versions with the API
 35 | - Compare the results
 36 | 
 37 | **Deliverable:** `task1_prompt_improvements.py` with both old and new prompts, plus comparison results
 38 | 
 39 | ---
 40 | 
 41 | ### Task 2: Role-Based Prompt System
 42 | 
 43 | Create a system that generates prompts based on different AI roles:
 44 | 
 45 | **Roles to implement:**
 46 | - `coding_tutor`: Explains programming concepts to beginners
 47 | - `business_analyst`: Analyzes business problems
 48 | - `creative_writer`: Helps with creative writing
 49 | - `data_scientist`: Explains data science concepts
 50 | - `technical_writer`: Creates technical documentation
 51 | 
 52 | **Requirements:**
 53 | - Create a function `generate_role_prompt(role, user_input)`
 54 | - Each role should have a distinct personality and style
 55 | - Test each role with the same input to see how responses differ
 56 | 
 57 | **Deliverable:** `task2_role_prompts.py`
 58 | 
 59 | ---
 60 | 
 61 | ### Task 3: Few-Shot Learning Templates
 62 | 
 63 | Create few-shot prompt templates for these tasks:
 64 | 
 65 | 1. **Text Classification**: Classify customer reviews as positive, negative, or neutral
 66 | 2. **Format Conversion**: Convert informal text to formal business language
 67 | 3. **Information Extraction**: Extract names, dates, and locations from text
 68 | 4. **Code Translation**: Convert Python code to pseudocode
 69 | 
 70 | **Requirements:**
 71 | - Each template should have 3-5 examples
 72 | - Create reusable functions
 73 | - Test with new inputs
 74 | 
 75 | **Deliverable:** `task3_fewshot_templates.py`
 76 | 
 77 | ---
 78 | 
 79 | ### Task 4: Chain-of-Thought Problem Solver
 80 | 
 81 | Build a problem-solving system using chain-of-thought prompting:
 82 | 
 83 | **Problem types to handle:**
 84 | - Math word problems
 85 | - Logic puzzles
 86 | - Code debugging scenarios
 87 | - Decision-making problems
 88 | 
 89 | **Requirements:**
 90 | - Create a function that formats problems with CoT instructions
 91 | - The prompt should encourage step-by-step reasoning
 92 | - Extract and display the reasoning steps from the response
 93 | 
 94 | **Example:**
 95 | ```python
 96 | problem = "If 3 apples cost $2, how much do 9 apples cost?"
 97 | solution = solve_with_cot(problem)
 98 | # Should show: Step 1, Step 2, Step 3, Final Answer
 99 | ```
100 | 
101 | **Deliverable:** `task4_cot_solver.py`
102 | 
103 | ---
104 | 
105 | ### Task 5: RAG Prompt Template Builder
106 | 
107 | Create a comprehensive RAG prompt template system:
108 | 
109 | **Features:**
110 | 1. **Context Injection**: Add retrieved documents to prompt
111 | 2. **Citation Support**: Include instructions for citing sources
112 | 3. **Answer Formatting**: Specify output format (paragraph, bullet points, JSON)
113 | 4. **Fallback Handling**: Instructions for when context is insufficient
114 | 5. **Multi-document Synthesis**: Handle multiple relevant documents
115 | 
116 | **Requirements:**
117 | - Create a class `RAGPromptBuilder`
118 | - Methods:
119 |   - `add_context(documents)` - Add retrieved documents
120 |   - `set_question(question)` - Set the question
121 |   - `set_format(format_type)` - Set output format
122 |   - `enable_citations(enable)` - Toggle citation requirements
123 |   - `build()` - Generate final prompt
124 | 
125 | **Example usage:**
126 | ```python
127 | builder = RAGPromptBuilder()
128 | builder.add_context(["Doc 1 content", "Doc 2 content"])
129 | builder.set_question("What is RAG?")
130 | builder.set_format("bullet_points")
131 | builder.enable_citations(True)
132 | prompt = builder.build()
133 | ```
134 | 
135 | **Deliverable:** `task5_rag_prompt_builder.py`
136 | 
137 | ---
138 | 
139 | ## One Mini Project
140 | 
141 | ### 🎯 Build a Prompt Engineering Playground
142 | 
143 | Create an interactive application `prompt_playground.py` that allows users to experiment with different prompt engineering techniques.
144 | 
145 | **Features:**
146 | 
147 | 1. **Main Menu:**
148 |    ```
149 |    === Prompt Engineering Playground ===
150 |    1. Basic Prompt Tester
151 |    2. Role-Based Prompts
152 |    3. Few-Shot Learning
153 |    4. Chain-of-Thought
154 |    5. RAG Prompt Builder
155 |    6. Prompt Comparison Tool
156 |    7. Save/Load Prompts
157 |    8. Exit
158 |    ```
159 | 
160 | 2. **Basic Prompt Tester:**
161 |    - Enter a prompt
162 |    - Adjust parameters (temperature, max_tokens)
163 |    - View response
164 |    - Rate the response quality (1-5)
165 |    - Save prompts and ratings
166 | 
167 | 3. **Role-Based Prompts:**
168 |    - Select from predefined roles
169 |    - Enter your input
170 |    - See how different roles respond
171 |    - Create custom roles
172 | 
173 | 4. **Few-Shot Learning:**
174 |    - Add examples interactively
175 |    - Test with new inputs
176 |    - Compare results with/without examples
177 |    - Save example sets
178 | 
179 | 5. **Chain-of-Thought:**
180 |    - Enter a problem
181 |    - View step-by-step reasoning
182 |    - Extract and highlight reasoning steps
183 |    - Compare with direct answers
184 | 
185 | 6. **RAG Prompt Builder:**
186 |    - Add context documents
187 |    - Set question
188 |    - Configure options (citations, format, etc.)
189 |    - Generate and test prompt
190 |    - Save templates
191 | 
192 | 7. **Prompt Comparison Tool:**
193 |    - Enter multiple prompt variations
194 |    - Test all with the same input
195 |    - Side-by-side comparison
196 |    - Quality scoring
197 |    - Export comparison results
198 | 
199 | 8. **Save/Load Prompts:**
200 |    - Save successful prompts to JSON
201 |    - Load saved prompts
202 |    - Organize by category
203 |    - Search prompts
204 | 
205 | **Advanced Features:**
206 | - **A/B Testing**: Compare two prompts statistically
207 | - **Prompt Library**: Pre-built prompts for common tasks
208 | - **Response Analyzer**: Analyze response quality (length, structure, etc.)
209 | - **Token Optimizer**: Suggest ways to reduce token usage
210 | - **Export Options**: Export prompts and results to various formats
211 | 
212 | **Requirements:**
213 | - Use classes for organization
214 | - Store prompts and results in JSON files
215 | - Implement a clean CLI interface
216 | - Add color coding for better UX (optional)
217 | - Include help/documentation
218 | - Handle errors gracefully
219 | 
220 | **Example Interaction:**
221 | ```
222 | === Prompt Engineering Playground ===
223 | Choose option: 1
224 | 
225 | Enter your prompt: Explain quantum computing
226 | Temperature [0.7]: 0.5
227 | Max tokens [200]: 150
228 | 
229 | [Processing...]
230 | 
231 | Response:
232 | Quantum computing uses quantum mechanical phenomena...
233 | 
234 | Tokens: 87
235 | Rate this response (1-5): 4
236 | 
237 | [1] Try different parameters
238 | [2] Save this prompt
239 | [3] Compare with another prompt
240 | [4] Main menu
241 | ```
242 | 
243 | **Deliverables:**
244 | - `prompt_playground.py` - Main application
245 | - `prompt_templates.json` - Saved prompt templates
246 | - `requirements.txt` - Dependencies
247 | - `README_playground.md` - Usage guide
248 | - Sample output demonstrating all features
249 | 
250 | ---
251 | 
252 | ## Expected Output Section
253 | 
254 | ### Task 1 Expected Output:
255 | ```
256 | === Prompt Comparison ===
257 | 
258 | Original: "Tell me about machine learning"
259 | Improved: "Explain machine learning in 3 paragraphs, covering:
260 | 1. What it is
261 | 2. Common applications
262 | 3. Key algorithms"
263 | 
264 | Results:
265 | - Original: Generic, unfocused response (156 tokens)
266 | - Improved: Structured, comprehensive response (203 tokens)
267 | - Quality: Improved version is 40% more informative
268 | ```
269 | 
270 | ### Task 2 Expected Output:
271 | ```
272 | === Role-Based Prompts ===
273 | Input: "How do I learn Python?"
274 | 
275 | Coding Tutor:
276 | "Start with basics: variables, data types, and functions. 
277 | Practice daily with small projects..."
278 | 
279 | Business Analyst:
280 | "Python is valuable for data analysis. Focus on pandas 
281 | and data visualization libraries..."
282 | 
283 | [Different perspectives based on role]
284 | ```
285 | 
286 | ### Task 4 Expected Output:
287 | ```
288 | === Chain-of-Thought Solver ===
289 | Problem: "If 3 apples cost $2, how much do 9 apples cost?"
290 | 
291 | Step 1: Identify what we need to find
292 | → Cost of 9 apples
293 | 
294 | Step 2: Find cost per apple
295 | → $2 ÷ 3 = $0.67 per apple
296 | 
297 | Step 3: Calculate cost of 9 apples
298 | → $0.67 × 9 = $6
299 | 
300 | Final Answer: 9 apples cost $6
301 | ```
302 | 
303 | ### Task 5 Expected Output:
304 | ```python
305 | builder = RAGPromptBuilder()
306 | builder.add_context([
307 |     "RAG combines retrieval and generation...",
308 |     "Vector databases store embeddings..."
309 | ])
310 | builder.set_question("How does RAG work?")
311 | builder.set_format("bullet_points")
312 | builder.enable_citations(True)
313 | 
314 | Generated Prompt:
315 | """
316 | You are a helpful assistant...
317 | 
318 | Documents:
319 | [Document 1]
320 | RAG combines retrieval and generation...
321 | 
322 | [Document 2]
323 | Vector databases store embeddings...
324 | 
325 | Question: How does RAG work?
326 | 
327 | Instructions:
328 | - Answer using the documents above
329 | - Cite your sources
330 | - Format as bullet points
331 | ...
332 | """
333 | ```
334 | 
335 | ### Mini Project Expected Output:
336 | 
337 | The playground should provide a comprehensive, user-friendly interface for experimenting with prompts:
338 | 
339 | - Intuitive menu navigation
340 | - Real-time prompt testing
341 | - Side-by-side comparisons
342 | - Quality metrics and analysis
343 | - Export capabilities
344 | - Professional presentation
345 | 
346 | **Example session:**
347 | ```
348 | === Prompt Engineering Playground ===
349 | 1. Basic Prompt Tester
350 | ...
351 | Choose: 6
352 | 
353 | === Prompt Comparison Tool ===
354 | Enter prompt 1: Explain AI simply
355 | Enter prompt 2: You are a teacher. Explain AI to a 10-year-old.
356 | 
357 | [Testing both prompts...]
358 | 
359 | Results:
360 | Prompt 1: 145 tokens, Generic explanation
361 | Prompt 2: 167 tokens, Age-appropriate, engaging explanation
362 | 
363 | Winner: Prompt 2 (Better engagement, clearer structure)
364 | ```
365 | 
366 | ---
367 | 
368 | ## Submission Checklist
369 | 
370 | - [ ] Task 1: Prompt improvements completed and tested
371 | - [ ] Task 2: Role-based system functional
372 | - [ ] Task 3: Few-shot templates created
373 | - [ ] Task 4: CoT solver working
374 | - [ ] Task 5: RAG prompt builder implemented
375 | - [ ] Mini project: Full playground application
376 | - [ ] All prompts tested with API
377 | - [ ] Results documented and compared
378 | - [ ] Code is well-organized and commented
379 | 
380 | **Remember:** Good prompts are the foundation of great RAG systems!
381 | 
382 | **Good luck!** 🚀
383 | 
384 | 


--------------------------------------------------------------------------------
/Day-03: Prompt Engineering Essentials/README.md:
--------------------------------------------------------------------------------
  1 | # Day 3 — Prompt Engineering Essentials
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | **Prompt Engineering** is the art and science of crafting instructions that get the best results from LLMs. Think of it as learning to communicate effectively with AI—the better your prompts, the better the AI's responses.
  6 | 
  7 | **Why this matters for RAG:**
  8 | - RAG systems rely heavily on well-crafted prompts
  9 | - You'll need to prompt LLMs to answer questions based on retrieved context
 10 | - Effective prompts improve answer quality and reduce hallucinations
 11 | - Prompt engineering is a core skill for building production RAG applications
 12 | 
 13 | **Real-world context:**
 14 | Imagine asking a librarian a vague question vs. a specific one:
 15 | - ❌ "Tell me about space" (too vague)
 16 | - ✅ "What are the key differences between stars and planets? Explain in 3 bullet points." (specific and clear)
 17 | 
 18 | Prompt engineering is about being that specific librarian question—clear, structured, and goal-oriented.
 19 | 
 20 | ---
 21 | 
 22 | ## 2. Deep-Dive Explanation
 23 | 
 24 | ### 2.1 What is a Prompt?
 25 | 
 26 | A **prompt** is the input text you send to an LLM. It can be:
 27 | - A simple question
 28 | - Instructions with examples
 29 | - A conversation history
 30 | - Structured templates
 31 | 
 32 | **Prompt Structure:**
 33 | ```
 34 | [System Message] + [Context] + [User Question] + [Format Instructions]
 35 | ```
 36 | 
 37 | ### 2.2 Core Prompt Engineering Techniques
 38 | 
 39 | #### 2.2.1 Be Specific and Clear
 40 | 
 41 | **Bad:**
 42 | ```
 43 | Tell me about Python.
 44 | ```
 45 | 
 46 | **Good:**
 47 | ```
 48 | Explain Python programming language in 3 sentences, focusing on:
 49 | 1. What it's used for
 50 | 2. Key features
 51 | 3. Why it's popular for AI
 52 | ```
 53 | 
 54 | #### 2.2.2 Use Role-Playing
 55 | 
 56 | Assign a role to the AI:
 57 | ```
 58 | You are an expert Python tutor. Explain variables to a beginner programmer.
 59 | ```
 60 | 
 61 | #### 2.2.3 Provide Examples (Few-Shot Learning)
 62 | 
 63 | Show the AI what you want:
 64 | ```
 65 | Example 1:
 66 | Input: "Python is easy"
 67 | Output: "Python is beginner-friendly"
 68 | 
 69 | Example 2:
 70 | Input: "AI is powerful"
 71 | Output: "AI has transformative capabilities"
 72 | 
 73 | Now convert: "RAG is useful"
 74 | ```
 75 | 
 76 | #### 2.2.4 Chain-of-Thought (CoT)
 77 | 
 78 | Encourage step-by-step reasoning:
 79 | ```
 80 | Solve: 15 * 23
 81 | 
 82 | Let's think step by step:
 83 | 1. First, multiply 15 by 20 = 300
 84 | 2. Then, multiply 15 by 3 = 45
 85 | 3. Finally, add 300 + 45 = 345
 86 | ```
 87 | 
 88 | #### 2.2.5 Output Formatting
 89 | 
 90 | Specify the format you want:
 91 | ```
 92 | List 5 programming languages. Format as JSON:
 93 | {
 94 |   "languages": [
 95 |     {"name": "...", "year": "..."}
 96 |   ]
 97 | }
 98 | ```
 99 | 
100 | ### 2.3 Prompt Patterns for RAG
101 | 
102 | #### 2.3.1 Context Injection Pattern
103 | 
104 | ```
105 | Use the following context to answer the question:
106 | 
107 | Context:
108 | {retrieved_documents}
109 | 
110 | Question: {user_question}
111 | 
112 | Answer based only on the provided context. If the context doesn't contain enough information, say "I don't have enough information."
113 | ```
114 | 
115 | #### 2.3.2 Answer with Citations
116 | 
117 | ```
118 | Based on the following documents, answer the question and cite your sources:
119 | 
120 | Documents:
121 | {document_1}
122 | {document_2}
123 | 
124 | Question: {question}
125 | 
126 | Format your answer as:
127 | Answer: [your answer]
128 | Sources: [document numbers]
129 | ```
130 | 
131 | #### 2.3.3 Multi-Step Reasoning
132 | 
133 | ```
134 | Given the context below, follow these steps:
135 | 1. Identify key information
136 | 2. Analyze the relationships
137 | 3. Synthesize an answer
138 | 
139 | Context: {context}
140 | Question: {question}
141 | ```
142 | 
143 | ### 2.4 Common Prompt Mistakes
144 | 
145 | **Mistake 1: Being Too Vague**
146 | - ❌ "Explain this"
147 | - ✅ "Summarize the main points in 3 bullet points"
148 | 
149 | **Mistake 2: Not Providing Context**
150 | - ❌ Asking about specific documents without including them
151 | - ✅ Including relevant context in the prompt
152 | 
153 | **Mistake 3: Ambiguous Instructions**
154 | - ❌ "Make it better"
155 | - ✅ "Rewrite this sentence to be more concise and professional"
156 | 
157 | **Mistake 4: Ignoring Token Limits**
158 | - ❌ Including too much context
159 | - ✅ Being selective about what context to include
160 | 
161 | ### 2.5 Prompt Templates
162 | 
163 | **Template 1: Question Answering**
164 | ```
165 | Context: {context}
166 | 
167 | Question: {question}
168 | 
169 | Instructions:
170 | - Answer based only on the provided context
171 | - If the answer isn't in the context, say so
172 | - Be concise but complete
173 | ```
174 | 
175 | **Template 2: Summarization**
176 | ```
177 | Summarize the following text in {number} sentences:
178 | 
179 | {text}
180 | 
181 | Focus on: {key_points}
182 | ```
183 | 
184 | **Template 3: Extraction**
185 | ```
186 | Extract the following information from the text:
187 | - Names
188 | - Dates
189 | - Key facts
190 | 
191 | Text: {text}
192 | 
193 | Format as JSON.
194 | ```
195 | 
196 | ---
197 | 
198 | ## 3. Instructor Examples
199 | 
200 | ### Example 1: Basic Prompt with Context
201 | 
202 | ```python
203 | def answer_with_context(context, question):
204 |     """Answer a question using provided context"""
205 |     prompt = f"""Use the following information to answer the question.
206 | 
207 | Information:
208 | {context}
209 | 
210 | Question: {question}
211 | 
212 | Answer the question based only on the information provided above. 
213 | If the information doesn't contain the answer, say "I don't have enough information."
214 | """
215 |     
216 |     response = openai.ChatCompletion.create(
217 |         model="gpt-3.5-turbo",
218 |         messages=[{"role": "user", "content": prompt}],
219 |         temperature=0.3  # Lower temperature for factual answers
220 |     )
221 |     
222 |     return response.choices[0].message.content
223 | 
224 | # Usage
225 | context = "Python was created by Guido van Rossum in 1991."
226 | question = "Who created Python?"
227 | answer = answer_with_context(context, question)
228 | print(answer)
229 | ```
230 | 
231 | ### Example 2: Few-Shot Learning
232 | 
233 | ```python
234 | def classify_sentiment_fewshot(text):
235 |     """Classify sentiment using few-shot examples"""
236 |     prompt = f"""Classify the sentiment of the following text as positive, negative, or neutral.
237 | 
238 | Examples:
239 | Text: "I love this product!"
240 | Sentiment: positive
241 | 
242 | Text: "This is terrible."
243 | Sentiment: negative
244 | 
245 | Text: "The weather is okay."
246 | Sentiment: neutral
247 | 
248 | Now classify:
249 | Text: "{text}"
250 | Sentiment:"""
251 |     
252 |     response = openai.ChatCompletion.create(
253 |         model="gpt-3.5-turbo",
254 |         messages=[{"role": "user", "content": prompt}],
255 |         temperature=0.1  # Very low for classification
256 |     )
257 |     
258 |     return response.choices[0].message.content.strip()
259 | 
260 | # Usage
261 | result = classify_sentiment_fewshot("This movie was amazing!")
262 | print(result)  # positive
263 | ```
264 | 
265 | ### Example 3: Chain-of-Thought Prompting
266 | 
267 | ```python
268 | def solve_problem_cot(problem):
269 |     """Solve a problem using chain-of-thought reasoning"""
270 |     prompt = f"""Solve the following problem step by step.
271 | 
272 | Problem: {problem}
273 | 
274 | Let's think through this step by step:
275 | 1. First, identify what we need to find
276 | 2. List the information we have
277 | 3. Determine the approach
278 | 4. Solve step by step
279 | 5. Verify the answer
280 | 
281 | Solution:"""
282 |     
283 |     response = openai.ChatCompletion.create(
284 |         model="gpt-3.5-turbo",
285 |         messages=[{"role": "user", "content": prompt}],
286 |         temperature=0.3,
287 |         max_tokens=300
288 |     )
289 |     
290 |     return response.choices[0].message.content
291 | 
292 | # Usage
293 | problem = "If a train travels 120 km in 2 hours, what's its average speed?"
294 | solution = solve_problem_cot(problem)
295 | print(solution)
296 | ```
297 | 
298 | ### Example 4: RAG-Style Prompt Template
299 | 
300 | ```python
301 | def rag_prompt_template(context_chunks, question):
302 |     """Create a RAG-style prompt with multiple context chunks"""
303 |     context_text = "\n\n".join([
304 |         f"[Document {i+1}]\n{chunk}"
305 |         for i, chunk in enumerate(context_chunks)
306 |     ])
307 |     
308 |     prompt = f"""You are a helpful assistant that answers questions based on provided documents.
309 | 
310 | Documents:
311 | {context_text}
312 | 
313 | Question: {question}
314 | 
315 | Instructions:
316 | 1. Answer the question using information from the documents above
317 | 2. If multiple documents are relevant, synthesize information from all of them
318 | 3. Cite which document(s) you used (e.g., "According to Document 1...")
319 | 4. If the documents don't contain enough information, say so clearly
320 | 5. Be specific and accurate
321 | 
322 | Answer:"""
323 |     
324 |     return prompt
325 | 
326 | # Usage
327 | chunks = [
328 |     "Python is a programming language created in 1991.",
329 |     "RAG stands for Retrieval-Augmented Generation."
330 | ]
331 | question = "What is Python?"
332 | prompt = rag_prompt_template(chunks, question)
333 | # Use this prompt with OpenAI API
334 | ```
335 | 
336 | ---
337 | 
338 | ## 4. Student Practice Tasks
339 | 
340 | ### Task 1: Basic Prompt Improvement
341 | Take these vague prompts and rewrite them to be specific and clear:
342 | - "Tell me about AI"
343 | - "Explain this code"
344 | - "What should I do?"
345 | 
346 | ### Task 2: Role-Playing Prompts
347 | Create prompts that assign different roles to the AI:
348 | - A coding tutor
349 | - A business consultant
350 | - A creative writer
351 | - A data analyst
352 | 
353 | ### Task 3: Few-Shot Examples
354 | Create a few-shot prompt for:
355 | - Classifying emails as spam/not spam
356 | - Converting text to a specific format
357 | - Extracting key information
358 | 
359 | ### Task 4: Chain-of-Thought
360 | Write a CoT prompt for:
361 | - Solving a math word problem
362 | - Debugging code
363 | - Making a decision
364 | 
365 | ### Task 5: RAG Prompt Template
366 | Create a reusable RAG prompt template function that:
367 | - Takes context and question
368 | - Includes instructions
369 | - Specifies output format
370 | - Handles cases where context is insufficient
371 | 
372 | ### Task 6: Prompt Comparison
373 | Test the same question with:
374 | - A basic prompt
375 | - An improved prompt with context
376 | - A prompt with examples
377 | Compare the quality of responses.
378 | 
379 | ---
380 | 
381 | ## 5. Summary / Key Takeaways
382 | 
383 | - **Be specific**: Clear, detailed prompts get better results
384 | - **Use roles**: Assigning roles helps guide AI behavior
385 | - **Few-shot learning**: Examples teach the AI what you want
386 | - **Chain-of-thought**: Encourages step-by-step reasoning
387 | - **Format instructions**: Specify the output format you need
388 | - **Context matters**: In RAG, always include relevant context
389 | - **Temperature settings**: Lower for factual, higher for creative
390 | - **Iterate**: Prompt engineering is iterative—refine based on results
391 | - **Test variations**: Try different phrasings to find what works best
392 | 
393 | ---
394 | 
395 | ## 6. Further Reading (Optional)
396 | 
397 | - OpenAI Prompt Engineering Guide
398 | - "Prompt Engineering for LLMs" by Lilian Weng
399 | - LangChain Prompt Templates documentation
400 | - Anthropic's Prompt Engineering resources
401 | 
402 | ---
403 | 
404 | **Next up:** Day 4 will teach you how to extract and chunk data from various sources!
405 | 
406 | 


--------------------------------------------------------------------------------
/Day-10: Build & Deploy a RAG Application (FastAPI-Streamlit)/README.md:
--------------------------------------------------------------------------------
  1 | # Day 10 — Build & Deploy a RAG Application (FastAPI/Streamlit)
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Congratulations! You've reached the final day. Today, you'll build and deploy a complete RAG application that others can use. You'll create a web interface and API so your RAG system is accessible and production-ready.
  6 | 
  7 | **What you'll build:**
  8 | 
  9 | - **FastAPI Backend**: REST API for your RAG system
 10 | - **Streamlit Frontend**: User-friendly web interface
 11 | - **Deployment**: Make it accessible to others
 12 | 
 13 | **Why this matters:**
 14 | 
 15 | - Real applications need interfaces
 16 | - APIs allow integration with other systems
 17 | - Web interfaces make systems accessible
 18 | - Deployment makes your work usable
 19 | 
 20 | **Real-world context:**
 21 | Your RAG system is powerful, but it's just code. To make it useful, you need:
 22 | 
 23 | - A way for users to interact (web UI)
 24 | - A way for other systems to use it (API)
 25 | - A way to access it from anywhere (deployment)
 26 | 
 27 | ---
 28 | 
 29 | ## 2. Deep-Dive Explanation
 30 | 
 31 | ### 2.1 FastAPI for RAG Backend
 32 | 
 33 | **What is FastAPI?**
 34 | A modern Python web framework for building APIs:
 35 | 
 36 | - Fast and performant
 37 | - Automatic API documentation
 38 | - Type hints support
 39 | - Easy to use
 40 | 
 41 | **Why FastAPI for RAG?**
 42 | 
 43 | - Handles async operations well
 44 | - Great for ML/AI applications
 45 | - Automatic validation
 46 | - Easy to deploy
 47 | 
 48 | **Key Components:**
 49 | 
 50 | - **Routes**: API endpoints
 51 | - **Models**: Request/response schemas
 52 | - **Dependencies**: Reusable components
 53 | - **Middleware**: Cross-cutting concerns
 54 | 
 55 | ### 2.2 Streamlit for RAG Frontend
 56 | 
 57 | **What is Streamlit?**
 58 | A Python framework for building web apps:
 59 | 
 60 | - Simple and intuitive
 61 | - Great for data/ML apps
 62 | - No frontend knowledge needed
 63 | - Fast development
 64 | 
 65 | **Why Streamlit for RAG?**
 66 | 
 67 | - Perfect for interactive AI apps
 68 | - Easy to add file uploads
 69 | - Simple chat interfaces
 70 | - Quick prototyping
 71 | 
 72 | **Key Components:**
 73 | 
 74 | - **Widgets**: Inputs, buttons, displays
 75 | - **Layout**: Organize your UI
 76 | - **State**: Manage app state
 77 | - **Session**: User sessions
 78 | 
 79 | ### 2.3 Application Architecture
 80 | 
 81 | **Complete System:**
 82 | 
 83 | ```
 84 | ┌─────────────┐
 85 | │  Streamlit  │  User Interface
 86 | │   Frontend  │
 87 | └──────┬──────┘
 88 |        │ HTTP
 89 |        ▼
 90 | ┌─────────────┐
 91 | │   FastAPI   │  REST API
 92 | │   Backend   │
 93 | └──────┬──────┘
 94 |        │
 95 |        ▼
 96 | ┌─────────────┐
 97 | │  RAG System │  Your RAG Pipeline
 98 | └─────────────┘
 99 | ```
100 | 
101 | ### 2.4 API Design
102 | 
103 | **Essential Endpoints:**
104 | 
105 | - `POST /index` - Index documents
106 | - `POST /query` - Query RAG system
107 | - `GET /health` - Health check
108 | - `GET /stats` - System statistics
109 | - `DELETE /documents/{id}` - Remove document
110 | 
111 | **Request/Response Models:**
112 | 
113 | - Structured data
114 | - Validation
115 | - Type safety
116 | - Documentation
117 | 
118 | ### 2.5 Deployment Options
119 | 
120 | **Local Deployment:**
121 | 
122 | - Run on your machine
123 | - Access via localhost
124 | - Good for testing
125 | 
126 | **Cloud Deployment:**
127 | 
128 | - **Heroku**: Easy, free tier
129 | - **Railway**: Simple deployment
130 | - **Render**: Free hosting
131 | - **AWS/GCP/Azure**: Production scale
132 | 
133 | **Containerization:**
134 | 
135 | - Docker for packaging
136 | - Easy deployment
137 | - Consistent environment
138 | 
139 | ---
140 | 
141 | ## 3. Instructor Examples
142 | 
143 | ### Example 1: FastAPI RAG Backend
144 | 
145 | ```python
146 | from fastapi import FastAPI, UploadFile, File, HTTPException
147 | from pydantic import BaseModel
148 | from typing import List, Optional
149 | import os
150 | 
151 | app = FastAPI(title="RAG API", version="1.0.0")
152 | 
153 | # Your RAG system (from previous days)
154 | from rag_system import RAGSystem
155 | 
156 | rag = RAGSystem()
157 | 
158 | # Request/Response Models
159 | class QueryRequest(BaseModel):
160 |     question: str
161 |     k: Optional[int] = 3
162 | 
163 | class QueryResponse(BaseModel):
164 |     answer: str
165 |     sources: List[dict]
166 |     processing_time: float
167 | 
168 | class IndexResponse(BaseModel):
169 |     message: str
170 |     chunks_indexed: int
171 | 
172 | # API Endpoints
173 | @app.get("/")
174 | async def root():
175 |     return {"message": "RAG API is running"}
176 | 
177 | @app.get("/health")
178 | async def health():
179 |     return {"status": "healthy"}
180 | 
181 | @app.post("/index", response_model=IndexResponse)
182 | async def index_document(file: UploadFile = File(...)):
183 |     """Index a document"""
184 |     try:
185 |         # Save uploaded file
186 |         file_path = f"temp/{file.filename}"
187 |         os.makedirs("temp", exist_ok=True)
188 | 
189 |         with open(file_path, "wb") as f:
190 |             content = await file.read()
191 |             f.write(content)
192 | 
193 |         # Index document
194 |         chunks = rag.index_document(file_path)
195 | 
196 |         # Clean up
197 |         os.remove(file_path)
198 | 
199 |         return IndexResponse(
200 |             message=f"Document indexed successfully",
201 |             chunks_indexed=chunks
202 |         )
203 |     except Exception as e:
204 |         raise HTTPException(status_code=500, detail=str(e))
205 | 
206 | @app.post("/query", response_model=QueryResponse)
207 | async def query_rag(request: QueryRequest):
208 |     """Query the RAG system"""
209 |     import time
210 |     start_time = time.time()
211 | 
212 |     try:
213 |         result = rag.query(request.question, k=request.k)
214 |         processing_time = time.time() - start_time
215 | 
216 |         return QueryResponse(
217 |             answer=result["answer"],
218 |             sources=result["sources"],
219 |             processing_time=processing_time
220 |         )
221 |     except Exception as e:
222 |         raise HTTPException(status_code=500, detail=str(e))
223 | 
224 | @app.get("/stats")
225 | async def get_stats():
226 |     """Get system statistics"""
227 |     stats = rag.get_statistics()
228 |     return stats
229 | 
230 | if __name__ == "__main__":
231 |     import uvicorn
232 |     uvicorn.run(app, host="0.0.0.0", port=8000)
233 | ```
234 | 
235 | ### Example 2: Streamlit RAG Frontend
236 | 
237 | ```python
238 | import streamlit as st
239 | import requests
240 | import time
241 | 
242 | # API URL
243 | API_URL = "http://localhost:8000"
244 | 
245 | # Page config
246 | st.set_page_config(
247 |     page_title="RAG Application",
248 |     page_icon="🤖",
249 |     layout="wide"
250 | )
251 | 
252 | # Title
253 | st.title("🤖 RAG Application")
254 | st.markdown("Ask questions about your documents!")
255 | 
256 | # Sidebar
257 | with st.sidebar:
258 |     st.header("📄 Document Management")
259 | 
260 |     # File upload
261 |     uploaded_file = st.file_uploader(
262 |         "Upload a document",
263 |         type=["pdf", "txt"],
264 |         help="Upload PDF or TXT files"
265 |     )
266 | 
267 |     if uploaded_file:
268 |         if st.button("Index Document"):
269 |             with st.spinner("Indexing document..."):
270 |                 files = {"file": uploaded_file.getvalue()}
271 |                 response = requests.post(
272 |                     f"{API_URL}/index",
273 |                     files=files
274 |                 )
275 | 
276 |                 if response.status_code == 200:
277 |                     result = response.json()
278 |                     st.success(f"✅ {result['message']}")
279 |                     st.info(f"Indexed {result['chunks_indexed']} chunks")
280 |                 else:
281 |                     st.error("Failed to index document")
282 | 
283 |     # Statistics
284 |     if st.button("View Statistics"):
285 |         response = requests.get(f"{API_URL}/stats")
286 |         if response.status_code == 200:
287 |             stats = response.json()
288 |             st.json(stats)
289 | 
290 | # Main area
291 | st.header("💬 Ask a Question")
292 | 
293 | # Query input
294 | question = st.text_input(
295 |     "Enter your question:",
296 |     placeholder="What is the main topic?",
297 |     key="question_input"
298 | )
299 | 
300 | # K value
301 | k = st.slider("Number of sources:", 1, 10, 3)
302 | 
303 | # Query button
304 | if st.button("Ask", type="primary") and question:
305 |     with st.spinner("Thinking..."):
306 |         # Make API request
307 |         response = requests.post(
308 |             f"{API_URL}/query",
309 |             json={"question": question, "k": k}
310 |         )
311 | 
312 |         if response.status_code == 200:
313 |             result = response.json()
314 | 
315 |             # Display answer
316 |             st.subheader("📝 Answer")
317 |             st.write(result["answer"])
318 | 
319 |             # Display sources
320 |             st.subheader("📚 Sources")
321 |             for i, source in enumerate(result["sources"], 1):
322 |                 with st.expander(f"Source {i} (Similarity: {source.get('similarity', 'N/A'):.2f})"):
323 |                     st.write(source.get("text", ""))
324 |                     if "metadata" in source:
325 |                         st.caption(f"Source: {source['metadata']}")
326 | 
327 |             # Processing time
328 |             st.caption(f"⏱️ Processed in {result['processing_time']:.2f} seconds")
329 |         else:
330 |             st.error("Failed to get answer")
331 | 
332 | # Chat history (optional)
333 | if "chat_history" not in st.session_state:
334 |     st.session_state.chat_history = []
335 | 
336 | # Display history
337 | if st.session_state.chat_history:
338 |     st.header("💭 Chat History")
339 |     for i, (q, a) in enumerate(reversed(st.session_state.chat_history[-5:]), 1):
340 |         with st.expander(f"Q{i}: {q}"):
341 |             st.write(a)
342 | ```
343 | 
344 | ### Example 3: Complete Deployment Setup
345 | 
346 | ```python
347 | # requirements.txt
348 | fastapi==0.104.1
349 | uvicorn==0.24.0
350 | streamlit==1.28.0
351 | openai==1.3.0
352 | chromadb==0.4.15
353 | pypdf==3.17.0
354 | requests==2.31.0
355 | python-multipart==0.0.6
356 | 
357 | # Dockerfile
358 | FROM python:3.10-slim
359 | 
360 | WORKDIR /app
361 | 
362 | COPY requirements.txt .
363 | RUN pip install --no-cache-dir -r requirements.txt
364 | 
365 | COPY . .
366 | 
367 | EXPOSE 8000
368 | 
369 | CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
370 | 
371 | # docker-compose.yml
372 | version: '3.8'
373 | services:
374 |   api:
375 |     build: .
376 |     ports:
377 |       - "8000:8000"
378 |     environment:
379 |       - OPENAI_API_KEY=${OPENAI_API_KEY}
380 |     volumes:
381 |       - ./data:/app/data
382 | ```
383 | 
384 | ---
385 | 
386 | ## 4. Student Practice Tasks
387 | 
388 | ### Task 1: FastAPI Backend
389 | 
390 | Build a FastAPI backend with:
391 | 
392 | - Document indexing endpoint
393 | - Query endpoint
394 | - Health check
395 | - Statistics endpoint
396 | 
397 | ### Task 2: Streamlit Frontend
398 | 
399 | Create a Streamlit UI with:
400 | 
401 | - File upload
402 | - Query interface
403 | - Answer display
404 | - Source display
405 | 
406 | ### Task 3: Integration
407 | 
408 | Connect Streamlit to FastAPI:
409 | 
410 | - Make API calls
411 | - Handle errors
412 | - Display results
413 | - Add loading states
414 | 
415 | ### Task 4: Error Handling
416 | 
417 | Add comprehensive error handling:
418 | 
419 | - API errors
420 | - File errors
421 | - Validation errors
422 | - User-friendly messages
423 | 
424 | ### Task 5: Deployment
425 | 
426 | Deploy your application:
427 | 
428 | - Local deployment
429 | - Cloud deployment (choose one)
430 | - Docker containerization
431 | - Environment variables
432 | 
433 | ### Task 6: Documentation
434 | 
435 | Create documentation:
436 | 
437 | - API documentation
438 | - User guide
439 | - Deployment instructions
440 | - README file
441 | 
442 | ---
443 | 
444 | ## 5. Summary / Key Takeaways
445 | 
446 | - **FastAPI** provides a fast, modern API framework
447 | - **Streamlit** makes building UIs simple
448 | - **APIs** enable integration with other systems
449 | - **Web interfaces** make systems accessible
450 | - **Deployment** makes your work usable
451 | - **Docker** simplifies deployment
452 | - **Error handling** is crucial for production
453 | - **Documentation** helps users
454 | - **Testing** ensures reliability
455 | - **You've built a complete RAG system!** 🎉
456 | 
457 | ---
458 | 
459 | ## 6. Further Reading (Optional)
460 | 
461 | - FastAPI Documentation
462 | - Streamlit Documentation
463 | - Docker Documentation
464 | - Deployment guides (Heroku, Railway, Render)
465 | - API design best practices
466 | 
467 | ---
468 | 
469 | **Congratulations on completing the 10-day RAG roadmap!** 🎊
470 | 
471 | You now have:
472 | 
473 | - ✅ Python foundations
474 | - ✅ LLM understanding
475 | - ✅ Prompt engineering skills
476 | - ✅ Data extraction capabilities
477 | - ✅ Embedding knowledge
478 | - ✅ RAG system building
479 | - ✅ Framework experience
480 | - ✅ Advanced techniques
481 | - ✅ Deployment skills
482 | 
483 | **Keep building and learning!** 🚀
484 | 


--------------------------------------------------------------------------------
/Day-05: Embeddings & Vector Databases/README.md:
--------------------------------------------------------------------------------
  1 | # Day 5 — Embeddings & Vector Databases
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Today, you'll learn about **embeddings** and **vector databases**—the technology that makes RAG retrieval possible. This is where the magic happens: converting text into numbers that capture meaning, and storing them so we can find similar content quickly.
  6 | 
  7 | **What are Embeddings?**
  8 | Embeddings are numerical representations of text that capture semantic meaning. Similar texts have similar embeddings (close numbers), allowing computers to understand meaning mathematically.
  9 | 
 10 | **Why this matters for RAG:**
 11 | - Embeddings convert text chunks into searchable vectors
 12 | - Vector databases store and retrieve similar content efficiently
 13 | - When you ask a question, we find the most relevant chunks using similarity search
 14 | - This is the "Retrieval" part of RAG!
 15 | 
 16 | **Real-world context:**
 17 | Imagine a library where books are organized by meaning, not alphabetically. When you ask "Tell me about dogs," the system finds all books about dogs, even if they don't contain the exact word "dogs" (maybe they say "canines" or "pets"). Embeddings make this possible!
 18 | 
 19 | ---
 20 | 
 21 | ## 2. Deep-Dive Explanation
 22 | 
 23 | ### 2.1 What are Embeddings?
 24 | 
 25 | **Text Embeddings** are dense vectors (arrays of numbers) that represent text in a high-dimensional space.
 26 | 
 27 | **Key Properties:**
 28 | - **Semantic similarity**: Similar meanings → similar vectors
 29 | - **Fixed dimensions**: Each embedding has the same length (e.g., 1536 for OpenAI)
 30 | - **Dense**: Most values are non-zero (unlike sparse representations)
 31 | 
 32 | **Example:**
 33 | ```
 34 | "dog" → [0.2, -0.5, 0.8, ..., 0.1]  (1536 numbers)
 35 | "puppy" → [0.19, -0.48, 0.79, ..., 0.12]  (very similar!)
 36 | "car" → [-0.3, 0.6, -0.2, ..., -0.5]  (very different!)
 37 | ```
 38 | 
 39 | ### 2.2 How Embeddings Work
 40 | 
 41 | **The Process:**
 42 | ```
 43 | Text → Embedding Model → Vector (Array of Numbers)
 44 | ```
 45 | 
 46 | **Embedding Models:**
 47 | - **OpenAI**: `text-embedding-ada-002` or `text-embedding-3-small`
 48 | - **Sentence Transformers**: Open-source alternatives
 49 | - **Custom models**: Trained on specific domains
 50 | 
 51 | **Dimensions:**
 52 | - OpenAI ada-002: 1536 dimensions
 53 | - OpenAI 3-small: 1536 dimensions
 54 | - Sentence-BERT: 384 or 768 dimensions
 55 | - More dimensions = more detail (but slower, more storage)
 56 | 
 57 | ### 2.3 Similarity Search
 58 | 
 59 | **Cosine Similarity:**
 60 | Measures the angle between two vectors (0 to 1):
 61 | - 1.0 = Identical meaning
 62 | - 0.9 = Very similar
 63 | - 0.5 = Somewhat related
 64 | - 0.0 = Unrelated
 65 | 
 66 | **Formula (simplified):**
 67 | ```
 68 | similarity = dot_product(vec1, vec2) / (magnitude(vec1) * magnitude(vec2))
 69 | ```
 70 | 
 71 | **Why Cosine Similarity?**
 72 | - Focuses on direction, not magnitude
 73 | - Works well for text embeddings
 74 | - Range: -1 to 1 (usually 0 to 1 for normalized embeddings)
 75 | 
 76 | ### 2.4 Vector Databases
 77 | 
 78 | **What is a Vector Database?**
 79 | A specialized database optimized for storing and searching high-dimensional vectors.
 80 | 
 81 | **Key Features:**
 82 | - Fast similarity search
 83 | - Handles millions of vectors
 84 | - Supports metadata filtering
 85 | - Efficient indexing (ANN - Approximate Nearest Neighbor)
 86 | 
 87 | **Popular Vector Databases:**
 88 | - **ChromaDB**: Simple, Python-native
 89 | - **Pinecone**: Cloud-based, scalable
 90 | - **Weaviate**: Open-source, feature-rich
 91 | - **Qdrant**: Fast, Rust-based
 92 | - **FAISS**: Facebook's library (not a full DB)
 93 | 
 94 | ### 2.5 ChromaDB Basics
 95 | 
 96 | **Why ChromaDB for Learning?**
 97 | - Easy to use
 98 | - No external services needed
 99 | - Perfect for prototyping
100 | - Python-native
101 | 
102 | **Core Concepts:**
103 | - **Collection**: Container for vectors and metadata
104 | - **Documents**: Your text chunks
105 | - **Embeddings**: Vector representations
106 | - **Metadata**: Additional info (source, page, etc.)
107 | 
108 | **Basic Operations:**
109 | 1. Create collection
110 | 2. Add documents (auto-generates embeddings)
111 | 3. Query for similar documents
112 | 4. Retrieve with metadata
113 | 
114 | ### 2.6 The Embedding Pipeline
115 | 
116 | **Complete Flow:**
117 | ```
118 | Text Chunks → Embedding Model → Vectors → Vector DB → Index
119 |                                                       ↓
120 | Query → Embedding Model → Query Vector → Similarity Search → Top K Results
121 | ```
122 | 
123 | **Steps:**
124 | 1. **Chunk documents** (from Day 4)
125 | 2. **Generate embeddings** for each chunk
126 | 3. **Store in vector DB** with metadata
127 | 4. **Query**: Convert question to embedding
128 | 5. **Search**: Find most similar chunks
129 | 6. **Retrieve**: Get top K results
130 | 
131 | ---
132 | 
133 | ## 3. Instructor Examples
134 | 
135 | ### Example 1: Generating Embeddings with OpenAI
136 | 
137 | ```python
138 | import openai
139 | import os
140 | 
141 | openai.api_key = os.getenv("OPENAI_API_KEY")
142 | 
143 | def get_embedding(text, model="text-embedding-ada-002"):
144 |     """Generate embedding for text"""
145 |     text = text.replace("\n", " ")  # Replace newlines
146 |     
147 |     response = openai.Embedding.create(
148 |         model=model,
149 |         input=text
150 |     )
151 |     
152 |     return response['data'][0]['embedding']
153 | 
154 | # Usage
155 | text = "Python is a programming language"
156 | embedding = get_embedding(text)
157 | print(f"Embedding dimension: {len(embedding)}")  # 1536
158 | print(f"First 5 values: {embedding[:5]}")
159 | ```
160 | 
161 | ### Example 2: Batch Embedding Generation
162 | 
163 | ```python
164 | def get_embeddings_batch(texts, model="text-embedding-ada-002"):
165 |     """Generate embeddings for multiple texts"""
166 |     # Clean texts
167 |     texts = [text.replace("\n", " ") for text in texts]
168 |     
169 |     response = openai.Embedding.create(
170 |         model=model,
171 |         input=texts
172 |     )
173 |     
174 |     # Extract embeddings
175 |     embeddings = [item['embedding'] for item in response['data']]
176 |     return embeddings
177 | 
178 | # Usage
179 | texts = [
180 |     "Python is a programming language",
181 |     "Dogs are loyal pets",
182 |     "Machine learning uses algorithms"
183 | ]
184 | embeddings = get_embeddings_batch(texts)
185 | print(f"Generated {len(embeddings)} embeddings")
186 | ```
187 | 
188 | ### Example 3: Simple Similarity Calculation
189 | 
190 | ```python
191 | import numpy as np
192 | 
193 | def cosine_similarity(vec1, vec2):
194 |     """Calculate cosine similarity between two vectors"""
195 |     vec1 = np.array(vec1)
196 |     vec2 = np.array(vec2)
197 |     
198 |     dot_product = np.dot(vec1, vec2)
199 |     norm1 = np.linalg.norm(vec1)
200 |     norm2 = np.linalg.norm(vec2)
201 |     
202 |     if norm1 == 0 or norm2 == 0:
203 |         return 0.0
204 |     
205 |     return dot_product / (norm1 * norm2)
206 | 
207 | # Usage
208 | embedding1 = get_embedding("dog")
209 | embedding2 = get_embedding("puppy")
210 | embedding3 = get_embedding("car")
211 | 
212 | similarity_dog_puppy = cosine_similarity(embedding1, embedding2)
213 | similarity_dog_car = cosine_similarity(embedding1, embedding3)
214 | 
215 | print(f"Dog-Puppy similarity: {similarity_dog_puppy:.3f}")  # ~0.85
216 | print(f"Dog-Car similarity: {similarity_dog_car:.3f}")  # ~0.30
217 | ```
218 | 
219 | ### Example 4: ChromaDB Basics
220 | 
221 | ```python
222 | import chromadb
223 | from chromadb.config import Settings
224 | 
225 | # Initialize ChromaDB (in-memory for simplicity)
226 | client = chromadb.Client(Settings(anonymized_telemetry=False))
227 | 
228 | # Create or get a collection
229 | collection = client.create_collection(name="documents")
230 | 
231 | # Add documents
232 | documents = [
233 |     "Python is a high-level programming language",
234 |     "Dogs are loyal and friendly animals",
235 |     "Machine learning is a subset of AI"
236 | ]
237 | 
238 | ids = ["doc1", "doc2", "doc3"]
239 | metadatas = [
240 |     {"source": "python_book", "page": 1},
241 |     {"source": "animal_guide", "page": 5},
242 |     {"source": "ai_textbook", "page": 10}
243 | ]
244 | 
245 | collection.add(
246 |     documents=documents,
247 |     ids=ids,
248 |     metadatas=metadatas
249 | )
250 | 
251 | # Query for similar documents
252 | results = collection.query(
253 |     query_texts=["programming languages"],
254 |     n_results=2
255 | )
256 | 
257 | print("Similar documents:")
258 | for i, doc in enumerate(results['documents'][0]):
259 |     print(f"{i+1}. {doc}")
260 |     print(f"   Metadata: {results['metadatas'][0][i]}")
261 | ```
262 | 
263 | ### Example 5: Complete Embedding Pipeline
264 | 
265 | ```python
266 | class EmbeddingPipeline:
267 |     def __init__(self, embedding_model="text-embedding-ada-002"):
268 |         self.embedding_model = embedding_model
269 |         self.client = chromadb.Client()
270 |         self.collection = None
271 |     
272 |     def create_collection(self, name):
273 |         """Create a new collection"""
274 |         self.collection = self.client.create_collection(name=name)
275 |         return self.collection
276 |     
277 |     def add_documents(self, texts, ids=None, metadatas=None):
278 |         """Add documents to collection (ChromaDB auto-generates embeddings)"""
279 |         if ids is None:
280 |             ids = [f"doc_{i}" for i in range(len(texts))]
281 |         
282 |         self.collection.add(
283 |             documents=texts,
284 |             ids=ids,
285 |             metadatas=metadatas
286 |         )
287 |     
288 |     def search(self, query_text, n_results=5, filter_metadata=None):
289 |         """Search for similar documents"""
290 |         query_params = {
291 |             "query_texts": [query_text],
292 |             "n_results": n_results
293 |         }
294 |         
295 |         if filter_metadata:
296 |             query_params["where"] = filter_metadata
297 |         
298 |         results = self.collection.query(**query_params)
299 |         
300 |         return {
301 |             "documents": results['documents'][0],
302 |             "metadatas": results['metadatas'][0],
303 |             "distances": results['distances'][0]
304 |         }
305 |     
306 |     def get_stats(self):
307 |         """Get collection statistics"""
308 |         count = self.collection.count()
309 |         return {"total_documents": count}
310 | 
311 | # Usage
312 | pipeline = EmbeddingPipeline()
313 | pipeline.create_collection("my_docs")
314 | 
315 | # Add documents
316 | texts = ["Document 1 text...", "Document 2 text..."]
317 | metadatas = [{"source": "book1"}, {"source": "book2"}]
318 | pipeline.add_documents(texts, metadatas=metadatas)
319 | 
320 | # Search
321 | results = pipeline.search("What is Python?", n_results=3)
322 | for doc, metadata in zip(results["documents"], results["metadatas"]):
323 |     print(f"Found: {doc[:50]}... (Source: {metadata['source']})")
324 | ```
325 | 
326 | ---
327 | 
328 | ## 4. Student Practice Tasks
329 | 
330 | ### Task 1: Embedding Generator
331 | Create a function that:
332 | - Takes a list of texts
333 | - Generates embeddings for each
334 | - Returns embeddings with metadata
335 | - Handles API errors
336 | 
337 | ### Task 2: Similarity Calculator
338 | Build a tool that:
339 | - Takes two texts
340 | - Generates embeddings
341 | - Calculates cosine similarity
342 | - Explains the similarity score
343 | 
344 | ### Task 3: ChromaDB Setup
345 | Set up ChromaDB and:
346 | - Create a collection
347 | - Add 10 sample documents
348 | - Query for similar documents
349 | - Display results with metadata
350 | 
351 | ### Task 4: Batch Processing
352 | Create a system that:
353 | - Processes multiple documents
354 | - Generates embeddings in batches
355 | - Stores in ChromaDB
356 | - Shows progress
357 | 
358 | ### Task 5: Similarity Search
359 | Implement a search function that:
360 | - Takes a query
361 | - Finds top 5 most similar documents
362 | - Returns results with similarity scores
363 | - Filters by metadata if needed
364 | 
365 | ### Task 6: Embedding Visualization
366 | (Advanced) Use dimensionality reduction (PCA/t-SNE) to visualize embeddings in 2D and see how similar texts cluster together.
367 | 
368 | ---
369 | 
370 | ## 5. Summary / Key Takeaways
371 | 
372 | - **Embeddings** convert text to numerical vectors that capture meaning
373 | - **Similar texts** have similar embeddings (high cosine similarity)
374 | - **Embedding models** like OpenAI's ada-002 generate 1536-dimensional vectors
375 | - **Vector databases** (like ChromaDB) store and search embeddings efficiently
376 | - **Cosine similarity** measures how similar two embeddings are (0-1 scale)
377 | - **ChromaDB** is easy to use for learning and prototyping
378 | - **The pipeline**: Text → Embedding → Store → Query → Retrieve
379 | - **Metadata** helps filter and organize stored documents
380 | - **Batch processing** is more efficient than one-by-one
381 | - **Similarity search** finds relevant chunks for RAG queries
382 | 
383 | ---
384 | 
385 | ## 6. Further Reading (Optional)
386 | 
387 | - OpenAI Embeddings Guide
388 | - ChromaDB Documentation
389 | - Sentence Transformers library
390 | - Vector Database Comparison articles
391 | - "The Illustrated Word2vec" (understanding embeddings conceptually)
392 | 
393 | ---
394 | 
395 | **Next up:** Day 6 will combine everything into a complete RAG system!
396 | 
397 | 


--------------------------------------------------------------------------------
/Day-07: Implement RAG From Scratch (Pure Python)/README.md:
--------------------------------------------------------------------------------
  1 | # Day 7 — Implement RAG From Scratch (Pure Python)
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Today, you'll build a complete RAG system using only pure Python—no frameworks! This deep dive will help you understand every component and how they work together. By building from scratch, you'll gain a solid foundation before using frameworks like LangChain.
  6 | 
  7 | **Why build from scratch?**
  8 | - Understand every component deeply
  9 | - No "magic" - you see how everything works
 10 | - Customize any part you want
 11 | - Better debugging skills
 12 | - Foundation for using frameworks later
 13 | 
 14 | **What you'll build:**
 15 | A complete, production-ready RAG system with:
 16 | - Document processing
 17 | - Embedding generation
 18 | - Vector storage and search
 19 | - Prompt construction
 20 | - LLM integration
 21 | - Answer generation
 22 | 
 23 | ---
 24 | 
 25 | ## 2. Deep-Dive Explanation
 26 | 
 27 | ### 2.1 System Architecture
 28 | 
 29 | **Complete RAG System Components:**
 30 | 
 31 | ```
 32 | ┌─────────────────┐
 33 | │  Document Loader │
 34 | └────────┬─────────┘
 35 |          │
 36 |          ▼
 37 | ┌─────────────────┐
 38 | │  Text Chunker    │
 39 | └────────┬─────────┘
 40 |          │
 41 |          ▼
 42 | ┌─────────────────┐
 43 | │ Embedding Model │
 44 | └────────┬─────────┘
 45 |          │
 46 |          ▼
 47 | ┌─────────────────┐
 48 | │  Vector Store   │
 49 | └────────┬─────────┘
 50 |          │
 51 |          ▼
 52 | ┌─────────────────┐
 53 | │  Query Handler  │
 54 | └────────┬─────────┘
 55 |          │
 56 |          ▼
 57 | ┌─────────────────┐
 58 | │  RAG Pipeline   │
 59 | └─────────────────┘
 60 | ```
 61 | 
 62 | ### 2.2 Component Design
 63 | 
 64 | **1. Document Loader**
 65 | - Read various file formats
 66 | - Extract text
 67 | - Handle errors
 68 | 
 69 | **2. Text Chunker**
 70 | - Split into manageable pieces
 71 | - Preserve context
 72 | - Add metadata
 73 | 
 74 | **3. Embedding Generator**
 75 | - Call OpenAI API
 76 | - Handle batching
 77 | - Cache embeddings
 78 | 
 79 | **4. Vector Store**
 80 | - Store embeddings
 81 | - Implement similarity search
 82 | - Manage metadata
 83 | 
 84 | **5. Query Processor**
 85 | - Convert query to embedding
 86 | - Search vector store
 87 | - Rank results
 88 | 
 89 | **6. RAG Pipeline**
 90 | - Orchestrate all components
 91 | - Handle errors
 92 | - Return formatted results
 93 | 
 94 | ### 2.3 Implementation Strategy
 95 | 
 96 | **Class Structure:**
 97 | ```python
 98 | class RAGSystem:
 99 |     - document_loader
100 |     - chunker
101 |     - embedding_generator
102 |     - vector_store
103 |     - llm_client
104 |     
105 |     Methods:
106 |     - load_documents()
107 |     - index_documents()
108 |     - query()
109 |     - get_stats()
110 | ```
111 | 
112 | **Error Handling:**
113 | - API failures
114 | - File errors
115 | - Empty results
116 | - Invalid inputs
117 | 
118 | **Configuration:**
119 | - Chunk size
120 | - K value
121 | - Similarity threshold
122 | - LLM parameters
123 | 
124 | ---
125 | 
126 | ## 3. Instructor Examples
127 | 
128 | ### Example 1: Complete RAG System Structure
129 | 
130 | ```python
131 | import os
132 | import openai
133 | import json
134 | import numpy as np
135 | from typing import List, Dict, Optional
136 | 
137 | class DocumentLoader:
138 |     """Load documents from various sources"""
139 |     
140 |     def load_text_file(self, filepath: str) -> str:
141 |         """Load text from .txt file"""
142 |         with open(filepath, 'r', encoding='utf-8') as f:
143 |             return f.read()
144 |     
145 |     def load_pdf(self, filepath: str) -> str:
146 |         """Load text from PDF"""
147 |         import pypdf
148 |         text = ""
149 |         with open(filepath, 'rb') as f:
150 |             reader = pypdf.PdfReader(f)
151 |             for page in reader.pages:
152 |                 text += page.extract_text() + "\n"
153 |         return text
154 | 
155 | class TextChunker:
156 |     """Split text into chunks"""
157 |     
158 |     def __init__(self, chunk_size: int = 500, overlap: int = 50):
159 |         self.chunk_size = chunk_size
160 |         self.overlap = overlap
161 |     
162 |     def chunk_text(self, text: str, source: str = "unknown") -> List[Dict]:
163 |         """Split text into chunks with metadata"""
164 |         words = text.split()
165 |         chunks = []
166 |         
167 |         for i in range(0, len(words), self.chunk_size - self.overlap):
168 |             chunk_words = words[i:i + self.chunk_size]
169 |             chunk_text = " ".join(chunk_words)
170 |             
171 |             chunks.append({
172 |                 "text": chunk_text,
173 |                 "source": source,
174 |                 "chunk_id": len(chunks) + 1,
175 |                 "word_count": len(chunk_words)
176 |             })
177 |         
178 |         return chunks
179 | 
180 | class EmbeddingGenerator:
181 |     """Generate embeddings using OpenAI"""
182 |     
183 |     def __init__(self, model: str = "text-embedding-ada-002"):
184 |         self.model = model
185 |         openai.api_key = os.getenv("OPENAI_API_KEY")
186 |     
187 |     def generate(self, text: str) -> List[float]:
188 |         """Generate embedding for single text"""
189 |         text = text.replace("\n", " ")
190 |         response = openai.Embedding.create(
191 |             model=self.model,
192 |             input=text
193 |         )
194 |         return response['data'][0]['embedding']
195 |     
196 |     def generate_batch(self, texts: List[str]) -> List[List[float]]:
197 |         """Generate embeddings for multiple texts"""
198 |         texts = [t.replace("\n", " ") for t in texts]
199 |         response = openai.Embedding.create(
200 |             model=self.model,
201 |             input=texts
202 |         )
203 |         return [item['embedding'] for item in response['data']]
204 | 
205 | class VectorStore:
206 |     """Simple vector store using in-memory storage"""
207 |     
208 |     def __init__(self):
209 |         self.embeddings = []
210 |         self.chunks = []
211 |         self.metadata = []
212 |     
213 |     def add(self, embeddings: List[List[float]], chunks: List[Dict]):
214 |         """Add embeddings and chunks to store"""
215 |         self.embeddings.extend(embeddings)
216 |         self.chunks.extend(chunks)
217 |         self.metadata.extend([c.get("metadata", {}) for c in chunks])
218 |     
219 |     def search(self, query_embedding: List[float], k: int = 3) -> List[Dict]:
220 |         """Search for top K similar chunks"""
221 |         if not self.embeddings:
222 |             return []
223 |         
224 |         # Calculate similarities
225 |         similarities = []
226 |         query_vec = np.array(query_embedding)
227 |         
228 |         for emb in self.embeddings:
229 |             emb_vec = np.array(emb)
230 |             similarity = np.dot(query_vec, emb_vec) / (
231 |                 np.linalg.norm(query_vec) * np.linalg.norm(emb_vec)
232 |             )
233 |             similarities.append(similarity)
234 |         
235 |         # Get top K
236 |         top_indices = np.argsort(similarities)[::-1][:k]
237 |         
238 |         results = []
239 |         for idx in top_indices:
240 |             results.append({
241 |                 "chunk": self.chunks[idx],
242 |                 "similarity": float(similarities[idx]),
243 |                 "metadata": self.metadata[idx]
244 |             })
245 |         
246 |         return results
247 | 
248 | class RAGSystem:
249 |     """Complete RAG system"""
250 |     
251 |     def __init__(self):
252 |         self.loader = DocumentLoader()
253 |         self.chunker = TextChunker()
254 |         self.embedder = EmbeddingGenerator()
255 |         self.vector_store = VectorStore()
256 |         openai.api_key = os.getenv("OPENAI_API_KEY")
257 |     
258 |     def index_document(self, filepath: str):
259 |         """Load, chunk, and index a document"""
260 |         # Load
261 |         if filepath.endswith('.txt'):
262 |             text = self.loader.load_text_file(filepath)
263 |         elif filepath.endswith('.pdf'):
264 |             text = self.loader.load_pdf(filepath)
265 |         else:
266 |             raise ValueError(f"Unsupported file type: {filepath}")
267 |         
268 |         # Chunk
269 |         chunks = self.chunker.chunk_text(text, source=filepath)
270 |         
271 |         # Generate embeddings
272 |         chunk_texts = [c["text"] for c in chunks]
273 |         embeddings = self.embedder.generate_batch(chunk_texts)
274 |         
275 |         # Store
276 |         self.vector_store.add(embeddings, chunks)
277 |         
278 |         return len(chunks)
279 |     
280 |     def query(self, question: str, k: int = 3) -> Dict:
281 |         """Complete RAG query"""
282 |         # 1. Retrieve
283 |         query_embedding = self.embedder.generate(question)
284 |         results = self.vector_store.search(query_embedding, k)
285 |         
286 |         if not results:
287 |             return {"answer": "No relevant documents found.", "sources": []}
288 |         
289 |         # 2. Augment
290 |         context = "\n\n".join([
291 |             f"[Source: {r['chunk']['source']}]\n{r['chunk']['text']}"
292 |             for r in results
293 |         ])
294 |         
295 |         prompt = f"""Answer the question using the following context.
296 | 
297 | Context:
298 | {context}
299 | 
300 | Question: {question}
301 | 
302 | Answer based only on the provided context."""
303 |         
304 |         # 3. Generate
305 |         response = openai.ChatCompletion.create(
306 |             model="gpt-3.5-turbo",
307 |             messages=[{"role": "user", "content": prompt}],
308 |             temperature=0.3,
309 |             max_tokens=300
310 |         )
311 |         
312 |         answer = response.choices[0].message.content
313 |         
314 |         return {
315 |             "answer": answer,
316 |             "sources": [r['chunk'] for r in results],
317 |             "similarities": [r['similarity'] for r in results]
318 |         }
319 | 
320 | # Usage
321 | rag = RAGSystem()
322 | rag.index_document("document.pdf")
323 | result = rag.query("What is the main topic?")
324 | print(result["answer"])
325 | ```
326 | 
327 | ### Example 2: Enhanced RAG with Configuration
328 | 
329 | ```python
330 | class ConfigurableRAG(RAGSystem):
331 |     """RAG system with configuration options"""
332 |     
333 |     def __init__(self, config: Dict):
334 |         super().__init__()
335 |         self.config = config
336 |         self.chunker = TextChunker(
337 |             chunk_size=config.get("chunk_size", 500),
338 |             overlap=config.get("overlap", 50)
339 |         )
340 |         self.embedder = EmbeddingGenerator(
341 |             model=config.get("embedding_model", "text-embedding-ada-002")
342 |         )
343 |     
344 |     def query(self, question: str, k: Optional[int] = None) -> Dict:
345 |         """Query with configurable parameters"""
346 |         k = k or self.config.get("k", 3)
347 |         threshold = self.config.get("similarity_threshold", 0.0)
348 |         
349 |         # Retrieve
350 |         query_embedding = self.embedder.generate(question)
351 |         results = self.vector_store.search(query_embedding, k * 2)  # Get more, filter
352 |         
353 |         # Filter by threshold
354 |         filtered = [r for r in results if r['similarity'] >= threshold][:k]
355 |         
356 |         if not filtered:
357 |             return {"answer": "No relevant documents found.", "sources": []}
358 |         
359 |         # Rest of the pipeline...
360 |         # (similar to previous example)
361 | ```
362 | 
363 | ---
364 | 
365 | ## 4. Student Practice Tasks
366 | 
367 | ### Task 1: Core Components
368 | Implement each component separately:
369 | - DocumentLoader
370 | - TextChunker
371 | - EmbeddingGenerator
372 | - VectorStore
373 | 
374 | Test each independently.
375 | 
376 | ### Task 2: Integration
377 | Combine all components into a RAGSystem class. Test the complete pipeline.
378 | 
379 | ### Task 3: Error Handling
380 | Add comprehensive error handling:
381 | - API failures
382 | - File errors
383 | - Empty results
384 | - Invalid inputs
385 | 
386 | ### Task 4: Configuration System
387 | Create a configuration system that allows:
388 | - Adjusting chunk size
389 | - Changing K value
390 | - Setting similarity threshold
391 | - Configuring LLM parameters
392 | 
393 | ### Task 5: Performance Optimization
394 | Optimize your system:
395 | - Batch embedding generation
396 | - Cache embeddings
397 | - Efficient similarity search
398 | - Progress indicators
399 | 
400 | ### Task 6: Testing
401 | Create test cases:
402 | - Unit tests for each component
403 | - Integration tests
404 | - End-to-end tests
405 | 
406 | ---
407 | 
408 | ## 5. Summary / Key Takeaways
409 | 
410 | - **Building from scratch** deepens understanding
411 | - **Modular design** makes components reusable
412 | - **Error handling** is crucial for production
413 | - **Configuration** allows flexibility
414 | - **Vector search** uses cosine similarity
415 | - **Batching** improves efficiency
416 | - **Metadata** enables filtering and citations
417 | - **Testing** ensures reliability
418 | - **Pure Python** = no framework dependencies
419 | - **Foundation** for understanding frameworks
420 | 
421 | ---
422 | 
423 | ## 6. Further Reading (Optional)
424 | 
425 | - NumPy documentation (for vector operations)
426 | - OpenAI API best practices
427 | - Software design patterns
428 | - Unit testing in Python
429 | 
430 | ---
431 | 
432 | **Next up:** Day 8 will introduce you to LangChain and LlamaIndex frameworks!
433 | 
434 | 


--------------------------------------------------------------------------------
/Day-06: RAG Fundamentals (Retrieval → Augmentation → Generation)/README.md:
--------------------------------------------------------------------------------
  1 | # Day 6 — RAG Fundamentals (Retrieval → Augmentation → Generation)
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Today, you'll learn the complete RAG pipeline! **Retrieval-Augmented Generation** combines the best of both worlds: the knowledge retrieval of search engines and the language understanding of LLMs.
  6 | 
  7 | **What is RAG?**
  8 | RAG is a technique that:
  9 | 1. **Retrieves** relevant information from your documents
 10 | 2. **Augments** the LLM's prompt with this context
 11 | 3. **Generates** accurate, sourced answers
 12 | 
 13 | **Why RAG matters:**
 14 | - Solves LLM limitations (hallucination, outdated info)
 15 | - Provides accurate, verifiable answers
 16 | - Uses your own documents as knowledge base
 17 | - Enables domain-specific AI applications
 18 | 
 19 | **Real-world context:**
 20 | Instead of asking an LLM "What's in my company handbook?" (which it doesn't know), RAG:
 21 | 1. Searches your handbook documents
 22 | 2. Finds relevant sections
 23 | 3. Gives those sections to the LLM
 24 | 4. LLM answers based on YOUR documents
 25 | 
 26 | ---
 27 | 
 28 | ## 2. Deep-Dive Explanation
 29 | 
 30 | ### 2.1 The RAG Pipeline
 31 | 
 32 | **Complete Flow:**
 33 | ```
 34 | User Question
 35 |     ↓
 36 | [1. RETRIEVAL]
 37 |     ↓
 38 | Query Embedding → Vector Search → Top K Chunks
 39 |     ↓
 40 | [2. AUGMENTATION]
 41 |     ↓
 42 | Context + Question → Formatted Prompt
 43 |     ↓
 44 | [3. GENERATION]
 45 |     ↓
 46 | LLM → Answer with Sources
 47 | ```
 48 | 
 49 | ### 2.2 Step 1: Retrieval
 50 | 
 51 | **What happens:**
 52 | 1. Convert user question to embedding
 53 | 2. Search vector database for similar chunks
 54 | 3. Retrieve top K most relevant chunks
 55 | 4. Return chunks with metadata
 56 | 
 57 | **Key decisions:**
 58 | - **K value**: How many chunks? (typically 3-5)
 59 | - **Similarity threshold**: Minimum similarity score?
 60 | - **Metadata filtering**: Filter by source, date, etc.?
 61 | 
 62 | **Example:**
 63 | ```
 64 | Question: "What is Python?"
 65 | → Embedding: [0.1, -0.3, 0.8, ...]
 66 | → Search vector DB
 67 | → Retrieve: [Chunk about Python, Chunk about programming, ...]
 68 | ```
 69 | 
 70 | ### 2.3 Step 2: Augmentation
 71 | 
 72 | **What happens:**
 73 | 1. Combine retrieved chunks into context
 74 | 2. Format context with clear structure
 75 | 3. Add question to prompt
 76 | 4. Include instructions for the LLM
 77 | 
 78 | **Prompt Structure:**
 79 | ```
 80 | System: You are a helpful assistant...
 81 | Context:
 82 | [Chunk 1]
 83 | [Chunk 2]
 84 | [Chunk 3]
 85 | 
 86 | Question: {user_question}
 87 | 
 88 | Answer based on the context above.
 89 | ```
 90 | 
 91 | **Best practices:**
 92 | - Clearly separate chunks
 93 | - Include source information
 94 | - Limit context size (token budget)
 95 | - Add instructions for citation
 96 | 
 97 | ### 2.4 Step 3: Generation
 98 | 
 99 | **What happens:**
100 | 1. Send augmented prompt to LLM
101 | 2. LLM generates answer using context
102 | 3. Extract answer from response
103 | 4. Optionally extract citations
104 | 
105 | **LLM Configuration:**
106 | - **Temperature**: Lower (0.3-0.5) for factual answers
107 | - **Max tokens**: Based on expected answer length
108 | - **Model**: GPT-3.5-turbo or GPT-4
109 | 
110 | ### 2.5 Complete RAG System Components
111 | 
112 | **Required Components:**
113 | 1. **Document Store**: Where chunks are stored
114 | 2. **Embedding Model**: Converts text to vectors
115 | 3. **Vector Database**: Stores and searches embeddings
116 | 4. **LLM**: Generates final answers
117 | 5. **Prompt Template**: Formats context + question
118 | 
119 | **Data Flow:**
120 | ```
121 | Documents → Chunks → Embeddings → Vector DB
122 |                                     ↓
123 | User Question → Embedding → Search → Retrieved Chunks
124 |                                     ↓
125 | Retrieved Chunks + Question → Prompt → LLM → Answer
126 | ```
127 | 
128 | ### 2.6 RAG vs. Traditional Search
129 | 
130 | **Traditional Search:**
131 | - Keyword matching
132 | - Exact text search
133 | - May miss semantic matches
134 | 
135 | **RAG:**
136 | - Semantic understanding
137 | - Finds conceptually similar content
138 | - Understands context and meaning
139 | 
140 | **Example:**
141 | - Question: "How do I train a model?"
142 | - Keyword search: Finds "train" and "model" separately
143 | - RAG: Finds content about "machine learning training", "model training", etc.
144 | 
145 | ---
146 | 
147 | ## 3. Instructor Examples
148 | 
149 | ### Example 1: Simple RAG Pipeline
150 | 
151 | ```python
152 | import openai
153 | import chromadb
154 | from chromadb.config import Settings
155 | 
156 | class SimpleRAG:
157 |     def __init__(self):
158 |         self.client = chromadb.Client(Settings())
159 |         self.collection = None
160 |         openai.api_key = os.getenv("OPENAI_API_KEY")
161 |     
162 |     def setup(self, collection_name="documents"):
163 |         """Initialize collection"""
164 |         self.collection = self.client.create_collection(name=collection_name)
165 |     
166 |     def add_documents(self, texts, ids=None, metadatas=None):
167 |         """Add documents to the collection"""
168 |         if ids is None:
169 |             ids = [f"doc_{i}" for i in range(len(texts))]
170 |         
171 |         self.collection.add(
172 |             documents=texts,
173 |             ids=ids,
174 |             metadatas=metadatas
175 |         )
176 |     
177 |     def retrieve(self, query, k=3):
178 |         """Retrieve top K relevant chunks"""
179 |         results = self.collection.query(
180 |             query_texts=[query],
181 |             n_results=k
182 |         )
183 |         return results['documents'][0]
184 |     
185 |     def augment(self, context_chunks, question):
186 |         """Create augmented prompt"""
187 |         context = "\n\n".join([
188 |             f"[Document {i+1}]\n{chunk}"
189 |             for i, chunk in enumerate(context_chunks)
190 |         ])
191 |         
192 |         prompt = f"""Use the following documents to answer the question.
193 | 
194 | Documents:
195 | {context}
196 | 
197 | Question: {question}
198 | 
199 | Answer based only on the provided documents. If the documents don't contain enough information, say so."""
200 |         
201 |         return prompt
202 |     
203 |     def generate(self, prompt):
204 |         """Generate answer using LLM"""
205 |         response = openai.ChatCompletion.create(
206 |             model="gpt-3.5-turbo",
207 |             messages=[{"role": "user", "content": prompt}],
208 |             temperature=0.3,
209 |             max_tokens=300
210 |         )
211 |         return response.choices[0].message.content
212 |     
213 |     def query(self, question, k=3):
214 |         """Complete RAG pipeline"""
215 |         # 1. Retrieve
216 |         chunks = self.retrieve(question, k)
217 |         
218 |         # 2. Augment
219 |         prompt = self.augment(chunks, question)
220 |         
221 |         # 3. Generate
222 |         answer = self.generate(prompt)
223 |         
224 |         return {
225 |             "answer": answer,
226 |             "sources": chunks
227 |         }
228 | 
229 | # Usage
230 | rag = SimpleRAG()
231 | rag.setup()
232 | 
233 | # Add documents
234 | rag.add_documents([
235 |     "Python is a programming language created in 1991.",
236 |     "RAG combines retrieval and generation.",
237 |     "Machine learning uses algorithms to learn from data."
238 | ])
239 | 
240 | # Query
241 | result = rag.query("What is Python?")
242 | print(result["answer"])
243 | print("\nSources:", result["sources"])
244 | ```
245 | 
246 | ### Example 2: RAG with Metadata
247 | 
248 | ```python
249 | class RAGWithMetadata(SimpleRAG):
250 |     def retrieve_with_metadata(self, query, k=3):
251 |         """Retrieve chunks with metadata"""
252 |         results = self.collection.query(
253 |             query_texts=[query],
254 |             n_results=k
255 |         )
256 |         return {
257 |             "documents": results['documents'][0],
258 |             "metadatas": results['metadatas'][0],
259 |             "distances": results['distances'][0]
260 |         }
261 |     
262 |     def augment_with_sources(self, retrieved_data, question):
263 |         """Augment with source citations"""
264 |         context_parts = []
265 |         for i, (doc, metadata) in enumerate(zip(
266 |             retrieved_data['documents'],
267 |             retrieved_data['metadatas']
268 |         )):
269 |             source = metadata.get('source', f'Document {i+1}')
270 |             context_parts.append(f"[Source: {source}]\n{doc}")
271 |         
272 |         context = "\n\n".join(context_parts)
273 |         
274 |         prompt = f"""Answer the question using the following sources.
275 | 
276 | Sources:
277 | {context}
278 | 
279 | Question: {question}
280 | 
281 | Provide an answer and cite which source(s) you used."""
282 |         
283 |         return prompt
284 |     
285 |     def query(self, question, k=3):
286 |         """RAG with source citations"""
287 |         # Retrieve with metadata
288 |         retrieved = self.retrieve_with_metadata(question, k)
289 |         
290 |         # Augment
291 |         prompt = self.augment_with_sources(retrieved, question)
292 |         
293 |         # Generate
294 |         answer = self.generate(prompt)
295 |         
296 |         return {
297 |             "answer": answer,
298 |             "sources": [
299 |                 {"text": doc, "metadata": meta}
300 |                 for doc, meta in zip(
301 |                     retrieved['documents'],
302 |                     retrieved['metadatas']
303 |                 )
304 |             ]
305 |         }
306 | ```
307 | 
308 | ### Example 3: RAG with Similarity Filtering
309 | 
310 | ```python
311 | class FilteredRAG(SimpleRAG):
312 |     def retrieve_with_threshold(self, query, k=5, threshold=0.7):
313 |         """Retrieve only chunks above similarity threshold"""
314 |         results = self.collection.query(
315 |             query_texts=[query],
316 |             n_results=k
317 |         )
318 |         
319 |         # Filter by similarity (distance is inverse of similarity)
320 |         # Lower distance = higher similarity
321 |         filtered_docs = []
322 |         filtered_metas = []
323 |         
324 |         for doc, meta, distance in zip(
325 |             results['documents'][0],
326 |             results['metadatas'][0],
327 |             results['distances'][0]
328 |         ):
329 |             similarity = 1 - distance  # Convert distance to similarity
330 |             if similarity >= threshold:
331 |                 filtered_docs.append(doc)
332 |                 filtered_metas.append(meta)
333 |         
334 |         return filtered_docs, filtered_metas
335 |     
336 |     def query(self, question, k=5, threshold=0.7):
337 |         """RAG with similarity filtering"""
338 |         chunks, metadata = self.retrieve_with_threshold(question, k, threshold)
339 |         
340 |         if not chunks:
341 |             return {
342 |                 "answer": "I couldn't find relevant information in the documents.",
343 |                 "sources": []
344 |             }
345 |         
346 |         prompt = self.augment(chunks, question)
347 |         answer = self.generate(prompt)
348 |         
349 |         return {
350 |             "answer": answer,
351 |             "sources": chunks,
352 |             "num_sources": len(chunks)
353 |         }
354 | ```
355 | 
356 | ---
357 | 
358 | ## 4. Student Practice Tasks
359 | 
360 | ### Task 1: Basic RAG Implementation
361 | Build a simple RAG system that:
362 | - Stores documents in ChromaDB
363 | - Retrieves top 3 chunks for a query
364 | - Augments prompt with context
365 | - Generates answer using GPT-3.5
366 | 
367 | ### Task 2: RAG with Citations
368 | Enhance your RAG to:
369 | - Include source information in answers
370 | - Format citations properly
371 | - Show which document each part came from
372 | 
373 | ### Task 3: Similarity Threshold
374 | Add filtering to only use chunks above a similarity threshold. Test with different thresholds and observe how it affects answers.
375 | 
376 | ### Task 4: Multi-Query RAG
377 | Implement query expansion:
378 | - Generate multiple query variations
379 | - Search with each variation
380 | - Combine results
381 | - Remove duplicates
382 | 
383 | ### Task 5: RAG Evaluation
384 | Create a simple evaluation:
385 | - Test with known questions
386 | - Compare answers to expected answers
387 | - Calculate accuracy metrics
388 | 
389 | ### Task 6: RAG with Metadata Filtering
390 | Add ability to filter by metadata (e.g., only search in specific documents, date ranges, etc.)
391 | 
392 | ---
393 | 
394 | ## 5. Summary / Key Takeaways
395 | 
396 | - **RAG = Retrieval + Augmentation + Generation**
397 | - **Retrieval**: Find relevant chunks using semantic search
398 | - **Augmentation**: Combine chunks with question in a prompt
399 | - **Generation**: LLM creates answer from augmented context
400 | - **K value**: Number of chunks to retrieve (typically 3-5)
401 | - **Similarity threshold**: Filter low-quality matches
402 | - **Metadata**: Track sources for citations
403 | - **Prompt engineering**: Critical for good RAG results
404 | - **RAG solves**: Hallucination, outdated info, domain knowledge
405 | - **Complete pipeline**: Documents → Embeddings → Vector DB → Query → Retrieve → Augment → Generate
406 | 
407 | ---
408 | 
409 | ## 6. Further Reading (Optional)
410 | 
411 | - "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (original RAG paper)
412 | - LangChain RAG documentation
413 | - LlamaIndex RAG guides
414 | - RAG evaluation metrics
415 | 
416 | ---
417 | 
418 | **Next up:** Day 7 will have you build a complete RAG system from scratch!
419 | 
420 | 


--------------------------------------------------------------------------------
/Day-04: Chunking & Data Extraction (PDF-Web-Docs)/README.md:
--------------------------------------------------------------------------------
  1 | # Day 4 — Chunking & Data Extraction (PDF/Web/Docs)
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Before you can build a RAG system, you need to extract and prepare your data. Today, you'll learn how to extract text from various sources (PDFs, websites, documents) and split it into manageable chunks—a crucial step in the RAG pipeline.
  6 | 
  7 | **Why this matters for RAG:**
  8 | - RAG systems need documents in text format
  9 | - Documents must be split into chunks that fit LLM context windows
 10 | - Different sources require different extraction methods
 11 | - Proper chunking improves retrieval quality
 12 | 
 13 | **Real-world context:**
 14 | Imagine you have a 100-page PDF and want to answer questions about it. You can't send all 100 pages to an LLM at once (token limits!). Instead, you:
 15 | 1. Extract text from the PDF
 16 | 2. Split it into smaller chunks (e.g., 500 words each)
 17 | 3. Store these chunks for retrieval
 18 | 4. When a question comes, find relevant chunks and send only those to the LLM
 19 | 
 20 | ---
 21 | 
 22 | ## 2. Deep-Dive Explanation
 23 | 
 24 | ### 2.1 Data Extraction Overview
 25 | 
 26 | **The Pipeline:**
 27 | ```
 28 | Source File → Extract Text → Clean Text → Chunk Text → Store Chunks
 29 | ```
 30 | 
 31 | **Common Sources:**
 32 | - PDF files
 33 | - Web pages (HTML)
 34 | - Text files (.txt, .md)
 35 | - Word documents (.docx)
 36 | - CSV files
 37 | - JSON files
 38 | 
 39 | ### 2.2 PDF Extraction
 40 | 
 41 | **Libraries:**
 42 | - `PyPDF2`: Basic PDF reading
 43 | - `pypdf`: Modern alternative
 44 | - `pdfplumber`: Better text extraction
 45 | - `PyMuPDF` (fitz): Fast and accurate
 46 | 
 47 | **Challenges:**
 48 | - Scanned PDFs (need OCR)
 49 | - Complex layouts
 50 | - Tables and images
 51 | - Multi-column text
 52 | 
 53 | **Basic Extraction:**
 54 | ```python
 55 | import pypdf
 56 | 
 57 | def extract_pdf_text(filepath):
 58 |     text = ""
 59 |     with open(filepath, "rb") as file:
 60 |         reader = pypdf.PdfReader(file)
 61 |         for page in reader.pages:
 62 |             text += page.extract_text() + "\n"
 63 |     return text
 64 | ```
 65 | 
 66 | ### 2.3 Web Scraping
 67 | 
 68 | **Libraries:**
 69 | - `requests`: HTTP requests
 70 | - `BeautifulSoup4`: HTML parsing
 71 | - `selenium`: For JavaScript-heavy sites
 72 | 
 73 | **Basic Web Scraping:**
 74 | ```python
 75 | import requests
 76 | from bs4 import BeautifulSoup
 77 | 
 78 | def extract_web_text(url):
 79 |     response = requests.get(url)
 80 |     soup = BeautifulSoup(response.content, "html.parser")
 81 |     
 82 |     # Remove script and style elements
 83 |     for script in soup(["script", "style"]):
 84 |         script.decompose()
 85 |     
 86 |     # Get text
 87 |     text = soup.get_text()
 88 |     
 89 |     # Clean up whitespace
 90 |     lines = (line.strip() for line in text.splitlines())
 91 |     chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
 92 |     text = " ".join(chunk for chunk in chunks if chunk)
 93 |     
 94 |     return text
 95 | ```
 96 | 
 97 | ### 2.4 Text Chunking Strategies
 98 | 
 99 | #### 2.4.1 Fixed-Size Chunking
100 | 
101 | Split text into chunks of fixed character/word count:
102 | ```
103 | Text: [1000 chars] → Chunk1[500] + Chunk2[500]
104 | ```
105 | 
106 | **Pros:** Simple, predictable
107 | **Cons:** May split sentences/paragraphs
108 | 
109 | #### 2.4.2 Sentence-Aware Chunking
110 | 
111 | Split at sentence boundaries:
112 | ```
113 | Text → Sentences → Group into chunks (respecting max size)
114 | ```
115 | 
116 | **Pros:** Preserves sentence integrity
117 | **Cons:** More complex
118 | 
119 | #### 2.4.3 Paragraph-Aware Chunking
120 | 
121 | Split at paragraph boundaries:
122 | ```
123 | Text → Paragraphs → Group paragraphs into chunks
124 | ```
125 | 
126 | **Pros:** Preserves context
127 | **Cons:** Chunks may vary significantly in size
128 | 
129 | #### 2.4.4 Overlapping Chunks
130 | 
131 | Add overlap between chunks for context:
132 | ```
133 | Chunk1: [0-500] → Chunk2: [450-950] → Chunk3: [900-1400]
134 | ```
135 | 
136 | **Pros:** Maintains context across boundaries
137 | **Cons:** More storage, potential redundancy
138 | 
139 | ### 2.5 Chunking Best Practices
140 | 
141 | **Considerations:**
142 | - **Chunk size**: 200-1000 tokens (depends on model)
143 | - **Overlap**: 10-20% of chunk size
144 | - **Boundaries**: Respect sentence/paragraph boundaries
145 | - **Metadata**: Store source, position, timestamp
146 | 
147 | **Metadata to Store:**
148 | ```python
149 | chunk_metadata = {
150 |     "chunk_id": 1,
151 |     "source": "document.pdf",
152 |     "page": 3,
153 |     "start_char": 0,
154 |     "end_char": 500,
155 |     "word_count": 75
156 | }
157 | ```
158 | 
159 | ### 2.6 Text Cleaning
160 | 
161 | **Common Cleaning Steps:**
162 | 1. Remove extra whitespace
163 | 2. Remove special characters (if needed)
164 | 3. Normalize encoding
165 | 4. Remove headers/footers
166 | 5. Handle line breaks
167 | 
168 | ---
169 | 
170 | ## 3. Instructor Examples
171 | 
172 | ### Example 1: PDF Text Extraction
173 | 
174 | ```python
175 | import pypdf
176 | 
177 | def extract_pdf_text(filepath):
178 |     """Extract text from a PDF file"""
179 |     text = ""
180 |     try:
181 |         with open(filepath, "rb") as file:
182 |             pdf_reader = pypdf.PdfReader(file)
183 |             num_pages = len(pdf_reader.pages)
184 |             
185 |             for page_num in range(num_pages):
186 |                 page = pdf_reader.pages[page_num]
187 |                 page_text = page.extract_text()
188 |                 text += f"\n--- Page {page_num + 1} ---\n"
189 |                 text += page_text
190 |                 
191 |         return text
192 |     except Exception as e:
193 |         print(f"Error extracting PDF: {e}")
194 |         return None
195 | 
196 | # Usage
197 | text = extract_pdf_text("document.pdf")
198 | print(f"Extracted {len(text)} characters")
199 | ```
200 | 
201 | ### Example 2: Web Scraping
202 | 
203 | ```python
204 | import requests
205 | from bs4 import BeautifulSoup
206 | 
207 | def extract_web_content(url):
208 |     """Extract main content from a webpage"""
209 |     try:
210 |         headers = {
211 |             "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
212 |         }
213 |         response = requests.get(url, headers=headers, timeout=10)
214 |         response.raise_for_status()
215 |         
216 |         soup = BeautifulSoup(response.content, "html.parser")
217 |         
218 |         # Remove unwanted elements
219 |         for element in soup(["script", "style", "nav", "footer", "header"]):
220 |             element.decompose()
221 |         
222 |         # Try to find main content
223 |         main_content = soup.find("main") or soup.find("article") or soup.find("body")
224 |         
225 |         if main_content:
226 |             text = main_content.get_text(separator=" ", strip=True)
227 |             # Clean up multiple spaces
228 |             text = " ".join(text.split())
229 |             return text
230 |         else:
231 |             return soup.get_text(separator=" ", strip=True)
232 |             
233 |     except Exception as e:
234 |         print(f"Error scraping {url}: {e}")
235 |         return None
236 | 
237 | # Usage
238 | content = extract_web_content("https://example.com/article")
239 | ```
240 | 
241 | ### Example 3: Sentence-Aware Chunking
242 | 
243 | ```python
244 | import re
245 | 
246 | def chunk_text_sentences(text, chunk_size=500, overlap=50):
247 |     """Chunk text respecting sentence boundaries"""
248 |     # Split into sentences (simple approach)
249 |     sentences = re.split(r'(?<=[.!?])\s+', text)
250 |     
251 |     chunks = []
252 |     current_chunk = []
253 |     current_size = 0
254 |     
255 |     for sentence in sentences:
256 |         sentence_size = len(sentence)
257 |         
258 |         # If adding this sentence exceeds chunk size
259 |         if current_size + sentence_size > chunk_size and current_chunk:
260 |             # Save current chunk
261 |             chunk_text = " ".join(current_chunk)
262 |             chunks.append(chunk_text)
263 |             
264 |             # Start new chunk with overlap
265 |             overlap_text = " ".join(current_chunk[-2:]) if len(current_chunk) >= 2 else ""
266 |             current_chunk = [overlap_text, sentence] if overlap_text else [sentence]
267 |             current_size = len(" ".join(current_chunk))
268 |         else:
269 |             current_chunk.append(sentence)
270 |             current_size += sentence_size + 1  # +1 for space
271 |     
272 |     # Add final chunk
273 |     if current_chunk:
274 |         chunks.append(" ".join(current_chunk))
275 |     
276 |     return chunks
277 | 
278 | # Usage
279 | long_text = "Sentence one. Sentence two. Sentence three..." * 50
280 | chunks = chunk_text_sentences(long_text, chunk_size=200)
281 | print(f"Created {len(chunks)} chunks")
282 | ```
283 | 
284 | ### Example 4: Complete Document Processor
285 | 
286 | ```python
287 | class DocumentProcessor:
288 |     def __init__(self, chunk_size=500, overlap=50):
289 |         self.chunk_size = chunk_size
290 |         self.overlap = overlap
291 |         self.chunks = []
292 |     
293 |     def extract_from_pdf(self, filepath):
294 |         """Extract text from PDF"""
295 |         import pypdf
296 |         text = ""
297 |         with open(filepath, "rb") as file:
298 |             reader = pypdf.PdfReader(file)
299 |             for page in reader.pages:
300 |                 text += page.extract_text() + "\n"
301 |         return text
302 |     
303 |     def extract_from_web(self, url):
304 |         """Extract text from webpage"""
305 |         import requests
306 |         from bs4 import BeautifulSoup
307 |         
308 |         response = requests.get(url)
309 |         soup = BeautifulSoup(response.content, "html.parser")
310 |         for script in soup(["script", "style"]):
311 |             script.decompose()
312 |         return soup.get_text()
313 |     
314 |     def clean_text(self, text):
315 |         """Clean extracted text"""
316 |         # Remove extra whitespace
317 |         text = " ".join(text.split())
318 |         # Remove special characters (optional)
319 |         # text = re.sub(r'[^\w\s]', '', text)
320 |         return text
321 |     
322 |     def chunk_text(self, text, source="unknown"):
323 |         """Chunk text and store with metadata"""
324 |         words = text.split()
325 |         chunks = []
326 |         
327 |         for i in range(0, len(words), self.chunk_size - self.overlap):
328 |             chunk_words = words[i:i + self.chunk_size]
329 |             chunk_text = " ".join(chunk_words)
330 |             
331 |             chunk_data = {
332 |                 "chunk_id": len(chunks) + 1,
333 |                 "text": chunk_text,
334 |                 "source": source,
335 |                 "word_count": len(chunk_words),
336 |                 "start_word": i,
337 |                 "end_word": min(i + self.chunk_size, len(words))
338 |             }
339 |             chunks.append(chunk_data)
340 |         
341 |         self.chunks.extend(chunks)
342 |         return chunks
343 |     
344 |     def process_pdf(self, filepath):
345 |         """Complete PDF processing pipeline"""
346 |         text = self.extract_from_pdf(filepath)
347 |         text = self.clean_text(text)
348 |         chunks = self.chunk_text(text, source=filepath)
349 |         return chunks
350 | 
351 | # Usage
352 | processor = DocumentProcessor(chunk_size=200, overlap=20)
353 | chunks = processor.process_pdf("document.pdf")
354 | print(f"Processed {len(chunks)} chunks")
355 | ```
356 | 
357 | ---
358 | 
359 | ## 4. Student Practice Tasks
360 | 
361 | ### Task 1: PDF Extractor
362 | Write a function that extracts text from a PDF and returns:
363 | - Full text
364 | - Number of pages
365 | - Text per page (as a list)
366 | 
367 | ### Task 2: Web Scraper
368 | Create a web scraper that:
369 | - Takes a URL
370 | - Extracts main content (removes nav, ads, etc.)
371 | - Returns clean text
372 | - Handles errors gracefully
373 | 
374 | ### Task 3: Chunking Functions
375 | Implement three chunking strategies:
376 | - Fixed-size chunking
377 | - Sentence-aware chunking
378 | - Paragraph-aware chunking
379 | 
380 | Compare results on the same text.
381 | 
382 | ### Task 4: Text Cleaner
383 | Write a comprehensive text cleaning function that:
384 | - Removes extra whitespace
385 | - Handles encoding issues
386 | - Removes headers/footers (if patterns detected)
387 | - Normalizes line breaks
388 | 
389 | ### Task 5: Chunk Metadata
390 | Enhance chunking to include rich metadata:
391 | - Source file
392 | - Page number (for PDFs)
393 | - Character positions
394 | - Word count
395 | - Timestamp
396 | 
397 | ### Task 6: Multi-Format Processor
398 | Create a processor that handles:
399 | - PDF files
400 | - Text files
401 | - Web URLs
402 | - Returns standardized chunk format
403 | 
404 | ---
405 | 
406 | ## 5. Summary / Key Takeaways
407 | 
408 | - **Data extraction** is the first step in RAG pipelines
409 | - **PDF extraction** requires libraries like `pypdf` or `pdfplumber`
410 | - **Web scraping** uses `requests` and `BeautifulSoup`
411 | - **Chunking strategies** vary: fixed-size, sentence-aware, paragraph-aware
412 | - **Overlapping chunks** preserve context across boundaries
413 | - **Metadata** is crucial for tracking chunk sources
414 | - **Text cleaning** improves chunk quality
415 | - **Chunk size** should match your LLM's context window
416 | - **Different sources** require different extraction methods
417 | 
418 | ---
419 | 
420 | ## 6. Further Reading (Optional)
421 | 
422 | - PyPDF2/PyPDF documentation
423 | - BeautifulSoup documentation
424 | - LangChain document loaders
425 | - LlamaIndex data connectors
426 | - Text chunking best practices
427 | 
428 | ---
429 | 
430 | **Next up:** Day 5 will teach you about embeddings and vector databases!
431 | 
432 | 


--------------------------------------------------------------------------------
/Day-09: Advanced RAG (Reranking, Query Rewriting, Fusion)/README.md:
--------------------------------------------------------------------------------
  1 | # Day 9 — Advanced RAG (Reranking, Query Rewriting, Fusion)
  2 | 
  3 | ## 1. Beginner-Friendly Introduction
  4 | 
  5 | Today, you'll learn advanced techniques that make RAG systems production-ready! These techniques improve answer quality, handle complex queries, and make your system more robust.
  6 | 
  7 | **What are Advanced RAG Techniques?**
  8 | - **Reranking**: Improve retrieval by reordering results
  9 | - **Query Rewriting**: Transform queries for better retrieval
 10 | - **Fusion**: Combine multiple retrieval strategies
 11 | - **Hybrid Search**: Mix semantic and keyword search
 12 | 
 13 | **Why these matter:**
 14 | - Basic RAG works, but advanced techniques make it better
 15 | - Production systems need these optimizations
 16 | - Handle edge cases and complex queries
 17 | - Improve answer accuracy and relevance
 18 | 
 19 | **Real-world context:**
 20 | Think of basic RAG as a simple search engine. Advanced RAG is like Google—it uses multiple strategies, reranks results, understands query intent, and combines different signals to give you the best answer.
 21 | 
 22 | ---
 23 | 
 24 | ## 2. Deep-Dive Explanation
 25 | 
 26 | ### 2.1 Reranking
 27 | 
 28 | **What is Reranking?**
 29 | After retrieving initial results, rerank them using a more sophisticated model to improve order.
 30 | 
 31 | **Why Rerank?**
 32 | - Initial retrieval may miss subtle relevance
 33 | - Reranking models are more accurate
 34 | - Better final results
 35 | 
 36 | **How it works:**
 37 | ```
 38 | Initial Retrieval (Top 10) → Reranking Model → Reordered Top 5
 39 | ```
 40 | 
 41 | **Reranking Models:**
 42 | - Cross-encoders (BERT-based)
 43 | - Specialized reranking models
 44 | - Custom scoring functions
 45 | 
 46 | **Benefits:**
 47 | - Better top results
 48 | - Improved answer quality
 49 | - More relevant context
 50 | 
 51 | ### 2.2 Query Rewriting
 52 | 
 53 | **What is Query Rewriting?**
 54 | Transform user queries to improve retrieval:
 55 | - Expand queries with synonyms
 56 | - Generate multiple query variations
 57 | - Reformulate for better matching
 58 | - Extract key terms
 59 | 
 60 | **Techniques:**
 61 | 1. **Query Expansion**: Add related terms
 62 | 2. **Query Decomposition**: Break into sub-queries
 63 | 3. **Query Reformulation**: Rephrase for clarity
 64 | 4. **Hybrid Queries**: Combine semantic + keyword
 65 | 
 66 | **Example:**
 67 | ```
 68 | Original: "How to train model?"
 69 | Rewritten: ["How to train machine learning model?", "model training process", "train ML model"]
 70 | ```
 71 | 
 72 | ### 2.3 Fusion Techniques
 73 | 
 74 | **What is Fusion?**
 75 | Combine results from multiple retrieval strategies:
 76 | - Different embedding models
 77 | - Keyword + semantic search
 78 | - Multiple query variations
 79 | - Different chunk sizes
 80 | 
 81 | **Fusion Methods:**
 82 | 1. **Reciprocal Rank Fusion (RRF)**: Combine rankings
 83 | 2. **Weighted Fusion**: Weight different sources
 84 | 3. **Deduplication**: Remove duplicates
 85 | 4. **Score Normalization**: Normalize before combining
 86 | 
 87 | **Benefits:**
 88 | - More comprehensive retrieval
 89 | - Better coverage
 90 | - Reduces misses
 91 | 
 92 | ### 2.4 Hybrid Search
 93 | 
 94 | **What is Hybrid Search?**
 95 | Combine semantic search (embeddings) with keyword search (BM25/TF-IDF):
 96 | - Semantic: Understands meaning
 97 | - Keyword: Exact matches
 98 | - Best of both worlds
 99 | 
100 | **Implementation:**
101 | ```
102 | Query → [Semantic Search] → Results 1
103 |       → [Keyword Search]  → Results 2
104 |       → [Fusion]          → Final Results
105 | ```
106 | 
107 | ### 2.5 Advanced RAG Pipeline
108 | 
109 | **Enhanced Pipeline:**
110 | ```
111 | User Query
112 |     ↓
113 | Query Rewriting (multiple variations)
114 |     ↓
115 | Multiple Retrieval Strategies
116 |     ↓
117 | Fusion (combine results)
118 |     ↓
119 | Reranking (improve order)
120 |     ↓
121 | Top K Chunks
122 |     ↓
123 | Augmentation + Generation
124 |     ↓
125 | Answer
126 | ```
127 | 
128 | ---
129 | 
130 | ## 3. Instructor Examples
131 | 
132 | ### Example 1: Query Rewriting
133 | 
134 | ```python
135 | import openai
136 | 
137 | def rewrite_query(original_query):
138 |     """Generate query variations"""
139 |     prompt = f"""Generate 3 different ways to ask this question that might retrieve different relevant documents.
140 | 
141 | Original question: {original_query}
142 | 
143 | Generate 3 variations, each on a new line:"""
144 |     
145 |     response = openai.ChatCompletion.create(
146 |         model="gpt-3.5-turbo",
147 |         messages=[{"role": "user", "content": prompt}],
148 |         temperature=0.7,
149 |         max_tokens=150
150 |     )
151 |     
152 |     variations = response.choices[0].message.content.strip().split("\n")
153 |     variations = [v.strip("- ").strip() for v in variations if v.strip()]
154 |     
155 |     return [original_query] + variations
156 | 
157 | # Usage
158 | query = "How does machine learning work?"
159 | variations = rewrite_query(query)
160 | print(variations)
161 | # Output: [
162 | #     "How does machine learning work?",
163 | #     "What is the process of machine learning?",
164 | #     "How do ML algorithms learn from data?",
165 | #     "Explain the mechanism of machine learning"
166 | # ]
167 | ```
168 | 
169 | ### Example 2: Reciprocal Rank Fusion
170 | 
171 | ```python
172 | def reciprocal_rank_fusion(results_list, k=60):
173 |     """
174 |     Combine multiple ranked result lists using RRF
175 |     
176 |     Args:
177 |         results_list: List of result lists, each with (doc_id, score)
178 |         k: RRF constant (typically 60)
179 |     """
180 |     doc_scores = {}
181 |     
182 |     for results in results_list:
183 |         for rank, (doc_id, score) in enumerate(results, 1):
184 |             if doc_id not in doc_scores:
185 |                 doc_scores[doc_id] = 0
186 |             doc_scores[doc_id] += 1 / (k + rank)
187 |     
188 |     # Sort by score
189 |     fused_results = sorted(
190 |         doc_scores.items(),
191 |         key=lambda x: x[1],
192 |         reverse=True
193 |     )
194 |     
195 |     return fused_results
196 | 
197 | # Usage
198 | results1 = [("doc1", 0.9), ("doc2", 0.8), ("doc3", 0.7)]
199 | results2 = [("doc2", 0.85), ("doc1", 0.82), ("doc4", 0.75)]
200 | results3 = [("doc3", 0.88), ("doc1", 0.80), ("doc5", 0.72)]
201 | 
202 | fused = reciprocal_rank_fusion([results1, results2, results3])
203 | print(fused)
204 | # Output: [("doc1", 0.083), ("doc2", 0.075), ("doc3", 0.068), ...]
205 | ```
206 | 
207 | ### Example 3: Reranking with Cross-Encoder
208 | 
209 | ```python
210 | from sentence_transformers import CrossEncoder
211 | 
212 | class Reranker:
213 |     def __init__(self, model_name="cross-encoder/ms-marco-MiniLM-L-6-v2"):
214 |         self.model = CrossEncoder(model_name)
215 |     
216 |     def rerank(self, query, documents, top_k=5):
217 |         """Rerank documents for a query"""
218 |         # Create query-document pairs
219 |         pairs = [[query, doc] for doc in documents]
220 |         
221 |         # Get scores
222 |         scores = self.model.predict(pairs)
223 |         
224 |         # Sort by score
225 |         ranked_indices = sorted(
226 |             range(len(scores)),
227 |             key=lambda i: scores[i],
228 |             reverse=True
229 |         )
230 |         
231 |         # Return top K
232 |         reranked = [
233 |             {
234 |                 "document": documents[i],
235 |                 "score": float(scores[i]),
236 |                 "rank": rank + 1
237 |             }
238 |             for rank, i in enumerate(ranked_indices[:top_k])
239 |         ]
240 |         
241 |         return reranked
242 | 
243 | # Usage
244 | reranker = Reranker()
245 | documents = ["Doc 1 text...", "Doc 2 text...", "Doc 3 text..."]
246 | reranked = reranker.rerank("What is Python?", documents, top_k=3)
247 | for item in reranked:
248 |     print(f"Rank {item['rank']}: Score {item['score']:.3f}")
249 | ```
250 | 
251 | ### Example 4: Hybrid Search
252 | 
253 | ```python
254 | from rank_bm25 import BM25Okapi
255 | import numpy as np
256 | 
257 | class HybridSearch:
258 |     def __init__(self, documents, embeddings):
259 |         self.documents = documents
260 |         self.embeddings = embeddings
261 |         
262 |         # Setup BM25 (keyword search)
263 |         tokenized_docs = [doc.split() for doc in documents]
264 |         self.bm25 = BM25Okapi(tokenized_docs)
265 |     
266 |     def semantic_search(self, query_embedding, k=10):
267 |         """Semantic search using embeddings"""
268 |         similarities = []
269 |         for emb in self.embeddings:
270 |             similarity = np.dot(query_embedding, emb) / (
271 |                 np.linalg.norm(query_embedding) * np.linalg.norm(emb)
272 |             )
273 |             similarities.append(similarity)
274 |         
275 |         top_indices = np.argsort(similarities)[::-1][:k]
276 |         return [(idx, similarities[idx]) for idx in top_indices]
277 |     
278 |     def keyword_search(self, query, k=10):
279 |         """Keyword search using BM25"""
280 |         tokenized_query = query.split()
281 |         scores = self.bm25.get_scores(tokenized_query)
282 |         top_indices = np.argsort(scores)[::-1][:k]
283 |         return [(idx, scores[idx]) for idx in top_indices]
284 |     
285 |     def hybrid_search(self, query, query_embedding, k=5, alpha=0.5):
286 |         """Combine semantic and keyword search"""
287 |         # Get results from both
288 |         semantic_results = self.semantic_search(query_embedding, k*2)
289 |         keyword_results = self.keyword_search(query, k*2)
290 |         
291 |         # Normalize scores
292 |         semantic_scores = {idx: score for idx, score in semantic_results}
293 |         keyword_scores = {idx: score for idx, score in keyword_results}
294 |         
295 |         # Normalize to 0-1 range
296 |         max_sem = max(semantic_scores.values()) if semantic_scores else 1
297 |         max_key = max(keyword_scores.values()) if keyword_scores else 1
298 |         
299 |         # Combine scores
300 |         combined_scores = {}
301 |         all_indices = set(semantic_scores.keys()) | set(keyword_scores.keys())
302 |         
303 |         for idx in all_indices:
304 |             sem_score = semantic_scores.get(idx, 0) / max_sem if max_sem > 0 else 0
305 |             key_score = keyword_scores.get(idx, 0) / max_key if max_key > 0 else 0
306 |             combined_scores[idx] = alpha * sem_score + (1 - alpha) * key_score
307 |         
308 |         # Return top K
309 |         top_indices = sorted(
310 |             combined_scores.items(),
311 |             key=lambda x: x[1],
312 |             reverse=True
313 |         )[:k]
314 |         
315 |         return top_indices
316 | 
317 | # Usage
318 | hybrid = HybridSearch(documents, embeddings)
319 | results = hybrid.hybrid_search("Python programming", query_embedding, k=5)
320 | ```
321 | 
322 | ### Example 5: Complete Advanced RAG Pipeline
323 | 
324 | ```python
325 | class AdvancedRAG:
326 |     def __init__(self):
327 |         self.vector_store = None
328 |         self.reranker = Reranker()
329 |         # ... other components
330 |     
331 |     def query(self, question, k=5):
332 |         """Advanced RAG with all techniques"""
333 |         # 1. Query Rewriting
334 |         query_variations = rewrite_query(question)
335 |         
336 |         # 2. Multiple Retrievals
337 |         all_results = []
338 |         for query_var in query_variations:
339 |             results = self.vector_store.search(query_var, k=k*2)
340 |             all_results.append(results)
341 |         
342 |         # 3. Fusion
343 |         fused_results = reciprocal_rank_fusion(all_results)
344 |         
345 |         # 4. Reranking
346 |         top_docs = [self.vector_store.get_doc(doc_id) for doc_id, _ in fused_results[:k*2]]
347 |         reranked = self.reranker.rerank(question, top_docs, top_k=k)
348 |         
349 |         # 5. Augment and Generate
350 |         context = "\n\n".join([item["document"] for item in reranked])
351 |         answer = self.generate_answer(question, context)
352 |         
353 |         return {
354 |             "answer": answer,
355 |             "sources": reranked
356 |         }
357 | ```
358 | 
359 | ---
360 | 
361 | ## 4. Student Practice Tasks
362 | 
363 | ### Task 1: Query Rewriting
364 | Implement query rewriting that:
365 | - Generates 3-5 query variations
366 | - Uses LLM to create variations
367 | - Tests if variations improve retrieval
368 | 
369 | ### Task 2: Reranking
370 | Add reranking to your RAG system:
371 | - Use a reranking model
372 | - Compare results before/after reranking
373 | - Measure improvement
374 | 
375 | ### Task 3: Fusion
376 | Implement fusion:
377 | - Combine results from multiple queries
378 | - Use RRF or weighted fusion
379 | - Compare fused vs single retrieval
380 | 
381 | ### Task 4: Hybrid Search
382 | Build hybrid search:
383 | - Combine semantic + keyword search
384 | - Tune the alpha parameter
385 | - Compare with pure semantic search
386 | 
387 | ### Task 5: Complete Advanced Pipeline
388 | Combine all techniques:
389 | - Query rewriting
390 | - Multiple retrievals
391 | - Fusion
392 | - Reranking
393 | - Generation
394 | 
395 | ### Task 6: Evaluation
396 | Evaluate advanced techniques:
397 | - Compare answer quality
398 | - Measure retrieval improvement
399 | - Test with various queries
400 | 
401 | ---
402 | 
403 | ## 5. Summary / Key Takeaways
404 | 
405 | - **Reranking** improves result order using sophisticated models
406 | - **Query rewriting** generates variations for better retrieval
407 | - **Fusion** combines multiple retrieval strategies
408 | - **Hybrid search** mixes semantic and keyword search
409 | - **Advanced techniques** significantly improve RAG quality
410 | - **Production systems** benefit from these optimizations
411 | - **Trade-offs**: More complexity vs better results
412 | - **Evaluation** is crucial to measure improvements
413 | - **Combining techniques** often works best
414 | - **Start simple**, add complexity as needed
415 | 
416 | ---
417 | 
418 | ## 6. Further Reading (Optional)
419 | 
420 | - "Reciprocal Rank Fusion" paper
421 | - Cross-encoder models documentation
422 | - BM25 algorithm explanation
423 | - RAG evaluation metrics
424 | - Production RAG best practices
425 | 
426 | ---
427 | 
428 | **Next up:** Day 10 - Build and deploy a complete RAG application!
429 | 
430 | 


--------------------------------------------------------------------------------