├── .gitignore
├── 01-intro
    ├── README.md
    ├── documents-llm.json
    ├── documents.json
    ├── elastic-search.md
    ├── open-ai-alternatives.md
    ├── parse-faq.ipynb
    └── rag-intro.ipynb
├── 02-vector-search
    └── README.md
├── 03-evaluation
    └── README.md
├── 04-monitoring
    └── README.md
├── 05-best-practices
    ├── README.md
    ├── documents-with-ids.json
    ├── ground-truth-data.csv
    ├── hybrid-search-and-reranking-es.ipynb
    ├── hybrid-search-langchain.ipynb
    └── llm-zoomcamp-best-practicies.pdf
├── 06-project-example
    ├── README.md
    └── content-processing-summary.md
├── README.md
├── after-sign-up.md
├── asking-questions.md
├── awesome-llms.md
├── cohorts
    ├── 2024
    │   ├── 01-intro
    │   │   └── homework.md
    │   ├── 02-open-source
    │   │   ├── README.md
    │   │   ├── docker-compose.yaml
    │   │   ├── homework.md
    │   │   ├── huggingface-flan-t5.ipynb
    │   │   ├── huggingface-mistral-7b.ipynb
    │   │   ├── huggingface-phi3.ipynb
    │   │   ├── ollama.ipynb
    │   │   ├── prompt.md
    │   │   ├── qa_faq.py
    │   │   ├── rag-intro.ipynb
    │   │   ├── serving-hugging-face-models.md
    │   │   └── starter.ipynb
    │   ├── 03-vector-search
    │   │   ├── README.md
    │   │   ├── demo_es.ipynb
    │   │   ├── homework.md
    │   │   └── homework_solution.ipynb
    │   ├── 04-monitoring
    │   │   ├── README.md
    │   │   ├── app
    │   │   │   ├── .env
    │   │   │   ├── Dockerfile
    │   │   │   ├── README.MD
    │   │   │   ├── app.py
    │   │   │   ├── assistant.py
    │   │   │   ├── db.py
    │   │   │   ├── docker-compose.yaml
    │   │   │   ├── generate_data.py
    │   │   │   ├── prep.py
    │   │   │   └── requirements.txt
    │   │   ├── code.md
    │   │   ├── dashboard.json
    │   │   ├── data
    │   │   │   ├── evaluations-aqa.csv
    │   │   │   ├── evaluations-qa.csv
    │   │   │   ├── results-gpt35-cosine.csv
    │   │   │   ├── results-gpt35.csv
    │   │   │   ├── results-gpt4o-cosine.csv
    │   │   │   ├── results-gpt4o-mini-cosine.csv
    │   │   │   ├── results-gpt4o-mini.csv
    │   │   │   └── results-gpt4o.csv
    │   │   ├── grafana.md
    │   │   ├── homework.md
    │   │   ├── offline-rag-evaluation.ipynb
    │   │   └── solution.ipynb
    │   ├── 05-orchestration
    │   │   ├── README.md
    │   │   ├── code
    │   │   │   └── 06_retrieval.py
    │   │   ├── homework.md
    │   │   └── parse-faq-llm.ipynb
    │   ├── README.md
    │   ├── competition
    │   │   ├── README.md
    │   │   ├── data
    │   │   │   ├── test.csv
    │   │   │   └── train.csv
    │   │   ├── scorer.py
    │   │   ├── starter_notebook.ipynb
    │   │   └── starter_notebook_submission.csv
    │   ├── project.md
    │   └── workshops
    │   │   └── dlt.md
    └── 2025
    │   ├── 01-intro
    │       └── homework.md
    │   ├── README.md
    │   ├── course-launch-stream-summary.md
    │   ├── pre-course-q-a-stream-summary.md
    │   └── project.md
├── etc
    └── chunking.md
├── images
    ├── llm-zoomcamp-2025.jpg
    ├── llm-zoomcamp.jpg
    ├── qdrant.png
    └── saturn-cloud.png
├── learning-in-public.md
└── project.md


/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | __pycache__/
3 | .venv
4 | .envrc
5 | 


--------------------------------------------------------------------------------
/01-intro/README.md:
--------------------------------------------------------------------------------
  1 | # Module 1: Introduction
  2 |  
  3 | In this module, we will learn what LLM and RAG are and
  4 | implement a simple RAG pipeline to answer questions about 
  5 | the FAQ Documents from our Zoomcamp courses
  6 | 
  7 | What we will do: 
  8 | 
  9 | * Index Zoomcamp FAQ documents
 10 |     * DE Zoomcamp: https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit
 11 |     * ML Zoomcamp: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit
 12 |     * MLOps Zoomcamp: https://docs.google.com/document/d/12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0/edit
 13 | * Create a Q&A system for answering questions about these documents 
 14 | 
 15 | ## 1.1 Introduction to LLM and RAG
 16 | 
 17 | <a href="https://www.youtube.com/watch?v=Q75JgLEXMsM&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 18 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/Q75JgLEXMsM">
 19 | </a>
 20 | 
 21 | * LLM
 22 | * RAG
 23 | * RAG architecture
 24 | * Course outcome
 25 | 
 26 | 
 27 | ## 1.2 Preparing the Environment
 28 | 
 29 | <a href="https://www.youtube.com/watch?v=ozCpmkbJNJE&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 30 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/ozCpmkbJNJE">
 31 | </a>
 32 | 
 33 | * Installing libraries
 34 | * Alternative: installing anaconda or miniconda
 35 | 
 36 | ```bash
 37 | pip install tqdm notebook==7.1.2 openai elasticsearch==8.13.0 pandas scikit-learn ipywidgets
 38 | ```
 39 | 
 40 | ## 1.3 Retrieval
 41 | 
 42 | <a href="https://www.youtube.com/watch?v=olvem333Bqo&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 43 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/olvem333Bqo">
 44 | </a>
 45 | 
 46 | Note: as of now, you can install minsearch with pip:
 47 | 
 48 | ```bash
 49 | pip install minsearch
 50 | ```
 51 | 
 52 | * We will use the search engine we build in the [build-your-own-search-engine workshop](https://github.com/alexeygrigorev/build-your-own-search-engine): [minsearch](https://github.com/alexeygrigorev/minsearch)
 53 | * Indexing the documents
 54 | * Peforming the search
 55 | 
 56 | 
 57 | ## 1.4 Generation with OpenAI
 58 | 
 59 | <a href="https://www.youtube.com/watch?v=qz316T3U49Q&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 60 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/qz316T3U49Q">
 61 | </a>
 62 | 
 63 | * Invoking OpenAI API
 64 | * Building the prompt
 65 | * Getting the answer
 66 | 
 67 | 
 68 | If you don't want to use a service, you can run an LLM locally
 69 | refer to [module 2](../02-open-source/) for more details.
 70 | 
 71 | In particular, check "2.7 Ollama - Running LLMs on a CPU" - 
 72 | it can work with OpenAI API, so to make the example from 1.4 
 73 | work locally, you only need to change a few lines of code.
 74 | 
 75 | 
 76 | ## 1.4.2 OpenAI API Alternatives
 77 | 
 78 | <a href="https://www.youtube.com/watch?v=HObjFso2UJE&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 79 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/HObjFso2UJE">
 80 | </a>
 81 | 
 82 | [Open AI Alternatives](../awesome-llms.md#openai-api-alternatives)
 83 | 
 84 | 
 85 | ## 1.5 Cleaned RAG flow
 86 | 
 87 | <a href="https://www.youtube.com/watch?v=vkTiVwwch6A&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 88 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/vkTiVwwch6A">
 89 | </a>
 90 | 
 91 | * Cleaning the code we wrote so far
 92 | * Making it modular
 93 | 
 94 | ## 1.6 Searching with ElasticSearch
 95 | 
 96 | <a href="https://www.youtube.com/watch?v=1lgbR5wMvsI&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 97 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/1lgbR5wMvsI">
 98 | </a>
 99 | 
100 | * Run ElasticSearch with Docker
101 | * Index the documents
102 | * Replace MinSearch with ElasticSearch
103 | 
104 | Running ElasticSearch:
105 | 
106 | ```bash
107 | docker run -it \
108 |     --rm \
109 |     --name elasticsearch \
110 |     -m 4GB \
111 |     -p 9200:9200 \
112 |     -p 9300:9300 \
113 |     -e "discovery.type=single-node" \
114 |     -e "xpack.security.enabled=false" \
115 |     docker.elastic.co/elasticsearch/elasticsearch:8.4.3
116 | ```
117 | 
118 | If the previous command doesn't work (i.e. you see "error pulling image configuration"), try to run ElasticSearch directly from Docker Hub:
119 | 
120 | ```bash
121 | docker run -it \
122 |     --rm \
123 |     --name elasticsearch \
124 |     -p 9200:9200 \
125 |     -p 9300:9300 \
126 |     -e "discovery.type=single-node" \
127 |     -e "xpack.security.enabled=false" \
128 |     elasticsearch:8.4.3
129 | ```
130 | 
131 | Index settings:
132 | 
133 | ```python
134 | {
135 |     "settings": {
136 |         "number_of_shards": 1,
137 |         "number_of_replicas": 0
138 |     },
139 |     "mappings": {
140 |         "properties": {
141 |             "text": {"type": "text"},
142 |             "section": {"type": "text"},
143 |             "question": {"type": "text"},
144 |             "course": {"type": "keyword"} 
145 |         }
146 |     }
147 | }
148 | ```
149 | 
150 | Query:
151 | 
152 | ```python
153 | {
154 |     "size": 5,
155 |     "query": {
156 |         "bool": {
157 |             "must": {
158 |                 "multi_match": {
159 |                     "query": query,
160 |                     "fields": ["question^3", "text", "section"],
161 |                     "type": "best_fields"
162 |                 }
163 |             },
164 |             "filter": {
165 |                 "term": {
166 |                     "course": "data-engineering-zoomcamp"
167 |                 }
168 |             }
169 |         }
170 |     }
171 | }
172 | ```
173 | 
174 | We use `"type": "best_fields"`. You can read more about 
175 | different types of `multi_match` search in [elastic-search.md](elastic-search.md).
176 | 
177 | # 1.7 Homework
178 | More information [here](../cohorts/2025/01-intro/homework.md).
179 | 
180 | 
181 | # Extra materials
182 | 
183 | * If you're curious to know how the code for parsing the FAQ works, check [this video](https://www.loom.com/share/ff54d898188b402d880dbea2a7cb8064)
184 | 
185 | # Open-Source LLMs (optional)
186 | 
187 | It's also possible to run LLMs locally. For that, we 
188 | can use Ollama. Check these videos from LLM Zoomcamp 2024
189 | if you're interested in learning more about it:
190 | 
191 | * [Ollama - Running LLMs on a CPU](https://www.youtube.com/watch?v=PVpBGs_iSjY&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R)
192 | * [Ollama & Phi3 + Elastic in Docker-Compose](https://www.youtube.com/watch?v=4juoo_jk96U&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R)
193 | * [UI for RAG](https://www.youtube.com/watch?v=R6L8PZ-7bGo&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R)
194 | 
195 | To see the command lines used in the videos,
196 | see [2024 cohort folder](../cohorts/2024/02-open-source#27-ollama---running-llms-on-a-cpu)
197 | 
198 | # Notes
199 | 
200 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/01-intro/notes.md)
201 | * [Notes by Pham Nguyen Hung](https://hung.bearblog.dev/llm-zoomcamp-1-rag/)
202 | * [Notes by dimzachar](https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/01-intro/README.md)
203 | * [Notes by Olawale Ogundeji](https://github.com/presiZHai/LLM-Zoomcamp/blob/main/01-intro/notes.md)
204 | * [Notes by Uchechukwu](https://medium.com/@njokuuchechi/an-intro-to-large-language-models-llms-0c51c09abe10)
205 | * [Notes by Kamal](https://github.com/mk-hassan/llm-zoomcamp/blob/main/Module-1%3A%20Introduction%20to%20LLMs%20and%20RAG/README.md)
206 | * [Notes by Marat](https://machine-mind-ml.medium.com/discovering-semantic-search-and-rag-with-large-language-models-be7d9ba5bef4)
207 | * [Notes by Waleed](https://waleedayoub.com/post/llmzoomcamp_week1-intro_notes/)
208 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
209 | 


--------------------------------------------------------------------------------
/01-intro/elastic-search.md:
--------------------------------------------------------------------------------
  1 | # Elastic Search 
  2 | 
  3 | This document contains useful things about Elasticsearch
  4 | 
  5 | # `multi_match` Query in Elasticsearch
  6 | 
  7 | The `multi_match` query is used to search for a given text across multiple fields in an Elasticsearch index.
  8 | 
  9 | It provides various types to control how the matching is executed and scored. 
 10 | 
 11 | There are multiple types of `multi_match` queries:
 12 | 
 13 | - `best_fields`: Returns the highest score from any one field.
 14 | - `most_fields`: Combines the scores from all fields.
 15 | - `cross_fields`: Treats fields as one big field for scoring.
 16 | - `phrase`: Searches for the query as an exact phrase.
 17 | - `phrase_prefix`: Searches for the query as a prefix of a phrase.
 18 | 
 19 | 
 20 | ## `best_fields`
 21 | 
 22 | The `best_fields` type searches each field separately and returns the highest score from any one of the fields.
 23 | 
 24 | This type is useful when you want to find documents where at least one field matches the query well.
 25 | 
 26 | 
 27 | ```json
 28 | {
 29 |     "size": 5,
 30 |     "query": {
 31 |         "bool": {
 32 |             "must": {
 33 |                 "multi_match": {
 34 |                     "query": "How do I run docker on Windows?",
 35 |                     "fields": ["question", "text"],
 36 |                     "type": "best_fields"
 37 |                 }
 38 |             }
 39 |         }
 40 |     }
 41 | }
 42 | ```
 43 | 
 44 | ## `most_fields`
 45 | 
 46 | The `most_fields` type searches each field and combines the scores from all fields.
 47 | 
 48 | This is useful when the relevance of a document increases with more matching fields.
 49 | 
 50 | ```json
 51 | {
 52 |     "multi_match": {
 53 |         "query": "How do I run docker on Windows?",
 54 |         "fields": ["question^4", "text"],
 55 |         "type": "most_fields"
 56 |     }
 57 | }
 58 | ```
 59 | 
 60 | ## `cross_fields`
 61 | 
 62 | The `cross_fields` type treats fields as though they were one big field.
 63 | 
 64 | It is suitable for cases where you have fields representing the same text in different ways, such as synonyms.
 65 | 
 66 | ```json
 67 | {
 68 |     "multi_match": {
 69 |         "query": "How do I run docker on Windows?",
 70 |         "fields": ["question", "text"],
 71 |         "type": "cross_fields"
 72 |     }
 73 | }
 74 | ```
 75 | 
 76 | ## `phrase`
 77 | 
 78 | The `phrase` type looks for the query as an exact phrase within the fields.
 79 | 
 80 | It is useful for exact match searches.
 81 | 
 82 | ```json
 83 | {
 84 |     "multi_match": {
 85 |         "query": "How do I run docker on Windows?",
 86 |         "fields": ["question", "text"],
 87 |         "type": "phrase"
 88 |     }
 89 | }
 90 | ```
 91 | 
 92 | ## `phrase_prefix`
 93 | 
 94 | The `phrase_prefix` type searches for documents that contain the query as a prefix of a phrase.
 95 | 
 96 | This is useful for autocomplete or typeahead functionality.
 97 | 
 98 | 
 99 | ```json
100 | {
101 |     "multi_match": {
102 |         "query": "How do I run docker on Windows?",
103 |         "fields": ["question", "text"],
104 |         "type": "phrase_prefix"
105 |     }
106 | }
107 | ```


--------------------------------------------------------------------------------
/01-intro/open-ai-alternatives.md:
--------------------------------------------------------------------------------
1 | moved [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/awesome-llms.md#openai-api-alternatives) 
2 | 


--------------------------------------------------------------------------------
/01-intro/parse-faq.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 8,
  6 |    "id": "4cd1eaa8-3424-41ad-9cf2-3e8548712865",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import io\n",
 11 |     "\n",
 12 |     "import requests\n",
 13 |     "import docx"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 24,
 19 |    "id": "8180e7e4-b90d-4900-a59b-d22e5d6537c4",
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "def clean_line(line):\n",
 24 |     "    line = line.strip()\n",
 25 |     "    line = line.strip('\\uFEFF')\n",
 26 |     "    return line\n",
 27 |     "\n",
 28 |     "def read_faq(file_id):\n",
 29 |     "    url = f'https://docs.google.com/document/d/{file_id}/export?format=docx'\n",
 30 |     "    \n",
 31 |     "    response = requests.get(url)\n",
 32 |     "    response.raise_for_status()\n",
 33 |     "    \n",
 34 |     "    with io.BytesIO(response.content) as f_in:\n",
 35 |     "        doc = docx.Document(f_in)\n",
 36 |     "\n",
 37 |     "    questions = []\n",
 38 |     "\n",
 39 |     "    question_heading_style = 'heading 2'\n",
 40 |     "    section_heading_style = 'heading 1'\n",
 41 |     "    \n",
 42 |     "    heading_id = ''\n",
 43 |     "    section_title = ''\n",
 44 |     "    question_title = ''\n",
 45 |     "    answer_text_so_far = ''\n",
 46 |     "     \n",
 47 |     "    for p in doc.paragraphs:\n",
 48 |     "        style = p.style.name.lower()\n",
 49 |     "        p_text = clean_line(p.text)\n",
 50 |     "    \n",
 51 |     "        if len(p_text) == 0:\n",
 52 |     "            continue\n",
 53 |     "    \n",
 54 |     "        if style == section_heading_style:\n",
 55 |     "            section_title = p_text\n",
 56 |     "            continue\n",
 57 |     "    \n",
 58 |     "        if style == question_heading_style:\n",
 59 |     "            answer_text_so_far = answer_text_so_far.strip()\n",
 60 |     "            if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
 61 |     "                questions.append({\n",
 62 |     "                    'text': answer_text_so_far,\n",
 63 |     "                    'section': section_title,\n",
 64 |     "                    'question': question_title,\n",
 65 |     "                })\n",
 66 |     "                answer_text_so_far = ''\n",
 67 |     "    \n",
 68 |     "            question_title = p_text\n",
 69 |     "            continue\n",
 70 |     "        \n",
 71 |     "        answer_text_so_far += '\\n' + p_text\n",
 72 |     "    \n",
 73 |     "    answer_text_so_far = answer_text_so_far.strip()\n",
 74 |     "    if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
 75 |     "        questions.append({\n",
 76 |     "            'text': answer_text_so_far,\n",
 77 |     "            'section': section_title,\n",
 78 |     "            'question': question_title,\n",
 79 |     "        })\n",
 80 |     "\n",
 81 |     "    return questions"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 25,
 87 |    "id": "7d3c2dd7-f64a-4dc7-a4e3-3e8aadfa720f",
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "faq_documents = {\n",
 92 |     "    'data-engineering-zoomcamp': '19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw',\n",
 93 |     "    'machine-learning-zoomcamp': '1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8',\n",
 94 |     "    'mlops-zoomcamp': '12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0',\n",
 95 |     "}"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 27,
101 |    "id": "f94efe26-05e8-4ae5-a0fa-0a8e16852816",
102 |    "metadata": {},
103 |    "outputs": [
104 |     {
105 |      "name": "stdout",
106 |      "output_type": "stream",
107 |      "text": [
108 |       "data-engineering-zoomcamp\n",
109 |       "machine-learning-zoomcamp\n",
110 |       "mlops-zoomcamp\n"
111 |      ]
112 |     }
113 |    ],
114 |    "source": [
115 |     "documents = []\n",
116 |     "\n",
117 |     "for course, file_id in faq_documents.items():\n",
118 |     "    print(course)\n",
119 |     "    course_documents = read_faq(file_id)\n",
120 |     "    documents.append({'course': course, 'documents': course_documents})"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 29,
126 |    "id": "06b8d8be-f656-4cc3-893f-b159be8fda21",
127 |    "metadata": {},
128 |    "outputs": [],
129 |    "source": [
130 |     "import json"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "code",
135 |    "execution_count": 32,
136 |    "id": "30d50bc1-8d26-44ee-8734-cafce05e0523",
137 |    "metadata": {},
138 |    "outputs": [],
139 |    "source": [
140 |     "with open('documents.json', 'wt') as f_out:\n",
141 |     "    json.dump(documents, f_out, indent=2)"
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 33,
147 |    "id": "0eabb1c6-5cc6-4d4d-a6da-e27d41cea546",
148 |    "metadata": {},
149 |    "outputs": [
150 |     {
151 |      "name": "stdout",
152 |      "output_type": "stream",
153 |      "text": [
154 |       "[\n",
155 |       "  {\n",
156 |       "    \"course\": \"data-engineering-zoomcamp\",\n",
157 |       "    \"documents\": [\n",
158 |       "      {\n",
159 |       "        \"text\": \"The purpose of this document is to capture frequently asked technical questions\\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \\u201cOffice Hours'' live.1\\nSubscribe to course public Google Calendar (it works from Desktop only).\\nRegister before the course starts using this link.\\nJoin the course Telegram channel with announcements.\\nDon\\u2019t forget to register in DataTalks.Club's Slack and join the channel.\",\n",
160 |       "        \"section\": \"General course-related questions\",\n",
161 |       "        \"question\": \"Course - When will the course start?\"\n",
162 |       "      },\n",
163 |       "      {\n"
164 |      ]
165 |     }
166 |    ],
167 |    "source": [
168 |     "!head documents.json"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": null,
174 |    "id": "1b21af5c-2f6d-49e7-92e9-ca229e2473b9",
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": []
178 |   }
179 |  ],
180 |  "metadata": {
181 |   "kernelspec": {
182 |    "display_name": "Python 3 (ipykernel)",
183 |    "language": "python",
184 |    "name": "python3"
185 |   },
186 |   "language_info": {
187 |    "codemirror_mode": {
188 |     "name": "ipython",
189 |     "version": 3
190 |    },
191 |    "file_extension": ".py",
192 |    "mimetype": "text/x-python",
193 |    "name": "python",
194 |    "nbconvert_exporter": "python",
195 |    "pygments_lexer": "ipython3",
196 |    "version": "3.9.13"
197 |   }
198 |  },
199 |  "nbformat": 4,
200 |  "nbformat_minor": 5
201 | }
202 | 


--------------------------------------------------------------------------------
/02-vector-search/README.md:
--------------------------------------------------------------------------------
 1 | # Vector Search 
 2 | 
 3 | TBA
 4 | 
 5 | ## Homework
 6 | 
 7 | See [here](../cohorts/2025/02-vector-search/homework.md)
 8 | 
 9 | 
10 | # Notes
11 | 
12 | * Notes from [2024 edition](../cohorts/2024/03-vector-search/)
13 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
14 | 


--------------------------------------------------------------------------------
/03-evaluation/README.md:
--------------------------------------------------------------------------------
 1 | # RAG and LLM Evaluation  
 2 | 
 3 | TBA
 4 | 
 5 | ## Homework
 6 | 
 7 | TBA
 8 | 
 9 | # Notes
10 | 
11 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
12 | 


--------------------------------------------------------------------------------
/04-monitoring/README.md:
--------------------------------------------------------------------------------
 1 | # Module 4: Evaluation and Monitoring
 2 | 
 3 | In this module, we will learn how to evaluate and monitor our LLM and RAG system.
 4 | 
 5 | In the evaluation part, we assess the quality of our entire RAG
 6 | system before it goes live.
 7 | 
 8 | In the monitoring part, we collect, store and visualize
 9 | metrics to assess the answer quality of a deployed LLM. We also
10 | collect chat history and user feedback.
11 | 
12 | 
13 | TBA
14 | 
15 | # Notes
16 | 
17 | * Notes from [2024 edition](../cohorts/2024/04-monitoring/)
18 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
19 | 


--------------------------------------------------------------------------------
/05-best-practices/README.md:
--------------------------------------------------------------------------------
 1 | # Module 6: Best practices
 2 | 
 3 | In this module, we'll cover the techniques that could improve your RAG pipeline.
 4 | 
 5 | ## 6.1 Techniques to Improve RAG Pipeline
 6 | 
 7 | <a href="https://www.youtube.com/watch?v=Tq9Vbm_2z3o">
 8 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/Tq9Vbm_2z3o">
 9 | </a>
10 | 
11 | * Small-to-Big chunk retrieval
12 | * Leveraging document metadata
13 | * Hybrid search
14 | * User query rewriting
15 | * Document reranking
16 | 
17 | Links:
18 | * [Slides](llm-zoomcamp-best-practicies.pdf)
19 | * [Five Techniques for Improving RAG Chatbots - Nikita Kozodoi [Video]](https://www.youtube.com/watch?v=xPYmClWk5O8)
20 | * [Survey on RAG techniques [Article]](https://arxiv.org/abs/2312.10997)
21 | 
22 | 
23 | ## 6.2 Hybrid search
24 | 
25 | <a href="https://www.youtube.com/watch?v=TQ_ck6Q9gSQ">
26 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/TQ_ck6Q9gSQ">
27 | </a>
28 | 
29 | * Hybrid search strategy
30 | * Hybrid search in Elasticsearch
31 | 
32 | Links:
33 | * [Notebook](hybrid-search-and-reranking-es.ipynb)
34 | * [Hybrid search [Elasticsearch Guide]](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#_combine_approximate_knn_with_other_features)
35 | * [Hybrid search [Tutorial]](https://www.elastic.co/search-labs/tutorials/search-tutorial/vector-search/hybrid-search)
36 | 
37 | 
38 | ## 6.3 Document Reranking
39 | 
40 | <a href="https://www.youtube.com/watch?v=H4M55Ptc5cM">
41 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/H4M55Ptc5cM">
42 | </a>
43 | 
44 | * Reranking concept and metrics
45 | * Reciprocal Rank Fusion (RRF)
46 | * Handmade raranking implementation
47 | 
48 | Links:
49 | * [Reciprocal Rank Fusion (RRF) method [Elasticsearch Guide]](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html)
50 | * [RRF method [Article]](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
51 | * [Elasticsearch subscription plans](https://www.elastic.co/subscriptions)
52 | 
53 | We should pull and run a docker container with Elasticsearch 8.9.0 or higher in order to use reranking based on RRF algorithm:
54 | 
55 | ```bash
56 | docker run -it \
57 |     --rm \
58 |     --name elasticsearch \
59 |     -m 4GB \
60 |     -p 9200:9200 \
61 |     -p 9300:9300 \
62 |     -e "discovery.type=single-node" \
63 |     -e "xpack.security.enabled=false" \
64 |     docker.elastic.co/elasticsearch/elasticsearch:8.9.0
65 | ```
66 | 
67 | 
68 | ## 6.4 Hybrid search with LangChain
69 | 
70 | <a href="https://www.youtube.com/watch?v=CRfg7tAsnUU">
71 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/CRfg7tAsnUU">
72 | </a>
73 | 
74 | * LangChain: Introduction
75 | * ElasticsearchRetriever
76 | * Hybrid search implementation
77 | 
78 | ```bash
79 | pip install -qU langchain langchain-elasticsearch langchain-huggingface
80 | ```
81 | 
82 | Links:
83 | * [Notebook](hybrid-search-langchain.ipynb)
84 | * [Chatbot Implementation [Tutorial]](https://www.elastic.co/search-labs/tutorials/chatbot-tutorial/implementation)
85 | * [ElasticsearchRetriever](https://python.langchain.com/v0.2/docs/integrations/retrievers/elasticsearch_retriever/)
86 | 
87 | 
88 | ## Homework
89 | 
90 | TBD
91 | 
92 | # Notes
93 | 
94 | * First link goes here
95 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
96 | 


--------------------------------------------------------------------------------
/05-best-practices/llm-zoomcamp-best-practicies.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/05-best-practices/llm-zoomcamp-best-practicies.pdf


--------------------------------------------------------------------------------
/06-project-example/README.md:
--------------------------------------------------------------------------------
  1 | # 7. End-to-End Project Example 
  2 | 
  3 | Links:
  4 | 
  5 | * [Project alexeygrigorev/fitness-assistant](https://github.com/alexeygrigorev/fitness-assistant)
  6 | * [Project criteria](../project.md#evaluation-criteria)
  7 | 
  8 | 
  9 | Note: check the final result, it's a bit different 
 10 | from what we showed in the videos: we further improved it
 11 | by doing some small things here and there, like improved
 12 | README, code readability, etc.
 13 | 
 14 | 
 15 | ## 7.1. Fitness assistant project
 16 | 
 17 | <a href="https://www.youtube.com/watch?v=E9O0Tg68PPg&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 18 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/E9O0Tg68PPg">
 19 | </a>
 20 | 
 21 | * Generating data for the project
 22 | * Setting up the project
 23 | * Implementing the initial version of the RAG flow
 24 | 
 25 | ## 7.2. Evaluating retrieval
 26 | 
 27 | <a href="https://www.youtube.com/watch?v=6ulnHtJPCWY&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 28 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/6ulnHtJPCWY">
 29 | </a>
 30 | 
 31 | * Preparing the README file
 32 | * Generating gold standard evaluation data
 33 | * Evaluting retrieval
 34 | * Findning the best boosting coefficients
 35 | 
 36 | 
 37 | ## 7.3 Evaluating RAG
 38 | 
 39 | <a href="https://www.youtube.com/watch?v=lxpW2mR7dGk&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 40 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/lxpW2mR7dGk">
 41 | </a>
 42 | 
 43 | * Using LLM-as-a-Judge (type 2)
 44 | * Comparing gpt-4o-mini with gpt-4o
 45 | 
 46 | ## 7.4 Interface and ingestion pipeline
 47 | 
 48 | <a href="https://www.youtube.com/watch?v=vMHve2EyA5M&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 49 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/vMHve2EyA5M">
 50 | </a>
 51 | 
 52 | * Turnining the jupyter notebook into a script
 53 | * Creating the ingestion pipeline
 54 | * Creating the API interface with Flask
 55 | * Improving README
 56 | 
 57 | 
 58 | ## 7.5 Monitoring and containerization
 59 | 
 60 | <a href="https://www.youtube.com/watch?v=nQda9etJWW8&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 61 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/nQda9etJWW8">
 62 | </a>
 63 | 
 64 | * Creating a Docker image for our application
 65 | * Putting everything in docker compose
 66 | * Logging all the information for monitoring purposes
 67 | 
 68 | 
 69 | ## 7.6 Summary and closing remarks
 70 | 
 71 | <a href="https://www.youtube.com/watch?v=TW9M5VE8vpo&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 72 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/TW9M5VE8vpo">
 73 | </a>
 74 | 
 75 | * Changes between 7.5 and 7.6 (postres logging, grafara, cli.py, etc)
 76 | * README file improvements
 77 | * Total cost of the project (~$2) and how to lower it
 78 | * Using generated data for real-life projects
 79 | 
 80 | 
 81 | ## 7.7 Chunking for longer texts
 82 | 
 83 | <a href="https://www.youtube.com/watch?v=tyBRP_WewXA&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 84 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/tyBRP_WewXA">
 85 | </a>
 86 | 
 87 | * Different chunking strategies
 88 | * [Use cases: multiple articles, one article, slide decks](content-processing-summary.md)
 89 | 
 90 | Links:
 91 | 
 92 | * https://chatgpt.com/share/a4616f6b-43f4-4225-9d03-bb69c723c210
 93 | * https://chatgpt.com/share/74217c02-95e6-46ae-b5a5-ca79f9a07084
 94 | * https://chatgpt.com/share/8cf0ebde-c53f-4c6f-82ae-c6cc52b2fd0b
 95 | 
 96 | # Notes
 97 | 
 98 | * First link goes here
 99 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
100 | 


--------------------------------------------------------------------------------
/06-project-example/content-processing-summary.md:
--------------------------------------------------------------------------------
 1 | # Content Processing Cases and Steps
 2 | 
 3 | ## Case: Multiple Articles
 4 | 
 5 | - Assign each article a document id
 6 | - Chunk the articles
 7 | - Assign each chunk a unique chunk id (could be doc_id + chunk_number)
 8 | - Evaluate retrieval: separate hitrate for both doc_id and chunk_id
 9 | - Evaluate RAG: LLM as a Judge
10 | - Tuning chunk size: use metrics from Evaluate RAG
11 | 
12 | Example JSON structure for a chunk:
13 | ```json
14 | {
15 |   "doc_id": "ashdiasdh",
16 |   "chunk_id": "ashdiasdh_1",
17 |   "text": "actual text"
18 | }
19 | ```
20 | 
21 | ## Case: Single Article / Transcript / Etc.
22 | 
23 | Example: the user provides YouTubeID, you initialize the system and now you can talk to it
24 |   
25 | - Chunk it
26 | - Evaluation as for multiple articles
27 | 
28 | 
29 | ## Case: Book or Very Long Form Content
30 | 
31 | - Experiment with it
32 | - Each chapter / section can be a separate document
33 | - Use LLM as a Judge to see which approach works best
34 | 
35 | ## Case: Images
36 | 
37 | - Describe the images using gpt-4o-mini
38 | - [CLIP](https://openai.com/index/clip/)
39 | - Each image is a separate document
40 | 
41 | ## Case: Slides
42 | 
43 | - Same as with images + multiple articles
44 | - "Chunking": slide deck = document, slide = chunk
45 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | <p align="center">
  2 |   <img width="80%" src="images/llm-zoomcamp-2025.jpg" alt="LLM Zoomcamp">
  3 | </p>
  4 | 
  5 | <h1 align="center">
  6 |     <strong>LLM Zoomcamp: A Free Course on Real-Life Applications of LLMs</strong>
  7 | </h1>
  8 | 
  9 | <p align="center">
 10 | In 10 weeks, learn how to build AI systems that answer questions about your knowledge base. Gain hands-on experience with LLMs, RAG, vector search, evaluation, monitoring, and more.
 11 | </p>
 12 | 
 13 | <p align="center">
 14 | <a href="https://airtable.com/appPPxkgYLH06Mvbw/shr7WtxHEPXxaui0Q"><img src="https://user-images.githubusercontent.com/875246/185755203-17945fd1-6b64-46f2-8377-1011dcb1a444.png" height="50" /></a>
 15 | </p>
 16 | 
 17 | <p align="center">
 18 | <a href="https://datatalks.club/slack.html">Join Slack</a> •
 19 | <a href="https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J">#course-llm-zoomcamp Channel</a> •
 20 | <a href="https://t.me/llm_zoomcamp">Telegram Announcements</a> •
 21 | <a href="https://www.youtube.com/playlist?list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">Course Playlist</a> •
 22 | <a href="https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing">FAQ</a>
 23 | </p>
 24 | 
 25 | ## How to Take LLM Zoomcamp
 26 | 
 27 | ### 2025 Cohort
 28 | - **Start Date**: June 2, 2025, 17:00 CET
 29 | - **Register Here**: [Sign up](https://airtable.com/appPPxkgYLH06Mvbw/shr7WtxHEPXxaui0Q)
 30 | 
 31 | 2025 cohort checklist:
 32 | - Subscribe to our [Google Calendar](https://calendar.google.com/calendar/?cid=NjkxOThkOGFhZmUyZmQwMzZjNDFkNmE2ZDIyNjE5YjdiMmQyZDVjZTYzOGMxMzQyZmNkYjE5Y2VkNDYxOTUxY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t)
 33 | - Check [2025 cohort folder](https://github.com/DataTalksClub/llm-zoomcamp/tree/main/cohorts/2025) to stay updated
 34 | - Watch [live Q&A](https://youtube.com/live/8lgiOLMMKcY) about the course
 35 | - Watch [live course launch](https://www.youtube.com/live/FgnelhEJFj0) stream
 36 | - Save the [2025 course playlist](https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=n7CuD0DEgPtnbtsI) on YouTube
 37 | - Check course content by navigating to the right module on GitHub
 38 | - Share this course with a friend!
 39 | 
 40 | ### Self-Paced Learning
 41 | You can follow the course at your own pace:
 42 | 1. Watch the course videos.
 43 | 2. Complete the homework assignments.
 44 | 3. Work on a project and share it in Slack for feedback.
 45 | 
 46 | ## Syllabus
 47 | 
 48 | ### Pre-course Workshops
 49 | - [Build a Search Engine](https://www.youtube.com/watch?v=nMrGK5QgPVE) ([Code](https://github.com/alexeygrigorev/build-your-own-search-engine))
 50 | 
 51 | ### Modules
 52 | 
 53 | #### [Module 1: Introduction to LLMs and RAG](01-intro/)
 54 | - Basics of LLMs and Retrieval-Augmented Generation (RAG)
 55 | - OpenAI API and text search with Elasticsearch
 56 | 
 57 | #### [Module 2: Vector Search](02-vector-search/)
 58 | 
 59 | - Vector search and embeddings
 60 | - Indexing and retrieving data efficiently
 61 | - Using Qdrant as the vestor database
 62 | 
 63 | #### [Module 3: Evaluation](03-evaluation/)
 64 | 
 65 | - Search evaluation
 66 | - Online vs offline evaluation
 67 | - LLM as a Judge
 68 | 
 69 | #### [Module 4: Monitoring](04-monitoring/)
 70 | 
 71 | - Online evaluation techniques
 72 | - Monitoring user feedback with dashboards
 73 | 
 74 | 
 75 | #### [Module 5: Best Practices](05-best-practices/)
 76 | - Hybrid search
 77 | - Document reranking
 78 | 
 79 | #### [Module 6: Bonus - End-to-End Project](06-project-example/)
 80 | - Build a fitness assistant using LLMs
 81 | 
 82 | ### [Capstone Project](project.md)
 83 | 
 84 | Put eveything you learned into practice
 85 | 
 86 | ## Meet the Instructors
 87 | - [Alexey Grigorev](https://linkedin.com/in/agrigorev/)
 88 | - [Timur Kamaliev](https://www.linkedin.com/in/timurkamaliev/)
 89 | 
 90 | ## Community & Support
 91 | 
 92 | ### **Getting Help on Slack**
 93 | Join the [`#course-llm-zoomcamp`](https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J) channel on [DataTalks.Club Slack](https://datatalks.club/slack.html) for discussions, troubleshooting, and networking.
 94 | 
 95 | To keep discussions organized:
 96 | - Follow [our guidelines](asking-questions.md) when posting questions.
 97 | - Review the [community guidelines](https://datatalks.club/slack/guidelines.html).
 98 | 
 99 | ## Sponsors & Supporters
100 | A special thanks to our course sponsors for making this initiative possible!
101 | 
102 | <p align="center">
103 |   <a href="https://arize.com/">
104 |     <img height="120" src="https://datatalks.club/images/partners/arize.png">
105 |   </a>
106 | </p>
107 | 
108 | <p align="center">
109 |   <a href="https://qdrant.tech/">
110 |     <img height="120" src="images/qdrant.png">
111 |   </a>
112 | </p>
113 | 
114 | <p align="center">
115 |   <a href="https://dlthub.com/">
116 |     <img height="80" src="https://github.com/DataTalksClub/data-engineering-zoomcamp/raw/main/images/dlthub.png">
117 |   </a>
118 | </p>
119 | 
120 | 
121 | Interested in supporting our community? Reach out to [alexey@datatalks.club](mailto:alexey@datatalks.club).
122 | 
123 | ## About DataTalks.Club
124 | 
125 | <p align="center">
126 |   <img width="40%" src="https://github.com/user-attachments/assets/1243a44a-84c8-458d-9439-aaf6f3a32d89" alt="DataTalks.Club">
127 | </p>
128 | 
129 | <p align="center">
130 | <a href="https://datatalks.club/">DataTalks.Club</a> is a global online community of data enthusiasts. It's a place to discuss data, learn, share knowledge, ask and answer questions, and support each other.
131 | </p>
132 | 
133 | <p align="center">
134 | <a href="https://datatalks.club/">Website</a> •
135 | <a href="https://datatalks.club/slack.html">Join Slack Community</a> •
136 | <a href="https://us19.campaign-archive.com/home/?u=0d7822ab98152f5afc118c176&id=97178021aa">Newsletter</a> •
137 | <a href="http://lu.ma/dtc-events">Upcoming Events</a> •
138 | <a href="https://calendar.google.com/calendar/?cid=ZjhxaWRqbnEwamhzY3A4ODA5azFlZ2hzNjBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ">Google Calendar</a> •
139 | <a href="https://www.youtube.com/@DataTalksClub/featured">YouTube</a> •
140 | <a href="https://github.com/DataTalksClub">GitHub</a> •
141 | <a href="https://www.linkedin.com/company/datatalks-club/">LinkedIn</a> •
142 | <a href="https://twitter.com/DataTalksClub">Twitter</a>
143 | </p>
144 | 
145 | All the activity at DataTalks.Club mainly happens on [Slack](https://datatalks.club/slack.html). We post updates there and discuss different aspects of data, career questions, and more.
146 | 
147 | At DataTalksClub, we organize online events, community activities, and free courses. You can learn more about what we do at [DataTalksClub Community Navigation](https://www.notion.so/DataTalksClub-Community-Navigation-bf070ad27ba44bf6bbc9222082f0e5a8?pvs=21).
148 | 


--------------------------------------------------------------------------------
/after-sign-up.md:
--------------------------------------------------------------------------------
 1 | ## Thank you!
 2 | 
 3 | Thanks for signining up for the course.
 4 | 
 5 | Here are some things you should do before you start the course: 
 6 | 
 7 | - Register in [DataTalks.Club's Slack](https://datatalks.club/slack.html)
 8 | - Join the [`#course-llm-zoomcamp`](https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J) channel
 9 | - Join the [course Telegram channel with announcements](https://t.me/llm_zoomcamp)
10 | - Subscribe to [DataTalks.Club's YouTube channel](https://www.youtube.com/c/DataTalksClub) and check [the course playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV)
11 | - Subscribe to our [Course Calendar](https://calendar.google.com/calendar/?cid=NjkxOThkOGFhZmUyZmQwMzZjNDFkNmE2ZDIyNjE5YjdiMmQyZDVjZTYzOGMxMzQyZmNkYjE5Y2VkNDYxOTUxY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t)
12 | - Check our [Technical FAQ](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing) if you have questions 
13 | 
14 | See you in the course!
15 | 


--------------------------------------------------------------------------------
/asking-questions.md:
--------------------------------------------------------------------------------
 1 | ## Asking questions
 2 | 
 3 | If you have any questions, ask them 
 4 | in the [`#course-llm-zoomcamp`](https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J) channel in [DataTalks.Club](https://datatalks.club) slack.
 5 | 
 6 | To keep our discussion in Slack more organized, we ask you to follow these suggestions:
 7 | 
 8 | * Before asking a question, check [FAQ](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing).
 9 | * Use threads. When you have a problem, first describe the problem shortly
10 |   and then put the actual error in the thread - so it doesn't take the entire screen.
11 | * Instead of screenshots, it's better to copy-paste the error you're getting in text.
12 |   Use ` ``` ` for formatting your code.
13 |   It's very difficult to read text from screenshots.
14 | * Please don't take pictures of your code with a phone. It's even harder to read. Follow the previous suggestion,
15 |   and in rare cases when you need to show what happens on your screen, take a screenshot.
16 | * You don't need to tag the instructors when you have a problem. We will see it eventually.
17 | * If somebody helped you with your problem and it's not in [FAQ](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing), please add it there.
18 |   It'll help other students.
19 | 


--------------------------------------------------------------------------------
/awesome-llms.md:
--------------------------------------------------------------------------------
  1 | # Awesome LLMs
  2 | 
  3 | In this file, we will collect all interesting links
  4 | 
  5 | ## OpenAI API Alternatives
  6 | 
  7 | OpenAI and GPT are not the only hosted LLMs that we can use. 
  8 | There are other services that we can use
  9 | 
 10 | 
 11 | * [mistral.ai](https://mistral.ai) (5€ free credit on sign up)
 12 | * [Groq](https://console.groq.com) (can inference from open source LLMs with rate limits)
 13 | * [TogetherAI](https://api.together.ai) (can inference from variety of open source LLMs, 25$ free credit on sign up)
 14 | * [Google Gemini](https://ai.google.dev/gemini-api/docs/get-started/tutorial?lang=python) (2 months unlimited access)
 15 | * [OpenRouterAI](https://openrouter.ai/) (some small open-source models, such as Gemma 7B, are free)
 16 | * [HuggingFace API](https://huggingface.co/docs/api-inference/index) (over 150,000 open-source models, rate-limited and free)
 17 | * [Cohere](https://cohere.com/) (provides a developer trail key which allows upto 100 reqs/min for generating, summarizing, and classifying text. Read more [here](https://cohere.com/blog/free-developer-tier-announcement))
 18 | * [wit](https://wit.ai/) (Facebook AI Afiliate - free)
 19 | * [Anthropic API](https://www.anthropic.com/pricing#anthropic-api) (starting from $0.25 / MTok for input and $1.25 / MTok for the output for the most affordable model)
 20 | * [AI21Labs API](https://www.ai21.com/pricing#foundation-models) (Free trial including $10 credits for 3 months)
 21 | * [Replicate](https://replicate.com/) (faster inference, can host any ML model. charges 0.10$ per 1M input tokens for llama/Mistral model)
 22 | 
 23 | 
 24 | ## Local LLMs on CPUs
 25 | 
 26 | These services help run LLMs locally, also without GPUs
 27 | 
 28 | - [ollama](https://github.com/ollama/ollama)
 29 | - [Jan.AI](https://jan.ai/)
 30 | - [h2oGPT](https://github.com/h2oai/h2ogpt)
 31 | 
 32 | 
 33 | ## Applications
 34 | - **Text Generation**
 35 |   - [OpenAI GPT-3 Playground](https://platform.openai.com/playground)
 36 |   - [AI Dungeon](https://play.aidungeon.io/)
 37 | - **Chatbots**
 38 |   - [Rasa](https://rasa.com/)
 39 |   - [Microsoft Bot Framework](https://dev.botframework.com/)
 40 | - **Sentiment Analysis**
 41 |   - [VADER Sentiment Analysis](https://github.com/cjhutto/vaderSentiment)
 42 |   - [TextBlob](https://textblob.readthedocs.io/en/dev/)
 43 | - **Summarization**
 44 |   - [Sumy](https://github.com/miso-belica/sumy)
 45 |   - [Hugging Face Transformers Summarization](https://huggingface.co/transformers/task_summary.html)
 46 | - **Translation**
 47 |   - [MarianMT by Hugging Face](https://huggingface.co/transformers/model_doc/marian.html)
 48 | 
 49 | ## Fine-Tuning
 50 | - **Guides and Tutorials**
 51 |   - [Fine-Tuning GPT-3](https://platform.openai.com/docs/guides/fine-tuning)
 52 |   - [Hugging Face Fine-Tuning Tutorial](https://huggingface.co/transformers/training.html)
 53 | - **Tools and Frameworks**
 54 |   - [Hugging Face Trainer](https://huggingface.co/transformers/main_classes/trainer.html)
 55 |   - [Fastai](https://docs.fast.ai/text.learner.html)
 56 | - **Colab Notebooks**
 57 |   - [Fine-Tuning BERT on Colab](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
 58 |   - [Fine-Tuning GPT-2 on Colab](https://colab.research.google.com/github/fastai/course-v3/blob/master/nbs/dl2/12a_ulmfit.ipynb)
 59 | 
 60 | ## Prompt Engineering
 61 | - **Techniques and Best Practices**
 62 |   - [OpenAI Prompt Engineering Guide](https://platform.openai.com/docs/guides/completions/best-practices)
 63 |   - [Prompt Design for GPT-3](https://beta.openai.com/docs/guides/prompt-design)
 64 | - **Tools**
 65 |   - [Prompt Designer](https://promptdesigner.com/)
 66 |   - [Prompt Engineering Toolkit](https://github.com/prompt-engineering/awesome-prompt-engineering)
 67 | - **Examples and Case Studies**
 68 |   - [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts)
 69 |   - [GPT-3 Prompt Engineering Examples](https://github.com/shreyashankar/gpt-3-sandbox)
 70 | 
 71 | ## Deployment
 72 | - **Hosting Services**
 73 |   - [Hugging Face Inference API](https://huggingface.co/inference-api)
 74 |   - [AWS SageMaker](https://aws.amazon.com/sagemaker/)
 75 | - **Serverless Deployments**
 76 |   - [Serverless GPT-3 with AWS Lambda](https://towardsdatascience.com/building-serverless-gpt-3-powered-apis-with-aws-lambda-f2d4b8a91058)
 77 |   - [Deploying on Vercel](https://vercel.com/guides/deploying-next-and-vercel-api-with-openai-gpt-3)
 78 | - **Containerization**
 79 |   - [Dockerizing a GPT Model](https://medium.com/swlh/dockerize-your-gpt-3-chatbot-28dd48c19c91)
 80 |   - [Kubernetes for ML Deployments](https://towardsdatascience.com/kubernetes-for-machine-learning-6c7f5c5466a2)
 81 | 
 82 | ## Monitoring and Logging
 83 | - **Best Practices**
 84 |   - [Logging and Monitoring AI Models](https://www.dominodatalab.com/resources/whitepapers/logging-and-monitoring-for-machine-learning)
 85 |   - [Monitor Your NLP Models](https://towardsdatascience.com/monitor-your-nlp-models-40c2fb141a51)
 86 | 
 87 | ## Ethics and Bias
 88 | - **Frameworks and Guidelines**
 89 |   - [AI Ethics Guidelines Global Inventory](https://algorithmwatch.org/en/project/ai-ethics-guidelines-global-inventory/)
 90 |   - [Google AI Principles](https://ai.google/principles/)
 91 | - **Tools**
 92 |   - [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators)
 93 |   - [IBM AI Fairness 360](https://aif360.mybluemix.net/)
 94 | - **Research Papers**
 95 |   - [Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification](http://gendershades.org/overview.html)
 96 |   - [AI Fairness and Bias](https://arxiv.org/abs/1908.09635)
 97 | 
 98 | 
 99 | 
100 | 


--------------------------------------------------------------------------------
/cohorts/2024/01-intro/homework.md:
--------------------------------------------------------------------------------
  1 | ## Homework: Introduction
  2 | 
  3 | In this homework, we'll learn more about search and use Elastic Search for practice. 
  4 | 
  5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
  6 | 
  7 | ## Q1. Running Elastic 
  8 | 
  9 | Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:
 10 | 
 11 | ```bash
 12 | curl localhost:9200
 13 | ```
 14 | 
 15 | What's the `version.build_hash` value?
 16 | 
 17 | 
 18 | ## Getting the data
 19 | 
 20 | Now let's get the FAQ data. You can run this snippet:
 21 | 
 22 | ```python
 23 | import requests 
 24 | 
 25 | docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
 26 | docs_response = requests.get(docs_url)
 27 | documents_raw = docs_response.json()
 28 | 
 29 | documents = []
 30 | 
 31 | for course in documents_raw:
 32 |     course_name = course['course']
 33 | 
 34 |     for doc in course['documents']:
 35 |         doc['course'] = course_name
 36 |         documents.append(doc)
 37 | ```
 38 | 
 39 | Note that you need to have the `requests` library:
 40 | 
 41 | ```bash
 42 | pip install requests
 43 | ```
 44 | 
 45 | ## Q2. Indexing the data
 46 | 
 47 | Index the data in the same way as was shown in the course videos. Make the `course` field a keyword and the rest should be text. 
 48 | 
 49 | Don't forget to install the ElasticSearch client for Python:
 50 | 
 51 | ```bash
 52 | pip install elasticsearch
 53 | ```
 54 | 
 55 | Which function do you use for adding your data to elastic?
 56 | 
 57 | * `insert`
 58 | * `index`
 59 | * `put`
 60 | * `add`
 61 | 
 62 | ## Q3. Searching
 63 | 
 64 | Now let's search in our index. 
 65 | 
 66 | We will execute a query "How do I execute a command in a running docker container?". 
 67 | 
 68 | Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.
 69 | 
 70 | What's the score for the top ranking result?
 71 | 
 72 | * 94.05
 73 | * 84.05
 74 | * 74.05
 75 | * 64.05
 76 | 
 77 | Look at the `_score` field.
 78 | 
 79 | ## Q4. Filtering
 80 | 
 81 | Now let's only limit the questions to `machine-learning-zoomcamp`.
 82 | 
 83 | Return 3 results. What's the 3rd question returned by the search engine?
 84 | 
 85 | * How do I debug a docker container?
 86 | * How do I copy files from a different folder into docker container’s working directory?
 87 | * How do Lambda container images work?
 88 | * How can I annotate a graph?
 89 | 
 90 | ## Q5. Building a prompt
 91 | 
 92 | Now we're ready to build a prompt to send to an LLM. 
 93 | 
 94 | Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (`\n\n`)
 95 | ```python
 96 | context_template = """
 97 | Q: {question}
 98 | A: {text}
 99 | """.strip()
100 | ```
101 | 
102 | Now use the context you just created along with the "How do I execute a command in a running docker container?" question 
103 | to construct a prompt using the template below:
104 | 
105 | ```
106 | prompt_template = """
107 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
108 | Use only the facts from the CONTEXT when answering the QUESTION.
109 | 
110 | QUESTION: {question}
111 | 
112 | CONTEXT:
113 | {context}
114 | """.strip()
115 | ```
116 | 
117 | What's the length of the resulting prompt? (use the `len` function)
118 | 
119 | * 962
120 | * 1462
121 | * 1962
122 | * 2462
123 | 
124 | ## Q6. Tokens
125 | 
126 | When we use the OpenAI Platform, we're charged by the number of 
127 | tokens we send in our prompt and receive in the response.
128 | 
129 | The OpenAI python package uses `tiktoken` for tokenization:
130 | 
131 | ```bash
132 | pip install tiktoken
133 | ```
134 | 
135 | Let's calculate the number of tokens in our query: 
136 | 
137 | ```python
138 | encoding = tiktoken.encoding_for_model("gpt-4o")
139 | ```
140 | 
141 | Use the `encode` function. How many tokens does our prompt have?
142 | 
143 | * 122
144 | * 222
145 | * 322
146 | * 422
147 | 
148 | Note: to decode back a token into a word, you can use the `decode_single_token_bytes` function:
149 | 
150 | ```python
151 | encoding.decode_single_token_bytes(63842)
152 | ```
153 | 
154 | ## Bonus: generating the answer (ungraded)
155 | 
156 | Let's send the prompt to OpenAI. What's the response?  
157 | 
158 | Note: you can replace OpenAI with Ollama. See module 2.
159 | 
160 | ## Bonus: calculating the costs (ungraded)
161 | 
162 | Suppose that on average per request we send 150 tokens and receive back 250 tokens.
163 | 
164 | How much will it cost to run 1000 requests?
165 | 
166 | You can see the prices [here](https://openai.com/api/pricing/)
167 | 
168 | On June 17, the prices for gpt4o are:
169 | 
170 | * Input: $0.005 / 1K tokens
171 | * Output: $0.015 / 1K tokens
172 | 
173 | You can redo the calculations with the values you got in Q6 and Q7.
174 | 
175 | 
176 | ## Submit the results
177 | 
178 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2025/homework/hw1
179 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
180 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/README.md:
--------------------------------------------------------------------------------
  1 | # 2. Open-Source LLMs
  2 | 
  3 | In the previous module, we used OpenAI via OpenAI API. It's
  4 | a very convenient way to use an LLM, but you have to pay 
  5 | for the usage, and you don't have control over the 
  6 | model you get to use.
  7 | 
  8 | In this module, we'll look at using open-source LLMs instead.
  9 | 
 10 | ## 2.1 Open-Source LLMs - Introduction
 11 | 
 12 | <a href="https://www.youtube.com/watch?v=ATchkIRsH4g&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 13 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/ATchkIRsH4g">
 14 | </a>
 15 | 
 16 | * Open-Source LLMs
 17 | * Replacing the LLM box in the RAG flow
 18 | 
 19 | ## 2.2 Using a GPU in Saturn Cloud
 20 | 
 21 | <a href="https://www.youtube.com/watch?v=E0cAqBWfJYY&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 22 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/E0cAqBWfJYY">
 23 | </a>
 24 | 
 25 | * Registering in Saturn Cloud
 26 | * Configuring secrets and git
 27 | * Creating an instance with a GPU
 28 | 
 29 | ```bash
 30 | pip install -U transformers accelerate bitsandbytes sentencepiece
 31 | ```
 32 | 
 33 | Links:
 34 | 
 35 | * https://saturncloud.io/
 36 | * https://github.com/DataTalksClub/llm-zoomcamp-saturncloud
 37 | 
 38 | Google Colab as an alternative:
 39 | 
 40 | * [Video](https://www.loom.com/share/591f39e4e231486bbfc3fbd316ec03c5)
 41 | * [Notebook](https://colab.research.google.com/drive/1XmxUZutZXoAEdQZU45EXWPseBX9s2NRd)
 42 | 
 43 | ## 2.3 FLAN-T5
 44 | 
 45 | <a href="https://www.youtube.com/watch?v=a86iTyxnFE4&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 46 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/a86iTyxnFE4">
 47 | </a>
 48 | 
 49 | * Model: `google/flan-t5-xl`
 50 | * Notebook: [huggingface-flan-t5.ipynb](huggingface-flan-t5.ipynb)
 51 | 
 52 | ```bash
 53 | import os
 54 | os.environ['HF_HOME'] = '/run/cache/'
 55 | ```
 56 | 
 57 | Links:
 58 | 
 59 | * https://huggingface.co/google/flan-t5-xl
 60 | * https://huggingface.co/docs/transformers/en/model_doc/flan-t5
 61 | 
 62 | Explanation of Parameters:
 63 | 
 64 | * `max_length`: Set this to a higher value if you want longer responses. For example, `max_length=300`.
 65 | * `num_beams`: Increasing this can lead to more thorough exploration of possible sequences. Typical values are between 5 and 10.
 66 | * `do_sample`: Set this to `True` to use sampling methods. This can produce more diverse responses.
 67 | * `temperature`: Lowering this value makes the model more confident and deterministic, while higher values increase diversity. Typical values range from 0.7 to 1.5.
 68 | * `top_k` and `top_p`: These parameters control nucleus sampling. `top_k` limits the sampling pool to the top `k` tokens, while `top_p` uses cumulative probability to cut off the sampling pool. Adjust these based on the desired level of randomness.
 69 | 
 70 | 
 71 | ## 2.4 Phi 3 Mini
 72 | 
 73 | <a href="https://www.youtube.com/watch?v=8KH6AS2PqWk&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 74 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/8KH6AS2PqWk">
 75 | </a>
 76 | 
 77 | * Model: `microsoft/Phi-3-mini-128k-instruct`
 78 | * Notebook: [huggingface-phi3.ipynb](huggingface-phi3.ipynb)
 79 | 
 80 | 
 81 | Links:
 82 | 
 83 | * https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
 84 | 
 85 | ## 2.5 Mistral-7B and HuggingFace Hub Authentication
 86 | 
 87 | <a href="https://www.youtube.com/watch?v=TdVEOzSoUCs&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 88 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/TdVEOzSoUCs">
 89 | </a>
 90 | 
 91 | * Model: `mistralai/Mistral-7B-v0.1`
 92 | * Notebook: [huggingface-mistral-7b.ipynb](huggingface-mistral-7b.ipynb)
 93 | 
 94 | [ChatGPT instructions for serving](serving-hugging-face-models.md) 
 95 | 
 96 | 
 97 | Links:
 98 | 
 99 | * https://huggingface.co/docs/transformers/en/llm_tutorial
100 | * https://huggingface.co/settings/tokens
101 | * https://huggingface.co/mistralai/Mistral-7B-v0.1
102 | 
103 | 
104 | ## 2.6 Other models
105 | 
106 | <a href="https://www.youtube.com/watch?v=GzPV_HTmCkc&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
107 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/GzPV_HTmCkc">
108 | </a>
109 | 
110 | * [`LLM360/Amber`](https://huggingface.co/LLM360/Amber)
111 | * [Gemma-7B](https://huggingface.co/blog/gemma)
112 | * [SaulLM-7B](https://huggingface.co/papers/2403.03883) 
113 | * [Granite-7B](https://huggingface.co/ibm-granite/granite-7b-base)
114 | * [MPT-7B](https://huggingface.co/mosaicml/mpt-7b)
115 | * [OpenLLaMA-7B](https://huggingface.co/openlm-research/open_llama_7b)
116 | 
117 | Where to find them:
118 | 
119 | * Leaderboards 
120 | * Google
121 | * ChatGPT
122 | 
123 | Links:
124 | 
125 | * https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
126 | * https://huggingface.co/spaces/optimum/llm-perf-leaderboard
127 | 
128 | 
129 | ## 2.7 Ollama - Running LLMs on a CPU
130 | 
131 | <a href="https://www.youtube.com/watch?v=PVpBGs_iSjY&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
132 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/PVpBGs_iSjY">
133 | </a>
134 | 
135 | * The easiest way to run an LLM without a GPU is using [Ollama](https://github.com/ollama/ollama)
136 | * Notebook [ollama.ipynb](ollama.ipynb)
137 | 
138 | For Linux:
139 | 
140 | ```bash
141 | curl -fsSL https://ollama.com/install.sh | sh
142 | 
143 | ollama start
144 | ollama pull phi3
145 | ollama run phi3
146 | ```
147 | 
148 | [Prompt example](prompt.md)
149 | 
150 | Connecting to it with OpenAI API:
151 | 
152 | ```python
153 | from openai import OpenAI
154 | 
155 | client = OpenAI(
156 |     base_url='http://localhost:11434/v1/',
157 |     api_key='ollama',
158 | )
159 | ```
160 | 
161 | Docker
162 | 
163 | ```bash
164 | docker run -it \
165 |     -v ollama:/root/.ollama \
166 |     -p 11434:11434 \
167 |     --name ollama \
168 |     ollama/ollama
169 | ```
170 | 
171 | Pulling the model
172 | 
173 | ```bash
174 | docker exec -it ollama bash
175 | ollama pull phi3
176 | ```
177 | 
178 | 
179 | ## 2.8 Ollama & Phi3 + Elastic in Docker-Compose
180 | 
181 | <a href="https://www.youtube.com/watch?v=4juoo_jk96U&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
182 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/4juoo_jk96U">
183 | </a>
184 | 
185 | * Creating a Docker-Compose file 
186 | * Re-running the module 1 notebook
187 | 
188 | * Notebook: [rag-intro.ipynb](rag-intro.ipynb)
189 | 
190 | ## 2.9 UI for RAG
191 | 
192 | <a href="https://www.youtube.com/watch?v=R6L8PZ-7bGo&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
193 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/R6L8PZ-7bGo">
194 | </a>
195 | 
196 | * Putting it in Streamlit
197 | * [Code](qa_faq.py)
198 | 
199 | If you want to learn more about streamlit, you can
200 | use [this material](https://github.com/DataTalksClub/project-of-the-week/blob/main/2022-08-14-frontend.md)
201 | from [our repository with projects of the week](https://github.com/DataTalksClub/project-of-the-week/tree/main).
202 | 
203 | ## Homework
204 | 
205 | See [here](../cohorts/2024/02-open-source/homework.md)
206 | 
207 | # Notes
208 | 
209 | * [Workaround by Pham Nguyen Hung to use ElasticSearch container with Saturn Cloud & Google Colab instead of minsearch](https://hung.bearblog.dev/llm-zoomcamp-zrok/)
210 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/02-open-source/notes.md)
211 | * [Notes by Pham Nguyen Hung](https://hung.bearblog.dev/llm-zoomcamp-2-os/)
212 | * [Notes by Marat on Open-Sourced and Closed-Sourced Models and ways to run them](https://machine-mind-ml.medium.com/open-sourced-vs-closed-sourced-llms-2392c7db6e10)
213 | * [Notes by dimzachar](https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/02-open-source/README.md)
214 | * [Notes by Waleed](https://waleedayoub.com/post/llmzoomcamp_week2-open-source_notes/)
215 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
216 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/docker-compose.yaml:
--------------------------------------------------------------------------------
 1 | version: '3.8'
 2 | 
 3 | services:
 4 |   elasticsearch:
 5 |     image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
 6 |     container_name: elasticsearch
 7 |     environment:
 8 |       - discovery.type=single-node
 9 |       - xpack.security.enabled=false
10 |     ports:
11 |       - "9200:9200"
12 |       - "9300:9300"
13 | 
14 |   ollama:
15 |     image: ollama/ollama
16 |     container_name: ollama
17 |     volumes:
18 |       - ollama:/root/.ollama
19 |     ports:
20 |       - "11434:11434"
21 | 
22 | volumes:
23 |   ollama:
24 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/homework.md:
--------------------------------------------------------------------------------
  1 | ## Homework: Open-Source LLMs
  2 | 
  3 | In this homework, we'll experiment more with Ollama
  4 | 
  5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
  6 | 
  7 | Solution: https://www.loom.com/share/f04a63aaf0db4bf58194ba425f1fcffa
  8 | 
  9 | ## Q1. Running Ollama with Docker
 10 | 
 11 | Let's run ollama with Docker. We will need to execute the 
 12 | same command as in the lectures:
 13 | 
 14 | ```bash
 15 | docker run -it \
 16 |     --rm \
 17 |     -v ollama:/root/.ollama \
 18 |     -p 11434:11434 \
 19 |     --name ollama \
 20 |     ollama/ollama
 21 | ```
 22 | 
 23 | What's the version of ollama client? 
 24 | 
 25 | To find out, enter the container and execute `ollama` with the `-v` flag.
 26 | 
 27 | 
 28 | ## Q2. Downloading an LLM 
 29 | 
 30 | We will donwload a smaller LLM - gemma:2b. 
 31 | 
 32 | Again let's enter the container and pull the model:
 33 | 
 34 | ```bash
 35 | ollama pull gemma:2b
 36 | ```
 37 | 
 38 | In docker, it saved the results into `/root/.ollama`
 39 | 
 40 | We're interested in the metadata about this model. You can find
 41 | it in `models/manifests/registry.ollama.ai/library`
 42 | 
 43 | What's the content of the file related to gemma?
 44 | 
 45 | ## Q3. Running the LLM
 46 | 
 47 | Test the following prompt: "10 * 10". What's the answer?
 48 | 
 49 | ## Q4. Donwloading the weights 
 50 | 
 51 | We don't want to pull the weights every time we run
 52 | a docker container. Let's do it once and have them available
 53 | every time we start a container.
 54 | 
 55 | First, we will need to change how we run the container.
 56 | 
 57 | Instead of mapping the `/root/.ollama` folder to a named volume,
 58 | let's map it to a local directory:
 59 | 
 60 | ```bash
 61 | mkdir ollama_files
 62 | 
 63 | docker run -it \
 64 |     --rm \
 65 |     -v ./ollama_files:/root/.ollama \
 66 |     -p 11434:11434 \
 67 |     --name ollama \
 68 |     ollama/ollama
 69 | ```
 70 | 
 71 | Now pull the model:
 72 | 
 73 | ```bash
 74 | docker exec -it ollama ollama pull gemma:2b 
 75 | ```
 76 | 
 77 | What's the size of the `ollama_files/models` folder? 
 78 | 
 79 | * 0.6G
 80 | * 1.2G
 81 | * 1.7G
 82 | * 2.2G
 83 | 
 84 | Hint: on linux, you can use `du -h` for that.
 85 | 
 86 | ## Q5. Adding the weights 
 87 | 
 88 | Let's now stop the container and add the weights 
 89 | to a new image
 90 | 
 91 | For that, let's create a `Dockerfile`:
 92 | 
 93 | ```dockerfile
 94 | FROM ollama/ollama
 95 | 
 96 | COPY ...
 97 | ```
 98 | 
 99 | What do you put after `COPY`?
100 | 
101 | ## Q6. Serving it 
102 | 
103 | Let's build it:
104 | 
105 | ```bash
106 | docker build -t ollama-gemma2b .
107 | ```
108 | 
109 | And run it:
110 | 
111 | ```bash
112 | docker run -it --rm -p 11434:11434 ollama-gemma2b
113 | ```
114 | 
115 | We can connect to it using the OpenAI client
116 | 
117 | Let's test it with the following prompt:
118 | 
119 | ```python
120 | prompt = "What's the formula for energy?"
121 | ```
122 | 
123 | Also, to make results reproducible, set the `temperature` parameter to 0:
124 | 
125 | ```bash
126 | response = client.chat.completions.create(
127 |     #...
128 |     temperature=0.0
129 | )
130 | ```
131 | 
132 | How many completion tokens did you get in response?
133 | 
134 | * 304
135 | * 604
136 | * 904
137 | * 1204
138 | 
139 | ## Submit the results
140 | 
141 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw2
142 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
143 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/ollama.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "id": "506fab2a-a50c-42bd-a106-c83a9d2828ea",
  7 |    "metadata": {},
  8 |    "outputs": [
  9 |     {
 10 |      "name": "stderr",
 11 |      "output_type": "stream",
 12 |      "text": [
 13 |       "--2024-06-13 13:53:24--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py\n",
 14 |       "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...\n",
 15 |       "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
 16 |       "HTTP request sent, awaiting response... 200 OK\n",
 17 |       "Length: 3832 (3.7K) [text/plain]\n",
 18 |       "Saving to: 'minsearch.py'\n",
 19 |       "\n",
 20 |       "     0K ...                                                   100%  579K=0.006s\n",
 21 |       "\n",
 22 |       "2024-06-13 13:53:24 (579 KB/s) - 'minsearch.py' saved [3832/3832]\n",
 23 |       "\n"
 24 |      ]
 25 |     }
 26 |    ],
 27 |    "source": [
 28 |     "!rm -f minsearch.py\n",
 29 |     "!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "id": "3ac947de-effd-4b61-8792-a6d7a133f347",
 36 |    "metadata": {},
 37 |    "outputs": [
 38 |     {
 39 |      "data": {
 40 |       "text/plain": [
 41 |        "<minsearch.Index at 0x1d9c9bd8890>"
 42 |       ]
 43 |      },
 44 |      "execution_count": 3,
 45 |      "metadata": {},
 46 |      "output_type": "execute_result"
 47 |     }
 48 |    ],
 49 |    "source": [
 50 |     "import requests \n",
 51 |     "import minsearch\n",
 52 |     "\n",
 53 |     "docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'\n",
 54 |     "docs_response = requests.get(docs_url)\n",
 55 |     "documents_raw = docs_response.json()\n",
 56 |     "\n",
 57 |     "documents = []\n",
 58 |     "\n",
 59 |     "for course in documents_raw:\n",
 60 |     "    course_name = course['course']\n",
 61 |     "\n",
 62 |     "    for doc in course['documents']:\n",
 63 |     "        doc['course'] = course_name\n",
 64 |     "        documents.append(doc)\n",
 65 |     "\n",
 66 |     "index = minsearch.Index(\n",
 67 |     "    text_fields=[\"question\", \"text\", \"section\"],\n",
 68 |     "    keyword_fields=[\"course\"]\n",
 69 |     ")\n",
 70 |     "\n",
 71 |     "index.fit(documents)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 4,
 77 |    "id": "8f087272-b44d-4738-9ea2-175ec63a058b",
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "def search(query):\n",
 82 |     "    boost = {'question': 3.0, 'section': 0.5}\n",
 83 |     "\n",
 84 |     "    results = index.search(\n",
 85 |     "        query=query,\n",
 86 |     "        filter_dict={'course': 'data-engineering-zoomcamp'},\n",
 87 |     "        boost_dict=boost,\n",
 88 |     "        num_results=5\n",
 89 |     "    )\n",
 90 |     "\n",
 91 |     "    return results"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 8,
 97 |    "id": "742ab881-499a-4675-83c4-2013ea1377b9",
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "def build_prompt(query, search_results):\n",
102 |     "    prompt_template = \"\"\"\n",
103 |     "You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n",
104 |     "Use only the facts from the CONTEXT when answering the QUESTION.\n",
105 |     "\n",
106 |     "QUESTION: {question}\n",
107 |     "\n",
108 |     "CONTEXT: \n",
109 |     "{context}\n",
110 |     "\"\"\".strip()\n",
111 |     "\n",
112 |     "    context = \"\"\n",
113 |     "    \n",
114 |     "    for doc in search_results:\n",
115 |     "        context = context + f\"section: {doc['section']}\\nquestion: {doc['question']}\\nanswer: {doc['text']}\\n\\n\"\n",
116 |     "    \n",
117 |     "    prompt = prompt_template.format(question=query, context=context).strip()\n",
118 |     "    return prompt\n",
119 |     "\n",
120 |     "def llm(prompt):\n",
121 |     "    response = client.chat.completions.create(\n",
122 |     "        model='phi3',\n",
123 |     "        messages=[{\"role\": \"user\", \"content\": prompt}]\n",
124 |     "    )\n",
125 |     "    \n",
126 |     "    return response.choices[0].message.content"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 6,
132 |    "id": "fe8bff3e-b672-42be-866b-f2d9bb217106",
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": [
136 |     "def rag(query):\n",
137 |     "    search_results = search(query)\n",
138 |     "    prompt = build_prompt(query, search_results)\n",
139 |     "    answer = llm(prompt)\n",
140 |     "    return answer"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": 7,
146 |    "id": "091a77e6-936b-448e-a04b-bad1001f5bb0",
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "from openai import OpenAI\n",
151 |     "\n",
152 |     "client = OpenAI(\n",
153 |     "    base_url='http://localhost:11434/v1/',\n",
154 |     "    api_key='ollama',\n",
155 |     ")"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": 12,
161 |    "id": "9ee527a3-3331-4f4e-b6c8-f659ffc113f5",
162 |    "metadata": {},
163 |    "outputs": [
164 |     {
165 |      "data": {
166 |       "text/plain": [
167 |        "' This statement serves as an example to verify the functionality of various systems, such as text processing software or programming functions. It\\'s commonly used by developers during debugging sessions to ensure commands are working correctly without producing any unintended output.\\n\\nHere\\'s how you might include it in different contexts:\\n\\n**1. Using it as a command line test in a script:**\\nIf writing a shell script or using command-line tools, the statement can be inserted directly to demonstrate functionality. For instance, using `echo` on Unix-like systems:\\n```bash\\n#!/bin/bash\\necho \"This is a test\"\\necho \"This is also a test for confirmation.\"\\n```\\n\\n**2. Inserting it into a programming function as a placeholder or comment (in Python):**\\nAs a comment in code to remind future developers that the block can be replaced with actual implementation:\\n```python\\ndef process_text(input_string):\\n    # Test input: \"This is a test\"\\n    print(\"Testing...\")\\n    # Replace this line with your processing logic\\n    return input_string.upper()  # Example operation\\n```\\n\\n**3. Using in documentation or comments within software development code:**\\nDemonstrate how the statement can be used to clarify intentions when developing software, such as in a README file or inline comment:\\n```markdown\\n# Test Command Functionality\\nThis section contains commands that serve to test system functionality.\\n`echo \"This is a test\"` - A simple command to check output behavior.\\n```\\n\\nIn each case, the statement `This is a test` fulfills its role as a straightforward demonstration or placeholder within development and testing workflows.'"
168 |       ]
169 |      },
170 |      "execution_count": 12,
171 |      "metadata": {},
172 |      "output_type": "execute_result"
173 |     }
174 |    ],
175 |    "source": [
176 |     "llm('write that this is a test')"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": 13,
182 |    "id": "21aa255e-c971-44ca-9826-a721df3ad063",
183 |    "metadata": {},
184 |    "outputs": [
185 |     {
186 |      "name": "stdout",
187 |      "output_type": "stream",
188 |      "text": [
189 |       " This statement serves as an example to verify the functionality of various systems, such as text processing software or programming functions. It's commonly used by developers during debugging sessions to ensure commands are working correctly without producing any unintended output.\n",
190 |       "\n",
191 |       "Here's how you might include it in different contexts:\n",
192 |       "\n",
193 |       "**1. Using it as a command line test in a script:**\n",
194 |       "If writing a shell script or using command-line tools, the statement can be inserted directly to demonstrate functionality. For instance, using `echo` on Unix-like systems:\n",
195 |       "```bash\n",
196 |       "#!/bin/bash\n",
197 |       "echo \"This is a test\"\n",
198 |       "echo \"This is also a test for confirmation.\"\n",
199 |       "```\n",
200 |       "\n",
201 |       "**2. Inserting it into a programming function as a placeholder or comment (in Python):**\n",
202 |       "As a comment in code to remind future developers that the block can be replaced with actual implementation:\n",
203 |       "```python\n",
204 |       "def process_text(input_string):\n",
205 |       "    # Test input: \"This is a test\"\n",
206 |       "    print(\"Testing...\")\n",
207 |       "    # Replace this line with your processing logic\n",
208 |       "    return input_string.upper()  # Example operation\n",
209 |       "```\n",
210 |       "\n",
211 |       "**3. Using in documentation or comments within software development code:**\n",
212 |       "Demonstrate how the statement can be used to clarify intentions when developing software, such as in a README file or inline comment:\n",
213 |       "```markdown\n",
214 |       "# Test Command Functionality\n",
215 |       "This section contains commands that serve to test system functionality.\n",
216 |       "`echo \"This is a test\"` - A simple command to check output behavior.\n",
217 |       "```\n",
218 |       "\n",
219 |       "In each case, the statement `This is a test` fulfills its role as a straightforward demonstration or placeholder within development and testing workflows.\n"
220 |      ]
221 |     }
222 |    ],
223 |    "source": [
224 |     "print(_)"
225 |    ]
226 |   },
227 |   {
228 |    "cell_type": "code",
229 |    "execution_count": null,
230 |    "id": "988ece59-951a-4b32-ba3f-cb8efb66a9bb",
231 |    "metadata": {},
232 |    "outputs": [],
233 |    "source": []
234 |   }
235 |  ],
236 |  "metadata": {
237 |   "kernelspec": {
238 |    "display_name": "Python 3 (ipykernel)",
239 |    "language": "python",
240 |    "name": "python3"
241 |   },
242 |   "language_info": {
243 |    "codemirror_mode": {
244 |     "name": "ipython",
245 |     "version": 3
246 |    },
247 |    "file_extension": ".py",
248 |    "mimetype": "text/x-python",
249 |    "name": "python",
250 |    "nbconvert_exporter": "python",
251 |    "pygments_lexer": "ipython3",
252 |    "version": "3.11.9"
253 |   }
254 |  },
255 |  "nbformat": 4,
256 |  "nbformat_minor": 5
257 | }
258 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/prompt.md:
--------------------------------------------------------------------------------
 1 | 
 2 | Question: I just discovered the couse. can i still enrol
 3 | 
 4 | Context:
 5 | 
 6 | Course - Can I still join the course after the start date?
 7 | Yes, even if you don't register, you're still eligible to submit the homeworks.
 8 | Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.
 9 | 
10 | Environment - Is Python 3.9 still the recommended version to use in 2024?
11 | Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]
12 | But Python 3.10 and 3.11 should work fine.
13 | 
14 | How can we contribute to the course?
15 | Star the repo! Share it with friends if you find it useful ❣️
16 | Create a PR if you see you can improve the text or the structure of the repository.
17 | 
18 | Are we still using the NYC Trip data for January 2021? Or are we using the 2022 data?
19 | We will use the same data, as the project will essentially remain the same as last year’s. The data is available here
20 | 
21 | Docker-Compose - docker-compose still not available after changing .bashrc
22 | This is happen to me after following 1.4.1 video where we are installing docker compose in our Google Cloud VM. In my case, the docker-compose file downloaded from github named docker-compose-linux-x86_64 while it is more convenient to use docker-compose command instead. So just change the docker-compose-linux-x86_64 into docker-compose.
23 | 
24 | Answer:


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/qa_faq.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | import time
 3 | 
 4 | from elasticsearch import Elasticsearch
 5 | from openai import OpenAI
 6 | 
 7 | client = OpenAI(
 8 |     base_url='http://localhost:11434/v1/',
 9 |     api_key='ollama',
10 | )
11 | 
12 | es_client = Elasticsearch('http://localhost:9200') 
13 | 
14 | 
15 | def elastic_search(query, index_name = "course-questions"):
16 |     search_query = {
17 |         "size": 5,
18 |         "query": {
19 |             "bool": {
20 |                 "must": {
21 |                     "multi_match": {
22 |                         "query": query,
23 |                         "fields": ["question^3", "text", "section"],
24 |                         "type": "best_fields"
25 |                     }
26 |                 },
27 |                 "filter": {
28 |                     "term": {
29 |                         "course": "data-engineering-zoomcamp"
30 |                     }
31 |                 }
32 |             }
33 |         }
34 |     }
35 | 
36 |     response = es_client.search(index=index_name, body=search_query)
37 |     
38 |     result_docs = []
39 |     
40 |     for hit in response['hits']['hits']:
41 |         result_docs.append(hit['_source'])
42 |     
43 |     return result_docs
44 | 
45 | 
46 | def build_prompt(query, search_results):
47 |     prompt_template = """
48 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
49 | Use only the facts from the CONTEXT when answering the QUESTION.
50 | 
51 | QUESTION: {question}
52 | 
53 | CONTEXT: 
54 | {context}
55 | """.strip()
56 | 
57 |     context = ""
58 |     
59 |     for doc in search_results:
60 |         context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
61 |     
62 |     prompt = prompt_template.format(question=query, context=context).strip()
63 |     return prompt
64 | 
65 | def llm(prompt):
66 |     response = client.chat.completions.create(
67 |         model='phi3',
68 |         messages=[{"role": "user", "content": prompt}]
69 |     )
70 |     
71 |     return response.choices[0].message.content
72 | 
73 | 
74 | def rag(query):
75 |     search_results = elastic_search(query)
76 |     prompt = build_prompt(query, search_results)
77 |     answer = llm(prompt)
78 |     return answer
79 | 
80 | 
81 | def main():
82 |     st.title("RAG Function Invocation")
83 | 
84 |     user_input = st.text_input("Enter your input:")
85 | 
86 |     if st.button("Ask"):
87 |         with st.spinner('Processing...'):
88 |             output = rag(user_input)
89 |             st.success("Completed!")
90 |             st.write(output)
91 | 
92 | if __name__ == "__main__":
93 |     main()
94 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/rag-intro.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 3,
  6 |    "id": "ef8e9cdc-dfd4-4e54-a332-4b9bde4e6047",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "from openai import OpenAI\n",
 11 |     "\n",
 12 |     "client = OpenAI(\n",
 13 |     "    base_url='http://localhost:11434/v1/',\n",
 14 |     "    api_key='ollama',\n",
 15 |     ")"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "code",
 20 |    "execution_count": 7,
 21 |    "id": "2c05052f-a85a-4137-8398-0fd0be678599",
 22 |    "metadata": {},
 23 |    "outputs": [],
 24 |    "source": [
 25 |     "from elasticsearch import Elasticsearch"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 8,
 31 |    "id": "a78df1cc-5a5a-40b4-b673-19c7f0319453",
 32 |    "metadata": {},
 33 |    "outputs": [],
 34 |    "source": [
 35 |     "es_client = Elasticsearch('http://localhost:9200') "
 36 |    ]
 37 |   },
 38 |   {
 39 |    "cell_type": "code",
 40 |    "execution_count": 9,
 41 |    "id": "c9367c18-41ad-495e-9920-1a0c552f0d18",
 42 |    "metadata": {},
 43 |    "outputs": [
 44 |     {
 45 |      "data": {
 46 |       "text/plain": [
 47 |        "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})"
 48 |       ]
 49 |      },
 50 |      "execution_count": 9,
 51 |      "metadata": {},
 52 |      "output_type": "execute_result"
 53 |     }
 54 |    ],
 55 |    "source": [
 56 |     "index_settings = {\n",
 57 |     "    \"settings\": {\n",
 58 |     "        \"number_of_shards\": 1,\n",
 59 |     "        \"number_of_replicas\": 0\n",
 60 |     "    },\n",
 61 |     "    \"mappings\": {\n",
 62 |     "        \"properties\": {\n",
 63 |     "            \"text\": {\"type\": \"text\"},\n",
 64 |     "            \"section\": {\"type\": \"text\"},\n",
 65 |     "            \"question\": {\"type\": \"text\"},\n",
 66 |     "            \"course\": {\"type\": \"keyword\"} \n",
 67 |     "        }\n",
 68 |     "    }\n",
 69 |     "}\n",
 70 |     "\n",
 71 |     "index_name = \"course-questions\"\n",
 72 |     "\n",
 73 |     "es_client.indices.create(index=index_name, body=index_settings)"
 74 |    ]
 75 |   },
 76 |   {
 77 |    "cell_type": "code",
 78 |    "execution_count": 10,
 79 |    "id": "e131e3a3-4051-4fd7-8e4d-d17c2af2ad75",
 80 |    "metadata": {},
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "import requests \n",
 84 |     "\n",
 85 |     "docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'\n",
 86 |     "docs_response = requests.get(docs_url)\n",
 87 |     "documents_raw = docs_response.json()\n",
 88 |     "\n",
 89 |     "documents = []\n",
 90 |     "\n",
 91 |     "for course in documents_raw:\n",
 92 |     "    course_name = course['course']\n",
 93 |     "\n",
 94 |     "    for doc in course['documents']:\n",
 95 |     "        doc['course'] = course_name\n",
 96 |     "        documents.append(doc)"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": 12,
102 |    "id": "5c230059-e219-4a13-a7f8-ede4cf1b028f",
103 |    "metadata": {},
104 |    "outputs": [],
105 |    "source": [
106 |     "from tqdm.auto import tqdm"
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": 13,
112 |    "id": "70fe3c97-916d-42c0-bd7b-4f42d9056409",
113 |    "metadata": {},
114 |    "outputs": [
115 |     {
116 |      "data": {
117 |       "application/vnd.jupyter.widget-view+json": {
118 |        "model_id": "d74c0925b2eb48b8b301b1f418b6938a",
119 |        "version_major": 2,
120 |        "version_minor": 0
121 |       },
122 |       "text/plain": [
123 |        "  0%|          | 0/948 [00:00<?, ?it/s]"
124 |       ]
125 |      },
126 |      "metadata": {},
127 |      "output_type": "display_data"
128 |     }
129 |    ],
130 |    "source": [
131 |     "for doc in tqdm(documents):\n",
132 |     "    es_client.index(index=index_name, document=doc)"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "code",
137 |    "execution_count": 17,
138 |    "id": "7c72e000-910b-4fb5-aa88-2561e7bc39f2",
139 |    "metadata": {},
140 |    "outputs": [],
141 |    "source": [
142 |     "def elastic_search(query):\n",
143 |     "    search_query = {\n",
144 |     "        \"size\": 5,\n",
145 |     "        \"query\": {\n",
146 |     "            \"bool\": {\n",
147 |     "                \"must\": {\n",
148 |     "                    \"multi_match\": {\n",
149 |     "                        \"query\": query,\n",
150 |     "                        \"fields\": [\"question^3\", \"text\", \"section\"],\n",
151 |     "                        \"type\": \"best_fields\"\n",
152 |     "                    }\n",
153 |     "                },\n",
154 |     "                \"filter\": {\n",
155 |     "                    \"term\": {\n",
156 |     "                        \"course\": \"data-engineering-zoomcamp\"\n",
157 |     "                    }\n",
158 |     "                }\n",
159 |     "            }\n",
160 |     "        }\n",
161 |     "    }\n",
162 |     "\n",
163 |     "    response = es_client.search(index=index_name, body=search_query)\n",
164 |     "    \n",
165 |     "    result_docs = []\n",
166 |     "    \n",
167 |     "    for hit in response['hits']['hits']:\n",
168 |     "        result_docs.append(hit['_source'])\n",
169 |     "    \n",
170 |     "    return result_docs"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "code",
175 |    "execution_count": 18,
176 |    "id": "bdc51143-f861-46e2-8a8f-e5c6a324f53b",
177 |    "metadata": {},
178 |    "outputs": [],
179 |    "source": [
180 |     "def build_prompt(query, search_results):\n",
181 |     "    prompt_template = \"\"\"\n",
182 |     "You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n",
183 |     "Use only the facts from the CONTEXT when answering the QUESTION.\n",
184 |     "\n",
185 |     "QUESTION: {question}\n",
186 |     "\n",
187 |     "CONTEXT: \n",
188 |     "{context}\n",
189 |     "\"\"\".strip()\n",
190 |     "\n",
191 |     "    context = \"\"\n",
192 |     "    \n",
193 |     "    for doc in search_results:\n",
194 |     "        context = context + f\"section: {doc['section']}\\nquestion: {doc['question']}\\nanswer: {doc['text']}\\n\\n\"\n",
195 |     "    \n",
196 |     "    prompt = prompt_template.format(question=query, context=context).strip()\n",
197 |     "    return prompt\n",
198 |     "\n",
199 |     "def llm(prompt):\n",
200 |     "    response = client.chat.completions.create(\n",
201 |     "        model='phi3',\n",
202 |     "        messages=[{\"role\": \"user\", \"content\": prompt}]\n",
203 |     "    )\n",
204 |     "    \n",
205 |     "    return response.choices[0].message.content"
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": 19,
211 |    "id": "81abecbc-eb6b-428f-ab7d-7e21f58b64de",
212 |    "metadata": {},
213 |    "outputs": [],
214 |    "source": [
215 |     "def rag(query):\n",
216 |     "    search_results = elastic_search(query)\n",
217 |     "    prompt = build_prompt(query, search_results)\n",
218 |     "    answer = llm(prompt)\n",
219 |     "    return answer"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "code",
224 |    "execution_count": 21,
225 |    "id": "0ea9315a-a619-4066-9e90-8c260f2c8450",
226 |    "metadata": {},
227 |    "outputs": [
228 |     {
229 |      "data": {
230 |       "text/plain": [
231 |        "' Yes, you can still join the course even if you discover it after the start date. There will be deadlines for turning in final projects, but materials and support are available for those who start later. Additionally, we keep all the materials after the course finishes, so you can follow the course at your own pace.'"
232 |       ]
233 |      },
234 |      "execution_count": 21,
235 |      "metadata": {},
236 |      "output_type": "execute_result"
237 |     }
238 |    ],
239 |    "source": [
240 |     "query = 'I just disovered the course. Can I still join it?'\n",
241 |     "rag(query)"
242 |    ]
243 |   },
244 |   {
245 |    "cell_type": "code",
246 |    "execution_count": null,
247 |    "id": "d8095274-c9cd-4fd5-80d2-069fc951834f",
248 |    "metadata": {},
249 |    "outputs": [],
250 |    "source": []
251 |   }
252 |  ],
253 |  "metadata": {
254 |   "kernelspec": {
255 |    "display_name": "Python 3 (ipykernel)",
256 |    "language": "python",
257 |    "name": "python3"
258 |   },
259 |   "language_info": {
260 |    "codemirror_mode": {
261 |     "name": "ipython",
262 |     "version": 3
263 |    },
264 |    "file_extension": ".py",
265 |    "mimetype": "text/x-python",
266 |    "name": "python",
267 |    "nbconvert_exporter": "python",
268 |    "pygments_lexer": "ipython3",
269 |    "version": "3.11.9"
270 |   }
271 |  },
272 |  "nbformat": 4,
273 |  "nbformat_minor": 5
274 | }
275 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/serving-hugging-face-models.md:
--------------------------------------------------------------------------------
  1 | # Using Mistral-7B Model in Production
  2 | 
  3 | To use the Mistral-7B model from Hugging Face Hub in production without relying on the Hugging Face servers, you can download the model weights and set up a local environment to serve the model. Here’s a step-by-step guide on how to do this:
  4 | 
  5 | ### Step 1: Authenticate and Accept the User Agreement
  6 | 
  7 | Before downloading the model, you need to authenticate with Hugging Face and accept the user agreement for the Mistral-7B model.
  8 | 
  9 | 1. **Create a Hugging Face account** if you don't already have one.
 10 | 2. **Accept the model's user agreement** by visiting the model page on Hugging Face Hub and clicking on "Agree and Access".
 11 | 
 12 | ### Step 2: Install Hugging Face Transformers and Other Dependencies
 13 | 
 14 | You need to install the `transformers` library along with `torch` (PyTorch) to download and use the model. You can install these using pip:
 15 | 
 16 | ```bash
 17 | pip install transformers torch
 18 | ```
 19 | 
 20 | ### Step 3: Authenticate with Hugging Face CLI
 21 | 
 22 | You need to log in to the Hugging Face CLI to download the model weights.
 23 | 
 24 | ```bash
 25 | huggingface-cli login
 26 | ```
 27 | 
 28 | Follow the instructions to authenticate using your Hugging Face credentials.
 29 | 
 30 | ### Step 4: Download the Model
 31 | 
 32 | Use the `transformers` library to download the model and tokenizer. This will download the model weights to your local machine.
 33 | 
 34 | ```python
 35 | from transformers import AutoModelForCausalLM, AutoTokenizer
 36 | 
 37 | model_name = "mistralai/Mistral-7B-v0.1"
 38 | 
 39 | # Download and cache the model and tokenizer
 40 | model = AutoModelForCausalLM.from_pretrained(model_name)
 41 | tokenizer = AutoTokenizer.from_pretrained(model_name)
 42 | 
 43 | # Save the model and tokenizer locally
 44 | model.save_pretrained("./mistral-7b-model")
 45 | tokenizer.save_pretrained("./mistral-7b-tokenizer")
 46 | ```
 47 | 
 48 | ### Step 5: Load the Model Locally
 49 | 
 50 | Now, you can load the model from the saved directory without needing to connect to Hugging Face.
 51 | 
 52 | ```python
 53 | from transformers import AutoModelForCausalLM, AutoTokenizer
 54 | 
 55 | model_dir = "./mistral-7b-model"
 56 | tokenizer_dir = "./mistral-7b-tokenizer"
 57 | 
 58 | # Load the model and tokenizer from the local directory
 59 | model = AutoModelForCausalLM.from_pretrained(model_dir)
 60 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
 61 | ```
 62 | 
 63 | ### Step 6: Serve the Model
 64 | 
 65 | To serve the model in production, you can set up a web service using frameworks like Flask, FastAPI, or any other preferred web server framework. Here’s an example using FastAPI:
 66 | 
 67 | ```python
 68 | from fastapi import FastAPI, Request
 69 | from pydantic import BaseModel
 70 | from transformers import AutoModelForCausalLM, AutoTokenizer
 71 | import torch
 72 | 
 73 | app = FastAPI()
 74 | 
 75 | model_dir = "./mistral-7b-model"
 76 | tokenizer_dir = "./mistral-7b-tokenizer"
 77 | 
 78 | model = AutoModelForCausalLM.from_pretrained(model_dir)
 79 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
 80 | 
 81 | class TextGenerationRequest(BaseModel):
 82 |     text: str
 83 |     max_length: int = 50
 84 | 
 85 | @app.post("/generate")
 86 | async def generate_text(request: TextGenerationRequest):
 87 |     inputs = tokenizer(request.text, return_tensors="pt")
 88 |     outputs = model.generate(inputs["input_ids"], max_length=request.max_length)
 89 |     generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
 90 |     return {"generated_text": generated_text}
 91 | 
 92 | if __name__ == "__main__":
 93 |     import uvicorn
 94 |     uvicorn.run(app, host="0.0.0.0", port=8000)
 95 | ```
 96 | 
 97 | ### Step 7: Deploy the Service
 98 | 
 99 | You can deploy this FastAPI app using a web server like Uvicorn or Gunicorn, and use a process manager like Supervisor or systemd to keep it running in production. You may also consider containerizing your application using Docker for easier deployment and scalability.
100 | 
101 | ### Summary
102 | 
103 | By following these steps, you can download the model weights, set up a local environment, and serve the model in production without depending on Hugging Face's servers. This ensures that your service remains available even if Hugging Face experiences downtime.
104 | 


--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/starter.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 2,
  6 |    "id": "506fab2a-a50c-42bd-a106-c83a9d2828ea",
  7 |    "metadata": {},
  8 |    "outputs": [
  9 |     {
 10 |      "name": "stderr",
 11 |      "output_type": "stream",
 12 |      "text": [
 13 |       "--2024-06-13 13:53:24--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py\n",
 14 |       "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...\n",
 15 |       "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
 16 |       "HTTP request sent, awaiting response... 200 OK\n",
 17 |       "Length: 3832 (3.7K) [text/plain]\n",
 18 |       "Saving to: 'minsearch.py'\n",
 19 |       "\n",
 20 |       "     0K ...                                                   100%  579K=0.006s\n",
 21 |       "\n",
 22 |       "2024-06-13 13:53:24 (579 KB/s) - 'minsearch.py' saved [3832/3832]\n",
 23 |       "\n"
 24 |      ]
 25 |     }
 26 |    ],
 27 |    "source": [
 28 |     "!rm -f minsearch.py\n",
 29 |     "!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": 3,
 35 |    "id": "3ac947de-effd-4b61-8792-a6d7a133f347",
 36 |    "metadata": {},
 37 |    "outputs": [
 38 |     {
 39 |      "data": {
 40 |       "text/plain": [
 41 |        "<minsearch.Index at 0x1d9c9bd8890>"
 42 |       ]
 43 |      },
 44 |      "execution_count": 3,
 45 |      "metadata": {},
 46 |      "output_type": "execute_result"
 47 |     }
 48 |    ],
 49 |    "source": [
 50 |     "import requests \n",
 51 |     "import minsearch\n",
 52 |     "\n",
 53 |     "docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'\n",
 54 |     "docs_response = requests.get(docs_url)\n",
 55 |     "documents_raw = docs_response.json()\n",
 56 |     "\n",
 57 |     "documents = []\n",
 58 |     "\n",
 59 |     "for course in documents_raw:\n",
 60 |     "    course_name = course['course']\n",
 61 |     "\n",
 62 |     "    for doc in course['documents']:\n",
 63 |     "        doc['course'] = course_name\n",
 64 |     "        documents.append(doc)\n",
 65 |     "\n",
 66 |     "index = minsearch.Index(\n",
 67 |     "    text_fields=[\"question\", \"text\", \"section\"],\n",
 68 |     "    keyword_fields=[\"course\"]\n",
 69 |     ")\n",
 70 |     "\n",
 71 |     "index.fit(documents)"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 4,
 77 |    "id": "8f087272-b44d-4738-9ea2-175ec63a058b",
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "def search(query):\n",
 82 |     "    boost = {'question': 3.0, 'section': 0.5}\n",
 83 |     "\n",
 84 |     "    results = index.search(\n",
 85 |     "        query=query,\n",
 86 |     "        filter_dict={'course': 'data-engineering-zoomcamp'},\n",
 87 |     "        boost_dict=boost,\n",
 88 |     "        num_results=5\n",
 89 |     "    )\n",
 90 |     "\n",
 91 |     "    return results"
 92 |    ]
 93 |   },
 94 |   {
 95 |    "cell_type": "code",
 96 |    "execution_count": 8,
 97 |    "id": "742ab881-499a-4675-83c4-2013ea1377b9",
 98 |    "metadata": {},
 99 |    "outputs": [],
100 |    "source": [
101 |     "def build_prompt(query, search_results):\n",
102 |     "    prompt_template = \"\"\"\n",
103 |     "You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n",
104 |     "Use only the facts from the CONTEXT when answering the QUESTION.\n",
105 |     "\n",
106 |     "QUESTION: {question}\n",
107 |     "\n",
108 |     "CONTEXT: \n",
109 |     "{context}\n",
110 |     "\"\"\".strip()\n",
111 |     "\n",
112 |     "    context = \"\"\n",
113 |     "    \n",
114 |     "    for doc in search_results:\n",
115 |     "        context = context + f\"section: {doc['section']}\\nquestion: {doc['question']}\\nanswer: {doc['text']}\\n\\n\"\n",
116 |     "    \n",
117 |     "    prompt = prompt_template.format(question=query, context=context).strip()\n",
118 |     "    return prompt\n",
119 |     "\n",
120 |     "def llm(prompt):\n",
121 |     "    response = client.chat.completions.create(\n",
122 |     "        model='gpt-4o',\n",
123 |     "        messages=[{\"role\": \"user\", \"content\": prompt}]\n",
124 |     "    )\n",
125 |     "    \n",
126 |     "    return response.choices[0].message.content"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": 6,
132 |    "id": "fe8bff3e-b672-42be-866b-f2d9bb217106",
133 |    "metadata": {},
134 |    "outputs": [],
135 |    "source": [
136 |     "def rag(query):\n",
137 |     "    search_results = search(query)\n",
138 |     "    prompt = build_prompt(query, search_results)\n",
139 |     "    answer = llm(prompt)\n",
140 |     "    return answer"
141 |    ]
142 |   },
143 |   {
144 |    "cell_type": "code",
145 |    "execution_count": null,
146 |    "id": "988ece59-951a-4b32-ba3f-cb8efb66a9bb",
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": []
150 |   }
151 |  ],
152 |  "metadata": {
153 |   "kernelspec": {
154 |    "display_name": "Python 3 (ipykernel)",
155 |    "language": "python",
156 |    "name": "python3"
157 |   },
158 |   "language_info": {
159 |    "codemirror_mode": {
160 |     "name": "ipython",
161 |     "version": 3
162 |    },
163 |    "file_extension": ".py",
164 |    "mimetype": "text/x-python",
165 |    "name": "python",
166 |    "nbconvert_exporter": "python",
167 |    "pygments_lexer": "ipython3",
168 |    "version": "3.11.9"
169 |   }
170 |  },
171 |  "nbformat": 4,
172 |  "nbformat_minor": 5
173 | }
174 | 


--------------------------------------------------------------------------------
/cohorts/2024/03-vector-search/README.md:
--------------------------------------------------------------------------------
 1 | # Vector Search 
 2 | 
 3 | ## 3.1 Introduction to Vector Search
 4 | 
 5 | <a href="https://www.youtube.com/watch?v=C5AWdL3kg1Q&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 6 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/C5AWdL3kg1Q">
 7 | </a>
 8 | 
 9 | * [Slides](https://github.com/dataML007/elastic_search/blob/main/Introduction%20to%20Vector%20DB.pdf)
10 | 
11 | 
12 | ## 3.2 Semantic Search with Elasticsearch
13 | 
14 | <a href="https://www.youtube.com/watch?v=ptByfB_YcEg&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
15 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/ptByfB_YcEg">
16 | </a>
17 | 
18 | * Notebook: [demo_es.ipynb](demo_es.ipynb)
19 | 
20 | ### 3.2.2 Advanced Semantic Search
21 | 
22 | <a href="https://www.youtube.com/watch?v=yb3nYGuIL4c&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
23 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/yb3nYGuIL4c">
24 | </a>
25 | 
26 | 
27 | ## 3.3 Evaluating Retrieval 
28 | 
29 | ### 3.3.1 Introduction
30 | 
31 | <a href="https://www.youtube.com/watch?v=APMrUnC_dy0&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
32 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/APMrUnC_dy0">
33 | </a>
34 | 
35 | Plan for the section:
36 | 
37 | * Why do we need evaluation
38 | * [Evaluation metrics](eval/evaluation-metrics.md)
39 | * Ground truth / gold standard data
40 | * Generating ground truth with LLM
41 | * Evaluating the search resuls
42 | 
43 | 
44 | ### 3.3.2 Getting ground truth data
45 | 
46 | <a href="https://www.youtube.com/watch?v=bpxi6fKcyLw&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
47 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/bpxi6fKcyLw">
48 | </a>
49 | 
50 | * Approaches for getting evaluation data
51 | * Using OpenAI to generate evaluation data
52 | 
53 | Links:
54 | 
55 | * [notebook](eval/ground-truth-data.ipynb)
56 | * [documents with ids](eval/documents-with-ids.json)
57 | * [queries generated by OpenAI (pickle)](eval/results.bin)
58 | * [ground truth dataset](eval/ground-truth-data.csv)
59 | 
60 | 
61 | ### 3.3.3 Ranking evaluation: text search
62 | 
63 | <a href="https://www.youtube.com/watch?v=fdIV4xCsp0c&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
64 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/fdIV4xCsp0c">
65 | </a>
66 | 
67 | * Elasticsearch with text results
68 | * minsearch
69 | 
70 | Links:
71 | 
72 | * [Notebook](eval/evaluate-text.ipynb)
73 | 
74 | ### 3.3.4 Ranking evaluation: vector search
75 | 
76 | <a href="https://www.youtube.com/watch?v=VRprIm9-VV8&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
77 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/VRprIm9-VV8">
78 | </a>
79 | 
80 | * Elasticsearch with vector search
81 | * Ranking with question, answer, question+answer embeddings
82 | 
83 | Links:
84 | 
85 | * [Notebook](eval/evaluate-vector.ipynb)
86 | 
87 | ## Homework
88 | 
89 | See [here](../cohorts/2024/03-vector-search/homework.md)
90 | 
91 | 
92 | # Notes
93 | 
94 | * [Notes by dimzachar](https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/03-vector-search/README.md)
95 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/03-vector-search/notes.md)
96 | * [Notes on Vector Search and Vector Databases by Marat](https://machine-mind-ml.medium.com/enhancing-llm-performance-with-vector-search-and-vector-databases-1f20eb1cc650)
97 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
98 | 


--------------------------------------------------------------------------------
/cohorts/2024/03-vector-search/homework.md:
--------------------------------------------------------------------------------
  1 | ## Homework: Vector Search
  2 | 
  3 | In this homework, we'll experiemnt with vector with and without Elasticsearch
  4 | 
  5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
  6 | 
  7 | Solution:
  8 | 
  9 | * Video: https://www.loom.com/share/979b9e1f0a964fff88cd62d37e7515f9
 10 | * Notebook: [homework_solution.ipynb](homework_solution.ipynb)
 11 | 
 12 | 
 13 | 
 14 | ## Q1. Getting the embeddings model
 15 | 
 16 | First, we will get the embeddings model `multi-qa-distilbert-cos-v1` from
 17 | [the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)
 18 | 
 19 | ```bash
 20 | from sentence_transformers import SentenceTransformer
 21 | embedding_model = SentenceTransformer(model_name)
 22 | ```
 23 | 
 24 | Create the embedding for this user question:
 25 | 
 26 | ```python
 27 | user_question = "I just discovered the course. Can I still join it?"
 28 | ```
 29 | 
 30 | What's the first value of the resulting vector?
 31 | 
 32 | * -0.24
 33 | * -0.04
 34 | * 0.07
 35 | * 0.27
 36 | 
 37 | 
 38 | ## Prepare the documents
 39 | 
 40 | Now we will create the embeddings for the documents.
 41 | 
 42 | Load the documents with ids that we prepared in the module:
 43 | 
 44 | ```python
 45 | import requests 
 46 | 
 47 | base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
 48 | relative_url = '03-vector-search/eval/documents-with-ids.json'
 49 | docs_url = f'{base_url}/{relative_url}?raw=1'
 50 | docs_response = requests.get(docs_url)
 51 | documents = docs_response.json()
 52 | ```
 53 | 
 54 | We will use only a subset of the questions - the questions
 55 | for `"machine-learning-zoomcamp"`. After filtering, you should
 56 | have only 375 documents
 57 | 
 58 | ## Q2. Creating the embeddings
 59 | 
 60 | Now for each document, we will create an embedding for both question and answer fields.
 61 | 
 62 | We want to put all of them into a single matrix `X`:
 63 | 
 64 | - Create a list `embeddings` 
 65 | - Iterate over each document 
 66 | - `qa_text = f'{question} {text}'`
 67 | - compute the embedding for `qa_text`, append to `embeddings`
 68 | - At the end, let `X = np.array(embeddings)` (`import numpy as np`) 
 69 | 
 70 | What's the shape of X? (`X.shape`). Include the parantheses. 
 71 | 
 72 | 
 73 | 
 74 | ## Q3. Search
 75 | 
 76 | We have the embeddings and the query vector. Now let's compute the 
 77 | cosine similarity between the vector from Q1 (let's call it `v`) and the matrix from Q2. 
 78 | 
 79 | The vectors returned from the embedding model are already
 80 | normalized (you can check it by computing a dot product of a vector
 81 | with itself - it should return something very close to 1.0). This means that in order
 82 | to compute the coside similarity, it's sufficient to 
 83 | multiply the matrix `X` by the vector `v`:
 84 | 
 85 | 
 86 | ```python
 87 | scores = X.dot(v)
 88 | ```
 89 | 
 90 | What's the highest score in the results?
 91 | 
 92 | - 65.0 
 93 | - 6.5
 94 | - 0.65
 95 | - 0.065
 96 | 
 97 | 
 98 | ## Vector search
 99 | 
100 | We can now compute the similarity between a query vector and all the embeddings.
101 | 
102 | Let's use this to implement our own vector search
103 | 
104 | ```python
105 | class VectorSearchEngine():
106 |     def __init__(self, documents, embeddings):
107 |         self.documents = documents
108 |         self.embeddings = embeddings
109 | 
110 |     def search(self, v_query, num_results=10):
111 |         scores = self.embeddings.dot(v_query)
112 |         idx = np.argsort(-scores)[:num_results]
113 |         return [self.documents[i] for i in idx]
114 | 
115 | search_engine = VectorSearchEngine(documents=documents, embeddings=X)
116 | search_engine.search(v, num_results=5)
117 | ```
118 | 
119 | If you don't understand how the `search` function work:
120 | 
121 | * Ask ChatGTP or any other LLM of your choice to explain the code
122 | * Check our pre-course workshop about implementing a search engine [here](https://github.com/alexeygrigorev/build-your-own-search-engine)
123 | 
124 | (Note: you can replace `argsort` with `argpartition` to make it a lot faster)
125 | 
126 | 
127 | ## Q4. Hit-rate for our search engine
128 | 
129 | Let's evaluate the performance of our own search engine. We will
130 | use the hitrate metric for evaluation.
131 | 
132 | First, load the ground truth dataset:
133 | 
134 | ```python
135 | import pandas as pd
136 | 
137 | base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
138 | relative_url = '03-vector-search/eval/ground-truth-data.csv'
139 | ground_truth_url = f'{base_url}/{relative_url}?raw=1'
140 | 
141 | df_ground_truth = pd.read_csv(ground_truth_url)
142 | df_ground_truth = df_ground_truth[df_ground_truth.course == 'machine-learning-zoomcamp']
143 | ground_truth = df_ground_truth.to_dict(orient='records')
144 | ```
145 | 
146 | Now use the code from the module to calculate the hitrate of
147 | `VectorSearchEngine` with `num_results=5`.
148 | 
149 | What did you get?
150 | 
151 | * 0.93
152 | * 0.73
153 | * 0.53
154 | * 0.33
155 | 
156 | ## Q5. Indexing with Elasticsearch
157 | 
158 | Now let's index these documents with elasticsearch
159 | 
160 | * Create the index with the same settings as in the module (but change the dimensions)
161 | * Index the embeddings (note: you've already computed them)
162 | 
163 | After indexing, let's perform the search of the same query from Q1.
164 | 
165 | What's the ID of the document with the highest score?
166 | 
167 | ## Q6. Hit-rate for Elasticsearch
168 | 
169 | The search engine we used in Q4 computed the similarity between
170 | the query and ALL the vectors in our database. Usually this is 
171 | not practical, as we may have a lot of data.
172 | 
173 | Elasticsearch uses approximate techniques to make it faster. 
174 | 
175 | Let's evaluate how worse the results are when we switch from
176 | exact search (as in Q4) to approximate search with Elastic.
177 | 
178 | What's hitrate for our dataset for Elastic?
179 | 
180 | * 0.93
181 | * 0.73
182 | * 0.53
183 | * 0.33
184 | 
185 | 
186 | ## Submit the results
187 | 
188 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw3
189 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
190 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/README.md:
--------------------------------------------------------------------------------
  1 | # Module 4: Evaluation and Monitoring
  2 | 
  3 | In this module, we will learn how to evaluate and monitor our LLM and RAG system.
  4 | 
  5 | In the evaluation part, we assess the quality of our entire RAG
  6 | system before it goes live.
  7 | 
  8 | In the monitoring part, we collect, store and visualize
  9 | metrics to assess the answer quality of a deployed LLM. We also
 10 | collect chat history and user feedback.
 11 | 
 12 | 
 13 | ## 4.1 Introduction to monitoring answer quality 
 14 | 
 15 | <a href="https://www.youtube.com/watch?v=OWqinqemCmk&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 16 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/OWqinqemCmk">
 17 | </a>
 18 | 
 19 | * Why monitoring LLM systems?
 20 | * Monitoring answer quality of LLMs 
 21 | * Monitoring answer quality with user feedback
 22 | * What else to monitor, that is not covered by this module? 
 23 | 
 24 | 
 25 | ## 4.2 Offline vs Online (RAG) evaluation
 26 | 
 27 | <a href="https://www.youtube.com/watch?v=yTKGSqkhgI4&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 28 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/yTKGSqkhgI4">
 29 | </a>
 30 | 
 31 | * Modules recap
 32 | * Online vs offline evaluation
 33 | * Offline evaluation metrics 
 34 | 
 35 | 
 36 | ## 4.3 Generating data for offline RAG evaluation
 37 | 
 38 | <a href="https://www.youtube.com/watch?v=yTO5sRw6x78&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 39 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/yTO5sRw6x78">
 40 | </a>
 41 | 
 42 | Links:
 43 | 
 44 | * [notebook](offline-rag-evaluation.ipynb)
 45 | * [results-gpt4o.csv](data/results-gpt4o.csv) (answers from GPT-4o)
 46 | * [results-gpt35.csv](data/results-gpt35.csv) (answers from GPT-3.5-Turbo)
 47 | 
 48 | 
 49 | ## 4.4 Offline RAG evaluation: cosine similarity
 50 | 
 51 | <a href="https://www.youtube.com/watch?v=LlXclbD3pms&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 52 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/LlXclbD3pms">
 53 | </a>
 54 | 
 55 | Content
 56 | 
 57 | * A->Q->A' cosine similarity
 58 | * Evaluating gpt-4o
 59 | * Evaluating gpt-3.5-turbo
 60 | * Evaluating gpt-4o-mini
 61 | 
 62 | Links:
 63 | 
 64 | * [notebook](offline-rag-evaluation.ipynb)
 65 | * [results-gpt4o-cosine.csv](data/results-gpt4o-cosine.csv) (answers with cosine calculated from GPT-4o)
 66 | * [results-gpt35-cosine.csv](data/results-gpt35-cosine.csv) (answers with cosine calculated from GPT-3.5-Turbo)
 67 | * [results-gpt4o-mini.csv](data/results-gpt4o-mini.csv) (answers from GPT-4o-mini)
 68 | * [results-gpt4o-mini-cosine.csv](data/results-gpt4o-mini-cosine.csv) (answers with cosine calculated from GPT-4o-mini)
 69 | 
 70 | 
 71 | ## 4.5 Offline RAG evaluation: LLM as a judge
 72 | 
 73 | <a href="https://www.youtube.com/watch?v=IB6jePK1s58&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 74 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/IB6jePK1s58">
 75 | </a>
 76 | 
 77 | * LLM as a judge
 78 | * A->Q->A' evaluation
 79 | * Q->A evaluation
 80 | 
 81 | 
 82 | Links:
 83 | 
 84 | * [notebook](offline-rag-evaluation.ipynb)
 85 | * [evaluations-aqa.csv](data/evaluations-aqa.csv) (A->Q->A evaluation results)
 86 | * [evaluations-qa.csv](data/evaluations-qa.csv) (Q->A evaluation results)
 87 | https://youtu.be/
 88 | 
 89 | 
 90 | ## 4.6 Capturing user feedback
 91 | 
 92 | <a href="https://www.youtube.com/watch?v=XapKKBUMQ4M&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 93 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/XapKKBUMQ4M">
 94 | </a>
 95 | 
 96 | > You can see the prompts and the output from claude [here](code.md)
 97 | 
 98 | Content
 99 | 
100 | * Adding +1 and -1 buttons
101 | * Setting up a postgres database
102 | * Putting everything in docker compose
103 | 
104 | ```bash
105 | pip install pgcli
106 | pgcli -h localhost -U your_username -d course_assistant -W
107 | ```
108 | 
109 | 
110 | Links:
111 | 
112 | * [final code](app/)
113 | * [intermediate code from claude](code.md#46-capturing-user-feedback)
114 | 
115 | 
116 | ### 4.6.2 Capturing user feedback: part 2 
117 | 
118 | <a href="https://www.youtube.com/watch?v=BG8MlbidatA&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
119 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/BG8MlbidatA">
120 | </a>
121 | 
122 | * adding vector search
123 | * adding OpenAI
124 | 
125 | Links:
126 | 
127 | * [final code](app/)
128 | * [intermediate code from claude](code.md#462-capturing-user-feedback-part-2)
129 | 
130 | 
131 | ## 4.7 Monitoring the system
132 | 
133 | <a href="https://www.youtube.com/watch?v=BQN0TOi2Rew&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
134 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/BQN0TOi2Rew">
135 | </a>
136 | 
137 | * Setting up Grafana
138 | * Tokens and costs
139 | * QA relevance
140 | * User feedback
141 | * Other metrics
142 | 
143 | Links:
144 | 
145 | * [final code](app/)
146 | * [SQL queries for Grafana](grafana.md)
147 | * [intermediate code from claude](code.md#47-monitoring)
148 | 
149 | ### 4.7.2 Extra Grafana video
150 | 
151 | <a href="https://www.youtube.com/watch?v=qGFAX5ra1G8&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
152 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/qGFAX5ra1G8">
153 | </a>
154 | 
155 | * Grafana variables
156 | * Exporting and importing dashboards
157 | 
158 | Links:
159 | 
160 | * [SQL queries for Grafana](grafana.md)
161 | * [Grafana dashboard](dashboard.json)
162 | 
163 | ## Homework
164 | 
165 | See [here](../cohorts/2024/04-monitoring/homework.md)
166 | 
167 | 
168 | ## Extra resources
169 | 
170 | ### Overview of the module
171 | 
172 | ![image](https://github.com/user-attachments/assets/0c5e932b-4fca-4d51-8b1c-93f4600861dc)
173 | 
174 | https://www.loom.com/share/1dd375ec4b0d458fabdfc2b841089031
175 | 
176 | # Notes
177 | 
178 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/04-monitoring/notes.md)
179 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
180 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/.env:
--------------------------------------------------------------------------------
 1 | # PostgreSQL Configuration
 2 | POSTGRES_HOST=postgres
 3 | POSTGRES_DB=course_assistant
 4 | POSTGRES_USER=your_username
 5 | POSTGRES_PASSWORD=your_password
 6 | POSTGRES_PORT=5432
 7 | 
 8 | # Elasticsearch Configuration
 9 | ELASTIC_URL_LOCAL=http://localhost:9200
10 | ELASTIC_URL=http://elasticsearch:9200
11 | ELASTIC_PORT=9200
12 | 
13 | # Ollama Configuration
14 | OLLAMA_PORT=11434
15 | 
16 | # Streamlit Configuration
17 | STREAMLIT_PORT=8501
18 | 
19 | # Other Configuration
20 | MODEL_NAME=multi-qa-MiniLM-L6-cos-v1
21 | INDEX_NAME=course-questions
22 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.9-slim
 2 | 
 3 | ENV PYTHONUNBUFFERED=1
 4 | 
 5 | WORKDIR /app
 6 | 
 7 | COPY requirements.txt .
 8 | RUN pip install --no-cache-dir -r requirements.txt
 9 | 
10 | COPY . .
11 | 
12 | CMD ["streamlit", "run", "app.py"]


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/README.MD:
--------------------------------------------------------------------------------
 1 | ## Additional notes for those trying the streamlit/grafana out
 2 | 
 3 | 1) The following packages are required when you run some of .py scripts
 4 | 
 5 | ```
 6 | pip install psycopg2-binary python-dotenv
 7 | pip install pgcli
 8 | ```
 9 | 
10 | 
11 | 2) To download the phi3 model to the container
12 | ```
13 | docker-compose up -d
14 | docker-compose exec ollama ollama pull phi3
15 | ```
16 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/app.py:
--------------------------------------------------------------------------------
  1 | import streamlit as st
  2 | import time
  3 | import uuid
  4 | 
  5 | from assistant import get_answer
  6 | from db import (
  7 |     save_conversation,
  8 |     save_feedback,
  9 |     get_recent_conversations,
 10 |     get_feedback_stats,
 11 | )
 12 | 
 13 | 
 14 | def print_log(message):
 15 |     print(message, flush=True)
 16 | 
 17 | 
 18 | def main():
 19 |     print_log("Starting the Course Assistant application")
 20 |     st.title("Course Assistant")
 21 | 
 22 |     # Session state initialization
 23 |     if "conversation_id" not in st.session_state:
 24 |         st.session_state.conversation_id = str(uuid.uuid4())
 25 |         print_log(
 26 |             f"New conversation started with ID: {st.session_state.conversation_id}"
 27 |         )
 28 |     if "count" not in st.session_state:
 29 |         st.session_state.count = 0
 30 |         print_log("Feedback count initialized to 0")
 31 | 
 32 |     # Course selection
 33 |     course = st.selectbox(
 34 |         "Select a course:",
 35 |         ["machine-learning-zoomcamp", "data-engineering-zoomcamp", "mlops-zoomcamp"],
 36 |     )
 37 |     print_log(f"User selected course: {course}")
 38 | 
 39 |     # Model selection
 40 |     model_choice = st.selectbox(
 41 |         "Select a model:",
 42 |         ["ollama/phi3", "openai/gpt-3.5-turbo", "openai/gpt-4o", "openai/gpt-4o-mini"],
 43 |     )
 44 |     print_log(f"User selected model: {model_choice}")
 45 | 
 46 |     # Search type selection
 47 |     search_type = st.radio("Select search type:", ["Text", "Vector"])
 48 |     print_log(f"User selected search type: {search_type}")
 49 | 
 50 |     # User input
 51 |     user_input = st.text_input("Enter your question:")
 52 | 
 53 |     if st.button("Ask"):
 54 |         print_log(f"User asked: '{user_input}'")
 55 |         with st.spinner("Processing..."):
 56 |             print_log(
 57 |                 f"Getting answer from assistant using {model_choice} model and {search_type} search"
 58 |             )
 59 |             start_time = time.time()
 60 |             answer_data = get_answer(user_input, course, model_choice, search_type)
 61 |             end_time = time.time()
 62 |             print_log(f"Answer received in {end_time - start_time:.2f} seconds")
 63 |             st.success("Completed!")
 64 |             st.write(answer_data["answer"])
 65 | 
 66 |             # Display monitoring information
 67 |             st.write(f"Response time: {answer_data['response_time']:.2f} seconds")
 68 |             st.write(f"Relevance: {answer_data['relevance']}")
 69 |             st.write(f"Model used: {answer_data['model_used']}")
 70 |             st.write(f"Total tokens: {answer_data['total_tokens']}")
 71 |             if answer_data["openai_cost"] > 0:
 72 |                 st.write(f"OpenAI cost: ${answer_data['openai_cost']:.4f}")
 73 | 
 74 |             # Save conversation to database
 75 |             print_log("Saving conversation to database")
 76 |             save_conversation(
 77 |                 st.session_state.conversation_id, user_input, answer_data, course
 78 |             )
 79 |             print_log("Conversation saved successfully")
 80 |             # Generate a new conversation ID for next question
 81 |             st.session_state.conversation_id = str(uuid.uuid4())
 82 | 
 83 |     # Feedback buttons
 84 |     col1, col2 = st.columns(2)
 85 |     with col1:
 86 |         if st.button("+1"):
 87 |             st.session_state.count += 1
 88 |             print_log(
 89 |                 f"Positive feedback received. New count: {st.session_state.count}"
 90 |             )
 91 |             save_feedback(st.session_state.conversation_id, 1)
 92 |             print_log("Positive feedback saved to database")
 93 |     with col2:
 94 |         if st.button("-1"):
 95 |             st.session_state.count -= 1
 96 |             print_log(
 97 |                 f"Negative feedback received. New count: {st.session_state.count}"
 98 |             )
 99 |             save_feedback(st.session_state.conversation_id, -1)
100 |             print_log("Negative feedback saved to database")
101 | 
102 |     st.write(f"Current count: {st.session_state.count}")
103 | 
104 |     # Display recent conversations
105 |     st.subheader("Recent Conversations")
106 |     relevance_filter = st.selectbox(
107 |         "Filter by relevance:", ["All", "RELEVANT", "PARTLY_RELEVANT", "NON_RELEVANT"]
108 |     )
109 |     recent_conversations = get_recent_conversations(
110 |         limit=5, relevance=relevance_filter if relevance_filter != "All" else None
111 |     )
112 |     for conv in recent_conversations:
113 |         st.write(f"Q: {conv['question']}")
114 |         st.write(f"A: {conv['answer']}")
115 |         st.write(f"Relevance: {conv['relevance']}")
116 |         st.write(f"Model: {conv['model_used']}")
117 |         st.write("---")
118 | 
119 |     # Display feedback stats
120 |     feedback_stats = get_feedback_stats()
121 |     st.subheader("Feedback Statistics")
122 |     st.write(f"Thumbs up: {feedback_stats['thumbs_up']}")
123 |     st.write(f"Thumbs down: {feedback_stats['thumbs_down']}")
124 | 
125 | 
126 | print_log("Streamlit app loop completed")
127 | 
128 | 
129 | if __name__ == "__main__":
130 |     print_log("Course Assistant application started")
131 |     main()
132 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/assistant.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import time
  3 | import json
  4 | 
  5 | from openai import OpenAI
  6 | 
  7 | from elasticsearch import Elasticsearch
  8 | from sentence_transformers import SentenceTransformer
  9 | 
 10 | 
 11 | ELASTIC_URL = os.getenv("ELASTIC_URL", "http://elasticsearch:9200")
 12 | OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434/v1/")
 13 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-api-key-here")
 14 | 
 15 | 
 16 | es_client = Elasticsearch(ELASTIC_URL)
 17 | ollama_client = OpenAI(base_url=OLLAMA_URL, api_key="ollama")
 18 | openai_client = OpenAI(api_key=OPENAI_API_KEY)
 19 | 
 20 | model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
 21 | 
 22 | 
 23 | def elastic_search_text(query, course, index_name="course-questions"):
 24 |     search_query = {
 25 |         "size": 5,
 26 |         "query": {
 27 |             "bool": {
 28 |                 "must": {
 29 |                     "multi_match": {
 30 |                         "query": query,
 31 |                         "fields": ["question^3", "text", "section"],
 32 |                         "type": "best_fields",
 33 |                     }
 34 |                 },
 35 |                 "filter": {"term": {"course": course}},
 36 |             }
 37 |         },
 38 |     }
 39 | 
 40 |     response = es_client.search(index=index_name, body=search_query)
 41 |     return [hit["_source"] for hit in response["hits"]["hits"]]
 42 | 
 43 | 
 44 | def elastic_search_knn(field, vector, course, index_name="course-questions"):
 45 |     knn = {
 46 |         "field": field,
 47 |         "query_vector": vector,
 48 |         "k": 5,
 49 |         "num_candidates": 10000,
 50 |         "filter": {"term": {"course": course}},
 51 |     }
 52 | 
 53 |     search_query = {
 54 |         "knn": knn,
 55 |         "_source": ["text", "section", "question", "course", "id"],
 56 |     }
 57 | 
 58 |     es_results = es_client.search(index=index_name, body=search_query)
 59 | 
 60 |     return [hit["_source"] for hit in es_results["hits"]["hits"]]
 61 | 
 62 | 
 63 | def build_prompt(query, search_results):
 64 |     prompt_template = """
 65 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
 66 | Use only the facts from the CONTEXT when answering the QUESTION.
 67 | 
 68 | QUESTION: {question}
 69 | 
 70 | CONTEXT: 
 71 | {context}
 72 | """.strip()
 73 | 
 74 |     context = "\n\n".join(
 75 |         [
 76 |             f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}"
 77 |             for doc in search_results
 78 |         ]
 79 |     )
 80 |     return prompt_template.format(question=query, context=context).strip()
 81 | 
 82 | 
 83 | def llm(prompt, model_choice):
 84 |     start_time = time.time()
 85 |     if model_choice.startswith('ollama/'):
 86 |         response = ollama_client.chat.completions.create(
 87 |             model=model_choice.split('/')[-1],
 88 |             messages=[{"role": "user", "content": prompt}]
 89 |         )
 90 |         answer = response.choices[0].message.content
 91 |         tokens = {
 92 |             'prompt_tokens': response.usage.prompt_tokens,
 93 |             'completion_tokens': response.usage.completion_tokens,
 94 |             'total_tokens': response.usage.total_tokens
 95 |         }
 96 |     elif model_choice.startswith('openai/'):
 97 |         response = openai_client.chat.completions.create(
 98 |             model=model_choice.split('/')[-1],
 99 |             messages=[{"role": "user", "content": prompt}]
100 |         )
101 |         answer = response.choices[0].message.content
102 |         tokens = {
103 |             'prompt_tokens': response.usage.prompt_tokens,
104 |             'completion_tokens': response.usage.completion_tokens,
105 |             'total_tokens': response.usage.total_tokens
106 |         }
107 |     else:
108 |         raise ValueError(f"Unknown model choice: {model_choice}")
109 |     
110 |     end_time = time.time()
111 |     response_time = end_time - start_time
112 |     
113 |     return answer, tokens, response_time
114 | 
115 | 
116 | def evaluate_relevance(question, answer):
117 |     prompt_template = """
118 |     You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
119 |     Your task is to analyze the relevance of the generated answer to the given question.
120 |     Based on the relevance of the generated answer, you will classify it
121 |     as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".
122 | 
123 |     Here is the data for evaluation:
124 | 
125 |     Question: {question}
126 |     Generated Answer: {answer}
127 | 
128 |     Please analyze the content and context of the generated answer in relation to the question
129 |     and provide your evaluation in parsable JSON without using code blocks:
130 | 
131 |     {{
132 |       "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
133 |       "Explanation": "[Provide a brief explanation for your evaluation]"
134 |     }}
135 |     """.strip()
136 | 
137 |     prompt = prompt_template.format(question=question, answer=answer)
138 |     evaluation, tokens, _ = llm(prompt, 'openai/gpt-4o-mini')
139 |     
140 |     try:
141 |         json_eval = json.loads(evaluation)
142 |         return json_eval['Relevance'], json_eval['Explanation'], tokens
143 |     except json.JSONDecodeError:
144 |         return "UNKNOWN", "Failed to parse evaluation", tokens
145 | 
146 | 
147 | def calculate_openai_cost(model_choice, tokens):
148 |     openai_cost = 0
149 | 
150 |     if model_choice == 'openai/gpt-3.5-turbo':
151 |         openai_cost = (tokens['prompt_tokens'] * 0.0015 + tokens['completion_tokens'] * 0.002) / 1000
152 |     elif model_choice in ['openai/gpt-4o', 'openai/gpt-4o-mini']:
153 |         openai_cost = (tokens['prompt_tokens'] * 0.03 + tokens['completion_tokens'] * 0.06) / 1000
154 | 
155 |     return openai_cost
156 | 
157 | 
158 | def get_answer(query, course, model_choice, search_type):
159 |     if search_type == 'Vector':
160 |         vector = model.encode(query)
161 |         search_results = elastic_search_knn('question_text_vector', vector, course)
162 |     else:
163 |         search_results = elastic_search_text(query, course)
164 | 
165 |     prompt = build_prompt(query, search_results)
166 |     answer, tokens, response_time = llm(prompt, model_choice)
167 |     
168 |     relevance, explanation, eval_tokens = evaluate_relevance(query, answer)
169 | 
170 |     openai_cost = calculate_openai_cost(model_choice, tokens)
171 |  
172 |     return {
173 |         'answer': answer,
174 |         'response_time': response_time,
175 |         'relevance': relevance,
176 |         'relevance_explanation': explanation,
177 |         'model_used': model_choice,
178 |         'prompt_tokens': tokens['prompt_tokens'],
179 |         'completion_tokens': tokens['completion_tokens'],
180 |         'total_tokens': tokens['total_tokens'],
181 |         'eval_prompt_tokens': eval_tokens['prompt_tokens'],
182 |         'eval_completion_tokens': eval_tokens['completion_tokens'],
183 |         'eval_total_tokens': eval_tokens['total_tokens'],
184 |         'openai_cost': openai_cost
185 |     }


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/db.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import psycopg2
  3 | from psycopg2.extras import DictCursor
  4 | from datetime import datetime
  5 | from zoneinfo import ZoneInfo
  6 | 
  7 | tz = ZoneInfo("Europe/Berlin")
  8 | 
  9 | 
 10 | def get_db_connection():
 11 |     return psycopg2.connect(
 12 |         host=os.getenv("POSTGRES_HOST", "postgres"),
 13 |         database=os.getenv("POSTGRES_DB", "course_assistant"),
 14 |         user=os.getenv("POSTGRES_USER", "your_username"),
 15 |         password=os.getenv("POSTGRES_PASSWORD", "your_password"),
 16 |     )
 17 | 
 18 | 
 19 | def init_db():
 20 |     conn = get_db_connection()
 21 |     try:
 22 |         with conn.cursor() as cur:
 23 |             cur.execute("DROP TABLE IF EXISTS feedback")
 24 |             cur.execute("DROP TABLE IF EXISTS conversations")
 25 | 
 26 |             cur.execute("""
 27 |                 CREATE TABLE conversations (
 28 |                     id TEXT PRIMARY KEY,
 29 |                     question TEXT NOT NULL,
 30 |                     answer TEXT NOT NULL,
 31 |                     course TEXT NOT NULL,
 32 |                     model_used TEXT NOT NULL,
 33 |                     response_time FLOAT NOT NULL,
 34 |                     relevance TEXT NOT NULL,
 35 |                     relevance_explanation TEXT NOT NULL,
 36 |                     prompt_tokens INTEGER NOT NULL,
 37 |                     completion_tokens INTEGER NOT NULL,
 38 |                     total_tokens INTEGER NOT NULL,
 39 |                     eval_prompt_tokens INTEGER NOT NULL,
 40 |                     eval_completion_tokens INTEGER NOT NULL,
 41 |                     eval_total_tokens INTEGER NOT NULL,
 42 |                     openai_cost FLOAT NOT NULL,
 43 |                     timestamp TIMESTAMP WITH TIME ZONE NOT NULL
 44 |                 )
 45 |             """)
 46 |             cur.execute("""
 47 |                 CREATE TABLE feedback (
 48 |                     id SERIAL PRIMARY KEY,
 49 |                     conversation_id TEXT REFERENCES conversations(id),
 50 |                     feedback INTEGER NOT NULL,
 51 |                     timestamp TIMESTAMP WITH TIME ZONE NOT NULL
 52 |                 )
 53 |             """)
 54 |         conn.commit()
 55 |     finally:
 56 |         conn.close()
 57 | 
 58 | 
 59 | def save_conversation(conversation_id, question, answer_data, course, timestamp=None):
 60 |     if timestamp is None:
 61 |         timestamp = datetime.now(tz)
 62 |     
 63 |     conn = get_db_connection()
 64 |     try:
 65 |         with conn.cursor() as cur:
 66 |             cur.execute(
 67 |                 """
 68 |                 INSERT INTO conversations 
 69 |                 (id, question, answer, course, model_used, response_time, relevance, 
 70 |                 relevance_explanation, prompt_tokens, completion_tokens, total_tokens, 
 71 |                 eval_prompt_tokens, eval_completion_tokens, eval_total_tokens, openai_cost, timestamp)
 72 |                 VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, COALESCE(%s, CURRENT_TIMESTAMP))
 73 |             """,
 74 |                 (
 75 |                     conversation_id,
 76 |                     question,
 77 |                     answer_data["answer"],
 78 |                     course,
 79 |                     answer_data["model_used"],
 80 |                     answer_data["response_time"],
 81 |                     answer_data["relevance"],
 82 |                     answer_data["relevance_explanation"],
 83 |                     answer_data["prompt_tokens"],
 84 |                     answer_data["completion_tokens"],
 85 |                     answer_data["total_tokens"],
 86 |                     answer_data["eval_prompt_tokens"],
 87 |                     answer_data["eval_completion_tokens"],
 88 |                     answer_data["eval_total_tokens"],
 89 |                     answer_data["openai_cost"],
 90 |                     timestamp,
 91 |                 ),
 92 |             )
 93 |         conn.commit()
 94 |     finally:
 95 |         conn.close()
 96 | 
 97 | 
 98 | def save_feedback(conversation_id, feedback, timestamp=None):
 99 |     if timestamp is None:
100 |         timestamp = datetime.now(tz)
101 | 
102 |     conn = get_db_connection()
103 |     try:
104 |         with conn.cursor() as cur:
105 |             cur.execute(
106 |                 "INSERT INTO feedback (conversation_id, feedback, timestamp) VALUES (%s, %s, COALESCE(%s, CURRENT_TIMESTAMP))",
107 |                 (conversation_id, feedback, timestamp),
108 |             )
109 |         conn.commit()
110 |     finally:
111 |         conn.close()
112 | 
113 | 
114 | def get_recent_conversations(limit=5, relevance=None):
115 |     conn = get_db_connection()
116 |     try:
117 |         with conn.cursor(cursor_factory=DictCursor) as cur:
118 |             query = """
119 |                 SELECT c.*, f.feedback
120 |                 FROM conversations c
121 |                 LEFT JOIN feedback f ON c.id = f.conversation_id
122 |             """
123 |             if relevance:
124 |                 query += f" WHERE c.relevance = '{relevance}'"
125 |             query += " ORDER BY c.timestamp DESC LIMIT %s"
126 | 
127 |             cur.execute(query, (limit,))
128 |             return cur.fetchall()
129 |     finally:
130 |         conn.close()
131 | 
132 | 
133 | def get_feedback_stats():
134 |     conn = get_db_connection()
135 |     try:
136 |         with conn.cursor(cursor_factory=DictCursor) as cur:
137 |             cur.execute("""
138 |                 SELECT 
139 |                     SUM(CASE WHEN feedback > 0 THEN 1 ELSE 0 END) as thumbs_up,
140 |                     SUM(CASE WHEN feedback < 0 THEN 1 ELSE 0 END) as thumbs_down
141 |                 FROM feedback
142 |             """)
143 |             return cur.fetchone()
144 |     finally:
145 |         conn.close()
146 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/docker-compose.yaml:
--------------------------------------------------------------------------------
 1 | version: '3.8'
 2 | 
 3 | services:
 4 |   elasticsearch:
 5 |     image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
 6 |     container_name: elasticsearch
 7 |     environment:
 8 |       - discovery.type=single-node
 9 |       - xpack.security.enabled=false
10 |     ports:
11 |       - "${ELASTIC_PORT:-9200}:9200"
12 |       - "9300:9300"
13 |     volumes:
14 |       - elasticsearch_data:/usr/share/elasticsearch/data
15 | 
16 |   ollama:
17 |     image: ollama/ollama
18 |     container_name: ollama
19 |     volumes:
20 |       - ollama_data:/root/.ollama
21 |     ports:
22 |       - "${OLLAMA_PORT:-11434}:11434"
23 | 
24 |   postgres:
25 |     image: postgres:13
26 |     container_name: postgres
27 |     environment:
28 |       POSTGRES_DB: ${POSTGRES_DB}
29 |       POSTGRES_USER: ${POSTGRES_USER}
30 |       POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
31 |     ports:
32 |       - "${POSTGRES_PORT:-5432}:5432"
33 |     volumes:
34 |       - postgres_data:/var/lib/postgresql/data
35 | 
36 |   streamlit:
37 |     build:
38 |       context: .
39 |       dockerfile: Dockerfile
40 |     container_name: streamlit
41 |     environment:
42 |       - ELASTIC_URL=http://elasticsearch:${ELASTIC_PORT:-9200}
43 |       - OLLAMA_URL=http://ollama:${OLLAMA_PORT:-11434}/v1/
44 |       - POSTGRES_HOST=postgres
45 |       - POSTGRES_DB=${POSTGRES_DB}
46 |       - POSTGRES_USER=${POSTGRES_USER}
47 |       - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
48 |       - MODEL_NAME=${MODEL_NAME}
49 |       - INDEX_NAME=${INDEX_NAME}
50 |       - OPENAI_API_KEY=${OPENAI_API_KEY}
51 |     ports:
52 |       - "${STREAMLIT_PORT:-8501}:8501"
53 |     depends_on:
54 |       - elasticsearch
55 |       - ollama
56 |       - postgres
57 | 
58 |   grafana:
59 |     image: grafana/grafana:latest
60 |     container_name: grafana
61 |     ports:
62 |       - "3000:3000"
63 |     volumes:
64 |       - grafana_data:/var/lib/grafana
65 |     environment:
66 |       - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
67 |     depends_on:
68 |       - postgres
69 | 
70 | volumes:
71 |   elasticsearch_data:
72 |   ollama_data:
73 |   postgres_data:
74 |   grafana_data:
75 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/generate_data.py:
--------------------------------------------------------------------------------
  1 | import time
  2 | import random
  3 | import uuid
  4 | from datetime import datetime, timedelta
  5 | from zoneinfo import ZoneInfo
  6 | from db import save_conversation, save_feedback, get_db_connection
  7 | 
  8 | # Set the timezone to CET (Europe/Berlin)
  9 | tz = ZoneInfo("Europe/Berlin")
 10 | 
 11 | # List of sample questions and answers
 12 | SAMPLE_QUESTIONS = [
 13 |     "What is machine learning?",
 14 |     "How does linear regression work?",
 15 |     "Explain the concept of overfitting.",
 16 |     "What is the difference between supervised and unsupervised learning?",
 17 |     "How does cross-validation help in model evaluation?",
 18 | ]
 19 | 
 20 | SAMPLE_ANSWERS = [
 21 |     "Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computer systems to improve their performance on a specific task through experience.",
 22 |     "Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.",
 23 |     "Overfitting occurs when a machine learning model learns the training data too well, including its noise and fluctuations, resulting in poor generalization to new, unseen data.",
 24 |     "Supervised learning involves training models on labeled data, while unsupervised learning deals with finding patterns in unlabeled data without predefined outputs.",
 25 |     "Cross-validation is a technique used to assess how well a model will generalize to an independent dataset. It involves partitioning the data into subsets, training the model on a subset, and validating it on the remaining data.",
 26 | ]
 27 | 
 28 | COURSES = ["machine-learning-zoomcamp", "data-engineering-zoomcamp", "mlops-zoomcamp"]
 29 | MODELS = ["ollama/phi3", "openai/gpt-3.5-turbo", "openai/gpt-4o", "openai/gpt-4o-mini"]
 30 | RELEVANCE = ["RELEVANT", "PARTLY_RELEVANT", "NON_RELEVANT"]
 31 | 
 32 | 
 33 | def generate_synthetic_data(start_time, end_time):
 34 |     current_time = start_time
 35 |     conversation_count = 0
 36 |     print(f"Starting historical data generation from {start_time} to {end_time}")
 37 |     while current_time < end_time:
 38 |         conversation_id = str(uuid.uuid4())
 39 |         question = random.choice(SAMPLE_QUESTIONS)
 40 |         answer = random.choice(SAMPLE_ANSWERS)
 41 |         course = random.choice(COURSES)
 42 |         model = random.choice(MODELS)
 43 |         relevance = random.choice(RELEVANCE)
 44 | 
 45 |         openai_cost = 0
 46 | 
 47 |         if model.startswith("openai/"):
 48 |             openai_cost = random.uniform(0.001, 0.1)
 49 | 
 50 |         answer_data = {
 51 |             "answer": answer,
 52 |             "response_time": random.uniform(0.5, 5.0),
 53 |             "relevance": relevance,
 54 |             "relevance_explanation": f"This answer is {relevance.lower()} to the question.",
 55 |             "model_used": model,
 56 |             "prompt_tokens": random.randint(50, 200),
 57 |             "completion_tokens": random.randint(50, 300),
 58 |             "total_tokens": random.randint(100, 500),
 59 |             "eval_prompt_tokens": random.randint(50, 150),
 60 |             "eval_completion_tokens": random.randint(20, 100),
 61 |             "eval_total_tokens": random.randint(70, 250),
 62 |             "openai_cost": openai_cost,
 63 |         }
 64 | 
 65 |         save_conversation(conversation_id, question, answer_data, course, current_time)
 66 |         print(
 67 |             f"Saved conversation: ID={conversation_id}, Time={current_time}, Course={course}, Model={model}"
 68 |         )
 69 | 
 70 |         if random.random() < 0.7:
 71 |             feedback = 1 if random.random() < 0.8 else -1
 72 |             save_feedback(conversation_id, feedback, current_time)
 73 |             print(
 74 |                 f"Saved feedback for conversation {conversation_id}: {'Positive' if feedback > 0 else 'Negative'}"
 75 |             )
 76 | 
 77 |         current_time += timedelta(minutes=random.randint(1, 15))
 78 |         conversation_count += 1
 79 |         if conversation_count % 10 == 0:
 80 |             print(f"Generated {conversation_count} conversations so far...")
 81 | 
 82 |     print(
 83 |         f"Historical data generation complete. Total conversations: {conversation_count}"
 84 |     )
 85 | 
 86 | 
 87 | def generate_live_data():
 88 |     conversation_count = 0
 89 |     print("Starting live data generation...")
 90 |     while True:
 91 |         current_time = datetime.now(tz)
 92 |         # current_time = None
 93 |         conversation_id = str(uuid.uuid4())
 94 |         question = random.choice(SAMPLE_QUESTIONS)
 95 |         answer = random.choice(SAMPLE_ANSWERS)
 96 |         course = random.choice(COURSES)
 97 |         model = random.choice(MODELS)
 98 |         relevance = random.choice(RELEVANCE)
 99 | 
100 |         openai_cost = 0
101 | 
102 |         if model.startswith("openai/"):
103 |             openai_cost = random.uniform(0.001, 0.1)
104 | 
105 |         answer_data = {
106 |             "answer": answer,
107 |             "response_time": random.uniform(0.5, 5.0),
108 |             "relevance": relevance,
109 |             "relevance_explanation": f"This answer is {relevance.lower()} to the question.",
110 |             "model_used": model,
111 |             "prompt_tokens": random.randint(50, 200),
112 |             "completion_tokens": random.randint(50, 300),
113 |             "total_tokens": random.randint(100, 500),
114 |             "eval_prompt_tokens": random.randint(50, 150),
115 |             "eval_completion_tokens": random.randint(20, 100),
116 |             "eval_total_tokens": random.randint(70, 250),
117 |             "openai_cost": openai_cost,
118 |         }
119 | 
120 |         save_conversation(conversation_id, question, answer_data, course, current_time)
121 |         print(
122 |             f"Saved live conversation: ID={conversation_id}, Time={current_time}, Course={course}, Model={model}"
123 |         )
124 | 
125 |         if random.random() < 0.7:
126 |             feedback = 1 if random.random() < 0.8 else -1
127 |             save_feedback(conversation_id, feedback, current_time)
128 |             print(
129 |                 f"Saved feedback for live conversation {conversation_id}: {'Positive' if feedback > 0 else 'Negative'}"
130 |             )
131 | 
132 |         conversation_count += 1
133 |         if conversation_count % 10 == 0:
134 |             print(f"Generated {conversation_count} live conversations so far...")
135 | 
136 |         time.sleep(1)
137 | 
138 | 
139 | if __name__ == "__main__":
140 |     print(f"Script started at {datetime.now(tz)}")
141 |     end_time = datetime.now(tz)
142 |     start_time = end_time - timedelta(hours=6)
143 |     print(f"Generating historical data from {start_time} to {end_time}")
144 |     generate_synthetic_data(start_time, end_time)
145 |     print("Historical data generation complete.")
146 | 
147 |     print("Starting live data generation... Press Ctrl+C to stop.")
148 |     try:
149 |         generate_live_data()
150 |     except KeyboardInterrupt:
151 |         print(f"Live data generation stopped at {datetime.now(tz)}.")
152 |     finally:
153 |         print(f"Script ended at {datetime.now(tz)}")
154 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/prep.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import requests
  3 | import pandas as pd
  4 | from sentence_transformers import SentenceTransformer
  5 | from elasticsearch import Elasticsearch
  6 | from tqdm.auto import tqdm
  7 | from dotenv import load_dotenv
  8 | 
  9 | from db import init_db
 10 | 
 11 | load_dotenv()
 12 | 
 13 | ELASTIC_URL = os.getenv("ELASTIC_URL_LOCAL")
 14 | MODEL_NAME = os.getenv("MODEL_NAME")
 15 | INDEX_NAME = os.getenv("INDEX_NAME")
 16 | 
 17 | BASE_URL = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main"
 18 | 
 19 | 
 20 | def fetch_documents():
 21 |     print("Fetching documents...")
 22 |     relative_url = "03-vector-search/eval/documents-with-ids.json"
 23 |     docs_url = f"{BASE_URL}/{relative_url}?raw=1"
 24 |     docs_response = requests.get(docs_url)
 25 |     documents = docs_response.json()
 26 |     print(f"Fetched {len(documents)} documents")
 27 |     return documents
 28 | 
 29 | 
 30 | def fetch_ground_truth():
 31 |     print("Fetching ground truth data...")
 32 |     relative_url = "03-vector-search/eval/ground-truth-data.csv"
 33 |     ground_truth_url = f"{BASE_URL}/{relative_url}?raw=1"
 34 |     df_ground_truth = pd.read_csv(ground_truth_url)
 35 |     df_ground_truth = df_ground_truth[
 36 |         df_ground_truth.course == "machine-learning-zoomcamp"
 37 |     ]
 38 |     ground_truth = df_ground_truth.to_dict(orient="records")
 39 |     print(f"Fetched {len(ground_truth)} ground truth records")
 40 |     return ground_truth
 41 | 
 42 | 
 43 | def load_model():
 44 |     print(f"Loading model: {MODEL_NAME}")
 45 |     return SentenceTransformer(MODEL_NAME)
 46 | 
 47 | 
 48 | def setup_elasticsearch():
 49 |     print("Setting up Elasticsearch...")
 50 |     es_client = Elasticsearch(ELASTIC_URL)
 51 | 
 52 |     index_settings = {
 53 |         "settings": {"number_of_shards": 1, "number_of_replicas": 0},
 54 |         "mappings": {
 55 |             "properties": {
 56 |                 "text": {"type": "text"},
 57 |                 "section": {"type": "text"},
 58 |                 "question": {"type": "text"},
 59 |                 "course": {"type": "keyword"},
 60 |                 "id": {"type": "keyword"},
 61 |                 "question_text_vector": {
 62 |                     "type": "dense_vector",
 63 |                     "dims": 384,
 64 |                     "index": True,
 65 |                     "similarity": "cosine",
 66 |                 },
 67 |             }
 68 |         },
 69 |     }
 70 | 
 71 |     es_client.indices.delete(index=INDEX_NAME, ignore_unavailable=True)
 72 |     es_client.indices.create(index=INDEX_NAME, body=index_settings)
 73 |     print(f"Elasticsearch index '{INDEX_NAME}' created")
 74 |     return es_client
 75 | 
 76 | 
 77 | def index_documents(es_client, documents, model):
 78 |     print("Indexing documents...")
 79 |     for doc in tqdm(documents):
 80 |         question = doc["question"]
 81 |         text = doc["text"]
 82 |         doc["question_text_vector"] = model.encode(question + " " + text).tolist()
 83 |         es_client.index(index=INDEX_NAME, document=doc)
 84 |     print(f"Indexed {len(documents)} documents")
 85 | 
 86 | 
 87 | def main():
 88 |     # you may consider to comment <start>
 89 |     # if you just want to init the db or didn't want to re-index
 90 |     print("Starting the indexing process...")
 91 | 
 92 |     documents = fetch_documents()
 93 |     ground_truth = fetch_ground_truth()
 94 |     model = load_model()
 95 |     es_client = setup_elasticsearch()
 96 |     index_documents(es_client, documents, model)
 97 |     # you may consider to comment <end>
 98 | 
 99 |     print("Initializing database...")
100 |     init_db()
101 | 
102 |     print("Indexing process completed successfully!")
103 | 
104 | 
105 | if __name__ == "__main__":
106 |     main()
107 | 


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/requirements.txt:
--------------------------------------------------------------------------------
 1 | streamlit
 2 | elasticsearch==8.14.0
 3 | psycopg2-binary==2.9.9
 4 | python-dotenv
 5 | openai==1.35.7
 6 | sentence-transformers==2.7.0
 7 | numpy==1.26.4
 8 | 
 9 | --find-links https://download.pytorch.org/whl/cpu/torch_stable.html
10 | torch==2.3.1+cpu


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/grafana.md:
--------------------------------------------------------------------------------
  1 | ## Original queries
  2 | 
  3 | Response Time Panel:
  4 | 
  5 | ```sql
  6 | SELECT
  7 |   timestamp AS time,
  8 |   response_time
  9 | FROM conversations
 10 | ORDER BY timestamp
 11 | ```
 12 | 
 13 | Relevance Distribution Panel:
 14 | 
 15 | ```sql
 16 | SELECT
 17 |   relevance,
 18 |   COUNT(*) as count
 19 | FROM conversations
 20 | GROUP BY relevance
 21 | ```
 22 | 
 23 | Model Usage Panel:
 24 | 
 25 | ```sql
 26 | SELECT
 27 |   model_used,
 28 |   COUNT(*) as count
 29 | FROM conversations
 30 | GROUP BY model_used
 31 | ```
 32 | 
 33 | 
 34 | Token Usage Panel:
 35 | 
 36 | ```sql
 37 | SELECT
 38 |   timestamp AS time,
 39 |   total_tokens
 40 | FROM conversations
 41 | ORDER BY timestamp
 42 | ```
 43 | 
 44 | OpenAI Cost Panel:
 45 | 
 46 | ```sql
 47 | SELECT
 48 |   timestamp AS time,
 49 |   openai_cost
 50 | FROM conversations
 51 | WHERE openai_cost > 0
 52 | ORDER BY timestamp
 53 | ```
 54 | 
 55 | Recent Conversations Panel:
 56 | 
 57 | ```sql
 58 | SELECT
 59 |   timestamp AS time,
 60 |   question,
 61 |   answer,
 62 |   relevance
 63 | FROM conversations
 64 | ORDER BY timestamp DESC
 65 | LIMIT 5
 66 | ```
 67 | 
 68 | Feedback Statistics Panel:
 69 | 
 70 | ```sql
 71 | SELECT
 72 |   SUM(CASE WHEN feedback > 0 THEN 1 ELSE 0 END) as thumbs_up,
 73 |   SUM(CASE WHEN feedback < 0 THEN 1 ELSE 0 END) as thumbs_down
 74 | FROM feedback
 75 | ```
 76 | 
 77 | ## Revised queries
 78 | 
 79 | We can (and should) also use special grafana variables
 80 | 
 81 | - `$__timeFrom()` and `$__timeTo()`: Start and end of the selected time range
 82 | - `$__timeGroup(timestamp, $__interval)`: Groups results by time intervals automatically calculated by Grafana
 83 | 
 84 | ### 1. Response Time Panel
 85 | 
 86 | This query shows the response time for each conversation within the selected time range:
 87 | 
 88 | ```sql
 89 | SELECT
 90 |   timestamp AS time,
 91 |   response_time
 92 | FROM conversations
 93 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
 94 | ORDER BY timestamp
 95 | ```
 96 | 
 97 | ### 2. Relevance Distribution Panel
 98 | 
 99 | This query counts the number of conversations for each relevance category within the selected time range:
100 | 
101 | ```sql
102 | SELECT
103 |   relevance,
104 |   COUNT(*) as count
105 | FROM conversations
106 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
107 | GROUP BY relevance
108 | ```
109 | 
110 | ### 3. Model Usage Panel
111 | 
112 | This query counts the number of times each model was used within the selected time range:
113 | 
114 | ```sql
115 | SELECT
116 |   model_used,
117 |   COUNT(*) as count
118 | FROM conversations
119 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
120 | GROUP BY model_used
121 | ```
122 | 
123 | ### 4. Token Usage Panel
124 | 
125 | This query shows the average token usage over time, grouped by Grafana's automatically calculated interval:
126 | 
127 | ```sql
128 | SELECT
129 |   $__timeGroup(timestamp, $__interval) AS time,
130 |   AVG(total_tokens) AS avg_tokens
131 | FROM conversations
132 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
133 | GROUP BY 1
134 | ORDER BY 1
135 | ```
136 | 
137 | ### 5. OpenAI Cost Panel
138 | 
139 | This query shows the total OpenAI cost over time, grouped by Grafana's automatically calculated interval:
140 | 
141 | ```sql
142 | SELECT
143 |   $__timeGroup(timestamp, $__interval) AS time,
144 |   SUM(openai_cost) AS total_cost
145 | FROM conversations
146 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
147 |   AND openai_cost > 0
148 | GROUP BY 1
149 | ORDER BY 1
150 | ```
151 | 
152 | ### 6. Recent Conversations Panel
153 | 
154 | This query retrieves the 5 most recent conversations within the selected time range:
155 | 
156 | ```sql
157 | SELECT
158 |   timestamp AS time,
159 |   question,
160 |   answer,
161 |   relevance
162 | FROM conversations
163 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
164 | ORDER BY timestamp DESC
165 | LIMIT 5
166 | ```
167 | 
168 | ### 7. Feedback Statistics Panel
169 | 
170 | This query calculates the total number of positive and negative feedback within the selected time range:
171 | 
172 | ```sql
173 | SELECT
174 |   SUM(CASE WHEN feedback > 0 THEN 1 ELSE 0 END) as thumbs_up,
175 |   SUM(CASE WHEN feedback < 0 THEN 1 ELSE 0 END) as thumbs_down
176 | FROM feedback
177 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
178 | ```


--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/homework.md:
--------------------------------------------------------------------------------
  1 | ## Homework: Evaluation and Monitoring
  2 | 
  3 | Solution: [solution.ipynb](solution.ipynb)
  4 | 
  5 | In this homework, we'll evaluate the quality of our RAG system.
  6 | 
  7 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
  8 | 
  9 | Solution:
 10 | 
 11 | * Video: TBA
 12 | * Notebook: TBA
 13 | 
 14 | ## Getting the data
 15 | 
 16 | Let's start by getting the dataset. We will use the data we generated in the module.
 17 | 
 18 | In particular, we'll evaluate the quality of our RAG system
 19 | with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)
 20 | 
 21 | 
 22 | Read it:
 23 | 
 24 | ```python
 25 | url = f'{github_url}?raw=1'
 26 | df = pd.read_csv(url)
 27 | ```
 28 | 
 29 | We will use only the first 300 documents:
 30 | 
 31 | 
 32 | ```python
 33 | df = df.iloc[:300]
 34 | ```
 35 | 
 36 | ## Q1. Getting the embeddings model
 37 | 
 38 | Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
 39 | [the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)
 40 | 
 41 | > Note: this is not the same model as in HW3
 42 | 
 43 | ```bash
 44 | from sentence_transformers import SentenceTransformer
 45 | embedding_model = SentenceTransformer(model_name)
 46 | ```
 47 | 
 48 | Create the embeddings for the first LLM answer:
 49 | 
 50 | ```python
 51 | answer_llm = df.iloc[0].answer_llm
 52 | ```
 53 | 
 54 | What's the first value of the resulting vector?
 55 | 
 56 | * -0.42
 57 | * -0.22
 58 | * -0.02
 59 | * 0.21
 60 | 
 61 | 
 62 | ## Q2. Computing the dot product
 63 | 
 64 | 
 65 | Now for each answer pair, let's create embeddings and compute dot product between them
 66 | 
 67 | We will put the results (scores) into the `evaluations` list
 68 | 
 69 | What's the 75% percentile of the score?
 70 | 
 71 | * 21.67
 72 | * 31.67
 73 | * 41.67
 74 | * 51.67
 75 | 
 76 | ## Q3. Computing the cosine
 77 | 
 78 | From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.
 79 | 
 80 | So we need to normalize them.
 81 | 
 82 | To do it, we 
 83 | 
 84 | * Compute the norm of a vector
 85 | * Divide each element by this norm
 86 | 
 87 | So, for vector `v`, it'll be `v / ||v||`
 88 | 
 89 | In numpy, this is how you do it:
 90 | 
 91 | ```python
 92 | norm = np.sqrt((v * v).sum())
 93 | v_norm = v / norm
 94 | ```
 95 | 
 96 | Let's put it into a function and then compute dot product 
 97 | between normalized vectors. This will give us cosine similarity
 98 | 
 99 | What's the 75% cosine in the scores?
100 | 
101 | * 0.63
102 | * 0.73
103 | * 0.83
104 | * 0.93
105 | 
106 | ## Q4. Rouge
107 | 
108 | Now we will explore an alternative metric - the ROUGE score.  
109 | 
110 | This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.
111 | 
112 | It can give a more nuanced view of text similarity than just cosine similarity alone.
113 | 
114 | We don't need to implement it ourselves, there's a python package for it:
115 | 
116 | ```bash
117 | pip install rouge
118 | ```
119 | 
120 | (The latest version at the moment of writing is `1.0.1`)
121 | 
122 | Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)
123 | 
124 | ```
125 | from rouge import Rouge
126 | rouge_scorer = Rouge()
127 | 
128 | scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
129 | ```
130 | 
131 | There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.
132 | 
133 | * `rouge-1` - the overlap of unigrams,
134 | * `rouge-2` - bigrams,
135 | * `rouge-l` - the longest common subsequence
136 | 
137 | What's the F score for `rouge-1`?
138 | 
139 | - 0.35
140 | - 0.45
141 | - 0.55
142 | - 0.65
143 | 
144 | ## Q5. Average rouge score
145 | 
146 | Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4
147 | 
148 | - 0.35
149 | - 0.45
150 | - 0.55
151 | - 0.65
152 | 
153 | ## Q6. Average rouge score for all the data points
154 | 
155 | Now let's compute the F-score for all the records and create a dataframe from them.
156 | 
157 | What's the average F-score in `rouge_2` across all the records?
158 | 
159 | - 0.10
160 | - 0.20
161 | - 0.30
162 | - 0.40
163 | 
164 | 
165 | 
166 | ## Submit the results
167 | 
168 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw4
169 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
170 | 


--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/README.md:
--------------------------------------------------------------------------------
  1 | # Data Preparation in RAG
  2 | 
  3 | ## Getting started
  4 | 
  5 | 1. Clone [repository](https://github.com/mage-ai/rag-project)
  6 | ```bash
  7 | git clone https://github.com/mage-ai/rag-project
  8 | cd rag-project
  9 | ```
 10 | 3. navigate to the `rag-project/llm` directory, add `spacy` to the requirements.txt.
 11 | 4. Then update the `Dockerfile` found in the `rag-project` directory with the following:
 12 | ```YAML
 13 | RUN python -m spacy download en_core_web_sm
 14 | ```
 15 | 4. Run
 16 | 
 17 | ```bash
 18 | `./scripts/start.sh`
 19 | ```
 20 | 
 21 | Once started, go to [http://localhost:6789/](http://localhost:6789/)
 22 | 
 23 | For more setup information, refer to these [instructions](https://docs.mage.ai/getting-started/setup#docker-compose-template)
 24 | 
 25 | 
 26 | ## 0. Module overview
 27 | 
 28 | <a href="https://www.youtube.com/watch?v=gP2ZOsG9Umg&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 29 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/gP2ZOsG9Umg">
 30 | </a>
 31 | 
 32 | ## 1. Ingest
 33 | 
 34 | In this section, we cover the ingestion of documents from a single data source.
 35 | 
 36 | <a href="https://www.youtube.com/watch?v=9BJppvgLINc&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 37 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/9BJppvgLINc">
 38 | </a>
 39 | 
 40 | * [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/data_loaders/runic_oblivion.py)
 41 | * [Document link for API Data Loader](https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json)
 42 | 
 43 | ## 2. Chunk
 44 | 
 45 | Once data is ingested, we break it into manageable chunks.
 46 | 
 47 | The Q&A data is already chunked - the texts are small
 48 | and easy to process and index. But other datasets might
 49 | not be (book texts, transcripts, etc). 
 50 | 
 51 | In this video, we will talk about turning large texts
 52 | into smaller documents - i.e. chunking.
 53 | 
 54 | <a href="https://www.youtube.com/watch?v=H2oq5GSCKhM&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 55 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/H2oq5GSCKhM">
 56 | </a>
 57 | 
 58 | 
 59 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/transformers/radiant_photon.py)
 60 | 
 61 | ## 3. Tokenization
 62 | 
 63 | Tokenization is a crucial step in text processing and preparing the data for effective retrieval.
 64 | 
 65 | <a href="https://www.youtube.com/watch?v=hrMrqRgZryg&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 66 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/hrMrqRgZryg">
 67 | </a>
 68 | 
 69 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/transformers/vivid_nexus.py)
 70 | 
 71 | ## 4. Embed
 72 | 
 73 | Embedding data translates text into numerical vectors that can be processed by models.
 74 | 
 75 | Previously we used sentence transformers for that. In this video we show a different strategy for it.
 76 | 
 77 | 
 78 | <a href="https://www.youtube.com/watch?v=8wrArv0DEKc&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 79 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/8wrArv0DEKc">
 80 | </a>
 81 | 
 82 | 
 83 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/transformers/prismatic_axiom.py)
 84 | 
 85 | 
 86 | ## 5. Export
 87 | 
 88 | After processing, data needs to be exported for storage so that it can be retrieved for better contextualization of user queries.
 89 | 
 90 | Here we will save the embeddings to elasticsearch
 91 | 
 92 | please make sure to use the name given to your elasticsearch service in your docker compose file followed by the port as the connection string, e.g below
 93 | 
 94 | `<docker-compose-service-name><port>` http://elasticsearch:9200
 95 | 
 96 | 
 97 | <a href="https://www.youtube.com/watch?v=cHrphSoRBX4&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
 98 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/cHrphSoRBX4">
 99 | </a>
100 | 
101 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/data_exporters/numinous_fission.py)
102 | 
103 | ## 6. Retrieval: Test Vector Search Query
104 | 
105 | After exporting the chunks and embeddings, we can test the search query to retrieve relevant documents on sample queries.
106 | 
107 | <a href="https://www.youtube.com/watch?v=z5NqDcaBglY&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
108 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/z5NqDcaBglY">
109 | </a>
110 | 
111 | [Code](code/06_retrieval.py)
112 | 
113 | ## 7. Trigger Daily Runs
114 | 
115 | Automation is key to maintaining and updating your system.
116 | This section demonstrates how to schedule and trigger daily runs for your data pipelines, ensuring up-to-date and consistent data processing.
117 | 
118 | <a href="https://www.youtube.com/watch?v=nuk7_soKMUA&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R">
119 |   <img src="https://markdown-videos-api.jorgenkh.no/youtube/nuk7_soKMUA">
120 | </a>
121 | 
122 | ## Homework
123 | 
124 | See [here](../cohorts/2024/05-orchestration/homework.md).
125 | 
126 | # Notes
127 | 
128 | * First link goes here
129 | * [Notes by Abiodun Mage RAG error fixes](https://github.com/AOGbadamosi2018/llm-zoomcamp/blob/main/06%20-%20orchestration/mage_rag_notes.md).
130 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
131 | 


--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/code/06_retrieval.py:
--------------------------------------------------------------------------------
 1 | from typing import Dict, List, Union
 2 | 
 3 | import numpy as np
 4 | from elasticsearch import Elasticsearch, exceptions
 5 | 
 6 | 
 7 | SAMPLE__EMBEDDINGS = [
 8 |     [-0.1465761959552765, -0.4822517931461334, 0.07130702584981918, -0.25872930884361267, -0.1563894897699356, 0.16641047596931458, 0.24484659731388092, 0.2410498708486557, 0.008032954297959805, 0.17045290768146515, -0.009397129528224468, 0.09619587659835815, -0.22729521989822388, 0.10254761576652527, 0.016890447586774826, -0.13290464878082275, 0.11240798979997635, -0.11204371601343155, -0.057132963091135025, -0.011206787079572678, -0.007982085458934307, 0.279083788394928, 0.20115645229816437, -0.1427406221628189, -0.19398854672908783, -0.035979654639959335, 0.20723149180412292, 0.29891034960746765, 0.21407313644886017, 0.09746530652046204, 0.1671638935804367, 0.08161208778619766, 0.3090828061103821, -0.20648667216300964, 0.48498260974884033, -0.12691514194011688, 0.518856406211853, -0.26291757822036743, -0.0949832871556282, 0.09556109458208084, -0.20844918489456177, 0.2685297429561615, 0.053442806005477905, 0.05103180184960365, 0.1029752567410469, 0.04935301095247269, -0.11679927259683609, -0.012528933584690094, -0.08489680290222168, 0.013589601963758469, -0.32059246301651, 0.10357264429330826, -0.09533575177192688, 0.02984568662941456, 0.2793693542480469, -0.2653750777244568, -0.24152781069278717, -0.3563413619995117, 0.09674381464719772, -0.26155123114585876, -0.1397126317024231, -0.009133181534707546, 0.05972130224108696, -0.10438819974660873, 0.21889159083366394, 0.0694752112030983, -0.1312003880739212, -0.31072548031806946, -0.002836169209331274, 0.2468366175889969, 0.09420009702444077, 0.1284026801586151, -0.03227006644010544, -0.012532072141766548, 0.6650756597518921, -0.14863784611225128, 0.005239118821918964, -0.3317912817001343, 0.16372767090797424, -0.20166568458080292, 0.029721004888415337, -0.18536655604839325, -0.3608534038066864, -0.18234892189502716, 0.019248824566602707, 0.25257956981658936, 0.09671413153409958, 0.15569280087947845, -0.38228726387023926, 0.37017977237701416, 0.03356296569108963, -0.21182948350906372, 0.48848846554756165, 0.18350018560886383, -0.23519110679626465, -0.17464864253997803], [-0.18246106803417206, -0.36036479473114014, 0.3282334506511688, -0.230922132730484, 0.09600532799959183, 0.6859422326087952, 0.0581890344619751, 0.4913463294506073, 0.1536773443222046, -0.2965141832828522, 0.08466599136590958, 0.319297194480896, -0.15651769936084747, -0.043428342789411545, 0.014402368105947971, 0.16681505739688873, 0.22521673142910004, -0.2715776264667511, -0.11033261567354202, -0.04398636147379875, 0.3480629622936249, 0.11897992342710495, 0.8724615573883057, 0.10258488357067108, -0.5719427466392517, -0.03029855526983738, 0.23351268470287323, 0.20660561323165894, 0.575685441493988, -0.12116186320781708, 0.18459142744541168, -0.12865227460861206, 0.3948173522949219, -0.34464019536972046, 0.6699116230010986, -0.45167359709739685, 1.1505522727966309, -0.4498964548110962, -0.3248189687728882, -0.29674994945526123, -0.3570491075515747, 0.5436431765556335, 0.49576905369758606, -0.11180296540260315, -0.02045607566833496, -0.22768598794937134, -0.37912657856941223, -0.30414703488349915, -0.48289090394973755, -0.04158346354961395, -0.3547952473163605, 0.0687602087855339, 0.041512664407491684, 0.33524179458618164, 0.21826978027820587, -0.443082332611084, -0.5049593448638916, -0.5298929810523987, -0.02618088759481907, -0.2748631536960602, -0.1986193209886551, 0.35475826263427734, 0.22456413507461548, -0.29532068967819214, 0.25150877237319946, 0.243370920419693, -0.29938358068466187, -0.2128247618675232, -0.15292000770568848, -0.14813245832920074, -0.06183856353163719, -0.1251668632030487, 0.14256533980369568, -0.22781267762184143, 0.8101184964179993, 0.19796361029148102, 0.09104947745800018, -0.4860817790031433, 0.3078012764453888, -0.27373194694519043, 0.11800770461559296, -0.45869407057762146, 0.09508189558982849, -0.23971715569496155, -0.27427223324775696, 0.5139415264129639, 0.1871502846479416, 0.06647063046693802, -0.4054469168186188, 0.4751380681991577, 0.17067894339561462, 0.12443914264440536, 0.3577817678451538, 0.10574143379926682, -0.3181760311126709, -0.23804502189159393]
 9 | ]
10 | 
11 | 
12 | @data_loader
13 | def search(*args, **kwargs) -> List[Dict]:
14 |     """
15 |     query_embedding: Union[List[int], np.ndarray]
16 |     """
17 |     
18 |     connection_string = kwargs.get('connection_string', 'http://localhost:9200')
19 |     index_name = kwargs.get('index_name', 'documents')
20 |     source = kwargs.get('source', "cosineSimilarity(params.query_vector, 'embedding') + 1.0")
21 |     top_k = kwargs.get('top_k', 5)
22 |     chunk_column = kwargs.get('chunk_column', 'content')
23 | 
24 |     query_embedding = None
25 |     if len(args):
26 |         query_embedding = args[0]
27 |     if not query_embedding:
28 |         query_embedding = SAMPLE__EMBEDDINGS[0]
29 | 
30 |     if isinstance(query_embedding, np.ndarray):
31 |         query_embedding = query_embedding.tolist()
32 | 
33 |     script_query = {
34 |         "script_score": {
35 |             "query": {"match_all": {}},
36 |             "script": {
37 |                 "source": source,
38 |                 "params": {"query_vector": query_embedding},
39 |             }
40 |         }
41 |     }
42 | 
43 |     print("Sending script query:", script_query)
44 | 
45 |     es_client = Elasticsearch(connection_string)
46 |     
47 |     try:
48 |         response = es_client.search(
49 |             index=index_name,
50 |             body={
51 |                 "size": top_k,
52 |                 "query": script_query,
53 |                 "_source": [chunk_column],
54 |             },
55 |         )
56 | 
57 |         print("Raw response from Elasticsearch:", response)
58 | 
59 |         return [hit['_source'][chunk_column] for hit in response['hits']['hits']]
60 |     
61 |     except exceptions.BadRequestError as e:
62 |         print(f"BadRequestError: {e.info}")
63 |         return []
64 |     except Exception as e:
65 |         print(f"Unexpected error: {e}")
66 |         return []
67 | 


--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/homework.md:
--------------------------------------------------------------------------------
  1 | ## Homework: LLM Orchestration and Ingestion
  2 | 
  3 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
  4 | 
  5 | Our FAQ documents change with time: students add more records
  6 | and edit existing ones. We need to keep our index in sync.
  7 | 
  8 | There are two ways of doing it:
  9 | 
 10 | 1. Incremental: you only update records that got changed, created or deleted
 11 | 2. Full update: you recreate the entire index from scratch
 12 | 
 13 | In this homework, we'll look at full update. We will run our 
 14 | indexing pipeline daily and re-create the index from scracth 
 15 | each time we run. 
 16 | 
 17 | 
 18 | For that, we created two FAQ documents for LLM Zoomcamp
 19 | 
 20 | * [version 1](https://docs.google.com/document/d/1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E/edit)
 21 | * [version 2](https://docs.google.com/document/d/1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0/edit)
 22 | 
 23 | First, we will run our ingestion pipeline with version 1 
 24 | and then with version 2.
 25 | 
 26 | ## Q1. Running Mage
 27 | 
 28 | Clone the same repo we used in the module and run mage:
 29 | 
 30 | 
 31 | ```bash
 32 | git clone https://github.com/mage-ai/rag-project
 33 | ```
 34 | 
 35 | Add the following libraries to the requirements document:
 36 | 
 37 | ```
 38 | python-docx
 39 | elasticsearch
 40 | ```
 41 | 
 42 | Make sure you use the latest version of mage:
 43 | 
 44 | ```bash
 45 | docker pull mageai/mageai:llm
 46 | ```
 47 | 
 48 | Start it:
 49 | 
 50 | ```bash
 51 | ./scripts/start.sh
 52 | ```
 53 | 
 54 | Now mage is running on [http://localhost:6789/](http://localhost:6789/)
 55 | 
 56 | What's the version of mage? 
 57 | 
 58 | ## Creating a RAG pipeline
 59 | 
 60 | Create a RAG pipeline
 61 | 
 62 | 
 63 | ## Q2. Reading the documents
 64 | 
 65 | Now we can ingest the documents. Create a custom code ingestion
 66 | block 
 67 | 
 68 | Let's read the documents. We will use the same code we used
 69 | for parsing FAQ: [parse-faq-llm.ipynb](parse-faq-llm.ipynb)
 70 | 
 71 | 
 72 | Use the following document_id: 1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E
 73 | 
 74 | Which is the document ID of
 75 | [LLM FAQ version 1](https://docs.google.com/document/d/1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E/edit)
 76 | 
 77 | Copy the code to the editor
 78 | How many FAQ documents we processed?
 79 | 
 80 | * 1
 81 | * 2
 82 | * 3
 83 | * 4
 84 | 
 85 | ## Q3. Chunking
 86 | 
 87 | We don't really need to do any chuncking because our documents
 88 | already have well-specified boundaries. So we just need
 89 | to return the documents without any changes.
 90 | 
 91 | So let's go to the transformation part and add a custom code
 92 | chunking block:
 93 | 
 94 | ```python
 95 | documents = []
 96 | 
 97 | for doc in data['documents']:
 98 |     doc['course'] = data['course']
 99 |     # previously we used just "id" for document ID
100 |     doc['document_id'] = generate_document_id(doc)
101 |     documents.append(doc)
102 | 
103 | print(len(documents))
104 | 
105 | return documents
106 | ```
107 | 
108 | 
109 | Where `data` is the input parameter to the transformer.
110 | 
111 | And the `generate_document_id` is defined in the same way
112 | as in module 4:
113 | 
114 | ```python
115 | import hashlib
116 | 
117 | def generate_document_id(doc):
118 |     combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
119 |     hash_object = hashlib.md5(combined.encode())
120 |     hash_hex = hash_object.hexdigest()
121 |     document_id = hash_hex[:8]
122 |     return document_id
123 | ```
124 | 
125 | Note: if instead of a single dictionary you get a list, 
126 | add a for loop:
127 | 
128 | ```python
129 | for course_dict in data:
130 |     ...
131 | ```
132 | 
133 | You can check the type of `data` with this code:
134 | 
135 | ```python
136 | print(type(data))
137 | ```
138 | 
139 | How many documents (chunks) do we have in the output?
140 | 
141 | * 66
142 | * 76
143 | * 86
144 | * 96
145 | 
146 | 
147 | 
148 | ## Tokenization and embeddings
149 | 
150 | We don't need any tokenization, so we skip it.
151 | 
152 | Because currently it's required in mage, we can create 
153 | a dummy code block:
154 | 
155 | * Create a custom code block
156 | * Don't change it
157 | 
158 | Because we will use text search, we also don't need embeddings,
159 | so skip it too.
160 | 
161 | If you want to use sentence transformers - the ones from module
162 | 3 - you don't need tokenization, but need embeddings
163 | (you don't need it for this homework)
164 | 
165 | 
166 | ## Q4. Export
167 | 
168 | Now we're ready to index the data with elasticsearch. For that,
169 | we use the Export part of the pipeline
170 | 
171 | * Go to the Export part
172 | * Select vector databases -> Elasticsearch
173 | * Open the code for editing
174 | 
175 | Because we won't use vector search, but usual text search, we
176 | will need to adjust the code.
177 | 
178 | First, let's change the line where we read the index name:
179 | 
180 | ```python
181 | index_name = kwargs.get('index_name', 'documents')
182 | ``` 
183 | 
184 | To `index_name_prefix` - we will parametrize it with the day
185 | and time we run the pipeline
186 | 
187 | ```python
188 | from datetime import datetime
189 | 
190 | index_name_prefix = kwargs.get('index_name', 'documents')
191 | current_time = datetime.now().strftime("%Y%m%d_%M%S")
192 | index_name = f"{index_name_prefix}_{current_time}"
193 | print("index name:", index_name)
194 | ```
195 | 
196 | 
197 | We will need to save the name in a global variable, so it can be accessible in other code blocks
198 | 
199 | ```python
200 | from mage_ai.data_preparation.variable_manager import set_global_variable
201 | 
202 | set_global_variable('YOUR_PIPELINE_NAME', 'index_name', index_name)
203 | ```
204 | 
205 | Where your pipeline name is the name of the pipeline, e.g.
206 | `transcendent_nexus` (replace the space with underscore `_`)
207 | 
208 | 
209 | 
210 | Replace index settings with the settings we used previously:
211 | 
212 | ```python
213 | index_settings = {
214 |     "settings": {
215 |         "number_of_shards": number_of_shards,
216 |         "number_of_replicas": number_of_replicas
217 |     },
218 |     "mappings": {
219 |         "properties": {
220 |             "text": {"type": "text"},
221 |             "section": {"type": "text"},
222 |             "question": {"type": "text"},
223 |             "course": {"type": "keyword"},
224 |             "document_id": {"type": "keyword"}
225 |         }
226 |     }
227 | }
228 | ```
229 | 
230 | Remove the embeddings line:
231 | 
232 | ```python
233 | if isinstance(document[vector_column_name], np.ndarray):
234 |     document[vector_column_name] = document[vector_column_name].tolist()
235 | ```
236 | 
237 | At the end (outside of the indexing for loop), print the last document:
238 | 
239 | ```python
240 | print(document)
241 | ```
242 | 
243 | Now execute the block.
244 | 
245 | What's the last document id?
246 | 
247 | Also note the index name.
248 | 
249 | 
250 | ## Q5. Testing the retrieval
251 | 
252 | Now let's test the retrieval. Use mage or jupyter notebook to
253 | test it.
254 | 
255 | Let's use the following query: "When is the next cohort?"
256 | 
257 | What's the ID of the top matching result?
258 | 
259 | 
260 | ## Q6. Reindexing
261 | 
262 | Our FAQ document changes: every day course participants add
263 | new records or improve existing ones.
264 | 
265 | Imagine some time passed and the document changed. For that we have another version of the FAQ document: [version 2](https://docs.google.com/document/d/1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0/edit).
266 | 
267 | The ID of this document is `1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0`.
268 | 
269 | Let's re-execute the entire pipeline with the updated data.
270 | 
271 | For the same query "When is the next cohort?". What's the ID of the top matching result?
272 | 
273 | 
274 | 
275 | ## Submit the results
276 | 
277 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw5
278 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
279 | 


--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/parse-faq-llm.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 1,
  6 |    "id": "4cd1eaa8-3424-41ad-9cf2-3e8548712865",
  7 |    "metadata": {},
  8 |    "outputs": [],
  9 |    "source": [
 10 |     "import io\n",
 11 |     "\n",
 12 |     "import requests\n",
 13 |     "import docx"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 2,
 19 |    "id": "8180e7e4-b90d-4900-a59b-d22e5d6537c4",
 20 |    "metadata": {},
 21 |    "outputs": [],
 22 |    "source": [
 23 |     "def clean_line(line):\n",
 24 |     "    line = line.strip()\n",
 25 |     "    line = line.strip('\\uFEFF')\n",
 26 |     "    return line\n",
 27 |     "\n",
 28 |     "def read_faq(file_id):\n",
 29 |     "    url = f'https://docs.google.com/document/d/{file_id}/export?format=docx'\n",
 30 |     "    \n",
 31 |     "    response = requests.get(url)\n",
 32 |     "    response.raise_for_status()\n",
 33 |     "    \n",
 34 |     "    with io.BytesIO(response.content) as f_in:\n",
 35 |     "        doc = docx.Document(f_in)\n",
 36 |     "\n",
 37 |     "    questions = []\n",
 38 |     "\n",
 39 |     "    question_heading_style = 'heading 2'\n",
 40 |     "    section_heading_style = 'heading 1'\n",
 41 |     "    \n",
 42 |     "    heading_id = ''\n",
 43 |     "    section_title = ''\n",
 44 |     "    question_title = ''\n",
 45 |     "    answer_text_so_far = ''\n",
 46 |     "     \n",
 47 |     "    for p in doc.paragraphs:\n",
 48 |     "        style = p.style.name.lower()\n",
 49 |     "        p_text = clean_line(p.text)\n",
 50 |     "    \n",
 51 |     "        if len(p_text) == 0:\n",
 52 |     "            continue\n",
 53 |     "    \n",
 54 |     "        if style == section_heading_style:\n",
 55 |     "            section_title = p_text\n",
 56 |     "            continue\n",
 57 |     "    \n",
 58 |     "        if style == question_heading_style:\n",
 59 |     "            answer_text_so_far = answer_text_so_far.strip()\n",
 60 |     "            if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
 61 |     "                questions.append({\n",
 62 |     "                    'text': answer_text_so_far,\n",
 63 |     "                    'section': section_title,\n",
 64 |     "                    'question': question_title,\n",
 65 |     "                })\n",
 66 |     "                answer_text_so_far = ''\n",
 67 |     "    \n",
 68 |     "            question_title = p_text\n",
 69 |     "            continue\n",
 70 |     "        \n",
 71 |     "        answer_text_so_far += '\\n' + p_text\n",
 72 |     "    \n",
 73 |     "    answer_text_so_far = answer_text_so_far.strip()\n",
 74 |     "    if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
 75 |     "        questions.append({\n",
 76 |     "            'text': answer_text_so_far,\n",
 77 |     "            'section': section_title,\n",
 78 |     "            'question': question_title,\n",
 79 |     "        })\n",
 80 |     "\n",
 81 |     "    return questions"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": 3,
 87 |    "id": "7d3c2dd7-f64a-4dc7-a4e3-3e8aadfa720f",
 88 |    "metadata": {},
 89 |    "outputs": [],
 90 |    "source": [
 91 |     "faq_documents = {\n",
 92 |     "    'llm-zoomcamp': '1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0',\n",
 93 |     "}"
 94 |    ]
 95 |   },
 96 |   {
 97 |    "cell_type": "code",
 98 |    "execution_count": 4,
 99 |    "id": "f94efe26-05e8-4ae5-a0fa-0a8e16852816",
100 |    "metadata": {},
101 |    "outputs": [
102 |     {
103 |      "name": "stdout",
104 |      "output_type": "stream",
105 |      "text": [
106 |       "llm-zoomcamp\n"
107 |      ]
108 |     }
109 |    ],
110 |    "source": [
111 |     "documents = []\n",
112 |     "\n",
113 |     "for course, file_id in faq_documents.items():\n",
114 |     "    print(course)\n",
115 |     "    course_documents = read_faq(file_id)\n",
116 |     "    documents.append({'course': course, 'documents': course_documents})"
117 |    ]
118 |   },
119 |   {
120 |    "cell_type": "code",
121 |    "execution_count": null,
122 |    "id": "1b21af5c-2f6d-49e7-92e9-ca229e2473b9",
123 |    "metadata": {},
124 |    "outputs": [],
125 |    "source": []
126 |   }
127 |  ],
128 |  "metadata": {
129 |   "kernelspec": {
130 |    "display_name": "Python 3 (ipykernel)",
131 |    "language": "python",
132 |    "name": "python3"
133 |   },
134 |   "language_info": {
135 |    "codemirror_mode": {
136 |     "name": "ipython",
137 |     "version": 3
138 |    },
139 |    "file_extension": ".py",
140 |    "mimetype": "text/x-python",
141 |    "name": "python",
142 |    "nbconvert_exporter": "python",
143 |    "pygments_lexer": "ipython3",
144 |    "version": "3.12.3"
145 |   }
146 |  },
147 |  "nbformat": 4,
148 |  "nbformat_minor": 5
149 | }
150 | 


--------------------------------------------------------------------------------
/cohorts/2024/README.md:
--------------------------------------------------------------------------------
 1 | # LLM Zoomcamp 2024 Edition
 2 | 
 3 | 
 4 | Important links:
 5 | 
 6 | - [Pre-Course Q&A Stream](https://www.youtube.com/watch?v=YuxVHZ88hfg&list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV)
 7 | - [Course Launch Stream](https://www.youtube.com/watch?v=ifpqpB1ksGc)
 8 | - [2024 Edition Youtube Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV)
 9 | - [Course management platform](https://courses.datatalks.club/llm-zoomcamp-2024/)
10 | 
11 | [**LLM Zoomcamp 2024 Competition**](https://github.com/DataTalksClub/llm-zoomcamp/tree/main/cohorts/2024/competition#llm-zoomcamp-2024-competition)
12 | 
13 | * Deadline: September 30, 2024
14 | * Practice and receive additional points for the course!
15 | 
16 | [**Module 1: Introduction to LLMs and RAG**](01-intro)
17 | 
18 | * [Homework](01-intro/homework.md)
19 | * [Solution](01-intro/homework_solution.ipynb)
20 | 
21 | [**Module 2: Open-source LLMs and self-hosting LLMs**](02-open-source)
22 | 
23 | * [Homework and solution](02-open-source/homework.md)
24 | 
25 | [**Module 3: Vector databases**](03-vector-search)
26 | 
27 | * [Homework and solution](03-vector-search/homework.md)
28 | 
29 | [**Module 4: Evaluation and monitoring**](04-monitoring)
30 | 
31 | * [Module explainer](https://www.loom.com/share/1dd375ec4b0d458fabdfc2b841089031)
32 | * [Homework](04-monitoring/homework.md)
33 | 
34 | **Office hours**
35 | 
36 | * [Office hours 1](https://www.youtube.com/watch?v=q4Mb4SN-doo)
37 | * [Office hours 2](https://www.youtube.com/watch?v=lre6h7vqz7A)
38 | * [Office hours 3 and 4](https://www.loom.com/share/b57b995f79364da68da1d1826a766794)
39 | 


--------------------------------------------------------------------------------
/cohorts/2024/competition/README.md:
--------------------------------------------------------------------------------
 1 | # LLM Zoomcamp 2024 Competition
 2 | 
 3 | In the competition, you need to use LLMs to solve high school mathematics problems. 
 4 | Your task is to develop models that can accurately solve these problems and submit your predictions.
 5 | 
 6 | For more details, visit the [competition page](https://www.kaggle.com/competitions/llm-zoomcamp-2024-competition/overview).
 7 | 
 8 | 
 9 | ## Getting started
10 | 
11 | Getting started code: [starter_notebook.ipynb](starter_notebook.ipynb)
12 | 
13 | You will need to install langchain for this code:
14 | 
15 | ```bash
16 | pip install -qU langchain-openai langchain
17 | ```
18 | 
19 | Thanks [Blaq](https://www.linkedin.com/in/chinonsoodiaka/) for contibuting the notebook!
20 | 
21 | Note that the solution is not deterministic: when you run
22 | it again on the same record, sometimes the solution is different
23 | from the initial run.
24 | 
25 | Which might be more a feature than a bug:
26 | what if you run it multuple times and get the majority vote? 
27 | 
28 | ## Community solutions
29 | 
30 | - Pastor Solo - getting started notebook on Kaggle - https://www.kaggle.com/code/pastorsoto/starter-notebook
31 | - Slava Shen - chain of thoughts - https://www.kaggle.com/code/vyacheslavshen/double-check-with-llms
32 | 
33 | ## Evaluation
34 | 
35 | We use accuracy as the evaluation metric. Sometimes multiple
36 | answers are correct. In this case, a solution is correct if
37 | it matches at least one of the possible answers. 
38 | 
39 | You can find the code for evaluation in [scorer.py](scorer.py) - it's taken [from kaggle](https://www.kaggle.com/code/dremovd/accuracy-multiple-correct?scriptVersionId=158029538)
40 | and this is exactly the code we use in the competition.
41 | 


--------------------------------------------------------------------------------
/cohorts/2024/competition/scorer.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import numpy as np
 3 | 
 4 | 
 5 | class ParticipantVisibleError(Exception):
 6 |     # If you want an error message to be shown to participants, you must raise the error as a ParticipantVisibleError
 7 |     # All other errors will only be shown to the competition host. This helps prevent unintentional leakage of solution data.
 8 |     pass
 9 | 
10 | 
11 | def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
12 |     '''
13 |     Accuracy that works with multiple correct answers.
14 |     '''
15 |     solution = solution.set_index(row_id_column_name, drop=True)
16 |     submission = submission.set_index(row_id_column_name, drop=True)
17 |     submission = submission.loc[solution.index]
18 |         
19 |     target_column = 'answer'
20 |     assert target_column in solution.columns
21 |     assert target_column in submission.columns
22 | 
23 |     # This fix is needed because submission is loaded with default parameters
24 |     # Pandas magically converts string column into float
25 |     def fix_suffix(value):
26 |         if value.endswith('.0'):
27 |             return value[:-2]
28 |         else:
29 |             return value
30 |         
31 |     submission[target_column] = submission[target_column].astype(str)
32 |     submission[target_column] = submission[target_column].apply(fix_suffix)
33 |     
34 |     
35 |     def convert_to_list(value):
36 |         values = [v.strip() for v in value.strip().lstrip('[').rstrip(']').split(',')]
37 |         return values
38 | 
39 |     solution[target_column] = solution[target_column].astype(str).apply(convert_to_list)
40 | 
41 |     correct = [
42 |         submit_answer in correct_answer
43 |         for correct_answer, submit_answer in zip(
44 |             solution[target_column].values, 
45 |             submission[target_column].values
46 |         )
47 |     ]
48 |             
49 |     return np.mean(correct)


--------------------------------------------------------------------------------
/cohorts/2024/competition/starter_notebook_submission.csv:
--------------------------------------------------------------------------------
  1 | problem_id,answer
  2 | 11919,12.0
  3 | 8513,285.0
  4 | 7887,4.0
  5 | 5272,6.0
  6 | 8295,13.0
  7 | 3219,15.0
  8 | 7235,55.0
  9 | 3688,21.0
 10 | 6116,2412.0
 11 | 4720,34.0
 12 | 12122,40.8
 13 | 4311,4.0
 14 | 8283,7.0
 15 | 8347,2.0
 16 | 4170,220.0
 17 | 4,21.0
 18 | 8307,150.0
 19 | 7108,544.0
 20 | 7775,800.0
 21 | 7680,216.0
 22 | 8780,1680.0
 23 | 8707,2.25
 24 | 7863,5.0
 25 | 12036,66.0
 26 | 4569,3.0
 27 | 8134,21.0
 28 | 160,3.0
 29 | 2869,70.0
 30 | 3309,10.0
 31 | 6941,0.462
 32 | 4880,6.0
 33 | 5126,0.5
 34 | 7059,0.0
 35 | 10649,45.0
 36 | 10357,1.0
 37 | 2445,10.0
 38 | 5755,27.9
 39 | 4849,0.0
 40 | 1216,46.0
 41 | 7787,77.0
 42 | 2405,3.0
 43 | 3041,6.0
 44 | 10411,222240.0
 45 | 7081,328.0
 46 | 6018,4.0
 47 | 8948,80.0
 48 | 4658,54.0
 49 | 11679,64.0
 50 | 5390,99.2
 51 | 5319,-4.0
 52 | 6907,2.5
 53 | 9695,5.0
 54 | 12092,100.0
 55 | 8458,20.0
 56 | 7136,31.0
 57 | 11922,8.4
 58 | 6101,1350.0
 59 | 245,63.0
 60 | 7880,10.0
 61 | 12166,4.0
 62 | 8116,12.0
 63 | 13554,520.0
 64 | 4249,66.0
 65 | 12184,3.0
 66 | 5232,-4.0
 67 | 9499,40980.0
 68 | 4908,24.0
 69 | 7452,5.0
 70 | 13512,10.75
 71 | 9253,6000.0
 72 | 12338,4.0
 73 | 4478,3.0
 74 | 5170,-5.0
 75 | 3122,515.0
 76 | 10457,15.0
 77 | 2370,10.0
 78 | 8670,6.25
 79 | 9446,40.0
 80 | 8083,1.0
 81 | 5321,-6.0
 82 | 9388,0.0
 83 | 8443,8.0
 84 | 8138,600.0
 85 | 12084,23.0
 86 | 3305,11.0
 87 | 4810,23.0
 88 | 25,0.361
 89 | 217,10.0
 90 | 1421,14563.0
 91 | 6914,1.0
 92 | 4724,23.0
 93 | 10460,26.0
 94 | 1324,40.0
 95 | 5268,3.0
 96 | 5164,6.0
 97 | 3519,650.0
 98 | 7934,12.0
 99 | 9390,0.0
100 | 7137,22.0
101 | 5914,300.0
102 | 


--------------------------------------------------------------------------------
/cohorts/2024/project.md:
--------------------------------------------------------------------------------
 1 | ## Course Project
 2 | 
 3 | The goal of this project is to apply everything we learned
 4 | in this course and build an end-to-end RAG project.
 5 | 
 6 | Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete.
 7 | 
 8 | 
 9 | ## Submitting
10 | 
11 | ### Project Attempt #1
12 | 
13 | * Project: https://courses.datatalks.club/llm-zoomcamp-2024/project/project1
14 | * Review: https://courses.datatalks.club/llm-zoomcamp-2024/project/project1/eval
15 | 
16 | 
17 | ### Project Attempt #2
18 | 
19 | * Project: https://courses.datatalks.club/llm-zoomcamp-2024/project/project2
20 | * Review: https://courses.datatalks.club/llm-zoomcamp-2024/project/project2/eval
21 | 
22 | 
23 | > **Important**: update your "Certificate name" here: https://courses.datatalks.club/llm-zoomcamp-2024/enrollment -
24 | this is what we will use when generating certificates for you.
25 | 
26 | 
27 | ## Evaluation criteria
28 | 
29 | See [here](../../project.md#evaluation-criteria)


--------------------------------------------------------------------------------
/cohorts/2024/workshops/dlt.md:
--------------------------------------------------------------------------------
 1 | # Open source data ingestion for RAGs with dlt
 2 | 
 3 | Video: https://www.youtube.com/watch?v=qUNyfR_X2Mo
 4 | 
 5 | Homework solution: https://drive.google.com/file/d/1M1dKtAO-v3oYIztqMS8fXlLsBnqcmA-O/view?usp=sharing
 6 | 
 7 | In this hands-on workshop, we’ll learn how to build a data ingestion pipeline using dlt to load data from a REST API into LanceDB so you can have an always up to date RAG.
 8 | 
 9 | ​We’ll cover the following steps:
10 | 
11 | * Extract data from REST APIs
12 | * Loading and vectorizing into LanceDB, which unlike other vector DBs stores the data _and_ the embeddings
13 | * Incremental loading
14 | 
15 | ​By the end of this workshop, you’ll be able to write a portable, OSS data pipeline for your RAG that you can deploy anywhere, such as python notebooks, virtual machines, or orchestrators like Airflow, Dagster or Mage.
16 | 
17 | 
18 | # Resources
19 | 
20 | * Slides: [dlt-LLM-Zoomcamp.pdf](https://github.com/user-attachments/files/16131729/dlt.LLM.Zoomcamp.pdf)
21 | * [Google Colab notebook](https://colab.research.google.com/drive/1nNOybHdWQiwUUuJFZu__xvJxL_ADU3xl?usp=sharing) - make a copy to follow along!
22 | 
23 | --- 
24 | 
25 | # Homework
26 | 
27 | In the workshop, we extracted contents from two pages in notion titled "Workshop: Benefits and Perks" and "Workshop: Working hours, PTO, and Vacation". 
28 | 
29 | Repeat the same process for a third page titled "Homework: Employee handbook" (hidden from public view, but accessible via API key):
30 | 
31 | 1. Modify the REST API source to extract only this page.
32 | 2. Write the output into a separate table called "homework".
33 | 3. Remember to update the table name in all cells where you connect to a lancedb table.
34 | 
35 | To do this you can use the [workshop Colab](https://colab.research.google.com/drive/1nNOybHdWQiwUUuJFZu__xvJxL_ADU3xl?usp=sharing) as a basis.
36 | 
37 | Now, answer the following questions:  
38 | 
39 | ## Q1. Rows in LanceDB
40 | 
41 | How many rows does the lancedb table "notion_pages__homework" have?
42 | 
43 | * 14
44 | * 15
45 | * 16
46 | * 17
47 | 
48 | ## Q2. Running the Pipeline: Last edited time
49 | 
50 | In the demo, we created an incremental dlt resource `rest_api_notion_incremental` that keeps track of `last_edited_time`. What value does it store after you've run your pipeline once? (Hint: you will be able to get this value by performing some aggregation function on the column `last_edited_time` of the table)
51 | 
52 | * `Timestamp('2024-07-05 22:34:00+0000', tz='UTC') (OR "2024-07-05T22:34:00.000Z")`
53 | * `Timestamp('2024-07-05 23:33:00+0000', tz='UTC') (OR "2024-07-05T23:33:00.000Z")`
54 | * `Timestamp('2024-07-05 23:52:00+0000', tz='UTC') (OR "2024-07-05T23:52:00.000Z")`
55 | * `Timestamp('2024-07-05 22:56:00+0000', tz='UTC') (OR "2024-07-05T22:56:00.000Z")`
56 | 
57 | 
58 | ## Q3. Ask the Assistant 
59 | 
60 | Find out with the help of the AI assistant: how many PTO days are the employees entitled to in a year?  
61 | 
62 | * 20
63 | * 25
64 | * 30
65 | * 35
66 | 
67 | ## Submit the results
68 | 
69 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/workshop1
70 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
71 | 


--------------------------------------------------------------------------------
/cohorts/2025/01-intro/homework.md:
--------------------------------------------------------------------------------
  1 | ## Homework: Introduction
  2 | 
  3 | In this homework, we'll learn more about search and use Elastic Search for practice. 
  4 | 
  5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
  6 | 
  7 | 
  8 | ## Q1. Running Elastic 
  9 | 
 10 | Run Elastic Search 8.17.6, and get the cluster information. If you run it on localhost, this is how you do it:
 11 | 
 12 | ```bash
 13 | curl localhost:9200
 14 | ```
 15 | 
 16 | What's the `version.build_hash` value?
 17 | 
 18 | 
 19 | ## Getting the data
 20 | 
 21 | Now let's get the FAQ data. You can run this snippet:
 22 | 
 23 | ```python
 24 | import requests 
 25 | 
 26 | docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
 27 | docs_response = requests.get(docs_url)
 28 | documents_raw = docs_response.json()
 29 | 
 30 | documents = []
 31 | 
 32 | for course in documents_raw:
 33 |     course_name = course['course']
 34 | 
 35 |     for doc in course['documents']:
 36 |         doc['course'] = course_name
 37 |         documents.append(doc)
 38 | ```
 39 | 
 40 | Note that you need to have the `requests` library:
 41 | 
 42 | ```bash
 43 | pip install requests
 44 | ```
 45 | 
 46 | ## Q2. Indexing the data
 47 | 
 48 | Index the data in the same way as was shown in the course videos. Make the `course` field a keyword and the rest should be text. 
 49 | 
 50 | Don't forget to install the ElasticSearch client for Python:
 51 | 
 52 | ```bash
 53 | pip install elasticsearch
 54 | ```
 55 | 
 56 | Which function do you use for adding your data to elastic?
 57 | 
 58 | * `insert`
 59 | * `index`
 60 | * `put`
 61 | * `add`
 62 | 
 63 | ## Q3. Searching
 64 | 
 65 | Now let's search in our index. 
 66 | 
 67 | We will execute a query "How do execute a command on a Kubernetes pod?". 
 68 | 
 69 | Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.
 70 | 
 71 | What's the score for the top ranking result?
 72 | 
 73 | * 84.50
 74 | * 64.50
 75 | * 44.50
 76 | * 24.50
 77 | 
 78 | Look at the `_score` field.
 79 | 
 80 | ## Q4. Filtering
 81 | 
 82 | Now ask a different question: "How do copy a file to a Docker container?".
 83 | 
 84 | This time we are only interested in questions from `machine-learning-zoomcamp`.
 85 | 
 86 | Return 3 results. What's the 3rd question returned by the search engine?
 87 | 
 88 | * How do I debug a docker container?
 89 | * How do I copy files from a different folder into docker container’s working directory?
 90 | * How do Lambda container images work?
 91 | * How can I annotate a graph?
 92 | 
 93 | ## Q5. Building a prompt
 94 | 
 95 | Now we're ready to build a prompt to send to an LLM. 
 96 | 
 97 | Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (`\n\n`)
 98 | ```python
 99 | context_template = """
100 | Q: {question}
101 | A: {text}
102 | """.strip()
103 | ```
104 | 
105 | Now use the context you just created along with the "How do I execute a command in a running docker container?" question 
106 | to construct a prompt using the template below:
107 | 
108 | ```
109 | prompt_template = """
110 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
111 | Use only the facts from the CONTEXT when answering the QUESTION.
112 | 
113 | QUESTION: {question}
114 | 
115 | CONTEXT:
116 | {context}
117 | """.strip()
118 | ```
119 | 
120 | What's the length of the resulting prompt? (use the `len` function)
121 | 
122 | * 946
123 | * 1446
124 | * 1946
125 | * 2446
126 | 
127 | ## Q6. Tokens
128 | 
129 | When we use the OpenAI Platform, we're charged by the number of 
130 | tokens we send in our prompt and receive in the response.
131 | 
132 | The OpenAI python package uses `tiktoken` for tokenization:
133 | 
134 | ```bash
135 | pip install tiktoken
136 | ```
137 | 
138 | Let's calculate the number of tokens in our query: 
139 | 
140 | ```python
141 | encoding = tiktoken.encoding_for_model("gpt-4o")
142 | ```
143 | 
144 | Use the `encode` function. How many tokens does our prompt have?
145 | 
146 | * 120
147 | * 220
148 | * 320
149 | * 420
150 | 
151 | Note: to decode back a token into a word, you can use the `decode_single_token_bytes` function:
152 | 
153 | ```python
154 | encoding.decode_single_token_bytes(63842)
155 | ```
156 | 
157 | ## Bonus: generating the answer (ungraded)
158 | 
159 | Let's send the prompt to OpenAI. What's the response?  
160 | 
161 | Note: you can replace OpenAI with Ollama. See module 2.
162 | 
163 | ## Bonus: calculating the costs (ungraded)
164 | 
165 | Suppose that on average per request we send 150 tokens and receive back 250 tokens.
166 | 
167 | How much will it cost to run 1000 requests?
168 | 
169 | You can see the prices [here](https://openai.com/api/pricing/)
170 | 
171 | On June 17, the prices for gpt4o are:
172 | 
173 | * Input: $0.005 / 1K tokens
174 | * Output: $0.015 / 1K tokens
175 | 
176 | You can redo the calculations with the values you got in Q6 and Q7.
177 | 
178 | 
179 | ## Submit the results
180 | 
181 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2025/homework/hw1
182 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
183 | 


--------------------------------------------------------------------------------
/cohorts/2025/README.md:
--------------------------------------------------------------------------------
 1 | # LLM Zoomcamp 2025 Edition
 2 | 
 3 | Important links:
 4 | 
 5 | - Pre-Course Q&A Stream: [video](https://www.youtube.com/live/8lgiOLMMKcY), [summary](/cohorts/2025/pre-course-q-a-stream-summary.md)
 6 | - Course Launch Stream: [video](https://www.youtube.com/live/FgnelhEJFj0), [summary](/cohorts/2025/course-launch-stream-summary.md)
 7 | - [2025 Edition Youtube Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-)
 8 | - [Course management platform](https://courses.datatalks.club/llm-zoomcamp-2025/)
 9 | 
10 | 
11 | [**Module 1: Introduction to LLMs and RAG**](01-intro)
12 | 
13 | * [Homework](01-intro/homework.md)
14 | 
15 | 
16 | 


--------------------------------------------------------------------------------
/cohorts/2025/course-launch-stream-summary.md:
--------------------------------------------------------------------------------
  1 | # Key Takeaways from the LLM Zoomcamp 2025 Launch Stream
  2 | 
  3 | [![Watch the Launch Stream](https://img.youtube.com/vi/FgnelhEJFj0/maxresdefault.jpg)](https://youtu.be/FgnelhEJFj0)
  4 | 
  5 | > **[Watch the LLM Zoomcamp 2025 Launch Stream](https://youtu.be/FgnelhEJFj0)**
  6 | 
  7 | ---
  8 | 
  9 | ## Table of Contents
 10 | 
 11 | - [Quick Summary](#quick-summary)
 12 | - [Meet the Team](#meet-the-2025-team)
 13 | - [Prerequisites](#prerequisites)
 14 | - [Course Architecture](#course-architecture)
 15 | - [Module Breakdown](#module-breakdown)
 16 | - [Timeline & Deadlines](#timeline--deadlines)
 17 | - [Getting Help](#getting-help)
 18 | - [Earning Points](#earning-points)
 19 | - [Communication Channels](#communication-channels)
 20 | - [Sponsors & Support](#sponsors--support)
 21 | - [FAQ Highlights](#faq-highlights)
 22 | - [Next Steps](#next-steps)
 23 | 
 24 | ---
 25 | 
 26 | ## Quick Summary
 27 | 
 28 | <details>
 29 | <summary><strong>Key Takeaways (Click to expand)</strong></summary>
 30 | 
 31 | * **Central project**: Build a production-ready RAG chatbot
 32 | * **Skills required**: Python, CLI, Git, Docker—nothing more
 33 | * **Support workflow**: FAQ → Slack search → bot → channel (no tags)
 34 | * **Earn points** by contributing to FAQ and sharing progress publicly
 35 | * **No GPUs needed** for the main camp; open-source spin-off covers that
 36 | * **Capstone & peer review** are mandatory for certificate—start gathering data now
 37 | * **Budget**: ~$10 OpenAI credit covers the entire course
 38 | 
 39 | </details>
 40 | 
 41 | ## Meet the 2025 Team
 42 | 
 43 | | Instructor | Role | Expertise |
 44 | |------------|------|-----------|
 45 | | **[Alexey Grigorev](https://github.com/alexeygrigorev)** | Host & General Guidance | Course Creator, ML Engineering |
 46 | | **[Kacper Łukowski](https://github.com/kacperlukawski)** | Vector Search Module | Qdrant Expert |
 47 | | **Timur S** | Best Practices Module | Production ML Systems |
 48 | | **Phoenix Expert** | Monitoring Module | ML Observability (name TBA) |
 49 | 
 50 | ## Prerequisites
 51 | 
 52 | ### Required Skills
 53 | - **Python basics** (one-day refresher is sufficient)
 54 | - **Command-line & Git** confidence
 55 | - **Docker** installed and working
 56 | 
 57 | ### Need a Refresher?
 58 | - **Python**: Any weekend tutorial will do
 59 | - **Docker**: Check out the [Data Engineering Zoomcamp Docker lesson](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform)
 60 | - **Git**: [GitHub's Git Handbook](https://guides.github.com/introduction/git-handbook/)
 61 | 
 62 | ## Course Architecture
 63 | 
 64 | The entire Zoomcamp revolves around building a **production-grade RAG (Retrieval-Augmented Generation) chatbot** that answers student questions from the course FAQ.
 65 | 
 66 | ### What You'll Build:
 67 | 1. **Document Ingestion** → Text + vector indexing
 68 | 2. **Smart Retrieval** → Find most relevant snippets
 69 | 3. **LLM Integration** → Generate context-aware answers
 70 | 4. **Production Monitoring** → Track quality, latency, and costs
 71 | 
 72 | ## Module Breakdown
 73 | 
 74 | | Week | Module | Focus | 
 75 | |------|--------|-------|
 76 | | 1 | **Intro & RAG Foundations** | Core concepts, basic implementation | 
 77 | | 2 | **Vector vs Text Search** | Hybrid search strategies |
 78 | | 3 | **Evaluation Techniques** | Measuring RAG performance |
 79 | | 4 | **Monitoring with Phoenix** | Production observability |
 80 | | 5 | **Best Practices & Guardrails** | Security, reliability |
 81 | | 6-7 | **Capstone Project** | Build your own RAG system |
 82 | | 8 | **Peer Review** | Evaluate classmates' projects |
 83 | ## Timeline & Deadlines
 84 | 
 85 | ### Content Delivery
 86 | - **Pre-recorded videos** in organized playlists:
 87 |   - **[Main 2025 Playlist](https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=uwC6I0wFePjdmLdp)**
 88 |   - **[Legacy 2024 Content](https://youtube.com/playlist?list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV&si=l7lTHbVAUHks2AMP)**
 89 | 
 90 | ### Homework & Submissions
 91 | - **Platform**: [DataTalks.Club Course Platform](https://courses.datatalks.club/llm-zoomcamp-2025/)
 92 | - **Scoring**: Points-based system with public leaderboard
 93 | - **Deadline**: 3 weeks after each module launch
 94 | 
 95 | ### Certificate Requirements
 96 | - Complete all homework assignments
 97 | - Build and submit capstone project (2 weeks)
 98 | - Complete peer review process (1 week)
 99 | - Minimum point threshold (TBA)
100 | 
101 | ## Getting Help
102 | 
103 | ### Support Workflow (Follow This Order!)
104 | 
105 | 1. **Search the [FAQ Document](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing)** first
106 | 2. **Search [Slack](https://datatalks.club/slack.html) history** for similar questions
107 | 3. **Ask the Slack bot** (uses the same RAG pipeline you're building!)
108 | 4. **Post in `#course-llm-zoomcamp`** (**NEVER tag instructors directly**)
109 | 
110 | ### How to Ask Good Questions
111 | - Include error messages and code snippets
112 | - Mention what you've already tried
113 | - Use thread replies to keep channels organized
114 | 
115 | ## Earning Points
116 | 
117 | ### Ways to Boost Your Score:
118 | - **Contribute to FAQ**: Add solved issues and solutions
119 | - **Share publicly**: Post progress on LinkedIn/Twitter with **#LLMZoomcamp**
120 | - **Limit**: Up to 7 social media links per module count toward score
121 | - **Quality over quantity**: Thoughtful posts get more engagement
122 | 
123 | ### Content Ideas:
124 | - Weekly progress updates
125 | - Code snippets and explanations
126 | - Challenges you overcame
127 | - Creative applications of course concepts
128 | 
129 | ## Communication Channels
130 | 
131 | | Channel | Purpose | Link |
132 | |---------|---------|------|
133 | | **Telegram** | Announcements only | [Join Channel](https://t.me/llm_zoomcamp) |
134 | | **Slack** | Questions & peer help | [Join Workspace](https://datatalks.club/slack.html) |
135 | | **YouTube** | Video content | [Course Playlist](https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=uwC6I0wFePjdmLdp) |
136 | | **GitHub** | Course materials | [Main Repository](https://github.com/DataTalksClub/llm-zoomcamp) |
137 | 
138 | ### Communication Etiquette:
139 | - Always reply in **threads** to keep channels clean
140 | - Search before asking
141 | - Be respectful and helpful to peers
142 | - Don't tag instructors directly
143 | 
144 | ## Sponsors & Support
145 | 
146 | ### Course Sponsors
147 | - **[Arize AI](https://github.com/Arize-ai/phoenix)**
148 | - **[dltHub](https://github.com/dlt-hub/dlt)**
149 | - **[Qdrant](https://github.com/qdrant/qdrant)**
150 | 
151 | *These sponsors keep the course completely free for everyone!*
152 | 
153 | ### Support Alexey
154 | If you find value in this course, consider supporting via [GitHub Sponsors](https://github.com/sponsors/alexeygrigorev).
155 | 
156 | ## FAQ Highlights
157 | 
158 | <details>
159 | <summary><strong>Why no agents in the main course?</strong></summary>
160 | 
161 | **Current situation**: Agent frameworks evolve "every day," making it risky to lock the course to unstable APIs.
162 | 
163 | **Future plans**: 
164 | - **AI-Dev Tools Course** planned for **September 2025**
165 | - **2,100+ sign-ups** already confirmed interest
166 | - **Will cover**: Agentic search, LangGraph, CrewAI, agent guardrails
167 | - **Expect**: Workshop teasers during this cohort
168 | 
169 | **More info**: [AI Dev Tools Zoomcamp Repository](https://github.com/DataTalksClub/ai-dev-tools-zoomcamp)
170 | 
171 | </details>
172 | 
173 | <details>
174 | <summary><strong>Do I need GPUs?</strong></summary>
175 | 
176 | **Main LLM Zoomcamp**: 
177 | - **No GPUs needed**
178 | - **~$10 OpenAI credit** covers everything
179 | - **Uses hosted models**: OpenAI, Anthropic, Groq, etc.
180 | 
181 | **Open-Source LLM Mini-Course** (June/July 2025):
182 | - **Free GPU quotas** provided by Saturn Cloud & AMD
183 | - **Local models**: Llama 3, DeepSeek, etc.
184 | - **Topics**: Quantization, vLLM, LoRA fine-tuning
185 | 
186 | **More info**: [Open-Source LLM Zoomcamp](https://github.com/DataTalksClub/open-source-llm-zoomcamp)
187 | 
188 | </details>
189 | 
190 | <details>
191 | <summary><strong>How is LLM monitoring different from traditional MLOps?</strong></summary>
192 | 
193 | **Shared aspects**:
194 | - Uptime & latency tracking
195 | - Performance regression detection
196 | - Post-deployment monitoring
197 | 
198 | **LLM-specific metrics**:
199 | - **Cost per call/token**
200 | - **Hallucination detection** (via eval sets, heuristics, human feedback)
201 | - **Prompt/response drift** (style/length changes indicating model updates)
202 | - **Content risk**: PII leaks, toxicity detection
203 | 
204 | **Tools**: Phoenix (open-source) for dashboards and budget alerts
205 | 
206 | </details>
207 | 
208 | <details>
209 | <summary><strong>Local vs Hosted Models: Which to choose?</strong></summary>
210 | 
211 | | Aspect | Hosted API | Local/Self-hosted |
212 | |--------|------------|-------------------|
213 | | **Setup** | Single HTTP call | Download 4-40GB weights |
214 | | **Scaling** | Provider handles it | You manage infrastructure |
215 | | **Cost** | Pay per usage | Higher upfront, lower long-term |
216 | | **Control** | Limited customization | Full control, no data sharing |
217 | | **Models** | Latest frontier models | Open-source alternatives |
218 | | **Customization** | API parameters only | LoRA fine-tuning possible |
219 | 
220 | **Course approach**: Prototype with GPT-4o, then compare with local Llama 3
221 | 
222 | </details>
223 | 
224 | <details>
225 | <summary><strong>Job market advice for LLM engineers</strong></summary>
226 | 
227 | **Show, don't tell**:
228 | - Publish notebooks and blog posts
229 | - Create short demo videos (Loom)
230 | - Deploy working RAG pipelines
231 | 
232 | **Specialize wisely**:
233 | - RAG evaluation techniques
234 | - Retrieval optimization
235 | - LLM cost monitoring
236 | - Content safety & guardrails
237 | 
238 | **Network via "learning in public"**:
239 | - Use **#LLMZoomCamp** hashtag
240 | - Consistent LinkedIn/Twitter posts
241 | - Engage with course community
242 | 
243 | **Portfolio > certificates**:
244 | - Capstone GitHub repo
245 | - Regular social media updates
246 | - Video walkthroughs of projects
247 | 
248 | </details>
249 | 
250 | <details>
251 | <summary><strong>Capstone project guidelines</strong></summary>
252 | 
253 | **Start early**:
254 | - Gather domain corpus (docs, Slack dumps, PDFs)
255 | - Begin data cleaning and chunking
256 | - Choose a problem you're passionate about
257 | 
258 | **Requirements preview**:
259 | - Working RAG demonstration
260 | - Evaluation notebook with metrics
261 | - README with cost analysis
262 | - Monitoring screenshots
263 | - 3-minute video walkthrough
264 | 
265 | **Peer review process**:
266 | - 2 weeks for building
267 | - 1 week for reviewing 3 classmates' projects
268 | - Mandatory for certification
269 | 
270 | **Detailed rubric**: Will be published mid-cohort on the course platform
271 | 
272 | </details>
273 | 
274 | ## Next Steps
275 | 
276 | 1. **[Star the GitHub repo](https://github.com/DataTalksClub/llm-zoomcamp)** (helps with visibility!)
277 | 2. **Skim Module 1** content to get familiar
278 | 3. **Install Docker** and verify it works
279 | 4. **Join communication channels**:
280 |    - [Slack workspace](https://datatalks.club/slack.html)
281 |    - [Telegram channel](https://t.me/llm_zoomcamp)
282 | 5. **Bookmark the [FAQ document](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing)**
283 | 
284 | ### Week 1 Goals:
285 | - **Watch Module 1 videos**
286 | - **Complete first homework**
287 | - **Start thinking about capstone dataset**
288 | - **Share your journey** with #LLMZoomCamp
289 | 
290 | ### Long-term Success:
291 | - **Consistent engagement** with course materials
292 | - **Active participation** in community discussions
293 | - **Regular progress sharing** on social media
294 | - **Early capstone planning** and data preparation
295 | 
296 | ## Quick Links Reference
297 | 
298 | | Resource | Link |
299 | |----------|------|
300 | | **Main Repository** | https://github.com/DataTalksClub/llm-zoomcamp |
301 | | **Course Platform** | https://courses.datatalks.club/llm-zoomcamp-2025/ |
302 | | **2025 Playlist** | https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=uwC6I0wFePjdmLdp |
303 | | **FAQ Document** | https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing |
304 | | **Slack Workspace** | https://datatalks.club/slack.html |
305 | | **Telegram Channel** | https://t.me/llm_zoomcamp |
306 | | **Project Guidelines** | https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2025/project.md |
307 | | **Support Alexey** | https://github.com/sponsors/alexeygrigorev |
308 | 


--------------------------------------------------------------------------------
/cohorts/2025/pre-course-q-a-stream-summary.md:
--------------------------------------------------------------------------------
  1 | # Key Takeaways from the LLM Zoomcamp 2025 Pre-Course Live Q&A
  2 | 
  3 | [![Watch the Launch Stream](https://img.youtube.com/vi/8lgiOLMMKcY/maxresdefault.jpg)](https://youtu.be/8lgiOLMMKcY)
  4 | 
  5 | > **[Watch the LLM Zoomcamp 2025 Pre-Course Live Q&A](https://youtu.be/8lgiOLMMKcY)**
  6 | 
  7 | ## 1. Do you think LLMs are a lasting technology or are they just a passing trend like the metaverse or NFTs?
  8 | 
  9 | **Answer:** While there’s certainly hype around LLMs, they’ve already become deeply integrated into daily workflows—used for coding, personal productivity, and prototype development. Even after the hype subsides, the underlying tools and techniques (like RAG pipelines and vector search) will remain valuable. Future models and interfaces may evolve, but the core capabilities of LLMs are here to stay.
 10 | 
 11 | ## 2. What prerequisites would set me up for success in the course and help me get the best out of it?
 12 | 
 13 | **Answer:** You should be comfortable with:
 14 | 
 15 | * General programming (ideally in Python)
 16 | * Command-line tools
 17 | * Connecting services (e.g., spinning up ElasticSearch or similar)
 18 |   No deep machine-learning or advanced software-engineering background is required, but familiarity with basic scripting and package installation will make the coursework smoother.
 19 | 
 20 | ## 3. Is it beneficial if I do the course again this year even though I completed it last year?
 21 | 
 22 | **Answer:** Yes. The curriculum has been updated with fewer but deeper modules, new evaluation/monitoring tooling (Phoenix instead of Grafana), and likely a new vector-search backend. You’ll also get to experiment with the latest LLMs and compete in a fresh challenge exercise.
 23 | 
 24 | ## 4. Will the course help me implement RAG from a live database?
 25 | 
 26 | **Answer:** Most likely yes. Whether your data is in a transactional database or a knowledge-base store, the course teaches you how to connect to your data source, chunk and index content, and build a RAG application. The exact workflow may vary by database type, but the principles carry over.
 27 | 
 28 | ## 5. Are we going to do any agentic AI development in this course?
 29 | 
 30 | **Answer:** Not as part of the core modules. There **will** likely be an optional parallel workshop on agentic workflows in June, but agent development is not formally included in this year’s curriculum.
 31 | 
 32 | ## 6. Will this course include MLOps (LM Ops) content?
 33 | 
 34 | **Answer:** To a degree. The monitoring module covers key LM Ops practices (metrics, cost tracking, query logging), but full deployment and continuous-training pipelines are reserved for the separate Open Source LLM Zoom Camp.
 35 | 
 36 | ## 7. What will be the infrastructure cost, and can I run everything locally without cloud resources?
 37 | 
 38 | **Answer:**
 39 | 
 40 | * **Local:** 100 % feasible—you can use tools like Llama.cpp or Gro to run models on your own machine.
 41 | * **Cloud APIs:** Costs are modest. For OpenAI, \~1 million tokens in/out costs around \$10 total. Gro offers a free tier for basic experimentation.
 42 | 
 43 | ## 8. Will this course cover MCP (Model Context Protocol)?
 44 | 
 45 | **Answer:** No, MCP (and other emerging protocols) are not in this year’s syllabus. As with agentic AI, these topics may surface in a future specialized offering once industry practices stabilize.
 46 | 
 47 | ## 9. Do we discuss evaluating LLM-based applications, generating metrics, and setting up guardrails?
 48 | 
 49 | **Answer:**
 50 | 
 51 | * **Evaluation metrics:** Yes. You’ll learn classical IR metrics (e.g., MRR, recall) for search and how to use LLMs themselves as “judges” for end-to-end RAG evaluation.
 52 | * **Guardrails:** No formal guardrail framework is included, though monitoring best practices will help you detect and respond to undesired behavior.
 53 | 
 54 | ## 10. Will we cover chunking techniques in the course?
 55 | 
 56 | **Answer:** Yes and no. You’ll see how data is pre-chunked in the example repo and learn best practices for chunk size and strategy—but actual implementation and experimentation with chunking are left as part of your hands-on project work.
 57 | 
 58 | ## 11. Is the RAG pipeline included in the course?
 59 | 
 60 | **Answer:** Absolutely. RAG is the central focus: indexing, retrieval, prompt construction, and response handling are all core modules, and you’ll build full pipelines from scratch.
 61 | 
 62 | ## 12. Can different programming languages like JavaScript be used for this course?
 63 | 
 64 | **Answer:** The taught examples use Python, but you’re welcome to implement your project in another language (e.g., JavaScript). You’ll just need to provide clear installation and usage instructions (npm commands, environment setup) so peers can run and review your work.
 65 | 
 66 | ## 13. Can we expect a complex project, and learn from examples?
 67 | 
 68 | **Answer:** Project complexity is up to you. The course provides exemplar student projects (food-search RAG, recipe recommenders, etc.) to illustrate scope and quality. You then pick your own data and extend the RAG concepts to a domain of your choice.
 69 | 
 70 | ## 14. As someone new, how do I navigate the GitHub repo structure?
 71 | 
 72 | **Answer:**
 73 | 
 74 | 1. Open the repo’s README and follow the module links in order.
 75 | 2. Click each module’s folder to access videos and homework.
 76 | 3. Use the “Project Attempts” section to browse past student projects and peer-review guidelines.
 77 | 
 78 | ## 15. What is the estimated weekly time investment for the course?
 79 | 
 80 | **Answer:** Roughly **10 hours per week**:
 81 | 
 82 | * Module videos + readings: \~4 hours
 83 | * Homework/project work: \~6 hours
 84 | 
 85 | ## 16. How long does the course run and when does it end?
 86 | 
 87 | **Answer:** It spans **10 weeks** total:
 88 | 
 89 | * 2 weeks for the Intro module
 90 | * 1 week each for Modules 2–5
 91 | * 3 weeks for the final project period
 92 |   Expect to finish by late summer.
 93 | 
 94 | ## 17. How much will it cost to complete the course using Gro or OpenAI APIs?
 95 | 
 96 | **Answer:**
 97 | 
 98 | * **Gro:** Free tier available, suitable for initial experiments.
 99 | * **OpenAI:** Approximately \$10–\$20 for 1 million tokens in + out; in practice you’ll spend far less than your API quota.
100 | 
101 | ## 18. Can you explain the steps for enrolling and completing the course?
102 | 
103 | **Answer:**
104 | 
105 | 1. **Star & Fork** the GitHub repo to show support.
106 | 2. **Sign up** on the repo’s landing page (click “Sign up” button).
107 | 3. **Join** the community (Telegram/Slack) for announcements.
108 | 4. **Attend** the launch stream (May 27) for a content overview.
109 | 5. **Weekly:** Watch that week’s module videos, complete homework, submit via the platform.
110 | 6. **Final project:** Build and document your RAG application; peer-review others.
111 | 7. **Certificate:** Earned upon successful project submission and peer reviews.
112 | 


--------------------------------------------------------------------------------
/cohorts/2025/project.md:
--------------------------------------------------------------------------------
 1 | ## Course Project
 2 | 
 3 | The goal of this project is to apply everything we learned
 4 | in this course and build an end-to-end RAG project.
 5 | 
 6 | Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete.
 7 | 
 8 | 
 9 | ## Submitting
10 | 
11 | ### Project Attempt #1
12 | 
13 | * Project: https://courses.datatalks.club/llm-zoomcamp-2025/project/project1
14 | * Review: https://courses.datatalks.club/llm-zoomcamp-2025/project/project1/eval
15 | 
16 | 
17 | ### Project Attempt #2
18 | 
19 | * Project: https://courses.datatalks.club/llm-zoomcamp-2025/project/project2
20 | * Review: https://courses.datatalks.club/llm-zoomcamp-2025/project/project2/eval
21 | 
22 | 
23 | > **Important**: update your "Certificate name" here: https://courses.datatalks.club/llm-zoomcamp-2025/enrollment -
24 | this is what we will use when generating certificates for you.
25 | 
26 | 
27 | ## Evaluation criteria
28 | 
29 | See [here](../../project.md#evaluation-criteria)


--------------------------------------------------------------------------------
/etc/chunking.md:
--------------------------------------------------------------------------------
 1 | ## Chunking
 2 | 
 3 | I asked ChatGPT how we can chunk a YouTube transcript 
 4 | and this is what it answered:
 5 | 
 6 | Retrieval Augmented Generation (RAG) is a powerful method for enhancing the capabilities of Language Models by providing them access to external data sources. This approach is particularly effective when dealing with unstructured datasets like YouTube video transcripts, where the content is not organized in a clear question-and-answer format. Let's explore various strategies for chunking these transcripts, both using heuristics and leveraging LLMs for intelligent chunking.
 7 | 
 8 | ### Chunking Strategies
 9 | 
10 | Chunking is the process of breaking down a large document into smaller, manageable pieces that can be processed by a language model. Effective chunking is crucial for RAG because it directly influences the relevance and accuracy of the information retrieved.
11 | 
12 | #### Heuristic-based Chunking
13 | 
14 | 1. **Fixed-size Chunking**
15 |    - **Fixed Word Count:** Divide the transcript into chunks of a specific number of words (e.g., 200-500 words). This method is simple and ensures uniform chunk sizes.
16 |    - **Fixed Time Intervals:** For video transcripts, segment the text based on time intervals (e.g., every 2 minutes of video content).
17 |    - **Fixed Sentence Count:** Break the transcript into chunks containing a set number of sentences (e.g., 5-10 sentences).
18 | 
19 | 2. **Semantic Boundaries**
20 |    - **Pause-based Chunking:** Utilize pauses or silences in the video (detected by timestamps in the transcript) to delineate chunks, assuming pauses often indicate topic shifts.
21 |    - **Speaker Turn Chunking:** Divide the transcript at speaker changes, which can be useful in interviews or panel discussions.
22 | 
23 | 3. **Structural Segmentation**
24 |    - **Paragraph-based Chunking:** Use natural paragraph breaks as chunk boundaries, assuming these breaks reflect changes in topic or subtopics.
25 |    - **Section Headers:** If available, use any section headers or titles in the transcript to define chunk boundaries.
26 | 
27 | #### LLM-based Intelligent Chunking
28 | 
29 | 1. **Thematic Chunking**
30 |    - **Topic Modeling:** Use LLMs to identify and segment the transcript into coherent topics or themes. Each chunk would then represent a distinct topic.
31 |    - **Content Summarization:** Summarize the transcript using LLMs to identify key points and divide the text based on these summary insights.
32 | 
33 | 2. **Contextual Segmentation**
34 |    - **Semantic Similarity:** Use embeddings to compute semantic similarity and create chunks where the text remains contextually consistent. LLMs can help refine these chunks by ensuring thematic coherence.
35 |    - **Dynamic Chunking:** Allow the LLM to dynamically adjust chunk sizes based on content density, ensuring that dense or complex sections are adequately represented.
36 | 
37 | 3. **Hierarchical Structuring**
38 |    - **Sectioning:** Ask the LLM to automatically divide the transcript into sections with headers, providing an outline-like structure. This can involve creating a hierarchical representation of the content.
39 |    - **Content Categorization:** LLMs can categorize content into predefined categories (e.g., introduction, explanation, conclusion), structuring chunks accordingly.
40 | 
41 | 4. **Intent-based Chunking**
42 |    - **Question Generation:** Use LLMs to generate potential questions from the transcript and organize chunks based on these questions, mimicking a Q&A format.
43 |    - **Information Retrieval Intent:** Analyze user queries to tailor chunks that directly address likely information retrieval needs, improving retrieval efficiency.
44 | 
45 | ### Implementing Chunking for RAG
46 | 
47 | Here’s how you can implement these chunking strategies in a RAG workflow:
48 | 
49 | 1. **Preprocessing:**
50 |    - **Transcription:** Convert audio to text using a speech-to-text model if starting from raw video/audio.
51 |    - **Cleaning:** Remove unnecessary filler words, irrelevant content, and transcription errors.
52 | 
53 | 2. **Chunking:**
54 |    - Apply one or more chunking strategies to segment the transcript into meaningful units.
55 |    - Consider using a combination of heuristic and LLM-based methods for a balanced approach.
56 | 
57 | 3. **Indexing:**
58 |    - Convert chunks into embeddings using a suitable model (e.g., BERT, Sentence Transformers) and index them for efficient retrieval.
59 | 
60 | 4. **Retrieval:**
61 |    - For a given query, retrieve relevant chunks based on semantic similarity, keyword matching, or a combination of retrieval methods.
62 | 
63 | 5. **Generation:**
64 |    - Use an LLM to generate answers or outputs by combining retrieved chunks with the query context.
65 | 
66 | ### Example Workflow for YouTube Transcript
67 | 
68 | Let's walk through an example workflow using a YouTube video transcript:
69 | 
70 | 1. **Transcription & Cleaning:**
71 |    - Convert video to text, clean up non-essential parts, and ensure accuracy.
72 | 
73 | 2. **Chunking:**
74 |    - **Heuristic:** Break the text every 2 minutes of video time and by speaker turns.
75 |    - **Intelligent:** Use an LLM to divide the text into sections with headings like "Introduction," "Key Concepts," "Case Studies," etc.
76 | 
77 | 3. **Indexing:**
78 |    - Create embeddings for each chunk and index them in a database like Elasticsearch or a vector store.
79 | 
80 | 4. **Query Handling:**
81 |    - When a user submits a query, retrieve relevant chunks using semantic search.
82 | 
83 | 5. **Answer Generation:**
84 |    - The LLM processes the query and the retrieved chunks to generate a cohesive and accurate response.
85 | 
86 | ### Tools and Techniques for Implementation
87 | 
88 | - **Language Models:** GPT-4, BERT, T5, or domain-specific LLMs for intelligent chunking.
89 | - **Vector Stores:** FAISS, Milvus, Pinecone for indexing and retrieval.
90 | - **Embedding Models:** Sentence-BERT, Universal Sentence Encoder for chunk embedding.
91 | - **Speech-to-Text:** Google Speech-to-Text, Whisper by OpenAI for initial transcription.
92 | 
93 | ### Conclusion
94 | 
95 | Effective chunking strategies are key to harnessing the full potential of RAG in unstructured datasets like YouTube transcripts. By using both heuristic and intelligent approaches, you can ensure that the language model receives relevant context for generating accurate and informative responses. The choice of chunking strategy should align with the specific use case and characteristics of the dataset, allowing for flexible and efficient retrieval processes.
96 | 


--------------------------------------------------------------------------------
/images/llm-zoomcamp-2025.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/llm-zoomcamp-2025.jpg


--------------------------------------------------------------------------------
/images/llm-zoomcamp.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/llm-zoomcamp.jpg


--------------------------------------------------------------------------------
/images/qdrant.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/qdrant.png


--------------------------------------------------------------------------------
/images/saturn-cloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/saturn-cloud.png


--------------------------------------------------------------------------------
/learning-in-public.md:
--------------------------------------------------------------------------------
 1 | # Learning in public
 2 | 
 3 | Most people learn in private: they consume content but don't tell
 4 | anyone about it. There's nothing wrong with it.
 5 | 
 6 | But we want to encourage you to document your progress and
 7 | share it publicly on social media.
 8 | 
 9 | It helps you get noticed and will lead to:
10 | 
11 | * Expanding your network: meeting new people and making new friends
12 | * Being invited to meetups, conferences and podcasts
13 | * Landing a job or getting clients
14 | * Many other good things
15 | 
16 | Here's a more compresensive reading on why you want to do it: https://github.com/readme/guides/publishing-your-work
17 | 
18 | 
19 | ## Learning in Public for Zoomcamps
20 | 
21 | When you submit your homework or project, you can also submit
22 | learning in public posts:
23 | 
24 | <img src="https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/images/learning-in-public-links.png" />
25 | 
26 | You can watch this video to see how your learning in public posts may look like:
27 | 
28 | <a href="https://www.loom.com/share/710e3297487b409d94df0e8da1c984ce" target="_blank">
29 |     <img src="https://github.com/DataTalksClub/mlops-zoomcamp/raw/main/images/learning-in-public.png" height="240" />
30 | </a>
31 | 
32 | 
33 | Send a PR if you want to suggest improvements for this document


--------------------------------------------------------------------------------
/project.md:
--------------------------------------------------------------------------------
  1 | # Projects 
  2 | 
  3 | * Video: https://www.loom.com/share/8f99d25893de4fb8aaa95c0395c740b6
  4 | * Office hours: https://www.youtube.com/watch?v=pA9S1mTqAwU
  5 | 
  6 | In order to receive a certificate of completion for the course, you need
  7 | to deliver a project. There are two attempts for that.
  8 | 
  9 | [Submit your project here](cohorts/2025/project.md)
 10 | 
 11 | 
 12 | ## Objective
 13 | 
 14 | The goal of this project is to apply everything we have learned
 15 | in this course to build an end-to-end RAG application.
 16 | 
 17 | 
 18 | ## Problem statement
 19 | 
 20 | For the project, we ask you to build an end-to-end RAG project. 
 21 | 
 22 | For that, you need:
 23 | 
 24 | * Select a dataset that you're interested in (see [Datasets](#datasets) for examples and ideas)
 25 | * Ingest the data into a knowledge base
 26 | * Implement the RAG flow: query the knowledge base, build the prompt, send the promt to an LLM
 27 | * Evaluate the performance of your RAG flow
 28 | * Create an interface for the application
 29 | * Collect user feedback and monitor your application
 30 | 
 31 | 
 32 | ## Project Documentation
 33 | 
 34 | Your project rises or falls with its documentation. Hence, here are some general recommendations:
 35 | 
 36 | * **Write for a Broader Audience 📝**: Assume the reader has no prior knowledge of the course materials. This way, your documentation will be accessible not only to evaluators but also to anyone interested in your project.
 37 | * **Include Evaluation Criteria 🎯**: Make it easier for evaluators to assess your work by clearly mentioning each criterion in your README. Include relevant screenshots to visually support your points.
 38 | * **Think of Future Opportunities 🚀**: Imagine that potential hiring managers will look at your projects. Make it straightforward for them to understand what the project is about and what you contributed. Highlight key features and your role in the project.
 39 | * **Be Detailed and Comprehensive 📋**: Include as much detail as possible in the README file. Explain the setup, the functionality, and the workflow of your project. Tools like ChatGPT or other LLMs can assist you in expanding and refining your documentation.
 40 | * **Provide Clear Setup Instructions ⚙️**: Include step-by-step instructions on how to set up and run your project locally. Make sure to cover dependencies, configurations, and any other requirements needed to get your project up and running.
 41 | * **Use Visuals and Examples 🖼️**: Wherever possible, include diagrams, screenshots, or GIFs to illustrate key points. Use examples to show how to use your project, demonstrate common use cases, and provide sample inputs and expected outputs.
 42 |   * **App Preview Video 🎥**: Consider adding a short preview video of your app in action to the README. For example, if you're using Streamlit, you can easily record a screencast from the app's top-right menu ([Streamlit Guide](https://docs.streamlit.io/develop/concepts/architecture/app-chrome)). Once you saved the video file locally, you can just drag & drop it into the online GitHub editor of your README to add it ([Ref](https://stackoverflow.com/a/4279746)).
 43 | * **Organize with Sub-Files 🗂️**: If your documentation becomes lengthy, consider splitting it into sub-files and linking them in your README. This keeps the main README clean and neat while providing additional detailed information in separate files (e.g., `setup.md`, `usage.md`, `contributing.md`).
 44 | * **Keep It Updated 🔄**: As your project evolves, make sure your documentation reflects any changes or updates. Outdated documentation can confuse readers and diminish the credibility of your project.
 45 | 
 46 | Remember, clear and comprehensive documentation not only helps others but is also a valuable reference for yourself in the future.
 47 | 
 48 | 
 49 | ## Technologies
 50 | 
 51 | You don't have to limit yourself to technologies covered in the course. You can use alternatives as well:
 52 | 
 53 | * LLM: OpenAI, Ollama, Groq, AWS Bedrock, etc
 54 | * Knowledge base: any text, relational or vector database, including in-memory ones like we implemented in the course or SQLite  
 55 | * Monitoring: Grafana, Kibana, Streamlit, dash, etc
 56 | * Interface: Streamlit, dash, Flask, FastAPI, Django, etc (could be UI or API)
 57 | * Ingestion pipeline: Mage, dlt, Airflow, Prefect, python script, etc
 58 | 
 59 | If you use a tool that wasn't covered in the course, be sure to give a very detailed explanation
 60 | of what that tool does and how to use it. 
 61 | 
 62 | If you're not certain about some tools, ask in Slack.
 63 | 
 64 | ## Tips and best practices
 65 | 
 66 | * It's better to create a separate GitHub repository for your project
 67 | * Give your project a meaningful title, e.g. "DataTalksClub Zoomcamp Q&A system" or "Nutrition Facts Chat"
 68 | 
 69 | 
 70 | ## Peer reviewing
 71 | 
 72 | > [!IMPORTANT]  
 73 | > To evaluate the projects, we'll use peer reviewing. This is a great opportunity for you to learn from each other.
 74 | > * To get points for your project, you need to evaluate 3 projects of your peers
 75 | > * You get 3 extra points for each evaluation
 76 | 
 77 | ### Review Tips
 78 | 
 79 | * The reviewer is given a public GitHut repo link and a `commit-hash`
 80 |    * to see the code state of the repo at the provided commit hash, use the following URL:
 81 |    * `https://github.com/{username}/{repo-name}/tree/{commit-hash}`
 82 | * It's recommended to clone the repository for the review. To clone the project at the commit hash:
 83 |   ```bash
 84 |   git clone https://github.com/{username}/{repo-name}.git
 85 |   git reset --hard {commit-hash}
 86 |   ```
 87 | 
 88 | ## Evaluation Criteria
 89 | 
 90 | * Problem description
 91 |     * 0 points: The problem is not described
 92 |     * 1 point: The problem is described but briefly or unclearly
 93 |     * 2 points: The problem is well-described and it's clear what problem the project solves
 94 | * RAG flow
 95 |     * 0 points: No knowledge base or LLM is used
 96 |     * 1 point: No knowledge base is used, and the LLM is queried directly
 97 |     * 2 points: Both a knowledge base and an LLM are used in the RAG flow 
 98 | * Retrieval evaluation
 99 |     * 0 points: No evaluation of retrieval is provided
100 |     * 1 point: Only one retrieval approach is evaluated
101 |     * 2 points: Multiple retrieval approaches are evaluated, and the best one is used
102 | * RAG evaluation
103 |     * 0 points: No evaluation of RAG is provided
104 |     * 1 point: Only one RAG approach (e.g., one prompt) is evaluated
105 |     * 2 points: Multiple RAG approaches are evaluated, and the best one is used
106 | * Interface
107 |    * 0 points: No way to interact with the application at all
108 |    * 1 point: Command line interface, a script, or a Jupyter notebook
109 |    * 2 points: UI (e.g., Streamlit), web application (e.g., Django), or an API (e.g., built with FastAPI) 
110 | * Ingestion pipeline
111 |    * 0 points: No ingestion
112 |    * 1 point: Semi-automated ingestion of the dataset into the knowledge base, e.g., with a Jupyter notebook
113 |    * 2 points: Automated ingestion with a Python script or a special tool (e.g., Mage, dlt, Airflow, Prefect)
114 | * Monitoring
115 |    * 0 points: No monitoring
116 |    * 1 point: User feedback is collected OR there's a monitoring dashboard
117 |    * 2 points: User feedback is collected and there's a dashboard with at least 5 charts
118 | * Containerization
119 |     * 0 points: No containerization
120 |     * 1 point: Dockerfile is provided for the main application OR there's a docker-compose for the dependencies only
121 |     * 2 points: Everything is in docker-compose
122 | * Reproducibility
123 |     * 0 points: No instructions on how to run the code, the data is missing, or it's unclear how to access it
124 |     * 1 point: Some instructions are provided but are incomplete, OR instructions are clear and complete, the code works, but the data is missing
125 |     * 2 points: Instructions are clear, the dataset is accessible, it's easy to run the code, and it works. The versions for all dependencies are specified.
126 | * Best practices
127 |     * [ ] Hybrid search: combining both text and vector search (at least evaluating it) (1 point)
128 |     * [ ] Document re-ranking (1 point)
129 |     * [ ] User query rewriting (1 point)
130 | * Bonus points (not covered in the course)
131 |     * [ ] Deployment to the cloud (2 points)
132 |     * [ ] Up to 3 extra bonus points if you want to award for something extra (write in feedback for what)
133 | 
134 | 
135 | ## Project ideas
136 | 
137 | ### Datasets
138 | 
139 | Here are some datasets for your projects and potential things you can do with them
140 | 
141 | * DTC data:
142 |    * Slack dump: book of the week channel, course channels, career questions, etc
143 |    * DTC website with book of the week archives
144 |    * DTC Podcast: transcripts
145 | * Wikis
146 |    * any subsets of Wikipedia
147 |    * any wiki-like data source
148 |    * notion notes
149 | * Articles
150 |    * Index and answer questions from one or multiple articles
151 | * Transcripts
152 |    * Podcast transcripts
153 |    * YouTube video transcripts
154 | * Books
155 |    * Sci-fi, fiction, or non-fiction books
156 | * Slide Decks and pictures
157 |    * OCR and index slide decks (gpt-4o-mini can do that)
158 |    * Describe and index pictures
159 | * Add more here above this line - send a PR!
160 | * Or just ask ChatGPT (see more ideas [here in this example](https://chatgpt.com/share/70b51c12-e41c-4312-831d-04f489a17f1e))
161 | 
162 | You can also generate a dataset with an LLM:
163 | 
164 | * If you have a dataset but you can't publicly release it, you can generate a similar one with an LLM
165 | * Or you can simply think what kind of a dataset you want to have and generate it 
166 | 
167 | Note that your dataset doesn't have to be in thr Q&A form. Check [etc/chunking.md](etc/chunking.md) to learn more about chunking.
168 | 
169 | 
170 | ## Cheating and plagiarism
171 | 
172 | Plagiarism in any form is not allowed. Examples of plagiarism:
173 | 
174 | * Taking somebody's else notebooks and projects (in full or partly) and using it for your project
175 | * Re-using your own projects (in full or partly) from other courses and bootcamps
176 | * Re-using your appempt 1 project as attempt 2 if you passed attempt 1
177 | * Re-using your project from the previous iterations of the course
178 | 
179 | Violating any of this will result in 0 points for this project.
180 | 
181 | Re-using some parts of the code from the course is allowed.
182 | 


--------------------------------------------------------------------------------