├── .gitignore
├── 01-intro
├── README.md
├── documents-llm.json
├── documents.json
├── elastic-search.md
├── open-ai-alternatives.md
├── parse-faq.ipynb
└── rag-intro.ipynb
├── 02-vector-search
└── README.md
├── 03-evaluation
└── README.md
├── 04-monitoring
└── README.md
├── 05-best-practices
├── README.md
├── documents-with-ids.json
├── ground-truth-data.csv
├── hybrid-search-and-reranking-es.ipynb
├── hybrid-search-langchain.ipynb
└── llm-zoomcamp-best-practicies.pdf
├── 06-project-example
├── README.md
└── content-processing-summary.md
├── README.md
├── after-sign-up.md
├── asking-questions.md
├── awesome-llms.md
├── cohorts
├── 2024
│ ├── 01-intro
│ │ └── homework.md
│ ├── 02-open-source
│ │ ├── README.md
│ │ ├── docker-compose.yaml
│ │ ├── homework.md
│ │ ├── huggingface-flan-t5.ipynb
│ │ ├── huggingface-mistral-7b.ipynb
│ │ ├── huggingface-phi3.ipynb
│ │ ├── ollama.ipynb
│ │ ├── prompt.md
│ │ ├── qa_faq.py
│ │ ├── rag-intro.ipynb
│ │ ├── serving-hugging-face-models.md
│ │ └── starter.ipynb
│ ├── 03-vector-search
│ │ ├── README.md
│ │ ├── demo_es.ipynb
│ │ ├── homework.md
│ │ └── homework_solution.ipynb
│ ├── 04-monitoring
│ │ ├── README.md
│ │ ├── app
│ │ │ ├── .env
│ │ │ ├── Dockerfile
│ │ │ ├── README.MD
│ │ │ ├── app.py
│ │ │ ├── assistant.py
│ │ │ ├── db.py
│ │ │ ├── docker-compose.yaml
│ │ │ ├── generate_data.py
│ │ │ ├── prep.py
│ │ │ └── requirements.txt
│ │ ├── code.md
│ │ ├── dashboard.json
│ │ ├── data
│ │ │ ├── evaluations-aqa.csv
│ │ │ ├── evaluations-qa.csv
│ │ │ ├── results-gpt35-cosine.csv
│ │ │ ├── results-gpt35.csv
│ │ │ ├── results-gpt4o-cosine.csv
│ │ │ ├── results-gpt4o-mini-cosine.csv
│ │ │ ├── results-gpt4o-mini.csv
│ │ │ └── results-gpt4o.csv
│ │ ├── grafana.md
│ │ ├── homework.md
│ │ ├── offline-rag-evaluation.ipynb
│ │ └── solution.ipynb
│ ├── 05-orchestration
│ │ ├── README.md
│ │ ├── code
│ │ │ └── 06_retrieval.py
│ │ ├── homework.md
│ │ └── parse-faq-llm.ipynb
│ ├── README.md
│ ├── competition
│ │ ├── README.md
│ │ ├── data
│ │ │ ├── test.csv
│ │ │ └── train.csv
│ │ ├── scorer.py
│ │ ├── starter_notebook.ipynb
│ │ └── starter_notebook_submission.csv
│ ├── project.md
│ └── workshops
│ │ └── dlt.md
└── 2025
│ ├── 01-intro
│ └── homework.md
│ ├── README.md
│ ├── course-launch-stream-summary.md
│ ├── pre-course-q-a-stream-summary.md
│ └── project.md
├── etc
└── chunking.md
├── images
├── llm-zoomcamp-2025.jpg
├── llm-zoomcamp.jpg
├── qdrant.png
└── saturn-cloud.png
├── learning-in-public.md
└── project.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints/
2 | __pycache__/
3 | .venv
4 | .envrc
5 |
--------------------------------------------------------------------------------
/01-intro/README.md:
--------------------------------------------------------------------------------
1 | # Module 1: Introduction
2 |
3 | In this module, we will learn what LLM and RAG are and
4 | implement a simple RAG pipeline to answer questions about
5 | the FAQ Documents from our Zoomcamp courses
6 |
7 | What we will do:
8 |
9 | * Index Zoomcamp FAQ documents
10 | * DE Zoomcamp: https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit
11 | * ML Zoomcamp: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit
12 | * MLOps Zoomcamp: https://docs.google.com/document/d/12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0/edit
13 | * Create a Q&A system for answering questions about these documents
14 |
15 | ## 1.1 Introduction to LLM and RAG
16 |
17 |
18 |
19 |
20 |
21 | * LLM
22 | * RAG
23 | * RAG architecture
24 | * Course outcome
25 |
26 |
27 | ## 1.2 Preparing the Environment
28 |
29 |
30 |
31 |
32 |
33 | * Installing libraries
34 | * Alternative: installing anaconda or miniconda
35 |
36 | ```bash
37 | pip install tqdm notebook==7.1.2 openai elasticsearch==8.13.0 pandas scikit-learn ipywidgets
38 | ```
39 |
40 | ## 1.3 Retrieval
41 |
42 |
43 |
44 |
45 |
46 | Note: as of now, you can install minsearch with pip:
47 |
48 | ```bash
49 | pip install minsearch
50 | ```
51 |
52 | * We will use the search engine we build in the [build-your-own-search-engine workshop](https://github.com/alexeygrigorev/build-your-own-search-engine): [minsearch](https://github.com/alexeygrigorev/minsearch)
53 | * Indexing the documents
54 | * Peforming the search
55 |
56 |
57 | ## 1.4 Generation with OpenAI
58 |
59 |
60 |
61 |
62 |
63 | * Invoking OpenAI API
64 | * Building the prompt
65 | * Getting the answer
66 |
67 |
68 | If you don't want to use a service, you can run an LLM locally
69 | refer to [module 2](../02-open-source/) for more details.
70 |
71 | In particular, check "2.7 Ollama - Running LLMs on a CPU" -
72 | it can work with OpenAI API, so to make the example from 1.4
73 | work locally, you only need to change a few lines of code.
74 |
75 |
76 | ## 1.4.2 OpenAI API Alternatives
77 |
78 |
79 |
80 |
81 |
82 | [Open AI Alternatives](../awesome-llms.md#openai-api-alternatives)
83 |
84 |
85 | ## 1.5 Cleaned RAG flow
86 |
87 |
88 |
89 |
90 |
91 | * Cleaning the code we wrote so far
92 | * Making it modular
93 |
94 | ## 1.6 Searching with ElasticSearch
95 |
96 |
97 |
98 |
99 |
100 | * Run ElasticSearch with Docker
101 | * Index the documents
102 | * Replace MinSearch with ElasticSearch
103 |
104 | Running ElasticSearch:
105 |
106 | ```bash
107 | docker run -it \
108 | --rm \
109 | --name elasticsearch \
110 | -m 4GB \
111 | -p 9200:9200 \
112 | -p 9300:9300 \
113 | -e "discovery.type=single-node" \
114 | -e "xpack.security.enabled=false" \
115 | docker.elastic.co/elasticsearch/elasticsearch:8.4.3
116 | ```
117 |
118 | If the previous command doesn't work (i.e. you see "error pulling image configuration"), try to run ElasticSearch directly from Docker Hub:
119 |
120 | ```bash
121 | docker run -it \
122 | --rm \
123 | --name elasticsearch \
124 | -p 9200:9200 \
125 | -p 9300:9300 \
126 | -e "discovery.type=single-node" \
127 | -e "xpack.security.enabled=false" \
128 | elasticsearch:8.4.3
129 | ```
130 |
131 | Index settings:
132 |
133 | ```python
134 | {
135 | "settings": {
136 | "number_of_shards": 1,
137 | "number_of_replicas": 0
138 | },
139 | "mappings": {
140 | "properties": {
141 | "text": {"type": "text"},
142 | "section": {"type": "text"},
143 | "question": {"type": "text"},
144 | "course": {"type": "keyword"}
145 | }
146 | }
147 | }
148 | ```
149 |
150 | Query:
151 |
152 | ```python
153 | {
154 | "size": 5,
155 | "query": {
156 | "bool": {
157 | "must": {
158 | "multi_match": {
159 | "query": query,
160 | "fields": ["question^3", "text", "section"],
161 | "type": "best_fields"
162 | }
163 | },
164 | "filter": {
165 | "term": {
166 | "course": "data-engineering-zoomcamp"
167 | }
168 | }
169 | }
170 | }
171 | }
172 | ```
173 |
174 | We use `"type": "best_fields"`. You can read more about
175 | different types of `multi_match` search in [elastic-search.md](elastic-search.md).
176 |
177 | # 1.7 Homework
178 | More information [here](../cohorts/2025/01-intro/homework.md).
179 |
180 |
181 | # Extra materials
182 |
183 | * If you're curious to know how the code for parsing the FAQ works, check [this video](https://www.loom.com/share/ff54d898188b402d880dbea2a7cb8064)
184 |
185 | # Open-Source LLMs (optional)
186 |
187 | It's also possible to run LLMs locally. For that, we
188 | can use Ollama. Check these videos from LLM Zoomcamp 2024
189 | if you're interested in learning more about it:
190 |
191 | * [Ollama - Running LLMs on a CPU](https://www.youtube.com/watch?v=PVpBGs_iSjY&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R)
192 | * [Ollama & Phi3 + Elastic in Docker-Compose](https://www.youtube.com/watch?v=4juoo_jk96U&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R)
193 | * [UI for RAG](https://www.youtube.com/watch?v=R6L8PZ-7bGo&list=PL3MmuxUbc_hIB4fSqLy_0AfTjVLpgjV3R)
194 |
195 | To see the command lines used in the videos,
196 | see [2024 cohort folder](../cohorts/2024/02-open-source#27-ollama---running-llms-on-a-cpu)
197 |
198 | # Notes
199 |
200 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/01-intro/notes.md)
201 | * [Notes by Pham Nguyen Hung](https://hung.bearblog.dev/llm-zoomcamp-1-rag/)
202 | * [Notes by dimzachar](https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/01-intro/README.md)
203 | * [Notes by Olawale Ogundeji](https://github.com/presiZHai/LLM-Zoomcamp/blob/main/01-intro/notes.md)
204 | * [Notes by Uchechukwu](https://medium.com/@njokuuchechi/an-intro-to-large-language-models-llms-0c51c09abe10)
205 | * [Notes by Kamal](https://github.com/mk-hassan/llm-zoomcamp/blob/main/Module-1%3A%20Introduction%20to%20LLMs%20and%20RAG/README.md)
206 | * [Notes by Marat](https://machine-mind-ml.medium.com/discovering-semantic-search-and-rag-with-large-language-models-be7d9ba5bef4)
207 | * [Notes by Waleed](https://waleedayoub.com/post/llmzoomcamp_week1-intro_notes/)
208 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
209 |
--------------------------------------------------------------------------------
/01-intro/elastic-search.md:
--------------------------------------------------------------------------------
1 | # Elastic Search
2 |
3 | This document contains useful things about Elasticsearch
4 |
5 | # `multi_match` Query in Elasticsearch
6 |
7 | The `multi_match` query is used to search for a given text across multiple fields in an Elasticsearch index.
8 |
9 | It provides various types to control how the matching is executed and scored.
10 |
11 | There are multiple types of `multi_match` queries:
12 |
13 | - `best_fields`: Returns the highest score from any one field.
14 | - `most_fields`: Combines the scores from all fields.
15 | - `cross_fields`: Treats fields as one big field for scoring.
16 | - `phrase`: Searches for the query as an exact phrase.
17 | - `phrase_prefix`: Searches for the query as a prefix of a phrase.
18 |
19 |
20 | ## `best_fields`
21 |
22 | The `best_fields` type searches each field separately and returns the highest score from any one of the fields.
23 |
24 | This type is useful when you want to find documents where at least one field matches the query well.
25 |
26 |
27 | ```json
28 | {
29 | "size": 5,
30 | "query": {
31 | "bool": {
32 | "must": {
33 | "multi_match": {
34 | "query": "How do I run docker on Windows?",
35 | "fields": ["question", "text"],
36 | "type": "best_fields"
37 | }
38 | }
39 | }
40 | }
41 | }
42 | ```
43 |
44 | ## `most_fields`
45 |
46 | The `most_fields` type searches each field and combines the scores from all fields.
47 |
48 | This is useful when the relevance of a document increases with more matching fields.
49 |
50 | ```json
51 | {
52 | "multi_match": {
53 | "query": "How do I run docker on Windows?",
54 | "fields": ["question^4", "text"],
55 | "type": "most_fields"
56 | }
57 | }
58 | ```
59 |
60 | ## `cross_fields`
61 |
62 | The `cross_fields` type treats fields as though they were one big field.
63 |
64 | It is suitable for cases where you have fields representing the same text in different ways, such as synonyms.
65 |
66 | ```json
67 | {
68 | "multi_match": {
69 | "query": "How do I run docker on Windows?",
70 | "fields": ["question", "text"],
71 | "type": "cross_fields"
72 | }
73 | }
74 | ```
75 |
76 | ## `phrase`
77 |
78 | The `phrase` type looks for the query as an exact phrase within the fields.
79 |
80 | It is useful for exact match searches.
81 |
82 | ```json
83 | {
84 | "multi_match": {
85 | "query": "How do I run docker on Windows?",
86 | "fields": ["question", "text"],
87 | "type": "phrase"
88 | }
89 | }
90 | ```
91 |
92 | ## `phrase_prefix`
93 |
94 | The `phrase_prefix` type searches for documents that contain the query as a prefix of a phrase.
95 |
96 | This is useful for autocomplete or typeahead functionality.
97 |
98 |
99 | ```json
100 | {
101 | "multi_match": {
102 | "query": "How do I run docker on Windows?",
103 | "fields": ["question", "text"],
104 | "type": "phrase_prefix"
105 | }
106 | }
107 | ```
--------------------------------------------------------------------------------
/01-intro/open-ai-alternatives.md:
--------------------------------------------------------------------------------
1 | moved [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/awesome-llms.md#openai-api-alternatives)
2 |
--------------------------------------------------------------------------------
/01-intro/parse-faq.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 8,
6 | "id": "4cd1eaa8-3424-41ad-9cf2-3e8548712865",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import io\n",
11 | "\n",
12 | "import requests\n",
13 | "import docx"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 24,
19 | "id": "8180e7e4-b90d-4900-a59b-d22e5d6537c4",
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "def clean_line(line):\n",
24 | " line = line.strip()\n",
25 | " line = line.strip('\\uFEFF')\n",
26 | " return line\n",
27 | "\n",
28 | "def read_faq(file_id):\n",
29 | " url = f'https://docs.google.com/document/d/{file_id}/export?format=docx'\n",
30 | " \n",
31 | " response = requests.get(url)\n",
32 | " response.raise_for_status()\n",
33 | " \n",
34 | " with io.BytesIO(response.content) as f_in:\n",
35 | " doc = docx.Document(f_in)\n",
36 | "\n",
37 | " questions = []\n",
38 | "\n",
39 | " question_heading_style = 'heading 2'\n",
40 | " section_heading_style = 'heading 1'\n",
41 | " \n",
42 | " heading_id = ''\n",
43 | " section_title = ''\n",
44 | " question_title = ''\n",
45 | " answer_text_so_far = ''\n",
46 | " \n",
47 | " for p in doc.paragraphs:\n",
48 | " style = p.style.name.lower()\n",
49 | " p_text = clean_line(p.text)\n",
50 | " \n",
51 | " if len(p_text) == 0:\n",
52 | " continue\n",
53 | " \n",
54 | " if style == section_heading_style:\n",
55 | " section_title = p_text\n",
56 | " continue\n",
57 | " \n",
58 | " if style == question_heading_style:\n",
59 | " answer_text_so_far = answer_text_so_far.strip()\n",
60 | " if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
61 | " questions.append({\n",
62 | " 'text': answer_text_so_far,\n",
63 | " 'section': section_title,\n",
64 | " 'question': question_title,\n",
65 | " })\n",
66 | " answer_text_so_far = ''\n",
67 | " \n",
68 | " question_title = p_text\n",
69 | " continue\n",
70 | " \n",
71 | " answer_text_so_far += '\\n' + p_text\n",
72 | " \n",
73 | " answer_text_so_far = answer_text_so_far.strip()\n",
74 | " if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
75 | " questions.append({\n",
76 | " 'text': answer_text_so_far,\n",
77 | " 'section': section_title,\n",
78 | " 'question': question_title,\n",
79 | " })\n",
80 | "\n",
81 | " return questions"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 25,
87 | "id": "7d3c2dd7-f64a-4dc7-a4e3-3e8aadfa720f",
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "faq_documents = {\n",
92 | " 'data-engineering-zoomcamp': '19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw',\n",
93 | " 'machine-learning-zoomcamp': '1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8',\n",
94 | " 'mlops-zoomcamp': '12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0',\n",
95 | "}"
96 | ]
97 | },
98 | {
99 | "cell_type": "code",
100 | "execution_count": 27,
101 | "id": "f94efe26-05e8-4ae5-a0fa-0a8e16852816",
102 | "metadata": {},
103 | "outputs": [
104 | {
105 | "name": "stdout",
106 | "output_type": "stream",
107 | "text": [
108 | "data-engineering-zoomcamp\n",
109 | "machine-learning-zoomcamp\n",
110 | "mlops-zoomcamp\n"
111 | ]
112 | }
113 | ],
114 | "source": [
115 | "documents = []\n",
116 | "\n",
117 | "for course, file_id in faq_documents.items():\n",
118 | " print(course)\n",
119 | " course_documents = read_faq(file_id)\n",
120 | " documents.append({'course': course, 'documents': course_documents})"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 29,
126 | "id": "06b8d8be-f656-4cc3-893f-b159be8fda21",
127 | "metadata": {},
128 | "outputs": [],
129 | "source": [
130 | "import json"
131 | ]
132 | },
133 | {
134 | "cell_type": "code",
135 | "execution_count": 32,
136 | "id": "30d50bc1-8d26-44ee-8734-cafce05e0523",
137 | "metadata": {},
138 | "outputs": [],
139 | "source": [
140 | "with open('documents.json', 'wt') as f_out:\n",
141 | " json.dump(documents, f_out, indent=2)"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 33,
147 | "id": "0eabb1c6-5cc6-4d4d-a6da-e27d41cea546",
148 | "metadata": {},
149 | "outputs": [
150 | {
151 | "name": "stdout",
152 | "output_type": "stream",
153 | "text": [
154 | "[\n",
155 | " {\n",
156 | " \"course\": \"data-engineering-zoomcamp\",\n",
157 | " \"documents\": [\n",
158 | " {\n",
159 | " \"text\": \"The purpose of this document is to capture frequently asked technical questions\\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first \\u201cOffice Hours'' live.1\\nSubscribe to course public Google Calendar (it works from Desktop only).\\nRegister before the course starts using this link.\\nJoin the course Telegram channel with announcements.\\nDon\\u2019t forget to register in DataTalks.Club's Slack and join the channel.\",\n",
160 | " \"section\": \"General course-related questions\",\n",
161 | " \"question\": \"Course - When will the course start?\"\n",
162 | " },\n",
163 | " {\n"
164 | ]
165 | }
166 | ],
167 | "source": [
168 | "!head documents.json"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "id": "1b21af5c-2f6d-49e7-92e9-ca229e2473b9",
175 | "metadata": {},
176 | "outputs": [],
177 | "source": []
178 | }
179 | ],
180 | "metadata": {
181 | "kernelspec": {
182 | "display_name": "Python 3 (ipykernel)",
183 | "language": "python",
184 | "name": "python3"
185 | },
186 | "language_info": {
187 | "codemirror_mode": {
188 | "name": "ipython",
189 | "version": 3
190 | },
191 | "file_extension": ".py",
192 | "mimetype": "text/x-python",
193 | "name": "python",
194 | "nbconvert_exporter": "python",
195 | "pygments_lexer": "ipython3",
196 | "version": "3.9.13"
197 | }
198 | },
199 | "nbformat": 4,
200 | "nbformat_minor": 5
201 | }
202 |
--------------------------------------------------------------------------------
/02-vector-search/README.md:
--------------------------------------------------------------------------------
1 | # Vector Search
2 |
3 | TBA
4 |
5 | ## Homework
6 |
7 | See [here](../cohorts/2025/02-vector-search/homework.md)
8 |
9 |
10 | # Notes
11 |
12 | * Notes from [2024 edition](../cohorts/2024/03-vector-search/)
13 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
14 |
--------------------------------------------------------------------------------
/03-evaluation/README.md:
--------------------------------------------------------------------------------
1 | # RAG and LLM Evaluation
2 |
3 | TBA
4 |
5 | ## Homework
6 |
7 | TBA
8 |
9 | # Notes
10 |
11 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
12 |
--------------------------------------------------------------------------------
/04-monitoring/README.md:
--------------------------------------------------------------------------------
1 | # Module 4: Evaluation and Monitoring
2 |
3 | In this module, we will learn how to evaluate and monitor our LLM and RAG system.
4 |
5 | In the evaluation part, we assess the quality of our entire RAG
6 | system before it goes live.
7 |
8 | In the monitoring part, we collect, store and visualize
9 | metrics to assess the answer quality of a deployed LLM. We also
10 | collect chat history and user feedback.
11 |
12 |
13 | TBA
14 |
15 | # Notes
16 |
17 | * Notes from [2024 edition](../cohorts/2024/04-monitoring/)
18 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
19 |
--------------------------------------------------------------------------------
/05-best-practices/README.md:
--------------------------------------------------------------------------------
1 | # Module 6: Best practices
2 |
3 | In this module, we'll cover the techniques that could improve your RAG pipeline.
4 |
5 | ## 6.1 Techniques to Improve RAG Pipeline
6 |
7 |
8 |
9 |
10 |
11 | * Small-to-Big chunk retrieval
12 | * Leveraging document metadata
13 | * Hybrid search
14 | * User query rewriting
15 | * Document reranking
16 |
17 | Links:
18 | * [Slides](llm-zoomcamp-best-practicies.pdf)
19 | * [Five Techniques for Improving RAG Chatbots - Nikita Kozodoi [Video]](https://www.youtube.com/watch?v=xPYmClWk5O8)
20 | * [Survey on RAG techniques [Article]](https://arxiv.org/abs/2312.10997)
21 |
22 |
23 | ## 6.2 Hybrid search
24 |
25 |
26 |
27 |
28 |
29 | * Hybrid search strategy
30 | * Hybrid search in Elasticsearch
31 |
32 | Links:
33 | * [Notebook](hybrid-search-and-reranking-es.ipynb)
34 | * [Hybrid search [Elasticsearch Guide]](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#_combine_approximate_knn_with_other_features)
35 | * [Hybrid search [Tutorial]](https://www.elastic.co/search-labs/tutorials/search-tutorial/vector-search/hybrid-search)
36 |
37 |
38 | ## 6.3 Document Reranking
39 |
40 |
41 |
42 |
43 |
44 | * Reranking concept and metrics
45 | * Reciprocal Rank Fusion (RRF)
46 | * Handmade raranking implementation
47 |
48 | Links:
49 | * [Reciprocal Rank Fusion (RRF) method [Elasticsearch Guide]](https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html)
50 | * [RRF method [Article]](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
51 | * [Elasticsearch subscription plans](https://www.elastic.co/subscriptions)
52 |
53 | We should pull and run a docker container with Elasticsearch 8.9.0 or higher in order to use reranking based on RRF algorithm:
54 |
55 | ```bash
56 | docker run -it \
57 | --rm \
58 | --name elasticsearch \
59 | -m 4GB \
60 | -p 9200:9200 \
61 | -p 9300:9300 \
62 | -e "discovery.type=single-node" \
63 | -e "xpack.security.enabled=false" \
64 | docker.elastic.co/elasticsearch/elasticsearch:8.9.0
65 | ```
66 |
67 |
68 | ## 6.4 Hybrid search with LangChain
69 |
70 |
71 |
72 |
73 |
74 | * LangChain: Introduction
75 | * ElasticsearchRetriever
76 | * Hybrid search implementation
77 |
78 | ```bash
79 | pip install -qU langchain langchain-elasticsearch langchain-huggingface
80 | ```
81 |
82 | Links:
83 | * [Notebook](hybrid-search-langchain.ipynb)
84 | * [Chatbot Implementation [Tutorial]](https://www.elastic.co/search-labs/tutorials/chatbot-tutorial/implementation)
85 | * [ElasticsearchRetriever](https://python.langchain.com/v0.2/docs/integrations/retrievers/elasticsearch_retriever/)
86 |
87 |
88 | ## Homework
89 |
90 | TBD
91 |
92 | # Notes
93 |
94 | * First link goes here
95 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
96 |
--------------------------------------------------------------------------------
/05-best-practices/llm-zoomcamp-best-practicies.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/05-best-practices/llm-zoomcamp-best-practicies.pdf
--------------------------------------------------------------------------------
/06-project-example/README.md:
--------------------------------------------------------------------------------
1 | # 7. End-to-End Project Example
2 |
3 | Links:
4 |
5 | * [Project alexeygrigorev/fitness-assistant](https://github.com/alexeygrigorev/fitness-assistant)
6 | * [Project criteria](../project.md#evaluation-criteria)
7 |
8 |
9 | Note: check the final result, it's a bit different
10 | from what we showed in the videos: we further improved it
11 | by doing some small things here and there, like improved
12 | README, code readability, etc.
13 |
14 |
15 | ## 7.1. Fitness assistant project
16 |
17 |
18 |
19 |
20 |
21 | * Generating data for the project
22 | * Setting up the project
23 | * Implementing the initial version of the RAG flow
24 |
25 | ## 7.2. Evaluating retrieval
26 |
27 |
28 |
29 |
30 |
31 | * Preparing the README file
32 | * Generating gold standard evaluation data
33 | * Evaluting retrieval
34 | * Findning the best boosting coefficients
35 |
36 |
37 | ## 7.3 Evaluating RAG
38 |
39 |
40 |
41 |
42 |
43 | * Using LLM-as-a-Judge (type 2)
44 | * Comparing gpt-4o-mini with gpt-4o
45 |
46 | ## 7.4 Interface and ingestion pipeline
47 |
48 |
49 |
50 |
51 |
52 | * Turnining the jupyter notebook into a script
53 | * Creating the ingestion pipeline
54 | * Creating the API interface with Flask
55 | * Improving README
56 |
57 |
58 | ## 7.5 Monitoring and containerization
59 |
60 |
61 |
62 |
63 |
64 | * Creating a Docker image for our application
65 | * Putting everything in docker compose
66 | * Logging all the information for monitoring purposes
67 |
68 |
69 | ## 7.6 Summary and closing remarks
70 |
71 |
72 |
73 |
74 |
75 | * Changes between 7.5 and 7.6 (postres logging, grafara, cli.py, etc)
76 | * README file improvements
77 | * Total cost of the project (~$2) and how to lower it
78 | * Using generated data for real-life projects
79 |
80 |
81 | ## 7.7 Chunking for longer texts
82 |
83 |
84 |
85 |
86 |
87 | * Different chunking strategies
88 | * [Use cases: multiple articles, one article, slide decks](content-processing-summary.md)
89 |
90 | Links:
91 |
92 | * https://chatgpt.com/share/a4616f6b-43f4-4225-9d03-bb69c723c210
93 | * https://chatgpt.com/share/74217c02-95e6-46ae-b5a5-ca79f9a07084
94 | * https://chatgpt.com/share/8cf0ebde-c53f-4c6f-82ae-c6cc52b2fd0b
95 |
96 | # Notes
97 |
98 | * First link goes here
99 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
100 |
--------------------------------------------------------------------------------
/06-project-example/content-processing-summary.md:
--------------------------------------------------------------------------------
1 | # Content Processing Cases and Steps
2 |
3 | ## Case: Multiple Articles
4 |
5 | - Assign each article a document id
6 | - Chunk the articles
7 | - Assign each chunk a unique chunk id (could be doc_id + chunk_number)
8 | - Evaluate retrieval: separate hitrate for both doc_id and chunk_id
9 | - Evaluate RAG: LLM as a Judge
10 | - Tuning chunk size: use metrics from Evaluate RAG
11 |
12 | Example JSON structure for a chunk:
13 | ```json
14 | {
15 | "doc_id": "ashdiasdh",
16 | "chunk_id": "ashdiasdh_1",
17 | "text": "actual text"
18 | }
19 | ```
20 |
21 | ## Case: Single Article / Transcript / Etc.
22 |
23 | Example: the user provides YouTubeID, you initialize the system and now you can talk to it
24 |
25 | - Chunk it
26 | - Evaluation as for multiple articles
27 |
28 |
29 | ## Case: Book or Very Long Form Content
30 |
31 | - Experiment with it
32 | - Each chapter / section can be a separate document
33 | - Use LLM as a Judge to see which approach works best
34 |
35 | ## Case: Images
36 |
37 | - Describe the images using gpt-4o-mini
38 | - [CLIP](https://openai.com/index/clip/)
39 | - Each image is a separate document
40 |
41 | ## Case: Slides
42 |
43 | - Same as with images + multiple articles
44 | - "Chunking": slide deck = document, slide = chunk
45 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 | LLM Zoomcamp: A Free Course on Real-Life Applications of LLMs
7 |
8 |
9 |
10 | In 10 weeks, learn how to build AI systems that answer questions about your knowledge base. Gain hands-on experience with LLMs, RAG, vector search, evaluation, monitoring, and more.
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 | Join Slack •
19 | #course-llm-zoomcamp Channel •
20 | Telegram Announcements •
21 | Course Playlist •
22 | FAQ
23 |
24 |
25 | ## How to Take LLM Zoomcamp
26 |
27 | ### 2025 Cohort
28 | - **Start Date**: June 2, 2025, 17:00 CET
29 | - **Register Here**: [Sign up](https://airtable.com/appPPxkgYLH06Mvbw/shr7WtxHEPXxaui0Q)
30 |
31 | 2025 cohort checklist:
32 | - Subscribe to our [Google Calendar](https://calendar.google.com/calendar/?cid=NjkxOThkOGFhZmUyZmQwMzZjNDFkNmE2ZDIyNjE5YjdiMmQyZDVjZTYzOGMxMzQyZmNkYjE5Y2VkNDYxOTUxY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t)
33 | - Check [2025 cohort folder](https://github.com/DataTalksClub/llm-zoomcamp/tree/main/cohorts/2025) to stay updated
34 | - Watch [live Q&A](https://youtube.com/live/8lgiOLMMKcY) about the course
35 | - Watch [live course launch](https://www.youtube.com/live/FgnelhEJFj0) stream
36 | - Save the [2025 course playlist](https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=n7CuD0DEgPtnbtsI) on YouTube
37 | - Check course content by navigating to the right module on GitHub
38 | - Share this course with a friend!
39 |
40 | ### Self-Paced Learning
41 | You can follow the course at your own pace:
42 | 1. Watch the course videos.
43 | 2. Complete the homework assignments.
44 | 3. Work on a project and share it in Slack for feedback.
45 |
46 | ## Syllabus
47 |
48 | ### Pre-course Workshops
49 | - [Build a Search Engine](https://www.youtube.com/watch?v=nMrGK5QgPVE) ([Code](https://github.com/alexeygrigorev/build-your-own-search-engine))
50 |
51 | ### Modules
52 |
53 | #### [Module 1: Introduction to LLMs and RAG](01-intro/)
54 | - Basics of LLMs and Retrieval-Augmented Generation (RAG)
55 | - OpenAI API and text search with Elasticsearch
56 |
57 | #### [Module 2: Vector Search](02-vector-search/)
58 |
59 | - Vector search and embeddings
60 | - Indexing and retrieving data efficiently
61 | - Using Qdrant as the vestor database
62 |
63 | #### [Module 3: Evaluation](03-evaluation/)
64 |
65 | - Search evaluation
66 | - Online vs offline evaluation
67 | - LLM as a Judge
68 |
69 | #### [Module 4: Monitoring](04-monitoring/)
70 |
71 | - Online evaluation techniques
72 | - Monitoring user feedback with dashboards
73 |
74 |
75 | #### [Module 5: Best Practices](05-best-practices/)
76 | - Hybrid search
77 | - Document reranking
78 |
79 | #### [Module 6: Bonus - End-to-End Project](06-project-example/)
80 | - Build a fitness assistant using LLMs
81 |
82 | ### [Capstone Project](project.md)
83 |
84 | Put eveything you learned into practice
85 |
86 | ## Meet the Instructors
87 | - [Alexey Grigorev](https://linkedin.com/in/agrigorev/)
88 | - [Timur Kamaliev](https://www.linkedin.com/in/timurkamaliev/)
89 |
90 | ## Community & Support
91 |
92 | ### **Getting Help on Slack**
93 | Join the [`#course-llm-zoomcamp`](https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J) channel on [DataTalks.Club Slack](https://datatalks.club/slack.html) for discussions, troubleshooting, and networking.
94 |
95 | To keep discussions organized:
96 | - Follow [our guidelines](asking-questions.md) when posting questions.
97 | - Review the [community guidelines](https://datatalks.club/slack/guidelines.html).
98 |
99 | ## Sponsors & Supporters
100 | A special thanks to our course sponsors for making this initiative possible!
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 | Interested in supporting our community? Reach out to [alexey@datatalks.club](mailto:alexey@datatalks.club).
122 |
123 | ## About DataTalks.Club
124 |
125 |
126 |
127 |
128 |
129 |
130 | DataTalks.Club is a global online community of data enthusiasts. It's a place to discuss data, learn, share knowledge, ask and answer questions, and support each other.
131 |
132 |
133 |
134 | Website •
135 | Join Slack Community •
136 | Newsletter •
137 | Upcoming Events •
138 | Google Calendar •
139 | YouTube •
140 | GitHub •
141 | LinkedIn •
142 | Twitter
143 |
144 |
145 | All the activity at DataTalks.Club mainly happens on [Slack](https://datatalks.club/slack.html). We post updates there and discuss different aspects of data, career questions, and more.
146 |
147 | At DataTalksClub, we organize online events, community activities, and free courses. You can learn more about what we do at [DataTalksClub Community Navigation](https://www.notion.so/DataTalksClub-Community-Navigation-bf070ad27ba44bf6bbc9222082f0e5a8?pvs=21).
148 |
--------------------------------------------------------------------------------
/after-sign-up.md:
--------------------------------------------------------------------------------
1 | ## Thank you!
2 |
3 | Thanks for signining up for the course.
4 |
5 | Here are some things you should do before you start the course:
6 |
7 | - Register in [DataTalks.Club's Slack](https://datatalks.club/slack.html)
8 | - Join the [`#course-llm-zoomcamp`](https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J) channel
9 | - Join the [course Telegram channel with announcements](https://t.me/llm_zoomcamp)
10 | - Subscribe to [DataTalks.Club's YouTube channel](https://www.youtube.com/c/DataTalksClub) and check [the course playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV)
11 | - Subscribe to our [Course Calendar](https://calendar.google.com/calendar/?cid=NjkxOThkOGFhZmUyZmQwMzZjNDFkNmE2ZDIyNjE5YjdiMmQyZDVjZTYzOGMxMzQyZmNkYjE5Y2VkNDYxOTUxY0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t)
12 | - Check our [Technical FAQ](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing) if you have questions
13 |
14 | See you in the course!
15 |
--------------------------------------------------------------------------------
/asking-questions.md:
--------------------------------------------------------------------------------
1 | ## Asking questions
2 |
3 | If you have any questions, ask them
4 | in the [`#course-llm-zoomcamp`](https://app.slack.com/client/T01ATQK62F8/C06TEGTGM3J) channel in [DataTalks.Club](https://datatalks.club) slack.
5 |
6 | To keep our discussion in Slack more organized, we ask you to follow these suggestions:
7 |
8 | * Before asking a question, check [FAQ](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing).
9 | * Use threads. When you have a problem, first describe the problem shortly
10 | and then put the actual error in the thread - so it doesn't take the entire screen.
11 | * Instead of screenshots, it's better to copy-paste the error you're getting in text.
12 | Use ` ``` ` for formatting your code.
13 | It's very difficult to read text from screenshots.
14 | * Please don't take pictures of your code with a phone. It's even harder to read. Follow the previous suggestion,
15 | and in rare cases when you need to show what happens on your screen, take a screenshot.
16 | * You don't need to tag the instructors when you have a problem. We will see it eventually.
17 | * If somebody helped you with your problem and it's not in [FAQ](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing), please add it there.
18 | It'll help other students.
19 |
--------------------------------------------------------------------------------
/awesome-llms.md:
--------------------------------------------------------------------------------
1 | # Awesome LLMs
2 |
3 | In this file, we will collect all interesting links
4 |
5 | ## OpenAI API Alternatives
6 |
7 | OpenAI and GPT are not the only hosted LLMs that we can use.
8 | There are other services that we can use
9 |
10 |
11 | * [mistral.ai](https://mistral.ai) (5€ free credit on sign up)
12 | * [Groq](https://console.groq.com) (can inference from open source LLMs with rate limits)
13 | * [TogetherAI](https://api.together.ai) (can inference from variety of open source LLMs, 25$ free credit on sign up)
14 | * [Google Gemini](https://ai.google.dev/gemini-api/docs/get-started/tutorial?lang=python) (2 months unlimited access)
15 | * [OpenRouterAI](https://openrouter.ai/) (some small open-source models, such as Gemma 7B, are free)
16 | * [HuggingFace API](https://huggingface.co/docs/api-inference/index) (over 150,000 open-source models, rate-limited and free)
17 | * [Cohere](https://cohere.com/) (provides a developer trail key which allows upto 100 reqs/min for generating, summarizing, and classifying text. Read more [here](https://cohere.com/blog/free-developer-tier-announcement))
18 | * [wit](https://wit.ai/) (Facebook AI Afiliate - free)
19 | * [Anthropic API](https://www.anthropic.com/pricing#anthropic-api) (starting from $0.25 / MTok for input and $1.25 / MTok for the output for the most affordable model)
20 | * [AI21Labs API](https://www.ai21.com/pricing#foundation-models) (Free trial including $10 credits for 3 months)
21 | * [Replicate](https://replicate.com/) (faster inference, can host any ML model. charges 0.10$ per 1M input tokens for llama/Mistral model)
22 |
23 |
24 | ## Local LLMs on CPUs
25 |
26 | These services help run LLMs locally, also without GPUs
27 |
28 | - [ollama](https://github.com/ollama/ollama)
29 | - [Jan.AI](https://jan.ai/)
30 | - [h2oGPT](https://github.com/h2oai/h2ogpt)
31 |
32 |
33 | ## Applications
34 | - **Text Generation**
35 | - [OpenAI GPT-3 Playground](https://platform.openai.com/playground)
36 | - [AI Dungeon](https://play.aidungeon.io/)
37 | - **Chatbots**
38 | - [Rasa](https://rasa.com/)
39 | - [Microsoft Bot Framework](https://dev.botframework.com/)
40 | - **Sentiment Analysis**
41 | - [VADER Sentiment Analysis](https://github.com/cjhutto/vaderSentiment)
42 | - [TextBlob](https://textblob.readthedocs.io/en/dev/)
43 | - **Summarization**
44 | - [Sumy](https://github.com/miso-belica/sumy)
45 | - [Hugging Face Transformers Summarization](https://huggingface.co/transformers/task_summary.html)
46 | - **Translation**
47 | - [MarianMT by Hugging Face](https://huggingface.co/transformers/model_doc/marian.html)
48 |
49 | ## Fine-Tuning
50 | - **Guides and Tutorials**
51 | - [Fine-Tuning GPT-3](https://platform.openai.com/docs/guides/fine-tuning)
52 | - [Hugging Face Fine-Tuning Tutorial](https://huggingface.co/transformers/training.html)
53 | - **Tools and Frameworks**
54 | - [Hugging Face Trainer](https://huggingface.co/transformers/main_classes/trainer.html)
55 | - [Fastai](https://docs.fast.ai/text.learner.html)
56 | - **Colab Notebooks**
57 | - [Fine-Tuning BERT on Colab](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)
58 | - [Fine-Tuning GPT-2 on Colab](https://colab.research.google.com/github/fastai/course-v3/blob/master/nbs/dl2/12a_ulmfit.ipynb)
59 |
60 | ## Prompt Engineering
61 | - **Techniques and Best Practices**
62 | - [OpenAI Prompt Engineering Guide](https://platform.openai.com/docs/guides/completions/best-practices)
63 | - [Prompt Design for GPT-3](https://beta.openai.com/docs/guides/prompt-design)
64 | - **Tools**
65 | - [Prompt Designer](https://promptdesigner.com/)
66 | - [Prompt Engineering Toolkit](https://github.com/prompt-engineering/awesome-prompt-engineering)
67 | - **Examples and Case Studies**
68 | - [Awesome ChatGPT Prompts](https://github.com/f/awesome-chatgpt-prompts)
69 | - [GPT-3 Prompt Engineering Examples](https://github.com/shreyashankar/gpt-3-sandbox)
70 |
71 | ## Deployment
72 | - **Hosting Services**
73 | - [Hugging Face Inference API](https://huggingface.co/inference-api)
74 | - [AWS SageMaker](https://aws.amazon.com/sagemaker/)
75 | - **Serverless Deployments**
76 | - [Serverless GPT-3 with AWS Lambda](https://towardsdatascience.com/building-serverless-gpt-3-powered-apis-with-aws-lambda-f2d4b8a91058)
77 | - [Deploying on Vercel](https://vercel.com/guides/deploying-next-and-vercel-api-with-openai-gpt-3)
78 | - **Containerization**
79 | - [Dockerizing a GPT Model](https://medium.com/swlh/dockerize-your-gpt-3-chatbot-28dd48c19c91)
80 | - [Kubernetes for ML Deployments](https://towardsdatascience.com/kubernetes-for-machine-learning-6c7f5c5466a2)
81 |
82 | ## Monitoring and Logging
83 | - **Best Practices**
84 | - [Logging and Monitoring AI Models](https://www.dominodatalab.com/resources/whitepapers/logging-and-monitoring-for-machine-learning)
85 | - [Monitor Your NLP Models](https://towardsdatascience.com/monitor-your-nlp-models-40c2fb141a51)
86 |
87 | ## Ethics and Bias
88 | - **Frameworks and Guidelines**
89 | - [AI Ethics Guidelines Global Inventory](https://algorithmwatch.org/en/project/ai-ethics-guidelines-global-inventory/)
90 | - [Google AI Principles](https://ai.google/principles/)
91 | - **Tools**
92 | - [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators)
93 | - [IBM AI Fairness 360](https://aif360.mybluemix.net/)
94 | - **Research Papers**
95 | - [Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification](http://gendershades.org/overview.html)
96 | - [AI Fairness and Bias](https://arxiv.org/abs/1908.09635)
97 |
98 |
99 |
100 |
--------------------------------------------------------------------------------
/cohorts/2024/01-intro/homework.md:
--------------------------------------------------------------------------------
1 | ## Homework: Introduction
2 |
3 | In this homework, we'll learn more about search and use Elastic Search for practice.
4 |
5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
6 |
7 | ## Q1. Running Elastic
8 |
9 | Run Elastic Search 8.4.3, and get the cluster information. If you run it on localhost, this is how you do it:
10 |
11 | ```bash
12 | curl localhost:9200
13 | ```
14 |
15 | What's the `version.build_hash` value?
16 |
17 |
18 | ## Getting the data
19 |
20 | Now let's get the FAQ data. You can run this snippet:
21 |
22 | ```python
23 | import requests
24 |
25 | docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
26 | docs_response = requests.get(docs_url)
27 | documents_raw = docs_response.json()
28 |
29 | documents = []
30 |
31 | for course in documents_raw:
32 | course_name = course['course']
33 |
34 | for doc in course['documents']:
35 | doc['course'] = course_name
36 | documents.append(doc)
37 | ```
38 |
39 | Note that you need to have the `requests` library:
40 |
41 | ```bash
42 | pip install requests
43 | ```
44 |
45 | ## Q2. Indexing the data
46 |
47 | Index the data in the same way as was shown in the course videos. Make the `course` field a keyword and the rest should be text.
48 |
49 | Don't forget to install the ElasticSearch client for Python:
50 |
51 | ```bash
52 | pip install elasticsearch
53 | ```
54 |
55 | Which function do you use for adding your data to elastic?
56 |
57 | * `insert`
58 | * `index`
59 | * `put`
60 | * `add`
61 |
62 | ## Q3. Searching
63 |
64 | Now let's search in our index.
65 |
66 | We will execute a query "How do I execute a command in a running docker container?".
67 |
68 | Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.
69 |
70 | What's the score for the top ranking result?
71 |
72 | * 94.05
73 | * 84.05
74 | * 74.05
75 | * 64.05
76 |
77 | Look at the `_score` field.
78 |
79 | ## Q4. Filtering
80 |
81 | Now let's only limit the questions to `machine-learning-zoomcamp`.
82 |
83 | Return 3 results. What's the 3rd question returned by the search engine?
84 |
85 | * How do I debug a docker container?
86 | * How do I copy files from a different folder into docker container’s working directory?
87 | * How do Lambda container images work?
88 | * How can I annotate a graph?
89 |
90 | ## Q5. Building a prompt
91 |
92 | Now we're ready to build a prompt to send to an LLM.
93 |
94 | Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (`\n\n`)
95 | ```python
96 | context_template = """
97 | Q: {question}
98 | A: {text}
99 | """.strip()
100 | ```
101 |
102 | Now use the context you just created along with the "How do I execute a command in a running docker container?" question
103 | to construct a prompt using the template below:
104 |
105 | ```
106 | prompt_template = """
107 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
108 | Use only the facts from the CONTEXT when answering the QUESTION.
109 |
110 | QUESTION: {question}
111 |
112 | CONTEXT:
113 | {context}
114 | """.strip()
115 | ```
116 |
117 | What's the length of the resulting prompt? (use the `len` function)
118 |
119 | * 962
120 | * 1462
121 | * 1962
122 | * 2462
123 |
124 | ## Q6. Tokens
125 |
126 | When we use the OpenAI Platform, we're charged by the number of
127 | tokens we send in our prompt and receive in the response.
128 |
129 | The OpenAI python package uses `tiktoken` for tokenization:
130 |
131 | ```bash
132 | pip install tiktoken
133 | ```
134 |
135 | Let's calculate the number of tokens in our query:
136 |
137 | ```python
138 | encoding = tiktoken.encoding_for_model("gpt-4o")
139 | ```
140 |
141 | Use the `encode` function. How many tokens does our prompt have?
142 |
143 | * 122
144 | * 222
145 | * 322
146 | * 422
147 |
148 | Note: to decode back a token into a word, you can use the `decode_single_token_bytes` function:
149 |
150 | ```python
151 | encoding.decode_single_token_bytes(63842)
152 | ```
153 |
154 | ## Bonus: generating the answer (ungraded)
155 |
156 | Let's send the prompt to OpenAI. What's the response?
157 |
158 | Note: you can replace OpenAI with Ollama. See module 2.
159 |
160 | ## Bonus: calculating the costs (ungraded)
161 |
162 | Suppose that on average per request we send 150 tokens and receive back 250 tokens.
163 |
164 | How much will it cost to run 1000 requests?
165 |
166 | You can see the prices [here](https://openai.com/api/pricing/)
167 |
168 | On June 17, the prices for gpt4o are:
169 |
170 | * Input: $0.005 / 1K tokens
171 | * Output: $0.015 / 1K tokens
172 |
173 | You can redo the calculations with the values you got in Q6 and Q7.
174 |
175 |
176 | ## Submit the results
177 |
178 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2025/homework/hw1
179 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
180 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/README.md:
--------------------------------------------------------------------------------
1 | # 2. Open-Source LLMs
2 |
3 | In the previous module, we used OpenAI via OpenAI API. It's
4 | a very convenient way to use an LLM, but you have to pay
5 | for the usage, and you don't have control over the
6 | model you get to use.
7 |
8 | In this module, we'll look at using open-source LLMs instead.
9 |
10 | ## 2.1 Open-Source LLMs - Introduction
11 |
12 |
13 |
14 |
15 |
16 | * Open-Source LLMs
17 | * Replacing the LLM box in the RAG flow
18 |
19 | ## 2.2 Using a GPU in Saturn Cloud
20 |
21 |
22 |
23 |
24 |
25 | * Registering in Saturn Cloud
26 | * Configuring secrets and git
27 | * Creating an instance with a GPU
28 |
29 | ```bash
30 | pip install -U transformers accelerate bitsandbytes sentencepiece
31 | ```
32 |
33 | Links:
34 |
35 | * https://saturncloud.io/
36 | * https://github.com/DataTalksClub/llm-zoomcamp-saturncloud
37 |
38 | Google Colab as an alternative:
39 |
40 | * [Video](https://www.loom.com/share/591f39e4e231486bbfc3fbd316ec03c5)
41 | * [Notebook](https://colab.research.google.com/drive/1XmxUZutZXoAEdQZU45EXWPseBX9s2NRd)
42 |
43 | ## 2.3 FLAN-T5
44 |
45 |
46 |
47 |
48 |
49 | * Model: `google/flan-t5-xl`
50 | * Notebook: [huggingface-flan-t5.ipynb](huggingface-flan-t5.ipynb)
51 |
52 | ```bash
53 | import os
54 | os.environ['HF_HOME'] = '/run/cache/'
55 | ```
56 |
57 | Links:
58 |
59 | * https://huggingface.co/google/flan-t5-xl
60 | * https://huggingface.co/docs/transformers/en/model_doc/flan-t5
61 |
62 | Explanation of Parameters:
63 |
64 | * `max_length`: Set this to a higher value if you want longer responses. For example, `max_length=300`.
65 | * `num_beams`: Increasing this can lead to more thorough exploration of possible sequences. Typical values are between 5 and 10.
66 | * `do_sample`: Set this to `True` to use sampling methods. This can produce more diverse responses.
67 | * `temperature`: Lowering this value makes the model more confident and deterministic, while higher values increase diversity. Typical values range from 0.7 to 1.5.
68 | * `top_k` and `top_p`: These parameters control nucleus sampling. `top_k` limits the sampling pool to the top `k` tokens, while `top_p` uses cumulative probability to cut off the sampling pool. Adjust these based on the desired level of randomness.
69 |
70 |
71 | ## 2.4 Phi 3 Mini
72 |
73 |
74 |
75 |
76 |
77 | * Model: `microsoft/Phi-3-mini-128k-instruct`
78 | * Notebook: [huggingface-phi3.ipynb](huggingface-phi3.ipynb)
79 |
80 |
81 | Links:
82 |
83 | * https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
84 |
85 | ## 2.5 Mistral-7B and HuggingFace Hub Authentication
86 |
87 |
88 |
89 |
90 |
91 | * Model: `mistralai/Mistral-7B-v0.1`
92 | * Notebook: [huggingface-mistral-7b.ipynb](huggingface-mistral-7b.ipynb)
93 |
94 | [ChatGPT instructions for serving](serving-hugging-face-models.md)
95 |
96 |
97 | Links:
98 |
99 | * https://huggingface.co/docs/transformers/en/llm_tutorial
100 | * https://huggingface.co/settings/tokens
101 | * https://huggingface.co/mistralai/Mistral-7B-v0.1
102 |
103 |
104 | ## 2.6 Other models
105 |
106 |
107 |
108 |
109 |
110 | * [`LLM360/Amber`](https://huggingface.co/LLM360/Amber)
111 | * [Gemma-7B](https://huggingface.co/blog/gemma)
112 | * [SaulLM-7B](https://huggingface.co/papers/2403.03883)
113 | * [Granite-7B](https://huggingface.co/ibm-granite/granite-7b-base)
114 | * [MPT-7B](https://huggingface.co/mosaicml/mpt-7b)
115 | * [OpenLLaMA-7B](https://huggingface.co/openlm-research/open_llama_7b)
116 |
117 | Where to find them:
118 |
119 | * Leaderboards
120 | * Google
121 | * ChatGPT
122 |
123 | Links:
124 |
125 | * https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
126 | * https://huggingface.co/spaces/optimum/llm-perf-leaderboard
127 |
128 |
129 | ## 2.7 Ollama - Running LLMs on a CPU
130 |
131 |
132 |
133 |
134 |
135 | * The easiest way to run an LLM without a GPU is using [Ollama](https://github.com/ollama/ollama)
136 | * Notebook [ollama.ipynb](ollama.ipynb)
137 |
138 | For Linux:
139 |
140 | ```bash
141 | curl -fsSL https://ollama.com/install.sh | sh
142 |
143 | ollama start
144 | ollama pull phi3
145 | ollama run phi3
146 | ```
147 |
148 | [Prompt example](prompt.md)
149 |
150 | Connecting to it with OpenAI API:
151 |
152 | ```python
153 | from openai import OpenAI
154 |
155 | client = OpenAI(
156 | base_url='http://localhost:11434/v1/',
157 | api_key='ollama',
158 | )
159 | ```
160 |
161 | Docker
162 |
163 | ```bash
164 | docker run -it \
165 | -v ollama:/root/.ollama \
166 | -p 11434:11434 \
167 | --name ollama \
168 | ollama/ollama
169 | ```
170 |
171 | Pulling the model
172 |
173 | ```bash
174 | docker exec -it ollama bash
175 | ollama pull phi3
176 | ```
177 |
178 |
179 | ## 2.8 Ollama & Phi3 + Elastic in Docker-Compose
180 |
181 |
182 |
183 |
184 |
185 | * Creating a Docker-Compose file
186 | * Re-running the module 1 notebook
187 |
188 | * Notebook: [rag-intro.ipynb](rag-intro.ipynb)
189 |
190 | ## 2.9 UI for RAG
191 |
192 |
193 |
194 |
195 |
196 | * Putting it in Streamlit
197 | * [Code](qa_faq.py)
198 |
199 | If you want to learn more about streamlit, you can
200 | use [this material](https://github.com/DataTalksClub/project-of-the-week/blob/main/2022-08-14-frontend.md)
201 | from [our repository with projects of the week](https://github.com/DataTalksClub/project-of-the-week/tree/main).
202 |
203 | ## Homework
204 |
205 | See [here](../cohorts/2024/02-open-source/homework.md)
206 |
207 | # Notes
208 |
209 | * [Workaround by Pham Nguyen Hung to use ElasticSearch container with Saturn Cloud & Google Colab instead of minsearch](https://hung.bearblog.dev/llm-zoomcamp-zrok/)
210 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/02-open-source/notes.md)
211 | * [Notes by Pham Nguyen Hung](https://hung.bearblog.dev/llm-zoomcamp-2-os/)
212 | * [Notes by Marat on Open-Sourced and Closed-Sourced Models and ways to run them](https://machine-mind-ml.medium.com/open-sourced-vs-closed-sourced-llms-2392c7db6e10)
213 | * [Notes by dimzachar](https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/02-open-source/README.md)
214 | * [Notes by Waleed](https://waleedayoub.com/post/llmzoomcamp_week2-open-source_notes/)
215 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
216 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/docker-compose.yaml:
--------------------------------------------------------------------------------
1 | version: '3.8'
2 |
3 | services:
4 | elasticsearch:
5 | image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
6 | container_name: elasticsearch
7 | environment:
8 | - discovery.type=single-node
9 | - xpack.security.enabled=false
10 | ports:
11 | - "9200:9200"
12 | - "9300:9300"
13 |
14 | ollama:
15 | image: ollama/ollama
16 | container_name: ollama
17 | volumes:
18 | - ollama:/root/.ollama
19 | ports:
20 | - "11434:11434"
21 |
22 | volumes:
23 | ollama:
24 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/homework.md:
--------------------------------------------------------------------------------
1 | ## Homework: Open-Source LLMs
2 |
3 | In this homework, we'll experiment more with Ollama
4 |
5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
6 |
7 | Solution: https://www.loom.com/share/f04a63aaf0db4bf58194ba425f1fcffa
8 |
9 | ## Q1. Running Ollama with Docker
10 |
11 | Let's run ollama with Docker. We will need to execute the
12 | same command as in the lectures:
13 |
14 | ```bash
15 | docker run -it \
16 | --rm \
17 | -v ollama:/root/.ollama \
18 | -p 11434:11434 \
19 | --name ollama \
20 | ollama/ollama
21 | ```
22 |
23 | What's the version of ollama client?
24 |
25 | To find out, enter the container and execute `ollama` with the `-v` flag.
26 |
27 |
28 | ## Q2. Downloading an LLM
29 |
30 | We will donwload a smaller LLM - gemma:2b.
31 |
32 | Again let's enter the container and pull the model:
33 |
34 | ```bash
35 | ollama pull gemma:2b
36 | ```
37 |
38 | In docker, it saved the results into `/root/.ollama`
39 |
40 | We're interested in the metadata about this model. You can find
41 | it in `models/manifests/registry.ollama.ai/library`
42 |
43 | What's the content of the file related to gemma?
44 |
45 | ## Q3. Running the LLM
46 |
47 | Test the following prompt: "10 * 10". What's the answer?
48 |
49 | ## Q4. Donwloading the weights
50 |
51 | We don't want to pull the weights every time we run
52 | a docker container. Let's do it once and have them available
53 | every time we start a container.
54 |
55 | First, we will need to change how we run the container.
56 |
57 | Instead of mapping the `/root/.ollama` folder to a named volume,
58 | let's map it to a local directory:
59 |
60 | ```bash
61 | mkdir ollama_files
62 |
63 | docker run -it \
64 | --rm \
65 | -v ./ollama_files:/root/.ollama \
66 | -p 11434:11434 \
67 | --name ollama \
68 | ollama/ollama
69 | ```
70 |
71 | Now pull the model:
72 |
73 | ```bash
74 | docker exec -it ollama ollama pull gemma:2b
75 | ```
76 |
77 | What's the size of the `ollama_files/models` folder?
78 |
79 | * 0.6G
80 | * 1.2G
81 | * 1.7G
82 | * 2.2G
83 |
84 | Hint: on linux, you can use `du -h` for that.
85 |
86 | ## Q5. Adding the weights
87 |
88 | Let's now stop the container and add the weights
89 | to a new image
90 |
91 | For that, let's create a `Dockerfile`:
92 |
93 | ```dockerfile
94 | FROM ollama/ollama
95 |
96 | COPY ...
97 | ```
98 |
99 | What do you put after `COPY`?
100 |
101 | ## Q6. Serving it
102 |
103 | Let's build it:
104 |
105 | ```bash
106 | docker build -t ollama-gemma2b .
107 | ```
108 |
109 | And run it:
110 |
111 | ```bash
112 | docker run -it --rm -p 11434:11434 ollama-gemma2b
113 | ```
114 |
115 | We can connect to it using the OpenAI client
116 |
117 | Let's test it with the following prompt:
118 |
119 | ```python
120 | prompt = "What's the formula for energy?"
121 | ```
122 |
123 | Also, to make results reproducible, set the `temperature` parameter to 0:
124 |
125 | ```bash
126 | response = client.chat.completions.create(
127 | #...
128 | temperature=0.0
129 | )
130 | ```
131 |
132 | How many completion tokens did you get in response?
133 |
134 | * 304
135 | * 604
136 | * 904
137 | * 1204
138 |
139 | ## Submit the results
140 |
141 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw2
142 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
143 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/ollama.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "id": "506fab2a-a50c-42bd-a106-c83a9d2828ea",
7 | "metadata": {},
8 | "outputs": [
9 | {
10 | "name": "stderr",
11 | "output_type": "stream",
12 | "text": [
13 | "--2024-06-13 13:53:24-- https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py\n",
14 | "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...\n",
15 | "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
16 | "HTTP request sent, awaiting response... 200 OK\n",
17 | "Length: 3832 (3.7K) [text/plain]\n",
18 | "Saving to: 'minsearch.py'\n",
19 | "\n",
20 | " 0K ... 100% 579K=0.006s\n",
21 | "\n",
22 | "2024-06-13 13:53:24 (579 KB/s) - 'minsearch.py' saved [3832/3832]\n",
23 | "\n"
24 | ]
25 | }
26 | ],
27 | "source": [
28 | "!rm -f minsearch.py\n",
29 | "!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 3,
35 | "id": "3ac947de-effd-4b61-8792-a6d7a133f347",
36 | "metadata": {},
37 | "outputs": [
38 | {
39 | "data": {
40 | "text/plain": [
41 | ""
42 | ]
43 | },
44 | "execution_count": 3,
45 | "metadata": {},
46 | "output_type": "execute_result"
47 | }
48 | ],
49 | "source": [
50 | "import requests \n",
51 | "import minsearch\n",
52 | "\n",
53 | "docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'\n",
54 | "docs_response = requests.get(docs_url)\n",
55 | "documents_raw = docs_response.json()\n",
56 | "\n",
57 | "documents = []\n",
58 | "\n",
59 | "for course in documents_raw:\n",
60 | " course_name = course['course']\n",
61 | "\n",
62 | " for doc in course['documents']:\n",
63 | " doc['course'] = course_name\n",
64 | " documents.append(doc)\n",
65 | "\n",
66 | "index = minsearch.Index(\n",
67 | " text_fields=[\"question\", \"text\", \"section\"],\n",
68 | " keyword_fields=[\"course\"]\n",
69 | ")\n",
70 | "\n",
71 | "index.fit(documents)"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 4,
77 | "id": "8f087272-b44d-4738-9ea2-175ec63a058b",
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "def search(query):\n",
82 | " boost = {'question': 3.0, 'section': 0.5}\n",
83 | "\n",
84 | " results = index.search(\n",
85 | " query=query,\n",
86 | " filter_dict={'course': 'data-engineering-zoomcamp'},\n",
87 | " boost_dict=boost,\n",
88 | " num_results=5\n",
89 | " )\n",
90 | "\n",
91 | " return results"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 8,
97 | "id": "742ab881-499a-4675-83c4-2013ea1377b9",
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "def build_prompt(query, search_results):\n",
102 | " prompt_template = \"\"\"\n",
103 | "You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n",
104 | "Use only the facts from the CONTEXT when answering the QUESTION.\n",
105 | "\n",
106 | "QUESTION: {question}\n",
107 | "\n",
108 | "CONTEXT: \n",
109 | "{context}\n",
110 | "\"\"\".strip()\n",
111 | "\n",
112 | " context = \"\"\n",
113 | " \n",
114 | " for doc in search_results:\n",
115 | " context = context + f\"section: {doc['section']}\\nquestion: {doc['question']}\\nanswer: {doc['text']}\\n\\n\"\n",
116 | " \n",
117 | " prompt = prompt_template.format(question=query, context=context).strip()\n",
118 | " return prompt\n",
119 | "\n",
120 | "def llm(prompt):\n",
121 | " response = client.chat.completions.create(\n",
122 | " model='phi3',\n",
123 | " messages=[{\"role\": \"user\", \"content\": prompt}]\n",
124 | " )\n",
125 | " \n",
126 | " return response.choices[0].message.content"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 6,
132 | "id": "fe8bff3e-b672-42be-866b-f2d9bb217106",
133 | "metadata": {},
134 | "outputs": [],
135 | "source": [
136 | "def rag(query):\n",
137 | " search_results = search(query)\n",
138 | " prompt = build_prompt(query, search_results)\n",
139 | " answer = llm(prompt)\n",
140 | " return answer"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 7,
146 | "id": "091a77e6-936b-448e-a04b-bad1001f5bb0",
147 | "metadata": {},
148 | "outputs": [],
149 | "source": [
150 | "from openai import OpenAI\n",
151 | "\n",
152 | "client = OpenAI(\n",
153 | " base_url='http://localhost:11434/v1/',\n",
154 | " api_key='ollama',\n",
155 | ")"
156 | ]
157 | },
158 | {
159 | "cell_type": "code",
160 | "execution_count": 12,
161 | "id": "9ee527a3-3331-4f4e-b6c8-f659ffc113f5",
162 | "metadata": {},
163 | "outputs": [
164 | {
165 | "data": {
166 | "text/plain": [
167 | "' This statement serves as an example to verify the functionality of various systems, such as text processing software or programming functions. It\\'s commonly used by developers during debugging sessions to ensure commands are working correctly without producing any unintended output.\\n\\nHere\\'s how you might include it in different contexts:\\n\\n**1. Using it as a command line test in a script:**\\nIf writing a shell script or using command-line tools, the statement can be inserted directly to demonstrate functionality. For instance, using `echo` on Unix-like systems:\\n```bash\\n#!/bin/bash\\necho \"This is a test\"\\necho \"This is also a test for confirmation.\"\\n```\\n\\n**2. Inserting it into a programming function as a placeholder or comment (in Python):**\\nAs a comment in code to remind future developers that the block can be replaced with actual implementation:\\n```python\\ndef process_text(input_string):\\n # Test input: \"This is a test\"\\n print(\"Testing...\")\\n # Replace this line with your processing logic\\n return input_string.upper() # Example operation\\n```\\n\\n**3. Using in documentation or comments within software development code:**\\nDemonstrate how the statement can be used to clarify intentions when developing software, such as in a README file or inline comment:\\n```markdown\\n# Test Command Functionality\\nThis section contains commands that serve to test system functionality.\\n`echo \"This is a test\"` - A simple command to check output behavior.\\n```\\n\\nIn each case, the statement `This is a test` fulfills its role as a straightforward demonstration or placeholder within development and testing workflows.'"
168 | ]
169 | },
170 | "execution_count": 12,
171 | "metadata": {},
172 | "output_type": "execute_result"
173 | }
174 | ],
175 | "source": [
176 | "llm('write that this is a test')"
177 | ]
178 | },
179 | {
180 | "cell_type": "code",
181 | "execution_count": 13,
182 | "id": "21aa255e-c971-44ca-9826-a721df3ad063",
183 | "metadata": {},
184 | "outputs": [
185 | {
186 | "name": "stdout",
187 | "output_type": "stream",
188 | "text": [
189 | " This statement serves as an example to verify the functionality of various systems, such as text processing software or programming functions. It's commonly used by developers during debugging sessions to ensure commands are working correctly without producing any unintended output.\n",
190 | "\n",
191 | "Here's how you might include it in different contexts:\n",
192 | "\n",
193 | "**1. Using it as a command line test in a script:**\n",
194 | "If writing a shell script or using command-line tools, the statement can be inserted directly to demonstrate functionality. For instance, using `echo` on Unix-like systems:\n",
195 | "```bash\n",
196 | "#!/bin/bash\n",
197 | "echo \"This is a test\"\n",
198 | "echo \"This is also a test for confirmation.\"\n",
199 | "```\n",
200 | "\n",
201 | "**2. Inserting it into a programming function as a placeholder or comment (in Python):**\n",
202 | "As a comment in code to remind future developers that the block can be replaced with actual implementation:\n",
203 | "```python\n",
204 | "def process_text(input_string):\n",
205 | " # Test input: \"This is a test\"\n",
206 | " print(\"Testing...\")\n",
207 | " # Replace this line with your processing logic\n",
208 | " return input_string.upper() # Example operation\n",
209 | "```\n",
210 | "\n",
211 | "**3. Using in documentation or comments within software development code:**\n",
212 | "Demonstrate how the statement can be used to clarify intentions when developing software, such as in a README file or inline comment:\n",
213 | "```markdown\n",
214 | "# Test Command Functionality\n",
215 | "This section contains commands that serve to test system functionality.\n",
216 | "`echo \"This is a test\"` - A simple command to check output behavior.\n",
217 | "```\n",
218 | "\n",
219 | "In each case, the statement `This is a test` fulfills its role as a straightforward demonstration or placeholder within development and testing workflows.\n"
220 | ]
221 | }
222 | ],
223 | "source": [
224 | "print(_)"
225 | ]
226 | },
227 | {
228 | "cell_type": "code",
229 | "execution_count": null,
230 | "id": "988ece59-951a-4b32-ba3f-cb8efb66a9bb",
231 | "metadata": {},
232 | "outputs": [],
233 | "source": []
234 | }
235 | ],
236 | "metadata": {
237 | "kernelspec": {
238 | "display_name": "Python 3 (ipykernel)",
239 | "language": "python",
240 | "name": "python3"
241 | },
242 | "language_info": {
243 | "codemirror_mode": {
244 | "name": "ipython",
245 | "version": 3
246 | },
247 | "file_extension": ".py",
248 | "mimetype": "text/x-python",
249 | "name": "python",
250 | "nbconvert_exporter": "python",
251 | "pygments_lexer": "ipython3",
252 | "version": "3.11.9"
253 | }
254 | },
255 | "nbformat": 4,
256 | "nbformat_minor": 5
257 | }
258 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/prompt.md:
--------------------------------------------------------------------------------
1 |
2 | Question: I just discovered the couse. can i still enrol
3 |
4 | Context:
5 |
6 | Course - Can I still join the course after the start date?
7 | Yes, even if you don't register, you're still eligible to submit the homeworks.
8 | Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.
9 |
10 | Environment - Is Python 3.9 still the recommended version to use in 2024?
11 | Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source]
12 | But Python 3.10 and 3.11 should work fine.
13 |
14 | How can we contribute to the course?
15 | Star the repo! Share it with friends if you find it useful ❣️
16 | Create a PR if you see you can improve the text or the structure of the repository.
17 |
18 | Are we still using the NYC Trip data for January 2021? Or are we using the 2022 data?
19 | We will use the same data, as the project will essentially remain the same as last year’s. The data is available here
20 |
21 | Docker-Compose - docker-compose still not available after changing .bashrc
22 | This is happen to me after following 1.4.1 video where we are installing docker compose in our Google Cloud VM. In my case, the docker-compose file downloaded from github named docker-compose-linux-x86_64 while it is more convenient to use docker-compose command instead. So just change the docker-compose-linux-x86_64 into docker-compose.
23 |
24 | Answer:
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/qa_faq.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import time
3 |
4 | from elasticsearch import Elasticsearch
5 | from openai import OpenAI
6 |
7 | client = OpenAI(
8 | base_url='http://localhost:11434/v1/',
9 | api_key='ollama',
10 | )
11 |
12 | es_client = Elasticsearch('http://localhost:9200')
13 |
14 |
15 | def elastic_search(query, index_name = "course-questions"):
16 | search_query = {
17 | "size": 5,
18 | "query": {
19 | "bool": {
20 | "must": {
21 | "multi_match": {
22 | "query": query,
23 | "fields": ["question^3", "text", "section"],
24 | "type": "best_fields"
25 | }
26 | },
27 | "filter": {
28 | "term": {
29 | "course": "data-engineering-zoomcamp"
30 | }
31 | }
32 | }
33 | }
34 | }
35 |
36 | response = es_client.search(index=index_name, body=search_query)
37 |
38 | result_docs = []
39 |
40 | for hit in response['hits']['hits']:
41 | result_docs.append(hit['_source'])
42 |
43 | return result_docs
44 |
45 |
46 | def build_prompt(query, search_results):
47 | prompt_template = """
48 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
49 | Use only the facts from the CONTEXT when answering the QUESTION.
50 |
51 | QUESTION: {question}
52 |
53 | CONTEXT:
54 | {context}
55 | """.strip()
56 |
57 | context = ""
58 |
59 | for doc in search_results:
60 | context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
61 |
62 | prompt = prompt_template.format(question=query, context=context).strip()
63 | return prompt
64 |
65 | def llm(prompt):
66 | response = client.chat.completions.create(
67 | model='phi3',
68 | messages=[{"role": "user", "content": prompt}]
69 | )
70 |
71 | return response.choices[0].message.content
72 |
73 |
74 | def rag(query):
75 | search_results = elastic_search(query)
76 | prompt = build_prompt(query, search_results)
77 | answer = llm(prompt)
78 | return answer
79 |
80 |
81 | def main():
82 | st.title("RAG Function Invocation")
83 |
84 | user_input = st.text_input("Enter your input:")
85 |
86 | if st.button("Ask"):
87 | with st.spinner('Processing...'):
88 | output = rag(user_input)
89 | st.success("Completed!")
90 | st.write(output)
91 |
92 | if __name__ == "__main__":
93 | main()
94 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/rag-intro.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 3,
6 | "id": "ef8e9cdc-dfd4-4e54-a332-4b9bde4e6047",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "from openai import OpenAI\n",
11 | "\n",
12 | "client = OpenAI(\n",
13 | " base_url='http://localhost:11434/v1/',\n",
14 | " api_key='ollama',\n",
15 | ")"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": 7,
21 | "id": "2c05052f-a85a-4137-8398-0fd0be678599",
22 | "metadata": {},
23 | "outputs": [],
24 | "source": [
25 | "from elasticsearch import Elasticsearch"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 8,
31 | "id": "a78df1cc-5a5a-40b4-b673-19c7f0319453",
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "es_client = Elasticsearch('http://localhost:9200') "
36 | ]
37 | },
38 | {
39 | "cell_type": "code",
40 | "execution_count": 9,
41 | "id": "c9367c18-41ad-495e-9920-1a0c552f0d18",
42 | "metadata": {},
43 | "outputs": [
44 | {
45 | "data": {
46 | "text/plain": [
47 | "ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})"
48 | ]
49 | },
50 | "execution_count": 9,
51 | "metadata": {},
52 | "output_type": "execute_result"
53 | }
54 | ],
55 | "source": [
56 | "index_settings = {\n",
57 | " \"settings\": {\n",
58 | " \"number_of_shards\": 1,\n",
59 | " \"number_of_replicas\": 0\n",
60 | " },\n",
61 | " \"mappings\": {\n",
62 | " \"properties\": {\n",
63 | " \"text\": {\"type\": \"text\"},\n",
64 | " \"section\": {\"type\": \"text\"},\n",
65 | " \"question\": {\"type\": \"text\"},\n",
66 | " \"course\": {\"type\": \"keyword\"} \n",
67 | " }\n",
68 | " }\n",
69 | "}\n",
70 | "\n",
71 | "index_name = \"course-questions\"\n",
72 | "\n",
73 | "es_client.indices.create(index=index_name, body=index_settings)"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 10,
79 | "id": "e131e3a3-4051-4fd7-8e4d-d17c2af2ad75",
80 | "metadata": {},
81 | "outputs": [],
82 | "source": [
83 | "import requests \n",
84 | "\n",
85 | "docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'\n",
86 | "docs_response = requests.get(docs_url)\n",
87 | "documents_raw = docs_response.json()\n",
88 | "\n",
89 | "documents = []\n",
90 | "\n",
91 | "for course in documents_raw:\n",
92 | " course_name = course['course']\n",
93 | "\n",
94 | " for doc in course['documents']:\n",
95 | " doc['course'] = course_name\n",
96 | " documents.append(doc)"
97 | ]
98 | },
99 | {
100 | "cell_type": "code",
101 | "execution_count": 12,
102 | "id": "5c230059-e219-4a13-a7f8-ede4cf1b028f",
103 | "metadata": {},
104 | "outputs": [],
105 | "source": [
106 | "from tqdm.auto import tqdm"
107 | ]
108 | },
109 | {
110 | "cell_type": "code",
111 | "execution_count": 13,
112 | "id": "70fe3c97-916d-42c0-bd7b-4f42d9056409",
113 | "metadata": {},
114 | "outputs": [
115 | {
116 | "data": {
117 | "application/vnd.jupyter.widget-view+json": {
118 | "model_id": "d74c0925b2eb48b8b301b1f418b6938a",
119 | "version_major": 2,
120 | "version_minor": 0
121 | },
122 | "text/plain": [
123 | " 0%| | 0/948 [00:00, ?it/s]"
124 | ]
125 | },
126 | "metadata": {},
127 | "output_type": "display_data"
128 | }
129 | ],
130 | "source": [
131 | "for doc in tqdm(documents):\n",
132 | " es_client.index(index=index_name, document=doc)"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 17,
138 | "id": "7c72e000-910b-4fb5-aa88-2561e7bc39f2",
139 | "metadata": {},
140 | "outputs": [],
141 | "source": [
142 | "def elastic_search(query):\n",
143 | " search_query = {\n",
144 | " \"size\": 5,\n",
145 | " \"query\": {\n",
146 | " \"bool\": {\n",
147 | " \"must\": {\n",
148 | " \"multi_match\": {\n",
149 | " \"query\": query,\n",
150 | " \"fields\": [\"question^3\", \"text\", \"section\"],\n",
151 | " \"type\": \"best_fields\"\n",
152 | " }\n",
153 | " },\n",
154 | " \"filter\": {\n",
155 | " \"term\": {\n",
156 | " \"course\": \"data-engineering-zoomcamp\"\n",
157 | " }\n",
158 | " }\n",
159 | " }\n",
160 | " }\n",
161 | " }\n",
162 | "\n",
163 | " response = es_client.search(index=index_name, body=search_query)\n",
164 | " \n",
165 | " result_docs = []\n",
166 | " \n",
167 | " for hit in response['hits']['hits']:\n",
168 | " result_docs.append(hit['_source'])\n",
169 | " \n",
170 | " return result_docs"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 18,
176 | "id": "bdc51143-f861-46e2-8a8f-e5c6a324f53b",
177 | "metadata": {},
178 | "outputs": [],
179 | "source": [
180 | "def build_prompt(query, search_results):\n",
181 | " prompt_template = \"\"\"\n",
182 | "You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n",
183 | "Use only the facts from the CONTEXT when answering the QUESTION.\n",
184 | "\n",
185 | "QUESTION: {question}\n",
186 | "\n",
187 | "CONTEXT: \n",
188 | "{context}\n",
189 | "\"\"\".strip()\n",
190 | "\n",
191 | " context = \"\"\n",
192 | " \n",
193 | " for doc in search_results:\n",
194 | " context = context + f\"section: {doc['section']}\\nquestion: {doc['question']}\\nanswer: {doc['text']}\\n\\n\"\n",
195 | " \n",
196 | " prompt = prompt_template.format(question=query, context=context).strip()\n",
197 | " return prompt\n",
198 | "\n",
199 | "def llm(prompt):\n",
200 | " response = client.chat.completions.create(\n",
201 | " model='phi3',\n",
202 | " messages=[{\"role\": \"user\", \"content\": prompt}]\n",
203 | " )\n",
204 | " \n",
205 | " return response.choices[0].message.content"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": 19,
211 | "id": "81abecbc-eb6b-428f-ab7d-7e21f58b64de",
212 | "metadata": {},
213 | "outputs": [],
214 | "source": [
215 | "def rag(query):\n",
216 | " search_results = elastic_search(query)\n",
217 | " prompt = build_prompt(query, search_results)\n",
218 | " answer = llm(prompt)\n",
219 | " return answer"
220 | ]
221 | },
222 | {
223 | "cell_type": "code",
224 | "execution_count": 21,
225 | "id": "0ea9315a-a619-4066-9e90-8c260f2c8450",
226 | "metadata": {},
227 | "outputs": [
228 | {
229 | "data": {
230 | "text/plain": [
231 | "' Yes, you can still join the course even if you discover it after the start date. There will be deadlines for turning in final projects, but materials and support are available for those who start later. Additionally, we keep all the materials after the course finishes, so you can follow the course at your own pace.'"
232 | ]
233 | },
234 | "execution_count": 21,
235 | "metadata": {},
236 | "output_type": "execute_result"
237 | }
238 | ],
239 | "source": [
240 | "query = 'I just disovered the course. Can I still join it?'\n",
241 | "rag(query)"
242 | ]
243 | },
244 | {
245 | "cell_type": "code",
246 | "execution_count": null,
247 | "id": "d8095274-c9cd-4fd5-80d2-069fc951834f",
248 | "metadata": {},
249 | "outputs": [],
250 | "source": []
251 | }
252 | ],
253 | "metadata": {
254 | "kernelspec": {
255 | "display_name": "Python 3 (ipykernel)",
256 | "language": "python",
257 | "name": "python3"
258 | },
259 | "language_info": {
260 | "codemirror_mode": {
261 | "name": "ipython",
262 | "version": 3
263 | },
264 | "file_extension": ".py",
265 | "mimetype": "text/x-python",
266 | "name": "python",
267 | "nbconvert_exporter": "python",
268 | "pygments_lexer": "ipython3",
269 | "version": "3.11.9"
270 | }
271 | },
272 | "nbformat": 4,
273 | "nbformat_minor": 5
274 | }
275 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/serving-hugging-face-models.md:
--------------------------------------------------------------------------------
1 | # Using Mistral-7B Model in Production
2 |
3 | To use the Mistral-7B model from Hugging Face Hub in production without relying on the Hugging Face servers, you can download the model weights and set up a local environment to serve the model. Here’s a step-by-step guide on how to do this:
4 |
5 | ### Step 1: Authenticate and Accept the User Agreement
6 |
7 | Before downloading the model, you need to authenticate with Hugging Face and accept the user agreement for the Mistral-7B model.
8 |
9 | 1. **Create a Hugging Face account** if you don't already have one.
10 | 2. **Accept the model's user agreement** by visiting the model page on Hugging Face Hub and clicking on "Agree and Access".
11 |
12 | ### Step 2: Install Hugging Face Transformers and Other Dependencies
13 |
14 | You need to install the `transformers` library along with `torch` (PyTorch) to download and use the model. You can install these using pip:
15 |
16 | ```bash
17 | pip install transformers torch
18 | ```
19 |
20 | ### Step 3: Authenticate with Hugging Face CLI
21 |
22 | You need to log in to the Hugging Face CLI to download the model weights.
23 |
24 | ```bash
25 | huggingface-cli login
26 | ```
27 |
28 | Follow the instructions to authenticate using your Hugging Face credentials.
29 |
30 | ### Step 4: Download the Model
31 |
32 | Use the `transformers` library to download the model and tokenizer. This will download the model weights to your local machine.
33 |
34 | ```python
35 | from transformers import AutoModelForCausalLM, AutoTokenizer
36 |
37 | model_name = "mistralai/Mistral-7B-v0.1"
38 |
39 | # Download and cache the model and tokenizer
40 | model = AutoModelForCausalLM.from_pretrained(model_name)
41 | tokenizer = AutoTokenizer.from_pretrained(model_name)
42 |
43 | # Save the model and tokenizer locally
44 | model.save_pretrained("./mistral-7b-model")
45 | tokenizer.save_pretrained("./mistral-7b-tokenizer")
46 | ```
47 |
48 | ### Step 5: Load the Model Locally
49 |
50 | Now, you can load the model from the saved directory without needing to connect to Hugging Face.
51 |
52 | ```python
53 | from transformers import AutoModelForCausalLM, AutoTokenizer
54 |
55 | model_dir = "./mistral-7b-model"
56 | tokenizer_dir = "./mistral-7b-tokenizer"
57 |
58 | # Load the model and tokenizer from the local directory
59 | model = AutoModelForCausalLM.from_pretrained(model_dir)
60 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
61 | ```
62 |
63 | ### Step 6: Serve the Model
64 |
65 | To serve the model in production, you can set up a web service using frameworks like Flask, FastAPI, or any other preferred web server framework. Here’s an example using FastAPI:
66 |
67 | ```python
68 | from fastapi import FastAPI, Request
69 | from pydantic import BaseModel
70 | from transformers import AutoModelForCausalLM, AutoTokenizer
71 | import torch
72 |
73 | app = FastAPI()
74 |
75 | model_dir = "./mistral-7b-model"
76 | tokenizer_dir = "./mistral-7b-tokenizer"
77 |
78 | model = AutoModelForCausalLM.from_pretrained(model_dir)
79 | tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
80 |
81 | class TextGenerationRequest(BaseModel):
82 | text: str
83 | max_length: int = 50
84 |
85 | @app.post("/generate")
86 | async def generate_text(request: TextGenerationRequest):
87 | inputs = tokenizer(request.text, return_tensors="pt")
88 | outputs = model.generate(inputs["input_ids"], max_length=request.max_length)
89 | generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
90 | return {"generated_text": generated_text}
91 |
92 | if __name__ == "__main__":
93 | import uvicorn
94 | uvicorn.run(app, host="0.0.0.0", port=8000)
95 | ```
96 |
97 | ### Step 7: Deploy the Service
98 |
99 | You can deploy this FastAPI app using a web server like Uvicorn or Gunicorn, and use a process manager like Supervisor or systemd to keep it running in production. You may also consider containerizing your application using Docker for easier deployment and scalability.
100 |
101 | ### Summary
102 |
103 | By following these steps, you can download the model weights, set up a local environment, and serve the model in production without depending on Hugging Face's servers. This ensures that your service remains available even if Hugging Face experiences downtime.
104 |
--------------------------------------------------------------------------------
/cohorts/2024/02-open-source/starter.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 2,
6 | "id": "506fab2a-a50c-42bd-a106-c83a9d2828ea",
7 | "metadata": {},
8 | "outputs": [
9 | {
10 | "name": "stderr",
11 | "output_type": "stream",
12 | "text": [
13 | "--2024-06-13 13:53:24-- https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py\n",
14 | "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...\n",
15 | "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
16 | "HTTP request sent, awaiting response... 200 OK\n",
17 | "Length: 3832 (3.7K) [text/plain]\n",
18 | "Saving to: 'minsearch.py'\n",
19 | "\n",
20 | " 0K ... 100% 579K=0.006s\n",
21 | "\n",
22 | "2024-06-13 13:53:24 (579 KB/s) - 'minsearch.py' saved [3832/3832]\n",
23 | "\n"
24 | ]
25 | }
26 | ],
27 | "source": [
28 | "!rm -f minsearch.py\n",
29 | "!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": 3,
35 | "id": "3ac947de-effd-4b61-8792-a6d7a133f347",
36 | "metadata": {},
37 | "outputs": [
38 | {
39 | "data": {
40 | "text/plain": [
41 | ""
42 | ]
43 | },
44 | "execution_count": 3,
45 | "metadata": {},
46 | "output_type": "execute_result"
47 | }
48 | ],
49 | "source": [
50 | "import requests \n",
51 | "import minsearch\n",
52 | "\n",
53 | "docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'\n",
54 | "docs_response = requests.get(docs_url)\n",
55 | "documents_raw = docs_response.json()\n",
56 | "\n",
57 | "documents = []\n",
58 | "\n",
59 | "for course in documents_raw:\n",
60 | " course_name = course['course']\n",
61 | "\n",
62 | " for doc in course['documents']:\n",
63 | " doc['course'] = course_name\n",
64 | " documents.append(doc)\n",
65 | "\n",
66 | "index = minsearch.Index(\n",
67 | " text_fields=[\"question\", \"text\", \"section\"],\n",
68 | " keyword_fields=[\"course\"]\n",
69 | ")\n",
70 | "\n",
71 | "index.fit(documents)"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 4,
77 | "id": "8f087272-b44d-4738-9ea2-175ec63a058b",
78 | "metadata": {},
79 | "outputs": [],
80 | "source": [
81 | "def search(query):\n",
82 | " boost = {'question': 3.0, 'section': 0.5}\n",
83 | "\n",
84 | " results = index.search(\n",
85 | " query=query,\n",
86 | " filter_dict={'course': 'data-engineering-zoomcamp'},\n",
87 | " boost_dict=boost,\n",
88 | " num_results=5\n",
89 | " )\n",
90 | "\n",
91 | " return results"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": 8,
97 | "id": "742ab881-499a-4675-83c4-2013ea1377b9",
98 | "metadata": {},
99 | "outputs": [],
100 | "source": [
101 | "def build_prompt(query, search_results):\n",
102 | " prompt_template = \"\"\"\n",
103 | "You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n",
104 | "Use only the facts from the CONTEXT when answering the QUESTION.\n",
105 | "\n",
106 | "QUESTION: {question}\n",
107 | "\n",
108 | "CONTEXT: \n",
109 | "{context}\n",
110 | "\"\"\".strip()\n",
111 | "\n",
112 | " context = \"\"\n",
113 | " \n",
114 | " for doc in search_results:\n",
115 | " context = context + f\"section: {doc['section']}\\nquestion: {doc['question']}\\nanswer: {doc['text']}\\n\\n\"\n",
116 | " \n",
117 | " prompt = prompt_template.format(question=query, context=context).strip()\n",
118 | " return prompt\n",
119 | "\n",
120 | "def llm(prompt):\n",
121 | " response = client.chat.completions.create(\n",
122 | " model='gpt-4o',\n",
123 | " messages=[{\"role\": \"user\", \"content\": prompt}]\n",
124 | " )\n",
125 | " \n",
126 | " return response.choices[0].message.content"
127 | ]
128 | },
129 | {
130 | "cell_type": "code",
131 | "execution_count": 6,
132 | "id": "fe8bff3e-b672-42be-866b-f2d9bb217106",
133 | "metadata": {},
134 | "outputs": [],
135 | "source": [
136 | "def rag(query):\n",
137 | " search_results = search(query)\n",
138 | " prompt = build_prompt(query, search_results)\n",
139 | " answer = llm(prompt)\n",
140 | " return answer"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "id": "988ece59-951a-4b32-ba3f-cb8efb66a9bb",
147 | "metadata": {},
148 | "outputs": [],
149 | "source": []
150 | }
151 | ],
152 | "metadata": {
153 | "kernelspec": {
154 | "display_name": "Python 3 (ipykernel)",
155 | "language": "python",
156 | "name": "python3"
157 | },
158 | "language_info": {
159 | "codemirror_mode": {
160 | "name": "ipython",
161 | "version": 3
162 | },
163 | "file_extension": ".py",
164 | "mimetype": "text/x-python",
165 | "name": "python",
166 | "nbconvert_exporter": "python",
167 | "pygments_lexer": "ipython3",
168 | "version": "3.11.9"
169 | }
170 | },
171 | "nbformat": 4,
172 | "nbformat_minor": 5
173 | }
174 |
--------------------------------------------------------------------------------
/cohorts/2024/03-vector-search/README.md:
--------------------------------------------------------------------------------
1 | # Vector Search
2 |
3 | ## 3.1 Introduction to Vector Search
4 |
5 |
6 |
7 |
8 |
9 | * [Slides](https://github.com/dataML007/elastic_search/blob/main/Introduction%20to%20Vector%20DB.pdf)
10 |
11 |
12 | ## 3.2 Semantic Search with Elasticsearch
13 |
14 |
15 |
16 |
17 |
18 | * Notebook: [demo_es.ipynb](demo_es.ipynb)
19 |
20 | ### 3.2.2 Advanced Semantic Search
21 |
22 |
23 |
24 |
25 |
26 |
27 | ## 3.3 Evaluating Retrieval
28 |
29 | ### 3.3.1 Introduction
30 |
31 |
32 |
33 |
34 |
35 | Plan for the section:
36 |
37 | * Why do we need evaluation
38 | * [Evaluation metrics](eval/evaluation-metrics.md)
39 | * Ground truth / gold standard data
40 | * Generating ground truth with LLM
41 | * Evaluating the search resuls
42 |
43 |
44 | ### 3.3.2 Getting ground truth data
45 |
46 |
47 |
48 |
49 |
50 | * Approaches for getting evaluation data
51 | * Using OpenAI to generate evaluation data
52 |
53 | Links:
54 |
55 | * [notebook](eval/ground-truth-data.ipynb)
56 | * [documents with ids](eval/documents-with-ids.json)
57 | * [queries generated by OpenAI (pickle)](eval/results.bin)
58 | * [ground truth dataset](eval/ground-truth-data.csv)
59 |
60 |
61 | ### 3.3.3 Ranking evaluation: text search
62 |
63 |
64 |
65 |
66 |
67 | * Elasticsearch with text results
68 | * minsearch
69 |
70 | Links:
71 |
72 | * [Notebook](eval/evaluate-text.ipynb)
73 |
74 | ### 3.3.4 Ranking evaluation: vector search
75 |
76 |
77 |
78 |
79 |
80 | * Elasticsearch with vector search
81 | * Ranking with question, answer, question+answer embeddings
82 |
83 | Links:
84 |
85 | * [Notebook](eval/evaluate-vector.ipynb)
86 |
87 | ## Homework
88 |
89 | See [here](../cohorts/2024/03-vector-search/homework.md)
90 |
91 |
92 | # Notes
93 |
94 | * [Notes by dimzachar](https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/03-vector-search/README.md)
95 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/03-vector-search/notes.md)
96 | * [Notes on Vector Search and Vector Databases by Marat](https://machine-mind-ml.medium.com/enhancing-llm-performance-with-vector-search-and-vector-databases-1f20eb1cc650)
97 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
98 |
--------------------------------------------------------------------------------
/cohorts/2024/03-vector-search/homework.md:
--------------------------------------------------------------------------------
1 | ## Homework: Vector Search
2 |
3 | In this homework, we'll experiemnt with vector with and without Elasticsearch
4 |
5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
6 |
7 | Solution:
8 |
9 | * Video: https://www.loom.com/share/979b9e1f0a964fff88cd62d37e7515f9
10 | * Notebook: [homework_solution.ipynb](homework_solution.ipynb)
11 |
12 |
13 |
14 | ## Q1. Getting the embeddings model
15 |
16 | First, we will get the embeddings model `multi-qa-distilbert-cos-v1` from
17 | [the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)
18 |
19 | ```bash
20 | from sentence_transformers import SentenceTransformer
21 | embedding_model = SentenceTransformer(model_name)
22 | ```
23 |
24 | Create the embedding for this user question:
25 |
26 | ```python
27 | user_question = "I just discovered the course. Can I still join it?"
28 | ```
29 |
30 | What's the first value of the resulting vector?
31 |
32 | * -0.24
33 | * -0.04
34 | * 0.07
35 | * 0.27
36 |
37 |
38 | ## Prepare the documents
39 |
40 | Now we will create the embeddings for the documents.
41 |
42 | Load the documents with ids that we prepared in the module:
43 |
44 | ```python
45 | import requests
46 |
47 | base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
48 | relative_url = '03-vector-search/eval/documents-with-ids.json'
49 | docs_url = f'{base_url}/{relative_url}?raw=1'
50 | docs_response = requests.get(docs_url)
51 | documents = docs_response.json()
52 | ```
53 |
54 | We will use only a subset of the questions - the questions
55 | for `"machine-learning-zoomcamp"`. After filtering, you should
56 | have only 375 documents
57 |
58 | ## Q2. Creating the embeddings
59 |
60 | Now for each document, we will create an embedding for both question and answer fields.
61 |
62 | We want to put all of them into a single matrix `X`:
63 |
64 | - Create a list `embeddings`
65 | - Iterate over each document
66 | - `qa_text = f'{question} {text}'`
67 | - compute the embedding for `qa_text`, append to `embeddings`
68 | - At the end, let `X = np.array(embeddings)` (`import numpy as np`)
69 |
70 | What's the shape of X? (`X.shape`). Include the parantheses.
71 |
72 |
73 |
74 | ## Q3. Search
75 |
76 | We have the embeddings and the query vector. Now let's compute the
77 | cosine similarity between the vector from Q1 (let's call it `v`) and the matrix from Q2.
78 |
79 | The vectors returned from the embedding model are already
80 | normalized (you can check it by computing a dot product of a vector
81 | with itself - it should return something very close to 1.0). This means that in order
82 | to compute the coside similarity, it's sufficient to
83 | multiply the matrix `X` by the vector `v`:
84 |
85 |
86 | ```python
87 | scores = X.dot(v)
88 | ```
89 |
90 | What's the highest score in the results?
91 |
92 | - 65.0
93 | - 6.5
94 | - 0.65
95 | - 0.065
96 |
97 |
98 | ## Vector search
99 |
100 | We can now compute the similarity between a query vector and all the embeddings.
101 |
102 | Let's use this to implement our own vector search
103 |
104 | ```python
105 | class VectorSearchEngine():
106 | def __init__(self, documents, embeddings):
107 | self.documents = documents
108 | self.embeddings = embeddings
109 |
110 | def search(self, v_query, num_results=10):
111 | scores = self.embeddings.dot(v_query)
112 | idx = np.argsort(-scores)[:num_results]
113 | return [self.documents[i] for i in idx]
114 |
115 | search_engine = VectorSearchEngine(documents=documents, embeddings=X)
116 | search_engine.search(v, num_results=5)
117 | ```
118 |
119 | If you don't understand how the `search` function work:
120 |
121 | * Ask ChatGTP or any other LLM of your choice to explain the code
122 | * Check our pre-course workshop about implementing a search engine [here](https://github.com/alexeygrigorev/build-your-own-search-engine)
123 |
124 | (Note: you can replace `argsort` with `argpartition` to make it a lot faster)
125 |
126 |
127 | ## Q4. Hit-rate for our search engine
128 |
129 | Let's evaluate the performance of our own search engine. We will
130 | use the hitrate metric for evaluation.
131 |
132 | First, load the ground truth dataset:
133 |
134 | ```python
135 | import pandas as pd
136 |
137 | base_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main'
138 | relative_url = '03-vector-search/eval/ground-truth-data.csv'
139 | ground_truth_url = f'{base_url}/{relative_url}?raw=1'
140 |
141 | df_ground_truth = pd.read_csv(ground_truth_url)
142 | df_ground_truth = df_ground_truth[df_ground_truth.course == 'machine-learning-zoomcamp']
143 | ground_truth = df_ground_truth.to_dict(orient='records')
144 | ```
145 |
146 | Now use the code from the module to calculate the hitrate of
147 | `VectorSearchEngine` with `num_results=5`.
148 |
149 | What did you get?
150 |
151 | * 0.93
152 | * 0.73
153 | * 0.53
154 | * 0.33
155 |
156 | ## Q5. Indexing with Elasticsearch
157 |
158 | Now let's index these documents with elasticsearch
159 |
160 | * Create the index with the same settings as in the module (but change the dimensions)
161 | * Index the embeddings (note: you've already computed them)
162 |
163 | After indexing, let's perform the search of the same query from Q1.
164 |
165 | What's the ID of the document with the highest score?
166 |
167 | ## Q6. Hit-rate for Elasticsearch
168 |
169 | The search engine we used in Q4 computed the similarity between
170 | the query and ALL the vectors in our database. Usually this is
171 | not practical, as we may have a lot of data.
172 |
173 | Elasticsearch uses approximate techniques to make it faster.
174 |
175 | Let's evaluate how worse the results are when we switch from
176 | exact search (as in Q4) to approximate search with Elastic.
177 |
178 | What's hitrate for our dataset for Elastic?
179 |
180 | * 0.93
181 | * 0.73
182 | * 0.53
183 | * 0.33
184 |
185 |
186 | ## Submit the results
187 |
188 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw3
189 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
190 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/README.md:
--------------------------------------------------------------------------------
1 | # Module 4: Evaluation and Monitoring
2 |
3 | In this module, we will learn how to evaluate and monitor our LLM and RAG system.
4 |
5 | In the evaluation part, we assess the quality of our entire RAG
6 | system before it goes live.
7 |
8 | In the monitoring part, we collect, store and visualize
9 | metrics to assess the answer quality of a deployed LLM. We also
10 | collect chat history and user feedback.
11 |
12 |
13 | ## 4.1 Introduction to monitoring answer quality
14 |
15 |
16 |
17 |
18 |
19 | * Why monitoring LLM systems?
20 | * Monitoring answer quality of LLMs
21 | * Monitoring answer quality with user feedback
22 | * What else to monitor, that is not covered by this module?
23 |
24 |
25 | ## 4.2 Offline vs Online (RAG) evaluation
26 |
27 |
28 |
29 |
30 |
31 | * Modules recap
32 | * Online vs offline evaluation
33 | * Offline evaluation metrics
34 |
35 |
36 | ## 4.3 Generating data for offline RAG evaluation
37 |
38 |
39 |
40 |
41 |
42 | Links:
43 |
44 | * [notebook](offline-rag-evaluation.ipynb)
45 | * [results-gpt4o.csv](data/results-gpt4o.csv) (answers from GPT-4o)
46 | * [results-gpt35.csv](data/results-gpt35.csv) (answers from GPT-3.5-Turbo)
47 |
48 |
49 | ## 4.4 Offline RAG evaluation: cosine similarity
50 |
51 |
52 |
53 |
54 |
55 | Content
56 |
57 | * A->Q->A' cosine similarity
58 | * Evaluating gpt-4o
59 | * Evaluating gpt-3.5-turbo
60 | * Evaluating gpt-4o-mini
61 |
62 | Links:
63 |
64 | * [notebook](offline-rag-evaluation.ipynb)
65 | * [results-gpt4o-cosine.csv](data/results-gpt4o-cosine.csv) (answers with cosine calculated from GPT-4o)
66 | * [results-gpt35-cosine.csv](data/results-gpt35-cosine.csv) (answers with cosine calculated from GPT-3.5-Turbo)
67 | * [results-gpt4o-mini.csv](data/results-gpt4o-mini.csv) (answers from GPT-4o-mini)
68 | * [results-gpt4o-mini-cosine.csv](data/results-gpt4o-mini-cosine.csv) (answers with cosine calculated from GPT-4o-mini)
69 |
70 |
71 | ## 4.5 Offline RAG evaluation: LLM as a judge
72 |
73 |
74 |
75 |
76 |
77 | * LLM as a judge
78 | * A->Q->A' evaluation
79 | * Q->A evaluation
80 |
81 |
82 | Links:
83 |
84 | * [notebook](offline-rag-evaluation.ipynb)
85 | * [evaluations-aqa.csv](data/evaluations-aqa.csv) (A->Q->A evaluation results)
86 | * [evaluations-qa.csv](data/evaluations-qa.csv) (Q->A evaluation results)
87 | https://youtu.be/
88 |
89 |
90 | ## 4.6 Capturing user feedback
91 |
92 |
93 |
94 |
95 |
96 | > You can see the prompts and the output from claude [here](code.md)
97 |
98 | Content
99 |
100 | * Adding +1 and -1 buttons
101 | * Setting up a postgres database
102 | * Putting everything in docker compose
103 |
104 | ```bash
105 | pip install pgcli
106 | pgcli -h localhost -U your_username -d course_assistant -W
107 | ```
108 |
109 |
110 | Links:
111 |
112 | * [final code](app/)
113 | * [intermediate code from claude](code.md#46-capturing-user-feedback)
114 |
115 |
116 | ### 4.6.2 Capturing user feedback: part 2
117 |
118 |
119 |
120 |
121 |
122 | * adding vector search
123 | * adding OpenAI
124 |
125 | Links:
126 |
127 | * [final code](app/)
128 | * [intermediate code from claude](code.md#462-capturing-user-feedback-part-2)
129 |
130 |
131 | ## 4.7 Monitoring the system
132 |
133 |
134 |
135 |
136 |
137 | * Setting up Grafana
138 | * Tokens and costs
139 | * QA relevance
140 | * User feedback
141 | * Other metrics
142 |
143 | Links:
144 |
145 | * [final code](app/)
146 | * [SQL queries for Grafana](grafana.md)
147 | * [intermediate code from claude](code.md#47-monitoring)
148 |
149 | ### 4.7.2 Extra Grafana video
150 |
151 |
152 |
153 |
154 |
155 | * Grafana variables
156 | * Exporting and importing dashboards
157 |
158 | Links:
159 |
160 | * [SQL queries for Grafana](grafana.md)
161 | * [Grafana dashboard](dashboard.json)
162 |
163 | ## Homework
164 |
165 | See [here](../cohorts/2024/04-monitoring/homework.md)
166 |
167 |
168 | ## Extra resources
169 |
170 | ### Overview of the module
171 |
172 | 
173 |
174 | https://www.loom.com/share/1dd375ec4b0d458fabdfc2b841089031
175 |
176 | # Notes
177 |
178 | * [Notes by slavaheroes](https://github.com/slavaheroes/llm-zoomcamp/blob/homeworks/04-monitoring/notes.md)
179 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
180 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/.env:
--------------------------------------------------------------------------------
1 | # PostgreSQL Configuration
2 | POSTGRES_HOST=postgres
3 | POSTGRES_DB=course_assistant
4 | POSTGRES_USER=your_username
5 | POSTGRES_PASSWORD=your_password
6 | POSTGRES_PORT=5432
7 |
8 | # Elasticsearch Configuration
9 | ELASTIC_URL_LOCAL=http://localhost:9200
10 | ELASTIC_URL=http://elasticsearch:9200
11 | ELASTIC_PORT=9200
12 |
13 | # Ollama Configuration
14 | OLLAMA_PORT=11434
15 |
16 | # Streamlit Configuration
17 | STREAMLIT_PORT=8501
18 |
19 | # Other Configuration
20 | MODEL_NAME=multi-qa-MiniLM-L6-cos-v1
21 | INDEX_NAME=course-questions
22 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.9-slim
2 |
3 | ENV PYTHONUNBUFFERED=1
4 |
5 | WORKDIR /app
6 |
7 | COPY requirements.txt .
8 | RUN pip install --no-cache-dir -r requirements.txt
9 |
10 | COPY . .
11 |
12 | CMD ["streamlit", "run", "app.py"]
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/README.MD:
--------------------------------------------------------------------------------
1 | ## Additional notes for those trying the streamlit/grafana out
2 |
3 | 1) The following packages are required when you run some of .py scripts
4 |
5 | ```
6 | pip install psycopg2-binary python-dotenv
7 | pip install pgcli
8 | ```
9 |
10 |
11 | 2) To download the phi3 model to the container
12 | ```
13 | docker-compose up -d
14 | docker-compose exec ollama ollama pull phi3
15 | ```
16 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/app.py:
--------------------------------------------------------------------------------
1 | import streamlit as st
2 | import time
3 | import uuid
4 |
5 | from assistant import get_answer
6 | from db import (
7 | save_conversation,
8 | save_feedback,
9 | get_recent_conversations,
10 | get_feedback_stats,
11 | )
12 |
13 |
14 | def print_log(message):
15 | print(message, flush=True)
16 |
17 |
18 | def main():
19 | print_log("Starting the Course Assistant application")
20 | st.title("Course Assistant")
21 |
22 | # Session state initialization
23 | if "conversation_id" not in st.session_state:
24 | st.session_state.conversation_id = str(uuid.uuid4())
25 | print_log(
26 | f"New conversation started with ID: {st.session_state.conversation_id}"
27 | )
28 | if "count" not in st.session_state:
29 | st.session_state.count = 0
30 | print_log("Feedback count initialized to 0")
31 |
32 | # Course selection
33 | course = st.selectbox(
34 | "Select a course:",
35 | ["machine-learning-zoomcamp", "data-engineering-zoomcamp", "mlops-zoomcamp"],
36 | )
37 | print_log(f"User selected course: {course}")
38 |
39 | # Model selection
40 | model_choice = st.selectbox(
41 | "Select a model:",
42 | ["ollama/phi3", "openai/gpt-3.5-turbo", "openai/gpt-4o", "openai/gpt-4o-mini"],
43 | )
44 | print_log(f"User selected model: {model_choice}")
45 |
46 | # Search type selection
47 | search_type = st.radio("Select search type:", ["Text", "Vector"])
48 | print_log(f"User selected search type: {search_type}")
49 |
50 | # User input
51 | user_input = st.text_input("Enter your question:")
52 |
53 | if st.button("Ask"):
54 | print_log(f"User asked: '{user_input}'")
55 | with st.spinner("Processing..."):
56 | print_log(
57 | f"Getting answer from assistant using {model_choice} model and {search_type} search"
58 | )
59 | start_time = time.time()
60 | answer_data = get_answer(user_input, course, model_choice, search_type)
61 | end_time = time.time()
62 | print_log(f"Answer received in {end_time - start_time:.2f} seconds")
63 | st.success("Completed!")
64 | st.write(answer_data["answer"])
65 |
66 | # Display monitoring information
67 | st.write(f"Response time: {answer_data['response_time']:.2f} seconds")
68 | st.write(f"Relevance: {answer_data['relevance']}")
69 | st.write(f"Model used: {answer_data['model_used']}")
70 | st.write(f"Total tokens: {answer_data['total_tokens']}")
71 | if answer_data["openai_cost"] > 0:
72 | st.write(f"OpenAI cost: ${answer_data['openai_cost']:.4f}")
73 |
74 | # Save conversation to database
75 | print_log("Saving conversation to database")
76 | save_conversation(
77 | st.session_state.conversation_id, user_input, answer_data, course
78 | )
79 | print_log("Conversation saved successfully")
80 | # Generate a new conversation ID for next question
81 | st.session_state.conversation_id = str(uuid.uuid4())
82 |
83 | # Feedback buttons
84 | col1, col2 = st.columns(2)
85 | with col1:
86 | if st.button("+1"):
87 | st.session_state.count += 1
88 | print_log(
89 | f"Positive feedback received. New count: {st.session_state.count}"
90 | )
91 | save_feedback(st.session_state.conversation_id, 1)
92 | print_log("Positive feedback saved to database")
93 | with col2:
94 | if st.button("-1"):
95 | st.session_state.count -= 1
96 | print_log(
97 | f"Negative feedback received. New count: {st.session_state.count}"
98 | )
99 | save_feedback(st.session_state.conversation_id, -1)
100 | print_log("Negative feedback saved to database")
101 |
102 | st.write(f"Current count: {st.session_state.count}")
103 |
104 | # Display recent conversations
105 | st.subheader("Recent Conversations")
106 | relevance_filter = st.selectbox(
107 | "Filter by relevance:", ["All", "RELEVANT", "PARTLY_RELEVANT", "NON_RELEVANT"]
108 | )
109 | recent_conversations = get_recent_conversations(
110 | limit=5, relevance=relevance_filter if relevance_filter != "All" else None
111 | )
112 | for conv in recent_conversations:
113 | st.write(f"Q: {conv['question']}")
114 | st.write(f"A: {conv['answer']}")
115 | st.write(f"Relevance: {conv['relevance']}")
116 | st.write(f"Model: {conv['model_used']}")
117 | st.write("---")
118 |
119 | # Display feedback stats
120 | feedback_stats = get_feedback_stats()
121 | st.subheader("Feedback Statistics")
122 | st.write(f"Thumbs up: {feedback_stats['thumbs_up']}")
123 | st.write(f"Thumbs down: {feedback_stats['thumbs_down']}")
124 |
125 |
126 | print_log("Streamlit app loop completed")
127 |
128 |
129 | if __name__ == "__main__":
130 | print_log("Course Assistant application started")
131 | main()
132 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/assistant.py:
--------------------------------------------------------------------------------
1 | import os
2 | import time
3 | import json
4 |
5 | from openai import OpenAI
6 |
7 | from elasticsearch import Elasticsearch
8 | from sentence_transformers import SentenceTransformer
9 |
10 |
11 | ELASTIC_URL = os.getenv("ELASTIC_URL", "http://elasticsearch:9200")
12 | OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama:11434/v1/")
13 | OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "your-api-key-here")
14 |
15 |
16 | es_client = Elasticsearch(ELASTIC_URL)
17 | ollama_client = OpenAI(base_url=OLLAMA_URL, api_key="ollama")
18 | openai_client = OpenAI(api_key=OPENAI_API_KEY)
19 |
20 | model = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")
21 |
22 |
23 | def elastic_search_text(query, course, index_name="course-questions"):
24 | search_query = {
25 | "size": 5,
26 | "query": {
27 | "bool": {
28 | "must": {
29 | "multi_match": {
30 | "query": query,
31 | "fields": ["question^3", "text", "section"],
32 | "type": "best_fields",
33 | }
34 | },
35 | "filter": {"term": {"course": course}},
36 | }
37 | },
38 | }
39 |
40 | response = es_client.search(index=index_name, body=search_query)
41 | return [hit["_source"] for hit in response["hits"]["hits"]]
42 |
43 |
44 | def elastic_search_knn(field, vector, course, index_name="course-questions"):
45 | knn = {
46 | "field": field,
47 | "query_vector": vector,
48 | "k": 5,
49 | "num_candidates": 10000,
50 | "filter": {"term": {"course": course}},
51 | }
52 |
53 | search_query = {
54 | "knn": knn,
55 | "_source": ["text", "section", "question", "course", "id"],
56 | }
57 |
58 | es_results = es_client.search(index=index_name, body=search_query)
59 |
60 | return [hit["_source"] for hit in es_results["hits"]["hits"]]
61 |
62 |
63 | def build_prompt(query, search_results):
64 | prompt_template = """
65 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
66 | Use only the facts from the CONTEXT when answering the QUESTION.
67 |
68 | QUESTION: {question}
69 |
70 | CONTEXT:
71 | {context}
72 | """.strip()
73 |
74 | context = "\n\n".join(
75 | [
76 | f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}"
77 | for doc in search_results
78 | ]
79 | )
80 | return prompt_template.format(question=query, context=context).strip()
81 |
82 |
83 | def llm(prompt, model_choice):
84 | start_time = time.time()
85 | if model_choice.startswith('ollama/'):
86 | response = ollama_client.chat.completions.create(
87 | model=model_choice.split('/')[-1],
88 | messages=[{"role": "user", "content": prompt}]
89 | )
90 | answer = response.choices[0].message.content
91 | tokens = {
92 | 'prompt_tokens': response.usage.prompt_tokens,
93 | 'completion_tokens': response.usage.completion_tokens,
94 | 'total_tokens': response.usage.total_tokens
95 | }
96 | elif model_choice.startswith('openai/'):
97 | response = openai_client.chat.completions.create(
98 | model=model_choice.split('/')[-1],
99 | messages=[{"role": "user", "content": prompt}]
100 | )
101 | answer = response.choices[0].message.content
102 | tokens = {
103 | 'prompt_tokens': response.usage.prompt_tokens,
104 | 'completion_tokens': response.usage.completion_tokens,
105 | 'total_tokens': response.usage.total_tokens
106 | }
107 | else:
108 | raise ValueError(f"Unknown model choice: {model_choice}")
109 |
110 | end_time = time.time()
111 | response_time = end_time - start_time
112 |
113 | return answer, tokens, response_time
114 |
115 |
116 | def evaluate_relevance(question, answer):
117 | prompt_template = """
118 | You are an expert evaluator for a Retrieval-Augmented Generation (RAG) system.
119 | Your task is to analyze the relevance of the generated answer to the given question.
120 | Based on the relevance of the generated answer, you will classify it
121 | as "NON_RELEVANT", "PARTLY_RELEVANT", or "RELEVANT".
122 |
123 | Here is the data for evaluation:
124 |
125 | Question: {question}
126 | Generated Answer: {answer}
127 |
128 | Please analyze the content and context of the generated answer in relation to the question
129 | and provide your evaluation in parsable JSON without using code blocks:
130 |
131 | {{
132 | "Relevance": "NON_RELEVANT" | "PARTLY_RELEVANT" | "RELEVANT",
133 | "Explanation": "[Provide a brief explanation for your evaluation]"
134 | }}
135 | """.strip()
136 |
137 | prompt = prompt_template.format(question=question, answer=answer)
138 | evaluation, tokens, _ = llm(prompt, 'openai/gpt-4o-mini')
139 |
140 | try:
141 | json_eval = json.loads(evaluation)
142 | return json_eval['Relevance'], json_eval['Explanation'], tokens
143 | except json.JSONDecodeError:
144 | return "UNKNOWN", "Failed to parse evaluation", tokens
145 |
146 |
147 | def calculate_openai_cost(model_choice, tokens):
148 | openai_cost = 0
149 |
150 | if model_choice == 'openai/gpt-3.5-turbo':
151 | openai_cost = (tokens['prompt_tokens'] * 0.0015 + tokens['completion_tokens'] * 0.002) / 1000
152 | elif model_choice in ['openai/gpt-4o', 'openai/gpt-4o-mini']:
153 | openai_cost = (tokens['prompt_tokens'] * 0.03 + tokens['completion_tokens'] * 0.06) / 1000
154 |
155 | return openai_cost
156 |
157 |
158 | def get_answer(query, course, model_choice, search_type):
159 | if search_type == 'Vector':
160 | vector = model.encode(query)
161 | search_results = elastic_search_knn('question_text_vector', vector, course)
162 | else:
163 | search_results = elastic_search_text(query, course)
164 |
165 | prompt = build_prompt(query, search_results)
166 | answer, tokens, response_time = llm(prompt, model_choice)
167 |
168 | relevance, explanation, eval_tokens = evaluate_relevance(query, answer)
169 |
170 | openai_cost = calculate_openai_cost(model_choice, tokens)
171 |
172 | return {
173 | 'answer': answer,
174 | 'response_time': response_time,
175 | 'relevance': relevance,
176 | 'relevance_explanation': explanation,
177 | 'model_used': model_choice,
178 | 'prompt_tokens': tokens['prompt_tokens'],
179 | 'completion_tokens': tokens['completion_tokens'],
180 | 'total_tokens': tokens['total_tokens'],
181 | 'eval_prompt_tokens': eval_tokens['prompt_tokens'],
182 | 'eval_completion_tokens': eval_tokens['completion_tokens'],
183 | 'eval_total_tokens': eval_tokens['total_tokens'],
184 | 'openai_cost': openai_cost
185 | }
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/db.py:
--------------------------------------------------------------------------------
1 | import os
2 | import psycopg2
3 | from psycopg2.extras import DictCursor
4 | from datetime import datetime
5 | from zoneinfo import ZoneInfo
6 |
7 | tz = ZoneInfo("Europe/Berlin")
8 |
9 |
10 | def get_db_connection():
11 | return psycopg2.connect(
12 | host=os.getenv("POSTGRES_HOST", "postgres"),
13 | database=os.getenv("POSTGRES_DB", "course_assistant"),
14 | user=os.getenv("POSTGRES_USER", "your_username"),
15 | password=os.getenv("POSTGRES_PASSWORD", "your_password"),
16 | )
17 |
18 |
19 | def init_db():
20 | conn = get_db_connection()
21 | try:
22 | with conn.cursor() as cur:
23 | cur.execute("DROP TABLE IF EXISTS feedback")
24 | cur.execute("DROP TABLE IF EXISTS conversations")
25 |
26 | cur.execute("""
27 | CREATE TABLE conversations (
28 | id TEXT PRIMARY KEY,
29 | question TEXT NOT NULL,
30 | answer TEXT NOT NULL,
31 | course TEXT NOT NULL,
32 | model_used TEXT NOT NULL,
33 | response_time FLOAT NOT NULL,
34 | relevance TEXT NOT NULL,
35 | relevance_explanation TEXT NOT NULL,
36 | prompt_tokens INTEGER NOT NULL,
37 | completion_tokens INTEGER NOT NULL,
38 | total_tokens INTEGER NOT NULL,
39 | eval_prompt_tokens INTEGER NOT NULL,
40 | eval_completion_tokens INTEGER NOT NULL,
41 | eval_total_tokens INTEGER NOT NULL,
42 | openai_cost FLOAT NOT NULL,
43 | timestamp TIMESTAMP WITH TIME ZONE NOT NULL
44 | )
45 | """)
46 | cur.execute("""
47 | CREATE TABLE feedback (
48 | id SERIAL PRIMARY KEY,
49 | conversation_id TEXT REFERENCES conversations(id),
50 | feedback INTEGER NOT NULL,
51 | timestamp TIMESTAMP WITH TIME ZONE NOT NULL
52 | )
53 | """)
54 | conn.commit()
55 | finally:
56 | conn.close()
57 |
58 |
59 | def save_conversation(conversation_id, question, answer_data, course, timestamp=None):
60 | if timestamp is None:
61 | timestamp = datetime.now(tz)
62 |
63 | conn = get_db_connection()
64 | try:
65 | with conn.cursor() as cur:
66 | cur.execute(
67 | """
68 | INSERT INTO conversations
69 | (id, question, answer, course, model_used, response_time, relevance,
70 | relevance_explanation, prompt_tokens, completion_tokens, total_tokens,
71 | eval_prompt_tokens, eval_completion_tokens, eval_total_tokens, openai_cost, timestamp)
72 | VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, COALESCE(%s, CURRENT_TIMESTAMP))
73 | """,
74 | (
75 | conversation_id,
76 | question,
77 | answer_data["answer"],
78 | course,
79 | answer_data["model_used"],
80 | answer_data["response_time"],
81 | answer_data["relevance"],
82 | answer_data["relevance_explanation"],
83 | answer_data["prompt_tokens"],
84 | answer_data["completion_tokens"],
85 | answer_data["total_tokens"],
86 | answer_data["eval_prompt_tokens"],
87 | answer_data["eval_completion_tokens"],
88 | answer_data["eval_total_tokens"],
89 | answer_data["openai_cost"],
90 | timestamp,
91 | ),
92 | )
93 | conn.commit()
94 | finally:
95 | conn.close()
96 |
97 |
98 | def save_feedback(conversation_id, feedback, timestamp=None):
99 | if timestamp is None:
100 | timestamp = datetime.now(tz)
101 |
102 | conn = get_db_connection()
103 | try:
104 | with conn.cursor() as cur:
105 | cur.execute(
106 | "INSERT INTO feedback (conversation_id, feedback, timestamp) VALUES (%s, %s, COALESCE(%s, CURRENT_TIMESTAMP))",
107 | (conversation_id, feedback, timestamp),
108 | )
109 | conn.commit()
110 | finally:
111 | conn.close()
112 |
113 |
114 | def get_recent_conversations(limit=5, relevance=None):
115 | conn = get_db_connection()
116 | try:
117 | with conn.cursor(cursor_factory=DictCursor) as cur:
118 | query = """
119 | SELECT c.*, f.feedback
120 | FROM conversations c
121 | LEFT JOIN feedback f ON c.id = f.conversation_id
122 | """
123 | if relevance:
124 | query += f" WHERE c.relevance = '{relevance}'"
125 | query += " ORDER BY c.timestamp DESC LIMIT %s"
126 |
127 | cur.execute(query, (limit,))
128 | return cur.fetchall()
129 | finally:
130 | conn.close()
131 |
132 |
133 | def get_feedback_stats():
134 | conn = get_db_connection()
135 | try:
136 | with conn.cursor(cursor_factory=DictCursor) as cur:
137 | cur.execute("""
138 | SELECT
139 | SUM(CASE WHEN feedback > 0 THEN 1 ELSE 0 END) as thumbs_up,
140 | SUM(CASE WHEN feedback < 0 THEN 1 ELSE 0 END) as thumbs_down
141 | FROM feedback
142 | """)
143 | return cur.fetchone()
144 | finally:
145 | conn.close()
146 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/docker-compose.yaml:
--------------------------------------------------------------------------------
1 | version: '3.8'
2 |
3 | services:
4 | elasticsearch:
5 | image: docker.elastic.co/elasticsearch/elasticsearch:8.4.3
6 | container_name: elasticsearch
7 | environment:
8 | - discovery.type=single-node
9 | - xpack.security.enabled=false
10 | ports:
11 | - "${ELASTIC_PORT:-9200}:9200"
12 | - "9300:9300"
13 | volumes:
14 | - elasticsearch_data:/usr/share/elasticsearch/data
15 |
16 | ollama:
17 | image: ollama/ollama
18 | container_name: ollama
19 | volumes:
20 | - ollama_data:/root/.ollama
21 | ports:
22 | - "${OLLAMA_PORT:-11434}:11434"
23 |
24 | postgres:
25 | image: postgres:13
26 | container_name: postgres
27 | environment:
28 | POSTGRES_DB: ${POSTGRES_DB}
29 | POSTGRES_USER: ${POSTGRES_USER}
30 | POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
31 | ports:
32 | - "${POSTGRES_PORT:-5432}:5432"
33 | volumes:
34 | - postgres_data:/var/lib/postgresql/data
35 |
36 | streamlit:
37 | build:
38 | context: .
39 | dockerfile: Dockerfile
40 | container_name: streamlit
41 | environment:
42 | - ELASTIC_URL=http://elasticsearch:${ELASTIC_PORT:-9200}
43 | - OLLAMA_URL=http://ollama:${OLLAMA_PORT:-11434}/v1/
44 | - POSTGRES_HOST=postgres
45 | - POSTGRES_DB=${POSTGRES_DB}
46 | - POSTGRES_USER=${POSTGRES_USER}
47 | - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
48 | - MODEL_NAME=${MODEL_NAME}
49 | - INDEX_NAME=${INDEX_NAME}
50 | - OPENAI_API_KEY=${OPENAI_API_KEY}
51 | ports:
52 | - "${STREAMLIT_PORT:-8501}:8501"
53 | depends_on:
54 | - elasticsearch
55 | - ollama
56 | - postgres
57 |
58 | grafana:
59 | image: grafana/grafana:latest
60 | container_name: grafana
61 | ports:
62 | - "3000:3000"
63 | volumes:
64 | - grafana_data:/var/lib/grafana
65 | environment:
66 | - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
67 | depends_on:
68 | - postgres
69 |
70 | volumes:
71 | elasticsearch_data:
72 | ollama_data:
73 | postgres_data:
74 | grafana_data:
75 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/generate_data.py:
--------------------------------------------------------------------------------
1 | import time
2 | import random
3 | import uuid
4 | from datetime import datetime, timedelta
5 | from zoneinfo import ZoneInfo
6 | from db import save_conversation, save_feedback, get_db_connection
7 |
8 | # Set the timezone to CET (Europe/Berlin)
9 | tz = ZoneInfo("Europe/Berlin")
10 |
11 | # List of sample questions and answers
12 | SAMPLE_QUESTIONS = [
13 | "What is machine learning?",
14 | "How does linear regression work?",
15 | "Explain the concept of overfitting.",
16 | "What is the difference between supervised and unsupervised learning?",
17 | "How does cross-validation help in model evaluation?",
18 | ]
19 |
20 | SAMPLE_ANSWERS = [
21 | "Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models that enable computer systems to improve their performance on a specific task through experience.",
22 | "Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.",
23 | "Overfitting occurs when a machine learning model learns the training data too well, including its noise and fluctuations, resulting in poor generalization to new, unseen data.",
24 | "Supervised learning involves training models on labeled data, while unsupervised learning deals with finding patterns in unlabeled data without predefined outputs.",
25 | "Cross-validation is a technique used to assess how well a model will generalize to an independent dataset. It involves partitioning the data into subsets, training the model on a subset, and validating it on the remaining data.",
26 | ]
27 |
28 | COURSES = ["machine-learning-zoomcamp", "data-engineering-zoomcamp", "mlops-zoomcamp"]
29 | MODELS = ["ollama/phi3", "openai/gpt-3.5-turbo", "openai/gpt-4o", "openai/gpt-4o-mini"]
30 | RELEVANCE = ["RELEVANT", "PARTLY_RELEVANT", "NON_RELEVANT"]
31 |
32 |
33 | def generate_synthetic_data(start_time, end_time):
34 | current_time = start_time
35 | conversation_count = 0
36 | print(f"Starting historical data generation from {start_time} to {end_time}")
37 | while current_time < end_time:
38 | conversation_id = str(uuid.uuid4())
39 | question = random.choice(SAMPLE_QUESTIONS)
40 | answer = random.choice(SAMPLE_ANSWERS)
41 | course = random.choice(COURSES)
42 | model = random.choice(MODELS)
43 | relevance = random.choice(RELEVANCE)
44 |
45 | openai_cost = 0
46 |
47 | if model.startswith("openai/"):
48 | openai_cost = random.uniform(0.001, 0.1)
49 |
50 | answer_data = {
51 | "answer": answer,
52 | "response_time": random.uniform(0.5, 5.0),
53 | "relevance": relevance,
54 | "relevance_explanation": f"This answer is {relevance.lower()} to the question.",
55 | "model_used": model,
56 | "prompt_tokens": random.randint(50, 200),
57 | "completion_tokens": random.randint(50, 300),
58 | "total_tokens": random.randint(100, 500),
59 | "eval_prompt_tokens": random.randint(50, 150),
60 | "eval_completion_tokens": random.randint(20, 100),
61 | "eval_total_tokens": random.randint(70, 250),
62 | "openai_cost": openai_cost,
63 | }
64 |
65 | save_conversation(conversation_id, question, answer_data, course, current_time)
66 | print(
67 | f"Saved conversation: ID={conversation_id}, Time={current_time}, Course={course}, Model={model}"
68 | )
69 |
70 | if random.random() < 0.7:
71 | feedback = 1 if random.random() < 0.8 else -1
72 | save_feedback(conversation_id, feedback, current_time)
73 | print(
74 | f"Saved feedback for conversation {conversation_id}: {'Positive' if feedback > 0 else 'Negative'}"
75 | )
76 |
77 | current_time += timedelta(minutes=random.randint(1, 15))
78 | conversation_count += 1
79 | if conversation_count % 10 == 0:
80 | print(f"Generated {conversation_count} conversations so far...")
81 |
82 | print(
83 | f"Historical data generation complete. Total conversations: {conversation_count}"
84 | )
85 |
86 |
87 | def generate_live_data():
88 | conversation_count = 0
89 | print("Starting live data generation...")
90 | while True:
91 | current_time = datetime.now(tz)
92 | # current_time = None
93 | conversation_id = str(uuid.uuid4())
94 | question = random.choice(SAMPLE_QUESTIONS)
95 | answer = random.choice(SAMPLE_ANSWERS)
96 | course = random.choice(COURSES)
97 | model = random.choice(MODELS)
98 | relevance = random.choice(RELEVANCE)
99 |
100 | openai_cost = 0
101 |
102 | if model.startswith("openai/"):
103 | openai_cost = random.uniform(0.001, 0.1)
104 |
105 | answer_data = {
106 | "answer": answer,
107 | "response_time": random.uniform(0.5, 5.0),
108 | "relevance": relevance,
109 | "relevance_explanation": f"This answer is {relevance.lower()} to the question.",
110 | "model_used": model,
111 | "prompt_tokens": random.randint(50, 200),
112 | "completion_tokens": random.randint(50, 300),
113 | "total_tokens": random.randint(100, 500),
114 | "eval_prompt_tokens": random.randint(50, 150),
115 | "eval_completion_tokens": random.randint(20, 100),
116 | "eval_total_tokens": random.randint(70, 250),
117 | "openai_cost": openai_cost,
118 | }
119 |
120 | save_conversation(conversation_id, question, answer_data, course, current_time)
121 | print(
122 | f"Saved live conversation: ID={conversation_id}, Time={current_time}, Course={course}, Model={model}"
123 | )
124 |
125 | if random.random() < 0.7:
126 | feedback = 1 if random.random() < 0.8 else -1
127 | save_feedback(conversation_id, feedback, current_time)
128 | print(
129 | f"Saved feedback for live conversation {conversation_id}: {'Positive' if feedback > 0 else 'Negative'}"
130 | )
131 |
132 | conversation_count += 1
133 | if conversation_count % 10 == 0:
134 | print(f"Generated {conversation_count} live conversations so far...")
135 |
136 | time.sleep(1)
137 |
138 |
139 | if __name__ == "__main__":
140 | print(f"Script started at {datetime.now(tz)}")
141 | end_time = datetime.now(tz)
142 | start_time = end_time - timedelta(hours=6)
143 | print(f"Generating historical data from {start_time} to {end_time}")
144 | generate_synthetic_data(start_time, end_time)
145 | print("Historical data generation complete.")
146 |
147 | print("Starting live data generation... Press Ctrl+C to stop.")
148 | try:
149 | generate_live_data()
150 | except KeyboardInterrupt:
151 | print(f"Live data generation stopped at {datetime.now(tz)}.")
152 | finally:
153 | print(f"Script ended at {datetime.now(tz)}")
154 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/prep.py:
--------------------------------------------------------------------------------
1 | import os
2 | import requests
3 | import pandas as pd
4 | from sentence_transformers import SentenceTransformer
5 | from elasticsearch import Elasticsearch
6 | from tqdm.auto import tqdm
7 | from dotenv import load_dotenv
8 |
9 | from db import init_db
10 |
11 | load_dotenv()
12 |
13 | ELASTIC_URL = os.getenv("ELASTIC_URL_LOCAL")
14 | MODEL_NAME = os.getenv("MODEL_NAME")
15 | INDEX_NAME = os.getenv("INDEX_NAME")
16 |
17 | BASE_URL = "https://github.com/DataTalksClub/llm-zoomcamp/blob/main"
18 |
19 |
20 | def fetch_documents():
21 | print("Fetching documents...")
22 | relative_url = "03-vector-search/eval/documents-with-ids.json"
23 | docs_url = f"{BASE_URL}/{relative_url}?raw=1"
24 | docs_response = requests.get(docs_url)
25 | documents = docs_response.json()
26 | print(f"Fetched {len(documents)} documents")
27 | return documents
28 |
29 |
30 | def fetch_ground_truth():
31 | print("Fetching ground truth data...")
32 | relative_url = "03-vector-search/eval/ground-truth-data.csv"
33 | ground_truth_url = f"{BASE_URL}/{relative_url}?raw=1"
34 | df_ground_truth = pd.read_csv(ground_truth_url)
35 | df_ground_truth = df_ground_truth[
36 | df_ground_truth.course == "machine-learning-zoomcamp"
37 | ]
38 | ground_truth = df_ground_truth.to_dict(orient="records")
39 | print(f"Fetched {len(ground_truth)} ground truth records")
40 | return ground_truth
41 |
42 |
43 | def load_model():
44 | print(f"Loading model: {MODEL_NAME}")
45 | return SentenceTransformer(MODEL_NAME)
46 |
47 |
48 | def setup_elasticsearch():
49 | print("Setting up Elasticsearch...")
50 | es_client = Elasticsearch(ELASTIC_URL)
51 |
52 | index_settings = {
53 | "settings": {"number_of_shards": 1, "number_of_replicas": 0},
54 | "mappings": {
55 | "properties": {
56 | "text": {"type": "text"},
57 | "section": {"type": "text"},
58 | "question": {"type": "text"},
59 | "course": {"type": "keyword"},
60 | "id": {"type": "keyword"},
61 | "question_text_vector": {
62 | "type": "dense_vector",
63 | "dims": 384,
64 | "index": True,
65 | "similarity": "cosine",
66 | },
67 | }
68 | },
69 | }
70 |
71 | es_client.indices.delete(index=INDEX_NAME, ignore_unavailable=True)
72 | es_client.indices.create(index=INDEX_NAME, body=index_settings)
73 | print(f"Elasticsearch index '{INDEX_NAME}' created")
74 | return es_client
75 |
76 |
77 | def index_documents(es_client, documents, model):
78 | print("Indexing documents...")
79 | for doc in tqdm(documents):
80 | question = doc["question"]
81 | text = doc["text"]
82 | doc["question_text_vector"] = model.encode(question + " " + text).tolist()
83 | es_client.index(index=INDEX_NAME, document=doc)
84 | print(f"Indexed {len(documents)} documents")
85 |
86 |
87 | def main():
88 | # you may consider to comment
89 | # if you just want to init the db or didn't want to re-index
90 | print("Starting the indexing process...")
91 |
92 | documents = fetch_documents()
93 | ground_truth = fetch_ground_truth()
94 | model = load_model()
95 | es_client = setup_elasticsearch()
96 | index_documents(es_client, documents, model)
97 | # you may consider to comment
98 |
99 | print("Initializing database...")
100 | init_db()
101 |
102 | print("Indexing process completed successfully!")
103 |
104 |
105 | if __name__ == "__main__":
106 | main()
107 |
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/app/requirements.txt:
--------------------------------------------------------------------------------
1 | streamlit
2 | elasticsearch==8.14.0
3 | psycopg2-binary==2.9.9
4 | python-dotenv
5 | openai==1.35.7
6 | sentence-transformers==2.7.0
7 | numpy==1.26.4
8 |
9 | --find-links https://download.pytorch.org/whl/cpu/torch_stable.html
10 | torch==2.3.1+cpu
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/grafana.md:
--------------------------------------------------------------------------------
1 | ## Original queries
2 |
3 | Response Time Panel:
4 |
5 | ```sql
6 | SELECT
7 | timestamp AS time,
8 | response_time
9 | FROM conversations
10 | ORDER BY timestamp
11 | ```
12 |
13 | Relevance Distribution Panel:
14 |
15 | ```sql
16 | SELECT
17 | relevance,
18 | COUNT(*) as count
19 | FROM conversations
20 | GROUP BY relevance
21 | ```
22 |
23 | Model Usage Panel:
24 |
25 | ```sql
26 | SELECT
27 | model_used,
28 | COUNT(*) as count
29 | FROM conversations
30 | GROUP BY model_used
31 | ```
32 |
33 |
34 | Token Usage Panel:
35 |
36 | ```sql
37 | SELECT
38 | timestamp AS time,
39 | total_tokens
40 | FROM conversations
41 | ORDER BY timestamp
42 | ```
43 |
44 | OpenAI Cost Panel:
45 |
46 | ```sql
47 | SELECT
48 | timestamp AS time,
49 | openai_cost
50 | FROM conversations
51 | WHERE openai_cost > 0
52 | ORDER BY timestamp
53 | ```
54 |
55 | Recent Conversations Panel:
56 |
57 | ```sql
58 | SELECT
59 | timestamp AS time,
60 | question,
61 | answer,
62 | relevance
63 | FROM conversations
64 | ORDER BY timestamp DESC
65 | LIMIT 5
66 | ```
67 |
68 | Feedback Statistics Panel:
69 |
70 | ```sql
71 | SELECT
72 | SUM(CASE WHEN feedback > 0 THEN 1 ELSE 0 END) as thumbs_up,
73 | SUM(CASE WHEN feedback < 0 THEN 1 ELSE 0 END) as thumbs_down
74 | FROM feedback
75 | ```
76 |
77 | ## Revised queries
78 |
79 | We can (and should) also use special grafana variables
80 |
81 | - `$__timeFrom()` and `$__timeTo()`: Start and end of the selected time range
82 | - `$__timeGroup(timestamp, $__interval)`: Groups results by time intervals automatically calculated by Grafana
83 |
84 | ### 1. Response Time Panel
85 |
86 | This query shows the response time for each conversation within the selected time range:
87 |
88 | ```sql
89 | SELECT
90 | timestamp AS time,
91 | response_time
92 | FROM conversations
93 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
94 | ORDER BY timestamp
95 | ```
96 |
97 | ### 2. Relevance Distribution Panel
98 |
99 | This query counts the number of conversations for each relevance category within the selected time range:
100 |
101 | ```sql
102 | SELECT
103 | relevance,
104 | COUNT(*) as count
105 | FROM conversations
106 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
107 | GROUP BY relevance
108 | ```
109 |
110 | ### 3. Model Usage Panel
111 |
112 | This query counts the number of times each model was used within the selected time range:
113 |
114 | ```sql
115 | SELECT
116 | model_used,
117 | COUNT(*) as count
118 | FROM conversations
119 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
120 | GROUP BY model_used
121 | ```
122 |
123 | ### 4. Token Usage Panel
124 |
125 | This query shows the average token usage over time, grouped by Grafana's automatically calculated interval:
126 |
127 | ```sql
128 | SELECT
129 | $__timeGroup(timestamp, $__interval) AS time,
130 | AVG(total_tokens) AS avg_tokens
131 | FROM conversations
132 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
133 | GROUP BY 1
134 | ORDER BY 1
135 | ```
136 |
137 | ### 5. OpenAI Cost Panel
138 |
139 | This query shows the total OpenAI cost over time, grouped by Grafana's automatically calculated interval:
140 |
141 | ```sql
142 | SELECT
143 | $__timeGroup(timestamp, $__interval) AS time,
144 | SUM(openai_cost) AS total_cost
145 | FROM conversations
146 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
147 | AND openai_cost > 0
148 | GROUP BY 1
149 | ORDER BY 1
150 | ```
151 |
152 | ### 6. Recent Conversations Panel
153 |
154 | This query retrieves the 5 most recent conversations within the selected time range:
155 |
156 | ```sql
157 | SELECT
158 | timestamp AS time,
159 | question,
160 | answer,
161 | relevance
162 | FROM conversations
163 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
164 | ORDER BY timestamp DESC
165 | LIMIT 5
166 | ```
167 |
168 | ### 7. Feedback Statistics Panel
169 |
170 | This query calculates the total number of positive and negative feedback within the selected time range:
171 |
172 | ```sql
173 | SELECT
174 | SUM(CASE WHEN feedback > 0 THEN 1 ELSE 0 END) as thumbs_up,
175 | SUM(CASE WHEN feedback < 0 THEN 1 ELSE 0 END) as thumbs_down
176 | FROM feedback
177 | WHERE timestamp BETWEEN $__timeFrom() AND $__timeTo()
178 | ```
--------------------------------------------------------------------------------
/cohorts/2024/04-monitoring/homework.md:
--------------------------------------------------------------------------------
1 | ## Homework: Evaluation and Monitoring
2 |
3 | Solution: [solution.ipynb](solution.ipynb)
4 |
5 | In this homework, we'll evaluate the quality of our RAG system.
6 |
7 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
8 |
9 | Solution:
10 |
11 | * Video: TBA
12 | * Notebook: TBA
13 |
14 | ## Getting the data
15 |
16 | Let's start by getting the dataset. We will use the data we generated in the module.
17 |
18 | In particular, we'll evaluate the quality of our RAG system
19 | with [gpt-4o-mini](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/04-monitoring/data/results-gpt4o-mini.csv)
20 |
21 |
22 | Read it:
23 |
24 | ```python
25 | url = f'{github_url}?raw=1'
26 | df = pd.read_csv(url)
27 | ```
28 |
29 | We will use only the first 300 documents:
30 |
31 |
32 | ```python
33 | df = df.iloc[:300]
34 | ```
35 |
36 | ## Q1. Getting the embeddings model
37 |
38 | Now, get the embeddings model `multi-qa-mpnet-base-dot-v1` from
39 | [the Sentence Transformer library](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html#model-overview)
40 |
41 | > Note: this is not the same model as in HW3
42 |
43 | ```bash
44 | from sentence_transformers import SentenceTransformer
45 | embedding_model = SentenceTransformer(model_name)
46 | ```
47 |
48 | Create the embeddings for the first LLM answer:
49 |
50 | ```python
51 | answer_llm = df.iloc[0].answer_llm
52 | ```
53 |
54 | What's the first value of the resulting vector?
55 |
56 | * -0.42
57 | * -0.22
58 | * -0.02
59 | * 0.21
60 |
61 |
62 | ## Q2. Computing the dot product
63 |
64 |
65 | Now for each answer pair, let's create embeddings and compute dot product between them
66 |
67 | We will put the results (scores) into the `evaluations` list
68 |
69 | What's the 75% percentile of the score?
70 |
71 | * 21.67
72 | * 31.67
73 | * 41.67
74 | * 51.67
75 |
76 | ## Q3. Computing the cosine
77 |
78 | From Q2, we can see that the results are not within the [0, 1] range. It's because the vectors coming from this model are not normalized.
79 |
80 | So we need to normalize them.
81 |
82 | To do it, we
83 |
84 | * Compute the norm of a vector
85 | * Divide each element by this norm
86 |
87 | So, for vector `v`, it'll be `v / ||v||`
88 |
89 | In numpy, this is how you do it:
90 |
91 | ```python
92 | norm = np.sqrt((v * v).sum())
93 | v_norm = v / norm
94 | ```
95 |
96 | Let's put it into a function and then compute dot product
97 | between normalized vectors. This will give us cosine similarity
98 |
99 | What's the 75% cosine in the scores?
100 |
101 | * 0.63
102 | * 0.73
103 | * 0.83
104 | * 0.93
105 |
106 | ## Q4. Rouge
107 |
108 | Now we will explore an alternative metric - the ROUGE score.
109 |
110 | This is a set of metrics that compares two answers based on the overlap of n-grams, word sequences, and word pairs.
111 |
112 | It can give a more nuanced view of text similarity than just cosine similarity alone.
113 |
114 | We don't need to implement it ourselves, there's a python package for it:
115 |
116 | ```bash
117 | pip install rouge
118 | ```
119 |
120 | (The latest version at the moment of writing is `1.0.1`)
121 |
122 | Let's compute the ROUGE score between the answers at the index 10 of our dataframe (`doc_id=5170565b`)
123 |
124 | ```
125 | from rouge import Rouge
126 | rouge_scorer = Rouge()
127 |
128 | scores = rouge_scorer.get_scores(r['answer_llm'], r['answer_orig'])[0]
129 | ```
130 |
131 | There are three scores: `rouge-1`, `rouge-2` and `rouge-l`, and precision, recall and F1 score for each.
132 |
133 | * `rouge-1` - the overlap of unigrams,
134 | * `rouge-2` - bigrams,
135 | * `rouge-l` - the longest common subsequence
136 |
137 | What's the F score for `rouge-1`?
138 |
139 | - 0.35
140 | - 0.45
141 | - 0.55
142 | - 0.65
143 |
144 | ## Q5. Average rouge score
145 |
146 | Let's compute the average F-score between `rouge-1`, `rouge-2` and `rouge-l` for the same record from Q4
147 |
148 | - 0.35
149 | - 0.45
150 | - 0.55
151 | - 0.65
152 |
153 | ## Q6. Average rouge score for all the data points
154 |
155 | Now let's compute the F-score for all the records and create a dataframe from them.
156 |
157 | What's the average F-score in `rouge_2` across all the records?
158 |
159 | - 0.10
160 | - 0.20
161 | - 0.30
162 | - 0.40
163 |
164 |
165 |
166 | ## Submit the results
167 |
168 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw4
169 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
170 |
--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/README.md:
--------------------------------------------------------------------------------
1 | # Data Preparation in RAG
2 |
3 | ## Getting started
4 |
5 | 1. Clone [repository](https://github.com/mage-ai/rag-project)
6 | ```bash
7 | git clone https://github.com/mage-ai/rag-project
8 | cd rag-project
9 | ```
10 | 3. navigate to the `rag-project/llm` directory, add `spacy` to the requirements.txt.
11 | 4. Then update the `Dockerfile` found in the `rag-project` directory with the following:
12 | ```YAML
13 | RUN python -m spacy download en_core_web_sm
14 | ```
15 | 4. Run
16 |
17 | ```bash
18 | `./scripts/start.sh`
19 | ```
20 |
21 | Once started, go to [http://localhost:6789/](http://localhost:6789/)
22 |
23 | For more setup information, refer to these [instructions](https://docs.mage.ai/getting-started/setup#docker-compose-template)
24 |
25 |
26 | ## 0. Module overview
27 |
28 |
29 |
30 |
31 |
32 | ## 1. Ingest
33 |
34 | In this section, we cover the ingestion of documents from a single data source.
35 |
36 |
37 |
38 |
39 |
40 | * [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/data_loaders/runic_oblivion.py)
41 | * [Document link for API Data Loader](https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json)
42 |
43 | ## 2. Chunk
44 |
45 | Once data is ingested, we break it into manageable chunks.
46 |
47 | The Q&A data is already chunked - the texts are small
48 | and easy to process and index. But other datasets might
49 | not be (book texts, transcripts, etc).
50 |
51 | In this video, we will talk about turning large texts
52 | into smaller documents - i.e. chunking.
53 |
54 |
55 |
56 |
57 |
58 |
59 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/transformers/radiant_photon.py)
60 |
61 | ## 3. Tokenization
62 |
63 | Tokenization is a crucial step in text processing and preparing the data for effective retrieval.
64 |
65 |
66 |
67 |
68 |
69 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/transformers/vivid_nexus.py)
70 |
71 | ## 4. Embed
72 |
73 | Embedding data translates text into numerical vectors that can be processed by models.
74 |
75 | Previously we used sentence transformers for that. In this video we show a different strategy for it.
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/transformers/prismatic_axiom.py)
84 |
85 |
86 | ## 5. Export
87 |
88 | After processing, data needs to be exported for storage so that it can be retrieved for better contextualization of user queries.
89 |
90 | Here we will save the embeddings to elasticsearch
91 |
92 | please make sure to use the name given to your elasticsearch service in your docker compose file followed by the port as the connection string, e.g below
93 |
94 | `` http://elasticsearch:9200
95 |
96 |
97 |
98 |
99 |
100 |
101 | [Code](https://github.com/mage-ai/rag-project/blob/master/llm/rager/data_exporters/numinous_fission.py)
102 |
103 | ## 6. Retrieval: Test Vector Search Query
104 |
105 | After exporting the chunks and embeddings, we can test the search query to retrieve relevant documents on sample queries.
106 |
107 |
108 |
109 |
110 |
111 | [Code](code/06_retrieval.py)
112 |
113 | ## 7. Trigger Daily Runs
114 |
115 | Automation is key to maintaining and updating your system.
116 | This section demonstrates how to schedule and trigger daily runs for your data pipelines, ensuring up-to-date and consistent data processing.
117 |
118 |
119 |
120 |
121 |
122 | ## Homework
123 |
124 | See [here](../cohorts/2024/05-orchestration/homework.md).
125 |
126 | # Notes
127 |
128 | * First link goes here
129 | * [Notes by Abiodun Mage RAG error fixes](https://github.com/AOGbadamosi2018/llm-zoomcamp/blob/main/06%20-%20orchestration/mage_rag_notes.md).
130 | * Did you take notes? Add them above this line (Send a PR with *links* to your notes)
131 |
--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/code/06_retrieval.py:
--------------------------------------------------------------------------------
1 | from typing import Dict, List, Union
2 |
3 | import numpy as np
4 | from elasticsearch import Elasticsearch, exceptions
5 |
6 |
7 | SAMPLE__EMBEDDINGS = [
8 | [-0.1465761959552765, -0.4822517931461334, 0.07130702584981918, -0.25872930884361267, -0.1563894897699356, 0.16641047596931458, 0.24484659731388092, 0.2410498708486557, 0.008032954297959805, 0.17045290768146515, -0.009397129528224468, 0.09619587659835815, -0.22729521989822388, 0.10254761576652527, 0.016890447586774826, -0.13290464878082275, 0.11240798979997635, -0.11204371601343155, -0.057132963091135025, -0.011206787079572678, -0.007982085458934307, 0.279083788394928, 0.20115645229816437, -0.1427406221628189, -0.19398854672908783, -0.035979654639959335, 0.20723149180412292, 0.29891034960746765, 0.21407313644886017, 0.09746530652046204, 0.1671638935804367, 0.08161208778619766, 0.3090828061103821, -0.20648667216300964, 0.48498260974884033, -0.12691514194011688, 0.518856406211853, -0.26291757822036743, -0.0949832871556282, 0.09556109458208084, -0.20844918489456177, 0.2685297429561615, 0.053442806005477905, 0.05103180184960365, 0.1029752567410469, 0.04935301095247269, -0.11679927259683609, -0.012528933584690094, -0.08489680290222168, 0.013589601963758469, -0.32059246301651, 0.10357264429330826, -0.09533575177192688, 0.02984568662941456, 0.2793693542480469, -0.2653750777244568, -0.24152781069278717, -0.3563413619995117, 0.09674381464719772, -0.26155123114585876, -0.1397126317024231, -0.009133181534707546, 0.05972130224108696, -0.10438819974660873, 0.21889159083366394, 0.0694752112030983, -0.1312003880739212, -0.31072548031806946, -0.002836169209331274, 0.2468366175889969, 0.09420009702444077, 0.1284026801586151, -0.03227006644010544, -0.012532072141766548, 0.6650756597518921, -0.14863784611225128, 0.005239118821918964, -0.3317912817001343, 0.16372767090797424, -0.20166568458080292, 0.029721004888415337, -0.18536655604839325, -0.3608534038066864, -0.18234892189502716, 0.019248824566602707, 0.25257956981658936, 0.09671413153409958, 0.15569280087947845, -0.38228726387023926, 0.37017977237701416, 0.03356296569108963, -0.21182948350906372, 0.48848846554756165, 0.18350018560886383, -0.23519110679626465, -0.17464864253997803], [-0.18246106803417206, -0.36036479473114014, 0.3282334506511688, -0.230922132730484, 0.09600532799959183, 0.6859422326087952, 0.0581890344619751, 0.4913463294506073, 0.1536773443222046, -0.2965141832828522, 0.08466599136590958, 0.319297194480896, -0.15651769936084747, -0.043428342789411545, 0.014402368105947971, 0.16681505739688873, 0.22521673142910004, -0.2715776264667511, -0.11033261567354202, -0.04398636147379875, 0.3480629622936249, 0.11897992342710495, 0.8724615573883057, 0.10258488357067108, -0.5719427466392517, -0.03029855526983738, 0.23351268470287323, 0.20660561323165894, 0.575685441493988, -0.12116186320781708, 0.18459142744541168, -0.12865227460861206, 0.3948173522949219, -0.34464019536972046, 0.6699116230010986, -0.45167359709739685, 1.1505522727966309, -0.4498964548110962, -0.3248189687728882, -0.29674994945526123, -0.3570491075515747, 0.5436431765556335, 0.49576905369758606, -0.11180296540260315, -0.02045607566833496, -0.22768598794937134, -0.37912657856941223, -0.30414703488349915, -0.48289090394973755, -0.04158346354961395, -0.3547952473163605, 0.0687602087855339, 0.041512664407491684, 0.33524179458618164, 0.21826978027820587, -0.443082332611084, -0.5049593448638916, -0.5298929810523987, -0.02618088759481907, -0.2748631536960602, -0.1986193209886551, 0.35475826263427734, 0.22456413507461548, -0.29532068967819214, 0.25150877237319946, 0.243370920419693, -0.29938358068466187, -0.2128247618675232, -0.15292000770568848, -0.14813245832920074, -0.06183856353163719, -0.1251668632030487, 0.14256533980369568, -0.22781267762184143, 0.8101184964179993, 0.19796361029148102, 0.09104947745800018, -0.4860817790031433, 0.3078012764453888, -0.27373194694519043, 0.11800770461559296, -0.45869407057762146, 0.09508189558982849, -0.23971715569496155, -0.27427223324775696, 0.5139415264129639, 0.1871502846479416, 0.06647063046693802, -0.4054469168186188, 0.4751380681991577, 0.17067894339561462, 0.12443914264440536, 0.3577817678451538, 0.10574143379926682, -0.3181760311126709, -0.23804502189159393]
9 | ]
10 |
11 |
12 | @data_loader
13 | def search(*args, **kwargs) -> List[Dict]:
14 | """
15 | query_embedding: Union[List[int], np.ndarray]
16 | """
17 |
18 | connection_string = kwargs.get('connection_string', 'http://localhost:9200')
19 | index_name = kwargs.get('index_name', 'documents')
20 | source = kwargs.get('source', "cosineSimilarity(params.query_vector, 'embedding') + 1.0")
21 | top_k = kwargs.get('top_k', 5)
22 | chunk_column = kwargs.get('chunk_column', 'content')
23 |
24 | query_embedding = None
25 | if len(args):
26 | query_embedding = args[0]
27 | if not query_embedding:
28 | query_embedding = SAMPLE__EMBEDDINGS[0]
29 |
30 | if isinstance(query_embedding, np.ndarray):
31 | query_embedding = query_embedding.tolist()
32 |
33 | script_query = {
34 | "script_score": {
35 | "query": {"match_all": {}},
36 | "script": {
37 | "source": source,
38 | "params": {"query_vector": query_embedding},
39 | }
40 | }
41 | }
42 |
43 | print("Sending script query:", script_query)
44 |
45 | es_client = Elasticsearch(connection_string)
46 |
47 | try:
48 | response = es_client.search(
49 | index=index_name,
50 | body={
51 | "size": top_k,
52 | "query": script_query,
53 | "_source": [chunk_column],
54 | },
55 | )
56 |
57 | print("Raw response from Elasticsearch:", response)
58 |
59 | return [hit['_source'][chunk_column] for hit in response['hits']['hits']]
60 |
61 | except exceptions.BadRequestError as e:
62 | print(f"BadRequestError: {e.info}")
63 | return []
64 | except Exception as e:
65 | print(f"Unexpected error: {e}")
66 | return []
67 |
--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/homework.md:
--------------------------------------------------------------------------------
1 | ## Homework: LLM Orchestration and Ingestion
2 |
3 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
4 |
5 | Our FAQ documents change with time: students add more records
6 | and edit existing ones. We need to keep our index in sync.
7 |
8 | There are two ways of doing it:
9 |
10 | 1. Incremental: you only update records that got changed, created or deleted
11 | 2. Full update: you recreate the entire index from scratch
12 |
13 | In this homework, we'll look at full update. We will run our
14 | indexing pipeline daily and re-create the index from scracth
15 | each time we run.
16 |
17 |
18 | For that, we created two FAQ documents for LLM Zoomcamp
19 |
20 | * [version 1](https://docs.google.com/document/d/1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E/edit)
21 | * [version 2](https://docs.google.com/document/d/1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0/edit)
22 |
23 | First, we will run our ingestion pipeline with version 1
24 | and then with version 2.
25 |
26 | ## Q1. Running Mage
27 |
28 | Clone the same repo we used in the module and run mage:
29 |
30 |
31 | ```bash
32 | git clone https://github.com/mage-ai/rag-project
33 | ```
34 |
35 | Add the following libraries to the requirements document:
36 |
37 | ```
38 | python-docx
39 | elasticsearch
40 | ```
41 |
42 | Make sure you use the latest version of mage:
43 |
44 | ```bash
45 | docker pull mageai/mageai:llm
46 | ```
47 |
48 | Start it:
49 |
50 | ```bash
51 | ./scripts/start.sh
52 | ```
53 |
54 | Now mage is running on [http://localhost:6789/](http://localhost:6789/)
55 |
56 | What's the version of mage?
57 |
58 | ## Creating a RAG pipeline
59 |
60 | Create a RAG pipeline
61 |
62 |
63 | ## Q2. Reading the documents
64 |
65 | Now we can ingest the documents. Create a custom code ingestion
66 | block
67 |
68 | Let's read the documents. We will use the same code we used
69 | for parsing FAQ: [parse-faq-llm.ipynb](parse-faq-llm.ipynb)
70 |
71 |
72 | Use the following document_id: 1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E
73 |
74 | Which is the document ID of
75 | [LLM FAQ version 1](https://docs.google.com/document/d/1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E/edit)
76 |
77 | Copy the code to the editor
78 | How many FAQ documents we processed?
79 |
80 | * 1
81 | * 2
82 | * 3
83 | * 4
84 |
85 | ## Q3. Chunking
86 |
87 | We don't really need to do any chuncking because our documents
88 | already have well-specified boundaries. So we just need
89 | to return the documents without any changes.
90 |
91 | So let's go to the transformation part and add a custom code
92 | chunking block:
93 |
94 | ```python
95 | documents = []
96 |
97 | for doc in data['documents']:
98 | doc['course'] = data['course']
99 | # previously we used just "id" for document ID
100 | doc['document_id'] = generate_document_id(doc)
101 | documents.append(doc)
102 |
103 | print(len(documents))
104 |
105 | return documents
106 | ```
107 |
108 |
109 | Where `data` is the input parameter to the transformer.
110 |
111 | And the `generate_document_id` is defined in the same way
112 | as in module 4:
113 |
114 | ```python
115 | import hashlib
116 |
117 | def generate_document_id(doc):
118 | combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
119 | hash_object = hashlib.md5(combined.encode())
120 | hash_hex = hash_object.hexdigest()
121 | document_id = hash_hex[:8]
122 | return document_id
123 | ```
124 |
125 | Note: if instead of a single dictionary you get a list,
126 | add a for loop:
127 |
128 | ```python
129 | for course_dict in data:
130 | ...
131 | ```
132 |
133 | You can check the type of `data` with this code:
134 |
135 | ```python
136 | print(type(data))
137 | ```
138 |
139 | How many documents (chunks) do we have in the output?
140 |
141 | * 66
142 | * 76
143 | * 86
144 | * 96
145 |
146 |
147 |
148 | ## Tokenization and embeddings
149 |
150 | We don't need any tokenization, so we skip it.
151 |
152 | Because currently it's required in mage, we can create
153 | a dummy code block:
154 |
155 | * Create a custom code block
156 | * Don't change it
157 |
158 | Because we will use text search, we also don't need embeddings,
159 | so skip it too.
160 |
161 | If you want to use sentence transformers - the ones from module
162 | 3 - you don't need tokenization, but need embeddings
163 | (you don't need it for this homework)
164 |
165 |
166 | ## Q4. Export
167 |
168 | Now we're ready to index the data with elasticsearch. For that,
169 | we use the Export part of the pipeline
170 |
171 | * Go to the Export part
172 | * Select vector databases -> Elasticsearch
173 | * Open the code for editing
174 |
175 | Because we won't use vector search, but usual text search, we
176 | will need to adjust the code.
177 |
178 | First, let's change the line where we read the index name:
179 |
180 | ```python
181 | index_name = kwargs.get('index_name', 'documents')
182 | ```
183 |
184 | To `index_name_prefix` - we will parametrize it with the day
185 | and time we run the pipeline
186 |
187 | ```python
188 | from datetime import datetime
189 |
190 | index_name_prefix = kwargs.get('index_name', 'documents')
191 | current_time = datetime.now().strftime("%Y%m%d_%M%S")
192 | index_name = f"{index_name_prefix}_{current_time}"
193 | print("index name:", index_name)
194 | ```
195 |
196 |
197 | We will need to save the name in a global variable, so it can be accessible in other code blocks
198 |
199 | ```python
200 | from mage_ai.data_preparation.variable_manager import set_global_variable
201 |
202 | set_global_variable('YOUR_PIPELINE_NAME', 'index_name', index_name)
203 | ```
204 |
205 | Where your pipeline name is the name of the pipeline, e.g.
206 | `transcendent_nexus` (replace the space with underscore `_`)
207 |
208 |
209 |
210 | Replace index settings with the settings we used previously:
211 |
212 | ```python
213 | index_settings = {
214 | "settings": {
215 | "number_of_shards": number_of_shards,
216 | "number_of_replicas": number_of_replicas
217 | },
218 | "mappings": {
219 | "properties": {
220 | "text": {"type": "text"},
221 | "section": {"type": "text"},
222 | "question": {"type": "text"},
223 | "course": {"type": "keyword"},
224 | "document_id": {"type": "keyword"}
225 | }
226 | }
227 | }
228 | ```
229 |
230 | Remove the embeddings line:
231 |
232 | ```python
233 | if isinstance(document[vector_column_name], np.ndarray):
234 | document[vector_column_name] = document[vector_column_name].tolist()
235 | ```
236 |
237 | At the end (outside of the indexing for loop), print the last document:
238 |
239 | ```python
240 | print(document)
241 | ```
242 |
243 | Now execute the block.
244 |
245 | What's the last document id?
246 |
247 | Also note the index name.
248 |
249 |
250 | ## Q5. Testing the retrieval
251 |
252 | Now let's test the retrieval. Use mage or jupyter notebook to
253 | test it.
254 |
255 | Let's use the following query: "When is the next cohort?"
256 |
257 | What's the ID of the top matching result?
258 |
259 |
260 | ## Q6. Reindexing
261 |
262 | Our FAQ document changes: every day course participants add
263 | new records or improve existing ones.
264 |
265 | Imagine some time passed and the document changed. For that we have another version of the FAQ document: [version 2](https://docs.google.com/document/d/1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0/edit).
266 |
267 | The ID of this document is `1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0`.
268 |
269 | Let's re-execute the entire pipeline with the updated data.
270 |
271 | For the same query "When is the next cohort?". What's the ID of the top matching result?
272 |
273 |
274 |
275 | ## Submit the results
276 |
277 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/hw5
278 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
279 |
--------------------------------------------------------------------------------
/cohorts/2024/05-orchestration/parse-faq-llm.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "id": "4cd1eaa8-3424-41ad-9cf2-3e8548712865",
7 | "metadata": {},
8 | "outputs": [],
9 | "source": [
10 | "import io\n",
11 | "\n",
12 | "import requests\n",
13 | "import docx"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 2,
19 | "id": "8180e7e4-b90d-4900-a59b-d22e5d6537c4",
20 | "metadata": {},
21 | "outputs": [],
22 | "source": [
23 | "def clean_line(line):\n",
24 | " line = line.strip()\n",
25 | " line = line.strip('\\uFEFF')\n",
26 | " return line\n",
27 | "\n",
28 | "def read_faq(file_id):\n",
29 | " url = f'https://docs.google.com/document/d/{file_id}/export?format=docx'\n",
30 | " \n",
31 | " response = requests.get(url)\n",
32 | " response.raise_for_status()\n",
33 | " \n",
34 | " with io.BytesIO(response.content) as f_in:\n",
35 | " doc = docx.Document(f_in)\n",
36 | "\n",
37 | " questions = []\n",
38 | "\n",
39 | " question_heading_style = 'heading 2'\n",
40 | " section_heading_style = 'heading 1'\n",
41 | " \n",
42 | " heading_id = ''\n",
43 | " section_title = ''\n",
44 | " question_title = ''\n",
45 | " answer_text_so_far = ''\n",
46 | " \n",
47 | " for p in doc.paragraphs:\n",
48 | " style = p.style.name.lower()\n",
49 | " p_text = clean_line(p.text)\n",
50 | " \n",
51 | " if len(p_text) == 0:\n",
52 | " continue\n",
53 | " \n",
54 | " if style == section_heading_style:\n",
55 | " section_title = p_text\n",
56 | " continue\n",
57 | " \n",
58 | " if style == question_heading_style:\n",
59 | " answer_text_so_far = answer_text_so_far.strip()\n",
60 | " if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
61 | " questions.append({\n",
62 | " 'text': answer_text_so_far,\n",
63 | " 'section': section_title,\n",
64 | " 'question': question_title,\n",
65 | " })\n",
66 | " answer_text_so_far = ''\n",
67 | " \n",
68 | " question_title = p_text\n",
69 | " continue\n",
70 | " \n",
71 | " answer_text_so_far += '\\n' + p_text\n",
72 | " \n",
73 | " answer_text_so_far = answer_text_so_far.strip()\n",
74 | " if answer_text_so_far != '' and section_title != '' and question_title != '':\n",
75 | " questions.append({\n",
76 | " 'text': answer_text_so_far,\n",
77 | " 'section': section_title,\n",
78 | " 'question': question_title,\n",
79 | " })\n",
80 | "\n",
81 | " return questions"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 3,
87 | "id": "7d3c2dd7-f64a-4dc7-a4e3-3e8aadfa720f",
88 | "metadata": {},
89 | "outputs": [],
90 | "source": [
91 | "faq_documents = {\n",
92 | " 'llm-zoomcamp': '1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0',\n",
93 | "}"
94 | ]
95 | },
96 | {
97 | "cell_type": "code",
98 | "execution_count": 4,
99 | "id": "f94efe26-05e8-4ae5-a0fa-0a8e16852816",
100 | "metadata": {},
101 | "outputs": [
102 | {
103 | "name": "stdout",
104 | "output_type": "stream",
105 | "text": [
106 | "llm-zoomcamp\n"
107 | ]
108 | }
109 | ],
110 | "source": [
111 | "documents = []\n",
112 | "\n",
113 | "for course, file_id in faq_documents.items():\n",
114 | " print(course)\n",
115 | " course_documents = read_faq(file_id)\n",
116 | " documents.append({'course': course, 'documents': course_documents})"
117 | ]
118 | },
119 | {
120 | "cell_type": "code",
121 | "execution_count": null,
122 | "id": "1b21af5c-2f6d-49e7-92e9-ca229e2473b9",
123 | "metadata": {},
124 | "outputs": [],
125 | "source": []
126 | }
127 | ],
128 | "metadata": {
129 | "kernelspec": {
130 | "display_name": "Python 3 (ipykernel)",
131 | "language": "python",
132 | "name": "python3"
133 | },
134 | "language_info": {
135 | "codemirror_mode": {
136 | "name": "ipython",
137 | "version": 3
138 | },
139 | "file_extension": ".py",
140 | "mimetype": "text/x-python",
141 | "name": "python",
142 | "nbconvert_exporter": "python",
143 | "pygments_lexer": "ipython3",
144 | "version": "3.12.3"
145 | }
146 | },
147 | "nbformat": 4,
148 | "nbformat_minor": 5
149 | }
150 |
--------------------------------------------------------------------------------
/cohorts/2024/README.md:
--------------------------------------------------------------------------------
1 | # LLM Zoomcamp 2024 Edition
2 |
3 |
4 | Important links:
5 |
6 | - [Pre-Course Q&A Stream](https://www.youtube.com/watch?v=YuxVHZ88hfg&list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV)
7 | - [Course Launch Stream](https://www.youtube.com/watch?v=ifpqpB1ksGc)
8 | - [2024 Edition Youtube Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV)
9 | - [Course management platform](https://courses.datatalks.club/llm-zoomcamp-2024/)
10 |
11 | [**LLM Zoomcamp 2024 Competition**](https://github.com/DataTalksClub/llm-zoomcamp/tree/main/cohorts/2024/competition#llm-zoomcamp-2024-competition)
12 |
13 | * Deadline: September 30, 2024
14 | * Practice and receive additional points for the course!
15 |
16 | [**Module 1: Introduction to LLMs and RAG**](01-intro)
17 |
18 | * [Homework](01-intro/homework.md)
19 | * [Solution](01-intro/homework_solution.ipynb)
20 |
21 | [**Module 2: Open-source LLMs and self-hosting LLMs**](02-open-source)
22 |
23 | * [Homework and solution](02-open-source/homework.md)
24 |
25 | [**Module 3: Vector databases**](03-vector-search)
26 |
27 | * [Homework and solution](03-vector-search/homework.md)
28 |
29 | [**Module 4: Evaluation and monitoring**](04-monitoring)
30 |
31 | * [Module explainer](https://www.loom.com/share/1dd375ec4b0d458fabdfc2b841089031)
32 | * [Homework](04-monitoring/homework.md)
33 |
34 | **Office hours**
35 |
36 | * [Office hours 1](https://www.youtube.com/watch?v=q4Mb4SN-doo)
37 | * [Office hours 2](https://www.youtube.com/watch?v=lre6h7vqz7A)
38 | * [Office hours 3 and 4](https://www.loom.com/share/b57b995f79364da68da1d1826a766794)
39 |
--------------------------------------------------------------------------------
/cohorts/2024/competition/README.md:
--------------------------------------------------------------------------------
1 | # LLM Zoomcamp 2024 Competition
2 |
3 | In the competition, you need to use LLMs to solve high school mathematics problems.
4 | Your task is to develop models that can accurately solve these problems and submit your predictions.
5 |
6 | For more details, visit the [competition page](https://www.kaggle.com/competitions/llm-zoomcamp-2024-competition/overview).
7 |
8 |
9 | ## Getting started
10 |
11 | Getting started code: [starter_notebook.ipynb](starter_notebook.ipynb)
12 |
13 | You will need to install langchain for this code:
14 |
15 | ```bash
16 | pip install -qU langchain-openai langchain
17 | ```
18 |
19 | Thanks [Blaq](https://www.linkedin.com/in/chinonsoodiaka/) for contibuting the notebook!
20 |
21 | Note that the solution is not deterministic: when you run
22 | it again on the same record, sometimes the solution is different
23 | from the initial run.
24 |
25 | Which might be more a feature than a bug:
26 | what if you run it multuple times and get the majority vote?
27 |
28 | ## Community solutions
29 |
30 | - Pastor Solo - getting started notebook on Kaggle - https://www.kaggle.com/code/pastorsoto/starter-notebook
31 | - Slava Shen - chain of thoughts - https://www.kaggle.com/code/vyacheslavshen/double-check-with-llms
32 |
33 | ## Evaluation
34 |
35 | We use accuracy as the evaluation metric. Sometimes multiple
36 | answers are correct. In this case, a solution is correct if
37 | it matches at least one of the possible answers.
38 |
39 | You can find the code for evaluation in [scorer.py](scorer.py) - it's taken [from kaggle](https://www.kaggle.com/code/dremovd/accuracy-multiple-correct?scriptVersionId=158029538)
40 | and this is exactly the code we use in the competition.
41 |
--------------------------------------------------------------------------------
/cohorts/2024/competition/scorer.py:
--------------------------------------------------------------------------------
1 | import pandas as pd
2 | import numpy as np
3 |
4 |
5 | class ParticipantVisibleError(Exception):
6 | # If you want an error message to be shown to participants, you must raise the error as a ParticipantVisibleError
7 | # All other errors will only be shown to the competition host. This helps prevent unintentional leakage of solution data.
8 | pass
9 |
10 |
11 | def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
12 | '''
13 | Accuracy that works with multiple correct answers.
14 | '''
15 | solution = solution.set_index(row_id_column_name, drop=True)
16 | submission = submission.set_index(row_id_column_name, drop=True)
17 | submission = submission.loc[solution.index]
18 |
19 | target_column = 'answer'
20 | assert target_column in solution.columns
21 | assert target_column in submission.columns
22 |
23 | # This fix is needed because submission is loaded with default parameters
24 | # Pandas magically converts string column into float
25 | def fix_suffix(value):
26 | if value.endswith('.0'):
27 | return value[:-2]
28 | else:
29 | return value
30 |
31 | submission[target_column] = submission[target_column].astype(str)
32 | submission[target_column] = submission[target_column].apply(fix_suffix)
33 |
34 |
35 | def convert_to_list(value):
36 | values = [v.strip() for v in value.strip().lstrip('[').rstrip(']').split(',')]
37 | return values
38 |
39 | solution[target_column] = solution[target_column].astype(str).apply(convert_to_list)
40 |
41 | correct = [
42 | submit_answer in correct_answer
43 | for correct_answer, submit_answer in zip(
44 | solution[target_column].values,
45 | submission[target_column].values
46 | )
47 | ]
48 |
49 | return np.mean(correct)
--------------------------------------------------------------------------------
/cohorts/2024/competition/starter_notebook_submission.csv:
--------------------------------------------------------------------------------
1 | problem_id,answer
2 | 11919,12.0
3 | 8513,285.0
4 | 7887,4.0
5 | 5272,6.0
6 | 8295,13.0
7 | 3219,15.0
8 | 7235,55.0
9 | 3688,21.0
10 | 6116,2412.0
11 | 4720,34.0
12 | 12122,40.8
13 | 4311,4.0
14 | 8283,7.0
15 | 8347,2.0
16 | 4170,220.0
17 | 4,21.0
18 | 8307,150.0
19 | 7108,544.0
20 | 7775,800.0
21 | 7680,216.0
22 | 8780,1680.0
23 | 8707,2.25
24 | 7863,5.0
25 | 12036,66.0
26 | 4569,3.0
27 | 8134,21.0
28 | 160,3.0
29 | 2869,70.0
30 | 3309,10.0
31 | 6941,0.462
32 | 4880,6.0
33 | 5126,0.5
34 | 7059,0.0
35 | 10649,45.0
36 | 10357,1.0
37 | 2445,10.0
38 | 5755,27.9
39 | 4849,0.0
40 | 1216,46.0
41 | 7787,77.0
42 | 2405,3.0
43 | 3041,6.0
44 | 10411,222240.0
45 | 7081,328.0
46 | 6018,4.0
47 | 8948,80.0
48 | 4658,54.0
49 | 11679,64.0
50 | 5390,99.2
51 | 5319,-4.0
52 | 6907,2.5
53 | 9695,5.0
54 | 12092,100.0
55 | 8458,20.0
56 | 7136,31.0
57 | 11922,8.4
58 | 6101,1350.0
59 | 245,63.0
60 | 7880,10.0
61 | 12166,4.0
62 | 8116,12.0
63 | 13554,520.0
64 | 4249,66.0
65 | 12184,3.0
66 | 5232,-4.0
67 | 9499,40980.0
68 | 4908,24.0
69 | 7452,5.0
70 | 13512,10.75
71 | 9253,6000.0
72 | 12338,4.0
73 | 4478,3.0
74 | 5170,-5.0
75 | 3122,515.0
76 | 10457,15.0
77 | 2370,10.0
78 | 8670,6.25
79 | 9446,40.0
80 | 8083,1.0
81 | 5321,-6.0
82 | 9388,0.0
83 | 8443,8.0
84 | 8138,600.0
85 | 12084,23.0
86 | 3305,11.0
87 | 4810,23.0
88 | 25,0.361
89 | 217,10.0
90 | 1421,14563.0
91 | 6914,1.0
92 | 4724,23.0
93 | 10460,26.0
94 | 1324,40.0
95 | 5268,3.0
96 | 5164,6.0
97 | 3519,650.0
98 | 7934,12.0
99 | 9390,0.0
100 | 7137,22.0
101 | 5914,300.0
102 |
--------------------------------------------------------------------------------
/cohorts/2024/project.md:
--------------------------------------------------------------------------------
1 | ## Course Project
2 |
3 | The goal of this project is to apply everything we learned
4 | in this course and build an end-to-end RAG project.
5 |
6 | Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete.
7 |
8 |
9 | ## Submitting
10 |
11 | ### Project Attempt #1
12 |
13 | * Project: https://courses.datatalks.club/llm-zoomcamp-2024/project/project1
14 | * Review: https://courses.datatalks.club/llm-zoomcamp-2024/project/project1/eval
15 |
16 |
17 | ### Project Attempt #2
18 |
19 | * Project: https://courses.datatalks.club/llm-zoomcamp-2024/project/project2
20 | * Review: https://courses.datatalks.club/llm-zoomcamp-2024/project/project2/eval
21 |
22 |
23 | > **Important**: update your "Certificate name" here: https://courses.datatalks.club/llm-zoomcamp-2024/enrollment -
24 | this is what we will use when generating certificates for you.
25 |
26 |
27 | ## Evaluation criteria
28 |
29 | See [here](../../project.md#evaluation-criteria)
--------------------------------------------------------------------------------
/cohorts/2024/workshops/dlt.md:
--------------------------------------------------------------------------------
1 | # Open source data ingestion for RAGs with dlt
2 |
3 | Video: https://www.youtube.com/watch?v=qUNyfR_X2Mo
4 |
5 | Homework solution: https://drive.google.com/file/d/1M1dKtAO-v3oYIztqMS8fXlLsBnqcmA-O/view?usp=sharing
6 |
7 | In this hands-on workshop, we’ll learn how to build a data ingestion pipeline using dlt to load data from a REST API into LanceDB so you can have an always up to date RAG.
8 |
9 | We’ll cover the following steps:
10 |
11 | * Extract data from REST APIs
12 | * Loading and vectorizing into LanceDB, which unlike other vector DBs stores the data _and_ the embeddings
13 | * Incremental loading
14 |
15 | By the end of this workshop, you’ll be able to write a portable, OSS data pipeline for your RAG that you can deploy anywhere, such as python notebooks, virtual machines, or orchestrators like Airflow, Dagster or Mage.
16 |
17 |
18 | # Resources
19 |
20 | * Slides: [dlt-LLM-Zoomcamp.pdf](https://github.com/user-attachments/files/16131729/dlt.LLM.Zoomcamp.pdf)
21 | * [Google Colab notebook](https://colab.research.google.com/drive/1nNOybHdWQiwUUuJFZu__xvJxL_ADU3xl?usp=sharing) - make a copy to follow along!
22 |
23 | ---
24 |
25 | # Homework
26 |
27 | In the workshop, we extracted contents from two pages in notion titled "Workshop: Benefits and Perks" and "Workshop: Working hours, PTO, and Vacation".
28 |
29 | Repeat the same process for a third page titled "Homework: Employee handbook" (hidden from public view, but accessible via API key):
30 |
31 | 1. Modify the REST API source to extract only this page.
32 | 2. Write the output into a separate table called "homework".
33 | 3. Remember to update the table name in all cells where you connect to a lancedb table.
34 |
35 | To do this you can use the [workshop Colab](https://colab.research.google.com/drive/1nNOybHdWQiwUUuJFZu__xvJxL_ADU3xl?usp=sharing) as a basis.
36 |
37 | Now, answer the following questions:
38 |
39 | ## Q1. Rows in LanceDB
40 |
41 | How many rows does the lancedb table "notion_pages__homework" have?
42 |
43 | * 14
44 | * 15
45 | * 16
46 | * 17
47 |
48 | ## Q2. Running the Pipeline: Last edited time
49 |
50 | In the demo, we created an incremental dlt resource `rest_api_notion_incremental` that keeps track of `last_edited_time`. What value does it store after you've run your pipeline once? (Hint: you will be able to get this value by performing some aggregation function on the column `last_edited_time` of the table)
51 |
52 | * `Timestamp('2024-07-05 22:34:00+0000', tz='UTC') (OR "2024-07-05T22:34:00.000Z")`
53 | * `Timestamp('2024-07-05 23:33:00+0000', tz='UTC') (OR "2024-07-05T23:33:00.000Z")`
54 | * `Timestamp('2024-07-05 23:52:00+0000', tz='UTC') (OR "2024-07-05T23:52:00.000Z")`
55 | * `Timestamp('2024-07-05 22:56:00+0000', tz='UTC') (OR "2024-07-05T22:56:00.000Z")`
56 |
57 |
58 | ## Q3. Ask the Assistant
59 |
60 | Find out with the help of the AI assistant: how many PTO days are the employees entitled to in a year?
61 |
62 | * 20
63 | * 25
64 | * 30
65 | * 35
66 |
67 | ## Submit the results
68 |
69 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2024/homework/workshop1
70 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
71 |
--------------------------------------------------------------------------------
/cohorts/2025/01-intro/homework.md:
--------------------------------------------------------------------------------
1 | ## Homework: Introduction
2 |
3 | In this homework, we'll learn more about search and use Elastic Search for practice.
4 |
5 | > It's possible that your answers won't match exactly. If it's the case, select the closest one.
6 |
7 |
8 | ## Q1. Running Elastic
9 |
10 | Run Elastic Search 8.17.6, and get the cluster information. If you run it on localhost, this is how you do it:
11 |
12 | ```bash
13 | curl localhost:9200
14 | ```
15 |
16 | What's the `version.build_hash` value?
17 |
18 |
19 | ## Getting the data
20 |
21 | Now let's get the FAQ data. You can run this snippet:
22 |
23 | ```python
24 | import requests
25 |
26 | docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
27 | docs_response = requests.get(docs_url)
28 | documents_raw = docs_response.json()
29 |
30 | documents = []
31 |
32 | for course in documents_raw:
33 | course_name = course['course']
34 |
35 | for doc in course['documents']:
36 | doc['course'] = course_name
37 | documents.append(doc)
38 | ```
39 |
40 | Note that you need to have the `requests` library:
41 |
42 | ```bash
43 | pip install requests
44 | ```
45 |
46 | ## Q2. Indexing the data
47 |
48 | Index the data in the same way as was shown in the course videos. Make the `course` field a keyword and the rest should be text.
49 |
50 | Don't forget to install the ElasticSearch client for Python:
51 |
52 | ```bash
53 | pip install elasticsearch
54 | ```
55 |
56 | Which function do you use for adding your data to elastic?
57 |
58 | * `insert`
59 | * `index`
60 | * `put`
61 | * `add`
62 |
63 | ## Q3. Searching
64 |
65 | Now let's search in our index.
66 |
67 | We will execute a query "How do execute a command on a Kubernetes pod?".
68 |
69 | Use only `question` and `text` fields and give `question` a boost of 4, and use `"type": "best_fields"`.
70 |
71 | What's the score for the top ranking result?
72 |
73 | * 84.50
74 | * 64.50
75 | * 44.50
76 | * 24.50
77 |
78 | Look at the `_score` field.
79 |
80 | ## Q4. Filtering
81 |
82 | Now ask a different question: "How do copy a file to a Docker container?".
83 |
84 | This time we are only interested in questions from `machine-learning-zoomcamp`.
85 |
86 | Return 3 results. What's the 3rd question returned by the search engine?
87 |
88 | * How do I debug a docker container?
89 | * How do I copy files from a different folder into docker container’s working directory?
90 | * How do Lambda container images work?
91 | * How can I annotate a graph?
92 |
93 | ## Q5. Building a prompt
94 |
95 | Now we're ready to build a prompt to send to an LLM.
96 |
97 | Take the records returned from Elasticsearch in Q4 and use this template to build the context. Separate context entries by two linebreaks (`\n\n`)
98 | ```python
99 | context_template = """
100 | Q: {question}
101 | A: {text}
102 | """.strip()
103 | ```
104 |
105 | Now use the context you just created along with the "How do I execute a command in a running docker container?" question
106 | to construct a prompt using the template below:
107 |
108 | ```
109 | prompt_template = """
110 | You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
111 | Use only the facts from the CONTEXT when answering the QUESTION.
112 |
113 | QUESTION: {question}
114 |
115 | CONTEXT:
116 | {context}
117 | """.strip()
118 | ```
119 |
120 | What's the length of the resulting prompt? (use the `len` function)
121 |
122 | * 946
123 | * 1446
124 | * 1946
125 | * 2446
126 |
127 | ## Q6. Tokens
128 |
129 | When we use the OpenAI Platform, we're charged by the number of
130 | tokens we send in our prompt and receive in the response.
131 |
132 | The OpenAI python package uses `tiktoken` for tokenization:
133 |
134 | ```bash
135 | pip install tiktoken
136 | ```
137 |
138 | Let's calculate the number of tokens in our query:
139 |
140 | ```python
141 | encoding = tiktoken.encoding_for_model("gpt-4o")
142 | ```
143 |
144 | Use the `encode` function. How many tokens does our prompt have?
145 |
146 | * 120
147 | * 220
148 | * 320
149 | * 420
150 |
151 | Note: to decode back a token into a word, you can use the `decode_single_token_bytes` function:
152 |
153 | ```python
154 | encoding.decode_single_token_bytes(63842)
155 | ```
156 |
157 | ## Bonus: generating the answer (ungraded)
158 |
159 | Let's send the prompt to OpenAI. What's the response?
160 |
161 | Note: you can replace OpenAI with Ollama. See module 2.
162 |
163 | ## Bonus: calculating the costs (ungraded)
164 |
165 | Suppose that on average per request we send 150 tokens and receive back 250 tokens.
166 |
167 | How much will it cost to run 1000 requests?
168 |
169 | You can see the prices [here](https://openai.com/api/pricing/)
170 |
171 | On June 17, the prices for gpt4o are:
172 |
173 | * Input: $0.005 / 1K tokens
174 | * Output: $0.015 / 1K tokens
175 |
176 | You can redo the calculations with the values you got in Q6 and Q7.
177 |
178 |
179 | ## Submit the results
180 |
181 | * Submit your results here: https://courses.datatalks.club/llm-zoomcamp-2025/homework/hw1
182 | * It's possible that your answers won't match exactly. If it's the case, select the closest one.
183 |
--------------------------------------------------------------------------------
/cohorts/2025/README.md:
--------------------------------------------------------------------------------
1 | # LLM Zoomcamp 2025 Edition
2 |
3 | Important links:
4 |
5 | - Pre-Course Q&A Stream: [video](https://www.youtube.com/live/8lgiOLMMKcY), [summary](/cohorts/2025/pre-course-q-a-stream-summary.md)
6 | - Course Launch Stream: [video](https://www.youtube.com/live/FgnelhEJFj0), [summary](/cohorts/2025/course-launch-stream-summary.md)
7 | - [2025 Edition Youtube Playlist](https://www.youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-)
8 | - [Course management platform](https://courses.datatalks.club/llm-zoomcamp-2025/)
9 |
10 |
11 | [**Module 1: Introduction to LLMs and RAG**](01-intro)
12 |
13 | * [Homework](01-intro/homework.md)
14 |
15 |
16 |
--------------------------------------------------------------------------------
/cohorts/2025/course-launch-stream-summary.md:
--------------------------------------------------------------------------------
1 | # Key Takeaways from the LLM Zoomcamp 2025 Launch Stream
2 |
3 | [](https://youtu.be/FgnelhEJFj0)
4 |
5 | > **[Watch the LLM Zoomcamp 2025 Launch Stream](https://youtu.be/FgnelhEJFj0)**
6 |
7 | ---
8 |
9 | ## Table of Contents
10 |
11 | - [Quick Summary](#quick-summary)
12 | - [Meet the Team](#meet-the-2025-team)
13 | - [Prerequisites](#prerequisites)
14 | - [Course Architecture](#course-architecture)
15 | - [Module Breakdown](#module-breakdown)
16 | - [Timeline & Deadlines](#timeline--deadlines)
17 | - [Getting Help](#getting-help)
18 | - [Earning Points](#earning-points)
19 | - [Communication Channels](#communication-channels)
20 | - [Sponsors & Support](#sponsors--support)
21 | - [FAQ Highlights](#faq-highlights)
22 | - [Next Steps](#next-steps)
23 |
24 | ---
25 |
26 | ## Quick Summary
27 |
28 |
29 | Key Takeaways (Click to expand)
30 |
31 | * **Central project**: Build a production-ready RAG chatbot
32 | * **Skills required**: Python, CLI, Git, Docker—nothing more
33 | * **Support workflow**: FAQ → Slack search → bot → channel (no tags)
34 | * **Earn points** by contributing to FAQ and sharing progress publicly
35 | * **No GPUs needed** for the main camp; open-source spin-off covers that
36 | * **Capstone & peer review** are mandatory for certificate—start gathering data now
37 | * **Budget**: ~$10 OpenAI credit covers the entire course
38 |
39 |
40 |
41 | ## Meet the 2025 Team
42 |
43 | | Instructor | Role | Expertise |
44 | |------------|------|-----------|
45 | | **[Alexey Grigorev](https://github.com/alexeygrigorev)** | Host & General Guidance | Course Creator, ML Engineering |
46 | | **[Kacper Łukowski](https://github.com/kacperlukawski)** | Vector Search Module | Qdrant Expert |
47 | | **Timur S** | Best Practices Module | Production ML Systems |
48 | | **Phoenix Expert** | Monitoring Module | ML Observability (name TBA) |
49 |
50 | ## Prerequisites
51 |
52 | ### Required Skills
53 | - **Python basics** (one-day refresher is sufficient)
54 | - **Command-line & Git** confidence
55 | - **Docker** installed and working
56 |
57 | ### Need a Refresher?
58 | - **Python**: Any weekend tutorial will do
59 | - **Docker**: Check out the [Data Engineering Zoomcamp Docker lesson](https://github.com/DataTalksClub/data-engineering-zoomcamp/tree/main/01-docker-terraform)
60 | - **Git**: [GitHub's Git Handbook](https://guides.github.com/introduction/git-handbook/)
61 |
62 | ## Course Architecture
63 |
64 | The entire Zoomcamp revolves around building a **production-grade RAG (Retrieval-Augmented Generation) chatbot** that answers student questions from the course FAQ.
65 |
66 | ### What You'll Build:
67 | 1. **Document Ingestion** → Text + vector indexing
68 | 2. **Smart Retrieval** → Find most relevant snippets
69 | 3. **LLM Integration** → Generate context-aware answers
70 | 4. **Production Monitoring** → Track quality, latency, and costs
71 |
72 | ## Module Breakdown
73 |
74 | | Week | Module | Focus |
75 | |------|--------|-------|
76 | | 1 | **Intro & RAG Foundations** | Core concepts, basic implementation |
77 | | 2 | **Vector vs Text Search** | Hybrid search strategies |
78 | | 3 | **Evaluation Techniques** | Measuring RAG performance |
79 | | 4 | **Monitoring with Phoenix** | Production observability |
80 | | 5 | **Best Practices & Guardrails** | Security, reliability |
81 | | 6-7 | **Capstone Project** | Build your own RAG system |
82 | | 8 | **Peer Review** | Evaluate classmates' projects |
83 | ## Timeline & Deadlines
84 |
85 | ### Content Delivery
86 | - **Pre-recorded videos** in organized playlists:
87 | - **[Main 2025 Playlist](https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=uwC6I0wFePjdmLdp)**
88 | - **[Legacy 2024 Content](https://youtube.com/playlist?list=PL3MmuxUbc_hKiIVNf7DeEt_tGjypOYtKV&si=l7lTHbVAUHks2AMP)**
89 |
90 | ### Homework & Submissions
91 | - **Platform**: [DataTalks.Club Course Platform](https://courses.datatalks.club/llm-zoomcamp-2025/)
92 | - **Scoring**: Points-based system with public leaderboard
93 | - **Deadline**: 3 weeks after each module launch
94 |
95 | ### Certificate Requirements
96 | - Complete all homework assignments
97 | - Build and submit capstone project (2 weeks)
98 | - Complete peer review process (1 week)
99 | - Minimum point threshold (TBA)
100 |
101 | ## Getting Help
102 |
103 | ### Support Workflow (Follow This Order!)
104 |
105 | 1. **Search the [FAQ Document](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing)** first
106 | 2. **Search [Slack](https://datatalks.club/slack.html) history** for similar questions
107 | 3. **Ask the Slack bot** (uses the same RAG pipeline you're building!)
108 | 4. **Post in `#course-llm-zoomcamp`** (**NEVER tag instructors directly**)
109 |
110 | ### How to Ask Good Questions
111 | - Include error messages and code snippets
112 | - Mention what you've already tried
113 | - Use thread replies to keep channels organized
114 |
115 | ## Earning Points
116 |
117 | ### Ways to Boost Your Score:
118 | - **Contribute to FAQ**: Add solved issues and solutions
119 | - **Share publicly**: Post progress on LinkedIn/Twitter with **#LLMZoomcamp**
120 | - **Limit**: Up to 7 social media links per module count toward score
121 | - **Quality over quantity**: Thoughtful posts get more engagement
122 |
123 | ### Content Ideas:
124 | - Weekly progress updates
125 | - Code snippets and explanations
126 | - Challenges you overcame
127 | - Creative applications of course concepts
128 |
129 | ## Communication Channels
130 |
131 | | Channel | Purpose | Link |
132 | |---------|---------|------|
133 | | **Telegram** | Announcements only | [Join Channel](https://t.me/llm_zoomcamp) |
134 | | **Slack** | Questions & peer help | [Join Workspace](https://datatalks.club/slack.html) |
135 | | **YouTube** | Video content | [Course Playlist](https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=uwC6I0wFePjdmLdp) |
136 | | **GitHub** | Course materials | [Main Repository](https://github.com/DataTalksClub/llm-zoomcamp) |
137 |
138 | ### Communication Etiquette:
139 | - Always reply in **threads** to keep channels clean
140 | - Search before asking
141 | - Be respectful and helpful to peers
142 | - Don't tag instructors directly
143 |
144 | ## Sponsors & Support
145 |
146 | ### Course Sponsors
147 | - **[Arize AI](https://github.com/Arize-ai/phoenix)**
148 | - **[dltHub](https://github.com/dlt-hub/dlt)**
149 | - **[Qdrant](https://github.com/qdrant/qdrant)**
150 |
151 | *These sponsors keep the course completely free for everyone!*
152 |
153 | ### Support Alexey
154 | If you find value in this course, consider supporting via [GitHub Sponsors](https://github.com/sponsors/alexeygrigorev).
155 |
156 | ## FAQ Highlights
157 |
158 |
159 | Why no agents in the main course?
160 |
161 | **Current situation**: Agent frameworks evolve "every day," making it risky to lock the course to unstable APIs.
162 |
163 | **Future plans**:
164 | - **AI-Dev Tools Course** planned for **September 2025**
165 | - **2,100+ sign-ups** already confirmed interest
166 | - **Will cover**: Agentic search, LangGraph, CrewAI, agent guardrails
167 | - **Expect**: Workshop teasers during this cohort
168 |
169 | **More info**: [AI Dev Tools Zoomcamp Repository](https://github.com/DataTalksClub/ai-dev-tools-zoomcamp)
170 |
171 |
172 |
173 |
174 | Do I need GPUs?
175 |
176 | **Main LLM Zoomcamp**:
177 | - **No GPUs needed**
178 | - **~$10 OpenAI credit** covers everything
179 | - **Uses hosted models**: OpenAI, Anthropic, Groq, etc.
180 |
181 | **Open-Source LLM Mini-Course** (June/July 2025):
182 | - **Free GPU quotas** provided by Saturn Cloud & AMD
183 | - **Local models**: Llama 3, DeepSeek, etc.
184 | - **Topics**: Quantization, vLLM, LoRA fine-tuning
185 |
186 | **More info**: [Open-Source LLM Zoomcamp](https://github.com/DataTalksClub/open-source-llm-zoomcamp)
187 |
188 |
189 |
190 |
191 | How is LLM monitoring different from traditional MLOps?
192 |
193 | **Shared aspects**:
194 | - Uptime & latency tracking
195 | - Performance regression detection
196 | - Post-deployment monitoring
197 |
198 | **LLM-specific metrics**:
199 | - **Cost per call/token**
200 | - **Hallucination detection** (via eval sets, heuristics, human feedback)
201 | - **Prompt/response drift** (style/length changes indicating model updates)
202 | - **Content risk**: PII leaks, toxicity detection
203 |
204 | **Tools**: Phoenix (open-source) for dashboards and budget alerts
205 |
206 |
207 |
208 |
209 | Local vs Hosted Models: Which to choose?
210 |
211 | | Aspect | Hosted API | Local/Self-hosted |
212 | |--------|------------|-------------------|
213 | | **Setup** | Single HTTP call | Download 4-40GB weights |
214 | | **Scaling** | Provider handles it | You manage infrastructure |
215 | | **Cost** | Pay per usage | Higher upfront, lower long-term |
216 | | **Control** | Limited customization | Full control, no data sharing |
217 | | **Models** | Latest frontier models | Open-source alternatives |
218 | | **Customization** | API parameters only | LoRA fine-tuning possible |
219 |
220 | **Course approach**: Prototype with GPT-4o, then compare with local Llama 3
221 |
222 |
223 |
224 |
225 | Job market advice for LLM engineers
226 |
227 | **Show, don't tell**:
228 | - Publish notebooks and blog posts
229 | - Create short demo videos (Loom)
230 | - Deploy working RAG pipelines
231 |
232 | **Specialize wisely**:
233 | - RAG evaluation techniques
234 | - Retrieval optimization
235 | - LLM cost monitoring
236 | - Content safety & guardrails
237 |
238 | **Network via "learning in public"**:
239 | - Use **#LLMZoomCamp** hashtag
240 | - Consistent LinkedIn/Twitter posts
241 | - Engage with course community
242 |
243 | **Portfolio > certificates**:
244 | - Capstone GitHub repo
245 | - Regular social media updates
246 | - Video walkthroughs of projects
247 |
248 |
249 |
250 |
251 | Capstone project guidelines
252 |
253 | **Start early**:
254 | - Gather domain corpus (docs, Slack dumps, PDFs)
255 | - Begin data cleaning and chunking
256 | - Choose a problem you're passionate about
257 |
258 | **Requirements preview**:
259 | - Working RAG demonstration
260 | - Evaluation notebook with metrics
261 | - README with cost analysis
262 | - Monitoring screenshots
263 | - 3-minute video walkthrough
264 |
265 | **Peer review process**:
266 | - 2 weeks for building
267 | - 1 week for reviewing 3 classmates' projects
268 | - Mandatory for certification
269 |
270 | **Detailed rubric**: Will be published mid-cohort on the course platform
271 |
272 |
273 |
274 | ## Next Steps
275 |
276 | 1. **[Star the GitHub repo](https://github.com/DataTalksClub/llm-zoomcamp)** (helps with visibility!)
277 | 2. **Skim Module 1** content to get familiar
278 | 3. **Install Docker** and verify it works
279 | 4. **Join communication channels**:
280 | - [Slack workspace](https://datatalks.club/slack.html)
281 | - [Telegram channel](https://t.me/llm_zoomcamp)
282 | 5. **Bookmark the [FAQ document](https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing)**
283 |
284 | ### Week 1 Goals:
285 | - **Watch Module 1 videos**
286 | - **Complete first homework**
287 | - **Start thinking about capstone dataset**
288 | - **Share your journey** with #LLMZoomCamp
289 |
290 | ### Long-term Success:
291 | - **Consistent engagement** with course materials
292 | - **Active participation** in community discussions
293 | - **Regular progress sharing** on social media
294 | - **Early capstone planning** and data preparation
295 |
296 | ## Quick Links Reference
297 |
298 | | Resource | Link |
299 | |----------|------|
300 | | **Main Repository** | https://github.com/DataTalksClub/llm-zoomcamp |
301 | | **Course Platform** | https://courses.datatalks.club/llm-zoomcamp-2025/ |
302 | | **2025 Playlist** | https://youtube.com/playlist?list=PL3MmuxUbc_hIoBpuc900htYF4uhEAbaT-&si=uwC6I0wFePjdmLdp |
303 | | **FAQ Document** | https://docs.google.com/document/d/1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0/edit?usp=sharing |
304 | | **Slack Workspace** | https://datatalks.club/slack.html |
305 | | **Telegram Channel** | https://t.me/llm_zoomcamp |
306 | | **Project Guidelines** | https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2025/project.md |
307 | | **Support Alexey** | https://github.com/sponsors/alexeygrigorev |
308 |
--------------------------------------------------------------------------------
/cohorts/2025/pre-course-q-a-stream-summary.md:
--------------------------------------------------------------------------------
1 | # Key Takeaways from the LLM Zoomcamp 2025 Pre-Course Live Q&A
2 |
3 | [](https://youtu.be/8lgiOLMMKcY)
4 |
5 | > **[Watch the LLM Zoomcamp 2025 Pre-Course Live Q&A](https://youtu.be/8lgiOLMMKcY)**
6 |
7 | ## 1. Do you think LLMs are a lasting technology or are they just a passing trend like the metaverse or NFTs?
8 |
9 | **Answer:** While there’s certainly hype around LLMs, they’ve already become deeply integrated into daily workflows—used for coding, personal productivity, and prototype development. Even after the hype subsides, the underlying tools and techniques (like RAG pipelines and vector search) will remain valuable. Future models and interfaces may evolve, but the core capabilities of LLMs are here to stay.
10 |
11 | ## 2. What prerequisites would set me up for success in the course and help me get the best out of it?
12 |
13 | **Answer:** You should be comfortable with:
14 |
15 | * General programming (ideally in Python)
16 | * Command-line tools
17 | * Connecting services (e.g., spinning up ElasticSearch or similar)
18 | No deep machine-learning or advanced software-engineering background is required, but familiarity with basic scripting and package installation will make the coursework smoother.
19 |
20 | ## 3. Is it beneficial if I do the course again this year even though I completed it last year?
21 |
22 | **Answer:** Yes. The curriculum has been updated with fewer but deeper modules, new evaluation/monitoring tooling (Phoenix instead of Grafana), and likely a new vector-search backend. You’ll also get to experiment with the latest LLMs and compete in a fresh challenge exercise.
23 |
24 | ## 4. Will the course help me implement RAG from a live database?
25 |
26 | **Answer:** Most likely yes. Whether your data is in a transactional database or a knowledge-base store, the course teaches you how to connect to your data source, chunk and index content, and build a RAG application. The exact workflow may vary by database type, but the principles carry over.
27 |
28 | ## 5. Are we going to do any agentic AI development in this course?
29 |
30 | **Answer:** Not as part of the core modules. There **will** likely be an optional parallel workshop on agentic workflows in June, but agent development is not formally included in this year’s curriculum.
31 |
32 | ## 6. Will this course include MLOps (LM Ops) content?
33 |
34 | **Answer:** To a degree. The monitoring module covers key LM Ops practices (metrics, cost tracking, query logging), but full deployment and continuous-training pipelines are reserved for the separate Open Source LLM Zoom Camp.
35 |
36 | ## 7. What will be the infrastructure cost, and can I run everything locally without cloud resources?
37 |
38 | **Answer:**
39 |
40 | * **Local:** 100 % feasible—you can use tools like Llama.cpp or Gro to run models on your own machine.
41 | * **Cloud APIs:** Costs are modest. For OpenAI, \~1 million tokens in/out costs around \$10 total. Gro offers a free tier for basic experimentation.
42 |
43 | ## 8. Will this course cover MCP (Model Context Protocol)?
44 |
45 | **Answer:** No, MCP (and other emerging protocols) are not in this year’s syllabus. As with agentic AI, these topics may surface in a future specialized offering once industry practices stabilize.
46 |
47 | ## 9. Do we discuss evaluating LLM-based applications, generating metrics, and setting up guardrails?
48 |
49 | **Answer:**
50 |
51 | * **Evaluation metrics:** Yes. You’ll learn classical IR metrics (e.g., MRR, recall) for search and how to use LLMs themselves as “judges” for end-to-end RAG evaluation.
52 | * **Guardrails:** No formal guardrail framework is included, though monitoring best practices will help you detect and respond to undesired behavior.
53 |
54 | ## 10. Will we cover chunking techniques in the course?
55 |
56 | **Answer:** Yes and no. You’ll see how data is pre-chunked in the example repo and learn best practices for chunk size and strategy—but actual implementation and experimentation with chunking are left as part of your hands-on project work.
57 |
58 | ## 11. Is the RAG pipeline included in the course?
59 |
60 | **Answer:** Absolutely. RAG is the central focus: indexing, retrieval, prompt construction, and response handling are all core modules, and you’ll build full pipelines from scratch.
61 |
62 | ## 12. Can different programming languages like JavaScript be used for this course?
63 |
64 | **Answer:** The taught examples use Python, but you’re welcome to implement your project in another language (e.g., JavaScript). You’ll just need to provide clear installation and usage instructions (npm commands, environment setup) so peers can run and review your work.
65 |
66 | ## 13. Can we expect a complex project, and learn from examples?
67 |
68 | **Answer:** Project complexity is up to you. The course provides exemplar student projects (food-search RAG, recipe recommenders, etc.) to illustrate scope and quality. You then pick your own data and extend the RAG concepts to a domain of your choice.
69 |
70 | ## 14. As someone new, how do I navigate the GitHub repo structure?
71 |
72 | **Answer:**
73 |
74 | 1. Open the repo’s README and follow the module links in order.
75 | 2. Click each module’s folder to access videos and homework.
76 | 3. Use the “Project Attempts” section to browse past student projects and peer-review guidelines.
77 |
78 | ## 15. What is the estimated weekly time investment for the course?
79 |
80 | **Answer:** Roughly **10 hours per week**:
81 |
82 | * Module videos + readings: \~4 hours
83 | * Homework/project work: \~6 hours
84 |
85 | ## 16. How long does the course run and when does it end?
86 |
87 | **Answer:** It spans **10 weeks** total:
88 |
89 | * 2 weeks for the Intro module
90 | * 1 week each for Modules 2–5
91 | * 3 weeks for the final project period
92 | Expect to finish by late summer.
93 |
94 | ## 17. How much will it cost to complete the course using Gro or OpenAI APIs?
95 |
96 | **Answer:**
97 |
98 | * **Gro:** Free tier available, suitable for initial experiments.
99 | * **OpenAI:** Approximately \$10–\$20 for 1 million tokens in + out; in practice you’ll spend far less than your API quota.
100 |
101 | ## 18. Can you explain the steps for enrolling and completing the course?
102 |
103 | **Answer:**
104 |
105 | 1. **Star & Fork** the GitHub repo to show support.
106 | 2. **Sign up** on the repo’s landing page (click “Sign up” button).
107 | 3. **Join** the community (Telegram/Slack) for announcements.
108 | 4. **Attend** the launch stream (May 27) for a content overview.
109 | 5. **Weekly:** Watch that week’s module videos, complete homework, submit via the platform.
110 | 6. **Final project:** Build and document your RAG application; peer-review others.
111 | 7. **Certificate:** Earned upon successful project submission and peer reviews.
112 |
--------------------------------------------------------------------------------
/cohorts/2025/project.md:
--------------------------------------------------------------------------------
1 | ## Course Project
2 |
3 | The goal of this project is to apply everything we learned
4 | in this course and build an end-to-end RAG project.
5 |
6 | Remember that to pass the project, you must evaluate 3 peers. If you don't do that, your project can't be considered compelete.
7 |
8 |
9 | ## Submitting
10 |
11 | ### Project Attempt #1
12 |
13 | * Project: https://courses.datatalks.club/llm-zoomcamp-2025/project/project1
14 | * Review: https://courses.datatalks.club/llm-zoomcamp-2025/project/project1/eval
15 |
16 |
17 | ### Project Attempt #2
18 |
19 | * Project: https://courses.datatalks.club/llm-zoomcamp-2025/project/project2
20 | * Review: https://courses.datatalks.club/llm-zoomcamp-2025/project/project2/eval
21 |
22 |
23 | > **Important**: update your "Certificate name" here: https://courses.datatalks.club/llm-zoomcamp-2025/enrollment -
24 | this is what we will use when generating certificates for you.
25 |
26 |
27 | ## Evaluation criteria
28 |
29 | See [here](../../project.md#evaluation-criteria)
--------------------------------------------------------------------------------
/etc/chunking.md:
--------------------------------------------------------------------------------
1 | ## Chunking
2 |
3 | I asked ChatGPT how we can chunk a YouTube transcript
4 | and this is what it answered:
5 |
6 | Retrieval Augmented Generation (RAG) is a powerful method for enhancing the capabilities of Language Models by providing them access to external data sources. This approach is particularly effective when dealing with unstructured datasets like YouTube video transcripts, where the content is not organized in a clear question-and-answer format. Let's explore various strategies for chunking these transcripts, both using heuristics and leveraging LLMs for intelligent chunking.
7 |
8 | ### Chunking Strategies
9 |
10 | Chunking is the process of breaking down a large document into smaller, manageable pieces that can be processed by a language model. Effective chunking is crucial for RAG because it directly influences the relevance and accuracy of the information retrieved.
11 |
12 | #### Heuristic-based Chunking
13 |
14 | 1. **Fixed-size Chunking**
15 | - **Fixed Word Count:** Divide the transcript into chunks of a specific number of words (e.g., 200-500 words). This method is simple and ensures uniform chunk sizes.
16 | - **Fixed Time Intervals:** For video transcripts, segment the text based on time intervals (e.g., every 2 minutes of video content).
17 | - **Fixed Sentence Count:** Break the transcript into chunks containing a set number of sentences (e.g., 5-10 sentences).
18 |
19 | 2. **Semantic Boundaries**
20 | - **Pause-based Chunking:** Utilize pauses or silences in the video (detected by timestamps in the transcript) to delineate chunks, assuming pauses often indicate topic shifts.
21 | - **Speaker Turn Chunking:** Divide the transcript at speaker changes, which can be useful in interviews or panel discussions.
22 |
23 | 3. **Structural Segmentation**
24 | - **Paragraph-based Chunking:** Use natural paragraph breaks as chunk boundaries, assuming these breaks reflect changes in topic or subtopics.
25 | - **Section Headers:** If available, use any section headers or titles in the transcript to define chunk boundaries.
26 |
27 | #### LLM-based Intelligent Chunking
28 |
29 | 1. **Thematic Chunking**
30 | - **Topic Modeling:** Use LLMs to identify and segment the transcript into coherent topics or themes. Each chunk would then represent a distinct topic.
31 | - **Content Summarization:** Summarize the transcript using LLMs to identify key points and divide the text based on these summary insights.
32 |
33 | 2. **Contextual Segmentation**
34 | - **Semantic Similarity:** Use embeddings to compute semantic similarity and create chunks where the text remains contextually consistent. LLMs can help refine these chunks by ensuring thematic coherence.
35 | - **Dynamic Chunking:** Allow the LLM to dynamically adjust chunk sizes based on content density, ensuring that dense or complex sections are adequately represented.
36 |
37 | 3. **Hierarchical Structuring**
38 | - **Sectioning:** Ask the LLM to automatically divide the transcript into sections with headers, providing an outline-like structure. This can involve creating a hierarchical representation of the content.
39 | - **Content Categorization:** LLMs can categorize content into predefined categories (e.g., introduction, explanation, conclusion), structuring chunks accordingly.
40 |
41 | 4. **Intent-based Chunking**
42 | - **Question Generation:** Use LLMs to generate potential questions from the transcript and organize chunks based on these questions, mimicking a Q&A format.
43 | - **Information Retrieval Intent:** Analyze user queries to tailor chunks that directly address likely information retrieval needs, improving retrieval efficiency.
44 |
45 | ### Implementing Chunking for RAG
46 |
47 | Here’s how you can implement these chunking strategies in a RAG workflow:
48 |
49 | 1. **Preprocessing:**
50 | - **Transcription:** Convert audio to text using a speech-to-text model if starting from raw video/audio.
51 | - **Cleaning:** Remove unnecessary filler words, irrelevant content, and transcription errors.
52 |
53 | 2. **Chunking:**
54 | - Apply one or more chunking strategies to segment the transcript into meaningful units.
55 | - Consider using a combination of heuristic and LLM-based methods for a balanced approach.
56 |
57 | 3. **Indexing:**
58 | - Convert chunks into embeddings using a suitable model (e.g., BERT, Sentence Transformers) and index them for efficient retrieval.
59 |
60 | 4. **Retrieval:**
61 | - For a given query, retrieve relevant chunks based on semantic similarity, keyword matching, or a combination of retrieval methods.
62 |
63 | 5. **Generation:**
64 | - Use an LLM to generate answers or outputs by combining retrieved chunks with the query context.
65 |
66 | ### Example Workflow for YouTube Transcript
67 |
68 | Let's walk through an example workflow using a YouTube video transcript:
69 |
70 | 1. **Transcription & Cleaning:**
71 | - Convert video to text, clean up non-essential parts, and ensure accuracy.
72 |
73 | 2. **Chunking:**
74 | - **Heuristic:** Break the text every 2 minutes of video time and by speaker turns.
75 | - **Intelligent:** Use an LLM to divide the text into sections with headings like "Introduction," "Key Concepts," "Case Studies," etc.
76 |
77 | 3. **Indexing:**
78 | - Create embeddings for each chunk and index them in a database like Elasticsearch or a vector store.
79 |
80 | 4. **Query Handling:**
81 | - When a user submits a query, retrieve relevant chunks using semantic search.
82 |
83 | 5. **Answer Generation:**
84 | - The LLM processes the query and the retrieved chunks to generate a cohesive and accurate response.
85 |
86 | ### Tools and Techniques for Implementation
87 |
88 | - **Language Models:** GPT-4, BERT, T5, or domain-specific LLMs for intelligent chunking.
89 | - **Vector Stores:** FAISS, Milvus, Pinecone for indexing and retrieval.
90 | - **Embedding Models:** Sentence-BERT, Universal Sentence Encoder for chunk embedding.
91 | - **Speech-to-Text:** Google Speech-to-Text, Whisper by OpenAI for initial transcription.
92 |
93 | ### Conclusion
94 |
95 | Effective chunking strategies are key to harnessing the full potential of RAG in unstructured datasets like YouTube transcripts. By using both heuristic and intelligent approaches, you can ensure that the language model receives relevant context for generating accurate and informative responses. The choice of chunking strategy should align with the specific use case and characteristics of the dataset, allowing for flexible and efficient retrieval processes.
96 |
--------------------------------------------------------------------------------
/images/llm-zoomcamp-2025.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/llm-zoomcamp-2025.jpg
--------------------------------------------------------------------------------
/images/llm-zoomcamp.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/llm-zoomcamp.jpg
--------------------------------------------------------------------------------
/images/qdrant.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/qdrant.png
--------------------------------------------------------------------------------
/images/saturn-cloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/b9d8bd63621736c75fa8d6d0def5ad6656c0b981/images/saturn-cloud.png
--------------------------------------------------------------------------------
/learning-in-public.md:
--------------------------------------------------------------------------------
1 | # Learning in public
2 |
3 | Most people learn in private: they consume content but don't tell
4 | anyone about it. There's nothing wrong with it.
5 |
6 | But we want to encourage you to document your progress and
7 | share it publicly on social media.
8 |
9 | It helps you get noticed and will lead to:
10 |
11 | * Expanding your network: meeting new people and making new friends
12 | * Being invited to meetups, conferences and podcasts
13 | * Landing a job or getting clients
14 | * Many other good things
15 |
16 | Here's a more compresensive reading on why you want to do it: https://github.com/readme/guides/publishing-your-work
17 |
18 |
19 | ## Learning in Public for Zoomcamps
20 |
21 | When you submit your homework or project, you can also submit
22 | learning in public posts:
23 |
24 |
25 |
26 | You can watch this video to see how your learning in public posts may look like:
27 |
28 |
29 |
30 |
31 |
32 |
33 | Send a PR if you want to suggest improvements for this document
--------------------------------------------------------------------------------
/project.md:
--------------------------------------------------------------------------------
1 | # Projects
2 |
3 | * Video: https://www.loom.com/share/8f99d25893de4fb8aaa95c0395c740b6
4 | * Office hours: https://www.youtube.com/watch?v=pA9S1mTqAwU
5 |
6 | In order to receive a certificate of completion for the course, you need
7 | to deliver a project. There are two attempts for that.
8 |
9 | [Submit your project here](cohorts/2025/project.md)
10 |
11 |
12 | ## Objective
13 |
14 | The goal of this project is to apply everything we have learned
15 | in this course to build an end-to-end RAG application.
16 |
17 |
18 | ## Problem statement
19 |
20 | For the project, we ask you to build an end-to-end RAG project.
21 |
22 | For that, you need:
23 |
24 | * Select a dataset that you're interested in (see [Datasets](#datasets) for examples and ideas)
25 | * Ingest the data into a knowledge base
26 | * Implement the RAG flow: query the knowledge base, build the prompt, send the promt to an LLM
27 | * Evaluate the performance of your RAG flow
28 | * Create an interface for the application
29 | * Collect user feedback and monitor your application
30 |
31 |
32 | ## Project Documentation
33 |
34 | Your project rises or falls with its documentation. Hence, here are some general recommendations:
35 |
36 | * **Write for a Broader Audience 📝**: Assume the reader has no prior knowledge of the course materials. This way, your documentation will be accessible not only to evaluators but also to anyone interested in your project.
37 | * **Include Evaluation Criteria 🎯**: Make it easier for evaluators to assess your work by clearly mentioning each criterion in your README. Include relevant screenshots to visually support your points.
38 | * **Think of Future Opportunities 🚀**: Imagine that potential hiring managers will look at your projects. Make it straightforward for them to understand what the project is about and what you contributed. Highlight key features and your role in the project.
39 | * **Be Detailed and Comprehensive 📋**: Include as much detail as possible in the README file. Explain the setup, the functionality, and the workflow of your project. Tools like ChatGPT or other LLMs can assist you in expanding and refining your documentation.
40 | * **Provide Clear Setup Instructions ⚙️**: Include step-by-step instructions on how to set up and run your project locally. Make sure to cover dependencies, configurations, and any other requirements needed to get your project up and running.
41 | * **Use Visuals and Examples 🖼️**: Wherever possible, include diagrams, screenshots, or GIFs to illustrate key points. Use examples to show how to use your project, demonstrate common use cases, and provide sample inputs and expected outputs.
42 | * **App Preview Video 🎥**: Consider adding a short preview video of your app in action to the README. For example, if you're using Streamlit, you can easily record a screencast from the app's top-right menu ([Streamlit Guide](https://docs.streamlit.io/develop/concepts/architecture/app-chrome)). Once you saved the video file locally, you can just drag & drop it into the online GitHub editor of your README to add it ([Ref](https://stackoverflow.com/a/4279746)).
43 | * **Organize with Sub-Files 🗂️**: If your documentation becomes lengthy, consider splitting it into sub-files and linking them in your README. This keeps the main README clean and neat while providing additional detailed information in separate files (e.g., `setup.md`, `usage.md`, `contributing.md`).
44 | * **Keep It Updated 🔄**: As your project evolves, make sure your documentation reflects any changes or updates. Outdated documentation can confuse readers and diminish the credibility of your project.
45 |
46 | Remember, clear and comprehensive documentation not only helps others but is also a valuable reference for yourself in the future.
47 |
48 |
49 | ## Technologies
50 |
51 | You don't have to limit yourself to technologies covered in the course. You can use alternatives as well:
52 |
53 | * LLM: OpenAI, Ollama, Groq, AWS Bedrock, etc
54 | * Knowledge base: any text, relational or vector database, including in-memory ones like we implemented in the course or SQLite
55 | * Monitoring: Grafana, Kibana, Streamlit, dash, etc
56 | * Interface: Streamlit, dash, Flask, FastAPI, Django, etc (could be UI or API)
57 | * Ingestion pipeline: Mage, dlt, Airflow, Prefect, python script, etc
58 |
59 | If you use a tool that wasn't covered in the course, be sure to give a very detailed explanation
60 | of what that tool does and how to use it.
61 |
62 | If you're not certain about some tools, ask in Slack.
63 |
64 | ## Tips and best practices
65 |
66 | * It's better to create a separate GitHub repository for your project
67 | * Give your project a meaningful title, e.g. "DataTalksClub Zoomcamp Q&A system" or "Nutrition Facts Chat"
68 |
69 |
70 | ## Peer reviewing
71 |
72 | > [!IMPORTANT]
73 | > To evaluate the projects, we'll use peer reviewing. This is a great opportunity for you to learn from each other.
74 | > * To get points for your project, you need to evaluate 3 projects of your peers
75 | > * You get 3 extra points for each evaluation
76 |
77 | ### Review Tips
78 |
79 | * The reviewer is given a public GitHut repo link and a `commit-hash`
80 | * to see the code state of the repo at the provided commit hash, use the following URL:
81 | * `https://github.com/{username}/{repo-name}/tree/{commit-hash}`
82 | * It's recommended to clone the repository for the review. To clone the project at the commit hash:
83 | ```bash
84 | git clone https://github.com/{username}/{repo-name}.git
85 | git reset --hard {commit-hash}
86 | ```
87 |
88 | ## Evaluation Criteria
89 |
90 | * Problem description
91 | * 0 points: The problem is not described
92 | * 1 point: The problem is described but briefly or unclearly
93 | * 2 points: The problem is well-described and it's clear what problem the project solves
94 | * RAG flow
95 | * 0 points: No knowledge base or LLM is used
96 | * 1 point: No knowledge base is used, and the LLM is queried directly
97 | * 2 points: Both a knowledge base and an LLM are used in the RAG flow
98 | * Retrieval evaluation
99 | * 0 points: No evaluation of retrieval is provided
100 | * 1 point: Only one retrieval approach is evaluated
101 | * 2 points: Multiple retrieval approaches are evaluated, and the best one is used
102 | * RAG evaluation
103 | * 0 points: No evaluation of RAG is provided
104 | * 1 point: Only one RAG approach (e.g., one prompt) is evaluated
105 | * 2 points: Multiple RAG approaches are evaluated, and the best one is used
106 | * Interface
107 | * 0 points: No way to interact with the application at all
108 | * 1 point: Command line interface, a script, or a Jupyter notebook
109 | * 2 points: UI (e.g., Streamlit), web application (e.g., Django), or an API (e.g., built with FastAPI)
110 | * Ingestion pipeline
111 | * 0 points: No ingestion
112 | * 1 point: Semi-automated ingestion of the dataset into the knowledge base, e.g., with a Jupyter notebook
113 | * 2 points: Automated ingestion with a Python script or a special tool (e.g., Mage, dlt, Airflow, Prefect)
114 | * Monitoring
115 | * 0 points: No monitoring
116 | * 1 point: User feedback is collected OR there's a monitoring dashboard
117 | * 2 points: User feedback is collected and there's a dashboard with at least 5 charts
118 | * Containerization
119 | * 0 points: No containerization
120 | * 1 point: Dockerfile is provided for the main application OR there's a docker-compose for the dependencies only
121 | * 2 points: Everything is in docker-compose
122 | * Reproducibility
123 | * 0 points: No instructions on how to run the code, the data is missing, or it's unclear how to access it
124 | * 1 point: Some instructions are provided but are incomplete, OR instructions are clear and complete, the code works, but the data is missing
125 | * 2 points: Instructions are clear, the dataset is accessible, it's easy to run the code, and it works. The versions for all dependencies are specified.
126 | * Best practices
127 | * [ ] Hybrid search: combining both text and vector search (at least evaluating it) (1 point)
128 | * [ ] Document re-ranking (1 point)
129 | * [ ] User query rewriting (1 point)
130 | * Bonus points (not covered in the course)
131 | * [ ] Deployment to the cloud (2 points)
132 | * [ ] Up to 3 extra bonus points if you want to award for something extra (write in feedback for what)
133 |
134 |
135 | ## Project ideas
136 |
137 | ### Datasets
138 |
139 | Here are some datasets for your projects and potential things you can do with them
140 |
141 | * DTC data:
142 | * Slack dump: book of the week channel, course channels, career questions, etc
143 | * DTC website with book of the week archives
144 | * DTC Podcast: transcripts
145 | * Wikis
146 | * any subsets of Wikipedia
147 | * any wiki-like data source
148 | * notion notes
149 | * Articles
150 | * Index and answer questions from one or multiple articles
151 | * Transcripts
152 | * Podcast transcripts
153 | * YouTube video transcripts
154 | * Books
155 | * Sci-fi, fiction, or non-fiction books
156 | * Slide Decks and pictures
157 | * OCR and index slide decks (gpt-4o-mini can do that)
158 | * Describe and index pictures
159 | * Add more here above this line - send a PR!
160 | * Or just ask ChatGPT (see more ideas [here in this example](https://chatgpt.com/share/70b51c12-e41c-4312-831d-04f489a17f1e))
161 |
162 | You can also generate a dataset with an LLM:
163 |
164 | * If you have a dataset but you can't publicly release it, you can generate a similar one with an LLM
165 | * Or you can simply think what kind of a dataset you want to have and generate it
166 |
167 | Note that your dataset doesn't have to be in thr Q&A form. Check [etc/chunking.md](etc/chunking.md) to learn more about chunking.
168 |
169 |
170 | ## Cheating and plagiarism
171 |
172 | Plagiarism in any form is not allowed. Examples of plagiarism:
173 |
174 | * Taking somebody's else notebooks and projects (in full or partly) and using it for your project
175 | * Re-using your own projects (in full or partly) from other courses and bootcamps
176 | * Re-using your appempt 1 project as attempt 2 if you passed attempt 1
177 | * Re-using your project from the previous iterations of the course
178 |
179 | Violating any of this will result in 0 points for this project.
180 |
181 | Re-using some parts of the code from the course is allowed.
182 |
--------------------------------------------------------------------------------