├── .env.example ├── .gitignore ├── LICENSE ├── README.md ├── align_llm.ipynb ├── compare.ipynb ├── data └── chroma_docs.json ├── environment.yml ├── functions ├── chroma.py ├── embed.py ├── evaluate.py ├── llm.py ├── utils.py └── visualize.py ├── generate_benchmark.ipynb ├── img └── card.png ├── pyproject.toml ├── replicate_results.ipynb ├── requirements.txt └── results ├── 2025-03-31--13-59-25.json ├── 2025-03-31--14-01-03.json ├── 2025-03-31--14-08-55.json └── 2025-03-31--14-10-29.json /.env.example: -------------------------------------------------------------------------------- 1 | OPENAI_API_KEY= 2 | ANTHROPIC_API_KEY= 3 | 4 | CHROMA_TENANT= 5 | X_CHROMA_TOKEN= 6 | DATABASE_NAME= 7 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | .DS_Store 3 | __pycache__ 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Chroma 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Generative Benchmarking 2 | 3 | This project provides a comprehensive toolkit for generating custom benchmarks and replicating the results outlined in our [technical report](https://research.trychroma.com/generative-benchmarking). 4 | 5 | ![tech report card](img/card.png) 6 | 7 | ## Motivation 8 | 9 | Benchmarking is used to evaluate how well a model is performing, with the aim to generalize that performance to broader real-world scenarios. However, the widely-used benchmarks today often rely on artificially clean datasets and generic domains, with the added concern that they have likely already been seen by embedding models in training. 10 | 11 | We introduce generative benchmarking as a way to address these limitations. Given a set of documents, we synthetically generate queries that are representative of the ground truth. 12 | 13 | 14 | ## Overview 15 | This repository offers tools to: 16 | - **Generate Custom Benchmarks:** Generate benchmarks tailored to your data and use case 17 | - **Align LLMs:** Align your LLM judge to human preferences using an adapted version of [EvalGEN](https://arxiv.org/pdf/2404.12272) 18 | - **Compare Results:** Compare metrics from your generated benchmark 19 | - **Replicate Results:** Follow our guided notebooks to replicate the results presented in our technical report 20 | 21 | ## Repository Structure 22 | 23 | - **`generate_benchmark.ipynb`** 24 | A comprehensive guide to generating a custom benchmark based on your data 25 | 26 | - **`align_llm.ipynb`** 27 | This notebook walks you through aligning the LLM judge to your specific use case 28 | 29 | - **`compare.ipynb`** 30 | A framework for comparing results, which is useful when evaluating different embedding models or configurations 31 | 32 | - **`replicate_results.ipynb`** 33 | A guide to replicating the experiments and results from our technical report 34 | 35 | - **`data/`** 36 | Example data to immediately test out the notebooks with 37 | 38 | - **`functions/`** 39 | Functions used to run notebooks, includes various embedding functions and llm prompts 40 | 41 | - **`results/`** 42 | Folder for saving benchmark results, includes results produced from example data 43 | 44 | 45 | ## Installation 46 | 47 | ### pip 48 | 49 | ```bash 50 | pip install -r requirements.txt 51 | ``` 52 | 53 | ### poetry 54 | ```bash 55 | poetry install 56 | ``` 57 | 58 | ### conda 59 | ```bash 60 | conda env create -f environment.yml 61 | conda activate generative-benchmarking-env 62 | ``` 63 | 64 | ## Citation 65 | 66 | If you use this package in your research, please cite our technical report: 67 | 68 | ``` 69 | @techreport{hong2025benchmarking, 70 | title = {Generative Benchmarking}, 71 | author = {Hong, Kelly and Troynikov, Anton and Huber, Jeff and McGuire, Morgan}, 72 | year = {2025}, 73 | month = {April}, 74 | institution = {Chroma}, 75 | url = {https://research.trychroma.com/generative-benchmarking}, 76 | } 77 | ``` -------------------------------------------------------------------------------- /align_llm.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Align LLM Judge\n", 8 | "\n", 9 | "This notebook walks through how to align your LLM for document quality filtering. \n", 10 | "\n", 11 | "We use our adaptation of the [EvalGen](https://arxiv.org/pdf/2404.12272) framework, a systematic approach to aligning your LLM judge with human preferences." 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## 1. Setup" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### 1.1 Install & Import\n", 26 | "\n", 27 | "Install the necessary packages." 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "!pip install -r requirements.txt" 37 | ] 38 | }, 39 | { 40 | "cell_type": "markdown", 41 | "metadata": {}, 42 | "source": [ 43 | "Import modules." 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 3, 49 | "metadata": {}, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "The autoreload extension is already loaded. To reload it, use:\n", 56 | " %reload_ext autoreload\n" 57 | ] 58 | } 59 | ], 60 | "source": [ 61 | "%load_ext autoreload\n", 62 | "%autoreload 2\n", 63 | "\n", 64 | "import pandas as pd\n", 65 | "import numpy as np\n", 66 | "import json\n", 67 | "from openai import OpenAI as OpenAIClient\n", 68 | "from anthropic import Anthropic as AnthropicClient\n", 69 | "from functions.llm import *\n", 70 | "from functions.evaluate import *" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "### 1.2 Set Client" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "from dotenv import load_dotenv\n", 87 | "\n", 88 | "load_dotenv()\n", 89 | "\n", 90 | "ANTHROPIC_API_KEY = os.environ.get(\"ANTHROPIC_API_KEY\", \"OR_ENTER_YOUR_KEY_HERE\")\n", 91 | "\n", 92 | "anthropic_client = AnthropicClient(api_key=ANTHROPIC_API_KEY)" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "### 1.3 Load in Labeled Data\n", 100 | "\n", 101 | "Load in your manually labeled data and your entire corpus of documents.\n", 102 | "- Reference data schema in `data/human_labeled_data.json` and `data/chroma_docs.json`.\n", 103 | "\n", 104 | "We recommend ~200 labeled entries to start with." 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": null, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "with open('data/human_labeled_data.json', 'r') as f:\n", 114 | " human_labeled_documents = json.load(f)\n", 115 | "\n", 116 | "with open('data/chroma_docs.json', 'r') as f:\n", 117 | " all_documents = json.load(f)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "labeled_ids = list(human_labeled_documents.keys())\n", 127 | "labeled_documents = [all_documents[id] for id in labeled_ids]\n", 128 | "\n", 129 | "unlabeled_ids = [key for key in all_documents if key not in labeled_ids]\n", 130 | "unlabeled_documents = [all_documents[id] for id in unlabeled_ids]" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## 2. Baseline" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "### 2.1 Define Criteria\n", 145 | "\n", 146 | "We define our baseline criteria that we can iterate on." 147 | ] 148 | }, 149 | { 150 | "cell_type": "markdown", 151 | "metadata": {}, 152 | "source": [ 153 | "Fill in `context` and `user_intent` according to your use case." 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 2, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [ 162 | "context = \"FILL IN WITH YOUR CONTEXT\"\n", 163 | "user_intent = \"FILL IN WITH YOUR USER'S INTENT (e.g. seeking help with technical issues)\"" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "Feel free to modify/add criteria as you see fit." 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "relevance = f\"The document is relevant and something that users would search for considering the following context: {context}\"\n", 180 | "\n", 181 | "completeness = \"The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction or summary for the main content that users may be looking for.\"\n", 182 | "\n", 183 | "intent = f\"The document would be relevant in the case of a user {user_intent}\"\n", 184 | "\n", 185 | "criteria = [relevance, completeness, intent]\n", 186 | "criteria_labels = [\"relevant\", \"complete\", \"intent\"]" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "### 2.2 Get LLM Labels\n", 194 | "\n", 195 | "We create a batch request for our LLM calls (this is cheaper and typically faster)." 196 | ] 197 | }, 198 | { 199 | "cell_type": "code", 200 | "execution_count": null, 201 | "metadata": {}, 202 | "outputs": [], 203 | "source": [ 204 | "filtered_documents_v1_id = create_document_filter_batch(\n", 205 | " client=anthropic_client,\n", 206 | " documents=labeled_documents,\n", 207 | " ids=labeled_ids,\n", 208 | " criteria=criteria,\n", 209 | " criteria_labels=criteria_labels\n", 210 | ")" 211 | ] 212 | }, 213 | { 214 | "cell_type": "markdown", 215 | "metadata": {}, 216 | "source": [ 217 | "You can check the status of your batch through the [Anthropic Console](https://console.anthropic.com/workspaces/default/batches).\n", 218 | "\n", 219 | "Retrieve the batch once it is finished." 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "filtered_documents_v1 = retrieve_document_filter_batch(\n", 229 | " client=anthropic_client,\n", 230 | " batch_id=filtered_documents_v1_id\n", 231 | ")" 232 | ] 233 | }, 234 | { 235 | "cell_type": "markdown", 236 | "metadata": {}, 237 | "source": [ 238 | "### 2.3 Compare LLM vs Human Labels\n", 239 | "\n", 240 | "We take our LLM-labeled data and compare with our manual labling.\n", 241 | "\n", 242 | "`criteria_threshold` indicates the number of criterion that must be met in order for a document to be considered \"good quality\"." 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": null, 248 | "metadata": {}, 249 | "outputs": [], 250 | "source": [ 251 | "llm_vs_human(\n", 252 | " llm_judgements=filtered_documents_v1,\n", 253 | " human_judgements=human_labeled_documents,\n", 254 | " documents_mapping=all_documents,\n", 255 | " criteria_labels=criteria_labels,\n", 256 | " criteria_threshold=2\n", 257 | ")" 258 | ] 259 | }, 260 | { 261 | "cell_type": "markdown", 262 | "metadata": {}, 263 | "source": [ 264 | "## 3. Iterate\n", 265 | "\n", 266 | "Based on the results above, improve your LLM vs Human alignment score by iterating on your criteria:\n", 267 | "- Modify prompts\n", 268 | "- Add/remove criteria\n", 269 | "- Notice how the overall alignment score and criterion-specific scores change" 270 | ] 271 | } 272 | ], 273 | "metadata": { 274 | "kernelspec": { 275 | "display_name": "Python 3", 276 | "language": "python", 277 | "name": "python3" 278 | }, 279 | "language_info": { 280 | "codemirror_mode": { 281 | "name": "ipython", 282 | "version": 3 283 | }, 284 | "file_extension": ".py", 285 | "mimetype": "text/x-python", 286 | "name": "python", 287 | "nbconvert_exporter": "python", 288 | "pygments_lexer": "ipython3", 289 | "version": "3.9.6" 290 | } 291 | }, 292 | "nbformat": 4, 293 | "nbformat_minor": 2 294 | } 295 | -------------------------------------------------------------------------------- /compare.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Compare Embedding Models\n", 8 | "\n", 9 | "This notebook walks through how to compare various embedding models with your custom benchmark results.\n", 10 | "\n", 11 | "NOTE: what else would be useful here?" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## 1. Setup" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### 1.1 Install & Import\n", 26 | "\n", 27 | "Install the necessary packages." 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": null, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "%pip install -r requirements.txt" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [ 45 | "%load_ext autoreload\n", 46 | "%autoreload 2\n", 47 | "\n", 48 | "import pandas as pd\n", 49 | "import numpy as np\n", 50 | "import json\n", 51 | "import os\n", 52 | "from pathlib import Path\n", 53 | "from functions.utils import *\n", 54 | "from functions.visualize import *" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "### 1.2 Load in Results" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 13, 67 | "metadata": {}, 68 | "outputs": [], 69 | "source": [ 70 | "results_dir = Path(\"results\")\n", 71 | "\n", 72 | "with open(os.path.join(results_dir, \"2025-03-31--14-01-03.json\"), \"r\") as f:\n", 73 | " openai_small_results = json.load(f)\n", 74 | "\n", 75 | "with open(os.path.join(results_dir, \"2025-03-31--13-59-25.json\"), \"r\") as f:\n", 76 | " openai_large_results = json.load(f)\n", 77 | " \n", 78 | "with open(os.path.join(results_dir, \"2025-03-31--14-08-55.json\"), \"r\") as f:\n", 79 | " jina_results = json.load(f)\n", 80 | "\n", 81 | "with open(os.path.join(results_dir, \"2025-03-31--14-10-29.json\"), \"r\") as f:\n", 82 | " voyage_results = json.load(f)\n", 83 | "\n", 84 | "# Load in the results you wish to compare" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "metadata": {}, 91 | "outputs": [], 92 | "source": [ 93 | "results_list = [openai_small_results, openai_large_results, jina_results, voyage_results] # Add as many results as you want to compare\n", 94 | "\n", 95 | "# Create a dataframe of the results\n", 96 | "metrics_df = create_metrics_dataframe(results_list)\n", 97 | "\n", 98 | "metrics_df" 99 | ] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "## 2. Compare" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": null, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "compare_embedding_models(\n", 115 | " metrics_df = metrics_df,\n", 116 | " metric = \"Recall@3\",\n", 117 | " title = \"Recall@3 Scores by Model\"\n", 118 | ")" 119 | ] 120 | } 121 | ], 122 | "metadata": { 123 | "kernelspec": { 124 | "display_name": "Python 3", 125 | "language": "python", 126 | "name": "python3" 127 | }, 128 | "language_info": { 129 | "codemirror_mode": { 130 | "name": "ipython", 131 | "version": 3 132 | }, 133 | "file_extension": ".py", 134 | "mimetype": "text/x-python", 135 | "name": "python", 136 | "nbconvert_exporter": "python", 137 | "pygments_lexer": "ipython3", 138 | "version": "3.9.6" 139 | } 140 | }, 141 | "nbformat": 4, 142 | "nbformat_minor": 2 143 | } 144 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: generative-benchmarking-env 2 | channels: 3 | - defaults 4 | - conda-forge 5 | dependencies: 6 | - python>=3.9.6 7 | - pandas 8 | - numpy 9 | - tqdm 10 | - matplotlib 11 | - pip 12 | - notebook 13 | - pytrec_eval 14 | - pip: 15 | - chromadb 16 | - datasets 17 | - sentence-transformers 18 | - voyageai 19 | - openai 20 | - anthropic 21 | - dotenv 22 | -------------------------------------------------------------------------------- /functions/chroma.py: -------------------------------------------------------------------------------- 1 | from concurrent.futures import ThreadPoolExecutor 2 | import os 3 | import multiprocessing 4 | from typing import List, Any, Dict 5 | from tqdm import tqdm 6 | 7 | def collection_add_in_batches( 8 | collection: Any, 9 | ids: List[str], 10 | texts: List[str], 11 | embeddings: List[List[float]], 12 | metadatas: List[Dict[str, Any]] = None 13 | ) -> None: 14 | BATCH_SIZE = 100 15 | LEN = len(embeddings) 16 | N_THREADS = min(os.cpu_count() or multiprocessing.cpu_count(), 20) 17 | 18 | def add_batch(start: int, end: int) -> None: 19 | id_batch = ids[start:end] 20 | doc_batch = texts[start:end] 21 | 22 | print(f"Adding {start} to {end}") 23 | 24 | try: 25 | if metadatas: 26 | collection.add(ids=id_batch, documents=doc_batch, embeddings=embeddings[start:end], metadatas=metadatas[start:end]) 27 | else: 28 | collection.add(ids=id_batch, documents=doc_batch, embeddings=embeddings[start:end]) 29 | except Exception as e: 30 | print(f"Error adding {start} to {end}") 31 | print(e) 32 | 33 | threadpool = ThreadPoolExecutor(max_workers=N_THREADS) 34 | 35 | for i in range(0, LEN, BATCH_SIZE): 36 | threadpool.submit(add_batch, i, min(i + BATCH_SIZE, LEN)) 37 | 38 | threadpool.shutdown(wait=True) 39 | 40 | def get_collection_items( 41 | collection: Any, 42 | ) -> dict: 43 | BATCH_SIZE = 100 44 | collection_size = collection.count() 45 | items = collection.get(include=["metadatas"]) 46 | 47 | ids = items['ids'] 48 | 49 | embeddings_lookup = dict() 50 | 51 | for i in tqdm(range(0, collection_size, BATCH_SIZE), desc="Processing batches"): 52 | batch_ids = ids[i:i + BATCH_SIZE] 53 | result = collection.get(ids=batch_ids, include=["embeddings", "documents"]) 54 | 55 | retrieved_ids = result["ids"] 56 | retrieved_embeddings = result["embeddings"] 57 | retrieved_documents = result["documents"] 58 | 59 | for id, embedding, document in zip(retrieved_ids, retrieved_embeddings, retrieved_documents): 60 | embeddings_lookup[id] = { 61 | 'embedding': embedding, 62 | 'document': document 63 | } 64 | 65 | return embeddings_lookup -------------------------------------------------------------------------------- /functions/embed.py: -------------------------------------------------------------------------------- 1 | from typing import List, Any 2 | from tqdm import tqdm 3 | import requests 4 | import json 5 | from voyageai import Client as VoyageClient 6 | from openai import OpenAI as OpenAIClient 7 | 8 | def minilm_embed( 9 | model: Any, 10 | texts: List[str], 11 | ) -> List[List[float]]: 12 | embeddings = model.encode(texts) 13 | return embeddings 14 | 15 | def minilm_embed_in_batches( 16 | model: Any, 17 | texts: List[str], 18 | batch_size: int = 100 19 | ) -> List[List[float]]: 20 | all_embeddings = [] 21 | 22 | for i in tqdm(range(0, len(texts), batch_size), desc="Processing MiniLM batches"): 23 | batch = texts[i:i + batch_size] 24 | batch_embeddings = minilm_embed(model, batch) 25 | all_embeddings.extend(batch_embeddings) 26 | 27 | return all_embeddings 28 | 29 | 30 | def openai_embed( 31 | openai_client: OpenAIClient, 32 | texts: List[str], 33 | model: str 34 | ) -> List[List[float]]: 35 | try: 36 | return [response.embedding for response in openai_client.embeddings.create(model=model, input = texts).data] 37 | except Exception as e: 38 | print(f"Error embedding: {e}") 39 | return [[0.0]*1024 for _ in texts] 40 | 41 | def openai_embed_in_batches( 42 | openai_client: OpenAIClient, 43 | texts: List[str], 44 | model: str, 45 | batch_size: int = 100 46 | ) -> List[List[float]]: 47 | all_embeddings = [] 48 | 49 | for i in tqdm(range(0, len(texts), batch_size), desc="Processing OpenAI batches"): 50 | batch = texts[i:i + batch_size] 51 | batch_embeddings = openai_embed(openai_client, batch, model) 52 | all_embeddings.extend(batch_embeddings) 53 | 54 | return all_embeddings 55 | 56 | 57 | def jina_embed( 58 | JINA_API_KEY: str, 59 | input_type: str, 60 | texts: List[str] 61 | ) -> List[List[float]]: 62 | try: 63 | url = "https://api.jina.ai/v1/embeddings" 64 | headers = { 65 | "Content-Type": "application/json", 66 | "Authorization": f"Bearer {JINA_API_KEY}" 67 | } 68 | 69 | data = { 70 | "model": "jina-embeddings-v3", 71 | "task": input_type, 72 | "late_chunking": False, 73 | "dimensions": 1024, 74 | "embedding_type": "float", 75 | "input": texts 76 | } 77 | 78 | response = requests.post(url, headers=headers, json=data) 79 | response_dict = json.loads(response.text) 80 | embeddings = [item["embedding"] for item in response_dict["data"]] 81 | 82 | return embeddings 83 | 84 | except Exception as e: 85 | print(f"Error embedding batch: {e}") 86 | return [[0.0]*1024 for _ in texts] 87 | 88 | def jina_embed_in_batches( 89 | JINA_API_KEY: str, 90 | input_type: str, 91 | texts: List[str], 92 | batch_size: int = 100 93 | ) -> List[List[float]]: 94 | all_embeddings = [] 95 | 96 | for i in tqdm(range(0, len(texts), batch_size), desc="Processing Jina batches"): 97 | batch = texts[i:i + batch_size] 98 | batch_embeddings = jina_embed(JINA_API_KEY, input_type, batch) 99 | all_embeddings.extend(batch_embeddings) 100 | 101 | return all_embeddings 102 | 103 | 104 | def voyage_embed( 105 | voyage_client: VoyageClient, 106 | input_type: str, 107 | texts: List[str] 108 | ) -> List[List[float]]: 109 | try: 110 | response = voyage_client.embed(texts, model="voyage-3-large", input_type=input_type) 111 | return response.embeddings 112 | 113 | except Exception as e: 114 | print(f"Error embedding batch: {e}") 115 | return [[0.0]*1024 for _ in texts] 116 | 117 | def voyage_embed_in_batches( 118 | voyage_client: VoyageClient, 119 | input_type: str, 120 | texts: List[str], 121 | batch_size: int = 100 122 | ) -> List[List[float]]: 123 | all_embeddings = [] 124 | 125 | for i in tqdm(range(0, len(texts), batch_size), desc="Processing Voyage batches"): 126 | batch = texts[i:i + batch_size] 127 | 128 | batch_embeddings = voyage_embed(voyage_client, input_type, batch) 129 | 130 | all_embeddings.extend(batch_embeddings) 131 | 132 | return all_embeddings -------------------------------------------------------------------------------- /functions/evaluate.py: -------------------------------------------------------------------------------- 1 | import pytrec_eval 2 | from tqdm import tqdm 3 | import pandas as pd 4 | import numpy as np 5 | from typing import Any, Dict, List 6 | import matplotlib.pyplot as plt 7 | import chromadb 8 | from functions.visualize import * 9 | 10 | # Benchmarking 11 | def query_collection( 12 | collection: Any, 13 | query_text: List[str], 14 | query_ids: List[str], 15 | query_embeddings: List[List[float]], 16 | n_results: int = 10 17 | ) -> Dict[str, Dict[str, Any]]: 18 | BATCH_SIZE = 100 19 | results = dict() 20 | 21 | for i in tqdm(range(0, len(query_embeddings), BATCH_SIZE), desc="Processing batches"): 22 | batch_text = query_text[i:i + BATCH_SIZE] 23 | batch_ids = query_ids[i:i + BATCH_SIZE] 24 | batch_embeddings = query_embeddings[i:i + BATCH_SIZE] 25 | 26 | query_results = collection.query( 27 | query_embeddings=batch_embeddings, 28 | query_texts=batch_text, 29 | n_results=n_results 30 | ) 31 | 32 | for idx, (query_id, query_embedding) in enumerate(zip(batch_ids, batch_embeddings)): 33 | results[query_id] = { 34 | "query_embedding": query_embedding, 35 | "retrieved_corpus_ids": query_results["ids"][idx], 36 | "retrieved_corpus_text": query_results["documents"][idx], 37 | "all_scores": [1 - d for d in query_results["distances"][idx]] 38 | } 39 | 40 | return results 41 | 42 | def get_metrics( 43 | qrels: Dict[str, Dict[str, int]], 44 | results: Dict[str, Dict[str, float]], 45 | k_values: List[int] 46 | ) -> Dict[str, Dict[str, float]]: 47 | recall = dict() 48 | precision = dict() 49 | map = dict() 50 | ndcg = dict() 51 | 52 | for k in k_values: 53 | recall[f"Recall@{k}"] = 0.0 54 | precision[f"P@{k}"] = 0.0 55 | map[f"MAP@{k}"] = 0.0 56 | ndcg[f"NDCG@{k}"] = 0.0 57 | 58 | recall_string = "recall." + ",".join([str(k) for k in k_values]) 59 | precision_string = "P." + ",".join([str(k) for k in k_values]) 60 | map_string = "map_cut." + ",".join([str(k) for k in k_values]) 61 | ndcg_string = "ndcg_cut." + ",".join([str(k) for k in k_values]) 62 | 63 | evaluator = pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string}) 64 | 65 | scores = evaluator.evaluate(results) 66 | 67 | for query_id in scores.keys(): 68 | for k in k_values: 69 | ndcg[f"NDCG@{k}"] += scores[query_id]["ndcg_cut_" + str(k)] 70 | map[f"MAP@{k}"] += scores[query_id]["map_cut_" + str(k)] 71 | recall[f"Recall@{k}"] += scores[query_id]["recall_" + str(k)] 72 | precision[f"P@{k}"] += scores[query_id]["P_"+ str(k)] 73 | 74 | for k in k_values: 75 | ndcg[f"NDCG@{k}"] = round(ndcg[f"NDCG@{k}"]/len(scores), 5) 76 | map[f"MAP@{k}"] = round(map[f"MAP@{k}"]/len(scores), 5) 77 | recall[f"Recall@{k}"] = round(recall[f"Recall@{k}"]/len(scores), 5) 78 | precision[f"P@{k}"] = round(precision[f"P@{k}"]/len(scores), 5) 79 | 80 | return ndcg, map, recall, precision 81 | 82 | def evaluate( 83 | k_values: List[int], 84 | qrels_df: pd.DataFrame, 85 | results_dict: Dict[str, Dict[str, float]] 86 | ) -> Dict[str, Dict[str, float]]: 87 | qrels = qrels_df.groupby("query-id").apply(lambda g: dict(zip(g["corpus-id"], g["score"]))).to_dict() 88 | 89 | qrels = { 90 | qid: {doc_id: int(score) for doc_id, score in doc_dict.items()} 91 | for qid, doc_dict in qrels.items() 92 | } 93 | 94 | results = {} 95 | for query_id, query_data in results_dict.items(): 96 | results[query_id] = {} 97 | for doc_id, score in zip(query_data['retrieved_corpus_ids'], query_data['all_scores']): 98 | results[query_id][doc_id] = score 99 | 100 | ndcg, map, recall, precision = get_metrics( 101 | qrels=qrels, 102 | results=results, 103 | k_values=k_values 104 | ) 105 | 106 | final_result = { 107 | "NDCG": ndcg, 108 | "MAP": map, 109 | "Recall": recall, 110 | "Precision": precision 111 | } 112 | 113 | return final_result 114 | 115 | def run_benchmark( 116 | query_embeddings_lookup: Dict[str, Dict[str, float]], 117 | collection: Any, 118 | qrels: pd.DataFrame, 119 | k_values: List[int] = [1,3,5,10] 120 | ) -> Dict[str, Dict[str, float]]: 121 | query_ids = list(query_embeddings_lookup.keys()) 122 | queries = [query_embeddings_lookup[query_id]["text"] for query_id in query_ids] 123 | query_embeddings = [query_embeddings_lookup[query_id]["embedding"] for query_id in query_ids] 124 | 125 | query_results = query_collection( 126 | collection=collection, 127 | query_text=queries, 128 | query_ids=query_ids, 129 | query_embeddings=query_embeddings, 130 | n_results=20 131 | ) 132 | 133 | metrics = evaluate( 134 | k_values=k_values, 135 | qrels_df=qrels, 136 | results_dict=query_results 137 | ) 138 | 139 | for _, value in metrics.items(): 140 | for k, v in value.items(): 141 | print(f"{k}: {v}") 142 | 143 | return metrics 144 | 145 | def cosine_similarity( 146 | vec1: np.ndarray, 147 | vec2: np.ndarray 148 | ) -> float: 149 | return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)) 150 | 151 | # Alinging LLM Judge 152 | def llm_vs_human( 153 | llm_judgements: Dict[str, Dict[str, str]], 154 | human_judgements: Dict[str, Dict[str, str]], 155 | documents_mapping: Dict[str, str], 156 | criteria_labels: List[str], 157 | criteria_threshold: int 158 | ) -> None: 159 | results = {} 160 | aligned = [] 161 | not_aligned = [] 162 | met_threshold = [] 163 | overall_alignment = 0 164 | 165 | for criterion in criteria_labels: 166 | results[criterion] = 0.0 167 | 168 | for key, value in llm_judgements.items(): 169 | human_judgement = human_judgements[key] 170 | 171 | num_criteria_met = 0 172 | 173 | for criterion in criteria_labels: 174 | if value[criterion] and human_judgement: 175 | results[criterion] += 1 176 | num_criteria_met += 1 177 | elif not value[criterion] and not human_judgement: 178 | results[criterion] += 1 179 | num_criteria_met += 1 180 | 181 | if num_criteria_met == len(criteria_labels): 182 | aligned.append(documents_mapping[key]) 183 | elif num_criteria_met == 0: 184 | not_aligned.append(documents_mapping[key]) 185 | 186 | predicted = (num_criteria_met >= criteria_threshold) 187 | if predicted: 188 | met_threshold.append({ 189 | 'id': key, 190 | 'document': documents_mapping[key] 191 | }) 192 | 193 | if predicted == human_judgement: 194 | overall_alignment += 1 195 | 196 | for criterion in criteria_labels: 197 | results[criterion] = results[criterion] / len(llm_judgements) 198 | 199 | print(results) 200 | 201 | overall_alignment_score = (overall_alignment / len(llm_judgements)) * 100 202 | 203 | print(f"Documents aligned with Human Judgement: {overall_alignment}, {overall_alignment_score}%") 204 | print("\n") 205 | print(f"Number of documents meeting threshold: {len(met_threshold)}") 206 | print(f"Number of documents 100% aligned: {len(aligned)}") 207 | print(f"Number of documents 0% aligned: {len(not_aligned)}") 208 | 209 | for aligned_doc in aligned[:5]: 210 | print(f"Aligned: {aligned_doc}") 211 | print("\n") 212 | 213 | for not_aligned_doc in not_aligned[:5]: 214 | print(f"Not aligned: {not_aligned_doc}") 215 | print("\n") 216 | 217 | def score_query_query( 218 | qrels: pd.DataFrame, 219 | query_embeddings_1: dict, 220 | query_embeddings_2: dict, 221 | column_name: str, 222 | output_path: str = None 223 | ) -> pd.DataFrame: 224 | similarity_scores = [] 225 | 226 | # Check if inputs are lists (from evaluate_and_visualize_wandb) 227 | using_lists = isinstance(query_embeddings_1, list) and isinstance(query_embeddings_2, list) 228 | 229 | for idx, row in qrels.iterrows(): 230 | query_id = row['query-id'] 231 | 232 | if using_lists: 233 | # Assume the lists are in the same order as the qrels DataFrame 234 | col1_embedding = query_embeddings_1[idx] 235 | col2_embedding = query_embeddings_2[idx] 236 | else: 237 | # Use the dictionary lookup 238 | col1_embedding = query_embeddings_1[query_id]['embedding'] 239 | col2_embedding = query_embeddings_2[query_id]['embedding'] 240 | 241 | similarity = cosine_similarity(col1_embedding, col2_embedding).item() 242 | similarity_scores.append(similarity) 243 | 244 | scores_df = qrels.copy() 245 | scores_df[column_name] = similarity_scores 246 | 247 | if output_path: 248 | scores_df.to_csv(output_path) 249 | 250 | return scores_df 251 | 252 | 253 | def score_query_document( 254 | qrels: pd.DataFrame, 255 | query_embeddings_dict: dict, 256 | corpus_embeddings_dict: dict, 257 | column_name: str, 258 | output_path: str = None 259 | ) -> pd.DataFrame: 260 | similarity_scores = [] 261 | 262 | for _, row in qrels.iterrows(): 263 | query_id = row['query-id'] 264 | corpus_id = row['corpus-id'] 265 | 266 | query_embedding = query_embeddings_dict[query_id]['embedding'] 267 | corpus_embedding = corpus_embeddings_dict[corpus_id]['embedding'] 268 | 269 | similarity = cosine_similarity(query_embedding, corpus_embedding).item() 270 | similarity_scores.append(similarity) 271 | 272 | scores_df = qrels.copy() 273 | scores_df[column_name] = similarity_scores 274 | 275 | if output_path: 276 | scores_df.to_parquet(output_path) 277 | 278 | return scores_df 279 | 280 | def evaluate_and_visualize( 281 | ground_truth_query_dict: Dict[str, Dict[str, float]], 282 | generated_query_dict: Dict[str, Dict[str, float]], 283 | corpus_embeddings_dict: Dict[str, Dict[str, float]], 284 | qrels: pd.DataFrame, 285 | collection: Any, 286 | dataset_name: str, 287 | model_name: str, 288 | k_values: List[int] = [1,3,5,10] 289 | ) -> Dict[str, Dict[str, float]]: 290 | 291 | query_ids = list(ground_truth_query_dict.keys()) 292 | 293 | ground_truth_queries = [ground_truth_query_dict[query_id]["text"] for query_id in query_ids] 294 | generated_queries = [generated_query_dict[query_id]["text"] for query_id in query_ids] 295 | 296 | ground_truth_query_embeddings = [ground_truth_query_dict[query_id]["embedding"] for query_id in query_ids] 297 | generated_query_embeddings = [generated_query_dict[query_id]["embedding"] for query_id in query_ids] 298 | 299 | ground_truth_query_results = query_collection( 300 | collection=collection, 301 | query_text=ground_truth_queries, 302 | query_ids=query_ids, 303 | query_embeddings=ground_truth_query_embeddings 304 | ) 305 | generated_query_results = query_collection( 306 | collection=collection, 307 | query_text=generated_queries, 308 | query_ids=query_ids, 309 | query_embeddings=generated_query_embeddings 310 | ) 311 | 312 | ground_truth_query_metrics = evaluate( 313 | k_values=k_values, 314 | qrels_df=qrels, 315 | results_dict=ground_truth_query_results 316 | ) 317 | generated_query_metrics = evaluate( 318 | k_values=k_values, 319 | qrels_df=qrels, 320 | results_dict=generated_query_results 321 | ) 322 | 323 | print(f"Ground Truth Query Metrics:") 324 | print(ground_truth_query_metrics) 325 | print(f"\nGenerated Query Metrics:") 326 | print(generated_query_metrics) 327 | 328 | ground_truth_document_scores = score_query_document( 329 | qrels=qrels, 330 | query_embeddings_dict=ground_truth_query_dict, 331 | corpus_embeddings_dict=corpus_embeddings_dict, 332 | column_name="ground-truth-document" 333 | ) 334 | 335 | generated_document_scores = score_query_document( 336 | qrels=qrels, 337 | query_embeddings_dict=generated_query_dict, 338 | corpus_embeddings_dict=corpus_embeddings_dict, 339 | column_name="generated-document" 340 | ) 341 | 342 | plot_overlaid_distribution( 343 | df_1=ground_truth_document_scores, 344 | df_2=generated_document_scores, 345 | column_1="ground-truth-document", 346 | column_2="generated-document", 347 | title=f"{dataset_name} - {model_name} (Query <> Document)", 348 | xlabel="Cosine Similarity", 349 | ylabel="Normalized Frequency" 350 | ) 351 | 352 | return { 353 | "ground_truth_metrics": ground_truth_query_metrics, 354 | "generated_metrics": generated_query_metrics 355 | } -------------------------------------------------------------------------------- /functions/llm.py: -------------------------------------------------------------------------------- 1 | from anthropic import Anthropic as AnthropicClient 2 | from openai import OpenAI as OpenAIClient 3 | from anthropic.types.message_create_params import MessageCreateParamsNonStreaming 4 | from anthropic.types.messages.batch_create_params import Request 5 | from typing import Dict, List, Any 6 | from tqdm import tqdm 7 | import pandas as pd 8 | import requests 9 | import re 10 | 11 | # Document Filtering 12 | def filter_documents( 13 | client: OpenAIClient, 14 | model: str, 15 | documents: List[str], 16 | ids: List[str], 17 | criteria: List[str], 18 | criteria_labels: List[str] 19 | ) -> List[str]: 20 | 21 | SYSTEM_INSTRUCTION = """ 22 | You are an assistant specialized in filtering documents based on specific criteria. 23 | 24 | Given a document and a criterion, evaluate whether the document meets the criterion and output a single word: "yes" if the document meets the criterion, or "no" if it does not. Do not include any extra text or formatting, simply "yes" or "no". 25 | """ 26 | 27 | labels = {} 28 | filtered_document_ids = [] 29 | 30 | for document, id in tqdm(zip(documents, ids), total=len(documents), desc="Filtering documents"): 31 | labels[id] = {} 32 | 33 | for criterion, criterion_label in zip(criteria, criteria_labels): 34 | PROMPT = f""" 35 | Evaluate the following document with the criterion below. 36 | 37 | Criterion: {criterion} 38 | 39 | Document: {document} 40 | 41 | Output a single word: "yes" if the document meets the criterion, or "no" if it does not. Do not include any extra text or formatting, simply "yes" or "no". 42 | """ 43 | 44 | completion = client.chat.completions.create( 45 | model=model, 46 | messages=[ 47 | {"role": "system", "content": SYSTEM_INSTRUCTION}, 48 | {"role": "user", "content": PROMPT} 49 | ] 50 | ) 51 | 52 | if completion.choices[0].message.content == "yes": 53 | labels[id][criterion_label] = True 54 | else: 55 | labels[id][criterion_label] = False 56 | 57 | passed_all = True 58 | 59 | for criterion_label in criteria_labels: 60 | if not labels[id][criterion_label]: 61 | passed_all = False 62 | break 63 | 64 | if passed_all: 65 | filtered_document_ids.append(id) 66 | 67 | return filtered_document_ids 68 | 69 | def create_document_filter_batch( 70 | client: AnthropicClient, 71 | documents: List[str], 72 | ids: List[str], 73 | criteria: List[str], 74 | criteria_labels: List[str] 75 | ) -> str: 76 | 77 | SYSTEM_INSTRUCTION = """ 78 | You are an assistant specialized in filtering documents based on specific criteria. 79 | 80 | Given a document and a criterion, evaluate whether the document meets the criterion and output a single word: "yes" if the document meets the criterion, or "no" if it does not. Do not include any extra text or formatting, simply "yes" or "no". 81 | """ 82 | 83 | requests = [] 84 | 85 | for document, id in zip(documents, ids): 86 | for criterion, criterion_label in zip(criteria, criteria_labels): 87 | request_id = f"{id}_{criterion_label}" 88 | 89 | PROMPT = f""" 90 | Evaluate the following document with the criterion below. 91 | 92 | Criterion: {criterion} 93 | 94 | Document: {document} 95 | 96 | Output a single word: "yes" if the document meets the criterion, or "no" if it does not. Do not include any extra text or formatting, simply "yes" or "no". 97 | """ 98 | 99 | requests.append(Request( 100 | custom_id=request_id, 101 | params=MessageCreateParamsNonStreaming( 102 | model="claude-3-5-sonnet-20241022", 103 | max_tokens=8192, 104 | temperature=0.2, 105 | system=SYSTEM_INSTRUCTION, 106 | messages=[ 107 | { 108 | "role": "user", 109 | "content": [ 110 | { 111 | "type": "text", 112 | "text": PROMPT 113 | } 114 | ] 115 | } 116 | ] 117 | ) 118 | )) 119 | 120 | batch = client.messages.batches.create(requests=requests) 121 | 122 | print(f"Batch (id: {batch.id}) created successfully") 123 | 124 | return batch.id 125 | 126 | def retrieve_document_filter_batch( 127 | client: AnthropicClient, 128 | batch_id: str 129 | ) -> Dict[str, Dict[str, str]]: 130 | batch = client.messages.batches.results(batch_id) 131 | 132 | results = {} 133 | 134 | for item in batch: 135 | id = item.custom_id.split("_")[0] 136 | criterion = item.custom_id.split("_")[1] 137 | 138 | if id not in results: 139 | results[id] = {} 140 | 141 | if item.result.message.content[0].text == "yes": 142 | results[id][criterion] = True 143 | else: 144 | results[id][criterion] = False 145 | 146 | return results 147 | 148 | def retrieve_document_filter_batch_df( 149 | client: AnthropicClient, 150 | batch_id: str 151 | ) -> Dict[str, Dict[str, str]]: 152 | batch = client.messages.batches.results(batch_id) 153 | 154 | ids = [] 155 | criteria = [] 156 | classification = [] 157 | 158 | for item in batch: 159 | id = item.custom_id.split("_")[0] 160 | criterion = item.custom_id.split("_")[1] 161 | 162 | ids.append(id) 163 | criteria.append(criterion) 164 | if item.result.message.content[0].text == "yes": 165 | classification.append(True) 166 | else: 167 | classification.append(False) 168 | 169 | result_df = pd.DataFrame({"id": ids, "criterion": criteria, "classification": classification}) 170 | 171 | return result_df 172 | 173 | def get_filtered_ids( 174 | filtered_documents_batch_df: pd.DataFrame, 175 | ) -> List[str]: 176 | grouped = filtered_documents_batch_df.groupby('id') 177 | filtered_ids = grouped.filter(lambda x: x['classification'].all()).id.unique() 178 | 179 | return filtered_ids 180 | 181 | # Query Generation 182 | def create_golden_dataset( 183 | client: OpenAIClient, 184 | model: str, 185 | documents: List[str], 186 | ids: List[str], 187 | context: str, 188 | example_queries: str 189 | ) -> pd.DataFrame: 190 | 191 | if len(ids) != len(documents): 192 | raise ValueError("Length of ids must match length of documents") 193 | 194 | queries = [] 195 | 196 | SYSTEM_INSTRUCTION = f""" 197 | You are an assistant specialized in generating queries to curate a high-quality synthetic dataset. 198 | 199 | Simply output the query without any additional words or formatting. 200 | """ 201 | 202 | for id, document in tqdm(zip(ids, documents), total=len(ids), desc="Generating queries"): 203 | PROMPT = f""" 204 | Consider the context: 205 | {context} 206 | 207 | Based on the following piece of text: 208 | 209 | {document} 210 | 211 | 212 | Please generate a realistic query that a user may ask relevant to the information provided above. 213 | 214 | Here are some example queries that users have asked which you should consider when generating your query: 215 | 216 | {example_queries} 217 | 218 | 219 | Do not repeat the example queries, they are only provided to give you an idea of the type of queries that users ask. 220 | Make your query relevant to the information provided above and keep it in a similar style to the example queries, which may not always be in a complete question format. 221 | 222 | Simply output the query without any additional words. 223 | """ 224 | 225 | completion = client.chat.completions.create( 226 | model=model, 227 | messages=[ 228 | {"role": "system", "content": SYSTEM_INSTRUCTION}, 229 | {"role": "user", "content": PROMPT} 230 | ] 231 | ) 232 | 233 | queries.append(completion.choices[0].message.content) 234 | 235 | queries_df = pd.DataFrame({"id": ids, "query": queries}) 236 | 237 | return queries_df 238 | 239 | def create_golden_dataset_batch( 240 | client: AnthropicClient, 241 | model: str, 242 | documents: List[str], 243 | ids: List[str], 244 | context: str, 245 | example_queries: str 246 | ) -> str: 247 | 248 | if len(ids) != len(documents): 249 | raise ValueError("Length of ids must match length of documents") 250 | 251 | SYSTEM_INSTRUCTION = f""" 252 | You are an assistant specialized in generating queries to curate a high-quality synthetic dataset. 253 | 254 | Simply output the query without any additional words or formatting. 255 | """ 256 | 257 | requests = [] 258 | 259 | for id, document in zip(ids, documents): 260 | PROMPT = f""" 261 | Consider the context: 262 | {context} 263 | 264 | Based on the following piece of text: 265 | 266 | {document} 267 | 268 | 269 | Please generate a realistic query that a user may ask relevant to the information provided above. 270 | 271 | Here are some example queries that users have asked which you should consider when generating your query: 272 | 273 | {example_queries} 274 | 275 | 276 | Do not repeat the example queries, they are only provided to give you an idea of the type of queries that users ask. 277 | Make your query relevant to the information provided above and keep it in a similar style to the example queries, which may not always be in a complete question format. 278 | 279 | Simply output the query without any additional words in this format: 280 | 281 | [query] 282 | 283 | """ 284 | 285 | requests.append(Request( 286 | custom_id=id, 287 | params=MessageCreateParamsNonStreaming( 288 | model=model, 289 | max_tokens=8192, 290 | temperature=1, 291 | system=SYSTEM_INSTRUCTION, 292 | messages=[ 293 | { 294 | "role": "user", 295 | "content": [ 296 | { 297 | "type": "text", 298 | "text": PROMPT 299 | } 300 | ] 301 | } 302 | ] 303 | ) 304 | )) 305 | 306 | batch = client.messages.batches.create(requests=requests) 307 | 308 | print(f"Batch (id: {batch.id}) created successfully") 309 | 310 | return batch.id 311 | 312 | 313 | def retrieve_batch( 314 | client: AnthropicClient, 315 | batch_id: str 316 | ) -> pd.DataFrame: 317 | batch = client.messages.batches.results(batch_id) 318 | 319 | ids = [] 320 | queries = [] 321 | 322 | for item in batch: 323 | ids.append(item.custom_id) 324 | queries.append(item.result.message.content[0].text) 325 | 326 | result_df = pd.DataFrame({"id": ids, "query": queries}) 327 | 328 | return result_df 329 | 330 | # Results Replication 331 | def create_naive_query_batch( 332 | client: AnthropicClient, 333 | model: str, 334 | documents: List[str], 335 | ids: List[str] 336 | ) -> str: 337 | if len(ids) != len(documents): 338 | raise ValueError("Length of ids must match length of documents") 339 | 340 | SYSTEM_INSTRUCTION = "You are an assistant specialized in generating queries to curate a high-quality synthetic dataset" 341 | 342 | requests = [] 343 | for id, document in zip(ids, documents): 344 | PROMPT = f""" 345 | Based on the following piece of information: 346 | 347 | {document} 348 | 349 | 350 | Please generate a query relevant to the information provided above. 351 | 352 | Simply output the query without any additional words in this format: 353 | 354 | [query] 355 | 356 | """ 357 | 358 | requests.append(Request( 359 | custom_id=id, 360 | params=MessageCreateParamsNonStreaming( 361 | model=model, 362 | max_tokens=8192, 363 | temperature=1, 364 | system=SYSTEM_INSTRUCTION, 365 | messages=[ 366 | { 367 | "role": "user", 368 | "content": [ 369 | { 370 | "type": "text", 371 | "text": PROMPT 372 | } 373 | ] 374 | } 375 | ] 376 | ) 377 | )) 378 | 379 | batch = client.messages.batches.create(requests=requests) 380 | 381 | print(f"Batch (id: {batch.id}) created successfully") 382 | 383 | return batch.id 384 | 385 | def create_naive_query_multilingual_batch( 386 | client: AnthropicClient, 387 | model: str, 388 | documents: List[str], 389 | ids: List[str], 390 | language: str 391 | ) -> str: 392 | 393 | if len(ids) != len(documents): 394 | raise ValueError("Length of ids must match length of documents") 395 | 396 | SYSTEM_INSTRUCTION = f"You are an assistant specialized in generating queries to curate a high-quality synthetic dataset in {language}" 397 | 398 | requests = [] 399 | for id, document in zip(ids, documents): 400 | PROMPT = f""" 401 | Based on the following piece of information: 402 | 403 | {document} 404 | 405 | 406 | Please generate a query relevant to the information provided above in {language}. 407 | 408 | Simply output the query without any additional words in this format: 409 | 410 | [query] 411 | 412 | """ 413 | 414 | requests.append(Request( 415 | custom_id=id, 416 | params=MessageCreateParamsNonStreaming( 417 | model=model, 418 | max_tokens=8192, 419 | temperature=1, 420 | system=SYSTEM_INSTRUCTION, 421 | messages=[ 422 | { 423 | "role": "user", 424 | "content": [ 425 | { 426 | "type": "text", 427 | "text": PROMPT 428 | } 429 | ] 430 | } 431 | ] 432 | ) 433 | )) 434 | 435 | batch = client.messages.batches.create(requests=requests) 436 | 437 | print(f"Batch (id: {batch.id}) created successfully") 438 | 439 | return batch.id 440 | 441 | def create_distinct_query_batch( 442 | client: AnthropicClient, 443 | model: str, 444 | documents: List[str], 445 | ids: List[str], 446 | queries: List[str] 447 | ) -> str: 448 | if len(ids) != len(documents): 449 | raise ValueError("Length of ids must match length of documents") 450 | 451 | SYSTEM_INSTRUCTION = "You are an assistant specialized in generating queries to curate a high-quality synthetic dataset" 452 | 453 | requests = [] 454 | for id, document, query in zip(ids, documents, queries): 455 | PROMPT = f""" 456 | Based on the following information: 457 | 458 | {document} 459 | 460 | 461 | This would be an example query that would be good for this kind of context: 462 | 463 | {query} 464 | 465 | 466 | Please generate one additional query that is distinct from the example, but is still relevant to the corpus. This point is very important, ensure that the generated query does not repeat the given example query. 467 | 468 | Simply output the query without any additional words in this format: 469 | 470 | [query] 471 | 472 | """ 473 | 474 | requests.append(Request( 475 | custom_id=id, 476 | params=MessageCreateParamsNonStreaming( 477 | model=model, 478 | max_tokens=8192, 479 | temperature=1, 480 | system=SYSTEM_INSTRUCTION, 481 | messages=[ 482 | { 483 | "role": "user", 484 | "content": [ 485 | { 486 | "type": "text", 487 | "text": PROMPT 488 | } 489 | ] 490 | } 491 | ] 492 | ) 493 | )) 494 | 495 | batch = client.messages.batches.create(requests=requests) 496 | 497 | print(f"Batch (id: {batch.id}) created successfully") 498 | 499 | return batch.id 500 | 501 | def clean_id_for_batching(id_str: str) -> str: 502 | cleaned = re.sub(r'[^a-zA-Z0-9_-]', '_', str(id_str)) 503 | return cleaned 504 | 505 | def revert_id_from_batching(id_str: str) -> str: 506 | reverted = re.sub(r'_', '.', str(id_str), count=1) 507 | return reverted -------------------------------------------------------------------------------- /functions/utils.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from typing import List, Dict 3 | 4 | def combined_datasets_dataframes( 5 | queries: pd.DataFrame, 6 | corpus: pd.DataFrame, 7 | qrels: pd.DataFrame 8 | ) -> pd.DataFrame: 9 | qrels = qrels.merge(queries, left_on="query-id", right_on="_id", how="left") 10 | qrels.rename(columns={"text": "query-text"}, inplace=True) 11 | qrels.drop(columns=["_id"], inplace=True) 12 | qrels = qrels.merge(corpus, left_on="corpus-id", right_on="_id", how="left") 13 | qrels.rename(columns={"text": "corpus-text"}, inplace=True) 14 | qrels.drop(columns=["_id", "title"], inplace=True) 15 | 16 | return qrels 17 | 18 | def create_metrics_dataframe( 19 | results_list: List[Dict[str, Dict[str, float]]] 20 | ) -> pd.DataFrame: 21 | all_metrics = [] 22 | 23 | for result in results_list: 24 | model = result["model"] 25 | results = result["results"] 26 | 27 | all_metrics.append((model, results)) 28 | 29 | rows = [] 30 | 31 | for model, metrics in all_metrics: 32 | row = { 33 | 'Model': model, 34 | 'Recall@1': metrics['Recall']['Recall@1'], 35 | 'Recall@3': metrics['Recall']['Recall@3'], 36 | 'Recall@5': metrics['Recall']['Recall@5'], 37 | 'Recall@10': metrics['Recall']['Recall@10'], 38 | 'Precision@3': metrics['Precision']['P@3'], 39 | 'Precision@5': metrics['Precision']['P@5'], 40 | 'Precision@10': metrics['Precision']['P@10'], 41 | 'NDCG@3': metrics['NDCG']['NDCG@3'], 42 | 'NDCG@5': metrics['NDCG']['NDCG@5'], 43 | 'NDCG@10': metrics['NDCG']['NDCG@10'], 44 | 'MAP@3': metrics['MAP']['MAP@3'], 45 | 'MAP@5': metrics['MAP']['MAP@5'], 46 | 'MAP@10': metrics['MAP']['MAP@10'], 47 | } 48 | rows.append(row) 49 | 50 | metrics_df = pd.DataFrame(rows) 51 | 52 | return metrics_df -------------------------------------------------------------------------------- /functions/visualize.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | import numpy as np 4 | 5 | def plot_single_distribution( 6 | df: pd.DataFrame, 7 | column: str, 8 | title: str = '', 9 | xlabel: str = '', 10 | ylabel: str = '', 11 | bins: int = 30, 12 | alpha: float = 0.5, 13 | edgecolor: str = 'black', 14 | range: tuple = (0, 1) 15 | ) -> None: 16 | counts, bin_edges = np.histogram(df[column], bins=bins, range=range) 17 | total = counts.sum() 18 | normalized_counts = counts / total 19 | 20 | bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2 21 | 22 | plt.figure(figsize=(8, 5)) 23 | plt.bar(bin_centers, normalized_counts, width=bin_edges[1] - bin_edges[0], 24 | alpha=alpha, edgecolor=edgecolor, label="Normalized Frequency") 25 | plt.xlabel(xlabel) 26 | plt.ylabel(ylabel) 27 | plt.title(title) 28 | plt.legend() 29 | plt.grid(True) 30 | plt.show() 31 | 32 | 33 | def plot_overlaid_distribution( 34 | df_1: pd.DataFrame, 35 | df_2: pd.DataFrame, 36 | column_1: str, 37 | column_2: str, 38 | title: str = '', 39 | xlabel: str = '', 40 | ylabel: str = '', 41 | bins: int = 30, 42 | alpha: float = 0.5, 43 | edgecolor: str = 'black', 44 | range: tuple = (0, 1) 45 | ) -> None: 46 | counts_1, bin_edges_1 = np.histogram(df_1[column_1], bins=bins, range=range) 47 | counts_2, bin_edges_2 = np.histogram(df_2[column_2], bins=bins, range=range) 48 | total_1 = counts_1.sum() 49 | total_2 = counts_2.sum() 50 | 51 | bin_centers_1 = (bin_edges_1[:-1] + bin_edges_1[1:]) / 2 52 | bin_centers_2 = (bin_edges_2[:-1] + bin_edges_2[1:]) / 2 53 | 54 | normalized_counts_1 = counts_1 / total_1 55 | normalized_counts_2 = counts_2 / total_2 56 | 57 | plt.figure(figsize=(8, 5)) 58 | plt.bar(bin_centers_1, normalized_counts_1, width=bin_edges_1[1] - bin_edges_1[0], 59 | alpha=alpha, edgecolor=edgecolor, label=column_1) 60 | plt.bar(bin_centers_2, normalized_counts_2, width=bin_edges_2[1] - bin_edges_2[0], 61 | alpha=alpha, edgecolor=edgecolor, label=column_2) 62 | plt.xlabel(xlabel) 63 | plt.ylabel(ylabel) 64 | plt.title(title) 65 | plt.legend() 66 | plt.grid(True) 67 | plt.show() 68 | 69 | 70 | def compare_embedding_models( 71 | metrics_df: pd.DataFrame, 72 | metric: str = 'Recall@3', 73 | title: str = 'Recall@3 Scores by Model' 74 | ) -> None: 75 | plt.figure(figsize=(12, 6)) 76 | 77 | models = metrics_df['Model'].tolist() 78 | x = np.arange(len(models)) 79 | width = 0.4 80 | 81 | _, ax = plt.subplots(figsize=(12, 6)) 82 | ax.bar(x, metrics_df[metric], width, label='Score', color="#327eff") 83 | 84 | ax.set_ylabel(metric) 85 | ax.set_xlabel('Model') 86 | ax.set_title(title) 87 | ax.set_xticks(x) 88 | ax.set_xticklabels(models, rotation=45, ha='right') 89 | ax.legend() 90 | ax.grid(True, alpha=0.3) 91 | 92 | plt.tight_layout() 93 | plt.show() -------------------------------------------------------------------------------- /generate_benchmark.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Generate Custom Benchmark\n", 8 | "\n", 9 | "This notebook walks through how to generate a custom benchmark based on your data.\n", 10 | "\n", 11 | "We will be using OpenAI for our embedding model and LLM, but this can easily be switched out:\n", 12 | "- Various embedding functions are provided in `embedding_functions.py`\n", 13 | "- LLM prompts are provided in `llm_functions.py`\n", 14 | "\n", 15 | "NOTE: When switching out embedding models, you will need to make a new collection for your new embeddings. Then, embed the same documents and queries with the embedding model of your choice. \n", 16 | "\n", 17 | "Use the same golden dataset of queries when comparing embedding models on the same data.\n", 18 | "\n", 19 | "Cells that should be modified when switching out embedding models are labeled as **[Modify]**" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## 1. Setup" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "### 1.1 Install & Import\n", 34 | "\n", 35 | "Install the necessary packages." 36 | ] 37 | }, 38 | { 39 | "cell_type": "code", 40 | "execution_count": null, 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "%pip install -r requirements.txt" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "metadata": {}, 50 | "source": [ 51 | "Import modules." 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "%load_ext autoreload\n", 61 | "%autoreload 2\n", 62 | "\n", 63 | "import chromadb\n", 64 | "import pandas as pd\n", 65 | "import numpy as np\n", 66 | "import json\n", 67 | "import os\n", 68 | "from pathlib import Path\n", 69 | "from datetime import datetime\n", 70 | "from dotenv import load_dotenv\n", 71 | "from openai import OpenAI as OpenAIClient\n", 72 | "from anthropic import Anthropic as AnthropicClient\n", 73 | "from functions.llm import *\n", 74 | "from functions.embed import *\n", 75 | "from functions.chroma import *\n", 76 | "from functions.evaluate import *\n", 77 | "from functions.visualize import *\n", 78 | "\n", 79 | "load_dotenv()" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "### 1.2 Set Variables" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "We use pre-chunked [Chroma Docs](https://docs.trychroma.com/docs/overview/introduction) as an example. To run this notebook with your own data, uncomment the commented out lines and fill in.\n", 94 | "\n", 95 | "**[Modify]** `COLLECTION_NAME` when you change your embedding model" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 3, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "with open('data/chroma_docs.json', 'r') as f:\n", 105 | " corpus = json.load(f)\n", 106 | "\n", 107 | "context = \"This is a technical support bot for Chroma, a vector database company often used by developers for building AI applications.\"\n", 108 | "example_queries = \"\"\"\n", 109 | " how to add to a collection\n", 110 | " filter by metadata\n", 111 | " retrieve embeddings when querying\n", 112 | " how to use openai embedding function when adding to collection\n", 113 | " \"\"\"\n", 114 | "\n", 115 | "COLLECTION_NAME = \"chroma-docs-openai-large\" # change this collection name whenever you switch embedding models\n", 116 | "\n", 117 | "# Generate a Benchmark with your own data:\n", 118 | "\n", 119 | "# with open('filepath/to/your/data.json', 'r') as f:\n", 120 | "# corpus = json.load(f)\n", 121 | "\n", 122 | "# context = \"FILL IN WITH CONTEXT RELEVANT TO YOUR USE CASE\"\n", 123 | "# example_queries = \"FILL IN WITH EXAMPLE QUERIES\"\n", 124 | "\n", 125 | "# COLLECTION_NAME = \"YOUR COLLECTION NAME\"" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "### 1.2 Load API Keys\n", 133 | "\n", 134 | "To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database." 135 | ] 136 | }, 137 | { 138 | "cell_type": "code", 139 | "execution_count": null, 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "# Embedding Model & LLM\n", 144 | "OPENAI_API_KEY = os.getenv(\"OPENAI_API_KEY\", \"OR_ENTER_YOUR_KEY_HERE\"))\n", 145 | "\n", 146 | "# If you want to use Chroma Cloud, uncomment and fill in the following:\n", 147 | "# CHROMA_TENANT = os.getenv(\"CHROMA_TENANT\", \"OR_ENTER_YOUR_TENANT_ID_HERE\")\n", 148 | "# X_CHROMA_TOKEN = os.getenv(\"X_CHROMA_TOKEN\", \"OR_ENTER_YOUR_TOKEN_HERE\")\n", 149 | "# DATABASE_NAME = os.get(\"DATABASE_NAME\", \"OR_ENTER_YOUR_DATABASE_NAME_HERE\")" 150 | ] 151 | }, 152 | { 153 | "cell_type": "markdown", 154 | "metadata": {}, 155 | "source": [ 156 | "### 1.3 Set Clients\n", 157 | "\n", 158 | "Initialize the clients." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "chroma_client = chromadb.Client()\n", 168 | "\n", 169 | "# If you want to use Chroma Cloud, uncomment the following line:\n", 170 | "# chroma_client = chromadb.HttpClient(\n", 171 | "# ssl=True,\n", 172 | "# host='api.trychroma.com',\n", 173 | "# tenant=CHROMA_TENANT,\n", 174 | "# database=DATABASE_NAME,\n", 175 | "# headers={\n", 176 | "# 'x-chroma-token': X_CHROMA_TOKEN\n", 177 | "# }\n", 178 | "# )\n", 179 | "\n", 180 | "openai_client = OpenAIClient(api_key=OPENAI_API_KEY)" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "## 2. Create Chroma Collection\n", 188 | "\n", 189 | "If you already have a Chroma Collection for your data, skip to **2.3**." 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "### 2.1 Load in Your Data" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 7, 202 | "metadata": {}, 203 | "outputs": [], 204 | "source": [ 205 | "corpus_ids = list(corpus.keys())\n", 206 | "corpus_documents = [corpus[key] for key in corpus_ids]" 207 | ] 208 | }, 209 | { 210 | "cell_type": "markdown", 211 | "metadata": {}, 212 | "source": [ 213 | "### 2.2 Embed Data & Add to Chroma Collection\n", 214 | "\n", 215 | "Embed your documents using an embedding model of your choice. We use Openai's text-embedding-3-large here, but have other functions available in `embed.py`. You may also define your own embedding function.\n", 216 | "\n", 217 | "We use batching and multi-threading for efficiency.\n", 218 | "\n", 219 | "**[Modify]** embedding function (`openai_embed_in_batches`) to the embedding model you wish to use" 220 | ] 221 | }, 222 | { 223 | "cell_type": "code", 224 | "execution_count": null, 225 | "metadata": {}, 226 | "outputs": [], 227 | "source": [ 228 | "corpus_embeddings = openai_embed_in_batches(\n", 229 | " openai_client=openai_client,\n", 230 | " texts=corpus_documents,\n", 231 | " model=\"text-embedding-3-large\",\n", 232 | ")\n", 233 | "\n", 234 | "corpus_collection = chroma_client.get_or_create_collection(\n", 235 | " name=COLLECTION_NAME,\n", 236 | " metadata={\"hnsw:space\": \"cosine\"}\n", 237 | ")\n", 238 | "\n", 239 | "collection_add_in_batches(\n", 240 | " collection=corpus_collection,\n", 241 | " ids=corpus_ids,\n", 242 | " texts=corpus_documents,\n", 243 | " embeddings=corpus_embeddings,\n", 244 | ")\n", 245 | "\n", 246 | "corpus = {\n", 247 | " id: {\n", 248 | " 'document': document,\n", 249 | " 'embedding': embedding\n", 250 | " }\n", 251 | " for id, document, embedding in zip(corpus_ids, corpus_documents, corpus_embeddings)\n", 252 | "}" 253 | ] 254 | }, 255 | { 256 | "cell_type": "code", 257 | "execution_count": null, 258 | "metadata": {}, 259 | "outputs": [], 260 | "source": [ 261 | "corpus_collection = chroma_client.get_collection(\n", 262 | " name=COLLECTION_NAME\n", 263 | ")\n", 264 | "\n", 265 | "corpus = get_collection_items(\n", 266 | " collection=corpus_collection\n", 267 | ")\n", 268 | "\n", 269 | "corpus_ids = [key for key in corpus.keys()]\n", 270 | "corpus_documents = [corpus[key]['document'] for key in corpus_ids]" 271 | ] 272 | }, 273 | { 274 | "cell_type": "markdown", 275 | "metadata": {}, 276 | "source": [ 277 | "## 3. Filter Documents for Quality\n", 278 | "\n", 279 | "We begin by filtering our documents prior to query generation, this step ensures that we avoid generating queries from irrelevant or incomplete documents." 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "### 3.1 Set Criteria\n", 287 | "\n", 288 | "We use the following criteria:\n", 289 | "- `relevance` checks whether the document is relevant to the specified context\n", 290 | "- `completeness` checks for overall quality of the document\n", 291 | "\n", 292 | "You can modify the criteria as you see fit." 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 9, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "relevance = f\"The document is relevant to the following context: {context}\"\n", 302 | "completeness = \"The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for.\"\n", 303 | "\n", 304 | "criteria = [relevance, completeness]\n", 305 | "criteria_labels = [\"relevance\", \"completeness\"]" 306 | ] 307 | }, 308 | { 309 | "cell_type": "markdown", 310 | "metadata": {}, 311 | "source": [ 312 | "### 3.2 Filter Documents\n", 313 | "\n", 314 | "We filter our documents using gpt-4o-mini. Batching functions are also available in `llm.py`." 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": null, 320 | "metadata": {}, 321 | "outputs": [], 322 | "source": [ 323 | "filtered_document_ids = filter_documents(\n", 324 | " client=openai_client,\n", 325 | " model=\"gpt-4o-mini\",\n", 326 | " documents=corpus_documents,\n", 327 | " ids=corpus_ids,\n", 328 | " criteria=criteria,\n", 329 | " criteria_labels=criteria_labels\n", 330 | ")" 331 | ] 332 | }, 333 | { 334 | "cell_type": "code", 335 | "execution_count": 11, 336 | "metadata": {}, 337 | "outputs": [], 338 | "source": [ 339 | "passed_documents = [corpus[id]['document'] for id in filtered_document_ids]\n", 340 | "\n", 341 | "failed_document_ids = [id for id in corpus_ids if id not in filtered_document_ids]" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "### 3.3 View Results" 349 | ] 350 | }, 351 | { 352 | "cell_type": "code", 353 | "execution_count": null, 354 | "metadata": {}, 355 | "outputs": [], 356 | "source": [ 357 | "print(f\"Number of documents passed: {len(filtered_document_ids)}\")\n", 358 | "print(f\"Number of documents failed: {len(failed_document_ids)}\")\n", 359 | "print(\"-\"*80)\n", 360 | "print(\"Example of passed document:\")\n", 361 | "print(corpus[filtered_document_ids[0]]['document'])\n", 362 | "print(\"-\"*80)\n", 363 | "print(\"Example of failed document:\")\n", 364 | "print(corpus[failed_document_ids[0]]['document'])\n", 365 | "print(\"-\"*80)" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "## 4. Generate Golden Dataset\n", 373 | "\n", 374 | "Using our filtered documents, we can genereate a golden dataset of queries." 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "### 4.1 Create Custom Prompt\n", 382 | "\n", 383 | "We will use `context` and `example_queries` for query generation." 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "### 4.2 Generate Queries\n", 391 | "\n", 392 | "Generate queries with gpt-4o. Batching functions are available in `llm.py`." 393 | ] 394 | }, 395 | { 396 | "cell_type": "code", 397 | "execution_count": null, 398 | "metadata": {}, 399 | "outputs": [], 400 | "source": [ 401 | "golden_dataset = create_golden_dataset(\n", 402 | " client=openai_client,\n", 403 | " model=\"gpt-4o\",\n", 404 | " documents=passed_documents,\n", 405 | " ids=filtered_document_ids,\n", 406 | " context=context,\n", 407 | " example_queries=example_queries\n", 408 | ")\n", 409 | "\n", 410 | "golden_dataset.head()" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "## 5. Evaluate\n", 418 | "\n", 419 | "Now that we have our golden dataset, we will can run our evaluation." 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "### 5.1 Prepare Inputs" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": 14, 432 | "metadata": {}, 433 | "outputs": [], 434 | "source": [ 435 | "queries = golden_dataset['query'].tolist()\n", 436 | "ids = golden_dataset['id'].tolist()" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "Embed generated queries.\n", 444 | "\n", 445 | "**[Modify]** embedding function (`openai_embed_in_batches`) to the embedding model you wish to use" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": {}, 452 | "outputs": [], 453 | "source": [ 454 | "query_embeddings = openai_embed_in_batches(\n", 455 | " openai_client=openai_client,\n", 456 | " texts=queries,\n", 457 | " model=\"text-embedding-3-large\"\n", 458 | ")\n", 459 | "\n", 460 | "query_embeddings_lookup = {\n", 461 | " id: {\n", 462 | " \"text\": query,\n", 463 | " \"embedding\": embedding\n", 464 | " }\n", 465 | " for id, query, embedding in zip(ids, queries, query_embeddings)\n", 466 | "}" 467 | ] 468 | }, 469 | { 470 | "cell_type": "markdown", 471 | "metadata": {}, 472 | "source": [ 473 | "Create our qrels (query relevance labels) dataframe. In this case, each query and its corresponding document share the same id." 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 16, 479 | "metadata": {}, 480 | "outputs": [], 481 | "source": [ 482 | "qrels = pd.DataFrame(\n", 483 | " {\n", 484 | " \"query-id\": ids,\n", 485 | " \"corpus-id\": ids,\n", 486 | " \"score\": 1\n", 487 | " }\n", 488 | ")" 489 | ] 490 | }, 491 | { 492 | "cell_type": "markdown", 493 | "metadata": {}, 494 | "source": [ 495 | "### 5.2 Run Benchmark" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": null, 501 | "metadata": {}, 502 | "outputs": [], 503 | "source": [ 504 | "results = run_benchmark(\n", 505 | " query_embeddings_lookup=query_embeddings_lookup,\n", 506 | " collection=corpus_collection,\n", 507 | " qrels=qrels\n", 508 | ")" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "Save results.\n", 516 | "\n", 517 | "This is helpful for comparison (e.g. comparing different embedding models).\n", 518 | "\n", 519 | "**[Modify]** \"model\" to the model you are using" 520 | ] 521 | }, 522 | { 523 | "cell_type": "code", 524 | "execution_count": 32, 525 | "metadata": {}, 526 | "outputs": [], 527 | "source": [ 528 | "timestamp = datetime.now().strftime(\"%Y-%m-%d--%H-%M-%S\")\n", 529 | "results_to_save = {\n", 530 | " \"model\": \"text-embedding-3-large\",\n", 531 | " \"results\": results\n", 532 | "}" 533 | ] 534 | }, 535 | { 536 | "cell_type": "code", 537 | "execution_count": 33, 538 | "metadata": {}, 539 | "outputs": [], 540 | "source": [ 541 | "results_dir = Path(\"results\")\n", 542 | "\n", 543 | "with open(os.path.join(results_dir, f'{timestamp}.json'), 'w') as f:\n", 544 | " json.dump(results_to_save, f)" 545 | ] 546 | } 547 | ], 548 | "metadata": { 549 | "kernelspec": { 550 | "display_name": "Python 3", 551 | "language": "python", 552 | "name": "python3" 553 | }, 554 | "language_info": { 555 | "codemirror_mode": { 556 | "name": "ipython", 557 | "version": 3 558 | }, 559 | "file_extension": ".py", 560 | "mimetype": "text/x-python", 561 | "name": "python", 562 | "nbconvert_exporter": "python", 563 | "pygments_lexer": "ipython3", 564 | "version": "3.9.6" 565 | } 566 | }, 567 | "nbformat": 4, 568 | "nbformat_minor": 2 569 | } 570 | -------------------------------------------------------------------------------- /img/card.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/chroma-core/generative-benchmarking/6558a90d595b00ca22d0d8014155bb065cbf6a9b/img/card.png -------------------------------------------------------------------------------- /pyproject.toml: -------------------------------------------------------------------------------- 1 | [tool.poetry] 2 | name = "generative-benchmarking" 3 | version = "0.1.0" 4 | description = "Chroma Generative Benchmarking Notebooks" 5 | package-mode=false 6 | 7 | [tool.poetry.dependencies] 8 | python = "^3.9.6" 9 | chromadb = "*" 10 | pandas = "*" 11 | numpy = "*" 12 | tqdm = "*" 13 | matplotlib = "*" 14 | datasets = "*" 15 | sentence-transformers = "*" 16 | voyageai = "*" 17 | openai = "*" 18 | anthropic = "*" 19 | pytrec_eval = "*" 20 | notebook = "*" 21 | python-dotenv = "*" 22 | 23 | [build-system] 24 | requires = ["poetry-core"] 25 | build-backend = "poetry.core.masonry.api" 26 | -------------------------------------------------------------------------------- /replicate_results.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Replicate Results\n", 8 | "\n", 9 | "This notebook demonstrates how to replicate our results for:\n", 10 | "- **Generating representative benchmarks** with public datasets using the English subset of the [multilingual Wikipedia dataset](https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries)\n", 11 | "\n", 12 | "- **Aligning our LLM Judge** for document quality filtering with [Weights and Biases](https://wandb.ai/site/) data\n", 13 | "\n", 14 | "The rest of our results can be replicated with through replacing the dataset and embedding model." 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "## 1. Setup" 22 | ] 23 | }, 24 | { 25 | "cell_type": "markdown", 26 | "metadata": {}, 27 | "source": [ 28 | "### 1.1 Install & Import\n", 29 | "\n", 30 | "Install the necessary packages." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "!pip install -r requirements.txt" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "Import modules." 47 | ] 48 | }, 49 | { 50 | "cell_type": "code", 51 | "execution_count": 11, 52 | "metadata": {}, 53 | "outputs": [ 54 | { 55 | "name": "stdout", 56 | "output_type": "stream", 57 | "text": [ 58 | "The autoreload extension is already loaded. To reload it, use:\n", 59 | " %reload_ext autoreload\n" 60 | ] 61 | } 62 | ], 63 | "source": [ 64 | "%load_ext autoreload\n", 65 | "%autoreload 2\n", 66 | "\n", 67 | "import chromadb\n", 68 | "import pandas as pd\n", 69 | "import numpy as np\n", 70 | "import datasets\n", 71 | "from openai import OpenAI as OpenAIClient\n", 72 | "from anthropic import Anthropic as AnthropicClient\n", 73 | "from functions.llm import *\n", 74 | "from functions.embed import *\n", 75 | "from functions.chroma import *\n", 76 | "from functions.evaluate import *\n", 77 | "from functions.utils import *\n", 78 | "from functions.visualize import *" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "metadata": {}, 84 | "source": [ 85 | "### 1.2 Load API Keys\n", 86 | "\n", 87 | "To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database. If you want to use local Chroma, skip this step and simply input `OPENAI_API_KEY` and `CLAUDE_API_KEY`." 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": {}, 94 | "outputs": [], 95 | "source": [ 96 | "# Chroma Cloud\n", 97 | "CHROMA_TENANT = \"YOUR CHROMA TENANT ID\"\n", 98 | "X_CHROMA_TOKEN = \"YOUR CHROMA API KEY\"\n", 99 | "DATABASE_NAME = \"YOUR CHROMA DATABASE NAME\"\n", 100 | "\n", 101 | "# Embedding Model\n", 102 | "OPENAI_API_KEY = \"YOUR OPENAI API KEY\"\n", 103 | "\n", 104 | "# LLM\n", 105 | "CLAUDE_API_KEY = \"YOUR CLAUDE API KEY\"" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "### 1.3 Set Clients\n", 113 | "\n", 114 | "Initialize the clients." 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": null, 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "chroma_client = chromadb.HttpClient(\n", 124 | " ssl=True,\n", 125 | " host='api.trychroma.com',\n", 126 | " tenant=CHROMA_TENANT,\n", 127 | " database=DATABASE_NAME,\n", 128 | " headers={\n", 129 | " 'x-chroma-token': X_CHROMA_TOKEN\n", 130 | " }\n", 131 | ")\n", 132 | "\n", 133 | "# If you want to use the local Chroma instead, uncomment the following line:\n", 134 | "# chroma_client = chromadb.Client()\n", 135 | "\n", 136 | "openai_client = OpenAIClient(api_key=OPENAI_API_KEY)\n", 137 | "anthropic_client = AnthropicClient(api_key=CLAUDE_API_KEY)" 138 | ] 139 | }, 140 | { 141 | "cell_type": "markdown", 142 | "metadata": {}, 143 | "source": [ 144 | "### 1.4 Load Data\n", 145 | "\n", 146 | "We'll use the English subset of the [multilingual Wikipedia dataset](https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries).\n", 147 | "\n", 148 | "We use the `test` split for this demonstration, which contains:\n", 149 | "- 1500 queries\n", 150 | "- 1500 query-document relevance judgments (qrels)\n", 151 | "- 13500 corpus documents" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "First, we'll load the queries, documents, and query-document relevance judgments." 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": null, 164 | "metadata": {}, 165 | "outputs": [], 166 | "source": [ 167 | "wiki_queries = datasets.load_dataset(\"ellamind/wikipedia-2023-11-retrieval-multilingual-queries\", \"en\")[\"test\"].to_pandas()\n", 168 | "wiki_corpus = datasets.load_dataset(\"ellamind/wikipedia-2023-11-retrieval-multilingual-corpus\", \"en\")[\"test\"].to_pandas()\n", 169 | "wiki_qrels = datasets.load_dataset(\"ellamind/wikipedia-2023-11-retrieval-multilingual-qrels\", \"en\")[\"test\"].to_pandas()" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "For this specific dataset, the query-documnet relevance judgements include distractors as indicated by scores of 0.5 and target matches as indicated by scores of 1.0.\n", 177 | "\n", 178 | "We'll filter the query-document relevance judgments to only include target matches. Then, we'll combine the queries, documents, and query-document relevance judgments into a single dataframe for convenience." 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": {}, 185 | "outputs": [], 186 | "source": [ 187 | "wiki_qrels = wiki_qrels[wiki_qrels[\"score\"] == 1.0]\n", 188 | "\n", 189 | "wiki_qrels = combined_datasets_dataframes(wiki_queries, wiki_corpus, wiki_qrels)" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "## 2. Embed Corpus & Store in Chroma" 197 | ] 198 | }, 199 | { 200 | "cell_type": "markdown", 201 | "metadata": {}, 202 | "source": [ 203 | "### 2.1 Embed" 204 | ] 205 | }, 206 | { 207 | "cell_type": "code", 208 | "execution_count": null, 209 | "metadata": {}, 210 | "outputs": [], 211 | "source": [ 212 | "wiki_corpus_ids = wiki_corpus[\"_id\"].tolist()\n", 213 | "wiki_corpus_texts = wiki_corpus[\"text\"].tolist()\n", 214 | "\n", 215 | "wiki_corpus_embeddings = openai_embed_in_batches(\n", 216 | " openai_client=openai_client, \n", 217 | " texts=wiki_corpus_texts, \n", 218 | " model=\"text-embedding-3-small\"\n", 219 | ")\n", 220 | "\n", 221 | "wiki_corpus_lookup = {\n", 222 | " id: {\n", 223 | " \"text\": text,\n", 224 | " \"embedding\": embedding\n", 225 | " } for id, text, embedding in zip(wiki_corpus_ids, wiki_corpus_texts, wiki_corpus_embeddings)\n", 226 | "}" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "### 2.2 Create & Add to Chroma Collection" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "wiki_collection = chroma_client.get_or_create_collection(\n", 243 | " name=\"wiki-text-embedding-3-small\",\n", 244 | " metadata={\"hnsw:space\": \"cosine\"}\n", 245 | ")\n", 246 | "\n", 247 | "collection_add_in_batches(\n", 248 | " collection=wiki_collection, \n", 249 | " ids=wiki_corpus_ids, \n", 250 | " texts=wiki_corpus_texts, \n", 251 | " embeddings=wiki_corpus_embeddings\n", 252 | ")" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "## 3. Naive Query Generation\n", 260 | "\n", 261 | "We will demonstrate that LLMs have memorized a substantial portion of public benchmarks, which limits their ability to reliably generate unseen queries." 262 | ] 263 | }, 264 | { 265 | "cell_type": "markdown", 266 | "metadata": {}, 267 | "source": [ 268 | "### 3.1 Generate Queries\n", 269 | "\n", 270 | "Generate 1500 queries, only including the document as context.\n", 271 | "\n", 272 | "We batch the LLM calls for efficiency\n", 273 | "- ids are converted to align with Anthropic's batch id formatting\n", 274 | "- batch processing status can be viewed through [Anthropic's Console](https://console.anthropic.com/workspaces/default/batches)" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "metadata": {}, 281 | "outputs": [], 282 | "source": [ 283 | "corpus_ids_qrels = wiki_qrels[\"corpus-id\"].tolist()\n", 284 | "corpus_texts_qrels = wiki_qrels[\"corpus-text\"].tolist()\n", 285 | "\n", 286 | "ids_for_batching = [clean_id_for_batching(id) for id in corpus_ids_qrels]\n", 287 | "\n", 288 | "naive_queries_batch_id = create_naive_query_batch(\n", 289 | " client=anthropic_client,\n", 290 | " model=\"claude-3-5-sonnet-20241022\",\n", 291 | " documents=corpus_texts_qrels,\n", 292 | " ids=ids_for_batching\n", 293 | ")" 294 | ] 295 | }, 296 | { 297 | "cell_type": "markdown", 298 | "metadata": {}, 299 | "source": [ 300 | "Retrieve the generated queries once the batch is complete and merge with `wiki_qrels`." 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": null, 306 | "metadata": {}, 307 | "outputs": [], 308 | "source": [ 309 | "naive_queries_df = retrieve_batch(\n", 310 | " client=anthropic_client,\n", 311 | " batch_id=naive_queries_batch_id\n", 312 | ")\n", 313 | "\n", 314 | "naive_queries_df[\"id\"] = naive_queries_df[\"id\"].apply(revert_id_from_batching)\n", 315 | "\n", 316 | "wiki_qrels = wiki_qrels.merge(naive_queries_df, left_on=\"corpus-id\", right_on=\"id\", how=\"left\")\n", 317 | "wiki_qrels.rename(columns={\"query\": \"naively-generated-query\"}, inplace=True)\n", 318 | "wiki_qrels.drop(columns=[\"id\"], inplace=True)\n", 319 | "wiki_qrels.head()" 320 | ] 321 | }, 322 | { 323 | "cell_type": "markdown", 324 | "metadata": {}, 325 | "source": [ 326 | "### 3.2 Compare with Ground Truth Queries\n", 327 | "\n", 328 | "Embed the ground truth queries and generated queries." 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": null, 334 | "metadata": {}, 335 | "outputs": [], 336 | "source": [ 337 | "ground_truth_queries = wiki_qrels[\"query-text\"].tolist()\n", 338 | "query_ids = wiki_qrels[\"query-id\"].tolist()\n", 339 | "naive_queries = wiki_qrels[\"naively-generated-query\"].tolist()\n", 340 | "\n", 341 | "ground_truth_query_embeddings = openai_embed_in_batches(\n", 342 | " openai_client=openai_client, \n", 343 | " texts=ground_truth_queries, \n", 344 | " model=\"text-embedding-3-small\"\n", 345 | ")\n", 346 | "\n", 347 | "naive_query_embeddings = openai_embed_in_batches(\n", 348 | " openai_client=openai_client, \n", 349 | " texts=naive_queries, \n", 350 | " model=\"text-embedding-3-small\"\n", 351 | ")\n", 352 | "\n", 353 | "ground_truth_query_lookup = {\n", 354 | " id: {\n", 355 | " \"text\": text,\n", 356 | " \"embedding\": embedding\n", 357 | " } for id, text, embedding in zip(query_ids, ground_truth_queries, ground_truth_query_embeddings)\n", 358 | "}\n", 359 | "\n", 360 | "naive_query_lookup = {\n", 361 | " id: {\n", 362 | " \"text\": text,\n", 363 | " \"embedding\": embedding\n", 364 | " } for id, text, embedding in zip(query_ids, naive_queries, naive_query_embeddings)\n", 365 | "}" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "metadata": {}, 371 | "source": [ 372 | "Here, we plot the cosine similarity scores between each ground truth query and its corresponding generated query:" 373 | ] 374 | }, 375 | { 376 | "cell_type": "code", 377 | "execution_count": null, 378 | "metadata": {}, 379 | "outputs": [], 380 | "source": [ 381 | "naive_query_comparison = score_query_query(\n", 382 | " qrels=wiki_qrels, \n", 383 | " query_embeddings_1=ground_truth_query_lookup, \n", 384 | " query_embeddings_2=naive_query_lookup,\n", 385 | " column_name=\"naive-query-score\"\n", 386 | ")\n", 387 | "\n", 388 | "plot_single_distribution(\n", 389 | " df=naive_query_comparison, \n", 390 | " column=\"naive-query-score\", \n", 391 | " title=\"Ground Truth vs Naively Generated Queries\", \n", 392 | " xlabel=\"Cosine Similarity\", \n", 393 | " ylabel=\"Normalized Frequency\"\n", 394 | ")" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "metadata": {}, 400 | "source": [ 401 | "With further investigation, we can see that identical queries have been generated:" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "naive_query_comparison.sort_values(by=\"naive-query-score\", ascending=False, inplace=True)\n", 411 | "\n", 412 | "for i, row in naive_query_comparison.head(10).iterrows():\n", 413 | " print(f\"Score: {row['naive-query-score']:.4f}\")\n", 414 | " print(f\"Original Query: {row['query-text']}\")\n", 415 | " print(f\"Generated Query: {row['naively-generated-query']}\")\n", 416 | " print(\"-\" * 80)" 417 | ] 418 | }, 419 | { 420 | "cell_type": "markdown", 421 | "metadata": {}, 422 | "source": [ 423 | "## 4. Distinct Query Generation\n", 424 | "\n", 425 | "Since models have memorized these public benchmarks, we will generate unseen queries by explicitely prompting the model to generate a distinct query. \n", 426 | "\n", 427 | "Then, we will demonstrate that these newly generated distinct queries are also representative of the ground truth dataset." 428 | ] 429 | }, 430 | { 431 | "cell_type": "markdown", 432 | "metadata": {}, 433 | "source": [ 434 | "#### 4.1 Generate Queries\n", 435 | "\n", 436 | "We generate 1500 queries, now including both the ground truth query and the corpus as context." 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": {}, 443 | "outputs": [], 444 | "source": [ 445 | "distinct_batch_id = create_distinct_query_batch(\n", 446 | " client=anthropic_client,\n", 447 | " model=\"claude-3-5-sonnet-20241022\",\n", 448 | " documents=corpus_texts_qrels,\n", 449 | " ids=ids_for_batching,\n", 450 | " queries=ground_truth_queries\n", 451 | ")" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "distinct_queries_df = retrieve_batch(\n", 461 | " client=anthropic_client,\n", 462 | " batch_id=distinct_batch_id\n", 463 | ")\n", 464 | "distinct_queries_df[\"id\"] = distinct_queries_df[\"id\"].apply(revert_id_from_batching)\n", 465 | "\n", 466 | "wiki_qrels = wiki_qrels.merge(distinct_queries_df, left_on=\"corpus-id\", right_on=\"id\", how=\"left\")\n", 467 | "wiki_qrels.rename(columns={\"query\": \"distinct-generated-query\"}, inplace=True)\n", 468 | "wiki_qrels.drop(columns=[\"id\"], inplace=True)" 469 | ] 470 | }, 471 | { 472 | "cell_type": "markdown", 473 | "metadata": {}, 474 | "source": [ 475 | "### 4.2 Compare with Ground Truth Queries" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": null, 481 | "metadata": {}, 482 | "outputs": [], 483 | "source": [ 484 | "distinct_queries = wiki_qrels[\"distinct-generated-query\"].tolist()\n", 485 | "\n", 486 | "distinct_query_embeddings = openai_embed_in_batches(\n", 487 | " openai_client=openai_client, \n", 488 | " texts=distinct_queries, \n", 489 | " model=\"text-embedding-3-small\"\n", 490 | ")\n", 491 | "\n", 492 | "distinct_query_lookup = {\n", 493 | " id: {\n", 494 | " \"text\": text,\n", 495 | " \"embedding\": embedding\n", 496 | " } for id, text, embedding in zip(query_ids, distinct_queries, distinct_query_embeddings)\n", 497 | "}" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "metadata": {}, 503 | "source": [ 504 | "We plot the cosine similarity scores between each ground truth query and its corresponding generated (distinct) query, and compare with our previous plot:" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": {}, 511 | "outputs": [], 512 | "source": [ 513 | "distinct_query_comparison = score_query_query(\n", 514 | " qrels=wiki_qrels, \n", 515 | " query_embeddings_1=ground_truth_query_lookup, \n", 516 | " query_embeddings_2=distinct_query_lookup,\n", 517 | " column_name=\"distinct-query-score\"\n", 518 | ")\n", 519 | "\n", 520 | "plot_overlaid_distribution(\n", 521 | " df_1=naive_query_comparison, \n", 522 | " df_2=distinct_query_comparison, \n", 523 | " column_1=\"naive-query-score\", \n", 524 | " column_2=\"distinct-query-score\", \n", 525 | " title=\"Distinct vs Ground Truth Queries\", \n", 526 | " xlabel=\"Cosine Similarity\", \n", 527 | " ylabel=\"Normalized Frequency\"\n", 528 | ")" 529 | ] 530 | }, 531 | { 532 | "cell_type": "markdown", 533 | "metadata": {}, 534 | "source": [ 535 | "We can look at the most similar query-query scores and see that no identical queries have been generated:" 536 | ] 537 | }, 538 | { 539 | "cell_type": "code", 540 | "execution_count": null, 541 | "metadata": {}, 542 | "outputs": [], 543 | "source": [ 544 | "distinct_query_comparison.sort_values(by=\"distinct-query-score\", ascending=False, inplace=True)\n", 545 | "\n", 546 | "for i, row in distinct_query_comparison.head(10).iterrows():\n", 547 | " print(f\"Score: {row['distinct-query-score']:.4f}\")\n", 548 | " print(f\"Original Query: {row['query-text']}\")\n", 549 | " print(f\"Generated Query: {row['distinct-generated-query']}\")\n", 550 | " print(\"-\" * 80)" 551 | ] 552 | }, 553 | { 554 | "cell_type": "markdown", 555 | "metadata": {}, 556 | "source": [ 557 | "Compare Metrics:" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": null, 563 | "metadata": {}, 564 | "outputs": [], 565 | "source": [ 566 | "wiki_metrics = evaluate_and_visualize(\n", 567 | " ground_truth_query_dict=ground_truth_query_lookup,\n", 568 | " generated_query_dict=distinct_query_lookup,\n", 569 | " corpus_embeddings_dict=wiki_corpus_lookup,\n", 570 | " qrels=wiki_qrels,\n", 571 | " collection=wiki_collection,\n", 572 | " dataset_name=\"Wikipedia (English)\",\n", 573 | " model_name=\"text-embedding-3-small\"\n", 574 | ")" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": {}, 580 | "source": [ 581 | "## 5. Align LLM Judge\n", 582 | "\n", 583 | "We used labeled documents from [wandbot](https://github.com/wandb/wandbot), a technical support bot for Weights & Biases' AI developer tools.\n", 584 | "\n", 585 | "These documents were manually labeled `true` or `false` based on whether they are good for generating relevant queires from." 586 | ] 587 | }, 588 | { 589 | "cell_type": "markdown", 590 | "metadata": {}, 591 | "source": [ 592 | "### 5.1 Load in Data" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": 14, 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [ 601 | "with open(\"data/wandb_human_labels.json\", \"r\") as f:\n", 602 | " human_labeled_documents = json.load(f)\n", 603 | "\n", 604 | "with open(\"data/wandb_docs.json\", \"r\") as f:\n", 605 | " all_documents = json.load(f)" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": 15, 611 | "metadata": {}, 612 | "outputs": [], 613 | "source": [ 614 | "labeled_ids = list(human_labeled_documents.keys())\n", 615 | "labeled_documents = [all_documents[id] for id in labeled_ids]" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "### 5.2 Set Baseline Criteria" 623 | ] 624 | }, 625 | { 626 | "cell_type": "code", 627 | "execution_count": 29, 628 | "metadata": {}, 629 | "outputs": [], 630 | "source": [ 631 | "relevance_v1 = \"The document is relevant and contains information that users would search for in the context of a question-answering bot for Weights & Biases. It should address topics that are useful to machine learning practitioners.\"\n", 632 | "\n", 633 | "completeness_v1 = \"The document is complete, meaning it provides comprehensive information to answer queries rather than merely serving as an introduction.\"\n", 634 | "\n", 635 | "clarity_v1 = \"The document contains clear ideas and is comprehensible.\"\n", 636 | "\n", 637 | "criteria_v1 = [relevance_v1, completeness_v1, clarity_v1]\n", 638 | "criteria_labels_v1 = [\"relevant\", \"complete\", \"clear\"]" 639 | ] 640 | }, 641 | { 642 | "cell_type": "markdown", 643 | "metadata": {}, 644 | "source": [ 645 | "### 5.3 Get LLM Labels" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": null, 651 | "metadata": {}, 652 | "outputs": [], 653 | "source": [ 654 | "batch_v1_id = create_document_filter_batch(\n", 655 | " client=anthropic_client,\n", 656 | " documents=labeled_documents,\n", 657 | " ids=labeled_ids,\n", 658 | " criteria=criteria_v1,\n", 659 | " criteria_labels=criteria_labels_v1\n", 660 | ")" 661 | ] 662 | }, 663 | { 664 | "cell_type": "code", 665 | "execution_count": null, 666 | "metadata": {}, 667 | "outputs": [], 668 | "source": [ 669 | "batch_v1 = retrieve_document_filter_batch(\n", 670 | " client=anthropic_client,\n", 671 | " batch_id=batch_v1_id\n", 672 | ")" 673 | ] 674 | }, 675 | { 676 | "cell_type": "markdown", 677 | "metadata": {}, 678 | "source": [ 679 | "We take our LLM-labeled data and compare with our manual labling.\n", 680 | "\n", 681 | "`criteria_threshold` indicates the number of criterion that must be met in order for a document to be considered \"good quality\"." 682 | ] 683 | }, 684 | { 685 | "cell_type": "code", 686 | "execution_count": 30, 687 | "metadata": {}, 688 | "outputs": [ 689 | { 690 | "name": "stdout", 691 | "output_type": "stream", 692 | "text": [ 693 | "{'relevant': 0.652, 'complete': 0.528, 'clear': 0.508}\n", 694 | "Documents aligned with Human Judgement: 114, 45.6%\n", 695 | "\n", 696 | "\n", 697 | "Number of documents meeting threshold: 163\n", 698 | "Number of documents 100% aligned: 23\n", 699 | "Number of documents 0% aligned: 14\n", 700 | "Aligned: ## Challenges and Limitations of Q-learning\n", 701 | "* Slow Convergence and High Computational Requirements - Q-learning can take significant time to converge, especially in complex environments. It may require substantial computational resources, making it less feasible for real-time applications.\n", 702 | "* Curse of Dimensionality - The performance of Q-learning can deteriorate in high-dimensional state and action spaces, leading to increased computational complexity and reduced efficiency.\n", 703 | "* Lack of Generalisation - Q-learning tends to focus on specific states and actions, potentially leading to difficulties in generalizing learned policies to new, unseen environments and increased susceptibility to overfitting.\n", 704 | "* Exploration vs. Exploitation Trade-off - Striking the right balance between exploration (trying new actions) and exploitation (choosing the best-known actions) can be tricky. Over-exploration can lead to inefficiency, while under-exploitation can prevent discovering better strategies.\n", 705 | "* Handling Continuous State and Action Spaces - Q-learning is primarily designed for discrete state and action spaces. Adapting it for continuous spaces involves complex discretization techniques and can lead to suboptimal results.\n", 706 | "* Sensitivity to Hyperparameters - Q-learning's performance can depend highly on the choice of hyperparameters, such as the learning rate and discount factor. Finding the right values can be challenging.\n", 707 | "* Lack of Prior Knowledge - Q-learning doesn't incorporate prior knowledge about the environment, making it less efficient when some level of pre-existing understanding is available.\n", 708 | "* Non-Markovian Environments - Q-learning assumes that the environment follows the Markov property, meaning the future depends only on the current state and action. In non-Markovian environments, it may not perform optimally.\n", 709 | "\n", 710 | "\n", 711 | "Aligned: \" \n", 712 | "\n", 713 | "# The Woven Planet (Lyft) Level 5 Dataset \n", 714 | "\n", 715 | "Description: In this article, we'll be exploring the Woven Planet (Lyft) Level 5 dataset. We'll look at what it is as well as the autonomous vehicle tasks and techniques it supports \n", 716 | "\n", 717 | "Body: \n", 718 | "\n", 719 | "# What Is The Woven Planet Level 5 Dataset? \n", 720 | "\n", 721 | "[The ](https://level-5.global/) is the largest [autonomous-driving dataset](https://wandb.ai/av-datasets/av-dataset/reports/The-Many-Datasets-of-Autonomous-Driving--VmlldzoyNjU1OTg0) for motion planning and prediction tasks. It contains over 1,000 hours of data collected by 20 self-driving cars and is annotated with semantic maps and high-definition aerial views. \n", 722 | "\n", 723 | "There are 15,242 labeled elements in the dataset for [autonomous driving-related machine learning tasks](https://wandb.ai/av-team/av-tasks/reports/The-ML-Tasks-Of-Autonomous-Vehicle-Development--VmlldzoyNTc2Nzkx), such as motion forecasting, motion planning, and simulation. \n", 724 | "\n", 725 | "## What We're Covering About The Level 5 Dataset \n", 726 | "\n", 727 | "# Recommended Reading \n", 728 | "\n", 729 | "\"\n", 730 | "\n", 731 | "\n", 732 | "Aligned: ## Dedicated Weights & Biases Hook\n", 733 | "### Checkpointing\n", 734 | "MMDetection uses MMCV's CheckpointHook to periodically save model checkpoints. The period is determined by checkpoint_config.interval. However, these checkpoints are saved locally and might get overwritten by a new experiment. \n", 735 | "\n", 736 | "You can reliably store these checkpoints as W&B Artifacts by using the log_checkpoint=True argument. By saving them as W&B Artifacts, you can easily transfer the checkpoints across machines, keep different model ideas separately, and compare them across variants. \n", 737 | "\n", 738 | "Here are a few things worth noting: \n", 739 | "\n", 740 | "* There are 3 versions of checkpoints in the UI as shown above. That's because the model was trained for 12 epochs with `checkpoint_config.interval=4`.\n", 741 | "* Each version has an alias `epoch_x` where `x` is the current epoch.\n", 742 | "* The last checkpoint is marked with the alias `latest`. \n", 743 | "\n", 744 | "We recommend you set the checkpoint interval with caution to save both local and W&B storage space.\n", 745 | "\n", 746 | "\n", 747 | "Aligned: ## Transcript\n", 748 | "### Leadership lessons that Jensen has learned\n", 749 | "\n", 750 | "Lukas: \n", 751 | "\n", 752 | "> You've been running NVIDIA for quite a long time. I was curious how you feel you've changed as a leader over the decades of running the company. \n", 753 | "\n", 754 | "Jensen: \n", 755 | "\n", 756 | "> You know, you're almost asking the wrong person. You could ask almost anybody else around me. \n", 757 | "\n", 758 | "Lukas: \n", 759 | "\n", 760 | "> Fair enough. How has your experience changed? \n", 761 | "\n", 762 | "Jensen: \n", 763 | "\n", 764 | "> That's an easier question for me. \n", 765 | "\n", 766 | "When I was 30 years old, I didn't know anything about being CEO. I did a lot of learning on the job. There were many management techniques that were just really dumb, and I don't use them anymore. \n", 767 | "\n", 768 | "Lukas: \n", 769 | "\n", 770 | "> Like what? \n", 771 | "\n", 772 | "Jensen: \n", 773 | "\n", 774 | "> Well, alright. I'll give you a couple. \n", 775 | "\n", 776 | "Lukas: \n", 777 | "\n", 778 | "> Awesome. Thank you. \n", 779 | "\n", 780 | "Jensen:\n", 781 | "\n", 782 | "\n", 783 | "Aligned: ## About SBX Robotics\n", 784 | "\n", 785 | "Working on a computer vision problem? We can help. \n", 786 | "\n", 787 | "At [SBX Robotics](http://sbxrobotics.com/) we are experts in using synthetic data to bootstrap and improve computer vision systems. \n", 788 | "\n", 789 | "Our clients send us ~25 images from their production setting, and we generate 25,000 synthetic training samples proven to work on the original validation data. All of our datasets ship with: \n", 790 | "\n", 791 | "* Our best benchmark model trained on the synthetic data, tested on your validation data, and loaded into a Google Colab.\n", 792 | "* Code to evaluate the benchmark model, , and our with code snippets \n", 793 | "\n", 794 | "### Ready to try synthetic data for your project? \n", 795 | "\n", 796 | "[Use this link](https://www.sbxrobotics.com/?ref=wandb1#get-started) to submit 25-50 images from your production setting, or contact us at info@sbxrobotics.com. \n", 797 | "\n", 798 | ">\n", 799 | "Mention \"Weights & Biases Tables Tutorial\" for 20% off your first synthetic dataset.\n", 800 | "\n", 801 | "\n", 802 | "Not aligned: ## 🕶 What is HellaSwag?\n", 803 | "### 😎 Swag\n", 804 | "#### Adversarial Filtering\n", 805 | "The point of constructing this formula is to show, if we want an adversarial dataset, then we expect high empirical error on . In other words, in an ideal case, none of the examples generalize to another example within the dataset. \n", 806 | "\n", 807 | "Definitions for adversarial-filtered dataset: \n", 808 | "\n", 809 | "* for each instance , there is one positive instance and many negative instances where so\n", 810 | "* we filter these negative instances for each instance to such that\n", 811 | "* thus, we construct a set of assignments (list of indices)\n", 812 | "* our adversarial-filtered dataset: \n", 813 | "\n", 814 | "Great! We have a formal understanding of how this should be framed. How does it look algorithmically? \n", 815 | "\n", 816 | "We initialize and maintain set of assignments where . This is iteratively updated via Algorithm 1 (Adversarial Filtering).\n", 817 | "\n", 818 | "\n", 819 | "Not aligned: ## Performing In-Context Training\n", 820 | "### Demonstration Design\n", 821 | "#### Scoring Function\n", 822 | "The takeaway from these different methods is that there is still a lot of work to be done in creating a scoring function that mitigates sensitivity and reduces bias. In a nutshell, it isn't easy to calibrate how well in-context learning performs. This field is still very new, and standard metrics have not been established. \n", 823 | "\n", 824 | "Above is a table of factors that play into in-context learning. \n", 825 | "\n", 826 | "They've concluded the following findings in the pretraining stage: \n", 827 | "\n", 828 | "* domain source is more important than corpus size\n", 829 | "* corpora related to downstream tasks don't necessarily improve ICL ability\n", 830 | "* lower perplexity != better ICL\n", 831 | "* ICL emergent ability comes after some number of pretraining steps and model size \n", 832 | "\n", 833 | "During the inference stage: \n", 834 | "\n", 835 | "* input-label formatting matters\n", 836 | "* exposure of label space (what labels you use as examples)\n", 837 | "* input distribution\n", 838 | "* order of examples\n", 839 | "* examples with embeddings close to query embedding \n", 840 | "\n", 841 | "The takeaway from numerous works attempting to understand why in-context learning works is:\n", 842 | "\n", 843 | "\n", 844 | "Not aligned: ## 🦄 A Brief Overview of Stable Diffusion XL\n", 845 | "* SDXL leverages a larger UNet backbone. Three times the size, in fact. The increase in parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder.\n", 846 | "* SDXL leverages multiple novel conditioning schemes and is trained on multiple aspect ratios.\n", 847 | "* The image generation pipeline also leverages a specialized high-resolution refiner model which is used to improve the visual fidelity of samples generated by SDXL using an image-to-image diffusion technique.\n", 848 | "* The base diffusion model generates initial latent tensors of size 128x128, which can be passed through a loss to generate the high-resolution image.\n", 849 | "* The latent tensors could also be passed on to the refiner model that applies , using the same prompt. Although the base SDXL model is capable of generating stunning images with high fidelity, using the refiner model useful in many cases, especially to refine samples of low local quality such as deformed faces, eyes, lips, etc.\n", 850 | "* SDXL and the refinement model use the same autoencoder.\n", 851 | "\n", 852 | "\n", 853 | "Not aligned: ## How to Train Your Dragons Models\n", 854 | "### Zero-DCE: Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement\n", 855 | "#### Non-Reference Loss Functions\n", 856 | "where and represent the horizontal and vertical gradient operations, respectively. \n", 857 | "\n", 858 | "Spatial Constancy Loss: The spatial consistency loss encourages spatial coherence of the enhanced image by preserving the difference of neighboring regions between the input image and its enhanced version. The Spatial Constancy Loss is given by: \n", 859 | "\n", 860 | "$$\n", 861 | "\\Large{L_{s p a}=\\frac{1}{K} \\sum_{i=1}^K \\sum_{j \\in \\Omega(i)}\\left(\\left|\\left(Y_i-Y_j\\right)\\right|-\\left|\\left(I_i-I_j\\right)\\right|\\right)^2}\n", 862 | "$$ \n", 863 | "\n", 864 | "where... \n", 865 | "\n", 866 | "* is the four neighboring regions (top, down, left, right) centered at the region\n", 867 | "* and denote the average intensity value of the local region in the enhanced version and input image, respectively. \n", 868 | "\n", 869 | "All these loss functions have been implemented as part of the [restorers.losses](https://github.com/soumik12345/restorers/tree/main/restorers/losses) API and are automatically initialized when calling [model.compile](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile).\n", 870 | "\n", 871 | "\n", 872 | "Not aligned: ## Types of Language Models\n", 873 | "### Neural Language Models\n", 874 | "#### LSTM Language Model\n", 875 | "* The input gate controls the new information to add to the memory cell. It uses a to determine which values from the current input should be added to the memory cell.\n", 876 | "* The forget gate controls the amount of old information to forget. It also uses a sigmoid function to determine this.\n", 877 | "* The memory cell stores the information from the current and previous inputs, processed by the input and forgets gates.\n", 878 | "* The output gate controls the amount of information to output as the prediction. It uses a sigmoid function to determine this, followed by a tanh activation function to squish the values between -1 and 1. \n", 879 | "\n", 880 | "This allows LSTM to effectively handle long sequences of data and make predictions based on the context of the entire sequence. \n", 881 | "\n", 882 | "The cell state here is a memory cell that maintains information across time steps in a sequence, and the hidden state is the output of the LSTM unit at each time step that is passed to the next time step in a sequence.\n", 883 | "\n", 884 | "\n" 885 | ] 886 | } 887 | ], 888 | "source": [ 889 | "llm_vs_human(\n", 890 | " llm_judgements=batch_v1,\n", 891 | " human_judgements=human_labeled_documents,\n", 892 | " documents_mapping=all_documents,\n", 893 | " criteria_labels=criteria_labels_v1,\n", 894 | " criteria_threshold=2\n", 895 | ")" 896 | ] 897 | }, 898 | { 899 | "cell_type": "markdown", 900 | "metadata": {}, 901 | "source": [ 902 | "### 5.4 Iterate" 903 | ] 904 | }, 905 | { 906 | "cell_type": "markdown", 907 | "metadata": {}, 908 | "source": [ 909 | "Based on the baseline results, we iterate on our criteria." 910 | ] 911 | }, 912 | { 913 | "cell_type": "code", 914 | "execution_count": 22, 915 | "metadata": {}, 916 | "outputs": [], 917 | "source": [ 918 | "relevance_v2 = \"\"\"\n", 919 | " The document is relevant and something that users would search for considering the following context: \n", 920 | " We are building a question-answering bot designed specifically for Weights & Biases, an AI developer for training, fine-tuning, and managing models.\n", 921 | " Any information that would be useful to a user working in machine learning is considered as relevant.\n", 922 | " \"\"\"\n", 923 | "\n", 924 | "completeness_v2 = \"The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for.\"\n", 925 | "\n", 926 | "intent_v2 = \"The document would be relevant in the use case of a user working in machine learning, who may be seeking help or learn more about Weights & Biases or machine learning in general.\"\n", 927 | "\n", 928 | "criteria_v2 = [relevance_v2, completeness_v2, intent_v2]\n", 929 | "criteria_labels_v2 = [\"relevant\", \"complete\", \"intent\"]" 930 | ] 931 | }, 932 | { 933 | "cell_type": "code", 934 | "execution_count": null, 935 | "metadata": {}, 936 | "outputs": [], 937 | "source": [ 938 | "batch_v2_id = create_document_filter_batch(\n", 939 | " client=anthropic_client,\n", 940 | " documents=labeled_documents,\n", 941 | " ids=labeled_ids,\n", 942 | " criteria=criteria_v2,\n", 943 | " criteria_labels=criteria_labels_v2\n", 944 | ")" 945 | ] 946 | }, 947 | { 948 | "cell_type": "code", 949 | "execution_count": 24, 950 | "metadata": {}, 951 | "outputs": [], 952 | "source": [ 953 | "batch_v2 = retrieve_document_filter_batch(\n", 954 | " client=anthropic_client,\n", 955 | " batch_id=batch_v2_id\n", 956 | ")" 957 | ] 958 | }, 959 | { 960 | "cell_type": "code", 961 | "execution_count": 28, 962 | "metadata": {}, 963 | "outputs": [ 964 | { 965 | "name": "stdout", 966 | "output_type": "stream", 967 | "text": [ 968 | "{'relevant': 0.64, 'complete': 0.656, 'intent': 0.568}\n", 969 | "Documents aligned with Human Judgement: 188, 75.2%\n", 970 | "\n", 971 | "\n", 972 | "Number of documents meeting threshold: 159\n", 973 | "Number of documents 100% aligned: 65\n", 974 | "Number of documents 0% aligned: 8\n", 975 | "Aligned: ## Data Preparation and Annotation\n", 976 | "### Annotating a Summarization Dataset for Fine-Tuning\n", 977 | "\n", 978 | "There is a given format required by ChatGPT in order to fine-tune the model on. This format includes 3 sections: \n", 979 | "\n", 980 | "* System: This is the prompt that you will pass to ChatGPT. In our case, the prompt would be “GPT is a great and to-the-point dialogue summarization tool.”\n", 981 | "* User: This is the question asked to the model. In our case, it would be the text that we are required to summarize.\n", 982 | "* Assistant: This is the answer that our model would return. In this case, it would be a brief summary of the text.\n", 983 | "\n", 984 | "\n", 985 | "Aligned: ## Applications of Q-learning\n", 986 | "* The agent can be the program controlling a robot. In this scenario, the agent observes the environment (the real world) through sensors like cameras and touch sensors and acts accordingly by sending signals to the motors. It gets positive rewards for efficient navigation towards the goal and negative rewards for detours or time wastage.\n", 987 | "* The agent can be the program playing a board game like Go or chess.\n", 988 | "* The agent can be the program controlling Ms. Pac-Man where the environment is a simulation of the Atari game, actions are the nine joystick positions, and the rewards are the game points.\n", 989 | "* The agent can observe stock market prices and take action on whether to buy or sell.\n", 990 | "* The agent can be a smart thermostat that learns to understand human needs, getting positive rewards whenever it is close to the target temperature and saves energy, and getting negative rewards when humans need to tweak the temperature.\n", 991 | "* The agent can be a program solving a maze where it gets negative rewards for every time step, so it has to reach the exit as soon as possible.\n", 992 | "* There are various other tasks where it is well suited, such as driving cars, recommender systems, or placing ads on a web page.\n", 993 | "\n", 994 | "\n", 995 | "Aligned: ## Challenges and Limitations of Q-learning\n", 996 | "* Slow Convergence and High Computational Requirements - Q-learning can take significant time to converge, especially in complex environments. It may require substantial computational resources, making it less feasible for real-time applications.\n", 997 | "* Curse of Dimensionality - The performance of Q-learning can deteriorate in high-dimensional state and action spaces, leading to increased computational complexity and reduced efficiency.\n", 998 | "* Lack of Generalisation - Q-learning tends to focus on specific states and actions, potentially leading to difficulties in generalizing learned policies to new, unseen environments and increased susceptibility to overfitting.\n", 999 | "* Exploration vs. Exploitation Trade-off - Striking the right balance between exploration (trying new actions) and exploitation (choosing the best-known actions) can be tricky. Over-exploration can lead to inefficiency, while under-exploitation can prevent discovering better strategies.\n", 1000 | "* Handling Continuous State and Action Spaces - Q-learning is primarily designed for discrete state and action spaces. Adapting it for continuous spaces involves complex discretization techniques and can lead to suboptimal results.\n", 1001 | "* Sensitivity to Hyperparameters - Q-learning's performance can depend highly on the choice of hyperparameters, such as the learning rate and discount factor. Finding the right values can be challenging.\n", 1002 | "* Lack of Prior Knowledge - Q-learning doesn't incorporate prior knowledge about the environment, making it less efficient when some level of pre-existing understanding is available.\n", 1003 | "* Non-Markovian Environments - Q-learning assumes that the environment follows the Markov property, meaning the future depends only on the current state and action. In non-Markovian environments, it may not perform optimally.\n", 1004 | "\n", 1005 | "\n", 1006 | "Aligned: ## Fundamentals of Neural Networks\n", 1007 | "### 1. Basic Neural Network Structure\n", 1008 | "#### Hidden Layers and Neurons per Hidden Layers\n", 1009 | "* The number of hidden layers is highly dependent on the problem and the architecture of your neural network. You’re essentially trying to Goldilocks your way into the perfect neural network architecture – not too big, not too small, just right.\n", 1010 | "* Generally, 1-5 hidden layers will serve you well for most problems. When working with image or speech data, you’d want your network to have dozens-hundreds of layers, not all of which might be fully connected. For these use cases, there are pre-trained models (, , ) that allow you to use large parts of their networks, and train your model on top of these networks to learn only the higher order features. In this case, your model will still have only a few layers to train.\n", 1011 | "* In general using the same number of neurons for all hidden layers will suffice. For some datasets, having a large first layer and following it up with smaller layers will lead to better performance as the first layer can learn a lot of lower-level features that can feed into a few higher order features in the subsequent layers.\n", 1012 | "\n", 1013 | "\n", 1014 | "Aligned: ## Setting up Evaluation using LlamaIndex\n", 1015 | "### Generating Questions using LlamaIndex\n", 1016 | "* Evaluating response for hallucination: Is the generated response coming from the provided context, or is it making up things?\n", 1017 | "* Relevance of the retrieved chunks: Evaluate each retrieved chunk (node) against the generated response to see if that node contains the answer to the query.\n", 1018 | "* Evaluating the answer quality: Does the query + generated response come from the provided context? \n", 1019 | "\n", 1020 | "Using DatasetGenerator is easy, and one can pass the loaded documents to the DatasetGenerator.from_documents method. Calling generate_questions_from_nodes() on the object's instance will generate N questions per chunk. The default chunk size is 512, and N is 10. You might quickly realize that it will take a long time and a lot of API calls to generate a lot of questions. Let's customize the data generation process.\n", 1021 | "\n", 1022 | "\n", 1023 | "Not aligned: ## Example App: NewsTrackr\n", 1024 | "\n", 1025 | "In this section, we'll learn how to build a web app that can extract NEWS data on a given topic (stock/index in our case) using NEWS API, then perform Aspect Based Sentiment Analysis (ABSA) to determine the sentiment of different aspects related to a stock or an index. We'll use [AI21](https://studio.ai21.com/overview) to build our model, [Streamlit](https://streamlit.io/) as our web framework, and [NewsAPI](https://newsapi.org/) to collect news articles. \n", 1026 | "\n", 1027 | "But wait, why do we need to determine the sentiment of different aspects related to a stock? More importantly, what's ABSA? \n", 1028 | "\n", 1029 | "## Problem Statement and Workflow \n", 1030 | "\n", 1031 | "> \n", 1032 | "\n", 1033 | "Let's say that a friend of mine is new to trading and needs some guidance on how to make informed trading decisions. \n", 1034 | "\n", 1035 | "The right way to start trading would involve understanding the basics of technical and fundamental analysis, which are two primary methods used to analyze the markets and make informed investment decisions. \n", 1036 | "\n", 1037 | "## ABSA Model Fine-Tune Pipeline\n", 1038 | "\n", 1039 | "\n", 1040 | "Not aligned: ## An Introduction To HuggingFace Transformers for NLP\n", 1041 | "### What Are Transformers in Machine Learning?\n", 1042 | "After converting our data into a more understandable format, the embedded data is passed into the next layer, known as the self-attention layer. \n", 1043 | "\n", 1044 | "By utilizing self-attention, a transformer is capable of detecting distant data relations and resolving vanishing gradients. Meaning that a given transformer model will still be able to study a given relationship between two related words even if both these words are too far away from each other in a given context. \n", 1045 | "\n", 1046 | "The self-attention process represents how relevant a specific word is in relation to its neighboring words in a given sentence. This relation is then represented as what we call an attention vector. \n", 1047 | "\n", 1048 | "There are three additional types of vectors created in the self-attention layer which are key, query, and value vectors. Each vector is then multiplied by the input vector in order to return a weighted value.\n", 1049 | "\n", 1050 | "\n", 1051 | "Not aligned: ## Human Dynamics\n", 1052 | "### Reward Design\n", 1053 | "#### Good Alignment with Scene Objects\n", 1054 | "* Distance: ensures that the simulated end-effector (hands and feet) are in contact with the desired object. If is the position of the end-effector and is the target zone of the target object (in our case, an area of a quarter on the center of the surface), this reward is calculated using:\n", 1055 | "* Alignment favors the alignment of the character and the object when in contact. If is a unit vector along the frontal axis of the pelvis, this reward is calculated using:\n", 1056 | "* Center of Mass: informs the suitability of the trajectory of the character. If is the distance between the center of mass and the end-effector on landing time and is the distance between the expected center of mass and the expected landing position, this reward is calculated using: \n", 1057 | "\n", 1058 | "> \n", 1059 | "\n", 1060 | "Thus, the total Scene Loss is given by: \n", 1061 | "\n", 1062 | "$$\n", 1063 | "\\huge r_{\\text{scene}} = w_{\\text{dist}}r_{\\text{dist}} + w_{\\text{align}}r_{\\text{align}} + w_{\\text{com}}r_{\\text{com}}\n", 1064 | "$$\n", 1065 | "\n", 1066 | "\n", 1067 | "Not aligned: ## This week in AI: Meta LLaMA 2, Meta-Transformer, StabilityAI FreeWilly\n", 1068 | "### Meta-Transformer\n", 1069 | "\n", 1070 | "Meta-Transformer, not from Meta, is a unified, multi-modal Transformer architecture! \n", 1071 | "\n", 1072 | "It's safe to say this transformer is really multi-modal, not just text and images. Their website has a great video, below, walking through their paper's method. \n", 1073 | "\n", 1074 | "The overall architecture of their Meta-[Transformer](http://wandb.ai/fully-connected/blog/transformer) consists of a data-to-sequence tokenizer layer which, itself, consists of multiple modality-specific tokenizers. The tokenized input enters a shared token space which can all be fed into the unified model. The output of this unified model is fed into task-specific models. \n", 1075 | "\n", 1076 | "They benchmarked their model across dozens of benchmarks and other models!\n", 1077 | "\n", 1078 | "\n", 1079 | "Not aligned: ## Part 2: Applications\n", 1080 | "### Converting Text into Dataframes\n", 1081 | "#### Defining the Data Structures\n", 1082 | "The RowData class represents a single row of data in the dataframe. It contains a row attribute for the values in each row and a citation attribute for the citation from the original source data. \n", 1083 | "\n", 1084 | "The Dataframe class represents a dataframe and consists of a name attribute, a list of RowData objects in the data attribute, and a list of column names in the columns attribute. It also provides a to_pandas method to convert the dataframe into a Pandas DataFrame. \n", 1085 | "\n", 1086 | "The Database class represents a set of tables in a database. It contains a list of Dataframe objects in the tables attribute. \n", 1087 | "\n", 1088 | "Now we can define our own extraction function as usual and see what happens.\n", 1089 | "\n", 1090 | "\n" 1091 | ] 1092 | } 1093 | ], 1094 | "source": [ 1095 | "llm_vs_human(\n", 1096 | " llm_judgements=batch_v2,\n", 1097 | " human_judgements=human_labeled_documents,\n", 1098 | " documents_mapping=all_documents,\n", 1099 | " criteria_labels=criteria_labels_v2,\n", 1100 | " criteria_threshold=2\n", 1101 | ")" 1102 | ] 1103 | }, 1104 | { 1105 | "cell_type": "markdown", 1106 | "metadata": {}, 1107 | "source": [ 1108 | "We notice an improvement in our overall LLM vs Human alignment, as well as individual criterion categories." 1109 | ] 1110 | } 1111 | ], 1112 | "metadata": { 1113 | "kernelspec": { 1114 | "display_name": "Python 3", 1115 | "language": "python", 1116 | "name": "python3" 1117 | }, 1118 | "language_info": { 1119 | "codemirror_mode": { 1120 | "name": "ipython", 1121 | "version": 3 1122 | }, 1123 | "file_extension": ".py", 1124 | "mimetype": "text/x-python", 1125 | "name": "python", 1126 | "nbconvert_exporter": "python", 1127 | "pygments_lexer": "ipython3", 1128 | "version": "3.9.6" 1129 | } 1130 | }, 1131 | "nbformat": 4, 1132 | "nbformat_minor": 2 1133 | } 1134 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | chromadb 2 | pandas 3 | numpy 4 | tqdm 5 | matplotlib 6 | datasets 7 | sentence-transformers 8 | voyageai 9 | openai 10 | anthropic 11 | pytrec_eval 12 | notebook 13 | python-dotenv 14 | -------------------------------------------------------------------------------- /results/2025-03-31--13-59-25.json: -------------------------------------------------------------------------------- 1 | {"model": "text-embedding-3-large", "results": {"NDCG": {"NDCG@1": 0.59524, "NDCG@3": 0.661, "NDCG@5": 0.67125, "NDCG@10": 0.68821}, "MAP": {"MAP@1": 0.59524, "MAP@3": 0.64286, "MAP@5": 0.64881, "MAP@10": 0.65675}, "Recall": {"Recall@1": 0.59524, "Recall@3": 0.71429, "Recall@5": 0.7381, "Recall@10": 0.78571}, "Precision": {"P@1": 0.59524, "P@3": 0.2381, "P@5": 0.14762, "P@10": 0.07857}}} -------------------------------------------------------------------------------- /results/2025-03-31--14-01-03.json: -------------------------------------------------------------------------------- 1 | {"model": "text-embedding-3-small", "results": {"NDCG": {"NDCG@1": 0.5, "NDCG@3": 0.59892, "NDCG@5": 0.60917, "NDCG@10": 0.64652}, "MAP": {"MAP@1": 0.5, "MAP@3": 0.5754, "MAP@5": 0.58135, "MAP@10": 0.59613}, "Recall": {"Recall@1": 0.5, "Recall@3": 0.66667, "Recall@5": 0.69048, "Recall@10": 0.80952}, "Precision": {"P@1": 0.5, "P@3": 0.22222, "P@5": 0.1381, "P@10": 0.08095}}} -------------------------------------------------------------------------------- /results/2025-03-31--14-08-55.json: -------------------------------------------------------------------------------- 1 | {"model": "jina-embeddings-v3", "results": {"NDCG": {"NDCG@1": 0.52381, "NDCG@3": 0.56576, "NDCG@5": 0.60469, "NDCG@10": 0.62722}, "MAP": {"MAP@1": 0.52381, "MAP@3": 0.55556, "MAP@5": 0.57698, "MAP@10": 0.58598}, "Recall": {"Recall@1": 0.52381, "Recall@3": 0.59524, "Recall@5": 0.69048, "Recall@10": 0.7619}, "Precision": {"P@1": 0.52381, "P@3": 0.19841, "P@5": 0.1381, "P@10": 0.07619}}} -------------------------------------------------------------------------------- /results/2025-03-31--14-10-29.json: -------------------------------------------------------------------------------- 1 | {"model": "voyage-3-large", "results": {"NDCG": {"NDCG@1": 0.64286, "NDCG@3": 0.7055, "NDCG@5": 0.71575, "NDCG@10": 0.73956}, "MAP": {"MAP@1": 0.64286, "MAP@3": 0.68651, "MAP@5": 0.69246, "MAP@10": 0.70266}, "Recall": {"Recall@1": 0.64286, "Recall@3": 0.7619, "Recall@5": 0.78571, "Recall@10": 0.85714}, "Precision": {"P@1": 0.64286, "P@3": 0.25397, "P@5": 0.15714, "P@10": 0.08571}}} --------------------------------------------------------------------------------