├── .env.example ├── README.md ├── .gitignore └── weave_reasoning_tests.ipynb /.env.example: -------------------------------------------------------------------------------- 1 | WANDB_API_KEY=#add your key from https://wandb.ai/authorize 2 | OPENAI_API_KEY=#add your key from https://www.anthropic.com or https://console.anthropic.com 3 | OPENROUTER_API_KEY=#add your key from https://console.openrouter.ai 4 | ANTHROPIC_API_KEY=#add your key from https://console.anthropic.com 5 | GOOGLE_API_KEY=#add your key from https://console.cloud.google.com 6 | CEREBRAS_API_KEY=#add your key from https://console.cerebras.ai 7 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # o3-tests – Weave Reasoning Evaluation Suite 2 | 3 | o3‑tests is a **Weave** space (dashboard + artifacts) that tracks how the latest Large‑Language‑Model releases perform on *really* hard reasoning tasks. 4 | It started as a small playground for the **AI Explained SimpleBench** benchmark but quickly evolved into a public evaluation hub covering: 5 | 6 | * GPT‑4.1 family (all context sizes) 7 | * `o3` and `o4‑mini` 8 | * Gemini 2.5 ✨ *(some runs failed because of RPM limits – see the results tables)* 9 | 10 | The goal is to answer a single question: **“Which new model should I reach for when my prompt demands serious reasoning?”** 11 | 12 |

13 | Weave dashboard screenshot 14 |

15 | 16 | --- 17 | 18 | ## What’s inside this repo? 19 | 20 | | Path | Purpose | 21 | |------|---------| 22 | | `weave_reasoning_tests.ipynb` | Jupyter notebook that kicks off all evaluations, logs them to Weave and renders comparison tables/plots. | 23 | 24 | That’s it – everything else lives in the hosted Weave space. 25 | 26 | ## Quick start 27 | 28 | 1. **Clone** the repo and `cd` inside it. 29 | 30 | 2. Copy the example environment file and fill in your API keys: 31 | 32 | ```bash 33 | cp .env.example .env 34 | ``` 35 | 36 | Open `.env` and replace the placeholder values with your provider API keys. 37 | 38 | 3. Create a Python env with the same stack you use for your LLM work (≥ Python 3.9). 39 | 40 | ```bash 41 | python -m venv .venv 42 | source .venv/bin/activate 43 | pip install -U weave ai_explained_simplebench # plus whichever model SDKs you want (openai, anthropic, ...) 44 | ``` 45 | 46 | 4. Launch the notebook: 47 | 48 | ```bash 49 | jupyter notebook weave_reasoning_tests.ipynb 50 | ``` 51 | 52 | 5. Set your Weights & Biases API key: 53 | Visit https://wandb.ai/authorize, copy your key, and add it to the `WANDB_API_KEY` field in your `.env`. 54 | 55 | 6. In the first cell, set your provider keys (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.). 56 | 57 | 7. Run all cells. 58 | The notebook will send every evaluation result to your personal Weave project using the `WANDB_API_KEY` from your `.env` file; no manual login is needed. 59 | 60 | ## Adding another model 61 | 62 | The notebook is intentionally tiny – all heavy lifting is done by SimpleBench. 63 | 64 | 1. Append a new row to the `MODELS` list in the first section of the notebook. 65 | 2. Provide the model‑specific invocation parameters / client. 66 | 3. Re‑run the *Run Benchmarks* section. 67 | 68 | Once finished the new runs will appear in the Weave comparison view automatically. 69 | 70 | ## Contributing 71 | 72 | Help is very welcome! 73 | If you notice a regression, want to add a benchmark, or see a model missing from the list, please open an issue or PR. 74 | 75 | 1. Fork → feature branch → PR. 76 | 2. Keep the notebook clear of personal API keys – load them from environment variables. 77 | 3. Explain your change in the PR description and, if possible, link to the Weave run that validates it. 78 | 79 | ## Acknowledgements 80 | 81 | * **[Weave](https://www.github.com/wandb/weave)** for the slick dashboarding & data‑ops layer. 82 | * **[AI Explained SimpleBench](https://github.com/AIExplained/bench)** – we sponsor the project and rely on it for the question set. 83 | 84 | --- 85 | 86 | Made with 🧠 + ❤️ – ping me on X [@altryne](https://x.com/@altryne) if you think of another model worth tracking! 87 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | *$py.class 5 | 6 | # C extensions 7 | *.so 8 | 9 | # Distribution / packaging 10 | .Python 11 | build/ 12 | develop-eggs/ 13 | dist/ 14 | downloads/ 15 | eggs/ 16 | .eggs/ 17 | lib/ 18 | lib64/ 19 | parts/ 20 | sdist/ 21 | var/ 22 | wheels/ 23 | share/python-wheels/ 24 | *.egg-info/ 25 | .installed.cfg 26 | *.egg 27 | MANIFEST 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .nox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *.cover 49 | *.py,cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | cover/ 53 | 54 | # Translations 55 | *.mo 56 | *.pot 57 | 58 | # Django stuff: 59 | *.log 60 | local_settings.py 61 | db.sqlite3 62 | db.sqlite3-journal 63 | 64 | # Flask stuff: 65 | instance/ 66 | .webassets-cache 67 | 68 | # Scrapy stuff: 69 | .scrapy 70 | 71 | # Sphinx documentation 72 | docs/_build/ 73 | 74 | # PyBuilder 75 | .pybuilder/ 76 | target/ 77 | 78 | # Jupyter Notebook 79 | .ipynb_checkpoints 80 | 81 | # IPython 82 | profile_default/ 83 | ipython_config.py 84 | 85 | # pyenv 86 | # For a library or package, you might want to ignore these files since the code is 87 | # intended to run in multiple environments; otherwise, check them in: 88 | # .python-version 89 | 90 | # pipenv 91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. 92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies 93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not 94 | # install all needed dependencies. 95 | #Pipfile.lock 96 | 97 | # UV 98 | # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. 99 | # This is especially recommended for binary packages to ensure reproducibility, and is more 100 | # commonly ignored for libraries. 101 | #uv.lock 102 | 103 | # poetry 104 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. 105 | # This is especially recommended for binary packages to ensure reproducibility, and is more 106 | # commonly ignored for libraries. 107 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control 108 | #poetry.lock 109 | 110 | # pdm 111 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. 112 | #pdm.lock 113 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it 114 | # in version control. 115 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control 116 | .pdm.toml 117 | .pdm-python 118 | .pdm-build/ 119 | 120 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm 121 | __pypackages__/ 122 | 123 | # Celery stuff 124 | celerybeat-schedule 125 | celerybeat.pid 126 | 127 | # SageMath parsed files 128 | *.sage.py 129 | 130 | # Environments 131 | .env 132 | .venv 133 | env/ 134 | venv/ 135 | ENV/ 136 | env.bak/ 137 | venv.bak/ 138 | 139 | # Spyder project settings 140 | .spyderproject 141 | .spyproject 142 | 143 | # Rope project settings 144 | .ropeproject 145 | 146 | # mkdocs documentation 147 | /site 148 | 149 | # mypy 150 | .mypy_cache/ 151 | .dmypy.json 152 | dmypy.json 153 | 154 | # Pyre type checker 155 | .pyre/ 156 | 157 | # pytype static type analyzer 158 | .pytype/ 159 | 160 | # Cython debug symbols 161 | cython_debug/ 162 | 163 | # PyCharm 164 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can 165 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore 166 | # and can be added to the global gitignore or merged into this file. For a more nuclear 167 | # option (not recommended) you can uncomment the following to ignore the entire idea folder. 168 | #.idea/ 169 | 170 | # Ruff stuff: 171 | .ruff_cache/ 172 | 173 | # PyPI configuration file 174 | .pypirc 175 | -------------------------------------------------------------------------------- /weave_reasoning_tests.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# SimpleBench evals for all models - wandb.me/4dot1evals\n", 8 | "\n", 9 | "\n", 10 | "Using the simple bench questions to evaluate reasoning and non reasoning performance of the latest AI models." 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": null, 16 | "metadata": {}, 17 | "outputs": [], 18 | "source": [ 19 | "# Install and read in required packages, plus create an anthropic client.\n", 20 | "print('⏳ Installing packages')\n", 21 | "!uv pip install -q U weave set-env-colab-kaggle-dotenv tqdm ipywidgets requests openai google-genai google-generativeai llm-json anthropic\n", 22 | "print('✅ Packages installed')" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": null, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "from set_env import set_env\n", 32 | "from openai import OpenAI\n", 33 | "from google import genai\n", 34 | "from google.genai import types\n", 35 | "import weave\n", 36 | "from llm_json import json\n", 37 | "import random\n", 38 | "import os\n", 39 | "set_env(\"WANDB_API_KEY\")\n", 40 | "set_env(\"OPENAI_API_KEY\")\n", 41 | "set_env(\"GEMINI_API_KEY\")\n", 42 | "set_env(\"OPENROUTER_API_KEY\")\n", 43 | "set_env(\"CEREBRAS_API_KEY\")\n", 44 | "set_env(\"ANTHROPIC_API_KEY\")\n", 45 | "\n", 46 | "print('✅ Environment variables set')\n", 47 | "\n", 48 | "# Initialize weave and OpenAI client\n", 49 | "weave_client = weave.init('o3-tests')\n", 50 | "client = OpenAI()" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "SYSTEM_MESSAGE = \"\"\"\n", 60 | "You're given a multiple choice trick question. The answer will not be the obvious one\n", 61 | "The question may contain and likely is phrased in a way that contains many distractions. Your task is to identify the basic flaws to reason about the true answer\n", 62 | "It may help to think carefully about the location/physics of everything mentioned in the question.\n", 63 | "Many of these questions may be variations of some riddles you have been trained on in the past, but containing a trick that is easy for humans to identify. \n", 64 | "\n", 65 | "Start by rephrasing the original question and with special notice of the details that could be innocuous but can subvert the original meaning of the question. \n", 66 | "Then extract a few facts in the form of numbered statements as factual_statements to help you reason the correct answer\n", 67 | "From there deduce the best answer. Not the answer that would make the question make sense, but the correct answer for that specific weird scenario. \n", 68 | "\n", 69 | "Your response should be strictly in the JSON format with no backticks or other formatting:\n", 70 | " {\n", 71 | " \"rewritten_question\": \"rewritten_question\",\n", 72 | " \"factual_statements\": [\"statement1\", \"statement2\", \"statement3\"],\n", 73 | " \"explanation\": \"explanation\",\n", 74 | " \"correct_answer\": \"X\"\n", 75 | " }\n", 76 | "where X is one of the letters A, B, C, D, E, or F.\n", 77 | "\"\"\"\n" 78 | ] 79 | }, 80 | { 81 | "cell_type": "code", 82 | "execution_count": null, 83 | "metadata": {}, 84 | "outputs": [], 85 | "source": [ 86 | "from weave import Model, Evaluation\n", 87 | "from llm_json import json\n", 88 | "import os\n", 89 | "os.environ['WEAVE_PARALLELISM'] = '1'\n", 90 | "\n", 91 | "# Load the dataset\n", 92 | "def load_dataset():\n", 93 | " dataset = weave.ref(\"weave:///simplebench/simple_bench_public/object/competition_dataset:qNJnkgpMqoyc48GwlFCSpVypn3D8x77N7lGCBIab4XQ\").get()\n", 94 | " # Convert the data into the format we need\n", 95 | " return [{\n", 96 | " \"question\": item[\"prompt\"],\n", 97 | " \"question_id\": item[\"question_id\"],\n", 98 | " \"expected\": item[\"answer\"]\n", 99 | " } for item in dataset]\n", 100 | "\n", 101 | "class gpt4point1(Model):\n", 102 | " model: str = \"gpt-4.1-2025-04-14\"\n", 103 | " sandwich: bool = False\n", 104 | "\n", 105 | " @weave.op()\n", 106 | " def predict(self, question: str) -> dict:\n", 107 | " messages = [\n", 108 | " {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n", 109 | " {\"role\": \"user\", \"content\": question}\n", 110 | " ]\n", 111 | " if self.sandwich:\n", 112 | " messages.append({\n", 113 | " \"role\": \"developer\", \n", 114 | " \"content\": SYSTEM_MESSAGE\n", 115 | " })\n", 116 | " response = client.chat.completions.create(\n", 117 | " model=self.model,\n", 118 | " messages=messages\n", 119 | " )\n", 120 | " json_response = json.loads(response.choices[0].message.content.strip())\n", 121 | " return json_response\n", 122 | "\n", 123 | "# Define the model class with different reasoning efforts\n", 124 | "class o3Reasoning(Model):\n", 125 | " model: str = \"o3-mini-2025-01-31\"\n", 126 | " reasoning_effort: str # \"low\", \"medium\", or \"high\"\n", 127 | "\n", 128 | " @weave.op()\n", 129 | " def predict(self, question: str) -> dict:\n", 130 | "\n", 131 | " response = client.chat.completions.create(\n", 132 | " model=self.model,\n", 133 | " reasoning_effort=self.reasoning_effort,\n", 134 | " messages=[\n", 135 | " {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n", 136 | " {\"role\": \"user\", \"content\": question}\n", 137 | " ]\n", 138 | " )\n", 139 | " json_response = json.loads(response.choices[0].message.content.strip())\n", 140 | " return json_response\n", 141 | " # Define the model class with different reasoning efforts\n", 142 | "class o1Reasoning(Model):\n", 143 | " model: str = \"o1-2024-12-17\"\n", 144 | " reasoning_effort: str # \"low\", \"medium\", or \"high\"\n", 145 | "\n", 146 | " @weave.op()\n", 147 | " def predict(self, question: str) -> dict:\n", 148 | "\n", 149 | " response = client.chat.completions.create(\n", 150 | " model=self.model,\n", 151 | " reasoning_effort=self.reasoning_effort,\n", 152 | " messages=[\n", 153 | " {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n", 154 | " {\"role\": \"user\", \"content\": question}\n", 155 | " ]\n", 156 | " )\n", 157 | " json_response = json.loads(response.choices[0].message.content.strip())\n", 158 | " return json_response\n", 159 | "\n", 160 | "class GeminiReasoningModel(Model):\n", 161 | " model: str = \"gemini-2.5-pro-exp-03-25\"\n", 162 | " thinking_budget: int = 1000 # You can tweak this variable to any value between 0 and 24k\n", 163 | " thinking: bool = False\n", 164 | "\n", 165 | " @weave.op()\n", 166 | " def predict(self, question: str) -> dict:\n", 167 | " from google import genai\n", 168 | " from google.genai import types\n", 169 | " MODEL_ID = self.model\n", 170 | " prompt = question\n", 171 | " budget = self.thinking_budget\n", 172 | " \n", 173 | " client = genai.Client(api_key=os.getenv(\"GEMINI_API_KEY\"))\n", 174 | " response = client.models.generate_content(\n", 175 | " model=MODEL_ID,\n", 176 | " contents=f\"{SYSTEM_MESSAGE}\\n\\n{prompt}\",\n", 177 | " config=types.GenerateContentConfig(\n", 178 | " thinking_config=types.ThinkingConfig(\n", 179 | " include_thoughts=self.thinking,\n", 180 | " thinking_budget=budget\n", 181 | " )\n", 182 | " )\n", 183 | " )\n", 184 | " json_response = json.loads(response.text.strip())\n", 185 | " return json_response\n", 186 | "\n", 187 | "class CerebrasModel(Model):\n", 188 | " model: str = \"deepseek-r1-distill-llama-70b\"\n", 189 | " \n", 190 | "\n", 191 | " @weave.op()\n", 192 | " def predict(self, question: str) -> dict:\n", 193 | " cerebras_client = OpenAI(api_key=os.getenv(\"CEREBRAS_API_KEY\"), base_url=\"https://api.cerebras.ai/v1\")\n", 194 | " response = cerebras_client.chat.completions.create(\n", 195 | " model=self.model,\n", 196 | " messages=[\n", 197 | " {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n", 198 | " {\"role\": \"user\", \"content\": question}\n", 199 | " ]\n", 200 | " )\n", 201 | " json_response = json.loads(response.choices[0].message.content.strip())\n", 202 | " return json_response\n", 203 | "\n", 204 | "\n", 205 | "class DeepSeekV3Model(Model):\n", 206 | " model: str = \"deepseek/deepseek-chat-v3-0324\"\n", 207 | " \n", 208 | " @weave.op()\n", 209 | " def predict(self, question: str) -> dict:\n", 210 | " openrouterclient = OpenAI(\n", 211 | " api_key=os.getenv(\"OPENROUTER_API_KEY\"), \n", 212 | " base_url=\"https://openrouter.ai/api/v1\")\n", 213 | " \n", 214 | " chat_completion = openrouterclient.chat.completions.create(\n", 215 | " model=self.model,\n", 216 | " messages=[\n", 217 | " {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n", 218 | " {\"role\": \"user\", \"content\": question},\n", 219 | " ],\n", 220 | " temperature=0.7,\n", 221 | " max_tokens=1024,\n", 222 | " # extra_body={\n", 223 | " # \"provider\": {\n", 224 | " # \"order\": [\"DeepSeek\"],\n", 225 | " # \"allow_fallbacks\": False,\n", 226 | " # }\n", 227 | " # }\n", 228 | " )\n", 229 | " print(chat_completion)\n", 230 | " json_response = json.loads(chat_completion.choices[0].message.content.strip())\n", 231 | " return json_response\n", 232 | "\n", 233 | "class DeepSeekR1Model(Model):\n", 234 | " model: str = \"deepseek/deepseek-r1:free1\"\n", 235 | "\n", 236 | " @weave.op()\n", 237 | " def predict(self, question: str) -> dict:\n", 238 | " openrouterclient = OpenAI(\n", 239 | " api_key=os.getenv(\"OPENROUTER_API_KEY\"), \n", 240 | " base_url=\"https://openrouter.ai/api/v1\")\n", 241 | " \n", 242 | " # Format the question and options\n", 243 | " chat_completion = openrouterclient.chat.completions.create(\n", 244 | " model=self.model,\n", 245 | " messages=[\n", 246 | " {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n", 247 | " {\"role\": \"user\", \"content\": question},\n", 248 | " ],\n", 249 | " extra_body={\n", 250 | " \"include_reasoning\": True,\n", 251 | " \"provider\": {\n", 252 | " \"order\": [\"Chutes\", \"Together\"],\n", 253 | " \"allow_fallbacks\": True,\n", 254 | " }\n", 255 | " })\n", 256 | " # remove the think part of the response between and \n", 257 | " text = chat_completion.choices[0].message.content.strip()\n", 258 | "\n", 259 | " json_response = json.loads(text)\n", 260 | " return json_response\n", 261 | "\n", 262 | "# Define the scorer\n", 263 | "@weave.op()\n", 264 | "def accuracy_scorer(expected: str, model_output: dict) -> dict:\n", 265 | " return {\n", 266 | " \"correct\": expected == model_output[\"correct_answer\"]\n", 267 | " }\n", 268 | "\n", 269 | "# Create the evaluation\n", 270 | "async def run_evaluation():\n", 271 | " # Create dataset\n", 272 | " dataset = load_dataset()\n", 273 | " \n", 274 | " # Create models with different reasoning efforts\n", 275 | " models = [\n", 276 | " # {\n", 277 | " # \"model\": DeepSeekR1Model(model=\"deepseek/deepseek-r1:free\"),\n", 278 | " # \"display_name\": \"DeepSeek R1\"\n", 279 | " # },\n", 280 | " # {\n", 281 | " # \"model\": o1Reasoning(reasoning_effort=\"low\"),\n", 282 | " # \"display_name\": \"O1 (Low)\"\n", 283 | " # },\n", 284 | " # {\n", 285 | " # \"model\": o1Reasoning(reasoning_effort=\"medium\"),\n", 286 | " # \"display_name\": \"O1 (Medium)\"\n", 287 | " # },\n", 288 | " # {\n", 289 | " # \"model\": o1Reasoning(reasoning_effort=\"high\"),\n", 290 | " # \"display_name\": \"O1 (High)\"\n", 291 | " # },\n", 292 | " # {\n", 293 | " # \"model\": o3Reasoning(reasoning_effort=\"low\", model='o3'),\n", 294 | " # \"display_name\": \"O3 (Low)\"\n", 295 | " # },\n", 296 | " # {\n", 297 | " # \"model\": o3Reasoning(reasoning_effort=\"medium\", model='o3'),\n", 298 | " # \"display_name\": \"O3 (Medium)\"\n", 299 | " # },\n", 300 | " # {\n", 301 | " # \"model\": o3Reasoning(reasoning_effort=\"low\", model='o4-mini'),\n", 302 | " # \"display_name\": \"O4-mini (Low)\"\n", 303 | " # },\n", 304 | " # {\n", 305 | " # \"model\": o3Reasoning(reasoning_effort=\"medium\", model='o4-mini'),\n", 306 | " # \"display_name\": \"O4-mini (Medium)\"\n", 307 | " # },\n", 308 | " # {\n", 309 | " # \"model\": o3Reasoning(reasoning_effort=\"high\", model='o4-mini'),\n", 310 | " # \"display_name\": \"O4-mini (High)\"\n", 311 | " # },\n", 312 | " # {\n", 313 | " # \"model\": o3Reasoning(reasoning_effort=\"medium\"), \n", 314 | " # \"display_name\": \"O3 Mini (Medium)\"\n", 315 | " # },\n", 316 | " # {\n", 317 | " # \"model\": o3Reasoning(reasoning_effort=\"high\"),\n", 318 | " # \"display_name\": \"O3 Mini (High)\"\n", 319 | " # },\n", 320 | " # {\n", 321 | " # \"model\": GeminiReasoningModel(),\n", 322 | " # \"display_name\": \"Gemini 2.5\"\n", 323 | " # },\n", 324 | " {\n", 325 | " \"model\": GeminiReasoningModel(thinking_budget=0, thinking=False),\n", 326 | " \"display_name\": \"2.5 Flash\"\n", 327 | " },\n", 328 | " {\n", 329 | " \"model\": GeminiReasoningModel(thinking_budget=1000, thinking=True),\n", 330 | " \"display_name\": \"2.5 Flash (1000 thinking)\"\n", 331 | " },\n", 332 | " {\n", 333 | " \"model\": GeminiReasoningModel(thinking_budget=12000, thinking=True),\n", 334 | " \"display_name\": \"2.5 Flash (12k thinking)\"\n", 335 | " },\n", 336 | " # {\n", 337 | " # \"model\": gpt4point1(model=\"gpt-4.1-2025-04-14\", sandwich=True),\n", 338 | " # \"display_name\": \"GPT-4.1 🥪\"\n", 339 | " # },\n", 340 | " # {\n", 341 | " # \"model\": gpt4point1(model=\"gpt-4.1-mini-2025-04-14\", sandwich=True),\n", 342 | " # \"display_name\": \"4.1-mini 🥪\"\n", 343 | " # },\n", 344 | " # {\n", 345 | " # \"model\": gpt4point1(model=\"gpt-4.1-nano-2025-04-14\", sandwich=True),\n", 346 | " # \"display_name\": \"4.1-nano 🥪\"\n", 347 | " # },\n", 348 | " # {\n", 349 | " # \"model\": gpt4point1(model=\"gpt-4o\"),\n", 350 | " # \"display_name\": \"GPT 4o\"\n", 351 | " # },\n", 352 | " # {\n", 353 | " # \"model\": gpt4point1(model=\"gpt-4o-mini\"),\n", 354 | " # \"display_name\": \"GPT 4o-mini\"\n", 355 | " # },\n", 356 | " # {\n", 357 | " # \"model\": gpt4point1(model=\"gpt-4.5-preview\"),\n", 358 | " # \"display_name\": \"GPT 4.5\"\n", 359 | " # },\n", 360 | " # {\n", 361 | " # \"model\": DeepSeekV3Model(model=\"deepseek/deepseek-chat-v3-0324\"),\n", 362 | " # \"display_name\": \"DS V3-0324\"\n", 363 | " # },\n", 364 | " # {\n", 365 | " # \"model\": CerebrasModel(model=\"deepseek-r1-distill-llama-70b\"),\n", 366 | " # \"display_name\": \"Cerebras (Thinking)\"\n", 367 | " # }\n", 368 | " ]\n", 369 | " \n", 370 | " # Create evaluation\n", 371 | " evaluation = Evaluation(\n", 372 | " dataset=dataset,\n", 373 | " trials=3,\n", 374 | " scorers=[accuracy_scorer]\n", 375 | " )\n", 376 | " \n", 377 | " # Run evaluation for each model\n", 378 | " for model in models:\n", 379 | " await evaluation.evaluate(\n", 380 | " model[\"model\"],\n", 381 | " __weave={\"display_name\": model[\"display_name\"]}\n", 382 | " )\n", 383 | "\n", 384 | "# Run the evaluation\n", 385 | "await run_evaluation()" 386 | ] 387 | } 388 | ], 389 | "metadata": { 390 | "kernelspec": { 391 | "display_name": ".venv", 392 | "language": "python", 393 | "name": "python3" 394 | }, 395 | "language_info": { 396 | "codemirror_mode": { 397 | "name": "ipython", 398 | "version": 3 399 | }, 400 | "file_extension": ".py", 401 | "mimetype": "text/x-python", 402 | "name": "python", 403 | "nbconvert_exporter": "python", 404 | "pygments_lexer": "ipython3", 405 | "version": "3.13.3" 406 | } 407 | }, 408 | "nbformat": 4, 409 | "nbformat_minor": 2 410 | } 411 | --------------------------------------------------------------------------------