├── .env.example
├── README.md
├── .gitignore
└── weave_reasoning_tests.ipynb
/.env.example:
--------------------------------------------------------------------------------
1 | WANDB_API_KEY=#add your key from https://wandb.ai/authorize
2 | OPENAI_API_KEY=#add your key from https://www.anthropic.com or https://console.anthropic.com
3 | OPENROUTER_API_KEY=#add your key from https://console.openrouter.ai
4 | ANTHROPIC_API_KEY=#add your key from https://console.anthropic.com
5 | GOOGLE_API_KEY=#add your key from https://console.cloud.google.com
6 | CEREBRAS_API_KEY=#add your key from https://console.cerebras.ai
7 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # o3-tests – Weave Reasoning Evaluation Suite
2 |
3 | o3‑tests is a **Weave** space (dashboard + artifacts) that tracks how the latest Large‑Language‑Model releases perform on *really* hard reasoning tasks.
4 | It started as a small playground for the **AI Explained SimpleBench** benchmark but quickly evolved into a public evaluation hub covering:
5 |
6 | * GPT‑4.1 family (all context sizes)
7 | * `o3` and `o4‑mini`
8 | * Gemini 2.5 ✨ *(some runs failed because of RPM limits – see the results tables)*
9 |
10 | The goal is to answer a single question: **“Which new model should I reach for when my prompt demands serious reasoning?”**
11 |
12 |
13 |
14 |
15 |
16 | ---
17 |
18 | ## What’s inside this repo?
19 |
20 | | Path | Purpose |
21 | |------|---------|
22 | | `weave_reasoning_tests.ipynb` | Jupyter notebook that kicks off all evaluations, logs them to Weave and renders comparison tables/plots. |
23 |
24 | That’s it – everything else lives in the hosted Weave space.
25 |
26 | ## Quick start
27 |
28 | 1. **Clone** the repo and `cd` inside it.
29 |
30 | 2. Copy the example environment file and fill in your API keys:
31 |
32 | ```bash
33 | cp .env.example .env
34 | ```
35 |
36 | Open `.env` and replace the placeholder values with your provider API keys.
37 |
38 | 3. Create a Python env with the same stack you use for your LLM work (≥ Python 3.9).
39 |
40 | ```bash
41 | python -m venv .venv
42 | source .venv/bin/activate
43 | pip install -U weave ai_explained_simplebench # plus whichever model SDKs you want (openai, anthropic, ...)
44 | ```
45 |
46 | 4. Launch the notebook:
47 |
48 | ```bash
49 | jupyter notebook weave_reasoning_tests.ipynb
50 | ```
51 |
52 | 5. Set your Weights & Biases API key:
53 | Visit https://wandb.ai/authorize, copy your key, and add it to the `WANDB_API_KEY` field in your `.env`.
54 |
55 | 6. In the first cell, set your provider keys (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.).
56 |
57 | 7. Run all cells.
58 | The notebook will send every evaluation result to your personal Weave project using the `WANDB_API_KEY` from your `.env` file; no manual login is needed.
59 |
60 | ## Adding another model
61 |
62 | The notebook is intentionally tiny – all heavy lifting is done by SimpleBench.
63 |
64 | 1. Append a new row to the `MODELS` list in the first section of the notebook.
65 | 2. Provide the model‑specific invocation parameters / client.
66 | 3. Re‑run the *Run Benchmarks* section.
67 |
68 | Once finished the new runs will appear in the Weave comparison view automatically.
69 |
70 | ## Contributing
71 |
72 | Help is very welcome!
73 | If you notice a regression, want to add a benchmark, or see a model missing from the list, please open an issue or PR.
74 |
75 | 1. Fork → feature branch → PR.
76 | 2. Keep the notebook clear of personal API keys – load them from environment variables.
77 | 3. Explain your change in the PR description and, if possible, link to the Weave run that validates it.
78 |
79 | ## Acknowledgements
80 |
81 | * **[Weave](https://www.github.com/wandb/weave)** for the slick dashboarding & data‑ops layer.
82 | * **[AI Explained SimpleBench](https://github.com/AIExplained/bench)** – we sponsor the project and rely on it for the question set.
83 |
84 | ---
85 |
86 | Made with 🧠 + ❤️ – ping me on X [@altryne](https://x.com/@altryne) if you think of another model worth tracking!
87 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | build/
12 | develop-eggs/
13 | dist/
14 | downloads/
15 | eggs/
16 | .eggs/
17 | lib/
18 | lib64/
19 | parts/
20 | sdist/
21 | var/
22 | wheels/
23 | share/python-wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 | MANIFEST
28 |
29 | # PyInstaller
30 | # Usually these files are written by a python script from a template
31 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 |
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 |
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .nox/
43 | .coverage
44 | .coverage.*
45 | .cache
46 | nosetests.xml
47 | coverage.xml
48 | *.cover
49 | *.py,cover
50 | .hypothesis/
51 | .pytest_cache/
52 | cover/
53 |
54 | # Translations
55 | *.mo
56 | *.pot
57 |
58 | # Django stuff:
59 | *.log
60 | local_settings.py
61 | db.sqlite3
62 | db.sqlite3-journal
63 |
64 | # Flask stuff:
65 | instance/
66 | .webassets-cache
67 |
68 | # Scrapy stuff:
69 | .scrapy
70 |
71 | # Sphinx documentation
72 | docs/_build/
73 |
74 | # PyBuilder
75 | .pybuilder/
76 | target/
77 |
78 | # Jupyter Notebook
79 | .ipynb_checkpoints
80 |
81 | # IPython
82 | profile_default/
83 | ipython_config.py
84 |
85 | # pyenv
86 | # For a library or package, you might want to ignore these files since the code is
87 | # intended to run in multiple environments; otherwise, check them in:
88 | # .python-version
89 |
90 | # pipenv
91 | # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92 | # However, in case of collaboration, if having platform-specific dependencies or dependencies
93 | # having no cross-platform support, pipenv may install dependencies that don't work, or not
94 | # install all needed dependencies.
95 | #Pipfile.lock
96 |
97 | # UV
98 | # Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
99 | # This is especially recommended for binary packages to ensure reproducibility, and is more
100 | # commonly ignored for libraries.
101 | #uv.lock
102 |
103 | # poetry
104 | # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105 | # This is especially recommended for binary packages to ensure reproducibility, and is more
106 | # commonly ignored for libraries.
107 | # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108 | #poetry.lock
109 |
110 | # pdm
111 | # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112 | #pdm.lock
113 | # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114 | # in version control.
115 | # https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116 | .pdm.toml
117 | .pdm-python
118 | .pdm-build/
119 |
120 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121 | __pypackages__/
122 |
123 | # Celery stuff
124 | celerybeat-schedule
125 | celerybeat.pid
126 |
127 | # SageMath parsed files
128 | *.sage.py
129 |
130 | # Environments
131 | .env
132 | .venv
133 | env/
134 | venv/
135 | ENV/
136 | env.bak/
137 | venv.bak/
138 |
139 | # Spyder project settings
140 | .spyderproject
141 | .spyproject
142 |
143 | # Rope project settings
144 | .ropeproject
145 |
146 | # mkdocs documentation
147 | /site
148 |
149 | # mypy
150 | .mypy_cache/
151 | .dmypy.json
152 | dmypy.json
153 |
154 | # Pyre type checker
155 | .pyre/
156 |
157 | # pytype static type analyzer
158 | .pytype/
159 |
160 | # Cython debug symbols
161 | cython_debug/
162 |
163 | # PyCharm
164 | # JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165 | # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166 | # and can be added to the global gitignore or merged into this file. For a more nuclear
167 | # option (not recommended) you can uncomment the following to ignore the entire idea folder.
168 | #.idea/
169 |
170 | # Ruff stuff:
171 | .ruff_cache/
172 |
173 | # PyPI configuration file
174 | .pypirc
175 |
--------------------------------------------------------------------------------
/weave_reasoning_tests.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# SimpleBench evals for all models - wandb.me/4dot1evals\n",
8 | "\n",
9 | "\n",
10 | "Using the simple bench questions to evaluate reasoning and non reasoning performance of the latest AI models."
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": null,
16 | "metadata": {},
17 | "outputs": [],
18 | "source": [
19 | "# Install and read in required packages, plus create an anthropic client.\n",
20 | "print('⏳ Installing packages')\n",
21 | "!uv pip install -q U weave set-env-colab-kaggle-dotenv tqdm ipywidgets requests openai google-genai google-generativeai llm-json anthropic\n",
22 | "print('✅ Packages installed')"
23 | ]
24 | },
25 | {
26 | "cell_type": "code",
27 | "execution_count": null,
28 | "metadata": {},
29 | "outputs": [],
30 | "source": [
31 | "from set_env import set_env\n",
32 | "from openai import OpenAI\n",
33 | "from google import genai\n",
34 | "from google.genai import types\n",
35 | "import weave\n",
36 | "from llm_json import json\n",
37 | "import random\n",
38 | "import os\n",
39 | "set_env(\"WANDB_API_KEY\")\n",
40 | "set_env(\"OPENAI_API_KEY\")\n",
41 | "set_env(\"GEMINI_API_KEY\")\n",
42 | "set_env(\"OPENROUTER_API_KEY\")\n",
43 | "set_env(\"CEREBRAS_API_KEY\")\n",
44 | "set_env(\"ANTHROPIC_API_KEY\")\n",
45 | "\n",
46 | "print('✅ Environment variables set')\n",
47 | "\n",
48 | "# Initialize weave and OpenAI client\n",
49 | "weave_client = weave.init('o3-tests')\n",
50 | "client = OpenAI()"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "SYSTEM_MESSAGE = \"\"\"\n",
60 | "You're given a multiple choice trick question. The answer will not be the obvious one\n",
61 | "The question may contain and likely is phrased in a way that contains many distractions. Your task is to identify the basic flaws to reason about the true answer\n",
62 | "It may help to think carefully about the location/physics of everything mentioned in the question.\n",
63 | "Many of these questions may be variations of some riddles you have been trained on in the past, but containing a trick that is easy for humans to identify. \n",
64 | "\n",
65 | "Start by rephrasing the original question and with special notice of the details that could be innocuous but can subvert the original meaning of the question. \n",
66 | "Then extract a few facts in the form of numbered statements as factual_statements to help you reason the correct answer\n",
67 | "From there deduce the best answer. Not the answer that would make the question make sense, but the correct answer for that specific weird scenario. \n",
68 | "\n",
69 | "Your response should be strictly in the JSON format with no backticks or other formatting:\n",
70 | " {\n",
71 | " \"rewritten_question\": \"rewritten_question\",\n",
72 | " \"factual_statements\": [\"statement1\", \"statement2\", \"statement3\"],\n",
73 | " \"explanation\": \"explanation\",\n",
74 | " \"correct_answer\": \"X\"\n",
75 | " }\n",
76 | "where X is one of the letters A, B, C, D, E, or F.\n",
77 | "\"\"\"\n"
78 | ]
79 | },
80 | {
81 | "cell_type": "code",
82 | "execution_count": null,
83 | "metadata": {},
84 | "outputs": [],
85 | "source": [
86 | "from weave import Model, Evaluation\n",
87 | "from llm_json import json\n",
88 | "import os\n",
89 | "os.environ['WEAVE_PARALLELISM'] = '1'\n",
90 | "\n",
91 | "# Load the dataset\n",
92 | "def load_dataset():\n",
93 | " dataset = weave.ref(\"weave:///simplebench/simple_bench_public/object/competition_dataset:qNJnkgpMqoyc48GwlFCSpVypn3D8x77N7lGCBIab4XQ\").get()\n",
94 | " # Convert the data into the format we need\n",
95 | " return [{\n",
96 | " \"question\": item[\"prompt\"],\n",
97 | " \"question_id\": item[\"question_id\"],\n",
98 | " \"expected\": item[\"answer\"]\n",
99 | " } for item in dataset]\n",
100 | "\n",
101 | "class gpt4point1(Model):\n",
102 | " model: str = \"gpt-4.1-2025-04-14\"\n",
103 | " sandwich: bool = False\n",
104 | "\n",
105 | " @weave.op()\n",
106 | " def predict(self, question: str) -> dict:\n",
107 | " messages = [\n",
108 | " {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n",
109 | " {\"role\": \"user\", \"content\": question}\n",
110 | " ]\n",
111 | " if self.sandwich:\n",
112 | " messages.append({\n",
113 | " \"role\": \"developer\", \n",
114 | " \"content\": SYSTEM_MESSAGE\n",
115 | " })\n",
116 | " response = client.chat.completions.create(\n",
117 | " model=self.model,\n",
118 | " messages=messages\n",
119 | " )\n",
120 | " json_response = json.loads(response.choices[0].message.content.strip())\n",
121 | " return json_response\n",
122 | "\n",
123 | "# Define the model class with different reasoning efforts\n",
124 | "class o3Reasoning(Model):\n",
125 | " model: str = \"o3-mini-2025-01-31\"\n",
126 | " reasoning_effort: str # \"low\", \"medium\", or \"high\"\n",
127 | "\n",
128 | " @weave.op()\n",
129 | " def predict(self, question: str) -> dict:\n",
130 | "\n",
131 | " response = client.chat.completions.create(\n",
132 | " model=self.model,\n",
133 | " reasoning_effort=self.reasoning_effort,\n",
134 | " messages=[\n",
135 | " {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n",
136 | " {\"role\": \"user\", \"content\": question}\n",
137 | " ]\n",
138 | " )\n",
139 | " json_response = json.loads(response.choices[0].message.content.strip())\n",
140 | " return json_response\n",
141 | " # Define the model class with different reasoning efforts\n",
142 | "class o1Reasoning(Model):\n",
143 | " model: str = \"o1-2024-12-17\"\n",
144 | " reasoning_effort: str # \"low\", \"medium\", or \"high\"\n",
145 | "\n",
146 | " @weave.op()\n",
147 | " def predict(self, question: str) -> dict:\n",
148 | "\n",
149 | " response = client.chat.completions.create(\n",
150 | " model=self.model,\n",
151 | " reasoning_effort=self.reasoning_effort,\n",
152 | " messages=[\n",
153 | " {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n",
154 | " {\"role\": \"user\", \"content\": question}\n",
155 | " ]\n",
156 | " )\n",
157 | " json_response = json.loads(response.choices[0].message.content.strip())\n",
158 | " return json_response\n",
159 | "\n",
160 | "class GeminiReasoningModel(Model):\n",
161 | " model: str = \"gemini-2.5-pro-exp-03-25\"\n",
162 | " thinking_budget: int = 1000 # You can tweak this variable to any value between 0 and 24k\n",
163 | " thinking: bool = False\n",
164 | "\n",
165 | " @weave.op()\n",
166 | " def predict(self, question: str) -> dict:\n",
167 | " from google import genai\n",
168 | " from google.genai import types\n",
169 | " MODEL_ID = self.model\n",
170 | " prompt = question\n",
171 | " budget = self.thinking_budget\n",
172 | " \n",
173 | " client = genai.Client(api_key=os.getenv(\"GEMINI_API_KEY\"))\n",
174 | " response = client.models.generate_content(\n",
175 | " model=MODEL_ID,\n",
176 | " contents=f\"{SYSTEM_MESSAGE}\\n\\n{prompt}\",\n",
177 | " config=types.GenerateContentConfig(\n",
178 | " thinking_config=types.ThinkingConfig(\n",
179 | " include_thoughts=self.thinking,\n",
180 | " thinking_budget=budget\n",
181 | " )\n",
182 | " )\n",
183 | " )\n",
184 | " json_response = json.loads(response.text.strip())\n",
185 | " return json_response\n",
186 | "\n",
187 | "class CerebrasModel(Model):\n",
188 | " model: str = \"deepseek-r1-distill-llama-70b\"\n",
189 | " \n",
190 | "\n",
191 | " @weave.op()\n",
192 | " def predict(self, question: str) -> dict:\n",
193 | " cerebras_client = OpenAI(api_key=os.getenv(\"CEREBRAS_API_KEY\"), base_url=\"https://api.cerebras.ai/v1\")\n",
194 | " response = cerebras_client.chat.completions.create(\n",
195 | " model=self.model,\n",
196 | " messages=[\n",
197 | " {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
198 | " {\"role\": \"user\", \"content\": question}\n",
199 | " ]\n",
200 | " )\n",
201 | " json_response = json.loads(response.choices[0].message.content.strip())\n",
202 | " return json_response\n",
203 | "\n",
204 | "\n",
205 | "class DeepSeekV3Model(Model):\n",
206 | " model: str = \"deepseek/deepseek-chat-v3-0324\"\n",
207 | " \n",
208 | " @weave.op()\n",
209 | " def predict(self, question: str) -> dict:\n",
210 | " openrouterclient = OpenAI(\n",
211 | " api_key=os.getenv(\"OPENROUTER_API_KEY\"), \n",
212 | " base_url=\"https://openrouter.ai/api/v1\")\n",
213 | " \n",
214 | " chat_completion = openrouterclient.chat.completions.create(\n",
215 | " model=self.model,\n",
216 | " messages=[\n",
217 | " {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
218 | " {\"role\": \"user\", \"content\": question},\n",
219 | " ],\n",
220 | " temperature=0.7,\n",
221 | " max_tokens=1024,\n",
222 | " # extra_body={\n",
223 | " # \"provider\": {\n",
224 | " # \"order\": [\"DeepSeek\"],\n",
225 | " # \"allow_fallbacks\": False,\n",
226 | " # }\n",
227 | " # }\n",
228 | " )\n",
229 | " print(chat_completion)\n",
230 | " json_response = json.loads(chat_completion.choices[0].message.content.strip())\n",
231 | " return json_response\n",
232 | "\n",
233 | "class DeepSeekR1Model(Model):\n",
234 | " model: str = \"deepseek/deepseek-r1:free1\"\n",
235 | "\n",
236 | " @weave.op()\n",
237 | " def predict(self, question: str) -> dict:\n",
238 | " openrouterclient = OpenAI(\n",
239 | " api_key=os.getenv(\"OPENROUTER_API_KEY\"), \n",
240 | " base_url=\"https://openrouter.ai/api/v1\")\n",
241 | " \n",
242 | " # Format the question and options\n",
243 | " chat_completion = openrouterclient.chat.completions.create(\n",
244 | " model=self.model,\n",
245 | " messages=[\n",
246 | " {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
247 | " {\"role\": \"user\", \"content\": question},\n",
248 | " ],\n",
249 | " extra_body={\n",
250 | " \"include_reasoning\": True,\n",
251 | " \"provider\": {\n",
252 | " \"order\": [\"Chutes\", \"Together\"],\n",
253 | " \"allow_fallbacks\": True,\n",
254 | " }\n",
255 | " })\n",
256 | " # remove the think part of the response between and \n",
257 | " text = chat_completion.choices[0].message.content.strip()\n",
258 | "\n",
259 | " json_response = json.loads(text)\n",
260 | " return json_response\n",
261 | "\n",
262 | "# Define the scorer\n",
263 | "@weave.op()\n",
264 | "def accuracy_scorer(expected: str, model_output: dict) -> dict:\n",
265 | " return {\n",
266 | " \"correct\": expected == model_output[\"correct_answer\"]\n",
267 | " }\n",
268 | "\n",
269 | "# Create the evaluation\n",
270 | "async def run_evaluation():\n",
271 | " # Create dataset\n",
272 | " dataset = load_dataset()\n",
273 | " \n",
274 | " # Create models with different reasoning efforts\n",
275 | " models = [\n",
276 | " # {\n",
277 | " # \"model\": DeepSeekR1Model(model=\"deepseek/deepseek-r1:free\"),\n",
278 | " # \"display_name\": \"DeepSeek R1\"\n",
279 | " # },\n",
280 | " # {\n",
281 | " # \"model\": o1Reasoning(reasoning_effort=\"low\"),\n",
282 | " # \"display_name\": \"O1 (Low)\"\n",
283 | " # },\n",
284 | " # {\n",
285 | " # \"model\": o1Reasoning(reasoning_effort=\"medium\"),\n",
286 | " # \"display_name\": \"O1 (Medium)\"\n",
287 | " # },\n",
288 | " # {\n",
289 | " # \"model\": o1Reasoning(reasoning_effort=\"high\"),\n",
290 | " # \"display_name\": \"O1 (High)\"\n",
291 | " # },\n",
292 | " # {\n",
293 | " # \"model\": o3Reasoning(reasoning_effort=\"low\", model='o3'),\n",
294 | " # \"display_name\": \"O3 (Low)\"\n",
295 | " # },\n",
296 | " # {\n",
297 | " # \"model\": o3Reasoning(reasoning_effort=\"medium\", model='o3'),\n",
298 | " # \"display_name\": \"O3 (Medium)\"\n",
299 | " # },\n",
300 | " # {\n",
301 | " # \"model\": o3Reasoning(reasoning_effort=\"low\", model='o4-mini'),\n",
302 | " # \"display_name\": \"O4-mini (Low)\"\n",
303 | " # },\n",
304 | " # {\n",
305 | " # \"model\": o3Reasoning(reasoning_effort=\"medium\", model='o4-mini'),\n",
306 | " # \"display_name\": \"O4-mini (Medium)\"\n",
307 | " # },\n",
308 | " # {\n",
309 | " # \"model\": o3Reasoning(reasoning_effort=\"high\", model='o4-mini'),\n",
310 | " # \"display_name\": \"O4-mini (High)\"\n",
311 | " # },\n",
312 | " # {\n",
313 | " # \"model\": o3Reasoning(reasoning_effort=\"medium\"), \n",
314 | " # \"display_name\": \"O3 Mini (Medium)\"\n",
315 | " # },\n",
316 | " # {\n",
317 | " # \"model\": o3Reasoning(reasoning_effort=\"high\"),\n",
318 | " # \"display_name\": \"O3 Mini (High)\"\n",
319 | " # },\n",
320 | " # {\n",
321 | " # \"model\": GeminiReasoningModel(),\n",
322 | " # \"display_name\": \"Gemini 2.5\"\n",
323 | " # },\n",
324 | " {\n",
325 | " \"model\": GeminiReasoningModel(thinking_budget=0, thinking=False),\n",
326 | " \"display_name\": \"2.5 Flash\"\n",
327 | " },\n",
328 | " {\n",
329 | " \"model\": GeminiReasoningModel(thinking_budget=1000, thinking=True),\n",
330 | " \"display_name\": \"2.5 Flash (1000 thinking)\"\n",
331 | " },\n",
332 | " {\n",
333 | " \"model\": GeminiReasoningModel(thinking_budget=12000, thinking=True),\n",
334 | " \"display_name\": \"2.5 Flash (12k thinking)\"\n",
335 | " },\n",
336 | " # {\n",
337 | " # \"model\": gpt4point1(model=\"gpt-4.1-2025-04-14\", sandwich=True),\n",
338 | " # \"display_name\": \"GPT-4.1 🥪\"\n",
339 | " # },\n",
340 | " # {\n",
341 | " # \"model\": gpt4point1(model=\"gpt-4.1-mini-2025-04-14\", sandwich=True),\n",
342 | " # \"display_name\": \"4.1-mini 🥪\"\n",
343 | " # },\n",
344 | " # {\n",
345 | " # \"model\": gpt4point1(model=\"gpt-4.1-nano-2025-04-14\", sandwich=True),\n",
346 | " # \"display_name\": \"4.1-nano 🥪\"\n",
347 | " # },\n",
348 | " # {\n",
349 | " # \"model\": gpt4point1(model=\"gpt-4o\"),\n",
350 | " # \"display_name\": \"GPT 4o\"\n",
351 | " # },\n",
352 | " # {\n",
353 | " # \"model\": gpt4point1(model=\"gpt-4o-mini\"),\n",
354 | " # \"display_name\": \"GPT 4o-mini\"\n",
355 | " # },\n",
356 | " # {\n",
357 | " # \"model\": gpt4point1(model=\"gpt-4.5-preview\"),\n",
358 | " # \"display_name\": \"GPT 4.5\"\n",
359 | " # },\n",
360 | " # {\n",
361 | " # \"model\": DeepSeekV3Model(model=\"deepseek/deepseek-chat-v3-0324\"),\n",
362 | " # \"display_name\": \"DS V3-0324\"\n",
363 | " # },\n",
364 | " # {\n",
365 | " # \"model\": CerebrasModel(model=\"deepseek-r1-distill-llama-70b\"),\n",
366 | " # \"display_name\": \"Cerebras (Thinking)\"\n",
367 | " # }\n",
368 | " ]\n",
369 | " \n",
370 | " # Create evaluation\n",
371 | " evaluation = Evaluation(\n",
372 | " dataset=dataset,\n",
373 | " trials=3,\n",
374 | " scorers=[accuracy_scorer]\n",
375 | " )\n",
376 | " \n",
377 | " # Run evaluation for each model\n",
378 | " for model in models:\n",
379 | " await evaluation.evaluate(\n",
380 | " model[\"model\"],\n",
381 | " __weave={\"display_name\": model[\"display_name\"]}\n",
382 | " )\n",
383 | "\n",
384 | "# Run the evaluation\n",
385 | "await run_evaluation()"
386 | ]
387 | }
388 | ],
389 | "metadata": {
390 | "kernelspec": {
391 | "display_name": ".venv",
392 | "language": "python",
393 | "name": "python3"
394 | },
395 | "language_info": {
396 | "codemirror_mode": {
397 | "name": "ipython",
398 | "version": 3
399 | },
400 | "file_extension": ".py",
401 | "mimetype": "text/x-python",
402 | "name": "python",
403 | "nbconvert_exporter": "python",
404 | "pygments_lexer": "ipython3",
405 | "version": "3.13.3"
406 | }
407 | },
408 | "nbformat": 4,
409 | "nbformat_minor": 2
410 | }
411 |
--------------------------------------------------------------------------------