├── .env.example
├── README.md
├── .gitignore
└── weave_reasoning_tests.ipynb


/.env.example:
--------------------------------------------------------------------------------
1 | WANDB_API_KEY=#add your key from https://wandb.ai/authorize
2 | OPENAI_API_KEY=#add your key from https://www.anthropic.com or https://console.anthropic.com
3 | OPENROUTER_API_KEY=#add your key from https://console.openrouter.ai
4 | ANTHROPIC_API_KEY=#add your key from https://console.anthropic.com
5 | GOOGLE_API_KEY=#add your key from https://console.cloud.google.com
6 | CEREBRAS_API_KEY=#add your key from https://console.cerebras.ai
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # o3-tests – Weave Reasoning Evaluation Suite
 2 | 
 3 | o3‑tests is a **Weave** space (dashboard + artifacts) that tracks how the latest Large‑Language‑Model releases perform on *really* hard reasoning tasks.  
 4 | It started as a small playground for the **AI Explained SimpleBench** benchmark but quickly evolved into a public evaluation hub covering:
 5 | 
 6 | * GPT‑4.1 family (all context sizes)
 7 | * `o3` and `o4‑mini`
 8 | * Gemini 2.5 ✨ *(some runs failed because of RPM limits – see the results tables)*
 9 | 
10 | The goal is to answer a single question: **“Which new model should I reach for when my prompt demands serious reasoning?”**
11 | 
12 | <p align="center">
13 |   <img alt="Weave dashboard screenshot" src="https://github-production-user-asset-6210df.s3.amazonaws.com/463317/435788650-9386eb82-efd5-47bd-9680-75d35726d27f.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250421%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250421T191506Z&X-Amz-Expires=300&X-Amz-Signature=52212d68f2d9b84532c6d7a891215c83cd486ab79aba63b1f7fe05030693e2c8&X-Amz-SignedHeaders=host" width="800">
14 | </p>
15 | 
16 | ---
17 | 
18 | ## What’s inside this repo?
19 | 
20 | | Path | Purpose |
21 | |------|---------|
22 | | `weave_reasoning_tests.ipynb` | Jupyter notebook that kicks off all evaluations, logs them to Weave and renders comparison tables/plots. |
23 | 
24 | That’s it – everything else lives in the hosted Weave space.
25 | 
26 | ## Quick start
27 | 
28 | 1. **Clone** the repo and `cd` inside it.
29 | 
30 | 2. Copy the example environment file and fill in your API keys:
31 | 
32 |    ```bash
33 |    cp .env.example .env
34 |    ```
35 | 
36 |    Open `.env` and replace the placeholder values with your provider API keys.
37 | 
38 | 3. Create a Python env with the same stack you use for your LLM work (≥ Python 3.9).
39 | 
40 |    ```bash
41 |    python -m venv .venv
42 |    source .venv/bin/activate
43 |    pip install -U weave ai_explained_simplebench  # plus whichever model SDKs you want (openai, anthropic, ...)
44 |    ```
45 | 
46 | 4. Launch the notebook:
47 | 
48 |    ```bash
49 |    jupyter notebook weave_reasoning_tests.ipynb
50 |    ```
51 | 
52 | 5. Set your Weights & Biases API key:
53 |    Visit https://wandb.ai/authorize, copy your key, and add it to the `WANDB_API_KEY` field in your `.env`.
54 | 
55 | 6. In the first cell, set your provider keys (e.g. `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_API_KEY`, etc.).
56 | 
57 | 7. Run all cells.  
58 |    The notebook will send every evaluation result to your personal Weave project using the `WANDB_API_KEY` from your `.env` file; no manual login is needed.
59 | 
60 | ## Adding another model
61 | 
62 | The notebook is intentionally tiny – all heavy lifting is done by SimpleBench.
63 | 
64 | 1. Append a new row to the `MODELS` list in the first section of the notebook.
65 | 2. Provide the model‑specific invocation parameters / client.
66 | 3. Re‑run the *Run Benchmarks* section.
67 | 
68 | Once finished the new runs will appear in the Weave comparison view automatically.
69 | 
70 | ## Contributing
71 | 
72 | Help is very welcome!  
73 | If you notice a regression, want to add a benchmark, or see a model missing from the list, please open an issue or PR.
74 | 
75 | 1. Fork → feature branch → PR.
76 | 2. Keep the notebook clear of personal API keys – load them from environment variables.
77 | 3. Explain your change in the PR description and, if possible, link to the Weave run that validates it.
78 | 
79 | ## Acknowledgements
80 | 
81 | * **[Weave](https://www.github.com/wandb/weave)** for the slick dashboarding & data‑ops layer.
82 | * **[AI Explained SimpleBench](https://github.com/AIExplained/bench)** – we sponsor the project and rely on it for the question set.
83 | 
84 | ---
85 | 
86 | Made with 🧠 + ❤️  – ping me on X [@altryne](https://x.com/@altryne) if you think of another model worth tracking!
87 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Byte-compiled / optimized / DLL files
  2 | __pycache__/
  3 | *.py[cod]
  4 | *$py.class
  5 | 
  6 | # C extensions
  7 | *.so
  8 | 
  9 | # Distribution / packaging
 10 | .Python
 11 | build/
 12 | develop-eggs/
 13 | dist/
 14 | downloads/
 15 | eggs/
 16 | .eggs/
 17 | lib/
 18 | lib64/
 19 | parts/
 20 | sdist/
 21 | var/
 22 | wheels/
 23 | share/python-wheels/
 24 | *.egg-info/
 25 | .installed.cfg
 26 | *.egg
 27 | MANIFEST
 28 | 
 29 | # PyInstaller
 30 | #  Usually these files are written by a python script from a template
 31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 32 | *.manifest
 33 | *.spec
 34 | 
 35 | # Installer logs
 36 | pip-log.txt
 37 | pip-delete-this-directory.txt
 38 | 
 39 | # Unit test / coverage reports
 40 | htmlcov/
 41 | .tox/
 42 | .nox/
 43 | .coverage
 44 | .coverage.*
 45 | .cache
 46 | nosetests.xml
 47 | coverage.xml
 48 | *.cover
 49 | *.py,cover
 50 | .hypothesis/
 51 | .pytest_cache/
 52 | cover/
 53 | 
 54 | # Translations
 55 | *.mo
 56 | *.pot
 57 | 
 58 | # Django stuff:
 59 | *.log
 60 | local_settings.py
 61 | db.sqlite3
 62 | db.sqlite3-journal
 63 | 
 64 | # Flask stuff:
 65 | instance/
 66 | .webassets-cache
 67 | 
 68 | # Scrapy stuff:
 69 | .scrapy
 70 | 
 71 | # Sphinx documentation
 72 | docs/_build/
 73 | 
 74 | # PyBuilder
 75 | .pybuilder/
 76 | target/
 77 | 
 78 | # Jupyter Notebook
 79 | .ipynb_checkpoints
 80 | 
 81 | # IPython
 82 | profile_default/
 83 | ipython_config.py
 84 | 
 85 | # pyenv
 86 | #   For a library or package, you might want to ignore these files since the code is
 87 | #   intended to run in multiple environments; otherwise, check them in:
 88 | # .python-version
 89 | 
 90 | # pipenv
 91 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 92 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 93 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 94 | #   install all needed dependencies.
 95 | #Pipfile.lock
 96 | 
 97 | # UV
 98 | #   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
 99 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
100 | #   commonly ignored for libraries.
101 | #uv.lock
102 | 
103 | # poetry
104 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
105 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
106 | #   commonly ignored for libraries.
107 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
108 | #poetry.lock
109 | 
110 | # pdm
111 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
112 | #pdm.lock
113 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
114 | #   in version control.
115 | #   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
116 | .pdm.toml
117 | .pdm-python
118 | .pdm-build/
119 | 
120 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
121 | __pypackages__/
122 | 
123 | # Celery stuff
124 | celerybeat-schedule
125 | celerybeat.pid
126 | 
127 | # SageMath parsed files
128 | *.sage.py
129 | 
130 | # Environments
131 | .env
132 | .venv
133 | env/
134 | venv/
135 | ENV/
136 | env.bak/
137 | venv.bak/
138 | 
139 | # Spyder project settings
140 | .spyderproject
141 | .spyproject
142 | 
143 | # Rope project settings
144 | .ropeproject
145 | 
146 | # mkdocs documentation
147 | /site
148 | 
149 | # mypy
150 | .mypy_cache/
151 | .dmypy.json
152 | dmypy.json
153 | 
154 | # Pyre type checker
155 | .pyre/
156 | 
157 | # pytype static type analyzer
158 | .pytype/
159 | 
160 | # Cython debug symbols
161 | cython_debug/
162 | 
163 | # PyCharm
164 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
165 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
166 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
167 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
168 | #.idea/
169 | 
170 | # Ruff stuff:
171 | .ruff_cache/
172 | 
173 | # PyPI configuration file
174 | .pypirc
175 | 


--------------------------------------------------------------------------------
/weave_reasoning_tests.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# SimpleBench evals for all models - wandb.me/4dot1evals\n",
  8 |     "\n",
  9 |     "\n",
 10 |     "Using the simple bench questions to evaluate reasoning and non reasoning performance of the latest AI models."
 11 |    ]
 12 |   },
 13 |   {
 14 |    "cell_type": "code",
 15 |    "execution_count": null,
 16 |    "metadata": {},
 17 |    "outputs": [],
 18 |    "source": [
 19 |     "# Install and read in required packages, plus create an anthropic client.\n",
 20 |     "print('⏳ Installing packages')\n",
 21 |     "!uv pip install -q U weave set-env-colab-kaggle-dotenv tqdm ipywidgets requests openai google-genai google-generativeai llm-json anthropic\n",
 22 |     "print('✅ Packages installed')"
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": null,
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "from set_env import set_env\n",
 32 |     "from openai import OpenAI\n",
 33 |     "from google import genai\n",
 34 |     "from google.genai import types\n",
 35 |     "import weave\n",
 36 |     "from llm_json import json\n",
 37 |     "import random\n",
 38 |     "import os\n",
 39 |     "set_env(\"WANDB_API_KEY\")\n",
 40 |     "set_env(\"OPENAI_API_KEY\")\n",
 41 |     "set_env(\"GEMINI_API_KEY\")\n",
 42 |     "set_env(\"OPENROUTER_API_KEY\")\n",
 43 |     "set_env(\"CEREBRAS_API_KEY\")\n",
 44 |     "set_env(\"ANTHROPIC_API_KEY\")\n",
 45 |     "\n",
 46 |     "print('✅ Environment variables set')\n",
 47 |     "\n",
 48 |     "# Initialize weave and OpenAI client\n",
 49 |     "weave_client = weave.init('o3-tests')\n",
 50 |     "client = OpenAI()"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "SYSTEM_MESSAGE = \"\"\"\n",
 60 |     "You're given a multiple choice trick question. The answer will not be the obvious one\n",
 61 |     "The question may contain and likely is phrased in a way that contains many distractions. Your task is to identify the basic flaws to reason about the true answer\n",
 62 |     "It may help to think carefully about the location/physics of everything mentioned in the question.\n",
 63 |     "Many of these questions may be variations of some riddles you have been trained on in the past, but containing a trick that is easy for humans to identify. \n",
 64 |     "\n",
 65 |     "Start by rephrasing the original question and with special notice of the details that could be innocuous but can subvert the original meaning of the question. \n",
 66 |     "Then extract a few facts in the form of numbered statements as factual_statements to help you reason the correct answer\n",
 67 |     "From there deduce the best answer. Not the answer that would make the question make sense, but the correct answer for that specific weird scenario. \n",
 68 |     "\n",
 69 |     "Your response should be strictly in the JSON format with no backticks or other formatting:\n",
 70 |     "                 {\n",
 71 |     "                 \"rewritten_question\": \"rewritten_question\",\n",
 72 |     "                 \"factual_statements\": [\"statement1\", \"statement2\", \"statement3\"],\n",
 73 |     "                 \"explanation\": \"explanation\",\n",
 74 |     "                 \"correct_answer\": \"X\"\n",
 75 |     "                 }\n",
 76 |     "where X is one of the letters A, B, C, D, E, or F.\n",
 77 |     "\"\"\"\n"
 78 |    ]
 79 |   },
 80 |   {
 81 |    "cell_type": "code",
 82 |    "execution_count": null,
 83 |    "metadata": {},
 84 |    "outputs": [],
 85 |    "source": [
 86 |     "from weave import Model, Evaluation\n",
 87 |     "from llm_json import json\n",
 88 |     "import os\n",
 89 |     "os.environ['WEAVE_PARALLELISM'] = '1'\n",
 90 |     "\n",
 91 |     "# Load the dataset\n",
 92 |     "def load_dataset():\n",
 93 |     "    dataset = weave.ref(\"weave:///simplebench/simple_bench_public/object/competition_dataset:qNJnkgpMqoyc48GwlFCSpVypn3D8x77N7lGCBIab4XQ\").get()\n",
 94 |     "    # Convert the data into the format we need\n",
 95 |     "    return [{\n",
 96 |     "        \"question\": item[\"prompt\"],\n",
 97 |     "        \"question_id\": item[\"question_id\"],\n",
 98 |     "        \"expected\": item[\"answer\"]\n",
 99 |     "    } for item in dataset]\n",
100 |     "\n",
101 |     "class gpt4point1(Model):\n",
102 |     "    model: str = \"gpt-4.1-2025-04-14\"\n",
103 |     "    sandwich: bool = False\n",
104 |     "\n",
105 |     "    @weave.op()\n",
106 |     "    def predict(self, question: str) -> dict:\n",
107 |     "        messages = [\n",
108 |     "            {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n",
109 |     "            {\"role\": \"user\", \"content\": question}\n",
110 |     "        ]\n",
111 |     "        if self.sandwich:\n",
112 |     "            messages.append({\n",
113 |     "                \"role\": \"developer\", \n",
114 |     "                \"content\": SYSTEM_MESSAGE\n",
115 |     "        })\n",
116 |     "        response = client.chat.completions.create(\n",
117 |     "            model=self.model,\n",
118 |     "            messages=messages\n",
119 |     "        )\n",
120 |     "        json_response = json.loads(response.choices[0].message.content.strip())\n",
121 |     "        return json_response\n",
122 |     "\n",
123 |     "# Define the model class with different reasoning efforts\n",
124 |     "class o3Reasoning(Model):\n",
125 |     "    model: str = \"o3-mini-2025-01-31\"\n",
126 |     "    reasoning_effort: str  # \"low\", \"medium\", or \"high\"\n",
127 |     "\n",
128 |     "    @weave.op()\n",
129 |     "    def predict(self, question: str) -> dict:\n",
130 |     "\n",
131 |     "        response = client.chat.completions.create(\n",
132 |     "            model=self.model,\n",
133 |     "            reasoning_effort=self.reasoning_effort,\n",
134 |     "            messages=[\n",
135 |     "                {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n",
136 |     "                {\"role\": \"user\", \"content\": question}\n",
137 |     "            ]\n",
138 |     "        )\n",
139 |     "        json_response = json.loads(response.choices[0].message.content.strip())\n",
140 |     "        return json_response\n",
141 |     "    # Define the model class with different reasoning efforts\n",
142 |     "class o1Reasoning(Model):\n",
143 |     "    model: str = \"o1-2024-12-17\"\n",
144 |     "    reasoning_effort: str  # \"low\", \"medium\", or \"high\"\n",
145 |     "\n",
146 |     "    @weave.op()\n",
147 |     "    def predict(self, question: str) -> dict:\n",
148 |     "\n",
149 |     "        response = client.chat.completions.create(\n",
150 |     "            model=self.model,\n",
151 |     "            reasoning_effort=self.reasoning_effort,\n",
152 |     "            messages=[\n",
153 |     "                {\"role\": \"developer\", \"content\": SYSTEM_MESSAGE},\n",
154 |     "                {\"role\": \"user\", \"content\": question}\n",
155 |     "            ]\n",
156 |     "        )\n",
157 |     "        json_response = json.loads(response.choices[0].message.content.strip())\n",
158 |     "        return json_response\n",
159 |     "\n",
160 |     "class GeminiReasoningModel(Model):\n",
161 |     "    model: str = \"gemini-2.5-pro-exp-03-25\"\n",
162 |     "    thinking_budget: int = 1000 # You can tweak this variable to any value between 0 and 24k\n",
163 |     "    thinking: bool = False\n",
164 |     "\n",
165 |     "    @weave.op()\n",
166 |     "    def predict(self, question: str) -> dict:\n",
167 |     "        from google import genai\n",
168 |     "        from google.genai import types\n",
169 |     "        MODEL_ID = self.model\n",
170 |     "        prompt = question\n",
171 |     "        budget = self.thinking_budget\n",
172 |     "        \n",
173 |     "        client = genai.Client(api_key=os.getenv(\"GEMINI_API_KEY\"))\n",
174 |     "        response = client.models.generate_content(\n",
175 |     "            model=MODEL_ID,\n",
176 |     "            contents=f\"{SYSTEM_MESSAGE}\\n\\n{prompt}\",\n",
177 |     "            config=types.GenerateContentConfig(\n",
178 |     "                thinking_config=types.ThinkingConfig(\n",
179 |     "                    include_thoughts=self.thinking,\n",
180 |     "                    thinking_budget=budget\n",
181 |     "                )\n",
182 |     "            )\n",
183 |     "        )\n",
184 |     "        json_response = json.loads(response.text.strip())\n",
185 |     "        return json_response\n",
186 |     "\n",
187 |     "class CerebrasModel(Model):\n",
188 |     "    model: str = \"deepseek-r1-distill-llama-70b\"\n",
189 |     "    \n",
190 |     "\n",
191 |     "    @weave.op()\n",
192 |     "    def predict(self, question: str) -> dict:\n",
193 |     "        cerebras_client = OpenAI(api_key=os.getenv(\"CEREBRAS_API_KEY\"), base_url=\"https://api.cerebras.ai/v1\")\n",
194 |     "        response = cerebras_client.chat.completions.create(\n",
195 |     "            model=self.model,\n",
196 |     "            messages=[\n",
197 |     "                {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
198 |     "                {\"role\": \"user\", \"content\": question}\n",
199 |     "            ]\n",
200 |     "        )\n",
201 |     "        json_response = json.loads(response.choices[0].message.content.strip())\n",
202 |     "        return json_response\n",
203 |     "\n",
204 |     "\n",
205 |     "class DeepSeekV3Model(Model):\n",
206 |     "    model: str = \"deepseek/deepseek-chat-v3-0324\"\n",
207 |     "    \n",
208 |     "    @weave.op()\n",
209 |     "    def predict(self, question: str) -> dict:\n",
210 |     "        openrouterclient = OpenAI(\n",
211 |     "            api_key=os.getenv(\"OPENROUTER_API_KEY\"), \n",
212 |     "            base_url=\"https://openrouter.ai/api/v1\")\n",
213 |     "        \n",
214 |     "        chat_completion = openrouterclient.chat.completions.create(\n",
215 |     "            model=self.model,\n",
216 |     "            messages=[\n",
217 |     "                {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
218 |     "                {\"role\": \"user\", \"content\": question},\n",
219 |     "            ],\n",
220 |     "            temperature=0.7,\n",
221 |     "            max_tokens=1024,\n",
222 |     "            # extra_body={\n",
223 |     "            #     \"provider\": {\n",
224 |     "            #         \"order\": [\"DeepSeek\"],\n",
225 |     "            #         \"allow_fallbacks\": False,\n",
226 |     "            #     }\n",
227 |     "        # }\n",
228 |     "        )\n",
229 |     "        print(chat_completion)\n",
230 |     "        json_response = json.loads(chat_completion.choices[0].message.content.strip())\n",
231 |     "        return json_response\n",
232 |     "\n",
233 |     "class DeepSeekR1Model(Model):\n",
234 |     "    model: str = \"deepseek/deepseek-r1:free1\"\n",
235 |     "\n",
236 |     "    @weave.op()\n",
237 |     "    def predict(self, question: str) -> dict:\n",
238 |     "        openrouterclient = OpenAI(\n",
239 |     "            api_key=os.getenv(\"OPENROUTER_API_KEY\"), \n",
240 |     "            base_url=\"https://openrouter.ai/api/v1\")\n",
241 |     "        \n",
242 |     "        # Format the question and options\n",
243 |     "        chat_completion = openrouterclient.chat.completions.create(\n",
244 |     "            model=self.model,\n",
245 |     "            messages=[\n",
246 |     "                {\"role\": \"system\", \"content\": SYSTEM_MESSAGE},\n",
247 |     "                {\"role\": \"user\", \"content\": question},\n",
248 |     "            ],\n",
249 |     "            extra_body={\n",
250 |     "            \"include_reasoning\": True,\n",
251 |     "            \"provider\": {\n",
252 |     "                \"order\": [\"Chutes\", \"Together\"],\n",
253 |     "                \"allow_fallbacks\": True,\n",
254 |     "            }\n",
255 |     "        })\n",
256 |     "        # remove the think part of the response between <think> and </think>\n",
257 |     "        text = chat_completion.choices[0].message.content.strip()\n",
258 |     "\n",
259 |     "        json_response = json.loads(text)\n",
260 |     "        return json_response\n",
261 |     "\n",
262 |     "# Define the scorer\n",
263 |     "@weave.op()\n",
264 |     "def accuracy_scorer(expected: str, model_output: dict) -> dict:\n",
265 |     "    return {\n",
266 |     "        \"correct\": expected == model_output[\"correct_answer\"]\n",
267 |     "    }\n",
268 |     "\n",
269 |     "# Create the evaluation\n",
270 |     "async def run_evaluation():\n",
271 |     "    # Create dataset\n",
272 |     "    dataset = load_dataset()\n",
273 |     "    \n",
274 |     "    # Create models with different reasoning efforts\n",
275 |     "    models = [\n",
276 |     "        # {\n",
277 |     "        #     \"model\": DeepSeekR1Model(model=\"deepseek/deepseek-r1:free\"),\n",
278 |     "        #     \"display_name\": \"DeepSeek R1\"\n",
279 |     "        # },\n",
280 |     "        # {\n",
281 |     "        #     \"model\": o1Reasoning(reasoning_effort=\"low\"),\n",
282 |     "        #     \"display_name\": \"O1 (Low)\"\n",
283 |     "        # },\n",
284 |     "        # {\n",
285 |     "        #     \"model\": o1Reasoning(reasoning_effort=\"medium\"),\n",
286 |     "        #     \"display_name\": \"O1 (Medium)\"\n",
287 |     "        # },\n",
288 |     "        # {\n",
289 |     "        #     \"model\": o1Reasoning(reasoning_effort=\"high\"),\n",
290 |     "        #     \"display_name\": \"O1 (High)\"\n",
291 |     "        # },\n",
292 |     "        # {\n",
293 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"low\", model='o3'),\n",
294 |     "        #     \"display_name\": \"O3 (Low)\"\n",
295 |     "        # },\n",
296 |     "        # {\n",
297 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"medium\", model='o3'),\n",
298 |     "        #     \"display_name\": \"O3 (Medium)\"\n",
299 |     "        # },\n",
300 |     "        # {\n",
301 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"low\", model='o4-mini'),\n",
302 |     "        #     \"display_name\": \"O4-mini (Low)\"\n",
303 |     "        # },\n",
304 |     "        # {\n",
305 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"medium\", model='o4-mini'),\n",
306 |     "        #     \"display_name\": \"O4-mini (Medium)\"\n",
307 |     "        # },\n",
308 |     "        # {\n",
309 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"high\", model='o4-mini'),\n",
310 |     "        #     \"display_name\": \"O4-mini (High)\"\n",
311 |     "        # },\n",
312 |     "        # {\n",
313 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"medium\"), \n",
314 |     "        #     \"display_name\": \"O3 Mini (Medium)\"\n",
315 |     "        # },\n",
316 |     "        # {\n",
317 |     "        #     \"model\": o3Reasoning(reasoning_effort=\"high\"),\n",
318 |     "        #     \"display_name\": \"O3 Mini (High)\"\n",
319 |     "        # },\n",
320 |     "        # {\n",
321 |     "        #     \"model\": GeminiReasoningModel(),\n",
322 |     "        #     \"display_name\": \"Gemini 2.5\"\n",
323 |     "        # },\n",
324 |     "        {\n",
325 |     "            \"model\": GeminiReasoningModel(thinking_budget=0, thinking=False),\n",
326 |     "            \"display_name\": \"2.5 Flash\"\n",
327 |     "        },\n",
328 |     "        {\n",
329 |     "            \"model\": GeminiReasoningModel(thinking_budget=1000, thinking=True),\n",
330 |     "            \"display_name\": \"2.5 Flash (1000 thinking)\"\n",
331 |     "        },\n",
332 |     "        {\n",
333 |     "            \"model\": GeminiReasoningModel(thinking_budget=12000, thinking=True),\n",
334 |     "            \"display_name\": \"2.5 Flash (12k thinking)\"\n",
335 |     "        },\n",
336 |     "        # {\n",
337 |     "        #     \"model\": gpt4point1(model=\"gpt-4.1-2025-04-14\", sandwich=True),\n",
338 |     "        #     \"display_name\": \"GPT-4.1 🥪\"\n",
339 |     "        # },\n",
340 |     "        # {\n",
341 |     "        #     \"model\": gpt4point1(model=\"gpt-4.1-mini-2025-04-14\", sandwich=True),\n",
342 |     "        #     \"display_name\": \"4.1-mini 🥪\"\n",
343 |     "        # },\n",
344 |     "        # {\n",
345 |     "        #     \"model\": gpt4point1(model=\"gpt-4.1-nano-2025-04-14\", sandwich=True),\n",
346 |     "        #     \"display_name\": \"4.1-nano 🥪\"\n",
347 |     "        # },\n",
348 |     "        # {\n",
349 |     "        #     \"model\": gpt4point1(model=\"gpt-4o\"),\n",
350 |     "        #     \"display_name\": \"GPT 4o\"\n",
351 |     "        # },\n",
352 |     "        # {\n",
353 |     "        #     \"model\": gpt4point1(model=\"gpt-4o-mini\"),\n",
354 |     "        #     \"display_name\": \"GPT 4o-mini\"\n",
355 |     "        # },\n",
356 |     "        # {\n",
357 |     "        #     \"model\": gpt4point1(model=\"gpt-4.5-preview\"),\n",
358 |     "        #     \"display_name\": \"GPT 4.5\"\n",
359 |     "        # },\n",
360 |     "        # {\n",
361 |     "        #     \"model\": DeepSeekV3Model(model=\"deepseek/deepseek-chat-v3-0324\"),\n",
362 |     "        #     \"display_name\": \"DS V3-0324\"\n",
363 |     "        # },\n",
364 |     "        # {\n",
365 |     "        #     \"model\": CerebrasModel(model=\"deepseek-r1-distill-llama-70b\"),\n",
366 |     "        #     \"display_name\": \"Cerebras (Thinking)\"\n",
367 |     "        # }\n",
368 |     "    ]\n",
369 |     "    \n",
370 |     "    # Create evaluation\n",
371 |     "    evaluation = Evaluation(\n",
372 |     "        dataset=dataset,\n",
373 |     "        trials=3,\n",
374 |     "        scorers=[accuracy_scorer]\n",
375 |     "    )\n",
376 |     "    \n",
377 |     "    # Run evaluation for each model\n",
378 |     "    for model in models:\n",
379 |     "        await evaluation.evaluate(\n",
380 |     "            model[\"model\"],\n",
381 |     "            __weave={\"display_name\": model[\"display_name\"]}\n",
382 |     "        )\n",
383 |     "\n",
384 |     "# Run the evaluation\n",
385 |     "await run_evaluation()"
386 |    ]
387 |   }
388 |  ],
389 |  "metadata": {
390 |   "kernelspec": {
391 |    "display_name": ".venv",
392 |    "language": "python",
393 |    "name": "python3"
394 |   },
395 |   "language_info": {
396 |    "codemirror_mode": {
397 |     "name": "ipython",
398 |     "version": 3
399 |    },
400 |    "file_extension": ".py",
401 |    "mimetype": "text/x-python",
402 |    "name": "python",
403 |    "nbconvert_exporter": "python",
404 |    "pygments_lexer": "ipython3",
405 |    "version": "3.13.3"
406 |   }
407 |  },
408 |  "nbformat": 4,
409 |  "nbformat_minor": 2
410 | }
411 | 


--------------------------------------------------------------------------------