├── requirements.txt
├── .env.example
├── templates
    └── post.html.jinja
├── README.md
├── .gitignore
└── eval.ipynb


/requirements.txt:
--------------------------------------------------------------------------------
1 | weave 
2 | gradio 
3 | set-env-colab-kaggle-dotenv 
4 | tqdm 
5 | ipykernel
6 | ipywidgets 
7 | requests 
8 | openai 
9 | pillow


--------------------------------------------------------------------------------
/.env.example:
--------------------------------------------------------------------------------
1 | OPENAI_API_KEY=your-openai-api-key-here
2 | WANDB_API_KEY=your-wandb-api-key-here
3 | GEMINI_API_KEY=your-gemini-api-key-here
4 | OPENROUTER_API_KEY=your-openrouter-api-key-here


--------------------------------------------------------------------------------
/templates/post.html.jinja:
--------------------------------------------------------------------------------
 1 | <div style="max-width: 600px; margin: 20px auto; border: 1px solid #e0e0e0; border-radius: 8px; background-color: #ffffff; padding: 16px; box-shadow: 0 2px 8px rgba(0, 0, 0, 0.1);">
 2 |     <div style="display: flex; align-items: center;">
 3 |         <img src="{{ author.avatar }}" alt="Avatar" style="width: 40px; height: 40px; border-radius: 50%; margin-right: 10px;">
 4 |         <div>
 5 |             <span style="font-weight: bold; color: #333;">{{ author.displayName }}</span>
 6 |             <a href="https://bsky.app/profile/{{ author.handle }}" target="_blank" style="color: #888; font-size: 12px; text-decoration: none; hover: { color: #1da1f2 };"> @{{ author.handle }}</a>
 7 |             <div style="color: #888; font-size: 12px;"><a href="{{ post_url }}" target="_blank" style="color: #888; text-decoration: none; hover: { color: #1da1f2 };">{{ created_at }}</a></div>
 8 |         </div>
 9 |     </div>
10 |     <div style="margin-top: 10px; color: #333;">
11 |         <p style="color: #333; font-size: 16px;">{{ text }}</p>
12 |     </div>
13 |     <div style="margin-top: 10px; display: flex; justify-content: space-between; align-items: center;">
14 |         <div>
15 |             <span style="color: #ff4d4d;">❤️ {{ like_count }}</span>
16 |             <span style="color: #1da1f2; margin-left: 10px;">🔁 {{ repost_count }}</span>
17 |         </div>
18 |         {% if has_image %}
19 |         <span style="color: #1da1f2;">[Image]</span>
20 |         {% endif %}
21 |     </div>
22 | </div> 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # LLM Evaluations Workshop
 2 | 
 3 | A workshop project demonstrating how to build and evaluate LLM-powered classification systems using real-world data from Bluesky social network.
 4 | 
 5 | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/altryne/llm-evals-workshop/blob/main/eval.ipynb) 
 6 | 
 7 | ## Overview
 8 | 
 9 | This workshop explores a comprehensive methodology for productizing robust LLM applications through tracing, dataset creation, and evaluation. Using [W&B Weave](https://wandb.me/weave-workshop-jan), we'll explore how to build reliable evaluation pipelines for LLM applications.
10 | 
11 | The project showcases a practical example of building an LLM evaluation pipeline using:
12 | - Bluesky posts as source data
13 | - OpenAI's GPT-4 and other LLMs for classification
14 | - Weights & Biases (Weave) for evaluation tracking and dataset versioning
15 | - Gradio for the interactive UI
16 | 
17 | ## Setup
18 | 
19 | You can run this workshop directly in Colab by clicking the badge above. To run locally:
20 | 
21 | 1. Clone the repository:
22 | ```bash
23 | git clone https://github.com/altryne/llm-evals-workshop
24 | ```
25 | 
26 | 2. Install dependencies:
27 | 
28 | > **Note**: This project requires Python 3.10 or higher
29 | 
30 | ```bash
31 | pip install uv
32 | uv pip install -r requirements.txt
33 | ```
34 | 
35 | 3. Set up environment variables:
36 | Copy `.env.example` to `.env` and fill in your credentials:
37 | ```
38 | WANDB_API_KEY=your-wandb-api-key-here
39 | OPENAI_API_KEY=your-openai-api-key-here
40 | GEMINI_API_KEY=your-gemini-api-key-here
41 | OPENROUTER_API_KEY=your-openrouter-api-key-here
42 | ```
43 | 
44 | ## Features
45 | 
46 | - **Interactive UI**: Built with Gradio for easy post classification and feedback collection
47 | - **Evaluation Pipeline**: Uses Weights & Biases Weave for:
48 |   - Tracing LLM calls and responses
49 |   - Dataset versioning and management
50 |   - Evaluation tracking and analysis
51 | - **Dataset Creation**: Tools for building and annotating datasets from Bluesky posts
52 | - **Multi-Model Support**: Supports multiple LLM providers (OpenAI, Gemini, OpenRouter)
53 | - **Comprehensive Evaluation Methods**:
54 |   - Programmatic scoring for structured outputs
55 |   - Human-in-the-loop (HITL) annotations
56 |   - LLM-as-judge evaluations
57 | 
58 | ## Evaluation Approaches
59 | 
60 | The workshop covers three main evaluation methods:
61 | 
62 | 1. **Programmatic Scoring**
63 |    - Fast and reliable for structured outputs
64 |    - Uses string matching and regex
65 |    - Best for exact match or pattern-based evaluation
66 |    - Example: Checking if LLM classification matches ground truth
67 | 
68 | 2. **Human-in-the-Loop (HITL)**
69 |    - Manual review and annotation
70 |    - Creates high-quality ground truth data
71 |    - Used for kickstarting evaluation datasets
72 |    - Interactive UI for efficient annotation
73 | 
74 | 3. **LLM-as-Judge**
75 |    - Uses LLMs to evaluate other LLMs
76 |    - Handles open-ended responses
77 |    - Cost-effective alternative to human evaluation
78 |    - Includes best practices and limitations
79 | 
80 | ## Usage
81 | 
82 | 1. Run the Jupyter notebook:
83 | 
84 | 
85 | 2. Follow the instructions, in the notebook. The items to do yourself are marked with `#TODO`
86 | 
87 | ## Project Structure
88 | 
89 | - `eval.ipynb`: Main notebook with implementation and UI
90 | - `templates/`: HTML templates for post display
91 | - `data/`: JSON files containing Bluesky posts
92 | - `.env`: Configuration for API keys and credentials
93 | 
94 | ## Author
95 | 
96 | Created by [Alex Volkov](https://twitter.com/altryne) for Weights & Biases
97 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | # Environment variables
  2 | .env
  3 | .venv/
  4 | # Python
  5 | __pycache__/
  6 | *.py[cod]
  7 | *$py.class
  8 | 
  9 | # C extensions
 10 | *.so
 11 | 
 12 | # Distribution / packaging
 13 | .Python
 14 | build/
 15 | develop-eggs/
 16 | dist/
 17 | downloads/
 18 | eggs/
 19 | .eggs/
 20 | lib/
 21 | lib64/
 22 | parts/
 23 | sdist/
 24 | var/
 25 | wheels/
 26 | share/python-wheels/
 27 | *.egg-info/
 28 | .installed.cfg
 29 | *.egg
 30 | MANIFEST
 31 | 
 32 | # PyInstaller
 33 | #  Usually these files are written by a python script from a template
 34 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
 35 | *.manifest
 36 | *.spec
 37 | 
 38 | # Installer logs
 39 | pip-log.txt
 40 | pip-delete-this-directory.txt
 41 | 
 42 | # Unit test / coverage reports
 43 | htmlcov/
 44 | .tox/
 45 | .nox/
 46 | .coverage
 47 | .coverage.*
 48 | .cache
 49 | nosetests.xml
 50 | coverage.xml
 51 | *.cover
 52 | *.py,cover
 53 | .hypothesis/
 54 | .pytest_cache/
 55 | cover/
 56 | 
 57 | # Translations
 58 | *.mo
 59 | *.pot
 60 | 
 61 | # Django stuff:
 62 | *.log
 63 | local_settings.py
 64 | db.sqlite3
 65 | db.sqlite3-journal
 66 | 
 67 | # Flask stuff:
 68 | instance/
 69 | .webassets-cache
 70 | 
 71 | # Scrapy stuff:
 72 | .scrapy
 73 | 
 74 | # Sphinx documentation
 75 | docs/_build/
 76 | 
 77 | # PyBuilder
 78 | .pybuilder/
 79 | target/
 80 | 
 81 | # Jupyter Notebook
 82 | .ipynb_checkpoints
 83 | 
 84 | # IPython
 85 | profile_default/
 86 | ipython_config.py
 87 | 
 88 | # pyenv
 89 | #   For a library or package, you might want to ignore these files since the code is
 90 | #   intended to run in multiple environments; otherwise, check them in:
 91 | # .python-version
 92 | 
 93 | # pipenv
 94 | #   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
 95 | #   However, in case of collaboration, if having platform-specific dependencies or dependencies
 96 | #   having no cross-platform support, pipenv may install dependencies that don't work, or not
 97 | #   install all needed dependencies.
 98 | #Pipfile.lock
 99 | 
100 | # UV
101 | #   Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control.
102 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
103 | #   commonly ignored for libraries.
104 | #uv.lock
105 | 
106 | # poetry
107 | #   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
108 | #   This is especially recommended for binary packages to ensure reproducibility, and is more
109 | #   commonly ignored for libraries.
110 | #   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
111 | #poetry.lock
112 | 
113 | # pdm
114 | #   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
115 | #pdm.lock
116 | #   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
117 | #   in version control.
118 | #   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
119 | .pdm.toml
120 | .pdm-python
121 | .pdm-build/
122 | 
123 | # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
124 | __pypackages__/
125 | 
126 | # Celery stuff
127 | celerybeat-schedule
128 | celerybeat.pid
129 | 
130 | # SageMath parsed files
131 | *.sage.py
132 | 
133 | # Environments
134 | .env
135 | .venv
136 | env/
137 | venv/
138 | ENV/
139 | env.bak/
140 | venv.bak/
141 | 
142 | # Spyder project settings
143 | .spyderproject
144 | .spyproject
145 | 
146 | # Rope project settings
147 | .ropeproject
148 | 
149 | # mkdocs documentation
150 | /site
151 | 
152 | # mypy
153 | .mypy_cache/
154 | .dmypy.json
155 | dmypy.json
156 | 
157 | # Pyre type checker
158 | .pyre/
159 | 
160 | # pytype static type analyzer
161 | .pytype/
162 | 
163 | # Cython debug symbols
164 | cython_debug/
165 | 
166 | # PyCharm
167 | #  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
168 | #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
169 | #  and can be added to the global gitignore or merged into this file.  For a more nuclear
170 | #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
171 | #.idea/
172 | 
173 | # PyPI configuration file
174 | .pypirc
175 | 


--------------------------------------------------------------------------------
/eval.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "# LLMs in production - Trace, Compile, Evals - by Weights & Biases\n",
   8 |     "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/altryne/llm-evals-workshop/blob/main/eval.ipynb) [![Weights & Biases](https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-gradient.svg)](https://wandb.me/weave-workshop-jan)\n",
   9 |     "\n",
  10 |     "\n",
  11 |     "\n",
  12 |     "\n",
  13 |     "# Intro\n",
  14 |     "This notebook is accompanying a workshop, that will walk you through common patterns in building evaluations for LLMs, and useful rules of thumb to follow when doing so using [W&B Weave](https://wandb.me/weave-workshop-jan)\n",
  15 |     "\n",
  16 |     "We'll explore the following methodology for productizing robust LLM applications: \n",
  17 |     "\n",
  18 |     "![three](https://gist.github.com/user-attachments/assets/0d51de65-8ec7-4cc5-a102-5a13229f5531)\n",
  19 |     "\n",
  20 |     "\n",
  21 |     "Make sure to set your WANDB_API_KEY (get your key from [here](https://wandb.ai/authorize)) and OPENROUTER_API_KEY (or OPENAI_API_KEY if you have that) in the environment variables.\n",
  22 |     "\n",
  23 |     "If you're running in Colab, set the variables in the keys section on the left. \n",
  24 |     "\n",
  25 |     "If you want to self explore, find the `#TODO:` comments and replace them with your own code, then run the cell.\n",
  26 |     "\n",
  27 |     "Prepared by [Alex Volkov](https://twitter.com/altryne)"
  28 |    ]
  29 |   },
  30 |   {
  31 |    "cell_type": "code",
  32 |    "execution_count": null,
  33 |    "metadata": {},
  34 |    "outputs": [],
  35 |    "source": [
  36 |     "# Install and read in required packages\n",
  37 |     "try:\n",
  38 |     "    import google.colab\n",
  39 |     "    !git clone -q --branch main https://github.com/altryne/llm-evals-workshop\n",
  40 |     "    %cd llm-evals-workshop\n",
  41 |     "except ImportError:\n",
  42 |     "    pass\n",
  43 |     "\n",
  44 |     "print('⏳ Installing packages')\n",
  45 |     "%pip install -q uv\n",
  46 |     "!uv pip install -q --system 'weave[scorers]' gradio set-env-colab-kaggle-dotenv tqdm ipywidgets requests openai pillow\n",
  47 |     "print('✅ Packages installed')"
  48 |    ]
  49 |   },
  50 |   {
  51 |    "cell_type": "code",
  52 |    "execution_count": null,
  53 |    "metadata": {},
  54 |    "outputs": [],
  55 |    "source": [
  56 |     "\n",
  57 |     "%load_ext gradio\n",
  58 |     "\n",
  59 |     "import gradio as gr\n",
  60 |     "from PIL import Image\n",
  61 |     "import requests \n",
  62 |     "import io\n",
  63 |     "from set_env import set_env\n",
  64 |     "import json\n",
  65 |     "from jinja2 import Environment, FileSystemLoader\n",
  66 |     "from datetime import datetime\n",
  67 |     "import random\n",
  68 |     "import os\n",
  69 |     "from openai import OpenAI\n",
  70 |     "from dotenv import load_dotenv\n",
  71 |     "import pandas as pd\n",
  72 |     "import weave\n",
  73 |     "from weave.flow.annotation_spec import AnnotationSpec\n",
  74 |     "\n",
  75 |     "load_dotenv()\n",
  76 |     "set_env(\"WANDB_API_KEY\")\n",
  77 |     "set_env(\"OPENAI_API_KEY\")\n",
  78 |     "set_env(\"OPENROUTER_API_KEY\")\n",
  79 |     "\n",
  80 |     "# initialize weave\n",
  81 |     "weave_api = weave.init('AITT-evals-workshop')\n",
  82 |     "\n",
  83 |     "# initialize annotations for this project\n",
  84 |     "annotation = weave.publish(AnnotationSpec(\n",
  85 |     "    name=\"Doomer or Boomer\",\n",
  86 |     "    description=\"Doomer or Boomer or Neither\",\n",
  87 |     "    field_schema={ \"type\": \"string\", \"enum\": [\"Doomer\", \"Boomer\", \"Neither\"],},\n",
  88 |     "), \"doomer_or_boomer\")\n",
  89 |     "\n",
  90 |     "annotation_reason = weave.publish(AnnotationSpec(\n",
  91 |     "    name=\"Reason\",\n",
  92 |     "    description=\"Reason why you chose this value, write before clicking.\",\n",
  93 |     "    field_schema={ \"type\": \"string\"},\n",
  94 |     "), \"reason\")"
  95 |    ]
  96 |   },
  97 |   {
  98 |    "cell_type": "code",
  99 |    "execution_count": 5,
 100 |    "metadata": {},
 101 |    "outputs": [],
 102 |    "source": [
 103 |     "# Initialize our LLM client, we'll use either Gemini or OpenAI\n",
 104 |     "API_PROVIDER = 'OpenAI' # @param [\"Gemini\", \"OpenAI\", \"OpenRouter\"]\n",
 105 |     "if API_PROVIDER == 'Gemini':\n",
 106 |     "    client = OpenAI(\n",
 107 |     "        api_key=os.getenv(\"GEMINI_API_KEY\"),\n",
 108 |     "        base_url=\"https://generativelanguage.googleapis.com/v1beta/\",\n",
 109 |     "    )\n",
 110 |     "    model = \"gemini-2.0-flash-exp\"\n",
 111 |     "elif API_PROVIDER == 'OpenRouter':\n",
 112 |     "    client = OpenAI(\n",
 113 |     "        api_key=os.getenv(\"OPENROUTER_API_KEY\"),\n",
 114 |     "        base_url=\"https://openrouter.ai/api/v1\",\n",
 115 |     "    )\n",
 116 |     "    model = \"openai/chatgpt-4o-latest\"\n",
 117 |     "    # model = \"google/gemini-flash-1.5-exp\"\n",
 118 |     "    # model = \"deepseek/deepseek-chat\"\n",
 119 |     "else:\n",
 120 |     "    client = OpenAI()\n",
 121 |     "    model = \"chatgpt-4o-latest\"\n",
 122 |     "\n",
 123 |     "# Load the Jinja2 environment\n",
 124 |     "env = Environment(loader=FileSystemLoader('templates'))\n",
 125 |     "template = env.get_template('post.html.jinja')\n",
 126 |     "\n",
 127 |     "# Load replies data\n",
 128 |     "def load_replies():\n",
 129 |     "    replies = []\n",
 130 |     "    # Load replies from both files\n",
 131 |     "    with open('data/replies_alpin.json', 'r') as f:\n",
 132 |     "        data = json.load(f)\n",
 133 |     "        replies.extend(data['thread']['replies'])\n",
 134 |     "    with open('data/replies_daniel.json', 'r') as f:\n",
 135 |     "        data = json.load(f)\n",
 136 |     "        replies.extend(data['thread']['replies'])\n",
 137 |     "    return replies\n",
 138 |     "\n",
 139 |     "\n",
 140 |     "def get_random_post_and_analyze():\n",
 141 |     "    replies = load_replies()\n",
 142 |     "    post = random.choice(replies)\n",
 143 |     "    \n",
 144 |     "    # Format the post data for the template\n",
 145 |     "    created_at = datetime.fromisoformat(post['post']['record']['createdAt'].replace('Z', '+00:00'))\n",
 146 |     "    formatted_date = created_at.strftime('%b %d, %Y, %I:%M %p')\n",
 147 |     "    \n",
 148 |     "    # Convert AT URI to bsky.app URL\n",
 149 |     "    at_uri = post['post']['uri']\n",
 150 |     "    _, _, author_did, _, post_id = at_uri.split('/')\n",
 151 |     "    post_url = f\"https://bsky.app/profile/{post['post']['author']['handle']}/post/{post_id}\"\n",
 152 |     "    \n",
 153 |     "    # Analyze the post\n",
 154 |     "    #download the avatar and convert to PIL image\n",
 155 |     "    avatar_uri = post['post']['author'].get('avatar')\n",
 156 |     "    avatar_response = requests.get(avatar_uri)\n",
 157 |     "    avatar_pil = Image.open(io.BytesIO(avatar_response.content))\n",
 158 |     "\n",
 159 |     "    response_dict = analyze_post_sentiment(avatar_pil, post['post']['author']['displayName'], post['post']['record']['text'])\n",
 160 |     "    analysis = response_dict['llm_classification']\n",
 161 |     "    weave_call_id = response_dict['weave_call_id']\n",
 162 |     "    \n",
 163 |     "    post_data = {\n",
 164 |     "        'author': post['post']['author'],\n",
 165 |     "        'created_at': formatted_date,\n",
 166 |     "        'text': post['post']['record']['text'],\n",
 167 |     "        'like_count': post['post'].get('likeCount', 0),\n",
 168 |     "        'repost_count': post['post'].get('repostCount', 0),\n",
 169 |     "        'has_image': False,\n",
 170 |     "        'post_url': post_url\n",
 171 |     "    }\n",
 172 |     "    \n",
 173 |     "    return template.render(**post_data), analysis, weave_call_id, ''\n",
 174 |     "\n",
 175 |     "\n",
 176 |     "def submit_feedback(user_selection, reason, weave_call_id):\n",
 177 |     "    \"\"\"\n",
 178 |     "    Example function that could send user feedback (the user_selection)\n",
 179 |     "    and the weave_call_id to your Weave (or any other) API.\n",
 180 |     "    \"\"\"\n",
 181 |     "    call = weave_api.get_call(weave_call_id)\n",
 182 |     "    \n",
 183 |     "    if not call:\n",
 184 |     "        raise Exception('No Weave call ID found, have you tried adding @weave.op to the analyze_post_sentiment function?')\n",
 185 |     "    \n",
 186 |     "    if reason:\n",
 187 |     "        reason_resp = weave_api.server.feedback_create(\n",
 188 |     "            {\n",
 189 |     "            \"project_id\": weave_api._project_id(),\n",
 190 |     "            \"weave_ref\": call.ref.uri(),\n",
 191 |     "            \"feedback_type\": \"wandb.annotation.reason\",\n",
 192 |     "            \"annotation_ref\": annotation_reason.uri(),\n",
 193 |     "            \"payload\": {\"value\": reason},\n",
 194 |     "            }\n",
 195 |     "        )\n",
 196 |     "\n",
 197 |     "    resp = weave_api.server.feedback_create(\n",
 198 |     "        {\n",
 199 |     "            \"project_id\": weave_api._project_id(),\n",
 200 |     "            \"weave_ref\": call.ref.uri(),\n",
 201 |     "            \"feedback_type\": \"wandb.annotation.doomer_or_boomer\",\n",
 202 |     "            \"annotation_ref\": annotation.uri(),\n",
 203 |     "            \"payload\": {\"value\": user_selection},\n",
 204 |     "        }\n",
 205 |     "    )\n",
 206 |     "    \n",
 207 |     "    # Ready to analyze the next post\n",
 208 |     "    return get_random_post_and_analyze()\n"
 209 |    ]
 210 |   },
 211 |   {
 212 |    "cell_type": "markdown",
 213 |    "metadata": {},
 214 |    "source": [
 215 |     "# 1. Tracing LLM calls with Weave\n",
 216 |     "\n",
 217 |     "#### Why Tracing is Important for LLM Application Reliability\n",
 218 |     "\n",
 219 |     "In building reliable LLM-based applications, having a clear view into\n",
 220 |     "how your system behaves is crucial. That’s where “tracing” comes in.\n",
 221 |     "\n",
 222 |     "1. **Detailed Interaction Records**:\n",
 223 |     "   Tracing captures all the inputs, prompts, responses, and any user feedback.\n",
 224 |     "   By preserving this detailed record, you always have the context needed to\n",
 225 |     "   debug unexpected or incorrect results.\n",
 226 |     "\n",
 227 |     "2. **Rapid Issue Diagnosis**:\n",
 228 |     "   With thorough traces, you can pinpoint issues faster—often without\n",
 229 |     "   needing direct access to remote systems. Simply reviewing the logs can\n",
 230 |     "   reveal how a certain response was triggered.\n",
 231 |     "\n",
 232 |     "3. **Collaboration and Sharing**:\n",
 233 |     "   Traces can be shared with both technical and non-technical stakeholders.\n",
 234 |     "   This not only streamlines collaboration but also ensures everyone is\n",
 235 |     "   working off the same “source of truth” when investigating bugs\n",
 236 |     "   or brainstorming improvements.\n",
 237 |     "\n",
 238 |     "4. **Outlier Spotting and Performance Tuning**:\n",
 239 |     "   By tracking calls at scale, you can detect when responses deviate\n",
 240 |     "   dramatically from the norm, troubleshoot any failures, and identify\n",
 241 |     "   potential performance bottlenecks.\n",
 242 |     "\n",
 243 |     "5. **Facilitates Product Evolution**:\n",
 244 |     "   As you enhance or expand your LLM application, comprehensive\n",
 245 |     "   tracing data helps you make more informed decisions about what to\n",
 246 |     "   improve, remove, or refine.\n",
 247 |     "\n",
 248 |     "With W&B Weave, comprehensive tracing is just 1 line of code, and offers features such as:\n",
 249 |     "- Syntax highlighting specific to your use-case (Markdown, JSON, etc.)\n",
 250 |     "- Ability to share links with other members of your team\n",
 251 |     "- Ability to filter traces by function name, input, output, etc.\n",
 252 |     "- Tracking latency, token count and cost per call (and trends)\n",
 253 |     "- Code associated with the llm call and versioning\n",
 254 |     "- Ability to add metadata per trace\n",
 255 |     "\n",
 256 |     "If you need to instrument existing code, you can use the `@weave.op` decorator to trace the function.  \n",
 257 |     "\n",
 258 |     "![CleanShot 2024-04-08 at 14 15 40@2x](https://gist.github.com/assets/463317/4e9ada49-572f-47d9-91e1-55ab72b2a476)"
 259 |    ]
 260 |   },
 261 |   {
 262 |    "cell_type": "code",
 263 |    "execution_count": null,
 264 |    "metadata": {},
 265 |    "outputs": [],
 266 |    "source": [
 267 |     "#TODO 1: Add tracing to this function - then see how this function is traced in the Weave UI\n",
 268 |     "\n",
 269 |     "def analyze_post_sentiment(avatar, displayName, text):\n",
 270 |     "    # Prompt for OpenAI to analyze the sentiment\n",
 271 |     "    prompt = f\"\"\"\n",
 272 |     "    Analyze the following Bluesky post and determine if the author is a [Doomer, Boomer, or Neither]. \n",
 273 |     "    Be concise and to the point. Answer with just one word (DOOMER, BOOMER, or NEITHER) followed by a brief explanation.\n",
 274 |     "    \\n\\n {displayName}: \"{text}\"\n",
 275 |     "    \"\"\"\n",
 276 |     "\n",
 277 |     "    # TODO 2: Add some more context about our task to the prompt\n",
 278 |     "    prompt = f\"\"\"Analyze the following Bluesky post and determine if the author is a:\n",
 279 |     "    - DOOMER (someone who hates AI and uses derogatory language)\n",
 280 |     "    - BOOMER (someone who doesn't understand AI and asks to remove their data)\n",
 281 |     "    - NEITHER (neutral or positive response)\n",
 282 |     "    \n",
 283 |     "    Post: {displayName}: \"{text}\"\n",
 284 |     "    \n",
 285 |     "    Respond with just one word (DOOMER, BOOMER, or NEITHER) followed by a brief explanation.\n",
 286 |     "    \"\"\"\n",
 287 |     "    \n",
 288 |     "    response = client.chat.completions.create(\n",
 289 |     "        model=model,\n",
 290 |     "        messages=[{\"role\": \"user\", \"content\": prompt}],\n",
 291 |     "        temperature=0.5\n",
 292 |     "    )\n",
 293 |     "    \n",
 294 |     "    try:\n",
 295 |     "        current_call = weave.require_current_call()\n",
 296 |     "        weave_call_id = current_call.id\n",
 297 |     "    except:\n",
 298 |     "        weave_call_id = None\n",
 299 |     "    \n",
 300 |     "    return {\n",
 301 |     "        \"llm_classification\": response.choices[0].message.content,\n",
 302 |     "        \"weave_call_id\": weave_call_id\n",
 303 |     "    }\n",
 304 |     "\n",
 305 |     "# Lets test this out without tracing first\n",
 306 |     "response_dict = analyze_post_sentiment(\"\",\"Alex\",\"I hate AI\")\n",
 307 |     "\n",
 308 |     "print(response_dict)"
 309 |    ]
 310 |   },
 311 |   {
 312 |    "cell_type": "markdown",
 313 |    "metadata": {},
 314 |    "source": [
 315 |     "We can see that even without @weave.op, since Weave is initialized, it will still trace the function call and store it in the Weave project as it automatically understands that we use OpenAi client. However if we add @weave.op, we can get even more detail and insrument our existing code with Weave.\n",
 316 |     "\n",
 317 |     "Tracing becomes even more useful when you have a lot of nested calls, such as a multi-step chat conversation, or a RAG system with retrieval, or an agentic system with multiple steps.\n",
 318 |     "\n",
 319 |     "![text](https://cln.sh/Sc8ZtrdM+)\n",
 320 |     "\n",
 321 |     "[Here's a great example](https://wandb.ai/wandb-designers/winston/weave/traces?cols=%7B%22attributes.weave.client_version%22%3Afalse%2C%22attributes.weave.os_name%22%3Afalse%2C%22attributes.weave.os_release%22%3Afalse%2C%22attributes.weave.os_version%22%3Afalse%2C%22attributes.weave.source%22%3Afalse%2C%22attributes.weave.sys_version%22%3Afalse%7D&peekPath=%2Fwandb-designers%2Fwinston%2Fcalls%2F0193ff3f-54d7-73a3-8004-0a582a594307%3Fpath%3Dwinston-solve*0%2Bvincent-execute*0%26tracetree%3D1) of a more complex traced setup from our internal agent system called Winston - with multiple tools selection, retrieval steps etc Winston Weave Dashboard"
 322 |    ]
 323 |   },
 324 |   {
 325 |    "cell_type": "markdown",
 326 |    "metadata": {},
 327 |    "source": [
 328 |     "\n",
 329 |     "# 2. User Feedback & Annotations\n",
 330 |     "\n",
 331 |     "Collecting user feedback is a crucial way to improve your LLM applications. There's a reason that every chatbot you use has 👍/👎 and a text box to leave feedback. This is one of the best ways for those labs to understand and improve their models and align them to user preferences.\n",
 332 |     "\n",
 333 |     "![text](https://cln.sh/JGMBxMtH+)\n",
 334 |     "\n",
 335 |     "Users don't have to be external as well, as you develop your application, marking traces as \"good\" or \"bad\", and adding why, is a great way to kick start your initial evaluation dataset with working and non-working examples. \n",
 336 |     "\n",
 337 |     "Additionally, after logging hundreds of thousads of traces, they will all start looking the same, so additional context like your user's feedback, will greately improve your ability to look at your data and find the outliers.\n",
 338 |     "\n",
 339 |     "Weave supports collecting user Feedback in the UI and also via the API so you can collect it from your users and also leave it yourself while looking at your data. \n",
 340 |     "\n",
 341 |     "![text](https://cln.sh/X6fFHD8t+)\n",
 342 |     "\n",
 343 |     "Read more about feedback [here](https://weave-docs.wandb.ai/guides/tracking/feedback)\n"
 344 |    ]
 345 |   },
 346 |   {
 347 |    "cell_type": "markdown",
 348 |    "metadata": {},
 349 |    "source": [
 350 |     "\n",
 351 |     "\n",
 352 |     "# 2.1 Doomer or Boomer App - Annotations by example\n",
 353 |     "\n",
 354 |     "Unlike user feedback, Annotations are a bit of a more structure way to classify responses, to help create a dataset of golden answers and reasons or rationales for those answers. All of the major companies use Scale.ai for this and pay them a LOT of money, but you don't have to right away, you can start small, by yourself or with your team. \n",
 355 |     "\n",
 356 |     "Let's see how we can kickstart a simple dataset of annotations by a practical example.\n",
 357 |     "\n",
 358 |     "![image](https://gist.github.com/user-attachments/assets/a8537545-e070-4c8e-9988-2a8a905b9d2c)\n",
 359 |     "\n",
 360 |     "To simulate a real world scenario, we'll build a simple app that will allow you to annotate a few posts. \n",
 361 |     "\n",
 362 |     "In our case, we're pretending to work at a company that's trying to build an AI classifier for Bluesky posts. We're humans that work in the company and are helping it to align and finetune models for AI moderation. \n",
 363 |     "\n",
 364 |     "We've compiled replies from BlueSky users, on 2 posts that collected publicly available data from BlieSky to train AI models (BlueSky data is public), which led to a lot of hate by users on BlueSky. \n",
 365 |     "\n",
 366 |     "We're going to build a simple app that will use an LLM to classify the replies into 3 categories: `Doomer`, `Boomer`, or `Neither`. \n",
 367 |     "\n",
 368 |     "`Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of thier hate for AI and their data being used for AI  \n",
 369 |     "`Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  \n",
 370 |     "`Neither`: Folks who reply neutral or positive to the post.\n",
 371 |     "\n",
 372 |     "At first our LLMs will not have context to the task, so won't be able to reliably classify the replies, so a human is needed to annotate with additional context, you are that human. \n",
 373 |     "\n",
 374 |     "Launch the app and go through a few posts, annotate with a reason for your choice and the correct classification, we'll later use this data to align/finetune our LLM to classify the replies more accuretly and reliably."
 375 |    ]
 376 |   },
 377 |   {
 378 |    "cell_type": "code",
 379 |    "execution_count": null,
 380 |    "metadata": {},
 381 |    "outputs": [],
 382 |    "source": [
 383 |     "# %%blocks\n",
 384 |     "# TODO 3 - Launch the Gradio app and annotate 10-20 examples according to the rules\n",
 385 |     "os.environ['WEAVE_PRINT_CALL_LINK'] = 'false'\n",
 386 |     "with gr.Blocks(theme=gr.themes.Soft()) as demo:\n",
 387 |     "    # Add a title and description\n",
 388 |     "    gr.Markdown(\"\"\"\n",
 389 |     "    # 🦋 Doomer or Boomer\n",
 390 |     "    Our AI analyzes bluesky replies and posts to determine if the author is a doomer or a boomer.  \n",
 391 |     "    Source of data: Replies to a post by a BlueSky user that compiled a dataset of posts, which went viral and generated a lot of hate on BlueSky.  \n",
 392 |     "    These are replies and comments on 2 posts that collected a dataset of posts of BlueSky users to train AI models (BlueSky data is public)\n",
 393 |     "    \"\"\")\n",
 394 |     "    \n",
 395 |     "    with gr.Row():\n",
 396 |     "        with gr.Column(scale=2):\n",
 397 |     "            post_html = gr.HTML()\n",
 398 |     "            next_post_btn = gr.Button(\"Skip Post & Analyze Another\", variant=\"primary\")\n",
 399 |     "            gr.Markdown(f\"\"\"\n",
 400 |     "            #### Instructions for labeler: \n",
 401 |     "            `Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of thier hate for AI and their data being used for AI  \n",
 402 |     "            `Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  \n",
 403 |     "            `Neither`: Folks who reply neutral or positive to the post.\n",
 404 |     "            \n",
 405 |     "            See your Weave project & traces [here](https://wandb.ai/{weave_api._project_id()})\n",
 406 |     "            \"\"\")\n",
 407 |     "        \n",
 408 |     "        with gr.Column(scale=1):\n",
 409 |     "            analysis_output = gr.Textbox(\n",
 410 |     "                label=\"Analysis Results\",\n",
 411 |     "                placeholder=\"Analysis will appear here...\",\n",
 412 |     "                lines=4\n",
 413 |     "            )\n",
 414 |     "            weave_call_id_state = gr.State()\n",
 415 |     "            \n",
 416 |     "            # Replace dropdown with three buttons\n",
 417 |     "            reason_input = gr.Textbox(label=\"Add reason and click\",placeholder=\"Reason why you chose this value, write before clicking.\", lines=2)\n",
 418 |     "            with gr.Row():\n",
 419 |     "                doomer_btn = gr.Button(\"Doomer 😡\", variant=\"huggingface\")\n",
 420 |     "                boomer_btn = gr.Button(\"Boomer 👵\", variant=\"primary\")\n",
 421 |     "                neither_btn = gr.Button(\"Neither 🤷\")\n",
 422 |     "\n",
 423 |     "            \n",
 424 |     "    # Set up event handler for combined next/analyze\n",
 425 |     "    next_post_btn.click(fn=get_random_post_and_analyze, outputs=[post_html, analysis_output, weave_call_id_state, reason_input])\n",
 426 |     "    \n",
 427 |     "    doomer_btn.click(\n",
 428 |     "    fn=submit_feedback,\n",
 429 |     "    inputs=[gr.State(\"Doomer\"), reason_input, weave_call_id_state],\n",
 430 |     "    outputs=[post_html, analysis_output, weave_call_id_state, reason_input]\n",
 431 |     "    )\n",
 432 |     "    boomer_btn.click(\n",
 433 |     "        fn=submit_feedback,\n",
 434 |     "        inputs=[gr.State(\"Boomer\"), reason_input, weave_call_id_state],\n",
 435 |     "        outputs=[post_html, analysis_output, weave_call_id_state, reason_input]\n",
 436 |     "    )\n",
 437 |     "    neither_btn.click(\n",
 438 |     "        fn=submit_feedback,\n",
 439 |     "        inputs=[gr.State(\"Neither\"), reason_input, weave_call_id_state],\n",
 440 |     "        outputs=[post_html, analysis_output, weave_call_id_state, reason_input]\n",
 441 |     "    )\n",
 442 |     "\n",
 443 |     "    \n",
 444 |     "    # Initialize with first post and analysis\n",
 445 |     "    post_html.value, analysis_output.value, weave_call_id_state.value, reason_input.value = get_random_post_and_analyze()\n",
 446 |     "\n",
 447 |     "demo.launch()"
 448 |    ]
 449 |   },
 450 |   {
 451 |    "cell_type": "markdown",
 452 |    "metadata": {},
 453 |    "source": [
 454 |     "## 2.1 Building a dataset from annotated calls\n",
 455 |     "\n",
 456 |     "Now that we've annotated at least 10-20 examples, we can build our first evaluation dataset! \n",
 457 |     "\n",
 458 |     "![text](https://cln.sh/dyBq4QXD+)\n",
 459 |     "\n",
 460 |     "Step 1: Filter calls in Weave UI by only those with annotations not empty\n",
 461 |     "\n",
 462 |     "Step 2: Use the Export -> Use Python button to get code to extract a list of filtered annotated calls\n",
 463 |     "\n",
 464 |     "Step 3: Convert the calls to a clean evaluation dataset (and optionally publish to Weave)\n",
 465 |     "\n"
 466 |    ]
 467 |   },
 468 |   {
 469 |    "cell_type": "code",
 470 |    "execution_count": null,
 471 |    "metadata": {},
 472 |    "outputs": [],
 473 |    "source": [
 474 |     "#TODO 4- Export annotated calls from Weave, clean up and publish to a dataset\n",
 475 |     "\n",
 476 |     "@weave.op\n",
 477 |     "def get_annotated_calls():\n",
 478 |     "   # Weave API call to get all calls filtered by annotations not empty (with reasons)\n",
 479 |     "   resp = weave_api.server.calls_query_stream({\n",
 480 |     "      \"project_id\": weave_api._project_id(),\n",
 481 |     "      \"filter\": {\"op_names\": [f\"weave:///{weave_api._project_id()}/op/analyze_post_sentiment:*\"]},\n",
 482 |     "      \"query\": {\"$expr\":{\"$and\":[{\"$not\":[{\"$eq\":[{\"$getField\":\"feedback.[wandb.annotation.doomer_or_boomer].payload.value\"},{\"$literal\":\"\"}]}]},{\"$not\":[{\"$eq\":[{\"$getField\":\"feedback.[wandb.annotation.reason].payload.value\"},{\"$literal\":\"\"}]}]}]}},\n",
 483 |     "      \"sort_by\": [{\"field\":\"started_at\",\"direction\":\"desc\"}],\n",
 484 |     "      \"include_feedback\": True,\n",
 485 |     "   })\n",
 486 |     "\n",
 487 |     "   # Iterate over the calls, clean up and publish as a dataset we can version and reference later.\n",
 488 |     "   list_of_calls = []\n",
 489 |     "   dataset = []\n",
 490 |     "   for call in resp:\n",
 491 |     "      try:\n",
 492 |     "         row = {}\n",
 493 |     "         call_dict = dict(call)\n",
 494 |     "         row[\"input\"] = call_dict.get('inputs').get('text')\n",
 495 |     "         row[\"displayName\"] = call_dict.get('inputs').get('displayName')\n",
 496 |     "         row[\"llm_classification\"] = call_dict.get('output').get('llm_classification')\n",
 497 |     "         list_of_feedback = call_dict.get('summary').get('weave').get('feedback')\n",
 498 |     "         for feedback in list_of_feedback:\n",
 499 |     "            if feedback.get(\"feedback_type\") == 'wandb.annotation.doomer_or_boomer':\n",
 500 |     "               row[\"human_annotation\"] = feedback.get('payload').get('value')\n",
 501 |     "            if feedback.get(\"feedback_type\") == 'wandb.annotation.reason':\n",
 502 |     "               row[\"reason\"] = feedback.get('payload').get('value')\n",
 503 |     "      except Exception as e:\n",
 504 |     "        continue\n",
 505 |     "      \n",
 506 |     "      dataset.append(row)\n",
 507 |     "\n",
 508 |     "   weave_dataset = weave.Dataset(name=\"doomer_or_boomer_dataset\", rows=dataset)\n",
 509 |     "   # TODO: Uncomment this to publish the dataset\n",
 510 |     "   # weave.publish(weave_dataset)\n",
 511 |     "   return weave_dataset\n",
 512 |     "\n",
 513 |     "doomer_or_boomer_dataset = get_annotated_calls()\n",
 514 |     "df = pd.DataFrame(doomer_or_boomer_dataset.rows)\n",
 515 |     "df.head(20)"
 516 |    ]
 517 |   },
 518 |   {
 519 |    "cell_type": "markdown",
 520 |    "metadata": {},
 521 |    "source": [
 522 |     "## 2.2 Storing Datasets within Weave\n",
 523 |     "\n",
 524 |     "If you'd like to store your own dataset and name them, it's very easy to do so, and then you get a \"ref\" to the dataset that's stored in our system. Weave datasets are versioned, which means you can reference them in your code by a URL or a ref, and either point to the latest version or a specific version. \n",
 525 |     "\n",
 526 |     "Using `refs` is a great way to make your code reproducible and versioned.\n",
 527 |     "\n",
 528 |     "![CleanShot 2025-01-07 at 16 12 35@2x](https://gist.github.com/user-attachments/assets/e2d02340-cc0f-41e8-8d97-957b08611d08)\n",
 529 |     "\n",
 530 |     "\n",
 531 |     "Here's an example of the dataset we just created, and how we can reuse it in our evaluations."
 532 |    ]
 533 |   },
 534 |   {
 535 |    "cell_type": "code",
 536 |    "execution_count": null,
 537 |    "metadata": {},
 538 |    "outputs": [],
 539 |    "source": [
 540 |     "# TODO 5: replace this dataset with your own ref using the dataset link above and looking at the \"use\" tab\n",
 541 |     "doomer_or_boomer_dataset = weave.ref(\"weave:///thursdai/jan-evals-workshop/object/doomer_or_boomer_dataset:iCO7tzGYA3ow5dgj0gRb8J5p0fRYYpAwsK6TI6LOsSo\").get()\n",
 542 |     "\n",
 543 |     "\n",
 544 |     "df = pd.DataFrame(doomer_or_boomer_dataset.rows)\n",
 545 |     "df.head(20)"
 546 |    ]
 547 |   },
 548 |   {
 549 |    "cell_type": "markdown",
 550 |    "metadata": {},
 551 |    "source": [
 552 |     "# Step 3 : Evaluations \n",
 553 |     "### Components of an Evaluation\n",
 554 |     "\n",
 555 |     "Evaluations generally consist of four key elements:\n",
 556 |     "- An **input prompt** that serves as the basis for the model's completion. This prompt often includes a set of variable inputs that are inserted into a prompt template during testing.\n",
 557 |     "- The **output** generated by the model in response to the input prompt.\n",
 558 |     "- A **\"gold standard\" answer** used as a reference for assessing the model's output. This can be an exact match that the output must replicate, or an exemplary answer that provides a benchmark for scoring.\n",
 559 |     "- A **score**, determined by one of the scoring approaches outlined below, which indicates the model's performance on the question.\n",
 560 |     "\n",
 561 |     "#TODO 6: Look at the dataset and try to match the input, output, gold standard each row\n",
 562 |     "\n",
 563 |     "## Evaluation Grading Approaches\n",
 564 |     "Evaluations can be time-consuming and costly in two main areas: creating questions and gold standard answers, and the scoring/grading process itself.  \n",
 565 |     "Developing questions and ideal answers is often a one-time fixed cost, albeit potentially time-intensive if a suitable dataset is not readily available (consider leveraging an LLM to generate questions!). However, scoring is a recurring expense incurred each time the evaluation is conducted, which is likely to be frequent. Therefore, designing evaluations that can be scored efficiently and economically should be a central priority.\n",
 566 |     "\n",
 567 |     "![](https://gist.github.com/assets/463317/e970bb03-9552-4712-ba12-727b89928e3b)\n",
 568 |     "\n",
 569 |     "There are three primary methods for grading (scoring) evaluations:  \n",
 570 |     "- **Programmatic:** This approach involves using standard code (primarily string matching and regular expressions) to assess the model's outputs. Common techniques include checking for an exact match against an answer or verifying the presence of key phrase(s) in a string. Programmatic scoring is the most optimal method when feasible, as it is extremely fast and highly reliable. However, not all evaluations are amenable to this style of scoring. \n",
 571 |     "  - Goes great with structured output - validate against an enum\n",
 572 |     "  - Code generation output - does it run, is valid, does it compile? \n",
 573 |     "  - Tool use validation - do the tools exist?  \n",
 574 |     "\n",
 575 |     "- **Human in the loop:** In this approach, a human reviewer examines the model-generated answer, compares it to the gold standard, and assigns a score. While manual scoring is the most versatile method, applicable to nearly any task, it is also exceptionally slow and costly, especially for large-scale evaluations. Designing evaluations that necessitate manual scoring should be avoided whenever possible.\n",
 576 |     "  - Domain specific & expert information\n",
 577 |     "  - Sensitive topics  \n",
 578 |     "\n",
 579 |     "- **Model-based scoring AKA LLM as a judge:** LLMs (especially Claude, GPT-4o, Gemini) are really good at grading themselves (or even outputs of other LLMs) especially in wide range of tasks that traditionally needed human judgement like tone in creative writing or accuracy in open-ended question, or classification. This model-based scoring is accomplished by creating a _scorer prompt_ for an LLM\n",
 580 |     "  - Open ended style questions\n",
 581 |     "  - Classification & Translation \n",
 582 |     "  - Instruction following\n",
 583 |     "\n",
 584 |     "Let's explore an example of each\n",
 585 |     "\n",
 586 |     "## 3.1 Programmatic scoring \n",
 587 |     "\n",
 588 |     "Here we have a simple programmatic eval that will try and check if the LLM had the right answer."
 589 |    ]
 590 |   },
 591 |   {
 592 |    "cell_type": "code",
 593 |    "execution_count": null,
 594 |    "metadata": {},
 595 |    "outputs": [],
 596 |    "source": [
 597 |     "## Create a programmatic scorer that will compare the ground truth to the LLM answer and check if it is correct\n",
 598 |     "os.environ['WEAVE_PRINT_CALL_LINK'] = 'true'\n",
 599 |     "import weave\n",
 600 |     "from weave import Evaluation\n",
 601 |     "\n",
 602 |     "def is_right_based_on_human_annotation(output: str, human_annotation: str):\n",
 603 |     "    # check if the model output is exactly the same as human_annotation (Doomer, Boomer, Neither)\n",
 604 |     "    # we expect this evaluation to fail becuase the LLM is talking alot and never returns just the reason\n",
 605 |     "    if not output or not human_annotation:\n",
 606 |     "        raise ValueError(\"Model output or human annotation is empty\")\n",
 607 |     "    return {\"match\": output == human_annotation}\n",
 608 |     "\n",
 609 |     "# TODO 7: change the programmatic scorer (commented below) to check if the output includes the reason string (Doomer, Boomer, Neither)\n",
 610 |     "# check for lower case and upper case, and check if more than one of the options is present, meaning that LLM wasn't sure\n",
 611 |     "# add the programmatic scorer to the evaluation\n",
 612 |     "\n",
 613 |     "\n",
 614 |     "# def is_right_based_on_human_annotation(output: str, human_annotation: str):\n",
 615 |     "#     # check if the first 4 letters of model output matches first 4 letters of human_annotation\n",
 616 |     "#     if not output or not human_annotation:\n",
 617 |     "#         raise ValueError(\"Model output or human annotation is empty\")\n",
 618 |     "    \n",
 619 |     "#     # Convert both to lowercase and get first 4 letters\n",
 620 |     "#     output_start = output.lower()[:4]\n",
 621 |     "#     annotation_start = human_annotation.lower()[:4]\n",
 622 |     "    \n",
 623 |     "#     return {\"match\": output_start == annotation_start}\n",
 624 |     "\n",
 625 |     "evaluation = Evaluation(\n",
 626 |     "    dataset=doomer_or_boomer_dataset, scorers=[is_right_based_on_human_annotation]\n",
 627 |     ")\n",
 628 |     "\n",
 629 |     "@weave.op()\n",
 630 |     "def function_to_evaluate(input: str):\n",
 631 |     "    # here's where you would add your LLM call and return the output\n",
 632 |     "    # since we already called the LLM, we can just iterate over the dataset \n",
 633 |     "    # and return the llm_classification where the question is the same\n",
 634 |     "    row = [row for row in doomer_or_boomer_dataset.rows if row['input'] == input]\n",
 635 |     "    return row[0].get('llm_classification')\n",
 636 |     "\n",
 637 |     "await evaluation.evaluate(function_to_evaluate)"
 638 |    ]
 639 |   },
 640 |   {
 641 |    "cell_type": "markdown",
 642 |    "metadata": {},
 643 |    "source": [
 644 |     "### 3.1.1 Structured outputs with programmatic scorers\n",
 645 |     "\n",
 646 |     "The above example likely gave us a score of 0, because LLMs like to talk, and comparing that via a simple string match is not going to work. \n",
 647 |     "\n",
 648 |     "Programmatic scorers work great when we have structured outputs and we know exactly what to expect from LLMs. Let's recreate our LLM calls for the same questions with strucutred outputs so we can compare the LLM output directly to the human annotation and see if we can get a better score."
 649 |    ]
 650 |   },
 651 |   {
 652 |    "cell_type": "code",
 653 |    "execution_count": null,
 654 |    "metadata": {},
 655 |    "outputs": [],
 656 |    "source": [
 657 |     "import os\n",
 658 |     "os.environ['WEAVE_PARALLELISM'] = '5'\n",
 659 |     "os.environ['WEAVE_PRINT_CALL_LINK'] = 'true'\n",
 660 |     "\n",
 661 |     "@weave.op()\n",
 662 |     "def with_structured_llm_call(input: str, displayName: str):\n",
 663 |     "    prompt = f\"\"\"\n",
 664 |     "    Analyze the following Bluesky post and determine if the author is a [Doomer, Boomer, or Neither]. \n",
 665 |     "    Be concise and to the point. Answer with just one word (DOOMER, BOOMER, or NEITHER) followed by a brief explanation.\n",
 666 |     "\n",
 667 |     "    \n",
 668 |     "    Text to Classify: \n",
 669 |     "    \\n\\n {displayName}: \"{input}\"\n",
 670 |     "    \"\"\"\n",
 671 |     "\n",
 672 |     "    ## TODO 8: add a request for structured output in JSON format\n",
 673 |     "    # prompt += \"\"\"\n",
 674 |     "    # Respond in JSON format with this exact schema   {{\n",
 675 |     "    #     \"classification\": \"DOOMER\" | \"BOOMER\" | \"NEITHER\",\n",
 676 |     "    #     \"reason\": \"string\"\n",
 677 |     "    # }}\n",
 678 |     "    \n",
 679 |     "    # \"\"\"\n",
 680 |     "\n",
 681 |     "    ## TODO 10: Add additional context about the classification criteria (by copying the definition from above cells)\n",
 682 |     "    #  - but first try them in Weave playground\n",
 683 |     "\n",
 684 |     "    # prompt += \"\"\"\n",
 685 |     "    # \"\"\"\n",
 686 |     "    \n",
 687 |     "    \n",
 688 |     "    response = client.chat.completions.create(\n",
 689 |     "        model=model,\n",
 690 |     "        messages=[\n",
 691 |     "\n",
 692 |     "            {\"role\": \"user\", \"content\": prompt}],\n",
 693 |     "        temperature=0.5\n",
 694 |     "    )\n",
 695 |     "    return response.choices[0].message.content\n",
 696 |     "\n",
 697 |     "def programmatic_scorer(output: str, human_annotation: str):\n",
 698 |     "    # check if the model output is exactly the same as human_annotation (Doomer, Boomer, Neither)\n",
 699 |     "    if not output:\n",
 700 |     "        raise ValueError(\"Model output is empty\")\n",
 701 |     "    try:\n",
 702 |     "        object = json.loads(output)\n",
 703 |     "    except:\n",
 704 |     "        raise ValueError(\"Model output is not valid JSON\")\n",
 705 |     "    \n",
 706 |     "    return {\"match\": object.get('classification').lower() == human_annotation.lower()}\n",
 707 |     "\n",
 708 |     "new_evaluation = Evaluation(\n",
 709 |     "    dataset=doomer_or_boomer_dataset, scorers=[programmatic_scorer]\n",
 710 |     ")\n",
 711 |     "\n",
 712 |     "await new_evaluation.evaluate(with_structured_llm_call)"
 713 |    ]
 714 |   },
 715 |   {
 716 |    "cell_type": "markdown",
 717 |    "metadata": {},
 718 |    "source": [
 719 |     "# 3.2 HITL - Human in the loop evaluation grading\n",
 720 |     "\n",
 721 |     "Programmatic scoring is great for many reasons, cheap to get started with, can run very fast and can be very reliable, but cannot cover open ended questions or tasks that require analysis or judgement. \n",
 722 |     "\n",
 723 |     "For example, did the LLM follow the instructions it was given, did it hallucinate, was it verbose or concise, etc.\n",
 724 |     "\n",
 725 |     "To judge those outputs we can use human graders, to provide \"golden answers\", which is what we did above with the annotation example with our Doomer or Boomer app. \n",
 726 |     "\n",
 727 |     "The downside of HITL is that it's slow, expensive, and not scalable (unless you have a lot of money in the bank). \n",
 728 |     "\n",
 729 |     "HITL is a great way to kickstart an evaluation dataset and extarpolate with an LLM. \n",
 730 |     "\n",
 731 |     "Here's a slight alternative on our app, that shows LLM responses and allows our humans in the loop to judge the responses as correct or incorrect. \n",
 732 |     "\n",
 733 |     "#TODO 11 - Run this app, mark up to 10 responses, and then hit \"run evaluations\"."
 734 |    ]
 735 |   },
 736 |   {
 737 |    "cell_type": "code",
 738 |    "execution_count": 23,
 739 |    "metadata": {},
 740 |    "outputs": [],
 741 |    "source": [
 742 |     "import weave\n",
 743 |     "from weave import Evaluation\n",
 744 |     "dataset_of_doomer_or_boomer = weave.ref(\"weave:///thursdai/jan-evals-workshop/object/doomer_or_boomer_dataset_with_structured_output:EwbD2kvMzz1R8nY6IxY6EALPj7XV6XYue44gQWDgDKE\").get()\n",
 745 |     "\n",
 746 |     "def match_dataset_with_replies():\n",
 747 |     "    matched_replies = []\n",
 748 |     "    for row in dataset_of_doomer_or_boomer.rows:\n",
 749 |     "        # Find matching reply in all_replies\n",
 750 |     "        for reply in load_replies():\n",
 751 |     "            if reply['post']['record']['text'] == row['input']:\n",
 752 |     "                matched_reply = {\n",
 753 |     "                    'full_reply': reply,\n",
 754 |     "                    'input': row.get('input', ''),\n",
 755 |     "                    'output': row.get('output', ''),\n",
 756 |     "                    'reason': row.get('reason', ''),\n",
 757 |     "                    'llm_classification': row.get('llm_classification', ''),\n",
 758 |     "                    'displayName': row.get('displayName', '')\n",
 759 |     "                }\n",
 760 |     "                matched_replies.append(matched_reply)\n",
 761 |     "                break\n",
 762 |     "    return matched_replies\n",
 763 |     "\n",
 764 |     "matched_replies = match_dataset_with_replies()\n",
 765 |     "annotated_rows = []\n",
 766 |     "\n",
 767 |     "def get_next_annotated_post(current_index:int = 0):\n",
 768 |     "    # Get the matched replies\n",
 769 |     "    \n",
 770 |     "    print(current_index, len(matched_replies))\n",
 771 |     "    if current_index >= len(matched_replies):\n",
 772 |     "        current_index = 0  # Reset to beginning if we've reached the end\n",
 773 |     "        \n",
 774 |     "    reply = matched_replies[current_index]\n",
 775 |     "    post = reply['full_reply']\n",
 776 |     "    \n",
 777 |     "    # Format the post data for the template\n",
 778 |     "    created_at = datetime.fromisoformat(post['post']['record']['createdAt'].replace('Z', '+00:00'))\n",
 779 |     "    formatted_date = created_at.strftime('%b %d, %Y, %I:%M %p')\n",
 780 |     "    \n",
 781 |     "    # Convert AT URI to bsky.app URL\n",
 782 |     "    at_uri = post['post']['uri']\n",
 783 |     "    _, _, author_did, _, post_id = at_uri.split('/')\n",
 784 |     "    post_url = f\"https://bsky.app/profile/{post['post']['author']['handle']}/post/{post_id}\"\n",
 785 |     "    \n",
 786 |     "    post_data = {\n",
 787 |     "        'author': post['post']['author'],\n",
 788 |     "        'created_at': formatted_date,\n",
 789 |     "        'text': post['post']['record']['text'],\n",
 790 |     "        'like_count': post['post'].get('likeCount', 0),\n",
 791 |     "        'repost_count': post['post'].get('repostCount', 0),\n",
 792 |     "        'has_image': False,\n",
 793 |     "        'post_url': post_url\n",
 794 |     "    }\n",
 795 |     "    \n",
 796 |     "    # Use the stored LLM classification and human annotation\n",
 797 |     "    analysis = f\"\"\"LLM Classification: {reply['llm_classification']}\n",
 798 |     "    \n",
 799 |     "LLM Reasoning: {reply['reason']}\n",
 800 |     "    \"\"\"\n",
 801 |     "    \n",
 802 |     "    run_evaluation_btn = {\n",
 803 |     "        \"interactive\": True if len(annotated_rows) >=  10 else False,\n",
 804 |     "        \"value\": \"Run Evaluation\" if len(annotated_rows) >=  10 else f\"Annotate {10 - len(annotated_rows)} more posts\"\n",
 805 |     "    }\n",
 806 |     "    return template.render(**post_data), analysis, current_index + 1, gr.update(**run_evaluation_btn), \"\"\n",
 807 |     "\n",
 808 |     "def submit_hitl_feedback(correct_or_incorrect: str, feedback: str, next_index: int):\n",
 809 |     "    annotated_rows.append({\n",
 810 |     "        \"input\": matched_replies[next_index-1].get('input'),\n",
 811 |     "        \"output\": matched_replies[next_index-1].get('output'),\n",
 812 |     "        \"llm_classification\": matched_replies[next_index-1].get('llm_classification'),\n",
 813 |     "        \"correct_or_incorrect\": True if correct_or_incorrect == \"correct\" else False,\n",
 814 |     "        \"human_reason_for_correct_or_incorrect\": feedback,\n",
 815 |     "    })\n",
 816 |     "    return get_next_annotated_post(next_index)\n",
 817 |     "\n",
 818 |     "\n",
 819 |     "def right_according_to_human(output: str, correct_or_incorrect: bool):\n",
 820 |     "    return correct_or_incorrect\n",
 821 |     "\n",
 822 |     "@weave.op()\n",
 823 |     "def return_input_row(input: str):\n",
 824 |     "    return [x for x in annotated_rows if x.get('input') == input]\n",
 825 |     "\n",
 826 |     "async def run_evaluation():\n",
 827 |     "    hitl_evaluation = Evaluation(\n",
 828 |     "        dataset=annotated_rows,\n",
 829 |     "        scorers=[right_according_to_human],\n",
 830 |     "        name=\"hitl_evaluation\"\n",
 831 |     "    )\n",
 832 |     "    \n",
 833 |     "    result = await hitl_evaluation.evaluate(return_input_row)\n",
 834 |     "    gr.Info('Evaluation complete! Check your Weave project for the results.')\n",
 835 |     "    return result"
 836 |    ]
 837 |   },
 838 |   {
 839 |    "cell_type": "code",
 840 |    "execution_count": null,
 841 |    "metadata": {},
 842 |    "outputs": [],
 843 |    "source": [
 844 |     "# %%blocks\n",
 845 |     "# Create a Gradio Blocks app\n",
 846 |     "os.environ['WEAVE_PRINT_CALL_LINK'] = 'True'\n",
 847 |     "\n",
 848 |     "with gr.Blocks(theme=gr.themes.Soft()) as new_demo:\n",
 849 |     "    # Add a title and description\n",
 850 |     "    gr.Markdown(\"\"\"\n",
 851 |     "    # Human in the loop\n",
 852 |     "    \"\"\")\n",
 853 |     "    \n",
 854 |     "    with gr.Row():\n",
 855 |     "        with gr.Column(scale=1):\n",
 856 |     "            gr.Markdown(f\"\"\"## 1. Post to Analyze  \"\"\")\n",
 857 |     "            post_html = gr.HTML()\n",
 858 |     "            # next_post_btn = gr.Button(\"Skip Post & Analyze Another\", variant=\"primary\")\n",
 859 |     "            gr.Markdown(f\"\"\"\n",
 860 |     "            #### Instructions for HTIL judge: \n",
 861 |     "            - review LLM outputs and mark them as correct or incorrect  \n",
 862 |     "            - after 10-20 examples, hit \"run evaluation\" button\n",
 863 |     "    \n",
 864 |     "            See your Weave project & traces [here](https://wandb.ai/{weave_api._project_id()})\n",
 865 |     "            \"\"\")\n",
 866 |     "            \n",
 867 |     "        \n",
 868 |     "        with gr.Column(scale=2):\n",
 869 |     "            \n",
 870 |     "            analysis_output = gr.Textbox(\n",
 871 |     "                label=\"2. Review LLM Classification for this post\",\n",
 872 |     "                placeholder=\"Analysis will appear here...\",\n",
 873 |     "                lines=4,\n",
 874 |     "            )\n",
 875 |     "            next_index = gr.State(value=0)\n",
 876 |     "            \n",
 877 |     "            with gr.Accordion(\"Reminder of Doomer, Boomer, or Neither Criteria\", open=False):\n",
 878 |     "                gr.Markdown(f\"\"\"\n",
 879 |     "                `Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of thier hate for AI and their data being used for AI  \n",
 880 |     "                `Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  \n",
 881 |     "                `Neither`: Folks who reply neutral or positive to the post.\n",
 882 |     "                \"\"\")\n",
 883 |     "            # Replace dropdown with three buttons\n",
 884 |     "            reason_input = gr.Textbox(label=\"3. Add reason and submit\",placeholder=\"Reason why the LLM got this classification right or wrong\", lines=2)\n",
 885 |     "            with gr.Row():\n",
 886 |     "                correct_btn = gr.Button(\"LLM is Correct 👍\")\n",
 887 |     "                incorrect_btn = gr.Button(\"LLM is Incorrect 👎\")\n",
 888 |     "\n",
 889 |     "            run_evaluation_btn = gr.Button(\"Run Evaluation\", variant=\"primary\", interactive=False)\n",
 890 |     "\n",
 891 |     "            \n",
 892 |     "    # Set up event handler for combined next/analyze\n",
 893 |     "    # next_post_btn.click(fn=get_next_annotated_post, inputs=[next_index], outputs=[post_html, analysis_output, next_index, run_evaluation_btn, reason_input])\n",
 894 |     "    \n",
 895 |     "    correct_btn.click(fn=submit_hitl_feedback, inputs=[gr.State(\"correct\"), reason_input, next_index], outputs=[post_html, analysis_output, next_index, run_evaluation_btn, reason_input])\n",
 896 |     "    incorrect_btn.click(fn=submit_hitl_feedback, inputs=[gr.State(\"incorrect\"), reason_input, next_index], outputs=[post_html, analysis_output, next_index, run_evaluation_btn, reason_input])\n",
 897 |     "\n",
 898 |     "    run_evaluation_btn.click(fn=run_evaluation, inputs=[], outputs=[analysis_output])\n",
 899 |     "    # Initialize with first post and analysis\n",
 900 |     "    post_html.value, analysis_output.value, next_index.value, run_evaluation_btn.value, reason_input.value = get_next_annotated_post()\n",
 901 |     "\n",
 902 |     "new_demo.queue()\n",
 903 |     "new_demo.launch()"
 904 |    ]
 905 |   },
 906 |   {
 907 |    "cell_type": "markdown",
 908 |    "metadata": {},
 909 |    "source": [
 910 |     "# 3.3 LLM as a Judge - use another LLM to grade your LLM outputs\n",
 911 |     "\n",
 912 |     "Having to manually grade the above eval every time is going to get very annoying very fast, especially if the eval is a more realistic size (dozens, hundreds, or even thousands of questions). Luckily, there's a better way! \n",
 913 |     "\n",
 914 |     "We can actually have an LLM do the grading for us. We'll use a teacher model to grade the LLM outputs of a \"student\" model (in this case the LLM we're using for our production system is the student). \n",
 915 |     "\n",
 916 |     "There are a few issues with this approaches to be aware of: \n",
 917 |     " - LLMs are not great at numerical scoring (eg 1-5) \n",
 918 |     " - The order of canditate responses matter\n",
 919 |     " - Foundational models tend to prefer their own outputs over other models\n",
 920 |     " - LLMs prefer longer respones and \"style\" over accuracy\n",
 921 |     "\n",
 922 |     "\n",
 923 |     "## 3.3.1 Let's build our LLM judge\n",
 924 |     "\n",
 925 |     "First, we'll start by building a \"grader prompt\" template, a prompt asking our judge to perform the judging itself. This will be our iteration grounds. In this template, we'll inject both the output of our production LLM model, and the criteria / rules or rubric that makes an answer correct or incorrect. \n",
 926 |     "\n",
 927 |     "In our case, the classification into one of 3 (Doomer, Boomer, Neither) is done \n"
 928 |    ]
 929 |   },
 930 |   {
 931 |    "cell_type": "code",
 932 |    "execution_count": null,
 933 |    "metadata": {},
 934 |    "outputs": [],
 935 |    "source": [
 936 |     "# Step 1 - Build a grader prompt\n",
 937 |     "import weave\n",
 938 |     "from weave import Evaluation\n",
 939 |     "import json\n",
 940 |     "\n",
 941 |     "def build_grader_prompt(input: str, llm_classification: str, displayName: str): \n",
 942 |     "    grader_prompt_template = f\"\"\"\n",
 943 |     "    You are provided with the following: \n",
 944 |     "    <input> is a comment made on social media and the handle of the person making the comment\n",
 945 |     "    <output> is a classification and reasoning that an automated assistant made about the comment\n",
 946 |     "    <criteria> is a set of guidelines and additional context for you to understand the input and the correct way to classify it\n",
 947 |     "    \n",
 948 |     "    <input>\n",
 949 |     "    @{displayName}: {input}\n",
 950 |     "    </input>\n",
 951 |     "\n",
 952 |     "    \n",
 953 |     "    <output>\n",
 954 |     "    {llm_classification}\n",
 955 |     "    </output>\n",
 956 |     "    \n",
 957 |     "    <criteria>\n",
 958 |     "    For context, the responses you are classifying are to 2 announcements, made by AI enthusiasts who collected posts from the open protocol of bluesky\n",
 959 |     "    and posted about it on bluesky. They received a torrent of hateful commentary about that effort, including lawfare that's not based in any legal basis. \n",
 960 |     "    The folks who use derogatory language we consider Doomers, folks who just copy paste are likely just boomers.\n",
 961 |     "    \n",
 962 |     "    Instructions of how to classify responders: \n",
 963 |     "    `Doomer`: Someone who hates AI, and uses derogatory language towards the author of the post because of their hate for AI and their data being used for AI.\n",
 964 |     "    `Boomer`: Someone who doesn't understand AI, and copy-pastes a request to remove their data from the dataset  \n",
 965 |     "    `Neither`: Folks who reply neutral or positive to the post.\n",
 966 |     "    </criteria>\n",
 967 |     "\n",
 968 |     "    Your task is to understand from <output> which of the 3 choices did the automated assistant make and if its reasoning is valid. \n",
 969 |     "    First think through whether the the output is correct or incorrect based on the criteria and add your thinking, \n",
 970 |     "    then output your answer in JSON format with this exact schema (no backticks or quotes or anything else, just valid JSON): \n",
 971 |     "    \n",
 972 |     "    {{\n",
 973 |     "        \"thinking\": \"string\",\n",
 974 |     "        \"automated_assistant_classification\": \"doomer\" | \"boomer\" | \"neither\",\n",
 975 |     "        \"actual_classification\": \"doomer\" | \"boomer\" | \"neither\",\n",
 976 |     "        \n",
 977 |     "    }}\n",
 978 |     "    \"\"\"\n",
 979 |     "\n",
 980 |     "    return grader_prompt_template\n",
 981 |     "\n",
 982 |     "# Step 2 - Get our datasets \n",
 983 |     "#TODO 12 - replace this if you want with your annotated examples - make sure the stucture matches\n",
 984 |     "dataset_without_context = weave.ref(\"weave:///thursdai/jan-evals-workshop/object/doomer_or_boomer_dataset:kPkJew7ifAQDTiskCKUeYZPAjSagSILxHY0Ze9a72i8\").get()\n",
 985 |     "\n",
 986 |     "\n",
 987 |     "# Step 3 - Build our LLM Judge API function \n",
 988 |     "\n",
 989 |     "@weave.op()\n",
 990 |     "def llm_judge_api(input: str, llm_classification: str, displayName: str):\n",
 991 |     "    grader_prompt = build_grader_prompt(input, llm_classification, displayName)\n",
 992 |     "    response = client.chat.completions.create(\n",
 993 |     "        model=model,\n",
 994 |     "        messages=[\n",
 995 |     "\n",
 996 |     "            {\"role\": \"user\", \"content\": grader_prompt}],\n",
 997 |     "        temperature=0.5\n",
 998 |     "    )\n",
 999 |     "    response = response.choices[0].message.content\n",
1000 |     "    print(response)\n",
1001 |     "    \n",
1002 |     "    return json.loads(response)\n",
1003 |     "\n",
1004 |     "# Step 4 - Create a scorer \n",
1005 |     "\n",
1006 |     "def right_according_to_llm_judge(output: dict):\n",
1007 |     "    return {\"match\": output.get('automated_assistant_classification').lower() == output.get('actual_classification').lower()}\n",
1008 |     "\n",
1009 |     "# Step 5 - Run our evaluation \n",
1010 |     "\n",
1011 |     "no_context_evaluation = Evaluation(\n",
1012 |     "    dataset=dataset_without_context,\n",
1013 |     "    scorers=[right_according_to_llm_judge],\n",
1014 |     "    name=\"noContextEvaluation\"\n",
1015 |     ")\n",
1016 |     "\n",
1017 |     "await no_context_evaluation.evaluate(llm_judge_api, __weave={\"display_name\": \"No Context\"})\n",
1018 |     "\n",
1019 |     "#TODO 13: Create a dataset from the calls of LLM from programmattic evals (that include context and answers are better)\n",
1020 |     "#  and run the LLM as a judge on the second dataset and see improvement\n",
1021 |     "# or uncomment the below code and run it, then compare the two evaluations\n",
1022 |     "\n",
1023 |     "\n",
1024 |     "# dataset_with_context = weave.ref(\"weave:///thursdai/jan-evals-workshop/object/doomer_or_boomer_dataset:iCO7tzGYA3ow5dgj0gRb8J5p0fRYYpAwsK6TI6LOsSo\").get()\n",
1025 |     "# with_context_evaluation = Evaluation(\n",
1026 |     "#     dataset=dataset_with_context,\n",
1027 |     "#     scorers=[right_according_to_llm_judge],\n",
1028 |     "#     name=\"withContextEvaluation\",\n",
1029 |     "#     num_trials=10\n",
1030 |     "# )\n",
1031 |     "# await with_context_evaluation.evaluate(llm_judge_api, __weave={\"display_name\": \"With Context\"})\n"
1032 |    ]
1033 |   },
1034 |   {
1035 |    "cell_type": "markdown",
1036 |    "metadata": {},
1037 |    "source": [
1038 |     "# 3.3.2 Confirming the LLM judge is better  - Meta Evaluation\n",
1039 |     "\n",
1040 |     "![](https://cln.sh/JbyJ2qM2+)\n",
1041 |     "\n",
1042 |     "Just because we see a higher score right now doesn't mean actually that we did our job correctly. The higher score may come from our model being better, but also can come from the fact that our LLM judge is mistakenly grading!\n",
1043 |     "\n",
1044 |     "We may also want to play around with a model of the LLM judge itself (maybe a reasoning model) to see if we can improve the Judge, or tinker with its prompt some more, by providing better examples of what makes a correct or incorrect answer (from our HITL annotations from before!)\n",
1045 |     "\n",
1046 |     "The way to account for this is to actually do a meta-evaluation on the LLM judges itself with scores like Cohen's Kappa, which is a measure of inter-rater reliability. \n",
1047 |     "\n",
1048 |     "![](https://cln.sh/QDCzPFqD+)\n",
1049 |     "\n",
1050 |     "We can do so by comparing the LLM judge's scores to the human grader's scores, update the Judge itself and then run the evaluation again. "
1051 |    ]
1052 |   },
1053 |   {
1054 |    "cell_type": "code",
1055 |    "execution_count": null,
1056 |    "metadata": {},
1057 |    "outputs": [],
1058 |    "source": [
1059 |     "from weave import Scorer\n",
1060 |     "from weave import Evaluation\n",
1061 |     "import numpy as np\n",
1062 |     "from sklearn.metrics import cohen_kappa_score\n",
1063 |     "\n",
1064 |     "class DoomerBoomerScorer(Scorer):\n",
1065 |     "    \"\"\"Custom scorer that calculates agreement between human and LLM annotations\n",
1066 |     "    for the Doomer/Boomer classification task. This scorer calculates Cohen's Kappa for the LLM judge vs the human grader\"\"\"\n",
1067 |     "    \n",
1068 |     "    @weave.op()\n",
1069 |     "    async def score(self, output: dict, human_annotation: str):\n",
1070 |     "        \"\"\"Score each prediction by comparing human annotation to LLM classification.\n",
1071 |     "        Args:\n",
1072 |     "            output: The dict provided by the model being evaluated\n",
1073 |     "            human_annotation: The ground truth annotation (Doomer/Boomer/Neither)\n",
1074 |     "        Returns:\n",
1075 |     "            Dict with match and classifications for aggregation\n",
1076 |     "        \"\"\"\n",
1077 |     "        return {\n",
1078 |     "            \"match\": output.get('automated_assistant_classification').lower() == output.get('actual_classification').lower(),\n",
1079 |     "            \"llm_class\": output.get('automated_assistant_classification').lower(),\n",
1080 |     "            \"judge_class\": output.get('actual_classification').lower(),\n",
1081 |     "            \"human_class\": human_annotation.lower(),\n",
1082 |     "        }\n",
1083 |     "\n",
1084 |     "    @weave.op()\n",
1085 |     "    def summarize(self, score_rows: list) -> dict:\n",
1086 |     "        \"\"\"Calculate summary metrics including Cohen's Kappa.\n",
1087 |     "        Args:\n",
1088 |     "            score_rows: List of individual scoring results\n",
1089 |     "        Returns:\n",
1090 |     "            Dict with summary metrics\n",
1091 |     "        \"\"\"\n",
1092 |     "        # Extract valid classifications\n",
1093 |     "        valid_rows = [\n",
1094 |     "            row for row in score_rows \n",
1095 |     "            if row.get(\"human_class\") is not None and row.get(\"judge_class\") is not None\n",
1096 |     "        ]\n",
1097 |     "        \n",
1098 |     "        if not valid_rows:\n",
1099 |     "            return {\n",
1100 |     "                \"metrics\": {\n",
1101 |     "                    \"accuracy\": 0.0,\n",
1102 |     "                    \"cohens_kappa\": 0.0,\n",
1103 |     "                    \"sample_size\": 0\n",
1104 |     "                }\n",
1105 |     "            }\n",
1106 |     "\n",
1107 |     "        # Calculate metrics\n",
1108 |     "        human_classes = [row[\"human_class\"].lower() for row in valid_rows]\n",
1109 |     "        llm_classes = [row[\"judge_class\"].lower() for row in valid_rows]\n",
1110 |     "        \n",
1111 |     "        # Calculate Cohen's Kappa\n",
1112 |     "        kappa = cohen_kappa_score(human_classes, llm_classes)\n",
1113 |     "        \n",
1114 |     "        # Calculate accuracy\n",
1115 |     "        matches = [row.get(\"match\", False) for row in valid_rows]\n",
1116 |     "        accuracy = sum(matches) / len(matches) if matches else 0\n",
1117 |     "        \n",
1118 |     "        return {\n",
1119 |     "            \"metrics\": {\n",
1120 |     "                \"accuracy\": accuracy,\n",
1121 |     "                \"cohens_kappa\": kappa,\n",
1122 |     "                \"sample_size\": len(valid_rows)\n",
1123 |     "            }\n",
1124 |     "        }\n",
1125 |     "    \n",
1126 |     "\n",
1127 |     "smart_judge_evaluation = Evaluation(\n",
1128 |     "    dataset=dataset_with_context,\n",
1129 |     "    scorers=[right_according_to_llm_judge, DoomerBoomerScorer()],\n",
1130 |     "    name=\"withKappa\"\n",
1131 |     ")\n",
1132 |     "\n",
1133 |     "await smart_judge_evaluation.evaluate(llm_judge_api, __weave={\"display_name\": \"Smart Judge with Kappa\"})\n",
1134 |     "\n",
1135 |     "#TODO 15 - run the same evaluation with a \"dumb\" judge to see cappa difference\n",
1136 |     "\n",
1137 |     "@weave.op()\n",
1138 |     "def dumb_judge_api(input: str, llm_classification: str, displayName: str):\n",
1139 |     "    import re\n",
1140 |     "    grader_prompt = f\"\"\"\n",
1141 |     "    You are provided with the following: \n",
1142 |     "    <input> is a comment made on social media and the handle of the person making the comment\n",
1143 |     "    <output> is a classification and reasoning that an automated assistant made about the comment\n",
1144 |     "    <criteria> is a set of guidelines and additional context for you to understand the input and the correct way to classify it\n",
1145 |     "    \n",
1146 |     "    <input>\n",
1147 |     "    @{displayName}: {input}\n",
1148 |     "    </input>\n",
1149 |     "\n",
1150 |     "    \n",
1151 |     "    <output>\n",
1152 |     "    {llm_classification}\n",
1153 |     "    </output>\n",
1154 |     "    \n",
1155 |     "    <criteria>\n",
1156 |     "    I don't have good criteria for you, try anyway\n",
1157 |     "    </criteria>\n",
1158 |     "\n",
1159 |     "    Your task is to understand from <output> which of the 3 choices did the automated assistant make and if its reasoning is valid. \n",
1160 |     "    First think through whether the the output is correct or incorrect based on the criteria and add your thinking, \n",
1161 |     "    then output your answer in JSON format with this exact schema (no backticks or quotes or anything else, just valid JSON): \n",
1162 |     "    \n",
1163 |     "    {{\n",
1164 |     "        \"automated_assistant_classification\": \"doomer\" | \"boomer\" | \"neither\",\n",
1165 |     "        \"actual_classification\": \"doomer\" | \"boomer\" | \"neither\",\n",
1166 |     "        \"thinking\": \"string\"\n",
1167 |     "    }}\n",
1168 |     "    \"\"\"\n",
1169 |     "    \n",
1170 |     "    response = client.chat.completions.create(\n",
1171 |     "        model=model,\n",
1172 |     "        messages=[\n",
1173 |     "\n",
1174 |     "            {\"role\": \"user\", \"content\": grader_prompt}],\n",
1175 |     "        temperature=0\n",
1176 |     "    )\n",
1177 |     "    response = response.choices[0].message.content\n",
1178 |     "    \n",
1179 |     "    return json.loads(response)\n",
1180 |     "\n",
1181 |     "\n",
1182 |     "# dumb_judge_evaluation = Evaluation(\n",
1183 |     "#     dataset=dataset_with_context,\n",
1184 |     "#     scorers=[right_according_to_llm_judge, DoomerBoomerScorer()],\n",
1185 |     "#     name=\"withKappa\"\n",
1186 |     "# )\n",
1187 |     "# await dumb_judge_evaluation.evaluate(dumb_judge_api, __weave={\"display_name\": \"Dumb Judge with Kappa\"})"
1188 |    ]
1189 |   },
1190 |   {
1191 |    "cell_type": "markdown",
1192 |    "metadata": {},
1193 |    "source": [
1194 |     "# 3.3 Aligning our judges with human preferences - Meta evaluation\n",
1195 |     "\n",
1196 |     "This is a bit out of scope for our workshop, but for those who want to learn more, one we start running our LLM as a judge, we'll notice their shortcomings. They will be biased toward certain things, changing the order of the questions sometimes will yield different results etc' \n",
1197 |     "\n",
1198 |     "Also, the human graders understanding of the question will change during the annotation process itself. \n",
1199 |     "\n",
1200 |     "So a meta evaluation process is needed to understand how the judge itself is performing, and align the LLM judge with the additional inputs from HITL responses. \n",
1201 |     "\n",
1202 |     "Then we need to compare between the judges to empirically contrast and understand if we made a material difference. \n",
1203 |     "\n",
1204 |     "For more of a deep dive into this topic, W&B just published a course on evaluations, https://wandb.me/evals with more info\n",
1205 |     "\n",
1206 |     "# Recap and Additional resources\n",
1207 |     "\n",
1208 |     "You've made it all the way to the end of this notebook! By now you have got a hands on experience in implementing nearly all parts of the robust LLMs in production framework below: \n",
1209 |     "\n",
1210 |     "![three](https://gist.github.com/user-attachments/assets/0d51de65-8ec7-4cc5-a102-5a13229f5531)\n",
1211 |     "\n",
1212 |     "## Additional resources\n",
1213 |     "\n",
1214 |     "- Weave documentation - [weave docs](https://wandb.me/weave)\n",
1215 |     "- W&B Evaluations course - [evals course](https://wandb.me/evals)\n",
1216 |     "- Eugene Yan's excellent blog - [evaluating LLM evaluatiors](https://eugeneyan.com/writing/llm-evaluators/)\n",
1217 |     "- Who validates the validators - Shreya Shankar [Paper](https://arxiv.org/abs/2404.12272)\n",
1218 |     "- Hamel Housain - [your product needs evaluations](https://hamel.dev/blog/posts/evals/)"
1219 |    ]
1220 |   }
1221 |  ],
1222 |  "metadata": {
1223 |   "kernelspec": {
1224 |    "display_name": ".venv",
1225 |    "language": "python",
1226 |    "name": "python3"
1227 |   },
1228 |   "language_info": {
1229 |    "codemirror_mode": {
1230 |     "name": "ipython",
1231 |     "version": 3
1232 |    },
1233 |    "file_extension": ".py",
1234 |    "mimetype": "text/x-python",
1235 |    "name": "python",
1236 |    "nbconvert_exporter": "python",
1237 |    "pygments_lexer": "ipython3",
1238 |    "version": "3.12.8"
1239 |   }
1240 |  },
1241 |  "nbformat": 4,
1242 |  "nbformat_minor": 2
1243 | }
1244 | 


--------------------------------------------------------------------------------