├── README.md ├── dspy_breakdown.ipynb ├── media ├── advan_metrics.png ├── auto_fewshot.png ├── auto_ft.png ├── auto_instr.png ├── better_together.png ├── bootstrap_fewshot.png ├── bootstrap_finetune_diagram.png ├── bsfswrs.png ├── bsfswrs_diagram.png ├── copro_diagram.png ├── cot_module.png ├── dspy.png ├── dspy_workflow.png ├── ensemble_diagram.png ├── input_type.png ├── inter_metrics.png ├── knn_diagram.png ├── labeled_few_shot.png ├── majority.png ├── mermaid.png ├── metrics.png ├── mipro_diagram.png ├── modules.png ├── multi_chain.png ├── multiple_signature.png ├── optimizers.png ├── program_of_thought.png ├── program_transform.png ├── react.png ├── signatures.png └── simple_metrics.png └── optimized ├── bsfs_twt_sentiment.json ├── bsfswrs_twt_sentiment.json ├── bsft_twt_sentiment.pkl ├── copro_twt_sentiment.json ├── ensemble_twt_sentiment.json ├── knn_twt_sentiment.json ├── lfs_twt_sentiment.json ├── mipro_bsft_twt_sentiment.pkl └── mipro_twt_sentiment.json /README.md: -------------------------------------------------------------------------------- 1 | # Programming (Not Prompting) Your LLM with DSPy 2 | 3 | DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that treats language models as programmable functions rather than prompt templates. It provides a PyTorch-like interface for defining, composing, and optimizing LLM operations. Instead of writing and maintaining complex prompts, developers specify input/output signatures and let DSPy handle prompt engineering and optimization. The framework enables systematic improvement of LLM pipelines through techniques like automatic prompt tuning and self-improvement. 4 | 5 | 6 | 7 | The DSPy workflow follows 4 main steps: 8 | 1. Define your program using signatures and modules 9 | 2. Create measurable success metrics that clearly show your program's performance 10 | 3. Compile your program and optimize towards success metrics 11 | 4. Collect additional data and iterate 12 | 13 | We'll look through and apply all the various approaches DSPy offers across these steps in this notebook! 14 | -------------------------------------------------------------------------------- /dspy_breakdown.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "afbec709-c5b0-4c68-a4db-7a0f5c0ae228", 6 | "metadata": {}, 7 | "source": [ 8 | "# Programming (Not Prompting) Your LLM with DSPy\n", 9 | "\n", 10 | "DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that treats language models as programmable functions rather than prompt templates. It provides a PyTorch-like interface for defining, composing, and optimizing LLM operations. Instead of writing and maintaining complex prompts, developers specify input/output signatures and let DSPy handle prompt engineering and optimization. The framework enables systematic improvement of LLM pipelines through techniques like automatic prompt tuning and self-improvement.\n", 11 | "\n", 12 | "\n", 13 | "\n", 14 | "The DSPy workflow follows 4 main steps:\n", 15 | "1. Define your program using signatures and modules\n", 16 | "2. Create measurable success metrics that clearly show your program's performance\n", 17 | "3. Compile your program and optimize towards success metrics\n", 18 | "4. Collect additional data and iterate\n", 19 | "\n", 20 | "We'll look through and apply all the various approaches DSPy offers across these steps in this notebook!" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "id": "5737a40f-6a4d-4035-80b1-61604cc9a6b0", 26 | "metadata": {}, 27 | "source": [ 28 | "---\n", 29 | "## Setup\n", 30 | "\n", 31 | "" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": 1, 37 | "id": "fdad4261-30d2-491c-a42d-ed3c522a9a44", 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "import dspy" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "504e5e46-4f4c-4d11-b9ee-19a380cb36d1", 47 | "metadata": {}, 48 | "source": [ 49 | "Configure LLM" 50 | ] 51 | }, 52 | { 53 | "cell_type": "markdown", 54 | "id": "ef325de5-72b8-44c9-92e4-7f5ae602aec1", 55 | "metadata": {}, 56 | "source": [ 57 | "**Configure LLM**\n", 58 | "\n", 59 | "DSPy by default caches responses and models across your environment. Unless explicitly stated otherwise, configuring a language model will use that language model for all subsequent calls." 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 2, 65 | "id": "83eb594d-b7a6-4c73-883f-a2ccf671b5c4", 66 | "metadata": {}, 67 | "outputs": [], 68 | "source": [ 69 | "lm = dspy.LM('openai/gpt-4o-mini')\n", 70 | "dspy.configure(lm=lm)" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 9, 76 | "id": "44e58448-f7f3-4857-a5af-697bc6787bc2", 77 | "metadata": {}, 78 | "outputs": [ 79 | { 80 | "data": { 81 | "text/plain": [ 82 | "['This is a test! How can I assist you further?']" 83 | ] 84 | }, 85 | "execution_count": 9, 86 | "metadata": {}, 87 | "output_type": "execute_result" 88 | } 89 | ], 90 | "source": [ 91 | "lm(messages=[{\"role\": \"user\", \"content\": \"Say this is a test!\"}])" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "id": "e580490e-d9c9-4c49-8a8a-1b366e66b9b4", 97 | "metadata": {}, 98 | "source": [ 99 | "---\n", 100 | "## Signatures\n", 101 | "\n", 102 | "DSPy Signatures follow the same approach as regular function signatures but are defined in natural language. This is the core of the \"prompting\" that DSPy aims to replace. Instead of telling the LLM what to do, we take the approach of declaring what the LLM will do.\n", 103 | "\n", 104 | "The format looks like:\n", 105 | "\n", 106 | "```python \n", 107 | "'input -> output' \n", 108 | "```\n", 109 | "\n", 110 | "Where your `input` and `output` can be anything you'd like. It's also possible to define multiple inputs, outputs, types, or more well defined schemas.\n", 111 | "\n", 112 | "\n", 113 | "\n", 114 | "Behind the scenes, this is still a language model prompt, but it aims to be more modular than static, changing wording and structure based on your natural language signature. While this may seem counter intuitive as we're abstracting away from prompting, DSPy has set this up in a way that allows for easy switching in and out of models, and algorithmic optimizations that we will highlight later." 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "id": "9fa1e67a-77b9-4d48-b405-e21f45d0b5d3", 120 | "metadata": {}, 121 | "source": [ 122 | "### Simple Input & Output" 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": 108, 128 | "id": "5245d195-9a30-4017-bb17-d2000ff97be9", 129 | "metadata": {}, 130 | "outputs": [ 131 | { 132 | "name": "stdout", 133 | "output_type": "stream", 134 | "text": [ 135 | "Response: The sky appears blue due to a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it collides with molecules and small particles in the air. Sunlight is made up of different colors, each with varying wavelengths. Blue light has a shorter wavelength and is scattered in all directions more than other colors with longer wavelengths, such as red or yellow. This scattering causes the sky to look predominantly blue to our eyes during the day.\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "qna = dspy.Predict('question -> answer')\n", 141 | "\n", 142 | "response = qna(question=\"Why is the sky blue?\")\n", 143 | "\n", 144 | "print(\"Response: \", response.answer)" 145 | ] 146 | }, 147 | { 148 | "cell_type": "code", 149 | "execution_count": 110, 150 | "id": "54e1c151-4eba-4ba6-82f6-1c1982561639", 151 | "metadata": {}, 152 | "outputs": [ 153 | { 154 | "name": "stdout", 155 | "output_type": "stream", 156 | "text": [ 157 | "Summary: The market for our products is highly competitive, driven by rapid technological advancements and changing industry standards. Key competitive factors include product performance, range of offerings, customer access, distribution channels, software support, adherence to industry standards, manufacturing capabilities, pricing, and overall system costs. Our competitiveness hinges on our ability to predict customer demands and deliver quality products at competitive prices. We anticipate increased competition from both established players and new entrants, potentially offering lower prices or superior features. Additionally, competition may arise from companies specializing in GPUs, CPUs, DPUs, and high-performance interconnect products. Some competitors may possess greater resources, making it challenging for us to keep pace with market changes. The competitive landscape is expected to intensify in the future.\n" 158 | ] 159 | } 160 | ], 161 | "source": [ 162 | "sum = dspy.Predict('document -> summary')\n", 163 | "\n", 164 | "document = \"\"\"\n", 165 | "The market for our products is intensely competitive and is characterized by rapid technological change and evolving industry standards. \n", 166 | "We believe that theprincipal competitive factors in this market are performance, breadth of product offerings, access to customers and partners and distribution channels, softwaresupport, conformity to industry standard APIs, manufacturing capabilities, processor pricing, and total system costs. \n", 167 | "We believe that our ability to remain competitive will depend on how well we are able to anticipate the features and functions that customers and partners will demand and whether we are able todeliver consistent volumes of our products at acceptable levels of quality and at competitive prices. \n", 168 | "We expect competition to increase from both existing competitors and new market entrants with products that may be lower priced than ours or may provide better performance or additional features not provided by our products. \n", 169 | "In addition, it is possible that new competitors or alliances among competitors could emerge and acquire significant market share.\n", 170 | "A significant source of competition comes from companies that provide or intend to provide GPUs, CPUs, DPUs, embedded SoCs, and other accelerated, AI computing processor products, and providers of semiconductor-based high-performance interconnect products based on InfiniBand, Ethernet, Fibre Channel,and proprietary technologies. \n", 171 | "Some of our competitors may have greater marketing, financial, distribution and manufacturing resources than we do and may bemore able to adapt to customers or technological changes. \n", 172 | "We expect an increasingly competitive environment in the future.\n", 173 | "\"\"\"\n", 174 | "\n", 175 | "response = sum(document=document)\n", 176 | "\n", 177 | "print(\"Summary: \", response.summary)" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "id": "c9b6d5b4-9b8e-415c-b8aa-e5304e99fd6c", 183 | "metadata": {}, 184 | "source": [ 185 | "### Multiple Inputs and Outputs\n", 186 | "\n", 187 | "" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": 112, 193 | "id": "268ea25f-3f78-4b3a-82ed-ea7b67a3f800", 194 | "metadata": {}, 195 | "outputs": [ 196 | { 197 | "name": "stdout", 198 | "output_type": "stream", 199 | "text": [ 200 | "Answer: Your name is Adam Lucek.\n", 201 | "\n", 202 | "Citation: Context provided by the user.\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "multi = dspy.Predict('question, context -> answer, citation')\n", 208 | "\n", 209 | "question = \"What's my name?\"\n", 210 | "context = \"The user you're talking to is Adam Lucek, AI youtuber extraordinaire\"\n", 211 | "\n", 212 | "response = multi(question=question, context=context)\n", 213 | "\n", 214 | "print(\"Answer: \", response.answer)\n", 215 | "print(\"\\nCitation: \", response.citation)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "id": "972af06a-bca0-4e00-97c5-935b3c26d21e", 221 | "metadata": {}, 222 | "source": [ 223 | "### Type Hints with Outputs\n", 224 | "\n", 225 | "" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 114, 231 | "id": "eb8da6a7-2dc2-4703-86a9-922066b7247f", 232 | "metadata": {}, 233 | "outputs": [ 234 | { 235 | "name": "stdout", 236 | "output_type": "stream", 237 | "text": [ 238 | "Sentiment Classification: negative\n", 239 | "\n", 240 | "Confidence: 0.85\n", 241 | "\n", 242 | "Reasoning: The phrase \"I didn't really like it\" clearly indicates a negative sentiment towards whatever is being discussed. The use of \"didn't like\" suggests dissatisfaction, and the uncertainty expressed by \"I don't quite know\" reinforces a lack of positive feelings. The confidence level is high at 0.85 due to the explicit negative language used.\n" 243 | ] 244 | } 245 | ], 246 | "source": [ 247 | "emotion = dspy.Predict('input -> sentiment: str, confidence: float, reasoning: str')\n", 248 | "\n", 249 | "text = \"I don't quite know, I didn't really like it\"\n", 250 | "\n", 251 | "response = emotion(input=text)\n", 252 | "\n", 253 | "print(\"Sentiment Classification: \", response.sentiment)\n", 254 | "print(\"\\nConfidence: \", response.confidence)\n", 255 | "print(\"\\nReasoning: \", response.reasoning)" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "id": "ba8ef8c6-f2bb-4d8f-8ca0-0c5381304a7f", 261 | "metadata": {}, 262 | "source": [ 263 | "### Class Based Signatures\n", 264 | "\n", 265 | "For more advanced signatures, DSPy allows you to define a pydantic class or data structure schema instead of the simple inline string approach. These classes inherit from `dspy.Signature` to start, but you must define your inputs with `dspy.InputField()` and outputs with `dspy.OutputField()`.\n", 266 | "\n", 267 | "An optional `desc` argument can be passed within each field to add additional context as a description." 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 116, 273 | "id": "d9b8c1ba-09f3-4f1b-8c6c-496e7804df93", 274 | "metadata": {}, 275 | "outputs": [ 276 | { 277 | "name": "stdout", 278 | "output_type": "stream", 279 | "text": [ 280 | "Transformed Text: In a quaint coffee shop, where dreams brew and swirl, \n", 281 | "The finest lattes dance, a creamy, frothy whirl. \n", 282 | "A new barista, skilled, with hands that weave delight, \n", 283 | "Crafts magic with the espresso, morning's purest light.\n", 284 | "\n", 285 | "Style Metrics: {'formality': 0.7, 'complexity': 0.6, 'emotiveness': 0.8}\n", 286 | "\n", 287 | "Preserverd Keywords: ['coffee shop', 'lattes', 'barista', 'espresso machine']\n" 288 | ] 289 | } 290 | ], 291 | "source": [ 292 | "from typing import Literal\n", 293 | "\n", 294 | "class TextStyleTransfer(dspy.Signature):\n", 295 | " \"\"\"Transfer text between different writing styles while preserving content.\"\"\"\n", 296 | " text: str = dspy.InputField()\n", 297 | " source_style: Literal[\"academic\", \"casual\", \"business\", \"poetic\"] = dspy.InputField()\n", 298 | " target_style: Literal[\"academic\", \"casual\", \"business\", \"poetic\"] = dspy.InputField()\n", 299 | " preserved_keywords: list[str] = dspy.OutputField()\n", 300 | " transformed_text: str = dspy.OutputField()\n", 301 | " style_metrics: dict[str, float] = dspy.OutputField(desc=\"Scores for formality, complexity, emotiveness\")\n", 302 | "\n", 303 | "\n", 304 | "text = \"This coffee shop makes the best lattes ever! Their new barista really knows what he's doing with the espresso machine.\"\n", 305 | "\n", 306 | "style_transfer = dspy.Predict(TextStyleTransfer)\n", 307 | "\n", 308 | "response = style_transfer(\n", 309 | " text=text,\n", 310 | " source_style=\"casual\",\n", 311 | " target_style=\"poetic\"\n", 312 | ")\n", 313 | "\n", 314 | "print(\"Transformed Text: \", response.transformed_text)\n", 315 | "print(\"\\nStyle Metrics: \", response.style_metrics)\n", 316 | "print(\"\\nPreserverd Keywords: \", response.preserved_keywords)" 317 | ] 318 | }, 319 | { 320 | "cell_type": "markdown", 321 | "id": "d5e87a8c-27cb-4d81-ad87-ebe2f901569f", 322 | "metadata": {}, 323 | "source": [ 324 | "---\n", 325 | "## Modules\n", 326 | "\n", 327 | "\n", 328 | "\n", 329 | "Modules are where we apply different prompting frameworks to signatures. We've already been using the basic `Predict` module in our signature examples prior, but there exist many more popular strategies and variants. Here are the current available modules: \n", 330 | "\n", 331 | "* `ChainOfThought`: Implements chain-of-thought prompting by prepending a reasoning step before generating outputs. The module automatically adds a \"Let's think step by step\" prefix to encourage structured thinking. Use this when you need the model to break down complex problems into smaller steps.\n", 332 | "\n", 333 | "* `ProgramOfThought`: Generates executable Python code to solve problems, with built-in error handling and code regeneration capabilities. Use this for mathematical or algorithmic problems that are better solved through actual code execution.\n", 334 | "\n", 335 | "* `ReAct`: Implements Reasoning + Acting by interleaving thoughts, actions (via tools), and observations in a structured loop. Use this when your task requires multi-step reasoning and interaction with external tools or APIs.\n", 336 | "\n", 337 | "And a few helpers:\n", 338 | "\n", 339 | "* `MultiChainComparison`: Takes multiple reasoning attempts (default 3) and combines them into a single, more accurate response by comparing different reasoning paths. Use this when you need higher accuracy and can afford multiple attempts at solving a problem.\n", 340 | "\n", 341 | "* `majority`: A utility function that takes multiple completions and returns the most common response after normalizing the text. Use this when you want to implement simple voting among multiple completion attempts to increase reliability.\n" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "id": "10e865e5-3600-4091-8c5b-05b9ec883754", 347 | "metadata": {}, 348 | "source": [ 349 | "### [Chain of Thought](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/chain_of_thought.py)\n", 350 | "\n", 351 | "\n", 352 | "\n", 353 | "ChainOfThought works by modifying the prompt signature to include an explicit reasoning step before the output. When initialized with a signature, it creates an extended signature by prepending a \"reasoning\" field with the prefix \"Reasoning: Let's think step by step in order to\". This reasoning field forces the language model to write out its thought process before providing the final answer." 354 | ] 355 | }, 356 | { 357 | "cell_type": "code", 358 | "execution_count": 118, 359 | "id": "aee04b2f-1e45-479b-909a-c24b7133a71e", 360 | "metadata": {}, 361 | "outputs": [ 362 | { 363 | "name": "stdout", 364 | "output_type": "stream", 365 | "text": [ 366 | "Sentiment: Mixed\n", 367 | "\n", 368 | "Reasoning: The statement expresses a conflicting sentiment. The word \"phenomenal\" indicates a strong positive reaction, suggesting that the experience was impressive or outstanding. However, the phrase \"but I hated it\" introduces a negative sentiment, indicating a strong dislike or aversion to the same experience. This juxtaposition creates a complex emotional response, where the speaker acknowledges something as remarkable while simultaneously expressing a strong negative feeling towards it.\n" 369 | ] 370 | } 371 | ], 372 | "source": [ 373 | "# Define the Signature and Module\n", 374 | "cot_emotion = dspy.ChainOfThought('input -> sentiment: str')\n", 375 | "\n", 376 | "# Example\n", 377 | "text = \"That was phenomenal, but I hated it!\"\n", 378 | "\n", 379 | "# Run\n", 380 | "cot_response = cot_emotion(input=text)\n", 381 | "\n", 382 | "# Output\n", 383 | "print(\"Sentiment: \", cot_response.sentiment)\n", 384 | "# Inherently added reasoning\n", 385 | "print(\"\\nReasoning: \", cot_response.reasoning)" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "id": "e2372287-82c1-4a36-b933-cdae87034ed9", 391 | "metadata": {}, 392 | "source": [ 393 | "### [Program of Thought](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/program_of_thought.py)\n", 394 | "\n", 395 | "\n", 396 | "\n", 397 | "ProgramOfThought solves tasks by generating executable Python code rather than working directly with natural language outputs. When given a task, PoT first generates Python code using a ChainOfThought predictor, then executes that code in an isolated Python interpreter. If the code generates any errors, PoT enters a refinement loop where it shows the error to the language model, gets corrected code, and tries executing again, for up to a maximum number of iterations (default 3). The final output comes from actually running the successful code rather than from the language model directly. " 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 120, 403 | "id": "96a677a8-c6b6-4efa-97ef-f8f295b11b81", 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "name": "stdout", 408 | "output_type": "stream", 409 | "text": [ 410 | "Error in code execution\n" 411 | ] 412 | } 413 | ], 414 | "source": [ 415 | "# Define the Signature\n", 416 | "class MathAnalysis(dspy.Signature):\n", 417 | " \"\"\"Analyze a dataset and compute various statistical metrics.\"\"\"\n", 418 | " \n", 419 | " numbers: list[float] = dspy.InputField(desc=\"List of numerical values to analyze\")\n", 420 | " required_metrics: list[str] = dspy.InputField(desc=\"List of metrics to calculate (e.g. ['mean', 'variance', 'quartiles'])\")\n", 421 | " analysis_results: dict[str, float] = dspy.OutputField(desc=\"Dictionary containing the calculated metrics\")\n", 422 | "\n", 423 | "# Create the module\n", 424 | "math_analyzer = dspy.ProgramOfThought(MathAnalysis)\n", 425 | "\n", 426 | "# Example\n", 427 | "data = [1.5, 2.8, 3.2, 4.7, 5.1, 2.3, 3.9]\n", 428 | "metrics = ['mean', 'median']\n", 429 | "\n", 430 | "# Run\n", 431 | "pot_response = math_analyzer(\n", 432 | " numbers=data,\n", 433 | " required_metrics=metrics\n", 434 | ")" 435 | ] 436 | }, 437 | { 438 | "cell_type": "code", 439 | "execution_count": 121, 440 | "id": "559da509-f4e0-42ee-931b-d9c1da78c277", 441 | "metadata": {}, 442 | "outputs": [ 443 | { 444 | "name": "stdout", 445 | "output_type": "stream", 446 | "text": [ 447 | "Reasoning: The provided code correctly calculates the mean and median of the given list of numbers. The mean is computed by summing all the numbers and dividing by the count of numbers, while the median is determined by sorting the list and finding the middle value (or the average of the two middle values if the count is even). The output matches the expected results for both metrics.\n", 448 | "\n", 449 | "Results: {'mean': 3.357142857142857, 'median': 3.2}\n" 450 | ] 451 | } 452 | ], 453 | "source": [ 454 | "print(\"Reasoning: \", pot_response.reasoning)\n", 455 | "print(\"\\nResults: \", pot_response.analysis_results)" 456 | ] 457 | }, 458 | { 459 | "cell_type": "markdown", 460 | "id": "ebcd1149-5e3e-422a-a1f3-f73d0d3178f3", 461 | "metadata": {}, 462 | "source": [ 463 | "### [Reasoning + Acting (ReAct)](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/react.py)\n", 464 | "\n", 465 | "\n", 466 | "\n", 467 | "ReAct enables interactive problem-solving by combining reasoning with tool usage. It works by maintaining a trajectory of thought-action pairs, where at each step the model explains its reasoning, selects a tool to use, provides arguments for that tool, and then observes the tool's output to inform its next step. Each iteration consists of four parts: a thought explaining the strategy, selection of a tool name from the available tools, arguments to pass to that tool, and the observation from running the tool. This continues until either the model chooses to \"finish\" or reaches the maximum number of iterations. Here's a simple example:" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": 123, 473 | "id": "7fbf0317-40ee-4664-8bb3-27bcb0961599", 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "name": "stdout", 478 | "output_type": "stream", 479 | "text": [ 480 | "Answer: The Baltimore Orioles won the World Series in 1983, and England won the World Cup in 1966.\n", 481 | "\n", 482 | "Reasoning: The Baltimore Orioles won the 1983 World Series, defeating the Philadelphia Phillies four games to one. Additionally, England won the 1966 FIFA World Cup, beating West Germany 4–2 in the final match.\n" 483 | ] 484 | } 485 | ], 486 | "source": [ 487 | "# Define a Tool\n", 488 | "def wikipedia_search(query: str) -> list[str]:\n", 489 | " \"\"\"Retrieves abstracts from Wikipedia.\"\"\"\n", 490 | " # Existing Wikipedia Abstracts Server\n", 491 | " results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=3) \n", 492 | " return [x['text'] for x in results]\n", 493 | "\n", 494 | "# Define ReAct Module\n", 495 | "react_module = dspy.ReAct('question -> response', tools=[wikipedia_search])\n", 496 | "\n", 497 | "# Example\n", 498 | "text = \"Who won the world series in 1983 and who won the world cup in 1966?\"\n", 499 | "\n", 500 | "# Run\n", 501 | "react_response = react_module(question=text)\n", 502 | "\n", 503 | "print(\"Answer: \", react_response.response)\n", 504 | "print(\"\\nReasoning: \", react_response.reasoning)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "markdown", 509 | "id": "67a8f2ed-8fb6-4fac-a476-0c8056dc0a13", 510 | "metadata": {}, 511 | "source": [ 512 | "### [Multi Chain Comparison](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/multi_chain_comparison.py)\n", 513 | "\n", 514 | "\n", 515 | "\n", 516 | "MultiChainComparison is a meta-predictor that synthesizes multiple existing completions into a single, more robust prediction. It doesn't generate predictions itself, but instead takes M different completions (default 3) from other predictors - these could be from the same predictor with different temperatures, different predictors entirely, or repeated calls with the same settings. These completions are formatted as \"Student Attempt #1:\", \"Student Attempt #2:\", etc., with each attempt packaged as «I'm trying to \\[rationale] I'm not sure but my prediction is \\[answer]». The module then prompts the model to analyze these attempts holistically with \"Accurate Reasoning: Thank you everyone. Let's now holistically...\" to synthesize a final answer. This approach helps mitigate individual prediction errors by having the model explicitly compare and critique multiple solution paths before making its final decision." 517 | ] 518 | }, 519 | { 520 | "cell_type": "code", 521 | "execution_count": 126, 522 | "id": "5d346cb9-e837-472f-be07-974645808183", 523 | "metadata": {}, 524 | "outputs": [ 525 | { 526 | "name": "stdout", 527 | "output_type": "stream", 528 | "text": [ 529 | "Sentiment: Positive\n", 530 | "\n", 531 | "Reasoning: The phrase \"That was phenomenal!\" clearly indicates strong positive feelings. The word \"phenomenal\" is a superlative that suggests something is remarkable or exceptional, reinforcing the positive sentiment expressed by the speaker. All reasoning attempts correctly identify this sentiment as positive.\n", 532 | "\n", 533 | "Completion 1: Prediction(\n", 534 | " reasoning='The phrase \"That was phenomenal!\" expresses strong positive feelings about an experience or event. The use of the word \"phenomenal\" indicates that the subject exceeded expectations and was highly impressive.',\n", 535 | " sentiment='Positive'\n", 536 | ")\n", 537 | "\n", 538 | "Completion 2: Prediction(\n", 539 | " reasoning='The phrase \"That was phenomenal!\" expresses a strong positive reaction. The use of the word \"phenomenal\" indicates that the speaker is extremely impressed or pleased with something. This suggests a high level of enthusiasm and admiration.',\n", 540 | " sentiment='Positive'\n", 541 | ")\n", 542 | "\n", 543 | "Completion 3: Prediction(\n", 544 | " reasoning='The phrase \"That was phenomenal!\" expresses a strong positive reaction to an experience or event. The use of the word \"phenomenal\" conveys excitement and high praise, indicating that the speaker found something to be extraordinary or outstanding.',\n", 545 | " sentiment='positive'\n", 546 | ")\n" 547 | ] 548 | } 549 | ], 550 | "source": [ 551 | "# Run CoT completions with increasing temperatures\n", 552 | "text = \"That was phenomenal!\"\n", 553 | "\n", 554 | "cot_completions = []\n", 555 | "for i in range(3):\n", 556 | " # Temperature increases: 0.7, 0.8, 0.9\n", 557 | " temp_config = dict(temperature=0.7 + (0.1 * i))\n", 558 | " completion = cot_emotion(input=text, config=temp_config)\n", 559 | " cot_completions.append(completion)\n", 560 | "\n", 561 | "# Synthesize with MultiChainComparison\n", 562 | "mcot_emotion = dspy.MultiChainComparison('input -> sentiment', M=3)\n", 563 | "final_result = mcot_emotion(completions=cot_completions, input=text)\n", 564 | "\n", 565 | "print(f\"Sentiment: {final_result.sentiment}\")\n", 566 | "print(f\"\\nReasoning: {final_result.rationale}\")\n", 567 | "\n", 568 | "for i in range(3):\n", 569 | " print(f\"\\nCompletion {i+1}: \", cot_completions[i])" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "id": "10dce03c-432f-4541-bf29-27a8a6798b5d", 575 | "metadata": {}, 576 | "source": [ 577 | "### [Majority](https://github.com/stanfordnlp/dspy/blob/main/dspy/predict/aggregation.py)\n", 578 | "\n", 579 | "\n", 580 | "\n", 581 | "Majority is a utility function that implements a basic voting mechanism across multiple completions to determine the most common answer. It works by taking either a Prediction object (which contains completions) or a list of completions directly, then normalizes their values for the target field (either specified or defaults to the last output field). The normalization process, handled by normalize_text, helps manage slight variations in text that should be considered the same answer (returning None for answers that should be ignored). In cases of ties, earlier completions are prioritized. The function is particularly useful when combined with modules that generate multiple completions (like running predictors with different temperatures) and you want a simple way to find the most common response. The function returns a new Prediction object containing just the winning completion." 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 128, 587 | "id": "75c0d15a-74c1-4379-a2ea-4948ea23f562", 588 | "metadata": {}, 589 | "outputs": [ 590 | { 591 | "name": "stdout", 592 | "output_type": "stream", 593 | "text": [ 594 | "Most common sentiment: Positive\n" 595 | ] 596 | } 597 | ], 598 | "source": [ 599 | "# Example Completions From Prior Multi-Chain\n", 600 | "majority_result = dspy.majority(cot_completions, field='sentiment')\n", 601 | "\n", 602 | "# Results\n", 603 | "print(f\"Most common sentiment: {majority_result.sentiment}\")" 604 | ] 605 | }, 606 | { 607 | "cell_type": "markdown", 608 | "id": "d88698ff-7227-4a65-ac74-ca6e915603de", 609 | "metadata": {}, 610 | "source": [ 611 | "---\n", 612 | "## Evaluators\n", 613 | "\n", 614 | "While modules are the building blocks of your program, you may have realized there's limited ability to actually tune or change your modules directly like you would iterate on prompt chains. This is where DSPy starts to differentiate itself, as it aims to tune performance of your modules through measuring against defined metrics.\n", 615 | "\n", 616 | "As such, you need to deeply consider the optimal state of your LLM output and how you would measure it. This can be as simple as accuracy for classification tasks, or more complex like faithfulness to retrieved context." 617 | ] 618 | }, 619 | { 620 | "cell_type": "markdown", 621 | "id": "7fb19ae3-476c-46bf-99c2-29c3a5952891", 622 | "metadata": {}, 623 | "source": [ 624 | "### Example Data Type\n", 625 | "\n", 626 | "The data type for DSPy evaluators and metrics is the `Example` object. In essence it's just a `dict` but handles the formatting that the DSPy backend expects. The fields can be anything you'd like, but make sure they match up to your current input and output formatting for your module.\n", 627 | "\n", 628 | "Your training set data will consist of a list of examples." 629 | ] 630 | }, 631 | { 632 | "cell_type": "code", 633 | "execution_count": 120, 634 | "id": "8b62e852-043a-42df-8c80-b49d92f631c2", 635 | "metadata": {}, 636 | "outputs": [ 637 | { 638 | "name": "stdout", 639 | "output_type": "stream", 640 | "text": [ 641 | "Example({'question': 'What is my name?', 'answer': 'Your name is Adam Lucek'}) (input_keys=None)\n", 642 | "What is my name?\n", 643 | "Your name is Adam Lucek\n" 644 | ] 645 | } 646 | ], 647 | "source": [ 648 | "qa_pair = dspy.Example(question=\"What is my name?\", answer=\"Your name is Adam Lucek\")\n", 649 | "\n", 650 | "print(qa_pair)\n", 651 | "print(qa_pair.question)\n", 652 | "print(qa_pair.answer)" 653 | ] 654 | }, 655 | { 656 | "cell_type": "code", 657 | "execution_count": 123, 658 | "id": "3ebbc2bf-6495-4cb4-aec5-dbf59ba74f96", 659 | "metadata": {}, 660 | "outputs": [ 661 | { 662 | "name": "stdout", 663 | "output_type": "stream", 664 | "text": [ 665 | "Example({'excerpt': 'I really love programming!', 'classification': 'Positive', 'confidence': 0.95}) (input_keys=None)\n", 666 | "I really love programming!\n", 667 | "Positive\n", 668 | "0.95\n" 669 | ] 670 | } 671 | ], 672 | "source": [ 673 | "classification_pair = dspy.Example(excerpt=\"I really love programming!\", classification=\"Positive\", confidence=0.95)\n", 674 | "\n", 675 | "print(classification_pair)\n", 676 | "print(classification_pair.excerpt)\n", 677 | "print(classification_pair.classification)\n", 678 | "print(classification_pair.confidence)" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "id": "eddb0021-4117-4ed7-940f-321390b337f9", 684 | "metadata": {}, 685 | "source": [ 686 | "You may also explicitly label `inputs` and `labels` using the `.with_inputs()` method. Anything not specified in `.with_inputs()` is then expected to either be labels or metadata." 687 | ] 688 | }, 689 | { 690 | "cell_type": "code", 691 | "execution_count": 5, 692 | "id": "4a82e7a2-7f61-414d-b16a-d628731431b4", 693 | "metadata": {}, 694 | "outputs": [ 695 | { 696 | "name": "stdout", 697 | "output_type": "stream", 698 | "text": [ 699 | "Example with Input fields only: Example({'article': 'Placeholder for Article'}) (input_keys={'article'})\n", 700 | "\n", 701 | "Example object Non-Input fields only: Example({'summary': 'Expected Summary'}) (input_keys=None)\n" 702 | ] 703 | } 704 | ], 705 | "source": [ 706 | "article_summary = dspy.Example(article = \"Placeholder for Article\", summary= \"Expected Summary\").with_inputs(\"article\")\n", 707 | "\n", 708 | "input_key_only = article_summary.inputs()\n", 709 | "non_input_key_only = article_summary.labels()\n", 710 | "\n", 711 | "print(\"Example with Input fields only:\", article_summary.inputs())\n", 712 | "print(\"\\nExample object Non-Input fields only:\", article_summary.labels())" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "id": "56367549-d4a2-4aef-8aa6-82d1769291bb", 718 | "metadata": {}, 719 | "source": [ 720 | "### Metrics\n", 721 | "\n", 722 | "\n", 723 | "\n", 724 | "Now that we understand the data format, we must consider our metrics. Metrics are critical to DSPy as the framework will optimize your modules towards defined metrics.\n", 725 | "\n", 726 | "DSPy defines metrics concisely *A metric is just a function that will take examples from your data and the output of your system and return a score that quantifies how good the output is. What makes outputs from your system good or bad?*" 727 | ] 728 | }, 729 | { 730 | "cell_type": "markdown", 731 | "id": "8601d33c-8668-418e-851d-3e176977bed8", 732 | "metadata": {}, 733 | "source": [ 734 | "#### Simple Metrics\n", 735 | "\n", 736 | "\n", 737 | "\n", 738 | "Starting simply, setup and run validation for exact matches across a sentiment classification module" 739 | ] 740 | }, 741 | { 742 | "cell_type": "markdown", 743 | "id": "5401ccf6-34da-4710-9e35-ecf3a2652510", 744 | "metadata": {}, 745 | "source": [ 746 | "**Setup Module**" 747 | ] 748 | }, 749 | { 750 | "cell_type": "code", 751 | "execution_count": 11, 752 | "id": "f05e5d2c-dde9-4de6-a788-7ffa6375d44b", 753 | "metadata": {}, 754 | "outputs": [], 755 | "source": [ 756 | "# Simple Tweet Sentiment Classification Module\n", 757 | "from typing import Literal\n", 758 | "\n", 759 | "class TwtSentiment(dspy.Signature):\n", 760 | " tweet: str = dspy.InputField(desc=\"Candidate tweet for classificaiton\")\n", 761 | " sentiment: Literal[\"positive\", \"negative\", \"neutral\"] = dspy.OutputField()\n", 762 | "\n", 763 | "twt_sentiment = dspy.ChainOfThought(TwtSentiment)" 764 | ] 765 | }, 766 | { 767 | "cell_type": "markdown", 768 | "id": "4911e354-a446-428f-9628-46dca7567db4", 769 | "metadata": {}, 770 | "source": [ 771 | "**Format Dataset**\n", 772 | "\n", 773 | "We'll grab some example tweet and sentiment pairs from the [MTEB Tweeet Sentiment Extraction](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) dataset. This will be our dataset we validate against." 774 | ] 775 | }, 776 | { 777 | "cell_type": "code", 778 | "execution_count": 18, 779 | "id": "3ac4fac1-2c9e-4f82-9746-3350530be540", 780 | "metadata": {}, 781 | "outputs": [], 782 | "source": [ 783 | "import json\n", 784 | "\n", 785 | "# Formatting Examples\n", 786 | "examples = []\n", 787 | "num_examples = 50\n", 788 | "\n", 789 | "with open(\"./datasets/tweets.jsonl\", 'r', encoding='utf-8') as f:\n", 790 | " for i, line in enumerate(f):\n", 791 | " if num_examples and i >= num_examples:\n", 792 | " break\n", 793 | " \n", 794 | " data = json.loads(line.strip())\n", 795 | " example = dspy.Example(\n", 796 | " tweet=data['text'],\n", 797 | " sentiment=data['label_text']\n", 798 | " ).with_inputs(\"tweet\")\n", 799 | " examples.append(example)" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "id": "606ea004-dee1-40da-9c2b-eca82718f4e3", 805 | "metadata": {}, 806 | "source": [ 807 | "**Defining Metric**\n", 808 | "\n", 809 | "The metric takes in an example, a prediction and an optional trace (we'll discuss the trace at a later point). In this case, it will return `True` or `False` depending on whether the llm predicted sentiment is the same as our ground truth" 810 | ] 811 | }, 812 | { 813 | "cell_type": "code", 814 | "execution_count": 19, 815 | "id": "710d6e00-5c7e-448d-b1f7-6689e9a31167", 816 | "metadata": {}, 817 | "outputs": [], 818 | "source": [ 819 | "def validate_answer(example, pred, trace=None):\n", 820 | " return example.sentiment.lower() == pred.sentiment.lower()" 821 | ] 822 | }, 823 | { 824 | "cell_type": "markdown", 825 | "id": "1901b202-c843-42af-a771-2e8cf23427c2", 826 | "metadata": {}, 827 | "source": [ 828 | "**Running A Manual Evaluation**\n", 829 | "\n", 830 | "For each tweet in the examples it will run a prediction with our examples defined inputs (the tweet), this is then ran through our `validate_answer` metric which returns True or False and is then stored in our scores list." 831 | ] 832 | }, 833 | { 834 | "cell_type": "code", 835 | "execution_count": 20, 836 | "id": "6fa9b3dd-753a-47f6-83fc-1a889a856a61", 837 | "metadata": {}, 838 | "outputs": [], 839 | "source": [ 840 | "scores = []\n", 841 | "for x in examples:\n", 842 | " pred = twt_sentiment(**x.inputs())\n", 843 | " score = validate_answer(x, pred)\n", 844 | " scores.append(score)" 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": 23, 850 | "id": "8a0d6d7c-ef85-4e13-b281-62ccf4d536b4", 851 | "metadata": {}, 852 | "outputs": [ 853 | { 854 | "name": "stdout", 855 | "output_type": "stream", 856 | "text": [ 857 | "Baseline Accuracy: 0.76\n" 858 | ] 859 | } 860 | ], 861 | "source": [ 862 | "accuracy = sum(scores) / len(scores)\n", 863 | "print(\"Baseline Accuracy: \", accuracy)" 864 | ] 865 | }, 866 | { 867 | "cell_type": "markdown", 868 | "id": "a78f2fff-d437-460e-aed3-387a358779b3", 869 | "metadata": {}, 870 | "source": [ 871 | "#### Intermediate Metrics\n", 872 | "\n", 873 | "\n", 874 | "\n", 875 | "While these direct ground truth comparisons are good, we've seen the introduction of LLM-as-a-judge approaches assist in comparing and judging long form outputs.\n", 876 | "\n", 877 | "Let's implement some LLM based metrics:" 878 | ] 879 | }, 880 | { 881 | "cell_type": "markdown", 882 | "id": "f61a28ba-5be9-4d23-9016-9e727f33400e", 883 | "metadata": {}, 884 | "source": [ 885 | "**Setup Module**" 886 | ] 887 | }, 888 | { 889 | "cell_type": "code", 890 | "execution_count": 34, 891 | "id": "180432ca-6b0d-483a-9b33-94f66257ef31", 892 | "metadata": {}, 893 | "outputs": [], 894 | "source": [ 895 | "# CoT For Summarizing a Dialogue\n", 896 | "\n", 897 | "dialog_sum = dspy.ChainOfThought(\"dialogue: str -> summary: str\")" 898 | ] 899 | }, 900 | { 901 | "cell_type": "markdown", 902 | "id": "832c4f3c-22f0-4db9-8a25-644dc1f77b67", 903 | "metadata": {}, 904 | "source": [ 905 | "**Format Dataset**\n", 906 | "\n", 907 | "Our dataset for this example comes from [DialogSum](https://github.com/cylnlp/dialogsum), a collection of dialogues and corresponding summaries. We can use their summaries as the \"gold\" standard to test against with fuzzy metrics from an LLM." 908 | ] 909 | }, 910 | { 911 | "cell_type": "code", 912 | "execution_count": 32, 913 | "id": "348b0d4e-106d-41a5-95c2-7d3b3fb820d7", 914 | "metadata": {}, 915 | "outputs": [], 916 | "source": [ 917 | "import pandas as pd\n", 918 | "\n", 919 | "num_examples = 20\n", 920 | "df = pd.read_csv(\"./datasets/dialogsum.csv\")\n", 921 | " \n", 922 | "# Limit the number of examples\n", 923 | "if num_examples:\n", 924 | " df = df.head(num_examples)\n", 925 | "\n", 926 | "dialogsum_examples = []\n", 927 | "\n", 928 | "for _, row in df.iterrows():\n", 929 | " example = dspy.Example(\n", 930 | " dialogue=row['dialogue'],\n", 931 | " summary=row['summary']\n", 932 | " ).with_inputs('dialogue')\n", 933 | " \n", 934 | " dialogsum_examples.append(example)" 935 | ] 936 | }, 937 | { 938 | "cell_type": "markdown", 939 | "id": "07eef3ee-ef51-47cb-9645-f9bf39119e11", 940 | "metadata": {}, 941 | "source": [ 942 | "**Metric Signature**\n", 943 | "\n", 944 | "Now that we're using modules within our metrics, we need a dynamic signature that we can apply to metric predictions" 945 | ] 946 | }, 947 | { 948 | "cell_type": "code", 949 | "execution_count": 90, 950 | "id": "3bb1fcbc-3a35-4850-b715-8a5dcf021f86", 951 | "metadata": {}, 952 | "outputs": [], 953 | "source": [ 954 | "# Define the signature for automatic assessments.\n", 955 | "class Assess(dspy.Signature):\n", 956 | " \"\"\"Assess the quality of a dialog summary along the specified dimension.\"\"\"\n", 957 | "\n", 958 | " assessed_text = dspy.InputField()\n", 959 | " assessment_question = dspy.InputField()\n", 960 | " assessment_answer: bool = dspy.OutputField()" 961 | ] 962 | }, 963 | { 964 | "cell_type": "markdown", 965 | "id": "431562b1-3a95-4d8e-8cac-87919ff7b484", 966 | "metadata": {}, 967 | "source": [ 968 | "**Metric Definition**\n", 969 | "\n", 970 | "We'll be using an LLM to assess whether the generated dialogue summary is accurate in comparison to the original quesiton, and concise in comparison to the expected summary." 971 | ] 972 | }, 973 | { 974 | "cell_type": "code", 975 | "execution_count": 91, 976 | "id": "ecae2695-43c4-49dd-b40e-c15cd01fff64", 977 | "metadata": {}, 978 | "outputs": [], 979 | "source": [ 980 | "def dialog_metric(gold, pred, trace=None):\n", 981 | " dialogue, gold_summary, generated_summary = gold.dialogue, gold.summary, pred.summary\n", 982 | " \n", 983 | " # Define Assessment Questions\n", 984 | " accurate_question = f\"Given this original dialog: '{dialogue}', does the summary accurately represent what was discussed without adding or changing information?\"\n", 985 | " \n", 986 | " concise_question = f\"\"\"Compare the level of detail in the generated summary with the gold summary:\n", 987 | " Gold summary: '{gold_summary}'\n", 988 | " Is the generated summary appropriately detailed - neither too sparse nor too verbose compared to the gold summary?\"\"\"\n", 989 | "\n", 990 | " # Run Predictions\n", 991 | " accurate = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=accurate_question)\n", 992 | " concise = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=concise_question)\n", 993 | " \n", 994 | " # Extract boolean assessment answers\n", 995 | " accurate, concise = [m.assessment_answer for m in [accurate, concise]]\n", 996 | " \n", 997 | " # Calculate score - accuracy is required for any points\n", 998 | " score = (accurate + concise) if accurate else 0\n", 999 | " \n", 1000 | " if trace is not None:\n", 1001 | " return score >= 2\n", 1002 | " \n", 1003 | " return score / 2.0" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "markdown", 1008 | "id": "0928f747-b822-457f-8ffe-42322316d2a8", 1009 | "metadata": {}, 1010 | "source": [ 1011 | "**Running Evaluation**\n", 1012 | "\n", 1013 | "Similar manual evaluation to what we did earlier!" 1014 | ] 1015 | }, 1016 | { 1017 | "cell_type": "code", 1018 | "execution_count": 39, 1019 | "id": "5f0d7289-f9a4-4003-9d7a-92d1a9f8a23e", 1020 | "metadata": {}, 1021 | "outputs": [], 1022 | "source": [ 1023 | "intermediate_scores = []\n", 1024 | "for x in dialogsum_examples:\n", 1025 | " pred = dialog_sum(**x.inputs())\n", 1026 | " score = dialog_metric(x, pred)\n", 1027 | " intermediate_scores.append(score)" 1028 | ] 1029 | }, 1030 | { 1031 | "cell_type": "code", 1032 | "execution_count": 42, 1033 | "id": "7bc22618-4874-4ebc-92fc-ad812b05d9a1", 1034 | "metadata": {}, 1035 | "outputs": [ 1036 | { 1037 | "name": "stdout", 1038 | "output_type": "stream", 1039 | "text": [ 1040 | "Dialog Metric Score: 0.85\n" 1041 | ] 1042 | } 1043 | ], 1044 | "source": [ 1045 | "final_score = sum(intermediate_scores) / len(intermediate_scores)\n", 1046 | "print(\"Dialog Metric Score: \", final_score)" 1047 | ] 1048 | }, 1049 | { 1050 | "cell_type": "markdown", 1051 | "id": "37e7856a-473c-464b-b5ac-601bbf65546d", 1052 | "metadata": {}, 1053 | "source": [ 1054 | "#### Advanced Metrics with Tracing in DSPy\n", 1055 | "\n", 1056 | "\n", 1057 | "\n", 1058 | "DSPy's documentation highlights two key points about using modules as metrics:\n", 1059 | "\n", 1060 | "1. If your metric is itself a DSPy program, one of the most powerful ways to iterate is to compile (optimize) your metric itself. That's usually easy because the output of the metric is usually a simple value (e.g., a score out of 5) so the metric's metric is easy to define and optimize by collecting a few examples.\n", 1061 | "\n", 1062 | "2. When your metric is used during evaluation runs, DSPy will not try to track the steps of your program. But during compiling (optimization), DSPy will trace your LM calls. The trace will contain inputs/outputs to each DSPy predictor and you can leverage that to validate intermediate steps for optimization.\n", 1063 | "\n", 1064 | "Digging into the second point with our prior example, the metric operates in two modes:\n", 1065 | "\n", 1066 | "**Standard Evaluation (trace=None)**: Returns a normalized score (0-1) based on accuracy and conciseness of the summary, requiring factual accuracy as a gating factor.\n", 1067 | "\n", 1068 | "**Compilation Mode (trace available)**: During compilation, DSPy provides us with the trace of our ChainOfThought module `(dialog_sum)`. While our standard evaluation returns a normalized score between 0-1, in compilation mode we alter the return logic to instead provide a binary success criterion `(score >= 2)`. This binary signal helps DSPy optimize more effectively during compilation by providing a clear success/failure signal for each example.\n", 1069 | "\n", 1070 | "```python\n", 1071 | "def dialog_metric(gold, pred, trace=None):\n", 1072 | " dialogue, gold_summary, generated_summary = gold.dialogue, gold.summary, pred.summary\n", 1073 | " \n", 1074 | " # LLM-based assessment using Assess signature\n", 1075 | " accurate = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=accurate_question)\n", 1076 | " concise = dspy.Predict(Assess)(assessed_text=generated_summary, assessment_question=concise_question)\n", 1077 | " \n", 1078 | " if trace is not None:\n", 1079 | " # During compilation: Can access and validate CoT reasoning steps\n", 1080 | " # We're not doing anything with it currently but you can access in this way\n", 1081 | " reasoning_steps = [output.reasoning for *_, output in trace if hasattr(output, 'reasoning')]\n", 1082 | " # Return binary success criteria for optimization\n", 1083 | " return score >= 2 # Requires both accuracy and conciseness\n", 1084 | " \n", 1085 | " return score / 2.0 # Normalized evaluation score\n", 1086 | "```\n", 1087 | "\n", 1088 | "The trace functionality is particularly valuable for complex modules like our ChainOfThought implementation as it alters how DSPy handles optimization. During compilation, instead of returning normalized scores, we provide binary success signals based on specific criteria (score >= 2). This binary feedback helps DSPy more effectively optimize the model by providing clear success/failure signals for each example.\n", 1089 | "\n", 1090 | "This dual-mode evaluation strategy serves two distinct purposes. During normal evaluation, we get detailed normalized scores to assess model performance. During compilation, we switch to binary success criteria to guide optimization more effectively. This approach helps us maintain rich evaluation metrics while providing clearer signals for model improvement during the compilation phase. We could also further complicate this by including signals from intermediate steps that are generally obfuscated." 1091 | ] 1092 | }, 1093 | { 1094 | "cell_type": "markdown", 1095 | "id": "7902c886-81fd-45d5-97e6-943ff66c548f", 1096 | "metadata": {}, 1097 | "source": [ 1098 | "---\n", 1099 | "## Optimization\n", 1100 | "\n", 1101 | "\n", 1102 | "\n", 1103 | "So now that we have some modules and metrics we're measuring against, we can take the final step of optimizing our programs. This takes the guesswork out of tweaking and editing prompts by automatically testing, assessing and iterating against measurable values.\n", 1104 | "\n", 1105 | "DSPy offers a few ways to optimize your programs, copied over [from the docs](https://dspy.ai/learn/optimization/optimizers/):\n", 1106 | "\n", 1107 | "**Automatic Few-Shot Learning**\n", 1108 | "These optimizers extend the signature by automatically generating and including optimized examples within the prompt sent to the model, implementing few-shot learning.\n", 1109 | "\n", 1110 | "- `LabeledFewShot`: Simply constructs few-shot examples (demos) from provided labeled input and output data points. Requires k (number of examples for the prompt) and trainset to randomly select k examples from.\n", 1111 | "\n", 1112 | "- `BootstrapFewShot`: Uses a teacher module (which defaults to your program) to generate complete demonstrations for every stage of your program, along with labeled examples in trainset. Parameters include max_labeled_demos (the number of demonstrations randomly selected from the trainset) and max_bootstrapped_demos (the number of additional examples generated by the teacher). The bootstrapping process employs the metric to validate demonstrations, including only those that pass the metric in the \"compiled\" prompt. Advanced: Supports using a teacher program that is a different DSPy program that has compatible structure, for harder tasks.\n", 1113 | "\n", 1114 | "- `BootstrapFewShotWithRandomSearch`: Applies BootstrapFewShot several times with random search over generated demonstrations, and selects the best program over the optimization. Parameters mirror those of BootstrapFewShot, with the addition of num_candidate_programs, which specifies the number of random programs evaluated over the optimization, including candidates of the uncompiled program, LabeledFewShot optimized program, BootstrapFewShot compiled program with unshuffled examples and num_candidate_programs of BootstrapFewShot compiled programs with randomized example sets.\n", 1115 | "\n", 1116 | "- `KNNFewShot`: Uses k-Nearest Neighbors algorithm to find the nearest training example demonstrations for a given input example. These nearest neighbor demonstrations are then used as the trainset for the BootstrapFewShot optimization process. See this notebook for an example.\n", 1117 | "\n", 1118 | "**Automatic Instruction Optimization**\n", 1119 | "These optimizers produce optimal instructions for the prompt and, in the case of MIPROv2 can also optimize the set of few-shot demonstrations.\n", 1120 | "\n", 1121 | "- `COPRO`: Generates and refines new instructions for each step, and optimizes them with coordinate ascent (hill-climbing using the metric function and the trainset). Parameters include depth which is the number of iterations of prompt improvement the optimizer runs over.\n", 1122 | "\n", 1123 | "- `MIPROv2`: Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules.\n", 1124 | "\n", 1125 | "**Automatic Finetuning**\n", 1126 | "This optimizer is used to fine-tune the underlying LLM(s).\n", 1127 | "\n", 1128 | "- `BootstrapFinetune`: Distills a prompt-based DSPy program into weight updates. The output is a DSPy program that has the same steps, but where each step is conducted by a finetuned model instead of a prompted LM.\n", 1129 | "\n", 1130 | "**Program Transformations**\n", 1131 | "- `Ensemble`: Ensembles a set of DSPy programs and either uses the full set or randomly samples a subset into a single program." 1132 | ] 1133 | }, 1134 | { 1135 | "cell_type": "markdown", 1136 | "id": "063e1119-620f-4cee-bf2a-8c7cf60da190", 1137 | "metadata": {}, 1138 | "source": [ 1139 | "**Loading Train and Test for Tweets**\n", 1140 | "\n", 1141 | "For our examples, we'll be optimizing the tweet sentiment classification module from before. While classification tasks are not the best examples for LLM applications, it will still allow us to understand in a lightweight way what's going on behind each optimizer so we can better apply them to more advanced programs. " 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "execution_count": 60, 1147 | "id": "a30d0ce0-e86c-45b4-9511-5988b6c5286f", 1148 | "metadata": {}, 1149 | "outputs": [], 1150 | "source": [ 1151 | "import json\n", 1152 | "\n", 1153 | "# Formatting Examples\n", 1154 | "twitter_train = []\n", 1155 | "twitter_test = []\n", 1156 | "train_size = 100 # how many for train \n", 1157 | "test_size = 200 # how many for test\n", 1158 | "\n", 1159 | "with open(\"./datasets/tweets.jsonl\", 'r', encoding='utf-8') as f:\n", 1160 | " for i, line in enumerate(f):\n", 1161 | " if i >= (train_size + test_size):\n", 1162 | " break\n", 1163 | " \n", 1164 | " data = json.loads(line.strip())\n", 1165 | " example = dspy.Example(\n", 1166 | " tweet=data['text'],\n", 1167 | " sentiment=data['label_text']\n", 1168 | " ).with_inputs(\"tweet\")\n", 1169 | " \n", 1170 | " if i < train_size:\n", 1171 | " twitter_train.append(example)\n", 1172 | " else:\n", 1173 | " twitter_test.append(example)" 1174 | ] 1175 | }, 1176 | { 1177 | "cell_type": "markdown", 1178 | "id": "8d022044-e3e8-4ced-bcb7-b9eebefc3b9d", 1179 | "metadata": {}, 1180 | "source": [ 1181 | "**Candidate Program**" 1182 | ] 1183 | }, 1184 | { 1185 | "cell_type": "code", 1186 | "execution_count": 3, 1187 | "id": "8fbdbe86-ed1c-4147-afb0-8b19164e2bc0", 1188 | "metadata": {}, 1189 | "outputs": [], 1190 | "source": [ 1191 | "# Simple Tweet Sentiment Classification Module\n", 1192 | "from typing import Literal\n", 1193 | "\n", 1194 | "class TwtSentiment(dspy.Signature):\n", 1195 | " tweet: str = dspy.InputField(desc=\"Candidate tweet for classificaiton\")\n", 1196 | " sentiment: Literal[\"positive\", \"negative\", \"neutral\"] = dspy.OutputField()\n", 1197 | "\n", 1198 | "base_twt_sentiment = dspy.Predict(TwtSentiment)" 1199 | ] 1200 | }, 1201 | { 1202 | "cell_type": "markdown", 1203 | "id": "d58fe87c-ed1b-4882-a26b-7408846baab6", 1204 | "metadata": {}, 1205 | "source": [ 1206 | "**Simple Metrics**" 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "code", 1211 | "execution_count": 4, 1212 | "id": "daf38c22-e76c-456c-aecb-876ae4a9bd3c", 1213 | "metadata": {}, 1214 | "outputs": [], 1215 | "source": [ 1216 | "def validate_answer(example, pred, trace=None):\n", 1217 | " return example.sentiment.lower() == pred.sentiment.lower()" 1218 | ] 1219 | }, 1220 | { 1221 | "cell_type": "markdown", 1222 | "id": "cb80f619-c994-4d93-be40-8e213edd6d4b", 1223 | "metadata": {}, 1224 | "source": [ 1225 | "**Baseline Score**" 1226 | ] 1227 | }, 1228 | { 1229 | "cell_type": "code", 1230 | "execution_count": 7, 1231 | "id": "98020703-6a83-40c1-88d8-432a4cdb7244", 1232 | "metadata": {}, 1233 | "outputs": [ 1234 | { 1235 | "name": "stdout", 1236 | "output_type": "stream", 1237 | "text": [ 1238 | "Baseline Accuracy: 0.69\n" 1239 | ] 1240 | } 1241 | ], 1242 | "source": [ 1243 | "baseline_scores = []\n", 1244 | "for x in twitter_test:\n", 1245 | " pred = base_twt_sentiment(**x.inputs())\n", 1246 | " score = validate_answer(x, pred)\n", 1247 | " baseline_scores.append(score)\n", 1248 | "\n", 1249 | "base_accuracy = baseline_scores.count(True) / len(baseline_scores)\n", 1250 | "print(\"Baseline Accuracy: \", base_accuracy)" 1251 | ] 1252 | }, 1253 | { 1254 | "cell_type": "markdown", 1255 | "id": "3dc33d56-3516-4b78-b198-97a5b52921fa", 1256 | "metadata": {}, 1257 | "source": [ 1258 | "**Example Tweet We'll Run Each Program Through**" 1259 | ] 1260 | }, 1261 | { 1262 | "cell_type": "code", 1263 | "execution_count": 132, 1264 | "id": "eda44269-775f-45b2-9062-74657de17d40", 1265 | "metadata": {}, 1266 | "outputs": [], 1267 | "source": [ 1268 | "# Expected Positive Label\n", 1269 | "example_tweet = \"Hi! Waking up, and not lazy at all. You would be proud of me, 8 am here!!! Btw, nice colour, not burnt.\"" 1270 | ] 1271 | }, 1272 | { 1273 | "cell_type": "markdown", 1274 | "id": "2955e0d2-2740-4f8b-a747-94b0d142002d", 1275 | "metadata": {}, 1276 | "source": [ 1277 | "### Automatic Few Shot Learning\n", 1278 | "\n", 1279 | "\n", 1280 | "\n", 1281 | "These optimizers are focused around providing the best examples either by finding similar examples for your query in the training data during inference, or by generating optimized examples to use from the program itself." 1282 | ] 1283 | }, 1284 | { 1285 | "cell_type": "markdown", 1286 | "id": "1328534c-1aa2-465a-94d8-371fd3ae9c06", 1287 | "metadata": {}, 1288 | "source": [ 1289 | "#### LabeledFewShot\n", 1290 | "\n", 1291 | "\n", 1292 | "\n", 1293 | "The simplest optimizer. Randomly selects k examples from your training data to use as demonstrations." 1294 | ] 1295 | }, 1296 | { 1297 | "cell_type": "code", 1298 | "execution_count": 9, 1299 | "id": "b1ce3158-a10c-4aef-b820-e9980a716496", 1300 | "metadata": {}, 1301 | "outputs": [], 1302 | "source": [ 1303 | "from dspy.teleprompt import LabeledFewShot\n", 1304 | "\n", 1305 | "lfs_optimizer = LabeledFewShot(k=16) # Use 16 examples in prompts\n", 1306 | "\n", 1307 | "lfs_twt_sentiment = lfs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)" 1308 | ] 1309 | }, 1310 | { 1311 | "cell_type": "code", 1312 | "execution_count": 11, 1313 | "id": "60b3937a-2920-45f3-bdf5-6bb8c2ad7509", 1314 | "metadata": {}, 1315 | "outputs": [ 1316 | { 1317 | "name": "stdout", 1318 | "output_type": "stream", 1319 | "text": [ 1320 | "Labeled Few Shot Accuracy: 0.695\n" 1321 | ] 1322 | } 1323 | ], 1324 | "source": [ 1325 | "lfs_scores = []\n", 1326 | "for x in twitter_test:\n", 1327 | " pred = lfs_twt_sentiment(**x.inputs())\n", 1328 | " score = validate_answer(x, pred)\n", 1329 | " lfs_scores.append(score)\n", 1330 | "\n", 1331 | "lfs_accuracy = lfs_scores.count(True) / len(lfs_scores)\n", 1332 | "print(\"Labeled Few Shot Accuracy: \", lfs_accuracy)" 1333 | ] 1334 | }, 1335 | { 1336 | "cell_type": "code", 1337 | "execution_count": 12, 1338 | "id": "de9c155d-eed3-4e23-9336-e735d434470f", 1339 | "metadata": {}, 1340 | "outputs": [], 1341 | "source": [ 1342 | "lfs_twt_sentiment.save(\"./optimized/lfs_twt_sentiment.json\")" 1343 | ] 1344 | }, 1345 | { 1346 | "cell_type": "code", 1347 | "execution_count": 134, 1348 | "id": "d1630be8-c24e-490b-a3b5-8d54ada53c38", 1349 | "metadata": {}, 1350 | "outputs": [ 1351 | { 1352 | "name": "stdout", 1353 | "output_type": "stream", 1354 | "text": [ 1355 | "positive\n" 1356 | ] 1357 | } 1358 | ], 1359 | "source": [ 1360 | "print(lfs_twt_sentiment(tweet=example_tweet).sentiment)" 1361 | ] 1362 | }, 1363 | { 1364 | "cell_type": "markdown", 1365 | "id": "726fefdf-eba3-45f8-8e10-346a64855aca", 1366 | "metadata": {}, 1367 | "source": [ 1368 | "#### BootstrapFewShot \n", 1369 | "\n", 1370 | "\n", 1371 | "\n", 1372 | "Generates high-quality examples by executing your program and keeping only successful runs." 1373 | ] 1374 | }, 1375 | { 1376 | "cell_type": "code", 1377 | "execution_count": 15, 1378 | "id": "c4e5087f-25d0-4c4d-b474-a50b1ba7e484", 1379 | "metadata": {}, 1380 | "outputs": [ 1381 | { 1382 | "name": "stderr", 1383 | "output_type": "stream", 1384 | "text": [ 1385 | " 4%|█▋ | 4/100 [00:00<00:00, 199.89it/s]" 1386 | ] 1387 | }, 1388 | { 1389 | "name": "stdout", 1390 | "output_type": "stream", 1391 | "text": [ 1392 | "Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.\n" 1393 | ] 1394 | }, 1395 | { 1396 | "name": "stderr", 1397 | "output_type": "stream", 1398 | "text": [ 1399 | "\n" 1400 | ] 1401 | } 1402 | ], 1403 | "source": [ 1404 | "from dspy.teleprompt import BootstrapFewShot\n", 1405 | "\n", 1406 | "bsfs_optimizer = BootstrapFewShot(\n", 1407 | " metric=validate_answer, # Function to evaluate quality\n", 1408 | " max_bootstrapped_demos=4, # Generated examples\n", 1409 | " max_labeled_demos=16, # Examples from training data\n", 1410 | " metric_threshold=1 # Minimum quality threshold\n", 1411 | ")\n", 1412 | "\n", 1413 | "bsfs_twt_sentiment = bsfw_optimizer.compile(base_twt_sentiment, trainset=twitter_train)" 1414 | ] 1415 | }, 1416 | { 1417 | "cell_type": "code", 1418 | "execution_count": 16, 1419 | "id": "dd4424fe-00e9-427d-83cb-43d76d46b73a", 1420 | "metadata": {}, 1421 | "outputs": [ 1422 | { 1423 | "name": "stdout", 1424 | "output_type": "stream", 1425 | "text": [ 1426 | "Labeled Few Shot Accuracy: 0.715\n" 1427 | ] 1428 | } 1429 | ], 1430 | "source": [ 1431 | "bsfs_scores = []\n", 1432 | "for x in twitter_test:\n", 1433 | " pred = bsfw_twt_sentiment(**x.inputs())\n", 1434 | " score = validate_answer(x, pred)\n", 1435 | " bsfs_scores.append(score)\n", 1436 | "\n", 1437 | "bsfs_accuracy = bsfs_scores.count(True) / len(bsfs_scores)\n", 1438 | "print(\"Bootstrap Few Shot Accuracy: \", bsfs_accuracy)" 1439 | ] 1440 | }, 1441 | { 1442 | "cell_type": "code", 1443 | "execution_count": 17, 1444 | "id": "98dfaacd-8b6b-419b-b143-4399b11766f2", 1445 | "metadata": {}, 1446 | "outputs": [], 1447 | "source": [ 1448 | "bsfs_twt_sentiment.save(\"./optimized/bsfs_twt_sentiment.json\")" 1449 | ] 1450 | }, 1451 | { 1452 | "cell_type": "code", 1453 | "execution_count": 135, 1454 | "id": "1295d6d0-0a3c-423e-927e-793d1f507f4d", 1455 | "metadata": {}, 1456 | "outputs": [ 1457 | { 1458 | "name": "stdout", 1459 | "output_type": "stream", 1460 | "text": [ 1461 | "positive\n" 1462 | ] 1463 | } 1464 | ], 1465 | "source": [ 1466 | "print(bsfs_twt_sentiment(tweet=example_tweet).sentiment)" 1467 | ] 1468 | }, 1469 | { 1470 | "cell_type": "markdown", 1471 | "id": "b8e406d2-f445-4395-87ed-20848f5bd5ba", 1472 | "metadata": {}, 1473 | "source": [ 1474 | "#### BootstrapFewShotWithRandomSearch\n", 1475 | "\n", 1476 | "\n", 1477 | "\n", 1478 | "Extends BootstrapFewShot by trying multiple random sets of examples to find the best performing combination." 1479 | ] 1480 | }, 1481 | { 1482 | "cell_type": "code", 1483 | "execution_count": null, 1484 | "id": "e90ef5e8-6ae9-485a-bb04-150262146ab9", 1485 | "metadata": {}, 1486 | "outputs": [], 1487 | "source": [ 1488 | "from dspy.teleprompt import BootstrapFewShotWithRandomSearch\n", 1489 | "\n", 1490 | "bsfswrs_optimizer = BootstrapFewShotWithRandomSearch(\n", 1491 | " metric=validate_answer,\n", 1492 | " num_candidate_programs=16,\n", 1493 | " max_bootstrapped_demos=4,\n", 1494 | " max_labeled_demos=16\n", 1495 | ")\n", 1496 | "\n", 1497 | "bsfswrs_twt_sentiment = bsfswrs_optimizer.compile(base_twt_sentiment, trainset=twitter_train)" 1498 | ] 1499 | }, 1500 | { 1501 | "cell_type": "markdown", 1502 | "id": "e37f3146-8c97-4772-b739-39cf97409a0c", 1503 | "metadata": {}, 1504 | "source": [ 1505 | "" 1506 | ] 1507 | }, 1508 | { 1509 | "cell_type": "code", 1510 | "execution_count": 20, 1511 | "id": "0788556f-d7cb-4884-b7c2-b3f4fead1ff2", 1512 | "metadata": {}, 1513 | "outputs": [ 1514 | { 1515 | "name": "stdout", 1516 | "output_type": "stream", 1517 | "text": [ 1518 | "Bootstrap Few Shot With Random Search Accuracy: 0.7\n" 1519 | ] 1520 | } 1521 | ], 1522 | "source": [ 1523 | "bsfswrs_scores = []\n", 1524 | "for x in twitter_test:\n", 1525 | " pred = bsfswrs_twt_sentiment(**x.inputs())\n", 1526 | " score = validate_answer(x, pred)\n", 1527 | " bsfswrs_scores.append(score)\n", 1528 | "\n", 1529 | "bsfswrs_accuracy = bsfswrs_scores.count(True) / len(bsfswrs_scores)\n", 1530 | "print(\"Bootstrap Few Shot With Random Search Accuracy: \", bsfswrs_accuracy)" 1531 | ] 1532 | }, 1533 | { 1534 | "cell_type": "code", 1535 | "execution_count": 21, 1536 | "id": "d35f2dfa-a39c-4cd9-bdea-080866a8eafc", 1537 | "metadata": {}, 1538 | "outputs": [], 1539 | "source": [ 1540 | "bsfswrs_twt_sentiment.save(\"./optimized/bsfswrs_twt_sentiment.json\")" 1541 | ] 1542 | }, 1543 | { 1544 | "cell_type": "code", 1545 | "execution_count": 137, 1546 | "id": "e2a2e40e-0149-4460-9636-52558459de6b", 1547 | "metadata": {}, 1548 | "outputs": [ 1549 | { 1550 | "name": "stdout", 1551 | "output_type": "stream", 1552 | "text": [ 1553 | "positive\n" 1554 | ] 1555 | } 1556 | ], 1557 | "source": [ 1558 | "print(bsfswrs_twt_sentiment(tweet=example_tweet).sentiment)" 1559 | ] 1560 | }, 1561 | { 1562 | "cell_type": "markdown", 1563 | "id": "76d73c88-f8c9-49f1-996b-5efb5da74f64", 1564 | "metadata": {}, 1565 | "source": [ 1566 | "#### KNNFewShot\n", 1567 | "\n", 1568 | "\n", 1569 | "\n", 1570 | "Dynamically selects relevant examples based on similarity to the input." 1571 | ] 1572 | }, 1573 | { 1574 | "cell_type": "markdown", 1575 | "id": "6c02bc4e-c009-4865-aa38-417aaf3c8205", 1576 | "metadata": {}, 1577 | "source": [ 1578 | "**Defining an Embedding Function**\n", 1579 | "\n", 1580 | "As KNN retrieval relies on vector similarity, we need a quick embedding function. This is a very simple setup that uses OpenAI's api." 1581 | ] 1582 | }, 1583 | { 1584 | "cell_type": "code", 1585 | "execution_count": 22, 1586 | "id": "37169d2d-3136-4837-8270-833e698243d3", 1587 | "metadata": {}, 1588 | "outputs": [], 1589 | "source": [ 1590 | "from openai import OpenAI\n", 1591 | "import numpy as np\n", 1592 | "\n", 1593 | "client = OpenAI()\n", 1594 | "\n", 1595 | "def openai_embeddings(texts):\n", 1596 | " if isinstance(texts, str):\n", 1597 | " texts = [texts]\n", 1598 | " \n", 1599 | " response = client.embeddings.create(\n", 1600 | " model=\"text-embedding-3-small\",\n", 1601 | " input=texts\n", 1602 | " )\n", 1603 | " \n", 1604 | " # Convert to numpy array\n", 1605 | " embeddings = np.array([embedding.embedding for embedding in response.data], dtype=np.float32)\n", 1606 | " \n", 1607 | " # If single text, return single embedding\n", 1608 | " if len(embeddings) == 1:\n", 1609 | " return embeddings[0]\n", 1610 | " return embeddings" 1611 | ] 1612 | }, 1613 | { 1614 | "cell_type": "code", 1615 | "execution_count": 24, 1616 | "id": "2ebf5a88-c119-4626-b2cf-ceecfbc5f7bf", 1617 | "metadata": {}, 1618 | "outputs": [], 1619 | "source": [ 1620 | "from dspy.teleprompt import KNNFewShot\n", 1621 | "\n", 1622 | "knn_optimizer = KNNFewShot(\n", 1623 | " k=5, # Number of neighbors to use\n", 1624 | " trainset=twitter_train, # Dataset for finding neighbors\n", 1625 | " vectorizer=openai_embeddings # Function to convert inputs to vectors\n", 1626 | ")\n", 1627 | "\n", 1628 | "knn_twt_sentiment = knn_optimizer.compile(base_twt_sentiment, trainset=twitter_train)" 1629 | ] 1630 | }, 1631 | { 1632 | "cell_type": "code", 1633 | "execution_count": null, 1634 | "id": "0bb82a5c-62c9-4653-b27f-e19a3cd47563", 1635 | "metadata": { 1636 | "scrolled": true 1637 | }, 1638 | "outputs": [], 1639 | "source": [ 1640 | "knn_scores = []\n", 1641 | "for x in twitter_test:\n", 1642 | " pred = knn_twt_sentiment(**x.inputs())\n", 1643 | " score = validate_answer(x, pred)\n", 1644 | " knn_scores.append(score)" 1645 | ] 1646 | }, 1647 | { 1648 | "cell_type": "code", 1649 | "execution_count": 27, 1650 | "id": "05ed64de-325a-47d5-90e3-3f3afecad10a", 1651 | "metadata": {}, 1652 | "outputs": [ 1653 | { 1654 | "name": "stdout", 1655 | "output_type": "stream", 1656 | "text": [ 1657 | "KNN Few Shot Accuracy: 0.7\n" 1658 | ] 1659 | } 1660 | ], 1661 | "source": [ 1662 | "knn_accuracy = knn_scores.count(True) / len(knn_scores)\n", 1663 | "print(\"KNN Few Shot Accuracy: \", knn_accuracy)" 1664 | ] 1665 | }, 1666 | { 1667 | "cell_type": "code", 1668 | "execution_count": 28, 1669 | "id": "63a2ee91-273f-4395-9410-3bd002ef1517", 1670 | "metadata": {}, 1671 | "outputs": [], 1672 | "source": [ 1673 | "knn_twt_sentiment.save(\"./optimized/knn_twt_sentiment.json\")" 1674 | ] 1675 | }, 1676 | { 1677 | "cell_type": "code", 1678 | "execution_count": 139, 1679 | "id": "9fba2b9a-58ed-4228-8520-553f4eaa7290", 1680 | "metadata": {}, 1681 | "outputs": [ 1682 | { 1683 | "name": "stderr", 1684 | "output_type": "stream", 1685 | "text": [ 1686 | " 80%|████████████████████████████████████ | 4/5 [00:02<00:00, 1.69it/s]\n" 1687 | ] 1688 | }, 1689 | { 1690 | "name": "stdout", 1691 | "output_type": "stream", 1692 | "text": [ 1693 | "Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.\n", 1694 | "positive\n" 1695 | ] 1696 | } 1697 | ], 1698 | "source": [ 1699 | "print(knn_twt_sentiment(tweet=example_tweet).sentiment)" 1700 | ] 1701 | }, 1702 | { 1703 | "cell_type": "markdown", 1704 | "id": "6d2e86af-1c2e-4db3-af5d-5f289943e8f7", 1705 | "metadata": {}, 1706 | "source": [ 1707 | "### Instruction Optimization\n", 1708 | "\n", 1709 | "\n", 1710 | "\n", 1711 | "These optimizers improve the actual instructions and prompts given to the model, enhancing zero-shot performance rather than the few shot setups shown above." 1712 | ] 1713 | }, 1714 | { 1715 | "cell_type": "markdown", 1716 | "id": "e8f241e6-2a53-4536-99fb-aadb72d7d0c9", 1717 | "metadata": {}, 1718 | "source": [ 1719 | "#### COPRO (Coordinate Prompt Optimization)\n", 1720 | "\n", 1721 | "\n", 1722 | "\n", 1723 | "Generates and refines new instructions for each step, and optimizes them with coordinate ascent (hill-climbing using the metric function and the trainset). Parameters include depth which is the number of iterations of prompt improvement the optimizer runs over." 1724 | ] 1725 | }, 1726 | { 1727 | "cell_type": "code", 1728 | "execution_count": null, 1729 | "id": "7e9b82ca-a0c9-4138-9f47-a1c522cbf43d", 1730 | "metadata": { 1731 | "scrolled": true 1732 | }, 1733 | "outputs": [], 1734 | "source": [ 1735 | "from dspy.teleprompt import COPRO\n", 1736 | "\n", 1737 | "copro_optimizer = COPRO(\n", 1738 | " metric=validate_answer, # Metric to Optimize Against\n", 1739 | " prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation\n", 1740 | " breadth=10, # New prompts per iteration\n", 1741 | " depth=3, # Number of improvement rounds\n", 1742 | " init_temperature=1.4 # Creativity in generation\n", 1743 | ")\n", 1744 | "\n", 1745 | "copro_twt_sentiment = copro_optimizer.compile(base_twt_sentiment, trainset=twitter_train, eval_kwargs={'num_threads': 6, 'display_progress': True})" 1746 | ] 1747 | }, 1748 | { 1749 | "cell_type": "code", 1750 | "execution_count": 36, 1751 | "id": "83e3e525-c203-4972-b2f2-223938102aa1", 1752 | "metadata": {}, 1753 | "outputs": [ 1754 | { 1755 | "name": "stdout", 1756 | "output_type": "stream", 1757 | "text": [ 1758 | "CORPO Accuracy: 0.71\n" 1759 | ] 1760 | } 1761 | ], 1762 | "source": [ 1763 | "corpo_scores = []\n", 1764 | "for x in twitter_test:\n", 1765 | " pred = copro_twt_sentiment(**x.inputs())\n", 1766 | " score = validate_answer(x, pred)\n", 1767 | " corpo_scores.append(score)\n", 1768 | "\n", 1769 | "corpo_accuracy = corpo_scores.count(True) / len(corpo_scores)\n", 1770 | "print(\"CORPO Accuracy: \", corpo_accuracy)" 1771 | ] 1772 | }, 1773 | { 1774 | "cell_type": "code", 1775 | "execution_count": 37, 1776 | "id": "f2854fac-3397-47e2-abf1-67e83cd6d02b", 1777 | "metadata": {}, 1778 | "outputs": [], 1779 | "source": [ 1780 | "copro_twt_sentiment.save(\"./optimized/copro_twt_sentiment.json\")" 1781 | ] 1782 | }, 1783 | { 1784 | "cell_type": "code", 1785 | "execution_count": 141, 1786 | "id": "174c5403-2455-4d80-9cdb-66db486d6f29", 1787 | "metadata": {}, 1788 | "outputs": [ 1789 | { 1790 | "name": "stdout", 1791 | "output_type": "stream", 1792 | "text": [ 1793 | "positive\n" 1794 | ] 1795 | } 1796 | ], 1797 | "source": [ 1798 | "print(copro_twt_sentiment(tweet=example_tweet).sentiment)" 1799 | ] 1800 | }, 1801 | { 1802 | "cell_type": "markdown", 1803 | "id": "40895940-3c94-45c4-b445-195a557d53e4", 1804 | "metadata": {}, 1805 | "source": [ 1806 | "#### MIPROv2 (Multiprompt Instruction Proposal Optimizer Version 2)\n", 1807 | "\n", 1808 | "\n", 1809 | "\n", 1810 | "Generates instructions and few-shot examples in each step. The instruction generation is data-aware and demonstration-aware. Uses Bayesian Optimization to effectively search over the space of generation instructions/demonstrations across your modules." 1811 | ] 1812 | }, 1813 | { 1814 | "cell_type": "code", 1815 | "execution_count": null, 1816 | "id": "03a3594f-fa3d-4fbf-bf34-121ebd79bfab", 1817 | "metadata": { 1818 | "scrolled": true 1819 | }, 1820 | "outputs": [], 1821 | "source": [ 1822 | "from dspy.teleprompt import MIPROv2\n", 1823 | "\n", 1824 | "mipro_optimizer = MIPROv2(\n", 1825 | " metric=validate_answer,\n", 1826 | " prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation\n", 1827 | " num_candidates=10, # Instructions to try\n", 1828 | ")\n", 1829 | "\n", 1830 | "mipro_twt_sentiment = mipro_optimizer.compile(base_twt_sentiment, trainset=twitter_train, valset=twitter_test)" 1831 | ] 1832 | }, 1833 | { 1834 | "cell_type": "code", 1835 | "execution_count": 62, 1836 | "id": "d07efad0-e1c1-4c2e-a233-91d53fa7e0ce", 1837 | "metadata": {}, 1838 | "outputs": [ 1839 | { 1840 | "name": "stdout", 1841 | "output_type": "stream", 1842 | "text": [ 1843 | "MIPRO Accuracy: 0.715\n" 1844 | ] 1845 | } 1846 | ], 1847 | "source": [ 1848 | "mipro_scores = []\n", 1849 | "for x in twitter_test:\n", 1850 | " pred = mipro_twt_sentiment(**x.inputs())\n", 1851 | " score = validate_answer(x, pred)\n", 1852 | " mipro_scores.append(score)\n", 1853 | "\n", 1854 | "mipro_accuracy = mipro_scores.count(True) / len(mipro_scores)\n", 1855 | "print(\"MIPRO Accuracy: \", mipro_accuracy)" 1856 | ] 1857 | }, 1858 | { 1859 | "cell_type": "code", 1860 | "execution_count": 63, 1861 | "id": "e303d3b1-6a3b-4056-b27a-8612f8fc4071", 1862 | "metadata": {}, 1863 | "outputs": [], 1864 | "source": [ 1865 | "mipro_twt_sentiment.save(\"./optimized/mipro_twt_sentiment.json\")" 1866 | ] 1867 | }, 1868 | { 1869 | "cell_type": "code", 1870 | "execution_count": 153, 1871 | "id": "7de48564-3c5d-4d91-a5f9-f175870c3065", 1872 | "metadata": {}, 1873 | "outputs": [ 1874 | { 1875 | "name": "stdout", 1876 | "output_type": "stream", 1877 | "text": [ 1878 | "positive\n" 1879 | ] 1880 | } 1881 | ], 1882 | "source": [ 1883 | "print(mipro_twt_sentiment(tweet=example_tweet).sentiment)" 1884 | ] 1885 | }, 1886 | { 1887 | "cell_type": "markdown", 1888 | "id": "abcc4488-71b8-4480-a04e-39816bcf07ee", 1889 | "metadata": {}, 1890 | "source": [ 1891 | "### Automatic Finetuning\n", 1892 | "\n", 1893 | "\n", 1894 | "\n", 1895 | "Once you have a well optimized program, you may want to start looking for even further optimizations. Ideally, you would use a large and expensive model first to get the best performance, then transfer that knowledge to an optimized smaller model (or continually train an existing model)\n", 1896 | "\n", 1897 | "DSPy offers a solution to automatically use your best programs to create training data for downstream finetuning with `BootstrapFinetune`." 1898 | ] 1899 | }, 1900 | { 1901 | "cell_type": "markdown", 1902 | "id": "bd130178-ddcb-4af8-8f7d-0a860b55211a", 1903 | "metadata": {}, 1904 | "source": [ 1905 | "#### BootstrapFinetune\n", 1906 | "\n", 1907 | "\n", 1908 | "\n", 1909 | "Creates fine-tuned versions of language models based on successful program executions. In this example we'll instill our best performing program from MIPROv2 directly into gpt-4o-mini." 1910 | ] 1911 | }, 1912 | { 1913 | "cell_type": "code", 1914 | "execution_count": 64, 1915 | "id": "30dcedb7-58c0-4ac7-9e2d-ec2d73534b50", 1916 | "metadata": {}, 1917 | "outputs": [], 1918 | "source": [ 1919 | "dspy.settings.experimental = True" 1920 | ] 1921 | }, 1922 | { 1923 | "cell_type": "markdown", 1924 | "id": "2fa389d0-9a6a-4813-85c4-d4219c34e663", 1925 | "metadata": {}, 1926 | "source": [ 1927 | "**Grabbing some additional data**" 1928 | ] 1929 | }, 1930 | { 1931 | "cell_type": "code", 1932 | "execution_count": 82, 1933 | "id": "3bba076e-0d87-4758-9283-74f7808eec62", 1934 | "metadata": { 1935 | "jupyter": { 1936 | "source_hidden": true 1937 | } 1938 | }, 1939 | "outputs": [], 1940 | "source": [ 1941 | "import json\n", 1942 | "\n", 1943 | "# Formatting Examples\n", 1944 | "bsft_twitter_train = []\n", 1945 | "bsft_twitter_test = []\n", 1946 | "train_size = 500 # how many for train \n", 1947 | "test_size = 200 # how many for test\n", 1948 | "\n", 1949 | "with open(\"./datasets/tweets.jsonl\", 'r', encoding='utf-8') as f:\n", 1950 | " for i, line in enumerate(f):\n", 1951 | " if i >= (train_size + test_size):\n", 1952 | " break\n", 1953 | " \n", 1954 | " data = json.loads(line.strip())\n", 1955 | " example = dspy.Example(\n", 1956 | " tweet=data['text'],\n", 1957 | " sentiment=data['label_text']\n", 1958 | " ).with_inputs(\"tweet\")\n", 1959 | " \n", 1960 | " if i < train_size:\n", 1961 | " bsft_twitter_train.append(example)\n", 1962 | " else:\n", 1963 | " bsft_twitter_test.append(example)" 1964 | ] 1965 | }, 1966 | { 1967 | "cell_type": "markdown", 1968 | "id": "228c3889-ec33-40b5-9d07-a9214a3d0dd0", 1969 | "metadata": {}, 1970 | "source": [ 1971 | "**Teacher and Student**\n", 1972 | "\n", 1973 | "At it's core `BootstrapFinetune` is meant to use our best optimized program to create training data to fine tune a language model. As such we need a teacher model that will be used across our data to create the examples, and then a student program with a target model to be fine tuned." 1974 | ] 1975 | }, 1976 | { 1977 | "cell_type": "code", 1978 | "execution_count": 83, 1979 | "id": "88c88c7a-a3cc-4049-b502-55f820ee43ad", 1980 | "metadata": {}, 1981 | "outputs": [], 1982 | "source": [ 1983 | "# First make a deep copy of your optimized MIPRO program as the teacher\n", 1984 | "teacher = mipro_twt_sentiment.deepcopy()\n", 1985 | "\n", 1986 | "# Create student as a copy but with your target model\n", 1987 | "student = mipro_twt_sentiment.deepcopy()\n", 1988 | "student.set_lm(dspy.LM(\"gpt-4o-mini-2024-07-18\")) # e.g., mistral or whatever model you want to fine-tune" 1989 | ] 1990 | }, 1991 | { 1992 | "cell_type": "code", 1993 | "execution_count": 84, 1994 | "id": "ae4f2657-2964-46f7-a2f3-68b2e94923d7", 1995 | "metadata": {}, 1996 | "outputs": [ 1997 | { 1998 | "name": "stdout", 1999 | "output_type": "stream", 2000 | "text": [ 2001 | "[BootstrapFinetune] Preparing the student and teacher programs...\n", 2002 | "[BootstrapFinetune] Bootstrapping data...\n", 2003 | "Average Metric: 362.00 / 500 (72.4%): 100%|██| 500/500 [00:00<00:00, 628.33it/s]\n" 2004 | ] 2005 | }, 2006 | { 2007 | "name": "stderr", 2008 | "output_type": "stream", 2009 | "text": [ 2010 | "2024/12/30 00:58:17 INFO dspy.evaluate.evaluate: Average Metric: 362 / 500 (72.4%)\n" 2011 | ] 2012 | }, 2013 | { 2014 | "name": "stdout", 2015 | "output_type": "stream", 2016 | "text": [ 2017 | "[BootstrapFinetune] Preparing the train data...\n", 2018 | "[BootstrapFinetune] Collected data for 500 examples\n", 2019 | "[BootstrapFinetune] After filtering with the metric, 362 examples remain\n", 2020 | "[BootstrapFinetune] Using 362 data points for fine-tuning the model: gpt-4o-mini-2024-07-18\n", 2021 | "[BootstrapFinetune] Starting LM fine-tuning...\n", 2022 | "[BootstrapFinetune] 1 fine-tuning job(s) to start\n", 2023 | "[BootstrapFinetune] Starting 1 fine-tuning job(s)...\n", 2024 | "[OpenAI Provider] Validating the data format\n", 2025 | "[OpenAI Provider] Saving the data to a file\n", 2026 | "[OpenAI Provider] Data saved to /Users/adamlucek/.dspy_cache/finetune/798b39e1a18373a3.jsonl\n", 2027 | "[OpenAI Provider] Uploading the data to the provider\n", 2028 | "[OpenAI Provider] Starting remote training\n", 2029 | "[OpenAI Provider] Job started with the OpenAI Job ID ftjob-L8D3vni8wlEyuCOAhIgzuFHF\n", 2030 | "[OpenAI Provider] Waiting for training to complete\n", 2031 | "[OpenAI Provider] 2024-12-30 00:58:23 Validating training file: file-Sh4DqQsYEY5UaqJEHGy37y\n", 2032 | "[OpenAI Provider] 2024-12-30 01:02:36 Fine-tuning job started\n", 2033 | "[OpenAI Provider] The OpenAI estimated time remaining is: 0:09:13.388291\n", 2034 | "[OpenAI Provider] 2024-12-30 01:05:02 Step 11/1086: training loss=0.14\n", 2035 | "[OpenAI Provider] 2024-12-30 01:05:27 Step 32/1086: training loss=0.00\n", 2036 | "[OpenAI Provider] 2024-12-30 01:05:45 Step 53/1086: training loss=0.00\n", 2037 | "[OpenAI Provider] 2024-12-30 01:06:01 Step 66/1086: training loss=0.00\n", 2038 | "[OpenAI Provider] 2024-12-30 01:06:29 Step 94/1086: training loss=0.00\n", 2039 | "[OpenAI Provider] 2024-12-30 01:06:44 Step 110/1086: training loss=0.00\n", 2040 | "[OpenAI Provider] 2024-12-30 01:07:10 Step 132/1086: training loss=0.00\n", 2041 | "[OpenAI Provider] 2024-12-30 01:07:28 Step 152/1086: training loss=0.00\n", 2042 | "[OpenAI Provider] 2024-12-30 01:07:53 Step 177/1086: training loss=0.00\n", 2043 | "[OpenAI Provider] 2024-12-30 01:08:12 Step 194/1086: training loss=0.00\n", 2044 | "[OpenAI Provider] 2024-12-30 01:08:28 Step 211/1086: training loss=0.00\n", 2045 | "[OpenAI Provider] 2024-12-30 01:08:53 Step 232/1086: training loss=0.00\n", 2046 | "[OpenAI Provider] 2024-12-30 01:09:13 Step 253/1086: training loss=0.00\n", 2047 | "[OpenAI Provider] 2024-12-30 01:09:38 Step 279/1086: training loss=0.00\n", 2048 | "[OpenAI Provider] 2024-12-30 01:09:56 Step 297/1086: training loss=0.00\n", 2049 | "[OpenAI Provider] 2024-12-30 01:10:18 Step 317/1086: training loss=0.00\n", 2050 | "[OpenAI Provider] 2024-12-30 01:10:38 Step 334/1086: training loss=0.00\n", 2051 | "[OpenAI Provider] 2024-12-30 01:10:56 Step 353/1086: training loss=0.00\n", 2052 | "[OpenAI Provider] 2024-12-30 01:11:05 Step 361/1086: training loss=0.00\n", 2053 | "[OpenAI Provider] 2024-12-30 01:11:41 Step 363/1086: training loss=0.00\n", 2054 | "[OpenAI Provider] 2024-12-30 01:11:57 Step 375/1086: training loss=0.00\n", 2055 | "[OpenAI Provider] 2024-12-30 01:12:26 Step 403/1086: training loss=0.00\n", 2056 | "[OpenAI Provider] 2024-12-30 01:12:44 Step 421/1086: training loss=0.00\n", 2057 | "[OpenAI Provider] 2024-12-30 01:13:00 Step 437/1086: training loss=0.00\n", 2058 | "[OpenAI Provider] 2024-12-30 01:13:25 Step 457/1086: training loss=0.00\n", 2059 | "[OpenAI Provider] 2024-12-30 01:13:43 Step 478/1086: training loss=0.00\n", 2060 | "[OpenAI Provider] 2024-12-30 01:14:09 Step 501/1086: training loss=0.00\n", 2061 | "[OpenAI Provider] 2024-12-30 01:14:26 Step 519/1086: training loss=0.00\n", 2062 | "[OpenAI Provider] 2024-12-30 01:14:52 Step 543/1086: training loss=0.00\n", 2063 | "[OpenAI Provider] 2024-12-30 01:15:07 Step 556/1086: training loss=0.00\n", 2064 | "[OpenAI Provider] 2024-12-30 01:15:34 Step 583/1086: training loss=0.00\n", 2065 | "[OpenAI Provider] 2024-12-30 01:15:50 Step 599/1086: training loss=0.00\n", 2066 | "[OpenAI Provider] 2024-12-30 01:16:16 Step 621/1086: training loss=0.00\n", 2067 | "[OpenAI Provider] 2024-12-30 01:16:34 Step 641/1086: training loss=0.00\n", 2068 | "[OpenAI Provider] 2024-12-30 01:16:59 Step 665/1086: training loss=0.00\n", 2069 | "[OpenAI Provider] 2024-12-30 01:17:17 Step 683/1086: training loss=0.01\n", 2070 | "[OpenAI Provider] 2024-12-30 01:17:32 Step 699/1086: training loss=0.00\n", 2071 | "[OpenAI Provider] 2024-12-30 01:17:57 Step 720/1086: training loss=0.00\n", 2072 | "[OpenAI Provider] 2024-12-30 01:18:41 Step 740/1086: training loss=0.00\n", 2073 | "[OpenAI Provider] 2024-12-30 01:18:59 Step 758/1086: training loss=0.00\n", 2074 | "[OpenAI Provider] 2024-12-30 01:19:17 Step 772/1086: training loss=0.00\n", 2075 | "[OpenAI Provider] 2024-12-30 01:19:42 Step 796/1086: training loss=0.00\n", 2076 | "[OpenAI Provider] 2024-12-30 01:20:01 Step 816/1086: training loss=0.00\n", 2077 | "[OpenAI Provider] 2024-12-30 01:20:21 Step 835/1086: training loss=0.03\n", 2078 | "[OpenAI Provider] 2024-12-30 01:20:46 Step 857/1086: training loss=0.00\n", 2079 | "[OpenAI Provider] 2024-12-30 01:21:04 Step 877/1086: training loss=0.00\n", 2080 | "[OpenAI Provider] 2024-12-30 01:21:30 Step 901/1086: training loss=0.00\n", 2081 | "[OpenAI Provider] 2024-12-30 01:21:47 Step 919/1086: training loss=0.00\n", 2082 | "[OpenAI Provider] 2024-12-30 01:22:04 Step 934/1086: training loss=0.00\n", 2083 | "[OpenAI Provider] 2024-12-30 01:22:29 Step 957/1086: training loss=0.00\n", 2084 | "[OpenAI Provider] 2024-12-30 01:22:47 Step 977/1086: training loss=0.00\n", 2085 | "[OpenAI Provider] 2024-12-30 01:23:13 Step 1001/1086: training loss=0.00\n", 2086 | "[OpenAI Provider] 2024-12-30 01:23:30 Step 1019/1086: training loss=0.00\n", 2087 | "[OpenAI Provider] 2024-12-30 01:23:56 Step 1043/1086: training loss=0.00\n", 2088 | "[OpenAI Provider] 2024-12-30 01:24:12 Step 1056/1086: training loss=0.00\n", 2089 | "[OpenAI Provider] 2024-12-30 01:24:39 Step 1085/1086: training loss=0.00\n", 2090 | "[OpenAI Provider] Attempting to retrieve the trained model\n", 2091 | "[OpenAI Provider] Model retrieved: ft:gpt-4o-mini-2024-07-18:personal::AjxtAzey\n", 2092 | "[BootstrapFinetune] Job 1/1 is done\n", 2093 | "[BootstrapFinetune] Updating the student program with the fine-tuned LMs...\n", 2094 | "[BootstrapFinetune] BootstrapFinetune has finished compiling the student program\n" 2095 | ] 2096 | } 2097 | ], 2098 | "source": [ 2099 | "from dspy.teleprompt import BootstrapFinetune\n", 2100 | "\n", 2101 | "bsft_optimizer = BootstrapFinetune(\n", 2102 | " metric=validate_answer, # Used to filter training data\n", 2103 | " num_threads=16 # For parallel processing\n", 2104 | ")\n", 2105 | "\n", 2106 | "bsft_twt_sentiment = bsft_optimizer.compile(\n", 2107 | " student=student,\n", 2108 | " trainset=bsft_twitter_train,\n", 2109 | " teacher=teacher\n", 2110 | ")" 2111 | ] 2112 | }, 2113 | { 2114 | "cell_type": "code", 2115 | "execution_count": 86, 2116 | "id": "ecf21fde-43c7-4f83-99a5-32aee4cd014a", 2117 | "metadata": {}, 2118 | "outputs": [ 2119 | { 2120 | "name": "stdout", 2121 | "output_type": "stream", 2122 | "text": [ 2123 | "Bootstrap Fine Tune Accuracy: 0.725\n" 2124 | ] 2125 | } 2126 | ], 2127 | "source": [ 2128 | "bsft_scores = []\n", 2129 | "for x in bsft_twitter_test:\n", 2130 | " pred = bsft_twt_sentiment(**x.inputs())\n", 2131 | " score = validate_answer(x, pred)\n", 2132 | " bsft_scores.append(score)\n", 2133 | "\n", 2134 | "bsft_accuracy = bsft_scores.count(True) / len(bsft_scores)\n", 2135 | "print(\"Bootstrap Fine Tune Accuracy: \", bsft_accuracy)" 2136 | ] 2137 | }, 2138 | { 2139 | "cell_type": "code", 2140 | "execution_count": 85, 2141 | "id": "cdbce889-116d-4000-ae0e-03ac488f107f", 2142 | "metadata": {}, 2143 | "outputs": [], 2144 | "source": [ 2145 | "bsft_twt_sentiment.save(\"./optimized/bsft_twt_sentiment.pkl\")" 2146 | ] 2147 | }, 2148 | { 2149 | "cell_type": "code", 2150 | "execution_count": 151, 2151 | "id": "bb535b64-324e-48bd-97af-757cd8e33aaf", 2152 | "metadata": {}, 2153 | "outputs": [ 2154 | { 2155 | "name": "stdout", 2156 | "output_type": "stream", 2157 | "text": [ 2158 | "positive\n" 2159 | ] 2160 | } 2161 | ], 2162 | "source": [ 2163 | "print(bsft_twt_sentiment(tweet=example_tweet).sentiment)" 2164 | ] 2165 | }, 2166 | { 2167 | "cell_type": "markdown", 2168 | "id": "19dcfe42-f555-4027-9f37-867e9432e7af", 2169 | "metadata": {}, 2170 | "source": [ 2171 | "### Choosing an Optimizer\n", 2172 | "\n", 2173 | "From DSPy's [Documentation](https://dspy.ai/learn/optimization/optimizers):\n", 2174 | "\n", 2175 | "- If you have very few examples (around 10), start with `BootstrapFewShot`.\n", 2176 | "- If you have more data (50 examples or more), try `BootstrapFewShotWithRandomSearch`.\n", 2177 | "- If you prefer to do instruction optimization only (i.e. you want to keep your prompt 0-shot), use `MIPROv2` configured for 0-shot optimization to optimize.\n", 2178 | "- If you’re willing to use more inference calls to perform longer optimization runs (e.g. 40 trials or more), and have enough data (e.g. 200 examples or more to prevent overfitting) then try `MIPROv2`.\n", 2179 | "- If you have been able to use one of these with a large LM (e.g., 7B parameters or above) and need a very efficient program, finetune a small LM for your task with `BootstrapFinetune`.\n", 2180 | "\n", 2181 | "Can't choose one? Try the [Ensemble](https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/ensemble.py) compiler to combine multiple optimized programs together, then process the output's in some way (i.e. majority, weighted majority, etc) to get to a final output! \n", 2182 | "\n", 2183 | "" 2184 | ] 2185 | }, 2186 | { 2187 | "cell_type": "markdown", 2188 | "id": "405d6fa5-8189-47f2-9841-ae76f48abae2", 2189 | "metadata": {}, 2190 | "source": [ 2191 | "### Optimizing Optimized Programs\n", 2192 | "\n", 2193 | "As emphasized, running just one iteration of optimization is usually not enough. Iterate across your metrics, programs, and metrics in programs!\n", 2194 | "\n", 2195 | "DSPy has a built in function that encourages this, **[BetterTogether](https://github.com/stanfordnlp/dspy/blob/main/dspy/teleprompt/bettertogether.py)**\n", 2196 | "\n", 2197 | "\n", 2198 | "\n", 2199 | "But we'll go ahead and do it manually to see if it makes a difference!" 2200 | ] 2201 | }, 2202 | { 2203 | "cell_type": "markdown", 2204 | "id": "192e8175-d747-4ce7-b6e6-788e87c3001c", 2205 | "metadata": {}, 2206 | "source": [ 2207 | "**Grabbing Unseen Data**" 2208 | ] 2209 | }, 2210 | { 2211 | "cell_type": "code", 2212 | "execution_count": 100, 2213 | "id": "79112317-00eb-4b07-b9b5-e4e74c71be5b", 2214 | "metadata": { 2215 | "jupyter": { 2216 | "source_hidden": true 2217 | } 2218 | }, 2219 | "outputs": [], 2220 | "source": [ 2221 | "import json\n", 2222 | "# Formatting Examples\n", 2223 | "final_twitter_train = []\n", 2224 | "final_twitter_test = []\n", 2225 | "train_size = 300 # how many for train \n", 2226 | "test_size = 500 # how many for test\n", 2227 | "start_row = 1500 # start reading from this row\n", 2228 | "\n", 2229 | "with open(\"./datasets/tweets.jsonl\", 'r', encoding='utf-8') as f:\n", 2230 | " for i, line in enumerate(f):\n", 2231 | " # Skip until we reach start_row\n", 2232 | " if i < start_row:\n", 2233 | " continue\n", 2234 | " \n", 2235 | " # Adjust the index for our collection logic\n", 2236 | " collection_index = i - start_row\n", 2237 | " \n", 2238 | " if collection_index >= (train_size + test_size):\n", 2239 | " break\n", 2240 | " \n", 2241 | " data = json.loads(line.strip())\n", 2242 | " example = dspy.Example(\n", 2243 | " tweet=data['text'],\n", 2244 | " sentiment=data['label_text']\n", 2245 | " ).with_inputs(\"tweet\")\n", 2246 | " \n", 2247 | " if collection_index < train_size:\n", 2248 | " final_twitter_train.append(example)\n", 2249 | " else:\n", 2250 | " final_twitter_test.append(example)" 2251 | ] 2252 | }, 2253 | { 2254 | "cell_type": "markdown", 2255 | "id": "85cdaeb4-a4f9-4947-9e63-b96b716dff0b", 2256 | "metadata": {}, 2257 | "source": [ 2258 | "**Optimizing our Fine Tuned Program with MIPROv2**" 2259 | ] 2260 | }, 2261 | { 2262 | "cell_type": "code", 2263 | "execution_count": null, 2264 | "id": "c3b9117e-5710-4f97-be8b-e920a7a0e8b3", 2265 | "metadata": { 2266 | "scrolled": true 2267 | }, 2268 | "outputs": [], 2269 | "source": [ 2270 | "mipro_optimizer = MIPROv2(\n", 2271 | " metric=validate_answer,\n", 2272 | " prompt_model= dspy.LM('openai/gpt-4o'), # Different Model for Prompt Generation\n", 2273 | " num_candidates=10, # Instructions to try\n", 2274 | ")\n", 2275 | "\n", 2276 | "mipro_bsft_twt_sentiment = mipro_optimizer.compile(bsft_twt_sentiment, trainset=final_twitter_train, valset=final_twitter_test)" 2277 | ] 2278 | }, 2279 | { 2280 | "cell_type": "code", 2281 | "execution_count": 103, 2282 | "id": "038447be-c965-4b64-9183-09a46a21802e", 2283 | "metadata": {}, 2284 | "outputs": [ 2285 | { 2286 | "name": "stdout", 2287 | "output_type": "stream", 2288 | "text": [ 2289 | "MIPROv2 After Bootstrap Fine Tune Accuracy: 0.744\n" 2290 | ] 2291 | } 2292 | ], 2293 | "source": [ 2294 | "final_scores = []\n", 2295 | "for x in final_twitter_test:\n", 2296 | " pred = mipro_bsft_twt_sentiment(**x.inputs())\n", 2297 | " score = validate_answer(x, pred)\n", 2298 | " final_scores.append(score)\n", 2299 | "\n", 2300 | "mipro_bsft_accuracy = final_scores.count(True) / len(final_scores)\n", 2301 | "print(\"MIPROv2 After Bootstrap Fine Tune Accuracy: \", mipro_bsft_accuracy)" 2302 | ] 2303 | }, 2304 | { 2305 | "cell_type": "code", 2306 | "execution_count": 98, 2307 | "id": "937124eb-a41a-4ade-bc83-7bd63fd919c3", 2308 | "metadata": {}, 2309 | "outputs": [], 2310 | "source": [ 2311 | "mipro_bsft_twt_sentiment.save(\"./optimized/mipro_bsft_twt_sentiment.pkl\")" 2312 | ] 2313 | }, 2314 | { 2315 | "cell_type": "code", 2316 | "execution_count": 149, 2317 | "id": "4bffb932-7700-404d-8159-86e65337829c", 2318 | "metadata": {}, 2319 | "outputs": [ 2320 | { 2321 | "name": "stdout", 2322 | "output_type": "stream", 2323 | "text": [ 2324 | "positive\n" 2325 | ] 2326 | } 2327 | ], 2328 | "source": [ 2329 | "print(mipro_bsft_twt_sentiment(tweet=example_tweet).sentiment)" 2330 | ] 2331 | }, 2332 | { 2333 | "cell_type": "markdown", 2334 | "id": "db021148-d0b7-4695-8b5b-51d133ae3f79", 2335 | "metadata": {}, 2336 | "source": [ 2337 | "---\n", 2338 | "## Final Thoughts" 2339 | ] 2340 | }, 2341 | { 2342 | "cell_type": "markdown", 2343 | "id": "32bba4b2-c462-407f-a1c5-bd21570738ab", 2344 | "metadata": {}, 2345 | "source": [ 2346 | "Check out DSPy's [official documentation](https://dspy.ai/), which this notebook is essentially a code forward exploration of. They have plenty more [tutorials](https://dspy.ai/tutorials/) and [guides](https://dspy.ai/learn/) that are actively being updated as part of their latest (Dec 2024) release!\n", 2347 | "\n", 2348 | "Overall DSPy provides an interesting approach to applying language models within programs, abstracting away from trial and error via prompting by adding rigour around clear metric definition and optimization. Rather than work with difficult to interpret or tune text strings, they offer a clean base template that can be further optimized through algorithmic approaches, applying automated ways to coordinate or generate few shot examples, directly change the instructions given to the LLM, or a combination of the two.\n", 2349 | "\n", 2350 | "Inspired by deep learning frameworks, DSPy offers a powerful way to reliably optimize and iterate on LLM applications in a systematic and controlled way, with the entire ecosystem growing by the day. Go give [the DSPy repo](https://github.com/stanfordnlp/dspy/tree/main) a star!" 2351 | ] 2352 | }, 2353 | { 2354 | "cell_type": "code", 2355 | "execution_count": null, 2356 | "id": "948f1044-3022-49a5-a5ac-07fd5010eb00", 2357 | "metadata": {}, 2358 | "outputs": [], 2359 | "source": [] 2360 | } 2361 | ], 2362 | "metadata": { 2363 | "kernelspec": { 2364 | "display_name": "Python 3 (ipykernel)", 2365 | "language": "python", 2366 | "name": "python3" 2367 | }, 2368 | "language_info": { 2369 | "codemirror_mode": { 2370 | "name": "ipython", 2371 | "version": 3 2372 | }, 2373 | "file_extension": ".py", 2374 | "mimetype": "text/x-python", 2375 | "name": "python", 2376 | "nbconvert_exporter": "python", 2377 | "pygments_lexer": "ipython3", 2378 | "version": "3.12.0" 2379 | } 2380 | }, 2381 | "nbformat": 4, 2382 | "nbformat_minor": 5 2383 | } 2384 | -------------------------------------------------------------------------------- /media/advan_metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/advan_metrics.png -------------------------------------------------------------------------------- /media/auto_fewshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/auto_fewshot.png -------------------------------------------------------------------------------- /media/auto_ft.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/auto_ft.png -------------------------------------------------------------------------------- /media/auto_instr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/auto_instr.png -------------------------------------------------------------------------------- /media/better_together.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/better_together.png -------------------------------------------------------------------------------- /media/bootstrap_fewshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/bootstrap_fewshot.png -------------------------------------------------------------------------------- /media/bootstrap_finetune_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/bootstrap_finetune_diagram.png -------------------------------------------------------------------------------- /media/bsfswrs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/bsfswrs.png -------------------------------------------------------------------------------- /media/bsfswrs_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/bsfswrs_diagram.png -------------------------------------------------------------------------------- /media/copro_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/copro_diagram.png -------------------------------------------------------------------------------- /media/cot_module.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/cot_module.png -------------------------------------------------------------------------------- /media/dspy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/dspy.png -------------------------------------------------------------------------------- /media/dspy_workflow.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/dspy_workflow.png -------------------------------------------------------------------------------- /media/ensemble_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/ensemble_diagram.png -------------------------------------------------------------------------------- /media/input_type.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/input_type.png -------------------------------------------------------------------------------- /media/inter_metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/inter_metrics.png -------------------------------------------------------------------------------- /media/knn_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/knn_diagram.png -------------------------------------------------------------------------------- /media/labeled_few_shot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/labeled_few_shot.png -------------------------------------------------------------------------------- /media/majority.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/majority.png -------------------------------------------------------------------------------- /media/mermaid.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/mermaid.png -------------------------------------------------------------------------------- /media/metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/metrics.png -------------------------------------------------------------------------------- /media/mipro_diagram.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/mipro_diagram.png -------------------------------------------------------------------------------- /media/modules.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/modules.png -------------------------------------------------------------------------------- /media/multi_chain.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/multi_chain.png -------------------------------------------------------------------------------- /media/multiple_signature.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/multiple_signature.png -------------------------------------------------------------------------------- /media/optimizers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/optimizers.png -------------------------------------------------------------------------------- /media/program_of_thought.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/program_of_thought.png -------------------------------------------------------------------------------- /media/program_transform.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/program_transform.png -------------------------------------------------------------------------------- /media/react.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/react.png -------------------------------------------------------------------------------- /media/signatures.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/signatures.png -------------------------------------------------------------------------------- /media/simple_metrics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/media/simple_metrics.png -------------------------------------------------------------------------------- /optimized/bsfs_twt_sentiment.json: -------------------------------------------------------------------------------- 1 | { 2 | "lm": null, 3 | "traces": [], 4 | "train": [], 5 | "demos": [ 6 | { 7 | "augmented": true, 8 | "tweet": "Last session of the day http:\/\/twitpic.com\/67ezh", 9 | "sentiment": "neutral" 10 | }, 11 | { 12 | "augmented": true, 13 | "tweet": " Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China: (SH) (BJ).", 14 | "sentiment": "positive" 15 | }, 16 | { 17 | "augmented": true, 18 | "tweet": "Recession hit Veronique Branquinho, she has to quit her company, such a shame!", 19 | "sentiment": "negative" 20 | }, 21 | { 22 | "augmented": true, 23 | "tweet": " happy bday!", 24 | "sentiment": "positive" 25 | }, 26 | { 27 | "tweet": " Ur so cute..I`m a fan of Dream A Little Dream, This Kiss and appearances like in Dawson`s Creek Make more flicks!", 28 | "sentiment": "positive" 29 | }, 30 | { 31 | "tweet": "No AC, the fan doesnt swing our way ... we are sweating it out on a hot humid day", 32 | "sentiment": "negative" 33 | }, 34 | { 35 | "tweet": "Cramps . . .", 36 | "sentiment": "negative" 37 | }, 38 | { 39 | "tweet": " haaaw..well i get out of class at 10:50..i hope i make it", 40 | "sentiment": "positive" 41 | }, 42 | { 43 | "tweet": "Happy Mothers Day!!!", 44 | "sentiment": "positive" 45 | }, 46 | { 47 | "tweet": " soooooo wish i could, but im in school and myspace is completely blocked", 48 | "sentiment": "negative" 49 | }, 50 | { 51 | "tweet": " you should totally come get me and bring me to kelslaws house with you.", 52 | "sentiment": "neutral" 53 | }, 54 | { 55 | "tweet": " im really sorry i know wallah how u feel this life is shittttttttt", 56 | "sentiment": "negative" 57 | }, 58 | { 59 | "tweet": " I can`t call Mitch! Im from sweden!", 60 | "sentiment": "neutral" 61 | }, 62 | { 63 | "tweet": " thats so cool", 64 | "sentiment": "positive" 65 | }, 66 | { 67 | "tweet": "Happy Birthday Snickers!!!! ? I hope you have the best day ever! Let`s go shopping!!!", 68 | "sentiment": "positive" 69 | }, 70 | { 71 | "tweet": " did he ask for your Twitter ID? Your sun sign?", 72 | "sentiment": "neutral" 73 | } 74 | ], 75 | "signature": { 76 | "instructions": "Given the fields `tweet`, produce the fields `sentiment`.", 77 | "fields": [ 78 | { 79 | "prefix": "Tweet:", 80 | "description": "Candidate tweet for classificaiton" 81 | }, 82 | { 83 | "prefix": "Sentiment:", 84 | "description": "${sentiment}" 85 | } 86 | ] 87 | }, 88 | "metadata": { 89 | "dependency_versions": { 90 | "python": "3.12.0", 91 | "dspy": "2.5.43", 92 | "cloudpickle": "3.0.0" 93 | } 94 | } 95 | } -------------------------------------------------------------------------------- /optimized/bsfswrs_twt_sentiment.json: -------------------------------------------------------------------------------- 1 | { 2 | "lm": null, 3 | "traces": [], 4 | "train": [], 5 | "demos": [ 6 | { 7 | "augmented": true, 8 | "tweet": "Last session of the day http:\/\/twitpic.com\/67ezh", 9 | "sentiment": "neutral" 10 | }, 11 | { 12 | "tweet": "Terminator Salvation... by myself.", 13 | "sentiment": "neutral" 14 | }, 15 | { 16 | "tweet": " I checked. We didn`t win", 17 | "sentiment": "neutral" 18 | }, 19 | { 20 | "tweet": "I`m so very tired...and have insomnia.", 21 | "sentiment": "negative" 22 | }, 23 | { 24 | "tweet": ": hmmm, wrong link, ignore my tweet", 25 | "sentiment": "negative" 26 | }, 27 | { 28 | "tweet": " soooooo wish i could, but im in school and myspace is completely blocked", 29 | "sentiment": "negative" 30 | }, 31 | { 32 | "tweet": "Tracy and Berwick breaks my achy breaky heart They split ways in the hallways.", 33 | "sentiment": "negative" 34 | }, 35 | { 36 | "tweet": " Shanghai is also really exciting (precisely -- skyscrapers galore). Good tweeps in China: (SH) (BJ).", 37 | "sentiment": "positive" 38 | }, 39 | { 40 | "tweet": " look who I found just for you ---> http:\/\/twitter.com\/DJT2009", 41 | "sentiment": "positive" 42 | }, 43 | { 44 | "tweet": "So I really need to put the laptop down & start getting ready for shindig...But I`ve missed my TwitterLoves all day", 45 | "sentiment": "neutral" 46 | }, 47 | { 48 | "tweet": "feels sorry every time I`m printing out, I use like 200 new papers", 49 | "sentiment": "negative" 50 | }, 51 | { 52 | "tweet": "Gnight shar <(` `<)Vega(>` `)>", 53 | "sentiment": "neutral" 54 | }, 55 | { 56 | "tweet": "Is watching acoustic performances! & In the mood for a good 'FRIENDS' episode! I miss that show", 57 | "sentiment": "neutral" 58 | }, 59 | { 60 | "tweet": " for...the...loss. dumbface ...him, not u. what u up to on the wknd? i wanna seeeeeee ya!", 61 | "sentiment": "negative" 62 | }, 63 | { 64 | "tweet": "reality needs to check in. schools over. time to party not tonight tho, im going to bed. night, night.", 65 | "sentiment": "neutral" 66 | }, 67 | { 68 | "tweet": " http:\/\/twitpic.com\/4w75p - I like it!!", 69 | "sentiment": "positive" 70 | } 71 | ], 72 | "signature": { 73 | "instructions": "Given the fields `tweet`, produce the fields `sentiment`.", 74 | "fields": [ 75 | { 76 | "prefix": "Tweet:", 77 | "description": "Candidate tweet for classificaiton" 78 | }, 79 | { 80 | "prefix": "Sentiment:", 81 | "description": "${sentiment}" 82 | } 83 | ] 84 | }, 85 | "metadata": { 86 | "dependency_versions": { 87 | "python": "3.12.0", 88 | "dspy": "2.5.43", 89 | "cloudpickle": "3.0.0" 90 | } 91 | } 92 | } -------------------------------------------------------------------------------- /optimized/bsft_twt_sentiment.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/optimized/bsft_twt_sentiment.pkl -------------------------------------------------------------------------------- /optimized/copro_twt_sentiment.json: -------------------------------------------------------------------------------- 1 | { 2 | "lm": null, 3 | "traces": [], 4 | "train": [], 5 | "demos": [], 6 | "signature": { 7 | "instructions": "Given a social media post, determine the sentiment it expresses, categorizing it as either \"positive,\" \"negative,\" or \"neutral.\" Use the textual emotions within the post content to make an accurate judgment. Your goal is to provide an objective evaluation based on the tone and context of the post.", 8 | "fields": [ 9 | { 10 | "prefix": "Tweet:", 11 | "description": "Candidate tweet for classificaiton" 12 | }, 13 | { 14 | "prefix": "Analyzed Sentiment:", 15 | "description": "${sentiment}" 16 | } 17 | ] 18 | }, 19 | "metadata": { 20 | "dependency_versions": { 21 | "python": "3.12.0", 22 | "dspy": "2.5.43", 23 | "cloudpickle": "3.0.0" 24 | } 25 | } 26 | } -------------------------------------------------------------------------------- /optimized/ensemble_twt_sentiment.json: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/optimized/ensemble_twt_sentiment.json -------------------------------------------------------------------------------- /optimized/knn_twt_sentiment.json: -------------------------------------------------------------------------------- 1 | { 2 | "lm": null, 3 | "traces": [], 4 | "train": [], 5 | "demos": [], 6 | "signature": { 7 | "instructions": "Given the fields `tweet`, produce the fields `sentiment`.", 8 | "fields": [ 9 | { 10 | "prefix": "Tweet:", 11 | "description": "Candidate tweet for classificaiton" 12 | }, 13 | { 14 | "prefix": "Sentiment:", 15 | "description": "${sentiment}" 16 | } 17 | ] 18 | }, 19 | "metadata": { 20 | "dependency_versions": { 21 | "python": "3.12.0", 22 | "dspy": "2.5.43", 23 | "cloudpickle": "3.0.0" 24 | } 25 | } 26 | } -------------------------------------------------------------------------------- /optimized/lfs_twt_sentiment.json: -------------------------------------------------------------------------------- 1 | { 2 | "lm": null, 3 | "traces": [], 4 | "train": [], 5 | "demos": [ 6 | { 7 | "tweet": "Not happy", 8 | "sentiment": "negative" 9 | }, 10 | { 11 | "tweet": "Take antibacterial to school to clean your hands when you cant go the loos", 12 | "sentiment": "neutral" 13 | }, 14 | { 15 | "tweet": " Thanks Ennio", 16 | "sentiment": "positive" 17 | }, 18 | { 19 | "tweet": " that`s great!! weee!! visitors!", 20 | "sentiment": "positive" 21 | }, 22 | { 23 | "tweet": " did he ask for your Twitter ID? Your sun sign?", 24 | "sentiment": "neutral" 25 | }, 26 | { 27 | "tweet": "I just cried whilst watching hollyoaks .. i need a life! lol", 28 | "sentiment": "negative" 29 | }, 30 | { 31 | "tweet": "Change of plans. I am staying in Brandon. No Papaya Salad for me.", 32 | "sentiment": "neutral" 33 | }, 34 | { 35 | "tweet": " I`ll oscillate from one to the other.", 36 | "sentiment": "neutral" 37 | }, 38 | { 39 | "tweet": " I`m sorry at least it`s Friday?", 40 | "sentiment": "negative" 41 | }, 42 | { 43 | "tweet": "_xo dang it, so its not certain ? ? ? are you okay?", 44 | "sentiment": "neutral" 45 | }, 46 | { 47 | "tweet": " Hope ur havin fun in da club", 48 | "sentiment": "positive" 49 | }, 50 | { 51 | "tweet": "_LaMont yr very young looking dude", 52 | "sentiment": "positive" 53 | }, 54 | { 55 | "tweet": "Huh, another ScarePoint coding Sunday", 56 | "sentiment": "neutral" 57 | }, 58 | { 59 | "tweet": "at starbucks with my love. eff school. i have work later too.", 60 | "sentiment": "neutral" 61 | }, 62 | { 63 | "tweet": "Cramps . . .", 64 | "sentiment": "negative" 65 | }, 66 | { 67 | "tweet": "Happy mothers day mumm xoxo", 68 | "sentiment": "positive" 69 | } 70 | ], 71 | "signature": { 72 | "instructions": "Given the fields `tweet`, produce the fields `sentiment`.", 73 | "fields": [ 74 | { 75 | "prefix": "Tweet:", 76 | "description": "Candidate tweet for classificaiton" 77 | }, 78 | { 79 | "prefix": "Sentiment:", 80 | "description": "${sentiment}" 81 | } 82 | ] 83 | }, 84 | "metadata": { 85 | "dependency_versions": { 86 | "python": "3.12.0", 87 | "dspy": "2.5.43", 88 | "cloudpickle": "3.0.0" 89 | } 90 | } 91 | } -------------------------------------------------------------------------------- /optimized/mipro_bsft_twt_sentiment.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ALucek/dspy-breakdown/455264b71b767fecae0e039d9ca79b3124f2e3d7/optimized/mipro_bsft_twt_sentiment.pkl -------------------------------------------------------------------------------- /optimized/mipro_twt_sentiment.json: -------------------------------------------------------------------------------- 1 | { 2 | "lm": null, 3 | "traces": [], 4 | "train": [], 5 | "demos": [ 6 | { 7 | "augmented": true, 8 | "tweet": "The underwire in my bra is sticking out and poking me in the armpit", 9 | "sentiment": "negative" 10 | }, 11 | { 12 | "augmented": true, 13 | "tweet": " I always forget SOMETHING when I travel. I am at Newark airport.", 14 | "sentiment": "neutral" 15 | }, 16 | { 17 | "augmented": true, 18 | "tweet": " that`s great!! weee!! visitors!", 19 | "sentiment": "positive" 20 | }, 21 | { 22 | "tweet": " Miss you", 23 | "sentiment": "negative" 24 | }, 25 | { 26 | "tweet": " Thank you! I`m working on `s", 27 | "sentiment": "positive" 28 | }, 29 | { 30 | "tweet": "Guess what? mom adopted a kitty today (11 months) His name is Corky", 31 | "sentiment": "neutral" 32 | }, 33 | { 34 | "tweet": "thinks SG is wonderful", 35 | "sentiment": "positive" 36 | }, 37 | { 38 | "tweet": " .. and you`re on twitter! Did the tavern bore you that much?", 39 | "sentiment": "neutral" 40 | }, 41 | { 42 | "tweet": "So hot today =_= don`t like it and i hate my new timetable, having such a bad week", 43 | "sentiment": "negative" 44 | }, 45 | { 46 | "tweet": "http:\/\/twitpic.com\/4wp8s - My ear hurts, and THIS is my medicine. GUM", 47 | "sentiment": "negative" 48 | }, 49 | { 50 | "tweet": "Happy star wars day! May the fourth be with you", 51 | "sentiment": "positive" 52 | }, 53 | { 54 | "tweet": " you are lame go make me breakfast!!", 55 | "sentiment": "negative" 56 | }, 57 | { 58 | "tweet": "Last session of the day http:\/\/twitpic.com\/67ezh", 59 | "sentiment": "neutral" 60 | }, 61 | { 62 | "tweet": "Terminator Salvation... by myself.", 63 | "sentiment": "neutral" 64 | }, 65 | { 66 | "tweet": "On the monday, so i wont be able to be with you! i love you", 67 | "sentiment": "positive" 68 | }, 69 | { 70 | "tweet": "Recession hit Veronique Branquinho, she has to quit her company, such a shame!", 71 | "sentiment": "negative" 72 | } 73 | ], 74 | "signature": { 75 | "instructions": "Analyze the sentiment of the provided tweet and determine whether it conveys a positive, negative, or neutral emotion. Consider the context, tone, and language used in the tweet to accurately classify the sentiment. Provide the sentiment label as the output.", 76 | "fields": [ 77 | { 78 | "prefix": "Tweet:", 79 | "description": "Candidate tweet for classificaiton" 80 | }, 81 | { 82 | "prefix": "Sentiment:", 83 | "description": "${sentiment}" 84 | } 85 | ] 86 | }, 87 | "metadata": { 88 | "dependency_versions": { 89 | "python": "3.12.0", 90 | "dspy": "2.5.43", 91 | "cloudpickle": "3.0.0" 92 | } 93 | } 94 | } --------------------------------------------------------------------------------