├── .gitignore
├── README.md
├── benchmarks
    ├── README.md
    ├── arc.py
    ├── datasets
    │   ├── math_test.csv
    │   └── mmlu.csv
    ├── gpqa.py
    ├── gsm8k.py
    ├── mathematics.py
    └── mmlu.py
├── honeycomb
    ├── README.md
    ├── critique.txt
    ├── prompt.txt
    ├── queries.csv
    ├── queries.ipynb
    ├── queries.py
    └── utils.py
├── langchain
    ├── .env.example
    ├── .gitignore
    ├── README.md
    ├── inspect_langchain.py
    ├── wikipedia.jsonl
    └── wikipedia.py
├── requirements.txt
└── slides
    ├── images
        ├── inspect-honeycomb-validate.png
        ├── inspect-mathmatics.png
        ├── inspect-notebook-eval.png
        ├── inspect-view-answers.png
        ├── inspect-view-honeycomb-critique-score.png
        ├── inspect-view-honeycomb-critique.png
        ├── inspect-view-honeycomb-validate-prompt-2.png
        ├── inspect-view-honeycomb-validate-prompt.png
        ├── inspect-view-honeycomb-validate.png
        ├── inspect-view-messages.png
        ├── inspect-view-scoring.png
        ├── inspect-wikipedia-eval.png
        ├── inspect-wikipedia-explanation.png
        ├── inspect-wikipedia-messages.png
        ├── inspect-wikipedia-scoring.png
        └── log-incorrect.png
    ├── inspect.qmd
    └── intro-to-inspect.pdf


/.gitignore:
--------------------------------------------------------------------------------
 1 | 
 2 | .venv/
 3 | .env
 4 | logs/
 5 | __pycache__/
 6 | .mypy_cache
 7 | .DS_Store
 8 | .ipynb_checkpoints/
 9 | 
10 | /.quarto/
11 | *_files/
12 | *.html
13 | *-present.pdf
14 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | ## Inspect: An OSS framework for LLM evals
 2 | 
 3 | This repository includes the slides and code for the workshop on Inspect given at [Mastering LLMs: A Conference for Developers and Data Scientists](https://maven.com/parlance-labs/fine-tuning).
 4 | 
 5 | ### Slides
 6 | 
 7 | <img src="https://cdn-icons-png.flaticon.com/256/337/337946.png" width="20"/> [Intro to Inspect: Open Source Framework for LLM Evals](https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/main/slides/intro-to-inspect.pdf)
 8 | 
 9 | ### Code
10 | 
11 | | Directory                 | Description                                                                                                                                  |
12 | |---------------------|---------------------------------------------------|
13 | | [honeycomb](honeycomb/)   | Evals for for Honeycomb Natural Langaguge Query generator |
14 | | [langchain](langchain/)   | Demonstrates evaluating a LangChain Agent using Inspect                                                                                      |
15 | | [benchmarks](benchmarks/) | Implemetnations of some popular benchmarks using Inspect                                                                                     |
16 | | [slides](slides/)         | Source code and PDF version of workshop slides                                                                                               |
17 | 
18 | To ensure that you can run the example code in this repo, please install the dependencies using the [requirements.txt](requirements.txt) file in the root directory.
19 | 


--------------------------------------------------------------------------------
/benchmarks/README.md:
--------------------------------------------------------------------------------
 1 | ## Benchmarks
 2 | 
 3 | This directory contains evals for several benchmarks. Note that some benchmark datasets are included in the `datasets/` directory and some are downloaded from Hugging Face (you should be sure to `pip install datasets` before attempting to run these benchmarks).
 4 | 
 5 | | Benchmark                                                          | Reference                          |                             Code | Dataset      |
 6 | |-----------------------------|---------------|--------------:|---------------|
 7 | | MMLU: Measuring Massive Multitask Language Understanding           | <https://arxiv.org/abs/2009.03300> |               [mmlu.py](mmlu.py) | Local        |
 8 | | MATH: Measuring Mathematical Problem Solving With the MATH Dataset | <https://arxiv.org/abs/2103.03874> | [mathematics.py](mathematics.py) | Local        |
 9 | | GPQA: A Graduate-Level Google-Proof Q&A Benchmark                  | <https://arxiv.org/abs/2311.12022> |               [gpqa.py](gpqa.py) | Hugging Face |
10 | | ARC: AI2 Reasoning Challenge                                       | <https://arxiv.org/abs/1803.05457> |                 [arc.py](arc.py) | Hugging Face |
11 | | GSM8K: Training Verifiers to Solve Math Word Problems              | <https://arxiv.org/abs/2110.14168> |             [gsm8k.py](gsm8k.py) | Hugging Face |


--------------------------------------------------------------------------------
/benchmarks/arc.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
 3 | 
 4 | Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord
 5 | https://arxiv.org/abs/1803.05457
 6 | 
 7 | # run all subsets
 8 | inspect eval arc.py
 9 | 
10 | # run specific subsets
11 | inspect eval arc.py@arc_easy
12 | inspect eval arc.py@arc_challenge
13 | """
14 | 
15 | from inspect_ai import Task, task
16 | from inspect_ai.dataset import Sample, hf_dataset
17 | from inspect_ai.scorer import answer
18 | from inspect_ai.solver import multiple_choice
19 | 
20 | 
21 | def record_to_sample(record):
22 |     # read the labels and text
23 |     choices = record["choices"]
24 |     choices = dict(zip(choices["label"], choices["text"]))
25 | 
26 |     # determine the target then normalize to letter
27 |     answerKey = record["answerKey"]
28 |     target = list(choices.keys()).index(answerKey)
29 |     target = chr(ord("A") + int(target))
30 | 
31 |     # return sample
32 |     return Sample(
33 |         input=record["question"], choices=list(choices.values()), target=target
34 |     )
35 | 
36 | 
37 | def arc_task(dataset_name):
38 |     return Task(
39 |         dataset=hf_dataset(
40 |             path="allenai/ai2_arc",
41 |             name=dataset_name,
42 |             split="test",
43 |             sample_fields=record_to_sample,
44 |         ),
45 |         plan=multiple_choice(),
46 |         scorer=answer("letter"),
47 |     )
48 | 
49 | 
50 | @task
51 | def arc_easy():
52 |     return arc_task("ARC-Easy")
53 | 
54 | 
55 | @task
56 | def arc_challenge():
57 |     return arc_task("ARC-Challenge")
58 | 


--------------------------------------------------------------------------------
/benchmarks/gpqa.py:
--------------------------------------------------------------------------------
 1 | """
 2 | GPQA: A Graduate-Level Google-Proof Q&A Benchmark
 3 | 
 4 | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard
 5 | Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman
 6 | https://arxiv.org/abs/2311.12022
 7 | 
 8 | Based on: https://github.com/openai/simple-evals/blob/main/gpqa_eval.py
 9 | 
10 | # eval for default epochs (4)
11 | inspect eval gpqa.py
12 | 
13 | # eval with 1 epoch
14 | inspect eval gpqa.py --epochs 1
15 | 
16 | # without chain of thought
17 | inspect eval gpqa.py -T cot=false
18 | """
19 | 
20 | 
21 | from inspect_ai import Task, task
22 | from inspect_ai.dataset import Sample, csv_dataset
23 | from inspect_ai.model import GenerateConfig
24 | from inspect_ai.scorer import answer
25 | from inspect_ai.solver import multiple_choice
26 | 
27 | # default epochs to run eval for
28 | DEFAULT_EPOCHS = 4
29 | 
30 | 
31 | # map records to inspect samples (note that target is always "A" in the,
32 | # dataset, we will shuffle the presentation of options to mitigate this)
33 | def record_to_sample(record):
34 |     return Sample(
35 |         input=record["Question"],
36 |         choices=[
37 |             str(record["Correct Answer"]),
38 |             str(record["Incorrect Answer 1"]),
39 |             str(record["Incorrect Answer 2"]),
40 |             str(record["Incorrect Answer 3"]),
41 |         ],
42 |         target="A",
43 |         id=record["Record ID"],
44 |     )
45 | 
46 | 
47 | @task
48 | def gpqa_diamond(cot=True):
49 |     return Task(
50 |         dataset=csv_dataset(
51 |             csv_file="https://openaipublic.blob.core.windows.net/simple-evals/gpqa_diamond.csv",
52 |             sample_fields=record_to_sample,
53 |         ),
54 |         plan=[
55 |             multiple_choice(cot=cot, shuffle=True),
56 |         ],
57 |         scorer=answer("letter"),
58 |         config=GenerateConfig(temperature=0.5),
59 |         epochs=DEFAULT_EPOCHS,
60 |     )
61 | 


--------------------------------------------------------------------------------
/benchmarks/gsm8k.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Training Verifiers to Solve Math Word Problems
 3 | 
 4 | Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
 5 | https://arxiv.org/abs/2110.14168
 6 | 
 7 | # run with default fewshots (10)
 8 | inspect eval gsm8k.py
 9 | 
10 | # run with less  or no fewshots
11 | inspect eval gsm8k.py -T fewshot=5
12 | inspect eval gsm8k.py -T fewshot=false
13 | """
14 | 
15 | from inspect_ai import Task, task
16 | from inspect_ai.dataset import Sample, hf_dataset
17 | from inspect_ai.scorer import match
18 | from inspect_ai.solver import generate, prompt_template, system_message
19 | 
20 | 
21 | def record_to_sample(record):
22 |     DELIM = "####"
23 |     input = record["question"]
24 |     answer = record["answer"].split(DELIM)
25 |     target = answer.pop().strip()
26 |     reasoning = DELIM.join(answer)
27 |     return Sample(input=input, target=target, metadata={"reasoning": reasoning.strip()})
28 | 
29 | 
30 | def sample_to_fewshot(sample):
31 |     return (
32 |         f"{sample.input}\n\nReasoning:\n"
33 |         + f"{sample.metadata['reasoning']}\n\n"
34 |         + f"ANSWER: {sample.target}"
35 |     )
36 | 
37 | 
38 | # setup for problem + instructions for providing answer
39 | MATH_PROMPT_TEMPLATE = """
40 | Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.
41 | 
42 | {prompt}
43 | 
44 | Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \\boxed command.
45 | 
46 | Reasoning:
47 | """.strip()
48 | 
49 | 
50 | @task
51 | def gsm8k(fewshot=10, fewshot_seed=42):
52 |     # build plan dynamically (may or may not be doing fewshot)
53 |     plan = [prompt_template(MATH_PROMPT_TEMPLATE), generate()]
54 |     if fewshot:
55 |         fewshots = hf_dataset(
56 |             path="gsm8k",
57 |             data_dir="main",
58 |             split="train",
59 |             sample_fields=record_to_sample,
60 |             shuffle=True,
61 |             seed=fewshot_seed,
62 |             limit=fewshot,
63 |         )
64 |         plan.insert(
65 |             0,
66 |             system_message(
67 |                 "\n\n".join([sample_to_fewshot(sample) for sample in fewshots])
68 |             ),
69 |         )
70 | 
71 |     # define task
72 |     return Task(
73 |         dataset=hf_dataset(
74 |             path="gsm8k",
75 |             data_dir="main",
76 |             split="test",
77 |             sample_fields=record_to_sample,
78 |         ),
79 |         plan=plan,
80 |         scorer=match(numeric=True),
81 |     )
82 | 


--------------------------------------------------------------------------------
/benchmarks/mathematics.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Measuring Mathematical Problem Solving With the MATH Dataset
  3 | 
  4 | Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora,
  5 | Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt
  6 | https://arxiv.org/abs/2103.03874
  7 | 
  8 | Based on: https://github.com/openai/simple-evals/blob/main/math_eval.py
  9 | """
 10 | 
 11 | import re
 12 | 
 13 | from inspect_ai import Task, task
 14 | from inspect_ai.dataset import FieldSpec, csv_dataset
 15 | from inspect_ai.model import GenerateConfig, get_model
 16 | from inspect_ai.scorer import (
 17 |     CORRECT,
 18 |     INCORRECT,
 19 |     AnswerPattern,
 20 |     Score,
 21 |     Target,
 22 |     accuracy,
 23 |     bootstrap_std,
 24 |     scorer,
 25 | )
 26 | from inspect_ai.solver import TaskState, generate, prompt_template
 27 | 
 28 | # setup for problem + instructions for providing answer
 29 | PROMPT_TEMPLATE = """
 30 | Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem.
 31 | 
 32 | {prompt}
 33 | 
 34 | Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \\boxed command.
 35 | """.strip()
 36 | 
 37 | 
 38 | @task
 39 | def math(shuffle=True):
 40 |     return Task(
 41 |         dataset=csv_dataset(
 42 |             csv_file="datasets/math_test.csv",
 43 |             sample_fields=FieldSpec(input="Question", target="Answer"),
 44 |             shuffle=shuffle,
 45 |         ),
 46 |         plan=[
 47 |             prompt_template(PROMPT_TEMPLATE),
 48 |             generate(),
 49 |         ],
 50 |         scorer=expression_equivalance(),
 51 |         config=GenerateConfig(temperature=0.5),
 52 |     )
 53 | 
 54 | 
 55 | @scorer(metrics=[accuracy(), bootstrap_std()])
 56 | def expression_equivalance():
 57 |     async def score(state: TaskState, target: Target):
 58 |         # extract answer
 59 |         match = re.search(AnswerPattern.LINE, state.output.completion)
 60 |         if match:
 61 |             # ask the model to judge equivalance
 62 |             answer = match.group(1)
 63 |             prompt = EQUIVALANCE_TEMPLATE % (
 64 |                 {"expression1": target.text, "expression2": answer}
 65 |             )
 66 |             result = await get_model().generate(prompt)
 67 | 
 68 |             # return the score
 69 |             correct = result.completion.lower() == "yes"
 70 |             return Score(
 71 |                 value=CORRECT if correct else INCORRECT,
 72 |                 answer=answer,
 73 |                 explanation=state.output.completion,
 74 |             )
 75 |         else:
 76 |             return Score(
 77 |                 value=INCORRECT,
 78 |                 explanation="Answer not found in model output: "
 79 |                 + f"{state.output.completion}",
 80 |             )
 81 | 
 82 |     return score
 83 | 
 84 | 
 85 | EQUIVALANCE_TEMPLATE = r"""
 86 | Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications
 87 | 
 88 | Examples:
 89 | 
 90 |   Expression 1: $2x+3$
 91 |   Expression 2: $3+2x$
 92 | 
 93 | Yes
 94 | 
 95 |   Expression 1: 3/2
 96 |   Expression 2: 1.5
 97 | 
 98 | Yes
 99 | 
100 |   Expression 1: $x^2+2x+1$
101 |   Expression 2: $y^2+2y+1$
102 | 
103 | No
104 | 
105 |   Expression 1: $x^2+2x+1$
106 |   Expression 2: $(x+1)^2$
107 | 
108 | Yes
109 | 
110 |   Expression 1: 3245/5
111 |   Expression 2: 649
112 | 
113 | No
114 | (these are actually equal, don't mark them equivalent if you need to
115 | do nontrivial simplifications)
116 | 
117 |   Expression 1: 2/(-3)
118 |   Expression 2: -2/3
119 | 
120 | Yes
121 | (trivial simplifications are allowed)
122 | 
123 |   Expression 1: 72 degrees
124 |   Expression 2: 72
125 | 
126 | Yes
127 | (give benefit of the doubt to units)
128 | 
129 |   Expression 1: 64
130 |   Expression 2: 64 square feet
131 | 
132 | Yes
133 | (give benefit of the doubt to units)
134 | 
135 | ---
136 | 
137 | YOUR TASK
138 | 
139 | 
140 | Respond with only "Yes" or "No" (without quotes). Do not include a rationale.
141 | 
142 |   Expression 1: %(expression1)s
143 |   Expression 2: %(expression2)s
144 | """.strip()
145 | 


--------------------------------------------------------------------------------
/benchmarks/mmlu.py:
--------------------------------------------------------------------------------
 1 | """
 2 | Measuring Massive Multitask Language Understanding
 3 | 
 4 | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
 5 | Mantas Mazeika, Dawn Song, Jacob Steinhardt
 6 | https://arxiv.org/abs/2009.03300
 7 | 
 8 | Based on: https://github.com/openai/simple-evals/blob/main/mmlu_eval.py
 9 | 
10 | # eval all subjects w/ 500 randomly selected samples
11 | inspect eval mmlu.py@mmlu --limit 500
12 | 
13 | # add chain of thought
14 | inspect eval mmlu.py@mmlu --limit 500 -T cot=true
15 | 
16 | # eval selected subjects
17 | inspect eval mmlu.py@mmlu -T subjects=anatomy
18 | inspect eval mmlu.py@mmlu -T subjects=astronomy
19 | inspect eval mmlu.py@mmlu -T subjects=anatomy,astronomy
20 | """
21 | 
22 | from inspect_ai import Task, task
23 | from inspect_ai.dataset import Sample, csv_dataset
24 | from inspect_ai.model import GenerateConfig
25 | from inspect_ai.scorer import answer
26 | from inspect_ai.solver import multiple_choice
27 | 
28 | 
29 | # map records to inspect sample
30 | def record_to_sample(record):
31 |     return Sample(
32 |         input=record["Question"],
33 |         choices=[
34 |             str(record["A"]),
35 |             str(record["B"]),
36 |             str(record["C"]),
37 |             str(record["D"]),
38 |         ],
39 |         target=record["Answer"],
40 |         metadata={"subject": record["Subject"]},
41 |     )
42 | 
43 | 
44 | # read dataset globally so it can be shared by all of the tasks
45 | # (shuffle so that --limit draws from multiple subjects)
46 | dataset = csv_dataset(
47 |     csv_file="datasets/mmlu.csv",
48 |     sample_fields=record_to_sample,
49 |     shuffle=True,
50 | )
51 | 
52 | 
53 | @task
54 | def mmlu(subjects=[], cot=False):
55 |     # filter dataset if requested
56 |     subjects = subjects if isinstance(subjects, list) else [subjects]
57 |     if len(subjects) > 0:
58 |         task_dataset = dataset.filter(
59 |             name=f"{dataset.name}-{'-'.join(subjects)}",
60 |             predicate=lambda sample: sample.metadata["subject"] in subjects,
61 |         )
62 |     else:
63 |         task_dataset = dataset
64 | 
65 |     # return task
66 |     return Task(
67 |         dataset=task_dataset,
68 |         plan=multiple_choice(cot=cot),
69 |         scorer=answer("letter"),
70 |         config=GenerateConfig(temperature=0.5),
71 |     )
72 | 


--------------------------------------------------------------------------------
/honeycomb/README.md:
--------------------------------------------------------------------------------
 1 | ## Honeycomb Query Evals
 2 | 
 3 | ### Overview
 4 | 
 5 | Evals for for Honeycomb Natural Langaguge Query generator from the [Fine Tuning LLMs](https://maven.com/parlance-labs/fine-tuning) course. Related notebooks from the course can be found at <https://github.com/parlance-labs/ftcourse>.
 6 | 
 7 | The [queries.csv](queries.csv) dataset contains \~ 2,300 example queries (along with per-query column schemas generated offline via RAG). There are two scoring methods supported
 8 | (corresponding to the two @task definitions below):
 9 | 
10 | 1. @validate - score using the validity checker from the course (utils.py)
11 | 2. @critique - score using the critique prompt from the course (critique.txt)
12 | 
13 | ### Eval Script 
14 | 
15 | Use the `inspect eval` command with the eval script ([queries.py](queries.py)) to run evals on a variety of models.
16 | 
17 | To evaluate all of the queries using both scorers on Claude Opus:
18 | 
19 | ```bash
20 | $ inspect eval queries.py --model anthropic/claude-3-opus-20240229
21 | ```
22 | 
23 | To evaluate a random subset of 200 queries using both scorers on GPT 4 Turbo:
24 | 
25 | ```bash
26 | $ inspect eval queries.py --model openai/gpt-4-turbo --limit 200
27 | ```
28 | 
29 | To use the validate scorer only with a local Google Gemma 2B (via HF):
30 | 
31 | ```bash
32 | $ inspect eval queries.py@validate --model hf/google/gemma-2b
33 | ```
34 | 
35 | To use the critique scorer only with a local Ollma Lllama3 model
36 | 
37 | ```bash
38 | $ inspect eval queries.py@critique --model ollma/llama3
39 | ```
40 | 
41 | See `inspect eval --help` for details on all available options.
42 | 
43 | ### Eval Notebook 
44 | 
45 | There is a also a notebook version of the evals ([queries.ipynb](queries.ipynb)). Rather than using `inspect eval` as it done with the script, the notebook uses Inspect's `eval()` function to execute evals within the notebok runtime.


--------------------------------------------------------------------------------
/honeycomb/critique.txt:
--------------------------------------------------------------------------------
  1 | ## Background
  2 | 
  3 | Honeycomb is an observability platform that allows you to write queries to inspect trace data.
  4 | The specification of the Honeycomb query language is as follows:
  5 | 
  6 | QUERY SPEC:
  7 | All top-level keys are optional.
  8 | 
  9 | ```json
 10 | "calculations":[
 11 |     // ops: COUNT, CONCURRENCY, COUNT_DISTINCT, HEATMAP, SUM, AVG, MAX, MIN, P001, P01, P05, P10, P25, P50, P75, P90, P95, P99, P999, RATE_AVG, RATE_SUM, RATE_MAX
 12 |     {"op": "COUNT"},// COUNT and CONCURRENCY are just op
 13 |     {"op": "HEATMAP", "column": "name"}
 14 | ],
 15 | "filters":[
 16 |     // ops: =, !=, >, >=, <, <=, starts-with, does-not-start-with, exists, does-not-exist, contains, does-not-contain, in, not-in
 17 |     {"column": "name", "op": "exists"}, // exists and does-not-exist ops only have column
 18 |     {"column": "name", "op": "=", "value": "something"}
 19 | ],
 20 | "filter_combination": "AND", // AND or OR
 21 | "breakdowns":[
 22 |     // columns in COLUMNS
 23 |     "column1","column2"
 24 | ],
 25 | "orders":[
 26 |     // HEATMAP not allowed
 27 |     // Must come from breakdowns or calculations
 28 |     {"op": "op_in_calculation", "column": "column_in_calculation", "order": "ascending"},
 29 |     {"op": "COUNT", "order": "descending"}, // COUNT and CONCURRENCY have no column
 30 |     {"column": "column1", "order": "descending"},
 31 | ],
 32 | "havings":[
 33 |     // HEATMAP not allowed
 34 |     {"calculate_op": "op_in_calculation", "column": "name", "op": "OPNAME", "value": 100},
 35 |     {"calculate_op": "COUNT", "op": ">", "value": 10}, // COUNT and CONCURRENCY have no column
 36 | ],
 37 | "time_range": 7200, // Relative time range in seconds.
 38 | "start_time": 1234567890, // UNIX timestamp
 39 | "end_time": 1234567890, // UNIX timestamp
 40 | ```
 41 | 
 42 | QUERY SPEC TIPS:
 43 | 
 44 | COUNT counts events/spans. COUNT_DISTINCT counts unique values of columns.
 45 | HEATMAP shows value distributions.
 46 | trace.parent_id does-not-exist shows root span of a trace. Use that to count requests and traces.
 47 | name is a span or span event.
 48 | parent_name is the name of a span that created a span event.
 49 | 
 50 | When the user input is asking about a time range (such as "yesterday" or "since last week"),
 51 | always use the time_range, start_time, and end_time fields. time_range
 52 | is relative and can be combined with either start_time or end_time but not both.
 53 | 
 54 | ---
 55 | 
 56 | ## Instructions
 57 | 
 58 | You are an EXPERT query evaluator that has advanced capabilities to judge if a query good or not.  You understand the nuances of the Honeycomb query language, including what is likely to be most useful from an analytics perspective.
 59 | You are given the following three inputs: (1) NLQ, (2) A list of candidate columns (COLUMNS) that are allowed to be in the query, and (3) The query (QUERY). Your job is to evaluate and critique the QUERY relative to the provided NLQ and COLUMNS. 
 60 | 
 61 | The critiques must be provided in the same json format as provided in the examples below:
 62 | 
 63 | ---
 64 | 
 65 | NLQ: show me slowest trace
 66 | 
 67 | COLUMNS: ['trace.trace_id', 'trace.span_id', 'trace.parent_id', 'duration_ms', 'name', 'faas.instance', 'faas.id', 'filter', 'telemetry.instrumentation_library', 'library.name', 'faas.name', 'span.kind', 'type', 'http.wrote_bytes', 'http.url', 'service.name', 'http.flavor', 'span.num_links', 'span.num_events', 'net.host.name', 'library.version', 'http.scheme', 'net.peer.name', 'http.method', 'meta.signal_type', 'cloud.region', 'cloud.provider', 'faas.version', 'http.read_bytes', 'http.user_agent', 'cloud.account.id', 'organization_id', 'cloud.platform', 'net.sock.peer.addr', 'page_size', 'net.sock.peer.port', 'page_token', 'status_code', 'http.client_ip', 'http.status_code', 'http.route']
 68 | 
 69 | QUERY: {"calculations":[{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"limit":1,"time_range":7200}
 70 | 
 71 | {"critique": "The response is nearly correct, as it is looking for the slowest trace by using MAX(duration_ms) and ordering by duration_ms in descending order, which is appropriate for finding the 'slowest' trace. Additionally, filtering with trace.parent_id does-not-exist correctly identifies root spans. However, the query should be grouping by trace.trace_id to actually show the slowest trace. Without that grouping, the query only shows the MAX(duration_ms) measurement over time, irrespective of which trace is responsible for that measurement.", "outcome": "bad"}
 72 | 
 73 | ---
 74 | 
 75 | NLQ: slow requests
 76 | 
 77 | COLUMNS: ['duration_ms', 'deprecatedSource', 'error', 'severity', 'flags', 'rpc.method', 'deprecatedCount', 'ip', 'rpc.service', 'interface', 'metadata', 'reason', 'container.cpu.time', 'deprecatedLastTimestamp', 'name', 'regarding', 'namespace', 'message.type', 'watch-type', 'severity_text', 'apiVersion', 'kind', 'service.name', 'rpc.system', 'container.uptime', 'type', 'uid', 'span.kind', 'db.name', 'note', 'body', 'resourceVersion', 'library.version', 'db.statement', 'status_code', 'event.domain', 'managedFields', 'deprecatedFirstTimestamp', 'container.memory.page_faults', 'event.name', 'container.memory.available', 'k8s.container.cpu_request', 'container.id', 'library.name', 'container.memory.rss', 'message.uncompressed_size', 'rpc.grpc.status_code', 'net.peer.port', 'k8s.container.memory_request', 'container.memory.working_set', 'trace.parent_id', 'http.route', 'service_name', '']
 78 | 
 79 | QUERY: {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":7200}
 80 | 
 81 | {"critique": "This response is adequate. The inclusion of a HEATMAP and a MAX calculation on the duration_ms column is appropriate for identifying slow requests. Correctly including a filter on 'trace.parent_id' with 'does-not-exist' ensures that only root spans (requests) are considered. Ordering by the max duration_ms in descending order is sensible to quickly highlight slower requests. The use of 'name' in a breakdown is good, because it groups results into span names. This can help you quickly narrow down behavior to a particular part of a system. It might be more insightful to break down by 'http.route', because 'http.route' will also break down more broadly by API call. Having both 'http.route' and 'name' included can give both a broader and narrower breadkdown of the data, and both views can be helpful at different times.. Overall, the query aligns well with the NLQ but could benefit from a more specific breakdown column.", "outcome": "good"}
 82 | 
 83 | ---
 84 | NLQ: records of clientid 5fcedb77-267c-44e3-8ea4-4d21786b2618
 85 | 
 86 | COLUMNS: ['clientIds', 'sys-client-id', 'principal.clientId', 'client.id', 'clientName', 'metadata.clientId', 'payload.clientId', 'clientExternalId', 'informationReportingEvent.clientId', 'principalClientId', 'metadata.clientTxId', 'targetClientId', 'e2eId', 'client', 'transactionIds', 'clientId', 'client.external_id', 'metadata.clientName', 'sys-service-provider-id', 'transactionId', 'paymentRequestIds.identifying', 'serviceProviderId', 'service_provider.id', 'externalId', 'traceId', 'event.id', 'accounts-deleted', 'jobExecutionId', 'accounts-updated', 'syncContext.clientDto.id', '$metadata.requestId', 'subscriptionId', 'paymentExternalId', 'activeIntegrationIds', 'fileId', 'tracerId', '$metadata.extendedRequestId', 'moneyTransferId', 'paymentRequestIds', 'vendor.id', 'triggeredNumberOfClients', 'payload.eventId', 'existedTransactionIds', 'completedTransactionIds', 'metadata.paymentRequestExternalId', 'syncContext.clientDto.externalId', 'createdTransactionIds', 'transactionsDeletionServiceProviderId', 'idempotency_key', 'RequestId', 'status', 'error', 'duration_ms', 'http.route', 'name', 'http_path', 'trace.parent_id', 'http.status_code', 'meta.req.query', 'service_name', 'request_time']
 87 | 
 88 | QUERY: {"filters":[{"column":"client.id","op":"=","value":"5fcedb77-267c-44e3-8ea4-4d21786b2618"}]}
 89 | 
 90 | {"critique": "Poor response. While the filter by client.id is correct to find records for clientid '5fcedb77-267c-44e3-8ea4-4d21786b2618', the query lacks breakdowns, calculations, or order to make the results more insightful. It would be more beneficial to include calculations such as COUNT to understand the volume of records.", "outcome": "bad"}
 91 | 
 92 | ---
 93 | 
 94 | NLQ: Find the traces for the slowest requests in the past 3 hours where http.host = internal-pcmws-alkqa-routing-black-1164727359.us-east-1.elb.amazonaws.com
 95 | 
 96 | COLUMNS: ['trace.trace_id', 'http.host', 'exception.stacktrace', 'pcm.drawtraffic', 'SLI_geo-95-percentile-latency', 'pcm.server.worker-load-balancer', 'pcm.responsetime', 'aws.requestId', 'http.query-length', 'http.response-uri', 'trace.span_id', 'net.host.name', 'http-500-error-percent', 'http.url', 'net.host.port', 'pcm.server.worker-app-pool', 'http.request-headers', 'duration_ms', 'pcm.restful', 'http.response-version', 'trace.parent_id', 'pcm.initialmaprequest', 'http.response-headers', 'http.method', 'aws.region', 'service.instance.id', 'http.response-method', 'wcf.channel.path', 'pcm.server', 'pcm.userip', 'http.version', 'pcm.server-name', 'pcm.server-color', 'aws.service', 'pcm.stopcount', 'rpc.method', 'pcm.splcs', 'peer.service', 'wcf.channel.scheme', 'service.name', 'http.response_content_length', 'pcm.foserver', 'SLI_99.9-percent-without-500-error', 'pcm.httpcode', 'pcm.errortext', 'http.status-description', 'pcm.login', 'pcm.apikey', 'rpc.service', 'pcm.date', 'db.statement', 'http.route', 'parent_name', 'name', 'http.status_code', 'exception.message', 'error']
 97 | 
 98 | QUERY: {"breakdowns":["trace.trace_id"],"calculations":[{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"http.host","op":"=","value":"internal-pcmws-alkqa-routing-black-1164727359.us-east-1.elb.amazonaws.com","join_column":""}],"filter_combination":"AND","orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":10800}
 99 | 
100 | {"critique": "The query correctly uses time_range to indicate the past 3 hours and filters by the specific http.host value. However, it should also include a filter where trace.parent_id does-not-exist to identify the root spans or initial requests, which represent the start of a trace. Without this, the query might include intermediate spans that are not the start of the trace. The breakdown by trace.trace_id is correct, and ordering by the max duration is appropriate to find the slowest requests. It could be improved by including a calculation of COUNT to gauge the number of slow traces for each trace id, making it thorough for understanding general patterns or anomalies.", "outcome": "bad"}
101 | 
102 | ---
103 | 
104 | NLQ: show me traces where ip is 10.0.2.90
105 | 
106 | COLUMNS: ['net.host.ip', 'trace.trace_id', 'net.peer.ip', 'net.transport', 'net.host.name', 'http.client_ip', 'http.host', 'host.name', 'trace.span_id', 'net.host.port', 'k8s.pod.ip', 'sli_route_latency_1500', 'ip', 'net.peer.port', 'trace.parent_id', 'sli_route_latency', 'http.server_name', 'service.name', 'service.instance.id', 'rpc.user_agent', 'rpc.service', 'http.route', 'db.name', 'http.url', 'area', 'requestID', 'service.version', 'http.user_agent', 'http.target', 'host.arch', 'orderid', 'process.pid', 'rpc.system', 'os.type', 'rpc.method', 'otlp.exporter.exported', 'runtime.jvm.gc.time', 'http.scheme', 'thread.name', 'name', 'runtime.jvm.memory.pool', 'message.type', 'runtime.jvm.memory.area', 'thread.id', 'enduser.id', 'error', 'userid', 'http.flavor', 'pool', 'telemetry.instrumentation_library', 'http.status_code', 'service_name', 'parent_name', 'duration_ms', 'db.statement', 'exception.message']
107 | 
108 | QUERY: {"breakdowns":["trace.trace_id"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"net.host.ip","op":"=","value":"10.0.2.90"}]}
109 | 
110 | {"critique": "The query satisfies the NLQ. It correctly filters for traces with an IP address of 10.0.2.90 and counts the occurrences of those traces, grouped by trace.trace_id, which is what the NLQ requests. The response is good as it meets the requirement of showing traces from a specific IP address without additional complexities.", "outcome": "good"}
111 | 
112 | ---
113 | 
114 | NLQ: count of moderation.flags.create api calls
115 | 
116 | COLUMNS: ['response_flags', 'job_counts_MULTI_INDEX_CREATE', 'http_count', 'found_count', 'is_creation', 'is_create', 'job_counts_MULTI_INDEX_BATCH_CREATE', 'event_count', 'job_counts_MULTI_INDEX_CREATE_BE', 'job_counts_SEND_SLACK_UNFURLS_UPDATE', 'queued_count', 'keys_count', 'agenda_req_count', 'flags', 'job_counts_SEND_SLACK_DOC_THREAD_UPDATE', 'update_count', 'log_count', 'old_activity_badge_count', 'total_message_count', 'job_counts_SEND_SLACK_USER_MENTIONS_UPDATE', 'job_counts_UNIQUE_INDEX_CREATE', 'request_attempt_count', 'job_counts_ADD_TO_SLACK_DOCS_COMPANY', 'args.thread_counts_by_channel', 'job_counts_FINALIZE_EDIT_HISTORY', 'mc_queries_count', 'hitCount', 'api:warning', 'job_counts_MULTI_INDEX_BATCH_CREATE_BE', 'file_count', 'job_counts_UPDATE_THREAD_UNREAD_COUNTS_EF4', 'api:method', 'executing_count', 'job_counts_SEND_SLACK_FILES_UPDATE', 'failed_count', 'mpdm_count', 'count', 'grpc_req_count', 'job_counts_UPDATE_SLACK_REFERENCES_IN_THREAD', 'mc_queries_multi_count', 'channel_ids_count', 'job_counts_UPDATE_THREAD_PHRASES_ON_EDIT', 'total_participant_count', 'job_counts_MULTI_INDEX_DELETE', 'job_counts_UPDATE_THREAD_UNREAD_COUNTS_BE', 'kafkagate_count', 'authorized_apps_count', 'redirect_count', 'apps_count', 'bulk_generate_calls', 'duration_ms', 'slath', 'name', 'execution_context', 'trace_parent_id', 'db.statement', 'namespace', 'kind', 'error', 'message', 'reason', 'http.status_code', 'service_name', 'quip:call_path']
117 | 
118 | QUERY: {"breakdowns":["slath"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"name","op":"=","value":"moderation.flags.create","join_column":""}],"time_range":7200}
119 | 
120 | {"critique": "The query groups by 'slath' which is extraneous information - there is nothing in NLQ that indicates 'slath' is related to NLQ. Additionally, the filter clause should check for the 'contains' op instead of '='. While it is likely good to check for 'name' as that represents an operation name, which means it will likely have data NLQ is referring to, there is no guarantee that the actual name of the operation is exactly what was input in NLQ. Furthermore, the result should include 'name' in the breakdowns so that the results show the actual values of 'name'.", "outcome": "bad"}
121 | 
122 | ---
123 | 
124 | NLQ: What kind of events are contributing towards my rate limits
125 | 
126 | COLUMNS: ['up', 'k8s.pod.memory.major_page_faults', 'k8s.pod.memory_limit_utilization', 'k8s.pod.cpu_limit_utilization', 'k8s.node.memory.page_faults', 'k8s.container.cpu_limit', 'k8s.node.memory.major_page_faults', 'k8s.pod.memory.page_faults', 'k8s.container.memory_limit', 'k8s.node.filesystem.usage', 'otelcol_processor_batch_batch_send_size.avg', 'k8s.pod.memory.rss', 'k8s.pod.filesystem.usage', 'k8s.node.network.io.transmit', 'prometheus.io/port', 'scrape_duration_seconds', 'eks.amazonaws.com/compute-type', 'otelcol_processor_batch_timeout_trigger_send', 'otelcol_receiver_refused_metric_points', 'k8s.node.memory.rss', 'otelcol_processor_batch_batch_send_size.count', 'eks.amazonaws.com/component', 'otelcol_processor_dropped_metric_points', 'k8s.pod.network.io.transmit', 'otelcol_processor_refused_metric_points', 'job.runner.gitlab.com/url', 'scrape_samples_scraped', 'prometheus.io/scrape', 'k8s.pod.memory.usage', 'k8s.node.cpu.time', 'otelcol_processor_batch_batch_send_size.p50', 'k8s.node.network.io.receive', 'k8s.node.filesystem.capacity', 'k8s.node.memory.usage', 'otelcol_processor_dropped_log_records', 'k8s.container.restarts', 'k8s-app', 'otelcol_exporter_queue_capacity', 'k8s.pod.memory_request_utilization', 'k8s.pod.network.errors.transmit', 'otelcol_processor_batch_batch_send_size.max', 'otelcol_receiver_accepted_metric_points', 'otelcol_exporter_sent_metric_points', 'k8s.pod.filesystem.capacity', 'project.runner.gitlab.com/id', 'processor', 'otelcol_scraper_errored_metric_points', 'otelcol_processor_batch_batch_send_size.p05', 'interface', 'otelcol_process_memory_rss', 'net.host.name', 'name', 'trace.parent_id', 'duration_ms', 'http.route', 'service_name']
127 | 
128 | QUERY: {"breakdowns":["name"],"calculations":[{"column":"otelcol_processor_refused_metric_points","op":"SUM"},{"column":"otelcol_processor_dropped_metric_points","op":"SUM"},{"column":"otelcol_processor_dropped_log_records","op":"SUM"}],"filters":[{"column":"otelcol_processor_refused_metric_points","op":">","value":0},{"column":"otelcol_processor_dropped_metric_points","op":">","value":0},{"column":"otelcol_processor_dropped_log_records","op":">","value":0}]}
129 | 
130 | {"critique": "Good response. The query is grouping by 'name' to identify different event types and summing up refused, dropped metric points and dropped log records, which are potential indicators of hitting rate limits. Furthermore, it properly filters each of those to be greater than zero to only include relevant events. This helps to understand what events are contributing towards rate limits being exceeded.", "outcome": "good"}
131 | 
132 | ---
133 | 
134 | NLQ: count window-hash where window-hash exists per hour
135 | 
136 | COLUMNS: ['window-hash', 'window_hash_get', 'prop-window_hash', 'fingerprint_window_stack_traces_hash', 'prop-window_hash_get', 'fingerprint_window_overloaded_hash', 'fingerprint_window_document_keys_hash', 'fingerprint_window_keys_hash', 'fingerprint_window_client_tools_hash', 'fingerprint_window_messages_hash', 'headers-hash', 'fingerprint_time_zone_anomaly_hash', 'best_hash_get', 'w_count', 'legacy-hash', 'fingerprint_window_stack_traces', 'fingerprint_time_zone_hash', 'siteverify_nojs-hdict_count', 'fingerprint_time_zone_region_hash', 'fingerprint_window_missing_features_hash', 'fingerprint_timing_hash', 'fetch-hash', 'client_timestamp_check', 'canvas-hash', 'fingerprint_unique_keys_stable_hash_hash', 'fingerprint_window_document_keys', 'fingerprint_window_keys', 'task_request_count', 'fingerprint_web_worker_stack_size_hash', 'fingerprint_window_overloaded', 'fingerprint_window_version', 'fingerprint_time_zone_anomaly', 'token_rate_1m_tarpit', 'r_count', 'art_up__asn_window', 'fingerprint_unique_keys_stable_hash', 'sk_wgl_canv_hash', 'fingerprint_dom_rect_decimal_hash', 'fingerprint_window_messages', 'webrtc-hash', 'legacy_hash_get', 'fingerprint_window_sus_chrome_key', 'md_hash', 'webgl_hash_get', 'fingerprint_unique_keys_hash', 'checkcaptcha-elapsed-diff', 'fingerprint_analytics_hash', 'fingerprint_label_performance_start_bucket', 'art__asn_window', 'fingerprint_unique_keys_meta_hash_hash', 'getcaptcha-host', 'parent_name', 'exception.message', 'service_name', 'base_url', 'app.exception_string', 'request.path', 'response.status_code', 'duration_ms', 'db.statement', 'score_reason', 'sk', 's3_upload_uri', 'trace.parent_id', 'name', 'art_ms', 'shard_id']
137 | 
138 | QUERY: {"breakdowns":["window-hash"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"window-hash","op":"exists","join_column":""}],"filter_combination":"AND","orders":[{"op":"COUNT","order":"descending"}],"time_range":3600}
139 | 
140 | {"critique": "The response is mostly good, but misinterprets the time range and granularity settings. The QUERY correctly applies a COUNT calculation on events where 'window-hash' exists, which matches the NLQ requirement to count 'window-hash'. It breaks down by the 'window-hash' column for granularity and orders the results in descending order of the count, which is useful for identifying the most frequent 'window-hash' values. However, when NLQ says "per hour", it is likely seeking to show results with a granularity of 1 hour. As per query rules, this would require the 'time_range' to be at least '36000', which is 10 times the granularity value in seconds.", "outcome": "bad"}
141 | 
142 | ---
143 | 
144 | NLQ: latency distribution by status code
145 | 
146 | COLUMNS: ['status_code', 'severity_code', 'status_message', 'rpc.grpc.status_code', 'duration_ms', 'container.uptime', 'rpc.method', 'message.type', 'rpc.service', 'rpc.system', 'severity_text', 'apiVersion', 'opencensus.resourcetype', 'k8s.node.network.io.transmit', 'meta.signal_type', 'ip', 'container.memory.major_page_faults', 'container.memory.page_faults', 'error', 'container.cpu.utilization', 'library.version', 'net.peer.port', 'container.memory.available', 'deprecatedLastTimestamp', 'container.memory.working_set', 'severity', 'k8s.pod.network.io.transmit', 'container.cpu.time', 'k8s.node.uptime', 'k8s.pod.uptime', 'container.filesystem.usage', 'container.filesystem.available', 'trace.trace_id', 'interface', 'deprecatedFirstTimestamp', 'deprecatedCount', 'span.kind', 'resourceVersion', 'k8s.node.network.errors.transmit', 'metadata', 'event.domain', 'trace.span_id', 'container.memory.usage', 'Kubernetes']
147 | 
148 | QUERY: {"breakdowns":["status_code"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"filters":[{"column":"status_code","op":"exists","join_column":""}],"time_range":7200}
149 | 
150 | {"critique": "Almost a good response. Having a HEATMAP(duration_ms) is excellent for visualizing the distribution of response times, and P95 gives a useful statistical measure of latency under typical conditions. However, filtering and breaking down by 'status_code' is unfortunately incorrect, as this does not refer to any system status code, but rather a span status. Instead, 'rpc.grpc.status_code' should be used instead. If there were an 'http.status_code' column, or an 'http.response.status_code' column, that could be a good grouping too.", "outcome": "bad"}
151 | 
152 | ---
153 | 
154 | NLQ: events by name
155 | 
156 | COLUMNS: ['app.event_name', 'app.event_id', 'name', 'span.num_events', 'db.name', 'library.name', 'app.subscriber_name', 'parent_name', 'process.runtime.name', 'service.name', 'app.collection', 'job.enqueued_at', 'net.peer.name', 'db.system', 'http.route', 'error', 'job.created_at', 'exception.stacktrace', 'http.target', 'telemetry.sdk.name', 'http.host', 'messaging.system', 'db.statement', 'span.kind', 'type', 'http.method', 'service.version', 'messaging.destination', 'exception.message', 'exception.type', 'app.queued_times', 'process.pid', 'service.ownership', 'process.command', 'messaging.operation', 'duration_ms', 'library.version', 'meta.refinery.reason', 'messaging.protocol', 'telemetry.sdk.language', 'status_message', 'job.latency_ms', 'meta.annotation_type', 'messaging.rabbitmq.routing_key', 'deployment.environment', 'http.scheme', 'trace.link.trace_id', 'process.runtime.description', 'telemetry.instrumentation_library', 'trace.trace_id', 'trace.parent_id', 'http.status_code']
157 | 
158 | QUERY: {"breakdowns":["app.event_name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"app.event_name","op":"=","value":"Event drop","join_column":""}],"filter_combination":"AND","time_range":7200}
159 | 
160 | {"critique": "This is a bad response. The NLQ asks for events by name, which implies a need to count occurrences of events broken down by their name. The query correctly uses COUNT to aggregate events and breaks it down by app.event_name. However, it also includes a filter for a specific app.event_name value, which is unrelated to what NLQ asks for.", "outcome": "bad"}
161 | 
162 | ---
163 | 
164 | NLQ: /api/external_storage/gdrive/notify
165 | 
166 | COLUMNS: ['document.api_status.dropbox_download_url', 'http.payment_endpoints_no_errors', 'http.document_upload_successful', 'http.document_upload_success_rate', 'http.is_document_upload', 'http.auth_endpoints_no_errors', 'status_message', 'document_processing_event.extension', 'document.s3_dir', 'http.space_endpoints_successful', 'http.grpc_requests_no_errors', 'document_processing_event.upload_source', 'http.is_payment', 'rpc.method', 'graphql.document', 'flipper.file_requests_in_dashboard', 'link.downloadable', 'rpc.service', 'messaging.operation', 'http.payment_success_rate', 'document.upload_error_rate', 'messaging.destination', 'link.re_signable', 'message', 'http.is_auth_endpoint', 'flipper.space_analytics', 'http.url', 'flipper.space_analytics_backfill', 'http.space_endpoints_success_rate', 'http.presentation_view_endpoints_successful', 'net.transport', 'messaging.system', 'link.type', 'document_processing_event.delayed_job_queue', 'request.id', 'http.grpc_requests_success_rate', 'http.elaine_web_request', 'http.link_successful_requests_rate', 'http.esign_successful_requests', 'link.watermark_enabled', 'service.component', 'user.login_success', 'document_processing_event.client', 'flipper.space_audit_log', 'status_code', 'document.document_group_id', 'http.link_successful_requests', 'http.target', 'link.recipient_kind', 'document.upload_unknown_error_rate', 'duration_ms', 'service.name', 'http.route', 'db.statement', 'error', 'trace.parent_id', 'http.status_code', 'name', 'user.signup_referer_path', 'exception.message', 'parent_name']
167 | 
168 | QUERY: {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"http.route","op":"=","value":"/api/external_storage/gdrive/notify","join_column":""}],"time_range":7200}
169 | 
170 | {"critique": "The query correctly filters for the specific route '/api/external_storage/gdrive/notify'. Although it just shows a heatmap, because NLQ did not ask for anything specific, the heatmap is the best generat calculation to use. It may also be helpful to add a calculation for MAX or P95 to identify the slowest events explicitly. The breakdown by 'name' is generic and might not provide meaningful insights for this specific route unless 'name' is related to some aspect of the functionality at the route, but it is not a bad breakdown. It may also be helpful to filter to 'trace.parent_id' 'does-not-exist' so that it shows only requests to the endpoint, but since NLQ did not ask for requests, this isn't strictly necessary.", "outcome": "good"}
171 | 
172 | ---
173 | 
174 | NLQ: find the traces with the most spans
175 | 
176 | COLUMNS: ['trace.span_id', 'trace.trace_id', 'trace_flags', 'trace_id', 'span_id', 'exception.stacktrace', 'trace.parent_id', 'span.num_links', 'span.num_events', 'span.kind', 'flags', 'nodejs_gc_duration_seconds.max', 'nodejs_gc_duration_seconds.avg', 'nodejs_gc_duration_seconds.p01', 'container.memory.usage.max', 'scrape_samples_post_metric_relabeling', 'container.duration', 'extensions', 'nodejs_gc_duration_seconds.sum', 'container.memory.utilized', 'nodejs_gc_duration_seconds.p001', 'nodejs_gc_duration_seconds.p99', 'scrape_samples_scraped', 'path', 'nodejs_gc_duration_seconds.p05', 'source', 'nodejs_eventloop_lag_mean_seconds', 'nodejs_eventloop_lag_max_seconds', 'nodejs_gc_duration_seconds.count', 'nodejs_gc_duration_seconds.min', 'container.cpu.onlines', 'up', 'partial_last', 'container.cpu.utilized', 'graphql.source', 'ecs.task.memory.usage.max', 'nodejs_gc_duration_seconds.p95', 'ecs.task.cpu.onlines', 'container']
177 | 
178 | QUERY: {"breakdowns":["trace.trace_id"],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}]}
179 | 
180 | {"critique": "Bad response. The COUNT operation is incorrectly used because it does not count the number of spans per trace, but the total events within a window. Instead, the calculations should use a COUNT_DISTINCT(trace.span_id). Breaking down by 'trace.trace_id' is good, however, because that lets the user see each trace with a span count. Finally, the query should have an ordering of COUNT_DISTINCT(trace.span_id) 'desc' so that results are ordered by most spans.", "outcome": "bad"}
181 | 
182 | ---
183 | 
184 | NLQ: api calls that are failing
185 | 
186 | COLUMNS: ['cronofy.oauth_client_notification_subscriptions.failed', 'apirequestjournal.failed_journal_reads', 'cronofy.notifier_client.api_call_status', 'auth.oauthinator.go_to.api_call_count', 'api.event.conferencing.error', 'cronofy.notifier_client.api_call_path', 'cronofy.notifier_client.api_call_ms', 'api.event.conferencing.sync.error', 'auth.oauthinator.zoom.api_call_count', 'cronofy.service_account.relinks.failed', 'auth.failure', 'cronofy.api.rate_limits.oauth_client_account_exceeded', 'cronofy.notifier_client.api_call_method', 'auth.oauthinator.ms_teams.api_call_count', 'cronofy.api.rate_limits.oauth_client_exceeded', 'cronofy.embedded_auth_token.errors', 'apirequestjournal.failed_dir_enumerations', 'indeed.put.error', 'apirequestjournal.read_dir.total_calls', 'integrations.indeed.api_response.status', 'api.free_busy.caching.hit', 'shipper.ingress.failed_open', 'error.service_unavailable_error?', 'auth.oauthinator.go_to.refresh_failure_count', 'auth.provider_error_code', 'error', 'auth.graph_access_token_attempt', 'apirequestjournal.results', 'eightbyeight.put.error', 'integrations.slack.api_response.status', 'sli.apirequestjournal.dashboard.ok', 'integrations.hubspot.api_response.status', 'integrations.api.operation', 'enterprise_connect.request_context.failure_key', 'api.params.google_event_ids', 'eightbyeight.persistent.get.error', 'auth.error_code', 'auth.graph_me_attempt', 'auth.op', 'eightbyeight.persistent_cache.get.error', 'cronofy.calendar_api', 'auth.oauthinator.zoom.refresh_failure_count', 'shared.put.error', 'filehousekeeper.delete.failures', 'api.event.conferencing.dial_in', 'api.free_busy.total_pages', 'integrations.zendesk.api_response.status', 'apirequestjournal.index_job.total_indexing_errors', 'ms_graph.failed_delta_link_fallback', 'api.free_busy.page_events', 'name', 'shipper.ingress.full_path', 'cronofy.account_profile', 'trace.parent_id', 'http.status', 'scheduler.build_availability_query.type', 'active_job.class', 'enterprise_connect.request_context.error', 'cronofy.oauth_client', 'duration_ms', 'sync.state', 'cronofy.environment.name', 'sync.sync_type', 'http.path', 'service_name']
187 | 
188 | QUERY: {"breakdowns":["http.path],"calculations":[{"op":"COUNT"}],"filters":[{"column":"http.status","op":"exists","join_column":""},{"column":"http.status","op":"!=","value":"200","join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}
189 | 
190 | {"critique": "This query is too specific. The columns 'http.path' and 'http.status' are more generic columns, meaning that they capture information about more endpoints than the columns selected ('cronofy.notifier_client.api_call_status' and 'cronofy.notifier_client.api_call_path'). By using those more specific columns, the query makes an assumption that NLQ is referring to these specific columns, when there is no evidence that NLQ is doing that. Although the structure of this response is good, it should use the more generic columns 'http.status' and 'http.path' instead.", "outcome": "bad"}
191 | 
192 | ---
193 | 
194 | NLQ: ListWalletMultichainTokenHoldings
195 | 
196 | COLUMNS: ['service_dc_2', 'service_dc_1', 'service_dc_3', 'service_dc_5', 'service_dc_4', 'success_rate_v3', 'error', 'success_rate_demo', 'duration_ms', 'name', 'dc_simple_refinery_reason', 'client-uuid', 'meta.stressed', 'go.tb', 'ip', 'span.kind', 'db.statement.fingerprint', 'db.statement', 'span.num_links', 'parent_name', 'dc_log10_duration', 'http.url', 'db.rowcount', 'meta.refinery.reason', 'span.num_events', 'host.name', 'dc_db_system_or_type', 'log10_duration_ms', 'service.name', 'type', 'meta.signal_type', 'error.object', 'trace.span_id', 'status_code', 'meta.annotation_type', 'trace.trace_id', 'meta.refinery.sample_key', 'http.method', 'db.canonical_name', 'dc_is_root_span', 'grpc.source', 'trace.parent_id', 'duration_per_rowcount', 'meta.refinery.original_sample_rate', 'go.tb.exists', 'opencensus.exporterversion', 'dc_is_success', 'meta.refinery.send_reason', 'http.status_code', 'dc_ensure_nonroot_server_span', 'http.route']
197 | 
198 | QUERY: {"breakdowns":["name"],"filters":[{"column":"name","op":"=","value":"ListWalletMultichainTokenHoldings"}],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}
199 | 
200 | {"critique": "The query correctly filters on the 'name' column to isolate the 'ListWalletMultichainTokenHoldings' service. Using COUNT as a calculation is appropriate to get the number of occurrences, and ordering by count descending will properly rank the results. It could be improved by adding a calculation that includes the 'duration_ms' column such as AVG or MAX to assess the latency, as the NLQ suggests a potential interest in performance metrics for this particular service.", "outcome": "good"}
201 | 
202 | ---
203 | 
204 | NLQ: overall distribution of latencies
205 | 
206 | COLUMNS: ['duration_ms', 'container.uptime', 'container.cpu.time', 'event.domain', 'container.cpu.utilization', 'ip', 'regarding', 'container.memory.page_faults', 'severity', 'metadata', 'interface', 'container.memory.major_page_faults', 'span.kind', 'namespace', 'k8s.node.cpu.time', 'error', 'container.filesystem.usage', 'uid', 'container.memory.working_set', 'span.num_events', 'container.memory.usage', 'flags', 'k8s.node.uptime', 'event.name', 'k8s.pod.cpu.time', 'k8s.pod.uptime', 'container.filesystem.capacity', 'name', 'container.memory.available', 'k8s.node.network.io.transmit', 'severity_text', 'k8s.node.cpu.utilization', 'body', 'reason', 'note', 'kind', 'net.peer.port', 'k8s.pod.cpu.utilization', 'type', 'Kubernetes']
207 | 
208 | QUERY: {"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"breakdowns":[{"namespace"},"orders":[{"op":"P95","column:"duration_ms","order":"descending"}]]
209 | 
210 | {"critique": "This is a bad query. While it includes relevant calculations, there are far too many, which is an overwhelming response. Only a HEATMAP and a common aggregation like P95 should be used. Additionally, an ideal query would break down by a column like 'namespace' even though NLQ did not ask for it, because that can yield more interesting results.", "outcome": "bad"}
211 | 
212 | ---
213 | 
214 | NLQ: error
215 | 
216 | COLUMNS: ['error', 'env', 'exception.message', 'type', 'name', 'http.error_message', 'exception.type', 'http.error_name', 'exception.stacktrace', 'status_code', 'status_message', 'span.kind', 'net.transport', 'library.name', 'tls.authorized', 'http.target', 'service.name', 'http.url', 'tls.protocol', 'duration_ms', 'process.owner', 'library.version', 'process.command', 'parent_name', 'process.pid', 'http.flavor', 'deployment.environment', 'http.method', 'next.bubble', 'http.host', 'next.route', 'net.host.ip', 'http.scheme', 'telemetry.sdk.language', 'deployment.name', 'trace.parent_id', 'tls.alpnProtocol', 'trace.span_id', 'http.response_content_length_uncompressed', 'net.host.name', 'trace.trace_id', 'net.host.port', 'http.status_code', 'http.response_content_length', 'process.command_args', 'net.peer.name', 'http.user_agent', 'telemetry.instrumentation_library', 'net.peer.port', 'process.runtime.description', 'http.route']
217 | 
218 | QUERY: {"breakdowns":["name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"error","op":"=","value":true,"join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}
219 | 
220 | {"critique": "Bad response. The query correctly filters by errors and uses COUNT, but it breaks down by 'error' instead of a more useful column like 'name', which is the name of an operation. Although NLQ is vague, it's best to be generally helpful and show more information than less.", "outcome": "bad"}
221 | 
222 | ---
223 | 
224 | NLQ: I would like to pull a number from a string
225 | 
226 | COLUMNS: ['github.pull_requests.0.number', 'github.run_number', 'github.job.step.number', 'github.pull_requests.0.id', 'library.name', 'github.pull_requests.0.head.sha', 'github.pull_requests.0.url', 'name', 'github.pull_requests.0.base.sha', 'github.pull_requests.0.head.repo.id', 'github.pull_requests.0.head.repo.name', 'github.pull_requests.0.base.ref', 'github.job.name', 'service.name', 'github.pull_requests.0.head.ref', 'github.pull_requests.0.head.repo.url', 'github.job.id', 'github.job.step.name', 'telemetry.sdk.language', 'github.pull_requests.0.base.repo.name', 'github.job.conclusion', 'github.conclusion', 'span.kind', 'service.version', 'github.pull_requests.0.base.repo.url', 'error', 'github.pull_requests.0.base.repo.id', 'github.job.step.conclusion', 'status_code', 'span.num_events', 'service.instance.id', 'github.job.runner_name', 'github.job.run_id', 'github.workflow', 'github.html_url', 'github.run_attempt', 'github.event', 'github.job.runner_group_id', 'github.job.run_attempt', 'telemetry.sdk.name', 'github.workflow_id', 'github.head_commit.id', 'service.namespace', 'telemetry.sdk.version', 'github.author_name', 'github.head_commit.message', 'github.head_sha', 'trace.span_id', 'github.base_sha', 'github.workflow_url', 'trace.parent_id', 'duration_ms', 'http.route']
227 | 
228 | QUERY: 
229 | 
230 | {"critique": "This query is good because it does a best effort at aligning with a vague NLQ. By checking if 'name' contains 'string', the query is about as close to an interpretation of NLQ as-is possible without further clarification. A COUNT of events broken down by 'name' where 'name' contains 'string' is the best way to interpret a vague NLQ like this.", "outcome": "good"}
231 | 
232 | For the below NLQ, QUERY and COLUMNS provide a critique as JSON in the format {{"critique": "...", "outcome": "good"|"bad"}} as shown above. Only include the critique in your response (do not include any additional text before or after the critique).
233 | 
234 | 
235 | NLQ: {{prompt}}
236 | 
237 | COLUMNS: {{columns}}
238 | 
239 | QUERY: {{query}}
240 | ---


--------------------------------------------------------------------------------
/honeycomb/prompt.txt:
--------------------------------------------------------------------------------
  1 | COLUMNS:{{columns}}
  2 | 
  3 | QUERY SPEC:
  4 | All top-level keys are optional.
  5 | 
  6 | ```json
  7 | "calculations":[
  8 |     // ops: COUNT, CONCURRENCY, COUNT_DISTINCT, HEATMAP, SUM, AVG, MAX, MIN, P001, P01, P05, P10, P25, P50, P75, P90, P95, P99, P999, RATE_AVG, RATE_SUM, RATE_MAX
  9 |     {"op": "COUNT"},// COUNT and CONCURRENCY are just op
 10 |     {"op": "HEATMAP", "column": "name"}
 11 | ],
 12 | "filters":[
 13 |     // ops: =, !=, >, >=, <, <=, starts-with, does-not-start-with, exists, does-not-exist, contains, does-not-contain, in, not-in
 14 |     {"column": "name", "op": "exists"}, // exists and does-not-exist ops only have column
 15 |     {"column": "name", "op": "=", "value": "something"}
 16 | ],
 17 | "filter_combination": "AND", // AND or OR
 18 | "breakdowns":[
 19 |     // columns in COLUMNS
 20 |     "column1","column2"
 21 | ],
 22 | "orders":[
 23 |     // HEATMAP not allowed
 24 |     // Must come from breakdowns or calculations
 25 |     {"op": "op_in_calculation", "column": "column_in_calculation", "order": "ascending"},
 26 |     {"op": "COUNT", "order": "descending"}, // COUNT and CONCURRENCY have no column
 27 |     {"column": "column1", "order": "descending"},
 28 | ],
 29 | "havings":[
 30 |     // HEATMAP not allowed
 31 |     {"calculate_op": "op_in_calculation", "column": "name", "op": "OPNAME", "value": 100},
 32 |     {"calculate_op": "COUNT", "op": ">", "value": 10}, // COUNT and CONCURRENCY have no column
 33 | ],
 34 | "time_range": 7200, // Relative time range in seconds.
 35 | "start_time": 1234567890, // UNIX timestamp
 36 | "end_time": 1234567890, // UNIX timestamp
 37 | ```
 38 | 
 39 | TIPS:
 40 | COUNT counts events/spans. COUNT_DISTINCT counts unique values of columns.
 41 | HEATMAP shows value distributions.
 42 | trace.parent_id does-not-exist shows root span of a trace. Use that to count requests and traces.
 43 | name is a span or span event.
 44 | parent_name is the name of a span that created a span event.
 45 | 
 46 | When the user input is asking about a time range (such as "yesterday" or "since last week"),
 47 | always use the time_range, start_time, and end_time fields. time_range
 48 | is relative and can be combined with either start_time or end_time but not both.
 49 | 
 50 | Modify existing query or create a new query based on NLQ. Only include the query in your response (do not include any additional text before or after the query).
 51 | 
 52 | NLQ: Exception count by exception and caller
 53 | EXISTING QUERY:
 54 | {"breakdowns":["exception.message","parent_name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"exception.message","op":"exists","join_column":""},{"column":"parent_name","op":"exists","join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}
 55 | 
 56 | NLQ: Error count
 57 | EXISTING QUERY:
 58 | {"breakdowns":["error"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"error","op":"=","value":true,"join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}
 59 | 
 60 | NLQ: Error rate
 61 | EXISTING QUERY:
 62 | {"breakdowns":["error"],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}
 63 | 
 64 | NLQ: Slow requests
 65 | EXISTING QUERY:
 66 | {"breakdowns":["http.route"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":7200}
 67 | 
 68 | NLQ: Latency distribution by status code
 69 | EXISTING QUERY:
 70 | {"breakdowns":["http.status_code"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"filters":[{"column":"http.status_code","op":"exists","join_column":""}],"time_range":7200}
 71 | 
 72 | NLQ: Most time consuming db queries
 73 | EXISTING QUERY:
 74 | {"breakdowns":["name","db.statement"],"calculations":[{"column":"duration_ms","op":"SUM"}],"filters":[{"column":"db.statement","op":"exists","join_column":""}],"orders":[{"column":"duration_ms","op":"SUM","order":"descending"}],"time_range":7200}
 75 | 
 76 | NLQ: Slow db queries
 77 | EXISTING QUERY:
 78 | {"breakdowns":["db.statement"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"filters":[{"column":"db.statement","op":"exists","join_column":""}],"orders":[{"column":"duration_ms","op":"P95","order":"descending"}],"time_range":604800}
 79 | 
 80 | NLQ: by endpoint and operation
 81 | EXISTING QUERY:{"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}]}
 82 | {"breakdowns":["http.route","name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}]}
 83 | 
 84 | NLQ: by service
 85 | EXISTING QUERY:{"calculations":[{"column":"duration_ms","op":"HEATMAP"}]}
 86 | {"breakdowns":["service.name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}]}
 87 | 
 88 | NLQ: where span is client_request
 89 | EXISTING QUERY:{"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"time_range":604800}
 90 | {"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""},{"column":"name","op":"=","value":"client_request","join_column":""}],"time_range":604800}
 91 | 
 92 | NLQ: add count
 93 | EXISTING QUERY:{"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"http.route","op":"exists","join_column":""}]}
 94 | {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"op":"COUNT"}],"filters":[{"column":"http.route","op":"exists","join_column":""}]}
 95 | 
 96 | NLQ: remove count
 97 | EXISTING QUERY:{"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"op":"COUNT"}],"filters":[{"column":"service.name","op":"exists","join_column":""}]}
 98 | {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"service.name","op":"exists","join_column":""}]}
 99 | 
100 | NLQ: add count
101 | EXISTING QUERY:
102 | {"calculations":[{"op":"COUNT"}]}
103 | 
104 | NLQ: slow requests
105 | EXISTING QUERY:{"breakdowns":["name"],"filters":[{"column":"error","op":"exists","join_column":""}]}
106 | {"breakdowns":["http.route"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}]}
107 | 
108 | 
109 | NLQ:all pods with name cron-no-medication
110 | EXISTING QUERY:{"calculations":[{"op":"COUNT"}],"filters":[{"column":"k8s.pod.name","op":"starts-with","value":"cron-no-medication-observation-received-cron","join_column":""}],"filter_combination":"AND","limit":100,"time_range":604800,"granularity":0}
111 | 
112 | NLQ:{{prompt}}
113 | EXISTING QUERY:


--------------------------------------------------------------------------------
/honeycomb/queries.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "## Honeycomb Query Evals\n",
  8 |     "\n",
  9 |     "Evals for for Honeycomb Natural Langaguge Query generator from the [Fine Tuning LLMs](https://maven.com/parlance-labs/fine-tuning) course. Related notebooks from the course can be found at <https://github.com/parlance-labs/ftcourse>.\n",
 10 |     "\n",
 11 |     "The [queries.csv](queries.csv) dataset contains \\~ 2,300 example queries (along with per-query column schemas generated offline via RAG). There are two scoring methods supported\n",
 12 |     "(corresponding to the two @task definitions below):\n",
 13 |     "\n",
 14 |     "1. validate - score using the validity checker from the course (utils.py)\n",
 15 |     "2. critique - score using the critique prompt from the course (critique.txt)"
 16 |    ]
 17 |   },
 18 |   {
 19 |    "cell_type": "markdown",
 20 |    "metadata": {},
 21 |    "source": [
 22 |     "### Dataset\n",
 23 |     "\n",
 24 |     "Inspect uses a standard schema for [Datasets](https://ukgovernmentbeis.github.io/inspect_ai/datasets.html), so we'll map the raw data into that schema when reading it (note that \"columns\" are saved as metadata so we can use them for prompt engineering)"
 25 |    ]
 26 |   },
 27 |   {
 28 |    "cell_type": "code",
 29 |    "execution_count": null,
 30 |    "metadata": {},
 31 |    "outputs": [],
 32 |    "source": [
 33 |     "from inspect_ai.dataset import csv_dataset, FieldSpec\n",
 34 |     "\n",
 35 |     "dataset = csv_dataset(\n",
 36 |     "    csv_file=\"queries.csv\",\n",
 37 |     "    sample_fields=FieldSpec(input=\"user_input\", metadata=[\"columns\"]),\n",
 38 |     "    shuffle=True\n",
 39 |     ")"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "metadata": {},
 45 |    "source": [
 46 |     "### Solver\n",
 47 |     "\n",
 48 |     "To build the prompt, we'll cfreate a custom [Solver](https://ukgovernmentbeis.github.io/inspect_ai/solvers.html) that merges the user query/prompt and the RAG retreived column list into our prompt template:"
 49 |    ]
 50 |   },
 51 |   {
 52 |    "cell_type": "code",
 53 |    "execution_count": null,
 54 |    "metadata": {},
 55 |    "outputs": [],
 56 |    "source": [
 57 |     "from inspect_ai.solver import solver\n",
 58 |     "from inspect_ai.util import resource\n",
 59 |     "\n",
 60 |     "@solver\n",
 61 |     "def prompt_with_schema():\n",
 62 |     "\n",
 63 |     "    prompt_template = resource(\"prompt.txt\")\n",
 64 |     "\n",
 65 |     "    async def solve(state, generate):\n",
 66 |     "        # build the prompt\n",
 67 |     "        state.user_prompt.text = prompt_template.replace(\n",
 68 |     "            \"{{prompt}}\", state.user_prompt.text\n",
 69 |     "        ).replace(\n",
 70 |     "            \"{{columns}}\", state.metadata[\"columns\"]\n",
 71 |     "        )\n",
 72 |     "        return state\n",
 73 |     "\n",
 74 |     "    return solve\n"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "markdown",
 79 |    "metadata": {},
 80 |    "source": [
 81 |     "### Scorer\n",
 82 |     "\n",
 83 |     "To score the model's respones to the prompt, we'll create a custom [Scorer](https://ukgovernmentbeis.github.io/inspect_ai/scorers.html) that calls the `is_valid()` function to determine whether a valid query has been constructed:"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": null,
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT\n",
 93 |     "from utils import is_valid, json_completion\n",
 94 |     "\n",
 95 |     "@scorer(metrics=[accuracy()])\n",
 96 |     "def validate_scorer():\n",
 97 |     "\n",
 98 |     "    async def score(state, target):\n",
 99 |     "       \n",
100 |     "        # check for valid query\n",
101 |     "        query = json_completion(state.output.completion)\n",
102 |     "        if is_valid(query, state.metadata[\"columns\"]):\n",
103 |     "            value=CORRECT\n",
104 |     "        else: \n",
105 |     "            value=INCORRECT\n",
106 |     "       \n",
107 |     "        # return score w/ query that was extracted\n",
108 |     "        return Score(value=value, answer=query)\n",
109 |     "\n",
110 |     "    return score\n"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "markdown",
115 |    "metadata": {},
116 |    "source": [
117 |     "Note that the `json_completion()` function takes care of some details around extracting JSON from a model completion (e.g. removing sorrounding backtick code block emitted by some models)"
118 |    ]
119 |   },
120 |   {
121 |    "cell_type": "markdown",
122 |    "metadata": {},
123 |    "source": [
124 |     "### Validate Task\n",
125 |     "\n",
126 |     "Now we'll put all of this together to create an evaluation task:"
127 |    ]
128 |   },
129 |   {
130 |    "cell_type": "code",
131 |    "execution_count": null,
132 |    "metadata": {},
133 |    "outputs": [],
134 |    "source": [
135 |     "from inspect_ai import eval, task, Task\n",
136 |     "from inspect_ai.solver import system_message, generate\n",
137 |     "\n",
138 |     "@task\n",
139 |     "def validate():\n",
140 |     "    return Task(\n",
141 |     "        dataset=dataset,\n",
142 |     "        plan=[\n",
143 |     "            system_message(\"Honeycomb AI suggests queries based on user input.\"),\n",
144 |     "            prompt_with_schema(),\n",
145 |     "            generate()\n",
146 |     "        ],\n",
147 |     "        scorer=validate_scorer()\n",
148 |     "    )"
149 |    ]
150 |   },
151 |   {
152 |    "cell_type": "markdown",
153 |    "metadata": {},
154 |    "source": [
155 |     "We can run the task using Inspect's `eval()` function (limiting to 100 samples):"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "code",
160 |    "execution_count": null,
161 |    "metadata": {},
162 |    "outputs": [],
163 |    "source": [
164 |     "if __name__ == '__main__':\n",
165 |     "    eval(validate, model=\"openai/gpt-4-turbo\", limit=100)"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "markdown",
170 |    "metadata": {},
171 |    "source": [
172 |     "We use the `__name__ == '__main__'` conditional to indicate that we only want to run this cell in interactive contexts. As a result, we can also use the notebook as a module callable from `inspect eval`. For example:\n",
173 |     "\n",
174 |     "```bash\n",
175 |     "$ inspect eval queries.ipynb@validate \n",
176 |     "```"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "markdown",
181 |    "metadata": {},
182 |    "source": [
183 |     "### Critique Task\n",
184 |     "\n",
185 |     "Now we'll create a critique task. For this we'll need a LLM-based scorer that uses a critique template to prompt for whether the generated query is \"good\" or \"bad\":"
186 |    ]
187 |   },
188 |   {
189 |    "cell_type": "code",
190 |    "execution_count": null,
191 |    "metadata": {},
192 |    "outputs": [],
193 |    "source": [
194 |     "import json\n",
195 |     "from inspect_ai.model import get_model\n",
196 |     "\n",
197 |     "@scorer(metrics=[accuracy()])\n",
198 |     "def critique_scorer(model = \"anthropic/claude-3-opus-20240229\"):\n",
199 |     "\n",
200 |     "    async def score(state, target):\n",
201 |     "       \n",
202 |     "        # build the critic prompt\n",
203 |     "        query = state.output.completion.strip()\n",
204 |     "        critic_prompt = resource(\"critique.txt\").replace(\n",
205 |     "            \"{{prompt}}\", state.user_prompt.text\n",
206 |     "        ).replace(\n",
207 |     "            \"{{columns}}\", state.metadata[\"columns\"]\n",
208 |     "        ).replace(\n",
209 |     "            \"{{query}}\", query\n",
210 |     "        )\n",
211 |     "       \n",
212 |     "        # run the critique\n",
213 |     "        result = await get_model(model).generate(critic_prompt)\n",
214 |     "        try:\n",
215 |     "            parsed = json.loads(json_completion(result.completion))\n",
216 |     "            value = CORRECT if parsed[\"outcome\"] == \"good\" else INCORRECT\n",
217 |     "            explanation = parsed[\"critique\"]\n",
218 |     "        except (json.JSONDecodeError, KeyError):\n",
219 |     "            value = INCORRECT\n",
220 |     "            explanation = f\"JSON parsing error:\\n{result.completion}\"\n",
221 |     "        \n",
222 |     "        # return value and explanation (critique text)\n",
223 |     "        return Score(value=value, explanation=explanation)\n",
224 |     "\n",
225 |     "    return score"
226 |    ]
227 |   },
228 |   {
229 |    "cell_type": "markdown",
230 |    "metadata": {},
231 |    "source": [
232 |     "Now we use this scorer in a critique task definition:"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "code",
237 |    "execution_count": null,
238 |    "metadata": {},
239 |    "outputs": [],
240 |    "source": [
241 |     "@task\n",
242 |     "def critique():\n",
243 |     "    return Task(\n",
244 |     "        dataset=dataset,\n",
245 |     "        plan=[\n",
246 |     "            system_message(\"Honeycomb AI suggests queries based on user input.\"),\n",
247 |     "            prompt_with_schema(),\n",
248 |     "            generate()\n",
249 |     "        ],\n",
250 |     "        scorer=critique_scorer()\n",
251 |     "    )"
252 |    ]
253 |   },
254 |   {
255 |    "cell_type": "markdown",
256 |    "metadata": {},
257 |    "source": [
258 |     "And then run the task using `eval()` (limiting to 25 samples):"
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "code",
263 |    "execution_count": null,
264 |    "metadata": {},
265 |    "outputs": [],
266 |    "source": [
267 |     "if __name__ == '__main__':\n",
268 |     "    eval(critique, model=\"openai/gpt-4-turbo\", limit=25)"
269 |    ]
270 |   }
271 |  ],
272 |  "metadata": {
273 |   "kernelspec": {
274 |    "display_name": ".venv",
275 |    "language": "python",
276 |    "name": "python3"
277 |   },
278 |   "language_info": {
279 |    "codemirror_mode": {
280 |     "name": "ipython",
281 |     "version": 3
282 |    },
283 |    "file_extension": ".py",
284 |    "mimetype": "text/x-python",
285 |    "name": "python",
286 |    "nbconvert_exporter": "python",
287 |    "pygments_lexer": "ipython3",
288 |    "version": "3.11.6"
289 |   }
290 |  },
291 |  "nbformat": 4,
292 |  "nbformat_minor": 2
293 | }
294 | 


--------------------------------------------------------------------------------
/honeycomb/queries.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Evals for for Honeycomb Natural Langaguge Query generator from the Fine Tuning
  3 | LLMs course (https://maven.com/parlance-labs/fine-tuning)
  4 | 
  5 | Notebooks from the course: https://github.com/parlance-labs/ftcourse
  6 | 
  7 | The queries.csv dataset contains ~ 2,300 example queries (along with column
  8 | schemas generated offline via RAG). There are two scoring methods supported
  9 | (corresponding to the two @task definitions below):
 10 | 
 11 | 1. validate - score using the validity checker from the course (utils.py)
 12 | 2. critique - score using the critique prompt from the course (critique.txt)
 13 | """
 14 | 
 15 | import json
 16 | 
 17 | from inspect_ai import task, Task
 18 | from inspect_ai.dataset import csv_dataset, FieldSpec
 19 | from inspect_ai.model import get_model
 20 | from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT
 21 | from inspect_ai.solver import system_message, generate, solver
 22 | from inspect_ai.util import resource
 23 | 
 24 | from utils import is_valid, json_completion
 25 | 
 26 | 
 27 | @task
 28 | def validate():
 29 |     return eval_task(scorer=validate_scorer())
 30 | 
 31 | 
 32 | @task
 33 | def critique():
 34 |     return eval_task(scorer=critique_scorer())
 35 | 
 36 | 
 37 | # shared task implementation parmaeterized by scorer
 38 | def eval_task(scorer):
 39 | 
 40 |     # read dataset
 41 |     dataset = csv_dataset(
 42 |         csv_file="queries.csv",
 43 |         sample_fields=FieldSpec(
 44 |             input="user_input",
 45 |             metadata=["columns"]
 46 |         ),
 47 |         shuffle=True
 48 |     )
 49 | 
 50 |     # create eval task
 51 |     return Task(
 52 |         dataset=dataset,
 53 |         plan=[
 54 |             system_message("Honeycomb AI suggests queries based on user input."),
 55 |             prompt_with_schema(),
 56 |             generate()
 57 |         ],
 58 |         scorer=scorer
 59 |     )
 60 | 
 61 | 
 62 | @solver
 63 | def prompt_with_schema():
 64 | 
 65 |     prompt_template = resource("prompt.txt")
 66 | 
 67 |     async def solve(state, generate):
 68 |         # build the prompt
 69 |         state.user_prompt.text = prompt_template.replace(
 70 |             "{{prompt}}", state.user_prompt.text
 71 |         ).replace(
 72 |             "{{columns}}", state.metadata["columns"]
 73 |         )
 74 |         return state
 75 | 
 76 |     return solve
 77 | 
 78 | 
 79 | @scorer(metrics=[accuracy()])
 80 | def validate_scorer():
 81 | 
 82 |     async def score(state, target):
 83 |        
 84 |         # check for valid query
 85 |         query = json_completion(state.output.completion)
 86 |         if is_valid(query, state.metadata["columns"]):
 87 |             value=CORRECT
 88 |         else: 
 89 |             value=INCORRECT
 90 |        
 91 |         # return score w/ query that was extracted
 92 |         return Score(value=value, answer=query)
 93 | 
 94 |     return score
 95 | 
 96 | 
 97 | @scorer(metrics=[accuracy()])
 98 | def critique_scorer(model = "openai/gpt-4-turbo"):
 99 | 
100 |     async def score(state, target):
101 |        
102 |         # build the critic prompt
103 |         query = state.output.completion.strip()
104 |         critic_prompt = resource("critique.txt").replace(
105 |             "{{prompt}}", state.user_prompt.text
106 |         ).replace(
107 |             "{{columns}}", state.metadata["columns"]
108 |         ).replace(
109 |             "{{query}}", query
110 |         )
111 |        
112 |         # run the critique
113 |         result = await get_model(model).generate(critic_prompt)
114 |         try:
115 |             parsed = json.loads(json_completion(result.completion))
116 |             value = CORRECT if parsed["outcome"] == "good" else INCORRECT
117 |             explanation = parsed["critique"]
118 |         except (json.JSONDecodeError, KeyError):
119 |             value = INCORRECT
120 |             explanation = f"JSON parsing error:\n{result.completion}"
121 |         
122 |         # return value and explanation (critique text)
123 |         return Score(value=value, explanation=explanation)
124 | 
125 |     return score
126 | 
127 | 


--------------------------------------------------------------------------------
/honeycomb/utils.py:
--------------------------------------------------------------------------------
  1 | import re
  2 | import json
  3 | 
  4 | 
  5 | # sometimes models will enclose the JSON in markdown! (e.g. ```json)
  6 | # this function removes those delimiters should they be there
  7 | def json_completion(completion):
  8 |     completion = re.sub(r'^```json\n', '', completion.strip())
  9 |     completion = re.sub(r'\n```$', '', completion)
 10 |     return completion
 11 | 
 12 | 
 13 | 
 14 | class InvalidQueryException(Exception):
 15 |     def __init__(self, message, query=None):
 16 |         self.message = message
 17 |         self.query = query
 18 |         if query:
 19 |             self.message += f"\nQuery: {self.query}"
 20 |         super().__init__(self.message)
 21 | 
 22 | 
 23 | def is_valid(query_spec:str, columns:str, check_runnable=True):
 24 |     "Test if a query is valid"
 25 |     try:
 26 |         check_query(query_spec, columns, check_runnable)
 27 |         return True
 28 |     except (KeyError, InvalidQueryException):
 29 |         return False
 30 | 
 31 | def check_query(query_spec:str, columns:str, check_runnable=True):
 32 |     "Raise an exception if a query is invalid."
 33 |     query_spec = query_spec.replace("'", '"')
 34 |     try:
 35 |         spec = json.loads(query_spec)
 36 |     except json.decoder.JSONDecodeError:
 37 |         raise InvalidQueryException(f"JSON parsing error:\n{query_spec}", query_spec)
 38 | 
 39 |     valid_calculate_ops = [
 40 |         "COUNT",
 41 |         "COUNT_DISTINCT",
 42 |         "HEATMAP",
 43 |         "CONCURRENCY",
 44 |         "SUM",
 45 |         "AVG",
 46 |         "MAX",
 47 |         "MIN",
 48 |         "P001",
 49 |         "P01",
 50 |         "P05",
 51 |         "P10",
 52 |         "P25",
 53 |         "P50",
 54 |         "P75",
 55 |         "P90",
 56 |         "P95",
 57 |         "P99",
 58 |         "P999",
 59 |         "RATE_AVG",
 60 |         "RATE_SUM",
 61 |         "RATE_MAX",
 62 |     ]
 63 | 
 64 |     valid_filter_ops = [
 65 |         "=",
 66 |         "!=",
 67 |         ">",
 68 |         ">=",
 69 |         "<",
 70 |         "<=",
 71 |         "starts-with",
 72 |         "does-not-start-with",
 73 |         "exists",
 74 |         "does-not-exist",
 75 |         "contains",
 76 |         "does-not-contain",
 77 |         "in",
 78 |         "not-in",
 79 |     ]
 80 | 
 81 |     if spec == {} or isinstance(spec, float):
 82 |         raise InvalidQueryException("Query spec cannot be empty.", query_spec)
 83 |         
 84 |     if isinstance(spec, str):
 85 |         raise InvalidQueryException("Query spec was not parsed to json.", query_spec)
 86 |         
 87 |     if "calculations" in spec:
 88 |         for calc in spec["calculations"]:
 89 |             if "op" not in calc:
 90 |                 raise InvalidQueryException(f"{calc}: Calculation must have an op.", query_spec)
 91 |             
 92 |             if calc["op"] not in valid_calculate_ops:
 93 |                 raise InvalidQueryException(f"Invalid calculation: {calc['op']}", query_spec)
 94 |     
 95 |             if calc["op"] == "COUNT" or calc["op"] == "CONCURRENCY":
 96 |                 if "column" in calc:
 97 |                     raise InvalidQueryException(f"{calc}: {calc['op']} cannot take a column as input.", query_spec)        
 98 |             else:
 99 |                 if "column" not in calc:
100 |                     raise InvalidQueryException(f"{calc}: {calc['op']} must take a column as input.", query_spec)
101 |        
102 |                 if check_runnable and calc["column"] not in columns:
103 |                     raise InvalidQueryException(f"Invalid column: {calc['column']}", query_spec)
104 |         
105 | 
106 |     if "filters" in spec:
107 |         for filter in spec["filters"]:
108 |             if not isinstance(filter, dict):
109 |                 raise InvalidQueryException("filter of type other than dict found in query.", query_spec)
110 |             if "op" not in filter:
111 |                 raise InvalidQueryException("No op found in filter.", query_spec)
112 |             if filter["op"] not in valid_filter_ops:
113 |                 raise InvalidQueryException(f"Invalid filter: {filter['op']}", query_spec)
114 |     
115 | 
116 |             if check_runnable and filter["column"] not in columns:
117 |                 raise InvalidQueryException(f"Invalid column: {filter['column']}", query_spec)
118 |     
119 | 
120 |             if filter["op"] == "exists" or filter["op"] == "does-not-exist":
121 |                 if "value" in filter:
122 |                     raise InvalidQueryException(f"{filter}: {filter['op']} cannot take a value as input.", query_spec)
123 |         
124 |             else:
125 |                 if filter["op"] == "in" or filter["op"] == "not-in":
126 |                     if not isinstance(filter["value"], list):
127 |                         raise InvalidQueryException(f"{filter}: {filter['op']} must take a list as input.", query_spec)
128 |             
129 |                 else:
130 |                     if "value" not in filter:
131 |                         raise InvalidQueryException(f"{filter}: {filter['op']} must take a value as input.", query_spec)
132 |             
133 |     if "filter_combination" in spec:
134 |         if isinstance(spec["filter_combination"], str) and spec[
135 |             "filter_combination"
136 |         ].lower() not in ["and", "or"]:
137 |             raise InvalidQueryException(f"Invalid filter combination: {spec['filter_combination']}", query_spec)
138 | 
139 | 
140 |     if "breakdowns" in spec:
141 |         for breakdown in spec["breakdowns"]:
142 |             if check_runnable and breakdown not in columns:
143 |                 raise InvalidQueryException(f"Invalid column: {breakdown}", query_spec)
144 |     
145 | 
146 |     if "orders" in spec:
147 |         for order in spec["orders"]:
148 |             if "order" not in order:
149 |                 raise InvalidQueryException(f"Invalid order without orders key: {query_spec}")
150 |             if order["order"] != "ascending" and order["order"] != "descending":
151 |                 raise InvalidQueryException(f"Invalid order: {order['order']}", query_spec)
152 |     
153 |             if "op" in order:
154 |                 if order["op"] not in valid_calculate_ops:
155 |                     raise InvalidQueryException(f"Invalid order: {order['op']}", query_spec)
156 |         
157 | 
158 |                 if not any(calc["op"] == order["op"] for calc in spec.get("calculations", [])):
159 |                     raise InvalidQueryException(f"{order}: Order op must be present in calculations: {order['op']}", query_spec)        
160 | 
161 |                 if order["op"] == "COUNT" or order["op"] == "CONCURRENCY":
162 |                     if "column" in order:
163 |                         raise InvalidQueryException(f"{order}: {order['op']} cannot take a column as input.", query_spec)
164 |             
165 |                 else:
166 |                     if "column" not in order:
167 |                         raise InvalidQueryException(f"{order}: {order['op']} must take a column as input.", query_spec)
168 |             
169 |                     if check_runnable and order["column"] not in columns:
170 |                         raise InvalidQueryException(f"{order}: Invalid column in order: {order['column']}", query_spec)
171 |             
172 |             else:
173 |                 if "column" not in order:
174 |                     raise InvalidQueryException(f"{order}: Order must take a column or op as input.", query_spec)
175 |         
176 |                 if check_runnable and order["column"] not in columns:
177 |                     raise InvalidQueryException(f"{order}: Invalid column in order: {order['column']}", query_spec)
178 |         
179 | 
180 |     if "havings" in spec:
181 |         for having in spec["havings"]:
182 |             if "calculate_op" not in having:
183 |                 raise InvalidQueryException(f"{having}: Having must have a calculate_op.", query_spec)
184 | 
185 |             if "value" not in having:
186 |                 raise InvalidQueryException(f"{having}: Having must have a value.", query_spec)
187 | 
188 |             if "op" not in having:
189 |                 raise InvalidQueryException(f"{having}: Having must have an op.", query_spec)
190 | 
191 |             if having["calculate_op"] == "HEATMAP":
192 |                 raise InvalidQueryException("HEATMAP is not supported in having.", query_spec)
193 | 
194 |             if (
195 |                 having["calculate_op"] == "COUNT"
196 |                 or having["calculate_op"] == "CONCURRENCY"
197 |             ):
198 |                 if "column" in having:
199 |                     raise InvalidQueryException(f"{having}: {having['calculate_op']} cannot take a column as input.", query_spec)
200 |         
201 |             else:
202 |                 if "column" not in having:
203 |                     raise InvalidQueryException(f"{having}: {having['calculate_op']} must take a column as input.", query_spec)
204 |         
205 |                 if check_runnable and having["column"] not in columns:
206 |                     raise InvalidQueryException(f"{having}: Invalid column in having: {having['column']}", query_spec)
207 |         
208 | 
209 |     if "time_range" in spec:
210 |         if "start_time" in spec and "end_time" in spec:
211 |             raise InvalidQueryException("Time range cannot be specified with start_time and end_time.", query_spec)
212 | 
213 |         if not isinstance(spec["time_range"], int):
214 |             raise InvalidQueryException(f"time_range must be an int: {spec['time_range']}", query_spec)
215 | 
216 | 
217 |     if "start_time" in spec:
218 |         if not isinstance(spec["start_time"], int):
219 |             raise InvalidQueryException(f"start_time must be an int: {spec['start_time']}", query_spec)
220 | 
221 | 
222 |     if "end_time" in spec:
223 |         if not isinstance(spec["end_time"], int):
224 |             raise InvalidQueryException(f"end_time must be an int: {spec['end_time']}", query_spec)
225 | 
226 | 
227 |     if "granularity" in spec:
228 |         if not isinstance(spec["granularity"], int):
229 |             raise InvalidQueryException(f"granularity must be an int: {spec['granularity']}", query_spec)
230 | 
231 | 
232 |         time_range = (
233 |             spec["time_range"]
234 |             if "time_range" in spec
235 |             else spec["end_time"] - spec["start_time"]
236 |             if "start_time" in spec and "end_time" in spec
237 |             else 7200
238 |         )
239 |         if spec["granularity"] > time_range / 10:
240 |             raise InvalidQueryException(f"granularity must be <= time_range / 10: {spec['granularity']}", query_spec)
241 | 
242 |         if spec["granularity"] < time_range / 1000:
243 |             raise InvalidQueryException(f"granularity must be >= time_range / 1000: {spec['granularity']}", query_spec)
244 | 
245 |     if "limit" in spec:
246 |         if not isinstance(spec["limit"], int):
247 |             raise InvalidQueryException(f"limit must be an int: {spec['limit']}", query_spec)
248 | 


--------------------------------------------------------------------------------
/langchain/.env.example:
--------------------------------------------------------------------------------
1 | TAVILY_API_KEY=your-tavily-api-key
2 | 
3 | 


--------------------------------------------------------------------------------
/langchain/.gitignore:
--------------------------------------------------------------------------------
1 | .env
2 | .venv/
3 | 


--------------------------------------------------------------------------------
/langchain/README.md:
--------------------------------------------------------------------------------
 1 | ## LangChain Agent
 2 | 
 3 | This example demonstrates creating a custom solver that utilises a LangChain agent to perform QA using Wikipedia. The example includes the following source files:
 4 | 
 5 | | File                   | Description                                                                                     |
 6 | |------------------------|-------------------------------------------------------------------------------------------------|
 7 | | `.gitignore`           | Ignore the `.venv` directory and the `.env` file containing environment variables for the eval. |
 8 | | `.env.example`         | Prototype of `.env` file (copy this to `.env` and provide your `TAVILY_API_KEY`).               |
 9 | | `inspect_langchain.py` | Utilities for creating inspect solvers that use LangChain agents.                               |
10 | | `wikipedia.py`         | Evaluation task and custom solver that uses the search agent.                                   |
11 | | `wikipedia.jsonl`      | Dataset with questions and ideal answers.                                                       |
12 | 
13 | To run this example, first, be sure you provide a `.env` file that defines a `TAVILY_API_KEY` ([Tavily](https://tavily.com/) is a search API for LLM agents). Note that `.env` files should always be included in `.gitignore` as they often contain secrets!
14 | 
15 | Next, be sure that you have installed the dependencies in the `requirements.txt` file in the root of this repository. To do this in a virtual environment:
16 | 
17 | ``` bash
18 | $ python3 -m venv .venv
19 | $ source .venv/bin/activate
20 | $ pip install -r requirements.txt
21 | ```
22 | 
23 | Then, switch to the `langchain` directory and run the example as follows:
24 | 
25 | ``` python
26 | $ cd langchain
27 | $ inspect eval --model openai/gpt-4 
28 | ```
29 | 
30 | This example will run with any model provider that supports tool use (so Anthropic, Google Gemini, and Mistral will all work as well).
31 | 
32 | If you want to run in verbose mode (to see the agent's queries printed out), pass the `verbose` task parameter:
33 | 
34 | ``` bash
35 | $ inspect eval --model openai/gpt-4  -T verbose=true --limit 1
36 | ```
37 | 
38 | Note that we specify `--limit 1` so that the verbose output from multiple samples is not intermixed.


--------------------------------------------------------------------------------
/langchain/inspect_langchain.py:
--------------------------------------------------------------------------------
  1 | import json
  2 | from typing import Any, Dict, Protocol, cast, runtime_checkable
  3 | 
  4 | from langchain_core.callbacks import (
  5 |     AsyncCallbackManagerForLLMRun,
  6 |     CallbackManagerForLLMRun,
  7 | )
  8 | from langchain_core.language_models import BaseChatModel
  9 | from langchain_core.messages import (
 10 |     AIMessage,
 11 |     BaseMessage,
 12 |     FunctionMessage,
 13 |     HumanMessage,
 14 |     SystemMessage,
 15 |     ToolMessage,
 16 | )
 17 | from langchain_core.messages import ToolCall as LCToolCall
 18 | from langchain_core.outputs import (
 19 |     ChatGeneration,
 20 |     ChatResult,
 21 | )
 22 | from pydantic.v1 import Field
 23 | from typing_extensions import override
 24 | 
 25 | from inspect_ai.model import (
 26 |     ChatMessage,
 27 |     ChatMessageAssistant,
 28 |     ChatMessageSystem,
 29 |     ChatMessageTool,
 30 |     ChatMessageUser,
 31 |     Content,
 32 |     ContentImage,
 33 |     ContentText,
 34 |     GenerateConfig,
 35 |     ModelName,
 36 |     ModelOutput,
 37 |     ToolCall,
 38 |     ToolChoice,
 39 |     ToolInfo,
 40 |     ToolParam,
 41 |     get_model,
 42 | )
 43 | from inspect_ai.solver import Generate, Solver, TaskState
 44 | 
 45 | 
 46 | @runtime_checkable
 47 | class LangChainAgent(Protocol):
 48 |     async def __call__(
 49 |         self, llm: BaseChatModel, input: dict[str, Any]
 50 |     ) -> str | list[str | dict[str, Any]]:
 51 |         ...
 52 | 
 53 | 
 54 | def langchain_solver(agent: LangChainAgent) -> Solver:
 55 |     async def solve(state: TaskState, generate: Generate) -> TaskState:
 56 |         # create the inspect model api bridge
 57 |         llm = InspectChatModel()
 58 | 
 59 |         # call the agent
 60 |         await agent(
 61 |             llm=llm,
 62 |             input=dict(
 63 |                 input=state.user_prompt.text,
 64 |                 chat_history=as_langchain_chat_history(state.messages[1:]),
 65 |             ),
 66 |         )
 67 | 
 68 |         # collect output from llm interface
 69 |         state.messages = llm.messages
 70 |         state.output = llm.output
 71 | 
 72 |         # return state
 73 |         return state
 74 | 
 75 |     return solve
 76 | 
 77 | 
 78 | class InspectChatModel(BaseChatModel):
 79 |     # track messages and model output so we can update
 80 |     # the inspect task state when we are complete
 81 |     messages: list[ChatMessage] = Field(default=[], exclude=True)
 82 |     output: ModelOutput = Field(default=ModelOutput(), exclude=True)
 83 | 
 84 |     @property
 85 |     def _llm_type(self) -> str:
 86 |         return f"Inspect ({ModelName(get_model()).api})"
 87 | 
 88 |     @property
 89 |     def _identifying_params(self) -> Dict[str, Any]:
 90 |         return {
 91 |             "model_name": str(ModelName(get_model()).name),
 92 |         }
 93 | 
 94 |     @override
 95 |     def _generate(
 96 |         self,
 97 |         messages: list[BaseMessage],
 98 |         stop: list[str] | None = None,
 99 |         run_manager: CallbackManagerForLLMRun | None = None,
100 |         **kwargs: Any,
101 |     ) -> ChatResult:
102 |         # inspect uses async exclusively
103 |         raise NotImplementedError
104 | 
105 |     @override
106 |     async def _agenerate(
107 |         self,
108 |         messages: list[BaseMessage],
109 |         stop: list[str] | None = None,
110 |         run_manager: AsyncCallbackManagerForLLMRun | None = None,
111 |         **kwargs: dict[str, Any],
112 |     ) -> ChatResult:
113 |         # extract tools from kwargs
114 |         tools: list[ToolInfo] = []
115 |         tool_choice: ToolChoice | None = None
116 |         lc_tools = cast(list[dict[str, Any]] | None, kwargs.get("tools", None))
117 |         if lc_tools:
118 |             tools = [
119 |                 ToolInfo(
120 |                     name=tool["function"]["name"],
121 |                     description=tool["function"]["description"],
122 |                     params=as_inspect_tool_params(tool["function"]["parameters"]),
123 |                 )
124 |                 for tool in lc_tools
125 |             ]
126 |             tool_choice = "auto"
127 | 
128 |         # generate
129 |         input = [as_inspect_message(message) for message in messages]
130 |         result = await get_model().generate(
131 |             input=input,
132 |             tools=tools,
133 |             tool_choice=tool_choice,
134 |             config=GenerateConfig(stop_seqs=stop),
135 |         )
136 | 
137 |         # track last messages / model output
138 |         self.messages = input
139 |         self.messages.append(result.choices[0].message)
140 |         self.output = result
141 | 
142 |         # extract choices
143 |         generations = [
144 |             ChatGeneration(message=as_langchain_message(choice.message))
145 |             for choice in result.choices
146 |         ]
147 | 
148 |         # return
149 |         return ChatResult(generations=generations)
150 | 
151 | 
152 | def as_inspect_message(message: BaseMessage) -> ChatMessage:
153 |     if isinstance(message, SystemMessage):
154 |         return ChatMessageSystem(content=as_inspect_content(message.content))
155 |     elif isinstance(message, HumanMessage):
156 |         return ChatMessageUser(content=as_inspect_content(message.content))
157 |     elif isinstance(message, AIMessage):
158 |         return ChatMessageAssistant(
159 |             content=as_inspect_content(message.content),
160 |             tool_calls=(
161 |                 [
162 |                     ToolCall(
163 |                         type="function",
164 |                         function=call["name"],
165 |                         id=call["id"] or call["name"],
166 |                         arguments=call["args"],
167 |                     )
168 |                     for call in message.tool_calls
169 |                 ]
170 |                 if message.tool_calls and len(message.tool_calls) > 0
171 |                 else None
172 |             ),
173 |         )
174 |     elif isinstance(message, ToolMessage):
175 |         return ChatMessageTool(
176 |             content=as_inspect_content(message.content),
177 |             tool_call_id=message.tool_call_id,
178 |         )
179 |     elif isinstance(message, FunctionMessage):
180 |         return ChatMessageTool(
181 |             content=as_inspect_content(message.content), tool_call_id=message.name
182 |         )
183 |     else:
184 |         raise ValueError(f"Unexpected message type: {type(message)}")
185 | 
186 | 
187 | def as_langchain_message(message: ChatMessage) -> BaseMessage:
188 |     if isinstance(message, ChatMessageSystem):
189 |         return SystemMessage(content=as_langchain_content(message.content))
190 |     elif isinstance(message, ChatMessageUser):
191 |         return HumanMessage(content=as_langchain_content(message.content))
192 |     elif isinstance(message, ChatMessageAssistant):
193 |         additional_kwargs: dict[str, Any] = {}
194 |         if message.tool_calls and len(message.tool_calls) > 0:
195 |             additional_kwargs["tool_calls"] = [
196 |                 dict(
197 |                     id=call.id, name=call.function, arguments=json.dumps(call.arguments)
198 |                 )
199 |                 for call in message.tool_calls
200 |             ]
201 | 
202 |         return AIMessage(
203 |             content=as_langchain_content(message.content),
204 |             tool_calls=(
205 |                 [
206 |                     LCToolCall(id=call.id, name=call.function, args=call.arguments)
207 |                     for call in message.tool_calls
208 |                 ]
209 |                 if message.tool_calls
210 |                 else []
211 |             ),
212 |             additional_kwargs=additional_kwargs,
213 |         )
214 |     elif isinstance(message, ChatMessageTool):
215 |         return ToolMessage(
216 |             content=as_langchain_content(message.content),
217 |             tool_call_id=message.tool_call_id or "",
218 |         )
219 |     else:
220 |         raise ValueError(f"Unexpected message type: {type(message)}")
221 | 
222 | 
223 | def as_langchain_chat_history(messages: list[ChatMessage]) -> list[dict[str, Any]]:
224 |     return [dict(role=message.role, content=message.text) for message in messages]
225 | 
226 | 
227 | def as_inspect_content(
228 |     content: str | list[str | dict[str, Any]],
229 | ) -> str | list[Content]:
230 |     if isinstance(content, str):
231 |         return content
232 |     else:
233 |         return [
234 |             (
235 |                 ContentText(text=c)
236 |                 if isinstance(c, str)
237 |                 else (
238 |                     ContentText(text=c["text"])
239 |                     if c["type"] == "text"
240 |                     else ContentImage(image=c["image"])
241 |                 )
242 |             )
243 |             for c in content
244 |         ]
245 | 
246 | 
247 | def as_inspect_tool_params(parameters: dict[str, Any]) -> list[ToolParam]:
248 |     params: list[ToolParam] = []
249 |     for key, param in parameters["properties"].items():
250 |         params.append(
251 |             ToolParam(
252 |                 name=key,
253 |                 type=param["type"],
254 |                 description=param.get("description", param.get("title")),
255 |                 optional=key not in parameters["required"],
256 |             )
257 |         )
258 |     return params
259 | 
260 | 
261 | def as_langchain_content(
262 |     content: str | list[Content],
263 | ) -> str | list[str | dict[str, Any]]:
264 |     if isinstance(content, str):
265 |         return content
266 |     else:
267 |         return [c if isinstance(c, str) else c.model_dump() for c in content]
268 | 


--------------------------------------------------------------------------------
/langchain/wikipedia.jsonl:
--------------------------------------------------------------------------------
1 | {"input":[{"role":"user","content":"List the ten episode titles from the sixth season of \"Game of Thrones\" in broadcast order."}],"target":"The Red Woman, Home, Oathbreaker, Book of the Stranger, The Door, Blood of My Blood, The Broken Man, No One, Battle of the Bastards, The Winds of Winter"}
2 | {"input":[{"role":"user","content":"What's the difference between tennis and pickleball?"}],"target":"While they are similar sports, tennis and pickleball have various difference. First, the court size for pickleball is about half the size of a tennis court. Second, pickleball is played with a ball that resembles a whiffle ball. Third, pickleball is played with paddles as opposed to rackets. Finally, the scoring system is quite different as you play for points which can only be scored when you or your team are serving."}
3 | {"input":[{"role":"user","content":"Which types of fish contain the lowest levels of mercury?"}],"target":"The following types of fish contain low levels of mercury: salmon, flounder, Atlantic mackerel, anchovies, pollock, catfish, and shellfish (e.g., clams, scallops, mussels)."}
4 | 


--------------------------------------------------------------------------------
/langchain/wikipedia.py:
--------------------------------------------------------------------------------
 1 | from typing import Any, cast
 2 | 
 3 | from inspect_langchain import langchain_solver
 4 | from langchain import hub
 5 | from langchain.agents import (
 6 |     AgentExecutor,
 7 |     BaseMultiActionAgent,
 8 |     create_openai_tools_agent,
 9 | )
10 | from langchain_community.agent_toolkits.load_tools import load_tools
11 | from langchain_community.tools.tavily_search.tool import TavilySearchResults
12 | from langchain_community.utilities.tavily_search import TavilySearchAPIWrapper
13 | from langchain_core.language_models import BaseChatModel
14 | 
15 | from inspect_ai import Task, task
16 | from inspect_ai.dataset import json_dataset
17 | from inspect_ai.scorer import model_graded_fact
18 | from inspect_ai.solver import Solver, solver
19 | 
20 | @task
21 | def wikipedia() -> Task:
22 |     return Task(
23 |         dataset=json_dataset("wikipedia.jsonl"),
24 |         plan=wikipedia_search(),
25 |         scorer=model_graded_fact(),
26 |     )
27 | 
28 | @solver
29 | def wikipedia_search(
30 |     max_iterations: int | None = 15, 
31 |     max_execution_time: float | None = None
32 | ) -> Solver:
33 |     # standard prompt for functions agent
34 |     prompt = hub.pull("hwchase17/openai-tools-agent")
35 | 
36 |     # tavily and wikipedia tools
37 |     tavily_api = TavilySearchAPIWrapper()  # type: ignore
38 |     tools = [TavilySearchResults(api_wrapper=tavily_api)] + load_tools(["wikipedia"])
39 | 
40 |     # agent function
41 |     async def agent(llm: BaseChatModel, input: dict[str, Any]):
42 |         # create agent -- cast needed due to:
43 |         # https://github.com/langchain-ai/langchain/issues/13075
44 |         tools_agent = create_openai_tools_agent(llm, tools, prompt)
45 |         agent_executor = AgentExecutor.from_agent_and_tools(
46 |             agent=cast(BaseMultiActionAgent, tools_agent),
47 |             tools=tools,
48 |             name="wikipedia_search",
49 |             max_iterations=max_iterations,
50 |             max_execution_time=max_execution_time,
51 |         )
52 | 
53 |         # execute the agent and return output
54 |         result = await agent_executor.ainvoke(input)
55 |         return result["output"]
56 | 
57 |     # return agent function as inspect solver
58 |     return langchain_solver(agent)
59 | 
60 | 
61 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | inspect_ai
 2 | openai
 3 | anthropic
 4 | google-generativeai
 5 | datasets
 6 | ipywidgets
 7 | langchain==0.2.1
 8 | langchainhub==0.1.16
 9 | langchain-community==0.2.1
10 | wikipedia==1.4.0
11 | 
12 | 


--------------------------------------------------------------------------------
/slides/images/inspect-honeycomb-validate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-honeycomb-validate.png


--------------------------------------------------------------------------------
/slides/images/inspect-mathmatics.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-mathmatics.png


--------------------------------------------------------------------------------
/slides/images/inspect-notebook-eval.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-notebook-eval.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-answers.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-answers.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-honeycomb-critique-score.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-critique-score.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-honeycomb-critique.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-critique.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-honeycomb-validate-prompt-2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-validate-prompt-2.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-honeycomb-validate-prompt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-validate-prompt.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-honeycomb-validate.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-validate.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-messages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-messages.png


--------------------------------------------------------------------------------
/slides/images/inspect-view-scoring.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-scoring.png


--------------------------------------------------------------------------------
/slides/images/inspect-wikipedia-eval.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-eval.png


--------------------------------------------------------------------------------
/slides/images/inspect-wikipedia-explanation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-explanation.png


--------------------------------------------------------------------------------
/slides/images/inspect-wikipedia-messages.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-messages.png


--------------------------------------------------------------------------------
/slides/images/inspect-wikipedia-scoring.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-scoring.png


--------------------------------------------------------------------------------
/slides/images/log-incorrect.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/log-incorrect.png


--------------------------------------------------------------------------------
/slides/inspect.qmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Intro to Inspect"
  3 | subtitle: |
  4 |   Open Source Framework for LLM Evals
  5 | author: J.J. Allaire
  6 | date: today
  7 | format:
  8 |    revealjs: 
  9 |     width: 1350  
 10 |     height: 759.375
 11 |     menu: false
 12 |     slide-number: c/t
 13 |     pdf-separate-fragments: true
 14 | ---
 15 | 
 16 | ## Inspect
 17 | 
 18 | -   A Python package (`inspect_ai`) used to create LLM evaluations
 19 | 
 20 | -   Developed and maintained by the [UK AI Safety Institute](https://www.aisi.gov.uk/)
 21 | 
 22 | -   Similar in function to the eval frameworks embedded in benchmark suites (e.g. Open AI Evals, Eluether LM Eval Harness, etc.) but designed from the ground up for development of more complex evals
 23 | 
 24 | -   Focus on bridging research and production: provide a great development experience for researchers that results in evals that can be reproducibly run at scale
 25 | 
 26 | 
 27 | ## {background-image="images/inspect-honeycomb-validate.png" background-size="contain"}
 28 | 
 29 | ## Core Design
 30 | 
 31 | ::: {style="margin-top: 2rem;"}
 32 | |             |                                                                                                                                                                                                       |
 33 | |-----------------|------------------------------------------------------|
 34 | | **Dataset** | List of samples with `input` and `target`                                                                                                                                                             |
 35 | | **Solvers** | Functions that transform dataset inputs, call the model for generation, and act further on model output. Can be composed together as layers, or can be a single layer with higher internal complexity |
 36 | | **Scorer**  | Evaluates final output of solvers. May use text comparisons, model grading, or other custom schemes                                                                                                   |
 37 | :::
 38 | 
 39 | ## Hello, World
 40 | 
 41 | ``` {.python code-line-numbers="|11|12-16|17|20"}
 42 | from inspect_ai import Task, eval, task
 43 | from inspect_ai.dataset import example_dataset
 44 | from inspect_ai.scorer import model_graded_fact
 45 | from inspect_ai.solver import (
 46 |   chain_of_thought, generate, self_critique
 47 | )
 48 | 
 49 | @task
 50 | def theory_of_mind():
 51 |   return Task(
 52 |       dataset=example_dataset("theory_of_mind"),
 53 |       plan=[
 54 |           chain_of_thought(), 
 55 |           generate(), 
 56 |           self_critique()
 57 |       ],
 58 |       scorer=model_graded_fact(),
 59 |     )
 60 | 
 61 | eval(theory_of_mind, model="openai/gpt-4")
 62 | ```
 63 | 
 64 | 
 65 | ## Honeycomb Eval: `validate()`
 66 | 
 67 | ```{.python code-line-numbers="|3-10|16-20|21"}
 68 | @task
 69 | def validate():
 70 |     # read dataset
 71 |     dataset = csv_dataset(
 72 |         csv_file="queries.csv",
 73 |         sample_fields=FieldSpec(
 74 |             input="user_input",
 75 |             metadata=["columns"]
 76 |         ),
 77 |         shuffle=True
 78 |     )
 79 | 
 80 |     # create eval task
 81 |     return Task(
 82 |         dataset=dataset,
 83 |         plan=[
 84 |             system_message("Honeycomb AI suggests queries based on user input."),
 85 |             prompt_with_schema(),
 86 |             generate()
 87 |         ],
 88 |         scorer=validate_scorer()
 89 |     )
 90 | ```
 91 | 
 92 | ## Solver: `prompt_with_schema()`
 93 | 
 94 | Simple prompt template that substitutes the user query and the RAG generated column schema.
 95 | 
 96 | ```{.python code-line-numbers="|7-12"}
 97 | @solver
 98 | def prompt_with_schema():
 99 | 
100 |     prompt_template = resource("prompt.txt")
101 | 
102 |     async def solve(state, generate):
103 |         # build the prompt
104 |         state.user_prompt.text = prompt_template.replace(
105 |             "{{prompt}}", state.user_prompt.text
106 |         ).replace(
107 |             "{{columns}}", state.metadata["columns"]
108 |         )
109 |         return state
110 | 
111 |     return solve
112 | ```
113 | 
114 | 
115 | ## Scorer: `validate_scorer()`
116 | 
117 | Call the `is_valid()` function w/ the column schema to determine if a valid query was generated.
118 | 
119 | ```{.python code-line-numbers="|6-11"}
120 | @scorer(metrics=[accuracy()])
121 | def validate_scorer():
122 | 
123 |     async def score(state, target):
124 |        
125 |         # check for valid query
126 |         query = json_completion(state.output.completion)
127 |         if is_valid(query, state.metadata["columns"]):
128 |             value=CORRECT
129 |         else: 
130 |             value=INCORRECT
131 |        
132 |         # return score w/ query that was extracted
133 |         return Score(value=value, answer=query)
134 | 
135 |     return score
136 | ```
137 | 
138 | ## {background-image="images/inspect-view-honeycomb-validate.png" background-size="contain"}
139 | 
140 | ## {background-image="images/inspect-view-honeycomb-validate-prompt.png" background-size="contain"}
141 | 
142 | ## {background-image="images/inspect-view-honeycomb-validate-prompt-2.png" background-size="contain"}
143 | 
144 | 
145 | ## Honeycomb Eval: `critique()`
146 | 
147 | ```{.python code-line-numbers="|21"}
148 | @task
149 | def critique():
150 |     # read dataset
151 |     dataset = csv_dataset(
152 |         csv_file="queries.csv",
153 |         sample_fields=FieldSpec(
154 |             input="user_input",
155 |             metadata=["columns"]
156 |         ),
157 |         shuffle=True
158 |     )
159 | 
160 |     # create eval task
161 |     return Task(
162 |         dataset=dataset,
163 |         plan=[
164 |             system_message("Honeycomb AI suggests queries based on user input."),
165 |             prompt_with_schema(),
166 |             generate()
167 |         ],
168 |         scorer=critique_scorer()
169 |     )
170 | ```
171 | 
172 | 
173 | ## Scorer: `critique_scorer()`
174 | 
175 | ```{.python code-line-numbers="|2|5-13|15-19"}
176 | @scorer(metrics=[accuracy()])
177 | def critique_scorer(model = "openai/gpt-4-turbo"):
178 | 
179 |     async def score(state, target):
180 |         # build the critic prompt
181 |         query = state.output.completion.strip()
182 |         critic_prompt = resource("critique.txt").replace(
183 |             "{{prompt}}", state.user_prompt.text
184 |         ).replace(
185 |             "{{columns}}", state.metadata["columns"]
186 |         ).replace(
187 |             "{{query}}", query
188 |         )
189 |        
190 |         # run the critique
191 |         result = await get_model(model).generate(critic_prompt)
192 |         parsed = json.loads(json_completion(result.completion))
193 |         value = CORRECT if parsed["outcome"] == "good" else INCORRECT
194 |         explanation = parsed["critique"]
195 |         
196 |         return Score(value=value, explanation=explanation)
197 | 
198 |     return score
199 | ```
200 | 
201 | ## {background-image="images/inspect-view-honeycomb-critique.png" background-size="contain"}
202 | 
203 | ## {background-image="images/inspect-view-honeycomb-critique-score.png" background-size="contain"}
204 | 
205 | ## Solvers
206 | 
207 | A Solver is a Python function that tasks a `TaskState` and transforms it in some useful fashion
208 | 
209 | ::: aside
210 | Solver concept was borrowed from Open AI Evals and made more general in Inspect
211 | :::
212 | 
213 | ::: fragment
214 | **TaskState** (initialised from sample)
215 | 
216 | ``` python
217 | class TaskState:
218 |     messages: list[ChatMessage]
219 |     output: ModelOutput
220 |     ...
221 | ```
222 | :::
223 | 
224 | ::: fragment
225 | **Solver Function**
226 | 
227 | ``` python
228 | async def solve(state: TaskState, generate: Generate) -> TaskState:
229 |     # do something useful with state (prompt engineering,
230 |     # generating model output, critique and regenerate, etc.) 
231 |     return state
232 | ```
233 | :::
234 | 
235 | ## Baseline Solvers
236 | 
237 | ::: fragment
238 | `prompt_template()`
239 | 
240 | ``` python
241 | async def solve(state: TaskState, generate: Generate) -> TaskState:
242 |     prompt = state.user_prompt
243 |     prompt.text = prompt_template.format(prompt=prompt.text, **params)
244 |     return state
245 | ```
246 | 
247 | Modifies the existing prompt by passing it through a template
248 | :::
249 | 
250 | ::: {.fragment style="margin-top: 2rem;"}
251 | `generate()`
252 | 
253 | ``` python
254 | async def solve(state: TaskState, generate: Generate) -> TaskState:
255 |     return await generate(state)
256 | ```
257 | 
258 | Calls the model, appends the assistant message, and updates the model output
259 | :::
260 | 
261 | ## Solver: `multiple_choice()`
262 | 
263 | Prompt with several choices (optionally shuffled)
264 | 
265 | ``` python
266 | async def solve(state: TaskState, generate: Generate) -> TaskState:
267 | 
268 |     # build choices str and key
269 |     choices_str, choices_key = make_choices(choices=state.choices)
270 | 
271 |     # re-write prompt with A,B,C,... choices
272 |     state.user_prompt.text = template.format(
273 |         question=state.user_prompt.text,
274 |         choices=choices_str,
275 |     )
276 | 
277 |     # generate
278 |     state = await generate(state, temperature=0.0, max_tokens=1)
279 | 
280 |     # map the output back to the right index and return
281 |     state.output.completion = choices_key[state.output.completion]
282 | 
283 |     return state
284 | ```
285 | 
286 | ## Solver: `self_critique()`
287 | 
288 | Critique the generated response (possibly with another model), then re-generate in response to the critique.
289 | 
290 | ```{.python code-line-numbers="|3-8|10-16|18"}
291 | async def solve(state: TaskState, generate: Generate) -> TaskState:
292 |     
293 |     critique = await model.generate(
294 |         critique_template.format(
295 |             question=state.input_text,
296 |             completion=state.output.completion,
297 |         )
298 |     )
299 | 
300 |     state.messages.append(ChatMessageUser(
301 |         content=completion_template.format(
302 |             question=state.input_text,
303 |             completion=state.output.completion,
304 |             critique=critique.completion,
305 |         ),
306 |     ))
307 | 
308 |     return await generate(state)
309 | ```
310 | 
311 | ## Composition
312 | 
313 | Eval development frequently involves creating custom solvers and scorers. If made available in a Python package these can re-used across many evals
314 | 
315 | ::: fragment
316 | Some jailbreaking solvers from an internal **sheppard** package:
317 | 
318 | |                       |                                     |
319 | |-----------------------|-------------------------------------|
320 | | `encode()`            | Message obfuscation jailbreak       |
321 | | `pap_jailbreak()`     | Persuasion Adversarial Prompt (PAP) |
322 | | `payload_splitting()` | PARROT jailbreak                    |
323 | | `cr_jailbreak()`      | Content reinforcement               |
324 | :::
325 | 
326 | ## Composition
327 | 
328 | Using **sheppard** to provide jailbreaks for a security eval:
329 | 
330 | ```{.python code-line-numbers="|5|11-15"}
331 | from inspect_ai import Task, eval, task
332 | from inspect_ai.scorer import model_graded_fact
333 | from inspect_ai.solver import generate, system_message
334 | 
335 | from sheppard import pap_jailbreak
336 | 
337 | @task
338 | def security_guide():
339 |     return Task(
340 |         dataset=example_dataset("security_guide"),
341 |         plan=[
342 |           system_message("system.txt"), 
343 |           pap_jailbreak(),
344 |           generate()
345 |         ],
346 |         scorer=model_graded_fact(model="openai/gpt-4"),
347 |     )
348 | ```
349 | 
350 | ## Tool Use
351 | 
352 | ::: fragment
353 | `TaskState` also includes tools:
354 | 
355 | ``` {.python code-line-numbers="3-4"}
356 | class TaskState:
357 |     messages: list[ChatMessage]
358 |     tools: list[ToolDef]
359 |     tool_choice: ToolChoice
360 |     output: ModelOutput
361 |     ...
362 | ```
363 | :::
364 | 
365 | ::: fragment
366 | `use_tools()` makes tools available to `generate()`:
367 | 
368 | ``` {.python code-line-numbers="3-6"}
369 | return Task(
370 |     dataset=example_dataset("biology_qa"),
371 |     plan=[
372 |         use_tools(web_search()), 
373 |         generate()
374 |     ],
375 |     scorer=model_graded_qa(template=GRADER_TEMPLATE),
376 | )
377 | ```
378 | :::
379 | 
380 | 
381 | ## Agents and Tools
382 | 
383 | - Many permutations of agents and tool use are possible
384 | 
385 | > - Bespoke agent logic inside a solver (swapping various tools in and out)
386 | 
387 | > - Bridges to various agent libraries are as solvers (e.g. `langchain_agent()`, `langroid_agent()`, etc.)
388 | 
389 | 
390 | 
391 | ## Agent: Capture the Flag
392 | 
393 | Cybersecurity eval using hand-rolled agent loop (custom agents and agent frameworks can both be embedded in solvers)
394 | 
395 | ``` {.python code-line-numbers="|4-8|12"}
396 | Plan(
397 |     steps=[
398 |         init_challenge(),
399 |         use_tools([
400 |             command_exec(), create_file(),
401 |             decompile(), disassemble(),
402 |             check_flag(),
403 |         ]),
404 |         system_message("prompts/system.txt"),
405 |         initial_user_message(),
406 |         generate(),
407 |         check_for_flag_or_continue()
408 |     ],
409 |     cleanup=exit_challenge()
410 | )
411 | ```
412 | 
413 | ## Agent: LangChain
414 | 
415 | Convert any LangChain agent into a Solver
416 | 
417 | ```{.python code-line-numbers="|4-6|8-15|17"}
418 | @solver
419 | def wikipedia_search() -> Solver:
420 |   
421 |     tavily_api = TavilySearchAPIWrapper() 
422 |     tools = ([TavilySearchResults(api_wrapper=tavily_api)] + 
423 |         load_tools(["wikipedia"]))
424 |     
425 |     async def agent(llm: BaseChatModel, input: dict[str, Any]):
426 |         tools_agent = create_openai_tools_agent(llm, tools, prompt)
427 |         agent_executor = AgentExecutor.from_agent_and_tools(
428 |             agent=tools_agent,
429 |             tools=tools
430 |         )
431 |         result = await agent_executor.ainvoke(input)
432 |         return result["output"]
433 |     
434 |     return langchain_solver(agent)
435 | ```
436 | 
437 | ## {background-image="images/inspect-wikipedia-eval.png" background-size="contain"}
438 | 
439 | ## {background-image="images/inspect-wikipedia-messages.png" background-size="contain"}
440 | 
441 | ## {background-image="images/inspect-wikipedia-scoring.png" background-size="contain"}
442 | 
443 | ## {background-image="images/inspect-wikipedia-explanation.png" background-size="contain"}
444 | 
445 | ## Scoring
446 | 
447 | ::: incremental
448 | -  Conventional pattern matching / templated answer based scoring built in.
449 | 
450 | -  Model graded scorer built-in (often heavily customized)
451 | 
452 | -  Scorers also pluggable (i.e. provided from other packages). We expect lots of innovation in model graded scoring!
453 | 
454 | -  Offline / human scoring workflow is supported.
455 | 
456 | -  Plan to build tools to help rigerously evaluate model graded scorers against human baselines.
457 | :::
458 | 
459 | ## {background-image="images/inspect-mathmatics.png" background-size="contain"}
460 | 
461 | ## Scorer: `expression_equivalence()`
462 | 
463 | ```{.python code-line-numbers="|8-13" }
464 | @scorer(metrics=[accuracy(), bootstrap_std()])
465 | def expression_equivalance():
466 |     async def score(state: TaskState, target: Target):
467 | 
468 |         # extract answer
469 |         match = re.search(AnswerPattern.LINE, state.output.completion)
470 |        
471 |         # ask the model to judge equivalance
472 |         answer = match.group(1)
473 |         prompt = EQUIVALANCE_TEMPLATE % (
474 |             {"expression1": target.text, "expression2": answer}
475 |         )
476 |         result = await get_model().generate(prompt)
477 | 
478 |         # return the score
479 |         correct = result.completion.lower() == "yes"
480 |         return Score(
481 |             value=CORRECT if correct else INCORRECT,
482 |             answer=answer,
483 |             explanation=state.output.completion,
484 |         )
485 | 
486 |     return score
487 | ```
488 | 
489 | ## Logging
490 | 
491 | -   Capture all context required to debug, analyse, and reproduce evaluations
492 | 
493 | -   Python API for computing on log file contents
494 | 
495 | -   Log viewer for interactive exploration of eval results
496 | 
497 | ## `EvalLog` {.smaller}
498 | 
499 | |           |                        |                                                                        |
500 | |-------------------|-------------------|----------------------------------|
501 | | `status`  | `str`                  | Status of evaluation                                                   |
502 | | `eval`    | `EvalSpec`             | Top level eval details including task, model, creation time, etc.      |
503 | | `plan`    | `EvalPlan`             | List of solvers and model generation config used for the eval.         |
504 | | `samples` | `list[EvalSample]`     | Each sample evaluated, including its input, output, target, and score. |
505 | | `results` | `EvalResults`          | Aggregated scorer results                                              |
506 | | `stats`   | `EvalStats`            | Model token usage stats                                                |
507 | | `logging` | `list[LoggingMessage]` | Logging messages (e.g. from `log.info()`, `log.debug()`, etc.          |
508 | | `error`   | `EvalError`            | Error information                                                      |
509 | 
510 | ## Log Viewer: Samples
511 | 
512 | ![](images/inspect-view-answers.png){.border}
513 | 
514 | ## Log Viewer: Messages
515 | 
516 | ![](images/inspect-view-messages.png){.border}
517 | 
518 | 
519 | ## Log Viewer: Scoring
520 | 
521 | ![](images/inspect-view-scoring.png){.border}
522 | 
523 | ## Models {.smaller}
524 | 
525 | 
526 | | Provider     | Model Name                        | Docs                                                                                            |
527 | |-------------------|---------------------------|---------------------------|
528 | | OpenAI       | `openai/gpt-3.5-turbo`            | [OpenAI Models](https://platform.openai.com/docs/models/overview)                               |
529 | | Anthropic    | `anthropic/claude-3-sonnet-20240229`            | [Anthropic Models](https://docs.anthropic.com/claude/docs/models-overview)                      |
530 | | Google       | `google/gemini-1.0-pro`           | [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models)             |
531 | | Mistral      | `mistral/mistral-large-latest`    | [Mistral Models](https://docs.mistral.ai/platform/endpoints/)                                   |
532 | | Hugging Face | `hf/openai-community/gpt2`        | [Hugging Face Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) |
533 | | Ollama       | `ollama/llama3`                   | [Ollama Models](https://ollama.com/library)                                                     |
534 | | TogetherAI   | `together/lmsys/vicuna-13b-v1.5`  | [TogetherAI Models](https://docs.together.ai/docs/inference-models#chat-models)                 |
535 | | AWS Bedrock  | `bedrock/meta.llama2-70b-chat-v1` | [AWS Bedrock Models](https://aws.amazon.com/bedrock/)                                           |
536 | | Azure AI     | `azureai/azure-deployment-name`   | [Azure AI Models](https://ai.azure.com/explore/models)                                          |
537 | | Cloudflare   | `cf/meta/llama-2-7b-chat-fp16`    | [Cloudflare Models](https://developers.cloudflare.com/workers-ai/models/#text-generation)       |
538 | 
539 | : {tbl-colwidths="\[18,45,37\]"}
540 | 
541 | ::: {.fragment style="margin-top: 30px; font-weight: 400;"}
542 | Interface with any other model by creating a custom model provider...
543 | :::
544 | 
545 | ## Workflow
546 | 
547 | -   Lots of interactive exploration occurs during eval development, so critical to have good support for iterating in a Notebook / REPL
548 | 
549 | -   Eventually though, evals need to end up in a form that enables reproducibly running them in an eval suite
550 | 
551 | -   Need to support a continuum of workflows that transition well into each other
552 | 
553 | -   Provide good tooling in Jupyter and VS Code for entire spectrum
554 | 
555 | ## {background-image="images/inspect-notebook-eval.png" background-size="contain"}
556 | 
557 | ## Interactive Exploration
558 | 
559 | Ad-hoc exploration of an eval in a Notebook/REPL
560 | 
561 | ```{.python code-line-numbers="|1-6|8-12|14-15"}
562 | params = {
563 |    "system": ["devops.txt", "researcher.txt"],
564 |    "grader": ["hacker.txt", "expert.txt"],
565 |    "grader_model": ["openai/gpt-4", "google/gemini-1.0-pro"]
566 | }
567 | params = list(product(*(params[name] for name in params)))
568 | 
569 | tasks = [Task(
570 |     dataset=json_dataset("security_guide.jsonl"),
571 |     plan=[system_message(system), generate()],
572 |     scorer=model_graded_fact(template=grader, model=grader_model)
573 | ) for system, grader, grader_model in params]
574 | 
575 | logs = eval(tasks, model = "mistral/mistral-large-latest")
576 | plot_results(logs)
577 | ```
578 | 
579 | ## Task Parameters
580 | 
581 | Formalise variation with a parameterised `@task` function:
582 | 
583 | ```{.python code-line-numbers="|1-7|9-16"}
584 | @task
585 | def security_guide(system="devops.txt", grader="expert.txt"):
586 |    return Task(
587 |       dataset = json_dataset("security_guide.jsonl"),
588 |       plan=[system_message(system), generate()],
589 |       scorer=model_graded_fact(template=grader, model="openai/gpt-4")
590 |    )
591 | 
592 | params = {
593 |    "system": ["devops.txt", "researcher.txt"],
594 |    "grader": ["hacker.txt", "expert.txt"]
595 | }
596 | params = list(product(*(params[name] for name in params)))
597 | 
598 | eval([security_guide(system,grader) for system, grader in params],
599 |      model = "mistral/mistral-large-latest")
600 | ```
601 | 
602 | ## Task Parameters
603 | 
604 | `@task` functions are registered and addressable by external driver programs (step one in development =\> production)
605 | 
606 | ``` python
607 | @task
608 | def security_guide(system="devops.txt", grader="expert.txt"):
609 |     return Task(
610 |         dataset = json_dataset("security_guide.jsonl"),
611 |         plan=[system_message(system), generate()],
612 |         scorer=model_graded_fact(
613 |             template=grader, 
614 |             model="openai/gpt-4"
615 |         )
616 |     )
617 | ```
618 | 
619 | ::: fragment
620 | Now we can vary the parameters externally:
621 | 
622 | ``` {.bash code-line-numbers="true"}
623 | $ inspect eval security_guide.py -T system=devops.txt 
624 | $ inspect eval security_guide.py -T grader=hacker.txt 
625 | ```
626 | :::
627 | 
628 | ::: fragment
629 | Same workflow available for tasks in a notebook:
630 | 
631 | ``` {.bash code-line-numbers="true"}
632 | $ inspect eval security_guide.ipynb -T system=devops.txt 
633 | $ inspect eval security_guide.ipynb -T grader=hacker.txt 
634 | ```
635 | :::
636 | 
637 | ## Task Variants
638 | 
639 | We may discover that we *always* want to vary a parameter when running a full evaluation suite:
640 | 
641 | ``` python
642 | def security_guide(system, grader="expert.txt"):
643 |    return Task(
644 |       dataset = json_dataset("security_guide.jsonl"),
645 |       plan=[system_message(system), generate()],
646 |       scorer=model_graded_fact(template=grader, model="openai/gpt-4")
647 |    )
648 | 
649 | @task
650 | def devops()
651 |    return security_guide("devops.txt")
652 | 
653 | @task
654 | def researcher()
655 |    return security_guide("researcher.txt")
656 | ```
657 | 
658 | 
659 | ::: fragment
660 | Invoke by task name
661 | 
662 | ``` {.bash code-line-numbers="true"}
663 | $ inspect eval security_guide.py@devops
664 | $ inspect eval security_guide.py@researcher
665 | ```
666 | :::
667 | 
668 | 
669 | ## Eval Suites
670 | 
671 | We want to allow for arbitrary source code organisation but still be able to discover and enumerate tasks for a suite
672 | 
673 | ::: columns
674 | ::: {.column .fragment}
675 | ``` {.bash code-line-numbers="true"}
676 | security/
677 |   jeopardy/
678 |     import.py
679 |     analyze.py
680 |     task.py
681 |   attack_defense/
682 |     import.py
683 |     analyze.py
684 |     task.py
685 | ```
686 | :::
687 | 
688 | ::: {.column .fragment}
689 | ``` {.python code-line-numbers="true"}
690 | list_tasks("security")
691 | 
692 | jeopardy/task.py@crypto
693 | jeopardy/task.py@decompile
694 | jeopardy/task.py@packet
695 | jeopardy/task.py@heap_trouble
696 | attack_defense/task.py@saar
697 | attack_defense/task.py@bank
698 | attack_defense/task.py@voting
699 | ```
700 | :::
701 | :::
702 | 
703 | ::: {.fragment style="margin-top: 25px;"}
704 | Run them all
705 | 
706 | ``` {.python code-line-numbers="true"}
707 | eval(list_tasks("security"), model="mistral/mistral-large-latest")
708 | ```
709 | :::
710 | 
711 | ## Resiliency
712 | 
713 | The production version would look more like this:
714 | 
715 | ``` python
716 | # setup log context
717 | os.environ["INSPECT_LOG_DIR"] = "./security-suite_04-07-2024"
718 | 
719 | # run the eval suite
720 | tasks = list_tasks("security")
721 | eval(tasks, model="mistral/mistral-large-latest")
722 | 
723 | # ...later, in another process that also has INSPECT_LOG_DIR
724 | error_logs = list_eval_logs(status == "error")
725 | eval_retry(error_logs)
726 | ```
727 | 
728 | ::: {.fragment style="margin-top: 30px;"}
729 | Somewhat oversimplified, as we'd also want to enhance the logic around analysing the cause of errors and adopting optimal recovery strategies
730 | :::
731 | 
732 | ## Provenance
733 | 
734 | If you run an eval from a Git repository, you should be able to reproduce the eval with only its log file as context
735 | 
736 | ::: fragment
737 | ``` {.python code-line-numbers="|1-4|6-11"}
738 | # read the log and extract the origin and commit
739 | log = read_eval_log("security-log.json")
740 | origin = log.spec.revision.origin
741 | commit = log.spec.revision.commit
742 | 
743 | # clone the repo, checkout the commit, install deps, and run
744 | run(["git", "clone", revision.origin, "eval-dir"])
745 | with chdir("eval-dir"):
746 |    run(["git", "checkout", revision.commit])
747 |    run(["pip", "install", "-r", "requirements.txt"])
748 |    eval(log) 
749 | ```
750 | :::
751 | 
752 | 
753 | ## Learning More
754 | 
755 | - Docs: <https://ukgovernmentbeis.github.io/inspect_ai>
756 | 
757 | - GitHub: <https://github.com/ukgovernmentbeis/inspect_ai>
758 | 
759 | - Slides/Code: <https://github.com/jjallaire/inspect-llm-workshop>
760 | 
761 | #### Questions?
762 | 
763 | 
764 | 
765 | ```{=html}
766 | <style type="text/css">
767 | .code-commentary {
768 |   font-size: 2rem;
769 | }
770 | 
771 | .print-pdf pre.numberSource code>span {
772 |   left: -7em !important;
773 | }
774 | 
775 | .reveal pre.sourceCode code {
776 |   max-height: 650px;
777 | }
778 | 
779 | .border {
780 |     border: 1px solid lightgrey;
781 | }
782 | 
783 | #core-design td {
784 |     padding-top: 1em;
785 |     padding-bottom: 1em;
786 | }
787 | </style>
788 | ```


--------------------------------------------------------------------------------
/slides/intro-to-inspect.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/intro-to-inspect.pdf


--------------------------------------------------------------------------------