├── .gitignore ├── README.md ├── benchmarks ├── README.md ├── arc.py ├── datasets │ ├── math_test.csv │ └── mmlu.csv ├── gpqa.py ├── gsm8k.py ├── mathematics.py └── mmlu.py ├── honeycomb ├── README.md ├── critique.txt ├── prompt.txt ├── queries.csv ├── queries.ipynb ├── queries.py └── utils.py ├── langchain ├── .env.example ├── .gitignore ├── README.md ├── inspect_langchain.py ├── wikipedia.jsonl └── wikipedia.py ├── requirements.txt └── slides ├── images ├── inspect-honeycomb-validate.png ├── inspect-mathmatics.png ├── inspect-notebook-eval.png ├── inspect-view-answers.png ├── inspect-view-honeycomb-critique-score.png ├── inspect-view-honeycomb-critique.png ├── inspect-view-honeycomb-validate-prompt-2.png ├── inspect-view-honeycomb-validate-prompt.png ├── inspect-view-honeycomb-validate.png ├── inspect-view-messages.png ├── inspect-view-scoring.png ├── inspect-wikipedia-eval.png ├── inspect-wikipedia-explanation.png ├── inspect-wikipedia-messages.png ├── inspect-wikipedia-scoring.png └── log-incorrect.png ├── inspect.qmd └── intro-to-inspect.pdf /.gitignore: -------------------------------------------------------------------------------- 1 | 2 | .venv/ 3 | .env 4 | logs/ 5 | __pycache__/ 6 | .mypy_cache 7 | .DS_Store 8 | .ipynb_checkpoints/ 9 | 10 | /.quarto/ 11 | *_files/ 12 | *.html 13 | *-present.pdf 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Inspect: An OSS framework for LLM evals 2 | 3 | This repository includes the slides and code for the workshop on Inspect given at [Mastering LLMs: A Conference for Developers and Data Scientists](https://maven.com/parlance-labs/fine-tuning). 4 | 5 | ### Slides 6 | 7 | [Intro to Inspect: Open Source Framework for LLM Evals](https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/main/slides/intro-to-inspect.pdf) 8 | 9 | ### Code 10 | 11 | | Directory | Description | 12 | |---------------------|---------------------------------------------------| 13 | | [honeycomb](honeycomb/) | Evals for for Honeycomb Natural Langaguge Query generator | 14 | | [langchain](langchain/) | Demonstrates evaluating a LangChain Agent using Inspect | 15 | | [benchmarks](benchmarks/) | Implemetnations of some popular benchmarks using Inspect | 16 | | [slides](slides/) | Source code and PDF version of workshop slides | 17 | 18 | To ensure that you can run the example code in this repo, please install the dependencies using the [requirements.txt](requirements.txt) file in the root directory. 19 | -------------------------------------------------------------------------------- /benchmarks/README.md: -------------------------------------------------------------------------------- 1 | ## Benchmarks 2 | 3 | This directory contains evals for several benchmarks. Note that some benchmark datasets are included in the `datasets/` directory and some are downloaded from Hugging Face (you should be sure to `pip install datasets` before attempting to run these benchmarks). 4 | 5 | | Benchmark | Reference | Code | Dataset | 6 | |-----------------------------|---------------|--------------:|---------------| 7 | | MMLU: Measuring Massive Multitask Language Understanding | | [mmlu.py](mmlu.py) | Local | 8 | | MATH: Measuring Mathematical Problem Solving With the MATH Dataset | | [mathematics.py](mathematics.py) | Local | 9 | | GPQA: A Graduate-Level Google-Proof Q&A Benchmark | | [gpqa.py](gpqa.py) | Hugging Face | 10 | | ARC: AI2 Reasoning Challenge | | [arc.py](arc.py) | Hugging Face | 11 | | GSM8K: Training Verifiers to Solve Math Word Problems | | [gsm8k.py](gsm8k.py) | Hugging Face | -------------------------------------------------------------------------------- /benchmarks/arc.py: -------------------------------------------------------------------------------- 1 | """ 2 | Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge 3 | 4 | Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord 5 | https://arxiv.org/abs/1803.05457 6 | 7 | # run all subsets 8 | inspect eval arc.py 9 | 10 | # run specific subsets 11 | inspect eval arc.py@arc_easy 12 | inspect eval arc.py@arc_challenge 13 | """ 14 | 15 | from inspect_ai import Task, task 16 | from inspect_ai.dataset import Sample, hf_dataset 17 | from inspect_ai.scorer import answer 18 | from inspect_ai.solver import multiple_choice 19 | 20 | 21 | def record_to_sample(record): 22 | # read the labels and text 23 | choices = record["choices"] 24 | choices = dict(zip(choices["label"], choices["text"])) 25 | 26 | # determine the target then normalize to letter 27 | answerKey = record["answerKey"] 28 | target = list(choices.keys()).index(answerKey) 29 | target = chr(ord("A") + int(target)) 30 | 31 | # return sample 32 | return Sample( 33 | input=record["question"], choices=list(choices.values()), target=target 34 | ) 35 | 36 | 37 | def arc_task(dataset_name): 38 | return Task( 39 | dataset=hf_dataset( 40 | path="allenai/ai2_arc", 41 | name=dataset_name, 42 | split="test", 43 | sample_fields=record_to_sample, 44 | ), 45 | plan=multiple_choice(), 46 | scorer=answer("letter"), 47 | ) 48 | 49 | 50 | @task 51 | def arc_easy(): 52 | return arc_task("ARC-Easy") 53 | 54 | 55 | @task 56 | def arc_challenge(): 57 | return arc_task("ARC-Challenge") 58 | -------------------------------------------------------------------------------- /benchmarks/gpqa.py: -------------------------------------------------------------------------------- 1 | """ 2 | GPQA: A Graduate-Level Google-Proof Q&A Benchmark 3 | 4 | David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard 5 | Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman 6 | https://arxiv.org/abs/2311.12022 7 | 8 | Based on: https://github.com/openai/simple-evals/blob/main/gpqa_eval.py 9 | 10 | # eval for default epochs (4) 11 | inspect eval gpqa.py 12 | 13 | # eval with 1 epoch 14 | inspect eval gpqa.py --epochs 1 15 | 16 | # without chain of thought 17 | inspect eval gpqa.py -T cot=false 18 | """ 19 | 20 | 21 | from inspect_ai import Task, task 22 | from inspect_ai.dataset import Sample, csv_dataset 23 | from inspect_ai.model import GenerateConfig 24 | from inspect_ai.scorer import answer 25 | from inspect_ai.solver import multiple_choice 26 | 27 | # default epochs to run eval for 28 | DEFAULT_EPOCHS = 4 29 | 30 | 31 | # map records to inspect samples (note that target is always "A" in the, 32 | # dataset, we will shuffle the presentation of options to mitigate this) 33 | def record_to_sample(record): 34 | return Sample( 35 | input=record["Question"], 36 | choices=[ 37 | str(record["Correct Answer"]), 38 | str(record["Incorrect Answer 1"]), 39 | str(record["Incorrect Answer 2"]), 40 | str(record["Incorrect Answer 3"]), 41 | ], 42 | target="A", 43 | id=record["Record ID"], 44 | ) 45 | 46 | 47 | @task 48 | def gpqa_diamond(cot=True): 49 | return Task( 50 | dataset=csv_dataset( 51 | csv_file="https://openaipublic.blob.core.windows.net/simple-evals/gpqa_diamond.csv", 52 | sample_fields=record_to_sample, 53 | ), 54 | plan=[ 55 | multiple_choice(cot=cot, shuffle=True), 56 | ], 57 | scorer=answer("letter"), 58 | config=GenerateConfig(temperature=0.5), 59 | epochs=DEFAULT_EPOCHS, 60 | ) 61 | -------------------------------------------------------------------------------- /benchmarks/gsm8k.py: -------------------------------------------------------------------------------- 1 | """ 2 | Training Verifiers to Solve Math Word Problems 3 | 4 | Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman 5 | https://arxiv.org/abs/2110.14168 6 | 7 | # run with default fewshots (10) 8 | inspect eval gsm8k.py 9 | 10 | # run with less or no fewshots 11 | inspect eval gsm8k.py -T fewshot=5 12 | inspect eval gsm8k.py -T fewshot=false 13 | """ 14 | 15 | from inspect_ai import Task, task 16 | from inspect_ai.dataset import Sample, hf_dataset 17 | from inspect_ai.scorer import match 18 | from inspect_ai.solver import generate, prompt_template, system_message 19 | 20 | 21 | def record_to_sample(record): 22 | DELIM = "####" 23 | input = record["question"] 24 | answer = record["answer"].split(DELIM) 25 | target = answer.pop().strip() 26 | reasoning = DELIM.join(answer) 27 | return Sample(input=input, target=target, metadata={"reasoning": reasoning.strip()}) 28 | 29 | 30 | def sample_to_fewshot(sample): 31 | return ( 32 | f"{sample.input}\n\nReasoning:\n" 33 | + f"{sample.metadata['reasoning']}\n\n" 34 | + f"ANSWER: {sample.target}" 35 | ) 36 | 37 | 38 | # setup for problem + instructions for providing answer 39 | MATH_PROMPT_TEMPLATE = """ 40 | Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem. 41 | 42 | {prompt} 43 | 44 | Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \\boxed command. 45 | 46 | Reasoning: 47 | """.strip() 48 | 49 | 50 | @task 51 | def gsm8k(fewshot=10, fewshot_seed=42): 52 | # build plan dynamically (may or may not be doing fewshot) 53 | plan = [prompt_template(MATH_PROMPT_TEMPLATE), generate()] 54 | if fewshot: 55 | fewshots = hf_dataset( 56 | path="gsm8k", 57 | data_dir="main", 58 | split="train", 59 | sample_fields=record_to_sample, 60 | shuffle=True, 61 | seed=fewshot_seed, 62 | limit=fewshot, 63 | ) 64 | plan.insert( 65 | 0, 66 | system_message( 67 | "\n\n".join([sample_to_fewshot(sample) for sample in fewshots]) 68 | ), 69 | ) 70 | 71 | # define task 72 | return Task( 73 | dataset=hf_dataset( 74 | path="gsm8k", 75 | data_dir="main", 76 | split="test", 77 | sample_fields=record_to_sample, 78 | ), 79 | plan=plan, 80 | scorer=match(numeric=True), 81 | ) 82 | -------------------------------------------------------------------------------- /benchmarks/mathematics.py: -------------------------------------------------------------------------------- 1 | """ 2 | Measuring Mathematical Problem Solving With the MATH Dataset 3 | 4 | Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, 5 | Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt 6 | https://arxiv.org/abs/2103.03874 7 | 8 | Based on: https://github.com/openai/simple-evals/blob/main/math_eval.py 9 | """ 10 | 11 | import re 12 | 13 | from inspect_ai import Task, task 14 | from inspect_ai.dataset import FieldSpec, csv_dataset 15 | from inspect_ai.model import GenerateConfig, get_model 16 | from inspect_ai.scorer import ( 17 | CORRECT, 18 | INCORRECT, 19 | AnswerPattern, 20 | Score, 21 | Target, 22 | accuracy, 23 | bootstrap_std, 24 | scorer, 25 | ) 26 | from inspect_ai.solver import TaskState, generate, prompt_template 27 | 28 | # setup for problem + instructions for providing answer 29 | PROMPT_TEMPLATE = """ 30 | Solve the following math problem step by step. The last line of your response should be of the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem. 31 | 32 | {prompt} 33 | 34 | Remember to put your answer on its own line at the end in the form "ANSWER: $ANSWER" (without quotes) where $ANSWER is the answer to the problem, and you do not need to use a \\boxed command. 35 | """.strip() 36 | 37 | 38 | @task 39 | def math(shuffle=True): 40 | return Task( 41 | dataset=csv_dataset( 42 | csv_file="datasets/math_test.csv", 43 | sample_fields=FieldSpec(input="Question", target="Answer"), 44 | shuffle=shuffle, 45 | ), 46 | plan=[ 47 | prompt_template(PROMPT_TEMPLATE), 48 | generate(), 49 | ], 50 | scorer=expression_equivalance(), 51 | config=GenerateConfig(temperature=0.5), 52 | ) 53 | 54 | 55 | @scorer(metrics=[accuracy(), bootstrap_std()]) 56 | def expression_equivalance(): 57 | async def score(state: TaskState, target: Target): 58 | # extract answer 59 | match = re.search(AnswerPattern.LINE, state.output.completion) 60 | if match: 61 | # ask the model to judge equivalance 62 | answer = match.group(1) 63 | prompt = EQUIVALANCE_TEMPLATE % ( 64 | {"expression1": target.text, "expression2": answer} 65 | ) 66 | result = await get_model().generate(prompt) 67 | 68 | # return the score 69 | correct = result.completion.lower() == "yes" 70 | return Score( 71 | value=CORRECT if correct else INCORRECT, 72 | answer=answer, 73 | explanation=state.output.completion, 74 | ) 75 | else: 76 | return Score( 77 | value=INCORRECT, 78 | explanation="Answer not found in model output: " 79 | + f"{state.output.completion}", 80 | ) 81 | 82 | return score 83 | 84 | 85 | EQUIVALANCE_TEMPLATE = r""" 86 | Look at the following two expressions (answers to a math problem) and judge whether they are equivalent. Only perform trivial simplifications 87 | 88 | Examples: 89 | 90 | Expression 1: $2x+3$ 91 | Expression 2: $3+2x$ 92 | 93 | Yes 94 | 95 | Expression 1: 3/2 96 | Expression 2: 1.5 97 | 98 | Yes 99 | 100 | Expression 1: $x^2+2x+1$ 101 | Expression 2: $y^2+2y+1$ 102 | 103 | No 104 | 105 | Expression 1: $x^2+2x+1$ 106 | Expression 2: $(x+1)^2$ 107 | 108 | Yes 109 | 110 | Expression 1: 3245/5 111 | Expression 2: 649 112 | 113 | No 114 | (these are actually equal, don't mark them equivalent if you need to 115 | do nontrivial simplifications) 116 | 117 | Expression 1: 2/(-3) 118 | Expression 2: -2/3 119 | 120 | Yes 121 | (trivial simplifications are allowed) 122 | 123 | Expression 1: 72 degrees 124 | Expression 2: 72 125 | 126 | Yes 127 | (give benefit of the doubt to units) 128 | 129 | Expression 1: 64 130 | Expression 2: 64 square feet 131 | 132 | Yes 133 | (give benefit of the doubt to units) 134 | 135 | --- 136 | 137 | YOUR TASK 138 | 139 | 140 | Respond with only "Yes" or "No" (without quotes). Do not include a rationale. 141 | 142 | Expression 1: %(expression1)s 143 | Expression 2: %(expression2)s 144 | """.strip() 145 | -------------------------------------------------------------------------------- /benchmarks/mmlu.py: -------------------------------------------------------------------------------- 1 | """ 2 | Measuring Massive Multitask Language Understanding 3 | 4 | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, 5 | Mantas Mazeika, Dawn Song, Jacob Steinhardt 6 | https://arxiv.org/abs/2009.03300 7 | 8 | Based on: https://github.com/openai/simple-evals/blob/main/mmlu_eval.py 9 | 10 | # eval all subjects w/ 500 randomly selected samples 11 | inspect eval mmlu.py@mmlu --limit 500 12 | 13 | # add chain of thought 14 | inspect eval mmlu.py@mmlu --limit 500 -T cot=true 15 | 16 | # eval selected subjects 17 | inspect eval mmlu.py@mmlu -T subjects=anatomy 18 | inspect eval mmlu.py@mmlu -T subjects=astronomy 19 | inspect eval mmlu.py@mmlu -T subjects=anatomy,astronomy 20 | """ 21 | 22 | from inspect_ai import Task, task 23 | from inspect_ai.dataset import Sample, csv_dataset 24 | from inspect_ai.model import GenerateConfig 25 | from inspect_ai.scorer import answer 26 | from inspect_ai.solver import multiple_choice 27 | 28 | 29 | # map records to inspect sample 30 | def record_to_sample(record): 31 | return Sample( 32 | input=record["Question"], 33 | choices=[ 34 | str(record["A"]), 35 | str(record["B"]), 36 | str(record["C"]), 37 | str(record["D"]), 38 | ], 39 | target=record["Answer"], 40 | metadata={"subject": record["Subject"]}, 41 | ) 42 | 43 | 44 | # read dataset globally so it can be shared by all of the tasks 45 | # (shuffle so that --limit draws from multiple subjects) 46 | dataset = csv_dataset( 47 | csv_file="datasets/mmlu.csv", 48 | sample_fields=record_to_sample, 49 | shuffle=True, 50 | ) 51 | 52 | 53 | @task 54 | def mmlu(subjects=[], cot=False): 55 | # filter dataset if requested 56 | subjects = subjects if isinstance(subjects, list) else [subjects] 57 | if len(subjects) > 0: 58 | task_dataset = dataset.filter( 59 | name=f"{dataset.name}-{'-'.join(subjects)}", 60 | predicate=lambda sample: sample.metadata["subject"] in subjects, 61 | ) 62 | else: 63 | task_dataset = dataset 64 | 65 | # return task 66 | return Task( 67 | dataset=task_dataset, 68 | plan=multiple_choice(cot=cot), 69 | scorer=answer("letter"), 70 | config=GenerateConfig(temperature=0.5), 71 | ) 72 | -------------------------------------------------------------------------------- /honeycomb/README.md: -------------------------------------------------------------------------------- 1 | ## Honeycomb Query Evals 2 | 3 | ### Overview 4 | 5 | Evals for for Honeycomb Natural Langaguge Query generator from the [Fine Tuning LLMs](https://maven.com/parlance-labs/fine-tuning) course. Related notebooks from the course can be found at . 6 | 7 | The [queries.csv](queries.csv) dataset contains \~ 2,300 example queries (along with per-query column schemas generated offline via RAG). There are two scoring methods supported 8 | (corresponding to the two @task definitions below): 9 | 10 | 1. @validate - score using the validity checker from the course (utils.py) 11 | 2. @critique - score using the critique prompt from the course (critique.txt) 12 | 13 | ### Eval Script 14 | 15 | Use the `inspect eval` command with the eval script ([queries.py](queries.py)) to run evals on a variety of models. 16 | 17 | To evaluate all of the queries using both scorers on Claude Opus: 18 | 19 | ```bash 20 | $ inspect eval queries.py --model anthropic/claude-3-opus-20240229 21 | ``` 22 | 23 | To evaluate a random subset of 200 queries using both scorers on GPT 4 Turbo: 24 | 25 | ```bash 26 | $ inspect eval queries.py --model openai/gpt-4-turbo --limit 200 27 | ``` 28 | 29 | To use the validate scorer only with a local Google Gemma 2B (via HF): 30 | 31 | ```bash 32 | $ inspect eval queries.py@validate --model hf/google/gemma-2b 33 | ``` 34 | 35 | To use the critique scorer only with a local Ollma Lllama3 model 36 | 37 | ```bash 38 | $ inspect eval queries.py@critique --model ollma/llama3 39 | ``` 40 | 41 | See `inspect eval --help` for details on all available options. 42 | 43 | ### Eval Notebook 44 | 45 | There is a also a notebook version of the evals ([queries.ipynb](queries.ipynb)). Rather than using `inspect eval` as it done with the script, the notebook uses Inspect's `eval()` function to execute evals within the notebok runtime. -------------------------------------------------------------------------------- /honeycomb/critique.txt: -------------------------------------------------------------------------------- 1 | ## Background 2 | 3 | Honeycomb is an observability platform that allows you to write queries to inspect trace data. 4 | The specification of the Honeycomb query language is as follows: 5 | 6 | QUERY SPEC: 7 | All top-level keys are optional. 8 | 9 | ```json 10 | "calculations":[ 11 | // ops: COUNT, CONCURRENCY, COUNT_DISTINCT, HEATMAP, SUM, AVG, MAX, MIN, P001, P01, P05, P10, P25, P50, P75, P90, P95, P99, P999, RATE_AVG, RATE_SUM, RATE_MAX 12 | {"op": "COUNT"},// COUNT and CONCURRENCY are just op 13 | {"op": "HEATMAP", "column": "name"} 14 | ], 15 | "filters":[ 16 | // ops: =, !=, >, >=, <, <=, starts-with, does-not-start-with, exists, does-not-exist, contains, does-not-contain, in, not-in 17 | {"column": "name", "op": "exists"}, // exists and does-not-exist ops only have column 18 | {"column": "name", "op": "=", "value": "something"} 19 | ], 20 | "filter_combination": "AND", // AND or OR 21 | "breakdowns":[ 22 | // columns in COLUMNS 23 | "column1","column2" 24 | ], 25 | "orders":[ 26 | // HEATMAP not allowed 27 | // Must come from breakdowns or calculations 28 | {"op": "op_in_calculation", "column": "column_in_calculation", "order": "ascending"}, 29 | {"op": "COUNT", "order": "descending"}, // COUNT and CONCURRENCY have no column 30 | {"column": "column1", "order": "descending"}, 31 | ], 32 | "havings":[ 33 | // HEATMAP not allowed 34 | {"calculate_op": "op_in_calculation", "column": "name", "op": "OPNAME", "value": 100}, 35 | {"calculate_op": "COUNT", "op": ">", "value": 10}, // COUNT and CONCURRENCY have no column 36 | ], 37 | "time_range": 7200, // Relative time range in seconds. 38 | "start_time": 1234567890, // UNIX timestamp 39 | "end_time": 1234567890, // UNIX timestamp 40 | ``` 41 | 42 | QUERY SPEC TIPS: 43 | 44 | COUNT counts events/spans. COUNT_DISTINCT counts unique values of columns. 45 | HEATMAP shows value distributions. 46 | trace.parent_id does-not-exist shows root span of a trace. Use that to count requests and traces. 47 | name is a span or span event. 48 | parent_name is the name of a span that created a span event. 49 | 50 | When the user input is asking about a time range (such as "yesterday" or "since last week"), 51 | always use the time_range, start_time, and end_time fields. time_range 52 | is relative and can be combined with either start_time or end_time but not both. 53 | 54 | --- 55 | 56 | ## Instructions 57 | 58 | You are an EXPERT query evaluator that has advanced capabilities to judge if a query good or not. You understand the nuances of the Honeycomb query language, including what is likely to be most useful from an analytics perspective. 59 | You are given the following three inputs: (1) NLQ, (2) A list of candidate columns (COLUMNS) that are allowed to be in the query, and (3) The query (QUERY). Your job is to evaluate and critique the QUERY relative to the provided NLQ and COLUMNS. 60 | 61 | The critiques must be provided in the same json format as provided in the examples below: 62 | 63 | --- 64 | 65 | NLQ: show me slowest trace 66 | 67 | COLUMNS: ['trace.trace_id', 'trace.span_id', 'trace.parent_id', 'duration_ms', 'name', 'faas.instance', 'faas.id', 'filter', 'telemetry.instrumentation_library', 'library.name', 'faas.name', 'span.kind', 'type', 'http.wrote_bytes', 'http.url', 'service.name', 'http.flavor', 'span.num_links', 'span.num_events', 'net.host.name', 'library.version', 'http.scheme', 'net.peer.name', 'http.method', 'meta.signal_type', 'cloud.region', 'cloud.provider', 'faas.version', 'http.read_bytes', 'http.user_agent', 'cloud.account.id', 'organization_id', 'cloud.platform', 'net.sock.peer.addr', 'page_size', 'net.sock.peer.port', 'page_token', 'status_code', 'http.client_ip', 'http.status_code', 'http.route'] 68 | 69 | QUERY: {"calculations":[{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"limit":1,"time_range":7200} 70 | 71 | {"critique": "The response is nearly correct, as it is looking for the slowest trace by using MAX(duration_ms) and ordering by duration_ms in descending order, which is appropriate for finding the 'slowest' trace. Additionally, filtering with trace.parent_id does-not-exist correctly identifies root spans. However, the query should be grouping by trace.trace_id to actually show the slowest trace. Without that grouping, the query only shows the MAX(duration_ms) measurement over time, irrespective of which trace is responsible for that measurement.", "outcome": "bad"} 72 | 73 | --- 74 | 75 | NLQ: slow requests 76 | 77 | COLUMNS: ['duration_ms', 'deprecatedSource', 'error', 'severity', 'flags', 'rpc.method', 'deprecatedCount', 'ip', 'rpc.service', 'interface', 'metadata', 'reason', 'container.cpu.time', 'deprecatedLastTimestamp', 'name', 'regarding', 'namespace', 'message.type', 'watch-type', 'severity_text', 'apiVersion', 'kind', 'service.name', 'rpc.system', 'container.uptime', 'type', 'uid', 'span.kind', 'db.name', 'note', 'body', 'resourceVersion', 'library.version', 'db.statement', 'status_code', 'event.domain', 'managedFields', 'deprecatedFirstTimestamp', 'container.memory.page_faults', 'event.name', 'container.memory.available', 'k8s.container.cpu_request', 'container.id', 'library.name', 'container.memory.rss', 'message.uncompressed_size', 'rpc.grpc.status_code', 'net.peer.port', 'k8s.container.memory_request', 'container.memory.working_set', 'trace.parent_id', 'http.route', 'service_name', ''] 78 | 79 | QUERY: {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":7200} 80 | 81 | {"critique": "This response is adequate. The inclusion of a HEATMAP and a MAX calculation on the duration_ms column is appropriate for identifying slow requests. Correctly including a filter on 'trace.parent_id' with 'does-not-exist' ensures that only root spans (requests) are considered. Ordering by the max duration_ms in descending order is sensible to quickly highlight slower requests. The use of 'name' in a breakdown is good, because it groups results into span names. This can help you quickly narrow down behavior to a particular part of a system. It might be more insightful to break down by 'http.route', because 'http.route' will also break down more broadly by API call. Having both 'http.route' and 'name' included can give both a broader and narrower breadkdown of the data, and both views can be helpful at different times.. Overall, the query aligns well with the NLQ but could benefit from a more specific breakdown column.", "outcome": "good"} 82 | 83 | --- 84 | NLQ: records of clientid 5fcedb77-267c-44e3-8ea4-4d21786b2618 85 | 86 | COLUMNS: ['clientIds', 'sys-client-id', 'principal.clientId', 'client.id', 'clientName', 'metadata.clientId', 'payload.clientId', 'clientExternalId', 'informationReportingEvent.clientId', 'principalClientId', 'metadata.clientTxId', 'targetClientId', 'e2eId', 'client', 'transactionIds', 'clientId', 'client.external_id', 'metadata.clientName', 'sys-service-provider-id', 'transactionId', 'paymentRequestIds.identifying', 'serviceProviderId', 'service_provider.id', 'externalId', 'traceId', 'event.id', 'accounts-deleted', 'jobExecutionId', 'accounts-updated', 'syncContext.clientDto.id', '$metadata.requestId', 'subscriptionId', 'paymentExternalId', 'activeIntegrationIds', 'fileId', 'tracerId', '$metadata.extendedRequestId', 'moneyTransferId', 'paymentRequestIds', 'vendor.id', 'triggeredNumberOfClients', 'payload.eventId', 'existedTransactionIds', 'completedTransactionIds', 'metadata.paymentRequestExternalId', 'syncContext.clientDto.externalId', 'createdTransactionIds', 'transactionsDeletionServiceProviderId', 'idempotency_key', 'RequestId', 'status', 'error', 'duration_ms', 'http.route', 'name', 'http_path', 'trace.parent_id', 'http.status_code', 'meta.req.query', 'service_name', 'request_time'] 87 | 88 | QUERY: {"filters":[{"column":"client.id","op":"=","value":"5fcedb77-267c-44e3-8ea4-4d21786b2618"}]} 89 | 90 | {"critique": "Poor response. While the filter by client.id is correct to find records for clientid '5fcedb77-267c-44e3-8ea4-4d21786b2618', the query lacks breakdowns, calculations, or order to make the results more insightful. It would be more beneficial to include calculations such as COUNT to understand the volume of records.", "outcome": "bad"} 91 | 92 | --- 93 | 94 | NLQ: Find the traces for the slowest requests in the past 3 hours where http.host = internal-pcmws-alkqa-routing-black-1164727359.us-east-1.elb.amazonaws.com 95 | 96 | COLUMNS: ['trace.trace_id', 'http.host', 'exception.stacktrace', 'pcm.drawtraffic', 'SLI_geo-95-percentile-latency', 'pcm.server.worker-load-balancer', 'pcm.responsetime', 'aws.requestId', 'http.query-length', 'http.response-uri', 'trace.span_id', 'net.host.name', 'http-500-error-percent', 'http.url', 'net.host.port', 'pcm.server.worker-app-pool', 'http.request-headers', 'duration_ms', 'pcm.restful', 'http.response-version', 'trace.parent_id', 'pcm.initialmaprequest', 'http.response-headers', 'http.method', 'aws.region', 'service.instance.id', 'http.response-method', 'wcf.channel.path', 'pcm.server', 'pcm.userip', 'http.version', 'pcm.server-name', 'pcm.server-color', 'aws.service', 'pcm.stopcount', 'rpc.method', 'pcm.splcs', 'peer.service', 'wcf.channel.scheme', 'service.name', 'http.response_content_length', 'pcm.foserver', 'SLI_99.9-percent-without-500-error', 'pcm.httpcode', 'pcm.errortext', 'http.status-description', 'pcm.login', 'pcm.apikey', 'rpc.service', 'pcm.date', 'db.statement', 'http.route', 'parent_name', 'name', 'http.status_code', 'exception.message', 'error'] 97 | 98 | QUERY: {"breakdowns":["trace.trace_id"],"calculations":[{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"http.host","op":"=","value":"internal-pcmws-alkqa-routing-black-1164727359.us-east-1.elb.amazonaws.com","join_column":""}],"filter_combination":"AND","orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":10800} 99 | 100 | {"critique": "The query correctly uses time_range to indicate the past 3 hours and filters by the specific http.host value. However, it should also include a filter where trace.parent_id does-not-exist to identify the root spans or initial requests, which represent the start of a trace. Without this, the query might include intermediate spans that are not the start of the trace. The breakdown by trace.trace_id is correct, and ordering by the max duration is appropriate to find the slowest requests. It could be improved by including a calculation of COUNT to gauge the number of slow traces for each trace id, making it thorough for understanding general patterns or anomalies.", "outcome": "bad"} 101 | 102 | --- 103 | 104 | NLQ: show me traces where ip is 10.0.2.90 105 | 106 | COLUMNS: ['net.host.ip', 'trace.trace_id', 'net.peer.ip', 'net.transport', 'net.host.name', 'http.client_ip', 'http.host', 'host.name', 'trace.span_id', 'net.host.port', 'k8s.pod.ip', 'sli_route_latency_1500', 'ip', 'net.peer.port', 'trace.parent_id', 'sli_route_latency', 'http.server_name', 'service.name', 'service.instance.id', 'rpc.user_agent', 'rpc.service', 'http.route', 'db.name', 'http.url', 'area', 'requestID', 'service.version', 'http.user_agent', 'http.target', 'host.arch', 'orderid', 'process.pid', 'rpc.system', 'os.type', 'rpc.method', 'otlp.exporter.exported', 'runtime.jvm.gc.time', 'http.scheme', 'thread.name', 'name', 'runtime.jvm.memory.pool', 'message.type', 'runtime.jvm.memory.area', 'thread.id', 'enduser.id', 'error', 'userid', 'http.flavor', 'pool', 'telemetry.instrumentation_library', 'http.status_code', 'service_name', 'parent_name', 'duration_ms', 'db.statement', 'exception.message'] 107 | 108 | QUERY: {"breakdowns":["trace.trace_id"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"net.host.ip","op":"=","value":"10.0.2.90"}]} 109 | 110 | {"critique": "The query satisfies the NLQ. It correctly filters for traces with an IP address of 10.0.2.90 and counts the occurrences of those traces, grouped by trace.trace_id, which is what the NLQ requests. The response is good as it meets the requirement of showing traces from a specific IP address without additional complexities.", "outcome": "good"} 111 | 112 | --- 113 | 114 | NLQ: count of moderation.flags.create api calls 115 | 116 | COLUMNS: ['response_flags', 'job_counts_MULTI_INDEX_CREATE', 'http_count', 'found_count', 'is_creation', 'is_create', 'job_counts_MULTI_INDEX_BATCH_CREATE', 'event_count', 'job_counts_MULTI_INDEX_CREATE_BE', 'job_counts_SEND_SLACK_UNFURLS_UPDATE', 'queued_count', 'keys_count', 'agenda_req_count', 'flags', 'job_counts_SEND_SLACK_DOC_THREAD_UPDATE', 'update_count', 'log_count', 'old_activity_badge_count', 'total_message_count', 'job_counts_SEND_SLACK_USER_MENTIONS_UPDATE', 'job_counts_UNIQUE_INDEX_CREATE', 'request_attempt_count', 'job_counts_ADD_TO_SLACK_DOCS_COMPANY', 'args.thread_counts_by_channel', 'job_counts_FINALIZE_EDIT_HISTORY', 'mc_queries_count', 'hitCount', 'api:warning', 'job_counts_MULTI_INDEX_BATCH_CREATE_BE', 'file_count', 'job_counts_UPDATE_THREAD_UNREAD_COUNTS_EF4', 'api:method', 'executing_count', 'job_counts_SEND_SLACK_FILES_UPDATE', 'failed_count', 'mpdm_count', 'count', 'grpc_req_count', 'job_counts_UPDATE_SLACK_REFERENCES_IN_THREAD', 'mc_queries_multi_count', 'channel_ids_count', 'job_counts_UPDATE_THREAD_PHRASES_ON_EDIT', 'total_participant_count', 'job_counts_MULTI_INDEX_DELETE', 'job_counts_UPDATE_THREAD_UNREAD_COUNTS_BE', 'kafkagate_count', 'authorized_apps_count', 'redirect_count', 'apps_count', 'bulk_generate_calls', 'duration_ms', 'slath', 'name', 'execution_context', 'trace_parent_id', 'db.statement', 'namespace', 'kind', 'error', 'message', 'reason', 'http.status_code', 'service_name', 'quip:call_path'] 117 | 118 | QUERY: {"breakdowns":["slath"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"name","op":"=","value":"moderation.flags.create","join_column":""}],"time_range":7200} 119 | 120 | {"critique": "The query groups by 'slath' which is extraneous information - there is nothing in NLQ that indicates 'slath' is related to NLQ. Additionally, the filter clause should check for the 'contains' op instead of '='. While it is likely good to check for 'name' as that represents an operation name, which means it will likely have data NLQ is referring to, there is no guarantee that the actual name of the operation is exactly what was input in NLQ. Furthermore, the result should include 'name' in the breakdowns so that the results show the actual values of 'name'.", "outcome": "bad"} 121 | 122 | --- 123 | 124 | NLQ: What kind of events are contributing towards my rate limits 125 | 126 | COLUMNS: ['up', 'k8s.pod.memory.major_page_faults', 'k8s.pod.memory_limit_utilization', 'k8s.pod.cpu_limit_utilization', 'k8s.node.memory.page_faults', 'k8s.container.cpu_limit', 'k8s.node.memory.major_page_faults', 'k8s.pod.memory.page_faults', 'k8s.container.memory_limit', 'k8s.node.filesystem.usage', 'otelcol_processor_batch_batch_send_size.avg', 'k8s.pod.memory.rss', 'k8s.pod.filesystem.usage', 'k8s.node.network.io.transmit', 'prometheus.io/port', 'scrape_duration_seconds', 'eks.amazonaws.com/compute-type', 'otelcol_processor_batch_timeout_trigger_send', 'otelcol_receiver_refused_metric_points', 'k8s.node.memory.rss', 'otelcol_processor_batch_batch_send_size.count', 'eks.amazonaws.com/component', 'otelcol_processor_dropped_metric_points', 'k8s.pod.network.io.transmit', 'otelcol_processor_refused_metric_points', 'job.runner.gitlab.com/url', 'scrape_samples_scraped', 'prometheus.io/scrape', 'k8s.pod.memory.usage', 'k8s.node.cpu.time', 'otelcol_processor_batch_batch_send_size.p50', 'k8s.node.network.io.receive', 'k8s.node.filesystem.capacity', 'k8s.node.memory.usage', 'otelcol_processor_dropped_log_records', 'k8s.container.restarts', 'k8s-app', 'otelcol_exporter_queue_capacity', 'k8s.pod.memory_request_utilization', 'k8s.pod.network.errors.transmit', 'otelcol_processor_batch_batch_send_size.max', 'otelcol_receiver_accepted_metric_points', 'otelcol_exporter_sent_metric_points', 'k8s.pod.filesystem.capacity', 'project.runner.gitlab.com/id', 'processor', 'otelcol_scraper_errored_metric_points', 'otelcol_processor_batch_batch_send_size.p05', 'interface', 'otelcol_process_memory_rss', 'net.host.name', 'name', 'trace.parent_id', 'duration_ms', 'http.route', 'service_name'] 127 | 128 | QUERY: {"breakdowns":["name"],"calculations":[{"column":"otelcol_processor_refused_metric_points","op":"SUM"},{"column":"otelcol_processor_dropped_metric_points","op":"SUM"},{"column":"otelcol_processor_dropped_log_records","op":"SUM"}],"filters":[{"column":"otelcol_processor_refused_metric_points","op":">","value":0},{"column":"otelcol_processor_dropped_metric_points","op":">","value":0},{"column":"otelcol_processor_dropped_log_records","op":">","value":0}]} 129 | 130 | {"critique": "Good response. The query is grouping by 'name' to identify different event types and summing up refused, dropped metric points and dropped log records, which are potential indicators of hitting rate limits. Furthermore, it properly filters each of those to be greater than zero to only include relevant events. This helps to understand what events are contributing towards rate limits being exceeded.", "outcome": "good"} 131 | 132 | --- 133 | 134 | NLQ: count window-hash where window-hash exists per hour 135 | 136 | COLUMNS: ['window-hash', 'window_hash_get', 'prop-window_hash', 'fingerprint_window_stack_traces_hash', 'prop-window_hash_get', 'fingerprint_window_overloaded_hash', 'fingerprint_window_document_keys_hash', 'fingerprint_window_keys_hash', 'fingerprint_window_client_tools_hash', 'fingerprint_window_messages_hash', 'headers-hash', 'fingerprint_time_zone_anomaly_hash', 'best_hash_get', 'w_count', 'legacy-hash', 'fingerprint_window_stack_traces', 'fingerprint_time_zone_hash', 'siteverify_nojs-hdict_count', 'fingerprint_time_zone_region_hash', 'fingerprint_window_missing_features_hash', 'fingerprint_timing_hash', 'fetch-hash', 'client_timestamp_check', 'canvas-hash', 'fingerprint_unique_keys_stable_hash_hash', 'fingerprint_window_document_keys', 'fingerprint_window_keys', 'task_request_count', 'fingerprint_web_worker_stack_size_hash', 'fingerprint_window_overloaded', 'fingerprint_window_version', 'fingerprint_time_zone_anomaly', 'token_rate_1m_tarpit', 'r_count', 'art_up__asn_window', 'fingerprint_unique_keys_stable_hash', 'sk_wgl_canv_hash', 'fingerprint_dom_rect_decimal_hash', 'fingerprint_window_messages', 'webrtc-hash', 'legacy_hash_get', 'fingerprint_window_sus_chrome_key', 'md_hash', 'webgl_hash_get', 'fingerprint_unique_keys_hash', 'checkcaptcha-elapsed-diff', 'fingerprint_analytics_hash', 'fingerprint_label_performance_start_bucket', 'art__asn_window', 'fingerprint_unique_keys_meta_hash_hash', 'getcaptcha-host', 'parent_name', 'exception.message', 'service_name', 'base_url', 'app.exception_string', 'request.path', 'response.status_code', 'duration_ms', 'db.statement', 'score_reason', 'sk', 's3_upload_uri', 'trace.parent_id', 'name', 'art_ms', 'shard_id'] 137 | 138 | QUERY: {"breakdowns":["window-hash"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"window-hash","op":"exists","join_column":""}],"filter_combination":"AND","orders":[{"op":"COUNT","order":"descending"}],"time_range":3600} 139 | 140 | {"critique": "The response is mostly good, but misinterprets the time range and granularity settings. The QUERY correctly applies a COUNT calculation on events where 'window-hash' exists, which matches the NLQ requirement to count 'window-hash'. It breaks down by the 'window-hash' column for granularity and orders the results in descending order of the count, which is useful for identifying the most frequent 'window-hash' values. However, when NLQ says "per hour", it is likely seeking to show results with a granularity of 1 hour. As per query rules, this would require the 'time_range' to be at least '36000', which is 10 times the granularity value in seconds.", "outcome": "bad"} 141 | 142 | --- 143 | 144 | NLQ: latency distribution by status code 145 | 146 | COLUMNS: ['status_code', 'severity_code', 'status_message', 'rpc.grpc.status_code', 'duration_ms', 'container.uptime', 'rpc.method', 'message.type', 'rpc.service', 'rpc.system', 'severity_text', 'apiVersion', 'opencensus.resourcetype', 'k8s.node.network.io.transmit', 'meta.signal_type', 'ip', 'container.memory.major_page_faults', 'container.memory.page_faults', 'error', 'container.cpu.utilization', 'library.version', 'net.peer.port', 'container.memory.available', 'deprecatedLastTimestamp', 'container.memory.working_set', 'severity', 'k8s.pod.network.io.transmit', 'container.cpu.time', 'k8s.node.uptime', 'k8s.pod.uptime', 'container.filesystem.usage', 'container.filesystem.available', 'trace.trace_id', 'interface', 'deprecatedFirstTimestamp', 'deprecatedCount', 'span.kind', 'resourceVersion', 'k8s.node.network.errors.transmit', 'metadata', 'event.domain', 'trace.span_id', 'container.memory.usage', 'Kubernetes'] 147 | 148 | QUERY: {"breakdowns":["status_code"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"filters":[{"column":"status_code","op":"exists","join_column":""}],"time_range":7200} 149 | 150 | {"critique": "Almost a good response. Having a HEATMAP(duration_ms) is excellent for visualizing the distribution of response times, and P95 gives a useful statistical measure of latency under typical conditions. However, filtering and breaking down by 'status_code' is unfortunately incorrect, as this does not refer to any system status code, but rather a span status. Instead, 'rpc.grpc.status_code' should be used instead. If there were an 'http.status_code' column, or an 'http.response.status_code' column, that could be a good grouping too.", "outcome": "bad"} 151 | 152 | --- 153 | 154 | NLQ: events by name 155 | 156 | COLUMNS: ['app.event_name', 'app.event_id', 'name', 'span.num_events', 'db.name', 'library.name', 'app.subscriber_name', 'parent_name', 'process.runtime.name', 'service.name', 'app.collection', 'job.enqueued_at', 'net.peer.name', 'db.system', 'http.route', 'error', 'job.created_at', 'exception.stacktrace', 'http.target', 'telemetry.sdk.name', 'http.host', 'messaging.system', 'db.statement', 'span.kind', 'type', 'http.method', 'service.version', 'messaging.destination', 'exception.message', 'exception.type', 'app.queued_times', 'process.pid', 'service.ownership', 'process.command', 'messaging.operation', 'duration_ms', 'library.version', 'meta.refinery.reason', 'messaging.protocol', 'telemetry.sdk.language', 'status_message', 'job.latency_ms', 'meta.annotation_type', 'messaging.rabbitmq.routing_key', 'deployment.environment', 'http.scheme', 'trace.link.trace_id', 'process.runtime.description', 'telemetry.instrumentation_library', 'trace.trace_id', 'trace.parent_id', 'http.status_code'] 157 | 158 | QUERY: {"breakdowns":["app.event_name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"app.event_name","op":"=","value":"Event drop","join_column":""}],"filter_combination":"AND","time_range":7200} 159 | 160 | {"critique": "This is a bad response. The NLQ asks for events by name, which implies a need to count occurrences of events broken down by their name. The query correctly uses COUNT to aggregate events and breaks it down by app.event_name. However, it also includes a filter for a specific app.event_name value, which is unrelated to what NLQ asks for.", "outcome": "bad"} 161 | 162 | --- 163 | 164 | NLQ: /api/external_storage/gdrive/notify 165 | 166 | COLUMNS: ['document.api_status.dropbox_download_url', 'http.payment_endpoints_no_errors', 'http.document_upload_successful', 'http.document_upload_success_rate', 'http.is_document_upload', 'http.auth_endpoints_no_errors', 'status_message', 'document_processing_event.extension', 'document.s3_dir', 'http.space_endpoints_successful', 'http.grpc_requests_no_errors', 'document_processing_event.upload_source', 'http.is_payment', 'rpc.method', 'graphql.document', 'flipper.file_requests_in_dashboard', 'link.downloadable', 'rpc.service', 'messaging.operation', 'http.payment_success_rate', 'document.upload_error_rate', 'messaging.destination', 'link.re_signable', 'message', 'http.is_auth_endpoint', 'flipper.space_analytics', 'http.url', 'flipper.space_analytics_backfill', 'http.space_endpoints_success_rate', 'http.presentation_view_endpoints_successful', 'net.transport', 'messaging.system', 'link.type', 'document_processing_event.delayed_job_queue', 'request.id', 'http.grpc_requests_success_rate', 'http.elaine_web_request', 'http.link_successful_requests_rate', 'http.esign_successful_requests', 'link.watermark_enabled', 'service.component', 'user.login_success', 'document_processing_event.client', 'flipper.space_audit_log', 'status_code', 'document.document_group_id', 'http.link_successful_requests', 'http.target', 'link.recipient_kind', 'document.upload_unknown_error_rate', 'duration_ms', 'service.name', 'http.route', 'db.statement', 'error', 'trace.parent_id', 'http.status_code', 'name', 'user.signup_referer_path', 'exception.message', 'parent_name'] 167 | 168 | QUERY: {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"http.route","op":"=","value":"/api/external_storage/gdrive/notify","join_column":""}],"time_range":7200} 169 | 170 | {"critique": "The query correctly filters for the specific route '/api/external_storage/gdrive/notify'. Although it just shows a heatmap, because NLQ did not ask for anything specific, the heatmap is the best generat calculation to use. It may also be helpful to add a calculation for MAX or P95 to identify the slowest events explicitly. The breakdown by 'name' is generic and might not provide meaningful insights for this specific route unless 'name' is related to some aspect of the functionality at the route, but it is not a bad breakdown. It may also be helpful to filter to 'trace.parent_id' 'does-not-exist' so that it shows only requests to the endpoint, but since NLQ did not ask for requests, this isn't strictly necessary.", "outcome": "good"} 171 | 172 | --- 173 | 174 | NLQ: find the traces with the most spans 175 | 176 | COLUMNS: ['trace.span_id', 'trace.trace_id', 'trace_flags', 'trace_id', 'span_id', 'exception.stacktrace', 'trace.parent_id', 'span.num_links', 'span.num_events', 'span.kind', 'flags', 'nodejs_gc_duration_seconds.max', 'nodejs_gc_duration_seconds.avg', 'nodejs_gc_duration_seconds.p01', 'container.memory.usage.max', 'scrape_samples_post_metric_relabeling', 'container.duration', 'extensions', 'nodejs_gc_duration_seconds.sum', 'container.memory.utilized', 'nodejs_gc_duration_seconds.p001', 'nodejs_gc_duration_seconds.p99', 'scrape_samples_scraped', 'path', 'nodejs_gc_duration_seconds.p05', 'source', 'nodejs_eventloop_lag_mean_seconds', 'nodejs_eventloop_lag_max_seconds', 'nodejs_gc_duration_seconds.count', 'nodejs_gc_duration_seconds.min', 'container.cpu.onlines', 'up', 'partial_last', 'container.cpu.utilized', 'graphql.source', 'ecs.task.memory.usage.max', 'nodejs_gc_duration_seconds.p95', 'ecs.task.cpu.onlines', 'container'] 177 | 178 | QUERY: {"breakdowns":["trace.trace_id"],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}]} 179 | 180 | {"critique": "Bad response. The COUNT operation is incorrectly used because it does not count the number of spans per trace, but the total events within a window. Instead, the calculations should use a COUNT_DISTINCT(trace.span_id). Breaking down by 'trace.trace_id' is good, however, because that lets the user see each trace with a span count. Finally, the query should have an ordering of COUNT_DISTINCT(trace.span_id) 'desc' so that results are ordered by most spans.", "outcome": "bad"} 181 | 182 | --- 183 | 184 | NLQ: api calls that are failing 185 | 186 | COLUMNS: ['cronofy.oauth_client_notification_subscriptions.failed', 'apirequestjournal.failed_journal_reads', 'cronofy.notifier_client.api_call_status', 'auth.oauthinator.go_to.api_call_count', 'api.event.conferencing.error', 'cronofy.notifier_client.api_call_path', 'cronofy.notifier_client.api_call_ms', 'api.event.conferencing.sync.error', 'auth.oauthinator.zoom.api_call_count', 'cronofy.service_account.relinks.failed', 'auth.failure', 'cronofy.api.rate_limits.oauth_client_account_exceeded', 'cronofy.notifier_client.api_call_method', 'auth.oauthinator.ms_teams.api_call_count', 'cronofy.api.rate_limits.oauth_client_exceeded', 'cronofy.embedded_auth_token.errors', 'apirequestjournal.failed_dir_enumerations', 'indeed.put.error', 'apirequestjournal.read_dir.total_calls', 'integrations.indeed.api_response.status', 'api.free_busy.caching.hit', 'shipper.ingress.failed_open', 'error.service_unavailable_error?', 'auth.oauthinator.go_to.refresh_failure_count', 'auth.provider_error_code', 'error', 'auth.graph_access_token_attempt', 'apirequestjournal.results', 'eightbyeight.put.error', 'integrations.slack.api_response.status', 'sli.apirequestjournal.dashboard.ok', 'integrations.hubspot.api_response.status', 'integrations.api.operation', 'enterprise_connect.request_context.failure_key', 'api.params.google_event_ids', 'eightbyeight.persistent.get.error', 'auth.error_code', 'auth.graph_me_attempt', 'auth.op', 'eightbyeight.persistent_cache.get.error', 'cronofy.calendar_api', 'auth.oauthinator.zoom.refresh_failure_count', 'shared.put.error', 'filehousekeeper.delete.failures', 'api.event.conferencing.dial_in', 'api.free_busy.total_pages', 'integrations.zendesk.api_response.status', 'apirequestjournal.index_job.total_indexing_errors', 'ms_graph.failed_delta_link_fallback', 'api.free_busy.page_events', 'name', 'shipper.ingress.full_path', 'cronofy.account_profile', 'trace.parent_id', 'http.status', 'scheduler.build_availability_query.type', 'active_job.class', 'enterprise_connect.request_context.error', 'cronofy.oauth_client', 'duration_ms', 'sync.state', 'cronofy.environment.name', 'sync.sync_type', 'http.path', 'service_name'] 187 | 188 | QUERY: {"breakdowns":["http.path],"calculations":[{"op":"COUNT"}],"filters":[{"column":"http.status","op":"exists","join_column":""},{"column":"http.status","op":"!=","value":"200","join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200} 189 | 190 | {"critique": "This query is too specific. The columns 'http.path' and 'http.status' are more generic columns, meaning that they capture information about more endpoints than the columns selected ('cronofy.notifier_client.api_call_status' and 'cronofy.notifier_client.api_call_path'). By using those more specific columns, the query makes an assumption that NLQ is referring to these specific columns, when there is no evidence that NLQ is doing that. Although the structure of this response is good, it should use the more generic columns 'http.status' and 'http.path' instead.", "outcome": "bad"} 191 | 192 | --- 193 | 194 | NLQ: ListWalletMultichainTokenHoldings 195 | 196 | COLUMNS: ['service_dc_2', 'service_dc_1', 'service_dc_3', 'service_dc_5', 'service_dc_4', 'success_rate_v3', 'error', 'success_rate_demo', 'duration_ms', 'name', 'dc_simple_refinery_reason', 'client-uuid', 'meta.stressed', 'go.tb', 'ip', 'span.kind', 'db.statement.fingerprint', 'db.statement', 'span.num_links', 'parent_name', 'dc_log10_duration', 'http.url', 'db.rowcount', 'meta.refinery.reason', 'span.num_events', 'host.name', 'dc_db_system_or_type', 'log10_duration_ms', 'service.name', 'type', 'meta.signal_type', 'error.object', 'trace.span_id', 'status_code', 'meta.annotation_type', 'trace.trace_id', 'meta.refinery.sample_key', 'http.method', 'db.canonical_name', 'dc_is_root_span', 'grpc.source', 'trace.parent_id', 'duration_per_rowcount', 'meta.refinery.original_sample_rate', 'go.tb.exists', 'opencensus.exporterversion', 'dc_is_success', 'meta.refinery.send_reason', 'http.status_code', 'dc_ensure_nonroot_server_span', 'http.route'] 197 | 198 | QUERY: {"breakdowns":["name"],"filters":[{"column":"name","op":"=","value":"ListWalletMultichainTokenHoldings"}],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200} 199 | 200 | {"critique": "The query correctly filters on the 'name' column to isolate the 'ListWalletMultichainTokenHoldings' service. Using COUNT as a calculation is appropriate to get the number of occurrences, and ordering by count descending will properly rank the results. It could be improved by adding a calculation that includes the 'duration_ms' column such as AVG or MAX to assess the latency, as the NLQ suggests a potential interest in performance metrics for this particular service.", "outcome": "good"} 201 | 202 | --- 203 | 204 | NLQ: overall distribution of latencies 205 | 206 | COLUMNS: ['duration_ms', 'container.uptime', 'container.cpu.time', 'event.domain', 'container.cpu.utilization', 'ip', 'regarding', 'container.memory.page_faults', 'severity', 'metadata', 'interface', 'container.memory.major_page_faults', 'span.kind', 'namespace', 'k8s.node.cpu.time', 'error', 'container.filesystem.usage', 'uid', 'container.memory.working_set', 'span.num_events', 'container.memory.usage', 'flags', 'k8s.node.uptime', 'event.name', 'k8s.pod.cpu.time', 'k8s.pod.uptime', 'container.filesystem.capacity', 'name', 'container.memory.available', 'k8s.node.network.io.transmit', 'severity_text', 'k8s.node.cpu.utilization', 'body', 'reason', 'note', 'kind', 'net.peer.port', 'k8s.pod.cpu.utilization', 'type', 'Kubernetes'] 207 | 208 | QUERY: {"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"breakdowns":[{"namespace"},"orders":[{"op":"P95","column:"duration_ms","order":"descending"}]] 209 | 210 | {"critique": "This is a bad query. While it includes relevant calculations, there are far too many, which is an overwhelming response. Only a HEATMAP and a common aggregation like P95 should be used. Additionally, an ideal query would break down by a column like 'namespace' even though NLQ did not ask for it, because that can yield more interesting results.", "outcome": "bad"} 211 | 212 | --- 213 | 214 | NLQ: error 215 | 216 | COLUMNS: ['error', 'env', 'exception.message', 'type', 'name', 'http.error_message', 'exception.type', 'http.error_name', 'exception.stacktrace', 'status_code', 'status_message', 'span.kind', 'net.transport', 'library.name', 'tls.authorized', 'http.target', 'service.name', 'http.url', 'tls.protocol', 'duration_ms', 'process.owner', 'library.version', 'process.command', 'parent_name', 'process.pid', 'http.flavor', 'deployment.environment', 'http.method', 'next.bubble', 'http.host', 'next.route', 'net.host.ip', 'http.scheme', 'telemetry.sdk.language', 'deployment.name', 'trace.parent_id', 'tls.alpnProtocol', 'trace.span_id', 'http.response_content_length_uncompressed', 'net.host.name', 'trace.trace_id', 'net.host.port', 'http.status_code', 'http.response_content_length', 'process.command_args', 'net.peer.name', 'http.user_agent', 'telemetry.instrumentation_library', 'net.peer.port', 'process.runtime.description', 'http.route'] 217 | 218 | QUERY: {"breakdowns":["name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"error","op":"=","value":true,"join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200} 219 | 220 | {"critique": "Bad response. The query correctly filters by errors and uses COUNT, but it breaks down by 'error' instead of a more useful column like 'name', which is the name of an operation. Although NLQ is vague, it's best to be generally helpful and show more information than less.", "outcome": "bad"} 221 | 222 | --- 223 | 224 | NLQ: I would like to pull a number from a string 225 | 226 | COLUMNS: ['github.pull_requests.0.number', 'github.run_number', 'github.job.step.number', 'github.pull_requests.0.id', 'library.name', 'github.pull_requests.0.head.sha', 'github.pull_requests.0.url', 'name', 'github.pull_requests.0.base.sha', 'github.pull_requests.0.head.repo.id', 'github.pull_requests.0.head.repo.name', 'github.pull_requests.0.base.ref', 'github.job.name', 'service.name', 'github.pull_requests.0.head.ref', 'github.pull_requests.0.head.repo.url', 'github.job.id', 'github.job.step.name', 'telemetry.sdk.language', 'github.pull_requests.0.base.repo.name', 'github.job.conclusion', 'github.conclusion', 'span.kind', 'service.version', 'github.pull_requests.0.base.repo.url', 'error', 'github.pull_requests.0.base.repo.id', 'github.job.step.conclusion', 'status_code', 'span.num_events', 'service.instance.id', 'github.job.runner_name', 'github.job.run_id', 'github.workflow', 'github.html_url', 'github.run_attempt', 'github.event', 'github.job.runner_group_id', 'github.job.run_attempt', 'telemetry.sdk.name', 'github.workflow_id', 'github.head_commit.id', 'service.namespace', 'telemetry.sdk.version', 'github.author_name', 'github.head_commit.message', 'github.head_sha', 'trace.span_id', 'github.base_sha', 'github.workflow_url', 'trace.parent_id', 'duration_ms', 'http.route'] 227 | 228 | QUERY: 229 | 230 | {"critique": "This query is good because it does a best effort at aligning with a vague NLQ. By checking if 'name' contains 'string', the query is about as close to an interpretation of NLQ as-is possible without further clarification. A COUNT of events broken down by 'name' where 'name' contains 'string' is the best way to interpret a vague NLQ like this.", "outcome": "good"} 231 | 232 | For the below NLQ, QUERY and COLUMNS provide a critique as JSON in the format {{"critique": "...", "outcome": "good"|"bad"}} as shown above. Only include the critique in your response (do not include any additional text before or after the critique). 233 | 234 | 235 | NLQ: {{prompt}} 236 | 237 | COLUMNS: {{columns}} 238 | 239 | QUERY: {{query}} 240 | --- -------------------------------------------------------------------------------- /honeycomb/prompt.txt: -------------------------------------------------------------------------------- 1 | COLUMNS:{{columns}} 2 | 3 | QUERY SPEC: 4 | All top-level keys are optional. 5 | 6 | ```json 7 | "calculations":[ 8 | // ops: COUNT, CONCURRENCY, COUNT_DISTINCT, HEATMAP, SUM, AVG, MAX, MIN, P001, P01, P05, P10, P25, P50, P75, P90, P95, P99, P999, RATE_AVG, RATE_SUM, RATE_MAX 9 | {"op": "COUNT"},// COUNT and CONCURRENCY are just op 10 | {"op": "HEATMAP", "column": "name"} 11 | ], 12 | "filters":[ 13 | // ops: =, !=, >, >=, <, <=, starts-with, does-not-start-with, exists, does-not-exist, contains, does-not-contain, in, not-in 14 | {"column": "name", "op": "exists"}, // exists and does-not-exist ops only have column 15 | {"column": "name", "op": "=", "value": "something"} 16 | ], 17 | "filter_combination": "AND", // AND or OR 18 | "breakdowns":[ 19 | // columns in COLUMNS 20 | "column1","column2" 21 | ], 22 | "orders":[ 23 | // HEATMAP not allowed 24 | // Must come from breakdowns or calculations 25 | {"op": "op_in_calculation", "column": "column_in_calculation", "order": "ascending"}, 26 | {"op": "COUNT", "order": "descending"}, // COUNT and CONCURRENCY have no column 27 | {"column": "column1", "order": "descending"}, 28 | ], 29 | "havings":[ 30 | // HEATMAP not allowed 31 | {"calculate_op": "op_in_calculation", "column": "name", "op": "OPNAME", "value": 100}, 32 | {"calculate_op": "COUNT", "op": ">", "value": 10}, // COUNT and CONCURRENCY have no column 33 | ], 34 | "time_range": 7200, // Relative time range in seconds. 35 | "start_time": 1234567890, // UNIX timestamp 36 | "end_time": 1234567890, // UNIX timestamp 37 | ``` 38 | 39 | TIPS: 40 | COUNT counts events/spans. COUNT_DISTINCT counts unique values of columns. 41 | HEATMAP shows value distributions. 42 | trace.parent_id does-not-exist shows root span of a trace. Use that to count requests and traces. 43 | name is a span or span event. 44 | parent_name is the name of a span that created a span event. 45 | 46 | When the user input is asking about a time range (such as "yesterday" or "since last week"), 47 | always use the time_range, start_time, and end_time fields. time_range 48 | is relative and can be combined with either start_time or end_time but not both. 49 | 50 | Modify existing query or create a new query based on NLQ. Only include the query in your response (do not include any additional text before or after the query). 51 | 52 | NLQ: Exception count by exception and caller 53 | EXISTING QUERY: 54 | {"breakdowns":["exception.message","parent_name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"exception.message","op":"exists","join_column":""},{"column":"parent_name","op":"exists","join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200} 55 | 56 | NLQ: Error count 57 | EXISTING QUERY: 58 | {"breakdowns":["error"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"error","op":"=","value":true,"join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200} 59 | 60 | NLQ: Error rate 61 | EXISTING QUERY: 62 | {"breakdowns":["error"],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200} 63 | 64 | NLQ: Slow requests 65 | EXISTING QUERY: 66 | {"breakdowns":["http.route"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":7200} 67 | 68 | NLQ: Latency distribution by status code 69 | EXISTING QUERY: 70 | {"breakdowns":["http.status_code"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"filters":[{"column":"http.status_code","op":"exists","join_column":""}],"time_range":7200} 71 | 72 | NLQ: Most time consuming db queries 73 | EXISTING QUERY: 74 | {"breakdowns":["name","db.statement"],"calculations":[{"column":"duration_ms","op":"SUM"}],"filters":[{"column":"db.statement","op":"exists","join_column":""}],"orders":[{"column":"duration_ms","op":"SUM","order":"descending"}],"time_range":7200} 75 | 76 | NLQ: Slow db queries 77 | EXISTING QUERY: 78 | {"breakdowns":["db.statement"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"P95"}],"filters":[{"column":"db.statement","op":"exists","join_column":""}],"orders":[{"column":"duration_ms","op":"P95","order":"descending"}],"time_range":604800} 79 | 80 | NLQ: by endpoint and operation 81 | EXISTING QUERY:{"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}]} 82 | {"breakdowns":["http.route","name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}]} 83 | 84 | NLQ: by service 85 | EXISTING QUERY:{"calculations":[{"column":"duration_ms","op":"HEATMAP"}]} 86 | {"breakdowns":["service.name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}]} 87 | 88 | NLQ: where span is client_request 89 | EXISTING QUERY:{"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"time_range":604800} 90 | {"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""},{"column":"name","op":"=","value":"client_request","join_column":""}],"time_range":604800} 91 | 92 | NLQ: add count 93 | EXISTING QUERY:{"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"http.route","op":"exists","join_column":""}]} 94 | {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"op":"COUNT"}],"filters":[{"column":"http.route","op":"exists","join_column":""}]} 95 | 96 | NLQ: remove count 97 | EXISTING QUERY:{"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"op":"COUNT"}],"filters":[{"column":"service.name","op":"exists","join_column":""}]} 98 | {"breakdowns":["name"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}],"filters":[{"column":"service.name","op":"exists","join_column":""}]} 99 | 100 | NLQ: add count 101 | EXISTING QUERY: 102 | {"calculations":[{"op":"COUNT"}]} 103 | 104 | NLQ: slow requests 105 | EXISTING QUERY:{"breakdowns":["name"],"filters":[{"column":"error","op":"exists","join_column":""}]} 106 | {"breakdowns":["http.route"],"calculations":[{"column":"duration_ms","op":"HEATMAP"}]} 107 | 108 | 109 | NLQ:all pods with name cron-no-medication 110 | EXISTING QUERY:{"calculations":[{"op":"COUNT"}],"filters":[{"column":"k8s.pod.name","op":"starts-with","value":"cron-no-medication-observation-received-cron","join_column":""}],"filter_combination":"AND","limit":100,"time_range":604800,"granularity":0} 111 | 112 | NLQ:{{prompt}} 113 | EXISTING QUERY: -------------------------------------------------------------------------------- /honeycomb/queries.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## Honeycomb Query Evals\n", 8 | "\n", 9 | "Evals for for Honeycomb Natural Langaguge Query generator from the [Fine Tuning LLMs](https://maven.com/parlance-labs/fine-tuning) course. Related notebooks from the course can be found at .\n", 10 | "\n", 11 | "The [queries.csv](queries.csv) dataset contains \\~ 2,300 example queries (along with per-query column schemas generated offline via RAG). There are two scoring methods supported\n", 12 | "(corresponding to the two @task definitions below):\n", 13 | "\n", 14 | "1. validate - score using the validity checker from the course (utils.py)\n", 15 | "2. critique - score using the critique prompt from the course (critique.txt)" 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "### Dataset\n", 23 | "\n", 24 | "Inspect uses a standard schema for [Datasets](https://ukgovernmentbeis.github.io/inspect_ai/datasets.html), so we'll map the raw data into that schema when reading it (note that \"columns\" are saved as metadata so we can use them for prompt engineering)" 25 | ] 26 | }, 27 | { 28 | "cell_type": "code", 29 | "execution_count": null, 30 | "metadata": {}, 31 | "outputs": [], 32 | "source": [ 33 | "from inspect_ai.dataset import csv_dataset, FieldSpec\n", 34 | "\n", 35 | "dataset = csv_dataset(\n", 36 | " csv_file=\"queries.csv\",\n", 37 | " sample_fields=FieldSpec(input=\"user_input\", metadata=[\"columns\"]),\n", 38 | " shuffle=True\n", 39 | ")" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "### Solver\n", 47 | "\n", 48 | "To build the prompt, we'll cfreate a custom [Solver](https://ukgovernmentbeis.github.io/inspect_ai/solvers.html) that merges the user query/prompt and the RAG retreived column list into our prompt template:" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": null, 54 | "metadata": {}, 55 | "outputs": [], 56 | "source": [ 57 | "from inspect_ai.solver import solver\n", 58 | "from inspect_ai.util import resource\n", 59 | "\n", 60 | "@solver\n", 61 | "def prompt_with_schema():\n", 62 | "\n", 63 | " prompt_template = resource(\"prompt.txt\")\n", 64 | "\n", 65 | " async def solve(state, generate):\n", 66 | " # build the prompt\n", 67 | " state.user_prompt.text = prompt_template.replace(\n", 68 | " \"{{prompt}}\", state.user_prompt.text\n", 69 | " ).replace(\n", 70 | " \"{{columns}}\", state.metadata[\"columns\"]\n", 71 | " )\n", 72 | " return state\n", 73 | "\n", 74 | " return solve\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### Scorer\n", 82 | "\n", 83 | "To score the model's respones to the prompt, we'll create a custom [Scorer](https://ukgovernmentbeis.github.io/inspect_ai/scorers.html) that calls the `is_valid()` function to determine whether a valid query has been constructed:" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT\n", 93 | "from utils import is_valid, json_completion\n", 94 | "\n", 95 | "@scorer(metrics=[accuracy()])\n", 96 | "def validate_scorer():\n", 97 | "\n", 98 | " async def score(state, target):\n", 99 | " \n", 100 | " # check for valid query\n", 101 | " query = json_completion(state.output.completion)\n", 102 | " if is_valid(query, state.metadata[\"columns\"]):\n", 103 | " value=CORRECT\n", 104 | " else: \n", 105 | " value=INCORRECT\n", 106 | " \n", 107 | " # return score w/ query that was extracted\n", 108 | " return Score(value=value, answer=query)\n", 109 | "\n", 110 | " return score\n" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "Note that the `json_completion()` function takes care of some details around extracting JSON from a model completion (e.g. removing sorrounding backtick code block emitted by some models)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "markdown", 122 | "metadata": {}, 123 | "source": [ 124 | "### Validate Task\n", 125 | "\n", 126 | "Now we'll put all of this together to create an evaluation task:" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": null, 132 | "metadata": {}, 133 | "outputs": [], 134 | "source": [ 135 | "from inspect_ai import eval, task, Task\n", 136 | "from inspect_ai.solver import system_message, generate\n", 137 | "\n", 138 | "@task\n", 139 | "def validate():\n", 140 | " return Task(\n", 141 | " dataset=dataset,\n", 142 | " plan=[\n", 143 | " system_message(\"Honeycomb AI suggests queries based on user input.\"),\n", 144 | " prompt_with_schema(),\n", 145 | " generate()\n", 146 | " ],\n", 147 | " scorer=validate_scorer()\n", 148 | " )" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "We can run the task using Inspect's `eval()` function (limiting to 100 samples):" 156 | ] 157 | }, 158 | { 159 | "cell_type": "code", 160 | "execution_count": null, 161 | "metadata": {}, 162 | "outputs": [], 163 | "source": [ 164 | "if __name__ == '__main__':\n", 165 | " eval(validate, model=\"openai/gpt-4-turbo\", limit=100)" 166 | ] 167 | }, 168 | { 169 | "cell_type": "markdown", 170 | "metadata": {}, 171 | "source": [ 172 | "We use the `__name__ == '__main__'` conditional to indicate that we only want to run this cell in interactive contexts. As a result, we can also use the notebook as a module callable from `inspect eval`. For example:\n", 173 | "\n", 174 | "```bash\n", 175 | "$ inspect eval queries.ipynb@validate \n", 176 | "```" 177 | ] 178 | }, 179 | { 180 | "cell_type": "markdown", 181 | "metadata": {}, 182 | "source": [ 183 | "### Critique Task\n", 184 | "\n", 185 | "Now we'll create a critique task. For this we'll need a LLM-based scorer that uses a critique template to prompt for whether the generated query is \"good\" or \"bad\":" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": null, 191 | "metadata": {}, 192 | "outputs": [], 193 | "source": [ 194 | "import json\n", 195 | "from inspect_ai.model import get_model\n", 196 | "\n", 197 | "@scorer(metrics=[accuracy()])\n", 198 | "def critique_scorer(model = \"anthropic/claude-3-opus-20240229\"):\n", 199 | "\n", 200 | " async def score(state, target):\n", 201 | " \n", 202 | " # build the critic prompt\n", 203 | " query = state.output.completion.strip()\n", 204 | " critic_prompt = resource(\"critique.txt\").replace(\n", 205 | " \"{{prompt}}\", state.user_prompt.text\n", 206 | " ).replace(\n", 207 | " \"{{columns}}\", state.metadata[\"columns\"]\n", 208 | " ).replace(\n", 209 | " \"{{query}}\", query\n", 210 | " )\n", 211 | " \n", 212 | " # run the critique\n", 213 | " result = await get_model(model).generate(critic_prompt)\n", 214 | " try:\n", 215 | " parsed = json.loads(json_completion(result.completion))\n", 216 | " value = CORRECT if parsed[\"outcome\"] == \"good\" else INCORRECT\n", 217 | " explanation = parsed[\"critique\"]\n", 218 | " except (json.JSONDecodeError, KeyError):\n", 219 | " value = INCORRECT\n", 220 | " explanation = f\"JSON parsing error:\\n{result.completion}\"\n", 221 | " \n", 222 | " # return value and explanation (critique text)\n", 223 | " return Score(value=value, explanation=explanation)\n", 224 | "\n", 225 | " return score" 226 | ] 227 | }, 228 | { 229 | "cell_type": "markdown", 230 | "metadata": {}, 231 | "source": [ 232 | "Now we use this scorer in a critique task definition:" 233 | ] 234 | }, 235 | { 236 | "cell_type": "code", 237 | "execution_count": null, 238 | "metadata": {}, 239 | "outputs": [], 240 | "source": [ 241 | "@task\n", 242 | "def critique():\n", 243 | " return Task(\n", 244 | " dataset=dataset,\n", 245 | " plan=[\n", 246 | " system_message(\"Honeycomb AI suggests queries based on user input.\"),\n", 247 | " prompt_with_schema(),\n", 248 | " generate()\n", 249 | " ],\n", 250 | " scorer=critique_scorer()\n", 251 | " )" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "And then run the task using `eval()` (limiting to 25 samples):" 259 | ] 260 | }, 261 | { 262 | "cell_type": "code", 263 | "execution_count": null, 264 | "metadata": {}, 265 | "outputs": [], 266 | "source": [ 267 | "if __name__ == '__main__':\n", 268 | " eval(critique, model=\"openai/gpt-4-turbo\", limit=25)" 269 | ] 270 | } 271 | ], 272 | "metadata": { 273 | "kernelspec": { 274 | "display_name": ".venv", 275 | "language": "python", 276 | "name": "python3" 277 | }, 278 | "language_info": { 279 | "codemirror_mode": { 280 | "name": "ipython", 281 | "version": 3 282 | }, 283 | "file_extension": ".py", 284 | "mimetype": "text/x-python", 285 | "name": "python", 286 | "nbconvert_exporter": "python", 287 | "pygments_lexer": "ipython3", 288 | "version": "3.11.6" 289 | } 290 | }, 291 | "nbformat": 4, 292 | "nbformat_minor": 2 293 | } 294 | -------------------------------------------------------------------------------- /honeycomb/queries.py: -------------------------------------------------------------------------------- 1 | """ 2 | Evals for for Honeycomb Natural Langaguge Query generator from the Fine Tuning 3 | LLMs course (https://maven.com/parlance-labs/fine-tuning) 4 | 5 | Notebooks from the course: https://github.com/parlance-labs/ftcourse 6 | 7 | The queries.csv dataset contains ~ 2,300 example queries (along with column 8 | schemas generated offline via RAG). There are two scoring methods supported 9 | (corresponding to the two @task definitions below): 10 | 11 | 1. validate - score using the validity checker from the course (utils.py) 12 | 2. critique - score using the critique prompt from the course (critique.txt) 13 | """ 14 | 15 | import json 16 | 17 | from inspect_ai import task, Task 18 | from inspect_ai.dataset import csv_dataset, FieldSpec 19 | from inspect_ai.model import get_model 20 | from inspect_ai.scorer import accuracy, scorer, Score, CORRECT, INCORRECT 21 | from inspect_ai.solver import system_message, generate, solver 22 | from inspect_ai.util import resource 23 | 24 | from utils import is_valid, json_completion 25 | 26 | 27 | @task 28 | def validate(): 29 | return eval_task(scorer=validate_scorer()) 30 | 31 | 32 | @task 33 | def critique(): 34 | return eval_task(scorer=critique_scorer()) 35 | 36 | 37 | # shared task implementation parmaeterized by scorer 38 | def eval_task(scorer): 39 | 40 | # read dataset 41 | dataset = csv_dataset( 42 | csv_file="queries.csv", 43 | sample_fields=FieldSpec( 44 | input="user_input", 45 | metadata=["columns"] 46 | ), 47 | shuffle=True 48 | ) 49 | 50 | # create eval task 51 | return Task( 52 | dataset=dataset, 53 | plan=[ 54 | system_message("Honeycomb AI suggests queries based on user input."), 55 | prompt_with_schema(), 56 | generate() 57 | ], 58 | scorer=scorer 59 | ) 60 | 61 | 62 | @solver 63 | def prompt_with_schema(): 64 | 65 | prompt_template = resource("prompt.txt") 66 | 67 | async def solve(state, generate): 68 | # build the prompt 69 | state.user_prompt.text = prompt_template.replace( 70 | "{{prompt}}", state.user_prompt.text 71 | ).replace( 72 | "{{columns}}", state.metadata["columns"] 73 | ) 74 | return state 75 | 76 | return solve 77 | 78 | 79 | @scorer(metrics=[accuracy()]) 80 | def validate_scorer(): 81 | 82 | async def score(state, target): 83 | 84 | # check for valid query 85 | query = json_completion(state.output.completion) 86 | if is_valid(query, state.metadata["columns"]): 87 | value=CORRECT 88 | else: 89 | value=INCORRECT 90 | 91 | # return score w/ query that was extracted 92 | return Score(value=value, answer=query) 93 | 94 | return score 95 | 96 | 97 | @scorer(metrics=[accuracy()]) 98 | def critique_scorer(model = "openai/gpt-4-turbo"): 99 | 100 | async def score(state, target): 101 | 102 | # build the critic prompt 103 | query = state.output.completion.strip() 104 | critic_prompt = resource("critique.txt").replace( 105 | "{{prompt}}", state.user_prompt.text 106 | ).replace( 107 | "{{columns}}", state.metadata["columns"] 108 | ).replace( 109 | "{{query}}", query 110 | ) 111 | 112 | # run the critique 113 | result = await get_model(model).generate(critic_prompt) 114 | try: 115 | parsed = json.loads(json_completion(result.completion)) 116 | value = CORRECT if parsed["outcome"] == "good" else INCORRECT 117 | explanation = parsed["critique"] 118 | except (json.JSONDecodeError, KeyError): 119 | value = INCORRECT 120 | explanation = f"JSON parsing error:\n{result.completion}" 121 | 122 | # return value and explanation (critique text) 123 | return Score(value=value, explanation=explanation) 124 | 125 | return score 126 | 127 | -------------------------------------------------------------------------------- /honeycomb/utils.py: -------------------------------------------------------------------------------- 1 | import re 2 | import json 3 | 4 | 5 | # sometimes models will enclose the JSON in markdown! (e.g. ```json) 6 | # this function removes those delimiters should they be there 7 | def json_completion(completion): 8 | completion = re.sub(r'^```json\n', '', completion.strip()) 9 | completion = re.sub(r'\n```$', '', completion) 10 | return completion 11 | 12 | 13 | 14 | class InvalidQueryException(Exception): 15 | def __init__(self, message, query=None): 16 | self.message = message 17 | self.query = query 18 | if query: 19 | self.message += f"\nQuery: {self.query}" 20 | super().__init__(self.message) 21 | 22 | 23 | def is_valid(query_spec:str, columns:str, check_runnable=True): 24 | "Test if a query is valid" 25 | try: 26 | check_query(query_spec, columns, check_runnable) 27 | return True 28 | except (KeyError, InvalidQueryException): 29 | return False 30 | 31 | def check_query(query_spec:str, columns:str, check_runnable=True): 32 | "Raise an exception if a query is invalid." 33 | query_spec = query_spec.replace("'", '"') 34 | try: 35 | spec = json.loads(query_spec) 36 | except json.decoder.JSONDecodeError: 37 | raise InvalidQueryException(f"JSON parsing error:\n{query_spec}", query_spec) 38 | 39 | valid_calculate_ops = [ 40 | "COUNT", 41 | "COUNT_DISTINCT", 42 | "HEATMAP", 43 | "CONCURRENCY", 44 | "SUM", 45 | "AVG", 46 | "MAX", 47 | "MIN", 48 | "P001", 49 | "P01", 50 | "P05", 51 | "P10", 52 | "P25", 53 | "P50", 54 | "P75", 55 | "P90", 56 | "P95", 57 | "P99", 58 | "P999", 59 | "RATE_AVG", 60 | "RATE_SUM", 61 | "RATE_MAX", 62 | ] 63 | 64 | valid_filter_ops = [ 65 | "=", 66 | "!=", 67 | ">", 68 | ">=", 69 | "<", 70 | "<=", 71 | "starts-with", 72 | "does-not-start-with", 73 | "exists", 74 | "does-not-exist", 75 | "contains", 76 | "does-not-contain", 77 | "in", 78 | "not-in", 79 | ] 80 | 81 | if spec == {} or isinstance(spec, float): 82 | raise InvalidQueryException("Query spec cannot be empty.", query_spec) 83 | 84 | if isinstance(spec, str): 85 | raise InvalidQueryException("Query spec was not parsed to json.", query_spec) 86 | 87 | if "calculations" in spec: 88 | for calc in spec["calculations"]: 89 | if "op" not in calc: 90 | raise InvalidQueryException(f"{calc}: Calculation must have an op.", query_spec) 91 | 92 | if calc["op"] not in valid_calculate_ops: 93 | raise InvalidQueryException(f"Invalid calculation: {calc['op']}", query_spec) 94 | 95 | if calc["op"] == "COUNT" or calc["op"] == "CONCURRENCY": 96 | if "column" in calc: 97 | raise InvalidQueryException(f"{calc}: {calc['op']} cannot take a column as input.", query_spec) 98 | else: 99 | if "column" not in calc: 100 | raise InvalidQueryException(f"{calc}: {calc['op']} must take a column as input.", query_spec) 101 | 102 | if check_runnable and calc["column"] not in columns: 103 | raise InvalidQueryException(f"Invalid column: {calc['column']}", query_spec) 104 | 105 | 106 | if "filters" in spec: 107 | for filter in spec["filters"]: 108 | if not isinstance(filter, dict): 109 | raise InvalidQueryException("filter of type other than dict found in query.", query_spec) 110 | if "op" not in filter: 111 | raise InvalidQueryException("No op found in filter.", query_spec) 112 | if filter["op"] not in valid_filter_ops: 113 | raise InvalidQueryException(f"Invalid filter: {filter['op']}", query_spec) 114 | 115 | 116 | if check_runnable and filter["column"] not in columns: 117 | raise InvalidQueryException(f"Invalid column: {filter['column']}", query_spec) 118 | 119 | 120 | if filter["op"] == "exists" or filter["op"] == "does-not-exist": 121 | if "value" in filter: 122 | raise InvalidQueryException(f"{filter}: {filter['op']} cannot take a value as input.", query_spec) 123 | 124 | else: 125 | if filter["op"] == "in" or filter["op"] == "not-in": 126 | if not isinstance(filter["value"], list): 127 | raise InvalidQueryException(f"{filter}: {filter['op']} must take a list as input.", query_spec) 128 | 129 | else: 130 | if "value" not in filter: 131 | raise InvalidQueryException(f"{filter}: {filter['op']} must take a value as input.", query_spec) 132 | 133 | if "filter_combination" in spec: 134 | if isinstance(spec["filter_combination"], str) and spec[ 135 | "filter_combination" 136 | ].lower() not in ["and", "or"]: 137 | raise InvalidQueryException(f"Invalid filter combination: {spec['filter_combination']}", query_spec) 138 | 139 | 140 | if "breakdowns" in spec: 141 | for breakdown in spec["breakdowns"]: 142 | if check_runnable and breakdown not in columns: 143 | raise InvalidQueryException(f"Invalid column: {breakdown}", query_spec) 144 | 145 | 146 | if "orders" in spec: 147 | for order in spec["orders"]: 148 | if "order" not in order: 149 | raise InvalidQueryException(f"Invalid order without orders key: {query_spec}") 150 | if order["order"] != "ascending" and order["order"] != "descending": 151 | raise InvalidQueryException(f"Invalid order: {order['order']}", query_spec) 152 | 153 | if "op" in order: 154 | if order["op"] not in valid_calculate_ops: 155 | raise InvalidQueryException(f"Invalid order: {order['op']}", query_spec) 156 | 157 | 158 | if not any(calc["op"] == order["op"] for calc in spec.get("calculations", [])): 159 | raise InvalidQueryException(f"{order}: Order op must be present in calculations: {order['op']}", query_spec) 160 | 161 | if order["op"] == "COUNT" or order["op"] == "CONCURRENCY": 162 | if "column" in order: 163 | raise InvalidQueryException(f"{order}: {order['op']} cannot take a column as input.", query_spec) 164 | 165 | else: 166 | if "column" not in order: 167 | raise InvalidQueryException(f"{order}: {order['op']} must take a column as input.", query_spec) 168 | 169 | if check_runnable and order["column"] not in columns: 170 | raise InvalidQueryException(f"{order}: Invalid column in order: {order['column']}", query_spec) 171 | 172 | else: 173 | if "column" not in order: 174 | raise InvalidQueryException(f"{order}: Order must take a column or op as input.", query_spec) 175 | 176 | if check_runnable and order["column"] not in columns: 177 | raise InvalidQueryException(f"{order}: Invalid column in order: {order['column']}", query_spec) 178 | 179 | 180 | if "havings" in spec: 181 | for having in spec["havings"]: 182 | if "calculate_op" not in having: 183 | raise InvalidQueryException(f"{having}: Having must have a calculate_op.", query_spec) 184 | 185 | if "value" not in having: 186 | raise InvalidQueryException(f"{having}: Having must have a value.", query_spec) 187 | 188 | if "op" not in having: 189 | raise InvalidQueryException(f"{having}: Having must have an op.", query_spec) 190 | 191 | if having["calculate_op"] == "HEATMAP": 192 | raise InvalidQueryException("HEATMAP is not supported in having.", query_spec) 193 | 194 | if ( 195 | having["calculate_op"] == "COUNT" 196 | or having["calculate_op"] == "CONCURRENCY" 197 | ): 198 | if "column" in having: 199 | raise InvalidQueryException(f"{having}: {having['calculate_op']} cannot take a column as input.", query_spec) 200 | 201 | else: 202 | if "column" not in having: 203 | raise InvalidQueryException(f"{having}: {having['calculate_op']} must take a column as input.", query_spec) 204 | 205 | if check_runnable and having["column"] not in columns: 206 | raise InvalidQueryException(f"{having}: Invalid column in having: {having['column']}", query_spec) 207 | 208 | 209 | if "time_range" in spec: 210 | if "start_time" in spec and "end_time" in spec: 211 | raise InvalidQueryException("Time range cannot be specified with start_time and end_time.", query_spec) 212 | 213 | if not isinstance(spec["time_range"], int): 214 | raise InvalidQueryException(f"time_range must be an int: {spec['time_range']}", query_spec) 215 | 216 | 217 | if "start_time" in spec: 218 | if not isinstance(spec["start_time"], int): 219 | raise InvalidQueryException(f"start_time must be an int: {spec['start_time']}", query_spec) 220 | 221 | 222 | if "end_time" in spec: 223 | if not isinstance(spec["end_time"], int): 224 | raise InvalidQueryException(f"end_time must be an int: {spec['end_time']}", query_spec) 225 | 226 | 227 | if "granularity" in spec: 228 | if not isinstance(spec["granularity"], int): 229 | raise InvalidQueryException(f"granularity must be an int: {spec['granularity']}", query_spec) 230 | 231 | 232 | time_range = ( 233 | spec["time_range"] 234 | if "time_range" in spec 235 | else spec["end_time"] - spec["start_time"] 236 | if "start_time" in spec and "end_time" in spec 237 | else 7200 238 | ) 239 | if spec["granularity"] > time_range / 10: 240 | raise InvalidQueryException(f"granularity must be <= time_range / 10: {spec['granularity']}", query_spec) 241 | 242 | if spec["granularity"] < time_range / 1000: 243 | raise InvalidQueryException(f"granularity must be >= time_range / 1000: {spec['granularity']}", query_spec) 244 | 245 | if "limit" in spec: 246 | if not isinstance(spec["limit"], int): 247 | raise InvalidQueryException(f"limit must be an int: {spec['limit']}", query_spec) 248 | -------------------------------------------------------------------------------- /langchain/.env.example: -------------------------------------------------------------------------------- 1 | TAVILY_API_KEY=your-tavily-api-key 2 | 3 | -------------------------------------------------------------------------------- /langchain/.gitignore: -------------------------------------------------------------------------------- 1 | .env 2 | .venv/ 3 | -------------------------------------------------------------------------------- /langchain/README.md: -------------------------------------------------------------------------------- 1 | ## LangChain Agent 2 | 3 | This example demonstrates creating a custom solver that utilises a LangChain agent to perform QA using Wikipedia. The example includes the following source files: 4 | 5 | | File | Description | 6 | |------------------------|-------------------------------------------------------------------------------------------------| 7 | | `.gitignore` | Ignore the `.venv` directory and the `.env` file containing environment variables for the eval. | 8 | | `.env.example` | Prototype of `.env` file (copy this to `.env` and provide your `TAVILY_API_KEY`). | 9 | | `inspect_langchain.py` | Utilities for creating inspect solvers that use LangChain agents. | 10 | | `wikipedia.py` | Evaluation task and custom solver that uses the search agent. | 11 | | `wikipedia.jsonl` | Dataset with questions and ideal answers. | 12 | 13 | To run this example, first, be sure you provide a `.env` file that defines a `TAVILY_API_KEY` ([Tavily](https://tavily.com/) is a search API for LLM agents). Note that `.env` files should always be included in `.gitignore` as they often contain secrets! 14 | 15 | Next, be sure that you have installed the dependencies in the `requirements.txt` file in the root of this repository. To do this in a virtual environment: 16 | 17 | ``` bash 18 | $ python3 -m venv .venv 19 | $ source .venv/bin/activate 20 | $ pip install -r requirements.txt 21 | ``` 22 | 23 | Then, switch to the `langchain` directory and run the example as follows: 24 | 25 | ``` python 26 | $ cd langchain 27 | $ inspect eval --model openai/gpt-4 28 | ``` 29 | 30 | This example will run with any model provider that supports tool use (so Anthropic, Google Gemini, and Mistral will all work as well). 31 | 32 | If you want to run in verbose mode (to see the agent's queries printed out), pass the `verbose` task parameter: 33 | 34 | ``` bash 35 | $ inspect eval --model openai/gpt-4 -T verbose=true --limit 1 36 | ``` 37 | 38 | Note that we specify `--limit 1` so that the verbose output from multiple samples is not intermixed. -------------------------------------------------------------------------------- /langchain/inspect_langchain.py: -------------------------------------------------------------------------------- 1 | import json 2 | from typing import Any, Dict, Protocol, cast, runtime_checkable 3 | 4 | from langchain_core.callbacks import ( 5 | AsyncCallbackManagerForLLMRun, 6 | CallbackManagerForLLMRun, 7 | ) 8 | from langchain_core.language_models import BaseChatModel 9 | from langchain_core.messages import ( 10 | AIMessage, 11 | BaseMessage, 12 | FunctionMessage, 13 | HumanMessage, 14 | SystemMessage, 15 | ToolMessage, 16 | ) 17 | from langchain_core.messages import ToolCall as LCToolCall 18 | from langchain_core.outputs import ( 19 | ChatGeneration, 20 | ChatResult, 21 | ) 22 | from pydantic.v1 import Field 23 | from typing_extensions import override 24 | 25 | from inspect_ai.model import ( 26 | ChatMessage, 27 | ChatMessageAssistant, 28 | ChatMessageSystem, 29 | ChatMessageTool, 30 | ChatMessageUser, 31 | Content, 32 | ContentImage, 33 | ContentText, 34 | GenerateConfig, 35 | ModelName, 36 | ModelOutput, 37 | ToolCall, 38 | ToolChoice, 39 | ToolInfo, 40 | ToolParam, 41 | get_model, 42 | ) 43 | from inspect_ai.solver import Generate, Solver, TaskState 44 | 45 | 46 | @runtime_checkable 47 | class LangChainAgent(Protocol): 48 | async def __call__( 49 | self, llm: BaseChatModel, input: dict[str, Any] 50 | ) -> str | list[str | dict[str, Any]]: 51 | ... 52 | 53 | 54 | def langchain_solver(agent: LangChainAgent) -> Solver: 55 | async def solve(state: TaskState, generate: Generate) -> TaskState: 56 | # create the inspect model api bridge 57 | llm = InspectChatModel() 58 | 59 | # call the agent 60 | await agent( 61 | llm=llm, 62 | input=dict( 63 | input=state.user_prompt.text, 64 | chat_history=as_langchain_chat_history(state.messages[1:]), 65 | ), 66 | ) 67 | 68 | # collect output from llm interface 69 | state.messages = llm.messages 70 | state.output = llm.output 71 | 72 | # return state 73 | return state 74 | 75 | return solve 76 | 77 | 78 | class InspectChatModel(BaseChatModel): 79 | # track messages and model output so we can update 80 | # the inspect task state when we are complete 81 | messages: list[ChatMessage] = Field(default=[], exclude=True) 82 | output: ModelOutput = Field(default=ModelOutput(), exclude=True) 83 | 84 | @property 85 | def _llm_type(self) -> str: 86 | return f"Inspect ({ModelName(get_model()).api})" 87 | 88 | @property 89 | def _identifying_params(self) -> Dict[str, Any]: 90 | return { 91 | "model_name": str(ModelName(get_model()).name), 92 | } 93 | 94 | @override 95 | def _generate( 96 | self, 97 | messages: list[BaseMessage], 98 | stop: list[str] | None = None, 99 | run_manager: CallbackManagerForLLMRun | None = None, 100 | **kwargs: Any, 101 | ) -> ChatResult: 102 | # inspect uses async exclusively 103 | raise NotImplementedError 104 | 105 | @override 106 | async def _agenerate( 107 | self, 108 | messages: list[BaseMessage], 109 | stop: list[str] | None = None, 110 | run_manager: AsyncCallbackManagerForLLMRun | None = None, 111 | **kwargs: dict[str, Any], 112 | ) -> ChatResult: 113 | # extract tools from kwargs 114 | tools: list[ToolInfo] = [] 115 | tool_choice: ToolChoice | None = None 116 | lc_tools = cast(list[dict[str, Any]] | None, kwargs.get("tools", None)) 117 | if lc_tools: 118 | tools = [ 119 | ToolInfo( 120 | name=tool["function"]["name"], 121 | description=tool["function"]["description"], 122 | params=as_inspect_tool_params(tool["function"]["parameters"]), 123 | ) 124 | for tool in lc_tools 125 | ] 126 | tool_choice = "auto" 127 | 128 | # generate 129 | input = [as_inspect_message(message) for message in messages] 130 | result = await get_model().generate( 131 | input=input, 132 | tools=tools, 133 | tool_choice=tool_choice, 134 | config=GenerateConfig(stop_seqs=stop), 135 | ) 136 | 137 | # track last messages / model output 138 | self.messages = input 139 | self.messages.append(result.choices[0].message) 140 | self.output = result 141 | 142 | # extract choices 143 | generations = [ 144 | ChatGeneration(message=as_langchain_message(choice.message)) 145 | for choice in result.choices 146 | ] 147 | 148 | # return 149 | return ChatResult(generations=generations) 150 | 151 | 152 | def as_inspect_message(message: BaseMessage) -> ChatMessage: 153 | if isinstance(message, SystemMessage): 154 | return ChatMessageSystem(content=as_inspect_content(message.content)) 155 | elif isinstance(message, HumanMessage): 156 | return ChatMessageUser(content=as_inspect_content(message.content)) 157 | elif isinstance(message, AIMessage): 158 | return ChatMessageAssistant( 159 | content=as_inspect_content(message.content), 160 | tool_calls=( 161 | [ 162 | ToolCall( 163 | type="function", 164 | function=call["name"], 165 | id=call["id"] or call["name"], 166 | arguments=call["args"], 167 | ) 168 | for call in message.tool_calls 169 | ] 170 | if message.tool_calls and len(message.tool_calls) > 0 171 | else None 172 | ), 173 | ) 174 | elif isinstance(message, ToolMessage): 175 | return ChatMessageTool( 176 | content=as_inspect_content(message.content), 177 | tool_call_id=message.tool_call_id, 178 | ) 179 | elif isinstance(message, FunctionMessage): 180 | return ChatMessageTool( 181 | content=as_inspect_content(message.content), tool_call_id=message.name 182 | ) 183 | else: 184 | raise ValueError(f"Unexpected message type: {type(message)}") 185 | 186 | 187 | def as_langchain_message(message: ChatMessage) -> BaseMessage: 188 | if isinstance(message, ChatMessageSystem): 189 | return SystemMessage(content=as_langchain_content(message.content)) 190 | elif isinstance(message, ChatMessageUser): 191 | return HumanMessage(content=as_langchain_content(message.content)) 192 | elif isinstance(message, ChatMessageAssistant): 193 | additional_kwargs: dict[str, Any] = {} 194 | if message.tool_calls and len(message.tool_calls) > 0: 195 | additional_kwargs["tool_calls"] = [ 196 | dict( 197 | id=call.id, name=call.function, arguments=json.dumps(call.arguments) 198 | ) 199 | for call in message.tool_calls 200 | ] 201 | 202 | return AIMessage( 203 | content=as_langchain_content(message.content), 204 | tool_calls=( 205 | [ 206 | LCToolCall(id=call.id, name=call.function, args=call.arguments) 207 | for call in message.tool_calls 208 | ] 209 | if message.tool_calls 210 | else [] 211 | ), 212 | additional_kwargs=additional_kwargs, 213 | ) 214 | elif isinstance(message, ChatMessageTool): 215 | return ToolMessage( 216 | content=as_langchain_content(message.content), 217 | tool_call_id=message.tool_call_id or "", 218 | ) 219 | else: 220 | raise ValueError(f"Unexpected message type: {type(message)}") 221 | 222 | 223 | def as_langchain_chat_history(messages: list[ChatMessage]) -> list[dict[str, Any]]: 224 | return [dict(role=message.role, content=message.text) for message in messages] 225 | 226 | 227 | def as_inspect_content( 228 | content: str | list[str | dict[str, Any]], 229 | ) -> str | list[Content]: 230 | if isinstance(content, str): 231 | return content 232 | else: 233 | return [ 234 | ( 235 | ContentText(text=c) 236 | if isinstance(c, str) 237 | else ( 238 | ContentText(text=c["text"]) 239 | if c["type"] == "text" 240 | else ContentImage(image=c["image"]) 241 | ) 242 | ) 243 | for c in content 244 | ] 245 | 246 | 247 | def as_inspect_tool_params(parameters: dict[str, Any]) -> list[ToolParam]: 248 | params: list[ToolParam] = [] 249 | for key, param in parameters["properties"].items(): 250 | params.append( 251 | ToolParam( 252 | name=key, 253 | type=param["type"], 254 | description=param.get("description", param.get("title")), 255 | optional=key not in parameters["required"], 256 | ) 257 | ) 258 | return params 259 | 260 | 261 | def as_langchain_content( 262 | content: str | list[Content], 263 | ) -> str | list[str | dict[str, Any]]: 264 | if isinstance(content, str): 265 | return content 266 | else: 267 | return [c if isinstance(c, str) else c.model_dump() for c in content] 268 | -------------------------------------------------------------------------------- /langchain/wikipedia.jsonl: -------------------------------------------------------------------------------- 1 | {"input":[{"role":"user","content":"List the ten episode titles from the sixth season of \"Game of Thrones\" in broadcast order."}],"target":"The Red Woman, Home, Oathbreaker, Book of the Stranger, The Door, Blood of My Blood, The Broken Man, No One, Battle of the Bastards, The Winds of Winter"} 2 | {"input":[{"role":"user","content":"What's the difference between tennis and pickleball?"}],"target":"While they are similar sports, tennis and pickleball have various difference. First, the court size for pickleball is about half the size of a tennis court. Second, pickleball is played with a ball that resembles a whiffle ball. Third, pickleball is played with paddles as opposed to rackets. Finally, the scoring system is quite different as you play for points which can only be scored when you or your team are serving."} 3 | {"input":[{"role":"user","content":"Which types of fish contain the lowest levels of mercury?"}],"target":"The following types of fish contain low levels of mercury: salmon, flounder, Atlantic mackerel, anchovies, pollock, catfish, and shellfish (e.g., clams, scallops, mussels)."} 4 | -------------------------------------------------------------------------------- /langchain/wikipedia.py: -------------------------------------------------------------------------------- 1 | from typing import Any, cast 2 | 3 | from inspect_langchain import langchain_solver 4 | from langchain import hub 5 | from langchain.agents import ( 6 | AgentExecutor, 7 | BaseMultiActionAgent, 8 | create_openai_tools_agent, 9 | ) 10 | from langchain_community.agent_toolkits.load_tools import load_tools 11 | from langchain_community.tools.tavily_search.tool import TavilySearchResults 12 | from langchain_community.utilities.tavily_search import TavilySearchAPIWrapper 13 | from langchain_core.language_models import BaseChatModel 14 | 15 | from inspect_ai import Task, task 16 | from inspect_ai.dataset import json_dataset 17 | from inspect_ai.scorer import model_graded_fact 18 | from inspect_ai.solver import Solver, solver 19 | 20 | @task 21 | def wikipedia() -> Task: 22 | return Task( 23 | dataset=json_dataset("wikipedia.jsonl"), 24 | plan=wikipedia_search(), 25 | scorer=model_graded_fact(), 26 | ) 27 | 28 | @solver 29 | def wikipedia_search( 30 | max_iterations: int | None = 15, 31 | max_execution_time: float | None = None 32 | ) -> Solver: 33 | # standard prompt for functions agent 34 | prompt = hub.pull("hwchase17/openai-tools-agent") 35 | 36 | # tavily and wikipedia tools 37 | tavily_api = TavilySearchAPIWrapper() # type: ignore 38 | tools = [TavilySearchResults(api_wrapper=tavily_api)] + load_tools(["wikipedia"]) 39 | 40 | # agent function 41 | async def agent(llm: BaseChatModel, input: dict[str, Any]): 42 | # create agent -- cast needed due to: 43 | # https://github.com/langchain-ai/langchain/issues/13075 44 | tools_agent = create_openai_tools_agent(llm, tools, prompt) 45 | agent_executor = AgentExecutor.from_agent_and_tools( 46 | agent=cast(BaseMultiActionAgent, tools_agent), 47 | tools=tools, 48 | name="wikipedia_search", 49 | max_iterations=max_iterations, 50 | max_execution_time=max_execution_time, 51 | ) 52 | 53 | # execute the agent and return output 54 | result = await agent_executor.ainvoke(input) 55 | return result["output"] 56 | 57 | # return agent function as inspect solver 58 | return langchain_solver(agent) 59 | 60 | 61 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | inspect_ai 2 | openai 3 | anthropic 4 | google-generativeai 5 | datasets 6 | ipywidgets 7 | langchain==0.2.1 8 | langchainhub==0.1.16 9 | langchain-community==0.2.1 10 | wikipedia==1.4.0 11 | 12 | -------------------------------------------------------------------------------- /slides/images/inspect-honeycomb-validate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-honeycomb-validate.png -------------------------------------------------------------------------------- /slides/images/inspect-mathmatics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-mathmatics.png -------------------------------------------------------------------------------- /slides/images/inspect-notebook-eval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-notebook-eval.png -------------------------------------------------------------------------------- /slides/images/inspect-view-answers.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-answers.png -------------------------------------------------------------------------------- /slides/images/inspect-view-honeycomb-critique-score.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-critique-score.png -------------------------------------------------------------------------------- /slides/images/inspect-view-honeycomb-critique.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-critique.png -------------------------------------------------------------------------------- /slides/images/inspect-view-honeycomb-validate-prompt-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-validate-prompt-2.png -------------------------------------------------------------------------------- /slides/images/inspect-view-honeycomb-validate-prompt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-validate-prompt.png -------------------------------------------------------------------------------- /slides/images/inspect-view-honeycomb-validate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-honeycomb-validate.png -------------------------------------------------------------------------------- /slides/images/inspect-view-messages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-messages.png -------------------------------------------------------------------------------- /slides/images/inspect-view-scoring.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-view-scoring.png -------------------------------------------------------------------------------- /slides/images/inspect-wikipedia-eval.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-eval.png -------------------------------------------------------------------------------- /slides/images/inspect-wikipedia-explanation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-explanation.png -------------------------------------------------------------------------------- /slides/images/inspect-wikipedia-messages.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-messages.png -------------------------------------------------------------------------------- /slides/images/inspect-wikipedia-scoring.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/inspect-wikipedia-scoring.png -------------------------------------------------------------------------------- /slides/images/log-incorrect.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/images/log-incorrect.png -------------------------------------------------------------------------------- /slides/inspect.qmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Intro to Inspect" 3 | subtitle: | 4 | Open Source Framework for LLM Evals 5 | author: J.J. Allaire 6 | date: today 7 | format: 8 | revealjs: 9 | width: 1350 10 | height: 759.375 11 | menu: false 12 | slide-number: c/t 13 | pdf-separate-fragments: true 14 | --- 15 | 16 | ## Inspect 17 | 18 | - A Python package (`inspect_ai`) used to create LLM evaluations 19 | 20 | - Developed and maintained by the [UK AI Safety Institute](https://www.aisi.gov.uk/) 21 | 22 | - Similar in function to the eval frameworks embedded in benchmark suites (e.g. Open AI Evals, Eluether LM Eval Harness, etc.) but designed from the ground up for development of more complex evals 23 | 24 | - Focus on bridging research and production: provide a great development experience for researchers that results in evals that can be reproducibly run at scale 25 | 26 | 27 | ## {background-image="images/inspect-honeycomb-validate.png" background-size="contain"} 28 | 29 | ## Core Design 30 | 31 | ::: {style="margin-top: 2rem;"} 32 | | | | 33 | |-----------------|------------------------------------------------------| 34 | | **Dataset** | List of samples with `input` and `target` | 35 | | **Solvers** | Functions that transform dataset inputs, call the model for generation, and act further on model output. Can be composed together as layers, or can be a single layer with higher internal complexity | 36 | | **Scorer** | Evaluates final output of solvers. May use text comparisons, model grading, or other custom schemes | 37 | ::: 38 | 39 | ## Hello, World 40 | 41 | ``` {.python code-line-numbers="|11|12-16|17|20"} 42 | from inspect_ai import Task, eval, task 43 | from inspect_ai.dataset import example_dataset 44 | from inspect_ai.scorer import model_graded_fact 45 | from inspect_ai.solver import ( 46 | chain_of_thought, generate, self_critique 47 | ) 48 | 49 | @task 50 | def theory_of_mind(): 51 | return Task( 52 | dataset=example_dataset("theory_of_mind"), 53 | plan=[ 54 | chain_of_thought(), 55 | generate(), 56 | self_critique() 57 | ], 58 | scorer=model_graded_fact(), 59 | ) 60 | 61 | eval(theory_of_mind, model="openai/gpt-4") 62 | ``` 63 | 64 | 65 | ## Honeycomb Eval: `validate()` 66 | 67 | ```{.python code-line-numbers="|3-10|16-20|21"} 68 | @task 69 | def validate(): 70 | # read dataset 71 | dataset = csv_dataset( 72 | csv_file="queries.csv", 73 | sample_fields=FieldSpec( 74 | input="user_input", 75 | metadata=["columns"] 76 | ), 77 | shuffle=True 78 | ) 79 | 80 | # create eval task 81 | return Task( 82 | dataset=dataset, 83 | plan=[ 84 | system_message("Honeycomb AI suggests queries based on user input."), 85 | prompt_with_schema(), 86 | generate() 87 | ], 88 | scorer=validate_scorer() 89 | ) 90 | ``` 91 | 92 | ## Solver: `prompt_with_schema()` 93 | 94 | Simple prompt template that substitutes the user query and the RAG generated column schema. 95 | 96 | ```{.python code-line-numbers="|7-12"} 97 | @solver 98 | def prompt_with_schema(): 99 | 100 | prompt_template = resource("prompt.txt") 101 | 102 | async def solve(state, generate): 103 | # build the prompt 104 | state.user_prompt.text = prompt_template.replace( 105 | "{{prompt}}", state.user_prompt.text 106 | ).replace( 107 | "{{columns}}", state.metadata["columns"] 108 | ) 109 | return state 110 | 111 | return solve 112 | ``` 113 | 114 | 115 | ## Scorer: `validate_scorer()` 116 | 117 | Call the `is_valid()` function w/ the column schema to determine if a valid query was generated. 118 | 119 | ```{.python code-line-numbers="|6-11"} 120 | @scorer(metrics=[accuracy()]) 121 | def validate_scorer(): 122 | 123 | async def score(state, target): 124 | 125 | # check for valid query 126 | query = json_completion(state.output.completion) 127 | if is_valid(query, state.metadata["columns"]): 128 | value=CORRECT 129 | else: 130 | value=INCORRECT 131 | 132 | # return score w/ query that was extracted 133 | return Score(value=value, answer=query) 134 | 135 | return score 136 | ``` 137 | 138 | ## {background-image="images/inspect-view-honeycomb-validate.png" background-size="contain"} 139 | 140 | ## {background-image="images/inspect-view-honeycomb-validate-prompt.png" background-size="contain"} 141 | 142 | ## {background-image="images/inspect-view-honeycomb-validate-prompt-2.png" background-size="contain"} 143 | 144 | 145 | ## Honeycomb Eval: `critique()` 146 | 147 | ```{.python code-line-numbers="|21"} 148 | @task 149 | def critique(): 150 | # read dataset 151 | dataset = csv_dataset( 152 | csv_file="queries.csv", 153 | sample_fields=FieldSpec( 154 | input="user_input", 155 | metadata=["columns"] 156 | ), 157 | shuffle=True 158 | ) 159 | 160 | # create eval task 161 | return Task( 162 | dataset=dataset, 163 | plan=[ 164 | system_message("Honeycomb AI suggests queries based on user input."), 165 | prompt_with_schema(), 166 | generate() 167 | ], 168 | scorer=critique_scorer() 169 | ) 170 | ``` 171 | 172 | 173 | ## Scorer: `critique_scorer()` 174 | 175 | ```{.python code-line-numbers="|2|5-13|15-19"} 176 | @scorer(metrics=[accuracy()]) 177 | def critique_scorer(model = "openai/gpt-4-turbo"): 178 | 179 | async def score(state, target): 180 | # build the critic prompt 181 | query = state.output.completion.strip() 182 | critic_prompt = resource("critique.txt").replace( 183 | "{{prompt}}", state.user_prompt.text 184 | ).replace( 185 | "{{columns}}", state.metadata["columns"] 186 | ).replace( 187 | "{{query}}", query 188 | ) 189 | 190 | # run the critique 191 | result = await get_model(model).generate(critic_prompt) 192 | parsed = json.loads(json_completion(result.completion)) 193 | value = CORRECT if parsed["outcome"] == "good" else INCORRECT 194 | explanation = parsed["critique"] 195 | 196 | return Score(value=value, explanation=explanation) 197 | 198 | return score 199 | ``` 200 | 201 | ## {background-image="images/inspect-view-honeycomb-critique.png" background-size="contain"} 202 | 203 | ## {background-image="images/inspect-view-honeycomb-critique-score.png" background-size="contain"} 204 | 205 | ## Solvers 206 | 207 | A Solver is a Python function that tasks a `TaskState` and transforms it in some useful fashion 208 | 209 | ::: aside 210 | Solver concept was borrowed from Open AI Evals and made more general in Inspect 211 | ::: 212 | 213 | ::: fragment 214 | **TaskState** (initialised from sample) 215 | 216 | ``` python 217 | class TaskState: 218 | messages: list[ChatMessage] 219 | output: ModelOutput 220 | ... 221 | ``` 222 | ::: 223 | 224 | ::: fragment 225 | **Solver Function** 226 | 227 | ``` python 228 | async def solve(state: TaskState, generate: Generate) -> TaskState: 229 | # do something useful with state (prompt engineering, 230 | # generating model output, critique and regenerate, etc.) 231 | return state 232 | ``` 233 | ::: 234 | 235 | ## Baseline Solvers 236 | 237 | ::: fragment 238 | `prompt_template()` 239 | 240 | ``` python 241 | async def solve(state: TaskState, generate: Generate) -> TaskState: 242 | prompt = state.user_prompt 243 | prompt.text = prompt_template.format(prompt=prompt.text, **params) 244 | return state 245 | ``` 246 | 247 | Modifies the existing prompt by passing it through a template 248 | ::: 249 | 250 | ::: {.fragment style="margin-top: 2rem;"} 251 | `generate()` 252 | 253 | ``` python 254 | async def solve(state: TaskState, generate: Generate) -> TaskState: 255 | return await generate(state) 256 | ``` 257 | 258 | Calls the model, appends the assistant message, and updates the model output 259 | ::: 260 | 261 | ## Solver: `multiple_choice()` 262 | 263 | Prompt with several choices (optionally shuffled) 264 | 265 | ``` python 266 | async def solve(state: TaskState, generate: Generate) -> TaskState: 267 | 268 | # build choices str and key 269 | choices_str, choices_key = make_choices(choices=state.choices) 270 | 271 | # re-write prompt with A,B,C,... choices 272 | state.user_prompt.text = template.format( 273 | question=state.user_prompt.text, 274 | choices=choices_str, 275 | ) 276 | 277 | # generate 278 | state = await generate(state, temperature=0.0, max_tokens=1) 279 | 280 | # map the output back to the right index and return 281 | state.output.completion = choices_key[state.output.completion] 282 | 283 | return state 284 | ``` 285 | 286 | ## Solver: `self_critique()` 287 | 288 | Critique the generated response (possibly with another model), then re-generate in response to the critique. 289 | 290 | ```{.python code-line-numbers="|3-8|10-16|18"} 291 | async def solve(state: TaskState, generate: Generate) -> TaskState: 292 | 293 | critique = await model.generate( 294 | critique_template.format( 295 | question=state.input_text, 296 | completion=state.output.completion, 297 | ) 298 | ) 299 | 300 | state.messages.append(ChatMessageUser( 301 | content=completion_template.format( 302 | question=state.input_text, 303 | completion=state.output.completion, 304 | critique=critique.completion, 305 | ), 306 | )) 307 | 308 | return await generate(state) 309 | ``` 310 | 311 | ## Composition 312 | 313 | Eval development frequently involves creating custom solvers and scorers. If made available in a Python package these can re-used across many evals 314 | 315 | ::: fragment 316 | Some jailbreaking solvers from an internal **sheppard** package: 317 | 318 | | | | 319 | |-----------------------|-------------------------------------| 320 | | `encode()` | Message obfuscation jailbreak | 321 | | `pap_jailbreak()` | Persuasion Adversarial Prompt (PAP) | 322 | | `payload_splitting()` | PARROT jailbreak | 323 | | `cr_jailbreak()` | Content reinforcement | 324 | ::: 325 | 326 | ## Composition 327 | 328 | Using **sheppard** to provide jailbreaks for a security eval: 329 | 330 | ```{.python code-line-numbers="|5|11-15"} 331 | from inspect_ai import Task, eval, task 332 | from inspect_ai.scorer import model_graded_fact 333 | from inspect_ai.solver import generate, system_message 334 | 335 | from sheppard import pap_jailbreak 336 | 337 | @task 338 | def security_guide(): 339 | return Task( 340 | dataset=example_dataset("security_guide"), 341 | plan=[ 342 | system_message("system.txt"), 343 | pap_jailbreak(), 344 | generate() 345 | ], 346 | scorer=model_graded_fact(model="openai/gpt-4"), 347 | ) 348 | ``` 349 | 350 | ## Tool Use 351 | 352 | ::: fragment 353 | `TaskState` also includes tools: 354 | 355 | ``` {.python code-line-numbers="3-4"} 356 | class TaskState: 357 | messages: list[ChatMessage] 358 | tools: list[ToolDef] 359 | tool_choice: ToolChoice 360 | output: ModelOutput 361 | ... 362 | ``` 363 | ::: 364 | 365 | ::: fragment 366 | `use_tools()` makes tools available to `generate()`: 367 | 368 | ``` {.python code-line-numbers="3-6"} 369 | return Task( 370 | dataset=example_dataset("biology_qa"), 371 | plan=[ 372 | use_tools(web_search()), 373 | generate() 374 | ], 375 | scorer=model_graded_qa(template=GRADER_TEMPLATE), 376 | ) 377 | ``` 378 | ::: 379 | 380 | 381 | ## Agents and Tools 382 | 383 | - Many permutations of agents and tool use are possible 384 | 385 | > - Bespoke agent logic inside a solver (swapping various tools in and out) 386 | 387 | > - Bridges to various agent libraries are as solvers (e.g. `langchain_agent()`, `langroid_agent()`, etc.) 388 | 389 | 390 | 391 | ## Agent: Capture the Flag 392 | 393 | Cybersecurity eval using hand-rolled agent loop (custom agents and agent frameworks can both be embedded in solvers) 394 | 395 | ``` {.python code-line-numbers="|4-8|12"} 396 | Plan( 397 | steps=[ 398 | init_challenge(), 399 | use_tools([ 400 | command_exec(), create_file(), 401 | decompile(), disassemble(), 402 | check_flag(), 403 | ]), 404 | system_message("prompts/system.txt"), 405 | initial_user_message(), 406 | generate(), 407 | check_for_flag_or_continue() 408 | ], 409 | cleanup=exit_challenge() 410 | ) 411 | ``` 412 | 413 | ## Agent: LangChain 414 | 415 | Convert any LangChain agent into a Solver 416 | 417 | ```{.python code-line-numbers="|4-6|8-15|17"} 418 | @solver 419 | def wikipedia_search() -> Solver: 420 | 421 | tavily_api = TavilySearchAPIWrapper() 422 | tools = ([TavilySearchResults(api_wrapper=tavily_api)] + 423 | load_tools(["wikipedia"])) 424 | 425 | async def agent(llm: BaseChatModel, input: dict[str, Any]): 426 | tools_agent = create_openai_tools_agent(llm, tools, prompt) 427 | agent_executor = AgentExecutor.from_agent_and_tools( 428 | agent=tools_agent, 429 | tools=tools 430 | ) 431 | result = await agent_executor.ainvoke(input) 432 | return result["output"] 433 | 434 | return langchain_solver(agent) 435 | ``` 436 | 437 | ## {background-image="images/inspect-wikipedia-eval.png" background-size="contain"} 438 | 439 | ## {background-image="images/inspect-wikipedia-messages.png" background-size="contain"} 440 | 441 | ## {background-image="images/inspect-wikipedia-scoring.png" background-size="contain"} 442 | 443 | ## {background-image="images/inspect-wikipedia-explanation.png" background-size="contain"} 444 | 445 | ## Scoring 446 | 447 | ::: incremental 448 | - Conventional pattern matching / templated answer based scoring built in. 449 | 450 | - Model graded scorer built-in (often heavily customized) 451 | 452 | - Scorers also pluggable (i.e. provided from other packages). We expect lots of innovation in model graded scoring! 453 | 454 | - Offline / human scoring workflow is supported. 455 | 456 | - Plan to build tools to help rigerously evaluate model graded scorers against human baselines. 457 | ::: 458 | 459 | ## {background-image="images/inspect-mathmatics.png" background-size="contain"} 460 | 461 | ## Scorer: `expression_equivalence()` 462 | 463 | ```{.python code-line-numbers="|8-13" } 464 | @scorer(metrics=[accuracy(), bootstrap_std()]) 465 | def expression_equivalance(): 466 | async def score(state: TaskState, target: Target): 467 | 468 | # extract answer 469 | match = re.search(AnswerPattern.LINE, state.output.completion) 470 | 471 | # ask the model to judge equivalance 472 | answer = match.group(1) 473 | prompt = EQUIVALANCE_TEMPLATE % ( 474 | {"expression1": target.text, "expression2": answer} 475 | ) 476 | result = await get_model().generate(prompt) 477 | 478 | # return the score 479 | correct = result.completion.lower() == "yes" 480 | return Score( 481 | value=CORRECT if correct else INCORRECT, 482 | answer=answer, 483 | explanation=state.output.completion, 484 | ) 485 | 486 | return score 487 | ``` 488 | 489 | ## Logging 490 | 491 | - Capture all context required to debug, analyse, and reproduce evaluations 492 | 493 | - Python API for computing on log file contents 494 | 495 | - Log viewer for interactive exploration of eval results 496 | 497 | ## `EvalLog` {.smaller} 498 | 499 | | | | | 500 | |-------------------|-------------------|----------------------------------| 501 | | `status` | `str` | Status of evaluation | 502 | | `eval` | `EvalSpec` | Top level eval details including task, model, creation time, etc. | 503 | | `plan` | `EvalPlan` | List of solvers and model generation config used for the eval. | 504 | | `samples` | `list[EvalSample]` | Each sample evaluated, including its input, output, target, and score. | 505 | | `results` | `EvalResults` | Aggregated scorer results | 506 | | `stats` | `EvalStats` | Model token usage stats | 507 | | `logging` | `list[LoggingMessage]` | Logging messages (e.g. from `log.info()`, `log.debug()`, etc. | 508 | | `error` | `EvalError` | Error information | 509 | 510 | ## Log Viewer: Samples 511 | 512 | ![](images/inspect-view-answers.png){.border} 513 | 514 | ## Log Viewer: Messages 515 | 516 | ![](images/inspect-view-messages.png){.border} 517 | 518 | 519 | ## Log Viewer: Scoring 520 | 521 | ![](images/inspect-view-scoring.png){.border} 522 | 523 | ## Models {.smaller} 524 | 525 | 526 | | Provider | Model Name | Docs | 527 | |-------------------|---------------------------|---------------------------| 528 | | OpenAI | `openai/gpt-3.5-turbo` | [OpenAI Models](https://platform.openai.com/docs/models/overview) | 529 | | Anthropic | `anthropic/claude-3-sonnet-20240229` | [Anthropic Models](https://docs.anthropic.com/claude/docs/models-overview) | 530 | | Google | `google/gemini-1.0-pro` | [Google Models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models) | 531 | | Mistral | `mistral/mistral-large-latest` | [Mistral Models](https://docs.mistral.ai/platform/endpoints/) | 532 | | Hugging Face | `hf/openai-community/gpt2` | [Hugging Face Models](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending) | 533 | | Ollama | `ollama/llama3` | [Ollama Models](https://ollama.com/library) | 534 | | TogetherAI | `together/lmsys/vicuna-13b-v1.5` | [TogetherAI Models](https://docs.together.ai/docs/inference-models#chat-models) | 535 | | AWS Bedrock | `bedrock/meta.llama2-70b-chat-v1` | [AWS Bedrock Models](https://aws.amazon.com/bedrock/) | 536 | | Azure AI | `azureai/azure-deployment-name` | [Azure AI Models](https://ai.azure.com/explore/models) | 537 | | Cloudflare | `cf/meta/llama-2-7b-chat-fp16` | [Cloudflare Models](https://developers.cloudflare.com/workers-ai/models/#text-generation) | 538 | 539 | : {tbl-colwidths="\[18,45,37\]"} 540 | 541 | ::: {.fragment style="margin-top: 30px; font-weight: 400;"} 542 | Interface with any other model by creating a custom model provider... 543 | ::: 544 | 545 | ## Workflow 546 | 547 | - Lots of interactive exploration occurs during eval development, so critical to have good support for iterating in a Notebook / REPL 548 | 549 | - Eventually though, evals need to end up in a form that enables reproducibly running them in an eval suite 550 | 551 | - Need to support a continuum of workflows that transition well into each other 552 | 553 | - Provide good tooling in Jupyter and VS Code for entire spectrum 554 | 555 | ## {background-image="images/inspect-notebook-eval.png" background-size="contain"} 556 | 557 | ## Interactive Exploration 558 | 559 | Ad-hoc exploration of an eval in a Notebook/REPL 560 | 561 | ```{.python code-line-numbers="|1-6|8-12|14-15"} 562 | params = { 563 | "system": ["devops.txt", "researcher.txt"], 564 | "grader": ["hacker.txt", "expert.txt"], 565 | "grader_model": ["openai/gpt-4", "google/gemini-1.0-pro"] 566 | } 567 | params = list(product(*(params[name] for name in params))) 568 | 569 | tasks = [Task( 570 | dataset=json_dataset("security_guide.jsonl"), 571 | plan=[system_message(system), generate()], 572 | scorer=model_graded_fact(template=grader, model=grader_model) 573 | ) for system, grader, grader_model in params] 574 | 575 | logs = eval(tasks, model = "mistral/mistral-large-latest") 576 | plot_results(logs) 577 | ``` 578 | 579 | ## Task Parameters 580 | 581 | Formalise variation with a parameterised `@task` function: 582 | 583 | ```{.python code-line-numbers="|1-7|9-16"} 584 | @task 585 | def security_guide(system="devops.txt", grader="expert.txt"): 586 | return Task( 587 | dataset = json_dataset("security_guide.jsonl"), 588 | plan=[system_message(system), generate()], 589 | scorer=model_graded_fact(template=grader, model="openai/gpt-4") 590 | ) 591 | 592 | params = { 593 | "system": ["devops.txt", "researcher.txt"], 594 | "grader": ["hacker.txt", "expert.txt"] 595 | } 596 | params = list(product(*(params[name] for name in params))) 597 | 598 | eval([security_guide(system,grader) for system, grader in params], 599 | model = "mistral/mistral-large-latest") 600 | ``` 601 | 602 | ## Task Parameters 603 | 604 | `@task` functions are registered and addressable by external driver programs (step one in development =\> production) 605 | 606 | ``` python 607 | @task 608 | def security_guide(system="devops.txt", grader="expert.txt"): 609 | return Task( 610 | dataset = json_dataset("security_guide.jsonl"), 611 | plan=[system_message(system), generate()], 612 | scorer=model_graded_fact( 613 | template=grader, 614 | model="openai/gpt-4" 615 | ) 616 | ) 617 | ``` 618 | 619 | ::: fragment 620 | Now we can vary the parameters externally: 621 | 622 | ``` {.bash code-line-numbers="true"} 623 | $ inspect eval security_guide.py -T system=devops.txt 624 | $ inspect eval security_guide.py -T grader=hacker.txt 625 | ``` 626 | ::: 627 | 628 | ::: fragment 629 | Same workflow available for tasks in a notebook: 630 | 631 | ``` {.bash code-line-numbers="true"} 632 | $ inspect eval security_guide.ipynb -T system=devops.txt 633 | $ inspect eval security_guide.ipynb -T grader=hacker.txt 634 | ``` 635 | ::: 636 | 637 | ## Task Variants 638 | 639 | We may discover that we *always* want to vary a parameter when running a full evaluation suite: 640 | 641 | ``` python 642 | def security_guide(system, grader="expert.txt"): 643 | return Task( 644 | dataset = json_dataset("security_guide.jsonl"), 645 | plan=[system_message(system), generate()], 646 | scorer=model_graded_fact(template=grader, model="openai/gpt-4") 647 | ) 648 | 649 | @task 650 | def devops() 651 | return security_guide("devops.txt") 652 | 653 | @task 654 | def researcher() 655 | return security_guide("researcher.txt") 656 | ``` 657 | 658 | 659 | ::: fragment 660 | Invoke by task name 661 | 662 | ``` {.bash code-line-numbers="true"} 663 | $ inspect eval security_guide.py@devops 664 | $ inspect eval security_guide.py@researcher 665 | ``` 666 | ::: 667 | 668 | 669 | ## Eval Suites 670 | 671 | We want to allow for arbitrary source code organisation but still be able to discover and enumerate tasks for a suite 672 | 673 | ::: columns 674 | ::: {.column .fragment} 675 | ``` {.bash code-line-numbers="true"} 676 | security/ 677 | jeopardy/ 678 | import.py 679 | analyze.py 680 | task.py 681 | attack_defense/ 682 | import.py 683 | analyze.py 684 | task.py 685 | ``` 686 | ::: 687 | 688 | ::: {.column .fragment} 689 | ``` {.python code-line-numbers="true"} 690 | list_tasks("security") 691 | 692 | jeopardy/task.py@crypto 693 | jeopardy/task.py@decompile 694 | jeopardy/task.py@packet 695 | jeopardy/task.py@heap_trouble 696 | attack_defense/task.py@saar 697 | attack_defense/task.py@bank 698 | attack_defense/task.py@voting 699 | ``` 700 | ::: 701 | ::: 702 | 703 | ::: {.fragment style="margin-top: 25px;"} 704 | Run them all 705 | 706 | ``` {.python code-line-numbers="true"} 707 | eval(list_tasks("security"), model="mistral/mistral-large-latest") 708 | ``` 709 | ::: 710 | 711 | ## Resiliency 712 | 713 | The production version would look more like this: 714 | 715 | ``` python 716 | # setup log context 717 | os.environ["INSPECT_LOG_DIR"] = "./security-suite_04-07-2024" 718 | 719 | # run the eval suite 720 | tasks = list_tasks("security") 721 | eval(tasks, model="mistral/mistral-large-latest") 722 | 723 | # ...later, in another process that also has INSPECT_LOG_DIR 724 | error_logs = list_eval_logs(status == "error") 725 | eval_retry(error_logs) 726 | ``` 727 | 728 | ::: {.fragment style="margin-top: 30px;"} 729 | Somewhat oversimplified, as we'd also want to enhance the logic around analysing the cause of errors and adopting optimal recovery strategies 730 | ::: 731 | 732 | ## Provenance 733 | 734 | If you run an eval from a Git repository, you should be able to reproduce the eval with only its log file as context 735 | 736 | ::: fragment 737 | ``` {.python code-line-numbers="|1-4|6-11"} 738 | # read the log and extract the origin and commit 739 | log = read_eval_log("security-log.json") 740 | origin = log.spec.revision.origin 741 | commit = log.spec.revision.commit 742 | 743 | # clone the repo, checkout the commit, install deps, and run 744 | run(["git", "clone", revision.origin, "eval-dir"]) 745 | with chdir("eval-dir"): 746 | run(["git", "checkout", revision.commit]) 747 | run(["pip", "install", "-r", "requirements.txt"]) 748 | eval(log) 749 | ``` 750 | ::: 751 | 752 | 753 | ## Learning More 754 | 755 | - Docs: 756 | 757 | - GitHub: 758 | 759 | - Slides/Code: 760 | 761 | #### Questions? 762 | 763 | 764 | 765 | ```{=html} 766 | 788 | ``` -------------------------------------------------------------------------------- /slides/intro-to-inspect.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jjallaire/inspect-llm-workshop/86d00ca6d79b4754266ba36c10be6d5a108a1695/slides/intro-to-inspect.pdf --------------------------------------------------------------------------------