├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── Notebooks └── evaluating-large-language-models-using-llm-as-a-judge-with-amazon-bedrock.ipynb └── README.md /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | ## Code of Conduct 2 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 3 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 4 | opensource-codeofconduct@amazon.com with any additional questions or comments. -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing Guidelines 2 | 3 | Thank you for your interest in contributing to our project. Whether it's a bug report, new feature, correction, or additional 4 | documentation, we greatly value feedback and contributions from our community. 5 | 6 | Please read through this document before submitting any issues or pull requests to ensure we have all the necessary 7 | information to effectively respond to your bug report or contribution. 8 | 9 | 10 | ## Reporting Bugs/Feature Requests 11 | 12 | We welcome you to use the GitHub issue tracker to report bugs or suggest features. 13 | 14 | When filing an issue, please check existing open, or recently closed, issues to make sure somebody else hasn't already 15 | reported the issue. Please try to include as much information as you can. Details like these are incredibly useful: 16 | 17 | * A reproducible test case or series of steps 18 | * The version of our code being used 19 | * Any modifications you've made relevant to the bug 20 | * Anything unusual about your environment or deployment 21 | 22 | 23 | ## Contributing via Pull Requests 24 | Contributions via pull requests are much appreciated. Before sending us a pull request, please ensure that: 25 | 26 | 1. You are working against the latest source on the *main* branch. 27 | 2. You check existing open, and recently merged, pull requests to make sure someone else hasn't addressed the problem already. 28 | 3. You open an issue to discuss any significant work - we would hate for your time to be wasted. 29 | 30 | To send us a pull request, please: 31 | 32 | 1. Fork the repository. 33 | 2. Modify the source; please focus on the specific change you are contributing. If you also reformat all the code, it will be hard for us to focus on your change. 34 | 3. Ensure local tests pass. 35 | 4. Commit to your fork using clear commit messages. 36 | 5. Send us a pull request, answering any default questions in the pull request interface. 37 | 6. Pay attention to any automated CI failures reported in the pull request, and stay involved in the conversation. 38 | 39 | GitHub provides additional document on [forking a repository](https://help.github.com/articles/fork-a-repo/) and 40 | [creating a pull request](https://help.github.com/articles/creating-a-pull-request/). 41 | 42 | 43 | ## Finding contributions to work on 44 | Looking at the existing issues is a great way to find something to contribute on. As our projects, by default, use the default GitHub issue labels (enhancement/bug/duplicate/help wanted/invalid/question/wontfix), looking at any 'help wanted' issues is a great place to start. 45 | 46 | 47 | ## Code of Conduct 48 | This project has adopted the [Amazon Open Source Code of Conduct](https://aws.github.io/code-of-conduct). 49 | For more information see the [Code of Conduct FAQ](https://aws.github.io/code-of-conduct-faq) or contact 50 | opensource-codeofconduct@amazon.com with any additional questions or comments. 51 | 52 | 53 | ## Security issue notifications 54 | If you discover a potential security issue in this project we ask that you notify AWS/Amazon Security via our [vulnerability reporting page](http://aws.amazon.com/security/vulnerability-reporting/). Please do **not** create a public github issue. 55 | 56 | 57 | ## Licensing 58 | 59 | See the [LICENSE](LICENSE) file for our project's licensing. We will ask you to confirm the licensing of your contribution. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT-0 licence 2 | 3 | MIT No Attribution 4 | 5 | Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. 6 | 7 | Permission is hereby granted, free of charge, to any person obtaining a copy of 8 | this software and associated documentation files (the "Software"), to deal in 9 | the Software without restriction, including without limitation the rights to 10 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 11 | the Software, and to permit persons to whom the Software is furnished to do so. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 15 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 16 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 17 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 18 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 19 | -------------------------------------------------------------------------------- /Notebooks/evaluating-large-language-models-using-llm-as-a-judge-with-amazon-bedrock.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "3c825696-de59-4e70-8461-af86a02d812b", 6 | "metadata": { 7 | "tags": [] 8 | }, 9 | "source": [ 10 | "# Evaluating Large Language Models using LLM-as-a-Judge with Amazon Bedrock" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "90832ab8-1405-4c45-b185-e65cf8b3b13e", 16 | "metadata": {}, 17 | "source": [ 18 | "This notebook serves as a base for evaluating Large Language Models using LLM-as-a-Judge with Amazon Bedrock. \n", 19 | "\n", 20 | "> This notebook should work well with the Data Science 3.0 kernel in SageMaker Studio\n", 21 | "\n", 22 | "\n", 23 | "Evaluating large language models (LLM) is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, strong LLMs are used as judges to evaluate these models on more open-ended questions. The agreement between LLM judges and human preferences has been verified by introducing two benchmarks: [Multi Turn (MT)-bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge/data/mt_bench), a multi-turn question set, and [Chatbot Arena](https://arena.lmsys.org/), a crowdsourced battle platform. The results reveal that strong LLM judges can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. This makes LLM-as-a-judge a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain.\n", 24 | "\n", 25 | "> ℹ️ **Note:** The evaluation steps in this lab are based on the paper [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685.pdf).\n", 26 | "\n", 27 | "This lab addresses this challenge by providing a practical solution for evaluating LLMs using LLM-as-a-Judge with Amazon Bedrock. This is relevant for developers and researchers working on evaluating LLM based applications. In the notebook you are guided using MT-Bench questions to generate test answers and evaluate them with a single-answer grading using the Bedrock API, Python and Langchain. For demonstration purpose of this lab Claude Instant is evaluated and Claude 3 Sonnet is used as strong LLM judge. The notebook consists of the following chapters: \n", 28 | "\n", 29 | "1) [Setup of the environment](#1.-Setup-of-the-enviroment)\n", 30 | "2) [Load MT-Bench questions](#2.-Load-MT-Bench-questions)\n", 31 | "3) [Generate test answers from LLM which should be evaluated](#3.-Generate-test-answers-from-LLM-which-should-be-evaluated)\n", 32 | "4) [Evaluate answers with strong LLM-as-a-judge](#4.-Evaluate-answers-with-strong-LLM-as-a-judge)\n", 33 | "5) [Generate explanation for average rating score](#5.-Generate-explanation-for-average-rating-score)" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "id": "8c45dfee-35e1-467e-b831-f3b72f97f017", 39 | "metadata": {}, 40 | "source": [ 41 | "## 1. Setup of the enviroment" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "cd06d4c4-500f-4dd0-ba19-6559f85fde9b", 47 | "metadata": {}, 48 | "source": [ 49 | "We start by installing the required libraries." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 2, 55 | "id": "bb52e396-d5b3-4a6c-bc85-5a577efd33cf", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "%%capture \n", 60 | "%pip install langchain==0.1.10 boto3 tqdm" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "id": "6f81966d-ec79-42f1-9d05-446b5814b42f", 66 | "metadata": {}, 67 | "source": [ 68 | "## 2. Load MT-Bench question set" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "id": "eb150ca6-5373-48d9-8472-2266c1468e30", 74 | "metadata": {}, 75 | "source": [ 76 | "This lab uses the MT-Bench questions set which consists of 80 high-quality multi-turn questions. They are designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models.\n", 77 | "\n", 78 | "To evaluate custom applications or fine tuned LLMs, questions should be adjusted or created according to the use cases. They should focus on covering common uses cases and usage patterns.\n", 79 | "\n", 80 | "We download the questions to use them for evaluation." 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 3, 86 | "id": "c19cfdf0-b995-4898-9d93-032c714bd645", 87 | "metadata": { 88 | "tags": [] 89 | }, 90 | "outputs": [], 91 | "source": [ 92 | "# Import necessary libraries\n", 93 | "import requests\n", 94 | "import json\n", 95 | "\n", 96 | "# Download MT-Bench questions\n", 97 | "url = \"https://raw.githubusercontent.com/lm-sys/FastChat/main/fastchat/llm_judge/data/mt_bench/question.jsonl\"\n", 98 | "response = requests.get(url)\n", 99 | "lines = response.text.split(\"\\n\") \n", 100 | "\n", 101 | "# Iterate through lines and append them to questions array as json\n", 102 | "questions = []\n", 103 | "for line in lines:\n", 104 | " if line:\n", 105 | " questions.append(json.loads(line))" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "4133b71d-9ee9-4808-85f3-f096f5e7e0fe", 111 | "metadata": {}, 112 | "source": [ 113 | "## 3. Generate test answers from LLM which should be evaluated" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "id": "ded3bed1-f6b7-4775-9f8b-6bbee2b962eb", 119 | "metadata": {}, 120 | "source": [ 121 | "Now that we have the questions stored, we use the LLM which should be evaluated to generate the answers to these questions. First we create a prompt template which we use in a second step to generate each answer." 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 4, 127 | "id": "d8896ed0-4d83-4774-bb9f-bd4d6b5fb088", 128 | "metadata": {}, 129 | "outputs": [], 130 | "source": [ 131 | "# Import necessary libraries\n", 132 | "from langchain.prompts import PromptTemplate\n", 133 | "from langchain_community.chat_models import BedrockChat\n", 134 | "import boto3\n", 135 | "\n", 136 | "# Create bedrock client\n", 137 | "boto3_bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 5, 143 | "id": "3ff03e7e-31c2-4090-a3f1-b2bd013138a0", 144 | "metadata": { 145 | "tags": [] 146 | }, 147 | "outputs": [], 148 | "source": [ 149 | "# Create a prompt template to generate a question a end-user could have about each open ended question\n", 150 | "initial_question_prompt_template = PromptTemplate(\n", 151 | " input_variables=[\"input\",\"history\"],\n", 152 | " template=\"\"\"HUMAN:\n", 153 | " You are an artificial intelligence assistant and answer questions from a curious user\n", 154 | " Give a helpful, detailed, and polite answers to the user's question \n", 155 | " \n", 156 | " Current conversation:\n", 157 | " \n", 158 | " {history}\n", 159 | " \n", 160 | " \n", 161 | " Here is the human's next reply:\n", 162 | " \n", 163 | " {input}\n", 164 | " \n", 165 | "\n", 166 | " ANSWER:\"\"\")" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 6, 172 | "id": "e20f225d-8821-4524-bf35-791ebf07caff", 173 | "metadata": { 174 | "tags": [] 175 | }, 176 | "outputs": [], 177 | "source": [ 178 | "# For each model provider there are different parameters to define when inferencing against the model. These depend on the use case.\n", 179 | "inference_modifier = {\n", 180 | " \"temperature\": 0.5,\n", 181 | " \"top_k\": 250,\n", 182 | " \"top_p\": 1,\n", 183 | " }\n", 184 | " \n", 185 | "\n", 186 | "evaluate_llm = BedrockChat(model_id = \"anthropic.claude-instant-v1\",\n", 187 | " client = boto3_bedrock, \n", 188 | " model_kwargs = inference_modifier \n", 189 | " )" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": 7, 195 | "id": "92bc75f6-8dcb-488b-954a-c8bd1611e1da", 196 | "metadata": { 197 | "tags": [] 198 | }, 199 | "outputs": [], 200 | "source": [ 201 | "# Import necessary libraries\n", 202 | "from langchain.memory import ConversationBufferMemory\n", 203 | "from langchain.chains import ConversationChain\n", 204 | "from tqdm.auto import tqdm" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "id": "7f67a224-e317-42d6-82a4-9af60b033210", 210 | "metadata": {}, 211 | "source": [ 212 | "ℹ️ **Note:** The next steps takes several minutes to complete." 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 8, 218 | "id": "d81da9c9-dddd-43e7-a268-d7d4d26ed49a", 219 | "metadata": { 220 | "tags": [] 221 | }, 222 | "outputs": [ 223 | { 224 | "data": { 225 | "application/vnd.jupyter.widget-view+json": { 226 | "model_id": "86dfc4c88a7e4c289ae35ca1377a03bf", 227 | "version_major": 2, 228 | "version_minor": 0 229 | }, 230 | "text/plain": [ 231 | " 0%| | 0/80 [00:00Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below.\n", 287 | " Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Your evaluation should focus on the assistant's answer to the second user question. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \\\"\\\", for example: \\\"Rating: 5\\ \n", 288 | " \n", 289 | " \n", 290 | " Human:\n", 291 | " \n", 292 | " {question1}\n", 293 | " \n", 294 | "\n", 295 | " Assistant:\n", 296 | " \n", 297 | " {answer1}\n", 298 | " \n", 299 | "\n", 300 | " Human:\n", 301 | " \n", 302 | " {question2}\n", 303 | " \n", 304 | "\n", 305 | " Assistant:\n", 306 | " \n", 307 | " {answer2}\n", 308 | " \n", 309 | " \n", 310 | "\n", 311 | " ANSWER:\"\"\")" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 10, 317 | "id": "63d18812-ed8b-48b2-a520-448b74a99e03", 318 | "metadata": { 319 | "tags": [] 320 | }, 321 | "outputs": [], 322 | "source": [ 323 | "# For each model provider there are different parameters to define when inferencing against the model. These depend on the use case.\n", 324 | "eval_inference_modifier = {\n", 325 | " \"temperature\": 0.5,\n", 326 | " \"top_k\": 250,\n", 327 | " \"top_p\": 1,\n", 328 | " }\n", 329 | " \n", 330 | "\n", 331 | "eval_llm = BedrockChat(model_id = \"anthropic.claude-3-sonnet-20240229-v1:0\",\n", 332 | " client = boto3_bedrock, \n", 333 | " model_kwargs = inference_modifier \n", 334 | " )" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "id": "e0e0ac00-172c-4760-93ca-9b8de10345bd", 340 | "metadata": {}, 341 | "source": [ 342 | "ℹ️ **Note:** The next steps takes several minutes to complete." 343 | ] 344 | }, 345 | { 346 | "cell_type": "code", 347 | "execution_count": 11, 348 | "id": "90722ec0-9a6c-4345-a4f4-1c41cb029afb", 349 | "metadata": { 350 | "tags": [] 351 | }, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "application/vnd.jupyter.widget-view+json": { 356 | "model_id": "7b44f895bfcd4623a52d933260af4053", 357 | "version_major": 2, 358 | "version_minor": 0 359 | }, 360 | "text/plain": [ 361 | " 0%| | 0/80 [00:00(.*?)\"\n", 374 | "explanation_rating = []\n", 375 | "for question in tqdm(questions):\n", 376 | " question1 = question['answers'][0]['input']\n", 377 | " question2 = question['answers'][1]['input']\n", 378 | " answer1 = question['answers'][0]['response']\n", 379 | " answer2 = question['answers'][1]['response']\n", 380 | " question['rating_text'] = eval_llm.invoke(eval_prompt_template.format(question1 = question1, answer1 = answer1, question2=question2, answer2=answer2)).content\n", 381 | " tag_value = re.search(reg_str, question['rating_text'])\n", 382 | " if tag_value: \n", 383 | " question['rating_score'] = tag_value.group(1)\n", 384 | " explanation_rating.append(question['rating_text'])\n", 385 | " amount_questions = amount_questions + 1\n", 386 | " ratings_add_up = ratings_add_up + int(question['rating_score'])" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": 12, 392 | "id": "ad0bf06d-dd94-4cdc-9855-41a596328302", 393 | "metadata": { 394 | "tags": [] 395 | }, 396 | "outputs": [ 397 | { 398 | "name": "stdout", 399 | "output_type": "stream", 400 | "text": [ 401 | "The average rating score is: 8.6125\n" 402 | ] 403 | } 404 | ], 405 | "source": [ 406 | "# Calculate the average rating score\n", 407 | "average_rating = ratings_add_up/amount_questions\n", 408 | "print(\"The average rating score is: {}\".format(average_rating))" 409 | ] 410 | }, 411 | { 412 | "cell_type": "code", 413 | "execution_count": 13, 414 | "id": "7917b47a-4c7c-4ec2-94fb-f290a1d2addb", 415 | "metadata": {}, 416 | "outputs": [ 417 | { 418 | "data": { 419 | "image/png": "", 420 | "text/plain": [ 421 | "
" 422 | ] 423 | }, 424 | "metadata": {}, 425 | "output_type": "display_data" 426 | } 427 | ], 428 | "source": [ 429 | "# Diplay rating scores in bar chart\n", 430 | "import matplotlib.pyplot as plt\n", 431 | "from operator import countOf\n", 432 | "\n", 433 | "rating_scores = []\n", 434 | "for question in questions:\n", 435 | " if 'rating_score' in question: \n", 436 | " rating_scores.append(question['rating_score'])\n", 437 | "\n", 438 | "bar_labels = []\n", 439 | "rating_scores_count = []\n", 440 | "\n", 441 | "for x in range(11):\n", 442 | " bar_labels.append(str(x))\n", 443 | " rating_scores_count.append(countOf(rating_scores,str(x)))\n", 444 | "\n", 445 | "fig, ax = plt.subplots()\n", 446 | "ax.bar(bar_labels, rating_scores_count)\n", 447 | "ax.set_ylabel('Distribution')\n", 448 | "ax.set_xlabel('Rating score')\n", 449 | "plt.axvline(x=average_rating, color='tab:red')\n", 450 | "plt.legend(['Average rating score','Distribution rating score'])\n", 451 | "plt.show()\n" 452 | ] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "id": "2e4c565e-ca5e-4a7a-9587-7d036228c609", 457 | "metadata": {}, 458 | "source": [ 459 | "## 5. Generate explanation for average rating score" 460 | ] 461 | }, 462 | { 463 | "cell_type": "markdown", 464 | "id": "019df54a-1c4d-4096-9508-f4ee356feb08", 465 | "metadata": {}, 466 | "source": [ 467 | "To explain the average rating score, each rating explanation can be used to create a summary and identiy areas for improvements to further optimize the application." 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": 14, 473 | "id": "bfa61bdc-e5dc-4ccf-833f-2c07414d8b9e", 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "# Define prompt to summarize all ratings to explain the given average rating\n", 478 | "summary_prompt_template = \"\"\"HUMAN:\n", 479 | "Please act as an impartial summarizer and summarize the following explanations from a LLM as a judge to one single statement\n", 480 | "Explain the main areas for improvement. Also, write a concise summary of the following explantions from a LLM as a judge to explain the average rating given which the LLM as judge gave. \n", 481 | "\n", 482 | "{average_rating}\n", 483 | "{explanations}\n", 484 | "ANSWER:\"\"\"\n", 485 | "summary_prompt = PromptTemplate.from_template(summary_prompt_template)" 486 | ] 487 | }, 488 | { 489 | "cell_type": "code", 490 | "execution_count": 15, 491 | "id": "fa67becf-81e3-4ae3-aab5-b5dcf5094863", 492 | "metadata": {}, 493 | "outputs": [], 494 | "source": [ 495 | "explanation_avg_rating = eval_llm.invoke(summary_prompt.format(average_rating=average_rating, explanations=explanation_rating)).content" 496 | ] 497 | }, 498 | { 499 | "cell_type": "code", 500 | "execution_count": 16, 501 | "id": "cf44bb82-9c23-49d2-b2fe-c4041d030c6a", 502 | "metadata": {}, 503 | "outputs": [ 504 | { 505 | "name": "stdout", 506 | "output_type": "stream", 507 | "text": [ 508 | "Based on the explanations provided, the main areas for improvement seem to be:\n", 509 | "\n", 510 | "1. Providing more concise summaries or conclusions at times to reinforce the key points.\n", 511 | "2. Expanding on certain aspects with additional details, examples, or context where relevant.\n", 512 | "3. Analyzing complexities like time/space complexity, alternative approaches, or potential limitations in more depth for some technical responses.\n", 513 | "\n", 514 | "As for the average rating of 8.6125 given by the LLM judge, the explanations suggest that the assistant's responses were generally of high quality, demonstrating strong understanding, accuracy, relevance, and helpfulness in addressing the given tasks or questions. The judge commended the assistant's creativity, level of detail, clear explanations, and ability to provide well-reasoned and insightful solutions across a diverse range of topics and scenarios. However, there were occasional opportunities for improvement in areas like conciseness, depth of analysis, and considering additional nuances or perspectives, which likely prevented some responses from achieving a perfect 10/10 rating.\n" 515 | ] 516 | } 517 | ], 518 | "source": [ 519 | "print(explanation_avg_rating)" 520 | ] 521 | }, 522 | { 523 | "cell_type": "markdown", 524 | "id": "cea3c480-9d28-4d75-9248-abcae16c45c2", 525 | "metadata": {}, 526 | "source": [ 527 | "## Conclusion" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "id": "a3f5d0d6-4a5a-49fe-8ff6-8d5f6ddc3fea", 533 | "metadata": {}, 534 | "source": [ 535 | "The lab demonstrates a practical approach for evaluating large language models (LLMs) using the LLM-as-a-Judge technique with Amazon Bedrock. This method addresses the challenges in evaluating LLMs due to their broad capabilities and the limitations of existing benchmarks in measuring human preferences.\n", 536 | "\n", 537 | "By leveraging strong LLM judges, such as Claude 3 Sonnet in this notebook, the lab showcases how to assess the performance of LLMs like Claude Instant on the Multi Turn (MT)-Bench, a benchmark designed to measure alignment with human preferences. \n", 538 | "\n", 539 | "This approach makes LLM-as-a-judge a scalable and explainable way to approximate human preferences, which are otherwise very costly to obtain. The notebook provides a step-by-step guide on setting up the environment, loading the MT-Bench questions, generating test answers from the LLM under evaluation, leveraging the Bedrock API to assess the answers using the LLM judge, and visualize the results.\n", 540 | "\n", 541 | "The successful demonstration of this LLM-as-a-Judge methodology with Amazon Bedrock demonstrates the potential for developers and researchers working on LLM-based applications to adopt this innovative evaluation technique. By understanding the alignment of their models with human preferences, they can make more informed decisions and continue to improve the capabilities of these powerful language models." 542 | ] 543 | } 544 | ], 545 | "metadata": { 546 | "availableInstances": [ 547 | { 548 | "_defaultOrder": 0, 549 | "_isFastLaunch": true, 550 | "category": "General purpose", 551 | "gpuNum": 0, 552 | "hideHardwareSpecs": false, 553 | "memoryGiB": 4, 554 | "name": "ml.t3.medium", 555 | "vcpuNum": 2 556 | }, 557 | { 558 | "_defaultOrder": 1, 559 | "_isFastLaunch": false, 560 | "category": "General purpose", 561 | "gpuNum": 0, 562 | "hideHardwareSpecs": false, 563 | "memoryGiB": 8, 564 | "name": "ml.t3.large", 565 | "vcpuNum": 2 566 | }, 567 | { 568 | "_defaultOrder": 2, 569 | "_isFastLaunch": false, 570 | "category": "General purpose", 571 | "gpuNum": 0, 572 | "hideHardwareSpecs": false, 573 | "memoryGiB": 16, 574 | "name": "ml.t3.xlarge", 575 | "vcpuNum": 4 576 | }, 577 | { 578 | "_defaultOrder": 3, 579 | "_isFastLaunch": false, 580 | "category": "General purpose", 581 | "gpuNum": 0, 582 | "hideHardwareSpecs": false, 583 | "memoryGiB": 32, 584 | "name": "ml.t3.2xlarge", 585 | "vcpuNum": 8 586 | }, 587 | { 588 | "_defaultOrder": 4, 589 | "_isFastLaunch": true, 590 | "category": "General purpose", 591 | "gpuNum": 0, 592 | "hideHardwareSpecs": false, 593 | "memoryGiB": 8, 594 | "name": "ml.m5.large", 595 | "vcpuNum": 2 596 | }, 597 | { 598 | "_defaultOrder": 5, 599 | "_isFastLaunch": false, 600 | "category": "General purpose", 601 | "gpuNum": 0, 602 | "hideHardwareSpecs": false, 603 | "memoryGiB": 16, 604 | "name": "ml.m5.xlarge", 605 | "vcpuNum": 4 606 | }, 607 | { 608 | "_defaultOrder": 6, 609 | "_isFastLaunch": false, 610 | "category": "General purpose", 611 | "gpuNum": 0, 612 | "hideHardwareSpecs": false, 613 | "memoryGiB": 32, 614 | "name": "ml.m5.2xlarge", 615 | "vcpuNum": 8 616 | }, 617 | { 618 | "_defaultOrder": 7, 619 | "_isFastLaunch": false, 620 | "category": "General purpose", 621 | "gpuNum": 0, 622 | "hideHardwareSpecs": false, 623 | "memoryGiB": 64, 624 | "name": "ml.m5.4xlarge", 625 | "vcpuNum": 16 626 | }, 627 | { 628 | "_defaultOrder": 8, 629 | "_isFastLaunch": false, 630 | "category": "General purpose", 631 | "gpuNum": 0, 632 | "hideHardwareSpecs": false, 633 | "memoryGiB": 128, 634 | "name": "ml.m5.8xlarge", 635 | "vcpuNum": 32 636 | }, 637 | { 638 | "_defaultOrder": 9, 639 | "_isFastLaunch": false, 640 | "category": "General purpose", 641 | "gpuNum": 0, 642 | "hideHardwareSpecs": false, 643 | "memoryGiB": 192, 644 | "name": "ml.m5.12xlarge", 645 | "vcpuNum": 48 646 | }, 647 | { 648 | "_defaultOrder": 10, 649 | "_isFastLaunch": false, 650 | "category": "General purpose", 651 | "gpuNum": 0, 652 | "hideHardwareSpecs": false, 653 | "memoryGiB": 256, 654 | "name": "ml.m5.16xlarge", 655 | "vcpuNum": 64 656 | }, 657 | { 658 | "_defaultOrder": 11, 659 | "_isFastLaunch": false, 660 | "category": "General purpose", 661 | "gpuNum": 0, 662 | "hideHardwareSpecs": false, 663 | "memoryGiB": 384, 664 | "name": "ml.m5.24xlarge", 665 | "vcpuNum": 96 666 | }, 667 | { 668 | "_defaultOrder": 12, 669 | "_isFastLaunch": false, 670 | "category": "General purpose", 671 | "gpuNum": 0, 672 | "hideHardwareSpecs": false, 673 | "memoryGiB": 8, 674 | "name": "ml.m5d.large", 675 | "vcpuNum": 2 676 | }, 677 | { 678 | "_defaultOrder": 13, 679 | "_isFastLaunch": false, 680 | "category": "General purpose", 681 | "gpuNum": 0, 682 | "hideHardwareSpecs": false, 683 | "memoryGiB": 16, 684 | "name": "ml.m5d.xlarge", 685 | "vcpuNum": 4 686 | }, 687 | { 688 | "_defaultOrder": 14, 689 | "_isFastLaunch": false, 690 | "category": "General purpose", 691 | "gpuNum": 0, 692 | "hideHardwareSpecs": false, 693 | "memoryGiB": 32, 694 | "name": "ml.m5d.2xlarge", 695 | "vcpuNum": 8 696 | }, 697 | { 698 | "_defaultOrder": 15, 699 | "_isFastLaunch": false, 700 | "category": "General purpose", 701 | "gpuNum": 0, 702 | "hideHardwareSpecs": false, 703 | "memoryGiB": 64, 704 | "name": "ml.m5d.4xlarge", 705 | "vcpuNum": 16 706 | }, 707 | { 708 | "_defaultOrder": 16, 709 | "_isFastLaunch": false, 710 | "category": "General purpose", 711 | "gpuNum": 0, 712 | "hideHardwareSpecs": false, 713 | "memoryGiB": 128, 714 | "name": "ml.m5d.8xlarge", 715 | "vcpuNum": 32 716 | }, 717 | { 718 | "_defaultOrder": 17, 719 | "_isFastLaunch": false, 720 | "category": "General purpose", 721 | "gpuNum": 0, 722 | "hideHardwareSpecs": false, 723 | "memoryGiB": 192, 724 | "name": "ml.m5d.12xlarge", 725 | "vcpuNum": 48 726 | }, 727 | { 728 | "_defaultOrder": 18, 729 | "_isFastLaunch": false, 730 | "category": "General purpose", 731 | "gpuNum": 0, 732 | "hideHardwareSpecs": false, 733 | "memoryGiB": 256, 734 | "name": "ml.m5d.16xlarge", 735 | "vcpuNum": 64 736 | }, 737 | { 738 | "_defaultOrder": 19, 739 | "_isFastLaunch": false, 740 | "category": "General purpose", 741 | "gpuNum": 0, 742 | "hideHardwareSpecs": false, 743 | "memoryGiB": 384, 744 | "name": "ml.m5d.24xlarge", 745 | "vcpuNum": 96 746 | }, 747 | { 748 | "_defaultOrder": 20, 749 | "_isFastLaunch": false, 750 | "category": "General purpose", 751 | "gpuNum": 0, 752 | "hideHardwareSpecs": true, 753 | "memoryGiB": 0, 754 | "name": "ml.geospatial.interactive", 755 | "supportedImageNames": [ 756 | "sagemaker-geospatial-v1-0" 757 | ], 758 | "vcpuNum": 0 759 | }, 760 | { 761 | "_defaultOrder": 21, 762 | "_isFastLaunch": true, 763 | "category": "Compute optimized", 764 | "gpuNum": 0, 765 | "hideHardwareSpecs": false, 766 | "memoryGiB": 4, 767 | "name": "ml.c5.large", 768 | "vcpuNum": 2 769 | }, 770 | { 771 | "_defaultOrder": 22, 772 | "_isFastLaunch": false, 773 | "category": "Compute optimized", 774 | "gpuNum": 0, 775 | "hideHardwareSpecs": false, 776 | "memoryGiB": 8, 777 | "name": "ml.c5.xlarge", 778 | "vcpuNum": 4 779 | }, 780 | { 781 | "_defaultOrder": 23, 782 | "_isFastLaunch": false, 783 | "category": "Compute optimized", 784 | "gpuNum": 0, 785 | "hideHardwareSpecs": false, 786 | "memoryGiB": 16, 787 | "name": "ml.c5.2xlarge", 788 | "vcpuNum": 8 789 | }, 790 | { 791 | "_defaultOrder": 24, 792 | "_isFastLaunch": false, 793 | "category": "Compute optimized", 794 | "gpuNum": 0, 795 | "hideHardwareSpecs": false, 796 | "memoryGiB": 32, 797 | "name": "ml.c5.4xlarge", 798 | "vcpuNum": 16 799 | }, 800 | { 801 | "_defaultOrder": 25, 802 | "_isFastLaunch": false, 803 | "category": "Compute optimized", 804 | "gpuNum": 0, 805 | "hideHardwareSpecs": false, 806 | "memoryGiB": 72, 807 | "name": "ml.c5.9xlarge", 808 | "vcpuNum": 36 809 | }, 810 | { 811 | "_defaultOrder": 26, 812 | "_isFastLaunch": false, 813 | "category": "Compute optimized", 814 | "gpuNum": 0, 815 | "hideHardwareSpecs": false, 816 | "memoryGiB": 96, 817 | "name": "ml.c5.12xlarge", 818 | "vcpuNum": 48 819 | }, 820 | { 821 | "_defaultOrder": 27, 822 | "_isFastLaunch": false, 823 | "category": "Compute optimized", 824 | "gpuNum": 0, 825 | "hideHardwareSpecs": false, 826 | "memoryGiB": 144, 827 | "name": "ml.c5.18xlarge", 828 | "vcpuNum": 72 829 | }, 830 | { 831 | "_defaultOrder": 28, 832 | "_isFastLaunch": false, 833 | "category": "Compute optimized", 834 | "gpuNum": 0, 835 | "hideHardwareSpecs": false, 836 | "memoryGiB": 192, 837 | "name": "ml.c5.24xlarge", 838 | "vcpuNum": 96 839 | }, 840 | { 841 | "_defaultOrder": 29, 842 | "_isFastLaunch": true, 843 | "category": "Accelerated computing", 844 | "gpuNum": 1, 845 | "hideHardwareSpecs": false, 846 | "memoryGiB": 16, 847 | "name": "ml.g4dn.xlarge", 848 | "vcpuNum": 4 849 | }, 850 | { 851 | "_defaultOrder": 30, 852 | "_isFastLaunch": false, 853 | "category": "Accelerated computing", 854 | "gpuNum": 1, 855 | "hideHardwareSpecs": false, 856 | "memoryGiB": 32, 857 | "name": "ml.g4dn.2xlarge", 858 | "vcpuNum": 8 859 | }, 860 | { 861 | "_defaultOrder": 31, 862 | "_isFastLaunch": false, 863 | "category": "Accelerated computing", 864 | "gpuNum": 1, 865 | "hideHardwareSpecs": false, 866 | "memoryGiB": 64, 867 | "name": "ml.g4dn.4xlarge", 868 | "vcpuNum": 16 869 | }, 870 | { 871 | "_defaultOrder": 32, 872 | "_isFastLaunch": false, 873 | "category": "Accelerated computing", 874 | "gpuNum": 1, 875 | "hideHardwareSpecs": false, 876 | "memoryGiB": 128, 877 | "name": "ml.g4dn.8xlarge", 878 | "vcpuNum": 32 879 | }, 880 | { 881 | "_defaultOrder": 33, 882 | "_isFastLaunch": false, 883 | "category": "Accelerated computing", 884 | "gpuNum": 4, 885 | "hideHardwareSpecs": false, 886 | "memoryGiB": 192, 887 | "name": "ml.g4dn.12xlarge", 888 | "vcpuNum": 48 889 | }, 890 | { 891 | "_defaultOrder": 34, 892 | "_isFastLaunch": false, 893 | "category": "Accelerated computing", 894 | "gpuNum": 1, 895 | "hideHardwareSpecs": false, 896 | "memoryGiB": 256, 897 | "name": "ml.g4dn.16xlarge", 898 | "vcpuNum": 64 899 | }, 900 | { 901 | "_defaultOrder": 35, 902 | "_isFastLaunch": false, 903 | "category": "Accelerated computing", 904 | "gpuNum": 1, 905 | "hideHardwareSpecs": false, 906 | "memoryGiB": 61, 907 | "name": "ml.p3.2xlarge", 908 | "vcpuNum": 8 909 | }, 910 | { 911 | "_defaultOrder": 36, 912 | "_isFastLaunch": false, 913 | "category": "Accelerated computing", 914 | "gpuNum": 4, 915 | "hideHardwareSpecs": false, 916 | "memoryGiB": 244, 917 | "name": "ml.p3.8xlarge", 918 | "vcpuNum": 32 919 | }, 920 | { 921 | "_defaultOrder": 37, 922 | "_isFastLaunch": false, 923 | "category": "Accelerated computing", 924 | "gpuNum": 8, 925 | "hideHardwareSpecs": false, 926 | "memoryGiB": 488, 927 | "name": "ml.p3.16xlarge", 928 | "vcpuNum": 64 929 | }, 930 | { 931 | "_defaultOrder": 38, 932 | "_isFastLaunch": false, 933 | "category": "Accelerated computing", 934 | "gpuNum": 8, 935 | "hideHardwareSpecs": false, 936 | "memoryGiB": 768, 937 | "name": "ml.p3dn.24xlarge", 938 | "vcpuNum": 96 939 | }, 940 | { 941 | "_defaultOrder": 39, 942 | "_isFastLaunch": false, 943 | "category": "Memory Optimized", 944 | "gpuNum": 0, 945 | "hideHardwareSpecs": false, 946 | "memoryGiB": 16, 947 | "name": "ml.r5.large", 948 | "vcpuNum": 2 949 | }, 950 | { 951 | "_defaultOrder": 40, 952 | "_isFastLaunch": false, 953 | "category": "Memory Optimized", 954 | "gpuNum": 0, 955 | "hideHardwareSpecs": false, 956 | "memoryGiB": 32, 957 | "name": "ml.r5.xlarge", 958 | "vcpuNum": 4 959 | }, 960 | { 961 | "_defaultOrder": 41, 962 | "_isFastLaunch": false, 963 | "category": "Memory Optimized", 964 | "gpuNum": 0, 965 | "hideHardwareSpecs": false, 966 | "memoryGiB": 64, 967 | "name": "ml.r5.2xlarge", 968 | "vcpuNum": 8 969 | }, 970 | { 971 | "_defaultOrder": 42, 972 | "_isFastLaunch": false, 973 | "category": "Memory Optimized", 974 | "gpuNum": 0, 975 | "hideHardwareSpecs": false, 976 | "memoryGiB": 128, 977 | "name": "ml.r5.4xlarge", 978 | "vcpuNum": 16 979 | }, 980 | { 981 | "_defaultOrder": 43, 982 | "_isFastLaunch": false, 983 | "category": "Memory Optimized", 984 | "gpuNum": 0, 985 | "hideHardwareSpecs": false, 986 | "memoryGiB": 256, 987 | "name": "ml.r5.8xlarge", 988 | "vcpuNum": 32 989 | }, 990 | { 991 | "_defaultOrder": 44, 992 | "_isFastLaunch": false, 993 | "category": "Memory Optimized", 994 | "gpuNum": 0, 995 | "hideHardwareSpecs": false, 996 | "memoryGiB": 384, 997 | "name": "ml.r5.12xlarge", 998 | "vcpuNum": 48 999 | }, 1000 | { 1001 | "_defaultOrder": 45, 1002 | "_isFastLaunch": false, 1003 | "category": "Memory Optimized", 1004 | "gpuNum": 0, 1005 | "hideHardwareSpecs": false, 1006 | "memoryGiB": 512, 1007 | "name": "ml.r5.16xlarge", 1008 | "vcpuNum": 64 1009 | }, 1010 | { 1011 | "_defaultOrder": 46, 1012 | "_isFastLaunch": false, 1013 | "category": "Memory Optimized", 1014 | "gpuNum": 0, 1015 | "hideHardwareSpecs": false, 1016 | "memoryGiB": 768, 1017 | "name": "ml.r5.24xlarge", 1018 | "vcpuNum": 96 1019 | }, 1020 | { 1021 | "_defaultOrder": 47, 1022 | "_isFastLaunch": false, 1023 | "category": "Accelerated computing", 1024 | "gpuNum": 1, 1025 | "hideHardwareSpecs": false, 1026 | "memoryGiB": 16, 1027 | "name": "ml.g5.xlarge", 1028 | "vcpuNum": 4 1029 | }, 1030 | { 1031 | "_defaultOrder": 48, 1032 | "_isFastLaunch": false, 1033 | "category": "Accelerated computing", 1034 | "gpuNum": 1, 1035 | "hideHardwareSpecs": false, 1036 | "memoryGiB": 32, 1037 | "name": "ml.g5.2xlarge", 1038 | "vcpuNum": 8 1039 | }, 1040 | { 1041 | "_defaultOrder": 49, 1042 | "_isFastLaunch": false, 1043 | "category": "Accelerated computing", 1044 | "gpuNum": 1, 1045 | "hideHardwareSpecs": false, 1046 | "memoryGiB": 64, 1047 | "name": "ml.g5.4xlarge", 1048 | "vcpuNum": 16 1049 | }, 1050 | { 1051 | "_defaultOrder": 50, 1052 | "_isFastLaunch": false, 1053 | "category": "Accelerated computing", 1054 | "gpuNum": 1, 1055 | "hideHardwareSpecs": false, 1056 | "memoryGiB": 128, 1057 | "name": "ml.g5.8xlarge", 1058 | "vcpuNum": 32 1059 | }, 1060 | { 1061 | "_defaultOrder": 51, 1062 | "_isFastLaunch": false, 1063 | "category": "Accelerated computing", 1064 | "gpuNum": 1, 1065 | "hideHardwareSpecs": false, 1066 | "memoryGiB": 256, 1067 | "name": "ml.g5.16xlarge", 1068 | "vcpuNum": 64 1069 | }, 1070 | { 1071 | "_defaultOrder": 52, 1072 | "_isFastLaunch": false, 1073 | "category": "Accelerated computing", 1074 | "gpuNum": 4, 1075 | "hideHardwareSpecs": false, 1076 | "memoryGiB": 192, 1077 | "name": "ml.g5.12xlarge", 1078 | "vcpuNum": 48 1079 | }, 1080 | { 1081 | "_defaultOrder": 53, 1082 | "_isFastLaunch": false, 1083 | "category": "Accelerated computing", 1084 | "gpuNum": 4, 1085 | "hideHardwareSpecs": false, 1086 | "memoryGiB": 384, 1087 | "name": "ml.g5.24xlarge", 1088 | "vcpuNum": 96 1089 | }, 1090 | { 1091 | "_defaultOrder": 54, 1092 | "_isFastLaunch": false, 1093 | "category": "Accelerated computing", 1094 | "gpuNum": 8, 1095 | "hideHardwareSpecs": false, 1096 | "memoryGiB": 768, 1097 | "name": "ml.g5.48xlarge", 1098 | "vcpuNum": 192 1099 | }, 1100 | { 1101 | "_defaultOrder": 55, 1102 | "_isFastLaunch": false, 1103 | "category": "Accelerated computing", 1104 | "gpuNum": 8, 1105 | "hideHardwareSpecs": false, 1106 | "memoryGiB": 1152, 1107 | "name": "ml.p4d.24xlarge", 1108 | "vcpuNum": 96 1109 | }, 1110 | { 1111 | "_defaultOrder": 56, 1112 | "_isFastLaunch": false, 1113 | "category": "Accelerated computing", 1114 | "gpuNum": 8, 1115 | "hideHardwareSpecs": false, 1116 | "memoryGiB": 1152, 1117 | "name": "ml.p4de.24xlarge", 1118 | "vcpuNum": 96 1119 | }, 1120 | { 1121 | "_defaultOrder": 57, 1122 | "_isFastLaunch": false, 1123 | "category": "Accelerated computing", 1124 | "gpuNum": 0, 1125 | "hideHardwareSpecs": false, 1126 | "memoryGiB": 32, 1127 | "name": "ml.trn1.2xlarge", 1128 | "vcpuNum": 8 1129 | }, 1130 | { 1131 | "_defaultOrder": 58, 1132 | "_isFastLaunch": false, 1133 | "category": "Accelerated computing", 1134 | "gpuNum": 0, 1135 | "hideHardwareSpecs": false, 1136 | "memoryGiB": 512, 1137 | "name": "ml.trn1.32xlarge", 1138 | "vcpuNum": 128 1139 | }, 1140 | { 1141 | "_defaultOrder": 59, 1142 | "_isFastLaunch": false, 1143 | "category": "Accelerated computing", 1144 | "gpuNum": 0, 1145 | "hideHardwareSpecs": false, 1146 | "memoryGiB": 512, 1147 | "name": "ml.trn1n.32xlarge", 1148 | "vcpuNum": 128 1149 | } 1150 | ], 1151 | "instance_type": "ml.t3.medium", 1152 | "kernelspec": { 1153 | "display_name": "Python 3 (ipykernel)", 1154 | "language": "python", 1155 | "name": "python3" 1156 | }, 1157 | "language_info": { 1158 | "codemirror_mode": { 1159 | "name": "ipython", 1160 | "version": 3 1161 | }, 1162 | "file_extension": ".py", 1163 | "mimetype": "text/x-python", 1164 | "name": "python", 1165 | "nbconvert_exporter": "python", 1166 | "pygments_lexer": "ipython3", 1167 | "version": "3.10.13" 1168 | } 1169 | }, 1170 | "nbformat": 4, 1171 | "nbformat_minor": 5 1172 | } 1173 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **Update** Amazon Bedrock now supports LLM-as-a-judge capabilities - [AWS Blog Post](https://aws.amazon.com/blogs/aws/new-rag-evaluation-and-llm-as-a-judge-capabilities-in-amazon-bedrock/) 2 | 3 | # Evaluating Large Language Models using LLM-as-a-Judge 4 | 5 | Evaluating large language models (LLM) is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, strong LLMs are used as judges to evaluate these models on more open-ended questions. The agreement between LLM judges and human preferences has been verified by introducing two benchmarks: Multi Turn (MT)-bench, a multi-turn question set, and Chatbot Arena, a crowdsourced battle platform. The results reveal that strong LLM judges can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans This makes LLM-as-a-judge a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. 6 | 7 | > ℹ️ **Note:** The evaluation steps in this lab are based on the paper [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685.pdf). 8 | 9 | This lab addresses this challenge by providing a practical solution for evaluating LLMs using LLM-as-a-Judge with Amazon Bedrock. This is relevant for developers and researchers working on evaluating LLM based applications. In the notebook you are guided using MT-Bench questions to generate test answers and evaluate them with a single-answer grading using the Bedrock API, Python and Langchain. The notebook consists of the following chapters: 10 | 11 | 1) Set-up of the environment 12 | 2) Load MT-Bench questions 13 | 3) Generate test answers from LLM which should be evaluated 14 | 4) Evaluate answers with strong LLM-as-a-judge 15 | 5) Generate explanation for average rating score 16 | 17 | 18 | ## Getting started 19 | 20 | ### Choose a notebook environment 21 | 22 | This lab is presented as a **Python notebook**, which you can run from the environment of your choice: 23 | 24 | - [SageMaker Studio](https://aws.amazon.com/sagemaker/studio/) is a web-based integrated development environment (IDE) for machine learning. To get started quickly, refer to the [instructions for domain quick setup](https://docs.aws.amazon.com/sagemaker/latest/dg/onboard-quick-start.html). 25 | - [SageMaker Notebook Instance](https://docs.aws.amazon.com/sagemaker/latest/dg/howitworks-create-ws.html) is a machine learning (ML) compute instance running the Jupyter Notebook App. 26 | - To use your existing (local or other) notebook environment, make sure it has [credentials for calling AWS](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). 27 | 28 | 29 | ### Enable AWS IAM permissions for Bedrock 30 | 31 | The AWS identity you assume from your notebook environment (which is the [*Studio/notebook Execution Role*](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) from SageMaker, or can be a role or IAM User for self-managed notebooks), must have sufficient [AWS IAM permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies.html) to call the Amazon Bedrock service. 32 | 33 | To grant Bedrock access to your identity: 34 | 35 | - Open the [AWS IAM Console](https://us-east-1.console.aws.amazon.com/iam/home?#) 36 | - Find your [Role](https://us-east-1.console.aws.amazon.com/iamv2/home?#/roles) (if using SageMaker or otherwise assuming an IAM Role), or else [User](https://us-east-1.console.aws.amazon.com/iamv2/home?#/users) 37 | - Select *Add Permissions > Create Inline Policy* to attach new inline permissions, open the *JSON* editor and paste in the below example policy: 38 | 39 | ``` 40 | { 41 | "Version": "2012-10-17", 42 | "Statement": { 43 | "Sid": "AllowInference", 44 | "Effect": "Allow", 45 | "Action": [ 46 | "bedrock:InvokeModel" 47 | ], 48 | "Resource": "arn:aws:bedrock:*::foundation-model/*" 49 | } 50 | } 51 | ``` 52 | 53 | > ℹ️ **Note:** With Amazon SageMaker, your notebook execution role is typically be *separate* from the user or role that you log in to the AWS Console with. If you want to explore the AWS Console for Amazon Bedrock, you need to grant permissions to your Console user/role too. You can run the notebooks anywhere as long as you have access to the AWS Bedrock service and have appropriate credentials 54 | 55 | For more information on the fine-grained action and resource permissions in Bedrock, check out the Bedrock Developer Guide. 56 | 57 | 58 | ### Clone and use the notebooks 59 | 60 | > ℹ️ **Note:** In SageMaker Studio, you can open a "System Terminal" to run these commands by clicking *File > New > Terminal* 61 | 62 | Once your notebook environment is set up, clone this workshop repository into it. 63 | 64 | ```sh 65 | sudo yum install -y unzip 66 | git clone git@github.com:aws-samples/evaluating-large-language-models-using-llm-as-a-judge.git 67 | cd evaluating-large-language-models-using-llm-as-a-judge 68 | ``` 69 | 70 | You're now ready to explore the lab notebook! You will be guided through connection the notebook to Amazon Bedrock for large language model access. 71 | 72 | 73 | ## Contributing 74 | 75 | See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information. 76 | 77 | ## License 78 | This library is licensed under the MIT-0 License. See the [LICENSE](LICENSE) file. 79 | 80 | --------------------------------------------------------------------------------