├── PROMPTS.md ├── README.md ├── accuracy_bar_chart.png ├── accuracy_bar_chart_progression.png ├── parameter_summary_prompt_007.csv ├── publication_tables.ipynb ├── run_exam.py ├── sample_session_log.html └── score_exam.py /PROMPTS.md: -------------------------------------------------------------------------------- 1 | # Prompt Styles 2 | 3 | ### Single choice only 4 | ``` 5 | Imagine you are answering a Bar Exam question related to {row["question_category"]}. 6 | Please respond with this format: 7 | Answer: 8 | 9 | Question: {question_text} 10 | (A) {row["choice_a"].strip()} 11 | (B) {row["choice_b"].strip()} 12 | (C) {row["choice_c"].strip()} 13 | (D) {row["choice_d"].strip()} 14 | ``` 15 | 16 | ### Single choice and explanation 17 | ``` 18 | Imagine you are answering a Bar Exam question related to {row["question_category"]}. 19 | Please respond with this format: 20 | Answer: 21 | Reason: 22 | 23 | Question: {question_text} 24 | (A) {row["choice_a"].strip()} 25 | (B) {row["choice_b"].strip()} 26 | (C) {row["choice_c"].strip()} 27 | (D) {row["choice_d"].strip()} 28 | ``` 29 | 30 | ### Top two choices only 31 | ``` 32 | Imagine you are answering a Bar Exam question related to {row["question_category"]}. 33 | Please respond with this format: 34 | Answer: 35 | Backup Answer: 36 | 37 | Question: {question_text} 38 | (A) {row["choice_a"].strip()} 39 | (B) {row["choice_b"].strip()} 40 | (C) {row["choice_c"].strip()} 41 | (D) {row["choice_d"].strip()} 42 | ``` 43 | 44 | ### Top two choices and explanation 45 | ``` 46 | Imagine you are answering a Bar Exam question related to {row["question_category"]}. 47 | Please respond with this format: 48 | Answer: 49 | Backup Answer: 50 | Reason: 51 | 52 | Question: {question_text} 53 | (A) {row["choice_a"].strip()} 54 | (B) {row["choice_b"].strip()} 55 | (C) {row["choice_c"].strip()} 56 | (D) {row["choice_d"].strip()} 57 | ``` 58 | 59 | ### Top two choices and re-prompt 60 | * Initial Prompt 61 | ``` 62 | Please answer the following Bar Exam question in the following format: 63 | First Choice: 64 | Second Choice: 65 | 66 | Question: {question_text} 67 | (A) {row["choice_a"].strip()} 68 | (B) {row["choice_b"].strip()} 69 | (C) {row["choice_c"].strip()} 70 | (D) {row["choice_d"].strip()} 71 | ``` 72 | 73 | * Re-prompt 74 | ``` 75 | Please answer the following Bar Exam question in the following format: 76 | Choice: 77 | 78 | Question: {question_text} 79 | (A) {row["first_choice"].strip()} 80 | (B) {row["second_choice"].strip()} 81 | ``` 82 | 83 | 84 | ### Rank order all choices 85 | ``` 86 | Please answer the following Bar Exam question in the following rank order format: 87 | First Choice: 88 | Second Choice: 89 | Third Choice: 90 | Fourth Choice: 91 | 92 | Question: {question_text} 93 | (A) {row["choice_a"].strip()} 94 | (B) {row["choice_b"].strip()} 95 | (C) {row["choice_c"].strip()} 96 | (D) {row["choice_d"].strip()} 97 | ``` 98 | 99 | ### Rank order top three choices 100 | ``` 101 | Please answer the following Bar Exam question in the following rank order format: 102 | First Choice: 103 | Second Choice: 104 | Third Choice: 105 | 106 | Question: {question_text} 107 | (A) {row["choice_a"].strip()} 108 | (B) {row["choice_b"].strip()} 109 | (C) {row["choice_c"].strip()} 110 | (D) {row["choice_d"].strip()} 111 | ``` 112 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | GPT Takes the Bar - Supplementary Information 2 | ================== 3 | * __N.B.__: This is a preprint. 4 | * __Title__: GPT Takes the Bar 5 | * __Authors__: [Michael Bommarito](https://www.linkedin.com/in/bommarito/), [Daniel Martin Katz](https://www.linkedin.com/in/daniel-katz-3b001539/) 6 | * __Publication URL__: [arXiv:2212.14402](https://arxiv.org/abs/2212.14402), [SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4314839) 7 | * __Publication Date__: 2022-12-29 8 | 9 | ## Abstract 10 | ``` 11 | Nearly all jurisdictions in the United States require a professional license exam, commonly referred to as 12 | “the Bar Exam,” as a precondition for law practice. To even sit for the exam, most jurisdictions require 13 | that an applicant completes at least seven years of post-secondary education, including three years at an 14 | accredited law school. In addition, most test-takers also undergo weeks to months of further, exam-specific 15 | preparation. Despite this significant investment of time and capital, approximately one in five test-takers 16 | still score under the rate required to pass the exam on their first try. In the face of a complex task that 17 | requires such depth of knowledge, what, then, should we expect of the state of the art in “AI?” In this 18 | research, we document our experimental evaluation of the performance of OpenAI’s text-davinci-003 model, 19 | often-referred to as GPT-3.5, on the multistate multiple choice (MBE) section of the exam. While we find no 20 | benefit in fine-tuning over GPT-3.5’s zero-shot performance at the scale of our training data, we do find that 21 | hyperparameter optimization and prompt engineering positively impacted GPT-3.5’s zero-shot performance. For 22 | best prompt and parameters, GPT-3.5 achieves a headline correct rate of 50.3% on a complete NCBE MBE 23 | practice exam, significantly in excess of the 25% baseline guessing rate, and performs at a passing rate 24 | for both Evidence and Torts. GPT-3.5’s ranking of responses is also highly correlated with correctness; 25 | its top two and top three choices are correct 71% and 88% of the time, respectively, indicating very strong 26 | non-entailment performance. While our ability to interpret these results is limited by nascent scientific 27 | understanding of LLMs and the proprietary nature of GPT, we believe that these results strongly suggest that 28 | an LLM will pass the MBE component of the Bar Exam in the near future. 29 | ``` 30 | 31 | ### Table of Contents 32 | 33 | * [Jupyter Notebook with Tables and Figures](publication_tables.ipynb) 34 | * [Prompt Examples](PROMPTS.md) 35 | * [Example Session Log](sample_session_log.html) 36 | 37 | ## Progression of Models over Time 38 |

41 | 42 | 43 | ## `text-davinci-003` Performance by Question Category 44 |

47 | 48 | 49 | 50 | -------------------------------------------------------------------------------- /accuracy_bar_chart.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mjbommar/gpt-takes-the-bar-exam/f20fc42e9e0d3f8318394c62b828a8b3211d180a/accuracy_bar_chart.png -------------------------------------------------------------------------------- /accuracy_bar_chart_progression.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/mjbommar/gpt-takes-the-bar-exam/f20fc42e9e0d3f8318394c62b828a8b3211d180a/accuracy_bar_chart_progression.png -------------------------------------------------------------------------------- /parameter_summary_prompt_007.csv: -------------------------------------------------------------------------------- 1 | Temperature,Top P,Best Of,Count,Correct Mean,Top Two Correct Mean,Top Three Correct Mean 2 | 1.0,0.75,2,3,0.5119617224880383,0.7049441786283892,0.8755980861244019 3 | 0.0,1.0,2,1,0.507177033492823,0.7177033492822966,0.8899521531100478 4 | 0.5,0.75,4,3,0.507177033492823,0.7097288676236044,0.8803827751196173 5 | 0.5,1.0,4,3,0.5039872408293461,0.7145135566188198,0.8803827751196173 6 | 0.0,0.75,2,1,0.5023923444976076,0.722488038277512,0.8947368421052632 7 | 0.5,1.0,1,3,0.5007974481658692,0.7113237639553429,0.8787878787878788 8 | 0.5,1.0,2,3,0.5007974481658692,0.7145135566188198,0.8899521531100478 9 | 1.0,0.75,1,3,0.5007974481658692,0.7145135566188198,0.8803827751196173 10 | 0.5,0.75,1,3,0.49920255183413076,0.7017543859649122,0.8771929824561403 11 | 0.5,0.75,2,3,0.49920255183413076,0.7113237639553429,0.8851674641148325 12 | 1.0,1.0,4,3,0.49920255183413076,0.7097288676236044,0.8787878787878788 13 | 0.0,0.75,1,1,0.49760765550239233,0.7177033492822966,0.8899521531100478 14 | 1.0,0.75,4,3,0.49760765550239233,0.7113237639553429,0.8740031897926634 15 | 1.0,1.0,2,3,0.49760765550239233,0.7129186602870813,0.8692185007974481 16 | 0.0,1.0,1,2,0.49282296650717705,0.715311004784689,0.8875598086124402 17 | 1.0,1.0,1,3,0.4800637958532695,0.6858054226475279,0.8389154704944178 18 | -------------------------------------------------------------------------------- /publication_tables.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "id": "970c359a-ab35-4699-b52d-30ddf96b2148", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "# imports\n", 11 | "import sys\n", 12 | "\n", 13 | "# relative to project root\n", 14 | "sys.path.append(\"publication/\")\n", 15 | "from session_data import *\n", 16 | "\n", 17 | "# packages\n", 18 | "import pandas\n", 19 | "from IPython.display import display, display_html, display_latex" 20 | ] 21 | }, 22 | { 23 | "cell_type": "code", 24 | "execution_count": 5, 25 | "id": "b345e836-9540-4d3f-af21-9fa656503586", 26 | "metadata": {}, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/plain": " Exam Session ID Question Category Question Number GPT Answer \\\n0 bar-exam-001 Civil Procedure 1 D \n1 bar-exam-001 Civil Procedure 2 D \n2 bar-exam-001 Civil Procedure 3 C \n3 bar-exam-001 Civil Procedure 4 C \n4 bar-exam-001 Civil Procedure 5 C \n\n GPT Second Answer GPT Third Answer Correct Answer Correct Second Correct \\\n0 A B D True False \n1 B A D True False \n2 D A D False True \n3 D B A False False \n4 D B C True False \n\n Third Correct Top Two Correct Top Three Correct Temperature Max Tokens \\\n0 False True True 0.0 16 \n1 False True True 0.0 16 \n2 False True True 0.0 16 \n3 False False False 0.0 16 \n4 False True True 0.0 16 \n\n Top P Best Of Frequency Penalty Presence Penalty Session Duration \n0 1.0 1 0 0 208.769812 \n1 1.0 1 0 0 208.769812 \n2 1.0 1 0 0 208.769812 \n3 1.0 1 0 0 208.769812 \n4 1.0 1 0 0 208.769812 ", 31 | "text/html": "

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	Exam Session ID	Question Category	Question Number	GPT Answer	GPT Second Answer	GPT Third Answer	Correct Answer	Correct	Second Correct	Third Correct	Top Two Correct	Top Three Correct	Max Tokens	Top P	Best Of	Session Duration
0	bar-exam-001	Civil Procedure	1	D	A	B	D	True	False	False	True	True	16	1.0	1	208.769812
1	bar-exam-001	Civil Procedure	2	D	B	A	D	True	False	False	True	True	16	1.0	1	208.769812
2	bar-exam-001	Civil Procedure	3	C	D	A	D	False	True	False	True	True	16	1.0	1	208.769812
3	bar-exam-001	Civil Procedure	4	C	D	B	A	False	False	False	False	False	16	1.0	1	208.769812
4	bar-exam-001	Civil Procedure	5	C	D	B	C	True	False	False	True	True	16	1.0	1	208.769812

" 32 | }, 33 | "execution_count": 5, 34 | "metadata": {}, 35 | "output_type": "execute_result" 36 | } 37 | ], 38 | "source": [ 39 | "# read all session data\n", 40 | "session_df = get_session_data()\n", 41 | "session_df.head()" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "id": "a24a8bde-bc71-496f-8d0a-bd1904556868", 47 | "metadata": {}, 48 | "source": [ 49 | "## Headline Accuracy" 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": 22, 55 | "id": "fb5d1057-9ec5-4356-9c6b-50cf0dc44a59", 56 | "metadata": {}, 57 | "outputs": [ 58 | { 59 | "name": "stdout", 60 | "output_type": "stream", 61 | "text": [ 62 | "\\begin{tabular}{lr}\n", 63 | " & Accuracy (%) \\\\\n", 64 | "Correct Rate & 49.970000 \\\\\n", 65 | "Top Two Correct Rate & 70.970000 \\\\\n", 66 | "Top Three Correct Rate & 87.750000 \\\\\n", 67 | "\\end{tabular}\n", 68 | "\n" 69 | ] 70 | }, 71 | { 72 | "data": { 73 | "text/plain": " Accuracy (%)\nCorrect Rate 50\nTop Two Correct Rate 71\nTop Three Correct Rate 88", 74 | "text/html": "

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	Accuracy (%)
Correct Rate	50
Top Two Correct Rate	71
Top Three Correct Rate	88

" 75 | }, 76 | "metadata": {}, 77 | "output_type": "display_data" 78 | } 79 | ], 80 | "source": [ 81 | "performance_df = pandas.DataFrame({\n", 82 | " \"Correct Rate\": session_df[\"Correct\"].mean() * 100.0,\n", 83 | " \"Top Two Correct Rate\": session_df[\"Top Two Correct\"].mean() * 100.0,\n", 84 | " \"Top Three Correct Rate\": session_df[\"Top Three Correct\"].mean() * 100.0\n", 85 | "}, index=[\"Accuracy (%)\"]).T\n", 86 | "\n", 87 | "with pandas.option_context(\"float_format\", \"{:2.0f}\".format):\n", 88 | " print(performance_df.round(2).style.to_latex())\n", 89 | " display(performance_df)" 90 | ] 91 | }, 92 | { 93 | "cell_type": "markdown", 94 | "id": "8c3015df-158f-4ef0-ba3e-87dc3aeaf97f", 95 | "metadata": {}, 96 | "source": [ 97 | "## NCBE Rates" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": 23, 103 | "id": "9074f5f6-8f9b-400a-a56e-f7917a164103", 104 | "metadata": {}, 105 | "outputs": [ 106 | { 107 | "data": { 108 | "text/plain": " Accuracy (%)\nCivil Procedure 59.0\nConstitutional Law 72.0\nContracts 70.0\nCriminal Law and Procedure 71.0\nEvidence 65.0\nReal Property 65.0\nTorts 71.0", 109 | "text/html": "

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	Accuracy (%)
Civil Procedure	59.0
Constitutional Law	72.0
Contracts	70.0
Criminal Law and Procedure	71.0
Evidence	65.0
Real Property	65.0
Torts	71.0

" 110 | }, 111 | "execution_count": 23, 112 | "metadata": {}, 113 | "output_type": "execute_result" 114 | } 115 | ], 116 | "source": [ 117 | "ncbe_df = pandas.DataFrame(pandas.Series(NCBE_CATEGORY_CORRECT_RATES) * 100.0, columns=[\"Accuracy (%)\"])\n", 118 | "ncbe_df" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "id": "49dec6fe-fc6e-44e4-addd-597019a41b91", 124 | "metadata": {}, 125 | "source": [ 126 | "## Accuracy by Question Category" 127 | ] 128 | }, 129 | { 130 | "cell_type": "code", 131 | "execution_count": 29, 132 | "id": "e24d425f-6d59-4e04-9285-603d7c21310c", 133 | "metadata": {}, 134 | "outputs": [ 135 | { 136 | "name": "stdout", 137 | "output_type": "stream", 138 | "text": [ 139 | "\\begin{tabular}{lrrrr}\n", 140 | " & Correct Rate & Top Two Correct Rate & Top Three Correct Rate & NCBE Rate \\\\\n", 141 | "Evidence & 62.760000 & 84.470000 & 98.050000 & 65.000000 \\\\\n", 142 | "Torts & 61.650000 & 71.830000 & 93.860000 & 71.000000 \\\\\n", 143 | "Civil Procedure & 52.030000 & 62.680000 & 78.700000 & 59.000000 \\\\\n", 144 | "Constitutional Law & 49.020000 & 66.750000 & 86.830000 & 72.000000 \\\\\n", 145 | "Real Property & 44.960000 & 71.630000 & 84.800000 & 65.000000 \\\\\n", 146 | "Contracts & 44.720000 & 77.320000 & 85.850000 & 70.000000 \\\\\n", 147 | "Criminal Law and Procedure & 35.040000 & 62.110000 & 86.340000 & 71.000000 \\\\\n", 148 | "\\end{tabular}\n", 149 | "\n" 150 | ] 151 | }, 152 | { 153 | "data": { 154 | "text/plain": " Correct Rate Top Two Correct Rate \\\nEvidence 63 84 \nTorts 62 72 \nCivil Procedure 52 63 \nConstitutional Law 49 67 \nReal Property 45 72 \nContracts 45 77 \nCriminal Law and Procedure 35 62 \n\n Top Three Correct Rate NCBE Rate \nEvidence 98 65 \nTorts 94 71 \nCivil Procedure 79 59 \nConstitutional Law 87 72 \nReal Property 85 65 \nContracts 86 70 \nCriminal Law and Procedure 86 71 ", 155 | "text/html": "

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	Correct Rate	Top Two Correct Rate	Top Three Correct Rate	NCBE Rate
Evidence	63	84	98	65
Torts	62	72	94	71
Civil Procedure	52	63	79	59
Constitutional Law	49	67	87	72
Real Property	45	72	85	65
Contracts	45	77	86	70
Criminal Law and Procedure	35	62	86	71

" 156 | }, 157 | "metadata": {}, 158 | "output_type": "display_data" 159 | } 160 | ], 161 | "source": [ 162 | "performance_by_category_df = pandas.DataFrame({\n", 163 | " \"Correct Rate\": session_df.groupby(\"Question Category\")[\"Correct\"].mean() * 100.0,\n", 164 | " \"Top Two Correct Rate\": session_df.groupby(\"Question Category\")[\"Top Two Correct\"].mean() * 100.0,\n", 165 | " \"Top Three Correct Rate\": session_df.groupby(\"Question Category\")[\"Top Three Correct\"].mean() * 100.0,\n", 166 | " \"NCBE Rate\": ncbe_df[\"Accuracy (%)\"],\n", 167 | "})\\\n", 168 | " .sort_values(\"Correct Rate\", ascending=False)\n", 169 | "\n", 170 | "\n", 171 | "with pandas.option_context(\"float_format\", \"{:2.0f}\".format):\n", 172 | " print(performance_by_category_df.round(2).style.to_latex())\n", 173 | " display(performance_by_category_df)" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "source": [ 179 | "## Hyperparameters - Temperature" 180 | ], 181 | "metadata": { 182 | "collapsed": false 183 | } 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 31, 188 | "id": "a43319e0-5c42-4045-b8cd-d40ec424c0e8", 189 | "metadata": {}, 190 | "outputs": [ 191 | { 192 | "name": "stdout", 193 | "output_type": "stream", 194 | "text": [ 195 | "\\begin{tabular}{lrrrr}\n", 196 | " & Correct Rate & Top Two Correct Rate & Top Three Correct Rate & Samples \\\\\n", 197 | "Temperature & & & & \\\\\n", 198 | "0.000000 & 49.860000 & 71.770000 & 89.000000 & 500.000000 \\\\\n", 199 | "0.500000 & 50.190000 & 71.050000 & 88.200000 & 1800.000000 \\\\\n", 200 | "1.000000 & 49.790000 & 70.650000 & 86.950000 & 1800.000000 \\\\\n", 201 | "\\end{tabular}\n", 202 | "\n" 203 | ] 204 | }, 205 | { 206 | "data": { 207 | "text/plain": " Correct Rate Top Two Correct Rate Top Three Correct Rate \\\nTemperature \n0.00% 49.86% 71.77% 89.00% \n50.00% 50.19% 71.05% 88.20% \n100.00% 49.79% 70.65% 86.95% \n\n Samples \nTemperature \n0.00% 5 \n50.00% 18 \n100.00% 18 ", 208 | "text/html": "

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	Correct Rate	Top Two Correct Rate	Top Three Correct Rate	Samples
Temperature
0.00%	49.86%	71.77%	89.00%	5
50.00%	50.19%	71.05%	88.20%	18
100.00%	49.79%	70.65%	86.95%	18

" 209 | }, 210 | "metadata": {}, 211 | "output_type": "display_data" 212 | } 213 | ], 214 | "source": [ 215 | "performance_by_temperature_df = pandas.DataFrame({\n", 216 | " \"Correct Rate\": session_df.groupby(\"Temperature\")[\"Correct\"].mean(),\n", 217 | " \"Top Two Correct Rate\": session_df.groupby(\"Temperature\")[\"Top Two Correct\"].mean(),\n", 218 | " \"Top Three Correct Rate\": session_df.groupby(\"Temperature\")[\"Top Three Correct\"].mean(),\n", 219 | " \"Samples\": session_df.groupby(\"Temperature\")[\"Exam Session ID\"].nunique(),\n", 220 | "})\\\n", 221 | " .sort_values(\"Temperature\", ascending=True)\n", 222 | "\n", 223 | "\n", 224 | "with pandas.option_context(\"float_format\", \"{:.2%}\".format):\n", 225 | " print((100.0 * performance_by_temperature_df).round(2).style.to_latex())\n", 226 | " display(performance_by_temperature_df)" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "source": [ 232 | "## Hyperparameters - Best Of" 233 | ], 234 | "metadata": { 235 | "collapsed": false 236 | } 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 32, 241 | "id": "5df84467-9190-4d88-951d-f992e517bf66", 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "name": "stdout", 246 | "output_type": "stream", 247 | "text": [ 248 | "\\begin{tabular}{lrrrr}\n", 249 | " & Correct Rate & Top Two Correct Rate & Top Three Correct Rate & Samples \\\\\n", 250 | "Best Of & & & & \\\\\n", 251 | "1 & 49.510000 & 70.590000 & 87.270000 & 1500.000000 \\\\\n", 252 | "2 & 50.270000 & 71.220000 & 88.170000 & 1400.000000 \\\\\n", 253 | "4 & 50.200000 & 71.130000 & 87.840000 & 1200.000000 \\\\\n", 254 | "\\end{tabular}\n", 255 | "\n" 256 | ] 257 | }, 258 | { 259 | "data": { 260 | "text/plain": " Correct Rate Top Two Correct Rate Top Three Correct Rate Samples\nBest Of \n1 49.51% 70.59% 87.27% 15\n2 50.27% 71.22% 88.17% 14\n4 50.20% 71.13% 87.84% 12", 261 | "text/html": "

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n

	Correct Rate	Top Two Correct Rate	Top Three Correct Rate	Samples
Best Of
1	49.51%	70.59%	87.27%	15
2	50.27%	71.22%	88.17%	14
4	50.20%	71.13%	87.84%	12

" 262 | }, 263 | "metadata": {}, 264 | "output_type": "display_data" 265 | } 266 | ], 267 | "source": [ 268 | "performance_by_bestof_df = pandas.DataFrame({\n", 269 | " \"Correct Rate\": session_df.groupby(\"Best Of\")[\"Correct\"].mean(),\n", 270 | " \"Top Two Correct Rate\": session_df.groupby(\"Best Of\")[\"Top Two Correct\"].mean(),\n", 271 | " \"Top Three Correct Rate\": session_df.groupby(\"Best Of\")[\"Top Three Correct\"].mean(),\n", 272 | " \"Samples\": session_df.groupby(\"Best Of\")[\"Exam Session ID\"].nunique(),\n", 273 | "})\\\n", 274 | " .sort_values(\"Best Of\", ascending=True)\n", 275 | "\n", 276 | "\n", 277 | "with pandas.option_context(\"float_format\", \"{:.2%}\".format):\n", 278 | " print((100.0 * performance_by_bestof_df).round(2).style.to_latex())\n", 279 | " display(performance_by_bestof_df)" 280 | ] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "source": [ 285 | "## Hyperparameter Surface" 286 | ], 287 | "metadata": { 288 | "collapsed": false 289 | } 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 39, 294 | "outputs": [ 295 | { 296 | "name": "stdout", 297 | "output_type": "stream", 298 | "text": [ 299 | "Correct Rate\n" 300 | ] 301 | }, 302 | { 303 | "data": { 304 | "text/plain": "Best Of 1 2 4\nTemperature \n0.0 0.494418 0.504785 NaN\n0.5 0.500000 0.500000 0.505582\n1.0 0.490431 0.504785 0.498405", 305 | "text/html": "