├── README.md ├── demo.ipynb ├── instrusum.svg └── prompts ├── llmcompare.irrelevant.txt ├── llmcompare.missing.txt ├── llmcompare.overall.txt ├── llmeval.irrelevant.txt ├── llmeval.missing.txt ├── llmeval.overall.txt ├── llmrank.irrelevant.txt ├── llmrank.missing.txt ├── llmrank.overall.txt ├── llmscore.irrelevant.txt ├── llmscore.missing.txt └── llmscore.overall.txt /README.md: -------------------------------------------------------------------------------- 1 | # InstruSum 2 | 3 | This is a repository for our paper ["Benchmarking Generation and Evaluation Capabilities of Large Language 4 | Models for Instruction Controllable Summarization"](https://arxiv.org/abs/2311.09184). 5 | 6 |
7 | 8 |
9 | 10 | ## Quick Links 11 | 12 | - [Benchmark Dataset](#benchmark-dataset) 13 | - [Subset: dataset](#dataset) 14 | - [Subset: human_eval](#human_eval) 15 | - [Subset: llm_eval](#llm_eval) 16 | - [Subset: system_outputs](#system_outputs) 17 | - [Prompts for LLM-based Evaluation Methods](#prompts-for-llm-based-evaluation-methods) 18 | - [Citation](#citation) 19 | 20 | ## Benchmark Dataset 21 | 22 | InstruSum can be downloaded with Hugging Face Datasets under [`Salesforce/InstruSum`](https://huggingface.co/datasets/Salesforce/InstruSum). 23 | We provide a notebook, [demo.ipynb](demo.ipynb), for exploring the dataset and performing some basic analysis. 24 | 25 | InstruSum contains four subsets: `dataset`, `human_eval`, `llm_eval`, and `system_outputs`. 26 | 27 | ### dataset 28 | 29 | The `dataset` subset contains 100 human-written data examples by us. 30 | Each example contains an article, a summary instruction, a LLM-generated summary, and a hybrid LLM-human summary. 31 | 32 | ### human_eval 33 | 34 | This subset contains human evaluation results for the 100 examples in the `dataset` subset. 35 | There are 5 systems evaluated: OpenAI's `text-davinci-002`, `text-davinci-003`, `gpt-3.5-turbo-0301`, `gpt-4-0314`, along with the `hybrid` LLM-human summary. 36 | We evaluated 4 evaluation aspects: 37 | - **Overall Quality**: This rating assesses the overall quality of the summary in relation to the summary requirement. 38 | - **Missing Information**: Does the summary omit any crucial information from the article concerning the summary requirement? 39 | - **Irrelevant Information**: Does the summary include any information that is not relevant to the summary requirement? 40 | - **Factual Consistency**: Is the summary consistent with the facts presented in the article, without contradicting or misrepresenting any information? 41 | 42 | ### llm_eval 43 | 44 | This subset contains LLM-based automatic evaluation results for the 100 examples in the `dataset` subset. 45 | 46 | We used 11 LLMs in our evaluation and 4 evaluation protocols: 47 | 48 | - `LLMRank`: listwise ranking 49 | - `LLMCompare`: pairwise comparison 50 | - `LLMEval`: pointwise scoring by text completion 51 | - `LLMScore`: pointwise scoring by model-predicted log-likelihood 52 | 53 | In total, we evaluated 40 LLM-based evaluation methods over three quality aspects: 54 | 55 | | LLM | LLMRank | LLMCompare | LLMEval | LLMScore | 56 | |--------------------------|---------|------------|---------|----------| 57 | | `text-davinci-002` | ✅ | ✅ | ✅ | ✅ | 58 | | `text-davinci-003` | ✅ | ✅ | ✅ | ✅ | 59 | | `gpt-3.5-turbo-0301` | ✅ | ✅ | ✅ | ❌ | 60 | | `gpt-3.5-turbo-0613` | ✅ | ✅ | ✅ | ❌ | 61 | | `gpt-3.5-turbo-instruct` | ✅ | ✅ | ✅ | ✅ | 62 | | `gpt-4-0314` | ✅ | ✅ | ✅ | ❌ | 63 | | `gpt-4-1106-preview` | ✅ | ✅ | ✅ | ❌ | 64 | | `llama-2-7b-chat` | ✅ | ✅ | ✅ | ✅ | 65 | | `llama-2-13b-chat` | ✅ | ✅ | ✅ | ✅ | 66 | | `llama-2-70b-chat` | ✅ | ✅ | ✅ | ✅ | 67 | | `mistral-instruct` | ✅ | ✅ | ✅ | ✅ | 68 | 69 | ### system_outputs 70 | 71 | This subset contains the system outputs for the 100 examples in the `dataset` subset over 11 LLMs (same as the `llm_eval` subset). 72 | 73 | ## Prompts for LLM-based Evaluation Methods 74 | 75 | We provide the prompts for the 4 LLM-based evaluation protocols across 3 quality aspects ("overall", "missing", "irrelevant") in the [`prompts`](prompts) folder. 76 | 77 | 78 | ## Citation 79 | 80 | Please cite our paper if you use InstruSum in your work: 81 | 82 | ```bibtex 83 | @article{liu2023benchmarking, 84 | title={Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization}, 85 | author={Liu, Yixin and Fabbri, Alexander R and Chen, Jiawen and Zhao, Yilun and Han, Simeng and Joty, Shafiq and Liu, Pengfei and Radev, Dragomir and Wu, Chien-Sheng and Cohan, Arman}, 86 | journal={arXiv preprint arXiv:2311.09184}, 87 | year={2023} 88 | } 89 | ``` 90 | 91 | -------------------------------------------------------------------------------- /demo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# InstruSum Data Analysis" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## Load datasets" 15 | ] 16 | }, 17 | { 18 | "cell_type": "code", 19 | "execution_count": null, 20 | "metadata": {}, 21 | "outputs": [], 22 | "source": [ 23 | "%pip install datasets\n", 24 | "%pip install tabulate\n", 25 | "%pip install scipy" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 15, 31 | "metadata": {}, 32 | "outputs": [], 33 | "source": [ 34 | "from datasets import load_dataset\n", 35 | "from tabulate import tabulate" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### Load data examples" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "dataset_name = \"Salesforce/InstruSum\"" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "metadata": {}, 58 | "outputs": [], 59 | "source": [ 60 | "dataset = load_dataset(dataset_name, \"dataset\")[\"data\"]" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "#### Check one data example" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 3, 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "data": { 77 | "text/plain": [ 78 | "{'hybrid_summary': \"After being contacted by BBC Money Box, Lloyds started a new investigation and concluded that its initial response was wrong. They agreed to refund all of Margaret's money, plus interest and £600 by way of compensation. Vodafone also said they would provide the details of an individual who may have used Margaret's account to the police. Sue is grateful for the refund but hopes that someone will be held accountable for the fraud.\",\n", 79 | " 'article': '\"I was shaking with rage and stress, I couldn\\'t believe this had happened.\" By Dan WhitworthMoney Box reporter Sue is describing the moment she discovered that her late mother, Margaret (not their real names), who\\'d spent the last years of her life battling dementia, had had more than £14,000 stolen through direct debit fraud. \"To be told that that amount of money had been taken... I was outraged that someone could steal off my mother,\" Sue says. And she is not alone in her concern. The charity Action on Elder Abuse is warning about the dangers of direct debits being fraudulently set up in the name of vulnerable victims. The charity says it\\'s concerned about loopholes and a lack of transparency within the current system. But the Direct Debit scheme says its guarantee means companies that use it to take payments directly from customers\\' bank accounts are carefully vetted. \\'Legitimately\\' set up After being diagnosed with dementia in 2010, Margaret moved into a nursing home. Three years later, at a point when Margaret could no longer care for herself, two direct debits were set up using her bank account details. Over the next four years more than £14,000 of Margaret\\'s money was stolen to pay the direct debits and it was only after she died in 2017 that her daughter Sue discovered what had happened. Sue began trying to find out what had happened but was told by her mother\\'s bank, Lloyds, that it had carried out an investigation and it had concluded the direct debits had been \"legitimately\" set up so it would not be refunding any money. Most of the money stolen from Margaret\\'s account was used to pay Vodafone, but the company told Sue it was unable to help or provide any details of who was receiving its services because of \"data protection\" rules. Sue also contacted her local police force. It referred her to Action Fraud which said it was unlikely any further action would be taken. Sue described the reaction from her bank as \"disgusting\". \"The whole thing was taking over my life. I didn\\'t know where to go for help, I couldn\\'t sleep. All day long I was on the internet trying to find out who else I could go to for help but there was nothing.\" \"I wrote and explained that my mum couldn\\'t have set up these direct debits. \"I explained she couldn\\'t feed herself, she couldn\\'t go to the bathroom on her own, she was monitored all the time. \"She didn\\'t have the capacity in her mind to think about setting up a direct debit and nobody listened. It was like [I] was being ignored and I had the feeling that because my mum was dead they [Lloyds] couldn\\'t care less.\" Direct Debit offers a guarantee which explains that companies wishing to use it to take payments directly from people\\'s bank accounts have to go through a careful vetting process. A spokesperson for the Direct Debit scheme said: \"The billers [companies] are required to carry out payer verification checks when a Direct Debit Instruction is set up - details of the verification checks used by billers cannot be shared for obvious reasons.\" The safeguards supposedly in place to protect vulnerable people, as well as the loopholes in the system, is something that Veronica Gray from Action on Elder Abuse says need tackling. \"This particular case highlights a lack of transparency in how the system operates. This level of passing the buck when elderly or vulnerable people fall between the gaps is just not good enough. \"The Financial Abuse Code of Practice, which is a voluntary code but which many banks have signed up to, is very clear about how financial institutions should treat vulnerable customers. Clearly this has not been used in this case. \"[Bank] staff are struggling to know what signs to look for and clearly don\\'t have the skills to and expertise to identify patterns of abuse when they see them.\" When it was contacted by BBC Money Box, Lloyds started a new investigation which concluded that its initial response was wrong and it would be refunding all of Margaret\\'s money, plus interest and £600 by way of compensation. A Lloyds spokesperson said: \"We were very sorry to hear of the difficulties experienced by Sue when dealing with her late mother\\'s account. While we were not informed back in 2010 that Margaret had moved into a nursing home, it should have been clear when her daughter contacted us in 2017 - following her mother\\'s death - that Margaret would not have been in a position to arrange these Direct Debits. \"We would like to apologise for the distress and inconvenience caused by our handling of this case and have now arranged for a full refund of all the payments.\" Warning signs Vodafone said in a statement that it was also looking again at the case and would be providing the details of an individual who may have used Margaret\\'s account to the police. It added there were a \"wide range of security verification and fraud checks when opening a new account\", but that people can subsequently change the direct debit details. It also said it would welcome any initiative that further strengthened the direct debit system. Whilst Sue is grateful that Lloyds have decided to refund the money stolen from her mother\\'s account she just wants to make sure this can\\'t happen to someone else. \"I really would like someone to be accountable for doing this. You know, for the police or somebody to find out who did this - in case they\\'re doing this to somebody else. \"Lloyds should have looked into the fact that this account had laid dormant for years and then all of a sudden this money is coming out of it - surely that would ring a bell, that something\\'s wrong there? \"And once you say someone\\'s in a home with dementia and these things have happened surely that should mean something?\" You can hear more on BBC Radio 4\\'s Money Box programme on Saturday at 12pm or listen again here. Follow Money Box and Dan on twitter.',\n", 80 | " 'requirement': 'Summarize the conclusion of the fraud case.',\n", 81 | " 'llm_summary': \"After being contacted by BBC Money Box, Lloyds started a new investigation and concluded that its initial response was wrong. They agreed to refund all of Margaret's money, plus interest and £600 by way of compensation. Vodafone also said they would provide the details of an individual who may have used Margaret's account to the police. Sue is grateful for the refund but hopes that someone will be held accountable for the fraud.\"}" 82 | ] 83 | }, 84 | "execution_count": 3, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "dataset[0]" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "### Load human evaluation data" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": {}, 104 | "outputs": [], 105 | "source": [ 106 | "human_eval = load_dataset(dataset_name, \"human_eval\")[\"data\"]" 107 | ] 108 | }, 109 | { 110 | "cell_type": "markdown", 111 | "metadata": {}, 112 | "source": [ 113 | "#### Explore the data structure" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 5, 119 | "metadata": {}, 120 | "outputs": [ 121 | { 122 | "data": { 123 | "text/plain": [ 124 | "dict_keys(['annotations', 'article', 'requirement'])" 125 | ] 126 | }, 127 | "execution_count": 5, 128 | "metadata": {}, 129 | "output_type": "execute_result" 130 | } 131 | ], 132 | "source": [ 133 | "human_eval[0].keys()" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": 6, 139 | "metadata": {}, 140 | "outputs": [ 141 | { 142 | "data": { 143 | "text/plain": [ 144 | "dict_keys(['gpt-3.5-turbo-0301', 'gpt-4-0314', 'hybrid', 'text-davinci-002', 'text-davinci-003'])" 145 | ] 146 | }, 147 | "execution_count": 6, 148 | "metadata": {}, 149 | "output_type": "execute_result" 150 | } 151 | ], 152 | "source": [ 153 | "human_eval[0][\"annotations\"].keys()" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": 7, 159 | "metadata": {}, 160 | "outputs": [ 161 | { 162 | "data": { 163 | "text/plain": [ 164 | "{'score': {'factual': 1.0,\n", 165 | " 'irrelevant': 3.0,\n", 166 | " 'missing': 3.6666666666666665,\n", 167 | " 'overall': 3.0},\n", 168 | " 'summary': \"Lloyds Bank has apologized and refunded over £14,000, plus interest and £600 in compensation, to a woman whose late mother's account was targeted by direct debit fraud. The bank initially claimed the direct debits were legitimately set up and refused to refund the money. After being contacted by BBC Money Box, Lloyds conducted a new investigation and admitted its initial response was wrong. Vodafone, which received most of the stolen funds, is also reviewing the case and will provide details of a possible suspect to the police.\"}" 169 | ] 170 | }, 171 | "execution_count": 7, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "human_eval[0][\"annotations\"][\"gpt-4-0314\"]" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "#### Compute the average system scores" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 18, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "Model overall missing irrelevant factual\n", 197 | "------------------ --------- --------- ------------ ---------\n", 198 | "text-davinci-002 2.344 2.595 3.443 0.640\n", 199 | "text-davinci-003 3.239 3.702 3.708 0.710\n", 200 | "gpt-3.5-turbo-0301 2.897 3.473 2.958 0.800\n", 201 | "gpt-4-0314 3.970 4.067 4.205 0.860\n", 202 | "hybrid 3.873 3.948 4.359 0.860\n" 203 | ] 204 | } 205 | ], 206 | "source": [ 207 | "models = [\"text-davinci-002\", \"text-davinci-003\", \"gpt-3.5-turbo-0301\", \"gpt-4-0314\", \"hybrid\"]\n", 208 | "aspects = [\"overall\", \"missing\", \"irrelevant\", \"factual\"]\n", 209 | "scores = {model: {aspect: [] for aspect in aspects} for model in models}\n", 210 | "for row in human_eval:\n", 211 | " for model in models:\n", 212 | " for aspect in aspects:\n", 213 | " scores[model][aspect].append(row[\"annotations\"][model][\"score\"][aspect])\n", 214 | "for model in models:\n", 215 | " for aspect in aspects:\n", 216 | " scores[model][aspect] = sum(scores[model][aspect]) / len(scores[model][aspect])\n", 217 | "table = [[\"Model\"] + aspects]\n", 218 | "for model in models:\n", 219 | " table.append([model] + [scores[model][aspect] for aspect in aspects])\n", 220 | "print(tabulate(table, headers=\"firstrow\", floatfmt=\".3f\"))" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "### Load LLM-based evaluation data" 228 | ] 229 | }, 230 | { 231 | "cell_type": "code", 232 | "execution_count": null, 233 | "metadata": {}, 234 | "outputs": [], 235 | "source": [ 236 | "llm_eval = load_dataset(dataset_name, \"llm_eval\")[\"data\"]" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "#### Explore the data structure" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 9, 249 | "metadata": {}, 250 | "outputs": [ 251 | { 252 | "data": { 253 | "text/plain": [ 254 | "dict_keys(['system_outputs', 'article', 'requirement', 'llm_scores'])" 255 | ] 256 | }, 257 | "execution_count": 9, 258 | "metadata": {}, 259 | "output_type": "execute_result" 260 | } 261 | ], 262 | "source": [ 263 | "llm_eval[0].keys()" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "#### There are 3 evaluation aspects." 271 | ] 272 | }, 273 | { 274 | "cell_type": "code", 275 | "execution_count": 10, 276 | "metadata": {}, 277 | "outputs": [ 278 | { 279 | "data": { 280 | "text/plain": [ 281 | "dict_keys(['irrelevant', 'missing', 'overall'])" 282 | ] 283 | }, 284 | "execution_count": 10, 285 | "metadata": {}, 286 | "output_type": "execute_result" 287 | } 288 | ], 289 | "source": [ 290 | "llm_eval[0][\"llm_scores\"].keys()" 291 | ] 292 | }, 293 | { 294 | "cell_type": "markdown", 295 | "metadata": {}, 296 | "source": [ 297 | "#### There are 11 LLMs in total." 298 | ] 299 | }, 300 | { 301 | "cell_type": "code", 302 | "execution_count": 11, 303 | "metadata": {}, 304 | "outputs": [ 305 | { 306 | "data": { 307 | "text/plain": [ 308 | "dict_keys(['gpt-3.5-turbo-0301', 'gpt-3.5-turbo-0613', 'gpt-3.5-turbo-instruct', 'gpt-4-0314', 'gpt-4-1106-preview', 'llama-2-13b-chat', 'llama-2-70b-chat', 'llama-2-7b-chat', 'mistral-instruct', 'text-davinci-002', 'text-davinci-003'])" 309 | ] 310 | }, 311 | "execution_count": 11, 312 | "metadata": {}, 313 | "output_type": "execute_result" 314 | } 315 | ], 316 | "source": [ 317 | "llm_eval[0][\"llm_scores\"][\"overall\"].keys()" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "#### Each LLM is used with different evaluation protocols." 325 | ] 326 | }, 327 | { 328 | "cell_type": "code", 329 | "execution_count": 19, 330 | "metadata": {}, 331 | "outputs": [ 332 | { 333 | "data": { 334 | "text/plain": [ 335 | "dict_keys(['llmcompare', 'llmeval', 'llmrank'])" 336 | ] 337 | }, 338 | "execution_count": 19, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "llm_eval[0][\"llm_scores\"][\"overall\"]['gpt-3.5-turbo-0301'].keys()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "#### Let's check the LLMCompare scores." 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": 20, 357 | "metadata": {}, 358 | "outputs": [ 359 | { 360 | "data": { 361 | "text/plain": [ 362 | "{'gpt-3.5-turbo-0301': 1.75,\n", 363 | " 'gpt-4-0314': 1.75,\n", 364 | " 'hybrid': 0.25,\n", 365 | " 'text-davinci-002': 0.75,\n", 366 | " 'text-davinci-003': 0.5}" 367 | ] 368 | }, 369 | "execution_count": 20, 370 | "metadata": {}, 371 | "output_type": "execute_result" 372 | } 373 | ], 374 | "source": [ 375 | "llm_eval[0][\"llm_scores\"][\"overall\"]['gpt-3.5-turbo-0301'][\"llmcompare\"]" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "#### Compute the system-level correlation of different LLM-based evaluation methods on the overall quality aspect" 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": 21, 388 | "metadata": {}, 389 | "outputs": [ 390 | { 391 | "name": "stdout", 392 | "output_type": "stream", 393 | "text": [ 394 | "Model llmrank llmcompare llmeval\n", 395 | "---------------------- --------- ------------ ---------\n", 396 | "text-davinci-002 -0.200 0.400 0.738\n", 397 | "text-davinci-003 0.400 0.400 0.949\n", 398 | "gpt-3.5-turbo-0301 0.738 0.400 0.600\n", 399 | "gpt-3.5-turbo-0613 0.600 0.527 0.527\n", 400 | "gpt-3.5-turbo-instruct 0.400 0.600 0.738\n", 401 | "gpt-4-0314 0.800 1.000 1.000\n", 402 | "gpt-4-1106-preview 0.400 0.800 0.800\n", 403 | "llama-2-7b-chat 0.200 0.527 0.527\n", 404 | "llama-2-13b-chat 0.105 0.400 1.000\n", 405 | "llama-2-70b-chat -0.316 0.400 0.949\n", 406 | "mistral-instruct -0.400 0.105 0.447\n" 407 | ] 408 | } 409 | ], 410 | "source": [ 411 | "from scipy.stats import kendalltau\n", 412 | "\n", 413 | "models = [\n", 414 | " \"text-davinci-002\",\n", 415 | " \"text-davinci-003\",\n", 416 | " \"gpt-3.5-turbo-0301\",\n", 417 | " \"gpt-3.5-turbo-0613\",\n", 418 | " \"gpt-3.5-turbo-instruct\",\n", 419 | " \"gpt-4-0314\",\n", 420 | " \"gpt-4-1106-preview\",\n", 421 | " \"llama-2-7b-chat\",\n", 422 | " \"llama-2-13b-chat\",\n", 423 | " \"llama-2-70b-chat\",\n", 424 | " \"mistral-instruct\",\n", 425 | "]\n", 426 | "systems = [\"text-davinci-002\", \"text-davinci-003\", \"gpt-3.5-turbo-0301\", \"gpt-4-0314\", \"hybrid\"]\n", 427 | "methods = [\"llmrank\", \"llmcompare\", \"llmeval\"]\n", 428 | "llm_eval_results = {model: dict() for model in models}\n", 429 | "\n", 430 | "for model in models:\n", 431 | " for method in methods:\n", 432 | " scores = {s: [] for s in systems}\n", 433 | " for row in llm_eval:\n", 434 | " for s in systems:\n", 435 | " scores[s].append(row[\"llm_scores\"][\"overall\"][model][method][s])\n", 436 | " for s in systems:\n", 437 | " scores[s] = sum(scores[s]) / len(scores[s])\n", 438 | " llm_eval_results[model][method] = scores\n", 439 | "\n", 440 | "human_scores = {s: [] for s in systems}\n", 441 | "for row in human_eval:\n", 442 | " for s in systems:\n", 443 | " human_scores[s].append(row[\"annotations\"][s][\"score\"][\"overall\"])\n", 444 | "for s in systems:\n", 445 | " human_scores[s] = sum(human_scores[s]) / len(human_scores[s])\n", 446 | "human_scores = [human_scores[s] for s in systems]\n", 447 | "\n", 448 | "correlations = {model: dict() for model in models}\n", 449 | "for model in models:\n", 450 | " for method in methods:\n", 451 | " llm_scores = [llm_eval_results[model][method][s] for s in systems]\n", 452 | " corr, _ = kendalltau(human_scores, llm_scores)\n", 453 | " correlations[model][method] = corr\n", 454 | "\n", 455 | "table = [[\"Model\"] + methods]\n", 456 | "for model in models:\n", 457 | " table.append([model] + [correlations[model][method] for method in methods])\n", 458 | "print(tabulate(table, headers=\"firstrow\", floatfmt=\".3f\"))" 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "### Load system outputs" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": 13, 471 | "metadata": {}, 472 | "outputs": [], 473 | "source": [ 474 | "system_outputs = load_dataset(dataset_name, \"system_outputs\")[\"data\"]" 475 | ] 476 | }, 477 | { 478 | "cell_type": "markdown", 479 | "metadata": {}, 480 | "source": [ 481 | "#### Check one data example" 482 | ] 483 | }, 484 | { 485 | "cell_type": "code", 486 | "execution_count": 14, 487 | "metadata": {}, 488 | "outputs": [ 489 | { 490 | "data": { 491 | "text/plain": [ 492 | "{'system_outputs': {'gpt-3.5-turbo-0301': 'Lloyds Bank has refunded more than £14,000 stolen from the account of a dementia sufferer after direct debits were fraudulently set up. The bank initially refused to refund the money, claiming the direct debits had been \"legitimately\" set up. The victim\\'s daughter, Sue, contacted the police and Action Fraud, but was told it was unlikely any further action would be taken. Lloyds has now apologised and refunded the money, plus interest and £600 compensation. Vodafone, which received most of the stolen money, said it was looking again at the case and would provide details of an individual who may have used the account to the police.',\n", 493 | " 'gpt-3.5-turbo-0613': \"In this fraud case, Sue discovered that over £14,000 had been stolen from her late mother's bank account through direct debit fraud. Despite reporting the incident to her mother's bank, Lloyds, and contacting the police, Sue received little help or support. However, after the BBC Money Box program intervened, Lloyds conducted a new investigation and admitted their initial response was wrong. They agreed to refund all the stolen money, along with interest and compensation. Vodafone, the company receiving the stolen funds, also stated that they would provide details to the police. Sue hopes that someone will be held accountable for the fraud and wants to prevent similar incidents from happening to others.\",\n", 494 | " 'gpt-3.5-turbo-instruct': 'The article discusses a case where a woman\\'s late mother, who suffered from dementia, had over £14,000 stolen through direct debit fraud. Despite the victim\\'s daughter\\'s efforts to seek help from the bank and authorities, she was initially told that the direct debits were \"legitimately\" set up and no refund would be given. However, after further investigation and media involvement, the bank admitted their mistake and refunded the stolen money. The article highlights the need for better safeguards and transparency in the direct debit system to protect vulnerable individuals from financial abuse.',\n", 495 | " 'gpt-4-0314': \"Lloyds Bank has apologized and refunded over £14,000, plus interest and £600 in compensation, to a woman whose late mother's account was targeted by direct debit fraud. The bank initially claimed the direct debits were legitimately set up and refused to refund the money. After being contacted by BBC Money Box, Lloyds conducted a new investigation and admitted its initial response was wrong. Vodafone, which received most of the stolen funds, is also reviewing the case and will provide details of a possible suspect to the police.\",\n", 496 | " 'gpt-4-1106-preview': \"After Sue, the daughter of a late dementia patient named Margaret, discovered that more than £14,000 had been fraudulently taken from her mother's account through direct debits, she faced initial resistance from Lloyds Bank, which claimed the transactions were legitimate. However, following a new investigation prompted by BBC Money Box's contact, Lloyds acknowledged their error and agreed to refund all the stolen money, plus interest and £600 in compensation. Vodafone, which received most of the stolen funds, also agreed to re-examine the case and provide information to the police. Sue hopes that by bringing attention to this issue, similar fraud against vulnerable individuals can be prevented in the future.\",\n", 497 | " 'llama-2-13b-chat': \"In conclusion, Sue's mother Margaret had £14,000 stolen through direct debit fraud after she was diagnosed with dementia and moved into a nursing home. The bank, Lloyds, initially refused to refund the money, but after the case was brought to light through the BBC's Money Box program, Lloyds launched a new investigation and concluded that its initial response was wrong. Lloyds will be refunding all of Margaret's money, plus interest, and paying £600 in compensation. The case highlights the dangers of direct debit fraud and the need for greater transparency and accountability in the system to protect vulnerable victims.\",\n", 498 | " 'llama-2-70b-chat': 'The article discusses a case of direct debit fraud where a woman\\'s mother, who had dementia, had over £14,000 stolen from her bank account through fraudulently set up direct debits. The bank, Lloyds, initially refused to refund the money, claiming that the direct debits had been \"legitimately\" set up. However, after the woman\\'s daughter, Sue, contacted the bank again and explained that her mother was unable to set up the direct debits due to her dementia, Lloyds launched a new investigation and decided to refund the money, plus interest and compensation. The article highlights the need for greater transparency and accountability in the direct debit system to prevent such fraud cases.',\n", 499 | " 'llama-2-7b-chat': 'After a BBC Money Box investigation, Lloyds Bank has refunded over £14,000 stolen from the account of a woman with dementia, Margaret. The bank had initially claimed that the direct debits were \"legitimately\" set up, but after further investigation, Lloyds found that the payments were made without Margaret\\'s knowledge or consent. The bank has apologized for the distress and inconvenience caused and has refunded all of Margaret\\'s money, plus interest and £600 in compensation. Vodafone, the company that received the majority of the stolen money, is also reviewing the case and will provide the details of an individual who may have used Margaret\\'s account to the police. The case highlights concerns about the lack of transparency in the direct debit system and the need for better safeguards to protect vulnerable people.',\n", 500 | " 'mistral-instruct': \"The article describes a case of direct debit fraud against a vulnerable elderly woman, Margaret, who had dementia and was living in a nursing home. The fraudulent direct debits were set up in her name and resulted in the theft of over £14,000. The article highlights the dangers of direct debit fraud against vulnerable victims and the loopholes and lack of transparency in the current system. The Direct Debit scheme states that it has safeguards in place to protect vulnerable people, but the case highlights the need for these safeguards to be improved. The article concludes that the bank, Lloyds, has now refunded all of Margaret's money, plus interest and compensation, after it was found that its initial response was wrong.\",\n", 501 | " 'text-davinci-002': \"After being contacted by BBC Money Box, Lloyds started a new investigation and concluded that its initial response was wrong. They agreed to refund all of Margaret's money, plus interest and £600 by way of compensation. Vodafone also said they would provide the details of an individual who may have used Margaret's account to the police. Sue is grateful for the refund but hopes that someone will be held accountable for the fraud.\",\n", 502 | " 'text-davinci-003': \"After being contacted by BBC Money Box, Lloyds started a new investigation and concluded that its initial response was wrong. They agreed to refund all of Margaret's money, plus interest and £600 by way of compensation. Vodafone also said they would provide the details of an individual who may have used Margaret's account to the police. Sue is grateful for the refund but hopes that someone will be held accountable for the fraud.\"},\n", 503 | " 'article': '\"I was shaking with rage and stress, I couldn\\'t believe this had happened.\" By Dan WhitworthMoney Box reporter Sue is describing the moment she discovered that her late mother, Margaret (not their real names), who\\'d spent the last years of her life battling dementia, had had more than £14,000 stolen through direct debit fraud. \"To be told that that amount of money had been taken... I was outraged that someone could steal off my mother,\" Sue says. And she is not alone in her concern. The charity Action on Elder Abuse is warning about the dangers of direct debits being fraudulently set up in the name of vulnerable victims. The charity says it\\'s concerned about loopholes and a lack of transparency within the current system. But the Direct Debit scheme says its guarantee means companies that use it to take payments directly from customers\\' bank accounts are carefully vetted. \\'Legitimately\\' set up After being diagnosed with dementia in 2010, Margaret moved into a nursing home. Three years later, at a point when Margaret could no longer care for herself, two direct debits were set up using her bank account details. Over the next four years more than £14,000 of Margaret\\'s money was stolen to pay the direct debits and it was only after she died in 2017 that her daughter Sue discovered what had happened. Sue began trying to find out what had happened but was told by her mother\\'s bank, Lloyds, that it had carried out an investigation and it had concluded the direct debits had been \"legitimately\" set up so it would not be refunding any money. Most of the money stolen from Margaret\\'s account was used to pay Vodafone, but the company told Sue it was unable to help or provide any details of who was receiving its services because of \"data protection\" rules. Sue also contacted her local police force. It referred her to Action Fraud which said it was unlikely any further action would be taken. Sue described the reaction from her bank as \"disgusting\". \"The whole thing was taking over my life. I didn\\'t know where to go for help, I couldn\\'t sleep. All day long I was on the internet trying to find out who else I could go to for help but there was nothing.\" \"I wrote and explained that my mum couldn\\'t have set up these direct debits. \"I explained she couldn\\'t feed herself, she couldn\\'t go to the bathroom on her own, she was monitored all the time. \"She didn\\'t have the capacity in her mind to think about setting up a direct debit and nobody listened. It was like [I] was being ignored and I had the feeling that because my mum was dead they [Lloyds] couldn\\'t care less.\" Direct Debit offers a guarantee which explains that companies wishing to use it to take payments directly from people\\'s bank accounts have to go through a careful vetting process. A spokesperson for the Direct Debit scheme said: \"The billers [companies] are required to carry out payer verification checks when a Direct Debit Instruction is set up - details of the verification checks used by billers cannot be shared for obvious reasons.\" The safeguards supposedly in place to protect vulnerable people, as well as the loopholes in the system, is something that Veronica Gray from Action on Elder Abuse says need tackling. \"This particular case highlights a lack of transparency in how the system operates. This level of passing the buck when elderly or vulnerable people fall between the gaps is just not good enough. \"The Financial Abuse Code of Practice, which is a voluntary code but which many banks have signed up to, is very clear about how financial institutions should treat vulnerable customers. Clearly this has not been used in this case. \"[Bank] staff are struggling to know what signs to look for and clearly don\\'t have the skills to and expertise to identify patterns of abuse when they see them.\" When it was contacted by BBC Money Box, Lloyds started a new investigation which concluded that its initial response was wrong and it would be refunding all of Margaret\\'s money, plus interest and £600 by way of compensation. A Lloyds spokesperson said: \"We were very sorry to hear of the difficulties experienced by Sue when dealing with her late mother\\'s account. While we were not informed back in 2010 that Margaret had moved into a nursing home, it should have been clear when her daughter contacted us in 2017 - following her mother\\'s death - that Margaret would not have been in a position to arrange these Direct Debits. \"We would like to apologise for the distress and inconvenience caused by our handling of this case and have now arranged for a full refund of all the payments.\" Warning signs Vodafone said in a statement that it was also looking again at the case and would be providing the details of an individual who may have used Margaret\\'s account to the police. It added there were a \"wide range of security verification and fraud checks when opening a new account\", but that people can subsequently change the direct debit details. It also said it would welcome any initiative that further strengthened the direct debit system. Whilst Sue is grateful that Lloyds have decided to refund the money stolen from her mother\\'s account she just wants to make sure this can\\'t happen to someone else. \"I really would like someone to be accountable for doing this. You know, for the police or somebody to find out who did this - in case they\\'re doing this to somebody else. \"Lloyds should have looked into the fact that this account had laid dormant for years and then all of a sudden this money is coming out of it - surely that would ring a bell, that something\\'s wrong there? \"And once you say someone\\'s in a home with dementia and these things have happened surely that should mean something?\" You can hear more on BBC Radio 4\\'s Money Box programme on Saturday at 12pm or listen again here. Follow Money Box and Dan on twitter.',\n", 504 | " 'requirement': 'Summarize the conclusion of the fraud case.'}" 505 | ] 506 | }, 507 | "execution_count": 14, 508 | "metadata": {}, 509 | "output_type": "execute_result" 510 | } 511 | ], 512 | "source": [ 513 | "system_outputs[0]" 514 | ] 515 | } 516 | ], 517 | "metadata": { 518 | "kernelspec": { 519 | "display_name": "Python 3", 520 | "language": "python", 521 | "name": "python3" 522 | }, 523 | "language_info": { 524 | "codemirror_mode": { 525 | "name": "ipython", 526 | "version": 3 527 | }, 528 | "file_extension": ".py", 529 | "mimetype": "text/x-python", 530 | "name": "python", 531 | "nbconvert_exporter": "python", 532 | "pygments_lexer": "ipython3", 533 | "version": "3.8.17" 534 | } 535 | }, 536 | "nbformat": 4, 537 | "nbformat_minor": 2 538 | } 539 | -------------------------------------------------------------------------------- /prompts/llmcompare.irrelevant.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and two summaries. 2 | 3 | The summaries are crafted to meet a specific summary requirement. Note that there may be identical summaries. 4 | 5 | Your task is to compare the quality of these two summaries concerning whether they include any information that is not relevant to the summary requirement and pick the one that is better (there can be a tie). 6 | First you will give an explanation of your decision then you will provide your decision in the format of 1 or 2 or tie. 7 | 8 | Please refer to the example below for the format of your response. 9 | 10 | Example Response: 11 | Explanation: "Your explanation here". 12 | Decision: 1 or 2 or tie. 13 | 14 | Here are the actual article, the summary requirement, and two summaries: 15 | 16 | Article: 17 | {{Article}} 18 | 19 | Summary Requirement: 20 | {{Requirement}} 21 | 22 | Summary 1: 23 | 24 | {{Summary 1}} 25 | 26 | Summary 2: 27 | 28 | {{Summary 2}} -------------------------------------------------------------------------------- /prompts/llmcompare.missing.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and two summaries. 2 | 3 | The summaries are crafted to meet a specific summary requirement. Note that there may be identical summaries. 4 | 5 | Your task is to compare the quality of these two summaries concerning whether they omit any crucial information from the article with respect to the summary requirement and pick the one that is better (there can be a tie). Crucial information refers to key details or facts that are essential to understanding the article and meeting the summary requirement. 6 | First you will give an explanation of your decision then you will provide your decision in the format of 1 or 2 or tie. 7 | 8 | Please refer to the example below for the format of your response. 9 | 10 | Example Response: 11 | Explanation: "Your explanation here". 12 | Decision: 1 or 2 or tie. 13 | 14 | Here are the actual article, the summary requirement, and two summaries: 15 | 16 | Article: 17 | {{Article}} 18 | 19 | Summary Requirement: 20 | {{Requirement}} 21 | 22 | Summary 1: 23 | 24 | {{Summary 1}} 25 | 26 | Summary 2: 27 | 28 | {{Summary 2}} -------------------------------------------------------------------------------- /prompts/llmcompare.overall.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and two summaries. 2 | 3 | The summaries are crafted to meet a specific summary requirement. Note that there may be identical summaries. 4 | 5 | Your task is to compare the overall quality of these two summaries concerning the summary requirement and pick the one that is better (there can be a tie). 6 | First you will give an explanation of your decision then you will provide your decision in the format of 1 or 2 or tie. 7 | 8 | Please refer to the example below for the format of your response. 9 | 10 | Example Response: 11 | Explanation: "Your explanation here". 12 | Decision: 1 or 2 or tie. 13 | 14 | Here are the actual article, the summary requirement, and two summaries: 15 | 16 | Article: 17 | {{Article}} 18 | 19 | Summary Requirement: 20 | {{Requirement}} 21 | 22 | Summary 1: 23 | 24 | {{Summary 1}} 25 | 26 | Summary 2: 27 | 28 | {{Summary 2}} 29 | 30 | Please provide your response. -------------------------------------------------------------------------------- /prompts/llmeval.irrelevant.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and a summary. 2 | 3 | Your task is to rate the quality of the summary with a score from 1 to 5, based on whether it includes any information that is not relevant to the summary requirement. Here, 1 is the least amount of irrelevant information and 5 is the most. 4 | 5 | Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. 6 | 7 | Example Response: 8 | Evaluation Form (scores ONLY): 9 | - Irrelevant Information (1-5): 3 10 | 11 | Here are the actual article, the summary requirement, and the summary: 12 | 13 | Article: 14 | {{Article}} 15 | 16 | Summary Requirement: 17 | {{Requirement}} 18 | 19 | Summary: 20 | {{SUMMARY}} 21 | 22 | Evaluation Form (scores ONLY): 23 | - Irrelevant Information (1-5): -------------------------------------------------------------------------------- /prompts/llmeval.missing.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and a summary. 2 | 3 | Your task is to rate the quality of the summary with a score from 1 to 5, based on whether it omits any crucial information from the article concerning the summary requirement. Crucial information refers to key details or facts that are essential to understanding the article and meeting the summary requirement. Here, 1 is the least amount of missing information and 5 is the most. 4 | 5 | Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. 6 | 7 | Example Response: 8 | Evaluation Form (scores ONLY): 9 | - Missing Information (1-5): 3 10 | 11 | Here are the actual article, the summary requirement, and the summary: 12 | 13 | Article: 14 | {{Article}} 15 | 16 | Summary Requirement: 17 | {{Requirement}} 18 | 19 | Summary: 20 | {{SUMMARY}} 21 | 22 | Evaluation Form (scores ONLY): 23 | - Missing Information (1-5): -------------------------------------------------------------------------------- /prompts/llmeval.overall.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and a summary. 2 | 3 | Your task is to rate the overall quality of the summary with a score from 1 to 5 concerning the summary requirement, where 1 is the lowest and 5 is the highest. 4 | 5 | Please make sure you read and understand these instructions carefully. Please keep this document open while reviewing, and refer to it as needed. 6 | 7 | Example Response: 8 | Evaluation Form (scores ONLY): 9 | - Overall Quality (1-5): 3 10 | 11 | Here are the actual article, the summary requirement, and the summary: 12 | 13 | Article: 14 | {{Article}} 15 | 16 | Summary Requirement: 17 | {{Requirement}} 18 | 19 | Summary: 20 | {{SUMMARY}} 21 | 22 | Evaluation Form (scores ONLY): 23 | - Overall Quality (1-5): -------------------------------------------------------------------------------- /prompts/llmrank.irrelevant.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and a list of summaries numbered as follows: 1. Summary 1, 2. Summary 2, and so on. 2 | 3 | The summaries are crafted to meet a specific summary requirement. Note that there may be identical summaries within the list. 4 | 5 | Your task is to evaluate and rank the summaries in ascending order of their quality concerning whether they include any information that is not relevant to the summary requirement. The ranking should be a number between 1 and 5, where 1 indicates the least amount of irrelevant information and 5 indicates the most. First, you will explain your ranking, and then you will provide the ranking of each summary. 6 | 7 | Note: In case of a tie, do not skip a rank. For example, if Summary 1 has ranking 1 and Summary 2 and 3 both have ranking 2, then Summary 4 should be assigned a ranking of 3, not 4. 8 | 9 | Please refer to the example below for the format of your response. 10 | 11 | Example Response: 12 | Explanation: "Your explanation of the ranking." 13 | Ranking: "The ranking, e.g., 1, 2, 2, 3, 4." 14 | 15 | Here are the actual article, the summary requirement, and the summaries: 16 | 17 | Article: 18 | {{Article}} 19 | 20 | Summary Requirement: 21 | {{Requirement}} 22 | 23 | Summaries: 24 | 25 | 1. Summary 1: {{Summary 1}} 26 | 27 | 2. Summary 2: {{Summary 2}} 28 | 29 | 3. Summary 3: {{Summary 3}} 30 | 31 | 4. Summary 4: {{Summary 4}} 32 | 33 | 5. Summary 5: {{Summary 5}} -------------------------------------------------------------------------------- /prompts/llmrank.missing.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and a list of summaries numbered as follows: 1. Summary 1, 2. Summary 2, and so on. 2 | 3 | The summaries are crafted to meet a specific summary requirement. Note that there may be identical summaries within the list. 4 | 5 | Your task is to evaluate and rank the summaries in ascending order of their quality concerning whether they omit any crucial information from the article with respect to the summary requirement. Crucial information refers to key details or facts that are essential to understanding the article and meeting the summary requirement. The ranking should be a number between 1 and 5, where 1 indicates the least amount of missing information and 5 indicates the most. First, you will explain your ranking, and then you will provide the ranking of each summary. 6 | 7 | Note: In case of a tie, do not skip a rank. For example, if Summary 1 has ranking 1 and Summary 2 and 3 both have ranking 2, then Summary 4 should be assigned a ranking of 3, not 4. 8 | 9 | Please refer to the example below for the format of your response. 10 | 11 | Example Response: 12 | Explanation: "Your explanation of the ranking." 13 | Ranking: "The ranking, e.g., 1, 2, 2, 3, 4." 14 | 15 | Here are the actual article, the summary requirement, and the summaries: 16 | 17 | Article: 18 | {{Article}} 19 | 20 | Summary Requirement: 21 | {{Requirement}} 22 | 23 | Summaries: 24 | 25 | 1. Summary 1: {{Summary 1}} 26 | 27 | 2. Summary 2: {{Summary 2}} 28 | 29 | 3. Summary 3: {{Summary 3}} 30 | 31 | 4. Summary 4: {{Summary 4}} 32 | 33 | 5. Summary 5: {{Summary 5}} -------------------------------------------------------------------------------- /prompts/llmrank.overall.txt: -------------------------------------------------------------------------------- 1 | In this task, you will be provided with a news article, a specific summary requirement, and a list of summaries numbered as follows: 1. Summary 1, 2. Summary 2, and so on. 2 | 3 | The summaries are crafted to meet a specific summary requirement. Note that there may be identical summaries within the list. 4 | 5 | Your task is to evaluate and rank the summaries in ascending order of their overall quality concerning the summary requirement. First, you will explain your ranking, and then you will provide the ranking of each summary. The ranking should be a number between 1 and 5, where 1 is the best and 5 is the worst. 6 | 7 | Note: In case of a tie, do not skip a rank. For example, if Summary 1 has ranking 1 and Summary 2 and 3 both have ranking 2, then Summary 4 should be assigned a ranking of 3, not 4. 8 | 9 | Please refer to the example below for the format of your response. 10 | 11 | Example Response: 12 | Explanation: "Your explanation of the ranking." 13 | Ranking: "The ranking, e.g., 1, 2, 2, 3, 4." 14 | 15 | Here are the actual article, the summary requirement, and the summaries: 16 | 17 | Article: 18 | {{Article}} 19 | 20 | Summary Requirement: 21 | {{Requirement}} 22 | 23 | Summaries: 24 | 25 | 1. Summary 1: {{Summary 1}} 26 | 27 | 2. Summary 2: {{Summary 2}} 28 | 29 | 3. Summary 3: {{Summary 3}} 30 | 31 | 4. Summary 4: {{Summary 4}} 32 | 33 | 5. Summary 5: {{Summary 5}} -------------------------------------------------------------------------------- /prompts/llmscore.irrelevant.txt: -------------------------------------------------------------------------------- 1 | Answer the question based on the following article, a specific summary requirement, and a summary. 2 | Question: Does the summary include any information that is not relevant to the summary requirement? (a). Yes. (b). No. 3 | 4 | Article: 5 | {{Article}} 6 | 7 | Summary Requirement: 8 | {{Requirement}} 9 | 10 | Summary: 11 | {{SUMMARY}} 12 | 13 | Answer: Yes -------------------------------------------------------------------------------- /prompts/llmscore.missing.txt: -------------------------------------------------------------------------------- 1 | Answer the question based on the following article, a specific summary requirement, and a summary. 2 | Question: Does the summary omit any crucial information from the article concerning the summary requirement? Crucial information refers to key details or facts that are essential to understanding the article and meeting the summary requirement. (a). Yes. (b). No. 3 | 4 | Article: 5 | {{Article}} 6 | 7 | Summary Requirement: 8 | {{Requirement}} 9 | 10 | Summary: 11 | {{SUMMARY}} 12 | 13 | Answer: Yes -------------------------------------------------------------------------------- /prompts/llmscore.overall.txt: -------------------------------------------------------------------------------- 1 | Answer the question based on the following article, a specific summary requirement, and a summary. 2 | Question: Is the summary of good overall quality in relation to both the article and the summary requirement? (a). Yes. (b). No. 3 | 4 | Article: 5 | {{Article}} 6 | 7 | Summary Requirement: 8 | {{Requirement}} 9 | 10 | Summary: 11 | {{SUMMARY}} 12 | 13 | Answer: Yes --------------------------------------------------------------------------------