├── qa_models └── readme.md ├── requirements.txt ├── bart_qg └── train.sh ├── README.md ├── run_feqa.ipynb └── feqa.py /qa_models/readme.md: -------------------------------------------------------------------------------- 1 | Download squad1.0 folder from the Google Drive and place it under this folder. -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | Cython==0.29.15 2 | numpy==1.19.1 3 | benepar==0.1.2 4 | torch==1.5.0 5 | fairseq==0.9.0 6 | nltk==3.5 7 | spacy==2.3.2 8 | tensorflow==1.15.0 9 | transformers==2.8.0 10 | -------------------------------------------------------------------------------- /bart_qg/train.sh: -------------------------------------------------------------------------------- 1 | TOTAL_NUM_UPDATES=20000 2 | WARMUP_UPDATES=500 3 | LR=3e-05 4 | MAX_TOKENS=2048 5 | UPDATE_FREQ=4 6 | BART_PATH=~/models/bart_pretrained/bart.large/model.pt 7 | 8 | CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 fairseq-train ~/data/qg_data/bin \ 9 | --restore-file $BART_PATH \ 10 | --max-tokens $MAX_TOKENS \ 11 | --task translation \ 12 | --source-lang src --target-lang tgt \ 13 | --truncate-source \ 14 | --layernorm-embedding \ 15 | --share-all-embeddings \ 16 | --share-decoder-input-output-embed \ 17 | --reset-optimizer --reset-dataloader --reset-meters \ 18 | --required-batch-size-multiple 1 \ 19 | --arch bart_large \ 20 | --criterion label_smoothed_cross_entropy \ 21 | --label-smoothing 0.1 \ 22 | --dropout 0.1 --attention-dropout 0.1 \ 23 | --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \ 24 | --clip-norm 0.1 \ 25 | --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ 26 | --fp16 --update-freq $UPDATE_FREQ \ 27 | --skip-invalid-size-inputs-valid-test \ 28 | --no-epoch-checkpoints \ 29 | --no-last-checkpoints \ 30 | --find-unused-parameters; 31 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization 2 | 3 | This repository contains code for the paper 4 | 5 | > **FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization**. 6 | > Esin Durmus, He He and Mona Diab 7 | > In proceedings of ACL 2020. 8 | > https://www.aclweb.org/anthology/2020.acl-main.454/ 9 | 10 | ## Dependencies 11 | - Python 3.6 12 | 13 | Install all Python packages: `pip install -r requirements.txt`. 14 | 15 | ## Data 16 | The faithfulness annotations we collected for CNNDM and XSum are going to be added soon. 17 | 18 | ## Code 19 | Trained models for question generation and question answering systems are under [this drive](https://drive.google.com/drive/folders/1GrnfJxaK35O2IEevv4VbiwYSwxBQVI2X?usp=sharing). 20 | 21 | 1. Download **squad1.0** from Google Drive and place it under **qa_models** directory. 22 | 2. Download **checkpoints** folder and place it under **bart_qg** directory. 23 | 24 | **feqa.py**: includes the code to run feqa pipeline (question generation, answering and metric calculation). 25 | 26 | See **run_feqa.ipynb** notebook for a pilot example on how to run the pipeline for the given documents and output summaries. 27 | 28 | ## Reference 29 | If you use our code or annotations, please cite our paper: 30 | ``` 31 | @inproceedings{durmus-etal-2020-feqa, 32 | title = "{FEQA}: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization", 33 | author = "Durmus, Esin and 34 | He, He and 35 | Diab, Mona", 36 | booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics", 37 | month = jul, 38 | year = "2020", 39 | address = "Online", 40 | publisher = "Association for Computational Linguistics", 41 | url = "https://www.aclweb.org/anthology/2020.acl-main.454", 42 | doi = "10.18653/v1/2020.acl-main.454", 43 | pages = "5055--5070", 44 | abstract = "Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.", 45 | } 46 | ``` 47 | 48 | ## Contact 49 | You can send an email to ed459[at]cornell[dot]edu, if you have any questions or comments. 50 | 51 | 52 | -------------------------------------------------------------------------------- /run_feqa.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import benepar\n", 10 | "import nltk\n", 11 | "from feqa import FEQA" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 2, 17 | "metadata": {}, 18 | "outputs": [ 19 | { 20 | "name": "stderr", 21 | "output_type": "stream", 22 | "text": [ 23 | "[nltk_data] Downloading package benepar_en2 to\n", 24 | "[nltk_data] /Users/esindurmus/nltk_data...\n", 25 | "[nltk_data] Package benepar_en2 is already up-to-date!\n", 26 | "[nltk_data] Downloading package stopwords to\n", 27 | "[nltk_data] /Users/esindurmus/nltk_data...\n", 28 | "[nltk_data] Package stopwords is already up-to-date!\n" 29 | ] 30 | }, 31 | { 32 | "data": { 33 | "text/plain": [ 34 | "True" 35 | ] 36 | }, 37 | "execution_count": 2, 38 | "metadata": {}, 39 | "output_type": "execute_result" 40 | } 41 | ], 42 | "source": [ 43 | "benepar.download('benepar_en2')\n", 44 | "nltk.download('stopwords')" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 3, 50 | "metadata": {}, 51 | "outputs": [ 52 | { 53 | "name": "stdout", 54 | "output_type": "stream", 55 | "text": [ 56 | "Requirement already satisfied: en_core_web_sm==2.1.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm==2.1.0 in /opt/miniconda3/lib/python3.7/site-packages (2.1.0)\n", 57 | "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", 58 | "You can now load the model via spacy.load('en_core_web_sm')\n" 59 | ] 60 | } 61 | ], 62 | "source": [ 63 | "!python -m spacy download en_core_web_sm" 64 | ] 65 | }, 66 | { 67 | "cell_type": "code", 68 | "execution_count": 5, 69 | "metadata": {}, 70 | "outputs": [ 71 | { 72 | "name": "stdout", 73 | "output_type": "stream", 74 | "text": [ 75 | "loading archive file ./bart_qg/checkpoints/\n", 76 | "| [src] dictionary: 50264 types\n", 77 | "| [tgt] dictionary: 50264 types\n", 78 | "WARNING:tensorflow:From /opt/miniconda3/lib/python3.7/site-packages/benepar/base_parser.py:197: The name tf.GraphDef is deprecated. Please use tf.compat.v1.GraphDef instead.\n", 79 | "\n", 80 | "WARNING:tensorflow:From /opt/miniconda3/lib/python3.7/site-packages/benepar/base_parser.py:202: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.\n", 81 | "\n" 82 | ] 83 | } 84 | ], 85 | "source": [ 86 | "scorer = FEQA(use_gpu=False)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 21, 92 | "metadata": {}, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "Generating questions...\n", 99 | "Getting answers...\n", 100 | "Computing metrics...\n" 101 | ] 102 | }, 103 | { 104 | "data": { 105 | "text/plain": [ 106 | "[0.674074074074074, 0.8875]" 107 | ] 108 | }, 109 | "execution_count": 21, 110 | "metadata": {}, 111 | "output_type": "execute_result" 112 | } 113 | ], 114 | "source": [ 115 | "documents = [\n", 116 | " \"The world's oldest person has died a \\\n", 117 | " few weeks after celebrating her 117th birthday. \\\n", 118 | " Born on March 5, 1898, the greatgrandmother had lived through two world \\\n", 119 | " wars, the invention of the television and the \\\n", 120 | " first successful powered aeroplane.\", \n", 121 | " \"The world's oldest person has died a \\\n", 122 | " few weeks after celebrating her 117th birthday. \\\n", 123 | " Born on March 5, 1898, the greatgrandmother had lived through two world \\\n", 124 | " wars, the invention of the television and the \\\n", 125 | " first successful powered aeroplane.\"]\n", 126 | "summaries = [\n", 127 | " \"The world's oldest person died in 1898\",\n", 128 | " \"The world's oldest person died after her 117th birthday\"]\n", 129 | "scorer.compute_score(documents, summaries, aggregate=False)" 130 | ] 131 | } 132 | ], 133 | "metadata": { 134 | "kernelspec": { 135 | "display_name": "Python 3", 136 | "language": "python", 137 | "name": "python3" 138 | }, 139 | "language_info": { 140 | "codemirror_mode": { 141 | "name": "ipython", 142 | "version": 3 143 | }, 144 | "file_extension": ".py", 145 | "mimetype": "text/x-python", 146 | "name": "python", 147 | "nbconvert_exporter": "python", 148 | "pygments_lexer": "ipython3", 149 | "version": "3.7.4" 150 | } 151 | }, 152 | "nbformat": 4, 153 | "nbformat_minor": 4 154 | } 155 | -------------------------------------------------------------------------------- /feqa.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import nltk 3 | import spacy 4 | import benepar 5 | import torch 6 | import json 7 | import os 8 | import subprocess 9 | import numpy as np 10 | from tempfile import TemporaryDirectory 11 | from fairseq.models.bart import BARTModel 12 | from nltk import sent_tokenize,word_tokenize 13 | from nltk.corpus import stopwords 14 | from nltk.tree import Tree 15 | from nltk.tree import ParentedTree 16 | from benepar.spacy_plugin import BeneparComponent 17 | from collections import defaultdict, Counter 18 | 19 | 20 | class FEQA(object): 21 | def __init__(self, squad_dir='./qa_models/squad1.0', bart_qa_dir='./bart_qg/checkpoints/', use_gpu=False): 22 | self.qg_model = BARTModel.from_pretrained( 23 | bart_qa_dir, 24 | checkpoint_file = 'checkpoint_best.pt' 25 | ) 26 | 27 | if use_gpu: 28 | self.qg_model.cuda() 29 | self.qg_model.half() 30 | self.qg_model.eval() 31 | 32 | self.batch_size = 64 33 | self.beam_size = 10 34 | self.max_length = 100 35 | 36 | self.nlp = spacy.load('en_core_web_sm') 37 | self.parser = benepar.Parser("benepar_en2") 38 | self.stop_words = set(stopwords.words('english')) 39 | 40 | self.squad_cmd = ['python {}/run_squad.py'.format(squad_dir), 41 | '--model_type bert', 42 | '--model_name_or_path {}'.format(squad_dir), 43 | '--do_eval', 44 | '--overwrite_cache', 45 | '--do_lower_case', 46 | '--predict_file {}', 47 | '--per_gpu_train_batch_size 12', 48 | '--max_seq_length 384', 49 | '--doc_stride 128', 50 | '--output_dir {}'] 51 | 52 | self.squad_cmd = ' '.join(self.squad_cmd) 53 | 54 | def _get_entities(self, output_summary): 55 | entities = [X.text for X in self.nlp(output_summary).ents] 56 | return entities 57 | 58 | 59 | def _get_masked_phrases(self, output_summary, phrase_types=["NP"]): 60 | masked_phrases = [] 61 | parse_tree = self.parser.parse(output_summary) 62 | for subtree in parse_tree.subtrees(): 63 | phrases_list = [(subtree_.leaves(), subtree_.label()) for subtree_ in subtree if type(subtree_) == Tree and subtree_.label() in phrase_types] 64 | for phrase_tuple in phrases_list: 65 | phrase = phrase_tuple[0] 66 | phrase_type = phrase_tuple[1] 67 | phrase_text = " ".join(phrase) 68 | if len(phrase) > 0 and phrase_text not in self.stop_words: 69 | masked_phrases.append(phrase_text) 70 | 71 | return masked_phrases 72 | 73 | 74 | def _generate_questions(self, summaries, entities=True, phrase_types=["NP"]): 75 | doc_ids = [] 76 | qa_masks = [] 77 | tokenized_phrases = [] 78 | 79 | for id_, summary in enumerate(summaries): 80 | summary = summary.strip() 81 | all_masked_phrases = [] 82 | if entities: 83 | all_masked_phrases.extend(self._get_entities(summary)) 84 | all_masked_phrases.extend(self._get_masked_phrases(summary,phrase_types)) 85 | all_masked_phrases = list(set(all_masked_phrases)) 86 | 87 | for i, masked_phrase in enumerate(all_masked_phrases): 88 | tokenized_summary = " ".join(nltk.word_tokenize(summary.lower())) 89 | tokenized_phrase = " ".join(nltk.word_tokenize(masked_phrase.lower())) 90 | 91 | qa_masks.append(tokenized_summary + " [SEP] " + tokenized_phrase) 92 | doc_ids.append(str(id_)) 93 | tokenized_phrases.append(tokenized_phrase) 94 | 95 | questions = [] 96 | for i in range(0, len(qa_masks), self.batch_size): 97 | batch = qa_masks[i:i + self.batch_size] 98 | hypotheses = self.qg_model.sample(batch, beam=self.beam_size, lenpen=1.0, max_len_b=self.max_length, min_len=1, no_repeat_ngram_size=3) 99 | questions.extend(hypotheses) 100 | 101 | 102 | return doc_ids, questions, tokenized_phrases 103 | 104 | def _convert_to_squad_format(self, gold_answers, questions, doc_ids, documents): 105 | squad_format = {"data":[]} 106 | 107 | id_questions=defaultdict(list) 108 | id_gold_answers=defaultdict(str) 109 | 110 | for idx in range(0,len(doc_ids)): 111 | id_questions[doc_ids[idx].strip()].append((questions[idx], gold_answers[idx])) 112 | 113 | for idx in id_questions: 114 | paragraphs = [] 115 | context = documents[int(idx)].strip() 116 | 117 | title = "doc_" + str(idx) 118 | 119 | questions_list_input=[] 120 | for q_id, question in enumerate(id_questions[idx]): 121 | 122 | gold_answer = question[1] 123 | question_text = question[0] 124 | answers_input = [{"text": gold_answer, "answer_start": 0}] 125 | questions_input = { 126 | "question": question_text, 127 | "answers": answers_input, 128 | "id": str(idx).strip() + "-" + str(q_id) 129 | } 130 | questions_list_input.append(questions_input) 131 | id_gold_answers[questions_input["id"]] = gold_answer 132 | 133 | 134 | paragraphs.append({"context":" ".join(nltk.word_tokenize(context)).lower(),"qas":questions_list_input}) 135 | squad_format["data"].append({"title":title,"paragraphs":paragraphs}) 136 | 137 | 138 | squad_format["version"] = "1.1" 139 | 140 | return id_gold_answers, squad_format 141 | 142 | 143 | def _run_squad(self, squad_input): 144 | with TemporaryDirectory() as tmpdir: 145 | squad_input_file = os.path.join(tmpdir, 'squad_input.json') 146 | with open(squad_input_file, 'w') as fout: 147 | json.dump(squad_input, fout) 148 | cmd = self.squad_cmd.format(squad_input_file, tmpdir) 149 | ret = subprocess.check_output(cmd, shell=True) 150 | 151 | with open(os.path.join(tmpdir, 'predictions_.json')) as fin: 152 | squad_output = json.load(fin) 153 | 154 | return squad_output 155 | 156 | def _compute_f1(self, a_gold, a_pred): 157 | gold_toks = nltk.word_tokenize(a_gold) 158 | pred_toks = nltk.word_tokenize(a_pred) 159 | common = Counter(gold_toks) & Counter(pred_toks) 160 | num_same = sum(common.values()) 161 | if len(gold_toks) == 0 or len(pred_toks) == 0: 162 | return int(gold_toks == pred_toks) 163 | if num_same == 0: 164 | return 0 165 | precision = 1.0 * num_same / len(pred_toks) 166 | recall = 1.0 * num_same / len(gold_toks) 167 | f1 = (2 * precision * recall) / (precision + recall) 168 | return f1 169 | 170 | def _compute_f1_list(self, a_gold_list, a_pred_list): 171 | f1_list=[] 172 | for a_gold,a_pred in zip(a_gold_list, a_pred_list): 173 | f1_list.append(self._compute_f1(a_gold,a_pred)) 174 | return np.mean(f1_list) 175 | 176 | 177 | def compute_score(self, documents, summaries, aggregate=False): 178 | #generate questions from summaries 179 | print("Generating questions...") 180 | doc_ids, questions, gold_answers = self._generate_questions(summaries) 181 | print("Getting answers...") 182 | #run qa system 183 | gold_answers_dict, squad_format = self._convert_to_squad_format(gold_answers, questions, doc_ids, documents) 184 | predictions_dict = self._run_squad(squad_format) 185 | 186 | 187 | doc_questions=defaultdict(dict) 188 | print("Computing metrics...") 189 | for qa_id in gold_answers_dict: 190 | doc_id, question_id=qa_id.split("-") 191 | prediction = predictions_dict[qa_id] 192 | if doc_id in doc_questions: 193 | doc_questions[doc_id]["preds"].append(prediction) 194 | doc_questions[doc_id]["gold"].append(gold_answers_dict[qa_id]) 195 | else: 196 | doc_questions[doc_id]={"preds":[prediction],"gold":[gold_answers_dict[qa_id]]} 197 | 198 | doc_f1 = defaultdict(float) 199 | 200 | for idx in range(0,len(documents)): 201 | idx=str(idx) 202 | try: 203 | f1 = self._compute_f1_list(doc_questions[idx]["gold"],doc_questions[idx]["preds"]) 204 | doc_f1[idx] = f1 205 | except: 206 | doc_f1[idx] = 0 207 | 208 | for id_, summary in enumerate(summaries): 209 | if str(id_) not in doc_f1: 210 | doc_f1[str(id_)] = 0 211 | 212 | if aggregate: 213 | return np.mean(list(doc_f1.values())) 214 | 215 | else: 216 | return [doc_f1[k] for k in sorted(doc_f1.keys())] 217 | --------------------------------------------------------------------------------