├── img ├── high.png ├── HIBERT.png └── summarizing.jpg ├── summarizing_summarization_ACL2019.pdf └── README.md /img/high.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/img/high.png -------------------------------------------------------------------------------- /img/HIBERT.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/img/HIBERT.png -------------------------------------------------------------------------------- /img/summarizing.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/img/summarizing.jpg -------------------------------------------------------------------------------- /summarizing_summarization_ACL2019.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/summarizing_summarization_ACL2019.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Summarizing Summarization 2 | ![](img/summarizing.jpg) 3 | 4 | # EMNLP 2019 5 | ## Datasets 6 | 7 | ## Abstractive models 8 | 9 | 10 | ### [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345) 11 | Liu and Lapata introduce a novel document-level encoder based on BERT which for extractive and abstractive summarization. 12 | For extractive summarization, the model is built on top of this encoder by stacking several intersentence Transformer layers. 13 | While for Abstractive summarization they introduce a two-staged fine-tuning approach can further boost the quality of the generated summaries. For the latter, they use a standard encoder-decoder architecture with the encoder initialized to bert. To alleviate the mismatch between the pretrained encoder and the decoder they design a new fine-tuning schedule which separates the optimizers of the encoder and the decoder. 14 | 15 | 16 | ## Extractive models 17 | 18 | 19 | ## Evaluation 20 | 21 | ### [Earlier Isn’t Always Better: Submodular Analysis on Corpus and System Biases in Summarization](https://www.aclweb.org/anthology/D19-1327.pdf) 22 | 23 | Authors conduct an impressive amount of experiments to analyse systems and corpora bias in Summarization models using three sub-aspects of summarization: position, importance, and diversity, using state of the art abstractive and extractive summarization models on various amount of summarization corpora from different domains (e.g., news, academic papers, meeting minutes, movie script, books, 24 | posts). The paper shows lot of useful analysis for example we can find that position exhibits substantial bias in news articles, while not as much in academic papers and meeting minutes. Overall, this study provides useful lessons regarding consideration of underlying sub-aspects when collecting a new summarization dataset or developing a new system. 25 | 26 | 27 | 28 | 29 | 30 | # ACL 2019 31 | 32 | ## New Data, More Data 33 | 34 | ### [BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization](http://arxiv.org/abs/1906.03741) 35 | 36 | The authors introduced a novel dataset, consisting of 1.3 37 | million records of U.S. patent documents along with human written 38 | abstractive summaries. 39 | 40 | Characteristics: 41 | 42 | - summaries contain a richer discourse structure with more recurring 43 | entities; 44 | 45 | - longer input sequences (avg. 3,572.8 VS 789.9 words for CNN/DM); 46 | 47 | - salient content is evenly distributed in the input, while in popular 48 | news-based datasets it often concentrates in the first few 49 | sentences; 50 | 51 | - fewer and shorter extractive fragments are present in the summaries. 52 | 53 | The authors report results for various extractive and abstractive models 54 | on CNN/DM, NYT, and Big Patent. What seems really interesting is the 55 | divergence of results: [PointGen](http://arxiv.org/abs/1704.04368) compared favorably 56 | against the extractive unsupervised model 57 | [Text-Rank]() on the news-based dataset, while 58 | obtaining worse results on Big Patent. This shows once more the 59 | importance of testing models on several and different datasets. 60 | 61 | ### [Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model](http://arxiv.org/abs/1906.01749) 62 | 63 | The authors presented the first multi-document 64 | summarization dataset, based on news. It consists of input articles from 65 | over 1,500 different websites along with 56,216 professional summaries 66 | of these articles obtained from the site [newser.com](newser.com). 67 | Additionally, the authors propose an end-to-end model that achieves 68 | competitive results under both automatic and human evaluation on various 69 | multi-document datasets, including Multi-News. 70 | 71 | ## Multimodal Summarization 72 | 73 | ### [Talk-Summ: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks](http://arxiv.org/abs/1906.01351) 74 | 75 | Thanks to the recent trend of publishing videos of talks in academic 76 | conferences, @lev_talksumm:_2019 collected 1716 pairs of papers/videos 77 | and consider the video transcripts as the summary for the related paper. 78 | The proposed method of generating training data for scientific papers 79 | summarization is fully automatic. Hence, the number of training data 80 | might benefit directly from the increasing rate of papers published in a 81 | near future without too much effort required. This would be definitely 82 | something help full in order to follow the impressive rate of 83 | publication in NLP and other scientific fields! 84 | 85 | ### [Multimodal Abstractive Summarization for How2 Videos](http://arxiv.org/abs/1906.07901) 86 | 87 | The authors explored the behaviour of several models for video 88 | summarization on the How2 dataset. They propose a multimodal approach 89 | using automatic transcripts, the audio and the video latent 90 | representations and combinations of them through hierarchical attention. 91 | For the evaluation, in addition to ROUGE, the authors propose a variant 92 | that does not account for stopwords. Interestingly enough, the presented 93 | models include a video-only summarization model that performs 94 | competitively with a text-only model. 95 | 96 | ## Extractive models 97 | 98 | ### [Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization](http://arxiv.org/abs/1906.00072) 99 | 100 | The authors propose to tackle multi-document summarization with 101 | Determinantal Point Processes (DPP), a learned extractive method, and 102 | capsule network components. Motivation: TF-IDF 103 | vectors fall short when it comes to modeling semantic similarity, a fact 104 | that is particularly problematic for multi-document summarization. 105 | Solution: a similarity measure for pairs of sentences such that 106 | semantically similar sentences can receive high scores despite having 107 | very few words in common. The capsule network is trained under a binary 108 | classification setup, on a dataset derived from CNN/DM: the authors map 109 | abstract sentences to the most similar article sentences (label=true) 110 | and negative sampling (label=false). 111 | 112 | ### [Self-Supervised Learning for Contextualized Extractive Summarization](http://arxiv.org/abs/1906.04466) 113 | 114 | A method to train an extractive model in a self-supervised fashion: the 115 | sentence encoder is first trained to learn entailment w.r.t. to the next 116 | sentence, replacement and switch of the next sentence. It allows to 117 | train faster and to obtain a slight improvement on CNN/DM. The proposed 118 | method could also lead to longer text representations in a 119 | self-supervised manner. 120 | 121 | ### [Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction](http://arxiv.org/abs/1905.08511) 122 | 123 | The study focuses on HotpotQA, a multi-hop QA explainable task: the 124 | system return the answer with the evidence sentences by reasoning and 125 | gathering disjoint pieces of the reference texts gathering. The Query 126 | Focused Extractor (QFE) is inspired by the extractive summarization 127 | model proposed in [Chen et al.](http://arxiv.org/abs/1805.11080). Instead of covering the important 128 | information in the source document with the extractive summary, the 129 | approach covers the question with the extracted evidences. The model 130 | compared favorably with the SOTA BERT-based model in HotpotQA distractor 131 | setting for retrieving the evidence while not benefiting from any 132 | pretraining. In addition, it achieves SOTA performance on the [FEVER](https://arxiv.org/abs/1803.05355) 133 | dataset. 134 | 135 | ### [Sentence Centrality Revisited for Unsupervised Summarization](http://arxiv.org/abs/1906.03508) 136 | 137 | The authors revisited classic extractive unsupervised summarization 138 | using graph-based ranking approaches, where the nodes are the sentences 139 | of a document. They leverage BERT to encode each sentence. One of 140 | motivation is that popular supervised approaches are limited by the need 141 | of large-scale datasets and thus does not generalize well to other 142 | domain and languages. The model performs comparably well to SOTA 143 | approaches on the popular CNN/DM and NYT datasets, as well as for 144 | TTNews, a Chinese news summarization corpus showing its capability to 145 | adapt well to different domain. A human assessment is conducted, based 146 | on a set of question posed on the gold summary, evaluating how much 147 | relevant information is present in a generated evaluated summary. The 148 | application to sentence selection in multi-document summarization is 149 | suggested as future work. 150 | 151 | ### [HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization](http://arxiv.org/abs/1905.06566) 152 | 153 | HIBERT stands for Hierarchical BERT. The idea is to use two pretrained 154 | transformers (see figure below): the first, a standard BERT at token level used 155 | to represent the sentences; the second, working at sentence level and 156 | leveraging on the representation from the former to encode sentences of 157 | an entire document. Following the BERT masked pretraining method, the 158 | authors trained the sentence level transformer masking some sentences of 159 | the documents, and the final model achieves SOTA for summarization on 160 | CNN/DM and NY times datasets. The authors also report informative 161 | ablations, using out-of-domain, in-domain data, and a combination 162 | thereof for the pretraining. Cherry on the cake, they adapt BERT to 163 | extractive supervised summarization (i.e. finetuning BERT in a 164 | classification setup to select the sentences to extract) and report the 165 | result as a baseline. 166 | 167 | ![HIBERT architecture](img/HIBERT.png) 168 | 169 | ### [STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings](https://aclweb.org/anthology/papers/P/P19/P19-2034/) 170 | The authors leverage the semantic information in the sentence embedding space to create an extractive summary, in a computationally efficient manner. They also introduce a new dataset, CASS, built from judgments of the French Court of Cassation and the corresponding summaries. 171 | 172 | 173 | ## Abstractive models 174 | 175 | ### [Scoring Sentence Singletons and Pairs for Abstractive Summarization](http://arxiv.org/abs/1906.00077) 176 | 177 | Abstractive summarizers tend to perform content selection and fusion 178 | implicitly by learning to generate the text in an end-to-end manner. The 179 | proposed method, instead, summarizes documents in a two stages process, 180 | the first extractive and the second abstractive. The motivation is that 181 | separating the summarization process into two explicit steps could allow 182 | for more flexibility and explainability for each components. The 183 | extractive stage is done using BERT representations. The extracted 184 | sentence singletons are then fed into a sequence to sequence model to 185 | generate the summary. Evaluations are reported for both the extractive 186 | methods and the full pipeline on three datasets (CNN/DM, DUC and Xsum). 187 | 188 | ### [Hierarchical Transformers for Multi-Document Summarization](http://arxiv.org/abs/1905.13164) 189 | 190 | In the original WikiSum paper, the authors proposed a two stages 191 | process, first extracting the most important sentences from all the 192 | documents in order to get a shorter input, 193 | then learning to generate the output with a transformer model. On top of 194 | that, Liu and Lapata propose to refine the extractive step 195 | with a hierarchical representation of the documents using attention 196 | instead of just concatenating the extracted sentences. 197 | 198 | ### [BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization](http://arxiv.org/abs/1906.05012) 199 | 200 | Bi-directional Selective Encoding with Template (Biset) is a new 201 | architecture for abstractive summarization tested on the Gigawords 202 | dataset. Template-based summarization relies on manual creation of 203 | templates. The advantage of such approach is that it results in concise 204 | and coherent summaries without requiring training data. However, it 205 | requires experts to build these templates. In this paper, an automatic 206 | method is proposed to retrieve high-quality templates from training 207 | corpus. Given an input article, the model first retrieves the most 208 | similar articles using a TF-IDF-based method. Further, a similarity 209 | measure is computed through a neural network in order to re-rank the 210 | retrieved articles. The summary corresponding to the most similar 211 | article to the input is then selected as the template. Finally a 212 | sequence to sequence network is trained to generate the summary: the 213 | authors propose an architecture to learn the interaction between the 214 | source summary and the selected template. 215 | 216 | ### [Generating Summaries with Topic Templates and Structured Convolutional Decoders](http://arxiv.org/abs/1906.04687) 217 | 218 | Most previous works on neural text generation represent the target 219 | summaries as a single long sequence. Assuming that documents are 220 | organized into topically coherent text segments, the authors propose a 221 | hierarchical model that encodes both documents and sentences guided by 222 | the topic structure of target summaries. The topic templates from 223 | summaries are obtained via a trained Latent Dirichlet Allocation model. WikiCat-Sum, the dataset used for evaluation is 224 | derived from WikiSum, and focuses on three 225 | domains: Companies, Films, and Animals. The dataset is publicly 226 | [available](https://github.com/lauhaide/WikiCatSum). 227 | 228 | ### [Global Optimization under Length Constraint for Neural Text Summarization](https://www.aclweb.org/anthology/P19-1099) 229 | 230 | Most abstractive summarization models do not control the length of the 231 | generated summary and learn it from the distribution of the examples 232 | seen during training. The authors propose an optimization method under a 233 | length constraint. They report extensive experiments on CNN/DM using 234 | several models with different length constraints and optimization 235 | methods. In addition to ROUGE and length control, the authors report the 236 | average generation time, along with a human assessment. 237 | 238 | ## Evaluation 239 | 240 | ### [HighRES: Highlight-based Reference-less Evaluation of Summarization](http://arxiv.org/abs/1906.01361) 241 | 242 | Automatic summarization evaluation is an open research question and the 243 | current methods have several pitfalls. For this reason, most of the 244 | papers conduct human evaluations, a challenging and time consuming task. 245 | The authors propose a new human evaluation methodlogy: 246 | first, a group of annotators highlight the salient content in the input 247 | article. Then, other annotators are asked to score for precision (i.e. 248 | only important information is present in the summary), recall (all 249 | important information is present in the summary) and linguistic metrics 250 | (clarity and fluency). Major advantages of this method: 251 | 252 | - highlights are not dependent on the summaries being evaluated but 253 | only on the source documents, thus avoiding reference bias; 254 | 255 | - it provides absolute instead of ranked evaluation allowing for 256 | better interpretability; 257 | 258 | - the highlight annotation needs to happen only once per document, and 259 | it can be reused to evaluate many system summaries. 260 | 261 | Finally, the authors propose a version of ROUGE leveraging the highlight 262 | annotations. The UI (see figure below) is [open source](https://github.com/sheffieldnlp/highres). 263 | 264 | ![Figure 2 of Hardy et al. The UI for content evaluation with 265 | highlight.[]{label="fig:HIGHRES"}](img/high.png) 266 | 267 | ### [A Simple Theoretical Model of Importance for Summarization](https://www.aclweb.org/anthology/P19-1101) 268 | 269 | In this work, the author formalizes several simple but rigorous 270 | summary-related metrics such as redundancy, relevance, and 271 | informativeness, under the unifying notion of: *importance*. The paper 272 | includes several analyses to support the proposal and was recognized as 273 | an *outstanding* contribution. We look forward to see how the proposed 274 | framework will be adopted! 275 | 276 | --------------------------------------------------------------------------------