├── img
├── high.png
├── HIBERT.png
└── summarizing.jpg
├── summarizing_summarization_ACL2019.pdf
└── README.md
/img/high.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/img/high.png
--------------------------------------------------------------------------------
/img/HIBERT.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/img/HIBERT.png
--------------------------------------------------------------------------------
/img/summarizing.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/img/summarizing.jpg
--------------------------------------------------------------------------------
/summarizing_summarization_ACL2019.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/recitalAI/summarizing_summarization/HEAD/summarizing_summarization_ACL2019.pdf
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Summarizing Summarization
2 | 
3 |
4 | # EMNLP 2019
5 | ## Datasets
6 |
7 | ## Abstractive models
8 |
9 |
10 | ### [Text Summarization with Pretrained Encoders](https://arxiv.org/abs/1908.08345)
11 | Liu and Lapata introduce a novel document-level encoder based on BERT which for extractive and abstractive summarization.
12 | For extractive summarization, the model is built on top of this encoder by stacking several intersentence Transformer layers.
13 | While for Abstractive summarization they introduce a two-staged fine-tuning approach can further boost the quality of the generated summaries. For the latter, they use a standard encoder-decoder architecture with the encoder initialized to bert. To alleviate the mismatch between the pretrained encoder and the decoder they design a new fine-tuning schedule which separates the optimizers of the encoder and the decoder.
14 |
15 |
16 | ## Extractive models
17 |
18 |
19 | ## Evaluation
20 |
21 | ### [Earlier Isn’t Always Better: Submodular Analysis on Corpus and System Biases in Summarization](https://www.aclweb.org/anthology/D19-1327.pdf)
22 |
23 | Authors conduct an impressive amount of experiments to analyse systems and corpora bias in Summarization models using three sub-aspects of summarization: position, importance, and diversity, using state of the art abstractive and extractive summarization models on various amount of summarization corpora from different domains (e.g., news, academic papers, meeting minutes, movie script, books,
24 | posts). The paper shows lot of useful analysis for example we can find that position exhibits substantial bias in news articles, while not as much in academic papers and meeting minutes. Overall, this study provides useful lessons regarding consideration of underlying sub-aspects when collecting a new summarization dataset or developing a new system.
25 |
26 |
27 |
28 |
29 |
30 | # ACL 2019
31 |
32 | ## New Data, More Data
33 |
34 | ### [BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization](http://arxiv.org/abs/1906.03741)
35 |
36 | The authors introduced a novel dataset, consisting of 1.3
37 | million records of U.S. patent documents along with human written
38 | abstractive summaries.
39 |
40 | Characteristics:
41 |
42 | - summaries contain a richer discourse structure with more recurring
43 | entities;
44 |
45 | - longer input sequences (avg. 3,572.8 VS 789.9 words for CNN/DM);
46 |
47 | - salient content is evenly distributed in the input, while in popular
48 | news-based datasets it often concentrates in the first few
49 | sentences;
50 |
51 | - fewer and shorter extractive fragments are present in the summaries.
52 |
53 | The authors report results for various extractive and abstractive models
54 | on CNN/DM, NYT, and Big Patent. What seems really interesting is the
55 | divergence of results: [PointGen](http://arxiv.org/abs/1704.04368) compared favorably
56 | against the extractive unsupervised model
57 | [Text-Rank]() on the news-based dataset, while
58 | obtaining worse results on Big Patent. This shows once more the
59 | importance of testing models on several and different datasets.
60 |
61 | ### [Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model](http://arxiv.org/abs/1906.01749)
62 |
63 | The authors presented the first multi-document
64 | summarization dataset, based on news. It consists of input articles from
65 | over 1,500 different websites along with 56,216 professional summaries
66 | of these articles obtained from the site [newser.com](newser.com).
67 | Additionally, the authors propose an end-to-end model that achieves
68 | competitive results under both automatic and human evaluation on various
69 | multi-document datasets, including Multi-News.
70 |
71 | ## Multimodal Summarization
72 |
73 | ### [Talk-Summ: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks](http://arxiv.org/abs/1906.01351)
74 |
75 | Thanks to the recent trend of publishing videos of talks in academic
76 | conferences, @lev_talksumm:_2019 collected 1716 pairs of papers/videos
77 | and consider the video transcripts as the summary for the related paper.
78 | The proposed method of generating training data for scientific papers
79 | summarization is fully automatic. Hence, the number of training data
80 | might benefit directly from the increasing rate of papers published in a
81 | near future without too much effort required. This would be definitely
82 | something help full in order to follow the impressive rate of
83 | publication in NLP and other scientific fields!
84 |
85 | ### [Multimodal Abstractive Summarization for How2 Videos](http://arxiv.org/abs/1906.07901)
86 |
87 | The authors explored the behaviour of several models for video
88 | summarization on the How2 dataset. They propose a multimodal approach
89 | using automatic transcripts, the audio and the video latent
90 | representations and combinations of them through hierarchical attention.
91 | For the evaluation, in addition to ROUGE, the authors propose a variant
92 | that does not account for stopwords. Interestingly enough, the presented
93 | models include a video-only summarization model that performs
94 | competitively with a text-only model.
95 |
96 | ## Extractive models
97 |
98 | ### [Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization](http://arxiv.org/abs/1906.00072)
99 |
100 | The authors propose to tackle multi-document summarization with
101 | Determinantal Point Processes (DPP), a learned extractive method, and
102 | capsule network components. Motivation: TF-IDF
103 | vectors fall short when it comes to modeling semantic similarity, a fact
104 | that is particularly problematic for multi-document summarization.
105 | Solution: a similarity measure for pairs of sentences such that
106 | semantically similar sentences can receive high scores despite having
107 | very few words in common. The capsule network is trained under a binary
108 | classification setup, on a dataset derived from CNN/DM: the authors map
109 | abstract sentences to the most similar article sentences (label=true)
110 | and negative sampling (label=false).
111 |
112 | ### [Self-Supervised Learning for Contextualized Extractive Summarization](http://arxiv.org/abs/1906.04466)
113 |
114 | A method to train an extractive model in a self-supervised fashion: the
115 | sentence encoder is first trained to learn entailment w.r.t. to the next
116 | sentence, replacement and switch of the next sentence. It allows to
117 | train faster and to obtain a slight improvement on CNN/DM. The proposed
118 | method could also lead to longer text representations in a
119 | self-supervised manner.
120 |
121 | ### [Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction](http://arxiv.org/abs/1905.08511)
122 |
123 | The study focuses on HotpotQA, a multi-hop QA explainable task: the
124 | system return the answer with the evidence sentences by reasoning and
125 | gathering disjoint pieces of the reference texts gathering. The Query
126 | Focused Extractor (QFE) is inspired by the extractive summarization
127 | model proposed in [Chen et al.](http://arxiv.org/abs/1805.11080). Instead of covering the important
128 | information in the source document with the extractive summary, the
129 | approach covers the question with the extracted evidences. The model
130 | compared favorably with the SOTA BERT-based model in HotpotQA distractor
131 | setting for retrieving the evidence while not benefiting from any
132 | pretraining. In addition, it achieves SOTA performance on the [FEVER](https://arxiv.org/abs/1803.05355)
133 | dataset.
134 |
135 | ### [Sentence Centrality Revisited for Unsupervised Summarization](http://arxiv.org/abs/1906.03508)
136 |
137 | The authors revisited classic extractive unsupervised summarization
138 | using graph-based ranking approaches, where the nodes are the sentences
139 | of a document. They leverage BERT to encode each sentence. One of
140 | motivation is that popular supervised approaches are limited by the need
141 | of large-scale datasets and thus does not generalize well to other
142 | domain and languages. The model performs comparably well to SOTA
143 | approaches on the popular CNN/DM and NYT datasets, as well as for
144 | TTNews, a Chinese news summarization corpus showing its capability to
145 | adapt well to different domain. A human assessment is conducted, based
146 | on a set of question posed on the gold summary, evaluating how much
147 | relevant information is present in a generated evaluated summary. The
148 | application to sentence selection in multi-document summarization is
149 | suggested as future work.
150 |
151 | ### [HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization](http://arxiv.org/abs/1905.06566)
152 |
153 | HIBERT stands for Hierarchical BERT. The idea is to use two pretrained
154 | transformers (see figure below): the first, a standard BERT at token level used
155 | to represent the sentences; the second, working at sentence level and
156 | leveraging on the representation from the former to encode sentences of
157 | an entire document. Following the BERT masked pretraining method, the
158 | authors trained the sentence level transformer masking some sentences of
159 | the documents, and the final model achieves SOTA for summarization on
160 | CNN/DM and NY times datasets. The authors also report informative
161 | ablations, using out-of-domain, in-domain data, and a combination
162 | thereof for the pretraining. Cherry on the cake, they adapt BERT to
163 | extractive supervised summarization (i.e. finetuning BERT in a
164 | classification setup to select the sentences to extract) and report the
165 | result as a baseline.
166 |
167 | 
168 |
169 | ### [STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings](https://aclweb.org/anthology/papers/P/P19/P19-2034/)
170 | The authors leverage the semantic information in the sentence embedding space to create an extractive summary, in a computationally efficient manner. They also introduce a new dataset, CASS, built from judgments of the French Court of Cassation and the corresponding summaries.
171 |
172 |
173 | ## Abstractive models
174 |
175 | ### [Scoring Sentence Singletons and Pairs for Abstractive Summarization](http://arxiv.org/abs/1906.00077)
176 |
177 | Abstractive summarizers tend to perform content selection and fusion
178 | implicitly by learning to generate the text in an end-to-end manner. The
179 | proposed method, instead, summarizes documents in a two stages process,
180 | the first extractive and the second abstractive. The motivation is that
181 | separating the summarization process into two explicit steps could allow
182 | for more flexibility and explainability for each components. The
183 | extractive stage is done using BERT representations. The extracted
184 | sentence singletons are then fed into a sequence to sequence model to
185 | generate the summary. Evaluations are reported for both the extractive
186 | methods and the full pipeline on three datasets (CNN/DM, DUC and Xsum).
187 |
188 | ### [Hierarchical Transformers for Multi-Document Summarization](http://arxiv.org/abs/1905.13164)
189 |
190 | In the original WikiSum paper, the authors proposed a two stages
191 | process, first extracting the most important sentences from all the
192 | documents in order to get a shorter input,
193 | then learning to generate the output with a transformer model. On top of
194 | that, Liu and Lapata propose to refine the extractive step
195 | with a hierarchical representation of the documents using attention
196 | instead of just concatenating the extracted sentences.
197 |
198 | ### [BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization](http://arxiv.org/abs/1906.05012)
199 |
200 | Bi-directional Selective Encoding with Template (Biset) is a new
201 | architecture for abstractive summarization tested on the Gigawords
202 | dataset. Template-based summarization relies on manual creation of
203 | templates. The advantage of such approach is that it results in concise
204 | and coherent summaries without requiring training data. However, it
205 | requires experts to build these templates. In this paper, an automatic
206 | method is proposed to retrieve high-quality templates from training
207 | corpus. Given an input article, the model first retrieves the most
208 | similar articles using a TF-IDF-based method. Further, a similarity
209 | measure is computed through a neural network in order to re-rank the
210 | retrieved articles. The summary corresponding to the most similar
211 | article to the input is then selected as the template. Finally a
212 | sequence to sequence network is trained to generate the summary: the
213 | authors propose an architecture to learn the interaction between the
214 | source summary and the selected template.
215 |
216 | ### [Generating Summaries with Topic Templates and Structured Convolutional Decoders](http://arxiv.org/abs/1906.04687)
217 |
218 | Most previous works on neural text generation represent the target
219 | summaries as a single long sequence. Assuming that documents are
220 | organized into topically coherent text segments, the authors propose a
221 | hierarchical model that encodes both documents and sentences guided by
222 | the topic structure of target summaries. The topic templates from
223 | summaries are obtained via a trained Latent Dirichlet Allocation model. WikiCat-Sum, the dataset used for evaluation is
224 | derived from WikiSum, and focuses on three
225 | domains: Companies, Films, and Animals. The dataset is publicly
226 | [available](https://github.com/lauhaide/WikiCatSum).
227 |
228 | ### [Global Optimization under Length Constraint for Neural Text Summarization](https://www.aclweb.org/anthology/P19-1099)
229 |
230 | Most abstractive summarization models do not control the length of the
231 | generated summary and learn it from the distribution of the examples
232 | seen during training. The authors propose an optimization method under a
233 | length constraint. They report extensive experiments on CNN/DM using
234 | several models with different length constraints and optimization
235 | methods. In addition to ROUGE and length control, the authors report the
236 | average generation time, along with a human assessment.
237 |
238 | ## Evaluation
239 |
240 | ### [HighRES: Highlight-based Reference-less Evaluation of Summarization](http://arxiv.org/abs/1906.01361)
241 |
242 | Automatic summarization evaluation is an open research question and the
243 | current methods have several pitfalls. For this reason, most of the
244 | papers conduct human evaluations, a challenging and time consuming task.
245 | The authors propose a new human evaluation methodlogy:
246 | first, a group of annotators highlight the salient content in the input
247 | article. Then, other annotators are asked to score for precision (i.e.
248 | only important information is present in the summary), recall (all
249 | important information is present in the summary) and linguistic metrics
250 | (clarity and fluency). Major advantages of this method:
251 |
252 | - highlights are not dependent on the summaries being evaluated but
253 | only on the source documents, thus avoiding reference bias;
254 |
255 | - it provides absolute instead of ranked evaluation allowing for
256 | better interpretability;
257 |
258 | - the highlight annotation needs to happen only once per document, and
259 | it can be reused to evaluate many system summaries.
260 |
261 | Finally, the authors propose a version of ROUGE leveraging the highlight
262 | annotations. The UI (see figure below) is [open source](https://github.com/sheffieldnlp/highres).
263 |
264 | ![Figure 2 of Hardy et al. The UI for content evaluation with
265 | highlight.[]{label="fig:HIGHRES"}](img/high.png)
266 |
267 | ### [A Simple Theoretical Model of Importance for Summarization](https://www.aclweb.org/anthology/P19-1101)
268 |
269 | In this work, the author formalizes several simple but rigorous
270 | summary-related metrics such as redundancy, relevance, and
271 | informativeness, under the unifying notion of: *importance*. The paper
272 | includes several analyses to support the proposal and was recognized as
273 | an *outstanding* contribution. We look forward to see how the proposed
274 | framework will be adopted!
275 |
276 |
--------------------------------------------------------------------------------