└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # Seahorse dataset and metrics 2 | 3 | Seahorse is a dataset for multilingual, multifaceted summarization evaluation. 4 | It contains 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. 5 | 6 | More details can be found in the [paper](https://arxiv.org/abs/2305.13194), which can be cited as follows: 7 | ``` 8 | @misc{clark2023seahorse, 9 | title={SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation}, 10 | author={Elizabeth Clark and Shruti Rijhwani and Sebastian Gehrmann and Joshua Maynez and Roee Aharoni and Vitaly Nikolaev and Thibault Sellam and Aditya Siddhant and Dipanjan Das and Ankur P. Parikh}, 11 | year={2023}, 12 | eprint={2305.13194}, 13 | archivePrefix={arXiv}, 14 | primaryClass={cs.CL} 15 | } 16 | ``` 17 | The Seahorse dataset is released under the [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) license. 18 | 19 | You can download the dataset here: https://storage.googleapis.com/seahorse-public/seahorse_data.zip 20 | 21 | New!! We have also released the learnt metrics trained on Seahorse on HuggingFace. They can be found [here](https://huggingface.co/collections/google/seahorse-release-6543b0c06d87d83c6d24193b). 22 | 23 | ## Dataset description 24 | 25 | The dataset is split into 3 .tsv files: the train, validation, and test sets. 26 | 27 | Each file contains the following information: 28 | * `gem_id` The ID corresponding to the article that was used to generate the summary (see [Retrieving articles from GEM](https://github.com/google-research-datasets/seahorse/edit/main/README.md#retrieving-articles-from-gem) for more details) 29 | * `worker_lang` The language ID (de, es-ES, en-US, ru, tr, vi) 30 | * `summary` The generated summary 31 | * `model` The source of the summary (either reference or the summarization model) 32 | * `question1-6` 6 columns with annotator ratings, corresponding to the 6 dimensions of quality (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). If `question1`= No, then there will be no ratings for the remaining questions. 33 | 34 | Here is an example entry: 35 | ``` 36 | xlsum_english-validation-6416 en-US Schools in England, Wales and Scotland are being urged to bring back overseas exchange trips. t5_base Yes Yes Yes Yes Yes Yes 37 | ``` 38 | 39 | There is also a directory called `duplicates`, which contains the items that received multiple annotations. Note that this data should NOT be used for training metrics, as there may be overlap between the train/dev/test sets. 40 | 41 | ## Retrieving articles from GEM 42 | 43 | If you would like to access the articles that the Seahorse summaries are based on, you will need to retrieve them using their GEM ids. 44 | 45 | The [xsum](https://huggingface.co/datasets/GEM/xsum), [mlsum](https://huggingface.co/datasets/GEM/mlsum), and [xlsum](https://huggingface.co/datasets/GEM/xlsum) articles can all be retrieved through GEM on HuggingFace. The `gem_id` column points to the article in the GEM datasets. 46 | 47 | The wikilingua article ids come from a previous version of the GEM dataset and should be retrieved using [TensorFlow datasets](https://www.tensorflow.org/datasets/catalog/gem). Here's an example of how to load the English wikilingua dataset into a dataframe: 48 | 49 | ``` 50 | import tensorflow_datasets as tfds 51 | 52 | lang = 'english_en' 53 | orig_split = 'validation' 54 | 55 | ds, info = tfds.load(f'huggingface:gem/wiki_lingua_{lang}', split=orig_split, with_info=True) 56 | hfdf = tfds.as_dataframe(ds,info) 57 | ``` 58 | 59 | ## Seahorse metrics 60 | 61 | Metrics trained on Seahorse are available through HuggingFace. They can be found [here](https://huggingface.co/collections/google/seahorse-release-6543b0c06d87d83c6d24193b). 62 | These are mT5-based metrics in two sizes (Large and XXL), each trained on one of the six dimensions of quality. 63 | Please see the [paper](https://arxiv.org/abs/2305.13194) for more details about these metrics. 64 | | | Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | 65 | | -----------| ----------- | ----------- | ----------- | ----------- | ----------- | ----------- | 66 | | mT5-XXL | [seahorse-xxl-q1](https://huggingface.co/google/seahorse-xxl-q1)|[seahorse-xxl-q2](https://huggingface.co/google/seahorse-xxl-q2)|[seahorse-xxl-q3](https://huggingface.co/google/seahorse-xxl-q3)|[seahorse-xxl-q4](https://huggingface.co/google/seahorse-xxl-q4)|[seahorse-xxl-q5](https://huggingface.co/google/seahorse-xxl-q5)|[seahorse-xxl-q6](https://huggingface.co/google/seahorse-xxl-q6)| 67 | | mT5-Large | [seahorse-large-q1](https://huggingface.co/google/seahorse-large-q1)|[seahorse-large-q2](https://huggingface.co/google/seahorse-large-q2)|[seahorse-large-q3](https://huggingface.co/google/seahorse-large-q3)|[seahorse-large-q4](https://huggingface.co/google/seahorse-large-q4)|[seahorse-large-q5](https://huggingface.co/google/seahorse-large-q5)|[seahorse-large-q6](https://huggingface.co/google/seahorse-large-q6)| 68 | 69 | **Note:** If you want to use the metrics for labeling summaries (as opposed to looking at correlations or ROC-AUC as we did in the paper), you will need to select a threshold value to classify the metric's scores. The best threshold value will depend on your data and use case. 70 | 71 | ## Leaderboard 72 | 73 | We are maintaining a leaderboard with official results on our test set. 74 | 75 | We ask you to **not** incorporate any part of the Seahorse validation set into the training data, and only use it for validation/hyperparameter tuning as development sets are typically used. 76 | 77 | We report results on two metrics: Pearson correlation ($\rho$) and area under the ROC curve (roc). 78 | 79 |
| 82 | | 83 | | Q1 | 84 |Q2 | 85 |Q3 | 86 |Q4 | 87 |Q5 | 88 |Q6 | 89 |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Model | 92 |Link | 93 |$\rho$ | 94 |roc | 95 |$\rho$ | 96 |roc | 97 |$\rho$ | 98 |roc | 99 |$\rho$ | 100 |roc | 101 |$\rho$ | 102 |roc | 103 |$\rho$ | 104 |roc | 105 |
| mT5-seahorse | 108 |[Clark et al. 2023] | 109 |0.52 | 110 |0.90 | 111 |0.86 | 112 |0.98 | 113 |0.45 | 114 |0.84 | 115 |0.59 | 116 |0.85 | 117 |0.50 | 118 |0.80 | 119 |0.52 | 120 |0.81 | 121 |
| mT5-XNLI | 124 |[Honovich et al. 2022, Conneau et al. 2018] | 125 |- | 126 |- | 127 |- | 128 |- | 129 |- | 130 |- | 131 |0.43 | 132 |0.78 | 133 |- | 134 |- | 135 |- | 136 |- | 137 |
| ROUGE-L | 140 |[Lin et al. 2004] | 141 |0.04 | 142 |0.54 | 143 |0.06 | 144 |0.54 | 145 |-0.03 | 146 |0.43 | 147 |0.13 | 148 |0.55 | 149 |0.03 | 150 |0.54 | 151 |0.02 | 152 |0.54 | 153 |
| Majority Class | 156 |- | 157 |- | 158 |0.5 | 159 |- | 160 |0.5 | 161 |- | 162 |0.5 | 163 |- | 164 |0.5 | 165 |- | 166 |0.5 | 167 |- | 168 |0.5 | 169 |