├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── mt_metrics_eval ├── __init__.py ├── codalab │ ├── eval.py │ └── metadata ├── converters │ ├── __init__.py │ ├── evalset_ratings_to_standalone.py │ ├── score_mqm.py │ ├── standalone_ratings_to_evalset.py │ └── verify_scores_file.py ├── data.py ├── data_test.py ├── meta_info.py ├── mt_metrics_eval.ipynb ├── mtme.py ├── pce.py ├── pce_test.py ├── ratings.py ├── ratings_test.py ├── standalone_ratings.py ├── stats.py ├── stats_test.py ├── tasks.py ├── tasks_test.py ├── tau_optimization.py ├── tau_optimization_test.py ├── ties_matter.ipynb ├── wmt22_metrics.ipynb └── wmt23_metrics.ipynb └── setup.py /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to Contribute 2 | 3 | We'd love to accept your patches and contributions to this project. There are 4 | just a few small guidelines you need to follow. 5 | 6 | ## Contributor License Agreement 7 | 8 | Contributions to this project must be accompanied by a Contributor License 9 | Agreement. You (or your employer) retain the copyright to your contribution; 10 | this simply gives us permission to use and redistribute your contributions as 11 | part of the project. Head over to to see 12 | your current agreements on file or to sign a new one. 13 | 14 | You generally only need to submit a CLA once, so if you've already submitted one 15 | (even if it was for a different project), you probably don't need to do it 16 | again. 17 | 18 | ## Code reviews 19 | 20 | All submissions, including submissions by project members, require review. We 21 | use GitHub pull requests for this purpose. Consult 22 | [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more 23 | information on using pull requests. 24 | 25 | ## Community Guidelines 26 | 27 | This project follows [Google's Open Source Community 28 | Guidelines](https://opensource.google/conduct/). 29 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # MT Metrics Eval V2 2 | 3 | MTME is a simple toolkit to evaluate the performance of Machine Translation 4 | metrics on standard test sets such as those from the 5 | [WMT Metrics Shared Tasks](https://wmt-metrics-task.github.io). 6 | It bundles data relevant to metric development and evaluation for a 7 | given test set and language pair, and lets you do the following: 8 | 9 | - Access source, reference, and MT output text, along with associated 10 | meta-info, for the WMT metrics tasks from 2019 on. This can be done via 11 | software, or by directly accessing the files in a linux directory 12 | structure, in a straightforward format. 13 | - Access human and automatic metric scores for the above data, and MQM ratings 14 | for some language pairs. 15 | - Reproduce the official results from the WMT metrics tasks. For 16 | WMT22 on, there are colabs to do this; other years require more work. 17 | - Compute various correlations and perform significance tests on correlation 18 | differences between two metrics. 19 | 20 | These can be done on the command line using a python script, or from an 21 | API. 22 | 23 | ## Installation 24 | 25 | You need python 3.10 or later. To install: 26 | 27 | ```bash 28 | git clone https://github.com/google-research/mt-metrics-eval.git 29 | cd mt-metrics-eval 30 | pip install . 31 | ``` 32 | 33 | ## Downloading the data 34 | 35 | This must be done before using the toolkit. You can either use the mtme script: 36 | 37 | ```bash 38 | alias mtme='python3 -m mt_metrics_eval.mtme' 39 | mtme --download # Puts ~2G of data into $HOME/.mt-metrics-eval. 40 | ``` 41 | 42 | Or download directly, if you're only interested in the data: 43 | 44 | ```bash 45 | mkdir $HOME/.mt-metrics-eval 46 | cd $HOME/.mt-metrics-eval 47 | wget https://storage.googleapis.com/mt-metrics-eval/mt-metrics-eval-v2.tgz 48 | tar xfz mt-metrics-eval-v2.tgz 49 | ``` 50 | 51 | Once data is downloaded, you can optionally test the install: 52 | 53 | ```bash 54 | python3 -m unittest discover mt_metrics_eval "*_test.py" # Takes ~70 seconds. 55 | ``` 56 | 57 | ## Running from the command line 58 | 59 | Here are some examples of things you can do with the mtme script. They assume 60 | that the mtme alias above has been set up. 61 | 62 | Get information about test sets: 63 | 64 | ```bash 65 | mtme --list # List available test sets. 66 | mtme --list -t wmt22 # List language pairs for wmt22. 67 | mtme --list -t wmt22 -l en-de # List details for wmt22 en-de. 68 | ``` 69 | 70 | Get contents of test sets. Paste doc-id, source, standard reference, 71 | alternative reference to stdout: 72 | 73 | ```bash 74 | mtme -t wmt22 -l en-de --echo doc,src,refA,refB 75 | ``` 76 | 77 | Outputs from all systems, sequentially, pasted with doc-ids, source, and 78 | reference: 79 | 80 | ```bash 81 | mtme -t wmt22 -l en-de --echosys doc,src,refA 82 | ``` 83 | 84 | Human and metric scores for all systems, at all granularities: 85 | 86 | ```bash 87 | mtme -t wmt22 -l en-de --scores > wmt22.en-de.tsv 88 | ``` 89 | 90 | Evaluate metric score files containing tab-separated `system-name score` 91 | entries. For system-level correlations, supply one score per system. For 92 | document-level or segment-level correlations, supply one score per document or 93 | segment, grouped by system, in the same order as text generated using `--echo` 94 | (the same order as the WMT test-set file). Granularity is determined 95 | automatically. Domain-level scores are currently not supported by 96 | this command. 97 | 98 | ```bash 99 | examples=$HOME/.mt-metrics-eval/mt-metrics-eval-v2/wmt22/metric-scores/en-de 100 | 101 | mtme -t wmt22 -l en-de < $examples/BLEU-refA.sys.score 102 | mtme -t wmt22 -l en-de < $examples/BLEU-refA.seg.score 103 | ``` 104 | 105 | Compare to WMT appraise gold scores instead of MQM gold scores: 106 | 107 | ```bash 108 | mtme -t wmt22 -l en-de -g wmt-appraise < $examples/BLEU-refA.sys.score 109 | mtme -t wmt22 -l en-de -g wmt-appraise < $examples/BLEU-refA.seg.score 110 | ``` 111 | 112 | Compute correlations for two metrics files, and perform tests to determine 113 | whether they are significantly different: 114 | 115 | ```bash 116 | mtme -t wmt22 -l en-de -i $examples/BLEU-refA.sys.score -c $examples/COMET-22-refA.sys.score 117 | ``` 118 | 119 | Compare all known metrics under specified conditions. This corresponds to one of 120 | the "tasks" in the WMT22 metrics evaluation. The first output line contains all 121 | relevant parameter settings, and subsequent lines show metrics in descending 122 | order of performance, followed by the rank of their significance cluster, the 123 | value of the selected correlation statistic, and a vector of flags to indicate 124 | significant differences with lower-ranked metrics. These examples use k_block=5 125 | for demo purposes; using k_block=100 will approximately match official results 126 | but can take minutes to hours to complete, depending on the task. 127 | 128 | ```bash 129 | # System-level Pearson 130 | mtme -t wmt22 -l en-de --matrix --k_block 5 131 | 132 | # System-level paired-rank accuracy, pooling results across all MQM languages 133 | mtme -t wmt22 -l en-de,zh-en,en-ru --matrix \ 134 | --matrix_corr accuracy --k_block 5 135 | 136 | # Segment-level item-wise averaged Kendall-Tau-Acc23 with optimal tie threshold 137 | # using sampling rate of 1.0 (disabling significance testing for demo). 138 | mtme -t wmt22 -l en-de --matrix --matrix_level seg --avg item \ 139 | --matrix_corr KendallWithTiesOpt --matrix_perm_test pairs \ 140 | --matrix_corr_args "{'variant':'acc23', 'sample_rate':1.0}" --k 0 141 | ``` 142 | 143 | ## API and Colabs 144 | 145 | The colab notebook `mt_metrics_eval.ipynb` contains examples that show how to 146 | use the API to load and summarize data, and compare stored metrics (ones that 147 | participated in the metrics shared tasks) using different criteria. It also 148 | demonstrates how you can incorporate new metrics into these comparisons. 149 | 150 | The notebooks `wmt22_metrics.ipynb` and `wmt23_metrics.ipynb` document how the 151 | official results for these tasks were generated. 152 | We will try to provide similar notebooks for future evaluations. 153 | 154 | The notebook `ties_matter.ipynb` contains the code to reproduce the results 155 | from [Ties Matter: Meta-Evaluating Modern Metrics with Pairwise Accuracy and Tie Calibration](https://arxiv.org/abs/2305.14324). 156 | It also contains examples for how to calculate the proposed pairwise accuracy 157 | with tie calibration. 158 | 159 | ## MQM Ratings 160 | 161 | MTME also supports representing MQM ratings. 162 | The ratings are stored as `rating.Rating` objects in the `EvalSet`. 163 | They can be accessed via the `EvalSet.Ratings()` function. 164 | `Ratings()` returns a dictionary that maps between the name of a set of 165 | ratings and the ratings themselves, one per segment. 166 | Each entry can either represent: 167 | 168 | - An individual rater's ratings, in which the key is the ID of the rater 169 | - A metric's ratings, in which the key is the ID of the system that predicted the rating 170 | - A combined set of ratings that come from different raters, in which the key 171 | is the name for this group of ratings. This could be used if there was a logical 172 | "round" of ratings from different raters, like a full round of ratings collected 173 | as part of a WMT evaluation. 174 | 175 | The IDs of the raters who rated the segments can be accessed via 176 | `EvalSet.RaterIdsPerSeg()`. It returns a dict that is parallel to an entry 177 | in `EvalSet.Ratings()` that lists the individual rater IDs for each rating or 178 | `None` if there was no rating. 179 | For an individual rater's ratings or a metric's ratings, these are typically 180 | that rater's ID or the name of the metric. For a combined set of ratings, this 181 | will contain the per-segment rater IDs. 182 | 183 | For each year of WMT for which ratings are included in MTME, there is a rating 184 | entry for each individual rater. If there was a logical grouping of ratings, 185 | like a round of ratings that were collected at the same time, those are also 186 | included. 187 | Here are the ratings that are currently available: 188 | 189 | | Dataset | Language Pair | Ratings | 190 | | ------- | ------------- | ------- | 191 | | wmt20 | en-de | | 192 | | wmt20 | zh-en | | 193 | | wmt21.news | en-de | | 194 | | wmt21.news | zh-en | | 195 | | wmt21.tedtalks | en-de | | 196 | | wmt21.tedtalks | zh-en | | 197 | | wmt22 | en-de | | 198 | | wmt22 | en-ru |