├── LICENSE ├── README.md └── art ├── __init__.py ├── aggregators.py ├── scores.py ├── significance_tests.py ├── test ├── __init__.py ├── resources │ ├── example_scores │ └── example_scores_numerator_always_0 ├── test_aggregators.py ├── test_scores.py └── test_significance_tests.py └── transform_conll_score_file.py /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Sebastian Martschat (sebastian.martschat at gmail dot com) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Approximate Randomization Testing 2 | 3 | This repository contains a package that allows to perform two-sided paired 4 | approximate randomization tests to assess the statistical significance of the 5 | difference in performance between two systems. 6 | 7 | ## Usage 8 | 9 | You need to create an ApproximateRandomizationTest object to perform the test. 10 | Here is an example: 11 | 12 | ```python 13 | from art import aggregators 14 | from art import scores 15 | from art import significance_tests 16 | 17 | 18 | test = significance_tests.ApproximateRandomizationTest( 19 | scores.Scores.from_file(open('system1_file')), 20 | scores.Scores.from_file(open('system2_file')), 21 | aggregators.f_1) 22 | test.run() 23 | ``` 24 | 25 | ## Input Format 26 | 27 | We assume that we want to check the statistical significance of the 28 | difference in score between two systems S and T on the same corpus C. To 29 | compute the score over the whole corpus, we compute for each document all 30 | scores needed to compute the final score, and then aggregate the above values 31 | over the whole corpus to compute an aggregated score. 32 | 33 | Hence, we assume that the input files contain in the i-th line 34 | 35 | ``` 36 | score_1 score_2 ... score_n 37 | ``` 38 | 39 | for the i-th document in the corpus. That is, a list of numbers divided by 40 | space. 41 | 42 | ## Examples 43 | 44 | So far, three aggregation functions are implemented: average, dividing sums 45 | (suitable for precision and recall) and F1 score. All these are implemented in 46 | `aggregators.py`. We now briefly describe each such function and the expected 47 | input format for the `ApproximateRandomizationTest` object. 48 | 49 | ### Average 50 | 51 | Aggregation function `average`. Expected input is a file containing one 52 | number per line. The function just computes the average of all the numbers. 53 | 54 | ### Dividing sums 55 | 56 | Aggregations function `enum_sum_div_by_denom_sum`. Expected input is a file 57 | containing two numbers per line. The first number is interpreted as the 58 | enumerator, the second number as the demoninator. The aggregated score is 59 | computed by summing over each and then dividing. One use-case of this 60 | aggregation function is to compute recall or precision. 61 | 62 | ### F1 63 | 64 | Aggregations function `f_1`. Expected input is a file containing four numbers 65 | per line. The first two numbers are interpreted as enumerator and denominator 66 | for recall, the third and fourth number accordingly for precision. The 67 | aggregated score is aggregating recall and precision individually, by dividing 68 | sums, and then computing the F1 score. 69 | 70 | ## Converting CoNLL Coreference Score Files 71 | 72 | This repository also contains a module to create the required input from the 73 | output of the CoNLL coreference scorer 74 | (http://code.google.com/p/reference-coreference-scorers/). 75 | 76 | First, score the output of a system using the scorer, for one metric, as in 77 | 78 | ``` 79 | $ perl scorer.pl muc key response > conll_score_file 80 | ``` 81 | 82 | Then employ `get_numerators_and_denominators` from `transform_conll_score` to 83 | transform the file into an object which can be used for scoring: 84 | 85 | ```python 86 | from art import aggregators 87 | from art import scores 88 | from art import significance_tests 89 | from art import transform_conll_score_file as transform 90 | 91 | 92 | transformed_system1 = transform.get_numerators_and_denominators( 93 | open('conll_score_file')) 94 | transformed_system2 = transform.get_numerators_and_denominators( 95 | open('another_conll_score_file')) 96 | 97 | test = significance_tests.ApproximateRandomizationTest( 98 | scores.Scores(transformed_system1), 99 | scores.Scores(transformed_system2), 100 | aggregators.f_1) 101 | test.run() 102 | ``` 103 | 104 | ## Changelog 105 | 106 | __05 May 2017__ 107 | Fixed a bug in transforming CoNLL coreference score files. 108 | -------------------------------------------------------------------------------- /art/__init__.py: -------------------------------------------------------------------------------- 1 | """art: A package for significance testing via approximate randomization""" 2 | 3 | __author__ = 'smartschat' 4 | -------------------------------------------------------------------------------- /art/aggregators.py: -------------------------------------------------------------------------------- 1 | """Contains functions to aggregate scores over a corpus.""" 2 | 3 | import math 4 | 5 | __author__ = 'smartschat' 6 | 7 | 8 | def average(scores): 9 | """Compute the average of all scores. 10 | 11 | Args: 12 | scores: A Scores object. Each score in scores should contain only one 13 | number. 14 | 15 | Returns: 16 | The average of all numbers in scores. 17 | """ 18 | return math.fsum([score.values[0] for score in scores])/len(scores) 19 | 20 | 21 | def enum_sum_div_by_denom_sum(scores): 22 | """Sum up the first entry of all scores, then sum up the second entry. 23 | Divide the first sum by the second sum. 24 | 25 | Args: 26 | scores: A Scores object. Each score in scores should contain two 27 | numbers. 28 | 29 | Returns: 30 | The sum of the first entry of each score in scores, divided by the sum 31 | of the second entry of each score in scores. 32 | """ 33 | return math.fsum([score.values[0] for score in scores])/math.fsum( 34 | [score.values[1] for score in scores]) 35 | 36 | 37 | def f_1(scores): 38 | """Compute the corpus-wide F1 score represented by the scores. 39 | 40 | Each score should contain four entries. Consider: 41 | - first/second entry numerator/denominator for recall, 42 | - third/fourth entry numerator and denominator for precision. 43 | 44 | Then define r = sum(first entries)/sum(second entries) and 45 | p = sum(third entries)/sum(fourth entries) 46 | 47 | The F1 score is then computed as F1 = 2pr/(p+r). 48 | 49 | Args: 50 | scores: A Scores object. Each score in scores should contain four 51 | numbers. 52 | 53 | Returns: 54 | The corpus-wide F1 score. 55 | """ 56 | first_component = math.fsum( 57 | [score.values[0] for score in scores] 58 | )/math.fsum( 59 | [score.values[1] for score in scores]) 60 | 61 | second_component = math.fsum( 62 | [score.values[2] for score in scores] 63 | )/math.fsum( 64 | [score.values[3] for score in scores]) 65 | 66 | return 2 * first_component * second_component / ( 67 | first_component + second_component) 68 | -------------------------------------------------------------------------------- /art/scores.py: -------------------------------------------------------------------------------- 1 | """ Contains classes fore managing scores and lists of scores.""" 2 | 3 | __author__ = 'smartschat' 4 | 5 | 6 | class Score(object): 7 | """A score for an individual document. 8 | 9 | Attributes: 10 | values: A list of floats, which constitutes the score for the document 11 | under consideration. 12 | """ 13 | def __init__(self, score): 14 | """Create a score from a list of numbers. 15 | 16 | Args: 17 | score: a list of numbers. 18 | """ 19 | self.values = [float(val) for val in score] 20 | 21 | def __str__(self): 22 | return ' '.join([str(val) for val in self.values]) 23 | 24 | def __eq__(self, other): 25 | if isinstance(other, self.__class__): 26 | return self.values == other.values 27 | else: 28 | return False 29 | 30 | def __hash__(self): 31 | return hash(self.values) 32 | 33 | 34 | class Scores(object): 35 | """A collection of scores for a set of documents (a corpus). 36 | 37 | Attributes: 38 | scores: A list of Score objects. 39 | """ 40 | def __init__(self, scores=None): 41 | """Init from a list of scores. 42 | 43 | Args: 44 | scores: A list of Score objects. 45 | """ 46 | if not scores: 47 | scores = [] 48 | 49 | self.scores = scores 50 | 51 | def __eq__(self, other): 52 | if isinstance(other, self.__class__): 53 | return self.scores == other.scores 54 | else: 55 | return False 56 | 57 | def __hash__(self): 58 | return hash(self.scores) 59 | 60 | def __len__(self): 61 | return len(self.scores) 62 | 63 | def __iter__(self): 64 | return iter(self.scores) 65 | 66 | def __str__(self): 67 | return '\n'.join([str(score) for score in self.scores]) 68 | 69 | def append(self, score): 70 | """Append a score. 71 | 72 | Args: 73 | score: A Score object. 74 | """ 75 | self.scores.append(score) 76 | 77 | @staticmethod 78 | def from_file(file): 79 | """Create a Scores object from a file, where each line in the file 80 | describes a score for one document. 81 | 82 | The file should contain a list of numbers in each line, seperated by 83 | white space. The number of entries in each line should match. An 84 | example file looks like the following: 85 | 86 | 1 2 3 87 | 4 3 2.5 88 | 11 1 0 89 | 90 | Args: 91 | file: A file containing a list of scores. 92 | """ 93 | 94 | scores = [] 95 | for line in file.readlines(): 96 | scores.append(Score(line.split())) 97 | return Scores(scores) 98 | -------------------------------------------------------------------------------- /art/significance_tests.py: -------------------------------------------------------------------------------- 1 | """Contains significance tests for differences between systems.""" 2 | 3 | from __future__ import division 4 | import math 5 | import random 6 | 7 | from art.scores import Scores 8 | 9 | __author__ = 'smartschat' 10 | 11 | 12 | class ApproximateRandomizationTest(object): 13 | """A paired two-sided approximate randomization test. 14 | 15 | This class allows performing a paired two-sided approximate randomization 16 | test to assess the statistical significance of the difference in 17 | performance between two systems which are run and measured on the same 18 | corpus. 19 | 20 | Attributes: 21 | system1_scores: A Scores object, which represents the scores of the 22 | first system under consideration. 23 | system2_scores: A Scores object, which represents the scores of the 24 | second system under consideration. 25 | aggregator: An aggregator function, which aggregates all scores for 26 | individual documents to obtain a score for the whole 27 | corpus. 28 | trials: The number of iterations during the test. 29 | """ 30 | def __init__(self, 31 | system1_scores, 32 | system2_scores, 33 | aggregator, 34 | trials=10000): 35 | """Inits a paired two-sided approximate randomization test. 36 | 37 | Args: 38 | system1_scores: A Scores object, which represents the scores of the 39 | first system under consideration. 40 | system2_scores: A Scores object, which represents the scores of the 41 | second system under consideration. 42 | aggregator: An aggregator function, which aggregates all scores for 43 | individual documents to obtain a score for the 44 | whole corpus. 45 | trials: The number of iterations during the test. Defaults to 46 | 10000. 47 | """ 48 | self.system1_scores = system1_scores 49 | self.system2_scores = system2_scores 50 | self.aggregator = aggregator 51 | self.trials = trials 52 | 53 | def run(self): 54 | """Compute the statistical significance of a difference between 55 | the systems via a paired two-sided approximate randomization test. 56 | 57 | Returns: 58 | An approximation of the probability of observing corpus-wide 59 | differences in scores at least as extreme as observed here, when 60 | there is no difference between the systems. 61 | """ 62 | 63 | absolute_difference = math.fabs( 64 | self.aggregator(self.system1_scores) - 65 | self.aggregator(self.system2_scores)) 66 | shuffled_was_at_least_as_high = 0 67 | 68 | for i in range(0, self.trials): 69 | pseudo_system1_scores = Scores() 70 | pseudo_system2_scores = Scores() 71 | 72 | for score1, score2 in zip(self.system1_scores, 73 | self.system2_scores): 74 | if random.randint(0, 1) == 0: 75 | pseudo_system1_scores.append(score1) 76 | pseudo_system2_scores.append(score2) 77 | else: 78 | pseudo_system1_scores.append(score2) 79 | pseudo_system2_scores.append(score1) 80 | 81 | pseudo_difference = math.fabs( 82 | self.aggregator(pseudo_system1_scores) - 83 | self.aggregator(pseudo_system2_scores)) 84 | 85 | if pseudo_difference >= absolute_difference: 86 | shuffled_was_at_least_as_high += 1 87 | 88 | significance_level = (shuffled_was_at_least_as_high + 1) / ( 89 | self.trials + 1) 90 | 91 | return significance_level 92 | -------------------------------------------------------------------------------- /art/test/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'smartschat' 2 | 3 | -------------------------------------------------------------------------------- /art/test/resources/example_scores: -------------------------------------------------------------------------------- 1 | 2 3 2 | 4 12 3 | 22 500 4 | 3.1 4.355 -------------------------------------------------------------------------------- /art/test/resources/example_scores_numerator_always_0: -------------------------------------------------------------------------------- 1 | 0 3 2 | 0 12 3 | 0 500 4 | 0 4.355 -------------------------------------------------------------------------------- /art/test/test_aggregators.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | 3 | from art import aggregators 4 | from art.scores import Score 5 | from art.scores import Scores 6 | 7 | __author__ = 'smartschat' 8 | 9 | 10 | class TestAggregators(unittest.TestCase): 11 | def test_average(self): 12 | scores_for_average = Scores( 13 | [ 14 | Score([1]), 15 | Score([5]), 16 | Score([3]), 17 | Score([0]), 18 | ] 19 | ) 20 | self.assertEqual(9.0 / 4, aggregators.average(scores_for_average)) 21 | 22 | def test_enum_sum_div_by_denom_sum(self): 23 | scores_for_enum_sum_div_by_denom_sum = Scores( 24 | [ 25 | Score([2, 3]), 26 | Score([4, 12]), 27 | Score([22, 500]), 28 | Score([3.1, 4.355]), 29 | ] 30 | ) 31 | self.assertEqual(31.1 / 519.355, 32 | aggregators.enum_sum_div_by_denom_sum( 33 | scores_for_enum_sum_div_by_denom_sum)) 34 | 35 | def test_f_1(self): 36 | scores_for_f_1 = Scores( 37 | [ 38 | Score([2, 3, 7, 8]), 39 | Score([4, 12, 33, 50]), 40 | Score([22, 500, 12.3, 15.9]), 41 | Score([3.1, 4.355, 1, 2]), 42 | ] 43 | ) 44 | 45 | recall = 31.1 / 519.355 46 | precision = 53.3 / 75.9 47 | f_1 = 2 * recall * precision / (recall + precision) 48 | 49 | self.assertEqual(f_1, aggregators.f_1(scores_for_f_1)) 50 | 51 | if __name__ == '__main__': 52 | unittest.main() 53 | -------------------------------------------------------------------------------- /art/test/test_scores.py: -------------------------------------------------------------------------------- 1 | import os 2 | import unittest 3 | 4 | from art.scores import Score 5 | from art.scores import Scores 6 | 7 | __author__ = 'smartschat' 8 | 9 | 10 | class TestScores(unittest.TestCase): 11 | def test_from_file(self): 12 | expected_scores = Scores( 13 | [ 14 | Score([2, 3]), 15 | Score([4, 12]), 16 | Score([22, 500]), 17 | Score([3.1, 4.355]), 18 | ] 19 | ) 20 | self.assertEqual(expected_scores, Scores.from_file(open( 21 | os.path.dirname(os.path.realpath(__file__)) + 22 | "/resources/example_scores"))) 23 | 24 | 25 | if __name__ == '__main__': 26 | unittest.main() 27 | -------------------------------------------------------------------------------- /art/test/test_significance_tests.py: -------------------------------------------------------------------------------- 1 | import os 2 | import unittest 3 | 4 | from art import aggregators 5 | from art import scores 6 | from art import significance_tests 7 | 8 | 9 | __author__ = 'smartschat' 10 | 11 | 12 | class TestApproximateRandomizationTest(unittest.TestCase): 13 | def test_run(self): 14 | directory = os.path.dirname(os.path.realpath(__file__)) 15 | test = significance_tests.ApproximateRandomizationTest( 16 | scores.Scores.from_file(open(directory + 17 | "/resources/example_scores")), 18 | scores.Scores.from_file(open( 19 | directory + "/resources/example_scores_numerator_always_0")), 20 | aggregators.enum_sum_div_by_denom_sum 21 | ) 22 | self.assertGreater(test.run(), 0) 23 | 24 | def test_run_with_same(self): 25 | directory = os.path.dirname(os.path.realpath(__file__)) 26 | test = significance_tests.ApproximateRandomizationTest( 27 | scores.Scores.from_file(open(directory + 28 | "/resources/example_scores")), 29 | scores.Scores.from_file(open(directory + 30 | "/resources/example_scores")), 31 | aggregators.enum_sum_div_by_denom_sum 32 | ) 33 | self.assertEqual(1.0, test.run()) 34 | 35 | 36 | if __name__ == '__main__': 37 | unittest.main() 38 | -------------------------------------------------------------------------------- /art/transform_conll_score_file.py: -------------------------------------------------------------------------------- 1 | """Transform CoNLL scorer files into a suitable format.""" 2 | 3 | from art.scores import Score 4 | from art.scores import Scores 5 | 6 | __author__ = 'smartschat' 7 | 8 | 9 | def get_numerators_and_denominators(score_file): 10 | """Transform score files obtained by the CoNLL scorer. 11 | 12 | This function transforms files obtained by the reference coreference 13 | scorer (https://code.google.com/p/reference-coreference-scorers/) into 14 | a format suitable for performing significance testing for differences in 15 | F1 score. 16 | 17 | Args 18 | score_file: A file obtained via running the reference coreference 19 | scorer for a single metric, as in 20 | $ perl scorer.pl muc key response > conll_score_file 21 | 22 | Returns 23 | A Scores objects containing numerator/denominator for recall and 24 | precision for each document described in the score file. 25 | """ 26 | scores_from_file = Scores() 27 | 28 | temp_mapping = {} 29 | 30 | for line in score_file.readlines(): 31 | if line == '====== TOTALS =======': 32 | break 33 | elif line.startswith("("): 34 | identifier = line.strip() 35 | elif line.startswith('Recall:'): 36 | entries = line.split() 37 | recall_numerator = entries[1].replace("(", "") 38 | recall_denominator = entries[3].replace(")", "") 39 | precision_numerator = entries[6].replace("(", "") 40 | precision_denominator = entries[8].replace(")", "") 41 | 42 | temp_mapping[identifier] = [ 43 | recall_numerator, 44 | recall_denominator, 45 | precision_numerator, 46 | precision_denominator 47 | ] 48 | 49 | identifier = None 50 | 51 | for identifier in sorted(temp_mapping.keys()): 52 | scores_from_file.append( 53 | Score(temp_mapping[identifier]) 54 | ) 55 | 56 | return scores_from_file 57 | --------------------------------------------------------------------------------