├── LICENSE
├── README.md
└── art
    ├── __init__.py
    ├── aggregators.py
    ├── scores.py
    ├── significance_tests.py
    ├── test
        ├── __init__.py
        ├── resources
        │   ├── example_scores
        │   └── example_scores_numerator_always_0
        ├── test_aggregators.py
        ├── test_scores.py
        └── test_significance_tests.py
    └── transform_conll_score_file.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Sebastian Martschat (sebastian.martschat at gmail dot com)
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # Approximate Randomization Testing
  2 | 
  3 | This repository contains a package that allows to perform two-sided paired 
  4 | approximate randomization tests to assess the statistical significance of the
  5 | difference in performance between two systems.
  6 | 
  7 | ## Usage
  8 | 
  9 | You need to create an ApproximateRandomizationTest object to perform the test.
 10 | Here is an example:
 11 | 
 12 | ```python
 13 | from art import aggregators
 14 | from art import scores
 15 | from art import significance_tests
 16 | 
 17 | 
 18 | test = significance_tests.ApproximateRandomizationTest(
 19 |     scores.Scores.from_file(open('system1_file')), 
 20 |     scores.Scores.from_file(open('system2_file')), 
 21 |     aggregators.f_1)
 22 | test.run()
 23 | ```
 24 | 
 25 | ## Input Format
 26 | 
 27 | We assume that we want to check the statistical significance of the 
 28 | difference in score between two systems S and T on the same corpus C. To 
 29 | compute the score over the whole corpus, we compute for each document all 
 30 | scores needed to compute the final score, and then aggregate the above values
 31 |  over the whole corpus to compute an aggregated score. 
 32 | 
 33 | Hence, we assume that the input files contain in the i-th line
 34 | 
 35 | ```
 36 | score_1 score_2 ... score_n
 37 | ```
 38 | 
 39 | for the i-th document in the corpus. That is, a list of numbers divided by 
 40 | space.
 41 | 
 42 | ## Examples
 43 | 
 44 | So far, three aggregation functions are implemented: average, dividing sums
 45 | (suitable for precision and recall) and F1 score. All these are implemented in
 46 | `aggregators.py`. We now briefly describe each such function and the expected
 47 | input format for the `ApproximateRandomizationTest` object.
 48 | 
 49 | ### Average
 50 | 
 51 | Aggregation function `average`. Expected input is a file containing one 
 52 | number per line. The function just computes the average of all the numbers.
 53 | 
 54 | ### Dividing sums
 55 | 
 56 | Aggregations function `enum_sum_div_by_denom_sum`. Expected input is a file 
 57 | containing two numbers per line. The first number is interpreted as the 
 58 | enumerator, the second number as the demoninator. The aggregated score is 
 59 | computed by summing over each and then dividing. One use-case of this 
 60 | aggregation function is to compute recall or precision.
 61 | 
 62 | ### F1
 63 | 
 64 | Aggregations function `f_1`. Expected input is a file containing four numbers 
 65 | per line. The first two numbers are interpreted as enumerator and denominator
 66 | for recall, the third and fourth number accordingly for precision. The 
 67 | aggregated score is aggregating recall and precision individually, by dividing 
 68 | sums, and then computing the F1 score.
 69 | 
 70 | ## Converting CoNLL Coreference Score Files
 71 | 
 72 | This repository also contains a module to create the required input from the 
 73 | output of the CoNLL coreference scorer 
 74 | (http://code.google.com/p/reference-coreference-scorers/).
 75 | 
 76 | First, score the output of a system using the scorer, for one metric, as in
 77 | 
 78 | ```
 79 | $ perl scorer.pl muc key response > conll_score_file
 80 | ```
 81 | 
 82 | Then employ `get_numerators_and_denominators` from `transform_conll_score` to
 83 | transform the file into an object which can be used for scoring:
 84 | 
 85 | ```python
 86 | from art import aggregators
 87 | from art import scores
 88 | from art import significance_tests
 89 | from art import transform_conll_score_file as transform
 90 | 
 91 | 
 92 | transformed_system1 = transform.get_numerators_and_denominators(
 93 |                         open('conll_score_file'))
 94 | transformed_system2 = transform.get_numerators_and_denominators(
 95 |                         open('another_conll_score_file'))
 96 | 
 97 | test = significance_tests.ApproximateRandomizationTest(
 98 |     scores.Scores(transformed_system1), 
 99 |     scores.Scores(transformed_system2), 
100 |     aggregators.f_1)
101 | test.run()
102 | ```
103 | 
104 | ## Changelog
105 | 
106 | __05 May 2017__  
107 | Fixed a bug in transforming CoNLL coreference score files.
108 | 


--------------------------------------------------------------------------------
/art/__init__.py:
--------------------------------------------------------------------------------
1 | """art: A package for significance testing via approximate randomization"""
2 | 
3 | __author__ = 'smartschat'
4 | 


--------------------------------------------------------------------------------
/art/aggregators.py:
--------------------------------------------------------------------------------
 1 | """Contains functions to aggregate scores over a corpus."""
 2 | 
 3 | import math
 4 | 
 5 | __author__ = 'smartschat'
 6 | 
 7 | 
 8 | def average(scores):
 9 |     """Compute the average of all scores.
10 | 
11 |     Args:
12 |         scores: A Scores object. Each score in scores should contain only one
13 |                 number.
14 | 
15 |     Returns:
16 |         The average of all numbers in scores.
17 |     """
18 |     return math.fsum([score.values[0] for score in scores])/len(scores)
19 | 
20 | 
21 | def enum_sum_div_by_denom_sum(scores):
22 |     """Sum up the first entry of all scores, then sum up the second entry.
23 |     Divide the first sum by the second sum.
24 | 
25 |     Args:
26 |         scores: A Scores object. Each score in scores should contain two
27 |                 numbers.
28 | 
29 |     Returns:
30 |         The sum of the first entry of each score in scores, divided by the sum
31 |         of the second entry of each score in scores.
32 |     """
33 |     return math.fsum([score.values[0] for score in scores])/math.fsum(
34 |         [score.values[1] for score in scores])
35 | 
36 | 
37 | def f_1(scores):
38 |     """Compute the corpus-wide F1 score represented by the scores.
39 | 
40 |     Each score should contain four entries. Consider:
41 |         - first/second entry numerator/denominator for recall,
42 |         - third/fourth entry numerator and denominator for precision.
43 | 
44 |     Then define r = sum(first entries)/sum(second entries) and
45 |     p = sum(third entries)/sum(fourth entries)
46 | 
47 |     The F1 score is then computed as F1 = 2pr/(p+r).
48 | 
49 |     Args:
50 |         scores: A Scores object. Each score in scores should contain four
51 |                 numbers.
52 | 
53 |     Returns:
54 |         The corpus-wide F1 score.
55 |     """
56 |     first_component = math.fsum(
57 |         [score.values[0] for score in scores]
58 |     )/math.fsum(
59 |         [score.values[1] for score in scores])
60 | 
61 |     second_component = math.fsum(
62 |         [score.values[2] for score in scores]
63 |     )/math.fsum(
64 |         [score.values[3] for score in scores])
65 | 
66 |     return 2 * first_component * second_component / (
67 |         first_component + second_component)
68 | 


--------------------------------------------------------------------------------
/art/scores.py:
--------------------------------------------------------------------------------
 1 | """ Contains classes fore managing scores and lists of scores."""
 2 | 
 3 | __author__ = 'smartschat'
 4 | 
 5 | 
 6 | class Score(object):
 7 |     """A score for an individual document.
 8 | 
 9 |     Attributes:
10 |         values: A list of floats, which constitutes the score for the document
11 |                 under consideration.
12 |     """
13 |     def __init__(self, score):
14 |         """Create a score from a list of numbers.
15 | 
16 |         Args:
17 |             score: a list of numbers.
18 |         """
19 |         self.values = [float(val) for val in score]
20 | 
21 |     def __str__(self):
22 |         return ' '.join([str(val) for val in self.values])
23 | 
24 |     def __eq__(self, other):
25 |         if isinstance(other, self.__class__):
26 |             return self.values == other.values
27 |         else:
28 |             return False
29 | 
30 |     def __hash__(self):
31 |         return hash(self.values)
32 | 
33 | 
34 | class Scores(object):
35 |     """A collection of scores for a set of documents (a corpus).
36 | 
37 |     Attributes:
38 |         scores: A list of Score objects.
39 |     """
40 |     def __init__(self, scores=None):
41 |         """Init from a list of scores.
42 | 
43 |         Args:
44 |             scores: A list of Score objects.
45 |         """
46 |         if not scores:
47 |             scores = []
48 | 
49 |         self.scores = scores
50 | 
51 |     def __eq__(self, other):
52 |         if isinstance(other, self.__class__):
53 |             return self.scores == other.scores
54 |         else:
55 |             return False
56 | 
57 |     def __hash__(self):
58 |         return hash(self.scores)
59 | 
60 |     def __len__(self):
61 |         return len(self.scores)
62 | 
63 |     def __iter__(self):
64 |         return iter(self.scores)
65 | 
66 |     def __str__(self):
67 |         return '\n'.join([str(score) for score in self.scores])
68 | 
69 |     def append(self, score):
70 |         """Append a score.
71 | 
72 |         Args:
73 |             score: A Score object.
74 |         """
75 |         self.scores.append(score)
76 | 
77 |     @staticmethod
78 |     def from_file(file):
79 |         """Create a Scores object from a file, where each line in the file
80 |         describes a score for one document.
81 | 
82 |         The file should contain a list of numbers in each line, seperated by
83 |         white space. The number of entries in each line should match. An
84 |         example file looks like the following:
85 | 
86 |             1 2 3
87 |             4 3 2.5
88 |             11 1 0
89 | 
90 |         Args:
91 |             file: A file containing a list of scores.
92 |         """
93 | 
94 |         scores = []
95 |         for line in file.readlines():
96 |             scores.append(Score(line.split()))
97 |         return Scores(scores)
98 | 


--------------------------------------------------------------------------------
/art/significance_tests.py:
--------------------------------------------------------------------------------
 1 | """Contains significance tests for differences between systems."""
 2 | 
 3 | from __future__ import division
 4 | import math
 5 | import random
 6 | 
 7 | from art.scores import Scores
 8 | 
 9 | __author__ = 'smartschat'
10 | 
11 | 
12 | class ApproximateRandomizationTest(object):
13 |     """A paired two-sided approximate randomization test.
14 | 
15 |     This class allows performing a paired two-sided approximate randomization
16 |     test to assess the statistical significance of the difference in
17 |     performance between two systems which are run and measured on the same
18 |     corpus.
19 | 
20 |     Attributes:
21 |         system1_scores: A Scores object, which represents the scores of the
22 |                         first system under consideration.
23 |         system2_scores: A Scores object, which represents the scores of the
24 |                         second system under consideration.
25 |         aggregator: An aggregator function, which aggregates all scores for
26 |                         individual documents to obtain a score for the whole
27 |                         corpus.
28 |         trials: The number of iterations during the test.
29 |     """
30 |     def __init__(self,
31 |                  system1_scores,
32 |                  system2_scores,
33 |                  aggregator,
34 |                  trials=10000):
35 |         """Inits a paired two-sided approximate randomization test.
36 | 
37 |         Args:
38 |             system1_scores: A Scores object, which represents the scores of the
39 |                             first system under consideration.
40 |             system2_scores: A Scores object, which represents the scores of the
41 |                             second system under consideration.
42 |             aggregator: An aggregator function, which aggregates all scores for
43 |                             individual documents to obtain a score for the
44 |                             whole corpus.
45 |             trials: The number of iterations during the test. Defaults to
46 |                             10000.
47 |         """
48 |         self.system1_scores = system1_scores
49 |         self.system2_scores = system2_scores
50 |         self.aggregator = aggregator
51 |         self.trials = trials
52 | 
53 |     def run(self):
54 |         """Compute the statistical significance of a difference between
55 |         the systems via a paired two-sided approximate randomization test.
56 | 
57 |         Returns:
58 |             An approximation of the probability of observing corpus-wide
59 |             differences in scores at least as extreme as observed here, when
60 |             there is no difference between the systems.
61 |         """
62 | 
63 |         absolute_difference = math.fabs(
64 |             self.aggregator(self.system1_scores) -
65 |             self.aggregator(self.system2_scores))
66 |         shuffled_was_at_least_as_high = 0
67 | 
68 |         for i in range(0, self.trials):
69 |             pseudo_system1_scores = Scores()
70 |             pseudo_system2_scores = Scores()
71 | 
72 |             for score1, score2 in zip(self.system1_scores,
73 |                                       self.system2_scores):
74 |                 if random.randint(0, 1) == 0:
75 |                     pseudo_system1_scores.append(score1)
76 |                     pseudo_system2_scores.append(score2)
77 |                 else:
78 |                     pseudo_system1_scores.append(score2)
79 |                     pseudo_system2_scores.append(score1)
80 | 
81 |             pseudo_difference = math.fabs(
82 |                 self.aggregator(pseudo_system1_scores) -
83 |                 self.aggregator(pseudo_system2_scores))
84 | 
85 |             if pseudo_difference >= absolute_difference:
86 |                 shuffled_was_at_least_as_high += 1
87 | 
88 |         significance_level = (shuffled_was_at_least_as_high + 1) / (
89 |             self.trials + 1)
90 | 
91 |         return significance_level
92 | 


--------------------------------------------------------------------------------
/art/test/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'smartschat'
2 | 
3 | 


--------------------------------------------------------------------------------
/art/test/resources/example_scores:
--------------------------------------------------------------------------------
1 | 2 3
2 | 4 12
3 | 22 500
4 | 3.1 4.355


--------------------------------------------------------------------------------
/art/test/resources/example_scores_numerator_always_0:
--------------------------------------------------------------------------------
1 | 0 3
2 | 0 12
3 | 0 500
4 | 0 4.355


--------------------------------------------------------------------------------
/art/test/test_aggregators.py:
--------------------------------------------------------------------------------
 1 | import unittest
 2 | 
 3 | from art import aggregators
 4 | from art.scores import Score
 5 | from art.scores import Scores
 6 | 
 7 | __author__ = 'smartschat'
 8 | 
 9 | 
10 | class TestAggregators(unittest.TestCase):
11 |     def test_average(self):
12 |         scores_for_average = Scores(
13 |             [
14 |                 Score([1]),
15 |                 Score([5]),
16 |                 Score([3]),
17 |                 Score([0]),
18 |             ]
19 |         )
20 |         self.assertEqual(9.0 / 4, aggregators.average(scores_for_average))
21 | 
22 |     def test_enum_sum_div_by_denom_sum(self):
23 |         scores_for_enum_sum_div_by_denom_sum = Scores(
24 |             [
25 |                 Score([2, 3]),
26 |                 Score([4, 12]),
27 |                 Score([22, 500]),
28 |                 Score([3.1, 4.355]),
29 |             ]
30 |         )
31 |         self.assertEqual(31.1 / 519.355,
32 |                          aggregators.enum_sum_div_by_denom_sum(
33 |                              scores_for_enum_sum_div_by_denom_sum))
34 | 
35 |     def test_f_1(self):
36 |         scores_for_f_1 = Scores(
37 |             [
38 |                 Score([2, 3, 7, 8]),
39 |                 Score([4, 12, 33, 50]),
40 |                 Score([22, 500, 12.3, 15.9]),
41 |                 Score([3.1, 4.355, 1, 2]),
42 |             ]
43 |         )
44 | 
45 |         recall = 31.1 / 519.355
46 |         precision = 53.3 / 75.9
47 |         f_1 = 2 * recall * precision / (recall + precision)
48 | 
49 |         self.assertEqual(f_1, aggregators.f_1(scores_for_f_1))
50 | 
51 | if __name__ == '__main__':
52 |     unittest.main()
53 | 


--------------------------------------------------------------------------------
/art/test/test_scores.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import unittest
 3 | 
 4 | from art.scores import Score
 5 | from art.scores import Scores
 6 | 
 7 | __author__ = 'smartschat'
 8 | 
 9 | 
10 | class TestScores(unittest.TestCase):
11 |     def test_from_file(self):
12 |         expected_scores = Scores(
13 |             [
14 |                 Score([2, 3]),
15 |                 Score([4, 12]),
16 |                 Score([22, 500]),
17 |                 Score([3.1, 4.355]),
18 |             ]
19 |         )
20 |         self.assertEqual(expected_scores, Scores.from_file(open(
21 |             os.path.dirname(os.path.realpath(__file__)) +
22 |         "/resources/example_scores")))
23 | 
24 | 
25 | if __name__ == '__main__':
26 |     unittest.main()
27 | 


--------------------------------------------------------------------------------
/art/test/test_significance_tests.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import unittest
 3 | 
 4 | from art import aggregators
 5 | from art import scores
 6 | from art import significance_tests
 7 | 
 8 | 
 9 | __author__ = 'smartschat'
10 | 
11 | 
12 | class TestApproximateRandomizationTest(unittest.TestCase):
13 |     def test_run(self):
14 |         directory = os.path.dirname(os.path.realpath(__file__))
15 |         test = significance_tests.ApproximateRandomizationTest(
16 |             scores.Scores.from_file(open(directory +
17 |                                          "/resources/example_scores")),
18 |             scores.Scores.from_file(open(
19 |                 directory + "/resources/example_scores_numerator_always_0")),
20 |             aggregators.enum_sum_div_by_denom_sum
21 |         )
22 |         self.assertGreater(test.run(), 0)
23 | 
24 |     def test_run_with_same(self):
25 |         directory = os.path.dirname(os.path.realpath(__file__))
26 |         test = significance_tests.ApproximateRandomizationTest(
27 |             scores.Scores.from_file(open(directory +
28 |                                          "/resources/example_scores")),
29 |             scores.Scores.from_file(open(directory +
30 |                                          "/resources/example_scores")),
31 |             aggregators.enum_sum_div_by_denom_sum
32 |         )
33 |         self.assertEqual(1.0, test.run())
34 | 
35 | 
36 | if __name__ == '__main__':
37 |     unittest.main()
38 | 


--------------------------------------------------------------------------------
/art/transform_conll_score_file.py:
--------------------------------------------------------------------------------
 1 | """Transform CoNLL scorer files into a suitable format."""
 2 | 
 3 | from art.scores import Score
 4 | from art.scores import Scores
 5 | 
 6 | __author__ = 'smartschat'
 7 | 
 8 | 
 9 | def get_numerators_and_denominators(score_file):
10 |     """Transform score files obtained by the CoNLL scorer.
11 | 
12 |     This function transforms files obtained by the reference coreference
13 |     scorer (https://code.google.com/p/reference-coreference-scorers/) into
14 |     a format suitable for performing significance testing for differences in
15 |     F1 score.
16 | 
17 |     Args
18 |         score_file: A file obtained via running the reference coreference
19 |                     scorer for a single metric, as in
20 |                      $ perl scorer.pl muc key response > conll_score_file
21 | 
22 |     Returns
23 |         A Scores objects containing numerator/denominator for recall and
24 |         precision for each document described in the score file.
25 |     """
26 |     scores_from_file = Scores()
27 | 
28 |     temp_mapping = {}
29 | 
30 |     for line in score_file.readlines():
31 |         if line == '====== TOTALS =======':
32 |             break
33 |         elif line.startswith("("):
34 |             identifier = line.strip()
35 |         elif line.startswith('Recall:'):
36 |             entries = line.split()
37 |             recall_numerator = entries[1].replace("(", "")
38 |             recall_denominator = entries[3].replace(")", "")
39 |             precision_numerator = entries[6].replace("(", "")
40 |             precision_denominator = entries[8].replace(")", "")
41 | 
42 |             temp_mapping[identifier] = [
43 |                     recall_numerator,
44 |                     recall_denominator,
45 |                     precision_numerator,
46 |                     precision_denominator
47 |             ]
48 | 
49 |             identifier = None
50 | 
51 |     for identifier in sorted(temp_mapping.keys()):        
52 |         scores_from_file.append(
53 |             Score(temp_mapping[identifier])
54 |         )
55 | 
56 |     return scores_from_file
57 | 


--------------------------------------------------------------------------------