├── slides.pdf
├── requirements.txt
├── environment.yml
├── README.md
├── LICENSE
├── python_scripts
├── 3_uncertainty_in_metrics_tutorial.py
├── 2_roc_pr_curves_tutorial.py
└── 1_evaluation_tutorial.py
└── notebooks
├── 3_uncertainty_in_metrics_tutorial.ipynb
├── 2_roc_pr_curves_tutorial.ipynb
└── 1_evaluation_tutorial.ipynb
/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArturoAmorQ/euroscipy_2022_evaluation/HEAD/slides.pdf
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.26.*
2 | scipy==1.11.*
3 | pandas==2.1.*
4 | matplotlib==3.7.*
5 | jupyter
6 | seaborn
7 | scikit-learn==1.3.*
8 |
--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
1 | name: evaluation-tutorial
2 |
3 | dependencies:
4 | - python
5 | - scikit-learn
6 | - pandas
7 | - seaborn
8 | - jupyter
9 | - pip
10 |
11 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # EuroSciPy 2022 - Evaluating your ML models tutorial
2 |
3 | Follow the intro slides [here](https://github.com/ArturoAmorQ/euroscipy_2022_evaluation/blob/main/slides.pdf).
4 |
5 | ## Follow the tutorial online
6 |
7 | Launch an online notebook environment using [](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD)
8 |
9 | - [1_evaluation_tutorial.ipynb](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/1_evaluation_tutorial.ipynb)
10 | - [2_roc_pr_curves_tutorial.ipynb](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/2_roc_pr_curves_tutorial.ipynb)
11 | - [3_uncertainty_in_metrics_tutorial.ipynb ](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/3_uncertainty_in_metrics_tutorial.ipynb)
12 |
13 | You need an internet connection but you will not have to install any package
14 | locally.
15 |
16 | ## Running the tutorial locally
17 |
18 | ### Dependencies
19 |
20 | The tutorials will require the following packages:
21 |
22 | * python
23 | * jupyter
24 | * pandas
25 | * matplotlib
26 | * seaborn
27 | * scikit-learn >= 1.2.0
28 |
29 | ### Local install
30 |
31 | We provide both `requirements.txt` and `environment.yml` to install packages.
32 |
33 | You can install the packages using `pip`:
34 |
35 | ```
36 | $ pip install -r requirements.txt
37 | ```
38 |
39 | You can create an `evaluation-tutorial` conda environment executing:
40 |
41 | ```
42 | $ conda env create -f environment.yml
43 | ```
44 |
45 | and later activate the environment:
46 |
47 | ```
48 | $ conda activate evaluation-tutorial
49 | ```
50 |
51 | You might also only update your current environment using:
52 |
53 | ```
54 | $ conda env update --prefix ./env --file environment.yml --prune
55 | ```
56 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | Creative Commons Legal Code
2 |
3 | CC0 1.0 Universal
4 |
5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
12 | HEREUNDER.
13 |
14 | Statement of Purpose
15 |
16 | The laws of most jurisdictions throughout the world automatically confer
17 | exclusive Copyright and Related Rights (defined below) upon the creator
18 | and subsequent owner(s) (each and all, an "owner") of an original work of
19 | authorship and/or a database (each, a "Work").
20 |
21 | Certain owners wish to permanently relinquish those rights to a Work for
22 | the purpose of contributing to a commons of creative, cultural and
23 | scientific works ("Commons") that the public can reliably and without fear
24 | of later claims of infringement build upon, modify, incorporate in other
25 | works, reuse and redistribute as freely as possible in any form whatsoever
26 | and for any purposes, including without limitation commercial purposes.
27 | These owners may contribute to the Commons to promote the ideal of a free
28 | culture and the further production of creative, cultural and scientific
29 | works, or to gain reputation or greater distribution for their Work in
30 | part through the use and efforts of others.
31 |
32 | For these and/or other purposes and motivations, and without any
33 | expectation of additional consideration or compensation, the person
34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she
35 | is an owner of Copyright and Related Rights in the Work, voluntarily
36 | elects to apply CC0 to the Work and publicly distribute the Work under its
37 | terms, with knowledge of his or her Copyright and Related Rights in the
38 | Work and the meaning and intended legal effect of CC0 on those rights.
39 |
40 | 1. Copyright and Related Rights. A Work made available under CC0 may be
41 | protected by copyright and related or neighboring rights ("Copyright and
42 | Related Rights"). Copyright and Related Rights include, but are not
43 | limited to, the following:
44 |
45 | i. the right to reproduce, adapt, distribute, perform, display,
46 | communicate, and translate a Work;
47 | ii. moral rights retained by the original author(s) and/or performer(s);
48 | iii. publicity and privacy rights pertaining to a person's image or
49 | likeness depicted in a Work;
50 | iv. rights protecting against unfair competition in regards to a Work,
51 | subject to the limitations in paragraph 4(a), below;
52 | v. rights protecting the extraction, dissemination, use and reuse of data
53 | in a Work;
54 | vi. database rights (such as those arising under Directive 96/9/EC of the
55 | European Parliament and of the Council of 11 March 1996 on the legal
56 | protection of databases, and under any national implementation
57 | thereof, including any amended or successor version of such
58 | directive); and
59 | vii. other similar, equivalent or corresponding rights throughout the
60 | world based on applicable law or treaty, and any national
61 | implementations thereof.
62 |
63 | 2. Waiver. To the greatest extent permitted by, but not in contravention
64 | of, applicable law, Affirmer hereby overtly, fully, permanently,
65 | irrevocably and unconditionally waives, abandons, and surrenders all of
66 | Affirmer's Copyright and Related Rights and associated claims and causes
67 | of action, whether now known or unknown (including existing as well as
68 | future claims and causes of action), in the Work (i) in all territories
69 | worldwide, (ii) for the maximum duration provided by applicable law or
70 | treaty (including future time extensions), (iii) in any current or future
71 | medium and for any number of copies, and (iv) for any purpose whatsoever,
72 | including without limitation commercial, advertising or promotional
73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
74 | member of the public at large and to the detriment of Affirmer's heirs and
75 | successors, fully intending that such Waiver shall not be subject to
76 | revocation, rescission, cancellation, termination, or any other legal or
77 | equitable action to disrupt the quiet enjoyment of the Work by the public
78 | as contemplated by Affirmer's express Statement of Purpose.
79 |
80 | 3. Public License Fallback. Should any part of the Waiver for any reason
81 | be judged legally invalid or ineffective under applicable law, then the
82 | Waiver shall be preserved to the maximum extent permitted taking into
83 | account Affirmer's express Statement of Purpose. In addition, to the
84 | extent the Waiver is so judged Affirmer hereby grants to each affected
85 | person a royalty-free, non transferable, non sublicensable, non exclusive,
86 | irrevocable and unconditional license to exercise Affirmer's Copyright and
87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the
88 | maximum duration provided by applicable law or treaty (including future
89 | time extensions), (iii) in any current or future medium and for any number
90 | of copies, and (iv) for any purpose whatsoever, including without
91 | limitation commercial, advertising or promotional purposes (the
92 | "License"). The License shall be deemed effective as of the date CC0 was
93 | applied by Affirmer to the Work. Should any part of the License for any
94 | reason be judged legally invalid or ineffective under applicable law, such
95 | partial invalidity or ineffectiveness shall not invalidate the remainder
96 | of the License, and in such case Affirmer hereby affirms that he or she
97 | will not (i) exercise any of his or her remaining Copyright and Related
98 | Rights in the Work or (ii) assert any associated claims and causes of
99 | action with respect to the Work, in either case contrary to Affirmer's
100 | express Statement of Purpose.
101 |
102 | 4. Limitations and Disclaimers.
103 |
104 | a. No trademark or patent rights held by Affirmer are waived, abandoned,
105 | surrendered, licensed or otherwise affected by this document.
106 | b. Affirmer offers the Work as-is and makes no representations or
107 | warranties of any kind concerning the Work, express, implied,
108 | statutory or otherwise, including without limitation warranties of
109 | title, merchantability, fitness for a particular purpose, non
110 | infringement, or the absence of latent or other defects, accuracy, or
111 | the present or absence of errors, whether or not discoverable, all to
112 | the greatest extent permissible under applicable law.
113 | c. Affirmer disclaims responsibility for clearing rights of other persons
114 | that may apply to the Work or any use thereof, including without
115 | limitation any person's Copyright and Related Rights in the Work.
116 | Further, Affirmer disclaims responsibility for obtaining any necessary
117 | consents, permissions or other rights required for any use of the
118 | Work.
119 | d. Affirmer understands and acknowledges that Creative Commons is not a
120 | party to this document and has no duty or obligation with respect to
121 | this CC0 or use of the Work.
122 |
--------------------------------------------------------------------------------
/python_scripts/3_uncertainty_in_metrics_tutorial.py:
--------------------------------------------------------------------------------
1 | # %% [markdown]
2 | #
3 | # Uncertainty in evaluation metrics for classification
4 | # ====================================================
5 | #
6 | # Has it ever happen to you that one of your colleagues claim their model with
7 | # test score of 0.8001 is better than your model with test score of 0.7998?
8 | # Maybe they are not aware that model-evaluation procedures should gauge not
9 | # only the expected generalization performance, but also its variations. As
10 | # usual, let's build a toy dataset to illustrate this.
11 |
12 | # %%
13 | from sklearn.datasets import make_classification
14 |
15 | common_params = {
16 | "n_features": 2,
17 | "n_informative": 2,
18 | "n_redundant": 0,
19 | "n_classes": 2, # binary classification
20 | "random_state": 0,
21 | "weights": [0.55, 0.45],
22 | }
23 | X, y = make_classification(**common_params, n_samples=400)
24 |
25 | prevalence = y.mean()
26 | print(f"Percentage of samples in the positive class: {100*prevalence:.2f}%")
27 |
28 | # %% [markdown]
29 | # We are already familiar with using a a train-test split to estimate the
30 | # generalization performance of a model. By default the `train_test_split` uses
31 | # `shuffle=True`. Let's see what happens if we set a particular seed.
32 |
33 | # %%
34 | from sklearn.model_selection import train_test_split
35 | from sklearn.linear_model import LogisticRegression
36 |
37 | X_train, X_test, y_train, y_test = train_test_split(
38 | X, y, test_size=0.2, random_state=1
39 | )
40 | classifier = LogisticRegression().fit(X_train, y_train)
41 | classifier.score(X_test, y_test)
42 |
43 | # %% [markdown]
44 | # Now let's see what happens when shuffling with a different seed:
45 |
46 | # %%
47 | X_train, X_test, y_train, y_test = train_test_split(
48 | X, y, test_size=0.2, random_state=42
49 | )
50 | classifier = LogisticRegression().fit(X_train, y_train)
51 | classifier.score(X_test, y_test)
52 |
53 | # %% [markdown]
54 | # It seems that 42 is indeed the Ultimate answer to the Question of Life, the
55 | # Universe, and Everything! Or maybe the score of a model depends on the split:
56 | # - the train-test proportion;
57 | # - the representativeness of the elements in each set.
58 | #
59 | # A more systematic way of evaluating the generalization performance of a model
60 | # is through cross-validation, which consists of repeating the split such that
61 | # the training and testing sets are different for each evaluation.
62 |
63 | # %%
64 | from sklearn.model_selection import cross_val_score, ShuffleSplit
65 |
66 | classifier = LogisticRegression()
67 | cv = ShuffleSplit(n_splits=250, test_size=0.2)
68 |
69 | scores = cross_val_score(classifier, X, y, cv=cv)
70 | print(
71 | "The mean cross-validation accuracy is: "
72 | f"{scores.mean():.2f} ± {scores.std():.2f}."
73 | )
74 |
75 | # %% [markdown]
76 | # Scores have a variability. A sample probabilistic model gives the distribution
77 | # of observed error: if the classification rate is p, the observed distribution
78 | # of correct classifications on a set of size follows a binomial distribution.
79 | # Let's create a function to easily visualize this:
80 |
81 | # %%
82 | import matplotlib.pyplot as plt
83 | import numpy as np
84 | import seaborn as sns
85 | from scipy import stats
86 |
87 |
88 | def plot_error_distrib(classifier, X, y, cv=5):
89 |
90 | n = len(X)
91 |
92 | scores = cross_val_score(classifier, X, y, cv=cv)
93 | distrib = stats.binom(n=n, p=scores.mean())
94 |
95 | plt.plot(
96 | np.linspace(0, 1, n),
97 | n * distrib.pmf(np.arange(0, n)),
98 | linewidth=2,
99 | color="black",
100 | label="binomial distribution",
101 | )
102 | sns.histplot(scores, stat="density", label="empirical distribution")
103 | plt.xlim(0, 1)
104 | plt.title("Accuracy: " f"{scores.mean():.2f} ± {scores.std():.2f}.")
105 | plt.legend()
106 | plt.show()
107 |
108 |
109 | plot_error_distrib(classifier, X, y, cv=cv)
110 |
111 | # %% [markdown]
112 | # The empirical distribution is still broader than the theoretical one. This can
113 | # be explained by the fact that as we are retraining the model on each fold, it
114 | # actually fluctuates due the sampling noise in the training data, while the
115 | # model above only accounts for sampling noise in the test data.
116 | #
117 | # The situation does get better with more data:
118 |
119 | # %%
120 | X, y = make_classification(**common_params, n_samples=1_000)
121 | plot_error_distrib(classifier, X, y, cv=cv)
122 |
123 | # %% [markdown]
124 | # Importantly, the standard error of the mean (SEM) across folds is not a good
125 | # measure of this error, as the different data folds are not independent. For
126 | # instance, doing many random splits reduces the variance arbitrarily, but does
127 | # not provide actually new data points.
128 |
129 | # %%
130 | cv = ShuffleSplit(n_splits=10, test_size=0.2)
131 | X, y = make_classification(**common_params, n_samples=400)
132 | scores = cross_val_score(classifier, X, y, cv=cv)
133 |
134 | print(
135 | f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: "
136 | f"{scores.mean():.3f} ± {stats.sem(scores):.3f}."
137 | )
138 |
139 | cv = ShuffleSplit(n_splits=100, test_size=0.2)
140 | scores = cross_val_score(classifier, X, y, cv=cv)
141 |
142 | print(
143 | f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: "
144 | f"{scores.mean():.3f} ± {stats.sem(scores):.3f}."
145 | )
146 |
147 | cv = ShuffleSplit(n_splits=500, test_size=0.2)
148 | scores = cross_val_score(classifier, X, y, cv=cv)
149 |
150 | print(
151 | f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: "
152 | f"{scores.mean():.3f} ± {stats.sem(scores):.3f}."
153 | )
154 |
155 | # %% [markdown]
156 | # Indeed, the SEM goes to zero as 1/sqrt{`n_splits`}. Wraping-up:
157 | # - the more data the better;
158 | # - the more splits, the more descriptive of the variance is the binomial
159 | # distribution, but keep in mind that more splits consume more computing
160 | # power;
161 | # - use std instead of SEM to present your results.
162 | #
163 | # Now that we have an intuition on the variability of an evaluation metric, we
164 | # are ready to apply it to our original Diabetes problem:
165 |
166 | # %%
167 | from sklearn.tree import DecisionTreeClassifier
168 | from sklearn.inspection import DecisionBoundaryDisplay
169 |
170 | diabetes_params = {
171 | "n_samples": 10_000,
172 | "n_features": 2,
173 | "n_informative": 2,
174 | "n_redundant": 0,
175 | "n_classes": 2, # binary classification
176 | "shift": [4, 6],
177 | "scale": [10, 25],
178 | "random_state": 0,
179 | }
180 | X, y = make_classification(**diabetes_params, weights=[0.55, 0.45])
181 |
182 | X_train, X_plot, y_train, y_plot = train_test_split(
183 | X, y, stratify=y, test_size=0.1, random_state=0
184 | )
185 |
186 | estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)
187 |
188 | fig, ax = plt.subplots()
189 | disp = DecisionBoundaryDisplay.from_estimator(
190 | estimator,
191 | X_plot,
192 | response_method="predict",
193 | alpha=0.5,
194 | xlabel="age (years)",
195 | ylabel="blood sugar level (mg/dL)",
196 | ax=ax,
197 | )
198 | scatter = disp.ax_.scatter(X_plot[:, 0], X_plot[:, 1], c=y_plot, edgecolor="k")
199 | disp.ax_.set_title(f"Diabetes test with prevalence = {y.mean():.2f}")
200 | _ = disp.ax_.legend(*scatter.legend_elements())
201 |
202 | # %% [markdown]
203 | # Notice that the decision boundary changed with respect to the first notebook
204 | # we explored. Let's make a remark: models depend on the prevalence of
205 | # the data they were trained on. Therefore, all metrics (including likelihood ratios)
206 | # depend on prevalence as much as the model depends on it. The difference is that
207 | # likelihood ratios extrapolate through populations of different prevalence for
208 | # a **fixed model**.
209 | #
210 | # Let's compute all the metrics and assez their variability in this case:
211 |
212 | # %%
213 | from collections import defaultdict
214 | import pandas as pd
215 |
216 | cv = ShuffleSplit(n_splits=50, test_size=0.2)
217 |
218 | evaluation = defaultdict(list)
219 | scoring_strategies = [
220 | "accuracy",
221 | "balanced_accuracy",
222 | "recall",
223 | "precision",
224 | "matthews_corrcoef",
225 | # "positive_likelihood_ratio",
226 | # "neg_negative_likelihood_ratio",
227 | ]
228 |
229 | for score_name in scoring_strategies:
230 | scores = cross_val_score(estimator, X, y, cv=cv, scoring=score_name)
231 | evaluation[score_name] = scores
232 |
233 | evaluation = pd.DataFrame(evaluation).aggregate(["mean", "std"]).T
234 | evaluation["mean"].plot.barh(xerr=evaluation["std"]).set_xlabel("score")
235 | plt.show()
236 |
237 | # %% [markdown]
238 | # Notice that `"positive_likelihood_ratio"` is not bounded from above and
239 | # therefore it can't be directly compared with the other metrics on a single
240 | # plot. Similarly, the `"neg_negative_likelihood_ratio"` has a reversed sign (is
241 | # negative) to follow the scikit-learn convention for metrics for which a lower
242 | # score is better.
243 | #
244 | # In this case we trained the model on nearly balanced classes. Try changing the
245 | # prevalence and see how the variance of the metrics depend on data imbalance.
246 |
--------------------------------------------------------------------------------
/python_scripts/2_roc_pr_curves_tutorial.py:
--------------------------------------------------------------------------------
1 | # %% [markdown]
2 | # Evaluation of non-thresholded prediction
3 | # ========================================
4 | #
5 | # All statistics that we presented up to now rely on `.predict` which outputs
6 | # the most likely label. We haven’t made use of the probability associated with
7 | # this prediction, which gives the confidence of the classifier in this
8 | # prediction. By default, the prediction of a classifier corresponds to a
9 | # threshold of 0.5 probability in a binary classification problem. Let's build a
10 | # toy dataset to illustrate this.
11 |
12 | # %%
13 | from sklearn.datasets import make_classification
14 | from sklearn.model_selection import train_test_split
15 |
16 | common_params = {
17 | "n_samples": 10_000,
18 | "n_features": 2,
19 | "n_informative": 2,
20 | "n_redundant": 0,
21 | "n_classes": 2, # binary classification
22 | "class_sep": 0.5,
23 | "random_state": 0,
24 | }
25 | X, y = make_classification(**common_params, weights=[0.6, 0.4])
26 |
27 | X_train, X_test, y_train, y_test = train_test_split(
28 | X, y, stratify=y, random_state=0, test_size=0.02
29 | )
30 |
31 | # %% [markdown]
32 | # We can quickly check the predicted probabilities to belong to either class
33 | # using a `LogisticRegression`. To ease the visualization we select a subset
34 | # of `n_plot` samples.
35 |
36 | # %%
37 | import pandas as pd
38 | from sklearn.linear_model import LogisticRegression
39 |
40 | n_plot = 10
41 | classifier = LogisticRegression()
42 | classifier.fit(X_train, y_train)
43 |
44 | proba_predicted = pd.DataFrame(
45 | classifier.predict_proba(X_test), columns=classifier.classes_
46 | ).round(decimals=2)
47 | proba_predicted[:n_plot]
48 |
49 | # %% [markdown]
50 | # Probabilites sum to 1. In the binary case it suffices to retain the
51 | # probability of belonging to the positive class, here shown as an annotation in
52 | # the `DecisionBoundaryDisplay`. Notice that setting
53 | # `response_method="predict_proba"` shows the level curves of the 2D sigmoid
54 | # (logistic curve).
55 |
56 | # %%
57 | import matplotlib.pyplot as plt
58 | from matplotlib.colors import ListedColormap
59 | from sklearn.inspection import DecisionBoundaryDisplay
60 |
61 | fig, ax = plt.subplots()
62 | disp = DecisionBoundaryDisplay.from_estimator(
63 | classifier,
64 | X_test,
65 | response_method="predict_proba",
66 | cmap="RdBu",
67 | alpha=0.5,
68 | vmin=0,
69 | vmax=1,
70 | ax=ax,
71 | )
72 | DecisionBoundaryDisplay.from_estimator(
73 | classifier,
74 | X_test,
75 | response_method="predict_proba",
76 | plot_method="contour",
77 | alpha=0.2,
78 | levels=[0.5], # 0.5 probability contour line
79 | linestyles="--",
80 | linewidths=2,
81 | ax=ax,
82 | )
83 | scatter = disp.ax_.scatter(
84 | X_test[:n_plot, 0], X_test[:n_plot, 1], c=y_test[:n_plot],
85 | cmap=ListedColormap(["tab:red", "tab:blue"]),
86 | edgecolor="k"
87 | )
88 | disp.ax_.legend(*scatter.legend_elements(), title="True class", loc="lower right")
89 | for i, proba in enumerate(proba_predicted[:n_plot][1]):
90 | disp.ax_.annotate(proba, (X_test[i, 0], X_test[i, 1]), fontsize="large")
91 | plt.xlim(-2.0, 2.0)
92 | plt.ylim(-4.0, 4.0)
93 | plt.title(
94 | "Probability of belonging to the positive class\n(default decision threshold)"
95 | )
96 | plt.show()
97 |
98 | # %% [markdown]
99 | # Evaluation of different probability thresholds
100 | # ==============================================
101 | #
102 | # The default decision threshold (0.5) might not be the best threshold that
103 | # leads to optimal generalization performance of our classifier. One can vary
104 | # the decision threshold (and therefore the underlying prediction) and compute
105 | # some evaluation metrics as presented earlier.
106 | #
107 | # Receiver Operating Characteristic curve
108 | # ---------------------------------------
109 | #
110 | # One could be interested in the compromise between accurately discriminating
111 | # both the positive class and the negative classes. The statistics used for this
112 | # are sensitivity and specificity, which measure the proportion of correctly
113 | # classified samples per class.
114 | #
115 | # Sensitivity and specificity are generally plotted as a curve called the
116 | # Receiver Operating Characteristic (ROC) curve. Each point on the graph
117 | # corresponds to a specific decision threshold. Below is such a curve:
118 |
119 | # %%
120 | from sklearn.metrics import RocCurveDisplay
121 | from sklearn.dummy import DummyClassifier
122 |
123 | dummy_classifier = DummyClassifier(strategy="most_frequent")
124 | dummy_classifier.fit(X_train, y_train)
125 |
126 | disp = RocCurveDisplay.from_estimator(
127 | classifier, X_test, y_test, name="LogisticRegression", color="tab:green"
128 | )
129 | disp = RocCurveDisplay.from_estimator(
130 | dummy_classifier,
131 | X_test,
132 | y_test,
133 | name="chance level",
134 | color="tab:red",
135 | ax=disp.ax_,
136 | )
137 | plt.xlim(0, 1)
138 | plt.ylim(0, 1)
139 | plt.legend(loc="lower right")
140 | plt.title("ROC curve for LogisticRegression")
141 | plt.show()
142 |
143 | # %% [markdown]
144 | # ROC curves typically feature true positive rate on the Y axis, and false
145 | # positive rate on the X axis. This means that the top left corner of the plot
146 | # is the "ideal" point - a false positive rate of zero, and a true positive rate
147 | # of one. This is not very realistic, but it does mean that a larger area under
148 | # the curve (AUC) is usually better.
149 | #
150 | # We can compute the area under the ROC curve (using `roc_auc_score`) to
151 | # summarize the generalization performance of a model with a single number, or
152 | # to compare several models across thresholds.
153 |
154 | # %%
155 | from sklearn.ensemble import RandomForestClassifier
156 | from sklearn.ensemble import HistGradientBoostingClassifier
157 |
158 |
159 | classifiers = {
160 | "Hist Gradient Boosting": HistGradientBoostingClassifier(),
161 | "Random Forest": RandomForestClassifier(n_jobs=-1, random_state=1),
162 | "Logistic Regression": LogisticRegression(),
163 | "Chance": DummyClassifier(strategy="most_frequent"),
164 | }
165 |
166 | fig = plt.figure()
167 | ax = plt.axes([0.08, 0.15, 0.78, 0.78])
168 |
169 | for name, clf in classifiers.items():
170 | clf.fit(X_train, y_train)
171 | disp = RocCurveDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)
172 | plt.xlabel("False positive rate")
173 | plt.ylabel("True positive rate ")
174 | plt.text(
175 | 0.098,
176 | 0.575,
177 | "= sensitivity or recall",
178 | transform=fig.transFigure,
179 | size=7,
180 | rotation="vertical",
181 | )
182 | plt.xlim(0, 1)
183 | plt.ylim(0, 1)
184 | plt.legend(loc="lower right")
185 | plt.title("ROC curves for several models")
186 | plt.show()
187 |
188 | # %% [markdown]
189 | # It is important to notice that the lower bound of the ROC-AUC is 0.5,
190 | # corresponding to chance level. Indeed, we show the generalization performance
191 | # of a dummy classifier (the red line) to show that even the worst
192 | # generalization performance obtained will be above this line.
193 |
194 | # %% [markdown]
195 | # Precision-Recall curves
196 | # -----------------------
197 | #
198 | # As mentioned above, maximizing the ROC curve helps finding a compromise
199 | # between accurately discriminating both the positive class and the negative
200 | # classes. If the interest is to focus mainly on the positive class, the
201 | # precision and recall metrics are more appropriated. Similarly to the ROC
202 | # curve, each point in the Precision-Recall curve corresponds to a level of
203 | # probability which we used as a decision threshold.
204 |
205 | # %%
206 | from sklearn.metrics import PrecisionRecallDisplay
207 |
208 | fig = plt.figure()
209 | ax = plt.axes([0.08, 0.15, 0.78, 0.78])
210 |
211 | for name, clf in classifiers.items():
212 | clf.fit(X_train, y_train)
213 | disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)
214 | plt.xlabel("Recall ")
215 | plt.text(0.45, 0.067, "= TPR or sensitivity", transform=fig.transFigure, size=7)
216 | plt.ylabel("Precision ")
217 | plt.text(0.1, 0.6, "= PPV", transform=fig.transFigure, size=7, rotation="vertical")
218 | plt.xlim(0, 1)
219 | plt.ylim(0, 1)
220 | plt.legend(loc="lower right")
221 | plt.title("Precision-recall curve for several models")
222 | plt.show()
223 |
224 | # %% [markdown]
225 | # A classifier with no false positives would have a precision of 1 for all
226 | # recall values. In like manner to the ROC-AUC, the area under the curve can be
227 | # used to characterize the curve in a single number and is named average
228 | # precision (AP). With an ideal classifier, the average precision would be 1.
229 | #
230 | # In this case, notice that the AP of a `DummyClassifier`, used as baseline to
231 | # define the chance level, coincides with the prevalence of the positive class.
232 | # This is analogous to the downside of the accuracy score as shown in the first
233 | # notebook.
234 |
235 | # %%
236 | prevalence = y.mean()
237 | print(f"Prevalence of the positive class: {prevalence:.3f}")
238 |
239 | # %% [markdown]
240 | # Let's see the effect of adding umbalance between classes in our set of models:
241 |
242 | # %%
243 | X, y = make_classification(**common_params, weights=[0.83, 0.17])
244 |
245 | X_train, X_test, y_train, y_test = train_test_split(
246 | X, y, stratify=y, random_state=0, test_size=0.02
247 | )
248 |
249 | fig = plt.figure()
250 | ax = plt.axes([0.08, 0.15, 0.78, 0.78])
251 |
252 | for name, clf in classifiers.items():
253 | clf.fit(X_train, y_train)
254 | disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)
255 | plt.xlabel("Recall ")
256 | plt.text(0.45, 0.067, "= TPR or sensitivity", transform=fig.transFigure, size=7)
257 | plt.ylabel("Precision ")
258 | plt.text(0.1, 0.6, "= PPV", transform=fig.transFigure, size=7, rotation="vertical")
259 | plt.xlim(0, 1)
260 | plt.ylim(0, 1)
261 | plt.legend(loc="upper right")
262 | plt.title("Precision-recall curve for several models\nw. imbalanced data")
263 | plt.show()
264 |
265 | # %% [markdown]
266 | # The AP of all models decreased, including the baseline defined by the dummy
267 | # classifier. Indeed, we confirm that AP does not account for prevalence.
268 | #
269 | # Conclusions
270 | # ===========
271 | #
272 | # - Consider the prevalence in your target population. It may be that the
273 | # prevalence in your testing sample is not representative of that of the
274 | # target population. In that case, aside from LR+ and LR-, performance metrics
275 | # computed from the testing sample will not be representative of those in the
276 | # target population.
277 | #
278 | # - Never trust a single summary metric (accuracy, balanced accuracy, ROC-AUC,
279 | # etc.), but rather look at all the individual metrics. Understand the
280 | # implication of your choices to known the right tradeoff.
281 |
--------------------------------------------------------------------------------
/python_scripts/1_evaluation_tutorial.py:
--------------------------------------------------------------------------------
1 | # %% [markdown]
2 | #
3 | # Accounting for imbalance in evaluation metrics for classification
4 | # =================================================================
5 | #
6 | # Suppose we have a population of subjects with features `X` that can hopefully
7 | # serve as indicators of a binary class `y` (known ground truth). Additionally,
8 | # suppose the class prevalence (the number of samples in the positive class
9 | # divided by the total number of samples) is very low.
10 | #
11 | # To fix ideas, let's use a medical analogy and think about diabetes. We only
12 | # use two features -age and blood sugar level-, to keep the example as simple as
13 | # possible. We use `make_classification` to simulate the distribution of the
14 | # disease and to ensure **the data-generating process is always the same**. We
15 | # set the `weights=[0.99, 0.01]` to obtain a prevalence of around 1% which,
16 | # according to [The World
17 | # Bank](https://data.worldbank.org/indicator/SH.STA.DIAB.ZS?most_recent_value_desc=false),
18 | # is the case for the country with the lowest diabetes prevalence in 2022
19 | # (Benin).
20 | #
21 | # In practice, the ideas presented here can be applied in settings where the
22 | # data available to learn and evaluate a classifier has nearly balanced classes,
23 | # such as a case-control study, while the target application, i.e. the general
24 | # population, has very low prevalence.
25 |
26 | # %%
27 | from sklearn.datasets import make_classification
28 |
29 | common_params = {
30 | "n_samples": 10_000,
31 | "n_features": 2,
32 | "n_informative": 2,
33 | "n_redundant": 0,
34 | "n_classes": 2, # binary classification
35 | "shift": [4, 6],
36 | "scale": [10, 25],
37 | "random_state": 0,
38 | }
39 | X, y = make_classification(**common_params, weights=[0.99, 0.01])
40 | prevalence = y.mean()
41 | print(f"Percentage of people carrying the disease: {100*prevalence:.2f}%")
42 |
43 | # %% [markdown]
44 | # A simple model is trained to diagnose if a person is likely to have diabetes.
45 | # To estimate the generalization performance of such model, we do a train-test
46 | # split.
47 |
48 | # %%
49 | from sklearn.model_selection import train_test_split
50 | from sklearn.tree import DecisionTreeClassifier
51 |
52 | X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
53 |
54 | estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)
55 |
56 | # %% [markdown]
57 | # We now show the decision boundary learned by the estimator. Notice that we
58 | # only plot an stratified subset of the original data.
59 |
60 | # %%
61 | import matplotlib.pyplot as plt
62 | from sklearn.inspection import DecisionBoundaryDisplay
63 |
64 | fig, ax = plt.subplots()
65 | disp = DecisionBoundaryDisplay.from_estimator(
66 | estimator,
67 | X_test,
68 | response_method="predict",
69 | alpha=0.5,
70 | xlabel="age (years)",
71 | ylabel="blood sugar level (mg/dL)",
72 | ax=ax,
73 | )
74 | scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k")
75 | disp.ax_.set_title(f"Hypothetical diabetes test with prevalence = {y.mean():.2f}")
76 | _ = disp.ax_.legend(*scatter.legend_elements())
77 |
78 | # %% [markdown]
79 | # The most widely used summary metric is arguably accuracy. Its main advantage
80 | # is a natural interpretation: the proportion of correctly classified samples.
81 |
82 | # %%
83 | from sklearn import metrics
84 |
85 | y_pred = estimator.predict(X_test)
86 | accuracy = metrics.accuracy_score(y_test, y_pred)
87 | print(f"Accuracy on the test set: {accuracy:.3f}")
88 |
89 | # %% [markdown]
90 | # However, it is misleading when the data is imbalanced. Our model performs
91 | # as well as a trivial majority classifier.
92 |
93 | # %%
94 | from sklearn.dummy import DummyClassifier
95 |
96 | dummy = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
97 | y_dummy = estimator.predict(X_test)
98 | accuracy_dummy = metrics.accuracy_score(y_test, y_dummy)
99 | print(f"Accuracy if Diabetes did not exist: {accuracy_dummy:.3f}")
100 |
101 | # %% [markdown]
102 | # Some of the other metrics are better at describing the flaws of our model:
103 |
104 | # %%
105 | sensitivity = metrics.recall_score(y_test, y_pred)
106 | specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
107 | balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)
108 | matthews = metrics.matthews_corrcoef(y_test, y_pred)
109 | PPV = metrics.precision_score(y_test, y_pred)
110 |
111 | print(f"Sensitivity on the test set: {sensitivity:.2f}")
112 | print(f"Specificity on the test set: {specificity:.2f}")
113 | print(f"Balanced accuracy on the test set: {balanced_acc:.2f}")
114 | print(f"Matthews correlation coeff on the test set: {matthews:.2f}")
115 | print()
116 | print(f"Probability to have the disease given a positive test: {100*PPV:.2f}%")
117 |
118 | # %% [markdown]
119 | # Our classifier is not informative enough on the general population. The PPV
120 | # and NPV give the information of interest: P(D+ | T+) and P(D− | T−). However,
121 | # they are not intrinsic to the medical test (in other words the trained ML
122 | # model) but also depend on the prevalence and thus on the target population.
123 | #
124 | # The class likelihood ratios (LR±) depend only on sensitivity and specificity
125 | # of the classifier, and not on the prevalence of the study population. For the
126 | # moment it suffice to recall that LR± is defined as
127 | #
128 | # LR± = P(D± | T+) / P(D± | T−)
129 |
130 | # %%
131 | pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)
132 | print(f"LR+ on the test set: {pos_LR:.3f}") # higher is better
133 | print(f"LR- on the test set: {neg_LR:.3f}") # lower is better
134 |
135 | # %% [markdown]
136 | #
137 | #
Caution
138 | #
Please notice that if you want to use the
139 | # `metrics.class_likelihood_ratios`, you require scikit-learn > v.1.2.0.
140 | #
141 | #
142 | #
143 | # Extrapolating between populations
144 | # ---------------------------------
145 | #
146 | # The prevalence can be variable (for instance the prevalence of an infectious
147 | # disease will be variable across time) and a given classifier may be intended
148 | # to be applied in various situations.
149 | #
150 | # According to the World Bank, the diabetes prevalence in the French Polynesia
151 | # in 2022 is above 25%. Let's now evaluate our previously trained model on a
152 | # **different population** with such prevalence and **the same data-generating
153 | # process**.
154 |
155 | X, y = make_classification(**common_params, weights=[0.75, 0.25])
156 | X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
157 |
158 | fig, ax = plt.subplots()
159 | disp = DecisionBoundaryDisplay.from_estimator(
160 | estimator,
161 | X_test,
162 | response_method="predict",
163 | alpha=0.5,
164 | xlabel="age (years)",
165 | ylabel="blood sugar level (mg/dL)",
166 | ax=ax,
167 | )
168 | scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k")
169 | disp.ax_.set_title(f"Hypothetical diabetes test with prevalence = {y.mean():.2f}")
170 | _ = disp.ax_.legend(*scatter.legend_elements())
171 |
172 | # %%
173 | # We then compute the same metrics using a test set with the new
174 | # prevalence:
175 |
176 | y_pred = estimator.predict(X_test)
177 | prevalence = y.mean()
178 | accuracy = metrics.accuracy_score(y_test, y_pred)
179 | sensitivity = metrics.recall_score(y_test, y_pred)
180 | specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
181 | balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)
182 | matthews = metrics.matthews_corrcoef(y_test, y_pred)
183 | PPV = metrics.precision_score(y_test, y_pred)
184 |
185 | print(f"Accuracy on the test set: {accuracy:.2f}")
186 | print(f"Sensitivity on the test set: {sensitivity:.2f}")
187 | print(f"Specificity on the test set: {specificity:.2f}")
188 | print(f"Balanced accuracy on the test set: {balanced_acc:.2f}")
189 | print(f"Matthews correlation coeff on the test set: {matthews:.2f}")
190 | print()
191 | print(f"Probability to have the disease given a positive test: {100*PPV:.2f}%")
192 |
193 | # %% [markdown]
194 | # The same model seems to perform better on this new dataset. Notice in
195 | # particular that the probability to have the disease given a positive test
196 | # increased. The same blood sugar test is less predictive in Benin than in
197 | # the French Polynesia!
198 | #
199 | # If we really want to score the test and not the dataset, we need a metric that
200 | # does not depend on the prevalence of the study population.
201 |
202 | # %%
203 | pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)
204 |
205 | print(f"LR+ on the test set: {pos_LR:.3f}")
206 | print(f"LR- on the test set: {neg_LR:.3f}")
207 |
208 | # %% [markdown]
209 | # Despite some variations due to residual dataset dependence, the class
210 | # likelihood ratios are mathematically invariant with respect to prevalence. See
211 | # [this example from the User
212 | # Guide](https://scikit-learn.org/dev/auto_examples/model_selection/plot_likelihood_ratios.html#invariance-with-respect-to-prevalence)
213 | # for a demo regarding such property.
214 | #
215 | # Pre-test vs. post-test odds
216 | # ---------------------------
217 | #
218 | # Both class likelihood ratios are interpretable in terms of odds:
219 | #
220 | # post-test odds = Likelihood ratio * pre-test odds
221 | #
222 | # The interpretation of LR+ in this case reads:
223 |
224 | # %%
225 | print("The post-test odds that the condition is truly present given a positive "
226 | f"test result are: {pos_LR:.3f} times larger than the pre-test odds.")
227 |
228 | # %% [markdown]
229 | # We found that diagnosis tool is useful: the post-test odds are larger than the
230 | # pre-test odds. We now choose the pre-test probability to be the prevalence of
231 | # the disease in the held-out testing set.
232 |
233 | # %%
234 | pretest_odds = y_test.mean() / (1 - y_test.mean())
235 | posttest_odds = pretest_odds * pos_LR
236 |
237 | print(f"Observed pre-test odds: {pretest_odds:.3f}")
238 | print(f"Estimated post-test odds using LR+: {posttest_odds:.3f}")
239 |
240 | # %% [markdown]
241 | # The post-test probability is the probability of an individual to truly have
242 | # the condition given a positive test result, i.e. the number of true positives
243 | # divided by the total number of samples. In real life applications this is
244 | # unknown.
245 |
246 | # %%
247 | posttest_prob = posttest_odds / (1 + posttest_odds)
248 |
249 | print(f"Estimated post-test probability using LR+: {posttest_prob:.3f}")
250 |
251 | # %% [markdown]
252 | # We can verify that if we had had access to the true labels, we would have
253 | # obatined the same probabilities:
254 |
255 | # %%
256 | posttest_prob = y_test[y_pred == 1].mean()
257 |
258 | print(f"Observed post-test probability: {posttest_prob:.3f}")
259 |
260 | # %% [markdown]
261 | # Conclusion: If a Benin salesperson was to sell the model to the French Polynesia
262 | # by showing them the 59.84% probability to have the disease given a positive test,
263 | # the French Polynesia would have never bought it, even though it would be quite
264 | # predictive for their own population. The right thing to report are the LR±.
265 | #
266 | # Can you imagine what would happen if the model is trained on nearly balanced classes
267 | # and then extrapolated to other scenarios?
268 |
--------------------------------------------------------------------------------
/notebooks/3_uncertainty_in_metrics_tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "604de853",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "Uncertainty in evaluation metrics for classification\n",
10 | "====================================================\n",
11 | "\n",
12 | "Has it ever happen to you that one of your colleagues claim their model with\n",
13 | "test score of 0.8001 is better than your model with test score of 0.7998?\n",
14 | "Maybe they are not aware that model-evaluation procedures should gauge not\n",
15 | "only the expected generalization performance, but also its variations. As\n",
16 | "usual, let's build a toy dataset to illustrate this."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "id": "096f78da",
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "from sklearn.datasets import make_classification\n",
27 | "\n",
28 | "common_params = {\n",
29 | " \"n_features\": 2,\n",
30 | " \"n_informative\": 2,\n",
31 | " \"n_redundant\": 0,\n",
32 | " \"n_classes\": 2, # binary classification\n",
33 | " \"random_state\": 0,\n",
34 | " \"weights\": [0.55, 0.45],\n",
35 | "}\n",
36 | "X, y = make_classification(**common_params, n_samples=400)\n",
37 | "\n",
38 | "prevalence = y.mean()\n",
39 | "print(f\"Percentage of samples in the positive class: {100*prevalence:.2f}%\")"
40 | ]
41 | },
42 | {
43 | "cell_type": "markdown",
44 | "id": "98f132b7",
45 | "metadata": {},
46 | "source": [
47 | "We are already familiar with using a a train-test split to estimate the\n",
48 | "generalization performance of a model. By default the `train_test_split` uses\n",
49 | "`shuffle=True`. Let's see what happens if we set a particular seed."
50 | ]
51 | },
52 | {
53 | "cell_type": "code",
54 | "execution_count": null,
55 | "id": "615abf52",
56 | "metadata": {},
57 | "outputs": [],
58 | "source": [
59 | "from sklearn.model_selection import train_test_split\n",
60 | "from sklearn.linear_model import LogisticRegression\n",
61 | "\n",
62 | "X_train, X_test, y_train, y_test = train_test_split(\n",
63 | " X, y, test_size=0.2, random_state=1\n",
64 | ")\n",
65 | "classifier = LogisticRegression().fit(X_train, y_train)\n",
66 | "classifier.score(X_test, y_test)"
67 | ]
68 | },
69 | {
70 | "cell_type": "markdown",
71 | "id": "79141bd1",
72 | "metadata": {},
73 | "source": [
74 | "Now let's see what happens when shuffling with a different seed:"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "id": "a75b8b30",
81 | "metadata": {},
82 | "outputs": [],
83 | "source": [
84 | "X_train, X_test, y_train, y_test = train_test_split(\n",
85 | " X, y, test_size=0.2, random_state=42\n",
86 | ")\n",
87 | "classifier = LogisticRegression().fit(X_train, y_train)\n",
88 | "classifier.score(X_test, y_test)"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "id": "bde0b08e",
94 | "metadata": {},
95 | "source": [
96 | "It seems that 42 is indeed the Ultimate answer to the Question of Life, the\n",
97 | "Universe, and Everything! Or maybe the score of a model depends on the split:\n",
98 | " - the train-test proportion;\n",
99 | " - the representativeness of the elements in each set.\n",
100 | "\n",
101 | "A more systematic way of evaluating the generalization performance of a model\n",
102 | "is through cross-validation, which consists of repeating the split such that\n",
103 | "the training and testing sets are different for each evaluation."
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": null,
109 | "id": "3178a156",
110 | "metadata": {},
111 | "outputs": [],
112 | "source": [
113 | "from sklearn.model_selection import cross_val_score, ShuffleSplit\n",
114 | "\n",
115 | "classifier = LogisticRegression()\n",
116 | "cv = ShuffleSplit(n_splits=250, test_size=0.2)\n",
117 | "\n",
118 | "scores = cross_val_score(classifier, X, y, cv=cv)\n",
119 | "print(\n",
120 | " \"The mean cross-validation accuracy is: \"\n",
121 | " f\"{scores.mean():.2f} ± {scores.std():.2f}.\"\n",
122 | ")"
123 | ]
124 | },
125 | {
126 | "cell_type": "markdown",
127 | "id": "84a601e3",
128 | "metadata": {},
129 | "source": [
130 | "Scores have a variability. A sample probabilistic model gives the distribution\n",
131 | "of observed error: if the classification rate is p, the observed distribution\n",
132 | "of correct classifications on a set of size follows a binomial distribution.\n",
133 | "Let's create a function to easily visualize this:"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "id": "99ee9c85",
140 | "metadata": {},
141 | "outputs": [],
142 | "source": [
143 | "import matplotlib.pyplot as plt\n",
144 | "import numpy as np\n",
145 | "import seaborn as sns\n",
146 | "from scipy import stats\n",
147 | "\n",
148 | "\n",
149 | "def plot_error_distrib(classifier, X, y, cv=5):\n",
150 | "\n",
151 | " n = len(X)\n",
152 | "\n",
153 | " scores = cross_val_score(classifier, X, y, cv=cv)\n",
154 | " distrib = stats.binom(n=n, p=scores.mean())\n",
155 | "\n",
156 | " plt.plot(\n",
157 | " np.linspace(0, 1, n),\n",
158 | " n * distrib.pmf(np.arange(0, n)),\n",
159 | " linewidth=2,\n",
160 | " color=\"black\",\n",
161 | " label=\"binomial distribution\",\n",
162 | " )\n",
163 | " sns.histplot(scores, stat=\"density\", label=\"empirical distribution\")\n",
164 | " plt.xlim(0, 1)\n",
165 | " plt.title(\"Accuracy: \" f\"{scores.mean():.2f} ± {scores.std():.2f}.\")\n",
166 | " plt.legend()\n",
167 | " plt.show()\n",
168 | "\n",
169 | "\n",
170 | "plot_error_distrib(classifier, X, y, cv=cv)"
171 | ]
172 | },
173 | {
174 | "cell_type": "markdown",
175 | "id": "80fcb32e",
176 | "metadata": {},
177 | "source": [
178 | "The empirical distribution is still broader than the theoretical one. This can\n",
179 | "be explained by the fact that as we are retraining the model on each fold, it\n",
180 | "actually fluctuates due the sampling noise in the training data, while the\n",
181 | "model above only accounts for sampling noise in the test data.\n",
182 | "\n",
183 | "The situation does get better with more data:"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": null,
189 | "id": "442ddefb",
190 | "metadata": {},
191 | "outputs": [],
192 | "source": [
193 | "X, y = make_classification(**common_params, n_samples=1_000)\n",
194 | "plot_error_distrib(classifier, X, y, cv=cv)"
195 | ]
196 | },
197 | {
198 | "cell_type": "markdown",
199 | "id": "b3a7b727",
200 | "metadata": {},
201 | "source": [
202 | "Importantly, the standard error of the mean (SEM) across folds is not a good\n",
203 | "measure of this error, as the different data folds are not independent. For\n",
204 | "instance, doing many random splits reduces the variance arbitrarily, but does\n",
205 | "not provide actually new data points."
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": null,
211 | "id": "f5c2a788",
212 | "metadata": {},
213 | "outputs": [],
214 | "source": [
215 | "cv = ShuffleSplit(n_splits=10, test_size=0.2)\n",
216 | "X, y = make_classification(**common_params, n_samples=400)\n",
217 | "scores = cross_val_score(classifier, X, y, cv=cv)\n",
218 | "\n",
219 | "print(\n",
220 | " f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n",
221 | " f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n",
222 | ")\n",
223 | "\n",
224 | "cv = ShuffleSplit(n_splits=100, test_size=0.2)\n",
225 | "scores = cross_val_score(classifier, X, y, cv=cv)\n",
226 | "\n",
227 | "print(\n",
228 | " f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n",
229 | " f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n",
230 | ")\n",
231 | "\n",
232 | "cv = ShuffleSplit(n_splits=500, test_size=0.2)\n",
233 | "scores = cross_val_score(classifier, X, y, cv=cv)\n",
234 | "\n",
235 | "print(\n",
236 | " f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n",
237 | " f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n",
238 | ")"
239 | ]
240 | },
241 | {
242 | "cell_type": "markdown",
243 | "id": "22457177",
244 | "metadata": {},
245 | "source": [
246 | "Indeed, the SEM goes to zero as 1/sqrt{`n_splits`}. Wraping-up:\n",
247 | "- the more data the better;\n",
248 | "- the more splits, the more descriptive of the variance is the binomial\n",
249 | " distribution, but keep in mind that more splits consume more computing\n",
250 | " power;\n",
251 | "- use std instead of SEM to present your results.\n",
252 | "\n",
253 | "Now that we have an intuition on the variability of an evaluation metric, we\n",
254 | "are ready to apply it to our original Diabetes problem:"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "id": "c2e55603",
261 | "metadata": {},
262 | "outputs": [],
263 | "source": [
264 | "from sklearn.tree import DecisionTreeClassifier\n",
265 | "from sklearn.inspection import DecisionBoundaryDisplay\n",
266 | "\n",
267 | "diabetes_params = {\n",
268 | " \"n_samples\": 10_000,\n",
269 | " \"n_features\": 2,\n",
270 | " \"n_informative\": 2,\n",
271 | " \"n_redundant\": 0,\n",
272 | " \"n_classes\": 2, # binary classification\n",
273 | " \"shift\": [4, 6],\n",
274 | " \"scale\": [10, 25],\n",
275 | " \"random_state\": 0,\n",
276 | "}\n",
277 | "X, y = make_classification(**diabetes_params, weights=[0.55, 0.45])\n",
278 | "\n",
279 | "X_train, X_plot, y_train, y_plot = train_test_split(\n",
280 | " X, y, stratify=y, test_size=0.1, random_state=0\n",
281 | ")\n",
282 | "\n",
283 | "estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)\n",
284 | "\n",
285 | "fig, ax = plt.subplots()\n",
286 | "disp = DecisionBoundaryDisplay.from_estimator(\n",
287 | " estimator,\n",
288 | " X_plot,\n",
289 | " response_method=\"predict\",\n",
290 | " alpha=0.5,\n",
291 | " xlabel=\"age (years)\",\n",
292 | " ylabel=\"blood sugar level (mg/dL)\",\n",
293 | " ax=ax,\n",
294 | ")\n",
295 | "scatter = disp.ax_.scatter(X_plot[:, 0], X_plot[:, 1], c=y_plot, edgecolor=\"k\")\n",
296 | "disp.ax_.set_title(f\"Diabetes test with prevalence = {y.mean():.2f}\")\n",
297 | "_ = disp.ax_.legend(*scatter.legend_elements())"
298 | ]
299 | },
300 | {
301 | "cell_type": "markdown",
302 | "id": "27cad221",
303 | "metadata": {},
304 | "source": [
305 | "Notice that the decision boundary changed with respect to the first notebook\n",
306 | "we explored. Let's make a remark: models depend on the prevalence of\n",
307 | "the data they were trained on. Therefore, all metrics (including likelihood ratios)\n",
308 | "depend on prevalence as much as the model depends on it. The difference is that\n",
309 | "likelihood ratios extrapolate through populations of different prevalence for\n",
310 | "a **fixed model**.\n",
311 | "\n",
312 | "Let's compute all the metrics and assez their variability in this case:"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": null,
318 | "id": "277a758a",
319 | "metadata": {},
320 | "outputs": [],
321 | "source": [
322 | "from collections import defaultdict\n",
323 | "import pandas as pd\n",
324 | "\n",
325 | "cv = ShuffleSplit(n_splits=50, test_size=0.2)\n",
326 | "\n",
327 | "evaluation = defaultdict(list)\n",
328 | "scoring_strategies = [\n",
329 | " \"accuracy\",\n",
330 | " \"balanced_accuracy\",\n",
331 | " \"recall\",\n",
332 | " \"precision\",\n",
333 | " \"matthews_corrcoef\",\n",
334 | " # \"positive_likelihood_ratio\",\n",
335 | " # \"neg_negative_likelihood_ratio\",\n",
336 | "]\n",
337 | "\n",
338 | "for score_name in scoring_strategies:\n",
339 | " scores = cross_val_score(estimator, X, y, cv=cv, scoring=score_name)\n",
340 | " evaluation[score_name] = scores\n",
341 | "\n",
342 | "evaluation = pd.DataFrame(evaluation).aggregate([\"mean\", \"std\"]).T\n",
343 | "evaluation[\"mean\"].plot.barh(xerr=evaluation[\"std\"]).set_xlabel(\"score\")\n",
344 | "plt.show()"
345 | ]
346 | },
347 | {
348 | "cell_type": "markdown",
349 | "id": "812bdbd6",
350 | "metadata": {},
351 | "source": [
352 | "Notice that `\"positive_likelihood_ratio\"` is not bounded from above and\n",
353 | "therefore it can't be directly compared with the other metrics on a single\n",
354 | "plot. Similarly, the `\"neg_negative_likelihood_ratio\"` has a reversed sign (is\n",
355 | "negative) to follow the scikit-learn convention for metrics for which a lower\n",
356 | "score is better.\n",
357 | "\n",
358 | "In this case we trained the model on nearly balanced classes. Try changing the\n",
359 | "prevalence and see how the variance of the metrics depend on data imbalance."
360 | ]
361 | }
362 | ],
363 | "metadata": {
364 | "jupytext": {
365 | "cell_metadata_filter": "-all",
366 | "main_language": "python",
367 | "notebook_metadata_filter": "-all"
368 | }
369 | },
370 | "nbformat": 4,
371 | "nbformat_minor": 5
372 | }
373 |
--------------------------------------------------------------------------------
/notebooks/2_roc_pr_curves_tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "1c0200d9",
6 | "metadata": {},
7 | "source": [
8 | "Evaluation of non-thresholded prediction\n",
9 | "========================================\n",
10 | "\n",
11 | "All statistics that we presented up to now rely on `.predict` which outputs\n",
12 | "the most likely label. We haven’t made use of the probability associated with\n",
13 | "this prediction, which gives the confidence of the classifier in this\n",
14 | "prediction. By default, the prediction of a classifier corresponds to a\n",
15 | "threshold of 0.5 probability in a binary classification problem. Let's build a\n",
16 | "toy dataset to illustrate this."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": null,
22 | "id": "2cee4d2f",
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "from sklearn.datasets import make_classification\n",
27 | "from sklearn.model_selection import train_test_split\n",
28 | "\n",
29 | "common_params = {\n",
30 | " \"n_samples\": 10_000,\n",
31 | " \"n_features\": 2,\n",
32 | " \"n_informative\": 2,\n",
33 | " \"n_redundant\": 0,\n",
34 | " \"n_classes\": 2, # binary classification\n",
35 | " \"class_sep\": 0.5,\n",
36 | " \"random_state\": 0,\n",
37 | "}\n",
38 | "X, y = make_classification(**common_params, weights=[0.6, 0.4])\n",
39 | "\n",
40 | "X_train, X_test, y_train, y_test = train_test_split(\n",
41 | " X, y, stratify=y, random_state=0, test_size=0.02\n",
42 | ")"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "id": "aa2bc2df",
48 | "metadata": {},
49 | "source": [
50 | "We can quickly check the predicted probabilities to belong to either class\n",
51 | "using a `LogisticRegression`. To ease the visualization we select a subset\n",
52 | "of `n_plot` samples."
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": null,
58 | "id": "78a544d8",
59 | "metadata": {},
60 | "outputs": [],
61 | "source": [
62 | "import pandas as pd\n",
63 | "from sklearn.linear_model import LogisticRegression\n",
64 | "\n",
65 | "n_plot = 10\n",
66 | "classifier = LogisticRegression()\n",
67 | "classifier.fit(X_train, y_train)\n",
68 | "\n",
69 | "proba_predicted = pd.DataFrame(\n",
70 | " classifier.predict_proba(X_test), columns=classifier.classes_\n",
71 | ").round(decimals=2)\n",
72 | "proba_predicted[:n_plot]"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "id": "807a2305",
78 | "metadata": {},
79 | "source": [
80 | "Probabilites sum to 1. In the binary case it suffices to retain the\n",
81 | "probability of belonging to the positive class, here shown as an annotation in\n",
82 | "the `DecisionBoundaryDisplay`. Notice that setting\n",
83 | "`response_method=\"predict_proba\"` shows the level curves of the 2D sigmoid\n",
84 | "(logistic curve)."
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": null,
90 | "id": "cbaf5e5d",
91 | "metadata": {},
92 | "outputs": [],
93 | "source": [
94 | "import matplotlib.pyplot as plt\n",
95 | "from matplotlib.colors import ListedColormap\n",
96 | "from sklearn.inspection import DecisionBoundaryDisplay\n",
97 | "\n",
98 | "fig, ax = plt.subplots()\n",
99 | "disp = DecisionBoundaryDisplay.from_estimator(\n",
100 | " classifier,\n",
101 | " X_test,\n",
102 | " response_method=\"predict_proba\",\n",
103 | " cmap=\"RdBu\",\n",
104 | " alpha=0.5,\n",
105 | " vmin=0,\n",
106 | " vmax=1,\n",
107 | " ax=ax,\n",
108 | ")\n",
109 | "DecisionBoundaryDisplay.from_estimator(\n",
110 | " classifier,\n",
111 | " X_test,\n",
112 | " response_method=\"predict_proba\",\n",
113 | " plot_method=\"contour\",\n",
114 | " alpha=0.2,\n",
115 | " levels=[0.5], # 0.5 probability contour line\n",
116 | " linestyles=\"--\",\n",
117 | " linewidths=2,\n",
118 | " ax=ax,\n",
119 | ")\n",
120 | "scatter = disp.ax_.scatter(\n",
121 | " X_test[:n_plot, 0], X_test[:n_plot, 1], c=y_test[:n_plot], \n",
122 | " cmap=ListedColormap([\"tab:red\", \"tab:blue\"]),\n",
123 | " edgecolor=\"k\"\n",
124 | ")\n",
125 | "disp.ax_.legend(*scatter.legend_elements(), title=\"True class\", loc=\"lower right\")\n",
126 | "for i, proba in enumerate(proba_predicted[:n_plot][1]):\n",
127 | " disp.ax_.annotate(proba, (X_test[i, 0], X_test[i, 1]), fontsize=\"large\")\n",
128 | "plt.xlim(-2.0, 2.0)\n",
129 | "plt.ylim(-4.0, 4.0)\n",
130 | "plt.title(\n",
131 | " \"Probability of belonging to the positive class\\n(default decision threshold)\"\n",
132 | ")\n",
133 | "plt.show()"
134 | ]
135 | },
136 | {
137 | "cell_type": "markdown",
138 | "id": "c6b3f802",
139 | "metadata": {},
140 | "source": [
141 | "Evaluation of different probability thresholds\n",
142 | "==============================================\n",
143 | "\n",
144 | "The default decision threshold (0.5) might not be the best threshold that\n",
145 | "leads to optimal generalization performance of our classifier. One can vary\n",
146 | "the decision threshold (and therefore the underlying prediction) and compute\n",
147 | "some evaluation metrics as presented earlier.\n",
148 | "\n",
149 | "Receiver Operating Characteristic curve\n",
150 | "---------------------------------------\n",
151 | "\n",
152 | "One could be interested in the compromise between accurately discriminating\n",
153 | "both the positive class and the negative classes. The statistics used for this\n",
154 | "are sensitivity and specificity, which measure the proportion of correctly\n",
155 | "classified samples per class.\n",
156 | "\n",
157 | "Sensitivity and specificity are generally plotted as a curve called the\n",
158 | "Receiver Operating Characteristic (ROC) curve. Each point on the graph\n",
159 | "corresponds to a specific decision threshold. Below is such a curve:"
160 | ]
161 | },
162 | {
163 | "cell_type": "code",
164 | "execution_count": null,
165 | "id": "dbed0c9e",
166 | "metadata": {},
167 | "outputs": [],
168 | "source": [
169 | "from sklearn.metrics import RocCurveDisplay\n",
170 | "from sklearn.dummy import DummyClassifier\n",
171 | "\n",
172 | "dummy_classifier = DummyClassifier(strategy=\"most_frequent\")\n",
173 | "dummy_classifier.fit(X_train, y_train)\n",
174 | "\n",
175 | "disp = RocCurveDisplay.from_estimator(\n",
176 | " classifier, X_test, y_test, name=\"LogisticRegression\", color=\"tab:green\"\n",
177 | ")\n",
178 | "disp = RocCurveDisplay.from_estimator(\n",
179 | " dummy_classifier,\n",
180 | " X_test,\n",
181 | " y_test,\n",
182 | " name=\"chance level\",\n",
183 | " color=\"tab:red\",\n",
184 | " ax=disp.ax_,\n",
185 | ")\n",
186 | "plt.xlim(0, 1)\n",
187 | "plt.ylim(0, 1)\n",
188 | "plt.legend(loc=\"lower right\")\n",
189 | "plt.title(\"ROC curve for LogisticRegression\")\n",
190 | "plt.show()"
191 | ]
192 | },
193 | {
194 | "cell_type": "markdown",
195 | "id": "1837bd36",
196 | "metadata": {},
197 | "source": [
198 | "ROC curves typically feature true positive rate on the Y axis, and false\n",
199 | "positive rate on the X axis. This means that the top left corner of the plot\n",
200 | "is the \"ideal\" point - a false positive rate of zero, and a true positive rate\n",
201 | "of one. This is not very realistic, but it does mean that a larger area under\n",
202 | "the curve (AUC) is usually better.\n",
203 | "\n",
204 | "We can compute the area under the ROC curve (using `roc_auc_score`) to\n",
205 | "summarize the generalization performance of a model with a single number, or\n",
206 | "to compare several models across thresholds."
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "id": "72cb955a",
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "from sklearn.ensemble import RandomForestClassifier\n",
217 | "from sklearn.ensemble import HistGradientBoostingClassifier\n",
218 | "\n",
219 | "\n",
220 | "classifiers = {\n",
221 | " \"Hist Gradient Boosting\": HistGradientBoostingClassifier(),\n",
222 | " \"Random Forest\": RandomForestClassifier(n_jobs=-1, random_state=1),\n",
223 | " \"Logistic Regression\": LogisticRegression(),\n",
224 | " \"Chance\": DummyClassifier(strategy=\"most_frequent\"),\n",
225 | "}\n",
226 | "\n",
227 | "fig = plt.figure()\n",
228 | "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n",
229 | "\n",
230 | "for name, clf in classifiers.items():\n",
231 | " clf.fit(X_train, y_train)\n",
232 | " disp = RocCurveDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n",
233 | "plt.xlabel(\"False positive rate\")\n",
234 | "plt.ylabel(\"True positive rate \")\n",
235 | "plt.text(\n",
236 | " 0.098,\n",
237 | " 0.575,\n",
238 | " \"= sensitivity or recall\",\n",
239 | " transform=fig.transFigure,\n",
240 | " size=7,\n",
241 | " rotation=\"vertical\",\n",
242 | ")\n",
243 | "plt.xlim(0, 1)\n",
244 | "plt.ylim(0, 1)\n",
245 | "plt.legend(loc=\"lower right\")\n",
246 | "plt.title(\"ROC curves for several models\")\n",
247 | "plt.show()"
248 | ]
249 | },
250 | {
251 | "cell_type": "markdown",
252 | "id": "e0ef75b4",
253 | "metadata": {},
254 | "source": [
255 | "It is important to notice that the lower bound of the ROC-AUC is 0.5,\n",
256 | "corresponding to chance level. Indeed, we show the generalization performance\n",
257 | "of a dummy classifier (the red line) to show that even the worst\n",
258 | "generalization performance obtained will be above this line."
259 | ]
260 | },
261 | {
262 | "cell_type": "markdown",
263 | "id": "df5f19d6",
264 | "metadata": {},
265 | "source": [
266 | "Precision-Recall curves\n",
267 | "-----------------------\n",
268 | "\n",
269 | "As mentioned above, maximizing the ROC curve helps finding a compromise\n",
270 | "between accurately discriminating both the positive class and the negative\n",
271 | "classes. If the interest is to focus mainly on the positive class, the\n",
272 | "precision and recall metrics are more appropriated. Similarly to the ROC\n",
273 | "curve, each point in the Precision-Recall curve corresponds to a level of\n",
274 | "probability which we used as a decision threshold."
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": null,
280 | "id": "44183db6",
281 | "metadata": {},
282 | "outputs": [],
283 | "source": [
284 | "from sklearn.metrics import PrecisionRecallDisplay\n",
285 | "\n",
286 | "fig = plt.figure()\n",
287 | "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n",
288 | "\n",
289 | "for name, clf in classifiers.items():\n",
290 | " clf.fit(X_train, y_train)\n",
291 | " disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n",
292 | "plt.xlabel(\"Recall \")\n",
293 | "plt.text(0.45, 0.067, \"= TPR or sensitivity\", transform=fig.transFigure, size=7)\n",
294 | "plt.ylabel(\"Precision \")\n",
295 | "plt.text(0.1, 0.6, \"= PPV\", transform=fig.transFigure, size=7, rotation=\"vertical\")\n",
296 | "plt.xlim(0, 1)\n",
297 | "plt.ylim(0, 1)\n",
298 | "plt.legend(loc=\"lower right\")\n",
299 | "plt.title(\"Precision-recall curve for several models\")\n",
300 | "plt.show()"
301 | ]
302 | },
303 | {
304 | "cell_type": "markdown",
305 | "id": "397e60b7",
306 | "metadata": {},
307 | "source": [
308 | "A classifier with no false positives would have a precision of 1 for all\n",
309 | "recall values. In like manner to the ROC-AUC, the area under the curve can be\n",
310 | "used to characterize the curve in a single number and is named average\n",
311 | "precision (AP). With an ideal classifier, the average precision would be 1.\n",
312 | "\n",
313 | "In this case, notice that the AP of a `DummyClassifier`, used as baseline to\n",
314 | "define the chance level, coincides with the prevalence of the positive class.\n",
315 | "This is analogous to the downside of the accuracy score as shown in the first\n",
316 | "notebook."
317 | ]
318 | },
319 | {
320 | "cell_type": "code",
321 | "execution_count": null,
322 | "id": "0475287c",
323 | "metadata": {},
324 | "outputs": [],
325 | "source": [
326 | "prevalence = y.mean()\n",
327 | "print(f\"Prevalence of the positive class: {prevalence:.3f}\")"
328 | ]
329 | },
330 | {
331 | "cell_type": "markdown",
332 | "id": "415c7cc9",
333 | "metadata": {},
334 | "source": [
335 | "Let's see the effect of adding umbalance between classes in our set of models:"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "id": "50944a15",
342 | "metadata": {},
343 | "outputs": [],
344 | "source": [
345 | "X, y = make_classification(**common_params, weights=[0.83, 0.17])\n",
346 | "\n",
347 | "X_train, X_test, y_train, y_test = train_test_split(\n",
348 | " X, y, stratify=y, random_state=0, test_size=0.02\n",
349 | ")\n",
350 | "\n",
351 | "fig = plt.figure()\n",
352 | "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n",
353 | "\n",
354 | "for name, clf in classifiers.items():\n",
355 | " clf.fit(X_train, y_train)\n",
356 | " disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n",
357 | "plt.xlabel(\"Recall \")\n",
358 | "plt.text(0.45, 0.067, \"= TPR or sensitivity\", transform=fig.transFigure, size=7)\n",
359 | "plt.ylabel(\"Precision \")\n",
360 | "plt.text(0.1, 0.6, \"= PPV\", transform=fig.transFigure, size=7, rotation=\"vertical\")\n",
361 | "plt.xlim(0, 1)\n",
362 | "plt.ylim(0, 1)\n",
363 | "plt.legend(loc=\"upper right\")\n",
364 | "plt.title(\"Precision-recall curve for several models\\nw. imbalanced data\")\n",
365 | "plt.show()"
366 | ]
367 | },
368 | {
369 | "cell_type": "markdown",
370 | "id": "f97d04b3",
371 | "metadata": {},
372 | "source": [
373 | "The AP of all models decreased, including the baseline defined by the dummy\n",
374 | "classifier. Indeed, we confirm that AP does not account for prevalence.\n",
375 | "\n",
376 | "Conclusions\n",
377 | "===========\n",
378 | "\n",
379 | "- Consider the prevalence in your target population. It may be that the\n",
380 | " prevalence in your testing sample is not representative of that of the\n",
381 | " target population. In that case, aside from LR+ and LR-, performance metrics\n",
382 | " computed from the testing sample will not be representative of those in the\n",
383 | " target population.\n",
384 | "\n",
385 | "- Never trust a single summary metric (accuracy, balanced accuracy, ROC-AUC,\n",
386 | " etc.), but rather look at all the individual metrics. Understand the\n",
387 | " implication of your choices to known the right tradeoff."
388 | ]
389 | }
390 | ],
391 | "metadata": {
392 | "jupytext": {
393 | "cell_metadata_filter": "-all",
394 | "main_language": "python",
395 | "notebook_metadata_filter": "-all"
396 | }
397 | },
398 | "nbformat": 4,
399 | "nbformat_minor": 5
400 | }
401 |
--------------------------------------------------------------------------------
/notebooks/1_evaluation_tutorial.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "a97407df",
6 | "metadata": {},
7 | "source": [
8 | "\n",
9 | "Accounting for imbalance in evaluation metrics for classification \n",
10 | "=================================================================\n",
11 | "\n",
12 | "Suppose we have a population of subjects with features `X` that can hopefully\n",
13 | "serve as indicators of a binary class `y` (known ground truth). Additionally,\n",
14 | "suppose the class prevalence (the number of samples in the positive class\n",
15 | "divided by the total number of samples) is very low.\n",
16 | "\n",
17 | "To fix ideas, let's use a medical analogy and think about diabetes. We only\n",
18 | "use two features -age and blood sugar level-, to keep the example as simple as\n",
19 | "possible. We use `make_classification` to simulate the distribution of the\n",
20 | "disease and to ensure **the data-generating process is always the same**. We\n",
21 | "set the `weights=[0.99, 0.01]` to obtain a prevalence of around 1% which,\n",
22 | "according to [The World\n",
23 | "Bank](https://data.worldbank.org/indicator/SH.STA.DIAB.ZS?most_recent_value_desc=false),\n",
24 | "is the case for the country with the lowest diabetes prevalence in 2022\n",
25 | "(Benin).\n",
26 | "\n",
27 | "In practice, the ideas presented here can be applied in settings where the\n",
28 | "data available to learn and evaluate a classifier has nearly balanced classes,\n",
29 | "such as a case-control study, while the target application, i.e. the general\n",
30 | "population, has very low prevalence."
31 | ]
32 | },
33 | {
34 | "cell_type": "code",
35 | "execution_count": null,
36 | "id": "ba2ba7a7",
37 | "metadata": {},
38 | "outputs": [],
39 | "source": [
40 | "from sklearn.datasets import make_classification\n",
41 | "\n",
42 | "common_params = {\n",
43 | " \"n_samples\": 10_000,\n",
44 | " \"n_features\": 2,\n",
45 | " \"n_informative\": 2,\n",
46 | " \"n_redundant\": 0,\n",
47 | " \"n_classes\": 2, # binary classification\n",
48 | " \"shift\": [4, 6],\n",
49 | " \"scale\": [10, 25],\n",
50 | " \"random_state\": 0,\n",
51 | "}\n",
52 | "X, y = make_classification(**common_params, weights=[0.99, 0.01])\n",
53 | "prevalence = y.mean()\n",
54 | "print(f\"Percentage of people carrying the disease: {100*prevalence:.2f}%\")"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "id": "68480c16",
60 | "metadata": {},
61 | "source": [
62 | "A simple model is trained to diagnose if a person is likely to have diabetes.\n",
63 | "To estimate the generalization performance of such model, we do a train-test\n",
64 | "split."
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": null,
70 | "id": "dbda2ce1",
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "from sklearn.model_selection import train_test_split\n",
75 | "from sklearn.tree import DecisionTreeClassifier\n",
76 | "\n",
77 | "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n",
78 | "\n",
79 | "estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)"
80 | ]
81 | },
82 | {
83 | "cell_type": "markdown",
84 | "id": "46de5abd",
85 | "metadata": {},
86 | "source": [
87 | "We now show the decision boundary learned by the estimator. Notice that we\n",
88 | "only plot an stratified subset of the original data."
89 | ]
90 | },
91 | {
92 | "cell_type": "code",
93 | "execution_count": null,
94 | "id": "42c34246",
95 | "metadata": {},
96 | "outputs": [],
97 | "source": [
98 | "import matplotlib.pyplot as plt\n",
99 | "from sklearn.inspection import DecisionBoundaryDisplay\n",
100 | "\n",
101 | "fig, ax = plt.subplots()\n",
102 | "disp = DecisionBoundaryDisplay.from_estimator(\n",
103 | " estimator,\n",
104 | " X_test,\n",
105 | " response_method=\"predict\",\n",
106 | " alpha=0.5,\n",
107 | " xlabel=\"age (years)\",\n",
108 | " ylabel=\"blood sugar level (mg/dL)\",\n",
109 | " ax=ax,\n",
110 | ")\n",
111 | "scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor=\"k\")\n",
112 | "disp.ax_.set_title(f\"Hypothetical diabetes test with prevalence = {y.mean():.2f}\")\n",
113 | "_ = disp.ax_.legend(*scatter.legend_elements())"
114 | ]
115 | },
116 | {
117 | "cell_type": "markdown",
118 | "id": "0976dd3e",
119 | "metadata": {},
120 | "source": [
121 | "The most widely used summary metric is arguably accuracy. Its main advantage\n",
122 | "is a natural interpretation: the proportion of correctly classified samples."
123 | ]
124 | },
125 | {
126 | "cell_type": "code",
127 | "execution_count": null,
128 | "id": "2a42dfdf",
129 | "metadata": {},
130 | "outputs": [],
131 | "source": [
132 | "from sklearn import metrics\n",
133 | "\n",
134 | "y_pred = estimator.predict(X_test)\n",
135 | "accuracy = metrics.accuracy_score(y_test, y_pred)\n",
136 | "print(f\"Accuracy on the test set: {accuracy:.3f}\")"
137 | ]
138 | },
139 | {
140 | "cell_type": "markdown",
141 | "id": "460f1449",
142 | "metadata": {},
143 | "source": [
144 | "However, it is misleading when the data is imbalanced. Our model performs\n",
145 | "as well as a trivial majority classifier."
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": null,
151 | "id": "6fcf2b77",
152 | "metadata": {},
153 | "outputs": [],
154 | "source": [
155 | "from sklearn.dummy import DummyClassifier\n",
156 | "\n",
157 | "dummy = DummyClassifier(strategy=\"most_frequent\").fit(X_train, y_train)\n",
158 | "y_dummy = estimator.predict(X_test)\n",
159 | "accuracy_dummy = metrics.accuracy_score(y_test, y_dummy)\n",
160 | "print(f\"Accuracy if Diabetes did not exist: {accuracy_dummy:.3f}\")"
161 | ]
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "id": "d2ce15e0",
166 | "metadata": {},
167 | "source": [
168 | "Some of the other metrics are better at describing the flaws of our model:"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": null,
174 | "id": "3bc30b2b",
175 | "metadata": {},
176 | "outputs": [],
177 | "source": [
178 | "sensitivity = metrics.recall_score(y_test, y_pred)\n",
179 | "specificity = metrics.recall_score(y_test, y_pred, pos_label=0)\n",
180 | "balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)\n",
181 | "matthews = metrics.matthews_corrcoef(y_test, y_pred)\n",
182 | "PPV = metrics.precision_score(y_test, y_pred)\n",
183 | "\n",
184 | "print(f\"Sensitivity on the test set: {sensitivity:.2f}\")\n",
185 | "print(f\"Specificity on the test set: {specificity:.2f}\")\n",
186 | "print(f\"Balanced accuracy on the test set: {balanced_acc:.2f}\")\n",
187 | "print(f\"Matthews correlation coeff on the test set: {matthews:.2f}\")\n",
188 | "print()\n",
189 | "print(f\"Probability to have the disease given a positive test: {100*PPV:.2f}%\")"
190 | ]
191 | },
192 | {
193 | "cell_type": "markdown",
194 | "id": "48b03751",
195 | "metadata": {},
196 | "source": [
197 | "Our classifier is not informative enough on the general population. The PPV\n",
198 | "and NPV give the information of interest: P(D+ | T+) and P(D− | T−). However,\n",
199 | "they are not intrinsic to the medical test (in other words the trained ML\n",
200 | "model) but also depend on the prevalence and thus on the target population.\n",
201 | "\n",
202 | "The class likelihood ratios (LR±) depend only on sensitivity and specificity\n",
203 | "of the classifier, and not on the prevalence of the study population. For the\n",
204 | "moment it suffice to recall that LR± is defined as\n",
205 | "\n",
206 | " LR± = P(D± | T+) / P(D± | T−)"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "id": "13b04b48",
213 | "metadata": {},
214 | "outputs": [],
215 | "source": [
216 | "pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)\n",
217 | "print(f\"LR+ on the test set: {pos_LR:.3f}\") # higher is better\n",
218 | "print(f\"LR- on the test set: {neg_LR:.3f}\") # lower is better"
219 | ]
220 | },
221 | {
222 | "cell_type": "markdown",
223 | "id": "f999f3e2",
224 | "metadata": {},
225 | "source": [
226 | "\n",
227 | "
Caution
\n",
228 | "
Please notice that if you want to use the\n",
229 | "`metrics.class_likelihood_ratios`, you require scikit-learn > v.1.2.0.\n",
230 | "
\n",
231 | "
\n",
232 | "\n",
233 | "Extrapolating between populations\n",
234 | "---------------------------------\n",
235 | "\n",
236 | "The prevalence can be variable (for instance the prevalence of an infectious\n",
237 | "disease will be variable across time) and a given classifier may be intended\n",
238 | "to be applied in various situations.\n",
239 | "\n",
240 | "According to the World Bank, the diabetes prevalence in the French Polynesia\n",
241 | "in 2022 is above 25%. Let's now evaluate our previously trained model on a\n",
242 | "**different population** with such prevalence and **the same data-generating\n",
243 | "process**.\n",
244 | "\n",
245 | "X, y = make_classification(**common_params, weights=[0.75, 0.25])\n",
246 | "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n",
247 | "\n",
248 | "fig, ax = plt.subplots()\n",
249 | "disp = DecisionBoundaryDisplay.from_estimator(\n",
250 | " estimator,\n",
251 | " X_test,\n",
252 | " response_method=\"predict\",\n",
253 | " alpha=0.5,\n",
254 | " xlabel=\"age (years)\",\n",
255 | " ylabel=\"blood sugar level (mg/dL)\",\n",
256 | " ax=ax,\n",
257 | ")\n",
258 | "scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor=\"k\")\n",
259 | "disp.ax_.set_title(f\"Hypothetical diabetes test with prevalence = {y.mean():.2f}\")\n",
260 | "_ = disp.ax_.legend(*scatter.legend_elements())"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": null,
266 | "id": "432b3b95",
267 | "metadata": {},
268 | "outputs": [],
269 | "source": [
270 | "# We then compute the same metrics using a test set with the new\n",
271 | "# prevalence:\n",
272 | "\n",
273 | "y_pred = estimator.predict(X_test)\n",
274 | "prevalence = y.mean()\n",
275 | "accuracy = metrics.accuracy_score(y_test, y_pred)\n",
276 | "sensitivity = metrics.recall_score(y_test, y_pred)\n",
277 | "specificity = metrics.recall_score(y_test, y_pred, pos_label=0)\n",
278 | "balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)\n",
279 | "matthews = metrics.matthews_corrcoef(y_test, y_pred)\n",
280 | "PPV = metrics.precision_score(y_test, y_pred)\n",
281 | "\n",
282 | "print(f\"Accuracy on the test set: {accuracy:.2f}\")\n",
283 | "print(f\"Sensitivity on the test set: {sensitivity:.2f}\")\n",
284 | "print(f\"Specificity on the test set: {specificity:.2f}\")\n",
285 | "print(f\"Balanced accuracy on the test set: {balanced_acc:.2f}\")\n",
286 | "print(f\"Matthews correlation coeff on the test set: {matthews:.2f}\")\n",
287 | "print()\n",
288 | "print(f\"Probability to have the disease given a positive test: {100*PPV:.2f}%\")"
289 | ]
290 | },
291 | {
292 | "cell_type": "markdown",
293 | "id": "60315d3b",
294 | "metadata": {},
295 | "source": [
296 | "The same model seems to perform better on this new dataset. Notice in\n",
297 | "particular that the probability to have the disease given a positive test\n",
298 | "increased. The same blood sugar test is less predictive in Benin than in\n",
299 | "the French Polynesia!\n",
300 | "\n",
301 | "If we really want to score the test and not the dataset, we need a metric that\n",
302 | "does not depend on the prevalence of the study population."
303 | ]
304 | },
305 | {
306 | "cell_type": "code",
307 | "execution_count": null,
308 | "id": "7395579a",
309 | "metadata": {},
310 | "outputs": [],
311 | "source": [
312 | "pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)\n",
313 | "\n",
314 | "print(f\"LR+ on the test set: {pos_LR:.3f}\")\n",
315 | "print(f\"LR- on the test set: {neg_LR:.3f}\")"
316 | ]
317 | },
318 | {
319 | "cell_type": "markdown",
320 | "id": "d4e7c039",
321 | "metadata": {},
322 | "source": [
323 | "Despite some variations due to residual dataset dependence, the class\n",
324 | "likelihood ratios are mathematically invariant with respect to prevalence. See\n",
325 | "[this example from the User\n",
326 | "Guide](https://scikit-learn.org/dev/auto_examples/model_selection/plot_likelihood_ratios.html#invariance-with-respect-to-prevalence)\n",
327 | "for a demo regarding such property.\n",
328 | "\n",
329 | "Pre-test vs. post-test odds\n",
330 | "---------------------------\n",
331 | "\n",
332 | "Both class likelihood ratios are interpretable in terms of odds:\n",
333 | "\n",
334 | " post-test odds = Likelihood ratio * pre-test odds\n",
335 | "\n",
336 | "The interpretation of LR+ in this case reads:"
337 | ]
338 | },
339 | {
340 | "cell_type": "code",
341 | "execution_count": null,
342 | "id": "4983b942",
343 | "metadata": {},
344 | "outputs": [],
345 | "source": [
346 | "print(\"The post-test odds that the condition is truly present given a positive \"\n",
347 | " f\"test result are: {pos_LR:.3f} times larger than the pre-test odds.\")"
348 | ]
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "id": "71a226b4",
353 | "metadata": {},
354 | "source": [
355 | "We found that diagnosis tool is useful: the post-test odds are larger than the\n",
356 | "pre-test odds. We now choose the pre-test probability to be the prevalence of\n",
357 | "the disease in the held-out testing set."
358 | ]
359 | },
360 | {
361 | "cell_type": "code",
362 | "execution_count": null,
363 | "id": "9967d431",
364 | "metadata": {},
365 | "outputs": [],
366 | "source": [
367 | "pretest_odds = y_test.mean() / (1 - y_test.mean())\n",
368 | "posttest_odds = pretest_odds * pos_LR\n",
369 | "\n",
370 | "print(f\"Observed pre-test odds: {pretest_odds:.3f}\")\n",
371 | "print(f\"Estimated post-test odds using LR+: {posttest_odds:.3f}\")"
372 | ]
373 | },
374 | {
375 | "cell_type": "markdown",
376 | "id": "0828ff62",
377 | "metadata": {},
378 | "source": [
379 | "The post-test probability is the probability of an individual to truly have\n",
380 | "the condition given a positive test result, i.e. the number of true positives\n",
381 | "divided by the total number of samples. In real life applications this is\n",
382 | "unknown."
383 | ]
384 | },
385 | {
386 | "cell_type": "code",
387 | "execution_count": null,
388 | "id": "bf4993a8",
389 | "metadata": {},
390 | "outputs": [],
391 | "source": [
392 | "posttest_prob = posttest_odds / (1 + posttest_odds)\n",
393 | "\n",
394 | "print(f\"Estimated post-test probability using LR+: {posttest_prob:.3f}\")"
395 | ]
396 | },
397 | {
398 | "cell_type": "markdown",
399 | "id": "34d66001",
400 | "metadata": {},
401 | "source": [
402 | "We can verify that if we had had access to the true labels, we would have\n",
403 | "obatined the same probabilities:"
404 | ]
405 | },
406 | {
407 | "cell_type": "code",
408 | "execution_count": null,
409 | "id": "32d96836",
410 | "metadata": {},
411 | "outputs": [],
412 | "source": [
413 | "posttest_prob = y_test[y_pred == 1].mean()\n",
414 | "\n",
415 | "print(f\"Observed post-test probability: {posttest_prob:.3f}\")"
416 | ]
417 | },
418 | {
419 | "cell_type": "markdown",
420 | "id": "de75d14b",
421 | "metadata": {},
422 | "source": [
423 | "Conclusion: If a Benin salesperson was to sell the model to the French Polynesia\n",
424 | "by showing them the 59.84% probability to have the disease given a positive test,\n",
425 | "the French Polynesia would have never bought it, even though it would be quite\n",
426 | "predictive for their own population. The right thing to report are the LR±.\n",
427 | "\n",
428 | "Can you imagine what would happen if the model is trained on nearly balanced classes\n",
429 | "and then extrapolated to other scenarios?"
430 | ]
431 | }
432 | ],
433 | "metadata": {
434 | "jupytext": {
435 | "cell_metadata_filter": "-all",
436 | "main_language": "python",
437 | "notebook_metadata_filter": "-all"
438 | }
439 | },
440 | "nbformat": 4,
441 | "nbformat_minor": 5
442 | }
443 |
--------------------------------------------------------------------------------