├── slides.pdf
├── requirements.txt
├── environment.yml
├── README.md
├── LICENSE
├── python_scripts
    ├── 3_uncertainty_in_metrics_tutorial.py
    ├── 2_roc_pr_curves_tutorial.py
    └── 1_evaluation_tutorial.py
└── notebooks
    ├── 3_uncertainty_in_metrics_tutorial.ipynb
    ├── 2_roc_pr_curves_tutorial.ipynb
    └── 1_evaluation_tutorial.ipynb


/slides.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ArturoAmorQ/euroscipy_2022_evaluation/HEAD/slides.pdf


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | numpy==1.26.*
2 | scipy==1.11.*
3 | pandas==2.1.*
4 | matplotlib==3.7.*
5 | jupyter
6 | seaborn
7 | scikit-learn==1.3.*
8 | 


--------------------------------------------------------------------------------
/environment.yml:
--------------------------------------------------------------------------------
 1 | name: evaluation-tutorial
 2 | 
 3 | dependencies:
 4 |   - python
 5 |   - scikit-learn
 6 |   - pandas
 7 |   - seaborn
 8 |   - jupyter
 9 |   - pip
10 |     
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # EuroSciPy 2022 - Evaluating your ML models tutorial
 2 | 
 3 | Follow the intro slides [here](https://github.com/ArturoAmorQ/euroscipy_2022_evaluation/blob/main/slides.pdf).
 4 | 
 5 | ## Follow the tutorial online
 6 | 
 7 | Launch an online notebook environment using [![Binder](https://mybinder.org/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD)
 8 | 
 9 | - [1_evaluation_tutorial.ipynb](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/1_evaluation_tutorial.ipynb)
10 | - [2_roc_pr_curves_tutorial.ipynb](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/2_roc_pr_curves_tutorial.ipynb)
11 | - [3_uncertainty_in_metrics_tutorial.ipynb ](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/3_uncertainty_in_metrics_tutorial.ipynb)
12 | 
13 | You need an internet connection but you will not have to install any package
14 | locally.
15 | 
16 | ## Running the tutorial locally
17 | 
18 | ### Dependencies
19 | 
20 | The tutorials will require the following packages:
21 | 
22 | * python
23 | * jupyter
24 | * pandas
25 | * matplotlib
26 | * seaborn
27 | * scikit-learn >= 1.2.0
28 | 
29 | ### Local install
30 | 
31 | We provide both `requirements.txt` and `environment.yml` to install packages.
32 | 
33 | You can install the packages using `pip`:
34 | 
35 | ```
36 | $ pip install -r requirements.txt
37 | ```
38 | 
39 | You can create an `evaluation-tutorial` conda environment executing:
40 | 
41 | ```
42 | $ conda env create -f environment.yml
43 | ```
44 | 
45 | and later activate the environment:
46 | 
47 | ```
48 | $ conda activate evaluation-tutorial
49 | ```
50 | 
51 | You might also only update your current environment using:
52 | 
53 | ```
54 | $ conda env update --prefix ./env --file environment.yml  --prune
55 | ```
56 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
  1 | Creative Commons Legal Code
  2 | 
  3 | CC0 1.0 Universal
  4 | 
  5 |     CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
  6 |     LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
  7 |     ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
  8 |     INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
  9 |     REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
 10 |     PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
 11 |     THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
 12 |     HEREUNDER.
 13 | 
 14 | Statement of Purpose
 15 | 
 16 | The laws of most jurisdictions throughout the world automatically confer
 17 | exclusive Copyright and Related Rights (defined below) upon the creator
 18 | and subsequent owner(s) (each and all, an "owner") of an original work of
 19 | authorship and/or a database (each, a "Work").
 20 | 
 21 | Certain owners wish to permanently relinquish those rights to a Work for
 22 | the purpose of contributing to a commons of creative, cultural and
 23 | scientific works ("Commons") that the public can reliably and without fear
 24 | of later claims of infringement build upon, modify, incorporate in other
 25 | works, reuse and redistribute as freely as possible in any form whatsoever
 26 | and for any purposes, including without limitation commercial purposes.
 27 | These owners may contribute to the Commons to promote the ideal of a free
 28 | culture and the further production of creative, cultural and scientific
 29 | works, or to gain reputation or greater distribution for their Work in
 30 | part through the use and efforts of others.
 31 | 
 32 | For these and/or other purposes and motivations, and without any
 33 | expectation of additional consideration or compensation, the person
 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she
 35 | is an owner of Copyright and Related Rights in the Work, voluntarily
 36 | elects to apply CC0 to the Work and publicly distribute the Work under its
 37 | terms, with knowledge of his or her Copyright and Related Rights in the
 38 | Work and the meaning and intended legal effect of CC0 on those rights.
 39 | 
 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be
 41 | protected by copyright and related or neighboring rights ("Copyright and
 42 | Related Rights"). Copyright and Related Rights include, but are not
 43 | limited to, the following:
 44 | 
 45 |   i. the right to reproduce, adapt, distribute, perform, display,
 46 |      communicate, and translate a Work;
 47 |  ii. moral rights retained by the original author(s) and/or performer(s);
 48 | iii. publicity and privacy rights pertaining to a person's image or
 49 |      likeness depicted in a Work;
 50 |  iv. rights protecting against unfair competition in regards to a Work,
 51 |      subject to the limitations in paragraph 4(a), below;
 52 |   v. rights protecting the extraction, dissemination, use and reuse of data
 53 |      in a Work;
 54 |  vi. database rights (such as those arising under Directive 96/9/EC of the
 55 |      European Parliament and of the Council of 11 March 1996 on the legal
 56 |      protection of databases, and under any national implementation
 57 |      thereof, including any amended or successor version of such
 58 |      directive); and
 59 | vii. other similar, equivalent or corresponding rights throughout the
 60 |      world based on applicable law or treaty, and any national
 61 |      implementations thereof.
 62 | 
 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention
 64 | of, applicable law, Affirmer hereby overtly, fully, permanently,
 65 | irrevocably and unconditionally waives, abandons, and surrenders all of
 66 | Affirmer's Copyright and Related Rights and associated claims and causes
 67 | of action, whether now known or unknown (including existing as well as
 68 | future claims and causes of action), in the Work (i) in all territories
 69 | worldwide, (ii) for the maximum duration provided by applicable law or
 70 | treaty (including future time extensions), (iii) in any current or future
 71 | medium and for any number of copies, and (iv) for any purpose whatsoever,
 72 | including without limitation commercial, advertising or promotional
 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
 74 | member of the public at large and to the detriment of Affirmer's heirs and
 75 | successors, fully intending that such Waiver shall not be subject to
 76 | revocation, rescission, cancellation, termination, or any other legal or
 77 | equitable action to disrupt the quiet enjoyment of the Work by the public
 78 | as contemplated by Affirmer's express Statement of Purpose.
 79 | 
 80 | 3. Public License Fallback. Should any part of the Waiver for any reason
 81 | be judged legally invalid or ineffective under applicable law, then the
 82 | Waiver shall be preserved to the maximum extent permitted taking into
 83 | account Affirmer's express Statement of Purpose. In addition, to the
 84 | extent the Waiver is so judged Affirmer hereby grants to each affected
 85 | person a royalty-free, non transferable, non sublicensable, non exclusive,
 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and
 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the
 88 | maximum duration provided by applicable law or treaty (including future
 89 | time extensions), (iii) in any current or future medium and for any number
 90 | of copies, and (iv) for any purpose whatsoever, including without
 91 | limitation commercial, advertising or promotional purposes (the
 92 | "License"). The License shall be deemed effective as of the date CC0 was
 93 | applied by Affirmer to the Work. Should any part of the License for any
 94 | reason be judged legally invalid or ineffective under applicable law, such
 95 | partial invalidity or ineffectiveness shall not invalidate the remainder
 96 | of the License, and in such case Affirmer hereby affirms that he or she
 97 | will not (i) exercise any of his or her remaining Copyright and Related
 98 | Rights in the Work or (ii) assert any associated claims and causes of
 99 | action with respect to the Work, in either case contrary to Affirmer's
100 | express Statement of Purpose.
101 | 
102 | 4. Limitations and Disclaimers.
103 | 
104 |  a. No trademark or patent rights held by Affirmer are waived, abandoned,
105 |     surrendered, licensed or otherwise affected by this document.
106 |  b. Affirmer offers the Work as-is and makes no representations or
107 |     warranties of any kind concerning the Work, express, implied,
108 |     statutory or otherwise, including without limitation warranties of
109 |     title, merchantability, fitness for a particular purpose, non
110 |     infringement, or the absence of latent or other defects, accuracy, or
111 |     the present or absence of errors, whether or not discoverable, all to
112 |     the greatest extent permissible under applicable law.
113 |  c. Affirmer disclaims responsibility for clearing rights of other persons
114 |     that may apply to the Work or any use thereof, including without
115 |     limitation any person's Copyright and Related Rights in the Work.
116 |     Further, Affirmer disclaims responsibility for obtaining any necessary
117 |     consents, permissions or other rights required for any use of the
118 |     Work.
119 |  d. Affirmer understands and acknowledges that Creative Commons is not a
120 |     party to this document and has no duty or obligation with respect to
121 |     this CC0 or use of the Work.
122 | 


--------------------------------------------------------------------------------
/python_scripts/3_uncertainty_in_metrics_tutorial.py:
--------------------------------------------------------------------------------
  1 | # %% [markdown]
  2 | #
  3 | # Uncertainty in evaluation metrics for classification
  4 | # ====================================================
  5 | #
  6 | # Has it ever happen to you that one of your colleagues claim their model with
  7 | # test score of 0.8001 is better than your model with test score of 0.7998?
  8 | # Maybe they are not aware that model-evaluation procedures should gauge not
  9 | # only the expected generalization performance, but also its variations. As
 10 | # usual, let's build a toy dataset to illustrate this.
 11 | 
 12 | # %%
 13 | from sklearn.datasets import make_classification
 14 | 
 15 | common_params = {
 16 |     "n_features": 2,
 17 |     "n_informative": 2,
 18 |     "n_redundant": 0,
 19 |     "n_classes": 2,  # binary classification
 20 |     "random_state": 0,
 21 |     "weights": [0.55, 0.45],
 22 | }
 23 | X, y = make_classification(**common_params, n_samples=400)
 24 | 
 25 | prevalence = y.mean()
 26 | print(f"Percentage of samples in the positive class: {100*prevalence:.2f}%")
 27 | 
 28 | # %% [markdown]
 29 | # We are already familiar with using a a train-test split to estimate the
 30 | # generalization performance of a model. By default the `train_test_split` uses
 31 | # `shuffle=True`. Let's see what happens if we set a particular seed.
 32 | 
 33 | # %%
 34 | from sklearn.model_selection import train_test_split
 35 | from sklearn.linear_model import LogisticRegression
 36 | 
 37 | X_train, X_test, y_train, y_test = train_test_split(
 38 |     X, y, test_size=0.2, random_state=1
 39 | )
 40 | classifier = LogisticRegression().fit(X_train, y_train)
 41 | classifier.score(X_test, y_test)
 42 | 
 43 | # %% [markdown]
 44 | # Now let's see what happens when shuffling with a different seed:
 45 | 
 46 | # %%
 47 | X_train, X_test, y_train, y_test = train_test_split(
 48 |     X, y, test_size=0.2, random_state=42
 49 | )
 50 | classifier = LogisticRegression().fit(X_train, y_train)
 51 | classifier.score(X_test, y_test)
 52 | 
 53 | # %% [markdown]
 54 | # It seems that 42 is indeed the Ultimate answer to the Question of Life, the
 55 | # Universe, and Everything! Or maybe the score of a model depends on the split:
 56 | #  - the train-test proportion;
 57 | #  - the representativeness of the elements in each set.
 58 | #
 59 | # A more systematic way of evaluating the generalization performance of a model
 60 | # is through cross-validation, which consists of repeating the split such that
 61 | # the training and testing sets are different for each evaluation.
 62 | 
 63 | # %%
 64 | from sklearn.model_selection import cross_val_score, ShuffleSplit
 65 | 
 66 | classifier = LogisticRegression()
 67 | cv = ShuffleSplit(n_splits=250, test_size=0.2)
 68 | 
 69 | scores = cross_val_score(classifier, X, y, cv=cv)
 70 | print(
 71 |     "The mean cross-validation accuracy is: "
 72 |     f"{scores.mean():.2f} ± {scores.std():.2f}."
 73 | )
 74 | 
 75 | # %% [markdown]
 76 | # Scores have a variability. A sample probabilistic model gives the distribution
 77 | # of observed error: if the classification rate is p, the observed distribution
 78 | # of correct classifications on a set of size follows a binomial distribution.
 79 | # Let's create a function to easily visualize this:
 80 | 
 81 | # %%
 82 | import matplotlib.pyplot as plt
 83 | import numpy as np
 84 | import seaborn as sns
 85 | from scipy import stats
 86 | 
 87 | 
 88 | def plot_error_distrib(classifier, X, y, cv=5):
 89 | 
 90 |     n = len(X)
 91 | 
 92 |     scores = cross_val_score(classifier, X, y, cv=cv)
 93 |     distrib = stats.binom(n=n, p=scores.mean())
 94 | 
 95 |     plt.plot(
 96 |         np.linspace(0, 1, n),
 97 |         n * distrib.pmf(np.arange(0, n)),
 98 |         linewidth=2,
 99 |         color="black",
100 |         label="binomial distribution",
101 |     )
102 |     sns.histplot(scores, stat="density", label="empirical distribution")
103 |     plt.xlim(0, 1)
104 |     plt.title("Accuracy: " f"{scores.mean():.2f} ± {scores.std():.2f}.")
105 |     plt.legend()
106 |     plt.show()
107 | 
108 | 
109 | plot_error_distrib(classifier, X, y, cv=cv)
110 | 
111 | # %% [markdown]
112 | # The empirical distribution is still broader than the theoretical one. This can
113 | # be explained by the fact that as we are retraining the model on each fold, it
114 | # actually fluctuates due the sampling noise in the training data, while the
115 | # model above only accounts for sampling noise in the test data.
116 | #
117 | # The situation does get better with more data:
118 | 
119 | # %%
120 | X, y = make_classification(**common_params, n_samples=1_000)
121 | plot_error_distrib(classifier, X, y, cv=cv)
122 | 
123 | # %% [markdown]
124 | # Importantly, the standard error of the mean (SEM) across folds is not a good
125 | # measure of this error, as the different data folds are not independent. For
126 | # instance, doing many random splits reduces the variance arbitrarily, but does
127 | # not provide actually new data points.
128 | 
129 | # %%
130 | cv = ShuffleSplit(n_splits=10, test_size=0.2)
131 | X, y = make_classification(**common_params, n_samples=400)
132 | scores = cross_val_score(classifier, X, y, cv=cv)
133 | 
134 | print(
135 |     f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: "
136 |     f"{scores.mean():.3f} ± {stats.sem(scores):.3f}."
137 | )
138 | 
139 | cv = ShuffleSplit(n_splits=100, test_size=0.2)
140 | scores = cross_val_score(classifier, X, y, cv=cv)
141 | 
142 | print(
143 |     f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: "
144 |     f"{scores.mean():.3f} ± {stats.sem(scores):.3f}."
145 | )
146 | 
147 | cv = ShuffleSplit(n_splits=500, test_size=0.2)
148 | scores = cross_val_score(classifier, X, y, cv=cv)
149 | 
150 | print(
151 |     f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: "
152 |     f"{scores.mean():.3f} ± {stats.sem(scores):.3f}."
153 | )
154 | 
155 | # %% [markdown]
156 | # Indeed, the SEM goes to zero as 1/sqrt{`n_splits`}. Wraping-up:
157 | # - the more data the better;
158 | # - the more splits, the more descriptive of the variance is the binomial
159 | #   distribution, but keep in mind that more splits consume more computing
160 | #   power;
161 | # - use std instead of SEM to present your results.
162 | #
163 | # Now that we have an intuition on the variability of an evaluation metric, we
164 | # are ready to apply it to our original Diabetes problem:
165 | 
166 | # %%
167 | from sklearn.tree import DecisionTreeClassifier
168 | from sklearn.inspection import DecisionBoundaryDisplay
169 | 
170 | diabetes_params = {
171 |     "n_samples": 10_000,
172 |     "n_features": 2,
173 |     "n_informative": 2,
174 |     "n_redundant": 0,
175 |     "n_classes": 2,  # binary classification
176 |     "shift": [4, 6],
177 |     "scale": [10, 25],
178 |     "random_state": 0,
179 | }
180 | X, y = make_classification(**diabetes_params, weights=[0.55, 0.45])
181 | 
182 | X_train, X_plot, y_train, y_plot = train_test_split(
183 |     X, y, stratify=y, test_size=0.1, random_state=0
184 | )
185 | 
186 | estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)
187 | 
188 | fig, ax = plt.subplots()
189 | disp = DecisionBoundaryDisplay.from_estimator(
190 |     estimator,
191 |     X_plot,
192 |     response_method="predict",
193 |     alpha=0.5,
194 |     xlabel="age (years)",
195 |     ylabel="blood sugar level (mg/dL)",
196 |     ax=ax,
197 | )
198 | scatter = disp.ax_.scatter(X_plot[:, 0], X_plot[:, 1], c=y_plot, edgecolor="k")
199 | disp.ax_.set_title(f"Diabetes test with prevalence = {y.mean():.2f}")
200 | _ = disp.ax_.legend(*scatter.legend_elements())
201 | 
202 | # %% [markdown]
203 | # Notice that the decision boundary changed with respect to the first notebook
204 | # we explored. Let's make a remark: models depend on the prevalence of
205 | # the data they were trained on. Therefore, all metrics (including likelihood ratios)
206 | # depend on prevalence as much as the model depends on it. The difference is that
207 | # likelihood ratios extrapolate through populations of different prevalence for
208 | # a **fixed model**.
209 | #
210 | # Let's compute all the metrics and assez their variability in this case:
211 | 
212 | # %%
213 | from collections import defaultdict
214 | import pandas as pd
215 | 
216 | cv = ShuffleSplit(n_splits=50, test_size=0.2)
217 | 
218 | evaluation = defaultdict(list)
219 | scoring_strategies = [
220 |     "accuracy",
221 |     "balanced_accuracy",
222 |     "recall",
223 |     "precision",
224 |     "matthews_corrcoef",
225 |     # "positive_likelihood_ratio",
226 |     # "neg_negative_likelihood_ratio",
227 | ]
228 | 
229 | for score_name in scoring_strategies:
230 |     scores = cross_val_score(estimator, X, y, cv=cv, scoring=score_name)
231 |     evaluation[score_name] = scores
232 | 
233 | evaluation = pd.DataFrame(evaluation).aggregate(["mean", "std"]).T
234 | evaluation["mean"].plot.barh(xerr=evaluation["std"]).set_xlabel("score")
235 | plt.show()
236 | 
237 | # %% [markdown]
238 | # Notice that `"positive_likelihood_ratio"` is not bounded from above and
239 | # therefore it can't be directly compared with the other metrics on a single
240 | # plot. Similarly, the `"neg_negative_likelihood_ratio"` has a reversed sign (is
241 | # negative) to follow the scikit-learn convention for metrics for which a lower
242 | # score is better.
243 | #
244 | # In this case we trained the model on nearly balanced classes. Try changing the
245 | # prevalence and see how the variance of the metrics depend on data imbalance.
246 | 


--------------------------------------------------------------------------------
/python_scripts/2_roc_pr_curves_tutorial.py:
--------------------------------------------------------------------------------
  1 | # %% [markdown]
  2 | # Evaluation of non-thresholded prediction
  3 | # ========================================
  4 | #
  5 | # All statistics that we presented up to now rely on `.predict` which outputs
  6 | # the most likely label. We haven’t made use of the probability associated with
  7 | # this prediction, which gives the confidence of the classifier in this
  8 | # prediction. By default, the prediction of a classifier corresponds to a
  9 | # threshold of 0.5 probability in a binary classification problem. Let's build a
 10 | # toy dataset to illustrate this.
 11 | 
 12 | # %%
 13 | from sklearn.datasets import make_classification
 14 | from sklearn.model_selection import train_test_split
 15 | 
 16 | common_params = {
 17 |     "n_samples": 10_000,
 18 |     "n_features": 2,
 19 |     "n_informative": 2,
 20 |     "n_redundant": 0,
 21 |     "n_classes": 2,  # binary classification
 22 |     "class_sep": 0.5,
 23 |     "random_state": 0,
 24 | }
 25 | X, y = make_classification(**common_params, weights=[0.6, 0.4])
 26 | 
 27 | X_train, X_test, y_train, y_test = train_test_split(
 28 |     X, y, stratify=y, random_state=0, test_size=0.02
 29 | )
 30 | 
 31 | # %% [markdown]
 32 | # We can quickly check the predicted probabilities to belong to either class
 33 | # using a `LogisticRegression`. To ease the visualization we select a subset
 34 | # of `n_plot` samples.
 35 | 
 36 | # %%
 37 | import pandas as pd
 38 | from sklearn.linear_model import LogisticRegression
 39 | 
 40 | n_plot = 10
 41 | classifier = LogisticRegression()
 42 | classifier.fit(X_train, y_train)
 43 | 
 44 | proba_predicted = pd.DataFrame(
 45 |     classifier.predict_proba(X_test), columns=classifier.classes_
 46 | ).round(decimals=2)
 47 | proba_predicted[:n_plot]
 48 | 
 49 | # %% [markdown]
 50 | # Probabilites sum to 1. In the binary case it suffices to retain the
 51 | # probability of belonging to the positive class, here shown as an annotation in
 52 | # the `DecisionBoundaryDisplay`. Notice that setting
 53 | # `response_method="predict_proba"` shows the level curves of the 2D sigmoid
 54 | # (logistic curve).
 55 | 
 56 | # %%
 57 | import matplotlib.pyplot as plt
 58 | from matplotlib.colors import ListedColormap
 59 | from sklearn.inspection import DecisionBoundaryDisplay
 60 | 
 61 | fig, ax = plt.subplots()
 62 | disp = DecisionBoundaryDisplay.from_estimator(
 63 |     classifier,
 64 |     X_test,
 65 |     response_method="predict_proba",
 66 |     cmap="RdBu",
 67 |     alpha=0.5,
 68 |     vmin=0,
 69 |     vmax=1,
 70 |     ax=ax,
 71 | )
 72 | DecisionBoundaryDisplay.from_estimator(
 73 |     classifier,
 74 |     X_test,
 75 |     response_method="predict_proba",
 76 |     plot_method="contour",
 77 |     alpha=0.2,
 78 |     levels=[0.5],  # 0.5 probability contour line
 79 |     linestyles="--",
 80 |     linewidths=2,
 81 |     ax=ax,
 82 | )
 83 | scatter = disp.ax_.scatter(
 84 |     X_test[:n_plot, 0], X_test[:n_plot, 1], c=y_test[:n_plot], 
 85 |     cmap=ListedColormap(["tab:red", "tab:blue"]),
 86 |     edgecolor="k"
 87 | )
 88 | disp.ax_.legend(*scatter.legend_elements(), title="True class", loc="lower right")
 89 | for i, proba in enumerate(proba_predicted[:n_plot][1]):
 90 |     disp.ax_.annotate(proba, (X_test[i, 0], X_test[i, 1]), fontsize="large")
 91 | plt.xlim(-2.0, 2.0)
 92 | plt.ylim(-4.0, 4.0)
 93 | plt.title(
 94 |     "Probability of belonging to the positive class\n(default decision threshold)"
 95 | )
 96 | plt.show()
 97 | 
 98 | # %% [markdown]
 99 | # Evaluation of different probability thresholds
100 | # ==============================================
101 | #
102 | # The default decision threshold (0.5) might not be the best threshold that
103 | # leads to optimal generalization performance of our classifier. One can vary
104 | # the decision threshold (and therefore the underlying prediction) and compute
105 | # some evaluation metrics as presented earlier.
106 | #
107 | # Receiver Operating Characteristic curve
108 | # ---------------------------------------
109 | #
110 | # One could be interested in the compromise between accurately discriminating
111 | # both the positive class and the negative classes. The statistics used for this
112 | # are sensitivity and specificity, which measure the proportion of correctly
113 | # classified samples per class.
114 | #
115 | # Sensitivity and specificity are generally plotted as a curve called the
116 | # Receiver Operating Characteristic (ROC) curve. Each point on the graph
117 | # corresponds to a specific decision threshold. Below is such a curve:
118 | 
119 | # %%
120 | from sklearn.metrics import RocCurveDisplay
121 | from sklearn.dummy import DummyClassifier
122 | 
123 | dummy_classifier = DummyClassifier(strategy="most_frequent")
124 | dummy_classifier.fit(X_train, y_train)
125 | 
126 | disp = RocCurveDisplay.from_estimator(
127 |     classifier, X_test, y_test, name="LogisticRegression", color="tab:green"
128 | )
129 | disp = RocCurveDisplay.from_estimator(
130 |     dummy_classifier,
131 |     X_test,
132 |     y_test,
133 |     name="chance level",
134 |     color="tab:red",
135 |     ax=disp.ax_,
136 | )
137 | plt.xlim(0, 1)
138 | plt.ylim(0, 1)
139 | plt.legend(loc="lower right")
140 | plt.title("ROC curve for LogisticRegression")
141 | plt.show()
142 | 
143 | # %% [markdown]
144 | # ROC curves typically feature true positive rate on the Y axis, and false
145 | # positive rate on the X axis. This means that the top left corner of the plot
146 | # is the "ideal" point - a false positive rate of zero, and a true positive rate
147 | # of one. This is not very realistic, but it does mean that a larger area under
148 | # the curve (AUC) is usually better.
149 | #
150 | # We can compute the area under the ROC curve (using `roc_auc_score`) to
151 | # summarize the generalization performance of a model with a single number, or
152 | # to compare several models across thresholds.
153 | 
154 | # %%
155 | from sklearn.ensemble import RandomForestClassifier
156 | from sklearn.ensemble import HistGradientBoostingClassifier
157 | 
158 | 
159 | classifiers = {
160 |     "Hist Gradient Boosting": HistGradientBoostingClassifier(),
161 |     "Random Forest": RandomForestClassifier(n_jobs=-1, random_state=1),
162 |     "Logistic Regression": LogisticRegression(),
163 |     "Chance": DummyClassifier(strategy="most_frequent"),
164 | }
165 | 
166 | fig = plt.figure()
167 | ax = plt.axes([0.08, 0.15, 0.78, 0.78])
168 | 
169 | for name, clf in classifiers.items():
170 |     clf.fit(X_train, y_train)
171 |     disp = RocCurveDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)
172 | plt.xlabel("False positive rate")
173 | plt.ylabel("True positive rate                           ")
174 | plt.text(
175 |     0.098,
176 |     0.575,
177 |     "= sensitivity or recall",
178 |     transform=fig.transFigure,
179 |     size=7,
180 |     rotation="vertical",
181 | )
182 | plt.xlim(0, 1)
183 | plt.ylim(0, 1)
184 | plt.legend(loc="lower right")
185 | plt.title("ROC curves for several models")
186 | plt.show()
187 | 
188 | # %% [markdown]
189 | # It is important to notice that the lower bound of the ROC-AUC is 0.5,
190 | # corresponding to chance level. Indeed, we show the generalization performance
191 | # of a dummy classifier (the red line) to show that even the worst
192 | # generalization performance obtained will be above this line.
193 | 
194 | # %% [markdown]
195 | # Precision-Recall curves
196 | # -----------------------
197 | #
198 | # As mentioned above, maximizing the ROC curve helps finding a compromise
199 | # between accurately discriminating both the positive class and the negative
200 | # classes. If the interest is to focus mainly on the positive class, the
201 | # precision and recall metrics are more appropriated. Similarly to the ROC
202 | # curve, each point in the Precision-Recall curve corresponds to a level of
203 | # probability which we used as a decision threshold.
204 | 
205 | # %%
206 | from sklearn.metrics import PrecisionRecallDisplay
207 | 
208 | fig = plt.figure()
209 | ax = plt.axes([0.08, 0.15, 0.78, 0.78])
210 | 
211 | for name, clf in classifiers.items():
212 |     clf.fit(X_train, y_train)
213 |     disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)
214 | plt.xlabel("Recall                  ")
215 | plt.text(0.45, 0.067, "= TPR or sensitivity", transform=fig.transFigure, size=7)
216 | plt.ylabel("Precision         ")
217 | plt.text(0.1, 0.6, "= PPV", transform=fig.transFigure, size=7, rotation="vertical")
218 | plt.xlim(0, 1)
219 | plt.ylim(0, 1)
220 | plt.legend(loc="lower right")
221 | plt.title("Precision-recall curve for several models")
222 | plt.show()
223 | 
224 | # %% [markdown]
225 | # A classifier with no false positives would have a precision of 1 for all
226 | # recall values. In like manner to the ROC-AUC, the area under the curve can be
227 | # used to characterize the curve in a single number and is named average
228 | # precision (AP). With an ideal classifier, the average precision would be 1.
229 | #
230 | # In this case, notice that the AP of a `DummyClassifier`, used as baseline to
231 | # define the chance level, coincides with the prevalence of the positive class.
232 | # This is analogous to the downside of the accuracy score as shown in the first
233 | # notebook.
234 | 
235 | # %%
236 | prevalence = y.mean()
237 | print(f"Prevalence of the positive class: {prevalence:.3f}")
238 | 
239 | # %% [markdown]
240 | # Let's see the effect of adding umbalance between classes in our set of models:
241 | 
242 | # %%
243 | X, y = make_classification(**common_params, weights=[0.83, 0.17])
244 | 
245 | X_train, X_test, y_train, y_test = train_test_split(
246 |     X, y, stratify=y, random_state=0, test_size=0.02
247 | )
248 | 
249 | fig = plt.figure()
250 | ax = plt.axes([0.08, 0.15, 0.78, 0.78])
251 | 
252 | for name, clf in classifiers.items():
253 |     clf.fit(X_train, y_train)
254 |     disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)
255 | plt.xlabel("Recall                  ")
256 | plt.text(0.45, 0.067, "= TPR or sensitivity", transform=fig.transFigure, size=7)
257 | plt.ylabel("Precision         ")
258 | plt.text(0.1, 0.6, "= PPV", transform=fig.transFigure, size=7, rotation="vertical")
259 | plt.xlim(0, 1)
260 | plt.ylim(0, 1)
261 | plt.legend(loc="upper right")
262 | plt.title("Precision-recall curve for several models\nw. imbalanced data")
263 | plt.show()
264 | 
265 | # %% [markdown]
266 | # The AP of all models decreased, including the baseline defined by the dummy
267 | # classifier. Indeed, we confirm that AP does not account for prevalence.
268 | #
269 | # Conclusions
270 | # ===========
271 | #
272 | # - Consider the prevalence in your target population. It may be that the
273 | #   prevalence in your testing sample is not representative of that of the
274 | #   target population. In that case, aside from LR+ and LR-, performance metrics
275 | #   computed from the testing sample will not be representative of those in the
276 | #   target population.
277 | #
278 | # - Never trust a single summary metric (accuracy, balanced accuracy, ROC-AUC,
279 | #   etc.), but rather look at all the individual metrics. Understand the
280 | #   implication of your choices to known the right tradeoff.
281 | 


--------------------------------------------------------------------------------
/python_scripts/1_evaluation_tutorial.py:
--------------------------------------------------------------------------------
  1 | # %% [markdown]
  2 | #
  3 | # Accounting for imbalance in evaluation metrics for classification 
  4 | # =================================================================
  5 | #
  6 | # Suppose we have a population of subjects with features `X` that can hopefully
  7 | # serve as indicators of a binary class `y` (known ground truth). Additionally,
  8 | # suppose the class prevalence (the number of samples in the positive class
  9 | # divided by the total number of samples) is very low.
 10 | #
 11 | # To fix ideas, let's use a medical analogy and think about diabetes. We only
 12 | # use two features -age and blood sugar level-, to keep the example as simple as
 13 | # possible. We use `make_classification` to simulate the distribution of the
 14 | # disease and to ensure **the data-generating process is always the same**. We
 15 | # set the `weights=[0.99, 0.01]` to obtain a prevalence of around 1% which,
 16 | # according to [The World
 17 | # Bank](https://data.worldbank.org/indicator/SH.STA.DIAB.ZS?most_recent_value_desc=false),
 18 | # is the case for the country with the lowest diabetes prevalence in 2022
 19 | # (Benin).
 20 | # 
 21 | # In practice, the ideas presented here can be applied in settings where the
 22 | # data available to learn and evaluate a classifier has nearly balanced classes,
 23 | # such as a case-control study, while the target application, i.e. the general
 24 | # population, has very low prevalence.
 25 | 
 26 | # %%
 27 | from sklearn.datasets import make_classification
 28 | 
 29 | common_params = {
 30 |     "n_samples": 10_000,
 31 |     "n_features": 2,
 32 |     "n_informative": 2,
 33 |     "n_redundant": 0,
 34 |     "n_classes": 2, # binary classification
 35 |     "shift": [4, 6],
 36 |     "scale": [10, 25],
 37 |     "random_state": 0,
 38 | }
 39 | X, y = make_classification(**common_params, weights=[0.99, 0.01])
 40 | prevalence = y.mean()
 41 | print(f"Percentage of people carrying the disease: {100*prevalence:.2f}%")
 42 | 
 43 | # %% [markdown]
 44 | # A simple model is trained to diagnose if a person is likely to have diabetes.
 45 | # To estimate the generalization performance of such model, we do a train-test
 46 | # split.
 47 | 
 48 | # %%
 49 | from sklearn.model_selection import train_test_split
 50 | from sklearn.tree import DecisionTreeClassifier
 51 | 
 52 | X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
 53 | 
 54 | estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)
 55 | 
 56 | # %% [markdown]
 57 | # We now show the decision boundary learned by the estimator. Notice that we
 58 | # only plot an stratified subset of the original data.
 59 | 
 60 | # %%
 61 | import matplotlib.pyplot as plt
 62 | from sklearn.inspection import DecisionBoundaryDisplay
 63 | 
 64 | fig, ax = plt.subplots()
 65 | disp = DecisionBoundaryDisplay.from_estimator(
 66 |     estimator,
 67 |     X_test,
 68 |     response_method="predict",
 69 |     alpha=0.5,
 70 |     xlabel="age (years)",
 71 |     ylabel="blood sugar level (mg/dL)",
 72 |     ax=ax,
 73 | )
 74 | scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k")
 75 | disp.ax_.set_title(f"Hypothetical diabetes test with prevalence = {y.mean():.2f}")
 76 | _ = disp.ax_.legend(*scatter.legend_elements())
 77 | 
 78 | # %% [markdown]
 79 | # The most widely used summary metric is arguably accuracy. Its main advantage
 80 | # is a natural interpretation: the proportion of correctly classified samples.
 81 | 
 82 | # %%
 83 | from sklearn import metrics
 84 | 
 85 | y_pred = estimator.predict(X_test)
 86 | accuracy = metrics.accuracy_score(y_test, y_pred)
 87 | print(f"Accuracy on the test set: {accuracy:.3f}")
 88 | 
 89 | # %% [markdown]
 90 | # However, it is misleading when the data is imbalanced. Our model performs
 91 | # as well as a trivial majority classifier.
 92 | 
 93 | # %%
 94 | from sklearn.dummy import DummyClassifier
 95 | 
 96 | dummy = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
 97 | y_dummy = estimator.predict(X_test)
 98 | accuracy_dummy = metrics.accuracy_score(y_test, y_dummy)
 99 | print(f"Accuracy if Diabetes did not exist: {accuracy_dummy:.3f}")
100 | 
101 | # %% [markdown]
102 | # Some of the other metrics are better at describing the flaws of our model:
103 | 
104 | # %%
105 | sensitivity = metrics.recall_score(y_test, y_pred)
106 | specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
107 | balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)
108 | matthews = metrics.matthews_corrcoef(y_test, y_pred)
109 | PPV = metrics.precision_score(y_test, y_pred)
110 | 
111 | print(f"Sensitivity on the test set: {sensitivity:.2f}")
112 | print(f"Specificity on the test set: {specificity:.2f}")
113 | print(f"Balanced accuracy on the test set: {balanced_acc:.2f}")
114 | print(f"Matthews correlation coeff on the test set: {matthews:.2f}")
115 | print()
116 | print(f"Probability to have the disease given a positive test: {100*PPV:.2f}%")
117 | 
118 | # %% [markdown]
119 | # Our classifier is not informative enough on the general population. The PPV
120 | # and NPV give the information of interest: P(D+ | T+) and P(D− | T−). However,
121 | # they are not intrinsic to the medical test (in other words the trained ML
122 | # model) but also depend on the prevalence and thus on the target population.
123 | #
124 | # The class likelihood ratios (LR±) depend only on sensitivity and specificity
125 | # of the classifier, and not on the prevalence of the study population. For the
126 | # moment it suffice to recall that LR± is defined as
127 | #
128 | #     LR± = P(D± | T+) / P(D± | T−)
129 | 
130 | # %%
131 | pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)
132 | print(f"LR+ on the test set: {pos_LR:.3f}") # higher is better
133 | print(f"LR- on the test set: {neg_LR:.3f}") #  lower is better
134 | 
135 | # %% [markdown]
136 | # <div class="admonition note alert alert-info">
137 | # <p class="first admonition-title" style="font-weight: bold;">Caution</p>
138 | # <p class="last">Please notice that if you want to use the
139 | # `metrics.class_likelihood_ratios`, you require scikit-learn > v.1.2.0.
140 | # </p>
141 | # </div>
142 | #
143 | # Extrapolating between populations
144 | # ---------------------------------
145 | #
146 | # The prevalence can be variable (for instance the prevalence of an infectious
147 | # disease will be variable across time) and a given classifier may be intended
148 | # to be applied in various situations.
149 | #
150 | # According to the World Bank, the diabetes prevalence in the French Polynesia
151 | # in 2022 is above 25%. Let's now evaluate our previously trained model on a
152 | # **different population** with such prevalence and **the same data-generating
153 | # process**.
154 | 
155 | X, y = make_classification(**common_params, weights=[0.75, 0.25])
156 | X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
157 | 
158 | fig, ax = plt.subplots()
159 | disp = DecisionBoundaryDisplay.from_estimator(
160 |     estimator,
161 |     X_test,
162 |     response_method="predict",
163 |     alpha=0.5,
164 |     xlabel="age (years)",
165 |     ylabel="blood sugar level (mg/dL)",
166 |     ax=ax,
167 | )
168 | scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k")
169 | disp.ax_.set_title(f"Hypothetical diabetes test with prevalence = {y.mean():.2f}")
170 | _ = disp.ax_.legend(*scatter.legend_elements())
171 | 
172 | # %% 
173 | # We then compute the same metrics using a test set with the new
174 | # prevalence:
175 | 
176 | y_pred = estimator.predict(X_test)
177 | prevalence = y.mean()
178 | accuracy = metrics.accuracy_score(y_test, y_pred)
179 | sensitivity = metrics.recall_score(y_test, y_pred)
180 | specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
181 | balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)
182 | matthews = metrics.matthews_corrcoef(y_test, y_pred)
183 | PPV = metrics.precision_score(y_test, y_pred)
184 | 
185 | print(f"Accuracy on the test set: {accuracy:.2f}")
186 | print(f"Sensitivity on the test set: {sensitivity:.2f}")
187 | print(f"Specificity on the test set: {specificity:.2f}")
188 | print(f"Balanced accuracy on the test set: {balanced_acc:.2f}")
189 | print(f"Matthews correlation coeff on the test set: {matthews:.2f}")
190 | print()
191 | print(f"Probability to have the disease given a positive test: {100*PPV:.2f}%")
192 | 
193 | # %% [markdown]
194 | # The same model seems to perform better on this new dataset. Notice in
195 | # particular that the probability to have the disease given a positive test
196 | # increased. The same blood sugar test is less predictive in Benin than in
197 | # the French Polynesia!
198 | #
199 | # If we really want to score the test and not the dataset, we need a metric that
200 | # does not depend on the prevalence of the study population.
201 | 
202 | # %%
203 | pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)
204 | 
205 | print(f"LR+ on the test set: {pos_LR:.3f}")
206 | print(f"LR- on the test set: {neg_LR:.3f}")
207 | 
208 | # %% [markdown]
209 | # Despite some variations due to residual dataset dependence, the class
210 | # likelihood ratios are mathematically invariant with respect to prevalence. See
211 | # [this example from the User
212 | # Guide](https://scikit-learn.org/dev/auto_examples/model_selection/plot_likelihood_ratios.html#invariance-with-respect-to-prevalence)
213 | # for a demo regarding such property.
214 | #
215 | # Pre-test vs. post-test odds
216 | # ---------------------------
217 | #
218 | # Both class likelihood ratios are interpretable in terms of odds:
219 | #
220 | #     post-test odds = Likelihood ratio * pre-test odds
221 | #
222 | # The interpretation of LR+ in this case reads:
223 | 
224 | # %%
225 | print("The post-test odds that the condition is truly present given a positive "
226 |      f"test result are: {pos_LR:.3f} times larger than the pre-test odds.")
227 | 
228 | # %% [markdown]
229 | # We found that diagnosis tool is useful: the post-test odds are larger than the
230 | # pre-test odds. We now choose the pre-test probability to be the prevalence of
231 | # the disease in the held-out testing set.
232 | 
233 | # %%
234 | pretest_odds = y_test.mean() / (1 - y_test.mean())
235 | posttest_odds = pretest_odds * pos_LR
236 | 
237 | print(f"Observed pre-test odds: {pretest_odds:.3f}")
238 | print(f"Estimated post-test odds using LR+: {posttest_odds:.3f}")
239 | 
240 | # %% [markdown]
241 | # The post-test probability is the probability of an individual to truly have
242 | # the condition given a positive test result, i.e. the number of true positives
243 | # divided by the total number of samples. In real life applications this is
244 | # unknown.
245 | 
246 | # %%
247 | posttest_prob = posttest_odds / (1 + posttest_odds)
248 | 
249 | print(f"Estimated post-test probability using LR+: {posttest_prob:.3f}")
250 | 
251 | # %% [markdown]
252 | # We can verify that if we had had access to the true labels, we would have
253 | # obatined the same probabilities:
254 | 
255 | # %%
256 | posttest_prob = y_test[y_pred == 1].mean()
257 | 
258 | print(f"Observed post-test probability: {posttest_prob:.3f}")
259 | 
260 | # %% [markdown]
261 | # Conclusion: If a Benin salesperson was to sell the model to the French Polynesia
262 | # by showing them the 59.84% probability to have the disease given a positive test,
263 | # the French Polynesia would have never bought it, even though it would be quite
264 | # predictive for their own population. The right thing to report are the LR±.
265 | #
266 | # Can you imagine what would happen if the model is trained on nearly balanced classes
267 | # and then extrapolated to other scenarios?
268 | 


--------------------------------------------------------------------------------
/notebooks/3_uncertainty_in_metrics_tutorial.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "604de853",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "\n",
  9 |     "Uncertainty in evaluation metrics for classification\n",
 10 |     "====================================================\n",
 11 |     "\n",
 12 |     "Has it ever happen to you that one of your colleagues claim their model with\n",
 13 |     "test score of 0.8001 is better than your model with test score of 0.7998?\n",
 14 |     "Maybe they are not aware that model-evaluation procedures should gauge not\n",
 15 |     "only the expected generalization performance, but also its variations. As\n",
 16 |     "usual, let's build a toy dataset to illustrate this."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "id": "096f78da",
 23 |    "metadata": {},
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "from sklearn.datasets import make_classification\n",
 27 |     "\n",
 28 |     "common_params = {\n",
 29 |     "    \"n_features\": 2,\n",
 30 |     "    \"n_informative\": 2,\n",
 31 |     "    \"n_redundant\": 0,\n",
 32 |     "    \"n_classes\": 2,  # binary classification\n",
 33 |     "    \"random_state\": 0,\n",
 34 |     "    \"weights\": [0.55, 0.45],\n",
 35 |     "}\n",
 36 |     "X, y = make_classification(**common_params, n_samples=400)\n",
 37 |     "\n",
 38 |     "prevalence = y.mean()\n",
 39 |     "print(f\"Percentage of samples in the positive class: {100*prevalence:.2f}%\")"
 40 |    ]
 41 |   },
 42 |   {
 43 |    "cell_type": "markdown",
 44 |    "id": "98f132b7",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "We are already familiar with using a a train-test split to estimate the\n",
 48 |     "generalization performance of a model. By default the `train_test_split` uses\n",
 49 |     "`shuffle=True`. Let's see what happens if we set a particular seed."
 50 |    ]
 51 |   },
 52 |   {
 53 |    "cell_type": "code",
 54 |    "execution_count": null,
 55 |    "id": "615abf52",
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "from sklearn.model_selection import train_test_split\n",
 60 |     "from sklearn.linear_model import LogisticRegression\n",
 61 |     "\n",
 62 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 63 |     "    X, y, test_size=0.2, random_state=1\n",
 64 |     ")\n",
 65 |     "classifier = LogisticRegression().fit(X_train, y_train)\n",
 66 |     "classifier.score(X_test, y_test)"
 67 |    ]
 68 |   },
 69 |   {
 70 |    "cell_type": "markdown",
 71 |    "id": "79141bd1",
 72 |    "metadata": {},
 73 |    "source": [
 74 |     "Now let's see what happens when shuffling with a different seed:"
 75 |    ]
 76 |   },
 77 |   {
 78 |    "cell_type": "code",
 79 |    "execution_count": null,
 80 |    "id": "a75b8b30",
 81 |    "metadata": {},
 82 |    "outputs": [],
 83 |    "source": [
 84 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 85 |     "    X, y, test_size=0.2, random_state=42\n",
 86 |     ")\n",
 87 |     "classifier = LogisticRegression().fit(X_train, y_train)\n",
 88 |     "classifier.score(X_test, y_test)"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "id": "bde0b08e",
 94 |    "metadata": {},
 95 |    "source": [
 96 |     "It seems that 42 is indeed the Ultimate answer to the Question of Life, the\n",
 97 |     "Universe, and Everything! Or maybe the score of a model depends on the split:\n",
 98 |     " - the train-test proportion;\n",
 99 |     " - the representativeness of the elements in each set.\n",
100 |     "\n",
101 |     "A more systematic way of evaluating the generalization performance of a model\n",
102 |     "is through cross-validation, which consists of repeating the split such that\n",
103 |     "the training and testing sets are different for each evaluation."
104 |    ]
105 |   },
106 |   {
107 |    "cell_type": "code",
108 |    "execution_count": null,
109 |    "id": "3178a156",
110 |    "metadata": {},
111 |    "outputs": [],
112 |    "source": [
113 |     "from sklearn.model_selection import cross_val_score, ShuffleSplit\n",
114 |     "\n",
115 |     "classifier = LogisticRegression()\n",
116 |     "cv = ShuffleSplit(n_splits=250, test_size=0.2)\n",
117 |     "\n",
118 |     "scores = cross_val_score(classifier, X, y, cv=cv)\n",
119 |     "print(\n",
120 |     "    \"The mean cross-validation accuracy is: \"\n",
121 |     "    f\"{scores.mean():.2f} ± {scores.std():.2f}.\"\n",
122 |     ")"
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "markdown",
127 |    "id": "84a601e3",
128 |    "metadata": {},
129 |    "source": [
130 |     "Scores have a variability. A sample probabilistic model gives the distribution\n",
131 |     "of observed error: if the classification rate is p, the observed distribution\n",
132 |     "of correct classifications on a set of size follows a binomial distribution.\n",
133 |     "Let's create a function to easily visualize this:"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "code",
138 |    "execution_count": null,
139 |    "id": "99ee9c85",
140 |    "metadata": {},
141 |    "outputs": [],
142 |    "source": [
143 |     "import matplotlib.pyplot as plt\n",
144 |     "import numpy as np\n",
145 |     "import seaborn as sns\n",
146 |     "from scipy import stats\n",
147 |     "\n",
148 |     "\n",
149 |     "def plot_error_distrib(classifier, X, y, cv=5):\n",
150 |     "\n",
151 |     "    n = len(X)\n",
152 |     "\n",
153 |     "    scores = cross_val_score(classifier, X, y, cv=cv)\n",
154 |     "    distrib = stats.binom(n=n, p=scores.mean())\n",
155 |     "\n",
156 |     "    plt.plot(\n",
157 |     "        np.linspace(0, 1, n),\n",
158 |     "        n * distrib.pmf(np.arange(0, n)),\n",
159 |     "        linewidth=2,\n",
160 |     "        color=\"black\",\n",
161 |     "        label=\"binomial distribution\",\n",
162 |     "    )\n",
163 |     "    sns.histplot(scores, stat=\"density\", label=\"empirical distribution\")\n",
164 |     "    plt.xlim(0, 1)\n",
165 |     "    plt.title(\"Accuracy: \" f\"{scores.mean():.2f} ± {scores.std():.2f}.\")\n",
166 |     "    plt.legend()\n",
167 |     "    plt.show()\n",
168 |     "\n",
169 |     "\n",
170 |     "plot_error_distrib(classifier, X, y, cv=cv)"
171 |    ]
172 |   },
173 |   {
174 |    "cell_type": "markdown",
175 |    "id": "80fcb32e",
176 |    "metadata": {},
177 |    "source": [
178 |     "The empirical distribution is still broader than the theoretical one. This can\n",
179 |     "be explained by the fact that as we are retraining the model on each fold, it\n",
180 |     "actually fluctuates due the sampling noise in the training data, while the\n",
181 |     "model above only accounts for sampling noise in the test data.\n",
182 |     "\n",
183 |     "The situation does get better with more data:"
184 |    ]
185 |   },
186 |   {
187 |    "cell_type": "code",
188 |    "execution_count": null,
189 |    "id": "442ddefb",
190 |    "metadata": {},
191 |    "outputs": [],
192 |    "source": [
193 |     "X, y = make_classification(**common_params, n_samples=1_000)\n",
194 |     "plot_error_distrib(classifier, X, y, cv=cv)"
195 |    ]
196 |   },
197 |   {
198 |    "cell_type": "markdown",
199 |    "id": "b3a7b727",
200 |    "metadata": {},
201 |    "source": [
202 |     "Importantly, the standard error of the mean (SEM) across folds is not a good\n",
203 |     "measure of this error, as the different data folds are not independent. For\n",
204 |     "instance, doing many random splits reduces the variance arbitrarily, but does\n",
205 |     "not provide actually new data points."
206 |    ]
207 |   },
208 |   {
209 |    "cell_type": "code",
210 |    "execution_count": null,
211 |    "id": "f5c2a788",
212 |    "metadata": {},
213 |    "outputs": [],
214 |    "source": [
215 |     "cv = ShuffleSplit(n_splits=10, test_size=0.2)\n",
216 |     "X, y = make_classification(**common_params, n_samples=400)\n",
217 |     "scores = cross_val_score(classifier, X, y, cv=cv)\n",
218 |     "\n",
219 |     "print(\n",
220 |     "    f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n",
221 |     "    f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n",
222 |     ")\n",
223 |     "\n",
224 |     "cv = ShuffleSplit(n_splits=100, test_size=0.2)\n",
225 |     "scores = cross_val_score(classifier, X, y, cv=cv)\n",
226 |     "\n",
227 |     "print(\n",
228 |     "    f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n",
229 |     "    f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n",
230 |     ")\n",
231 |     "\n",
232 |     "cv = ShuffleSplit(n_splits=500, test_size=0.2)\n",
233 |     "scores = cross_val_score(classifier, X, y, cv=cv)\n",
234 |     "\n",
235 |     "print(\n",
236 |     "    f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n",
237 |     "    f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n",
238 |     ")"
239 |    ]
240 |   },
241 |   {
242 |    "cell_type": "markdown",
243 |    "id": "22457177",
244 |    "metadata": {},
245 |    "source": [
246 |     "Indeed, the SEM goes to zero as 1/sqrt{`n_splits`}. Wraping-up:\n",
247 |     "- the more data the better;\n",
248 |     "- the more splits, the more descriptive of the variance is the binomial\n",
249 |     "  distribution, but keep in mind that more splits consume more computing\n",
250 |     "  power;\n",
251 |     "- use std instead of SEM to present your results.\n",
252 |     "\n",
253 |     "Now that we have an intuition on the variability of an evaluation metric, we\n",
254 |     "are ready to apply it to our original Diabetes problem:"
255 |    ]
256 |   },
257 |   {
258 |    "cell_type": "code",
259 |    "execution_count": null,
260 |    "id": "c2e55603",
261 |    "metadata": {},
262 |    "outputs": [],
263 |    "source": [
264 |     "from sklearn.tree import DecisionTreeClassifier\n",
265 |     "from sklearn.inspection import DecisionBoundaryDisplay\n",
266 |     "\n",
267 |     "diabetes_params = {\n",
268 |     "    \"n_samples\": 10_000,\n",
269 |     "    \"n_features\": 2,\n",
270 |     "    \"n_informative\": 2,\n",
271 |     "    \"n_redundant\": 0,\n",
272 |     "    \"n_classes\": 2,  # binary classification\n",
273 |     "    \"shift\": [4, 6],\n",
274 |     "    \"scale\": [10, 25],\n",
275 |     "    \"random_state\": 0,\n",
276 |     "}\n",
277 |     "X, y = make_classification(**diabetes_params, weights=[0.55, 0.45])\n",
278 |     "\n",
279 |     "X_train, X_plot, y_train, y_plot = train_test_split(\n",
280 |     "    X, y, stratify=y, test_size=0.1, random_state=0\n",
281 |     ")\n",
282 |     "\n",
283 |     "estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)\n",
284 |     "\n",
285 |     "fig, ax = plt.subplots()\n",
286 |     "disp = DecisionBoundaryDisplay.from_estimator(\n",
287 |     "    estimator,\n",
288 |     "    X_plot,\n",
289 |     "    response_method=\"predict\",\n",
290 |     "    alpha=0.5,\n",
291 |     "    xlabel=\"age (years)\",\n",
292 |     "    ylabel=\"blood sugar level (mg/dL)\",\n",
293 |     "    ax=ax,\n",
294 |     ")\n",
295 |     "scatter = disp.ax_.scatter(X_plot[:, 0], X_plot[:, 1], c=y_plot, edgecolor=\"k\")\n",
296 |     "disp.ax_.set_title(f\"Diabetes test with prevalence = {y.mean():.2f}\")\n",
297 |     "_ = disp.ax_.legend(*scatter.legend_elements())"
298 |    ]
299 |   },
300 |   {
301 |    "cell_type": "markdown",
302 |    "id": "27cad221",
303 |    "metadata": {},
304 |    "source": [
305 |     "Notice that the decision boundary changed with respect to the first notebook\n",
306 |     "we explored. Let's make a remark: models depend on the prevalence of\n",
307 |     "the data they were trained on. Therefore, all metrics (including likelihood ratios)\n",
308 |     "depend on prevalence as much as the model depends on it. The difference is that\n",
309 |     "likelihood ratios extrapolate through populations of different prevalence for\n",
310 |     "a **fixed model**.\n",
311 |     "\n",
312 |     "Let's compute all the metrics and assez their variability in this case:"
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "code",
317 |    "execution_count": null,
318 |    "id": "277a758a",
319 |    "metadata": {},
320 |    "outputs": [],
321 |    "source": [
322 |     "from collections import defaultdict\n",
323 |     "import pandas as pd\n",
324 |     "\n",
325 |     "cv = ShuffleSplit(n_splits=50, test_size=0.2)\n",
326 |     "\n",
327 |     "evaluation = defaultdict(list)\n",
328 |     "scoring_strategies = [\n",
329 |     "    \"accuracy\",\n",
330 |     "    \"balanced_accuracy\",\n",
331 |     "    \"recall\",\n",
332 |     "    \"precision\",\n",
333 |     "    \"matthews_corrcoef\",\n",
334 |     "    # \"positive_likelihood_ratio\",\n",
335 |     "    # \"neg_negative_likelihood_ratio\",\n",
336 |     "]\n",
337 |     "\n",
338 |     "for score_name in scoring_strategies:\n",
339 |     "    scores = cross_val_score(estimator, X, y, cv=cv, scoring=score_name)\n",
340 |     "    evaluation[score_name] = scores\n",
341 |     "\n",
342 |     "evaluation = pd.DataFrame(evaluation).aggregate([\"mean\", \"std\"]).T\n",
343 |     "evaluation[\"mean\"].plot.barh(xerr=evaluation[\"std\"]).set_xlabel(\"score\")\n",
344 |     "plt.show()"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "markdown",
349 |    "id": "812bdbd6",
350 |    "metadata": {},
351 |    "source": [
352 |     "Notice that `\"positive_likelihood_ratio\"` is not bounded from above and\n",
353 |     "therefore it can't be directly compared with the other metrics on a single\n",
354 |     "plot. Similarly, the `\"neg_negative_likelihood_ratio\"` has a reversed sign (is\n",
355 |     "negative) to follow the scikit-learn convention for metrics for which a lower\n",
356 |     "score is better.\n",
357 |     "\n",
358 |     "In this case we trained the model on nearly balanced classes. Try changing the\n",
359 |     "prevalence and see how the variance of the metrics depend on data imbalance."
360 |    ]
361 |   }
362 |  ],
363 |  "metadata": {
364 |   "jupytext": {
365 |    "cell_metadata_filter": "-all",
366 |    "main_language": "python",
367 |    "notebook_metadata_filter": "-all"
368 |   }
369 |  },
370 |  "nbformat": 4,
371 |  "nbformat_minor": 5
372 | }
373 | 


--------------------------------------------------------------------------------
/notebooks/2_roc_pr_curves_tutorial.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "1c0200d9",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "Evaluation of non-thresholded prediction\n",
  9 |     "========================================\n",
 10 |     "\n",
 11 |     "All statistics that we presented up to now rely on `.predict` which outputs\n",
 12 |     "the most likely label. We haven’t made use of the probability associated with\n",
 13 |     "this prediction, which gives the confidence of the classifier in this\n",
 14 |     "prediction. By default, the prediction of a classifier corresponds to a\n",
 15 |     "threshold of 0.5 probability in a binary classification problem. Let's build a\n",
 16 |     "toy dataset to illustrate this."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": null,
 22 |    "id": "2cee4d2f",
 23 |    "metadata": {},
 24 |    "outputs": [],
 25 |    "source": [
 26 |     "from sklearn.datasets import make_classification\n",
 27 |     "from sklearn.model_selection import train_test_split\n",
 28 |     "\n",
 29 |     "common_params = {\n",
 30 |     "    \"n_samples\": 10_000,\n",
 31 |     "    \"n_features\": 2,\n",
 32 |     "    \"n_informative\": 2,\n",
 33 |     "    \"n_redundant\": 0,\n",
 34 |     "    \"n_classes\": 2,  # binary classification\n",
 35 |     "    \"class_sep\": 0.5,\n",
 36 |     "    \"random_state\": 0,\n",
 37 |     "}\n",
 38 |     "X, y = make_classification(**common_params, weights=[0.6, 0.4])\n",
 39 |     "\n",
 40 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
 41 |     "    X, y, stratify=y, random_state=0, test_size=0.02\n",
 42 |     ")"
 43 |    ]
 44 |   },
 45 |   {
 46 |    "cell_type": "markdown",
 47 |    "id": "aa2bc2df",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "We can quickly check the predicted probabilities to belong to either class\n",
 51 |     "using a `LogisticRegression`. To ease the visualization we select a subset\n",
 52 |     "of `n_plot` samples."
 53 |    ]
 54 |   },
 55 |   {
 56 |    "cell_type": "code",
 57 |    "execution_count": null,
 58 |    "id": "78a544d8",
 59 |    "metadata": {},
 60 |    "outputs": [],
 61 |    "source": [
 62 |     "import pandas as pd\n",
 63 |     "from sklearn.linear_model import LogisticRegression\n",
 64 |     "\n",
 65 |     "n_plot = 10\n",
 66 |     "classifier = LogisticRegression()\n",
 67 |     "classifier.fit(X_train, y_train)\n",
 68 |     "\n",
 69 |     "proba_predicted = pd.DataFrame(\n",
 70 |     "    classifier.predict_proba(X_test), columns=classifier.classes_\n",
 71 |     ").round(decimals=2)\n",
 72 |     "proba_predicted[:n_plot]"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "markdown",
 77 |    "id": "807a2305",
 78 |    "metadata": {},
 79 |    "source": [
 80 |     "Probabilites sum to 1. In the binary case it suffices to retain the\n",
 81 |     "probability of belonging to the positive class, here shown as an annotation in\n",
 82 |     "the `DecisionBoundaryDisplay`. Notice that setting\n",
 83 |     "`response_method=\"predict_proba\"` shows the level curves of the 2D sigmoid\n",
 84 |     "(logistic curve)."
 85 |    ]
 86 |   },
 87 |   {
 88 |    "cell_type": "code",
 89 |    "execution_count": null,
 90 |    "id": "cbaf5e5d",
 91 |    "metadata": {},
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "import matplotlib.pyplot as plt\n",
 95 |     "from matplotlib.colors import ListedColormap\n",
 96 |     "from sklearn.inspection import DecisionBoundaryDisplay\n",
 97 |     "\n",
 98 |     "fig, ax = plt.subplots()\n",
 99 |     "disp = DecisionBoundaryDisplay.from_estimator(\n",
100 |     "    classifier,\n",
101 |     "    X_test,\n",
102 |     "    response_method=\"predict_proba\",\n",
103 |     "    cmap=\"RdBu\",\n",
104 |     "    alpha=0.5,\n",
105 |     "    vmin=0,\n",
106 |     "    vmax=1,\n",
107 |     "    ax=ax,\n",
108 |     ")\n",
109 |     "DecisionBoundaryDisplay.from_estimator(\n",
110 |     "    classifier,\n",
111 |     "    X_test,\n",
112 |     "    response_method=\"predict_proba\",\n",
113 |     "    plot_method=\"contour\",\n",
114 |     "    alpha=0.2,\n",
115 |     "    levels=[0.5],  # 0.5 probability contour line\n",
116 |     "    linestyles=\"--\",\n",
117 |     "    linewidths=2,\n",
118 |     "    ax=ax,\n",
119 |     ")\n",
120 |     "scatter = disp.ax_.scatter(\n",
121 |     "    X_test[:n_plot, 0], X_test[:n_plot, 1], c=y_test[:n_plot], \n",
122 |     "    cmap=ListedColormap([\"tab:red\", \"tab:blue\"]),\n",
123 |     "    edgecolor=\"k\"\n",
124 |     ")\n",
125 |     "disp.ax_.legend(*scatter.legend_elements(), title=\"True class\", loc=\"lower right\")\n",
126 |     "for i, proba in enumerate(proba_predicted[:n_plot][1]):\n",
127 |     "    disp.ax_.annotate(proba, (X_test[i, 0], X_test[i, 1]), fontsize=\"large\")\n",
128 |     "plt.xlim(-2.0, 2.0)\n",
129 |     "plt.ylim(-4.0, 4.0)\n",
130 |     "plt.title(\n",
131 |     "    \"Probability of belonging to the positive class\\n(default decision threshold)\"\n",
132 |     ")\n",
133 |     "plt.show()"
134 |    ]
135 |   },
136 |   {
137 |    "cell_type": "markdown",
138 |    "id": "c6b3f802",
139 |    "metadata": {},
140 |    "source": [
141 |     "Evaluation of different probability thresholds\n",
142 |     "==============================================\n",
143 |     "\n",
144 |     "The default decision threshold (0.5) might not be the best threshold that\n",
145 |     "leads to optimal generalization performance of our classifier. One can vary\n",
146 |     "the decision threshold (and therefore the underlying prediction) and compute\n",
147 |     "some evaluation metrics as presented earlier.\n",
148 |     "\n",
149 |     "Receiver Operating Characteristic curve\n",
150 |     "---------------------------------------\n",
151 |     "\n",
152 |     "One could be interested in the compromise between accurately discriminating\n",
153 |     "both the positive class and the negative classes. The statistics used for this\n",
154 |     "are sensitivity and specificity, which measure the proportion of correctly\n",
155 |     "classified samples per class.\n",
156 |     "\n",
157 |     "Sensitivity and specificity are generally plotted as a curve called the\n",
158 |     "Receiver Operating Characteristic (ROC) curve. Each point on the graph\n",
159 |     "corresponds to a specific decision threshold. Below is such a curve:"
160 |    ]
161 |   },
162 |   {
163 |    "cell_type": "code",
164 |    "execution_count": null,
165 |    "id": "dbed0c9e",
166 |    "metadata": {},
167 |    "outputs": [],
168 |    "source": [
169 |     "from sklearn.metrics import RocCurveDisplay\n",
170 |     "from sklearn.dummy import DummyClassifier\n",
171 |     "\n",
172 |     "dummy_classifier = DummyClassifier(strategy=\"most_frequent\")\n",
173 |     "dummy_classifier.fit(X_train, y_train)\n",
174 |     "\n",
175 |     "disp = RocCurveDisplay.from_estimator(\n",
176 |     "    classifier, X_test, y_test, name=\"LogisticRegression\", color=\"tab:green\"\n",
177 |     ")\n",
178 |     "disp = RocCurveDisplay.from_estimator(\n",
179 |     "    dummy_classifier,\n",
180 |     "    X_test,\n",
181 |     "    y_test,\n",
182 |     "    name=\"chance level\",\n",
183 |     "    color=\"tab:red\",\n",
184 |     "    ax=disp.ax_,\n",
185 |     ")\n",
186 |     "plt.xlim(0, 1)\n",
187 |     "plt.ylim(0, 1)\n",
188 |     "plt.legend(loc=\"lower right\")\n",
189 |     "plt.title(\"ROC curve for LogisticRegression\")\n",
190 |     "plt.show()"
191 |    ]
192 |   },
193 |   {
194 |    "cell_type": "markdown",
195 |    "id": "1837bd36",
196 |    "metadata": {},
197 |    "source": [
198 |     "ROC curves typically feature true positive rate on the Y axis, and false\n",
199 |     "positive rate on the X axis. This means that the top left corner of the plot\n",
200 |     "is the \"ideal\" point - a false positive rate of zero, and a true positive rate\n",
201 |     "of one. This is not very realistic, but it does mean that a larger area under\n",
202 |     "the curve (AUC) is usually better.\n",
203 |     "\n",
204 |     "We can compute the area under the ROC curve (using `roc_auc_score`) to\n",
205 |     "summarize the generalization performance of a model with a single number, or\n",
206 |     "to compare several models across thresholds."
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": null,
212 |    "id": "72cb955a",
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "from sklearn.ensemble import RandomForestClassifier\n",
217 |     "from sklearn.ensemble import HistGradientBoostingClassifier\n",
218 |     "\n",
219 |     "\n",
220 |     "classifiers = {\n",
221 |     "    \"Hist Gradient Boosting\": HistGradientBoostingClassifier(),\n",
222 |     "    \"Random Forest\": RandomForestClassifier(n_jobs=-1, random_state=1),\n",
223 |     "    \"Logistic Regression\": LogisticRegression(),\n",
224 |     "    \"Chance\": DummyClassifier(strategy=\"most_frequent\"),\n",
225 |     "}\n",
226 |     "\n",
227 |     "fig = plt.figure()\n",
228 |     "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n",
229 |     "\n",
230 |     "for name, clf in classifiers.items():\n",
231 |     "    clf.fit(X_train, y_train)\n",
232 |     "    disp = RocCurveDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n",
233 |     "plt.xlabel(\"False positive rate\")\n",
234 |     "plt.ylabel(\"True positive rate                           \")\n",
235 |     "plt.text(\n",
236 |     "    0.098,\n",
237 |     "    0.575,\n",
238 |     "    \"= sensitivity or recall\",\n",
239 |     "    transform=fig.transFigure,\n",
240 |     "    size=7,\n",
241 |     "    rotation=\"vertical\",\n",
242 |     ")\n",
243 |     "plt.xlim(0, 1)\n",
244 |     "plt.ylim(0, 1)\n",
245 |     "plt.legend(loc=\"lower right\")\n",
246 |     "plt.title(\"ROC curves for several models\")\n",
247 |     "plt.show()"
248 |    ]
249 |   },
250 |   {
251 |    "cell_type": "markdown",
252 |    "id": "e0ef75b4",
253 |    "metadata": {},
254 |    "source": [
255 |     "It is important to notice that the lower bound of the ROC-AUC is 0.5,\n",
256 |     "corresponding to chance level. Indeed, we show the generalization performance\n",
257 |     "of a dummy classifier (the red line) to show that even the worst\n",
258 |     "generalization performance obtained will be above this line."
259 |    ]
260 |   },
261 |   {
262 |    "cell_type": "markdown",
263 |    "id": "df5f19d6",
264 |    "metadata": {},
265 |    "source": [
266 |     "Precision-Recall curves\n",
267 |     "-----------------------\n",
268 |     "\n",
269 |     "As mentioned above, maximizing the ROC curve helps finding a compromise\n",
270 |     "between accurately discriminating both the positive class and the negative\n",
271 |     "classes. If the interest is to focus mainly on the positive class, the\n",
272 |     "precision and recall metrics are more appropriated. Similarly to the ROC\n",
273 |     "curve, each point in the Precision-Recall curve corresponds to a level of\n",
274 |     "probability which we used as a decision threshold."
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": null,
280 |    "id": "44183db6",
281 |    "metadata": {},
282 |    "outputs": [],
283 |    "source": [
284 |     "from sklearn.metrics import PrecisionRecallDisplay\n",
285 |     "\n",
286 |     "fig = plt.figure()\n",
287 |     "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n",
288 |     "\n",
289 |     "for name, clf in classifiers.items():\n",
290 |     "    clf.fit(X_train, y_train)\n",
291 |     "    disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n",
292 |     "plt.xlabel(\"Recall                  \")\n",
293 |     "plt.text(0.45, 0.067, \"= TPR or sensitivity\", transform=fig.transFigure, size=7)\n",
294 |     "plt.ylabel(\"Precision         \")\n",
295 |     "plt.text(0.1, 0.6, \"= PPV\", transform=fig.transFigure, size=7, rotation=\"vertical\")\n",
296 |     "plt.xlim(0, 1)\n",
297 |     "plt.ylim(0, 1)\n",
298 |     "plt.legend(loc=\"lower right\")\n",
299 |     "plt.title(\"Precision-recall curve for several models\")\n",
300 |     "plt.show()"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "markdown",
305 |    "id": "397e60b7",
306 |    "metadata": {},
307 |    "source": [
308 |     "A classifier with no false positives would have a precision of 1 for all\n",
309 |     "recall values. In like manner to the ROC-AUC, the area under the curve can be\n",
310 |     "used to characterize the curve in a single number and is named average\n",
311 |     "precision (AP). With an ideal classifier, the average precision would be 1.\n",
312 |     "\n",
313 |     "In this case, notice that the AP of a `DummyClassifier`, used as baseline to\n",
314 |     "define the chance level, coincides with the prevalence of the positive class.\n",
315 |     "This is analogous to the downside of the accuracy score as shown in the first\n",
316 |     "notebook."
317 |    ]
318 |   },
319 |   {
320 |    "cell_type": "code",
321 |    "execution_count": null,
322 |    "id": "0475287c",
323 |    "metadata": {},
324 |    "outputs": [],
325 |    "source": [
326 |     "prevalence = y.mean()\n",
327 |     "print(f\"Prevalence of the positive class: {prevalence:.3f}\")"
328 |    ]
329 |   },
330 |   {
331 |    "cell_type": "markdown",
332 |    "id": "415c7cc9",
333 |    "metadata": {},
334 |    "source": [
335 |     "Let's see the effect of adding umbalance between classes in our set of models:"
336 |    ]
337 |   },
338 |   {
339 |    "cell_type": "code",
340 |    "execution_count": null,
341 |    "id": "50944a15",
342 |    "metadata": {},
343 |    "outputs": [],
344 |    "source": [
345 |     "X, y = make_classification(**common_params, weights=[0.83, 0.17])\n",
346 |     "\n",
347 |     "X_train, X_test, y_train, y_test = train_test_split(\n",
348 |     "    X, y, stratify=y, random_state=0, test_size=0.02\n",
349 |     ")\n",
350 |     "\n",
351 |     "fig = plt.figure()\n",
352 |     "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n",
353 |     "\n",
354 |     "for name, clf in classifiers.items():\n",
355 |     "    clf.fit(X_train, y_train)\n",
356 |     "    disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n",
357 |     "plt.xlabel(\"Recall                  \")\n",
358 |     "plt.text(0.45, 0.067, \"= TPR or sensitivity\", transform=fig.transFigure, size=7)\n",
359 |     "plt.ylabel(\"Precision         \")\n",
360 |     "plt.text(0.1, 0.6, \"= PPV\", transform=fig.transFigure, size=7, rotation=\"vertical\")\n",
361 |     "plt.xlim(0, 1)\n",
362 |     "plt.ylim(0, 1)\n",
363 |     "plt.legend(loc=\"upper right\")\n",
364 |     "plt.title(\"Precision-recall curve for several models\\nw. imbalanced data\")\n",
365 |     "plt.show()"
366 |    ]
367 |   },
368 |   {
369 |    "cell_type": "markdown",
370 |    "id": "f97d04b3",
371 |    "metadata": {},
372 |    "source": [
373 |     "The AP of all models decreased, including the baseline defined by the dummy\n",
374 |     "classifier. Indeed, we confirm that AP does not account for prevalence.\n",
375 |     "\n",
376 |     "Conclusions\n",
377 |     "===========\n",
378 |     "\n",
379 |     "- Consider the prevalence in your target population. It may be that the\n",
380 |     "  prevalence in your testing sample is not representative of that of the\n",
381 |     "  target population. In that case, aside from LR+ and LR-, performance metrics\n",
382 |     "  computed from the testing sample will not be representative of those in the\n",
383 |     "  target population.\n",
384 |     "\n",
385 |     "- Never trust a single summary metric (accuracy, balanced accuracy, ROC-AUC,\n",
386 |     "  etc.), but rather look at all the individual metrics. Understand the\n",
387 |     "  implication of your choices to known the right tradeoff."
388 |    ]
389 |   }
390 |  ],
391 |  "metadata": {
392 |   "jupytext": {
393 |    "cell_metadata_filter": "-all",
394 |    "main_language": "python",
395 |    "notebook_metadata_filter": "-all"
396 |   }
397 |  },
398 |  "nbformat": 4,
399 |  "nbformat_minor": 5
400 | }
401 | 


--------------------------------------------------------------------------------
/notebooks/1_evaluation_tutorial.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "id": "a97407df",
  6 |    "metadata": {},
  7 |    "source": [
  8 |     "\n",
  9 |     "Accounting for imbalance in evaluation metrics for classification \n",
 10 |     "=================================================================\n",
 11 |     "\n",
 12 |     "Suppose we have a population of subjects with features `X` that can hopefully\n",
 13 |     "serve as indicators of a binary class `y` (known ground truth). Additionally,\n",
 14 |     "suppose the class prevalence (the number of samples in the positive class\n",
 15 |     "divided by the total number of samples) is very low.\n",
 16 |     "\n",
 17 |     "To fix ideas, let's use a medical analogy and think about diabetes. We only\n",
 18 |     "use two features -age and blood sugar level-, to keep the example as simple as\n",
 19 |     "possible. We use `make_classification` to simulate the distribution of the\n",
 20 |     "disease and to ensure **the data-generating process is always the same**. We\n",
 21 |     "set the `weights=[0.99, 0.01]` to obtain a prevalence of around 1% which,\n",
 22 |     "according to [The World\n",
 23 |     "Bank](https://data.worldbank.org/indicator/SH.STA.DIAB.ZS?most_recent_value_desc=false),\n",
 24 |     "is the case for the country with the lowest diabetes prevalence in 2022\n",
 25 |     "(Benin).\n",
 26 |     "\n",
 27 |     "In practice, the ideas presented here can be applied in settings where the\n",
 28 |     "data available to learn and evaluate a classifier has nearly balanced classes,\n",
 29 |     "such as a case-control study, while the target application, i.e. the general\n",
 30 |     "population, has very low prevalence."
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "id": "ba2ba7a7",
 37 |    "metadata": {},
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "from sklearn.datasets import make_classification\n",
 41 |     "\n",
 42 |     "common_params = {\n",
 43 |     "    \"n_samples\": 10_000,\n",
 44 |     "    \"n_features\": 2,\n",
 45 |     "    \"n_informative\": 2,\n",
 46 |     "    \"n_redundant\": 0,\n",
 47 |     "    \"n_classes\": 2, # binary classification\n",
 48 |     "    \"shift\": [4, 6],\n",
 49 |     "    \"scale\": [10, 25],\n",
 50 |     "    \"random_state\": 0,\n",
 51 |     "}\n",
 52 |     "X, y = make_classification(**common_params, weights=[0.99, 0.01])\n",
 53 |     "prevalence = y.mean()\n",
 54 |     "print(f\"Percentage of people carrying the disease: {100*prevalence:.2f}%\")"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "markdown",
 59 |    "id": "68480c16",
 60 |    "metadata": {},
 61 |    "source": [
 62 |     "A simple model is trained to diagnose if a person is likely to have diabetes.\n",
 63 |     "To estimate the generalization performance of such model, we do a train-test\n",
 64 |     "split."
 65 |    ]
 66 |   },
 67 |   {
 68 |    "cell_type": "code",
 69 |    "execution_count": null,
 70 |    "id": "dbda2ce1",
 71 |    "metadata": {},
 72 |    "outputs": [],
 73 |    "source": [
 74 |     "from sklearn.model_selection import train_test_split\n",
 75 |     "from sklearn.tree import DecisionTreeClassifier\n",
 76 |     "\n",
 77 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n",
 78 |     "\n",
 79 |     "estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "markdown",
 84 |    "id": "46de5abd",
 85 |    "metadata": {},
 86 |    "source": [
 87 |     "We now show the decision boundary learned by the estimator. Notice that we\n",
 88 |     "only plot an stratified subset of the original data."
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "code",
 93 |    "execution_count": null,
 94 |    "id": "42c34246",
 95 |    "metadata": {},
 96 |    "outputs": [],
 97 |    "source": [
 98 |     "import matplotlib.pyplot as plt\n",
 99 |     "from sklearn.inspection import DecisionBoundaryDisplay\n",
100 |     "\n",
101 |     "fig, ax = plt.subplots()\n",
102 |     "disp = DecisionBoundaryDisplay.from_estimator(\n",
103 |     "    estimator,\n",
104 |     "    X_test,\n",
105 |     "    response_method=\"predict\",\n",
106 |     "    alpha=0.5,\n",
107 |     "    xlabel=\"age (years)\",\n",
108 |     "    ylabel=\"blood sugar level (mg/dL)\",\n",
109 |     "    ax=ax,\n",
110 |     ")\n",
111 |     "scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor=\"k\")\n",
112 |     "disp.ax_.set_title(f\"Hypothetical diabetes test with prevalence = {y.mean():.2f}\")\n",
113 |     "_ = disp.ax_.legend(*scatter.legend_elements())"
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "id": "0976dd3e",
119 |    "metadata": {},
120 |    "source": [
121 |     "The most widely used summary metric is arguably accuracy. Its main advantage\n",
122 |     "is a natural interpretation: the proportion of correctly classified samples."
123 |    ]
124 |   },
125 |   {
126 |    "cell_type": "code",
127 |    "execution_count": null,
128 |    "id": "2a42dfdf",
129 |    "metadata": {},
130 |    "outputs": [],
131 |    "source": [
132 |     "from sklearn import metrics\n",
133 |     "\n",
134 |     "y_pred = estimator.predict(X_test)\n",
135 |     "accuracy = metrics.accuracy_score(y_test, y_pred)\n",
136 |     "print(f\"Accuracy on the test set: {accuracy:.3f}\")"
137 |    ]
138 |   },
139 |   {
140 |    "cell_type": "markdown",
141 |    "id": "460f1449",
142 |    "metadata": {},
143 |    "source": [
144 |     "However, it is misleading when the data is imbalanced. Our model performs\n",
145 |     "as well as a trivial majority classifier."
146 |    ]
147 |   },
148 |   {
149 |    "cell_type": "code",
150 |    "execution_count": null,
151 |    "id": "6fcf2b77",
152 |    "metadata": {},
153 |    "outputs": [],
154 |    "source": [
155 |     "from sklearn.dummy import DummyClassifier\n",
156 |     "\n",
157 |     "dummy = DummyClassifier(strategy=\"most_frequent\").fit(X_train, y_train)\n",
158 |     "y_dummy = estimator.predict(X_test)\n",
159 |     "accuracy_dummy = metrics.accuracy_score(y_test, y_dummy)\n",
160 |     "print(f\"Accuracy if Diabetes did not exist: {accuracy_dummy:.3f}\")"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "id": "d2ce15e0",
166 |    "metadata": {},
167 |    "source": [
168 |     "Some of the other metrics are better at describing the flaws of our model:"
169 |    ]
170 |   },
171 |   {
172 |    "cell_type": "code",
173 |    "execution_count": null,
174 |    "id": "3bc30b2b",
175 |    "metadata": {},
176 |    "outputs": [],
177 |    "source": [
178 |     "sensitivity = metrics.recall_score(y_test, y_pred)\n",
179 |     "specificity = metrics.recall_score(y_test, y_pred, pos_label=0)\n",
180 |     "balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)\n",
181 |     "matthews = metrics.matthews_corrcoef(y_test, y_pred)\n",
182 |     "PPV = metrics.precision_score(y_test, y_pred)\n",
183 |     "\n",
184 |     "print(f\"Sensitivity on the test set: {sensitivity:.2f}\")\n",
185 |     "print(f\"Specificity on the test set: {specificity:.2f}\")\n",
186 |     "print(f\"Balanced accuracy on the test set: {balanced_acc:.2f}\")\n",
187 |     "print(f\"Matthews correlation coeff on the test set: {matthews:.2f}\")\n",
188 |     "print()\n",
189 |     "print(f\"Probability to have the disease given a positive test: {100*PPV:.2f}%\")"
190 |    ]
191 |   },
192 |   {
193 |    "cell_type": "markdown",
194 |    "id": "48b03751",
195 |    "metadata": {},
196 |    "source": [
197 |     "Our classifier is not informative enough on the general population. The PPV\n",
198 |     "and NPV give the information of interest: P(D+ | T+) and P(D− | T−). However,\n",
199 |     "they are not intrinsic to the medical test (in other words the trained ML\n",
200 |     "model) but also depend on the prevalence and thus on the target population.\n",
201 |     "\n",
202 |     "The class likelihood ratios (LR±) depend only on sensitivity and specificity\n",
203 |     "of the classifier, and not on the prevalence of the study population. For the\n",
204 |     "moment it suffice to recall that LR± is defined as\n",
205 |     "\n",
206 |     "    LR± = P(D± | T+) / P(D± | T−)"
207 |    ]
208 |   },
209 |   {
210 |    "cell_type": "code",
211 |    "execution_count": null,
212 |    "id": "13b04b48",
213 |    "metadata": {},
214 |    "outputs": [],
215 |    "source": [
216 |     "pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)\n",
217 |     "print(f\"LR+ on the test set: {pos_LR:.3f}\") # higher is better\n",
218 |     "print(f\"LR- on the test set: {neg_LR:.3f}\") #  lower is better"
219 |    ]
220 |   },
221 |   {
222 |    "cell_type": "markdown",
223 |    "id": "f999f3e2",
224 |    "metadata": {},
225 |    "source": [
226 |     "<div class=\"admonition note alert alert-info\">\n",
227 |     "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Caution</p>\n",
228 |     "<p class=\"last\">Please notice that if you want to use the\n",
229 |     "`metrics.class_likelihood_ratios`, you require scikit-learn > v.1.2.0.\n",
230 |     "</p>\n",
231 |     "</div>\n",
232 |     "\n",
233 |     "Extrapolating between populations\n",
234 |     "---------------------------------\n",
235 |     "\n",
236 |     "The prevalence can be variable (for instance the prevalence of an infectious\n",
237 |     "disease will be variable across time) and a given classifier may be intended\n",
238 |     "to be applied in various situations.\n",
239 |     "\n",
240 |     "According to the World Bank, the diabetes prevalence in the French Polynesia\n",
241 |     "in 2022 is above 25%. Let's now evaluate our previously trained model on a\n",
242 |     "**different population** with such prevalence and **the same data-generating\n",
243 |     "process**.\n",
244 |     "\n",
245 |     "X, y = make_classification(**common_params, weights=[0.75, 0.25])\n",
246 |     "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n",
247 |     "\n",
248 |     "fig, ax = plt.subplots()\n",
249 |     "disp = DecisionBoundaryDisplay.from_estimator(\n",
250 |     "    estimator,\n",
251 |     "    X_test,\n",
252 |     "    response_method=\"predict\",\n",
253 |     "    alpha=0.5,\n",
254 |     "    xlabel=\"age (years)\",\n",
255 |     "    ylabel=\"blood sugar level (mg/dL)\",\n",
256 |     "    ax=ax,\n",
257 |     ")\n",
258 |     "scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor=\"k\")\n",
259 |     "disp.ax_.set_title(f\"Hypothetical diabetes test with prevalence = {y.mean():.2f}\")\n",
260 |     "_ = disp.ax_.legend(*scatter.legend_elements())"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": null,
266 |    "id": "432b3b95",
267 |    "metadata": {},
268 |    "outputs": [],
269 |    "source": [
270 |     "# We then compute the same metrics using a test set with the new\n",
271 |     "# prevalence:\n",
272 |     "\n",
273 |     "y_pred = estimator.predict(X_test)\n",
274 |     "prevalence = y.mean()\n",
275 |     "accuracy = metrics.accuracy_score(y_test, y_pred)\n",
276 |     "sensitivity = metrics.recall_score(y_test, y_pred)\n",
277 |     "specificity = metrics.recall_score(y_test, y_pred, pos_label=0)\n",
278 |     "balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)\n",
279 |     "matthews = metrics.matthews_corrcoef(y_test, y_pred)\n",
280 |     "PPV = metrics.precision_score(y_test, y_pred)\n",
281 |     "\n",
282 |     "print(f\"Accuracy on the test set: {accuracy:.2f}\")\n",
283 |     "print(f\"Sensitivity on the test set: {sensitivity:.2f}\")\n",
284 |     "print(f\"Specificity on the test set: {specificity:.2f}\")\n",
285 |     "print(f\"Balanced accuracy on the test set: {balanced_acc:.2f}\")\n",
286 |     "print(f\"Matthews correlation coeff on the test set: {matthews:.2f}\")\n",
287 |     "print()\n",
288 |     "print(f\"Probability to have the disease given a positive test: {100*PPV:.2f}%\")"
289 |    ]
290 |   },
291 |   {
292 |    "cell_type": "markdown",
293 |    "id": "60315d3b",
294 |    "metadata": {},
295 |    "source": [
296 |     "The same model seems to perform better on this new dataset. Notice in\n",
297 |     "particular that the probability to have the disease given a positive test\n",
298 |     "increased. The same blood sugar test is less predictive in Benin than in\n",
299 |     "the French Polynesia!\n",
300 |     "\n",
301 |     "If we really want to score the test and not the dataset, we need a metric that\n",
302 |     "does not depend on the prevalence of the study population."
303 |    ]
304 |   },
305 |   {
306 |    "cell_type": "code",
307 |    "execution_count": null,
308 |    "id": "7395579a",
309 |    "metadata": {},
310 |    "outputs": [],
311 |    "source": [
312 |     "pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)\n",
313 |     "\n",
314 |     "print(f\"LR+ on the test set: {pos_LR:.3f}\")\n",
315 |     "print(f\"LR- on the test set: {neg_LR:.3f}\")"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "markdown",
320 |    "id": "d4e7c039",
321 |    "metadata": {},
322 |    "source": [
323 |     "Despite some variations due to residual dataset dependence, the class\n",
324 |     "likelihood ratios are mathematically invariant with respect to prevalence. See\n",
325 |     "[this example from the User\n",
326 |     "Guide](https://scikit-learn.org/dev/auto_examples/model_selection/plot_likelihood_ratios.html#invariance-with-respect-to-prevalence)\n",
327 |     "for a demo regarding such property.\n",
328 |     "\n",
329 |     "Pre-test vs. post-test odds\n",
330 |     "---------------------------\n",
331 |     "\n",
332 |     "Both class likelihood ratios are interpretable in terms of odds:\n",
333 |     "\n",
334 |     "    post-test odds = Likelihood ratio * pre-test odds\n",
335 |     "\n",
336 |     "The interpretation of LR+ in this case reads:"
337 |    ]
338 |   },
339 |   {
340 |    "cell_type": "code",
341 |    "execution_count": null,
342 |    "id": "4983b942",
343 |    "metadata": {},
344 |    "outputs": [],
345 |    "source": [
346 |     "print(\"The post-test odds that the condition is truly present given a positive \"\n",
347 |     "     f\"test result are: {pos_LR:.3f} times larger than the pre-test odds.\")"
348 |    ]
349 |   },
350 |   {
351 |    "cell_type": "markdown",
352 |    "id": "71a226b4",
353 |    "metadata": {},
354 |    "source": [
355 |     "We found that diagnosis tool is useful: the post-test odds are larger than the\n",
356 |     "pre-test odds. We now choose the pre-test probability to be the prevalence of\n",
357 |     "the disease in the held-out testing set."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": null,
363 |    "id": "9967d431",
364 |    "metadata": {},
365 |    "outputs": [],
366 |    "source": [
367 |     "pretest_odds = y_test.mean() / (1 - y_test.mean())\n",
368 |     "posttest_odds = pretest_odds * pos_LR\n",
369 |     "\n",
370 |     "print(f\"Observed pre-test odds: {pretest_odds:.3f}\")\n",
371 |     "print(f\"Estimated post-test odds using LR+: {posttest_odds:.3f}\")"
372 |    ]
373 |   },
374 |   {
375 |    "cell_type": "markdown",
376 |    "id": "0828ff62",
377 |    "metadata": {},
378 |    "source": [
379 |     "The post-test probability is the probability of an individual to truly have\n",
380 |     "the condition given a positive test result, i.e. the number of true positives\n",
381 |     "divided by the total number of samples. In real life applications this is\n",
382 |     "unknown."
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "code",
387 |    "execution_count": null,
388 |    "id": "bf4993a8",
389 |    "metadata": {},
390 |    "outputs": [],
391 |    "source": [
392 |     "posttest_prob = posttest_odds / (1 + posttest_odds)\n",
393 |     "\n",
394 |     "print(f\"Estimated post-test probability using LR+: {posttest_prob:.3f}\")"
395 |    ]
396 |   },
397 |   {
398 |    "cell_type": "markdown",
399 |    "id": "34d66001",
400 |    "metadata": {},
401 |    "source": [
402 |     "We can verify that if we had had access to the true labels, we would have\n",
403 |     "obatined the same probabilities:"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "code",
408 |    "execution_count": null,
409 |    "id": "32d96836",
410 |    "metadata": {},
411 |    "outputs": [],
412 |    "source": [
413 |     "posttest_prob = y_test[y_pred == 1].mean()\n",
414 |     "\n",
415 |     "print(f\"Observed post-test probability: {posttest_prob:.3f}\")"
416 |    ]
417 |   },
418 |   {
419 |    "cell_type": "markdown",
420 |    "id": "de75d14b",
421 |    "metadata": {},
422 |    "source": [
423 |     "Conclusion: If a Benin salesperson was to sell the model to the French Polynesia\n",
424 |     "by showing them the 59.84% probability to have the disease given a positive test,\n",
425 |     "the French Polynesia would have never bought it, even though it would be quite\n",
426 |     "predictive for their own population. The right thing to report are the LR±.\n",
427 |     "\n",
428 |     "Can you imagine what would happen if the model is trained on nearly balanced classes\n",
429 |     "and then extrapolated to other scenarios?"
430 |    ]
431 |   }
432 |  ],
433 |  "metadata": {
434 |   "jupytext": {
435 |    "cell_metadata_filter": "-all",
436 |    "main_language": "python",
437 |    "notebook_metadata_filter": "-all"
438 |   }
439 |  },
440 |  "nbformat": 4,
441 |  "nbformat_minor": 5
442 | }
443 | 


--------------------------------------------------------------------------------