├── slides.pdf ├── requirements.txt ├── environment.yml ├── README.md ├── LICENSE ├── python_scripts ├── 3_uncertainty_in_metrics_tutorial.py ├── 2_roc_pr_curves_tutorial.py └── 1_evaluation_tutorial.py └── notebooks ├── 3_uncertainty_in_metrics_tutorial.ipynb ├── 2_roc_pr_curves_tutorial.ipynb └── 1_evaluation_tutorial.ipynb /slides.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ArturoAmorQ/euroscipy_2022_evaluation/HEAD/slides.pdf -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | numpy==1.26.* 2 | scipy==1.11.* 3 | pandas==2.1.* 4 | matplotlib==3.7.* 5 | jupyter 6 | seaborn 7 | scikit-learn==1.3.* 8 | -------------------------------------------------------------------------------- /environment.yml: -------------------------------------------------------------------------------- 1 | name: evaluation-tutorial 2 | 3 | dependencies: 4 | - python 5 | - scikit-learn 6 | - pandas 7 | - seaborn 8 | - jupyter 9 | - pip 10 | 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # EuroSciPy 2022 - Evaluating your ML models tutorial 2 | 3 | Follow the intro slides [here](https://github.com/ArturoAmorQ/euroscipy_2022_evaluation/blob/main/slides.pdf). 4 | 5 | ## Follow the tutorial online 6 | 7 | Launch an online notebook environment using [![Binder](https://mybinder.org/badge_logo.svg)](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD) 8 | 9 | - [1_evaluation_tutorial.ipynb](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/1_evaluation_tutorial.ipynb) 10 | - [2_roc_pr_curves_tutorial.ipynb](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/2_roc_pr_curves_tutorial.ipynb) 11 | - [3_uncertainty_in_metrics_tutorial.ipynb ](https://notebooks.gesis.org/binder/v2/gh/ArturoAmorQ/euroscipy_2022_evaluation/HEAD?filepath=/notebooks/3_uncertainty_in_metrics_tutorial.ipynb) 12 | 13 | You need an internet connection but you will not have to install any package 14 | locally. 15 | 16 | ## Running the tutorial locally 17 | 18 | ### Dependencies 19 | 20 | The tutorials will require the following packages: 21 | 22 | * python 23 | * jupyter 24 | * pandas 25 | * matplotlib 26 | * seaborn 27 | * scikit-learn >= 1.2.0 28 | 29 | ### Local install 30 | 31 | We provide both `requirements.txt` and `environment.yml` to install packages. 32 | 33 | You can install the packages using `pip`: 34 | 35 | ``` 36 | $ pip install -r requirements.txt 37 | ``` 38 | 39 | You can create an `evaluation-tutorial` conda environment executing: 40 | 41 | ``` 42 | $ conda env create -f environment.yml 43 | ``` 44 | 45 | and later activate the environment: 46 | 47 | ``` 48 | $ conda activate evaluation-tutorial 49 | ``` 50 | 51 | You might also only update your current environment using: 52 | 53 | ``` 54 | $ conda env update --prefix ./env --file environment.yml --prune 55 | ``` 56 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Creative Commons Legal Code 2 | 3 | CC0 1.0 Universal 4 | 5 | CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE 6 | LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN 7 | ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS 8 | INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES 9 | REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS 10 | PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM 11 | THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED 12 | HEREUNDER. 13 | 14 | Statement of Purpose 15 | 16 | The laws of most jurisdictions throughout the world automatically confer 17 | exclusive Copyright and Related Rights (defined below) upon the creator 18 | and subsequent owner(s) (each and all, an "owner") of an original work of 19 | authorship and/or a database (each, a "Work"). 20 | 21 | Certain owners wish to permanently relinquish those rights to a Work for 22 | the purpose of contributing to a commons of creative, cultural and 23 | scientific works ("Commons") that the public can reliably and without fear 24 | of later claims of infringement build upon, modify, incorporate in other 25 | works, reuse and redistribute as freely as possible in any form whatsoever 26 | and for any purposes, including without limitation commercial purposes. 27 | These owners may contribute to the Commons to promote the ideal of a free 28 | culture and the further production of creative, cultural and scientific 29 | works, or to gain reputation or greater distribution for their Work in 30 | part through the use and efforts of others. 31 | 32 | For these and/or other purposes and motivations, and without any 33 | expectation of additional consideration or compensation, the person 34 | associating CC0 with a Work (the "Affirmer"), to the extent that he or she 35 | is an owner of Copyright and Related Rights in the Work, voluntarily 36 | elects to apply CC0 to the Work and publicly distribute the Work under its 37 | terms, with knowledge of his or her Copyright and Related Rights in the 38 | Work and the meaning and intended legal effect of CC0 on those rights. 39 | 40 | 1. Copyright and Related Rights. A Work made available under CC0 may be 41 | protected by copyright and related or neighboring rights ("Copyright and 42 | Related Rights"). Copyright and Related Rights include, but are not 43 | limited to, the following: 44 | 45 | i. the right to reproduce, adapt, distribute, perform, display, 46 | communicate, and translate a Work; 47 | ii. moral rights retained by the original author(s) and/or performer(s); 48 | iii. publicity and privacy rights pertaining to a person's image or 49 | likeness depicted in a Work; 50 | iv. rights protecting against unfair competition in regards to a Work, 51 | subject to the limitations in paragraph 4(a), below; 52 | v. rights protecting the extraction, dissemination, use and reuse of data 53 | in a Work; 54 | vi. database rights (such as those arising under Directive 96/9/EC of the 55 | European Parliament and of the Council of 11 March 1996 on the legal 56 | protection of databases, and under any national implementation 57 | thereof, including any amended or successor version of such 58 | directive); and 59 | vii. other similar, equivalent or corresponding rights throughout the 60 | world based on applicable law or treaty, and any national 61 | implementations thereof. 62 | 63 | 2. Waiver. To the greatest extent permitted by, but not in contravention 64 | of, applicable law, Affirmer hereby overtly, fully, permanently, 65 | irrevocably and unconditionally waives, abandons, and surrenders all of 66 | Affirmer's Copyright and Related Rights and associated claims and causes 67 | of action, whether now known or unknown (including existing as well as 68 | future claims and causes of action), in the Work (i) in all territories 69 | worldwide, (ii) for the maximum duration provided by applicable law or 70 | treaty (including future time extensions), (iii) in any current or future 71 | medium and for any number of copies, and (iv) for any purpose whatsoever, 72 | including without limitation commercial, advertising or promotional 73 | purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each 74 | member of the public at large and to the detriment of Affirmer's heirs and 75 | successors, fully intending that such Waiver shall not be subject to 76 | revocation, rescission, cancellation, termination, or any other legal or 77 | equitable action to disrupt the quiet enjoyment of the Work by the public 78 | as contemplated by Affirmer's express Statement of Purpose. 79 | 80 | 3. Public License Fallback. Should any part of the Waiver for any reason 81 | be judged legally invalid or ineffective under applicable law, then the 82 | Waiver shall be preserved to the maximum extent permitted taking into 83 | account Affirmer's express Statement of Purpose. In addition, to the 84 | extent the Waiver is so judged Affirmer hereby grants to each affected 85 | person a royalty-free, non transferable, non sublicensable, non exclusive, 86 | irrevocable and unconditional license to exercise Affirmer's Copyright and 87 | Related Rights in the Work (i) in all territories worldwide, (ii) for the 88 | maximum duration provided by applicable law or treaty (including future 89 | time extensions), (iii) in any current or future medium and for any number 90 | of copies, and (iv) for any purpose whatsoever, including without 91 | limitation commercial, advertising or promotional purposes (the 92 | "License"). The License shall be deemed effective as of the date CC0 was 93 | applied by Affirmer to the Work. Should any part of the License for any 94 | reason be judged legally invalid or ineffective under applicable law, such 95 | partial invalidity or ineffectiveness shall not invalidate the remainder 96 | of the License, and in such case Affirmer hereby affirms that he or she 97 | will not (i) exercise any of his or her remaining Copyright and Related 98 | Rights in the Work or (ii) assert any associated claims and causes of 99 | action with respect to the Work, in either case contrary to Affirmer's 100 | express Statement of Purpose. 101 | 102 | 4. Limitations and Disclaimers. 103 | 104 | a. No trademark or patent rights held by Affirmer are waived, abandoned, 105 | surrendered, licensed or otherwise affected by this document. 106 | b. Affirmer offers the Work as-is and makes no representations or 107 | warranties of any kind concerning the Work, express, implied, 108 | statutory or otherwise, including without limitation warranties of 109 | title, merchantability, fitness for a particular purpose, non 110 | infringement, or the absence of latent or other defects, accuracy, or 111 | the present or absence of errors, whether or not discoverable, all to 112 | the greatest extent permissible under applicable law. 113 | c. Affirmer disclaims responsibility for clearing rights of other persons 114 | that may apply to the Work or any use thereof, including without 115 | limitation any person's Copyright and Related Rights in the Work. 116 | Further, Affirmer disclaims responsibility for obtaining any necessary 117 | consents, permissions or other rights required for any use of the 118 | Work. 119 | d. Affirmer understands and acknowledges that Creative Commons is not a 120 | party to this document and has no duty or obligation with respect to 121 | this CC0 or use of the Work. 122 | -------------------------------------------------------------------------------- /python_scripts/3_uncertainty_in_metrics_tutorial.py: -------------------------------------------------------------------------------- 1 | # %% [markdown] 2 | # 3 | # Uncertainty in evaluation metrics for classification 4 | # ==================================================== 5 | # 6 | # Has it ever happen to you that one of your colleagues claim their model with 7 | # test score of 0.8001 is better than your model with test score of 0.7998? 8 | # Maybe they are not aware that model-evaluation procedures should gauge not 9 | # only the expected generalization performance, but also its variations. As 10 | # usual, let's build a toy dataset to illustrate this. 11 | 12 | # %% 13 | from sklearn.datasets import make_classification 14 | 15 | common_params = { 16 | "n_features": 2, 17 | "n_informative": 2, 18 | "n_redundant": 0, 19 | "n_classes": 2, # binary classification 20 | "random_state": 0, 21 | "weights": [0.55, 0.45], 22 | } 23 | X, y = make_classification(**common_params, n_samples=400) 24 | 25 | prevalence = y.mean() 26 | print(f"Percentage of samples in the positive class: {100*prevalence:.2f}%") 27 | 28 | # %% [markdown] 29 | # We are already familiar with using a a train-test split to estimate the 30 | # generalization performance of a model. By default the `train_test_split` uses 31 | # `shuffle=True`. Let's see what happens if we set a particular seed. 32 | 33 | # %% 34 | from sklearn.model_selection import train_test_split 35 | from sklearn.linear_model import LogisticRegression 36 | 37 | X_train, X_test, y_train, y_test = train_test_split( 38 | X, y, test_size=0.2, random_state=1 39 | ) 40 | classifier = LogisticRegression().fit(X_train, y_train) 41 | classifier.score(X_test, y_test) 42 | 43 | # %% [markdown] 44 | # Now let's see what happens when shuffling with a different seed: 45 | 46 | # %% 47 | X_train, X_test, y_train, y_test = train_test_split( 48 | X, y, test_size=0.2, random_state=42 49 | ) 50 | classifier = LogisticRegression().fit(X_train, y_train) 51 | classifier.score(X_test, y_test) 52 | 53 | # %% [markdown] 54 | # It seems that 42 is indeed the Ultimate answer to the Question of Life, the 55 | # Universe, and Everything! Or maybe the score of a model depends on the split: 56 | # - the train-test proportion; 57 | # - the representativeness of the elements in each set. 58 | # 59 | # A more systematic way of evaluating the generalization performance of a model 60 | # is through cross-validation, which consists of repeating the split such that 61 | # the training and testing sets are different for each evaluation. 62 | 63 | # %% 64 | from sklearn.model_selection import cross_val_score, ShuffleSplit 65 | 66 | classifier = LogisticRegression() 67 | cv = ShuffleSplit(n_splits=250, test_size=0.2) 68 | 69 | scores = cross_val_score(classifier, X, y, cv=cv) 70 | print( 71 | "The mean cross-validation accuracy is: " 72 | f"{scores.mean():.2f} ± {scores.std():.2f}." 73 | ) 74 | 75 | # %% [markdown] 76 | # Scores have a variability. A sample probabilistic model gives the distribution 77 | # of observed error: if the classification rate is p, the observed distribution 78 | # of correct classifications on a set of size follows a binomial distribution. 79 | # Let's create a function to easily visualize this: 80 | 81 | # %% 82 | import matplotlib.pyplot as plt 83 | import numpy as np 84 | import seaborn as sns 85 | from scipy import stats 86 | 87 | 88 | def plot_error_distrib(classifier, X, y, cv=5): 89 | 90 | n = len(X) 91 | 92 | scores = cross_val_score(classifier, X, y, cv=cv) 93 | distrib = stats.binom(n=n, p=scores.mean()) 94 | 95 | plt.plot( 96 | np.linspace(0, 1, n), 97 | n * distrib.pmf(np.arange(0, n)), 98 | linewidth=2, 99 | color="black", 100 | label="binomial distribution", 101 | ) 102 | sns.histplot(scores, stat="density", label="empirical distribution") 103 | plt.xlim(0, 1) 104 | plt.title("Accuracy: " f"{scores.mean():.2f} ± {scores.std():.2f}.") 105 | plt.legend() 106 | plt.show() 107 | 108 | 109 | plot_error_distrib(classifier, X, y, cv=cv) 110 | 111 | # %% [markdown] 112 | # The empirical distribution is still broader than the theoretical one. This can 113 | # be explained by the fact that as we are retraining the model on each fold, it 114 | # actually fluctuates due the sampling noise in the training data, while the 115 | # model above only accounts for sampling noise in the test data. 116 | # 117 | # The situation does get better with more data: 118 | 119 | # %% 120 | X, y = make_classification(**common_params, n_samples=1_000) 121 | plot_error_distrib(classifier, X, y, cv=cv) 122 | 123 | # %% [markdown] 124 | # Importantly, the standard error of the mean (SEM) across folds is not a good 125 | # measure of this error, as the different data folds are not independent. For 126 | # instance, doing many random splits reduces the variance arbitrarily, but does 127 | # not provide actually new data points. 128 | 129 | # %% 130 | cv = ShuffleSplit(n_splits=10, test_size=0.2) 131 | X, y = make_classification(**common_params, n_samples=400) 132 | scores = cross_val_score(classifier, X, y, cv=cv) 133 | 134 | print( 135 | f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: " 136 | f"{scores.mean():.3f} ± {stats.sem(scores):.3f}." 137 | ) 138 | 139 | cv = ShuffleSplit(n_splits=100, test_size=0.2) 140 | scores = cross_val_score(classifier, X, y, cv=cv) 141 | 142 | print( 143 | f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: " 144 | f"{scores.mean():.3f} ± {stats.sem(scores):.3f}." 145 | ) 146 | 147 | cv = ShuffleSplit(n_splits=500, test_size=0.2) 148 | scores = cross_val_score(classifier, X, y, cv=cv) 149 | 150 | print( 151 | f"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: " 152 | f"{scores.mean():.3f} ± {stats.sem(scores):.3f}." 153 | ) 154 | 155 | # %% [markdown] 156 | # Indeed, the SEM goes to zero as 1/sqrt{`n_splits`}. Wraping-up: 157 | # - the more data the better; 158 | # - the more splits, the more descriptive of the variance is the binomial 159 | # distribution, but keep in mind that more splits consume more computing 160 | # power; 161 | # - use std instead of SEM to present your results. 162 | # 163 | # Now that we have an intuition on the variability of an evaluation metric, we 164 | # are ready to apply it to our original Diabetes problem: 165 | 166 | # %% 167 | from sklearn.tree import DecisionTreeClassifier 168 | from sklearn.inspection import DecisionBoundaryDisplay 169 | 170 | diabetes_params = { 171 | "n_samples": 10_000, 172 | "n_features": 2, 173 | "n_informative": 2, 174 | "n_redundant": 0, 175 | "n_classes": 2, # binary classification 176 | "shift": [4, 6], 177 | "scale": [10, 25], 178 | "random_state": 0, 179 | } 180 | X, y = make_classification(**diabetes_params, weights=[0.55, 0.45]) 181 | 182 | X_train, X_plot, y_train, y_plot = train_test_split( 183 | X, y, stratify=y, test_size=0.1, random_state=0 184 | ) 185 | 186 | estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train) 187 | 188 | fig, ax = plt.subplots() 189 | disp = DecisionBoundaryDisplay.from_estimator( 190 | estimator, 191 | X_plot, 192 | response_method="predict", 193 | alpha=0.5, 194 | xlabel="age (years)", 195 | ylabel="blood sugar level (mg/dL)", 196 | ax=ax, 197 | ) 198 | scatter = disp.ax_.scatter(X_plot[:, 0], X_plot[:, 1], c=y_plot, edgecolor="k") 199 | disp.ax_.set_title(f"Diabetes test with prevalence = {y.mean():.2f}") 200 | _ = disp.ax_.legend(*scatter.legend_elements()) 201 | 202 | # %% [markdown] 203 | # Notice that the decision boundary changed with respect to the first notebook 204 | # we explored. Let's make a remark: models depend on the prevalence of 205 | # the data they were trained on. Therefore, all metrics (including likelihood ratios) 206 | # depend on prevalence as much as the model depends on it. The difference is that 207 | # likelihood ratios extrapolate through populations of different prevalence for 208 | # a **fixed model**. 209 | # 210 | # Let's compute all the metrics and assez their variability in this case: 211 | 212 | # %% 213 | from collections import defaultdict 214 | import pandas as pd 215 | 216 | cv = ShuffleSplit(n_splits=50, test_size=0.2) 217 | 218 | evaluation = defaultdict(list) 219 | scoring_strategies = [ 220 | "accuracy", 221 | "balanced_accuracy", 222 | "recall", 223 | "precision", 224 | "matthews_corrcoef", 225 | # "positive_likelihood_ratio", 226 | # "neg_negative_likelihood_ratio", 227 | ] 228 | 229 | for score_name in scoring_strategies: 230 | scores = cross_val_score(estimator, X, y, cv=cv, scoring=score_name) 231 | evaluation[score_name] = scores 232 | 233 | evaluation = pd.DataFrame(evaluation).aggregate(["mean", "std"]).T 234 | evaluation["mean"].plot.barh(xerr=evaluation["std"]).set_xlabel("score") 235 | plt.show() 236 | 237 | # %% [markdown] 238 | # Notice that `"positive_likelihood_ratio"` is not bounded from above and 239 | # therefore it can't be directly compared with the other metrics on a single 240 | # plot. Similarly, the `"neg_negative_likelihood_ratio"` has a reversed sign (is 241 | # negative) to follow the scikit-learn convention for metrics for which a lower 242 | # score is better. 243 | # 244 | # In this case we trained the model on nearly balanced classes. Try changing the 245 | # prevalence and see how the variance of the metrics depend on data imbalance. 246 | -------------------------------------------------------------------------------- /python_scripts/2_roc_pr_curves_tutorial.py: -------------------------------------------------------------------------------- 1 | # %% [markdown] 2 | # Evaluation of non-thresholded prediction 3 | # ======================================== 4 | # 5 | # All statistics that we presented up to now rely on `.predict` which outputs 6 | # the most likely label. We haven’t made use of the probability associated with 7 | # this prediction, which gives the confidence of the classifier in this 8 | # prediction. By default, the prediction of a classifier corresponds to a 9 | # threshold of 0.5 probability in a binary classification problem. Let's build a 10 | # toy dataset to illustrate this. 11 | 12 | # %% 13 | from sklearn.datasets import make_classification 14 | from sklearn.model_selection import train_test_split 15 | 16 | common_params = { 17 | "n_samples": 10_000, 18 | "n_features": 2, 19 | "n_informative": 2, 20 | "n_redundant": 0, 21 | "n_classes": 2, # binary classification 22 | "class_sep": 0.5, 23 | "random_state": 0, 24 | } 25 | X, y = make_classification(**common_params, weights=[0.6, 0.4]) 26 | 27 | X_train, X_test, y_train, y_test = train_test_split( 28 | X, y, stratify=y, random_state=0, test_size=0.02 29 | ) 30 | 31 | # %% [markdown] 32 | # We can quickly check the predicted probabilities to belong to either class 33 | # using a `LogisticRegression`. To ease the visualization we select a subset 34 | # of `n_plot` samples. 35 | 36 | # %% 37 | import pandas as pd 38 | from sklearn.linear_model import LogisticRegression 39 | 40 | n_plot = 10 41 | classifier = LogisticRegression() 42 | classifier.fit(X_train, y_train) 43 | 44 | proba_predicted = pd.DataFrame( 45 | classifier.predict_proba(X_test), columns=classifier.classes_ 46 | ).round(decimals=2) 47 | proba_predicted[:n_plot] 48 | 49 | # %% [markdown] 50 | # Probabilites sum to 1. In the binary case it suffices to retain the 51 | # probability of belonging to the positive class, here shown as an annotation in 52 | # the `DecisionBoundaryDisplay`. Notice that setting 53 | # `response_method="predict_proba"` shows the level curves of the 2D sigmoid 54 | # (logistic curve). 55 | 56 | # %% 57 | import matplotlib.pyplot as plt 58 | from matplotlib.colors import ListedColormap 59 | from sklearn.inspection import DecisionBoundaryDisplay 60 | 61 | fig, ax = plt.subplots() 62 | disp = DecisionBoundaryDisplay.from_estimator( 63 | classifier, 64 | X_test, 65 | response_method="predict_proba", 66 | cmap="RdBu", 67 | alpha=0.5, 68 | vmin=0, 69 | vmax=1, 70 | ax=ax, 71 | ) 72 | DecisionBoundaryDisplay.from_estimator( 73 | classifier, 74 | X_test, 75 | response_method="predict_proba", 76 | plot_method="contour", 77 | alpha=0.2, 78 | levels=[0.5], # 0.5 probability contour line 79 | linestyles="--", 80 | linewidths=2, 81 | ax=ax, 82 | ) 83 | scatter = disp.ax_.scatter( 84 | X_test[:n_plot, 0], X_test[:n_plot, 1], c=y_test[:n_plot], 85 | cmap=ListedColormap(["tab:red", "tab:blue"]), 86 | edgecolor="k" 87 | ) 88 | disp.ax_.legend(*scatter.legend_elements(), title="True class", loc="lower right") 89 | for i, proba in enumerate(proba_predicted[:n_plot][1]): 90 | disp.ax_.annotate(proba, (X_test[i, 0], X_test[i, 1]), fontsize="large") 91 | plt.xlim(-2.0, 2.0) 92 | plt.ylim(-4.0, 4.0) 93 | plt.title( 94 | "Probability of belonging to the positive class\n(default decision threshold)" 95 | ) 96 | plt.show() 97 | 98 | # %% [markdown] 99 | # Evaluation of different probability thresholds 100 | # ============================================== 101 | # 102 | # The default decision threshold (0.5) might not be the best threshold that 103 | # leads to optimal generalization performance of our classifier. One can vary 104 | # the decision threshold (and therefore the underlying prediction) and compute 105 | # some evaluation metrics as presented earlier. 106 | # 107 | # Receiver Operating Characteristic curve 108 | # --------------------------------------- 109 | # 110 | # One could be interested in the compromise between accurately discriminating 111 | # both the positive class and the negative classes. The statistics used for this 112 | # are sensitivity and specificity, which measure the proportion of correctly 113 | # classified samples per class. 114 | # 115 | # Sensitivity and specificity are generally plotted as a curve called the 116 | # Receiver Operating Characteristic (ROC) curve. Each point on the graph 117 | # corresponds to a specific decision threshold. Below is such a curve: 118 | 119 | # %% 120 | from sklearn.metrics import RocCurveDisplay 121 | from sklearn.dummy import DummyClassifier 122 | 123 | dummy_classifier = DummyClassifier(strategy="most_frequent") 124 | dummy_classifier.fit(X_train, y_train) 125 | 126 | disp = RocCurveDisplay.from_estimator( 127 | classifier, X_test, y_test, name="LogisticRegression", color="tab:green" 128 | ) 129 | disp = RocCurveDisplay.from_estimator( 130 | dummy_classifier, 131 | X_test, 132 | y_test, 133 | name="chance level", 134 | color="tab:red", 135 | ax=disp.ax_, 136 | ) 137 | plt.xlim(0, 1) 138 | plt.ylim(0, 1) 139 | plt.legend(loc="lower right") 140 | plt.title("ROC curve for LogisticRegression") 141 | plt.show() 142 | 143 | # %% [markdown] 144 | # ROC curves typically feature true positive rate on the Y axis, and false 145 | # positive rate on the X axis. This means that the top left corner of the plot 146 | # is the "ideal" point - a false positive rate of zero, and a true positive rate 147 | # of one. This is not very realistic, but it does mean that a larger area under 148 | # the curve (AUC) is usually better. 149 | # 150 | # We can compute the area under the ROC curve (using `roc_auc_score`) to 151 | # summarize the generalization performance of a model with a single number, or 152 | # to compare several models across thresholds. 153 | 154 | # %% 155 | from sklearn.ensemble import RandomForestClassifier 156 | from sklearn.ensemble import HistGradientBoostingClassifier 157 | 158 | 159 | classifiers = { 160 | "Hist Gradient Boosting": HistGradientBoostingClassifier(), 161 | "Random Forest": RandomForestClassifier(n_jobs=-1, random_state=1), 162 | "Logistic Regression": LogisticRegression(), 163 | "Chance": DummyClassifier(strategy="most_frequent"), 164 | } 165 | 166 | fig = plt.figure() 167 | ax = plt.axes([0.08, 0.15, 0.78, 0.78]) 168 | 169 | for name, clf in classifiers.items(): 170 | clf.fit(X_train, y_train) 171 | disp = RocCurveDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax) 172 | plt.xlabel("False positive rate") 173 | plt.ylabel("True positive rate ") 174 | plt.text( 175 | 0.098, 176 | 0.575, 177 | "= sensitivity or recall", 178 | transform=fig.transFigure, 179 | size=7, 180 | rotation="vertical", 181 | ) 182 | plt.xlim(0, 1) 183 | plt.ylim(0, 1) 184 | plt.legend(loc="lower right") 185 | plt.title("ROC curves for several models") 186 | plt.show() 187 | 188 | # %% [markdown] 189 | # It is important to notice that the lower bound of the ROC-AUC is 0.5, 190 | # corresponding to chance level. Indeed, we show the generalization performance 191 | # of a dummy classifier (the red line) to show that even the worst 192 | # generalization performance obtained will be above this line. 193 | 194 | # %% [markdown] 195 | # Precision-Recall curves 196 | # ----------------------- 197 | # 198 | # As mentioned above, maximizing the ROC curve helps finding a compromise 199 | # between accurately discriminating both the positive class and the negative 200 | # classes. If the interest is to focus mainly on the positive class, the 201 | # precision and recall metrics are more appropriated. Similarly to the ROC 202 | # curve, each point in the Precision-Recall curve corresponds to a level of 203 | # probability which we used as a decision threshold. 204 | 205 | # %% 206 | from sklearn.metrics import PrecisionRecallDisplay 207 | 208 | fig = plt.figure() 209 | ax = plt.axes([0.08, 0.15, 0.78, 0.78]) 210 | 211 | for name, clf in classifiers.items(): 212 | clf.fit(X_train, y_train) 213 | disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax) 214 | plt.xlabel("Recall ") 215 | plt.text(0.45, 0.067, "= TPR or sensitivity", transform=fig.transFigure, size=7) 216 | plt.ylabel("Precision ") 217 | plt.text(0.1, 0.6, "= PPV", transform=fig.transFigure, size=7, rotation="vertical") 218 | plt.xlim(0, 1) 219 | plt.ylim(0, 1) 220 | plt.legend(loc="lower right") 221 | plt.title("Precision-recall curve for several models") 222 | plt.show() 223 | 224 | # %% [markdown] 225 | # A classifier with no false positives would have a precision of 1 for all 226 | # recall values. In like manner to the ROC-AUC, the area under the curve can be 227 | # used to characterize the curve in a single number and is named average 228 | # precision (AP). With an ideal classifier, the average precision would be 1. 229 | # 230 | # In this case, notice that the AP of a `DummyClassifier`, used as baseline to 231 | # define the chance level, coincides with the prevalence of the positive class. 232 | # This is analogous to the downside of the accuracy score as shown in the first 233 | # notebook. 234 | 235 | # %% 236 | prevalence = y.mean() 237 | print(f"Prevalence of the positive class: {prevalence:.3f}") 238 | 239 | # %% [markdown] 240 | # Let's see the effect of adding umbalance between classes in our set of models: 241 | 242 | # %% 243 | X, y = make_classification(**common_params, weights=[0.83, 0.17]) 244 | 245 | X_train, X_test, y_train, y_test = train_test_split( 246 | X, y, stratify=y, random_state=0, test_size=0.02 247 | ) 248 | 249 | fig = plt.figure() 250 | ax = plt.axes([0.08, 0.15, 0.78, 0.78]) 251 | 252 | for name, clf in classifiers.items(): 253 | clf.fit(X_train, y_train) 254 | disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax) 255 | plt.xlabel("Recall ") 256 | plt.text(0.45, 0.067, "= TPR or sensitivity", transform=fig.transFigure, size=7) 257 | plt.ylabel("Precision ") 258 | plt.text(0.1, 0.6, "= PPV", transform=fig.transFigure, size=7, rotation="vertical") 259 | plt.xlim(0, 1) 260 | plt.ylim(0, 1) 261 | plt.legend(loc="upper right") 262 | plt.title("Precision-recall curve for several models\nw. imbalanced data") 263 | plt.show() 264 | 265 | # %% [markdown] 266 | # The AP of all models decreased, including the baseline defined by the dummy 267 | # classifier. Indeed, we confirm that AP does not account for prevalence. 268 | # 269 | # Conclusions 270 | # =========== 271 | # 272 | # - Consider the prevalence in your target population. It may be that the 273 | # prevalence in your testing sample is not representative of that of the 274 | # target population. In that case, aside from LR+ and LR-, performance metrics 275 | # computed from the testing sample will not be representative of those in the 276 | # target population. 277 | # 278 | # - Never trust a single summary metric (accuracy, balanced accuracy, ROC-AUC, 279 | # etc.), but rather look at all the individual metrics. Understand the 280 | # implication of your choices to known the right tradeoff. 281 | -------------------------------------------------------------------------------- /python_scripts/1_evaluation_tutorial.py: -------------------------------------------------------------------------------- 1 | # %% [markdown] 2 | # 3 | # Accounting for imbalance in evaluation metrics for classification 4 | # ================================================================= 5 | # 6 | # Suppose we have a population of subjects with features `X` that can hopefully 7 | # serve as indicators of a binary class `y` (known ground truth). Additionally, 8 | # suppose the class prevalence (the number of samples in the positive class 9 | # divided by the total number of samples) is very low. 10 | # 11 | # To fix ideas, let's use a medical analogy and think about diabetes. We only 12 | # use two features -age and blood sugar level-, to keep the example as simple as 13 | # possible. We use `make_classification` to simulate the distribution of the 14 | # disease and to ensure **the data-generating process is always the same**. We 15 | # set the `weights=[0.99, 0.01]` to obtain a prevalence of around 1% which, 16 | # according to [The World 17 | # Bank](https://data.worldbank.org/indicator/SH.STA.DIAB.ZS?most_recent_value_desc=false), 18 | # is the case for the country with the lowest diabetes prevalence in 2022 19 | # (Benin). 20 | # 21 | # In practice, the ideas presented here can be applied in settings where the 22 | # data available to learn and evaluate a classifier has nearly balanced classes, 23 | # such as a case-control study, while the target application, i.e. the general 24 | # population, has very low prevalence. 25 | 26 | # %% 27 | from sklearn.datasets import make_classification 28 | 29 | common_params = { 30 | "n_samples": 10_000, 31 | "n_features": 2, 32 | "n_informative": 2, 33 | "n_redundant": 0, 34 | "n_classes": 2, # binary classification 35 | "shift": [4, 6], 36 | "scale": [10, 25], 37 | "random_state": 0, 38 | } 39 | X, y = make_classification(**common_params, weights=[0.99, 0.01]) 40 | prevalence = y.mean() 41 | print(f"Percentage of people carrying the disease: {100*prevalence:.2f}%") 42 | 43 | # %% [markdown] 44 | # A simple model is trained to diagnose if a person is likely to have diabetes. 45 | # To estimate the generalization performance of such model, we do a train-test 46 | # split. 47 | 48 | # %% 49 | from sklearn.model_selection import train_test_split 50 | from sklearn.tree import DecisionTreeClassifier 51 | 52 | X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) 53 | 54 | estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train) 55 | 56 | # %% [markdown] 57 | # We now show the decision boundary learned by the estimator. Notice that we 58 | # only plot an stratified subset of the original data. 59 | 60 | # %% 61 | import matplotlib.pyplot as plt 62 | from sklearn.inspection import DecisionBoundaryDisplay 63 | 64 | fig, ax = plt.subplots() 65 | disp = DecisionBoundaryDisplay.from_estimator( 66 | estimator, 67 | X_test, 68 | response_method="predict", 69 | alpha=0.5, 70 | xlabel="age (years)", 71 | ylabel="blood sugar level (mg/dL)", 72 | ax=ax, 73 | ) 74 | scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k") 75 | disp.ax_.set_title(f"Hypothetical diabetes test with prevalence = {y.mean():.2f}") 76 | _ = disp.ax_.legend(*scatter.legend_elements()) 77 | 78 | # %% [markdown] 79 | # The most widely used summary metric is arguably accuracy. Its main advantage 80 | # is a natural interpretation: the proportion of correctly classified samples. 81 | 82 | # %% 83 | from sklearn import metrics 84 | 85 | y_pred = estimator.predict(X_test) 86 | accuracy = metrics.accuracy_score(y_test, y_pred) 87 | print(f"Accuracy on the test set: {accuracy:.3f}") 88 | 89 | # %% [markdown] 90 | # However, it is misleading when the data is imbalanced. Our model performs 91 | # as well as a trivial majority classifier. 92 | 93 | # %% 94 | from sklearn.dummy import DummyClassifier 95 | 96 | dummy = DummyClassifier(strategy="most_frequent").fit(X_train, y_train) 97 | y_dummy = estimator.predict(X_test) 98 | accuracy_dummy = metrics.accuracy_score(y_test, y_dummy) 99 | print(f"Accuracy if Diabetes did not exist: {accuracy_dummy:.3f}") 100 | 101 | # %% [markdown] 102 | # Some of the other metrics are better at describing the flaws of our model: 103 | 104 | # %% 105 | sensitivity = metrics.recall_score(y_test, y_pred) 106 | specificity = metrics.recall_score(y_test, y_pred, pos_label=0) 107 | balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred) 108 | matthews = metrics.matthews_corrcoef(y_test, y_pred) 109 | PPV = metrics.precision_score(y_test, y_pred) 110 | 111 | print(f"Sensitivity on the test set: {sensitivity:.2f}") 112 | print(f"Specificity on the test set: {specificity:.2f}") 113 | print(f"Balanced accuracy on the test set: {balanced_acc:.2f}") 114 | print(f"Matthews correlation coeff on the test set: {matthews:.2f}") 115 | print() 116 | print(f"Probability to have the disease given a positive test: {100*PPV:.2f}%") 117 | 118 | # %% [markdown] 119 | # Our classifier is not informative enough on the general population. The PPV 120 | # and NPV give the information of interest: P(D+ | T+) and P(D− | T−). However, 121 | # they are not intrinsic to the medical test (in other words the trained ML 122 | # model) but also depend on the prevalence and thus on the target population. 123 | # 124 | # The class likelihood ratios (LR±) depend only on sensitivity and specificity 125 | # of the classifier, and not on the prevalence of the study population. For the 126 | # moment it suffice to recall that LR± is defined as 127 | # 128 | # LR± = P(D± | T+) / P(D± | T−) 129 | 130 | # %% 131 | pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred) 132 | print(f"LR+ on the test set: {pos_LR:.3f}") # higher is better 133 | print(f"LR- on the test set: {neg_LR:.3f}") # lower is better 134 | 135 | # %% [markdown] 136 | #
137 | #

Caution

138 | #

Please notice that if you want to use the 139 | # `metrics.class_likelihood_ratios`, you require scikit-learn > v.1.2.0. 140 | #

141 | #
142 | # 143 | # Extrapolating between populations 144 | # --------------------------------- 145 | # 146 | # The prevalence can be variable (for instance the prevalence of an infectious 147 | # disease will be variable across time) and a given classifier may be intended 148 | # to be applied in various situations. 149 | # 150 | # According to the World Bank, the diabetes prevalence in the French Polynesia 151 | # in 2022 is above 25%. Let's now evaluate our previously trained model on a 152 | # **different population** with such prevalence and **the same data-generating 153 | # process**. 154 | 155 | X, y = make_classification(**common_params, weights=[0.75, 0.25]) 156 | X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) 157 | 158 | fig, ax = plt.subplots() 159 | disp = DecisionBoundaryDisplay.from_estimator( 160 | estimator, 161 | X_test, 162 | response_method="predict", 163 | alpha=0.5, 164 | xlabel="age (years)", 165 | ylabel="blood sugar level (mg/dL)", 166 | ax=ax, 167 | ) 168 | scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor="k") 169 | disp.ax_.set_title(f"Hypothetical diabetes test with prevalence = {y.mean():.2f}") 170 | _ = disp.ax_.legend(*scatter.legend_elements()) 171 | 172 | # %% 173 | # We then compute the same metrics using a test set with the new 174 | # prevalence: 175 | 176 | y_pred = estimator.predict(X_test) 177 | prevalence = y.mean() 178 | accuracy = metrics.accuracy_score(y_test, y_pred) 179 | sensitivity = metrics.recall_score(y_test, y_pred) 180 | specificity = metrics.recall_score(y_test, y_pred, pos_label=0) 181 | balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred) 182 | matthews = metrics.matthews_corrcoef(y_test, y_pred) 183 | PPV = metrics.precision_score(y_test, y_pred) 184 | 185 | print(f"Accuracy on the test set: {accuracy:.2f}") 186 | print(f"Sensitivity on the test set: {sensitivity:.2f}") 187 | print(f"Specificity on the test set: {specificity:.2f}") 188 | print(f"Balanced accuracy on the test set: {balanced_acc:.2f}") 189 | print(f"Matthews correlation coeff on the test set: {matthews:.2f}") 190 | print() 191 | print(f"Probability to have the disease given a positive test: {100*PPV:.2f}%") 192 | 193 | # %% [markdown] 194 | # The same model seems to perform better on this new dataset. Notice in 195 | # particular that the probability to have the disease given a positive test 196 | # increased. The same blood sugar test is less predictive in Benin than in 197 | # the French Polynesia! 198 | # 199 | # If we really want to score the test and not the dataset, we need a metric that 200 | # does not depend on the prevalence of the study population. 201 | 202 | # %% 203 | pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred) 204 | 205 | print(f"LR+ on the test set: {pos_LR:.3f}") 206 | print(f"LR- on the test set: {neg_LR:.3f}") 207 | 208 | # %% [markdown] 209 | # Despite some variations due to residual dataset dependence, the class 210 | # likelihood ratios are mathematically invariant with respect to prevalence. See 211 | # [this example from the User 212 | # Guide](https://scikit-learn.org/dev/auto_examples/model_selection/plot_likelihood_ratios.html#invariance-with-respect-to-prevalence) 213 | # for a demo regarding such property. 214 | # 215 | # Pre-test vs. post-test odds 216 | # --------------------------- 217 | # 218 | # Both class likelihood ratios are interpretable in terms of odds: 219 | # 220 | # post-test odds = Likelihood ratio * pre-test odds 221 | # 222 | # The interpretation of LR+ in this case reads: 223 | 224 | # %% 225 | print("The post-test odds that the condition is truly present given a positive " 226 | f"test result are: {pos_LR:.3f} times larger than the pre-test odds.") 227 | 228 | # %% [markdown] 229 | # We found that diagnosis tool is useful: the post-test odds are larger than the 230 | # pre-test odds. We now choose the pre-test probability to be the prevalence of 231 | # the disease in the held-out testing set. 232 | 233 | # %% 234 | pretest_odds = y_test.mean() / (1 - y_test.mean()) 235 | posttest_odds = pretest_odds * pos_LR 236 | 237 | print(f"Observed pre-test odds: {pretest_odds:.3f}") 238 | print(f"Estimated post-test odds using LR+: {posttest_odds:.3f}") 239 | 240 | # %% [markdown] 241 | # The post-test probability is the probability of an individual to truly have 242 | # the condition given a positive test result, i.e. the number of true positives 243 | # divided by the total number of samples. In real life applications this is 244 | # unknown. 245 | 246 | # %% 247 | posttest_prob = posttest_odds / (1 + posttest_odds) 248 | 249 | print(f"Estimated post-test probability using LR+: {posttest_prob:.3f}") 250 | 251 | # %% [markdown] 252 | # We can verify that if we had had access to the true labels, we would have 253 | # obatined the same probabilities: 254 | 255 | # %% 256 | posttest_prob = y_test[y_pred == 1].mean() 257 | 258 | print(f"Observed post-test probability: {posttest_prob:.3f}") 259 | 260 | # %% [markdown] 261 | # Conclusion: If a Benin salesperson was to sell the model to the French Polynesia 262 | # by showing them the 59.84% probability to have the disease given a positive test, 263 | # the French Polynesia would have never bought it, even though it would be quite 264 | # predictive for their own population. The right thing to report are the LR±. 265 | # 266 | # Can you imagine what would happen if the model is trained on nearly balanced classes 267 | # and then extrapolated to other scenarios? 268 | -------------------------------------------------------------------------------- /notebooks/3_uncertainty_in_metrics_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "604de853", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "Uncertainty in evaluation metrics for classification\n", 10 | "====================================================\n", 11 | "\n", 12 | "Has it ever happen to you that one of your colleagues claim their model with\n", 13 | "test score of 0.8001 is better than your model with test score of 0.7998?\n", 14 | "Maybe they are not aware that model-evaluation procedures should gauge not\n", 15 | "only the expected generalization performance, but also its variations. As\n", 16 | "usual, let's build a toy dataset to illustrate this." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "096f78da", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from sklearn.datasets import make_classification\n", 27 | "\n", 28 | "common_params = {\n", 29 | " \"n_features\": 2,\n", 30 | " \"n_informative\": 2,\n", 31 | " \"n_redundant\": 0,\n", 32 | " \"n_classes\": 2, # binary classification\n", 33 | " \"random_state\": 0,\n", 34 | " \"weights\": [0.55, 0.45],\n", 35 | "}\n", 36 | "X, y = make_classification(**common_params, n_samples=400)\n", 37 | "\n", 38 | "prevalence = y.mean()\n", 39 | "print(f\"Percentage of samples in the positive class: {100*prevalence:.2f}%\")" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "id": "98f132b7", 45 | "metadata": {}, 46 | "source": [ 47 | "We are already familiar with using a a train-test split to estimate the\n", 48 | "generalization performance of a model. By default the `train_test_split` uses\n", 49 | "`shuffle=True`. Let's see what happens if we set a particular seed." 50 | ] 51 | }, 52 | { 53 | "cell_type": "code", 54 | "execution_count": null, 55 | "id": "615abf52", 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "from sklearn.model_selection import train_test_split\n", 60 | "from sklearn.linear_model import LogisticRegression\n", 61 | "\n", 62 | "X_train, X_test, y_train, y_test = train_test_split(\n", 63 | " X, y, test_size=0.2, random_state=1\n", 64 | ")\n", 65 | "classifier = LogisticRegression().fit(X_train, y_train)\n", 66 | "classifier.score(X_test, y_test)" 67 | ] 68 | }, 69 | { 70 | "cell_type": "markdown", 71 | "id": "79141bd1", 72 | "metadata": {}, 73 | "source": [ 74 | "Now let's see what happens when shuffling with a different seed:" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "id": "a75b8b30", 81 | "metadata": {}, 82 | "outputs": [], 83 | "source": [ 84 | "X_train, X_test, y_train, y_test = train_test_split(\n", 85 | " X, y, test_size=0.2, random_state=42\n", 86 | ")\n", 87 | "classifier = LogisticRegression().fit(X_train, y_train)\n", 88 | "classifier.score(X_test, y_test)" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "id": "bde0b08e", 94 | "metadata": {}, 95 | "source": [ 96 | "It seems that 42 is indeed the Ultimate answer to the Question of Life, the\n", 97 | "Universe, and Everything! Or maybe the score of a model depends on the split:\n", 98 | " - the train-test proportion;\n", 99 | " - the representativeness of the elements in each set.\n", 100 | "\n", 101 | "A more systematic way of evaluating the generalization performance of a model\n", 102 | "is through cross-validation, which consists of repeating the split such that\n", 103 | "the training and testing sets are different for each evaluation." 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": null, 109 | "id": "3178a156", 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "from sklearn.model_selection import cross_val_score, ShuffleSplit\n", 114 | "\n", 115 | "classifier = LogisticRegression()\n", 116 | "cv = ShuffleSplit(n_splits=250, test_size=0.2)\n", 117 | "\n", 118 | "scores = cross_val_score(classifier, X, y, cv=cv)\n", 119 | "print(\n", 120 | " \"The mean cross-validation accuracy is: \"\n", 121 | " f\"{scores.mean():.2f} ± {scores.std():.2f}.\"\n", 122 | ")" 123 | ] 124 | }, 125 | { 126 | "cell_type": "markdown", 127 | "id": "84a601e3", 128 | "metadata": {}, 129 | "source": [ 130 | "Scores have a variability. A sample probabilistic model gives the distribution\n", 131 | "of observed error: if the classification rate is p, the observed distribution\n", 132 | "of correct classifications on a set of size follows a binomial distribution.\n", 133 | "Let's create a function to easily visualize this:" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "id": "99ee9c85", 140 | "metadata": {}, 141 | "outputs": [], 142 | "source": [ 143 | "import matplotlib.pyplot as plt\n", 144 | "import numpy as np\n", 145 | "import seaborn as sns\n", 146 | "from scipy import stats\n", 147 | "\n", 148 | "\n", 149 | "def plot_error_distrib(classifier, X, y, cv=5):\n", 150 | "\n", 151 | " n = len(X)\n", 152 | "\n", 153 | " scores = cross_val_score(classifier, X, y, cv=cv)\n", 154 | " distrib = stats.binom(n=n, p=scores.mean())\n", 155 | "\n", 156 | " plt.plot(\n", 157 | " np.linspace(0, 1, n),\n", 158 | " n * distrib.pmf(np.arange(0, n)),\n", 159 | " linewidth=2,\n", 160 | " color=\"black\",\n", 161 | " label=\"binomial distribution\",\n", 162 | " )\n", 163 | " sns.histplot(scores, stat=\"density\", label=\"empirical distribution\")\n", 164 | " plt.xlim(0, 1)\n", 165 | " plt.title(\"Accuracy: \" f\"{scores.mean():.2f} ± {scores.std():.2f}.\")\n", 166 | " plt.legend()\n", 167 | " plt.show()\n", 168 | "\n", 169 | "\n", 170 | "plot_error_distrib(classifier, X, y, cv=cv)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "markdown", 175 | "id": "80fcb32e", 176 | "metadata": {}, 177 | "source": [ 178 | "The empirical distribution is still broader than the theoretical one. This can\n", 179 | "be explained by the fact that as we are retraining the model on each fold, it\n", 180 | "actually fluctuates due the sampling noise in the training data, while the\n", 181 | "model above only accounts for sampling noise in the test data.\n", 182 | "\n", 183 | "The situation does get better with more data:" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": null, 189 | "id": "442ddefb", 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "X, y = make_classification(**common_params, n_samples=1_000)\n", 194 | "plot_error_distrib(classifier, X, y, cv=cv)" 195 | ] 196 | }, 197 | { 198 | "cell_type": "markdown", 199 | "id": "b3a7b727", 200 | "metadata": {}, 201 | "source": [ 202 | "Importantly, the standard error of the mean (SEM) across folds is not a good\n", 203 | "measure of this error, as the different data folds are not independent. For\n", 204 | "instance, doing many random splits reduces the variance arbitrarily, but does\n", 205 | "not provide actually new data points." 206 | ] 207 | }, 208 | { 209 | "cell_type": "code", 210 | "execution_count": null, 211 | "id": "f5c2a788", 212 | "metadata": {}, 213 | "outputs": [], 214 | "source": [ 215 | "cv = ShuffleSplit(n_splits=10, test_size=0.2)\n", 216 | "X, y = make_classification(**common_params, n_samples=400)\n", 217 | "scores = cross_val_score(classifier, X, y, cv=cv)\n", 218 | "\n", 219 | "print(\n", 220 | " f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n", 221 | " f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n", 222 | ")\n", 223 | "\n", 224 | "cv = ShuffleSplit(n_splits=100, test_size=0.2)\n", 225 | "scores = cross_val_score(classifier, X, y, cv=cv)\n", 226 | "\n", 227 | "print(\n", 228 | " f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n", 229 | " f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n", 230 | ")\n", 231 | "\n", 232 | "cv = ShuffleSplit(n_splits=500, test_size=0.2)\n", 233 | "scores = cross_val_score(classifier, X, y, cv=cv)\n", 234 | "\n", 235 | "print(\n", 236 | " f\"Mean accuracy ± SEM with n_split={cv.get_n_splits()}: \"\n", 237 | " f\"{scores.mean():.3f} ± {stats.sem(scores):.3f}.\"\n", 238 | ")" 239 | ] 240 | }, 241 | { 242 | "cell_type": "markdown", 243 | "id": "22457177", 244 | "metadata": {}, 245 | "source": [ 246 | "Indeed, the SEM goes to zero as 1/sqrt{`n_splits`}. Wraping-up:\n", 247 | "- the more data the better;\n", 248 | "- the more splits, the more descriptive of the variance is the binomial\n", 249 | " distribution, but keep in mind that more splits consume more computing\n", 250 | " power;\n", 251 | "- use std instead of SEM to present your results.\n", 252 | "\n", 253 | "Now that we have an intuition on the variability of an evaluation metric, we\n", 254 | "are ready to apply it to our original Diabetes problem:" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "id": "c2e55603", 261 | "metadata": {}, 262 | "outputs": [], 263 | "source": [ 264 | "from sklearn.tree import DecisionTreeClassifier\n", 265 | "from sklearn.inspection import DecisionBoundaryDisplay\n", 266 | "\n", 267 | "diabetes_params = {\n", 268 | " \"n_samples\": 10_000,\n", 269 | " \"n_features\": 2,\n", 270 | " \"n_informative\": 2,\n", 271 | " \"n_redundant\": 0,\n", 272 | " \"n_classes\": 2, # binary classification\n", 273 | " \"shift\": [4, 6],\n", 274 | " \"scale\": [10, 25],\n", 275 | " \"random_state\": 0,\n", 276 | "}\n", 277 | "X, y = make_classification(**diabetes_params, weights=[0.55, 0.45])\n", 278 | "\n", 279 | "X_train, X_plot, y_train, y_plot = train_test_split(\n", 280 | " X, y, stratify=y, test_size=0.1, random_state=0\n", 281 | ")\n", 282 | "\n", 283 | "estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)\n", 284 | "\n", 285 | "fig, ax = plt.subplots()\n", 286 | "disp = DecisionBoundaryDisplay.from_estimator(\n", 287 | " estimator,\n", 288 | " X_plot,\n", 289 | " response_method=\"predict\",\n", 290 | " alpha=0.5,\n", 291 | " xlabel=\"age (years)\",\n", 292 | " ylabel=\"blood sugar level (mg/dL)\",\n", 293 | " ax=ax,\n", 294 | ")\n", 295 | "scatter = disp.ax_.scatter(X_plot[:, 0], X_plot[:, 1], c=y_plot, edgecolor=\"k\")\n", 296 | "disp.ax_.set_title(f\"Diabetes test with prevalence = {y.mean():.2f}\")\n", 297 | "_ = disp.ax_.legend(*scatter.legend_elements())" 298 | ] 299 | }, 300 | { 301 | "cell_type": "markdown", 302 | "id": "27cad221", 303 | "metadata": {}, 304 | "source": [ 305 | "Notice that the decision boundary changed with respect to the first notebook\n", 306 | "we explored. Let's make a remark: models depend on the prevalence of\n", 307 | "the data they were trained on. Therefore, all metrics (including likelihood ratios)\n", 308 | "depend on prevalence as much as the model depends on it. The difference is that\n", 309 | "likelihood ratios extrapolate through populations of different prevalence for\n", 310 | "a **fixed model**.\n", 311 | "\n", 312 | "Let's compute all the metrics and assez their variability in this case:" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": null, 318 | "id": "277a758a", 319 | "metadata": {}, 320 | "outputs": [], 321 | "source": [ 322 | "from collections import defaultdict\n", 323 | "import pandas as pd\n", 324 | "\n", 325 | "cv = ShuffleSplit(n_splits=50, test_size=0.2)\n", 326 | "\n", 327 | "evaluation = defaultdict(list)\n", 328 | "scoring_strategies = [\n", 329 | " \"accuracy\",\n", 330 | " \"balanced_accuracy\",\n", 331 | " \"recall\",\n", 332 | " \"precision\",\n", 333 | " \"matthews_corrcoef\",\n", 334 | " # \"positive_likelihood_ratio\",\n", 335 | " # \"neg_negative_likelihood_ratio\",\n", 336 | "]\n", 337 | "\n", 338 | "for score_name in scoring_strategies:\n", 339 | " scores = cross_val_score(estimator, X, y, cv=cv, scoring=score_name)\n", 340 | " evaluation[score_name] = scores\n", 341 | "\n", 342 | "evaluation = pd.DataFrame(evaluation).aggregate([\"mean\", \"std\"]).T\n", 343 | "evaluation[\"mean\"].plot.barh(xerr=evaluation[\"std\"]).set_xlabel(\"score\")\n", 344 | "plt.show()" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "id": "812bdbd6", 350 | "metadata": {}, 351 | "source": [ 352 | "Notice that `\"positive_likelihood_ratio\"` is not bounded from above and\n", 353 | "therefore it can't be directly compared with the other metrics on a single\n", 354 | "plot. Similarly, the `\"neg_negative_likelihood_ratio\"` has a reversed sign (is\n", 355 | "negative) to follow the scikit-learn convention for metrics for which a lower\n", 356 | "score is better.\n", 357 | "\n", 358 | "In this case we trained the model on nearly balanced classes. Try changing the\n", 359 | "prevalence and see how the variance of the metrics depend on data imbalance." 360 | ] 361 | } 362 | ], 363 | "metadata": { 364 | "jupytext": { 365 | "cell_metadata_filter": "-all", 366 | "main_language": "python", 367 | "notebook_metadata_filter": "-all" 368 | } 369 | }, 370 | "nbformat": 4, 371 | "nbformat_minor": 5 372 | } 373 | -------------------------------------------------------------------------------- /notebooks/2_roc_pr_curves_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "1c0200d9", 6 | "metadata": {}, 7 | "source": [ 8 | "Evaluation of non-thresholded prediction\n", 9 | "========================================\n", 10 | "\n", 11 | "All statistics that we presented up to now rely on `.predict` which outputs\n", 12 | "the most likely label. We haven’t made use of the probability associated with\n", 13 | "this prediction, which gives the confidence of the classifier in this\n", 14 | "prediction. By default, the prediction of a classifier corresponds to a\n", 15 | "threshold of 0.5 probability in a binary classification problem. Let's build a\n", 16 | "toy dataset to illustrate this." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": null, 22 | "id": "2cee4d2f", 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "from sklearn.datasets import make_classification\n", 27 | "from sklearn.model_selection import train_test_split\n", 28 | "\n", 29 | "common_params = {\n", 30 | " \"n_samples\": 10_000,\n", 31 | " \"n_features\": 2,\n", 32 | " \"n_informative\": 2,\n", 33 | " \"n_redundant\": 0,\n", 34 | " \"n_classes\": 2, # binary classification\n", 35 | " \"class_sep\": 0.5,\n", 36 | " \"random_state\": 0,\n", 37 | "}\n", 38 | "X, y = make_classification(**common_params, weights=[0.6, 0.4])\n", 39 | "\n", 40 | "X_train, X_test, y_train, y_test = train_test_split(\n", 41 | " X, y, stratify=y, random_state=0, test_size=0.02\n", 42 | ")" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "id": "aa2bc2df", 48 | "metadata": {}, 49 | "source": [ 50 | "We can quickly check the predicted probabilities to belong to either class\n", 51 | "using a `LogisticRegression`. To ease the visualization we select a subset\n", 52 | "of `n_plot` samples." 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": null, 58 | "id": "78a544d8", 59 | "metadata": {}, 60 | "outputs": [], 61 | "source": [ 62 | "import pandas as pd\n", 63 | "from sklearn.linear_model import LogisticRegression\n", 64 | "\n", 65 | "n_plot = 10\n", 66 | "classifier = LogisticRegression()\n", 67 | "classifier.fit(X_train, y_train)\n", 68 | "\n", 69 | "proba_predicted = pd.DataFrame(\n", 70 | " classifier.predict_proba(X_test), columns=classifier.classes_\n", 71 | ").round(decimals=2)\n", 72 | "proba_predicted[:n_plot]" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "id": "807a2305", 78 | "metadata": {}, 79 | "source": [ 80 | "Probabilites sum to 1. In the binary case it suffices to retain the\n", 81 | "probability of belonging to the positive class, here shown as an annotation in\n", 82 | "the `DecisionBoundaryDisplay`. Notice that setting\n", 83 | "`response_method=\"predict_proba\"` shows the level curves of the 2D sigmoid\n", 84 | "(logistic curve)." 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": null, 90 | "id": "cbaf5e5d", 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "import matplotlib.pyplot as plt\n", 95 | "from matplotlib.colors import ListedColormap\n", 96 | "from sklearn.inspection import DecisionBoundaryDisplay\n", 97 | "\n", 98 | "fig, ax = plt.subplots()\n", 99 | "disp = DecisionBoundaryDisplay.from_estimator(\n", 100 | " classifier,\n", 101 | " X_test,\n", 102 | " response_method=\"predict_proba\",\n", 103 | " cmap=\"RdBu\",\n", 104 | " alpha=0.5,\n", 105 | " vmin=0,\n", 106 | " vmax=1,\n", 107 | " ax=ax,\n", 108 | ")\n", 109 | "DecisionBoundaryDisplay.from_estimator(\n", 110 | " classifier,\n", 111 | " X_test,\n", 112 | " response_method=\"predict_proba\",\n", 113 | " plot_method=\"contour\",\n", 114 | " alpha=0.2,\n", 115 | " levels=[0.5], # 0.5 probability contour line\n", 116 | " linestyles=\"--\",\n", 117 | " linewidths=2,\n", 118 | " ax=ax,\n", 119 | ")\n", 120 | "scatter = disp.ax_.scatter(\n", 121 | " X_test[:n_plot, 0], X_test[:n_plot, 1], c=y_test[:n_plot], \n", 122 | " cmap=ListedColormap([\"tab:red\", \"tab:blue\"]),\n", 123 | " edgecolor=\"k\"\n", 124 | ")\n", 125 | "disp.ax_.legend(*scatter.legend_elements(), title=\"True class\", loc=\"lower right\")\n", 126 | "for i, proba in enumerate(proba_predicted[:n_plot][1]):\n", 127 | " disp.ax_.annotate(proba, (X_test[i, 0], X_test[i, 1]), fontsize=\"large\")\n", 128 | "plt.xlim(-2.0, 2.0)\n", 129 | "plt.ylim(-4.0, 4.0)\n", 130 | "plt.title(\n", 131 | " \"Probability of belonging to the positive class\\n(default decision threshold)\"\n", 132 | ")\n", 133 | "plt.show()" 134 | ] 135 | }, 136 | { 137 | "cell_type": "markdown", 138 | "id": "c6b3f802", 139 | "metadata": {}, 140 | "source": [ 141 | "Evaluation of different probability thresholds\n", 142 | "==============================================\n", 143 | "\n", 144 | "The default decision threshold (0.5) might not be the best threshold that\n", 145 | "leads to optimal generalization performance of our classifier. One can vary\n", 146 | "the decision threshold (and therefore the underlying prediction) and compute\n", 147 | "some evaluation metrics as presented earlier.\n", 148 | "\n", 149 | "Receiver Operating Characteristic curve\n", 150 | "---------------------------------------\n", 151 | "\n", 152 | "One could be interested in the compromise between accurately discriminating\n", 153 | "both the positive class and the negative classes. The statistics used for this\n", 154 | "are sensitivity and specificity, which measure the proportion of correctly\n", 155 | "classified samples per class.\n", 156 | "\n", 157 | "Sensitivity and specificity are generally plotted as a curve called the\n", 158 | "Receiver Operating Characteristic (ROC) curve. Each point on the graph\n", 159 | "corresponds to a specific decision threshold. Below is such a curve:" 160 | ] 161 | }, 162 | { 163 | "cell_type": "code", 164 | "execution_count": null, 165 | "id": "dbed0c9e", 166 | "metadata": {}, 167 | "outputs": [], 168 | "source": [ 169 | "from sklearn.metrics import RocCurveDisplay\n", 170 | "from sklearn.dummy import DummyClassifier\n", 171 | "\n", 172 | "dummy_classifier = DummyClassifier(strategy=\"most_frequent\")\n", 173 | "dummy_classifier.fit(X_train, y_train)\n", 174 | "\n", 175 | "disp = RocCurveDisplay.from_estimator(\n", 176 | " classifier, X_test, y_test, name=\"LogisticRegression\", color=\"tab:green\"\n", 177 | ")\n", 178 | "disp = RocCurveDisplay.from_estimator(\n", 179 | " dummy_classifier,\n", 180 | " X_test,\n", 181 | " y_test,\n", 182 | " name=\"chance level\",\n", 183 | " color=\"tab:red\",\n", 184 | " ax=disp.ax_,\n", 185 | ")\n", 186 | "plt.xlim(0, 1)\n", 187 | "plt.ylim(0, 1)\n", 188 | "plt.legend(loc=\"lower right\")\n", 189 | "plt.title(\"ROC curve for LogisticRegression\")\n", 190 | "plt.show()" 191 | ] 192 | }, 193 | { 194 | "cell_type": "markdown", 195 | "id": "1837bd36", 196 | "metadata": {}, 197 | "source": [ 198 | "ROC curves typically feature true positive rate on the Y axis, and false\n", 199 | "positive rate on the X axis. This means that the top left corner of the plot\n", 200 | "is the \"ideal\" point - a false positive rate of zero, and a true positive rate\n", 201 | "of one. This is not very realistic, but it does mean that a larger area under\n", 202 | "the curve (AUC) is usually better.\n", 203 | "\n", 204 | "We can compute the area under the ROC curve (using `roc_auc_score`) to\n", 205 | "summarize the generalization performance of a model with a single number, or\n", 206 | "to compare several models across thresholds." 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "72cb955a", 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "from sklearn.ensemble import RandomForestClassifier\n", 217 | "from sklearn.ensemble import HistGradientBoostingClassifier\n", 218 | "\n", 219 | "\n", 220 | "classifiers = {\n", 221 | " \"Hist Gradient Boosting\": HistGradientBoostingClassifier(),\n", 222 | " \"Random Forest\": RandomForestClassifier(n_jobs=-1, random_state=1),\n", 223 | " \"Logistic Regression\": LogisticRegression(),\n", 224 | " \"Chance\": DummyClassifier(strategy=\"most_frequent\"),\n", 225 | "}\n", 226 | "\n", 227 | "fig = plt.figure()\n", 228 | "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n", 229 | "\n", 230 | "for name, clf in classifiers.items():\n", 231 | " clf.fit(X_train, y_train)\n", 232 | " disp = RocCurveDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n", 233 | "plt.xlabel(\"False positive rate\")\n", 234 | "plt.ylabel(\"True positive rate \")\n", 235 | "plt.text(\n", 236 | " 0.098,\n", 237 | " 0.575,\n", 238 | " \"= sensitivity or recall\",\n", 239 | " transform=fig.transFigure,\n", 240 | " size=7,\n", 241 | " rotation=\"vertical\",\n", 242 | ")\n", 243 | "plt.xlim(0, 1)\n", 244 | "plt.ylim(0, 1)\n", 245 | "plt.legend(loc=\"lower right\")\n", 246 | "plt.title(\"ROC curves for several models\")\n", 247 | "plt.show()" 248 | ] 249 | }, 250 | { 251 | "cell_type": "markdown", 252 | "id": "e0ef75b4", 253 | "metadata": {}, 254 | "source": [ 255 | "It is important to notice that the lower bound of the ROC-AUC is 0.5,\n", 256 | "corresponding to chance level. Indeed, we show the generalization performance\n", 257 | "of a dummy classifier (the red line) to show that even the worst\n", 258 | "generalization performance obtained will be above this line." 259 | ] 260 | }, 261 | { 262 | "cell_type": "markdown", 263 | "id": "df5f19d6", 264 | "metadata": {}, 265 | "source": [ 266 | "Precision-Recall curves\n", 267 | "-----------------------\n", 268 | "\n", 269 | "As mentioned above, maximizing the ROC curve helps finding a compromise\n", 270 | "between accurately discriminating both the positive class and the negative\n", 271 | "classes. If the interest is to focus mainly on the positive class, the\n", 272 | "precision and recall metrics are more appropriated. Similarly to the ROC\n", 273 | "curve, each point in the Precision-Recall curve corresponds to a level of\n", 274 | "probability which we used as a decision threshold." 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": null, 280 | "id": "44183db6", 281 | "metadata": {}, 282 | "outputs": [], 283 | "source": [ 284 | "from sklearn.metrics import PrecisionRecallDisplay\n", 285 | "\n", 286 | "fig = plt.figure()\n", 287 | "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n", 288 | "\n", 289 | "for name, clf in classifiers.items():\n", 290 | " clf.fit(X_train, y_train)\n", 291 | " disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n", 292 | "plt.xlabel(\"Recall \")\n", 293 | "plt.text(0.45, 0.067, \"= TPR or sensitivity\", transform=fig.transFigure, size=7)\n", 294 | "plt.ylabel(\"Precision \")\n", 295 | "plt.text(0.1, 0.6, \"= PPV\", transform=fig.transFigure, size=7, rotation=\"vertical\")\n", 296 | "plt.xlim(0, 1)\n", 297 | "plt.ylim(0, 1)\n", 298 | "plt.legend(loc=\"lower right\")\n", 299 | "plt.title(\"Precision-recall curve for several models\")\n", 300 | "plt.show()" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "id": "397e60b7", 306 | "metadata": {}, 307 | "source": [ 308 | "A classifier with no false positives would have a precision of 1 for all\n", 309 | "recall values. In like manner to the ROC-AUC, the area under the curve can be\n", 310 | "used to characterize the curve in a single number and is named average\n", 311 | "precision (AP). With an ideal classifier, the average precision would be 1.\n", 312 | "\n", 313 | "In this case, notice that the AP of a `DummyClassifier`, used as baseline to\n", 314 | "define the chance level, coincides with the prevalence of the positive class.\n", 315 | "This is analogous to the downside of the accuracy score as shown in the first\n", 316 | "notebook." 317 | ] 318 | }, 319 | { 320 | "cell_type": "code", 321 | "execution_count": null, 322 | "id": "0475287c", 323 | "metadata": {}, 324 | "outputs": [], 325 | "source": [ 326 | "prevalence = y.mean()\n", 327 | "print(f\"Prevalence of the positive class: {prevalence:.3f}\")" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "id": "415c7cc9", 333 | "metadata": {}, 334 | "source": [ 335 | "Let's see the effect of adding umbalance between classes in our set of models:" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "id": "50944a15", 342 | "metadata": {}, 343 | "outputs": [], 344 | "source": [ 345 | "X, y = make_classification(**common_params, weights=[0.83, 0.17])\n", 346 | "\n", 347 | "X_train, X_test, y_train, y_test = train_test_split(\n", 348 | " X, y, stratify=y, random_state=0, test_size=0.02\n", 349 | ")\n", 350 | "\n", 351 | "fig = plt.figure()\n", 352 | "ax = plt.axes([0.08, 0.15, 0.78, 0.78])\n", 353 | "\n", 354 | "for name, clf in classifiers.items():\n", 355 | " clf.fit(X_train, y_train)\n", 356 | " disp = PrecisionRecallDisplay.from_estimator(clf, X_test, y_test, name=name, ax=ax)\n", 357 | "plt.xlabel(\"Recall \")\n", 358 | "plt.text(0.45, 0.067, \"= TPR or sensitivity\", transform=fig.transFigure, size=7)\n", 359 | "plt.ylabel(\"Precision \")\n", 360 | "plt.text(0.1, 0.6, \"= PPV\", transform=fig.transFigure, size=7, rotation=\"vertical\")\n", 361 | "plt.xlim(0, 1)\n", 362 | "plt.ylim(0, 1)\n", 363 | "plt.legend(loc=\"upper right\")\n", 364 | "plt.title(\"Precision-recall curve for several models\\nw. imbalanced data\")\n", 365 | "plt.show()" 366 | ] 367 | }, 368 | { 369 | "cell_type": "markdown", 370 | "id": "f97d04b3", 371 | "metadata": {}, 372 | "source": [ 373 | "The AP of all models decreased, including the baseline defined by the dummy\n", 374 | "classifier. Indeed, we confirm that AP does not account for prevalence.\n", 375 | "\n", 376 | "Conclusions\n", 377 | "===========\n", 378 | "\n", 379 | "- Consider the prevalence in your target population. It may be that the\n", 380 | " prevalence in your testing sample is not representative of that of the\n", 381 | " target population. In that case, aside from LR+ and LR-, performance metrics\n", 382 | " computed from the testing sample will not be representative of those in the\n", 383 | " target population.\n", 384 | "\n", 385 | "- Never trust a single summary metric (accuracy, balanced accuracy, ROC-AUC,\n", 386 | " etc.), but rather look at all the individual metrics. Understand the\n", 387 | " implication of your choices to known the right tradeoff." 388 | ] 389 | } 390 | ], 391 | "metadata": { 392 | "jupytext": { 393 | "cell_metadata_filter": "-all", 394 | "main_language": "python", 395 | "notebook_metadata_filter": "-all" 396 | } 397 | }, 398 | "nbformat": 4, 399 | "nbformat_minor": 5 400 | } 401 | -------------------------------------------------------------------------------- /notebooks/1_evaluation_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "a97407df", 6 | "metadata": {}, 7 | "source": [ 8 | "\n", 9 | "Accounting for imbalance in evaluation metrics for classification \n", 10 | "=================================================================\n", 11 | "\n", 12 | "Suppose we have a population of subjects with features `X` that can hopefully\n", 13 | "serve as indicators of a binary class `y` (known ground truth). Additionally,\n", 14 | "suppose the class prevalence (the number of samples in the positive class\n", 15 | "divided by the total number of samples) is very low.\n", 16 | "\n", 17 | "To fix ideas, let's use a medical analogy and think about diabetes. We only\n", 18 | "use two features -age and blood sugar level-, to keep the example as simple as\n", 19 | "possible. We use `make_classification` to simulate the distribution of the\n", 20 | "disease and to ensure **the data-generating process is always the same**. We\n", 21 | "set the `weights=[0.99, 0.01]` to obtain a prevalence of around 1% which,\n", 22 | "according to [The World\n", 23 | "Bank](https://data.worldbank.org/indicator/SH.STA.DIAB.ZS?most_recent_value_desc=false),\n", 24 | "is the case for the country with the lowest diabetes prevalence in 2022\n", 25 | "(Benin).\n", 26 | "\n", 27 | "In practice, the ideas presented here can be applied in settings where the\n", 28 | "data available to learn and evaluate a classifier has nearly balanced classes,\n", 29 | "such as a case-control study, while the target application, i.e. the general\n", 30 | "population, has very low prevalence." 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "id": "ba2ba7a7", 37 | "metadata": {}, 38 | "outputs": [], 39 | "source": [ 40 | "from sklearn.datasets import make_classification\n", 41 | "\n", 42 | "common_params = {\n", 43 | " \"n_samples\": 10_000,\n", 44 | " \"n_features\": 2,\n", 45 | " \"n_informative\": 2,\n", 46 | " \"n_redundant\": 0,\n", 47 | " \"n_classes\": 2, # binary classification\n", 48 | " \"shift\": [4, 6],\n", 49 | " \"scale\": [10, 25],\n", 50 | " \"random_state\": 0,\n", 51 | "}\n", 52 | "X, y = make_classification(**common_params, weights=[0.99, 0.01])\n", 53 | "prevalence = y.mean()\n", 54 | "print(f\"Percentage of people carrying the disease: {100*prevalence:.2f}%\")" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "id": "68480c16", 60 | "metadata": {}, 61 | "source": [ 62 | "A simple model is trained to diagnose if a person is likely to have diabetes.\n", 63 | "To estimate the generalization performance of such model, we do a train-test\n", 64 | "split." 65 | ] 66 | }, 67 | { 68 | "cell_type": "code", 69 | "execution_count": null, 70 | "id": "dbda2ce1", 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "from sklearn.model_selection import train_test_split\n", 75 | "from sklearn.tree import DecisionTreeClassifier\n", 76 | "\n", 77 | "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n", 78 | "\n", 79 | "estimator = DecisionTreeClassifier(max_depth=2, random_state=0).fit(X_train, y_train)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "id": "46de5abd", 85 | "metadata": {}, 86 | "source": [ 87 | "We now show the decision boundary learned by the estimator. Notice that we\n", 88 | "only plot an stratified subset of the original data." 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": null, 94 | "id": "42c34246", 95 | "metadata": {}, 96 | "outputs": [], 97 | "source": [ 98 | "import matplotlib.pyplot as plt\n", 99 | "from sklearn.inspection import DecisionBoundaryDisplay\n", 100 | "\n", 101 | "fig, ax = plt.subplots()\n", 102 | "disp = DecisionBoundaryDisplay.from_estimator(\n", 103 | " estimator,\n", 104 | " X_test,\n", 105 | " response_method=\"predict\",\n", 106 | " alpha=0.5,\n", 107 | " xlabel=\"age (years)\",\n", 108 | " ylabel=\"blood sugar level (mg/dL)\",\n", 109 | " ax=ax,\n", 110 | ")\n", 111 | "scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor=\"k\")\n", 112 | "disp.ax_.set_title(f\"Hypothetical diabetes test with prevalence = {y.mean():.2f}\")\n", 113 | "_ = disp.ax_.legend(*scatter.legend_elements())" 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "id": "0976dd3e", 119 | "metadata": {}, 120 | "source": [ 121 | "The most widely used summary metric is arguably accuracy. Its main advantage\n", 122 | "is a natural interpretation: the proportion of correctly classified samples." 123 | ] 124 | }, 125 | { 126 | "cell_type": "code", 127 | "execution_count": null, 128 | "id": "2a42dfdf", 129 | "metadata": {}, 130 | "outputs": [], 131 | "source": [ 132 | "from sklearn import metrics\n", 133 | "\n", 134 | "y_pred = estimator.predict(X_test)\n", 135 | "accuracy = metrics.accuracy_score(y_test, y_pred)\n", 136 | "print(f\"Accuracy on the test set: {accuracy:.3f}\")" 137 | ] 138 | }, 139 | { 140 | "cell_type": "markdown", 141 | "id": "460f1449", 142 | "metadata": {}, 143 | "source": [ 144 | "However, it is misleading when the data is imbalanced. Our model performs\n", 145 | "as well as a trivial majority classifier." 146 | ] 147 | }, 148 | { 149 | "cell_type": "code", 150 | "execution_count": null, 151 | "id": "6fcf2b77", 152 | "metadata": {}, 153 | "outputs": [], 154 | "source": [ 155 | "from sklearn.dummy import DummyClassifier\n", 156 | "\n", 157 | "dummy = DummyClassifier(strategy=\"most_frequent\").fit(X_train, y_train)\n", 158 | "y_dummy = estimator.predict(X_test)\n", 159 | "accuracy_dummy = metrics.accuracy_score(y_test, y_dummy)\n", 160 | "print(f\"Accuracy if Diabetes did not exist: {accuracy_dummy:.3f}\")" 161 | ] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "id": "d2ce15e0", 166 | "metadata": {}, 167 | "source": [ 168 | "Some of the other metrics are better at describing the flaws of our model:" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": null, 174 | "id": "3bc30b2b", 175 | "metadata": {}, 176 | "outputs": [], 177 | "source": [ 178 | "sensitivity = metrics.recall_score(y_test, y_pred)\n", 179 | "specificity = metrics.recall_score(y_test, y_pred, pos_label=0)\n", 180 | "balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)\n", 181 | "matthews = metrics.matthews_corrcoef(y_test, y_pred)\n", 182 | "PPV = metrics.precision_score(y_test, y_pred)\n", 183 | "\n", 184 | "print(f\"Sensitivity on the test set: {sensitivity:.2f}\")\n", 185 | "print(f\"Specificity on the test set: {specificity:.2f}\")\n", 186 | "print(f\"Balanced accuracy on the test set: {balanced_acc:.2f}\")\n", 187 | "print(f\"Matthews correlation coeff on the test set: {matthews:.2f}\")\n", 188 | "print()\n", 189 | "print(f\"Probability to have the disease given a positive test: {100*PPV:.2f}%\")" 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "id": "48b03751", 195 | "metadata": {}, 196 | "source": [ 197 | "Our classifier is not informative enough on the general population. The PPV\n", 198 | "and NPV give the information of interest: P(D+ | T+) and P(D− | T−). However,\n", 199 | "they are not intrinsic to the medical test (in other words the trained ML\n", 200 | "model) but also depend on the prevalence and thus on the target population.\n", 201 | "\n", 202 | "The class likelihood ratios (LR±) depend only on sensitivity and specificity\n", 203 | "of the classifier, and not on the prevalence of the study population. For the\n", 204 | "moment it suffice to recall that LR± is defined as\n", 205 | "\n", 206 | " LR± = P(D± | T+) / P(D± | T−)" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "13b04b48", 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [ 216 | "pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)\n", 217 | "print(f\"LR+ on the test set: {pos_LR:.3f}\") # higher is better\n", 218 | "print(f\"LR- on the test set: {neg_LR:.3f}\") # lower is better" 219 | ] 220 | }, 221 | { 222 | "cell_type": "markdown", 223 | "id": "f999f3e2", 224 | "metadata": {}, 225 | "source": [ 226 | "
\n", 227 | "

Caution

\n", 228 | "

Please notice that if you want to use the\n", 229 | "`metrics.class_likelihood_ratios`, you require scikit-learn > v.1.2.0.\n", 230 | "

\n", 231 | "
\n", 232 | "\n", 233 | "Extrapolating between populations\n", 234 | "---------------------------------\n", 235 | "\n", 236 | "The prevalence can be variable (for instance the prevalence of an infectious\n", 237 | "disease will be variable across time) and a given classifier may be intended\n", 238 | "to be applied in various situations.\n", 239 | "\n", 240 | "According to the World Bank, the diabetes prevalence in the French Polynesia\n", 241 | "in 2022 is above 25%. Let's now evaluate our previously trained model on a\n", 242 | "**different population** with such prevalence and **the same data-generating\n", 243 | "process**.\n", 244 | "\n", 245 | "X, y = make_classification(**common_params, weights=[0.75, 0.25])\n", 246 | "X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)\n", 247 | "\n", 248 | "fig, ax = plt.subplots()\n", 249 | "disp = DecisionBoundaryDisplay.from_estimator(\n", 250 | " estimator,\n", 251 | " X_test,\n", 252 | " response_method=\"predict\",\n", 253 | " alpha=0.5,\n", 254 | " xlabel=\"age (years)\",\n", 255 | " ylabel=\"blood sugar level (mg/dL)\",\n", 256 | " ax=ax,\n", 257 | ")\n", 258 | "scatter = disp.ax_.scatter(X_test[:, 0], X_test[:, 1], c=y_test, edgecolor=\"k\")\n", 259 | "disp.ax_.set_title(f\"Hypothetical diabetes test with prevalence = {y.mean():.2f}\")\n", 260 | "_ = disp.ax_.legend(*scatter.legend_elements())" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "id": "432b3b95", 267 | "metadata": {}, 268 | "outputs": [], 269 | "source": [ 270 | "# We then compute the same metrics using a test set with the new\n", 271 | "# prevalence:\n", 272 | "\n", 273 | "y_pred = estimator.predict(X_test)\n", 274 | "prevalence = y.mean()\n", 275 | "accuracy = metrics.accuracy_score(y_test, y_pred)\n", 276 | "sensitivity = metrics.recall_score(y_test, y_pred)\n", 277 | "specificity = metrics.recall_score(y_test, y_pred, pos_label=0)\n", 278 | "balanced_acc = metrics.balanced_accuracy_score(y_test, y_pred)\n", 279 | "matthews = metrics.matthews_corrcoef(y_test, y_pred)\n", 280 | "PPV = metrics.precision_score(y_test, y_pred)\n", 281 | "\n", 282 | "print(f\"Accuracy on the test set: {accuracy:.2f}\")\n", 283 | "print(f\"Sensitivity on the test set: {sensitivity:.2f}\")\n", 284 | "print(f\"Specificity on the test set: {specificity:.2f}\")\n", 285 | "print(f\"Balanced accuracy on the test set: {balanced_acc:.2f}\")\n", 286 | "print(f\"Matthews correlation coeff on the test set: {matthews:.2f}\")\n", 287 | "print()\n", 288 | "print(f\"Probability to have the disease given a positive test: {100*PPV:.2f}%\")" 289 | ] 290 | }, 291 | { 292 | "cell_type": "markdown", 293 | "id": "60315d3b", 294 | "metadata": {}, 295 | "source": [ 296 | "The same model seems to perform better on this new dataset. Notice in\n", 297 | "particular that the probability to have the disease given a positive test\n", 298 | "increased. The same blood sugar test is less predictive in Benin than in\n", 299 | "the French Polynesia!\n", 300 | "\n", 301 | "If we really want to score the test and not the dataset, we need a metric that\n", 302 | "does not depend on the prevalence of the study population." 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "id": "7395579a", 309 | "metadata": {}, 310 | "outputs": [], 311 | "source": [ 312 | "pos_LR, neg_LR = metrics.class_likelihood_ratios(y_test, y_pred)\n", 313 | "\n", 314 | "print(f\"LR+ on the test set: {pos_LR:.3f}\")\n", 315 | "print(f\"LR- on the test set: {neg_LR:.3f}\")" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "id": "d4e7c039", 321 | "metadata": {}, 322 | "source": [ 323 | "Despite some variations due to residual dataset dependence, the class\n", 324 | "likelihood ratios are mathematically invariant with respect to prevalence. See\n", 325 | "[this example from the User\n", 326 | "Guide](https://scikit-learn.org/dev/auto_examples/model_selection/plot_likelihood_ratios.html#invariance-with-respect-to-prevalence)\n", 327 | "for a demo regarding such property.\n", 328 | "\n", 329 | "Pre-test vs. post-test odds\n", 330 | "---------------------------\n", 331 | "\n", 332 | "Both class likelihood ratios are interpretable in terms of odds:\n", 333 | "\n", 334 | " post-test odds = Likelihood ratio * pre-test odds\n", 335 | "\n", 336 | "The interpretation of LR+ in this case reads:" 337 | ] 338 | }, 339 | { 340 | "cell_type": "code", 341 | "execution_count": null, 342 | "id": "4983b942", 343 | "metadata": {}, 344 | "outputs": [], 345 | "source": [ 346 | "print(\"The post-test odds that the condition is truly present given a positive \"\n", 347 | " f\"test result are: {pos_LR:.3f} times larger than the pre-test odds.\")" 348 | ] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "id": "71a226b4", 353 | "metadata": {}, 354 | "source": [ 355 | "We found that diagnosis tool is useful: the post-test odds are larger than the\n", 356 | "pre-test odds. We now choose the pre-test probability to be the prevalence of\n", 357 | "the disease in the held-out testing set." 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "id": "9967d431", 364 | "metadata": {}, 365 | "outputs": [], 366 | "source": [ 367 | "pretest_odds = y_test.mean() / (1 - y_test.mean())\n", 368 | "posttest_odds = pretest_odds * pos_LR\n", 369 | "\n", 370 | "print(f\"Observed pre-test odds: {pretest_odds:.3f}\")\n", 371 | "print(f\"Estimated post-test odds using LR+: {posttest_odds:.3f}\")" 372 | ] 373 | }, 374 | { 375 | "cell_type": "markdown", 376 | "id": "0828ff62", 377 | "metadata": {}, 378 | "source": [ 379 | "The post-test probability is the probability of an individual to truly have\n", 380 | "the condition given a positive test result, i.e. the number of true positives\n", 381 | "divided by the total number of samples. In real life applications this is\n", 382 | "unknown." 383 | ] 384 | }, 385 | { 386 | "cell_type": "code", 387 | "execution_count": null, 388 | "id": "bf4993a8", 389 | "metadata": {}, 390 | "outputs": [], 391 | "source": [ 392 | "posttest_prob = posttest_odds / (1 + posttest_odds)\n", 393 | "\n", 394 | "print(f\"Estimated post-test probability using LR+: {posttest_prob:.3f}\")" 395 | ] 396 | }, 397 | { 398 | "cell_type": "markdown", 399 | "id": "34d66001", 400 | "metadata": {}, 401 | "source": [ 402 | "We can verify that if we had had access to the true labels, we would have\n", 403 | "obatined the same probabilities:" 404 | ] 405 | }, 406 | { 407 | "cell_type": "code", 408 | "execution_count": null, 409 | "id": "32d96836", 410 | "metadata": {}, 411 | "outputs": [], 412 | "source": [ 413 | "posttest_prob = y_test[y_pred == 1].mean()\n", 414 | "\n", 415 | "print(f\"Observed post-test probability: {posttest_prob:.3f}\")" 416 | ] 417 | }, 418 | { 419 | "cell_type": "markdown", 420 | "id": "de75d14b", 421 | "metadata": {}, 422 | "source": [ 423 | "Conclusion: If a Benin salesperson was to sell the model to the French Polynesia\n", 424 | "by showing them the 59.84% probability to have the disease given a positive test,\n", 425 | "the French Polynesia would have never bought it, even though it would be quite\n", 426 | "predictive for their own population. The right thing to report are the LR±.\n", 427 | "\n", 428 | "Can you imagine what would happen if the model is trained on nearly balanced classes\n", 429 | "and then extrapolated to other scenarios?" 430 | ] 431 | } 432 | ], 433 | "metadata": { 434 | "jupytext": { 435 | "cell_metadata_filter": "-all", 436 | "main_language": "python", 437 | "notebook_metadata_filter": "-all" 438 | } 439 | }, 440 | "nbformat": 4, 441 | "nbformat_minor": 5 442 | } 443 | --------------------------------------------------------------------------------