├── plots ├── overall_plot.png ├── SHAP_final_14_0.png ├── SHAP_final_15_0.png ├── SHAP_final_16_0.png └── individual_observation.png ├── SHAP_lesson_tutorial.ipynb └── README.md /plots/overall_plot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/overall_plot.png -------------------------------------------------------------------------------- /plots/SHAP_final_14_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/SHAP_final_14_0.png -------------------------------------------------------------------------------- /plots/SHAP_final_15_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/SHAP_final_15_0.png -------------------------------------------------------------------------------- /plots/SHAP_final_16_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/SHAP_final_16_0.png -------------------------------------------------------------------------------- /plots/individual_observation.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/individual_observation.png -------------------------------------------------------------------------------- /SHAP_lesson_tutorial.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# SHAP (SHapley Additive exPlanations)\n", 8 | "#### Using SHAP to see feature contribution to the target variable\n", 9 | "Works with any sklear tree-based model & XGBoost, LightGBM, CatBoost\n", 10 | "\n", 11 | "Library documentation: \n", 12 | "https://shap.readthedocs.io/en/latest/ \n", 13 | "https://github.com/slundberg/shap#citations" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": 1, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "import shap\n", 23 | "import pandas as pd \n", 24 | "from sklearn.datasets import load_boston\n", 25 | "from sklearn.ensemble import RandomForestRegressor" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### Step 1. Load data into dataframe" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 61, 38 | "metadata": {}, 39 | "outputs": [], 40 | "source": [ 41 | "# 1. load boston data into a variable" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 62, 47 | "metadata": {}, 48 | "outputs": [], 49 | "source": [ 50 | "# 2. see what we can extract from the sklearn toy dataset" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": 54, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "# 3. dataframe with all features" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": 29, 65 | "metadata": {}, 66 | "outputs": [], 67 | "source": [ 68 | "# 4. For simplicity extract only following columns\n", 69 | "include = ['CRIM', 'RM', 'AGE', 'DIS', 'LSTAT']" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 68, 75 | "metadata": {}, 76 | "outputs": [], 77 | "source": [ 78 | "# 5. Rename columns\n", 79 | "rename = ['crime_rate', 'nr_of_rooms', 'pct_old_building', 'center_distance', 'pct_lower_income']" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 39, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# 6. Create y data" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "### Step 2. Random Forest " 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 65, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "# 1. Fit a random forest with 100 estimators " 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 66, 110 | "metadata": {}, 111 | "outputs": [], 112 | "source": [ 113 | "# 2. Score the model " 114 | ] 115 | }, 116 | { 117 | "cell_type": "markdown", 118 | "metadata": {}, 119 | "source": [ 120 | "### Step 3. SHAP values" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 71, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "# 1. Initialize JavaScript visualization" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": 72, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "# 2. Create SHAP explainer" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 73, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "# 3. Summary plot" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 75, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "# 4. Individual prediction force plot " 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 76, 162 | "metadata": {}, 163 | "outputs": [], 164 | "source": [ 165 | "# 5. See how every feaure contributes to the model output" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 77, 171 | "metadata": { 172 | "scrolled": true 173 | }, 174 | "outputs": [], 175 | "source": [ 176 | "# 6. SHAP values for all predictions" 177 | ] 178 | }, 179 | { 180 | "cell_type": "code", 181 | "execution_count": null, 182 | "metadata": {}, 183 | "outputs": [], 184 | "source": [] 185 | } 186 | ], 187 | "metadata": { 188 | "kernelspec": { 189 | "display_name": "Python 3", 190 | "language": "python", 191 | "name": "python3" 192 | }, 193 | "language_info": { 194 | "codemirror_mode": { 195 | "name": "ipython", 196 | "version": 3 197 | }, 198 | "file_extension": ".py", 199 | "mimetype": "text/x-python", 200 | "name": "python", 201 | "nbconvert_exporter": "python", 202 | "pygments_lexer": "ipython3", 203 | "version": "3.7.3" 204 | } 205 | }, 206 | "nbformat": 4, 207 | "nbformat_minor": 2 208 | } 209 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # SHAP (SHapley Additive exPlanations) 3 | #### Using SHAP to see the feature contribution to the target variable 4 | TreeExplainer works with any sklear tree-based model & XGBoost, LightGBM, CatBoost. See the documentation for other model based approaches. 5 | 6 | Library documentation: 7 | https://shap.readthedocs.io/en/latest/ 8 | https://github.com/slundberg/shap#citations 9 | 10 | Shapely values are based on the cooperative game theory. There is a trade off with machine learning model complexity vs interpretability. Simple models are easier to understand but they are often not as accurate at predicting the target variable. More complicated models have a higher accuracy, but they are notorious of being 'black boxes' which makes understanding the outcome difficult. Python SHAP library is an easy to use visual library that facilitates our understanding about feature importance and impact direction (positive/negative) to our target variable both globally and for an individual observation. 11 | 12 | pip install shap 13 | or 14 | conda install -c conda-forge shap 15 | 16 | ```python 17 | import shap 18 | import pandas as pd 19 | from sklearn.datasets import load_boston 20 | from sklearn.ensemble import RandomForestRegressor 21 | from sklearn.model_selection import train_test_split, cross_val_score 22 | ``` 23 | 24 | ### Step 1. Load the data into a dataframe 25 | 26 | 27 | ```python 28 | boston = load_boston() 29 | 30 | # Create a Pandas dataframe with all the features 31 | X = pd.DataFrame(data = boston['data'], columns = boston['feature_names']) 32 | y = boston['target'] 33 | ``` 34 | 35 | 36 | ### Step 2. Random Forest 37 | 38 | 39 | ```python 40 | # Split the data 41 | Xtrain, Xtest, ytrain, ytest = train_test_split(X, y) 42 | ``` 43 | 44 | 45 | ```python 46 | # Initiate and fit a Random Forest Regressor 47 | rf_reg = RandomForestRegressor(n_estimators=100) 48 | rf_reg.fit(Xtrain, ytrain) 49 | ``` 50 | 51 | 52 | 53 | 54 | RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, 55 | max_features='auto', max_leaf_nodes=None, 56 | min_impurity_decrease=0.0, min_impurity_split=None, 57 | min_samples_leaf=1, min_samples_split=2, 58 | min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, 59 | oob_score=False, random_state=None, verbose=0, warm_start=False) 60 | 61 | 62 | 63 | 64 | ```python 65 | rf_train = rf_reg.score(Xtrain, ytrain) 66 | rf_cv = cross_val_score(rf_reg, Xtrain, ytrain, cv=5).mean() 67 | rf_test = rf_reg.score(Xtest, ytest) 68 | print('Evaluation of the Random Forest performance\n') 69 | print(f'Training score: {rf_train.round(4)}') 70 | print(f'Cross validation score: {rf_cv.round(4)}') 71 | print(f'Test score: {rf_test.round(4)}') 72 | ``` 73 | 74 | Evaluation of the Random Forest performance 75 | 76 | Training score: 0.9788 77 | Cross validation score: 0.8396 78 | Test score: 0.7989 79 | 80 | 81 | ### Step 3. SHAP values 82 | 83 | 84 | ```python 85 | # Initialize JavaScript visualization - use Jupyter notebook to see the interactive features of the plots 86 | shap.initjs() 87 | ``` 88 | 89 | ```python 90 | # Create a TreeExplainer and extract shap values from it - will be used for plotting later 91 | explainer = shap.TreeExplainer(rf_reg) 92 | shap_values = explainer.shap_values(X) 93 | ``` 94 | 95 | ```python 96 | # shap force plot for the first prediction. Here we want to interpret the output value for the 1st observation in our dataframe. 97 | shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:]) 98 | ``` 99 | ![png](plots/individual_observation.png) 100 | 101 | ```python 102 | # SHAP values for all predictions and the direction of their impact 103 | shap.force_plot(explainer.expected_value, shap_values, X) 104 | ``` 105 | ![png](plots/overall_plot.png) 106 | 107 | ```python 108 | # Effect of a single feature on the shap value,and automatically selected other feature to show dependence 109 | shap.dependence_plot('AGE', shap_values, X) 110 | ``` 111 | 112 | 113 | ![png](plots/SHAP_final_14_0.png) 114 | 115 | 116 | 117 | ```python 118 | # See the absolute shap value of how each feaure contributes to the model output 119 | shap.summary_plot(shap_values, X) 120 | ``` 121 | 122 | 123 | ![png](plots/SHAP_final_15_0.png) 124 | 125 | 126 | 127 | ```python 128 | shap.summary_plot(shap_values, X, plot_type="bar") 129 | ``` 130 | 131 | 132 | ![png](plots/SHAP_final_16_0.png) 133 | 134 | 135 | Feature and dataset description 136 | 137 | ```python 138 | print(boston.DESCR) 139 | ``` 140 | 141 | .. _boston_dataset: 142 | 143 | Boston house prices dataset 144 | --------------------------- 145 | 146 | **Data Set Characteristics:** 147 | 148 | :Number of Instances: 506 149 | 150 | :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. 151 | 152 | :Attribute Information (in order): 153 | - CRIM per capita crime rate by town 154 | - ZN proportion of residential land zoned for lots over 25,000 sq.ft. 155 | - INDUS proportion of non-retail business acres per town 156 | - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 157 | - NOX nitric oxides concentration (parts per 10 million) 158 | - RM average number of rooms per dwelling 159 | - AGE proportion of owner-occupied units built prior to 1940 160 | - DIS weighted distances to five Boston employment centres 161 | - RAD index of accessibility to radial highways 162 | - TAX full-value property-tax rate per $10,000 163 | - PTRATIO pupil-teacher ratio by town 164 | - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town 165 | - LSTAT % lower status of the population 166 | - MEDV Median value of owner-occupied homes in $1000's 167 | 168 | :Missing Attribute Values: None 169 | 170 | :Creator: Harrison, D. and Rubinfeld, D.L. 171 | 172 | This is a copy of UCI ML housing dataset. 173 | https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ 174 | 175 | 176 | This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. 177 | 178 | The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic 179 | prices and the demand for clean air', J. Environ. Economics & Management, 180 | vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics 181 | ...', Wiley, 1980. N.B. Various transformations are used in the table on 182 | pages 244-261 of the latter. 183 | 184 | The Boston house-price data has been used in many machine learning papers that address regression 185 | problems. 186 | 187 | .. topic:: References 188 | 189 | - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. 190 | - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann. 191 | 192 | 193 | --------------------------------------------------------------------------------