├── plots
    ├── overall_plot.png
    ├── SHAP_final_14_0.png
    ├── SHAP_final_15_0.png
    ├── SHAP_final_16_0.png
    └── individual_observation.png
├── SHAP_lesson_tutorial.ipynb
└── README.md


/plots/overall_plot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/overall_plot.png


--------------------------------------------------------------------------------
/plots/SHAP_final_14_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/SHAP_final_14_0.png


--------------------------------------------------------------------------------
/plots/SHAP_final_15_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/SHAP_final_15_0.png


--------------------------------------------------------------------------------
/plots/SHAP_final_16_0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/SHAP_final_16_0.png


--------------------------------------------------------------------------------
/plots/individual_observation.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/helenaEH/SHAP_tutorial/HEAD/plots/individual_observation.png


--------------------------------------------------------------------------------
/SHAP_lesson_tutorial.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# SHAP (SHapley Additive exPlanations)\n",
  8 |     "#### Using SHAP to see feature contribution to the target variable\n",
  9 |     "Works with any sklear tree-based model & XGBoost, LightGBM, CatBoost\n",
 10 |     "\n",
 11 |     "Library documentation:   \n",
 12 |     "https://shap.readthedocs.io/en/latest/  \n",
 13 |     "https://github.com/slundberg/shap#citations"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": 1,
 19 |    "metadata": {},
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "import shap\n",
 23 |     "import pandas as pd \n",
 24 |     "from sklearn.datasets import load_boston\n",
 25 |     "from sklearn.ensemble import RandomForestRegressor"
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "markdown",
 30 |    "metadata": {},
 31 |    "source": [
 32 |     "### Step 1. Load data into dataframe"
 33 |    ]
 34 |   },
 35 |   {
 36 |    "cell_type": "code",
 37 |    "execution_count": 61,
 38 |    "metadata": {},
 39 |    "outputs": [],
 40 |    "source": [
 41 |     "# 1. load boston data into a variable"
 42 |    ]
 43 |   },
 44 |   {
 45 |    "cell_type": "code",
 46 |    "execution_count": 62,
 47 |    "metadata": {},
 48 |    "outputs": [],
 49 |    "source": [
 50 |     "# 2. see what we can extract from the sklearn toy dataset"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": 54,
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "# 3. dataframe with all features"
 60 |    ]
 61 |   },
 62 |   {
 63 |    "cell_type": "code",
 64 |    "execution_count": 29,
 65 |    "metadata": {},
 66 |    "outputs": [],
 67 |    "source": [
 68 |     "# 4. For simplicity extract only following columns\n",
 69 |     "include = ['CRIM', 'RM', 'AGE', 'DIS', 'LSTAT']"
 70 |    ]
 71 |   },
 72 |   {
 73 |    "cell_type": "code",
 74 |    "execution_count": 68,
 75 |    "metadata": {},
 76 |    "outputs": [],
 77 |    "source": [
 78 |     "# 5. Rename columns\n",
 79 |     "rename = ['crime_rate', 'nr_of_rooms', 'pct_old_building', 'center_distance', 'pct_lower_income']"
 80 |    ]
 81 |   },
 82 |   {
 83 |    "cell_type": "code",
 84 |    "execution_count": 39,
 85 |    "metadata": {},
 86 |    "outputs": [],
 87 |    "source": [
 88 |     "# 6. Create y data"
 89 |    ]
 90 |   },
 91 |   {
 92 |    "cell_type": "markdown",
 93 |    "metadata": {},
 94 |    "source": [
 95 |     "### Step 2. Random Forest "
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 65,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "# 1. Fit a random forest with 100 estimators "
105 |    ]
106 |   },
107 |   {
108 |    "cell_type": "code",
109 |    "execution_count": 66,
110 |    "metadata": {},
111 |    "outputs": [],
112 |    "source": [
113 |     "# 2. Score the model "
114 |    ]
115 |   },
116 |   {
117 |    "cell_type": "markdown",
118 |    "metadata": {},
119 |    "source": [
120 |     "### Step 3. SHAP values"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": 71,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "# 1. Initialize JavaScript visualization"
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": 72,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "# 2. Create SHAP explainer"
139 |    ]
140 |   },
141 |   {
142 |    "cell_type": "code",
143 |    "execution_count": 73,
144 |    "metadata": {},
145 |    "outputs": [],
146 |    "source": [
147 |     "# 3. Summary plot"
148 |    ]
149 |   },
150 |   {
151 |    "cell_type": "code",
152 |    "execution_count": 75,
153 |    "metadata": {},
154 |    "outputs": [],
155 |    "source": [
156 |     "# 4. Individual prediction force plot "
157 |    ]
158 |   },
159 |   {
160 |    "cell_type": "code",
161 |    "execution_count": 76,
162 |    "metadata": {},
163 |    "outputs": [],
164 |    "source": [
165 |     "# 5. See how every feaure contributes to the model output"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 77,
171 |    "metadata": {
172 |     "scrolled": true
173 |    },
174 |    "outputs": [],
175 |    "source": [
176 |     "# 6. SHAP values for all predictions"
177 |    ]
178 |   },
179 |   {
180 |    "cell_type": "code",
181 |    "execution_count": null,
182 |    "metadata": {},
183 |    "outputs": [],
184 |    "source": []
185 |   }
186 |  ],
187 |  "metadata": {
188 |   "kernelspec": {
189 |    "display_name": "Python 3",
190 |    "language": "python",
191 |    "name": "python3"
192 |   },
193 |   "language_info": {
194 |    "codemirror_mode": {
195 |     "name": "ipython",
196 |     "version": 3
197 |    },
198 |    "file_extension": ".py",
199 |    "mimetype": "text/x-python",
200 |    "name": "python",
201 |    "nbconvert_exporter": "python",
202 |    "pygments_lexer": "ipython3",
203 |    "version": "3.7.3"
204 |   }
205 |  },
206 |  "nbformat": 4,
207 |  "nbformat_minor": 2
208 | }
209 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | # SHAP (SHapley Additive exPlanations)
  3 | #### Using SHAP to see the feature contribution to the target variable
  4 | TreeExplainer works with any sklear tree-based model & XGBoost, LightGBM, CatBoost. See the documentation for other model based approaches. 
  5 | 
  6 | Library documentation:   
  7 | https://shap.readthedocs.io/en/latest/  
  8 | https://github.com/slundberg/shap#citations
  9 | 
 10 | Shapely values are based on the cooperative game theory. There is a trade off with machine learning model complexity vs interpretability. Simple models are easier to understand but they are often not as accurate at predicting the target variable. More complicated models have a higher accuracy, but they are notorious of being 'black boxes' which makes understanding the outcome difficult. Python SHAP library is an easy to use visual library that facilitates our understanding about feature importance and impact direction (positive/negative) to our target variable both globally and for an individual observation.
 11 | 
 12 | pip install shap  
 13 | or   
 14 | conda install -c conda-forge shap  
 15 | 
 16 | ```python
 17 | import shap
 18 | import pandas as pd 
 19 | from sklearn.datasets import load_boston
 20 | from sklearn.ensemble import RandomForestRegressor
 21 | from sklearn.model_selection import train_test_split, cross_val_score
 22 | ```
 23 | 
 24 | ### Step 1. Load the data into a dataframe
 25 | 
 26 | 
 27 | ```python
 28 | boston = load_boston()
 29 | 
 30 | # Create a Pandas dataframe with all the features
 31 | X = pd.DataFrame(data = boston['data'], columns = boston['feature_names'])
 32 | y = boston['target']
 33 | ```
 34 | 
 35 | 
 36 | ### Step 2. Random Forest 
 37 | 
 38 | 
 39 | ```python
 40 | # Split the data
 41 | Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
 42 | ```
 43 | 
 44 | 
 45 | ```python
 46 | # Initiate and fit a Random Forest Regressor 
 47 | rf_reg = RandomForestRegressor(n_estimators=100)
 48 | rf_reg.fit(Xtrain, ytrain)
 49 | ```
 50 | 
 51 | 
 52 | 
 53 | 
 54 |     RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
 55 |                max_features='auto', max_leaf_nodes=None,
 56 |                min_impurity_decrease=0.0, min_impurity_split=None,
 57 |                min_samples_leaf=1, min_samples_split=2,
 58 |                min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
 59 |                oob_score=False, random_state=None, verbose=0, warm_start=False)
 60 | 
 61 | 
 62 | 
 63 | 
 64 | ```python
 65 | rf_train = rf_reg.score(Xtrain, ytrain)
 66 | rf_cv = cross_val_score(rf_reg, Xtrain, ytrain, cv=5).mean()
 67 | rf_test = rf_reg.score(Xtest, ytest)
 68 | print('Evaluation of the Random Forest performance\n')
 69 | print(f'Training score: {rf_train.round(4)}')
 70 | print(f'Cross validation score: {rf_cv.round(4)}')
 71 | print(f'Test score: {rf_test.round(4)}')
 72 | ```
 73 | 
 74 |     Evaluation of the Random Forest performance
 75 |     
 76 |     Training score: 0.9788
 77 |     Cross validation score: 0.8396
 78 |     Test score: 0.7989
 79 |     
 80 | 
 81 | ### Step 3. SHAP values
 82 | 
 83 | 
 84 | ```python
 85 | # Initialize JavaScript visualization - use Jupyter notebook to see the interactive features of the plots
 86 | shap.initjs()
 87 | ```
 88 | 
 89 | ```python
 90 | # Create a TreeExplainer and extract shap values from it - will be used for plotting later
 91 | explainer = shap.TreeExplainer(rf_reg)
 92 | shap_values = explainer.shap_values(X)
 93 | ```
 94 | 
 95 | ```python
 96 | # shap force plot for the first prediction. Here we want to interpret the output value for the 1st observation in our dataframe. 
 97 | shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
 98 | ```
 99 | ![png](plots/individual_observation.png)
100 | 
101 | ```python
102 | # SHAP values for all predictions and the direction of their impact
103 | shap.force_plot(explainer.expected_value, shap_values, X)
104 | ```
105 | ![png](plots/overall_plot.png)
106 | 
107 | ```python
108 | # Effect of a single feature on the shap value,and automatically selected other feature to show dependence 
109 | shap.dependence_plot('AGE', shap_values, X)
110 | ```
111 | 
112 | 
113 | ![png](plots/SHAP_final_14_0.png)
114 | 
115 | 
116 | 
117 | ```python
118 | # See the absolute shap value of how each feaure contributes to the model output
119 | shap.summary_plot(shap_values, X)
120 | ```
121 | 
122 | 
123 | ![png](plots/SHAP_final_15_0.png)
124 | 
125 | 
126 | 
127 | ```python
128 | shap.summary_plot(shap_values, X, plot_type="bar")
129 | ```
130 | 
131 | 
132 | ![png](plots/SHAP_final_16_0.png)
133 | 
134 | 
135 | Feature and dataset description 
136 | 
137 | ```python
138 | print(boston.DESCR)
139 | ```
140 | 
141 |     .. _boston_dataset:
142 |     
143 |     Boston house prices dataset
144 |     ---------------------------
145 |     
146 |     **Data Set Characteristics:**  
147 |     
148 |         :Number of Instances: 506 
149 |     
150 |         :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
151 |     
152 |         :Attribute Information (in order):
153 |             - CRIM     per capita crime rate by town
154 |             - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
155 |             - INDUS    proportion of non-retail business acres per town
156 |             - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
157 |             - NOX      nitric oxides concentration (parts per 10 million)
158 |             - RM       average number of rooms per dwelling
159 |             - AGE      proportion of owner-occupied units built prior to 1940
160 |             - DIS      weighted distances to five Boston employment centres
161 |             - RAD      index of accessibility to radial highways
162 |             - TAX      full-value property-tax rate per $10,000
163 |             - PTRATIO  pupil-teacher ratio by town
164 |             - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
165 |             - LSTAT    % lower status of the population
166 |             - MEDV     Median value of owner-occupied homes in $1000's
167 |     
168 |         :Missing Attribute Values: None
169 |     
170 |         :Creator: Harrison, D. and Rubinfeld, D.L.
171 |     
172 |     This is a copy of UCI ML housing dataset.
173 |     https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
174 |     
175 |     
176 |     This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
177 |     
178 |     The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
179 |     prices and the demand for clean air', J. Environ. Economics & Management,
180 |     vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
181 |     ...', Wiley, 1980.   N.B. Various transformations are used in the table on
182 |     pages 244-261 of the latter.
183 |     
184 |     The Boston house-price data has been used in many machine learning papers that address regression
185 |     problems.   
186 |          
187 |     .. topic:: References
188 |     
189 |        - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
190 |        - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
191 |     
192 |     
193 | 


--------------------------------------------------------------------------------