├── .DS_Store ├── .gitignore ├── LICENSE ├── README.md ├── cqr ├── .DS_Store ├── __init__.py ├── helper.py ├── torch_models.py └── tune_params_cv.py ├── cqr_real_data_example.ipynb ├── cqr_synthetic_data_example_1.ipynb ├── cqr_synthetic_data_example_2.ipynb ├── datasets ├── .DS_Store ├── CASP.csv ├── Concrete_Data.csv ├── README.md ├── STAR.csv ├── bike_train.csv ├── communities.data ├── communities_attributes.csv ├── datasets.py └── facebook │ └── README.md ├── detect_prediction_bias_example.ipynb ├── equalized_coverage_example.ipynb ├── get_meps_data ├── README.md ├── base_dataset.py ├── download_data.R ├── main_clean_and_save_to_csv.py ├── meps_dataset_panel19_fy2015_reg.py ├── meps_dataset_panel20_fy2015_reg.py ├── meps_dataset_panel21_fy2016_reg.py ├── regression_dataset.py ├── save_dataset.py └── structured_dataset.py ├── nonconformist ├── .DS_Store ├── __init__.py ├── acp.py ├── base.py ├── cp.py ├── evaluation.py ├── icp.py ├── nc.py └── util.py ├── poster └── CQR_Poster.pdf └── reproducible_experiments ├── all_cqr_experiments.py ├── all_equalized_coverage_experiments.py ├── run_cqr_experiment.py └── run_equalized_coverage_experiment.py /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Byte-compiled / optimized / DLL files 2 | __pycache__/ 3 | *.py[cod] 4 | .ipynb_checkpoints/ 5 | .DS_Store 6 | 7 | # C extensions 8 | *.so 9 | 10 | # Distribution / packaging 11 | .Python 12 | env/ 13 | build/ 14 | develop-eggs/ 15 | dist/ 16 | downloads/ 17 | eggs/ 18 | lib/ 19 | lib64/ 20 | parts/ 21 | sdist/ 22 | var/ 23 | *.egg-info/ 24 | .installed.cfg 25 | *.egg 26 | 27 | # PyInstaller 28 | # Usually these files are written by a python script from a template 29 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 30 | *.manifest 31 | *.spec 32 | 33 | # Installer logs 34 | pip-log.txt 35 | pip-delete-this-directory.txt 36 | 37 | # Unit test / coverage reports 38 | htmlcov/ 39 | .tox/ 40 | .coverage 41 | .cache 42 | nosetests.xml 43 | coverage.xml 44 | 45 | # Translations 46 | *.mo 47 | *.pot 48 | 49 | # Django stuff: 50 | *.log 51 | 52 | # Sphinx documentation 53 | docs/_build/ 54 | 55 | # PyBuilder 56 | target/ 57 | 58 | # PyCharm 59 | .idea -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | nonconformist package: 4 | Copyright (c) 2015 Henrik Linusson 5 | 6 | Other extensions: 7 | Copyright (c) 2019 Yaniv Romano 8 | 9 | Permission is hereby granted, free of charge, to any person obtaining a copy 10 | of this software and associated documentation files (the "Software"), to deal 11 | in the Software without restriction, including without limitation the rights 12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 13 | copies of the Software, and to permit persons to whom the Software is 14 | furnished to do so, subject to the following conditions: 15 | 16 | The above copyright notice and this permission notice shall be included in all 17 | copies or substantial portions of the Software. 18 | 19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 22 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 25 | SOFTWARE. 26 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Reliable Predictive Inference 2 | 3 | An important factor to guarantee a responsible use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance. 4 | 5 | This package contains a Python implementation of Conformalized quantile regression (CQR) [1] methodology for constructing marginal distribusion-free prediction intervals. It also implements the equalized coverage framework [2] that builds valid group-conditional prediction intervals. 6 | 7 | # Conformalized Quantile Regression [1] 8 | 9 | CQR is a technique for constructing prediction intervals that attain valid coverage in finite samples, without making distributional assumptions. It combines the statistical efficiency of quantile regression with the distribution-free coverage guarantee of conformal prediction. On one hand, CQR is flexible in that it can wrap around any algorithm for quantile regression, including random forests and deep neural networks. On the other hand, a key strength of CQR is its rigorous control of the miscoverage rate, independent of the underlying regression algorithm. 10 | 11 | [1] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes, [“Conformalized quantile regression.”](https://arxiv.org/abs/1905.03222) 2019. 12 | 13 | # Equalized Coverage [2] 14 | 15 | To support equitable treatment, the equalized coverage methodology forces the construction of the prediction intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. Similar to CQR and conformal inference, equalized coverage offers rigorous distribution-free guarantees that hold in finite samples. This methodology can also be viewed as a wrapper around any predictive algorithm. 16 | 17 | [2] Y. Romano, R. F. Barber, C. Sabbatti and E. J. Candès, [“With malice towards none: Assessing uncertainty via equalized coverage.”](https://statweb.stanford.edu/~candes/papers/EqualizedCoverage.pdf) 2019. 18 | 19 | ## Getting Started 20 | 21 | This package is self-contained and implemented in python. 22 | 23 | Part of the code is a taken from the nonconformist package available at https://github.com/donlnz/nonconformist. One may refer to the nonconformist repository to view other applications of conformal prediction. 24 | 25 | ### Prerequisites 26 | 27 | * python 28 | * numpy 29 | * scipy 30 | * scikit-learn 31 | * scikit-garden 32 | * pytorch 33 | * pandas 34 | 35 | ### Installing 36 | 37 | The development version is available here on github: 38 | ```bash 39 | git clone https://github.com/yromano/cqr.git 40 | ``` 41 | 42 | ## Usage 43 | 44 | ### CQR 45 | 46 | Please refer to [cqr_real_data_example.ipynb](cqr_real_data_example.ipynb) for basic usage. Comparisons to competitive methods and additional usage examples of this package can be found in [cqr_synthetic_data_example_1.ipynb](cqr_synthetic_data_example_1.ipynb) and [cqr_synthetic_data_example_2.ipynb](cqr_synthetic_data_example_2.ipynb). 47 | 48 | ### Equalized Coverage 49 | 50 | The notebook [detect_prediction_bias_example.ipynb](detect_prediction_bias_example.ipynb) performs simple data analysis for MEPS 21 data set and detects bias in the prediction. The notebook [equalized_coverage_example.ipynb](equalized_coverage_example.ipynb) illustrates how to run the methods proposed in [2] and construct prediction intervals with equal coverage across groups. 51 | 52 | ## Reproducible Research 53 | 54 | The code available under /reproducible_experiments/ in the repository replicates the experimental results in [1] and [2]. 55 | 56 | ### Publicly Available Datasets 57 | 58 | * [Blog](https://archive.ics.uci.edu/ml/datasets/BlogFeedback): BlogFeedback data set. 59 | 60 | * [Bio](https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure): Physicochemical properties of protein tertiary structure data set. 61 | 62 | * [Bike](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset): Bike sharing dataset data set. 63 | 64 | * [Community](http://archive.ics.uci.edu/ml/datasets/communities+and+crime): Communities and crime data set. 65 | 66 | * [STAR](https://www.rdocumentation.org/packages/AER/versions/1.2-6/topics/STAR): C.M. Achilles, Helen Pate Bain, Fred Bellott, Jayne Boyd-Zaharias, Jeremy Finn, John Folger, John Johnston, and Elizabeth Word. Tennessee’s Student Teacher Achievement Ratio (STAR) project, 2008. 67 | 68 | * [Concrete](http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength): Concrete compressive strength data set. 69 | 70 | * [Facebook Variant 1 and Variant 2](https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset): Facebook comment volume data set. 71 | 72 | ### Data subject to copyright/usage rules 73 | 74 | The Medical Expenditure Panel Survey (MPES) data can be downloaded using the code in the folder /get_meps_data/ under this repository. It is based on [this explanation](https://github.com/yromano/cqr/blob/master/get_meps_data/README.md) (code provided by [IBM's AIF360](https://github.com/IBM/AIF360)). 75 | 76 | * [MEPS_19](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181): Medical expenditure panel survey, panel 19. 77 | 78 | * [MEPS_20](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181): Medical expenditure panel survey, panel 20. 79 | 80 | * [MEPS_21](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-192): Medical expenditure panel survey, panel 21. 81 | 82 | ## License 83 | 84 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. 85 | -------------------------------------------------------------------------------- /cqr/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/cqr/.DS_Store -------------------------------------------------------------------------------- /cqr/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | -------------------------------------------------------------------------------- /cqr/helper.py: -------------------------------------------------------------------------------- 1 | 2 | import sys 3 | import torch 4 | import numpy as np 5 | from cqr import torch_models 6 | from functools import partial 7 | from cqr import tune_params_cv 8 | from nonconformist.cp import IcpRegressor 9 | from nonconformist.base import RegressorAdapter 10 | from skgarden import RandomForestQuantileRegressor 11 | 12 | if torch.cuda.is_available(): 13 | device = "cuda:0" 14 | else: 15 | device = "cpu" 16 | 17 | 18 | def compute_coverage_len(y_test, y_lower, y_upper): 19 | """ Compute average coverage and length of prediction intervals 20 | 21 | Parameters 22 | ---------- 23 | 24 | y_test : numpy array, true labels (n) 25 | y_lower : numpy array, estimated lower bound for the labels (n) 26 | y_upper : numpy array, estimated upper bound for the labels (n) 27 | 28 | Returns 29 | ------- 30 | 31 | coverage : float, average coverage 32 | avg_length : float, average length 33 | 34 | """ 35 | in_the_range = np.sum((y_test >= y_lower) & (y_test <= y_upper)) 36 | coverage = in_the_range / len(y_test) * 100 37 | avg_length = np.mean(abs(y_upper - y_lower)) 38 | return coverage, avg_length 39 | 40 | def run_icp(nc, X_train, y_train, X_test, idx_train, idx_cal, significance, condition=None): 41 | """ Run split conformal method 42 | 43 | Parameters 44 | ---------- 45 | 46 | nc : class of nonconformist object 47 | X_train : numpy array, training features (n1Xp) 48 | y_train : numpy array, training labels (n1) 49 | X_test : numpy array, testing features (n2Xp) 50 | idx_train : numpy array, indices of proper training set examples 51 | idx_cal : numpy array, indices of calibration set examples 52 | significance : float, significance level (e.g. 0.1) 53 | condition : function, mapping feature vector to group id 54 | 55 | Returns 56 | ------- 57 | 58 | y_lower : numpy array, estimated lower bound for the labels (n2) 59 | y_upper : numpy array, estimated upper bound for the labels (n2) 60 | 61 | """ 62 | icp = IcpRegressor(nc,condition=condition) 63 | 64 | # Fit the ICP using the proper training set 65 | icp.fit(X_train[idx_train,:], y_train[idx_train]) 66 | 67 | # Calibrate the ICP using the calibration set 68 | icp.calibrate(X_train[idx_cal,:], y_train[idx_cal]) 69 | 70 | # Produce predictions for the test set, with confidence 90% 71 | predictions = icp.predict(X_test, significance=significance) 72 | 73 | y_lower = predictions[:,0] 74 | y_upper = predictions[:,1] 75 | 76 | return y_lower, y_upper 77 | 78 | 79 | def run_icp_sep(nc, X_train, y_train, X_test, idx_train, idx_cal, significance, condition): 80 | """ Run split conformal method, train a seperate regressor for each group 81 | 82 | Parameters 83 | ---------- 84 | 85 | nc : class of nonconformist object 86 | X_train : numpy array, training features (n1Xp) 87 | y_train : numpy array, training labels (n1) 88 | X_test : numpy array, testing features (n2Xp) 89 | idx_train : numpy array, indices of proper training set examples 90 | idx_cal : numpy array, indices of calibration set examples 91 | significance : float, significance level (e.g. 0.1) 92 | condition : function, mapping a feature vector to group id 93 | 94 | Returns 95 | ------- 96 | 97 | y_lower : numpy array, estimated lower bound for the labels (n2) 98 | y_upper : numpy array, estimated upper bound for the labels (n2) 99 | 100 | """ 101 | 102 | X_proper_train = X_train[idx_train,:] 103 | y_proper_train = y_train[idx_train] 104 | X_calibration = X_train[idx_cal,:] 105 | y_calibration = y_train[idx_cal] 106 | 107 | category_map_proper_train = np.array([condition((X_proper_train[i, :], y_proper_train[i])) for i in range(y_proper_train.size)]) 108 | category_map_calibration = np.array([condition((X_calibration[i, :], y_calibration[i])) for i in range(y_calibration.size)]) 109 | category_map_test = np.array([condition((X_test[i, :], None)) for i in range(X_test.shape[0])]) 110 | 111 | categories = np.unique(category_map_proper_train) 112 | 113 | y_lower = np.zeros(X_test.shape[0]) 114 | y_upper = np.zeros(X_test.shape[0]) 115 | 116 | cnt = 0 117 | 118 | for cond in categories: 119 | 120 | icp = IcpRegressor(nc[cnt]) 121 | 122 | idx_proper_train_group = category_map_proper_train == cond 123 | # Fit the ICP using the proper training set 124 | icp.fit(X_proper_train[idx_proper_train_group,:], y_proper_train[idx_proper_train_group]) 125 | 126 | idx_calibration_group = category_map_calibration == cond 127 | # Calibrate the ICP using the calibration set 128 | icp.calibrate(X_calibration[idx_calibration_group,:], y_calibration[idx_calibration_group]) 129 | 130 | idx_test_group = category_map_test == cond 131 | # Produce predictions for the test set, with confidence 90% 132 | predictions = icp.predict(X_test[idx_test_group,:], significance=significance) 133 | 134 | y_lower[idx_test_group] = predictions[:,0] 135 | y_upper[idx_test_group] = predictions[:,1] 136 | 137 | cnt = cnt + 1 138 | 139 | return y_lower, y_upper 140 | 141 | def compute_coverage(y_test,y_lower,y_upper,significance,name=""): 142 | """ Compute average coverage and length, and print results 143 | 144 | Parameters 145 | ---------- 146 | 147 | y_test : numpy array, true labels (n) 148 | y_lower : numpy array, estimated lower bound for the labels (n) 149 | y_upper : numpy array, estimated upper bound for the labels (n) 150 | significance : float, desired significance level 151 | name : string, optional output string (e.g. the method name) 152 | 153 | Returns 154 | ------- 155 | 156 | coverage : float, average coverage 157 | avg_length : float, average length 158 | 159 | """ 160 | in_the_range = np.sum((y_test >= y_lower) & (y_test <= y_upper)) 161 | coverage = in_the_range / len(y_test) * 100 162 | print("%s: Percentage in the range (expecting %.2f): %f" % (name, 100 - significance*100, coverage)) 163 | sys.stdout.flush() 164 | 165 | avg_length = abs(np.mean(y_lower - y_upper)) 166 | print("%s: Average length: %f" % (name, avg_length)) 167 | sys.stdout.flush() 168 | return coverage, avg_length 169 | 170 | def compute_coverage_per_sample(y_test,y_lower,y_upper,significance,name="",x_test=None,condition=None): 171 | """ Compute average coverage and length, and print results 172 | 173 | Parameters 174 | ---------- 175 | 176 | y_test : numpy array, true labels (n) 177 | y_lower : numpy array, estimated lower bound for the labels (n) 178 | y_upper : numpy array, estimated upper bound for the labels (n) 179 | significance : float, desired significance level 180 | name : string, optional output string (e.g. the method name) 181 | x_test : numpy array, test features 182 | condition : function, mapping a feature vector to group id 183 | 184 | Returns 185 | ------- 186 | 187 | coverage : float, average coverage 188 | avg_length : float, average length 189 | 190 | """ 191 | 192 | if condition is not None: 193 | 194 | category_map = np.array([condition((x_test[i, :], y_test[i])) for i in range(y_test.size)]) 195 | categories = np.unique(category_map) 196 | 197 | coverage = np.empty(len(categories), dtype=np.object) 198 | length = np.empty(len(categories), dtype=np.object) 199 | 200 | cnt = 0 201 | 202 | for cond in categories: 203 | 204 | idx = category_map == cond 205 | 206 | coverage[cnt] = (y_test[idx] >= y_lower[idx]) & (y_test[idx] <= y_upper[idx]) 207 | 208 | coverage_avg = np.sum( coverage[cnt] ) / len(y_test[idx]) * 100 209 | print("%s: Group %d : Percentage in the range (expecting %.2f): %f" % (name, cond, 100 - significance*100, coverage_avg)) 210 | sys.stdout.flush() 211 | 212 | length[cnt] = abs(y_upper[idx] - y_lower[idx]) 213 | print("%s: Group %d : Average length: %f" % (name, cond, np.mean(length[cnt]))) 214 | sys.stdout.flush() 215 | cnt = cnt + 1 216 | 217 | else: 218 | 219 | coverage = (y_test >= y_lower) & (y_test <= y_upper) 220 | coverage_avg = np.sum(coverage) / len(y_test) * 100 221 | print("%s: Percentage in the range (expecting %.2f): %f" % (name, 100 - significance*100, coverage_avg)) 222 | sys.stdout.flush() 223 | 224 | length = abs(y_upper - y_lower) 225 | print("%s: Average length: %f" % (name, np.mean(length))) 226 | sys.stdout.flush() 227 | 228 | return coverage, length 229 | 230 | 231 | def plot_func_data(y_test,y_lower,y_upper,name=""): 232 | """ Plot the test labels along with the constructed prediction band 233 | 234 | Parameters 235 | ---------- 236 | 237 | y_test : numpy array, true labels (n) 238 | y_lower : numpy array, estimated lower bound for the labels (n) 239 | y_upper : numpy array, estimated upper bound for the labels (n) 240 | name : string, optional output string (e.g. the method name) 241 | 242 | """ 243 | 244 | # allowed to import graphics 245 | import matplotlib.pyplot as plt 246 | 247 | interval = y_upper - y_lower 248 | sort_ind = np.argsort(interval) 249 | y_test_sorted = y_test[sort_ind] 250 | upper_sorted = y_upper[sort_ind] 251 | lower_sorted = y_lower[sort_ind] 252 | mean = (upper_sorted + lower_sorted) / 2 253 | 254 | # Center such that the mean of the prediction interval is at 0.0 255 | y_test_sorted -= mean 256 | upper_sorted -= mean 257 | lower_sorted -= mean 258 | 259 | plt.plot(y_test_sorted, "ro") 260 | plt.fill_between( 261 | np.arange(len(upper_sorted)), lower_sorted, upper_sorted, alpha=0.2, color="r", 262 | label="Pred. interval") 263 | plt.xlabel("Ordered samples") 264 | plt.ylabel("Values and prediction intervals") 265 | 266 | plt.title(name) 267 | plt.show() 268 | 269 | interval = y_upper - y_lower 270 | sort_ind = np.argsort(y_test) 271 | y_test_sorted = y_test[sort_ind] 272 | upper_sorted = y_upper[sort_ind] 273 | lower_sorted = y_lower[sort_ind] 274 | 275 | plt.plot(y_test_sorted, "ro") 276 | plt.fill_between( 277 | np.arange(len(upper_sorted)), lower_sorted, upper_sorted, alpha=0.2, color="r", 278 | label="Pred. interval") 279 | plt.xlabel("Ordered samples by response") 280 | plt.ylabel("Values and prediction intervals") 281 | 282 | plt.title(name) 283 | plt.show() 284 | 285 | ############################################################################### 286 | # Deep conditional mean regression 287 | # Minimizing MSE loss 288 | ############################################################################### 289 | 290 | class MSENet_RegressorAdapter(RegressorAdapter): 291 | """ Conditional mean estimator, formulated as neural net 292 | """ 293 | def __init__(self, 294 | model, 295 | fit_params=None, 296 | in_shape=1, 297 | hidden_size=1, 298 | learn_func=torch.optim.Adam, 299 | epochs=1000, 300 | batch_size=10, 301 | dropout=0.1, 302 | lr=0.01, 303 | wd=1e-6, 304 | test_ratio=0.2, 305 | random_state=0): 306 | 307 | """ Initialization 308 | 309 | Parameters 310 | ---------- 311 | model : unused parameter (for compatibility with nc class) 312 | fit_params : unused parameter (for compatibility with nc class) 313 | in_shape : integer, input signal dimension 314 | hidden_size : integer, hidden layer dimension 315 | learn_func : class of Pytorch's SGD optimizer 316 | epochs : integer, maximal number of epochs 317 | batch_size : integer, mini-batch size for SGD 318 | dropout : float, dropout rate 319 | lr : float, learning rate for SGD 320 | wd : float, weight decay 321 | test_ratio : float, ratio of held-out data, used in cross-validation 322 | random_state : integer, seed for splitting the data in cross-validation 323 | 324 | """ 325 | super(MSENet_RegressorAdapter, self).__init__(model, fit_params) 326 | # Instantiate model 327 | self.epochs = epochs 328 | self.batch_size = batch_size 329 | self.dropout = dropout 330 | self.lr = lr 331 | self.wd = wd 332 | self.test_ratio = test_ratio 333 | self.random_state = random_state 334 | self.model = torch_models.mse_model(in_shape=in_shape, hidden_size=hidden_size, dropout=dropout) 335 | self.loss_func = torch.nn.MSELoss() 336 | self.learner = torch_models.LearnerOptimized(self.model, 337 | partial(learn_func, lr=lr, weight_decay=wd), 338 | self.loss_func, 339 | device=device, 340 | test_ratio=self.test_ratio, 341 | random_state=self.random_state) 342 | 343 | def fit(self, x, y): 344 | """ Fit the model to data 345 | 346 | Parameters 347 | ---------- 348 | 349 | x : numpy array of training features (nXp) 350 | y : numpy array of training labels (n) 351 | 352 | """ 353 | self.learner.fit(x, y, self.epochs, batch_size=self.batch_size) 354 | 355 | def predict(self, x): 356 | """ Estimate the label given the features 357 | 358 | Parameters 359 | ---------- 360 | x : numpy array of training features (nXp) 361 | 362 | Returns 363 | ------- 364 | ret_val : numpy array of predicted labels (n) 365 | 366 | """ 367 | return self.learner.predict(x) 368 | 369 | ############################################################################### 370 | # Deep neural network for conditional quantile regression 371 | # Minimizing pinball loss 372 | ############################################################################### 373 | 374 | class AllQNet_RegressorAdapter(RegressorAdapter): 375 | """ Conditional quantile estimator, formulated as neural net 376 | """ 377 | def __init__(self, 378 | model, 379 | fit_params=None, 380 | in_shape=1, 381 | hidden_size=1, 382 | quantiles=[.05, .95], 383 | learn_func=torch.optim.Adam, 384 | epochs=1000, 385 | batch_size=10, 386 | dropout=0.1, 387 | lr=0.01, 388 | wd=1e-6, 389 | test_ratio=0.2, 390 | random_state=0, 391 | use_rearrangement=False): 392 | """ Initialization 393 | 394 | Parameters 395 | ---------- 396 | model : None, unused parameter (for compatibility with nc class) 397 | fit_params : None, unused parameter (for compatibility with nc class) 398 | in_shape : integer, input signal dimension 399 | hidden_size : integer, hidden layer dimension 400 | quantiles : numpy array, low and high quantile levels in range (0,1) 401 | learn_func : class of Pytorch's SGD optimizer 402 | epochs : integer, maximal number of epochs 403 | batch_size : integer, mini-batch size for SGD 404 | dropout : float, dropout rate 405 | lr : float, learning rate for SGD 406 | wd : float, weight decay 407 | test_ratio : float, ratio of held-out data, used in cross-validation 408 | random_state : integer, seed for splitting the data in cross-validation 409 | use_rearrangement : boolean, use the rearrangement algorithm (True) 410 | of not (False). See reference [1]. 411 | 412 | References 413 | ---------- 414 | .. [1] Chernozhukov, Victor, Iván Fernández‐Val, and Alfred Galichon. 415 | "Quantile and probability curves without crossing." 416 | Econometrica 78.3 (2010): 1093-1125. 417 | 418 | """ 419 | super(AllQNet_RegressorAdapter, self).__init__(model, fit_params) 420 | # Instantiate model 421 | self.quantiles = quantiles 422 | if use_rearrangement: 423 | self.all_quantiles = torch.from_numpy(np.linspace(0.01,0.99,99)).float() 424 | else: 425 | self.all_quantiles = self.quantiles 426 | self.epochs = epochs 427 | self.batch_size = batch_size 428 | self.dropout = dropout 429 | self.lr = lr 430 | self.wd = wd 431 | self.test_ratio = test_ratio 432 | self.random_state = random_state 433 | self.model = torch_models.all_q_model(quantiles=self.all_quantiles, 434 | in_shape=in_shape, 435 | hidden_size=hidden_size, 436 | dropout=dropout) 437 | self.loss_func = torch_models.AllQuantileLoss(self.all_quantiles) 438 | self.learner = torch_models.LearnerOptimizedCrossing(self.model, 439 | partial(learn_func, lr=lr, weight_decay=wd), 440 | self.loss_func, 441 | device=device, 442 | test_ratio=self.test_ratio, 443 | random_state=self.random_state, 444 | qlow=self.quantiles[0], 445 | qhigh=self.quantiles[1], 446 | use_rearrangement=use_rearrangement) 447 | 448 | def fit(self, x, y): 449 | """ Fit the model to data 450 | 451 | Parameters 452 | ---------- 453 | 454 | x : numpy array of training features (nXp) 455 | y : numpy array of training labels (n) 456 | 457 | """ 458 | self.learner.fit(x, y, self.epochs, self.batch_size) 459 | 460 | def predict(self, x): 461 | """ Estimate the conditional low and high quantiles given the features 462 | 463 | Parameters 464 | ---------- 465 | x : numpy array of training features (nXp) 466 | 467 | Returns 468 | ------- 469 | ret_val : numpy array of estimated conditional quantiles (nX2) 470 | 471 | """ 472 | return self.learner.predict(x) 473 | 474 | 475 | ############################################################################### 476 | # Quantile random forests model 477 | ############################################################################### 478 | 479 | class QuantileForestRegressorAdapter(RegressorAdapter): 480 | """ Conditional quantile estimator, defined as quantile random forests (QRF) 481 | 482 | References 483 | ---------- 484 | .. [1] Meinshausen, Nicolai. "Quantile regression forests." 485 | Journal of Machine Learning Research 7.Jun (2006): 983-999. 486 | 487 | """ 488 | 489 | def __init__(self, 490 | model, 491 | fit_params=None, 492 | quantiles=[5, 95], 493 | params=None): 494 | """ Initialization 495 | 496 | Parameters 497 | ---------- 498 | model : None, unused parameter (for compatibility with nc class) 499 | fit_params : None, unused parameter (for compatibility with nc class) 500 | quantiles : numpy array, low and high quantile levels in range (0,100) 501 | params : dictionary of parameters 502 | params["random_state"] : integer, seed for splitting the data 503 | in cross-validation. Also used as the 504 | seed in quantile random forests (QRF) 505 | params["min_samples_leaf"] : integer, parameter of QRF 506 | params["n_estimators"] : integer, parameter of QRF 507 | params["max_features"] : integer, parameter of QRF 508 | params["CV"] : boolean, use cross-validation (True) or 509 | not (False) to tune the two QRF quantile levels 510 | to obtain the desired coverage 511 | params["test_ratio"] : float, ratio of held-out data, used 512 | in cross-validation 513 | params["coverage_factor"] : float, to avoid too conservative 514 | estimation of the prediction band, 515 | when tuning the two QRF quantile 516 | levels in cross-validation one may 517 | ask for prediction intervals with 518 | reduced average coverage, equal to 519 | coverage_factor*(q_high - q_low). 520 | params["range_vals"] : float, determines the lowest and highest 521 | quantile level parameters when tuning 522 | the quanitle levels bt cross-validation. 523 | The smallest value is equal to 524 | quantiles[0] - range_vals. 525 | Similarly, the largest is equal to 526 | quantiles[1] + range_vals. 527 | params["num_vals"] : integer, when tuning QRF's quantile 528 | parameters, sweep over a grid of length 529 | num_vals. 530 | 531 | """ 532 | super(QuantileForestRegressorAdapter, self).__init__(model, fit_params) 533 | # Instantiate model 534 | self.quantiles = quantiles 535 | self.cv_quantiles = self.quantiles 536 | self.params = params 537 | self.rfqr = RandomForestQuantileRegressor(random_state=params["random_state"], 538 | min_samples_leaf=params["min_samples_leaf"], 539 | n_estimators=params["n_estimators"], 540 | max_features=params["max_features"]) 541 | 542 | def fit(self, x, y): 543 | """ Fit the model to data 544 | 545 | Parameters 546 | ---------- 547 | 548 | x : numpy array of training features (nXp) 549 | y : numpy array of training labels (n) 550 | 551 | """ 552 | if self.params["CV"]: 553 | target_coverage = self.quantiles[1] - self.quantiles[0] 554 | coverage_factor = self.params["coverage_factor"] 555 | range_vals = self.params["range_vals"] 556 | num_vals = self.params["num_vals"] 557 | grid_q_low = np.linspace(self.quantiles[0],self.quantiles[0]+range_vals,num_vals).reshape(-1,1) 558 | grid_q_high = np.linspace(self.quantiles[1],self.quantiles[1]-range_vals,num_vals).reshape(-1,1) 559 | grid_q = np.concatenate((grid_q_low,grid_q_high),1) 560 | 561 | self.cv_quantiles = tune_params_cv.CV_quntiles_rf(self.params, 562 | x, 563 | y, 564 | target_coverage, 565 | grid_q, 566 | self.params["test_ratio"], 567 | self.params["random_state"], 568 | coverage_factor) 569 | 570 | self.rfqr.fit(x, y) 571 | 572 | def predict(self, x): 573 | """ Estimate the conditional low and high quantiles given the features 574 | 575 | Parameters 576 | ---------- 577 | x : numpy array of training features (nXp) 578 | 579 | Returns 580 | ------- 581 | ret_val : numpy array of estimated conditional quantiles (nX2) 582 | 583 | """ 584 | lower = self.rfqr.predict(x, quantile=self.cv_quantiles[0]) 585 | upper = self.rfqr.predict(x, quantile=self.cv_quantiles[1]) 586 | 587 | ret_val = np.zeros((len(lower),2)) 588 | ret_val[:,0] = lower 589 | ret_val[:,1] = upper 590 | return ret_val 591 | -------------------------------------------------------------------------------- /cqr/torch_models.py: -------------------------------------------------------------------------------- 1 | 2 | import sys 3 | import copy 4 | import torch 5 | import numpy as np 6 | import torch.nn as nn 7 | from cqr import helper 8 | from sklearn.model_selection import train_test_split 9 | 10 | 11 | if torch.cuda.is_available(): 12 | device = "cuda:0" 13 | else: 14 | device = "cpu" 15 | 16 | ############################################################################### 17 | # Helper functions 18 | ############################################################################### 19 | 20 | def epoch_internal_train(model, loss_func, x_train, y_train, batch_size, optimizer, cnt=0, best_cnt=np.Inf): 21 | """ Sweep over the data and update the model's parameters 22 | 23 | Parameters 24 | ---------- 25 | 26 | model : class of neural net model 27 | loss_func : class of loss function 28 | x_train : pytorch tensor n training features, each of dimension p (nXp) 29 | batch_size : integer, size of the mini-batch 30 | optimizer : class of SGD solver 31 | cnt : integer, counting the gradient steps 32 | best_cnt: integer, stop the training if current cnt > best_cnt 33 | 34 | Returns 35 | ------- 36 | 37 | epoch_loss : mean loss value 38 | cnt : integer, cumulative number of gradient steps 39 | 40 | """ 41 | 42 | model.train() 43 | shuffle_idx = np.arange(x_train.shape[0]) 44 | np.random.shuffle(shuffle_idx) 45 | x_train = x_train[shuffle_idx] 46 | y_train = y_train[shuffle_idx] 47 | epoch_losses = [] 48 | for idx in range(0, x_train.shape[0], batch_size): 49 | cnt = cnt + 1 50 | optimizer.zero_grad() 51 | batch_x = x_train[idx : min(idx + batch_size, x_train.shape[0]),:] 52 | batch_y = y_train[idx : min(idx + batch_size, y_train.shape[0])] 53 | preds = model(batch_x) 54 | loss = loss_func(preds, batch_y) 55 | loss.backward() 56 | optimizer.step() 57 | epoch_losses.append(loss.cpu().detach().numpy()) 58 | 59 | if cnt >= best_cnt: 60 | break 61 | 62 | epoch_loss = np.mean(epoch_losses) 63 | 64 | return epoch_loss, cnt 65 | 66 | def rearrange(all_quantiles, quantile_low, quantile_high, test_preds): 67 | """ Produce monotonic quantiles 68 | 69 | Parameters 70 | ---------- 71 | 72 | all_quantiles : numpy array (q), grid of quantile levels in the range (0,1) 73 | quantile_low : float, desired low quantile in the range (0,1) 74 | quantile_high : float, desired high quantile in the range (0,1) 75 | test_preds : numpy array of predicted quantile (nXq) 76 | 77 | Returns 78 | ------- 79 | 80 | q_fixed : numpy array (nX2), containing the rearranged estimates of the 81 | desired low and high quantile 82 | 83 | References 84 | ---------- 85 | .. [1] Chernozhukov, Victor, Iván Fernández‐Val, and Alfred Galichon. 86 | "Quantile and probability curves without crossing." 87 | Econometrica 78.3 (2010): 1093-1125. 88 | 89 | """ 90 | scaling = all_quantiles[-1] - all_quantiles[0] 91 | low_val = (quantile_low - all_quantiles[0])/scaling 92 | high_val = (quantile_high - all_quantiles[0])/scaling 93 | q_fixed = np.quantile(test_preds,(low_val, high_val),interpolation='linear',axis=1) 94 | return q_fixed.T 95 | 96 | ############################################################################### 97 | # Deep conditional mean regression 98 | # Minimizing MSE loss 99 | ############################################################################### 100 | 101 | # Define the network 102 | class mse_model(nn.Module): 103 | """ Conditional mean estimator, formulated as neural net 104 | """ 105 | 106 | def __init__(self, 107 | in_shape=1, 108 | hidden_size=64, 109 | dropout=0.5): 110 | """ Initialization 111 | 112 | Parameters 113 | ---------- 114 | 115 | in_shape : integer, input signal dimension (p) 116 | hidden_size : integer, hidden layer dimension 117 | dropout : float, dropout rate 118 | 119 | """ 120 | 121 | super().__init__() 122 | self.in_shape = in_shape 123 | self.out_shape = 1 124 | self.hidden_size = hidden_size 125 | self.dropout = dropout 126 | self.build_model() 127 | self.init_weights() 128 | 129 | def build_model(self): 130 | """ Construct the network 131 | """ 132 | self.base_model = nn.Sequential( 133 | nn.Linear(self.in_shape, self.hidden_size), 134 | nn.ReLU(), 135 | nn.Dropout(self.dropout), 136 | nn.Linear(self.hidden_size, self.hidden_size), 137 | nn.ReLU(), 138 | nn.Dropout(self.dropout), 139 | nn.Linear(self.hidden_size, 1), 140 | ) 141 | 142 | def init_weights(self): 143 | """ Initialize the network parameters 144 | """ 145 | for m in self.base_model: 146 | if isinstance(m, nn.Linear): 147 | nn.init.orthogonal_(m.weight) 148 | nn.init.constant_(m.bias, 0) 149 | 150 | def forward(self, x): 151 | """ Run forward pass 152 | """ 153 | return torch.squeeze(self.base_model(x)) 154 | 155 | # Define the training procedure 156 | class LearnerOptimized: 157 | """ Fit a neural network (conditional mean) to training data 158 | """ 159 | def __init__(self, model, optimizer_class, loss_func, device='cpu', test_ratio=0.2, random_state=0): 160 | """ Initialization 161 | 162 | Parameters 163 | ---------- 164 | 165 | model : class of neural network model 166 | optimizer_class : class of SGD optimizer (e.g. Adam) 167 | loss_func : loss to minimize 168 | device : string, "cuda:0" or "cpu" 169 | test_ratio : float, test size used in cross-validation (CV) 170 | random_state : int, seed to be used in CV when splitting to train-test 171 | 172 | """ 173 | self.model = model.to(device) 174 | self.optimizer_class = optimizer_class 175 | self.optimizer = optimizer_class(self.model.parameters()) 176 | self.loss_func = loss_func.to(device) 177 | self.device = device 178 | self.test_ratio = test_ratio 179 | self.random_state = random_state 180 | self.loss_history = [] 181 | self.test_loss_history = [] 182 | self.full_loss_history = [] 183 | 184 | def fit(self, x, y, epochs, batch_size, verbose=False): 185 | """ Fit the model to data 186 | 187 | Parameters 188 | ---------- 189 | 190 | x : numpy array, containing the training features (nXp) 191 | y : numpy array, containing the training labels (n) 192 | epochs : integer, maximal number of epochs 193 | batch_size : integer, mini-batch size for SGD 194 | 195 | """ 196 | 197 | sys.stdout.flush() 198 | model = copy.deepcopy(self.model) 199 | model = model.to(device) 200 | optimizer = self.optimizer_class(model.parameters()) 201 | best_epoch = epochs 202 | 203 | x_train, xx, y_train, yy = train_test_split(x, y, test_size=self.test_ratio,random_state=self.random_state) 204 | 205 | x_train = torch.from_numpy(x_train).float().to(self.device).requires_grad_(False) 206 | xx = torch.from_numpy(xx).float().to(self.device).requires_grad_(False) 207 | y_train = torch.from_numpy(y_train).float().to(self.device).requires_grad_(False) 208 | yy = torch.from_numpy(yy).float().to(self.device).requires_grad_(False) 209 | 210 | best_cnt = 1e10 211 | best_test_epoch_loss = 1e10 212 | 213 | cnt = 0 214 | for e in range(epochs): 215 | epoch_loss, cnt = epoch_internal_train(model, self.loss_func, x_train, y_train, batch_size, optimizer, cnt) 216 | self.loss_history.append(epoch_loss) 217 | 218 | # test 219 | model.eval() 220 | preds = model(xx) 221 | test_preds = preds.cpu().detach().numpy() 222 | test_preds = np.squeeze(test_preds) 223 | test_epoch_loss = self.loss_func(preds, yy).cpu().detach().numpy() 224 | 225 | self.test_loss_history.append(test_epoch_loss) 226 | 227 | if (test_epoch_loss <= best_test_epoch_loss): 228 | best_test_epoch_loss = test_epoch_loss 229 | best_epoch = e 230 | best_cnt = cnt 231 | 232 | if (e+1) % 100 == 0 and verbose: 233 | print("CV: Epoch {}: Train {}, Test {}, Best epoch {}, Best loss {}".format(e+1, epoch_loss, test_epoch_loss, best_epoch, best_test_epoch_loss)) 234 | sys.stdout.flush() 235 | 236 | # use all the data to train the model, for best_cnt steps 237 | x = torch.from_numpy(x).float().to(self.device).requires_grad_(False) 238 | y = torch.from_numpy(y).float().to(self.device).requires_grad_(False) 239 | 240 | cnt = 0 241 | for e in range(best_epoch+1): 242 | if cnt > best_cnt: 243 | break 244 | 245 | epoch_loss, cnt = epoch_internal_train(self.model, self.loss_func, x, y, batch_size, self.optimizer, cnt, best_cnt) 246 | self.full_loss_history.append(epoch_loss) 247 | 248 | if (e+1) % 100 == 0 and verbose: 249 | print("Full: Epoch {}: {}, cnt {}".format(e+1, epoch_loss, cnt)) 250 | sys.stdout.flush() 251 | 252 | def predict(self, x): 253 | """ Estimate the label given the features 254 | 255 | Parameters 256 | ---------- 257 | x : numpy array of training features (nXp) 258 | 259 | Returns 260 | ------- 261 | ret_val : numpy array of predicted labels (n) 262 | 263 | """ 264 | self.model.eval() 265 | ret_val = self.model(torch.from_numpy(x).to(self.device).requires_grad_(False)).cpu().detach().numpy() 266 | return ret_val 267 | 268 | 269 | ############################################################################## 270 | # Quantile regression 271 | # Implementation inspired by: 272 | # https://github.com/ceshine/quantile-regression-tensorflow 273 | ############################################################################## 274 | 275 | class AllQuantileLoss(nn.Module): 276 | """ Pinball loss function 277 | """ 278 | def __init__(self, quantiles): 279 | """ Initialize 280 | 281 | Parameters 282 | ---------- 283 | quantiles : pytorch vector of quantile levels, each in the range (0,1) 284 | 285 | 286 | """ 287 | super().__init__() 288 | self.quantiles = quantiles 289 | 290 | def forward(self, preds, target): 291 | """ Compute the pinball loss 292 | 293 | Parameters 294 | ---------- 295 | preds : pytorch tensor of estimated labels (n) 296 | target : pytorch tensor of true labels (n) 297 | 298 | Returns 299 | ------- 300 | loss : cost function value 301 | 302 | """ 303 | assert not target.requires_grad 304 | assert preds.size(0) == target.size(0) 305 | losses = [] 306 | 307 | for i, q in enumerate(self.quantiles): 308 | errors = target - preds[:, i] 309 | losses.append(torch.max((q-1) * errors, q * errors).unsqueeze(1)) 310 | 311 | loss = torch.mean(torch.sum(torch.cat(losses, dim=1), dim=1)) 312 | return loss 313 | 314 | 315 | class all_q_model(nn.Module): 316 | """ Conditional quantile estimator, formulated as neural net 317 | """ 318 | def __init__(self, 319 | quantiles, 320 | in_shape=1, 321 | hidden_size=64, 322 | dropout=0.5): 323 | """ Initialization 324 | 325 | Parameters 326 | ---------- 327 | quantiles : numpy array of quantile levels (q), each in the range (0,1) 328 | in_shape : integer, input signal dimension (p) 329 | hidden_size : integer, hidden layer dimension 330 | dropout : float, dropout rate 331 | 332 | """ 333 | super().__init__() 334 | self.quantiles = quantiles 335 | self.num_quantiles = len(quantiles) 336 | self.hidden_size = hidden_size 337 | self.in_shape = in_shape 338 | self.out_shape = len(quantiles) 339 | self.dropout = dropout 340 | self.build_model() 341 | self.init_weights() 342 | 343 | def build_model(self): 344 | """ Construct the network 345 | """ 346 | self.base_model = nn.Sequential( 347 | nn.Linear(self.in_shape, self.hidden_size), 348 | nn.ReLU(), 349 | nn.Dropout(self.dropout), 350 | nn.Linear(self.hidden_size, self.hidden_size), 351 | nn.ReLU(), 352 | nn.Dropout(self.dropout), 353 | nn.Linear(self.hidden_size, self.num_quantiles), 354 | ) 355 | 356 | def init_weights(self): 357 | """ Initialize the network parameters 358 | """ 359 | for m in self.base_model: 360 | if isinstance(m, nn.Linear): 361 | nn.init.orthogonal_(m.weight) 362 | nn.init.constant_(m.bias, 0) 363 | 364 | def forward(self, x): 365 | """ Run forward pass 366 | """ 367 | return self.base_model(x) 368 | 369 | class LearnerOptimizedCrossing: 370 | """ Fit a neural network (conditional quantile) to training data 371 | """ 372 | def __init__(self, model, optimizer_class, loss_func, device='cpu', test_ratio=0.2, random_state=0, 373 | qlow=0.05, qhigh=0.95, use_rearrangement=False): 374 | """ Initialization 375 | 376 | Parameters 377 | ---------- 378 | 379 | model : class of neural network model 380 | optimizer_class : class of SGD optimizer (e.g. pytorch's Adam) 381 | loss_func : loss to minimize 382 | device : string, "cuda:0" or "cpu" 383 | test_ratio : float, test size used in cross-validation (CV) 384 | random_state : integer, seed used in CV when splitting to train-test 385 | qlow : float, low quantile level in the range (0,1) 386 | qhigh : float, high quantile level in the range (0,1) 387 | use_rearrangement : boolean, use the rearrangement algorithm (True) 388 | of not (False) 389 | 390 | """ 391 | self.model = model.to(device) 392 | self.use_rearrangement = use_rearrangement 393 | self.compute_coverage = True 394 | self.quantile_low = qlow 395 | self.quantile_high = qhigh 396 | self.target_coverage = 100.0*(self.quantile_high - self.quantile_low) 397 | self.all_quantiles = loss_func.quantiles 398 | self.optimizer_class = optimizer_class 399 | self.optimizer = optimizer_class(self.model.parameters()) 400 | self.loss_func = loss_func.to(device) 401 | self.device = device 402 | self.test_ratio = test_ratio 403 | self.random_state = random_state 404 | self.loss_history = [] 405 | self.test_loss_history = [] 406 | self.full_loss_history = [] 407 | 408 | def fit(self, x, y, epochs, batch_size, verbose=False): 409 | """ Fit the model to data 410 | 411 | Parameters 412 | ---------- 413 | 414 | x : numpy array of training features (nXp) 415 | y : numpy array of training labels (n) 416 | epochs : integer, maximal number of epochs 417 | batch_size : integer, mini-batch size used in SGD solver 418 | 419 | """ 420 | sys.stdout.flush() 421 | model = copy.deepcopy(self.model) 422 | model = model.to(device) 423 | optimizer = self.optimizer_class(model.parameters()) 424 | best_epoch = epochs 425 | 426 | x_train, xx, y_train, yy = train_test_split(x, 427 | y, 428 | test_size=self.test_ratio, 429 | random_state=self.random_state) 430 | 431 | x_train = torch.from_numpy(x_train).float().to(self.device).requires_grad_(False) 432 | xx = torch.from_numpy(xx).float().to(self.device).requires_grad_(False) 433 | y_train = torch.from_numpy(y_train).float().to(self.device).requires_grad_(False) 434 | yy_cpu = yy 435 | yy = torch.from_numpy(yy).float().to(self.device).requires_grad_(False) 436 | 437 | best_avg_length = 1e10 438 | best_coverage = 0 439 | best_cnt = 1e10 440 | 441 | cnt = 0 442 | for e in range(epochs): 443 | model.train() 444 | epoch_loss, cnt = epoch_internal_train(model, self.loss_func, x_train, y_train, batch_size, optimizer, cnt) 445 | self.loss_history.append(epoch_loss) 446 | 447 | model.eval() 448 | preds = model(xx) 449 | test_epoch_loss = self.loss_func(preds, yy).cpu().detach().numpy() 450 | self.test_loss_history.append(test_epoch_loss) 451 | 452 | test_preds = preds.cpu().detach().numpy() 453 | test_preds = np.squeeze(test_preds) 454 | 455 | if self.use_rearrangement: 456 | test_preds = rearrange(self.all_quantiles, self.quantile_low, self.quantile_high, test_preds) 457 | 458 | y_lower = test_preds[:,0] 459 | y_upper = test_preds[:,1] 460 | coverage, avg_length = helper.compute_coverage_len(yy_cpu, y_lower, y_upper) 461 | 462 | if (coverage >= self.target_coverage) and (avg_length < best_avg_length): 463 | best_avg_length = avg_length 464 | best_coverage = coverage 465 | best_epoch = e 466 | best_cnt = cnt 467 | 468 | if (e+1) % 100 == 0 and verbose: 469 | print("CV: Epoch {}: Train {}, Test {}, Best epoch {}, Best Coverage {} Best Length {} Cur Coverage {}".format(e+1, epoch_loss, test_epoch_loss, best_epoch, best_coverage, best_avg_length, coverage)) 470 | sys.stdout.flush() 471 | 472 | x = torch.from_numpy(x).float().to(self.device).requires_grad_(False) 473 | y = torch.from_numpy(y).float().to(self.device).requires_grad_(False) 474 | 475 | cnt = 0 476 | for e in range(best_epoch+1): 477 | if cnt > best_cnt: 478 | break 479 | epoch_loss, cnt = epoch_internal_train(self.model, self.loss_func, x, y, batch_size, self.optimizer, cnt, best_cnt) 480 | self.full_loss_history.append(epoch_loss) 481 | 482 | if (e+1) % 100 == 0 and verbose: 483 | print("Full: Epoch {}: {}, cnt {}".format(e+1, epoch_loss, cnt)) 484 | sys.stdout.flush() 485 | 486 | def predict(self, x): 487 | """ Estimate the conditional low and high quantile given the features 488 | 489 | Parameters 490 | ---------- 491 | x : numpy array of training features (nXp) 492 | 493 | Returns 494 | ------- 495 | test_preds : numpy array of predicted low and high quantiles (nX2) 496 | 497 | """ 498 | self.model.eval() 499 | test_preds = self.model(torch.from_numpy(x).to(self.device).requires_grad_(False)).cpu().detach().numpy() 500 | if self.use_rearrangement: 501 | test_preds = rearrange(self.all_quantiles, self.quantile_low, self.quantile_high, test_preds) 502 | else: 503 | test_preds[:,0] = np.min(test_preds,axis=1) 504 | test_preds[:,1] = np.max(test_preds,axis=1) 505 | return test_preds 506 | -------------------------------------------------------------------------------- /cqr/tune_params_cv.py: -------------------------------------------------------------------------------- 1 | 2 | from cqr import helper 3 | from skgarden import RandomForestQuantileRegressor 4 | from sklearn.model_selection import train_test_split 5 | 6 | 7 | def CV_quntiles_rf(params, 8 | X, 9 | y, 10 | target_coverage, 11 | grid_q, 12 | test_ratio, 13 | random_state, 14 | coverage_factor=0.9): 15 | """ Tune the low and high quantile level parameters of quantile random 16 | forests method, using cross-validation 17 | 18 | Parameters 19 | ---------- 20 | params : dictionary of parameters 21 | params["random_state"] : integer, seed for splitting the data 22 | in cross-validation. Also used as the 23 | seed in quantile random forest (QRF) 24 | params["min_samples_leaf"] : integer, parameter of QRF 25 | params["n_estimators"] : integer, parameter of QRF 26 | params["max_features"] : integer, parameter of QRF 27 | X : numpy array, containing the training features (nXp) 28 | y : numpy array, containing the training labels (n) 29 | target_coverage : desired coverage of prediction band. The output coverage 30 | may be smaller if coverage_factor <= 1, in this case the 31 | target will be modified to target_coverage*coverage_factor 32 | grid_q : numpy array, of low and high quantile levels to test 33 | test_ratio : float, test size of the held-out data 34 | random_state : integer, seed for splitting the data in cross-validation. 35 | Also used as the seed in QRF. 36 | coverage_factor : float, when tuning the two QRF quantile levels one may 37 | ask for prediction band with smaller average coverage, 38 | equal to coverage_factor*(q_high - q_low) to avoid too 39 | conservative estimation of the prediction band 40 | 41 | Returns 42 | ------- 43 | best_q : numpy array of low and high quantile levels (length 2) 44 | 45 | References 46 | ---------- 47 | .. [1] Meinshausen, Nicolai. "Quantile regression forests." 48 | Journal of Machine Learning Research 7.Jun (2006): 983-999. 49 | 50 | """ 51 | target_coverage = coverage_factor*target_coverage 52 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio,random_state=random_state) 53 | best_avg_length = 1e10 54 | best_q = grid_q[0] 55 | 56 | rf = RandomForestQuantileRegressor(random_state=params["random_state"], 57 | min_samples_leaf=params["min_samples_leaf"], 58 | n_estimators=params["n_estimators"], 59 | max_features=params["max_features"]) 60 | rf.fit(X_train, y_train) 61 | 62 | for q in grid_q: 63 | y_lower = rf.predict(X_test, quantile=q[0]) 64 | y_upper = rf.predict(X_test, quantile=q[1]) 65 | coverage, avg_length = helper.compute_coverage_len(y_test, y_lower, y_upper) 66 | if (coverage >= target_coverage) and (avg_length < best_avg_length): 67 | best_avg_length = avg_length 68 | best_q = q 69 | else: 70 | break 71 | return best_q 72 | -------------------------------------------------------------------------------- /cqr_real_data_example.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Conformalized quantile regression (CQR): Real data experiment\n", 8 | "\n", 9 | "In this tutorial we will load a real dataset and construct prediction intervals using CQR [1].\n", 10 | "\n", 11 | "[1] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes, “Conformalized quantile regression.” 2019.\n", 12 | "\n", 13 | "## Prediction intervals\n", 14 | "\n", 15 | "Suppose we are given $ n $ training samples $ \\{(X_i, Y_i)\\}_{i=1}^n$ and we must now predict the unknown value of $Y_{n+1}$ at a test point $X_{n+1}$. We assume that all the samples $ \\{(X_i,Y_i)\\}_{i=1}^{n+1} $ are drawn exchangeably$-$for instance, they may be drawn i.i.d.$-$from an arbitrary joint distribution $P_{XY}$ over the feature vectors $ X\\in \\mathbb{R}^p $ and response variables $ Y\\in \\mathbb{R} $. We aim to construct a marginal distribution-free prediction interval $C(X_{n+1}) \\subseteq \\mathbb{R}$ that is likely to contain the unknown response $Y_{n+1} $. That is, given a desired miscoverage rate $ \\alpha $, we ask that\n", 16 | "$$ \\mathbb{P}\\{Y_{n+1} \\in C(X_{n+1})\\} \\geq 1-\\alpha $$\n", 17 | "for any joint distribution $ P_{XY} $ and any sample size $n$. The probability in this statement is marginal, being taken over all the samples $ \\{(X_i, Y_i)\\}_{i=1}^{n+1} $.\n", 18 | "\n", 19 | "To accomplish this, we build on the method of split conformal prediction. We first split the training data into two disjoint subsets, a proper training set and a calibration set. We fit two quantile regressors on the proper training set to obtain initial estimates of the lower and upper bounds of the prediction interval. Then, using the calibration set, we conformalize and, if necessary, correct this prediction interval. Unlike the original interval, the conformalized prediction interval is guaranteed to satisfy the coverage requirement regardless of the choice or accuracy of the quantile regression estimator.\n", 20 | "\n", 21 | "\n", 22 | "\n", 23 | "## A case study\n", 24 | "\n", 25 | "We start by importing several libraries, loading the real dataset and standardize its features and response. We set the target miscoverage rate $\\alpha$ to 0.1." 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": {}, 32 | "outputs": [ 33 | { 34 | "name": "stdout", 35 | "output_type": "stream", 36 | "text": [ 37 | "Dataset: community\n", 38 | "Dimensions: train set (n=1595, p=100) ; test set (n=399, p=100)\n" 39 | ] 40 | } 41 | ], 42 | "source": [ 43 | "import torch\n", 44 | "import random\n", 45 | "import numpy as np\n", 46 | "np.warnings.filterwarnings('ignore')\n", 47 | "\n", 48 | "from datasets import datasets\n", 49 | "from sklearn.preprocessing import StandardScaler\n", 50 | "from sklearn.model_selection import train_test_split\n", 51 | "\n", 52 | "seed = 1\n", 53 | "\n", 54 | "random_state_train_test = seed\n", 55 | "random.seed(seed)\n", 56 | "np.random.seed(seed)\n", 57 | "torch.manual_seed(seed)\n", 58 | "if torch.cuda.is_available():\n", 59 | " torch.cuda.manual_seed_all(seed)\n", 60 | " \n", 61 | "# desired miscoverage error\n", 62 | "alpha = 0.1\n", 63 | "\n", 64 | "# desired quanitile levels\n", 65 | "quantiles = [0.05, 0.95]\n", 66 | "\n", 67 | "# used to determine the size of test set\n", 68 | "test_ratio = 0.2\n", 69 | "\n", 70 | "# name of dataset\n", 71 | "dataset_base_path = \"./datasets/\"\n", 72 | "dataset_name = \"community\"\n", 73 | "\n", 74 | "# load the dataset\n", 75 | "X, y = datasets.GetDataset(dataset_name, dataset_base_path)\n", 76 | "\n", 77 | "# divide the dataset into test and train based on the test_ratio parameter\n", 78 | "x_train, x_test, y_train, y_test = train_test_split(X,\n", 79 | " y,\n", 80 | " test_size=test_ratio,\n", 81 | " random_state=random_state_train_test)\n", 82 | "\n", 83 | "# reshape the data\n", 84 | "x_train = np.asarray(x_train)\n", 85 | "y_train = np.asarray(y_train)\n", 86 | "x_test = np.asarray(x_test)\n", 87 | "y_test = np.asarray(y_test)\n", 88 | "\n", 89 | "# compute input dimensions\n", 90 | "n_train = x_train.shape[0]\n", 91 | "in_shape = x_train.shape[1]\n", 92 | "\n", 93 | "# display basic information\n", 94 | "print(\"Dataset: %s\" % (dataset_name))\n", 95 | "print(\"Dimensions: train set (n=%d, p=%d) ; test set (n=%d, p=%d)\" % \n", 96 | " (x_train.shape[0], x_train.shape[1], x_test.shape[0], x_test.shape[1]))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "metadata": {}, 102 | "source": [ 103 | "## Data splitting\n", 104 | "\n", 105 | "We begin by splitting the data into a proper training set and a calibration set. Recall that the main idea is to fit a regression model on the proper training samples, then use the residuals on a held-out validation set to quantify the uncertainty in future predictions." 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 3, 111 | "metadata": {}, 112 | "outputs": [], 113 | "source": [ 114 | "# divide the data into proper training set and calibration set\n", 115 | "idx = np.random.permutation(n_train)\n", 116 | "n_half = int(np.floor(n_train/2))\n", 117 | "idx_train, idx_cal = idx[:n_half], idx[n_half:2*n_half]\n", 118 | "\n", 119 | "# zero mean and unit variance scaling \n", 120 | "scalerX = StandardScaler()\n", 121 | "scalerX = scalerX.fit(x_train[idx_train])\n", 122 | "\n", 123 | "# scale\n", 124 | "x_train = scalerX.transform(x_train)\n", 125 | "x_test = scalerX.transform(x_test)\n", 126 | "\n", 127 | "# scale the labels by dividing each by the mean absolute response\n", 128 | "mean_y_train = np.mean(np.abs(y_train[idx_train]))\n", 129 | "y_train = np.squeeze(y_train)/mean_y_train\n", 130 | "y_test = np.squeeze(y_test)/mean_y_train" 131 | ] 132 | }, 133 | { 134 | "cell_type": "markdown", 135 | "metadata": {}, 136 | "source": [ 137 | "## CQR random forests\n", 138 | "\n", 139 | "Given these two subsets, we now turn to conformalize the initial prediction interval constructed by quantile random forests [2]. Below, we set the hyper-parameters of the CQR random forests method.\n", 140 | "\n", 141 | "[2] Meinshausen Nicolai. \"Quantile regression forests.\" Journal of Machine Learning Research 7, no. Jun (2006): 983-999." 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 4, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "#########################################################\n", 151 | "# Quantile random forests parameters\n", 152 | "# (See QuantileForestRegressorAdapter class in helper.py)\n", 153 | "#########################################################\n", 154 | "\n", 155 | "# the number of trees in the forest\n", 156 | "n_estimators = 1000\n", 157 | "\n", 158 | "# the minimum number of samples required to be at a leaf node\n", 159 | "# (default skgarden's parameter)\n", 160 | "min_samples_leaf = 1\n", 161 | "\n", 162 | "# the number of features to consider when looking for the best split\n", 163 | "# (default skgarden's parameter)\n", 164 | "max_features = x_train.shape[1]\n", 165 | "\n", 166 | "# target quantile levels\n", 167 | "quantiles_forest = [quantiles[0]*100, quantiles[1]*100]\n", 168 | "\n", 169 | "# use cross-validation to tune the quantile levels?\n", 170 | "cv_qforest = True\n", 171 | "\n", 172 | "# when tuning the two QRF quantile levels one may\n", 173 | "# ask for a prediction band with smaller average coverage\n", 174 | "# to avoid too conservative estimation of the prediction band\n", 175 | "# This would be equal to coverage_factor*(quantiles[1] - quantiles[0])\n", 176 | "coverage_factor = 0.85\n", 177 | "\n", 178 | "# ratio of held-out data, used in cross-validation\n", 179 | "cv_test_ratio = 0.05\n", 180 | "\n", 181 | "# seed for splitting the data in cross-validation.\n", 182 | "# Also used as the seed in quantile random forests function\n", 183 | "cv_random_state = 1\n", 184 | "\n", 185 | "# determines the lowest and highest quantile level parameters.\n", 186 | "# This is used when tuning the quanitle levels by cross-validation.\n", 187 | "# The smallest value is equal to quantiles[0] - range_vals.\n", 188 | "# Similarly, the largest value is equal to quantiles[1] + range_vals.\n", 189 | "cv_range_vals = 30\n", 190 | "\n", 191 | "# sweep over a grid of length num_vals when tuning QRF's quantile parameters \n", 192 | "cv_num_vals = 10" 193 | ] 194 | }, 195 | { 196 | "cell_type": "markdown", 197 | "metadata": {}, 198 | "source": [ 199 | "### Symmetric nonconformity score \n", 200 | "\n", 201 | "In the following cell we run the entire CQR procudure. The class `QuantileForestRegressorAdapter` defines the underlying estimator. The class `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` fits the regression function to the proper training set, corrects (if required) the initial estimate of the prediction interval using the calibration set, and returns the conformal band. Lastly, we compute the average coverage and length on future test data using `compute_coverage`." 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 5, 207 | "metadata": {}, 208 | "outputs": [ 209 | { 210 | "name": "stdout", 211 | "output_type": "stream", 212 | "text": [ 213 | "CQR Random Forests: Percentage in the range (expecting 90.00): 91.228070\n", 214 | "CQR Random Forests: Average length: 1.355441\n" 215 | ] 216 | } 217 | ], 218 | "source": [ 219 | "from cqr import helper\n", 220 | "from nonconformist.nc import RegressorNc\n", 221 | "from nonconformist.nc import QuantileRegErrFunc\n", 222 | "\n", 223 | "# define the QRF's parameters \n", 224 | "params_qforest = dict()\n", 225 | "params_qforest[\"n_estimators\"] = n_estimators\n", 226 | "params_qforest[\"min_samples_leaf\"] = min_samples_leaf\n", 227 | "params_qforest[\"max_features\"] = max_features\n", 228 | "params_qforest[\"CV\"] = cv_qforest\n", 229 | "params_qforest[\"coverage_factor\"] = coverage_factor\n", 230 | "params_qforest[\"test_ratio\"] = cv_test_ratio\n", 231 | "params_qforest[\"random_state\"] = cv_random_state\n", 232 | "params_qforest[\"range_vals\"] = cv_range_vals\n", 233 | "params_qforest[\"num_vals\"] = cv_num_vals\n", 234 | "\n", 235 | "# define QRF model\n", 236 | "quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,\n", 237 | " fit_params=None,\n", 238 | " quantiles=quantiles_forest,\n", 239 | " params=params_qforest)\n", 240 | " \n", 241 | "# define the CQR object\n", 242 | "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n", 243 | "\n", 244 | "# run CQR procedure\n", 245 | "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n", 246 | "\n", 247 | "# compute and print average coverage and average length\n", 248 | "coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,\n", 249 | " y_lower,\n", 250 | " y_upper,\n", 251 | " alpha,\n", 252 | " \"CQR Random Forests\")" 253 | ] 254 | }, 255 | { 256 | "cell_type": "markdown", 257 | "metadata": {}, 258 | "source": [ 259 | "As can be seen, we obtained valid coverage.\n", 260 | "\n", 261 | "### Asymmetric nonconformity score \n", 262 | "\n", 263 | "The nonconformity score function `QuantileRegErrFunc` treats the left and right tails symmetrically, but if the error distribution is significantly skewed, one may choose to treat them asymmetrically. This can be done by replacing `QuantileRegErrFunc` with `QuantileRegAsymmetricErrFunc`, as implemented in the following cell." 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 6, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | "Asymmetric CQR Random Forests: Percentage in the range (expecting 90.00): 90.726817\n", 276 | "Asymmetric CQR Random Forests: Average length: 1.480756\n" 277 | ] 278 | } 279 | ], 280 | "source": [ 281 | "from nonconformist.nc import QuantileRegAsymmetricErrFunc\n", 282 | "\n", 283 | "# define QRF model\n", 284 | "quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,\n", 285 | " fit_params=None,\n", 286 | " quantiles=quantiles_forest,\n", 287 | " params=params_qforest)\n", 288 | " \n", 289 | "# define the CQR object\n", 290 | "nc = RegressorNc(quantile_estimator, QuantileRegAsymmetricErrFunc())\n", 291 | "\n", 292 | "# run CQR procedure\n", 293 | "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n", 294 | "\n", 295 | "# compute and print average coverage and average length\n", 296 | "coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,\n", 297 | " y_lower,\n", 298 | " y_upper,\n", 299 | " alpha,\n", 300 | " \"Asymmetric CQR Random Forests\")" 301 | ] 302 | }, 303 | { 304 | "cell_type": "markdown", 305 | "metadata": {}, 306 | "source": [ 307 | "Above, we also obtained valid coverage.\n", 308 | "\n", 309 | "\n", 310 | "## CQR neural net\n", 311 | "\n", 312 | "In what follows we will use neural network as the underlying quantile regression method. Below, we set the hyper-parameters of the CQR neural network method." 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 7, 318 | "metadata": {}, 319 | "outputs": [], 320 | "source": [ 321 | "#####################################################\n", 322 | "# Neural network parameters\n", 323 | "# (See AllQNet_RegressorAdapter class in helper.py)\n", 324 | "#####################################################\n", 325 | "\n", 326 | "# pytorch's optimizer object\n", 327 | "nn_learn_func = torch.optim.Adam\n", 328 | "\n", 329 | "# number of epochs\n", 330 | "epochs = 1000\n", 331 | "\n", 332 | "# learning rate\n", 333 | "lr = 0.0005\n", 334 | "\n", 335 | "# mini-batch size\n", 336 | "batch_size = 64\n", 337 | "\n", 338 | "# hidden dimension of the network\n", 339 | "hidden_size = 64\n", 340 | "\n", 341 | "# dropout regularization rate\n", 342 | "dropout = 0.1\n", 343 | "\n", 344 | "# weight decay regularization\n", 345 | "wd = 1e-6\n", 346 | "\n", 347 | "# Ask for a reduced coverage when tuning the network parameters by \n", 348 | "# cross-validataion to avoid too concervative initial estimation of the \n", 349 | "# prediction interval. This estimation will be conformalized by CQR.\n", 350 | "quantiles_net = [0.1, 0.9]" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": {}, 356 | "source": [ 357 | "We now turn to invoke the CQR procedure. The class `AllQNet_RegressorAdapter` defines the underlying neural network estimator. Just as before, `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` returns the conformal band, computed on test data. Lastly, we compute the average coverage and length using `compute_coverage`." 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": 8, 363 | "metadata": {}, 364 | "outputs": [ 365 | { 366 | "name": "stdout", 367 | "output_type": "stream", 368 | "text": [ 369 | "CQR Neural Net: Percentage in the range (expecting 90.00): 90.225564\n", 370 | "CQR Neural Net: Average length: 1.502654\n" 371 | ] 372 | } 373 | ], 374 | "source": [ 375 | "# define quantile neural network model\n", 376 | "quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,\n", 377 | " fit_params=None,\n", 378 | " in_shape=in_shape,\n", 379 | " hidden_size=hidden_size,\n", 380 | " quantiles=quantiles_net,\n", 381 | " learn_func=nn_learn_func,\n", 382 | " epochs=epochs,\n", 383 | " batch_size=batch_size,\n", 384 | " dropout=dropout,\n", 385 | " lr=lr,\n", 386 | " wd=wd,\n", 387 | " test_ratio=cv_test_ratio,\n", 388 | " random_state=cv_random_state,\n", 389 | " use_rearrangement=False)\n", 390 | "\n", 391 | "# define a CQR object, computes the absolute residual error of points \n", 392 | "# located outside the estimated quantile neural network band \n", 393 | "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n", 394 | "\n", 395 | "# run CQR procedure\n", 396 | "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n", 397 | "\n", 398 | "# compute and print average coverage and average length\n", 399 | "coverage_cp_qnet, length_cp_qnet = helper.compute_coverage(y_test,\n", 400 | " y_lower,\n", 401 | " y_upper,\n", 402 | " alpha,\n", 403 | " \"CQR Neural Net\")" 404 | ] 405 | }, 406 | { 407 | "cell_type": "markdown", 408 | "metadata": {}, 409 | "source": [ 410 | "Above, we can see that the prediction interval constructed by CQR Neural Net is also valid. Notice the difference in the average length between the two methods (CQR Neural Net and CQR Random Forests). \n", 411 | "\n", 412 | "## CQR neural net with rearrangement\n", 413 | "\n", 414 | "Crossing quantiles is a longstanding problem in quantile regression. This issue does not affect the validity guarantee of CQR as it holds regardless of the accuracy or choice of the quantile regression method. However, this may affect the effeciency of the resulting conformal band.\n", 415 | "\n", 416 | "Below we use the rearrangement method [3] to bypass the crossing quantile problem. Notice that we pass `use_rearrangement=True` as an argument to `AllQNet_RegressorAdapter`.\n", 417 | "\n", 418 | "[3] Chernozhukov Victor, Iván Fernández‐Val, and Alfred Galichon. “Quantile and probability curves without crossing.” Econometrica 78, no. 3 (2010): 1093-1125." 419 | ] 420 | }, 421 | { 422 | "cell_type": "code", 423 | "execution_count": 9, 424 | "metadata": {}, 425 | "outputs": [ 426 | { 427 | "name": "stdout", 428 | "output_type": "stream", 429 | "text": [ 430 | "CQR Rearrangement Neural Net: Percentage in the range (expecting 90.00): 89.974937\n", 431 | "CQR Rearrangement Neural Net: Average length: 1.476710\n" 432 | ] 433 | } 434 | ], 435 | "source": [ 436 | "# define quantile neural network model, using the rearrangement algorithm\n", 437 | "quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,\n", 438 | " fit_params=None,\n", 439 | " in_shape=in_shape,\n", 440 | " hidden_size=hidden_size,\n", 441 | " quantiles=quantiles_net,\n", 442 | " learn_func=nn_learn_func,\n", 443 | " epochs=epochs,\n", 444 | " batch_size=batch_size,\n", 445 | " dropout=dropout,\n", 446 | " lr=lr,\n", 447 | " wd=wd,\n", 448 | " test_ratio=cv_test_ratio,\n", 449 | " random_state=cv_random_state,\n", 450 | " use_rearrangement=True)\n", 451 | "\n", 452 | "# define the CQR object, computing the absolute residual error of points \n", 453 | "# located outside the estimated quantile neural network band \n", 454 | "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n", 455 | "\n", 456 | "# run CQR procedure\n", 457 | "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n", 458 | "\n", 459 | "# compute and print average coverage and average length\n", 460 | "coverage_cp_re_qnet, length_cp_re_qnet = helper.compute_coverage(y_test,\n", 461 | " y_lower,\n", 462 | " y_upper,\n", 463 | " alpha,\n", 464 | " \"CQR Rearrangement Neural Net\")" 465 | ] 466 | } 467 | ], 468 | "metadata": { 469 | "kernelspec": { 470 | "display_name": "Python 3", 471 | "language": "python", 472 | "name": "python3" 473 | }, 474 | "language_info": { 475 | "codemirror_mode": { 476 | "name": "ipython", 477 | "version": 3 478 | }, 479 | "file_extension": ".py", 480 | "mimetype": "text/x-python", 481 | "name": "python", 482 | "nbconvert_exporter": "python", 483 | "pygments_lexer": "ipython3", 484 | "version": "3.7.3" 485 | } 486 | }, 487 | "nbformat": 4, 488 | "nbformat_minor": 2 489 | } 490 | -------------------------------------------------------------------------------- /datasets/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/datasets/.DS_Store -------------------------------------------------------------------------------- /datasets/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Publicly Available Datasets 3 | 4 | * Please download the file blogData_train.csv from [this link](https://archive.ics.uci.edu/ml/datasets/BlogFeedback), and save it in this directory. 5 | 6 | * Please download the files Features_Variant_1.csv and Features_Variant_2.csv from 7 | [this link](https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset) and store the two under ./facebook/ directory. 8 | 9 | ## Data subject to copyright/usage rules 10 | 11 | Please follow the instruction in [this README](https://github.com/yromano/cqr/blob/master/get_meps_data/README.md) file, describing how to download and process the MEPS datasets. 12 | 13 | Once downloaded, copy the the three files 'meps_19_reg.csv', 'meps_20_reg.csv', and 'meps_21_reg.csv' to this folder. 14 | -------------------------------------------------------------------------------- /datasets/communities_attributes.csv: -------------------------------------------------------------------------------- 1 | attributes 2 | state 3 | county 4 | community 5 | communityname 6 | fold 7 | population 8 | householdsize 9 | racepctblack 10 | racePctWhite 11 | racePctAsian 12 | racePctHisp 13 | agePct12t21 14 | agePct12t29 15 | agePct16t24 16 | agePct65up 17 | numbUrban 18 | pctUrban 19 | medIncome 20 | pctWWage 21 | pctWFarmSelf 22 | pctWInvInc 23 | pctWSocSec 24 | pctWPubAsst 25 | pctWRetire 26 | medFamInc 27 | perCapInc 28 | whitePerCap 29 | blackPerCap 30 | indianPerCap 31 | AsianPerCap 32 | OtherPerCap 33 | HispPerCap 34 | NumUnderPov 35 | PctPopUnderPov 36 | PctLess9thGrade 37 | PctNotHSGrad 38 | PctBSorMore 39 | PctUnemployed 40 | PctEmploy 41 | PctEmplManu 42 | PctEmplProfServ 43 | PctOccupManu 44 | PctOccupMgmtProf 45 | MalePctDivorce 46 | MalePctNevMarr 47 | FemalePctDiv 48 | TotalPctDiv 49 | PersPerFam 50 | PctFam2Par 51 | PctKids2Par 52 | PctYoungKids2Par 53 | PctTeen2Par 54 | PctWorkMomYoungKids 55 | PctWorkMom 56 | NumIlleg 57 | PctIlleg 58 | NumImmig 59 | PctImmigRecent 60 | PctImmigRec5 61 | PctImmigRec8 62 | PctImmigRec10 63 | PctRecentImmig 64 | PctRecImmig5 65 | PctRecImmig8 66 | PctRecImmig10 67 | PctSpeakEnglOnly 68 | PctNotSpeakEnglWell 69 | PctLargHouseFam 70 | PctLargHouseOccup 71 | PersPerOccupHous 72 | PersPerOwnOccHous 73 | PersPerRentOccHous 74 | PctPersOwnOccup 75 | PctPersDenseHous 76 | PctHousLess3BR 77 | MedNumBR 78 | HousVacant 79 | PctHousOccup 80 | PctHousOwnOcc 81 | PctVacantBoarded 82 | PctVacMore6Mos 83 | MedYrHousBuilt 84 | PctHousNoPhone 85 | PctWOFullPlumb 86 | OwnOccLowQuart 87 | OwnOccMedVal 88 | OwnOccHiQuart 89 | RentLowQ 90 | RentMedian 91 | RentHighQ 92 | MedRent 93 | MedRentPctHousInc 94 | MedOwnCostPctInc 95 | MedOwnCostPctIncNoMtg 96 | NumInShelters 97 | NumStreet 98 | PctForeignBorn 99 | PctBornSameState 100 | PctSameHouse85 101 | PctSameCity85 102 | PctSameState85 103 | LemasSwornFT 104 | LemasSwFTPerPop 105 | LemasSwFTFieldOps 106 | LemasSwFTFieldPerPop 107 | LemasTotalReq 108 | LemasTotReqPerPop 109 | PolicReqPerOffic 110 | PolicPerPop 111 | RacialMatchCommPol 112 | PctPolicWhite 113 | PctPolicBlack 114 | PctPolicHisp 115 | PctPolicAsian 116 | PctPolicMinor 117 | OfficAssgnDrugUnits 118 | NumKindsDrugsSeiz 119 | PolicAveOTWorked 120 | LandArea 121 | PopDens 122 | PctUsePubTrans 123 | PolicCars 124 | PolicOperBudg 125 | LemasPctPolicOnPatr 126 | LemasGangUnitDeploy 127 | LemasPctOfficDrugUn 128 | PolicBudgPerPop 129 | ViolentCrimesPerPop 130 | -------------------------------------------------------------------------------- /datasets/datasets.py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | import pandas as pd 4 | 5 | 6 | def GetDataset(name, base_path): 7 | """ Load a dataset 8 | 9 | Parameters 10 | ---------- 11 | name : string, dataset name 12 | base_path : string, e.g. "path/to/datasets/directory/" 13 | 14 | Returns 15 | ------- 16 | X : features (nXp) 17 | y : labels (n) 18 | 19 | """ 20 | if name=="meps_19": 21 | df = pd.read_csv(base_path + 'meps_19_reg_fix.csv') 22 | column_names = df.columns 23 | response_name = "UTILIZATION_reg" 24 | column_names = column_names[column_names!=response_name] 25 | column_names = column_names[column_names!="Unnamed: 0"] 26 | 27 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1', 28 | 'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1', 29 | 'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7', 30 | 'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2', 31 | 'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4', 32 | 'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1', 33 | 'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5', 34 | 'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4', 35 | 'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1', 36 | 'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2', 37 | 'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2', 38 | 'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1', 39 | 'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1', 40 | 'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2', 41 | 'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1', 42 | 'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1', 43 | 'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2', 44 | 'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1', 45 | 'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1', 46 | 'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2', 47 | 'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1', 48 | 'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2', 49 | 'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0', 50 | 'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5', 51 | 'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4', 52 | 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 53 | 'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE'] 54 | 55 | y = df[response_name].values 56 | X = df[col_names].values 57 | 58 | if name=="meps_20": 59 | df = pd.read_csv(base_path + 'meps_20_reg_fix.csv') 60 | column_names = df.columns 61 | response_name = "UTILIZATION_reg" 62 | column_names = column_names[column_names!=response_name] 63 | column_names = column_names[column_names!="Unnamed: 0"] 64 | 65 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1', 66 | 'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1', 67 | 'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7', 68 | 'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2', 69 | 'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4', 70 | 'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1', 71 | 'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5', 72 | 'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4', 73 | 'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1', 74 | 'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2', 75 | 'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2', 76 | 'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1', 77 | 'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1', 78 | 'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2', 79 | 'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1', 80 | 'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1', 81 | 'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2', 82 | 'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1', 83 | 'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1', 84 | 'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2', 85 | 'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1', 86 | 'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2', 87 | 'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0', 88 | 'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5', 89 | 'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4', 90 | 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 91 | 'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE'] 92 | 93 | y = df[response_name].values 94 | X = df[col_names].values 95 | 96 | if name=="meps_21": 97 | df = pd.read_csv(base_path + 'meps_21_reg_fix.csv') 98 | column_names = df.columns 99 | response_name = "UTILIZATION_reg" 100 | column_names = column_names[column_names!=response_name] 101 | column_names = column_names[column_names!="Unnamed: 0"] 102 | 103 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT16F', 'REGION=1', 104 | 'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1', 105 | 'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7', 106 | 'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2', 107 | 'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4', 108 | 'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1', 109 | 'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5', 110 | 'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4', 111 | 'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1', 112 | 'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2', 113 | 'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2', 114 | 'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1', 115 | 'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1', 116 | 'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2', 117 | 'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1', 118 | 'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1', 119 | 'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2', 120 | 'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1', 121 | 'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1', 122 | 'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2', 123 | 'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1', 124 | 'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2', 125 | 'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0', 126 | 'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5', 127 | 'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4', 128 | 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 129 | 'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE'] 130 | 131 | y = df[response_name].values 132 | X = df[col_names].values 133 | 134 | if name=="star": 135 | df = pd.read_csv(base_path + 'STAR.csv') 136 | df.loc[df['gender'] == 'female', 'gender'] = 0 137 | df.loc[df['gender'] == 'male', 'gender'] = 1 138 | 139 | df.loc[df['ethnicity'] == 'cauc', 'ethnicity'] = 0 140 | df.loc[df['ethnicity'] == 'afam', 'ethnicity'] = 1 141 | df.loc[df['ethnicity'] == 'asian', 'ethnicity'] = 2 142 | df.loc[df['ethnicity'] == 'hispanic', 'ethnicity'] = 3 143 | df.loc[df['ethnicity'] == 'amindian', 'ethnicity'] = 4 144 | df.loc[df['ethnicity'] == 'other', 'ethnicity'] = 5 145 | 146 | df.loc[df['stark'] == 'regular', 'stark'] = 0 147 | df.loc[df['stark'] == 'small', 'stark'] = 1 148 | df.loc[df['stark'] == 'regular+aide', 'stark'] = 2 149 | 150 | df.loc[df['star1'] == 'regular', 'star1'] = 0 151 | df.loc[df['star1'] == 'small', 'star1'] = 1 152 | df.loc[df['star1'] == 'regular+aide', 'star1'] = 2 153 | 154 | df.loc[df['star2'] == 'regular', 'star2'] = 0 155 | df.loc[df['star2'] == 'small', 'star2'] = 1 156 | df.loc[df['star2'] == 'regular+aide', 'star2'] = 2 157 | 158 | df.loc[df['star3'] == 'regular', 'star3'] = 0 159 | df.loc[df['star3'] == 'small', 'star3'] = 1 160 | df.loc[df['star3'] == 'regular+aide', 'star3'] = 2 161 | 162 | df.loc[df['lunchk'] == 'free', 'lunchk'] = 0 163 | df.loc[df['lunchk'] == 'non-free', 'lunchk'] = 1 164 | 165 | df.loc[df['lunch1'] == 'free', 'lunch1'] = 0 166 | df.loc[df['lunch1'] == 'non-free', 'lunch1'] = 1 167 | 168 | df.loc[df['lunch2'] == 'free', 'lunch2'] = 0 169 | df.loc[df['lunch2'] == 'non-free', 'lunch2'] = 1 170 | 171 | df.loc[df['lunch3'] == 'free', 'lunch3'] = 0 172 | df.loc[df['lunch3'] == 'non-free', 'lunch3'] = 1 173 | 174 | df.loc[df['schoolk'] == 'inner-city', 'schoolk'] = 0 175 | df.loc[df['schoolk'] == 'suburban', 'schoolk'] = 1 176 | df.loc[df['schoolk'] == 'rural', 'schoolk'] = 2 177 | df.loc[df['schoolk'] == 'urban', 'schoolk'] = 3 178 | 179 | df.loc[df['school1'] == 'inner-city', 'school1'] = 0 180 | df.loc[df['school1'] == 'suburban', 'school1'] = 1 181 | df.loc[df['school1'] == 'rural', 'school1'] = 2 182 | df.loc[df['school1'] == 'urban', 'school1'] = 3 183 | 184 | df.loc[df['school2'] == 'inner-city', 'school2'] = 0 185 | df.loc[df['school2'] == 'suburban', 'school2'] = 1 186 | df.loc[df['school2'] == 'rural', 'school2'] = 2 187 | df.loc[df['school2'] == 'urban', 'school2'] = 3 188 | 189 | df.loc[df['school3'] == 'inner-city', 'school3'] = 0 190 | df.loc[df['school3'] == 'suburban', 'school3'] = 1 191 | df.loc[df['school3'] == 'rural', 'school3'] = 2 192 | df.loc[df['school3'] == 'urban', 'school3'] = 3 193 | 194 | df.loc[df['degreek'] == 'bachelor', 'degreek'] = 0 195 | df.loc[df['degreek'] == 'master', 'degreek'] = 1 196 | df.loc[df['degreek'] == 'specialist', 'degreek'] = 2 197 | df.loc[df['degreek'] == 'master+', 'degreek'] = 3 198 | 199 | df.loc[df['degree1'] == 'bachelor', 'degree1'] = 0 200 | df.loc[df['degree1'] == 'master', 'degree1'] = 1 201 | df.loc[df['degree1'] == 'specialist', 'degree1'] = 2 202 | df.loc[df['degree1'] == 'phd', 'degree1'] = 3 203 | 204 | df.loc[df['degree2'] == 'bachelor', 'degree2'] = 0 205 | df.loc[df['degree2'] == 'master', 'degree2'] = 1 206 | df.loc[df['degree2'] == 'specialist', 'degree2'] = 2 207 | df.loc[df['degree2'] == 'phd', 'degree2'] = 3 208 | 209 | df.loc[df['degree3'] == 'bachelor', 'degree3'] = 0 210 | df.loc[df['degree3'] == 'master', 'degree3'] = 1 211 | df.loc[df['degree3'] == 'specialist', 'degree3'] = 2 212 | df.loc[df['degree3'] == 'phd', 'degree3'] = 3 213 | 214 | df.loc[df['ladderk'] == 'level1', 'ladderk'] = 0 215 | df.loc[df['ladderk'] == 'level2', 'ladderk'] = 1 216 | df.loc[df['ladderk'] == 'level3', 'ladderk'] = 2 217 | df.loc[df['ladderk'] == 'apprentice', 'ladderk'] = 3 218 | df.loc[df['ladderk'] == 'probation', 'ladderk'] = 4 219 | df.loc[df['ladderk'] == 'pending', 'ladderk'] = 5 220 | df.loc[df['ladderk'] == 'notladder', 'ladderk'] = 6 221 | 222 | 223 | df.loc[df['ladder1'] == 'level1', 'ladder1'] = 0 224 | df.loc[df['ladder1'] == 'level2', 'ladder1'] = 1 225 | df.loc[df['ladder1'] == 'level3', 'ladder1'] = 2 226 | df.loc[df['ladder1'] == 'apprentice', 'ladder1'] = 3 227 | df.loc[df['ladder1'] == 'probation', 'ladder1'] = 4 228 | df.loc[df['ladder1'] == 'noladder', 'ladder1'] = 5 229 | df.loc[df['ladder1'] == 'notladder', 'ladder1'] = 6 230 | 231 | df.loc[df['ladder2'] == 'level1', 'ladder2'] = 0 232 | df.loc[df['ladder2'] == 'level2', 'ladder2'] = 1 233 | df.loc[df['ladder2'] == 'level3', 'ladder2'] = 2 234 | df.loc[df['ladder2'] == 'apprentice', 'ladder2'] = 3 235 | df.loc[df['ladder2'] == 'probation', 'ladder2'] = 4 236 | df.loc[df['ladder2'] == 'noladder', 'ladder2'] = 5 237 | df.loc[df['ladder2'] == 'notladder', 'ladder2'] = 6 238 | 239 | df.loc[df['ladder3'] == 'level1', 'ladder3'] = 0 240 | df.loc[df['ladder3'] == 'level2', 'ladder3'] = 1 241 | df.loc[df['ladder3'] == 'level3', 'ladder3'] = 2 242 | df.loc[df['ladder3'] == 'apprentice', 'ladder3'] = 3 243 | df.loc[df['ladder3'] == 'probation', 'ladder3'] = 4 244 | df.loc[df['ladder3'] == 'noladder', 'ladder3'] = 5 245 | df.loc[df['ladder3'] == 'notladder', 'ladder3'] = 6 246 | 247 | df.loc[df['tethnicityk'] == 'cauc', 'tethnicityk'] = 0 248 | df.loc[df['tethnicityk'] == 'afam', 'tethnicityk'] = 1 249 | 250 | df.loc[df['tethnicity1'] == 'cauc', 'tethnicity1'] = 0 251 | df.loc[df['tethnicity1'] == 'afam', 'tethnicity1'] = 1 252 | 253 | df.loc[df['tethnicity2'] == 'cauc', 'tethnicity2'] = 0 254 | df.loc[df['tethnicity2'] == 'afam', 'tethnicity2'] = 1 255 | 256 | df.loc[df['tethnicity3'] == 'cauc', 'tethnicity3'] = 0 257 | df.loc[df['tethnicity3'] == 'afam', 'tethnicity3'] = 1 258 | df.loc[df['tethnicity3'] == 'asian', 'tethnicity3'] = 2 259 | 260 | df = df.dropna() 261 | 262 | grade = df["readk"] + df["read1"] + df["read2"] + df["read3"] 263 | grade += df["mathk"] + df["math1"] + df["math2"] + df["math3"] 264 | 265 | 266 | names = df.columns 267 | target_names = names[8:16] 268 | data_names = np.concatenate((names[0:8],names[17:])) 269 | X = df.loc[:, data_names].values 270 | y = grade.values 271 | 272 | 273 | if name=="facebook_1": 274 | df = pd.read_csv(base_path + 'facebook/Features_Variant_1.csv') 275 | y = df.iloc[:,53].values 276 | X = df.iloc[:,0:53].values 277 | 278 | if name=="facebook_2": 279 | df = pd.read_csv(base_path + 'facebook/Features_Variant_2.csv') 280 | y = df.iloc[:,53].values 281 | X = df.iloc[:,0:53].values 282 | 283 | if name=="bio": 284 | #https://github.com/joefavergel/TertiaryPhysicochemicalProperties/blob/master/RMSD-ProteinTertiaryStructures.ipynb 285 | df = pd.read_csv(base_path + 'CASP.csv') 286 | y = df.iloc[:,0].values 287 | X = df.iloc[:,1:].values 288 | 289 | if name=='blog_data': 290 | # https://github.com/xinbinhuang/feature-selection_blogfeedback 291 | df = pd.read_csv(base_path + 'blogData_train.csv', header=None) 292 | X = df.iloc[:,0:280].values 293 | y = df.iloc[:,-1].values 294 | 295 | if name == "concrete": 296 | dataset = np.loadtxt(open(base_path + 'Concrete_Data.csv', "rb"), delimiter=",", skiprows=1) 297 | X = dataset[:, :-1] 298 | y = dataset[:, -1:] 299 | 300 | 301 | if name=="bike": 302 | # https://www.kaggle.com/rajmehra03/bike-sharing-demand-rmsle-0-3194 303 | df=pd.read_csv(base_path + 'bike_train.csv') 304 | 305 | # # seperating season as per values. this is bcoz this will enhance features. 306 | season=pd.get_dummies(df['season'],prefix='season') 307 | df=pd.concat([df,season],axis=1) 308 | 309 | # # # same for weather. this is bcoz this will enhance features. 310 | weather=pd.get_dummies(df['weather'],prefix='weather') 311 | df=pd.concat([df,weather],axis=1) 312 | 313 | # # # now can drop weather and season. 314 | df.drop(['season','weather'],inplace=True,axis=1) 315 | df.head() 316 | 317 | df["hour"] = [t.hour for t in pd.DatetimeIndex(df.datetime)] 318 | df["day"] = [t.dayofweek for t in pd.DatetimeIndex(df.datetime)] 319 | df["month"] = [t.month for t in pd.DatetimeIndex(df.datetime)] 320 | df['year'] = [t.year for t in pd.DatetimeIndex(df.datetime)] 321 | df['year'] = df['year'].map({2011:0, 2012:1}) 322 | 323 | df.drop('datetime',axis=1,inplace=True) 324 | df.drop(['casual','registered'],axis=1,inplace=True) 325 | df.columns.to_series().groupby(df.dtypes).groups 326 | X = df.drop('count',axis=1).values 327 | y = df['count'].values 328 | 329 | if name=="community": 330 | # https://github.com/vbordalo/Communities-Crime/blob/master/Crime_v1.ipynb 331 | attrib = pd.read_csv(base_path + 'communities_attributes.csv', delim_whitespace = True) 332 | data = pd.read_csv(base_path + 'communities.data', names = attrib['attributes']) 333 | data = data.drop(columns=['state','county', 334 | 'community','communityname', 335 | 'fold'], axis=1) 336 | 337 | data = data.replace('?', np.nan) 338 | 339 | # Impute mean values for samples with missing values 340 | from sklearn.preprocessing import Imputer 341 | 342 | imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) 343 | 344 | imputer = imputer.fit(data[['OtherPerCap']]) 345 | data[['OtherPerCap']] = imputer.transform(data[['OtherPerCap']]) 346 | data = data.dropna(axis=1) 347 | X = data.iloc[:, 0:100].values 348 | y = data.iloc[:, 100].values 349 | 350 | 351 | X = X.astype(np.float32) 352 | y = y.astype(np.float32) 353 | 354 | return X, y 355 | -------------------------------------------------------------------------------- /datasets/facebook/README.md: -------------------------------------------------------------------------------- 1 | 2 | Please download the files Features_Variant_1.csv and Features_Variant_2.csv from 3 | [this link](https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset) and store the two in this directory. 4 | -------------------------------------------------------------------------------- /get_meps_data/README.md: -------------------------------------------------------------------------------- 1 | # Medical Expenditure Panel Survey data 2 | 3 | 4 | ## A quick guide: 5 | 6 | cd to the current code directory, and run 7 | 8 | ```Bash 9 | Rscript download_data.R 10 | ``` 11 | 12 | You should see the files h181.csv and h192.csv in the code directory. Then, to clean the raw files and create the datasets, run 13 | 14 | ```Bash 15 | python main_clean_and_save_to_csv.py 16 | ``` 17 | 18 | Now, you should see 3 new files: meps_19_reg.csv, meps_20_reg.csv, and meps_21_reg.csv. These are the csv files that we used in our experiments. 19 | 20 | The following sections provide more detailed explanation. 21 | 22 | ### Note: the code and the following text is copied from IBM's AIF360 package. 23 | 24 | The Medical Expenditure Panel Survey (MEPS) data consists of large scale surveys of families and individuals, medical providers, and employers, and collects data on health services used, costs & frequency of services, demographics, etc., of the respondents. 25 | 26 | Please refer to https://github.com/IBM/AIF360 for more details. 27 | 28 | ## Source / Data Set Description: 29 | 30 | 31 | * [2015 full Year Consolidated Data File](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181): This file contains MEPS survey data for calendar year 2015 obtained in rounds 3, 4, and 5 of Panel 19, and rounds 1, 2, and 3 of Panel 20. 32 | 33 | * [2016 full Year Consolidated Data File](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-192): : This file contains MEPS survey data for calendar year 2016 obtained in rounds 3, 4, and 5 of Panel 20, and rounds 1, 2, and 3 of Panel 21. 34 | 35 | 36 | ## Data Use Agreement 37 | 38 | As the user of the data it is your responsibility to read and abide by any copyright/usage rules and restrictions as 39 | stated on the MEPS web site before downloading the data. 40 | 41 | - [Data Use Agreement (2015 Data File)](https://meps.ahrq.gov/data_stats/download_data/pufs/h181/h181doc.shtml#Data) 42 | - [Data Use Agreement (2016 Data File)](https://meps.ahrq.gov/data_stats/download_data/pufs/h192/h192doc.shtml#DataA) 43 | 44 | 45 | ## Download instructions 46 | 47 | In order to use the MEPS datasets, please follow the following directions to download the datafiles and convert into csv files. 48 | 49 | Follow either set of instructions below for using R or SPSS. Further instructions for SAS, and Stata, are available at 50 | the [AHRQ MEPS Github repository](https://github.com/HHS-AHRQ/MEPS). 51 | 52 | - **Generating CSV files with R** 53 | 54 | In the current folder run the R script `download_data.R`. R can be downloaded from [CRAN](https://cran.r-project.org). 55 | If you are working on Mac OS X the easiest way to get the R command line support is by installing it with 56 | [Homebrew](https://brew.sh/) `brew install R`. 57 | 58 | ```Bash 59 | Rscript download_data.R 60 | ``` 61 | 62 | Example output: 63 | 64 | ``` 65 | Loading required package: foreign 66 | 67 | trying URL 'https://meps.ahrq.gov/mepsweb/data_files/pufs/h181ssp.zip' 68 | Content type 'application/zip' length 13303652 bytes (12.7 MB) 69 | ================================================== 70 | downloaded 12.7 MB 71 | 72 | Loading dataframe from file: h181.ssp 73 | Exporting dataframe to file: h181.csv 74 | 75 | trying URL 'https://meps.ahrq.gov/mepsweb/data_files/pufs/h192ssp.zip' 76 | Content type 'application/zip' length 15505898 bytes (14.8 MB) 77 | ================================================== 78 | downloaded 14.8 MB 79 | 80 | Loading dataframe from file: h192.ssp 81 | Exporting dataframe to file: h192.csv 82 | ``` 83 | 84 | - **Generating CSV files with SPSS** 85 | 86 | The instructions below require the use of SPSS. 87 | 88 | 1. 2015 full Year Consolidated Data File 89 | * Download the [`Data File, ASCII format`](https://meps.ahrq.gov/mepsweb/data_files/pufs/h181dat.zip) 90 | * Extract the file `h181.dat` from downloaded zip archive 91 | * Convert the file to comma-delimited format, `h181.csv`, and save in this folder. 92 | * To convert the .dat file into csv format,download one of the programming statements files, such as the [SPSS Programming Statements](https://meps.ahrq.gov/mepsweb/data_stats/download_data/pufs/h181/h181spu.txt) file. 93 | * Edit this file to change the FILE HANDLE name to the complete path/name of the downloaded data file, execute the SPSS programming statements to load the data, and 'save as' a comma-delimited file called 'h181.csv' in the current folder. 94 | 95 | 2. 2016 full Year Consolidated Data File 96 | * Download the [`Data File, ASCII format`](https://meps.ahrq.gov/mepsweb/data_files/pufs/h192dat.zip) 97 | * Extract the file `h192.dat` from downloaded zip archive 98 | * Convert the file to comma-delimited format, `h192.csv`, and save in current repository. 99 | * To convert the .dat file into csv format,download one of the programming statements files, such as the [SPSS Programming Statements](https://meps.ahrq.gov/mepsweb/data_stats/download_data/pufs/h192/h192spu.txt) file. 100 | * Edit this file to change the FILE HANDLE name to the complete path/name of the downloaded data file, execute the SPSS programming statements to load the data, and 'save as' a comma-delimited file called 'h192.csv' in this folder. 101 | 102 | ## Cleaning the Data 103 | 104 | To clean the raw files and create the 3 MEPS datasets used in the our paper, run 105 | 106 | ```Bash 107 | python main_clean_and_save_to_csv.py 108 | ``` 109 | 110 | which produces the files: 'meps_19_reg.csv', 'meps_20_reg.csv', and 'meps_21_reg.csv'. 111 | -------------------------------------------------------------------------------- /get_meps_data/base_dataset.py: -------------------------------------------------------------------------------- 1 | # Code copied from IBM's AIF360 package: 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/dataset.py 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | from __future__ import unicode_literals 8 | 9 | import abc 10 | import copy 11 | import sys 12 | 13 | if sys.version_info >= (3, 4): 14 | ABC = abc.ABC 15 | else: 16 | ABC = abc.ABCMeta(str('ABC'), (), {}) 17 | 18 | 19 | class BaseDataset(ABC): 20 | """Abstract base class for datasets.""" 21 | 22 | @abc.abstractmethod 23 | def __init__(self, **kwargs): 24 | self.metadata = kwargs.pop('metadata', dict()) or dict() 25 | self.metadata.update({ 26 | 'transformer': '{}.__init__'.format(type(self).__name__), 27 | 'params': kwargs, 28 | 'previous': [] 29 | }) 30 | self.validate_dataset() 31 | 32 | def validate_dataset(self): 33 | """Error checking and type validation.""" 34 | pass 35 | 36 | def copy(self, deepcopy=False): 37 | """Convenience method to return a copy of this dataset. 38 | 39 | Args: 40 | deepcopy (bool, optional): :func:`~copy.deepcopy` this dataset if 41 | `True`, shallow copy otherwise. 42 | 43 | Returns: 44 | Dataset: A new dataset with fields copied from this object and 45 | metadata set accordingly. 46 | """ 47 | cpy = copy.deepcopy(self) if deepcopy else copy.copy(self) 48 | # preserve any user-created fields 49 | cpy.metadata = cpy.metadata.copy() 50 | cpy.metadata.update({ 51 | 'transformer': '{}.copy'.format(type(self).__name__), 52 | 'params': {'deepcopy': deepcopy}, 53 | 'previous': [self] 54 | }) 55 | return cpy 56 | 57 | @abc.abstractmethod 58 | def export_dataset(self): 59 | """Save this Dataset to disk.""" 60 | raise NotImplementedError 61 | 62 | @abc.abstractmethod 63 | def split(self, num_or_size_splits, shuffle=False): 64 | """Split this dataset into multiple partitions. 65 | 66 | Args: 67 | num_or_size_splits (array or int): If `num_or_size_splits` is an 68 | int, *k*, the value is the number of equal-sized folds to make 69 | (if *k* does not evenly divide the dataset these folds are 70 | approximately equal-sized). If `num_or_size_splits` is an array 71 | of type int, the values are taken as the indices at which to 72 | split the dataset. If the values are floats (< 1.), they are 73 | considered to be fractional proportions of the dataset at which 74 | to split. 75 | shuffle (bool, optional): Randomly shuffle the dataset before 76 | splitting. 77 | 78 | Returns: 79 | list(Dataset): Splits. Contains *k* or `len(num_or_size_splits) + 1` 80 | datasets depending on `num_or_size_splits`. 81 | """ 82 | raise NotImplementedError 83 | -------------------------------------------------------------------------------- /get_meps_data/download_data.R: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env Rscript 2 | 3 | # Code copied from IBM's aif360 package, https://github.com/IBM/AIF360 4 | 5 | # This R script can be used to download the Medical Expenditure Panel Survey (MEPS) 6 | # data files for 2015 and 2016 and convert the files from SAS transport format into 7 | # standard CSV files. 8 | 9 | usage_note <- paste("", 10 | "By using this script you acknowledge the responsibility for reading and", 11 | "abiding by any copyright/usage rules and restrictions as stated on the", 12 | "MEPS web site (https://meps.ahrq.gov/data_stats/data_use.jsp).", 13 | "", 14 | "Continue [y/n]? > ", sep = "\n") 15 | 16 | cat(usage_note) 17 | answer <- scan("stdin", character(), n=1, quiet=TRUE) 18 | 19 | if (tolower(answer) != 'y') { 20 | opt <- options(show.error.messages=FALSE) 21 | on.exit(options(opt)) 22 | stop() 23 | } 24 | 25 | if (!require("foreign")) { 26 | install.packages("foreign") 27 | library(foreign) 28 | } 29 | 30 | convert <- function(ssp_file, csv_file) { 31 | message("Loading dataframe from file: ", ssp_file) 32 | df = read.xport(ssp_file) 33 | message("Exporting dataframe to file: ", csv_file) 34 | write.csv(df, file=csv_file, row.names=FALSE, quote=FALSE) 35 | } 36 | 37 | for (dataset in c("h181", "h192")) { 38 | zip_file <- paste(dataset, "ssp.zip", sep="") 39 | ssp_file <- paste(dataset, "ssp", sep=".") 40 | csv_file <- paste(dataset, "csv", sep=".") 41 | url <- paste("https://meps.ahrq.gov/mepsweb/data_files/pufs", zip_file, sep="/") 42 | 43 | # skip to next dataset if we already have the CSV file 44 | if (file.exists(csv_file)) { 45 | message(csv_file, " already exists") 46 | next 47 | } 48 | 49 | # download the zip file only if not downloaded before 50 | if (!file.exists(zip_file)) { 51 | download.file(url, destfile=zip_file) 52 | } 53 | 54 | # unzip and convert the dataset from SAS transport format to CSV 55 | unzip(zip_file) 56 | convert(ssp_file, csv_file) 57 | 58 | # clean up temporary files if we got the CSV file 59 | if (file.exists(csv_file)) { 60 | file.remove(zip_file) 61 | file.remove(ssp_file) 62 | } 63 | } 64 | -------------------------------------------------------------------------------- /get_meps_data/main_clean_and_save_to_csv.py: -------------------------------------------------------------------------------- 1 | 2 | # Code based on IBM's AIF360 software package, suggesting a simple modification 3 | # that accumulates the medical utilization variables without binarization 4 | 5 | # Load packages 6 | from meps_dataset_panel19_fy2015_reg import MEPSDataset19Reg 7 | from meps_dataset_panel20_fy2015_reg import MEPSDataset20Reg 8 | from meps_dataset_panel21_fy2016_reg import MEPSDataset21Reg 9 | 10 | import numpy as np 11 | 12 | print("Cleaning and saving MEPS 19, 20 and 21") 13 | 14 | # Load raw MEPS 19 data, extract and clean the features, then save to meps_19.csv 15 | MEPSDataset19Reg() 16 | 17 | # Load raw MEPS 20 data, extract and clean the features, then save to meps_20.csv 18 | MEPSDataset20Reg() 19 | 20 | # Load raw MEPS 21 data, extract and clean the features, then save to meps_21.csv 21 | MEPSDataset21Reg() 22 | 23 | 24 | print("Done.") 25 | 26 | ############################################################################### 27 | ############################################################################### 28 | 29 | 30 | # We now show how to load the processed csv file 31 | import pandas as pd 32 | 33 | print("Loading processed data and printing the dimensions") 34 | 35 | 36 | ############################################################################## 37 | # MEPS 19 38 | ############################################################################## 39 | 40 | # Load the processed meps_19_reg.csv, extract features X and response y 41 | df = pd.read_csv('meps_19_reg.csv') 42 | column_names = df.columns 43 | response_name = "UTILIZATION_reg" 44 | column_names = column_names[column_names!=response_name] 45 | column_names = column_names[column_names!="Unnamed: 0"] 46 | 47 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1', 48 | 'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1', 49 | 'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7', 50 | 'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2', 51 | 'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4', 52 | 'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1', 53 | 'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5', 54 | 'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4', 55 | 'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1', 56 | 'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2', 57 | 'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2', 58 | 'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1', 59 | 'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1', 60 | 'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2', 61 | 'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1', 62 | 'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1', 63 | 'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2', 64 | 'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1', 65 | 'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1', 66 | 'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2', 67 | 'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1', 68 | 'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2', 69 | 'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0', 70 | 'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5', 71 | 'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4', 72 | 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 73 | 'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE'] 74 | 75 | y = df[response_name].values 76 | X = df[col_names].values 77 | 78 | print("MEPS 19: n = " + str(X.shape[0]) + " p = " + str(X.shape[1]) + " response len = " + str(y.shape[0])) 79 | 80 | 81 | ############################################################################## 82 | # MEPS 20 83 | ############################################################################## 84 | 85 | # Load the processed meps_20_reg.csv, extract features X and response y 86 | df = pd.read_csv('meps_20_reg.csv') 87 | column_names = df.columns 88 | response_name = "UTILIZATION_reg" 89 | column_names = column_names[column_names!=response_name] 90 | column_names = column_names[column_names!="Unnamed: 0"] 91 | 92 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1', 93 | 'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1', 94 | 'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7', 95 | 'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2', 96 | 'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4', 97 | 'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1', 98 | 'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5', 99 | 'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4', 100 | 'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1', 101 | 'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2', 102 | 'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2', 103 | 'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1', 104 | 'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1', 105 | 'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2', 106 | 'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1', 107 | 'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1', 108 | 'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2', 109 | 'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1', 110 | 'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1', 111 | 'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2', 112 | 'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1', 113 | 'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2', 114 | 'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0', 115 | 'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5', 116 | 'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4', 117 | 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 118 | 'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE'] 119 | 120 | 121 | y = df[response_name].values 122 | X = df[col_names].values 123 | 124 | print("MEPS 20: n = " + str(X.shape[0]) + " p = " + str(X.shape[1]) + " response len = " + str(y.shape[0])) 125 | 126 | 127 | ############################################################################## 128 | # MEPS 21 129 | ############################################################################## 130 | 131 | # Load the processed meps_21_reg.csv, extract features X and response y 132 | df = pd.read_csv('meps_21_reg.csv') 133 | column_names = df.columns 134 | response_name = "UTILIZATION_reg" 135 | column_names = column_names[column_names!=response_name] 136 | column_names = column_names[column_names!="Unnamed: 0"] 137 | 138 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT16F', 'REGION=1', 139 | 'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1', 140 | 'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7', 141 | 'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2', 142 | 'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4', 143 | 'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1', 144 | 'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5', 145 | 'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4', 146 | 'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1', 147 | 'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2', 148 | 'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2', 149 | 'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1', 150 | 'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1', 151 | 'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2', 152 | 'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1', 153 | 'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1', 154 | 'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2', 155 | 'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1', 156 | 'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1', 157 | 'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2', 158 | 'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1', 159 | 'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2', 160 | 'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0', 161 | 'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5', 162 | 'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4', 163 | 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 164 | 'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE'] 165 | 166 | 167 | y = df[response_name].values 168 | X = df[col_names].values 169 | 170 | print("MEPS 21: n = " + str(X.shape[0]) + " p = " + str(X.shape[1]) + " response len = " + str(y.shape[0])) 171 | -------------------------------------------------------------------------------- /get_meps_data/meps_dataset_panel19_fy2015_reg.py: -------------------------------------------------------------------------------- 1 | # This code is a variant of 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/meps_dataset_panel19_fy2015.py 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | from __future__ import unicode_literals 8 | 9 | import pandas as pd 10 | 11 | from save_dataset import SaveDataset 12 | 13 | default_mappings = { 14 | 'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-White'}] 15 | } 16 | 17 | def default_preprocessing(df): 18 | """ 19 | 1.Create a new column, RACE that is 'White' if RACEV2X = 1 and HISPANX = 2 i.e. non Hispanic White 20 | and 'non-White' otherwise 21 | 2. Restrict to Panel 19 22 | 3. RENAME all columns that are PANEL/ROUND SPECIFIC 23 | 4. Drop rows based on certain values of individual features that correspond to missing/unknown - generally < -1 24 | 5. Compute UTILIZATION 25 | """ 26 | def race(row): 27 | if ((row['HISPANX'] == 2) and (row['RACEV2X'] == 1)): #non-Hispanic Whites are marked as WHITE; all others as NON-WHITE 28 | return 'White' 29 | return 'Non-White' 30 | 31 | df['RACEV2X'] = df.apply(lambda row: race(row), axis=1) 32 | df = df.rename(columns = {'RACEV2X' : 'RACE'}) 33 | 34 | df = df[df['PANEL'] == 19] 35 | 36 | # RENAME COLUMNS 37 | df = df.rename(columns = {'FTSTU53X' : 'FTSTU', 'ACTDTY53' : 'ACTDTY', 'HONRDC53' : 'HONRDC', 'RTHLTH53' : 'RTHLTH', 38 | 'MNHLTH53' : 'MNHLTH', 'CHBRON53' : 'CHBRON', 'JTPAIN53' : 'JTPAIN', 'PREGNT53' : 'PREGNT', 39 | 'WLKLIM53' : 'WLKLIM', 'ACTLIM53' : 'ACTLIM', 'SOCLIM53' : 'SOCLIM', 'COGLIM53' : 'COGLIM', 40 | 'EMPST53' : 'EMPST', 'REGION53' : 'REGION', 'MARRY53X' : 'MARRY', 'AGE53X' : 'AGE', 41 | 'POVCAT15' : 'POVCAT', 'INSCOV15' : 'INSCOV'}) 42 | 43 | df = df[df['REGION'] >= 0] # remove values -1 44 | df = df[df['AGE'] >= 0] # remove values -1 45 | 46 | df = df[df['MARRY'] >= 0] # remove values -1, -7, -8, -9 47 | 48 | df = df[df['ASTHDX'] >= 0] # remove values -1, -7, -8, -9 49 | 50 | df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG', 51 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 52 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 53 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42', 54 | 'PHQ242','EMPST','POVCAT','INSCOV']] >= -1).all(1)] #for all other categorical features, remove values < -1 55 | 56 | df = df[(df[['OBTOTV15', 'OPTOTV15', 'ERTOT15', 'IPNGTD15', 'HHTOTD15']]>=0).all(1)] 57 | 58 | def utilization(row): 59 | return row['OBTOTV15'] + row['OPTOTV15'] + row['ERTOT15'] + row['IPNGTD15'] + row['HHTOTD15'] 60 | 61 | df['TOTEXP15'] = df.apply(lambda row: utilization(row), axis=1) 62 | 63 | df = df.rename(columns = {'TOTEXP15' : 'UTILIZATION_reg'}) 64 | return df 65 | 66 | 67 | class MEPSDataset19Reg(SaveDataset): 68 | """MEPS Dataset. 69 | """ 70 | 71 | def __init__(self, label_name='UTILIZATION_reg', favorable_classes=[1.0], 72 | protected_attribute_names=['RACE'], 73 | privileged_classes=[['White']], 74 | instance_weights_name='PERWT15F', 75 | categorical_features=['REGION','SEX','MARRY', 76 | 'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX', 77 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 78 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 79 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42', 80 | 'PHQ242','EMPST','POVCAT','INSCOV'], 81 | features_to_keep=['REGION','AGE','SEX','RACE','MARRY', 82 | 'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX', 83 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 84 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 85 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42','PCS42', 86 | 'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION_reg','PERWT15F'], 87 | features_to_drop=[], 88 | na_values=[], custom_preprocessing=default_preprocessing, 89 | metadata=default_mappings): 90 | 91 | filepath = './h181.csv' 92 | 93 | df = pd.read_csv(filepath, sep=',', na_values=na_values) 94 | 95 | super(MEPSDataset19Reg, self).__init__(df=df, label_name=label_name, 96 | favorable_classes=favorable_classes, 97 | protected_attribute_names=protected_attribute_names, 98 | privileged_classes=privileged_classes, 99 | instance_weights_name=instance_weights_name, 100 | categorical_features=categorical_features, 101 | features_to_keep=features_to_keep, 102 | features_to_drop=features_to_drop, na_values=na_values, 103 | custom_preprocessing=custom_preprocessing, metadata=metadata, dataset_name='meps_19_reg') 104 | -------------------------------------------------------------------------------- /get_meps_data/meps_dataset_panel20_fy2015_reg.py: -------------------------------------------------------------------------------- 1 | # This code is a variant of 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/meps_dataset_panel20_fy2015.py 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | from __future__ import unicode_literals 8 | 9 | import pandas as pd 10 | 11 | #from standard_datasets import StandardDataset 12 | from save_dataset import SaveDataset 13 | 14 | default_mappings = { 15 | 'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-White'}] 16 | } 17 | 18 | def default_preprocessing(df): 19 | """ 20 | 1.Create a new column, RACE that is 'White' if RACEV2X = 1 and HISPANX = 2 i.e. non Hispanic White 21 | and 'non-White' otherwise 22 | 2. Restrict to Panel 20 23 | 3. RENAME all columns that are PANEL/ROUND SPECIFIC 24 | 4. Drop rows based on certain values of individual features that correspond to missing/unknown - generally < -1 25 | 5. Compute UTILIZATION, binarize it to 0 (< 10) and 1 (>= 10) 26 | """ 27 | def race(row): 28 | if ((row['HISPANX'] == 2) and (row['RACEV2X'] == 1)): #non-Hispanic Whites are marked as WHITE; all others as NON-WHITE 29 | return 'White' 30 | return 'Non-White' 31 | 32 | df['RACEV2X'] = df.apply(lambda row: race(row), axis=1) 33 | df = df.rename(columns = {'RACEV2X' : 'RACE'}) 34 | 35 | df = df[df['PANEL'] == 20] 36 | 37 | # RENAME COLUMNS 38 | df = df.rename(columns = {'FTSTU53X' : 'FTSTU', 'ACTDTY53' : 'ACTDTY', 'HONRDC53' : 'HONRDC', 'RTHLTH53' : 'RTHLTH', 39 | 'MNHLTH53' : 'MNHLTH', 'CHBRON53' : 'CHBRON', 'JTPAIN53' : 'JTPAIN', 'PREGNT53' : 'PREGNT', 40 | 'WLKLIM53' : 'WLKLIM', 'ACTLIM53' : 'ACTLIM', 'SOCLIM53' : 'SOCLIM', 'COGLIM53' : 'COGLIM', 41 | 'EMPST53' : 'EMPST', 'REGION53' : 'REGION', 'MARRY53X' : 'MARRY', 'AGE53X' : 'AGE', 42 | 'POVCAT15' : 'POVCAT', 'INSCOV15' : 'INSCOV'}) 43 | 44 | df = df[df['REGION'] >= 0] # remove values -1 45 | df = df[df['AGE'] >= 0] # remove values -1 46 | 47 | df = df[df['MARRY'] >= 0] # remove values -1, -7, -8, -9 48 | 49 | df = df[df['ASTHDX'] >= 0] # remove values -1, -7, -8, -9 50 | 51 | df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG', 52 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 53 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 54 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42', 55 | 'PHQ242','EMPST','POVCAT','INSCOV']] >= -1).all(1)] #for all other categorical features, remove values < -1 56 | 57 | df = df[(df[['OBTOTV15', 'OPTOTV15', 'ERTOT15', 'IPNGTD15', 'HHTOTD15']]>=0).all(1)] 58 | 59 | def utilization(row): 60 | return row['OBTOTV15'] + row['OPTOTV15'] + row['ERTOT15'] + row['IPNGTD15'] + row['HHTOTD15'] 61 | 62 | df['TOTEXP15'] = df.apply(lambda row: utilization(row), axis=1) 63 | 64 | df = df.rename(columns = {'TOTEXP15' : 'UTILIZATION_reg'}) 65 | return df 66 | 67 | 68 | class MEPSDataset20Reg(SaveDataset): 69 | """MEPS Dataset. 70 | """ 71 | 72 | def __init__(self, label_name='UTILIZATION_reg', favorable_classes=[1.0], 73 | protected_attribute_names=['RACE'], 74 | privileged_classes=[['White']], 75 | instance_weights_name='PERWT15F', 76 | categorical_features=['REGION','SEX','MARRY', 77 | 'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX', 78 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 79 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 80 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42', 'ADSMOK42', 'PHQ242', 81 | 'EMPST','POVCAT','INSCOV'], 82 | features_to_keep=['REGION','AGE','SEX','RACE','MARRY', 83 | 'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX', 84 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 85 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 86 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42', 'ADSMOK42', 87 | 'PCS42', 88 | 'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION_reg', 'PERWT15F'], 89 | features_to_drop=[], 90 | na_values=[], custom_preprocessing=default_preprocessing, 91 | metadata=default_mappings): 92 | 93 | filepath = './h181.csv' 94 | 95 | df = pd.read_csv(filepath, sep=',', na_values=na_values) 96 | 97 | super(MEPSDataset20Reg, self).__init__(df=df, label_name=label_name, 98 | favorable_classes=favorable_classes, 99 | protected_attribute_names=protected_attribute_names, 100 | privileged_classes=privileged_classes, 101 | instance_weights_name=instance_weights_name, 102 | categorical_features=categorical_features, 103 | features_to_keep=features_to_keep, 104 | features_to_drop=features_to_drop, na_values=na_values, 105 | custom_preprocessing=custom_preprocessing, metadata=metadata, dataset_name='meps_20_reg') 106 | -------------------------------------------------------------------------------- /get_meps_data/meps_dataset_panel21_fy2016_reg.py: -------------------------------------------------------------------------------- 1 | # This code is a variant of 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/meps_dataset_panel21_fy2016.py 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | from __future__ import unicode_literals 8 | 9 | import pandas as pd 10 | 11 | #from standard_dataset import StandardDataset 12 | from save_dataset import SaveDataset 13 | 14 | default_mappings = { 15 | 'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-White'}] 16 | } 17 | 18 | def default_preprocessing(df): 19 | """ 20 | 1.Create a new column, RACE that is 'White' if RACEV2X = 1 and HISPANX = 2 i.e. non Hispanic White 21 | and 'Non-White' otherwise 22 | 2. Restrict to Panel 21 23 | 3. RENAME all columns that are PANEL/ROUND SPECIFIC 24 | 4. Drop rows based on certain values of individual features that correspond to missing/unknown - generally < -1 25 | 5. Compute UTILIZATION, binarize it to 0 (< 10) and 1 (>= 10) 26 | """ 27 | def race(row): 28 | if ((row['HISPANX'] == 2) and (row['RACEV2X'] == 1)): #non-Hispanic Whites are marked as WHITE; all others as NON-WHITE 29 | return 'White' 30 | return 'Non-White' 31 | 32 | df['RACEV2X'] = df.apply(lambda row: race(row), axis=1) 33 | df = df.rename(columns = {'RACEV2X' : 'RACE'}) 34 | 35 | df = df[df['PANEL'] == 21] 36 | 37 | # RENAME COLUMNS 38 | df = df.rename(columns = {'FTSTU53X' : 'FTSTU', 'ACTDTY53' : 'ACTDTY', 'HONRDC53' : 'HONRDC', 'RTHLTH53' : 'RTHLTH', 39 | 'MNHLTH53' : 'MNHLTH', 'CHBRON53' : 'CHBRON', 'JTPAIN53' : 'JTPAIN', 'PREGNT53' : 'PREGNT', 40 | 'WLKLIM53' : 'WLKLIM', 'ACTLIM53' : 'ACTLIM', 'SOCLIM53' : 'SOCLIM', 'COGLIM53' : 'COGLIM', 41 | 'EMPST53' : 'EMPST', 'REGION53' : 'REGION', 'MARRY53X' : 'MARRY', 'AGE53X' : 'AGE', 42 | 'POVCAT16' : 'POVCAT', 'INSCOV16' : 'INSCOV'}) 43 | 44 | df = df[df['REGION'] >= 0] # remove values -1 45 | df = df[df['AGE'] >= 0] # remove values -1 46 | 47 | df = df[df['MARRY'] >= 0] # remove values -1, -7, -8, -9 48 | 49 | df = df[df['ASTHDX'] >= 0] # remove values -1, -7, -8, -9 50 | 51 | df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG', 52 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 53 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 54 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42', 55 | 'PHQ242','EMPST','POVCAT','INSCOV']] >= -1).all(1)] #for all other categorical features, remove values < -1 56 | 57 | df = df[(df[['OBTOTV16', 'OPTOTV16', 'ERTOT16', 'IPNGTD16', 'HHTOTD16']]>=0).all(1)] 58 | 59 | def utilization(row): 60 | return row['OBTOTV16'] + row['OPTOTV16'] + row['ERTOT16'] + row['IPNGTD16'] + row['HHTOTD16'] 61 | 62 | df['TOTEXP16'] = df.apply(lambda row: utilization(row), axis=1) 63 | 64 | df = df.rename(columns = {'TOTEXP16' : 'UTILIZATION_reg'}) 65 | return df 66 | 67 | 68 | class MEPSDataset21Reg(SaveDataset): 69 | """MEPS Dataset. 70 | """ 71 | 72 | def __init__(self, label_name='UTILIZATION_reg', favorable_classes=[1.0], 73 | protected_attribute_names=['RACE'], 74 | privileged_classes=[['White']], 75 | instance_weights_name='PERWT16F', 76 | categorical_features=['REGION','SEX','MARRY', 77 | 'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX', 78 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 79 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 80 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42', 'ADSMOK42', 'PHQ242', 81 | 'EMPST','POVCAT','INSCOV'], 82 | features_to_keep=['REGION','AGE','SEX','RACE','MARRY', 83 | 'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX', 84 | 'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX', 85 | 'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM', 86 | 'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42', 87 | 'PCS42', 88 | 'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION_reg', 'PERWT16F'], 89 | features_to_drop=[], 90 | na_values=[], custom_preprocessing=default_preprocessing, 91 | metadata=default_mappings): 92 | 93 | filepath = './h192.csv' 94 | df = pd.read_csv(filepath, sep=',', na_values=na_values) 95 | 96 | super(MEPSDataset21Reg, self).__init__(df=df, label_name=label_name, 97 | favorable_classes=favorable_classes, 98 | protected_attribute_names=protected_attribute_names, 99 | privileged_classes=privileged_classes, 100 | instance_weights_name=instance_weights_name, 101 | categorical_features=categorical_features, 102 | features_to_keep=features_to_keep, 103 | features_to_drop=features_to_drop, na_values=na_values, 104 | custom_preprocessing=custom_preprocessing, metadata=metadata, dataset_name='meps_21_reg') 105 | -------------------------------------------------------------------------------- /get_meps_data/regression_dataset.py: -------------------------------------------------------------------------------- 1 | # Code copied from IBM's AIF360 package 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/binary_label_dataset.py 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | from __future__ import unicode_literals 8 | 9 | import numpy as np 10 | 11 | from structured_dataset import StructuredDataset 12 | 13 | 14 | class RegressionDataset(StructuredDataset): 15 | """Base class for all structured datasets with binary labels.""" 16 | 17 | def __init__(self, favorable_label=1., unfavorable_label=0., **kwargs): 18 | """ 19 | Args: 20 | favorable_label (float): Label value which is considered favorable 21 | (i.e. "positive"). 22 | unfavorable_label (float): Label value which is considered 23 | unfavorable (i.e. "negative"). 24 | **kwargs: StructuredDataset arguments. 25 | """ 26 | self.favorable_label = float(favorable_label) 27 | self.unfavorable_label = float(unfavorable_label) 28 | 29 | super(RegressionDataset, self).__init__(**kwargs) 30 | 31 | def validate_dataset(self): 32 | """Error checking and type validation. 33 | 34 | Raises: 35 | ValueError: `labels` must be shape [n, 1]. 36 | ValueError: `favorable_label` and `unfavorable_label` must be the 37 | only values present in `labels`. 38 | """ 39 | super(RegressionDataset, self).validate_dataset() 40 | 41 | # =========================== SHAPE CHECKING =========================== 42 | # Verify if the labels are only 1 column 43 | if self.labels.shape[1] != 1: 44 | raise ValueError("BinaryLabelDataset only supports single-column " 45 | "labels:\n\tlabels.shape = {}".format(self.labels.shape)) 46 | 47 | # =========================== VALUE CHECKING =========================== 48 | # Check if the favorable and unfavorable labels match those in the dataset 49 | if (not set(self.labels.ravel()) <= 50 | set([self.favorable_label, self.unfavorable_label])): 51 | raise ValueError("The favorable and unfavorable labels provided do " 52 | "not match the labels in the dataset.") 53 | 54 | if np.all(self.scores == self.labels): 55 | self.scores = (self.scores == self.favorable_label).astype(np.float64) 56 | -------------------------------------------------------------------------------- /get_meps_data/save_dataset.py: -------------------------------------------------------------------------------- 1 | # Code copied from IBM's AIF360 package 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/standard_dataset.py 3 | 4 | from __future__ import absolute_import 5 | from __future__ import division 6 | from __future__ import print_function 7 | from __future__ import unicode_literals 8 | 9 | from logging import warning 10 | 11 | import numpy as np 12 | import pandas as pd 13 | 14 | from regression_dataset import RegressionDataset 15 | 16 | 17 | class SaveDataset(RegressionDataset): 18 | """Base class for every :obj:`RegressionDataset`. The code is similar 19 | to that of aif360. 20 | 21 | It is not strictly necessary to inherit this class when adding custom 22 | datasets but it may be useful. 23 | 24 | This class is very loosely based on code from 25 | https://github.com/algofairness/fairness-comparison. 26 | """ 27 | 28 | def __init__(self, df, label_name, favorable_classes, 29 | protected_attribute_names, privileged_classes, 30 | instance_weights_name='', scores_name='', 31 | categorical_features=[], features_to_keep=[], 32 | features_to_drop=[], na_values=[], custom_preprocessing=None, 33 | metadata=None, dataset_name='my_data'): 34 | """ 35 | Subclasses of StandardDataset should perform the following before 36 | calling `super().__init__`: 37 | 38 | 1. Load the dataframe from a raw file. 39 | 40 | Then, this class will go through a standard preprocessing routine which: 41 | 42 | 2. (optional) Performs some dataset-specific preprocessing (e.g. 43 | renaming columns/values, handling missing data). 44 | 45 | 3. Drops unrequested columns (see `features_to_keep` and 46 | `features_to_drop` for details). 47 | 48 | 4. Drops rows with NA values. 49 | 50 | 5. Creates a one-hot encoding of the categorical variables. 51 | 52 | 6. Maps protected attributes to binary privileged/unprivileged 53 | values (1/0). 54 | 55 | Args: 56 | df (pandas.DataFrame): DataFrame on which to perform standard 57 | processing. 58 | label_name: Name of the label column in `df`. 59 | favorable_classes (list or function): Label values which are 60 | considered favorable or a boolean function which returns `True` 61 | if favorable. All others are unfavorable. Label values are 62 | mapped to 1 (favorable) and 0 (unfavorable) if they are not 63 | already binary and numerical. 64 | protected_attribute_names (list): List of names corresponding to 65 | protected attribute columns in `df`. 66 | privileged_classes (list(list or function)): Each element is 67 | a list of values which are considered privileged or a boolean 68 | function which return `True` if privileged for the corresponding 69 | column in `protected_attribute_names`. All others are 70 | unprivileged. Values are mapped to 1 (privileged) and 0 71 | (unprivileged) if they are not already numerical. 72 | instance_weights_name (optional): Name of the instance weights 73 | column in `df`. 74 | categorical_features (optional, list): List of column names in the 75 | DataFrame which are to be expanded into one-hot vectors. 76 | features_to_keep (optional, list): Column names to keep. All others 77 | are dropped except those present in `protected_attribute_names`, 78 | `categorical_features`, `label_name` or `instance_weights_name`. 79 | Defaults to all columns if not provided. 80 | features_to_drop (optional, list): Column names to drop. *Note: this 81 | overrides* `features_to_keep`. 82 | na_values (optional): Additional strings to recognize as NA. See 83 | :func:`pandas.read_csv` for details. 84 | custom_preprocessing (function): A function object which 85 | acts on and returns a DataFrame (f: DataFrame -> DataFrame). If 86 | `None`, no extra preprocessing is applied. 87 | metadata (optional): Additional metadata to append. 88 | """ 89 | # 2. Perform dataset-specific preprocessing 90 | if custom_preprocessing: 91 | df = custom_preprocessing(df) 92 | 93 | # 3. Drop unrequested columns 94 | features_to_keep = features_to_keep or df.columns.tolist() 95 | keep = (set(features_to_keep) | set(protected_attribute_names) 96 | | set(categorical_features) | set([label_name])) 97 | if instance_weights_name: 98 | keep |= set([instance_weights_name]) 99 | df = df[sorted(keep - set(features_to_drop), key=df.columns.get_loc)] 100 | categorical_features = sorted(set(categorical_features) - set(features_to_drop), key=df.columns.get_loc) 101 | 102 | # 4. Remove any rows that have missing data. 103 | dropped = df.dropna() 104 | count = df.shape[0] - dropped.shape[0] 105 | if count > 0: 106 | warning("Missing Data: {} rows removed from {}.".format(count, 107 | type(self).__name__)) 108 | df = dropped 109 | 110 | # 5. Create a one-hot encoding of the categorical variables. 111 | df = pd.get_dummies(df, columns=categorical_features, prefix_sep='=') 112 | 113 | # 6. Map protected attributes to privileged/unprivileged 114 | privileged_protected_attributes = [] 115 | unprivileged_protected_attributes = [] 116 | for attr, vals in zip(protected_attribute_names, privileged_classes): 117 | privileged_values = [1.] 118 | unprivileged_values = [0.] 119 | if callable(vals): 120 | df[attr] = df[attr].apply(vals) 121 | elif np.issubdtype(df[attr].dtype, np.number): 122 | # this attribute is numeric; no remapping needed 123 | privileged_values = vals 124 | unprivileged_values = list(set(df[attr]).difference(vals)) 125 | else: 126 | # find all instances which match any of the attribute values 127 | priv = np.array([ ( el in vals ) for el in df[attr] ]) 128 | df.loc[priv, attr] = privileged_values[0] 129 | df.loc[~priv, attr] = unprivileged_values[0] 130 | 131 | privileged_protected_attributes.append( 132 | np.array(privileged_values, dtype=np.float64)) 133 | unprivileged_protected_attributes.append( 134 | np.array(unprivileged_values, dtype=np.float64)) 135 | 136 | full_name = dataset_name + ".csv" 137 | print("writing file: " + full_name) 138 | df.to_csv(full_name) 139 | -------------------------------------------------------------------------------- /nonconformist/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/nonconformist/.DS_Store -------------------------------------------------------------------------------- /nonconformist/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | docstring 5 | """ 6 | 7 | # Authors: Henrik Linusson 8 | # Yaniv Romano modified np.py file to include CQR 9 | 10 | __version__ = '2.1.0' 11 | 12 | __all__ = ['icp', 'nc', 'acp'] 13 | -------------------------------------------------------------------------------- /nonconformist/acp.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Aggregated conformal predictors 5 | """ 6 | 7 | # Authors: Henrik Linusson 8 | 9 | import numpy as np 10 | from sklearn.cross_validation import KFold, StratifiedKFold 11 | from sklearn.cross_validation import ShuffleSplit, StratifiedShuffleSplit 12 | from sklearn.base import clone 13 | from nonconformist.base import BaseEstimator 14 | from nonconformist.util import calc_p 15 | 16 | 17 | # ----------------------------------------------------------------------------- 18 | # Sampling strategies 19 | # ----------------------------------------------------------------------------- 20 | class BootstrapSampler(object): 21 | """Bootstrap sampler. 22 | 23 | See also 24 | -------- 25 | CrossSampler, RandomSubSampler 26 | 27 | Examples 28 | -------- 29 | """ 30 | def gen_samples(self, y, n_samples, problem_type): 31 | for i in range(n_samples): 32 | idx = np.array(range(y.size)) 33 | train = np.random.choice(y.size, y.size, replace=True) 34 | cal_mask = np.array(np.ones(idx.size), dtype=bool) 35 | for j in train: 36 | cal_mask[j] = False 37 | cal = idx[cal_mask] 38 | 39 | yield train, cal 40 | 41 | 42 | class CrossSampler(object): 43 | """Cross-fold sampler. 44 | 45 | See also 46 | -------- 47 | BootstrapSampler, RandomSubSampler 48 | 49 | Examples 50 | -------- 51 | """ 52 | def gen_samples(self, y, n_samples, problem_type): 53 | if problem_type == 'classification': 54 | folds = StratifiedKFold(y, n_folds=n_samples) 55 | else: 56 | folds = KFold(y.size, n_folds=n_samples) 57 | for train, cal in folds: 58 | yield train, cal 59 | 60 | 61 | class RandomSubSampler(object): 62 | """Random subsample sampler. 63 | 64 | Parameters 65 | ---------- 66 | calibration_portion : float 67 | Ratio (0-1) of examples to use for calibration. 68 | 69 | See also 70 | -------- 71 | BootstrapSampler, CrossSampler 72 | 73 | Examples 74 | -------- 75 | """ 76 | def __init__(self, calibration_portion=0.3): 77 | self.cal_portion = calibration_portion 78 | 79 | def gen_samples(self, y, n_samples, problem_type): 80 | if problem_type == 'classification': 81 | splits = StratifiedShuffleSplit(y, 82 | n_iter=n_samples, 83 | test_size=self.cal_portion) 84 | else: 85 | splits = ShuffleSplit(y.size, 86 | n_iter=n_samples, 87 | test_size=self.cal_portion) 88 | 89 | for train, cal in splits: 90 | yield train, cal 91 | 92 | 93 | # ----------------------------------------------------------------------------- 94 | # Conformal ensemble 95 | # ----------------------------------------------------------------------------- 96 | class AggregatedCp(BaseEstimator): 97 | """Aggregated conformal predictor. 98 | 99 | Combines multiple IcpClassifier or IcpRegressor predictors into an 100 | aggregated model. 101 | 102 | Parameters 103 | ---------- 104 | predictor : object 105 | Prototype conformal predictor (e.g. IcpClassifier or IcpRegressor) 106 | used for defining conformal predictors included in the aggregate model. 107 | 108 | sampler : object 109 | Sampler object used to generate training and calibration examples 110 | for the underlying conformal predictors. 111 | 112 | aggregation_func : callable 113 | Function used to aggregate the predictions of the underlying 114 | conformal predictors. Defaults to ``numpy.mean``. 115 | 116 | n_models : int 117 | Number of models to aggregate. 118 | 119 | Attributes 120 | ---------- 121 | predictor : object 122 | Prototype conformal predictor. 123 | 124 | predictors : list 125 | List of underlying conformal predictors. 126 | 127 | sampler : object 128 | Sampler object used to generate training and calibration examples. 129 | 130 | agg_func : callable 131 | Function used to aggregate the predictions of the underlying 132 | conformal predictors 133 | 134 | References 135 | ---------- 136 | .. [1] Vovk, V. (2013). Cross-conformal predictors. Annals of Mathematics 137 | and Artificial Intelligence, 1-20. 138 | 139 | .. [2] Carlsson, L., Eklund, M., & Norinder, U. (2014). Aggregated 140 | Conformal Prediction. In Artificial Intelligence Applications and 141 | Innovations (pp. 231-240). Springer Berlin Heidelberg. 142 | 143 | Examples 144 | -------- 145 | """ 146 | def __init__(self, 147 | predictor, 148 | sampler=BootstrapSampler(), 149 | aggregation_func=None, 150 | n_models=10): 151 | self.predictors = [] 152 | self.n_models = n_models 153 | self.predictor = predictor 154 | self.sampler = sampler 155 | 156 | if aggregation_func is not None: 157 | self.agg_func = aggregation_func 158 | else: 159 | self.agg_func = lambda x: np.mean(x, axis=2) 160 | 161 | def fit(self, x, y): 162 | """Fit underlying conformal predictors. 163 | 164 | Parameters 165 | ---------- 166 | x : numpy array of shape [n_samples, n_features] 167 | Inputs of examples for fitting the underlying conformal predictors. 168 | 169 | y : numpy array of shape [n_samples] 170 | Outputs of examples for fitting the underlying conformal predictors. 171 | 172 | Returns 173 | ------- 174 | None 175 | """ 176 | self.n_train = y.size 177 | self.predictors = [] 178 | idx = np.random.permutation(y.size) 179 | x, y = x[idx, :], y[idx] 180 | problem_type = self.predictor.__class__.get_problem_type() 181 | samples = self.sampler.gen_samples(y, 182 | self.n_models, 183 | problem_type) 184 | for train, cal in samples: 185 | predictor = clone(self.predictor) 186 | predictor.fit(x[train, :], y[train]) 187 | predictor.calibrate(x[cal, :], y[cal]) 188 | self.predictors.append(predictor) 189 | 190 | if problem_type == 'classification': 191 | self.classes = self.predictors[0].classes 192 | 193 | def predict(self, x, significance=None): 194 | """Predict the output values for a set of input patterns. 195 | 196 | Parameters 197 | ---------- 198 | x : numpy array of shape [n_samples, n_features] 199 | Inputs of patters for which to predict output values. 200 | 201 | significance : float or None 202 | Significance level (maximum allowed error rate) of predictions. 203 | Should be a float between 0 and 1. If ``None``, then the p-values 204 | are output rather than the predictions. Note: ``significance=None`` 205 | is applicable to classification problems only. 206 | 207 | Returns 208 | ------- 209 | p : numpy array of shape [n_samples, n_classes] or [n_samples, 2] 210 | For classification problems: If significance is ``None``, then p 211 | contains the p-values for each sample-class pair; if significance 212 | is a float between 0 and 1, then p is a boolean array denoting 213 | which labels are included in the prediction sets. 214 | 215 | For regression problems: Prediction interval (minimum and maximum 216 | boundaries) for the set of test patterns. 217 | """ 218 | is_regression =\ 219 | self.predictor.__class__.get_problem_type() == 'regression' 220 | 221 | n_examples = x.shape[0] 222 | 223 | if is_regression and significance is None: 224 | signs = np.arange(0.01, 1.0, 0.01) 225 | pred = np.zeros((n_examples, 2, signs.size)) 226 | for i, s in enumerate(signs): 227 | predictions = np.dstack([p.predict(x, s) 228 | for p in self.predictors]) 229 | predictions = self.agg_func(predictions) 230 | pred[:, :, i] = predictions 231 | return pred 232 | else: 233 | def f(p, x): 234 | return p.predict(x, significance if is_regression else None) 235 | predictions = np.dstack([f(p, x) for p in self.predictors]) 236 | predictions = self.agg_func(predictions) 237 | 238 | if significance and not is_regression: 239 | return predictions >= significance 240 | else: 241 | return predictions 242 | 243 | 244 | class CrossConformalClassifier(AggregatedCp): 245 | """Cross-conformal classifier. 246 | 247 | Combines multiple IcpClassifiers into a cross-conformal classifier. 248 | 249 | Parameters 250 | ---------- 251 | predictor : object 252 | Prototype conformal predictor (e.g. IcpClassifier or IcpRegressor) 253 | used for defining conformal predictors included in the aggregate model. 254 | 255 | aggregation_func : callable 256 | Function used to aggregate the predictions of the underlying 257 | conformal predictors. Defaults to ``numpy.mean``. 258 | 259 | n_models : int 260 | Number of models to aggregate. 261 | 262 | Attributes 263 | ---------- 264 | predictor : object 265 | Prototype conformal predictor. 266 | 267 | predictors : list 268 | List of underlying conformal predictors. 269 | 270 | sampler : object 271 | Sampler object used to generate training and calibration examples. 272 | 273 | agg_func : callable 274 | Function used to aggregate the predictions of the underlying 275 | conformal predictors 276 | 277 | References 278 | ---------- 279 | .. [1] Vovk, V. (2013). Cross-conformal predictors. Annals of Mathematics 280 | and Artificial Intelligence, 1-20. 281 | 282 | Examples 283 | -------- 284 | """ 285 | def __init__(self, 286 | predictor, 287 | n_models=10): 288 | super(CrossConformalClassifier, self).__init__(predictor, 289 | CrossSampler(), 290 | n_models) 291 | 292 | def predict(self, x, significance=None): 293 | ncal_ngt_neq = np.stack([p._get_stats(x) for p in self.predictors], 294 | axis=3) 295 | ncal_ngt_neq = ncal_ngt_neq.sum(axis=3) 296 | 297 | p = calc_p(ncal_ngt_neq[:, :, 0], 298 | ncal_ngt_neq[:, :, 1], 299 | ncal_ngt_neq[:, :, 2], 300 | smoothing=self.predictors[0].smoothing) 301 | 302 | if significance: 303 | return p > significance 304 | else: 305 | return p 306 | 307 | 308 | class BootstrapConformalClassifier(AggregatedCp): 309 | """Bootstrap conformal classifier. 310 | 311 | Combines multiple IcpClassifiers into a bootstrap conformal classifier. 312 | 313 | Parameters 314 | ---------- 315 | predictor : object 316 | Prototype conformal predictor (e.g. IcpClassifier or IcpRegressor) 317 | used for defining conformal predictors included in the aggregate model. 318 | 319 | aggregation_func : callable 320 | Function used to aggregate the predictions of the underlying 321 | conformal predictors. Defaults to ``numpy.mean``. 322 | 323 | n_models : int 324 | Number of models to aggregate. 325 | 326 | Attributes 327 | ---------- 328 | predictor : object 329 | Prototype conformal predictor. 330 | 331 | predictors : list 332 | List of underlying conformal predictors. 333 | 334 | sampler : object 335 | Sampler object used to generate training and calibration examples. 336 | 337 | agg_func : callable 338 | Function used to aggregate the predictions of the underlying 339 | conformal predictors 340 | 341 | References 342 | ---------- 343 | .. [1] Vovk, V. (2013). Cross-conformal predictors. Annals of Mathematics 344 | and Artificial Intelligence, 1-20. 345 | 346 | Examples 347 | -------- 348 | """ 349 | def __init__(self, 350 | predictor, 351 | n_models=10): 352 | super(BootstrapConformalClassifier, self).__init__(predictor, 353 | BootstrapSampler(), 354 | n_models) 355 | 356 | def predict(self, x, significance=None): 357 | ncal_ngt_neq = np.stack([p._get_stats(x) for p in self.predictors], 358 | axis=3) 359 | ncal_ngt_neq = ncal_ngt_neq.sum(axis=3) 360 | 361 | p = calc_p(ncal_ngt_neq[:, :, 0] + ncal_ngt_neq[:, :, 0] / self.n_train, 362 | ncal_ngt_neq[:, :, 1] + ncal_ngt_neq[:, :, 0] / self.n_train, 363 | ncal_ngt_neq[:, :, 2], 364 | smoothing=self.predictors[0].smoothing) 365 | 366 | if significance: 367 | return p > significance 368 | else: 369 | return p 370 | -------------------------------------------------------------------------------- /nonconformist/base.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | docstring 5 | """ 6 | 7 | # Authors: Henrik Linusson 8 | 9 | import abc 10 | import numpy as np 11 | 12 | from sklearn.base import BaseEstimator 13 | 14 | 15 | class RegressorMixin(object): 16 | def __init__(self): 17 | super(RegressorMixin, self).__init__() 18 | 19 | @classmethod 20 | def get_problem_type(cls): 21 | return 'regression' 22 | 23 | 24 | class ClassifierMixin(object): 25 | def __init__(self): 26 | super(ClassifierMixin, self).__init__() 27 | 28 | @classmethod 29 | def get_problem_type(cls): 30 | return 'classification' 31 | 32 | 33 | class BaseModelAdapter(BaseEstimator): 34 | __metaclass__ = abc.ABCMeta 35 | 36 | def __init__(self, model, fit_params=None): 37 | super(BaseModelAdapter, self).__init__() 38 | 39 | self.model = model 40 | self.last_x, self.last_y = None, None 41 | self.clean = False 42 | self.fit_params = {} if fit_params is None else fit_params 43 | 44 | def fit(self, x, y): 45 | """Fits the model. 46 | 47 | Parameters 48 | ---------- 49 | x : numpy array of shape [n_samples, n_features] 50 | Inputs of examples for fitting the model. 51 | 52 | y : numpy array of shape [n_samples] 53 | Outputs of examples for fitting the model. 54 | 55 | Returns 56 | ------- 57 | None 58 | """ 59 | 60 | self.model.fit(x, y, **self.fit_params) 61 | self.clean = False 62 | 63 | def predict(self, x): 64 | """Returns the prediction made by the underlying model. 65 | 66 | Parameters 67 | ---------- 68 | x : numpy array of shape [n_samples, n_features] 69 | Inputs of test examples. 70 | 71 | Returns 72 | ------- 73 | y : numpy array of shape [n_samples] 74 | Predicted outputs of test examples. 75 | """ 76 | if ( 77 | not self.clean or 78 | self.last_x is None or 79 | self.last_y is None or 80 | not np.array_equal(self.last_x, x) 81 | ): 82 | self.last_x = x 83 | self.last_y = self._underlying_predict(x) 84 | self.clean = True 85 | 86 | return self.last_y.copy() 87 | 88 | @abc.abstractmethod 89 | def _underlying_predict(self, x): 90 | """Produces a prediction using the encapsulated model. 91 | 92 | Parameters 93 | ---------- 94 | x : numpy array of shape [n_samples, n_features] 95 | Inputs of test examples. 96 | 97 | Returns 98 | ------- 99 | y : numpy array of shape [n_samples] 100 | Predicted outputs of test examples. 101 | """ 102 | pass 103 | 104 | 105 | class ClassifierAdapter(BaseModelAdapter): 106 | def __init__(self, model, fit_params=None): 107 | super(ClassifierAdapter, self).__init__(model, fit_params) 108 | 109 | def _underlying_predict(self, x): 110 | return self.model.predict_proba(x) 111 | 112 | 113 | class RegressorAdapter(BaseModelAdapter): 114 | def __init__(self, model, fit_params=None): 115 | super(RegressorAdapter, self).__init__(model, fit_params) 116 | 117 | def _underlying_predict(self, x): 118 | return self.model.predict(x) 119 | 120 | 121 | class OobMixin(object): 122 | def __init__(self, model, fit_params=None): 123 | super(OobMixin, self).__init__(model, fit_params) 124 | self.train_x = None 125 | 126 | def fit(self, x, y): 127 | super(OobMixin, self).fit(x, y) 128 | self.train_x = x 129 | 130 | def _underlying_predict(self, x): 131 | # TODO: sub-sampling of ensemble for test patterns 132 | oob = x == self.train_x 133 | 134 | if hasattr(oob, 'all'): 135 | oob = oob.all() 136 | 137 | if oob: 138 | return self._oob_prediction() 139 | else: 140 | return super(OobMixin, self)._underlying_predict(x) 141 | 142 | 143 | class OobClassifierAdapter(OobMixin, ClassifierAdapter): 144 | def __init__(self, model, fit_params=None): 145 | super(OobClassifierAdapter, self).__init__(model, fit_params) 146 | 147 | def _oob_prediction(self): 148 | return self.model.oob_decision_function_ 149 | 150 | 151 | class OobRegressorAdapter(OobMixin, RegressorAdapter): 152 | def __init__(self, model, fit_params=None): 153 | super(OobRegressorAdapter, self).__init__(model, fit_params) 154 | 155 | def _oob_prediction(self): 156 | return self.model.oob_prediction_ 157 | -------------------------------------------------------------------------------- /nonconformist/cp.py: -------------------------------------------------------------------------------- 1 | from nonconformist.icp import * 2 | 3 | # TODO: move contents from nonconformist.icp here 4 | 5 | # ----------------------------------------------------------------------------- 6 | # TcpClassifier 7 | # ----------------------------------------------------------------------------- 8 | class TcpClassifier(BaseEstimator, ClassifierMixin): 9 | """Transductive conformal classifier. 10 | 11 | Parameters 12 | ---------- 13 | nc_function : BaseScorer 14 | Nonconformity scorer object used to calculate nonconformity of 15 | calibration examples and test patterns. Should implement ``fit(x, y)`` 16 | and ``calc_nc(x, y)``. 17 | 18 | smoothing : boolean 19 | Decides whether to use stochastic smoothing of p-values. 20 | 21 | Attributes 22 | ---------- 23 | train_x : numpy array of shape [n_cal_examples, n_features] 24 | Inputs of training set. 25 | 26 | train_y : numpy array of shape [n_cal_examples] 27 | Outputs of calibration set. 28 | 29 | nc_function : BaseScorer 30 | Nonconformity scorer object used to calculate nonconformity scores. 31 | 32 | classes : numpy array of shape [n_classes] 33 | List of class labels, with indices corresponding to output columns 34 | of TcpClassifier.predict() 35 | 36 | See also 37 | -------- 38 | IcpClassifier 39 | 40 | References 41 | ---------- 42 | .. [1] Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning 43 | in a random world. Springer Science & Business Media. 44 | 45 | Examples 46 | -------- 47 | >>> import numpy as np 48 | >>> from sklearn.datasets import load_iris 49 | >>> from sklearn.svm import SVC 50 | >>> from nonconformist.base import ClassifierAdapter 51 | >>> from nonconformist.cp import TcpClassifier 52 | >>> from nonconformist.nc import ClassifierNc, MarginErrFunc 53 | >>> iris = load_iris() 54 | >>> idx = np.random.permutation(iris.target.size) 55 | >>> train = idx[:int(idx.size / 2)] 56 | >>> test = idx[int(idx.size / 2):] 57 | >>> model = ClassifierAdapter(SVC(probability=True)) 58 | >>> nc = ClassifierNc(model, MarginErrFunc()) 59 | >>> tcp = TcpClassifier(nc) 60 | >>> tcp.fit(iris.data[train, :], iris.target[train]) 61 | >>> tcp.predict(iris.data[test, :], significance=0.10) 62 | ... # doctest: +SKIP 63 | array([[ True, False, False], 64 | [False, True, False], 65 | ..., 66 | [False, True, False], 67 | [False, True, False]], dtype=bool) 68 | """ 69 | 70 | def __init__(self, nc_function, condition=None, smoothing=True): 71 | self.train_x, self.train_y = None, None 72 | self.nc_function = nc_function 73 | super(TcpClassifier, self).__init__() 74 | 75 | # Check if condition-parameter is the default function (i.e., 76 | # lambda x: 0). This is so we can safely clone the object without 77 | # the clone accidentally having self.conditional = True. 78 | default_condition = lambda x: 0 79 | is_default = (callable(condition) and 80 | (condition.__code__.co_code == 81 | default_condition.__code__.co_code)) 82 | 83 | if is_default: 84 | self.condition = condition 85 | self.conditional = False 86 | elif callable(condition): 87 | self.condition = condition 88 | self.conditional = True 89 | else: 90 | self.condition = lambda x: 0 91 | self.conditional = False 92 | 93 | self.smoothing = smoothing 94 | 95 | self.base_icp = IcpClassifier( 96 | self.nc_function, 97 | self.condition, 98 | self.smoothing 99 | ) 100 | 101 | self.classes = None 102 | 103 | def fit(self, x, y): 104 | self.train_x, self.train_y = x, y 105 | self.classes = np.unique(y) 106 | 107 | def predict(self, x, significance=None): 108 | """Predict the output values for a set of input patterns. 109 | 110 | Parameters 111 | ---------- 112 | x : numpy array of shape [n_samples, n_features] 113 | Inputs of patters for which to predict output values. 114 | 115 | significance : float or None 116 | Significance level (maximum allowed error rate) of predictions. 117 | Should be a float between 0 and 1. If ``None``, then the p-values 118 | are output rather than the predictions. 119 | 120 | Returns 121 | ------- 122 | p : numpy array of shape [n_samples, n_classes] 123 | If significance is ``None``, then p contains the p-values for each 124 | sample-class pair; if significance is a float between 0 and 1, then 125 | p is a boolean array denoting which labels are included in the 126 | prediction sets. 127 | """ 128 | n_test = x.shape[0] 129 | n_train = self.train_x.shape[0] 130 | p = np.zeros((n_test, self.classes.size)) 131 | for i in range(n_test): 132 | for j, y in enumerate(self.classes): 133 | train_x = np.vstack([self.train_x, x[i, :]]) 134 | train_y = np.hstack([self.train_y, y]) 135 | self.base_icp.fit(train_x, train_y) 136 | scores = self.base_icp.nc_function.score(train_x, train_y) 137 | ngt = (scores[:-1] > scores[-1]).sum() 138 | neq = (scores[:-1] == scores[-1]).sum() 139 | 140 | p[i, j] = calc_p(n_train, ngt, neq, self.smoothing) 141 | 142 | if significance is not None: 143 | return p > significance 144 | else: 145 | return p 146 | 147 | def predict_conf(self, x): 148 | """Predict the output values for a set of input patterns, using 149 | the confidence-and-credibility output scheme. 150 | 151 | Parameters 152 | ---------- 153 | x : numpy array of shape [n_samples, n_features] 154 | Inputs of patters for which to predict output values. 155 | 156 | Returns 157 | ------- 158 | p : numpy array of shape [n_samples, 3] 159 | p contains three columns: the first column contains the most 160 | likely class for each test pattern; the second column contains 161 | the confidence in the predicted class label, and the third column 162 | contains the credibility of the prediction. 163 | """ 164 | p = self.predict(x, significance=None) 165 | label = p.argmax(axis=1) 166 | credibility = p.max(axis=1) 167 | for i, idx in enumerate(label): 168 | p[i, idx] = -np.inf 169 | confidence = 1 - p.max(axis=1) 170 | 171 | return np.array([label, confidence, credibility]).T 172 | -------------------------------------------------------------------------------- /nonconformist/evaluation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Evaluation of conformal predictors. 5 | """ 6 | 7 | # Authors: Henrik Linusson 8 | 9 | # TODO: cross_val_score/run_experiment should possibly allow multiple to be evaluated on identical folding 10 | 11 | from __future__ import division 12 | 13 | from nonconformist.base import RegressorMixin, ClassifierMixin 14 | 15 | import sys 16 | import numpy as np 17 | import pandas as pd 18 | 19 | from sklearn.cross_validation import StratifiedShuffleSplit 20 | from sklearn.cross_validation import KFold 21 | from sklearn.cross_validation import train_test_split 22 | from sklearn.base import clone, BaseEstimator 23 | 24 | 25 | class BaseIcpCvHelper(BaseEstimator): 26 | """Base class for cross validation helpers. 27 | """ 28 | def __init__(self, icp, calibration_portion): 29 | super(BaseIcpCvHelper, self).__init__() 30 | self.icp = icp 31 | self.calibration_portion = calibration_portion 32 | 33 | def predict(self, x, significance=None): 34 | return self.icp.predict(x, significance) 35 | 36 | 37 | class ClassIcpCvHelper(BaseIcpCvHelper, ClassifierMixin): 38 | """Helper class for running the ``cross_val_score`` evaluation 39 | method on IcpClassifiers. 40 | 41 | See also 42 | -------- 43 | IcpRegCrossValHelper 44 | 45 | Examples 46 | -------- 47 | >>> from sklearn.datasets import load_iris 48 | >>> from sklearn.ensemble import RandomForestClassifier 49 | >>> from nonconformist.icp import IcpClassifier 50 | >>> from nonconformist.nc import ClassifierNc, MarginErrFunc 51 | >>> from nonconformist.evaluation import ClassIcpCvHelper 52 | >>> from nonconformist.evaluation import class_mean_errors 53 | >>> from nonconformist.evaluation import cross_val_score 54 | >>> data = load_iris() 55 | >>> nc = ProbEstClassifierNc(RandomForestClassifier(), MarginErrFunc()) 56 | >>> icp = IcpClassifier(nc) 57 | >>> icp_cv = ClassIcpCvHelper(icp) 58 | >>> cross_val_score(icp_cv, 59 | ... data.data, 60 | ... data.target, 61 | ... iterations=2, 62 | ... folds=2, 63 | ... scoring_funcs=[class_mean_errors], 64 | ... significance_levels=[0.1]) 65 | ... # doctest: +SKIP 66 | class_mean_errors fold iter significance 67 | 0 0.013333 0 0 0.1 68 | 1 0.080000 1 0 0.1 69 | 2 0.053333 0 1 0.1 70 | 3 0.080000 1 1 0.1 71 | """ 72 | def __init__(self, icp, calibration_portion=0.25): 73 | super(ClassIcpCvHelper, self).__init__(icp, calibration_portion) 74 | 75 | def fit(self, x, y): 76 | split = StratifiedShuffleSplit(y, n_iter=1, 77 | test_size=self.calibration_portion) 78 | for train, cal in split: 79 | self.icp.fit(x[train, :], y[train]) 80 | self.icp.calibrate(x[cal, :], y[cal]) 81 | 82 | 83 | class RegIcpCvHelper(BaseIcpCvHelper, RegressorMixin): 84 | """Helper class for running the ``cross_val_score`` evaluation 85 | method on IcpRegressors. 86 | 87 | See also 88 | -------- 89 | IcpClassCrossValHelper 90 | 91 | Examples 92 | -------- 93 | >>> from sklearn.datasets import load_boston 94 | >>> from sklearn.ensemble import RandomForestRegressor 95 | >>> from nonconformist.icp import IcpRegressor 96 | >>> from nonconformist.nc import RegressorNc, AbsErrorErrFunc 97 | >>> from nonconformist.evaluation import RegIcpCvHelper 98 | >>> from nonconformist.evaluation import reg_mean_errors 99 | >>> from nonconformist.evaluation import cross_val_score 100 | >>> data = load_boston() 101 | >>> nc = RegressorNc(RandomForestRegressor(), AbsErrorErrFunc()) 102 | >>> icp = IcpRegressor(nc) 103 | >>> icp_cv = RegIcpCvHelper(icp) 104 | >>> cross_val_score(icp_cv, 105 | ... data.data, 106 | ... data.target, 107 | ... iterations=2, 108 | ... folds=2, 109 | ... scoring_funcs=[reg_mean_errors], 110 | ... significance_levels=[0.1]) 111 | ... # doctest: +SKIP 112 | fold iter reg_mean_errors significance 113 | 0 0 0 0.185771 0.1 114 | 1 1 0 0.138340 0.1 115 | 2 0 1 0.071146 0.1 116 | 3 1 1 0.043478 0.1 117 | """ 118 | def __init__(self, icp, calibration_portion=0.25): 119 | super(RegIcpCvHelper, self).__init__(icp, calibration_portion) 120 | 121 | def fit(self, x, y): 122 | split = train_test_split(x, y, test_size=self.calibration_portion) 123 | x_tr, x_cal, y_tr, y_cal = split[0], split[1], split[2], split[3] 124 | self.icp.fit(x_tr, y_tr) 125 | self.icp.calibrate(x_cal, y_cal) 126 | 127 | 128 | # ----------------------------------------------------------------------------- 129 | # 130 | # ----------------------------------------------------------------------------- 131 | def cross_val_score(model,x, y, iterations=10, folds=10, fit_params=None, 132 | scoring_funcs=None, significance_levels=None, 133 | verbose=False): 134 | """Evaluates a conformal predictor using cross-validation. 135 | 136 | Parameters 137 | ---------- 138 | model : object 139 | Conformal predictor to evaluate. 140 | 141 | x : numpy array of shape [n_samples, n_features] 142 | Inputs of data to use for evaluation. 143 | 144 | y : numpy array of shape [n_samples] 145 | Outputs of data to use for evaluation. 146 | 147 | iterations : int 148 | Number of iterations to use for evaluation. The data set is randomly 149 | shuffled before each iteration. 150 | 151 | folds : int 152 | Number of folds to use for evaluation. 153 | 154 | fit_params : dictionary 155 | Parameters to supply to the conformal prediction object on training. 156 | 157 | scoring_funcs : iterable 158 | List of evaluation functions to apply to the conformal predictor in each 159 | fold. Each evaluation function should have a signature 160 | ``scorer(prediction, y, significance)``. 161 | 162 | significance_levels : iterable 163 | List of significance levels at which to evaluate the conformal 164 | predictor. 165 | 166 | verbose : boolean 167 | Indicates whether to output progress information during evaluation. 168 | 169 | Returns 170 | ------- 171 | scores : pandas DataFrame 172 | Tabulated results for each iteration, fold and evaluation function. 173 | """ 174 | 175 | fit_params = fit_params if fit_params else {} 176 | significance_levels = (significance_levels if significance_levels 177 | is not None else np.arange(0.01, 1.0, 0.01)) 178 | 179 | df = pd.DataFrame() 180 | 181 | columns = ['iter', 182 | 'fold', 183 | 'significance', 184 | ] + [f.__name__ for f in scoring_funcs] 185 | for i in range(iterations): 186 | idx = np.random.permutation(y.size) 187 | x, y = x[idx, :], y[idx] 188 | cv = KFold(y.size, folds) 189 | for j, (train, test) in enumerate(cv): 190 | if verbose: 191 | sys.stdout.write('\riter {}/{} fold {}/{}'.format( 192 | i + 1, 193 | iterations, 194 | j + 1, 195 | folds 196 | )) 197 | m = clone(model) 198 | m.fit(x[train, :], y[train], **fit_params) 199 | prediction = m.predict(x[test, :], significance=None) 200 | for k, s in enumerate(significance_levels): 201 | scores = [scoring_func(prediction, y[test], s) 202 | for scoring_func in scoring_funcs] 203 | df_score = pd.DataFrame([[i, j, s] + scores], 204 | columns=columns) 205 | df = df.append(df_score, ignore_index=True) 206 | 207 | return df 208 | 209 | 210 | def run_experiment(models, csv_files, iterations=10, folds=10, fit_params=None, 211 | scoring_funcs=None, significance_levels=None, 212 | normalize=False, verbose=False, header=0): 213 | """Performs a cross-validation evaluation of one or several conformal 214 | predictors on a collection of data sets in csv format. 215 | 216 | Parameters 217 | ---------- 218 | models : object or iterable 219 | Conformal predictor(s) to evaluate. 220 | 221 | csv_files : iterable 222 | List of file names (with absolute paths) containing csv-data, used to 223 | evaluate the conformal predictor. 224 | 225 | iterations : int 226 | Number of iterations to use for evaluation. The data set is randomly 227 | shuffled before each iteration. 228 | 229 | folds : int 230 | Number of folds to use for evaluation. 231 | 232 | fit_params : dictionary 233 | Parameters to supply to the conformal prediction object on training. 234 | 235 | scoring_funcs : iterable 236 | List of evaluation functions to apply to the conformal predictor in each 237 | fold. Each evaluation function should have a signature 238 | ``scorer(prediction, y, significance)``. 239 | 240 | significance_levels : iterable 241 | List of significance levels at which to evaluate the conformal 242 | predictor. 243 | 244 | verbose : boolean 245 | Indicates whether to output progress information during evaluation. 246 | 247 | Returns 248 | ------- 249 | scores : pandas DataFrame 250 | Tabulated results for each data set, iteration, fold and 251 | evaluation function. 252 | """ 253 | df = pd.DataFrame() 254 | if not hasattr(models, '__iter__'): 255 | models = [models] 256 | 257 | for model in models: 258 | is_regression = model.get_problem_type() == 'regression' 259 | 260 | n_data_sets = len(csv_files) 261 | for i, csv_file in enumerate(csv_files): 262 | if verbose: 263 | print('\n{} ({} / {})'.format(csv_file, i + 1, n_data_sets)) 264 | data = pd.read_csv(csv_file, header=header) 265 | x, y = data.values[:, :-1], data.values[:, -1] 266 | x = np.array(x, dtype=np.float64) 267 | if normalize: 268 | if is_regression: 269 | y = y - y.min() / (y.max() - y.min()) 270 | else: 271 | for j, y_ in enumerate(np.unique(y)): 272 | y[y == y_] = j 273 | 274 | scores = cross_val_score(model, x, y, iterations, folds, 275 | fit_params, scoring_funcs, 276 | significance_levels, verbose) 277 | 278 | ds_df = pd.DataFrame(scores) 279 | ds_df['model'] = model.__class__.__name__ 280 | try: 281 | ds_df['data_set'] = csv_file.split('/')[-1] 282 | except: 283 | ds_df['data_set'] = csv_file 284 | 285 | df = df.append(ds_df) 286 | 287 | return df 288 | 289 | 290 | # ----------------------------------------------------------------------------- 291 | # Validity measures 292 | # ----------------------------------------------------------------------------- 293 | def reg_n_correct(prediction, y, significance=None): 294 | """Calculates the number of correct predictions made by a conformal 295 | regression model. 296 | """ 297 | if significance is not None: 298 | idx = int(significance * 100 - 1) 299 | prediction = prediction[:, :, idx] 300 | 301 | low = y >= prediction[:, 0] 302 | high = y <= prediction[:, 1] 303 | correct = low * high 304 | 305 | return y[correct].size 306 | 307 | 308 | def reg_mean_errors(prediction, y, significance): 309 | """Calculates the average error rate of a conformal regression model. 310 | """ 311 | return 1 - reg_n_correct(prediction, y, significance) / y.size 312 | 313 | 314 | def class_n_correct(prediction, y, significance): 315 | """Calculates the number of correct predictions made by a conformal 316 | classification model. 317 | """ 318 | labels, y = np.unique(y, return_inverse=True) 319 | prediction = prediction > significance 320 | correct = np.zeros((y.size,), dtype=bool) 321 | for i, y_ in enumerate(y): 322 | correct[i] = prediction[i, int(y_)] 323 | return np.sum(correct) 324 | 325 | 326 | def class_mean_errors(prediction, y, significance=None): 327 | """Calculates the average error rate of a conformal classification model. 328 | """ 329 | return 1 - (class_n_correct(prediction, y, significance) / y.size) 330 | 331 | 332 | def class_one_err(prediction, y, significance=None): 333 | """Calculates the error rate of conformal classifier predictions containing 334 | only a single output label. 335 | """ 336 | labels, y = np.unique(y, return_inverse=True) 337 | prediction = prediction > significance 338 | idx = np.arange(0, y.size, 1) 339 | idx = filter(lambda x: np.sum(prediction[x, :]) == 1, idx) 340 | errors = filter(lambda x: not prediction[x, int(y[x])], idx) 341 | 342 | if len(idx) > 0: 343 | return np.size(errors) / np.size(idx) 344 | else: 345 | return 0 346 | 347 | 348 | def class_mean_errors_one_class(prediction, y, significance, c=0): 349 | """Calculates the average error rate of a conformal classification model, 350 | considering only test examples belonging to class ``c``. Use 351 | ``functools.partial`` in order to test other classes. 352 | """ 353 | labels, y = np.unique(y, return_inverse=True) 354 | prediction = prediction > significance 355 | idx = np.arange(0, y.size, 1)[y == c] 356 | errs = np.sum(1 for _ in filter(lambda x: not prediction[x, c], idx)) 357 | 358 | if idx.size > 0: 359 | return errs / idx.size 360 | else: 361 | return 0 362 | 363 | 364 | def class_one_err_one_class(prediction, y, significance, c=0): 365 | """Calculates the error rate of conformal classifier predictions containing 366 | only a single output label. Considers only test examples belonging to 367 | class ``c``. Use ``functools.partial`` in order to test other classes. 368 | """ 369 | labels, y = np.unique(y, return_inverse=True) 370 | prediction = prediction > significance 371 | idx = np.arange(0, y.size, 1) 372 | idx = filter(lambda x: prediction[x, c], idx) 373 | idx = filter(lambda x: np.sum(prediction[x, :]) == 1, idx) 374 | errors = filter(lambda x: int(y[x]) != c, idx) 375 | 376 | if len(idx) > 0: 377 | return np.size(errors) / np.size(idx) 378 | else: 379 | return 0 380 | 381 | 382 | # ----------------------------------------------------------------------------- 383 | # Efficiency measures 384 | # ----------------------------------------------------------------------------- 385 | def _reg_interval_size(prediction, y, significance): 386 | idx = int(significance * 100 - 1) 387 | prediction = prediction[:, :, idx] 388 | 389 | return prediction[:, 1] - prediction[:, 0] 390 | 391 | 392 | def reg_min_size(prediction, y, significance): 393 | return np.min(_reg_interval_size(prediction, y, significance)) 394 | 395 | 396 | def reg_q1_size(prediction, y, significance): 397 | return np.percentile(_reg_interval_size(prediction, y, significance), 25) 398 | 399 | 400 | def reg_median_size(prediction, y, significance): 401 | return np.median(_reg_interval_size(prediction, y, significance)) 402 | 403 | 404 | def reg_q3_size(prediction, y, significance): 405 | return np.percentile(_reg_interval_size(prediction, y, significance), 75) 406 | 407 | 408 | def reg_max_size(prediction, y, significance): 409 | return np.max(_reg_interval_size(prediction, y, significance)) 410 | 411 | 412 | def reg_mean_size(prediction, y, significance): 413 | """Calculates the average prediction interval size of a conformal 414 | regression model. 415 | """ 416 | return np.mean(_reg_interval_size(prediction, y, significance)) 417 | 418 | 419 | def class_avg_c(prediction, y, significance): 420 | """Calculates the average number of classes per prediction of a conformal 421 | classification model. 422 | """ 423 | prediction = prediction > significance 424 | return np.sum(prediction) / prediction.shape[0] 425 | 426 | 427 | def class_mean_p_val(prediction, y, significance): 428 | """Calculates the mean of the p-values output by a conformal classification 429 | model. 430 | """ 431 | return np.mean(prediction) 432 | 433 | 434 | def class_one_c(prediction, y, significance): 435 | """Calculates the rate of singleton predictions (prediction sets containing 436 | only a single class label) of a conformal classification model. 437 | """ 438 | prediction = prediction > significance 439 | n_singletons = np.sum(1 for _ in filter(lambda x: np.sum(x) == 1, 440 | prediction)) 441 | return n_singletons / y.size 442 | 443 | 444 | def class_empty(prediction, y, significance): 445 | """Calculates the rate of singleton predictions (prediction sets containing 446 | only a single class label) of a conformal classification model. 447 | """ 448 | prediction = prediction > significance 449 | n_empty = np.sum(1 for _ in filter(lambda x: np.sum(x) == 0, 450 | prediction)) 451 | return n_empty / y.size 452 | 453 | 454 | def n_test(prediction, y, significance): 455 | """Provides the number of test patters used in the evaluation. 456 | """ 457 | return y.size -------------------------------------------------------------------------------- /nonconformist/icp.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Inductive conformal predictors. 5 | """ 6 | 7 | # Authors: Henrik Linusson 8 | 9 | from __future__ import division 10 | from collections import defaultdict 11 | from functools import partial 12 | 13 | import numpy as np 14 | from sklearn.base import BaseEstimator 15 | 16 | from nonconformist.base import RegressorMixin, ClassifierMixin 17 | from nonconformist.util import calc_p 18 | 19 | 20 | # ----------------------------------------------------------------------------- 21 | # Base inductive conformal predictor 22 | # ----------------------------------------------------------------------------- 23 | class BaseIcp(BaseEstimator): 24 | """Base class for inductive conformal predictors. 25 | """ 26 | def __init__(self, nc_function, condition=None): 27 | self.cal_x, self.cal_y = None, None 28 | self.nc_function = nc_function 29 | 30 | # Check if condition-parameter is the default function (i.e., 31 | # lambda x: 0). This is so we can safely clone the object without 32 | # the clone accidentally having self.conditional = True. 33 | default_condition = lambda x: 0 34 | is_default = (callable(condition) and 35 | (condition.__code__.co_code == 36 | default_condition.__code__.co_code)) 37 | 38 | if is_default: 39 | self.condition = condition 40 | self.conditional = False 41 | elif callable(condition): 42 | self.condition = condition 43 | self.conditional = True 44 | else: 45 | self.condition = lambda x: 0 46 | self.conditional = False 47 | 48 | def fit(self, x, y): 49 | """Fit underlying nonconformity scorer. 50 | 51 | Parameters 52 | ---------- 53 | x : numpy array of shape [n_samples, n_features] 54 | Inputs of examples for fitting the nonconformity scorer. 55 | 56 | y : numpy array of shape [n_samples] 57 | Outputs of examples for fitting the nonconformity scorer. 58 | 59 | Returns 60 | ------- 61 | None 62 | """ 63 | # TODO: incremental? 64 | self.nc_function.fit(x, y) 65 | 66 | def calibrate(self, x, y, increment=False): 67 | """Calibrate conformal predictor based on underlying nonconformity 68 | scorer. 69 | 70 | Parameters 71 | ---------- 72 | x : numpy array of shape [n_samples, n_features] 73 | Inputs of examples for calibrating the conformal predictor. 74 | 75 | y : numpy array of shape [n_samples, n_features] 76 | Outputs of examples for calibrating the conformal predictor. 77 | 78 | increment : boolean 79 | If ``True``, performs an incremental recalibration of the conformal 80 | predictor. The supplied ``x`` and ``y`` are added to the set of 81 | previously existing calibration examples, and the conformal 82 | predictor is then calibrated on both the old and new calibration 83 | examples. 84 | 85 | Returns 86 | ------- 87 | None 88 | """ 89 | self._calibrate_hook(x, y, increment) 90 | self._update_calibration_set(x, y, increment) 91 | 92 | if self.conditional: 93 | category_map = np.array([self.condition((x[i, :], y[i])) 94 | for i in range(y.size)]) 95 | self.categories = np.unique(category_map) 96 | self.cal_scores = defaultdict(partial(np.ndarray, 0)) 97 | 98 | for cond in self.categories: 99 | idx = category_map == cond 100 | cal_scores = self.nc_function.score(self.cal_x[idx, :], 101 | self.cal_y[idx]) 102 | self.cal_scores[cond] = np.sort(cal_scores,0)[::-1] 103 | else: 104 | self.categories = np.array([0]) 105 | cal_scores = self.nc_function.score(self.cal_x, self.cal_y) 106 | self.cal_scores = {0: np.sort(cal_scores,0)[::-1]} 107 | 108 | def _calibrate_hook(self, x, y, increment): 109 | pass 110 | 111 | def _update_calibration_set(self, x, y, increment): 112 | if increment and self.cal_x is not None and self.cal_y is not None: 113 | self.cal_x = np.vstack([self.cal_x, x]) 114 | self.cal_y = np.hstack([self.cal_y, y]) 115 | else: 116 | self.cal_x, self.cal_y = x, y 117 | 118 | 119 | # ----------------------------------------------------------------------------- 120 | # Inductive conformal classifier 121 | # ----------------------------------------------------------------------------- 122 | class IcpClassifier(BaseIcp, ClassifierMixin): 123 | """Inductive conformal classifier. 124 | 125 | Parameters 126 | ---------- 127 | nc_function : BaseScorer 128 | Nonconformity scorer object used to calculate nonconformity of 129 | calibration examples and test patterns. Should implement ``fit(x, y)`` 130 | and ``calc_nc(x, y)``. 131 | 132 | smoothing : boolean 133 | Decides whether to use stochastic smoothing of p-values. 134 | 135 | Attributes 136 | ---------- 137 | cal_x : numpy array of shape [n_cal_examples, n_features] 138 | Inputs of calibration set. 139 | 140 | cal_y : numpy array of shape [n_cal_examples] 141 | Outputs of calibration set. 142 | 143 | nc_function : BaseScorer 144 | Nonconformity scorer object used to calculate nonconformity scores. 145 | 146 | classes : numpy array of shape [n_classes] 147 | List of class labels, with indices corresponding to output columns 148 | of IcpClassifier.predict() 149 | 150 | See also 151 | -------- 152 | IcpRegressor 153 | 154 | References 155 | ---------- 156 | .. [1] Papadopoulos, H., & Haralambous, H. (2011). Reliable prediction 157 | intervals with regression neural networks. Neural Networks, 24(8), 158 | 842-851. 159 | 160 | Examples 161 | -------- 162 | >>> import numpy as np 163 | >>> from sklearn.datasets import load_iris 164 | >>> from sklearn.tree import DecisionTreeClassifier 165 | >>> from nonconformist.base import ClassifierAdapter 166 | >>> from nonconformist.icp import IcpClassifier 167 | >>> from nonconformist.nc import ClassifierNc, MarginErrFunc 168 | >>> iris = load_iris() 169 | >>> idx = np.random.permutation(iris.target.size) 170 | >>> train = idx[:int(idx.size / 3)] 171 | >>> cal = idx[int(idx.size / 3):int(2 * idx.size / 3)] 172 | >>> test = idx[int(2 * idx.size / 3):] 173 | >>> model = ClassifierAdapter(DecisionTreeClassifier()) 174 | >>> nc = ClassifierNc(model, MarginErrFunc()) 175 | >>> icp = IcpClassifier(nc) 176 | >>> icp.fit(iris.data[train, :], iris.target[train]) 177 | >>> icp.calibrate(iris.data[cal, :], iris.target[cal]) 178 | >>> icp.predict(iris.data[test, :], significance=0.10) 179 | ... # doctest: +SKIP 180 | array([[ True, False, False], 181 | [False, True, False], 182 | ..., 183 | [False, True, False], 184 | [False, True, False]], dtype=bool) 185 | """ 186 | def __init__(self, nc_function, condition=None, smoothing=True): 187 | super(IcpClassifier, self).__init__(nc_function, condition) 188 | self.classes = None 189 | self.smoothing = smoothing 190 | 191 | def _calibrate_hook(self, x, y, increment=False): 192 | self._update_classes(y, increment) 193 | 194 | def _update_classes(self, y, increment): 195 | if self.classes is None or not increment: 196 | self.classes = np.unique(y) 197 | else: 198 | self.classes = np.unique(np.hstack([self.classes, y])) 199 | 200 | def predict(self, x, significance=None): 201 | """Predict the output values for a set of input patterns. 202 | 203 | Parameters 204 | ---------- 205 | x : numpy array of shape [n_samples, n_features] 206 | Inputs of patters for which to predict output values. 207 | 208 | significance : float or None 209 | Significance level (maximum allowed error rate) of predictions. 210 | Should be a float between 0 and 1. If ``None``, then the p-values 211 | are output rather than the predictions. 212 | 213 | Returns 214 | ------- 215 | p : numpy array of shape [n_samples, n_classes] 216 | If significance is ``None``, then p contains the p-values for each 217 | sample-class pair; if significance is a float between 0 and 1, then 218 | p is a boolean array denoting which labels are included in the 219 | prediction sets. 220 | """ 221 | # TODO: if x == self.last_x ... 222 | n_test_objects = x.shape[0] 223 | p = np.zeros((n_test_objects, self.classes.size)) 224 | 225 | ncal_ngt_neq = self._get_stats(x) 226 | 227 | for i in range(len(self.classes)): 228 | for j in range(n_test_objects): 229 | p[j, i] = calc_p(ncal_ngt_neq[j, i, 0], 230 | ncal_ngt_neq[j, i, 1], 231 | ncal_ngt_neq[j, i, 2], 232 | self.smoothing) 233 | 234 | if significance is not None: 235 | return p > significance 236 | else: 237 | return p 238 | 239 | def _get_stats(self, x): 240 | n_test_objects = x.shape[0] 241 | ncal_ngt_neq = np.zeros((n_test_objects, self.classes.size, 3)) 242 | for i, c in enumerate(self.classes): 243 | test_class = np.zeros(x.shape[0], dtype=self.classes.dtype) 244 | test_class.fill(c) 245 | 246 | # TODO: maybe calculate p-values using cython or similar 247 | # TODO: interpolated p-values 248 | 249 | # TODO: nc_function.calc_nc should take X * {y1, y2, ... ,yn} 250 | test_nc_scores = self.nc_function.score(x, test_class) 251 | for j, nc in enumerate(test_nc_scores): 252 | cal_scores = self.cal_scores[self.condition((x[j, :], c))][::-1] 253 | n_cal = cal_scores.size 254 | 255 | idx_left = np.searchsorted(cal_scores, nc, 'left') 256 | idx_right = np.searchsorted(cal_scores, nc, 'right') 257 | 258 | ncal_ngt_neq[j, i, 0] = n_cal 259 | ncal_ngt_neq[j, i, 1] = n_cal - idx_right 260 | ncal_ngt_neq[j, i, 2] = idx_right - idx_left 261 | 262 | return ncal_ngt_neq 263 | 264 | def predict_conf(self, x): 265 | """Predict the output values for a set of input patterns, using 266 | the confidence-and-credibility output scheme. 267 | 268 | Parameters 269 | ---------- 270 | x : numpy array of shape [n_samples, n_features] 271 | Inputs of patters for which to predict output values. 272 | 273 | Returns 274 | ------- 275 | p : numpy array of shape [n_samples, 3] 276 | p contains three columns: the first column contains the most 277 | likely class for each test pattern; the second column contains 278 | the confidence in the predicted class label, and the third column 279 | contains the credibility of the prediction. 280 | """ 281 | p = self.predict(x, significance=None) 282 | label = p.argmax(axis=1) 283 | credibility = p.max(axis=1) 284 | for i, idx in enumerate(label): 285 | p[i, idx] = -np.inf 286 | confidence = 1 - p.max(axis=1) 287 | 288 | return np.array([label, confidence, credibility]).T 289 | 290 | 291 | # ----------------------------------------------------------------------------- 292 | # Inductive conformal regressor 293 | # ----------------------------------------------------------------------------- 294 | class IcpRegressor(BaseIcp, RegressorMixin): 295 | """Inductive conformal regressor. 296 | 297 | Parameters 298 | ---------- 299 | nc_function : BaseScorer 300 | Nonconformity scorer object used to calculate nonconformity of 301 | calibration examples and test patterns. Should implement ``fit(x, y)``, 302 | ``calc_nc(x, y)`` and ``predict(x, nc_scores, significance)``. 303 | 304 | Attributes 305 | ---------- 306 | cal_x : numpy array of shape [n_cal_examples, n_features] 307 | Inputs of calibration set. 308 | 309 | cal_y : numpy array of shape [n_cal_examples] 310 | Outputs of calibration set. 311 | 312 | nc_function : BaseScorer 313 | Nonconformity scorer object used to calculate nonconformity scores. 314 | 315 | See also 316 | -------- 317 | IcpClassifier 318 | 319 | References 320 | ---------- 321 | .. [1] Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002). 322 | Inductive confidence machines for regression. In Machine Learning: ECML 323 | 2002 (pp. 345-356). Springer Berlin Heidelberg. 324 | 325 | .. [2] Papadopoulos, H., & Haralambous, H. (2011). Reliable prediction 326 | intervals with regression neural networks. Neural Networks, 24(8), 327 | 842-851. 328 | 329 | Examples 330 | -------- 331 | >>> import numpy as np 332 | >>> from sklearn.datasets import load_boston 333 | >>> from sklearn.tree import DecisionTreeRegressor 334 | >>> from nonconformist.base import RegressorAdapter 335 | >>> from nonconformist.icp import IcpRegressor 336 | >>> from nonconformist.nc import RegressorNc, AbsErrorErrFunc 337 | >>> boston = load_boston() 338 | >>> idx = np.random.permutation(boston.target.size) 339 | >>> train = idx[:int(idx.size / 3)] 340 | >>> cal = idx[int(idx.size / 3):int(2 * idx.size / 3)] 341 | >>> test = idx[int(2 * idx.size / 3):] 342 | >>> model = RegressorAdapter(DecisionTreeRegressor()) 343 | >>> nc = RegressorNc(model, AbsErrorErrFunc()) 344 | >>> icp = IcpRegressor(nc) 345 | >>> icp.fit(boston.data[train, :], boston.target[train]) 346 | >>> icp.calibrate(boston.data[cal, :], boston.target[cal]) 347 | >>> icp.predict(boston.data[test, :], significance=0.10) 348 | ... # doctest: +SKIP 349 | array([[ 5. , 20.6], 350 | [ 15.5, 31.1], 351 | ..., 352 | [ 14.2, 29.8], 353 | [ 11.6, 27.2]]) 354 | """ 355 | def __init__(self, nc_function, condition=None): 356 | super(IcpRegressor, self).__init__(nc_function, condition) 357 | 358 | def predict(self, x, significance=None): 359 | """Predict the output values for a set of input patterns. 360 | 361 | Parameters 362 | ---------- 363 | x : numpy array of shape [n_samples, n_features] 364 | Inputs of patters for which to predict output values. 365 | 366 | significance : float 367 | Significance level (maximum allowed error rate) of predictions. 368 | Should be a float between 0 and 1. If ``None``, then intervals for 369 | all significance levels (0.01, 0.02, ..., 0.99) are output in a 370 | 3d-matrix. 371 | 372 | Returns 373 | ------- 374 | p : numpy array of shape [n_samples, 2] or [n_samples, 2, 99} 375 | If significance is ``None``, then p contains the interval (minimum 376 | and maximum boundaries) for each test pattern, and each significance 377 | level (0.01, 0.02, ..., 0.99). If significance is a float between 378 | 0 and 1, then p contains the prediction intervals (minimum and 379 | maximum boundaries) for the set of test patterns at the chosen 380 | significance level. 381 | """ 382 | # TODO: interpolated p-values 383 | 384 | n_significance = (99 if significance is None 385 | else np.array(significance).size) 386 | 387 | if n_significance > 1: 388 | prediction = np.zeros((x.shape[0], 2, n_significance)) 389 | else: 390 | prediction = np.zeros((x.shape[0], 2)) 391 | 392 | condition_map = np.array([self.condition((x[i, :], None)) 393 | for i in range(x.shape[0])]) 394 | 395 | for condition in self.categories: 396 | idx = condition_map == condition 397 | if np.sum(idx) > 0: 398 | p = self.nc_function.predict(x[idx, :], 399 | self.cal_scores[condition], 400 | significance) 401 | if n_significance > 1: 402 | prediction[idx, :, :] = p 403 | else: 404 | prediction[idx, :] = p 405 | 406 | return prediction 407 | 408 | 409 | class OobCpClassifier(IcpClassifier): 410 | def __init__(self, 411 | nc_function, 412 | condition=None, 413 | smoothing=True): 414 | super(OobCpClassifier, self).__init__(nc_function, 415 | condition, 416 | smoothing) 417 | 418 | def fit(self, x, y): 419 | super(OobCpClassifier, self).fit(x, y) 420 | super(OobCpClassifier, self).calibrate(x, y, False) 421 | 422 | def calibrate(self, x, y, increment=False): 423 | # Should throw exception (or really not be implemented for oob) 424 | pass 425 | 426 | 427 | class OobCpRegressor(IcpRegressor): 428 | def __init__(self, 429 | nc_function, 430 | condition=None): 431 | super(OobCpRegressor, self).__init__(nc_function, 432 | condition) 433 | 434 | def fit(self, x, y): 435 | super(OobCpRegressor, self).fit(x, y) 436 | super(OobCpRegressor, self).calibrate(x, y, False) 437 | 438 | def calibrate(self, x, y, increment=False): 439 | # Should throw exception (or really not be implemented for oob) 440 | pass 441 | -------------------------------------------------------------------------------- /nonconformist/nc.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | """ 4 | Nonconformity functions. 5 | """ 6 | 7 | # Authors: Henrik Linusson 8 | # Yaniv Romano modified RegressorNc class to include CQR 9 | 10 | from __future__ import division 11 | 12 | import abc 13 | import numpy as np 14 | import sklearn.base 15 | from nonconformist.base import ClassifierAdapter, RegressorAdapter 16 | from nonconformist.base import OobClassifierAdapter, OobRegressorAdapter 17 | 18 | # ----------------------------------------------------------------------------- 19 | # Error functions 20 | # ----------------------------------------------------------------------------- 21 | 22 | 23 | class ClassificationErrFunc(object): 24 | """Base class for classification model error functions. 25 | """ 26 | 27 | __metaclass__ = abc.ABCMeta 28 | 29 | def __init__(self): 30 | super(ClassificationErrFunc, self).__init__() 31 | 32 | @abc.abstractmethod 33 | def apply(self, prediction, y): 34 | """Apply the nonconformity function. 35 | 36 | Parameters 37 | ---------- 38 | prediction : numpy array of shape [n_samples, n_classes] 39 | Class probability estimates for each sample. 40 | 41 | y : numpy array of shape [n_samples] 42 | True output labels of each sample. 43 | 44 | Returns 45 | ------- 46 | nc : numpy array of shape [n_samples] 47 | Nonconformity scores of the samples. 48 | """ 49 | pass 50 | 51 | 52 | class RegressionErrFunc(object): 53 | """Base class for regression model error functions. 54 | """ 55 | 56 | __metaclass__ = abc.ABCMeta 57 | 58 | def __init__(self): 59 | super(RegressionErrFunc, self).__init__() 60 | 61 | @abc.abstractmethod 62 | def apply(self, prediction, y):#, norm=None, beta=0): 63 | """Apply the nonconformity function. 64 | 65 | Parameters 66 | ---------- 67 | prediction : numpy array of shape [n_samples, n_classes] 68 | Class probability estimates for each sample. 69 | 70 | y : numpy array of shape [n_samples] 71 | True output labels of each sample. 72 | 73 | Returns 74 | ------- 75 | nc : numpy array of shape [n_samples] 76 | Nonconformity scores of the samples. 77 | """ 78 | pass 79 | 80 | @abc.abstractmethod 81 | def apply_inverse(self, nc, significance):#, norm=None, beta=0): 82 | """Apply the inverse of the nonconformity function (i.e., 83 | calculate prediction interval). 84 | 85 | Parameters 86 | ---------- 87 | nc : numpy array of shape [n_calibration_samples] 88 | Nonconformity scores obtained for conformal predictor. 89 | 90 | significance : float 91 | Significance level (0, 1). 92 | 93 | Returns 94 | ------- 95 | interval : numpy array of shape [n_samples, 2] 96 | Minimum and maximum interval boundaries for each prediction. 97 | """ 98 | pass 99 | 100 | 101 | class InverseProbabilityErrFunc(ClassificationErrFunc): 102 | """Calculates the probability of not predicting the correct class. 103 | 104 | For each correct output in ``y``, nonconformity is defined as 105 | 106 | .. math:: 107 | 1 - \hat{P}(y_i | x) \, . 108 | """ 109 | 110 | def __init__(self): 111 | super(InverseProbabilityErrFunc, self).__init__() 112 | 113 | def apply(self, prediction, y): 114 | prob = np.zeros(y.size, dtype=np.float32) 115 | for i, y_ in enumerate(y): 116 | if y_ >= prediction.shape[1]: 117 | prob[i] = 0 118 | else: 119 | prob[i] = prediction[i, int(y_)] 120 | return 1 - prob 121 | 122 | 123 | class MarginErrFunc(ClassificationErrFunc): 124 | """ 125 | Calculates the margin error. 126 | 127 | For each correct output in ``y``, nonconformity is defined as 128 | 129 | .. math:: 130 | 0.5 - \dfrac{\hat{P}(y_i | x) - max_{y \, != \, y_i} \hat{P}(y | x)}{2} 131 | """ 132 | 133 | def __init__(self): 134 | super(MarginErrFunc, self).__init__() 135 | 136 | def apply(self, prediction, y): 137 | prob = np.zeros(y.size, dtype=np.float32) 138 | for i, y_ in enumerate(y): 139 | if y_ >= prediction.shape[1]: 140 | prob[i] = 0 141 | else: 142 | prob[i] = prediction[i, int(y_)] 143 | prediction[i, int(y_)] = -np.inf 144 | return 0.5 - ((prob - prediction.max(axis=1)) / 2) 145 | 146 | 147 | class AbsErrorErrFunc(RegressionErrFunc): 148 | """Calculates absolute error nonconformity for regression problems. 149 | 150 | For each correct output in ``y``, nonconformity is defined as 151 | 152 | .. math:: 153 | | y_i - \hat{y}_i | 154 | """ 155 | 156 | def __init__(self): 157 | super(AbsErrorErrFunc, self).__init__() 158 | 159 | def apply(self, prediction, y): 160 | return np.abs(prediction - y) 161 | 162 | def apply_inverse(self, nc, significance): 163 | nc = np.sort(nc)[::-1] 164 | border = int(np.floor(significance * (nc.size + 1))) - 1 165 | # TODO: should probably warn against too few calibration examples 166 | border = min(max(border, 0), nc.size - 1) 167 | return np.vstack([nc[border], nc[border]]) 168 | 169 | 170 | class SignErrorErrFunc(RegressionErrFunc): 171 | """Calculates signed error nonconformity for regression problems. 172 | 173 | For each correct output in ``y``, nonconformity is defined as 174 | 175 | .. math:: 176 | y_i - \hat{y}_i 177 | 178 | References 179 | ---------- 180 | .. [1] Linusson, Henrik, Ulf Johansson, and Tuve Lofstrom. 181 | Signed-error conformal regression. Pacific-Asia Conference on Knowledge 182 | Discovery and Data Mining. Springer International Publishing, 2014. 183 | """ 184 | 185 | def __init__(self): 186 | super(SignErrorErrFunc, self).__init__() 187 | 188 | def apply(self, prediction, y): 189 | return (prediction - y) 190 | 191 | def apply_inverse(self, nc, significance): 192 | 193 | err_high = -nc 194 | err_low = nc 195 | 196 | err_high = np.reshape(err_high, (nc.shape[0],1)) 197 | err_low = np.reshape(err_low, (nc.shape[0],1)) 198 | 199 | nc = np.concatenate((err_low,err_high),1) 200 | 201 | nc = np.sort(nc,0) 202 | index = int(np.ceil((1 - significance / 2) * (nc.shape[0] + 1))) - 1 203 | index = min(max(index, 0), nc.shape[0] - 1) 204 | return np.vstack([nc[index,0], nc[index,1]]) 205 | 206 | # CQR symmetric error function 207 | class QuantileRegErrFunc(RegressionErrFunc): 208 | """Calculates conformalized quantile regression error. 209 | 210 | For each correct output in ``y``, nonconformity is defined as 211 | 212 | .. math:: 213 | max{\hat{q}_low - y, y - \hat{q}_high} 214 | 215 | """ 216 | def __init__(self): 217 | super(QuantileRegErrFunc, self).__init__() 218 | 219 | def apply(self, prediction, y): 220 | y_lower = prediction[:,0] 221 | y_upper = prediction[:,-1] 222 | error_low = y_lower - y 223 | error_high = y - y_upper 224 | err = np.maximum(error_high,error_low) 225 | return err 226 | 227 | def apply_inverse(self, nc, significance): 228 | nc = np.sort(nc,0) 229 | index = int(np.ceil((1 - significance) * (nc.shape[0] + 1))) - 1 230 | index = min(max(index, 0), nc.shape[0] - 1) 231 | return np.vstack([nc[index], nc[index]]) 232 | 233 | # CQR asymmetric error function 234 | class QuantileRegAsymmetricErrFunc(RegressionErrFunc): 235 | """Calculates conformalized quantile regression asymmetric error function. 236 | 237 | For each correct output in ``y``, nonconformity is defined as 238 | 239 | .. math:: 240 | E_low = \hat{q}_low - y 241 | E_high = y - \hat{q}_high 242 | 243 | """ 244 | def __init__(self): 245 | super(QuantileRegAsymmetricErrFunc, self).__init__() 246 | 247 | def apply(self, prediction, y): 248 | y_lower = prediction[:,0] 249 | y_upper = prediction[:,-1] 250 | 251 | error_high = y - y_upper 252 | error_low = y_lower - y 253 | 254 | err_high = np.reshape(error_high, (y_upper.shape[0],1)) 255 | err_low = np.reshape(error_low, (y_lower.shape[0],1)) 256 | 257 | return np.concatenate((err_low,err_high),1) 258 | 259 | def apply_inverse(self, nc, significance): 260 | nc = np.sort(nc,0) 261 | index = int(np.ceil((1 - significance / 2) * (nc.shape[0] + 1))) - 1 262 | index = min(max(index, 0), nc.shape[0] - 1) 263 | return np.vstack([nc[index,0], nc[index,1]]) 264 | 265 | # ----------------------------------------------------------------------------- 266 | # Base nonconformity scorer 267 | # ----------------------------------------------------------------------------- 268 | class BaseScorer(sklearn.base.BaseEstimator): 269 | __metaclass__ = abc.ABCMeta 270 | 271 | def __init__(self): 272 | super(BaseScorer, self).__init__() 273 | 274 | @abc.abstractmethod 275 | def fit(self, x, y): 276 | pass 277 | 278 | @abc.abstractmethod 279 | def score(self, x, y=None): 280 | pass 281 | 282 | 283 | class RegressorNormalizer(BaseScorer): 284 | def __init__(self, base_model, normalizer_model, err_func): 285 | super(RegressorNormalizer, self).__init__() 286 | self.base_model = base_model 287 | self.normalizer_model = normalizer_model 288 | self.err_func = err_func 289 | 290 | def fit(self, x, y): 291 | residual_prediction = self.base_model.predict(x) 292 | residual_error = np.abs(self.err_func.apply(residual_prediction, y)) 293 | 294 | ###################################################################### 295 | # Optional: use logarithmic function as in the original implementation 296 | # available in https://github.com/donlnz/nonconformist 297 | # 298 | # CODE: 299 | # residual_error += 0.00001 # Add small term to avoid log(0) 300 | # log_err = np.log(residual_error) 301 | ###################################################################### 302 | 303 | log_err = residual_error 304 | self.normalizer_model.fit(x, log_err) 305 | 306 | def score(self, x, y=None): 307 | 308 | ###################################################################### 309 | # Optional: use logarithmic function as in the original implementation 310 | # available in https://github.com/donlnz/nonconformist 311 | # 312 | # CODE: 313 | # norm = np.exp(self.normalizer_model.predict(x)) 314 | ###################################################################### 315 | 316 | norm = np.abs(self.normalizer_model.predict(x)) 317 | return norm 318 | 319 | 320 | class NcFactory(object): 321 | @staticmethod 322 | def create_nc(model, err_func=None, normalizer_model=None, oob=False): 323 | if normalizer_model is not None: 324 | normalizer_adapter = RegressorAdapter(normalizer_model) 325 | else: 326 | normalizer_adapter = None 327 | 328 | if isinstance(model, sklearn.base.ClassifierMixin): 329 | err_func = MarginErrFunc() if err_func is None else err_func 330 | if oob: 331 | c = sklearn.base.clone(model) 332 | c.fit([[0], [1]], [0, 1]) 333 | if hasattr(c, 'oob_decision_function_'): 334 | adapter = OobClassifierAdapter(model) 335 | else: 336 | raise AttributeError('Cannot use out-of-bag ' 337 | 'calibration with {}'.format( 338 | model.__class__.__name__ 339 | )) 340 | else: 341 | adapter = ClassifierAdapter(model) 342 | 343 | if normalizer_adapter is not None: 344 | normalizer = RegressorNormalizer(adapter, 345 | normalizer_adapter, 346 | err_func) 347 | return ClassifierNc(adapter, err_func, normalizer) 348 | else: 349 | return ClassifierNc(adapter, err_func) 350 | 351 | elif isinstance(model, sklearn.base.RegressorMixin): 352 | err_func = AbsErrorErrFunc() if err_func is None else err_func 353 | if oob: 354 | c = sklearn.base.clone(model) 355 | c.fit([[0], [1]], [0, 1]) 356 | if hasattr(c, 'oob_prediction_'): 357 | adapter = OobRegressorAdapter(model) 358 | else: 359 | raise AttributeError('Cannot use out-of-bag ' 360 | 'calibration with {}'.format( 361 | model.__class__.__name__ 362 | )) 363 | else: 364 | adapter = RegressorAdapter(model) 365 | 366 | if normalizer_adapter is not None: 367 | normalizer = RegressorNormalizer(adapter, 368 | normalizer_adapter, 369 | err_func) 370 | return RegressorNc(adapter, err_func, normalizer) 371 | else: 372 | return RegressorNc(adapter, err_func) 373 | 374 | 375 | class BaseModelNc(BaseScorer): 376 | """Base class for nonconformity scorers based on an underlying model. 377 | 378 | Parameters 379 | ---------- 380 | model : ClassifierAdapter or RegressorAdapter 381 | Underlying classification model used for calculating nonconformity 382 | scores. 383 | 384 | err_func : ClassificationErrFunc or RegressionErrFunc 385 | Error function object. 386 | 387 | normalizer : BaseScorer 388 | Normalization model. 389 | 390 | beta : float 391 | Normalization smoothing parameter. As the beta-value increases, 392 | the normalized nonconformity function approaches a non-normalized 393 | equivalent. 394 | """ 395 | def __init__(self, model, err_func, normalizer=None, beta=1e-6): 396 | super(BaseModelNc, self).__init__() 397 | self.err_func = err_func 398 | self.model = model 399 | self.normalizer = normalizer 400 | self.beta = beta 401 | 402 | # If we use sklearn.base.clone (e.g., during cross-validation), 403 | # object references get jumbled, so we need to make sure that the 404 | # normalizer has a reference to the proper model adapter, if applicable. 405 | if (self.normalizer is not None and 406 | hasattr(self.normalizer, 'base_model')): 407 | self.normalizer.base_model = self.model 408 | 409 | self.last_x, self.last_y = None, None 410 | self.last_prediction = None 411 | self.clean = False 412 | 413 | def fit(self, x, y): 414 | """Fits the underlying model of the nonconformity scorer. 415 | 416 | Parameters 417 | ---------- 418 | x : numpy array of shape [n_samples, n_features] 419 | Inputs of examples for fitting the underlying model. 420 | 421 | y : numpy array of shape [n_samples] 422 | Outputs of examples for fitting the underlying model. 423 | 424 | Returns 425 | ------- 426 | None 427 | """ 428 | self.model.fit(x, y) 429 | if self.normalizer is not None: 430 | self.normalizer.fit(x, y) 431 | self.clean = False 432 | 433 | def score(self, x, y=None): 434 | """Calculates the nonconformity score of a set of samples. 435 | 436 | Parameters 437 | ---------- 438 | x : numpy array of shape [n_samples, n_features] 439 | Inputs of examples for which to calculate a nonconformity score. 440 | 441 | y : numpy array of shape [n_samples] 442 | Outputs of examples for which to calculate a nonconformity score. 443 | 444 | Returns 445 | ------- 446 | nc : numpy array of shape [n_samples] 447 | Nonconformity scores of samples. 448 | """ 449 | prediction = self.model.predict(x) 450 | n_test = x.shape[0] 451 | if self.normalizer is not None: 452 | norm = self.normalizer.score(x) + self.beta 453 | else: 454 | norm = np.ones(n_test) 455 | if prediction.ndim > 1: 456 | ret_val = self.err_func.apply(prediction, y) 457 | else: 458 | ret_val = self.err_func.apply(prediction, y) / norm 459 | return ret_val 460 | 461 | 462 | # ----------------------------------------------------------------------------- 463 | # Classification nonconformity scorers 464 | # ----------------------------------------------------------------------------- 465 | class ClassifierNc(BaseModelNc): 466 | """Nonconformity scorer using an underlying class probability estimating 467 | model. 468 | 469 | Parameters 470 | ---------- 471 | model : ClassifierAdapter 472 | Underlying classification model used for calculating nonconformity 473 | scores. 474 | 475 | err_func : ClassificationErrFunc 476 | Error function object. 477 | 478 | normalizer : BaseScorer 479 | Normalization model. 480 | 481 | beta : float 482 | Normalization smoothing parameter. As the beta-value increases, 483 | the normalized nonconformity function approaches a non-normalized 484 | equivalent. 485 | 486 | Attributes 487 | ---------- 488 | model : ClassifierAdapter 489 | Underlying model object. 490 | 491 | err_func : ClassificationErrFunc 492 | Scorer function used to calculate nonconformity scores. 493 | 494 | See also 495 | -------- 496 | RegressorNc, NormalizedRegressorNc 497 | """ 498 | def __init__(self, 499 | model, 500 | err_func=MarginErrFunc(), 501 | normalizer=None, 502 | beta=1e-6): 503 | super(ClassifierNc, self).__init__(model, 504 | err_func, 505 | normalizer, 506 | beta) 507 | 508 | 509 | # ----------------------------------------------------------------------------- 510 | # Regression nonconformity scorers 511 | # ----------------------------------------------------------------------------- 512 | class RegressorNc(BaseModelNc): 513 | """Nonconformity scorer using an underlying regression model. 514 | 515 | Parameters 516 | ---------- 517 | model : RegressorAdapter 518 | Underlying regression model used for calculating nonconformity scores. 519 | 520 | err_func : RegressionErrFunc 521 | Error function object. 522 | 523 | normalizer : BaseScorer 524 | Normalization model. 525 | 526 | beta : float 527 | Normalization smoothing parameter. As the beta-value increases, 528 | the normalized nonconformity function approaches a non-normalized 529 | equivalent. 530 | 531 | Attributes 532 | ---------- 533 | model : RegressorAdapter 534 | Underlying model object. 535 | 536 | err_func : RegressionErrFunc 537 | Scorer function used to calculate nonconformity scores. 538 | 539 | See also 540 | -------- 541 | ProbEstClassifierNc, NormalizedRegressorNc 542 | """ 543 | def __init__(self, 544 | model, 545 | err_func=AbsErrorErrFunc(), 546 | normalizer=None, 547 | beta=1e-6): 548 | super(RegressorNc, self).__init__(model, 549 | err_func, 550 | normalizer, 551 | beta) 552 | 553 | def predict(self, x, nc, significance=None): 554 | """Constructs prediction intervals for a set of test examples. 555 | 556 | Predicts the output of each test pattern using the underlying model, 557 | and applies the (partial) inverse nonconformity function to each 558 | prediction, resulting in a prediction interval for each test pattern. 559 | 560 | Parameters 561 | ---------- 562 | x : numpy array of shape [n_samples, n_features] 563 | Inputs of patters for which to predict output values. 564 | 565 | significance : float 566 | Significance level (maximum allowed error rate) of predictions. 567 | Should be a float between 0 and 1. If ``None``, then intervals for 568 | all significance levels (0.01, 0.02, ..., 0.99) are output in a 569 | 3d-matrix. 570 | 571 | Returns 572 | ------- 573 | p : numpy array of shape [n_samples, 2] or [n_samples, 2, 99] 574 | If significance is ``None``, then p contains the interval (minimum 575 | and maximum boundaries) for each test pattern, and each significance 576 | level (0.01, 0.02, ..., 0.99). If significance is a float between 577 | 0 and 1, then p contains the prediction intervals (minimum and 578 | maximum boundaries) for the set of test patterns at the chosen 579 | significance level. 580 | """ 581 | n_test = x.shape[0] 582 | prediction = self.model.predict(x) 583 | if self.normalizer is not None: 584 | norm = self.normalizer.score(x) + self.beta 585 | else: 586 | norm = np.ones(n_test) 587 | 588 | if significance: 589 | intervals = np.zeros((x.shape[0], 2)) 590 | err_dist = self.err_func.apply_inverse(nc, significance) 591 | err_dist = np.hstack([err_dist] * n_test) 592 | if prediction.ndim > 1: # CQR 593 | intervals[:, 0] = prediction[:,0] - err_dist[0, :] 594 | intervals[:, 1] = prediction[:,-1] + err_dist[1, :] 595 | else: # regular conformal prediction 596 | err_dist *= norm 597 | intervals[:, 0] = prediction - err_dist[0, :] 598 | intervals[:, 1] = prediction + err_dist[1, :] 599 | 600 | return intervals 601 | else: # Not tested for CQR 602 | significance = np.arange(0.01, 1.0, 0.01) 603 | intervals = np.zeros((x.shape[0], 2, significance.size)) 604 | 605 | for i, s in enumerate(significance): 606 | err_dist = self.err_func.apply_inverse(nc, s) 607 | err_dist = np.hstack([err_dist] * n_test) 608 | err_dist *= norm 609 | 610 | intervals[:, 0, i] = prediction - err_dist[0, :] 611 | intervals[:, 1, i] = prediction + err_dist[0, :] 612 | 613 | return intervals 614 | -------------------------------------------------------------------------------- /nonconformist/util.py: -------------------------------------------------------------------------------- 1 | from __future__ import division 2 | import numpy as np 3 | 4 | def calc_p(ncal, ngt, neq, smoothing=False): 5 | if smoothing: 6 | return (ngt + (neq + 1) * np.random.uniform(0, 1)) / (ncal + 1) 7 | else: 8 | return (ngt + neq + 1) / (ncal + 1) 9 | -------------------------------------------------------------------------------- /poster/CQR_Poster.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/poster/CQR_Poster.pdf -------------------------------------------------------------------------------- /reproducible_experiments/all_cqr_experiments.py: -------------------------------------------------------------------------------- 1 | 2 | ############################################################################### 3 | # Script for reproducing the results of CQR paper 4 | ############################################################################### 5 | 6 | import numpy as np 7 | from reproducible_experiments.run_cqr_experiment import run_experiment 8 | #from run_cqr_experiment import run_experiment 9 | 10 | 11 | # list methods to test 12 | test_methods = ['linear', 13 | 'neural_net', 14 | 'random_forest', 15 | 'quantile_net', 16 | 'cqr_quantile_net', 17 | 'cqr_asymmetric_quantile_net', 18 | 'rearrangement', 19 | 'cqr_rearrangement', 20 | 'cqr_asymmetric_rearrangement', 21 | 'quantile_forest', 22 | 'cqr_quantile_forest', 23 | 'cqr_asymmetric_quantile_forest'] 24 | 25 | # list of datasets 26 | dataset_names = ['meps_19', 27 | 'meps_20', 28 | 'meps_21', 29 | 'star', 30 | 'facebook_1', 31 | 'facebook_2', 32 | 'bio', 33 | 'blog_data', 34 | 'concrete', 35 | 'bike', 36 | 'community'] 37 | 38 | # vector of random seeds 39 | random_state_train_test = np.arange(20) 40 | 41 | for test_method_id in range(12): 42 | for dataset_name_id in range(11): 43 | for random_state_train_test_id in range(20): 44 | dataset_name = dataset_names[dataset_name_id] 45 | test_method = test_methods[test_method_id] 46 | random_state = random_state_train_test[random_state_train_test_id] 47 | 48 | # run an experiment and save average results to CSV file 49 | run_experiment(dataset_name, test_method, random_state) 50 | -------------------------------------------------------------------------------- /reproducible_experiments/all_equalized_coverage_experiments.py: -------------------------------------------------------------------------------- 1 | ############################################################################### 2 | # Script for reproducing the results of CQR paper 3 | ############################################################################### 4 | 5 | import numpy as np 6 | from reproducible_experiments.run_equalized_coverage_experiment import run_equalized_coverage_experiment 7 | #from run_equalized_coverage_experiment import run_equalized_coverage_experiment 8 | 9 | # list methods to test 10 | test_methods = ['net', 11 | 'qnet'] 12 | 13 | dataset_names = ["meps_21"] 14 | 15 | test_ratio_vec = [0.2] 16 | 17 | # vector of random seeds 18 | random_state_train_test = np.arange(40) 19 | 20 | for test_method_id in range(2): 21 | for random_state_train_test_id in range(40): 22 | for dataset_name_id in range(1): 23 | for test_ratio_id in range(1): 24 | test_ratio = test_ratio_vec[test_ratio_id] 25 | test_method = test_methods[test_method_id] 26 | random_state = random_state_train_test[random_state_train_test_id] 27 | dataset_name = dataset_names[dataset_name_id] 28 | 29 | # run an experiment and save average results to CSV file 30 | run_equalized_coverage_experiment(dataset_name, 31 | test_method, 32 | random_state, 33 | True, 34 | test_ratio) 35 | --------------------------------------------------------------------------------