├── .DS_Store
├── .gitignore
├── LICENSE
├── README.md
├── cqr
    ├── .DS_Store
    ├── __init__.py
    ├── helper.py
    ├── torch_models.py
    └── tune_params_cv.py
├── cqr_real_data_example.ipynb
├── cqr_synthetic_data_example_1.ipynb
├── cqr_synthetic_data_example_2.ipynb
├── datasets
    ├── .DS_Store
    ├── CASP.csv
    ├── Concrete_Data.csv
    ├── README.md
    ├── STAR.csv
    ├── bike_train.csv
    ├── communities.data
    ├── communities_attributes.csv
    ├── datasets.py
    └── facebook
    │   └── README.md
├── detect_prediction_bias_example.ipynb
├── equalized_coverage_example.ipynb
├── get_meps_data
    ├── README.md
    ├── base_dataset.py
    ├── download_data.R
    ├── main_clean_and_save_to_csv.py
    ├── meps_dataset_panel19_fy2015_reg.py
    ├── meps_dataset_panel20_fy2015_reg.py
    ├── meps_dataset_panel21_fy2016_reg.py
    ├── regression_dataset.py
    ├── save_dataset.py
    └── structured_dataset.py
├── nonconformist
    ├── .DS_Store
    ├── __init__.py
    ├── acp.py
    ├── base.py
    ├── cp.py
    ├── evaluation.py
    ├── icp.py
    ├── nc.py
    └── util.py
├── poster
    └── CQR_Poster.pdf
└── reproducible_experiments
    ├── all_cqr_experiments.py
    ├── all_equalized_coverage_experiments.py
    ├── run_cqr_experiment.py
    └── run_equalized_coverage_experiment.py


/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/.DS_Store


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | # Byte-compiled / optimized / DLL files
 2 | __pycache__/
 3 | *.py[cod]
 4 | .ipynb_checkpoints/
 5 | .DS_Store
 6 | 
 7 | # C extensions
 8 | *.so
 9 | 
10 | # Distribution / packaging
11 | .Python
12 | env/
13 | build/
14 | develop-eggs/
15 | dist/
16 | downloads/
17 | eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | *.egg-info/
24 | .installed.cfg
25 | *.egg
26 | 
27 | # PyInstaller
28 | #  Usually these files are written by a python script from a template
29 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
30 | *.manifest
31 | *.spec
32 | 
33 | # Installer logs
34 | pip-log.txt
35 | pip-delete-this-directory.txt
36 | 
37 | # Unit test / coverage reports
38 | htmlcov/
39 | .tox/
40 | .coverage
41 | .cache
42 | nosetests.xml
43 | coverage.xml
44 | 
45 | # Translations
46 | *.mo
47 | *.pot
48 | 
49 | # Django stuff:
50 | *.log
51 | 
52 | # Sphinx documentation
53 | docs/_build/
54 | 
55 | # PyBuilder
56 | target/
57 | 
58 | # PyCharm
59 | .idea


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | nonconformist package:
 4 | Copyright (c) 2015 Henrik Linusson
 5 | 
 6 | Other extensions:
 7 | Copyright (c) 2019 Yaniv Romano
 8 | 
 9 | Permission is hereby granted, free of charge, to any person obtaining a copy
10 | of this software and associated documentation files (the "Software"), to deal
11 | in the Software without restriction, including without limitation the rights
12 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
13 | copies of the Software, and to permit persons to whom the Software is
14 | furnished to do so, subject to the following conditions:
15 | 
16 | The above copyright notice and this permission notice shall be included in all
17 | copies or substantial portions of the Software.
18 | 
19 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
20 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
21 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
22 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
23 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
24 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
25 | SOFTWARE.
26 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Reliable Predictive Inference
 2 | 
 3 | An important factor to guarantee a responsible use of data-driven recommendation systems is that we should be able to communicate their uncertainty to decision makers. This can be accomplished by constructing prediction intervals, which provide an intuitive measure of the limits of predictive performance.
 4 | 
 5 | This package contains a Python implementation of Conformalized quantile regression (CQR) [1] methodology for constructing marginal distribusion-free prediction intervals. It also implements the equalized coverage framework [2] that builds valid group-conditional prediction intervals.
 6 | 
 7 | # Conformalized Quantile Regression [1]
 8 | 
 9 | CQR is a technique for constructing prediction intervals that attain valid coverage in finite samples, without making distributional assumptions. It combines the statistical efficiency of quantile regression with the distribution-free coverage guarantee of conformal prediction. On one hand, CQR is flexible in that it can wrap around any algorithm for quantile regression, including random forests and deep neural networks. On the other hand, a key strength of CQR is its rigorous control of the miscoverage rate, independent of the underlying regression algorithm.
10 | 
11 | [1] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes, [“Conformalized quantile regression.”](https://arxiv.org/abs/1905.03222) 2019.
12 | 
13 | # Equalized Coverage [2]
14 | 
15 | To support equitable treatment, the equalized coverage methodology forces the construction of the prediction intervals to be unbiased in the sense that their coverage must be equal across all protected groups of interest. Similar to CQR and conformal inference, equalized coverage offers rigorous distribution-free guarantees that hold in finite samples. This methodology can also be viewed as a wrapper around any predictive algorithm.
16 | 
17 | [2] Y. Romano, R. F. Barber, C. Sabbatti and E. J. Candès, [“With malice towards none: Assessing uncertainty via equalized coverage.”](https://statweb.stanford.edu/~candes/papers/EqualizedCoverage.pdf) 2019.
18 | 
19 | ## Getting Started
20 | 
21 | This package is self-contained and implemented in python.
22 | 
23 | Part of the code is a taken from the nonconformist package available at https://github.com/donlnz/nonconformist. One may refer to the nonconformist repository to view other applications of conformal prediction.  
24 | 
25 | ### Prerequisites
26 | 
27 | * python
28 | * numpy
29 | * scipy
30 | * scikit-learn
31 | * scikit-garden
32 | * pytorch
33 | * pandas
34 | 
35 | ### Installing
36 | 
37 | The development version is available here on github:
38 | ```bash
39 | git clone https://github.com/yromano/cqr.git
40 | ```
41 | 
42 | ## Usage
43 | 
44 | ### CQR
45 | 
46 | Please refer to [cqr_real_data_example.ipynb](cqr_real_data_example.ipynb) for basic usage. Comparisons to competitive methods and additional usage examples of this package can be found in [cqr_synthetic_data_example_1.ipynb](cqr_synthetic_data_example_1.ipynb) and [cqr_synthetic_data_example_2.ipynb](cqr_synthetic_data_example_2.ipynb).
47 | 
48 | ### Equalized Coverage
49 | 
50 | The notebook [detect_prediction_bias_example.ipynb](detect_prediction_bias_example.ipynb) performs simple data analysis for MEPS 21 data set and detects bias in the prediction. The notebook [equalized_coverage_example.ipynb](equalized_coverage_example.ipynb) illustrates how to run the methods proposed in [2] and construct prediction intervals with equal coverage across groups.
51 | 
52 | ## Reproducible Research
53 | 
54 | The code available under /reproducible_experiments/ in the repository replicates the experimental results in [1] and [2].
55 | 
56 | ### Publicly Available Datasets
57 | 
58 | * [Blog](https://archive.ics.uci.edu/ml/datasets/BlogFeedback): BlogFeedback data set.
59 | 
60 | * [Bio](https://archive.ics.uci.edu/ml/datasets/Physicochemical+Properties+of+Protein+Tertiary+Structure): Physicochemical  properties  of  protein  tertiary  structure  data  set.
61 | 
62 | * [Bike](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset): Bike  sharing  dataset  data  set.
63 | 
64 | * [Community](http://archive.ics.uci.edu/ml/datasets/communities+and+crime): Communities   and   crime   data   set.
65 | 
66 | * [STAR](https://www.rdocumentation.org/packages/AER/versions/1.2-6/topics/STAR): C.M. Achilles, Helen Pate Bain, Fred Bellott, Jayne Boyd-Zaharias, Jeremy Finn, John Folger, John Johnston, and Elizabeth Word. Tennessee’s Student Teacher Achievement Ratio (STAR) project, 2008.
67 | 
68 | * [Concrete](http://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength): Concrete compressive strength data set.
69 | 
70 | * [Facebook Variant 1 and Variant 2](https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset): Facebook  comment  volume  data  set.
71 | 
72 | ### Data subject to copyright/usage rules
73 | 
74 | The Medical Expenditure Panel Survey (MPES) data can be downloaded using the code in the folder /get_meps_data/ under this repository. It is based on [this explanation](https://github.com/yromano/cqr/blob/master/get_meps_data/README.md) (code provided by [IBM's AIF360](https://github.com/IBM/AIF360)).
75 | 
76 | * [MEPS_19](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181): Medical expenditure panel survey,  panel 19.
77 | 
78 | * [MEPS_20](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181): Medical expenditure panel survey,  panel 20.
79 | 
80 | * [MEPS_21](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-192): Medical expenditure panel survey,  panel 21.
81 | 
82 | ## License
83 | 
84 | This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
85 | 


--------------------------------------------------------------------------------
/cqr/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/cqr/.DS_Store


--------------------------------------------------------------------------------
/cqr/__init__.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | 


--------------------------------------------------------------------------------
/cqr/helper.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import sys
  3 | import torch
  4 | import numpy as np
  5 | from cqr import torch_models
  6 | from functools import partial
  7 | from cqr import tune_params_cv
  8 | from nonconformist.cp import IcpRegressor
  9 | from nonconformist.base import RegressorAdapter
 10 | from skgarden import RandomForestQuantileRegressor
 11 | 
 12 | if torch.cuda.is_available():
 13 |     device = "cuda:0"
 14 | else:
 15 |     device = "cpu"
 16 | 
 17 | 
 18 | def compute_coverage_len(y_test, y_lower, y_upper):
 19 |     """ Compute average coverage and length of prediction intervals
 20 | 
 21 |     Parameters
 22 |     ----------
 23 | 
 24 |     y_test : numpy array, true labels (n)
 25 |     y_lower : numpy array, estimated lower bound for the labels (n)
 26 |     y_upper : numpy array, estimated upper bound for the labels (n)
 27 | 
 28 |     Returns
 29 |     -------
 30 | 
 31 |     coverage : float, average coverage
 32 |     avg_length : float, average length
 33 | 
 34 |     """
 35 |     in_the_range = np.sum((y_test >= y_lower) & (y_test <= y_upper))
 36 |     coverage = in_the_range / len(y_test) * 100
 37 |     avg_length = np.mean(abs(y_upper - y_lower))
 38 |     return coverage, avg_length
 39 | 
 40 | def run_icp(nc, X_train, y_train, X_test, idx_train, idx_cal, significance, condition=None):
 41 |     """ Run split conformal method
 42 | 
 43 |     Parameters
 44 |     ----------
 45 | 
 46 |     nc : class of nonconformist object
 47 |     X_train : numpy array, training features (n1Xp)
 48 |     y_train : numpy array, training labels (n1)
 49 |     X_test : numpy array, testing features (n2Xp)
 50 |     idx_train : numpy array, indices of proper training set examples
 51 |     idx_cal : numpy array, indices of calibration set examples
 52 |     significance : float, significance level (e.g. 0.1)
 53 |     condition : function, mapping feature vector to group id
 54 | 
 55 |     Returns
 56 |     -------
 57 | 
 58 |     y_lower : numpy array, estimated lower bound for the labels (n2)
 59 |     y_upper : numpy array, estimated upper bound for the labels (n2)
 60 | 
 61 |     """
 62 |     icp = IcpRegressor(nc,condition=condition)
 63 | 
 64 |     # Fit the ICP using the proper training set
 65 |     icp.fit(X_train[idx_train,:], y_train[idx_train])
 66 | 
 67 |     # Calibrate the ICP using the calibration set
 68 |     icp.calibrate(X_train[idx_cal,:], y_train[idx_cal])
 69 | 
 70 |     # Produce predictions for the test set, with confidence 90%
 71 |     predictions = icp.predict(X_test, significance=significance)
 72 | 
 73 |     y_lower = predictions[:,0]
 74 |     y_upper = predictions[:,1]
 75 | 
 76 |     return y_lower, y_upper
 77 | 
 78 | 
 79 | def run_icp_sep(nc, X_train, y_train, X_test, idx_train, idx_cal, significance, condition):
 80 |     """ Run split conformal method, train a seperate regressor for each group
 81 | 
 82 |     Parameters
 83 |     ----------
 84 | 
 85 |     nc : class of nonconformist object
 86 |     X_train : numpy array, training features (n1Xp)
 87 |     y_train : numpy array, training labels (n1)
 88 |     X_test : numpy array, testing features (n2Xp)
 89 |     idx_train : numpy array, indices of proper training set examples
 90 |     idx_cal : numpy array, indices of calibration set examples
 91 |     significance : float, significance level (e.g. 0.1)
 92 |     condition : function, mapping a feature vector to group id
 93 | 
 94 |     Returns
 95 |     -------
 96 | 
 97 |     y_lower : numpy array, estimated lower bound for the labels (n2)
 98 |     y_upper : numpy array, estimated upper bound for the labels (n2)
 99 | 
100 |     """
101 |     
102 |     X_proper_train = X_train[idx_train,:]
103 |     y_proper_train = y_train[idx_train]
104 |     X_calibration = X_train[idx_cal,:]
105 |     y_calibration = y_train[idx_cal]
106 |     
107 |     category_map_proper_train = np.array([condition((X_proper_train[i, :], y_proper_train[i])) for i in range(y_proper_train.size)])
108 |     category_map_calibration = np.array([condition((X_calibration[i, :], y_calibration[i])) for i in range(y_calibration.size)])
109 |     category_map_test = np.array([condition((X_test[i, :], None)) for i in range(X_test.shape[0])])
110 |     
111 |     categories = np.unique(category_map_proper_train)
112 | 
113 |     y_lower = np.zeros(X_test.shape[0])
114 |     y_upper = np.zeros(X_test.shape[0])
115 |     
116 |     cnt = 0
117 | 
118 |     for cond in categories:
119 |         
120 |         icp = IcpRegressor(nc[cnt])
121 |         
122 |         idx_proper_train_group = category_map_proper_train == cond
123 |         # Fit the ICP using the proper training set
124 |         icp.fit(X_proper_train[idx_proper_train_group,:], y_proper_train[idx_proper_train_group])
125 |     
126 |         idx_calibration_group = category_map_calibration == cond
127 |         # Calibrate the ICP using the calibration set
128 |         icp.calibrate(X_calibration[idx_calibration_group,:], y_calibration[idx_calibration_group])
129 |     
130 |         idx_test_group = category_map_test == cond
131 |         # Produce predictions for the test set, with confidence 90%
132 |         predictions = icp.predict(X_test[idx_test_group,:], significance=significance)
133 |     
134 |         y_lower[idx_test_group] = predictions[:,0]
135 |         y_upper[idx_test_group] = predictions[:,1]
136 |         
137 |         cnt = cnt + 1
138 | 
139 |     return y_lower, y_upper
140 | 
141 | def compute_coverage(y_test,y_lower,y_upper,significance,name=""):
142 |     """ Compute average coverage and length, and print results
143 | 
144 |     Parameters
145 |     ----------
146 | 
147 |     y_test : numpy array, true labels (n)
148 |     y_lower : numpy array, estimated lower bound for the labels (n)
149 |     y_upper : numpy array, estimated upper bound for the labels (n)
150 |     significance : float, desired significance level
151 |     name : string, optional output string (e.g. the method name)
152 | 
153 |     Returns
154 |     -------
155 | 
156 |     coverage : float, average coverage
157 |     avg_length : float, average length
158 | 
159 |     """
160 |     in_the_range = np.sum((y_test >= y_lower) & (y_test <= y_upper))
161 |     coverage = in_the_range / len(y_test) * 100
162 |     print("%s: Percentage in the range (expecting %.2f): %f" % (name, 100 - significance*100, coverage))
163 |     sys.stdout.flush()
164 | 
165 |     avg_length = abs(np.mean(y_lower - y_upper))
166 |     print("%s: Average length: %f" % (name, avg_length))
167 |     sys.stdout.flush()
168 |     return coverage, avg_length
169 | 
170 | def compute_coverage_per_sample(y_test,y_lower,y_upper,significance,name="",x_test=None,condition=None):
171 |     """ Compute average coverage and length, and print results
172 | 
173 |     Parameters
174 |     ----------
175 | 
176 |     y_test : numpy array, true labels (n)
177 |     y_lower : numpy array, estimated lower bound for the labels (n)
178 |     y_upper : numpy array, estimated upper bound for the labels (n)
179 |     significance : float, desired significance level
180 |     name : string, optional output string (e.g. the method name)
181 |     x_test : numpy array, test features
182 |     condition : function, mapping a feature vector to group id
183 | 
184 |     Returns
185 |     -------
186 | 
187 |     coverage : float, average coverage
188 |     avg_length : float, average length
189 | 
190 |     """
191 |     
192 |     if condition is not None:
193 |         
194 |         category_map = np.array([condition((x_test[i, :], y_test[i])) for i in range(y_test.size)])
195 |         categories = np.unique(category_map)
196 |         
197 |         coverage = np.empty(len(categories), dtype=np.object)
198 |         length = np.empty(len(categories), dtype=np.object)
199 |         
200 |         cnt = 0
201 |         
202 |         for cond in categories:
203 |                         
204 |             idx = category_map == cond
205 |             
206 |             coverage[cnt] = (y_test[idx] >= y_lower[idx]) & (y_test[idx] <= y_upper[idx])
207 | 
208 |             coverage_avg = np.sum( coverage[cnt] ) / len(y_test[idx]) * 100
209 |             print("%s: Group %d : Percentage in the range (expecting %.2f): %f" % (name, cond, 100 - significance*100, coverage_avg))
210 |             sys.stdout.flush()
211 |         
212 |             length[cnt] = abs(y_upper[idx] - y_lower[idx])
213 |             print("%s: Group %d : Average length: %f" % (name, cond, np.mean(length[cnt])))
214 |             sys.stdout.flush()
215 |             cnt = cnt + 1
216 |     
217 |     else:        
218 |         
219 |         coverage = (y_test >= y_lower) & (y_test <= y_upper)
220 |         coverage_avg = np.sum(coverage) / len(y_test) * 100
221 |         print("%s: Percentage in the range (expecting %.2f): %f" % (name, 100 - significance*100, coverage_avg))
222 |         sys.stdout.flush()
223 |     
224 |         length = abs(y_upper - y_lower)
225 |         print("%s: Average length: %f" % (name, np.mean(length)))
226 |         sys.stdout.flush()
227 |     
228 |     return coverage, length
229 | 
230 | 
231 | def plot_func_data(y_test,y_lower,y_upper,name=""):
232 |     """ Plot the test labels along with the constructed prediction band
233 | 
234 |     Parameters
235 |     ----------
236 | 
237 |     y_test : numpy array, true labels (n)
238 |     y_lower : numpy array, estimated lower bound for the labels (n)
239 |     y_upper : numpy array, estimated upper bound for the labels (n)
240 |     name : string, optional output string (e.g. the method name)
241 | 
242 |     """
243 | 
244 |     # allowed to import graphics
245 |     import matplotlib.pyplot as plt
246 | 
247 |     interval = y_upper - y_lower
248 |     sort_ind = np.argsort(interval)
249 |     y_test_sorted = y_test[sort_ind]
250 |     upper_sorted = y_upper[sort_ind]
251 |     lower_sorted = y_lower[sort_ind]
252 |     mean = (upper_sorted + lower_sorted) / 2
253 | 
254 |     # Center such that the mean of the prediction interval is at 0.0
255 |     y_test_sorted -= mean
256 |     upper_sorted -= mean
257 |     lower_sorted -= mean
258 | 
259 |     plt.plot(y_test_sorted, "ro")
260 |     plt.fill_between(
261 |         np.arange(len(upper_sorted)), lower_sorted, upper_sorted, alpha=0.2, color="r",
262 |         label="Pred. interval")
263 |     plt.xlabel("Ordered samples")
264 |     plt.ylabel("Values and prediction intervals")
265 | 
266 |     plt.title(name)
267 |     plt.show()
268 | 
269 |     interval = y_upper - y_lower
270 |     sort_ind = np.argsort(y_test)
271 |     y_test_sorted = y_test[sort_ind]
272 |     upper_sorted = y_upper[sort_ind]
273 |     lower_sorted = y_lower[sort_ind]
274 | 
275 |     plt.plot(y_test_sorted, "ro")
276 |     plt.fill_between(
277 |         np.arange(len(upper_sorted)), lower_sorted, upper_sorted, alpha=0.2, color="r",
278 |         label="Pred. interval")
279 |     plt.xlabel("Ordered samples by response")
280 |     plt.ylabel("Values and prediction intervals")
281 | 
282 |     plt.title(name)
283 |     plt.show()
284 | 
285 | ###############################################################################
286 | # Deep conditional mean regression
287 | # Minimizing MSE loss
288 | ###############################################################################
289 | 
290 | class MSENet_RegressorAdapter(RegressorAdapter):
291 |     """ Conditional mean estimator, formulated as neural net
292 |     """
293 |     def __init__(self,
294 |                  model,
295 |                  fit_params=None,
296 |                  in_shape=1,
297 |                  hidden_size=1,
298 |                  learn_func=torch.optim.Adam,
299 |                  epochs=1000,
300 |                  batch_size=10,
301 |                  dropout=0.1,
302 |                  lr=0.01,
303 |                  wd=1e-6,
304 |                  test_ratio=0.2,
305 |                  random_state=0):
306 | 
307 |         """ Initialization
308 | 
309 |         Parameters
310 |         ----------
311 |         model : unused parameter (for compatibility with nc class)
312 |         fit_params : unused parameter (for compatibility with nc class)
313 |         in_shape : integer, input signal dimension
314 |         hidden_size : integer, hidden layer dimension
315 |         learn_func : class of Pytorch's SGD optimizer
316 |         epochs : integer, maximal number of epochs
317 |         batch_size : integer, mini-batch size for SGD
318 |         dropout : float, dropout rate
319 |         lr : float, learning rate for SGD
320 |         wd : float, weight decay
321 |         test_ratio : float, ratio of held-out data, used in cross-validation
322 |         random_state : integer, seed for splitting the data in cross-validation
323 | 
324 |         """
325 |         super(MSENet_RegressorAdapter, self).__init__(model, fit_params)
326 |         # Instantiate model
327 |         self.epochs = epochs
328 |         self.batch_size = batch_size
329 |         self.dropout = dropout
330 |         self.lr = lr
331 |         self.wd = wd
332 |         self.test_ratio = test_ratio
333 |         self.random_state = random_state
334 |         self.model = torch_models.mse_model(in_shape=in_shape, hidden_size=hidden_size, dropout=dropout)
335 |         self.loss_func = torch.nn.MSELoss()
336 |         self.learner = torch_models.LearnerOptimized(self.model,
337 |                                                      partial(learn_func, lr=lr, weight_decay=wd),
338 |                                                      self.loss_func,
339 |                                                      device=device,
340 |                                                      test_ratio=self.test_ratio,
341 |                                                      random_state=self.random_state)
342 | 
343 |     def fit(self, x, y):
344 |         """ Fit the model to data
345 | 
346 |         Parameters
347 |         ----------
348 | 
349 |         x : numpy array of training features (nXp)
350 |         y : numpy array of training labels (n)
351 | 
352 |         """
353 |         self.learner.fit(x, y, self.epochs, batch_size=self.batch_size)
354 | 
355 |     def predict(self, x):
356 |         """ Estimate the label given the features
357 | 
358 |         Parameters
359 |         ----------
360 |         x : numpy array of training features (nXp)
361 | 
362 |         Returns
363 |         -------
364 |         ret_val : numpy array of predicted labels (n)
365 | 
366 |         """
367 |         return self.learner.predict(x)
368 | 
369 | ###############################################################################
370 | # Deep neural network for conditional quantile regression
371 | # Minimizing pinball loss
372 | ###############################################################################
373 | 
374 | class AllQNet_RegressorAdapter(RegressorAdapter):
375 |     """ Conditional quantile estimator, formulated as neural net
376 |     """
377 |     def __init__(self,
378 |                  model,
379 |                  fit_params=None,
380 |                  in_shape=1,
381 |                  hidden_size=1,
382 |                  quantiles=[.05, .95],
383 |                  learn_func=torch.optim.Adam,
384 |                  epochs=1000,
385 |                  batch_size=10,
386 |                  dropout=0.1,
387 |                  lr=0.01,
388 |                  wd=1e-6,
389 |                  test_ratio=0.2,
390 |                  random_state=0,
391 |                  use_rearrangement=False):
392 |         """ Initialization
393 | 
394 |         Parameters
395 |         ----------
396 |         model : None, unused parameter (for compatibility with nc class)
397 |         fit_params : None, unused parameter (for compatibility with nc class)
398 |         in_shape : integer, input signal dimension
399 |         hidden_size : integer, hidden layer dimension
400 |         quantiles : numpy array, low and high quantile levels in range (0,1)
401 |         learn_func : class of Pytorch's SGD optimizer
402 |         epochs : integer, maximal number of epochs
403 |         batch_size : integer, mini-batch size for SGD
404 |         dropout : float, dropout rate
405 |         lr : float, learning rate for SGD
406 |         wd : float, weight decay
407 |         test_ratio : float, ratio of held-out data, used in cross-validation
408 |         random_state : integer, seed for splitting the data in cross-validation
409 |         use_rearrangement : boolean, use the rearrangement algorithm (True)
410 |                             of not (False). See reference [1].
411 | 
412 |         References
413 |         ----------
414 |         .. [1]  Chernozhukov, Victor, Iván Fernández‐Val, and Alfred Galichon.
415 |                 "Quantile and probability curves without crossing."
416 |                 Econometrica 78.3 (2010): 1093-1125.
417 | 
418 |         """
419 |         super(AllQNet_RegressorAdapter, self).__init__(model, fit_params)
420 |         # Instantiate model
421 |         self.quantiles = quantiles
422 |         if use_rearrangement:
423 |             self.all_quantiles = torch.from_numpy(np.linspace(0.01,0.99,99)).float()
424 |         else:
425 |             self.all_quantiles = self.quantiles
426 |         self.epochs = epochs
427 |         self.batch_size = batch_size
428 |         self.dropout = dropout
429 |         self.lr = lr
430 |         self.wd = wd
431 |         self.test_ratio = test_ratio
432 |         self.random_state = random_state
433 |         self.model = torch_models.all_q_model(quantiles=self.all_quantiles,
434 |                                               in_shape=in_shape,
435 |                                               hidden_size=hidden_size,
436 |                                               dropout=dropout)
437 |         self.loss_func = torch_models.AllQuantileLoss(self.all_quantiles)
438 |         self.learner = torch_models.LearnerOptimizedCrossing(self.model,
439 |                                                              partial(learn_func, lr=lr, weight_decay=wd),
440 |                                                              self.loss_func,
441 |                                                              device=device,
442 |                                                              test_ratio=self.test_ratio,
443 |                                                              random_state=self.random_state,
444 |                                                              qlow=self.quantiles[0],
445 |                                                              qhigh=self.quantiles[1],
446 |                                                              use_rearrangement=use_rearrangement)
447 | 
448 |     def fit(self, x, y):
449 |         """ Fit the model to data
450 | 
451 |         Parameters
452 |         ----------
453 | 
454 |         x : numpy array of training features (nXp)
455 |         y : numpy array of training labels (n)
456 | 
457 |         """
458 |         self.learner.fit(x, y, self.epochs, self.batch_size)
459 | 
460 |     def predict(self, x):
461 |         """ Estimate the conditional low and high quantiles given the features
462 | 
463 |         Parameters
464 |         ----------
465 |         x : numpy array of training features (nXp)
466 | 
467 |         Returns
468 |         -------
469 |         ret_val : numpy array of estimated conditional quantiles (nX2)
470 | 
471 |         """
472 |         return self.learner.predict(x)
473 | 
474 | 
475 | ###############################################################################
476 | # Quantile random forests model
477 | ###############################################################################
478 | 
479 | class QuantileForestRegressorAdapter(RegressorAdapter):
480 |     """ Conditional quantile estimator, defined as quantile random forests (QRF)
481 | 
482 |     References
483 |     ----------
484 |     .. [1]  Meinshausen, Nicolai. "Quantile regression forests."
485 |             Journal of Machine Learning Research 7.Jun (2006): 983-999.
486 | 
487 |     """
488 | 
489 |     def __init__(self,
490 |                  model,
491 |                  fit_params=None,
492 |                  quantiles=[5, 95],
493 |                  params=None):
494 |         """ Initialization
495 | 
496 |         Parameters
497 |         ----------
498 |         model : None, unused parameter (for compatibility with nc class)
499 |         fit_params : None, unused parameter (for compatibility with nc class)
500 |         quantiles : numpy array, low and high quantile levels in range (0,100)
501 |         params : dictionary of parameters
502 |                 params["random_state"] : integer, seed for splitting the data
503 |                                          in cross-validation. Also used as the
504 |                                          seed in quantile random forests (QRF)
505 |                 params["min_samples_leaf"] : integer, parameter of QRF
506 |                 params["n_estimators"] : integer, parameter of QRF
507 |                 params["max_features"] : integer, parameter of QRF
508 |                 params["CV"] : boolean, use cross-validation (True) or
509 |                                not (False) to tune the two QRF quantile levels
510 |                                to obtain the desired coverage
511 |                 params["test_ratio"] : float, ratio of held-out data, used
512 |                                        in cross-validation
513 |                 params["coverage_factor"] : float, to avoid too conservative
514 |                                             estimation of the prediction band,
515 |                                             when tuning the two QRF quantile
516 |                                             levels in cross-validation one may
517 |                                             ask for prediction intervals with
518 |                                             reduced average coverage, equal to
519 |                                             coverage_factor*(q_high - q_low).
520 |                 params["range_vals"] : float, determines the lowest and highest
521 |                                        quantile level parameters when tuning
522 |                                        the quanitle levels bt cross-validation.
523 |                                        The smallest value is equal to
524 |                                        quantiles[0] - range_vals.
525 |                                        Similarly, the largest is equal to
526 |                                        quantiles[1] + range_vals.
527 |                 params["num_vals"] : integer, when tuning QRF's quantile
528 |                                      parameters, sweep over a grid of length
529 |                                      num_vals.
530 | 
531 |         """
532 |         super(QuantileForestRegressorAdapter, self).__init__(model, fit_params)
533 |         # Instantiate model
534 |         self.quantiles = quantiles
535 |         self.cv_quantiles = self.quantiles
536 |         self.params = params
537 |         self.rfqr = RandomForestQuantileRegressor(random_state=params["random_state"],
538 |                                                   min_samples_leaf=params["min_samples_leaf"],
539 |                                                   n_estimators=params["n_estimators"],
540 |                                                   max_features=params["max_features"])
541 | 
542 |     def fit(self, x, y):
543 |         """ Fit the model to data
544 | 
545 |         Parameters
546 |         ----------
547 | 
548 |         x : numpy array of training features (nXp)
549 |         y : numpy array of training labels (n)
550 | 
551 |         """
552 |         if self.params["CV"]:
553 |             target_coverage = self.quantiles[1] - self.quantiles[0]
554 |             coverage_factor = self.params["coverage_factor"]
555 |             range_vals = self.params["range_vals"]
556 |             num_vals = self.params["num_vals"]
557 |             grid_q_low = np.linspace(self.quantiles[0],self.quantiles[0]+range_vals,num_vals).reshape(-1,1)
558 |             grid_q_high = np.linspace(self.quantiles[1],self.quantiles[1]-range_vals,num_vals).reshape(-1,1)
559 |             grid_q = np.concatenate((grid_q_low,grid_q_high),1)
560 | 
561 |             self.cv_quantiles = tune_params_cv.CV_quntiles_rf(self.params,
562 |                                                               x,
563 |                                                               y,
564 |                                                               target_coverage,
565 |                                                               grid_q,
566 |                                                               self.params["test_ratio"],
567 |                                                               self.params["random_state"],
568 |                                                               coverage_factor)
569 | 
570 |         self.rfqr.fit(x, y)
571 | 
572 |     def predict(self, x):
573 |         """ Estimate the conditional low and high quantiles given the features
574 | 
575 |         Parameters
576 |         ----------
577 |         x : numpy array of training features (nXp)
578 | 
579 |         Returns
580 |         -------
581 |         ret_val : numpy array of estimated conditional quantiles (nX2)
582 | 
583 |         """
584 |         lower = self.rfqr.predict(x, quantile=self.cv_quantiles[0])
585 |         upper = self.rfqr.predict(x, quantile=self.cv_quantiles[1])
586 | 
587 |         ret_val = np.zeros((len(lower),2))
588 |         ret_val[:,0] = lower
589 |         ret_val[:,1] = upper
590 |         return ret_val
591 | 


--------------------------------------------------------------------------------
/cqr/torch_models.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import sys
  3 | import copy
  4 | import torch
  5 | import numpy as np
  6 | import torch.nn as nn
  7 | from cqr import helper
  8 | from sklearn.model_selection import train_test_split
  9 | 
 10 | 
 11 | if torch.cuda.is_available():
 12 |     device = "cuda:0"
 13 | else:
 14 |     device = "cpu"
 15 | 
 16 | ###############################################################################
 17 | # Helper functions
 18 | ###############################################################################
 19 | 
 20 | def epoch_internal_train(model, loss_func, x_train, y_train, batch_size, optimizer, cnt=0, best_cnt=np.Inf):
 21 |     """ Sweep over the data and update the model's parameters
 22 | 
 23 |     Parameters
 24 |     ----------
 25 | 
 26 |     model : class of neural net model
 27 |     loss_func : class of loss function
 28 |     x_train : pytorch tensor n training features, each of dimension p (nXp)
 29 |     batch_size : integer, size of the mini-batch
 30 |     optimizer : class of SGD solver
 31 |     cnt : integer, counting the gradient steps
 32 |     best_cnt: integer, stop the training if current cnt > best_cnt
 33 | 
 34 |     Returns
 35 |     -------
 36 | 
 37 |     epoch_loss : mean loss value
 38 |     cnt : integer, cumulative number of gradient steps
 39 | 
 40 |     """
 41 | 
 42 |     model.train()
 43 |     shuffle_idx = np.arange(x_train.shape[0])
 44 |     np.random.shuffle(shuffle_idx)
 45 |     x_train = x_train[shuffle_idx]
 46 |     y_train = y_train[shuffle_idx]
 47 |     epoch_losses = []
 48 |     for idx in range(0, x_train.shape[0], batch_size):
 49 |         cnt = cnt + 1
 50 |         optimizer.zero_grad()
 51 |         batch_x = x_train[idx : min(idx + batch_size, x_train.shape[0]),:]
 52 |         batch_y = y_train[idx : min(idx + batch_size, y_train.shape[0])]
 53 |         preds = model(batch_x)
 54 |         loss = loss_func(preds, batch_y)
 55 |         loss.backward()
 56 |         optimizer.step()
 57 |         epoch_losses.append(loss.cpu().detach().numpy())
 58 | 
 59 |         if cnt >= best_cnt:
 60 |             break
 61 | 
 62 |     epoch_loss = np.mean(epoch_losses)
 63 | 
 64 |     return epoch_loss, cnt
 65 | 
 66 | def rearrange(all_quantiles, quantile_low, quantile_high, test_preds):
 67 |     """ Produce monotonic quantiles
 68 | 
 69 |     Parameters
 70 |     ----------
 71 | 
 72 |     all_quantiles : numpy array (q), grid of quantile levels in the range (0,1)
 73 |     quantile_low : float, desired low quantile in the range (0,1)
 74 |     quantile_high : float, desired high quantile in the range (0,1)
 75 |     test_preds : numpy array of predicted quantile (nXq)
 76 | 
 77 |     Returns
 78 |     -------
 79 | 
 80 |     q_fixed : numpy array (nX2), containing the rearranged estimates of the
 81 |               desired low and high quantile
 82 | 
 83 |     References
 84 |     ----------
 85 |     .. [1]  Chernozhukov, Victor, Iván Fernández‐Val, and Alfred Galichon.
 86 |             "Quantile and probability curves without crossing."
 87 |             Econometrica 78.3 (2010): 1093-1125.
 88 | 
 89 |     """
 90 |     scaling = all_quantiles[-1] - all_quantiles[0]
 91 |     low_val = (quantile_low - all_quantiles[0])/scaling
 92 |     high_val = (quantile_high - all_quantiles[0])/scaling
 93 |     q_fixed = np.quantile(test_preds,(low_val, high_val),interpolation='linear',axis=1)
 94 |     return q_fixed.T
 95 | 
 96 | ###############################################################################
 97 | # Deep conditional mean regression
 98 | # Minimizing MSE loss
 99 | ###############################################################################
100 | 
101 | # Define the network
102 | class mse_model(nn.Module):
103 |     """ Conditional mean estimator, formulated as neural net
104 |     """
105 | 
106 |     def __init__(self,
107 |                  in_shape=1,
108 |                  hidden_size=64,
109 |                  dropout=0.5):
110 |         """ Initialization
111 | 
112 |         Parameters
113 |         ----------
114 | 
115 |         in_shape : integer, input signal dimension (p)
116 |         hidden_size : integer, hidden layer dimension
117 |         dropout : float, dropout rate
118 | 
119 |         """
120 | 
121 |         super().__init__()
122 |         self.in_shape = in_shape
123 |         self.out_shape = 1
124 |         self.hidden_size = hidden_size
125 |         self.dropout = dropout
126 |         self.build_model()
127 |         self.init_weights()
128 | 
129 |     def build_model(self):
130 |         """ Construct the network
131 |         """
132 |         self.base_model = nn.Sequential(
133 |             nn.Linear(self.in_shape, self.hidden_size),
134 |             nn.ReLU(),
135 |             nn.Dropout(self.dropout),
136 |             nn.Linear(self.hidden_size, self.hidden_size),
137 |             nn.ReLU(),
138 |             nn.Dropout(self.dropout),
139 |             nn.Linear(self.hidden_size, 1),
140 |         )
141 | 
142 |     def init_weights(self):
143 |         """ Initialize the network parameters
144 |         """
145 |         for m in self.base_model:
146 |             if isinstance(m, nn.Linear):
147 |                 nn.init.orthogonal_(m.weight)
148 |                 nn.init.constant_(m.bias, 0)
149 | 
150 |     def forward(self, x):
151 |         """ Run forward pass
152 |         """
153 |         return torch.squeeze(self.base_model(x))
154 | 
155 | # Define the training procedure
156 | class LearnerOptimized:
157 |     """ Fit a neural network (conditional mean) to training data
158 |     """
159 |     def __init__(self, model, optimizer_class, loss_func, device='cpu', test_ratio=0.2, random_state=0):
160 |         """ Initialization
161 | 
162 |         Parameters
163 |         ----------
164 | 
165 |         model : class of neural network model
166 |         optimizer_class : class of SGD optimizer (e.g. Adam)
167 |         loss_func : loss to minimize
168 |         device : string, "cuda:0" or "cpu"
169 |         test_ratio : float, test size used in cross-validation (CV)
170 |         random_state : int, seed to be used in CV when splitting to train-test
171 | 
172 |         """
173 |         self.model = model.to(device)
174 |         self.optimizer_class = optimizer_class
175 |         self.optimizer = optimizer_class(self.model.parameters())
176 |         self.loss_func = loss_func.to(device)
177 |         self.device = device
178 |         self.test_ratio = test_ratio
179 |         self.random_state = random_state
180 |         self.loss_history = []
181 |         self.test_loss_history = []
182 |         self.full_loss_history = []
183 | 
184 |     def fit(self, x, y, epochs, batch_size, verbose=False):
185 |         """ Fit the model to data
186 | 
187 |         Parameters
188 |         ----------
189 | 
190 |         x : numpy array, containing the training features (nXp)
191 |         y : numpy array, containing the training labels (n)
192 |         epochs : integer, maximal number of epochs
193 |         batch_size : integer, mini-batch size for SGD
194 | 
195 |         """
196 | 
197 |         sys.stdout.flush()
198 |         model = copy.deepcopy(self.model)
199 |         model = model.to(device)
200 |         optimizer = self.optimizer_class(model.parameters())
201 |         best_epoch = epochs
202 | 
203 |         x_train, xx, y_train, yy = train_test_split(x, y, test_size=self.test_ratio,random_state=self.random_state)
204 | 
205 |         x_train = torch.from_numpy(x_train).float().to(self.device).requires_grad_(False)
206 |         xx = torch.from_numpy(xx).float().to(self.device).requires_grad_(False)
207 |         y_train = torch.from_numpy(y_train).float().to(self.device).requires_grad_(False)
208 |         yy = torch.from_numpy(yy).float().to(self.device).requires_grad_(False)
209 | 
210 |         best_cnt = 1e10
211 |         best_test_epoch_loss = 1e10
212 | 
213 |         cnt = 0
214 |         for e in range(epochs):
215 |             epoch_loss, cnt = epoch_internal_train(model, self.loss_func, x_train, y_train, batch_size, optimizer, cnt)
216 |             self.loss_history.append(epoch_loss)
217 | 
218 |             # test
219 |             model.eval()
220 |             preds = model(xx)
221 |             test_preds = preds.cpu().detach().numpy()
222 |             test_preds = np.squeeze(test_preds)
223 |             test_epoch_loss = self.loss_func(preds, yy).cpu().detach().numpy()
224 | 
225 |             self.test_loss_history.append(test_epoch_loss)
226 | 
227 |             if (test_epoch_loss <= best_test_epoch_loss):
228 |                 best_test_epoch_loss = test_epoch_loss
229 |                 best_epoch = e
230 |                 best_cnt = cnt
231 | 
232 |             if (e+1) % 100 == 0 and verbose:
233 |                 print("CV: Epoch {}: Train {}, Test {}, Best epoch {}, Best loss {}".format(e+1, epoch_loss, test_epoch_loss, best_epoch, best_test_epoch_loss))
234 |                 sys.stdout.flush()
235 | 
236 |         # use all the data to train the model, for best_cnt steps
237 |         x = torch.from_numpy(x).float().to(self.device).requires_grad_(False)
238 |         y = torch.from_numpy(y).float().to(self.device).requires_grad_(False)
239 | 
240 |         cnt = 0
241 |         for e in range(best_epoch+1):
242 |             if cnt > best_cnt:
243 |                 break
244 | 
245 |             epoch_loss, cnt = epoch_internal_train(self.model, self.loss_func, x, y, batch_size, self.optimizer, cnt, best_cnt)
246 |             self.full_loss_history.append(epoch_loss)
247 | 
248 |             if (e+1) % 100 == 0 and verbose:
249 |                 print("Full: Epoch {}: {}, cnt {}".format(e+1, epoch_loss, cnt))
250 |                 sys.stdout.flush()
251 | 
252 |     def predict(self, x):
253 |         """ Estimate the label given the features
254 | 
255 |         Parameters
256 |         ----------
257 |         x : numpy array of training features (nXp)
258 | 
259 |         Returns
260 |         -------
261 |         ret_val : numpy array of predicted labels (n)
262 | 
263 |         """
264 |         self.model.eval()
265 |         ret_val = self.model(torch.from_numpy(x).to(self.device).requires_grad_(False)).cpu().detach().numpy()
266 |         return ret_val
267 | 
268 | 
269 | ##############################################################################
270 | # Quantile regression
271 | # Implementation inspired by:
272 | # https://github.com/ceshine/quantile-regression-tensorflow
273 | ##############################################################################
274 | 
275 | class AllQuantileLoss(nn.Module):
276 |     """ Pinball loss function
277 |     """
278 |     def __init__(self, quantiles):
279 |         """ Initialize
280 | 
281 |         Parameters
282 |         ----------
283 |         quantiles : pytorch vector of quantile levels, each in the range (0,1)
284 | 
285 | 
286 |         """
287 |         super().__init__()
288 |         self.quantiles = quantiles
289 | 
290 |     def forward(self, preds, target):
291 |         """ Compute the pinball loss
292 | 
293 |         Parameters
294 |         ----------
295 |         preds : pytorch tensor of estimated labels (n)
296 |         target : pytorch tensor of true labels (n)
297 | 
298 |         Returns
299 |         -------
300 |         loss : cost function value
301 | 
302 |         """
303 |         assert not target.requires_grad
304 |         assert preds.size(0) == target.size(0)
305 |         losses = []
306 | 
307 |         for i, q in enumerate(self.quantiles):
308 |             errors = target - preds[:, i]
309 |             losses.append(torch.max((q-1) * errors, q * errors).unsqueeze(1))
310 | 
311 |         loss = torch.mean(torch.sum(torch.cat(losses, dim=1), dim=1))
312 |         return loss
313 | 
314 | 
315 | class all_q_model(nn.Module):
316 |     """ Conditional quantile estimator, formulated as neural net
317 |     """
318 |     def __init__(self,
319 |                  quantiles,
320 |                  in_shape=1,
321 |                  hidden_size=64,
322 |                  dropout=0.5):
323 |         """ Initialization
324 | 
325 |         Parameters
326 |         ----------
327 |         quantiles : numpy array of quantile levels (q), each in the range (0,1)
328 |         in_shape : integer, input signal dimension (p)
329 |         hidden_size : integer, hidden layer dimension
330 |         dropout : float, dropout rate
331 | 
332 |         """
333 |         super().__init__()
334 |         self.quantiles = quantiles
335 |         self.num_quantiles = len(quantiles)
336 |         self.hidden_size = hidden_size
337 |         self.in_shape = in_shape
338 |         self.out_shape = len(quantiles)
339 |         self.dropout = dropout
340 |         self.build_model()
341 |         self.init_weights()
342 | 
343 |     def build_model(self):
344 |         """ Construct the network
345 |         """
346 |         self.base_model = nn.Sequential(
347 |             nn.Linear(self.in_shape, self.hidden_size),
348 |             nn.ReLU(),
349 |             nn.Dropout(self.dropout),
350 |             nn.Linear(self.hidden_size, self.hidden_size),
351 |             nn.ReLU(),
352 |             nn.Dropout(self.dropout),
353 |             nn.Linear(self.hidden_size, self.num_quantiles),
354 |         )
355 | 
356 |     def init_weights(self):
357 |         """ Initialize the network parameters
358 |         """
359 |         for m in self.base_model:
360 |             if isinstance(m, nn.Linear):
361 |                 nn.init.orthogonal_(m.weight)
362 |                 nn.init.constant_(m.bias, 0)
363 | 
364 |     def forward(self, x):
365 |         """ Run forward pass
366 |         """
367 |         return self.base_model(x)
368 | 
369 | class LearnerOptimizedCrossing:
370 |     """ Fit a neural network (conditional quantile) to training data
371 |     """
372 |     def __init__(self, model, optimizer_class, loss_func, device='cpu', test_ratio=0.2, random_state=0,
373 |                  qlow=0.05, qhigh=0.95, use_rearrangement=False):
374 |         """ Initialization
375 | 
376 |         Parameters
377 |         ----------
378 | 
379 |         model : class of neural network model
380 |         optimizer_class : class of SGD optimizer (e.g. pytorch's Adam)
381 |         loss_func : loss to minimize
382 |         device : string, "cuda:0" or "cpu"
383 |         test_ratio : float, test size used in cross-validation (CV)
384 |         random_state : integer, seed used in CV when splitting to train-test
385 |         qlow : float, low quantile level in the range (0,1)
386 |         qhigh : float, high quantile level in the range (0,1)
387 |         use_rearrangement : boolean, use the rearrangement  algorithm (True)
388 |                             of not (False)
389 | 
390 |         """
391 |         self.model = model.to(device)
392 |         self.use_rearrangement = use_rearrangement
393 |         self.compute_coverage = True
394 |         self.quantile_low = qlow
395 |         self.quantile_high = qhigh
396 |         self.target_coverage = 100.0*(self.quantile_high - self.quantile_low)
397 |         self.all_quantiles = loss_func.quantiles
398 |         self.optimizer_class = optimizer_class
399 |         self.optimizer = optimizer_class(self.model.parameters())
400 |         self.loss_func = loss_func.to(device)
401 |         self.device = device
402 |         self.test_ratio = test_ratio
403 |         self.random_state = random_state
404 |         self.loss_history = []
405 |         self.test_loss_history = []
406 |         self.full_loss_history = []
407 | 
408 |     def fit(self, x, y, epochs, batch_size, verbose=False):
409 |         """ Fit the model to data
410 | 
411 |         Parameters
412 |         ----------
413 | 
414 |         x : numpy array of training features (nXp)
415 |         y : numpy array of training labels (n)
416 |         epochs : integer, maximal number of epochs
417 |         batch_size : integer, mini-batch size used in SGD solver
418 | 
419 |         """
420 |         sys.stdout.flush()
421 |         model = copy.deepcopy(self.model)
422 |         model = model.to(device)
423 |         optimizer = self.optimizer_class(model.parameters())
424 |         best_epoch = epochs
425 | 
426 |         x_train, xx, y_train, yy = train_test_split(x,
427 |                                                     y,
428 |                                                     test_size=self.test_ratio,
429 |                                                     random_state=self.random_state)
430 | 
431 |         x_train = torch.from_numpy(x_train).float().to(self.device).requires_grad_(False)
432 |         xx = torch.from_numpy(xx).float().to(self.device).requires_grad_(False)
433 |         y_train = torch.from_numpy(y_train).float().to(self.device).requires_grad_(False)
434 |         yy_cpu = yy
435 |         yy = torch.from_numpy(yy).float().to(self.device).requires_grad_(False)
436 | 
437 |         best_avg_length = 1e10
438 |         best_coverage = 0
439 |         best_cnt = 1e10
440 | 
441 |         cnt = 0
442 |         for e in range(epochs):
443 |             model.train()
444 |             epoch_loss, cnt = epoch_internal_train(model, self.loss_func, x_train, y_train, batch_size, optimizer, cnt)
445 |             self.loss_history.append(epoch_loss)
446 | 
447 |             model.eval()
448 |             preds = model(xx)
449 |             test_epoch_loss = self.loss_func(preds, yy).cpu().detach().numpy()
450 |             self.test_loss_history.append(test_epoch_loss)
451 | 
452 |             test_preds = preds.cpu().detach().numpy()
453 |             test_preds = np.squeeze(test_preds)
454 | 
455 |             if self.use_rearrangement:
456 |                 test_preds = rearrange(self.all_quantiles, self.quantile_low, self.quantile_high, test_preds)
457 | 
458 |             y_lower = test_preds[:,0]
459 |             y_upper = test_preds[:,1]
460 |             coverage, avg_length = helper.compute_coverage_len(yy_cpu, y_lower, y_upper)
461 | 
462 |             if (coverage >= self.target_coverage) and (avg_length < best_avg_length):
463 |                 best_avg_length = avg_length
464 |                 best_coverage = coverage
465 |                 best_epoch = e
466 |                 best_cnt = cnt
467 | 
468 |             if (e+1) % 100 == 0 and verbose:
469 |                 print("CV: Epoch {}: Train {}, Test {}, Best epoch {}, Best Coverage {} Best Length {} Cur Coverage {}".format(e+1, epoch_loss, test_epoch_loss, best_epoch, best_coverage, best_avg_length, coverage))
470 |                 sys.stdout.flush()
471 | 
472 |         x = torch.from_numpy(x).float().to(self.device).requires_grad_(False)
473 |         y = torch.from_numpy(y).float().to(self.device).requires_grad_(False)
474 | 
475 |         cnt = 0
476 |         for e in range(best_epoch+1):
477 |             if cnt > best_cnt:
478 |                 break
479 |             epoch_loss, cnt = epoch_internal_train(self.model, self.loss_func, x, y, batch_size, self.optimizer, cnt, best_cnt)
480 |             self.full_loss_history.append(epoch_loss)
481 | 
482 |             if (e+1) % 100 == 0 and verbose:
483 |                 print("Full: Epoch {}: {}, cnt {}".format(e+1, epoch_loss, cnt))
484 |                 sys.stdout.flush()
485 | 
486 |     def predict(self, x):
487 |         """ Estimate the conditional low and high quantile given the features
488 | 
489 |         Parameters
490 |         ----------
491 |         x : numpy array of training features (nXp)
492 | 
493 |         Returns
494 |         -------
495 |         test_preds : numpy array of predicted low and high quantiles (nX2)
496 | 
497 |         """
498 |         self.model.eval()
499 |         test_preds = self.model(torch.from_numpy(x).to(self.device).requires_grad_(False)).cpu().detach().numpy()
500 |         if self.use_rearrangement:
501 |             test_preds = rearrange(self.all_quantiles, self.quantile_low, self.quantile_high, test_preds)
502 |         else:
503 |             test_preds[:,0] = np.min(test_preds,axis=1)
504 |             test_preds[:,1] = np.max(test_preds,axis=1)
505 |         return test_preds
506 | 


--------------------------------------------------------------------------------
/cqr/tune_params_cv.py:
--------------------------------------------------------------------------------
 1 | 
 2 | from cqr import helper
 3 | from skgarden import RandomForestQuantileRegressor
 4 | from sklearn.model_selection import train_test_split
 5 | 
 6 | 
 7 | def CV_quntiles_rf(params,
 8 |                    X,
 9 |                    y,
10 |                    target_coverage,
11 |                    grid_q,
12 |                    test_ratio,
13 |                    random_state,
14 |                    coverage_factor=0.9):
15 |     """ Tune the low and high quantile level parameters of quantile random
16 |         forests method, using cross-validation
17 |     
18 |     Parameters
19 |     ----------
20 |     params : dictionary of parameters
21 |             params["random_state"] : integer, seed for splitting the data 
22 |                                      in cross-validation. Also used as the
23 |                                      seed in quantile random forest (QRF)
24 |             params["min_samples_leaf"] : integer, parameter of QRF
25 |             params["n_estimators"] : integer, parameter of QRF
26 |             params["max_features"] : integer, parameter of QRF
27 |     X : numpy array, containing the training features (nXp)
28 |     y : numpy array, containing the training labels (n)
29 |     target_coverage : desired coverage of prediction band. The output coverage
30 |                       may be smaller if coverage_factor <= 1, in this case the
31 |                       target will be modified to target_coverage*coverage_factor
32 |     grid_q : numpy array, of low and high quantile levels to test
33 |     test_ratio : float, test size of the held-out data
34 |     random_state : integer, seed for splitting the data in cross-validation.
35 |                    Also used as the seed in QRF.
36 |     coverage_factor : float, when tuning the two QRF quantile levels one may
37 |                       ask for prediction band with smaller average coverage,
38 |                       equal to coverage_factor*(q_high - q_low) to avoid too
39 |                       conservative estimation of the prediction band
40 |     
41 |     Returns
42 |     -------
43 |     best_q : numpy array of low and high quantile levels (length 2)
44 |     
45 |     References
46 |     ----------
47 |     .. [1]  Meinshausen, Nicolai. "Quantile regression forests."
48 |             Journal of Machine Learning Research 7.Jun (2006): 983-999.
49 |     
50 |     """
51 |     target_coverage = coverage_factor*target_coverage
52 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio,random_state=random_state)
53 |     best_avg_length = 1e10
54 |     best_q = grid_q[0]
55 | 
56 |     rf = RandomForestQuantileRegressor(random_state=params["random_state"],
57 |                                        min_samples_leaf=params["min_samples_leaf"],
58 |                                        n_estimators=params["n_estimators"],
59 |                                        max_features=params["max_features"])
60 |     rf.fit(X_train, y_train)
61 | 
62 |     for q in grid_q:
63 |         y_lower = rf.predict(X_test, quantile=q[0])
64 |         y_upper = rf.predict(X_test, quantile=q[1])
65 |         coverage, avg_length = helper.compute_coverage_len(y_test, y_lower, y_upper)
66 |         if (coverage >= target_coverage) and (avg_length < best_avg_length):
67 |             best_avg_length = avg_length
68 |             best_q = q
69 |         else:
70 |             break
71 |     return best_q
72 | 


--------------------------------------------------------------------------------
/cqr_real_data_example.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# Conformalized quantile regression (CQR): Real data experiment\n",
  8 |     "\n",
  9 |     "In this tutorial we will load a real dataset and construct prediction intervals using CQR [1].\n",
 10 |     "\n",
 11 |     "[1] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes, “Conformalized quantile regression.” 2019.\n",
 12 |     "\n",
 13 |     "## Prediction intervals\n",
 14 |     "\n",
 15 |     "Suppose we are given $ n $ training samples $ \\{(X_i, Y_i)\\}_{i=1}^n$ and we must now predict the unknown value of $Y_{n+1}$ at a test point $X_{n+1}$. We assume that all the samples $ \\{(X_i,Y_i)\\}_{i=1}^{n+1} $ are drawn exchangeably$-$for instance, they may be drawn i.i.d.$-$from an arbitrary joint distribution $P_{XY}$ over the feature vectors $ X\\in \\mathbb{R}^p $ and response variables $ Y\\in \\mathbb{R} $. We aim to construct a marginal distribution-free prediction interval $C(X_{n+1}) \\subseteq \\mathbb{R}$ that is likely to contain the unknown response $Y_{n+1} $. That is, given a desired miscoverage rate $ \\alpha $, we ask that\n",
 16 |     "$$ \\mathbb{P}\\{Y_{n+1} \\in C(X_{n+1})\\} \\geq 1-\\alpha $$\n",
 17 |     "for any joint distribution $ P_{XY} $ and any sample size $n$. The probability in this statement is marginal, being taken over all the samples $ \\{(X_i, Y_i)\\}_{i=1}^{n+1} $.\n",
 18 |     "\n",
 19 |     "To accomplish this, we build on the method of split conformal prediction. We first split the training data into two disjoint subsets, a proper training set and a calibration set. We fit two quantile regressors on the proper training set to obtain initial estimates of the lower and upper bounds of the prediction interval. Then, using the calibration set, we conformalize and, if necessary, correct this prediction interval. Unlike the original interval, the conformalized prediction interval is guaranteed to satisfy the coverage requirement regardless of the choice or accuracy of the quantile regression estimator.\n",
 20 |     "\n",
 21 |     "\n",
 22 |     "\n",
 23 |     "## A case study\n",
 24 |     "\n",
 25 |     "We start by importing several libraries, loading the real dataset and standardize its features and response. We set the target miscoverage rate $\\alpha$ to 0.1."
 26 |    ]
 27 |   },
 28 |   {
 29 |    "cell_type": "code",
 30 |    "execution_count": 2,
 31 |    "metadata": {},
 32 |    "outputs": [
 33 |     {
 34 |      "name": "stdout",
 35 |      "output_type": "stream",
 36 |      "text": [
 37 |       "Dataset: community\n",
 38 |       "Dimensions: train set (n=1595, p=100) ; test set (n=399, p=100)\n"
 39 |      ]
 40 |     }
 41 |    ],
 42 |    "source": [
 43 |     "import torch\n",
 44 |     "import random\n",
 45 |     "import numpy as np\n",
 46 |     "np.warnings.filterwarnings('ignore')\n",
 47 |     "\n",
 48 |     "from datasets import datasets\n",
 49 |     "from sklearn.preprocessing import StandardScaler\n",
 50 |     "from sklearn.model_selection import train_test_split\n",
 51 |     "\n",
 52 |     "seed = 1\n",
 53 |     "\n",
 54 |     "random_state_train_test = seed\n",
 55 |     "random.seed(seed)\n",
 56 |     "np.random.seed(seed)\n",
 57 |     "torch.manual_seed(seed)\n",
 58 |     "if torch.cuda.is_available():\n",
 59 |     "    torch.cuda.manual_seed_all(seed)\n",
 60 |     "    \n",
 61 |     "# desired miscoverage error\n",
 62 |     "alpha = 0.1\n",
 63 |     "\n",
 64 |     "# desired quanitile levels\n",
 65 |     "quantiles = [0.05, 0.95]\n",
 66 |     "\n",
 67 |     "# used to determine the size of test set\n",
 68 |     "test_ratio = 0.2\n",
 69 |     "\n",
 70 |     "# name of dataset\n",
 71 |     "dataset_base_path = \"./datasets/\"\n",
 72 |     "dataset_name = \"community\"\n",
 73 |     "\n",
 74 |     "# load the dataset\n",
 75 |     "X, y = datasets.GetDataset(dataset_name, dataset_base_path)\n",
 76 |     "\n",
 77 |     "# divide the dataset into test and train based on the test_ratio parameter\n",
 78 |     "x_train, x_test, y_train, y_test = train_test_split(X,\n",
 79 |     "                                                    y,\n",
 80 |     "                                                    test_size=test_ratio,\n",
 81 |     "                                                    random_state=random_state_train_test)\n",
 82 |     "\n",
 83 |     "# reshape the data\n",
 84 |     "x_train = np.asarray(x_train)\n",
 85 |     "y_train = np.asarray(y_train)\n",
 86 |     "x_test = np.asarray(x_test)\n",
 87 |     "y_test = np.asarray(y_test)\n",
 88 |     "\n",
 89 |     "# compute input dimensions\n",
 90 |     "n_train = x_train.shape[0]\n",
 91 |     "in_shape = x_train.shape[1]\n",
 92 |     "\n",
 93 |     "# display basic information\n",
 94 |     "print(\"Dataset: %s\" % (dataset_name))\n",
 95 |     "print(\"Dimensions: train set (n=%d, p=%d) ; test set (n=%d, p=%d)\" % \n",
 96 |     "      (x_train.shape[0], x_train.shape[1], x_test.shape[0], x_test.shape[1]))"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "markdown",
101 |    "metadata": {},
102 |    "source": [
103 |     "## Data splitting\n",
104 |     "\n",
105 |     "We begin by splitting the data into a proper training set and a calibration set. Recall that the main idea is to fit a regression model on the proper training samples, then use the residuals on a held-out validation set to quantify the uncertainty in future predictions."
106 |    ]
107 |   },
108 |   {
109 |    "cell_type": "code",
110 |    "execution_count": 3,
111 |    "metadata": {},
112 |    "outputs": [],
113 |    "source": [
114 |     "# divide the data into proper training set and calibration set\n",
115 |     "idx = np.random.permutation(n_train)\n",
116 |     "n_half = int(np.floor(n_train/2))\n",
117 |     "idx_train, idx_cal = idx[:n_half], idx[n_half:2*n_half]\n",
118 |     "\n",
119 |     "# zero mean and unit variance scaling \n",
120 |     "scalerX = StandardScaler()\n",
121 |     "scalerX = scalerX.fit(x_train[idx_train])\n",
122 |     "\n",
123 |     "# scale\n",
124 |     "x_train = scalerX.transform(x_train)\n",
125 |     "x_test = scalerX.transform(x_test)\n",
126 |     "\n",
127 |     "# scale the labels by dividing each by the mean absolute response\n",
128 |     "mean_y_train = np.mean(np.abs(y_train[idx_train]))\n",
129 |     "y_train = np.squeeze(y_train)/mean_y_train\n",
130 |     "y_test = np.squeeze(y_test)/mean_y_train"
131 |    ]
132 |   },
133 |   {
134 |    "cell_type": "markdown",
135 |    "metadata": {},
136 |    "source": [
137 |     "## CQR random forests\n",
138 |     "\n",
139 |     "Given these two subsets, we now turn to conformalize the initial prediction interval constructed by quantile random forests [2]. Below, we set the hyper-parameters of the CQR random forests method.\n",
140 |     "\n",
141 |     "[2] Meinshausen Nicolai. \"Quantile regression forests.\" Journal of Machine Learning Research 7, no. Jun (2006): 983-999."
142 |    ]
143 |   },
144 |   {
145 |    "cell_type": "code",
146 |    "execution_count": 4,
147 |    "metadata": {},
148 |    "outputs": [],
149 |    "source": [
150 |     "#########################################################\n",
151 |     "# Quantile random forests parameters\n",
152 |     "# (See QuantileForestRegressorAdapter class in helper.py)\n",
153 |     "#########################################################\n",
154 |     "\n",
155 |     "# the number of trees in the forest\n",
156 |     "n_estimators = 1000\n",
157 |     "\n",
158 |     "# the minimum number of samples required to be at a leaf node\n",
159 |     "# (default skgarden's parameter)\n",
160 |     "min_samples_leaf = 1\n",
161 |     "\n",
162 |     "# the number of features to consider when looking for the best split\n",
163 |     "# (default skgarden's parameter)\n",
164 |     "max_features = x_train.shape[1]\n",
165 |     "\n",
166 |     "# target quantile levels\n",
167 |     "quantiles_forest = [quantiles[0]*100, quantiles[1]*100]\n",
168 |     "\n",
169 |     "# use cross-validation to tune the quantile levels?\n",
170 |     "cv_qforest = True\n",
171 |     "\n",
172 |     "# when tuning the two QRF quantile levels one may\n",
173 |     "# ask for a prediction band with smaller average coverage\n",
174 |     "# to avoid too conservative estimation of the prediction band\n",
175 |     "# This would be equal to coverage_factor*(quantiles[1] - quantiles[0])\n",
176 |     "coverage_factor = 0.85\n",
177 |     "\n",
178 |     "# ratio of held-out data, used in cross-validation\n",
179 |     "cv_test_ratio = 0.05\n",
180 |     "\n",
181 |     "# seed for splitting the data in cross-validation.\n",
182 |     "# Also used as the seed in quantile random forests function\n",
183 |     "cv_random_state = 1\n",
184 |     "\n",
185 |     "# determines the lowest and highest quantile level parameters.\n",
186 |     "# This is used when tuning the quanitle levels by cross-validation.\n",
187 |     "# The smallest value is equal to quantiles[0] - range_vals.\n",
188 |     "# Similarly, the largest value is equal to quantiles[1] + range_vals.\n",
189 |     "cv_range_vals = 30\n",
190 |     "\n",
191 |     "# sweep over a grid of length num_vals when tuning QRF's quantile parameters                   \n",
192 |     "cv_num_vals = 10"
193 |    ]
194 |   },
195 |   {
196 |    "cell_type": "markdown",
197 |    "metadata": {},
198 |    "source": [
199 |     "### Symmetric nonconformity score \n",
200 |     "\n",
201 |     "In the following cell we run the entire CQR procudure. The class `QuantileForestRegressorAdapter` defines the underlying estimator. The class `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` fits the regression function to the proper training set, corrects (if required) the initial estimate of the prediction interval using the calibration set, and returns the conformal band. Lastly, we compute the average coverage and length on future test data using `compute_coverage`."
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 5,
207 |    "metadata": {},
208 |    "outputs": [
209 |     {
210 |      "name": "stdout",
211 |      "output_type": "stream",
212 |      "text": [
213 |       "CQR Random Forests: Percentage in the range (expecting 90.00): 91.228070\n",
214 |       "CQR Random Forests: Average length: 1.355441\n"
215 |      ]
216 |     }
217 |    ],
218 |    "source": [
219 |     "from cqr import helper\n",
220 |     "from nonconformist.nc import RegressorNc\n",
221 |     "from nonconformist.nc import QuantileRegErrFunc\n",
222 |     "\n",
223 |     "# define the QRF's parameters \n",
224 |     "params_qforest = dict()\n",
225 |     "params_qforest[\"n_estimators\"] = n_estimators\n",
226 |     "params_qforest[\"min_samples_leaf\"] = min_samples_leaf\n",
227 |     "params_qforest[\"max_features\"] = max_features\n",
228 |     "params_qforest[\"CV\"] = cv_qforest\n",
229 |     "params_qforest[\"coverage_factor\"] = coverage_factor\n",
230 |     "params_qforest[\"test_ratio\"] = cv_test_ratio\n",
231 |     "params_qforest[\"random_state\"] = cv_random_state\n",
232 |     "params_qforest[\"range_vals\"] = cv_range_vals\n",
233 |     "params_qforest[\"num_vals\"] = cv_num_vals\n",
234 |     "\n",
235 |     "# define QRF model\n",
236 |     "quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,\n",
237 |     "                                                           fit_params=None,\n",
238 |     "                                                           quantiles=quantiles_forest,\n",
239 |     "                                                           params=params_qforest)\n",
240 |     "        \n",
241 |     "# define the CQR object\n",
242 |     "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n",
243 |     "\n",
244 |     "# run CQR procedure\n",
245 |     "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
246 |     "\n",
247 |     "# compute and print average coverage and average length\n",
248 |     "coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,\n",
249 |     "                                                                 y_lower,\n",
250 |     "                                                                 y_upper,\n",
251 |     "                                                                 alpha,\n",
252 |     "                                                                 \"CQR Random Forests\")"
253 |    ]
254 |   },
255 |   {
256 |    "cell_type": "markdown",
257 |    "metadata": {},
258 |    "source": [
259 |     "As can be seen, we obtained valid coverage.\n",
260 |     "\n",
261 |     "### Asymmetric nonconformity score \n",
262 |     "\n",
263 |     "The nonconformity score function `QuantileRegErrFunc` treats the left and right tails symmetrically, but if the error distribution is significantly skewed, one may choose to treat them asymmetrically. This can be done by replacing `QuantileRegErrFunc` with `QuantileRegAsymmetricErrFunc`, as implemented in the following cell."
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "code",
268 |    "execution_count": 6,
269 |    "metadata": {},
270 |    "outputs": [
271 |     {
272 |      "name": "stdout",
273 |      "output_type": "stream",
274 |      "text": [
275 |       "Asymmetric CQR Random Forests: Percentage in the range (expecting 90.00): 90.726817\n",
276 |       "Asymmetric CQR Random Forests: Average length: 1.480756\n"
277 |      ]
278 |     }
279 |    ],
280 |    "source": [
281 |     "from nonconformist.nc import QuantileRegAsymmetricErrFunc\n",
282 |     "\n",
283 |     "# define QRF model\n",
284 |     "quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,\n",
285 |     "                                                           fit_params=None,\n",
286 |     "                                                           quantiles=quantiles_forest,\n",
287 |     "                                                           params=params_qforest)\n",
288 |     "        \n",
289 |     "# define the CQR object\n",
290 |     "nc = RegressorNc(quantile_estimator, QuantileRegAsymmetricErrFunc())\n",
291 |     "\n",
292 |     "# run CQR procedure\n",
293 |     "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
294 |     "\n",
295 |     "# compute and print average coverage and average length\n",
296 |     "coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,\n",
297 |     "                                                                 y_lower,\n",
298 |     "                                                                 y_upper,\n",
299 |     "                                                                 alpha,\n",
300 |     "                                                                 \"Asymmetric CQR Random Forests\")"
301 |    ]
302 |   },
303 |   {
304 |    "cell_type": "markdown",
305 |    "metadata": {},
306 |    "source": [
307 |     "Above, we also obtained valid coverage.\n",
308 |     "\n",
309 |     "\n",
310 |     "## CQR neural net\n",
311 |     "\n",
312 |     "In what follows we will use neural network as the underlying quantile regression method. Below, we set the hyper-parameters of the CQR neural network method."
313 |    ]
314 |   },
315 |   {
316 |    "cell_type": "code",
317 |    "execution_count": 7,
318 |    "metadata": {},
319 |    "outputs": [],
320 |    "source": [
321 |     "#####################################################\n",
322 |     "# Neural network parameters\n",
323 |     "# (See AllQNet_RegressorAdapter class in helper.py)\n",
324 |     "#####################################################\n",
325 |     "\n",
326 |     "# pytorch's optimizer object\n",
327 |     "nn_learn_func = torch.optim.Adam\n",
328 |     "\n",
329 |     "# number of epochs\n",
330 |     "epochs = 1000\n",
331 |     "\n",
332 |     "# learning rate\n",
333 |     "lr = 0.0005\n",
334 |     "\n",
335 |     "# mini-batch size\n",
336 |     "batch_size = 64\n",
337 |     "\n",
338 |     "# hidden dimension of the network\n",
339 |     "hidden_size = 64\n",
340 |     "\n",
341 |     "# dropout regularization rate\n",
342 |     "dropout = 0.1\n",
343 |     "\n",
344 |     "# weight decay regularization\n",
345 |     "wd = 1e-6\n",
346 |     "\n",
347 |     "# Ask for a reduced coverage when tuning the network parameters by \n",
348 |     "# cross-validataion to avoid too concervative initial estimation of the \n",
349 |     "# prediction interval. This estimation will be conformalized by CQR.\n",
350 |     "quantiles_net = [0.1, 0.9]"
351 |    ]
352 |   },
353 |   {
354 |    "cell_type": "markdown",
355 |    "metadata": {},
356 |    "source": [
357 |     "We now turn to invoke the CQR procedure. The class `AllQNet_RegressorAdapter` defines the underlying neural network estimator. Just as before, `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` returns the conformal band, computed on test data. Lastly, we compute the average coverage and length using `compute_coverage`."
358 |    ]
359 |   },
360 |   {
361 |    "cell_type": "code",
362 |    "execution_count": 8,
363 |    "metadata": {},
364 |    "outputs": [
365 |     {
366 |      "name": "stdout",
367 |      "output_type": "stream",
368 |      "text": [
369 |       "CQR Neural Net: Percentage in the range (expecting 90.00): 90.225564\n",
370 |       "CQR Neural Net: Average length: 1.502654\n"
371 |      ]
372 |     }
373 |    ],
374 |    "source": [
375 |     "# define quantile neural network model\n",
376 |     "quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,\n",
377 |     "                                                     fit_params=None,\n",
378 |     "                                                     in_shape=in_shape,\n",
379 |     "                                                     hidden_size=hidden_size,\n",
380 |     "                                                     quantiles=quantiles_net,\n",
381 |     "                                                     learn_func=nn_learn_func,\n",
382 |     "                                                     epochs=epochs,\n",
383 |     "                                                     batch_size=batch_size,\n",
384 |     "                                                     dropout=dropout,\n",
385 |     "                                                     lr=lr,\n",
386 |     "                                                     wd=wd,\n",
387 |     "                                                     test_ratio=cv_test_ratio,\n",
388 |     "                                                     random_state=cv_random_state,\n",
389 |     "                                                     use_rearrangement=False)\n",
390 |     "\n",
391 |     "# define a CQR object, computes the absolute residual error of points \n",
392 |     "# located outside the estimated quantile neural network band \n",
393 |     "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n",
394 |     "\n",
395 |     "# run CQR procedure\n",
396 |     "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
397 |     "\n",
398 |     "# compute and print average coverage and average length\n",
399 |     "coverage_cp_qnet, length_cp_qnet = helper.compute_coverage(y_test,\n",
400 |     "                                                           y_lower,\n",
401 |     "                                                           y_upper,\n",
402 |     "                                                           alpha,\n",
403 |     "                                                           \"CQR Neural Net\")"
404 |    ]
405 |   },
406 |   {
407 |    "cell_type": "markdown",
408 |    "metadata": {},
409 |    "source": [
410 |     "Above, we can see that the prediction interval constructed by CQR Neural Net is also valid. Notice the difference in the average length between the two methods (CQR Neural Net and CQR Random Forests). \n",
411 |     "\n",
412 |     "## CQR neural net with rearrangement\n",
413 |     "\n",
414 |     "Crossing quantiles is a longstanding problem in quantile regression. This issue does not affect the validity guarantee of CQR as it holds regardless of the accuracy or choice of the quantile regression method. However, this may affect the effeciency of the resulting conformal band.\n",
415 |     "\n",
416 |     "Below we use the rearrangement method [3] to bypass the crossing quantile problem. Notice that we pass `use_rearrangement=True` as an argument to `AllQNet_RegressorAdapter`.\n",
417 |     "\n",
418 |     "[3] Chernozhukov Victor, Iván Fernández‐Val, and Alfred Galichon. “Quantile and probability curves without crossing.” Econometrica 78, no. 3 (2010): 1093-1125."
419 |    ]
420 |   },
421 |   {
422 |    "cell_type": "code",
423 |    "execution_count": 9,
424 |    "metadata": {},
425 |    "outputs": [
426 |     {
427 |      "name": "stdout",
428 |      "output_type": "stream",
429 |      "text": [
430 |       "CQR Rearrangement Neural Net: Percentage in the range (expecting 90.00): 89.974937\n",
431 |       "CQR Rearrangement Neural Net: Average length: 1.476710\n"
432 |      ]
433 |     }
434 |    ],
435 |    "source": [
436 |     "# define quantile neural network model, using the rearrangement algorithm\n",
437 |     "quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,\n",
438 |     "                                                     fit_params=None,\n",
439 |     "                                                     in_shape=in_shape,\n",
440 |     "                                                     hidden_size=hidden_size,\n",
441 |     "                                                     quantiles=quantiles_net,\n",
442 |     "                                                     learn_func=nn_learn_func,\n",
443 |     "                                                     epochs=epochs,\n",
444 |     "                                                     batch_size=batch_size,\n",
445 |     "                                                     dropout=dropout,\n",
446 |     "                                                     lr=lr,\n",
447 |     "                                                     wd=wd,\n",
448 |     "                                                     test_ratio=cv_test_ratio,\n",
449 |     "                                                     random_state=cv_random_state,\n",
450 |     "                                                     use_rearrangement=True)\n",
451 |     "\n",
452 |     "# define the CQR object, computing the absolute residual error of points \n",
453 |     "# located outside the estimated quantile neural network band \n",
454 |     "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n",
455 |     "\n",
456 |     "# run CQR procedure\n",
457 |     "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
458 |     "\n",
459 |     "# compute and print average coverage and average length\n",
460 |     "coverage_cp_re_qnet, length_cp_re_qnet = helper.compute_coverage(y_test,\n",
461 |     "                                                                 y_lower,\n",
462 |     "                                                                 y_upper,\n",
463 |     "                                                                 alpha,\n",
464 |     "                                                                 \"CQR Rearrangement Neural Net\")"
465 |    ]
466 |   }
467 |  ],
468 |  "metadata": {
469 |   "kernelspec": {
470 |    "display_name": "Python 3",
471 |    "language": "python",
472 |    "name": "python3"
473 |   },
474 |   "language_info": {
475 |    "codemirror_mode": {
476 |     "name": "ipython",
477 |     "version": 3
478 |    },
479 |    "file_extension": ".py",
480 |    "mimetype": "text/x-python",
481 |    "name": "python",
482 |    "nbconvert_exporter": "python",
483 |    "pygments_lexer": "ipython3",
484 |    "version": "3.7.3"
485 |   }
486 |  },
487 |  "nbformat": 4,
488 |  "nbformat_minor": 2
489 | }
490 | 


--------------------------------------------------------------------------------
/datasets/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/datasets/.DS_Store


--------------------------------------------------------------------------------
/datasets/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | ## Publicly Available Datasets
 3 | 
 4 | * Please download the file blogData_train.csv from [this link](https://archive.ics.uci.edu/ml/datasets/BlogFeedback), and save it in this directory.
 5 | 
 6 | * Please download the files Features_Variant_1.csv and Features_Variant_2.csv from
 7 | [this link](https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset) and store the two under ./facebook/ directory.
 8 | 
 9 | ## Data subject to copyright/usage rules
10 | 
11 | Please follow the instruction in [this README](https://github.com/yromano/cqr/blob/master/get_meps_data/README.md) file, describing how to download and process the MEPS datasets.
12 | 
13 | Once downloaded, copy the the three files 'meps_19_reg.csv', 'meps_20_reg.csv', and 'meps_21_reg.csv' to this folder. 
14 | 


--------------------------------------------------------------------------------
/datasets/communities_attributes.csv:
--------------------------------------------------------------------------------
  1 | attributes
  2 | state  
  3 | county  
  4 | community  
  5 | communityname
  6 | fold  
  7 | population  
  8 | householdsize  
  9 | racepctblack  
 10 | racePctWhite  
 11 | racePctAsian  
 12 | racePctHisp  
 13 | agePct12t21  
 14 | agePct12t29  
 15 | agePct16t24  
 16 | agePct65up  
 17 | numbUrban  
 18 | pctUrban  
 19 | medIncome  
 20 | pctWWage  
 21 | pctWFarmSelf  
 22 | pctWInvInc  
 23 | pctWSocSec  
 24 | pctWPubAsst  
 25 | pctWRetire  
 26 | medFamInc  
 27 | perCapInc  
 28 | whitePerCap  
 29 | blackPerCap  
 30 | indianPerCap  
 31 | AsianPerCap  
 32 | OtherPerCap  
 33 | HispPerCap  
 34 | NumUnderPov  
 35 | PctPopUnderPov  
 36 | PctLess9thGrade  
 37 | PctNotHSGrad  
 38 | PctBSorMore  
 39 | PctUnemployed  
 40 | PctEmploy  
 41 | PctEmplManu  
 42 | PctEmplProfServ  
 43 | PctOccupManu  
 44 | PctOccupMgmtProf  
 45 | MalePctDivorce  
 46 | MalePctNevMarr  
 47 | FemalePctDiv  
 48 | TotalPctDiv  
 49 | PersPerFam  
 50 | PctFam2Par  
 51 | PctKids2Par  
 52 | PctYoungKids2Par  
 53 | PctTeen2Par  
 54 | PctWorkMomYoungKids  
 55 | PctWorkMom  
 56 | NumIlleg  
 57 | PctIlleg  
 58 | NumImmig  
 59 | PctImmigRecent  
 60 | PctImmigRec5  
 61 | PctImmigRec8  
 62 | PctImmigRec10  
 63 | PctRecentImmig  
 64 | PctRecImmig5  
 65 | PctRecImmig8  
 66 | PctRecImmig10  
 67 | PctSpeakEnglOnly  
 68 | PctNotSpeakEnglWell  
 69 | PctLargHouseFam  
 70 | PctLargHouseOccup  
 71 | PersPerOccupHous  
 72 | PersPerOwnOccHous  
 73 | PersPerRentOccHous  
 74 | PctPersOwnOccup  
 75 | PctPersDenseHous  
 76 | PctHousLess3BR  
 77 | MedNumBR  
 78 | HousVacant  
 79 | PctHousOccup  
 80 | PctHousOwnOcc  
 81 | PctVacantBoarded  
 82 | PctVacMore6Mos  
 83 | MedYrHousBuilt  
 84 | PctHousNoPhone  
 85 | PctWOFullPlumb  
 86 | OwnOccLowQuart  
 87 | OwnOccMedVal  
 88 | OwnOccHiQuart  
 89 | RentLowQ  
 90 | RentMedian  
 91 | RentHighQ  
 92 | MedRent  
 93 | MedRentPctHousInc  
 94 | MedOwnCostPctInc  
 95 | MedOwnCostPctIncNoMtg  
 96 | NumInShelters  
 97 | NumStreet  
 98 | PctForeignBorn  
 99 | PctBornSameState  
100 | PctSameHouse85  
101 | PctSameCity85  
102 | PctSameState85  
103 | LemasSwornFT  
104 | LemasSwFTPerPop  
105 | LemasSwFTFieldOps  
106 | LemasSwFTFieldPerPop  
107 | LemasTotalReq  
108 | LemasTotReqPerPop  
109 | PolicReqPerOffic  
110 | PolicPerPop  
111 | RacialMatchCommPol  
112 | PctPolicWhite  
113 | PctPolicBlack  
114 | PctPolicHisp  
115 | PctPolicAsian  
116 | PctPolicMinor  
117 | OfficAssgnDrugUnits  
118 | NumKindsDrugsSeiz  
119 | PolicAveOTWorked  
120 | LandArea  
121 | PopDens  
122 | PctUsePubTrans  
123 | PolicCars  
124 | PolicOperBudg  
125 | LemasPctPolicOnPatr  
126 | LemasGangUnitDeploy  
127 | LemasPctOfficDrugUn  
128 | PolicBudgPerPop  
129 | ViolentCrimesPerPop  
130 | 


--------------------------------------------------------------------------------
/datasets/datasets.py:
--------------------------------------------------------------------------------
  1 | 
  2 | import numpy as np
  3 | import pandas as pd
  4 | 
  5 | 
  6 | def GetDataset(name, base_path):
  7 |     """ Load a dataset
  8 |     
  9 |     Parameters
 10 |     ----------
 11 |     name : string, dataset name
 12 |     base_path : string, e.g. "path/to/datasets/directory/"
 13 |     
 14 |     Returns
 15 |     -------
 16 |     X : features (nXp)
 17 |     y : labels (n)
 18 |     
 19 | 	"""
 20 |     if name=="meps_19":
 21 |         df = pd.read_csv(base_path + 'meps_19_reg_fix.csv')
 22 |         column_names = df.columns
 23 |         response_name = "UTILIZATION_reg"
 24 |         column_names = column_names[column_names!=response_name]
 25 |         column_names = column_names[column_names!="Unnamed: 0"]
 26 |         
 27 |         col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1',
 28 |                    'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1',
 29 |                    'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7',
 30 |                    'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2',
 31 |                    'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4',
 32 |                    'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1',
 33 |                    'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5',
 34 |                    'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4',
 35 |                    'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1',
 36 |                    'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2',
 37 |                    'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2',
 38 |                    'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1',
 39 |                    'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1',
 40 |                    'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2',
 41 |                    'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1',
 42 |                    'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1',
 43 |                    'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2',
 44 |                    'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1',
 45 |                    'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1',
 46 |                    'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2',
 47 |                    'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1',
 48 |                    'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2',
 49 |                    'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0',
 50 |                    'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5',
 51 |                    'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4',
 52 |                    'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
 53 |                    'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE']
 54 |         
 55 |         y = df[response_name].values
 56 |         X = df[col_names].values
 57 |         
 58 |     if name=="meps_20":
 59 |         df = pd.read_csv(base_path + 'meps_20_reg_fix.csv')
 60 |         column_names = df.columns
 61 |         response_name = "UTILIZATION_reg"
 62 |         column_names = column_names[column_names!=response_name]
 63 |         column_names = column_names[column_names!="Unnamed: 0"]
 64 |         
 65 |         col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1',
 66 |                    'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1',
 67 |                    'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7',
 68 |                    'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2',
 69 |                    'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4',
 70 |                    'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1',
 71 |                    'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5',
 72 |                    'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4',
 73 |                    'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1',
 74 |                    'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2',
 75 |                    'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2',
 76 |                    'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1',
 77 |                    'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1',
 78 |                    'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2',
 79 |                    'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1',
 80 |                    'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1',
 81 |                    'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2',
 82 |                    'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1',
 83 |                    'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1',
 84 |                    'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2',
 85 |                    'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1',
 86 |                    'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2',
 87 |                    'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0',
 88 |                    'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5',
 89 |                    'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4',
 90 |                    'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
 91 |                    'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE']
 92 |         
 93 |         y = df[response_name].values
 94 |         X = df[col_names].values
 95 |         
 96 |     if name=="meps_21":
 97 |         df = pd.read_csv(base_path + 'meps_21_reg_fix.csv')
 98 |         column_names = df.columns
 99 |         response_name = "UTILIZATION_reg"
100 |         column_names = column_names[column_names!=response_name]
101 |         column_names = column_names[column_names!="Unnamed: 0"]
102 |         
103 |         col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT16F', 'REGION=1',
104 |                    'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1',
105 |                    'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7',
106 |                    'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2',
107 |                    'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4',
108 |                    'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1',
109 |                    'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5',
110 |                    'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4',
111 |                    'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1',
112 |                    'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2',
113 |                    'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2',
114 |                    'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1',
115 |                    'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1',
116 |                    'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2',
117 |                    'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1',
118 |                    'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1',
119 |                    'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2',
120 |                    'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1',
121 |                    'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1',
122 |                    'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2',
123 |                    'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1',
124 |                    'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2',
125 |                    'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0',
126 |                    'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5',
127 |                    'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4',
128 |                    'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
129 |                    'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE']
130 |         
131 |         y = df[response_name].values
132 |         X = df[col_names].values
133 |         
134 |     if name=="star":
135 |         df = pd.read_csv(base_path + 'STAR.csv')
136 |         df.loc[df['gender'] == 'female', 'gender'] = 0
137 |         df.loc[df['gender'] == 'male', 'gender'] = 1
138 |         
139 |         df.loc[df['ethnicity'] == 'cauc', 'ethnicity'] = 0
140 |         df.loc[df['ethnicity'] == 'afam', 'ethnicity'] = 1
141 |         df.loc[df['ethnicity'] == 'asian', 'ethnicity'] = 2
142 |         df.loc[df['ethnicity'] == 'hispanic', 'ethnicity'] = 3
143 |         df.loc[df['ethnicity'] == 'amindian', 'ethnicity'] = 4
144 |         df.loc[df['ethnicity'] == 'other', 'ethnicity'] = 5
145 |         
146 |         df.loc[df['stark'] == 'regular', 'stark'] = 0
147 |         df.loc[df['stark'] == 'small', 'stark'] = 1
148 |         df.loc[df['stark'] == 'regular+aide', 'stark'] = 2
149 |         
150 |         df.loc[df['star1'] == 'regular', 'star1'] = 0
151 |         df.loc[df['star1'] == 'small', 'star1'] = 1
152 |         df.loc[df['star1'] == 'regular+aide', 'star1'] = 2        
153 |         
154 |         df.loc[df['star2'] == 'regular', 'star2'] = 0
155 |         df.loc[df['star2'] == 'small', 'star2'] = 1
156 |         df.loc[df['star2'] == 'regular+aide', 'star2'] = 2   
157 | 
158 |         df.loc[df['star3'] == 'regular', 'star3'] = 0
159 |         df.loc[df['star3'] == 'small', 'star3'] = 1
160 |         df.loc[df['star3'] == 'regular+aide', 'star3'] = 2      
161 |         
162 |         df.loc[df['lunchk'] == 'free', 'lunchk'] = 0
163 |         df.loc[df['lunchk'] == 'non-free', 'lunchk'] = 1
164 |         
165 |         df.loc[df['lunch1'] == 'free', 'lunch1'] = 0    
166 |         df.loc[df['lunch1'] == 'non-free', 'lunch1'] = 1      
167 |         
168 |         df.loc[df['lunch2'] == 'free', 'lunch2'] = 0    
169 |         df.loc[df['lunch2'] == 'non-free', 'lunch2'] = 1  
170 |         
171 |         df.loc[df['lunch3'] == 'free', 'lunch3'] = 0    
172 |         df.loc[df['lunch3'] == 'non-free', 'lunch3'] = 1  
173 |         
174 |         df.loc[df['schoolk'] == 'inner-city', 'schoolk'] = 0
175 |         df.loc[df['schoolk'] == 'suburban', 'schoolk'] = 1
176 |         df.loc[df['schoolk'] == 'rural', 'schoolk'] = 2  
177 |         df.loc[df['schoolk'] == 'urban', 'schoolk'] = 3
178 | 
179 |         df.loc[df['school1'] == 'inner-city', 'school1'] = 0
180 |         df.loc[df['school1'] == 'suburban', 'school1'] = 1
181 |         df.loc[df['school1'] == 'rural', 'school1'] = 2  
182 |         df.loc[df['school1'] == 'urban', 'school1'] = 3      
183 |         
184 |         df.loc[df['school2'] == 'inner-city', 'school2'] = 0
185 |         df.loc[df['school2'] == 'suburban', 'school2'] = 1
186 |         df.loc[df['school2'] == 'rural', 'school2'] = 2  
187 |         df.loc[df['school2'] == 'urban', 'school2'] = 3      
188 |         
189 |         df.loc[df['school3'] == 'inner-city', 'school3'] = 0
190 |         df.loc[df['school3'] == 'suburban', 'school3'] = 1
191 |         df.loc[df['school3'] == 'rural', 'school3'] = 2  
192 |         df.loc[df['school3'] == 'urban', 'school3'] = 3  
193 |         
194 |         df.loc[df['degreek'] == 'bachelor', 'degreek'] = 0
195 |         df.loc[df['degreek'] == 'master', 'degreek'] = 1
196 |         df.loc[df['degreek'] == 'specialist', 'degreek'] = 2  
197 |         df.loc[df['degreek'] == 'master+', 'degreek'] = 3 
198 | 
199 |         df.loc[df['degree1'] == 'bachelor', 'degree1'] = 0
200 |         df.loc[df['degree1'] == 'master', 'degree1'] = 1
201 |         df.loc[df['degree1'] == 'specialist', 'degree1'] = 2  
202 |         df.loc[df['degree1'] == 'phd', 'degree1'] = 3              
203 |         
204 |         df.loc[df['degree2'] == 'bachelor', 'degree2'] = 0
205 |         df.loc[df['degree2'] == 'master', 'degree2'] = 1
206 |         df.loc[df['degree2'] == 'specialist', 'degree2'] = 2  
207 |         df.loc[df['degree2'] == 'phd', 'degree2'] = 3
208 |         
209 |         df.loc[df['degree3'] == 'bachelor', 'degree3'] = 0
210 |         df.loc[df['degree3'] == 'master', 'degree3'] = 1
211 |         df.loc[df['degree3'] == 'specialist', 'degree3'] = 2  
212 |         df.loc[df['degree3'] == 'phd', 'degree3'] = 3          
213 |         
214 |         df.loc[df['ladderk'] == 'level1', 'ladderk'] = 0
215 |         df.loc[df['ladderk'] == 'level2', 'ladderk'] = 1
216 |         df.loc[df['ladderk'] == 'level3', 'ladderk'] = 2  
217 |         df.loc[df['ladderk'] == 'apprentice', 'ladderk'] = 3  
218 |         df.loc[df['ladderk'] == 'probation', 'ladderk'] = 4
219 |         df.loc[df['ladderk'] == 'pending', 'ladderk'] = 5
220 |         df.loc[df['ladderk'] == 'notladder', 'ladderk'] = 6
221 |         
222 |         
223 |         df.loc[df['ladder1'] == 'level1', 'ladder1'] = 0
224 |         df.loc[df['ladder1'] == 'level2', 'ladder1'] = 1
225 |         df.loc[df['ladder1'] == 'level3', 'ladder1'] = 2  
226 |         df.loc[df['ladder1'] == 'apprentice', 'ladder1'] = 3  
227 |         df.loc[df['ladder1'] == 'probation', 'ladder1'] = 4
228 |         df.loc[df['ladder1'] == 'noladder', 'ladder1'] = 5
229 |         df.loc[df['ladder1'] == 'notladder', 'ladder1'] = 6
230 |         
231 |         df.loc[df['ladder2'] == 'level1', 'ladder2'] = 0
232 |         df.loc[df['ladder2'] == 'level2', 'ladder2'] = 1
233 |         df.loc[df['ladder2'] == 'level3', 'ladder2'] = 2  
234 |         df.loc[df['ladder2'] == 'apprentice', 'ladder2'] = 3  
235 |         df.loc[df['ladder2'] == 'probation', 'ladder2'] = 4
236 |         df.loc[df['ladder2'] == 'noladder', 'ladder2'] = 5
237 |         df.loc[df['ladder2'] == 'notladder', 'ladder2'] = 6
238 |         
239 |         df.loc[df['ladder3'] == 'level1', 'ladder3'] = 0
240 |         df.loc[df['ladder3'] == 'level2', 'ladder3'] = 1
241 |         df.loc[df['ladder3'] == 'level3', 'ladder3'] = 2  
242 |         df.loc[df['ladder3'] == 'apprentice', 'ladder3'] = 3  
243 |         df.loc[df['ladder3'] == 'probation', 'ladder3'] = 4
244 |         df.loc[df['ladder3'] == 'noladder', 'ladder3'] = 5
245 |         df.loc[df['ladder3'] == 'notladder', 'ladder3'] = 6
246 |         
247 |         df.loc[df['tethnicityk'] == 'cauc', 'tethnicityk'] = 0
248 |         df.loc[df['tethnicityk'] == 'afam', 'tethnicityk'] = 1
249 |         
250 |         df.loc[df['tethnicity1'] == 'cauc', 'tethnicity1'] = 0
251 |         df.loc[df['tethnicity1'] == 'afam', 'tethnicity1'] = 1
252 |         
253 |         df.loc[df['tethnicity2'] == 'cauc', 'tethnicity2'] = 0
254 |         df.loc[df['tethnicity2'] == 'afam', 'tethnicity2'] = 1
255 |         
256 |         df.loc[df['tethnicity3'] == 'cauc', 'tethnicity3'] = 0
257 |         df.loc[df['tethnicity3'] == 'afam', 'tethnicity3'] = 1
258 |         df.loc[df['tethnicity3'] == 'asian', 'tethnicity3'] = 2
259 |         
260 |         df = df.dropna()
261 |         
262 |         grade = df["readk"] + df["read1"] + df["read2"] + df["read3"]
263 |         grade += df["mathk"] + df["math1"] + df["math2"] + df["math3"]
264 |         
265 |         
266 |         names = df.columns
267 |         target_names = names[8:16]
268 |         data_names = np.concatenate((names[0:8],names[17:]))
269 |         X = df.loc[:, data_names].values
270 |         y = grade.values
271 |         
272 |         
273 |     if name=="facebook_1":
274 |         df = pd.read_csv(base_path + 'facebook/Features_Variant_1.csv')        
275 |         y = df.iloc[:,53].values
276 |         X = df.iloc[:,0:53].values        
277 |     
278 |     if name=="facebook_2":
279 |         df = pd.read_csv(base_path + 'facebook/Features_Variant_2.csv')        
280 |         y = df.iloc[:,53].values
281 |         X = df.iloc[:,0:53].values 
282 |         
283 |     if name=="bio":
284 |         #https://github.com/joefavergel/TertiaryPhysicochemicalProperties/blob/master/RMSD-ProteinTertiaryStructures.ipynb
285 |         df = pd.read_csv(base_path + 'CASP.csv')        
286 |         y = df.iloc[:,0].values
287 |         X = df.iloc[:,1:].values        
288 |         
289 |     if name=='blog_data':
290 |         # https://github.com/xinbinhuang/feature-selection_blogfeedback
291 |         df = pd.read_csv(base_path + 'blogData_train.csv', header=None)
292 |         X = df.iloc[:,0:280].values
293 |         y = df.iloc[:,-1].values
294 |     
295 |     if name == "concrete":
296 |         dataset = np.loadtxt(open(base_path + 'Concrete_Data.csv', "rb"), delimiter=",", skiprows=1)
297 |         X = dataset[:, :-1]
298 |         y = dataset[:, -1:]
299 |     
300 |     
301 |     if name=="bike":
302 |         # https://www.kaggle.com/rajmehra03/bike-sharing-demand-rmsle-0-3194
303 |         df=pd.read_csv(base_path + 'bike_train.csv')
304 |         
305 |         # # seperating season as per values. this is bcoz this will enhance features.
306 |         season=pd.get_dummies(df['season'],prefix='season')
307 |         df=pd.concat([df,season],axis=1)
308 |         
309 |         # # # same for weather. this is bcoz this will enhance features.
310 |         weather=pd.get_dummies(df['weather'],prefix='weather')
311 |         df=pd.concat([df,weather],axis=1)
312 |         
313 |         # # # now can drop weather and season.
314 |         df.drop(['season','weather'],inplace=True,axis=1)
315 |         df.head()
316 |         
317 |         df["hour"] = [t.hour for t in pd.DatetimeIndex(df.datetime)]
318 |         df["day"] = [t.dayofweek for t in pd.DatetimeIndex(df.datetime)]
319 |         df["month"] = [t.month for t in pd.DatetimeIndex(df.datetime)]
320 |         df['year'] = [t.year for t in pd.DatetimeIndex(df.datetime)]
321 |         df['year'] = df['year'].map({2011:0, 2012:1})
322 |  
323 |         df.drop('datetime',axis=1,inplace=True)
324 |         df.drop(['casual','registered'],axis=1,inplace=True)
325 |         df.columns.to_series().groupby(df.dtypes).groups
326 |         X = df.drop('count',axis=1).values
327 |         y = df['count'].values
328 |     
329 |     if name=="community":
330 |         # https://github.com/vbordalo/Communities-Crime/blob/master/Crime_v1.ipynb
331 |         attrib = pd.read_csv(base_path + 'communities_attributes.csv', delim_whitespace = True)
332 |         data = pd.read_csv(base_path + 'communities.data', names = attrib['attributes'])
333 |         data = data.drop(columns=['state','county',
334 |                           'community','communityname',
335 |                           'fold'], axis=1)
336 |         
337 |         data = data.replace('?', np.nan)
338 |         
339 |         # Impute mean values for samples with missing values        
340 |         from sklearn.preprocessing import Imputer
341 |         
342 |         imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
343 |         
344 |         imputer = imputer.fit(data[['OtherPerCap']])
345 |         data[['OtherPerCap']] = imputer.transform(data[['OtherPerCap']])
346 |         data = data.dropna(axis=1)
347 |         X = data.iloc[:, 0:100].values
348 |         y = data.iloc[:, 100].values
349 | 
350 |         
351 |     X = X.astype(np.float32)
352 |     y = y.astype(np.float32)
353 |     
354 |     return X, y
355 |         


--------------------------------------------------------------------------------
/datasets/facebook/README.md:
--------------------------------------------------------------------------------
1 | 
2 | Please download the files Features_Variant_1.csv and Features_Variant_2.csv from
3 | [this link](https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset) and store the two in this directory.
4 | 


--------------------------------------------------------------------------------
/get_meps_data/README.md:
--------------------------------------------------------------------------------
  1 | # Medical Expenditure Panel Survey data
  2 | <https://meps.ahrq.gov/mepsweb/>
  3 | 
  4 | ## A quick guide:
  5 | 
  6 | cd to the current code directory, and run
  7 | 
  8 | ```Bash
  9 | Rscript download_data.R
 10 | ```
 11 | 
 12 | You should see the files h181.csv and h192.csv in the code directory. Then, to clean the raw files and create the datasets, run
 13 | 
 14 | ```Bash
 15 | python main_clean_and_save_to_csv.py
 16 | ```
 17 | 
 18 | Now, you should see 3 new files: meps_19_reg.csv, meps_20_reg.csv, and meps_21_reg.csv. These are the csv files that we used in our experiments.
 19 | 
 20 | The following sections provide more detailed explanation.
 21 | 
 22 | ### Note: the code and the following text is copied from IBM's AIF360 package.
 23 | 
 24 | The Medical Expenditure Panel Survey (MEPS) data consists of large scale surveys of families and individuals, medical providers, and employers, and collects data on health services used, costs & frequency of services, demographics, etc., of the respondents.
 25 | 
 26 | Please refer to https://github.com/IBM/AIF360 for more details.
 27 | 
 28 | ## Source / Data Set Description:
 29 | 
 30 | 
 31 | * [2015 full Year Consolidated Data File](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181): This file contains MEPS survey data for calendar year 2015 obtained in rounds 3, 4, and 5 of Panel 19, and rounds 1, 2, and 3 of Panel 20.
 32 | 
 33 | * [2016 full Year Consolidated Data File](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-192): : This file contains MEPS survey data for calendar year 2016 obtained in rounds 3, 4, and 5 of Panel 20, and rounds 1, 2, and 3 of Panel 21.
 34 | 
 35 | 
 36 | ## Data Use Agreement
 37 | 
 38 | As the user of the data it is your responsibility to read and abide by any copyright/usage rules and restrictions as
 39 | stated on the MEPS web site before downloading the data.
 40 | 
 41 | - [Data Use Agreement (2015 Data File)](https://meps.ahrq.gov/data_stats/download_data/pufs/h181/h181doc.shtml#Data)
 42 | - [Data Use Agreement (2016 Data File)](https://meps.ahrq.gov/data_stats/download_data/pufs/h192/h192doc.shtml#DataA)
 43 | 
 44 | 
 45 | ## Download instructions
 46 | 
 47 | In order to use the MEPS datasets, please follow the following directions to download the datafiles and convert into csv files.
 48 | 
 49 | Follow either set of instructions below for using R or SPSS. Further instructions for SAS, and Stata, are available at
 50 | the [AHRQ MEPS Github repository](https://github.com/HHS-AHRQ/MEPS).
 51 | 
 52 |  - **Generating CSV files with R**
 53 | 
 54 |     In the current folder run the R script `download_data.R`. R can be downloaded from [CRAN](https://cran.r-project.org).
 55 |     If you are working on Mac OS X the easiest way to get the R command line support is by installing it with
 56 |     [Homebrew](https://brew.sh/) `brew install R`.
 57 | 
 58 |     ```Bash
 59 |     Rscript download_data.R
 60 |     ```
 61 | 
 62 |     Example output:
 63 | 
 64 |     ```
 65 |     Loading required package: foreign
 66 | 
 67 |     trying URL 'https://meps.ahrq.gov/mepsweb/data_files/pufs/h181ssp.zip'
 68 |     Content type 'application/zip' length 13303652 bytes (12.7 MB)
 69 |     ==================================================
 70 |     downloaded 12.7 MB
 71 | 
 72 |     Loading dataframe from file: h181.ssp
 73 |     Exporting dataframe to file: h181.csv
 74 | 
 75 |     trying URL 'https://meps.ahrq.gov/mepsweb/data_files/pufs/h192ssp.zip'
 76 |     Content type 'application/zip' length 15505898 bytes (14.8 MB)
 77 |     ==================================================
 78 |     downloaded 14.8 MB
 79 | 
 80 |     Loading dataframe from file: h192.ssp
 81 |     Exporting dataframe to file: h192.csv
 82 |     ```
 83 | 
 84 |  - **Generating CSV files with SPSS**
 85 | 
 86 |     The instructions below require the use of SPSS.
 87 | 
 88 |     1. 2015 full Year Consolidated Data File
 89 |         * Download the [`Data File, ASCII format`](https://meps.ahrq.gov/mepsweb/data_files/pufs/h181dat.zip)
 90 |         * Extract the file `h181.dat` from downloaded zip archive
 91 |         * Convert the file to comma-delimited format, `h181.csv`, and save in this folder.
 92 |             * To convert the .dat file into csv format,download one of the programming statements files, such as the [SPSS Programming Statements](https://meps.ahrq.gov/mepsweb/data_stats/download_data/pufs/h181/h181spu.txt) file.
 93 |             * Edit this file to change the FILE HANDLE name to the complete path/name of the downloaded data file, execute the SPSS programming statements to load the data, and 'save as' a comma-delimited file called 'h181.csv' in the current folder.
 94 | 
 95 |     2. 2016 full Year Consolidated Data File
 96 |         * Download the [`Data File, ASCII format`](https://meps.ahrq.gov/mepsweb/data_files/pufs/h192dat.zip)
 97 |         * Extract the file `h192.dat` from downloaded zip archive
 98 |         * Convert the file to comma-delimited format, `h192.csv`, and save in current repository.
 99 |             * To convert the .dat file into csv format,download one of the programming statements files, such as the [SPSS Programming Statements](https://meps.ahrq.gov/mepsweb/data_stats/download_data/pufs/h192/h192spu.txt) file.
100 |             * Edit this file to change the FILE HANDLE name to the complete path/name of the downloaded data file, execute the SPSS programming statements to load the data, and 'save as' a comma-delimited file called 'h192.csv' in this folder.
101 | 
102 | ## Cleaning the Data
103 | 
104 | To clean the raw files and create the 3 MEPS datasets used in the our paper, run
105 | 
106 | ```Bash
107 | python main_clean_and_save_to_csv.py
108 | ```
109 | 
110 | which produces the files: 'meps_19_reg.csv', 'meps_20_reg.csv', and 'meps_21_reg.csv'.
111 | 


--------------------------------------------------------------------------------
/get_meps_data/base_dataset.py:
--------------------------------------------------------------------------------
 1 | # Code copied from IBM's AIF360 package:
 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/dataset.py
 3 | 
 4 | from __future__ import absolute_import
 5 | from __future__ import division
 6 | from __future__ import print_function
 7 | from __future__ import unicode_literals
 8 | 
 9 | import abc
10 | import copy
11 | import sys
12 | 
13 | if sys.version_info >= (3, 4):
14 |     ABC = abc.ABC
15 | else:
16 |     ABC = abc.ABCMeta(str('ABC'), (), {})
17 | 
18 | 
19 | class BaseDataset(ABC):
20 |     """Abstract base class for datasets."""
21 | 
22 |     @abc.abstractmethod
23 |     def __init__(self, **kwargs):
24 |         self.metadata = kwargs.pop('metadata', dict()) or dict()
25 |         self.metadata.update({
26 |             'transformer': '{}.__init__'.format(type(self).__name__),
27 |             'params': kwargs,
28 |             'previous': []
29 |         })
30 |         self.validate_dataset()
31 | 
32 |     def validate_dataset(self):
33 |         """Error checking and type validation."""
34 |         pass
35 | 
36 |     def copy(self, deepcopy=False):
37 |         """Convenience method to return a copy of this dataset.
38 | 
39 |         Args:
40 |             deepcopy (bool, optional): :func:`~copy.deepcopy` this dataset if
41 |                 `True`, shallow copy otherwise.
42 | 
43 |         Returns:
44 |             Dataset: A new dataset with fields copied from this object and
45 |             metadata set accordingly.
46 |         """
47 |         cpy = copy.deepcopy(self) if deepcopy else copy.copy(self)
48 |         # preserve any user-created fields
49 |         cpy.metadata = cpy.metadata.copy()
50 |         cpy.metadata.update({
51 |             'transformer': '{}.copy'.format(type(self).__name__),
52 |             'params': {'deepcopy': deepcopy},
53 |             'previous': [self]
54 |         })
55 |         return cpy
56 | 
57 |     @abc.abstractmethod
58 |     def export_dataset(self):
59 |         """Save this Dataset to disk."""
60 |         raise NotImplementedError
61 | 
62 |     @abc.abstractmethod
63 |     def split(self, num_or_size_splits, shuffle=False):
64 |         """Split this dataset into multiple partitions.
65 | 
66 |         Args:
67 |             num_or_size_splits (array or int): If `num_or_size_splits` is an
68 |                 int, *k*, the value is the number of equal-sized folds to make
69 |                 (if *k* does not evenly divide the dataset these folds are
70 |                 approximately equal-sized). If `num_or_size_splits` is an array
71 |                 of type int, the values are taken as the indices at which to
72 |                 split the dataset. If the values are floats (< 1.), they are
73 |                 considered to be fractional proportions of the dataset at which
74 |                 to split.
75 |             shuffle (bool, optional): Randomly shuffle the dataset before
76 |                 splitting.
77 | 
78 |         Returns:
79 |             list(Dataset): Splits. Contains *k* or `len(num_or_size_splits) + 1`
80 |             datasets depending on `num_or_size_splits`.
81 |         """
82 |         raise NotImplementedError
83 | 


--------------------------------------------------------------------------------
/get_meps_data/download_data.R:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env Rscript
 2 | 
 3 | # Code copied from IBM's aif360 package, https://github.com/IBM/AIF360
 4 | 
 5 | # This R script can be used to download the Medical Expenditure Panel Survey (MEPS)
 6 | # data files for 2015 and 2016 and convert the files from SAS transport format into
 7 | # standard CSV files.
 8 | 
 9 | usage_note <- paste("",
10 |     "By using this script you acknowledge the responsibility for reading and",
11 |     "abiding by any copyright/usage rules and restrictions as stated on the",
12 |     "MEPS web site (https://meps.ahrq.gov/data_stats/data_use.jsp).",
13 |     "",
14 |     "Continue [y/n]? > ", sep = "\n")
15 | 
16 | cat(usage_note)
17 | answer <- scan("stdin", character(), n=1, quiet=TRUE)
18 | 
19 | if (tolower(answer) != 'y') {
20 |     opt <- options(show.error.messages=FALSE)
21 |     on.exit(options(opt))
22 |     stop()
23 | }
24 | 
25 | if (!require("foreign")) {
26 |     install.packages("foreign")
27 |     library(foreign)
28 | }
29 | 
30 | convert <- function(ssp_file, csv_file) {
31 |     message("Loading dataframe from file: ", ssp_file)
32 |     df = read.xport(ssp_file)
33 |     message("Exporting dataframe to file: ", csv_file)
34 |     write.csv(df, file=csv_file, row.names=FALSE, quote=FALSE)
35 | }
36 | 
37 | for (dataset in c("h181", "h192")) {
38 |     zip_file <- paste(dataset, "ssp.zip", sep="")
39 |     ssp_file <- paste(dataset, "ssp", sep=".")
40 |     csv_file <- paste(dataset, "csv", sep=".")
41 |     url <- paste("https://meps.ahrq.gov/mepsweb/data_files/pufs", zip_file, sep="/")
42 | 
43 |     # skip to next dataset if we already have the CSV file
44 |     if (file.exists(csv_file)) {
45 |         message(csv_file, " already exists")
46 |         next
47 |     }
48 | 
49 |     # download the zip file only if not downloaded before
50 |     if (!file.exists(zip_file)) {
51 |         download.file(url, destfile=zip_file)
52 |     }
53 | 
54 |     # unzip and convert the dataset from SAS transport format to CSV
55 |     unzip(zip_file)
56 |     convert(ssp_file, csv_file)
57 | 
58 |     # clean up temporary files if we got the CSV file
59 |     if (file.exists(csv_file)) {
60 |         file.remove(zip_file)
61 |         file.remove(ssp_file)
62 |     }
63 | }
64 | 


--------------------------------------------------------------------------------
/get_meps_data/main_clean_and_save_to_csv.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # Code based on IBM's AIF360 software package, suggesting a simple modification
  3 | # that accumulates the medical utilization variables without binarization
  4 | 
  5 | # Load packages
  6 | from meps_dataset_panel19_fy2015_reg import MEPSDataset19Reg
  7 | from meps_dataset_panel20_fy2015_reg import MEPSDataset20Reg
  8 | from meps_dataset_panel21_fy2016_reg import MEPSDataset21Reg
  9 | 
 10 | import numpy as np
 11 | 
 12 | print("Cleaning and saving MEPS 19, 20 and 21")
 13 | 
 14 | # Load raw MEPS 19 data, extract and clean the features, then save to meps_19.csv
 15 | MEPSDataset19Reg()
 16 | 
 17 | # Load raw MEPS 20 data, extract and clean the features, then save to meps_20.csv
 18 | MEPSDataset20Reg()
 19 | 
 20 | # Load raw MEPS 21 data, extract and clean the features, then save to meps_21.csv
 21 | MEPSDataset21Reg()
 22 | 
 23 | 
 24 | print("Done.")
 25 | 
 26 | ###############################################################################
 27 | ###############################################################################
 28 | 
 29 | 
 30 | # We now show how to load the processed csv file
 31 | import pandas as pd
 32 | 
 33 | print("Loading processed data and printing the dimensions")
 34 | 
 35 | 
 36 | ##############################################################################
 37 | # MEPS 19
 38 | ##############################################################################
 39 | 
 40 | # Load the processed meps_19_reg.csv, extract features X and response y
 41 | df = pd.read_csv('meps_19_reg.csv')
 42 | column_names = df.columns
 43 | response_name = "UTILIZATION_reg"
 44 | column_names = column_names[column_names!=response_name]
 45 | column_names = column_names[column_names!="Unnamed: 0"]
 46 | 
 47 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1',
 48 |            'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1',
 49 |            'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7',
 50 |            'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2',
 51 |            'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4',
 52 |            'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1',
 53 |            'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5',
 54 |            'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4',
 55 |            'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1',
 56 |            'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2',
 57 |            'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2',
 58 |            'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1',
 59 |            'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1',
 60 |            'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2',
 61 |            'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1',
 62 |            'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1',
 63 |            'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2',
 64 |            'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1',
 65 |            'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1',
 66 |            'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2',
 67 |            'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1',
 68 |            'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2',
 69 |            'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0',
 70 |            'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5',
 71 |            'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4',
 72 |            'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
 73 |            'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE']
 74 | 
 75 | y = df[response_name].values
 76 | X = df[col_names].values
 77 | 
 78 | print("MEPS 19: n = " + str(X.shape[0]) + " p = " + str(X.shape[1]) + " response len = " + str(y.shape[0]))
 79 | 
 80 | 
 81 | ##############################################################################
 82 | # MEPS 20
 83 | ##############################################################################
 84 | 
 85 | # Load the processed meps_20_reg.csv, extract features X and response y
 86 | df = pd.read_csv('meps_20_reg.csv')
 87 | column_names = df.columns
 88 | response_name = "UTILIZATION_reg"
 89 | column_names = column_names[column_names!=response_name]
 90 | column_names = column_names[column_names!="Unnamed: 0"]
 91 | 
 92 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT15F', 'REGION=1',
 93 |            'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1',
 94 |            'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7',
 95 |            'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2',
 96 |            'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4',
 97 |            'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1',
 98 |            'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5',
 99 |            'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4',
100 |            'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1',
101 |            'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2',
102 |            'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2',
103 |            'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1',
104 |            'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1',
105 |            'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2',
106 |            'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1',
107 |            'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1',
108 |            'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2',
109 |            'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1',
110 |            'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1',
111 |            'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2',
112 |            'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1',
113 |            'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2',
114 |            'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0',
115 |            'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5',
116 |            'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4',
117 |            'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
118 |            'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE']
119 | 
120 | 
121 | y = df[response_name].values
122 | X = df[col_names].values
123 | 
124 | print("MEPS 20: n = " + str(X.shape[0]) + " p = " + str(X.shape[1]) + " response len = " + str(y.shape[0]))
125 | 
126 | 
127 | ##############################################################################
128 | # MEPS 21
129 | ##############################################################################
130 | 
131 | # Load the processed meps_21_reg.csv, extract features X and response y
132 | df = pd.read_csv('meps_21_reg.csv')
133 | column_names = df.columns
134 | response_name = "UTILIZATION_reg"
135 | column_names = column_names[column_names!=response_name]
136 | column_names = column_names[column_names!="Unnamed: 0"]
137 | 
138 | col_names = ['AGE', 'PCS42', 'MCS42', 'K6SUM42', 'PERWT16F', 'REGION=1',
139 |            'REGION=2', 'REGION=3', 'REGION=4', 'SEX=1', 'SEX=2', 'MARRY=1',
140 |            'MARRY=2', 'MARRY=3', 'MARRY=4', 'MARRY=5', 'MARRY=6', 'MARRY=7',
141 |            'MARRY=8', 'MARRY=9', 'MARRY=10', 'FTSTU=-1', 'FTSTU=1', 'FTSTU=2',
142 |            'FTSTU=3', 'ACTDTY=1', 'ACTDTY=2', 'ACTDTY=3', 'ACTDTY=4',
143 |            'HONRDC=1', 'HONRDC=2', 'HONRDC=3', 'HONRDC=4', 'RTHLTH=-1',
144 |            'RTHLTH=1', 'RTHLTH=2', 'RTHLTH=3', 'RTHLTH=4', 'RTHLTH=5',
145 |            'MNHLTH=-1', 'MNHLTH=1', 'MNHLTH=2', 'MNHLTH=3', 'MNHLTH=4',
146 |            'MNHLTH=5', 'HIBPDX=-1', 'HIBPDX=1', 'HIBPDX=2', 'CHDDX=-1',
147 |            'CHDDX=1', 'CHDDX=2', 'ANGIDX=-1', 'ANGIDX=1', 'ANGIDX=2',
148 |            'MIDX=-1', 'MIDX=1', 'MIDX=2', 'OHRTDX=-1', 'OHRTDX=1', 'OHRTDX=2',
149 |            'STRKDX=-1', 'STRKDX=1', 'STRKDX=2', 'EMPHDX=-1', 'EMPHDX=1',
150 |            'EMPHDX=2', 'CHBRON=-1', 'CHBRON=1', 'CHBRON=2', 'CHOLDX=-1',
151 |            'CHOLDX=1', 'CHOLDX=2', 'CANCERDX=-1', 'CANCERDX=1', 'CANCERDX=2',
152 |            'DIABDX=-1', 'DIABDX=1', 'DIABDX=2', 'JTPAIN=-1', 'JTPAIN=1',
153 |            'JTPAIN=2', 'ARTHDX=-1', 'ARTHDX=1', 'ARTHDX=2', 'ARTHTYPE=-1',
154 |            'ARTHTYPE=1', 'ARTHTYPE=2', 'ARTHTYPE=3', 'ASTHDX=1', 'ASTHDX=2',
155 |            'ADHDADDX=-1', 'ADHDADDX=1', 'ADHDADDX=2', 'PREGNT=-1', 'PREGNT=1',
156 |            'PREGNT=2', 'WLKLIM=-1', 'WLKLIM=1', 'WLKLIM=2', 'ACTLIM=-1',
157 |            'ACTLIM=1', 'ACTLIM=2', 'SOCLIM=-1', 'SOCLIM=1', 'SOCLIM=2',
158 |            'COGLIM=-1', 'COGLIM=1', 'COGLIM=2', 'DFHEAR42=-1', 'DFHEAR42=1',
159 |            'DFHEAR42=2', 'DFSEE42=-1', 'DFSEE42=1', 'DFSEE42=2',
160 |            'ADSMOK42=-1', 'ADSMOK42=1', 'ADSMOK42=2', 'PHQ242=-1', 'PHQ242=0',
161 |            'PHQ242=1', 'PHQ242=2', 'PHQ242=3', 'PHQ242=4', 'PHQ242=5',
162 |            'PHQ242=6', 'EMPST=-1', 'EMPST=1', 'EMPST=2', 'EMPST=3', 'EMPST=4',
163 |            'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
164 |            'INSCOV=1', 'INSCOV=2', 'INSCOV=3', 'RACE']
165 | 
166 | 
167 | y = df[response_name].values
168 | X = df[col_names].values
169 | 
170 | print("MEPS 21: n = " + str(X.shape[0]) + " p = " + str(X.shape[1]) + " response len = " + str(y.shape[0]))
171 | 


--------------------------------------------------------------------------------
/get_meps_data/meps_dataset_panel19_fy2015_reg.py:
--------------------------------------------------------------------------------
  1 | # This code is a variant of
  2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/meps_dataset_panel19_fy2015.py
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | from __future__ import unicode_literals
  8 | 
  9 | import pandas as pd
 10 | 
 11 | from save_dataset import SaveDataset
 12 | 
 13 | default_mappings = {
 14 |     'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-White'}]
 15 | }
 16 | 
 17 | def default_preprocessing(df):
 18 |     """
 19 |     1.Create a new column, RACE that is 'White' if RACEV2X = 1 and HISPANX = 2 i.e. non Hispanic White
 20 |       and 'non-White' otherwise
 21 |     2. Restrict to Panel 19
 22 |     3. RENAME all columns that are PANEL/ROUND SPECIFIC
 23 |     4. Drop rows based on certain values of individual features that correspond to missing/unknown - generally < -1
 24 |     5. Compute UTILIZATION
 25 |     """
 26 |     def race(row):
 27 |         if ((row['HISPANX'] == 2) and (row['RACEV2X'] == 1)):  #non-Hispanic Whites are marked as WHITE; all others as NON-WHITE
 28 |             return 'White'
 29 |         return 'Non-White'
 30 | 
 31 |     df['RACEV2X'] = df.apply(lambda row: race(row), axis=1)
 32 |     df = df.rename(columns = {'RACEV2X' : 'RACE'})
 33 | 
 34 |     df = df[df['PANEL'] == 19]
 35 | 
 36 |     # RENAME COLUMNS
 37 |     df = df.rename(columns = {'FTSTU53X' : 'FTSTU', 'ACTDTY53' : 'ACTDTY', 'HONRDC53' : 'HONRDC', 'RTHLTH53' : 'RTHLTH',
 38 |                               'MNHLTH53' : 'MNHLTH', 'CHBRON53' : 'CHBRON', 'JTPAIN53' : 'JTPAIN', 'PREGNT53' : 'PREGNT',
 39 |                               'WLKLIM53' : 'WLKLIM', 'ACTLIM53' : 'ACTLIM', 'SOCLIM53' : 'SOCLIM', 'COGLIM53' : 'COGLIM',
 40 |                               'EMPST53' : 'EMPST', 'REGION53' : 'REGION', 'MARRY53X' : 'MARRY', 'AGE53X' : 'AGE',
 41 |                               'POVCAT15' : 'POVCAT', 'INSCOV15' : 'INSCOV'})
 42 | 
 43 |     df = df[df['REGION'] >= 0] # remove values -1
 44 |     df = df[df['AGE'] >= 0] # remove values -1
 45 | 
 46 |     df = df[df['MARRY'] >= 0] # remove values -1, -7, -8, -9
 47 | 
 48 |     df = df[df['ASTHDX'] >= 0] # remove values -1, -7, -8, -9
 49 | 
 50 |     df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG',
 51 |                              'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 52 |                              'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 53 |                              'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42',
 54 |                              'PHQ242','EMPST','POVCAT','INSCOV']] >= -1).all(1)]  #for all other categorical features, remove values < -1
 55 | 
 56 |     df = df[(df[['OBTOTV15', 'OPTOTV15', 'ERTOT15', 'IPNGTD15', 'HHTOTD15']]>=0).all(1)]
 57 | 
 58 |     def utilization(row):
 59 |         return row['OBTOTV15'] + row['OPTOTV15'] + row['ERTOT15'] + row['IPNGTD15'] + row['HHTOTD15']
 60 | 
 61 |     df['TOTEXP15'] = df.apply(lambda row: utilization(row), axis=1)
 62 | 
 63 |     df = df.rename(columns = {'TOTEXP15' : 'UTILIZATION_reg'})
 64 |     return df
 65 | 
 66 | 
 67 | class MEPSDataset19Reg(SaveDataset):
 68 |     """MEPS Dataset.
 69 |     """
 70 | 
 71 |     def __init__(self, label_name='UTILIZATION_reg', favorable_classes=[1.0],
 72 |                  protected_attribute_names=['RACE'],
 73 |                  privileged_classes=[['White']],
 74 |                  instance_weights_name='PERWT15F',
 75 |                  categorical_features=['REGION','SEX','MARRY',
 76 |                                  'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
 77 |                                  'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 78 |                                  'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 79 |                                  'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42',
 80 |                                  'PHQ242','EMPST','POVCAT','INSCOV'],
 81 |                  features_to_keep=['REGION','AGE','SEX','RACE','MARRY',
 82 |                                  'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
 83 |                                  'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 84 |                                  'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 85 |                                  'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42','PCS42',
 86 |                                  'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION_reg','PERWT15F'],
 87 |                  features_to_drop=[],
 88 |                  na_values=[], custom_preprocessing=default_preprocessing,
 89 |                  metadata=default_mappings):
 90 | 
 91 |         filepath = './h181.csv'
 92 | 
 93 |         df = pd.read_csv(filepath, sep=',', na_values=na_values)
 94 | 
 95 |         super(MEPSDataset19Reg, self).__init__(df=df, label_name=label_name,
 96 |             favorable_classes=favorable_classes,
 97 |             protected_attribute_names=protected_attribute_names,
 98 |             privileged_classes=privileged_classes,
 99 |             instance_weights_name=instance_weights_name,
100 |             categorical_features=categorical_features,
101 |             features_to_keep=features_to_keep,
102 |             features_to_drop=features_to_drop, na_values=na_values,
103 |             custom_preprocessing=custom_preprocessing, metadata=metadata, dataset_name='meps_19_reg')
104 | 


--------------------------------------------------------------------------------
/get_meps_data/meps_dataset_panel20_fy2015_reg.py:
--------------------------------------------------------------------------------
  1 | # This code is a variant of
  2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/meps_dataset_panel20_fy2015.py
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | from __future__ import unicode_literals
  8 | 
  9 | import pandas as pd
 10 | 
 11 | #from standard_datasets import StandardDataset
 12 | from save_dataset import SaveDataset
 13 | 
 14 | default_mappings = {
 15 |     'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-White'}]
 16 | }
 17 | 
 18 | def default_preprocessing(df):
 19 |     """
 20 |     1.Create a new column, RACE that is 'White' if RACEV2X = 1 and HISPANX = 2 i.e. non Hispanic White
 21 |       and 'non-White' otherwise
 22 |     2. Restrict to Panel 20
 23 |     3. RENAME all columns that are PANEL/ROUND SPECIFIC
 24 |     4. Drop rows based on certain values of individual features that correspond to missing/unknown - generally < -1
 25 |     5. Compute UTILIZATION, binarize it to 0 (< 10) and 1 (>= 10)
 26 |     """
 27 |     def race(row):
 28 |         if ((row['HISPANX'] == 2) and (row['RACEV2X'] == 1)):  #non-Hispanic Whites are marked as WHITE; all others as NON-WHITE
 29 |             return 'White'
 30 |         return 'Non-White'
 31 | 
 32 |     df['RACEV2X'] = df.apply(lambda row: race(row), axis=1)
 33 |     df = df.rename(columns = {'RACEV2X' : 'RACE'})
 34 | 
 35 |     df = df[df['PANEL'] == 20]
 36 | 
 37 |     # RENAME COLUMNS
 38 |     df = df.rename(columns = {'FTSTU53X' : 'FTSTU', 'ACTDTY53' : 'ACTDTY', 'HONRDC53' : 'HONRDC', 'RTHLTH53' : 'RTHLTH',
 39 |                               'MNHLTH53' : 'MNHLTH', 'CHBRON53' : 'CHBRON', 'JTPAIN53' : 'JTPAIN', 'PREGNT53' : 'PREGNT',
 40 |                               'WLKLIM53' : 'WLKLIM', 'ACTLIM53' : 'ACTLIM', 'SOCLIM53' : 'SOCLIM', 'COGLIM53' : 'COGLIM',
 41 |                               'EMPST53' : 'EMPST', 'REGION53' : 'REGION', 'MARRY53X' : 'MARRY', 'AGE53X' : 'AGE',
 42 |                               'POVCAT15' : 'POVCAT', 'INSCOV15' : 'INSCOV'})
 43 | 
 44 |     df = df[df['REGION'] >= 0] # remove values -1
 45 |     df = df[df['AGE'] >= 0] # remove values -1
 46 | 
 47 |     df = df[df['MARRY'] >= 0] # remove values -1, -7, -8, -9
 48 | 
 49 |     df = df[df['ASTHDX'] >= 0] # remove values -1, -7, -8, -9
 50 | 
 51 |     df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG',
 52 |                              'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 53 |                              'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 54 |                              'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42',
 55 |                              'PHQ242','EMPST','POVCAT','INSCOV']] >= -1).all(1)]  #for all other categorical features, remove values < -1
 56 | 
 57 |     df = df[(df[['OBTOTV15', 'OPTOTV15', 'ERTOT15', 'IPNGTD15', 'HHTOTD15']]>=0).all(1)]
 58 | 
 59 |     def utilization(row):
 60 |         return row['OBTOTV15'] + row['OPTOTV15'] + row['ERTOT15'] + row['IPNGTD15'] + row['HHTOTD15']
 61 | 
 62 |     df['TOTEXP15'] = df.apply(lambda row: utilization(row), axis=1)
 63 | 
 64 |     df = df.rename(columns = {'TOTEXP15' : 'UTILIZATION_reg'})
 65 |     return df
 66 | 
 67 | 
 68 | class MEPSDataset20Reg(SaveDataset):
 69 |     """MEPS Dataset.
 70 |     """
 71 | 
 72 |     def __init__(self, label_name='UTILIZATION_reg', favorable_classes=[1.0],
 73 |                  protected_attribute_names=['RACE'],
 74 |                  privileged_classes=[['White']],
 75 |                  instance_weights_name='PERWT15F',
 76 |                  categorical_features=['REGION','SEX','MARRY',
 77 |                                  'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
 78 |                                  'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 79 |                                  'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 80 |                                  'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42', 'ADSMOK42', 'PHQ242',
 81 |                                  'EMPST','POVCAT','INSCOV'],
 82 |                  features_to_keep=['REGION','AGE','SEX','RACE','MARRY',
 83 |                                  'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
 84 |                                  'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 85 |                                  'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 86 |                                  'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42', 'ADSMOK42',
 87 |                                  'PCS42',
 88 |                                  'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION_reg', 'PERWT15F'],
 89 |                  features_to_drop=[],
 90 |                  na_values=[], custom_preprocessing=default_preprocessing,
 91 |                  metadata=default_mappings):
 92 | 
 93 |         filepath = './h181.csv'
 94 | 
 95 |         df = pd.read_csv(filepath, sep=',', na_values=na_values)
 96 | 
 97 |         super(MEPSDataset20Reg, self).__init__(df=df, label_name=label_name,
 98 |             favorable_classes=favorable_classes,
 99 |             protected_attribute_names=protected_attribute_names,
100 |             privileged_classes=privileged_classes,
101 |             instance_weights_name=instance_weights_name,
102 |             categorical_features=categorical_features,
103 |             features_to_keep=features_to_keep,
104 |             features_to_drop=features_to_drop, na_values=na_values,
105 |             custom_preprocessing=custom_preprocessing, metadata=metadata, dataset_name='meps_20_reg')
106 | 


--------------------------------------------------------------------------------
/get_meps_data/meps_dataset_panel21_fy2016_reg.py:
--------------------------------------------------------------------------------
  1 | # This code is a variant of
  2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/meps_dataset_panel21_fy2016.py
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | from __future__ import unicode_literals
  8 | 
  9 | import pandas as pd
 10 | 
 11 | #from standard_dataset import StandardDataset
 12 | from save_dataset import SaveDataset
 13 | 
 14 | default_mappings = {
 15 |     'protected_attribute_maps': [{1.0: 'White', 0.0: 'Non-White'}]
 16 | }
 17 | 
 18 | def default_preprocessing(df):
 19 |     """
 20 |     1.Create a new column, RACE that is 'White' if RACEV2X = 1 and HISPANX = 2 i.e. non Hispanic White
 21 |       and 'Non-White' otherwise
 22 |     2. Restrict to Panel 21
 23 |     3. RENAME all columns that are PANEL/ROUND SPECIFIC
 24 |     4. Drop rows based on certain values of individual features that correspond to missing/unknown - generally < -1
 25 |     5. Compute UTILIZATION, binarize it to 0 (< 10) and 1 (>= 10)
 26 |     """
 27 |     def race(row):
 28 |         if ((row['HISPANX'] == 2) and (row['RACEV2X'] == 1)):  #non-Hispanic Whites are marked as WHITE; all others as NON-WHITE
 29 |             return 'White'
 30 |         return 'Non-White'
 31 | 
 32 |     df['RACEV2X'] = df.apply(lambda row: race(row), axis=1)
 33 |     df = df.rename(columns = {'RACEV2X' : 'RACE'})
 34 | 
 35 |     df = df[df['PANEL'] == 21]
 36 | 
 37 |     # RENAME COLUMNS
 38 |     df = df.rename(columns = {'FTSTU53X' : 'FTSTU', 'ACTDTY53' : 'ACTDTY', 'HONRDC53' : 'HONRDC', 'RTHLTH53' : 'RTHLTH',
 39 |                               'MNHLTH53' : 'MNHLTH', 'CHBRON53' : 'CHBRON', 'JTPAIN53' : 'JTPAIN', 'PREGNT53' : 'PREGNT',
 40 |                               'WLKLIM53' : 'WLKLIM', 'ACTLIM53' : 'ACTLIM', 'SOCLIM53' : 'SOCLIM', 'COGLIM53' : 'COGLIM',
 41 |                               'EMPST53' : 'EMPST', 'REGION53' : 'REGION', 'MARRY53X' : 'MARRY', 'AGE53X' : 'AGE',
 42 |                               'POVCAT16' : 'POVCAT', 'INSCOV16' : 'INSCOV'})
 43 | 
 44 |     df = df[df['REGION'] >= 0] # remove values -1
 45 |     df = df[df['AGE'] >= 0] # remove values -1
 46 | 
 47 |     df = df[df['MARRY'] >= 0] # remove values -1, -7, -8, -9
 48 | 
 49 |     df = df[df['ASTHDX'] >= 0] # remove values -1, -7, -8, -9
 50 | 
 51 |     df = df[(df[['FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX','EDUCYR','HIDEG',
 52 |                              'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 53 |                              'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 54 |                              'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42',
 55 |                              'PHQ242','EMPST','POVCAT','INSCOV']] >= -1).all(1)]  #for all other categorical features, remove values < -1
 56 | 
 57 |     df = df[(df[['OBTOTV16', 'OPTOTV16', 'ERTOT16', 'IPNGTD16', 'HHTOTD16']]>=0).all(1)]
 58 | 
 59 |     def utilization(row):
 60 |         return row['OBTOTV16'] + row['OPTOTV16'] + row['ERTOT16'] + row['IPNGTD16'] + row['HHTOTD16']
 61 | 
 62 |     df['TOTEXP16'] = df.apply(lambda row: utilization(row), axis=1)
 63 | 
 64 |     df = df.rename(columns = {'TOTEXP16' : 'UTILIZATION_reg'})
 65 |     return df
 66 | 
 67 | 
 68 | class MEPSDataset21Reg(SaveDataset):
 69 |     """MEPS Dataset.
 70 |     """
 71 | 
 72 |     def __init__(self, label_name='UTILIZATION_reg', favorable_classes=[1.0],
 73 |                  protected_attribute_names=['RACE'],
 74 |                  privileged_classes=[['White']],
 75 |                  instance_weights_name='PERWT16F',
 76 |                  categorical_features=['REGION','SEX','MARRY',
 77 |                                  'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
 78 |                                  'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 79 |                                  'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 80 |                                  'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42', 'ADSMOK42', 'PHQ242',
 81 |                                  'EMPST','POVCAT','INSCOV'],
 82 |                  features_to_keep=['REGION','AGE','SEX','RACE','MARRY',
 83 |                                  'FTSTU','ACTDTY','HONRDC','RTHLTH','MNHLTH','HIBPDX','CHDDX','ANGIDX',
 84 |                                  'MIDX','OHRTDX','STRKDX','EMPHDX','CHBRON','CHOLDX','CANCERDX','DIABDX',
 85 |                                  'JTPAIN','ARTHDX','ARTHTYPE','ASTHDX','ADHDADDX','PREGNT','WLKLIM',
 86 |                                  'ACTLIM','SOCLIM','COGLIM','DFHEAR42','DFSEE42','ADSMOK42',
 87 |                                  'PCS42',
 88 |                                  'MCS42','K6SUM42','PHQ242','EMPST','POVCAT','INSCOV','UTILIZATION_reg', 'PERWT16F'],
 89 |                  features_to_drop=[],
 90 |                  na_values=[], custom_preprocessing=default_preprocessing,
 91 |                  metadata=default_mappings):
 92 | 
 93 |         filepath = './h192.csv'
 94 |         df = pd.read_csv(filepath, sep=',', na_values=na_values)
 95 | 
 96 |         super(MEPSDataset21Reg, self).__init__(df=df, label_name=label_name,
 97 |             favorable_classes=favorable_classes,
 98 |             protected_attribute_names=protected_attribute_names,
 99 |             privileged_classes=privileged_classes,
100 |             instance_weights_name=instance_weights_name,
101 |             categorical_features=categorical_features,
102 |             features_to_keep=features_to_keep,
103 |             features_to_drop=features_to_drop, na_values=na_values,
104 |             custom_preprocessing=custom_preprocessing, metadata=metadata, dataset_name='meps_21_reg')
105 | 


--------------------------------------------------------------------------------
/get_meps_data/regression_dataset.py:
--------------------------------------------------------------------------------
 1 | # Code copied from IBM's AIF360 package
 2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/binary_label_dataset.py
 3 | 
 4 | from __future__ import absolute_import
 5 | from __future__ import division
 6 | from __future__ import print_function
 7 | from __future__ import unicode_literals
 8 | 
 9 | import numpy as np
10 | 
11 | from structured_dataset import StructuredDataset
12 | 
13 | 
14 | class RegressionDataset(StructuredDataset):
15 |     """Base class for all structured datasets with binary labels."""
16 | 
17 |     def __init__(self, favorable_label=1., unfavorable_label=0., **kwargs):
18 |         """
19 |         Args:
20 |             favorable_label (float): Label value which is considered favorable
21 |                 (i.e. "positive").
22 |             unfavorable_label (float): Label value which is considered
23 |                 unfavorable (i.e. "negative").
24 |             **kwargs: StructuredDataset arguments.
25 |         """
26 |         self.favorable_label = float(favorable_label)
27 |         self.unfavorable_label = float(unfavorable_label)
28 | 
29 |         super(RegressionDataset, self).__init__(**kwargs)
30 | 
31 |     def validate_dataset(self):
32 |         """Error checking and type validation.
33 | 
34 |         Raises:
35 |             ValueError: `labels` must be shape [n, 1].
36 |             ValueError: `favorable_label` and `unfavorable_label` must be the
37 |                 only values present in `labels`.
38 |         """
39 |         super(RegressionDataset, self).validate_dataset()
40 | 
41 |         # =========================== SHAPE CHECKING ===========================
42 |         # Verify if the labels are only 1 column
43 |         if self.labels.shape[1] != 1:
44 |             raise ValueError("BinaryLabelDataset only supports single-column "
45 |                 "labels:\n\tlabels.shape = {}".format(self.labels.shape))
46 | 
47 |         # =========================== VALUE CHECKING ===========================
48 |         # Check if the favorable and unfavorable labels match those in the dataset
49 |         if (not set(self.labels.ravel()) <=
50 |                 set([self.favorable_label, self.unfavorable_label])):
51 |             raise ValueError("The favorable and unfavorable labels provided do "
52 |                              "not match the labels in the dataset.")
53 | 
54 |         if np.all(self.scores == self.labels):
55 |             self.scores = (self.scores == self.favorable_label).astype(np.float64)
56 | 


--------------------------------------------------------------------------------
/get_meps_data/save_dataset.py:
--------------------------------------------------------------------------------
  1 | # Code copied from IBM's AIF360 package
  2 | # https://github.com/IBM/AIF360/blob/master/aif360/datasets/standard_dataset.py
  3 | 
  4 | from __future__ import absolute_import
  5 | from __future__ import division
  6 | from __future__ import print_function
  7 | from __future__ import unicode_literals
  8 | 
  9 | from logging import warning
 10 | 
 11 | import numpy as np
 12 | import pandas as pd
 13 | 
 14 | from regression_dataset import RegressionDataset
 15 | 
 16 | 
 17 | class SaveDataset(RegressionDataset):
 18 |     """Base class for every :obj:`RegressionDataset`. The code is similar
 19 |     to that of aif360.
 20 | 
 21 |     It is not strictly necessary to inherit this class when adding custom
 22 |     datasets but it may be useful.
 23 | 
 24 |     This class is very loosely based on code from
 25 |     https://github.com/algofairness/fairness-comparison.
 26 |     """
 27 | 
 28 |     def __init__(self, df, label_name, favorable_classes,
 29 |                  protected_attribute_names, privileged_classes,
 30 |                  instance_weights_name='', scores_name='',
 31 |                  categorical_features=[], features_to_keep=[],
 32 |                  features_to_drop=[], na_values=[], custom_preprocessing=None,
 33 |                  metadata=None, dataset_name='my_data'):
 34 |         """
 35 |         Subclasses of StandardDataset should perform the following before
 36 |         calling `super().__init__`:
 37 | 
 38 |             1. Load the dataframe from a raw file.
 39 | 
 40 |         Then, this class will go through a standard preprocessing routine which:
 41 | 
 42 |             2. (optional) Performs some dataset-specific preprocessing (e.g.
 43 |                renaming columns/values, handling missing data).
 44 | 
 45 |             3. Drops unrequested columns (see `features_to_keep` and
 46 |                `features_to_drop` for details).
 47 | 
 48 |             4. Drops rows with NA values.
 49 | 
 50 |             5. Creates a one-hot encoding of the categorical variables.
 51 | 
 52 |             6. Maps protected attributes to binary privileged/unprivileged
 53 |                values (1/0).
 54 | 
 55 |         Args:
 56 |             df (pandas.DataFrame): DataFrame on which to perform standard
 57 |                 processing.
 58 |             label_name: Name of the label column in `df`.
 59 |             favorable_classes (list or function): Label values which are
 60 |                 considered favorable or a boolean function which returns `True`
 61 |                 if favorable. All others are unfavorable. Label values are
 62 |                 mapped to 1 (favorable) and 0 (unfavorable) if they are not
 63 |                 already binary and numerical.
 64 |             protected_attribute_names (list): List of names corresponding to
 65 |                 protected attribute columns in `df`.
 66 |             privileged_classes (list(list or function)): Each element is
 67 |                 a list of values which are considered privileged or a boolean
 68 |                 function which return `True` if privileged for the corresponding
 69 |                 column in `protected_attribute_names`. All others are
 70 |                 unprivileged. Values are mapped to 1 (privileged) and 0
 71 |                 (unprivileged) if they are not already numerical.
 72 |             instance_weights_name (optional): Name of the instance weights
 73 |                 column in `df`.
 74 |             categorical_features (optional, list): List of column names in the
 75 |                 DataFrame which are to be expanded into one-hot vectors.
 76 |             features_to_keep (optional, list): Column names to keep. All others
 77 |                 are dropped except those present in `protected_attribute_names`,
 78 |                 `categorical_features`, `label_name` or `instance_weights_name`.
 79 |                 Defaults to all columns if not provided.
 80 |             features_to_drop (optional, list): Column names to drop. *Note: this
 81 |                 overrides* `features_to_keep`.
 82 |             na_values (optional): Additional strings to recognize as NA. See
 83 |                 :func:`pandas.read_csv` for details.
 84 |             custom_preprocessing (function): A function object which
 85 |                 acts on and returns a DataFrame (f: DataFrame -> DataFrame). If
 86 |                 `None`, no extra preprocessing is applied.
 87 |             metadata (optional): Additional metadata to append.
 88 |         """
 89 |         # 2. Perform dataset-specific preprocessing
 90 |         if custom_preprocessing:
 91 |             df = custom_preprocessing(df)
 92 | 
 93 |         # 3. Drop unrequested columns
 94 |         features_to_keep = features_to_keep or df.columns.tolist()
 95 |         keep = (set(features_to_keep) | set(protected_attribute_names)
 96 |               | set(categorical_features) | set([label_name]))
 97 |         if instance_weights_name:
 98 |             keep |= set([instance_weights_name])
 99 |         df = df[sorted(keep - set(features_to_drop), key=df.columns.get_loc)]
100 |         categorical_features = sorted(set(categorical_features) - set(features_to_drop), key=df.columns.get_loc)
101 | 
102 |         # 4. Remove any rows that have missing data.
103 |         dropped = df.dropna()
104 |         count = df.shape[0] - dropped.shape[0]
105 |         if count > 0:
106 |             warning("Missing Data: {} rows removed from {}.".format(count,
107 |                     type(self).__name__))
108 |         df = dropped
109 | 
110 |         # 5. Create a one-hot encoding of the categorical variables.
111 |         df = pd.get_dummies(df, columns=categorical_features, prefix_sep='=')
112 | 
113 |         # 6. Map protected attributes to privileged/unprivileged
114 |         privileged_protected_attributes = []
115 |         unprivileged_protected_attributes = []
116 |         for attr, vals in zip(protected_attribute_names, privileged_classes):
117 |             privileged_values = [1.]
118 |             unprivileged_values = [0.]
119 |             if callable(vals):
120 |                 df[attr] = df[attr].apply(vals)
121 |             elif np.issubdtype(df[attr].dtype, np.number):
122 |                 # this attribute is numeric; no remapping needed
123 |                 privileged_values = vals
124 |                 unprivileged_values = list(set(df[attr]).difference(vals))
125 |             else:
126 |                 # find all instances which match any of the attribute values
127 |                 priv = np.array([ ( el in vals ) for el in df[attr] ])
128 |                 df.loc[priv, attr] = privileged_values[0]
129 |                 df.loc[~priv, attr] = unprivileged_values[0]
130 | 
131 |             privileged_protected_attributes.append(
132 |                 np.array(privileged_values, dtype=np.float64))
133 |             unprivileged_protected_attributes.append(
134 |                 np.array(unprivileged_values, dtype=np.float64))
135 | 
136 |         full_name = dataset_name + ".csv"
137 |         print("writing file: " + full_name)
138 |         df.to_csv(full_name)
139 | 


--------------------------------------------------------------------------------
/nonconformist/.DS_Store:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/nonconformist/.DS_Store


--------------------------------------------------------------------------------
/nonconformist/__init__.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | """
 4 | docstring
 5 | """
 6 | 
 7 | # Authors: Henrik Linusson
 8 | # Yaniv Romano modified np.py file to include CQR
 9 | 
10 | __version__ = '2.1.0'
11 | 
12 | __all__ = ['icp', 'nc', 'acp']
13 | 


--------------------------------------------------------------------------------
/nonconformist/acp.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | Aggregated conformal predictors
  5 | """
  6 | 
  7 | # Authors: Henrik Linusson
  8 | 
  9 | import numpy as np
 10 | from sklearn.cross_validation import KFold, StratifiedKFold
 11 | from sklearn.cross_validation import ShuffleSplit, StratifiedShuffleSplit
 12 | from sklearn.base import clone
 13 | from nonconformist.base import BaseEstimator
 14 | from nonconformist.util import calc_p
 15 | 
 16 | 
 17 | # -----------------------------------------------------------------------------
 18 | # Sampling strategies
 19 | # -----------------------------------------------------------------------------
 20 | class BootstrapSampler(object):
 21 | 	"""Bootstrap sampler.
 22 | 
 23 | 	See also
 24 | 	--------
 25 | 	CrossSampler, RandomSubSampler
 26 | 
 27 | 	Examples
 28 | 	--------
 29 | 	"""
 30 | 	def gen_samples(self, y, n_samples, problem_type):
 31 | 		for i in range(n_samples):
 32 | 			idx = np.array(range(y.size))
 33 | 			train = np.random.choice(y.size, y.size, replace=True)
 34 | 			cal_mask = np.array(np.ones(idx.size), dtype=bool)
 35 | 			for j in train:
 36 | 				cal_mask[j] = False
 37 | 			cal = idx[cal_mask]
 38 | 
 39 | 			yield train, cal
 40 | 
 41 | 
 42 | class CrossSampler(object):
 43 | 	"""Cross-fold sampler.
 44 | 
 45 | 	See also
 46 | 	--------
 47 | 	BootstrapSampler, RandomSubSampler
 48 | 
 49 | 	Examples
 50 | 	--------
 51 | 	"""
 52 | 	def gen_samples(self, y, n_samples, problem_type):
 53 | 		if problem_type == 'classification':
 54 | 			folds = StratifiedKFold(y, n_folds=n_samples)
 55 | 		else:
 56 | 			folds = KFold(y.size, n_folds=n_samples)
 57 | 		for train, cal in folds:
 58 | 			yield train, cal
 59 | 
 60 | 
 61 | class RandomSubSampler(object):
 62 | 	"""Random subsample sampler.
 63 | 
 64 | 	Parameters
 65 | 	----------
 66 | 	calibration_portion : float
 67 | 		Ratio (0-1) of examples to use for calibration.
 68 | 
 69 | 	See also
 70 | 	--------
 71 | 	BootstrapSampler, CrossSampler
 72 | 
 73 | 	Examples
 74 | 	--------
 75 | 	"""
 76 | 	def __init__(self, calibration_portion=0.3):
 77 | 		self.cal_portion = calibration_portion
 78 | 
 79 | 	def gen_samples(self, y, n_samples, problem_type):
 80 | 		if problem_type == 'classification':
 81 | 			splits = StratifiedShuffleSplit(y,
 82 | 			                                n_iter=n_samples,
 83 | 			                                test_size=self.cal_portion)
 84 | 		else:
 85 | 			splits = ShuffleSplit(y.size,
 86 | 			                      n_iter=n_samples,
 87 | 			                      test_size=self.cal_portion)
 88 | 
 89 | 		for train, cal in splits:
 90 | 			yield train, cal
 91 | 
 92 | 
 93 | # -----------------------------------------------------------------------------
 94 | # Conformal ensemble
 95 | # -----------------------------------------------------------------------------
 96 | class AggregatedCp(BaseEstimator):
 97 | 	"""Aggregated conformal predictor.
 98 | 
 99 | 	Combines multiple IcpClassifier or IcpRegressor predictors into an
100 | 	aggregated model.
101 | 
102 | 	Parameters
103 | 	----------
104 | 	predictor : object
105 | 		Prototype conformal predictor (e.g. IcpClassifier or IcpRegressor)
106 | 		used for defining conformal predictors included in the aggregate model.
107 | 
108 | 	sampler : object
109 | 		Sampler object used to generate training and calibration examples
110 | 		for the underlying conformal predictors.
111 | 
112 | 	aggregation_func : callable
113 | 		Function used to aggregate the predictions of the underlying
114 | 		conformal predictors. Defaults to ``numpy.mean``.
115 | 
116 | 	n_models : int
117 | 		Number of models to aggregate.
118 | 
119 | 	Attributes
120 | 	----------
121 | 	predictor : object
122 | 		Prototype conformal predictor.
123 | 
124 | 	predictors : list
125 | 		List of underlying conformal predictors.
126 | 
127 | 	sampler : object
128 | 		Sampler object used to generate training and calibration examples.
129 | 
130 | 	agg_func : callable
131 | 		Function used to aggregate the predictions of the underlying
132 | 		conformal predictors
133 | 
134 | 	References
135 | 	----------
136 | 	.. [1] Vovk, V. (2013). Cross-conformal predictors. Annals of Mathematics
137 | 		and Artificial Intelligence, 1-20.
138 | 
139 | 	.. [2] Carlsson, L., Eklund, M., & Norinder, U. (2014). Aggregated
140 | 		Conformal Prediction. In Artificial Intelligence Applications and
141 | 		Innovations (pp. 231-240). Springer Berlin Heidelberg.
142 | 
143 | 	Examples
144 | 	--------
145 | 	"""
146 | 	def __init__(self,
147 | 	             predictor,
148 | 	             sampler=BootstrapSampler(),
149 | 	             aggregation_func=None,
150 | 	             n_models=10):
151 | 		self.predictors = []
152 | 		self.n_models = n_models
153 | 		self.predictor = predictor
154 | 		self.sampler = sampler
155 | 
156 | 		if aggregation_func is not None:
157 | 			self.agg_func = aggregation_func
158 | 		else:
159 | 			self.agg_func = lambda x: np.mean(x, axis=2)
160 | 
161 | 	def fit(self, x, y):
162 | 		"""Fit underlying conformal predictors.
163 | 
164 | 		Parameters
165 | 		----------
166 | 		x : numpy array of shape [n_samples, n_features]
167 | 			Inputs of examples for fitting the underlying conformal predictors.
168 | 
169 | 		y : numpy array of shape [n_samples]
170 | 			Outputs of examples for fitting the underlying conformal predictors.
171 | 
172 | 		Returns
173 | 		-------
174 | 		None
175 | 		"""
176 | 		self.n_train = y.size
177 | 		self.predictors = []
178 | 		idx = np.random.permutation(y.size)
179 | 		x, y = x[idx, :], y[idx]
180 | 		problem_type = self.predictor.__class__.get_problem_type()
181 | 		samples = self.sampler.gen_samples(y,
182 | 		                                   self.n_models,
183 | 		                                   problem_type)
184 | 		for train, cal in samples:
185 | 			predictor = clone(self.predictor)
186 | 			predictor.fit(x[train, :], y[train])
187 | 			predictor.calibrate(x[cal, :], y[cal])
188 | 			self.predictors.append(predictor)
189 | 
190 | 		if problem_type == 'classification':
191 | 			self.classes = self.predictors[0].classes
192 | 
193 | 	def predict(self, x, significance=None):
194 | 		"""Predict the output values for a set of input patterns.
195 | 
196 | 		Parameters
197 | 		----------
198 | 		x : numpy array of shape [n_samples, n_features]
199 | 			Inputs of patters for which to predict output values.
200 | 
201 | 		significance : float or None
202 | 			Significance level (maximum allowed error rate) of predictions.
203 | 			Should be a float between 0 and 1. If ``None``, then the p-values
204 | 			are output rather than the predictions. Note: ``significance=None``
205 | 			is applicable to classification problems only.
206 | 
207 | 		Returns
208 | 		-------
209 | 		p : numpy array of shape [n_samples, n_classes] or [n_samples, 2]
210 | 			For classification problems: If significance is ``None``, then p
211 | 			contains the p-values for each sample-class pair; if significance
212 | 			is a float between 0 and 1, then p is a boolean array denoting
213 | 			which labels are included in the prediction sets.
214 | 
215 | 			For regression problems: Prediction interval (minimum and maximum
216 | 			boundaries) for the set of test patterns.
217 | 		"""
218 | 		is_regression =\
219 | 			self.predictor.__class__.get_problem_type() == 'regression'
220 | 
221 | 		n_examples = x.shape[0]
222 | 
223 | 		if is_regression and significance is None:
224 | 			signs = np.arange(0.01, 1.0, 0.01)
225 | 			pred = np.zeros((n_examples, 2, signs.size))
226 | 			for i, s in enumerate(signs):
227 | 				predictions = np.dstack([p.predict(x, s)
228 | 				                         for p in self.predictors])
229 | 				predictions = self.agg_func(predictions)
230 | 				pred[:, :, i] = predictions
231 | 			return pred
232 | 		else:
233 | 			def f(p, x):
234 | 				return p.predict(x, significance if is_regression else None)
235 | 			predictions = np.dstack([f(p, x) for p in self.predictors])
236 | 			predictions = self.agg_func(predictions)
237 | 
238 | 			if significance and not is_regression:
239 | 				return predictions >= significance
240 | 			else:
241 | 				return predictions
242 | 
243 | 
244 | class CrossConformalClassifier(AggregatedCp):
245 | 	"""Cross-conformal classifier.
246 | 
247 | 	Combines multiple IcpClassifiers into a cross-conformal classifier.
248 | 
249 | 	Parameters
250 | 	----------
251 | 	predictor : object
252 | 		Prototype conformal predictor (e.g. IcpClassifier or IcpRegressor)
253 | 		used for defining conformal predictors included in the aggregate model.
254 | 
255 | 	aggregation_func : callable
256 | 		Function used to aggregate the predictions of the underlying
257 | 		conformal predictors. Defaults to ``numpy.mean``.
258 | 
259 | 	n_models : int
260 | 		Number of models to aggregate.
261 | 
262 | 	Attributes
263 | 	----------
264 | 	predictor : object
265 | 		Prototype conformal predictor.
266 | 
267 | 	predictors : list
268 | 		List of underlying conformal predictors.
269 | 
270 | 	sampler : object
271 | 		Sampler object used to generate training and calibration examples.
272 | 
273 | 	agg_func : callable
274 | 		Function used to aggregate the predictions of the underlying
275 | 		conformal predictors
276 | 
277 | 	References
278 | 	----------
279 | 	.. [1] Vovk, V. (2013). Cross-conformal predictors. Annals of Mathematics
280 | 		and Artificial Intelligence, 1-20.
281 | 
282 | 	Examples
283 | 	--------
284 | 	"""
285 | 	def __init__(self,
286 | 				 predictor,
287 | 				 n_models=10):
288 | 		super(CrossConformalClassifier, self).__init__(predictor,
289 | 													   CrossSampler(),
290 | 													   n_models)
291 | 
292 | 	def predict(self, x, significance=None):
293 | 		ncal_ngt_neq = np.stack([p._get_stats(x) for p in self.predictors],
294 | 		                        axis=3)
295 | 		ncal_ngt_neq = ncal_ngt_neq.sum(axis=3)
296 | 
297 | 		p = calc_p(ncal_ngt_neq[:, :, 0],
298 | 		           ncal_ngt_neq[:, :, 1],
299 | 		           ncal_ngt_neq[:, :, 2],
300 | 		           smoothing=self.predictors[0].smoothing)
301 | 
302 | 		if significance:
303 | 			return p > significance
304 | 		else:
305 | 			return p
306 | 
307 | 
308 | class BootstrapConformalClassifier(AggregatedCp):
309 | 	"""Bootstrap conformal classifier.
310 | 
311 | 	Combines multiple IcpClassifiers into a bootstrap conformal classifier.
312 | 
313 | 	Parameters
314 | 	----------
315 | 	predictor : object
316 | 		Prototype conformal predictor (e.g. IcpClassifier or IcpRegressor)
317 | 		used for defining conformal predictors included in the aggregate model.
318 | 
319 | 	aggregation_func : callable
320 | 		Function used to aggregate the predictions of the underlying
321 | 		conformal predictors. Defaults to ``numpy.mean``.
322 | 
323 | 	n_models : int
324 | 		Number of models to aggregate.
325 | 
326 | 	Attributes
327 | 	----------
328 | 	predictor : object
329 | 		Prototype conformal predictor.
330 | 
331 | 	predictors : list
332 | 		List of underlying conformal predictors.
333 | 
334 | 	sampler : object
335 | 		Sampler object used to generate training and calibration examples.
336 | 
337 | 	agg_func : callable
338 | 		Function used to aggregate the predictions of the underlying
339 | 		conformal predictors
340 | 
341 | 	References
342 | 	----------
343 | 	.. [1] Vovk, V. (2013). Cross-conformal predictors. Annals of Mathematics
344 | 		and Artificial Intelligence, 1-20.
345 | 
346 | 	Examples
347 | 	--------
348 | 	"""
349 | 	def __init__(self,
350 | 				 predictor,
351 | 	             n_models=10):
352 | 		super(BootstrapConformalClassifier, self).__init__(predictor,
353 | 														   BootstrapSampler(),
354 | 														   n_models)
355 | 
356 | 	def predict(self, x, significance=None):
357 | 		ncal_ngt_neq = np.stack([p._get_stats(x) for p in self.predictors],
358 | 		                        axis=3)
359 | 		ncal_ngt_neq = ncal_ngt_neq.sum(axis=3)
360 | 
361 | 		p = calc_p(ncal_ngt_neq[:, :, 0] + ncal_ngt_neq[:, :, 0] / self.n_train,
362 | 		           ncal_ngt_neq[:, :, 1] + ncal_ngt_neq[:, :, 0] / self.n_train,
363 | 		           ncal_ngt_neq[:, :, 2],
364 | 		           smoothing=self.predictors[0].smoothing)
365 | 
366 | 		if significance:
367 | 			return p > significance
368 | 		else:
369 | 			return p
370 | 


--------------------------------------------------------------------------------
/nonconformist/base.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | docstring
  5 | """
  6 | 
  7 | # Authors: Henrik Linusson
  8 | 
  9 | import abc
 10 | import numpy as np
 11 | 
 12 | from sklearn.base import BaseEstimator
 13 | 
 14 | 
 15 | class RegressorMixin(object):
 16 | 	def __init__(self):
 17 | 		super(RegressorMixin, self).__init__()
 18 | 
 19 | 	@classmethod
 20 | 	def get_problem_type(cls):
 21 | 		return 'regression'
 22 | 
 23 | 
 24 | class ClassifierMixin(object):
 25 | 	def __init__(self):
 26 | 		super(ClassifierMixin, self).__init__()
 27 | 
 28 | 	@classmethod
 29 | 	def get_problem_type(cls):
 30 | 		return 'classification'
 31 | 
 32 | 
 33 | class BaseModelAdapter(BaseEstimator):
 34 | 	__metaclass__ = abc.ABCMeta
 35 | 
 36 | 	def __init__(self, model, fit_params=None):
 37 | 		super(BaseModelAdapter, self).__init__()
 38 | 
 39 | 		self.model = model
 40 | 		self.last_x, self.last_y = None, None
 41 | 		self.clean = False
 42 | 		self.fit_params = {} if fit_params is None else fit_params
 43 | 
 44 | 	def fit(self, x, y):
 45 | 		"""Fits the model.
 46 | 
 47 | 		Parameters
 48 | 		----------
 49 | 		x : numpy array of shape [n_samples, n_features]
 50 | 			Inputs of examples for fitting the model.
 51 | 
 52 | 		y : numpy array of shape [n_samples]
 53 | 			Outputs of examples for fitting the model.
 54 | 
 55 | 		Returns
 56 | 		-------
 57 | 		None
 58 | 		"""
 59 | 
 60 | 		self.model.fit(x, y, **self.fit_params)
 61 | 		self.clean = False
 62 | 
 63 | 	def predict(self, x):
 64 | 		"""Returns the prediction made by the underlying model.
 65 | 
 66 | 		Parameters
 67 | 		----------
 68 | 		x : numpy array of shape [n_samples, n_features]
 69 | 			Inputs of test examples.
 70 | 
 71 | 		Returns
 72 | 		-------
 73 | 		y : numpy array of shape [n_samples]
 74 | 			Predicted outputs of test examples.
 75 | 		"""
 76 | 		if (
 77 | 			not self.clean or
 78 | 			self.last_x is None or
 79 | 			self.last_y is None or
 80 | 			not np.array_equal(self.last_x, x)
 81 | 		):
 82 | 			self.last_x = x
 83 | 			self.last_y = self._underlying_predict(x)
 84 | 			self.clean = True
 85 | 
 86 | 		return self.last_y.copy()
 87 | 
 88 | 	@abc.abstractmethod
 89 | 	def _underlying_predict(self, x):
 90 | 		"""Produces a prediction using the encapsulated model.
 91 | 
 92 | 		Parameters
 93 | 		----------
 94 | 		x : numpy array of shape [n_samples, n_features]
 95 | 			Inputs of test examples.
 96 | 
 97 | 		Returns
 98 | 		-------
 99 | 		y : numpy array of shape [n_samples]
100 | 			Predicted outputs of test examples.
101 | 		"""
102 | 		pass
103 | 
104 | 
105 | class ClassifierAdapter(BaseModelAdapter):
106 | 	def __init__(self, model, fit_params=None):
107 | 		super(ClassifierAdapter, self).__init__(model, fit_params)
108 | 
109 | 	def _underlying_predict(self, x):
110 | 		return self.model.predict_proba(x)
111 | 
112 | 
113 | class RegressorAdapter(BaseModelAdapter):
114 | 	def __init__(self, model, fit_params=None):
115 | 		super(RegressorAdapter, self).__init__(model, fit_params)
116 | 
117 | 	def _underlying_predict(self, x):
118 | 		return self.model.predict(x)
119 | 
120 | 
121 | class OobMixin(object):
122 | 	def __init__(self, model, fit_params=None):
123 | 		super(OobMixin, self).__init__(model, fit_params)
124 | 		self.train_x = None
125 | 
126 | 	def fit(self, x, y):
127 | 		super(OobMixin, self).fit(x, y)
128 | 		self.train_x = x
129 | 
130 | 	def _underlying_predict(self, x):
131 | 		# TODO: sub-sampling of ensemble for test patterns
132 | 		oob = x == self.train_x
133 | 
134 | 		if hasattr(oob, 'all'):
135 | 			oob = oob.all()
136 | 
137 | 		if oob:
138 | 			return self._oob_prediction()
139 | 		else:
140 | 			return super(OobMixin, self)._underlying_predict(x)
141 | 
142 | 
143 | class OobClassifierAdapter(OobMixin, ClassifierAdapter):
144 | 	def __init__(self, model, fit_params=None):
145 | 		super(OobClassifierAdapter, self).__init__(model, fit_params)
146 | 
147 | 	def _oob_prediction(self):
148 | 		return self.model.oob_decision_function_
149 | 
150 | 
151 | class OobRegressorAdapter(OobMixin, RegressorAdapter):
152 | 	def __init__(self, model, fit_params=None):
153 | 		super(OobRegressorAdapter, self).__init__(model, fit_params)
154 | 
155 | 	def _oob_prediction(self):
156 | 		return self.model.oob_prediction_
157 | 


--------------------------------------------------------------------------------
/nonconformist/cp.py:
--------------------------------------------------------------------------------
  1 | from nonconformist.icp import *
  2 | 
  3 | # TODO: move contents from nonconformist.icp here
  4 | 
  5 | # -----------------------------------------------------------------------------
  6 | # TcpClassifier
  7 | # -----------------------------------------------------------------------------
  8 | class TcpClassifier(BaseEstimator, ClassifierMixin):
  9 | 	"""Transductive conformal classifier.
 10 | 
 11 | 	Parameters
 12 | 	----------
 13 | 	nc_function : BaseScorer
 14 | 		Nonconformity scorer object used to calculate nonconformity of
 15 | 		calibration examples and test patterns. Should implement ``fit(x, y)``
 16 | 		and ``calc_nc(x, y)``.
 17 | 
 18 | 	smoothing : boolean
 19 | 		Decides whether to use stochastic smoothing of p-values.
 20 | 
 21 | 	Attributes
 22 | 	----------
 23 | 	train_x : numpy array of shape [n_cal_examples, n_features]
 24 | 		Inputs of training set.
 25 | 
 26 | 	train_y : numpy array of shape [n_cal_examples]
 27 | 		Outputs of calibration set.
 28 | 
 29 | 	nc_function : BaseScorer
 30 | 		Nonconformity scorer object used to calculate nonconformity scores.
 31 | 
 32 | 	classes : numpy array of shape [n_classes]
 33 | 		List of class labels, with indices corresponding to output columns
 34 | 		 of TcpClassifier.predict()
 35 | 
 36 | 	See also
 37 | 	--------
 38 | 	IcpClassifier
 39 | 
 40 | 	References
 41 | 	----------
 42 | 	.. [1] Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic learning
 43 | 	in a random world. Springer Science & Business Media.
 44 | 
 45 | 	Examples
 46 | 	--------
 47 | 	>>> import numpy as np
 48 | 	>>> from sklearn.datasets import load_iris
 49 | 	>>> from sklearn.svm import SVC
 50 | 	>>> from nonconformist.base import ClassifierAdapter
 51 | 	>>> from nonconformist.cp import TcpClassifier
 52 | 	>>> from nonconformist.nc import ClassifierNc, MarginErrFunc
 53 | 	>>> iris = load_iris()
 54 | 	>>> idx = np.random.permutation(iris.target.size)
 55 | 	>>> train = idx[:int(idx.size / 2)]
 56 | 	>>> test = idx[int(idx.size / 2):]
 57 | 	>>> model = ClassifierAdapter(SVC(probability=True))
 58 | 	>>> nc = ClassifierNc(model, MarginErrFunc())
 59 | 	>>> tcp = TcpClassifier(nc)
 60 | 	>>> tcp.fit(iris.data[train, :], iris.target[train])
 61 | 	>>> tcp.predict(iris.data[test, :], significance=0.10)
 62 | 	...             # doctest: +SKIP
 63 | 	array([[ True, False, False],
 64 | 		[False,  True, False],
 65 | 		...,
 66 | 		[False,  True, False],
 67 | 		[False,  True, False]], dtype=bool)
 68 | 	"""
 69 | 
 70 | 	def __init__(self, nc_function, condition=None, smoothing=True):
 71 | 		self.train_x, self.train_y = None, None
 72 | 		self.nc_function = nc_function
 73 | 		super(TcpClassifier, self).__init__()
 74 | 
 75 | 		# Check if condition-parameter is the default function (i.e.,
 76 | 		# lambda x: 0). This is so we can safely clone the object without
 77 | 		# the clone accidentally having self.conditional = True.
 78 | 		default_condition = lambda x: 0
 79 | 		is_default = (callable(condition) and
 80 | 		              (condition.__code__.co_code ==
 81 | 		               default_condition.__code__.co_code))
 82 | 
 83 | 		if is_default:
 84 | 			self.condition = condition
 85 | 			self.conditional = False
 86 | 		elif callable(condition):
 87 | 			self.condition = condition
 88 | 			self.conditional = True
 89 | 		else:
 90 | 			self.condition = lambda x: 0
 91 | 			self.conditional = False
 92 | 
 93 | 		self.smoothing = smoothing
 94 | 
 95 | 		self.base_icp = IcpClassifier(
 96 | 			self.nc_function,
 97 | 			self.condition,
 98 | 			self.smoothing
 99 | 		)
100 | 
101 | 		self.classes = None
102 | 
103 | 	def fit(self, x, y):
104 | 		self.train_x, self.train_y = x, y
105 | 		self.classes = np.unique(y)
106 | 
107 | 	def predict(self, x, significance=None):
108 | 		"""Predict the output values for a set of input patterns.
109 | 
110 | 		Parameters
111 | 		----------
112 | 		x : numpy array of shape [n_samples, n_features]
113 | 			Inputs of patters for which to predict output values.
114 | 
115 | 		significance : float or None
116 | 			Significance level (maximum allowed error rate) of predictions.
117 | 			Should be a float between 0 and 1. If ``None``, then the p-values
118 | 			are output rather than the predictions.
119 | 
120 | 		Returns
121 | 		-------
122 | 		p : numpy array of shape [n_samples, n_classes]
123 | 			If significance is ``None``, then p contains the p-values for each
124 | 			sample-class pair; if significance is a float between 0 and 1, then
125 | 			p is a boolean array denoting which labels are included in the
126 | 			prediction sets.
127 | 		"""
128 | 		n_test = x.shape[0]
129 | 		n_train = self.train_x.shape[0]
130 | 		p = np.zeros((n_test, self.classes.size))
131 | 		for i in range(n_test):
132 | 			for j, y in enumerate(self.classes):
133 | 				train_x = np.vstack([self.train_x, x[i, :]])
134 | 				train_y = np.hstack([self.train_y, y])
135 | 				self.base_icp.fit(train_x, train_y)
136 | 				scores = self.base_icp.nc_function.score(train_x, train_y)
137 | 				ngt = (scores[:-1] > scores[-1]).sum()
138 | 				neq = (scores[:-1] == scores[-1]).sum()
139 | 
140 | 				p[i, j] = calc_p(n_train, ngt, neq, self.smoothing)
141 | 
142 | 		if significance is not None:
143 | 			return p > significance
144 | 		else:
145 | 			return p
146 | 
147 | 	def predict_conf(self, x):
148 | 		"""Predict the output values for a set of input patterns, using
149 | 		the confidence-and-credibility output scheme.
150 | 
151 | 		Parameters
152 | 		----------
153 | 		x : numpy array of shape [n_samples, n_features]
154 | 			Inputs of patters for which to predict output values.
155 | 
156 | 		Returns
157 | 		-------
158 | 		p : numpy array of shape [n_samples, 3]
159 | 			p contains three columns: the first column contains the most
160 | 			likely class for each test pattern; the second column contains
161 | 			the confidence in the predicted class label, and the third column
162 | 			contains the credibility of the prediction.
163 | 		"""
164 | 		p = self.predict(x, significance=None)
165 | 		label = p.argmax(axis=1)
166 | 		credibility = p.max(axis=1)
167 | 		for i, idx in enumerate(label):
168 | 			p[i, idx] = -np.inf
169 | 		confidence = 1 - p.max(axis=1)
170 | 
171 | 		return np.array([label, confidence, credibility]).T
172 | 


--------------------------------------------------------------------------------
/nonconformist/evaluation.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | Evaluation of conformal predictors.
  5 | """
  6 | 
  7 | # Authors: Henrik Linusson
  8 | 
  9 | # TODO: cross_val_score/run_experiment should possibly allow multiple to be evaluated on identical folding
 10 | 
 11 | from __future__ import division
 12 | 
 13 | from nonconformist.base import RegressorMixin, ClassifierMixin
 14 | 
 15 | import sys
 16 | import numpy as np
 17 | import pandas as pd
 18 | 
 19 | from sklearn.cross_validation import StratifiedShuffleSplit
 20 | from sklearn.cross_validation import KFold
 21 | from sklearn.cross_validation import train_test_split
 22 | from sklearn.base import clone, BaseEstimator
 23 | 
 24 | 
 25 | class BaseIcpCvHelper(BaseEstimator):
 26 | 	"""Base class for cross validation helpers.
 27 | 	"""
 28 | 	def __init__(self, icp, calibration_portion):
 29 | 		super(BaseIcpCvHelper, self).__init__()
 30 | 		self.icp = icp
 31 | 		self.calibration_portion = calibration_portion
 32 | 
 33 | 	def predict(self, x, significance=None):
 34 | 			return self.icp.predict(x, significance)
 35 | 
 36 | 
 37 | class ClassIcpCvHelper(BaseIcpCvHelper, ClassifierMixin):
 38 | 	"""Helper class for running the ``cross_val_score`` evaluation
 39 | 	method on IcpClassifiers.
 40 | 
 41 | 	See also
 42 | 	--------
 43 | 	IcpRegCrossValHelper
 44 | 
 45 | 	Examples
 46 | 	--------
 47 | 	>>> from sklearn.datasets import load_iris
 48 | 	>>> from sklearn.ensemble import RandomForestClassifier
 49 | 	>>> from nonconformist.icp import IcpClassifier
 50 | 	>>> from nonconformist.nc import ClassifierNc, MarginErrFunc
 51 | 	>>> from nonconformist.evaluation import ClassIcpCvHelper
 52 | 	>>> from nonconformist.evaluation import class_mean_errors
 53 | 	>>> from nonconformist.evaluation import cross_val_score
 54 | 	>>> data = load_iris()
 55 | 	>>> nc = ProbEstClassifierNc(RandomForestClassifier(), MarginErrFunc())
 56 | 	>>> icp = IcpClassifier(nc)
 57 | 	>>> icp_cv = ClassIcpCvHelper(icp)
 58 | 	>>> cross_val_score(icp_cv,
 59 | 	...                 data.data,
 60 | 	...                 data.target,
 61 | 	...                 iterations=2,
 62 | 	...                 folds=2,
 63 | 	...                 scoring_funcs=[class_mean_errors],
 64 | 	...                 significance_levels=[0.1])
 65 | 	...     # doctest: +SKIP
 66 | 	   class_mean_errors  fold  iter  significance
 67 | 	0           0.013333     0     0           0.1
 68 | 	1           0.080000     1     0           0.1
 69 | 	2           0.053333     0     1           0.1
 70 | 	3           0.080000     1     1           0.1
 71 | 	"""
 72 | 	def __init__(self, icp, calibration_portion=0.25):
 73 | 		super(ClassIcpCvHelper, self).__init__(icp, calibration_portion)
 74 | 
 75 | 	def fit(self, x, y):
 76 | 		split = StratifiedShuffleSplit(y, n_iter=1,
 77 | 		                               test_size=self.calibration_portion)
 78 | 		for train, cal in split:
 79 | 			self.icp.fit(x[train, :], y[train])
 80 | 			self.icp.calibrate(x[cal, :], y[cal])
 81 | 
 82 | 
 83 | class RegIcpCvHelper(BaseIcpCvHelper, RegressorMixin):
 84 | 	"""Helper class for running the ``cross_val_score`` evaluation
 85 | 	method on IcpRegressors.
 86 | 
 87 | 	See also
 88 | 	--------
 89 | 	IcpClassCrossValHelper
 90 | 
 91 | 	Examples
 92 | 	--------
 93 | 	>>> from sklearn.datasets import load_boston
 94 | 	>>> from sklearn.ensemble import RandomForestRegressor
 95 | 	>>> from nonconformist.icp import IcpRegressor
 96 | 	>>> from nonconformist.nc import RegressorNc, AbsErrorErrFunc
 97 | 	>>> from nonconformist.evaluation import RegIcpCvHelper
 98 | 	>>> from nonconformist.evaluation import reg_mean_errors
 99 | 	>>> from nonconformist.evaluation import cross_val_score
100 | 	>>> data = load_boston()
101 | 	>>> nc = RegressorNc(RandomForestRegressor(), AbsErrorErrFunc())
102 | 	>>> icp = IcpRegressor(nc)
103 | 	>>> icp_cv = RegIcpCvHelper(icp)
104 | 	>>> cross_val_score(icp_cv,
105 | 	...                 data.data,
106 | 	...                 data.target,
107 | 	...                 iterations=2,
108 | 	...                 folds=2,
109 | 	...                 scoring_funcs=[reg_mean_errors],
110 | 	...                 significance_levels=[0.1])
111 | 	...     # doctest: +SKIP
112 | 	   fold  iter  reg_mean_errors  significance
113 | 	0     0     0         0.185771           0.1
114 | 	1     1     0         0.138340           0.1
115 | 	2     0     1         0.071146           0.1
116 | 	3     1     1         0.043478           0.1
117 | 	"""
118 | 	def __init__(self, icp, calibration_portion=0.25):
119 | 		super(RegIcpCvHelper, self).__init__(icp, calibration_portion)
120 | 
121 | 	def fit(self, x, y):
122 | 		split = train_test_split(x, y, test_size=self.calibration_portion)
123 | 		x_tr, x_cal, y_tr, y_cal = split[0], split[1], split[2], split[3]
124 | 		self.icp.fit(x_tr, y_tr)
125 | 		self.icp.calibrate(x_cal, y_cal)
126 | 
127 | 
128 | # -----------------------------------------------------------------------------
129 | #
130 | # -----------------------------------------------------------------------------
131 | def cross_val_score(model,x, y, iterations=10, folds=10, fit_params=None,
132 | 					scoring_funcs=None, significance_levels=None,
133 | 					verbose=False):
134 | 	"""Evaluates a conformal predictor using cross-validation.
135 | 
136 | 	Parameters
137 | 	----------
138 | 	model : object
139 | 		Conformal predictor to evaluate.
140 | 
141 | 	x : numpy array of shape [n_samples, n_features]
142 | 		Inputs of data to use for evaluation.
143 | 
144 | 	y : numpy array of shape [n_samples]
145 | 		Outputs of data to use for evaluation.
146 | 
147 | 	iterations : int
148 | 		Number of iterations to use for evaluation. The data set is randomly
149 | 		shuffled before each iteration.
150 | 
151 | 	folds : int
152 | 		Number of folds to use for evaluation.
153 | 
154 | 	fit_params : dictionary
155 | 		Parameters to supply to the conformal prediction object on training.
156 | 
157 | 	scoring_funcs : iterable
158 | 		List of evaluation functions to apply to the conformal predictor in each
159 | 		fold. Each evaluation function should have a signature
160 | 		``scorer(prediction, y, significance)``.
161 | 
162 | 	significance_levels : iterable
163 | 		List of significance levels at which to evaluate the conformal
164 | 		predictor.
165 | 
166 | 	verbose : boolean
167 | 		Indicates whether to output progress information during evaluation.
168 | 
169 | 	Returns
170 | 	-------
171 | 	scores : pandas DataFrame
172 | 		Tabulated results for each iteration, fold and evaluation function.
173 | 	"""
174 | 
175 | 	fit_params = fit_params if fit_params else {}
176 | 	significance_levels = (significance_levels if significance_levels
177 | 	                       is not None else np.arange(0.01, 1.0, 0.01))
178 | 
179 | 	df = pd.DataFrame()
180 | 
181 | 	columns = ['iter',
182 | 			   'fold',
183 | 			   'significance',
184 | 			   ] + [f.__name__ for f in scoring_funcs]
185 | 	for i in range(iterations):
186 | 		idx = np.random.permutation(y.size)
187 | 		x, y = x[idx, :], y[idx]
188 | 		cv = KFold(y.size, folds)
189 | 		for j, (train, test) in enumerate(cv):
190 | 			if verbose:
191 | 				sys.stdout.write('\riter {}/{} fold {}/{}'.format(
192 | 					i + 1,
193 | 					iterations,
194 | 					j + 1,
195 | 					folds
196 | 				))
197 | 			m = clone(model)
198 | 			m.fit(x[train, :], y[train], **fit_params)
199 | 			prediction = m.predict(x[test, :], significance=None)
200 | 			for k, s in enumerate(significance_levels):
201 | 				scores = [scoring_func(prediction, y[test], s)
202 | 						  for scoring_func in scoring_funcs]
203 | 				df_score = pd.DataFrame([[i, j, s] + scores],
204 | 											columns=columns)
205 | 				df = df.append(df_score, ignore_index=True)
206 | 
207 | 	return df
208 | 
209 | 
210 | def run_experiment(models, csv_files, iterations=10, folds=10, fit_params=None,
211 | 				   scoring_funcs=None, significance_levels=None,
212 | 				   normalize=False, verbose=False, header=0):
213 | 	"""Performs a cross-validation evaluation of one or several conformal
214 | 	predictors on a	collection of data sets in csv format.
215 | 
216 | 	Parameters
217 | 	----------
218 | 	models : object or iterable
219 | 		Conformal predictor(s) to evaluate.
220 | 
221 | 	csv_files : iterable
222 | 		List of file names (with absolute paths) containing csv-data, used to
223 | 		evaluate the conformal predictor.
224 | 
225 | 	iterations : int
226 | 		Number of iterations to use for evaluation. The data set is randomly
227 | 		shuffled before each iteration.
228 | 
229 | 	folds : int
230 | 		Number of folds to use for evaluation.
231 | 
232 | 	fit_params : dictionary
233 | 		Parameters to supply to the conformal prediction object on training.
234 | 
235 | 	scoring_funcs : iterable
236 | 		List of evaluation functions to apply to the conformal predictor in each
237 | 		fold. Each evaluation function should have a signature
238 | 		``scorer(prediction, y, significance)``.
239 | 
240 | 	significance_levels : iterable
241 | 		List of significance levels at which to evaluate the conformal
242 | 		predictor.
243 | 
244 | 	verbose : boolean
245 | 		Indicates whether to output progress information during evaluation.
246 | 
247 | 	Returns
248 | 	-------
249 | 	scores : pandas DataFrame
250 | 		Tabulated results for each data set, iteration, fold and
251 | 		evaluation function.
252 | 	"""
253 | 	df = pd.DataFrame()
254 | 	if not hasattr(models, '__iter__'):
255 | 		models = [models]
256 | 
257 | 	for model in models:
258 | 		is_regression = model.get_problem_type() == 'regression'
259 | 
260 | 		n_data_sets = len(csv_files)
261 | 		for i, csv_file in enumerate(csv_files):
262 | 			if verbose:
263 | 				print('\n{} ({} / {})'.format(csv_file, i + 1, n_data_sets))
264 | 			data = pd.read_csv(csv_file, header=header)
265 | 			x, y = data.values[:, :-1], data.values[:, -1]
266 | 			x = np.array(x, dtype=np.float64)
267 | 			if normalize:
268 | 				if is_regression:
269 | 					y = y - y.min() / (y.max() - y.min())
270 | 				else:
271 | 					for j, y_ in enumerate(np.unique(y)):
272 | 						y[y == y_] = j
273 | 
274 | 			scores = cross_val_score(model, x, y, iterations, folds,
275 | 			                         fit_params, scoring_funcs,
276 | 			                         significance_levels, verbose)
277 | 
278 | 			ds_df = pd.DataFrame(scores)
279 | 			ds_df['model'] = model.__class__.__name__
280 | 			try:
281 | 				ds_df['data_set'] = csv_file.split('/')[-1]
282 | 			except:
283 | 				ds_df['data_set'] = csv_file
284 | 
285 | 			df = df.append(ds_df)
286 | 
287 | 	return df
288 | 
289 | 
290 | # -----------------------------------------------------------------------------
291 | # Validity measures
292 | # -----------------------------------------------------------------------------
293 | def reg_n_correct(prediction, y, significance=None):
294 | 	"""Calculates the number of correct predictions made by a conformal
295 | 	regression model.
296 | 	"""
297 | 	if significance is not None:
298 | 		idx = int(significance * 100 - 1)
299 | 		prediction = prediction[:, :, idx]
300 | 
301 | 	low = y >= prediction[:, 0]
302 | 	high = y <= prediction[:, 1]
303 | 	correct = low * high
304 | 
305 | 	return y[correct].size
306 | 
307 | 
308 | def reg_mean_errors(prediction, y, significance):
309 | 	"""Calculates the average error rate of a conformal regression model.
310 | 	"""
311 | 	return 1 - reg_n_correct(prediction, y, significance) / y.size
312 | 
313 | 
314 | def class_n_correct(prediction, y, significance):
315 | 	"""Calculates the number of correct predictions made by a conformal
316 | 	classification model.
317 | 	"""
318 | 	labels, y = np.unique(y, return_inverse=True)
319 | 	prediction = prediction > significance
320 | 	correct = np.zeros((y.size,), dtype=bool)
321 | 	for i, y_ in enumerate(y):
322 | 		correct[i] = prediction[i, int(y_)]
323 | 	return np.sum(correct)
324 | 
325 | 
326 | def class_mean_errors(prediction, y, significance=None):
327 | 	"""Calculates the average error rate of a conformal classification model.
328 | 	"""
329 | 	return 1 - (class_n_correct(prediction, y, significance) / y.size)
330 | 
331 | 
332 | def class_one_err(prediction, y, significance=None):
333 | 	"""Calculates the error rate of conformal classifier predictions containing
334 | 	 only a single output label.
335 | 	"""
336 | 	labels, y = np.unique(y, return_inverse=True)
337 | 	prediction = prediction > significance
338 | 	idx = np.arange(0, y.size, 1)
339 | 	idx = filter(lambda x: np.sum(prediction[x, :]) == 1, idx)
340 | 	errors = filter(lambda x: not prediction[x, int(y[x])], idx)
341 | 
342 | 	if len(idx) > 0:
343 | 		return np.size(errors) / np.size(idx)
344 | 	else:
345 | 		return 0
346 | 
347 | 
348 | def class_mean_errors_one_class(prediction, y, significance, c=0):
349 | 	"""Calculates the average error rate of a conformal classification model,
350 | 	  considering only test examples belonging to class ``c``. Use
351 | 	  ``functools.partial`` in order to test other classes.
352 | 	"""
353 | 	labels, y = np.unique(y, return_inverse=True)
354 | 	prediction = prediction > significance
355 | 	idx = np.arange(0, y.size, 1)[y == c]
356 | 	errs = np.sum(1 for _ in filter(lambda x: not prediction[x, c], idx))
357 | 
358 | 	if idx.size > 0:
359 | 		return errs / idx.size
360 | 	else:
361 | 		return 0
362 | 
363 | 
364 | def class_one_err_one_class(prediction, y, significance, c=0):
365 | 	"""Calculates the error rate of conformal classifier predictions containing
366 | 	 only a single output label. Considers only test examples belonging to
367 | 	 class ``c``. Use ``functools.partial`` in order to test other classes.
368 | 	"""
369 | 	labels, y = np.unique(y, return_inverse=True)
370 | 	prediction = prediction > significance
371 | 	idx = np.arange(0, y.size, 1)
372 | 	idx = filter(lambda x: prediction[x, c], idx)
373 | 	idx = filter(lambda x: np.sum(prediction[x, :]) == 1, idx)
374 | 	errors = filter(lambda x: int(y[x]) != c, idx)
375 | 
376 | 	if len(idx) > 0:
377 | 		return np.size(errors) / np.size(idx)
378 | 	else:
379 | 		return 0
380 | 
381 | 
382 | # -----------------------------------------------------------------------------
383 | # Efficiency measures
384 | # -----------------------------------------------------------------------------
385 | def _reg_interval_size(prediction, y, significance):
386 | 	idx = int(significance * 100 - 1)
387 | 	prediction = prediction[:, :, idx]
388 | 
389 | 	return prediction[:, 1] - prediction[:, 0]
390 | 
391 | 
392 | def reg_min_size(prediction, y, significance):
393 | 	return np.min(_reg_interval_size(prediction, y, significance))
394 | 
395 | 
396 | def reg_q1_size(prediction, y, significance):
397 | 	return np.percentile(_reg_interval_size(prediction, y, significance), 25)
398 | 
399 | 
400 | def reg_median_size(prediction, y, significance):
401 | 	return np.median(_reg_interval_size(prediction, y, significance))
402 | 
403 | 
404 | def reg_q3_size(prediction, y, significance):
405 | 	return np.percentile(_reg_interval_size(prediction, y, significance), 75)
406 | 
407 | 
408 | def reg_max_size(prediction, y, significance):
409 | 	return np.max(_reg_interval_size(prediction, y, significance))
410 | 
411 | 
412 | def reg_mean_size(prediction, y, significance):
413 | 	"""Calculates the average prediction interval size of a conformal
414 | 	regression model.
415 | 	"""
416 | 	return np.mean(_reg_interval_size(prediction, y, significance))
417 | 
418 | 
419 | def class_avg_c(prediction, y, significance):
420 | 	"""Calculates the average number of classes per prediction of a conformal
421 | 	classification model.
422 | 	"""
423 | 	prediction = prediction > significance
424 | 	return np.sum(prediction) / prediction.shape[0]
425 | 
426 | 
427 | def class_mean_p_val(prediction, y, significance):
428 | 	"""Calculates the mean of the p-values output by a conformal classification
429 | 	model.
430 | 	"""
431 | 	return np.mean(prediction)
432 | 
433 | 
434 | def class_one_c(prediction, y, significance):
435 | 	"""Calculates the rate of singleton predictions (prediction sets containing
436 | 	only a single class label) of a conformal classification model.
437 | 	"""
438 | 	prediction = prediction > significance
439 | 	n_singletons = np.sum(1 for _ in filter(lambda x: np.sum(x) == 1,
440 | 	                                        prediction))
441 | 	return n_singletons / y.size
442 | 
443 | 
444 | def class_empty(prediction, y, significance):
445 | 	"""Calculates the rate of singleton predictions (prediction sets containing
446 | 	only a single class label) of a conformal classification model.
447 | 	"""
448 | 	prediction = prediction > significance
449 | 	n_empty = np.sum(1 for _ in filter(lambda x: np.sum(x) == 0,
450 | 	                                        prediction))
451 | 	return n_empty / y.size
452 | 
453 | 
454 | def n_test(prediction, y, significance):
455 | 	"""Provides the number of test patters used in the evaluation.
456 | 	"""
457 | 	return y.size


--------------------------------------------------------------------------------
/nonconformist/icp.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | Inductive conformal predictors.
  5 | """
  6 | 
  7 | # Authors: Henrik Linusson
  8 | 
  9 | from __future__ import division
 10 | from collections import defaultdict
 11 | from functools import partial
 12 | 
 13 | import numpy as np
 14 | from sklearn.base import BaseEstimator
 15 | 
 16 | from nonconformist.base import RegressorMixin, ClassifierMixin
 17 | from nonconformist.util import calc_p
 18 | 
 19 | 
 20 | # -----------------------------------------------------------------------------
 21 | # Base inductive conformal predictor
 22 | # -----------------------------------------------------------------------------
 23 | class BaseIcp(BaseEstimator):
 24 | 	"""Base class for inductive conformal predictors.
 25 | 	"""
 26 | 	def __init__(self, nc_function, condition=None):
 27 | 		self.cal_x, self.cal_y = None, None
 28 | 		self.nc_function = nc_function
 29 | 
 30 | 		# Check if condition-parameter is the default function (i.e.,
 31 | 		# lambda x: 0). This is so we can safely clone the object without
 32 | 		# the clone accidentally having self.conditional = True.
 33 | 		default_condition = lambda x: 0
 34 | 		is_default = (callable(condition) and
 35 | 		              (condition.__code__.co_code ==
 36 | 		               default_condition.__code__.co_code))
 37 | 
 38 | 		if is_default:
 39 | 			self.condition = condition
 40 | 			self.conditional = False
 41 | 		elif callable(condition):
 42 | 			self.condition = condition
 43 | 			self.conditional = True
 44 | 		else:
 45 | 			self.condition = lambda x: 0
 46 | 			self.conditional = False
 47 | 
 48 | 	def fit(self, x, y):
 49 | 		"""Fit underlying nonconformity scorer.
 50 | 
 51 | 		Parameters
 52 | 		----------
 53 | 		x : numpy array of shape [n_samples, n_features]
 54 | 			Inputs of examples for fitting the nonconformity scorer.
 55 | 
 56 | 		y : numpy array of shape [n_samples]
 57 | 			Outputs of examples for fitting the nonconformity scorer.
 58 | 
 59 | 		Returns
 60 | 		-------
 61 | 		None
 62 | 		"""
 63 | 		# TODO: incremental?
 64 | 		self.nc_function.fit(x, y)
 65 | 
 66 | 	def calibrate(self, x, y, increment=False):
 67 | 		"""Calibrate conformal predictor based on underlying nonconformity
 68 | 		scorer.
 69 | 
 70 | 		Parameters
 71 | 		----------
 72 | 		x : numpy array of shape [n_samples, n_features]
 73 | 			Inputs of examples for calibrating the conformal predictor.
 74 | 
 75 | 		y : numpy array of shape [n_samples, n_features]
 76 | 			Outputs of examples for calibrating the conformal predictor.
 77 | 
 78 | 		increment : boolean
 79 | 			If ``True``, performs an incremental recalibration of the conformal
 80 | 			predictor. The supplied ``x`` and ``y`` are added to the set of
 81 | 			previously existing calibration examples, and the conformal
 82 | 			predictor is then calibrated on both the old and new calibration
 83 | 			examples.
 84 | 
 85 | 		Returns
 86 | 		-------
 87 | 		None
 88 | 		"""
 89 | 		self._calibrate_hook(x, y, increment)
 90 | 		self._update_calibration_set(x, y, increment)
 91 | 
 92 | 		if self.conditional:
 93 | 			category_map = np.array([self.condition((x[i, :], y[i]))
 94 | 									 for i in range(y.size)])
 95 | 			self.categories = np.unique(category_map)
 96 | 			self.cal_scores = defaultdict(partial(np.ndarray, 0))
 97 | 
 98 | 			for cond in self.categories:
 99 | 				idx = category_map == cond
100 | 				cal_scores = self.nc_function.score(self.cal_x[idx, :],
101 | 				                                    self.cal_y[idx])
102 | 				self.cal_scores[cond] = np.sort(cal_scores,0)[::-1]
103 | 		else:
104 | 			self.categories = np.array([0])
105 | 			cal_scores = self.nc_function.score(self.cal_x, self.cal_y)
106 | 			self.cal_scores = {0: np.sort(cal_scores,0)[::-1]}
107 | 
108 | 	def _calibrate_hook(self, x, y, increment):
109 | 		pass
110 | 
111 | 	def _update_calibration_set(self, x, y, increment):
112 | 		if increment and self.cal_x is not None and self.cal_y is not None:
113 | 			self.cal_x = np.vstack([self.cal_x, x])
114 | 			self.cal_y = np.hstack([self.cal_y, y])
115 | 		else:
116 | 			self.cal_x, self.cal_y = x, y
117 | 
118 | 
119 | # -----------------------------------------------------------------------------
120 | # Inductive conformal classifier
121 | # -----------------------------------------------------------------------------
122 | class IcpClassifier(BaseIcp, ClassifierMixin):
123 | 	"""Inductive conformal classifier.
124 | 
125 | 	Parameters
126 | 	----------
127 | 	nc_function : BaseScorer
128 | 		Nonconformity scorer object used to calculate nonconformity of
129 | 		calibration examples and test patterns. Should implement ``fit(x, y)``
130 | 		and ``calc_nc(x, y)``.
131 | 
132 | 	smoothing : boolean
133 | 		Decides whether to use stochastic smoothing of p-values.
134 | 
135 | 	Attributes
136 | 	----------
137 | 	cal_x : numpy array of shape [n_cal_examples, n_features]
138 | 		Inputs of calibration set.
139 | 
140 | 	cal_y : numpy array of shape [n_cal_examples]
141 | 		Outputs of calibration set.
142 | 
143 | 	nc_function : BaseScorer
144 | 		Nonconformity scorer object used to calculate nonconformity scores.
145 | 
146 | 	classes : numpy array of shape [n_classes]
147 | 		List of class labels, with indices corresponding to output columns
148 | 		 of IcpClassifier.predict()
149 | 
150 | 	See also
151 | 	--------
152 | 	IcpRegressor
153 | 
154 | 	References
155 | 	----------
156 | 	.. [1] Papadopoulos, H., & Haralambous, H. (2011). Reliable prediction
157 | 		intervals with regression neural networks. Neural Networks, 24(8),
158 | 		842-851.
159 | 
160 | 	Examples
161 | 	--------
162 | 	>>> import numpy as np
163 | 	>>> from sklearn.datasets import load_iris
164 | 	>>> from sklearn.tree import DecisionTreeClassifier
165 | 	>>> from nonconformist.base import ClassifierAdapter
166 | 	>>> from nonconformist.icp import IcpClassifier
167 | 	>>> from nonconformist.nc import ClassifierNc, MarginErrFunc
168 | 	>>> iris = load_iris()
169 | 	>>> idx = np.random.permutation(iris.target.size)
170 | 	>>> train = idx[:int(idx.size / 3)]
171 | 	>>> cal = idx[int(idx.size / 3):int(2 * idx.size / 3)]
172 | 	>>> test = idx[int(2 * idx.size / 3):]
173 | 	>>> model = ClassifierAdapter(DecisionTreeClassifier())
174 | 	>>> nc = ClassifierNc(model, MarginErrFunc())
175 | 	>>> icp = IcpClassifier(nc)
176 | 	>>> icp.fit(iris.data[train, :], iris.target[train])
177 | 	>>> icp.calibrate(iris.data[cal, :], iris.target[cal])
178 | 	>>> icp.predict(iris.data[test, :], significance=0.10)
179 | 	...             # doctest: +SKIP
180 | 	array([[ True, False, False],
181 | 		[False,  True, False],
182 | 		...,
183 | 		[False,  True, False],
184 | 		[False,  True, False]], dtype=bool)
185 | 	"""
186 | 	def __init__(self, nc_function, condition=None, smoothing=True):
187 | 		super(IcpClassifier, self).__init__(nc_function, condition)
188 | 		self.classes = None
189 | 		self.smoothing = smoothing
190 | 
191 | 	def _calibrate_hook(self, x, y, increment=False):
192 | 		self._update_classes(y, increment)
193 | 
194 | 	def _update_classes(self, y, increment):
195 | 		if self.classes is None or not increment:
196 | 			self.classes = np.unique(y)
197 | 		else:
198 | 			self.classes = np.unique(np.hstack([self.classes, y]))
199 | 
200 | 	def predict(self, x, significance=None):
201 | 		"""Predict the output values for a set of input patterns.
202 | 
203 | 		Parameters
204 | 		----------
205 | 		x : numpy array of shape [n_samples, n_features]
206 | 			Inputs of patters for which to predict output values.
207 | 
208 | 		significance : float or None
209 | 			Significance level (maximum allowed error rate) of predictions.
210 | 			Should be a float between 0 and 1. If ``None``, then the p-values
211 | 			are output rather than the predictions.
212 | 
213 | 		Returns
214 | 		-------
215 | 		p : numpy array of shape [n_samples, n_classes]
216 | 			If significance is ``None``, then p contains the p-values for each
217 | 			sample-class pair; if significance is a float between 0 and 1, then
218 | 			p is a boolean array denoting which labels are included in the
219 | 			prediction sets.
220 | 		"""
221 | 		# TODO: if x == self.last_x ...
222 | 		n_test_objects = x.shape[0]
223 | 		p = np.zeros((n_test_objects, self.classes.size))
224 | 
225 | 		ncal_ngt_neq = self._get_stats(x)
226 | 
227 | 		for i in range(len(self.classes)):
228 | 			for j in range(n_test_objects):
229 | 				p[j, i] = calc_p(ncal_ngt_neq[j, i, 0],
230 | 				                 ncal_ngt_neq[j, i, 1],
231 | 				                 ncal_ngt_neq[j, i, 2],
232 | 				                 self.smoothing)
233 | 
234 | 		if significance is not None:
235 | 			return p > significance
236 | 		else:
237 | 			return p
238 | 
239 | 	def _get_stats(self, x):
240 | 		n_test_objects = x.shape[0]
241 | 		ncal_ngt_neq = np.zeros((n_test_objects, self.classes.size, 3))
242 | 		for i, c in enumerate(self.classes):
243 | 			test_class = np.zeros(x.shape[0], dtype=self.classes.dtype)
244 | 			test_class.fill(c)
245 | 
246 | 			# TODO: maybe calculate p-values using cython or similar
247 | 			# TODO: interpolated p-values
248 | 
249 | 			# TODO: nc_function.calc_nc should take X * {y1, y2, ... ,yn}
250 | 			test_nc_scores = self.nc_function.score(x, test_class)
251 | 			for j, nc in enumerate(test_nc_scores):
252 | 				cal_scores = self.cal_scores[self.condition((x[j, :], c))][::-1]
253 | 				n_cal = cal_scores.size
254 | 
255 | 				idx_left = np.searchsorted(cal_scores, nc, 'left')
256 | 				idx_right = np.searchsorted(cal_scores, nc, 'right')
257 | 
258 | 				ncal_ngt_neq[j, i, 0] = n_cal
259 | 				ncal_ngt_neq[j, i, 1] = n_cal - idx_right
260 | 				ncal_ngt_neq[j, i, 2] = idx_right - idx_left
261 | 
262 | 		return ncal_ngt_neq
263 | 
264 | 	def predict_conf(self, x):
265 | 		"""Predict the output values for a set of input patterns, using
266 | 		the confidence-and-credibility output scheme.
267 | 
268 | 		Parameters
269 | 		----------
270 | 		x : numpy array of shape [n_samples, n_features]
271 | 			Inputs of patters for which to predict output values.
272 | 
273 | 		Returns
274 | 		-------
275 | 		p : numpy array of shape [n_samples, 3]
276 | 			p contains three columns: the first column contains the most
277 | 			likely class for each test pattern; the second column contains
278 | 			the confidence in the predicted class label, and the third column
279 | 			contains the credibility of the prediction.
280 | 		"""
281 | 		p = self.predict(x, significance=None)
282 | 		label = p.argmax(axis=1)
283 | 		credibility = p.max(axis=1)
284 | 		for i, idx in enumerate(label):
285 | 			p[i, idx] = -np.inf
286 | 		confidence = 1 - p.max(axis=1)
287 | 
288 | 		return np.array([label, confidence, credibility]).T
289 | 
290 | 
291 | # -----------------------------------------------------------------------------
292 | # Inductive conformal regressor
293 | # -----------------------------------------------------------------------------
294 | class IcpRegressor(BaseIcp, RegressorMixin):
295 | 	"""Inductive conformal regressor.
296 | 
297 | 	Parameters
298 | 	----------
299 | 	nc_function : BaseScorer
300 | 		Nonconformity scorer object used to calculate nonconformity of
301 | 		calibration examples and test patterns. Should implement ``fit(x, y)``,
302 | 		``calc_nc(x, y)`` and ``predict(x, nc_scores, significance)``.
303 | 
304 | 	Attributes
305 | 	----------
306 | 	cal_x : numpy array of shape [n_cal_examples, n_features]
307 | 		Inputs of calibration set.
308 | 
309 | 	cal_y : numpy array of shape [n_cal_examples]
310 | 		Outputs of calibration set.
311 | 
312 | 	nc_function : BaseScorer
313 | 		Nonconformity scorer object used to calculate nonconformity scores.
314 | 
315 | 	See also
316 | 	--------
317 | 	IcpClassifier
318 | 
319 | 	References
320 | 	----------
321 | 	.. [1] Papadopoulos, H., Proedrou, K., Vovk, V., & Gammerman, A. (2002).
322 | 		Inductive confidence machines for regression. In Machine Learning: ECML
323 | 		2002 (pp. 345-356). Springer Berlin Heidelberg.
324 | 
325 | 	.. [2] Papadopoulos, H., & Haralambous, H. (2011). Reliable prediction
326 | 		intervals with regression neural networks. Neural Networks, 24(8),
327 | 		842-851.
328 | 
329 | 	Examples
330 | 	--------
331 | 	>>> import numpy as np
332 | 	>>> from sklearn.datasets import load_boston
333 | 	>>> from sklearn.tree import DecisionTreeRegressor
334 | 	>>> from nonconformist.base import RegressorAdapter
335 | 	>>> from nonconformist.icp import IcpRegressor
336 | 	>>> from nonconformist.nc import RegressorNc, AbsErrorErrFunc
337 | 	>>> boston = load_boston()
338 | 	>>> idx = np.random.permutation(boston.target.size)
339 | 	>>> train = idx[:int(idx.size / 3)]
340 | 	>>> cal = idx[int(idx.size / 3):int(2 * idx.size / 3)]
341 | 	>>> test = idx[int(2 * idx.size / 3):]
342 | 	>>> model = RegressorAdapter(DecisionTreeRegressor())
343 | 	>>> nc = RegressorNc(model, AbsErrorErrFunc())
344 | 	>>> icp = IcpRegressor(nc)
345 | 	>>> icp.fit(boston.data[train, :], boston.target[train])
346 | 	>>> icp.calibrate(boston.data[cal, :], boston.target[cal])
347 | 	>>> icp.predict(boston.data[test, :], significance=0.10)
348 | 	...     # doctest: +SKIP
349 | 	array([[  5. ,  20.6],
350 | 		[ 15.5,  31.1],
351 | 		...,
352 | 		[ 14.2,  29.8],
353 | 		[ 11.6,  27.2]])
354 | 	"""
355 | 	def __init__(self, nc_function, condition=None):
356 | 		super(IcpRegressor, self).__init__(nc_function, condition)
357 | 
358 | 	def predict(self, x, significance=None):
359 | 		"""Predict the output values for a set of input patterns.
360 | 
361 | 		Parameters
362 | 		----------
363 | 		x : numpy array of shape [n_samples, n_features]
364 | 			Inputs of patters for which to predict output values.
365 | 
366 | 		significance : float
367 | 			Significance level (maximum allowed error rate) of predictions.
368 | 			Should be a float between 0 and 1. If ``None``, then intervals for
369 | 			all significance levels (0.01, 0.02, ..., 0.99) are output in a
370 | 			3d-matrix.
371 | 
372 | 		Returns
373 | 		-------
374 | 		p : numpy array of shape [n_samples, 2] or [n_samples, 2, 99}
375 | 			If significance is ``None``, then p contains the interval (minimum
376 | 			and maximum boundaries) for each test pattern, and each significance
377 | 			level (0.01, 0.02, ..., 0.99). If significance is a float between
378 | 			0 and 1, then p contains the prediction intervals (minimum and
379 | 			maximum	boundaries) for the set of test patterns at the chosen
380 | 			significance level.
381 | 		"""
382 | 		# TODO: interpolated p-values
383 | 
384 | 		n_significance = (99 if significance is None
385 | 		                  else np.array(significance).size)
386 | 
387 | 		if n_significance > 1:
388 | 			prediction = np.zeros((x.shape[0], 2, n_significance))
389 | 		else:
390 | 			prediction = np.zeros((x.shape[0], 2))
391 | 
392 | 		condition_map = np.array([self.condition((x[i, :], None))
393 | 		                          for i in range(x.shape[0])])
394 | 
395 | 		for condition in self.categories:
396 | 			idx = condition_map == condition
397 | 			if np.sum(idx) > 0:
398 | 				p = self.nc_function.predict(x[idx, :],
399 | 				                             self.cal_scores[condition],
400 | 				                             significance)
401 | 				if n_significance > 1:
402 | 					prediction[idx, :, :] = p
403 | 				else:
404 | 					prediction[idx, :] = p
405 | 
406 | 		return prediction
407 | 
408 | 
409 | class OobCpClassifier(IcpClassifier):
410 | 	def __init__(self,
411 | 	             nc_function,
412 | 	             condition=None,
413 | 	             smoothing=True):
414 | 		super(OobCpClassifier, self).__init__(nc_function,
415 | 		                                      condition,
416 | 		                                      smoothing)
417 | 
418 | 	def fit(self, x, y):
419 | 		super(OobCpClassifier, self).fit(x, y)
420 | 		super(OobCpClassifier, self).calibrate(x, y, False)
421 | 
422 | 	def calibrate(self, x, y, increment=False):
423 | 		# Should throw exception (or really not be implemented for oob)
424 | 		pass
425 | 
426 | 
427 | class OobCpRegressor(IcpRegressor):
428 | 	def __init__(self,
429 | 				 nc_function,
430 | 				 condition=None):
431 | 		super(OobCpRegressor, self).__init__(nc_function,
432 | 											  condition)
433 | 
434 | 	def fit(self, x, y):
435 | 		super(OobCpRegressor, self).fit(x, y)
436 | 		super(OobCpRegressor, self).calibrate(x, y, False)
437 | 
438 | 	def calibrate(self, x, y, increment=False):
439 | 		# Should throw exception (or really not be implemented for oob)
440 | 		pass
441 | 


--------------------------------------------------------------------------------
/nonconformist/nc.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | """
  4 | Nonconformity functions.
  5 | """
  6 | 
  7 | # Authors: Henrik Linusson
  8 | # Yaniv Romano modified RegressorNc class to include CQR
  9 | 
 10 | from __future__ import division
 11 | 
 12 | import abc
 13 | import numpy as np
 14 | import sklearn.base
 15 | from nonconformist.base import ClassifierAdapter, RegressorAdapter
 16 | from nonconformist.base import OobClassifierAdapter, OobRegressorAdapter
 17 | 
 18 | # -----------------------------------------------------------------------------
 19 | # Error functions
 20 | # -----------------------------------------------------------------------------
 21 | 
 22 | 
 23 | class ClassificationErrFunc(object):
 24 | 	"""Base class for classification model error functions.
 25 | 	"""
 26 | 
 27 | 	__metaclass__ = abc.ABCMeta
 28 | 
 29 | 	def __init__(self):
 30 | 		super(ClassificationErrFunc, self).__init__()
 31 | 
 32 | 	@abc.abstractmethod
 33 | 	def apply(self, prediction, y):
 34 | 		"""Apply the nonconformity function.
 35 | 
 36 | 		Parameters
 37 | 		----------
 38 | 		prediction : numpy array of shape [n_samples, n_classes]
 39 | 			Class probability estimates for each sample.
 40 | 
 41 | 		y : numpy array of shape [n_samples]
 42 | 			True output labels of each sample.
 43 | 
 44 | 		Returns
 45 | 		-------
 46 | 		nc : numpy array of shape [n_samples]
 47 | 			Nonconformity scores of the samples.
 48 | 		"""
 49 | 		pass
 50 | 
 51 | 
 52 | class RegressionErrFunc(object):
 53 | 	"""Base class for regression model error functions.
 54 | 	"""
 55 | 
 56 | 	__metaclass__ = abc.ABCMeta
 57 | 
 58 | 	def __init__(self):
 59 | 		super(RegressionErrFunc, self).__init__()
 60 | 
 61 | 	@abc.abstractmethod
 62 | 	def apply(self, prediction, y):#, norm=None, beta=0):
 63 | 		"""Apply the nonconformity function.
 64 | 
 65 | 		Parameters
 66 | 		----------
 67 | 		prediction : numpy array of shape [n_samples, n_classes]
 68 | 			Class probability estimates for each sample.
 69 | 
 70 | 		y : numpy array of shape [n_samples]
 71 | 			True output labels of each sample.
 72 | 
 73 | 		Returns
 74 | 		-------
 75 | 		nc : numpy array of shape [n_samples]
 76 | 			Nonconformity scores of the samples.
 77 | 		"""
 78 | 		pass
 79 | 
 80 | 	@abc.abstractmethod
 81 | 	def apply_inverse(self, nc, significance):#, norm=None, beta=0):
 82 | 		"""Apply the inverse of the nonconformity function (i.e.,
 83 | 		calculate prediction interval).
 84 | 
 85 | 		Parameters
 86 | 		----------
 87 | 		nc : numpy array of shape [n_calibration_samples]
 88 | 			Nonconformity scores obtained for conformal predictor.
 89 | 
 90 | 		significance : float
 91 | 			Significance level (0, 1).
 92 | 
 93 | 		Returns
 94 | 		-------
 95 | 		interval : numpy array of shape [n_samples, 2]
 96 | 			Minimum and maximum interval boundaries for each prediction.
 97 | 		"""
 98 | 		pass
 99 | 
100 | 
101 | class InverseProbabilityErrFunc(ClassificationErrFunc):
102 | 	"""Calculates the probability of not predicting the correct class.
103 | 
104 | 	For each correct output in ``y``, nonconformity is defined as
105 | 
106 | 	.. math::
107 | 		1 - \hat{P}(y_i | x) \, .
108 | 	"""
109 | 
110 | 	def __init__(self):
111 | 		super(InverseProbabilityErrFunc, self).__init__()
112 | 
113 | 	def apply(self, prediction, y):
114 | 		prob = np.zeros(y.size, dtype=np.float32)
115 | 		for i, y_ in enumerate(y):
116 | 			if y_ >= prediction.shape[1]:
117 | 				prob[i] = 0
118 | 			else:
119 | 				prob[i] = prediction[i, int(y_)]
120 | 		return 1 - prob
121 | 
122 | 
123 | class MarginErrFunc(ClassificationErrFunc):
124 | 	"""
125 | 	Calculates the margin error.
126 | 
127 | 	For each correct output in ``y``, nonconformity is defined as
128 | 
129 | 	.. math::
130 | 		0.5 - \dfrac{\hat{P}(y_i | x) - max_{y \, != \, y_i} \hat{P}(y | x)}{2}
131 | 	"""
132 | 
133 | 	def __init__(self):
134 | 		super(MarginErrFunc, self).__init__()
135 | 
136 | 	def apply(self, prediction, y):
137 | 		prob = np.zeros(y.size, dtype=np.float32)
138 | 		for i, y_ in enumerate(y):
139 | 			if y_ >= prediction.shape[1]:
140 | 				prob[i] = 0
141 | 			else:
142 | 				prob[i] = prediction[i, int(y_)]
143 | 				prediction[i, int(y_)] = -np.inf
144 | 		return 0.5 - ((prob - prediction.max(axis=1)) / 2)
145 | 
146 | 
147 | class AbsErrorErrFunc(RegressionErrFunc):
148 | 	"""Calculates absolute error nonconformity for regression problems.
149 | 
150 | 		For each correct output in ``y``, nonconformity is defined as
151 | 
152 | 		.. math::
153 | 			| y_i - \hat{y}_i |
154 | 	"""
155 | 
156 | 	def __init__(self):
157 | 		super(AbsErrorErrFunc, self).__init__()
158 | 
159 | 	def apply(self, prediction, y):
160 | 		return np.abs(prediction - y)
161 | 
162 | 	def apply_inverse(self, nc, significance):
163 | 		nc = np.sort(nc)[::-1]
164 | 		border = int(np.floor(significance * (nc.size + 1))) - 1
165 | 		# TODO: should probably warn against too few calibration examples
166 | 		border = min(max(border, 0), nc.size - 1)
167 | 		return np.vstack([nc[border], nc[border]])
168 | 
169 | 
170 | class SignErrorErrFunc(RegressionErrFunc):
171 | 	"""Calculates signed error nonconformity for regression problems.
172 | 
173 | 	For each correct output in ``y``, nonconformity is defined as
174 | 
175 | 	.. math::
176 | 		y_i - \hat{y}_i
177 | 
178 | 	References
179 | 	----------
180 | 	.. [1] Linusson, Henrik, Ulf Johansson, and Tuve Lofstrom.
181 | 		Signed-error conformal regression. Pacific-Asia Conference on Knowledge
182 | 		Discovery and Data Mining. Springer International Publishing, 2014.
183 | 	"""
184 | 
185 | 	def __init__(self):
186 | 		super(SignErrorErrFunc, self).__init__()
187 | 
188 | 	def apply(self, prediction, y):
189 | 		return (prediction - y)
190 | 
191 | 	def apply_inverse(self, nc, significance):
192 |         
193 | 		err_high = -nc
194 | 		err_low = nc
195 |         
196 | 		err_high = np.reshape(err_high, (nc.shape[0],1))
197 | 		err_low = np.reshape(err_low, (nc.shape[0],1))
198 |         
199 | 		nc = np.concatenate((err_low,err_high),1)
200 |         
201 | 		nc = np.sort(nc,0)
202 | 		index = int(np.ceil((1 - significance / 2) * (nc.shape[0] + 1))) - 1
203 | 		index = min(max(index, 0), nc.shape[0] - 1)
204 | 		return np.vstack([nc[index,0], nc[index,1]])
205 | 
206 | # CQR symmetric error function
207 | class QuantileRegErrFunc(RegressionErrFunc):
208 |     """Calculates conformalized quantile regression error.
209 |     
210 |     For each correct output in ``y``, nonconformity is defined as
211 |     
212 |     .. math::
213 |         max{\hat{q}_low - y, y - \hat{q}_high}
214 |     
215 |     """
216 |     def __init__(self):
217 |         super(QuantileRegErrFunc, self).__init__()
218 | 
219 |     def apply(self, prediction, y):
220 |         y_lower = prediction[:,0]
221 |         y_upper = prediction[:,-1]
222 |         error_low = y_lower - y
223 |         error_high = y - y_upper
224 |         err = np.maximum(error_high,error_low)
225 |         return err
226 | 
227 |     def apply_inverse(self, nc, significance):
228 |         nc = np.sort(nc,0)
229 |         index = int(np.ceil((1 - significance) * (nc.shape[0] + 1))) - 1
230 |         index = min(max(index, 0), nc.shape[0] - 1)
231 |         return np.vstack([nc[index], nc[index]])
232 | 
233 | # CQR asymmetric error function 
234 | class QuantileRegAsymmetricErrFunc(RegressionErrFunc):
235 |     """Calculates conformalized quantile regression asymmetric error function.
236 |     
237 |     For each correct output in ``y``, nonconformity is defined as
238 |     
239 |     .. math::
240 |         E_low = \hat{q}_low - y
241 |         E_high = y - \hat{q}_high
242 |     
243 |     """
244 |     def __init__(self):
245 |         super(QuantileRegAsymmetricErrFunc, self).__init__()
246 | 
247 |     def apply(self, prediction, y):
248 |         y_lower = prediction[:,0]
249 |         y_upper = prediction[:,-1]
250 |         
251 |         error_high = y - y_upper 
252 |         error_low = y_lower - y
253 |         
254 |         err_high = np.reshape(error_high, (y_upper.shape[0],1))
255 |         err_low = np.reshape(error_low, (y_lower.shape[0],1))
256 | 
257 |         return np.concatenate((err_low,err_high),1)
258 | 
259 |     def apply_inverse(self, nc, significance):
260 |         nc = np.sort(nc,0)
261 |         index = int(np.ceil((1 - significance / 2) * (nc.shape[0] + 1))) - 1
262 |         index = min(max(index, 0), nc.shape[0] - 1)
263 |         return np.vstack([nc[index,0], nc[index,1]])
264 |     
265 | # -----------------------------------------------------------------------------
266 | # Base nonconformity scorer
267 | # -----------------------------------------------------------------------------
268 | class BaseScorer(sklearn.base.BaseEstimator):
269 | 	__metaclass__ = abc.ABCMeta
270 | 
271 | 	def __init__(self):
272 | 		super(BaseScorer, self).__init__()
273 | 
274 | 	@abc.abstractmethod
275 | 	def fit(self, x, y):
276 | 		pass
277 | 
278 | 	@abc.abstractmethod
279 | 	def score(self, x, y=None):
280 | 		pass
281 | 
282 | 
283 | class RegressorNormalizer(BaseScorer):
284 | 	def __init__(self, base_model, normalizer_model, err_func):
285 | 		super(RegressorNormalizer, self).__init__()
286 | 		self.base_model = base_model
287 | 		self.normalizer_model = normalizer_model
288 | 		self.err_func = err_func
289 | 
290 | 	def fit(self, x, y):
291 | 		residual_prediction = self.base_model.predict(x)
292 | 		residual_error = np.abs(self.err_func.apply(residual_prediction, y))
293 | 
294 | 		######################################################################
295 | 		# Optional: use logarithmic function as in the original implementation
296 | 		# available in https://github.com/donlnz/nonconformist
297 | 		#
298 | 		# CODE:
299 | 		# residual_error += 0.00001 # Add small term to avoid log(0)
300 | 		# log_err = np.log(residual_error)
301 | 		######################################################################
302 | 
303 | 		log_err = residual_error
304 | 		self.normalizer_model.fit(x, log_err)
305 | 
306 | 	def score(self, x, y=None):
307 | 
308 | 		######################################################################
309 | 		# Optional: use logarithmic function as in the original implementation
310 | 		# available in https://github.com/donlnz/nonconformist
311 | 		#
312 | 		# CODE:
313 | 		# norm = np.exp(self.normalizer_model.predict(x))
314 | 		######################################################################
315 | 
316 | 		norm = np.abs(self.normalizer_model.predict(x))
317 | 		return norm
318 | 
319 | 
320 | class NcFactory(object):
321 | 	@staticmethod
322 | 	def create_nc(model, err_func=None, normalizer_model=None, oob=False):
323 | 		if normalizer_model is not None:
324 | 			normalizer_adapter = RegressorAdapter(normalizer_model)
325 | 		else:
326 | 			normalizer_adapter = None
327 | 
328 | 		if isinstance(model, sklearn.base.ClassifierMixin):
329 | 			err_func = MarginErrFunc() if err_func is None else err_func
330 | 			if oob:
331 | 				c = sklearn.base.clone(model)
332 | 				c.fit([[0], [1]], [0, 1])
333 | 				if hasattr(c, 'oob_decision_function_'):
334 | 					adapter = OobClassifierAdapter(model)
335 | 				else:
336 | 					raise AttributeError('Cannot use out-of-bag '
337 | 					                      'calibration with {}'.format(
338 | 						model.__class__.__name__
339 | 					))
340 | 			else:
341 | 				adapter = ClassifierAdapter(model)
342 | 
343 | 			if normalizer_adapter is not None:
344 | 				normalizer = RegressorNormalizer(adapter,
345 | 				                                 normalizer_adapter,
346 | 				                                 err_func)
347 | 				return ClassifierNc(adapter, err_func, normalizer)
348 | 			else:
349 | 				return ClassifierNc(adapter, err_func)
350 | 
351 | 		elif isinstance(model, sklearn.base.RegressorMixin):
352 | 			err_func = AbsErrorErrFunc() if err_func is None else err_func
353 | 			if oob:
354 | 				c = sklearn.base.clone(model)
355 | 				c.fit([[0], [1]], [0, 1])
356 | 				if hasattr(c, 'oob_prediction_'):
357 | 					adapter = OobRegressorAdapter(model)
358 | 				else:
359 | 					raise AttributeError('Cannot use out-of-bag '
360 | 					                     'calibration with {}'.format(
361 | 						model.__class__.__name__
362 | 					))
363 | 			else:
364 | 				adapter = RegressorAdapter(model)
365 | 
366 | 			if normalizer_adapter is not None:
367 | 				normalizer = RegressorNormalizer(adapter,
368 | 				                                 normalizer_adapter,
369 | 				                                 err_func)
370 | 				return RegressorNc(adapter, err_func, normalizer)
371 | 			else:
372 | 				return RegressorNc(adapter, err_func)
373 | 
374 | 
375 | class BaseModelNc(BaseScorer):
376 | 	"""Base class for nonconformity scorers based on an underlying model.
377 | 
378 | 	Parameters
379 | 	----------
380 | 	model : ClassifierAdapter or RegressorAdapter
381 | 		Underlying classification model used for calculating nonconformity
382 | 		scores.
383 | 
384 | 	err_func : ClassificationErrFunc or RegressionErrFunc
385 | 		Error function object.
386 | 
387 | 	normalizer : BaseScorer
388 | 		Normalization model.
389 | 
390 | 	beta : float
391 | 		Normalization smoothing parameter. As the beta-value increases,
392 | 		the normalized nonconformity function approaches a non-normalized
393 | 		equivalent.
394 | 	"""
395 | 	def __init__(self, model, err_func, normalizer=None, beta=1e-6):
396 | 		super(BaseModelNc, self).__init__()
397 | 		self.err_func = err_func
398 | 		self.model = model
399 | 		self.normalizer = normalizer
400 | 		self.beta = beta
401 | 
402 | 		# If we use sklearn.base.clone (e.g., during cross-validation),
403 | 		# object references get jumbled, so we need to make sure that the
404 | 		# normalizer has a reference to the proper model adapter, if applicable.
405 | 		if (self.normalizer is not None and
406 | 			hasattr(self.normalizer, 'base_model')):
407 | 			self.normalizer.base_model = self.model
408 | 
409 | 		self.last_x, self.last_y = None, None
410 | 		self.last_prediction = None
411 | 		self.clean = False
412 | 
413 | 	def fit(self, x, y):
414 | 		"""Fits the underlying model of the nonconformity scorer.
415 | 
416 | 		Parameters
417 | 		----------
418 | 		x : numpy array of shape [n_samples, n_features]
419 | 			Inputs of examples for fitting the underlying model.
420 | 
421 | 		y : numpy array of shape [n_samples]
422 | 			Outputs of examples for fitting the underlying model.
423 | 
424 | 		Returns
425 | 		-------
426 | 		None
427 | 		"""
428 | 		self.model.fit(x, y)
429 | 		if self.normalizer is not None:
430 | 			self.normalizer.fit(x, y)
431 | 		self.clean = False
432 | 
433 | 	def score(self, x, y=None):
434 | 		"""Calculates the nonconformity score of a set of samples.
435 | 
436 | 		Parameters
437 | 		----------
438 | 		x : numpy array of shape [n_samples, n_features]
439 | 			Inputs of examples for which to calculate a nonconformity score.
440 | 
441 | 		y : numpy array of shape [n_samples]
442 | 			Outputs of examples for which to calculate a nonconformity score.
443 | 
444 | 		Returns
445 | 		-------
446 | 		nc : numpy array of shape [n_samples]
447 | 			Nonconformity scores of samples.
448 | 		"""
449 | 		prediction = self.model.predict(x)
450 | 		n_test = x.shape[0]
451 | 		if self.normalizer is not None:
452 | 			norm = self.normalizer.score(x) + self.beta
453 | 		else:
454 | 			norm = np.ones(n_test)
455 | 		if prediction.ndim > 1:
456 | 		    ret_val = self.err_func.apply(prediction, y)
457 | 		else:
458 | 		    ret_val = self.err_func.apply(prediction, y) / norm
459 | 		return ret_val
460 | 
461 | 
462 | # -----------------------------------------------------------------------------
463 | # Classification nonconformity scorers
464 | # -----------------------------------------------------------------------------
465 | class ClassifierNc(BaseModelNc):
466 | 	"""Nonconformity scorer using an underlying class probability estimating
467 | 	model.
468 | 
469 | 	Parameters
470 | 	----------
471 | 	model : ClassifierAdapter
472 | 		Underlying classification model used for calculating nonconformity
473 | 		scores.
474 | 
475 | 	err_func : ClassificationErrFunc
476 | 		Error function object.
477 | 
478 | 	normalizer : BaseScorer
479 | 		Normalization model.
480 | 
481 | 	beta : float
482 | 		Normalization smoothing parameter. As the beta-value increases,
483 | 		the normalized nonconformity function approaches a non-normalized
484 | 		equivalent.
485 | 
486 | 	Attributes
487 | 	----------
488 | 	model : ClassifierAdapter
489 | 		Underlying model object.
490 | 
491 | 	err_func : ClassificationErrFunc
492 | 		Scorer function used to calculate nonconformity scores.
493 | 
494 | 	See also
495 | 	--------
496 | 	RegressorNc, NormalizedRegressorNc
497 | 	"""
498 | 	def __init__(self,
499 | 	             model,
500 | 	             err_func=MarginErrFunc(),
501 | 	             normalizer=None,
502 | 	             beta=1e-6):
503 | 		super(ClassifierNc, self).__init__(model,
504 | 		                                   err_func,
505 | 		                                   normalizer,
506 | 		                                   beta)
507 | 
508 | 
509 | # -----------------------------------------------------------------------------
510 | # Regression nonconformity scorers
511 | # -----------------------------------------------------------------------------
512 | class RegressorNc(BaseModelNc):
513 | 	"""Nonconformity scorer using an underlying regression model.
514 | 
515 | 	Parameters
516 | 	----------
517 | 	model : RegressorAdapter
518 | 		Underlying regression model used for calculating nonconformity scores.
519 | 
520 | 	err_func : RegressionErrFunc
521 | 		Error function object.
522 | 
523 | 	normalizer : BaseScorer
524 | 		Normalization model.
525 | 
526 | 	beta : float
527 | 		Normalization smoothing parameter. As the beta-value increases,
528 | 		the normalized nonconformity function approaches a non-normalized
529 | 		equivalent.
530 | 
531 | 	Attributes
532 | 	----------
533 | 	model : RegressorAdapter
534 | 		Underlying model object.
535 | 
536 | 	err_func : RegressionErrFunc
537 | 		Scorer function used to calculate nonconformity scores.
538 | 
539 | 	See also
540 | 	--------
541 | 	ProbEstClassifierNc, NormalizedRegressorNc
542 | 	"""
543 | 	def __init__(self,
544 | 	             model,
545 | 	             err_func=AbsErrorErrFunc(),
546 | 	             normalizer=None,
547 | 	             beta=1e-6):
548 | 		super(RegressorNc, self).__init__(model,
549 | 		                                  err_func,
550 | 		                                  normalizer,
551 | 		                                  beta)
552 | 
553 | 	def predict(self, x, nc, significance=None):
554 | 		"""Constructs prediction intervals for a set of test examples.
555 | 
556 | 		Predicts the output of each test pattern using the underlying model,
557 | 		and applies the (partial) inverse nonconformity function to each
558 | 		prediction, resulting in a prediction interval for each test pattern.
559 | 
560 | 		Parameters
561 | 		----------
562 | 		x : numpy array of shape [n_samples, n_features]
563 | 			Inputs of patters for which to predict output values.
564 | 
565 | 		significance : float
566 | 			Significance level (maximum allowed error rate) of predictions.
567 | 			Should be a float between 0 and 1. If ``None``, then intervals for
568 | 			all significance levels (0.01, 0.02, ..., 0.99) are output in a
569 | 			3d-matrix.
570 | 
571 | 		Returns
572 | 		-------
573 | 		p : numpy array of shape [n_samples, 2] or [n_samples, 2, 99]
574 | 			If significance is ``None``, then p contains the interval (minimum
575 | 			and maximum boundaries) for each test pattern, and each significance
576 | 			level (0.01, 0.02, ..., 0.99). If significance is a float between
577 | 			0 and 1, then p contains the prediction intervals (minimum and
578 | 			maximum	boundaries) for the set of test patterns at the chosen
579 | 			significance level.
580 | 		"""
581 | 		n_test = x.shape[0]
582 | 		prediction = self.model.predict(x)
583 | 		if self.normalizer is not None:
584 | 			norm = self.normalizer.score(x) + self.beta
585 | 		else:
586 | 			norm = np.ones(n_test)
587 | 
588 | 		if significance:
589 | 			intervals = np.zeros((x.shape[0], 2))
590 | 			err_dist = self.err_func.apply_inverse(nc, significance)
591 | 			err_dist = np.hstack([err_dist] * n_test)
592 | 			if prediction.ndim > 1: # CQR
593 | 				intervals[:, 0] = prediction[:,0] - err_dist[0, :]
594 | 				intervals[:, 1] = prediction[:,-1] + err_dist[1, :]
595 | 			else: # regular conformal prediction
596 | 				err_dist *= norm
597 | 				intervals[:, 0] = prediction - err_dist[0, :]
598 | 				intervals[:, 1] = prediction + err_dist[1, :]
599 | 
600 | 			return intervals
601 | 		else: # Not tested for CQR
602 | 			significance = np.arange(0.01, 1.0, 0.01)
603 | 			intervals = np.zeros((x.shape[0], 2, significance.size))
604 | 
605 | 			for i, s in enumerate(significance):
606 | 				err_dist = self.err_func.apply_inverse(nc, s)
607 | 				err_dist = np.hstack([err_dist] * n_test)
608 | 				err_dist *= norm
609 | 
610 | 				intervals[:, 0, i] = prediction - err_dist[0, :]
611 | 				intervals[:, 1, i] = prediction + err_dist[0, :]
612 | 
613 | 			return intervals
614 | 


--------------------------------------------------------------------------------
/nonconformist/util.py:
--------------------------------------------------------------------------------
1 | from __future__ import division
2 | import numpy as np
3 | 
4 | def calc_p(ncal, ngt, neq, smoothing=False):
5 | 	if smoothing:
6 | 		return (ngt + (neq + 1) * np.random.uniform(0, 1)) / (ncal + 1)
7 | 	else:
8 | 		return (ngt + neq + 1) / (ncal + 1)
9 | 


--------------------------------------------------------------------------------
/poster/CQR_Poster.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yromano/cqr/73267abb7ed7d3c6dad6ab4449154db7ec306535/poster/CQR_Poster.pdf


--------------------------------------------------------------------------------
/reproducible_experiments/all_cqr_experiments.py:
--------------------------------------------------------------------------------
 1 | 
 2 | ###############################################################################
 3 | # Script for reproducing the results of CQR paper
 4 | ###############################################################################
 5 | 
 6 | import numpy as np
 7 | from reproducible_experiments.run_cqr_experiment import run_experiment
 8 | #from run_cqr_experiment import run_experiment
 9 | 
10 | 
11 | # list methods to test
12 | test_methods = ['linear',
13 |                 'neural_net',
14 |                 'random_forest',
15 |                 'quantile_net',
16 |                 'cqr_quantile_net',
17 |                 'cqr_asymmetric_quantile_net',
18 |                 'rearrangement',
19 |                 'cqr_rearrangement',
20 |                 'cqr_asymmetric_rearrangement',
21 |                 'quantile_forest',
22 |                 'cqr_quantile_forest',
23 |                 'cqr_asymmetric_quantile_forest']
24 | 
25 | # list of datasets
26 | dataset_names = ['meps_19',
27 |                  'meps_20',
28 |                  'meps_21',
29 |                  'star',
30 |                  'facebook_1',
31 |                  'facebook_2',
32 |                  'bio',
33 |                  'blog_data',
34 |                  'concrete',
35 |                  'bike',
36 |                  'community']
37 | 
38 | # vector of random seeds
39 | random_state_train_test = np.arange(20)
40 | 
41 | for test_method_id in range(12):
42 |     for dataset_name_id in range(11):
43 |         for random_state_train_test_id in range(20):
44 |             dataset_name = dataset_names[dataset_name_id]
45 |             test_method = test_methods[test_method_id]
46 |             random_state = random_state_train_test[random_state_train_test_id]
47 | 
48 |             # run an experiment and save average results to CSV file
49 |             run_experiment(dataset_name, test_method, random_state)
50 | 


--------------------------------------------------------------------------------
/reproducible_experiments/all_equalized_coverage_experiments.py:
--------------------------------------------------------------------------------
 1 | ###############################################################################
 2 | # Script for reproducing the results of CQR paper
 3 | ###############################################################################
 4 | 
 5 | import numpy as np
 6 | from reproducible_experiments.run_equalized_coverage_experiment import run_equalized_coverage_experiment
 7 | #from run_equalized_coverage_experiment import run_equalized_coverage_experiment
 8 | 
 9 | # list methods to test
10 | test_methods = ['net',
11 |                 'qnet']
12 | 
13 | dataset_names = ["meps_21"]
14 | 
15 | test_ratio_vec = [0.2]
16 |                 
17 | # vector of random seeds
18 | random_state_train_test = np.arange(40)
19 | 
20 | for test_method_id in range(2):
21 |     for random_state_train_test_id in range(40):
22 |         for dataset_name_id in range(1):
23 |             for test_ratio_id in range(1):
24 |                 test_ratio = test_ratio_vec[test_ratio_id]
25 |                 test_method = test_methods[test_method_id]
26 |                 random_state = random_state_train_test[random_state_train_test_id]
27 |                 dataset_name = dataset_names[dataset_name_id]
28 |         
29 |                 # run an experiment and save average results to CSV file
30 |                 run_equalized_coverage_experiment(dataset_name,
31 |                                                   test_method,
32 |                                                   random_state,
33 |                                                   True,
34 |                                                   test_ratio)
35 | 


--------------------------------------------------------------------------------