├── Algorithms ├── README.md ├── SPO2CART.py ├── SPOForest.py ├── SPO_tree_greedy.py ├── leaf_model.py ├── mtp.py └── mtp_SPO2CART.py ├── Applications ├── Illustrative Example │ ├── decision_problem_solver.py │ └── illustrative_example.py ├── README.md ├── Shortest Path │ ├── SPOForest_nonlinear.py │ ├── SPOgreedy_nonlinear.py │ ├── SPOopt_nonlinear.py │ ├── decision_problem_solver.py │ └── spo_opt_tree_nonlinear.py └── Yahoo News │ ├── Data Preprocessing │ ├── DataPreprocessing1.ipynb │ ├── DataPreprocessing2.ipynb │ ├── README.md │ └── YahooNewsDataExtraction.py │ ├── SPOForest_news.py │ ├── SPOgreedy_news.py │ ├── SPOopt_news.py │ ├── decision_problem_solver.py │ └── spo_opt_tree_news.py ├── LICENSE.txt └── README.md /Algorithms/README.md: -------------------------------------------------------------------------------- 1 | # Algorithms 2 | 3 | This folder contains implementations for training (greedy) SPO Trees (SPO_tree_greedy.py) and SPO Forests (SPOForest.py) for general predict-then-optimize problems. The implementation of the SPO Tree MILP approach is tailored to the specific applications of the paper and therefore is provided in the relevant applications folder. 4 | 5 | The SPO Tree / SPO Forest classes consist of the following methods: 6 | * \_\_init\_\_(): initializes the algorithm 7 | * fit(): trains the algorithm on data: contextual features X, cost vectors C 8 | * traverse(): prints out the learned tree/forest 9 | * prune(): Not implemented for SPO Forests. Prunes the SPO Tree on a held-out validation set to prevent overfitting. Applies the CART pruning method. 10 | * est_decision(): outputs estimated optimal decisions given new contextual features Xnew 11 | * est_cost(): outputs estimated cost vectors given new contextual features Xnew 12 | 13 | Further documentation is provided in the headers of SPO_tree_greedy.py and SPOForest.py. 14 | 15 | The structure of the decision-making problem of interest should be encoded in a file called decision_problem_solver.py. This file should contain two functions specified by the practitioner: 16 | * get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector for underlying decision problem) 17 | * find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors. 18 | 19 | Any (optional) arguments for these functions may be passed as keyword arguments to the fit() functions of the SPO Tree/Forest classes. An example is given in the Yahoo News application. The shortest path applications provide an additional example of the specification of decision_problem_solver.py. 20 | 21 | Code currently only supports Python 2.7 (not Python 3). 22 | Package Dependencies: gurobipy (with valid Gurobi license), numpy, pandas, scipy, joblib 23 | -------------------------------------------------------------------------------- /Algorithms/SPO2CART.py: -------------------------------------------------------------------------------- 1 | """ 2 | Encodes SPOT MILP as the structure of a CART tree in order to apply CART's pruning method 3 | Also supports traverse() which traverses the tree 4 | """ 5 | import numpy as np 6 | from mtp_SPO2CART import MTP_SPO2CART 7 | from decision_problem_solver import* 8 | from scipy.spatial import distance 9 | 10 | 11 | def truncate_train_x(train_x, train_x_precision): 12 | return(np.around(train_x, decimals=train_x_precision)) 13 | 14 | class SPO2CART(object): 15 | ''' 16 | This function initializes the SPO tree 17 | 18 | Parameters: 19 | 20 | max_depth: the maximum depth of the pre-pruned tree (default = Inf: no depth limit) 21 | 22 | min_weight_per_node: the mininum number of observations (with respect to cumulative weight) per node 23 | 24 | min_depth: the minimum depth of the pre-pruned tree (default: set equal to max_depth) 25 | 26 | min_diff: if depth > min_depth, stop splitting if improvement in fit does not exceed min_diff 27 | 28 | binary_splits: if True, use binary splits when building the tree, else consider multiway splits 29 | (i.e., when splitting on a variable, split on all unique vals) 30 | 31 | debias_splits/frac_debias_set/min_debias_set_size: Additional params when binary_splits = True. If debias_splits = True, then in each node, 32 | hold out frac_debias_set of the training set (w.r.t. case weights) to evaluate the error of the best splitting point for each feature. 33 | Stop bias-correcting when we have insufficient data; i.e. the total weight in the debias set < min_debias_set_size. 34 | Note: after finding best split point, we then refit the model on all training data and recalculate the training error 35 | 36 | quant_discret: continuous variable split points are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 37 | 38 | run_in_parallel: if set to True, enables parallel computing among num_workers threads. If num_workers is not 39 | specified, uses the number of cpu cores available. 40 | ''' 41 | def __init__(self, a,b,**kwargs): 42 | 43 | kwargs["SPO_weight_param"] = 1.0 44 | 45 | if "SPO_full_error" not in kwargs: 46 | kwargs["SPO_full_error"] = True 47 | 48 | self.SPO_weight_param = kwargs["SPO_weight_param"] 49 | self.SPO_full_error = kwargs["SPO_full_error"] 50 | self.tree = MTP_SPO2CART(a,b,**kwargs) 51 | 52 | ''' 53 | This function fits the tree on data (X,C,weights). 54 | 55 | X: The feature data used in tree splits. Can either be a pandas data frame or numpy array, with: 56 | (a) rows of X = observations 57 | (b) columns of X = features 58 | C: the cost vectors used in the leaf node models. Must be a numpy array, with: 59 | (a) rows of C = observations 60 | (b) columns of C = cost vector components 61 | weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding weight of observation i 62 | feats_continuous: If False, all feature are treated as categorical. If True, all feature are treated as continuous. 63 | feats_continuous can also be a boolean vector of dimension = num_features specifying how to treat each feature 64 | verbose: if verbose=True, prints out progress in tree fitting procedure 65 | ''' 66 | def fit(self, X, C, train_x_precision, 67 | weights=None, feats_continuous=True, verbose=False, refit_leaves=False, 68 | **kwargs): 69 | self.decision_kwargs = kwargs 70 | X = truncate_train_x(X, train_x_precision) 71 | num_obs = C.shape[0] 72 | 73 | A = np.array(range(num_obs)) 74 | if self.SPO_full_error == True and self.SPO_weight_param != 0.0: 75 | for i in range(num_obs): 76 | A[i] = find_opt_decision(C[i,:].reshape(1,-1),**kwargs)['objective'][0] 77 | 78 | if self.SPO_weight_param != 0.0 and self.SPO_weight_param != 1.0: 79 | if self.SPO_full_error == True: 80 | SPO_loss_bound = -float("inf") 81 | for i in range(num_obs): 82 | SPO_loss = -find_opt_decision(-C[i,:].reshape(1,-1),**kwargs)['objective'][0] - A[i] 83 | if SPO_loss >= SPO_loss_bound: 84 | SPO_loss_bound = SPO_loss 85 | 86 | else: 87 | c_max = np.max(C,axis=0) 88 | SPO_loss_bound = -find_opt_decision(-c_max.reshape(1,-1),**kwargs)['objective'][0] 89 | 90 | #Upper bound for MSE loss: maximum pairwise difference between any two elements 91 | dists = distance.cdist(C, C, 'sqeuclidean') 92 | MSE_loss_bound = np.max(dists) 93 | 94 | else: 95 | SPO_loss_bound = 1.0 96 | MSE_loss_bound = 1.0 97 | 98 | #kwargs["SPO_loss_bound"] = SPO_loss_bound 99 | #kwargs["MSE_loss_bound"] = MSE_loss_bound 100 | self.tree.fit(X,A,C, 101 | weights=weights, feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, 102 | SPO_loss_bound = SPO_loss_bound, MSE_loss_bound = MSE_loss_bound, 103 | **kwargs) 104 | 105 | ''' 106 | Prints out the tree. 107 | Required: call tree fit() method first 108 | Prints pruned tree if prune() method has been called, else prints unpruned tree 109 | verbose=True prints additional statistics within each leaf 110 | ''' 111 | def traverse(self, verbose=False): 112 | self.tree.traverse(verbose=verbose) 113 | 114 | ''' 115 | Prunes the tree. Set verbose=True to track progress 116 | ''' 117 | def prune(self, Xval, Cval, 118 | weights_val=None, one_SE_rule=True,verbose=False,approx_pruning=False): 119 | num_obs = Cval.shape[0] 120 | 121 | Aval = np.array(range(num_obs)) 122 | if self.SPO_full_error == True and self.SPO_weight_param != 0.0: 123 | for i in range(num_obs): 124 | Aval[i] = find_opt_decision(Cval[i,:].reshape(1,-1),**self.decision_kwargs)['objective'][0] 125 | self.tree.prune(Xval,Aval,Cval, 126 | weights_val=weights_val,one_SE_rule=one_SE_rule,verbose=verbose,approx_pruning=approx_pruning) 127 | 128 | 129 | ''' 130 | Produces decision given data Xnew 131 | Required: call tree fit() method first 132 | Uses pruned tree if pruning method has been called, else uses unpruned tree 133 | Argument alpha controls level of pruning. If not specified, uses alpha trained from the prune() method 134 | 135 | As a step in finding the estimated decisions for data (Xnew), this function first finds 136 | the leaf node locations corresponding to each row of Xnew. It does so by a top-down search 137 | starting at the root node 0. 138 | If return_loc=True, est_decision will also return the leaf node locations for the data, in addition to the decision. 139 | ''' 140 | def est_decision(self, Xnew, alpha=None, return_loc=False): 141 | return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc) 142 | 143 | def est_cost(self, Xnew, alpha=None, return_loc=False): 144 | return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc, get_cost=True) -------------------------------------------------------------------------------- /Algorithms/SPOForest.py: -------------------------------------------------------------------------------- 1 | """ 2 | SPO RANDOM FOREST IMPLEMENTATION 3 | 4 | This code will work for general predict-then-optimize applications. Fits SPO Forest to dataset of feature-cost pairs. 5 | 6 | The structure of the decision-making problem of interest should be encoded in a file called decision_problem_solver.py. 7 | Specifically, this code requires two functions: 8 | get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector for underlying decision problem) 9 | find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 10 | """ 11 | import numpy as np 12 | from mtp import MTP 13 | from decision_problem_solver import* 14 | from scipy.spatial import distance 15 | from SPO_tree_greedy import SPOTree 16 | from joblib import Parallel, delayed 17 | from collections import Counter 18 | 19 | class SPOForest(object): 20 | ''' 21 | This function initializes the SPO forest 22 | 23 | FOREST PARAMETERS: 24 | 25 | n_estimators: number of SPO trees in the random forest 26 | 27 | max_features: number of features to consider when looking for the best split in each node 28 | 29 | run_in_parallel, num_workers: if run_in_parallel is set to True, enables parallel computing among num_workers threads. 30 | If num_workers is not specified, uses the number of cpu cores available. The task of computing each SPO tree in the forest 31 | is distributed among the available cores. (each tree may only use 1 core and thus this arg is set to None in SPOTree class) 32 | 33 | TREE PARAMETERS (DIRECTLY PASSED TO SPOTree CLASS): 34 | 35 | max_depth: maximum training depth of each tree in the forest (default = Inf: no depth limit) 36 | 37 | min_weight_per_node: the mininum number of observations (with respect to cumulative weight) per node for each tree in the forest 38 | 39 | quant_discret: continuous variable split points are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 40 | 41 | SPO_weight_param: splits are decided through loss = SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 42 | SPO_weight_param = 1.0 -> SPO loss 43 | SPO_weight_param = 0.0 -> MSE loss (i.e., CART) 44 | 45 | SPO_full_error: if SPO error is used, are the full errors computed for split evaluation, 46 | i.e. are the alg's decision losses subtracted by the optimal decision losses? 47 | 48 | Keep all other parameter values as default 49 | ''' 50 | def __init__(self, n_estimators=10, run_in_parallel=False, num_workers=None, **kwargs): 51 | self.n_estimators = n_estimators 52 | if (run_in_parallel == False): 53 | num_workers = 1 54 | if num_workers is None: 55 | num_workers = -1 #this uses all available cpu cores 56 | self.run_in_parallel = run_in_parallel 57 | self.num_workers = num_workers 58 | 59 | self.forest = [None]*n_estimators 60 | for t in range(n_estimators): 61 | self.forest[t] = SPOTree(**kwargs) 62 | 63 | ''' 64 | This function fits the SPO forest on data (X,C,weights). 65 | 66 | X: The feature data used in tree splits. Can either be a pandas data frame or numpy array, with: 67 | (a) rows of X = observations 68 | (b) columns of X = features 69 | C: the cost vectors used in the leaf node models. Must be a numpy array, with: 70 | (a) rows of C = observations 71 | (b) columns of C = cost vector components 72 | weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding weight of observation i 73 | feats_continuous: If False, all feature are treated as categorical. If True, all feature are treated as continuous. 74 | feats_continuous can also be a boolean vector of dimension = num_features specifying how to treat each feature 75 | verbose: if verbose=True, prints out progress in tree fitting procedure 76 | verbose_forest: if verbose_forest=True, prints out progress in the forest fitting procedure 77 | seed: seed for rng 78 | ''' 79 | def fit(self, X, C, weights=None, verbose_forest=False, seed=None, 80 | feats_continuous=False, verbose=False, refit_leaves=False, 81 | **kwargs): 82 | 83 | self.decision_kwargs = kwargs 84 | 85 | num_obs = C.shape[0] 86 | 87 | if weights is None: 88 | weights = np.ones([num_obs]) 89 | 90 | if seed is not None: 91 | np.random.seed(seed) 92 | tree_seeds = np.random.randint(0, high=2**32-1, size=self.n_estimators) 93 | 94 | 95 | if self.num_workers == 1: 96 | for t in range(self.n_estimators): 97 | if verbose_forest == True: 98 | print("Fitting tree " + str(t+1) + "out of " + str(self.n_estimators)) 99 | np.random.seed(tree_seeds[t]) 100 | bootstrap_inds = np.random.choice(range(num_obs), size=num_obs, replace=True) 101 | Xb = np.copy(X[bootstrap_inds]) 102 | Cb = np.copy(C[bootstrap_inds]) 103 | weights_b = np.copy(weights[bootstrap_inds]) 104 | self.forest[t].fit(Xb, Cb, weights=weights_b, seed=tree_seeds[t], 105 | feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, 106 | **kwargs) 107 | 108 | else: 109 | self.forest = Parallel(n_jobs=self.num_workers, max_nbytes=1e5)(delayed(_fit_tree)(t, self.n_estimators, self.forest[t], X, C, weights, verbose_forest, tree_seeds[t], feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, **kwargs) for t in range(self.n_estimators)) 110 | 111 | 112 | ''' 113 | Prints all trees in the forest 114 | Required: call forest fit() method first 115 | ''' 116 | def traverse(self): 117 | for t in range(self.n_estimators): 118 | print("Printing Tree " + str(t+1) + "out of " + str(self.n_estimators)) 119 | self.forest[t].traverse() 120 | print("\n\n\n") 121 | 122 | ''' 123 | Predicts decisions or costs given data Xnew 124 | Required: call tree fit() method first 125 | 126 | method: method for aggregating decisions from each of the individual trees in the forest. Two approaches: 127 | (1) "mean": averages predicted cost vectors from each tree, then finds decision with respect to average cost vector 128 | (2) "mode": each tree in the forest estimates an optimal decision; take the most-recommended decision 129 | 130 | NOTE: return_loc argument not supported: 131 | (If return_loc=True, est_decision will also return the leaf node locations for the data, in addition to the decision.) 132 | ''' 133 | def est_decision(self, Xnew, method="mean"): 134 | if method == "mean": 135 | forest_costs = self.est_cost(Xnew) 136 | forest_decisions = find_opt_decision(forest_costs,**self.decision_kwargs)['weights'] 137 | 138 | elif method == "mode": 139 | num_obs = Xnew.shape[0] 140 | tree_decisions = [None]*self.n_estimators 141 | for t in range(self.n_estimators): 142 | tree_decisions[t] = self.forest[t].est_decision(Xnew) 143 | tree_decisions = np.array(tree_decisions) 144 | forest_decisions = np.zeros((num_obs,tree_decisions.shape[2])) 145 | for i in range(num_obs): 146 | forest_decisions[i] = _get_mode_row(tree_decisions[:,i,:]) 147 | 148 | return forest_decisions 149 | 150 | def est_cost(self, Xnew): 151 | tree_costs = [None]*self.n_estimators 152 | for t in range(self.n_estimators): 153 | tree_costs[t] = self.forest[t].est_cost(Xnew) 154 | tree_costs = np.array(tree_costs) 155 | forest_costs = np.mean(tree_costs,axis=0) 156 | return forest_costs 157 | 158 | ''' 159 | Helper methods (ignore) 160 | ''' 161 | 162 | def _fit_tree(t, n_estimators, tree, X, C, weights, verbose_forest, tree_seed, **kwargs): 163 | if verbose_forest == True: 164 | print("Fitting tree " + str(t+1) + "out of " + str(n_estimators)) 165 | 166 | num_obs = C.shape[0] 167 | np.random.seed(tree_seed) 168 | bootstrap_inds = np.random.choice(range(num_obs), size=num_obs, replace=True) 169 | Xb = np.copy(X[bootstrap_inds]) 170 | Cb = np.copy(C[bootstrap_inds]) 171 | weights_b = np.copy(weights[bootstrap_inds]) 172 | tree.fit(Xb, Cb, weights=weights_b, seed=tree_seed, **kwargs) 173 | return(tree) 174 | 175 | def _get_mode_row(a): 176 | return(np.array(Counter(map(tuple, a)).most_common()[0][0])) -------------------------------------------------------------------------------- /Algorithms/SPO_tree_greedy.py: -------------------------------------------------------------------------------- 1 | """ 2 | SPO GREEDY TREE IMPLEMENTATION 3 | 4 | This code will work for general predict-then-optimize applications. Fits SPO (greedy) tree to dataset of feature-cost pairs. 5 | 6 | The structure of the decision-making problem of interest should be encoded in a file called decision_problem_solver.py. 7 | Specifically, this code requires two functions: 8 | get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector for underlying decision problem) 9 | find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 10 | 11 | """ 12 | import numpy as np 13 | from mtp import MTP 14 | from decision_problem_solver import* 15 | from scipy.spatial import distance 16 | 17 | class SPOTree(object): 18 | ''' 19 | This function initializes the SPO tree 20 | 21 | Parameters: 22 | 23 | max_depth: maximum training depth of each tree in the forest (default = Inf: no depth limit) 24 | 25 | min_weight_per_node: the mininum number of observations (with respect to cumulative weight) per node for each tree in the forest 26 | 27 | quant_discret: continuous variable split points are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 28 | 29 | SPO_weight_param: splits are decided through loss = SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 30 | SPO_weight_param = 1.0 -> SPO loss 31 | SPO_weight_param = 0.0 -> MSE loss (i.e., CART) 32 | 33 | SPO_full_error: if SPO error is used, are the full errors computed for split evaluation, 34 | i.e. are the alg's decision losses subtracted by the optimal decision losses? 35 | 36 | run_in_parallel: if set to True, enables parallel computing among num_workers threads. If num_workers is not 37 | specified, uses the number of cpu cores available. 38 | 39 | max_features: number of features to consider when looking for the best split in each node. Useful when building random forests. Default equal to total num features 40 | 41 | Keep all other parameter values as default 42 | ''' 43 | def __init__(self, **kwargs): 44 | self.SPO_weight_param = kwargs["SPO_weight_param"] 45 | self.SPO_full_error = kwargs["SPO_full_error"] 46 | self.tree = MTP(**kwargs) 47 | 48 | ''' 49 | This function fits the tree on data (X,C,weights). 50 | 51 | X: The feature data used in tree splits. Can either be a pandas data frame or numpy array, with: 52 | (a) rows of X = observations 53 | (b) columns of X = features 54 | C: the cost vectors used in the leaf node models. Must be a numpy array, with: 55 | (a) rows of C = observations 56 | (b) columns of C = cost vector components 57 | weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding weight of observation i 58 | feats_continuous: If False, all feature are treated as categorical. If True, all feature are treated as continuous. 59 | feats_continuous can also be a boolean vector of dimension = num_features specifying how to treat each feature 60 | verbose: if verbose=True, prints out progress in tree fitting procedure 61 | 62 | Keep all other parameter values as default 63 | ''' 64 | def fit(self, X, C, 65 | weights=None, feats_continuous=False, verbose=False, refit_leaves=False, seed=None, 66 | **kwargs): 67 | self.pruned = False 68 | self.decision_kwargs = kwargs 69 | num_obs = C.shape[0] 70 | 71 | A = np.array(range(num_obs)) 72 | if self.SPO_full_error == True and self.SPO_weight_param != 0.0: 73 | for i in range(num_obs): 74 | A[i] = find_opt_decision(C[i,:].reshape(1,-1),**kwargs)['objective'][0] 75 | 76 | if self.SPO_weight_param != 0.0 and self.SPO_weight_param != 1.0: 77 | if self.SPO_full_error == True: 78 | SPO_loss_bound = -float("inf") 79 | for i in range(num_obs): 80 | SPO_loss = -find_opt_decision(-C[i,:].reshape(1,-1),**kwargs)['objective'][0] - A[i] 81 | if SPO_loss >= SPO_loss_bound: 82 | SPO_loss_bound = SPO_loss 83 | 84 | else: 85 | c_max = np.max(C,axis=0) 86 | SPO_loss_bound = -find_opt_decision(-c_max.reshape(1,-1),**kwargs)['objective'][0] 87 | 88 | #Upper bound for MSE loss: maximum pairwise difference between any two elements 89 | dists = distance.cdist(C, C, 'sqeuclidean') 90 | MSE_loss_bound = np.max(dists) 91 | 92 | else: 93 | SPO_loss_bound = 1.0 94 | MSE_loss_bound = 1.0 95 | 96 | #kwargs["SPO_loss_bound"] = SPO_loss_bound 97 | #kwargs["MSE_loss_bound"] = MSE_loss_bound 98 | self.tree.fit(X,A,C, 99 | weights=weights, feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, seed=seed, 100 | SPO_loss_bound = SPO_loss_bound, MSE_loss_bound = MSE_loss_bound, 101 | **kwargs) 102 | 103 | ''' 104 | Prints out the tree. 105 | Required: call tree fit() method first 106 | Prints pruned tree if prune() method has been called, else prints unpruned tree 107 | verbose=True prints additional statistics within each leaf 108 | ''' 109 | def traverse(self, verbose=False): 110 | self.tree.traverse(verbose=verbose) 111 | 112 | ''' 113 | Prunes the tree. Set verbose=True to track progress 114 | ''' 115 | def prune(self, Xval, Cval, 116 | weights_val=None, one_SE_rule=True,verbose=False,approx_pruning=False): 117 | num_obs = Cval.shape[0] 118 | 119 | Aval = np.array(range(num_obs)) 120 | if self.SPO_full_error == True and self.SPO_weight_param != 0.0: 121 | for i in range(num_obs): 122 | Aval[i] = find_opt_decision(Cval[i,:].reshape(1,-1),**self.decision_kwargs)['objective'][0] 123 | 124 | self.tree.prune(Xval,Aval,Cval, 125 | weights_val=weights_val,one_SE_rule=one_SE_rule,verbose=verbose,approx_pruning=approx_pruning) 126 | self.pruned = True 127 | 128 | 129 | ''' 130 | Produces decision or cost given data Xnew 131 | Required: call tree fit() method first 132 | Uses pruned tree if pruning method has been called, else uses unpruned tree 133 | Argument alpha controls level of pruning. If not specified, uses alpha trained from the prune() method 134 | 135 | As a step in finding the estimated decisions for data (Xnew), this function first finds 136 | the leaf node locations corresponding to each row of Xnew. It does so by a top-down search 137 | starting at the root node 0. 138 | If return_loc=True, est_decision will also return the leaf node locations for the data, in addition to the decision. 139 | ''' 140 | def est_decision(self, Xnew, alpha=None, return_loc=False): 141 | return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc) 142 | 143 | def est_cost(self, Xnew, alpha=None, return_loc=False): 144 | return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc, get_cost=True) 145 | 146 | ''' 147 | Other methods (ignore) 148 | ''' 149 | def get_tree_encoding(self, x_train=None): 150 | return self.tree.get_tree_encoding(x_train=x_train) 151 | 152 | def get_pruning_alpha(self): 153 | if self.pruned == True: 154 | return self.tree.alpha_best 155 | else: 156 | return(0) -------------------------------------------------------------------------------- /Algorithms/leaf_model.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Helper class for mtp.py 3 | 4 | Defines the leaf nodes of the tree, specifically 5 | - the computation of the predicted cost vectors and decisions within the given leaf of the tree 6 | - the SPO/MSE loss from using the predicted decision within the leaf 7 | ''' 8 | 9 | import numpy as np 10 | from decision_problem_solver import* 11 | #from scipy.spatial import distance 12 | 13 | ''' 14 | mtp.py depends on the classes and functions below. 15 | These classes/methods are used to define the model object in each leaf node, 16 | as well as helper functions for certain operations in the tree fitting procedure. 17 | 18 | Summary of methods and functions to specify: 19 | Methods as a part of class LeafModel: fit(), predict(), to_string(), error(), error_pruning() 20 | Other helper functions: get_sub(), are_Ys_diverse() 21 | 22 | ''' 23 | 24 | ''' 25 | LeafModel: the model used in each leaf. 26 | Has five methods: fit, predict, to_string, error, error_pruning 27 | 28 | SPO_weight_param: number between 0 and 1: 29 | Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 30 | ''' 31 | class LeafModel(object): 32 | 33 | #Any additional args passed to mtp's init() function are directly passed here 34 | def __init__(self,*args,**kwargs): 35 | self.SPO_weight_param = kwargs["SPO_weight_param"] 36 | self.SPO_full_error = kwargs["SPO_full_error"] 37 | return 38 | 39 | ''' 40 | This function trains the leaf node model on the data (A,Y,weights). 41 | 42 | A and Y can take any form (lists, matrices, vectors, etc.). For our applications, I recommend making Y 43 | the response data (e.g., choices) and A alternative-specific data (e.g., features, choice sets) 44 | 45 | weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding 46 | weight of observation/customer i. If you know you will not be using case weights 47 | in your particular application, you can ignore this input entirely. 48 | 49 | Returns 0 or 1. 50 | 0: No errors occurred when fitting leaf node model 51 | 1: An error occurred when fitting the leaf node model (probably due to insufficient data) 52 | If fit returns 1, then the tree will not consider the split that led to this leaf node model 53 | 54 | fit_init is a LeafModel object which represents a previously-trained leaf node model. 55 | If specified, fit_init is used for initialization when training this current LeafModel object. 56 | Useful for faster computation when fit_init's coefficients are close to the optimal solution of the new data. 57 | 58 | For those interested in defining their own leaf node functions: 59 | (1) It is not required to use the fit_init argument in your code 60 | (2) All edge cases must be handled in code below (ex: arguments 61 | consist of a single entry, weights are all zero, Y has one unique choice, etc.). 62 | In these cases, either hard-code a model that works with these edge-cases (e.g., 63 | if all Ys = 1, predict 1 with probability one), or have the fit function return 1 (error) 64 | (3) Store the fitted model as an attribute to the self object. You can name the attribute 65 | anything you want (i.e., it does not have to be self.model_obj and self.model_coef below), 66 | as long as its consistent with your predict_prob() and to_string() methods 67 | 68 | Any additional args passed to mtp's fit() function are directly passed here 69 | ''' 70 | def fit(self, A, Y, weights, fit_init=None, refit=False, SPO_loss_bound=None, MSE_loss_bound=None, **kwargs): 71 | #no need to refit this model since it is already fit to optimality 72 | #note: change this behavior if debias=TRUE 73 | if refit == True: 74 | return(0) 75 | 76 | self.SPO_loss_bound = SPO_loss_bound 77 | self.MSE_loss_bound = MSE_loss_bound 78 | 79 | def fast_row_avg(X,weights): 80 | return (np.matmul(weights,X)/sum(weights)).reshape(-1) 81 | 82 | #if no observations are mapped to this leaf, then assign any feasible cost vector here 83 | if sum(weights) == 0: 84 | self.mean_cost = np.ones(get_num_decisions(**kwargs)) 85 | else: 86 | self.mean_cost = fast_row_avg(Y,weights) 87 | self.decision = find_opt_decision(self.mean_cost.reshape(1,-1),**kwargs)['weights'].reshape(-1) 88 | 89 | return(0) 90 | 91 | ''' 92 | This function applies model from fit() to predict choice data given new data A. 93 | Returns a list/numpy array of choices (one list entry per observation, i.e. l[i] yields prediction for ith obs.). 94 | Note: make sure to call fit() first before this method. 95 | 96 | Any additional args passed to mtp's predict() function are directly passed here 97 | ''' 98 | def predict(self, A, get_cost=False, *args,**kwargs): 99 | if get_cost==True: 100 | #Returns predicted cost corresponding to this leaf node 101 | return np.array([self.mean_cost]*len(A)) 102 | else: 103 | #Returns predicted decision corresponding to this leaf node 104 | return np.array([self.decision]*len(A)) 105 | ''' 106 | This function outputs the errors for each observation in pair (A,Y). 107 | Used in training when comparing different tree splits. 108 | Ex: mean-squared-error between observed data Y and predict(A) 109 | 110 | How to pass additional arguments to this function: simply pass these arguments to the init()/fit() functions and store them 111 | in the self object. 112 | ''' 113 | def error(self,A,Y): 114 | def MSEloss(C,Cpred): 115 | #return distance.cdist(C, Cpred, 'sqeuclidean').reshape(-1) 116 | MSE = (C**2).sum(axis=1)[:, None] - 2 * C.dot(Cpred.transpose()) + ((Cpred**2).sum(axis=1)[None, :]) 117 | return MSE.reshape(-1) 118 | 119 | def SPOloss(C,decision): 120 | return np.matmul(C,decision).reshape(-1) 121 | 122 | if self.SPO_weight_param == 1.0: 123 | if self.SPO_full_error == True: 124 | SPO_loss = SPOloss(Y,self.decision) - A 125 | else: 126 | SPO_loss = SPOloss(Y,self.decision) 127 | return SPO_loss 128 | elif self.SPO_weight_param == 0.0: 129 | MSE_loss = MSEloss(Y, self.mean_cost.reshape(1,-1)) 130 | return MSE_loss 131 | else: 132 | if self.SPO_full_error == True: 133 | SPO_loss = SPOloss(Y,self.decision) - A 134 | else: 135 | SPO_loss = SPOloss(Y,self.decision) 136 | MSE_loss = MSEloss(Y, self.mean_cost.reshape(1,-1)) 137 | return self.SPO_weight_param*SPO_loss/self.SPO_loss_bound+(1.0-self.SPO_weight_param)*MSE_loss/self.MSE_loss_bound 138 | 139 | ''' 140 | This function outputs the errors for each observation in pair (A,Y). 141 | Used in pruning to determine the best tree subset. 142 | Ex: mean-squared-error between observed data Y and predict(A) 143 | 144 | How to pass additional arguments to this function: simply pass these arguments to the init()/fit() functions and store them 145 | in the self object. 146 | ''' 147 | def error_pruning(self,A,Y): 148 | return self.error(A,Y) 149 | 150 | ''' 151 | This function returns the string representation of the fitted model 152 | Used in traverse() method, which traverses the tree and prints out all terminal node models 153 | 154 | Any additional args passed to mtp's traverse() function are directly passed here 155 | ''' 156 | def to_string(self,*leafargs,**leafkwargs): 157 | return "Mean cost vector: \n" + str(self.mean_cost) +"\n"+"decision: \n"+str(self.decision) 158 | 159 | 160 | ''' 161 | Given attribute data A, choice data Y, and observation indices data_inds, 162 | extract those observations of A and Y corresponding to data_inds 163 | 164 | If only attribute data A is given, returns A. 165 | If only choice data Y is given, returns Y. 166 | 167 | Used to partition the data in the tree-fitting procedure 168 | ''' 169 | def get_sub(data_inds,A=None,Y=None,is_boolvec=False): 170 | if A is None: 171 | return Y[data_inds] 172 | if Y is None: 173 | return A[data_inds] 174 | else: 175 | return A[data_inds],Y[data_inds] 176 | 177 | ''' 178 | This function takes as input choice data Y and outputs a boolean corresponding 179 | to whether all of the choices in Y are the same. 180 | 181 | It is used as a test for whether we should make a node a leaf. If are_Ys_diverse(Y)=False, 182 | then the node will become a leaf. Otherwise, if the node passes the other tests (doesn't exceed 183 | max depth, etc), we will consider splitting on the node. 184 | ''' 185 | def are_Ys_diverse(Y): 186 | #return False iff all cost vectors (rows of Y) are the same 187 | tmp = [len(np.unique(Y[:,j])) for j in range(Y.shape[1])] 188 | return (np.max(tmp) > 1) 189 | 190 | -------------------------------------------------------------------------------- /Applications/Illustrative Example/decision_problem_solver.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Generic file to set up the decision problem (i.e., optimization problem) under consideration 3 | Must have functions: 4 | get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector) 5 | find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 6 | 7 | This particular file sets up a two-road shortest path decision problem 8 | ''' 9 | 10 | from gurobipy import * 11 | import numpy as np 12 | 13 | dim = 2 #(creates dim * dim grid, where dim = number of vertices) 14 | Edge_list = [(i,i+1) for i in range(1, dim**2 + 1) if i % dim != 0] 15 | Edge_list += [(i, i + dim) for i in range(1, dim**2 + 1) if i <= dim**2 - dim] 16 | Edge_dict = {} #(assigns each edge to a unique integer from 0 to number-of-edges) 17 | for index, edge in enumerate(Edge_list): 18 | Edge_dict[edge] = index 19 | D = len(Edge_list) # D = number of decisions 20 | 21 | def get_num_decisions(): 22 | return D 23 | 24 | Edges = tuplelist(Edge_list) 25 | # Find the optimal total cost for an observation in the context of shortes path 26 | m_shortest_path = Model('shortest_path') 27 | m_shortest_path.Params.OutputFlag = 0 28 | flow = m_shortest_path.addVars(Edges, ub = 1, name = 'flow') 29 | m_shortest_path.addConstrs((quicksum(flow[i,j] for i,j in Edges.select(i,'*')) - quicksum(flow[k, i] for k,i in Edges.select('*', 30 | i)) == 0 for i in range(2, dim**2)), name = 'inner_nodes') 31 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select(1, '*')) == 1), name = 'start_node') 32 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select('*', dim**2)) == 1), name = 'end_node') 33 | 34 | def shortest_path(cost): 35 | # m_shortest_path.setObjective(quicksum(flow[i,j] * cost[Edge_dict[(i,j)]] for i,j in Edges), GRB.MINIMIZE) 36 | m_shortest_path.setObjective(LinExpr( [ (cost[Edge_dict[(i,j)]],flow[i,j] ) for i,j in Edges]), GRB.MINIMIZE) 37 | m_shortest_path.optimize() 38 | return {'weights': m_shortest_path.getAttr('x', flow), 'objective': m_shortest_path.objVal} 39 | 40 | def find_opt_decision(cost): 41 | weights = np.zeros(cost.shape) 42 | objective = np.zeros(cost.shape[0]) 43 | for i in range(cost.shape[0]): 44 | temp = shortest_path(cost[i,:]) 45 | for edge in Edges: 46 | weights[i, Edge_dict[edge]] = temp['weights'][edge] 47 | objective[i] = temp['objective'] 48 | return {'weights': weights, 'objective':objective} 49 | -------------------------------------------------------------------------------- /Applications/Illustrative Example/illustrative_example.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python2 2 | # -*- coding: utf-8 -*- 3 | """ 4 | ILLUSTRATIVE EXAMPLE 5 | 6 | Generates a two-road instance of the shortest paths problem. 7 | Runs SPO Tree (greedy) and CART on this dataset and compares their performance. 8 | Produces plots visualizing predicted costs and normalized SPO loss incurred by SPOT and CART. 9 | To run code below, put SPO_tree_greedy.py into the same folder as this script. 10 | """ 11 | 12 | import numpy as np 13 | from SPO_tree_greedy import SPOTree 14 | from decision_problem_solver import* 15 | import matplotlib as mpl 16 | #mpl.use('Agg') 17 | import matplotlib.pyplot as plt 18 | #plt.ioff() 19 | 20 | plt.rcParams.update({'font.size': 12}) 21 | figsize = (5.2, 4.3) 22 | np.random.seed(0) 23 | dpi=450 24 | 25 | #SIMULATED DATASET FUNCTIONS 26 | def get_costs(X): 27 | X = X.reshape(-1) 28 | mat = np.zeros((len(X),4)) 29 | for i in range(len(X)): 30 | mat[i,0] = (X[i] + 0.8)*5-2.1 31 | mat[i,1] = (5*X[i]+0.4)**2 32 | return(mat) 33 | 34 | def gen_dataset(n): 35 | x = np.random.rand(n,1) #generate random features in [0,1] 36 | costs = get_costs(x) 37 | return(x,costs) 38 | 39 | def get_step_func_rep(x, costs): 40 | change_inds = np.where(costs[1:]-costs[:-1] > 0)[0] 41 | x_change_points = (x[change_inds.tolist()]+x[(change_inds+1).tolist()])/2.0 42 | x_min = np.append(np.array(x[0]),x_change_points) 43 | x_max = np.append(x_change_points,np.array(x[-1])) 44 | change_inds = change_inds.tolist() 45 | change_inds.append(len(x)-1) 46 | y = costs[change_inds] 47 | return(y, x_min, x_max) 48 | 49 | def get_decision_boundary(x, costs): 50 | tmp = costs[:,1] > costs[:,0] 51 | if not any(tmp) == True: 52 | return None 53 | return(min(x[tmp])) 54 | 55 | def plot_costs(plot_x, true_costs, est_costs, color_est_costs, est_name, fname): 56 | true_costs_0 = true_costs[:,0] 57 | true_costs_1 = true_costs[:,1] 58 | est_costs_0 = est_costs[:,0] 59 | est_costs_1 = est_costs[:,1] 60 | fig,ax = plt.subplots(1, figsize=figsize) 61 | line_true_costs_0, = ax.plot(plot_x, true_costs_0, linewidth=2.0, color='grey', linestyle='-', label='Edge 1 Cost (True)') 62 | line_true_costs_1, = ax.plot(plot_x, true_costs_1, linewidth=2.0, color='grey', linestyle='--', label='Edge 2 Cost (True)') 63 | 64 | costs,xmin,xmax = get_step_func_rep(plot_x, est_costs_0) 65 | line_est_costs_0 = ax.hlines(costs, xmin, xmax, linewidth=2.0, color=color_est_costs, linestyle='-', label='Edge 1 Cost ('+est_name+')') 66 | costs,xmin,xmax = get_step_func_rep(plot_x, est_costs_1) 67 | line_est_costs_1 = ax.hlines(costs, xmin, xmax, linewidth=2.0, color=color_est_costs, linestyle='--', label='Edge 2 Cost ('+est_name+')') 68 | 69 | plt.xlabel('x') 70 | plt.ylabel('Edge Cost') 71 | #plt.ylim(top=21) 72 | 73 | bdry = get_decision_boundary(plot_x, est_costs) 74 | line_est_bdry = ax.axvline(x=bdry, linewidth=1.5, color=color_est_costs, linestyle=':', label='Decision Boundary ('+est_name+')') 75 | #_,xmin,_ = get_step_func_rep(plot_x, est_costs_0) 76 | #xbds = xmin[1:] 77 | #for xbd in xbds: 78 | # plt.axvline(x=xbd, linewidth=1.0, color='grey', linestyle=':') 79 | 80 | bdry = get_decision_boundary(plot_x, true_costs) 81 | line_true_bdry = ax.axvline(x=bdry, linewidth=1.5, color='grey', linestyle=':', label='Decision Boundary (True)') 82 | 83 | plt.legend(handles=[line_true_costs_0, line_true_costs_1, line_true_bdry, 84 | line_est_costs_0, line_est_costs_1, line_est_bdry], loc='upper left') 85 | #plt.show() 86 | plt.savefig(fname, format='png', dpi=dpi, bbox_inches='tight', pad_inches=0) 87 | plt.clf() 88 | 89 | 90 | plot_x = np.linspace(0,1,num=1000).reshape(1000,1) 91 | true_costs = get_costs(plot_x) 92 | 93 | #SIMULATED DATA PARAMETERS 94 | n_train = 10000; 95 | n_valid = 2000; 96 | n_test = 5000; 97 | 98 | #GENERATE TRAINING DATA 99 | train_x, train_cost = gen_dataset(n_train) 100 | #GENERATE VALIDATION SET DATA 101 | valid_x, valid_cost = gen_dataset(n_valid) 102 | #GENERATE TESTING DATA 103 | test_x, test_cost = gen_dataset(n_test) 104 | 105 | ################################################################### 106 | #FIT SPO Tree ALGORITHM 107 | #SPO_weight_param: number between 0 and 1: 108 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 109 | my_tree = SPOTree(max_depth = 1, min_weights_per_node = 20, quant_discret = 0.01, debias_splits=False, SPO_weight_param=1.0, SPO_full_error=True) 110 | my_tree.fit(train_x,train_cost,verbose=False,feats_continuous=True); #verbose specifies whether fitting procedure should print progress 111 | #my_tree.traverse() #prints out the unpruned tree 112 | 113 | #PRUNE DECISION TREE USING VALIDATION SET 114 | my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 115 | #my_tree.traverse() #prints out the pruned tree 116 | 117 | #FIND TEST SET SPO LOSS 118 | opt_decision = find_opt_decision(test_cost)['weights'] 119 | pred_decision = my_tree.est_decision(test_x) 120 | 121 | incurred_test_cost = np.sum([np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]) 122 | opt_test_cost = np.sum(np.sum(test_cost * opt_decision,axis=1)) 123 | 124 | #percent error: 125 | print("SPO Tree: Test Set Normalized SPO Loss: ") 126 | SPO_error = 100.0*(incurred_test_cost-opt_test_cost)/opt_test_cost 127 | print(str(SPO_error)+" percent error") 128 | est_costs = my_tree.est_cost(plot_x) 129 | plot_costs(plot_x, true_costs, est_costs, 'blue', 'SPOT', 'casestudySPOTdpi'+str(dpi)+'.png') 130 | 131 | ################################################################### 132 | #FIT MSE Tree ALGORITHM 133 | MSE_tree_depths = [1,2,3,4,5] 134 | MSE_tree_depths_errors = np.zeros(len(MSE_tree_depths)) 135 | for max_depth_ind,max_depth in enumerate(MSE_tree_depths): 136 | #SPO_weight_param: number between 0 and 1: 137 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 138 | my_tree = SPOTree(max_depth = max_depth, min_weights_per_node = 20, quant_discret = 0.01, debias_splits=False, SPO_weight_param=0.0, SPO_full_error=True) 139 | my_tree.fit(train_x,train_cost,verbose=False,feats_continuous=True); #verbose specifies whether fitting procedure should print progress 140 | #my_tree.traverse() #prints out the unpruned tree 141 | 142 | #PRUNE DECISION TREE USING VALIDATION SET 143 | my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 144 | my_tree.traverse() #prints out the pruned tree 145 | 146 | #FIND TEST SET SPO LOSS 147 | opt_decision = find_opt_decision(test_cost)['weights'] 148 | pred_decision = my_tree.est_decision(test_x) 149 | 150 | incurred_test_cost = np.sum([np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]) 151 | opt_test_cost = np.sum(np.sum(test_cost * opt_decision,axis=1)) 152 | 153 | #percent error: 154 | print("MSE Tree Depth "+str(max_depth)+": Test Set Normalized SPO Loss: ") 155 | MSE_tree_depths_errors[max_depth_ind] = 100.0*(incurred_test_cost-opt_test_cost)/opt_test_cost 156 | print(str(MSE_tree_depths_errors[max_depth_ind])+" percent error") 157 | est_costs = my_tree.est_cost(plot_x) 158 | plot_costs(plot_x, true_costs, est_costs, 'orange', 'CART', 'casestudyCART'+str(max_depth)+'dpi'+str(dpi)+'.png') 159 | 160 | SPO_tree_depths_errors = [SPO_error]*len(MSE_tree_depths) 161 | fig,ax = plt.subplots(1, figsize=figsize) 162 | ax.plot(MSE_tree_depths, MSE_tree_depths_errors, linewidth=2.0, color='orange', label='CART') 163 | ax.plot(MSE_tree_depths, SPO_tree_depths_errors, linewidth=2.0, color='blue', label='SPOT') 164 | plt.xlabel("Training Depth") 165 | plt.ylabel("Norm. Extra Travel Time (%)") 166 | plt.legend(loc='upper right') 167 | plt.xticks(MSE_tree_depths) 168 | #plt.show() 169 | plt.savefig('casestudyCARTerrorsdpi'+str(dpi)+'.png', format='png', dpi=dpi, bbox_inches='tight', pad_inches=0) 170 | plt.clf() 171 | -------------------------------------------------------------------------------- /Applications/README.md: -------------------------------------------------------------------------------- 1 | # Applications 2 | 3 | This folder contains all code for reproducing the three numerical experiments (applications) covered in the paper: 4 | * Illustrative Example: A two-road shortest path decision problem. 5 | * Shortest Path: A shortest path decision problem over a 4 x 4 grid network, where driver starts in 6 | southwest corner and tries to find shortest path to northeast corner. 7 | * Yahoo News: A news article recommendation decision problem constructed from the Yahoo! Front Page Today Module dataset. 8 | 9 | Datasets used in the numerical experiments may be found here: 10 | * Shortest Path: https://archive.org/details/spotree_shortestpathdata 11 | * Yahoo News: The Yahoo! Front Page Today dataset may be found at https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49. A license must be obtained from Yahoo to access the dataset. The data may be used only for academic research purposes and may not be used for any commercial purposes or by any commercial entity. Preprocessing scripts are included to format the raw dataset to match the one used in our numerical experiments. 12 | 13 | To reproduce the numerical experiments, merge into a single folder all codes in the Algorithms folder with the codes + data files (unzipped) corresponding to the application of interest. Then, run the relevant application Python script. 14 | 15 | The headers of the application scripts contains all experimental parameter settings used in the paper. 16 | 17 | Code currently only supports Python 2.7 (not Python 3). 18 | Package Dependencies: gurobipy (with valid Gurobi license), numpy, pandas, scipy, joblib 19 | * The Illustrative Example application also depends on matplotlib 20 | -------------------------------------------------------------------------------- /Applications/Shortest Path/SPOForest_nonlinear.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Runs random forest algorithm on shortest path dataset with nonlinear mapping from features to costs ("nonlinear") 3 | This code considers two methods of aggregating individual tree predictions to obtain a forest decision: 4 | - "mean": averages cost predictions for each tree in the forest; outputs decision associated with average cost 5 | - "mode": outputs the mode decision recommended by the trees in the forest 6 | Outputs decision costs for each test-set instance as pickle file 7 | Takes multiple input arguments: 8 | (1) n_train: number of training observations. can take values 200, 10000 9 | (2) eps: parameter (\bar{\epsilon}) in the paper controlling noise in mapping from features to costs. 10 | n_train = 200: can take values 0, 0.25 11 | n_train = 10000: can take values 0, 0.5 12 | (3) deg_set_str: set of deg parameters to try, e.g. "2-10". 13 | deg = parameter "degree" in the paper controlling nonlinearity in mapping from features to costs. 14 | can try values in {2,10} 15 | (4) reps_st, reps_end: we provide 10 total datasets corresponding to different generated B values (matrix mapping features to costs). 16 | script will run code on problem instances reps_st to reps_end 17 | (5) max_depth_set_str: sequence of training depths tuned using cross validation, e.g. "2-4-5" 18 | (6) min_samples_leaf_set_str: sequence of "min. (weighted) observations per leaf" tuned using cross validation, e.g. "20-50-100" 19 | (7) n_estimators_set_str: sequence of number of trees in forest tuned using cross validation, e.g. "20-50-100" 20 | (8) max_features_set_str: sequence of number of features used in feature bagging tuned using cross validation, e.g. "2-3-4" 21 | (9) aggr_method: method for aggregating individual tree cost predictions to arrive at forest decisions. Either "mean" or "mode" 22 | (10) algtype: set equal to "MSE" (CART forest) or "SPO" (SPOT forest) 23 | (11) number of workers to use in parallel processing (i.e., fitting individual trees in the forest in parallel) 24 | Values of input arguments used in paper: 25 | (1) n_train: consider values 200, 10000 26 | (2) eps: 27 | n_train = 200: considered values 0, 0.25 28 | n_train = 10000: considered values 0, 0.5 29 | (3) deg_set_str: "2-10" 30 | (4) reps_st, reps_end: reps_st = 0, reps_end = 10 31 | (5) max_depth_set_str: "1000" 32 | (6) min_samples_leaf_set_str: "20" 33 | (7) n_estimators_set_str: "100" 34 | (8) max_features_set_str: "2-3-4-5" 35 | (9) aggr_method: "mean" 36 | (10) algtype: "MSE" (CART forest) or "SPO" (SPOT forest) 37 | (11) number of workers to use in parallel processing: 8 38 | ''' 39 | 40 | import time 41 | 42 | import numpy as np 43 | import pickle 44 | from gurobipy import* 45 | from SPOForest import SPOForest 46 | from decision_problem_solver import* 47 | ############################################## 48 | forest_seed = 0 #seed to set random forest rng 49 | ############################################## 50 | import sys 51 | #problem parameters 52 | n_train = int(sys.argv[1])#200 53 | eps = float(sys.argv[2])#0 54 | deg_set_str = sys.argv[3] 55 | deg_set=[int(k) for k in deg_set_str.split('-')]#[2,4,6,8,10] 56 | #evaluate algs of dataset replications from rep_st to rep_end 57 | reps_st = int(sys.argv[4])#0 #can be as low as 0 58 | reps_end = int(sys.argv[5])#1 #can be as high as 50 59 | valid_frac = 0.2 60 | ######################################## 61 | #training parameters 62 | max_depth_set_str = sys.argv[6] 63 | max_depth_set=[int(k) for k in max_depth_set_str.split('-')]#[None] 64 | min_samples_leaf_set_str = sys.argv[7] 65 | min_samples_leaf_set=[int(k) for k in min_samples_leaf_set_str.split('-')]#[5] 66 | n_estimators_set_str = sys.argv[8] 67 | n_estimators_set=[int(k) for k in n_estimators_set_str.split('-')]#[100,500] 68 | max_features_set_str = sys.argv[9] 69 | max_features_set=[int(k) for k in max_features_set_str.split('-')]#[3] 70 | aggr_method=sys.argv[10] #either "mean" or "mode" 71 | algtype=sys.argv[11] #either "MSE" or "SPO" 72 | #number of workers 73 | if sys.argv[12] == "1": 74 | run_in_parallel = False 75 | num_workers = None 76 | else: 77 | run_in_parallel = True 78 | num_workers = int(sys.argv[12]) 79 | ######################################## 80 | #output filename 81 | fname_out = algtype+"Forest_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+"_depthSet"+max_depth_set_str+"_minObsSet"+min_samples_leaf_set_str+"_nEstSet"+n_estimators_set_str+"_mFeatSet"+max_features_set_str+"_aMethod"+aggr_method+".pkl"; 82 | ############################################################################# 83 | ############################################################################# 84 | ############################################################################# 85 | #data = pickle.load(open('non_linear_big_data_dim4.p','rb')) 86 | #'non_linear_data_dim4.p' has the following options: 87 | #n_train: 200, 400, 800 88 | #nonlinear degrees: 8, 2, 4, 10, 6 89 | #eps: 0, 0.25, 0.5 90 | #50 replications of the experiment (0-49) 91 | #dataset characteristics: 5 continuous features x, dimension 4 grid c, 1000 test set observations 92 | if n_train == 10000: 93 | data = pickle.load(open('non_linear_bigdata10000_dim4.p','rb')) 94 | else: 95 | data = pickle.load(open('non_linear_data_dim4.p','rb')) 96 | n_test = 1000 97 | ############################################################################# 98 | ############################################################################# 99 | ############################################################################# 100 | assert(reps_st >= 0) 101 | assert(reps_end <= 50) 102 | n_reps = reps_end-reps_st 103 | 104 | def forest_traintest(train_x,train_cost,test_x,test_cost,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,aggr_method, algtype): 105 | if algtype == "MSE": 106 | SPO_weight_param=0.0 107 | elif algtype == "SPO": 108 | SPO_weight_param=1.0 109 | regr = SPOForest(n_estimators=n_estimators,run_in_parallel=run_in_parallel,num_workers=num_workers, 110 | max_depth=max_depth, min_weights_per_node=min_samples_leaf, quant_discret=0.01, debias_splits=False, 111 | max_features=max_features, 112 | SPO_weight_param=SPO_weight_param, SPO_full_error=True) 113 | regr.fit(train_x, train_cost, verbose_forest=True, verbose=False, feats_continuous=True, seed=forest_seed) 114 | pred_decision = regr.est_decision(test_x, method=aggr_method) 115 | return regr, np.mean([np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]) 116 | 117 | def forest_tuneparams(train_x,train_cost,valid_x,valid_cost,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set,run_in_parallel,num_workers,aggr_method, algtype): 118 | best_err = np.float("inf") 119 | for n_estimators in n_estimators_set: 120 | for max_depth in max_depth_set: 121 | for min_samples_leaf in min_samples_leaf_set: 122 | for max_features in max_features_set: 123 | regr, err = forest_traintest(train_x,train_cost,valid_x,valid_cost,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,aggr_method, algtype) 124 | if err <= best_err: 125 | best_regr, best_err, best_n_estimators,best_max_depth,best_min_samples_leaf,best_max_features = regr, err, n_estimators,max_depth,min_samples_leaf,max_features 126 | 127 | print("Best n_estimators: " + str(best_n_estimators)) 128 | print("Best max_depth: " + str(best_max_depth)) 129 | print("Best min_samples_leaf: " + str(best_min_samples_leaf)) 130 | print("Best max_features: " + str(best_max_features)) 131 | return best_regr, best_err, best_n_estimators,best_max_depth,best_min_samples_leaf,best_max_features 132 | 133 | #costs_deg[deg] yields a n_reps*n_test matrix of costs corresponding to the experimental data for deg, i.e. 134 | #costs_deg[deg][i][j] gives the observed cost on test set i (0-49) example j (0-(n_test-1)) 135 | costs_deg = {} 136 | 137 | for deg in deg_set: 138 | costs_deg[deg] = np.zeros((n_reps,n_test)) 139 | 140 | for trial_num in range(reps_st,reps_end): 141 | train_x,train_cost,test_x,test_cost = data[n_train][deg][eps][trial_num] 142 | print "Deg "+str(deg)+", Trial Number "+str(trial_num)+" out of " + str(reps_end) 143 | 144 | #split up training data into train/valid split 145 | n_valid = int(np.floor(n_train*valid_frac)) 146 | valid_x = train_x[:n_valid] 147 | valid_cost = train_cost[:n_valid] 148 | train_x = train_x[n_valid:] 149 | train_cost = train_cost[n_valid:] 150 | 151 | start = time.time() 152 | 153 | #FIT FOREST 154 | regr,_,_,_,_,_ = forest_tuneparams(train_x,train_cost,valid_x,valid_cost,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set, run_in_parallel, num_workers, aggr_method, algtype) 155 | 156 | end = time.time() 157 | print "Elapsed time: " + str(end-start) 158 | 159 | #FIND TEST SET COST 160 | pred_decision = regr.est_decision(test_x, method=aggr_method) 161 | incurred_test_cost = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))] 162 | 163 | print "Average test set cost: " + str(np.mean(incurred_test_cost)) 164 | 165 | costs_deg[deg][trial_num] = incurred_test_cost 166 | 167 | # Saving the objects occasionally: 168 | if trial_num % 25 == 0: 169 | with open(fname_out, 'wb') as output: 170 | pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL) 171 | 172 | with open(fname_out, 'wb') as output: 173 | pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL) 174 | 175 | # Getting back the objects: 176 | #with open(fname_out, 'rb') as input: 177 | # costs_deg = pickle.load(input) -------------------------------------------------------------------------------- /Applications/Shortest Path/SPOgreedy_nonlinear.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Runs SPOT (greedy) / CART algorithm on shortest path dataset with nonlinear mapping from features to costs ("nonlinear") 3 | Outputs algorithm decision costs for each test-set instance as pickle file 4 | Also outputs optimal decision costs for each test-set instance as pickle file 5 | Takes multiple input arguments: 6 | (1) n_train: number of training observations. can take values 200, 10000 7 | (2) eps: parameter (\bar{\epsilon}) in the paper controlling noise in mapping from features to costs. 8 | n_train = 200: can take values 0, 0.25 9 | n_train = 10000: can take values 0, 0.5 10 | (3) deg_set_str: set of deg parameters to try, e.g. "2-10". 11 | deg = parameter "degree" in the paper controlling nonlinearity in mapping from features to costs. 12 | can try values in {2,10} 13 | (4) reps_st, reps_end: we provide 10 total datasets corresponding to different generated B values (matrix mapping features to costs). 14 | script will run code on problem instances reps_st to reps_end 15 | (5) max_depth: training depth of tree, e.g. "5" 16 | (6) min_weights_per_node: min. number of (weighted) observations per leaf, e.g. "100" 17 | (7) algtype: set equal to "MSE" (CART) or "SPO" (SPOT greedy) 18 | Values of input arguments used in paper: 19 | (1) n_train: consider values 200, 10000 20 | (2) eps: 21 | n_train = 200: considered values 0, 0.25 22 | n_train = 10000: considered values 0, 0.5 23 | (3) deg_set_str: "2-10" 24 | (4) reps_st, reps_end: reps_st = 0, reps_end = 10 25 | (5) max_depth: 26 | n_train = 200: considered depths of 1, 2, 3, 1000 27 | n_train = 10000: considered depths of 2, 4, 6, 1000 28 | (6) min_weights_per_node: 20 29 | (7) algtype: "MSE" (CART) or "SPO" (SPOT greedy) 30 | ''' 31 | 32 | import time 33 | 34 | import numpy as np 35 | import pickle 36 | from SPO_tree_greedy import SPOTree 37 | from decision_problem_solver import* 38 | import sys 39 | #problem parameters 40 | n_train = int(sys.argv[1])#200 41 | eps = float(sys.argv[2])#0 42 | deg_set_str = sys.argv[3] 43 | deg_set=[int(k) for k in deg_set_str.split('-')]#[2,4,6,8,10] 44 | #evaluate algs of dataset replications from rep_st to rep_end 45 | reps_st = int(sys.argv[4])#0 #can be as low as 0 46 | reps_end = int(sys.argv[5])#1 #can be as high as 50 47 | valid_frac = 0.2 #set aside valid_frac of training data for validation 48 | ######################################## 49 | #training parameters 50 | max_depth = int(sys.argv[6])#3 51 | min_weights_per_node = int(sys.argv[7])#20 52 | algtype = sys.argv[8] #either "MSE" or "SPO" 53 | ######################################## 54 | #output filename for alg 55 | fname_out = algtype+"greedy_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+"_depth"+str(max_depth)+"_minObs"+str(min_weights_per_node)+".pkl"; 56 | #output filename for opt costs 57 | fname_out_opt = "Opt_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+".pkl"; 58 | ############################################################################# 59 | ############################################################################# 60 | ############################################################################# 61 | #data = pickle.load(open('non_linear_big_data_dim4.p','rb')) 62 | #'non_linear_data_dim4.p' has the following options: 63 | #n_train: 200, 400, 800 64 | #nonlinear degrees: 8, 2, 4, 10, 6 65 | #eps: 0, 0.25, 0.5 66 | #50 replications of the experiment (0-49) 67 | #dataset characteristics: 5 continuous features x, dimension 4 grid c, 1000 test set observations 68 | if n_train == 10000: 69 | data = pickle.load(open('non_linear_bigdata10000_dim4.p','rb')) 70 | else: 71 | data = pickle.load(open('non_linear_data_dim4.p','rb')) 72 | n_test = 1000 73 | ############################################################################# 74 | ############################################################################# 75 | ############################################################################# 76 | assert(reps_st >= 0) 77 | assert(reps_end <= 50) 78 | n_reps = reps_end-reps_st 79 | 80 | #costs_deg[deg] yields a n_reps*n_test matrix of costs corresponding to the experimental data for deg, i.e. 81 | #costs_deg[deg][i][j] gives the observed cost on test set i (0-49) example j (0-(n_test-1)) 82 | costs_deg = {} 83 | optcosts_deg = {} #optimal costs 84 | 85 | for deg in deg_set: 86 | costs_deg[deg] = np.zeros((n_reps,n_test)) 87 | optcosts_deg[deg] = np.zeros((n_reps,n_test)) 88 | 89 | for trial_num in range(reps_st,reps_end): 90 | train_x,train_cost,test_x,test_cost = data[n_train][deg][eps][trial_num] 91 | print "Deg "+str(deg)+", Trial Number "+str(trial_num)+" out of " + str(reps_end) 92 | 93 | #split up training data into train/valid split 94 | n_valid = int(np.floor(n_train*valid_frac)) 95 | valid_x = train_x[:n_valid] 96 | valid_cost = train_cost[:n_valid] 97 | train_x = train_x[n_valid:] 98 | train_cost = train_cost[n_valid:] 99 | 100 | start = time.time() 101 | 102 | #FIT ALGORITHM 103 | #SPO_weight_param: number between 0 and 1: 104 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 105 | if algtype == "MSE": 106 | SPO_weight_param=0.0 107 | elif algtype == "SPO": 108 | SPO_weight_param=1.0 109 | my_tree = SPOTree(max_depth = max_depth, min_weights_per_node = min_weights_per_node, quant_discret = 0.01, debias_splits=False, SPO_weight_param=SPO_weight_param, SPO_full_error=True) 110 | my_tree.fit(train_x,train_cost,verbose=False,feats_continuous=True); #verbose specifies whether fitting procedure should print progress 111 | #my_tree.traverse() #prints out the unpruned tree 112 | 113 | end = time.time() 114 | print "Elapsed time: " + str(end-start) 115 | 116 | #PRUNE DECISION TREE USING VALIDATION SET 117 | my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 118 | #my_tree.traverse() #prints out the pruned tree 119 | 120 | #FIND TEST SET ALGORITHM COST AND OPTIMAL COST 121 | opt_decision = find_opt_decision(test_cost)['weights'] 122 | pred_decision = my_tree.est_decision(test_x) 123 | costs_deg[deg][trial_num] = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))] 124 | optcosts_deg[deg][trial_num] = [np.sum(test_cost[i] * opt_decision[i,:]) for i in range(0,opt_decision.shape[0])] 125 | 126 | # Saving the objects occasionally: 127 | if trial_num % 25 == 0: 128 | with open(fname_out, 'wb') as output: 129 | pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL) 130 | with open(fname_out_opt, 'wb') as output: 131 | pickle.dump(optcosts_deg, output, pickle.HIGHEST_PROTOCOL) 132 | 133 | with open(fname_out, 'wb') as output: 134 | pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL) 135 | with open(fname_out_opt, 'wb') as output: 136 | pickle.dump(optcosts_deg, output, pickle.HIGHEST_PROTOCOL) 137 | 138 | # Getting back the objects: 139 | #with open(fname_out, 'rb') as input: 140 | # costs_deg = pickle.load(input) -------------------------------------------------------------------------------- /Applications/Shortest Path/SPOopt_nonlinear.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Runs SPOT (MILP) algorithm on shortest path dataset with nonlinear mapping from features to costs ("nonlinear") 3 | Outputs decision costs for each test-set instance as pickle file 4 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r. 5 | Takes multiple input arguments: 6 | (1) n_train: number of training observations. can take values 200, 10000 7 | (2) eps: parameter (\bar{\epsilon}) in the paper controlling noise in mapping from features to costs. 8 | n_train = 200: can take values 0, 0.25 9 | n_train = 10000: can take values 0, 0.5 10 | (3) deg_set_str: set of deg parameters to try, e.g. "2-10". 11 | deg = parameter "degree" in the paper controlling nonlinearity in mapping from features to costs. 12 | can try values in {2,10} 13 | (4) reps_st, reps_end: we provide 10 total datasets corresponding to different generated B values (matrix mapping features to costs). 14 | script will run code on problem instances reps_st to reps_end 15 | (5) H: training depth of tree, e.g. "5" 16 | (6) N_min: min. number of (weighted) observations per leaf, e.g. "100" 17 | (7) train_x_precision: contextual features x are rounded to train_x_precision before fitting MILP (e.g., 2 = two decimal places) 18 | higher values of train_x_precision will be more precise but take more computational time 19 | (8) reg_set_str: sequence of regularization parameters to try (tuned using cross validation), e.g. "0.001-0.01-0.1" 20 | if "None", fits MILP using no regularizaiton and then prunes using CART pruning procedure (with SPO loss as pruning metric) 21 | (9) solver_time_limit: MILP solver is terminated after solver_time_limit seconds, returning best-found solution 22 | Values of input arguments used in paper: 23 | (1) n_train: consider values 200, 10000 24 | (2) eps: 25 | n_train = 200: considered values 0, 0.25 26 | n_train = 10000: considered values 0, 0.5 27 | (3) deg_set_str: "2-10" 28 | (4) reps_st, reps_end: reps_st = 0, reps_end = 10 29 | (5) H: 30 | n_train = 200: considered depths of 1, 2, 3, 1000 31 | n_train = 10000: considered depths of 2, 4, 6, 1000 32 | (6) N_min: 20 33 | (7) train_x_precision: 2 34 | (8) reg_set_str: "None" 35 | (9) solver_time_limit: 16200 36 | ''' 37 | 38 | import time 39 | 40 | import numpy as np 41 | import pickle 42 | from spo_opt_tree_nonlinear import* 43 | from SPO2CART import SPO2CART 44 | from SPO_tree_greedy import SPOTree 45 | from decision_problem_solver import* 46 | import sys 47 | #problem parameters 48 | n_train = int(sys.argv[1])#200 49 | eps = float(sys.argv[2])#0 50 | deg_set_str = sys.argv[3] 51 | deg_set=[int(k) for k in deg_set_str.split('-')]#[2,4,6,8,10] 52 | #evaluate algs of dataset replications from rep_st to rep_end 53 | reps_st = int(sys.argv[4])#0 #can be as low as 0 54 | reps_end = int(sys.argv[5])#1 #can be as high as 50 55 | valid_frac = 0.2 56 | ######################################## 57 | #training parameters 58 | #optimal tree params 59 | H = int(sys.argv[6])#2 #H = max tree depth 60 | N_min = int(sys.argv[7])#4 #N_min = minimum number of observations per leaf node 61 | #higher values of train_x_precision will be more precise but take more computational time 62 | #values >= 8 might cause numerical errors 63 | train_x_precision = int(sys.argv[8])#2 64 | reg_set_str = sys.argv[9]#"None" 65 | if reg_set_str == "None": 66 | reg_set = None 67 | else: 68 | reg_set = [int(k) for k in reg_set_str.split('-')]#None 69 | #reg_set = [0.001] #if reg_set = None, uses CART to prune tree 70 | solver_time_limit = int(sys.argv[10]) 71 | ######################################## 72 | #output filename 73 | fname_out = "SPOopt_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+"_depth"+str(H)+"_minObs"+str(N_min)+"_prec"+str(train_x_precision)+"_regset"+reg_set_str+"_tLim"+str(solver_time_limit)+".pkl"; 74 | ############################################################################# 75 | ############################################################################# 76 | ############################################################################# 77 | #data = pickle.load(open('non_linear_big_data_dim4.p','rb')) 78 | #'non_linear_data_dim4.p' has the following options: 79 | #n_train: 200, 400, 800 80 | #nonlinear degrees: 8, 2, 4, 10, 6 81 | #eps: 0, 0.25, 0.5 82 | #50 replications of the experiment (0-49) 83 | #dataset characteristics: 5 continuous features x, dimension 4 grid c, 1000 test set observations 84 | if n_train == 10000: 85 | data = pickle.load(open('non_linear_bigdata10000_dim4.p','rb')) 86 | else: 87 | data = pickle.load(open('non_linear_data_dim4.p','rb')) 88 | n_test = 1000 89 | ############################################################################# 90 | ############################################################################# 91 | ############################################################################# 92 | assert(reps_st >= 0) 93 | assert(reps_end <= 50) 94 | n_reps = reps_end-reps_st 95 | 96 | #costs_deg[deg] yields a n_reps*n_test matrix of costs corresponding to the experimental data for deg, i.e. 97 | #costs_deg[deg][i][j] gives the observed cost on test set i (0-49) example j (0-(n_test-1)) 98 | costs_deg = {} 99 | 100 | for deg in deg_set: 101 | costs_deg[deg] = np.zeros((n_reps,n_test)) 102 | 103 | for trial_num in range(reps_st,reps_end): 104 | train_x,train_cost,test_x,test_cost = data[n_train][deg][eps][trial_num] 105 | print "Deg "+str(deg)+", Trial Number "+str(trial_num)+" out of " + str(reps_end) 106 | 107 | #split up training data into train/valid split 108 | n_valid = int(np.floor(n_train*valid_frac)) 109 | valid_x = train_x[:n_valid] 110 | valid_cost = train_cost[:n_valid] 111 | train_x = train_x[n_valid:] 112 | train_cost = train_cost[n_valid:] 113 | 114 | start = time.time() 115 | 116 | #FIT SPO OPTIMAL TREE 117 | if reg_set is None: 118 | 119 | #FIT SPO GREEDY TREE AS INITIAL SOLUTION 120 | def truncate_train_x(train_x, train_x_precision): 121 | return(np.around(train_x, decimals=train_x_precision)) 122 | train_x_truncated = truncate_train_x(train_x, train_x_precision) 123 | #SPO_weight_param: number between 0 and 1: 124 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 125 | my_tree = SPOTree(max_depth = H, min_weights_per_node = N_min, quant_discret = 0.01, debias_splits=False, SPO_weight_param=1.0, SPO_full_error=True) 126 | my_tree.fit(train_x_truncated,train_cost,verbose=False,feats_continuous=True); 127 | 128 | #PRUNE SPO GREEDY TREE USING TRAINING SET (TO GET RID OF REDUNDANT LEAVES) 129 | my_tree.prune(train_x_truncated, train_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 130 | spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated) 131 | 132 | #(OPTIONAL) PRUNE SPO GREEDY TREE USING VALIDATION SET 133 | # my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 134 | # spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated) 135 | # alpha = my_tree.get_pruning_alpha() 136 | 137 | #FIT SPO OPTIMAL TREE USING FOUND INITIAL SOLUTION 138 | reg_param = 1e-4 #introduce very small amount of regularization to ensure leaves with zero predictive power are aggregated 139 | spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d = spo_opt_tree(train_cost,train_x,train_x_precision,reg_param, N_min, H, 140 | a_start=spo_greedy_a, z_start=spo_greedy_z, 141 | Presolve=2, Seed=0, TimeLimit=solver_time_limit, 142 | returnAllOptvars=True) 143 | 144 | end = time.time() 145 | print "Elapsed time: " + str(end-start) 146 | 147 | #(IF NOT USING POSTPRUNING) FIND TEST SET COST 148 | # path = decision_path(test_x,spo_dt_a,spo_dt_b) 149 | # costs_deg[deg][trial_num] = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False) 150 | 151 | #PRUNE MILP TREE USING CART PRUNING METHOD ON VALIDATION SET 152 | spo2cart = SPO2CART(spo_dt_a, spo_dt_b) 153 | spo2cart.fit(train_x,train_cost,train_x_precision,verbose=False,feats_continuous=True) 154 | spo2cart.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 155 | #my_tree.traverse() #prints out the pruned tree 156 | #(IF PRUNED) FIND TEST SET COST 157 | pred_decision = spo2cart.est_decision(test_x) 158 | costs_deg[deg][trial_num] = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))] 159 | 160 | else: 161 | spo_dt_a, spo_dt_b, spo_dt_w, _, best_alpha = spo_opt_tunealpha(train_x,train_cost,valid_x,valid_cost,train_x_precision,reg_set, N_min, H) 162 | print("Best Alpha: " + best_alpha) 163 | 164 | end = time.time() 165 | print "Elapsed time: " + str(end-start) 166 | 167 | #FIND TEST SET COST 168 | path = decision_path(test_x,spo_dt_a,spo_dt_b) 169 | costs_deg[deg][trial_num] = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False) 170 | 171 | # Saving the objects occasionally: 172 | if trial_num % 5 == 0: 173 | with open(fname_out, 'wb') as output: 174 | pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL) 175 | 176 | with open(fname_out, 'wb') as output: 177 | pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL) 178 | 179 | # Getting back the objects: 180 | #with open(fname_out, 'rb') as input: 181 | # costs_deg = pickle.load(input) -------------------------------------------------------------------------------- /Applications/Shortest Path/decision_problem_solver.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Generic file to set up the decision problem (i.e., optimization problem) under consideration 3 | Must have functions: 4 | get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector) 5 | find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 6 | 7 | This particular file sets up a shortest path decision problem over a 4 x 4 grid network, where driver starts in 8 | southwest corner and tries to find shortest path to northeast corner. 9 | ''' 10 | 11 | from gurobipy import * 12 | import numpy as np 13 | 14 | dim = 4 #(creates dim * dim grid, where dim = number of vertices) 15 | Edge_list = [(i,i+1) for i in range(1, dim**2 + 1) if i % dim != 0] 16 | Edge_list += [(i, i + dim) for i in range(1, dim**2 + 1) if i <= dim**2 - dim] 17 | Edge_dict = {} #(assigns each edge to a unique integer from 0 to number-of-edges) 18 | for index, edge in enumerate(Edge_list): 19 | Edge_dict[edge] = index 20 | D = len(Edge_list) # D = number of decisions 21 | 22 | def get_num_decisions(): 23 | return D 24 | 25 | Edges = tuplelist(Edge_list) 26 | # Find the optimal total cost for an observation in the context of shortes path 27 | m_shortest_path = Model('shortest_path') 28 | m_shortest_path.Params.OutputFlag = 0 29 | flow = m_shortest_path.addVars(Edges, ub = 1, name = 'flow') 30 | m_shortest_path.addConstrs((quicksum(flow[i,j] for i,j in Edges.select(i,'*')) - quicksum(flow[k, i] for k,i in Edges.select('*', 31 | i)) == 0 for i in range(2, dim**2)), name = 'inner_nodes') 32 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select(1, '*')) == 1), name = 'start_node') 33 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select('*', dim**2)) == 1), name = 'end_node') 34 | 35 | def shortest_path(cost): 36 | # m_shortest_path.setObjective(quicksum(flow[i,j] * cost[Edge_dict[(i,j)]] for i,j in Edges), GRB.MINIMIZE) 37 | m_shortest_path.setObjective(LinExpr( [ (cost[Edge_dict[(i,j)]],flow[i,j] ) for i,j in Edges]), GRB.MINIMIZE) 38 | m_shortest_path.optimize() 39 | return {'weights': m_shortest_path.getAttr('x', flow), 'objective': m_shortest_path.objVal} 40 | 41 | def find_opt_decision(cost): 42 | weights = np.zeros(cost.shape) 43 | objective = np.zeros(cost.shape[0]) 44 | for i in range(cost.shape[0]): 45 | temp = shortest_path(cost[i,:]) 46 | for edge in Edges: 47 | weights[i, Edge_dict[edge]] = temp['weights'][edge] 48 | objective[i] = temp['objective'] 49 | return {'weights': weights, 'objective':objective} 50 | -------------------------------------------------------------------------------- /Applications/Shortest Path/spo_opt_tree_nonlinear.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Code for fitting SPOT MILP on shortest paths dataset. 3 | Note that this code is specifically designed for the shortest paths application and 4 | will not work for other applications without modifying the constraints. 5 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r. 6 | ''' 7 | 8 | 9 | import numpy as np 10 | import pandas as pd 11 | from gurobipy import* 12 | import pickle 13 | from decision_problem_solver import* 14 | 15 | # Helper functions for tree structure 16 | def find_parent_index(t): 17 | return (t+1)//2 - 1 18 | 19 | def find_ancestors(t): 20 | l= [] 21 | r = [] 22 | if t == 0: 23 | return 24 | else: 25 | while find_parent_index(t) !=0: 26 | parent = find_parent_index(t) 27 | if (t+1)% (1+parent) ==1: 28 | r.append(parent) 29 | else: 30 | l.append(parent) 31 | t = parent 32 | if t==2: 33 | r.append(0) 34 | else: 35 | l.append(0) 36 | return[l,r] 37 | 38 | #truncate training set features to desired precision 39 | def truncate_train_x(train_x, train_x_precision): 40 | return(np.around(train_x, decimals=train_x_precision)) 41 | 42 | 43 | #trains an optimal tree model on train_cost, train_x, and reg parameter spo_opt_tree_reg (scalar) 44 | #returns parameter encoding of the opimal tree (a,b,w). (a,b) encode splits, (w) encode leaf decisions 45 | #optimal tree params: 46 | #N_min = minimum number of observations per leaf node 47 | #H = max tree depth 48 | #def spo_opt_tree(train_cost, train_x,spo_opt_tree_reg): 49 | def spo_opt_tree(train_cost, train_x, train_x_precision, spo_opt_tree_reg, N_min, H, returnAllOptvars=False, 50 | a_start=None, b_start=None, w_start=None, y_start=None, z_start=None, l_start=None, d_start=None, 51 | threads=None, MIPGap=None, MIPFocus=None, verbose=False, Seed=None, TimeLimit=None, 52 | Presolve=None, ImproveStartTime=None, VarBranch=None, Cuts=None, 53 | tune=False, TuneCriterion=None, TuneJobs=None, TuneTimeLimit=None, TuneTrials=None, tune_foutpref=None): 54 | assert(spo_opt_tree_reg >= 1e-4) 55 | # We label all nodes of the tree by 0, 1, 2, ... 2**(H+1) - 1. 56 | T_B = 2**H - 1 57 | T_L = 2**H 58 | #Edge_list comes from importing shortest_path_solver 59 | Edges_w_t = tuplelist([(i,j,t) for i,j in Edge_list for t in range(T_L)]) 60 | 61 | n_train, P = train_x.shape 62 | #truncate x features so eps (below) will not be too small 63 | train_x = truncate_train_x(train_x, train_x_precision) 64 | 65 | assert(np.all(train_x >= 0)) 66 | assert(np.all(train_x <= 1)) 67 | assert(np.all(train_cost >= 0)) #assert nonnegative costs 68 | assert(np.all(train_cost.shape[0] == train_x.shape[0])) 69 | # Instantiate optimization model 70 | # Compute average optimal cost across all training set observations 71 | # (Although irrelevant for the optimization problem, it helps in interpreting alpha) 72 | optimal_costs = np.zeros(train_x.shape[0]) 73 | for i in range(train_x.shape[0]): 74 | optimal_costs[i] = find_opt_decision(train_cost[i,:].reshape(1,-1))['objective'][0] 75 | sum_optimal_cost = sum(optimal_costs) 76 | 77 | # Compute big M constant 78 | M = 0 79 | for i in range(train_x.shape[0]): 80 | longest_path_cost = -find_opt_decision(-train_cost[i,:].reshape(1,-1))['objective'][0] 81 | if longest_path_cost >= M: 82 | M = longest_path_cost 83 | #M = train_cost.max()*(dim-1)*2 84 | spo = Model('spo_opt_tree') 85 | if verbose == False: 86 | spo.Params.OutputFlag = 0 87 | #compute epsilon constants 88 | #eps = np.float("inf") 89 | #for j in range(train_x.shape[1]): 90 | #ordered_feat = np.sort(train_x[:,j]) 91 | #diffs = ordered_feat[1:]-ordered_feat[:-1] 92 | #nonzero_diffs = diffs[diffs > 0] 93 | #if min(nonzero_diffs) <= eps: 94 | #eps = min(nonzero_diffs) 95 | #one_plus_eps = 1 + eps 96 | 97 | eps = np.array([np.float("inf")]*train_x.shape[1]) 98 | for j in range(train_x.shape[1]): 99 | ordered_feat = np.sort(train_x[:,j]) 100 | diffs = ordered_feat[1:]-ordered_feat[:-1] 101 | nonzero_diffs = diffs[diffs > 0] 102 | eps[j] = min(nonzero_diffs) 103 | one_plus_eps_max = 1 + max(eps) 104 | 105 | #run params 106 | if threads is not None: 107 | spo.Params.Threads = threads 108 | if MIPGap is not None: 109 | spo.Params.MIPGap = MIPGap # default = 1e-4, try 1e-2 110 | if MIPFocus is not None: 111 | spo.Params.MIPFocus = MIPFocus 112 | if Seed is not None: 113 | spo.Params.Seed = Seed 114 | if TimeLimit is not None: 115 | spo.Params.TimeLimit = TimeLimit 116 | if Presolve is not None: 117 | spo.Params.Presolve = Presolve 118 | if ImproveStartTime is not None: 119 | spo.Params.ImproveStartTime = ImproveStartTime 120 | if VarBranch is not None: 121 | spo.Params.VarBranch = VarBranch 122 | if Cuts is not None: 123 | spo.Params.Cuts = Cuts 124 | 125 | #tune params 126 | if tune == True and TuneCriterion is not None: 127 | spo.Params.TuneCriterion = TuneCriterion 128 | if tune == True and TuneJobs is not None: 129 | spo.Params.TuneJobs = TuneJobs 130 | if tune == True and TuneTimeLimit is not None: 131 | spo.Params.TuneTimeLimit = TuneTimeLimit 132 | if tune == True and TuneTrials is not None: 133 | spo.Params.TuneTrials = TuneTrials 134 | 135 | # Add variables 136 | y = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = 0,name = 'y') 137 | z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), vtype=GRB.BINARY, name = 'z') 138 | #z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = 0, ub = 1, name = 'z') 139 | w = spo.addVars(tuplelist([(t, j) for t in range(T_L) for j in range(D)]), lb = 0,name = 'w') 140 | l = spo.addVars(tuplelist([i for i in range(T_L)]), vtype=GRB.BINARY, name = 'l') 141 | d = spo.addVars(tuplelist([i for i in range(T_B)]), vtype=GRB.BINARY, name = 'd') 142 | a = spo.addVars(tuplelist([(j,t) for j in range(P) for t in range(T_B)]), vtype=GRB.BINARY, name = 'a') 143 | #b = spo.addVars(tuplelist([i for i in range(T_B)]), lb = 0, name = 'b') 144 | b = spo.addVars(tuplelist([i for i in range(T_B)]), ub = 1, name = 'b') 145 | 146 | if a_start is not None: 147 | for i in range(P): 148 | for j in range(T_B): 149 | a[i,j].start = a_start[i,j] 150 | 151 | if b_start is not None: 152 | for i in range(T_B): 153 | b[i].start = b_start[i] 154 | 155 | if w_start is not None: 156 | for i in range(T_L): 157 | for j in range(D): 158 | w[i,j].start = w_start[i,j] 159 | 160 | if y_start is not None: 161 | for i in range(n_train): 162 | for j in range(T_L): 163 | y[i,j].start = y_start[i,j] 164 | 165 | if z_start is not None: 166 | for i in range(n_train): 167 | for j in range(T_L): 168 | z[i,j].start = z_start[i,j] 169 | 170 | if l_start is not None: 171 | for i in range(T_L): 172 | l[i].start = l_start[i] 173 | 174 | if d_start is not None: 175 | for i in range(T_B): 176 | d[i].start = d_start[i] 177 | 178 | spo.update() #for initial values to be written immediately 179 | 180 | # if a_start is not None: 181 | # for i in range(P): 182 | # for j in range(T_B): 183 | # print(a[i,j].start) 184 | # 185 | # if b_start is not None: 186 | # for i in range(T_B): 187 | # print(b[i].start) 188 | # 189 | # if w_start is not None: 190 | # for i in range(T_L): 191 | # for j in range(D): 192 | # print(w[i,j].start) 193 | # 194 | # if y_start is not None: 195 | # for i in range(n_train): 196 | # for j in range(T_L): 197 | # print(y[i,j].start) 198 | # 199 | # if z_start is not None: 200 | # for i in range(n_train): 201 | # for j in range(T_L): 202 | # print(z[i,j].start) 203 | # 204 | # if l_start is not None: 205 | # for i in range(T_L): 206 | # print(l[i].start) 207 | # 208 | # if d_start is not None: 209 | # for i in range(T_B): 210 | # print(d[i].start) 211 | 212 | 213 | 214 | # Add constraints 215 | # Const 7b 216 | for i in range(n_train): 217 | for t in range(T_L): 218 | #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()]) 219 | expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)]) 220 | spo.addConstr(y[i,t] >= expr_constraint - M * (1 - z[i,t])) 221 | # spo.addConstr(y[i,t] >= quicksum(train_cost[i,j] * w[t,j] for key,j in Edge_dict.items())- M * (1 - z[i,t])) 222 | 223 | # # Const 7c (genreral constraint for feasibility of nominal problem Aw <= B) 224 | # for t in range(T_L): 225 | # for i in range(K): 226 | # spo.addConstr(quicksum(A[i,j] * w[t,j] for j in range(D)) <= B[i] ) 227 | 228 | # Const 7c (constraint for feasibility of shortest_path problem) 229 | flow = spo.addVars(Edges_w_t, lb = 0, name = 'flow') 230 | spo.addConstrs((quicksum(flow[i,j,t] for i,j,t in Edges_w_t.select(i,'*',t)) - quicksum(flow[k,i,t] for k,i,t in Edges_w_t.select('*',i,t)) == 0 231 | for i in range(2, dim**2) for t in range(T_L) )) 232 | spo.addConstrs((quicksum(flow[i,j,t] for i,j,t in Edges_w_t.select(1, '*',t)) == 1 for t in range(T_L))) 233 | spo.addConstrs((quicksum(flow[i,j,t] for i,j,t in Edges_w_t.select('*', dim**2,t)) == 1 for t in range(T_L))) 234 | spo.addConstrs( w[t,Edge_dict[(i,j)]] - flow[i,j,t] == 0 for i,j,t in Edges_w_t) # Map shortest path flow to w_t 235 | 236 | # Const 7d 237 | for i in range(n_train): 238 | # spo.addConstr(quicksum(z[i,t] for t in range(T_L)) == 1) 239 | spo.addConstr(LinExpr([(1,z[i,t]) for t in range(T_L)]) == 1) 240 | 241 | # Const 7e 242 | for i in range(n_train): 243 | for t in range(T_L): 244 | spo.addConstr(z[i,t] <= l[t]) 245 | 246 | # Const 7f 247 | for t in range(T_L): 248 | # spo.addConstr(quicksum(z[i,t] for i in range(n_train))>= N_min * l[t]) 249 | spo.addConstr(LinExpr([(1,z[i,t]) for i in range(n_train)])>= N_min * l[t]) 250 | 251 | # Const 7g 252 | for i in range(n_train): 253 | for t in range(T_L): 254 | left, right = find_ancestors(t + T_B) 255 | for m in right: 256 | # spo.addConstr(quicksum(a[p,m]* train_x[i,p] for p in range(P)) >= b[m]- (1 - z[i,t] )) 257 | spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) >= b[m]- (1 - z[i,t] )) 258 | 259 | # Const 7h 260 | for m in left: 261 | # spo.addConstr(quicksum(a[p,m]* (x[i,p] + eps[p]) for p in range(P))<= b[m] + (1+eps_max)*(1-z[i,t] )) 262 | #spo.addConstr(quicksum(a[p,m]* train_x[i,p] for p in range(P)) +0.0001<= b[m] + (1-z[i,t] )) 263 | #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) + eps <= b[m] + (1+eps)*(1 - z[i,t] )) 264 | spo.addConstr(LinExpr([(train_x[i,p]+ eps[p],a[p,m]) for p in range(P)]) <= b[m] + one_plus_eps_max*(1 - z[i,t] )) 265 | #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) <= b[m] + 1 - one_plus_eps*z[i,t]) 266 | 267 | # Const 7i 268 | for t in range(T_B): 269 | # spo.addConstr(quicksum(a[p,t] for p in range(P)) == d[t]) 270 | spo.addConstr(LinExpr([(1,a[p,t]) for p in range(P)]) == d[t]) 271 | 272 | # Const 7j 273 | for t in range(T_B): 274 | #spo.addConstr(b[t] <= d[t]) 275 | spo.addConstr(b[t] >= 1 - d[t]) 276 | 277 | # Const 7k 278 | for t in range(1,T_B): 279 | spo.addConstr(d[t] <= d[find_parent_index(t)]) 280 | 281 | # Const 7l (optional): ensures LP relaxation of problem has obj >= 0 282 | for i in range(n_train): 283 | spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i]) 284 | #for t in range(T_L): 285 | #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()]) 286 | #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)]) 287 | #spo.addConstr(expr_constraint >= optimal_costs[i]) 288 | #spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i]) 289 | 290 | # Add objective 291 | # spo.setObjective( quicksum(y[i,t] for i in range(n_train) for t in range(T_L))/n_train + spo_opt_tree_reg* quicksum(d[t] for t in range(T_B) ), GRB.MINIMIZE) 292 | expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ]) - sum_optimal_cost 293 | #expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ]) 294 | if spo_opt_tree_reg > 0: 295 | expr_objective.add(LinExpr([(1, d[t]) for t in range(T_B)])*spo_opt_tree_reg*n_train) 296 | spo.setObjective(expr_objective, GRB.MINIMIZE) 297 | 298 | 299 | # Solve optimization 300 | if tune == True: 301 | spo.tune() 302 | if tune_foutpref is None: 303 | tune_foutpref='tune' 304 | for i in range(spo.tuneResultCount): 305 | spo.getTuneResult(i) 306 | spo.write(tune_foutpref+str(i)+'.prm') 307 | spo.optimize() 308 | 309 | # Get values of objective and variables 310 | # print('Obj=') 311 | # print(spo.getObjective().getValue()) 312 | # 313 | # z_ = np.zeros((n,T_L)) 314 | # z_res = spo.getAttr('X', z) 315 | # for i,j in z_res: 316 | # z_[i,j] = z_res[i,j] 317 | # print(z_) 318 | spo_dt_a = np.zeros((P,T_B)) 319 | a_res = spo.getAttr('X', a) 320 | for i in range(P): 321 | for j in range(T_B): 322 | spo_dt_a[i,j] = a_res[i,j] 323 | #for i,j in a_res: 324 | #spo_dt_a[i,j] = a_res[i,j] 325 | 326 | spo_dt_b = np.zeros(T_B) 327 | b_res = spo.getAttr('X', b) 328 | for i in range(T_B): 329 | spo_dt_b[i] = b_res[i] 330 | #for i in b_res: 331 | #spo_dt_b[i] = b_res[i] 332 | 333 | spo_dt_w = np.zeros((T_L,D)) 334 | w_res = spo.getAttr('X', w) 335 | for i in range(T_L): 336 | for j in range(D): 337 | spo_dt_w[i,j] = w_res[i,j] 338 | #for i,j in w_res: 339 | #spo_dt_w[i,j] = w_res[i,j] 340 | 341 | spo_dt_y = np.zeros((n_train,T_L)) 342 | y_res = spo.getAttr('X', y) 343 | for i in range(n_train): 344 | for j in range(T_L): 345 | spo_dt_y[i,j] = y_res[i,j] 346 | spo_dt_z = np.zeros((n_train,T_L)) 347 | z_res = spo.getAttr('X', z) 348 | for i in range(n_train): 349 | for j in range(T_L): 350 | spo_dt_z[i,j] = z_res[i,j] 351 | spo_dt_l = np.zeros(T_L) 352 | l_res = spo.getAttr('X', l) 353 | for i in range(T_L): 354 | spo_dt_l[i] = l_res[i] 355 | spo_dt_d = np.zeros(T_B) 356 | d_res = spo.getAttr('X', d) 357 | for i in range(T_B): 358 | spo_dt_d[i] = d_res[i] 359 | 360 | if returnAllOptvars == False: 361 | return spo_dt_a, spo_dt_b, spo_dt_w 362 | else: 363 | return spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d 364 | 365 | # Given a tree defined by a and b for all interior nodes, find the path (including the leaf node in which it lies) of observations using its features 366 | def decision_path(x,a,b): 367 | T_B = len(b) 368 | if len(x.shape) == 1: 369 | n = 1 370 | P = x.size 371 | else: 372 | n, P = x.shape 373 | res = [] 374 | for i in range(n): 375 | node = 0 376 | path = [0] 377 | T_B = a.shape[1] 378 | while node < T_B: 379 | if np.dot(x[i,:], a[:,node]) < b[node]: 380 | node = (node+1)*2 - 1 381 | else: 382 | node = (node+1)*2 383 | path.append(node) 384 | res.append(path) 385 | return np.array(res) 386 | 387 | 388 | # Given the path of an observation (including the leaf node in which it lies), find the predicted total cost for that observation 389 | def apply_leaf_decision(c,path, w, subtract_optimal=False): 390 | T_L, D = w.shape 391 | n = c.shape[0] 392 | paths = path[:, -1] 393 | actual_cost = [] 394 | for i in range(n): 395 | decision_node = paths[i] - T_L +1 396 | cost_decision = np.dot(c[i,:], w[decision_node,:]) 397 | if subtract_optimal == True: 398 | cost_optimal = find_opt_decision(c[i,:].reshape(1,-1))['objective'][0] 399 | actual_cost.append(cost_decision-cost_optimal) 400 | else: 401 | actual_cost.append(cost_decision) 402 | return np.array(actual_cost) 403 | 404 | def spo_opt_traintest(train_x,train_cost,test_x,test_cost,train_x_precision,spo_opt_tree_reg, N_min, H): 405 | spo_dt_a,spo_dt_b, spo_dt_w = spo_opt_tree(train_cost,train_x,train_x_precision,spo_opt_tree_reg, N_min, H) 406 | path = decision_path(test_x,spo_dt_a,spo_dt_b) 407 | return spo_dt_a,spo_dt_b, spo_dt_w, np.mean(apply_leaf_decision(test_cost,path, spo_dt_w, subtract_optimal=True)) 408 | 409 | def spo_opt_tunealpha(train_x,train_cost,valid_x,valid_cost,train_x_precision,reg_set, N_min, H): 410 | best_err = np.float("inf") 411 | for alpha in reg_set: 412 | spo_dt_a,spo_dt_b, spo_dt_w, err = spo_opt_traintest(train_x,train_cost,valid_x,valid_cost,train_x_precision,alpha, N_min, H) 413 | if err <= best_err: 414 | best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha = spo_dt_a,spo_dt_b, spo_dt_w, err, alpha 415 | return best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha 416 | 417 | #def cv_spo_opt_traintest(cost, X, train_x_precision,reg_set, splits = 4): 418 | # dic = {reg:0 for reg in reg_set} 419 | # n, n_edges = cost.shape 420 | # K = X.shape[1] 421 | # kf = KFold(n_splits = splits) 422 | # for train, test in kf.split(X): 423 | # X_train, X_test, cost_train, cost_test = X[train], X[test], cost[train], cost[test] 424 | # opt_cost = find_opt_decision(cost_test)['objective'] 425 | # for spo_opt_tree_reg in reg_set: 426 | # actual_cost = spo_opt_traintest(X_train, cost_train, X_test, cost_test,train_x_precision,spo_opt_tree_reg) 427 | # dic[spo_opt_tree_reg] += sum(actual_cost - opt_cost) 428 | # return smallest_dic_value(dic) 429 | # 430 | #def smallest_dic_value(dic): 431 | # reverse = dict() 432 | # for key in dic.keys(): 433 | # reverse[dic[key]] = key 434 | # return reverse[min(reverse.keys())] 435 | -------------------------------------------------------------------------------- /Applications/Yahoo News/Data Preprocessing/DataPreprocessing1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import YahooNewsDataExtraction as tool\n", 10 | "from sklearn.cluster import KMeans\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import pickle" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "# Read all data and save in separate files" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": null, 26 | "metadata": {}, 27 | "outputs": [], 28 | "source": [ 29 | "'''The process time for each raw Yahoo news data file is around 10 - 30 minutes.'''\n", 30 | "count = 0\n", 31 | "adsDict = dict()\n", 32 | "recordDict = dict() \n", 33 | "for day in range(1,11):\n", 34 | " if day == 10:\n", 35 | " '''Use raw Yahoo News data path and filename'''\n", 36 | " filename = '../ydata-fp-td-clicks-v1_0.200905' + str(day)\n", 37 | " else:\n", 38 | " filename = '../ydata-fp-td-clicks-v1_0.2009050' + str(day)\n", 39 | " with open(filename) as fp:\n", 40 | " line = fp.readline()\n", 41 | " while line:\n", 42 | " timestamp, offered_ad_id, click, user_feat, eligible_ads_ids, eligible_ads_feat = tool.parse_line(line)\n", 43 | " recordDict.update({count:np.hstack([user_feat[1:], offered_ad_id, click])})\n", 44 | " adsDict.update(dict(zip(eligible_ads_ids,eligible_ads_feat[:,1:])))\n", 45 | " count += 1\n", 46 | " line = fp.readline()\n", 47 | " \n", 48 | " recordDF = pd.DataFrame(recordDict).T\n", 49 | " filename = 'day' + str(day) + '_records.npy'\n", 50 | " np.save(filename, recordDF.values)\n", 51 | " \n", 52 | " filename = 'day' + str(day) + '_adsDict.p'\n", 53 | " pickle.dump(adsDict, open(filename,'wb'))\n", 54 | " count = 0\n", 55 | " adsDict = dict()\n", 56 | " recordDict = dict()\n", 57 | " print('Completed processing: ', filename)" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": {}, 64 | "outputs": [], 65 | "source": [] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "# Clustering user data" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": {}, 78 | "outputs": [], 79 | "source": [ 80 | "'''Load train data. Note that this data will be further split into the training set and validation set via random sampling.'''\n", 81 | "train_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(1,6)]) \n" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "'''Cluster users in train data according to user features (excluding constant feature)'''\n", 91 | "from sklearn.cluster import MiniBatchKMeans\n", 92 | "n_userClusters = 10000\n", 93 | "user_kmeans = MiniBatchKMeans(n_clusters=n_userClusters,random_state=0,batch_size=3000, init_size= n_userClusters)\n", 94 | "user_kmeans.fit(train_records[:,:5])\n", 95 | "filename = 'train_userCluster.p'\n", 96 | "pickle.dump(user_kmeans,open(filename,'wb'))" 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": null, 102 | "metadata": {}, 103 | "outputs": [], 104 | "source": [ 105 | "'''Load test data.'''\n", 106 | "test_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(6,11)]) " 107 | ] 108 | }, 109 | { 110 | "cell_type": "code", 111 | "execution_count": null, 112 | "metadata": {}, 113 | "outputs": [], 114 | "source": [ 115 | "'''Cluster test users via cluster trained using training set user data '''\n", 116 | "\n", 117 | "filename = 'train_userCluster.p'\n", 118 | "user_kmeans = pickle.load(open(filename,'rb'))\n", 119 | "test_usertype_predictions = user_kmeans.predict(test_records[:,:5])\n", 120 | "\n", 121 | "filename = 'test_userTypePredictions.npy'\n", 122 | "np.save(filename,test_usertype_predictions)" 123 | ] 124 | } 125 | ], 126 | "metadata": { 127 | "kernelspec": { 128 | "display_name": "Python 3", 129 | "language": "python", 130 | "name": "python3" 131 | }, 132 | "language_info": { 133 | "codemirror_mode": { 134 | "name": "ipython", 135 | "version": 3 136 | }, 137 | "file_extension": ".py", 138 | "mimetype": "text/x-python", 139 | "name": "python", 140 | "nbconvert_exporter": "python", 141 | "pygments_lexer": "ipython3", 142 | "version": "3.6.3" 143 | } 144 | }, 145 | "nbformat": 4, 146 | "nbformat_minor": 2 147 | } 148 | -------------------------------------------------------------------------------- /Applications/Yahoo News/Data Preprocessing/DataPreprocessing2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import YahooNewsDataExtraction as tool\n", 10 | "from sklearn.cluster import KMeans\n", 11 | "import numpy as np\n", 12 | "import pandas as pd\n", 13 | "import pickle" 14 | ] 15 | }, 16 | { 17 | "cell_type": "code", 18 | "execution_count": null, 19 | "metadata": {}, 20 | "outputs": [], 21 | "source": [ 22 | "threshold = 50\n", 23 | "train_val_split = 0.5" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "# Training set Processing" 31 | ] 32 | }, 33 | { 34 | "cell_type": "code", 35 | "execution_count": null, 36 | "metadata": {}, 37 | "outputs": [], 38 | "source": [ 39 | "'''Read in train data'''\n", 40 | "'''Col0 - Col4 are features, Col5 is adID, Col6 is click binary indicator.'''\n", 41 | "train_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(1,6)]) \n", 42 | "\n", 43 | "\n", 44 | "'''Read in ads dictionary.'''\n", 45 | "train_adsDict = dict()\n", 46 | "for day in range(1,6):\n", 47 | " train_adsDict.update(pickle.load(open('day' + str(day) + '_adsDict.p','rb')))\n", 48 | "\n", 49 | " \n", 50 | "'''Add user type to each user interaction record.'''\n", 51 | "filename = 'train_userCluster.p'\n", 52 | "user_kmeans = pickle.load(open(filename,'rb'))\n", 53 | "train_recordsDF = pd.DataFrame(np.hstack([train_records,user_kmeans.labels_.reshape(-1,1)])).rename(columns \n", 54 | " ={5:'adID', 6: 'click',7: 'userType'})" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": null, 60 | "metadata": {}, 61 | "outputs": [], 62 | "source": [ 63 | "'''Cluster articles into different types.'''\n", 64 | "train_adsDF = pd.DataFrame(train_adsDict).T\n", 65 | "n_adsClusters = 7\n", 66 | "ads_kmeans = KMeans(n_clusters=n_adsClusters, random_state=300)\n", 67 | "ads_kmeans.fit(train_adsDF)\n", 68 | "ads_kmeans.labels_\n", 69 | "train_adsType = dict(zip(train_adsDF.dropna().index, ads_kmeans.labels_))\n", 70 | "\n", 71 | "filename = 'train_adsCluster_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.p'\n", 72 | "pickle.dump(ads_kmeans,open(filename,'wb'))" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": {}, 79 | "outputs": [], 80 | "source": [ 81 | "'''Add ad type to each user interaction record.'''\n", 82 | "train_validation_recordDFwType= pd.concat([train_recordsDF, \n", 83 | " train_recordsDF['adID'].map(train_adsType).rename('adsType')],axis =1).dropna()\n" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": {}, 90 | "outputs": [], 91 | "source": [ 92 | "'''Split train data into training and validation sets. Length ratio is given by train_val_split.'''\n", 93 | "from sklearn.model_selection import train_test_split\n", 94 | "train_recordDFwType, validation_recordDFwType, = train_test_split(train_validation_recordDFwType,\n", 95 | " test_size=train_val_split, random_state=42)" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": null, 101 | "metadata": {}, 102 | "outputs": [], 103 | "source": [ 104 | "'''Calculate training set click probability for each pair of user type and article type in training set.'''\n", 105 | "train_recordDFwType['adsType'] = train_recordDFwType['adsType'].astype(int)\n", 106 | "train_Y_clickProb = train_recordDFwType[['click','userType','adsType']].groupby(['userType','adsType'])['click'].agg({'clickProb':'mean',\n", 107 | " 'n_obs':'count'})" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "'''Drop article type 3 and several interaction records so average click probabilities for \n", 117 | "each user and article type were calculated with at least 50 interaction records. '''\n", 118 | "\n", 119 | "train_tmp = train_Y_clickProb['n_obs'].unstack().drop([3], axis =1).min(axis = 1)\n", 120 | "train_tmp_index = train_tmp[train_tmp >= threshold].index" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": null, 126 | "metadata": {}, 127 | "outputs": [], 128 | "source": [ 129 | "filename = 'filtered_train_clickprob_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 130 | "np.save(filename, train_Y_clickProb['clickProb'].unstack().loc[train_tmp_index].drop([3],axis = 1).values) \n", 131 | "\n", 132 | "\n", 133 | "filename = 'filtered_train_usernumobserv_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 134 | "np.save(filename, train_recordDFwType.groupby('userType').size().loc[train_tmp_index].values) \n", 135 | "\n", 136 | "\n", 137 | "train_X_user = train_recordDFwType.groupby(['userType']).mean().drop(['adID','click','adsType'],axis =1)\n", 138 | "filename = 'filtered_train_userFeat_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 139 | "np.save(filename, train_X_user.loc[train_tmp_index].values)\n", 140 | "\n", 141 | "\n", 142 | "filename = 'filtered_train_featMap_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 143 | "pickle.dump(dict(zip(list(train_X_user.index), train_X_user.values)), open(filename, 'wb'))\n" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "# Validation set Processing" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": null, 156 | "metadata": {}, 157 | "outputs": [], 158 | "source": [ 159 | "filename = 'filtered_train_featMap_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 160 | "train_featMap = pickle.load(open(filename, 'rb'))\n", 161 | "validation_recordDFwType['adsType'] = validation_recordDFwType['adID'].map(train_adsType)" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "'''Calculate training set click probability for each pair of user type and article type in validation set.'''\n", 171 | "validation_Y_clickProb = validation_recordDFwType[['click','userType','adsType']].groupby(['userType','adsType'])['click'].agg({'clickProb':'mean',\n", 172 | " 'n_obs':'count'})" 173 | ] 174 | }, 175 | { 176 | "cell_type": "code", 177 | "execution_count": null, 178 | "metadata": {}, 179 | "outputs": [], 180 | "source": [ 181 | "'''Drop article type 3 and several interaction records so average click probabilities for \n", 182 | "each user and article type were calculated with at least 50 interaction records. '''\n", 183 | "validation_tmp = validation_Y_clickProb['n_obs'].unstack().drop([3], axis =1).min(axis = 1)\n", 184 | "validation_tmp_index = validation_tmp[validation_tmp >= threshold].index" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": null, 190 | "metadata": {}, 191 | "outputs": [], 192 | "source": [ 193 | "filename = 'filtered_validation_clickprob_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 194 | "np.save(filename, validation_Y_clickProb['clickProb'].unstack().loc[validation_tmp_index].drop([3],axis = 1).values) \n", 195 | "\n", 196 | "filename = 'filtered_validation_usernumobserv_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 197 | "np.save(filename, validation_recordDFwType.groupby('userType').size().loc[validation_tmp_index].values) \n", 198 | "\n", 199 | "validation_X_user = np.vstack([train_featMap[x] for x in list(validation_tmp_index)])\n", 200 | "filename = 'filtered_validation_userFeat_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 201 | "np.save(filename, validation_X_user) " 202 | ] 203 | }, 204 | { 205 | "cell_type": "markdown", 206 | "metadata": {}, 207 | "source": [ 208 | "# Test set Processing" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": null, 214 | "metadata": {}, 215 | "outputs": [], 216 | "source": [ 217 | "'''Read in test data'''\n", 218 | "'''Col0 - Col4 are features, Col5 is adID, Col6 is click binary indicator.'''\n", 219 | "test_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(6,11)]) \n", 220 | "\n", 221 | "'''Read in ads dictionary.'''\n", 222 | "test_adsDict = dict()\n", 223 | "for day in range(6,11):\n", 224 | " test_adsDict.update(pickle.load(open('day' + str(day) + '_adsDict.p','rb')))\n", 225 | "\n", 226 | "filename = 'train_userCluster.p'\n", 227 | "user_kmeans = pickle.load(open(filename,'rb'))\n", 228 | "\n", 229 | "filename = 'test_userTypePredictions.npy'\n", 230 | "test_usertype_predictions = np.load(filename)\n", 231 | "\n", 232 | "filename = 'train_adsCluster_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.p'\n", 233 | "ads_kmeans = pickle.load(open(filename,'rb'))\n", 234 | "\n", 235 | "filename = 'filtered_train_featMap_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 236 | "train_featMap = pickle.load(open(filename, 'rb'))" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": null, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "test_adsDF = pd.DataFrame(test_adsDict).T\n", 246 | "test_recordDFwType = pd.DataFrame(np.hstack([test_records,test_usertype_predictions.reshape(-1,1)])).rename(columns = \n", 247 | " {5:'adID', 6: 'click',7: 'userType'})\n", 248 | "test_adsType = dict(zip(test_adsDF.index, ads_kmeans.predict(test_adsDF)))\n", 249 | "test_recordDFwType['adsType'] = test_recordDFwType['adID'].map(test_adsType)" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": null, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "'''Calculate training set click probability for each pair of user type and article type in test set.'''\n", 259 | "test_Y_clickProb = test_recordDFwType[['click','userType','adsType']].groupby(['userType','adsType'])['click'].agg({'clickProb':'mean',\n", 260 | " 'n_obs':'count'})" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": null, 266 | "metadata": {}, 267 | "outputs": [], 268 | "source": [ 269 | "'''Drop article type 3 and several interaction records. '''\n", 270 | "test_tmp = test_Y_clickProb['n_obs'].unstack().drop([3], axis =1).min(axis = 1)\n", 271 | "test_tmp_index = test_tmp[test_tmp >= threshold].index\n" 272 | ] 273 | }, 274 | { 275 | "cell_type": "code", 276 | "execution_count": null, 277 | "metadata": {}, 278 | "outputs": [], 279 | "source": [ 280 | "filename = 'filtered_test_clickprob_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 281 | "np.save(filename, test_Y_clickProb['clickProb'].unstack().loc[test_tmp_index].drop([3],axis = 1).values) \n", 282 | "\n", 283 | "filename = 'filtered_test_usernumobserv_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 284 | "np.save(filename, test_recordDFwType.groupby('userType').size().loc[test_tmp_index].values) \n", 285 | "\n", 286 | "test_X_user = np.vstack([train_featMap[x] for x in list(test_tmp_index)])\n", 287 | "filename = 'filtered_test_userFeat_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n", 288 | "np.save(filename, test_X_user) " 289 | ] 290 | } 291 | ], 292 | "metadata": { 293 | "kernelspec": { 294 | "display_name": "Python 3", 295 | "language": "python", 296 | "name": "python3" 297 | }, 298 | "language_info": { 299 | "codemirror_mode": { 300 | "name": "ipython", 301 | "version": 3 302 | }, 303 | "file_extension": ".py", 304 | "mimetype": "text/x-python", 305 | "name": "python", 306 | "nbconvert_exporter": "python", 307 | "pygments_lexer": "ipython3", 308 | "version": "3.6.3" 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 2 313 | } 314 | -------------------------------------------------------------------------------- /Applications/Yahoo News/Data Preprocessing/README.md: -------------------------------------------------------------------------------- 1 | # Yahoo News Dataset Preprocessing 2 | 3 | Please follow the steps below to obtain the preprocessed Yahoo! Front Page Today Module dataset used in our numerical experiments: 4 | * Download the raw Yahoo! Front Page Today dataset at https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49. A license must be obtained from Yahoo to access the dataset. The data may be used only for academic research purposes and may not be used for any commercial purposes or by any commercial entity. 5 | * Run DataPreprocessing1.ipynb. This may take at least 24 hours to complete. 6 | * Run DataPreprocessing2.ipynb. -------------------------------------------------------------------------------- /Applications/Yahoo News/Data Preprocessing/YahooNewsDataExtraction.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | #given vector of strings like ['2:0.306008', '3:0.000450', '4:0.077048', '5:0.230439', '6:0.386055', '1:1.000000'], 4 | #extract features into numpy vector 5 | def extract_features(strv): 6 | feat = np.array([0.0]*6) 7 | for i in range(6): 8 | tmp = strv[i].split(":") 9 | feat_index = int(tmp[0]) 10 | feat_value = float(tmp[1]) 11 | feat[feat_index-1] = feat_value 12 | return(feat) 13 | 14 | 15 | #returns timestamp, offered ad id, click (binary), user features, vector of eligible ad ids, vector of eligible ad id features 16 | def parse_line(line): 17 | #remove \n at the end of the line 18 | line = line.strip() 19 | line = line.split("|") 20 | 21 | #get timestamp, offered ad id, click (binary) 22 | decision_info = line[0].split(" ") 23 | timestamp = int(decision_info[0]) 24 | offered_ad_id = int(decision_info[1]) 25 | click = int(decision_info[2]) 26 | 27 | #get user features 28 | user_info = line[1].split(" ")[1:] 29 | user_feat = extract_features(user_info) 30 | 31 | #get eligible ads + features 32 | ad_info = line[2].split(" ") 33 | eligible_ads_ids = np.array([int(ad_info[0])]) 34 | eligible_ads_feat = np.array([extract_features(ad_info[1:])]) 35 | for i in range(3, len(line)): 36 | ad_info = line[i].split(" ") 37 | if len(ad_info[1:]) >= 6: 38 | eligible_ads_ids = np.append(eligible_ads_ids,np.array([int(ad_info[0])])) 39 | eligible_ads_feat = np.append(eligible_ads_feat,np.array([extract_features(ad_info[1:])]), axis=0) 40 | #else: 41 | #print('Ad Feature Amomaly: ' + line[i]) 42 | 43 | sorted_inds = np.argsort(eligible_ads_ids) 44 | eligible_ads_ids = eligible_ads_ids[sorted_inds] 45 | eligible_ads_feat = eligible_ads_feat[sorted_inds] 46 | 47 | return(timestamp, offered_ad_id, click, user_feat, eligible_ads_ids, eligible_ads_feat) 48 | 49 | #Read Data 50 | path="/R6/ydata-fp-td-clicks-v1_0.20090501" 51 | 52 | 53 | 54 | 55 | -------------------------------------------------------------------------------- /Applications/Yahoo News/SPOForest_news.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Runs random forest algorithm on Yahoo News dataset 3 | This code considers two methods of aggregating individual tree predictions to obtain a forest decision: 4 | - "mean": averages cost predictions for each tree in the forest; outputs decision associated with average cost 5 | - "mode": outputs the mode decision recommended by the trees in the forest 6 | Outputs decision costs for each test-set instance as pickle file 7 | Takes multiple input arguments: 8 | (1) max_depth_set_str: sequence of training depths tuned using cross validation, e.g. "2-4-5" 9 | (2) min_samples_leaf_set_str: sequence of "min. (weighted) observations per leaf" tuned using cross validation, e.g. "20-50-100" 10 | (3) n_estimators_set_str: sequence of number of trees in forest tuned using cross validation, e.g. "20-50-100" 11 | (4) max_features_set_str: sequence of number of features used in feature bagging tuned using cross validation, e.g. "2-3-4" 12 | (5) algtype: set equal to "MSE" (CART forest) or "SPO" (SPOT forest) 13 | (6) number of workers to use in parallel processing (i.e., fitting individual trees in the forest in parallel) 14 | (7) decision_problem_seed: seed controlling generation of constraints in article recommendation problem (-1 = no constraints) 15 | (8) train_size: number of random obs. to extract from the training data. 16 | Only useful in limiting the size of the training data (-1 = use full training data) 17 | (9) quant_discret: continuous variable split points in the trees are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 18 | Values of input arguments used in paper: 19 | (1) max_depth_set_str: "1000" 20 | (2) min_samples_leaf_set_str: "10000" 21 | (3) n_estimators_set_str: "50" 22 | (4) max_features_set_str: "2-3-4-5" 23 | (5) algtype: "MSE" for CART forest, "SPO" for SPOT forest 24 | (6) number of workers to use in parallel processing: 10 25 | (7) decision_problem_seed: ran 9 constraint instances corresponding to seeds of 10, 11, 12, ..., 18 26 | (8) train_size: -1 27 | (9) quant_discret: 0.05 28 | ''' 29 | 30 | import time 31 | 32 | import numpy as np 33 | import pickle 34 | from gurobipy import* 35 | from decision_problem_solver import* 36 | from SPOForest import SPOForest 37 | import sys 38 | 39 | ######################################## 40 | #SEEDS FOR RANDOM NUMBER GENERATORS 41 | #seed for rngs in random forest 42 | forest_seed = 0 43 | #seed controlling random subset of training data used (if full training data is not being used) 44 | select_train_seed = 0 45 | ######################################## 46 | #training parameters 47 | max_depth_set_str = sys.argv[1] 48 | max_depth_set=[int(k) for k in max_depth_set_str.split('-')]#[None] 49 | min_samples_leaf_set_str = sys.argv[2] 50 | min_samples_leaf_set=[int(k) for k in min_samples_leaf_set_str.split('-')]#[5] 51 | n_estimators_set_str = sys.argv[3] 52 | n_estimators_set=[int(k) for k in n_estimators_set_str.split('-')]#[100,500] 53 | max_features_set_str = sys.argv[4] 54 | max_features_set=[int(k) for k in max_features_set_str.split('-')]#[3] 55 | algtype=sys.argv[5] #either "MSE" or "SPO" 56 | #number of workers 57 | if sys.argv[6] == "1": 58 | run_in_parallel = False 59 | num_workers = None 60 | else: 61 | run_in_parallel = True 62 | num_workers = int(sys.argv[6]) 63 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive 64 | decision_problem_seed=int(sys.argv[7]) #if -1, use no constraints in decision problem 65 | train_size=int(sys.argv[8]) #if you want to limit the size of the training data (-1 = no limit) 66 | quant_discret=float(sys.argv[9]) 67 | ######################################## 68 | #output filename 69 | fname_out_mean = algtype+"Forest_news_costs_depthSet"+max_depth_set_str+"_minObsSet"+min_samples_leaf_set_str+"_nEstSet"+n_estimators_set_str+"_mFeatSet"+max_features_set_str+"_aMethod"+"mean"+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl"; 70 | fname_out_mode = algtype+"Forest_news_costs_depthSet"+max_depth_set_str+"_minObsSet"+min_samples_leaf_set_str+"_nEstSet"+n_estimators_set_str+"_mFeatSet"+max_features_set_str+"_aMethod"+"mode"+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl"; 71 | ############################################################################# 72 | ############################################################################# 73 | ############################################################################# 74 | 75 | #generate decision problem 76 | num_constr = 5 #number of Aw <= b constraints 77 | num_dec = 6 #number of decisions 78 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive 79 | if decision_problem_seed == -1: 80 | #no budget constraint case 81 | A_constr = np.zeros((num_constr,num_dec)) 82 | b_constr = np.ones(num_constr) 83 | l_constr = np.zeros(num_dec) 84 | u_constr = np.ones(num_dec) 85 | else: 86 | np.random.seed(decision_problem_seed) 87 | A_constr = np.random.exponential(scale=1.0, size=(num_constr,num_dec)) 88 | b_constr = np.ones(num_constr) 89 | l_constr = np.zeros(num_dec) 90 | u_constr = np.ones(num_dec) 91 | 92 | ############################################## 93 | 94 | thresh = "50" 95 | valid_size = "50.0%" 96 | 97 | train_x = np.load('filtered_train_userFeat_'+valid_size+'_'+thresh+'.npy') 98 | valid_x = np.load('filtered_validation_userFeat_'+valid_size+'_'+thresh+'.npy') 99 | test_x = np.load('filtered_test_userFeat_'+valid_size+'_'+thresh+'.npy') 100 | 101 | #make negative to turn into minimization problem 102 | train_cost = np.load('filtered_train_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 103 | valid_cost = np.load('filtered_validation_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 104 | test_cost = np.load('filtered_test_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 105 | 106 | train_weights = np.load('filtered_train_usernumobserv_'+valid_size+'_'+thresh+'.npy') 107 | valid_weights = np.load('filtered_validation_usernumobserv_'+valid_size+'_'+thresh+'.npy') 108 | test_weights = np.load('filtered_test_usernumobserv_'+valid_size+'_'+thresh+'.npy') 109 | 110 | ############################################## 111 | #limit size of training data if specified 112 | if train_size != -1 and train_size <= train_x.shape[0] and train_size <= valid_x.shape[0]: 113 | np.random.seed(select_train_seed) 114 | sel_inds = np.random.choice(range(train_x.shape[0]), size = train_size, replace=False) 115 | train_x = train_x[sel_inds] 116 | train_cost = train_cost[sel_inds] 117 | train_weights = train_weights[sel_inds] 118 | sel_inds = np.random.choice(range(valid_x.shape[0]), size = train_size, replace=False) 119 | valid_x = valid_x[sel_inds] 120 | valid_cost = valid_cost[sel_inds] 121 | valid_weights = valid_weights[sel_inds] 122 | 123 | ############################################################################# 124 | ############################################################################# 125 | ############################################################################# 126 | 127 | def forest_traintest(train_x,train_cost,train_weights,test_x,test_cost,test_weights,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,algtype, A_constr, b_constr, l_constr, u_constr): 128 | if algtype == "MSE": 129 | SPO_weight_param=0.0 130 | elif algtype == "SPO": 131 | SPO_weight_param=1.0 132 | regr = SPOForest(n_estimators=n_estimators,run_in_parallel=run_in_parallel,num_workers=num_workers, 133 | max_depth=max_depth, min_weights_per_node=min_samples_leaf, quant_discret=quant_discret, debias_splits=False, 134 | max_features=max_features, 135 | SPO_weight_param=SPO_weight_param, SPO_full_error=True) 136 | regr.fit(train_x, train_cost, train_weights, verbose_forest=True, verbose=False, feats_continuous=True, seed=forest_seed, 137 | A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 138 | pred_decision_mean = regr.est_decision(test_x, method="mean") 139 | pred_decision_mode = regr.est_decision(test_x, method="mode") 140 | alg_costs_mean = [np.sum(test_cost[i] * pred_decision_mean[i]) for i in range(0,len(pred_decision_mean))] 141 | alg_costs_mode = [np.sum(test_cost[i] * pred_decision_mode[i]) for i in range(0,len(pred_decision_mode))] 142 | return regr, np.dot(alg_costs_mean,test_weights)/np.sum(test_weights), np.dot(alg_costs_mode,test_weights)/np.sum(test_weights) 143 | 144 | def forest_tuneparams(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set,run_in_parallel,num_workers,algtype, A_constr, b_constr, l_constr, u_constr): 145 | best_err_mean = np.float("inf") 146 | best_err_mode = np.float("inf") 147 | for n_estimators in n_estimators_set: 148 | for max_depth in max_depth_set: 149 | for min_samples_leaf in min_samples_leaf_set: 150 | for max_features in max_features_set: 151 | regr, err_mean, err_mode = forest_traintest(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,algtype, A_constr, b_constr, l_constr, u_constr) 152 | if err_mean <= best_err_mean: 153 | best_regr_mean, best_err_mean, best_n_estimators_mean,best_max_depth_mean,best_min_samples_leaf_mean,best_max_features_mean = regr, err_mean, n_estimators,max_depth,min_samples_leaf,max_features 154 | if err_mode <= best_err_mode: 155 | best_regr_mode, best_err_mode, best_n_estimators_mode,best_max_depth_mode,best_min_samples_leaf_mode,best_max_features_mode = regr, err_mode, n_estimators,max_depth,min_samples_leaf,max_features 156 | 157 | print("Best n_estimators (mean method): " + str(best_n_estimators_mean)) 158 | print("Best max_depth (mean method): " + str(best_max_depth_mean)) 159 | print("Best min_samples_leaf (mean method): " + str(best_min_samples_leaf_mean)) 160 | print("Best max_features (mean method): " + str(best_max_features_mean)) 161 | 162 | print("Best n_estimators (mode method): " + str(best_n_estimators_mode)) 163 | print("Best max_depth (mode method): " + str(best_max_depth_mode)) 164 | print("Best min_samples_leaf (mode method): " + str(best_min_samples_leaf_mode)) 165 | print("Best max_features (mode method): " + str(best_max_features_mode)) 166 | 167 | return best_regr_mean, best_err_mean, best_n_estimators_mean,best_max_depth_mean,best_min_samples_leaf_mean,best_max_features_mean, best_regr_mode, best_err_mode, best_n_estimators_mode,best_max_depth_mode,best_min_samples_leaf_mode,best_max_features_mode 168 | 169 | start = time.time() 170 | 171 | #FIT FOREST 172 | regr_mean,_,_,_,_,_,regr_mode,_,_,_,_,_ = forest_tuneparams(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set, run_in_parallel, num_workers, algtype, A_constr, b_constr, l_constr, u_constr) 173 | 174 | end = time.time() 175 | print "Elapsed time: " + str(end-start) 176 | 177 | #FIND TEST SET COST 178 | pred_decision_mean = regr_mean.est_decision(test_x, method="mean") 179 | pred_decision_mode = regr_mode.est_decision(test_x, method="mode") 180 | costs_mean = [np.sum(test_cost[i] * pred_decision_mean[i]) for i in range(0,len(pred_decision_mean))] 181 | costs_mode = [np.sum(test_cost[i] * pred_decision_mode[i]) for i in range(0,len(pred_decision_mode))] 182 | 183 | print "Average test set cost (mean method) (max is better): " + str(-1.0*np.mean(costs_mean)) 184 | print "Average test set weighted cost (mean method) (max is better): " + str(-1.0*np.dot(costs_mean,test_weights)/np.sum(test_weights)) 185 | print "Average test set cost (mode method) (max is better): " + str(-1.0*np.mean(costs_mode)) 186 | print "Average test set weighted cost (mode method) (max is better): " + str(-1.0*np.dot(costs_mode,test_weights)/np.sum(test_weights)) 187 | 188 | with open(fname_out_mean, 'wb') as output: 189 | pickle.dump(costs_mean, output, pickle.HIGHEST_PROTOCOL) 190 | with open(fname_out_mode, 'wb') as output: 191 | pickle.dump(costs_mode, output, pickle.HIGHEST_PROTOCOL) 192 | 193 | # Getting back the objects: 194 | #with open(fname_out, 'rb') as input: 195 | # costs_deg = pickle.load(input) -------------------------------------------------------------------------------- /Applications/Yahoo News/SPOgreedy_news.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Runs SPOT (greedy) / CART algorithm on Yahoo News dataset 3 | Outputs algorithm decision costs for each test-set instance as pickle file 4 | Also outputs optimal decision costs for each test-set instance as pickle file 5 | Takes multiple input arguments: 6 | (1) max_depth: training depth of tree, e.g. "5" 7 | (2) min_weights_per_node: min. number of (weighted) observations per leaf, e.g. "100" 8 | (3) algtype: set equal to "MSE" (CART) or "SPO" (SPOT greedy) 9 | (4) decision_problem_seed: seed controlling generation of constraints in article recommendation problem (-1 = no constraints) 10 | (5) train_size: number of random obs. to extract from the training data. 11 | Only useful in limiting the size of the training data (-1 = use full training data) 12 | (6) quant_discret: continuous variable split points in the tree is chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 13 | Values of input arguments used in paper: 14 | (1) max_depth: considered depths of 2, 4, 6, 1000 15 | (2) min_weights_per_node: "10000" 16 | (3) algtype: "MSE" (CART) or "SPO" (SPOT greedy) 17 | (4) decision_problem_seed: ran 9 constraint instances corresponding to seeds of 10, 11, 12, ..., 18 18 | (5) train_size: -1 19 | (6) quant_discret: 0.05 20 | ''' 21 | 22 | import time 23 | 24 | import numpy as np 25 | import pickle 26 | from decision_problem_solver import* 27 | from SPO_tree_greedy import SPOTree 28 | import sys 29 | 30 | ############################################## 31 | #seed controlling random subset of training data used (if full training data is not being used) 32 | select_train_seed = 0 33 | ######################################## 34 | #training parameters 35 | max_depth = int(sys.argv[1])#3 36 | min_weights_per_node = int(sys.argv[2])#20 37 | algtype = sys.argv[3] #either "MSE" or "SPO" 38 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive 39 | decision_problem_seed=int(sys.argv[4]) #if -1, use no constraints in decision problem 40 | train_size=int(sys.argv[5]) #if you want to limit the size of the training data (-1 = no limit) 41 | quant_discret=float(sys.argv[6]) 42 | ######################################## 43 | #output filename for alg 44 | fname_out = algtype+"greedy_news_costs_depth"+str(max_depth)+"_minObs"+str(min_weights_per_node)+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl"; 45 | #output filename for opt costs 46 | fname_out_opt = "Opt_news_costs"+"_probSeed"+str(decision_problem_seed)+".pkl"; 47 | ############################################################################# 48 | ############################################################################# 49 | ############################################################################# 50 | 51 | #generate decision problem 52 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive 53 | num_constr = 5 #number of Aw <= b constraints 54 | num_dec = 6 #number of decisions 55 | if decision_problem_seed == -1: 56 | #no budget constraint case 57 | A_constr = np.zeros((num_constr,num_dec)) 58 | b_constr = np.ones(num_constr) 59 | l_constr = np.zeros(num_dec) 60 | u_constr = np.ones(num_dec) 61 | else: 62 | np.random.seed(decision_problem_seed) 63 | A_constr = np.random.exponential(scale=1.0, size=(num_constr,num_dec)) 64 | b_constr = np.ones(num_constr) 65 | l_constr = np.zeros(num_dec) 66 | u_constr = np.ones(num_dec) 67 | 68 | ############################################## 69 | 70 | thresh = "50" 71 | valid_size = "50.0%" 72 | 73 | train_x = np.load('filtered_train_userFeat_'+valid_size+'_'+thresh+'.npy') 74 | valid_x = np.load('filtered_validation_userFeat_'+valid_size+'_'+thresh+'.npy') 75 | test_x = np.load('filtered_test_userFeat_'+valid_size+'_'+thresh+'.npy') 76 | 77 | #make negative to turn into minimization problem 78 | train_cost = np.load('filtered_train_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 79 | valid_cost = np.load('filtered_validation_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 80 | test_cost = np.load('filtered_test_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 81 | 82 | train_weights = np.load('filtered_train_usernumobserv_'+valid_size+'_'+thresh+'.npy') 83 | valid_weights = np.load('filtered_validation_usernumobserv_'+valid_size+'_'+thresh+'.npy') 84 | test_weights = np.load('filtered_test_usernumobserv_'+valid_size+'_'+thresh+'.npy') 85 | 86 | ############################################## 87 | #limit size of training data if specified 88 | if train_size != -1 and train_size <= train_x.shape[0] and train_size <= valid_x.shape[0]: 89 | np.random.seed(select_train_seed) 90 | sel_inds = np.random.choice(range(train_x.shape[0]), size = train_size, replace=False) 91 | train_x = train_x[sel_inds] 92 | train_cost = train_cost[sel_inds] 93 | train_weights = train_weights[sel_inds] 94 | sel_inds = np.random.choice(range(valid_x.shape[0]), size = train_size, replace=False) 95 | valid_x = valid_x[sel_inds] 96 | valid_cost = valid_cost[sel_inds] 97 | valid_weights = valid_weights[sel_inds] 98 | 99 | ############################################################################# 100 | ############################################################################# 101 | ############################################################################# 102 | 103 | start = time.time() 104 | 105 | #FIT ALGORITHM 106 | #SPO_weight_param: number between 0 and 1: 107 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 108 | if algtype == "MSE": 109 | SPO_weight_param=0.0 110 | elif algtype == "SPO": 111 | SPO_weight_param=1.0 112 | my_tree = SPOTree(max_depth = max_depth, min_weights_per_node = min_weights_per_node, quant_discret = quant_discret, debias_splits=False, SPO_weight_param=SPO_weight_param, SPO_full_error=True) 113 | my_tree.fit(train_x,train_cost,weights=train_weights,verbose=False,feats_continuous=True, 114 | A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr); #verbose specifies whether fitting procedure should print progress 115 | 116 | my_tree.traverse() #prints out the unpruned tree 117 | 118 | end = time.time() 119 | print "Elapsed time: " + str(end-start) 120 | 121 | ################## 122 | 123 | start = time.time() 124 | 125 | #PRUNE DECISION TREE USING VALIDATION SET 126 | my_tree.prune(valid_x, valid_cost, weights_val=valid_weights, verbose=True, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 127 | 128 | end = time.time() 129 | print "Elapsed time: " + str(end-start) 130 | 131 | my_tree.traverse() #prints out the pruned tree 132 | 133 | ################## 134 | 135 | #FIND TEST SET ALGORITHM COST AND OPTIMAL COST 136 | opt_decision = find_opt_decision(test_cost, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['weights'] 137 | pred_cost = my_tree.est_cost(test_x) 138 | cost_pred_error = [np.mean(abs(pred_cost[i] - test_cost[i])) for i in range(0,pred_cost.shape[0])] 139 | pred_decision = my_tree.est_decision(test_x) 140 | costs = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,pred_decision.shape[0])] 141 | optcosts = [np.sum(test_cost[i] * opt_decision[i,:]) for i in range(0,opt_decision.shape[0])] 142 | 143 | print "Average test set cost (max is better): " + str(-1.0*np.mean(costs)) 144 | print "Average test set weighted cost (max is better): " + str(-1.0*np.dot(costs,test_weights)/np.sum(test_weights)) 145 | print "Average optimal test set cost (max is better): " + str(-1.0*np.mean(optcosts)) 146 | print "Average optimal test set weighted cost (max is better): " + str(-1.0*np.dot(optcosts,test_weights)/np.sum(test_weights)) 147 | 148 | with open(fname_out, 'wb') as output: 149 | pickle.dump(costs, output, pickle.HIGHEST_PROTOCOL) 150 | with open(fname_out_opt, 'wb') as output: 151 | pickle.dump(optcosts, output, pickle.HIGHEST_PROTOCOL) -------------------------------------------------------------------------------- /Applications/Yahoo News/SPOopt_news.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Runs SPOT (MILP) algorithm on Yahoo News dataset 3 | Outputs decision costs for each test-set instance as pickle file 4 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r. 5 | Takes multiple input arguments: 6 | (1) H: training depth of tree, e.g. "5" 7 | (2) N_min: min. number of (weighted) observations per leaf, e.g. "100" 8 | (3) train_x_precision: contextual features x are rounded to train_x_precision before fitting MILP (e.g., 2 = two decimal places) 9 | higher values of train_x_precision will be more precise but take more computational time 10 | (4) reg_set_str: sequence of regularization parameters to try (tuned using cross validation), e.g. "0.001-0.01-0.1" 11 | if "None", fits MILP using no regularizaiton and then prunes using CART pruning procedure (with SPO loss as pruning metric) 12 | (5) solver_time_limit: MILP solver is terminated after solver_time_limit seconds, returning best-found solution 13 | (6) decision_problem_seed: seed controlling generation of constraints in article recommendation problem (-1 = no constraints) 14 | (7) train_size: number of random obs. to extract from the training data. 15 | Only useful in limiting the size of the training data (-1 = use full training data) 16 | (8) quant_discret: Used to fit the SPOT tree used in warm-starting the MILP. 17 | Continuous variable split points in the tree is chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 18 | Values of input arguments used in paper: 19 | (1) H: considered depths of 2, 4, 6 20 | (2) N_min: "10000" 21 | (3) train_x_precision: 3 22 | (4) reg_set_str: "None" 23 | (5) solver_time_limit: 43200 24 | (6) decision_problem_seed: ran 9 constraint instances corresponding to seeds of 10, 11, 12, ..., 18 25 | (7) train_size: -1 26 | (8) quant_discret: 0.05 27 | ''' 28 | 29 | import time 30 | 31 | import numpy as np 32 | import pickle 33 | from spo_opt_tree_news import* 34 | from SPO2CART import SPO2CART 35 | from SPO_tree_greedy import SPOTree 36 | from decision_problem_solver import* 37 | 38 | import sys 39 | ############################################## 40 | #seed controlling random subset of training data used (if full training data is not being used) 41 | select_train_seed = 0 42 | ######################################## 43 | #training parameters 44 | #optimal tree params 45 | H = int(sys.argv[1])#2 #H = max tree depth 46 | N_min = int(sys.argv[2])#4 #N_min = minimum number of observations per leaf node 47 | #higher values of train_x_precision will be more precise but take more computational time 48 | #values >= 8 might cause numerical errors 49 | train_x_precision = int(sys.argv[3])#2 50 | reg_set_str = sys.argv[4]#"None" 51 | if reg_set_str == "None": 52 | reg_set = None 53 | else: 54 | reg_set = [int(k) for k in reg_set_str.split('-')]#None 55 | #reg_set = [0.001] #if reg_set = None, uses CART to prune tree 56 | solver_time_limit = int(sys.argv[5]) 57 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive 58 | decision_problem_seed=int(sys.argv[6]) #if -1, use no constraints in decision problem 59 | train_size=int(sys.argv[7]) #if you want to limit the size of the training data (-1 = no limit) 60 | quant_discret=float(sys.argv[8]) 61 | ######################################## 62 | #output filename for alg 63 | fname_out = "SPOopt_news_costs_depth"+str(H)+"_minObs"+str(N_min)+"_prec"+str(train_x_precision)+"_regset"+reg_set_str+"_tLim"+str(solver_time_limit)+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl"; 64 | ############################################################################# 65 | ############################################################################# 66 | ############################################################################# 67 | 68 | #generate decision problem 69 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive 70 | num_constr = 5 #number of Aw <= b constraints 71 | num_dec = 6 #number of decisions 72 | if decision_problem_seed == -1: 73 | #no budget constraint case 74 | A_constr = np.zeros((num_constr,num_dec)) 75 | b_constr = np.ones(num_constr) 76 | l_constr = np.zeros(num_dec) 77 | u_constr = np.ones(num_dec) 78 | else: 79 | np.random.seed(decision_problem_seed) 80 | A_constr = np.random.exponential(scale=1.0, size=(num_constr,num_dec)) 81 | b_constr = np.ones(num_constr) 82 | l_constr = np.zeros(num_dec) 83 | u_constr = np.ones(num_dec) 84 | 85 | ############################################## 86 | 87 | thresh = "50" 88 | valid_size = "50.0%" 89 | 90 | train_x = np.load('filtered_train_userFeat_'+valid_size+'_'+thresh+'.npy') 91 | valid_x = np.load('filtered_validation_userFeat_'+valid_size+'_'+thresh+'.npy') 92 | test_x = np.load('filtered_test_userFeat_'+valid_size+'_'+thresh+'.npy') 93 | 94 | #make negative to turn into minimization problem 95 | train_cost = np.load('filtered_train_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 96 | valid_cost = np.load('filtered_validation_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 97 | test_cost = np.load('filtered_test_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0 98 | 99 | train_weights = np.load('filtered_train_usernumobserv_'+valid_size+'_'+thresh+'.npy') 100 | valid_weights = np.load('filtered_validation_usernumobserv_'+valid_size+'_'+thresh+'.npy') 101 | test_weights = np.load('filtered_test_usernumobserv_'+valid_size+'_'+thresh+'.npy') 102 | 103 | ############################################## 104 | #limit size of training data if specified 105 | if train_size != -1 and train_size <= train_x.shape[0] and train_size <= valid_x.shape[0]: 106 | np.random.seed(select_train_seed) 107 | sel_inds = np.random.choice(range(train_x.shape[0]), size = train_size, replace=False) 108 | train_x = train_x[sel_inds] 109 | train_cost = train_cost[sel_inds] 110 | train_weights = train_weights[sel_inds] 111 | sel_inds = np.random.choice(range(valid_x.shape[0]), size = train_size, replace=False) 112 | valid_x = valid_x[sel_inds] 113 | valid_cost = valid_cost[sel_inds] 114 | valid_weights = valid_weights[sel_inds] 115 | 116 | ############################################################################# 117 | ############################################################################# 118 | ############################################################################# 119 | 120 | start = time.time() 121 | 122 | #FIT SPO OPTIMAL TREE 123 | if reg_set is None: 124 | 125 | #FIT SPO GREEDY TREE AS INITIAL SOLUTION 126 | def truncate_train_x(train_x, train_x_precision): 127 | return(np.around(train_x, decimals=train_x_precision)) 128 | train_x_truncated = truncate_train_x(train_x, train_x_precision) 129 | #SPO_weight_param: number between 0 and 1: 130 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param) 131 | my_tree = SPOTree(max_depth = H, min_weights_per_node = N_min, quant_discret = quant_discret, debias_splits=False, SPO_weight_param=1.0, SPO_full_error=True) 132 | my_tree.fit(train_x_truncated,train_cost,weights=train_weights,verbose=False,feats_continuous=True, 133 | A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr); 134 | 135 | #PRUNE SPO GREEDY TREE USING TRAINING SET (TO GET RID OF REDUNDANT LEAVES) 136 | my_tree.prune(train_x_truncated, train_cost, weights_val=train_weights, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 137 | my_tree.traverse() 138 | spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated) 139 | 140 | #(OPTIONAL) PRUNE SPO GREEDY TREE USING VALIDATION SET 141 | # my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 142 | # spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated) 143 | # alpha = my_tree.get_pruning_alpha() 144 | 145 | #FIT SPO OPTIMAL TREE USING FOUND INITIAL SOLUTION 146 | reg_param = 1e-5 #introduce very small amount of regularization to ensure leaves with zero predictive power are aggregated 147 | spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d = spo_opt_tree(train_cost,train_x,train_x_precision,reg_param, N_min, H, 148 | weights=train_weights, 149 | a_start=spo_greedy_a, z_start=spo_greedy_z, 150 | Presolve=2, Seed=0, TimeLimit=solver_time_limit, 151 | returnAllOptvars=True, 152 | A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 153 | 154 | end = time.time() 155 | print "Elapsed time: " + str(end-start) 156 | 157 | #(IF NOT USING POSTPRUNING) FIND TEST SET COST 158 | # path = decision_path(test_x,spo_dt_a,spo_dt_b) 159 | # costs_deg[deg][trial_num] = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 160 | 161 | #PRUNE MILP TREE USING CART PRUNING METHOD ON VALIDATION SET 162 | spo2cart = SPO2CART(spo_dt_a, spo_dt_b) 163 | spo2cart.fit(train_x,train_cost,train_x_precision,weights=train_weights,verbose=False,feats_continuous=True, 164 | A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 165 | spo2cart.traverse() #prints out the unpruned tree found by MILP 166 | spo2cart.prune(valid_x, valid_cost, weights_val=valid_weights, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress 167 | spo2cart.traverse() #prints out the pruned tree 168 | #FIND TEST SET COST 169 | pred_decision = spo2cart.est_decision(test_x) 170 | costs = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))] 171 | 172 | else: 173 | spo_dt_a, spo_dt_b, spo_dt_w, _, best_alpha = spo_opt_tunealpha(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,train_x_precision,reg_set, N_min, H, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 174 | print("Best Alpha: " + best_alpha) 175 | 176 | end = time.time() 177 | print "Elapsed time: " + str(end-start) 178 | 179 | #FIND TEST SET COST 180 | path = decision_path(test_x,spo_dt_a,spo_dt_b) 181 | costs = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 182 | 183 | with open(fname_out, 'wb') as output: 184 | pickle.dump(costs, output, pickle.HIGHEST_PROTOCOL) 185 | 186 | print "Average test set cost (max is better): " + str(-1.0*np.mean(costs)) 187 | print "Average test set weighted cost (max is better): " + str(-1.0*np.dot(costs,test_weights)/np.sum(test_weights)) -------------------------------------------------------------------------------- /Applications/Yahoo News/decision_problem_solver.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Generic file to set up the decision problem (i.e., optimization problem) under consideration 3 | Must have functions: 4 | get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector) 5 | find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 6 | 7 | This particular file sets up a news article recommendation decision problem 8 | ''' 9 | 10 | from gurobipy import * 11 | import numpy as np 12 | import sys 13 | 14 | def get_num_decisions(A_constr=None, b_constr=None, l_constr=None, u_constr=None): 15 | num_constr, num_dec = A_constr.shape 16 | return num_dec 17 | 18 | def find_opt_decision(p, A_constr=None, b_constr=None, l_constr=None, u_constr=None): 19 | num_constr, num_dec = A_constr.shape 20 | '''input matrix p, such that each row corresponds to an instance''' 21 | weights = np.zeros(p.shape) 22 | objective = np.zeros(p.shape[0]) 23 | 24 | if (p.shape[1] != num_dec): 25 | return 'Shape inconsistent, check input dimensions.' 26 | 27 | model = Model() 28 | model.Params.outputflag = 0 29 | w = model.addVars(num_dec, lb = l_constr, ub= u_constr) 30 | #w = model.addVars(num_dec, lb =0) 31 | model.addConstrs((quicksum(A_constr[i][j]*w[j] for j in range(num_dec)) <= b_constr[i] for i in range(num_constr))) 32 | model.addConstr(quicksum(w[j] for j in range(num_dec)) == 1) 33 | 34 | for inst in range(p.shape[0]): 35 | #model.setObjective(quicksum(p[inst,j]*w[j] for j in range(num_dec)), GRB.MAXIMIZE) 36 | model.setObjective(quicksum(p[inst,j]*w[j] for j in range(num_dec)), GRB.MINIMIZE) 37 | model.optimize() 38 | if model.status == GRB.OPTIMAL: 39 | weights[inst,:] = np.array([w[j].X for j in range(num_dec)]) 40 | objective[inst] = model.ObjVal 41 | else: 42 | print(inst, "Infeasible!") 43 | sys.exit("Decision problem infeasible") 44 | # print(model.status) 45 | return {'weights': weights, 'objective':objective} 46 | 47 | 48 | '''Example input''' 49 | #np.random.seed(0) 50 | #gen_decision_problem() 51 | #inst_num = 10 52 | #p = np.random.rand(inst_num,num_dec)*-1.0 53 | #w = find_opt_decision(p) 54 | #w = w['weights'] 55 | #print(w) 56 | #print(p) 57 | -------------------------------------------------------------------------------- /Applications/Yahoo News/spo_opt_tree_news.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Code for fitting SPOT MILP on Yahoo News dataset. 3 | Note that this code is specifically designed for the Yahoo News application and 4 | will not work for other applications without modifying the constraints. 5 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r. 6 | ''' 7 | 8 | 9 | import numpy as np 10 | import pandas as pd 11 | from gurobipy import* 12 | import pickle 13 | from sklearn.model_selection import KFold 14 | from decision_problem_solver import* 15 | 16 | # Helper functions for tree structure 17 | def find_parent_index(t): 18 | return (t+1)//2 - 1 19 | 20 | def find_ancestors(t): 21 | l= [] 22 | r = [] 23 | if t == 0: 24 | return 25 | else: 26 | while find_parent_index(t) !=0: 27 | parent = find_parent_index(t) 28 | if (t+1)% (1+parent) ==1: 29 | r.append(parent) 30 | else: 31 | l.append(parent) 32 | t = parent 33 | if t==2: 34 | r.append(0) 35 | else: 36 | l.append(0) 37 | return[l,r] 38 | 39 | #truncate training set features to desired precision 40 | def truncate_train_x(train_x, train_x_precision): 41 | return(np.around(train_x, decimals=train_x_precision)) 42 | 43 | 44 | #trains an optimal tree model on train_cost, train_x, and reg parameter spo_opt_tree_reg (scalar) 45 | #returns parameter encoding of the opimal tree (a,b,w). (a,b) encode splits, (w) encode leaf decisions 46 | #optimal tree params: 47 | #N_min = minimum number of observations per leaf node 48 | #H = max tree depth 49 | #def spo_opt_tree(train_cost, train_x,spo_opt_tree_reg): 50 | def spo_opt_tree(train_cost, train_x, train_x_precision, spo_opt_tree_reg, N_min, H, 51 | weights=None, 52 | returnAllOptvars=False, 53 | a_start=None, b_start=None, w_start=None, y_start=None, z_start=None, l_start=None, d_start=None, 54 | threads=None, MIPGap=None, MIPFocus=None, verbose=False, Seed=None, TimeLimit=None, 55 | Presolve=None, ImproveStartTime=None, VarBranch=None, Cuts=None, 56 | tune=False, TuneCriterion=None, TuneJobs=None, TuneTimeLimit=None, TuneTrials=None, tune_foutpref=None, 57 | A_constr=None, b_constr=None, l_constr=None, u_constr=None): 58 | 59 | ################################333 60 | #assert(spo_opt_tree_reg >= 1e-4) 61 | ################################333 62 | assert(A_constr is not None) 63 | assert(b_constr is not None) 64 | assert(l_constr is not None) 65 | assert(u_constr is not None) 66 | assert(weights is not None) 67 | 68 | #currently only lower bound = 0, upper bound = 1 constraints supported 69 | assert(len(np.unique(l_constr)) == 1 and np.unique(l_constr)[0] == 0) 70 | assert(len(np.unique(u_constr)) == 1 and np.unique(u_constr)[0] == 1) 71 | 72 | num_constr, D = A_constr.shape 73 | 74 | # We label all nodes of the tree by 0, 1, 2, ... 2**(H+1) - 1. 75 | T_B = 2**H - 1 76 | T_L = 2**H 77 | 78 | n_train, P = train_x.shape 79 | #truncate x features so eps (below) will not be too small 80 | train_x = truncate_train_x(train_x, train_x_precision) 81 | 82 | assert(np.all(train_x >= 0)) 83 | assert(np.all(train_x <= 1)) 84 | assert(np.all(train_cost <= 0)) #assert nonpositive costs 85 | assert(np.all(train_cost.shape[0] == train_x.shape[0])) 86 | # Instantiate optimization model 87 | # Compute average optimal cost across all training set observations 88 | # (Although irrelevant for the optimization problem, it helps in interpreting alpha) 89 | optimal_costs = np.zeros(train_x.shape[0]) 90 | for i in range(train_x.shape[0]): 91 | optimal_costs[i] = find_opt_decision(train_cost[i,:].reshape(1,-1),A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'][0] 92 | if weights is not None: 93 | sum_optimal_cost = np.dot(optimal_costs, weights) 94 | else: 95 | sum_optimal_cost = sum(optimal_costs) 96 | 97 | # Compute big M constant 98 | negM = 0 99 | for i in range(train_x.shape[0]): 100 | min_decision_cost = find_opt_decision(train_cost[i,:].reshape(1,-1),A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'][0] 101 | if min_decision_cost <= negM: 102 | negM = min_decision_cost 103 | 104 | #M = train_cost.max()*(dim-1)*2 105 | spo = Model('spo_opt_tree') 106 | if verbose == False: 107 | spo.Params.OutputFlag = 0 108 | #compute epsilon constants 109 | #eps = np.float("inf") 110 | #for j in range(train_x.shape[1]): 111 | #ordered_feat = np.sort(train_x[:,j]) 112 | #diffs = ordered_feat[1:]-ordered_feat[:-1] 113 | #nonzero_diffs = diffs[diffs > 0] 114 | #if min(nonzero_diffs) <= eps: 115 | #eps = min(nonzero_diffs) 116 | #one_plus_eps = 1 + eps 117 | 118 | eps = np.array([np.float("inf")]*train_x.shape[1]) 119 | for j in range(train_x.shape[1]): 120 | ordered_feat = np.sort(train_x[:,j]) 121 | diffs = ordered_feat[1:]-ordered_feat[:-1] 122 | nonzero_diffs = diffs[diffs > 0] 123 | eps[j] = min(nonzero_diffs) 124 | one_plus_eps_max = 1 + max(eps) 125 | 126 | #run params 127 | if threads is not None: 128 | spo.Params.Threads = threads 129 | if MIPGap is not None: 130 | spo.Params.MIPGap = MIPGap # default = 1e-4, try 1e-2 131 | if MIPFocus is not None: 132 | spo.Params.MIPFocus = MIPFocus 133 | if Seed is not None: 134 | spo.Params.Seed = Seed 135 | if TimeLimit is not None: 136 | spo.Params.TimeLimit = TimeLimit 137 | if Presolve is not None: 138 | spo.Params.Presolve = Presolve 139 | if ImproveStartTime is not None: 140 | spo.Params.ImproveStartTime = ImproveStartTime 141 | if VarBranch is not None: 142 | spo.Params.VarBranch = VarBranch 143 | if Cuts is not None: 144 | spo.Params.Cuts = Cuts 145 | 146 | #tune params 147 | if tune == True and TuneCriterion is not None: 148 | spo.Params.TuneCriterion = TuneCriterion 149 | if tune == True and TuneJobs is not None: 150 | spo.Params.TuneJobs = TuneJobs 151 | if tune == True and TuneTimeLimit is not None: 152 | spo.Params.TuneTimeLimit = TuneTimeLimit 153 | if tune == True and TuneTrials is not None: 154 | spo.Params.TuneTrials = TuneTrials 155 | 156 | # Add variables 157 | y = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = -GRB.INFINITY, ub = 0, name = 'y') 158 | z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), vtype=GRB.BINARY, name = 'z') 159 | #z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = 0, ub = 1, name = 'z') 160 | w = spo.addVars(tuplelist([(t, j) for t in range(T_L) for j in range(D)]), lb = 0, ub= 1,name = 'w') 161 | l = spo.addVars(tuplelist([i for i in range(T_L)]), vtype=GRB.BINARY, name = 'l') 162 | d = spo.addVars(tuplelist([i for i in range(T_B)]), vtype=GRB.BINARY, name = 'd') 163 | a = spo.addVars(tuplelist([(j,t) for j in range(P) for t in range(T_B)]), vtype=GRB.BINARY, name = 'a') 164 | #b = spo.addVars(tuplelist([i for i in range(T_B)]), lb = 0, name = 'b') 165 | b = spo.addVars(tuplelist([i for i in range(T_B)]), lb = 0, ub = 1, name = 'b') 166 | 167 | if a_start is not None: 168 | for i in range(P): 169 | for j in range(T_B): 170 | a[i,j].start = a_start[i,j] 171 | 172 | if b_start is not None: 173 | for i in range(T_B): 174 | b[i].start = b_start[i] 175 | 176 | if w_start is not None: 177 | for i in range(T_L): 178 | for j in range(D): 179 | w[i,j].start = w_start[i,j] 180 | 181 | if y_start is not None: 182 | for i in range(n_train): 183 | for j in range(T_L): 184 | y[i,j].start = y_start[i,j] 185 | 186 | if z_start is not None: 187 | for i in range(n_train): 188 | for j in range(T_L): 189 | z[i,j].start = z_start[i,j] 190 | 191 | if l_start is not None: 192 | for i in range(T_L): 193 | l[i].start = l_start[i] 194 | 195 | if d_start is not None: 196 | for i in range(T_B): 197 | d[i].start = d_start[i] 198 | 199 | spo.update() #for initial values to be written immediately 200 | 201 | # if a_start is not None: 202 | # for i in range(P): 203 | # for j in range(T_B): 204 | # print(a[i,j].start) 205 | # 206 | # if b_start is not None: 207 | # for i in range(T_B): 208 | # print(b[i].start) 209 | # 210 | # if w_start is not None: 211 | # for i in range(T_L): 212 | # for j in range(D): 213 | # print(w[i,j].start) 214 | # 215 | # if y_start is not None: 216 | # for i in range(n_train): 217 | # for j in range(T_L): 218 | # print(y[i,j].start) 219 | # 220 | # if z_start is not None: 221 | # for i in range(n_train): 222 | # for j in range(T_L): 223 | # print(z[i,j].start) 224 | # 225 | # if l_start is not None: 226 | # for i in range(T_L): 227 | # print(l[i].start) 228 | # 229 | # if d_start is not None: 230 | # for i in range(T_B): 231 | # print(d[i].start) 232 | 233 | 234 | # Add constraints 235 | # Const 236 | for i in range(n_train): 237 | for t in range(T_L): 238 | #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()]) 239 | expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)]) 240 | spo.addConstr(y[i,t] >= expr_constraint) 241 | spo.addConstr(y[i,t] >= negM * z[i,t]) 242 | # spo.addConstr(y[i,t] >= quicksum(train_cost[i,j] * w[t,j] for key,j in Edge_dict.items())- M * (1 - z[i,t])) 243 | 244 | # # (genreral constraint for feasibility of nominal problem Aw <= B) 245 | # for t in range(T_L): 246 | # for i in range(K): 247 | # spo.addConstr(quicksum(A[i,j] * w[t,j] for j in range(D)) <= B[i] ) 248 | 249 | # Const 250 | #constraint for feasibility of decision problem) 251 | for t in range(T_L): 252 | spo.addConstrs((quicksum(A_constr[i][j]*w[t,j] for j in range(D)) <= b_constr[i] for i in range(num_constr))) 253 | spo.addConstr(quicksum(w[t,j] for j in range(D)) == 1) 254 | 255 | # Const 256 | for i in range(n_train): 257 | # spo.addConstr(quicksum(z[i,t] for t in range(T_L)) == 1) 258 | spo.addConstr(LinExpr([(1,z[i,t]) for t in range(T_L)]) == 1) 259 | 260 | # Const 261 | for i in range(n_train): 262 | for t in range(T_L): 263 | spo.addConstr(z[i,t] <= l[t]) 264 | 265 | # Const 266 | for t in range(T_L): 267 | # spo.addConstr(quicksum(z[i,t] for i in range(n_train))>= N_min * l[t]) 268 | if weights is not None: 269 | spo.addConstr(LinExpr([(weights[i],z[i,t]) for i in range(n_train)])>= N_min * l[t]) 270 | else: 271 | spo.addConstr(LinExpr([(1,z[i,t]) for i in range(n_train)])>= N_min * l[t]) 272 | 273 | # Const 274 | for i in range(n_train): 275 | for t in range(T_L): 276 | left, right = find_ancestors(t + T_B) 277 | for m in right: 278 | # spo.addConstr(quicksum(a[p,m]* train_x[i,p] for p in range(P)) >= b[m]- (1 - z[i,t] )) 279 | spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) >= b[m]- (1 - z[i,t] )) 280 | 281 | # Const 282 | for m in left: 283 | # spo.addConstr(quicksum(a[p,m]* (x[i,p] + eps[p]) for p in range(P))<= b[m] + (1+eps_max)*(1-z[i,t] )) 284 | #spo.addConstr(quicksum(a[p,m]* train_x[i,p] for p in range(P)) +0.0001<= b[m] + (1-z[i,t] )) 285 | #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) + eps <= b[m] + (1+eps)*(1 - z[i,t] )) 286 | spo.addConstr(LinExpr([(train_x[i,p]+ eps[p],a[p,m]) for p in range(P)]) <= b[m] + one_plus_eps_max*(1 - z[i,t] )) 287 | #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) <= b[m] + 1 - one_plus_eps*z[i,t]) 288 | 289 | # Const 290 | for t in range(T_B): 291 | # spo.addConstr(quicksum(a[p,t] for p in range(P)) == d[t]) 292 | spo.addConstr(LinExpr([(1,a[p,t]) for p in range(P)]) == d[t]) 293 | 294 | # Const 295 | for t in range(T_B): 296 | #spo.addConstr(b[t] <= d[t]) 297 | spo.addConstr(b[t] >= 1 - d[t]) 298 | 299 | # Const 300 | for t in range(1,T_B): 301 | spo.addConstr(d[t] <= d[find_parent_index(t)]) 302 | 303 | # Const (optional): ensures LP relaxation of problem has y's defined sensibly 304 | for i in range(n_train): 305 | spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i]) 306 | #for t in range(T_L): 307 | #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()]) 308 | #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)]) 309 | #spo.addConstr(expr_constraint >= optimal_costs[i]) 310 | #spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i]) 311 | 312 | # Add objective 313 | # spo.setObjective( quicksum(y[i,t] for i in range(n_train) for t in range(T_L))/n_train + spo_opt_tree_reg* quicksum(d[t] for t in range(T_B) ), GRB.MINIMIZE) 314 | if weights is not None: 315 | expr_objective = LinExpr([(weights[i], y[i,t]) for i in range(n_train) for t in range(T_L) ]) - sum_optimal_cost 316 | else: 317 | expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ]) - sum_optimal_cost 318 | #expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ]) 319 | if spo_opt_tree_reg > 0: 320 | if weights is not None: 321 | sum_weights = sum(weights) 322 | expr_objective.add(LinExpr([(1, d[t]) for t in range(T_B)])*spo_opt_tree_reg*sum_weights) 323 | else: 324 | expr_objective.add(LinExpr([(1, d[t]) for t in range(T_B)])*spo_opt_tree_reg*n_train) 325 | spo.setObjective(expr_objective, GRB.MINIMIZE) 326 | 327 | 328 | # Solve optimization 329 | if tune == True: 330 | spo.tune() 331 | if tune_foutpref is None: 332 | tune_foutpref='tune' 333 | for i in range(spo.tuneResultCount): 334 | spo.getTuneResult(i) 335 | spo.write(tune_foutpref+str(i)+'.prm') 336 | spo.optimize() 337 | 338 | #############################33333 339 | # if spo.status == GRB.OPTIMAL: 340 | # print("Objective Value:") 341 | # print(spo.ObjVal) 342 | # print("Reg term of objective:") 343 | # if weights is not None: 344 | # print(spo_opt_tree_reg*sum_weights) 345 | # else: 346 | # print(spo_opt_tree_reg*n_train) 347 | # else: 348 | # import sys 349 | # print("Infeasible!") 350 | # sys.exit("Decision problem infeasible") 351 | #############################33333 352 | 353 | # Get values of objective and variables 354 | # print('Obj=') 355 | # print(spo.getObjective().getValue()) 356 | # 357 | # z_ = np.zeros((n,T_L)) 358 | # z_res = spo.getAttr('X', z) 359 | # for i,j in z_res: 360 | # z_[i,j] = z_res[i,j] 361 | # print(z_) 362 | spo_dt_a = np.zeros((P,T_B)) 363 | a_res = spo.getAttr('X', a) 364 | for i in range(P): 365 | for j in range(T_B): 366 | spo_dt_a[i,j] = a_res[i,j] 367 | #for i,j in a_res: 368 | #spo_dt_a[i,j] = a_res[i,j] 369 | 370 | spo_dt_b = np.zeros(T_B) 371 | b_res = spo.getAttr('X', b) 372 | for i in range(T_B): 373 | spo_dt_b[i] = b_res[i] 374 | #for i in b_res: 375 | #spo_dt_b[i] = b_res[i] 376 | 377 | spo_dt_w = np.zeros((T_L,D)) 378 | w_res = spo.getAttr('X', w) 379 | for i in range(T_L): 380 | for j in range(D): 381 | spo_dt_w[i,j] = w_res[i,j] 382 | #for i,j in w_res: 383 | #spo_dt_w[i,j] = w_res[i,j] 384 | 385 | spo_dt_y = np.zeros((n_train,T_L)) 386 | y_res = spo.getAttr('X', y) 387 | for i in range(n_train): 388 | for j in range(T_L): 389 | spo_dt_y[i,j] = y_res[i,j] 390 | spo_dt_z = np.zeros((n_train,T_L)) 391 | z_res = spo.getAttr('X', z) 392 | for i in range(n_train): 393 | for j in range(T_L): 394 | spo_dt_z[i,j] = z_res[i,j] 395 | spo_dt_l = np.zeros(T_L) 396 | l_res = spo.getAttr('X', l) 397 | for i in range(T_L): 398 | spo_dt_l[i] = l_res[i] 399 | spo_dt_d = np.zeros(T_B) 400 | d_res = spo.getAttr('X', d) 401 | for i in range(T_B): 402 | spo_dt_d[i] = d_res[i] 403 | 404 | if returnAllOptvars == False: 405 | return spo_dt_a, spo_dt_b, spo_dt_w 406 | else: 407 | return spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d 408 | 409 | # Given a tree defined by a and b for all interior nodes, find the path (including the leaf node in which it lies) of observations using its features 410 | def decision_path(x,a,b): 411 | T_B = len(b) 412 | if len(x.shape) == 1: 413 | n = 1 414 | P = x.size 415 | else: 416 | n, P = x.shape 417 | res = [] 418 | for i in range(n): 419 | node = 0 420 | path = [0] 421 | T_B = a.shape[1] 422 | while node < T_B: 423 | if np.dot(x[i,:], a[:,node]) < b[node]: 424 | node = (node+1)*2 - 1 425 | else: 426 | node = (node+1)*2 427 | path.append(node) 428 | res.append(path) 429 | return np.array(res) 430 | 431 | 432 | # Given the path of an observation (including the leaf node in which it lies), find the predicted total cost for that observation 433 | def apply_leaf_decision(c,path, w, subtract_optimal=False, A_constr=None, b_constr=None, l_constr=None, u_constr=None): 434 | T_L, D = w.shape 435 | n = c.shape[0] 436 | paths = path[:, -1] 437 | actual_cost = [] 438 | for i in range(n): 439 | decision_node = paths[i] - T_L +1 440 | cost_decision = np.dot(c[i,:], w[decision_node,:]) 441 | if subtract_optimal == True: 442 | cost_optimal = find_opt_decision(c[i,:].reshape(1,-1), A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'][0] 443 | actual_cost.append(cost_decision-cost_optimal) 444 | else: 445 | actual_cost.append(cost_decision) 446 | return np.array(actual_cost) 447 | 448 | def spo_opt_traintest(train_x,train_cost,train_weights,test_x,test_cost,test_weights,train_x_precision,spo_opt_tree_reg, N_min, H, A_constr=None, b_constr=None, l_constr=None, u_constr=None): 449 | spo_dt_a,spo_dt_b, spo_dt_w = spo_opt_tree(train_cost,train_x,train_x_precision,spo_opt_tree_reg, N_min, H, 450 | weights=train_weights, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 451 | path = decision_path(test_x,spo_dt_a,spo_dt_b) 452 | costs = apply_leaf_decision(test_cost,path, spo_dt_w, subtract_optimal=True, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 453 | return spo_dt_a,spo_dt_b, spo_dt_w, np.dot(costs,test_weights)/np.sum(test_weights) 454 | 455 | def spo_opt_tunealpha(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,train_x_precision,reg_set, N_min, H, A_constr=None, b_constr=None, l_constr=None, u_constr=None): 456 | best_err = np.float("inf") 457 | for alpha in reg_set: 458 | spo_dt_a,spo_dt_b, spo_dt_w, err = spo_opt_traintest(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,train_x_precision,alpha, N_min, H, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr) 459 | if err <= best_err: 460 | best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha = spo_dt_a,spo_dt_b, spo_dt_w, err, alpha 461 | return best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha 462 | 463 | #def cv_spo_opt_traintest(cost, X, train_x_precision,reg_set, splits = 4): 464 | # dic = {reg:0 for reg in reg_set} 465 | # n, n_edges = cost.shape 466 | # K = X.shape[1] 467 | # kf = KFold(n_splits = splits) 468 | # for train, test in kf.split(X): 469 | # X_train, X_test, cost_train, cost_test = X[train], X[test], cost[train], cost[test] 470 | # opt_cost = find_opt_decision(cost_test, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'] 471 | # for spo_opt_tree_reg in reg_set: 472 | # actual_cost = spo_opt_traintest(X_train, cost_train, X_test, cost_test,train_x_precision,spo_opt_tree_reg) 473 | # dic[spo_opt_tree_reg] += sum(actual_cost - opt_cost) 474 | # return smallest_dic_value(dic) 475 | # 476 | #def smallest_dic_value(dic): 477 | # reverse = dict() 478 | # for key in dic.keys(): 479 | # reverse[dic[key]] = key 480 | # return reverse[min(reverse.keys())] 481 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Adam N. Elmachtoub, Jason Cheuk Nam Liang, Ryan McNellis 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SPO Tree 2 | 3 | This is a Python code base for training "SPO Trees" (SPOTs) from the paper "Decision Trees for Decision-Making under the Predict-then-Optimize Framework". 4 | 5 | The "Algorithms" folder contains implementations for training (greedy) SPO Trees (SPO_tree_greedy.py) and SPO Forests (SPOForest.py) for general predict-then-optimize problems. 6 | 7 | The "Applications" folder contains all data + code for reproducing the three numerical experiments (applications) covered in the paper: 8 | * Illustrative Example: A two-road shortest path decision problem. 9 | * Shortest Path: A shortest path decision problem over a 4 x 4 grid network, where driver starts in 10 | southwest corner and tries to find shortest path to northeast corner. 11 | * Yahoo News: A news article recommendation decision problem constructed from the Yahoo! Front Page Today Module dataset. 12 | 13 | Code currently only supports Python 2.7 (not Python 3). 14 | Package Dependencies: gurobipy (with valid Gurobi license), numpy, pandas, scipy, joblib 15 | * The Illustrative Example application also depends on matplotlib 16 | --------------------------------------------------------------------------------