├── Algorithms
    ├── README.md
    ├── SPO2CART.py
    ├── SPOForest.py
    ├── SPO_tree_greedy.py
    ├── leaf_model.py
    ├── mtp.py
    └── mtp_SPO2CART.py
├── Applications
    ├── Illustrative Example
    │   ├── decision_problem_solver.py
    │   └── illustrative_example.py
    ├── README.md
    ├── Shortest Path
    │   ├── SPOForest_nonlinear.py
    │   ├── SPOgreedy_nonlinear.py
    │   ├── SPOopt_nonlinear.py
    │   ├── decision_problem_solver.py
    │   └── spo_opt_tree_nonlinear.py
    └── Yahoo News
    │   ├── Data Preprocessing
    │       ├── DataPreprocessing1.ipynb
    │       ├── DataPreprocessing2.ipynb
    │       ├── README.md
    │       └── YahooNewsDataExtraction.py
    │   ├── SPOForest_news.py
    │   ├── SPOgreedy_news.py
    │   ├── SPOopt_news.py
    │   ├── decision_problem_solver.py
    │   └── spo_opt_tree_news.py
├── LICENSE.txt
└── README.md


/Algorithms/README.md:
--------------------------------------------------------------------------------
 1 | # Algorithms
 2 | 
 3 | This folder contains implementations for training (greedy) SPO Trees (SPO_tree_greedy.py) and SPO Forests (SPOForest.py) for general predict-then-optimize problems. The implementation of the SPO Tree MILP approach is tailored to the specific applications of the paper and therefore is provided in the relevant applications folder.
 4 | 
 5 | The SPO Tree / SPO Forest classes consist of the following methods:
 6 | * \_\_init\_\_(): initializes the algorithm
 7 | * fit(): trains the algorithm on data: contextual features X, cost vectors C
 8 | * traverse(): prints out the learned tree/forest
 9 | * prune(): Not implemented for SPO Forests. Prunes the SPO Tree on a held-out validation set to prevent overfitting. Applies the CART pruning method.
10 | * est_decision(): outputs estimated optimal decisions given new contextual features Xnew
11 | * est_cost(): outputs estimated cost vectors given new contextual features Xnew
12 | 
13 | Further documentation is provided in the headers of SPO_tree_greedy.py and SPOForest.py.
14 | 
15 | The structure of the decision-making problem of interest should be encoded in a file called decision_problem_solver.py. This file should contain two functions specified by the practitioner:
16 | * get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector for underlying decision problem)
17 | * find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors.
18 | 
19 | Any (optional) arguments for these functions may be passed as keyword arguments to the fit() functions of the SPO Tree/Forest classes. An example is given in the Yahoo News application. The shortest path applications provide an additional example of the specification of decision_problem_solver.py.
20 | 
21 | Code currently only supports Python 2.7 (not Python 3).
22 | Package Dependencies: gurobipy (with valid Gurobi license), numpy, pandas, scipy, joblib
23 | 


--------------------------------------------------------------------------------
/Algorithms/SPO2CART.py:
--------------------------------------------------------------------------------
  1 | """
  2 | Encodes SPOT MILP as the structure of a CART tree in order to apply CART's pruning method
  3 | Also supports traverse() which traverses the tree
  4 | """
  5 | import numpy as np
  6 | from mtp_SPO2CART import MTP_SPO2CART
  7 | from decision_problem_solver import*
  8 | from scipy.spatial import distance
  9 | 
 10 | 
 11 | def truncate_train_x(train_x, train_x_precision):
 12 |   return(np.around(train_x, decimals=train_x_precision))
 13 | 
 14 | class SPO2CART(object):
 15 |   '''
 16 |   This function initializes the SPO tree
 17 |   
 18 |   Parameters:
 19 |   
 20 |   max_depth: the maximum depth of the pre-pruned tree (default = Inf: no depth limit)
 21 | 
 22 |   min_weight_per_node: the mininum number of observations (with respect to cumulative weight) per node
 23 | 
 24 |   min_depth: the minimum depth of the pre-pruned tree (default: set equal to max_depth)
 25 | 
 26 |   min_diff: if depth > min_depth, stop splitting if improvement in fit does not exceed min_diff
 27 |   
 28 |   binary_splits: if True, use binary splits when building the tree, else consider multiway splits 
 29 |   (i.e., when splitting on a variable, split on all unique vals)
 30 |   
 31 |   debias_splits/frac_debias_set/min_debias_set_size: Additional params when binary_splits = True. If debias_splits = True, then in each node,
 32 |   hold out frac_debias_set of the training set (w.r.t. case weights) to evaluate the error of the best splitting point for each feature. 
 33 |   Stop bias-correcting when we have insufficient data; i.e. the total weight in the debias set < min_debias_set_size.
 34 |     Note: after finding best split point, we then refit the model on all training data and recalculate the training error
 35 |   
 36 |   quant_discret: continuous variable split points are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 
 37 |   
 38 |   run_in_parallel: if set to True, enables parallel computing among num_workers threads. If num_workers is not
 39 |   specified, uses the number of cpu cores available.
 40 |   '''
 41 |   def __init__(self, a,b,**kwargs):
 42 |     
 43 |     kwargs["SPO_weight_param"] = 1.0
 44 |     
 45 |     if "SPO_full_error" not in kwargs:
 46 |       kwargs["SPO_full_error"] = True
 47 |     
 48 |     self.SPO_weight_param = kwargs["SPO_weight_param"]
 49 |     self.SPO_full_error = kwargs["SPO_full_error"]
 50 |     self.tree = MTP_SPO2CART(a,b,**kwargs)
 51 |   
 52 |   '''
 53 |   This function fits the tree on data (X,C,weights).
 54 |   
 55 |   X: The feature data used in tree splits. Can either be a pandas data frame or numpy array, with:
 56 |     (a) rows of X = observations
 57 |     (b) columns of X = features
 58 |   C: the cost vectors used in the leaf node models. Must be a numpy array, with:
 59 |     (a) rows of C = observations
 60 |     (b) columns of C = cost vector components
 61 |   weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding weight of observation i
 62 |   feats_continuous: If False, all feature are treated as categorical. If True, all feature are treated as continuous.
 63 |     feats_continuous can also be a boolean vector of dimension = num_features specifying how to treat each feature
 64 |   verbose: if verbose=True, prints out progress in tree fitting procedure
 65 |   '''
 66 |   def fit(self, X, C, train_x_precision, 
 67 |           weights=None, feats_continuous=True, verbose=False, refit_leaves=False,
 68 |           **kwargs):
 69 |     self.decision_kwargs = kwargs
 70 |     X = truncate_train_x(X, train_x_precision)
 71 |     num_obs = C.shape[0]
 72 |     
 73 |     A = np.array(range(num_obs))
 74 |     if self.SPO_full_error == True and self.SPO_weight_param != 0.0:
 75 |       for i in range(num_obs):
 76 |         A[i] = find_opt_decision(C[i,:].reshape(1,-1),**kwargs)['objective'][0]
 77 |     
 78 |     if self.SPO_weight_param != 0.0 and self.SPO_weight_param != 1.0:
 79 |       if self.SPO_full_error == True:
 80 |         SPO_loss_bound = -float("inf")
 81 |         for i in range(num_obs):
 82 |           SPO_loss = -find_opt_decision(-C[i,:].reshape(1,-1),**kwargs)['objective'][0] - A[i]
 83 |           if SPO_loss >= SPO_loss_bound:
 84 |             SPO_loss_bound = SPO_loss
 85 |         
 86 |       else:
 87 |         c_max = np.max(C,axis=0)
 88 |         SPO_loss_bound = -find_opt_decision(-c_max.reshape(1,-1),**kwargs)['objective'][0]
 89 |       
 90 |       #Upper bound for MSE loss: maximum pairwise difference between any two elements
 91 |       dists = distance.cdist(C, C, 'sqeuclidean')
 92 |       MSE_loss_bound = np.max(dists)
 93 |         
 94 |     else:
 95 |       SPO_loss_bound = 1.0
 96 |       MSE_loss_bound = 1.0
 97 |     
 98 |     #kwargs["SPO_loss_bound"] = SPO_loss_bound
 99 |     #kwargs["MSE_loss_bound"] = MSE_loss_bound
100 |     self.tree.fit(X,A,C,
101 |                   weights=weights, feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, 
102 |                   SPO_loss_bound = SPO_loss_bound, MSE_loss_bound = MSE_loss_bound,
103 |                   **kwargs)
104 |   
105 |   '''
106 |   Prints out the tree. 
107 |   Required: call tree fit() method first
108 |   Prints pruned tree if prune() method has been called, else prints unpruned tree
109 |   verbose=True prints additional statistics within each leaf
110 |   '''
111 |   def traverse(self, verbose=False):
112 |     self.tree.traverse(verbose=verbose)
113 |   
114 |   '''
115 |   Prunes the tree. Set verbose=True to track progress
116 |   '''
117 |   def prune(self, Xval, Cval, 
118 |             weights_val=None, one_SE_rule=True,verbose=False,approx_pruning=False):
119 |     num_obs = Cval.shape[0]
120 |     
121 |     Aval = np.array(range(num_obs))
122 |     if self.SPO_full_error == True and self.SPO_weight_param != 0.0:
123 |       for i in range(num_obs):
124 |         Aval[i] = find_opt_decision(Cval[i,:].reshape(1,-1),**self.decision_kwargs)['objective'][0]
125 |     self.tree.prune(Xval,Aval,Cval,
126 |                     weights_val=weights_val,one_SE_rule=one_SE_rule,verbose=verbose,approx_pruning=approx_pruning)
127 |     
128 |   
129 |   '''
130 |   Produces decision given data Xnew
131 |   Required: call tree fit() method first
132 |   Uses pruned tree if pruning method has been called, else uses unpruned tree
133 |   Argument alpha controls level of pruning. If not specified, uses alpha trained from the prune() method
134 |   
135 |   As a step in finding the estimated decisions for data (Xnew), this function first finds
136 |   the leaf node locations corresponding to each row of Xnew. It does so by a top-down search
137 |   starting at the root node 0. 
138 |   If return_loc=True, est_decision will also return the leaf node locations for the data, in addition to the decision.
139 |   '''
140 |   def est_decision(self, Xnew, alpha=None, return_loc=False):
141 |     return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc)
142 |   
143 |   def est_cost(self, Xnew, alpha=None, return_loc=False):
144 |     return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc, get_cost=True)


--------------------------------------------------------------------------------
/Algorithms/SPOForest.py:
--------------------------------------------------------------------------------
  1 | """
  2 | SPO RANDOM FOREST IMPLEMENTATION
  3 | 
  4 | This code will work for general predict-then-optimize applications. Fits SPO Forest to dataset of feature-cost pairs.
  5 | 
  6 | The structure of the decision-making problem of interest should be encoded in a file called decision_problem_solver.py. 
  7 | Specifically, this code requires two functions:
  8 |   get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector for underlying decision problem)
  9 |   find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 
 10 | """
 11 | import numpy as np
 12 | from mtp import MTP
 13 | from decision_problem_solver import*
 14 | from scipy.spatial import distance
 15 | from SPO_tree_greedy import SPOTree
 16 | from joblib import Parallel, delayed
 17 | from collections import Counter
 18 | 
 19 | class SPOForest(object):
 20 |   '''
 21 |   This function initializes the SPO forest
 22 |   
 23 |   FOREST PARAMETERS:
 24 |     
 25 |   n_estimators: number of SPO trees in the random forest 
 26 |   
 27 |   max_features: number of features to consider when looking for the best split in each node
 28 |   
 29 |   run_in_parallel, num_workers: if run_in_parallel is set to True, enables parallel computing among num_workers threads. 
 30 |   If num_workers is not specified, uses the number of cpu cores available. The task of computing each SPO tree in the forest
 31 |   is distributed among the available cores. (each tree may only use 1 core and thus this arg is set to None in SPOTree class)
 32 |   
 33 |   TREE PARAMETERS (DIRECTLY PASSED TO SPOTree CLASS):
 34 |   
 35 |   max_depth: maximum training depth of each tree in the forest (default = Inf: no depth limit)
 36 |   
 37 |   min_weight_per_node: the mininum number of observations (with respect to cumulative weight) per node for each tree in the forest
 38 |   
 39 |   quant_discret: continuous variable split points are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 
 40 |   
 41 |   SPO_weight_param: splits are decided through loss = SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
 42 |     SPO_weight_param = 1.0 -> SPO loss
 43 |     SPO_weight_param = 0.0 -> MSE loss (i.e., CART)
 44 |   
 45 |   SPO_full_error: if SPO error is used, are the full errors computed for split evaluation, 
 46 |     i.e. are the alg's decision losses subtracted by the optimal decision losses?
 47 |   
 48 |   Keep all other parameter values as default
 49 |   '''
 50 |   def __init__(self, n_estimators=10, run_in_parallel=False, num_workers=None, **kwargs): 
 51 |     self.n_estimators = n_estimators
 52 |     if (run_in_parallel == False):
 53 |       num_workers = 1
 54 |     if num_workers is None:
 55 |       num_workers = -1 #this uses all available cpu cores
 56 |     self.run_in_parallel = run_in_parallel
 57 |     self.num_workers = num_workers
 58 |     
 59 |     self.forest = [None]*n_estimators
 60 |     for t in range(n_estimators):
 61 |       self.forest[t] = SPOTree(**kwargs)
 62 |   
 63 |   '''
 64 |   This function fits the SPO forest on data (X,C,weights).
 65 |   
 66 |   X: The feature data used in tree splits. Can either be a pandas data frame or numpy array, with:
 67 |     (a) rows of X = observations
 68 |     (b) columns of X = features
 69 |   C: the cost vectors used in the leaf node models. Must be a numpy array, with:
 70 |     (a) rows of C = observations
 71 |     (b) columns of C = cost vector components
 72 |   weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding weight of observation i
 73 |   feats_continuous: If False, all feature are treated as categorical. If True, all feature are treated as continuous.
 74 |     feats_continuous can also be a boolean vector of dimension = num_features specifying how to treat each feature
 75 |   verbose: if verbose=True, prints out progress in tree fitting procedure
 76 |   verbose_forest: if verbose_forest=True, prints out progress in the forest fitting procedure
 77 |   seed: seed for rng
 78 |   '''
 79 |   def fit(self, X, C, weights=None, verbose_forest=False, seed=None, 
 80 |           feats_continuous=False, verbose=False, refit_leaves=False,
 81 |           **kwargs):
 82 |     
 83 |     self.decision_kwargs = kwargs
 84 |     
 85 |     num_obs = C.shape[0]
 86 |     
 87 |     if weights is None:
 88 |       weights = np.ones([num_obs])
 89 |     
 90 |     if seed is not None:
 91 |       np.random.seed(seed)
 92 |     tree_seeds = np.random.randint(0, high=2**32-1, size=self.n_estimators)
 93 |     
 94 |     
 95 |     if self.num_workers == 1:
 96 |       for t in range(self.n_estimators):
 97 |         if verbose_forest == True:
 98 |           print("Fitting tree " + str(t+1) + "out of " + str(self.n_estimators))
 99 |         np.random.seed(tree_seeds[t]) 
100 |         bootstrap_inds = np.random.choice(range(num_obs), size=num_obs, replace=True)
101 |         Xb = np.copy(X[bootstrap_inds])
102 |         Cb = np.copy(C[bootstrap_inds])
103 |         weights_b = np.copy(weights[bootstrap_inds])
104 |         self.forest[t].fit(Xb, Cb, weights=weights_b, seed=tree_seeds[t], 
105 |                            feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves,
106 |                            **kwargs)
107 |     
108 |     else:
109 |       self.forest = Parallel(n_jobs=self.num_workers, max_nbytes=1e5)(delayed(_fit_tree)(t, self.n_estimators, self.forest[t], X, C, weights, verbose_forest, tree_seeds[t], feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, **kwargs) for t in range(self.n_estimators))
110 |        
111 |   
112 |   '''
113 |   Prints all trees in the forest
114 |   Required: call forest fit() method first
115 |   '''
116 |   def traverse(self):
117 |     for t in range(self.n_estimators):
118 |       print("Printing Tree " + str(t+1) + "out of " + str(self.n_estimators))
119 |       self.forest[t].traverse()
120 |       print("\n\n\n")
121 |       
122 |   '''
123 |   Predicts decisions or costs given data Xnew
124 |   Required: call tree fit() method first
125 |   
126 |   method: method for aggregating decisions from each of the individual trees in the forest. Two approaches:
127 |     (1) "mean": averages predicted cost vectors from each tree, then finds decision with respect to average cost vector
128 |     (2) "mode": each tree in the forest estimates an optimal decision; take the most-recommended decision
129 |   
130 |   NOTE: return_loc argument not supported:
131 |   (If return_loc=True, est_decision will also return the leaf node locations for the data, in addition to the decision.)
132 |   '''
133 |   def est_decision(self, Xnew, method="mean"):
134 |     if method == "mean":
135 |       forest_costs = self.est_cost(Xnew)
136 |       forest_decisions = find_opt_decision(forest_costs,**self.decision_kwargs)['weights']
137 |     
138 |     elif method == "mode":
139 |       num_obs = Xnew.shape[0]
140 |       tree_decisions = [None]*self.n_estimators
141 |       for t in range(self.n_estimators):
142 |         tree_decisions[t] = self.forest[t].est_decision(Xnew)
143 |       tree_decisions = np.array(tree_decisions)
144 |       forest_decisions = np.zeros((num_obs,tree_decisions.shape[2]))
145 |       for i in range(num_obs):
146 |         forest_decisions[i] = _get_mode_row(tree_decisions[:,i,:])
147 |     
148 |     return forest_decisions
149 |   
150 |   def est_cost(self, Xnew):
151 |     tree_costs = [None]*self.n_estimators
152 |     for t in range(self.n_estimators):
153 |       tree_costs[t] = self.forest[t].est_cost(Xnew)
154 |     tree_costs = np.array(tree_costs)
155 |     forest_costs = np.mean(tree_costs,axis=0)    
156 |     return forest_costs
157 | 
158 | '''
159 | Helper methods (ignore)
160 | '''
161 |   
162 | def _fit_tree(t, n_estimators, tree, X, C, weights, verbose_forest, tree_seed, **kwargs):
163 |   if verbose_forest == True:
164 |     print("Fitting tree " + str(t+1) + "out of " + str(n_estimators))
165 |   
166 |   num_obs = C.shape[0]
167 |   np.random.seed(tree_seed) 
168 |   bootstrap_inds = np.random.choice(range(num_obs), size=num_obs, replace=True)
169 |   Xb = np.copy(X[bootstrap_inds])
170 |   Cb = np.copy(C[bootstrap_inds])
171 |   weights_b = np.copy(weights[bootstrap_inds])
172 |   tree.fit(Xb, Cb, weights=weights_b, seed=tree_seed, **kwargs)
173 |   return(tree)
174 | 
175 | def _get_mode_row(a):
176 |   return(np.array(Counter(map(tuple, a)).most_common()[0][0]))


--------------------------------------------------------------------------------
/Algorithms/SPO_tree_greedy.py:
--------------------------------------------------------------------------------
  1 | """
  2 | SPO GREEDY TREE IMPLEMENTATION
  3 | 
  4 | This code will work for general predict-then-optimize applications. Fits SPO (greedy) tree to dataset of feature-cost pairs.
  5 | 
  6 | The structure of the decision-making problem of interest should be encoded in a file called decision_problem_solver.py. 
  7 | Specifically, this code requires two functions:
  8 |   get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector for underlying decision problem)
  9 |   find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors 
 10 | 
 11 | """
 12 | import numpy as np
 13 | from mtp import MTP
 14 | from decision_problem_solver import*
 15 | from scipy.spatial import distance
 16 | 
 17 | class SPOTree(object):
 18 |   '''
 19 |   This function initializes the SPO tree
 20 |   
 21 |   Parameters:
 22 |   
 23 |   max_depth: maximum training depth of each tree in the forest (default = Inf: no depth limit)
 24 |   
 25 |   min_weight_per_node: the mininum number of observations (with respect to cumulative weight) per node for each tree in the forest
 26 |   
 27 |   quant_discret: continuous variable split points are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 
 28 |   
 29 |   SPO_weight_param: splits are decided through loss = SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
 30 |     SPO_weight_param = 1.0 -> SPO loss
 31 |     SPO_weight_param = 0.0 -> MSE loss (i.e., CART)
 32 |   
 33 |   SPO_full_error: if SPO error is used, are the full errors computed for split evaluation, 
 34 |     i.e. are the alg's decision losses subtracted by the optimal decision losses? 
 35 |   
 36 |   run_in_parallel: if set to True, enables parallel computing among num_workers threads. If num_workers is not
 37 |   specified, uses the number of cpu cores available.
 38 |   
 39 |   max_features: number of features to consider when looking for the best split in each node. Useful when building random forests. Default equal to total num features
 40 |   
 41 |   Keep all other parameter values as default
 42 |   '''
 43 |   def __init__(self, **kwargs): 
 44 |     self.SPO_weight_param = kwargs["SPO_weight_param"]
 45 |     self.SPO_full_error = kwargs["SPO_full_error"]
 46 |     self.tree = MTP(**kwargs)
 47 |   
 48 |   '''
 49 |   This function fits the tree on data (X,C,weights).
 50 |   
 51 |   X: The feature data used in tree splits. Can either be a pandas data frame or numpy array, with:
 52 |     (a) rows of X = observations
 53 |     (b) columns of X = features
 54 |   C: the cost vectors used in the leaf node models. Must be a numpy array, with:
 55 |     (a) rows of C = observations
 56 |     (b) columns of C = cost vector components
 57 |   weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding weight of observation i
 58 |   feats_continuous: If False, all feature are treated as categorical. If True, all feature are treated as continuous.
 59 |     feats_continuous can also be a boolean vector of dimension = num_features specifying how to treat each feature
 60 |   verbose: if verbose=True, prints out progress in tree fitting procedure
 61 |   
 62 |   Keep all other parameter values as default
 63 |   '''
 64 |   def fit(self, X, C, 
 65 |           weights=None, feats_continuous=False, verbose=False, refit_leaves=False, seed=None,
 66 |           **kwargs):
 67 |     self.pruned = False
 68 |     self.decision_kwargs = kwargs
 69 |     num_obs = C.shape[0]
 70 |     
 71 |     A = np.array(range(num_obs))
 72 |     if self.SPO_full_error == True and self.SPO_weight_param != 0.0:
 73 |       for i in range(num_obs):
 74 |         A[i] = find_opt_decision(C[i,:].reshape(1,-1),**kwargs)['objective'][0]
 75 |     
 76 |     if self.SPO_weight_param != 0.0 and self.SPO_weight_param != 1.0:
 77 |       if self.SPO_full_error == True:
 78 |         SPO_loss_bound = -float("inf")
 79 |         for i in range(num_obs):
 80 |           SPO_loss = -find_opt_decision(-C[i,:].reshape(1,-1),**kwargs)['objective'][0] - A[i]
 81 |           if SPO_loss >= SPO_loss_bound:
 82 |             SPO_loss_bound = SPO_loss
 83 |         
 84 |       else:
 85 |         c_max = np.max(C,axis=0)
 86 |         SPO_loss_bound = -find_opt_decision(-c_max.reshape(1,-1),**kwargs)['objective'][0]
 87 |       
 88 |       #Upper bound for MSE loss: maximum pairwise difference between any two elements
 89 |       dists = distance.cdist(C, C, 'sqeuclidean')
 90 |       MSE_loss_bound = np.max(dists)
 91 |         
 92 |     else:
 93 |       SPO_loss_bound = 1.0
 94 |       MSE_loss_bound = 1.0
 95 |     
 96 |     #kwargs["SPO_loss_bound"] = SPO_loss_bound
 97 |     #kwargs["MSE_loss_bound"] = MSE_loss_bound
 98 |     self.tree.fit(X,A,C,
 99 |                   weights=weights, feats_continuous=feats_continuous, verbose=verbose, refit_leaves=refit_leaves, seed=seed,
100 |                   SPO_loss_bound = SPO_loss_bound, MSE_loss_bound = MSE_loss_bound,
101 |                   **kwargs)
102 |   
103 |   '''
104 |   Prints out the tree. 
105 |   Required: call tree fit() method first
106 |   Prints pruned tree if prune() method has been called, else prints unpruned tree
107 |   verbose=True prints additional statistics within each leaf
108 |   '''
109 |   def traverse(self, verbose=False):
110 |     self.tree.traverse(verbose=verbose)
111 |   
112 |   '''
113 |   Prunes the tree. Set verbose=True to track progress
114 |   '''
115 |   def prune(self, Xval, Cval, 
116 |             weights_val=None, one_SE_rule=True,verbose=False,approx_pruning=False):
117 |     num_obs = Cval.shape[0]
118 |     
119 |     Aval = np.array(range(num_obs))
120 |     if self.SPO_full_error == True and self.SPO_weight_param != 0.0:
121 |       for i in range(num_obs):
122 |         Aval[i] = find_opt_decision(Cval[i,:].reshape(1,-1),**self.decision_kwargs)['objective'][0]
123 |     
124 |     self.tree.prune(Xval,Aval,Cval,
125 |                     weights_val=weights_val,one_SE_rule=one_SE_rule,verbose=verbose,approx_pruning=approx_pruning)
126 |     self.pruned = True
127 |     
128 |   
129 |   '''
130 |   Produces decision or cost given data Xnew
131 |   Required: call tree fit() method first
132 |   Uses pruned tree if pruning method has been called, else uses unpruned tree
133 |   Argument alpha controls level of pruning. If not specified, uses alpha trained from the prune() method
134 |   
135 |   As a step in finding the estimated decisions for data (Xnew), this function first finds
136 |   the leaf node locations corresponding to each row of Xnew. It does so by a top-down search
137 |   starting at the root node 0. 
138 |   If return_loc=True, est_decision will also return the leaf node locations for the data, in addition to the decision.
139 |   '''
140 |   def est_decision(self, Xnew, alpha=None, return_loc=False):
141 |     return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc)
142 |   
143 |   def est_cost(self, Xnew, alpha=None, return_loc=False):
144 |     return self.tree.predict(Xnew, np.array(range(0,Xnew.shape[0])), alpha=alpha, return_loc=return_loc, get_cost=True)
145 | 
146 |   '''
147 |   Other methods (ignore)
148 |   '''
149 |   def get_tree_encoding(self, x_train=None):
150 |     return self.tree.get_tree_encoding(x_train=x_train)
151 |   
152 |   def get_pruning_alpha(self):
153 |     if self.pruned == True:
154 |       return self.tree.alpha_best
155 |     else:
156 |       return(0)


--------------------------------------------------------------------------------
/Algorithms/leaf_model.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Helper class for mtp.py
  3 | 
  4 | Defines the leaf nodes of the tree, specifically
  5 | - the computation of the predicted cost vectors and decisions within the given leaf of the tree
  6 | - the SPO/MSE loss from using the predicted decision within the leaf
  7 | '''
  8 | 
  9 | import numpy as np
 10 | from decision_problem_solver import*
 11 | #from scipy.spatial import distance
 12 | 
 13 | '''
 14 | mtp.py depends on the classes and functions below. 
 15 | These classes/methods are used to define the model object in each leaf node,
 16 | as well as helper functions for certain operations in the tree fitting procedure.
 17 | 
 18 | Summary of methods and functions to specify:
 19 |   Methods as a part of class LeafModel: fit(), predict(), to_string(), error(), error_pruning()
 20 |   Other helper functions: get_sub(), are_Ys_diverse()
 21 |   
 22 | '''
 23 | 
 24 | '''
 25 | LeafModel: the model used in each leaf. 
 26 | Has five methods: fit, predict, to_string, error, error_pruning
 27 | 
 28 | SPO_weight_param: number between 0 and 1:
 29 | Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
 30 | '''
 31 | class LeafModel(object):
 32 |   
 33 |   #Any additional args passed to mtp's init() function are directly passed here
 34 |   def __init__(self,*args,**kwargs):
 35 |     self.SPO_weight_param = kwargs["SPO_weight_param"]
 36 |     self.SPO_full_error = kwargs["SPO_full_error"]
 37 |     return
 38 |   
 39 |   '''
 40 |   This function trains the leaf node model on the data (A,Y,weights).
 41 |   
 42 |   A and Y can take any form (lists, matrices, vectors, etc.). For our applications, I recommend making Y
 43 |   the response data (e.g., choices) and A alternative-specific data (e.g., features, choice sets)
 44 |   
 45 |   weights: a numpy array of case weights. Is 1-dimensional, with weights[i] yielding 
 46 |     weight of observation/customer i. If you know you will not be using case weights
 47 |     in your particular application, you can ignore this input entirely.
 48 |   
 49 |   Returns 0 or 1.
 50 |     0: No errors occurred when fitting leaf node model
 51 |     1: An error occurred when fitting the leaf node model (probably due to insufficient data)
 52 |   If fit returns 1, then the tree will not consider the split that led to this leaf node model
 53 |   
 54 |   fit_init is a LeafModel object which represents a previously-trained leaf node model.
 55 |   If specified, fit_init is used for initialization when training this current LeafModel object.
 56 |   Useful for faster computation when fit_init's coefficients are close to the optimal solution of the new data.
 57 |   
 58 |   For those interested in defining their own leaf node functions:
 59 |     (1) It is not required to use the fit_init argument in your code
 60 |     (2) All edge cases must be handled in code below (ex: arguments
 61 |         consist of a single entry, weights are all zero, Y has one unique choice, etc.).
 62 |         In these cases, either hard-code a model that works with these edge-cases (e.g., 
 63 |         if all Ys = 1, predict 1 with probability one), or have the fit function return 1 (error)
 64 |     (3) Store the fitted model as an attribute to the self object. You can name the attribute
 65 |         anything you want (i.e., it does not have to be self.model_obj and self.model_coef below),
 66 |         as long as its consistent with your predict_prob() and to_string() methods
 67 |         
 68 |   Any additional args passed to mtp's fit() function are directly passed here
 69 |   '''
 70 |   def fit(self, A, Y, weights, fit_init=None, refit=False, SPO_loss_bound=None, MSE_loss_bound=None, **kwargs):    
 71 |     #no need to refit this model since it is already fit to optimality
 72 |     #note: change this behavior if debias=TRUE
 73 |     if refit == True:
 74 |       return(0)
 75 |       
 76 |     self.SPO_loss_bound = SPO_loss_bound
 77 |     self.MSE_loss_bound = MSE_loss_bound
 78 |     
 79 |     def fast_row_avg(X,weights):
 80 |       return (np.matmul(weights,X)/sum(weights)).reshape(-1)
 81 |     
 82 |     #if no observations are mapped to this leaf, then assign any feasible cost vector here 
 83 |     if sum(weights) == 0:
 84 |       self.mean_cost = np.ones(get_num_decisions(**kwargs))
 85 |     else:
 86 |       self.mean_cost = fast_row_avg(Y,weights)
 87 |     self.decision = find_opt_decision(self.mean_cost.reshape(1,-1),**kwargs)['weights'].reshape(-1)
 88 |     
 89 |     return(0)
 90 |     
 91 |   '''
 92 |   This function applies model from fit() to predict choice data given new data A.
 93 |   Returns a list/numpy array of choices (one list entry per observation, i.e. l[i] yields prediction for ith obs.).
 94 |   Note: make sure to call fit() first before this method.
 95 |   
 96 |   Any additional args passed to mtp's predict() function are directly passed here
 97 |   '''
 98 |   def predict(self, A, get_cost=False, *args,**kwargs):
 99 |     if get_cost==True:
100 |       #Returns predicted cost corresponding to this leaf node
101 |       return np.array([self.mean_cost]*len(A))
102 |     else:
103 |       #Returns predicted decision corresponding to this leaf node
104 |       return np.array([self.decision]*len(A))
105 |   '''
106 |   This function outputs the errors for each observation in pair (A,Y).  
107 |   Used in training when comparing different tree splits.
108 |   Ex: mean-squared-error between observed data Y and predict(A)
109 |   
110 |   How to pass additional arguments to this function: simply pass these arguments to the init()/fit() functions and store them
111 |   in the self object.
112 |   '''
113 |   def error(self,A,Y):
114 |     def MSEloss(C,Cpred):
115 |       #return distance.cdist(C, Cpred, 'sqeuclidean').reshape(-1)
116 |       MSE = (C**2).sum(axis=1)[:, None] - 2 * C.dot(Cpred.transpose()) + ((Cpred**2).sum(axis=1)[None, :])
117 |       return MSE.reshape(-1)
118 |     
119 |     def SPOloss(C,decision):
120 |       return np.matmul(C,decision).reshape(-1)
121 |     
122 |     if self.SPO_weight_param == 1.0:
123 |       if self.SPO_full_error == True:
124 |         SPO_loss = SPOloss(Y,self.decision) - A
125 |       else:
126 |         SPO_loss = SPOloss(Y,self.decision)
127 |       return SPO_loss
128 |     elif self.SPO_weight_param == 0.0:
129 |       MSE_loss = MSEloss(Y, self.mean_cost.reshape(1,-1))
130 |       return MSE_loss
131 |     else:
132 |       if self.SPO_full_error == True:
133 |         SPO_loss = SPOloss(Y,self.decision) - A
134 |       else:
135 |         SPO_loss = SPOloss(Y,self.decision)
136 |       MSE_loss = MSEloss(Y, self.mean_cost.reshape(1,-1))
137 |       return self.SPO_weight_param*SPO_loss/self.SPO_loss_bound+(1.0-self.SPO_weight_param)*MSE_loss/self.MSE_loss_bound
138 |   
139 |   '''
140 |   This function outputs the errors for each observation in pair (A,Y).  
141 |   Used in pruning to determine the best tree subset.
142 |   Ex: mean-squared-error between observed data Y and predict(A)
143 |   
144 |   How to pass additional arguments to this function: simply pass these arguments to the init()/fit() functions and store them
145 |   in the self object.
146 |   '''
147 |   def error_pruning(self,A,Y):
148 |     return self.error(A,Y)
149 |   
150 |   '''
151 |   This function returns the string representation of the fitted model
152 |   Used in traverse() method, which traverses the tree and prints out all terminal node models
153 |   
154 |   Any additional args passed to mtp's traverse() function are directly passed here
155 |   '''
156 |   def to_string(self,*leafargs,**leafkwargs):
157 |     return "Mean cost vector: \n" + str(self.mean_cost) +"\n"+"decision: \n"+str(self.decision)
158 |     
159 | 
160 | '''
161 | Given attribute data A, choice data Y, and observation indices data_inds,
162 | extract those observations of A and Y corresponding to data_inds
163 | 
164 | If only attribute data A is given, returns A.
165 | If only choice data Y is given, returns Y.
166 | 
167 | Used to partition the data in the tree-fitting procedure
168 | '''
169 | def get_sub(data_inds,A=None,Y=None,is_boolvec=False):
170 |   if A is None:
171 |     return Y[data_inds]
172 |   if Y is None:
173 |     return A[data_inds]
174 |   else:
175 |     return A[data_inds],Y[data_inds]
176 | 
177 | '''
178 | This function takes as input choice data Y and outputs a boolean corresponding
179 | to whether all of the choices in Y are the same. 
180 | 
181 | It is used as a test for whether we should make a node a leaf. If are_Ys_diverse(Y)=False,
182 | then the node will become a leaf. Otherwise, if the node passes the other tests (doesn't exceed
183 | max depth, etc), we will consider splitting on the node.
184 | '''
185 | def are_Ys_diverse(Y):
186 |   #return False iff all cost vectors (rows of Y) are the same
187 |   tmp = [len(np.unique(Y[:,j])) for j in range(Y.shape[1])]
188 |   return (np.max(tmp) > 1)
189 | 
190 | 


--------------------------------------------------------------------------------
/Applications/Illustrative Example/decision_problem_solver.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Generic file to set up the decision problem (i.e., optimization problem) under consideration
 3 | Must have functions: 
 4 |   get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector)
 5 |   find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors
 6 | 
 7 | This particular file sets up a two-road shortest path decision problem
 8 | '''
 9 | 
10 | from gurobipy import *
11 | import numpy as np
12 | 
13 | dim = 2 #(creates dim * dim grid, where dim = number of vertices)
14 | Edge_list = [(i,i+1) for i in range(1, dim**2 + 1) if i % dim != 0]
15 | Edge_list += [(i, i + dim) for i in range(1, dim**2 + 1) if i <= dim**2 - dim]
16 | Edge_dict = {} #(assigns each edge to a unique integer from 0 to number-of-edges)
17 | for index, edge in enumerate(Edge_list):
18 |     Edge_dict[edge] = index
19 | D = len(Edge_list) # D = number of decisions
20 | 
21 | def get_num_decisions():
22 |   return D
23 | 
24 | Edges = tuplelist(Edge_list)
25 | # Find the optimal total cost for an observation in the context of shortes path
26 | m_shortest_path = Model('shortest_path')
27 | m_shortest_path.Params.OutputFlag = 0
28 | flow = m_shortest_path.addVars(Edges, ub = 1, name = 'flow')
29 | m_shortest_path.addConstrs((quicksum(flow[i,j] for i,j in Edges.select(i,'*')) - quicksum(flow[k, i] for k,i in Edges.select('*',
30 |   i)) == 0 for i in range(2, dim**2)), name = 'inner_nodes')
31 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select(1, '*')) == 1), name = 'start_node')
32 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select('*', dim**2)) == 1), name = 'end_node')
33 | 
34 | def shortest_path(cost):
35 |     # m_shortest_path.setObjective(quicksum(flow[i,j] * cost[Edge_dict[(i,j)]] for i,j in Edges), GRB.MINIMIZE)
36 |     m_shortest_path.setObjective(LinExpr( [ (cost[Edge_dict[(i,j)]],flow[i,j] ) for i,j in Edges]), GRB.MINIMIZE)
37 |     m_shortest_path.optimize()
38 |     return {'weights': m_shortest_path.getAttr('x', flow), 'objective': m_shortest_path.objVal}
39 | 
40 | def find_opt_decision(cost):
41 |     weights = np.zeros(cost.shape)
42 |     objective = np.zeros(cost.shape[0])
43 |     for i in range(cost.shape[0]):
44 |         temp = shortest_path(cost[i,:])
45 |         for edge in Edges:
46 |             weights[i, Edge_dict[edge]] = temp['weights'][edge]
47 |         objective[i] = temp['objective']
48 |     return {'weights': weights, 'objective':objective}
49 | 


--------------------------------------------------------------------------------
/Applications/Illustrative Example/illustrative_example.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python2
  2 | # -*- coding: utf-8 -*-
  3 | """
  4 | ILLUSTRATIVE EXAMPLE
  5 | 
  6 | Generates a two-road instance of the shortest paths problem. 
  7 | Runs SPO Tree (greedy) and CART on this dataset and compares their performance.
  8 | Produces plots visualizing predicted costs and normalized SPO loss incurred by SPOT and CART.
  9 | To run code below, put SPO_tree_greedy.py into the same folder as this script.
 10 | """
 11 | 
 12 | import numpy as np
 13 | from SPO_tree_greedy import SPOTree
 14 | from decision_problem_solver import*
 15 | import matplotlib as mpl
 16 | #mpl.use('Agg')
 17 | import matplotlib.pyplot as plt
 18 | #plt.ioff()
 19 | 
 20 | plt.rcParams.update({'font.size': 12})
 21 | figsize = (5.2, 4.3)
 22 | np.random.seed(0)
 23 | dpi=450
 24 | 
 25 | #SIMULATED DATASET FUNCTIONS
 26 | def get_costs(X):
 27 |   X = X.reshape(-1)
 28 |   mat = np.zeros((len(X),4))
 29 |   for i in range(len(X)):
 30 |     mat[i,0] = (X[i] + 0.8)*5-2.1
 31 |     mat[i,1] = (5*X[i]+0.4)**2
 32 |   return(mat)
 33 |   
 34 | def gen_dataset(n):
 35 |   x = np.random.rand(n,1) #generate random features in [0,1]
 36 |   costs = get_costs(x)
 37 |   return(x,costs)
 38 | 
 39 | def get_step_func_rep(x, costs):
 40 |   change_inds = np.where(costs[1:]-costs[:-1] > 0)[0]
 41 |   x_change_points = (x[change_inds.tolist()]+x[(change_inds+1).tolist()])/2.0
 42 |   x_min = np.append(np.array(x[0]),x_change_points)
 43 |   x_max = np.append(x_change_points,np.array(x[-1]))
 44 |   change_inds = change_inds.tolist()
 45 |   change_inds.append(len(x)-1)
 46 |   y = costs[change_inds]
 47 |   return(y, x_min, x_max)
 48 | 
 49 | def get_decision_boundary(x, costs):
 50 |   tmp = costs[:,1] > costs[:,0]
 51 |   if not any(tmp) == True:
 52 |     return None
 53 |   return(min(x[tmp]))
 54 | 
 55 | def plot_costs(plot_x, true_costs, est_costs, color_est_costs, est_name, fname):
 56 |   true_costs_0 = true_costs[:,0]
 57 |   true_costs_1 = true_costs[:,1]
 58 |   est_costs_0 = est_costs[:,0]
 59 |   est_costs_1 = est_costs[:,1]
 60 |   fig,ax = plt.subplots(1, figsize=figsize)
 61 |   line_true_costs_0, = ax.plot(plot_x, true_costs_0, linewidth=2.0, color='grey', linestyle='-', label='Edge 1 Cost (True)')
 62 |   line_true_costs_1, = ax.plot(plot_x, true_costs_1, linewidth=2.0, color='grey', linestyle='--', label='Edge 2 Cost (True)')
 63 |   
 64 |   costs,xmin,xmax = get_step_func_rep(plot_x, est_costs_0)
 65 |   line_est_costs_0 = ax.hlines(costs, xmin, xmax, linewidth=2.0, color=color_est_costs, linestyle='-', label='Edge 1 Cost ('+est_name+')')
 66 |   costs,xmin,xmax = get_step_func_rep(plot_x, est_costs_1)
 67 |   line_est_costs_1 = ax.hlines(costs, xmin, xmax, linewidth=2.0, color=color_est_costs, linestyle='--', label='Edge 2 Cost ('+est_name+')')
 68 |   
 69 |   plt.xlabel('x')
 70 |   plt.ylabel('Edge Cost')  
 71 |   #plt.ylim(top=21)
 72 |   
 73 |   bdry = get_decision_boundary(plot_x, est_costs)
 74 |   line_est_bdry = ax.axvline(x=bdry, linewidth=1.5, color=color_est_costs, linestyle=':', label='Decision Boundary ('+est_name+')')
 75 |   #_,xmin,_ = get_step_func_rep(plot_x, est_costs_0)
 76 |   #xbds = xmin[1:]
 77 |   #for xbd in xbds:
 78 |   #  plt.axvline(x=xbd, linewidth=1.0, color='grey', linestyle=':')
 79 |   
 80 |   bdry = get_decision_boundary(plot_x, true_costs)
 81 |   line_true_bdry = ax.axvline(x=bdry, linewidth=1.5, color='grey', linestyle=':', label='Decision Boundary (True)')
 82 |   
 83 |   plt.legend(handles=[line_true_costs_0, line_true_costs_1, line_true_bdry,
 84 |                       line_est_costs_0, line_est_costs_1, line_est_bdry], loc='upper left')
 85 |   #plt.show()
 86 |   plt.savefig(fname, format='png', dpi=dpi, bbox_inches='tight', pad_inches=0)
 87 |   plt.clf()
 88 | 
 89 | 
 90 | plot_x = np.linspace(0,1,num=1000).reshape(1000,1)
 91 | true_costs = get_costs(plot_x)
 92 | 
 93 | #SIMULATED DATA PARAMETERS 
 94 | n_train = 10000;
 95 | n_valid = 2000;
 96 | n_test = 5000;
 97 | 
 98 | #GENERATE TRAINING DATA
 99 | train_x, train_cost = gen_dataset(n_train)
100 | #GENERATE VALIDATION SET DATA
101 | valid_x, valid_cost = gen_dataset(n_valid)
102 | #GENERATE TESTING DATA
103 | test_x, test_cost = gen_dataset(n_test)
104 | 
105 | ###################################################################
106 | #FIT SPO Tree ALGORITHM
107 | #SPO_weight_param: number between 0 and 1:
108 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
109 | my_tree = SPOTree(max_depth = 1, min_weights_per_node = 20, quant_discret = 0.01, debias_splits=False, SPO_weight_param=1.0, SPO_full_error=True)
110 | my_tree.fit(train_x,train_cost,verbose=False,feats_continuous=True); #verbose specifies whether fitting procedure should print progress
111 | #my_tree.traverse() #prints out the unpruned tree 
112 | 
113 | #PRUNE DECISION TREE USING VALIDATION SET
114 | my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
115 | #my_tree.traverse() #prints out the pruned tree
116 | 
117 | #FIND TEST SET SPO LOSS
118 | opt_decision = find_opt_decision(test_cost)['weights']
119 | pred_decision = my_tree.est_decision(test_x)
120 | 
121 | incurred_test_cost = np.sum([np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))])
122 | opt_test_cost = np.sum(np.sum(test_cost * opt_decision,axis=1))
123 | 
124 | #percent error:
125 | print("SPO Tree: Test Set Normalized SPO Loss: ")
126 | SPO_error = 100.0*(incurred_test_cost-opt_test_cost)/opt_test_cost
127 | print(str(SPO_error)+" percent error")
128 | est_costs = my_tree.est_cost(plot_x)
129 | plot_costs(plot_x, true_costs, est_costs, 'blue', 'SPOT', 'casestudySPOTdpi'+str(dpi)+'.png')
130 | 
131 | ###################################################################
132 | #FIT MSE Tree ALGORITHM
133 | MSE_tree_depths = [1,2,3,4,5]
134 | MSE_tree_depths_errors = np.zeros(len(MSE_tree_depths))
135 | for max_depth_ind,max_depth in enumerate(MSE_tree_depths):
136 |   #SPO_weight_param: number between 0 and 1:
137 |   #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
138 |   my_tree = SPOTree(max_depth = max_depth, min_weights_per_node = 20, quant_discret = 0.01, debias_splits=False, SPO_weight_param=0.0, SPO_full_error=True)
139 |   my_tree.fit(train_x,train_cost,verbose=False,feats_continuous=True); #verbose specifies whether fitting procedure should print progress
140 |   #my_tree.traverse() #prints out the unpruned tree 
141 |   
142 |   #PRUNE DECISION TREE USING VALIDATION SET
143 |   my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
144 |   my_tree.traverse() #prints out the pruned tree
145 |   
146 |   #FIND TEST SET SPO LOSS
147 |   opt_decision = find_opt_decision(test_cost)['weights']
148 |   pred_decision = my_tree.est_decision(test_x)
149 |   
150 |   incurred_test_cost = np.sum([np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))])
151 |   opt_test_cost = np.sum(np.sum(test_cost * opt_decision,axis=1))
152 |   
153 |   #percent error:
154 |   print("MSE Tree Depth "+str(max_depth)+": Test Set Normalized SPO Loss: ")
155 |   MSE_tree_depths_errors[max_depth_ind] = 100.0*(incurred_test_cost-opt_test_cost)/opt_test_cost
156 |   print(str(MSE_tree_depths_errors[max_depth_ind])+" percent error")
157 |   est_costs = my_tree.est_cost(plot_x)
158 |   plot_costs(plot_x, true_costs, est_costs, 'orange', 'CART', 'casestudyCART'+str(max_depth)+'dpi'+str(dpi)+'.png')
159 | 
160 | SPO_tree_depths_errors = [SPO_error]*len(MSE_tree_depths)
161 | fig,ax = plt.subplots(1, figsize=figsize)
162 | ax.plot(MSE_tree_depths, MSE_tree_depths_errors, linewidth=2.0, color='orange', label='CART')
163 | ax.plot(MSE_tree_depths, SPO_tree_depths_errors, linewidth=2.0, color='blue', label='SPOT')
164 | plt.xlabel("Training Depth")
165 | plt.ylabel("Norm. Extra Travel Time (%)")
166 | plt.legend(loc='upper right')
167 | plt.xticks(MSE_tree_depths)
168 | #plt.show()
169 | plt.savefig('casestudyCARTerrorsdpi'+str(dpi)+'.png', format='png', dpi=dpi, bbox_inches='tight', pad_inches=0)
170 | plt.clf()
171 | 


--------------------------------------------------------------------------------
/Applications/README.md:
--------------------------------------------------------------------------------
 1 | # Applications
 2 | 
 3 | This folder contains all code for reproducing the three numerical experiments (applications) covered in the paper:
 4 | * Illustrative Example: A two-road shortest path decision problem.
 5 | * Shortest Path: A shortest path decision problem over a 4 x 4 grid network, where driver starts in
 6 | southwest corner and tries to find shortest path to northeast corner.
 7 | * Yahoo News: A news article recommendation decision problem constructed from the Yahoo! Front Page Today Module dataset.
 8 | 
 9 | Datasets used in the numerical experiments may be found here:
10 | * Shortest Path: https://archive.org/details/spotree_shortestpathdata
11 | * Yahoo News: The Yahoo! Front Page Today dataset may be found at https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49. A license must be obtained from Yahoo to access the dataset. The data may be used only for academic research purposes and may not be used for any commercial purposes or by any commercial entity. Preprocessing scripts are included to format the raw dataset to match the one used in our numerical experiments.
12 | 
13 | To reproduce the numerical experiments, merge into a single folder all codes in the Algorithms folder with the codes + data files (unzipped) corresponding to the application of interest. Then, run the relevant application Python script.
14 | 
15 | The headers of the application scripts contains all experimental parameter settings used in the paper.
16 | 
17 | Code currently only supports Python 2.7 (not Python 3).
18 | Package Dependencies: gurobipy (with valid Gurobi license), numpy, pandas, scipy, joblib
19 | * The Illustrative Example application also depends on matplotlib
20 | 


--------------------------------------------------------------------------------
/Applications/Shortest Path/SPOForest_nonlinear.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Runs random forest algorithm on shortest path dataset with nonlinear mapping from features to costs ("nonlinear")
  3 | This code considers two methods of aggregating individual tree predictions to obtain a forest decision:
  4 |   - "mean": averages cost predictions for each tree in the forest; outputs decision associated with average cost
  5 |   - "mode": outputs the mode decision recommended by the trees in the forest
  6 | Outputs decision costs for each test-set instance as pickle file
  7 | Takes multiple input arguments:
  8 |   (1) n_train: number of training observations. can take values 200, 10000
  9 |   (2) eps: parameter (\bar{\epsilon}) in the paper controlling noise in mapping from features to costs.
 10 |     n_train = 200: can take values 0, 0.25
 11 |     n_train = 10000: can take values 0, 0.5
 12 |   (3) deg_set_str: set of deg parameters to try, e.g. "2-10". 
 13 |     deg = parameter "degree" in the paper controlling nonlinearity in mapping from features to costs. 
 14 |     can try values in {2,10}
 15 |   (4) reps_st, reps_end: we provide 10 total datasets corresponding to different generated B values (matrix mapping features to costs). 
 16 |     script will run code on problem instances reps_st to reps_end 
 17 |   (5) max_depth_set_str: sequence of training depths tuned using cross validation, e.g. "2-4-5"
 18 |   (6) min_samples_leaf_set_str: sequence of "min. (weighted) observations per leaf" tuned using cross validation, e.g. "20-50-100"
 19 |   (7) n_estimators_set_str: sequence of number of trees in forest tuned using cross validation, e.g. "20-50-100"
 20 |   (8) max_features_set_str: sequence of number of features used in feature bagging tuned using cross validation, e.g. "2-3-4"
 21 |   (9) aggr_method: method for aggregating individual tree cost predictions to arrive at forest decisions. Either "mean" or "mode"
 22 |   (10) algtype: set equal to "MSE" (CART forest) or "SPO" (SPOT forest)
 23 |   (11) number of workers to use in parallel processing (i.e., fitting individual trees in the forest in parallel)
 24 | Values of input arguments used in paper:
 25 |   (1) n_train: consider values 200, 10000
 26 |   (2) eps:
 27 |     n_train = 200: considered values 0, 0.25
 28 |     n_train = 10000: considered values 0, 0.5
 29 |   (3) deg_set_str: "2-10"
 30 |   (4) reps_st, reps_end: reps_st = 0, reps_end = 10
 31 |   (5) max_depth_set_str: "1000"
 32 |   (6) min_samples_leaf_set_str: "20"
 33 |   (7) n_estimators_set_str: "100"
 34 |   (8) max_features_set_str: "2-3-4-5"
 35 |   (9) aggr_method: "mean"
 36 |   (10) algtype: "MSE" (CART forest) or "SPO" (SPOT forest)
 37 |   (11) number of workers to use in parallel processing: 8
 38 | '''
 39 | 
 40 | import time
 41 | 
 42 | import numpy as np
 43 | import pickle
 44 | from gurobipy import*
 45 | from SPOForest import SPOForest
 46 | from decision_problem_solver import*
 47 | ##############################################
 48 | forest_seed = 0 #seed to set random forest rng
 49 | ##############################################
 50 | import sys
 51 | #problem parameters
 52 | n_train = int(sys.argv[1])#200
 53 | eps = float(sys.argv[2])#0
 54 | deg_set_str = sys.argv[3]
 55 | deg_set=[int(k) for k in deg_set_str.split('-')]#[2,4,6,8,10]
 56 | #evaluate algs of dataset replications from rep_st to rep_end
 57 | reps_st = int(sys.argv[4])#0 #can be as low as 0
 58 | reps_end = int(sys.argv[5])#1 #can be as high as 50
 59 | valid_frac = 0.2
 60 | ########################################
 61 | #training parameters
 62 | max_depth_set_str = sys.argv[6]
 63 | max_depth_set=[int(k) for k in max_depth_set_str.split('-')]#[None]
 64 | min_samples_leaf_set_str = sys.argv[7]
 65 | min_samples_leaf_set=[int(k) for k in min_samples_leaf_set_str.split('-')]#[5]
 66 | n_estimators_set_str = sys.argv[8]
 67 | n_estimators_set=[int(k) for k in n_estimators_set_str.split('-')]#[100,500]
 68 | max_features_set_str = sys.argv[9]
 69 | max_features_set=[int(k) for k in max_features_set_str.split('-')]#[3]
 70 | aggr_method=sys.argv[10] #either "mean" or "mode"
 71 | algtype=sys.argv[11] #either "MSE" or "SPO"
 72 | #number of workers
 73 | if sys.argv[12] == "1":
 74 |   run_in_parallel = False
 75 |   num_workers = None
 76 | else: 
 77 |   run_in_parallel = True
 78 |   num_workers = int(sys.argv[12])
 79 | ########################################
 80 | #output filename
 81 | fname_out = algtype+"Forest_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+"_depthSet"+max_depth_set_str+"_minObsSet"+min_samples_leaf_set_str+"_nEstSet"+n_estimators_set_str+"_mFeatSet"+max_features_set_str+"_aMethod"+aggr_method+".pkl";
 82 | #############################################################################
 83 | #############################################################################
 84 | #############################################################################
 85 | #data = pickle.load(open('non_linear_big_data_dim4.p','rb'))
 86 | #'non_linear_data_dim4.p' has the following options:
 87 | #n_train: 200, 400, 800
 88 | #nonlinear degrees: 8, 2, 4, 10, 6
 89 | #eps: 0, 0.25, 0.5
 90 | #50 replications of the experiment (0-49)
 91 | #dataset characteristics: 5 continuous features x, dimension 4 grid c, 1000 test set observations
 92 | if n_train == 10000:
 93 |   data = pickle.load(open('non_linear_bigdata10000_dim4.p','rb'))
 94 | else:
 95 |   data = pickle.load(open('non_linear_data_dim4.p','rb'))
 96 | n_test = 1000
 97 | #############################################################################
 98 | #############################################################################
 99 | #############################################################################
100 | assert(reps_st >= 0)
101 | assert(reps_end <= 50)
102 | n_reps = reps_end-reps_st
103 | 
104 | def forest_traintest(train_x,train_cost,test_x,test_cost,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,aggr_method, algtype):
105 |     if algtype == "MSE":
106 |       SPO_weight_param=0.0
107 |     elif algtype == "SPO":
108 |       SPO_weight_param=1.0
109 |     regr = SPOForest(n_estimators=n_estimators,run_in_parallel=run_in_parallel,num_workers=num_workers, 
110 |                      max_depth=max_depth, min_weights_per_node=min_samples_leaf, quant_discret=0.01, debias_splits=False,
111 |                      max_features=max_features,
112 |                      SPO_weight_param=SPO_weight_param, SPO_full_error=True)
113 |     regr.fit(train_x, train_cost, verbose_forest=True, verbose=False, feats_continuous=True, seed=forest_seed)
114 |     pred_decision = regr.est_decision(test_x, method=aggr_method)
115 |     return regr, np.mean([np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))])
116 | 
117 | def forest_tuneparams(train_x,train_cost,valid_x,valid_cost,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set,run_in_parallel,num_workers,aggr_method, algtype):
118 |     best_err = np.float("inf")
119 |     for n_estimators in n_estimators_set:
120 |       for max_depth in max_depth_set:
121 |         for min_samples_leaf in min_samples_leaf_set:
122 |           for max_features in max_features_set:
123 |             regr, err = forest_traintest(train_x,train_cost,valid_x,valid_cost,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,aggr_method, algtype)
124 |             if err <= best_err:
125 |               best_regr, best_err, best_n_estimators,best_max_depth,best_min_samples_leaf,best_max_features = regr, err, n_estimators,max_depth,min_samples_leaf,max_features
126 |     
127 |     print("Best n_estimators: " + str(best_n_estimators))
128 |     print("Best max_depth: " + str(best_max_depth))
129 |     print("Best min_samples_leaf: " + str(best_min_samples_leaf))
130 |     print("Best max_features: " + str(best_max_features))
131 |     return best_regr, best_err, best_n_estimators,best_max_depth,best_min_samples_leaf,best_max_features
132 | 
133 | #costs_deg[deg] yields a n_reps*n_test matrix of costs corresponding to the experimental data for deg, i.e.
134 | #costs_deg[deg][i][j] gives the observed cost on test set i (0-49) example j (0-(n_test-1))
135 | costs_deg = {}
136 | 
137 | for deg in deg_set:
138 |   costs_deg[deg] = np.zeros((n_reps,n_test))
139 |     
140 |   for trial_num in range(reps_st,reps_end):
141 |     train_x,train_cost,test_x,test_cost = data[n_train][deg][eps][trial_num]
142 |     print "Deg "+str(deg)+", Trial Number "+str(trial_num)+" out of " + str(reps_end)
143 |     
144 |     #split up training data into train/valid split
145 |     n_valid = int(np.floor(n_train*valid_frac))
146 |     valid_x = train_x[:n_valid]
147 |     valid_cost = train_cost[:n_valid]
148 |     train_x = train_x[n_valid:]
149 |     train_cost = train_cost[n_valid:]
150 |     
151 |     start = time.time()
152 |     
153 |     #FIT FOREST
154 |     regr,_,_,_,_,_ = forest_tuneparams(train_x,train_cost,valid_x,valid_cost,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set, run_in_parallel, num_workers, aggr_method, algtype)
155 |     
156 |     end = time.time()    
157 |     print "Elapsed time: " + str(end-start)
158 | 
159 |     #FIND TEST SET COST
160 |     pred_decision = regr.est_decision(test_x, method=aggr_method)
161 |     incurred_test_cost = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]
162 |     
163 |     print "Average test set cost: " + str(np.mean(incurred_test_cost))
164 |     
165 |     costs_deg[deg][trial_num] = incurred_test_cost
166 | 
167 |     # Saving the objects occasionally:
168 |     if trial_num % 25 == 0:
169 |       with open(fname_out, 'wb') as output:
170 |         pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL)
171 | 
172 | with open(fname_out, 'wb') as output:
173 |   pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL)
174 |   
175 | # Getting back the objects:
176 | #with open(fname_out, 'rb') as input:
177 | #  costs_deg = pickle.load(input)


--------------------------------------------------------------------------------
/Applications/Shortest Path/SPOgreedy_nonlinear.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Runs SPOT (greedy) / CART algorithm on shortest path dataset with nonlinear mapping from features to costs ("nonlinear")
  3 | Outputs algorithm decision costs for each test-set instance as pickle file
  4 | Also outputs optimal decision costs for each test-set instance as pickle file
  5 | Takes multiple input arguments:
  6 |   (1) n_train: number of training observations. can take values 200, 10000
  7 |   (2) eps: parameter (\bar{\epsilon}) in the paper controlling noise in mapping from features to costs.
  8 |     n_train = 200: can take values 0, 0.25
  9 |     n_train = 10000: can take values 0, 0.5
 10 |   (3) deg_set_str: set of deg parameters to try, e.g. "2-10". 
 11 |     deg = parameter "degree" in the paper controlling nonlinearity in mapping from features to costs. 
 12 |     can try values in {2,10}
 13 |   (4) reps_st, reps_end: we provide 10 total datasets corresponding to different generated B values (matrix mapping features to costs). 
 14 |     script will run code on problem instances reps_st to reps_end 
 15 |   (5) max_depth: training depth of tree, e.g. "5"
 16 |   (6) min_weights_per_node: min. number of (weighted) observations per leaf, e.g. "100"
 17 |   (7) algtype: set equal to "MSE" (CART) or "SPO" (SPOT greedy)
 18 | Values of input arguments used in paper:
 19 |   (1) n_train: consider values 200, 10000
 20 |   (2) eps:
 21 |     n_train = 200: considered values 0, 0.25
 22 |     n_train = 10000: considered values 0, 0.5
 23 |   (3) deg_set_str: "2-10"
 24 |   (4) reps_st, reps_end: reps_st = 0, reps_end = 10
 25 |   (5) max_depth: 
 26 |     n_train = 200: considered depths of 1, 2, 3, 1000
 27 |     n_train = 10000: considered depths of 2, 4, 6, 1000
 28 |   (6) min_weights_per_node: 20
 29 |   (7) algtype: "MSE" (CART) or "SPO" (SPOT greedy)
 30 | '''
 31 | 
 32 | import time
 33 | 
 34 | import numpy as np
 35 | import pickle
 36 | from SPO_tree_greedy import SPOTree
 37 | from decision_problem_solver import*
 38 | import sys
 39 | #problem parameters
 40 | n_train = int(sys.argv[1])#200
 41 | eps = float(sys.argv[2])#0
 42 | deg_set_str = sys.argv[3]
 43 | deg_set=[int(k) for k in deg_set_str.split('-')]#[2,4,6,8,10]
 44 | #evaluate algs of dataset replications from rep_st to rep_end
 45 | reps_st = int(sys.argv[4])#0 #can be as low as 0
 46 | reps_end = int(sys.argv[5])#1 #can be as high as 50
 47 | valid_frac = 0.2 #set aside valid_frac of training data for validation
 48 | ########################################
 49 | #training parameters
 50 | max_depth = int(sys.argv[6])#3
 51 | min_weights_per_node = int(sys.argv[7])#20
 52 | algtype = sys.argv[8] #either "MSE" or "SPO"
 53 | ########################################
 54 | #output filename for alg
 55 | fname_out = algtype+"greedy_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+"_depth"+str(max_depth)+"_minObs"+str(min_weights_per_node)+".pkl";
 56 | #output filename for opt costs
 57 | fname_out_opt = "Opt_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+".pkl";
 58 | #############################################################################
 59 | #############################################################################
 60 | #############################################################################
 61 | #data = pickle.load(open('non_linear_big_data_dim4.p','rb'))
 62 | #'non_linear_data_dim4.p' has the following options:
 63 | #n_train: 200, 400, 800
 64 | #nonlinear degrees: 8, 2, 4, 10, 6
 65 | #eps: 0, 0.25, 0.5
 66 | #50 replications of the experiment (0-49)
 67 | #dataset characteristics: 5 continuous features x, dimension 4 grid c, 1000 test set observations
 68 | if n_train == 10000:
 69 |   data = pickle.load(open('non_linear_bigdata10000_dim4.p','rb'))
 70 | else:
 71 |   data = pickle.load(open('non_linear_data_dim4.p','rb'))
 72 | n_test = 1000
 73 | #############################################################################
 74 | #############################################################################
 75 | #############################################################################
 76 | assert(reps_st >= 0)
 77 | assert(reps_end <= 50)
 78 | n_reps = reps_end-reps_st
 79 | 
 80 | #costs_deg[deg] yields a n_reps*n_test matrix of costs corresponding to the experimental data for deg, i.e.
 81 | #costs_deg[deg][i][j] gives the observed cost on test set i (0-49) example j (0-(n_test-1))
 82 | costs_deg = {}
 83 | optcosts_deg = {} #optimal costs
 84 | 
 85 | for deg in deg_set:
 86 |   costs_deg[deg] = np.zeros((n_reps,n_test))
 87 |   optcosts_deg[deg] = np.zeros((n_reps,n_test))
 88 |     
 89 |   for trial_num in range(reps_st,reps_end):
 90 |     train_x,train_cost,test_x,test_cost = data[n_train][deg][eps][trial_num]
 91 |     print "Deg "+str(deg)+", Trial Number "+str(trial_num)+" out of " + str(reps_end)
 92 |     
 93 |     #split up training data into train/valid split
 94 |     n_valid = int(np.floor(n_train*valid_frac))
 95 |     valid_x = train_x[:n_valid]
 96 |     valid_cost = train_cost[:n_valid]
 97 |     train_x = train_x[n_valid:]
 98 |     train_cost = train_cost[n_valid:]
 99 |     
100 |     start = time.time()
101 |     
102 |     #FIT ALGORITHM
103 |     #SPO_weight_param: number between 0 and 1:
104 |     #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
105 |     if algtype == "MSE":
106 |       SPO_weight_param=0.0
107 |     elif algtype == "SPO":
108 |       SPO_weight_param=1.0      
109 |     my_tree = SPOTree(max_depth = max_depth, min_weights_per_node = min_weights_per_node, quant_discret = 0.01, debias_splits=False, SPO_weight_param=SPO_weight_param, SPO_full_error=True)
110 |     my_tree.fit(train_x,train_cost,verbose=False,feats_continuous=True); #verbose specifies whether fitting procedure should print progress
111 |     #my_tree.traverse() #prints out the unpruned tree 
112 |     
113 |     end = time.time()    
114 |     print "Elapsed time: " + str(end-start)
115 |     
116 |     #PRUNE DECISION TREE USING VALIDATION SET
117 |     my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
118 |     #my_tree.traverse() #prints out the pruned tree
119 | 
120 |     #FIND TEST SET ALGORITHM COST AND OPTIMAL COST
121 |     opt_decision = find_opt_decision(test_cost)['weights']
122 |     pred_decision = my_tree.est_decision(test_x)
123 |     costs_deg[deg][trial_num] = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]
124 |     optcosts_deg[deg][trial_num] = [np.sum(test_cost[i] * opt_decision[i,:]) for i in range(0,opt_decision.shape[0])]
125 | 
126 |     # Saving the objects occasionally:
127 |     if trial_num % 25 == 0:
128 |       with open(fname_out, 'wb') as output:
129 |         pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL)
130 |       with open(fname_out_opt, 'wb') as output:
131 |         pickle.dump(optcosts_deg, output, pickle.HIGHEST_PROTOCOL)
132 | 
133 | with open(fname_out, 'wb') as output:
134 |   pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL)
135 | with open(fname_out_opt, 'wb') as output:
136 |   pickle.dump(optcosts_deg, output, pickle.HIGHEST_PROTOCOL)
137 |   
138 | # Getting back the objects:
139 | #with open(fname_out, 'rb') as input:
140 | #  costs_deg = pickle.load(input)


--------------------------------------------------------------------------------
/Applications/Shortest Path/SPOopt_nonlinear.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Runs SPOT (MILP) algorithm on shortest path dataset with nonlinear mapping from features to costs ("nonlinear")
  3 | Outputs decision costs for each test-set instance as pickle file
  4 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r.
  5 | Takes multiple input arguments:
  6 |   (1) n_train: number of training observations. can take values 200, 10000
  7 |   (2) eps: parameter (\bar{\epsilon}) in the paper controlling noise in mapping from features to costs.
  8 |     n_train = 200: can take values 0, 0.25
  9 |     n_train = 10000: can take values 0, 0.5
 10 |   (3) deg_set_str: set of deg parameters to try, e.g. "2-10". 
 11 |     deg = parameter "degree" in the paper controlling nonlinearity in mapping from features to costs. 
 12 |     can try values in {2,10}
 13 |   (4) reps_st, reps_end: we provide 10 total datasets corresponding to different generated B values (matrix mapping features to costs). 
 14 |     script will run code on problem instances reps_st to reps_end 
 15 |   (5) H: training depth of tree, e.g. "5"
 16 |   (6) N_min: min. number of (weighted) observations per leaf, e.g. "100"
 17 |   (7) train_x_precision: contextual features x are rounded to train_x_precision before fitting MILP (e.g., 2 = two decimal places)
 18 |     higher values of train_x_precision will be more precise but take more computational time
 19 |   (8) reg_set_str: sequence of regularization parameters to try (tuned using cross validation), e.g. "0.001-0.01-0.1"
 20 |     if "None", fits MILP using no regularizaiton and then prunes using CART pruning procedure (with SPO loss as pruning metric)
 21 |   (9) solver_time_limit: MILP solver is terminated after solver_time_limit seconds, returning best-found solution
 22 | Values of input arguments used in paper:
 23 |   (1) n_train: consider values 200, 10000
 24 |   (2) eps:
 25 |     n_train = 200: considered values 0, 0.25
 26 |     n_train = 10000: considered values 0, 0.5
 27 |   (3) deg_set_str: "2-10"
 28 |   (4) reps_st, reps_end: reps_st = 0, reps_end = 10
 29 |   (5) H: 
 30 |     n_train = 200: considered depths of 1, 2, 3, 1000
 31 |     n_train = 10000: considered depths of 2, 4, 6, 1000
 32 |   (6) N_min: 20
 33 |   (7) train_x_precision: 2
 34 |   (8) reg_set_str: "None"
 35 |   (9) solver_time_limit: 16200
 36 | '''
 37 | 
 38 | import time
 39 | 
 40 | import numpy as np
 41 | import pickle
 42 | from spo_opt_tree_nonlinear import*
 43 | from SPO2CART import SPO2CART
 44 | from SPO_tree_greedy import SPOTree
 45 | from decision_problem_solver import*
 46 | import sys
 47 | #problem parameters
 48 | n_train = int(sys.argv[1])#200
 49 | eps = float(sys.argv[2])#0
 50 | deg_set_str = sys.argv[3]
 51 | deg_set=[int(k) for k in deg_set_str.split('-')]#[2,4,6,8,10]
 52 | #evaluate algs of dataset replications from rep_st to rep_end
 53 | reps_st = int(sys.argv[4])#0 #can be as low as 0
 54 | reps_end = int(sys.argv[5])#1 #can be as high as 50
 55 | valid_frac = 0.2
 56 | ########################################
 57 | #training parameters
 58 | #optimal tree params
 59 | H = int(sys.argv[6])#2 #H = max tree depth
 60 | N_min = int(sys.argv[7])#4 #N_min = minimum number of observations per leaf node
 61 | #higher values of train_x_precision will be more precise but take more computational time
 62 | #values >= 8 might cause numerical errors
 63 | train_x_precision = int(sys.argv[8])#2
 64 | reg_set_str = sys.argv[9]#"None"
 65 | if reg_set_str == "None":
 66 |   reg_set = None
 67 | else:
 68 |   reg_set = [int(k) for k in reg_set_str.split('-')]#None
 69 | #reg_set = [0.001] #if reg_set = None, uses CART to prune tree
 70 | solver_time_limit = int(sys.argv[10])
 71 | ########################################
 72 | #output filename
 73 | fname_out = "SPOopt_nonlin_costs_tr"+str(n_train)+"_eps"+str(eps)+"_degSet"+deg_set_str+"_repsSt"+str(reps_st)+"_repsEnd"+str(reps_end)+"_depth"+str(H)+"_minObs"+str(N_min)+"_prec"+str(train_x_precision)+"_regset"+reg_set_str+"_tLim"+str(solver_time_limit)+".pkl";
 74 | #############################################################################
 75 | #############################################################################
 76 | #############################################################################
 77 | #data = pickle.load(open('non_linear_big_data_dim4.p','rb'))
 78 | #'non_linear_data_dim4.p' has the following options:
 79 | #n_train: 200, 400, 800
 80 | #nonlinear degrees: 8, 2, 4, 10, 6
 81 | #eps: 0, 0.25, 0.5
 82 | #50 replications of the experiment (0-49)
 83 | #dataset characteristics: 5 continuous features x, dimension 4 grid c, 1000 test set observations
 84 | if n_train == 10000:
 85 |   data = pickle.load(open('non_linear_bigdata10000_dim4.p','rb'))
 86 | else:
 87 |   data = pickle.load(open('non_linear_data_dim4.p','rb'))
 88 | n_test = 1000
 89 | #############################################################################
 90 | #############################################################################
 91 | #############################################################################
 92 | assert(reps_st >= 0)
 93 | assert(reps_end <= 50)
 94 | n_reps = reps_end-reps_st
 95 | 
 96 | #costs_deg[deg] yields a n_reps*n_test matrix of costs corresponding to the experimental data for deg, i.e.
 97 | #costs_deg[deg][i][j] gives the observed cost on test set i (0-49) example j (0-(n_test-1))
 98 | costs_deg = {}
 99 | 
100 | for deg in deg_set:
101 |   costs_deg[deg] = np.zeros((n_reps,n_test))
102 |     
103 |   for trial_num in range(reps_st,reps_end):
104 |     train_x,train_cost,test_x,test_cost = data[n_train][deg][eps][trial_num]
105 |     print "Deg "+str(deg)+", Trial Number "+str(trial_num)+" out of " + str(reps_end)
106 |     
107 |     #split up training data into train/valid split
108 |     n_valid = int(np.floor(n_train*valid_frac))
109 |     valid_x = train_x[:n_valid]
110 |     valid_cost = train_cost[:n_valid]
111 |     train_x = train_x[n_valid:]
112 |     train_cost = train_cost[n_valid:]
113 |     
114 |     start = time.time()    
115 |     
116 |     #FIT SPO OPTIMAL TREE
117 |     if reg_set is None:
118 |       
119 |       #FIT SPO GREEDY TREE AS INITIAL SOLUTION
120 |       def truncate_train_x(train_x, train_x_precision):
121 |         return(np.around(train_x, decimals=train_x_precision))
122 |       train_x_truncated = truncate_train_x(train_x, train_x_precision)
123 |       #SPO_weight_param: number between 0 and 1:
124 |       #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
125 |       my_tree = SPOTree(max_depth = H, min_weights_per_node = N_min, quant_discret = 0.01, debias_splits=False, SPO_weight_param=1.0, SPO_full_error=True)
126 |       my_tree.fit(train_x_truncated,train_cost,verbose=False,feats_continuous=True);
127 | 
128 |       #PRUNE SPO GREEDY TREE USING TRAINING SET (TO GET RID OF REDUNDANT LEAVES)   
129 |       my_tree.prune(train_x_truncated, train_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
130 |       spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated)
131 |       
132 |       #(OPTIONAL) PRUNE SPO GREEDY TREE USING VALIDATION SET
133 | #      my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
134 | #      spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated)
135 | #      alpha = my_tree.get_pruning_alpha()
136 |       
137 |       #FIT SPO OPTIMAL TREE USING FOUND INITIAL SOLUTION
138 |       reg_param = 1e-4 #introduce very small amount of regularization to ensure leaves with zero predictive power are aggregated
139 |       spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d = spo_opt_tree(train_cost,train_x,train_x_precision,reg_param, N_min, H,
140 |                                                                                           a_start=spo_greedy_a, z_start=spo_greedy_z,
141 |                                                                                           Presolve=2, Seed=0, TimeLimit=solver_time_limit,
142 |                                                                                           returnAllOptvars=True)
143 |       
144 |       end = time.time()    
145 |       print "Elapsed time: " + str(end-start)
146 |       
147 |       #(IF NOT USING POSTPRUNING) FIND TEST SET COST    
148 | #      path = decision_path(test_x,spo_dt_a,spo_dt_b)
149 | #      costs_deg[deg][trial_num] = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False)
150 |       
151 |       #PRUNE MILP TREE USING CART PRUNING METHOD ON VALIDATION SET
152 |       spo2cart = SPO2CART(spo_dt_a, spo_dt_b)
153 |       spo2cart.fit(train_x,train_cost,train_x_precision,verbose=False,feats_continuous=True)
154 |       spo2cart.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
155 |       #my_tree.traverse() #prints out the pruned tree 
156 |       #(IF PRUNED) FIND TEST SET COST
157 |       pred_decision = spo2cart.est_decision(test_x)
158 |       costs_deg[deg][trial_num] = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]
159 |       
160 |     else:
161 |       spo_dt_a, spo_dt_b, spo_dt_w, _, best_alpha = spo_opt_tunealpha(train_x,train_cost,valid_x,valid_cost,train_x_precision,reg_set, N_min, H)
162 |       print("Best Alpha: " + best_alpha)
163 |       
164 |       end = time.time()    
165 |       print "Elapsed time: " + str(end-start)
166 | 
167 |       #FIND TEST SET COST    
168 |       path = decision_path(test_x,spo_dt_a,spo_dt_b)
169 |       costs_deg[deg][trial_num] = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False)
170 |     
171 |     # Saving the objects occasionally:
172 |     if trial_num % 5 == 0:
173 |       with open(fname_out, 'wb') as output:
174 |         pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL)
175 | 
176 | with open(fname_out, 'wb') as output:
177 |   pickle.dump(costs_deg, output, pickle.HIGHEST_PROTOCOL)
178 |   
179 | # Getting back the objects:
180 | #with open(fname_out, 'rb') as input:
181 | #  costs_deg = pickle.load(input)


--------------------------------------------------------------------------------
/Applications/Shortest Path/decision_problem_solver.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Generic file to set up the decision problem (i.e., optimization problem) under consideration
 3 | Must have functions: 
 4 |   get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector)
 5 |   find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors
 6 | 
 7 | This particular file sets up a shortest path decision problem over a 4 x 4 grid network, where driver starts in
 8 | southwest corner and tries to find shortest path to northeast corner.
 9 | '''
10 | 
11 | from gurobipy import *
12 | import numpy as np
13 | 
14 | dim = 4 #(creates dim * dim grid, where dim = number of vertices)
15 | Edge_list = [(i,i+1) for i in range(1, dim**2 + 1) if i % dim != 0]
16 | Edge_list += [(i, i + dim) for i in range(1, dim**2 + 1) if i <= dim**2 - dim]
17 | Edge_dict = {} #(assigns each edge to a unique integer from 0 to number-of-edges)
18 | for index, edge in enumerate(Edge_list):
19 |     Edge_dict[edge] = index
20 | D = len(Edge_list) # D = number of decisions
21 | 
22 | def get_num_decisions():
23 |   return D
24 | 
25 | Edges = tuplelist(Edge_list)
26 | # Find the optimal total cost for an observation in the context of shortes path
27 | m_shortest_path = Model('shortest_path')
28 | m_shortest_path.Params.OutputFlag = 0
29 | flow = m_shortest_path.addVars(Edges, ub = 1, name = 'flow')
30 | m_shortest_path.addConstrs((quicksum(flow[i,j] for i,j in Edges.select(i,'*')) - quicksum(flow[k, i] for k,i in Edges.select('*',
31 |   i)) == 0 for i in range(2, dim**2)), name = 'inner_nodes')
32 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select(1, '*')) == 1), name = 'start_node')
33 | m_shortest_path.addConstr((quicksum(flow[i,j] for i,j in Edges.select('*', dim**2)) == 1), name = 'end_node')
34 | 
35 | def shortest_path(cost):
36 |     # m_shortest_path.setObjective(quicksum(flow[i,j] * cost[Edge_dict[(i,j)]] for i,j in Edges), GRB.MINIMIZE)
37 |     m_shortest_path.setObjective(LinExpr( [ (cost[Edge_dict[(i,j)]],flow[i,j] ) for i,j in Edges]), GRB.MINIMIZE)
38 |     m_shortest_path.optimize()
39 |     return {'weights': m_shortest_path.getAttr('x', flow), 'objective': m_shortest_path.objVal}
40 | 
41 | def find_opt_decision(cost):
42 |     weights = np.zeros(cost.shape)
43 |     objective = np.zeros(cost.shape[0])
44 |     for i in range(cost.shape[0]):
45 |         temp = shortest_path(cost[i,:])
46 |         for edge in Edges:
47 |             weights[i, Edge_dict[edge]] = temp['weights'][edge]
48 |         objective[i] = temp['objective']
49 |     return {'weights': weights, 'objective':objective}
50 | 


--------------------------------------------------------------------------------
/Applications/Shortest Path/spo_opt_tree_nonlinear.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Code for fitting SPOT MILP on shortest paths dataset. 
  3 | Note that this code is specifically designed for the shortest paths application and
  4 | will not work for other applications without modifying the constraints. 
  5 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r.
  6 | '''
  7 | 
  8 | 
  9 | import numpy as np
 10 | import pandas as pd
 11 | from gurobipy import*
 12 | import pickle
 13 | from decision_problem_solver import*
 14 | 
 15 | # Helper functions for tree structure
 16 | def find_parent_index(t):
 17 |   return (t+1)//2 - 1
 18 | 
 19 | def find_ancestors(t):
 20 |     l= []
 21 |     r = []
 22 |     if t == 0:
 23 |         return
 24 |     else:
 25 |         while find_parent_index(t) !=0:
 26 |             parent = find_parent_index(t)
 27 |             if (t+1)% (1+parent) ==1:
 28 |                 r.append(parent)
 29 |             else:
 30 |                 l.append(parent)
 31 |             t = parent
 32 |         if t==2:
 33 |             r.append(0)
 34 |         else:
 35 |             l.append(0)
 36 |     return[l,r]
 37 | 
 38 | #truncate training set features to desired precision    
 39 | def truncate_train_x(train_x, train_x_precision):
 40 |   return(np.around(train_x, decimals=train_x_precision))
 41 |   
 42 | 
 43 | #trains an optimal tree model on train_cost, train_x, and reg parameter spo_opt_tree_reg (scalar)
 44 | #returns parameter encoding of the opimal tree (a,b,w). (a,b) encode splits, (w) encode leaf decisions
 45 | #optimal tree params:
 46 | #N_min = minimum number of observations per leaf node
 47 | #H = max tree depth
 48 | #def spo_opt_tree(train_cost, train_x,spo_opt_tree_reg):
 49 | def spo_opt_tree(train_cost, train_x, train_x_precision, spo_opt_tree_reg, N_min, H, returnAllOptvars=False,
 50 |                  a_start=None, b_start=None, w_start=None, y_start=None, z_start=None, l_start=None, d_start=None, 
 51 |                  threads=None, MIPGap=None, MIPFocus=None, verbose=False, Seed=None, TimeLimit=None,
 52 |                  Presolve=None, ImproveStartTime=None, VarBranch=None, Cuts=None, 
 53 |                  tune=False, TuneCriterion=None, TuneJobs=None, TuneTimeLimit=None, TuneTrials=None, tune_foutpref=None):
 54 |     assert(spo_opt_tree_reg >= 1e-4)    
 55 |     # We label all nodes of the tree by 0, 1, 2, ... 2**(H+1) - 1.
 56 |     T_B = 2**H - 1
 57 |     T_L = 2**H
 58 |     #Edge_list comes from importing shortest_path_solver
 59 |     Edges_w_t = tuplelist([(i,j,t) for i,j in Edge_list for t in range(T_L)])
 60 |   
 61 |     n_train, P = train_x.shape
 62 |     #truncate x features so eps (below) will not be too small
 63 |     train_x = truncate_train_x(train_x, train_x_precision)
 64 |     
 65 |     assert(np.all(train_x >= 0))
 66 |     assert(np.all(train_x <= 1))
 67 |     assert(np.all(train_cost >= 0)) #assert nonnegative costs
 68 |     assert(np.all(train_cost.shape[0] == train_x.shape[0]))
 69 |     # Instantiate optimization model
 70 |     # Compute average optimal cost across all training set observations
 71 |     # (Although irrelevant for the optimization problem, it helps in interpreting alpha)
 72 |     optimal_costs = np.zeros(train_x.shape[0])
 73 |     for i in range(train_x.shape[0]):
 74 |       optimal_costs[i] = find_opt_decision(train_cost[i,:].reshape(1,-1))['objective'][0]
 75 |     sum_optimal_cost = sum(optimal_costs)
 76 |     
 77 |     # Compute big M constant
 78 |     M = 0
 79 |     for i in range(train_x.shape[0]):
 80 |       longest_path_cost = -find_opt_decision(-train_cost[i,:].reshape(1,-1))['objective'][0]
 81 |       if longest_path_cost >= M:
 82 |         M = longest_path_cost
 83 |     #M = train_cost.max()*(dim-1)*2
 84 |     spo = Model('spo_opt_tree')
 85 |     if verbose == False:
 86 |       spo.Params.OutputFlag = 0
 87 |     #compute epsilon constants
 88 |     #eps = np.float("inf")
 89 |     #for j in range(train_x.shape[1]):
 90 |       #ordered_feat = np.sort(train_x[:,j])
 91 |       #diffs = ordered_feat[1:]-ordered_feat[:-1]
 92 |       #nonzero_diffs = diffs[diffs > 0]
 93 |       #if min(nonzero_diffs) <= eps:
 94 |         #eps = min(nonzero_diffs)
 95 |     #one_plus_eps = 1 + eps
 96 |     
 97 |     eps = np.array([np.float("inf")]*train_x.shape[1])
 98 |     for j in range(train_x.shape[1]):
 99 |       ordered_feat = np.sort(train_x[:,j])
100 |       diffs = ordered_feat[1:]-ordered_feat[:-1]
101 |       nonzero_diffs = diffs[diffs > 0]
102 |       eps[j] = min(nonzero_diffs)
103 |     one_plus_eps_max = 1 + max(eps) 
104 |     
105 |     #run params
106 |     if threads is not None:
107 |       spo.Params.Threads = threads
108 |     if MIPGap is not None:
109 |       spo.Params.MIPGap = MIPGap # default = 1e-4, try 1e-2
110 |     if MIPFocus is not None:
111 |       spo.Params.MIPFocus = MIPFocus
112 |     if Seed is not None:
113 |       spo.Params.Seed = Seed
114 |     if TimeLimit is not None:
115 |       spo.Params.TimeLimit = TimeLimit
116 |     if Presolve is not None:
117 |       spo.Params.Presolve = Presolve
118 |     if ImproveStartTime is not None:
119 |       spo.Params.ImproveStartTime = ImproveStartTime
120 |     if VarBranch is not None:
121 |       spo.Params.VarBranch = VarBranch
122 |     if Cuts is not None:
123 |       spo.Params.Cuts = Cuts
124 |     
125 |     #tune params
126 |     if tune == True and TuneCriterion is not None:
127 |       spo.Params.TuneCriterion = TuneCriterion
128 |     if tune == True and TuneJobs is not None:
129 |       spo.Params.TuneJobs = TuneJobs
130 |     if tune == True and TuneTimeLimit is not None:
131 |       spo.Params.TuneTimeLimit = TuneTimeLimit
132 |     if tune == True and TuneTrials is not None:
133 |       spo.Params.TuneTrials = TuneTrials
134 | 
135 |     # Add variables
136 |     y = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = 0,name = 'y')
137 |     z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), vtype=GRB.BINARY, name = 'z')
138 |     #z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = 0, ub = 1, name = 'z')
139 |     w = spo.addVars(tuplelist([(t, j) for t in range(T_L) for j in range(D)]), lb = 0,name = 'w')
140 |     l = spo.addVars(tuplelist([i for i in range(T_L)]), vtype=GRB.BINARY, name = 'l')
141 |     d = spo.addVars(tuplelist([i for i in range(T_B)]), vtype=GRB.BINARY, name = 'd')
142 |     a = spo.addVars(tuplelist([(j,t) for j in range(P) for t in range(T_B)]), vtype=GRB.BINARY, name = 'a')
143 |     #b = spo.addVars(tuplelist([i for i in range(T_B)]), lb = 0, name = 'b')
144 |     b = spo.addVars(tuplelist([i for i in range(T_B)]), ub = 1, name = 'b')
145 |     
146 |     if a_start is not None:
147 |       for i in range(P):
148 |         for j in range(T_B):
149 |           a[i,j].start = a_start[i,j]
150 |     
151 |     if b_start is not None:
152 |       for i in range(T_B):
153 |         b[i].start = b_start[i]
154 |         
155 |     if w_start is not None:
156 |       for i in range(T_L):
157 |         for j in range(D):
158 |           w[i,j].start = w_start[i,j]
159 |     
160 |     if y_start is not None:
161 |       for i in range(n_train):
162 |         for j in range(T_L):
163 |           y[i,j].start = y_start[i,j]
164 |     
165 |     if z_start is not None:
166 |       for i in range(n_train):
167 |         for j in range(T_L):
168 |           z[i,j].start = z_start[i,j]
169 |     
170 |     if l_start is not None:
171 |       for i in range(T_L):
172 |         l[i].start = l_start[i]
173 |     
174 |     if d_start is not None:
175 |       for i in range(T_B):
176 |         d[i].start = d_start[i]
177 |     
178 |     spo.update() #for initial values to be written immediately
179 |     
180 | #    if a_start is not None:
181 | #      for i in range(P):
182 | #        for j in range(T_B):
183 | #          print(a[i,j].start)
184 | #    
185 | #    if b_start is not None:
186 | #      for i in range(T_B):
187 | #        print(b[i].start)
188 | #        
189 | #    if w_start is not None:
190 | #      for i in range(T_L):
191 | #        for j in range(D):
192 | #          print(w[i,j].start)
193 | #    
194 | #    if y_start is not None:
195 | #      for i in range(n_train):
196 | #        for j in range(T_L):
197 | #          print(y[i,j].start)
198 | #    
199 | #    if z_start is not None:
200 | #      for i in range(n_train):
201 | #        for j in range(T_L):
202 | #          print(z[i,j].start)
203 | #    
204 | #    if l_start is not None:
205 | #      for i in range(T_L):
206 | #        print(l[i].start)
207 | #    
208 | #    if d_start is not None:
209 | #      for i in range(T_B):
210 | #        print(d[i].start)
211 |     
212 |     
213 | 
214 |     # Add constraints
215 |     # Const 7b
216 |     for i in range(n_train):
217 |         for t in range(T_L):
218 |             #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()])
219 |             expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)])
220 |             spo.addConstr(y[i,t] >= expr_constraint - M * (1 - z[i,t]))
221 |             # spo.addConstr(y[i,t] >= quicksum(train_cost[i,j] * w[t,j] for key,j in Edge_dict.items())- M * (1 - z[i,t]))
222 | 
223 |     # # Const 7c (genreral constraint for feasibility of nominal problem Aw <= B)
224 |     # for t in range(T_L):
225 |     #     for i in range(K):
226 |     #         spo.addConstr(quicksum(A[i,j] * w[t,j] for j in range(D)) <= B[i] )
227 | 
228 |     # Const 7c (constraint for feasibility of shortest_path problem)
229 |     flow = spo.addVars(Edges_w_t, lb = 0, name = 'flow')
230 |     spo.addConstrs((quicksum(flow[i,j,t] for i,j,t in Edges_w_t.select(i,'*',t)) - quicksum(flow[k,i,t] for k,i,t in Edges_w_t.select('*',i,t)) == 0
231 |                             for i in range(2, dim**2) for t in range(T_L) ))
232 |     spo.addConstrs((quicksum(flow[i,j,t] for i,j,t in Edges_w_t.select(1, '*',t)) == 1 for t in range(T_L)))
233 |     spo.addConstrs((quicksum(flow[i,j,t] for i,j,t in Edges_w_t.select('*', dim**2,t)) == 1 for t in range(T_L)))
234 |     spo.addConstrs( w[t,Edge_dict[(i,j)]] - flow[i,j,t] == 0 for i,j,t in Edges_w_t) # Map shortest path flow to w_t
235 | 
236 |     # Const 7d
237 |     for i in range(n_train):
238 |         # spo.addConstr(quicksum(z[i,t] for t in range(T_L)) == 1)
239 |         spo.addConstr(LinExpr([(1,z[i,t]) for t in range(T_L)]) == 1)
240 | 
241 |     # Const 7e
242 |     for i in range(n_train):
243 |         for t in range(T_L):
244 |             spo.addConstr(z[i,t] <= l[t])
245 | 
246 |     # Const 7f
247 |     for t in range(T_L):
248 |         # spo.addConstr(quicksum(z[i,t] for i in range(n_train))>= N_min * l[t])
249 |         spo.addConstr(LinExpr([(1,z[i,t]) for i in range(n_train)])>= N_min * l[t])
250 | 
251 |     # Const 7g
252 |     for i in range(n_train):
253 |         for t in range(T_L):
254 |             left, right = find_ancestors(t + T_B)
255 |             for m in right:
256 |                 # spo.addConstr(quicksum(a[p,m]* train_x[i,p] for p in range(P)) >= b[m]- (1 - z[i,t] ))
257 |                 spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)])  >= b[m]- (1 - z[i,t] ))
258 | 
259 |     # Const 7h
260 |             for m in left:
261 |                 # spo.addConstr(quicksum(a[p,m]* (x[i,p] + eps[p]) for p in range(P))<= b[m] + (1+eps_max)*(1-z[i,t] ))
262 |                 #spo.addConstr(quicksum(a[p,m]* train_x[i,p]  for p in range(P)) +0.0001<= b[m] + (1-z[i,t] ))
263 |                 #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) + eps  <= b[m] +  (1+eps)*(1 - z[i,t] ))
264 |                 spo.addConstr(LinExpr([(train_x[i,p]+ eps[p],a[p,m]) for p in range(P)]) <= b[m] +  one_plus_eps_max*(1 - z[i,t] ))
265 |                 #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) <= b[m] +  1 - one_plus_eps*z[i,t])
266 | 
267 |     # Const 7i
268 |     for t in range(T_B):
269 |         # spo.addConstr(quicksum(a[p,t] for p in range(P)) == d[t])
270 |         spo.addConstr(LinExpr([(1,a[p,t]) for p in range(P)]) == d[t])
271 | 
272 |     # Const 7j
273 |     for t in range(T_B):
274 |         #spo.addConstr(b[t] <= d[t])
275 |         spo.addConstr(b[t] >= 1 - d[t])
276 | 
277 |     # Const 7k
278 |     for t in range(1,T_B):
279 |         spo.addConstr(d[t] <= d[find_parent_index(t)])
280 |       
281 |     # Const 7l (optional): ensures LP relaxation of problem has obj >= 0
282 |     for i in range(n_train):
283 |       spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i])
284 |         #for t in range(T_L):
285 |             #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()])
286 |             #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)])
287 |             #spo.addConstr(expr_constraint >= optimal_costs[i])
288 |             #spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i])
289 | 
290 |     # Add objective
291 |     # spo.setObjective( quicksum(y[i,t] for i in range(n_train) for t in range(T_L))/n_train + spo_opt_tree_reg* quicksum(d[t] for t in range(T_B) ), GRB.MINIMIZE)
292 |     expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ]) - sum_optimal_cost
293 |     #expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ])
294 |     if spo_opt_tree_reg > 0:
295 |       expr_objective.add(LinExpr([(1, d[t]) for t in range(T_B)])*spo_opt_tree_reg*n_train)
296 |     spo.setObjective(expr_objective, GRB.MINIMIZE)
297 |     
298 | 
299 |     # Solve optimization
300 |     if tune == True:
301 |       spo.tune()
302 |       if tune_foutpref is None:
303 |         tune_foutpref='tune'
304 |       for i in range(spo.tuneResultCount):
305 |         spo.getTuneResult(i)
306 |         spo.write(tune_foutpref+str(i)+'.prm')
307 |     spo.optimize()
308 |   
309 |     # Get values of objective and variables
310 |     # print('Obj=')
311 |     # print(spo.getObjective().getValue())
312 |     #
313 |     # z_ = np.zeros((n,T_L))
314 |     # z_res = spo.getAttr('X', z)
315 |     # for i,j in z_res:
316 |     #     z_[i,j] = z_res[i,j]
317 |     # print(z_)
318 |     spo_dt_a = np.zeros((P,T_B))
319 |     a_res = spo.getAttr('X', a)
320 |     for i in range(P):
321 |       for j in range(T_B):
322 |         spo_dt_a[i,j] = a_res[i,j]
323 |     #for i,j in a_res:
324 |         #spo_dt_a[i,j] = a_res[i,j]
325 | 
326 |     spo_dt_b = np.zeros(T_B)
327 |     b_res = spo.getAttr('X', b)
328 |     for i in range(T_B):
329 |       spo_dt_b[i] = b_res[i]
330 |     #for i in b_res:
331 |         #spo_dt_b[i] = b_res[i]
332 | 
333 |     spo_dt_w = np.zeros((T_L,D))
334 |     w_res = spo.getAttr('X', w)
335 |     for i in range(T_L):
336 |       for j in range(D):
337 |         spo_dt_w[i,j] = w_res[i,j]
338 |     #for i,j in w_res:
339 |         #spo_dt_w[i,j] = w_res[i,j]
340 |     
341 |     spo_dt_y = np.zeros((n_train,T_L))
342 |     y_res = spo.getAttr('X', y)
343 |     for i in range(n_train):
344 |       for j in range(T_L):
345 |         spo_dt_y[i,j] = y_res[i,j]
346 |     spo_dt_z = np.zeros((n_train,T_L))
347 |     z_res = spo.getAttr('X', z)
348 |     for i in range(n_train):
349 |       for j in range(T_L):
350 |         spo_dt_z[i,j] = z_res[i,j]
351 |     spo_dt_l = np.zeros(T_L)
352 |     l_res = spo.getAttr('X', l)
353 |     for i in range(T_L):
354 |       spo_dt_l[i] = l_res[i]
355 |     spo_dt_d = np.zeros(T_B)
356 |     d_res = spo.getAttr('X', d)
357 |     for i in range(T_B):
358 |       spo_dt_d[i] = d_res[i]
359 |     
360 |     if returnAllOptvars == False:
361 |       return spo_dt_a, spo_dt_b, spo_dt_w
362 |     else:
363 |       return spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d
364 | 
365 | # Given a tree defined by a and b for all interior nodes, find the path (including the leaf node in which it lies) of observations using its features
366 | def decision_path(x,a,b):
367 |     T_B = len(b)
368 |     if len(x.shape) == 1:
369 |         n = 1
370 |         P = x.size
371 |     else:
372 |         n, P = x.shape
373 |     res = []
374 |     for i in range(n):
375 |         node = 0
376 |         path = [0]
377 |         T_B = a.shape[1]
378 |         while node < T_B:
379 |             if np.dot(x[i,:], a[:,node]) < b[node]:
380 |                 node = (node+1)*2 - 1
381 |             else:
382 |                 node = (node+1)*2
383 |             path.append(node)
384 |         res.append(path)
385 |     return np.array(res)
386 | 
387 | 
388 | # Given the path of an observation (including the leaf node in which it lies), find the predicted total cost for that observation
389 | def apply_leaf_decision(c,path, w, subtract_optimal=False):
390 |     T_L, D = w.shape
391 |     n = c.shape[0]
392 |     paths = path[:, -1]
393 |     actual_cost = []
394 |     for i in range(n):
395 |         decision_node = paths[i] - T_L +1
396 |         cost_decision = np.dot(c[i,:], w[decision_node,:])
397 |         if subtract_optimal == True:
398 |           cost_optimal = find_opt_decision(c[i,:].reshape(1,-1))['objective'][0]
399 |           actual_cost.append(cost_decision-cost_optimal)
400 |         else:
401 |           actual_cost.append(cost_decision)          
402 |     return np.array(actual_cost)
403 | 
404 | def spo_opt_traintest(train_x,train_cost,test_x,test_cost,train_x_precision,spo_opt_tree_reg, N_min, H):
405 |     spo_dt_a,spo_dt_b, spo_dt_w = spo_opt_tree(train_cost,train_x,train_x_precision,spo_opt_tree_reg, N_min, H)
406 |     path = decision_path(test_x,spo_dt_a,spo_dt_b)
407 |     return spo_dt_a,spo_dt_b, spo_dt_w, np.mean(apply_leaf_decision(test_cost,path, spo_dt_w, subtract_optimal=True))
408 | 
409 | def spo_opt_tunealpha(train_x,train_cost,valid_x,valid_cost,train_x_precision,reg_set, N_min, H):
410 |     best_err = np.float("inf")
411 |     for alpha in reg_set:
412 |       spo_dt_a,spo_dt_b, spo_dt_w, err = spo_opt_traintest(train_x,train_cost,valid_x,valid_cost,train_x_precision,alpha, N_min, H)
413 |       if err <= best_err:
414 |         best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha = spo_dt_a,spo_dt_b, spo_dt_w, err, alpha
415 |     return best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha
416 | 
417 | #def cv_spo_opt_traintest(cost, X, train_x_precision,reg_set, splits = 4):
418 | #    dic = {reg:0 for reg in reg_set}
419 | #    n, n_edges = cost.shape
420 | #    K = X.shape[1]
421 | #    kf = KFold(n_splits = splits)
422 | #    for train, test in kf.split(X):
423 | #        X_train, X_test, cost_train, cost_test = X[train], X[test], cost[train], cost[test]
424 | #        opt_cost = find_opt_decision(cost_test)['objective']
425 | #        for spo_opt_tree_reg in reg_set:
426 | #            actual_cost = spo_opt_traintest(X_train, cost_train, X_test, cost_test,train_x_precision,spo_opt_tree_reg)
427 | #            dic[spo_opt_tree_reg] += sum(actual_cost - opt_cost)
428 | #    return smallest_dic_value(dic)
429 | #
430 | #def smallest_dic_value(dic):
431 | #    reverse = dict()
432 | #    for key in dic.keys():
433 | #        reverse[dic[key]] = key
434 | #    return reverse[min(reverse.keys())]
435 | 


--------------------------------------------------------------------------------
/Applications/Yahoo News/Data Preprocessing/DataPreprocessing1.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import YahooNewsDataExtraction as tool\n",
 10 |     "from sklearn.cluster import KMeans\n",
 11 |     "import numpy as np\n",
 12 |     "import pandas as pd\n",
 13 |     "import pickle"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "markdown",
 18 |    "metadata": {},
 19 |    "source": [
 20 |     "# Read all data and save in separate files"
 21 |    ]
 22 |   },
 23 |   {
 24 |    "cell_type": "code",
 25 |    "execution_count": null,
 26 |    "metadata": {},
 27 |    "outputs": [],
 28 |    "source": [
 29 |     "'''The process time for each raw Yahoo news data file is around 10 - 30 minutes.'''\n",
 30 |     "count = 0\n",
 31 |     "adsDict = dict()\n",
 32 |     "recordDict = dict()                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          \n",
 33 |     "for day in range(1,11):\n",
 34 |     "    if day == 10:\n",
 35 |     "        '''Use raw Yahoo News data path and filename'''\n",
 36 |     "        filename = '../ydata-fp-td-clicks-v1_0.200905' + str(day)\n",
 37 |     "    else:\n",
 38 |     "        filename = '../ydata-fp-td-clicks-v1_0.2009050' + str(day)\n",
 39 |     "    with open(filename) as fp:\n",
 40 |     "        line = fp.readline()\n",
 41 |     "        while line:\n",
 42 |     "            timestamp, offered_ad_id, click, user_feat, eligible_ads_ids, eligible_ads_feat = tool.parse_line(line)\n",
 43 |     "            recordDict.update({count:np.hstack([user_feat[1:], offered_ad_id, click])})\n",
 44 |     "            adsDict.update(dict(zip(eligible_ads_ids,eligible_ads_feat[:,1:])))\n",
 45 |     "            count += 1\n",
 46 |     "            line = fp.readline()\n",
 47 |     "            \n",
 48 |     "        recordDF = pd.DataFrame(recordDict).T\n",
 49 |     "        filename = 'day' + str(day) + '_records.npy'\n",
 50 |     "        np.save(filename, recordDF.values)\n",
 51 |     "        \n",
 52 |     "        filename = 'day' + str(day) + '_adsDict.p'\n",
 53 |     "        pickle.dump(adsDict, open(filename,'wb'))\n",
 54 |     "    count = 0\n",
 55 |     "    adsDict = dict()\n",
 56 |     "    recordDict = dict()\n",
 57 |     "    print('Completed processing: ', filename)"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": null,
 63 |    "metadata": {},
 64 |    "outputs": [],
 65 |    "source": []
 66 |   },
 67 |   {
 68 |    "cell_type": "markdown",
 69 |    "metadata": {},
 70 |    "source": [
 71 |     "# Clustering user data"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": null,
 77 |    "metadata": {},
 78 |    "outputs": [],
 79 |    "source": [
 80 |     "'''Load train data. Note that this data will be further split into the training set and validation set via random sampling.'''\n",
 81 |     "train_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(1,6)]) \n"
 82 |    ]
 83 |   },
 84 |   {
 85 |    "cell_type": "code",
 86 |    "execution_count": null,
 87 |    "metadata": {},
 88 |    "outputs": [],
 89 |    "source": [
 90 |     "'''Cluster users in train data according to user features (excluding constant feature)'''\n",
 91 |     "from sklearn.cluster import MiniBatchKMeans\n",
 92 |     "n_userClusters = 10000\n",
 93 |     "user_kmeans = MiniBatchKMeans(n_clusters=n_userClusters,random_state=0,batch_size=3000, init_size= n_userClusters)\n",
 94 |     "user_kmeans.fit(train_records[:,:5])\n",
 95 |     "filename = 'train_userCluster.p'\n",
 96 |     "pickle.dump(user_kmeans,open(filename,'wb'))"
 97 |    ]
 98 |   },
 99 |   {
100 |    "cell_type": "code",
101 |    "execution_count": null,
102 |    "metadata": {},
103 |    "outputs": [],
104 |    "source": [
105 |     "'''Load test data.'''\n",
106 |     "test_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(6,11)]) "
107 |    ]
108 |   },
109 |   {
110 |    "cell_type": "code",
111 |    "execution_count": null,
112 |    "metadata": {},
113 |    "outputs": [],
114 |    "source": [
115 |     "'''Cluster test users via cluster trained using training set user data '''\n",
116 |     "\n",
117 |     "filename = 'train_userCluster.p'\n",
118 |     "user_kmeans = pickle.load(open(filename,'rb'))\n",
119 |     "test_usertype_predictions = user_kmeans.predict(test_records[:,:5])\n",
120 |     "\n",
121 |     "filename = 'test_userTypePredictions.npy'\n",
122 |     "np.save(filename,test_usertype_predictions)"
123 |    ]
124 |   }
125 |  ],
126 |  "metadata": {
127 |   "kernelspec": {
128 |    "display_name": "Python 3",
129 |    "language": "python",
130 |    "name": "python3"
131 |   },
132 |   "language_info": {
133 |    "codemirror_mode": {
134 |     "name": "ipython",
135 |     "version": 3
136 |    },
137 |    "file_extension": ".py",
138 |    "mimetype": "text/x-python",
139 |    "name": "python",
140 |    "nbconvert_exporter": "python",
141 |    "pygments_lexer": "ipython3",
142 |    "version": "3.6.3"
143 |   }
144 |  },
145 |  "nbformat": 4,
146 |  "nbformat_minor": 2
147 | }
148 | 


--------------------------------------------------------------------------------
/Applications/Yahoo News/Data Preprocessing/DataPreprocessing2.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": null,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import YahooNewsDataExtraction as tool\n",
 10 |     "from sklearn.cluster import KMeans\n",
 11 |     "import numpy as np\n",
 12 |     "import pandas as pd\n",
 13 |     "import pickle"
 14 |    ]
 15 |   },
 16 |   {
 17 |    "cell_type": "code",
 18 |    "execution_count": null,
 19 |    "metadata": {},
 20 |    "outputs": [],
 21 |    "source": [
 22 |     "threshold = 50\n",
 23 |     "train_val_split = 0.5"
 24 |    ]
 25 |   },
 26 |   {
 27 |    "cell_type": "markdown",
 28 |    "metadata": {},
 29 |    "source": [
 30 |     "# Training set Processing"
 31 |    ]
 32 |   },
 33 |   {
 34 |    "cell_type": "code",
 35 |    "execution_count": null,
 36 |    "metadata": {},
 37 |    "outputs": [],
 38 |    "source": [
 39 |     "'''Read in train data'''\n",
 40 |     "'''Col0 - Col4 are features, Col5 is adID, Col6 is click binary indicator.'''\n",
 41 |     "train_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(1,6)]) \n",
 42 |     "\n",
 43 |     "\n",
 44 |     "'''Read in ads dictionary.'''\n",
 45 |     "train_adsDict = dict()\n",
 46 |     "for day in range(1,6):\n",
 47 |     "    train_adsDict.update(pickle.load(open('day' + str(day) + '_adsDict.p','rb')))\n",
 48 |     "\n",
 49 |     "    \n",
 50 |     "'''Add user type to each user interaction record.'''\n",
 51 |     "filename = 'train_userCluster.p'\n",
 52 |     "user_kmeans = pickle.load(open(filename,'rb'))\n",
 53 |     "train_recordsDF = pd.DataFrame(np.hstack([train_records,user_kmeans.labels_.reshape(-1,1)])).rename(columns \n",
 54 |     "                                                                                                    ={5:'adID', 6: 'click',7: 'userType'})"
 55 |    ]
 56 |   },
 57 |   {
 58 |    "cell_type": "code",
 59 |    "execution_count": null,
 60 |    "metadata": {},
 61 |    "outputs": [],
 62 |    "source": [
 63 |     "'''Cluster articles into different types.'''\n",
 64 |     "train_adsDF = pd.DataFrame(train_adsDict).T\n",
 65 |     "n_adsClusters = 7\n",
 66 |     "ads_kmeans = KMeans(n_clusters=n_adsClusters, random_state=300)\n",
 67 |     "ads_kmeans.fit(train_adsDF)\n",
 68 |     "ads_kmeans.labels_\n",
 69 |     "train_adsType = dict(zip(train_adsDF.dropna().index, ads_kmeans.labels_))\n",
 70 |     "\n",
 71 |     "filename = 'train_adsCluster_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.p'\n",
 72 |     "pickle.dump(ads_kmeans,open(filename,'wb'))"
 73 |    ]
 74 |   },
 75 |   {
 76 |    "cell_type": "code",
 77 |    "execution_count": null,
 78 |    "metadata": {},
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "'''Add ad type to each user interaction record.'''\n",
 82 |     "train_validation_recordDFwType= pd.concat([train_recordsDF, \n",
 83 |     "                          train_recordsDF['adID'].map(train_adsType).rename('adsType')],axis =1).dropna()\n"
 84 |    ]
 85 |   },
 86 |   {
 87 |    "cell_type": "code",
 88 |    "execution_count": null,
 89 |    "metadata": {},
 90 |    "outputs": [],
 91 |    "source": [
 92 |     "'''Split train data into training and validation sets. Length ratio is given by train_val_split.'''\n",
 93 |     "from sklearn.model_selection import train_test_split\n",
 94 |     "train_recordDFwType, validation_recordDFwType, = train_test_split(train_validation_recordDFwType,\n",
 95 |     "                                                                  test_size=train_val_split, random_state=42)"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": null,
101 |    "metadata": {},
102 |    "outputs": [],
103 |    "source": [
104 |     "'''Calculate training set click probability for each pair of user type and article type in training set.'''\n",
105 |     "train_recordDFwType['adsType'] = train_recordDFwType['adsType'].astype(int)\n",
106 |     "train_Y_clickProb = train_recordDFwType[['click','userType','adsType']].groupby(['userType','adsType'])['click'].agg({'clickProb':'mean',\n",
107 |     "                                                            'n_obs':'count'})"
108 |    ]
109 |   },
110 |   {
111 |    "cell_type": "code",
112 |    "execution_count": null,
113 |    "metadata": {},
114 |    "outputs": [],
115 |    "source": [
116 |     "'''Drop article type 3 and several interaction records so average click probabilities for \n",
117 |     "each user and article type were calculated with at least 50 interaction records. '''\n",
118 |     "\n",
119 |     "train_tmp = train_Y_clickProb['n_obs'].unstack().drop([3], axis =1).min(axis = 1)\n",
120 |     "train_tmp_index = train_tmp[train_tmp >= threshold].index"
121 |    ]
122 |   },
123 |   {
124 |    "cell_type": "code",
125 |    "execution_count": null,
126 |    "metadata": {},
127 |    "outputs": [],
128 |    "source": [
129 |     "filename = 'filtered_train_clickprob_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
130 |     "np.save(filename, train_Y_clickProb['clickProb'].unstack().loc[train_tmp_index].drop([3],axis = 1).values) \n",
131 |     "\n",
132 |     "\n",
133 |     "filename = 'filtered_train_usernumobserv_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
134 |     "np.save(filename, train_recordDFwType.groupby('userType').size().loc[train_tmp_index].values) \n",
135 |     "\n",
136 |     "\n",
137 |     "train_X_user = train_recordDFwType.groupby(['userType']).mean().drop(['adID','click','adsType'],axis =1)\n",
138 |     "filename = 'filtered_train_userFeat_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
139 |     "np.save(filename, train_X_user.loc[train_tmp_index].values)\n",
140 |     "\n",
141 |     "\n",
142 |     "filename = 'filtered_train_featMap_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
143 |     "pickle.dump(dict(zip(list(train_X_user.index), train_X_user.values)), open(filename, 'wb'))\n"
144 |    ]
145 |   },
146 |   {
147 |    "cell_type": "markdown",
148 |    "metadata": {},
149 |    "source": [
150 |     "# Validation set Processing"
151 |    ]
152 |   },
153 |   {
154 |    "cell_type": "code",
155 |    "execution_count": null,
156 |    "metadata": {},
157 |    "outputs": [],
158 |    "source": [
159 |     "filename = 'filtered_train_featMap_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
160 |     "train_featMap = pickle.load(open(filename, 'rb'))\n",
161 |     "validation_recordDFwType['adsType']  = validation_recordDFwType['adID'].map(train_adsType)"
162 |    ]
163 |   },
164 |   {
165 |    "cell_type": "code",
166 |    "execution_count": null,
167 |    "metadata": {},
168 |    "outputs": [],
169 |    "source": [
170 |     "'''Calculate training set click probability for each pair of user type and article type in validation set.'''\n",
171 |     "validation_Y_clickProb = validation_recordDFwType[['click','userType','adsType']].groupby(['userType','adsType'])['click'].agg({'clickProb':'mean',\n",
172 |     "                                                            'n_obs':'count'})"
173 |    ]
174 |   },
175 |   {
176 |    "cell_type": "code",
177 |    "execution_count": null,
178 |    "metadata": {},
179 |    "outputs": [],
180 |    "source": [
181 |     "'''Drop article type 3 and several interaction records so average click probabilities for \n",
182 |     "each user and article type were calculated with at least 50 interaction records. '''\n",
183 |     "validation_tmp = validation_Y_clickProb['n_obs'].unstack().drop([3], axis =1).min(axis = 1)\n",
184 |     "validation_tmp_index = validation_tmp[validation_tmp >= threshold].index"
185 |    ]
186 |   },
187 |   {
188 |    "cell_type": "code",
189 |    "execution_count": null,
190 |    "metadata": {},
191 |    "outputs": [],
192 |    "source": [
193 |     "filename = 'filtered_validation_clickprob_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
194 |     "np.save(filename, validation_Y_clickProb['clickProb'].unstack().loc[validation_tmp_index].drop([3],axis = 1).values) \n",
195 |     "\n",
196 |     "filename = 'filtered_validation_usernumobserv_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
197 |     "np.save(filename, validation_recordDFwType.groupby('userType').size().loc[validation_tmp_index].values) \n",
198 |     "\n",
199 |     "validation_X_user = np.vstack([train_featMap[x] for x in list(validation_tmp_index)])\n",
200 |     "filename = 'filtered_validation_userFeat_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
201 |     "np.save(filename, validation_X_user) "
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "markdown",
206 |    "metadata": {},
207 |    "source": [
208 |     "# Test set Processing"
209 |    ]
210 |   },
211 |   {
212 |    "cell_type": "code",
213 |    "execution_count": null,
214 |    "metadata": {},
215 |    "outputs": [],
216 |    "source": [
217 |     "'''Read in test data'''\n",
218 |     "'''Col0 - Col4 are features, Col5 is adID, Col6 is click binary indicator.'''\n",
219 |     "test_records = np.vstack([np.load('day' + str(day) + '_records.npy') for day in range(6,11)]) \n",
220 |     "\n",
221 |     "'''Read in ads dictionary.'''\n",
222 |     "test_adsDict = dict()\n",
223 |     "for day in range(6,11):\n",
224 |     "    test_adsDict.update(pickle.load(open('day' + str(day) + '_adsDict.p','rb')))\n",
225 |     "\n",
226 |     "filename = 'train_userCluster.p'\n",
227 |     "user_kmeans = pickle.load(open(filename,'rb'))\n",
228 |     "\n",
229 |     "filename = 'test_userTypePredictions.npy'\n",
230 |     "test_usertype_predictions = np.load(filename)\n",
231 |     "\n",
232 |     "filename = 'train_adsCluster_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.p'\n",
233 |     "ads_kmeans =  pickle.load(open(filename,'rb'))\n",
234 |     "\n",
235 |     "filename = 'filtered_train_featMap_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
236 |     "train_featMap = pickle.load(open(filename, 'rb'))"
237 |    ]
238 |   },
239 |   {
240 |    "cell_type": "code",
241 |    "execution_count": null,
242 |    "metadata": {},
243 |    "outputs": [],
244 |    "source": [
245 |     "test_adsDF = pd.DataFrame(test_adsDict).T\n",
246 |     "test_recordDFwType = pd.DataFrame(np.hstack([test_records,test_usertype_predictions.reshape(-1,1)])).rename(columns =                     \n",
247 |     "                                                                                                            {5:'adID', 6: 'click',7: 'userType'})\n",
248 |     "test_adsType = dict(zip(test_adsDF.index, ads_kmeans.predict(test_adsDF)))\n",
249 |     "test_recordDFwType['adsType']  = test_recordDFwType['adID'].map(test_adsType)"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": null,
255 |    "metadata": {},
256 |    "outputs": [],
257 |    "source": [
258 |     "'''Calculate training set click probability for each pair of user type and article type in test set.'''\n",
259 |     "test_Y_clickProb = test_recordDFwType[['click','userType','adsType']].groupby(['userType','adsType'])['click'].agg({'clickProb':'mean',\n",
260 |     "                                                            'n_obs':'count'})"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": null,
266 |    "metadata": {},
267 |    "outputs": [],
268 |    "source": [
269 |     "'''Drop article type 3 and several interaction records. '''\n",
270 |     "test_tmp = test_Y_clickProb['n_obs'].unstack().drop([3], axis =1).min(axis = 1)\n",
271 |     "test_tmp_index = test_tmp[test_tmp >= threshold].index\n"
272 |    ]
273 |   },
274 |   {
275 |    "cell_type": "code",
276 |    "execution_count": null,
277 |    "metadata": {},
278 |    "outputs": [],
279 |    "source": [
280 |     "filename = 'filtered_test_clickprob_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
281 |     "np.save(filename, test_Y_clickProb['clickProb'].unstack().loc[test_tmp_index].drop([3],axis = 1).values) \n",
282 |     "\n",
283 |     "filename = 'filtered_test_usernumobserv_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
284 |     "np.save(filename, test_recordDFwType.groupby('userType').size().loc[test_tmp_index].values) \n",
285 |     "\n",
286 |     "test_X_user = np.vstack([train_featMap[x] for x in list(test_tmp_index)])\n",
287 |     "filename = 'filtered_test_userFeat_' + str(train_val_split * 100) + '%'+'_'+ str(threshold) + '.npy'\n",
288 |     "np.save(filename, test_X_user) "
289 |    ]
290 |   }
291 |  ],
292 |  "metadata": {
293 |   "kernelspec": {
294 |    "display_name": "Python 3",
295 |    "language": "python",
296 |    "name": "python3"
297 |   },
298 |   "language_info": {
299 |    "codemirror_mode": {
300 |     "name": "ipython",
301 |     "version": 3
302 |    },
303 |    "file_extension": ".py",
304 |    "mimetype": "text/x-python",
305 |    "name": "python",
306 |    "nbconvert_exporter": "python",
307 |    "pygments_lexer": "ipython3",
308 |    "version": "3.6.3"
309 |   }
310 |  },
311 |  "nbformat": 4,
312 |  "nbformat_minor": 2
313 | }
314 | 


--------------------------------------------------------------------------------
/Applications/Yahoo News/Data Preprocessing/README.md:
--------------------------------------------------------------------------------
1 | # Yahoo News Dataset Preprocessing
2 | 
3 | Please follow the steps below to obtain the preprocessed Yahoo! Front Page Today Module dataset used in our numerical experiments:
4 | * Download the raw Yahoo! Front Page Today dataset at https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=49. A license must be obtained from Yahoo to access the dataset. The data may be used only for academic research purposes and may not be used for any commercial purposes or by any commercial entity. 
5 | * Run DataPreprocessing1.ipynb. This may take at least 24 hours to complete.
6 | * Run DataPreprocessing2.ipynb.


--------------------------------------------------------------------------------
/Applications/Yahoo News/Data Preprocessing/YahooNewsDataExtraction.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | 
 3 | #given vector of strings like ['2:0.306008', '3:0.000450', '4:0.077048', '5:0.230439', '6:0.386055', '1:1.000000'],
 4 | #extract features into numpy vector
 5 | def extract_features(strv):
 6 |     feat = np.array([0.0]*6)
 7 |     for i in range(6):
 8 |         tmp = strv[i].split(":")
 9 |         feat_index = int(tmp[0])
10 |         feat_value = float(tmp[1])
11 |         feat[feat_index-1] = feat_value
12 |     return(feat)
13 |     
14 | 
15 | #returns timestamp, offered ad id, click (binary), user features, vector of eligible ad ids, vector of eligible ad id features
16 | def parse_line(line):
17 |     #remove \n at the end of the line
18 |     line = line.strip()
19 |     line = line.split("|")
20 |   
21 |     #get timestamp, offered ad id, click (binary)
22 |     decision_info = line[0].split(" ")
23 |     timestamp = int(decision_info[0])
24 |     offered_ad_id = int(decision_info[1])
25 |     click = int(decision_info[2])
26 | 
27 |     #get user features
28 |     user_info = line[1].split(" ")[1:]
29 |     user_feat = extract_features(user_info)
30 | 
31 |     #get eligible ads + features
32 |     ad_info = line[2].split(" ")
33 |     eligible_ads_ids = np.array([int(ad_info[0])])
34 |     eligible_ads_feat = np.array([extract_features(ad_info[1:])])
35 |     for i in range(3, len(line)):
36 |         ad_info = line[i].split(" ")
37 |         if len(ad_info[1:]) >= 6:
38 |             eligible_ads_ids = np.append(eligible_ads_ids,np.array([int(ad_info[0])]))
39 |             eligible_ads_feat = np.append(eligible_ads_feat,np.array([extract_features(ad_info[1:])]), axis=0)
40 |         #else:
41 |             #print('Ad Feature Amomaly: ' + line[i])
42 |             
43 |     sorted_inds = np.argsort(eligible_ads_ids)
44 |     eligible_ads_ids = eligible_ads_ids[sorted_inds]
45 |     eligible_ads_feat = eligible_ads_feat[sorted_inds]
46 |   
47 |     return(timestamp, offered_ad_id, click, user_feat, eligible_ads_ids, eligible_ads_feat)
48 | 
49 | #Read Data
50 | path="/R6/ydata-fp-td-clicks-v1_0.20090501"
51 | 
52 | 
53 |    
54 |  
55 | 


--------------------------------------------------------------------------------
/Applications/Yahoo News/SPOForest_news.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Runs random forest algorithm on Yahoo News dataset
  3 | This code considers two methods of aggregating individual tree predictions to obtain a forest decision:
  4 |   - "mean": averages cost predictions for each tree in the forest; outputs decision associated with average cost
  5 |   - "mode": outputs the mode decision recommended by the trees in the forest
  6 | Outputs decision costs for each test-set instance as pickle file
  7 | Takes multiple input arguments:
  8 |   (1) max_depth_set_str: sequence of training depths tuned using cross validation, e.g. "2-4-5"
  9 |   (2) min_samples_leaf_set_str: sequence of "min. (weighted) observations per leaf" tuned using cross validation, e.g. "20-50-100"
 10 |   (3) n_estimators_set_str: sequence of number of trees in forest tuned using cross validation, e.g. "20-50-100"
 11 |   (4) max_features_set_str: sequence of number of features used in feature bagging tuned using cross validation, e.g. "2-3-4"
 12 |   (5) algtype: set equal to "MSE" (CART forest) or "SPO" (SPOT forest)
 13 |   (6) number of workers to use in parallel processing (i.e., fitting individual trees in the forest in parallel)
 14 |   (7) decision_problem_seed: seed controlling generation of constraints in article recommendation problem (-1 = no constraints)
 15 |   (8) train_size: number of random obs. to extract from the training data. 
 16 |   Only useful in limiting the size of the training data (-1 = use full training data)
 17 |   (9) quant_discret: continuous variable split points in the trees are chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 
 18 | Values of input arguments used in paper:
 19 |   (1) max_depth_set_str: "1000"
 20 |   (2) min_samples_leaf_set_str: "10000"
 21 |   (3) n_estimators_set_str: "50"
 22 |   (4) max_features_set_str: "2-3-4-5"
 23 |   (5) algtype: "MSE" for CART forest, "SPO" for SPOT forest
 24 |   (6) number of workers to use in parallel processing: 10
 25 |   (7) decision_problem_seed: ran 9 constraint instances corresponding to seeds of 10, 11, 12, ..., 18
 26 |   (8) train_size: -1
 27 |   (9) quant_discret: 0.05
 28 | '''
 29 | 
 30 | import time
 31 | 
 32 | import numpy as np
 33 | import pickle
 34 | from gurobipy import*
 35 | from decision_problem_solver import*
 36 | from SPOForest import SPOForest
 37 | import sys
 38 | 
 39 | ########################################
 40 | #SEEDS FOR RANDOM NUMBER GENERATORS
 41 | #seed for rngs in random forest
 42 | forest_seed = 0
 43 | #seed controlling random subset of training data used (if full training data is not being used)
 44 | select_train_seed = 0 
 45 | ########################################
 46 | #training parameters
 47 | max_depth_set_str = sys.argv[1]
 48 | max_depth_set=[int(k) for k in max_depth_set_str.split('-')]#[None]
 49 | min_samples_leaf_set_str = sys.argv[2]
 50 | min_samples_leaf_set=[int(k) for k in min_samples_leaf_set_str.split('-')]#[5]
 51 | n_estimators_set_str = sys.argv[3]
 52 | n_estimators_set=[int(k) for k in n_estimators_set_str.split('-')]#[100,500]
 53 | max_features_set_str = sys.argv[4]
 54 | max_features_set=[int(k) for k in max_features_set_str.split('-')]#[3]
 55 | algtype=sys.argv[5] #either "MSE" or "SPO"
 56 | #number of workers
 57 | if sys.argv[6] == "1":
 58 |   run_in_parallel = False
 59 |   num_workers = None
 60 | else: 
 61 |   run_in_parallel = True
 62 |   num_workers = int(sys.argv[6])
 63 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive
 64 | decision_problem_seed=int(sys.argv[7]) #if -1, use no constraints in decision problem
 65 | train_size=int(sys.argv[8]) #if you want to limit the size of the training data (-1 = no limit)
 66 | quant_discret=float(sys.argv[9]) 
 67 | ########################################
 68 | #output filename
 69 | fname_out_mean = algtype+"Forest_news_costs_depthSet"+max_depth_set_str+"_minObsSet"+min_samples_leaf_set_str+"_nEstSet"+n_estimators_set_str+"_mFeatSet"+max_features_set_str+"_aMethod"+"mean"+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl";
 70 | fname_out_mode = algtype+"Forest_news_costs_depthSet"+max_depth_set_str+"_minObsSet"+min_samples_leaf_set_str+"_nEstSet"+n_estimators_set_str+"_mFeatSet"+max_features_set_str+"_aMethod"+"mode"+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl";
 71 | #############################################################################
 72 | #############################################################################
 73 | #############################################################################
 74 | 
 75 | #generate decision problem
 76 | num_constr = 5 #number of Aw <= b constraints
 77 | num_dec = 6 #number of decisions
 78 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive
 79 | if decision_problem_seed == -1:
 80 |   #no budget constraint case
 81 |   A_constr = np.zeros((num_constr,num_dec))
 82 |   b_constr = np.ones(num_constr)
 83 |   l_constr = np.zeros(num_dec)
 84 |   u_constr = np.ones(num_dec)
 85 | else:
 86 |   np.random.seed(decision_problem_seed)
 87 |   A_constr = np.random.exponential(scale=1.0, size=(num_constr,num_dec))
 88 |   b_constr = np.ones(num_constr)
 89 |   l_constr = np.zeros(num_dec)
 90 |   u_constr = np.ones(num_dec)
 91 | 
 92 | ##############################################
 93 |   
 94 | thresh = "50"
 95 | valid_size = "50.0%"
 96 | 
 97 | train_x = np.load('filtered_train_userFeat_'+valid_size+'_'+thresh+'.npy')
 98 | valid_x = np.load('filtered_validation_userFeat_'+valid_size+'_'+thresh+'.npy')
 99 | test_x = np.load('filtered_test_userFeat_'+valid_size+'_'+thresh+'.npy')
100 | 
101 | #make negative to turn into minimization problem
102 | train_cost = np.load('filtered_train_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
103 | valid_cost = np.load('filtered_validation_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
104 | test_cost = np.load('filtered_test_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
105 | 
106 | train_weights = np.load('filtered_train_usernumobserv_'+valid_size+'_'+thresh+'.npy')
107 | valid_weights = np.load('filtered_validation_usernumobserv_'+valid_size+'_'+thresh+'.npy')
108 | test_weights = np.load('filtered_test_usernumobserv_'+valid_size+'_'+thresh+'.npy')
109 | 
110 | ##############################################
111 | #limit size of training data if specified
112 | if train_size != -1 and train_size <= train_x.shape[0] and train_size <= valid_x.shape[0]:
113 |   np.random.seed(select_train_seed)
114 |   sel_inds = np.random.choice(range(train_x.shape[0]), size = train_size, replace=False)
115 |   train_x = train_x[sel_inds]
116 |   train_cost = train_cost[sel_inds]
117 |   train_weights = train_weights[sel_inds]
118 |   sel_inds = np.random.choice(range(valid_x.shape[0]), size = train_size, replace=False)
119 |   valid_x = valid_x[sel_inds]
120 |   valid_cost = valid_cost[sel_inds]
121 |   valid_weights = valid_weights[sel_inds]
122 | 
123 | #############################################################################
124 | #############################################################################
125 | #############################################################################
126 | 
127 | def forest_traintest(train_x,train_cost,train_weights,test_x,test_cost,test_weights,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,algtype, A_constr, b_constr, l_constr, u_constr):
128 |     if algtype == "MSE":
129 |       SPO_weight_param=0.0
130 |     elif algtype == "SPO":
131 |       SPO_weight_param=1.0
132 |     regr = SPOForest(n_estimators=n_estimators,run_in_parallel=run_in_parallel,num_workers=num_workers, 
133 |                      max_depth=max_depth, min_weights_per_node=min_samples_leaf, quant_discret=quant_discret, debias_splits=False,
134 |                      max_features=max_features,
135 |                      SPO_weight_param=SPO_weight_param, SPO_full_error=True)
136 |     regr.fit(train_x, train_cost, train_weights, verbose_forest=True, verbose=False, feats_continuous=True, seed=forest_seed,
137 |              A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
138 |     pred_decision_mean = regr.est_decision(test_x, method="mean")
139 |     pred_decision_mode = regr.est_decision(test_x, method="mode")
140 |     alg_costs_mean = [np.sum(test_cost[i] * pred_decision_mean[i]) for i in range(0,len(pred_decision_mean))]
141 |     alg_costs_mode = [np.sum(test_cost[i] * pred_decision_mode[i]) for i in range(0,len(pred_decision_mode))]
142 |     return regr, np.dot(alg_costs_mean,test_weights)/np.sum(test_weights), np.dot(alg_costs_mode,test_weights)/np.sum(test_weights)
143 | 
144 | def forest_tuneparams(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set,run_in_parallel,num_workers,algtype, A_constr, b_constr, l_constr, u_constr):
145 |     best_err_mean = np.float("inf")
146 |     best_err_mode = np.float("inf")
147 |     for n_estimators in n_estimators_set:
148 |       for max_depth in max_depth_set:
149 |         for min_samples_leaf in min_samples_leaf_set:
150 |           for max_features in max_features_set:
151 |             regr, err_mean, err_mode = forest_traintest(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,n_estimators,max_depth,min_samples_leaf,max_features,run_in_parallel,num_workers,algtype, A_constr, b_constr, l_constr, u_constr)
152 |             if err_mean <= best_err_mean:
153 |               best_regr_mean, best_err_mean, best_n_estimators_mean,best_max_depth_mean,best_min_samples_leaf_mean,best_max_features_mean = regr, err_mean, n_estimators,max_depth,min_samples_leaf,max_features
154 |             if err_mode <= best_err_mode:
155 |               best_regr_mode, best_err_mode, best_n_estimators_mode,best_max_depth_mode,best_min_samples_leaf_mode,best_max_features_mode = regr, err_mode, n_estimators,max_depth,min_samples_leaf,max_features
156 |     
157 |     print("Best n_estimators (mean method): " + str(best_n_estimators_mean))
158 |     print("Best max_depth (mean method): " + str(best_max_depth_mean))
159 |     print("Best min_samples_leaf (mean method): " + str(best_min_samples_leaf_mean))
160 |     print("Best max_features (mean method): " + str(best_max_features_mean))
161 |     
162 |     print("Best n_estimators (mode method): " + str(best_n_estimators_mode))
163 |     print("Best max_depth (mode method): " + str(best_max_depth_mode))
164 |     print("Best min_samples_leaf (mode method): " + str(best_min_samples_leaf_mode))
165 |     print("Best max_features (mode method): " + str(best_max_features_mode))
166 |     
167 |     return best_regr_mean, best_err_mean, best_n_estimators_mean,best_max_depth_mean,best_min_samples_leaf_mean,best_max_features_mean, best_regr_mode, best_err_mode, best_n_estimators_mode,best_max_depth_mode,best_min_samples_leaf_mode,best_max_features_mode
168 | 
169 | start = time.time()
170 | 
171 | #FIT FOREST
172 | regr_mean,_,_,_,_,_,regr_mode,_,_,_,_,_ = forest_tuneparams(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,n_estimators_set,max_depth_set,min_samples_leaf_set,max_features_set, run_in_parallel, num_workers, algtype, A_constr, b_constr, l_constr, u_constr)
173 | 
174 | end = time.time()    
175 | print "Elapsed time: " + str(end-start)
176 | 
177 | #FIND TEST SET COST
178 | pred_decision_mean = regr_mean.est_decision(test_x, method="mean")
179 | pred_decision_mode = regr_mode.est_decision(test_x, method="mode")
180 | costs_mean = [np.sum(test_cost[i] * pred_decision_mean[i]) for i in range(0,len(pred_decision_mean))]
181 | costs_mode = [np.sum(test_cost[i] * pred_decision_mode[i]) for i in range(0,len(pred_decision_mode))]
182 | 
183 | print "Average test set cost (mean method) (max is better): " + str(-1.0*np.mean(costs_mean))
184 | print "Average test set weighted cost (mean method) (max is better): " + str(-1.0*np.dot(costs_mean,test_weights)/np.sum(test_weights))
185 | print "Average test set cost (mode method) (max is better): " + str(-1.0*np.mean(costs_mode))
186 | print "Average test set weighted cost (mode method) (max is better): " + str(-1.0*np.dot(costs_mode,test_weights)/np.sum(test_weights))
187 | 
188 | with open(fname_out_mean, 'wb') as output:
189 |   pickle.dump(costs_mean, output, pickle.HIGHEST_PROTOCOL)
190 | with open(fname_out_mode, 'wb') as output:
191 |   pickle.dump(costs_mode, output, pickle.HIGHEST_PROTOCOL)
192 |   
193 | # Getting back the objects:
194 | #with open(fname_out, 'rb') as input:
195 | #  costs_deg = pickle.load(input)


--------------------------------------------------------------------------------
/Applications/Yahoo News/SPOgreedy_news.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Runs SPOT (greedy) / CART algorithm on Yahoo News dataset
  3 | Outputs algorithm decision costs for each test-set instance as pickle file
  4 | Also outputs optimal decision costs for each test-set instance as pickle file
  5 | Takes multiple input arguments:
  6 |   (1) max_depth: training depth of tree, e.g. "5"
  7 |   (2) min_weights_per_node: min. number of (weighted) observations per leaf, e.g. "100"
  8 |   (3) algtype: set equal to "MSE" (CART) or "SPO" (SPOT greedy)
  9 |   (4) decision_problem_seed: seed controlling generation of constraints in article recommendation problem (-1 = no constraints)
 10 |   (5) train_size: number of random obs. to extract from the training data. 
 11 |   Only useful in limiting the size of the training data (-1 = use full training data)
 12 |   (6) quant_discret: continuous variable split points in the tree is chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 
 13 | Values of input arguments used in paper:
 14 |   (1) max_depth: considered depths of 2, 4, 6, 1000
 15 |   (2) min_weights_per_node: "10000"
 16 |   (3) algtype: "MSE" (CART) or "SPO" (SPOT greedy)
 17 |   (4) decision_problem_seed: ran 9 constraint instances corresponding to seeds of 10, 11, 12, ..., 18
 18 |   (5) train_size: -1
 19 |   (6) quant_discret: 0.05
 20 | '''
 21 | 
 22 | import time
 23 | 
 24 | import numpy as np
 25 | import pickle
 26 | from decision_problem_solver import*
 27 | from SPO_tree_greedy import SPOTree
 28 | import sys
 29 |     
 30 | ##############################################
 31 | #seed controlling random subset of training data used (if full training data is not being used)
 32 | select_train_seed = 0
 33 | ########################################
 34 | #training parameters
 35 | max_depth = int(sys.argv[1])#3
 36 | min_weights_per_node = int(sys.argv[2])#20
 37 | algtype = sys.argv[3] #either "MSE" or "SPO"
 38 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive
 39 | decision_problem_seed=int(sys.argv[4]) #if -1, use no constraints in decision problem
 40 | train_size=int(sys.argv[5]) #if you want to limit the size of the training data (-1 = no limit)
 41 | quant_discret=float(sys.argv[6]) 
 42 | ########################################
 43 | #output filename for alg
 44 | fname_out = algtype+"greedy_news_costs_depth"+str(max_depth)+"_minObs"+str(min_weights_per_node)+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl";
 45 | #output filename for opt costs
 46 | fname_out_opt = "Opt_news_costs"+"_probSeed"+str(decision_problem_seed)+".pkl";
 47 | #############################################################################
 48 | #############################################################################
 49 | #############################################################################
 50 | 
 51 | #generate decision problem
 52 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive
 53 | num_constr = 5 #number of Aw <= b constraints
 54 | num_dec = 6 #number of decisions
 55 | if decision_problem_seed == -1:
 56 |   #no budget constraint case
 57 |   A_constr = np.zeros((num_constr,num_dec))
 58 |   b_constr = np.ones(num_constr)
 59 |   l_constr = np.zeros(num_dec)
 60 |   u_constr = np.ones(num_dec)
 61 | else:
 62 |   np.random.seed(decision_problem_seed)
 63 |   A_constr = np.random.exponential(scale=1.0, size=(num_constr,num_dec))
 64 |   b_constr = np.ones(num_constr)
 65 |   l_constr = np.zeros(num_dec)
 66 |   u_constr = np.ones(num_dec)
 67 |     
 68 | ##############################################
 69 | 
 70 | thresh = "50"
 71 | valid_size = "50.0%"
 72 | 
 73 | train_x = np.load('filtered_train_userFeat_'+valid_size+'_'+thresh+'.npy')
 74 | valid_x = np.load('filtered_validation_userFeat_'+valid_size+'_'+thresh+'.npy')
 75 | test_x = np.load('filtered_test_userFeat_'+valid_size+'_'+thresh+'.npy')
 76 | 
 77 | #make negative to turn into minimization problem
 78 | train_cost = np.load('filtered_train_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
 79 | valid_cost = np.load('filtered_validation_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
 80 | test_cost = np.load('filtered_test_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
 81 | 
 82 | train_weights = np.load('filtered_train_usernumobserv_'+valid_size+'_'+thresh+'.npy')
 83 | valid_weights = np.load('filtered_validation_usernumobserv_'+valid_size+'_'+thresh+'.npy')
 84 | test_weights = np.load('filtered_test_usernumobserv_'+valid_size+'_'+thresh+'.npy')
 85 | 
 86 | ##############################################
 87 | #limit size of training data if specified
 88 | if train_size != -1 and train_size <= train_x.shape[0] and train_size <= valid_x.shape[0]:
 89 |   np.random.seed(select_train_seed)
 90 |   sel_inds = np.random.choice(range(train_x.shape[0]), size = train_size, replace=False)
 91 |   train_x = train_x[sel_inds]
 92 |   train_cost = train_cost[sel_inds]
 93 |   train_weights = train_weights[sel_inds]
 94 |   sel_inds = np.random.choice(range(valid_x.shape[0]), size = train_size, replace=False)
 95 |   valid_x = valid_x[sel_inds]
 96 |   valid_cost = valid_cost[sel_inds]
 97 |   valid_weights = valid_weights[sel_inds]
 98 | 
 99 | #############################################################################
100 | #############################################################################
101 | #############################################################################
102 | 
103 | start = time.time()
104 | 
105 | #FIT ALGORITHM
106 | #SPO_weight_param: number between 0 and 1:
107 | #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
108 | if algtype == "MSE":
109 |   SPO_weight_param=0.0
110 | elif algtype == "SPO":
111 |   SPO_weight_param=1.0      
112 | my_tree = SPOTree(max_depth = max_depth, min_weights_per_node = min_weights_per_node, quant_discret = quant_discret, debias_splits=False, SPO_weight_param=SPO_weight_param, SPO_full_error=True)
113 | my_tree.fit(train_x,train_cost,weights=train_weights,verbose=False,feats_continuous=True,
114 |             A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr); #verbose specifies whether fitting procedure should print progress
115 | 
116 | my_tree.traverse() #prints out the unpruned tree 
117 | 
118 | end = time.time()    
119 | print "Elapsed time: " + str(end-start)
120 | 
121 | ##################
122 | 
123 | start = time.time()
124 | 
125 | #PRUNE DECISION TREE USING VALIDATION SET
126 | my_tree.prune(valid_x, valid_cost, weights_val=valid_weights, verbose=True, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
127 | 
128 | end = time.time()    
129 | print "Elapsed time: " + str(end-start)
130 | 
131 | my_tree.traverse() #prints out the pruned tree
132 | 
133 | ##################
134 | 
135 | #FIND TEST SET ALGORITHM COST AND OPTIMAL COST
136 | opt_decision = find_opt_decision(test_cost, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['weights']
137 | pred_cost = my_tree.est_cost(test_x)
138 | cost_pred_error = [np.mean(abs(pred_cost[i] - test_cost[i])) for i in range(0,pred_cost.shape[0])]
139 | pred_decision = my_tree.est_decision(test_x)
140 | costs = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,pred_decision.shape[0])]
141 | optcosts = [np.sum(test_cost[i] * opt_decision[i,:]) for i in range(0,opt_decision.shape[0])]
142 | 
143 | print "Average test set cost (max is better): " + str(-1.0*np.mean(costs))
144 | print "Average test set weighted cost (max is better): " + str(-1.0*np.dot(costs,test_weights)/np.sum(test_weights))
145 | print "Average optimal test set cost (max is better): " + str(-1.0*np.mean(optcosts))
146 | print "Average optimal test set weighted cost (max is better): " + str(-1.0*np.dot(optcosts,test_weights)/np.sum(test_weights))
147 | 
148 | with open(fname_out, 'wb') as output:
149 |   pickle.dump(costs, output, pickle.HIGHEST_PROTOCOL)
150 | with open(fname_out_opt, 'wb') as output:
151 |   pickle.dump(optcosts, output, pickle.HIGHEST_PROTOCOL)


--------------------------------------------------------------------------------
/Applications/Yahoo News/SPOopt_news.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Runs SPOT (MILP) algorithm on Yahoo News dataset
  3 | Outputs decision costs for each test-set instance as pickle file
  4 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r.
  5 | Takes multiple input arguments:
  6 |   (1) H: training depth of tree, e.g. "5"
  7 |   (2) N_min: min. number of (weighted) observations per leaf, e.g. "100"
  8 |   (3) train_x_precision: contextual features x are rounded to train_x_precision before fitting MILP (e.g., 2 = two decimal places)
  9 |     higher values of train_x_precision will be more precise but take more computational time
 10 |   (4) reg_set_str: sequence of regularization parameters to try (tuned using cross validation), e.g. "0.001-0.01-0.1"
 11 |     if "None", fits MILP using no regularizaiton and then prunes using CART pruning procedure (with SPO loss as pruning metric)
 12 |   (5) solver_time_limit: MILP solver is terminated after solver_time_limit seconds, returning best-found solution
 13 |   (6) decision_problem_seed: seed controlling generation of constraints in article recommendation problem (-1 = no constraints)
 14 |   (7) train_size: number of random obs. to extract from the training data. 
 15 |     Only useful in limiting the size of the training data (-1 = use full training data)
 16 |   (8) quant_discret: Used to fit the SPOT tree used in warm-starting the MILP. 
 17 |     Continuous variable split points in the tree is chosen from quantiles of the variable corresponding to quant_discret,2*quant_discret,3*quant_discret, etc.. 
 18 | Values of input arguments used in paper:
 19 |   (1) H: considered depths of 2, 4, 6
 20 |   (2) N_min: "10000"
 21 |   (3) train_x_precision: 3
 22 |   (4) reg_set_str: "None"
 23 |   (5) solver_time_limit: 43200
 24 |   (6) decision_problem_seed: ran 9 constraint instances corresponding to seeds of 10, 11, 12, ..., 18
 25 |   (7) train_size: -1
 26 |   (8) quant_discret: 0.05
 27 | '''
 28 | 
 29 | import time
 30 | 
 31 | import numpy as np
 32 | import pickle
 33 | from spo_opt_tree_news import*
 34 | from SPO2CART import SPO2CART
 35 | from SPO_tree_greedy import SPOTree
 36 | from decision_problem_solver import*
 37 | 
 38 | import sys
 39 | ##############################################
 40 | #seed controlling random subset of training data used (if full training data is not being used)
 41 | select_train_seed = 0
 42 | ########################################
 43 | #training parameters
 44 | #optimal tree params
 45 | H = int(sys.argv[1])#2 #H = max tree depth
 46 | N_min = int(sys.argv[2])#4 #N_min = minimum number of observations per leaf node
 47 | #higher values of train_x_precision will be more precise but take more computational time
 48 | #values >= 8 might cause numerical errors
 49 | train_x_precision = int(sys.argv[3])#2
 50 | reg_set_str = sys.argv[4]#"None"
 51 | if reg_set_str == "None":
 52 |   reg_set = None
 53 | else:
 54 |   reg_set = [int(k) for k in reg_set_str.split('-')]#None
 55 | #reg_set = [0.001] #if reg_set = None, uses CART to prune tree
 56 | solver_time_limit = int(sys.argv[5])
 57 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive
 58 | decision_problem_seed=int(sys.argv[6]) #if -1, use no constraints in decision problem
 59 | train_size=int(sys.argv[7]) #if you want to limit the size of the training data (-1 = no limit)
 60 | quant_discret=float(sys.argv[8]) 
 61 | ########################################
 62 | #output filename for alg
 63 | fname_out = "SPOopt_news_costs_depth"+str(H)+"_minObs"+str(N_min)+"_prec"+str(train_x_precision)+"_regset"+reg_set_str+"_tLim"+str(solver_time_limit)+"_probSeed"+str(decision_problem_seed)+"_nTrain"+str(train_size)+"_qd"+str(quant_discret)+".pkl";
 64 | #############################################################################
 65 | #############################################################################
 66 | #############################################################################
 67 | 
 68 | #generate decision problem
 69 | #ineligible_seeds = [27,28,29,32,39] #considered range = 10-40 inclusive
 70 | num_constr = 5 #number of Aw <= b constraints
 71 | num_dec = 6 #number of decisions
 72 | if decision_problem_seed == -1:
 73 |   #no budget constraint case
 74 |   A_constr = np.zeros((num_constr,num_dec))
 75 |   b_constr = np.ones(num_constr)
 76 |   l_constr = np.zeros(num_dec)
 77 |   u_constr = np.ones(num_dec)
 78 | else:
 79 |   np.random.seed(decision_problem_seed)
 80 |   A_constr = np.random.exponential(scale=1.0, size=(num_constr,num_dec))
 81 |   b_constr = np.ones(num_constr)
 82 |   l_constr = np.zeros(num_dec)
 83 |   u_constr = np.ones(num_dec)
 84 |     
 85 | ##############################################
 86 | 
 87 | thresh = "50"
 88 | valid_size = "50.0%"
 89 | 
 90 | train_x = np.load('filtered_train_userFeat_'+valid_size+'_'+thresh+'.npy')
 91 | valid_x = np.load('filtered_validation_userFeat_'+valid_size+'_'+thresh+'.npy')
 92 | test_x = np.load('filtered_test_userFeat_'+valid_size+'_'+thresh+'.npy')
 93 | 
 94 | #make negative to turn into minimization problem
 95 | train_cost = np.load('filtered_train_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
 96 | valid_cost = np.load('filtered_validation_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
 97 | test_cost = np.load('filtered_test_clickprob_'+valid_size+'_'+thresh+'.npy')*-1.0
 98 | 
 99 | train_weights = np.load('filtered_train_usernumobserv_'+valid_size+'_'+thresh+'.npy')
100 | valid_weights = np.load('filtered_validation_usernumobserv_'+valid_size+'_'+thresh+'.npy')
101 | test_weights = np.load('filtered_test_usernumobserv_'+valid_size+'_'+thresh+'.npy')
102 | 
103 | ##############################################
104 | #limit size of training data if specified
105 | if train_size != -1 and train_size <= train_x.shape[0] and train_size <= valid_x.shape[0]:
106 |   np.random.seed(select_train_seed)
107 |   sel_inds = np.random.choice(range(train_x.shape[0]), size = train_size, replace=False)
108 |   train_x = train_x[sel_inds]
109 |   train_cost = train_cost[sel_inds]
110 |   train_weights = train_weights[sel_inds]
111 |   sel_inds = np.random.choice(range(valid_x.shape[0]), size = train_size, replace=False)
112 |   valid_x = valid_x[sel_inds]
113 |   valid_cost = valid_cost[sel_inds]
114 |   valid_weights = valid_weights[sel_inds]
115 | 
116 | #############################################################################
117 | #############################################################################
118 | #############################################################################
119 | 
120 | start = time.time()
121 | 
122 | #FIT SPO OPTIMAL TREE
123 | if reg_set is None:
124 |   
125 |   #FIT SPO GREEDY TREE AS INITIAL SOLUTION
126 |   def truncate_train_x(train_x, train_x_precision):
127 |     return(np.around(train_x, decimals=train_x_precision))
128 |   train_x_truncated = truncate_train_x(train_x, train_x_precision)
129 |   #SPO_weight_param: number between 0 and 1:
130 |   #Error metric: SPO_loss*SPO_weight_param + MSE_loss*(1-SPO_weight_param)
131 |   my_tree = SPOTree(max_depth = H, min_weights_per_node = N_min, quant_discret = quant_discret, debias_splits=False, SPO_weight_param=1.0, SPO_full_error=True)
132 |   my_tree.fit(train_x_truncated,train_cost,weights=train_weights,verbose=False,feats_continuous=True,
133 |               A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr);
134 | 
135 |   #PRUNE SPO GREEDY TREE USING TRAINING SET (TO GET RID OF REDUNDANT LEAVES)   
136 |   my_tree.prune(train_x_truncated, train_cost, weights_val=train_weights, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
137 |   my_tree.traverse()
138 |   spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated)
139 |   
140 |   #(OPTIONAL) PRUNE SPO GREEDY TREE USING VALIDATION SET
141 | #      my_tree.prune(valid_x, valid_cost, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
142 | #      spo_greedy_a, spo_greedy_b, spo_greedy_z = my_tree.get_tree_encoding(x_train=train_x_truncated)
143 | #      alpha = my_tree.get_pruning_alpha()
144 |   
145 |   #FIT SPO OPTIMAL TREE USING FOUND INITIAL SOLUTION
146 |   reg_param = 1e-5 #introduce very small amount of regularization to ensure leaves with zero predictive power are aggregated
147 |   spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d = spo_opt_tree(train_cost,train_x,train_x_precision,reg_param, N_min, H,
148 |                                                                                       weights=train_weights,
149 |                                                                                       a_start=spo_greedy_a, z_start=spo_greedy_z,
150 |                                                                                       Presolve=2, Seed=0, TimeLimit=solver_time_limit,
151 |                                                                                       returnAllOptvars=True,
152 |                                                                                       A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
153 |   
154 |   end = time.time()    
155 |   print "Elapsed time: " + str(end-start)
156 |   
157 |   #(IF NOT USING POSTPRUNING) FIND TEST SET COST    
158 | #      path = decision_path(test_x,spo_dt_a,spo_dt_b)
159 | #      costs_deg[deg][trial_num] = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
160 |   
161 |   #PRUNE MILP TREE USING CART PRUNING METHOD ON VALIDATION SET
162 |   spo2cart = SPO2CART(spo_dt_a, spo_dt_b)
163 |   spo2cart.fit(train_x,train_cost,train_x_precision,weights=train_weights,verbose=False,feats_continuous=True,
164 |                A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
165 |   spo2cart.traverse() #prints out the unpruned tree found by MILP
166 |   spo2cart.prune(valid_x, valid_cost, weights_val=valid_weights, verbose=False, one_SE_rule=False) #verbose specifies whether pruning procedure should print progress
167 |   spo2cart.traverse() #prints out the pruned tree 
168 |   #FIND TEST SET COST
169 |   pred_decision = spo2cart.est_decision(test_x)
170 |   costs = [np.sum(test_cost[i] * pred_decision[i]) for i in range(0,len(pred_decision))]
171 |   
172 | else:
173 |   spo_dt_a, spo_dt_b, spo_dt_w, _, best_alpha = spo_opt_tunealpha(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,train_x_precision,reg_set, N_min, H, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
174 |   print("Best Alpha: " + best_alpha)
175 |   
176 |   end = time.time()    
177 |   print "Elapsed time: " + str(end-start)
178 | 
179 |   #FIND TEST SET COST    
180 |   path = decision_path(test_x,spo_dt_a,spo_dt_b)
181 |   costs = apply_leaf_decision(test_cost,path,spo_dt_w, subtract_optimal=False, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
182 | 
183 | with open(fname_out, 'wb') as output:
184 |   pickle.dump(costs, output, pickle.HIGHEST_PROTOCOL)
185 | 
186 | print "Average test set cost (max is better): " + str(-1.0*np.mean(costs))
187 | print "Average test set weighted cost (max is better): " + str(-1.0*np.dot(costs,test_weights)/np.sum(test_weights))


--------------------------------------------------------------------------------
/Applications/Yahoo News/decision_problem_solver.py:
--------------------------------------------------------------------------------
 1 | '''
 2 | Generic file to set up the decision problem (i.e., optimization problem) under consideration
 3 | Must have functions: 
 4 |   get_num_decisions(): returns number of decision variables (i.e., dimension of cost vector)
 5 |   find_opt_decision(): returns for a matrix of cost vectors the corresponding optimal decisions for those cost vectors
 6 | 
 7 | This particular file sets up a news article recommendation decision problem
 8 | '''
 9 | 
10 | from gurobipy import *
11 | import numpy as np
12 | import sys
13 | 
14 | def get_num_decisions(A_constr=None, b_constr=None, l_constr=None, u_constr=None):
15 |     num_constr, num_dec = A_constr.shape
16 |     return num_dec
17 | 
18 | def find_opt_decision(p, A_constr=None, b_constr=None, l_constr=None, u_constr=None):
19 |     num_constr, num_dec = A_constr.shape
20 |     '''input matrix p, such that each row corresponds to an instance'''
21 |     weights = np.zeros(p.shape)
22 |     objective = np.zeros(p.shape[0])
23 | 
24 |     if (p.shape[1] != num_dec):
25 |         return 'Shape inconsistent, check input dimensions.'
26 | 
27 |     model = Model()
28 |     model.Params.outputflag = 0
29 |     w = model.addVars(num_dec, lb = l_constr, ub= u_constr)
30 |     #w = model.addVars(num_dec, lb =0)
31 |     model.addConstrs((quicksum(A_constr[i][j]*w[j] for j in range(num_dec)) <= b_constr[i] for i in range(num_constr)))
32 |     model.addConstr(quicksum(w[j] for j in range(num_dec)) == 1)
33 | 
34 |     for inst in range(p.shape[0]):
35 |         #model.setObjective(quicksum(p[inst,j]*w[j] for j in range(num_dec)), GRB.MAXIMIZE)
36 |         model.setObjective(quicksum(p[inst,j]*w[j] for j in range(num_dec)), GRB.MINIMIZE)
37 |         model.optimize()
38 |         if model.status == GRB.OPTIMAL:
39 |             weights[inst,:] = np.array([w[j].X for j in range(num_dec)])
40 |             objective[inst] = model.ObjVal
41 |         else:
42 |             print(inst, "Infeasible!")
43 |             sys.exit("Decision problem infeasible")
44 |     # print(model.status)
45 |     return {'weights': weights, 'objective':objective}
46 | 
47 | 
48 | '''Example input'''
49 | #np.random.seed(0)
50 | #gen_decision_problem()
51 | #inst_num = 10
52 | #p = np.random.rand(inst_num,num_dec)*-1.0
53 | #w = find_opt_decision(p)
54 | #w = w['weights']
55 | #print(w)
56 | #print(p)
57 | 


--------------------------------------------------------------------------------
/Applications/Yahoo News/spo_opt_tree_news.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | Code for fitting SPOT MILP on Yahoo News dataset.
  3 | Note that this code is specifically designed for the Yahoo News application and
  4 | will not work for other applications without modifying the constraints. 
  5 | Note on notation: the paper uses r to denote the binary variables which map training observations to leaves. This code uses z rather than r.
  6 | '''
  7 | 
  8 | 
  9 | import numpy as np
 10 | import pandas as pd
 11 | from gurobipy import*
 12 | import pickle
 13 | from sklearn.model_selection import KFold
 14 | from decision_problem_solver import*
 15 | 
 16 | # Helper functions for tree structure
 17 | def find_parent_index(t):
 18 |   return (t+1)//2 - 1
 19 | 
 20 | def find_ancestors(t):
 21 |     l= []
 22 |     r = []
 23 |     if t == 0:
 24 |         return
 25 |     else:
 26 |         while find_parent_index(t) !=0:
 27 |             parent = find_parent_index(t)
 28 |             if (t+1)% (1+parent) ==1:
 29 |                 r.append(parent)
 30 |             else:
 31 |                 l.append(parent)
 32 |             t = parent
 33 |         if t==2:
 34 |             r.append(0)
 35 |         else:
 36 |             l.append(0)
 37 |     return[l,r]
 38 | 
 39 | #truncate training set features to desired precision    
 40 | def truncate_train_x(train_x, train_x_precision):
 41 |   return(np.around(train_x, decimals=train_x_precision))
 42 |   
 43 | 
 44 | #trains an optimal tree model on train_cost, train_x, and reg parameter spo_opt_tree_reg (scalar)
 45 | #returns parameter encoding of the opimal tree (a,b,w). (a,b) encode splits, (w) encode leaf decisions
 46 | #optimal tree params:
 47 | #N_min = minimum number of observations per leaf node
 48 | #H = max tree depth
 49 | #def spo_opt_tree(train_cost, train_x,spo_opt_tree_reg):
 50 | def spo_opt_tree(train_cost, train_x, train_x_precision, spo_opt_tree_reg, N_min, H, 
 51 |                  weights=None, 
 52 |                  returnAllOptvars=False,
 53 |                  a_start=None, b_start=None, w_start=None, y_start=None, z_start=None, l_start=None, d_start=None, 
 54 |                  threads=None, MIPGap=None, MIPFocus=None, verbose=False, Seed=None, TimeLimit=None,
 55 |                  Presolve=None, ImproveStartTime=None, VarBranch=None, Cuts=None, 
 56 |                  tune=False, TuneCriterion=None, TuneJobs=None, TuneTimeLimit=None, TuneTrials=None, tune_foutpref=None,
 57 |                  A_constr=None, b_constr=None, l_constr=None, u_constr=None):
 58 |     
 59 |     ################################333
 60 |     #assert(spo_opt_tree_reg >= 1e-4)
 61 |     ################################333
 62 |     assert(A_constr is not None)
 63 |     assert(b_constr is not None)
 64 |     assert(l_constr is not None)
 65 |     assert(u_constr is not None)
 66 |     assert(weights is not None)
 67 |     
 68 |     #currently only lower bound = 0, upper bound = 1 constraints supported
 69 |     assert(len(np.unique(l_constr)) == 1 and np.unique(l_constr)[0] == 0)
 70 |     assert(len(np.unique(u_constr)) == 1 and np.unique(u_constr)[0] == 1)
 71 |     
 72 |     num_constr, D = A_constr.shape
 73 |     
 74 |     # We label all nodes of the tree by 0, 1, 2, ... 2**(H+1) - 1.
 75 |     T_B = 2**H - 1
 76 |     T_L = 2**H
 77 |   
 78 |     n_train, P = train_x.shape
 79 |     #truncate x features so eps (below) will not be too small
 80 |     train_x = truncate_train_x(train_x, train_x_precision)
 81 |     
 82 |     assert(np.all(train_x >= 0))
 83 |     assert(np.all(train_x <= 1))
 84 |     assert(np.all(train_cost <= 0)) #assert nonpositive costs
 85 |     assert(np.all(train_cost.shape[0] == train_x.shape[0]))
 86 |     # Instantiate optimization model
 87 |     # Compute average optimal cost across all training set observations
 88 |     # (Although irrelevant for the optimization problem, it helps in interpreting alpha)
 89 |     optimal_costs = np.zeros(train_x.shape[0])
 90 |     for i in range(train_x.shape[0]):
 91 |       optimal_costs[i] = find_opt_decision(train_cost[i,:].reshape(1,-1),A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'][0]
 92 |     if weights is not None:
 93 |       sum_optimal_cost = np.dot(optimal_costs, weights)
 94 |     else:
 95 |       sum_optimal_cost = sum(optimal_costs)
 96 |     
 97 |     # Compute big M constant
 98 |     negM = 0
 99 |     for i in range(train_x.shape[0]):
100 |       min_decision_cost = find_opt_decision(train_cost[i,:].reshape(1,-1),A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'][0]
101 |       if min_decision_cost <= negM:
102 |         negM = min_decision_cost
103 |     
104 |     #M = train_cost.max()*(dim-1)*2
105 |     spo = Model('spo_opt_tree')
106 |     if verbose == False:
107 |       spo.Params.OutputFlag = 0
108 |     #compute epsilon constants
109 |     #eps = np.float("inf")
110 |     #for j in range(train_x.shape[1]):
111 |       #ordered_feat = np.sort(train_x[:,j])
112 |       #diffs = ordered_feat[1:]-ordered_feat[:-1]
113 |       #nonzero_diffs = diffs[diffs > 0]
114 |       #if min(nonzero_diffs) <= eps:
115 |         #eps = min(nonzero_diffs)
116 |     #one_plus_eps = 1 + eps
117 |     
118 |     eps = np.array([np.float("inf")]*train_x.shape[1])
119 |     for j in range(train_x.shape[1]):
120 |       ordered_feat = np.sort(train_x[:,j])
121 |       diffs = ordered_feat[1:]-ordered_feat[:-1]
122 |       nonzero_diffs = diffs[diffs > 0]
123 |       eps[j] = min(nonzero_diffs)
124 |     one_plus_eps_max = 1 + max(eps) 
125 |     
126 |     #run params
127 |     if threads is not None:
128 |       spo.Params.Threads = threads
129 |     if MIPGap is not None:
130 |       spo.Params.MIPGap = MIPGap # default = 1e-4, try 1e-2
131 |     if MIPFocus is not None:
132 |       spo.Params.MIPFocus = MIPFocus
133 |     if Seed is not None:
134 |       spo.Params.Seed = Seed
135 |     if TimeLimit is not None:
136 |       spo.Params.TimeLimit = TimeLimit
137 |     if Presolve is not None:
138 |       spo.Params.Presolve = Presolve
139 |     if ImproveStartTime is not None:
140 |       spo.Params.ImproveStartTime = ImproveStartTime
141 |     if VarBranch is not None:
142 |       spo.Params.VarBranch = VarBranch
143 |     if Cuts is not None:
144 |       spo.Params.Cuts = Cuts
145 |     
146 |     #tune params
147 |     if tune == True and TuneCriterion is not None:
148 |       spo.Params.TuneCriterion = TuneCriterion
149 |     if tune == True and TuneJobs is not None:
150 |       spo.Params.TuneJobs = TuneJobs
151 |     if tune == True and TuneTimeLimit is not None:
152 |       spo.Params.TuneTimeLimit = TuneTimeLimit
153 |     if tune == True and TuneTrials is not None:
154 |       spo.Params.TuneTrials = TuneTrials
155 | 
156 |     # Add variables
157 |     y = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = -GRB.INFINITY, ub = 0, name = 'y')
158 |     z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), vtype=GRB.BINARY, name = 'z')
159 |     #z = spo.addVars(tuplelist([(i, t) for i in range(n_train) for t in range(T_L)]), lb = 0, ub = 1, name = 'z')
160 |     w = spo.addVars(tuplelist([(t, j) for t in range(T_L) for j in range(D)]), lb = 0, ub= 1,name = 'w')
161 |     l = spo.addVars(tuplelist([i for i in range(T_L)]), vtype=GRB.BINARY, name = 'l')
162 |     d = spo.addVars(tuplelist([i for i in range(T_B)]), vtype=GRB.BINARY, name = 'd')
163 |     a = spo.addVars(tuplelist([(j,t) for j in range(P) for t in range(T_B)]), vtype=GRB.BINARY, name = 'a')
164 |     #b = spo.addVars(tuplelist([i for i in range(T_B)]), lb = 0, name = 'b')
165 |     b = spo.addVars(tuplelist([i for i in range(T_B)]), lb = 0, ub = 1, name = 'b')
166 |     
167 |     if a_start is not None:
168 |       for i in range(P):
169 |         for j in range(T_B):
170 |           a[i,j].start = a_start[i,j]
171 |     
172 |     if b_start is not None:
173 |       for i in range(T_B):
174 |         b[i].start = b_start[i]
175 |         
176 |     if w_start is not None:
177 |       for i in range(T_L):
178 |         for j in range(D):
179 |           w[i,j].start = w_start[i,j]
180 |     
181 |     if y_start is not None:
182 |       for i in range(n_train):
183 |         for j in range(T_L):
184 |           y[i,j].start = y_start[i,j]
185 |     
186 |     if z_start is not None:
187 |       for i in range(n_train):
188 |         for j in range(T_L):
189 |           z[i,j].start = z_start[i,j]
190 |     
191 |     if l_start is not None:
192 |       for i in range(T_L):
193 |         l[i].start = l_start[i]
194 |     
195 |     if d_start is not None:
196 |       for i in range(T_B):
197 |         d[i].start = d_start[i]
198 |     
199 |     spo.update() #for initial values to be written immediately
200 |     
201 | #    if a_start is not None:
202 | #      for i in range(P):
203 | #        for j in range(T_B):
204 | #          print(a[i,j].start)
205 | #    
206 | #    if b_start is not None:
207 | #      for i in range(T_B):
208 | #        print(b[i].start)
209 | #        
210 | #    if w_start is not None:
211 | #      for i in range(T_L):
212 | #        for j in range(D):
213 | #          print(w[i,j].start)
214 | #    
215 | #    if y_start is not None:
216 | #      for i in range(n_train):
217 | #        for j in range(T_L):
218 | #          print(y[i,j].start)
219 | #    
220 | #    if z_start is not None:
221 | #      for i in range(n_train):
222 | #        for j in range(T_L):
223 | #          print(z[i,j].start)
224 | #    
225 | #    if l_start is not None:
226 | #      for i in range(T_L):
227 | #        print(l[i].start)
228 | #    
229 | #    if d_start is not None:
230 | #      for i in range(T_B):
231 | #        print(d[i].start)
232 |     
233 | 
234 |     # Add constraints
235 |     # Const
236 |     for i in range(n_train):
237 |         for t in range(T_L):
238 |             #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()])
239 |             expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)])
240 |             spo.addConstr(y[i,t] >= expr_constraint)
241 |             spo.addConstr(y[i,t] >= negM * z[i,t])
242 |             # spo.addConstr(y[i,t] >= quicksum(train_cost[i,j] * w[t,j] for key,j in Edge_dict.items())- M * (1 - z[i,t]))
243 | 
244 |     # # (genreral constraint for feasibility of nominal problem Aw <= B)
245 |     # for t in range(T_L):
246 |     #     for i in range(K):
247 |     #         spo.addConstr(quicksum(A[i,j] * w[t,j] for j in range(D)) <= B[i] )
248 | 
249 |     # Const
250 |     #constraint for feasibility of decision problem)
251 |     for t in range(T_L):
252 |       spo.addConstrs((quicksum(A_constr[i][j]*w[t,j] for j in range(D)) <= b_constr[i] for i in range(num_constr)))
253 |       spo.addConstr(quicksum(w[t,j] for j in range(D)) == 1)
254 | 
255 |     # Const
256 |     for i in range(n_train):
257 |         # spo.addConstr(quicksum(z[i,t] for t in range(T_L)) == 1)
258 |         spo.addConstr(LinExpr([(1,z[i,t]) for t in range(T_L)]) == 1)
259 | 
260 |     # Const
261 |     for i in range(n_train):
262 |         for t in range(T_L):
263 |             spo.addConstr(z[i,t] <= l[t])
264 | 
265 |     # Const
266 |     for t in range(T_L):
267 |         # spo.addConstr(quicksum(z[i,t] for i in range(n_train))>= N_min * l[t])
268 |         if weights is not None:
269 |           spo.addConstr(LinExpr([(weights[i],z[i,t]) for i in range(n_train)])>= N_min * l[t])
270 |         else:
271 |           spo.addConstr(LinExpr([(1,z[i,t]) for i in range(n_train)])>= N_min * l[t])
272 | 
273 |     # Const
274 |     for i in range(n_train):
275 |         for t in range(T_L):
276 |             left, right = find_ancestors(t + T_B)
277 |             for m in right:
278 |                 # spo.addConstr(quicksum(a[p,m]* train_x[i,p] for p in range(P)) >= b[m]- (1 - z[i,t] ))
279 |                 spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)])  >= b[m]- (1 - z[i,t] ))
280 | 
281 |     # Const
282 |             for m in left:
283 |                 # spo.addConstr(quicksum(a[p,m]* (x[i,p] + eps[p]) for p in range(P))<= b[m] + (1+eps_max)*(1-z[i,t] ))
284 |                 #spo.addConstr(quicksum(a[p,m]* train_x[i,p]  for p in range(P)) +0.0001<= b[m] + (1-z[i,t] ))
285 |                 #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) + eps  <= b[m] +  (1+eps)*(1 - z[i,t] ))
286 |                 spo.addConstr(LinExpr([(train_x[i,p]+ eps[p],a[p,m]) for p in range(P)]) <= b[m] +  one_plus_eps_max*(1 - z[i,t] ))
287 |                 #spo.addConstr(LinExpr([(train_x[i,p],a[p,m]) for p in range(P)]) <= b[m] +  1 - one_plus_eps*z[i,t])
288 | 
289 |     # Const
290 |     for t in range(T_B):
291 |         # spo.addConstr(quicksum(a[p,t] for p in range(P)) == d[t])
292 |         spo.addConstr(LinExpr([(1,a[p,t]) for p in range(P)]) == d[t])
293 | 
294 |     # Const
295 |     for t in range(T_B):
296 |         #spo.addConstr(b[t] <= d[t])
297 |         spo.addConstr(b[t] >= 1 - d[t])
298 | 
299 |     # Const
300 |     for t in range(1,T_B):
301 |         spo.addConstr(d[t] <= d[find_parent_index(t)])
302 |       
303 |     # Const (optional): ensures LP relaxation of problem has y's defined sensibly
304 |     for i in range(n_train):
305 |       spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i])
306 |         #for t in range(T_L):
307 |             #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for key,j in Edge_dict.items()])
308 |             #expr_constraint = LinExpr([ (train_cost[i,j], w[t,j]) for j in range(D)])
309 |             #spo.addConstr(expr_constraint >= optimal_costs[i])
310 |             #spo.addConstr(LinExpr([(1,y[i,t]) for t in range(T_L)]) >= optimal_costs[i])
311 | 
312 |     # Add objective
313 |     # spo.setObjective( quicksum(y[i,t] for i in range(n_train) for t in range(T_L))/n_train + spo_opt_tree_reg* quicksum(d[t] for t in range(T_B) ), GRB.MINIMIZE)
314 |     if weights is not None:
315 |       expr_objective = LinExpr([(weights[i], y[i,t]) for i in range(n_train) for t in range(T_L) ]) - sum_optimal_cost
316 |     else:
317 |       expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ]) - sum_optimal_cost
318 |     #expr_objective = LinExpr([(1, y[i,t]) for i in range(n_train) for t in range(T_L) ])
319 |     if spo_opt_tree_reg > 0:
320 |       if weights is not None:
321 |         sum_weights = sum(weights)
322 |         expr_objective.add(LinExpr([(1, d[t]) for t in range(T_B)])*spo_opt_tree_reg*sum_weights)
323 |       else:
324 |         expr_objective.add(LinExpr([(1, d[t]) for t in range(T_B)])*spo_opt_tree_reg*n_train)
325 |     spo.setObjective(expr_objective, GRB.MINIMIZE)
326 |     
327 | 
328 |     # Solve optimization
329 |     if tune == True:
330 |       spo.tune()
331 |       if tune_foutpref is None:
332 |         tune_foutpref='tune'
333 |       for i in range(spo.tuneResultCount):
334 |         spo.getTuneResult(i)
335 |         spo.write(tune_foutpref+str(i)+'.prm')
336 |     spo.optimize()
337 |     
338 |     #############################33333
339 | #    if spo.status == GRB.OPTIMAL:
340 | #      print("Objective Value:")
341 | #      print(spo.ObjVal)
342 | #      print("Reg term of objective:")
343 | #      if weights is not None:
344 | #        print(spo_opt_tree_reg*sum_weights)
345 | #      else:
346 | #        print(spo_opt_tree_reg*n_train)
347 | #    else:
348 | #      import sys
349 | #      print("Infeasible!")
350 | #      sys.exit("Decision problem infeasible")
351 |     #############################33333
352 |   
353 |     # Get values of objective and variables
354 |     # print('Obj=')
355 |     # print(spo.getObjective().getValue())
356 |     #
357 |     # z_ = np.zeros((n,T_L))
358 |     # z_res = spo.getAttr('X', z)
359 |     # for i,j in z_res:
360 |     #     z_[i,j] = z_res[i,j]
361 |     # print(z_)
362 |     spo_dt_a = np.zeros((P,T_B))
363 |     a_res = spo.getAttr('X', a)
364 |     for i in range(P):
365 |       for j in range(T_B):
366 |         spo_dt_a[i,j] = a_res[i,j]
367 |     #for i,j in a_res:
368 |         #spo_dt_a[i,j] = a_res[i,j]
369 | 
370 |     spo_dt_b = np.zeros(T_B)
371 |     b_res = spo.getAttr('X', b)
372 |     for i in range(T_B):
373 |       spo_dt_b[i] = b_res[i]
374 |     #for i in b_res:
375 |         #spo_dt_b[i] = b_res[i]
376 | 
377 |     spo_dt_w = np.zeros((T_L,D))
378 |     w_res = spo.getAttr('X', w)
379 |     for i in range(T_L):
380 |       for j in range(D):
381 |         spo_dt_w[i,j] = w_res[i,j]
382 |     #for i,j in w_res:
383 |         #spo_dt_w[i,j] = w_res[i,j]
384 |     
385 |     spo_dt_y = np.zeros((n_train,T_L))
386 |     y_res = spo.getAttr('X', y)
387 |     for i in range(n_train):
388 |       for j in range(T_L):
389 |         spo_dt_y[i,j] = y_res[i,j]
390 |     spo_dt_z = np.zeros((n_train,T_L))
391 |     z_res = spo.getAttr('X', z)
392 |     for i in range(n_train):
393 |       for j in range(T_L):
394 |         spo_dt_z[i,j] = z_res[i,j]
395 |     spo_dt_l = np.zeros(T_L)
396 |     l_res = spo.getAttr('X', l)
397 |     for i in range(T_L):
398 |       spo_dt_l[i] = l_res[i]
399 |     spo_dt_d = np.zeros(T_B)
400 |     d_res = spo.getAttr('X', d)
401 |     for i in range(T_B):
402 |       spo_dt_d[i] = d_res[i]
403 |     
404 |     if returnAllOptvars == False:
405 |       return spo_dt_a, spo_dt_b, spo_dt_w
406 |     else:
407 |       return spo_dt_a, spo_dt_b, spo_dt_w, spo_dt_y, spo_dt_z, spo_dt_l, spo_dt_d
408 | 
409 | # Given a tree defined by a and b for all interior nodes, find the path (including the leaf node in which it lies) of observations using its features
410 | def decision_path(x,a,b):
411 |     T_B = len(b)
412 |     if len(x.shape) == 1:
413 |         n = 1
414 |         P = x.size
415 |     else:
416 |         n, P = x.shape
417 |     res = []
418 |     for i in range(n):
419 |         node = 0
420 |         path = [0]
421 |         T_B = a.shape[1]
422 |         while node < T_B:
423 |             if np.dot(x[i,:], a[:,node]) < b[node]:
424 |                 node = (node+1)*2 - 1
425 |             else:
426 |                 node = (node+1)*2
427 |             path.append(node)
428 |         res.append(path)
429 |     return np.array(res)
430 | 
431 | 
432 | # Given the path of an observation (including the leaf node in which it lies), find the predicted total cost for that observation
433 | def apply_leaf_decision(c,path, w, subtract_optimal=False, A_constr=None, b_constr=None, l_constr=None, u_constr=None):
434 |     T_L, D = w.shape
435 |     n = c.shape[0]
436 |     paths = path[:, -1]
437 |     actual_cost = []
438 |     for i in range(n):
439 |         decision_node = paths[i] - T_L +1
440 |         cost_decision = np.dot(c[i,:], w[decision_node,:])
441 |         if subtract_optimal == True:
442 |           cost_optimal = find_opt_decision(c[i,:].reshape(1,-1), A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective'][0]
443 |           actual_cost.append(cost_decision-cost_optimal)
444 |         else:
445 |           actual_cost.append(cost_decision)          
446 |     return np.array(actual_cost)
447 | 
448 | def spo_opt_traintest(train_x,train_cost,train_weights,test_x,test_cost,test_weights,train_x_precision,spo_opt_tree_reg, N_min, H, A_constr=None, b_constr=None, l_constr=None, u_constr=None):
449 |     spo_dt_a,spo_dt_b, spo_dt_w = spo_opt_tree(train_cost,train_x,train_x_precision,spo_opt_tree_reg, N_min, H, 
450 |                                                weights=train_weights, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
451 |     path = decision_path(test_x,spo_dt_a,spo_dt_b)
452 |     costs = apply_leaf_decision(test_cost,path, spo_dt_w, subtract_optimal=True, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
453 |     return spo_dt_a,spo_dt_b, spo_dt_w, np.dot(costs,test_weights)/np.sum(test_weights)
454 | 
455 | def spo_opt_tunealpha(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,train_x_precision,reg_set, N_min, H, A_constr=None, b_constr=None, l_constr=None, u_constr=None):
456 |     best_err = np.float("inf")
457 |     for alpha in reg_set:
458 |       spo_dt_a,spo_dt_b, spo_dt_w, err = spo_opt_traintest(train_x,train_cost,train_weights,valid_x,valid_cost,valid_weights,train_x_precision,alpha, N_min, H, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)
459 |       if err <= best_err:
460 |         best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha = spo_dt_a,spo_dt_b, spo_dt_w, err, alpha
461 |     return best_spo_dt_a, best_spo_dt_b, best_spo_dt_w, best_err, best_alpha
462 | 
463 | #def cv_spo_opt_traintest(cost, X, train_x_precision,reg_set, splits = 4):
464 | #    dic = {reg:0 for reg in reg_set}
465 | #    n, n_edges = cost.shape
466 | #    K = X.shape[1]
467 | #    kf = KFold(n_splits = splits)
468 | #    for train, test in kf.split(X):
469 | #        X_train, X_test, cost_train, cost_test = X[train], X[test], cost[train], cost[test]
470 | #        opt_cost = find_opt_decision(cost_test, A_constr=A_constr, b_constr=b_constr, l_constr=l_constr, u_constr=u_constr)['objective']
471 | #        for spo_opt_tree_reg in reg_set:
472 | #            actual_cost = spo_opt_traintest(X_train, cost_train, X_test, cost_test,train_x_precision,spo_opt_tree_reg)
473 | #            dic[spo_opt_tree_reg] += sum(actual_cost - opt_cost)
474 | #    return smallest_dic_value(dic)
475 | #
476 | #def smallest_dic_value(dic):
477 | #    reverse = dict()
478 | #    for key in dic.keys():
479 | #        reverse[dic[key]] = key
480 | #    return reverse[min(reverse.keys())]
481 | 


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Adam N. Elmachtoub, Jason Cheuk Nam Liang, Ryan McNellis
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # SPO Tree
 2 | 
 3 | This is a Python code base for training "SPO Trees" (SPOTs) from the paper "Decision Trees for Decision-Making under the Predict-then-Optimize Framework".
 4 | 
 5 | The "Algorithms" folder contains implementations for training (greedy) SPO Trees (SPO_tree_greedy.py) and SPO Forests (SPOForest.py) for general predict-then-optimize problems.
 6 | 
 7 | The "Applications" folder contains all data + code for reproducing the three numerical experiments (applications) covered in the paper:
 8 | * Illustrative Example: A two-road shortest path decision problem.
 9 | * Shortest Path: A shortest path decision problem over a 4 x 4 grid network, where driver starts in
10 | southwest corner and tries to find shortest path to northeast corner.
11 | * Yahoo News: A news article recommendation decision problem constructed from the Yahoo! Front Page Today Module dataset.
12 | 
13 | Code currently only supports Python 2.7 (not Python 3).
14 | Package Dependencies: gurobipy (with valid Gurobi license), numpy, pandas, scipy, joblib
15 | * The Illustrative Example application also depends on matplotlib
16 | 


--------------------------------------------------------------------------------