├── README.md
├── example.py
├── hp_search.py
└── rocstar.py


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | 
  4 | # Roc-star : An objective function for ROC-AUC that actually works.
  5 | 
  6 | For binary classification. everybody loves the Area Under the Curve (AUC) metric, but nobody directly targets it in their loss function.    Instead folks use a proxy function like Binary Cross Entropy (BCE).
  7 | 
  8 | This works fairly well, most of the time.   But we're left with a nagging question : could we get a higher score with a loss function closer in nature to AUC?
  9 | 
 10 | It seems likely since BCE really bears very little relation to AUC.   There have been many attempts to find a loss function that more directly targets AUC.  (One common tactic is some form of rank-loss function such as Hinge Rank-Loss.)  In practice, however, no clear winner has ever emerged.   There's been no serious challenge to BCE.
 11 | 
 12 | There are also considerations beyond performance.  Since BCE is essentially different than AUC, BCE tends to misbehave in the final stretch of training where we are trying to steer it toward the highest AUC score.
 13 | 
 14 | A good deal of the AUC optimization actually ends up occurring in the tuning of hyper-parameters.   Early Stopping becomes an uncomfortable necessity as the model may diverge sharply at any time from its high score.
 15 | 
 16 | We'd like a loss function that gives us higher scores and less trouble.   
 17 | 
 18 | We present such a  function here.
 19 | 
 20 | 
 21 | ## The Problem : AUC is bumpy
 22 | 
 23 | My favorite working definition of AUC is this:   Let's call the binary class labels  "Black" (0) and "White" (1).   Pick one black element at random and let *x* be its predicted value.  Now pick a random white element with value *y*.  Then,
 24 | 
 25 | AUC = the probability that the elements are in the right order.  That is, *x*<*y* .
 26 | 
 27 | That's it.   For any given set of points like the Training Set, we can get this probability by doing a brute-force calculation. Scan the set of all possible black/white pairs , and count the portion that are right-ordered.
 28 | 
 29 | We can see that the AUC score is not differentiable (a smooth curve with respect to any single *x* or *y*.)   Take an element (any color) and move it slightly enough that it doesn't hit a neighboring element.   The AUC stays the same.   Once the point does cross a neighbor, we have a chance of flipping one of the x<y comparisons - which changes the AUC.   So the AUC makes no smooth transitions.
 30 | 
 31 | That's a problem for Neural Nets, where we need a differentiable loss function.   
 32 | 
 33 | ## The Search : Ancients and Artifacts.
 34 | 
 35 | So we set out to find a *differentiable* function which is close as possible to AUC.
 36 | 
 37 | I dug back through the existing literature and found nothing that worked in practice.  Finally I came across a curious piece of code that somebody had checked into the TFLearn codebase.
 38 | 
 39 | Without fanfare, it promised differentiable deliverance from BCE in the form of a new loss function.
 40 | 
 41 |  (***Don't try it, it blows up***.) :
 42 | [http://tflearn.org/objectives/#roc-auc-score](http://tflearn.org/objectives/#roc-auc-score)
 43 |         
 44 | 	def roc_auc_score(y_pred, y_true):
 45 |     """Bad code, do not use. 
 46 |     ROC AUC Score.
 47 |     Approximates the Area Under Curve score, using approximation based on
 48 |     the Wilcoxon-Mann-Whitney U statistic.
 49 |     Yan, L., Dodier, R., Mozer, M. C., & Wolniewicz, R. (2003).
 50 |     Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic.
 51 |     Measures overall performance for a full range of threshold levels.
 52 |     Arguments:
 53 |         y_pred: `Tensor`. Predicted values.
 54 |         y_true: `Tensor` . Targets (labels), a probability distribution.
 55 |     """
 56 |     with tf.name_scope("RocAucScore"):
 57 |         pos = tf.boolean_mask(y_pred, tf.cast(y_true, tf.bool))
 58 |         neg = tf.boolean_mask(y_pred, ~tf.cast(y_true, tf.bool))
 59 |      .
 60 |      .
 61 |      more bad code)
 62 | 
 63 | 
 64 | 
 65 | 
 66 | 
 67 | It doesn't work at all. (Blowing up is actually the least of its problems).   But it mentions the paper it was based on.
 68 | 
 69 | Even though the paper is *ancient*, dating back to 2003, I found that with a little work - some extension of the math and careful coding - it actually works.   It's uber-fast, with speed comparable to BCE (and just as vectorizable for a GPU/MPP)  *.  In my tests, it gives higher AUC scores than BCE, is less sensitive to Learning Rate (avoiding the need for a Scheduler in my tests), and eliminates entirely the need for Early Stopping.
 70 | 
 71 | OK, let's turn to the original paper : [**Optimizing Classifier Performance via an Approximation to the Wilcoxon**-**Mann**-**Whitney Statistic**.](https://www.aaai.org/Library/ICML/2003/icml03-110.php)
 72 | 
 73 | ## The paper
 74 | 
 75 | The authors, *Yan et. al* , motivate the discussion by writing the AUC score in a particular form.    Recall our example where we calculate AUC by doing a brute-force count over the set of possible black/white pairs to find the portion that are right-ordered.     Let **B** be the set of black values and **W** the set of white values.  All possible pairs are given by the Cartesian Product **B** X **W**.  To count the right-ordered pairs we write :
 76 | 
 77 | <img style="margin-left:50px; max-height:75px; max-width:150px "  src='https://raw.githubusercontent.com/iridiumblue/about/master/pretty.png'/>
 78 | 
 79 | 
 80 | 
 81 | This is really just straining mathematical notation to say 'count the right-ordered pairs.'   If we divide this sum by the total number of pairs ,  |**B**| * |**W**|, we get exactly the AUC metric.  (Historically, this is called the normalized Wilcoxon-Mann-Whitney (WMW) statistic.)
 82 | 
 83 | To make a loss function from this, we could just flip the x < y comparison to x > y  in order to penalize wrong-ordered pairs.  The problem, of course, is that discontinuous jump when *x* crosses y .   
 84 | 
 85 | *Yan et. al* surveys - and then rejects - past work-arounds using continuous approximations to the step (Heaviside) function, such as a Sigmoid curve.   Then they pull this out of a hat :
 86 | 
 87 | <img style="max-height:250px; max-width:800px; "  src='https://raw.githubusercontent.com/iridiumblue/about/master/xpr.png' />
 88 | 
 89 | Yann got this forumula by applying a series of changes to the WMW:
 90 | 
 91 |  1. x<y has been flipped to y<x, to make it a loss (higher is worse.)   So the loss is focussed on wrong-ordered pairs.
 92 |  2. Instead of treating all pairs equally, weight is given to the how far apart the pair is.
 93 |  3. That weight is raised to the power of *p*.
 94 |  4. We add a padding Γ to that distance.
 95 | 
 96 | We'll go through these in turn.  1 is clear enough. There's a little more to 2 than meets the eye.   It makes intuitive sense that wrong-ordered pairs with wide separation should be given a bigger loss.   But something interesting is also happening as that separation approaches 0.  The loss goes to zero linearly, rather than a step-function.  So we've gotten rid of the discontinuity.
 97 | 
 98 | In fact, if p were 1 and Γ were 0, the loss would simply be our old friend ReLu(x-y).   But in this case we notice a hiccup, which reveals the need for the exponent *p*.  ReLu is not differentiable at 0.   That's not much of a problem in ReLu's more accustomed role as an activation function, but for our purposes the singularity at 0 lands directly on the thing we are most interested most in : the points where white and black elements pass each other.
 99 | 
100 | Fortunately,  raising ReLu to a power fixes this.  ReLu^p with p>1 is differentiable everywhere.  OK, so p>1.
101 | 
102 | Now back to Γ :  Γ provides a 'padding' which is enforced between two points.   We penalize not only wrong-ordered pairs, but also right-ordered pairs which are *too close*.  If a right-ordered pair is too close, its elements are at risk of getting swapped in the future by the random jiggling of a stochastic neural net.  The idea is to keep them moving apart until they reach a comfortable distance.  
103 | 
104 | 
105 | And that's the basic idea as outlined in the paper.   We now ake some refinements regarding Γ and p.
106 | 
107 | 
108 | ## About that Γ and p
109 | 
110 | 
111 | Here we break a bit with the paper.    *Yan et. al* seem a little squeamish on the topic of choosing Γ and p, offering only that a *p* = 2 or *p* = 3 seems good and that Γ should be somewhere between 0.10 and 0.70.  Yan essentially wishes us luck with these parameters and bows out.  
112 | 
113 | First, we permanently fix *p* = 2, because any self-respecting loss function should be a sum-of-squares.    (One reason for this is that it ensures the loss function is not only differentiable, but also *convex*)
114 | 
115 | Second and more importantly, let's take a look at Γ.   The heuristic of 'somewhere from 0.10 to 0.70' looks strange on the face of it; even if the predictions were normalized to be 0<x<1, this guidance seems overbroad, indifferent to the underlying distributions, and just weird.
116 | 
117 | We're going to derive Γ from the training set.    
118 | 
119 | Consider the training set and its Black/White pairs, **B** X **W**.  There are |**B**||**W**| pairs in this set.   Of these,    |**B**| |**W**|  AUC are right-ordered.   So, the number of wrong-ordered pairs is (1-AUC) |**B**| |**W**|   
120 | 
121 | When Γ is zero, only these wrong-ordered pairs are in motion (have positive loss.)  A positive Γ would expand the set of moving pairs to include some pairs which are right-ordered, but too close.  Instead of worrying about Γ's numeric value, we'll specify just how many too-close pairs we want to set in motion:
122 | 
123 | We define a constant δ which fixes the proportion of too-close pairs to wrong-ordered pairs.   
124 | 
125 | *  |**too_close_pairs**| = δ |**wrong_ordered_pairs**|
126 | 
127 | We fix this δ throughout training and update Γ to conform to it.  For given δ, we find Γ such that
128 | 
129 | *  |**pairs** where y+Γ>x| = δ |**pairs** where y>x|
130 | 
131 | In our experiments we found that δ can range from 0.5 to 2.0, and 1.0 is a good default choice.  
132 | 
133 | So we set δ to 1, p to 2, and forget about Γ altogether,
134 | 
135 | ## Let's make code
136 | 
137 | Our loss function (1) looks dismally expensive to compute.   It requires that we scan the entire training set for each individual prediction.   
138 | 
139 | We bypass this problem with a performance tweak :
140 | 
141 |  Suppose we are calculating the loss function for a given white data point, *x*.  To calculate (3), we need to compare *x* against the entire training set of black predictions, *y*.
142 | We take a short-cut and use a random sub-sample of the black data points.    If we set the size of the sub-sample to be, say, 1000 - we get a very (very) close approximation to the true loss function.     [1]
143 | 
144 | Similar reasoning applies to the loss function of a black data-point; we use a random sub-sample of all white training elements.
145 | 
146 | In this way, white and black subsamples fit easily into GPU memory.   By reusing the same sub-sample throughout a given batch, we can parallelize the operation in batches.   We end up with a loss function that's about as fast at BCE.
147 | 
148 | Here's the batch-loss function in PyTorch:
149 |   
150 |     
151 |     def roc_star_loss( _y_true, y_pred, gamma, _epoch_true, epoch_pred):
152 |         """
153 |         Nearly direct loss function for AUC.
154 |         See article,
155 |         C. Reiss, "Roc-star : An objective function for ROC-AUC that actually works."
156 |         https://github.com/iridiumblue/articles/blob/master/roc_star.md
157 |             _y_true: `Tensor`. Targets (labels).  Float either 0.0 or 1.0 .
158 |             y_pred: `Tensor` . Predictions.
159 |             gamma  : `Float` Gamma, as derived from last epoch.
160 |             _epoch_true: `Tensor`.  Targets (labels) from last epoch.
161 |             epoch_pred : `Tensor`.  Predicions from last epoch.
162 |         """
163 |         #convert labels to boolean
164 |         y_true = (_y_true>=0.50)
165 |         epoch_true = (_epoch_true>=0.50)
166 | 
167 |         # if batch is either all true or false return small random stub value.
168 |         if torch.sum(y_true)==0 or torch.sum(y_true) == y_true.shape[0]: return torch.sum(y_pred)*1e-8
169 | 
170 |         pos = y_pred[y_true]
171 |         neg = y_pred[~y_true]
172 | 
173 |         epoch_pos = epoch_pred[epoch_true]
174 |         epoch_neg = epoch_pred[~epoch_true]
175 | 
176 |         # Take random subsamples of the training set, both positive and negative.
177 |         max_pos = 1000 # Max number of positive training samples
178 |         max_neg = 1000 # Max number of positive training samples
179 |         cap_pos = epoch_pos.shape[0]
180 |         cap_neg = epoch_neg.shape[0]
181 |         epoch_pos = epoch_pos[torch.rand_like(epoch_pos) < max_pos/cap_pos]
182 |         epoch_neg = epoch_neg[torch.rand_like(epoch_neg) < max_neg/cap_pos]
183 | 
184 |         ln_pos = pos.shape[0]
185 |         ln_neg = neg.shape[0]
186 | 
187 |         # sum positive batch elements agaionst (subsampled) negative elements
188 |         if ln_pos>0 :
189 |             pos_expand = pos.view(-1,1).expand(-1,epoch_neg.shape[0]).reshape(-1)
190 |             neg_expand = epoch_neg.repeat(ln_pos)
191 | 
192 |             diff2 = neg_expand - pos_expand + gamma
193 |             l2 = diff2[diff2>0]
194 |             m2 = l2 * l2
195 |             len2 = l2.shape[0]
196 |         else:
197 |             m2 = torch.tensor([0], dtype=torch.float).cuda()
198 |             len2 = 0
199 | 
200 |         # Similarly, compare negative batch elements against (subsampled) positive elements
201 |         if ln_neg>0 :
202 |             pos_expand = epoch_pos.view(-1,1).expand(-1, ln_neg).reshape(-1)
203 |             neg_expand = neg.repeat(epoch_pos.shape[0])
204 | 
205 |             diff3 = neg_expand - pos_expand + gamma
206 |             l3 = diff3[diff3>0]
207 |             m3 = l3*l3
208 |             len3 = l3.shape[0]
209 |         else:
210 |             m3 = torch.tensor([0], dtype=torch.float).cuda()
211 |             len3=0
212 | 
213 |         if (torch.sum(m2)+torch.sum(m3))!=0 :
214 |            res2 = torch.sum(m2)/max_pos+torch.sum(m3)/max_neg
215 |            #code.interact(local=dict(globals(), **locals()))
216 |         else:
217 |            res2 = torch.sum(m2)+torch.sum(m3)
218 | 
219 |         res2 = torch.where(torch.isnan(res2), torch.zeros_like(res2), res2)
220 | 
221 |         return res2
222 | 
223 | 
224 | Note that there are some extra parameters.   We are passing in the training set from the *last epoch*.    Since the entire training set doesn't change much from one epoch to the next, the loss function can compare each prediction again a slightly out-of-date training set.  This simplifies debugging, and appears to benefit performance as the 'background' epoch isn't changing from one batch to the next.    
225 | 
226 | Similarly, Γ is an expensive calculation.    We again use the sub-sampling trick, but increase the size of the sub-samples to ~10,000 to ensure an accurate estimate.   To keep performance clipping along, we recompute this value only once per epoch.  Here's the function to do that :
227 |  
228 |     def epoch_update_gamma(y_true,y_pred, epoch=-1,delta=2):
229 |         """
230 |         Calculate gamma from last epoch's targets and predictions.
231 |         Gamma is updated at the end of each epoch.
232 |         y_true: `Tensor`. Targets (labels).  Float either 0.0 or 1.0 .
233 |         y_pred: `Tensor` . Predictions.
234 |         """
235 |         DELTA = delta
236 |         SUB_SAMPLE_SIZE = 2000.0
237 |         pos = y_pred[y_true==1]
238 |         neg = y_pred[y_true==0] # yo pytorch, no boolean tensors or operators?  Wassap?
239 |         # subsample the training set for performance
240 |         cap_pos = pos.shape[0]
241 |         cap_neg = neg.shape[0]
242 |         pos = pos[torch.rand_like(pos) < SUB_SAMPLE_SIZE/cap_pos]
243 |         neg = neg[torch.rand_like(neg) < SUB_SAMPLE_SIZE/cap_neg]
244 |         ln_pos = pos.shape[0]
245 |         ln_neg = neg.shape[0]
246 |         pos_expand = pos.view(-1,1).expand(-1,ln_neg).reshape(-1)
247 |         neg_expand = neg.repeat(ln_pos)
248 |         diff = neg_expand - pos_expand
249 |         ln_All = diff.shape[0]
250 |         Lp = diff[diff>0] # because we're taking positive diffs, we got pos and neg flipped.
251 |         ln_Lp = Lp.shape[0]-1
252 |         diff_neg = -1.0 * diff[diff<0]
253 |         diff_neg = diff_neg.sort()[0]
254 |         ln_neg = diff_neg.shape[0]-1
255 |         ln_neg = max([ln_neg, 0])
256 |         left_wing = int(ln_Lp*DELTA)
257 |         left_wing = max([0,left_wing])
258 |         left_wing = min([ln_neg,left_wing])
259 |         default_gamma=torch.tensor(0.2, dtype=torch.float).cuda()
260 |         if diff_neg.shape[0] > 0 :
261 |            gamma = diff_neg[left_wing]
262 |         else:
263 |            gamma = default_gamma # default=torch.tensor(0.2, dtype=torch.float).cuda() #zoink
264 |         L1 = diff[diff>-1.0*gamma]
265 |         ln_L1 = L1.shape[0]
266 |         if epoch > -1 :
267 |             return gamma
268 |         else :
269 |             return default_gamma
270 | 
271 | 
272 | Here's the helicopter view showing how to use the two functions as we loop on epochs, then on batches :
273 | 
274 | 
275 |  
276 |     train_ds = CatDogDataset(train_files, transform)
277 |     train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE)
278 | 
279 |     #initialize last epoch with random values
280 |     last_epoch_y_pred = torch.tensor( 1.0-numpy.random.rand(len(train_ds))/2.0, dtype=torch.float).cuda()
281 |     last_epoch_y_t    = torch.tensor([o for o in train_tt],dtype=torch.float).cuda()
282 |     epoch_gamma = 0.20
283 |     for epoch in range(epoches):
284 |         epoch_y_pred=[]
285 |         epoch_y_t=[]
286 |         for X, y in train_dl:
287 |             preds = model(X)
288 |             # .
289 |             # .
290 |             loss = roc_star_loss(y,preds,epoch_gamma, last_epoch_y_t, last_epoch_y_pred)
291 |             # .
292 |             # .
293 |             epoch_y_pred.extend(preds)
294 |             epoch_y_t.extend(y)
295 |         last_epoch_y_pred = torch.tensor(epoch_y_pred).cuda()
296 |         last_epoch_y_t = torch.tensor(epoch_y_t).cuda()
297 |         epoch_gamma = epoch_update_gamma(last_epoch_y_t, last_epoch_y_pred, epoch)
298 |         #...
299 | 
300 | A complete working example can be found here, [example.py](https://github.com/iridiumblue/roc-star/blob/master/example.py) 
301 | For a faster jump-star you can fork this kernel on Kaggle : [kernel](https://www.kaggle.com/iridiumblue/roc-star-an-auc-loss-function-to-challenge-bxe)
302 | 
303 | Below we chart the performance of roc-star against the same model using BCE.    Experience shows that roc-star can often simply be swapped into any model using BCE with a good chance at a performance increase.
304 | 
305 | <img style="max-height:450px; max-width:750px; "  src='https://raw.githubusercontent.com/iridiumblue/about/master/newplot.png' />
306 | 
307 | 
308 | 


--------------------------------------------------------------------------------
/example.py:
--------------------------------------------------------------------------------
  1 | '''
  2 | A demonstration of a new and experimental loss function which directly targets AUC/ROC,
  3 | and is seen to outperform BxE in early testing. See paper up here -
  4 | https://github.com/iridiumblue/roc-star.
  5 | 
  6 | The test is a simple sentiment analysis binary classifier turned loose on tweets from Twitter.
  7 | Text embeddings have been precomputed and pickled for speed.
  8 | TRUNC truncates the training set, set it to -1 for all 1.6 M sample tweets.
  9 | 
 10 | Note that for the first epoch (only), the loss function is BxE. That is just to kickstart
 11 | the new loss function roc_star_loss. That's a good practice to stick to if you want
 12 | to give this a try for your model. Note also that roc_star_loss requires a call to
 13 | epoch_update_gamma at the end of each epoch.
 14 | '''
 15 | 
 16 | 
 17 | from warnings import simplefilter
 18 | import time
 19 | from copy import copy
 20 | import torch
 21 | import torch.nn.functional as F
 22 | import torch.nn as nn
 23 | from torch.nn.utils.rnn import PackedSequence
 24 | import numpy as np
 25 | import _pickle
 26 | import gc
 27 | from pathlib2 import Path
 28 | from sklearn.metrics import roc_auc_score
 29 | from trains import Task
 30 | from trains import StorageManager
 31 | from pkbar import Kbar as Progbar
 32 | import argparse
 33 | import os
 34 | from tempfile import gettempdir
 35 | 
 36 | 
 37 | # ignore all future warnings
 38 | simplefilter(action='ignore', category=FutureWarning)
 39 | simplefilter(action='ignore', category=UserWarning)
 40 | 
 41 | x_train_torch,x_valid_torch,y_train_torch,y_valid_torch = None,None,None,None
 42 | embedding_matrix = None
 43 | task=None
 44 | logger=None
 45 | best_result={}
 46 | max_features = 200000
 47 | embed_size = 300
 48 | 
 49 | 
 50 | def init(h_params):
 51 |     global x_train_torch, x_valid_torch, y_train_torch, y_valid_torch
 52 |     global embedding_matrix
 53 |     global task, logger
 54 | 
 55 |     print("Recovering tokenized text from pickle ...")
 56 |     tokenized = StorageManager.get_local_copy(
 57 |         remote_url="https://allegro-datasets.s3.amazonaws.com/roc_star_data"
 58 |                    "/tokenized.pkl.zip",
 59 |         name="tokenized",
 60 |     )
 61 |     embedding = StorageManager.get_local_copy(
 62 |         remote_url="https://allegro-datasets.s3.amazonaws.com/roc_star_data"
 63 |                    "/embedding.pkl.zip",
 64 |         name="embedding",
 65 |     )
 66 | 
 67 |     x_train, x_valid, y_train, y_valid = _pickle.load(open(Path(tokenized, "tokenized.pkl"), "rb"))
 68 |     print("Reusing pickled embedding ...")
 69 |     embedding_matrix = _pickle.load(open(Path(embedding, "embedding.pkl"), "rb"))
 70 | 
 71 |     print("Moving data to GPU ...")
 72 |     if h_params.trunc>-1 :
 73 |         print(f"\r\r * * WARNING training set truncated to first {h_params.trunc} items.\r\r")
 74 | 
 75 |     x_train_torch = torch.tensor(x_train[:h_params.trunc], dtype=torch.long).cuda()
 76 |     x_valid_torch = torch.tensor(x_valid, dtype=torch.long).cuda()
 77 |     y_train_torch = torch.tensor(y_train[:h_params.trunc], dtype=torch.float32).cuda()
 78 |     y_valid_torch = torch.tensor(y_valid, dtype=torch.float32).cuda()
 79 |     del x_train, y_train, x_valid, y_valid; gc.collect(2)
 80 | 
 81 |     task_name = "ROC" if h_params.use_roc_star else "BxE"
 82 |     task = Task.init(project_name='Roc-star Loss', task_name='Roc star')
 83 |     logger = task.get_logger()
 84 | 
 85 | 
 86 | def epoch_update_gamma(y_true, y_pred, epoch=-1, delta=2):
 87 |         """
 88 |         Calculate gamma from last epoch's targets and predictions.
 89 |         Gamma is updated at the end of each epoch.
 90 |         y_true: `Tensor`. Targets (labels).  Float either 0.0 or 1.0 .
 91 |         y_pred: `Tensor` . Predictions.
 92 |         """
 93 |         sub_sample_size = 2000.0
 94 |         pos = y_pred[y_true==1]
 95 |         neg = y_pred[y_true==0] # yo pytorch, no boolean tensors or operators?  Wassap?
 96 |         # subsample the training set for performance
 97 |         cap_pos = pos.shape[0]
 98 |         cap_neg = neg.shape[0]
 99 |         pos = pos[torch.rand_like(pos) < sub_sample_size/cap_pos]
100 |         neg = neg[torch.rand_like(neg) < sub_sample_size/cap_neg]
101 |         ln_pos = pos.shape[0]
102 |         ln_neg = neg.shape[0]
103 |         pos_expand = pos.view(-1,1).expand(-1,ln_neg).reshape(-1)
104 |         neg_expand = neg.repeat(ln_pos)
105 |         diff = neg_expand - pos_expand
106 |         Lp = diff[diff>0] # because we're taking positive diffs, we got pos and neg flipped.
107 |         ln_Lp = Lp.shape[0]-1
108 |         diff_neg = -1.0 * diff[diff<0]
109 |         diff_neg = diff_neg.sort()[0]
110 |         ln_neg = diff_neg.shape[0]-1
111 |         ln_neg = max([ln_neg, 0])
112 |         left_wing = int(ln_Lp*delta)
113 |         left_wing = max([0,left_wing])
114 |         left_wing = min([ln_neg,left_wing])
115 |         default_gamma = torch.tensor(0.2, dtype=torch.float).cuda()
116 |         if diff_neg.shape[0] > 0 :
117 |             gamma = diff_neg[left_wing]
118 |         else:
119 |             gamma = default_gamma # default=torch.tensor(0.2, dtype=torch.float).cuda() #zoink
120 |         L1 = diff[diff>-1.0*gamma]
121 |         if epoch > -1 :
122 |             return gamma
123 |         else :
124 |             return default_gamma
125 | 
126 | 
127 | def roc_star_loss(_y_true, y_pred, gamma, _epoch_true, epoch_pred):
128 |         """
129 |         Nearly direct loss function for AUC.
130 |         See article,
131 |         C. Reiss, "Roc-star : An objective function for ROC-AUC that actually works."
132 |         https://github.com/iridiumblue/articles/blob/master/roc_star.md
133 |             _y_true: `Tensor`. Targets (labels).  Float either 0.0 or 1.0 .
134 |             y_pred: `Tensor` . Predictions.
135 |             gamma  : `Float` Gamma, as derived from last epoch.
136 |             _epoch_true: `Tensor`.  Targets (labels) from last epoch.
137 |             epoch_pred : `Tensor`.  Predicions from last epoch.
138 |         """
139 |         #convert labels to boolean
140 |         y_true = (_y_true>=0.50)
141 |         epoch_true = (_epoch_true>=0.50)
142 | 
143 |         # if batch is either all true or false return small random stub value.
144 |         if torch.sum(y_true)==0 or torch.sum(y_true) == y_true.shape[0]: return torch.sum(y_pred)*1e-8
145 | 
146 |         pos = y_pred[y_true]
147 |         neg = y_pred[~y_true]
148 | 
149 |         epoch_pos = epoch_pred[epoch_true]
150 |         epoch_neg = epoch_pred[~epoch_true]
151 | 
152 |         # Take random subsamples of the training set, both positive and negative.
153 |         max_pos = 1000 # Max number of positive training samples
154 |         max_neg = 1000 # Max number of positive training samples
155 |         cap_pos = epoch_pos.shape[0]
156 |         epoch_pos = epoch_pos[torch.rand_like(epoch_pos) < max_pos/cap_pos]
157 |         epoch_neg = epoch_neg[torch.rand_like(epoch_neg) < max_neg/cap_pos]
158 | 
159 |         ln_pos = pos.shape[0]
160 |         ln_neg = neg.shape[0]
161 | 
162 |         # sum positive batch elements agaionst (subsampled) negative elements
163 |         if ln_pos>0 :
164 |             pos_expand = pos.view(-1,1).expand(-1,epoch_neg.shape[0]).reshape(-1)
165 |             neg_expand = epoch_neg.repeat(ln_pos)
166 | 
167 |             diff2 = neg_expand - pos_expand + gamma
168 |             l2 = diff2[diff2>0]
169 |             m2 = l2 * l2
170 |         else:
171 |             m2 = torch.tensor([0], dtype=torch.float).cuda()
172 | 
173 |         # Similarly, compare negative batch elements against (subsampled) positive elements
174 |         if ln_neg>0 :
175 |             pos_expand = epoch_pos.view(-1,1).expand(-1, ln_neg).reshape(-1)
176 |             neg_expand = neg.repeat(epoch_pos.shape[0])
177 | 
178 |             diff3 = neg_expand - pos_expand + gamma
179 |             l3 = diff3[diff3>0]
180 |             m3 = l3*l3
181 |         else:
182 |             m3 = torch.tensor([0], dtype=torch.float).cuda()
183 | 
184 |         if (torch.sum(m2)+torch.sum(m3))!=0 :
185 |             res2 = torch.sum(m2)/max_pos+torch.sum(m3)/max_neg
186 |         else:
187 |             res2 = torch.sum(m2)+torch.sum(m3)
188 | 
189 |         res2 = torch.where(torch.isnan(res2), torch.zeros_like(res2), res2)
190 | 
191 |         return res2
192 | 
193 | 
194 | #https://github.com/keitakurita/Better_LSTM_PyTorch/blob/master/better_lstm/model.py
195 | class VariationalDropout(nn.Module):
196 |     """
197 |     Applies the same dropout mask across the temporal dimension
198 |     See https://arxiv.org/abs/1512.05287 for more details.
199 |     Note that this is not applied to the recurrent activations in the LSTM like the above paper.
200 |     Instead, it is applied to the inputs and outputs of the recurrent layer.
201 |     """
202 |     def __init__(self, dropout, batch_first=False):
203 |         super().__init__()
204 |         self.dropout = dropout
205 |         self.batch_first = batch_first
206 | 
207 |     def forward(self, x: torch.Tensor) -> torch.Tensor:
208 |         if not self.training or self.dropout <= 0.:
209 |             return x
210 | 
211 |         is_packed = isinstance(x, PackedSequence)
212 |         if is_packed:
213 |             x, batch_sizes = x
214 |             max_batch_size = int(batch_sizes[0])
215 |         else:
216 |             batch_sizes = None
217 |             max_batch_size = x.size(0)
218 | 
219 |         # Drop same mask across entire sequence
220 |         if self.batch_first:
221 |             m = x.new_empty(max_batch_size, 1, x.size(2), requires_grad=False).bernoulli_(1 - self.dropout)
222 |         else:
223 |             m = x.new_empty(1, max_batch_size, x.size(2), requires_grad=False).bernoulli_(1 - self.dropout)
224 |         x = x.masked_fill(m == 0, 0) / (1 - self.dropout)
225 | 
226 |         if is_packed:
227 |             return PackedSequence(x, batch_sizes)
228 |         else:
229 |             return x
230 | 
231 | 
232 | #https://github.com/keitakurita/Better_LSTM_PyTorch/blob/master/better_lstm/model.py
233 | class LSTM(nn.LSTM):
234 |     def __init__(self, *args, dropouti: float=0.,
235 |                  dropoutw: float=0., dropouto: float=0.,
236 |                  batch_first=True, unit_forget_bias=True, **kwargs):
237 |         super().__init__(*args, **kwargs, batch_first=batch_first)
238 |         self.unit_forget_bias = unit_forget_bias
239 |         self.dropoutw = dropoutw
240 |         self.input_drop = VariationalDropout(dropouti,
241 |                                              batch_first=batch_first)
242 |         self.output_drop = VariationalDropout(dropouto,
243 |                                               batch_first=batch_first)
244 |         self._init_weights()
245 | 
246 |     def _init_weights(self):
247 |         """
248 |         Use orthogonal init for recurrent layers, xavier uniform for input layers
249 |         Bias is 0 except for forget gate
250 |         """
251 |         for name, param in self.named_parameters():
252 |             if "weight_hh" in name:
253 |                 nn.init.orthogonal_(param.data)
254 |             elif "weight_ih" in name:
255 |                 nn.init.xavier_uniform_(param.data)
256 |             elif "bias" in name and self.unit_forget_bias:
257 |                 nn.init.zeros_(param.data)
258 |                 param.data[self.hidden_size:2 * self.hidden_size] = 1
259 | 
260 |     def _drop_weights(self):
261 |         for name, param in self.named_parameters():
262 |             if "weight_hh" in name:
263 |                 getattr(self, name).data = \
264 |                     torch.nn.functional.dropout(param.data, p=self.dropoutw,
265 |                                                 training=self.training).contiguous()
266 | 
267 |     def forward(self, input, hx=None):
268 |         self._drop_weights()
269 |         input = self.input_drop(input)
270 |         seq, state = super().forward(input, hx=hx)
271 |         return self.output_drop(seq), state
272 | 
273 | 
274 | 
275 | class NeuralNet(nn.Module):
276 |     def __init__(self, embedding_matrix,h_params):
277 |         super(NeuralNet, self).__init__()
278 |         embed_size = embedding_matrix.shape[1]
279 | 
280 |         self.embedding = nn.Embedding(max_features, embed_size)
281 |         self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
282 |         self.embedding.weight.requires_grad = False
283 |         self.h_params = copy(h_params)
284 | 
285 |         self.c1 = nn.Conv1d(300,kernel_size=2,out_channels=300,padding=1)
286 |         LSTM_UNITS=h_params.lstm_units
287 |         BIDIR = h_params.bidirectional
288 | 
289 |         LSTM_OUT = 2* LSTM_UNITS  if BIDIR else LSTM_UNITS
290 | 
291 |         self.lstm1 = LSTM(embed_size, LSTM_UNITS, dropouti=h_params.dropout_i,dropoutw=h_params.dropout_w, dropouto=h_params.dropout_o,bidirectional=BIDIR, batch_first=True)
292 |         self.lstm2 = LSTM(LSTM_OUT, LSTM_UNITS,   dropouti=h_params.dropout_i,dropoutw=h_params.dropout_w, dropouto=h_params.dropout_o,bidirectional=BIDIR, batch_first=True)
293 | 
294 | 
295 |         self.linear1 = nn.Linear(2*LSTM_OUT, 2*LSTM_OUT)
296 |         self.linear2 = nn.Linear(2*LSTM_OUT, 2*LSTM_OUT)
297 | 
298 |         self.hey_norm = nn.LayerNorm(2*LSTM_OUT)
299 | 
300 |         self.linear_out = nn.Linear(2*LSTM_OUT, h_params.dense_hidden_units)
301 |         self.linear_xtra = nn.Linear(h_params.dense_hidden_units,int(h_params.dense_hidden_units/2))
302 |         self.linear_xtra2 = nn.Linear(int(h_params.dense_hidden_units/2),int(h_params.dense_hidden_units/4))
303 |         self.linear_out2= nn.Linear(int(h_params.dense_hidden_units/4), 1)
304 | 
305 | 
306 | 
307 |     def forward(self, x):
308 |         h_embedding = self.embedding(x)
309 |         h1 = h_embedding.permute(0, 2, 1)
310 | 
311 |         q1 =  self.c1(h1)
312 |         f1 =  q1.permute(0, 2, 1)
313 |         h_lstm1, _ = self.lstm1(f1)
314 |         h_lstm2, _ = self.lstm2(h_lstm1)
315 | 
316 |         # global average pooling
317 |         avg_pool = torch.mean(h_lstm2, 1)
318 |         # global max pooling
319 |         max_pool, _ = torch.max(h_lstm2, 1)
320 | 
321 |         h_conc = torch.cat((max_pool, avg_pool), 1)
322 | 
323 |         h_conc_linear1  = self.linear1(h_conc)
324 |         h_conc_linear2  = self.linear2(h_conc)
325 | 
326 |         hidden = h_conc + h_conc_linear1 + h_conc_linear2
327 |         hidden = F.selu(self.linear_out(hidden))
328 |         hidden =  F.selu(self.linear_xtra(hidden))
329 |         hidden =  F.selu(self.linear_xtra2(hidden))
330 |         hidden =  F.sigmoid(self.linear_out2(hidden))
331 | 
332 |         result=hidden.flatten()
333 | 
334 |         return result
335 | 
336 | 
337 | 
338 | def train_model(h_params, model, x_train, x_valid, y_train, y_valid,  lr,
339 |                 batch_size=1000, n_epochs=20, title='', graph=''):
340 |     global best_result
341 |     param_lrs = [{'params': param, 'lr': lr} for param in model.parameters()]
342 |     optimizer = torch.optim.AdamW(param_lrs, lr=h_params.initial_lr,
343 |                 betas=(0.9, 0.999),
344 |                 eps=1e-6,
345 |                 amsgrad=False
346 |                 )
347 | 
348 |     train_loader = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(x_train, y_train), batch_size=batch_size, shuffle=True)
349 |     valid_loader = torch.utils.data.DataLoader(torch.utils.data.TensorDataset(x_valid, y_valid), batch_size=batch_size, shuffle=False)
350 | 
351 |     num_batches = len(x_train)/batch_size
352 |     print(len(x_train))
353 |     results=[]
354 | 
355 |     for epoch in range(n_epochs):
356 |         start_time = time.time()
357 |         model.train()
358 |         avg_loss = 0.
359 | 
360 | 
361 |         progbar =Progbar(num_batches, stateful_metrics=['train-auc'])
362 | 
363 |         whole_y_pred=np.array([])
364 |         whole_y_t=np.array([])
365 | 
366 |         for i,data in enumerate(train_loader):
367 |             x_batch = data[:-1][0]
368 |             y_batch = data[-1]
369 | 
370 |             y_pred = model(x_batch)
371 | 
372 |             if h_params.use_roc_star and epoch>0 :
373 | 
374 |                if i==0 : print('*Using Loss Roc-star')
375 |                loss = roc_star_loss(y_batch,y_pred,epoch_gamma, last_whole_y_t, last_whole_y_pred)
376 | 
377 |             else:
378 |                if i==0 : print('*Using Loss BxE')
379 |                loss = F.binary_cross_entropy(y_pred, 1.0*y_batch)
380 | 
381 |             logger.report_scalar(title="Loss", series="trains loss",
382 |                                  value=loss, iteration=epoch * len(x_train) + i)
383 | 
384 |             optimizer.zero_grad()
385 |             loss.backward()
386 |             # To prevent gradient explosions resulting in NaNs
387 |             # https://discuss.pytorch.org/t/nan-loss-in-rnn-model/655/8
388 |             # https://github.com/pytorch/examples/blob/master/word_language_model/main.py
389 |             torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
390 | 
391 |             optimizer.step()
392 | 
393 |             whole_y_pred = np.append(whole_y_pred, y_pred.clone().detach().cpu().numpy())
394 |             whole_y_t    = np.append(whole_y_t, y_batch.clone().detach().cpu().numpy())
395 | 
396 |             if i>0:
397 |                 if i%50==1 :
398 |                    try:
399 |                       train_roc_val = roc_auc_score(whole_y_t>=0.5, whole_y_pred)
400 |                    except:
401 | 
402 |                       train_roc_val=-1
403 | 
404 |                    progbar.update(
405 |                       i,
406 |                       values=[
407 |                           ("loss", np.mean(loss.detach().cpu().numpy())),
408 |                           ("train-auc", train_roc_val)
409 |                       ]
410 |                    )
411 | 
412 |         model.eval()
413 |         last_whole_y_t = torch.tensor(whole_y_t).cuda()
414 |         last_whole_y_pred = torch.tensor(whole_y_pred).cuda()
415 | 
416 |         all_valid_preds = np.array([])
417 |         all_valid_t = np.array([])
418 |         for i, valid_data in enumerate(valid_loader):
419 |             x_batch = valid_data[:-1]
420 |             y_batch = valid_data[-1]
421 | 
422 |             y_pred = model(*x_batch).detach().cpu().numpy()
423 |             y_t = y_batch.detach().cpu().numpy()
424 | 
425 |             all_valid_preds=np.concatenate([all_valid_preds,y_pred],axis=0)
426 |             all_valid_t = np.concatenate([all_valid_t,y_t],axis=0)
427 | 
428 |         epoch_gamma = epoch_update_gamma(last_whole_y_t, last_whole_y_pred, epoch,h_params.delta)
429 | 
430 |         try:
431 |           valid_auc = roc_auc_score(all_valid_t>=0.5, all_valid_preds)
432 |         except:
433 |           valid_auc=-1
434 | 
435 |         try:
436 |            train_roc_val = roc_auc_score(whole_y_t>=0.5, whole_y_pred)
437 |         except:
438 |            train_roc_val=-1
439 | 
440 |         elapsed_time = time.time() - start_time
441 |         if epoch==0 :
442 |            print("\n\n* * * * * * * * * * * Params :", title," :: ", graph)
443 |         print('\nEpoch {}/{} \t loss={:.4f} \t time={:.2f}s'.format(
444 |               epoch + 1, n_epochs, avg_loss, elapsed_time))
445 | 
446 |         print("Gamma = ", epoch_gamma)
447 |         print("Validation AUC = ", valid_auc)
448 |         if not ('auc' in best_result) or best_result['auc']<valid_auc :
449 |             best_result['auc']= valid_auc
450 |             best_result['params'] = copy(h_params)
451 |             best_result['epoch'] = epoch+1
452 |             torch.save(model.state_dict(), os.path.join(gettempdir(), "roc-star.pt"))
453 |             print('* * grabbing ', best_result)
454 | 
455 | 
456 |         print("\r Training AUC = ", train_roc_val)
457 |         if valid_auc>0 and train_roc_val>0 :
458 |             logger.report_scalar(title="Accuracy", series="validation accuracy",
459 |                  value=valid_auc, iteration=epoch)
460 |             logger.report_scalar(title="Accuracy", series="train accuracy",
461 |                  value=train_roc_val, iteration=epoch)
462 | 
463 |         results.append({
464 |            'valid_auc': valid_auc,
465 |            'train_auc': train_roc_val,
466 |            'gamma': epoch_gamma.item()
467 |         })
468 | 
469 |         print(f'TRAINS results page: {task._get_app_server()}/projects/{task.project}/experiments/{task.id}/output/log')
470 | 
471 |         print()
472 | 
473 |     return results
474 | 
475 | 
476 | def run(h_params,embedding_matrix, title='',graph=''):
477 |     model = NeuralNet(embedding_matrix,h_params)
478 |     model.cuda()
479 |     h_params.dropout_i,h_params.dropout_o,h_params.dropout_w = (h_params.dropout,h_params.dropout,h_params.dropout)
480 |     run_result = train_model(h_params,model, x_train_torch, x_valid_torch, y_train_torch, y_valid_torch,  lr=h_params.initial_lr,
481 |                 batch_size=h_params.batch_size, n_epochs=h_params.epochs,title=title,graph=graph)
482 |     return run_result
483 | 
484 | 
485 | if __name__ == '__main__':
486 |     parser = argparse.ArgumentParser()
487 | 
488 |     parser.add_argument('--use-roc-star', type=bool, default=True)
489 |     parser.add_argument('--delta', type=float, default=2)
490 |     parser.add_argument('--initial-lr', type=float, default=1e-3)
491 |     parser.add_argument('--dense-hidden-units', type=int, default=1024)
492 |     parser.add_argument('--lstm-units', type=int, default=64)
493 |     parser.add_argument('--batch-size', type=int,default=128)
494 |     parser.add_argument('--bidirectional', type=bool, default=True)
495 |     parser.add_argument('--spacial-dropout', type=float, default=0.00)
496 |     parser.add_argument('--dropout-w', type=float,default=0.18)
497 |     parser.add_argument('--dropout-i', type=float,default=0.18)
498 |     parser.add_argument('--dropout-o', type=float,default=0.18)
499 |     parser.add_argument('--dropout', type=float,default=0.20)
500 |     parser.add_argument('--epochs', type=int, default=30)
501 |     parser.add_argument('--trunc', type=int, default = 200000,help='truncates data, -1 for no truncation')
502 |     parser.add_argument('--maxlen', type=int, default=30)
503 |     parser.add_argument('--seed', type=int,default=43)
504 | 
505 |     FLAGS, unparsed = parser.parse_known_args()
506 | 
507 |     torch.manual_seed(FLAGS.seed)
508 |     torch.backends.cudnn.deterministic = True
509 | 
510 |     init(FLAGS)
511 | 
512 |     title = "ROC_STAR" if FLAGS.use_roc_star else "BXE"
513 |     run(FLAGS, embedding_matrix, title=title, graph=title)
514 | 
515 |     print(f'TRAINS results page: {task._get_app_server()}/projects/{task.project}/experiments/{task.id}/output/log')
516 |     print('\r\r\r * * Best validation score ',best_result)
517 | 
518 | 


--------------------------------------------------------------------------------
/hp_search.py:
--------------------------------------------------------------------------------
  1 | import logging
  2 | 
  3 | from trains import Task
  4 | from trains.automation import DiscreteParameterRange, HyperParameterOptimizer, RandomSearch, \
  5 |     UniformIntegerParameterRange, GridSearch
  6 | 
  7 | try:
  8 |     from trains.automation.hpbandster import OptimizerBOHB
  9 |     Our_SearchStrategy = GridSearch
 10 | except ValueError:
 11 |     logging.getLogger().warning(
 12 |         'Apologies, it seems you do not have \'hpbandster\' installed, '
 13 |         'we will be using RandomSearch strategy instead\n'
 14 |         'If you like to try ' '{{BOHB}: Robust and Efficient Hyperparameter Optimization at Scale},\n'
 15 |         'run: pip install hpbandster')
 16 |     Our_SearchStrategy = RandomSearch
 17 | 
 18 | 
 19 | def job_complete_callback(
 20 |     job_id,                 # type: str
 21 |     objective_value,        # type: float
 22 |     objective_iteration,    # type: int
 23 |     job_parameters,         # type: dict
 24 |     top_performance_job_id  # type: str
 25 | ):
 26 |     print('Job completed!', job_id, objective_value, objective_iteration, job_parameters)
 27 |     if job_id == top_performance_job_id:
 28 |         print('WOOT WOOT we broke the record! Objective reached {}'.format(objective_value))
 29 | 
 30 | 
 31 | task = Task.init(project_name='Roc-star Loss',
 32 |                  task_name='Automatic Hyper-Parameter Optimization',
 33 |                  task_type=Task.TaskTypes.optimizer,
 34 |                  reuse_last_task_id=False)
 35 | 
 36 | # MODIFY THIS, MAKE SURE TO CHOOSE THE CORRECT TEMPLATE TASK ID
 37 | # experiment template to optimize in the hyper-parameter optimization
 38 | args = {
 39 |     'template_task_id': '8d8ff6167a334ff59c3001c06e996eda',
 40 |     'run_as_service': False,
 41 | }
 42 | args = task.connect(args)
 43 | 
 44 | # Example use case:
 45 | an_optimizer = HyperParameterOptimizer(
 46 |     # This is the experiment we want to optimize
 47 |     base_task_id=args['template_task_id'],
 48 |     # here we define the hyper-parameters to optimize
 49 |     hyper_parameters=[
 50 |         DiscreteParameterRange('lstm_units', values=[64, 96, 128]),
 51 |         DiscreteParameterRange('dense_hidden_units', values=[512, 1024, 2048]),
 52 |         DiscreteParameterRange('use_roc_star', values=[True,False])
 53 |     ],
 54 |     # this is the objective metric we want to maximize/minimize
 55 |     objective_metric_title='Accuracy',
 56 |     objective_metric_series='validation accuracy',
 57 |     # now we decide if we want to maximize it or minimize it (accuracy we maximize)
 58 |     objective_metric_sign='max',
 59 |     # let us limit the number of concurrent experiments,
 60 |     # this in turn will make sure we do dont bombard the scheduler with experiments.
 61 |     # if we have an auto-scaler connected, this, by proxy, will limit the number of machine
 62 |     max_number_of_concurrent_tasks=4,
 63 |     # this is the optimizer class (actually doing the optimization)
 64 |     # Currently, we can choose from GridSearch, RandomSearch or OptimizerBOHB (Bayesian optimization Hyper-Band)
 65 |     # more are coming soon...
 66 |     optimizer_class=Our_SearchStrategy,
 67 |     # Select an execution queue to schedule the experiments for execution
 68 |     execution_queue='MY-EXPERIMENT-QUEUE',
 69 |     # Optional: Limit the execution time of a single experiment, in minutes.
 70 |     # (this is optional, and if using  OptimizerBOHB, it is ignored)
 71 |     #time_limit_per_job=10.,
 72 |     # Check the experiments every 6 seconds is way too often, we should probably set it to 5 min,
 73 |     # assuming a single experiment is usually hours...
 74 |     pool_period_min=0.1,
 75 |     # set the maximum number of jobs to launch for the optimization, default (None) unlimited
 76 |     # If OptimizerBOHB is used, it defined the maximum budget in terms of full jobs
 77 |     # basically the cumulative number of iterations will not exceed total_max_jobs * max_iteration_per_job
 78 |     total_max_jobs=10,
 79 |     # This is only applicable for OptimizerBOHB and ignore by the rest
 80 |     # set the minimum number of iterations for an experiment, before early stopping
 81 |     min_iteration_per_job=10,
 82 |     # Set the maximum number of iterations for an experiment to execute
 83 |     # (This is optional, unless using OptimizerBOHB where this is a must)
 84 |     max_iteration_per_job=30,
 85 | )
 86 | 
 87 | # if we are running as a service, just enqueue ourselves into the services queue and let it run the optimization
 88 | if args['run_as_service']:
 89 |     # if this code is executed by `trains-agent` the function call does nothing.
 90 |     # if executed locally, the local process will be terminated, and a remote copy will be executed instead
 91 |     task.execute_remotely(queue_name='services', exit_process=True)
 92 | 
 93 | # report every 12 seconds, this is way too often, but we are testing here J
 94 | an_optimizer.set_report_period(2.2)
 95 | # start the optimization process, callback function to be called every time an experiment is completed
 96 | # this function returns immediately
 97 | an_optimizer.start(job_complete_callback=job_complete_callback)
 98 | # set the time limit for the optimization process (2 hours)
 99 | an_optimizer.set_time_limit(in_minutes=120.0)
100 | # wait until process is done (notice we are controlling the optimization process in the background)
101 | an_optimizer.wait()
102 | # optimization is completed, print the top performing experiments id
103 | top_exp = an_optimizer.get_top_experiments(top_k=3)
104 | print([t.id for t in top_exp])
105 | # make sure background optimization stopped
106 | an_optimizer.stop()
107 | 
108 | print('We are done, good bye')
109 | 


--------------------------------------------------------------------------------
/rocstar.py:
--------------------------------------------------------------------------------
  1 | def epoch_update_gamma(y_true,y_pred, epoch=-1,delta=1):
  2 |         """
  3 |         Calculate gamma from last epoch's targets and predictions.
  4 |         Gamma is updated at the end of each epoch.
  5 |         y_true: `Tensor`. Targets (labels).  Float either 0.0 or 1.0 .
  6 |         y_pred: `Tensor` . Predictions.
  7 |         """
  8 |         DELTA = delta+1
  9 |         SUB_SAMPLE_SIZE = 2000.0
 10 |         pos = y_pred[y_true==1]
 11 |         neg = y_pred[y_true==0] # yo pytorch, no boolean tensors or operators?  Wassap?
 12 |         # subsample the training set for performance
 13 |         cap_pos = pos.shape[0]
 14 |         cap_neg = neg.shape[0]
 15 |         pos = pos[torch.rand_like(pos) < SUB_SAMPLE_SIZE/cap_pos]
 16 |         neg = neg[torch.rand_like(neg) < SUB_SAMPLE_SIZE/cap_neg]
 17 |         ln_pos = pos.shape[0]
 18 |         ln_neg = neg.shape[0]
 19 |         pos_expand = pos.view(-1,1).expand(-1,ln_neg).reshape(-1)
 20 |         neg_expand = neg.repeat(ln_pos)
 21 |         diff = neg_expand - pos_expand
 22 |         ln_All = diff.shape[0]
 23 |         Lp = diff[diff>0] # because we're taking positive diffs, we got pos and neg flipped.
 24 |         ln_Lp = Lp.shape[0]-1
 25 |         diff_neg = -1.0 * diff[diff<0]
 26 |         diff_neg = diff_neg.sort()[0]
 27 |         ln_neg = diff_neg.shape[0]-1
 28 |         ln_neg = max([ln_neg, 0])
 29 |         left_wing = int(ln_Lp*DELTA)
 30 |         left_wing = max([0,left_wing])
 31 |         left_wing = min([ln_neg,left_wing])
 32 |         default_gamma=torch.tensor(0.2, dtype=torch.float).cuda()
 33 |         if diff_neg.shape[0] > 0 :
 34 |            gamma = diff_neg[left_wing]
 35 |         else:
 36 |            gamma = default_gamma # default=torch.tensor(0.2, dtype=torch.float).cuda() #zoink
 37 |         L1 = diff[diff>-1.0*gamma]
 38 |         ln_L1 = L1.shape[0]
 39 |         if epoch > -1 :
 40 |             return gamma
 41 |         else :
 42 |             return default_gamma
 43 | 
 44 | 
 45 | 
 46 | def roc_star_loss( _y_true, y_pred, gamma, _epoch_true, epoch_pred):
 47 |         """
 48 |         Nearly direct loss function for AUC.
 49 |         See article,
 50 |         C. Reiss, "Roc-star : An objective function for ROC-AUC that actually works."
 51 |         https://github.com/iridiumblue/articles/blob/master/roc_star.md
 52 |             _y_true: `Tensor`. Targets (labels).  Float either 0.0 or 1.0 .
 53 |             y_pred: `Tensor` . Predictions.
 54 |             gamma  : `Float` Gamma, as derived from last epoch.
 55 |             _epoch_true: `Tensor`.  Targets (labels) from last epoch.
 56 |             epoch_pred : `Tensor`.  Predicions from last epoch.
 57 |         """
 58 |         #convert labels to boolean
 59 |         y_true = (_y_true>=0.50)
 60 |         epoch_true = (_epoch_true>=0.50)
 61 | 
 62 |         # if batch is either all true or false return small random stub value.
 63 |         if torch.sum(y_true)==0 or torch.sum(y_true) == y_true.shape[0]: return torch.sum(y_pred)*1e-8
 64 | 
 65 |         pos = y_pred[y_true]
 66 |         neg = y_pred[~y_true]
 67 | 
 68 |         epoch_pos = epoch_pred[epoch_true]
 69 |         epoch_neg = epoch_pred[~epoch_true]
 70 | 
 71 |         # Take random subsamples of the training set, both positive and negative.
 72 |         max_pos = 1000 # Max number of positive training samples
 73 |         max_neg = 1000 # Max number of positive training samples
 74 |         cap_pos = epoch_pos.shape[0]
 75 |         cap_neg = epoch_neg.shape[0]
 76 |         epoch_pos = epoch_pos[torch.rand_like(epoch_pos) < max_pos/cap_pos]
 77 |         epoch_neg = epoch_neg[torch.rand_like(epoch_neg) < max_neg/cap_pos]
 78 | 
 79 |         ln_pos = pos.shape[0]
 80 |         ln_neg = neg.shape[0]
 81 | 
 82 |         # sum positive batch elements agaionst (subsampled) negative elements
 83 |         if ln_pos>0 :
 84 |             pos_expand = pos.view(-1,1).expand(-1,epoch_neg.shape[0]).reshape(-1)
 85 |             neg_expand = epoch_neg.repeat(ln_pos)
 86 | 
 87 |             diff2 = neg_expand - pos_expand + gamma
 88 |             l2 = diff2[diff2>0]
 89 |             m2 = l2 * l2
 90 |             len2 = l2.shape[0]
 91 |         else:
 92 |             m2 = torch.tensor([0], dtype=torch.float).cuda()
 93 |             len2 = 0
 94 | 
 95 |         # Similarly, compare negative batch elements against (subsampled) positive elements
 96 |         if ln_neg>0 :
 97 |             pos_expand = epoch_pos.view(-1,1).expand(-1, ln_neg).reshape(-1)
 98 |             neg_expand = neg.repeat(epoch_pos.shape[0])
 99 | 
100 |             diff3 = neg_expand - pos_expand + gamma
101 |             l3 = diff3[diff3>0]
102 |             m3 = l3*l3
103 |             len3 = l3.shape[0]
104 |         else:
105 |             m3 = torch.tensor([0], dtype=torch.float).cuda()
106 |             len3=0
107 | 
108 |         if (torch.sum(m2)+torch.sum(m3))!=0 :
109 |            res2 = torch.sum(m2)/max_pos+torch.sum(m3)/max_neg
110 |            #code.interact(local=dict(globals(), **locals()))
111 |         else:
112 |            res2 = torch.sum(m2)+torch.sum(m3)
113 | 
114 |         res2 = torch.where(torch.isnan(res2), torch.zeros_like(res2), res2)
115 | 
116 |         return res2


--------------------------------------------------------------------------------