├── README.md ├── data ├── 0.54824_mixedkmeans_k2_elasticnetcv_pca8_5pct_residuals_discarded.csv ├── PCs_vs_clusters.png ├── final_submission_0.53019_mixedkmeans_k2_elasticnetcv_pca8_5pct_residuals_discarded.csv ├── numerai_predictions_format.csv ├── numerai_tournament_data.csv └── numerai_training_data.csv ├── instructions.txt └── src ├── models.py ├── numerai.py └── scoring.py /README.md: -------------------------------------------------------------------------------- 1 | I competed in the [numerai](https://numer.ai/) data modeling challenge as user [Pequod](https://numer.ai/ai?pequod), finishing 59th out of 209 competitors. Not great, but I'm satisfied with the result given the limited time I put into it. 2 | 3 | The model is very simple: 4 | 5 | 1. PCA dimension reduction from 14 to 8 continuous variables 6 | 2. [Dummy code](http://www.psychstat.missouristate.edu/multibook/mlt08m.html) the categorical variable 7 | 3. K-means clustering into 6 clusters 8 | 4. Separate [ElasicNet CV linear estimators](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html) for each cluster. 9 | 10 | 11 | At one point I was at the head of the leader board, and maintained the #2 spot until very near the end of the contest. The competition organizers then made it clear that final rankings would be based on a separate dataset than that used for the leaderboard; at that point, it became very difficult to gauge how well my solution might do, because my results against the validation set often did not line up well with the data set used for leaderboard position. In other words, my solution was not generalizing well. 12 | 13 | Rather than trying to maintain leaderboard position, I used n-fold cross validation to try and identify the combination of model parameters that produced the highest average AUC score across the CV sets, while minimizing the spread between highest and lowest scoring sets. I figured this way, I could have more confidence in how my solution would ultimately perform on the final test set. I chose to use 8 principle components and 6 k-means derived clusters (see cyan highlights in image below) 14 | 15 | ![](data/PCs_vs_clusters.png) 16 | 17 | My final sumbission had AUC 0.53019 on the leaderboard test set. Even though my highest ranking solution scored AUC of 0.54824, I kept the lower scoring solution posted in the hopes it would generalize well, and that many players above me would be sorely surprised by their ultimate results on the final test set because of overfitting to the leaderboard set. My solution scored 0.52854 on the final test set - very close to the score on the leaderboard set. 18 | 19 | -------------------------------------------------------------------------------- /data/PCs_vs_clusters.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/dbjohnson/numerai/767fa11cd9fedf6d2c2ba979345ee2fa8cb51a8e/data/PCs_vs_clusters.png -------------------------------------------------------------------------------- /data/numerai_predictions_format.csv: -------------------------------------------------------------------------------- 1 | "t_id","probability" 2 | 13843,0.54 3 | 13677,0.51 4 | 547,0.59 5 | -------------------------------------------------------------------------------- /instructions.txt: -------------------------------------------------------------------------------- 1 | http://numer.ai 2 | 3 | 4 | # Numerai Tournament Mechanics 5 | 6 | Here we discuss data, tournament evaluation, and payouts. 7 | 8 | ## Training Data 9 | 10 | #### numerai_training_data.csv 11 | Use this dataset to train your machine learning algorithm. The first fourteen 12 | columns (`f1` - `f14`) are integer features. Column `c1` is a categorical 13 | feature, column validation indicates a dataset that you can use to validate 14 | your model, and target is the binary class you’re trying to predict. 15 | 16 | ## Tournament Data 17 | 18 | #### numerai_tournament_data.csv 19 | Once you’ve built your model, you can use it to generate probability estimates 20 | on this new dataset. 21 | 22 | ## Upload Predictions 23 | 24 | You can then upload your predictions, and see how well your model performed 25 | versus the other tournament participants. The format of your prediction upload 26 | should be a csv with two columns `t_id` and `probability`. 27 | 28 | To upload your predictions, login and click the *Upload Predictions* button on 29 | the homepage. Your score will appear on the leaderboard. 30 | 31 | ## Evaluation 32 | 33 | To be successful a lot of your predictions need to be correct, and where your 34 | model shows the most confident predictions, it need to be especially correct 35 | then. We evaluate your models with the [Area Under The Receiver Operator 36 | Characteristic Curve measure](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve). 37 | We only report your AUC score on 1/3rd of the predictions you submit. This 38 | prevents overfitting. 39 | 40 | ## Payouts 41 | 42 | The leaderboard will change frequently as new predictions are uploaded, and so 43 | payouts will change too. To receive payment, you need to have placed in the top 44 | 10 when the timer runs out, and furthermore you need to come back to apply your 45 | model to a new live dataset to claim your prize. We will email you a link to a 46 | live encrypted dataset to predict on when the timer runs out. You’ll have 24 47 | hours to use your model to predict on this dataset. By requiring that you upload 48 | predictions, instead of the code related to your model, we ensure that you can 49 | retain all intellectual property rights to your model. You never have to tell us 50 | how you built your model. It’s yours forever, and you can apply your ideas to 51 | all of our subsequent tournaments, and soon submit your predictions directly to 52 | our API. 53 | 54 | We want you to be able to remain completely anonymous on Numerai. To facilitate 55 | anonymity, we offer payments in Bitcoin as well as USD. 56 | 57 | 58 | http://numer.ai 59 | 60 | -------------------------------------------------------------------------------- /src/models.py: -------------------------------------------------------------------------------- 1 | import functools 2 | import pandas as pd 3 | import numpy as np 4 | import sklearn.linear_model 5 | import sklearn.ensemble 6 | import sklearn.metrics 7 | from sklearn.cluster import KMeans 8 | 9 | import scoring 10 | 11 | 12 | class PCA(object): 13 | def __init__(self, n_princ_comp): 14 | self.n_princ_comp = n_princ_comp 15 | 16 | def fit(self, A): 17 | M = A - np.mean(A, axis=0) # subtract the mean (along columns) 18 | self.eig_vals, self.eig_vecs = np.linalg.eig(np.cov(M.T)) 19 | sorted_idx = np.argsort(self.eig_vals) 20 | self.eig_vecs_kept = self.eig_vecs[:, sorted_idx[-self.n_princ_comp:]] 21 | 22 | def transform(self, A): 23 | return A.dot(self.eig_vecs_kept) 24 | 25 | 26 | def clean_dataset(df, quantile=0.95): 27 | means = df[scoring.float_cols].mean() 28 | stds = df[scoring.float_cols].std() 29 | zscores = (df[scoring.float_cols] - means).abs() / stds 30 | rowmax_zscores = zscores.max(axis=1) 31 | thresh = rowmax_zscores.quantile(quantile) 32 | idx = rowmax_zscores > thresh 33 | return df.drop(df.index[idx]) 34 | 35 | 36 | def balance_dataset(df): 37 | dfs = [] 38 | for c, dfc in df.groupby('c1'): 39 | target = dfc['target'].values 40 | ones = np.sum(target == 1) 41 | zeros = np.sum(target == 0) 42 | if ones != zeros: 43 | if ones > zeros: 44 | drop_idx = np.where(target == 1)[0][:ones - zeros] 45 | else: 46 | drop_idx = np.where(target == 0)[0][:zeros - ones] 47 | dfc = dfc.drop(dfc.index[drop_idx]) 48 | dfs.append(dfc) 49 | return pd.concat(dfs) 50 | 51 | 52 | class DummyCoder(object): 53 | def fit(self, series): 54 | self.unique_vals = set(series.unique()) 55 | self.unique_vals.pop() 56 | 57 | def encode(self, series): 58 | df_encoded = pd.DataFrame() 59 | for i, v in enumerate(self.unique_vals): 60 | df_encoded['d{}'.format(i)] = series.apply(lambda x: 1 if x == v else 0) 61 | return df_encoded.values 62 | 63 | 64 | class MixedModelKmeans(object): 65 | def __init__(self, ycol='target', xcols=scoring.float_cols, categorical_col='c1', 66 | base_estimator=functools.partial(sklearn.linear_model.ElasticNetCV, l1_ratio=[.1, .5, .7, .9, .95, .99, 1]), 67 | n_princ_comp=8, 68 | n_clusters=2): 69 | self.ycol = ycol 70 | self.xcols = xcols 71 | self.categorical_col = categorical_col 72 | self.n_clusters = n_clusters 73 | self.base_estimator = base_estimator 74 | self.n_princ_comp = n_princ_comp 75 | 76 | def fit(self, df): 77 | self.dummycoder = DummyCoder() 78 | self.dummycoder.fit(df[self.categorical_col]) 79 | self.kmodel = KMeans(n_clusters=self.n_clusters, random_state=1332) 80 | 81 | clusters = self.kmodel.fit_predict(df[self.xcols]) 82 | self.cluster_to_model = {} 83 | # clusters = df['c1'].values 84 | for cluster in set(clusters): 85 | dfc = df.iloc[clusters == cluster] 86 | pca = PCA(self.n_princ_comp) 87 | pca.fit(dfc[self.xcols]) 88 | x = np.hstack((pca.transform(dfc[self.xcols]), self.dummycoder.encode(dfc[self.categorical_col]))) 89 | y = dfc[self.ycol] 90 | model = self.base_estimator() 91 | model.fit(x, y) 92 | self.cluster_to_model[cluster] = model, pca 93 | 94 | def predict(self, df): 95 | clusters = self.kmodel.predict(df[self.xcols]) 96 | res = np.zeros(len(df)) 97 | clusters = df['c1'].values 98 | for c in set(clusters): 99 | model, pca = self.cluster_to_model[c] 100 | idx = clusters == c 101 | x = pca.transform(df[self.xcols][idx]) 102 | dc = self.dummycoder.encode(df[self.categorical_col][idx]) 103 | x = np.hstack((x, dc)) 104 | res[idx] = model.predict(x) 105 | return res 106 | 107 | def score(self, df): 108 | yhat = self.predict(df) 109 | return sklearn.metrics.roc_auc_score(df[self.ycol], yhat) 110 | -------------------------------------------------------------------------------- /src/numerai.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from matplotlib import pylab as plt 3 | 4 | import models 5 | import scoring 6 | 7 | 8 | cross_validation_sets = scoring.make_cv_set(cv=4) 9 | 10 | 11 | def explore(pc_range=xrange(1, len(scoring.float_cols)+1), cluster_range=xrange(1, 10), discard=0.05): 12 | reload(models) 13 | best_score = None 14 | min_aucs = [] 15 | max_aucs = [] 16 | for n_princ_comp in pc_range: 17 | min_aucs.append([]) 18 | max_aucs.append([]) 19 | for n_clusters in cluster_range: 20 | m, min_auc, max_auc = make_model(n_princ_comp=n_princ_comp, n_clusters=n_clusters, discard=discard) 21 | print 'pc: {}, clusters: {}, discard: {}, min auc: {} max auc: {}'.format(n_princ_comp, n_clusters, discard, min_auc, max_auc) 22 | min_aucs[-1].append(min_auc) 23 | max_aucs[-1].append(max_auc) 24 | if not best_score or min_auc > best_score: 25 | print 'new best!' 26 | best_score = min_auc 27 | # retrain model on full training set for submission 28 | fit_model(m, scoring.df, discard) 29 | scoring.make_submission(m, '_min{:1.5f}_max{:1.4f}_p{}_k{}_d{}.csv'.format(min_auc, max_auc, n_princ_comp, n_clusters, discard)) 30 | 31 | plt.subplot(121) 32 | plt.imshow(min_aucs, interpolation='nearest') 33 | plt.subplot(122) 34 | plt.imshow(max_aucs, interpolation='nearest') 35 | 36 | 37 | def make_model(n_princ_comp, n_clusters, discard): 38 | m = models.MixedModelKmeans(n_clusters=n_clusters, n_princ_comp=n_princ_comp) 39 | scores = [] 40 | for df_train, df_validation in cross_validation_sets: 41 | fit_model(m, df_train, discard) 42 | scores.append(m.score(df_validation)) 43 | return m, min(scores), max(scores) 44 | 45 | 46 | def fit_model(m, df, discard): 47 | m.fit(df) 48 | if discard > 0: 49 | residuals = np.abs(df['target'] - m.predict(df)) 50 | thresh = residuals.quantile(1 - discard) 51 | m.fit(df.drop(df.index[np.where(residuals > thresh)])) 52 | 53 | -------------------------------------------------------------------------------- /src/scoring.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | import pandas as pd 3 | import numpy as np 4 | from sklearn import metrics 5 | 6 | 7 | df = pd.DataFrame.from_csv('numerai_training_data.csv', index_col=None) 8 | float_cols = [c for c in df.columns if c.startswith('f')] 9 | #df['target'] = df['target'].map(lambda x: 1 if x else -1) 10 | df_train = df[df['validation'] == 0] 11 | df_test = df[df['validation'] == 1] 12 | 13 | df_tourney = pd.DataFrame.from_csv('numerai_tournament_data.csv', index_col=None) 14 | 15 | 16 | def area_under_curve(model, use_trainset=False): 17 | df_eval = df_train if use_trainset else df_test 18 | scores = model.predict(df_eval) 19 | uniq = np.unique(np.append(scores, [0, 1])) 20 | uniq.sort() 21 | tprs = [] 22 | fprs = [] 23 | 24 | for t in reversed(uniq): 25 | tn = len(df_eval[(scores < t) & (df_eval['target'] != 1)]) 26 | tp = len(df_eval[(scores >= t) & (df_eval['target'] == 1)]) 27 | fn = len(df_eval[(scores < t) & (df_eval['target'] == 1)]) 28 | fp = len(df_eval[(scores >= t) & (df_eval['target'] != 1)]) 29 | tpr = float(tp) / max(1, tp + fn) 30 | fpr = float(fp) / max(1, tn + fp) 31 | tprs.append(tpr) 32 | fprs.append(fpr) 33 | 34 | pos_idx = (df_eval['target'] == 1).values 35 | neg_idx = ~pos_idx 36 | plt.subplot(121) 37 | plt.hist(scores[neg_idx], 50, color='r', alpha=0.5) 38 | plt.hist(scores[pos_idx], 50, color='b', alpha=0.5) 39 | plt.subplot(122) 40 | plt.plot(fprs, tprs) 41 | plt.title(auc) 42 | plt.plot([0, 1], [0, 1]) 43 | print metrics.auc(fprs, tprs) 44 | 45 | 46 | def make_submission(model, fn='submission.csv'): 47 | scores = model.predict(df_tourney) 48 | df_tourney['probability'] = scores 49 | df_tourney[['t_id', 'probability']].to_csv(fn) 50 | print 'Submission complete!' 51 | 52 | 53 | def make_cv_set(cv=4, seed=1337): 54 | np.random.seed(seed) 55 | 56 | cv_to_train_dfcs = defaultdict(list) 57 | cv_to_test_dfcs = defaultdict(list) 58 | for x, dfx in df.groupby('c1'): 59 | cv_sets = np.random.choice(cv, size=len(dfx)) 60 | for c in xrange(cv): 61 | train = dfx[cv_sets != c] 62 | test = dfx[cv_sets == c] 63 | cv_to_train_dfcs[c].append(train) 64 | cv_to_test_dfcs[c].append(test) 65 | 66 | train_validation_sets = [] 67 | for c in xrange(cv): 68 | train = pd.concat(cv_to_train_dfcs[c]) 69 | test = pd.concat(cv_to_test_dfcs[c]) 70 | train_validation_sets.append((train, test)) 71 | 72 | return train_validation_sets 73 | --------------------------------------------------------------------------------