├── README.md ├── data └── train.csv ├── dboostForest.R ├── dboostFun.R ├── dboostFunCommented.R └── dboostStump.R /README.md: -------------------------------------------------------------------------------- 1 | # dboost: powered by OVER 9000 reg monkeys 2 | Stochastic Dummy Boosting 3 | 4 | Michael S Kim (03/03/2016) 5 | 6 | The base algorithm is about 30 lines of code (when you take out the comments). I highly suggest reading the 30 lines to get the details. This should take you much less than 30 minutes. Given this is Github where I assume most people here can code, it's probably not that difficult to just read the code. 7 | 8 | I randomly create dummy variables by randomly splitting column features. This is basically like a decision stump - so a threshold. I give an option to hash a column's values before thresholding to prevent only splitting based upon numerical face value rank. Then I do ridge regression which is my weak learner for boosting. Then it's just GBM from there. 9 | 10 | This is not random tree embedding and then xgblinear. It is not feature engineering and then GBM. The space of all possible dummy encodings across all possible columns is not something one can list out in practice. 11 | 12 | # Each iteration of dboost will have a very different dummy encoding and the resulting learner will be very weak because it is random. 13 | 14 | Think of all the possible ways you can dummy just one feature column with three distinct values of 3,4,5. You can dummy as 15 | 16 | 3 is one else zero dummy 17 | 18 | 4 '' 19 | 20 | 5 '' 21 | 22 | 3,4 '' 23 | 24 | 4,5 '' 25 | 26 | 3,5 '' 27 | 28 | And that is just one sample column with 3 distinct values. You could have a column with thousands of possible distinct values very easily. You could have many columns. The possible ways to dummy are really in practice endless. 29 | 30 | The intuition behind this method is that I have not been able to boost strong learners very well. Hence I looked towards a diverse set of uncorrelated weak learners. That is why you see random dummy variables and ridge regression. This method also happens to be very fast in theory (you can use online ridge, unsecure hashing should be fast, etc.). 31 | 32 | Read the dboostFunCommented.R for the easy version of the code without some features. The other code dboostFun.R adds a few features that improves my cross validated scores in my local experiments. 33 | 34 | In my very limited local testing, the tuned dboost should get scores around untuned random forest. This may not hold in general, but this (hopefully new) meta algorithm is promising. 35 | 36 | In my Kaggle experiences it's basically XG >= NN > RF as the best level 1 algorithm. I usually try every algorithm on Caret and the results are subpar. If dboost ends up at the level of RF (on even a subset of Kaggle problems), the double bagged version / meta bagging version of XG+DB/NN should produce top 10 results. Again more testing is required. 37 | -------------------------------------------------------------------------------- /data/train.csv: -------------------------------------------------------------------------------- 1 | feat1,feat2,feat3,feat4,feat5,feat6,feat7,feat8,feat9,feat10,feat11,feat12,feat13,feat14,feat15,feat16,feat17,feat18,feat19,feat20,feat21,feat22,feat23,feat24,feat25,feat26,feat27,feat28,feat29,feat30,feat31,feat32,feat33,feat34,feat35,feat36,feat37,feat38,feat39,feat40,target 2 | 14,2,2,4,4,5,5,2,2,5,5,5,5,2,5,5,0,0,0,0,0,3,1,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3778622 3 | 21,2,2,4,5,4,4,2,2,5,4,4,5,4,4,5,0,0,0,0,0,5,5,3,3,5,0,0,0,0,4,3,0,0,0,0,0,0,0,0,7201784 4 | 4,1,3,2,2,4,4,2,1,5,4,5,5,3,5,5,0,0,0,0,0,3,4,1,2,2,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1734635 5 | 23,2,2,9,6,6,6,4,4,10,8,10,10,8,10,7.5,0,0,0,0,0,25,15,15,3,20,0,0,0,0,10,2.5,0,0,0,0,0,0,0,0,3745136 6 | 17,1,2,4,5,3,5,2,2,5,4,4,5,4,4,5,0,0,0,0,0,2,1,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6363241 7 | 17,1,2,4,5,4,4,1,3,5,4,4,5,4,4,5,0,0,0,0,0,1,4,1,2,1,0,0,0,0,4,3,0,0,0,0,0,0,0,0,4100886 8 | 17,1,3,12,7.5,6,6,2,10,10,10,10,10,4,10,7.5,3,10,15,3,12,10,9,3,2,5,8,8,10,2.5,7.5,7.5,5,15,20,2,12,3,16,4,3218919 9 | 4,1,2,4,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,3,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6923132 10 | 32,2,3,6,4.5,6,6,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511 11 | 4,1,3,3,5,4,5,3,4,5,4,4,4,5,4,4,1,1,2,1,1,5,5,1,3,2,2,2,2,1,3,3,5,2,3,2,2,2,3,3,2156098 12 | 13,2,3,3,4,4,4,2,2,5,4,5,5,3,5,5,3,5,5,4,4,5,5,1,1,5,2,4,1,5,1,3,5,1,2,2,4,5,5,4,5525735 13 | 18,1,3,3,5,5,3,1,2,5,5,5,5,2,5,5,2,2,4,2,4,2,5,2,4,2,1,1,1,4,3,3,5,5,4,2,2,4,3,3,4250758 14 | 26,2,3,2,4,5,4,1,1,3,4,4,4,4,2,3,0,0,0,0,0,1,2,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4651867 15 | 8,2,2,1,1,4,3,1,1,1,4,5,5,3,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4263629 16 | 17,1,2,3,5,4,4,2,5,5,5,5,5,2,5,5,0,0,0,0,0,4,5,2,2,1,0,0,0,0,4,3,0,0,0,0,0,0,0,0,9652351 17 | 17,1,3,2,4,4,4,2,5,5,4,5,4,3,5,4,4,4,5,5,5,2,4,1,2,1,5,5,5,1,2,3,5,5,5,3,4,4,5,3,5966193 18 | 1,2,3,4,5,4,3,1,3,5,5,5,5,1,5,5,2,2,4,2,4,3,3,1,4,4,3,3,3,2,2,3,4,5,5,3,3,4,4,3,6313222 19 | 18,1,3,3,5,4,4,2,2,5,3,4,5,5,4,5,3,2,3,1,4,4,5,5,3,5,5,2,5,3,5,1,4,3,3,2,3,4,3,1,2525376 20 | 21,2,2,3,4,4,4,2,3,5,5,5,5,1,5,5,0,0,0,0,0,2,3,1,2,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2544858 21 | 17,1,2,5,5,4,5,1,4,5,3,4,4,5,3,4,0,0,0,0,0,5,5,5,5,4,0,0,0,0,5,0,0,0,0,0,0,0,0,0,16549065 22 | 28,2,3,1,1,3,3,2,1,1,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3008199 23 | 17,1,2,4,5,4,5,2,4,5,4,4,4,3,3,4,0,0,0,0,0,5,5,3,5,3,0,0,0,0,5,1,0,0,0,0,0,0,0,0,4052734 24 | 25,2,2,2,3,4,4,2,1,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2993069 25 | 29,2,2,2,4,4,4,2,2,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1270499 26 | 17,1,3,12,7.5,6,6,2,8,10,8,8,8,8,8,6,12,10,12,9,9,20,15,6,4,25,10,8,10,2.5,12.5,2.5,20,9,15,4,12,12,12,4,5500819 27 | 5,2,2,6,6,6,4.5,2,8,10,10,10,8,4,10,6,0,0,0,0,0,5,15,3,2,5,0,0,0,0,5,7.5,0,0,0,0,0,0,0,0,3752885 28 | 18,1,2,4,5,4,4,2,3,5,4,4,5,5,4,5,0,0,0,0,0,3,2,2,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,9262754 29 | 1,2,3,4,5,4,3,1,2,5,4,4,5,5,4,5,2,2,4,2,4,5,5,4,3,5,3,3,3,4,3,2,4,1,3,3,3,4,3,3,3903884 30 | 26,2,2,1,2,5,4,1,2,1,5,4,4,2,4,3,0,0,0,0,0,1,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3570392 31 | 4,1,3,9,7.5,6,7.5,8,6,10,10,8,8,4,6,6,3,2,9,3,12,25,12,6,2,10,4,4,5,2.5,7.5,2.5,20,9,20,4,18,12,12,2,3871345 32 | 4,1,2,4,5,4,5,2,2,5,3,4,5,5,4,5,0,0,0,0,0,5,5,5,5,5,0,0,0,0,5,0,0,0,0,0,0,0,0,0,3347768 33 | 24,2,2,2,2,4,3,1,1,1,4,5,5,3,5,5,0,0,0,0,0,4,3,1,1,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,1882132 34 | 18,1,3,4,4,5,3,1,3,5,5,5,5,2,5,5,3,2,4,1,4,1,3,2,2,3,3,2,3,5,2,3,4,3,3,2,4,4,3,1,4250554 35 | 17,1,3,4,5,4,4,2,2,5,4,4,4,5,3,4,0,0,0,0,0,4,5,2,2,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,2390534 36 | 5,2,2,1,2,5,4,1,2,1,4,5,5,3,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,2,0,0,0,0,0,0,0,0,2025298 37 | 17,1,3,12,7.5,6,7.5,2,8,10,8,8,8,10,6,6,15,8,12,6,9,25,15,9,5,15,6,10,10,5,12.5,2.5,25,15,25,6,18,3,16,6,4286646 38 | 17,1,2,4,5,4,4,1,3,5,5,5,5,2,5,5,0,0,0,0,0,3,3,1,3,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3600467 39 | 4,1,2,6,6,4.5,7.5,8,10,10,8,8,8,10,8,6,0,0,0,0,0,5,6,3,1,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,5017320 40 | 34,2,2,2,3,4,4,2,1,1,4,4,5,3,4,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1763232 41 | 4,1,3,12,7.5,6,6,2,6,10,8,10,10,6,10,7.5,6,4,9,3,12,10,15,6,3,10,4,4,5,2.5,5,7.5,10,15,20,4,18,12,12,2,3164973 42 | 27,2,2,3,4,3,4,2,1,5,5,5,5,2,5,5,0,0,0,0,0,3,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3376145 43 | 17,1,2,12,7.5,6,6,2,8,10,10,8,8,4,8,6,0,0,0,0,0,15,9,3,3,5,0,0,0,0,5,7.5,0,0,0,0,0,0,0,0,3426169 44 | 12,2,3,6,4.5,6,7.5,6,4,10,10,10,10,2,10,7.5,0,0,0,0,0,25,3,3,1,10,0,0,0,0,5,2.5,0,0,0,0,0,0,0,0,5444228 45 | 17,1,2,2,2,4,4,2,2,5,5,5,5,2,5,5,0,0,0,0,0,3,4,1,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4590423 46 | 17,1,2,12,7.5,7.5,6,2,10,10,8,8,10,8,8,7.5,0,0,0,0,0,25,15,6,3,25,0,0,0,0,10,7.5,0,0,0,0,0,0,0,0,6782425 47 | 7,2,3,3,4,5,4,2,2,5,4,5,5,4,5,5,5,4,5,5,4,5,5,4,2,5,1,1,1,0,2,3,5,5,5,5,4,5,5,4,4758477 48 | 28,2,2,2,4,4,4,2,2,5,3,4,4,5,4,4,0,0,0,0,0,2,2,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2999068 49 | 18,1,3,4,5,4,3,1,2,5,5,5,5,2,5,5,3,2,3,2,3,4,5,1,4,2,5,3,3,3,2,3,3,5,5,4,4,4,3,2,5337527 50 | 17,1,3,4,5,4,4,2,3,5,5,5,5,1,5,5,0,0,0,0,0,2,2,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3780019 51 | 31,2,2,3,4,4,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,2,3,1,2,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,2097022 52 | 11,2,2,4,5,2,5,2,2,5,4,4,4,3,4,4,0,0,0,0,0,4,4,1,2,3,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2738052 53 | 33,2,3,9,6,6,6,4,6,10,6,8,10,10,8,7.5,9,10,15,3,12,15,15,9,2,25,6,4,7.5,2.5,5,7.5,20,9,15,4,24,12,16,2,4780607 54 | 17,1,2,4,5,4,5,1,3,5,5,5,5,2,5,5,0,0,0,0,0,3,5,2,5,3,0,0,0,0,5,1,0,0,0,0,0,0,0,0,3727365 55 | 24,2,3,2,2,4,5,4,2,5,5,5,5,1,5,5,0,0,0,0,0,3,1,1,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3452382 56 | 1,2,3,4,5,4,3,1,3,5,5,5,5,2,5,5,1,2,2,1,3,2,5,1,3,4,1,1,4,3,2,3,5,5,3,3,5,4,3,2,4467729 57 | 4,1,3,3,5,4,5,3,3,5,4,4,5,4,2,5,1,3,3,1,4,5,5,4,2,2,1,1,1,1,3,1,3,3,3,2,3,4,3,1,2018786 58 | 17,1,2,12,7.5,6,6,4,6,10,8,8,8,6,8,6,0,0,0,0,0,15,12,3,2,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,4350573 59 | 17,1,3,3,5,4,4,2,2,5,1,4,4,5,2,4,4,4,4,3,3,5,5,5,1,1,5,5,5,1,3,2,5,5,3,3,3,4,3,2,6836483 60 | 17,1,2,4,5,4,4,1,4,5,5,5,5,2,5,5,0,0,0,0,0,3,5,1,4,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,3982768 61 | 3,2,3,6,3,6,6,4,4,10,8,10,10,6,10,7.5,9,8,12,3,12,10,12,6,1,10,2,2,2.5,2.5,5,7.5,15,3,15,6,18,12,16,6,2954086 62 | 4,1,2,3,5,3,5,3,4,5,4,4,4,5,4,4,0,0,0,0,0,5,5,1,3,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,1904842 63 | 29,2,3,4,4,4,4,1,2,5,4,5,5,3,5,5,3,5,5,3,4,5,5,3,3,5,2,2,2,3,3,1,5,5,5,3,4,4,4,3,3248660 64 | 31,2,2,2,4,4,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,2,3,1,2,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,4429513 65 | 4,1,2,2,4,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,2383841 66 | 17,1,3,12,7.5,6,6,2,8,10,10,10,10,4,10,7.5,15,10,15,15,12,10,9,6,3,5,10,10,12.5,5,5,7.5,20,15,25,4,24,15,20,6,2551253 67 | 17,1,3,4,5,4,5,2,3,5,4,4,4,4,3,4,0,0,0,0,0,3,5,2,4,2,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4491607 68 | 17,1,3,12,7.5,6,6,4,4,10,8,10,10,6,10,7.5,0,0,0,0,0,25,15,6,4,20,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,4952255 69 | 28,2,3,2,4,4,4,2,2,5,5,4,4,1,3,4,0,0,0,0,0,2,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2021934 70 | 17,1,2,5,5,4,4,1,3,5,5,4,4,2,4,4,0,0,0,0,0,5,5,3,4,2,0,0,0,0,5,2,0,0,0,0,0,0,0,0,5906596 71 | 29,2,2,2,3,4,4,2,2,5,5,5,5,1,5,5,0,0,0,0,0,3,1,1,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2371202 72 | 17,1,3,4,5,4,4,2,3,5,4,4,4,4,4,4,4,4,3,3,3,3,5,1,2,2,3,3,3,1,2,2,4,1,2,2,3,4,3,2,3818055 73 | 17,1,3,4,5,4,4,1,3,5,5,4,4,2,4,4,3,4,4,2,4,5,5,2,4,3,5,3,4,2,4,2,3,5,5,2,3,5,4,4,4705946 74 | 17,1,3,1,1,4,3,1,2,1,5,5,5,1,5,5,1,1,1,1,1,2,1,1,1,1,1,1,1,5,1,3,5,4,4,3,3,4,2,3,2364479 75 | 5,2,2,3,3,6,6,4,2,2,8,10,10,6,10,7.5,0,0,0,0,0,5,3,3,1,5,0,0,0,0,2.5,7.5,0,0,0,0,0,0,0,0,4888774 76 | 18,1,2,2,4,4,4,2,3,5,5,5,4,2,5,4,0,0,0,0,0,2,2,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3445077 77 | 4,1,1,1,3,0,5,5,5,1,5,5,5,2,5,5,0,0,0,0,0,1,2,2,1,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3810008 78 | 29,2,2,3,4,4,3,1,2,5,5,5,5,2,5,5,0,0,0,0,0,2,4,2,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,5595268 79 | 9,2,2,2,2,4,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,3,3,1,3,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,1999098 80 | 17,1,2,3,4,4,5,2,2,5,5,5,5,2,5,5,0,0,0,0,0,1,3,1,3,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3004429 81 | 20,2,3,2,2,4,4,2,2,4,4,5,5,4,5,5,0,0,0,0,0,5,4,2,1,2,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3273041 82 | 11,2,3,2,4,2,5,2,3,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2055379 83 | 16,2,3,3,3,4,4,2,2,5,4,5,5,3,5,5,1,5,4,2,4,5,5,2,1,4,3,3,3,2,2,3,5,5,5,4,2,5,3,2,4015750 84 | 22,2,2,2,2,4,4,3,1,4,5,5,5,2,5,5,0,0,0,0,0,2,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1619683 85 | 17,1,3,4,5,5,4,1,5,5,3,4,5,5,3,5,3,4,5,5,4,5,5,5,5,5,5,3,5,1,5,1,5,5,4,3,3,4,4,1,4554238 86 | 17,1,3,12,7.5,6,6,2,10,10,10,10,10,4,10,7.5,9,8,12,3,12,10,9,6,3,5,8,8,10,10,10,7.5,5,6,10,6,18,12,12,6,4136426 87 | 9,2,2,3,4,4,4,2,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,2,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,7592272 88 | 17,1,3,4,5,4,4,1,3,5,5,5,5,2,5,5,2,3,3,1,3,4,2,1,2,1,5,2,5,2,2,3,4,3,3,2,3,4,3,3,2058644 89 | 17,1,3,4,5,4,4,2,4,5,4,4,4,5,3,4,5,5,5,5,5,5,5,5,3,1,5,5,5,1,5,1,5,5,4,2,5,5,5,3,13575225 90 | 29,2,2,2,3,4,4,2,2,5,5,5,5,1,5,5,3,2,4,1,3,3,2,1,2,1,5,1,2,5,1,3,4,4,3,4,4,4,3,5,3753721 91 | 6,2,2,1,2,5,3,1,2,1,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,2,0,0,0,0,0,0,0,0,2792031 92 | 18,1,2,2,4,4,4,2,2,5,4,4,5,3,4,5,0,0,0,0,0,2,2,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,8894599 93 | 17,1,2,12,7.5,6,6,2,8,10,10,10,10,4,10,7.5,0,0,0,0,0,15,3,3,3,5,0,0,0,0,7.5,7.5,0,0,0,0,0,0,0,0,8630683 94 | 4,1,3,3,5,4,5,2,3,5,3,5,4,5,5,4,3,4,4,1,4,5,5,2,3,2,2,2,2,1,3,3,5,5,4,3,3,4,3,1,2267426 95 | 17,1,2,5,5,4,4,2,2,5,4,4,4,4,3,4,0,0,0,0,0,4,5,1,3,1,0,0,0,0,3,1,0,0,0,0,0,0,0,0,1149870 96 | 9,2,3,3,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,2,2,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2083447 97 | 17,1,3,12,7.5,7.5,4.5,2,10,10,10,10,10,4,10,7.5,3,4,9,3,3,10,15,3,4,5,6,4,7.5,5,10,5,25,6,10,4,18,12,12,2,1847826 98 | 17,1,3,2,4,4,4,2,5,5,5,5,5,2,5,5,2,2,5,2,4,2,5,1,1,3,5,2,3,5,3,3,5,5,4,2,3,4,4,2,5161371 99 | 15,2,3,3,4,3,4,2,2,5,5,5,5,2,5,5,2,1,2,1,4,2,2,1,2,1,2,3,3,5,1,3,5,1,3,2,3,4,3,3,4316716 100 | 19,2,2,2,2,4,3,2,1,5,4,5,5,5,5,5,0,0,0,0,0,5,4,2,2,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3807496 101 | 31,2,3,3,4,4,4,1,2,5,4,4,5,4,4,5,5,5,3,2,4,3,5,1,3,1,2,2,2,1,2,2,3,1,4,3,4,5,5,3,3410879 102 | 9,2,2,4,5,4,3,1,2,5,4,4,5,4,4,5,0,0,0,0,0,5,5,5,5,5,0,0,0,0,4,3,0,0,0,0,0,0,0,0,5435277 103 | 17,1,3,6,4.5,6,6,4,8,10,10,10,10,4,10,7.5,6,4,12,6,12,5,9,3,2,5,8,8,10,12.5,5,7.5,25,15,25,4,18,12,16,8,4882986 104 | 4,1,2,3,4,4,5,3,4,5,4,4,4,5,4,4,0,0,0,0,0,4,5,1,2,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,3199619 105 | 17,1,2,4,5,4,5,2,2,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,7217635 106 | 6,2,2,2,3,4,4,2,2,5,4,5,5,3,5,5,0,0,0,0,0,5,4,2,2,3,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4067567 107 | 14,2,2,4,5,5,4,2,2,5,4,5,5,3,5,5,0,0,0,0,0,5,5,4,5,5,0,0,0,0,4,1,0,0,0,0,0,0,0,0,3939804 108 | 17,1,3,4,5,5,4,1,5,5,5,4,4,2,4,4,3,4,2,1,4,2,3,2,2,1,3,3,4,5,3,3,3,2,1,2,2,1,3,3,3784230 109 | 4,1,3,2,3,4,3,1,5,5,5,5,5,1,5,5,1,1,2,1,4,1,3,1,2,5,1,1,1,1,1,3,5,5,5,3,4,4,3,1,2740687 110 | 10,2,2,3,4,4,4,1,2,5,2,4,5,5,4,5,0,0,0,0,0,3,5,1,3,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2344689 111 | 4,1,2,1,1,4,4,2,1,1,4,4,4,5,2,4,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3447891 112 | 17,1,2,2,4,4,4,1,4,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6941174 113 | 9,2,2,3,3,4,4,2,2,5,4,5,5,3,5,5,0,0,0,0,0,3,5,1,3,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3351383 114 | 28,2,2,2,4,4,4,2,3,5,5,5,4,2,5,4,4,1,2,2,4,3,1,1,1,1,3,3,3,1,2,3,1,1,3,2,3,1,3,3,5286213 115 | 4,1,2,2,4,3,5,4,5,5,4,4,4,5,4,4,0,0,0,0,0,1,2,1,1,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4219263 116 | 21,2,2,9,6,6,6,4,6,10,10,10,10,2,10,7.5,0,0,0,0,0,15,12,3,3,10,0,0,0,0,7.5,7.5,0,0,0,0,0,0,0,0,3956086 117 | 18,1,2,3,5,4,3,1,2,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1756069 118 | 5,2,2,2,3,4,3,1,4,5,5,5,5,2,5,5,0,0,0,0,0,1,2,1,1,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,3258837 119 | 17,1,2,2,4,4,5,1,3,5,4,5,5,3,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,8213525 120 | 17,1,3,2,4,3,4,2,3,5,5,5,4,1,5,4,2,2,2,1,4,2,1,1,2,1,5,3,5,5,2,3,5,5,5,4,4,4,3,4,3836721 121 | 17,1,3,5,5,3,5,2,2,5,5,5,4,2,5,4,4,4,4,5,4,5,5,2,2,3,2,2,2,5,3,2,5,5,4,4,4,4,5,2,8904084 122 | 4,1,2,2,4,5,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,2,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2732646 123 | 17,1,2,3,5,4,4,2,5,5,4,5,4,3,5,4,0,0,0,0,0,1,4,1,1,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,5461701 124 | 4,1,3,2,3,5,3,1,5,5,5,5,5,1,5,5,1,1,3,1,4,1,3,1,2,1,1,1,1,1,2,3,3,5,5,2,3,4,3,4,4264176 125 | 17,1,2,3,5,4,4,1,4,5,4,5,4,4,5,4,0,0,0,0,0,4,5,2,4,5,0,0,0,0,5,1,0,0,0,0,0,0,0,0,6694797 126 | 11,2,2,4,5,2,4,2,2,5,4,4,4,3,4,4,0,0,0,0,0,3,2,1,2,3,0,0,0,0,2,2,0,0,0,0,0,0,0,0,6412623 127 | 18,1,3,4,5,5,3,1,2,5,4,5,5,3,5,5,4,4,3,5,4,3,5,3,5,5,4,4,4,4,5,3,4,1,2,2,3,5,5,1,7865428 128 | 17,1,2,2,4,4,4,1,3,5,4,5,5,3,5,5,0,0,0,0,0,2,4,2,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4066619 129 | 2,2,3,1,1,4,4,1,2,1,5,5,5,1,5,5,1,1,2,1,4,1,1,1,1,1,4,4,4,2,2,3,4,5,5,3,4,5,4,5,4952498 130 | 14,2,3,4,5,5,4,2,2,5,4,5,5,3,5,5,3,4,3,1,3,5,5,4,5,5,3,3,3,2,4,1,5,3,3,2,3,4,3,1,4155436 131 | 30,2,2,3,4,3,4,3,2,5,4,4,4,3,4,4,0,0,0,0,0,5,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3261924 132 | 17,1,3,2,3,4,4,1,5,5,5,5,5,2,5,5,3,4,4,3,4,2,4,1,2,1,5,4,4,5,1,3,4,5,2,2,3,5,4,4,5166635 133 | 4,1,2,3,5,5,3,1,5,5,5,5,5,2,5,5,0,0,0,0,0,2,5,2,4,3,0,0,0,0,3,3,0,0,0,0,0,0,0,0,3028268 134 | 17,1,3,4,5,4,4,2,2,5,4,5,5,3,5,5,1,2,2,2,4,5,4,1,3,3,1,1,1,4,2,3,5,3,4,5,5,4,3,4,5653753 135 | 33,2,2,2,3,3,5,4,2,4,4,4,4,4,4,4,0,0,0,0,0,4,3,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,5787594 136 | 17,1,2,3,5,4,4,2,5,5,4,5,4,3,5,4,0,0,0,0,0,1,4,1,1,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,19696939 137 | 17,1,2,4,5,4,4,1,3,5,5,5,5,2,5,5,0,0,0,0,0,3,4,1,2,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,4807747 138 | 17,1,2,3,5,4,4,2,5,5,4,5,4,3,5,4,0,0,0,0,0,1,4,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,7495093 139 | -------------------------------------------------------------------------------- /dboostForest.R: -------------------------------------------------------------------------------- 1 | #See dboostFunCommented.R for readable, commented, simpler code. 2 | #This dboostFun.R is a work in progress with various experimental features. 3 | require(glmnet) 4 | #require(freestats) 5 | #require(randomForest) 6 | require(data.table) 7 | set.seed(1) 8 | train = fread('/home/mikeskim/Desktop/tfiAlgo/train.csv',data.table=F) 9 | #download train data at https://www.kaggle.com/c/restaurant-revenue-prediction/data 10 | #License GPL-2 as it depends on glmnet 11 | 12 | hashfun = function(x,seedj,bool1,gseed) { 13 | if (bool1) { 14 | set.seed(seedj+(gseed*999999)) 15 | return((x*runif(1))%%1) 16 | } 17 | else { 18 | return(x) 19 | } 20 | } 21 | 22 | unique1 = function(x,bool1) { 23 | if (bool1) { 24 | return(unique(x)) 25 | } 26 | else { 27 | return(x) 28 | } 29 | } 30 | 31 | train$Id=NULL 32 | train$`Open Date`=NULL 33 | train$City = as.numeric(as.factor(train$City)) 34 | train$Type = as.numeric(as.factor(train$Type)) 35 | train$`City Group`=as.numeric(as.factor(train$`City Group`)) 36 | train$revenue = log(train$revenue) 37 | train = train[sample(nrow(train)),] 38 | test = train[1:37,] 39 | train = train[38:137,] 40 | 41 | testX = as.matrix(test[,-ncol(test)]) 42 | testY = as.matrix(test$revenue) 43 | 44 | trainX = as.matrix(train[,-ncol(train)]) 45 | trainY = as.matrix(train$revenue) 46 | 47 | 48 | 49 | 50 | dBoost = function(trainX,trainY,testX,COLN=40,ROWN=100,ntrees=50,step0=0.0038,lambda0=0.4,lambda1=0.2,crossNum=7,uni=T,hashB=T,gseed=6) { 51 | mmrows = nrow(testX) 52 | mrows = nrow(trainX) 53 | mcols = ncol(trainX) 54 | 55 | preds = rep(mean(trainY),mrows) 56 | testpreds = rep(mean(trainY),mmrows) 57 | #modelList = list() 58 | 59 | #COLN = 25#40 60 | #ROWN = 22#100 61 | for (j in 1:ntrees) { 62 | tmpR = sample(mrows,replace=T)[1:ROWN] 63 | tmpC = sample(mcols,replace=T)[1:COLN] 64 | tmpY = trainY[tmpR] 65 | tmpX = trainX[tmpR,] 66 | tmpX = tmpX[,tmpC] 67 | tmpXX = trainX[,tmpC] 68 | testXX = testX[,tmpC] 69 | tmpX0 = tmpX 70 | tmpXX0 = tmpXX 71 | testXX0 = testXX 72 | 73 | for (k in 1:COLN) { 74 | tmpA = rep(0, ROWN) 75 | tmpX[,k] = hashfun(tmpX[,k],j,hashB,gseed) 76 | #tmpS = rep(-1,length(tmpY)); tmpM = quantile(tmpY,runif(1,0.15,0.85)); tmpS[tmpY>tmpM]=1 77 | #cutoff = decisionStump(X=tmpX,w=1/(tmpY^2),y=tmpS)$theta 78 | #cutoff = decisionStump(X=tmpX,w=rep(1,length(tmpS)),y=tmpS)$theta 79 | cutoff = sample(x=unique1(tmpX[,k],uni),size=1) 80 | tmpA[tmpX[,k]>cutoff]=1 81 | tmpX[,k] = tmpA 82 | 83 | tmpA = rep(0, mrows) 84 | tmpXX[,k] = hashfun(tmpXX[,k],j,hashB,gseed) 85 | tmpA[tmpXX[,k]>cutoff]=1 86 | tmpXX[,k] = tmpA 87 | 88 | tmpA = rep(0, mmrows) 89 | testXX[,k] = hashfun(testXX[,k],j,hashB,gseed) 90 | tmpA[testXX[,k]>cutoff]=1 91 | testXX[,k] = tmpA 92 | } 93 | 94 | for (k in 1:crossNum) { 95 | k1 = sample(1:COLN,size=1) 96 | k2 = sample(1:COLN,size=1) 97 | tmpA = rep(0, ROWN) 98 | combined = tmpX0[,k1]*tmpX0[,k2] 99 | cutoff = sample(x=unique1(combined,uni),size=1) 100 | tmpA[combined>cutoff]=1 101 | tmpX = cbind(tmpX, tmpA) 102 | 103 | tmpA = rep(0, mrows) 104 | combined = tmpXX0[,k1]*tmpXX0[,k2] 105 | tmpA[combined>cutoff]=1 106 | tmpXX = cbind(tmpXX, tmpA) 107 | 108 | tmpA = rep(0, mmrows) 109 | combined = testXX0[,k1]*testXX0[,k2] 110 | tmpA[combined>cutoff]=1 111 | testXX = cbind(testXX, tmpA) 112 | } 113 | 114 | 115 | 116 | modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0) 117 | preds = cbind(preds,predict(modelx,tmpXX)) 118 | testpreds = cbind(testpreds, predict(modelx,testXX)) 119 | if (j %%3==0) { 120 | modelx = glmnet(x=preds,y=trainY,family="gaussian",alpha=0,lambda=lambda1) 121 | tmpP = predict(modelx,testpreds) 122 | print(mean(abs(testY-tmpP))) 123 | #print(mean(abs(testY-apply(testpreds,1,mean)))) 124 | } 125 | 126 | } 127 | 128 | return(testpreds) 129 | } 130 | 131 | 132 | tmpP = dBoost(trainX,trainY,testX) 133 | #mean(abs(trainY-preds)) 134 | #mean(abs(testY-tmpP)) 135 | 136 | " 137 | COLN=40,ROWN=100,ntrees=50,step0=0.0038,lambda0=0.2,lambda1=0.35,crossNum=8,uni=T,hashB=T,gseed=6 138 | [1] 0.3505081 139 | [1] 0.300838 140 | [1] 0.3160683 141 | [1] 0.3303783 142 | [1] 0.3194279 143 | [1] 0.286952 144 | [1] 0.2843058 145 | [1] 0.2837281 146 | [1] 0.27747 147 | [1] 0.2778305 148 | [1] 0.2782543 149 | [1] 0.2785709 150 | [1] 0.2751876 151 | [1] 0.27808 152 | [1] 0.2752973 153 | [1] 0.2764192 154 | really easy to overfit even local cv... need large train. 155 | " 156 | 157 | #[1] 0.3394888 158 | # 0.3338347 159 | # 0.3350612 (better than some rf runs... tuned dboost>>>untuned rf???) 160 | # 0.3380544 hash uni=T 161 | #0.338174 with hash 162 | #0.3408058 without hash 163 | #comparable to rf.... sometimes. -------------------------------------------------------------------------------- /dboostFun.R: -------------------------------------------------------------------------------- 1 | #See dboostFunCommented.R for readable, commented, simpler code. 2 | #This dboostFun.R is a work in progress with various experimental features. 3 | require(glmnet) 4 | #require(randomForest) 5 | require(data.table) 6 | set.seed(3) 7 | train = fread('/home/mikeskim/Desktop/tfiAlgo/train.csv',data.table=F) 8 | #download train data at https://www.kaggle.com/c/restaurant-revenue-prediction/data 9 | #License GPL-2 as it depends on glmnet 10 | 11 | hashfun = function(x,seedj,bool1,gseed) { 12 | if (bool1) { 13 | set.seed(seedj+(gseed*999999)) 14 | return((x*runif(1))%%1) 15 | } 16 | else { 17 | return(x) 18 | } 19 | } 20 | 21 | unique1 = function(x,bool1) { 22 | if (bool1) { 23 | return(unique(x)) 24 | } 25 | else { 26 | return(x) 27 | } 28 | } 29 | 30 | train$Id=NULL 31 | train$`Open Date`=NULL 32 | train$City = as.numeric(as.factor(train$City)) 33 | train$Type = as.numeric(as.factor(train$Type)) 34 | train$`City Group`=as.numeric(as.factor(train$`City Group`)) 35 | train$revenue = log(train$revenue) 36 | train = train[sample(nrow(train)),] 37 | test = train[1:37,] 38 | train = train[38:137,] 39 | 40 | testX = as.matrix(test[,-ncol(test)]) 41 | testY = as.matrix(test$revenue) 42 | 43 | trainX = as.matrix(train[,-ncol(train)]) 44 | trainY = as.matrix(train$revenue) 45 | 46 | 47 | 48 | 49 | dBoost = function(trainX,trainY,testX,COLN=24,ROWN=22,ntrees=4300,step0=0.0038,lambda0=0.3,crossNum=7,uni=T,hashB=T,gseed=6) { 50 | mmrows = nrow(testX) 51 | mrows = nrow(trainX) 52 | mcols = ncol(trainX) 53 | 54 | preds = rep(0,mrows) 55 | testpreds = rep(0,mmrows) 56 | #modelList = list() 57 | 58 | #COLN = 25#40 59 | #ROWN = 22#100 60 | for (j in 1:ntrees) { 61 | tmpR = sample(mrows,replace=T)[1:ROWN] 62 | tmpC = sample(mcols,replace=T)[1:COLN] 63 | tmpY = trainY[tmpR] - preds[tmpR] 64 | tmpX = trainX[tmpR,] 65 | tmpX = tmpX[,tmpC] 66 | tmpXX = trainX[,tmpC] 67 | testXX = testX[,tmpC] 68 | tmpX0 = tmpX 69 | tmpXX0 = tmpXX 70 | testXX0 = testXX 71 | 72 | for (k in 1:COLN) { 73 | tmpA = rep(0, ROWN) 74 | tmpX[,k] = hashfun(tmpX[,k],j,hashB,gseed) 75 | cutoff = sample(x=unique1(tmpX[,k],uni),size=1) 76 | tmpA[tmpX[,k]>cutoff]=1 77 | tmpX[,k] = tmpA 78 | 79 | tmpA = rep(0, mrows) 80 | tmpXX[,k] = hashfun(tmpXX[,k],j,hashB,gseed) 81 | tmpA[tmpXX[,k]>cutoff]=1 82 | tmpXX[,k] = tmpA 83 | 84 | tmpA = rep(0, mmrows) 85 | testXX[,k] = hashfun(testXX[,k],j,hashB,gseed) 86 | tmpA[testXX[,k]>cutoff]=1 87 | testXX[,k] = tmpA 88 | } 89 | 90 | for (k in 1:crossNum) { 91 | k1 = sample(1:COLN,size=1) 92 | k2 = sample(1:COLN,size=1) 93 | tmpA = rep(0, ROWN) 94 | combined = tmpX0[,k1]*tmpX0[,k2] 95 | cutoff = sample(x=unique1(combined,uni),size=1) 96 | tmpA[combined>cutoff]=1 97 | tmpX = cbind(tmpX, tmpA) 98 | 99 | tmpA = rep(0, mrows) 100 | combined = tmpXX0[,k1]*tmpXX0[,k2] 101 | tmpA[combined>cutoff]=1 102 | tmpXX = cbind(tmpXX, tmpA) 103 | 104 | tmpA = rep(0, mmrows) 105 | combined = testXX0[,k1]*testXX0[,k2] 106 | tmpA[combined>cutoff]=1 107 | testXX = cbind(testXX, tmpA) 108 | } 109 | 110 | 111 | if (j ==1 ) { 112 | modelx = glmnet(x=trainX,y=trainY,family="gaussian",alpha=0,lambda=lambda0) 113 | preds = preds + step0*predict(modelx,trainX) 114 | testpreds = testpreds + step0*predict(modelx,testX) 115 | } 116 | else { 117 | modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0) 118 | preds = preds + step0*predict(modelx,tmpXX) 119 | testpreds = testpreds + step0*predict(modelx,testXX) 120 | } 121 | } 122 | 123 | return(testpreds) 124 | } 125 | 126 | 127 | tmpP = dBoost(trainX,trainY,testX) 128 | #mean(abs(trainY-preds)) 129 | mean(abs(testY-tmpP)) 130 | # 0.3338347 131 | # 0.3350612 (better than some rf runs... tuned dboost>>>untuned rf???) 132 | # 0.3380544 hash uni=T 133 | #0.338174 with hash 134 | #0.3408058 without hash 135 | #comparable to rf.... sometimes. -------------------------------------------------------------------------------- /dboostFunCommented.R: -------------------------------------------------------------------------------- 1 | #Michael S Kim - mikeskim at gmail -dot- com 2 | #03/2016 3 | #License GPL-2 as it depends on glmnet 4 | #Simpler and easier to understand version of dboost with comments. 5 | #Some code is taken out to make it easier to read. 6 | #See dboostFun.R for a more complete version that gives better CV scores. 7 | 8 | #Write some helper functions to use later 9 | unique1 = function(x,bool1) { 10 | if (bool1) { 11 | return(unique(x)) 12 | } 13 | else { 14 | return(x) 15 | } 16 | } 17 | 18 | #Load glmnet package since we will do Ridge regression as our weak learner 19 | require(glmnet) 20 | 21 | #Set seed to reproduce results - this is stochastic boosting. 22 | set.seed(1) 23 | 24 | #Load small sample dataset 25 | train = read.csv('/home/mikeskim/Desktop/dboost/data/train.csv') 26 | 27 | #Mangle the data to get into train,test 28 | #Transform target variable with log to make it easier to fit a linear model to. 29 | train$target = log(train$target) 30 | 31 | #Shuffle all data for train and test split 32 | train = train[sample(nrow(train)),] 33 | 34 | #Test on first 37 rows 35 | test = train[1:37,] 36 | 37 | #Train on next 100 or so rows 38 | train = train[38:137,] 39 | 40 | #Split out target variable (last column) 41 | testX = as.matrix(test[,-ncol(test)]) 42 | testY = as.matrix(test[,ncol(test)]) 43 | 44 | trainX = as.matrix(train[,-ncol(train)]) 45 | trainY = as.matrix(train[,ncol(train)]) 46 | 47 | 48 | #This is the dboost function that takes input your training independent variables trainX, your training target trainY 49 | #And your test independent variables testX, testing target is left out for you to CV with later outside the function. 50 | dBoost = function(trainX,trainY,testX,COLN=24,ROWN=22,ntrees=3000,step0=0.0038,lambda0=0.9,uni=F) { 51 | #Saving dimensions of inputs to use later for resampling steps 52 | mmrows = nrow(testX) 53 | mrows = nrow(trainX) 54 | mcols = ncol(trainX) 55 | 56 | #Set initial (all zero) prediction vectors for train and test 57 | preds = rep(0,mrows) 58 | testpreds = rep(0,mmrows) 59 | 60 | #This is the main boosting loop that repeats to ensemble the weak learners (ridge regression models) 61 | for (j in 1:ntrees) { 62 | #This is stochastic boosting, so resampling with replacement from columns and rows. 63 | tmpR = sample(mrows,replace=T)[1:ROWN] 64 | tmpC = sample(mcols,replace=T)[1:COLN] 65 | 66 | #You are training on (updated) residuals 67 | tmpY = trainY[tmpR] - preds[tmpR] 68 | 69 | #Setup some matrices required for training and testing at step j. 70 | tmpX = trainX[tmpR,] 71 | tmpX = tmpX[,tmpC] 72 | tmpXX = trainX[,tmpC] 73 | testXX = testX[,tmpC] 74 | 75 | #This is where you make your dummy variables via random splitting. 76 | for (k in 1:COLN) { 77 | #Make dummy variable vector filled with zeros for now. 78 | tmpA = rep(0, ROWN) 79 | 80 | #Randomly select a cutoff from the original variable. This is fixed for this j iteration 81 | cutoff = sample(x=unique1(tmpX[,k],uni),size=1) 82 | 83 | #Fill the dummy variable with 1s on indices the original variable is greater than random cutoff. 84 | tmpA[tmpX[,k]>cutoff]=1 85 | 86 | #Replace original variable with dummy variable (only for this j iteration / weak learner) 87 | tmpX[,k] = tmpA 88 | 89 | #Repeat above steps using all of train - above only uses resampled subset of training rows. 90 | #So your model j's training is done above (trained on subset), 91 | #but your model j's prediction on train is applied on all training data. 92 | tmpA = rep(0, mrows) 93 | tmpA[tmpXX[,k]>cutoff]=1 94 | tmpXX[,k] = tmpA 95 | 96 | #Repeat above using all of test 97 | tmpA = rep(0, mmrows) 98 | tmpA[testXX[,k]>cutoff]=1 99 | testXX[,k] = tmpA 100 | } #End dummy variable making loop 101 | 102 | #Train weak learner via glmnet uses only subset of rows and columns (each column is dummied so contains only 0 or 1) 103 | modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0) 104 | 105 | #Make predictions scaled via step0 (for both train and test) 106 | preds = preds + step0*predict(modelx,tmpXX) 107 | testpreds = testpreds + step0*predict(modelx,testXX) 108 | }#End boosting loop ensembling weak learners 109 | 110 | return(testpreds) 111 | }#End dboost function, this is like 30 lines of code without comments. 112 | 113 | #Predict on test set using training input data 114 | tmpP = dBoost(trainX,trainY,testX) 115 | 116 | #See what the test score is (one holdout cross validation) 117 | mean(abs(testY-tmpP)) 118 | # 0.4142502 119 | #comparable to rf.... sometimes. will require tuning. -------------------------------------------------------------------------------- /dboostStump.R: -------------------------------------------------------------------------------- 1 | #See dboostFunCommented.R for readable, commented, simpler code. 2 | #This dboostFun.R is a work in progress with various experimental features. 3 | require(glmnet) 4 | require(freestats) 5 | #require(randomForest) 6 | require(data.table) 7 | set.seed(3) 8 | train = fread('/home/mikeskim/Desktop/tfiAlgo/train.csv',data.table=F) 9 | #download train data at https://www.kaggle.com/c/restaurant-revenue-prediction/data 10 | #License GPL-2 as it depends on glmnet 11 | 12 | hashfun = function(x,seedj,bool1,gseed) { 13 | if (bool1) { 14 | set.seed(seedj+(gseed*999999)) 15 | return((x*runif(1))%%1) 16 | } 17 | else { 18 | return(x) 19 | } 20 | } 21 | 22 | unique1 = function(x,bool1) { 23 | if (bool1) { 24 | return(unique(x)) 25 | } 26 | else { 27 | return(x) 28 | } 29 | } 30 | 31 | train$Id=NULL 32 | train$`Open Date`=NULL 33 | train$City = as.numeric(as.factor(train$City)) 34 | train$Type = as.numeric(as.factor(train$Type)) 35 | train$`City Group`=as.numeric(as.factor(train$`City Group`)) 36 | train$revenue = log(train$revenue) 37 | train = train[sample(nrow(train)),] 38 | test = train[1:37,] 39 | train = train[38:137,] 40 | 41 | testX = as.matrix(test[,-ncol(test)]) 42 | testY = as.matrix(test$revenue) 43 | 44 | trainX = as.matrix(train[,-ncol(train)]) 45 | trainY = as.matrix(train$revenue) 46 | 47 | 48 | 49 | 50 | dBoost = function(trainX,trainY,testX,COLN=24,ROWN=22,ntrees=9900,step0=0.0038,lambda0=0.3,crossNum=7,uni=T,hashB=T,gseed=6) { 51 | mmrows = nrow(testX) 52 | mrows = nrow(trainX) 53 | mcols = ncol(trainX) 54 | 55 | preds = rep(0,mrows) 56 | testpreds = rep(0,mmrows) 57 | #modelList = list() 58 | 59 | #COLN = 25#40 60 | #ROWN = 22#100 61 | for (j in 1:ntrees) { 62 | tmpR = sample(mrows,replace=T)[1:ROWN] 63 | tmpC = sample(mcols,replace=T)[1:COLN] 64 | tmpY = trainY[tmpR] - preds[tmpR] 65 | tmpX = trainX[tmpR,] 66 | tmpX = tmpX[,tmpC] 67 | tmpXX = trainX[,tmpC] 68 | testXX = testX[,tmpC] 69 | tmpX0 = tmpX 70 | tmpXX0 = tmpXX 71 | testXX0 = testXX 72 | 73 | for (k in 1:COLN) { 74 | tmpA = rep(0, ROWN) 75 | tmpX[,k] = hashfun(tmpX[,k],j,hashB,gseed) 76 | tmpS = rep(-1,length(tmpY)); tmpM = quantile(tmpY,runif(1,0.2,0.8)); tmpS[tmpY>tmpM]=1 77 | cutoff = decisionStump(X=tmpX,w=1/(tmpY^2),y=tmpS)$theta 78 | #cutoff = sample(x=unique1(tmpX[,k],uni),size=1) 79 | tmpA[tmpX[,k]>cutoff]=1 80 | tmpX[,k] = tmpA 81 | 82 | tmpA = rep(0, mrows) 83 | tmpXX[,k] = hashfun(tmpXX[,k],j,hashB,gseed) 84 | tmpA[tmpXX[,k]>cutoff]=1 85 | tmpXX[,k] = tmpA 86 | 87 | tmpA = rep(0, mmrows) 88 | testXX[,k] = hashfun(testXX[,k],j,hashB,gseed) 89 | tmpA[testXX[,k]>cutoff]=1 90 | testXX[,k] = tmpA 91 | } 92 | 93 | for (k in 1:crossNum) { 94 | k1 = sample(1:COLN,size=1) 95 | k2 = sample(1:COLN,size=1) 96 | tmpA = rep(0, ROWN) 97 | combined = tmpX0[,k1]*tmpX0[,k2] 98 | cutoff = sample(x=unique1(combined,uni),size=1) 99 | tmpA[combined>cutoff]=1 100 | tmpX = cbind(tmpX, tmpA) 101 | 102 | tmpA = rep(0, mrows) 103 | combined = tmpXX0[,k1]*tmpXX0[,k2] 104 | tmpA[combined>cutoff]=1 105 | tmpXX = cbind(tmpXX, tmpA) 106 | 107 | tmpA = rep(0, mmrows) 108 | combined = testXX0[,k1]*testXX0[,k2] 109 | tmpA[combined>cutoff]=1 110 | testXX = cbind(testXX, tmpA) 111 | } 112 | 113 | 114 | if (j ==1 ) { 115 | modelx = glmnet(x=trainX,y=trainY,family="gaussian",alpha=0,lambda=lambda0) 116 | preds = preds + step0*predict(modelx,trainX) 117 | testpreds = testpreds + step0*predict(modelx,testX) 118 | } 119 | else { 120 | modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0) 121 | preds = preds + step0*predict(modelx,tmpXX) 122 | testpreds = testpreds + step0*predict(modelx,testXX) 123 | if (j %%100==0) { 124 | print(mean(abs(testY-testpreds))) 125 | } 126 | } 127 | } 128 | 129 | return(testpreds) 130 | } 131 | 132 | 133 | tmpP = dBoost(trainX,trainY,testX) 134 | #mean(abs(trainY-preds)) 135 | mean(abs(testY-tmpP)) 136 | #[1] 0.3394888 137 | # 0.3338347 138 | # 0.3350612 (better than some rf runs... tuned dboost>>>untuned rf???) 139 | # 0.3380544 hash uni=T 140 | #0.338174 with hash 141 | #0.3408058 without hash 142 | #comparable to rf.... sometimes. --------------------------------------------------------------------------------