├── README.md
├── data
    └── train.csv
├── dboostForest.R
├── dboostFun.R
├── dboostFunCommented.R
└── dboostStump.R


/README.md:
--------------------------------------------------------------------------------
 1 | # dboost: powered by OVER 9000 reg monkeys
 2 | Stochastic Dummy Boosting
 3 | 
 4 | Michael S Kim (03/03/2016)
 5 | 
 6 | The base algorithm is about 30 lines of code (when you take out the comments). I highly suggest reading the 30 lines to get the details. This should take you much less than 30 minutes. Given this is Github where I assume most people here can code, it's probably not that difficult to just read the code.
 7 | 
 8 | I randomly create dummy variables by randomly splitting column features. This is basically like a decision stump - so a threshold. I give an option to hash a column's values before thresholding to prevent only splitting based upon numerical face value rank. Then I do ridge regression which is my weak learner for boosting. Then it's just GBM from there.
 9 | 
10 | This is not random tree embedding and then xgblinear. It is not feature engineering and then GBM. The space of all possible dummy encodings across all possible columns is not something one can list out in practice. 
11 | 
12 | # Each iteration of dboost will have a very different dummy encoding and the resulting learner will be very weak because it is random. 
13 | 
14 | Think of all the possible ways you can dummy just one feature column with three distinct values of 3,4,5. You can dummy as
15 | 
16 | 3 is one else zero dummy
17 | 
18 | 4 ''
19 | 
20 | 5 ''
21 | 
22 | 3,4 ''
23 | 
24 | 4,5 ''
25 | 
26 | 3,5 ''
27 | 
28 | And that is just one sample column with 3 distinct values. You could have a column with thousands of possible distinct values very easily. You could have many columns. The possible ways to dummy are really in practice endless. 
29 | 
30 | The intuition behind this method is that I have not been able to boost strong learners very well. Hence I looked towards a diverse set of uncorrelated weak learners. That is why you see random dummy variables and ridge regression. This method also happens to be very fast in theory (you can use online ridge, unsecure hashing should be fast, etc.). 
31 | 
32 | Read the dboostFunCommented.R for the easy version of the code without some features. The other code dboostFun.R adds a few features that improves my cross validated scores in my local experiments. 
33 | 
34 | In my very limited local testing, the tuned dboost should get scores around untuned random forest. This may not hold in general, but this (hopefully new) meta algorithm is promising.
35 | 
36 | In my Kaggle experiences it's basically XG >= NN > RF as the best level 1 algorithm. I usually try every algorithm on Caret and the results are subpar. If dboost ends up at the level of RF (on even a subset of Kaggle problems), the double bagged version / meta bagging version of XG+DB/NN should produce top 10 results. Again more testing is required.
37 | 


--------------------------------------------------------------------------------
/data/train.csv:
--------------------------------------------------------------------------------
  1 | feat1,feat2,feat3,feat4,feat5,feat6,feat7,feat8,feat9,feat10,feat11,feat12,feat13,feat14,feat15,feat16,feat17,feat18,feat19,feat20,feat21,feat22,feat23,feat24,feat25,feat26,feat27,feat28,feat29,feat30,feat31,feat32,feat33,feat34,feat35,feat36,feat37,feat38,feat39,feat40,target
  2 | 14,2,2,4,4,5,5,2,2,5,5,5,5,2,5,5,0,0,0,0,0,3,1,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3778622
  3 | 21,2,2,4,5,4,4,2,2,5,4,4,5,4,4,5,0,0,0,0,0,5,5,3,3,5,0,0,0,0,4,3,0,0,0,0,0,0,0,0,7201784
  4 | 4,1,3,2,2,4,4,2,1,5,4,5,5,3,5,5,0,0,0,0,0,3,4,1,2,2,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1734635
  5 | 23,2,2,9,6,6,6,4,4,10,8,10,10,8,10,7.5,0,0,0,0,0,25,15,15,3,20,0,0,0,0,10,2.5,0,0,0,0,0,0,0,0,3745136
  6 | 17,1,2,4,5,3,5,2,2,5,4,4,5,4,4,5,0,0,0,0,0,2,1,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6363241
  7 | 17,1,2,4,5,4,4,1,3,5,4,4,5,4,4,5,0,0,0,0,0,1,4,1,2,1,0,0,0,0,4,3,0,0,0,0,0,0,0,0,4100886
  8 | 17,1,3,12,7.5,6,6,2,10,10,10,10,10,4,10,7.5,3,10,15,3,12,10,9,3,2,5,8,8,10,2.5,7.5,7.5,5,15,20,2,12,3,16,4,3218919
  9 | 4,1,2,4,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,3,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6923132
 10 | 32,2,3,6,4.5,6,6,4,4,10,8,10,10,8,10,7.5,6,4,9,3,12,20,12,6,1,10,2,2,2.5,2.5,2.5,7.5,25,12,10,6,18,12,12,6,2675511
 11 | 4,1,3,3,5,4,5,3,4,5,4,4,4,5,4,4,1,1,2,1,1,5,5,1,3,2,2,2,2,1,3,3,5,2,3,2,2,2,3,3,2156098
 12 | 13,2,3,3,4,4,4,2,2,5,4,5,5,3,5,5,3,5,5,4,4,5,5,1,1,5,2,4,1,5,1,3,5,1,2,2,4,5,5,4,5525735
 13 | 18,1,3,3,5,5,3,1,2,5,5,5,5,2,5,5,2,2,4,2,4,2,5,2,4,2,1,1,1,4,3,3,5,5,4,2,2,4,3,3,4250758
 14 | 26,2,3,2,4,5,4,1,1,3,4,4,4,4,2,3,0,0,0,0,0,1,2,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4651867
 15 | 8,2,2,1,1,4,3,1,1,1,4,5,5,3,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4263629
 16 | 17,1,2,3,5,4,4,2,5,5,5,5,5,2,5,5,0,0,0,0,0,4,5,2,2,1,0,0,0,0,4,3,0,0,0,0,0,0,0,0,9652351
 17 | 17,1,3,2,4,4,4,2,5,5,4,5,4,3,5,4,4,4,5,5,5,2,4,1,2,1,5,5,5,1,2,3,5,5,5,3,4,4,5,3,5966193
 18 | 1,2,3,4,5,4,3,1,3,5,5,5,5,1,5,5,2,2,4,2,4,3,3,1,4,4,3,3,3,2,2,3,4,5,5,3,3,4,4,3,6313222
 19 | 18,1,3,3,5,4,4,2,2,5,3,4,5,5,4,5,3,2,3,1,4,4,5,5,3,5,5,2,5,3,5,1,4,3,3,2,3,4,3,1,2525376
 20 | 21,2,2,3,4,4,4,2,3,5,5,5,5,1,5,5,0,0,0,0,0,2,3,1,2,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2544858
 21 | 17,1,2,5,5,4,5,1,4,5,3,4,4,5,3,4,0,0,0,0,0,5,5,5,5,4,0,0,0,0,5,0,0,0,0,0,0,0,0,0,16549065
 22 | 28,2,3,1,1,3,3,2,1,1,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3008199
 23 | 17,1,2,4,5,4,5,2,4,5,4,4,4,3,3,4,0,0,0,0,0,5,5,3,5,3,0,0,0,0,5,1,0,0,0,0,0,0,0,0,4052734
 24 | 25,2,2,2,3,4,4,2,1,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2993069
 25 | 29,2,2,2,4,4,4,2,2,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1270499
 26 | 17,1,3,12,7.5,6,6,2,8,10,8,8,8,8,8,6,12,10,12,9,9,20,15,6,4,25,10,8,10,2.5,12.5,2.5,20,9,15,4,12,12,12,4,5500819
 27 | 5,2,2,6,6,6,4.5,2,8,10,10,10,8,4,10,6,0,0,0,0,0,5,15,3,2,5,0,0,0,0,5,7.5,0,0,0,0,0,0,0,0,3752885
 28 | 18,1,2,4,5,4,4,2,3,5,4,4,5,5,4,5,0,0,0,0,0,3,2,2,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,9262754
 29 | 1,2,3,4,5,4,3,1,2,5,4,4,5,5,4,5,2,2,4,2,4,5,5,4,3,5,3,3,3,4,3,2,4,1,3,3,3,4,3,3,3903884
 30 | 26,2,2,1,2,5,4,1,2,1,5,4,4,2,4,3,0,0,0,0,0,1,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3570392
 31 | 4,1,3,9,7.5,6,7.5,8,6,10,10,8,8,4,6,6,3,2,9,3,12,25,12,6,2,10,4,4,5,2.5,7.5,2.5,20,9,20,4,18,12,12,2,3871345
 32 | 4,1,2,4,5,4,5,2,2,5,3,4,5,5,4,5,0,0,0,0,0,5,5,5,5,5,0,0,0,0,5,0,0,0,0,0,0,0,0,0,3347768
 33 | 24,2,2,2,2,4,3,1,1,1,4,5,5,3,5,5,0,0,0,0,0,4,3,1,1,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,1882132
 34 | 18,1,3,4,4,5,3,1,3,5,5,5,5,2,5,5,3,2,4,1,4,1,3,2,2,3,3,2,3,5,2,3,4,3,3,2,4,4,3,1,4250554
 35 | 17,1,3,4,5,4,4,2,2,5,4,4,4,5,3,4,0,0,0,0,0,4,5,2,2,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,2390534
 36 | 5,2,2,1,2,5,4,1,2,1,4,5,5,3,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,2,0,0,0,0,0,0,0,0,2025298
 37 | 17,1,3,12,7.5,6,7.5,2,8,10,8,8,8,10,6,6,15,8,12,6,9,25,15,9,5,15,6,10,10,5,12.5,2.5,25,15,25,6,18,3,16,6,4286646
 38 | 17,1,2,4,5,4,4,1,3,5,5,5,5,2,5,5,0,0,0,0,0,3,3,1,3,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3600467
 39 | 4,1,2,6,6,4.5,7.5,8,10,10,8,8,8,10,8,6,0,0,0,0,0,5,6,3,1,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,5017320
 40 | 34,2,2,2,3,4,4,2,1,1,4,4,5,3,4,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1763232
 41 | 4,1,3,12,7.5,6,6,2,6,10,8,10,10,6,10,7.5,6,4,9,3,12,10,15,6,3,10,4,4,5,2.5,5,7.5,10,15,20,4,18,12,12,2,3164973
 42 | 27,2,2,3,4,3,4,2,1,5,5,5,5,2,5,5,0,0,0,0,0,3,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3376145
 43 | 17,1,2,12,7.5,6,6,2,8,10,10,8,8,4,8,6,0,0,0,0,0,15,9,3,3,5,0,0,0,0,5,7.5,0,0,0,0,0,0,0,0,3426169
 44 | 12,2,3,6,4.5,6,7.5,6,4,10,10,10,10,2,10,7.5,0,0,0,0,0,25,3,3,1,10,0,0,0,0,5,2.5,0,0,0,0,0,0,0,0,5444228
 45 | 17,1,2,2,2,4,4,2,2,5,5,5,5,2,5,5,0,0,0,0,0,3,4,1,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4590423
 46 | 17,1,2,12,7.5,7.5,6,2,10,10,8,8,10,8,8,7.5,0,0,0,0,0,25,15,6,3,25,0,0,0,0,10,7.5,0,0,0,0,0,0,0,0,6782425
 47 | 7,2,3,3,4,5,4,2,2,5,4,5,5,4,5,5,5,4,5,5,4,5,5,4,2,5,1,1,1,0,2,3,5,5,5,5,4,5,5,4,4758477
 48 | 28,2,2,2,4,4,4,2,2,5,3,4,4,5,4,4,0,0,0,0,0,2,2,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2999068
 49 | 18,1,3,4,5,4,3,1,2,5,5,5,5,2,5,5,3,2,3,2,3,4,5,1,4,2,5,3,3,3,2,3,3,5,5,4,4,4,3,2,5337527
 50 | 17,1,3,4,5,4,4,2,3,5,5,5,5,1,5,5,0,0,0,0,0,2,2,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3780019
 51 | 31,2,2,3,4,4,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,2,3,1,2,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,2097022
 52 | 11,2,2,4,5,2,5,2,2,5,4,4,4,3,4,4,0,0,0,0,0,4,4,1,2,3,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2738052
 53 | 33,2,3,9,6,6,6,4,6,10,6,8,10,10,8,7.5,9,10,15,3,12,15,15,9,2,25,6,4,7.5,2.5,5,7.5,20,9,15,4,24,12,16,2,4780607
 54 | 17,1,2,4,5,4,5,1,3,5,5,5,5,2,5,5,0,0,0,0,0,3,5,2,5,3,0,0,0,0,5,1,0,0,0,0,0,0,0,0,3727365
 55 | 24,2,3,2,2,4,5,4,2,5,5,5,5,1,5,5,0,0,0,0,0,3,1,1,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3452382
 56 | 1,2,3,4,5,4,3,1,3,5,5,5,5,2,5,5,1,2,2,1,3,2,5,1,3,4,1,1,4,3,2,3,5,5,3,3,5,4,3,2,4467729
 57 | 4,1,3,3,5,4,5,3,3,5,4,4,5,4,2,5,1,3,3,1,4,5,5,4,2,2,1,1,1,1,3,1,3,3,3,2,3,4,3,1,2018786
 58 | 17,1,2,12,7.5,6,6,4,6,10,8,8,8,6,8,6,0,0,0,0,0,15,12,3,2,5,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,4350573
 59 | 17,1,3,3,5,4,4,2,2,5,1,4,4,5,2,4,4,4,4,3,3,5,5,5,1,1,5,5,5,1,3,2,5,5,3,3,3,4,3,2,6836483
 60 | 17,1,2,4,5,4,4,1,4,5,5,5,5,2,5,5,0,0,0,0,0,3,5,1,4,2,0,0,0,0,3,3,0,0,0,0,0,0,0,0,3982768
 61 | 3,2,3,6,3,6,6,4,4,10,8,10,10,6,10,7.5,9,8,12,3,12,10,12,6,1,10,2,2,2.5,2.5,5,7.5,15,3,15,6,18,12,16,6,2954086
 62 | 4,1,2,3,5,3,5,3,4,5,4,4,4,5,4,4,0,0,0,0,0,5,5,1,3,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,1904842
 63 | 29,2,3,4,4,4,4,1,2,5,4,5,5,3,5,5,3,5,5,3,4,5,5,3,3,5,2,2,2,3,3,1,5,5,5,3,4,4,4,3,3248660
 64 | 31,2,2,2,4,4,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,2,3,1,2,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,4429513
 65 | 4,1,2,2,4,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,2383841
 66 | 17,1,3,12,7.5,6,6,2,8,10,10,10,10,4,10,7.5,15,10,15,15,12,10,9,6,3,5,10,10,12.5,5,5,7.5,20,15,25,4,24,15,20,6,2551253
 67 | 17,1,3,4,5,4,5,2,3,5,4,4,4,4,3,4,0,0,0,0,0,3,5,2,4,2,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4491607
 68 | 17,1,3,12,7.5,6,6,4,4,10,8,10,10,6,10,7.5,0,0,0,0,0,25,15,6,4,20,0,0,0,0,7.5,5,0,0,0,0,0,0,0,0,4952255
 69 | 28,2,3,2,4,4,4,2,2,5,5,4,4,1,3,4,0,0,0,0,0,2,1,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2021934
 70 | 17,1,2,5,5,4,4,1,3,5,5,4,4,2,4,4,0,0,0,0,0,5,5,3,4,2,0,0,0,0,5,2,0,0,0,0,0,0,0,0,5906596
 71 | 29,2,2,2,3,4,4,2,2,5,5,5,5,1,5,5,0,0,0,0,0,3,1,1,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2371202
 72 | 17,1,3,4,5,4,4,2,3,5,4,4,4,4,4,4,4,4,3,3,3,3,5,1,2,2,3,3,3,1,2,2,4,1,2,2,3,4,3,2,3818055
 73 | 17,1,3,4,5,4,4,1,3,5,5,4,4,2,4,4,3,4,4,2,4,5,5,2,4,3,5,3,4,2,4,2,3,5,5,2,3,5,4,4,4705946
 74 | 17,1,3,1,1,4,3,1,2,1,5,5,5,1,5,5,1,1,1,1,1,2,1,1,1,1,1,1,1,5,1,3,5,4,4,3,3,4,2,3,2364479
 75 | 5,2,2,3,3,6,6,4,2,2,8,10,10,6,10,7.5,0,0,0,0,0,5,3,3,1,5,0,0,0,0,2.5,7.5,0,0,0,0,0,0,0,0,4888774
 76 | 18,1,2,2,4,4,4,2,3,5,5,5,4,2,5,4,0,0,0,0,0,2,2,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3445077
 77 | 4,1,1,1,3,0,5,5,5,1,5,5,5,2,5,5,0,0,0,0,0,1,2,2,1,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3810008
 78 | 29,2,2,3,4,4,3,1,2,5,5,5,5,2,5,5,0,0,0,0,0,2,4,2,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,5595268
 79 | 9,2,2,2,2,4,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,3,3,1,3,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,1999098
 80 | 17,1,2,3,4,4,5,2,2,5,5,5,5,2,5,5,0,0,0,0,0,1,3,1,3,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3004429
 81 | 20,2,3,2,2,4,4,2,2,4,4,5,5,4,5,5,0,0,0,0,0,5,4,2,1,2,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3273041
 82 | 11,2,3,2,4,2,5,2,3,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2055379
 83 | 16,2,3,3,3,4,4,2,2,5,4,5,5,3,5,5,1,5,4,2,4,5,5,2,1,4,3,3,3,2,2,3,5,5,5,4,2,5,3,2,4015750
 84 | 22,2,2,2,2,4,4,3,1,4,5,5,5,2,5,5,0,0,0,0,0,2,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1619683
 85 | 17,1,3,4,5,5,4,1,5,5,3,4,5,5,3,5,3,4,5,5,4,5,5,5,5,5,5,3,5,1,5,1,5,5,4,3,3,4,4,1,4554238
 86 | 17,1,3,12,7.5,6,6,2,10,10,10,10,10,4,10,7.5,9,8,12,3,12,10,9,6,3,5,8,8,10,10,10,7.5,5,6,10,6,18,12,12,6,4136426
 87 | 9,2,2,3,4,4,4,2,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,2,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,7592272
 88 | 17,1,3,4,5,4,4,1,3,5,5,5,5,2,5,5,2,3,3,1,3,4,2,1,2,1,5,2,5,2,2,3,4,3,3,2,3,4,3,3,2058644
 89 | 17,1,3,4,5,4,4,2,4,5,4,4,4,5,3,4,5,5,5,5,5,5,5,5,3,1,5,5,5,1,5,1,5,5,4,2,5,5,5,3,13575225
 90 | 29,2,2,2,3,4,4,2,2,5,5,5,5,1,5,5,3,2,4,1,3,3,2,1,2,1,5,1,2,5,1,3,4,4,3,4,4,4,3,5,3753721
 91 | 6,2,2,1,2,5,3,1,2,1,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,2,0,0,0,0,0,0,0,0,2792031
 92 | 18,1,2,2,4,4,4,2,2,5,4,4,5,3,4,5,0,0,0,0,0,2,2,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,8894599
 93 | 17,1,2,12,7.5,6,6,2,8,10,10,10,10,4,10,7.5,0,0,0,0,0,15,3,3,3,5,0,0,0,0,7.5,7.5,0,0,0,0,0,0,0,0,8630683
 94 | 4,1,3,3,5,4,5,2,3,5,3,5,4,5,5,4,3,4,4,1,4,5,5,2,3,2,2,2,2,1,3,3,5,5,4,3,3,4,3,1,2267426
 95 | 17,1,2,5,5,4,4,2,2,5,4,4,4,4,3,4,0,0,0,0,0,4,5,1,3,1,0,0,0,0,3,1,0,0,0,0,0,0,0,0,1149870
 96 | 9,2,3,3,5,4,4,1,2,5,5,5,5,1,5,5,0,0,0,0,0,3,2,1,2,2,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2083447
 97 | 17,1,3,12,7.5,7.5,4.5,2,10,10,10,10,10,4,10,7.5,3,4,9,3,3,10,15,3,4,5,6,4,7.5,5,10,5,25,6,10,4,18,12,12,2,1847826
 98 | 17,1,3,2,4,4,4,2,5,5,5,5,5,2,5,5,2,2,5,2,4,2,5,1,1,3,5,2,3,5,3,3,5,5,4,2,3,4,4,2,5161371
 99 | 15,2,3,3,4,3,4,2,2,5,5,5,5,2,5,5,2,1,2,1,4,2,2,1,2,1,2,3,3,5,1,3,5,1,3,2,3,4,3,3,4316716
100 | 19,2,2,2,2,4,3,2,1,5,4,5,5,5,5,5,0,0,0,0,0,5,4,2,2,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3807496
101 | 31,2,3,3,4,4,4,1,2,5,4,4,5,4,4,5,5,5,3,2,4,3,5,1,3,1,2,2,2,1,2,2,3,1,4,3,4,5,5,3,3410879
102 | 9,2,2,4,5,4,3,1,2,5,4,4,5,4,4,5,0,0,0,0,0,5,5,5,5,5,0,0,0,0,4,3,0,0,0,0,0,0,0,0,5435277
103 | 17,1,3,6,4.5,6,6,4,8,10,10,10,10,4,10,7.5,6,4,12,6,12,5,9,3,2,5,8,8,10,12.5,5,7.5,25,15,25,4,18,12,16,8,4882986
104 | 4,1,2,3,4,4,5,3,4,5,4,4,4,5,4,4,0,0,0,0,0,4,5,1,2,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,3199619
105 | 17,1,2,4,5,4,5,2,2,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,7217635
106 | 6,2,2,2,3,4,4,2,2,5,4,5,5,3,5,5,0,0,0,0,0,5,4,2,2,3,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4067567
107 | 14,2,2,4,5,5,4,2,2,5,4,5,5,3,5,5,0,0,0,0,0,5,5,4,5,5,0,0,0,0,4,1,0,0,0,0,0,0,0,0,3939804
108 | 17,1,3,4,5,5,4,1,5,5,5,4,4,2,4,4,3,4,2,1,4,2,3,2,2,1,3,3,4,5,3,3,3,2,1,2,2,1,3,3,3784230
109 | 4,1,3,2,3,4,3,1,5,5,5,5,5,1,5,5,1,1,2,1,4,1,3,1,2,5,1,1,1,1,1,3,5,5,5,3,4,4,3,1,2740687
110 | 10,2,2,3,4,4,4,1,2,5,2,4,5,5,4,5,0,0,0,0,0,3,5,1,3,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,2344689
111 | 4,1,2,1,1,4,4,2,1,1,4,4,4,5,2,4,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3447891
112 | 17,1,2,2,4,4,4,1,4,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,6941174
113 | 9,2,2,3,3,4,4,2,2,5,4,5,5,3,5,5,0,0,0,0,0,3,5,1,3,2,0,0,0,0,2,3,0,0,0,0,0,0,0,0,3351383
114 | 28,2,2,2,4,4,4,2,3,5,5,5,4,2,5,4,4,1,2,2,4,3,1,1,1,1,3,3,3,1,2,3,1,1,3,2,3,1,3,3,5286213
115 | 4,1,2,2,4,3,5,4,5,5,4,4,4,5,4,4,0,0,0,0,0,1,2,1,1,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,4219263
116 | 21,2,2,9,6,6,6,4,6,10,10,10,10,2,10,7.5,0,0,0,0,0,15,12,3,3,10,0,0,0,0,7.5,7.5,0,0,0,0,0,0,0,0,3956086
117 | 18,1,2,3,5,4,3,1,2,5,5,5,5,1,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,1756069
118 | 5,2,2,2,3,4,3,1,4,5,5,5,5,2,5,5,0,0,0,0,0,1,2,1,1,1,0,0,0,0,3,2,0,0,0,0,0,0,0,0,3258837
119 | 17,1,2,2,4,4,5,1,3,5,4,5,5,3,5,5,0,0,0,0,0,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,8213525
120 | 17,1,3,2,4,3,4,2,3,5,5,5,4,1,5,4,2,2,2,1,4,2,1,1,2,1,5,3,5,5,2,3,5,5,5,4,4,4,3,4,3836721
121 | 17,1,3,5,5,3,5,2,2,5,5,5,4,2,5,4,4,4,4,5,4,5,5,2,2,3,2,2,2,5,3,2,5,5,4,4,4,4,5,2,8904084
122 | 4,1,2,2,4,5,4,1,2,5,5,5,5,2,5,5,0,0,0,0,0,1,1,1,2,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,2732646
123 | 17,1,2,3,5,4,4,2,5,5,4,5,4,3,5,4,0,0,0,0,0,1,4,1,1,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,5461701
124 | 4,1,3,2,3,5,3,1,5,5,5,5,5,1,5,5,1,1,3,1,4,1,3,1,2,1,1,1,1,1,2,3,3,5,5,2,3,4,3,4,4264176
125 | 17,1,2,3,5,4,4,1,4,5,4,5,4,4,5,4,0,0,0,0,0,4,5,2,4,5,0,0,0,0,5,1,0,0,0,0,0,0,0,0,6694797
126 | 11,2,2,4,5,2,4,2,2,5,4,4,4,3,4,4,0,0,0,0,0,3,2,1,2,3,0,0,0,0,2,2,0,0,0,0,0,0,0,0,6412623
127 | 18,1,3,4,5,5,3,1,2,5,4,5,5,3,5,5,4,4,3,5,4,3,5,3,5,5,4,4,4,4,5,3,4,1,2,2,3,5,5,1,7865428
128 | 17,1,2,2,4,4,4,1,3,5,4,5,5,3,5,5,0,0,0,0,0,2,4,2,2,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,4066619
129 | 2,2,3,1,1,4,4,1,2,1,5,5,5,1,5,5,1,1,2,1,4,1,1,1,1,1,4,4,4,2,2,3,4,5,5,3,4,5,4,5,4952498
130 | 14,2,3,4,5,5,4,2,2,5,4,5,5,3,5,5,3,4,3,1,3,5,5,4,5,5,3,3,3,2,4,1,5,3,3,2,3,4,3,1,4155436
131 | 30,2,2,3,4,3,4,3,2,5,4,4,4,3,4,4,0,0,0,0,0,5,1,1,1,1,0,0,0,0,1,3,0,0,0,0,0,0,0,0,3261924
132 | 17,1,3,2,3,4,4,1,5,5,5,5,5,2,5,5,3,4,4,3,4,2,4,1,2,1,5,4,4,5,1,3,4,5,2,2,3,5,4,4,5166635
133 | 4,1,2,3,5,5,3,1,5,5,5,5,5,2,5,5,0,0,0,0,0,2,5,2,4,3,0,0,0,0,3,3,0,0,0,0,0,0,0,0,3028268
134 | 17,1,3,4,5,4,4,2,2,5,4,5,5,3,5,5,1,2,2,2,4,5,4,1,3,3,1,1,1,4,2,3,5,3,4,5,5,4,3,4,5653753
135 | 33,2,2,2,3,3,5,4,2,4,4,4,4,4,4,4,0,0,0,0,0,4,3,2,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,5787594
136 | 17,1,2,3,5,4,4,2,5,5,4,5,4,3,5,4,0,0,0,0,0,1,4,1,1,1,0,0,0,0,2,2,0,0,0,0,0,0,0,0,19696939
137 | 17,1,2,4,5,4,4,1,3,5,5,5,5,2,5,5,0,0,0,0,0,3,4,1,2,1,0,0,0,0,3,3,0,0,0,0,0,0,0,0,4807747
138 | 17,1,2,3,5,4,4,2,5,5,4,5,4,3,5,4,0,0,0,0,0,1,4,1,1,1,0,0,0,0,2,3,0,0,0,0,0,0,0,0,7495093
139 | 


--------------------------------------------------------------------------------
/dboostForest.R:
--------------------------------------------------------------------------------
  1 | #See dboostFunCommented.R for readable, commented, simpler code.
  2 | #This dboostFun.R is a work in progress with various experimental features.
  3 | require(glmnet)
  4 | #require(freestats)
  5 | #require(randomForest)
  6 | require(data.table)
  7 | set.seed(1)
  8 | train = fread('/home/mikeskim/Desktop/tfiAlgo/train.csv',data.table=F)
  9 | #download train data at https://www.kaggle.com/c/restaurant-revenue-prediction/data
 10 | #License GPL-2 as it depends on glmnet
 11 | 
 12 | hashfun = function(x,seedj,bool1,gseed) {
 13 |   if (bool1) {
 14 |     set.seed(seedj+(gseed*999999))
 15 |     return((x*runif(1))%%1)
 16 |   }
 17 |   else {
 18 |     return(x)
 19 |   }
 20 | }
 21 | 
 22 | unique1 = function(x,bool1) {
 23 |   if (bool1) {
 24 |     return(unique(x))
 25 |   }
 26 |   else {
 27 |     return(x)
 28 |   }
 29 | }
 30 | 
 31 | train$Id=NULL
 32 | train$`Open Date`=NULL
 33 | train$City = as.numeric(as.factor(train$City))
 34 | train$Type = as.numeric(as.factor(train$Type))
 35 | train$`City Group`=as.numeric(as.factor(train$`City Group`))
 36 | train$revenue = log(train$revenue)
 37 | train = train[sample(nrow(train)),]
 38 | test = train[1:37,]
 39 | train = train[38:137,]
 40 | 
 41 | testX = as.matrix(test[,-ncol(test)])
 42 | testY = as.matrix(test$revenue)
 43 | 
 44 | trainX = as.matrix(train[,-ncol(train)])
 45 | trainY = as.matrix(train$revenue)
 46 | 
 47 | 
 48 | 
 49 | 
 50 | dBoost = function(trainX,trainY,testX,COLN=40,ROWN=100,ntrees=50,step0=0.0038,lambda0=0.4,lambda1=0.2,crossNum=7,uni=T,hashB=T,gseed=6) {
 51 |   mmrows = nrow(testX)
 52 |   mrows = nrow(trainX)
 53 |   mcols = ncol(trainX)
 54 |   
 55 |   preds = rep(mean(trainY),mrows)
 56 |   testpreds = rep(mean(trainY),mmrows)
 57 |   #modelList = list()
 58 |   
 59 |   #COLN = 25#40
 60 |   #ROWN = 22#100
 61 |   for (j in 1:ntrees) {
 62 |     tmpR = sample(mrows,replace=T)[1:ROWN]
 63 |     tmpC = sample(mcols,replace=T)[1:COLN]
 64 |     tmpY = trainY[tmpR] 
 65 |     tmpX = trainX[tmpR,]
 66 |     tmpX = tmpX[,tmpC]
 67 |     tmpXX = trainX[,tmpC]
 68 |     testXX = testX[,tmpC]
 69 |     tmpX0 = tmpX
 70 |     tmpXX0 = tmpXX
 71 |     testXX0 = testXX
 72 | 
 73 |     for (k in 1:COLN) {
 74 |       tmpA = rep(0, ROWN)
 75 |       tmpX[,k] = hashfun(tmpX[,k],j,hashB,gseed)
 76 |       #tmpS = rep(-1,length(tmpY)); tmpM = quantile(tmpY,runif(1,0.15,0.85)); tmpS[tmpY>tmpM]=1
 77 |       #cutoff = decisionStump(X=tmpX,w=1/(tmpY^2),y=tmpS)$theta
 78 |       #cutoff = decisionStump(X=tmpX,w=rep(1,length(tmpS)),y=tmpS)$theta
 79 |       cutoff = sample(x=unique1(tmpX[,k],uni),size=1)
 80 |       tmpA[tmpX[,k]>cutoff]=1
 81 |       tmpX[,k] = tmpA
 82 |       
 83 |       tmpA = rep(0, mrows)
 84 |       tmpXX[,k] = hashfun(tmpXX[,k],j,hashB,gseed)
 85 |       tmpA[tmpXX[,k]>cutoff]=1 
 86 |       tmpXX[,k] = tmpA
 87 |       
 88 |       tmpA = rep(0, mmrows)
 89 |       testXX[,k] = hashfun(testXX[,k],j,hashB,gseed)
 90 |       tmpA[testXX[,k]>cutoff]=1
 91 |       testXX[,k] = tmpA
 92 |     }
 93 | 
 94 |     for (k in 1:crossNum) {
 95 |       k1 = sample(1:COLN,size=1)
 96 |       k2 = sample(1:COLN,size=1)
 97 |       tmpA = rep(0, ROWN)
 98 |       combined = tmpX0[,k1]*tmpX0[,k2]
 99 |       cutoff = sample(x=unique1(combined,uni),size=1)
100 |       tmpA[combined>cutoff]=1
101 |       tmpX = cbind(tmpX, tmpA)
102 |       
103 |       tmpA = rep(0, mrows)
104 |       combined = tmpXX0[,k1]*tmpXX0[,k2]
105 |       tmpA[combined>cutoff]=1
106 |       tmpXX = cbind(tmpXX, tmpA)
107 |       
108 |       tmpA = rep(0, mmrows)
109 |       combined = testXX0[,k1]*testXX0[,k2]
110 |       tmpA[combined>cutoff]=1
111 |       testXX = cbind(testXX, tmpA)
112 |     }
113 | 
114 |     
115 | 
116 |       modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0)
117 |       preds = cbind(preds,predict(modelx,tmpXX))
118 |       testpreds = cbind(testpreds, predict(modelx,testXX))
119 |       if (j %%3==0) {
120 |         modelx = glmnet(x=preds,y=trainY,family="gaussian",alpha=0,lambda=lambda1)
121 |         tmpP = predict(modelx,testpreds)
122 |         print(mean(abs(testY-tmpP)))
123 |         #print(mean(abs(testY-apply(testpreds,1,mean))))
124 |       }
125 |     
126 |   }
127 |   
128 |   return(testpreds)
129 | }
130 | 
131 | 
132 | tmpP = dBoost(trainX,trainY,testX)
133 | #mean(abs(trainY-preds))
134 | #mean(abs(testY-tmpP))
135 | 
136 | "
137 | COLN=40,ROWN=100,ntrees=50,step0=0.0038,lambda0=0.2,lambda1=0.35,crossNum=8,uni=T,hashB=T,gseed=6
138 | [1] 0.3505081
139 | [1] 0.300838
140 | [1] 0.3160683
141 | [1] 0.3303783
142 | [1] 0.3194279
143 | [1] 0.286952
144 | [1] 0.2843058
145 | [1] 0.2837281
146 | [1] 0.27747
147 | [1] 0.2778305
148 | [1] 0.2782543
149 | [1] 0.2785709
150 | [1] 0.2751876
151 | [1] 0.27808
152 | [1] 0.2752973
153 | [1] 0.2764192
154 | really easy to overfit even local cv... need large train.
155 | "
156 | 
157 | #[1] 0.3394888
158 | #  0.3338347
159 | # 0.3350612 (better than some rf runs...  tuned dboost>>>untuned rf???)
160 | # 0.3380544 hash uni=T
161 | #0.338174 with hash
162 | #0.3408058 without hash
163 | #comparable to rf.... sometimes.


--------------------------------------------------------------------------------
/dboostFun.R:
--------------------------------------------------------------------------------
  1 | #See dboostFunCommented.R for readable, commented, simpler code.
  2 | #This dboostFun.R is a work in progress with various experimental features.
  3 | require(glmnet)
  4 | #require(randomForest)
  5 | require(data.table)
  6 | set.seed(3)
  7 | train = fread('/home/mikeskim/Desktop/tfiAlgo/train.csv',data.table=F)
  8 | #download train data at https://www.kaggle.com/c/restaurant-revenue-prediction/data
  9 | #License GPL-2 as it depends on glmnet
 10 | 
 11 | hashfun = function(x,seedj,bool1,gseed) {
 12 |   if (bool1) {
 13 |     set.seed(seedj+(gseed*999999))
 14 |     return((x*runif(1))%%1)
 15 |   }
 16 |   else {
 17 |     return(x)
 18 |   }
 19 | }
 20 | 
 21 | unique1 = function(x,bool1) {
 22 |   if (bool1) {
 23 |     return(unique(x))
 24 |   }
 25 |   else {
 26 |     return(x)
 27 |   }
 28 | }
 29 | 
 30 | train$Id=NULL
 31 | train$`Open Date`=NULL
 32 | train$City = as.numeric(as.factor(train$City))
 33 | train$Type = as.numeric(as.factor(train$Type))
 34 | train$`City Group`=as.numeric(as.factor(train$`City Group`))
 35 | train$revenue = log(train$revenue)
 36 | train = train[sample(nrow(train)),]
 37 | test = train[1:37,]
 38 | train = train[38:137,]
 39 | 
 40 | testX = as.matrix(test[,-ncol(test)])
 41 | testY = as.matrix(test$revenue)
 42 | 
 43 | trainX = as.matrix(train[,-ncol(train)])
 44 | trainY = as.matrix(train$revenue)
 45 | 
 46 | 
 47 | 
 48 | 
 49 | dBoost = function(trainX,trainY,testX,COLN=24,ROWN=22,ntrees=4300,step0=0.0038,lambda0=0.3,crossNum=7,uni=T,hashB=T,gseed=6) {
 50 |   mmrows = nrow(testX)
 51 |   mrows = nrow(trainX)
 52 |   mcols = ncol(trainX)
 53 |   
 54 |   preds = rep(0,mrows)
 55 |   testpreds = rep(0,mmrows)
 56 |   #modelList = list()
 57 |   
 58 |   #COLN = 25#40
 59 |   #ROWN = 22#100
 60 |   for (j in 1:ntrees) {
 61 |     tmpR = sample(mrows,replace=T)[1:ROWN]
 62 |     tmpC = sample(mcols,replace=T)[1:COLN]
 63 |     tmpY = trainY[tmpR] - preds[tmpR]
 64 |     tmpX = trainX[tmpR,]
 65 |     tmpX = tmpX[,tmpC]
 66 |     tmpXX = trainX[,tmpC]
 67 |     testXX = testX[,tmpC]
 68 |     tmpX0 = tmpX
 69 |     tmpXX0 = tmpXX
 70 |     testXX0 = testXX
 71 |     
 72 |     for (k in 1:COLN) {
 73 |       tmpA = rep(0, ROWN)
 74 |       tmpX[,k] = hashfun(tmpX[,k],j,hashB,gseed)
 75 |       cutoff = sample(x=unique1(tmpX[,k],uni),size=1)
 76 |       tmpA[tmpX[,k]>cutoff]=1
 77 |       tmpX[,k] = tmpA
 78 |       
 79 |       tmpA = rep(0, mrows)
 80 |       tmpXX[,k] = hashfun(tmpXX[,k],j,hashB,gseed)
 81 |       tmpA[tmpXX[,k]>cutoff]=1 
 82 |       tmpXX[,k] = tmpA
 83 |       
 84 |       tmpA = rep(0, mmrows)
 85 |       testXX[,k] = hashfun(testXX[,k],j,hashB,gseed)
 86 |       tmpA[testXX[,k]>cutoff]=1
 87 |       testXX[,k] = tmpA
 88 |     }
 89 |     
 90 |     for (k in 1:crossNum) {
 91 |       k1 = sample(1:COLN,size=1)
 92 |       k2 = sample(1:COLN,size=1)
 93 |       tmpA = rep(0, ROWN)
 94 |       combined = tmpX0[,k1]*tmpX0[,k2]
 95 |       cutoff = sample(x=unique1(combined,uni),size=1)
 96 |       tmpA[combined>cutoff]=1
 97 |       tmpX = cbind(tmpX, tmpA)
 98 |       
 99 |       tmpA = rep(0, mrows)
100 |       combined = tmpXX0[,k1]*tmpXX0[,k2]
101 |       tmpA[combined>cutoff]=1
102 |       tmpXX = cbind(tmpXX, tmpA)
103 |       
104 |       tmpA = rep(0, mmrows)
105 |       combined = testXX0[,k1]*testXX0[,k2]
106 |       tmpA[combined>cutoff]=1
107 |       testXX = cbind(testXX, tmpA)
108 |     }
109 |     
110 |     
111 |     if (j ==1 ) {
112 |       modelx = glmnet(x=trainX,y=trainY,family="gaussian",alpha=0,lambda=lambda0)
113 |       preds = preds + step0*predict(modelx,trainX)
114 |       testpreds = testpreds + step0*predict(modelx,testX)
115 |     }
116 |     else {
117 |       modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0)
118 |       preds = preds + step0*predict(modelx,tmpXX)
119 |       testpreds = testpreds + step0*predict(modelx,testXX)
120 |     }
121 |   }
122 |   
123 |   return(testpreds)
124 | }
125 | 
126 | 
127 | tmpP = dBoost(trainX,trainY,testX)
128 | #mean(abs(trainY-preds))
129 | mean(abs(testY-tmpP))
130 | #  0.3338347
131 | # 0.3350612 (better than some rf runs...  tuned dboost>>>untuned rf???)
132 | # 0.3380544 hash uni=T
133 | #0.338174 with hash
134 | #0.3408058 without hash
135 | #comparable to rf.... sometimes.


--------------------------------------------------------------------------------
/dboostFunCommented.R:
--------------------------------------------------------------------------------
  1 | #Michael S Kim - mikeskim at gmail -dot- com
  2 | #03/2016
  3 | #License GPL-2 as it depends on glmnet
  4 | #Simpler and easier to understand version of dboost with comments.
  5 | #Some code is taken out to make it easier to read.
  6 | #See dboostFun.R for a more complete version that gives better CV scores.
  7 | 
  8 | #Write some helper functions to use later
  9 | unique1 = function(x,bool1) {
 10 |   if (bool1) {
 11 |     return(unique(x))
 12 |   }
 13 |   else {
 14 |     return(x)
 15 |   }
 16 | }
 17 | 
 18 | #Load glmnet package since we will do Ridge regression as our weak learner
 19 | require(glmnet)
 20 | 
 21 | #Set seed to reproduce results - this is stochastic boosting. 
 22 | set.seed(1)
 23 | 
 24 | #Load small sample dataset
 25 | train = read.csv('/home/mikeskim/Desktop/dboost/data/train.csv')
 26 | 
 27 | #Mangle the data to get into train,test
 28 | #Transform target variable with log to make it easier to fit a linear model to.
 29 | train$target = log(train$target)
 30 | 
 31 | #Shuffle all data for train and test split
 32 | train = train[sample(nrow(train)),]
 33 | 
 34 | #Test on first 37 rows
 35 | test = train[1:37,]
 36 | 
 37 | #Train on next 100 or so rows
 38 | train = train[38:137,]
 39 | 
 40 | #Split out target variable (last column)
 41 | testX = as.matrix(test[,-ncol(test)])
 42 | testY = as.matrix(test[,ncol(test)])
 43 | 
 44 | trainX = as.matrix(train[,-ncol(train)])
 45 | trainY = as.matrix(train[,ncol(train)])
 46 | 
 47 | 
 48 | #This is the dboost function that takes input your training independent variables trainX, your training target trainY
 49 | #And your test independent variables testX, testing target is left out for you to CV with later outside the function.
 50 | dBoost = function(trainX,trainY,testX,COLN=24,ROWN=22,ntrees=3000,step0=0.0038,lambda0=0.9,uni=F) {
 51 |   #Saving dimensions of inputs to use later for resampling steps
 52 |   mmrows = nrow(testX)
 53 |   mrows = nrow(trainX)
 54 |   mcols = ncol(trainX)
 55 |   
 56 |   #Set initial (all zero) prediction vectors for train and test
 57 |   preds = rep(0,mrows)
 58 |   testpreds = rep(0,mmrows)
 59 |   
 60 |   #This is the main boosting loop that repeats to ensemble the weak learners (ridge regression models)
 61 |   for (j in 1:ntrees) {
 62 |     #This is stochastic boosting, so resampling with replacement from columns and rows.
 63 |     tmpR = sample(mrows,replace=T)[1:ROWN]
 64 |     tmpC = sample(mcols,replace=T)[1:COLN]
 65 |     
 66 |     #You are training on (updated) residuals
 67 |     tmpY = trainY[tmpR] - preds[tmpR]
 68 |     
 69 |     #Setup some matrices required for training and testing at step j.
 70 |     tmpX = trainX[tmpR,]
 71 |     tmpX = tmpX[,tmpC]
 72 |     tmpXX = trainX[,tmpC]
 73 |     testXX = testX[,tmpC]
 74 |     
 75 |     #This is where you make your dummy variables via random splitting.
 76 |     for (k in 1:COLN) {
 77 |       #Make dummy variable vector filled with zeros for now.
 78 |       tmpA = rep(0, ROWN)
 79 |       
 80 |       #Randomly select a cutoff from the original variable. This is fixed for this j iteration
 81 |       cutoff = sample(x=unique1(tmpX[,k],uni),size=1)
 82 |       
 83 |       #Fill the dummy variable with 1s on indices the original variable is greater than random cutoff.
 84 |       tmpA[tmpX[,k]>cutoff]=1
 85 |       
 86 |       #Replace original variable with dummy variable (only for this j iteration / weak learner)
 87 |       tmpX[,k] = tmpA
 88 |       
 89 |       #Repeat above steps using all of train - above only uses resampled subset of training rows.
 90 |       #So your model j's training is done above (trained on subset),
 91 |       #but your model j's prediction on train is applied on all training data. 
 92 |       tmpA = rep(0, mrows)
 93 |       tmpA[tmpXX[,k]>cutoff]=1 
 94 |       tmpXX[,k] = tmpA
 95 |       
 96 |       #Repeat above using all of test
 97 |       tmpA = rep(0, mmrows)
 98 |       tmpA[testXX[,k]>cutoff]=1
 99 |       testXX[,k] = tmpA
100 |     } #End dummy variable making loop
101 |     
102 |     #Train weak learner via glmnet uses only subset of rows and columns (each column is dummied so contains only 0 or 1)
103 |     modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0)
104 |       
105 |     #Make predictions scaled via step0 (for both train and test) 
106 |     preds = preds + step0*predict(modelx,tmpXX)
107 |     testpreds = testpreds + step0*predict(modelx,testXX)
108 |   }#End boosting loop ensembling weak learners
109 |   
110 |   return(testpreds)
111 | }#End dboost function, this is like 30 lines of code without comments.
112 | 
113 | #Predict on test set using training input data
114 | tmpP = dBoost(trainX,trainY,testX)
115 | 
116 | #See what the test score is (one holdout cross validation)
117 | mean(abs(testY-tmpP))
118 | #  0.4142502
119 | #comparable to rf.... sometimes. will require tuning.


--------------------------------------------------------------------------------
/dboostStump.R:
--------------------------------------------------------------------------------
  1 | #See dboostFunCommented.R for readable, commented, simpler code.
  2 | #This dboostFun.R is a work in progress with various experimental features.
  3 | require(glmnet)
  4 | require(freestats)
  5 | #require(randomForest)
  6 | require(data.table)
  7 | set.seed(3)
  8 | train = fread('/home/mikeskim/Desktop/tfiAlgo/train.csv',data.table=F)
  9 | #download train data at https://www.kaggle.com/c/restaurant-revenue-prediction/data
 10 | #License GPL-2 as it depends on glmnet
 11 | 
 12 | hashfun = function(x,seedj,bool1,gseed) {
 13 |   if (bool1) {
 14 |     set.seed(seedj+(gseed*999999))
 15 |     return((x*runif(1))%%1)
 16 |   }
 17 |   else {
 18 |     return(x)
 19 |   }
 20 | }
 21 | 
 22 | unique1 = function(x,bool1) {
 23 |   if (bool1) {
 24 |     return(unique(x))
 25 |   }
 26 |   else {
 27 |     return(x)
 28 |   }
 29 | }
 30 | 
 31 | train$Id=NULL
 32 | train$`Open Date`=NULL
 33 | train$City = as.numeric(as.factor(train$City))
 34 | train$Type = as.numeric(as.factor(train$Type))
 35 | train$`City Group`=as.numeric(as.factor(train$`City Group`))
 36 | train$revenue = log(train$revenue)
 37 | train = train[sample(nrow(train)),]
 38 | test = train[1:37,]
 39 | train = train[38:137,]
 40 | 
 41 | testX = as.matrix(test[,-ncol(test)])
 42 | testY = as.matrix(test$revenue)
 43 | 
 44 | trainX = as.matrix(train[,-ncol(train)])
 45 | trainY = as.matrix(train$revenue)
 46 | 
 47 | 
 48 | 
 49 | 
 50 | dBoost = function(trainX,trainY,testX,COLN=24,ROWN=22,ntrees=9900,step0=0.0038,lambda0=0.3,crossNum=7,uni=T,hashB=T,gseed=6) {
 51 |   mmrows = nrow(testX)
 52 |   mrows = nrow(trainX)
 53 |   mcols = ncol(trainX)
 54 |   
 55 |   preds = rep(0,mrows)
 56 |   testpreds = rep(0,mmrows)
 57 |   #modelList = list()
 58 |   
 59 |   #COLN = 25#40
 60 |   #ROWN = 22#100
 61 |   for (j in 1:ntrees) {
 62 |     tmpR = sample(mrows,replace=T)[1:ROWN]
 63 |     tmpC = sample(mcols,replace=T)[1:COLN]
 64 |     tmpY = trainY[tmpR] - preds[tmpR]
 65 |     tmpX = trainX[tmpR,]
 66 |     tmpX = tmpX[,tmpC]
 67 |     tmpXX = trainX[,tmpC]
 68 |     testXX = testX[,tmpC]
 69 |     tmpX0 = tmpX
 70 |     tmpXX0 = tmpXX
 71 |     testXX0 = testXX
 72 | 
 73 |     for (k in 1:COLN) {
 74 |       tmpA = rep(0, ROWN)
 75 |       tmpX[,k] = hashfun(tmpX[,k],j,hashB,gseed)
 76 |       tmpS = rep(-1,length(tmpY)); tmpM = quantile(tmpY,runif(1,0.2,0.8)); tmpS[tmpY>tmpM]=1
 77 |       cutoff = decisionStump(X=tmpX,w=1/(tmpY^2),y=tmpS)$theta
 78 |       #cutoff = sample(x=unique1(tmpX[,k],uni),size=1)
 79 |       tmpA[tmpX[,k]>cutoff]=1
 80 |       tmpX[,k] = tmpA
 81 |       
 82 |       tmpA = rep(0, mrows)
 83 |       tmpXX[,k] = hashfun(tmpXX[,k],j,hashB,gseed)
 84 |       tmpA[tmpXX[,k]>cutoff]=1 
 85 |       tmpXX[,k] = tmpA
 86 |       
 87 |       tmpA = rep(0, mmrows)
 88 |       testXX[,k] = hashfun(testXX[,k],j,hashB,gseed)
 89 |       tmpA[testXX[,k]>cutoff]=1
 90 |       testXX[,k] = tmpA
 91 |     }
 92 | 
 93 |     for (k in 1:crossNum) {
 94 |       k1 = sample(1:COLN,size=1)
 95 |       k2 = sample(1:COLN,size=1)
 96 |       tmpA = rep(0, ROWN)
 97 |       combined = tmpX0[,k1]*tmpX0[,k2]
 98 |       cutoff = sample(x=unique1(combined,uni),size=1)
 99 |       tmpA[combined>cutoff]=1
100 |       tmpX = cbind(tmpX, tmpA)
101 |       
102 |       tmpA = rep(0, mrows)
103 |       combined = tmpXX0[,k1]*tmpXX0[,k2]
104 |       tmpA[combined>cutoff]=1
105 |       tmpXX = cbind(tmpXX, tmpA)
106 |       
107 |       tmpA = rep(0, mmrows)
108 |       combined = testXX0[,k1]*testXX0[,k2]
109 |       tmpA[combined>cutoff]=1
110 |       testXX = cbind(testXX, tmpA)
111 |     }
112 |     
113 |     
114 |     if (j ==1 ) {
115 |       modelx = glmnet(x=trainX,y=trainY,family="gaussian",alpha=0,lambda=lambda0)
116 |       preds = preds + step0*predict(modelx,trainX)
117 |       testpreds = testpreds + step0*predict(modelx,testX)
118 |     }
119 |     else {
120 |       modelx = glmnet(x=tmpX,y=tmpY,family="gaussian",alpha=0,lambda=lambda0)
121 |       preds = preds + step0*predict(modelx,tmpXX)
122 |       testpreds = testpreds + step0*predict(modelx,testXX)
123 |       if (j %%100==0) {
124 |         print(mean(abs(testY-testpreds)))
125 |       }
126 |     }
127 |   }
128 |   
129 |   return(testpreds)
130 | }
131 | 
132 | 
133 | tmpP = dBoost(trainX,trainY,testX)
134 | #mean(abs(trainY-preds))
135 | mean(abs(testY-tmpP))
136 | #[1] 0.3394888
137 | #  0.3338347
138 | # 0.3350612 (better than some rf runs...  tuned dboost>>>untuned rf???)
139 | # 0.3380544 hash uni=T
140 | #0.338174 with hash
141 | #0.3408058 without hash
142 | #comparable to rf.... sometimes.


--------------------------------------------------------------------------------