├── .gitignore ├── 1-train_test-same_yr ├── analyze.Rmd ├── analyze.html ├── res.csv └── run-tuning.R ├── 2-train_test_1each ├── analyze-1000.Rmd ├── analyze-100K-100.html ├── analyze-100K-1000.html ├── analyze-10K-100.html ├── analyze-1M-100.html ├── analyze.Rmd ├── fig-100K-1000-AUCcorr.png ├── res-100K-100.csv ├── res-100K-1000.csv ├── res-10K-100.csv ├── res-1M-100.csv └── run-tuning.R ├── 3-test_rs ├── analyze-100K.html ├── analyze-10K.html ├── analyze.Rmd ├── fig-100K-AUCcorr.png ├── fig-100K-AUCrs_rank.png ├── fig-100K-AUCtest_rank.png ├── fig-10K-AUCcorr.png ├── res-100K.csv ├── res-10K.csv └── run-tuning.R ├── 4-train_rs ├── analyze.Rmd ├── analyze.html ├── fig-AUCcorr.png ├── res.csv └── run-tuning.R ├── 5-ensemble ├── analyze.Rmd ├── analyze.html ├── fig-AUCens.png ├── res.csv └── run-tuning.R ├── GBM-tune.Rproj └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | .DS_Store -------------------------------------------------------------------------------- /1-train_test-same_yr/analyze.Rmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | 3 | library(data.table) 4 | library(dplyr) 5 | library(ggplot2) 6 | 7 | d_pm_res <- fread("res.csv") 8 | 9 | 10 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = auc_rs_avg)) + 11 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 1) 12 | 13 | d_pm_res %>% ggplot() + geom_point(aes(x = auc_test, y = auc_rs_avg)) + 14 | geom_errorbar(aes(x = auc_test, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.0003) + 15 | geom_abline(slope = 1, color = "grey70") 16 | 17 | auc_test_top <- sort(d_pm_res$auc_test, decreasing = TRUE)[10] 18 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% mutate(is_top = auc_test>=auc_test_top) %>% 19 | ggplot(aes(color = is_top)) + geom_point(aes(x = rank, y = auc_rs_avg)) + 20 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.3) 21 | 22 | 23 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% head(10) 24 | 25 | d_pm_res %>% arrange(desc(auc_test)) %>% head(10) 26 | 27 | 28 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% tail(10) 29 | 30 | 31 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = num_leaves)) + scale_y_log10() 32 | 33 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = learning_rate)) + scale_y_log10() 34 | 35 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = min_data_in_leaf)) + scale_y_log10() 36 | 37 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = feature_fraction)) 38 | 39 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = bagging_fraction)) 40 | 41 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l1)) 42 | 43 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l2)) 44 | 45 | 46 | 47 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees)) 48 | 49 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees)) + scale_y_log10(breaks=c(30,100,300,1000,3000)) 50 | 51 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm)) 52 | 53 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm)) + scale_y_log10(breaks=c(1,3,10,30,100)) 54 | 55 | ``` -------------------------------------------------------------------------------- /1-train_test-same_yr/res.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test 2 | 100,0.03,5,0.8,0.4,0,0,1,650.7,4.0216,0.793133331684209,0.00212887297927467,0.798324975559597 3 | 100,0.01,20,0.8,0.8,0,0.1,2,1533.35,10.0221,0.792578036202827,0.00132179685125197,0.794173722247516 4 | 1000,0.1,10,0.6,0.4,0.3,0,3,73.7,3.1904,0.798616993349702,0.00128329958567587,0.810451945521633 5 | 100,0.1,5,0.6,0.6,0,0.3,4,212.2,1.35345,0.783952047314779,0.00147874687997539,0.788910982752182 6 | 200,0.1,50,0.6,0.4,0.01,0.3,5,189.75,2.03574999999999,0.790614254652929,0.00108070633254583,0.798719621417582 7 | 500,0.01,50,1,0.4,0,0,6,988.45,21.8507,0.798207298647128,0.00174835927425001,0.805647917641174 8 | 2000,0.1,10,0.8,1,0.01,0,7,45.1,4.63100000000002,0.804723179024603,0.00208305402659788,0.821109672798589 9 | 5000,0.03,20,1,0.8,0,0.3,8,165.45,16.0516,0.791286924935662,0.00141259431486052,0.801627084090923 10 | 200,0.1,5,0.6,0.4,0.3,0,9,171.35,1.83250000000002,0.791959818057037,0.00155559317484103,0.801687547239966 11 | 200,0.1,10,0.8,0.6,0,0,10,137.4,1.57290000000002,0.799740180297374,0.00167727444846346,0.808316836517587 12 | 2000,0.1,10,0.8,1,0.01,0.3,11,45.8,4.66340000000001,0.804943769614222,0.00102151590737398,0.82105004527474 13 | 5000,0.1,10,1,0.4,0,0,12,33.25,8.20140000000002,0.772542294382051,0.00139766227086254,0.794084337995178 14 | 1000,0.01,10,1,0.4,0.1,0.01,13,558.95,23.6237999999999,0.800954860203947,0.00180023825607282,0.80765774766394 15 | 200,0.03,20,0.6,0.4,0,0.01,14,407.35,4.14664999999995,0.791253627118857,0.00157063526158686,0.794694259763454 16 | 5000,0.1,10,0.6,0.4,0.1,0,15,58.8,7.82049999999995,0.794808395540459,0.00113205991103151,0.811774411101435 17 | 5000,0.03,50,0.6,0.4,0,0.3,16,353.3,13.3264,0.794190891977503,0.00137168227407346,0.802194822769227 18 | 2000,0.03,20,0.6,0.4,0.1,0,17,230.45,15.1756000000001,0.800791244309088,0.00124830660629967,0.807660067884991 19 | 1000,0.1,20,0.6,0.4,0,0,18,67.8,2.96845000000003,0.796672469513812,0.00123393518435348,0.806082602652894 20 | 2000,0.1,20,0.6,0.4,0,0,19,67.55,5.23190000000002,0.792181862876565,0.00146464129632093,0.80746214221707 21 | 5000,0.1,50,1,0.8,0.01,0.3,20,73.35,4.19949999999992,0.789743216922761,0.00151707479083062,0.80233434108188 22 | 1000,0.01,50,0.6,0.8,0,0.3,21,801.6,24.1895,0.790836868856426,0.00181894042703466,0.797713654745412 23 | 5000,0.03,5,1,1,0.1,0.01,22,100.6,22.0226,0.77136582273674,0.00142404409555978,0.792081898114532 24 | 100,0.1,10,0.8,0.6,0,0.01,23,188.3,1.44554999999991,0.789953407214075,0.00190910184576514,0.799189570334669 25 | 2000,0.01,10,1,0.8,0.3,0.3,24,458.8,38.018,0.795183091511507,0.00108933279777626,0.804163703696687 26 | 100,0.03,10,0.8,0.4,0.01,0.01,25,621.2,3.96425000000018,0.791721385194187,0.00177120031404751,0.798259914311143 27 | 500,0.03,10,1,0.8,0.3,0.01,26,306.6,7.14699999999989,0.803274929888346,0.00134809579926469,0.811264590674137 28 | 1000,0.1,20,0.8,0.8,0.1,0,27,70.7,3.30155000000004,0.803788535479854,0.00173379535844403,0.817897517125199 29 | 1000,0.03,5,0.8,0.4,0,0.01,28,217.3,9.10365000000011,0.816141557402516,0.00137665718304232,0.824131731930388 30 | 5000,0.03,50,0.8,0.4,0,0,29,270.95,13.67325,0.804824430415439,0.00159402668500612,0.811862684931413 31 | 200,0.1,10,1,0.4,0,0,30,146.2,1.78804999999988,0.794731475822345,0.00164231213630595,0.807469585071821 32 | 200,0.03,50,0.8,0.4,0.1,0.3,31,587.6,6.30069999999987,0.802711699224349,0.00133586274557213,0.808131825451264 33 | 5000,0.03,5,1,0.4,0,0.3,32,106.3,22.22165,0.773436356593152,0.00191279419056369,0.793076858210495 34 | 5000,0.01,50,0.6,1,0.1,0.01,33,762.5,27.4715999999999,0.792692214460731,0.00139436910734219,0.796700571000135 35 | 500,0.03,10,1,1,0,0.1,34,266.25,6.2808500000001,0.804994401700058,0.00157419995564153,0.807713649871748 36 | 1000,0.1,10,1,0.4,0,0,35,57.5,3.04995000000008,0.795278285104018,0.00126662554735763,0.808996288012511 37 | 1000,0.1,20,0.8,0.6,0,0,36,70,3.30264999999972,0.804507780716041,0.00118831335109138,0.817807695616941 38 | 100,0.03,20,0.8,0.4,0,0.1,37,635.8,4.18384999999998,0.793735813957307,0.00208965872517592,0.79804598559811 39 | 1000,0.03,5,0.8,0.8,0,0,38,209.2,8.75834999999997,0.817206962049992,0.00152441083097017,0.823058974861555 40 | 100,0.1,50,0.6,0.8,0,0,39,223.1,1.54625000000005,0.784391705891791,0.00168243423911674,0.790224736916789 41 | 5000,0.01,5,0.8,0.6,0.01,0,40,321.7,63.208,0.813258027009519,0.00131704777983475,0.822575055361432 42 | 1000,0.03,50,1,1,0.3,0,41,275.95,10.8982,0.793635239942747,0.00155089903993613,0.804276485456916 43 | 200,0.01,20,0.6,0.6,0.3,0,42,1201.45,11.9443000000001,0.79373131460204,0.00139400150422791,0.796082854671224 44 | 500,0.01,5,0.6,0.6,0.3,0,43,860.05,17.5143499999998,0.807184638435139,0.00160322015625995,0.807230609203949 45 | 5000,0.03,20,0.6,0.4,0.01,0,44,209.1,17.5648500000002,0.802585070952643,0.00143333924805045,0.805751345064545 46 | 1000,0.01,20,1,0.6,0,0,45,555.15,22.6934999999999,0.796117355067068,0.00142572094052303,0.80280325130023 47 | 500,0.01,20,0.6,1,0.3,0,46,822.75,16.9427500000001,0.800602037659788,0.00150082530746404,0.803727114980181 48 | 1000,0.03,50,0.8,0.6,0.01,0,47,270.65,10.41155,0.802430416676399,0.00113204160246269,0.811734799697879 49 | 200,0.1,20,0.6,1,0,0,48,152.35,1.79664999999995,0.788187987464658,0.00112995575940939,0.796899669308749 50 | 1000,0.1,5,0.6,0.4,0.3,0,49,72.65,3.29420000000018,0.79851269105032,0.00125461823296939,0.81177151881826 51 | 2000,0.01,50,1,1,0.3,0.1,50,777.6,33.3392499999999,0.796194453059917,0.00139300821039638,0.802033688809507 52 | 100,0.1,50,1,0.4,0,0,51,248.75,1.84425000000001,0.792334898329837,0.00149829289840324,0.802619805195038 53 | 500,0.1,20,0.8,0.8,0,0,52,89.9,2.28985000000002,0.799480192617534,0.00155209523163078,0.813808142426749 54 | 200,0.1,5,0.8,0.4,0.01,0.1,53,153.95,1.78055000000004,0.798539385894309,0.00140907256117862,0.809890084736972 55 | 100,0.03,20,1,1,0.1,0,54,652.2,4.41529999999984,0.795483178710807,0.00196911604647464,0.800175939131186 56 | 100,0.1,10,1,0.6,0.3,0,55,238.55,1.7072000000001,0.7950804176617,0.00168436183297411,0.8058755706569 57 | 500,0.01,10,0.8,0.4,0,0,56,840.55,17.9680999999999,0.813158312219066,0.00144959546200667,0.815692810991432 58 | 2000,0.03,50,1,0.4,0.3,0,57,262.85,11.8437499999999,0.792171501119114,0.0014807717546798,0.802888507546511 59 | 200,0.03,10,0.8,1,0,0.1,58,451.1,4.76259999999966,0.802631443844852,0.001304751947319,0.807363456340498 60 | 2000,0.01,20,0.6,1,0,0.3,59,630.45,40.1248,0.800349836754519,0.00151077614310938,0.806564994350096 61 | 1000,0.03,50,0.6,0.6,0.01,0,60,338.3,11.2043000000003,0.796795385538892,0.00153925887581545,0.802485786281843 62 | 500,0.01,20,0.6,0.8,0.01,0.01,61,794.6,16.5014000000001,0.799094934710253,0.00165175629479638,0.802457893507567 63 | 200,0.1,20,0.8,0.8,0.01,0,62,148.1,1.79014999999999,0.799121606710138,0.00119431886357907,0.806608989545784 64 | 200,0.1,10,0.6,1,0.01,0,63,139.7,1.62384999999995,0.78698249444501,0.00219284167257509,0.795226582447516 65 | 200,0.03,5,1,0.6,0.3,0,64,600.65,6.19680000000008,0.803827815697537,0.00202547875549495,0.813985654469108 66 | 500,0.01,10,0.8,1,0.01,0.1,65,848.4,18.5755500000001,0.811973579613939,0.00110472852818227,0.815172082500149 67 | 5000,0.01,10,1,1,0,0,66,330.95,62.5902500000002,0.781258282773491,0.00152634542749945,0.793157859421619 68 | 5000,0.01,20,0.8,0.8,0.01,0.1,67,516.6,48.7011500000001,0.807508162134909,0.00113215434965275,0.816263799707706 69 | 100,0.1,50,1,0.8,0.01,0.1,68,233.3,1.75204999999987,0.792885200699409,0.00138459612244783,0.801455366120416 70 | 5000,0.01,10,0.8,1,0,0.1,69,379.8,59.0607,0.807882239487438,0.00152449709445559,0.817123885389956 71 | 500,0.1,5,0.6,0.8,0,0,70,86.9,2.18439999999991,0.791079020648579,0.00138578078336123,0.803788710155265 72 | 2000,0.03,50,0.8,1,0,0.1,71,266.05,12.1331999999999,0.803086627772241,0.00139272199049814,0.811105751875537 73 | 500,0.03,10,0.6,1,0,0.01,72,341.3,7.19410000000007,0.800595143926844,0.00166397078628163,0.807298834940354 74 | 1000,0.03,50,1,0.8,0.3,0.01,73,275.2,10.6879499999995,0.797101217467447,0.00141775116978677,0.803967109671454 75 | 500,0.03,10,0.6,0.4,0,0,74,343.25,7.00439999999981,0.802960622197304,0.00113726444939383,0.807333600235683 76 | 5000,0.1,50,1,0.4,0,0,75,73.45,4.6484999999997,0.788759115342117,0.00197917568123626,0.802611565601364 77 | 500,0.1,5,0.6,1,0,0,76,84.5,2.14564999999948,0.792004181810046,0.00124243431485302,0.803031048627674 78 | 200,0.1,50,0.6,0.8,0.01,0,77,201.75,2.26224999999995,0.793122878406468,0.00147114790128617,0.800336332521249 79 | 5000,0.03,20,1,1,0,0.01,78,148.8,16.1312500000004,0.786657192594729,0.00131645578523834,0.799334852474144 80 | 2000,0.1,10,1,0.4,0,0,79,41.45,4.5394999999995,0.784159270527344,0.0013925483693018,0.802484477106754 81 | 2000,0.1,5,1,0.6,0.1,0,80,41.55,4.64255000000067,0.790220033747943,0.00165668253716844,0.807253709446921 82 | 2000,0.1,50,0.8,1,0.01,0,81,84.1,4.39639999999999,0.798666387389073,0.00207644826297946,0.812202702326724 83 | 2000,0.01,5,1,0.8,0.01,0.01,82,404.7,35.2871499999999,0.795766550257401,0.0016223392885126,0.80590769340613 84 | 200,0.03,5,1,0.6,0.3,0,83,540.65,5.42954999999984,0.804178452266139,0.0013850593287532,0.811688672663985 85 | 1000,0.01,20,0.6,0.8,0,0.1,84,653.8,23.0737500000008,0.801986091503471,0.00182077665129979,0.806081248542418 86 | 500,0.03,10,0.6,0.4,0.1,0,85,341.75,7.08204999999944,0.803932402127717,0.00153446150749478,0.808278052962057 87 | 500,0.1,5,0.6,0.6,0,0,86,90.7,2.10105000000021,0.793932893722768,0.00148410963488235,0.80319774766167 88 | 2000,0.1,50,0.8,0.4,0.3,0.3,87,93.75,4.22204999999958,0.800395879186408,0.00150236755460191,0.812751239770574 89 | 500,0.01,5,1,0.8,0,0.3,88,881.15,19.8014499999999,0.804169783322628,0.00138556406125226,0.811445478919193 90 | 5000,0.03,5,0.8,0.6,0,0.3,89,111.5,22.8261000000002,0.808032202876178,0.00158078927194371,0.818011229566293 91 | 1000,0.1,20,0.6,0.8,0,0,90,70.85,3.12175000000007,0.796367869458404,0.00148724923883016,0.807133740617903 92 | 200,0.1,20,0.8,0.6,0.3,0,91,168.75,1.91055000000033,0.803499326417865,0.00112257850710717,0.811489808537795 93 | 500,0.1,5,1,1,0,0.01,92,87.3,2.27300000000014,0.800800607761154,0.00134596625985896,0.81343423770421 94 | 1000,0.01,20,0.6,1,0,0,93,655.6,23.5426500000009,0.803785481681152,0.00133626321074606,0.805946414743445 95 | 5000,0.01,10,0.8,0.4,0.01,0.01,94,361.8,58.2747000000003,0.808105627527694,0.00174725464299831,0.81816843513674 96 | 1000,0.1,50,0.8,0.8,0,0,95,85.45,3.71480000000029,0.800634189151031,0.00137517425180894,0.812499261158631 97 | 500,0.03,5,1,0.4,0,0,96,277.6,6.4525999999998,0.801035418327514,0.00121568911211669,0.811161190903276 98 | 500,0.1,50,1,0.4,0,0.1,97,109.7,2.70584999999992,0.794213793165755,0.00147049070001713,0.807756723027987 99 | 5000,0.03,50,0.6,0.8,0.01,0,98,348.25,13.9102500000003,0.794057650714021,0.0013864043217383,0.803203691230075 100 | 5000,0.01,50,0.6,1,0,0,99,747.3,29.8374,0.795038060425771,0.00108113291518004,0.796957128700436 101 | 2000,0.01,5,0.6,0.4,0.01,0,100,512.5,38.7797500000002,0.804657813221696,0.00113790412573063,0.811559588865575 102 | -------------------------------------------------------------------------------- /1-train_test-same_yr/run-tuning.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(ROCR) 3 | library(lightgbm) 4 | library(dplyr) 5 | 6 | set.seed(123) 7 | 8 | 9 | # for yr in 1990 1991; do 10 | # wget http://stat-computing.org/dataexpo/2009/$yr.csv.bz2 11 | # bunzip2 $yr.csv.bz2 12 | # done 13 | 14 | 15 | ## TODO: loop 1991..2000 16 | yr <- 1991 17 | 18 | d0_train <- fread(paste0("/var/data/airline/",yr-1,".csv")) 19 | d0_train <- d0_train[!is.na(DepDelay)] 20 | 21 | d0_test <- fread(paste0("/var/data/airline/",yr,".csv")) 22 | d0_test <- d0_test[!is.na(DepDelay)] 23 | 24 | d0 <- rbind(d0_train, d0_test) 25 | 26 | 27 | for (k in c("Month","DayofMonth","DayOfWeek")) { 28 | d0[[k]] <- paste0("c-",as.character(d0[[k]])) # force them to be categoricals 29 | } 30 | 31 | d0[["dep_delayed_15min"]] <- ifelse(d0[["DepDelay"]]>=15,1,0) 32 | 33 | cols_keep <- c("Month", "DayofMonth", "DayOfWeek", "DepTime", "UniqueCarrier", 34 | "Origin", "Dest", "Distance", "dep_delayed_15min") 35 | d0 <- d0[, cols_keep, with = FALSE] 36 | 37 | 38 | d0_wrules <- lgb.prepare_rules(d0) # lightgbm special treat of categoricals 39 | d0 <- d0_wrules$data 40 | cols_cats <- names(d0_wrules$rules) 41 | 42 | 43 | n1 <- nrow(d0_train) 44 | n2 <- nrow(d0_test) 45 | 46 | p <- ncol(d0)-1 47 | d1_train <- as.matrix(d0[1:n1,]) 48 | d1_test <- as.matrix(d0[(n1+1):(n1+n2),]) 49 | 50 | 51 | 52 | ## TODO: loop 10K,100K,1M 53 | size <- 100e3 54 | 55 | 56 | 57 | n_random <- 100 ## TODO: 1000? 58 | 59 | params_grid <- expand.grid( # default 60 | num_leaves = c(100,200,500,1000,2000,5000), # 31 (was 127) 61 | learning_rate = c(0.01,0.03,0.1), # 0.1 62 | min_data_in_leaf = c(5,10,20,50), # 20 (was 100) 63 | feature_fraction = c(0.6,0.8,1), # 1 64 | bagging_fraction = c(0.4,0.6,0.8,1), # 1 65 | lambda_l1 = c(0,0,0,0, 0.01, 0.1, 0.3), # 0 66 | lambda_l2 = c(0,0,0,0, 0.01, 0.1, 0.3) # 0 67 | ## TODO: 68 | ## min_sum_hessian_in_leaf 69 | ## min_gain_to_split 70 | ## max_bin 71 | ## min_data_in_bin 72 | ) 73 | params_random <- params_grid[sample(1:nrow(params_grid),n_random),] 74 | 75 | 76 | 77 | 78 | ## TODO: resample 79 | d <- d1_train[sample(1:nrow(d1_train), size),] 80 | 81 | 82 | 83 | ## TODO: resample 84 | size_test <- 100e3 85 | d_test <- d1_test[sample(1:nrow(d1_test), size_test),] 86 | 87 | 88 | 89 | ## first run: get train/test with same distribution to check overfitting of best models 90 | idx <- sample(1:nrow(d1_train), size) 91 | d <- d1_train[idx,] 92 | d_test <- d1_train[sample(setdiff(1:nrow(d1_train),idx), size_test),] 93 | #user system elapsed 94 | #240025.539 4094.378 31870.404 95 | ## Results: AUC_test>AUC_rs seems it does not overfit, either underfit or sample variation 96 | ## TODO: resample train/test to see if underfit or sample variation 97 | 98 | 99 | 100 | # warm up 101 | dlgb_train <- lgb.Dataset(data = d[,1:p], label = d[,p+1]) 102 | system.time({ 103 | md <- lgb.train(data = dlgb_train, objective = "binary", 104 | num_threads = parallel::detectCores()/2, 105 | nrounds = 100, 106 | categorical_feature = cols_cats, 107 | verbose = 0) 108 | }) 109 | system.time({ 110 | md <- lgb.train(data = dlgb_train, objective = "binary", 111 | num_threads = parallel::detectCores()/2, 112 | nrounds = 100, 113 | categorical_feature = cols_cats, 114 | verbose = 0) 115 | }) 116 | 117 | 118 | 119 | system.time({ 120 | 121 | d_res <- data.frame() 122 | for (krpm in 1:nrow(params_random)) { 123 | params <- as.list(params_random[krpm,]) 124 | 125 | ## resample 126 | n_resample <- 20 ## TODO: 10? 127 | 128 | p_train <- 0.8 ## TODO: change? (80-10-10 split now) 129 | p_earlystop <- 0.1 130 | p_modelselec <- 1 - p_train - p_earlystop 131 | 132 | mds <- list() 133 | d_res_rs <- data.frame() 134 | for (k in 1:n_resample) { 135 | cat(" krpm:",krpm,"k:",k,"\n") 136 | 137 | idx_train <- sample(1:size, size*p_train) 138 | idx_earlystop <- sample(setdiff(1:size,idx_train), size*p_earlystop) 139 | idx_modelselec <- setdiff(setdiff(1:size,idx_train),idx_earlystop) 140 | 141 | d_train <- d[idx_train,] 142 | d_earlystop <- d[idx_earlystop,] 143 | d_modelselec <- d[idx_modelselec,] 144 | 145 | dlgb_train <- lgb.Dataset(data = d_train[,1:p], label = d_train[,p+1]) 146 | dlgb_earlystop <- lgb.Dataset(data = d_earlystop[,1:p], label = d_earlystop[,p+1]) 147 | 148 | 149 | runtm <- system.time({ 150 | md <- lgb.train(data = dlgb_train, objective = "binary", 151 | num_threads = parallel::detectCores()/2, # = number of "real" cores 152 | params = params, 153 | nrounds = 10000, early_stopping_rounds = 10, valid = list(valid = dlgb_earlystop), 154 | categorical_feature = cols_cats, 155 | verbose = 0) 156 | })[[3]] 157 | 158 | phat <- predict(md, data = d_modelselec[,1:p]) 159 | rocr_pred <- prediction(phat, d_modelselec[,p+1]) 160 | auc_rs <- performance(rocr_pred, "auc")@y.values[[1]] 161 | 162 | d_res_rs <- rbind(d_res_rs, data.frame(ntrees = md$best_iter, runtm = runtm, auc_rs = auc_rs)) 163 | mds[[k]] <- md 164 | } 165 | d_res_rs_avg <- d_res_rs %>% summarize(ntrees = mean(ntrees), runtm = mean(runtm), auc_rs_avg = mean(auc_rs), 166 | auc_rs_std = sd(auc_rs)/sqrt(n_resample)) # std of the mean! 167 | 168 | # consider the model as the average of the models from resamples 169 | # TODO?: alternatively could retrain the "final" model on all of data (early stoping or avg number of trees?) 170 | phat <- matrix(0, nrow = n_resample, ncol = nrow(d_test)) 171 | for (k in 1:n_resample) { 172 | phat[k,] <- predict(mds[[k]], data = d_test[,1:p]) 173 | } 174 | phat_avg <- apply(phat, 2, mean) 175 | rocr_pred <- prediction(phat_avg, d_test[,p+1]) 176 | auc_test <- performance(rocr_pred, "auc")@y.values[[1]] 177 | 178 | d_res <- rbind(d_res, cbind(krpm, d_res_rs_avg, auc_test)) 179 | } 180 | 181 | }) 182 | 183 | d_pm_res <- cbind(params_random, d_res) 184 | 185 | d_pm_res 186 | 187 | 188 | ## TODO: ensemble/avg the top 2,3,...10 models (would need to save all the models) 189 | 190 | ## TODO: easier: see the goodness as a function of how many random iterations 10,100,1000 (can be done in 191 | ## the other analysis file) 192 | 193 | 194 | fwrite(d_pm_res, file = "res.csv") 195 | 196 | -------------------------------------------------------------------------------- /2-train_test_1each/analyze-1000.Rmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | 3 | library(data.table) 4 | library(dplyr) 5 | library(ggplot2) 6 | 7 | d_pm_res <- fread("res-100K-1000.csv") 8 | 9 | 10 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = auc_rs_avg), size = 0.2) + 11 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 1, alpha = 0.2) 12 | 13 | d_pm_res %>% ggplot() + geom_point(aes(x = auc_test, y = auc_rs_avg), alpha = 0.3, size = 0.5) + 14 | geom_errorbar(aes(x = auc_test, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.0003, alpha = 0.3) + 15 | geom_abline(slope = 1, color = "grey70") 16 | 17 | auc_test_top <- sort(d_pm_res$auc_test, decreasing = TRUE)[10] 18 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% mutate(is_top = auc_test>=auc_test_top) %>% 19 | ggplot(aes(color = is_top)) + geom_point(aes(x = rank, y = auc_rs_avg)) + 20 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.3) 21 | 22 | 23 | cor(d_pm_res$auc_rs_avg, d_pm_res$auc_test) 24 | cor(d_pm_res$auc_rs_avg, d_pm_res$auc_test, method = "spearman") 25 | 26 | 27 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% head(10) 28 | d_pm_res %>% arrange(desc(auc_test)) %>% head(10) 29 | 30 | 31 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% tail(10) 32 | d_pm_res %>% arrange(desc(auc_test)) %>% tail(10) 33 | 34 | 35 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = num_leaves), alpha = 0.3) + scale_y_log10() 36 | 37 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = learning_rate), alpha = 0.3) + scale_y_log10() 38 | 39 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = min_data_in_leaf), alpha = 0.3) + scale_y_log10() 40 | 41 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = feature_fraction), alpha = 0.3) 42 | 43 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = bagging_fraction), alpha = 0.3) 44 | 45 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l1), alpha = 0.3) 46 | 47 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l2), alpha = 0.3) 48 | 49 | 50 | 51 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees), alpha = 0.3) 52 | 53 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees), alpha = 0.3) + scale_y_log10(breaks=c(30,100,300,1000,3000)) 54 | 55 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm), alpha = 0.3) 56 | 57 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm), alpha = 0.3) + scale_y_log10(breaks=c(1,3,10,30,100)) 58 | 59 | 60 | 61 | summary(d_pm_res$runtm) 62 | 63 | mean(d_pm_res$runtm)*20*100/3600 64 | 65 | 66 | summary(d_pm_res$ntree) 67 | 68 | 69 | ``` -------------------------------------------------------------------------------- /2-train_test_1each/analyze.Rmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | 3 | library(data.table) 4 | library(dplyr) 5 | library(ggplot2) 6 | 7 | d_pm_res <- fread("res-10K-100.csv") 8 | 9 | 10 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = auc_rs_avg)) + 11 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 1) 12 | 13 | d_pm_res %>% ggplot() + geom_point(aes(x = auc_test, y = auc_rs_avg)) + 14 | geom_errorbar(aes(x = auc_test, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.0003) + 15 | geom_abline(slope = 1, color = "grey70") 16 | 17 | auc_test_top <- sort(d_pm_res$auc_test, decreasing = TRUE)[10] 18 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% mutate(is_top = auc_test>=auc_test_top) %>% 19 | ggplot(aes(color = is_top)) + geom_point(aes(x = rank, y = auc_rs_avg)) + 20 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.3) 21 | 22 | 23 | cor(d_pm_res$auc_rs_avg, d_pm_res$auc_test) 24 | cor(d_pm_res$auc_rs_avg, d_pm_res$auc_test, method = "spearman") 25 | 26 | 27 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% head(10) 28 | d_pm_res %>% arrange(desc(auc_test)) %>% head(10) 29 | 30 | 31 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% tail(10) 32 | d_pm_res %>% arrange(desc(auc_test)) %>% tail(10) 33 | 34 | 35 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = num_leaves)) + scale_y_log10() 36 | 37 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = learning_rate)) + scale_y_log10() 38 | 39 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = min_data_in_leaf)) + scale_y_log10() 40 | 41 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = feature_fraction)) 42 | 43 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = bagging_fraction)) 44 | 45 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l1)) 46 | 47 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l2)) 48 | 49 | 50 | 51 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees)) 52 | 53 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees)) + scale_y_log10(breaks=c(30,100,300,1000,3000)) 54 | 55 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm)) 56 | 57 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm)) + scale_y_log10(breaks=c(1,3,10,30,100)) 58 | 59 | 60 | 61 | summary(d_pm_res$runtm) 62 | 63 | mean(d_pm_res$runtm)*20*100/3600 64 | 65 | 66 | summary(d_pm_res$ntree) 67 | 68 | 69 | ``` -------------------------------------------------------------------------------- /2-train_test_1each/fig-100K-1000-AUCcorr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/2-train_test_1each/fig-100K-1000-AUCcorr.png -------------------------------------------------------------------------------- /2-train_test_1each/res-100K-100.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test 2 | 100,0.03,5,0.8,0.4,0,0,1,677.9,4.05035,0.791578504715822,0.00166053744814572,0.73174971895443 3 | 100,0.01,20,0.8,0.8,0,0.1,2,1751.85,11.0103,0.790808047504172,0.00166937557151477,0.730549558342319 4 | 1000,0.1,10,0.6,0.4,0.3,0,3,72.55,3.05865,0.795600334078155,0.00156384158309613,0.729710363019238 5 | 100,0.1,5,0.6,0.6,0,0.3,4,238,1.52414999999999,0.780516928224509,0.00155494776808068,0.72306487285285 6 | 200,0.1,50,0.6,0.4,0.01,0.3,5,189.7,2.02194999999999,0.788165684479056,0.00125503075395822,0.722299737935232 7 | 500,0.01,50,1,0.4,0,0,6,999.05,21.7314,0.798159643962157,0.00171421640508396,0.731330244093394 8 | 2000,0.1,10,0.8,1,0.01,0,7,45.05,4.55685000000001,0.800727257275389,0.00130993391715852,0.74351721076888 9 | 5000,0.03,20,1,0.8,0,0.3,8,162.2,15.79325,0.785189919351823,0.00143561197302206,0.728332668924037 10 | 200,0.1,5,0.6,0.4,0.3,0,9,161.75,1.78695,0.788932149848789,0.00173304562255912,0.726885720211955 11 | 200,0.1,10,0.8,0.6,0,0,10,130.7,1.47574999999997,0.795275976192908,0.000953584242910259,0.736155613302774 12 | 2000,0.1,10,0.8,1,0.01,0.3,11,46.1,4.60360000000003,0.802555925019003,0.00123136147289023,0.742563072215322 13 | 5000,0.1,10,1,0.4,0,0,12,31.3,7.71004999999992,0.767911107287579,0.00115416627864389,0.722191310928811 14 | 1000,0.01,10,1,0.4,0.1,0.01,13,556.2,23.23635,0.797626536956635,0.00121330241617777,0.737334708753687 15 | 200,0.03,20,0.6,0.4,0,0.01,14,435.25,4.41134999999997,0.791573223179858,0.00125572237975219,0.726153477314418 16 | 5000,0.1,10,0.6,0.4,0.1,0,15,58.4,7.74555,0.790166619998654,0.00135012797212164,0.730523025634413 17 | 5000,0.03,50,0.6,0.4,0,0.3,16,353.3,13.4215,0.793891380046658,0.00138831147774158,0.725665817955326 18 | 2000,0.03,20,0.6,0.4,0.1,0,17,231.95,15.00245,0.796869782649063,0.00160399198458777,0.733148832967826 19 | 1000,0.1,20,0.6,0.4,0,0,18,71.1,3.06130000000003,0.792354157055418,0.00137072876196483,0.725816731015731 20 | 2000,0.1,20,0.6,0.4,0,0,19,66.85,5.14865000000009,0.791940676527726,0.00128549146884453,0.725611399784615 21 | 5000,0.1,50,1,0.8,0.01,0.3,20,77.15,4.25585000000003,0.789229266838517,0.00170043854822695,0.729915313178875 22 | 1000,0.01,50,0.6,0.8,0,0.3,21,755.75,22.2363500000001,0.789818525964729,0.00120073625985256,0.726077827324743 23 | 5000,0.03,5,1,1,0.1,0.01,22,99.05,21.3121500000001,0.769172436734264,0.00137789208859516,0.72234840251191 24 | 100,0.1,10,0.8,0.6,0,0.01,23,208,1.46940000000004,0.788882030762249,0.00163007866541741,0.73209349558275 25 | 2000,0.01,10,1,0.8,0.3,0.3,24,465.1,37.6858,0.796039931073271,0.00151174812708258,0.733973155784208 26 | 100,0.03,10,0.8,0.4,0.01,0.01,25,661.1,4.17405000000003,0.791523902291453,0.00153912777468259,0.731090314304785 27 | 500,0.03,10,1,0.8,0.3,0.01,26,281.75,6.57590000000009,0.799157435074281,0.00112947387432213,0.737715277293083 28 | 1000,0.1,20,0.8,0.8,0.1,0,27,76.05,3.52655,0.805203915722793,0.00130652028552496,0.738965640899561 29 | 1000,0.03,5,0.8,0.4,0,0.01,28,212.7,8.89724999999999,0.814551740848826,0.00112838104640368,0.745900055335843 30 | 5000,0.03,50,0.8,0.4,0,0,29,257.35,12.91655,0.802021892007834,0.00147980536362156,0.733878350863451 31 | 200,0.1,10,1,0.4,0,0,30,136.95,1.6733999999999,0.790112738748778,0.00142500030118507,0.735242700636094 32 | 200,0.03,50,0.8,0.4,0.1,0.3,31,519.65,5.46785000000004,0.801120037459329,0.00112750867052792,0.731912635277867 33 | 5000,0.03,5,1,0.4,0,0.3,32,105,22.2052,0.774410585327097,0.00149326840592376,0.723005737095749 34 | 5000,0.01,50,0.6,1,0.1,0.01,33,818.05,29.2794000000001,0.791186080860844,0.00149265489459356,0.725619623520595 35 | 500,0.03,10,1,1,0,0.1,34,273.35,6.34194999999991,0.796110871241863,0.00208782710647878,0.736184082555255 36 | 1000,0.1,10,1,0.4,0,0,35,58.5,2.97955000000011,0.793801600774447,0.00129942829517422,0.739548282558039 37 | 1000,0.1,20,0.8,0.6,0,0,36,72.45,3.39364999999989,0.801902943912107,0.00143736395339191,0.736973177217827 38 | 100,0.03,20,0.8,0.4,0,0.1,37,653.4,4.05880000000043,0.792519548079425,0.00163210444280686,0.73061843188783 39 | 1000,0.03,5,0.8,0.8,0,0,38,212,8.86319999999987,0.813078357278425,0.0019822540078922,0.745362619860386 40 | 100,0.1,50,0.6,0.8,0,0,39,245.45,1.77209999999977,0.785554258239288,0.00139611544230318,0.720645782642126 41 | 5000,0.01,5,0.8,0.6,0.01,0,40,320.15,62.4935500000001,0.808138704144911,0.00144813312012613,0.742551642996043 42 | 1000,0.03,50,1,1,0.3,0,41,275.1,10.5845000000001,0.794507231824078,0.00163267599771256,0.727649516019378 43 | 200,0.01,20,0.6,0.6,0.3,0,42,1153.25,11.6420499999999,0.79078603593473,0.00137387015362163,0.727136073425199 44 | 500,0.01,5,0.6,0.6,0.3,0,43,844.95,17.0589000000002,0.803774113943707,0.00130495713765246,0.735493675648833 45 | 5000,0.03,20,0.6,0.4,0.01,0,44,206.05,16.9838,0.796233451110566,0.00148566490430155,0.731105540593921 46 | 1000,0.01,20,1,0.6,0,0,45,561.1,22.8034999999997,0.796581470844654,0.00103576417247685,0.732193345557879 47 | 500,0.01,20,0.6,1,0.3,0,46,868.45,17.3620499999998,0.799392704814777,0.00114572697094952,0.732185582811043 48 | 1000,0.03,50,0.8,0.6,0.01,0,47,269.75,10.2305000000002,0.802540626367858,0.00132088650461302,0.73440113019512 49 | 200,0.1,20,0.6,1,0,0,48,142.05,1.7233000000002,0.785229221516566,0.00145473271163123,0.723902790465834 50 | 1000,0.1,5,0.6,0.4,0.3,0,49,74.55,3.29085000000023,0.799113743259215,0.00158918412247988,0.731570861954848 51 | 2000,0.01,50,1,1,0.3,0.1,50,791.4,33.0179999999999,0.79503221103346,0.00117688886969621,0.727969807049265 52 | 100,0.1,50,1,0.4,0,0,51,234.85,1.78145000000004,0.786462222736903,0.00167714404002373,0.730523738072269 53 | 500,0.1,20,0.8,0.8,0,0,52,87.8,2.27194999999983,0.803965234448568,0.00162284469212085,0.738014026578789 54 | 200,0.1,5,0.8,0.4,0.01,0.1,53,159.75,1.84465,0.797916892512777,0.00181097179810758,0.738298122644503 55 | 100,0.03,20,1,1,0.1,0,54,758.45,4.84820000000018,0.792309275130644,0.00192676951172148,0.733417287561655 56 | 100,0.1,10,1,0.6,0.3,0,55,271,1.94060000000009,0.795279632661003,0.00143979016403205,0.738652215490416 57 | 500,0.01,10,0.8,0.4,0,0,56,832,17.5548999999997,0.807908038000343,0.00155887332060103,0.742559157218016 58 | 2000,0.03,50,1,0.4,0.3,0,57,270.7,12.0292,0.793635672797191,0.00159963321887733,0.728231288328144 59 | 200,0.03,10,0.8,1,0,0.1,58,468.55,4.86695,0.798936490335741,0.00121048836451345,0.736701216961648 60 | 2000,0.01,20,0.6,1,0,0.3,59,623.1,37.9847999999997,0.798620998386887,0.00151690641375795,0.731527402268123 61 | 1000,0.03,50,0.6,0.6,0.01,0,60,350.2,11.3863499999999,0.794433205047433,0.00156341809056467,0.726130204667882 62 | 500,0.01,20,0.6,0.8,0.01,0.01,61,814.9,16.5912999999995,0.798634160876369,0.00185713651405445,0.730843514509512 63 | 200,0.1,20,0.8,0.8,0.01,0,62,148.95,1.76560000000045,0.797396213431429,0.00133855182372315,0.735159003314231 64 | 200,0.1,10,0.6,1,0.01,0,63,151.9,1.70714999999982,0.783568156369059,0.00178208309195338,0.725490961718356 65 | 200,0.03,5,1,0.6,0.3,0,64,557.65,5.79285000000036,0.800015812793708,0.0015673848849122,0.738856577474697 66 | 500,0.01,10,0.8,1,0.01,0.1,65,868.15,18.5210000000003,0.810150051704961,0.00122175100087932,0.742058519013962 67 | 5000,0.01,10,1,1,0,0,66,335.2,61.8605999999996,0.773130584612202,0.00188104487604846,0.723243149961184 68 | 5000,0.01,20,0.8,0.8,0.01,0.1,67,509.4,47.7409499999996,0.802534912936386,0.00160100061821232,0.736979893236384 69 | 100,0.1,50,1,0.8,0.01,0.1,68,234.45,1.82894999999971,0.787527656518229,0.00178016677292967,0.731156943329792 70 | 5000,0.01,10,0.8,1,0,0.1,69,366.8,55.6688999999999,0.806140901011197,0.0017116945525855,0.738869094355656 71 | 500,0.1,5,0.6,0.8,0,0,70,93.2,2.29609999999975,0.787604006507014,0.00151540115814862,0.728073227498161 72 | 2000,0.03,50,0.8,1,0,0.1,71,269.45,12.1540999999996,0.801583531050854,0.0013594232792254,0.734263427936007 73 | 500,0.03,10,0.6,1,0,0.01,72,347.25,7.19714999999997,0.803019516229912,0.00110351898879225,0.732737291409847 74 | 1000,0.03,50,1,0.8,0.3,0.01,73,270.65,10.5120500000008,0.791898525154095,0.0012578170181605,0.728447148241451 75 | 500,0.03,10,0.6,0.4,0,0,74,337.6,6.9090000000002,0.8012230058587,0.00151630261348123,0.733556076507841 76 | 5000,0.1,50,1,0.4,0,0,75,74.65,4.63869999999988,0.791293477075846,0.00169266636990327,0.729046264660964 77 | 500,0.1,5,0.6,1,0,0,76,83.6,2.18505000000041,0.789334062073551,0.001197515140357,0.727901880821998 78 | 200,0.1,50,0.6,0.8,0.01,0,77,179.8,2.12645000000048,0.789837179818467,0.00170375219441587,0.722162736418652 79 | 5000,0.03,20,1,1,0,0.01,78,144.95,15.9059499999999,0.784109062517104,0.00157665250349067,0.726905336132206 80 | 2000,0.1,10,1,0.4,0,0,79,41.1,4.53879999999972,0.78333693557307,0.00114869405962524,0.733173379426173 81 | 2000,0.1,5,1,0.6,0.1,0,80,41.25,4.63054999999968,0.788571079607799,0.00179418775593646,0.737522827446029 82 | 2000,0.1,50,0.8,1,0.01,0,81,83.15,4.44734999999982,0.801612655609249,0.00141760714802689,0.733143676320833 83 | 2000,0.01,5,1,0.8,0.01,0.01,82,407.75,35.1780999999999,0.791169659816624,0.00163939111510892,0.736970884941462 84 | 200,0.03,5,1,0.6,0.3,0,83,577.05,6.01165000000019,0.802101439263675,0.00159179691720595,0.738858455542869 85 | 1000,0.01,20,0.6,0.8,0,0.1,84,644.2,23.2806000000001,0.799140710837452,0.00162202513377764,0.732143862798302 86 | 500,0.03,10,0.6,0.4,0.1,0,85,346.1,7.09640000000054,0.801488451980671,0.00144491967476402,0.733630464381145 87 | 500,0.1,5,0.6,0.6,0,0,86,87.9,2.15499999999993,0.790807203717708,0.00149594188474146,0.72826603502606 88 | 2000,0.1,50,0.8,0.4,0.3,0.3,87,92.15,4.22265000000025,0.796506024305534,0.00130370155496838,0.733305792491668 89 | 500,0.01,5,1,0.8,0,0.3,88,741.55,17.1615,0.800461599600089,0.00145910180032937,0.736743084140092 90 | 5000,0.03,5,0.8,0.6,0,0.3,89,110.6,22.3653499999999,0.803423837624033,0.00159802838239765,0.738001560377466 91 | 1000,0.1,20,0.6,0.8,0,0,90,68.65,3.12675000000036,0.792181260170418,0.00127980241655864,0.725823962601559 92 | 200,0.1,20,0.8,0.6,0.3,0,91,159.35,1.91554999999971,0.798106539410406,0.00124061832724658,0.738023506583583 93 | 500,0.1,5,1,1,0,0.01,92,84.75,2.26230000000014,0.7950160896334,0.00179607975107052,0.740670339119317 94 | 1000,0.01,20,0.6,1,0,0,93,663.2,23.9772499999999,0.799415098221612,0.00126809346919548,0.732093729001106 95 | 5000,0.01,10,0.8,0.4,0.01,0.01,94,355.55,58.0876000000004,0.80398825637922,0.00112874118602854,0.738255841257873 96 | 1000,0.1,50,0.8,0.8,0,0,95,84.85,3.78615000000063,0.798900707232075,0.00115044840079164,0.732931331815914 97 | 500,0.03,5,1,0.4,0,0,96,267,6.17324999999946,0.798058298291698,0.00112306194361304,0.739808907157095 98 | 500,0.1,50,1,0.4,0,0.1,97,100.9,2.60930000000026,0.792112268270605,0.00123673391944962,0.730918939365015 99 | 5000,0.03,50,0.6,0.8,0.01,0,98,338,13.7522499999999,0.793373810924904,0.00147256597172739,0.726374790155387 100 | 5000,0.01,50,0.6,1,0,0,99,697.05,28.0506499999998,0.789519331698273,0.00132462475509124,0.724765077158717 101 | 2000,0.01,5,0.6,0.4,0.01,0,100,508.8,38.3366500000004,0.802667142715459,0.00153194746612377,0.735103363960625 102 | -------------------------------------------------------------------------------- /2-train_test_1each/res-10K-100.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test 2 | 200,0.03,50,0.6,0.8,0.1,0,1,143.15,0.46405,0.679441388905402,0.00429046995832862,0.663341145222633 3 | 5000,0.03,50,1,0.6,0,0.01,2,121.35,0.800649999999998,0.680981030621418,0.00455513072500328,0.663980871336177 4 | 5000,0.1,5,0.8,1,0,0.01,3,23.75,0.964250000000001,0.668441380505428,0.00419562787594795,0.656375037109314 5 | 200,0.01,20,0.6,0.8,0,0.01,4,348.4,2.00885,0.687926250473345,0.00620670118202601,0.658039356472891 6 | 100,0.1,5,1,0.4,0,0.3,5,39.95,0.281649999999999,0.684431222839064,0.00541417449525558,0.667480139644667 7 | 500,0.1,10,0.8,0.6,0,0.01,6,26.8,0.551500000000002,0.674094227008649,0.0060734403820326,0.65759866552147 8 | 5000,0.01,20,1,0.6,0,0,7,295.25,3.63134999999999,0.666681982245132,0.00353313717402141,0.659253376148977 9 | 1000,0.1,20,0.8,0.6,0.01,0,8,31.85,0.483149999999998,0.669013185777821,0.00576243552198307,0.659251835809757 10 | 5000,0.01,20,0.8,0.8,0.01,0.01,9,318.5,3.4135,0.683430948726647,0.00557157219230214,0.658538492071028 11 | 2000,0.01,10,1,0.4,0.01,0,10,256.3,4.16670000000001,0.670199614838598,0.00598339944229106,0.654853100584595 12 | 2000,0.1,20,1,1,0.1,0.01,11,29.05,0.503299999999996,0.667464853877453,0.00413772426803288,0.657527153484008 13 | 500,0.03,10,1,0.8,0.1,0,12,88.9,1.26675000000001,0.669382839690034,0.00573680459652995,0.655905724380567 14 | 500,0.1,5,0.8,1,0.3,0,13,27.75,0.617950000000002,0.661250698439774,0.00503758593119008,0.659355876850447 15 | 2000,0.01,50,1,0.4,0,0.3,14,360.2,1.76900000000001,0.666831937567592,0.00732342865893615,0.663242460562134 16 | 5000,0.01,50,0.6,0.6,0,0,15,410.2,1.97899999999999,0.681308730507993,0.00415547094197651,0.663730589255059 17 | 5000,0.03,5,0.6,0.4,0.3,0.1,16,111.75,1.8685,0.69092433906521,0.00436520210773656,0.654432480125664 18 | 2000,0.1,5,0.6,0.4,0,0,17,27.2,1.2609,0.677625541884231,0.00503768937709164,0.65828021905125 19 | 5000,0.01,50,0.6,0.4,0.3,0,18,388.6,1.64455,0.681519273100336,0.00592914264379605,0.665222300422056 20 | 5000,0.1,20,0.8,0.4,0,0,19,28.75,0.630000000000024,0.675365254900933,0.00514508251993156,0.659712865906138 21 | 5000,0.1,10,0.8,0.6,0.01,0,20,26.9,0.801100000000008,0.673606401538423,0.00449631751521683,0.657981331472995 22 | 5000,0.01,5,0.6,0.8,0,0,21,268.15,9.40009999999999,0.683057421164464,0.00416502024441304,0.652109417195097 23 | 2000,0.03,50,0.6,1,0,0,22,143.85,0.76479999999998,0.685946538417652,0.00474722126603591,0.662226962274882 24 | 1000,0.1,10,0.6,1,0,0,23,28.35,0.661950000000013,0.677309493706781,0.00515128862096278,0.658045751184769 25 | 100,0.01,20,1,1,0,0,24,337.9,1.3579,0.685999517567415,0.00586783446468377,0.664914727723061 26 | 5000,0.01,20,1,0.8,0,0,25,298.4,3.63444999999998,0.665952630363413,0.00474062132096139,0.659123875879868 27 | 200,0.03,10,1,0.8,0.01,0.1,26,105.8,0.954550000000017,0.680498044811926,0.00636336189448961,0.661894575995569 28 | 100,0.1,50,1,0.8,0.01,0,27,34.95,0.318050000000005,0.674870408682819,0.00434467918572242,0.662557552799084 29 | 2000,0.1,10,0.6,1,0,0.3,28,35.4,0.670249999999987,0.677814438356103,0.00539305234412885,0.657415947561929 30 | 100,0.01,20,1,0.8,0.1,0.1,29,342.25,1.44759999999998,0.675317720560586,0.00625036673298941,0.663907213511093 31 | 2000,0.03,50,1,0.4,0,0,30,116.75,0.752199999999994,0.6782936732842,0.00493064855384014,0.662202667860448 32 | 5000,0.1,50,0.6,0.6,0,0,31,39.25,0.418500000000017,0.680046449261683,0.00503382714226434,0.661027092933261 33 | 2000,0.01,50,1,1,0.1,0,32,340.8,1.88829999999999,0.665294466829695,0.00549082131127244,0.663622960626155 34 | 100,0.01,5,1,1,0,0,33,337.1,1.40815,0.681329599413325,0.00573487160374683,0.669142513729455 35 | 2000,0.01,5,0.8,1,0,0,34,246.85,9.7803,0.681986048807137,0.00673857074179763,0.65639549881465 36 | 2000,0.03,10,0.8,1,0,0,35,92.65,2.07135,0.687278313289529,0.00396663595119139,0.658307670621763 37 | 2000,0.01,10,1,0.4,0,0.1,36,258.65,4.06885000000001,0.657700380779621,0.00450972822088197,0.656590974822507 38 | 2000,0.1,5,0.8,1,0,0,37,23.1,1.36765000000001,0.669452033113243,0.00426618070120534,0.657276687562305 39 | 1000,0.01,5,1,0.8,0.1,0,38,236.15,4.71795000000001,0.666433907238878,0.0055707870751365,0.65355511673351 40 | 200,0.03,50,0.6,0.8,0.3,0.3,39,156.15,0.517850000000044,0.679472093285522,0.00620187613234061,0.663630509366867 41 | 500,0.01,10,0.6,0.8,0.01,0.1,40,316.2,3.37770000000005,0.676884204488134,0.00541348752138069,0.655975959828811 42 | 1000,0.01,5,0.8,0.4,0.3,0,41,274.75,4.19630000000004,0.686074188742778,0.00562139667119835,0.658738696949638 43 | 1000,0.1,20,0.8,0.8,0,0.01,42,30.35,0.456299999999987,0.677340071336784,0.00663301242267078,0.659541960809034 44 | 100,0.1,5,0.6,0.6,0,0,43,41.85,0.279049999999893,0.689679043602083,0.00484479496427635,0.662642544030271 45 | 100,0.01,10,1,0.6,0,0.01,44,359.7,1.4568,0.687528381632039,0.00525024056956061,0.666887607132293 46 | 500,0.1,20,0.8,0.4,0,0,45,30.75,0.432550000000015,0.662148443710279,0.00696297789287842,0.658696433218319 47 | 200,0.03,50,0.6,0.8,0,0,46,144.25,0.572949999999946,0.68354458470505,0.00565383629179905,0.662367264528243 48 | 500,0.1,20,0.8,0.4,0.1,0.01,47,31.75,0.411549999999943,0.670871567178253,0.00506417488021279,0.65931067166931 49 | 200,0.03,50,1,0.8,0,0,48,115.05,0.575950000000012,0.684797682765292,0.00450750269038693,0.662067081731621 50 | 200,0.01,10,1,1,0.01,0,49,301.15,2.13425,0.673831720665513,0.00571611151764473,0.661421588905769 51 | 2000,0.01,20,1,0.6,0,0.1,50,291,2.87359999999994,0.676850693503827,0.00407950526006601,0.658581546071844 52 | 5000,0.01,10,0.8,0.8,0,0,51,273.4,5.89849999999994,0.688629030893702,0.00367336612675096,0.658596293520639 53 | 1000,0.03,50,0.8,0.4,0,0,52,117.6,0.687299999999982,0.676502828249564,0.00449710256312323,0.664911522523266 54 | 5000,0.1,5,0.8,0.4,0,0.1,53,24.25,1.03744999999999,0.677574836142103,0.00486639884698516,0.65710101241407 55 | 1000,0.01,10,1,0.8,0,0,54,253.75,4.6377,0.665936871287218,0.00464871325111141,0.656178456587063 56 | 200,0.03,50,1,0.6,0,0,55,111.5,0.598949999999991,0.66994835475169,0.00603502980038448,0.6627480347154 57 | 100,0.1,50,0.8,0.8,0,0,56,37.3,0.293849999999998,0.665402739020812,0.00473066181284866,0.665636977191411 58 | 2000,0.01,5,1,0.4,0,0,57,225.55,9.12760000000003,0.666148808380858,0.00449781404673565,0.651131360622416 59 | 200,0.03,5,0.6,1,0,0.1,58,114.15,0.833650000000034,0.685828146370905,0.00459582425806197,0.656455529883283 60 | 1000,0.01,20,0.6,0.8,0,0,59,342.9,2.66740000000007,0.67628275557318,0.00647504900612066,0.656248812634698 61 | 5000,0.03,5,0.6,0.8,0.3,0.1,60,109.4,1.89379999999996,0.690621018395744,0.00507628188858128,0.653884571367828 62 | 1000,0.01,50,0.6,0.6,0,0.3,61,397.2,1.41000000000006,0.689385266963072,0.00537077311973636,0.664143584495604 63 | 5000,0.01,20,0.6,0.4,0,0,62,342.5,3.35654999999999,0.686272563355528,0.0054346997004455,0.65764933689679 64 | 2000,0.1,5,0.6,0.8,0,0,63,26.35,1.29574999999998,0.669271727097683,0.0073119709027798,0.657920995474121 65 | 5000,0.01,5,1,0.8,0,0,64,222.8,9.95714999999998,0.662737712101852,0.00704761015974227,0.65211866952549 66 | 100,0.03,10,1,0.8,0.01,0,65,123.15,0.668499999999995,0.681895034447863,0.00460942323098496,0.666611916807764 67 | 500,0.01,50,0.6,0.8,0.3,0.01,66,413.8,1.3056,0.68512398712189,0.00525366727090325,0.665739974017316 68 | 1000,0.03,50,0.6,0.4,0,0,67,145.75,0.687549999999942,0.679901759703033,0.00594675645181537,0.662886321585415 69 | 1000,0.01,20,0.8,1,0,0,68,316.35,2.88259999999998,0.689965587940745,0.00547852464895218,0.660175776569415 70 | 2000,0.01,10,0.6,0.8,0,0,69,312.85,5.23320000000008,0.684771875956524,0.00537278338402666,0.654808368976965 71 | 100,0.1,10,0.8,0.8,0.3,0,70,40.95,0.330999999999995,0.68188620493952,0.00572190586339513,0.669635594843313 72 | 2000,0.01,10,1,1,0.1,0,71,258.75,3.89299999999998,0.665973289150595,0.00453196895844625,0.656662337826465 73 | 200,0.03,5,1,0.8,0,0.3,72,106.05,0.930199999999968,0.676414314619103,0.00550517686092345,0.66329057336638 74 | 100,0.03,20,1,0.8,0,0,73,118.7,0.591849999999931,0.682351247461034,0.0061983547376618,0.663412144957581 75 | 200,0.1,20,0.6,0.6,0,0.1,74,40.05,0.369300000000067,0.679067844003284,0.00620291181599023,0.656261163782447 76 | 1000,0.01,50,0.8,0.6,0.01,0,75,354.15,1.54925000000003,0.684605100178614,0.00549631291242087,0.664596811906023 77 | 5000,0.1,20,0.8,0.6,0.01,0,76,30.05,0.62175000000002,0.67664885082481,0.00564455633881941,0.657775227026588 78 | 1000,0.01,20,0.6,1,0.01,0,77,349.75,2.63310000000001,0.680099092917275,0.00453609041687502,0.657868673945449 79 | 500,0.03,5,1,0.6,0,0,78,78.6,1.48224999999993,0.672726675874766,0.0048892284240284,0.653580394572663 80 | 200,0.1,5,1,0.8,0,0,79,29.45,0.397250000000031,0.673556426295628,0.00442901752480345,0.661741416174876 81 | 100,0.03,10,1,0.8,0.01,0.01,80,126.05,0.641300000000092,0.678670005601334,0.00468752990496548,0.66643364043739 82 | 200,0.1,20,0.6,0.6,0,0.3,81,37.9,0.348050000000012,0.678691396972462,0.00466259806831318,0.655184644138496 83 | 500,0.1,50,0.6,0.4,0,0,82,43,0.286000000000058,0.672940577265883,0.00664018275069387,0.659598677844187 84 | 500,0.1,20,1,1,0.3,0,83,32.1,0.39675000000002,0.675498756936283,0.00453092905003851,0.659936092531972 85 | 5000,0.03,10,0.8,0.8,0.1,0,84,97.85,1.83470000000002,0.690991665601491,0.00401665693625977,0.659416066757328 86 | 500,0.03,5,0.8,0.8,0,0,85,84.05,1.49995000000013,0.676597570032957,0.00469414711274923,0.657889920037394 87 | 1000,0.01,10,1,1,0,0.3,86,273.65,3.39319999999998,0.677242092564924,0.00386145221575705,0.657889812184282 88 | 2000,0.01,20,0.6,0.4,0.1,0,87,351.5,2.6545000000001,0.688639221139577,0.00564971551428606,0.658587826067145 89 | 5000,0.03,10,1,0.4,0,0,88,85.3,2.29984999999979,0.671352438911219,0.0046252782508582,0.65562808681622 90 | 2000,0.03,50,0.6,1,0,0,89,139.05,0.7721,0.676273471166258,0.00511643616885163,0.66238230416111 91 | 200,0.03,50,0.6,1,0,0.3,90,146.15,0.593750000000091,0.680637204506542,0.00584862290765764,0.663207544572464 92 | 1000,0.03,20,0.8,0.4,0,0,91,107.55,1.12109999999998,0.675600756018752,0.00592994135700377,0.660622202367463 93 | 2000,0.01,10,0.6,0.4,0,0.3,92,334.2,3.82909999999993,0.701018337831426,0.00326656517524741,0.659857025380668 94 | 5000,0.03,20,0.6,0.8,0.3,0,93,131.6,1.10539999999987,0.677055655714402,0.00541870139049504,0.656651850087209 95 | 1000,0.1,5,1,0.6,0.3,0,94,25.6,0.752000000000044,0.668196531950731,0.00615343229273909,0.655792102057436 96 | 200,0.01,50,0.8,0.4,0.1,0,95,370.35,1.45764999999992,0.671031386886442,0.00469654853579933,0.665078555691998 97 | 1000,0.03,5,0.6,0.4,0.01,0,96,92.8,2.05505000000003,0.675259448296213,0.00535712185250626,0.646380668181599 98 | 500,0.03,5,1,0.8,0,0,97,79.1,1.48249999999994,0.665117791062052,0.00557025138911545,0.653403096724611 99 | 100,0.1,20,0.6,0.6,0,0,98,45.3,0.291249999999991,0.689783820868522,0.0053569658123363,0.655990161148459 100 | 2000,0.03,20,0.8,0.4,0,0,99,102.7,1.2337500000001,0.680412815529284,0.00589519304505992,0.659118044945863 101 | 100,0.03,10,0.6,0.6,0,0.1,100,142.75,0.611699999999973,0.685379706927051,0.00463918447418359,0.661811976653324 102 | -------------------------------------------------------------------------------- /2-train_test_1each/res-1M-100.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test 2 | 100,0.03,5,0.8,0.4,0,0,1,3802.55,120.783,0.928552977436617,0.000629989905078403,0.83152149771496 3 | 100,0.01,20,0.8,0.8,0,0.1,2,-1,144.64055,0.901822042668505,0.000282709777047993,0.809145417413935 4 | 1000,0.1,10,0.6,0.4,0.3,0,3,393.6,18.35645,0.912157932489603,0.00031733904603671,0.80935037794643 5 | 100,0.1,5,0.6,0.6,0,0.3,4,2013.4,29.2095000000001,0.887929804512845,0.00154432079455466,0.792912861524808 6 | 200,0.1,50,0.6,0.4,0.01,0.3,5,1291.3,25.2191000000001,0.891127169233971,0.000778692600684032,0.791562178697442 7 | 500,0.01,50,1,0.4,0,0,6,3039.15,247.27205,0.933232021798135,0.000335454767021557,0.833189392434765 8 | 2000,0.1,10,0.8,1,0.01,0,7,334.45,23.5260000000001,0.942153515701862,0.000422542377867309,0.841510090381076 9 | 5000,0.03,20,1,0.8,0,0.3,8,554.2,88.2715500000004,0.942260783180992,0.000467097381936381,0.838180468826975 10 | 200,0.1,5,0.6,0.4,0.3,0,9,1344.1,24.5158999999996,0.898090660149655,0.00112582675165146,0.80083152728585 11 | 200,0.1,10,0.8,0.6,0,0,10,2017.3,31.8293,0.931423679614167,0.000505223596612852,0.834971391055929 12 | 2000,0.1,10,0.8,1,0.01,0.3,11,395.2,26.9914499999995,0.94320411848142,0.000326039652070183,0.841624380229041 13 | 5000,0.1,10,1,0.4,0,0,12,138.45,22.6196499999995,0.941872429392576,0.000309827700166319,0.840026154128775 14 | 1000,0.01,10,1,0.4,0.1,0.01,13,6135.8,231.3514,0.943336212869011,0.000327791258010041,0.842820148840849 15 | 200,0.03,20,0.6,0.4,0,0.01,14,3147.3,60.2298999999997,0.889037410924096,0.000588039481558914,0.790049274352161 16 | 5000,0.1,10,0.6,0.4,0.1,0,15,228.1,36.0525500000002,0.927411210608524,0.000308191295396548,0.82257388703613 17 | 5000,0.03,50,0.6,0.4,0,0.3,16,640,101.1159,0.919744929033864,0.000613667745272612,0.812605533856583 18 | 2000,0.03,20,0.6,0.4,0.1,0,17,827.9,64.5343000000003,0.920046958911862,0.000414235591458134,0.814117343293444 19 | 1000,0.1,20,0.6,0.4,0,0,18,384.4,17.7786000000007,0.907227310715876,0.000456830348357686,0.803997176655387 20 | 2000,0.1,20,0.6,0.4,0,0,19,340.4,26.3998500000005,0.917795529724232,0.000684310152727947,0.813559414738884 21 | 5000,0.1,50,1,0.8,0.01,0.3,20,198.05,34.2700499999997,0.936635915995783,0.000394275380675705,0.83331414177397 22 | 1000,0.01,50,0.6,0.8,0,0.3,21,2997.35,150.983500000001,0.90187111010475,0.00042629706259887,0.795538241324112 23 | 5000,0.03,5,1,1,0.1,0.01,22,558.55,88.3417500000007,0.948530250742272,0.000266785705940305,0.842177924204273 24 | 100,0.1,10,0.8,0.6,0,0.01,23,3031.7,39.5927500000005,0.925823253864089,0.000735578455131657,0.831589626341213 25 | 2000,0.01,10,1,0.8,0.3,0.3,24,3232.35,228.52795,0.946415301625824,0.000445017623488118,0.844916809513448 26 | 100,0.03,10,0.8,0.4,0.01,0.01,25,2191.4,127.720249999999,0.928221710580796,0.000376823428436669,0.829732570608068 27 | 500,0.03,10,1,0.8,0.3,0.01,26,4242.65,100.406799999999,0.94546180837524,0.000362460863116169,0.842929723757844 28 | 1000,0.1,20,0.8,0.8,0.1,0,27,581.9,23.1185499999989,0.93739484461948,0.00043694949770499,0.836204260942335 29 | 1000,0.03,5,0.8,0.4,0,0.01,28,1907.7,70.0186500000007,0.942289917015927,0.000389992729430344,0.84071153517048 30 | 5000,0.03,50,0.8,0.4,0,0,29,585.1,97.1748999999993,0.941700879221408,0.000298277584338711,0.83709629482777 31 | 200,0.1,10,1,0.4,0,0,30,2149.1,33.6985999999997,0.935008473344113,0.000512576708816282,0.838943670462456 32 | 200,0.03,50,0.8,0.4,0.1,0.3,31,6946.25,117.77345,0.926302493257766,0.000545623183067408,0.825670151651492 33 | 5000,0.03,5,1,0.4,0,0.3,32,531.45,84.1488999999998,0.948670708875893,0.000251278194514635,0.844472146122928 34 | 5000,0.01,50,0.6,1,0.1,0.01,33,1432.5,222.343250000001,0.914053136440273,0.000263108964467581,0.806273021519677 35 | 500,0.03,10,1,1,0,0.1,34,3492.9,82.981800000001,0.940695944514298,0.000293114092026294,0.841092064817579 36 | 1000,0.1,10,1,0.4,0,0,35,485.75,18.8190499999997,0.936472054392723,0.000549374265181327,0.840795987526587 37 | 1000,0.1,20,0.8,0.6,0,0,36,552.55,21.7816999999995,0.935151706606396,0.000540557182468832,0.835654726686021 38 | 100,0.03,20,0.8,0.4,0,0.1,37,4898.3,126.04205,0.925486093691748,0.000684799819748542,0.826892114983669 39 | 1000,0.03,5,0.8,0.8,0,0,38,1864.65,68.9323000000004,0.941649203613191,0.000383385023219247,0.840532691221241 40 | 100,0.1,50,0.6,0.8,0,0,39,1868.2,28.5067999999992,0.883881932480232,0.00106223478467757,0.786196207430829 41 | 5000,0.01,5,0.8,0.6,0.01,0,40,1430.85,224.440750000001,0.952166667239316,0.000305228394036835,0.846994803332115 42 | 1000,0.03,50,1,1,0.3,0,41,2192.45,90.6455499999996,0.937338614925524,0.000349789173505317,0.835727137027353 43 | 200,0.01,20,0.6,0.6,0.3,0,42,5118.6,166.8929,0.88827979850206,0.000836296168503014,0.788421045584212 44 | 500,0.01,5,0.6,0.6,0.3,0,43,5075.6,148.240650000003,0.903203776180809,0.00156206107274625,0.800727039192846 45 | 5000,0.03,20,0.6,0.4,0.01,0,44,406.4,65.7197499999973,0.923224142135467,0.000287487735596188,0.816547499036084 46 | 1000,0.01,20,1,0.6,0,0,45,5712.45,218.448500000002,0.940086953703248,0.000340107411895074,0.840038781677267 47 | 500,0.01,20,0.6,1,0.3,0,46,4970.55,149.303750000001,0.900706111966387,0.000882957697942373,0.796612566092716 48 | 1000,0.03,50,0.8,0.6,0.01,0,47,2026.75,83.1782000000014,0.933099526092154,0.000351837303150423,0.830420883274621 49 | 200,0.1,20,0.6,1,0,0,48,1142.15,21.3650499999996,0.890972954231671,0.00120943544735578,0.793684858034545 50 | 1000,0.1,5,0.6,0.4,0.3,0,49,400.7,18.2869499999986,0.912702566959078,0.000342597444150723,0.81187889840511 51 | 2000,0.01,50,1,1,0.3,0.1,50,3519.1,269.914600000003,0.939063049108492,0.000367167968557361,0.835769882965568 52 | 100,0.1,50,1,0.4,0,0,51,3875.35,52.9981000000014,0.927540786534275,0.000417817891343919,0.830910628803697 53 | 500,0.1,20,0.8,0.8,0,0,52,1019.6,24.5771500000003,0.932770446246031,0.000534619385289461,0.834132140264162 54 | 200,0.1,5,0.8,0.4,0.01,0.1,53,2260.25,35.1738999999994,0.9357201890633,0.000521386530967759,0.839967078526091 55 | 100,0.03,20,1,1,0.1,0,54,2297.1,132.8357,0.929765157584002,0.000384721658260022,0.831471636355458 56 | 100,0.1,10,1,0.6,0.3,0,55,3880.5,51.8982000000004,0.937653913053393,0.000512410569209586,0.838392727760125 57 | 500,0.01,10,0.8,0.4,0,0,56,7654.95,206.638300000003,0.937469543881626,0.000327691841148982,0.836526254033885 58 | 2000,0.03,50,1,0.4,0.3,0,57,1476.25,111.485100000001,0.940401014885683,0.00046258044146184,0.837591387308121 59 | 200,0.03,10,0.8,1,0,0.1,58,6509.9,104.378,0.934014973250614,0.000318562245678198,0.834555950940177 60 | 2000,0.01,20,0.6,1,0,0.3,59,1593.55,129.462900000002,0.911929394816015,0.000800548457815684,0.80624929553596 61 | 1000,0.03,50,0.6,0.6,0.01,0,60,1187.2,58.77785,0.905836896821453,0.000332556363219848,0.799776335226553 62 | 500,0.01,20,0.6,0.8,0.01,0.01,61,4038.7,123.709900000001,0.89499635979875,0.000762268622783796,0.792622238005586 63 | 200,0.1,20,0.8,0.8,0.01,0,62,2118.8,33.6311000000031,0.929844513728793,0.000582410036831613,0.833383431576811 64 | 200,0.1,10,0.6,1,0.01,0,63,1104.55,20.1748999999967,0.890859268486979,0.00124935017487443,0.794945832549493 65 | 200,0.03,5,1,0.6,0.3,0,64,7454.95,130.89555,0.944468984615389,0.000359225449171314,0.843626134988361 66 | 500,0.01,10,0.8,1,0.01,0.1,65,6223.3,219.05435,0.938403549817382,0.000405602742736205,0.837761615405894 67 | 5000,0.01,10,1,1,0,0,66,1391.3,217.9003,0.943725182968744,0.000311490243548399,0.837593731421017 68 | 5000,0.01,20,0.8,0.8,0.01,0.1,67,1555.8,243.523250000001,0.94825746346874,0.00026101020031388,0.841834413738463 69 | 100,0.1,50,1,0.8,0.01,0.1,68,3798.9,51.1725500000015,0.927671465720834,0.000499108676133973,0.829936832852093 70 | 5000,0.01,10,0.8,1,0,0.1,69,1540.95,235.534349999999,0.950220335919011,0.00028028192909375,0.84426510655008 71 | 500,0.1,5,0.6,0.8,0,0,70,503.05,14.6277500000011,0.895845535302764,0.0012296564520378,0.798914732279154 72 | 2000,0.03,50,0.8,1,0,0.1,71,1117.85,82.3734499999991,0.934610104985804,0.000436047492780075,0.830614256767789 73 | 500,0.03,10,0.6,1,0,0.01,72,1412.75,41.4781000000017,0.897248113266541,0.000852033301570897,0.796658662035342 74 | 1000,0.03,50,1,0.8,0.3,0.01,73,2129.75,87.2907999999952,0.936386303430685,0.000425325124926869,0.835053980979773 75 | 500,0.03,10,0.6,0.4,0,0,74,1415.35,41.3008999999962,0.896959809390824,0.00079298521143683,0.796844221818604 76 | 5000,0.1,50,1,0.4,0,0,75,194.5,33.0926999999996,0.936033440964579,0.000379339811501328,0.834127164216775 77 | 500,0.1,5,0.6,1,0,0,76,519.1,14.9333500000037,0.895738329208953,0.00152300243765828,0.799233165681881 78 | 200,0.1,50,0.6,0.8,0.01,0,77,1317.3,25.465049999996,0.890977056155588,0.000724407687810897,0.791501843010333 79 | 5000,0.03,20,1,1,0,0.01,78,560.1,86.7487000000037,0.940583599176119,0.00031335149078546,0.836382165268205 80 | 2000,0.1,10,1,0.4,0,0,79,306.2,21.374900000004,0.939542104741436,0.000574508977453389,0.842202623440075 81 | 2000,0.1,5,1,0.6,0.1,0,80,363.25,25.0497499999939,0.944035213717331,0.000355613697855726,0.844561865315179 82 | 2000,0.1,50,0.8,1,0.01,0,81,378.75,28.1263500000045,0.933944843388541,0.000377887763715203,0.832142853811854 83 | 2000,0.01,5,1,0.8,0.01,0.01,82,2993.55,203.046150000002,0.945001140127659,0.000422742699405718,0.843760510026995 84 | 200,0.03,5,1,0.6,0.3,0,83,8004.7,132.307149999998,0.94447607267448,0.000449038803331522,0.843755700144636 85 | 1000,0.01,20,0.6,0.8,0,0.1,84,2935.1,137.930850000006,0.906469868613599,0.000332297792646868,0.801827747595493 86 | 500,0.03,10,0.6,0.4,0.1,0,85,1527.2,45.7167000000001,0.89932137695948,0.000657372230430505,0.797569072974834 87 | 500,0.1,5,0.6,0.6,0,0,86,528.65,15.2622000000003,0.896499288749758,0.00127729164671696,0.799861907218166 88 | 2000,0.1,50,0.8,0.4,0.3,0.3,87,461.35,35.2676499999987,0.937704037963392,0.000383591276267761,0.834358688395433 89 | 500,0.01,5,1,0.8,0,0.3,88,5876.8,146.618649999995,0.935258968209527,0.000962802385559809,0.838724130932016 90 | 5000,0.03,5,0.8,0.6,0,0.3,89,553.85,86.877499999998,0.951254077617881,0.000295090995580958,0.846000196207264 91 | 1000,0.1,20,0.6,0.8,0,0,90,381.85,17.6822999999975,0.907022382523726,0.000487569296601529,0.804221571565644 92 | 200,0.1,20,0.8,0.6,0.3,0,91,2410.2,38.9081000000035,0.935292735136089,0.00056222921111984,0.834810148724461 93 | 500,0.1,5,1,1,0,0.01,92,1176.7,27.3078999999983,0.94080439903968,0.000439500688229819,0.845682336469295 94 | 1000,0.01,20,0.6,1,0,0,93,2876.9,134.253799999999,0.906442900954995,0.000466685696947013,0.801286823256206 95 | 5000,0.01,10,0.8,0.4,0.01,0.01,94,1507.05,234.41545,0.95122685521823,0.000277961949003674,0.845088223514703 96 | 1000,0.1,50,0.8,0.8,0,0,95,633.45,26.3362000000037,0.929705760192553,0.000586097215312633,0.829901515872647 97 | 500,0.03,5,1,0.4,0,0,96,3665.45,83.2490000000005,0.942197897912775,0.000461704109500219,0.84427138224489 98 | 500,0.1,50,1,0.4,0,0.1,97,1284.85,32.1952000000005,0.932645879970742,0.000346507811446759,0.834972663998221 99 | 5000,0.03,50,0.6,0.8,0.01,0,98,642.2,102.886750000001,0.919854347582079,0.000671826475628139,0.812714650830909 100 | 5000,0.01,50,0.6,1,0,0,99,1438,225.392999999999,0.914160602030787,0.000286553376914195,0.805707685766044 101 | 2000,0.01,5,0.6,0.4,0.01,0,100,1457.1,114.667099999999,0.915086611770002,0.000451786594847502,0.809943971635364 102 | -------------------------------------------------------------------------------- /2-train_test_1each/run-tuning.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(ROCR) 3 | library(lightgbm) 4 | library(dplyr) 5 | 6 | set.seed(123) 7 | 8 | 9 | # for yr in 1990 1991; do 10 | # wget http://stat-computing.org/dataexpo/2009/$yr.csv.bz2 11 | # bunzip2 $yr.csv.bz2 12 | # done 13 | 14 | 15 | ## TODO: loop 1991..2000 16 | yr <- 1991 17 | 18 | d0_train <- fread(paste0("/var/data/airline/",yr-1,".csv")) 19 | d0_train <- d0_train[!is.na(DepDelay)] 20 | 21 | d0_test <- fread(paste0("/var/data/airline/",yr,".csv")) 22 | d0_test <- d0_test[!is.na(DepDelay)] 23 | 24 | d0 <- rbind(d0_train, d0_test) 25 | 26 | 27 | for (k in c("Month","DayofMonth","DayOfWeek")) { 28 | d0[[k]] <- paste0("c-",as.character(d0[[k]])) # force them to be categoricals 29 | } 30 | 31 | d0[["dep_delayed_15min"]] <- ifelse(d0[["DepDelay"]]>=15,1,0) 32 | 33 | cols_keep <- c("Month", "DayofMonth", "DayOfWeek", "DepTime", "UniqueCarrier", 34 | "Origin", "Dest", "Distance", "dep_delayed_15min") 35 | d0 <- d0[, cols_keep, with = FALSE] 36 | 37 | 38 | d0_wrules <- lgb.prepare_rules(d0) # lightgbm special treat of categoricals 39 | d0 <- d0_wrules$data 40 | cols_cats <- names(d0_wrules$rules) 41 | 42 | 43 | n1 <- nrow(d0_train) 44 | n2 <- nrow(d0_test) 45 | 46 | p <- ncol(d0)-1 47 | d1_train <- as.matrix(d0[1:n1,]) 48 | d1_test <- as.matrix(d0[(n1+1):(n1+n2),]) 49 | 50 | 51 | 52 | ## TODO: loop 10K,100K,1M 53 | size <- 100e3 54 | 55 | 56 | 57 | n_random <- 100 ## TODO: 1000? 58 | 59 | params_grid <- expand.grid( # default 60 | num_leaves = c(100,200,500,1000,2000,5000), # 31 (was 127) 61 | learning_rate = c(0.01,0.03,0.1), # 0.1 62 | min_data_in_leaf = c(5,10,20,50), # 20 (was 100) 63 | feature_fraction = c(0.6,0.8,1), # 1 64 | bagging_fraction = c(0.4,0.6,0.8,1), # 1 65 | lambda_l1 = c(0,0,0,0, 0.01, 0.1, 0.3), # 0 66 | lambda_l2 = c(0,0,0,0, 0.01, 0.1, 0.3) # 0 67 | ## TODO: 68 | ## min_sum_hessian_in_leaf 69 | ## min_gain_to_split 70 | ## max_bin 71 | ## min_data_in_bin 72 | ) 73 | params_random <- params_grid[sample(1:nrow(params_grid),n_random),] 74 | 75 | 76 | 77 | 78 | ## TODO: resample 79 | d <- d1_train[sample(1:nrow(d1_train), size),] 80 | 81 | 82 | 83 | ## TODO: resample 84 | size_test <- 100e3 85 | d_test <- d1_test[sample(1:nrow(d1_test), size_test),] 86 | 87 | 88 | # ## first run: get train/test with same distribution to check overfitting of best models 89 | # idx <- sample(1:nrow(d1_train), size) 90 | # d <- d1_train[idx,] 91 | # d_test <- d1_train[sample(setdiff(1:nrow(d1_train),idx), size_test),] 92 | # #user system elapsed 93 | # #240025.539 4094.378 31870.404 94 | # ## Results: AUC_test>AUC_rs seems it does not overfit, either underfit or sample variation 95 | # ## TODO: resample train/test to see if underfit or sample variation 96 | 97 | 98 | 99 | # warm up 100 | dlgb_train <- lgb.Dataset(data = d[,1:p], label = d[,p+1]) 101 | system.time({ 102 | md <- lgb.train(data = dlgb_train, objective = "binary", 103 | num_threads = parallel::detectCores()/2, 104 | nrounds = 100, 105 | categorical_feature = cols_cats, 106 | verbose = 0) 107 | }) 108 | system.time({ 109 | md <- lgb.train(data = dlgb_train, objective = "binary", 110 | num_threads = parallel::detectCores()/2, 111 | nrounds = 100, 112 | categorical_feature = cols_cats, 113 | verbose = 0) 114 | }) 115 | 116 | 117 | 118 | system.time({ 119 | 120 | d_res <- data.frame() 121 | for (krpm in 1:nrow(params_random)) { 122 | params <- as.list(params_random[krpm,]) 123 | 124 | ## resample 125 | n_resample <- 20 ## TODO: 100? 126 | 127 | p_train <- 0.8 ## TODO: change? (80-10-10 split now) 128 | p_earlystop <- 0.1 129 | p_modelselec <- 1 - p_train - p_earlystop 130 | 131 | mds <- list() 132 | d_res_rs <- data.frame() 133 | for (k in 1:n_resample) { 134 | cat(" krpm:",krpm,"k:",k,"\n") 135 | 136 | idx_train <- sample(1:size, size*p_train) 137 | idx_earlystop <- sample(setdiff(1:size,idx_train), size*p_earlystop) 138 | idx_modelselec <- setdiff(setdiff(1:size,idx_train),idx_earlystop) 139 | 140 | d_train <- d[idx_train,] 141 | d_earlystop <- d[idx_earlystop,] 142 | d_modelselec <- d[idx_modelselec,] 143 | 144 | dlgb_train <- lgb.Dataset(data = d_train[,1:p], label = d_train[,p+1]) 145 | dlgb_earlystop <- lgb.Dataset(data = d_earlystop[,1:p], label = d_earlystop[,p+1]) 146 | 147 | 148 | runtm <- system.time({ 149 | md <- lgb.train(data = dlgb_train, objective = "binary", 150 | num_threads = parallel::detectCores()/2, # = number of "real" cores 151 | params = params, 152 | nrounds = 10000, early_stopping_rounds = 10, valid = list(valid = dlgb_earlystop), 153 | categorical_feature = cols_cats, 154 | verbose = 0) 155 | })[[3]] 156 | 157 | phat <- predict(md, data = d_modelselec[,1:p]) 158 | rocr_pred <- prediction(phat, d_modelselec[,p+1]) 159 | auc_rs <- performance(rocr_pred, "auc")@y.values[[1]] 160 | 161 | d_res_rs <- rbind(d_res_rs, data.frame(ntrees = md$best_iter, runtm = runtm, auc_rs = auc_rs)) 162 | mds[[k]] <- md 163 | } 164 | d_res_rs_avg <- d_res_rs %>% summarize(ntrees = mean(ntrees), runtm = mean(runtm), auc_rs_avg = mean(auc_rs), 165 | auc_rs_std = sd(auc_rs)/sqrt(n_resample)) # std of the mean! 166 | 167 | # consider the model as the average of the models from resamples 168 | # TODO?: alternatively could retrain the "final" model on all of data (early stoping or avg number of trees?) 169 | phat <- matrix(0, nrow = n_resample, ncol = nrow(d_test)) 170 | for (k in 1:n_resample) { 171 | phat[k,] <- predict(mds[[k]], data = d_test[,1:p]) 172 | } 173 | phat_avg <- apply(phat, 2, mean) 174 | rocr_pred <- prediction(phat_avg, d_test[,p+1]) 175 | auc_test <- performance(rocr_pred, "auc")@y.values[[1]] 176 | 177 | d_res <- rbind(d_res, cbind(krpm, d_res_rs_avg, auc_test)) 178 | } 179 | 180 | }) 181 | 182 | d_pm_res <- cbind(params_random, d_res) 183 | 184 | d_pm_res 185 | 186 | 187 | ## TODO: ensemble/avg the top 2,3,...10 models (would need to save all the models) 188 | 189 | ## TODO: easier: see the goodness as a function of how many random iterations 10,100,1000 (can be done in 190 | ## the other analysis file) 191 | 192 | 193 | fwrite(d_pm_res, file = "res.csv") 194 | 195 | -------------------------------------------------------------------------------- /3-test_rs/analyze.Rmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | 3 | library(data.table) 4 | library(dplyr) 5 | library(ggplot2) 6 | 7 | d_pm_res <- fread("res-100K.csv") 8 | 9 | 10 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = auc_rs_avg)) + 11 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 1) 12 | 13 | d_pm_res %>% ggplot(aes(x = auc_test_avg, y = auc_rs_avg)) + geom_point() + 14 | geom_errorbar(aes(ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.0003, alpha = 0.2) + 15 | geom_abline(slope = 1, color = "grey70") + 16 | geom_errorbarh(aes(xmin = auc_test_avg-auc_test_sd, xmax = auc_test_avg+auc_test_sd), height = 0.0003, alpha = 0.2) 17 | 18 | auc_test_top <- sort(d_pm_res$auc_test_avg, decreasing = TRUE)[10] 19 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% mutate(is_top = auc_test_avg>=auc_test_top) %>% 20 | ggplot(aes(color = is_top)) + geom_point(aes(x = rank, y = auc_rs_avg)) + 21 | geom_errorbar(aes(x = rank, ymin = auc_rs_avg-auc_rs_std, ymax = auc_rs_avg+auc_rs_std), width = 0.3) 22 | 23 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = auc_test_avg)) + 24 | geom_errorbar(aes(x = rank, ymin = auc_test_avg-auc_test_sd, ymax = auc_test_avg+auc_test_sd), width = 1) 25 | 26 | 27 | cor(d_pm_res$auc_rs_avg, d_pm_res$auc_test_avg) 28 | cor(d_pm_res$auc_rs_avg, d_pm_res$auc_test_avg, method = "spearman") 29 | 30 | 31 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% head(10) 32 | d_pm_res %>% arrange(desc(auc_test_avg)) %>% head(10) 33 | 34 | d_pm_res %>% arrange(desc(auc_rs_avg)) %>% tail(10) 35 | d_pm_res %>% arrange(desc(auc_test_avg)) %>% tail(10) 36 | 37 | 38 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = num_leaves)) + scale_y_log10() 39 | 40 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = learning_rate)) + scale_y_log10() 41 | 42 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = min_data_in_leaf)) + scale_y_log10() 43 | 44 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = feature_fraction)) 45 | 46 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = bagging_fraction)) 47 | 48 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l1)) 49 | 50 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = lambda_l2)) 51 | 52 | 53 | 54 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees)) 55 | 56 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = ntrees)) + scale_y_log10(breaks=c(30,100,300,1000,3000)) 57 | 58 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm)) 59 | 60 | d_pm_res %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% ggplot() + geom_point(aes(x = rank, y = runtm)) + scale_y_log10(breaks=c(1,3,10,30,100)) 61 | 62 | 63 | summary(d_pm_res$runtm) 64 | 65 | mean(d_pm_res$runtm)*20*100/3600 66 | 67 | 68 | summary(d_pm_res$ntree) 69 | 70 | 71 | ``` -------------------------------------------------------------------------------- /3-test_rs/fig-100K-AUCcorr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/3-test_rs/fig-100K-AUCcorr.png -------------------------------------------------------------------------------- /3-test_rs/fig-100K-AUCrs_rank.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/3-test_rs/fig-100K-AUCrs_rank.png -------------------------------------------------------------------------------- /3-test_rs/fig-100K-AUCtest_rank.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/3-test_rs/fig-100K-AUCtest_rank.png -------------------------------------------------------------------------------- /3-test_rs/fig-10K-AUCcorr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/3-test_rs/fig-10K-AUCcorr.png -------------------------------------------------------------------------------- /3-test_rs/res-100K.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test_avg,auc_test_sd 2 | 200,0.03,50,0.6,0.8,0.1,0,1,495.4,3.0923,0.788293542150508,0.00158925841709515,0.723144981513305,0.00231210649064027 3 | 5000,0.03,50,1,0.6,0,0.01,2,250.2,7.65315,0.791957372334741,0.0016983440798005,0.732644813533755,0.00184936771557532 4 | 5000,0.1,5,0.8,1,0,0.01,3,33.95,5.07194999999997,0.800584990894827,0.00125561250169598,0.749156047358491,0.00205425405313052 5 | 200,0.01,20,0.6,0.8,0,0.01,4,1095,6.61705000000002,0.787733303807144,0.00139923381827546,0.726305966732273,0.00219388438993 6 | 100,0.1,5,1,0.4,0,0.3,5,245.6,0.955050000000051,0.787748037585345,0.00175069839702623,0.734584447940127,0.00174277287312448 7 | 500,0.1,10,0.8,0.6,0,0.01,6,94.85,1.29679999999996,0.80226824092923,0.00139347722179284,0.74388159141264,0.00195665992109027 8 | 5000,0.01,20,1,0.6,0,0,7,425.75,29.3979500000001,0.783812254109439,0.00141642619180188,0.729878313905452,0.00201571905467454 9 | 1000,0.1,20,0.8,0.6,0.01,0,8,73.4,2.01105000000016,0.803575477662336,0.00137451984079618,0.741784069868219,0.00200961594346767 10 | 5000,0.01,20,0.8,0.8,0.01,0.01,9,495.25,29.19205,0.803605251039444,0.0011858826523806,0.740235984061904,0.00205524056219106 11 | 2000,0.01,10,1,0.4,0.01,0,10,416.25,20.9205000000002,0.789582079093407,0.00108290678009551,0.734522475627661,0.0021752978633002 12 | 2000,0.1,20,1,1,0.1,0.01,11,48.3,2.8054500000002,0.782793859342886,0.00176687197892049,0.732377940766706,0.00200076670603934 13 | 500,0.03,10,1,0.8,0.1,0,12,293.2,3.79399999999996,0.798848700176517,0.00131237327456868,0.740064991742361,0.00180926968337642 14 | 500,0.1,5,0.8,1,0.3,0,13,98.3,1.37475000000022,0.808687509171401,0.00109755393574709,0.74750471046781,0.00182114048879434 15 | 2000,0.01,50,1,0.4,0,0.3,14,751.45,19.6466500000001,0.794866339838377,0.00136862614851209,0.732252690371904,0.00186170712241115 16 | 5000,0.01,50,0.6,0.6,0,0,15,740.35,17.5397,0.788415289132339,0.00133923545076051,0.727240901772118,0.00234334883930632 17 | 5000,0.03,5,0.6,0.4,0.3,0.1,16,192.95,15.5766500000003,0.799515837718476,0.00111123553848021,0.736697713270535,0.00218853112726501 18 | 2000,0.1,5,0.6,0.4,0,0,17,53.75,2.98655000000017,0.787648275108135,0.00138958308137535,0.733687152886856,0.00236932775774525 19 | 5000,0.01,50,0.6,0.4,0.3,0,18,867.6,17.1940500000002,0.79100370065621,0.00193123256853869,0.728030045701859,0.00244012940176136 20 | 5000,0.1,20,0.8,0.4,0,0,19,45.5,3.69475000000002,0.796739445500034,0.00150650302498103,0.742324617720152,0.00196696195171257 21 | 5000,0.1,10,0.8,0.6,0.01,0,20,37.05,4.71805000000004,0.795873550952099,0.00204851323147363,0.747506533898696,0.00209707699434227 22 | 5000,0.01,5,0.6,0.8,0,0,21,362.2,40.0346,0.798256252445463,0.00111622142563911,0.735377201486105,0.00228922554472575 23 | 2000,0.03,50,0.6,1,0,0,22,342.95,8.01785,0.792788758255409,0.00149785713135705,0.728803581199091,0.00234581964943645 24 | 1000,0.1,10,0.6,1,0,0,23,68.45,1.84619999999959,0.792471424638641,0.00124256400546119,0.728397405525895,0.00225491639952752 25 | 100,0.01,20,1,1,0,0,24,1645.45,6.14664999999986,0.785826811613146,0.00158112660662412,0.728229162699025,0.00194203192523001 26 | 5000,0.01,20,1,0.8,0,0,25,428.5,29.3934500000001,0.786487357607407,0.00132727388538261,0.730919245190628,0.00189512396616456 27 | 200,0.03,10,1,0.8,0.01,0.1,26,485.1,2.99345000000012,0.795641467335251,0.00131291227883532,0.736311244503527,0.00181110155176873 28 | 100,0.1,50,1,0.8,0.01,0,27,246.05,1.11040000000012,0.790261502666769,0.0016645496440058,0.733571151652941,0.00207124454098566 29 | 2000,0.1,10,0.6,1,0,0.3,28,65.8,3.37679999999982,0.794651416557506,0.00183914228458343,0.730302779720246,0.00220699265055946 30 | 100,0.01,20,1,0.8,0.1,0.1,29,1694.05,6.41944999999996,0.787058740836131,0.00156426436343612,0.729339764323513,0.00202882988515877 31 | 2000,0.03,50,1,0.4,0,0,30,249.1,7.39330000000064,0.795107911134776,0.00175967140416962,0.732815561705604,0.00177608810759066 32 | 5000,0.1,50,0.6,0.6,0,0,31,85.15,2.32795000000042,0.781608830117892,0.00136730476394089,0.722706214904799,0.0023913919757029 33 | 2000,0.01,50,1,1,0.1,0,32,732.25,20.0138000000001,0.791373398715735,0.00125192596717394,0.73150973050598,0.00189656481822839 34 | 100,0.01,5,1,1,0,0,33,1665.3,5.98769999999968,0.783047428929322,0.00133934464078655,0.728894284045484,0.00167598347404097 35 | 2000,0.01,5,0.8,1,0,0,34,444,21.5956500000002,0.814246520891619,0.00146550244632235,0.747448516251052,0.00215970620571533 36 | 2000,0.03,10,0.8,1,0,0,35,160.4,8.08520000000026,0.808969982904969,0.00169131589691264,0.7434116112629,0.00210789768432663 37 | 2000,0.01,10,1,0.4,0,0.1,36,419.45,20.9787499999984,0.789777530453203,0.00128271102112973,0.735077273336161,0.00200906014525673 38 | 2000,0.1,5,0.8,1,0,0,37,44.85,2.72075000000004,0.805986931445475,0.00136046330915719,0.751420407398978,0.00183069193021998 39 | 1000,0.01,5,1,0.8,0.1,0,38,584.95,14.3204000000002,0.804238661849547,0.00202627426717689,0.743043571202373,0.00173785951871213 40 | 200,0.03,50,0.6,0.8,0.3,0.3,39,510.5,3.21964999999909,0.788824253057131,0.00136455972079891,0.723455852245052,0.00228988891310819 41 | 500,0.01,10,0.6,0.8,0.01,0.1,40,830.2,9.86209999999955,0.799308227148977,0.00153607497781551,0.732630616995861,0.00213115630921976 42 | 1000,0.01,5,0.8,0.4,0.3,0,41,670.15,16.1159,0.817481840844254,0.00165116559720666,0.748884126772546,0.00208144240868024 43 | 1000,0.1,20,0.8,0.8,0,0.01,42,74.25,2.00380000000005,0.802576121300649,0.00128850673626037,0.741066769913599,0.00203453676097778 44 | 100,0.1,5,0.6,0.6,0,0,43,196.3,0.814700000000084,0.774298062345061,0.00207836022345232,0.720095553115763,0.00189210168750444 45 | 100,0.01,10,1,0.6,0,0.01,44,1662.4,6.0492000000002,0.785380049192551,0.00148629778321049,0.728962629076868,0.00186936791930906 46 | 500,0.1,20,0.8,0.4,0,0,45,91.25,1.2895500000006,0.797558217347458,0.00152201322051121,0.7402857265489,0.00199773612413769 47 | 200,0.03,50,0.6,0.8,0,0,46,488.1,3.03234999999877,0.784899296106119,0.00132369484799808,0.723096256418486,0.00225917624214396 48 | 500,0.1,20,0.8,0.4,0.1,0.01,47,96.4,1.36700000000019,0.797946262801347,0.00170218393119264,0.740601479650344,0.00203354775057054 49 | 200,0.03,50,1,0.8,0,0,48,521.5,3.16879999999946,0.797420448007672,0.00178201903946785,0.736297825209765,0.00208313395427846 50 | 200,0.01,10,1,1,0.01,0,49,1190.05,7.00334999999832,0.794070124306061,0.00159378718165542,0.73460409608153,0.00175250917521503 51 | 2000,0.01,20,1,0.6,0,0.1,50,471.7,22.5039999999997,0.787584766809749,0.00177232444592121,0.732204854752618,0.00197468796122166 52 | 5000,0.01,10,0.8,0.8,0,0,51,359,39.0842999999997,0.808426481128471,0.00144265608014597,0.741359330376635,0.00205965699989979 53 | 1000,0.03,50,0.8,0.4,0,0,52,266.55,5.8887999999999,0.802373763572969,0.00150748953737239,0.73738010637954,0.00204549457317267 54 | 5000,0.1,5,0.8,0.4,0,0.1,53,34.35,5.19029999999984,0.797877690903286,0.00144380596268803,0.749175829588418,0.00192904703265573 55 | 1000,0.01,10,1,0.8,0,0,54,559.55,13.1446499999995,0.796610195254037,0.00139605134288494,0.738979540470026,0.00203805407047546 56 | 200,0.03,50,1,0.6,0,0,55,517.1,3.12135000000126,0.792870558332542,0.00187299298957691,0.735304104574053,0.00202504628939085 57 | 100,0.1,50,0.8,0.8,0,0,56,256.25,1.00590000000047,0.790337370250814,0.00219399982790179,0.733124931420132,0.00219917440838352 58 | 2000,0.01,5,1,0.4,0,0,57,411.5,20.5716000000004,0.791995983000322,0.00127322172531698,0.738962611809297,0.00203410106789552 59 | 200,0.03,5,0.6,1,0,0.1,58,402.1,2.40084999999999,0.784098940043924,0.00109832227671711,0.727290693021978,0.00183015307280379 60 | 1000,0.01,20,0.6,0.8,0,0,59,653.35,14.0843000000004,0.799111486696331,0.0013422793112804,0.731988233104752,0.00239111567985236 61 | 5000,0.03,5,0.6,0.8,0.3,0.1,60,194.1,15.7453500000003,0.799264195258167,0.00159479858666593,0.736732735521519,0.0022530268872956 62 | 1000,0.01,50,0.6,0.6,0,0.3,61,811.05,14.6119999999995,0.790283296098649,0.00142699598213304,0.72778894288424,0.0022725601284837 63 | 5000,0.01,20,0.6,0.4,0,0,62,606.65,32.2308999999994,0.796486436348218,0.00125857838124688,0.731623270704461,0.00230573275752607 64 | 2000,0.1,5,0.6,0.8,0,0,63,54.4,3.00190000000111,0.787023687649596,0.00172089075863786,0.733162207092356,0.00223386645383782 65 | 5000,0.01,5,1,0.8,0,0,64,280.9,33.4296499999993,0.768032007357778,0.00157616724277141,0.725218905988437,0.00212367641779034 66 | 100,0.03,10,1,0.8,0.01,0,65,655.9,2.44515000000029,0.787595685864723,0.00160859627023955,0.729991375616491,0.0017896625727561 67 | 500,0.01,50,0.6,0.8,0.3,0.01,66,985.1,11.5338999999993,0.792490086555918,0.00139795262070104,0.72749253444922,0.00238061286335489 68 | 1000,0.03,50,0.6,0.4,0,0,67,341,6.64430000000066,0.791711000084207,0.0012312198707244,0.727575957896919,0.0022995445758235 69 | 1000,0.01,20,0.8,1,0,0,68,642,14.5761000000013,0.808348077426237,0.00131795526200489,0.741730778536304,0.00211213007428613 70 | 2000,0.01,10,0.6,0.8,0,0,69,538.2,24.5768999999978,0.800207121329932,0.00122617337251054,0.734076606451219,0.00227992933496627 71 | 100,0.1,10,0.8,0.8,0.3,0,70,276.2,1.05564999999842,0.794218745787054,0.00145862752188895,0.735763492629865,0.00200787581780642 72 | 2000,0.01,10,1,1,0.1,0,71,419.2,21.1466999999982,0.789422881940988,0.00160831929802662,0.734835867950074,0.00211827812933281 73 | 200,0.03,5,1,0.8,0,0.3,72,528.35,3.11740000000063,0.797940600852207,0.00146418422780454,0.738816491836559,0.00162389383662084 74 | 100,0.03,20,1,0.8,0,0,73,594.85,2.22765000000218,0.787248193039144,0.0015095926067435,0.729203081167186,0.00199485532342905 75 | 200,0.1,20,0.6,0.6,0,0.1,74,153.6,1.01064999999799,0.785011862092736,0.00179062996239697,0.724642602069858,0.00222991050659879 76 | 1000,0.01,50,0.8,0.6,0.01,0,75,773.05,16.6847499999996,0.80302322760738,0.00130443431220004,0.737695971157471,0.00191918802124848 77 | 5000,0.1,20,0.8,0.6,0.01,0,76,45.9,3.45714999999909,0.797987851704324,0.00160202005784862,0.741811052971049,0.00196855705524214 78 | 1000,0.01,20,0.6,1,0.01,0,77,646.1,13.8108000000015,0.797815263183489,0.00122797913300311,0.731649817554127,0.00232169789786185 79 | 500,0.03,5,1,0.6,0,0,78,267.7,3.42220000000016,0.79869507732157,0.00159429881378461,0.741150785002158,0.00159313826876036 80 | 200,0.1,5,1,0.8,0,0,79,135.2,0.881050000000687,0.790763532789102,0.00207318742425423,0.737824834767947,0.00154456134422182 81 | 100,0.03,10,1,0.8,0.01,0.01,80,658.8,2.40199999999968,0.78697334898199,0.00195011338881419,0.73104877207955,0.00180487124072758 82 | 200,0.1,20,0.6,0.6,0,0.3,81,176.4,1.13615000000063,0.785322871859092,0.00195387999890304,0.726041660344395,0.0022798540882487 83 | 500,0.1,50,0.6,0.4,0,0,82,107.95,1.47599999999875,0.785446782081459,0.00208425690128443,0.722858502011796,0.00257226890179431 84 | 500,0.1,20,1,1,0.3,0,83,102.5,1.47105000000156,0.795178874639564,0.00151627269458569,0.740928605191836,0.00194437038582323 85 | 5000,0.03,10,0.8,0.8,0.1,0,84,131.55,12.2760499999997,0.80488407077742,0.00172530990281775,0.740626871808947,0.00212868924896759 86 | 500,0.03,5,0.8,0.8,0,0,85,304.15,3.85670000000246,0.812303723050231,0.00140132501513907,0.746379581498471,0.00187572781206878 87 | 1000,0.01,10,1,1,0,0.3,86,615.2,15.43845,0.795929605159923,0.00123997624342262,0.739202175570606,0.00198937599094205 88 | 2000,0.01,20,0.6,0.4,0.1,0,87,619.5,24.6246499999972,0.797846630896894,0.00170025357130534,0.732888836215258,0.00231594969809365 89 | 5000,0.03,10,1,0.4,0,0,88,110.65,13.1151500000015,0.774677223435328,0.00165776619906156,0.727149987264661,0.00228475330489443 90 | 2000,0.03,50,0.6,1,0,0,89,338.85,7.78184999999867,0.792048394020365,0.0011583906922407,0.728503077675136,0.00244606187346474 91 | 200,0.03,50,0.6,1,0,0.3,90,508.1,3.15140000000029,0.787553367101601,0.00140422646059083,0.722680216642422,0.00228618049520376 92 | 1000,0.03,20,0.8,0.4,0,0,91,224.9,5.24059999999881,0.805055829497583,0.00175639649313409,0.741663764773563,0.00212240734729544 93 | 2000,0.01,10,0.6,0.4,0,0.3,92,599.8,26.122699999997,0.802653411373368,0.00100788888358833,0.735266881000808,0.00226822798817389 94 | 5000,0.03,20,0.6,0.8,0.3,0,93,264.9,9.75144999999975,0.795563309195819,0.00180120267667718,0.73492908212705,0.0023583713741441 95 | 1000,0.1,5,1,0.6,0.3,0,94,62.5,1.79654999999912,0.798470400235447,0.00117431891096264,0.745209705077008,0.00188995567928804 96 | 200,0.01,50,0.8,0.4,0.1,0,95,1480,8.90909999999858,0.798375782795846,0.0011231844011915,0.734993243028463,0.00207382755742672 97 | 1000,0.03,5,0.6,0.4,0.01,0,96,251.1,5.89575000000332,0.797275655936365,0.00136004630381936,0.735383890141345,0.00224481656330881 98 | 500,0.03,5,1,0.8,0,0,97,258.05,3.30085000000108,0.7987741476891,0.00171947420420652,0.74088849289184,0.00164564057247875 99 | 100,0.1,20,0.6,0.6,0,0,98,176.6,0.734000000000378,0.773051885819882,0.00130847522100934,0.71866603961427,0.00221475239438002 100 | 2000,0.03,20,0.8,0.4,0,0,99,173.15,8.4362000000001,0.807887265763082,0.00120023895564672,0.739892105989912,0.00207923763164267 101 | 100,0.03,10,0.6,0.6,0,0.1,100,505.15,1.93894999999829,0.77397432032672,0.00164066104084039,0.720098084346301,0.00190502172765001 102 | -------------------------------------------------------------------------------- /3-test_rs/res-10K.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test_avg,auc_test_sd 2 | 50,0.03,50,0.6,0.8,0.1,0,1,165.45,0.231650000000001,0.691381871696735,0.00592989794939431,0.668667489046144,0.00296388596898639 3 | 500,0.03,50,1,0.6,0,0.01,2,115,0.333850000000002,0.675208351156258,0.0062732437659569,0.666310016011439,0.00301878004759422 4 | 200,0.1,5,0.8,1,0,0.01,3,30.2,0.178900000000002,0.679701699522979,0.00731262990767077,0.668974045732552,0.00291863492203561 5 | 20,0.01,20,0.6,0.8,0,0.01,4,495.8,0.339300000000003,0.683002145201335,0.0057958609181941,0.674049222726918,0.00256615957223973 6 | 500,0.03,5,1,0.4,0,0.3,5,87.35,0.866850000000005,0.673193212367569,0.00386175778179112,0.656889942107222,0.00298104160847579 7 | 50,0.1,10,0.8,0.6,0,0.01,6,45.2,0.091149999999999,0.690161071267728,0.00434651408521804,0.67525144973868,0.00260329873410281 8 | 500,0.01,20,1,0.6,0,0,7,289.35,1.54405,0.661062329525489,0.00470789092097717,0.663214200180321,0.0031860443160786 9 | 100,0.1,20,0.8,0.6,0.01,0,8,37.4,0.125350000000026,0.68294326568027,0.00498093749405794,0.66985438409494,0.00294325293878437 10 | 500,0.01,20,0.8,0.8,0.01,0.01,9,293,1.42155000000005,0.685707484507107,0.00565285159192133,0.665947530335269,0.00302960700402288 11 | 200,0.01,10,1,0.4,0.01,0,10,283.6,1.19804999999998,0.667872671374466,0.00388591567595419,0.667653680020935,0.00305807182837967 12 | 100,0.1,20,1,1,0.1,0.01,11,37.6,0.131299999999965,0.674806897895864,0.00685728063547886,0.668011264059089,0.00310477190354332 13 | 20,0.03,10,1,0.8,0.1,0,12,179.05,0.150850000000037,0.692967802817069,0.00524395022664196,0.677491890692327,0.00237793556032428 14 | 50,0.1,5,0.8,1,0.3,0,13,46.6,0.0953999999999951,0.692546870610467,0.00598554939605691,0.675363240138751,0.00252388999208886 15 | 50,0.01,50,1,0.4,0,0.3,14,361.75,0.503550000000018,0.687126938969341,0.00583345407383235,0.671537292108311,0.00283593637062559 16 | 200,0.01,50,0.6,0.6,0,0,15,423.2,0.846400000000017,0.675014631703613,0.00503402210432165,0.667746675065468,0.0030638054615172 17 | 100,0.03,5,0.6,0.4,0.3,0.1,16,154.95,0.349999999999955,0.688050152600797,0.0059480061139357,0.66808439247133,0.00257777111833681 18 | 200,0.1,5,0.6,0.4,0,0,17,30.1,0.166849999999931,0.687369033525596,0.00454672270666743,0.666529629086551,0.00299637362936524 19 | 200,0.01,50,0.6,0.4,0.3,0,18,396.45,0.65630000000001,0.675873290112029,0.00565933212139969,0.669151038440617,0.00294892815086746 20 | 500,0.1,20,0.8,0.4,0,0,19,33.1,0.234850000000006,0.66367674660156,0.00558852092631481,0.66387992376956,0.00314976586083577 21 | 200,0.1,10,0.8,0.6,0.01,0,20,30.5,0.179200000000105,0.678437409725847,0.00518359083656836,0.667857505865943,0.00305898510382532 22 | 200,0.01,5,0.6,0.8,0,0,21,342.85,1.25514999999998,0.688966079493424,0.00534504212992018,0.66418896566063,0.00280784072547256 23 | 100,0.03,50,0.6,1,0,0,22,152.9,0.302550000000019,0.683698160998158,0.00411053717250267,0.666226503049011,0.00304651190118851 24 | 100,0.1,10,0.6,1,0,0,23,39.85,0.119149999999945,0.684396014318017,0.00339104037348062,0.664625121662538,0.00295170643898301 25 | 20,0.01,20,1,1,0,0,24,452.3,0.328900000000021,0.683796330730553,0.00504852961849169,0.675861293581894,0.00257478740668902 26 | 200,0.01,20,1,0.8,0,0,25,314.25,1.21609999999996,0.672003958159431,0.00482279805800751,0.66240847482294,0.00324728124177031 27 | 100,0.01,10,1,0.8,0.01,0.1,26,353.3,0.816450000000032,0.688004855021233,0.00502152722119961,0.671170081173332,0.00288024782321888 28 | 100,0.03,50,1,0.8,0.01,0,27,119.2,0.285899999999947,0.670166038661384,0.00545924739121195,0.667838109908662,0.00300850978450397 29 | 500,0.03,10,0.6,1,0,0.3,28,116.45,0.708299999999872,0.69012289341178,0.00474725163592617,0.659138166213049,0.00308365881203449 30 | 50,0.1,10,1,0.8,0.1,0.1,29,43.45,0.0915000000001783,0.683060358744108,0.00557898881950107,0.674487959460661,0.00269448803600049 31 | 200,0.03,50,1,0.4,0,0,30,111.9,0.297899999999891,0.677461454961429,0.0054456717058803,0.666866297984967,0.00306840531951846 32 | 100,0.1,50,0.6,0.6,0,0,31,43.75,0.116250000000082,0.682064573163051,0.00498964562726551,0.664615070185116,0.00319767550545964 33 | 100,0.01,50,1,1,0.1,0,32,347.35,0.751549999999952,0.680162097568672,0.00505898378147799,0.666934084434769,0.00301374500708753 34 | 500,0.1,50,0.8,1,0,0,33,38.05,0.133200000000079,0.673671955762741,0.00459691782716702,0.668117134238107,0.00300697327620682 35 | 50,0.01,5,0.8,1,0,0,34,392.15,0.49940000000006,0.689547806486912,0.00525680579012251,0.676607154316574,0.00249801706622886 36 | 100,0.03,10,0.8,1,0,0,35,128.85,0.306250000000045,0.685091491052114,0.00475567709237604,0.673178038947212,0.00279645752948984 37 | 500,0.1,5,1,0.4,0,0.1,36,23.85,0.329099999999835,0.664534122286564,0.00450454965123151,0.656983802620796,0.00298386491976179 38 | 100,0.1,5,0.8,1,0,0,37,36.4,0.118799999999965,0.686537465319078,0.00646800784485711,0.673773516635534,0.00236129834495276 39 | 50,0.01,5,1,0.8,0.1,0,38,397.5,0.541650000000163,0.68232619730759,0.004971943333981,0.675866150178718,0.00251977513472225 40 | 20,0.01,50,0.6,0.8,0.3,0.3,39,506.3,0.352749999999969,0.684540136846683,0.00529028883204893,0.673124416158258,0.00249521739220503 41 | 50,0.1,5,0.6,0.8,0.01,0.1,40,53.2,0.094550000000072,0.688382445837379,0.00354225255400311,0.669110066735103,0.00271581746318464 42 | 500,0.1,50,0.6,0.4,0.3,0,41,46.25,0.116049999999996,0.682358240119042,0.00501853764705828,0.665441805683221,0.00318963259601758 43 | 200,0.03,20,0.8,0.8,0,0.01,42,107.25,0.442500000000109,0.681760162636882,0.00478081104761001,0.663963335136351,0.00315018340444106 44 | 200,0.03,5,0.6,0.6,0,0,43,118,0.472850000000108,0.692083783715059,0.00469087473810969,0.65883805270645,0.00255972438851333 45 | 50,0.1,5,1,0.6,0,0.01,44,45.85,0.0922500000000582,0.694861117543693,0.00446005953173602,0.674217509467411,0.00246939247714409 46 | 500,0.03,20,0.8,0.4,0,0,45,107.3,0.59074999999998,0.668013211253567,0.00679469737919543,0.665433566159206,0.00320604635596223 47 | 100,0.01,50,0.6,0.8,0,0,46,397.8,0.732250000000295,0.679451498665879,0.00445947079837377,0.668163594039684,0.00302443311841327 48 | 50,0.03,20,0.8,0.4,0.1,0.01,47,133.1,0.198850000000039,0.688159779189768,0.00492802508398845,0.673878422609314,0.00286746338952859 49 | 100,0.01,50,1,0.8,0,0,48,342.75,0.727849999999944,0.674906965048774,0.00352549890387175,0.667116911625451,0.0030023518269247 50 | 500,0.1,5,1,1,0.01,0,49,22.95,0.324900000000071,0.657279202272083,0.00522069327435221,0.656177329810489,0.00302022232478796 51 | 100,0.1,10,1,0.6,0,0.1,50,36.45,0.123999999999887,0.677132128998723,0.0049327243514671,0.672857911472902,0.00291470501797602 52 | 500,0.01,10,0.8,0.8,0,0,51,276.45,2.28680000000013,0.683144947473664,0.00620378006306257,0.663370300763828,0.00327145355449744 53 | 500,0.01,50,0.8,0.4,0,0,52,359.55,0.85630000000001,0.678769858945006,0.00479070654602952,0.667596221325456,0.00291070277010579 54 | 200,0.03,5,0.8,0.4,0,0.1,53,105.35,0.458999999999833,0.691418868016869,0.00586674006749031,0.671174101806582,0.00268820812545646 55 | 200,0.1,5,1,0.8,0,0,54,29.25,0.176049999999759,0.676846751911262,0.00589314458196734,0.667292295138811,0.00273108300902907 56 | 500,0.01,50,1,0.6,0,0,55,327.25,0.821150000000216,0.670873848292416,0.00378185006916472,0.668190112381466,0.00304580799814487 57 | 50,0.03,50,0.8,0.8,0,0,56,129.6,0.191899999999896,0.684027763893081,0.00764938365531025,0.671747225520913,0.00287222413592866 58 | 500,0.1,50,0.8,0.4,0,0,57,40.45,0.138300000000072,0.672695719486686,0.00549763471822401,0.666514933562107,0.00307511435209772 59 | 500,0.1,50,1,0.8,0,0.1,58,33.5,0.124800000000232,0.668866942524983,0.00420077881432703,0.667299869568317,0.00318681058017116 60 | 50,0.01,20,0.6,0.8,0,0,59,445.4,0.53660000000018,0.689856929653143,0.00367195232311505,0.670829277875856,0.002845455832965 61 | 50,0.01,5,0.6,0.8,0.3,0.1,60,474.4,0.575300000000061,0.69648180570605,0.00564406747464752,0.674406172843076,0.00242452849578485 62 | 500,0.03,20,0.6,0.6,0,0.3,61,125.5,0.478550000000087,0.693593212584201,0.00568375346331295,0.662689938204427,0.0032645654104175 63 | 500,0.01,20,0.6,0.4,0,0,62,361.25,1.56024999999991,0.692078063695475,0.00455866457466345,0.661779315324238,0.00332455214525284 64 | 20,0.1,5,0.6,0.8,0,0,63,65.25,0.0672000000004118,0.683182631640814,0.00551262692930609,0.673868548505741,0.0024194267039062 65 | 500,0.01,5,1,0.8,0,0,64,232.2,2.15925000000025,0.663046792028932,0.00379247673579236,0.656185687255817,0.00296559525684093 66 | 100,0.01,10,1,0.8,0.01,0,65,342.2,0.789499999999862,0.681514508283559,0.00482378745202785,0.671393897912726,0.0028954264700897 67 | 500,0.03,20,0.6,0.8,0.3,0.01,66,140.35,0.458049999999912,0.683140229195547,0.00497949034249797,0.663052596381908,0.0031364283504644 68 | 500,0.01,50,0.6,0.4,0,0,67,411,0.845100000000093,0.681181604127992,0.00490197537309798,0.668047199154905,0.00299705263250051 69 | 100,0.1,10,0.8,1,0,0,68,37.6,0.12044999999971,0.68481096567633,0.006865750144157,0.673572725555154,0.00283289405863382 70 | 200,0.01,10,0.6,0.8,0,0,69,354.65,1.29215000000004,0.697203920097822,0.00590570509065455,0.662544382823031,0.00308676401425039 71 | 500,0.01,10,0.8,0.8,0.3,0,70,294,1.65525000000007,0.692027621503717,0.00572792296800501,0.666699420538895,0.00307690572664252 72 | 100,0.01,10,1,1,0.1,0,71,354.25,0.823849999999857,0.686944128108288,0.00537673699767535,0.672116236815533,0.00293699497573473 73 | 20,0.1,50,0.8,0.8,0,0.3,72,63.2,0.0702999999997701,0.681621219078824,0.00681246106837453,0.67424228160857,0.00275400845403047 74 | 500,0.01,20,1,0.8,0,0,73,302.85,1.62250000000031,0.671563882382346,0.00482090086593957,0.663591030773442,0.00328506324579783 75 | 50,0.01,20,0.6,0.6,0,0.1,74,439.1,0.534649999999783,0.688839753262367,0.00490470352968761,0.670552429664283,0.00284717422617784 76 | 100,0.01,50,0.8,0.6,0.01,0,75,352,0.730099999999857,0.68850703808686,0.00441052702708363,0.668423441855795,0.00298872721373063 77 | 200,0.03,20,0.8,0.6,0.01,0,76,106.2,0.439449999999579,0.685086625344178,0.00561891803428811,0.665225461939473,0.00309726838331184 78 | 100,0.1,10,0.6,1,0.01,0,77,38.45,0.11630000000041,0.686491504704842,0.00467111981625413,0.666402138407105,0.00299530707707694 79 | 50,0.03,5,1,0.6,0,0,78,142.95,0.225899999999638,0.690580124000992,0.00439666949312811,0.675481821861031,0.00251543639696928 80 | 100,0.03,5,1,0.8,0,0,79,115.85,0.291449999999895,0.680370684653702,0.00403011344265648,0.673145977263005,0.00257990752573797 81 | 50,0.1,5,1,0.8,0.01,0.01,80,40.7,0.0851500000004307,0.680493857600223,0.00541498946721055,0.675163757594798,0.00249806554664219 82 | 500,0.1,10,0.6,0.6,0,0.3,81,31.7,0.250749999999971,0.680701269319929,0.00518132975978483,0.665368543988071,0.00298240491116882 83 | 20,0.03,50,0.6,0.4,0,0,82,191.65,0.148100000000159,0.681102285763324,0.00486619612025537,0.672007238397396,0.00253903147844653 84 | 500,0.03,20,1,1,0.3,0,83,105.55,0.460799999999654,0.681254407766112,0.00539410528370488,0.665454612558366,0.00329688424421434 85 | 50,0.01,10,0.8,0.8,0.1,0,84,403.9,0.526550000000134,0.684602304406272,0.00553873354815086,0.677511195535922,0.00271797373543544 86 | 500,0.01,5,0.8,0.8,0,0,85,250.95,2.26700000000001,0.683068863665584,0.00695631345139802,0.661439393211639,0.00299492901785196 87 | 20,0.03,5,1,1,0,0.3,86,177.3,0.146400000000176,0.681727233055966,0.00457537393428769,0.677977316904331,0.00220437907988266 88 | 200,0.1,10,0.6,0.4,0.1,0,87,33.4,0.173550000000614,0.691171045110957,0.00595841476506799,0.664668958025187,0.00302159741096505 89 | 20,0.03,10,1,0.4,0,0,88,165.5,0.136650000000009,0.695444931309763,0.00445787952191472,0.677369201132799,0.00249079326357547 90 | 50,0.03,50,0.6,1,0,0,89,150.75,0.20515000000014,0.684167685462124,0.00475806804423743,0.66903449417703,0.00287945117138197 91 | 100,0.03,20,0.6,1,0,0.3,90,137.55,0.30210000000061,0.682451269263526,0.00585735838440862,0.663600610835717,0.00310435286261044 92 | 20,0.03,20,0.8,0.4,0,0,91,166.6,0.134050000000207,0.6881439940874,0.00484839753591119,0.676658656313493,0.00263576783404208 93 | 20,0.03,5,0.6,0.4,0,0.3,92,202.5,0.150150000000031,0.694271972564648,0.00618543403201642,0.675187206724521,0.00234728869870653 94 | 100,0.03,20,0.6,0.8,0.3,0,93,134.95,0.289899999999398,0.695923818093417,0.00529059134751508,0.664854869595159,0.00296830743095428 95 | 50,0.1,5,1,0.6,0.3,0,94,47.3,0.094300000000112,0.675662672241766,0.00433022837514174,0.674410088407884,0.00241813906829363 96 | 500,0.1,20,0.8,0.4,0.1,0,95,30.9,0.19994999999999,0.676334062459852,0.00503027637237493,0.664243138050541,0.00305264086335631 97 | 20,0.01,5,0.6,0.4,0.01,0,96,516.4,0.34230000000025,0.684775672599605,0.00541396325707667,0.675909748592901,0.00222295705900595 98 | 100,0.01,5,1,0.8,0,0,97,343.1,0.773450000000048,0.691996247697333,0.00446901079866937,0.672028011651912,0.00263524431992524 99 | 500,0.03,20,0.6,0.6,0,0,98,123.9,0.572999999999593,0.682211018873215,0.0042433497730561,0.658870456669596,0.00326430021762419 100 | 500,0.01,20,0.8,0.4,0,0,99,289.15,1.46200000000008,0.675089983164456,0.00523244316060759,0.664723243599223,0.0031504214671758 101 | 200,0.03,5,0.6,0.6,0,0.1,100,113,0.45564999999915,0.691009707170415,0.00440625156404678,0.659341075519736,0.00267614445621424 102 | -------------------------------------------------------------------------------- /3-test_rs/run-tuning.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(ROCR) 3 | library(lightgbm) 4 | library(dplyr) 5 | 6 | set.seed(1234) 7 | 8 | 9 | # for yr in 1990 1991; do 10 | # wget http://stat-computing.org/dataexpo/2009/$yr.csv.bz2 11 | # bunzip2 $yr.csv.bz2 12 | # done 13 | 14 | 15 | ## TODO: loop 1991..2000 16 | yr <- 1991 17 | 18 | d0_train <- fread(paste0("/var/data/airline/",yr-1,".csv")) 19 | d0_train <- d0_train[!is.na(DepDelay)] 20 | 21 | d0_test <- fread(paste0("/var/data/airline/",yr,".csv")) 22 | d0_test <- d0_test[!is.na(DepDelay)] 23 | 24 | d0 <- rbind(d0_train, d0_test) 25 | 26 | 27 | for (k in c("Month","DayofMonth","DayOfWeek")) { 28 | d0[[k]] <- paste0("c-",as.character(d0[[k]])) # force them to be categoricals 29 | } 30 | 31 | d0[["dep_delayed_15min"]] <- ifelse(d0[["DepDelay"]]>=15,1,0) 32 | 33 | cols_keep <- c("Month", "DayofMonth", "DayOfWeek", "DepTime", "UniqueCarrier", 34 | "Origin", "Dest", "Distance", "dep_delayed_15min") 35 | d0 <- d0[, cols_keep, with = FALSE] 36 | 37 | 38 | d0_wrules <- lgb.prepare_rules(d0) # lightgbm special treat of categoricals 39 | d0 <- d0_wrules$data 40 | cols_cats <- names(d0_wrules$rules) 41 | 42 | 43 | n1 <- nrow(d0_train) 44 | n2 <- nrow(d0_test) 45 | 46 | p <- ncol(d0)-1 47 | d1_train <- as.matrix(d0[1:n1,]) 48 | d1_test <- as.matrix(d0[(n1+1):(n1+n2),]) 49 | 50 | 51 | 52 | ## TODO: loop 10K,100K,1M 53 | size <- 100e3 54 | 55 | 56 | 57 | n_random <- 100 ## TODO: 1000? 58 | 59 | params_grid <- expand.grid( # default 60 | num_leaves = c(100,200,500,1000,2000,5000), # 31 (was 127) 61 | ## TODO: num_leaves should be size dependent 62 | ## 100K: num_leaves = c(100,200,500,1000,2000,5000) DONE for 3-test_rs 63 | ## 10K: num_leaves = c(20,50,100,200,500) DONE for 3-test_rs 64 | ## 10M: should be higher than 100K (used same as 100K in 2-train_test_1each) 65 | learning_rate = c(0.01,0.03,0.1), # 0.1 66 | min_data_in_leaf = c(5,10,20,50), # 20 (was 100) 67 | feature_fraction = c(0.6,0.8,1), # 1 68 | bagging_fraction = c(0.4,0.6,0.8,1), # 1 69 | lambda_l1 = c(0,0,0,0, 0.01, 0.1, 0.3), # 0 70 | lambda_l2 = c(0,0,0,0, 0.01, 0.1, 0.3) # 0 71 | ## TODO: 72 | ## min_sum_hessian_in_leaf 73 | ## min_gain_to_split 74 | ## max_bin 75 | ## min_data_in_bin 76 | ) 77 | params_random <- params_grid[sample(1:nrow(params_grid),n_random),] 78 | 79 | 80 | 81 | 82 | ## TODO: resample train 83 | d <- d1_train[sample(1:nrow(d1_train), size),] 84 | 85 | 86 | 87 | ## resample test 88 | size_test <- 100e3 89 | n_test_rs <- 20 90 | d_test_list <- list() 91 | for (i in 1:n_test_rs) { 92 | d_test_list[[i]] <- d1_test[sample(1:nrow(d1_test), size_test),] 93 | } 94 | 95 | 96 | 97 | # ## first run: get train/test with same distribution to check overfitting of best models 98 | # idx <- sample(1:nrow(d1_train), size) 99 | # d <- d1_train[idx,] 100 | # d_test <- d1_train[sample(setdiff(1:nrow(d1_train),idx), size_test),] 101 | # #user system elapsed 102 | # #240025.539 4094.378 31870.404 103 | # ## Results: AUC_test>AUC_rs seems it does not overfit, either underfit or sample variation 104 | # ## TODO: resample train/test to see if underfit or sample variation 105 | 106 | 107 | 108 | # warm up 109 | dlgb_train <- lgb.Dataset(data = d[,1:p], label = d[,p+1]) 110 | system.time({ 111 | md <- lgb.train(data = dlgb_train, objective = "binary", 112 | num_threads = parallel::detectCores()/2, 113 | nrounds = 100, 114 | categorical_feature = cols_cats, 115 | verbose = 0) 116 | }) 117 | system.time({ 118 | md <- lgb.train(data = dlgb_train, objective = "binary", 119 | num_threads = parallel::detectCores()/2, 120 | nrounds = 100, 121 | categorical_feature = cols_cats, 122 | verbose = 0) 123 | }) 124 | 125 | 126 | 127 | system.time({ 128 | 129 | d_res <- data.frame() 130 | for (krpm in 1:nrow(params_random)) { 131 | params <- as.list(params_random[krpm,]) 132 | 133 | ## resample 134 | n_resample <- 20 ## TODO: 100? 135 | 136 | p_train <- 0.8 ## TODO: change? (80-10-10 split now) 137 | p_earlystop <- 0.1 138 | p_modelselec <- 1 - p_train - p_earlystop 139 | 140 | mds <- list() 141 | d_res_rs <- data.frame() 142 | for (k in 1:n_resample) { 143 | cat(" krpm:",krpm,"train - k:",k,"\n") 144 | 145 | idx_train <- sample(1:size, size*p_train) 146 | idx_earlystop <- sample(setdiff(1:size,idx_train), size*p_earlystop) 147 | idx_modelselec <- setdiff(setdiff(1:size,idx_train),idx_earlystop) 148 | 149 | d_train <- d[idx_train,] 150 | d_earlystop <- d[idx_earlystop,] 151 | d_modelselec <- d[idx_modelselec,] 152 | 153 | dlgb_train <- lgb.Dataset(data = d_train[,1:p], label = d_train[,p+1]) 154 | dlgb_earlystop <- lgb.Dataset(data = d_earlystop[,1:p], label = d_earlystop[,p+1]) 155 | 156 | 157 | runtm <- system.time({ 158 | md <- lgb.train(data = dlgb_train, objective = "binary", 159 | num_threads = parallel::detectCores()/2, # = number of "real" cores 160 | params = params, 161 | nrounds = 10000, early_stopping_rounds = 10, valid = list(valid = dlgb_earlystop), 162 | categorical_feature = cols_cats, 163 | verbose = 0) 164 | })[[3]] 165 | 166 | phat <- predict(md, data = d_modelselec[,1:p]) 167 | rocr_pred <- prediction(phat, d_modelselec[,p+1]) 168 | auc_rs <- performance(rocr_pred, "auc")@y.values[[1]] 169 | 170 | d_res_rs <- rbind(d_res_rs, data.frame(ntrees = md$best_iter, runtm = runtm, auc_rs = auc_rs)) 171 | mds[[k]] <- md 172 | } 173 | d_res_rs_avg <- d_res_rs %>% summarize(ntrees = mean(ntrees), runtm = mean(runtm), auc_rs_avg = mean(auc_rs), 174 | auc_rs_std = sd(auc_rs)/sqrt(n_resample)) # std of the mean! 175 | 176 | # consider the model as the average of the models from resamples 177 | # TODO?: alternatively could retrain the "final" model on all of data (early stoping or avg number of trees?) 178 | auc_test <- numeric(n_test_rs) 179 | for (i in 1:n_test_rs) { 180 | cat(" krpm:",krpm,"pred test - i:",i,"\n") 181 | d_test <- d_test_list[[i]] 182 | phat <- matrix(0, nrow = n_resample, ncol = nrow(d_test)) 183 | for (k in 1:n_resample) { 184 | phat[k,] <- predict(mds[[k]], data = d_test[,1:p]) 185 | } 186 | phat_avg <- apply(phat, 2, mean) 187 | rocr_pred <- prediction(phat_avg, d_test[,p+1]) 188 | auc_test[[i]] <- performance(rocr_pred, "auc")@y.values[[1]] 189 | } 190 | 191 | d_res <- rbind(d_res, cbind(krpm, d_res_rs_avg, auc_test_avg=mean(auc_test), auc_test_sd=sd(auc_test))) 192 | } 193 | 194 | }) 195 | 196 | d_pm_res <- cbind(params_random, d_res) 197 | 198 | d_pm_res 199 | 200 | 201 | ## TODO: ensemble/avg the top 2,3,...10 models (would need to save all the models) 202 | 203 | ## TODO: easier: see the goodness as a function of how many random iterations 10,100,1000 (can be done in 204 | ## the other analysis file) 205 | 206 | 207 | fwrite(d_pm_res, file = "res.csv") 208 | 209 | -------------------------------------------------------------------------------- /4-train_rs/analyze.Rmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | 3 | library(data.table) 4 | library(dplyr) 5 | library(ggplot2) 6 | 7 | d <- fread("res.csv") 8 | 9 | d %>% select(krpm, ktr, auc_rs_avg, auc_rs_std) %>% melt(id.vars = c("krpm","ktr")) %>% 10 | dcast(krpm ~ variable + ktr, value.var = "value") %>% 11 | ggplot(aes(x = auc_rs_avg_1, y = auc_rs_avg_2)) + geom_point() + geom_abline(color = "grey70") + 12 | geom_errorbar(aes(x = auc_rs_avg_1, ymin = auc_rs_avg_2 - auc_rs_std_2, ymax = auc_rs_avg_2 + auc_rs_std_2), width = 0.0003, alpha = 0.3) + 13 | geom_errorbarh(aes(xmin = auc_rs_avg_1 - auc_rs_std_1, xmax = auc_rs_avg_1 + auc_rs_std_1, y = auc_rs_avg_2), height = 0.0003, alpha = 0.3) + 14 | theme(aspect.ratio=1) 15 | 16 | d %>% select(krpm, ktr, auc_test_avg, auc_test_sd) %>% melt(id.vars = c("krpm","ktr")) %>% 17 | dcast(krpm ~ variable + ktr, value.var = "value") %>% 18 | ggplot(aes(x = auc_test_avg_1, y = auc_test_avg_2)) + geom_point() + geom_abline(color = "grey70") + 19 | geom_errorbar(aes(ymin = auc_test_avg_2 - auc_test_sd_2, ymax = auc_test_avg_2 + auc_test_sd_2), width = 0.0003, alpha = 0.3) + 20 | geom_errorbarh(aes(xmin = auc_test_avg_1 - auc_test_sd_1, xmax = auc_test_avg_1 + auc_test_sd_1), height = 0.0003, alpha = 0.3) + 21 | theme(aspect.ratio=1) 22 | 23 | d %>% arrange(desc(auc_rs_avg)) %>% head 24 | 25 | cor(filter(d, ktr==1)$auc_rs_avg, filter(d, ktr==2)$auc_rs_avg) 26 | cor(filter(d, ktr==1)$auc_test_avg, filter(d, ktr==2)$auc_test_avg) 27 | 28 | cor(filter(d, ktr==1)$auc_rs_avg, filter(d, ktr==2)$auc_rs_avg, method = "spearman") 29 | cor(filter(d, ktr==1)$auc_test_avg, filter(d, ktr==2)$auc_test_avg, method = "spearman") 30 | 31 | 32 | ``` -------------------------------------------------------------------------------- /4-train_rs/fig-AUCcorr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/4-train_rs/fig-AUCcorr.png -------------------------------------------------------------------------------- /4-train_rs/res.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,ktr,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test_avg,auc_test_sd 2 | 200,0.03,50,0.6,0.8,0.1,0,1,1,504.1,3.53190000000001,0.793905801025043,0.00106888662880647,0.720104145761758,0.00262421526836331 3 | 5000,0.03,50,1,0.6,0,0.01,1,2,228.95,7.01814999999998,0.795255357015507,0.00111619329369637,0.72603360618307,0.00229470344772884 4 | 5000,0.1,5,0.8,1,0,0.01,1,3,33.75,4.89000000000003,0.79420221832431,0.001359150647637,0.739512956165004,0.00219167855421504 5 | 200,0.01,20,0.6,0.8,0,0.01,1,4,1113.6,7.49764999999995,0.790400224976676,0.00118138549194084,0.721952655273603,0.00246560321570195 6 | 100,0.1,5,1,0.4,0,0.3,1,5,274.55,1.15080000000003,0.793942889748699,0.00160525447533963,0.73105395792979,0.0021781798648245 7 | 500,0.1,10,0.8,0.6,0,0.01,1,6,90.25,1.27959999999994,0.807188940364849,0.00130714954023486,0.736817246869031,0.00223059630535718 8 | 5000,0.01,20,1,0.6,0,0,1,7,420.45,28.18875,0.789555570903933,0.00149786991599192,0.722173143682352,0.00205279214245765 9 | 1000,0.1,20,0.8,0.6,0.01,0,1,8,69.25,1.93214999999982,0.804291069411897,0.00152124510659648,0.735476095341969,0.00227027638795461 10 | 5000,0.01,20,0.8,0.8,0.01,0.01,1,9,498.6,28.5813999999999,0.807588148085269,0.00128596538376813,0.733799494272582,0.00226063045482493 11 | 2000,0.01,10,1,0.4,0.01,0,1,10,414.75,19.283,0.791190577217193,0.00172326403574193,0.725768879527109,0.00206050429268505 12 | 2000,0.1,20,1,1,0.1,0.01,1,11,46.2,2.58030000000017,0.783104757480599,0.00150656606999049,0.724723942132371,0.00199038884379344 13 | 500,0.03,10,1,0.8,0.1,0,1,12,272.35,3.65495000000028,0.803618364204889,0.00182892049364106,0.734320942385842,0.00203408160187565 14 | 500,0.1,5,0.8,1,0.3,0,1,13,99.4,1.39904999999981,0.812731715401344,0.00146599446958762,0.742175034110207,0.00220072114540164 15 | 2000,0.01,50,1,0.4,0,0.3,1,14,739.3,18.9164000000002,0.800294125890865,0.00113617288837682,0.726742916008345,0.00218467742674742 16 | 5000,0.01,50,0.6,0.6,0,0,1,15,741.7,17.97955,0.796295946397814,0.00146445645667981,0.724816312877629,0.00261191571516278 17 | 5000,0.03,5,0.6,0.4,0.3,0.1,1,16,193.5,14.7054999999998,0.802891138483193,0.00114100880036976,0.735333592914232,0.00276835879478427 18 | 2000,0.1,5,0.6,0.4,0,0,1,17,55.25,2.86294999999936,0.793690038374923,0.00179313655330786,0.730733891206031,0.00239496315903648 19 | 5000,0.01,50,0.6,0.4,0.3,0,1,18,824.15,16.4132499999996,0.793816910078084,0.00116823144712513,0.725749181224415,0.00262283802547571 20 | 5000,0.1,20,0.8,0.4,0,0,1,19,46.1,3.58650000000016,0.800169861201316,0.00106482944177794,0.73584980414696,0.00213711133633759 21 | 5000,0.1,10,0.8,0.6,0.01,0,1,20,37.15,4.42585000000017,0.795526106603878,0.0013752549262254,0.740932993819892,0.00235561928101491 22 | 5000,0.01,5,0.6,0.8,0,0,1,21,365.05,37.8797500000001,0.800643568276404,0.00103767883513895,0.733783099778605,0.00265293511439358 23 | 2000,0.03,50,0.6,1,0,0,1,22,342.55,7.59269999999961,0.799032405907668,0.0011955077814005,0.726456901322868,0.002484336157961 24 | 1000,0.1,10,0.6,1,0,0,1,23,67.6,1.75005000000056,0.797488481106552,0.00129132758505242,0.725580127880839,0.00231580438944509 25 | 100,0.01,20,1,1,0,0,1,24,1600.1,6.85114999999969,0.791487710994319,0.00128880898947135,0.724248472216486,0.00216234414634961 26 | 5000,0.01,20,1,0.8,0,0,1,25,435.7,29.1157000000003,0.78613175600386,0.00147974055812543,0.721915516385107,0.00203499412940279 27 | 200,0.03,10,1,0.8,0.01,0.1,1,26,446.95,2.97055,0.798848515535208,0.00166081971728235,0.730836158392021,0.00206970040573764 28 | 100,0.1,50,1,0.8,0.01,0,1,27,238.2,1.14959999999919,0.791499917730912,0.00163559973012393,0.728148278551643,0.0023457645216369 29 | 2000,0.1,10,0.6,1,0,0.3,1,28,66.15,3.10290000000096,0.801338478647986,0.00183689783816229,0.727346061714427,0.00257389224616518 30 | 100,0.01,20,1,0.8,0.1,0.1,1,29,1922.45,8.11680000000051,0.793707275863755,0.00178313129745767,0.727170750822353,0.00218151649516174 31 | 2000,0.03,50,1,0.4,0,0,1,30,248.85,7.04645000000055,0.798471800357539,0.00174163431093758,0.727635180544621,0.00221860051577636 32 | 5000,0.1,50,0.6,0.6,0,0,1,31,80.8,2.25945000000029,0.789132840154337,0.00145661181636077,0.72026250634151,0.00258272428161756 33 | 2000,0.01,50,1,1,0.1,0,1,32,746.65,19.6640000000003,0.797017127053799,0.00138705606190533,0.727412167358476,0.00221518545595577 34 | 100,0.01,5,1,1,0,0,1,33,1671.5,6.74414999999899,0.789640767199149,0.00165510380691628,0.724640663947056,0.00211832218588762 35 | 2000,0.01,5,0.8,1,0,0,1,34,448.3,20.1550000000003,0.814425963365831,0.00158732355472366,0.740697013301097,0.00206802583025785 36 | 2000,0.03,10,0.8,1,0,0,1,35,154,7.28020000000033,0.809938012971029,0.00146544094138551,0.735545080836477,0.00200463487557121 37 | 2000,0.01,10,1,0.4,0,0.1,1,36,411.9,19.1537999999997,0.789880999840495,0.0018535160338422,0.725478516428844,0.00203927435422693 38 | 2000,0.1,5,0.8,1,0,0,1,37,44.3,2.48930000000037,0.805446920893754,0.00147965834963657,0.744709289593932,0.00185250980477981 39 | 1000,0.01,5,1,0.8,0.1,0,1,38,577.3,13.5154500000004,0.803086309735378,0.00113695839615118,0.736259649996772,0.00211764829455085 40 | 200,0.03,50,0.6,0.8,0.3,0.3,1,39,520.6,3.68299999999945,0.792419713989116,0.00134136273499682,0.720529464574446,0.0025306555444744 41 | 500,0.01,10,0.6,0.8,0.01,0.1,1,40,807.6,9.81514999999927,0.802947023447654,0.00196169831392044,0.730749534109147,0.00244464806524729 42 | 1000,0.01,5,0.8,0.4,0.3,0,1,41,643.6,14.5579499999993,0.817063211471807,0.00139674415385282,0.742203077848788,0.00217559038246652 43 | 1000,0.1,20,0.8,0.8,0,0.01,1,42,68.05,1.78869999999879,0.80315987090797,0.00159477497927378,0.734825849805336,0.00213925990637409 44 | 100,0.1,5,0.6,0.6,0,0,1,43,188.3,0.869449999998324,0.781740241493597,0.00157122514908131,0.715984408001844,0.00239002087022445 45 | 100,0.01,10,1,0.6,0,0.01,1,44,1629.25,6.75590000000084,0.789512922574876,0.00179958805335766,0.725488295284682,0.00207150585988432 46 | 500,0.1,20,0.8,0.4,0,0,1,45,90.65,1.35794999999926,0.805455241249067,0.00115298035920194,0.735150809004502,0.00210519116614145 47 | 200,0.03,50,0.6,0.8,0,0,1,46,511.35,3.57229999999909,0.792449745369873,0.00174843104092957,0.719982022993806,0.00261258482736249 48 | 500,0.1,20,0.8,0.4,0.1,0.01,1,47,91.8,1.33845000000219,0.805027762175657,0.00147123887974479,0.735673871588295,0.00214632092369826 49 | 200,0.03,50,1,0.8,0,0,1,48,564.1,3.82754999999961,0.801038472195254,0.00151342781691579,0.731511841190933,0.00227111910913615 50 | 200,0.01,10,1,1,0.01,0,1,49,1222.9,7.82815000000046,0.79893478698257,0.0014971970768063,0.729271338288688,0.00204987273986805 51 | 2000,0.01,20,1,0.6,0,0.1,1,50,468.5,20.8639999999999,0.791258901133065,0.00142989401559474,0.72321815003807,0.00206618864908291 52 | 5000,0.01,10,0.8,0.8,0,0,1,51,356.6,36.7361000000012,0.809394228552169,0.00162079493898032,0.734454912075929,0.0021886929729417 53 | 1000,0.03,50,0.8,0.4,0,0,1,52,283.05,6.29649999999965,0.806866485929959,0.001670280169734,0.734353792573771,0.00230618646624106 54 | 5000,0.1,5,0.8,0.4,0,0.1,1,53,35.3,5.04424999999828,0.796250458440206,0.00124110360152061,0.741548744509239,0.0023507653042562 55 | 1000,0.01,10,1,0.8,0,0,1,54,547.9,12.4923000000003,0.800822191113726,0.00166606975898452,0.732879601312407,0.00208594425428183 56 | 200,0.03,50,1,0.6,0,0,1,55,548.35,3.6932000000008,0.801432220974519,0.00127242693392209,0.730213078432996,0.00231314369831071 57 | 100,0.1,50,0.8,0.8,0,0,1,56,258.1,1.18924999999726,0.797671475907398,0.00204085891120507,0.728525137540649,0.00240036264953448 58 | 2000,0.01,5,1,0.4,0,0,1,57,408.2,18.9316999999981,0.79242727338728,0.00151918004528413,0.729462796901685,0.0020879330113742 59 | 200,0.03,5,0.6,1,0,0.1,1,58,439.9,2.79960000000065,0.792171418711296,0.00181541785758186,0.724023811086297,0.00245479800756146 60 | 1000,0.01,20,0.6,0.8,0,0,1,59,646.1,13.5250999999989,0.80448030576963,0.00141573212914468,0.730137798061434,0.0025978330714033 61 | 5000,0.03,5,0.6,0.8,0.3,0.1,1,60,192.45,14.7803,0.802958520848044,0.000928758151080977,0.735576673762922,0.00267209650167068 62 | 1000,0.01,50,0.6,0.6,0,0.3,1,61,755.9,13.2256499999996,0.795266276671961,0.00157578753308772,0.725259527891383,0.00255668178753854 63 | 5000,0.01,20,0.6,0.4,0,0,1,62,602.35,30.9717999999979,0.802783295272016,0.00124537962568191,0.729869580941692,0.00254086374732206 64 | 2000,0.1,5,0.6,0.8,0,0,1,63,56.95,2.90719999999783,0.795722977962547,0.0015914844478567,0.730000974945623,0.00249404134737002 65 | 5000,0.01,5,1,0.8,0,0,1,64,284.25,32.2957999999984,0.764076373697371,0.000763198754237687,0.711042851628733,0.00201818934539481 66 | 100,0.03,10,1,0.8,0.01,0,1,65,613.45,2.60614999999889,0.793180429516792,0.00172164260113474,0.725995424256534,0.00206238387501942 67 | 500,0.01,50,0.6,0.8,0.3,0.01,1,66,918.55,11.2574999999968,0.796390609434806,0.00135369455838411,0.724798022081823,0.00254925979395592 68 | 1000,0.03,50,0.6,0.4,0,0,1,67,335.4,6.42245000000185,0.798637213929461,0.0019247645458311,0.726194477400681,0.00259949261650511 69 | 1000,0.01,20,0.8,1,0,0,1,68,631.4,14.0124500000013,0.80973143775291,0.00147498494860103,0.736046656639667,0.00215088879491356 70 | 2000,0.01,10,0.6,0.8,0,0,1,69,583.1,24.5231000000007,0.803982047536657,0.00141778713287913,0.732427230707998,0.00252018489560955 71 | 100,0.1,10,0.8,0.8,0.3,0,1,70,257.2,1.11225000000122,0.79774448393598,0.00187695397689562,0.731677679506446,0.00238364904241514 72 | 2000,0.01,10,1,1,0.1,0,1,71,422.8,19.771449999998,0.78992512461342,0.00169557689089815,0.726148639991001,0.00197509698598974 73 | 200,0.03,5,1,0.8,0,0.3,1,72,502,3.21804999999949,0.800721230104083,0.00194719550629154,0.732564374948409,0.002138357853034 74 | 100,0.03,20,1,0.8,0,0,1,73,630,2.69080000000104,0.7936727903737,0.00133155514481926,0.726172217030416,0.00212510516590683 75 | 200,0.1,20,0.6,0.6,0,0.1,1,74,156.15,1.14384999999966,0.790723438468786,0.00172634423720046,0.720301401239119,0.00254574867015647 76 | 1000,0.01,50,0.8,0.6,0.01,0,1,75,768.2,16.6460000000006,0.805200903480153,0.00187651255007236,0.732310995175689,0.00225451944216642 77 | 5000,0.1,20,0.8,0.6,0.01,0,1,76,45.65,3.32170000000187,0.797352638006782,0.00169879833393911,0.736405802939465,0.00209297516485633 78 | 1000,0.01,20,0.6,1,0.01,0,1,77,643.75,13.4942500000005,0.803159181984495,0.00174071861275551,0.730356605009945,0.00249242533688458 79 | 500,0.03,5,1,0.6,0,0,1,78,254.3,3.34794999999722,0.803193467694077,0.00130650661008677,0.734734060099204,0.00202344112501416 80 | 200,0.1,5,1,0.8,0,0,1,79,152.1,1.04454999999871,0.79221711298618,0.00154473188644518,0.732566308847452,0.00205237706856113 81 | 100,0.03,10,1,0.8,0.01,0.01,1,80,650.75,2.66905000000115,0.792011935480015,0.0016570447644805,0.726159636849049,0.00210839949646842 82 | 200,0.1,20,0.6,0.6,0,0.3,1,81,158.2,1.13779999999679,0.794724321916074,0.00162537003842448,0.720887360158766,0.002576286002573 83 | 500,0.1,50,0.6,0.4,0,0,1,82,110.6,1.58434999999736,0.791531940738197,0.00144263089892718,0.719841674942062,0.00253673911646034 84 | 500,0.1,20,1,1,0.3,0,1,83,95.6,1.41030000000173,0.799839584709049,0.00194038408687047,0.733365553223478,0.00207740428115262 85 | 5000,0.03,10,0.8,0.8,0.1,0,1,84,137.3,12.0814000000028,0.807551088089016,0.00160571702167219,0.734640236786444,0.00217314203255126 86 | 500,0.03,5,0.8,0.8,0,0,1,85,303.3,3.72704999999987,0.810426471098616,0.00106985958454304,0.737779342071767,0.00204631579008364 87 | 1000,0.01,10,1,1,0,0.3,1,86,583.75,13.7119999999981,0.802271483911748,0.00135195370787542,0.732964796698112,0.00200991876489421 88 | 2000,0.01,20,0.6,0.4,0.1,0,1,87,615.1,21.9512000000017,0.802851760906882,0.00161948181348644,0.730852262032564,0.00258670569443369 89 | 5000,0.03,10,1,0.4,0,0,1,88,112.65,12.7780999999974,0.773902454357874,0.00134830621231674,0.715637428009879,0.00216292416155162 90 | 2000,0.03,50,0.6,1,0,0,1,89,339.55,7.67400000000052,0.79772622880322,0.00140917739258954,0.726139641934392,0.00253573132959354 91 | 200,0.03,50,0.6,1,0,0.3,1,90,457.65,3.25805000000255,0.79212644776585,0.00148779510802636,0.719299788344214,0.00255194396812954 92 | 1000,0.03,20,0.8,0.4,0,0,1,91,222.55,5.08160000000207,0.811834163682165,0.0018755089969409,0.736120341053846,0.00218502802130753 93 | 2000,0.01,10,0.6,0.4,0,0.3,1,92,609.3,24.7743999999992,0.80249165575174,0.00147960348180678,0.733663586965519,0.00262259941558552 94 | 5000,0.03,20,0.6,0.8,0.3,0,1,93,283.6,10.364800000003,0.805107889009231,0.0014195807541009,0.734067755373421,0.00262506570624142 95 | 1000,0.1,5,1,0.6,0.3,0,1,94,64.75,1.83549999999668,0.803005814501122,0.00156232003383349,0.738448116626551,0.00203140812805499 96 | 200,0.01,50,0.8,0.4,0.1,0,1,95,1561.5,10.4491500000018,0.802608012469911,0.00163668196859569,0.729664513873702,0.00235689851263105 97 | 1000,0.03,5,0.6,0.4,0.01,0,1,96,262.2,5.58195000000123,0.806784624900963,0.00198672030655701,0.733824428009013,0.00253813096199881 98 | 500,0.03,5,1,0.8,0,0,1,97,267.8,3.46060000000143,0.804444031489751,0.00145679130189519,0.735158708052718,0.00202520788096225 99 | 100,0.1,20,0.6,0.6,0,0,1,98,186.8,0.885349999998289,0.782284817859267,0.001788158137131,0.716645015810987,0.00253234463316703 100 | 2000,0.03,20,0.8,0.4,0,0,1,99,172.45,7.67790000000241,0.807572796520594,0.000916076983385521,0.73434165978659,0.00210590813402388 101 | 100,0.03,10,0.6,0.6,0,0.1,1,100,535.05,2.36975000000239,0.781921480671981,0.00161720683072429,0.717418518742145,0.00249300203026497 102 | 200,0.03,50,0.6,0.8,0.1,0,2,1,507.8,3.55710000000108,0.796651872940974,0.00155784807251844,0.723806269319257,0.002464451392231 103 | 5000,0.03,50,1,0.6,0,0.01,2,2,252.3,7.77644999999902,0.804149370686455,0.00149713619132941,0.731905729166247,0.00213460188017622 104 | 5000,0.1,5,0.8,1,0,0.01,2,3,34.25,4.89875000000029,0.801078378879306,0.00127647856842087,0.7448646118166,0.00215842450214794 105 | 200,0.01,20,0.6,0.8,0,0.01,2,4,1162.85,7.81430000000109,0.79645927475835,0.00127834953393609,0.724914948921653,0.0026230566723172 106 | 100,0.1,5,1,0.4,0,0.3,2,5,302.75,1.24915000000037,0.80079378796733,0.0014842457228913,0.737465277694748,0.00222302878865863 107 | 500,0.1,10,0.8,0.6,0,0.01,2,6,95.25,1.35995000000257,0.810297735631866,0.00120888814314256,0.743106832807054,0.00249068682123391 108 | 5000,0.01,20,1,0.6,0,0,2,7,440.25,29.3507000000042,0.792022958863099,0.00193149867687988,0.728991754379029,0.00203718223198981 109 | 1000,0.1,20,0.8,0.6,0.01,0,2,8,77.3,2.03349999999627,0.809601419633188,0.0017847929342283,0.740580594124623,0.00230799146566778 110 | 5000,0.01,20,0.8,0.8,0.01,0.01,2,9,495,28.5890999999989,0.809645570948168,0.00117114692384066,0.736736759001573,0.00224873118805945 111 | 2000,0.01,10,1,0.4,0.01,0,2,10,427.45,19.931650000003,0.794751089889443,0.00189719528502638,0.733167892004734,0.00215922533745532 112 | 2000,0.1,20,1,1,0.1,0.01,2,11,49.15,2.66015000000189,0.792049143317766,0.00148501981689572,0.730559757461451,0.00209310778642823 113 | 500,0.03,10,1,0.8,0.1,0,2,12,279.05,3.72869999999675,0.808567517924433,0.00139478155892105,0.738779940424911,0.00234083829025599 114 | 500,0.1,5,0.8,1,0.3,0,2,13,105.3,1.43474999999598,0.815143774366359,0.00139827374520169,0.745447137841828,0.00231162436788071 115 | 2000,0.01,50,1,0.4,0,0.3,2,14,752.4,19.1770999999979,0.80067611949487,0.00146391292687027,0.73132321095765,0.00209335663190181 116 | 5000,0.01,50,0.6,0.6,0,0,2,15,743.35,17.772950000003,0.794083785579861,0.00121308248156169,0.726723489051691,0.00254092489277719 117 | 5000,0.03,5,0.6,0.4,0.3,0.1,2,16,193.85,14.846100000001,0.80682344441032,0.00139035602055143,0.736794059137755,0.00243285857256828 118 | 2000,0.1,5,0.6,0.4,0,0,2,17,54.7,2.80669999999664,0.799013341639539,0.00102161112363232,0.732697363516798,0.00230664031596371 119 | 5000,0.01,50,0.6,0.4,0.3,0,2,18,805.9,16.0020000000004,0.797574213494165,0.00175822107652717,0.727233744068022,0.00255928858505525 120 | 5000,0.1,20,0.8,0.4,0,0,2,19,46.4,3.62724999999627,0.803506596217067,0.00137365741939645,0.739109745305122,0.00219228661540642 121 | 5000,0.1,10,0.8,0.6,0.01,0,2,20,37.2,4.43780000000115,0.800826483538408,0.00139988130366162,0.7442499518059,0.00226908541101441 122 | 5000,0.01,5,0.6,0.8,0,0,2,21,365.7,37.996149999999,0.806841303664462,0.00141013251223749,0.734791421044041,0.00231253120033373 123 | 2000,0.03,50,0.6,1,0,0,2,22,346.25,7.70034999999916,0.80020461091839,0.00156905119804281,0.728081806330127,0.00254061360864922 124 | 1000,0.1,10,0.6,1,0,0,2,23,68.35,1.74604999999865,0.803553164197127,0.00145865681629724,0.729051447389239,0.00230041451643216 125 | 100,0.01,20,1,1,0,0,2,24,1661.55,7.11590000000433,0.794500689518312,0.00136989723435819,0.728790921903975,0.00218048137776457 126 | 5000,0.01,20,1,0.8,0,0,2,25,453.65,30.4132500000022,0.791281210039613,0.00153832359112372,0.729148196772137,0.00218415787930537 127 | 200,0.03,10,1,0.8,0.01,0.1,2,26,480.95,3.1900999999998,0.802301012749428,0.00142241982073234,0.735716274924048,0.0022384301689428 128 | 100,0.1,50,1,0.8,0.01,0,2,27,239.65,1.09759999999951,0.796146303032846,0.00166344396713081,0.732128408283232,0.00211769476521909 129 | 2000,0.1,10,0.6,1,0,0.3,2,28,66.1,3.06794999999693,0.801160474834757,0.00160606582751412,0.729606897225387,0.00232035674103594 130 | 100,0.01,20,1,0.8,0.1,0.1,2,29,1870.55,7.9409500000038,0.797662670683433,0.00128720261040851,0.730007227996349,0.00223525171086526 131 | 2000,0.03,50,1,0.4,0,0,2,30,240.2,6.72434999999969,0.802749529849692,0.00163088574030902,0.731389440636933,0.00213436367020721 132 | 5000,0.1,50,0.6,0.6,0,0,2,31,89.45,2.36600000000035,0.792852059889372,0.00152988325377686,0.72192650321926,0.00242101400929022 133 | 2000,0.01,50,1,1,0.1,0,2,32,738.5,19.2625500000009,0.802452601674024,0.001256357527888,0.731097571377929,0.00205472576646149 134 | 100,0.01,5,1,1,0,0,2,33,1856,7.29824999999837,0.796925996780658,0.00161009683593372,0.72999944221107,0.00226809645453189 135 | 2000,0.01,5,0.8,1,0,0,2,34,452.8,20.2342500000028,0.819936453489451,0.00122226042415329,0.745739651119728,0.00233419459930788 136 | 2000,0.03,10,0.8,1,0,0,2,35,156.8,7.32205000000249,0.814574725460621,0.00138056575827667,0.740345815333325,0.00221116471600471 137 | 2000,0.01,10,1,0.4,0,0.1,2,36,435.85,20.3272999999987,0.797779107911364,0.00184474190552823,0.733051396723064,0.00213087490416201 138 | 2000,0.1,5,0.8,1,0,0,2,37,44.7,2.48619999999646,0.81187105132393,0.000872784369953204,0.748381445334838,0.0023433912017249 139 | 1000,0.01,5,1,0.8,0.1,0,2,38,585.75,13.7892499999987,0.811938430271137,0.00142544829146679,0.741789157559362,0.00232527402546597 140 | 200,0.03,50,0.6,0.8,0.3,0.3,2,39,534.95,3.91554999999498,0.794625929071286,0.00162518914974419,0.724294081238971,0.00252354388794757 141 | 500,0.01,10,0.6,0.8,0.01,0.1,2,40,877.9,10.6945500000045,0.806168884661405,0.00173568854156837,0.733258202054076,0.00249368473966721 142 | 1000,0.01,5,0.8,0.4,0.3,0,2,41,682.6,15.3618499999982,0.823272440938158,0.00116973236409404,0.747171528658011,0.00234542631435175 143 | 1000,0.1,20,0.8,0.8,0,0.01,2,42,77.1,1.97684999999765,0.811400205876065,0.00175064694367046,0.7404881581458,0.00243596430459658 144 | 100,0.1,5,0.6,0.6,0,0,2,43,195.55,0.873749999998836,0.782202915934724,0.00142560093820152,0.721824324548352,0.0025455333764921 145 | 100,0.01,10,1,0.6,0,0.01,2,44,1775.8,7.24535000000033,0.79488518658197,0.00249102540345342,0.730124326082988,0.00227526864044374 146 | 500,0.1,20,0.8,0.4,0,0,2,45,94.2,1.32805000000226,0.809806726620525,0.00122019382370122,0.739586803922123,0.0024078304614511 147 | 200,0.03,50,0.6,0.8,0,0,2,46,502.35,3.50409999999974,0.795561806925349,0.00165153551427979,0.723144573080934,0.00252411305465961 148 | 500,0.1,20,0.8,0.4,0.1,0.01,2,47,96.55,1.38034999999945,0.809394111289872,0.00155540498441278,0.739357130542322,0.00239583071442902 149 | 200,0.03,50,1,0.8,0,0,2,48,562.3,3.75495000000083,0.802168379064912,0.00146254962583536,0.735295105679722,0.00223541144083555 150 | 200,0.01,10,1,1,0.01,0,2,49,1170.9,7.52795000000188,0.802762205316612,0.00148855275010644,0.733710868162545,0.002300457003218 151 | 2000,0.01,20,1,0.6,0,0.1,2,50,484.1,21.4430000000022,0.794620769570691,0.00172619702330079,0.729139845504428,0.00212213981910749 152 | 5000,0.01,10,0.8,0.8,0,0,2,51,362.75,37.4852999999959,0.81190496981767,0.00168340359063729,0.738328901090869,0.00222776925716911 153 | 1000,0.03,50,0.8,0.4,0,0,2,52,278.45,6.19975000000268,0.81166158144268,0.0015136754782275,0.736339309164483,0.00231225115194496 154 | 5000,0.1,5,0.8,0.4,0,0.1,2,53,35.3,5.0632500000007,0.801856320669281,0.00174572434433644,0.747796852509413,0.0021526066596599 155 | 1000,0.01,10,1,0.8,0,0,2,54,547.75,12.5406499999997,0.805638639142985,0.00117535625525876,0.738014605244079,0.00230881556926307 156 | 200,0.03,50,1,0.6,0,0,2,55,558.35,3.75920000000333,0.804195073348239,0.0017066002323658,0.73506879252267,0.00218731851016899 157 | 100,0.1,50,0.8,0.8,0,0,2,56,264.55,1.17669999999926,0.800284458952591,0.00106443144118611,0.732868220898818,0.00236010561325203 158 | 2000,0.01,5,1,0.4,0,0,2,57,414.8,19.2172500000044,0.802405898157213,0.00112722349256947,0.73764031154505,0.00219433568483179 159 | 200,0.03,5,0.6,1,0,0.1,2,58,437.8,2.81775000000052,0.796584885398873,0.00136771942393086,0.728193066798084,0.00257260114870976 160 | 1000,0.01,20,0.6,0.8,0,0,2,59,662.75,13.9209500000055,0.803468411987375,0.0012598372303866,0.73273824598531,0.00247972260183335 161 | 5000,0.03,5,0.6,0.8,0.3,0.1,2,60,194.6,14.8489500000025,0.807729187752187,0.00147716560777223,0.736554594559007,0.00245994292792971 162 | 1000,0.01,50,0.6,0.6,0,0.3,2,61,763.9,13.4030999999988,0.794385609809018,0.00173755704962652,0.726578274945311,0.00244879807542226 163 | 5000,0.01,20,0.6,0.4,0,0,2,62,608.45,31.2417000000016,0.803431028820349,0.00133316080264915,0.731854032903435,0.0025509676420841 164 | 2000,0.1,5,0.6,0.8,0,0,2,63,56.15,2.90120000000461,0.801275289112911,0.00149825422244744,0.731862643139424,0.00237580533719092 165 | 5000,0.01,5,1,0.8,0,0,2,64,281.3,32.2028000000079,0.774731056352573,0.00169146954054313,0.719820754134346,0.00213692009139224 166 | 100,0.03,10,1,0.8,0.01,0,2,65,632.45,2.76400000000431,0.798607790691524,0.00163344417903281,0.731087018552676,0.00230865690215469 167 | 500,0.01,50,0.6,0.8,0.3,0.01,2,66,1038.65,12.7694500000071,0.80073140853397,0.00132445715834122,0.726056682574606,0.00243222706022596 168 | 1000,0.03,50,0.6,0.4,0,0,2,67,346.65,6.68520000000426,0.801186025874124,0.00129597064320547,0.728117348245811,0.00251649960762606 169 | 1000,0.01,20,0.8,1,0,0,2,68,665.15,14.7334499999997,0.815313163256013,0.00134630940148532,0.740189674208029,0.00232682172424398 170 | 2000,0.01,10,0.6,0.8,0,0,2,69,575.2,24.2790999999997,0.809573127940098,0.00141862132401634,0.734099582817844,0.00247607404844462 171 | 100,0.1,10,0.8,0.8,0.3,0,2,70,265.05,1.14319999999716,0.803521305334898,0.00130652508223555,0.736169161466691,0.00235719480107851 172 | 2000,0.01,10,1,1,0.1,0,2,71,434.1,20.3410499999969,0.797585211124335,0.00179715492848779,0.733200620062865,0.00219136038851572 173 | 200,0.03,5,1,0.8,0,0.3,2,72,548.95,3.43894999999902,0.806922971178566,0.00164950370995726,0.737628249712726,0.0022789751990044 174 | 100,0.03,20,1,0.8,0,0,2,73,699.35,2.94704999999085,0.798541323330541,0.00138457493973324,0.731405839741641,0.00224775125546478 175 | 200,0.1,20,0.6,0.6,0,0.1,2,74,172.45,1.25479999999516,0.794958996140526,0.000928898165620188,0.724577055180956,0.00261965100864866 176 | 1000,0.01,50,0.8,0.6,0.01,0,2,75,833.15,17.992750000002,0.809403260225483,0.00161813207314964,0.736347127478115,0.00224938088027606 177 | 5000,0.1,20,0.8,0.6,0.01,0,2,76,46.2,3.31890000001004,0.803042595969958,0.00162948554139918,0.738487878741082,0.00222235326219155 178 | 1000,0.01,20,0.6,1,0.01,0,2,77,666.65,14.090200000006,0.807223257757524,0.00133055884074376,0.732806664746113,0.00253495430126326 179 | 500,0.03,5,1,0.6,0,0,2,78,269.05,3.52035000000615,0.81142302824477,0.00174727636122202,0.740104390205438,0.00226221967381416 180 | 200,0.1,5,1,0.8,0,0,2,79,146.55,0.989899999997579,0.796910710391103,0.00162798496539423,0.736809406471142,0.00219783177716781 181 | 100,0.03,10,1,0.8,0.01,0.01,2,80,627.1,2.65395000000426,0.794063117626496,0.0015939457161282,0.730767354981441,0.00231991347888117 182 | 200,0.1,20,0.6,0.6,0,0.3,2,81,185.75,1.31755000000121,0.793965921524558,0.00178611463044059,0.725152887265669,0.00258949879119995 183 | 500,0.1,50,0.6,0.4,0,0,2,82,97.35,1.45400000000664,0.78976401100018,0.00140091198462759,0.720946775859678,0.0024101240000016 184 | 500,0.1,20,1,1,0.3,0,2,83,100.1,1.51735000000044,0.804633293262471,0.0015847644348341,0.738882815221631,0.00231629700815876 185 | 5000,0.03,10,0.8,0.8,0.1,0,2,84,139.4,12.3201000000001,0.808703700192124,0.00111099937356941,0.739145251046477,0.00224223937952005 186 | 500,0.03,5,0.8,0.8,0,0,2,85,310.7,3.87964999999967,0.819431083301472,0.00118042661722577,0.743605353557596,0.00238404649151062 187 | 1000,0.01,10,1,1,0,0.3,2,86,604.2,14.2168000000005,0.808003986409487,0.00113921249719731,0.737517176081704,0.00222323891391618 188 | 2000,0.01,20,0.6,0.4,0.1,0,2,87,629.7,22.540300000002,0.803631999820312,0.000989466686791003,0.732439104932466,0.00250721941092098 189 | 5000,0.03,10,1,0.4,0,0,2,88,113.95,13.2454499999993,0.781522229904992,0.00152763961276802,0.72353161865186,0.00216848455548257 190 | 2000,0.03,50,0.6,1,0,0,2,89,349.5,7.77540000000445,0.800467256645226,0.00120330332016256,0.727484275916496,0.00246339638737405 191 | 200,0.03,50,0.6,1,0,0.3,2,90,508.85,3.58810000000231,0.797329782672206,0.0012069347377825,0.723218818034168,0.00249951753353999 192 | 1000,0.03,20,0.8,0.4,0,0,2,91,235.2,5.37410000000091,0.813145432546416,0.00140819892111941,0.739336634215442,0.00246277766020585 193 | 2000,0.01,10,0.6,0.4,0,0.3,2,92,603.55,24.7003500000021,0.80825835738083,0.00131419207393419,0.735122924479664,0.00245691292882541 194 | 5000,0.03,20,0.6,0.8,0.3,0,2,93,276.3,10.0945999999996,0.810100483059994,0.00161181176228853,0.73499570942628,0.00257387512071734 195 | 1000,0.1,5,1,0.6,0.3,0,2,94,64.3,1.83490000000165,0.808795169737479,0.00133456790060563,0.742987787300846,0.00222886501429502 196 | 200,0.01,50,0.8,0.4,0.1,0,2,95,1528.95,10.3123500000016,0.805774888086704,0.00163992415962307,0.734354113150634,0.0022148250946036 197 | 1000,0.03,5,0.6,0.4,0.01,0,2,96,283.1,6.04970000000321,0.810607965945351,0.00173152041356796,0.736695499826128,0.0024000846804161 198 | 500,0.03,5,1,0.8,0,0,2,97,277,3.56434999999474,0.809050081631466,0.00119687977054429,0.739272140886748,0.00228748684537395 199 | 100,0.1,20,0.6,0.6,0,0,2,98,215.9,1.01034999999683,0.787336123300701,0.0017614469076344,0.720612322029443,0.00265663589360298 200 | 2000,0.03,20,0.8,0.4,0,0,2,99,173.85,7.7542000000074,0.809097680533991,0.00168307699373113,0.736958508997588,0.0022540189948254 201 | 100,0.03,10,0.6,0.6,0,0.1,2,100,557.55,2.48014999999432,0.788244793143524,0.00175268348002644,0.722313023226255,0.00257850559220422 202 | -------------------------------------------------------------------------------- /4-train_rs/run-tuning.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(ROCR) 3 | library(lightgbm) 4 | library(dplyr) 5 | 6 | set.seed(1234) 7 | 8 | 9 | # for yr in 1990 1991; do 10 | # wget http://stat-computing.org/dataexpo/2009/$yr.csv.bz2 11 | # bunzip2 $yr.csv.bz2 12 | # done 13 | 14 | 15 | ## TODO: loop 1991..2000 16 | yr <- 1991 17 | 18 | d0_train <- fread(paste0("/var/data/airline/",yr-1,".csv")) 19 | d0_train <- d0_train[!is.na(DepDelay)] 20 | 21 | d0_test <- fread(paste0("/var/data/airline/",yr,".csv")) 22 | d0_test <- d0_test[!is.na(DepDelay)] 23 | 24 | d0 <- rbind(d0_train, d0_test) 25 | 26 | 27 | for (k in c("Month","DayofMonth","DayOfWeek")) { 28 | d0[[k]] <- paste0("c-",as.character(d0[[k]])) # force them to be categoricals 29 | } 30 | 31 | d0[["dep_delayed_15min"]] <- ifelse(d0[["DepDelay"]]>=15,1,0) 32 | 33 | cols_keep <- c("Month", "DayofMonth", "DayOfWeek", "DepTime", "UniqueCarrier", 34 | "Origin", "Dest", "Distance", "dep_delayed_15min") 35 | d0 <- d0[, cols_keep, with = FALSE] 36 | 37 | 38 | d0_wrules <- lgb.prepare_rules(d0) # lightgbm special treat of categoricals 39 | d0 <- d0_wrules$data 40 | cols_cats <- names(d0_wrules$rules) 41 | 42 | 43 | n1 <- nrow(d0_train) 44 | n2 <- nrow(d0_test) 45 | 46 | p <- ncol(d0)-1 47 | d1_train <- as.matrix(d0[1:n1,]) 48 | d1_test <- as.matrix(d0[(n1+1):(n1+n2),]) 49 | 50 | 51 | 52 | ## TODO: loop 10K,100K,1M 53 | size <- 100e3 54 | 55 | 56 | 57 | n_random <- 100 ## TODO: 1000? @@@ 58 | 59 | params_grid <- expand.grid( # default 60 | num_leaves = c(100,200,500,1000,2000,5000), # 31 (was 127) 61 | ## TODO: num_leaves should be size dependent 62 | ## 100K: num_leaves = c(100,200,500,1000,2000,5000) DONE for 3-test_rs 63 | ## 10K: num_leaves = c(20,50,100,200,500) DONE for 3-test_rs 64 | ## 10M: should be higher than 100K (used same as 100K in 2-train_test_1each) 65 | learning_rate = c(0.01,0.03,0.1), # 0.1 66 | min_data_in_leaf = c(5,10,20,50), # 20 (was 100) 67 | feature_fraction = c(0.6,0.8,1), # 1 68 | bagging_fraction = c(0.4,0.6,0.8,1), # 1 69 | lambda_l1 = c(0,0,0,0, 0.01, 0.1, 0.3), # 0 70 | lambda_l2 = c(0,0,0,0, 0.01, 0.1, 0.3) # 0 71 | ## TODO: 72 | ## min_sum_hessian_in_leaf 73 | ## min_gain_to_split 74 | ## max_bin 75 | ## min_data_in_bin 76 | ) 77 | params_random <- params_grid[sample(1:nrow(params_grid),n_random),] 78 | 79 | 80 | 81 | ## resample test 82 | size_test <- 100e3 83 | n_test_rs <- 20 ## @@@ 84 | d_test_list <- list() 85 | for (i in 1:n_test_rs) { 86 | d_test_list[[i]] <- d1_test[sample(1:nrow(d1_test), size_test),] 87 | } 88 | 89 | 90 | 91 | 92 | d_pm_res <- data.frame() 93 | n_train_rs <- 2 ## @@@ 94 | for (ktr in 1:n_train_rs) { 95 | ## resample train 96 | d <- d1_train[sample(1:nrow(d1_train), size),] 97 | 98 | 99 | 100 | # warm up 101 | dlgb_train <- lgb.Dataset(data = d[,1:p], label = d[,p+1]) 102 | md <- lgb.train(data = dlgb_train, objective = "binary", 103 | num_threads = parallel::detectCores()/2, 104 | nrounds = 100, 105 | categorical_feature = cols_cats, 106 | verbose = 0) 107 | md <- lgb.train(data = dlgb_train, objective = "binary", 108 | num_threads = parallel::detectCores()/2, 109 | nrounds = 100, 110 | categorical_feature = cols_cats, 111 | verbose = 0) 112 | 113 | 114 | 115 | d_res <- data.frame() 116 | for (krpm in 1:nrow(params_random)) { 117 | params <- as.list(params_random[krpm,]) 118 | 119 | ## resample 120 | n_resample <- 20 ## TODO: 100? @@@ 121 | 122 | p_train <- 0.8 ## TODO: change? (80-10-10 split now) 123 | p_earlystop <- 0.1 124 | p_modelselec <- 1 - p_train - p_earlystop 125 | 126 | mds <- list() 127 | d_res_rs <- data.frame() 128 | for (k in 1:n_resample) { 129 | cat("ktr: ",ktr," ~ krpm:",krpm,"train - k:",k,"\n") 130 | 131 | idx_train <- sample(1:size, size*p_train) 132 | idx_earlystop <- sample(setdiff(1:size,idx_train), size*p_earlystop) 133 | idx_modelselec <- setdiff(setdiff(1:size,idx_train),idx_earlystop) 134 | 135 | d_train <- d[idx_train,] 136 | d_earlystop <- d[idx_earlystop,] 137 | d_modelselec <- d[idx_modelselec,] 138 | 139 | dlgb_train <- lgb.Dataset(data = d_train[,1:p], label = d_train[,p+1]) 140 | dlgb_earlystop <- lgb.Dataset(data = d_earlystop[,1:p], label = d_earlystop[,p+1]) 141 | 142 | 143 | runtm <- system.time({ 144 | md <- lgb.train(data = dlgb_train, objective = "binary", 145 | num_threads = parallel::detectCores()/2, # = number of "real" cores 146 | params = params, 147 | nrounds = 10000, early_stopping_rounds = 10, valid = list(valid = dlgb_earlystop), 148 | categorical_feature = cols_cats, 149 | verbose = 0) 150 | })[[3]] 151 | 152 | phat <- predict(md, data = d_modelselec[,1:p]) 153 | rocr_pred <- prediction(phat, d_modelselec[,p+1]) 154 | auc_rs <- performance(rocr_pred, "auc")@y.values[[1]] 155 | 156 | d_res_rs <- rbind(d_res_rs, data.frame(ntrees = md$best_iter, runtm = runtm, auc_rs = auc_rs)) 157 | mds[[k]] <- md 158 | } 159 | d_res_rs_avg <- d_res_rs %>% summarize(ntrees = mean(ntrees), runtm = mean(runtm), auc_rs_avg = mean(auc_rs), 160 | auc_rs_std = sd(auc_rs)/sqrt(n_resample)) # std of the mean! 161 | 162 | # consider the model as the average of the models from resamples 163 | # TODO?: alternatively could retrain the "final" model on all of data (early stoping or avg number of trees?) 164 | auc_test <- numeric(n_test_rs) 165 | for (i in 1:n_test_rs) { 166 | cat("ktr: ",ktr," ~ krpm:",krpm,"pred test - i:",i,"\n") 167 | d_test <- d_test_list[[i]] 168 | phat <- matrix(0, nrow = n_resample, ncol = nrow(d_test)) 169 | for (k in 1:n_resample) { 170 | phat[k,] <- predict(mds[[k]], data = d_test[,1:p]) 171 | } 172 | phat_avg <- apply(phat, 2, mean) 173 | rocr_pred <- prediction(phat_avg, d_test[,p+1]) 174 | auc_test[[i]] <- performance(rocr_pred, "auc")@y.values[[1]] 175 | } 176 | 177 | d_res <- rbind(d_res, cbind(krpm, d_res_rs_avg, auc_test_avg=mean(auc_test), auc_test_sd=sd(auc_test))) 178 | } 179 | 180 | d_pm_res <- rbind(d_pm_res, cbind(params_random, ktr, d_res)) 181 | 182 | } 183 | 184 | d_pm_res 185 | 186 | 187 | ## TODO: ensemble/avg the top 2,3,...10 models (would need to save all the models) 188 | 189 | ## TODO: easier: see the goodness as a function of how many random iterations 10,100,1000 (can be done in 190 | ## the other analysis file) 191 | 192 | 193 | fwrite(d_pm_res, file = "res.csv") 194 | 195 | -------------------------------------------------------------------------------- /5-ensemble/analyze.Rmd: -------------------------------------------------------------------------------- 1 | ```{r} 2 | 3 | library(data.table) 4 | library(dplyr) 5 | library(ggplot2) 6 | 7 | d <- fread("res.csv") 8 | 9 | d %>% arrange(desc(auc_test_avg)) %>% head 10 | 11 | d %>% mutate(is_ens = krpm==0) %>% group_by(is_ens) %>% mutate(rank = dense_rank(desc(auc_rs_avg))) %>% 12 | mutate(rank = ifelse(is_ens,-4,rank)) %>% 13 | ggplot() + geom_point(aes(x = rank, y = auc_test_avg, color = is_ens)) + 14 | geom_errorbar(aes(x = rank, ymin = auc_test_avg-auc_test_sd, ymax = auc_test_avg+auc_test_sd, color = is_ens), width = 1) + scale_color_manual(values = c("black","red")) 15 | 16 | ``` 17 | -------------------------------------------------------------------------------- /5-ensemble/fig-AUCens.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/szilard/GBM-tune/7047ae1d8a133aefcc338665ffbee3864cc529ff/5-ensemble/fig-AUCens.png -------------------------------------------------------------------------------- /5-ensemble/res.csv: -------------------------------------------------------------------------------- 1 | num_leaves,learning_rate,min_data_in_leaf,feature_fraction,bagging_fraction,lambda_l1,lambda_l2,krpm,ntrees,runtm,auc_rs_avg,auc_rs_std,auc_test_avg,auc_test_sd 2 | 200,0.03,50,0.6,0.8,0.1,0,1,495.4,4.2116,0.788293542150508,0.00158925841709515,0.723144981513305,0.00231210649064027 3 | 5000,0.03,50,1,0.6,0,0.01,2,250.2,11.95425,0.791957372334741,0.0016983440798005,0.732644813533755,0.00184936771557532 4 | 5000,0.1,5,0.8,1,0,0.01,3,33.95,8.40489999999999,0.800584990894827,0.00125561250169598,0.749156047358491,0.00205425405313052 5 | 200,0.01,20,0.6,0.8,0,0.01,4,1095,9.28054999999996,0.787733303807144,0.00139923381827546,0.726305966732273,0.00219388438993 6 | 100,0.1,5,1,0.4,0,0.3,5,245.6,1.27539999999999,0.787748037585345,0.00175069839702623,0.734584447940127,0.00174277287312448 7 | 500,0.1,10,0.8,0.6,0,0.01,6,94.85,1.96545000000003,0.80226824092923,0.00139347722179284,0.74388159141264,0.00195665992109027 8 | 5000,0.01,20,1,0.6,0,0,7,425.75,51.0857,0.783812254109439,0.00141642619180188,0.729878313905452,0.00201571905467454 9 | 1000,0.1,20,0.8,0.6,0.01,0,8,73.4,3.11755000000012,0.803575477662336,0.00137451984079618,0.741784069868219,0.00200961594346767 10 | 5000,0.01,20,0.8,0.8,0.01,0.01,9,495.25,50.66085,0.803605251039444,0.0011858826523806,0.740235984061904,0.00205524056219106 11 | 2000,0.01,10,1,0.4,0.01,0,10,416.25,35.5298,0.789582079093407,0.00108290678009551,0.734522475627661,0.0021752978633002 12 | 2000,0.1,20,1,1,0.1,0.01,11,48.3,4.66410000000005,0.782793859342886,0.00176687197892049,0.732377940766706,0.00200076670603934 13 | 500,0.03,10,1,0.8,0.1,0,12,293.2,5.96374999999998,0.798848700176517,0.00131237327456868,0.740064991742361,0.00180926968337642 14 | 500,0.1,5,0.8,1,0.3,0,13,98.3,2.1503999999999,0.808687509171401,0.00109755393574709,0.74750471046781,0.00182114048879434 15 | 2000,0.01,50,1,0.4,0,0.3,14,751.45,32.5882000000001,0.794866339838377,0.00136862614851209,0.732252690371904,0.00186170712241115 16 | 5000,0.01,50,0.6,0.6,0,0,15,740.35,28.2793000000001,0.788415289132339,0.00133923545076051,0.727240901772118,0.00234334883930632 17 | 5000,0.03,5,0.6,0.4,0.3,0.1,16,192.95,27.135,0.799515837718476,0.00111123553848021,0.736697713270535,0.00218853112726501 18 | 2000,0.1,5,0.6,0.4,0,0,17,53.75,4.87974999999969,0.787648275108135,0.00138958308137535,0.733687152886856,0.00236932775774525 19 | 5000,0.01,50,0.6,0.4,0.3,0,18,867.6,26.9306999999999,0.79100370065621,0.00193123256853869,0.728030045701859,0.00244012940176136 20 | 5000,0.1,20,0.8,0.4,0,0,19,45.5,5.96479999999992,0.796739445500034,0.00150650302498103,0.742324617720152,0.00196696195171257 21 | 5000,0.1,10,0.8,0.6,0.01,0,20,37.05,7.67849999999971,0.795873550952099,0.00204851323147363,0.747506533898696,0.00209707699434227 22 | 5000,0.01,5,0.6,0.8,0,0,21,362.2,68.7490499999999,0.798256252445463,0.00111622142563911,0.735377201486105,0.00228922554472575 23 | 2000,0.03,50,0.6,1,0,0,22,342.95,13.4049500000001,0.792788758255409,0.00149785713135705,0.728803581199091,0.00234581964943645 24 | 1000,0.1,10,0.6,1,0,0,23,68.45,2.88149999999969,0.792471424638641,0.00124256400546119,0.728397405525895,0.00225491639952752 25 | 100,0.01,20,1,1,0,0,24,1645.45,8.26669999999995,0.785826811613146,0.00158112660662412,0.728229162699025,0.00194203192523001 26 | 5000,0.01,20,1,0.8,0,0,25,428.5,50.23105,0.786487357607407,0.00132727388538261,0.730919245190628,0.00189512396616456 27 | 200,0.03,10,1,0.8,0.01,0.1,26,485.1,4.38060000000005,0.795641467335251,0.00131291227883532,0.736311244503527,0.00181110155176873 28 | 100,0.1,50,1,0.8,0.01,0,27,246.05,1.45604999999941,0.790261502666769,0.0016645496440058,0.733571151652941,0.00207124454098566 29 | 2000,0.1,10,0.6,1,0,0.3,28,65.8,5.74185000000034,0.794651416557506,0.00183914228458343,0.730302779720246,0.00220699265055946 30 | 100,0.01,20,1,0.8,0.1,0.1,29,1694.05,8.7122000000003,0.787058740836131,0.00156426436343612,0.729339764323513,0.00202882988515877 31 | 2000,0.03,50,1,0.4,0,0,30,249.1,12.3219500000005,0.795107911134776,0.00175967140416962,0.732815561705604,0.00177608810759066 32 | 5000,0.1,50,0.6,0.6,0,0,31,85.15,3.67974999999969,0.781608830117892,0.00136730476394089,0.722706214904799,0.0023913919757029 33 | 2000,0.01,50,1,1,0.1,0,32,732.25,33.6602000000003,0.791373398715735,0.00125192596717394,0.73150973050598,0.00189656481822839 34 | 100,0.01,5,1,1,0,0,33,1665.3,8.62790000000005,0.783047428929322,0.00133934464078655,0.728894284045484,0.00167598347404097 35 | 2000,0.01,5,0.8,1,0,0,34,444,36.9664499999999,0.814246520891619,0.00146550244632235,0.747448516251052,0.00215970620571533 36 | 2000,0.03,10,0.8,1,0,0,35,160.4,13.9895,0.808969982904969,0.00169131589691264,0.7434116112629,0.00210789768432663 37 | 2000,0.01,10,1,0.4,0,0.1,36,419.45,35.4347000000002,0.789777530453203,0.00128271102112973,0.735077273336161,0.00200906014525673 38 | 2000,0.1,5,0.8,1,0,0,37,44.85,4.59219999999987,0.805986931445475,0.00136046330915719,0.751420407398978,0.00183069193021998 39 | 1000,0.01,5,1,0.8,0.1,0,38,584.95,23.0081999999999,0.804238661849547,0.00202627426717689,0.743043571202373,0.00173785951871213 40 | 200,0.03,50,0.6,0.8,0.3,0.3,39,510.5,4.63990000000013,0.788824253057131,0.00136455972079891,0.723455852245052,0.00228988891310819 41 | 500,0.01,10,0.6,0.8,0.01,0.1,40,830.2,15.9611000000001,0.799308227148977,0.00153607497781551,0.732630616995861,0.00213115630921976 42 | 1000,0.01,5,0.8,0.4,0.3,0,41,670.15,25.5952500000001,0.817481840844254,0.00165116559720666,0.748884126772546,0.00208144240868024 43 | 1000,0.1,20,0.8,0.8,0,0.01,42,74.25,3.22185000000027,0.802576121300649,0.00128850673626037,0.741066769913599,0.00203453676097778 44 | 100,0.1,5,0.6,0.6,0,0,43,196.3,1.18004999999939,0.774298062345061,0.00207836022345232,0.720095553115763,0.00189210168750444 45 | 100,0.01,10,1,0.6,0,0.01,44,1662.4,8.32379999999939,0.785380049192551,0.00148629778321049,0.728962629076868,0.00186936791930906 46 | 500,0.1,20,0.8,0.4,0,0,45,91.25,2.04319999999898,0.797558217347458,0.00152201322051121,0.7402857265489,0.00199773612413769 47 | 200,0.03,50,0.6,0.8,0,0,46,488.1,4.35525000000052,0.784899296106119,0.00132369484799808,0.723096256418486,0.00225917624214396 48 | 500,0.1,20,0.8,0.4,0.1,0.01,47,96.4,2.1390999999996,0.797946262801347,0.00170218393119264,0.740601479650344,0.00203354775057054 49 | 200,0.03,50,1,0.8,0,0,48,521.5,4.57219999999943,0.797420448007672,0.00178201903946785,0.736297825209765,0.00208313395427846 50 | 200,0.01,10,1,1,0.01,0,49,1190.05,10.2939499999997,0.794070124306061,0.00159378718165542,0.73460409608153,0.00175250917521503 51 | 2000,0.01,20,1,0.6,0,0.1,50,471.7,38.2425000000003,0.787584766809749,0.00177232444592121,0.732204854752618,0.00197468796122166 52 | 5000,0.01,10,0.8,0.8,0,0,51,359,69.4015999999989,0.808426481128471,0.00144265608014597,0.741359330376635,0.00205965699989979 53 | 1000,0.03,50,0.8,0.4,0,0,52,266.55,9.93134999999893,0.802373763572969,0.00150748953737239,0.73738010637954,0.00204549457317267 54 | 5000,0.1,5,0.8,0.4,0,0.1,53,34.35,8.646899999999,0.797877690903286,0.00144380596268803,0.749175829588418,0.00192904703265573 55 | 1000,0.01,10,1,0.8,0,0,54,559.55,21.6616499999982,0.796610195254037,0.00139605134288494,0.738979540470026,0.00203805407047546 56 | 200,0.03,50,1,0.6,0,0,55,517.1,4.71495000000104,0.792870558332542,0.00187299298957691,0.735304104574053,0.00202504628939085 57 | 100,0.1,50,0.8,0.8,0,0,56,256.25,1.5369999999999,0.790337370250814,0.00219399982790179,0.733124931420132,0.00219917440838352 58 | 2000,0.01,5,1,0.4,0,0,57,411.5,34.6811500000003,0.791995983000322,0.00127322172531698,0.738962611809297,0.00203410106789552 59 | 200,0.03,5,0.6,1,0,0.1,58,402.1,3.66659999999974,0.784098940043924,0.00109832227671711,0.727290693021978,0.00183015307280379 60 | 1000,0.01,20,0.6,0.8,0,0,59,653.35,23.4239500000007,0.799111486696331,0.0013422793112804,0.731988233104752,0.00239111567985236 61 | 5000,0.03,5,0.6,0.8,0.3,0.1,60,194.1,27.050250000001,0.799264195258167,0.00159479858666593,0.736732735521519,0.0022530268872956 62 | 1000,0.01,50,0.6,0.6,0,0.3,61,811.05,23.9070000000014,0.790283296098649,0.00142699598213304,0.72778894288424,0.0022725601284837 63 | 5000,0.01,20,0.6,0.4,0,0,62,606.65,54.6755499999992,0.796486436348218,0.00125857838124688,0.731623270704461,0.00230573275752607 64 | 2000,0.1,5,0.6,0.8,0,0,63,54.4,5.22209999999941,0.787023687649596,0.00172089075863786,0.733162207092356,0.00223386645383782 65 | 5000,0.01,5,1,0.8,0,0,64,280.9,58.5305000000011,0.768032007357778,0.00157616724277141,0.725218905988437,0.00212367641779034 66 | 100,0.03,10,1,0.8,0.01,0,65,655.9,3.6067999999992,0.787595685864723,0.00160859627023955,0.729991375616491,0.0017896625727561 67 | 500,0.01,50,0.6,0.8,0.3,0.01,66,985.1,18.3434999999987,0.792490086555918,0.00139795262070104,0.72749253444922,0.00238061286335489 68 | 1000,0.03,50,0.6,0.4,0,0,67,341,11.1403000000009,0.791711000084207,0.0012312198707244,0.727575957896919,0.0022995445758235 69 | 1000,0.01,20,0.8,1,0,0,68,642,23.98815,0.808348077426237,0.00131795526200489,0.741730778536304,0.00211213007428613 70 | 2000,0.01,10,0.6,0.8,0,0,69,538.2,43.0634500000007,0.800207121329932,0.00122617337251054,0.734076606451219,0.00227992933496627 71 | 100,0.1,10,0.8,0.8,0.3,0,70,276.2,1.7760499999993,0.794218745787054,0.00145862752188895,0.735763492629865,0.00200787581780642 72 | 2000,0.01,10,1,1,0.1,0,71,419.2,35.4963499999994,0.789422881940988,0.00160831929802662,0.734835867950074,0.00211827812933281 73 | 200,0.03,5,1,0.8,0,0.3,72,528.35,4.95095000000001,0.797940600852207,0.00146418422780454,0.738816491836559,0.00162389383662084 74 | 100,0.03,20,1,0.8,0,0,73,594.85,3.33074999999917,0.787248193039144,0.0015095926067435,0.729203081167186,0.00199485532342905 75 | 200,0.1,20,0.6,0.6,0,0.1,74,153.6,1.7637999999999,0.785011862092736,0.00179062996239697,0.724642602069858,0.00222991050659879 76 | 1000,0.01,50,0.8,0.6,0.01,0,75,773.05,27.8829999999991,0.80302322760738,0.00130443431220004,0.737695971157471,0.00191918802124848 77 | 5000,0.1,20,0.8,0.6,0.01,0,76,45.9,5.80624999999927,0.797987851704324,0.00160202005784862,0.741811052971049,0.00196855705524214 78 | 1000,0.01,20,0.6,1,0.01,0,77,646.1,23.4822499999995,0.797815263183489,0.00122797913300311,0.731649817554127,0.00232169789786185 79 | 500,0.03,5,1,0.6,0,0,78,267.7,5.72485000000015,0.79869507732157,0.00159429881378461,0.741150785002158,0.00159313826876036 80 | 200,0.1,5,1,0.8,0,0,79,135.2,1.64800000000032,0.790763532789102,0.00207318742425423,0.737824834767947,0.00154456134422182 81 | 100,0.03,10,1,0.8,0.01,0.01,80,658.8,3.71229999999923,0.78697334898199,0.00195011338881419,0.73104877207955,0.00180487124072758 82 | 200,0.1,20,0.6,0.6,0,0.3,81,176.4,1.99185000000034,0.785322871859092,0.00195387999890304,0.726041660344395,0.0022798540882487 83 | 500,0.1,50,0.6,0.4,0,0,82,107.95,2.59484999999877,0.785446782081459,0.00208425690128443,0.722858502011796,0.00257226890179431 84 | 500,0.1,20,1,1,0.3,0,83,102.5,2.57604999999967,0.795178874639564,0.00151627269458569,0.740928605191836,0.00194437038582323 85 | 5000,0.03,10,0.8,0.8,0.1,0,84,131.55,21.3587000000007,0.80488407077742,0.00172530990281775,0.740626871808947,0.00212868924896759 86 | 500,0.03,5,0.8,0.8,0,0,85,304.15,6.35740000000114,0.812303723050231,0.00140132501513907,0.746379581498471,0.00187572781206878 87 | 1000,0.01,10,1,1,0,0.3,86,615.2,24.3977500000023,0.795929605159923,0.00123997624342262,0.739202175570606,0.00198937599094205 88 | 2000,0.01,20,0.6,0.4,0.1,0,87,619.5,42.7379999999997,0.797846630896894,0.00170025357130534,0.732888836215258,0.00231594969809365 89 | 5000,0.03,10,1,0.4,0,0,88,110.65,23.584150000001,0.774677223435328,0.00165776619906156,0.727149987264661,0.00228475330489443 90 | 2000,0.03,50,0.6,1,0,0,89,338.85,13.5963499999969,0.792048394020365,0.0011583906922407,0.728503077675136,0.00244606187346474 91 | 200,0.03,50,0.6,1,0,0.3,90,508.1,4.93350000000137,0.787553367101601,0.00140422646059083,0.722680216642422,0.00228618049520376 92 | 1000,0.03,20,0.8,0.4,0,0,91,224.9,9.08435000000027,0.805055829497583,0.00175639649313409,0.741663764773563,0.00212240734729544 93 | 2000,0.01,10,0.6,0.4,0,0.3,92,599.8,47.3322500000002,0.802653411373368,0.00100788888358833,0.735266881000808,0.00226822798817389 94 | 5000,0.03,20,0.6,0.8,0.3,0,93,264.9,17.0623000000007,0.795563309195819,0.00180120267667718,0.73492908212705,0.0023583713741441 95 | 1000,0.1,5,1,0.6,0.3,0,94,62.5,3.2159499999987,0.798470400235447,0.00117431891096264,0.745209705077008,0.00188995567928804 96 | 200,0.01,50,0.8,0.4,0.1,0,95,1480,13.2478000000017,0.798375782795846,0.0011231844011915,0.734993243028463,0.00207382755742672 97 | 1000,0.03,5,0.6,0.4,0.01,0,96,251.1,9.88720000000103,0.797275655936365,0.00136004630381936,0.735383890141345,0.00224481656330881 98 | 500,0.03,5,1,0.8,0,0,97,258.05,5.60144999999975,0.7987741476891,0.00171947420420652,0.74088849289184,0.00164564057247875 99 | 100,0.1,20,0.6,0.6,0,0,98,176.6,1.33499999999985,0.773051885819882,0.00130847522100934,0.71866603961427,0.00221475239438002 100 | 2000,0.03,20,0.8,0.4,0,0,99,173.15,14.9591000000008,0.807887265763082,0.00120023895564672,0.739892105989912,0.00207923763164267 101 | 100,0.03,10,0.6,0.6,0,0.1,100,505.15,2.97780000000057,0.77397432032672,0.00164066104084039,0.720098084346301,0.00190502172765001 102 | ,,,,,,,0,,,,,0.748092655481907,0.00199707595601828 103 | -------------------------------------------------------------------------------- /5-ensemble/run-tuning.R: -------------------------------------------------------------------------------- 1 | library(data.table) 2 | library(ROCR) 3 | library(lightgbm) 4 | library(dplyr) 5 | 6 | set.seed(1234) 7 | 8 | 9 | # for yr in 1990 1991; do 10 | # wget http://stat-computing.org/dataexpo/2009/$yr.csv.bz2 11 | # bunzip2 $yr.csv.bz2 12 | # done 13 | 14 | 15 | ## TODO: loop 1991..2000 16 | yr <- 1991 17 | 18 | d0_train <- fread(paste0("/var/data/airline/",yr-1,".csv")) 19 | d0_train <- d0_train[!is.na(DepDelay)] 20 | 21 | d0_test <- fread(paste0("/var/data/airline/",yr,".csv")) 22 | d0_test <- d0_test[!is.na(DepDelay)] 23 | 24 | d0 <- rbind(d0_train, d0_test) 25 | 26 | 27 | for (k in c("Month","DayofMonth","DayOfWeek")) { 28 | d0[[k]] <- paste0("c-",as.character(d0[[k]])) # force them to be categoricals 29 | } 30 | 31 | d0[["dep_delayed_15min"]] <- ifelse(d0[["DepDelay"]]>=15,1,0) 32 | 33 | cols_keep <- c("Month", "DayofMonth", "DayOfWeek", "DepTime", "UniqueCarrier", 34 | "Origin", "Dest", "Distance", "dep_delayed_15min") 35 | d0 <- d0[, cols_keep, with = FALSE] 36 | 37 | 38 | d0_wrules <- lgb.prepare_rules(d0) # lightgbm special treat of categoricals 39 | d0 <- d0_wrules$data 40 | cols_cats <- names(d0_wrules$rules) 41 | 42 | 43 | n1 <- nrow(d0_train) 44 | n2 <- nrow(d0_test) 45 | 46 | p <- ncol(d0)-1 47 | d1_train <- as.matrix(d0[1:n1,]) 48 | d1_test <- as.matrix(d0[(n1+1):(n1+n2),]) 49 | 50 | 51 | 52 | ## TODO: loop 10K,100K,1M 53 | size <- 100e3 ### @@@ 100K 54 | 55 | 56 | 57 | n_random <- 100 ## TODO: 1000? @@@ 100 58 | 59 | params_grid <- expand.grid( # default 60 | num_leaves = c(100,200,500,1000,2000,5000), # 31 (was 127) 61 | ## TODO: num_leaves should be size dependent. this is good for 100K, 62 | ## but 10K requires less (20..500) and 1M more (1,2,5,10,20K?) 63 | learning_rate = c(0.01,0.03,0.1), # 0.1 64 | min_data_in_leaf = c(5,10,20,50), # 20 (was 100) 65 | feature_fraction = c(0.6,0.8,1), # 1 66 | bagging_fraction = c(0.4,0.6,0.8,1), # 1 67 | lambda_l1 = c(0,0,0,0, 0.01, 0.1, 0.3), # 0 68 | lambda_l2 = c(0,0,0,0, 0.01, 0.1, 0.3) # 0 69 | ## TODO: 70 | ## min_sum_hessian_in_leaf 71 | ## min_gain_to_split 72 | ## max_bin 73 | ## min_data_in_bin 74 | ) 75 | params_random <- params_grid[sample(1:nrow(params_grid),n_random),] 76 | 77 | 78 | 79 | 80 | ## TODO: resample train 81 | d <- d1_train[sample(1:nrow(d1_train), size),] 82 | 83 | 84 | 85 | ## resample test 86 | size_test <- 100e3 87 | n_test_rs <- 20 ### @@@ 20 88 | d_test_list <- list() 89 | for (i in 1:n_test_rs) { 90 | d_test_list[[i]] <- d1_test[sample(1:nrow(d1_test), size_test),] 91 | } 92 | 93 | 94 | 95 | d_res <- data.frame() 96 | mds <- list() 97 | for (krpm in 1:nrow(params_random)) { 98 | params <- as.list(params_random[krpm,]) 99 | 100 | ## resample 101 | n_resample <- 20 ## TODO: 100? @@@ 20 102 | 103 | p_train <- 0.8 ## TODO: change? (80-10-10 split now) 104 | p_earlystop <- 0.1 105 | p_modelselec <- 1 - p_train - p_earlystop 106 | 107 | mds[[krpm]] <- list() 108 | d_res_rs <- data.frame() 109 | for (k in 1:n_resample) { 110 | cat(" krpm:",krpm,"train - k:",k,"\n") 111 | 112 | idx_train <- sample(1:size, size*p_train) 113 | idx_earlystop <- sample(setdiff(1:size,idx_train), size*p_earlystop) 114 | idx_modelselec <- setdiff(setdiff(1:size,idx_train),idx_earlystop) 115 | 116 | d_train <- d[idx_train,] 117 | d_earlystop <- d[idx_earlystop,] 118 | d_modelselec <- d[idx_modelselec,] 119 | 120 | dlgb_train <- lgb.Dataset(data = d_train[,1:p], label = d_train[,p+1]) 121 | dlgb_earlystop <- lgb.Dataset(data = d_earlystop[,1:p], label = d_earlystop[,p+1]) 122 | 123 | 124 | runtm <- system.time({ 125 | md <- lgb.train(data = dlgb_train, objective = "binary", 126 | num_threads = parallel::detectCores()/2, # = number of "real" cores 127 | params = params, 128 | nrounds = 10000, early_stopping_rounds = 10, valid = list(valid = dlgb_earlystop), 129 | categorical_feature = cols_cats, 130 | reset_data = TRUE, 131 | verbose = 0) 132 | })[[3]] 133 | gc(verbose = FALSE) 134 | 135 | phat <- predict(md, data = d_modelselec[,1:p]) 136 | rocr_pred <- prediction(phat, d_modelselec[,p+1]) 137 | auc_rs <- performance(rocr_pred, "auc")@y.values[[1]] 138 | 139 | d_res_rs <- rbind(d_res_rs, data.frame(ntrees = md$best_iter, runtm = runtm, auc_rs = auc_rs)) 140 | mds[[krpm]][[k]] <- md 141 | } 142 | d_res_rs_avg <- d_res_rs %>% summarize(ntrees = mean(ntrees), runtm = mean(runtm), auc_rs_avg = mean(auc_rs), 143 | auc_rs_std = sd(auc_rs)/sqrt(n_resample)) # std of the mean! 144 | 145 | # consider the model as the average of the models from resamples 146 | # TODO?: alternatively could retrain the "final" model on all of data (early stoping or avg number of trees?) 147 | auc_test <- numeric(n_test_rs) 148 | for (i in 1:n_test_rs) { 149 | cat(" krpm:",krpm,"pred test - i:",i,"\n") 150 | d_test <- d_test_list[[i]] 151 | phat <- matrix(0, nrow = n_resample, ncol = nrow(d_test)) 152 | for (k in 1:n_resample) { 153 | phat[k,] <- predict(mds[[krpm]][[k]], data = d_test[,1:p]) 154 | } 155 | phat_avg <- apply(phat, 2, mean) 156 | rocr_pred <- prediction(phat_avg, d_test[,p+1]) 157 | auc_test[[i]] <- performance(rocr_pred, "auc")@y.values[[1]] 158 | } 159 | 160 | d_res <- rbind(d_res, cbind(krpm, d_res_rs_avg, auc_test_avg=mean(auc_test), auc_test_sd=sd(auc_test))) 161 | } 162 | 163 | d_pm_res <- cbind(params_random, d_res) 164 | 165 | 166 | n_rs_best <- 10 167 | krpm_top <- d_pm_res %>% arrange(desc(auc_rs_avg)) %>% head(n_rs_best) %>% .$krpm 168 | auc_test <- numeric(n_test_rs) 169 | for (i in 1:n_test_rs) { 170 | cat("ensemble pred test - i:",i,"\n") 171 | d_test <- d_test_list[[i]] 172 | phat_avg1 <- matrix(0, nrow = n_rs_best, ncol = nrow(d_test)) 173 | for (j in 1:n_rs_best) { 174 | phat <- matrix(0, nrow = n_resample, ncol = nrow(d_test)) 175 | for (k in 1:n_resample) { 176 | phat[k,] <- predict(mds[[krpm_top[j]]][[k]], data = d_test[,1:p]) 177 | } 178 | phat_avg1[j,] <- apply(phat, 2, mean) 179 | } 180 | phat_avg2 <- apply(phat_avg1, 2, mean) 181 | rocr_pred <- prediction(phat_avg2, d_test[,p+1]) 182 | auc_test[[i]] <- performance(rocr_pred, "auc")@y.values[[1]] 183 | } 184 | 185 | fwrite(d_pm_res, file = "res.csv") 186 | fwrite(data.frame(NA,NA,NA,NA,NA,NA,NA,0,NA,NA,NA,NA,mean(auc_test),sd(auc_test)), append = TRUE, col.names = FALSE, file = "res.csv") 187 | 188 | 189 | -------------------------------------------------------------------------------- /GBM-tune.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | ## Tuning GBMs (hyperparameter tuning) and impact on out-of-sample predictions 3 | 4 | The goal of this repo is to study the impact of having one dataset/sample ("the dataset") 5 | when training and tuning machine learning models in practice (or in competitions) 6 | on the prediction accuracy on new data (that usually comes from a slightly different 7 | distribution due to non-stationarity). 8 | 9 | To keep things simple we focus on binary classification, use only one source dataset 10 | with mix of numeric and categorical features and no missing values, we don't perform feature engineering, 11 | tune only GBMs with `lightgbm` and random hyperparameter search (might also ensemble the best models later), and 12 | we use only AUC as a measure of accuracy. 13 | 14 | Unlike in most practical applications or in competitions such as Kaggle, we create the following 15 | laboratory/controlled environment that allows us to study the effects of sample variations in repeated 16 | experiments. We pick a public dataset that spans over several years (the well known airline dataset). 17 | From this source data we pick 1 year of "master" data for training/tuning and the following 1 year for testing (hold-out). 18 | We take samples of give sizes (e.g. 10K, 100K, 1M records) from the "master" training set and 19 | samples of 100K from the "master" test set. 20 | 21 | We choose a grid of hyperparameter values and take 100 random combinations from the grid. 22 | For each hyperparameter combination we repeat the following resampling procedure 20 times: 23 | Split the training set 80-10-10 into data used for (1) training (2) validation for early stopping 24 | and (3) evaluation for model selection. 25 | We train the GBM models with early stopping and record the AUCs on the last split of data (3). We record 26 | the average AUC and its standard deviation. 27 | Finally, we compute the AUC on the testset for the ensemble of the 20 models obtained 28 | with resampling (simple average of their predictions). 29 | 30 | We study the test AUC of the top performing hyperparameter combinations (selected based only on 31 | the information from the resampling procedure without access to the test set). In fact, we resample 32 | the test set itself as well, therefore we obtain averages and standard errors for the test AUC. 33 | 34 | 35 | 36 | ### Train set size 100K records 37 | 38 | The evaluation AUC of the 100 random hyperparameter trials vs their ranking 39 | (errorbars based on train 80-10-10 resampling): 40 | 41 | ![](3-test_rs/fig-100K-AUCrs_rank.png) 42 | 43 | The test AUC vs evaluation ranking (errorbars based on testset resampling): 44 | 45 | ![](3-test_rs/fig-100K-AUCtest_rank.png) 46 | 47 | Test vs evaluation AUC (with errorbars based on train 80-10-10 and test resampling, respectively): 48 | 49 | ![](3-test_rs/fig-100K-AUCcorr.png) 50 | 51 | The top models selected by evaluation AUC are also top in test AUC, the correlation between 52 | evaluation/model selection AUC and test AUC is high (Pearson and rank correlation `~0.75`). 53 | 54 | A top model is: 55 | ``` 56 | num_leaves = 1000 57 | learning_rate = 0.03 58 | min_data_in_leaf = 5 59 | feature_fraction = 0.8 60 | bagging_fraction = 0.8 61 | ``` 62 | 63 | For this combination, early stopping happens at `~200` trees in `~10 sec` for each resample (on a server with 16 cores/8 physical cores) 64 | leading to evaluation AUC `0.815` and test AUC `0.745` (the training data is coming from one given year, while the test 65 | data is coming from the next year, therefore the decrease in prediction accuracy). 66 | 67 | The runtime and number of trees for the different hyperparameter combinations vary, the total training time 68 | for the 100 random hyperparameter trials with 20 train resamples each is `~6 hrs`, while adding prediction time 69 | for 1 test set (initially) we have `~7 hrs` total runtime, while further on 20 resamples of the test set `~26 hrs` 70 | total run time (the experiment can be easily parallelized to multiple servers as the trials in the random 71 | search are completely independent). 72 | 73 | More details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-100K-100.html) and 74 | [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/3-test_rs/analyze-100K.html). 75 | 76 | 77 | 78 | ### Train set size 1M records 79 | 80 | The correlation between evaluation/model selection AUC and test AUC is even higher (Pearson/rank correlation `~0.97`), 81 | and naturally the top models selected by evaluation AUC are also top in test AUC even more so. 82 | 83 | The best models have now a larger `num_leaves` (as one would expect since there is more data and one can build deeper 84 | trees without overfitting) and the early stopping stops later (more trees). 85 | Rune time is approximately `10x`, best evaluation and test AUC in the table below. 86 | 87 | 88 | Size | eval AUC | test AUC | 89 | --------|----------------|---------------| 90 | 10K | 0.701 / 0.682 | 0.660 / 0.670 | 91 | 100K | 0.815 | 0.745 | 92 | 1M | 0.952 | 0.847 | 93 | 94 | More details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-1M-100.html). 95 | 96 | 97 | 98 | ### Train set size 10K records 99 | 100 | The best models selected based on evaluation AUC are not anymore the best models on test, the correlation is now low `~0.25`. 101 | 102 | ![](3-test_rs/fig-10K-AUCcorr.png) 103 | 104 | It seems 10K is just *not enough data* for obtaining a good model out-of-sample 105 | (some of the variables have 100s of categories and some appear with low frequency, 106 | so this result is not completely unexpected), 107 | and even with cross validation there is some kind of *overfitting* here. 108 | 109 | The best models based on evaluation AUC have deeper trees (evaluation AUC `0.701`, but low test AUC `0.660`), while 110 | the best models on test have shallower trees (evaluation AUC `0.682`, test AUC `0.670`). 111 | Therefore one could reduce overfitting by restricting the space to shallower trees (effectively regularizing). 112 | 113 | Also, the above evaluation AUC is a biased estimate for the AUC of the best model even on the training set (thanks @preko for 114 | pointing this out). 115 | 116 | More details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-10K-100.html) and 117 | [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/3-test_rs/analyze-10K.html). 118 | 119 | 120 | 121 | ### Note on train set sizes 122 | 123 | ... 124 | 125 | 126 | 127 | ### Larger hyperparameter search (1000 random trials) 128 | 129 | We ran 1000 random trials on 100K data (~60 hrs runtime on 8 physical core server, one could parallelize the trials on different servers/more cores). 130 | 131 | ![](2-train_test_1each/fig-100K-1000-AUCcorr.png) 132 | 133 | With these many trials the best model (cross-validation) is a bit overfit to the training set and it is not the best model on the test set. 134 | 135 | Trials | eval AUC | test AUC | 136 | --------|----------------|---------------| 137 | 100 | 0.815 | 0.745 | 138 | 1000 | 0.821 / 0.807 | 0.744 / 0.746 | 139 | 140 | The best model on the test set (you cannot do that for model selection!!!) is also a bit overfit to the test set. If one does multiple 141 | resamples of the test set (which we did not do here because of the extra computational costs), the best model on the "average" test set 142 | would still be (slightly) overfitted to the test distribution (since that's likely to be a bit different than the train distribution 143 | because of temporal separation). 144 | 145 | More details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/2-train_test_1each/analyze-100K-1000.html). 146 | 147 | 148 | 149 | ### Variation of best model selected vs training sample 150 | 151 | We repeat the above experiments with resampling the train set from the source data. This is to study 152 | the sensitivity of the results w.r.t. a given sample. 153 | 154 | For 100K records, the resample AUC for 2 train resamples with the same hyperparameter values: 155 | 156 | ![](4-train_rs/fig-AUCcorr.png) 157 | 158 | The correlation (both Pearson and rank) of the 2 above is `~0.95`. 159 | 160 | A similar graph and correlation is found if one looks at AUC on the test data. 161 | 162 | More details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/4-train_rs/analyze.html). 163 | 164 | Therefore, it seems that 100K records is enough to get similar best hyperparameter values 165 | not depending too much of the given training sample. However, the test AUC shows some 166 | variation therefore the best models must be somewhat different. TODO: More reaserch to clarify this. 167 | 168 | 169 | 170 | ### Ensembles 171 | 172 | The test AUC of the average of the top 10 models (of the 100 with random hyperparameter search, selected based on 173 | resample AUC) (in red) and the AUC of all the models (horizontal axis `rank` based on resample AUC): 174 | 175 | ![](5-ensemble/fig-AUCens.png) 176 | 177 | More details [here](https://htmlpreview.github.io/?https://github.com/szilard/GBM-tune/blob/master/5-ensemble/analyze.html). 178 | 179 | Interesting that this simple ensemble does not beat the best model. 180 | TODO: Stacking or some other more sophisticated way of ensembling models (vs simple average of top 10 above). 181 | 182 | --------------------------------------------------------------------------------