├── LightGBM_demo ├── .gitignore ├── README.md ├── binary_classification │ ├── README.md │ ├── binary.test │ ├── binary.test.weight │ ├── binary.train │ ├── binary.train.weight │ ├── predict.conf │ └── train.conf ├── lambdarank │ ├── README.md │ ├── predict.conf │ ├── rank.test │ ├── rank.test.query │ ├── rank.train │ ├── rank.train.query │ └── train.conf ├── multiclass_classification │ ├── README.md │ ├── multiclass.test │ ├── multiclass.train │ ├── predict.conf │ └── train.conf ├── parallel_learning │ ├── README.md │ ├── binary.test │ ├── binary.train │ ├── predict.conf │ └── train.conf ├── python-guide │ ├── README.md │ ├── advanced_example.py │ ├── model.json │ ├── model.pkl │ ├── plot_example.py │ ├── simple_example.py │ └── sklearn_example.py └── regression │ ├── README.md │ ├── predict.conf │ ├── regression.test │ ├── regression.test.init │ ├── regression.train │ ├── regression.train.init │ └── train.conf ├── Linux下使用MySQL.md ├── README.md ├── TensorFlow_Started.ipynb ├── XGBoost_demo ├── .gitignore ├── README.md ├── binary_classification │ ├── README.md │ ├── agaricus-lepiota.data │ ├── agaricus-lepiota.fmap │ ├── agaricus-lepiota.names │ ├── mapfeat.py │ ├── mknfold.py │ ├── mushroom.conf │ └── runexp.sh ├── data │ ├── README.md │ ├── agaricus.txt.test │ ├── agaricus.txt.train │ ├── featmap.txt │ └── gen_autoclaims.R ├── distributed-training │ ├── README.md │ ├── mushroom.aws.conf │ ├── plot_model.ipynb │ └── run_aws.sh ├── gpu_acceleration │ ├── README.md │ └── cover_type.py ├── guide-python │ ├── README.md │ ├── basic_walkthrough.py │ ├── boost_from_prediction.py │ ├── cross_validation.py │ ├── custom_objective.py │ ├── evals_result.py │ ├── external_memory.py │ ├── gamma_regression.py │ ├── generalized_linear_model.py │ ├── predict_first_ntree.py │ ├── predict_leaf_indices.py │ ├── runall.sh │ ├── sklearn_evals_result.py │ ├── sklearn_examples.py │ └── sklearn_parallel.py ├── kaggle-higgs │ ├── README.md │ ├── higgs-cv.py │ ├── higgs-numpy.py │ ├── higgs-pred.R │ ├── higgs-pred.py │ ├── higgs-train.R │ ├── run.sh │ ├── speedtest.R │ └── speedtest.py ├── kaggle-otto │ ├── README.MD │ ├── otto_train_pred.R │ └── understandingXGBoostModel.Rmd ├── multiclass_classification │ ├── README.md │ ├── runexp.sh │ └── train.py ├── rank │ ├── README.md │ ├── mq2008.conf │ ├── runexp.sh │ ├── trans_data.py │ └── wgetdata.sh ├── regression │ ├── README.md │ ├── machine.conf │ ├── machine.data │ ├── machine.names │ ├── mapfeat.py │ ├── mknfold.py │ └── runexp.sh └── yearpredMSD │ ├── README.md │ ├── csv2libsvm.py │ ├── runexp.sh │ └── yearpredMSD.conf ├── git学习.md ├── sklearn学习.md └── xgboost学习.md /LightGBM_demo/.gitignore: -------------------------------------------------------------------------------- 1 | *.txt 2 | -------------------------------------------------------------------------------- /LightGBM_demo/README.md: -------------------------------------------------------------------------------- 1 | Examples 2 | ======== 3 | 4 | You can learn how to use LightGBM by these examples. 5 | -------------------------------------------------------------------------------- /LightGBM_demo/binary_classification/README.md: -------------------------------------------------------------------------------- 1 | Binary Classification Example 2 | ============================= 3 | 4 | Here is an example for LightGBM to run binary classification task. 5 | 6 | ***You should copy executable file to this folder first.*** 7 | 8 | Trainin 9 | ------- 10 | 11 | For Windows, by running following command in this folder: 12 | 13 | ``` 14 | lightgbm.exe config=train.conf 15 | ``` 16 | 17 | For Linux, by running following command in this folder: 18 | 19 | ``` 20 | ./lightgbm config=train.conf 21 | ``` 22 | 23 | Prediction 24 | ---------- 25 | 26 | You should finish training first. 27 | 28 | For Windows, by running following command in this folder: 29 | 30 | ``` 31 | lightgbm.exe config=predict.conf 32 | ``` 33 | 34 | For Linux, by running following command in this folder: 35 | 36 | ``` 37 | ./lightgbm config=predict.conf 38 | ``` 39 | -------------------------------------------------------------------------------- /LightGBM_demo/binary_classification/binary.test.weight: -------------------------------------------------------------------------------- 1 | 1.0 2 | 1.0 3 | 1.0 4 | 1.0 5 | 1.0 6 | 1.0 7 | 1.0 8 | 1.0 9 | 1.0 10 | 1.0 11 | 1.0 12 | 1.0 13 | 1.0 14 | 1.0 15 | 1.0 16 | 1.0 17 | 1.0 18 | 1.0 19 | 1.0 20 | 1.0 21 | 1.0 22 | 1.0 23 | 1.0 24 | 1.0 25 | 1.0 26 | 1.0 27 | 1.0 28 | 1.0 29 | 1.0 30 | 1.0 31 | 1.0 32 | 1.0 33 | 1.0 34 | 1.0 35 | 1.0 36 | 1.0 37 | 1.0 38 | 1.0 39 | 1.0 40 | 1.0 41 | 1.0 42 | 1.0 43 | 1.0 44 | 1.0 45 | 1.0 46 | 1.0 47 | 1.0 48 | 1.0 49 | 1.0 50 | 1.0 51 | 1.0 52 | 1.0 53 | 1.0 54 | 1.0 55 | 1.0 56 | 1.0 57 | 1.0 58 | 1.0 59 | 1.0 60 | 1.0 61 | 1.0 62 | 1.0 63 | 1.0 64 | 1.0 65 | 1.0 66 | 1.0 67 | 1.0 68 | 1.0 69 | 1.0 70 | 1.0 71 | 1.0 72 | 1.0 73 | 1.0 74 | 1.0 75 | 1.0 76 | 1.0 77 | 1.0 78 | 1.0 79 | 1.0 80 | 1.0 81 | 1.0 82 | 1.0 83 | 1.0 84 | 1.0 85 | 1.0 86 | 1.0 87 | 1.0 88 | 1.0 89 | 1.0 90 | 1.0 91 | 1.0 92 | 1.0 93 | 1.0 94 | 1.0 95 | 1.0 96 | 1.0 97 | 1.0 98 | 1.0 99 | 1.0 100 | 1.0 101 | 1.0 102 | 1.0 103 | 1.0 104 | 1.0 105 | 1.0 106 | 1.0 107 | 1.0 108 | 1.0 109 | 1.0 110 | 1.0 111 | 1.0 112 | 1.0 113 | 1.0 114 | 1.0 115 | 1.0 116 | 1.0 117 | 1.0 118 | 1.0 119 | 1.0 120 | 1.0 121 | 1.0 122 | 1.0 123 | 1.0 124 | 1.0 125 | 1.0 126 | 1.0 127 | 1.0 128 | 1.0 129 | 1.0 130 | 1.0 131 | 1.0 132 | 1.0 133 | 1.0 134 | 1.0 135 | 1.0 136 | 1.0 137 | 1.0 138 | 1.0 139 | 1.0 140 | 1.0 141 | 1.0 142 | 1.0 143 | 1.0 144 | 1.0 145 | 1.0 146 | 1.0 147 | 1.0 148 | 1.0 149 | 1.0 150 | 1.0 151 | 1.0 152 | 1.0 153 | 1.0 154 | 1.0 155 | 1.0 156 | 1.0 157 | 1.0 158 | 1.0 159 | 1.0 160 | 1.0 161 | 1.0 162 | 1.0 163 | 1.0 164 | 1.0 165 | 1.0 166 | 1.0 167 | 1.0 168 | 1.0 169 | 1.0 170 | 1.0 171 | 1.0 172 | 1.0 173 | 1.0 174 | 1.0 175 | 1.0 176 | 1.0 177 | 1.0 178 | 1.0 179 | 1.0 180 | 1.0 181 | 1.0 182 | 1.0 183 | 1.0 184 | 1.0 185 | 1.0 186 | 1.0 187 | 1.0 188 | 1.0 189 | 1.0 190 | 1.0 191 | 1.0 192 | 1.0 193 | 1.0 194 | 1.0 195 | 1.0 196 | 1.0 197 | 1.0 198 | 1.0 199 | 1.0 200 | 1.0 201 | 1.0 202 | 1.0 203 | 1.0 204 | 1.0 205 | 1.0 206 | 1.0 207 | 1.0 208 | 1.0 209 | 1.0 210 | 1.0 211 | 1.0 212 | 1.0 213 | 1.0 214 | 1.0 215 | 1.0 216 | 1.0 217 | 1.0 218 | 1.0 219 | 1.0 220 | 1.0 221 | 1.0 222 | 1.0 223 | 1.0 224 | 1.0 225 | 1.0 226 | 1.0 227 | 1.0 228 | 1.0 229 | 1.0 230 | 1.0 231 | 1.0 232 | 1.0 233 | 1.0 234 | 1.0 235 | 1.0 236 | 1.0 237 | 1.0 238 | 1.0 239 | 1.0 240 | 1.0 241 | 1.0 242 | 1.0 243 | 1.0 244 | 1.0 245 | 1.0 246 | 1.0 247 | 1.0 248 | 1.0 249 | 1.0 250 | 1.0 251 | 1.0 252 | 1.0 253 | 1.0 254 | 1.0 255 | 1.0 256 | 1.0 257 | 1.0 258 | 1.0 259 | 1.0 260 | 1.0 261 | 1.0 262 | 1.0 263 | 1.0 264 | 1.0 265 | 1.0 266 | 1.0 267 | 1.0 268 | 1.0 269 | 1.0 270 | 1.0 271 | 1.0 272 | 1.0 273 | 1.0 274 | 1.0 275 | 1.0 276 | 1.0 277 | 1.0 278 | 1.0 279 | 1.0 280 | 1.0 281 | 1.0 282 | 1.0 283 | 1.0 284 | 1.0 285 | 1.0 286 | 1.0 287 | 1.0 288 | 1.0 289 | 1.0 290 | 1.0 291 | 1.0 292 | 1.0 293 | 1.0 294 | 1.0 295 | 1.0 296 | 1.0 297 | 1.0 298 | 1.0 299 | 1.0 300 | 1.0 301 | 1.0 302 | 1.0 303 | 1.0 304 | 1.0 305 | 1.0 306 | 1.0 307 | 1.0 308 | 1.0 309 | 1.0 310 | 1.0 311 | 1.0 312 | 1.0 313 | 1.0 314 | 1.0 315 | 1.0 316 | 1.0 317 | 1.0 318 | 1.0 319 | 1.0 320 | 1.0 321 | 1.0 322 | 1.0 323 | 1.0 324 | 1.0 325 | 1.0 326 | 1.0 327 | 1.0 328 | 1.0 329 | 1.0 330 | 1.0 331 | 1.0 332 | 1.0 333 | 1.0 334 | 1.0 335 | 1.0 336 | 1.0 337 | 1.0 338 | 1.0 339 | 1.0 340 | 1.0 341 | 1.0 342 | 1.0 343 | 1.0 344 | 1.0 345 | 1.0 346 | 1.0 347 | 1.0 348 | 1.0 349 | 1.0 350 | 1.0 351 | 1.0 352 | 1.0 353 | 1.0 354 | 1.0 355 | 1.0 356 | 1.0 357 | 1.0 358 | 1.0 359 | 1.0 360 | 1.0 361 | 1.0 362 | 1.0 363 | 1.0 364 | 1.0 365 | 1.0 366 | 1.0 367 | 1.0 368 | 1.0 369 | 1.0 370 | 1.0 371 | 1.0 372 | 1.0 373 | 1.0 374 | 1.0 375 | 1.0 376 | 1.0 377 | 1.0 378 | 1.0 379 | 1.0 380 | 1.0 381 | 1.0 382 | 1.0 383 | 1.0 384 | 1.0 385 | 1.0 386 | 1.0 387 | 1.0 388 | 1.0 389 | 1.0 390 | 1.0 391 | 1.0 392 | 1.0 393 | 1.0 394 | 1.0 395 | 1.0 396 | 1.0 397 | 1.0 398 | 1.0 399 | 1.0 400 | 1.0 401 | 1.0 402 | 1.0 403 | 1.0 404 | 1.0 405 | 1.0 406 | 1.0 407 | 1.0 408 | 1.0 409 | 1.0 410 | 1.0 411 | 1.0 412 | 1.0 413 | 1.0 414 | 1.0 415 | 1.0 416 | 1.0 417 | 1.0 418 | 1.0 419 | 1.0 420 | 1.0 421 | 1.0 422 | 1.0 423 | 1.0 424 | 1.0 425 | 1.0 426 | 1.0 427 | 1.0 428 | 1.0 429 | 1.0 430 | 1.0 431 | 1.0 432 | 1.0 433 | 1.0 434 | 1.0 435 | 1.0 436 | 1.0 437 | 1.0 438 | 1.0 439 | 1.0 440 | 1.0 441 | 1.0 442 | 1.0 443 | 1.0 444 | 1.0 445 | 1.0 446 | 1.0 447 | 1.0 448 | 1.0 449 | 1.0 450 | 1.0 451 | 1.0 452 | 1.0 453 | 1.0 454 | 1.0 455 | 1.0 456 | 1.0 457 | 1.0 458 | 1.0 459 | 1.0 460 | 1.0 461 | 1.0 462 | 1.0 463 | 1.0 464 | 1.0 465 | 1.0 466 | 1.0 467 | 1.0 468 | 1.0 469 | 1.0 470 | 1.0 471 | 1.0 472 | 1.0 473 | 1.0 474 | 1.0 475 | 1.0 476 | 1.0 477 | 1.0 478 | 1.0 479 | 1.0 480 | 1.0 481 | 1.0 482 | 1.0 483 | 1.0 484 | 1.0 485 | 1.0 486 | 1.0 487 | 1.0 488 | 1.0 489 | 1.0 490 | 1.0 491 | 1.0 492 | 1.0 493 | 1.0 494 | 1.0 495 | 1.0 496 | 1.0 497 | 1.0 498 | 1.0 499 | 1.0 500 | 1.0 501 | -------------------------------------------------------------------------------- /LightGBM_demo/binary_classification/predict.conf: -------------------------------------------------------------------------------- 1 | 2 | task = predict 3 | 4 | data = binary.test 5 | 6 | input_model= LightGBM_model.txt 7 | -------------------------------------------------------------------------------- /LightGBM_demo/binary_classification/train.conf: -------------------------------------------------------------------------------- 1 | # task type, support train and predict 2 | task = train 3 | 4 | # boosting type, support gbdt for now, alias: boosting, boost 5 | boosting_type = gbdt 6 | 7 | # application type, support following application 8 | # regression , regression task 9 | # binary , binary classification task 10 | # lambdarank , lambdarank task 11 | # alias: application, app 12 | objective = binary 13 | 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics 15 | # l1 16 | # l2 , default metric for regression 17 | # ndcg , default metric for lambdarank 18 | # auc 19 | # binary_logloss , default metric for binary 20 | # binary_error 21 | metric = binary_logloss,auc 22 | 23 | # frequence for metric output 24 | metric_freq = 1 25 | 26 | # true if need output metric for training data, alias: tranining_metric, train_metric 27 | is_training_metric = true 28 | 29 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 30 | max_bin = 255 31 | 32 | # training data 33 | # if exsting weight file, should name to "binary.train.weight" 34 | # alias: train_data, train 35 | data = binary.train 36 | 37 | # validation data, support multi validation data, separated by ',' 38 | # if exsting weight file, should name to "binary.test.weight" 39 | # alias: valid, test, test_data, 40 | valid_data = binary.test 41 | 42 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds 43 | num_trees = 100 44 | 45 | # shrinkage rate , alias: shrinkage_rate 46 | learning_rate = 0.1 47 | 48 | # number of leaves for one tree, alias: num_leaf 49 | num_leaves = 63 50 | 51 | # type of tree learner, support following types: 52 | # serial , single machine version 53 | # feature , use feature parallel to train 54 | # data , use data parallel to train 55 | # voting , use voting based parallel to train 56 | # alias: tree 57 | tree_learner = serial 58 | 59 | # number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 60 | # num_threads = 8 61 | 62 | # feature sub-sample, will random select 80% feature to train on each iteration 63 | # alias: sub_feature 64 | feature_fraction = 0.8 65 | 66 | # Support bagging (data sub-sample), will perform bagging every 5 iterations 67 | bagging_freq = 5 68 | 69 | # Bagging farction, will random select 80% data on bagging 70 | # alias: sub_row 71 | bagging_fraction = 0.8 72 | 73 | # minimal number data for one leaf, use this to deal with over-fit 74 | # alias : min_data_per_leaf, min_data 75 | min_data_in_leaf = 50 76 | 77 | # minimal sum hessians for one leaf, use this to deal with over-fit 78 | min_sum_hessian_in_leaf = 5.0 79 | 80 | # save memory and faster speed for sparse feature, alias: is_sparse 81 | is_enable_sparse = true 82 | 83 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed 84 | # alias: two_round_loading, two_round 85 | use_two_round_loading = false 86 | 87 | # true if need to save data to binary file and application will auto load data from binary file next time 88 | # alias: is_save_binary, save_binary 89 | is_save_binary_file = false 90 | 91 | # output model file 92 | output_model = LightGBM_model.txt 93 | 94 | # support continuous train from trained gbdt model 95 | # input_model= trained_model.txt 96 | 97 | # output prediction file for predict task 98 | # output_result= prediction.txt 99 | 100 | # support continuous train from initial score file 101 | # input_init_score= init_score.txt 102 | 103 | 104 | # number of machines in parallel training, alias: num_machine 105 | num_machines = 1 106 | 107 | # local listening port in parallel training, alias: local_port 108 | local_listen_port = 12400 109 | 110 | # machines list file for parallel training, alias: mlist 111 | machine_list_file = mlist.txt 112 | -------------------------------------------------------------------------------- /LightGBM_demo/lambdarank/README.md: -------------------------------------------------------------------------------- 1 | LambdaRank Example 2 | ================== 3 | 4 | Here is an example for LightGBM to run lambdarank task. 5 | 6 | ***You should copy executable file to this folder first.*** 7 | 8 | Training 9 | -------- 10 | 11 | For Windows, by running following command in this folder: 12 | 13 | ``` 14 | lightgbm.exe config=train.conf 15 | ``` 16 | 17 | For Linux, by running following command in this folder: 18 | 19 | ``` 20 | ./lightgbm config=train.conf 21 | ``` 22 | 23 | Prediction 24 | ---------- 25 | 26 | You should finish training first. 27 | 28 | For Windows, by running following command in this folder: 29 | 30 | ``` 31 | lightgbm.exe config=predict.conf 32 | ``` 33 | 34 | For Linux, by running following command in this folder: 35 | 36 | ``` 37 | ./lightgbm config=predict.conf 38 | ``` 39 | -------------------------------------------------------------------------------- /LightGBM_demo/lambdarank/predict.conf: -------------------------------------------------------------------------------- 1 | 2 | task = predict 3 | 4 | data = rank.test 5 | 6 | input_model= LightGBM_model.txt 7 | -------------------------------------------------------------------------------- /LightGBM_demo/lambdarank/rank.test.query: -------------------------------------------------------------------------------- 1 | 12 2 | 19 3 | 18 4 | 10 5 | 15 6 | 15 7 | 22 8 | 23 9 | 18 10 | 16 11 | 16 12 | 11 13 | 6 14 | 13 15 | 17 16 | 21 17 | 20 18 | 16 19 | 13 20 | 16 21 | 21 22 | 15 23 | 10 24 | 19 25 | 10 26 | 13 27 | 18 28 | 17 29 | 23 30 | 24 31 | 16 32 | 13 33 | 17 34 | 24 35 | 17 36 | 10 37 | 17 38 | 15 39 | 18 40 | 16 41 | 9 42 | 9 43 | 21 44 | 14 45 | 13 46 | 13 47 | 13 48 | 10 49 | 10 50 | 6 51 | -------------------------------------------------------------------------------- /LightGBM_demo/lambdarank/rank.train.query: -------------------------------------------------------------------------------- 1 | 1 2 | 13 3 | 5 4 | 8 5 | 19 6 | 12 7 | 18 8 | 5 9 | 14 10 | 13 11 | 8 12 | 9 13 | 16 14 | 11 15 | 21 16 | 14 17 | 21 18 | 9 19 | 14 20 | 11 21 | 20 22 | 18 23 | 13 24 | 20 25 | 22 26 | 22 27 | 13 28 | 17 29 | 10 30 | 13 31 | 12 32 | 13 33 | 13 34 | 23 35 | 18 36 | 13 37 | 20 38 | 12 39 | 22 40 | 14 41 | 13 42 | 23 43 | 13 44 | 14 45 | 14 46 | 5 47 | 13 48 | 15 49 | 14 50 | 14 51 | 16 52 | 16 53 | 15 54 | 21 55 | 22 56 | 10 57 | 22 58 | 18 59 | 25 60 | 16 61 | 12 62 | 12 63 | 15 64 | 15 65 | 25 66 | 13 67 | 9 68 | 12 69 | 8 70 | 16 71 | 25 72 | 19 73 | 24 74 | 12 75 | 16 76 | 10 77 | 16 78 | 9 79 | 17 80 | 15 81 | 7 82 | 9 83 | 15 84 | 14 85 | 16 86 | 17 87 | 8 88 | 17 89 | 12 90 | 18 91 | 23 92 | 10 93 | 12 94 | 12 95 | 4 96 | 14 97 | 12 98 | 15 99 | 27 100 | 16 101 | 20 102 | 13 103 | 19 104 | 13 105 | 17 106 | 17 107 | 16 108 | 12 109 | 15 110 | 14 111 | 14 112 | 19 113 | 12 114 | 23 115 | 18 116 | 16 117 | 9 118 | 23 119 | 11 120 | 15 121 | 8 122 | 10 123 | 10 124 | 16 125 | 11 126 | 15 127 | 22 128 | 16 129 | 17 130 | 23 131 | 16 132 | 22 133 | 17 134 | 14 135 | 12 136 | 14 137 | 20 138 | 15 139 | 17 140 | 15 141 | 15 142 | 22 143 | 9 144 | 21 145 | 9 146 | 17 147 | 16 148 | 15 149 | 13 150 | 13 151 | 15 152 | 14 153 | 18 154 | 21 155 | 14 156 | 17 157 | 15 158 | 14 159 | 16 160 | 12 161 | 17 162 | 19 163 | 16 164 | 11 165 | 18 166 | 11 167 | 13 168 | 14 169 | 9 170 | 16 171 | 15 172 | 16 173 | 25 174 | 9 175 | 13 176 | 22 177 | 16 178 | 18 179 | 20 180 | 14 181 | 11 182 | 9 183 | 16 184 | 19 185 | 19 186 | 11 187 | 11 188 | 13 189 | 14 190 | 14 191 | 13 192 | 16 193 | 6 194 | 21 195 | 16 196 | 12 197 | 16 198 | 11 199 | 24 200 | 12 201 | 10 202 | -------------------------------------------------------------------------------- /LightGBM_demo/lambdarank/train.conf: -------------------------------------------------------------------------------- 1 | # task type, support train and predict 2 | task = train 3 | 4 | # boosting type, support gbdt for now, alias: boosting, boost 5 | boosting_type = gbdt 6 | 7 | # application type, support following application 8 | # regression , regression task 9 | # binary , binary classification task 10 | # lambdarank , lambdarank task 11 | # alias: application, app 12 | objective = lambdarank 13 | 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics 15 | # l1 16 | # l2 , default metric for regression 17 | # ndcg , default metric for lambdarank 18 | # auc 19 | # binary_logloss , default metric for binary 20 | # binary_error 21 | metric = ndcg 22 | 23 | # evaluation position for ndcg metric, alias : ndcg_at 24 | ndcg_eval_at = 1,3,5 25 | 26 | # frequence for metric output 27 | metric_freq = 1 28 | 29 | # true if need output metric for training data, alias: tranining_metric, train_metric 30 | is_training_metric = true 31 | 32 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 33 | max_bin = 255 34 | 35 | # training data 36 | # if exsting weight file, should name to "rank.train.weight" 37 | # if exsting query file, should name to "rank.train.query" 38 | # alias: train_data, train 39 | data = rank.train 40 | 41 | # validation data, support multi validation data, separated by ',' 42 | # if exsting weight file, should name to "rank.test.weight" 43 | # if exsting query file, should name to "rank.test.query" 44 | # alias: valid, test, test_data, 45 | valid_data = rank.test 46 | 47 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds 48 | num_trees = 100 49 | 50 | # shrinkage rate , alias: shrinkage_rate 51 | learning_rate = 0.1 52 | 53 | # number of leaves for one tree, alias: num_leaf 54 | num_leaves = 31 55 | 56 | # type of tree learner, support following types: 57 | # serial , single machine version 58 | # feature , use feature parallel to train 59 | # data , use data parallel to train 60 | # voting , use voting based parallel to train 61 | # alias: tree 62 | tree_learner = serial 63 | 64 | # number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 65 | # num_threads = 8 66 | 67 | # feature sub-sample, will random select 80% feature to train on each iteration 68 | # alias: sub_feature 69 | feature_fraction = 1.0 70 | 71 | # Support bagging (data sub-sample), will perform bagging every 5 iterations 72 | bagging_freq = 1 73 | 74 | # Bagging farction, will random select 80% data on bagging 75 | # alias: sub_row 76 | bagging_fraction = 0.9 77 | 78 | # minimal number data for one leaf, use this to deal with over-fit 79 | # alias : min_data_per_leaf, min_data 80 | min_data_in_leaf = 50 81 | 82 | # minimal sum hessians for one leaf, use this to deal with over-fit 83 | min_sum_hessian_in_leaf = 5.0 84 | 85 | # save memory and faster speed for sparse feature, alias: is_sparse 86 | is_enable_sparse = true 87 | 88 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed 89 | # alias: two_round_loading, two_round 90 | use_two_round_loading = false 91 | 92 | # true if need to save data to binary file and application will auto load data from binary file next time 93 | # alias: is_save_binary, save_binary 94 | is_save_binary_file = false 95 | 96 | # output model file 97 | output_model = LightGBM_model.txt 98 | 99 | # support continuous train from trained gbdt model 100 | # input_model= trained_model.txt 101 | 102 | # output prediction file for predict task 103 | # output_result= prediction.txt 104 | 105 | # support continuous train from initial score file 106 | # input_init_score= init_score.txt 107 | 108 | 109 | # number of machines in parallel training, alias: num_machine 110 | num_machines = 1 111 | 112 | # local listening port in parallel training, alias: local_port 113 | local_listen_port = 12400 114 | 115 | # machines list file for parallel training, alias: mlist 116 | machine_list_file = mlist.txt 117 | -------------------------------------------------------------------------------- /LightGBM_demo/multiclass_classification/README.md: -------------------------------------------------------------------------------- 1 | Multiclass Classification Example 2 | ================================= 3 | 4 | Here is an example for LightGBM to run multiclass classification task. 5 | 6 | ***You should copy executable file to this folder first.*** 7 | 8 | Training 9 | -------- 10 | 11 | For Windows, by running following command in this folder: 12 | 13 | ``` 14 | lightgbm.exe config=train.conf 15 | ``` 16 | 17 | For Linux, by running following command in this folder: 18 | 19 | ``` 20 | ./lightgbm config=train.conf 21 | ``` 22 | 23 | Prediction 24 | ---------- 25 | 26 | You should finish training first. 27 | 28 | For Windows, by running following command in this folder: 29 | 30 | ``` 31 | lightgbm.exe config=predict.conf 32 | ``` 33 | 34 | For Linux, by running following command in this folder: 35 | 36 | ``` 37 | ./lightgbm config=predict.conf 38 | ``` 39 | -------------------------------------------------------------------------------- /LightGBM_demo/multiclass_classification/predict.conf: -------------------------------------------------------------------------------- 1 | task = predict 2 | 3 | data = multiclass.test 4 | 5 | input_model= LightGBM_model.txt 6 | -------------------------------------------------------------------------------- /LightGBM_demo/multiclass_classification/train.conf: -------------------------------------------------------------------------------- 1 | # task type, support train and predict 2 | task = train 3 | 4 | # boosting type, support gbdt for now, alias: boosting, boost 5 | boosting_type = gbdt 6 | 7 | # application type, support following application 8 | # regression , regression task 9 | # binary , binary classification task 10 | # lambdarank , lambdarank task 11 | # multiclass 12 | # alias: application, app 13 | objective = multiclass 14 | 15 | # eval metrics, support multi metric, delimite by ',' , support following metrics 16 | # l1 17 | # l2 , default metric for regression 18 | # ndcg , default metric for lambdarank 19 | # auc 20 | # binary_logloss , default metric for binary 21 | # binary_error 22 | # multi_logloss 23 | # multi_error 24 | metric = multi_logloss 25 | 26 | # number of class, for multiclass classification 27 | num_class = 5 28 | 29 | # frequence for metric output 30 | metric_freq = 1 31 | 32 | # true if need output metric for training data, alias: tranining_metric, train_metric 33 | is_training_metric = true 34 | 35 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 36 | max_bin = 255 37 | 38 | # training data 39 | # if exsting weight file, should name to "regression.train.weight" 40 | # alias: train_data, train 41 | data = multiclass.train 42 | 43 | # valid data 44 | valid_data = multiclass.test 45 | 46 | # round for early stopping 47 | early_stopping = 10 48 | 49 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds 50 | num_trees = 100 51 | 52 | # shrinkage rate , alias: shrinkage_rate 53 | learning_rate = 0.05 54 | 55 | # number of leaves for one tree, alias: num_leaf 56 | num_leaves = 31 57 | -------------------------------------------------------------------------------- /LightGBM_demo/parallel_learning/README.md: -------------------------------------------------------------------------------- 1 | Parallel Learning Example 2 | ========================= 3 | 4 | Here is an example for LightGBM to perform parallel learning for 2 machines. 5 | 6 | 1. Edit mlist.txt, write the ip of these 2 machines that you want to run application on. 7 | 8 | ``` 9 | machine1_ip 12400 10 | machine2_ip 12400 11 | ``` 12 | 13 | 2. Copy this folder and executable file to these 2 machines that you want to run application on. 14 | 15 | 3. Run command in this folder on both 2 machines: 16 | 17 | For Windows: ```lightgbm.exe config=train.conf``` 18 | 19 | For Linux: ```./lightgbm config=train.conf``` 20 | 21 | This parallel learning example is based on socket. LightGBM also support parallel learning based on mpi. 22 | 23 | For more details about the usage of parallel learning, please refer to [this](https://github.com/Microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst). 24 | -------------------------------------------------------------------------------- /LightGBM_demo/parallel_learning/predict.conf: -------------------------------------------------------------------------------- 1 | 2 | task = predict 3 | 4 | data = binary.test 5 | 6 | input_model= LightGBM_model.txt 7 | 8 | -------------------------------------------------------------------------------- /LightGBM_demo/parallel_learning/train.conf: -------------------------------------------------------------------------------- 1 | # task type, support train and predict 2 | task = train 3 | 4 | # boosting type, support gbdt for now, alias: boosting, boost 5 | boosting_type = gbdt 6 | 7 | # application type, support following application 8 | # regression , regression task 9 | # binary , binary classification task 10 | # lambdarank , lambdarank task 11 | # alias: application, app 12 | objective = binary 13 | 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics 15 | # l1 16 | # l2 , default metric for regression 17 | # ndcg , default metric for lambdarank 18 | # auc 19 | # binary_logloss , default metric for binary 20 | # binary_error 21 | metric = binary_logloss,auc 22 | 23 | # frequence for metric output 24 | metric_freq = 1 25 | 26 | # true if need output metric for training data, alias: tranining_metric, train_metric 27 | is_training_metric = true 28 | 29 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 30 | max_bin = 255 31 | 32 | # training data 33 | # if exsting weight file, should name to "binary.train.weight" 34 | # alias: train_data, train 35 | data = binary.train 36 | 37 | # validation data, support multi validation data, separated by ',' 38 | # if exsting weight file, should name to "binary.test.weight" 39 | # alias: valid, test, test_data, 40 | valid_data = binary.test 41 | 42 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds 43 | num_trees = 100 44 | 45 | # shrinkage rate , alias: shrinkage_rate 46 | learning_rate = 0.1 47 | 48 | # number of leaves for one tree, alias: num_leaf 49 | num_leaves = 63 50 | 51 | # type of tree learner, support following types: 52 | # serial , single machine version 53 | # feature , use feature parallel to train 54 | # data , use data parallel to train 55 | # voting , use voting based parallel to train 56 | # alias: tree 57 | tree_learner = feature 58 | 59 | # number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 60 | # num_threads = 8 61 | 62 | # feature sub-sample, will random select 80% feature to train on each iteration 63 | # alias: sub_feature 64 | feature_fraction = 0.8 65 | 66 | # Support bagging (data sub-sample), will perform bagging every 5 iterations 67 | bagging_freq = 5 68 | 69 | # Bagging farction, will random select 80% data on bagging 70 | # alias: sub_row 71 | bagging_fraction = 0.8 72 | 73 | # minimal number data for one leaf, use this to deal with over-fit 74 | # alias : min_data_per_leaf, min_data 75 | min_data_in_leaf = 50 76 | 77 | # minimal sum hessians for one leaf, use this to deal with over-fit 78 | min_sum_hessian_in_leaf = 5.0 79 | 80 | # save memory and faster speed for sparse feature, alias: is_sparse 81 | is_enable_sparse = true 82 | 83 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed 84 | # alias: two_round_loading, two_round 85 | use_two_round_loading = false 86 | 87 | # true if need to save data to binary file and application will auto load data from binary file next time 88 | # alias: is_save_binary, save_binary 89 | is_save_binary_file = false 90 | 91 | # output model file 92 | output_model = LightGBM_model.txt 93 | 94 | # support continuous train from trained gbdt model 95 | # input_model= trained_model.txt 96 | 97 | # output prediction file for predict task 98 | # output_result= prediction.txt 99 | 100 | # support continuous train from initial score file 101 | # input_init_score= init_score.txt 102 | 103 | 104 | # number of machines in parallel training, alias: num_machine 105 | num_machines = 2 106 | 107 | # local listening port in parallel training, alias: local_port 108 | local_listen_port = 12400 109 | 110 | # machines list file for parallel training, alias: mlist 111 | machine_list_file = mlist.txt 112 | -------------------------------------------------------------------------------- /LightGBM_demo/python-guide/README.md: -------------------------------------------------------------------------------- 1 | Python Package Examples 2 | ======================= 3 | 4 | Here is an example for LightGBM to use Python-package. 5 | 6 | You should install LightGBM [Python-package](https://github.com/Microsoft/LightGBM/tree/master/python-package) first. 7 | 8 | You also need scikit-learn, pandas and matplotlib (only for plot example) to run the examples, but they are not required for the package itself. You can install them with pip: 9 | 10 | ``` 11 | pip install scikit-learn pandas matplotlib -U 12 | ``` 13 | 14 | Now you can run examples in this folder, for example: 15 | 16 | ``` 17 | python simple_example.py 18 | ``` 19 | 20 | Examples include: 21 | 22 | - [simple_example.py](https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py) 23 | - Construct Dataset 24 | - Basic train and predict 25 | - Eval during training 26 | - Early stopping 27 | - Save model to file 28 | - [sklearn_example.py](https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/sklearn_example.py) 29 | - Basic train and predict with sklearn interface 30 | - Feature importances with sklearn interface 31 | - [advanced_example.py](https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/advanced_example.py) 32 | - Set feature names 33 | - Directly use categorical features without one-hot encoding 34 | - Dump model to json format 35 | - Get feature importances 36 | - Get feature names 37 | - Load model to predict 38 | - Dump and load model with pickle 39 | - Load model file to continue training 40 | - Change learning rates during training 41 | - Self-defined objective function 42 | - Self-defined eval metric 43 | - Callback function 44 | -------------------------------------------------------------------------------- /LightGBM_demo/python-guide/advanced_example.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | # pylint: disable = invalid-name, C0111 3 | import json 4 | import lightgbm as lgb 5 | import pandas as pd 6 | import numpy as np 7 | from sklearn.metrics import mean_squared_error 8 | 9 | try: 10 | import cPickle as pickle 11 | except: 12 | import pickle 13 | 14 | # load or create your dataset 15 | print('Load data...') 16 | df_train = pd.read_csv('../binary_classification/binary.train', header=None, sep='\t') 17 | df_test = pd.read_csv('../binary_classification/binary.test', header=None, sep='\t') 18 | W_train = pd.read_csv('../binary_classification/binary.train.weight', header=None)[0] 19 | W_test = pd.read_csv('../binary_classification/binary.test.weight', header=None)[0] 20 | 21 | y_train = df_train[0].values 22 | y_test = df_test[0].values 23 | X_train = df_train.drop(0, axis=1).values 24 | X_test = df_test.drop(0, axis=1).values 25 | 26 | num_train, num_feature = X_train.shape 27 | 28 | # create dataset for lightgbm 29 | # if you want to re-use data, remember to set free_raw_data=False 30 | lgb_train = lgb.Dataset(X_train, y_train, 31 | weight=W_train, free_raw_data=False) 32 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train, 33 | weight=W_test, free_raw_data=False) 34 | 35 | # specify your configurations as a dict 36 | params = { 37 | 'boosting_type': 'gbdt', 38 | 'objective': 'binary', 39 | 'metric': 'binary_logloss', 40 | 'num_leaves': 31, 41 | 'learning_rate': 0.05, 42 | 'feature_fraction': 0.9, 43 | 'bagging_fraction': 0.8, 44 | 'bagging_freq': 5, 45 | 'verbose': 0 46 | } 47 | 48 | # generate a feature name 49 | feature_name = ['feature_' + str(col) for col in range(num_feature)] 50 | 51 | print('Start training...') 52 | # feature_name and categorical_feature 53 | gbm = lgb.train(params, 54 | lgb_train, 55 | num_boost_round=10, 56 | valid_sets=lgb_train, # eval training data 57 | feature_name=feature_name, 58 | categorical_feature=[21]) 59 | 60 | # check feature name 61 | print('Finish first 10 rounds...') 62 | print('7th feature name is:', repr(lgb_train.feature_name[6])) 63 | 64 | # save model to file 65 | gbm.save_model('model.txt') 66 | 67 | # dump model to json (and save to file) 68 | print('Dump model to JSON...') 69 | model_json = gbm.dump_model() 70 | 71 | with open('model.json', 'w+') as f: 72 | json.dump(model_json, f, indent=4) 73 | 74 | # feature names 75 | print('Feature names:', gbm.feature_name()) 76 | 77 | # feature importances 78 | print('Feature importances:', list(gbm.feature_importance())) 79 | 80 | # load model to predict 81 | print('Load model to predict') 82 | bst = lgb.Booster(model_file='model.txt') 83 | # can only predict with the best iteration (or the saving iteration) 84 | y_pred = bst.predict(X_test) 85 | # eval with loaded model 86 | print('The rmse of loaded model\'s prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) 87 | 88 | # dump model with pickle 89 | with open('model.pkl', 'wb') as fout: 90 | pickle.dump(gbm, fout) 91 | # load model with pickle to predict 92 | with open('model.pkl', 'rb') as fin: 93 | pkl_bst = pickle.load(fin) 94 | # can predict with any iteration when loaded in pickle way 95 | y_pred = pkl_bst.predict(X_test, num_iteration=7) 96 | # eval with loaded model 97 | print('The rmse of pickled model\'s prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) 98 | 99 | # continue training 100 | # init_model accepts: 101 | # 1. model file name 102 | # 2. Booster() 103 | gbm = lgb.train(params, 104 | lgb_train, 105 | num_boost_round=10, 106 | init_model='model.txt', 107 | valid_sets=lgb_eval) 108 | 109 | print('Finish 10 - 20 rounds with model file...') 110 | 111 | # decay learning rates 112 | # learning_rates accepts: 113 | # 1. list/tuple with length = num_boost_round 114 | # 2. function(curr_iter) 115 | gbm = lgb.train(params, 116 | lgb_train, 117 | num_boost_round=10, 118 | init_model=gbm, 119 | learning_rates=lambda iter: 0.05 * (0.99 ** iter), 120 | valid_sets=lgb_eval) 121 | 122 | print('Finish 20 - 30 rounds with decay learning rates...') 123 | 124 | # change other parameters during training 125 | gbm = lgb.train(params, 126 | lgb_train, 127 | num_boost_round=10, 128 | init_model=gbm, 129 | valid_sets=lgb_eval, 130 | callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)]) 131 | 132 | print('Finish 30 - 40 rounds with changing bagging_fraction...') 133 | 134 | 135 | # self-defined objective function 136 | # f(preds: array, train_data: Dataset) -> grad: array, hess: array 137 | # log likelihood loss 138 | def loglikelood(preds, train_data): 139 | labels = train_data.get_label() 140 | preds = 1. / (1. + np.exp(-preds)) 141 | grad = preds - labels 142 | hess = preds * (1. - preds) 143 | return grad, hess 144 | 145 | 146 | # self-defined eval metric 147 | # f(preds: array, train_data: Dataset) -> name: string, value: array, is_higher_better: bool 148 | # binary error 149 | def binary_error(preds, train_data): 150 | labels = train_data.get_label() 151 | return 'error', np.mean(labels != (preds > 0.5)), False 152 | 153 | 154 | gbm = lgb.train(params, 155 | lgb_train, 156 | num_boost_round=10, 157 | init_model=gbm, 158 | fobj=loglikelood, 159 | feval=binary_error, 160 | valid_sets=lgb_eval) 161 | 162 | print('Finish 40 - 50 rounds with self-defined objective function and eval metric...') 163 | 164 | print('Start a new training job...') 165 | 166 | 167 | # callback 168 | def reset_metrics(): 169 | def callback(env): 170 | lgb_eval_new = lgb.Dataset(X_test, y_test, reference=lgb_train) 171 | if env.iteration - env.begin_iteration == 5: 172 | print('Add a new valid dataset at iteration 5...') 173 | env.model.add_valid(lgb_eval_new, 'new valid') 174 | callback.before_iteration = True 175 | callback.order = 0 176 | return callback 177 | 178 | 179 | gbm = lgb.train(params, 180 | lgb_train, 181 | num_boost_round=10, 182 | valid_sets=lgb_train, 183 | callbacks=[reset_metrics()]) 184 | 185 | print('Finish first 10 rounds with callback function...') 186 | -------------------------------------------------------------------------------- /LightGBM_demo/python-guide/model.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fuqiuai/learning_notes/eeecad3a5beff499add3e267111c3e38bcf2b638/LightGBM_demo/python-guide/model.pkl -------------------------------------------------------------------------------- /LightGBM_demo/python-guide/plot_example.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | # pylint: disable = invalid-name, C0111 3 | import lightgbm as lgb 4 | import pandas as pd 5 | 6 | try: 7 | import matplotlib.pyplot as plt 8 | except ImportError: 9 | raise ImportError('You need to install matplotlib for plot_example.py.') 10 | 11 | # load or create your dataset 12 | print('Load data...') 13 | df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t') 14 | df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t') 15 | 16 | y_train = df_train[0].values 17 | y_test = df_test[0].values 18 | X_train = df_train.drop(0, axis=1).values 19 | X_test = df_test.drop(0, axis=1).values 20 | 21 | # create dataset for lightgbm 22 | lgb_train = lgb.Dataset(X_train, y_train) 23 | lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train) 24 | 25 | # specify your configurations as a dict 26 | params = { 27 | 'num_leaves': 5, 28 | 'metric': ('l1', 'l2'), 29 | 'verbose': 0 30 | } 31 | 32 | evals_result = {} # to record eval results for plotting 33 | 34 | print('Start training...') 35 | # train 36 | gbm = lgb.train(params, 37 | lgb_train, 38 | num_boost_round=100, 39 | valid_sets=[lgb_train, lgb_test], 40 | feature_name=['f' + str(i + 1) for i in range(28)], 41 | categorical_feature=[21], 42 | evals_result=evals_result, 43 | verbose_eval=10) 44 | 45 | print('Plot metrics during training...') 46 | ax = lgb.plot_metric(evals_result, metric='l1') 47 | plt.show() 48 | 49 | print('Plot feature importances...') 50 | ax = lgb.plot_importance(gbm, max_num_features=10) 51 | plt.show() 52 | 53 | print('Plot 84th tree...') # one tree use categorical feature to split 54 | ax = lgb.plot_tree(gbm, tree_index=83, figsize=(20, 8), show_info=['split_gain']) 55 | plt.show() 56 | 57 | print('Plot 84th tree with graphviz...') 58 | graph = lgb.create_tree_digraph(gbm, tree_index=83, name='Tree84') 59 | graph.render(view=True) 60 | -------------------------------------------------------------------------------- /LightGBM_demo/python-guide/simple_example.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | # pylint: disable = invalid-name, C0111 3 | import json 4 | import lightgbm as lgb 5 | import pandas as pd 6 | from sklearn.metrics import mean_squared_error 7 | 8 | 9 | # load or create your dataset 10 | print('Load data...') 11 | df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t') 12 | df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t') 13 | 14 | y_train = df_train[0].values 15 | y_test = df_test[0].values 16 | X_train = df_train.drop(0, axis=1).values 17 | X_test = df_test.drop(0, axis=1).values 18 | 19 | # create dataset for lightgbm 20 | lgb_train = lgb.Dataset(X_train, y_train) 21 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train) 22 | 23 | # specify your configurations as a dict 24 | params = { 25 | 'task': 'train', 26 | 'boosting_type': 'gbdt', 27 | 'objective': 'regression', 28 | 'metric': {'l2', 'auc'}, 29 | 'num_leaves': 31, 30 | 'learning_rate': 0.05, 31 | 'feature_fraction': 0.9, 32 | 'bagging_fraction': 0.8, 33 | 'bagging_freq': 5, 34 | 'verbose': 0 35 | } 36 | 37 | print('Start training...') 38 | # train 39 | gbm = lgb.train(params, 40 | lgb_train, 41 | num_boost_round=20, 42 | valid_sets=lgb_eval, 43 | early_stopping_rounds=5) 44 | 45 | print('Save model...') 46 | # save model to file 47 | gbm.save_model('model.txt') 48 | 49 | print('Start predicting...') 50 | # predict 51 | y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration) 52 | # eval 53 | print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) 54 | -------------------------------------------------------------------------------- /LightGBM_demo/python-guide/sklearn_example.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | # pylint: disable = invalid-name, C0111 3 | import lightgbm as lgb 4 | import pandas as pd 5 | from sklearn.metrics import mean_squared_error 6 | from sklearn.model_selection import GridSearchCV 7 | 8 | # load or create your dataset 9 | print('Load data...') 10 | df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t') 11 | df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t') 12 | 13 | y_train = df_train[0].values 14 | y_test = df_test[0].values 15 | X_train = df_train.drop(0, axis=1).values 16 | X_test = df_test.drop(0, axis=1).values 17 | 18 | print('Start training...') 19 | # train 20 | gbm = lgb.LGBMRegressor(objective='regression', 21 | num_leaves=31, 22 | learning_rate=0.05, 23 | n_estimators=20) 24 | gbm.fit(X_train, y_train, 25 | eval_set=[(X_test, y_test)], 26 | eval_metric='l1', 27 | early_stopping_rounds=5) 28 | 29 | print('Start predicting...') 30 | # predict 31 | y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_) 32 | # eval 33 | print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) 34 | 35 | # feature importances 36 | print('Feature importances:', list(gbm.feature_importances_)) 37 | 38 | # other scikit-learn modules 39 | estimator = lgb.LGBMRegressor(num_leaves=31) 40 | 41 | param_grid = { 42 | 'learning_rate': [0.01, 0.1, 1], 43 | 'n_estimators': [20, 40] 44 | } 45 | 46 | gbm = GridSearchCV(estimator, param_grid) 47 | 48 | gbm.fit(X_train, y_train) 49 | 50 | print('Best parameters found by grid search are:', gbm.best_params_) 51 | -------------------------------------------------------------------------------- /LightGBM_demo/regression/README.md: -------------------------------------------------------------------------------- 1 | Regression Example 2 | ================== 3 | 4 | Here is an example for LightGBM to run regression task. 5 | 6 | ***You should copy executable file to this folder first.*** 7 | 8 | Training 9 | -------- 10 | 11 | For Windows, by running following command in this folder: 12 | 13 | ``` 14 | lightgbm.exe config=train.conf 15 | ``` 16 | 17 | For Linux, by running following command in this folder: 18 | 19 | ``` 20 | ./lightgbm config=train.conf 21 | ``` 22 | 23 | Prediction 24 | ---------- 25 | 26 | You should finish training first. 27 | 28 | For Windows, by running following command in this folder: 29 | 30 | ``` 31 | lightgbm.exe config=predict.conf 32 | ``` 33 | 34 | For Linux, by running following command in this folder: 35 | 36 | ``` 37 | ./lightgbm config=predict.conf 38 | ``` 39 | -------------------------------------------------------------------------------- /LightGBM_demo/regression/predict.conf: -------------------------------------------------------------------------------- 1 | 2 | task = predict 3 | 4 | data = regression.test 5 | 6 | input_model= LightGBM_model.txt 7 | -------------------------------------------------------------------------------- /LightGBM_demo/regression/regression.test.init: -------------------------------------------------------------------------------- 1 | 0.039 2 | 0.187 3 | 0.831 4 | 0.767 5 | 0.351 6 | 0.377 7 | 0.534 8 | 0.000 9 | 0.241 10 | 0.208 11 | 0.250 12 | 0.806 13 | 0.280 14 | 0.192 15 | 0.504 16 | 0.866 17 | 0.241 18 | 0.079 19 | 0.356 20 | 0.748 21 | 0.551 22 | 0.817 23 | 0.960 24 | 0.793 25 | 0.604 26 | 0.493 27 | 0.040 28 | 0.984 29 | 0.383 30 | 0.152 31 | 0.667 32 | 0.284 33 | 0.586 34 | 0.587 35 | 0.446 36 | 0.836 37 | 0.265 38 | 0.449 39 | 0.538 40 | 0.664 41 | 0.784 42 | 0.395 43 | 0.646 44 | 0.151 45 | 0.933 46 | 0.383 47 | 0.730 48 | 0.020 49 | 0.205 50 | 0.487 51 | 0.878 52 | 0.527 53 | 0.930 54 | 0.484 55 | 0.490 56 | 0.120 57 | 0.803 58 | 0.247 59 | 0.900 60 | 0.911 61 | 0.943 62 | 0.520 63 | 0.677 64 | 0.779 65 | 0.131 66 | 0.601 67 | 0.034 68 | 0.498 69 | 0.155 70 | 0.183 71 | 0.365 72 | 0.432 73 | 0.623 74 | 0.074 75 | 0.504 76 | 0.183 77 | 0.574 78 | 0.637 79 | 0.557 80 | 0.738 81 | 0.336 82 | 0.765 83 | 0.433 84 | 0.484 85 | 0.648 86 | 0.018 87 | 0.654 88 | 0.619 89 | 0.310 90 | 0.086 91 | 0.091 92 | 0.923 93 | 0.689 94 | 0.127 95 | 0.357 96 | 0.592 97 | 0.836 98 | 0.044 99 | 0.237 100 | 0.890 101 | 0.009 102 | 0.201 103 | 0.959 104 | 0.613 105 | 0.262 106 | 0.067 107 | 0.028 108 | 0.245 109 | 0.881 110 | 0.416 111 | 0.720 112 | 0.918 113 | 0.408 114 | 0.191 115 | 0.517 116 | 0.908 117 | 0.804 118 | 0.066 119 | 0.693 120 | 0.572 121 | 0.907 122 | 0.122 123 | 0.534 124 | 0.879 125 | 0.410 126 | 0.482 127 | 0.070 128 | 0.278 129 | 0.325 130 | 0.945 131 | 0.283 132 | 0.461 133 | 0.671 134 | 0.162 135 | 0.486 136 | 0.739 137 | 0.867 138 | 0.626 139 | 0.669 140 | 0.126 141 | 0.946 142 | 0.133 143 | 0.775 144 | 0.265 145 | 0.934 146 | 0.720 147 | 0.754 148 | 0.219 149 | 0.443 150 | 0.618 151 | 0.770 152 | 0.104 153 | 0.962 154 | 0.890 155 | 0.270 156 | 0.823 157 | 0.518 158 | 0.462 159 | 0.314 160 | 0.581 161 | 0.730 162 | 0.411 163 | 0.629 164 | 0.699 165 | 0.711 166 | 0.052 167 | 0.860 168 | 0.458 169 | 0.262 170 | 0.242 171 | 0.483 172 | 0.887 173 | 0.378 174 | 0.750 175 | 0.097 176 | 0.476 177 | 0.992 178 | 0.770 179 | 0.211 180 | 0.501 181 | 0.234 182 | 0.410 183 | 0.780 184 | 0.771 185 | 0.228 186 | 0.922 187 | 0.593 188 | 0.380 189 | 0.502 190 | 0.605 191 | 0.560 192 | 0.486 193 | 0.505 194 | 0.176 195 | 0.813 196 | 0.542 197 | 0.131 198 | 0.766 199 | 0.932 200 | 0.947 201 | 0.369 202 | 0.136 203 | 0.518 204 | 0.113 205 | 0.934 206 | 0.184 207 | 0.253 208 | 0.407 209 | 0.383 210 | 0.795 211 | 0.456 212 | 0.171 213 | 0.267 214 | 0.509 215 | 0.147 216 | 0.612 217 | 0.566 218 | 0.715 219 | 0.938 220 | 0.912 221 | 0.946 222 | 0.245 223 | 0.132 224 | 0.302 225 | 0.895 226 | 0.972 227 | 0.859 228 | 0.110 229 | 0.947 230 | 0.423 231 | 0.009 232 | 0.442 233 | 0.046 234 | 0.544 235 | 0.339 236 | 0.473 237 | 0.613 238 | 0.869 239 | 0.662 240 | 0.434 241 | 0.819 242 | 0.906 243 | 0.120 244 | 0.532 245 | 0.285 246 | 0.047 247 | 0.669 248 | 0.863 249 | 0.163 250 | 0.812 251 | 0.853 252 | 0.914 253 | 0.265 254 | 0.904 255 | 0.321 256 | 0.552 257 | 0.051 258 | 0.044 259 | 0.720 260 | 0.444 261 | 0.256 262 | 0.190 263 | 0.670 264 | 0.000 265 | 0.806 266 | 0.079 267 | 0.191 268 | 0.386 269 | 0.485 270 | 0.355 271 | 0.321 272 | 0.964 273 | 0.642 274 | 0.023 275 | 0.430 276 | 0.875 277 | 0.301 278 | 0.095 279 | 0.758 280 | 0.606 281 | 0.570 282 | 0.054 283 | 0.140 284 | 0.623 285 | 0.208 286 | 0.504 287 | 0.545 288 | 0.284 289 | 0.948 290 | 0.842 291 | 0.722 292 | 0.078 293 | 0.106 294 | 0.493 295 | 0.161 296 | 0.978 297 | 0.159 298 | 0.487 299 | 0.364 300 | 0.639 301 | 0.129 302 | 0.430 303 | 0.275 304 | 0.888 305 | 0.041 306 | 0.914 307 | 0.833 308 | 0.298 309 | 0.789 310 | 0.031 311 | 0.967 312 | 0.527 313 | 0.303 314 | 0.363 315 | 0.066 316 | 0.989 317 | 0.039 318 | 0.655 319 | 0.443 320 | 0.949 321 | 0.246 322 | 0.532 323 | 0.482 324 | 0.703 325 | 0.068 326 | 0.194 327 | 0.215 328 | 0.738 329 | 0.189 330 | 0.573 331 | 0.215 332 | 0.862 333 | 0.942 334 | 0.518 335 | 0.352 336 | 0.234 337 | 0.050 338 | 0.269 339 | 0.654 340 | 0.534 341 | 0.944 342 | 0.396 343 | 0.694 344 | 0.489 345 | 0.513 346 | 0.268 347 | 0.455 348 | 0.471 349 | 0.707 350 | 0.941 351 | 0.329 352 | 0.042 353 | 0.496 354 | 0.544 355 | 0.168 356 | 0.760 357 | 0.985 358 | 0.946 359 | 0.197 360 | 0.875 361 | 0.704 362 | 0.454 363 | 0.541 364 | 0.850 365 | 0.480 366 | 0.373 367 | 0.493 368 | 0.579 369 | 0.189 370 | 0.901 371 | 0.674 372 | 0.633 373 | 0.099 374 | 0.604 375 | 0.121 376 | 0.079 377 | 0.527 378 | 0.403 379 | 0.589 380 | 0.089 381 | 0.431 382 | 0.175 383 | 0.987 384 | 0.561 385 | 0.687 386 | 0.325 387 | 0.095 388 | 0.976 389 | 0.286 390 | 0.424 391 | 0.650 392 | 0.025 393 | 0.810 394 | 0.537 395 | 0.278 396 | 0.062 397 | 0.162 398 | 0.895 399 | 0.686 400 | 0.250 401 | 0.066 402 | 0.691 403 | 0.572 404 | 0.405 405 | 0.364 406 | 0.217 407 | 0.670 408 | 0.971 409 | 0.176 410 | 0.597 411 | 0.424 412 | 0.447 413 | 0.254 414 | 0.825 415 | 0.485 416 | 0.543 417 | 0.305 418 | 0.182 419 | 0.086 420 | 0.714 421 | 0.196 422 | 0.690 423 | 0.390 424 | 0.416 425 | 0.469 426 | 0.368 427 | 0.101 428 | 0.310 429 | 0.664 430 | 0.666 431 | 0.286 432 | 0.460 433 | 0.193 434 | 0.210 435 | 0.023 436 | 0.897 437 | 0.211 438 | 0.228 439 | 0.280 440 | 0.127 441 | 0.639 442 | 0.075 443 | 0.134 444 | 0.645 445 | 0.340 446 | 0.708 447 | 0.557 448 | 0.256 449 | 0.651 450 | 0.116 451 | 0.536 452 | 0.437 453 | 0.268 454 | 0.604 455 | 0.871 456 | 0.999 457 | 0.608 458 | 0.405 459 | 0.225 460 | 0.257 461 | 0.479 462 | 0.367 463 | 0.914 464 | 0.368 465 | 0.373 466 | 0.384 467 | 0.837 468 | 0.651 469 | 0.614 470 | 0.334 471 | 0.818 472 | 0.038 473 | 0.871 474 | 0.513 475 | 0.398 476 | 0.497 477 | 0.667 478 | 0.013 479 | 0.872 480 | 0.447 481 | 0.343 482 | 0.138 483 | 0.439 484 | 0.496 485 | 0.404 486 | 0.679 487 | 0.421 488 | 0.961 489 | 0.599 490 | 0.807 491 | 0.109 492 | 0.397 493 | 0.337 494 | 0.569 495 | 0.861 496 | 0.078 497 | 0.073 498 | 0.850 499 | 0.213 500 | 0.669 501 | -------------------------------------------------------------------------------- /LightGBM_demo/regression/train.conf: -------------------------------------------------------------------------------- 1 | # task type, support train and predict 2 | task = train 3 | 4 | # boosting type, support gbdt for now, alias: boosting, boost 5 | boosting_type = gbdt 6 | 7 | # application type, support following application 8 | # regression , regression task 9 | # binary , binary classification task 10 | # lambdarank , lambdarank task 11 | # alias: application, app 12 | objective = regression 13 | 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics 15 | # l1 16 | # l2 , default metric for regression 17 | # ndcg , default metric for lambdarank 18 | # auc 19 | # binary_logloss , default metric for binary 20 | # binary_error 21 | metric = l2 22 | 23 | # frequence for metric output 24 | metric_freq = 1 25 | 26 | # true if need output metric for training data, alias: tranining_metric, train_metric 27 | is_training_metric = true 28 | 29 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 30 | max_bin = 255 31 | 32 | # training data 33 | # if exsting weight file, should name to "regression.train.weight" 34 | # alias: train_data, train 35 | data = regression.train 36 | 37 | # validation data, support multi validation data, separated by ',' 38 | # if exsting weight file, should name to "regression.test.weight" 39 | # alias: valid, test, test_data, 40 | valid_data = regression.test 41 | 42 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds 43 | num_trees = 100 44 | 45 | # shrinkage rate , alias: shrinkage_rate 46 | learning_rate = 0.05 47 | 48 | # number of leaves for one tree, alias: num_leaf 49 | num_leaves = 31 50 | 51 | # type of tree learner, support following types: 52 | # serial , single machine version 53 | # feature , use feature parallel to train 54 | # data , use data parallel to train 55 | # voting , use voting based parallel to train 56 | # alias: tree 57 | tree_learner = serial 58 | 59 | # number of threads for multi-threading. One thread will use one CPU, default is setted to #cpu. 60 | # num_threads = 8 61 | 62 | # feature sub-sample, will random select 80% feature to train on each iteration 63 | # alias: sub_feature 64 | feature_fraction = 0.9 65 | 66 | # Support bagging (data sub-sample), will perform bagging every 5 iterations 67 | bagging_freq = 5 68 | 69 | # Bagging farction, will random select 80% data on bagging 70 | # alias: sub_row 71 | bagging_fraction = 0.8 72 | 73 | # minimal number data for one leaf, use this to deal with over-fit 74 | # alias : min_data_per_leaf, min_data 75 | min_data_in_leaf = 100 76 | 77 | # minimal sum hessians for one leaf, use this to deal with over-fit 78 | min_sum_hessian_in_leaf = 5.0 79 | 80 | # save memory and faster speed for sparse feature, alias: is_sparse 81 | is_enable_sparse = true 82 | 83 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed 84 | # alias: two_round_loading, two_round 85 | use_two_round_loading = false 86 | 87 | # true if need to save data to binary file and application will auto load data from binary file next time 88 | # alias: is_save_binary, save_binary 89 | is_save_binary_file = false 90 | 91 | # output model file 92 | output_model = LightGBM_model.txt 93 | 94 | # support continuous train from trained gbdt model 95 | # input_model= trained_model.txt 96 | 97 | # output prediction file for predict task 98 | # output_result= prediction.txt 99 | 100 | # support continuous train from initial score file 101 | # input_init_score= init_score.txt 102 | 103 | 104 | # number of machines in parallel training, alias: num_machine 105 | num_machines = 1 106 | 107 | # local listening port in parallel training, alias: local_port 108 | local_listen_port = 12400 109 | 110 | # machines list file for parallel training, alias: mlist 111 | machine_list_file = mlist.txt 112 | -------------------------------------------------------------------------------- /Linux下使用MySQL.md: -------------------------------------------------------------------------------- 1 | *MySQL 是最流行的关系型数据库管理系统之一,属于 Oracle 旗下产品* 2 | 3 | ## 安装启动操作 4 | 1.安装mysql命令 :$ `sudo apt-get install -y mysql-server` 5 | 2.查看mysql的版本命令(注意-V是大写,不然会出现如下错误):$ `mysql -V` 6 | 3.启动mysql命令(关闭,重启等只需将start换成stop,restart等即可):$`sudo service mysql start` 7 | 4.登录mysql命令为:$ `mysql -u用户名 -p密码` 8 | 5.连接远程数据库:$ `mysql -h -P -u -p` 9 | 10 | ## 数据库操作 11 | 1.查看数据库:> `show databases;` (注意分号“;”不要落下) 12 | 2.新建一个数据库命令:> `create database 数据库名称;` 13 |    删除一个数据库命令:> `drop database 数据库名称;` 14 | 3.使用某个数据库:> `use 数据库名称;` 15 | 16 | ## 表操作 17 | 1.查看表命令:> `show tables;` 18 | 2.建立一个新表:> `create table 表名 (字段参数);` 或 >`create table if not exists 表名(字段参数);` 19 |    删除一个旧表:> `drop table 表名;` 或 >`drop table if exists 表名;` 20 | 3.查看表结构:> `desc 表名称;` 或 >`show columns from 表名称;` 21 | 4.对表数据的操作: 22 |    增:>`insert into 表名称 (字段名1,字段名2,字段名3......) values(字段名1的值,字段名2的值,字段名3的值......);` 23 |    删:>`delete from 表名称 where 表达式;` 24 |    改:>`update 表名称 set 字段名=“新值” where 表达式;` 25 |    查:>`select 字段名1,字段名2,字段名3..... from 表名称;` 26 | 5.增加字段:>`alter table 表名称 add 字段名 数据类型 其它;` (其它包括默认初始值的设定等等) 27 | 6.删除字段:>`alter table 表名称 drop 字段名;` 28 | 29 | 30 | ## 用户相关操作 31 | 注:以下命令均需先以root身份登录mysql:`mysql -uroot -p` 32 | 1.添加新用户 33 | (1)创建新用户:> `insert into mysql.user(Host,User,Password) values("localhost","user1",password("password1"));` 34 | (2)为用户分配权限: 35 |             设置用户可以在本地访问mysql:`grant all privileges on *.* to username@localhost identified by "password" ;` 36 |             设置用户只能访问指定数据库:`grant all privileges on 数据库名.* to username@localhost identified by "password" ;` 37 | (3)刷新系统权限表:>`flush privileges;` 38 | 2.查看MySql当前所有的用户:>`SELECT DISTINCT User FROM mysql.user;` 39 | 3.删除用户及其数据字典中包含的数据:>`drop user 'xbb'@'localhost';` 40 | 41 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # learning_notes 2 | 3 | 一些个人的学习笔记 4 | 5 | - git学习参考[Git教程-廖雪峰的官方网站](https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/) 6 | 7 | - Linux下使用MySQL参考[Linux系统下Mysql使用简单教程](http://www.jb51.net/article/84399.htm) 8 | 9 | - sklearn学习参考[sklearn官网](http://scikit-learn.org/stable/index.html) 10 | 11 | - XGBoost学习参考[XGBoost官网](http://xgboost.readthedocs.io/en/latest/////python/python_api.html#module-xgboost.sklearn) 12 | 13 | - LightGBM学习参考[LightGBM官网](https://lightgbm.readthedocs.io/en/latest/index.html#) 14 | 15 | - tensorflow入门学习[TensorFlow_Started.ipynb](https://github.com/fuqiuai/learning_notes/blob/master/TensorFlow_Started.ipynb) 16 | -------------------------------------------------------------------------------- /TensorFlow_Started.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# TensorFlow入门" 8 | ] 9 | }, 10 | { 11 | "cell_type": "code", 12 | "execution_count": 119, 13 | "metadata": { 14 | "collapsed": true 15 | }, 16 | "outputs": [], 17 | "source": [ 18 | "import tensorflow as tf\n", 19 | "import numpy as np" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "Hello,TensorFlow!" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 23, 32 | "metadata": {}, 33 | "outputs": [ 34 | { 35 | "name": "stdout", 36 | "output_type": "stream", 37 | "text": [ 38 | "b'Hello,TensorFlow!'\n" 39 | ] 40 | } 41 | ], 42 | "source": [ 43 | "hello=tf.constant(\"Hello,TensorFlow!\")\n", 44 | "sess=tf.Session()\n", 45 | "print(sess.run(hello))" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## 1.1 张量和图\n", 53 | "\n", 54 | "TensorFlow 是一种采用数据流图(data flow graphs),用于数值计算的开源软件库。其中 **Tensor代表传递的数据为张量(多维数组)**,**Flow代表使用计算图进行运算**。数据流图用「结点」(nodes)和「边」(edges)组成的有向图来描述数学运算。**结点**一般用来表示施加的数学操作,但也可以表示数据输入的起点和输出的终点,或者是读取/写入持久变量(persistent variable)的终点。**边**表示结点之间的输入/输出关系。这些数据边可以传送维度可动态调整的多维数据数组,即张量(tensor)。\n", 55 | "\n", 56 | "下面代码是使用计算图的案例:" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 6, 62 | "metadata": {}, 63 | "outputs": [ 64 | { 65 | "name": "stdout", 66 | "output_type": "stream", 67 | "text": [ 68 | "8\n", 69 | "[[0. 0.]\n", 70 | " [0. 0.]]\n" 71 | ] 72 | } 73 | ], 74 | "source": [ 75 | "a = tf.constant(2, tf.int16)\n", 76 | "b = tf.constant(4, tf.float32)\n", 77 | "\n", 78 | "graph = tf.Graph()\n", 79 | "with graph.as_default():\n", 80 | " a = tf.Variable(8, tf.float32)\n", 81 | " b = tf.Variable(tf.zeros([2,2], tf.float32))\n", 82 | " \n", 83 | "with tf.Session(graph=graph) as session:\n", 84 | " tf.global_variables_initializer().run()\n", 85 | " print(session.run(a))\n", 86 | " print(session.run(b))" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "在 Tensorflow 中,**所有不同的变量和运算都是储存在计算图**。所以在我们构建完模型所需要的图之后,还**需要打开一个会话(Session)来运行整个计算图**。在会话中,我们可以将所有计算分配到可用的 CPU 和 GPU 资源中。\n", 94 | "\n", 95 | "如下所示代码,我们声明两个常量 a 和 b,并且定义一个加法运算。但它并不会输出计算结果,因为我们只是定义了一张图,而没有运行它:" 96 | ] 97 | }, 98 | { 99 | "cell_type": "code", 100 | "execution_count": 7, 101 | "metadata": {}, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "Tensor(\"add:0\", shape=(2,), dtype=int32)\n" 108 | ] 109 | } 110 | ], 111 | "source": [ 112 | "a=tf.constant([1,2],name=\"a\")\n", 113 | "\n", 114 | "b=tf.constant([2,4],name=\"b\")\n", 115 | "\n", 116 | "result = a+b\n", 117 | "\n", 118 | "print(result)" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "下面的代码才会输出计算结果,因为我们需要创建一个会话才能管理 TensorFlow 运行时的所有资源。但计算完毕后需要关闭会话来帮助系统回收资源,不然就会出现资源泄漏的问题。下面提供了使用会话的两种方式:" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": 21, 131 | "metadata": {}, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "[3 6]\n" 138 | ] 139 | } 140 | ], 141 | "source": [ 142 | "a=tf.constant([1,2])\n", 143 | "\n", 144 | "b=tf.constant([2,4])\n", 145 | "\n", 146 | "result = a+b\n", 147 | "\n", 148 | "'''\n", 149 | "sess=tf.Session()\n", 150 | "print(sess.run(result))\n", 151 | "sess.close\n", 152 | "'''\n", 153 | "\n", 154 | "with tf.Session() as sess:\n", 155 | " print(sess.run(result))" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "## 1.2 常量、变量和占位符\n", 163 | "\n", 164 | "**TensorFlow 中最基本的单位是常量(constant)、变量(Variable)和占位符(placeholder)。**\n", 165 | "\n", 166 | "下面我们分别定义了常量与变量:" 167 | ] 168 | }, 169 | { 170 | "cell_type": "code", 171 | "execution_count": 53, 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "name": "stdout", 176 | "output_type": "stream", 177 | "text": [ 178 | "2 \n", 179 | " [[0. 0.]\n", 180 | " [0. 0.]] \n", 181 | " [0 0 0 0 0 0 0 0 0 0 0]\n" 182 | ] 183 | } 184 | ], 185 | "source": [ 186 | "import numpy as np\n", 187 | "\n", 188 | "a=tf.constant(2,tf.int16)\n", 189 | "b=tf.constant(8.9,tf.float32)\n", 190 | "\n", 191 | "d=tf.Variable(4,tf.int16)\n", 192 | "\n", 193 | "g = tf.constant(np.zeros(shape=(2,2), dtype=np.float32))\n", 194 | "# 等价于 g=tf.zeros([2,2],tf.float32)\n", 195 | "\n", 196 | "h = tf.zeros([11], tf.int16)\n", 197 | "i = tf.ones([2,2], tf.float32)\n", 198 | "l = tf.Variable(tf.zeros([5,6,5], tf.float32))\n", 199 | "\n", 200 | "# print(a,'\\n',d,'\\n',g,'\\n',i,'\\n',h,'\\n',l)\n", 201 | "with tf.Session() as sess:\n", 202 | " print(sess.run(a),'\\n',sess.run(g),'\\n',sess.run(h))" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "TensorFlow数据类型可参考:http://wiki.jikexueyuan.com/project/tensorflow-zh/resources/dims_types.html\n", 210 | "\n", 211 | "**常量(constant)**同其他语言的常量, 在赋值后不可修改。tf.constant(\n", 212 | " value,\n", 213 | " dtype=None,\n", 214 | " shape=None,\n", 215 | " name='Const',\n", 216 | " verify_shape=False\n", 217 | ")\n", 218 | "
**占位符(placeholder)**同其他语言的方法的参数, 在执行方法时设置。tf.placeholder(\n", 219 | " dtype,\n", 220 | " shape=None,\n", 221 | " name=None\n", 222 | ")\n", 223 | "
**变量(Variable)**比较特殊, 机器学习特有属性. 对于常规的算法而言, 在输入值已知的情况下, 优化变量(参数)使损失函数的值达到最小. 因而在含有优化器(Optimizer)的算法中, 变量是动态计算(变化)的. 如果在未使用优化器时, 变量仍作为普通变量. 注意: 在使用变量之前, 需要执行初始化方法, 系统才会为其赋值。tf.Variable('initial-value', name='optional-name')" 224 | ] 225 | }, 226 | { 227 | "cell_type": "code", 228 | "execution_count": 81, 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "name": "stdout", 233 | "output_type": "stream", 234 | "text": [ 235 | "Tensor(\"Const_116:0\", shape=(), dtype=float32) Tensor(\"Const_117:0\", shape=(), dtype=float32)\n", 236 | "7.5\n", 237 | "[3. 7.]\n", 238 | "linear_model: [0. 0.3 0.6 0.90000004]\n" 239 | ] 240 | } 241 | ], 242 | "source": [ 243 | "with tf.Session() as sess:\n", 244 | " # 常量\n", 245 | " node1 = tf.constant(3.0, tf.float32)\n", 246 | " node2 = tf.constant(4.0)\n", 247 | " print (node1, node2) # 只打印结点信息\n", 248 | "\n", 249 | " # 占位符\n", 250 | " a = tf.placeholder(tf.float32)\n", 251 | " b = tf.placeholder(tf.float32)\n", 252 | " adder_node = a + b # 与调用add方法类似\n", 253 | " print (sess.run(adder_node, {a: 3, b: 4.5}))\n", 254 | " print (sess.run(adder_node, {a: [1, 3], b: [2, 4]}))\n", 255 | "\n", 256 | " # 变量\n", 257 | " W = tf.Variable([.3], tf.float32)\n", 258 | " b = tf.Variable([-.3], tf.float32)\n", 259 | " x = tf.placeholder(tf.float32)\n", 260 | " # linear_model = node1 * x + node2\n", 261 | " linear_model = W * x + b\n", 262 | " sess.run(tf.global_variables_initializer()) #初始化模型参数\n", 263 | " print (\"linear_model: \", sess.run(linear_model, {x: [1, 2, 3, 4]}))" 264 | ] 265 | }, 266 | { 267 | "cell_type": "markdown", 268 | "metadata": {}, 269 | "source": [ 270 | "在会话中,占位符可以使用 feed_dict 馈送数据。\n", 271 | "\n", 272 | "feed_dict 是一个字典,在字典中需要给出每一个用到的占位符的取值。在训练神经网络时需要每次提供一个批量的训练样本,如果每次迭代选取的数据要通过常量表示,那么 TensorFlow 的计算图会非常大。因为每增加一个常量,TensorFlow 都会在计算图中增加一个结点。所以说拥有几百万次迭代的神经网络会拥有极其庞大的计算图,而**占位符却可以解决这一点,它只会拥有占位符这一个结点**。\n", 273 | "\n", 274 | "下面一段代码分别展示了使用常量和占位符进行计算:" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 92, 280 | "metadata": {}, 281 | "outputs": [ 282 | { 283 | "name": "stdout", 284 | "output_type": "stream", 285 | "text": [ 286 | "[[-0.11131823 2.3845987 ]]\n", 287 | "[[-0.11131823 2.3845987 ]]\n", 288 | "[[2. 2.]]\n" 289 | ] 290 | }, 291 | { 292 | "data": { 293 | "text/plain": [ 294 | ">" 295 | ] 296 | }, 297 | "execution_count": 92, 298 | "metadata": {}, 299 | "output_type": "execute_result" 300 | } 301 | ], 302 | "source": [ 303 | "w1=tf.Variable(tf.random_normal([1,2],stddev=1,seed=1))\n", 304 | "\n", 305 | "#因为需要重复输入x,而每建一个x就会生成一个结点,计算图的效率会低。所以使用占位符\n", 306 | "x=tf.placeholder(tf.float32,shape=(1,2))\n", 307 | "x1=tf.constant([[0.7,0.9]])\n", 308 | "\n", 309 | "a=x+w1\n", 310 | "b=x1+w1\n", 311 | "\n", 312 | "sess=tf.Session()\n", 313 | "sess.run(tf.global_variables_initializer())\n", 314 | "\n", 315 | "#运行y时将占位符填上,feed_dict为字典,变量名不可变\n", 316 | "y_1=sess.run(a,feed_dict={x:[[0.7,0.9]]})\n", 317 | "y_2=sess.run(b)\n", 318 | "\n", 319 | "print(y_1)\n", 320 | "print(y_2)\n", 321 | "sess.close" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "## 1.3 实例\n", 329 | "例1:计算多维张量欧里几得距离" 330 | ] 331 | }, 332 | { 333 | "cell_type": "code", 334 | "execution_count": 110, 335 | "metadata": {}, 336 | "outputs": [ 337 | { 338 | "name": "stdout", 339 | "output_type": "stream", 340 | "text": [ 341 | "the distance between [[1 2]] and [[15 16]] -> 19.79899024963379\n", 342 | "the distance between [[3 4]] and [[13 14]] -> 14.142135620117188\n", 343 | "the distance between [[5 6]] and [[11 12]] -> 8.485280990600586\n", 344 | "the distance between [[7 8]] and [[ 9 10]] -> 2.8284270763397217\n" 345 | ] 346 | } 347 | ], 348 | "source": [ 349 | "list_of_points1_ = [[1,2], [3,4], [5,6], [7,8]]\n", 350 | "list_of_points2_ = [[15,16], [13,14], [11,12], [9,10]]\n", 351 | "\n", 352 | "# list_of_points1 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points1_])\n", 353 | "# list_of_points2 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points2_])\n", 354 | "\n", 355 | "\n", 356 | "#graph = tf.Graph()\n", 357 | "\n", 358 | "#with graph.as_default(): \n", 359 | "\n", 360 | "def calculate_eucledian_distance(point1, point2):\n", 361 | " difference = tf.subtract(point1, point2) # 相减\n", 362 | " power2 = tf.pow(difference, tf.constant(2.0, shape=(1,2))) # 幂乘\n", 363 | " add = tf.reduce_sum(power2) # 计算一个张量的各个维度上元素的总和\n", 364 | " eucledian_distance = tf.sqrt(add) # 开根\n", 365 | " return eucledian_distance\n", 366 | "\n", 367 | "#我们使用 tf.placeholder() 创建占位符 ,在 session.run() 过程中再投递数据 \n", 368 | "point1 = tf.placeholder(tf.float32, shape=(1, 2))\n", 369 | "point2 = tf.placeholder(tf.float32, shape=(1, 2))\n", 370 | "\n", 371 | "dist = calculate_eucledian_distance(point1, point2)\n", 372 | "\n", 373 | "# 开启会话\n", 374 | "with tf.Session() as session:\n", 375 | " #tf.global_variables_initializer().run() \n", 376 | " for ii in range(len(list_of_points1)):\n", 377 | " point1_ = list_of_points1[ii]\n", 378 | " point2_ = list_of_points2[ii]\n", 379 | "\n", 380 | " #使用feed_dict将数据投入到dist中\n", 381 | " distance = session.run(dist, feed_dict={point1 : point1_, point2 : point2_})\n", 382 | " print(\"the distance between {} and {} -> {}\".format(point1_, point2_, distance))" 383 | ] 384 | }, 385 | { 386 | "cell_type": "markdown", 387 | "metadata": {}, 388 | "source": [ 389 | "例2:构建三层全连接神经网络" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 134, 395 | "metadata": {}, 396 | "outputs": [ 397 | { 398 | "name": "stdout", 399 | "output_type": "stream", 400 | "text": [ 401 | "初始化权重为:\n", 402 | " [[-0.8113182 1.4845988 0.06532937]\n", 403 | " [-2.4427042 0.0992484 0.5912243 ]] \n", 404 | " [[-0.8113182 ]\n", 405 | " [ 1.4845988 ]\n", 406 | " [ 0.06532937]]\n", 407 | "在迭代0次后,训练损失为0.308504\n", 408 | "在迭代1000次后,训练损失为0.0393406\n", 409 | "在迭代2000次后,训练损失为0.0182158\n", 410 | "在迭代3000次后,训练损失为0.0104779\n", 411 | "在迭代4000次后,训练损失为0.00680374\n", 412 | "在迭代5000次后,训练损失为0.00446512\n", 413 | "在迭代6000次后,训练损失为0.00296797\n", 414 | "在迭代7000次后,训练损失为0.00218553\n", 415 | "在迭代8000次后,训练损失为0.00179452\n", 416 | "在迭代9000次后,训练损失为0.0013211\n", 417 | "在迭代10000次后,训练损失为0.000957699\n" 418 | ] 419 | } 420 | ], 421 | "source": [ 422 | "# 定义变量w1,w2(权重)\n", 423 | "w1=tf.Variable(tf.random_normal([2,3],stddev=1,seed=1))\n", 424 | "w2=tf.Variable(tf.random_normal([3,1],stddev=1,seed=1))\n", 425 | "\n", 426 | "# 定义占位符x,y(样本集)\n", 427 | "x=tf.placeholder(tf.float32,shape=(None,2)) # None可以根据batch大小确定维度,在shape的一个维度上使用None\n", 428 | "y=tf.placeholder(tf.float32,shape=(None,1))\n", 429 | "\n", 430 | "# 定义ReLU激活函数\n", 431 | "a=tf.nn.relu(tf.matmul(x,w1)) #tf.nn.relu(features, name = None),这个函数的作用是计算激活函数relu,即max(features, 0)。即将矩阵中每个元素的负值置0。\n", 432 | "yhat=tf.nn.relu(tf.matmul(a,w2)) # tf.matmul为矩阵相乘,yhat为预测的值\n", 433 | "\n", 434 | "# 定义交叉熵损失函数和训练算法AdamOptimizer\n", 435 | "cross_entropy=-tf.reduce_mean(y*tf.log(tf.clip_by_value(yhat,1e-10,1.0))) #tf.clip_by_value(A, min, max):输入一个张量A,把A中的每一个元素的值都压缩在min和max之间。小于min的让它等于min,大于max的元素的值等于max。\n", 436 | "train_op=tf.train.AdamOptimizer(0.001).minimize(cross_entropy) # 学习率为0.001\n", 437 | "\n", 438 | "# 随机生成512个样本,样本特征维数为2\n", 439 | "data_size=512\n", 440 | "X = np.random.RandomState(1).rand(data_size,2) # 样本范围为[0, 1)\n", 441 | "# 生成标签,1为正样本,0为负样本\n", 442 | "Y = [[int(x1+x2<1)] for (x1,x2) in X]\n", 443 | "\n", 444 | "batch_size=10 # 每次训练读取样本个数\n", 445 | "\n", 446 | "with tf.Session() as sess:\n", 447 | " sess.run(tf.global_variables_initializer()) # 初始化\n", 448 | " print('初始化权重为:\\n',sess.run(w1),'\\n',sess.run(w2))\n", 449 | " steps=10001\n", 450 | " for i in range(steps):\n", 451 | " #选定每一个批量读取的首尾位置,确保在1个epoch(全部样本训练一次为1个epoch)内采样训练\n", 452 | " start = i*batch_size % data_size\n", 453 | " end = min(start+batch_size,data_size)\n", 454 | " sess.run(train_op,feed_dict={x:X[start:end],y:Y[start:end]}) # 开始训练\n", 455 | " if i % 1000 == 0:\n", 456 | " training_loss=sess.run(cross_entropy,feed_dict={x:X,y:Y})\n", 457 | " print(\"在迭代%d次后,训练损失为%g\"%(i,training_loss))\n", 458 | " " 459 | ] 460 | }, 461 | { 462 | "cell_type": "markdown", 463 | "metadata": {}, 464 | "source": [ 465 | "说明:上面的代码定义了一个简单的三层全连接网络(输入层、隐藏层和输出层分别为 2、3 和 1 个神经元),隐藏层和输出层的激活函数使用的是 ReLU 函数。该模型训练的样本总数为 512,每次迭代读取的批量为 10。这个简单的全连接网络以交叉熵为损失函数,并使用Adam优化算法进行权重更新。" 466 | ] 467 | }, 468 | { 469 | "cell_type": "code", 470 | "execution_count": null, 471 | "metadata": { 472 | "collapsed": true 473 | }, 474 | "outputs": [], 475 | "source": [] 476 | } 477 | ], 478 | "metadata": { 479 | "kernelspec": { 480 | "display_name": "Python 3", 481 | "language": "python", 482 | "name": "python3" 483 | }, 484 | "language_info": { 485 | "codemirror_mode": { 486 | "name": "ipython", 487 | "version": 3 488 | }, 489 | "file_extension": ".py", 490 | "mimetype": "text/x-python", 491 | "name": "python", 492 | "nbconvert_exporter": "python", 493 | "pygments_lexer": "ipython3", 494 | "version": "3.5.5" 495 | } 496 | }, 497 | "nbformat": 4, 498 | "nbformat_minor": 2 499 | } 500 | -------------------------------------------------------------------------------- /XGBoost_demo/.gitignore: -------------------------------------------------------------------------------- 1 | *.libsvm 2 | *.pkl 3 | -------------------------------------------------------------------------------- /XGBoost_demo/README.md: -------------------------------------------------------------------------------- 1 | Awesome XGBoost 2 | =============== 3 | This page contains a curated list of examples, tutorials, blogs about XGBoost usecases. 4 | It is inspired by [awesome-MXNet](https://github.com/dmlc/mxnet/blob/master/example/README.md), 5 | [awesome-php](https://github.com/ziadoz/awesome-php) and [awesome-machine-learning](https://github.com/josephmisiti/awesome-machine-learning). 6 | 7 | Please send a pull request if you find things that belongs to here. 8 | 9 | Contents 10 | -------- 11 | - [Code Examples](#code-examples) 12 | - [Features Walkthrough](#features-walkthrough) 13 | - [Basic Examples by Tasks](#basic-examples-by-tasks) 14 | - [Benchmarks](#benchmarks) 15 | - [Machine Learning Challenge Winning Solutions](#machine-learning-challenge-winning-solutions) 16 | - [Tutorials](#tutorials) 17 | - [Usecases](#usecases) 18 | - [Tools using XGBoost](#tools-using-xgboost) 19 | - [Awards](#awards) 20 | - [Windows Binaries](#windows-binaries) 21 | 22 | Code Examples 23 | ------------- 24 | ### Features Walkthrough 25 | 26 | This is a list of short codes introducing different functionalities of xgboost packages. 27 | 28 | * Basic walkthrough of packages 29 | [python](guide-python/basic_walkthrough.py) 30 | [R](../R-package/demo/basic_walkthrough.R) 31 | [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/basic_walkthrough.jl) 32 | [PHP](https://github.com/bpachev/xgboost-php/blob/master/demo/titanic_demo.php) 33 | * Customize loss function, and evaluation metric 34 | [python](guide-python/custom_objective.py) 35 | [R](../R-package/demo/custom_objective.R) 36 | [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/custom_objective.jl) 37 | * Boosting from existing prediction 38 | [python](guide-python/boost_from_prediction.py) 39 | [R](../R-package/demo/boost_from_prediction.R) 40 | [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl) 41 | * Predicting using first n trees 42 | [python](guide-python/predict_first_ntree.py) 43 | [R](../R-package/demo/predict_first_ntree.R) 44 | [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/predict_first_ntree.jl) 45 | * Generalized Linear Model 46 | [python](guide-python/generalized_linear_model.py) 47 | [R](../R-package/demo/generalized_linear_model.R) 48 | [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/generalized_linear_model.jl) 49 | * Cross validation 50 | [python](guide-python/cross_validation.py) 51 | [R](../R-package/demo/cross_validation.R) 52 | [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/cross_validation.jl) 53 | * Predicting leaf indices 54 | [python](guide-python/predict_leaf_indices.py) 55 | [R](../R-package/demo/predict_leaf_indices.R) 56 | 57 | ### Basic Examples by Tasks 58 | 59 | Most of examples in this section are based on CLI or python version. 60 | However, the parameter settings can be applied to all versions 61 | 62 | - [Binary classification](binary_classification) 63 | - [Multiclass classification](multiclass_classification) 64 | - [Regression](regression) 65 | - [Learning to Rank](rank) 66 | 67 | ### Benchmarks 68 | 69 | - [Starter script for Kaggle Higgs Boson](kaggle-higgs) 70 | - [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution) 71 | - [Benchmarking the most commonly used open source tools for binary classification](https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines) 72 | 73 | 74 | ## Machine Learning Challenge Winning Solutions 75 | 76 | XGBoost is extensively used by machine learning practitioners to create state of art data science solutions, 77 | this is a list of machine learning winning solutions with XGBoost. 78 | Please send pull requests if you find ones that are missing here. 79 | 80 | - Maksims Volkovs, Guangwei Yu and Tomi Poutanen, 1st place of the [2017 ACM RecSys challenge](http://2017.recsyschallenge.com/). Link to [paper](http://www.cs.toronto.edu/~mvolkovs/recsys2017_challenge.pdf). 81 | - Vlad Sandulescu, Mihai Chiru, 1st place of the [KDD Cup 2016 competition](https://kddcup2016.azurewebsites.net). Link to [the arxiv paper](http://arxiv.org/abs/1609.02728). 82 | - Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the [Dato Truely Native? competition](https://www.kaggle.com/c/dato-native). Link to [the Kaggle interview](http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/). 83 | - Vlad Mironov, Alexander Guschin, 1st place of the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Link to [the Kaggle interview](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/). 84 | - Josef Slavicek, 3rd place of the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Link to [the Kaggle interview](http://blog.kaggle.com/2015/11/23/flavour-of-physics-winners-interview-3rd-place-josef-slavicek/). 85 | - Mario Filho, Josef Feigl, Lucas, Gilberto, 1st place of the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Link to [the Kaggle interview](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/). 86 | - Qingchen Wang, 1st place of the [Liberty Mutual Property Inspection](https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction). Link to [the Kaggle interview](http://blog.kaggle.com/2015/09/28/liberty-mutual-property-inspection-winners-interview-qingchen-wang/). 87 | - Chenglong Chen, 1st place of the [Crowdflower Search Results Relevance](https://www.kaggle.com/c/crowdflower-search-relevance). Link to [the winning solution](https://www.kaggle.com/c/crowdflower-search-relevance/forums/t/15186/1st-place-winner-solution-chenglong-chen/). 88 | - Alexandre Barachant (“Cat”) and Rafał Cycoń (“Dog”), 1st place of the [Grasp-and-Lift EEG Detection](https://www.kaggle.com/c/grasp-and-lift-eeg-detection). Link to [the Kaggle interview](http://blog.kaggle.com/2015/10/12/grasp-and-lift-eeg-winners-interview-1st-place-cat-dog/). 89 | - Halla Yang, 2nd place of the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Link to [the Kaggle interview](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/). 90 | - Owen Zhang, 1st place of the [Avito Context Ad Clicks competition](https://www.kaggle.com/c/avito-context-ad-clicks). Link to [the Kaggle interview](http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/). 91 | - Keiichi Kuroyanagi, 2nd place of the [Airbnb New User Bookings](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings). Link to [the Kaggle interview](http://blog.kaggle.com/2016/03/17/airbnb-new-user-bookings-winners-interview-2nd-place-keiichi-kuroyanagi-keiku/). 92 | - Marios Michailidis, Mathias Müller and Ning Situ, 1st place [Homesite Quote Conversion](https://www.kaggle.com/c/homesite-quote-conversion). Link to [the Kaggle interview](http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/). 93 | 94 | ## Talks 95 | - [XGBoost: A Scalable Tree Boosting System](http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/) (video+slides) by Tianqi Chen at the Los Angeles Data Science meetup 96 | 97 | ## Tutorials 98 | 99 | - [Machine Learning with XGBoost on Qubole Spark Cluster](https://www.qubole.com/blog/machine-learning-xgboost-qubole-spark-cluster/) 100 | - [XGBoost Official RMarkdown Tutorials](https://xgboost.readthedocs.org/en/latest/R-package/index.html#tutorials) 101 | - [An Introduction to XGBoost R Package](http://dmlc.ml/rstats/2016/03/10/xgboost.html) by Tong He 102 | - [Open Source Tools & Data Science Competitions](http://www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1) by Owen Zhang - XGBoost parameter tuning tips 103 | * [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit) 104 | * [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/) 105 | - [XGBoost - eXtreme Gradient Boosting](http://www.slideshare.net/ShangxuanZhang/xgboost) by Tong He 106 | - [How to use XGBoost algorithm in R in easy steps](http://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/) by TAVISH SRIVASTAVA ([Chinese Translation 中文翻译](https://segmentfault.com/a/1190000004421821) by [HarryZhu](https://segmentfault.com/u/harryprince)) 107 | - [Kaggle Solution: What’s Cooking ? (Text Mining Competition)](http://www.analyticsvidhya.com/blog/2015/12/kaggle-solution-cooking-text-mining-competition/) by MANISH SARASWAT 108 | - Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R) by Manuel Amunategui ([Youtube Link](https://www.youtube.com/watch?v=Og7CGAfSr_Y)) ([Github Link](https://github.com/amunategui/BetterCrossValidation)) 109 | - [XGBoost Rossman Parameter Tuning](https://www.kaggle.com/khozzy/rossmann-store-sales/xgboost-parameter-tuning-template/run/90168/notebook) by [Norbert Kozlowski](https://www.kaggle.com/khozzy) 110 | - [Featurizing log data before XGBoost](http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost) by Xavier Conort, Owen Zhang etc 111 | - [West Nile Virus Competition Benchmarks & Tutorials](http://blog.kaggle.com/2015/07/21/west-nile-virus-competition-benchmarks-tutorials/) by [Anna Montoya](http://blog.kaggle.com/author/annamontoya/) 112 | - [Ensemble Decision Tree with XGBoost](https://www.kaggle.com/binghsu/predict-west-nile-virus/xgboost-starter-code-python-0-69) by [Bing Xu](https://www.kaggle.com/binghsu) 113 | - [Notes on eXtreme Gradient Boosting](http://startup.ml/blog/xgboost) by ARSHAK NAVRUZYAN ([iPython Notebook](https://github.com/startupml/koan/blob/master/eXtreme%20Gradient%20Boosting.ipynb)) 114 | - [Complete Guide to Parameter Tuning in XGBoost](http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) by Aarshay Jain 115 | - [Practical XGBoost in Python online course](http://education.parrotprediction.teachable.com/courses/practical-xgboost-in-python) by Parrot Prediction 116 | - [Spark and XGBoost using Scala](http://www.elenacuoco.com/2016/10/10/scala-spark-xgboost-classification/) by Elena Cuoco 117 | ## Usecases 118 | If you have particular usecase of xgboost that you would like to highlight. 119 | Send a PR to add a one sentence description:) 120 | 121 | - XGBoost is used in [Kaggle Script](https://www.kaggle.com/scripts) to solve data science challenges. 122 | - [Seldon predictive service powered by XGBoost](http://docs.seldon.io/iris-demo.html) 123 | - XGBoost Distributed is used in [ODPS Cloud Service by Alibaba](https://yq.aliyun.com/articles/6355) (in Chinese) 124 | - XGBoost is incoporated as part of [Graphlab Create](https://dato.com/products/create/) for scalable machine learning. 125 | - [Hanjing Su](https://www.52cs.org) from Tencent data platform team: "We use distributed XGBoost for click through prediction in wechat shopping and lookalikes. The problems involve hundreds millions of users and thousands of features. XGBoost is cleanly designed and can be easily integrated into our production environment, reducing our cost in developments." 126 | - [CNevd](https://github.com/CNevd) from autohome.com ad platform team: "Distributed XGBoost is used for click through rate prediction in our display advertising, XGBoost is highly efficient and flexible and can be easily used on our distributed platform, our ctr made a great improvement with hundred millions samples and millions features due to this awesome XGBoost" 127 | 128 | 129 | 130 | ## Tools using XGBoost 131 | 132 | - [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API 133 | - [gp_xgboost_gridsearch](https://github.com/vatsan/gp_xgboost_gridsearch) - In-database parallel grid-search for XGBoost on [Greenplum](https://github.com/greenplum-db/gpdb) using PL/Python 134 | - [tpot](https://github.com/rhiever/tpot) - A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. 135 | 136 | ## Awards 137 | - [John Chambers Award](http://stat-computing.org/awards/jmc/winners.html) - 2016 Winner: XGBoost R Package, by Tong He (Simon Fraser University) and Tianqi Chen (University of Washington) 138 | 139 | ## Windows Binaries 140 | Unofficial windows binaries and instructions on how to use them are hosted on [Guido Tapia's blog](http://www.picnet.com.au/blogs/guido/post/2016/09/22/xgboost-windows-x64-binaries-for-download/) 141 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/README.md: -------------------------------------------------------------------------------- 1 | Binary Classification 2 | ===================== 3 | This is the quick start tutorial for xgboost CLI version. 4 | Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```. 5 | The script 'runexp.sh' can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository. 6 | 7 | ### Tutorial 8 | #### Generate Input Data 9 | XGBoost takes LibSVM format. An example of faked input data is below: 10 | ``` 11 | 1 101:1.2 102:0.03 12 | 0 1:2.1 10001:300 10002:400 13 | ... 14 | ``` 15 | Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive. 16 | 17 | 18 | First we will transform the dataset into classic LibSVM format and split the data into training set and test set by running: 19 | ``` 20 | python mapfeat.py 21 | python mknfold.py agaricus.txt 1 22 | ``` 23 | The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set. 24 | 25 | #### Training 26 | Then we can run the training process: 27 | ``` 28 | ../../xgboost mushroom.conf 29 | ``` 30 | 31 | mushroom.conf is the configuration for both training and testing. Each line containing the [attribute]=[value] configuration: 32 | 33 | ```conf 34 | # General Parameters, see comment for each definition 35 | # can be gbtree or gblinear 36 | booster = gbtree 37 | # choose logistic regression loss function for binary classification 38 | objective = binary:logistic 39 | 40 | # Tree Booster Parameters 41 | # step size shrinkage 42 | eta = 1.0 43 | # minimum loss reduction required to make a further partition 44 | gamma = 1.0 45 | # minimum sum of instance weight(hessian) needed in a child 46 | min_child_weight = 1 47 | # maximum depth of a tree 48 | max_depth = 3 49 | 50 | # Task Parameters 51 | # the number of round to do boosting 52 | num_round = 2 53 | # 0 means do not save any model except the final round model 54 | save_period = 0 55 | # The path of training data 56 | data = "agaricus.txt.train" 57 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set 58 | eval[test] = "agaricus.txt.test" 59 | # The path of test data 60 | test:data = "agaricus.txt.test" 61 | ``` 62 | We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification. 63 | 64 | The parameters shown in the example gives the most common ones that are needed to use xgboost. 65 | If you are interested in more parameter settings, the complete parameter settings and detailed descriptions are [here](../../doc/parameter.md). Besides putting the parameters in the configuration file, we can set them by passing them as arguments as below: 66 | 67 | ``` 68 | ../../xgboost mushroom.conf max_depth=6 69 | ``` 70 | This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and the config file, the command line setting will override the setting in config file. 71 | 72 | In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below: 73 | ```conf 74 | # General Parameters 75 | # choose the linear booster 76 | booster = gblinear 77 | ... 78 | 79 | # Change Tree Booster Parameters into Linear Booster Parameters 80 | # L2 regularization term on weights, default 0 81 | lambda = 0.01 82 | # L1 regularization term on weights, default 0 83 | If ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buffer if possible, this can speedup training process when you do training many times. You can disable it by setting ```use_buffer=0```. 84 | - Buffer file can also be used as standalone input, i.e if buffer file exists, but original agaricus.txt.test was removed, xgboost will still run 85 | * Deviation from LibSVM input format: xgboost is compatible with LibSVM format, with the following minor differences: 86 | - xgboost allows feature index starts from 0 87 | - for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1 88 | - the feature indices in each line *do not* need to be sorted 89 | alpha = 0.01 90 | # L2 regularization term on bias, default 0 91 | lambda_bias = 0.01 92 | 93 | # Regression Parameters 94 | ... 95 | ``` 96 | 97 | #### Get Predictions 98 | After training, we can use the output model to get the prediction of the test data: 99 | ``` 100 | ../../xgboost mushroom.conf task=pred model_in=0002.model 101 | ``` 102 | For binary classification, the output predictions are probability confidence scores in [0,1], corresponds to the probability of the label to be positive. 103 | 104 | #### Dump Model 105 | This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way: 106 | ``` 107 | ../../xgboost mushroom.conf task=dump model_in=0002.model name_dump=dump.raw.txt 108 | ../../xgboost mushroom.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt 109 | ``` 110 | 111 | In this demo, the tree boosters obtained will be printed in dump.raw.txt and dump.nice.txt, and the latter one is easier to understand because of usage of feature mapping featmap.txt 112 | 113 | Format of ```featmap.txt: \n ```: 114 | - Feature id must be from 0 to number of features, in sorted order. 115 | - i means this feature is binary indicator feature 116 | - q means this feature is a quantitative value, such as age, time, can be missing 117 | - int means this feature is integer value (when int is hinted, the decision boundary will be integer) 118 | 119 | #### Monitoring Progress 120 | When you run training we can find there are messages displayed on screen 121 | ``` 122 | tree train end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3 123 | [0] test-error:0.016139 124 | boosting round 1, 0 sec elapsed 125 | 126 | tree train end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3 127 | [1] test-error:0.000000 128 | ``` 129 | The messages for evaluation are printed into stderr, so if you want only to log the evaluation progress, simply type 130 | ``` 131 | ../../xgboost mushroom.conf 2>log.txt 132 | ``` 133 | Then you can find the following content in log.txt 134 | ``` 135 | [0] test-error:0.016139 136 | [1] test-error:0.000000 137 | ``` 138 | We can also monitor both training and test statistics, by adding following lines to configure 139 | ```conf 140 | eval[test] = "agaricus.txt.test" 141 | eval[trainname] = "agaricus.txt.train" 142 | ``` 143 | Run the command again, we can find the log file becomes 144 | ``` 145 | [0] test-error:0.016139 trainname-error:0.014433 146 | [1] test-error:0.000000 trainname-error:0.001228 147 | ``` 148 | The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round. 149 | 150 | xgboost also supports monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes 151 | ``` 152 | [0] test-error:0.016139 test-negllik:0.029795 trainname-error:0.014433 trainname-negllik:0.027023 153 | [1] test-error:0.000000 test-negllik:0.000000 trainname-error:0.001228 trainname-negllik:0.002457 154 | ``` 155 | ### Saving Progress Models 156 | If you want to save model every two round, simply set save_period=2. You will find 0002.model in the current folder. If you want to change the output folder of models, add model_dir=foldername. By default xgboost saves the model of last round. 157 | 158 | #### Continue from Existing Model 159 | If you want to continue boosting from existing model, say 0002.model, use 160 | ``` 161 | ../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model 162 | ``` 163 | xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function. 164 | #### Use Multi-Threading 165 | When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration. 166 | Eg. ```nthread=10``` 167 | 168 | Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```) 169 | Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8. 170 | 171 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/agaricus-lepiota.fmap: -------------------------------------------------------------------------------- 1 | 1. cap-shape: bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s 2 | 2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 3 | 3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y 4 | 4. bruises?: bruises=t,no=f 5 | 5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, 6 | musty=m,none=n,pungent=p,spicy=s 7 | 6. gill-attachment: attached=a,descending=d,free=f,notched=n 8 | 7. gill-spacing: close=c,crowded=w,distant=d 9 | 8. gill-size: broad=b,narrow=n 10 | 9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, 11 | green=r,orange=o,pink=p,purple=u,red=e, 12 | white=w,yellow=y 13 | 10. stalk-shape: enlarging=e,tapering=t 14 | 11. stalk-root: bulbous=b,club=c,cup=u,equal=e, 15 | rhizomorphs=z,rooted=r,missing=? 16 | 12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 17 | 13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 18 | 14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, 19 | pink=p,red=e,white=w,yellow=y 20 | 15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, 21 | pink=p,red=e,white=w,yellow=y 22 | 16. veil-type: partial=p,universal=u 23 | 17. veil-color: brown=n,orange=o,white=w,yellow=y 24 | 18. ring-number: none=n,one=o,two=t 25 | 19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, 26 | none=n,pendant=p,sheathing=s,zone=z 27 | 20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, 28 | orange=o,purple=u,white=w,yellow=y 29 | 21. population: abundant=a,clustered=c,numerous=n, 30 | scattered=s,several=v,solitary=y 31 | 22. habitat: grasses=g,leaves=l,meadows=m,paths=p, 32 | urban=u,waste=w,woods=d 33 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/agaricus-lepiota.names: -------------------------------------------------------------------------------- 1 | 1. Title: Mushroom Database 2 | 3 | 2. Sources: 4 | (a) Mushroom records drawn from The Audubon Society Field Guide to North 5 | American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred 6 | A. Knopf 7 | (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu) 8 | (c) Date: 27 April 1987 9 | 10 | 3. Past Usage: 11 | 1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational 12 | Adjustment (Technical Report 87-19). Doctoral disseration, Department 13 | of Information and Computer Science, University of California, Irvine. 14 | --- STAGGER: asymptoted to 95% classification accuracy after reviewing 15 | 1000 instances. 16 | 2. Iba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity 17 | and Coverage in Incremental Concept Learning. In Proceedings of 18 | the 5th International Conference on Machine Learning, 73-79. 19 | Ann Arbor, Michigan: Morgan Kaufmann. 20 | -- approximately the same results with their HILLARY algorithm 21 | 3. In the following references a set of rules (given below) were 22 | learned for this data set which may serve as a point of 23 | comparison for other researchers. 24 | 25 | Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules 26 | from training data using backpropagation networks, in: Proc. of the 27 | The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, 28 | available on-line at: http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/ 29 | 30 | Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of 31 | crisp logical rules using constrained backpropagation networks - 32 | comparison of two new approaches, in: Proc. of the European Symposium 33 | on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997, 34 | pp. xx-xx 35 | 36 | Wlodzislaw Duch, Department of Computer Methods, Nicholas Copernicus 37 | University, 87-100 Torun, Grudziadzka 5, Poland 38 | e-mail: duch@phys.uni.torun.pl 39 | WWW http://www.phys.uni.torun.pl/kmk/ 40 | 41 | Date: Mon, 17 Feb 1997 13:47:40 +0100 42 | From: Wlodzislaw Duch 43 | Organization: Dept. of Computer Methods, UMK 44 | 45 | I have attached a file containing logical rules for mushrooms. 46 | It should be helpful for other people since only in the last year I 47 | have seen about 10 papers analyzing this dataset and obtaining quite 48 | complex rules. We will try to contribute other results later. 49 | 50 | With best regards, Wlodek Duch 51 | ________________________________________________________________ 52 | 53 | Logical rules for the mushroom data sets. 54 | 55 | Logical rules given below seem to be the simplest possible for the 56 | mushroom dataset and therefore should be treated as benchmark results. 57 | 58 | Disjunctive rules for poisonous mushrooms, from most general 59 | to most specific: 60 | 61 | P_1) odor=NOT(almond.OR.anise.OR.none) 62 | 120 poisonous cases missed, 98.52% accuracy 63 | 64 | P_2) spore-print-color=green 65 | 48 cases missed, 99.41% accuracy 66 | 67 | P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND. 68 | (stalk-color-above-ring=NOT.brown) 69 | 8 cases missed, 99.90% accuracy 70 | 71 | P_4) habitat=leaves.AND.cap-color=white 72 | 100% accuracy 73 | 74 | Rule P_4) may also be 75 | 76 | P_4') population=clustered.AND.cap_color=white 77 | 78 | These rule involve 6 attributes (out of 22). Rules for edible 79 | mushrooms are obtained as negation of the rules given above, for 80 | example the rule: 81 | 82 | odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green 83 | 84 | gives 48 errors, or 99.41% accuracy on the whole dataset. 85 | 86 | Several slightly more complex variations on these rules exist, 87 | involving other attributes, such as gill_size, gill_spacing, 88 | stalk_surface_above_ring, but the rules given above are the simplest 89 | we have found. 90 | 91 | 92 | 4. Relevant Information: 93 | This data set includes descriptions of hypothetical samples 94 | corresponding to 23 species of gilled mushrooms in the Agaricus and 95 | Lepiota Family (pp. 500-525). Each species is identified as 96 | definitely edible, definitely poisonous, or of unknown edibility and 97 | not recommended. This latter class was combined with the poisonous 98 | one. The Guide clearly states that there is no simple rule for 99 | determining the edibility of a mushroom; no rule like ``leaflets 100 | three, let it be'' for Poisonous Oak and Ivy. 101 | 102 | 5. Number of Instances: 8124 103 | 104 | 6. Number of Attributes: 22 (all nominally valued) 105 | 106 | 7. Attribute Information: (classes: edible=e, poisonous=p) 107 | 1. cap-shape: bell=b,conical=c,convex=x,flat=f, 108 | knobbed=k,sunken=s 109 | 2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s 110 | 3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, 111 | pink=p,purple=u,red=e,white=w,yellow=y 112 | 4. bruises?: bruises=t,no=f 113 | 5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, 114 | musty=m,none=n,pungent=p,spicy=s 115 | 6. gill-attachment: attached=a,descending=d,free=f,notched=n 116 | 7. gill-spacing: close=c,crowded=w,distant=d 117 | 8. gill-size: broad=b,narrow=n 118 | 9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, 119 | green=r,orange=o,pink=p,purple=u,red=e, 120 | white=w,yellow=y 121 | 10. stalk-shape: enlarging=e,tapering=t 122 | 11. stalk-root: bulbous=b,club=c,cup=u,equal=e, 123 | rhizomorphs=z,rooted=r,missing=? 124 | 12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s 125 | 13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s 126 | 14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, 127 | pink=p,red=e,white=w,yellow=y 128 | 15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, 129 | pink=p,red=e,white=w,yellow=y 130 | 16. veil-type: partial=p,universal=u 131 | 17. veil-color: brown=n,orange=o,white=w,yellow=y 132 | 18. ring-number: none=n,one=o,two=t 133 | 19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, 134 | none=n,pendant=p,sheathing=s,zone=z 135 | 20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, 136 | orange=o,purple=u,white=w,yellow=y 137 | 21. population: abundant=a,clustered=c,numerous=n, 138 | scattered=s,several=v,solitary=y 139 | 22. habitat: grasses=g,leaves=l,meadows=m,paths=p, 140 | urban=u,waste=w,woods=d 141 | 142 | 8. Missing Attribute Values: 2480 of them (denoted by "?"), all for 143 | attribute #11. 144 | 145 | 9. Class Distribution: 146 | -- edible: 4208 (51.8%) 147 | -- poisonous: 3916 (48.2%) 148 | -- total: 8124 instances 149 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/mapfeat.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | def loadfmap( fname ): 4 | fmap = {} 5 | nmap = {} 6 | 7 | for l in open( fname ): 8 | arr = l.split() 9 | if arr[0].find('.') != -1: 10 | idx = int( arr[0].strip('.') ) 11 | assert idx not in fmap 12 | fmap[ idx ] = {} 13 | ftype = arr[1].strip(':') 14 | content = arr[2] 15 | else: 16 | content = arr[0] 17 | for it in content.split(','): 18 | if it.strip() == '': 19 | continue 20 | k , v = it.split('=') 21 | fmap[ idx ][ v ] = len(nmap) + 1 22 | nmap[ len(nmap) ] = ftype+'='+k 23 | return fmap, nmap 24 | 25 | def write_nmap( fo, nmap ): 26 | for i in range( len(nmap) ): 27 | fo.write('%d\t%s\ti\n' % (i, nmap[i]) ) 28 | 29 | # start here 30 | fmap, nmap = loadfmap( 'agaricus-lepiota.fmap' ) 31 | fo = open( 'featmap.txt', 'w' ) 32 | write_nmap( fo, nmap ) 33 | fo.close() 34 | 35 | fo = open( 'agaricus.txt', 'w' ) 36 | for l in open( 'agaricus-lepiota.data' ): 37 | arr = l.split(',') 38 | if arr[0] == 'p': 39 | fo.write('1') 40 | else: 41 | assert arr[0] == 'e' 42 | fo.write('0') 43 | for i in range( 1,len(arr) ): 44 | fo.write( ' %d:1' % fmap[i][arr[i].strip()] ) 45 | fo.write('\n') 46 | 47 | fo.close() 48 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/mknfold.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import sys 3 | import random 4 | 5 | if len(sys.argv) < 2: 6 | print ('Usage: [nfold = 5]') 7 | exit(0) 8 | 9 | random.seed( 10 ) 10 | 11 | k = int( sys.argv[2] ) 12 | if len(sys.argv) > 3: 13 | nfold = int( sys.argv[3] ) 14 | else: 15 | nfold = 5 16 | 17 | fi = open( sys.argv[1], 'r' ) 18 | ftr = open( sys.argv[1]+'.train', 'w' ) 19 | fte = open( sys.argv[1]+'.test', 'w' ) 20 | for l in fi: 21 | if random.randint( 1 , nfold ) == k: 22 | fte.write( l ) 23 | else: 24 | ftr.write( l ) 25 | 26 | fi.close() 27 | ftr.close() 28 | fte.close() 29 | 30 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/mushroom.conf: -------------------------------------------------------------------------------- 1 | # General Parameters, see comment for each definition 2 | # choose the booster, can be gbtree or gblinear 3 | booster = gbtree 4 | # choose logistic regression loss function for binary classification 5 | objective = binary:logistic 6 | 7 | # Tree Booster Parameters 8 | # step size shrinkage 9 | eta = 1.0 10 | # minimum loss reduction required to make a further partition 11 | gamma = 1.0 12 | # minimum sum of instance weight(hessian) needed in a child 13 | min_child_weight = 1 14 | # maximum depth of a tree 15 | max_depth = 3 16 | 17 | # Task Parameters 18 | # the number of round to do boosting 19 | num_round = 2 20 | # 0 means do not save any model except the final round model 21 | save_period = 0 22 | # The path of training data 23 | data = "agaricus.txt.train" 24 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set 25 | eval[test] = "agaricus.txt.test" 26 | # evaluate on training data as well each round 27 | eval_train = 1 28 | # The path of test data 29 | test:data = "agaricus.txt.test" 30 | -------------------------------------------------------------------------------- /XGBoost_demo/binary_classification/runexp.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # map feature using indicator encoding, also produce featmap.txt 3 | python mapfeat.py 4 | # split train and test 5 | python mknfold.py agaricus.txt 1 6 | # training and output the models 7 | ../../xgboost mushroom.conf 8 | # output prediction task=pred 9 | ../../xgboost mushroom.conf task=pred model_in=0002.model 10 | # print the boosters of 00002.model in dump.raw.txt 11 | ../../xgboost mushroom.conf task=dump model_in=0002.model name_dump=dump.raw.txt 12 | # use the feature map in printing for better visualization 13 | ../../xgboost mushroom.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt 14 | cat dump.nice.txt 15 | 16 | -------------------------------------------------------------------------------- /XGBoost_demo/data/README.md: -------------------------------------------------------------------------------- 1 | This folder contains processed example dataset used by the demos. 2 | Copyright of the dataset belongs to the original copyright holder 3 | -------------------------------------------------------------------------------- /XGBoost_demo/data/featmap.txt: -------------------------------------------------------------------------------- 1 | 0 cap-shape=bell i 2 | 1 cap-shape=conical i 3 | 2 cap-shape=convex i 4 | 3 cap-shape=flat i 5 | 4 cap-shape=knobbed i 6 | 5 cap-shape=sunken i 7 | 6 cap-surface=fibrous i 8 | 7 cap-surface=grooves i 9 | 8 cap-surface=scaly i 10 | 9 cap-surface=smooth i 11 | 10 cap-color=brown i 12 | 11 cap-color=buff i 13 | 12 cap-color=cinnamon i 14 | 13 cap-color=gray i 15 | 14 cap-color=green i 16 | 15 cap-color=pink i 17 | 16 cap-color=purple i 18 | 17 cap-color=red i 19 | 18 cap-color=white i 20 | 19 cap-color=yellow i 21 | 20 bruises?=bruises i 22 | 21 bruises?=no i 23 | 22 odor=almond i 24 | 23 odor=anise i 25 | 24 odor=creosote i 26 | 25 odor=fishy i 27 | 26 odor=foul i 28 | 27 odor=musty i 29 | 28 odor=none i 30 | 29 odor=pungent i 31 | 30 odor=spicy i 32 | 31 gill-attachment=attached i 33 | 32 gill-attachment=descending i 34 | 33 gill-attachment=free i 35 | 34 gill-attachment=notched i 36 | 35 gill-spacing=close i 37 | 36 gill-spacing=crowded i 38 | 37 gill-spacing=distant i 39 | 38 gill-size=broad i 40 | 39 gill-size=narrow i 41 | 40 gill-color=black i 42 | 41 gill-color=brown i 43 | 42 gill-color=buff i 44 | 43 gill-color=chocolate i 45 | 44 gill-color=gray i 46 | 45 gill-color=green i 47 | 46 gill-color=orange i 48 | 47 gill-color=pink i 49 | 48 gill-color=purple i 50 | 49 gill-color=red i 51 | 50 gill-color=white i 52 | 51 gill-color=yellow i 53 | 52 stalk-shape=enlarging i 54 | 53 stalk-shape=tapering i 55 | 54 stalk-root=bulbous i 56 | 55 stalk-root=club i 57 | 56 stalk-root=cup i 58 | 57 stalk-root=equal i 59 | 58 stalk-root=rhizomorphs i 60 | 59 stalk-root=rooted i 61 | 60 stalk-root=missing i 62 | 61 stalk-surface-above-ring=fibrous i 63 | 62 stalk-surface-above-ring=scaly i 64 | 63 stalk-surface-above-ring=silky i 65 | 64 stalk-surface-above-ring=smooth i 66 | 65 stalk-surface-below-ring=fibrous i 67 | 66 stalk-surface-below-ring=scaly i 68 | 67 stalk-surface-below-ring=silky i 69 | 68 stalk-surface-below-ring=smooth i 70 | 69 stalk-color-above-ring=brown i 71 | 70 stalk-color-above-ring=buff i 72 | 71 stalk-color-above-ring=cinnamon i 73 | 72 stalk-color-above-ring=gray i 74 | 73 stalk-color-above-ring=orange i 75 | 74 stalk-color-above-ring=pink i 76 | 75 stalk-color-above-ring=red i 77 | 76 stalk-color-above-ring=white i 78 | 77 stalk-color-above-ring=yellow i 79 | 78 stalk-color-below-ring=brown i 80 | 79 stalk-color-below-ring=buff i 81 | 80 stalk-color-below-ring=cinnamon i 82 | 81 stalk-color-below-ring=gray i 83 | 82 stalk-color-below-ring=orange i 84 | 83 stalk-color-below-ring=pink i 85 | 84 stalk-color-below-ring=red i 86 | 85 stalk-color-below-ring=white i 87 | 86 stalk-color-below-ring=yellow i 88 | 87 veil-type=partial i 89 | 88 veil-type=universal i 90 | 89 veil-color=brown i 91 | 90 veil-color=orange i 92 | 91 veil-color=white i 93 | 92 veil-color=yellow i 94 | 93 ring-number=none i 95 | 94 ring-number=one i 96 | 95 ring-number=two i 97 | 96 ring-type=cobwebby i 98 | 97 ring-type=evanescent i 99 | 98 ring-type=flaring i 100 | 99 ring-type=large i 101 | 100 ring-type=none i 102 | 101 ring-type=pendant i 103 | 102 ring-type=sheathing i 104 | 103 ring-type=zone i 105 | 104 spore-print-color=black i 106 | 105 spore-print-color=brown i 107 | 106 spore-print-color=buff i 108 | 107 spore-print-color=chocolate i 109 | 108 spore-print-color=green i 110 | 109 spore-print-color=orange i 111 | 110 spore-print-color=purple i 112 | 111 spore-print-color=white i 113 | 112 spore-print-color=yellow i 114 | 113 population=abundant i 115 | 114 population=clustered i 116 | 115 population=numerous i 117 | 116 population=scattered i 118 | 117 population=several i 119 | 118 population=solitary i 120 | 119 habitat=grasses i 121 | 120 habitat=leaves i 122 | 121 habitat=meadows i 123 | 122 habitat=paths i 124 | 123 habitat=urban i 125 | 124 habitat=waste i 126 | 125 habitat=woods i 127 | -------------------------------------------------------------------------------- /XGBoost_demo/data/gen_autoclaims.R: -------------------------------------------------------------------------------- 1 | site <- 'http://cran.r-project.org' 2 | if (!require('dummies')) 3 | install.packages('dummies', repos=site) 4 | if (!require('insuranceData')) 5 | install.packages('insuranceData', repos=site) 6 | 7 | library(dummies) 8 | library(insuranceData) 9 | 10 | data(AutoClaims) 11 | data = AutoClaims 12 | 13 | data$STATE = as.factor(data$STATE) 14 | data$CLASS = as.factor(data$CLASS) 15 | data$GENDER = as.factor(data$GENDER) 16 | 17 | data.dummy <- dummy.data.frame(data, dummy.class='factor', omit.constants=T); 18 | write.table(data.dummy, 'autoclaims.csv', sep=',', row.names=F, col.names=F, quote=F) 19 | -------------------------------------------------------------------------------- /XGBoost_demo/distributed-training/README.md: -------------------------------------------------------------------------------- 1 | Distributed XGBoost Training 2 | ============================ 3 | This is an tutorial of Distributed XGBoost Training. 4 | Currently xgboost supports distributed training via CLI program with the configuration file. 5 | There is also plan push distributed python and other language bindings, please open an issue 6 | if you are interested in contributing. 7 | 8 | Build XGBoost with Distributed Filesystem Support 9 | ------------------------------------------------- 10 | To use distributed xgboost, you only need to turn the options on to build 11 | with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```. 12 | 13 | 14 | Step by Step Tutorial on AWS 15 | ---------------------------- 16 | Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorials/aws_yarn.html) for running distributed xgboost. 17 | 18 | 19 | Model Analysis 20 | -------------- 21 | XGBoost is exchangeable across all bindings and platforms. 22 | This means you can use python or R to analyze the learnt model and do prediction. 23 | For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model. 24 | -------------------------------------------------------------------------------- /XGBoost_demo/distributed-training/mushroom.aws.conf: -------------------------------------------------------------------------------- 1 | # General Parameters, see comment for each definition 2 | # choose the booster, can be gbtree or gblinear 3 | booster = gbtree 4 | # choose logistic regression loss function for binary classification 5 | objective = binary:logistic 6 | 7 | # Tree Booster Parameters 8 | # step size shrinkage 9 | eta = 1.0 10 | # minimum loss reduction required to make a further partition 11 | gamma = 1.0 12 | # minimum sum of instance weight(hessian) needed in a child 13 | min_child_weight = 1 14 | # maximum depth of a tree 15 | max_depth = 3 16 | 17 | # Task Parameters 18 | # the number of round to do boosting 19 | num_round = 2 20 | # 0 means do not save any model except the final round model 21 | save_period = 0 22 | # The path of training data 23 | data = "s3://mybucket/xgb-demo/train" 24 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set 25 | # evaluate on training data as well each round 26 | eval_train = 1 27 | 28 | -------------------------------------------------------------------------------- /XGBoost_demo/distributed-training/plot_model.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# XGBoost Model Analysis\n", 8 | "\n", 9 | "This notebook can be used to load and analysis model learnt from all xgboost bindings, including distributed training. " 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [], 19 | "source": [ 20 | "import sys\n", 21 | "import os\n", 22 | "%matplotlib inline " 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Please change the ```pkg_path``` and ```model_file``` to be correct path" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "pkg_path = '../../python-package/'\n", 41 | "model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n", 42 | "sys.path.insert(0, pkg_path)\n", 43 | "import xgboost as xgb" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "# Plot the Feature Importance" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": { 57 | "collapsed": false 58 | }, 59 | "outputs": [], 60 | "source": [ 61 | "# plot the first two trees.\n", 62 | "bst = xgb.Booster(model_file=model_file)\n", 63 | "xgb.plot_importance(bst)" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "# Plot the First Tree" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": null, 76 | "metadata": { 77 | "collapsed": false 78 | }, 79 | "outputs": [], 80 | "source": [ 81 | "tree_id = 0\n", 82 | "xgb.to_graphviz(bst, tree_id)" 83 | ] 84 | } 85 | ], 86 | "metadata": { 87 | "kernelspec": { 88 | "display_name": "Python 2", 89 | "language": "python", 90 | "name": "python2" 91 | }, 92 | "language_info": { 93 | "codemirror_mode": { 94 | "name": "ipython", 95 | "version": 2 96 | }, 97 | "file_extension": ".py", 98 | "mimetype": "text/x-python", 99 | "name": "python", 100 | "nbconvert_exporter": "python", 101 | "pygments_lexer": "ipython2", 102 | "version": "2.7.3" 103 | } 104 | }, 105 | "nbformat": 4, 106 | "nbformat_minor": 0 107 | } 108 | -------------------------------------------------------------------------------- /XGBoost_demo/distributed-training/run_aws.sh: -------------------------------------------------------------------------------- 1 | # This is the example script to run distributed xgboost on AWS. 2 | # Change the following two lines for configuration 3 | 4 | export BUCKET=mybucket 5 | 6 | # submit the job to YARN 7 | ../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\ 8 | ../../xgboost mushroom.aws.conf nthread=2\ 9 | data=s3://${BUCKET}/xgb-demo/train\ 10 | eval[test]=s3://${BUCKET}/xgb-demo/test\ 11 | model_dir=s3://${BUCKET}/xgb-demo/model 12 | -------------------------------------------------------------------------------- /XGBoost_demo/gpu_acceleration/README.md: -------------------------------------------------------------------------------- 1 | # GPU Acceleration Demo 2 | 3 | This demo shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it time consuming to process. We compare the run-time and accuracy of the GPU and CPU histogram algorithms. 4 | 5 | This demo requires the [GPU plug-in](https://github.com/dmlc/xgboost/tree/master/plugin/updater_gpu) to be built and installed. 6 | 7 | The dataset is automatically loaded via the sklearn script. 8 | 9 | 10 | -------------------------------------------------------------------------------- /XGBoost_demo/gpu_acceleration/cover_type.py: -------------------------------------------------------------------------------- 1 | import xgboost as xgb 2 | import numpy as np 3 | from sklearn.datasets import fetch_covtype 4 | from sklearn.model_selection import train_test_split 5 | import time 6 | 7 | # Fetch dataset using sklearn 8 | cov = fetch_covtype() 9 | X = cov.data 10 | y = cov.target 11 | 12 | # Create 0.75/0.25 train/test split 13 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, train_size=0.75, 14 | random_state=42) 15 | 16 | # Specify sufficient boosting iterations to reach a minimum 17 | num_round = 3000 18 | 19 | # Leave most parameters as default 20 | param = {'objective': 'multi:softmax', # Specify multiclass classification 21 | 'num_class': 8, # Number of possible output classes 22 | 'tree_method': 'gpu_hist' # Use GPU accelerated algorithm 23 | } 24 | 25 | # Convert input data from numpy to XGBoost format 26 | dtrain = xgb.DMatrix(X_train, label=y_train) 27 | dtest = xgb.DMatrix(X_test, label=y_test) 28 | 29 | gpu_res = {} # Store accuracy result 30 | tmp = time.time() 31 | # Train model 32 | xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=gpu_res) 33 | print("GPU Training Time: %s seconds" % (str(time.time() - tmp))) 34 | 35 | # Repeat for CPU algorithm 36 | tmp = time.time() 37 | param['tree_method'] = 'hist' 38 | cpu_res = {} 39 | xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=cpu_res) 40 | print("CPU Training Time: %s seconds" % (str(time.time() - tmp))) 41 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/README.md: -------------------------------------------------------------------------------- 1 | XGBoost Python Feature Walkthrough 2 | ================================== 3 | * [Basic walkthrough of wrappers](basic_walkthrough.py) 4 | * [Cutomize loss function, and evaluation metric](custom_objective.py) 5 | * [Boosting from existing prediction](boost_from_prediction.py) 6 | * [Predicting using first n trees](predict_first_ntree.py) 7 | * [Generalized Linear Model](generalized_linear_model.py) 8 | * [Cross validation](cross_validation.py) 9 | * [Predicting leaf indices](predict_leaf_indices.py) 10 | * [Sklearn Wrapper](sklearn_examples.py) 11 | * [Sklearn Parallel](sklearn_parallel.py) 12 | * [Sklearn access evals result](sklearn_evals_result.py) 13 | * [Access evals result](evals_result.py) 14 | * [External Memory](external_memory.py) 15 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/basic_walkthrough.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import scipy.sparse 4 | import pickle 5 | import xgboost as xgb 6 | 7 | ### simple example 8 | # load file from text file, also binary buffer generated by xgboost 9 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 10 | dtest = xgb.DMatrix('../data/agaricus.txt.test') 11 | 12 | # specify parameters via map, definition are same as c++ version 13 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'} 14 | 15 | # specify validations set to watch performance 16 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 17 | num_round = 2 18 | bst = xgb.train(param, dtrain, num_round, watchlist) 19 | 20 | # this is prediction 21 | preds = bst.predict(dtest) 22 | labels = dtest.get_label() 23 | print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds)))) 24 | bst.save_model('0001.model') 25 | # dump model 26 | bst.dump_model('dump.raw.txt') 27 | # dump model with feature map 28 | bst.dump_model('dump.nice.txt', '../data/featmap.txt') 29 | 30 | # save dmatrix into binary buffer 31 | dtest.save_binary('dtest.buffer') 32 | # save model 33 | bst.save_model('xgb.model') 34 | # load model and data in 35 | bst2 = xgb.Booster(model_file='xgb.model') 36 | dtest2 = xgb.DMatrix('dtest.buffer') 37 | preds2 = bst2.predict(dtest2) 38 | # assert they are the same 39 | assert np.sum(np.abs(preds2 - preds)) == 0 40 | 41 | # alternatively, you can pickle the booster 42 | pks = pickle.dumps(bst2) 43 | # load model and data in 44 | bst3 = pickle.loads(pks) 45 | preds3 = bst3.predict(dtest2) 46 | # assert they are the same 47 | assert np.sum(np.abs(preds3 - preds)) == 0 48 | 49 | ### 50 | # build dmatrix from scipy.sparse 51 | print('start running example of build DMatrix from scipy.sparse CSR Matrix') 52 | labels = [] 53 | row = []; col = []; dat = [] 54 | i = 0 55 | for l in open('../data/agaricus.txt.train'): 56 | arr = l.split() 57 | labels.append(int(arr[0])) 58 | for it in arr[1:]: 59 | k,v = it.split(':') 60 | row.append(i); col.append(int(k)); dat.append(float(v)) 61 | i += 1 62 | csr = scipy.sparse.csr_matrix((dat, (row, col))) 63 | dtrain = xgb.DMatrix(csr, label=labels) 64 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 65 | bst = xgb.train(param, dtrain, num_round, watchlist) 66 | 67 | print('start running example of build DMatrix from scipy.sparse CSC Matrix') 68 | # we can also construct from csc matrix 69 | csc = scipy.sparse.csc_matrix((dat, (row, col))) 70 | dtrain = xgb.DMatrix(csc, label=labels) 71 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 72 | bst = xgb.train(param, dtrain, num_round, watchlist) 73 | 74 | print('start running example of build DMatrix from numpy array') 75 | # NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation 76 | # then convert to DMatrix 77 | npymat = csr.todense() 78 | dtrain = xgb.DMatrix(npymat, label=labels) 79 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 80 | bst = xgb.train(param, dtrain, num_round, watchlist) 81 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/boost_from_prediction.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import xgboost as xgb 4 | 5 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 6 | dtest = xgb.DMatrix('../data/agaricus.txt.test') 7 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 8 | ### 9 | # advanced: start from a initial base prediction 10 | # 11 | print ('start running example to start from a initial prediction') 12 | # specify parameters via map, definition are same as c++ version 13 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'} 14 | # train xgboost for 1 round 15 | bst = xgb.train(param, dtrain, 1, watchlist) 16 | # Note: we need the margin value instead of transformed prediction in set_base_margin 17 | # do predict with output_margin=True, will always give you margin values before logistic transformation 18 | ptrain = bst.predict(dtrain, output_margin=True) 19 | ptest = bst.predict(dtest, output_margin=True) 20 | dtrain.set_base_margin(ptrain) 21 | dtest.set_base_margin(ptest) 22 | 23 | 24 | print('this is result of running from initial prediction') 25 | bst = xgb.train(param, dtrain, 1, watchlist) -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/cross_validation.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import xgboost as xgb 4 | 5 | ### load data in do training 6 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 7 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'} 8 | num_round = 2 9 | 10 | print('running cross validation') 11 | # do cross validation, this will print result out as 12 | # [iteration] metric_name:mean_value+std_value 13 | # std_value is standard deviation of the metric 14 | xgb.cv(param, dtrain, num_round, nfold=5, 15 | metrics={'error'}, seed=0, 16 | callbacks=[xgb.callback.print_evaluation(show_stdv=True)]) 17 | 18 | print('running cross validation, disable standard deviation display') 19 | # do cross validation, this will print result out as 20 | # [iteration] metric_name:mean_value 21 | res = xgb.cv(param, dtrain, num_boost_round=10, nfold=5, 22 | metrics={'error'}, seed=0, 23 | callbacks=[xgb.callback.print_evaluation(show_stdv=False), 24 | xgb.callback.early_stop(3)]) 25 | print(res) 26 | print('running cross validation, with preprocessing function') 27 | # define the preprocessing function 28 | # used to return the preprocessed training, test data, and parameter 29 | # we can use this to do weight rescale, etc. 30 | # as a example, we try to set scale_pos_weight 31 | def fpreproc(dtrain, dtest, param): 32 | label = dtrain.get_label() 33 | ratio = float(np.sum(label == 0)) / np.sum(label == 1) 34 | param['scale_pos_weight'] = ratio 35 | return (dtrain, dtest, param) 36 | 37 | # do cross validation, for each fold 38 | # the dtrain, dtest, param will be passed into fpreproc 39 | # then the return value of fpreproc will be used to generate 40 | # results of that fold 41 | xgb.cv(param, dtrain, num_round, nfold=5, 42 | metrics={'auc'}, seed=0, fpreproc=fpreproc) 43 | 44 | ### 45 | # you can also do cross validation with cutomized loss function 46 | # See custom_objective.py 47 | ## 48 | print('running cross validation, with cutomsized loss function') 49 | def logregobj(preds, dtrain): 50 | labels = dtrain.get_label() 51 | preds = 1.0 / (1.0 + np.exp(-preds)) 52 | grad = preds - labels 53 | hess = preds * (1.0 - preds) 54 | return grad, hess 55 | def evalerror(preds, dtrain): 56 | labels = dtrain.get_label() 57 | return 'error', float(sum(labels != (preds > 0.0))) / len(labels) 58 | 59 | param = {'max_depth':2, 'eta':1, 'silent':1} 60 | # train with customized objective 61 | xgb.cv(param, dtrain, num_round, nfold=5, seed=0, 62 | obj=logregobj, feval=evalerror) 63 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/custom_objective.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import xgboost as xgb 4 | ### 5 | # advanced: customized loss function 6 | # 7 | print('start running example to used customized objective function') 8 | 9 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 10 | dtest = xgb.DMatrix('../data/agaricus.txt.test') 11 | 12 | # note: for customized objective function, we leave objective as default 13 | # note: what we are getting is margin value in prediction 14 | # you must know what you are doing 15 | param = {'max_depth': 2, 'eta': 1, 'silent': 1} 16 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 17 | num_round = 2 18 | 19 | # user define objective function, given prediction, return gradient and second order gradient 20 | # this is log likelihood loss 21 | def logregobj(preds, dtrain): 22 | labels = dtrain.get_label() 23 | preds = 1.0 / (1.0 + np.exp(-preds)) 24 | grad = preds - labels 25 | hess = preds * (1.0 - preds) 26 | return grad, hess 27 | 28 | # user defined evaluation function, return a pair metric_name, result 29 | # NOTE: when you do customized loss function, the default prediction value is margin 30 | # this may make builtin evaluation metric not function properly 31 | # for example, we are doing logistic loss, the prediction is score before logistic transformation 32 | # the builtin evaluation error assumes input is after logistic transformation 33 | # Take this in mind when you use the customization, and maybe you need write customized evaluation function 34 | def evalerror(preds, dtrain): 35 | labels = dtrain.get_label() 36 | # return a pair metric_name, result 37 | # since preds are margin(before logistic transformation, cutoff at 0) 38 | return 'error', float(sum(labels != (preds > 0.0))) / len(labels) 39 | 40 | # training with customized objective, we can also do step by step training 41 | # simply look at xgboost.py's implementation of train 42 | bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror) 43 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/evals_result.py: -------------------------------------------------------------------------------- 1 | ## 2 | # This script demonstrate how to access the eval metrics in xgboost 3 | ## 4 | 5 | import xgboost as xgb 6 | dtrain = xgb.DMatrix('../data/agaricus.txt.train', silent=True) 7 | dtest = xgb.DMatrix('../data/agaricus.txt.test', silent=True) 8 | 9 | param = [('max_depth', 2), ('objective', 'binary:logistic'), ('eval_metric', 'logloss'), ('eval_metric', 'error')] 10 | 11 | num_round = 2 12 | watchlist = [(dtest,'eval'), (dtrain,'train')] 13 | 14 | evals_result = {} 15 | bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=evals_result) 16 | 17 | print('Access logloss metric directly from evals_result:') 18 | print(evals_result['eval']['logloss']) 19 | 20 | print('') 21 | print('Access metrics through a loop:') 22 | for e_name, e_mtrs in evals_result.items(): 23 | print('- {}'.format(e_name)) 24 | for e_mtr_name, e_mtr_vals in e_mtrs.items(): 25 | print(' - {}'.format(e_mtr_name)) 26 | print(' - {}'.format(e_mtr_vals)) 27 | 28 | print('') 29 | print('Access complete dictionary:') 30 | print(evals_result) 31 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/external_memory.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import scipy.sparse 4 | import xgboost as xgb 5 | 6 | ### simple example for using external memory version 7 | 8 | # this is the only difference, add a # followed by a cache prefix name 9 | # several cache file with the prefix will be generated 10 | # currently only support convert from libsvm file 11 | dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache') 12 | dtest = xgb.DMatrix('../data/agaricus.txt.test#dtest.cache') 13 | 14 | # specify validations set to watch performance 15 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'} 16 | 17 | # performance notice: set nthread to be the number of your real cpu 18 | # some cpu offer two threads per core, for example, a 4 core cpu with 8 threads, in such case set nthread=4 19 | #param['nthread']=num_real_cpu 20 | 21 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 22 | num_round = 2 23 | bst = xgb.train(param, dtrain, num_round, watchlist) 24 | 25 | 26 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/gamma_regression.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import xgboost as xgb 3 | import numpy as np 4 | 5 | # this script demonstrates how to fit gamma regression model (with log link function) 6 | # in xgboost, before running the demo you need to generate the autoclaims dataset 7 | # by running gen_autoclaims.R located in xgboost/demo/data. 8 | 9 | data = np.genfromtxt('../data/autoclaims.csv', delimiter=',') 10 | dtrain = xgb.DMatrix(data[0:4741, 0:34], data[0:4741, 34]) 11 | dtest = xgb.DMatrix(data[4741:6773, 0:34], data[4741:6773, 34]) 12 | 13 | # for gamma regression, we need to set the objective to 'reg:gamma', it also suggests 14 | # to set the base_score to a value between 1 to 5 if the number of iteration is small 15 | param = {'silent':1, 'objective':'reg:gamma', 'booster':'gbtree', 'base_score':3} 16 | 17 | # the rest of settings are the same 18 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 19 | num_round = 30 20 | 21 | # training and evaluation 22 | bst = xgb.train(param, dtrain, num_round, watchlist) 23 | preds = bst.predict(dtest) 24 | labels = dtest.get_label() 25 | print('test deviance=%f' % (2 * np.sum((labels - preds) / preds - np.log(labels) + np.log(preds)))) 26 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/generalized_linear_model.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import xgboost as xgb 3 | ## 4 | # this script demonstrate how to fit generalized linear model in xgboost 5 | # basically, we are using linear model, instead of tree for our boosters 6 | ## 7 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 8 | dtest = xgb.DMatrix('../data/agaricus.txt.test') 9 | # change booster to gblinear, so that we are fitting a linear model 10 | # alpha is the L1 regularizer 11 | # lambda is the L2 regularizer 12 | # you can also set lambda_bias which is L2 regularizer on the bias term 13 | param = {'silent':1, 'objective':'binary:logistic', 'booster':'gblinear', 14 | 'alpha': 0.0001, 'lambda': 1} 15 | 16 | # normally, you do not need to set eta (step_size) 17 | # XGBoost uses a parallel coordinate descent algorithm (shotgun), 18 | # there could be affection on convergence with parallelization on certain cases 19 | # setting eta to be smaller value, e.g 0.5 can make the optimization more stable 20 | # param['eta'] = 1 21 | 22 | ## 23 | # the rest of settings are the same 24 | ## 25 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 26 | num_round = 4 27 | bst = xgb.train(param, dtrain, num_round, watchlist) 28 | preds = bst.predict(dtest) 29 | labels = dtest.get_label() 30 | print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds)))) 31 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/predict_first_ntree.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import xgboost as xgb 4 | 5 | ### load data in do training 6 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 7 | dtest = xgb.DMatrix('../data/agaricus.txt.test') 8 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'} 9 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 10 | num_round = 3 11 | bst = xgb.train(param, dtrain, num_round, watchlist) 12 | 13 | print('start testing prediction from first n trees') 14 | ### predict using first 1 tree 15 | label = dtest.get_label() 16 | ypred1 = bst.predict(dtest, ntree_limit=1) 17 | # by default, we predict using all the trees 18 | ypred2 = bst.predict(dtest) 19 | print('error of ypred1=%f' % (np.sum((ypred1 > 0.5) != label) / float(len(label)))) 20 | print('error of ypred2=%f' % (np.sum((ypred2 > 0.5) != label) / float(len(label)))) 21 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/predict_leaf_indices.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import xgboost as xgb 3 | 4 | ### load data in do training 5 | dtrain = xgb.DMatrix('../data/agaricus.txt.train') 6 | dtest = xgb.DMatrix('../data/agaricus.txt.test') 7 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'} 8 | watchlist = [(dtest, 'eval'), (dtrain, 'train')] 9 | num_round = 3 10 | bst = xgb.train(param, dtrain, num_round, watchlist) 11 | 12 | print ('start testing predict the leaf indices') 13 | ### predict using first 2 tree 14 | leafindex = bst.predict(dtest, ntree_limit=2, pred_leaf=True) 15 | print(leafindex.shape) 16 | print(leafindex) 17 | ### predict all trees 18 | leafindex = bst.predict(dtest, pred_leaf=True) 19 | print(leafindex.shape) 20 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/runall.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | export PYTHONPATH=PYTHONPATH:../../python-package 3 | python basic_walkthrough.py 4 | python custom_objective.py 5 | python boost_from_prediction.py 6 | python predict_first_ntree.py 7 | python generalized_linear_model.py 8 | python cross_validation.py 9 | python predict_leaf_indices.py 10 | python sklearn_examples.py 11 | python sklearn_parallel.py 12 | python external_memory.py 13 | rm -rf *~ *.model *.buffer 14 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/sklearn_evals_result.py: -------------------------------------------------------------------------------- 1 | ## 2 | # This script demonstrate how to access the xgboost eval metrics by using sklearn 3 | ## 4 | 5 | import xgboost as xgb 6 | import numpy as np 7 | from sklearn.datasets import make_hastie_10_2 8 | 9 | X, y = make_hastie_10_2(n_samples=2000, random_state=42) 10 | 11 | # Map labels from {-1, 1} to {0, 1} 12 | labels, y = np.unique(y, return_inverse=True) 13 | 14 | X_train, X_test = X[:1600], X[1600:] 15 | y_train, y_test = y[:1600], y[1600:] 16 | 17 | param_dist = {'objective':'binary:logistic', 'n_estimators':2} 18 | 19 | clf = xgb.XGBModel(**param_dist) 20 | # Or you can use: clf = xgb.XGBClassifier(**param_dist) 21 | 22 | clf.fit(X_train, y_train, 23 | eval_set=[(X_train, y_train), (X_test, y_test)], 24 | eval_metric='logloss', 25 | verbose=True) 26 | 27 | # Load evals result by calling the evals_result() function 28 | evals_result = clf.evals_result() 29 | 30 | print('Access logloss metric directly from validation_0:') 31 | print(evals_result['validation_0']['logloss']) 32 | 33 | print('') 34 | print('Access metrics through a loop:') 35 | for e_name, e_mtrs in evals_result.items(): 36 | print('- {}'.format(e_name)) 37 | for e_mtr_name, e_mtr_vals in e_mtrs.items(): 38 | print(' - {}'.format(e_mtr_name)) 39 | print(' - {}'.format(e_mtr_vals)) 40 | 41 | print('') 42 | print('Access complete dict:') 43 | print(evals_result) 44 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/sklearn_examples.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | ''' 3 | Created on 1 Apr 2015 4 | 5 | @author: Jamie Hall 6 | ''' 7 | import pickle 8 | import xgboost as xgb 9 | 10 | import numpy as np 11 | from sklearn.model_selection import KFold, train_test_split, GridSearchCV 12 | from sklearn.metrics import confusion_matrix, mean_squared_error 13 | from sklearn.datasets import load_iris, load_digits, load_boston 14 | 15 | rng = np.random.RandomState(31337) 16 | 17 | print("Zeros and Ones from the Digits dataset: binary classification") 18 | digits = load_digits(2) 19 | y = digits['target'] 20 | X = digits['data'] 21 | kf = KFold(n_splits=2, shuffle=True, random_state=rng) 22 | for train_index, test_index in kf.split(X): 23 | xgb_model = xgb.XGBClassifier().fit(X[train_index], y[train_index]) 24 | predictions = xgb_model.predict(X[test_index]) 25 | actuals = y[test_index] 26 | print(confusion_matrix(actuals, predictions)) 27 | 28 | print("Iris: multiclass classification") 29 | iris = load_iris() 30 | y = iris['target'] 31 | X = iris['data'] 32 | kf = KFold(n_splits=2, shuffle=True, random_state=rng) 33 | for train_index, test_index in kf.split(X): 34 | xgb_model = xgb.XGBClassifier().fit(X[train_index], y[train_index]) 35 | predictions = xgb_model.predict(X[test_index]) 36 | actuals = y[test_index] 37 | print(confusion_matrix(actuals, predictions)) 38 | 39 | print("Boston Housing: regression") 40 | boston = load_boston() 41 | y = boston['target'] 42 | X = boston['data'] 43 | kf = KFold(n_splits=2, shuffle=True, random_state=rng) 44 | for train_index, test_index in kf.split(X): 45 | xgb_model = xgb.XGBRegressor().fit(X[train_index], y[train_index]) 46 | predictions = xgb_model.predict(X[test_index]) 47 | actuals = y[test_index] 48 | print(mean_squared_error(actuals, predictions)) 49 | 50 | print("Parameter optimization") 51 | y = boston['target'] 52 | X = boston['data'] 53 | xgb_model = xgb.XGBRegressor() 54 | clf = GridSearchCV(xgb_model, 55 | {'max_depth': [2,4,6], 56 | 'n_estimators': [50,100,200]}, verbose=1) 57 | clf.fit(X,y) 58 | print(clf.best_score_) 59 | print(clf.best_params_) 60 | 61 | # The sklearn API models are picklable 62 | print("Pickling sklearn API models") 63 | # must open in binary format to pickle 64 | pickle.dump(clf, open("best_boston.pkl", "wb")) 65 | clf2 = pickle.load(open("best_boston.pkl", "rb")) 66 | print(np.allclose(clf.predict(X), clf2.predict(X))) 67 | 68 | # Early-stopping 69 | 70 | X = digits['data'] 71 | y = digits['target'] 72 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 73 | clf = xgb.XGBClassifier() 74 | clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="auc", 75 | eval_set=[(X_test, y_test)]) 76 | 77 | -------------------------------------------------------------------------------- /XGBoost_demo/guide-python/sklearn_parallel.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | if __name__ == "__main__": 4 | # NOTE: on posix systems, this *has* to be here and in the 5 | # `__name__ == "__main__"` clause to run XGBoost in parallel processes 6 | # using fork, if XGBoost was built with OpenMP support. Otherwise, if you 7 | # build XGBoost without OpenMP support, you can use fork, which is the 8 | # default backend for joblib, and omit this. 9 | try: 10 | from multiprocessing import set_start_method 11 | except ImportError: 12 | raise ImportError("Unable to import multiprocessing.set_start_method." 13 | " This example only runs on Python 3.4") 14 | set_start_method("forkserver") 15 | 16 | import numpy as np 17 | from sklearn.model_selection import GridSearchCV 18 | from sklearn.datasets import load_boston 19 | import xgboost as xgb 20 | 21 | rng = np.random.RandomState(31337) 22 | 23 | print("Parallel Parameter optimization") 24 | boston = load_boston() 25 | 26 | os.environ["OMP_NUM_THREADS"] = "2" # or to whatever you want 27 | y = boston['target'] 28 | X = boston['data'] 29 | xgb_model = xgb.XGBRegressor() 30 | clf = GridSearchCV(xgb_model, {'max_depth': [2, 4, 6], 31 | 'n_estimators': [50, 100, 200]}, verbose=1, 32 | n_jobs=2) 33 | clf.fit(X, y) 34 | print(clf.best_score_) 35 | print(clf.best_params_) 36 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/README.md: -------------------------------------------------------------------------------- 1 | Highlights 2 | ===== 3 | Higgs challenge ends recently, xgboost is being used by many users. This list highlights the xgboost solutions of players 4 | * Blogpost by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/) 5 | * The solution by Tianqi Chen and Tong He [Link](https://github.com/hetong007/higgsml) 6 | 7 | Guide for Kaggle Higgs Challenge 8 | ===== 9 | 10 | This is the folder giving example of how to use XGBoost Python Module to run Kaggle Higgs competition 11 | 12 | This script will achieve about 3.600 AMS score in public leaderboard. To get start, you need do following step: 13 | 14 | 1. Compile the XGBoost python lib 15 | ```bash 16 | cd ../.. 17 | make 18 | ``` 19 | 20 | 2. Put training.csv test.csv on folder './data' (you can create a symbolic link) 21 | 22 | 3. Run ./run.sh 23 | 24 | Speed 25 | ===== 26 | speedtest.py compares xgboost's speed on this dataset with sklearn.GBM 27 | 28 | 29 | Using R module 30 | ===== 31 | * Alternatively, you can run using R, higgs-train.R and higgs-pred.R. 32 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/higgs-cv.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import numpy as np 3 | import xgboost as xgb 4 | 5 | ### load data in do training 6 | train = np.loadtxt('./data/training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s'.encode('utf-8')) } ) 7 | label = train[:,32] 8 | data = train[:,1:31] 9 | weight = train[:,31] 10 | dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight ) 11 | param = {'max_depth':6, 'eta':0.1, 'silent':1, 'objective':'binary:logitraw', 'nthread':4} 12 | num_round = 120 13 | 14 | print ('running cross validation, with preprocessing function') 15 | # define the preprocessing function 16 | # used to return the preprocessed training, test data, and parameter 17 | # we can use this to do weight rescale, etc. 18 | # as a example, we try to set scale_pos_weight 19 | def fpreproc(dtrain, dtest, param): 20 | label = dtrain.get_label() 21 | ratio = float(np.sum(label == 0)) / np.sum(label==1) 22 | param['scale_pos_weight'] = ratio 23 | wtrain = dtrain.get_weight() 24 | wtest = dtest.get_weight() 25 | sum_weight = sum(wtrain) + sum(wtest) 26 | wtrain *= sum_weight / sum(wtrain) 27 | wtest *= sum_weight / sum(wtest) 28 | dtrain.set_weight(wtrain) 29 | dtest.set_weight(wtest) 30 | return (dtrain, dtest, param) 31 | 32 | # do cross validation, for each fold 33 | # the dtrain, dtest, param will be passed into fpreproc 34 | # then the return value of fpreproc will be used to generate 35 | # results of that fold 36 | xgb.cv(param, dtrain, num_round, nfold=5, 37 | metrics={'ams@0.15', 'auc'}, seed = 0, fpreproc = fpreproc) 38 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/higgs-numpy.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # this is the example script to use xgboost to train 3 | import numpy as np 4 | 5 | import xgboost as xgb 6 | 7 | test_size = 550000 8 | 9 | # path to where the data lies 10 | dpath = 'data' 11 | 12 | # load in training data, directly use numpy 13 | dtrain = np.loadtxt( dpath+'/training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s'.encode('utf-8')) } ) 14 | print ('finish loading from csv ') 15 | 16 | label = dtrain[:,32] 17 | data = dtrain[:,1:31] 18 | # rescale weight to make it same as test set 19 | weight = dtrain[:,31] * float(test_size) / len(label) 20 | 21 | sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0 ) 22 | sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0 ) 23 | 24 | # print weight statistics 25 | print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos )) 26 | 27 | # construct xgboost.DMatrix from numpy array, treat -999.0 as missing value 28 | xgmat = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight ) 29 | 30 | # setup parameters for xgboost 31 | param = {} 32 | # use logistic regression loss, use raw prediction before logistic transformation 33 | # since we only need the rank 34 | param['objective'] = 'binary:logitraw' 35 | # scale weight of positive examples 36 | param['scale_pos_weight'] = sum_wneg/sum_wpos 37 | param['eta'] = 0.1 38 | param['max_depth'] = 6 39 | param['eval_metric'] = 'auc' 40 | param['silent'] = 1 41 | param['nthread'] = 16 42 | 43 | # you can directly throw param in, though we want to watch multiple metrics here 44 | plst = list(param.items())+[('eval_metric', 'ams@0.15')] 45 | 46 | watchlist = [ (xgmat,'train') ] 47 | # boost 120 trees 48 | num_round = 120 49 | print ('loading data end, start to boost trees') 50 | bst = xgb.train( plst, xgmat, num_round, watchlist ); 51 | # save out model 52 | bst.save_model('higgs.model') 53 | 54 | print ('finish training') 55 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/higgs-pred.R: -------------------------------------------------------------------------------- 1 | # install xgboost package, see R-package in root folder 2 | require(xgboost) 3 | require(methods) 4 | 5 | modelfile <- "higgs.model" 6 | outfile <- "higgs.pred.csv" 7 | dtest <- read.csv("data/test.csv", header=TRUE) 8 | data <- as.matrix(dtest[2:31]) 9 | idx <- dtest[[1]] 10 | 11 | xgmat <- xgb.DMatrix(data, missing = -999.0) 12 | bst <- xgb.load(modelfile=modelfile) 13 | ypred <- predict(bst, xgmat) 14 | 15 | rorder <- rank(ypred, ties.method="first") 16 | 17 | threshold <- 0.15 18 | # to be completed 19 | ntop <- length(rorder) - as.integer(threshold*length(rorder)) 20 | plabel <- ifelse(rorder > ntop, "s", "b") 21 | outdata <- list("EventId" = idx, 22 | "RankOrder" = rorder, 23 | "Class" = plabel) 24 | write.csv(outdata, file = outfile, quote=FALSE, row.names=FALSE) 25 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/higgs-pred.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # make prediction 3 | import numpy as np 4 | import xgboost as xgb 5 | 6 | # path to where the data lies 7 | dpath = 'data' 8 | 9 | modelfile = 'higgs.model' 10 | outfile = 'higgs.pred.csv' 11 | # make top 15% as positive 12 | threshold_ratio = 0.15 13 | 14 | # load in training data, directly use numpy 15 | dtest = np.loadtxt( dpath+'/test.csv', delimiter=',', skiprows=1 ) 16 | data = dtest[:,1:31] 17 | idx = dtest[:,0] 18 | 19 | print ('finish loading from csv ') 20 | xgmat = xgb.DMatrix( data, missing = -999.0 ) 21 | bst = xgb.Booster({'nthread':16}, model_file = modelfile) 22 | ypred = bst.predict( xgmat ) 23 | 24 | res = [ ( int(idx[i]), ypred[i] ) for i in range(len(ypred)) ] 25 | 26 | rorder = {} 27 | for k, v in sorted( res, key = lambda x:-x[1] ): 28 | rorder[ k ] = len(rorder) + 1 29 | 30 | # write out predictions 31 | ntop = int( threshold_ratio * len(rorder ) ) 32 | fo = open(outfile, 'w') 33 | nhit = 0 34 | ntot = 0 35 | fo.write('EventId,RankOrder,Class\n') 36 | for k, v in res: 37 | if rorder[k] <= ntop: 38 | lb = 's' 39 | nhit += 1 40 | else: 41 | lb = 'b' 42 | # change output rank order to follow Kaggle convention 43 | fo.write('%s,%d,%s\n' % ( k, len(rorder)+1-rorder[k], lb ) ) 44 | ntot += 1 45 | fo.close() 46 | 47 | print ('finished writing into prediction file') 48 | 49 | 50 | 51 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/higgs-train.R: -------------------------------------------------------------------------------- 1 | # install xgboost package, see R-package in root folder 2 | require(xgboost) 3 | require(methods) 4 | 5 | testsize <- 550000 6 | 7 | dtrain <- read.csv("data/training.csv", header=TRUE) 8 | dtrain[33] <- dtrain[33] == "s" 9 | label <- as.numeric(dtrain[[33]]) 10 | data <- as.matrix(dtrain[2:31]) 11 | weight <- as.numeric(dtrain[[32]]) * testsize / length(label) 12 | 13 | sumwpos <- sum(weight * (label==1.0)) 14 | sumwneg <- sum(weight * (label==0.0)) 15 | print(paste("weight statistics: wpos=", sumwpos, "wneg=", sumwneg, "ratio=", sumwneg / sumwpos)) 16 | 17 | xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0) 18 | param <- list("objective" = "binary:logitraw", 19 | "scale_pos_weight" = sumwneg / sumwpos, 20 | "bst:eta" = 0.1, 21 | "bst:max_depth" = 6, 22 | "eval_metric" = "auc", 23 | "eval_metric" = "ams@0.15", 24 | "silent" = 1, 25 | "nthread" = 16) 26 | watchlist <- list("train" = xgmat) 27 | nround = 120 28 | print ("loading data end, start to boost trees") 29 | bst = xgb.train(param, xgmat, nround, watchlist ); 30 | # save out model 31 | xgb.save(bst, "higgs.model") 32 | print ('finish training') 33 | 34 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python -u higgs-numpy.py 4 | ret=$? 5 | if [[ $ret != 0 ]]; then 6 | echo "ERROR in higgs-numpy.py" 7 | exit $ret 8 | fi 9 | python -u higgs-pred.py 10 | ret=$? 11 | if [[ $ret != 0 ]]; then 12 | echo "ERROR in higgs-pred.py" 13 | exit $ret 14 | fi 15 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/speedtest.R: -------------------------------------------------------------------------------- 1 | # install xgboost package, see R-package in root folder 2 | require(xgboost) 3 | require(gbm) 4 | require(methods) 5 | 6 | testsize <- 550000 7 | 8 | dtrain <- read.csv("data/training.csv", header=TRUE, nrows=350001) 9 | dtrain$Label = as.numeric(dtrain$Label=='s') 10 | # gbm.time = system.time({ 11 | # gbm.model <- gbm(Label ~ ., data = dtrain[, -c(1,32)], n.trees = 120, 12 | # interaction.depth = 6, shrinkage = 0.1, bag.fraction = 1, 13 | # verbose = TRUE) 14 | # }) 15 | # print(gbm.time) 16 | # Test result: 761.48 secs 17 | 18 | # dtrain[33] <- dtrain[33] == "s" 19 | # label <- as.numeric(dtrain[[33]]) 20 | data <- as.matrix(dtrain[2:31]) 21 | weight <- as.numeric(dtrain[[32]]) * testsize / length(label) 22 | 23 | sumwpos <- sum(weight * (label==1.0)) 24 | sumwneg <- sum(weight * (label==0.0)) 25 | print(paste("weight statistics: wpos=", sumwpos, "wneg=", sumwneg, "ratio=", sumwneg / sumwpos)) 26 | 27 | xgboost.time = list() 28 | threads = c(1,2,4,8,16) 29 | for (i in 1:length(threads)){ 30 | thread = threads[i] 31 | xgboost.time[[i]] = system.time({ 32 | xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0) 33 | param <- list("objective" = "binary:logitraw", 34 | "scale_pos_weight" = sumwneg / sumwpos, 35 | "bst:eta" = 0.1, 36 | "bst:max_depth" = 6, 37 | "eval_metric" = "auc", 38 | "eval_metric" = "ams@0.15", 39 | "silent" = 1, 40 | "nthread" = thread) 41 | watchlist <- list("train" = xgmat) 42 | nround = 120 43 | print ("loading data end, start to boost trees") 44 | bst = xgb.train(param, xgmat, nround, watchlist ); 45 | # save out model 46 | xgb.save(bst, "higgs.model") 47 | print ('finish training') 48 | }) 49 | } 50 | 51 | xgboost.time 52 | # [[1]] 53 | # user system elapsed 54 | # 99.015 0.051 98.982 55 | # 56 | # [[2]] 57 | # user system elapsed 58 | # 100.268 0.317 55.473 59 | # 60 | # [[3]] 61 | # user system elapsed 62 | # 111.682 0.777 35.963 63 | # 64 | # [[4]] 65 | # user system elapsed 66 | # 149.396 1.851 32.661 67 | # 68 | # [[5]] 69 | # user system elapsed 70 | # 157.390 5.988 40.949 71 | 72 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-higgs/speedtest.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # this is the example script to use xgboost to train 3 | import numpy as np 4 | import xgboost as xgb 5 | from sklearn.ensemble import GradientBoostingClassifier 6 | import time 7 | test_size = 550000 8 | 9 | # path to where the data lies 10 | dpath = 'data' 11 | 12 | # load in training data, directly use numpy 13 | dtrain = np.loadtxt( dpath+'/training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s') } ) 14 | print ('finish loading from csv ') 15 | 16 | label = dtrain[:,32] 17 | data = dtrain[:,1:31] 18 | # rescale weight to make it same as test set 19 | weight = dtrain[:,31] * float(test_size) / len(label) 20 | 21 | sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0 ) 22 | sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0 ) 23 | 24 | # print weight statistics 25 | print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos )) 26 | 27 | # construct xgboost.DMatrix from numpy array, treat -999.0 as missing value 28 | xgmat = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight ) 29 | 30 | # setup parameters for xgboost 31 | param = {} 32 | # use logistic regression loss 33 | param['objective'] = 'binary:logitraw' 34 | # scale weight of positive examples 35 | param['scale_pos_weight'] = sum_wneg/sum_wpos 36 | param['bst:eta'] = 0.1 37 | param['bst:max_depth'] = 6 38 | param['eval_metric'] = 'auc' 39 | param['silent'] = 1 40 | param['nthread'] = 4 41 | 42 | plst = param.items()+[('eval_metric', 'ams@0.15')] 43 | 44 | watchlist = [ (xgmat,'train') ] 45 | # boost 10 trees 46 | num_round = 10 47 | print ('loading data end, start to boost trees') 48 | print ("training GBM from sklearn") 49 | tmp = time.time() 50 | gbm = GradientBoostingClassifier(n_estimators=num_round, max_depth=6, verbose=2) 51 | gbm.fit(data, label) 52 | print ("sklearn.GBM costs: %s seconds" % str(time.time() - tmp)) 53 | #raw_input() 54 | print ("training xgboost") 55 | threads = [1, 2, 4, 16] 56 | for i in threads: 57 | param['nthread'] = i 58 | tmp = time.time() 59 | plst = param.items()+[('eval_metric', 'ams@0.15')] 60 | bst = xgb.train( plst, xgmat, num_round, watchlist ); 61 | print ("XGBoost with %d thread costs: %s seconds" % (i, str(time.time() - tmp))) 62 | 63 | print ('finish training') 64 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-otto/README.MD: -------------------------------------------------------------------------------- 1 | Benckmark for Otto Group Competition 2 | ========= 3 | 4 | This is a folder containing the benchmark for the [Otto Group Competition on Kaggle](http://www.kaggle.com/c/otto-group-product-classification-challenge). 5 | 6 | ## Getting started 7 | 8 | 1. Put `train.csv` and `test.csv` under the `data` folder 9 | 2. Run the script 10 | 3. Submit the `submission.csv` 11 | 12 | The parameter `nthread` controls the number of cores to run on, please set it to suit your machine. 13 | 14 | ## R-package 15 | 16 | To install the R-package of xgboost, please run 17 | 18 | ```r 19 | devtools::install_github('tqchen/xgboost',subdir='R-package') 20 | ``` 21 | 22 | Windows users may need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. 23 | 24 | 25 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-otto/otto_train_pred.R: -------------------------------------------------------------------------------- 1 | require(xgboost) 2 | require(methods) 3 | 4 | train = read.csv('data/train.csv',header=TRUE,stringsAsFactors = F) 5 | test = read.csv('data/test.csv',header=TRUE,stringsAsFactors = F) 6 | train = train[,-1] 7 | test = test[,-1] 8 | 9 | y = train[,ncol(train)] 10 | y = gsub('Class_','',y) 11 | y = as.integer(y)-1 # xgboost take features in [0,numOfClass) 12 | 13 | x = rbind(train[,-ncol(train)],test) 14 | x = as.matrix(x) 15 | x = matrix(as.numeric(x),nrow(x),ncol(x)) 16 | trind = 1:length(y) 17 | teind = (nrow(train)+1):nrow(x) 18 | 19 | # Set necessary parameter 20 | param <- list("objective" = "multi:softprob", 21 | "eval_metric" = "mlogloss", 22 | "num_class" = 9, 23 | "nthread" = 8) 24 | 25 | # Run Cross Validation 26 | cv.nround = 50 27 | bst.cv = xgb.cv(param=param, data = x[trind,], label = y, 28 | nfold = 3, nrounds=cv.nround) 29 | 30 | # Train the model 31 | nround = 50 32 | bst = xgboost(param=param, data = x[trind,], label = y, nrounds=nround) 33 | 34 | # Make prediction 35 | pred = predict(bst,x[teind,]) 36 | pred = matrix(pred,9,length(pred)/9) 37 | pred = t(pred) 38 | 39 | # Output submission 40 | pred = format(pred, digits=2,scientific=F) # shrink the size of submission 41 | pred = data.frame(1:nrow(pred),pred) 42 | names(pred) = c('id', paste0('Class_',1:9)) 43 | write.csv(pred,file='submission.csv', quote=FALSE,row.names=FALSE) 44 | -------------------------------------------------------------------------------- /XGBoost_demo/kaggle-otto/understandingXGBoostModel.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Understanding XGBoost Model on Otto Dataset" 3 | author: "Michaël Benesty" 4 | output: 5 | rmarkdown::html_vignette: 6 | css: ../../R-package/vignettes/vignette.css 7 | number_sections: yes 8 | toc: yes 9 | --- 10 | 11 | Introduction 12 | ============ 13 | 14 | **XGBoost** is an implementation of the famous gradient boosting algorithm. This model is often described as a *blackbox*, meaning it works well but it is not trivial to understand how. Indeed, the model is made of hundreds (thousands?) of decision trees. You may wonder how possible a human would be able to have a general view of the model? 15 | 16 | While XGBoost is known for its fast speed and accurate predictive power, it also comes with various functions to help you understand the model. 17 | The purpose of this RMarkdown document is to demonstrate how easily we can leverage the functions already implemented in **XGBoost R** package. Of course, everything showed below can be applied to the dataset you may have to manipulate at work or wherever! 18 | 19 | First we will prepare the **Otto** dataset and train a model, then we will generate two visualisations to get a clue of what is important to the model, finally, we will see how we can leverage these information. 20 | 21 | Preparation of the data 22 | ======================= 23 | 24 | This part is based on the **R** tutorial example by [Tong He](https://github.com/dmlc/xgboost/blob/master/demo/kaggle-otto/otto_train_pred.R) 25 | 26 | First, let's load the packages and the dataset. 27 | 28 | ```{r loading} 29 | require(xgboost) 30 | require(methods) 31 | require(data.table) 32 | require(magrittr) 33 | train <- fread('data/train.csv', header = T, stringsAsFactors = F) 34 | test <- fread('data/test.csv', header=TRUE, stringsAsFactors = F) 35 | ``` 36 | > `magrittr` and `data.table` are here to make the code cleaner and much more rapid. 37 | 38 | Let's explore the dataset. 39 | 40 | ```{r explore} 41 | # Train dataset dimensions 42 | dim(train) 43 | 44 | # Training content 45 | train[1:6,1:5, with =F] 46 | 47 | # Test dataset dimensions 48 | dim(test) 49 | 50 | # Test content 51 | test[1:6,1:5, with =F] 52 | ``` 53 | > We only display the 6 first rows and 5 first columns for convenience 54 | 55 | Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product. 56 | 57 | Obviously the first column (`ID`) doesn't contain any useful information. 58 | 59 | To let the algorithm focus on real stuff, we will delete it. 60 | 61 | ```{r clean, results='hide'} 62 | # Delete ID column in training dataset 63 | train[, id := NULL] 64 | 65 | # Delete ID column in testing dataset 66 | test[, id := NULL] 67 | ``` 68 | 69 | According to its description, the **Otto** challenge is a multi class classification challenge. We need to extract the labels (here the name of the different classes) from the dataset. We only have two files (test and training), it seems logical that the training file contains the class we are looking for. Usually the labels is in the first or the last column. We already know what is in the first column, let's check the content of the last one. 70 | 71 | ```{r searchLabel} 72 | # Check the content of the last column 73 | train[1:6, ncol(train), with = F] 74 | # Save the name of the last column 75 | nameLastCol <- names(train)[ncol(train)] 76 | ``` 77 | 78 | The classes are provided as character string in the `r ncol(train)`th column called `r nameLastCol`. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to `integer`. Moreover, according to the documentation, it should start at `0`. 79 | 80 | For that purpose, we will: 81 | 82 | * extract the target column 83 | * remove `Class_` from each class name 84 | * convert to `integer` 85 | * remove `1` to the new value 86 | 87 | ```{r classToIntegers} 88 | # Convert from classes to numbers 89 | y <- train[, nameLastCol, with = F][[1]] %>% gsub('Class_','',.) %>% {as.integer(.) -1} 90 | 91 | # Display the first 5 levels 92 | y[1:5] 93 | ``` 94 | 95 | We remove label column from training dataset, otherwise **XGBoost** would use it to guess the labels! 96 | 97 | ```{r deleteCols, results='hide'} 98 | train[, nameLastCol:=NULL, with = F] 99 | ``` 100 | 101 | `data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by **XGBoost**. We need to convert both datasets (training and test) in `numeric` Matrix format. 102 | 103 | ```{r convertToNumericMatrix} 104 | trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix 105 | testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix 106 | ``` 107 | 108 | Model training 109 | ============== 110 | 111 | Before the learning we will use the cross validation to evaluate the our error rate. 112 | 113 | Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part to use it as the test data and perform a training. Then it will reintegrate the first part and retain the second part, do a training and so on... 114 | 115 | You can look at the function documentation for more information. 116 | 117 | ```{r crossValidation} 118 | numberOfClasses <- max(y) + 1 119 | 120 | param <- list("objective" = "multi:softprob", 121 | "eval_metric" = "mlogloss", 122 | "num_class" = numberOfClasses) 123 | 124 | cv.nround <- 5 125 | cv.nfold <- 3 126 | 127 | bst.cv = xgb.cv(param=param, data = trainMatrix, label = y, 128 | nfold = cv.nfold, nrounds = cv.nround) 129 | ``` 130 | > As we can see the error rate is low on the test dataset (for a 5mn trained model). 131 | 132 | Finally, we are ready to train the real model!!! 133 | 134 | ```{r modelTraining} 135 | nround = 50 136 | bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nround) 137 | ``` 138 | 139 | Model understanding 140 | =================== 141 | 142 | Feature importance 143 | ------------------ 144 | 145 | So far, we have built a model made of **`r nround`** trees. 146 | 147 | To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products). 148 | 149 | Each division operation is called a *split*. 150 | 151 | Each group at each division level is called a branch and the deepest level is called a *leaf*. 152 | 153 | In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits). 154 | 155 | **Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been misclassified by the first *tree*. 156 | 157 | In the same way, in Boosting we try to optimize the misclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*. 158 | 159 | The improvement brought by each *split* can be measured, it is the *gain*. 160 | 161 | Each *split* is done on one feature only at one value. 162 | 163 | Let's see what the model looks like. 164 | 165 | ```{r modelDump} 166 | model <- xgb.dump(bst, with.stats = T) 167 | model[1:10] 168 | ``` 169 | > For convenience, we are displaying the first 10 lines of the model only. 170 | 171 | Clearly, it is not easy to understand what it means. 172 | 173 | Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A). 174 | 175 | Hopefully, **XGBoost** offers a better representation: **feature importance**. 176 | 177 | Feature importance is about averaging the *gain* of each feature for all *split* and all *trees*. 178 | 179 | Then we can use the function `xgb.plot.importance`. 180 | 181 | ```{r importanceFeature, fig.align='center', fig.height=5, fig.width=10} 182 | # Get the feature real names 183 | names <- dimnames(trainMatrix)[[2]] 184 | 185 | # Compute feature importance matrix 186 | importance_matrix <- xgb.importance(names, model = bst) 187 | 188 | # Nice graph 189 | xgb.plot.importance(importance_matrix[1:10,]) 190 | ``` 191 | 192 | > To make it understandable we first extract the column names from the `Matrix`. 193 | 194 | Interpretation 195 | -------------- 196 | 197 | In the feature importance above, we can see the first 10 most important features. 198 | 199 | This function gives a color to each bar. These colors represent groups of features. Basically a K-means clustering is applied to group each feature by importance. 200 | 201 | From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels. 202 | 203 | Or you can just reason about why these features are so important (in **Otto** challenge we can't go this way because there is not enough information). 204 | 205 | Tree graph 206 | ---------- 207 | 208 | Feature importance gives you feature weight information but not interaction between features. 209 | 210 | **XGBoost R** package have another useful function for that. 211 | 212 | Please, scroll on the right to see the tree. 213 | 214 | ```{r treeGraph, dpi=1500, fig.align='left'} 215 | xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2) 216 | ``` 217 | 218 | We are just displaying the first two trees here. 219 | 220 | On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the interaction between features is complicated. 221 | Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes. 222 | 223 | Going deeper 224 | ============ 225 | 226 | There are 4 documents you may also be interested in: 227 | 228 | * [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation 229 | * [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysis 230 | * [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case 231 | * [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model 232 | -------------------------------------------------------------------------------- /XGBoost_demo/multiclass_classification/README.md: -------------------------------------------------------------------------------- 1 | Demonstrating how to use XGBoost accomplish Multi-Class classification task on [UCI Dermatology dataset](https://archive.ics.uci.edu/ml/datasets/Dermatology) 2 | 3 | Make sure you make make xgboost python module in ../../python 4 | 5 | 1. Run runexp.sh 6 | ```bash 7 | ./runexp.sh 8 | ``` 9 | 10 | 11 | -------------------------------------------------------------------------------- /XGBoost_demo/multiclass_classification/runexp.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | if [ -f dermatology.data ] 3 | then 4 | echo "use existing data to run multi class classification" 5 | else 6 | echo "getting data from uci, make sure you are connected to internet" 7 | wget https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data 8 | fi 9 | python train.py 10 | -------------------------------------------------------------------------------- /XGBoost_demo/multiclass_classification/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | from __future__ import division 4 | 5 | import numpy as np 6 | import xgboost as xgb 7 | 8 | # label need to be 0 to num_class -1 9 | data = np.loadtxt('./dermatology.data', delimiter=',', 10 | converters={33: lambda x:int(x == '?'), 34: lambda x:int(x) - 1}) 11 | sz = data.shape 12 | 13 | train = data[:int(sz[0] * 0.7), :] 14 | test = data[int(sz[0] * 0.7):, :] 15 | 16 | train_X = train[:, :33] 17 | train_Y = train[:, 34] 18 | 19 | test_X = test[:, :33] 20 | test_Y = test[:, 34] 21 | 22 | xg_train = xgb.DMatrix(train_X, label=train_Y) 23 | xg_test = xgb.DMatrix(test_X, label=test_Y) 24 | # setup parameters for xgboost 25 | param = {} 26 | # use softmax multi-class classification 27 | param['objective'] = 'multi:softmax' 28 | # scale weight of positive examples 29 | param['eta'] = 0.1 30 | param['max_depth'] = 6 31 | param['silent'] = 1 32 | param['nthread'] = 4 33 | param['num_class'] = 6 34 | 35 | watchlist = [(xg_train, 'train'), (xg_test, 'test')] 36 | num_round = 5 37 | bst = xgb.train(param, xg_train, num_round, watchlist) 38 | # get prediction 39 | pred = bst.predict(xg_test) 40 | error_rate = np.sum(pred != test_Y) / test_Y.shape[0] 41 | print('Test error using softmax = {}'.format(error_rate)) 42 | 43 | # do the same thing again, but output probabilities 44 | param['objective'] = 'multi:softprob' 45 | bst = xgb.train(param, xg_train, num_round, watchlist) 46 | # Note: this convention has been changed since xgboost-unity 47 | # get prediction, this is in 1D array, need reshape to (ndata, nclass) 48 | pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], 6) 49 | pred_label = np.argmax(pred_prob, axis=1) 50 | error_rate = np.sum(pred_label != test_Y) / test_Y.shape[0] 51 | print('Test error using softprob = {}'.format(error_rate)) 52 | -------------------------------------------------------------------------------- /XGBoost_demo/rank/README.md: -------------------------------------------------------------------------------- 1 | Learning to rank 2 | ==== 3 | XGBoost supports accomplishing ranking tasks. In ranking scenario, data are often grouped and we need the [group information file](../../doc/input_format.md#group-input-format) to specify ranking tasks. The model used in XGBoost for ranking is the LambdaRank, this function is not yet completed. Currently, we provide pairwise rank. 4 | 5 | ### Parameters 6 | The configuration setting is similar to the regression and binary classification setting, except user need to specify the objectives: 7 | 8 | ``` 9 | ... 10 | objective="rank:pairwise" 11 | ... 12 | ``` 13 | For more usage details please refer to the [binary classification demo](../binary_classification), 14 | 15 | Instructions 16 | ==== 17 | The dataset for ranking demo is from LETOR04 MQ2008 fold1, 18 | You can use the following command to run the example 19 | 20 | Get the data: ./wgetdata.sh 21 | Run the example: ./runexp.sh 22 | -------------------------------------------------------------------------------- /XGBoost_demo/rank/mq2008.conf: -------------------------------------------------------------------------------- 1 | # General Parameters, see comment for each definition 2 | 3 | # specify objective 4 | objective="rank:pairwise" 5 | 6 | # Tree Booster Parameters 7 | # step size shrinkage 8 | eta = 0.1 9 | # minimum loss reduction required to make a further partition 10 | gamma = 1.0 11 | # minimum sum of instance weight(hessian) needed in a child 12 | min_child_weight = 0.1 13 | # maximum depth of a tree 14 | max_depth = 6 15 | 16 | # Task parameters 17 | # the number of round to do boosting 18 | num_round = 4 19 | # 0 means do not save any model except the final round model 20 | save_period = 0 21 | # The path of training data 22 | data = "mq2008.train" 23 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set 24 | eval[test] = "mq2008.vali" 25 | # The path of test data 26 | test:data = "mq2008.test" 27 | 28 | 29 | -------------------------------------------------------------------------------- /XGBoost_demo/rank/runexp.sh: -------------------------------------------------------------------------------- 1 | python trans_data.py train.txt mq2008.train mq2008.train.group 2 | 3 | python trans_data.py test.txt mq2008.test mq2008.test.group 4 | 5 | python trans_data.py vali.txt mq2008.vali mq2008.vali.group 6 | 7 | ../../xgboost mq2008.conf 8 | 9 | ../../xgboost mq2008.conf task=pred model_in=0004.model 10 | 11 | 12 | -------------------------------------------------------------------------------- /XGBoost_demo/rank/trans_data.py: -------------------------------------------------------------------------------- 1 | import sys 2 | 3 | def save_data(group_data,output_feature,output_group): 4 | if len(group_data) == 0: 5 | return 6 | 7 | output_group.write(str(len(group_data))+"\n") 8 | for data in group_data: 9 | # only include nonzero features 10 | feats = [ p for p in data[2:] if float(p.split(':')[1]) != 0.0 ] 11 | output_feature.write(data[0] + " " + " ".join(feats) + "\n") 12 | 13 | if __name__ == "__main__": 14 | if len(sys.argv) != 4: 15 | print ("Usage: python trans_data.py [Ranksvm Format Input] [Output Feature File] [Output Group File]") 16 | sys.exit(0) 17 | 18 | fi = open(sys.argv[1]) 19 | output_feature = open(sys.argv[2],"w") 20 | output_group = open(sys.argv[3],"w") 21 | 22 | group_data = [] 23 | group = "" 24 | for line in fi: 25 | if not line: 26 | break 27 | if "#" in line: 28 | line = line[:line.index("#")] 29 | splits = line.strip().split(" ") 30 | if splits[1] != group: 31 | save_data(group_data,output_feature,output_group) 32 | group_data = [] 33 | group = splits[1] 34 | group_data.append(splits) 35 | 36 | save_data(group_data,output_feature,output_group) 37 | 38 | fi.close() 39 | output_feature.close() 40 | output_group.close() 41 | 42 | -------------------------------------------------------------------------------- /XGBoost_demo/rank/wgetdata.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | wget http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2008.rar 3 | unrar x MQ2008.rar 4 | mv -f MQ2008/Fold1/*.txt . 5 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/README.md: -------------------------------------------------------------------------------- 1 | Regression 2 | ==== 3 | Using XGBoost for regression is very similar to using it for binary classification. We suggest that you can refer to the [binary classification demo](../binary_classification) first. In XGBoost if we use negative log likelihood as the loss function for regression, the training procedure is same as training binary classifier of XGBoost. 4 | 5 | ### Tutorial 6 | The dataset we used is the [computer hardware dataset from UCI repository](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware). The demo for regression is almost the same as the [binary classification demo](../binary_classification), except a little difference in general parameter: 7 | ``` 8 | # General parameter 9 | # this is the only difference with classification, use reg:linear to do linear regression 10 | # when labels are in [0,1] we can also use reg:logistic 11 | objective = reg:linear 12 | ... 13 | 14 | ``` 15 | 16 | The input format is same as binary classification, except that the label is now the target regression values. We use linear regression here, if we want use objective = reg:logistic logistic regression, the label needed to be pre-scaled into [0,1]. 17 | 18 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/machine.conf: -------------------------------------------------------------------------------- 1 | # General Parameters, see comment for each definition 2 | # choose the tree booster, can also change to gblinear 3 | booster = gbtree 4 | # this is the only difference with classification, use reg:linear to do linear classification 5 | # when labels are in [0,1] we can also use reg:logistic 6 | objective = reg:linear 7 | 8 | # Tree Booster Parameters 9 | # step size shrinkage 10 | eta = 1.0 11 | # minimum loss reduction required to make a further partition 12 | gamma = 1.0 13 | # minimum sum of instance weight(hessian) needed in a child 14 | min_child_weight = 1 15 | # maximum depth of a tree 16 | max_depth = 3 17 | 18 | # Task parameters 19 | # the number of round to do boosting 20 | num_round = 2 21 | # 0 means do not save any model except the final round model 22 | save_period = 0 23 | # The path of training data 24 | data = "machine.txt.train" 25 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set 26 | eval[test] = "machine.txt.test" 27 | # The path of test data 28 | test:data = "machine.txt.test" 29 | 30 | 31 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/machine.data: -------------------------------------------------------------------------------- 1 | adviser,32/60,125,256,6000,256,16,128,198,199 2 | amdahl,470v/7,29,8000,32000,32,8,32,269,253 3 | amdahl,470v/7a,29,8000,32000,32,8,32,220,253 4 | amdahl,470v/7b,29,8000,32000,32,8,32,172,253 5 | amdahl,470v/7c,29,8000,16000,32,8,16,132,132 6 | amdahl,470v/b,26,8000,32000,64,8,32,318,290 7 | amdahl,580-5840,23,16000,32000,64,16,32,367,381 8 | amdahl,580-5850,23,16000,32000,64,16,32,489,381 9 | amdahl,580-5860,23,16000,64000,64,16,32,636,749 10 | amdahl,580-5880,23,32000,64000,128,32,64,1144,1238 11 | apollo,dn320,400,1000,3000,0,1,2,38,23 12 | apollo,dn420,400,512,3500,4,1,6,40,24 13 | basf,7/65,60,2000,8000,65,1,8,92,70 14 | basf,7/68,50,4000,16000,65,1,8,138,117 15 | bti,5000,350,64,64,0,1,4,10,15 16 | bti,8000,200,512,16000,0,4,32,35,64 17 | burroughs,b1955,167,524,2000,8,4,15,19,23 18 | burroughs,b2900,143,512,5000,0,7,32,28,29 19 | burroughs,b2925,143,1000,2000,0,5,16,31,22 20 | burroughs,b4955,110,5000,5000,142,8,64,120,124 21 | burroughs,b5900,143,1500,6300,0,5,32,30,35 22 | burroughs,b5920,143,3100,6200,0,5,20,33,39 23 | burroughs,b6900,143,2300,6200,0,6,64,61,40 24 | burroughs,b6925,110,3100,6200,0,6,64,76,45 25 | c.r.d,68/10-80,320,128,6000,0,1,12,23,28 26 | c.r.d,universe:2203t,320,512,2000,4,1,3,69,21 27 | c.r.d,universe:68,320,256,6000,0,1,6,33,28 28 | c.r.d,universe:68/05,320,256,3000,4,1,3,27,22 29 | c.r.d,universe:68/137,320,512,5000,4,1,5,77,28 30 | c.r.d,universe:68/37,320,256,5000,4,1,6,27,27 31 | cdc,cyber:170/750,25,1310,2620,131,12,24,274,102 32 | cdc,cyber:170/760,25,1310,2620,131,12,24,368,102 33 | cdc,cyber:170/815,50,2620,10480,30,12,24,32,74 34 | cdc,cyber:170/825,50,2620,10480,30,12,24,63,74 35 | cdc,cyber:170/835,56,5240,20970,30,12,24,106,138 36 | cdc,cyber:170/845,64,5240,20970,30,12,24,208,136 37 | cdc,omega:480-i,50,500,2000,8,1,4,20,23 38 | cdc,omega:480-ii,50,1000,4000,8,1,5,29,29 39 | cdc,omega:480-iii,50,2000,8000,8,1,5,71,44 40 | cambex,1636-1,50,1000,4000,8,3,5,26,30 41 | cambex,1636-10,50,1000,8000,8,3,5,36,41 42 | cambex,1641-1,50,2000,16000,8,3,5,40,74 43 | cambex,1641-11,50,2000,16000,8,3,6,52,74 44 | cambex,1651-1,50,2000,16000,8,3,6,60,74 45 | dec,decsys:10:1091,133,1000,12000,9,3,12,72,54 46 | dec,decsys:20:2060,133,1000,8000,9,3,12,72,41 47 | dec,microvax-1,810,512,512,8,1,1,18,18 48 | dec,vax:11/730,810,1000,5000,0,1,1,20,28 49 | dec,vax:11/750,320,512,8000,4,1,5,40,36 50 | dec,vax:11/780,200,512,8000,8,1,8,62,38 51 | dg,eclipse:c/350,700,384,8000,0,1,1,24,34 52 | dg,eclipse:m/600,700,256,2000,0,1,1,24,19 53 | dg,eclipse:mv/10000,140,1000,16000,16,1,3,138,72 54 | dg,eclipse:mv/4000,200,1000,8000,0,1,2,36,36 55 | dg,eclipse:mv/6000,110,1000,4000,16,1,2,26,30 56 | dg,eclipse:mv/8000,110,1000,12000,16,1,2,60,56 57 | dg,eclipse:mv/8000-ii,220,1000,8000,16,1,2,71,42 58 | formation,f4000/100,800,256,8000,0,1,4,12,34 59 | formation,f4000/200,800,256,8000,0,1,4,14,34 60 | formation,f4000/200ap,800,256,8000,0,1,4,20,34 61 | formation,f4000/300,800,256,8000,0,1,4,16,34 62 | formation,f4000/300ap,800,256,8000,0,1,4,22,34 63 | four-phase,2000/260,125,512,1000,0,8,20,36,19 64 | gould,concept:32/8705,75,2000,8000,64,1,38,144,75 65 | gould,concept:32/8750,75,2000,16000,64,1,38,144,113 66 | gould,concept:32/8780,75,2000,16000,128,1,38,259,157 67 | hp,3000/30,90,256,1000,0,3,10,17,18 68 | hp,3000/40,105,256,2000,0,3,10,26,20 69 | hp,3000/44,105,1000,4000,0,3,24,32,28 70 | hp,3000/48,105,2000,4000,8,3,19,32,33 71 | hp,3000/64,75,2000,8000,8,3,24,62,47 72 | hp,3000/88,75,3000,8000,8,3,48,64,54 73 | hp,3000/iii,175,256,2000,0,3,24,22,20 74 | harris,100,300,768,3000,0,6,24,36,23 75 | harris,300,300,768,3000,6,6,24,44,25 76 | harris,500,300,768,12000,6,6,24,50,52 77 | harris,600,300,768,4500,0,1,24,45,27 78 | harris,700,300,384,12000,6,1,24,53,50 79 | harris,80,300,192,768,6,6,24,36,18 80 | harris,800,180,768,12000,6,1,31,84,53 81 | honeywell,dps:6/35,330,1000,3000,0,2,4,16,23 82 | honeywell,dps:6/92,300,1000,4000,8,3,64,38,30 83 | honeywell,dps:6/96,300,1000,16000,8,2,112,38,73 84 | honeywell,dps:7/35,330,1000,2000,0,1,2,16,20 85 | honeywell,dps:7/45,330,1000,4000,0,3,6,22,25 86 | honeywell,dps:7/55,140,2000,4000,0,3,6,29,28 87 | honeywell,dps:7/65,140,2000,4000,0,4,8,40,29 88 | honeywell,dps:8/44,140,2000,4000,8,1,20,35,32 89 | honeywell,dps:8/49,140,2000,32000,32,1,20,134,175 90 | honeywell,dps:8/50,140,2000,8000,32,1,54,66,57 91 | honeywell,dps:8/52,140,2000,32000,32,1,54,141,181 92 | honeywell,dps:8/62,140,2000,32000,32,1,54,189,181 93 | honeywell,dps:8/20,140,2000,4000,8,1,20,22,32 94 | ibm,3033:s,57,4000,16000,1,6,12,132,82 95 | ibm,3033:u,57,4000,24000,64,12,16,237,171 96 | ibm,3081,26,16000,32000,64,16,24,465,361 97 | ibm,3081:d,26,16000,32000,64,8,24,465,350 98 | ibm,3083:b,26,8000,32000,0,8,24,277,220 99 | ibm,3083:e,26,8000,16000,0,8,16,185,113 100 | ibm,370/125-2,480,96,512,0,1,1,6,15 101 | ibm,370/148,203,1000,2000,0,1,5,24,21 102 | ibm,370/158-3,115,512,6000,16,1,6,45,35 103 | ibm,38/3,1100,512,1500,0,1,1,7,18 104 | ibm,38/4,1100,768,2000,0,1,1,13,20 105 | ibm,38/5,600,768,2000,0,1,1,16,20 106 | ibm,38/7,400,2000,4000,0,1,1,32,28 107 | ibm,38/8,400,4000,8000,0,1,1,32,45 108 | ibm,4321,900,1000,1000,0,1,2,11,18 109 | ibm,4331-1,900,512,1000,0,1,2,11,17 110 | ibm,4331-11,900,1000,4000,4,1,2,18,26 111 | ibm,4331-2,900,1000,4000,8,1,2,22,28 112 | ibm,4341,900,2000,4000,0,3,6,37,28 113 | ibm,4341-1,225,2000,4000,8,3,6,40,31 114 | ibm,4341-10,225,2000,4000,8,3,6,34,31 115 | ibm,4341-11,180,2000,8000,8,1,6,50,42 116 | ibm,4341-12,185,2000,16000,16,1,6,76,76 117 | ibm,4341-2,180,2000,16000,16,1,6,66,76 118 | ibm,4341-9,225,1000,4000,2,3,6,24,26 119 | ibm,4361-4,25,2000,12000,8,1,4,49,59 120 | ibm,4361-5,25,2000,12000,16,3,5,66,65 121 | ibm,4381-1,17,4000,16000,8,6,12,100,101 122 | ibm,4381-2,17,4000,16000,32,6,12,133,116 123 | ibm,8130-a,1500,768,1000,0,0,0,12,18 124 | ibm,8130-b,1500,768,2000,0,0,0,18,20 125 | ibm,8140,800,768,2000,0,0,0,20,20 126 | ipl,4436,50,2000,4000,0,3,6,27,30 127 | ipl,4443,50,2000,8000,8,3,6,45,44 128 | ipl,4445,50,2000,8000,8,1,6,56,44 129 | ipl,4446,50,2000,16000,24,1,6,70,82 130 | ipl,4460,50,2000,16000,24,1,6,80,82 131 | ipl,4480,50,8000,16000,48,1,10,136,128 132 | magnuson,m80/30,100,1000,8000,0,2,6,16,37 133 | magnuson,m80/31,100,1000,8000,24,2,6,26,46 134 | magnuson,m80/32,100,1000,8000,24,3,6,32,46 135 | magnuson,m80/42,50,2000,16000,12,3,16,45,80 136 | magnuson,m80/43,50,2000,16000,24,6,16,54,88 137 | magnuson,m80/44,50,2000,16000,24,6,16,65,88 138 | microdata,seq.ms/3200,150,512,4000,0,8,128,30,33 139 | nas,as/3000,115,2000,8000,16,1,3,50,46 140 | nas,as/3000-n,115,2000,4000,2,1,5,40,29 141 | nas,as/5000,92,2000,8000,32,1,6,62,53 142 | nas,as/5000-e,92,2000,8000,32,1,6,60,53 143 | nas,as/5000-n,92,2000,8000,4,1,6,50,41 144 | nas,as/6130,75,4000,16000,16,1,6,66,86 145 | nas,as/6150,60,4000,16000,32,1,6,86,95 146 | nas,as/6620,60,2000,16000,64,5,8,74,107 147 | nas,as/6630,60,4000,16000,64,5,8,93,117 148 | nas,as/6650,50,4000,16000,64,5,10,111,119 149 | nas,as/7000,72,4000,16000,64,8,16,143,120 150 | nas,as/7000-n,72,2000,8000,16,6,8,105,48 151 | nas,as/8040,40,8000,16000,32,8,16,214,126 152 | nas,as/8050,40,8000,32000,64,8,24,277,266 153 | nas,as/8060,35,8000,32000,64,8,24,370,270 154 | nas,as/9000-dpc,38,16000,32000,128,16,32,510,426 155 | nas,as/9000-n,48,4000,24000,32,8,24,214,151 156 | nas,as/9040,38,8000,32000,64,8,24,326,267 157 | nas,as/9060,30,16000,32000,256,16,24,510,603 158 | ncr,v8535:ii,112,1000,1000,0,1,4,8,19 159 | ncr,v8545:ii,84,1000,2000,0,1,6,12,21 160 | ncr,v8555:ii,56,1000,4000,0,1,6,17,26 161 | ncr,v8565:ii,56,2000,6000,0,1,8,21,35 162 | ncr,v8565:ii-e,56,2000,8000,0,1,8,24,41 163 | ncr,v8575:ii,56,4000,8000,0,1,8,34,47 164 | ncr,v8585:ii,56,4000,12000,0,1,8,42,62 165 | ncr,v8595:ii,56,4000,16000,0,1,8,46,78 166 | ncr,v8635,38,4000,8000,32,16,32,51,80 167 | ncr,v8650,38,4000,8000,32,16,32,116,80 168 | ncr,v8655,38,8000,16000,64,4,8,100,142 169 | ncr,v8665,38,8000,24000,160,4,8,140,281 170 | ncr,v8670,38,4000,16000,128,16,32,212,190 171 | nixdorf,8890/30,200,1000,2000,0,1,2,25,21 172 | nixdorf,8890/50,200,1000,4000,0,1,4,30,25 173 | nixdorf,8890/70,200,2000,8000,64,1,5,41,67 174 | perkin-elmer,3205,250,512,4000,0,1,7,25,24 175 | perkin-elmer,3210,250,512,4000,0,4,7,50,24 176 | perkin-elmer,3230,250,1000,16000,1,1,8,50,64 177 | prime,50-2250,160,512,4000,2,1,5,30,25 178 | prime,50-250-ii,160,512,2000,2,3,8,32,20 179 | prime,50-550-ii,160,1000,4000,8,1,14,38,29 180 | prime,50-750-ii,160,1000,8000,16,1,14,60,43 181 | prime,50-850-ii,160,2000,8000,32,1,13,109,53 182 | siemens,7.521,240,512,1000,8,1,3,6,19 183 | siemens,7.531,240,512,2000,8,1,5,11,22 184 | siemens,7.536,105,2000,4000,8,3,8,22,31 185 | siemens,7.541,105,2000,6000,16,6,16,33,41 186 | siemens,7.551,105,2000,8000,16,4,14,58,47 187 | siemens,7.561,52,4000,16000,32,4,12,130,99 188 | siemens,7.865-2,70,4000,12000,8,6,8,75,67 189 | siemens,7.870-2,59,4000,12000,32,6,12,113,81 190 | siemens,7.872-2,59,8000,16000,64,12,24,188,149 191 | siemens,7.875-2,26,8000,24000,32,8,16,173,183 192 | siemens,7.880-2,26,8000,32000,64,12,16,248,275 193 | siemens,7.881-2,26,8000,32000,128,24,32,405,382 194 | sperry,1100/61-h1,116,2000,8000,32,5,28,70,56 195 | sperry,1100/81,50,2000,32000,24,6,26,114,182 196 | sperry,1100/82,50,2000,32000,48,26,52,208,227 197 | sperry,1100/83,50,2000,32000,112,52,104,307,341 198 | sperry,1100/84,50,4000,32000,112,52,104,397,360 199 | sperry,1100/93,30,8000,64000,96,12,176,915,919 200 | sperry,1100/94,30,8000,64000,128,12,176,1150,978 201 | sperry,80/3,180,262,4000,0,1,3,12,24 202 | sperry,80/4,180,512,4000,0,1,3,14,24 203 | sperry,80/5,180,262,4000,0,1,3,18,24 204 | sperry,80/6,180,512,4000,0,1,3,21,24 205 | sperry,80/8,124,1000,8000,0,1,8,42,37 206 | sperry,90/80-model-3,98,1000,8000,32,2,8,46,50 207 | sratus,32,125,2000,8000,0,2,14,52,41 208 | wang,vs-100,480,512,8000,32,0,0,67,47 209 | wang,vs-90,480,1000,4000,0,0,0,45,25 210 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/machine.names: -------------------------------------------------------------------------------- 1 | 1. Title: Relative CPU Performance Data 2 | 3 | 2. Source Information 4 | -- Creators: Phillip Ein-Dor and Jacob Feldmesser 5 | -- Ein-Dor: Faculty of Management; Tel Aviv University; Ramat-Aviv; 6 | Tel Aviv, 69978; Israel 7 | -- Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779 8 | -- Date: October, 1987 9 | 10 | 3. Past Usage: 11 | 1. Ein-Dor and Feldmesser (CACM 4/87, pp 308-317) 12 | -- Results: 13 | -- linear regression prediction of relative cpu performance 14 | -- Recorded 34% average deviation from actual values 15 | 2. Kibler,D. & Aha,D. (1988). Instance-Based Prediction of 16 | Real-Valued Attributes. In Proceedings of the CSCSI (Canadian 17 | AI) Conference. 18 | -- Results: 19 | -- instance-based prediction of relative cpu performance 20 | -- similar results; no transformations required 21 | - Predicted attribute: cpu relative performance (numeric) 22 | 23 | 4. Relevant Information: 24 | -- The estimated relative performance values were estimated by the authors 25 | using a linear regression method. See their article (pp 308-313) for 26 | more details on how the relative performance values were set. 27 | 28 | 5. Number of Instances: 209 29 | 30 | 6. Number of Attributes: 10 (6 predictive attributes, 2 non-predictive, 31 | 1 goal field, and the linear regression's guess) 32 | 33 | 7. Attribute Information: 34 | 1. vendor name: 30 35 | (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, 36 | dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, 37 | microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, 38 | sratus, wang) 39 | 2. Model Name: many unique symbols 40 | 3. MYCT: machine cycle time in nanoseconds (integer) 41 | 4. MMIN: minimum main memory in kilobytes (integer) 42 | 5. MMAX: maximum main memory in kilobytes (integer) 43 | 6. CACH: cache memory in kilobytes (integer) 44 | 7. CHMIN: minimum channels in units (integer) 45 | 8. CHMAX: maximum channels in units (integer) 46 | 9. PRP: published relative performance (integer) 47 | 10. ERP: estimated relative performance from the original article (integer) 48 | 49 | 8. Missing Attribute Values: None 50 | 51 | 9. Class Distribution: the class value (PRP) is continuously valued. 52 | PRP Value Range: Number of Instances in Range: 53 | 0-20 31 54 | 21-100 121 55 | 101-200 27 56 | 201-300 13 57 | 301-400 7 58 | 401-500 4 59 | 501-600 2 60 | above 600 4 61 | 62 | Summary Statistics: 63 | Min Max Mean SD PRP Correlation 64 | MCYT: 17 1500 203.8 260.3 -0.3071 65 | MMIN: 64 32000 2868.0 3878.7 0.7949 66 | MMAX: 64 64000 11796.1 11726.6 0.8630 67 | CACH: 0 256 25.2 40.6 0.6626 68 | CHMIN: 0 52 4.7 6.8 0.6089 69 | CHMAX: 0 176 18.2 26.0 0.6052 70 | PRP: 6 1150 105.6 160.8 1.0000 71 | ERP: 15 1238 99.3 154.8 0.9665 72 | 73 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/mapfeat.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | fo = open( 'machine.txt', 'w' ) 4 | cnt = 6 5 | fmap = {} 6 | for l in open( 'machine.data' ): 7 | arr = l.split(',') 8 | fo.write(arr[8]) 9 | for i in range( 0,6 ): 10 | fo.write( ' %d:%s' %(i,arr[i+2]) ) 11 | 12 | if arr[0] not in fmap: 13 | fmap[arr[0]] = cnt 14 | cnt += 1 15 | 16 | fo.write( ' %d:1' % fmap[arr[0]] ) 17 | fo.write('\n') 18 | 19 | fo.close() 20 | 21 | # create feature map for machine data 22 | fo = open('featmap.txt', 'w') 23 | # list from machine.names 24 | names = ['vendor','MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP' ]; 25 | 26 | for i in range(0,6): 27 | fo.write( '%d\t%s\tint\n' % (i, names[i+1])) 28 | 29 | for v, k in sorted( fmap.items(), key = lambda x:x[1] ): 30 | fo.write( '%d\tvendor=%s\ti\n' % (k, v)) 31 | fo.close() 32 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/mknfold.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import sys 3 | import random 4 | 5 | if len(sys.argv) < 2: 6 | print ('Usage: [nfold = 5]') 7 | exit(0) 8 | 9 | random.seed( 10 ) 10 | 11 | k = int( sys.argv[2] ) 12 | if len(sys.argv) > 3: 13 | nfold = int( sys.argv[3] ) 14 | else: 15 | nfold = 5 16 | 17 | fi = open( sys.argv[1], 'r' ) 18 | ftr = open( sys.argv[1]+'.train', 'w' ) 19 | fte = open( sys.argv[1]+'.test', 'w' ) 20 | for l in fi: 21 | if random.randint( 1 , nfold ) == k: 22 | fte.write( l ) 23 | else: 24 | ftr.write( l ) 25 | 26 | fi.close() 27 | ftr.close() 28 | fte.close() 29 | 30 | -------------------------------------------------------------------------------- /XGBoost_demo/regression/runexp.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # map the data to features. For convenience we only use 7 original attributes and encode them as features in a trivial way 3 | python mapfeat.py 4 | # split train and test 5 | python mknfold.py machine.txt 1 6 | # training and output the models 7 | ../../xgboost machine.conf 8 | # output predictions of test data 9 | ../../xgboost machine.conf task=pred model_in=0002.model 10 | # print the boosters of 0002.model in dump.raw.txt 11 | ../../xgboost machine.conf task=dump model_in=0002.model name_dump=dump.raw.txt 12 | # print the boosters of 0002.model in dump.nice.txt with feature map 13 | ../../xgboost machine.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt 14 | 15 | # cat the result 16 | cat dump.nice.txt 17 | -------------------------------------------------------------------------------- /XGBoost_demo/yearpredMSD/README.md: -------------------------------------------------------------------------------- 1 | Demonstrating how to use XGBoost on [Year Prediction task of Million Song Dataset](https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD) 2 | 3 | 1. Run runexp.sh 4 | ```bash 5 | ./runexp.sh 6 | ``` 7 | 8 | You can also use the script to prepare LIBSVM format, and run the [Distributed Version](../../multi-node). 9 | Note that though that normally you only need to use single machine for dataset at this scale, and use distributed version for larger scale dataset. 10 | -------------------------------------------------------------------------------- /XGBoost_demo/yearpredMSD/csv2libsvm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | import sys 3 | 4 | if len(sys.argv) < 3: 5 | print 'Usage: ' 6 | print 'convert a all numerical csv to libsvm' 7 | 8 | fo = open(sys.argv[2], 'w') 9 | for l in open(sys.argv[1]): 10 | arr = l.split(',') 11 | fo.write('%s' % arr[0]) 12 | for i in xrange(len(arr) - 1): 13 | fo.write(' %d:%s' % (i, arr[i+1])) 14 | fo.close() 15 | -------------------------------------------------------------------------------- /XGBoost_demo/yearpredMSD/runexp.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | if [ -f YearPredictionMSD.txt ] 4 | then 5 | echo "use existing data to run experiment" 6 | else 7 | echo "getting data from uci, make sure you are connected to internet" 8 | wget https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip 9 | unzip YearPredictionMSD.txt.zip 10 | fi 11 | echo "start making data.." 12 | # map feature using indicator encoding, also produce featmap.txt 13 | python csv2libsvm.py YearPredictionMSD.txt yearpredMSD.libsvm 14 | head -n 463715 yearpredMSD.libsvm > yearpredMSD.libsvm.train 15 | tail -n 51630 yearpredMSD.libsvm > yearpredMSD.libsvm.test 16 | echo "finish making the data" 17 | ../../xgboost yearpredMSD.conf 18 | -------------------------------------------------------------------------------- /XGBoost_demo/yearpredMSD/yearpredMSD.conf: -------------------------------------------------------------------------------- 1 | # General Parameters, see comment for each definition 2 | # choose the tree booster, can also change to gblinear 3 | booster = gbtree 4 | # this is the only difference with classification, use reg:linear to do linear classification 5 | # when labels are in [0,1] we can also use reg:logistic 6 | objective = reg:linear 7 | 8 | # Tree Booster Parameters 9 | # step size shrinkage 10 | eta = 1.0 11 | # minimum loss reduction required to make a further partition 12 | gamma = 1.0 13 | # minimum sum of instance weight(hessian) needed in a child 14 | min_child_weight = 1 15 | # maximum depth of a tree 16 | max_depth = 5 17 | 18 | base_score = 2001 19 | # Task parameters 20 | # the number of round to do boosting 21 | num_round = 100 22 | # 0 means do not save any model except the final round model 23 | save_period = 0 24 | # The path of training data 25 | data = "yearpredMSD.libsvm.train" 26 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set 27 | eval[test] = "yearpredMSD.libsvm.test" 28 | # The path of test data 29 | #test:data = "yearpredMSD.libsvm.test" 30 | 31 | -------------------------------------------------------------------------------- /git学习.md: -------------------------------------------------------------------------------- 1 | *git是分布式版本控制系统,跟踪管理的是修改,而非文件* 2 | 3 | ## 基本操作 4 | 1.安装git: `sudo apt-get install git` 5 | 安装后设置: `git config --global user.name "Your Name"` 6 |                        `git config --global user.email "email@example.com"` 7 | 2.初始化目录为git仓库: **`git init`** 8 |    把文件添加到仓库:**`git add `** 9 |    把文件提交到仓库:**`git commit -m <说明>`** 10 | 3.查看仓库当前的状态:**`git status`** 11 |    查看修改了哪些内容:`git diff` 12 | 4.查看提交历史:**`git log (--pretty=oneline)`** 13 |    查看命令历史:`git reflog` 14 |    回到上一版本:`git reset --hard HEAD^ ` 15 | 5.工作区和版本库间的关系 16 |     ![工作区和版本库](https://static.liaoxuefeng.com/files/attachments/919020037470528/0) 17 | 6.撤销修改 18 |    修改了但是未git add: `git checkout -- ` 19 |    git add了但未git commit: `git reset HEAD ` 20 | 7.删除文件:在目录中用rm命令删除文件后 21 |     若确实要删除文件:`git rm ` 22 |     若删错了,可恢复:`git checkout -- ` 23 | 24 | ## 远程仓库 25 | 1.关联远程库:**`git remote add origin <远程库地址>`** (关联后远程库的名字就是origin) 26 | 2.推送:**`git push (-u) origin BRANCH`** (-u第一次推送时加,将本地当前的分支推送到远程的BRANCH分支) 27 | `git push origin 本地分支名:远程分支名` (将本地某分支推送到远程某分支) 28 | 29 | 3.将远程BRANCH分支与本地整合:**`git pull origin BRANCH`** (等同于git fetch和git merge) 30 | `git pull origin 远程分支名:本地分支名` 31 | 32 | 4.克隆:**`git clone <远程库地址>`** (无需先进行关联) 33 | 5.查看远程仓库地址:`git remote -v` 34 | 35 | ## 分支管理 36 | 1.查看当前所有分支:`git branch` (`git branch -a`查看本地和远程的分支) 37 | 2.创建分支:`git branch ` 38 | 3.切换分支:`git checkout ` 39 | 4.创建+切换分支:`git checkout -b ` 40 | 5.合并某分支到当前分支:`git merge ` 41 | 6.删除分支:`git branch -d ` 42 | 7.分支冲突 (1)手动解决 (2)解决冲突后查看分支合并图:`git log --graph (--pretty=oneline)` 43 | 8.分支管理策略、Bug分支、Feature分支、多人协作(实际开发中用)参照[Git教程-廖雪峰的官方网站](https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/) 44 | 45 | 9.比较两个分支的差异: 46 | 47 | `git diff branch1 branch2 --stat` //显示出所有有差异的文件列表 48 | 49 | `git diff branch1 branch2 文件名(带路径)` //显示指定文件的详细差异 50 | 51 | `git diff branch1 branch2` //显示出所有有差异的文件的详细差异 52 | 53 | 10. 查看本地分支与远程分支的关联关系:`git branch -vv` 54 | 55 | 11. 更新远程分支列表: `git remote update origin --prune` 56 | 57 | 12. 拉取远程分支并创建本地分支 `git checkout -b 本地分支名x origin/远程分支名x` (采用此种方法建立的本地分支会和远程分支建立映射关系) 58 | 59 | ## 其他 60 | 1. git切换分支时会把未add或未commit的内容带过去;因为工作区和暂存区是公共的,未add和未commit的内容不属于任何一个分支。 61 | -------------------------------------------------------------------------------- /sklearn学习.md: -------------------------------------------------------------------------------- 1 | ### 目录 2 | ### 1. 分类、回归 3 | ### 2. 降维 4 | ### 3. 模型评估与选择 5 | ### 4. 数据预处理 6 | 7 | 8 | |大类 | 小类 | 适用问题 | 实现 | 说明 | 9 | |-------- | --------| -------- | -------- | -------- | 10 | | **分类、回归** | | | | | 11 | |1.1 [广义线性模型](http://scikit-learn.org/stable/modules/linear_model.html)| 1.1.1 普通最小二乘法 | 回归 | [sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) | | 12 | | 注:本节中所有的回归模型皆为线性回归模型 | 1.1.2 Ridge/岭回归 | 回归 | [sklearn.linear_model.Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) | 解决两类回归问题:
一是样本少于变量个数
二是变量间存在共线性 | 13 | | | 1.1.3 Lasso | 回归 | [sklearn.linear_model.Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) | 适合特征较少的数据 | 14 | | | 1.1.4 Multi-task Lasso | 回归 | [sklearn.linear_model.MultiTaskLasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskLasso.html#sklearn.linear_model.MultiTaskLasso) | y值不是一元的回归问题 15 | | | 1.1.5 Elastic Net | 回归 | [sklearn.linear_model.ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet) | 结合了Ridge和Lasso | 16 | | | 1.1.6 Multi-task Elastic Net | 回归 | [sklearn.linear_model.MultiTaskElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskElasticNet.html#sklearn.linear_model.MultiTaskElasticNet) | y值不是一元的回归问题 | 17 | | | 1.1.7 Least Angle Regression(LARS) | 回归 | [sklearn.linear_model.Lars](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html#sklearn.linear_model.Lars) | 适合高维数据 | 18 | | | 1.1.8 LARS Lasso | 回归 | [sklearn.linear_model.LassoLars](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html#sklearn.linear_model.LassoLars) | (1)适合高维数据使用
(2)LARS算法实现的lasso模型 | 19 | | | 1.1.9 Orthogonal Matching Pursuit (OMP) | 回归 | [sklearn.linear_model.OrthogonalMatchingPursuit](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html#sklearn.linear_model.OrthogonalMatchingPursuit) | 基于贪心算法实现 | 20 | | | 1.1.10 贝叶斯回归 | 回归 | [sklearn.linear_model.BayesianRidge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge)
[sklearn.linear_model.ARDRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html#sklearn.linear_model.ARDRegression)| 优点: (1)适用于手边数据(2)可用于在估计过程中包含正规化参数 
缺点:耗时 | 21 | | | 1.1.11 Logistic regression | 分类 | [sklearn.linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) | 22 | | | 1.1.12 SGD(随机梯度下降法) | 分类
/回归 | [sklearn.linear_model.SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)
[sklearn.linear_model.SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) | 适用于大规模数据 | 23 | | | 1.1.13 Perceptron | 分类 | [sklearn.linear_model.Perceptron](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron) | 适用于大规模数据 | 24 | | | 1.1.14 Passive Aggressive Algorithms | 分类
/回归 | [sklearn.linear_model.
PassiveAggressiveClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html#sklearn.linear_model.PassiveAggressiveClassifier)

[sklearn.linear_model.
PassiveAggressiveRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html#sklearn.linear_model.PassiveAggressiveRegressor) | 适用于大规模数据 | 25 | | | 1.1.15 Huber Regression | 回归 | [sklearn.linear_model.HuberRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor) | 能够处理数据中有异常值的情况 26 | | | 1.1.16 多项式回归 | 回归 | [sklearn.preprocessing.PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) | 通过PolynomialFeatures将非线性特征转化成多项式形式,再用线性模型进行处理 | 27 | | 1.2 [线性和二次判别分析](http://scikit-learn.org/stable/modules/lda_qda.html) | 1.2.1 LDA | 分类/降维 | [sklearn.discriminant_analysis.
LinearDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis) | | 28 | | | 1.2.2 QDA | 分类 | [sklearn.discriminant_analysis.
QuadraticDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html#sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis) | | 29 | | 1.3 [核岭回归](http://scikit-learn.org/stable/modules/kernel_ridge.html) | 简称KRR | 回归 | [sklearn.kernel_ridge.KernelRidge](http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge) | 将核技巧应用到岭回归(1.1.2)中,以实现非线性回归 | 30 | | 1.4 [支持向量机](http://scikit-learn.org/stable/modules/svm.html) | 1.4.1 SVC,NuSVC,LinearSVC | 分类 | [sklearn.svm.SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
[sklearn.svm.NuSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC)
[sklearn.svm.LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)| SVC可用于非线性分类,可指定核函数;
NuSVC与SVC唯一的不同是可控制支持向量的个数;
LinearSVC用于线性分类| 31 | | | 1.4.2 SVR,NuSVR,LinearSVR | 回归 | [sklearn.svm.SVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)
[sklearn.svm.NuSVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVR)
[sklearn.svm.LinearSVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVR)| 同上,将"分类"变成"回归"即可 | 32 | | | 1.4.3 OneClassSVM | 异常检测 | [sklearn.svm.OneClassSVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM)| 无监督实现异常值检测 | 33 | | 1.5 [随机梯度下降](http://scikit-learn.org/stable/modules/sgd.html) | 同1.1.12 | | | | 34 | | 1.6 [最近邻](http://scikit-learn.org/stable/modules/neighbors.html) | 1.6.1 Unsupervised Nearest Neighbors | -- | [sklearn.neighbors.NearestNeighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) | 无监督实现K近邻的寻找 | 35 | | | 1.6.2 Nearest Neighbors Classification | 分类 | [sklearn.neighbors.KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
[sklearn.neighbors.RadiusNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html#sklearn.neighbors.RadiusNeighborsClassifier) | (1)不太适用于高维数据
(2)两种实现只是距离度量不一样,后者更适合非均匀的采样 | 36 | | | 1.6.3 Nearest Neighbors Regression| 回归 | [sklearn.neighbors.KNeighborsRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)
[sklearn.neighbors.RadiusNeighborsRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor) | 同上 | 37 | | | 1.6.5 Nearest Centroid Classifier | 分类 | [sklearn.neighbors.NearestCentroid](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html#sklearn.neighbors.NearestCentroid) | 每个类对应一个质心,测试样本被分类到距离最近的质心所在的类别 | 38 | | 1.7 [高斯过程(GP/GPML)](http://scikit-learn.org/stable/modules/gaussian_process.html) | 1.7.1 GPR | 回归 | [sklearn.gaussian_process.
GaussianProcessRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html#sklearn.gaussian_process.GaussianProcessRegressor) | 与KRR一样使用了核技巧 | 39 | | | 1.7.3 GPC | 分类 | [sklearn.gaussian_process.
GaussianProcessClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier) | | 40 | | 1.8 [交叉分解](http://scikit-learn.org/stable/modules/cross_decomposition.html) | 实现算法:CCA和PLS | -- | -- | 用来计算两个多元数据集的线性关系,当预测数据比观测数据有更多的变量时,用PLS更好 | 41 | | 1.9 [朴素贝叶斯](http://scikit-learn.org/stable/modules/naive_bayes.html) | 1.9.1 高斯朴素贝叶斯 | 分类 | [sklearn.naive_bayes.GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) | 处理特征是连续型变量的情况 | 42 | | | 1.9.2 多项式朴素贝叶斯 | 分类 | [sklearn.naive_bayes.MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) | 最常见,要求特征是离散数据 | 43 | | | 1.9.3 伯努利朴素贝叶斯 | 分类 | [sklearn.naive_bayes.BernoulliNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB) | 要求特征是离散的,且为布尔类型,即true和false,或者1和0 | 44 | | 1.10 [决策树](http://scikit-learn.org/stable/modules/tree.html) | 1.10.1 Classification | 分类 | [sklearn.tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) | | 45 | | | 1.10.2 Regression | 回归 | [sklearn.tree.DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) | | 46 | | 1.11 [集成方法](http://scikit-learn.org/stable/modules/ensemble.html) | 1.11.1 Bagging | 分类/回归 | [sklearn.ensemble.BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)
[sklearn.ensemble.BaggingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor) | 可以指定基学习器,默认为决策树 | 47 | | 注:1和2属于集成方法中的并行化方法,3和4属于序列化方法 | 1.11.2 Forests of randomized trees | 分类/回归 | RandomForest(RF,随机森林):
[sklearn.ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)
[sklearn.ensemble.RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
ExtraTrees(RF改进):
[sklearn.ensemble.ExtraTreesClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier)
[sklearn.ensemble.ExtraTreesRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor) | 基学习器为决策树 | 48 | | | 1.11.3 AdaBoost | 分类/回归 | [sklearn.ensemble.AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)
[sklearn.ensemble.AdaBoostRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor) | 可以指定基学习器,默认为决策树 | 49 | | 号外:最近特别火的两个梯度提升算法,LightGBM和XGBoost
(XGBoost提供了sklearn接口) | 1.11.4 Gradient Tree Boosting| 分类/回归 | GBDT:
[sklearn.ensemble.GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)
GBRT:
[sklearn.ensemble.GradientBoostingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor) | 基学习器为决策树 | 50 | | | 1.11.5 Voting Classifier | 分类 | [sklearn.ensemble.VotingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier) | 须指定基学习器 | 51 | | 1.12 [多类与多标签算法](http://scikit-learn.org/stable/modules/multiclass.html) | -- | -- | -- | **sklearn中的分类算法都默认支持多类分类**,其中LinearSVC、 LogisticRegression和GaussianProcessClassifier在进行多类分类时需指定参数multi_class | 52 | | 1.13 [特征选择](http://scikit-learn.org/stable/modules/feature_selection.html) | 1.13.1 过滤法之方差选择法 | 特征选择 | [sklearn.feature_selection.VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) | 特征选择方法分为3种:过滤法、包裹法和嵌入法。过滤法不用考虑后续学习器 | 53 | | | 1.13.2 过滤法之卡方检验 | 特征选择 | [sklearn.feature_selection.chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2),结合[sklearn.feature_selection.SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) | | 54 | | | 1.13.3 包裹法之递归特征消除法 | 特征选择 | [sklearn.feature_selection.RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE) | 包裹法需考虑后续学习器,参数中需输入基学习器 | 55 | | | 1.13.4 嵌入法 | 特征选择 | [sklearn.feature_selection.SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) | 嵌入法是过滤法和嵌入法的结合,参数中也需输入基学习器 | 56 | | 1.14 [半监督](http://scikit-learn.org/stable/modules/label_propagation.html) | 1.14.1 Label Propagation | 分类/回归 | [sklearn.semi_supervised.LabelPropagation](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html#sklearn.semi_supervised.LabelPropagation)
[sklearn.semi_supervised.LabelSpreading](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelSpreading.html#sklearn.semi_supervised.LabelSpreading) | | 57 | | 1.15 [保序回归](http://scikit-learn.org/stable/modules/isotonic.html) | -- | 回归 | [sklearn.isotonic.IsotonicRegression](http://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html#sklearn.isotonic.IsotonicRegression) | | 58 | | 1.16 [概率校准](http://scikit-learn.org/stable/modules/calibration.html) | -- | -- | -- | 在执行分类时,获得预测的标签的概率 | 59 | | 1.17 [神经网络模型](http://scikit-learn.org/stable/modules/neural_networks_supervised.html) | (待写) | | | | 60 | | **降维** | | | | | 61 | | 2.5 [降维](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) | 2.5.1 主成分分析 | 降维 | PCA:
[sklearn.decomposition.PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)
IPCA:
[sklearn.decomposition.IncrementalPCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA)
KPCA:
[sklearn.decomposition.KernelPCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html#sklearn.decomposition.KernelPCA)
SPCA:
[sklearn.decomposition.SparsePCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html#sklearn.decomposition.SparsePCA) | (1)IPCA比PCA有更好的内存效率,适合超大规模降维。
(2)KPCA可以进行非线性降维
(3)SPCA是PCA的变体,降维后返回最佳的稀疏矩阵 | 62 | | | 2.5.2 截断奇异值分解 | 降维 | [sklearn.decomposition.TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD)| 可以直接对scipy.sparse矩阵处理 | 63 | | | 2.5.3 字典学习 | -- | [sklearn.decomposition.SparseCoder](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparseCoder.html#sklearn.decomposition.SparseCoder)
[sklearn.decomposition.DictionaryLearning](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.DictionaryLearning.html#sklearn.decomposition.DictionaryLearning) | SparseCoder实现稀疏编码,DictionaryLearning实现字典学习 | 64 | | **模型评估与选择** | | | | | 65 | | 3.1 [交叉验证/CV](http://scikit-learn.org/stable/modules/cross_validation.html) | 3.1.1 分割训练集和测试集 | -- | [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) | | 66 | | | 3.1.2 通过交叉验证评估score |--| [sklearn.model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) | score对应性能度量,分类问题默认为accuracy_score,回归问题默认为r2_score | 67 | | | 3.1.3 留一法LOO | -- | [sklearn.model_selection.LeaveOneOut](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut) | CV的特例 | 68 | | | 3.1.4 留P法LPO | -- | [sklearn.model_selection.LeavePOut](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePOut.html#sklearn.model_selection.LeavePOut) | CV的特例 | 69 | | 3.2 [调参](http://scikit-learn.org/stable/modules/grid_search.html) | 3.2.1 网格搜索 | -- | [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) | 最常用的调参方法。可传入学习器、学习器参数范围、性能度量score(默认为accuracy_score或r2_score )等 | 70 | | | 3.2.2 随机搜索 | -- | [sklearn.model_selection.RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) | 参数传入同上 | 71 | | 3.3 [性能度量](http://scikit-learn.org/stable/modules/model_evaluation.html) | 3.3.1 分类度量 | -- | -- | 对应交叉验证和调参中的score | 72 | | | 3.3.2 回归度量 | -- | -- | | 73 | | | 3.3.3 聚类度量 | -- | -- | | 74 | | 3.4 [模型持久性](http://scikit-learn.org/stable/modules/model_persistence.html) | -- | -- | -- | 使用pickle存放模型,可以使模型不用重复训练 | 75 | | 3.5 [验证曲线](http://scikit-learn.org/stable/modules/learning_curve.html#learning-curve)| 3.5.1 验证曲线 | -- | [sklearn.model_selection.validation_curve](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.validation_curve.html#sklearn.model_selection.validation_curve) | 横轴为某个参数的值,纵轴为模型得分 | 76 | | | 3.5.2 学习曲线 | -- | [sklearn.model_selection.learning_curve](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve) | 横轴为训练数据大小,纵轴为模型得分 | 77 | | **数据预处理** | | | | | 78 | | 4.3 [数据预处理](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) | 4.3.1 标准化 | 数据预处理 | 标准化:
[sklearn.preprocessing.scale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale)
[sklearn.preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) | scale与StandardScaler都是将将特征转化成标准正态分布(即均值为0,方差为1),且都可以处理scipy.sparse矩阵,但一般选择后者 | 79 | | | | 数据预处理 |
区间缩放:
[sklearn.preprocessing.MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)
[sklearn.preprocessing.MaxAbsScale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) | MinMaxScaler默认为0-1缩放,MaxAbsScaler可以处理scipy.sparse矩阵 | 80 | | | 4.3.2 非线性转换 | 数据预处理 | [sklearn.preprocessing.QuantileTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer) | 可以更少的受异常值的影响 | 81 | | | 4.3.3 归一化 | 数据预处理 | [sklearn.preprocessing.Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) | 将行向量转换为单位向量,目的在于样本向量在点乘运算或其他核函数计算相似性时,拥有统一的标准 | 82 | | | 4.3.4 二值化 | 数据预处理 | [sklearn.preprocessing.Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html#sklearn.preprocessing.Binarizer) | 通过设置阈值对定量特征处理,获取布尔值 | 83 | | | 4.3.5 哑编码 | 数据预处理 | [sklearn.preprocessing.OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) | 对定性特征编码。也可用pandas.get_dummies实现 | 84 | | | 4.3.6 缺失值计算 | 数据预处理 | [sklearn.preprocessing.Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer) | 可用三种方式填充缺失值,均值(默认)、中位数和众数。也可用pandas.fillna实现 | 85 | | | 4.3.7 多项式转换 | 数据预处理 | [sklearn.preprocessing.PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) | | 86 | | | 4.3.8 自定义转换 | 数据预处理 | [sklearn.preprocessing.FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) | | 87 | | | | | | | 88 | | | | | | | 89 | 90 | 91 | -------------------------------------------------------------------------------- /xgboost学习.md: -------------------------------------------------------------------------------- 1 | ## XGBoost调参指南 2 | [参考-官网](http://xgboost.readthedocs.io/en/latest/////////how_to/param_tuning.html) 3 | ### 方法1 4 | 可按照**max_depth, min_child_weight colsamplt_bytree,eta**的顺序一个一个调,每次调的时候其他参数保持不变 5 | ### 方法2:防止过拟合 6 | When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem. 7 | 8 | There are in general two ways that you can control overfitting in xgboost 9 | 10 | - The first way is to directly control model complexity 11 | This include **max_depth, min_child_weight and gamma** 12 | - The second way is to add randomness to make training robust to noise 13 | This include **subsample, colsample_bytree** 14 | You can also reduce stepsize **eta**, but needs to remember to increase **num_round** when you do so. 15 | 16 | ## XGBoost参数 17 | [参考1-官网](http://xgboost.readthedocs.io/en/latest/////parameter.html) [参考2-CSDN](http://m.blog.csdn.net/qimiejia5584/article/details/78622442) 18 |
XGBoost参数类型分为三种: 19 | - general parameters/一般参数:决定使用哪种booster,可选gbtree、dart、gblinear 20 | - booster parameters/提升器参数:不同的booster选取不同的参数 21 | - task parameters/任务参数:设置学习场景,比如回归、二类分类、多类分类等 22 | 23 | | 参数类型 | 参数名 | 参数取值 | 默认值 | 说明 | 24 | |-------- | --------| -------- | -------- | -------- | 25 | | **general** | | | | | 26 | || booster | gbtree
dart
gblinear | gbtree | 指定使用的booster。前两种为树模型,后两种为线性模型。一般用默认的就好| 27 | | | silent | 0、1 | 0 | 0表示输出运行信息,1表示采取静默模式 | 28 | | | nthread | |系统允许的最大线程数 | 并发线程数 | 29 | | | num_pbuffer | 不可指定 | | 缓冲池大小 | 30 | | | num_feature | 不可指定 | | 特征维度 | 31 | | **booster** | | | | | 32 | | **Tree Booster** | **eta** | [0,1] |0.3| 学习率 | 33 | | | **gamma** | [0,∞] |0 | 最小损失分裂。越大越保守,一般0.1、0.2这样子 | 34 | | | **max_depth** | [0,∞] | 6 | 树的最大深度。越大模型越复杂,越容易过拟合 | 35 | | | **min_child_weight** | [0,∞] | 1 | 子节点最小的权重。非常容易影响结果,参数数值越大,就越保守,越不会过拟合 | 36 | | | max_delta_step | [0,∞] |0 | 最大delta的步数| 37 | | | **subsample** | (0,1] | 1 | 子样本数目。是否只使用部分的样本进行训练,这可以避免过拟合化。默认为1,即全部用作训练。 | 38 | | | **colsample_bytree** | (0,1] | 1 | 每棵树的列数(特征数) | 39 | | | colsample_bylevel | (0,1] | 1 | 每一层的列数(特征数)| 40 | | | lambda | | 1 | L2正则化的权重 | 41 | | | alpha | | 0 | L1正则化的权重 | 42 | | | tree_method | {‘auto’, ‘exact’, ‘approx’, ‘hist’, ‘gpu_exact’, ‘gpu_hist’} | 'auto' |树构造的算法 | 43 | | | sketch_eps | (0,1) |0.03 | 当tree_mothed='approx'才有用| 44 | | | scale_pos_weight | | 1 | 用在不均衡的分类中 | 45 | | | updater | | ’grow_colmaker,prune’ | 更新器 | 46 | | | refresh_leaf | |1 |当updater= refresh才有用 | 47 | | | process_type | {‘default’, ‘update’} |’default’ |程序运行方式。update表示更新已有的树| 48 | | | grow_policy | {‘depthwise’, ‘lossguide’} |’depthwise’ |控制新结点加入树的方式。当tree_mothod='hist'才有用| 49 | | | max_leaves | |0 | 要添加的最大结点数 | 50 | | | max_bin | | 256|当tree_mothod='hist'才有用| 51 | | | predictor | {‘cpu_predictor’, ‘gpu_predictor’} |’cpu_predictor’ | 算法预测时采用的方式| 52 | | **Dart Booster**
(比Tree Booster增加了5个参数) | sample_type| {“uniform”,“weighted”} | "uniform"|选样方式 | 53 | | |normalize_type |{"tree", "forest"} | "tree"|正则化方式 | 54 | | | rate_drop | [0.0, 1.0] |0.0| 舍弃上一轮树的比例| 55 | || one_drop | | 0| 当值不为0的时候,至少有一棵树被舍弃| 56 | | |skip _drop | [0.0, 1.0] |0.0 | 跳过舍弃树的程序的概率| 57 | | **Linear Booster** | lambda | |0 |L2正则化的权重 | 58 | | | alpha | |0 | L1正则化的权重| 59 | | | lambda_bias | |0 |L2正则化的偏爱 | 60 | | **Tweedie Regression**
(三种booster共有的参数) | tweedie_variance_power | (1,2) |1.5 | 接近2代表接近gamma分布,接近1代表接近泊松分布| 61 | | **task** | | | | | 62 | | | **objective** | “reg:linear” – 线性回归
“reg:logistic” – 逻辑回归
"binary:logistic" – 逻辑回归处理二分类(输出概率)
"binary:logitraw" – 逻辑回归处理二分类(输出分数)
"count:poisson" – 泊松回归处理计算数据(输出均值、max_delta_step参数默认为0.7)
"multi:softmax" – 采用softmax目标函数处理多分类问题(需要设定类别数"num_class")
"multi:softprob" – 多分类(与上一搁一样,只是它的输出是ndata*nclass
"rank:pairwise" – 处理排位问题
"reg:gamma" – 用γ回归(返回均值)
"reg:tweedie" – 用特威迪回归| 'reg:linear' |定义学习任务及相应的学习目标 | 63 | | | base_score | | 0.5 | 所有实例的初始预测得分 | 64 | | | eval_metric | | 回归问题:rmse
分类问题:error
排位问题:map| 评价指标/性能度量。Python可通过list传递多个评价指标 | 65 | | | seed | | 0 | 随机数种子。每次取一样的seed可得到相同的随机划分 | 66 | | **Command Line** | | | | | 67 | ||**num_round**||| boosting的轮数/迭代次数| 68 | 69 | xgb使用sklearn接口会改变的函数名是:[官网](http://xgboost.readthedocs.io/en/latest/////python/python_api.html#module-xgboost.sklearn) 70 | - eta -> learning_rate 71 | - lambda -> reg_lambda 72 | - alpha -> reg_alpha --------------------------------------------------------------------------------