├── LightGBM_demo
    ├── .gitignore
    ├── README.md
    ├── binary_classification
    │   ├── README.md
    │   ├── binary.test
    │   ├── binary.test.weight
    │   ├── binary.train
    │   ├── binary.train.weight
    │   ├── predict.conf
    │   └── train.conf
    ├── lambdarank
    │   ├── README.md
    │   ├── predict.conf
    │   ├── rank.test
    │   ├── rank.test.query
    │   ├── rank.train
    │   ├── rank.train.query
    │   └── train.conf
    ├── multiclass_classification
    │   ├── README.md
    │   ├── multiclass.test
    │   ├── multiclass.train
    │   ├── predict.conf
    │   └── train.conf
    ├── parallel_learning
    │   ├── README.md
    │   ├── binary.test
    │   ├── binary.train
    │   ├── predict.conf
    │   └── train.conf
    ├── python-guide
    │   ├── README.md
    │   ├── advanced_example.py
    │   ├── model.json
    │   ├── model.pkl
    │   ├── plot_example.py
    │   ├── simple_example.py
    │   └── sklearn_example.py
    └── regression
    │   ├── README.md
    │   ├── predict.conf
    │   ├── regression.test
    │   ├── regression.test.init
    │   ├── regression.train
    │   ├── regression.train.init
    │   └── train.conf
├── Linux下使用MySQL.md
├── README.md
├── TensorFlow_Started.ipynb
├── XGBoost_demo
    ├── .gitignore
    ├── README.md
    ├── binary_classification
    │   ├── README.md
    │   ├── agaricus-lepiota.data
    │   ├── agaricus-lepiota.fmap
    │   ├── agaricus-lepiota.names
    │   ├── mapfeat.py
    │   ├── mknfold.py
    │   ├── mushroom.conf
    │   └── runexp.sh
    ├── data
    │   ├── README.md
    │   ├── agaricus.txt.test
    │   ├── agaricus.txt.train
    │   ├── featmap.txt
    │   └── gen_autoclaims.R
    ├── distributed-training
    │   ├── README.md
    │   ├── mushroom.aws.conf
    │   ├── plot_model.ipynb
    │   └── run_aws.sh
    ├── gpu_acceleration
    │   ├── README.md
    │   └── cover_type.py
    ├── guide-python
    │   ├── README.md
    │   ├── basic_walkthrough.py
    │   ├── boost_from_prediction.py
    │   ├── cross_validation.py
    │   ├── custom_objective.py
    │   ├── evals_result.py
    │   ├── external_memory.py
    │   ├── gamma_regression.py
    │   ├── generalized_linear_model.py
    │   ├── predict_first_ntree.py
    │   ├── predict_leaf_indices.py
    │   ├── runall.sh
    │   ├── sklearn_evals_result.py
    │   ├── sklearn_examples.py
    │   └── sklearn_parallel.py
    ├── kaggle-higgs
    │   ├── README.md
    │   ├── higgs-cv.py
    │   ├── higgs-numpy.py
    │   ├── higgs-pred.R
    │   ├── higgs-pred.py
    │   ├── higgs-train.R
    │   ├── run.sh
    │   ├── speedtest.R
    │   └── speedtest.py
    ├── kaggle-otto
    │   ├── README.MD
    │   ├── otto_train_pred.R
    │   └── understandingXGBoostModel.Rmd
    ├── multiclass_classification
    │   ├── README.md
    │   ├── runexp.sh
    │   └── train.py
    ├── rank
    │   ├── README.md
    │   ├── mq2008.conf
    │   ├── runexp.sh
    │   ├── trans_data.py
    │   └── wgetdata.sh
    ├── regression
    │   ├── README.md
    │   ├── machine.conf
    │   ├── machine.data
    │   ├── machine.names
    │   ├── mapfeat.py
    │   ├── mknfold.py
    │   └── runexp.sh
    └── yearpredMSD
    │   ├── README.md
    │   ├── csv2libsvm.py
    │   ├── runexp.sh
    │   └── yearpredMSD.conf
├── git学习.md
├── sklearn学习.md
└── xgboost学习.md


/LightGBM_demo/.gitignore:
--------------------------------------------------------------------------------
1 | *.txt
2 | 


--------------------------------------------------------------------------------
/LightGBM_demo/README.md:
--------------------------------------------------------------------------------
1 | Examples
2 | ========
3 | 
4 | You can learn how to use LightGBM by these examples.
5 | 


--------------------------------------------------------------------------------
/LightGBM_demo/binary_classification/README.md:
--------------------------------------------------------------------------------
 1 | Binary Classification Example
 2 | =============================
 3 | 
 4 | Here is an example for LightGBM to run binary classification task.
 5 | 
 6 | ***You should copy executable file to this folder first.***
 7 | 
 8 | Trainin
 9 | -------
10 | 
11 | For Windows, by running following command in this folder:
12 | 
13 | ```
14 | lightgbm.exe config=train.conf
15 | ```
16 | 
17 | For Linux, by running following command in this folder:
18 | 
19 | ```
20 | ./lightgbm config=train.conf
21 | ```
22 | 
23 | Prediction
24 | ----------
25 | 
26 | You should finish training first.
27 | 
28 | For Windows, by running following command in this folder:
29 | 
30 | ```
31 | lightgbm.exe config=predict.conf
32 | ```
33 | 
34 | For Linux, by running following command in this folder:
35 | 
36 | ```
37 | ./lightgbm config=predict.conf
38 | ```
39 | 


--------------------------------------------------------------------------------
/LightGBM_demo/binary_classification/binary.test.weight:
--------------------------------------------------------------------------------
  1 | 1.0
  2 | 1.0
  3 | 1.0
  4 | 1.0
  5 | 1.0
  6 | 1.0
  7 | 1.0
  8 | 1.0
  9 | 1.0
 10 | 1.0
 11 | 1.0
 12 | 1.0
 13 | 1.0
 14 | 1.0
 15 | 1.0
 16 | 1.0
 17 | 1.0
 18 | 1.0
 19 | 1.0
 20 | 1.0
 21 | 1.0
 22 | 1.0
 23 | 1.0
 24 | 1.0
 25 | 1.0
 26 | 1.0
 27 | 1.0
 28 | 1.0
 29 | 1.0
 30 | 1.0
 31 | 1.0
 32 | 1.0
 33 | 1.0
 34 | 1.0
 35 | 1.0
 36 | 1.0
 37 | 1.0
 38 | 1.0
 39 | 1.0
 40 | 1.0
 41 | 1.0
 42 | 1.0
 43 | 1.0
 44 | 1.0
 45 | 1.0
 46 | 1.0
 47 | 1.0
 48 | 1.0
 49 | 1.0
 50 | 1.0
 51 | 1.0
 52 | 1.0
 53 | 1.0
 54 | 1.0
 55 | 1.0
 56 | 1.0
 57 | 1.0
 58 | 1.0
 59 | 1.0
 60 | 1.0
 61 | 1.0
 62 | 1.0
 63 | 1.0
 64 | 1.0
 65 | 1.0
 66 | 1.0
 67 | 1.0
 68 | 1.0
 69 | 1.0
 70 | 1.0
 71 | 1.0
 72 | 1.0
 73 | 1.0
 74 | 1.0
 75 | 1.0
 76 | 1.0
 77 | 1.0
 78 | 1.0
 79 | 1.0
 80 | 1.0
 81 | 1.0
 82 | 1.0
 83 | 1.0
 84 | 1.0
 85 | 1.0
 86 | 1.0
 87 | 1.0
 88 | 1.0
 89 | 1.0
 90 | 1.0
 91 | 1.0
 92 | 1.0
 93 | 1.0
 94 | 1.0
 95 | 1.0
 96 | 1.0
 97 | 1.0
 98 | 1.0
 99 | 1.0
100 | 1.0
101 | 1.0
102 | 1.0
103 | 1.0
104 | 1.0
105 | 1.0
106 | 1.0
107 | 1.0
108 | 1.0
109 | 1.0
110 | 1.0
111 | 1.0
112 | 1.0
113 | 1.0
114 | 1.0
115 | 1.0
116 | 1.0
117 | 1.0
118 | 1.0
119 | 1.0
120 | 1.0
121 | 1.0
122 | 1.0
123 | 1.0
124 | 1.0
125 | 1.0
126 | 1.0
127 | 1.0
128 | 1.0
129 | 1.0
130 | 1.0
131 | 1.0
132 | 1.0
133 | 1.0
134 | 1.0
135 | 1.0
136 | 1.0
137 | 1.0
138 | 1.0
139 | 1.0
140 | 1.0
141 | 1.0
142 | 1.0
143 | 1.0
144 | 1.0
145 | 1.0
146 | 1.0
147 | 1.0
148 | 1.0
149 | 1.0
150 | 1.0
151 | 1.0
152 | 1.0
153 | 1.0
154 | 1.0
155 | 1.0
156 | 1.0
157 | 1.0
158 | 1.0
159 | 1.0
160 | 1.0
161 | 1.0
162 | 1.0
163 | 1.0
164 | 1.0
165 | 1.0
166 | 1.0
167 | 1.0
168 | 1.0
169 | 1.0
170 | 1.0
171 | 1.0
172 | 1.0
173 | 1.0
174 | 1.0
175 | 1.0
176 | 1.0
177 | 1.0
178 | 1.0
179 | 1.0
180 | 1.0
181 | 1.0
182 | 1.0
183 | 1.0
184 | 1.0
185 | 1.0
186 | 1.0
187 | 1.0
188 | 1.0
189 | 1.0
190 | 1.0
191 | 1.0
192 | 1.0
193 | 1.0
194 | 1.0
195 | 1.0
196 | 1.0
197 | 1.0
198 | 1.0
199 | 1.0
200 | 1.0
201 | 1.0
202 | 1.0
203 | 1.0
204 | 1.0
205 | 1.0
206 | 1.0
207 | 1.0
208 | 1.0
209 | 1.0
210 | 1.0
211 | 1.0
212 | 1.0
213 | 1.0
214 | 1.0
215 | 1.0
216 | 1.0
217 | 1.0
218 | 1.0
219 | 1.0
220 | 1.0
221 | 1.0
222 | 1.0
223 | 1.0
224 | 1.0
225 | 1.0
226 | 1.0
227 | 1.0
228 | 1.0
229 | 1.0
230 | 1.0
231 | 1.0
232 | 1.0
233 | 1.0
234 | 1.0
235 | 1.0
236 | 1.0
237 | 1.0
238 | 1.0
239 | 1.0
240 | 1.0
241 | 1.0
242 | 1.0
243 | 1.0
244 | 1.0
245 | 1.0
246 | 1.0
247 | 1.0
248 | 1.0
249 | 1.0
250 | 1.0
251 | 1.0
252 | 1.0
253 | 1.0
254 | 1.0
255 | 1.0
256 | 1.0
257 | 1.0
258 | 1.0
259 | 1.0
260 | 1.0
261 | 1.0
262 | 1.0
263 | 1.0
264 | 1.0
265 | 1.0
266 | 1.0
267 | 1.0
268 | 1.0
269 | 1.0
270 | 1.0
271 | 1.0
272 | 1.0
273 | 1.0
274 | 1.0
275 | 1.0
276 | 1.0
277 | 1.0
278 | 1.0
279 | 1.0
280 | 1.0
281 | 1.0
282 | 1.0
283 | 1.0
284 | 1.0
285 | 1.0
286 | 1.0
287 | 1.0
288 | 1.0
289 | 1.0
290 | 1.0
291 | 1.0
292 | 1.0
293 | 1.0
294 | 1.0
295 | 1.0
296 | 1.0
297 | 1.0
298 | 1.0
299 | 1.0
300 | 1.0
301 | 1.0
302 | 1.0
303 | 1.0
304 | 1.0
305 | 1.0
306 | 1.0
307 | 1.0
308 | 1.0
309 | 1.0
310 | 1.0
311 | 1.0
312 | 1.0
313 | 1.0
314 | 1.0
315 | 1.0
316 | 1.0
317 | 1.0
318 | 1.0
319 | 1.0
320 | 1.0
321 | 1.0
322 | 1.0
323 | 1.0
324 | 1.0
325 | 1.0
326 | 1.0
327 | 1.0
328 | 1.0
329 | 1.0
330 | 1.0
331 | 1.0
332 | 1.0
333 | 1.0
334 | 1.0
335 | 1.0
336 | 1.0
337 | 1.0
338 | 1.0
339 | 1.0
340 | 1.0
341 | 1.0
342 | 1.0
343 | 1.0
344 | 1.0
345 | 1.0
346 | 1.0
347 | 1.0
348 | 1.0
349 | 1.0
350 | 1.0
351 | 1.0
352 | 1.0
353 | 1.0
354 | 1.0
355 | 1.0
356 | 1.0
357 | 1.0
358 | 1.0
359 | 1.0
360 | 1.0
361 | 1.0
362 | 1.0
363 | 1.0
364 | 1.0
365 | 1.0
366 | 1.0
367 | 1.0
368 | 1.0
369 | 1.0
370 | 1.0
371 | 1.0
372 | 1.0
373 | 1.0
374 | 1.0
375 | 1.0
376 | 1.0
377 | 1.0
378 | 1.0
379 | 1.0
380 | 1.0
381 | 1.0
382 | 1.0
383 | 1.0
384 | 1.0
385 | 1.0
386 | 1.0
387 | 1.0
388 | 1.0
389 | 1.0
390 | 1.0
391 | 1.0
392 | 1.0
393 | 1.0
394 | 1.0
395 | 1.0
396 | 1.0
397 | 1.0
398 | 1.0
399 | 1.0
400 | 1.0
401 | 1.0
402 | 1.0
403 | 1.0
404 | 1.0
405 | 1.0
406 | 1.0
407 | 1.0
408 | 1.0
409 | 1.0
410 | 1.0
411 | 1.0
412 | 1.0
413 | 1.0
414 | 1.0
415 | 1.0
416 | 1.0
417 | 1.0
418 | 1.0
419 | 1.0
420 | 1.0
421 | 1.0
422 | 1.0
423 | 1.0
424 | 1.0
425 | 1.0
426 | 1.0
427 | 1.0
428 | 1.0
429 | 1.0
430 | 1.0
431 | 1.0
432 | 1.0
433 | 1.0
434 | 1.0
435 | 1.0
436 | 1.0
437 | 1.0
438 | 1.0
439 | 1.0
440 | 1.0
441 | 1.0
442 | 1.0
443 | 1.0
444 | 1.0
445 | 1.0
446 | 1.0
447 | 1.0
448 | 1.0
449 | 1.0
450 | 1.0
451 | 1.0
452 | 1.0
453 | 1.0
454 | 1.0
455 | 1.0
456 | 1.0
457 | 1.0
458 | 1.0
459 | 1.0
460 | 1.0
461 | 1.0
462 | 1.0
463 | 1.0
464 | 1.0
465 | 1.0
466 | 1.0
467 | 1.0
468 | 1.0
469 | 1.0
470 | 1.0
471 | 1.0
472 | 1.0
473 | 1.0
474 | 1.0
475 | 1.0
476 | 1.0
477 | 1.0
478 | 1.0
479 | 1.0
480 | 1.0
481 | 1.0
482 | 1.0
483 | 1.0
484 | 1.0
485 | 1.0
486 | 1.0
487 | 1.0
488 | 1.0
489 | 1.0
490 | 1.0
491 | 1.0
492 | 1.0
493 | 1.0
494 | 1.0
495 | 1.0
496 | 1.0
497 | 1.0
498 | 1.0
499 | 1.0
500 | 1.0
501 | 


--------------------------------------------------------------------------------
/LightGBM_demo/binary_classification/predict.conf:
--------------------------------------------------------------------------------
1 | 
2 | task = predict
3 | 
4 | data = binary.test
5 | 
6 | input_model= LightGBM_model.txt
7 | 


--------------------------------------------------------------------------------
/LightGBM_demo/binary_classification/train.conf:
--------------------------------------------------------------------------------
  1 | # task type, support train and predict
  2 | task = train
  3 | 
  4 | # boosting type, support gbdt for now, alias: boosting, boost
  5 | boosting_type = gbdt
  6 | 
  7 | # application type, support following application
  8 | # regression , regression task
  9 | # binary , binary classification task
 10 | # lambdarank , lambdarank task
 11 | # alias: application, app
 12 | objective = binary
 13 | 
 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics
 15 | # l1 
 16 | # l2 , default metric for regression
 17 | # ndcg , default metric for lambdarank
 18 | # auc 
 19 | # binary_logloss , default metric for binary
 20 | # binary_error
 21 | metric = binary_logloss,auc
 22 | 
 23 | # frequence for metric output
 24 | metric_freq = 1
 25 | 
 26 | # true if need output metric for training data, alias: tranining_metric, train_metric
 27 | is_training_metric = true
 28 | 
 29 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 
 30 | max_bin = 255
 31 | 
 32 | # training data
 33 | # if exsting weight file, should name to "binary.train.weight"
 34 | # alias: train_data, train
 35 | data = binary.train
 36 | 
 37 | # validation data, support multi validation data, separated by ','
 38 | # if exsting weight file, should name to "binary.test.weight"
 39 | # alias: valid, test, test_data, 
 40 | valid_data = binary.test
 41 | 
 42 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
 43 | num_trees = 100
 44 | 
 45 | # shrinkage rate , alias: shrinkage_rate
 46 | learning_rate = 0.1
 47 | 
 48 | # number of leaves for one tree, alias: num_leaf
 49 | num_leaves = 63
 50 | 
 51 | # type of tree learner, support following types:
 52 | # serial , single machine version
 53 | # feature , use feature parallel to train
 54 | # data , use data parallel to train
 55 | # voting , use voting based parallel to train
 56 | # alias: tree
 57 | tree_learner = serial
 58 | 
 59 | # number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 
 60 | # num_threads = 8
 61 | 
 62 | # feature sub-sample, will random select 80% feature to train on each iteration 
 63 | # alias: sub_feature
 64 | feature_fraction = 0.8
 65 | 
 66 | # Support bagging (data sub-sample), will perform bagging every 5 iterations
 67 | bagging_freq = 5
 68 | 
 69 | # Bagging farction, will random select 80% data on bagging
 70 | # alias: sub_row
 71 | bagging_fraction = 0.8
 72 | 
 73 | # minimal number data for one leaf, use this to deal with over-fit
 74 | # alias : min_data_per_leaf, min_data
 75 | min_data_in_leaf = 50
 76 | 
 77 | # minimal sum hessians for one leaf, use this to deal with over-fit
 78 | min_sum_hessian_in_leaf = 5.0
 79 | 
 80 | # save memory and faster speed for sparse feature, alias: is_sparse
 81 | is_enable_sparse = true
 82 | 
 83 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed
 84 | # alias: two_round_loading, two_round
 85 | use_two_round_loading = false
 86 | 
 87 | # true if need to save data to binary file and application will auto load data from binary file next time
 88 | # alias: is_save_binary, save_binary
 89 | is_save_binary_file = false
 90 | 
 91 | # output model file
 92 | output_model = LightGBM_model.txt
 93 | 
 94 | # support continuous train from trained gbdt model
 95 | # input_model= trained_model.txt
 96 | 
 97 | # output prediction file for predict task
 98 | # output_result= prediction.txt
 99 | 
100 | # support continuous train from initial score file
101 | # input_init_score= init_score.txt
102 | 
103 | 
104 | # number of machines in parallel training, alias: num_machine
105 | num_machines = 1
106 | 
107 | # local listening port in parallel training, alias: local_port
108 | local_listen_port = 12400
109 | 
110 | # machines list file for parallel training, alias: mlist
111 | machine_list_file = mlist.txt
112 | 


--------------------------------------------------------------------------------
/LightGBM_demo/lambdarank/README.md:
--------------------------------------------------------------------------------
 1 | LambdaRank Example
 2 | ==================
 3 | 
 4 | Here is an example for LightGBM to run lambdarank task.
 5 | 
 6 | ***You should copy executable file to this folder first.***
 7 | 
 8 | Training
 9 | --------
10 | 
11 | For Windows, by running following command in this folder:
12 | 
13 | ```
14 | lightgbm.exe config=train.conf
15 | ```
16 | 
17 | For Linux, by running following command in this folder:
18 | 
19 | ```
20 | ./lightgbm config=train.conf
21 | ```
22 | 
23 | Prediction
24 | ----------
25 | 
26 | You should finish training first.
27 | 
28 | For Windows, by running following command in this folder:
29 | 
30 | ```
31 | lightgbm.exe config=predict.conf
32 | ```
33 | 
34 | For Linux, by running following command in this folder:
35 | 
36 | ```
37 | ./lightgbm config=predict.conf
38 | ```
39 | 


--------------------------------------------------------------------------------
/LightGBM_demo/lambdarank/predict.conf:
--------------------------------------------------------------------------------
1 | 
2 | task = predict
3 | 
4 | data = rank.test
5 | 
6 | input_model= LightGBM_model.txt
7 | 


--------------------------------------------------------------------------------
/LightGBM_demo/lambdarank/rank.test.query:
--------------------------------------------------------------------------------
 1 | 12
 2 | 19
 3 | 18
 4 | 10
 5 | 15
 6 | 15
 7 | 22
 8 | 23
 9 | 18
10 | 16
11 | 16
12 | 11
13 | 6
14 | 13
15 | 17
16 | 21
17 | 20
18 | 16
19 | 13
20 | 16
21 | 21
22 | 15
23 | 10
24 | 19
25 | 10
26 | 13
27 | 18
28 | 17
29 | 23
30 | 24
31 | 16
32 | 13
33 | 17
34 | 24
35 | 17
36 | 10
37 | 17
38 | 15
39 | 18
40 | 16
41 | 9
42 | 9
43 | 21
44 | 14
45 | 13
46 | 13
47 | 13
48 | 10
49 | 10
50 | 6
51 | 


--------------------------------------------------------------------------------
/LightGBM_demo/lambdarank/rank.train.query:
--------------------------------------------------------------------------------
  1 | 1
  2 | 13
  3 | 5
  4 | 8
  5 | 19
  6 | 12
  7 | 18
  8 | 5
  9 | 14
 10 | 13
 11 | 8
 12 | 9
 13 | 16
 14 | 11
 15 | 21
 16 | 14
 17 | 21
 18 | 9
 19 | 14
 20 | 11
 21 | 20
 22 | 18
 23 | 13
 24 | 20
 25 | 22
 26 | 22
 27 | 13
 28 | 17
 29 | 10
 30 | 13
 31 | 12
 32 | 13
 33 | 13
 34 | 23
 35 | 18
 36 | 13
 37 | 20
 38 | 12
 39 | 22
 40 | 14
 41 | 13
 42 | 23
 43 | 13
 44 | 14
 45 | 14
 46 | 5
 47 | 13
 48 | 15
 49 | 14
 50 | 14
 51 | 16
 52 | 16
 53 | 15
 54 | 21
 55 | 22
 56 | 10
 57 | 22
 58 | 18
 59 | 25
 60 | 16
 61 | 12
 62 | 12
 63 | 15
 64 | 15
 65 | 25
 66 | 13
 67 | 9
 68 | 12
 69 | 8
 70 | 16
 71 | 25
 72 | 19
 73 | 24
 74 | 12
 75 | 16
 76 | 10
 77 | 16
 78 | 9
 79 | 17
 80 | 15
 81 | 7
 82 | 9
 83 | 15
 84 | 14
 85 | 16
 86 | 17
 87 | 8
 88 | 17
 89 | 12
 90 | 18
 91 | 23
 92 | 10
 93 | 12
 94 | 12
 95 | 4
 96 | 14
 97 | 12
 98 | 15
 99 | 27
100 | 16
101 | 20
102 | 13
103 | 19
104 | 13
105 | 17
106 | 17
107 | 16
108 | 12
109 | 15
110 | 14
111 | 14
112 | 19
113 | 12
114 | 23
115 | 18
116 | 16
117 | 9
118 | 23
119 | 11
120 | 15
121 | 8
122 | 10
123 | 10
124 | 16
125 | 11
126 | 15
127 | 22
128 | 16
129 | 17
130 | 23
131 | 16
132 | 22
133 | 17
134 | 14
135 | 12
136 | 14
137 | 20
138 | 15
139 | 17
140 | 15
141 | 15
142 | 22
143 | 9
144 | 21
145 | 9
146 | 17
147 | 16
148 | 15
149 | 13
150 | 13
151 | 15
152 | 14
153 | 18
154 | 21
155 | 14
156 | 17
157 | 15
158 | 14
159 | 16
160 | 12
161 | 17
162 | 19
163 | 16
164 | 11
165 | 18
166 | 11
167 | 13
168 | 14
169 | 9
170 | 16
171 | 15
172 | 16
173 | 25
174 | 9
175 | 13
176 | 22
177 | 16
178 | 18
179 | 20
180 | 14
181 | 11
182 | 9
183 | 16
184 | 19
185 | 19
186 | 11
187 | 11
188 | 13
189 | 14
190 | 14
191 | 13
192 | 16
193 | 6
194 | 21
195 | 16
196 | 12
197 | 16
198 | 11
199 | 24
200 | 12
201 | 10
202 | 


--------------------------------------------------------------------------------
/LightGBM_demo/lambdarank/train.conf:
--------------------------------------------------------------------------------
  1 | # task type, support train and predict
  2 | task = train
  3 | 
  4 | # boosting type, support gbdt for now, alias: boosting, boost
  5 | boosting_type = gbdt
  6 | 
  7 | # application type, support following application
  8 | # regression , regression task
  9 | # binary , binary classification task
 10 | # lambdarank , lambdarank task
 11 | # alias: application, app
 12 | objective = lambdarank
 13 | 
 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics
 15 | # l1 
 16 | # l2 , default metric for regression
 17 | # ndcg , default metric for lambdarank
 18 | # auc 
 19 | # binary_logloss , default metric for binary
 20 | # binary_error
 21 | metric = ndcg
 22 | 
 23 | # evaluation position for ndcg metric, alias : ndcg_at
 24 | ndcg_eval_at = 1,3,5
 25 | 
 26 | # frequence for metric output
 27 | metric_freq = 1
 28 | 
 29 | # true if need output metric for training data, alias: tranining_metric, train_metric
 30 | is_training_metric = true
 31 | 
 32 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 
 33 | max_bin = 255
 34 | 
 35 | # training data
 36 | # if exsting weight file, should name to "rank.train.weight"
 37 | # if exsting query file, should name to "rank.train.query"
 38 | # alias: train_data, train
 39 | data = rank.train
 40 | 
 41 | # validation data, support multi validation data, separated by ','
 42 | # if exsting weight file, should name to "rank.test.weight"
 43 | # if exsting query file, should name to "rank.test.query"
 44 | # alias: valid, test, test_data, 
 45 | valid_data = rank.test
 46 | 
 47 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
 48 | num_trees = 100
 49 | 
 50 | # shrinkage rate , alias: shrinkage_rate
 51 | learning_rate = 0.1
 52 | 
 53 | # number of leaves for one tree, alias: num_leaf
 54 | num_leaves = 31
 55 | 
 56 | # type of tree learner, support following types:
 57 | # serial , single machine version
 58 | # feature , use feature parallel to train
 59 | # data , use data parallel to train
 60 | # voting , use voting based parallel to train
 61 | # alias: tree
 62 | tree_learner = serial
 63 | 
 64 | # number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 
 65 | # num_threads = 8
 66 | 
 67 | # feature sub-sample, will random select 80% feature to train on each iteration 
 68 | # alias: sub_feature
 69 | feature_fraction = 1.0
 70 | 
 71 | # Support bagging (data sub-sample), will perform bagging every 5 iterations
 72 | bagging_freq = 1
 73 | 
 74 | # Bagging farction, will random select 80% data on bagging
 75 | # alias: sub_row
 76 | bagging_fraction = 0.9
 77 | 
 78 | # minimal number data for one leaf, use this to deal with over-fit
 79 | # alias : min_data_per_leaf, min_data
 80 | min_data_in_leaf = 50
 81 | 
 82 | # minimal sum hessians for one leaf, use this to deal with over-fit
 83 | min_sum_hessian_in_leaf = 5.0
 84 | 
 85 | # save memory and faster speed for sparse feature, alias: is_sparse
 86 | is_enable_sparse = true
 87 | 
 88 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed
 89 | # alias: two_round_loading, two_round
 90 | use_two_round_loading = false
 91 | 
 92 | # true if need to save data to binary file and application will auto load data from binary file next time
 93 | # alias: is_save_binary, save_binary
 94 | is_save_binary_file = false
 95 | 
 96 | # output model file
 97 | output_model = LightGBM_model.txt
 98 | 
 99 | # support continuous train from trained gbdt model
100 | # input_model= trained_model.txt
101 | 
102 | # output prediction file for predict task
103 | # output_result= prediction.txt
104 | 
105 | # support continuous train from initial score file
106 | # input_init_score= init_score.txt
107 | 
108 | 
109 | # number of machines in parallel training, alias: num_machine
110 | num_machines = 1
111 | 
112 | # local listening port in parallel training, alias: local_port
113 | local_listen_port = 12400
114 | 
115 | # machines list file for parallel training, alias: mlist
116 | machine_list_file = mlist.txt
117 | 


--------------------------------------------------------------------------------
/LightGBM_demo/multiclass_classification/README.md:
--------------------------------------------------------------------------------
 1 | Multiclass Classification Example
 2 | =================================
 3 | 
 4 | Here is an example for LightGBM to run multiclass classification task.
 5 | 
 6 | ***You should copy executable file to this folder first.***
 7 | 
 8 | Training
 9 | --------
10 | 
11 | For Windows, by running following command in this folder:
12 | 
13 | ```
14 | lightgbm.exe config=train.conf
15 | ```
16 | 
17 | For Linux, by running following command in this folder:
18 | 
19 | ```
20 | ./lightgbm config=train.conf
21 | ```
22 | 
23 | Prediction
24 | ----------
25 | 
26 | You should finish training first.
27 | 
28 | For Windows, by running following command in this folder:
29 | 
30 | ```
31 | lightgbm.exe config=predict.conf
32 | ```
33 | 
34 | For Linux, by running following command in this folder:
35 | 
36 | ```
37 | ./lightgbm config=predict.conf
38 | ```
39 | 


--------------------------------------------------------------------------------
/LightGBM_demo/multiclass_classification/predict.conf:
--------------------------------------------------------------------------------
1 | task = predict
2 | 
3 | data = multiclass.test
4 | 
5 | input_model= LightGBM_model.txt
6 | 


--------------------------------------------------------------------------------
/LightGBM_demo/multiclass_classification/train.conf:
--------------------------------------------------------------------------------
 1 | # task type, support train and predict
 2 | task = train
 3 | 
 4 | # boosting type, support gbdt for now, alias: boosting, boost
 5 | boosting_type = gbdt
 6 | 
 7 | # application type, support following application
 8 | # regression , regression task
 9 | # binary , binary classification task
10 | # lambdarank , lambdarank task
11 | # multiclass
12 | # alias: application, app
13 | objective = multiclass
14 | 
15 | # eval metrics, support multi metric, delimite by ',' , support following metrics
16 | # l1 
17 | # l2 , default metric for regression
18 | # ndcg , default metric for lambdarank
19 | # auc 
20 | # binary_logloss , default metric for binary
21 | # binary_error
22 | # multi_logloss
23 | # multi_error
24 | metric = multi_logloss
25 | 
26 | # number of class, for multiclass classification
27 | num_class = 5
28 | 
29 | # frequence for metric output
30 | metric_freq = 1
31 | 
32 | # true if need output metric for training data, alias: tranining_metric, train_metric
33 | is_training_metric = true
34 | 
35 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 
36 | max_bin = 255
37 | 
38 | # training data
39 | # if exsting weight file, should name to "regression.train.weight"
40 | # alias: train_data, train
41 | data = multiclass.train
42 | 
43 | # valid data
44 | valid_data = multiclass.test
45 | 
46 | # round for early stopping
47 | early_stopping = 10
48 | 
49 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
50 | num_trees = 100
51 | 
52 | # shrinkage rate , alias: shrinkage_rate
53 | learning_rate = 0.05
54 | 
55 | # number of leaves for one tree, alias: num_leaf
56 | num_leaves = 31
57 | 


--------------------------------------------------------------------------------
/LightGBM_demo/parallel_learning/README.md:
--------------------------------------------------------------------------------
 1 | Parallel Learning Example
 2 | =========================
 3 | 
 4 | Here is an example for LightGBM to perform parallel learning for 2 machines.
 5 | 
 6 | 1. Edit mlist.txt, write the ip of these 2 machines that you want to run application on.
 7 | 
 8 |   ```
 9 |   machine1_ip 12400
10 |   machine2_ip 12400
11 |   ```
12 | 
13 | 2. Copy this folder and executable file to these 2 machines that you want to run application on.
14 | 
15 | 3. Run command in this folder on both 2 machines:
16 | 
17 |    For Windows: ```lightgbm.exe config=train.conf```
18 | 
19 |    For Linux: ```./lightgbm config=train.conf```
20 | 
21 | This parallel learning example is based on socket. LightGBM also support parallel learning based on mpi.
22 | 
23 | For more details about the usage of parallel learning, please refer to [this](https://github.com/Microsoft/LightGBM/blob/master/docs/Parallel-Learning-Guide.rst).
24 | 


--------------------------------------------------------------------------------
/LightGBM_demo/parallel_learning/predict.conf:
--------------------------------------------------------------------------------
1 | 
2 | task = predict
3 | 
4 | data = binary.test
5 | 
6 | input_model= LightGBM_model.txt
7 | 
8 | 


--------------------------------------------------------------------------------
/LightGBM_demo/parallel_learning/train.conf:
--------------------------------------------------------------------------------
  1 | # task type, support train and predict
  2 | task = train
  3 | 
  4 | # boosting type, support gbdt for now, alias: boosting, boost
  5 | boosting_type = gbdt
  6 | 
  7 | # application type, support following application
  8 | # regression , regression task
  9 | # binary , binary classification task
 10 | # lambdarank , lambdarank task
 11 | # alias: application, app
 12 | objective = binary
 13 | 
 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics
 15 | # l1 
 16 | # l2 , default metric for regression
 17 | # ndcg , default metric for lambdarank
 18 | # auc 
 19 | # binary_logloss , default metric for binary
 20 | # binary_error
 21 | metric = binary_logloss,auc
 22 | 
 23 | # frequence for metric output
 24 | metric_freq = 1
 25 | 
 26 | # true if need output metric for training data, alias: tranining_metric, train_metric
 27 | is_training_metric = true
 28 | 
 29 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 
 30 | max_bin = 255
 31 | 
 32 | # training data
 33 | # if exsting weight file, should name to "binary.train.weight"
 34 | # alias: train_data, train
 35 | data = binary.train
 36 | 
 37 | # validation data, support multi validation data, separated by ','
 38 | # if exsting weight file, should name to "binary.test.weight"
 39 | # alias: valid, test, test_data, 
 40 | valid_data = binary.test
 41 | 
 42 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
 43 | num_trees = 100
 44 | 
 45 | # shrinkage rate , alias: shrinkage_rate
 46 | learning_rate = 0.1
 47 | 
 48 | # number of leaves for one tree, alias: num_leaf
 49 | num_leaves = 63
 50 | 
 51 | # type of tree learner, support following types:
 52 | # serial , single machine version
 53 | # feature , use feature parallel to train
 54 | # data , use data parallel to train
 55 | # voting , use voting based parallel to train
 56 | # alias: tree
 57 | tree_learner = feature
 58 | 
 59 | # number of threads for multi-threading. One thread will use one CPU, defalut is setted to #cpu. 
 60 | # num_threads = 8
 61 | 
 62 | # feature sub-sample, will random select 80% feature to train on each iteration 
 63 | # alias: sub_feature
 64 | feature_fraction = 0.8
 65 | 
 66 | # Support bagging (data sub-sample), will perform bagging every 5 iterations
 67 | bagging_freq = 5
 68 | 
 69 | # Bagging farction, will random select 80% data on bagging
 70 | # alias: sub_row
 71 | bagging_fraction = 0.8
 72 | 
 73 | # minimal number data for one leaf, use this to deal with over-fit
 74 | # alias : min_data_per_leaf, min_data
 75 | min_data_in_leaf = 50
 76 | 
 77 | # minimal sum hessians for one leaf, use this to deal with over-fit
 78 | min_sum_hessian_in_leaf = 5.0
 79 | 
 80 | # save memory and faster speed for sparse feature, alias: is_sparse
 81 | is_enable_sparse = true
 82 | 
 83 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed
 84 | # alias: two_round_loading, two_round
 85 | use_two_round_loading = false
 86 | 
 87 | # true if need to save data to binary file and application will auto load data from binary file next time
 88 | # alias: is_save_binary, save_binary
 89 | is_save_binary_file = false
 90 | 
 91 | # output model file
 92 | output_model = LightGBM_model.txt
 93 | 
 94 | # support continuous train from trained gbdt model
 95 | # input_model= trained_model.txt
 96 | 
 97 | # output prediction file for predict task
 98 | # output_result= prediction.txt
 99 | 
100 | # support continuous train from initial score file
101 | # input_init_score= init_score.txt
102 | 
103 | 
104 | # number of machines in parallel training, alias: num_machine
105 | num_machines = 2
106 | 
107 | # local listening port in parallel training, alias: local_port
108 | local_listen_port = 12400
109 | 
110 | # machines list file for parallel training, alias: mlist
111 | machine_list_file = mlist.txt
112 | 


--------------------------------------------------------------------------------
/LightGBM_demo/python-guide/README.md:
--------------------------------------------------------------------------------
 1 | Python Package Examples
 2 | =======================
 3 | 
 4 | Here is an example for LightGBM to use Python-package.
 5 | 
 6 | You should install LightGBM [Python-package](https://github.com/Microsoft/LightGBM/tree/master/python-package) first.
 7 | 
 8 | You also need scikit-learn, pandas and matplotlib (only for plot example) to run the examples, but they are not required for the package itself. You can install them with pip:
 9 | 
10 | ```
11 | pip install scikit-learn pandas matplotlib -U
12 | ```
13 | 
14 | Now you can run examples in this folder, for example:
15 | 
16 | ```
17 | python simple_example.py
18 | ```
19 | 
20 | Examples include:
21 | 
22 | - [simple_example.py](https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/simple_example.py)
23 |     - Construct Dataset
24 |     - Basic train and predict
25 |     - Eval during training 
26 |     - Early stopping
27 |     - Save model to file
28 | - [sklearn_example.py](https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/sklearn_example.py)
29 |     - Basic train and predict with sklearn interface
30 |     - Feature importances with sklearn interface
31 | - [advanced_example.py](https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/advanced_example.py)
32 |     - Set feature names
33 |     - Directly use categorical features without one-hot encoding
34 |     - Dump model to json format
35 |     - Get feature importances
36 |     - Get feature names
37 |     - Load model to predict
38 |     - Dump and load model with pickle
39 |     - Load model file to continue training
40 |     - Change learning rates during training
41 |     - Self-defined objective function
42 |     - Self-defined eval metric
43 |     - Callback function
44 | 


--------------------------------------------------------------------------------
/LightGBM_demo/python-guide/advanced_example.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | # pylint: disable = invalid-name, C0111
  3 | import json
  4 | import lightgbm as lgb
  5 | import pandas as pd
  6 | import numpy as np
  7 | from sklearn.metrics import mean_squared_error
  8 | 
  9 | try:
 10 |     import cPickle as pickle
 11 | except:
 12 |     import pickle
 13 | 
 14 | # load or create your dataset
 15 | print('Load data...')
 16 | df_train = pd.read_csv('../binary_classification/binary.train', header=None, sep='\t')
 17 | df_test = pd.read_csv('../binary_classification/binary.test', header=None, sep='\t')
 18 | W_train = pd.read_csv('../binary_classification/binary.train.weight', header=None)[0]
 19 | W_test = pd.read_csv('../binary_classification/binary.test.weight', header=None)[0]
 20 | 
 21 | y_train = df_train[0].values
 22 | y_test = df_test[0].values
 23 | X_train = df_train.drop(0, axis=1).values
 24 | X_test = df_test.drop(0, axis=1).values
 25 | 
 26 | num_train, num_feature = X_train.shape
 27 | 
 28 | # create dataset for lightgbm
 29 | # if you want to re-use data, remember to set free_raw_data=False
 30 | lgb_train = lgb.Dataset(X_train, y_train,
 31 |                         weight=W_train, free_raw_data=False)
 32 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
 33 |                        weight=W_test, free_raw_data=False)
 34 | 
 35 | # specify your configurations as a dict
 36 | params = {
 37 |     'boosting_type': 'gbdt',
 38 |     'objective': 'binary',
 39 |     'metric': 'binary_logloss',
 40 |     'num_leaves': 31,
 41 |     'learning_rate': 0.05,
 42 |     'feature_fraction': 0.9,
 43 |     'bagging_fraction': 0.8,
 44 |     'bagging_freq': 5,
 45 |     'verbose': 0
 46 | }
 47 | 
 48 | # generate a feature name
 49 | feature_name = ['feature_' + str(col) for col in range(num_feature)]
 50 | 
 51 | print('Start training...')
 52 | # feature_name and categorical_feature
 53 | gbm = lgb.train(params,
 54 |                 lgb_train,
 55 |                 num_boost_round=10,
 56 |                 valid_sets=lgb_train,  # eval training data
 57 |                 feature_name=feature_name,
 58 |                 categorical_feature=[21])
 59 | 
 60 | # check feature name
 61 | print('Finish first 10 rounds...')
 62 | print('7th feature name is:', repr(lgb_train.feature_name[6]))
 63 | 
 64 | # save model to file
 65 | gbm.save_model('model.txt')
 66 | 
 67 | # dump model to json (and save to file)
 68 | print('Dump model to JSON...')
 69 | model_json = gbm.dump_model()
 70 | 
 71 | with open('model.json', 'w+') as f:
 72 |     json.dump(model_json, f, indent=4)
 73 | 
 74 | # feature names
 75 | print('Feature names:', gbm.feature_name())
 76 | 
 77 | # feature importances
 78 | print('Feature importances:', list(gbm.feature_importance()))
 79 | 
 80 | # load model to predict
 81 | print('Load model to predict')
 82 | bst = lgb.Booster(model_file='model.txt')
 83 | # can only predict with the best iteration (or the saving iteration)
 84 | y_pred = bst.predict(X_test)
 85 | # eval with loaded model
 86 | print('The rmse of loaded model\'s prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
 87 | 
 88 | # dump model with pickle
 89 | with open('model.pkl', 'wb') as fout:
 90 |     pickle.dump(gbm, fout)
 91 | # load model with pickle to predict
 92 | with open('model.pkl', 'rb') as fin:
 93 |     pkl_bst = pickle.load(fin)
 94 | # can predict with any iteration when loaded in pickle way
 95 | y_pred = pkl_bst.predict(X_test, num_iteration=7)
 96 | # eval with loaded model
 97 | print('The rmse of pickled model\'s prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
 98 | 
 99 | # continue training
100 | # init_model accepts:
101 | # 1. model file name
102 | # 2. Booster()
103 | gbm = lgb.train(params,
104 |                 lgb_train,
105 |                 num_boost_round=10,
106 |                 init_model='model.txt',
107 |                 valid_sets=lgb_eval)
108 | 
109 | print('Finish 10 - 20 rounds with model file...')
110 | 
111 | # decay learning rates
112 | # learning_rates accepts:
113 | # 1. list/tuple with length = num_boost_round
114 | # 2. function(curr_iter)
115 | gbm = lgb.train(params,
116 |                 lgb_train,
117 |                 num_boost_round=10,
118 |                 init_model=gbm,
119 |                 learning_rates=lambda iter: 0.05 * (0.99 ** iter),
120 |                 valid_sets=lgb_eval)
121 | 
122 | print('Finish 20 - 30 rounds with decay learning rates...')
123 | 
124 | # change other parameters during training
125 | gbm = lgb.train(params,
126 |                 lgb_train,
127 |                 num_boost_round=10,
128 |                 init_model=gbm,
129 |                 valid_sets=lgb_eval,
130 |                 callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)])
131 | 
132 | print('Finish 30 - 40 rounds with changing bagging_fraction...')
133 | 
134 | 
135 | # self-defined objective function
136 | # f(preds: array, train_data: Dataset) -> grad: array, hess: array
137 | # log likelihood loss
138 | def loglikelood(preds, train_data):
139 |     labels = train_data.get_label()
140 |     preds = 1. / (1. + np.exp(-preds))
141 |     grad = preds - labels
142 |     hess = preds * (1. - preds)
143 |     return grad, hess
144 | 
145 | 
146 | # self-defined eval metric
147 | # f(preds: array, train_data: Dataset) -> name: string, value: array, is_higher_better: bool
148 | # binary error
149 | def binary_error(preds, train_data):
150 |     labels = train_data.get_label()
151 |     return 'error', np.mean(labels != (preds > 0.5)), False
152 | 
153 | 
154 | gbm = lgb.train(params,
155 |                 lgb_train,
156 |                 num_boost_round=10,
157 |                 init_model=gbm,
158 |                 fobj=loglikelood,
159 |                 feval=binary_error,
160 |                 valid_sets=lgb_eval)
161 | 
162 | print('Finish 40 - 50 rounds with self-defined objective function and eval metric...')
163 | 
164 | print('Start a new training job...')
165 | 
166 | 
167 | # callback
168 | def reset_metrics():
169 |     def callback(env):
170 |         lgb_eval_new = lgb.Dataset(X_test, y_test, reference=lgb_train)
171 |         if env.iteration - env.begin_iteration == 5:
172 |             print('Add a new valid dataset at iteration 5...')
173 |             env.model.add_valid(lgb_eval_new, 'new valid')
174 |     callback.before_iteration = True
175 |     callback.order = 0
176 |     return callback
177 | 
178 | 
179 | gbm = lgb.train(params,
180 |                 lgb_train,
181 |                 num_boost_round=10,
182 |                 valid_sets=lgb_train,
183 |                 callbacks=[reset_metrics()])
184 | 
185 | print('Finish first 10 rounds with callback function...')
186 | 


--------------------------------------------------------------------------------
/LightGBM_demo/python-guide/model.pkl:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fuqiuai/learning_notes/eeecad3a5beff499add3e267111c3e38bcf2b638/LightGBM_demo/python-guide/model.pkl


--------------------------------------------------------------------------------
/LightGBM_demo/python-guide/plot_example.py:
--------------------------------------------------------------------------------
 1 | # coding: utf-8
 2 | # pylint: disable = invalid-name, C0111
 3 | import lightgbm as lgb
 4 | import pandas as pd
 5 | 
 6 | try:
 7 |     import matplotlib.pyplot as plt
 8 | except ImportError:
 9 |     raise ImportError('You need to install matplotlib for plot_example.py.')
10 | 
11 | # load or create your dataset
12 | print('Load data...')
13 | df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
14 | df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')
15 | 
16 | y_train = df_train[0].values
17 | y_test = df_test[0].values
18 | X_train = df_train.drop(0, axis=1).values
19 | X_test = df_test.drop(0, axis=1).values
20 | 
21 | # create dataset for lightgbm
22 | lgb_train = lgb.Dataset(X_train, y_train)
23 | lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
24 | 
25 | # specify your configurations as a dict
26 | params = {
27 |     'num_leaves': 5,
28 |     'metric': ('l1', 'l2'),
29 |     'verbose': 0
30 | }
31 | 
32 | evals_result = {}  # to record eval results for plotting
33 | 
34 | print('Start training...')
35 | # train
36 | gbm = lgb.train(params,
37 |                 lgb_train,
38 |                 num_boost_round=100,
39 |                 valid_sets=[lgb_train, lgb_test],
40 |                 feature_name=['f' + str(i + 1) for i in range(28)],
41 |                 categorical_feature=[21],
42 |                 evals_result=evals_result,
43 |                 verbose_eval=10)
44 | 
45 | print('Plot metrics during training...')
46 | ax = lgb.plot_metric(evals_result, metric='l1')
47 | plt.show()
48 | 
49 | print('Plot feature importances...')
50 | ax = lgb.plot_importance(gbm, max_num_features=10)
51 | plt.show()
52 | 
53 | print('Plot 84th tree...')  # one tree use categorical feature to split
54 | ax = lgb.plot_tree(gbm, tree_index=83, figsize=(20, 8), show_info=['split_gain'])
55 | plt.show()
56 | 
57 | print('Plot 84th tree with graphviz...')
58 | graph = lgb.create_tree_digraph(gbm, tree_index=83, name='Tree84')
59 | graph.render(view=True)
60 | 


--------------------------------------------------------------------------------
/LightGBM_demo/python-guide/simple_example.py:
--------------------------------------------------------------------------------
 1 | # coding: utf-8
 2 | # pylint: disable = invalid-name, C0111
 3 | import json
 4 | import lightgbm as lgb
 5 | import pandas as pd
 6 | from sklearn.metrics import mean_squared_error
 7 | 
 8 | 
 9 | # load or create your dataset
10 | print('Load data...')
11 | df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
12 | df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')
13 | 
14 | y_train = df_train[0].values
15 | y_test = df_test[0].values
16 | X_train = df_train.drop(0, axis=1).values
17 | X_test = df_test.drop(0, axis=1).values
18 | 
19 | # create dataset for lightgbm
20 | lgb_train = lgb.Dataset(X_train, y_train)
21 | lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
22 | 
23 | # specify your configurations as a dict
24 | params = {
25 |     'task': 'train',
26 |     'boosting_type': 'gbdt',
27 |     'objective': 'regression',
28 |     'metric': {'l2', 'auc'},
29 |     'num_leaves': 31,
30 |     'learning_rate': 0.05,
31 |     'feature_fraction': 0.9,
32 |     'bagging_fraction': 0.8,
33 |     'bagging_freq': 5,
34 |     'verbose': 0
35 | }
36 | 
37 | print('Start training...')
38 | # train
39 | gbm = lgb.train(params,
40 |                 lgb_train,
41 |                 num_boost_round=20,
42 |                 valid_sets=lgb_eval,
43 |                 early_stopping_rounds=5)
44 | 
45 | print('Save model...')
46 | # save model to file
47 | gbm.save_model('model.txt')
48 | 
49 | print('Start predicting...')
50 | # predict
51 | y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
52 | # eval
53 | print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
54 | 


--------------------------------------------------------------------------------
/LightGBM_demo/python-guide/sklearn_example.py:
--------------------------------------------------------------------------------
 1 | # coding: utf-8
 2 | # pylint: disable = invalid-name, C0111
 3 | import lightgbm as lgb
 4 | import pandas as pd
 5 | from sklearn.metrics import mean_squared_error
 6 | from sklearn.model_selection import GridSearchCV
 7 | 
 8 | # load or create your dataset
 9 | print('Load data...')
10 | df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
11 | df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')
12 | 
13 | y_train = df_train[0].values
14 | y_test = df_test[0].values
15 | X_train = df_train.drop(0, axis=1).values
16 | X_test = df_test.drop(0, axis=1).values
17 | 
18 | print('Start training...')
19 | # train
20 | gbm = lgb.LGBMRegressor(objective='regression',
21 |                         num_leaves=31,
22 |                         learning_rate=0.05,
23 |                         n_estimators=20)
24 | gbm.fit(X_train, y_train,
25 |         eval_set=[(X_test, y_test)],
26 |         eval_metric='l1',
27 |         early_stopping_rounds=5)
28 | 
29 | print('Start predicting...')
30 | # predict
31 | y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
32 | # eval
33 | print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
34 | 
35 | # feature importances
36 | print('Feature importances:', list(gbm.feature_importances_))
37 | 
38 | # other scikit-learn modules
39 | estimator = lgb.LGBMRegressor(num_leaves=31)
40 | 
41 | param_grid = {
42 |     'learning_rate': [0.01, 0.1, 1],
43 |     'n_estimators': [20, 40]
44 | }
45 | 
46 | gbm = GridSearchCV(estimator, param_grid)
47 | 
48 | gbm.fit(X_train, y_train)
49 | 
50 | print('Best parameters found by grid search are:', gbm.best_params_)
51 | 


--------------------------------------------------------------------------------
/LightGBM_demo/regression/README.md:
--------------------------------------------------------------------------------
 1 | Regression Example
 2 | ==================
 3 | 
 4 | Here is an example for LightGBM to run regression task.
 5 | 
 6 | ***You should copy executable file to this folder first.***
 7 | 
 8 | Training
 9 | --------
10 | 
11 | For Windows, by running following command in this folder:
12 | 
13 | ```
14 | lightgbm.exe config=train.conf
15 | ```
16 | 
17 | For Linux, by running following command in this folder:
18 | 
19 | ```
20 | ./lightgbm config=train.conf
21 | ```
22 | 
23 | Prediction
24 | ----------
25 | 
26 | You should finish training first.
27 | 
28 | For Windows, by running following command in this folder:
29 | 
30 | ```
31 | lightgbm.exe config=predict.conf
32 | ```
33 | 
34 | For Linux, by running following command in this folder:
35 | 
36 | ```
37 | ./lightgbm config=predict.conf
38 | ```
39 | 


--------------------------------------------------------------------------------
/LightGBM_demo/regression/predict.conf:
--------------------------------------------------------------------------------
1 | 
2 | task = predict
3 | 
4 | data = regression.test
5 | 
6 | input_model= LightGBM_model.txt
7 | 


--------------------------------------------------------------------------------
/LightGBM_demo/regression/regression.test.init:
--------------------------------------------------------------------------------
  1 | 0.039
  2 | 0.187
  3 | 0.831
  4 | 0.767
  5 | 0.351
  6 | 0.377
  7 | 0.534
  8 | 0.000
  9 | 0.241
 10 | 0.208
 11 | 0.250
 12 | 0.806
 13 | 0.280
 14 | 0.192
 15 | 0.504
 16 | 0.866
 17 | 0.241
 18 | 0.079
 19 | 0.356
 20 | 0.748
 21 | 0.551
 22 | 0.817
 23 | 0.960
 24 | 0.793
 25 | 0.604
 26 | 0.493
 27 | 0.040
 28 | 0.984
 29 | 0.383
 30 | 0.152
 31 | 0.667
 32 | 0.284
 33 | 0.586
 34 | 0.587
 35 | 0.446
 36 | 0.836
 37 | 0.265
 38 | 0.449
 39 | 0.538
 40 | 0.664
 41 | 0.784
 42 | 0.395
 43 | 0.646
 44 | 0.151
 45 | 0.933
 46 | 0.383
 47 | 0.730
 48 | 0.020
 49 | 0.205
 50 | 0.487
 51 | 0.878
 52 | 0.527
 53 | 0.930
 54 | 0.484
 55 | 0.490
 56 | 0.120
 57 | 0.803
 58 | 0.247
 59 | 0.900
 60 | 0.911
 61 | 0.943
 62 | 0.520
 63 | 0.677
 64 | 0.779
 65 | 0.131
 66 | 0.601
 67 | 0.034
 68 | 0.498
 69 | 0.155
 70 | 0.183
 71 | 0.365
 72 | 0.432
 73 | 0.623
 74 | 0.074
 75 | 0.504
 76 | 0.183
 77 | 0.574
 78 | 0.637
 79 | 0.557
 80 | 0.738
 81 | 0.336
 82 | 0.765
 83 | 0.433
 84 | 0.484
 85 | 0.648
 86 | 0.018
 87 | 0.654
 88 | 0.619
 89 | 0.310
 90 | 0.086
 91 | 0.091
 92 | 0.923
 93 | 0.689
 94 | 0.127
 95 | 0.357
 96 | 0.592
 97 | 0.836
 98 | 0.044
 99 | 0.237
100 | 0.890
101 | 0.009
102 | 0.201
103 | 0.959
104 | 0.613
105 | 0.262
106 | 0.067
107 | 0.028
108 | 0.245
109 | 0.881
110 | 0.416
111 | 0.720
112 | 0.918
113 | 0.408
114 | 0.191
115 | 0.517
116 | 0.908
117 | 0.804
118 | 0.066
119 | 0.693
120 | 0.572
121 | 0.907
122 | 0.122
123 | 0.534
124 | 0.879
125 | 0.410
126 | 0.482
127 | 0.070
128 | 0.278
129 | 0.325
130 | 0.945
131 | 0.283
132 | 0.461
133 | 0.671
134 | 0.162
135 | 0.486
136 | 0.739
137 | 0.867
138 | 0.626
139 | 0.669
140 | 0.126
141 | 0.946
142 | 0.133
143 | 0.775
144 | 0.265
145 | 0.934
146 | 0.720
147 | 0.754
148 | 0.219
149 | 0.443
150 | 0.618
151 | 0.770
152 | 0.104
153 | 0.962
154 | 0.890
155 | 0.270
156 | 0.823
157 | 0.518
158 | 0.462
159 | 0.314
160 | 0.581
161 | 0.730
162 | 0.411
163 | 0.629
164 | 0.699
165 | 0.711
166 | 0.052
167 | 0.860
168 | 0.458
169 | 0.262
170 | 0.242
171 | 0.483
172 | 0.887
173 | 0.378
174 | 0.750
175 | 0.097
176 | 0.476
177 | 0.992
178 | 0.770
179 | 0.211
180 | 0.501
181 | 0.234
182 | 0.410
183 | 0.780
184 | 0.771
185 | 0.228
186 | 0.922
187 | 0.593
188 | 0.380
189 | 0.502
190 | 0.605
191 | 0.560
192 | 0.486
193 | 0.505
194 | 0.176
195 | 0.813
196 | 0.542
197 | 0.131
198 | 0.766
199 | 0.932
200 | 0.947
201 | 0.369
202 | 0.136
203 | 0.518
204 | 0.113
205 | 0.934
206 | 0.184
207 | 0.253
208 | 0.407
209 | 0.383
210 | 0.795
211 | 0.456
212 | 0.171
213 | 0.267
214 | 0.509
215 | 0.147
216 | 0.612
217 | 0.566
218 | 0.715
219 | 0.938
220 | 0.912
221 | 0.946
222 | 0.245
223 | 0.132
224 | 0.302
225 | 0.895
226 | 0.972
227 | 0.859
228 | 0.110
229 | 0.947
230 | 0.423
231 | 0.009
232 | 0.442
233 | 0.046
234 | 0.544
235 | 0.339
236 | 0.473
237 | 0.613
238 | 0.869
239 | 0.662
240 | 0.434
241 | 0.819
242 | 0.906
243 | 0.120
244 | 0.532
245 | 0.285
246 | 0.047
247 | 0.669
248 | 0.863
249 | 0.163
250 | 0.812
251 | 0.853
252 | 0.914
253 | 0.265
254 | 0.904
255 | 0.321
256 | 0.552
257 | 0.051
258 | 0.044
259 | 0.720
260 | 0.444
261 | 0.256
262 | 0.190
263 | 0.670
264 | 0.000
265 | 0.806
266 | 0.079
267 | 0.191
268 | 0.386
269 | 0.485
270 | 0.355
271 | 0.321
272 | 0.964
273 | 0.642
274 | 0.023
275 | 0.430
276 | 0.875
277 | 0.301
278 | 0.095
279 | 0.758
280 | 0.606
281 | 0.570
282 | 0.054
283 | 0.140
284 | 0.623
285 | 0.208
286 | 0.504
287 | 0.545
288 | 0.284
289 | 0.948
290 | 0.842
291 | 0.722
292 | 0.078
293 | 0.106
294 | 0.493
295 | 0.161
296 | 0.978
297 | 0.159
298 | 0.487
299 | 0.364
300 | 0.639
301 | 0.129
302 | 0.430
303 | 0.275
304 | 0.888
305 | 0.041
306 | 0.914
307 | 0.833
308 | 0.298
309 | 0.789
310 | 0.031
311 | 0.967
312 | 0.527
313 | 0.303
314 | 0.363
315 | 0.066
316 | 0.989
317 | 0.039
318 | 0.655
319 | 0.443
320 | 0.949
321 | 0.246
322 | 0.532
323 | 0.482
324 | 0.703
325 | 0.068
326 | 0.194
327 | 0.215
328 | 0.738
329 | 0.189
330 | 0.573
331 | 0.215
332 | 0.862
333 | 0.942
334 | 0.518
335 | 0.352
336 | 0.234
337 | 0.050
338 | 0.269
339 | 0.654
340 | 0.534
341 | 0.944
342 | 0.396
343 | 0.694
344 | 0.489
345 | 0.513
346 | 0.268
347 | 0.455
348 | 0.471
349 | 0.707
350 | 0.941
351 | 0.329
352 | 0.042
353 | 0.496
354 | 0.544
355 | 0.168
356 | 0.760
357 | 0.985
358 | 0.946
359 | 0.197
360 | 0.875
361 | 0.704
362 | 0.454
363 | 0.541
364 | 0.850
365 | 0.480
366 | 0.373
367 | 0.493
368 | 0.579
369 | 0.189
370 | 0.901
371 | 0.674
372 | 0.633
373 | 0.099
374 | 0.604
375 | 0.121
376 | 0.079
377 | 0.527
378 | 0.403
379 | 0.589
380 | 0.089
381 | 0.431
382 | 0.175
383 | 0.987
384 | 0.561
385 | 0.687
386 | 0.325
387 | 0.095
388 | 0.976
389 | 0.286
390 | 0.424
391 | 0.650
392 | 0.025
393 | 0.810
394 | 0.537
395 | 0.278
396 | 0.062
397 | 0.162
398 | 0.895
399 | 0.686
400 | 0.250
401 | 0.066
402 | 0.691
403 | 0.572
404 | 0.405
405 | 0.364
406 | 0.217
407 | 0.670
408 | 0.971
409 | 0.176
410 | 0.597
411 | 0.424
412 | 0.447
413 | 0.254
414 | 0.825
415 | 0.485
416 | 0.543
417 | 0.305
418 | 0.182
419 | 0.086
420 | 0.714
421 | 0.196
422 | 0.690
423 | 0.390
424 | 0.416
425 | 0.469
426 | 0.368
427 | 0.101
428 | 0.310
429 | 0.664
430 | 0.666
431 | 0.286
432 | 0.460
433 | 0.193
434 | 0.210
435 | 0.023
436 | 0.897
437 | 0.211
438 | 0.228
439 | 0.280
440 | 0.127
441 | 0.639
442 | 0.075
443 | 0.134
444 | 0.645
445 | 0.340
446 | 0.708
447 | 0.557
448 | 0.256
449 | 0.651
450 | 0.116
451 | 0.536
452 | 0.437
453 | 0.268
454 | 0.604
455 | 0.871
456 | 0.999
457 | 0.608
458 | 0.405
459 | 0.225
460 | 0.257
461 | 0.479
462 | 0.367
463 | 0.914
464 | 0.368
465 | 0.373
466 | 0.384
467 | 0.837
468 | 0.651
469 | 0.614
470 | 0.334
471 | 0.818
472 | 0.038
473 | 0.871
474 | 0.513
475 | 0.398
476 | 0.497
477 | 0.667
478 | 0.013
479 | 0.872
480 | 0.447
481 | 0.343
482 | 0.138
483 | 0.439
484 | 0.496
485 | 0.404
486 | 0.679
487 | 0.421
488 | 0.961
489 | 0.599
490 | 0.807
491 | 0.109
492 | 0.397
493 | 0.337
494 | 0.569
495 | 0.861
496 | 0.078
497 | 0.073
498 | 0.850
499 | 0.213
500 | 0.669
501 | 


--------------------------------------------------------------------------------
/LightGBM_demo/regression/train.conf:
--------------------------------------------------------------------------------
  1 | # task type, support train and predict
  2 | task = train
  3 | 
  4 | # boosting type, support gbdt for now, alias: boosting, boost
  5 | boosting_type = gbdt
  6 | 
  7 | # application type, support following application
  8 | # regression , regression task
  9 | # binary , binary classification task
 10 | # lambdarank , lambdarank task
 11 | # alias: application, app
 12 | objective = regression
 13 | 
 14 | # eval metrics, support multi metric, delimite by ',' , support following metrics
 15 | # l1 
 16 | # l2 , default metric for regression
 17 | # ndcg , default metric for lambdarank
 18 | # auc 
 19 | # binary_logloss , default metric for binary
 20 | # binary_error
 21 | metric = l2
 22 | 
 23 | # frequence for metric output
 24 | metric_freq = 1
 25 | 
 26 | # true if need output metric for training data, alias: tranining_metric, train_metric
 27 | is_training_metric = true
 28 | 
 29 | # number of bins for feature bucket, 255 is a recommend setting, it can save memories, and also has good accuracy. 
 30 | max_bin = 255
 31 | 
 32 | # training data
 33 | # if exsting weight file, should name to "regression.train.weight"
 34 | # alias: train_data, train
 35 | data = regression.train
 36 | 
 37 | # validation data, support multi validation data, separated by ','
 38 | # if exsting weight file, should name to "regression.test.weight"
 39 | # alias: valid, test, test_data, 
 40 | valid_data = regression.test
 41 | 
 42 | # number of trees(iterations), alias: num_tree, num_iteration, num_iterations, num_round, num_rounds
 43 | num_trees = 100
 44 | 
 45 | # shrinkage rate , alias: shrinkage_rate
 46 | learning_rate = 0.05
 47 | 
 48 | # number of leaves for one tree, alias: num_leaf
 49 | num_leaves = 31
 50 | 
 51 | # type of tree learner, support following types:
 52 | # serial , single machine version
 53 | # feature , use feature parallel to train
 54 | # data , use data parallel to train
 55 | # voting , use voting based parallel to train
 56 | # alias: tree
 57 | tree_learner = serial
 58 | 
 59 | # number of threads for multi-threading. One thread will use one CPU, default is setted to #cpu. 
 60 | # num_threads = 8
 61 | 
 62 | # feature sub-sample, will random select 80% feature to train on each iteration 
 63 | # alias: sub_feature
 64 | feature_fraction = 0.9
 65 | 
 66 | # Support bagging (data sub-sample), will perform bagging every 5 iterations
 67 | bagging_freq = 5
 68 | 
 69 | # Bagging farction, will random select 80% data on bagging
 70 | # alias: sub_row
 71 | bagging_fraction = 0.8
 72 | 
 73 | # minimal number data for one leaf, use this to deal with over-fit
 74 | # alias : min_data_per_leaf, min_data
 75 | min_data_in_leaf = 100
 76 | 
 77 | # minimal sum hessians for one leaf, use this to deal with over-fit
 78 | min_sum_hessian_in_leaf = 5.0
 79 | 
 80 | # save memory and faster speed for sparse feature, alias: is_sparse
 81 | is_enable_sparse = true
 82 | 
 83 | # when data is bigger than memory size, set this to true. otherwise set false will have faster speed
 84 | # alias: two_round_loading, two_round
 85 | use_two_round_loading = false
 86 | 
 87 | # true if need to save data to binary file and application will auto load data from binary file next time
 88 | # alias: is_save_binary, save_binary
 89 | is_save_binary_file = false
 90 | 
 91 | # output model file
 92 | output_model = LightGBM_model.txt
 93 | 
 94 | # support continuous train from trained gbdt model
 95 | # input_model= trained_model.txt
 96 | 
 97 | # output prediction file for predict task
 98 | # output_result= prediction.txt
 99 | 
100 | # support continuous train from initial score file
101 | # input_init_score= init_score.txt
102 | 
103 | 
104 | # number of machines in parallel training, alias: num_machine
105 | num_machines = 1
106 | 
107 | # local listening port in parallel training, alias: local_port
108 | local_listen_port = 12400
109 | 
110 | # machines list file for parallel training, alias: mlist
111 | machine_list_file = mlist.txt
112 | 


--------------------------------------------------------------------------------
/Linux下使用MySQL.md:
--------------------------------------------------------------------------------
 1 | *MySQL 是最流行的关系型数据库管理系统之一,属于 Oracle 旗下产品*
 2 | 
 3 | ## 安装启动操作
 4 | 1.安装mysql命令 ：$ `sudo apt-get install -y mysql-server`  
 5 | 2.查看mysql的版本命令（注意-V是大写，不然会出现如下错误）：$ `mysql -V`  
 6 | 3.启动mysql命令(关闭，重启等只需将start换成stop,restart等即可)：$`sudo service mysql start`  
 7 | 4.登录mysql命令为：$ `mysql -u用户名 -p密码`  
 8 | 5.连接远程数据库：$ `mysql -h <host> -P <port> -u<username> -p<password>`
 9 | 
10 | ## 数据库操作
11 | 1.查看数据库：> `show databases;` （注意分号“；”不要落下）  
12 | 2.新建一个数据库命令：> `create database 数据库名称;`  
13 | &nbsp;&nbsp;&nbsp;删除一个数据库命令：> `drop database 数据库名称;`  
14 | 3.使用某个数据库：> `use 数据库名称;`
15 | 
16 | ## 表操作
17 | 1.查看表命令：> `show tables;`  
18 | 2.建立一个新表：> `create table 表名 （字段参数）;` 或 >`create table if not exists 表名（字段参数）；`  
19 | &nbsp;&nbsp;&nbsp;删除一个旧表：> `drop table 表名；` 或 >`drop table if exists 表名；`  
20 | 3.查看表结构：> `desc 表名称;` 或 >`show columns from 表名称;`  
21 | 4.对表数据的操作：  
22 | &nbsp;&nbsp;&nbsp;增：>`insert into 表名称 (字段名1，字段名2，字段名3......) values(字段名1的值，字段名2的值，字段名3的值......);`  
23 | &nbsp;&nbsp;&nbsp;删：>`delete from 表名称 where 表达式;`  
24 | &nbsp;&nbsp;&nbsp;改：>`update 表名称 set 字段名=“新值” where 表达式；`  
25 | &nbsp;&nbsp;&nbsp;查：>`select 字段名1,字段名2,字段名3..... from 表名称;`  
26 | 5.增加字段：>`alter table 表名称 add 字段名 数据类型 其它;` (其它包括默认初始值的设定等等)  
27 | 6.删除字段：>`alter table 表名称 drop 字段名;`
28 | 
29 | 
30 | ## 用户相关操作
31 | 注：以下命令均需先以root身份登录mysql：`mysql -uroot -p`  
32 | 1.添加新用户  
33 | （1）创建新用户：> `insert into mysql.user(Host,User,Password) values("localhost","user1",password("password1"));`    
34 | （2）为用户分配权限：  
35 | 	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;设置用户可以在本地访问mysql：`grant all privileges on *.* to username@localhost identified by "password" ;`  
36 | 	&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;设置用户只能访问指定数据库：`grant all privileges on 数据库名.* to username@localhost identified by "password" ;`     
37 | （3）刷新系统权限表：>`flush privileges;`  
38 | 2.查看MySql当前所有的用户：>`SELECT DISTINCT User FROM mysql.user;`  
39 | 3.删除用户及其数据字典中包含的数据：>`drop user 'xbb'@'localhost';`
40 | 
41 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # learning_notes
 2 | 
 3 | 一些个人的学习笔记
 4 | 
 5 | - git学习参考[Git教程-廖雪峰的官方网站](https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/)
 6 | 
 7 | - Linux下使用MySQL参考[Linux系统下Mysql使用简单教程](http://www.jb51.net/article/84399.htm)
 8 | 
 9 | - sklearn学习参考[sklearn官网](http://scikit-learn.org/stable/index.html)
10 | 
11 | - XGBoost学习参考[XGBoost官网](http://xgboost.readthedocs.io/en/latest/////python/python_api.html#module-xgboost.sklearn)
12 | 
13 | - LightGBM学习参考[LightGBM官网](https://lightgbm.readthedocs.io/en/latest/index.html#) 
14 | 
15 | - tensorflow入门学习[TensorFlow_Started.ipynb](https://github.com/fuqiuai/learning_notes/blob/master/TensorFlow_Started.ipynb)
16 | 


--------------------------------------------------------------------------------
/TensorFlow_Started.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# TensorFlow入门"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "code",
 12 |    "execution_count": 119,
 13 |    "metadata": {
 14 |     "collapsed": true
 15 |    },
 16 |    "outputs": [],
 17 |    "source": [
 18 |     "import tensorflow as tf\n",
 19 |     "import numpy as np"
 20 |    ]
 21 |   },
 22 |   {
 23 |    "cell_type": "markdown",
 24 |    "metadata": {},
 25 |    "source": [
 26 |     "Hello,TensorFlow!"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "code",
 31 |    "execution_count": 23,
 32 |    "metadata": {},
 33 |    "outputs": [
 34 |     {
 35 |      "name": "stdout",
 36 |      "output_type": "stream",
 37 |      "text": [
 38 |       "b'Hello,TensorFlow!'\n"
 39 |      ]
 40 |     }
 41 |    ],
 42 |    "source": [
 43 |     "hello=tf.constant(\"Hello,TensorFlow!\")\n",
 44 |     "sess=tf.Session()\n",
 45 |     "print(sess.run(hello))"
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "markdown",
 50 |    "metadata": {},
 51 |    "source": [
 52 |     "## 1.1 张量和图\n",
 53 |     "\n",
 54 |     "TensorFlow 是一种采用数据流图（data flow graphs），用于数值计算的开源软件库。其中 **Tensor代表传递的数据为张量（多维数组）**，**Flow代表使用计算图进行运算**。数据流图用「结点」（nodes）和「边」（edges）组成的有向图来描述数学运算。**结点**一般用来表示施加的数学操作，但也可以表示数据输入的起点和输出的终点，或者是读取/写入持久变量（persistent variable）的终点。**边**表示结点之间的输入/输出关系。这些数据边可以传送维度可动态调整的多维数据数组，即张量（tensor）。\n",
 55 |     "\n",
 56 |     "下面代码是使用计算图的案例:"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 6,
 62 |    "metadata": {},
 63 |    "outputs": [
 64 |     {
 65 |      "name": "stdout",
 66 |      "output_type": "stream",
 67 |      "text": [
 68 |       "8\n",
 69 |       "[[0. 0.]\n",
 70 |       " [0. 0.]]\n"
 71 |      ]
 72 |     }
 73 |    ],
 74 |    "source": [
 75 |     "a = tf.constant(2, tf.int16)\n",
 76 |     "b = tf.constant(4, tf.float32)\n",
 77 |     "\n",
 78 |     "graph = tf.Graph()\n",
 79 |     "with graph.as_default():\n",
 80 |     "    a = tf.Variable(8, tf.float32)\n",
 81 |     "    b = tf.Variable(tf.zeros([2,2], tf.float32))\n",
 82 |     "    \n",
 83 |     "with tf.Session(graph=graph) as session:\n",
 84 |     "    tf.global_variables_initializer().run()\n",
 85 |     "    print(session.run(a))\n",
 86 |     "    print(session.run(b))"
 87 |    ]
 88 |   },
 89 |   {
 90 |    "cell_type": "markdown",
 91 |    "metadata": {},
 92 |    "source": [
 93 |     "在 Tensorflow 中，**所有不同的变量和运算都是储存在计算图**。所以在我们构建完模型所需要的图之后，还**需要打开一个会话（Session）来运行整个计算图**。在会话中，我们可以将所有计算分配到可用的 CPU 和 GPU 资源中。\n",
 94 |     "\n",
 95 |     "如下所示代码，我们声明两个常量 a 和 b，并且定义一个加法运算。但它并不会输出计算结果，因为我们只是定义了一张图，而没有运行它:"
 96 |    ]
 97 |   },
 98 |   {
 99 |    "cell_type": "code",
100 |    "execution_count": 7,
101 |    "metadata": {},
102 |    "outputs": [
103 |     {
104 |      "name": "stdout",
105 |      "output_type": "stream",
106 |      "text": [
107 |       "Tensor(\"add:0\", shape=(2,), dtype=int32)\n"
108 |      ]
109 |     }
110 |    ],
111 |    "source": [
112 |     "a=tf.constant([1,2],name=\"a\")\n",
113 |     "\n",
114 |     "b=tf.constant([2,4],name=\"b\")\n",
115 |     "\n",
116 |     "result = a+b\n",
117 |     "\n",
118 |     "print(result)"
119 |    ]
120 |   },
121 |   {
122 |    "cell_type": "markdown",
123 |    "metadata": {},
124 |    "source": [
125 |     "下面的代码才会输出计算结果，因为我们需要创建一个会话才能管理 TensorFlow 运行时的所有资源。但计算完毕后需要关闭会话来帮助系统回收资源，不然就会出现资源泄漏的问题。下面提供了使用会话的两种方式："
126 |    ]
127 |   },
128 |   {
129 |    "cell_type": "code",
130 |    "execution_count": 21,
131 |    "metadata": {},
132 |    "outputs": [
133 |     {
134 |      "name": "stdout",
135 |      "output_type": "stream",
136 |      "text": [
137 |       "[3 6]\n"
138 |      ]
139 |     }
140 |    ],
141 |    "source": [
142 |     "a=tf.constant([1,2])\n",
143 |     "\n",
144 |     "b=tf.constant([2,4])\n",
145 |     "\n",
146 |     "result = a+b\n",
147 |     "\n",
148 |     "'''\n",
149 |     "sess=tf.Session()\n",
150 |     "print(sess.run(result))\n",
151 |     "sess.close\n",
152 |     "'''\n",
153 |     "\n",
154 |     "with tf.Session() as sess:\n",
155 |     "    print(sess.run(result))"
156 |    ]
157 |   },
158 |   {
159 |    "cell_type": "markdown",
160 |    "metadata": {},
161 |    "source": [
162 |     "## 1.2 常量、变量和占位符\n",
163 |     "\n",
164 |     "**TensorFlow 中最基本的单位是常量（constant）、变量（Variable）和占位符（placeholder）。**\n",
165 |     "\n",
166 |     "下面我们分别定义了常量与变量："
167 |    ]
168 |   },
169 |   {
170 |    "cell_type": "code",
171 |    "execution_count": 53,
172 |    "metadata": {},
173 |    "outputs": [
174 |     {
175 |      "name": "stdout",
176 |      "output_type": "stream",
177 |      "text": [
178 |       "2 \n",
179 |       " [[0. 0.]\n",
180 |       " [0. 0.]] \n",
181 |       " [0 0 0 0 0 0 0 0 0 0 0]\n"
182 |      ]
183 |     }
184 |    ],
185 |    "source": [
186 |     "import numpy as np\n",
187 |     "\n",
188 |     "a=tf.constant(2,tf.int16)\n",
189 |     "b=tf.constant(8.9,tf.float32)\n",
190 |     "\n",
191 |     "d=tf.Variable(4,tf.int16)\n",
192 |     "\n",
193 |     "g = tf.constant(np.zeros(shape=(2,2), dtype=np.float32))\n",
194 |     "# 等价于 g=tf.zeros([2,2],tf.float32)\n",
195 |     "\n",
196 |     "h = tf.zeros([11], tf.int16)\n",
197 |     "i = tf.ones([2,2], tf.float32)\n",
198 |     "l = tf.Variable(tf.zeros([5,6,5], tf.float32))\n",
199 |     "\n",
200 |     "# print(a,'\\n',d,'\\n',g,'\\n',i,'\\n',h,'\\n',l)\n",
201 |     "with tf.Session() as sess:\n",
202 |     "    print(sess.run(a),'\\n',sess.run(g),'\\n',sess.run(h))"
203 |    ]
204 |   },
205 |   {
206 |    "cell_type": "markdown",
207 |    "metadata": {},
208 |    "source": [
209 |     "TensorFlow数据类型可参考：http://wiki.jikexueyuan.com/project/tensorflow-zh/resources/dims_types.html\n",
210 |     "\n",
211 |     "**常量(constant)**同其他语言的常量, 在赋值后不可修改。tf.constant(\n",
212 |     "    value,\n",
213 |     "    dtype=None,\n",
214 |     "    shape=None,\n",
215 |     "    name='Const',\n",
216 |     "    verify_shape=False\n",
217 |     ")\n",
218 |     "<br /> **占位符(placeholder)**同其他语言的方法的参数, 在执行方法时设置。tf.placeholder(\n",
219 |     "    dtype,\n",
220 |     "    shape=None,\n",
221 |     "    name=None\n",
222 |     ")\n",
223 |     "<br /> **变量(Variable)**比较特殊, 机器学习特有属性. 对于常规的算法而言, 在输入值已知的情况下, 优化变量(参数)使损失函数的值达到最小. 因而在含有优化器(Optimizer)的算法中, 变量是动态计算(变化)的. 如果在未使用优化器时, 变量仍作为普通变量. 注意: 在使用变量之前, 需要执行初始化方法, 系统才会为其赋值。tf.Variable('initial-value', name='optional-name')"
224 |    ]
225 |   },
226 |   {
227 |    "cell_type": "code",
228 |    "execution_count": 81,
229 |    "metadata": {},
230 |    "outputs": [
231 |     {
232 |      "name": "stdout",
233 |      "output_type": "stream",
234 |      "text": [
235 |       "Tensor(\"Const_116:0\", shape=(), dtype=float32) Tensor(\"Const_117:0\", shape=(), dtype=float32)\n",
236 |       "7.5\n",
237 |       "[3. 7.]\n",
238 |       "linear_model:  [0.         0.3        0.6        0.90000004]\n"
239 |      ]
240 |     }
241 |    ],
242 |    "source": [
243 |     "with tf.Session() as sess:\n",
244 |     "    # 常量\n",
245 |     "    node1 = tf.constant(3.0, tf.float32)\n",
246 |     "    node2 = tf.constant(4.0)\n",
247 |     "    print (node1, node2)  # 只打印结点信息\n",
248 |     "\n",
249 |     "    # 占位符\n",
250 |     "    a = tf.placeholder(tf.float32)\n",
251 |     "    b = tf.placeholder(tf.float32)\n",
252 |     "    adder_node = a + b  # 与调用add方法类似\n",
253 |     "    print (sess.run(adder_node, {a: 3, b: 4.5}))\n",
254 |     "    print (sess.run(adder_node, {a: [1, 3], b: [2, 4]}))\n",
255 |     "\n",
256 |     "    # 变量\n",
257 |     "    W = tf.Variable([.3], tf.float32)\n",
258 |     "    b = tf.Variable([-.3], tf.float32)\n",
259 |     "    x = tf.placeholder(tf.float32)\n",
260 |     "    # linear_model = node1 * x + node2\n",
261 |     "    linear_model = W * x + b\n",
262 |     "    sess.run(tf.global_variables_initializer()) #初始化模型参数\n",
263 |     "    print (\"linear_model: \", sess.run(linear_model, {x: [1, 2, 3, 4]}))"
264 |    ]
265 |   },
266 |   {
267 |    "cell_type": "markdown",
268 |    "metadata": {},
269 |    "source": [
270 |     "在会话中，占位符可以使用 feed_dict 馈送数据。\n",
271 |     "\n",
272 |     "feed_dict 是一个字典，在字典中需要给出每一个用到的占位符的取值。在训练神经网络时需要每次提供一个批量的训练样本，如果每次迭代选取的数据要通过常量表示，那么 TensorFlow 的计算图会非常大。因为每增加一个常量，TensorFlow 都会在计算图中增加一个结点。所以说拥有几百万次迭代的神经网络会拥有极其庞大的计算图，而**占位符却可以解决这一点，它只会拥有占位符这一个结点**。\n",
273 |     "\n",
274 |     "下面一段代码分别展示了使用常量和占位符进行计算："
275 |    ]
276 |   },
277 |   {
278 |    "cell_type": "code",
279 |    "execution_count": 92,
280 |    "metadata": {},
281 |    "outputs": [
282 |     {
283 |      "name": "stdout",
284 |      "output_type": "stream",
285 |      "text": [
286 |       "[[-0.11131823  2.3845987 ]]\n",
287 |       "[[-0.11131823  2.3845987 ]]\n",
288 |       "[[2. 2.]]\n"
289 |      ]
290 |     },
291 |     {
292 |      "data": {
293 |       "text/plain": [
294 |        "<bound method BaseSession.close of <tensorflow.python.client.session.Session object at 0x000001F663C4BEF0>>"
295 |       ]
296 |      },
297 |      "execution_count": 92,
298 |      "metadata": {},
299 |      "output_type": "execute_result"
300 |     }
301 |    ],
302 |    "source": [
303 |     "w1=tf.Variable(tf.random_normal([1,2],stddev=1,seed=1))\n",
304 |     "\n",
305 |     "#因为需要重复输入x，而每建一个x就会生成一个结点，计算图的效率会低。所以使用占位符\n",
306 |     "x=tf.placeholder(tf.float32,shape=(1,2))\n",
307 |     "x1=tf.constant([[0.7,0.9]])\n",
308 |     "\n",
309 |     "a=x+w1\n",
310 |     "b=x1+w1\n",
311 |     "\n",
312 |     "sess=tf.Session()\n",
313 |     "sess.run(tf.global_variables_initializer())\n",
314 |     "\n",
315 |     "#运行y时将占位符填上，feed_dict为字典，变量名不可变\n",
316 |     "y_1=sess.run(a,feed_dict={x:[[0.7,0.9]]})\n",
317 |     "y_2=sess.run(b)\n",
318 |     "\n",
319 |     "print(y_1)\n",
320 |     "print(y_2)\n",
321 |     "sess.close"
322 |    ]
323 |   },
324 |   {
325 |    "cell_type": "markdown",
326 |    "metadata": {},
327 |    "source": [
328 |     "## 1.3 实例\n",
329 |     "例1：计算多维张量欧里几得距离"
330 |    ]
331 |   },
332 |   {
333 |    "cell_type": "code",
334 |    "execution_count": 110,
335 |    "metadata": {},
336 |    "outputs": [
337 |     {
338 |      "name": "stdout",
339 |      "output_type": "stream",
340 |      "text": [
341 |       "the distance between [[1 2]] and [[15 16]] -> 19.79899024963379\n",
342 |       "the distance between [[3 4]] and [[13 14]] -> 14.142135620117188\n",
343 |       "the distance between [[5 6]] and [[11 12]] -> 8.485280990600586\n",
344 |       "the distance between [[7 8]] and [[ 9 10]] -> 2.8284270763397217\n"
345 |      ]
346 |     }
347 |    ],
348 |    "source": [
349 |     "list_of_points1_ = [[1,2], [3,4], [5,6], [7,8]]\n",
350 |     "list_of_points2_ = [[15,16], [13,14], [11,12], [9,10]]\n",
351 |     "\n",
352 |     "# list_of_points1 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points1_])\n",
353 |     "# list_of_points2 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points2_])\n",
354 |     "\n",
355 |     "\n",
356 |     "#graph = tf.Graph()\n",
357 |     "\n",
358 |     "#with graph.as_default():   \n",
359 |     "\n",
360 |     "def calculate_eucledian_distance(point1, point2):\n",
361 |     "    difference = tf.subtract(point1, point2)  # 相减\n",
362 |     "    power2 = tf.pow(difference, tf.constant(2.0, shape=(1,2))) # 幂乘\n",
363 |     "    add = tf.reduce_sum(power2) # 计算一个张量的各个维度上元素的总和\n",
364 |     "    eucledian_distance = tf.sqrt(add) # 开根\n",
365 |     "    return eucledian_distance\n",
366 |     "\n",
367 |     "#我们使用 tf.placeholder() 创建占位符 ，在 session.run() 过程中再投递数据 \n",
368 |     "point1 = tf.placeholder(tf.float32, shape=(1, 2))\n",
369 |     "point2 = tf.placeholder(tf.float32, shape=(1, 2))\n",
370 |     "\n",
371 |     "dist = calculate_eucledian_distance(point1, point2)\n",
372 |     "\n",
373 |     "# 开启会话\n",
374 |     "with tf.Session() as session:\n",
375 |     "    #tf.global_variables_initializer().run()   \n",
376 |     "    for ii in range(len(list_of_points1)):\n",
377 |     "        point1_ = list_of_points1[ii]\n",
378 |     "        point2_ = list_of_points2[ii]\n",
379 |     "\n",
380 |     "        #使用feed_dict将数据投入到dist中\n",
381 |     "        distance = session.run(dist, feed_dict={point1 : point1_, point2 : point2_})\n",
382 |     "        print(\"the distance between {} and {} -> {}\".format(point1_, point2_, distance))"
383 |    ]
384 |   },
385 |   {
386 |    "cell_type": "markdown",
387 |    "metadata": {},
388 |    "source": [
389 |     "例2：构建三层全连接神经网络"
390 |    ]
391 |   },
392 |   {
393 |    "cell_type": "code",
394 |    "execution_count": 134,
395 |    "metadata": {},
396 |    "outputs": [
397 |     {
398 |      "name": "stdout",
399 |      "output_type": "stream",
400 |      "text": [
401 |       "初始化权重为：\n",
402 |       " [[-0.8113182   1.4845988   0.06532937]\n",
403 |       " [-2.4427042   0.0992484   0.5912243 ]] \n",
404 |       " [[-0.8113182 ]\n",
405 |       " [ 1.4845988 ]\n",
406 |       " [ 0.06532937]]\n",
407 |       "在迭代0次后，训练损失为0.308504\n",
408 |       "在迭代1000次后，训练损失为0.0393406\n",
409 |       "在迭代2000次后，训练损失为0.0182158\n",
410 |       "在迭代3000次后，训练损失为0.0104779\n",
411 |       "在迭代4000次后，训练损失为0.00680374\n",
412 |       "在迭代5000次后，训练损失为0.00446512\n",
413 |       "在迭代6000次后，训练损失为0.00296797\n",
414 |       "在迭代7000次后，训练损失为0.00218553\n",
415 |       "在迭代8000次后，训练损失为0.00179452\n",
416 |       "在迭代9000次后，训练损失为0.0013211\n",
417 |       "在迭代10000次后，训练损失为0.000957699\n"
418 |      ]
419 |     }
420 |    ],
421 |    "source": [
422 |     "# 定义变量w1,w2（权重）\n",
423 |     "w1=tf.Variable(tf.random_normal([2,3],stddev=1,seed=1))\n",
424 |     "w2=tf.Variable(tf.random_normal([3,1],stddev=1,seed=1))\n",
425 |     "\n",
426 |     "# 定义占位符x,y（样本集）\n",
427 |     "x=tf.placeholder(tf.float32,shape=(None,2)) # None可以根据batch大小确定维度，在shape的一个维度上使用None\n",
428 |     "y=tf.placeholder(tf.float32,shape=(None,1))\n",
429 |     "\n",
430 |     "# 定义ReLU激活函数\n",
431 |     "a=tf.nn.relu(tf.matmul(x,w1)) #tf.nn.relu(features, name = None),这个函数的作用是计算激活函数relu，即max(features, 0)。即将矩阵中每个元素的负值置0。\n",
432 |     "yhat=tf.nn.relu(tf.matmul(a,w2)) # tf.matmul为矩阵相乘,yhat为预测的值\n",
433 |     "\n",
434 |     "# 定义交叉熵损失函数和训练算法AdamOptimizer\n",
435 |     "cross_entropy=-tf.reduce_mean(y*tf.log(tf.clip_by_value(yhat,1e-10,1.0))) #tf.clip_by_value(A, min, max)：输入一个张量A，把A中的每一个元素的值都压缩在min和max之间。小于min的让它等于min，大于max的元素的值等于max。\n",
436 |     "train_op=tf.train.AdamOptimizer(0.001).minimize(cross_entropy) # 学习率为0.001\n",
437 |     "\n",
438 |     "# 随机生成512个样本，样本特征维数为2\n",
439 |     "data_size=512\n",
440 |     "X = np.random.RandomState(1).rand(data_size,2) # 样本范围为[0, 1)\n",
441 |     "# 生成标签，1为正样本,0为负样本\n",
442 |     "Y = [[int(x1+x2<1)] for (x1,x2) in X]\n",
443 |     "\n",
444 |     "batch_size=10 # 每次训练读取样本个数\n",
445 |     "\n",
446 |     "with tf.Session() as sess:\n",
447 |     "    sess.run(tf.global_variables_initializer()) # 初始化\n",
448 |     "    print('初始化权重为：\\n',sess.run(w1),'\\n',sess.run(w2))\n",
449 |     "    steps=10001\n",
450 |     "    for i in range(steps):\n",
451 |     "        #选定每一个批量读取的首尾位置，确保在1个epoch（全部样本训练一次为1个epoch）内采样训练\n",
452 |     "        start = i*batch_size % data_size\n",
453 |     "        end = min(start+batch_size,data_size)\n",
454 |     "        sess.run(train_op,feed_dict={x:X[start:end],y:Y[start:end]}) # 开始训练\n",
455 |     "        if i % 1000 == 0:\n",
456 |     "            training_loss=sess.run(cross_entropy,feed_dict={x:X,y:Y})\n",
457 |     "            print(\"在迭代%d次后，训练损失为%g\"%(i,training_loss))\n",
458 |     "        "
459 |    ]
460 |   },
461 |   {
462 |    "cell_type": "markdown",
463 |    "metadata": {},
464 |    "source": [
465 |     "说明：上面的代码定义了一个简单的三层全连接网络（输入层、隐藏层和输出层分别为 2、3 和 1 个神经元），隐藏层和输出层的激活函数使用的是 ReLU 函数。该模型训练的样本总数为 512，每次迭代读取的批量为 10。这个简单的全连接网络以交叉熵为损失函数，并使用Adam优化算法进行权重更新。"
466 |    ]
467 |   },
468 |   {
469 |    "cell_type": "code",
470 |    "execution_count": null,
471 |    "metadata": {
472 |     "collapsed": true
473 |    },
474 |    "outputs": [],
475 |    "source": []
476 |   }
477 |  ],
478 |  "metadata": {
479 |   "kernelspec": {
480 |    "display_name": "Python 3",
481 |    "language": "python",
482 |    "name": "python3"
483 |   },
484 |   "language_info": {
485 |    "codemirror_mode": {
486 |     "name": "ipython",
487 |     "version": 3
488 |    },
489 |    "file_extension": ".py",
490 |    "mimetype": "text/x-python",
491 |    "name": "python",
492 |    "nbconvert_exporter": "python",
493 |    "pygments_lexer": "ipython3",
494 |    "version": "3.5.5"
495 |   }
496 |  },
497 |  "nbformat": 4,
498 |  "nbformat_minor": 2
499 | }
500 | 


--------------------------------------------------------------------------------
/XGBoost_demo/.gitignore:
--------------------------------------------------------------------------------
1 | *.libsvm
2 | *.pkl
3 | 


--------------------------------------------------------------------------------
/XGBoost_demo/README.md:
--------------------------------------------------------------------------------
  1 | Awesome XGBoost
  2 | ===============
  3 | This page contains a curated list of examples, tutorials, blogs about XGBoost usecases.
  4 | It is inspired by [awesome-MXNet](https://github.com/dmlc/mxnet/blob/master/example/README.md),
  5 | [awesome-php](https://github.com/ziadoz/awesome-php) and [awesome-machine-learning](https://github.com/josephmisiti/awesome-machine-learning).
  6 | 
  7 | Please send a pull request if you find things that belongs to here.
  8 | 
  9 | Contents
 10 | --------
 11 | - [Code Examples](#code-examples)
 12 |   - [Features Walkthrough](#features-walkthrough)
 13 |   - [Basic Examples by Tasks](#basic-examples-by-tasks)
 14 |   - [Benchmarks](#benchmarks)
 15 | - [Machine Learning Challenge Winning Solutions](#machine-learning-challenge-winning-solutions)
 16 | - [Tutorials](#tutorials)
 17 | - [Usecases](#usecases)
 18 | - [Tools using XGBoost](#tools-using-xgboost)
 19 | - [Awards](#awards)
 20 | - [Windows Binaries](#windows-binaries)
 21 | 
 22 | Code Examples
 23 | -------------
 24 | ### Features Walkthrough
 25 | 
 26 | This is a list of short codes introducing different functionalities of xgboost packages.
 27 | 
 28 | * Basic walkthrough of packages
 29 |   [python](guide-python/basic_walkthrough.py)
 30 |   [R](../R-package/demo/basic_walkthrough.R)
 31 |   [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/basic_walkthrough.jl)
 32 |   [PHP](https://github.com/bpachev/xgboost-php/blob/master/demo/titanic_demo.php)
 33 | * Customize loss function, and evaluation metric
 34 |   [python](guide-python/custom_objective.py)
 35 |   [R](../R-package/demo/custom_objective.R)
 36 |   [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/custom_objective.jl)
 37 | * Boosting from existing prediction
 38 |   [python](guide-python/boost_from_prediction.py)
 39 |   [R](../R-package/demo/boost_from_prediction.R)
 40 |   [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
 41 | * Predicting using first n trees
 42 |   [python](guide-python/predict_first_ntree.py)
 43 |   [R](../R-package/demo/predict_first_ntree.R)
 44 |   [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/predict_first_ntree.jl)
 45 | * Generalized Linear Model
 46 |   [python](guide-python/generalized_linear_model.py)
 47 |   [R](../R-package/demo/generalized_linear_model.R)
 48 |   [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/generalized_linear_model.jl)
 49 | * Cross validation
 50 |   [python](guide-python/cross_validation.py)
 51 |   [R](../R-package/demo/cross_validation.R)
 52 |   [Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/cross_validation.jl)
 53 | * Predicting leaf indices
 54 |   [python](guide-python/predict_leaf_indices.py)
 55 |   [R](../R-package/demo/predict_leaf_indices.R)
 56 | 
 57 | ### Basic Examples by Tasks
 58 | 
 59 | Most of examples in this section are based on CLI or python version.
 60 | However, the parameter settings can be applied to all versions
 61 | 
 62 | - [Binary classification](binary_classification)
 63 | - [Multiclass classification](multiclass_classification)
 64 | - [Regression](regression)
 65 | - [Learning to Rank](rank)
 66 | 
 67 | ### Benchmarks
 68 | 
 69 | - [Starter script for Kaggle Higgs Boson](kaggle-higgs)
 70 | - [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
 71 | - [Benchmarking the most commonly used open source tools for binary classification](https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines)
 72 | 
 73 | 
 74 | ## Machine Learning Challenge Winning Solutions
 75 | 
 76 | XGBoost is extensively used by machine learning practitioners to create state of art data science solutions,
 77 | this is a list of machine learning winning solutions with XGBoost.
 78 | Please send pull requests if you find ones that are missing here.
 79 | 
 80 | - Maksims Volkovs, Guangwei Yu and Tomi Poutanen, 1st place of the [2017 ACM RecSys challenge](http://2017.recsyschallenge.com/). Link to [paper](http://www.cs.toronto.edu/~mvolkovs/recsys2017_challenge.pdf).
 81 | - Vlad Sandulescu, Mihai Chiru, 1st place of the [KDD Cup 2016 competition](https://kddcup2016.azurewebsites.net). Link to [the arxiv paper](http://arxiv.org/abs/1609.02728).
 82 | - Marios Michailidis, Mathias Müller and HJ van Veen, 1st place of the [Dato Truely Native? competition](https://www.kaggle.com/c/dato-native). Link to [the Kaggle interview](http://blog.kaggle.com/2015/12/03/dato-winners-interview-1st-place-mad-professors/).
 83 | - Vlad Mironov, Alexander Guschin, 1st place of the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Link to [the Kaggle interview](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
 84 | - Josef Slavicek, 3rd place of the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Link to [the Kaggle interview](http://blog.kaggle.com/2015/11/23/flavour-of-physics-winners-interview-3rd-place-josef-slavicek/).
 85 | - Mario Filho, Josef Feigl, Lucas, Gilberto, 1st place of the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Link to [the Kaggle interview](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
 86 | - Qingchen Wang, 1st place of the [Liberty Mutual Property Inspection](https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction). Link to [the Kaggle interview](http://blog.kaggle.com/2015/09/28/liberty-mutual-property-inspection-winners-interview-qingchen-wang/).
 87 | - Chenglong Chen, 1st place of the [Crowdflower Search Results Relevance](https://www.kaggle.com/c/crowdflower-search-relevance). Link to [the winning solution](https://www.kaggle.com/c/crowdflower-search-relevance/forums/t/15186/1st-place-winner-solution-chenglong-chen/).
 88 | - Alexandre Barachant (“Cat”) and Rafał Cycoń (“Dog”), 1st place of the [Grasp-and-Lift EEG Detection](https://www.kaggle.com/c/grasp-and-lift-eeg-detection). Link to [the Kaggle interview](http://blog.kaggle.com/2015/10/12/grasp-and-lift-eeg-winners-interview-1st-place-cat-dog/).
 89 | - Halla Yang, 2nd place of the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Link to [the Kaggle interview](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
 90 | - Owen Zhang, 1st place of the [Avito Context Ad Clicks competition](https://www.kaggle.com/c/avito-context-ad-clicks). Link to [the Kaggle interview](http://blog.kaggle.com/2015/08/26/avito-winners-interview-1st-place-owen-zhang/).
 91 | - Keiichi Kuroyanagi, 2nd place of the [Airbnb New User Bookings](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings). Link to [the Kaggle interview](http://blog.kaggle.com/2016/03/17/airbnb-new-user-bookings-winners-interview-2nd-place-keiichi-kuroyanagi-keiku/).
 92 | - Marios Michailidis, Mathias Müller and Ning Situ, 1st place [Homesite Quote Conversion](https://www.kaggle.com/c/homesite-quote-conversion). Link to [the Kaggle interview](http://blog.kaggle.com/2016/04/08/homesite-quote-conversion-winners-write-up-1st-place-kazanova-faron-clobber/).
 93 | 
 94 | ## Talks
 95 | - [XGBoost: A Scalable Tree Boosting System](http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/) (video+slides) by Tianqi Chen at the Los Angeles Data Science meetup
 96 | 
 97 | ## Tutorials
 98 | 
 99 | - [Machine Learning with XGBoost on Qubole Spark Cluster](https://www.qubole.com/blog/machine-learning-xgboost-qubole-spark-cluster/)
100 | - [XGBoost Official RMarkdown Tutorials](https://xgboost.readthedocs.org/en/latest/R-package/index.html#tutorials)
101 | - [An Introduction to XGBoost R Package](http://dmlc.ml/rstats/2016/03/10/xgboost.html) by Tong He
102 | - [Open Source Tools & Data Science Competitions](http://www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1) by Owen Zhang - XGBoost parameter tuning tips
103 | * [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
104 | * [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
105 | - [XGBoost - eXtreme Gradient Boosting](http://www.slideshare.net/ShangxuanZhang/xgboost) by Tong He
106 | - [How to use XGBoost algorithm in R in easy steps](http://www.analyticsvidhya.com/blog/2016/01/xgboost-algorithm-easy-steps/) by TAVISH SRIVASTAVA ([Chinese Translation 中文翻译](https://segmentfault.com/a/1190000004421821) by [HarryZhu](https://segmentfault.com/u/harryprince))
107 | - [Kaggle Solution: What’s Cooking ? (Text Mining Competition)](http://www.analyticsvidhya.com/blog/2015/12/kaggle-solution-cooking-text-mining-competition/) by MANISH SARASWAT
108 | - Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R) by Manuel Amunategui ([Youtube Link](https://www.youtube.com/watch?v=Og7CGAfSr_Y)) ([Github Link](https://github.com/amunategui/BetterCrossValidation))
109 | - [XGBoost Rossman Parameter Tuning](https://www.kaggle.com/khozzy/rossmann-store-sales/xgboost-parameter-tuning-template/run/90168/notebook) by [Norbert Kozlowski](https://www.kaggle.com/khozzy)
110 | - [Featurizing log data before XGBoost](http://www.slideshare.net/DataRobot/featurizing-log-data-before-xgboost) by Xavier Conort, Owen Zhang etc
111 | - [West Nile Virus Competition Benchmarks & Tutorials](http://blog.kaggle.com/2015/07/21/west-nile-virus-competition-benchmarks-tutorials/) by [Anna Montoya](http://blog.kaggle.com/author/annamontoya/)
112 | - [Ensemble Decision Tree with XGBoost](https://www.kaggle.com/binghsu/predict-west-nile-virus/xgboost-starter-code-python-0-69) by [Bing Xu](https://www.kaggle.com/binghsu)
113 | - [Notes on eXtreme Gradient Boosting](http://startup.ml/blog/xgboost) by ARSHAK NAVRUZYAN ([iPython Notebook](https://github.com/startupml/koan/blob/master/eXtreme%20Gradient%20Boosting.ipynb))
114 | - [Complete Guide to Parameter Tuning in XGBoost](http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) by Aarshay Jain
115 | - [Practical XGBoost in Python online course](http://education.parrotprediction.teachable.com/courses/practical-xgboost-in-python) by Parrot Prediction
116 | - [Spark and XGBoost using Scala](http://www.elenacuoco.com/2016/10/10/scala-spark-xgboost-classification/) by Elena Cuoco
117 | ## Usecases
118 | If you have particular usecase of xgboost that you would like to highlight.
119 | Send a PR to add a one sentence description:)
120 | 
121 | - XGBoost is used in [Kaggle Script](https://www.kaggle.com/scripts) to solve data science challenges.
122 | - [Seldon predictive service powered by XGBoost](http://docs.seldon.io/iris-demo.html)
123 | - XGBoost Distributed is used in [ODPS Cloud Service by Alibaba](https://yq.aliyun.com/articles/6355) (in Chinese)
124 | - XGBoost is incoporated as part of [Graphlab Create](https://dato.com/products/create/) for scalable machine learning.
125 | - [Hanjing Su](https://www.52cs.org) from Tencent data platform team: "We use distributed XGBoost for click through prediction in wechat shopping and lookalikes. The problems involve hundreds millions of users and thousands of features. XGBoost is cleanly designed and can be easily integrated into our production environment, reducing our cost in developments."
126 | - [CNevd](https://github.com/CNevd) from autohome.com ad platform team: "Distributed XGBoost is used for click through rate prediction in our display advertising, XGBoost is highly efficient and flexible and can be easily used on our distributed platform, our ctr made a great improvement with hundred millions samples and millions features due to this awesome XGBoost"
127 | 
128 | 
129 | 
130 | ## Tools using XGBoost
131 | 
132 | - [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API
133 | - [gp_xgboost_gridsearch](https://github.com/vatsan/gp_xgboost_gridsearch) - In-database parallel grid-search for XGBoost on [Greenplum](https://github.com/greenplum-db/gpdb) using PL/Python
134 | - [tpot](https://github.com/rhiever/tpot) - A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.
135 | 
136 | ## Awards
137 | - [John Chambers Award](http://stat-computing.org/awards/jmc/winners.html) - 2016 Winner: XGBoost R Package, by Tong He (Simon Fraser University) and Tianqi Chen (University of Washington)
138 | 
139 | ## Windows Binaries
140 | Unofficial windows binaries and instructions on how to use them are hosted on [Guido Tapia's blog](http://www.picnet.com.au/blogs/guido/post/2016/09/22/xgboost-windows-x64-binaries-for-download/)
141 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/README.md:
--------------------------------------------------------------------------------
  1 | Binary Classification
  2 | =====================
  3 | This is the quick start tutorial for xgboost CLI version.
  4 | Here we demonstrate how to use XGBoost for a binary classification task. Before getting started, make sure you compile xgboost in the root directory of the project by typing ```make```.
  5 | The script 'runexp.sh' can be used to run the demo. Here we use [mushroom dataset](https://archive.ics.uci.edu/ml/datasets/Mushroom) from UCI machine learning repository.
  6 | 
  7 | ### Tutorial
  8 | #### Generate Input Data
  9 | XGBoost takes LibSVM format. An example of faked input data is below:
 10 | ```
 11 | 1 101:1.2 102:0.03
 12 | 0 1:2.1 10001:300 10002:400
 13 | ...
 14 | ```
 15 | Each line represent a single instance, and in the first line '1' is the instance label,'101' and '102' are feature indices, '1.2' and '0.03' are feature values. In the binary classification case, '1' is used to indicate positive samples, and '0' is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.
 16 | 
 17 | 
 18 | First we will transform the dataset into classic LibSVM format and split the data into training set and test set by running:
 19 | ```
 20 | python mapfeat.py
 21 | python mknfold.py agaricus.txt 1
 22 | ```
 23 | The two files, 'agaricus.txt.train' and 'agaricus.txt.test' will be used as training set and test set.
 24 | 
 25 | #### Training
 26 | Then we can run the training process:
 27 | ```
 28 | ../../xgboost mushroom.conf
 29 | ```
 30 | 
 31 | mushroom.conf is the configuration for both training and testing. Each line containing the [attribute]=[value] configuration:
 32 | 
 33 | ```conf
 34 | # General Parameters, see comment for each definition
 35 | # can be gbtree or gblinear
 36 | booster = gbtree
 37 | # choose logistic regression loss function for binary classification
 38 | objective = binary:logistic
 39 | 
 40 | # Tree Booster Parameters
 41 | # step size shrinkage
 42 | eta = 1.0
 43 | # minimum loss reduction required to make a further partition
 44 | gamma = 1.0
 45 | # minimum sum of instance weight(hessian) needed in a child
 46 | min_child_weight = 1
 47 | # maximum depth of a tree
 48 | max_depth = 3
 49 | 
 50 | # Task Parameters
 51 | # the number of round to do boosting
 52 | num_round = 2
 53 | # 0 means do not save any model except the final round model
 54 | save_period = 0
 55 | # The path of training data
 56 | data = "agaricus.txt.train"
 57 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set
 58 | eval[test] = "agaricus.txt.test"
 59 | # The path of test data
 60 | test:data = "agaricus.txt.test"
 61 | ```
 62 | We use the tree booster and logistic regression objective in our setting. This indicates that we accomplish our task using classic gradient boosting regression tree(GBRT), which is a promising method for binary classification.
 63 | 
 64 | The parameters shown in the example gives the most common ones that are needed to use xgboost.
 65 | If you are interested in more parameter settings, the complete parameter settings and detailed descriptions are [here](../../doc/parameter.md). Besides putting the parameters in the configuration file, we can set them by passing them as arguments as below:
 66 | 
 67 | ```
 68 | ../../xgboost mushroom.conf max_depth=6
 69 | ```
 70 | This means that the parameter max_depth will be set as 6 rather than 3 in the conf file. When you use command line, make sure max_depth=6 is passed in as single argument, i.e. do not contain space in the argument. When a parameter setting is provided in both command line input and  the config file, the command line setting will override the setting in config file.
 71 | 
 72 | In this example, we use tree booster for gradient boosting. If you would like to use linear booster for regression, you can keep all the parameters except booster and the tree booster parameters as below:
 73 | ```conf
 74 | # General Parameters
 75 | # choose the linear booster
 76 | booster = gblinear
 77 | ...
 78 | 
 79 | # Change Tree Booster Parameters into Linear Booster Parameters
 80 | # L2 regularization term on weights, default 0
 81 | lambda = 0.01
 82 | # L1 regularization term on weights, default 0
 83 | If ```agaricus.txt.test.buffer``` exists, and automatically loads from binary buffer if possible, this can speedup training process when you do training many times. You can disable it by setting ```use_buffer=0```.
 84 |   - Buffer file can also be used as standalone input, i.e if buffer file exists, but original agaricus.txt.test was removed, xgboost will still run
 85 | * Deviation from LibSVM input format: xgboost is compatible with LibSVM format, with the following minor differences:
 86 |   - xgboost allows feature index starts from 0
 87 |   - for binary classification, the label is 1 for positive, 0 for negative, instead of +1,-1
 88 |   - the feature indices in each line *do not* need to be sorted
 89 | alpha = 0.01
 90 | # L2 regularization term on bias, default 0
 91 | lambda_bias = 0.01
 92 | 
 93 | # Regression Parameters
 94 | ...
 95 | ```
 96 | 
 97 | #### Get Predictions
 98 | After training, we can use the output model to get the prediction of the test data:
 99 | ```
100 | ../../xgboost mushroom.conf task=pred model_in=0002.model
101 | ```
102 | For binary classification, the output predictions are probability confidence scores in [0,1], corresponds to the probability of the label to be positive.
103 | 
104 | #### Dump Model
105 | This is a preliminary feature, so far only tree model support text dump. XGBoost can display the tree models in text files and we can scan the model in an easy way:
106 | ```
107 | ../../xgboost mushroom.conf task=dump model_in=0002.model name_dump=dump.raw.txt
108 | ../../xgboost mushroom.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt
109 | ```
110 | 
111 | In this demo, the tree boosters obtained will be printed in dump.raw.txt and dump.nice.txt, and the latter one is easier to understand because of usage of feature mapping featmap.txt
112 | 
113 | Format of ```featmap.txt: <featureid> <featurename> <q or i or int>\n ```:
114 |   - Feature id must be from 0 to number of features, in sorted order.
115 |   - i means this feature is binary indicator feature
116 |   - q means this feature is a quantitative value, such as age, time, can be missing
117 |   - int means this feature is integer value (when int is hinted, the decision boundary will be integer)
118 | 
119 | #### Monitoring Progress
120 | When you run training we can find there are messages displayed on screen
121 | ```
122 | tree train end, 1 roots, 12 extra nodes, 0 pruned nodes ,max_depth=3
123 | [0]  test-error:0.016139
124 | boosting round 1, 0 sec elapsed
125 | 
126 | tree train end, 1 roots, 10 extra nodes, 0 pruned nodes ,max_depth=3
127 | [1]  test-error:0.000000
128 | ```
129 | The messages for evaluation are printed into stderr, so if you want only to log the evaluation progress, simply type
130 | ```
131 | ../../xgboost mushroom.conf 2>log.txt
132 | ```
133 | Then you can find the following content in log.txt
134 | ```
135 | [0]     test-error:0.016139
136 | [1]     test-error:0.000000
137 | ```
138 | We can also monitor both training and test statistics, by adding following lines to configure
139 | ```conf
140 | eval[test] = "agaricus.txt.test"
141 | eval[trainname] = "agaricus.txt.train"
142 | ```
143 | Run the command again, we can find the log file becomes
144 | ```
145 | [0]     test-error:0.016139     trainname-error:0.014433
146 | [1]     test-error:0.000000     trainname-error:0.001228
147 | ```
148 | The rule is eval[name-printed-in-log] = filename, then the file will be added to monitoring process, and evaluated each round.
149 | 
150 | xgboost also supports monitoring multiple metrics, suppose we also want to monitor average log-likelihood of each prediction during training, simply add ```eval_metric=logloss``` to configure. Run again, we can find the log file becomes
151 | ```
152 | [0]     test-error:0.016139     test-negllik:0.029795   trainname-error:0.014433        trainname-negllik:0.027023
153 | [1]     test-error:0.000000     test-negllik:0.000000   trainname-error:0.001228        trainname-negllik:0.002457
154 | ```
155 | ### Saving Progress Models
156 | If you want to save model every two round, simply set save_period=2. You will find 0002.model in the current folder. If you want to change the output folder of models, add model_dir=foldername. By default xgboost saves the model of last round.
157 | 
158 | #### Continue from Existing Model
159 | If you want to continue boosting from existing model, say 0002.model, use
160 | ```
161 | ../../xgboost mushroom.conf model_in=0002.model num_round=2 model_out=continue.model
162 | ```
163 | xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
164 | #### Use Multi-Threading
165 | When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration.
166 | Eg. ```nthread=10```
167 | 
168 | Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```)
169 | Some systems will have ```Thread(s) per core = 2```, for example, a 4 core cpu with 8 threads, in such case set ```nthread=4``` and not 8.
170 | 
171 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/agaricus-lepiota.fmap:
--------------------------------------------------------------------------------
 1 |      1. cap-shape:                bell=b,conical=c,convex=x,flat=f,knobbed=k,sunken=s
 2 |      2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
 3 |      3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
 4 |      4. bruises?:                 bruises=t,no=f
 5 |      5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
 6 |                                   musty=m,none=n,pungent=p,spicy=s
 7 |      6. gill-attachment:          attached=a,descending=d,free=f,notched=n
 8 |      7. gill-spacing:             close=c,crowded=w,distant=d
 9 |      8. gill-size:                broad=b,narrow=n
10 |      9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
11 |                                   green=r,orange=o,pink=p,purple=u,red=e,
12 |                                   white=w,yellow=y
13 |     10. stalk-shape:              enlarging=e,tapering=t
14 |     11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
15 |                                   rhizomorphs=z,rooted=r,missing=?
16 |     12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
17 |     13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
18 |     14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
19 |                                   pink=p,red=e,white=w,yellow=y
20 |     15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
21 |                                   pink=p,red=e,white=w,yellow=y
22 |     16. veil-type:                partial=p,universal=u
23 |     17. veil-color:               brown=n,orange=o,white=w,yellow=y
24 |     18. ring-number:              none=n,one=o,two=t
25 |     19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
26 |                                   none=n,pendant=p,sheathing=s,zone=z
27 |     20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
28 |                                   orange=o,purple=u,white=w,yellow=y
29 |     21. population:               abundant=a,clustered=c,numerous=n,
30 |                                   scattered=s,several=v,solitary=y
31 |     22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
32 |                                   urban=u,waste=w,woods=d
33 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/agaricus-lepiota.names:
--------------------------------------------------------------------------------
  1 | 1. Title: Mushroom Database
  2 | 
  3 | 2. Sources: 
  4 |     (a) Mushroom records drawn from The Audubon Society Field Guide to North
  5 |         American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred
  6 |         A. Knopf
  7 |     (b) Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu)
  8 |     (c) Date: 27 April 1987
  9 | 
 10 | 3. Past Usage:
 11 |     1. Schlimmer,J.S. (1987). Concept Acquisition Through Representational
 12 |        Adjustment (Technical Report 87-19).  Doctoral disseration, Department
 13 |        of Information and Computer Science, University of California, Irvine.
 14 |        --- STAGGER: asymptoted to 95% classification accuracy after reviewing
 15 |            1000 instances.
 16 |     2. Iba,W., Wogulis,J., & Langley,P. (1988).  Trading off Simplicity
 17 |        and Coverage in Incremental Concept Learning. In Proceedings of 
 18 |        the 5th International Conference on Machine Learning, 73-79.
 19 |        Ann Arbor, Michigan: Morgan Kaufmann.  
 20 |        -- approximately the same results with their HILLARY algorithm    
 21 |     3. In the following references a set of rules (given below) were
 22 | 	learned for this data set which may serve as a point of
 23 | 	comparison for other researchers.
 24 | 
 25 | 	Duch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules
 26 | 	from training data using backpropagation networks, in: Proc. of the
 27 | 	The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30,
 28 | 	available on-line at: http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/
 29 | 
 30 | 	Duch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of
 31 | 	crisp logical rules using constrained backpropagation networks -
 32 | 	comparison of two new approaches, in: Proc. of the European Symposium
 33 | 	on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997,
 34 | 	pp. xx-xx
 35 | 
 36 | 	Wlodzislaw Duch, Department of Computer Methods, Nicholas Copernicus
 37 | 	University, 87-100 Torun, Grudziadzka 5, Poland
 38 | 	e-mail: duch@phys.uni.torun.pl
 39 | 	WWW     http://www.phys.uni.torun.pl/kmk/
 40 | 	
 41 | 	Date: Mon, 17 Feb 1997 13:47:40 +0100
 42 | 	From: Wlodzislaw Duch <duch@phys.uni.torun.pl>
 43 | 	Organization: Dept. of Computer Methods, UMK
 44 | 
 45 | 	I have attached a file containing logical rules for mushrooms.
 46 | 	It should be helpful for other people since only in the last year I
 47 | 	have seen about 10 papers analyzing this dataset and obtaining quite
 48 | 	complex rules. We will try to contribute other results later.
 49 | 
 50 | 	With best regards, Wlodek Duch
 51 | 	________________________________________________________________
 52 | 
 53 | 	Logical rules for the mushroom data sets.
 54 | 
 55 | 	Logical rules given below seem to be the simplest possible for the
 56 | 	mushroom dataset and therefore should be treated as benchmark results.
 57 | 
 58 | 	Disjunctive rules for poisonous mushrooms, from most general
 59 | 	to most specific:
 60 | 
 61 | 	P_1) odor=NOT(almond.OR.anise.OR.none)
 62 | 	     120 poisonous cases missed, 98.52% accuracy
 63 | 
 64 | 	P_2) spore-print-color=green
 65 | 	     48 cases missed, 99.41% accuracy
 66 |          
 67 | 	P_3) odor=none.AND.stalk-surface-below-ring=scaly.AND.
 68 | 	          (stalk-color-above-ring=NOT.brown) 
 69 | 	     8 cases missed, 99.90% accuracy
 70 |          
 71 | 	P_4) habitat=leaves.AND.cap-color=white
 72 | 	         100% accuracy     
 73 | 
 74 | 	Rule P_4) may also be
 75 | 
 76 | 	P_4') population=clustered.AND.cap_color=white
 77 | 
 78 | 	These rule involve 6 attributes (out of 22). Rules for edible
 79 | 	mushrooms are obtained as negation of the rules given above, for
 80 | 	example the rule:
 81 | 
 82 | 	odor=(almond.OR.anise.OR.none).AND.spore-print-color=NOT.green
 83 | 
 84 | 	gives 48 errors, or 99.41% accuracy on the whole dataset.
 85 | 
 86 | 	Several slightly more complex variations on these rules exist,
 87 | 	involving other attributes, such as gill_size, gill_spacing,
 88 | 	stalk_surface_above_ring, but the rules given above are the simplest
 89 | 	we have found.
 90 | 
 91 | 
 92 | 4. Relevant Information:
 93 |     This data set includes descriptions of hypothetical samples
 94 |     corresponding to 23 species of gilled mushrooms in the Agaricus and
 95 |     Lepiota Family (pp. 500-525).  Each species is identified as
 96 |     definitely edible, definitely poisonous, or of unknown edibility and
 97 |     not recommended.  This latter class was combined with the poisonous
 98 |     one.  The Guide clearly states that there is no simple rule for
 99 |     determining the edibility of a mushroom; no rule like ``leaflets
100 |     three, let it be'' for Poisonous Oak and Ivy.
101 | 
102 | 5. Number of Instances: 8124
103 | 
104 | 6. Number of Attributes: 22 (all nominally valued)
105 | 
106 | 7. Attribute Information: (classes: edible=e, poisonous=p)
107 |      1. cap-shape:                bell=b,conical=c,convex=x,flat=f,
108 |                                   knobbed=k,sunken=s
109 |      2. cap-surface:              fibrous=f,grooves=g,scaly=y,smooth=s
110 |      3. cap-color:                brown=n,buff=b,cinnamon=c,gray=g,green=r,
111 |                                   pink=p,purple=u,red=e,white=w,yellow=y
112 |      4. bruises?:                 bruises=t,no=f
113 |      5. odor:                     almond=a,anise=l,creosote=c,fishy=y,foul=f,
114 |                                   musty=m,none=n,pungent=p,spicy=s
115 |      6. gill-attachment:          attached=a,descending=d,free=f,notched=n
116 |      7. gill-spacing:             close=c,crowded=w,distant=d
117 |      8. gill-size:                broad=b,narrow=n
118 |      9. gill-color:               black=k,brown=n,buff=b,chocolate=h,gray=g,
119 |                                   green=r,orange=o,pink=p,purple=u,red=e,
120 |                                   white=w,yellow=y
121 |     10. stalk-shape:              enlarging=e,tapering=t
122 |     11. stalk-root:               bulbous=b,club=c,cup=u,equal=e,
123 |                                   rhizomorphs=z,rooted=r,missing=?
124 |     12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
125 |     13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
126 |     14. stalk-color-above-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
127 |                                   pink=p,red=e,white=w,yellow=y
128 |     15. stalk-color-below-ring:   brown=n,buff=b,cinnamon=c,gray=g,orange=o,
129 |                                   pink=p,red=e,white=w,yellow=y
130 |     16. veil-type:                partial=p,universal=u
131 |     17. veil-color:               brown=n,orange=o,white=w,yellow=y
132 |     18. ring-number:              none=n,one=o,two=t
133 |     19. ring-type:                cobwebby=c,evanescent=e,flaring=f,large=l,
134 |                                   none=n,pendant=p,sheathing=s,zone=z
135 |     20. spore-print-color:        black=k,brown=n,buff=b,chocolate=h,green=r,
136 |                                   orange=o,purple=u,white=w,yellow=y
137 |     21. population:               abundant=a,clustered=c,numerous=n,
138 |                                   scattered=s,several=v,solitary=y
139 |     22. habitat:                  grasses=g,leaves=l,meadows=m,paths=p,
140 |                                   urban=u,waste=w,woods=d
141 | 
142 | 8. Missing Attribute Values: 2480 of them (denoted by "?"), all for
143 |    attribute #11.
144 | 
145 | 9. Class Distribution: 
146 |     --    edible: 4208 (51.8%)
147 |     -- poisonous: 3916 (48.2%)
148 |     --     total: 8124 instances
149 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/mapfeat.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | def loadfmap( fname ):
 4 |     fmap = {}
 5 |     nmap = {}
 6 | 
 7 |     for l in open( fname ):
 8 |         arr = l.split()
 9 |         if arr[0].find('.') != -1:
10 |             idx = int( arr[0].strip('.') )
11 |             assert idx not in fmap
12 |             fmap[ idx ] = {}
13 |             ftype = arr[1].strip(':')
14 |             content = arr[2]
15 |         else:
16 |             content = arr[0]
17 |         for it in content.split(','):
18 |             if it.strip() == '':
19 |                 continue
20 |             k , v = it.split('=')
21 |             fmap[ idx ][ v ] = len(nmap) + 1
22 |             nmap[ len(nmap) ] = ftype+'='+k
23 |     return fmap, nmap
24 | 
25 | def write_nmap( fo, nmap ):
26 |     for i in range( len(nmap) ):
27 |         fo.write('%d\t%s\ti\n' % (i, nmap[i]) )
28 | 
29 | # start here
30 | fmap, nmap = loadfmap( 'agaricus-lepiota.fmap' )
31 | fo = open( 'featmap.txt', 'w' )
32 | write_nmap( fo, nmap )
33 | fo.close()
34 | 
35 | fo = open( 'agaricus.txt', 'w' )
36 | for l in open( 'agaricus-lepiota.data' ):
37 |     arr = l.split(',')
38 |     if arr[0] == 'p':
39 |         fo.write('1')
40 |     else:
41 |         assert arr[0] == 'e'
42 |         fo.write('0')
43 |     for i in range( 1,len(arr) ):
44 |         fo.write( ' %d:1' % fmap[i][arr[i].strip()] )
45 |     fo.write('\n')
46 | 
47 | fo.close()
48 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/mknfold.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import sys
 3 | import random
 4 | 
 5 | if len(sys.argv) < 2:
 6 |     print ('Usage:<filename> <k> [nfold = 5]')
 7 |     exit(0)
 8 | 
 9 | random.seed( 10 )
10 | 
11 | k = int( sys.argv[2] )
12 | if len(sys.argv) > 3:
13 |     nfold = int( sys.argv[3] )
14 | else:
15 |     nfold = 5
16 | 
17 | fi = open( sys.argv[1], 'r' )
18 | ftr = open( sys.argv[1]+'.train', 'w' )
19 | fte = open( sys.argv[1]+'.test', 'w' )
20 | for l in fi:
21 |     if random.randint( 1 , nfold ) == k:
22 |         fte.write( l )
23 |     else:
24 |         ftr.write( l )
25 | 
26 | fi.close()
27 | ftr.close()
28 | fte.close()
29 | 
30 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/mushroom.conf:
--------------------------------------------------------------------------------
 1 | # General Parameters, see comment for each definition
 2 | # choose the booster, can be gbtree or gblinear
 3 | booster = gbtree
 4 | # choose logistic regression loss function for binary classification
 5 | objective = binary:logistic
 6 | 
 7 | # Tree Booster Parameters
 8 | # step size shrinkage
 9 | eta = 1.0
10 | # minimum loss reduction required to make a further partition
11 | gamma = 1.0
12 | # minimum sum of instance weight(hessian) needed in a child
13 | min_child_weight = 1
14 | # maximum depth of a tree
15 | max_depth = 3
16 | 
17 | # Task Parameters
18 | # the number of round to do boosting
19 | num_round = 2
20 | # 0 means do not save any model except the final round model
21 | save_period = 0
22 | # The path of training data
23 | data = "agaricus.txt.train"
24 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set
25 | eval[test] = "agaricus.txt.test"
26 | # evaluate on training data as well each round
27 | eval_train = 1
28 | # The path of test data
29 | test:data = "agaricus.txt.test"
30 | 


--------------------------------------------------------------------------------
/XGBoost_demo/binary_classification/runexp.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # map feature using indicator encoding, also produce featmap.txt
 3 | python mapfeat.py
 4 | # split train and test
 5 | python mknfold.py agaricus.txt 1
 6 | # training and output the models
 7 | ../../xgboost mushroom.conf
 8 | # output prediction task=pred 
 9 | ../../xgboost mushroom.conf task=pred model_in=0002.model
10 | # print the boosters of 00002.model in dump.raw.txt
11 | ../../xgboost mushroom.conf task=dump model_in=0002.model name_dump=dump.raw.txt 
12 | # use the feature map in printing for better visualization
13 | ../../xgboost mushroom.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt
14 | cat dump.nice.txt
15 | 
16 | 


--------------------------------------------------------------------------------
/XGBoost_demo/data/README.md:
--------------------------------------------------------------------------------
1 | This folder contains processed example dataset used by the demos.
2 | Copyright of the dataset belongs to the original copyright holder
3 | 


--------------------------------------------------------------------------------
/XGBoost_demo/data/featmap.txt:
--------------------------------------------------------------------------------
  1 | 0	cap-shape=bell	i
  2 | 1	cap-shape=conical	i
  3 | 2	cap-shape=convex	i
  4 | 3	cap-shape=flat	i
  5 | 4	cap-shape=knobbed	i
  6 | 5	cap-shape=sunken	i
  7 | 6	cap-surface=fibrous	i
  8 | 7	cap-surface=grooves	i
  9 | 8	cap-surface=scaly	i
 10 | 9	cap-surface=smooth	i
 11 | 10	cap-color=brown	i
 12 | 11	cap-color=buff	i
 13 | 12	cap-color=cinnamon	i
 14 | 13	cap-color=gray	i
 15 | 14	cap-color=green	i
 16 | 15	cap-color=pink	i
 17 | 16	cap-color=purple	i
 18 | 17	cap-color=red	i
 19 | 18	cap-color=white	i
 20 | 19	cap-color=yellow	i
 21 | 20	bruises?=bruises	i
 22 | 21	bruises?=no	i
 23 | 22	odor=almond	i
 24 | 23	odor=anise	i
 25 | 24	odor=creosote	i
 26 | 25	odor=fishy	i
 27 | 26	odor=foul	i
 28 | 27	odor=musty	i
 29 | 28	odor=none	i
 30 | 29	odor=pungent	i
 31 | 30	odor=spicy	i
 32 | 31	gill-attachment=attached	i
 33 | 32	gill-attachment=descending	i
 34 | 33	gill-attachment=free	i
 35 | 34	gill-attachment=notched	i
 36 | 35	gill-spacing=close	i
 37 | 36	gill-spacing=crowded	i
 38 | 37	gill-spacing=distant	i
 39 | 38	gill-size=broad	i
 40 | 39	gill-size=narrow	i
 41 | 40	gill-color=black	i
 42 | 41	gill-color=brown	i
 43 | 42	gill-color=buff	i
 44 | 43	gill-color=chocolate	i
 45 | 44	gill-color=gray	i
 46 | 45	gill-color=green	i
 47 | 46	gill-color=orange	i
 48 | 47	gill-color=pink	i
 49 | 48	gill-color=purple	i
 50 | 49	gill-color=red	i
 51 | 50	gill-color=white	i
 52 | 51	gill-color=yellow	i
 53 | 52	stalk-shape=enlarging	i
 54 | 53	stalk-shape=tapering	i
 55 | 54	stalk-root=bulbous	i
 56 | 55	stalk-root=club	i
 57 | 56	stalk-root=cup	i
 58 | 57	stalk-root=equal	i
 59 | 58	stalk-root=rhizomorphs	i
 60 | 59	stalk-root=rooted	i
 61 | 60	stalk-root=missing	i
 62 | 61	stalk-surface-above-ring=fibrous	i
 63 | 62	stalk-surface-above-ring=scaly	i
 64 | 63	stalk-surface-above-ring=silky	i
 65 | 64	stalk-surface-above-ring=smooth	i
 66 | 65	stalk-surface-below-ring=fibrous	i
 67 | 66	stalk-surface-below-ring=scaly	i
 68 | 67	stalk-surface-below-ring=silky	i
 69 | 68	stalk-surface-below-ring=smooth	i
 70 | 69	stalk-color-above-ring=brown	i
 71 | 70	stalk-color-above-ring=buff	i
 72 | 71	stalk-color-above-ring=cinnamon	i
 73 | 72	stalk-color-above-ring=gray	i
 74 | 73	stalk-color-above-ring=orange	i
 75 | 74	stalk-color-above-ring=pink	i
 76 | 75	stalk-color-above-ring=red	i
 77 | 76	stalk-color-above-ring=white	i
 78 | 77	stalk-color-above-ring=yellow	i
 79 | 78	stalk-color-below-ring=brown	i
 80 | 79	stalk-color-below-ring=buff	i
 81 | 80	stalk-color-below-ring=cinnamon	i
 82 | 81	stalk-color-below-ring=gray	i
 83 | 82	stalk-color-below-ring=orange	i
 84 | 83	stalk-color-below-ring=pink	i
 85 | 84	stalk-color-below-ring=red	i
 86 | 85	stalk-color-below-ring=white	i
 87 | 86	stalk-color-below-ring=yellow	i
 88 | 87	veil-type=partial	i
 89 | 88	veil-type=universal	i
 90 | 89	veil-color=brown	i
 91 | 90	veil-color=orange	i
 92 | 91	veil-color=white	i
 93 | 92	veil-color=yellow	i
 94 | 93	ring-number=none	i
 95 | 94	ring-number=one	i
 96 | 95	ring-number=two	i
 97 | 96	ring-type=cobwebby	i
 98 | 97	ring-type=evanescent	i
 99 | 98	ring-type=flaring	i
100 | 99	ring-type=large	i
101 | 100	ring-type=none	i
102 | 101	ring-type=pendant	i
103 | 102	ring-type=sheathing	i
104 | 103	ring-type=zone	i
105 | 104	spore-print-color=black	i
106 | 105	spore-print-color=brown	i
107 | 106	spore-print-color=buff	i
108 | 107	spore-print-color=chocolate	i
109 | 108	spore-print-color=green	i
110 | 109	spore-print-color=orange	i
111 | 110	spore-print-color=purple	i
112 | 111	spore-print-color=white	i
113 | 112	spore-print-color=yellow	i
114 | 113	population=abundant	i
115 | 114	population=clustered	i
116 | 115	population=numerous	i
117 | 116	population=scattered	i
118 | 117	population=several	i
119 | 118	population=solitary	i
120 | 119	habitat=grasses	i
121 | 120	habitat=leaves	i
122 | 121	habitat=meadows	i
123 | 122	habitat=paths	i
124 | 123	habitat=urban	i
125 | 124	habitat=waste	i
126 | 125	habitat=woods	i
127 | 


--------------------------------------------------------------------------------
/XGBoost_demo/data/gen_autoclaims.R:
--------------------------------------------------------------------------------
 1 | site <- 'http://cran.r-project.org'
 2 | if (!require('dummies'))
 3 |     install.packages('dummies', repos=site)
 4 | if (!require('insuranceData'))
 5 |     install.packages('insuranceData', repos=site)
 6 | 
 7 | library(dummies)
 8 | library(insuranceData)
 9 | 
10 | data(AutoClaims)
11 | data = AutoClaims
12 | 
13 | data$STATE = as.factor(data$STATE)
14 | data$CLASS = as.factor(data$CLASS)
15 | data$GENDER = as.factor(data$GENDER)
16 | 
17 | data.dummy <- dummy.data.frame(data, dummy.class='factor', omit.constants=T);
18 | write.table(data.dummy, 'autoclaims.csv', sep=',', row.names=F, col.names=F, quote=F)
19 | 


--------------------------------------------------------------------------------
/XGBoost_demo/distributed-training/README.md:
--------------------------------------------------------------------------------
 1 | Distributed XGBoost Training
 2 | ============================
 3 | This is an tutorial of Distributed XGBoost Training.
 4 | Currently xgboost supports distributed training via CLI program with the configuration file.
 5 | There is also plan push distributed python and other language bindings, please open an issue
 6 | if you are interested in contributing.
 7 | 
 8 | Build XGBoost with Distributed Filesystem Support
 9 | -------------------------------------------------
10 | To use distributed xgboost, you only need to turn the options on to build
11 | with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
12 | 
13 | 
14 | Step by Step Tutorial on AWS
15 | ----------------------------
16 | Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorials/aws_yarn.html) for running distributed xgboost.
17 | 
18 | 
19 | Model Analysis
20 | --------------
21 | XGBoost is exchangeable across all bindings and platforms.
22 | This means you can use python or R to analyze the learnt model and do prediction.
23 | For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
24 | 


--------------------------------------------------------------------------------
/XGBoost_demo/distributed-training/mushroom.aws.conf:
--------------------------------------------------------------------------------
 1 | # General Parameters, see comment for each definition
 2 | # choose the booster, can be gbtree or gblinear
 3 | booster = gbtree
 4 | # choose logistic regression loss function for binary classification
 5 | objective = binary:logistic
 6 | 
 7 | # Tree Booster Parameters
 8 | # step size shrinkage
 9 | eta = 1.0
10 | # minimum loss reduction required to make a further partition
11 | gamma = 1.0
12 | # minimum sum of instance weight(hessian) needed in a child
13 | min_child_weight = 1
14 | # maximum depth of a tree
15 | max_depth = 3
16 | 
17 | # Task Parameters
18 | # the number of round to do boosting
19 | num_round = 2
20 | # 0 means do not save any model except the final round model
21 | save_period = 0
22 | # The path of training data
23 | data = "s3://mybucket/xgb-demo/train"
24 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set
25 | # evaluate on training data as well each round
26 | eval_train = 1
27 | 
28 | 


--------------------------------------------------------------------------------
/XGBoost_demo/distributed-training/plot_model.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# XGBoost Model Analysis\n",
  8 |     "\n",
  9 |     "This notebook can be used to load and analysis model learnt from all xgboost bindings, including distributed training. "
 10 |    ]
 11 |   },
 12 |   {
 13 |    "cell_type": "code",
 14 |    "execution_count": null,
 15 |    "metadata": {
 16 |     "collapsed": false
 17 |    },
 18 |    "outputs": [],
 19 |    "source": [
 20 |     "import sys\n",
 21 |     "import os\n",
 22 |     "%matplotlib inline "
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "markdown",
 27 |    "metadata": {},
 28 |    "source": [
 29 |     "## Please change the ```pkg_path``` and ```model_file``` to be correct path"
 30 |    ]
 31 |   },
 32 |   {
 33 |    "cell_type": "code",
 34 |    "execution_count": null,
 35 |    "metadata": {
 36 |     "collapsed": false
 37 |    },
 38 |    "outputs": [],
 39 |    "source": [
 40 |     "pkg_path = '../../python-package/'\n",
 41 |     "model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n",
 42 |     "sys.path.insert(0, pkg_path)\n",
 43 |     "import xgboost as xgb"
 44 |    ]
 45 |   },
 46 |   {
 47 |    "cell_type": "markdown",
 48 |    "metadata": {},
 49 |    "source": [
 50 |     "# Plot the Feature Importance"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {
 57 |     "collapsed": false
 58 |    },
 59 |    "outputs": [],
 60 |    "source": [
 61 |     "# plot the first two trees.\n",
 62 |     "bst = xgb.Booster(model_file=model_file)\n",
 63 |     "xgb.plot_importance(bst)"
 64 |    ]
 65 |   },
 66 |   {
 67 |    "cell_type": "markdown",
 68 |    "metadata": {},
 69 |    "source": [
 70 |     "# Plot the First Tree"
 71 |    ]
 72 |   },
 73 |   {
 74 |    "cell_type": "code",
 75 |    "execution_count": null,
 76 |    "metadata": {
 77 |     "collapsed": false
 78 |    },
 79 |    "outputs": [],
 80 |    "source": [
 81 |     "tree_id = 0\n",
 82 |     "xgb.to_graphviz(bst, tree_id)"
 83 |    ]
 84 |   }
 85 |  ],
 86 |  "metadata": {
 87 |   "kernelspec": {
 88 |    "display_name": "Python 2",
 89 |    "language": "python",
 90 |    "name": "python2"
 91 |   },
 92 |   "language_info": {
 93 |    "codemirror_mode": {
 94 |     "name": "ipython",
 95 |     "version": 2
 96 |    },
 97 |    "file_extension": ".py",
 98 |    "mimetype": "text/x-python",
 99 |    "name": "python",
100 |    "nbconvert_exporter": "python",
101 |    "pygments_lexer": "ipython2",
102 |    "version": "2.7.3"
103 |   }
104 |  },
105 |  "nbformat": 4,
106 |  "nbformat_minor": 0
107 | }
108 | 


--------------------------------------------------------------------------------
/XGBoost_demo/distributed-training/run_aws.sh:
--------------------------------------------------------------------------------
 1 | # This is the example script to run distributed xgboost on AWS.
 2 | # Change the following two lines for configuration
 3 | 
 4 | export BUCKET=mybucket
 5 | 
 6 | # submit the job to YARN
 7 | ../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
 8 |     ../../xgboost mushroom.aws.conf nthread=2\
 9 |     data=s3://${BUCKET}/xgb-demo/train\
10 |     eval[test]=s3://${BUCKET}/xgb-demo/test\
11 |     model_dir=s3://${BUCKET}/xgb-demo/model
12 | 


--------------------------------------------------------------------------------
/XGBoost_demo/gpu_acceleration/README.md:
--------------------------------------------------------------------------------
 1 | # GPU Acceleration Demo
 2 | 
 3 | This demo shows how to train a model on the [forest cover type](https://archive.ics.uci.edu/ml/datasets/covertype) dataset using GPU acceleration. The forest cover type dataset has 581,012 rows and 54 features, making it time consuming to process. We compare the run-time and accuracy of the GPU and CPU histogram algorithms.
 4 | 
 5 | This demo requires the [GPU plug-in](https://github.com/dmlc/xgboost/tree/master/plugin/updater_gpu) to be built and installed.
 6 | 
 7 | The dataset is automatically loaded via the sklearn script. 
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/XGBoost_demo/gpu_acceleration/cover_type.py:
--------------------------------------------------------------------------------
 1 | import xgboost as xgb
 2 | import numpy as np
 3 | from sklearn.datasets import fetch_covtype
 4 | from sklearn.model_selection import train_test_split
 5 | import time
 6 | 
 7 | # Fetch dataset using sklearn
 8 | cov = fetch_covtype()
 9 | X = cov.data
10 | y = cov.target
11 | 
12 | # Create 0.75/0.25 train/test split
13 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, train_size=0.75,
14 |                                                     random_state=42)
15 | 
16 | # Specify sufficient boosting iterations to reach a minimum
17 | num_round = 3000
18 | 
19 | # Leave most parameters as default
20 | param = {'objective': 'multi:softmax', # Specify multiclass classification
21 |          'num_class': 8, # Number of possible output classes
22 |          'tree_method': 'gpu_hist' # Use GPU accelerated algorithm
23 |          }
24 | 
25 | # Convert input data from numpy to XGBoost format
26 | dtrain = xgb.DMatrix(X_train, label=y_train)
27 | dtest = xgb.DMatrix(X_test, label=y_test)
28 | 
29 | gpu_res = {} # Store accuracy result
30 | tmp = time.time()
31 | # Train model
32 | xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=gpu_res)
33 | print("GPU Training Time: %s seconds" % (str(time.time() - tmp)))
34 | 
35 | # Repeat for CPU algorithm
36 | tmp = time.time()
37 | param['tree_method'] = 'hist'
38 | cpu_res = {}
39 | xgb.train(param, dtrain, num_round, evals=[(dtest, 'test')], evals_result=cpu_res)
40 | print("CPU Training Time: %s seconds" % (str(time.time() - tmp)))
41 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/README.md:
--------------------------------------------------------------------------------
 1 | XGBoost Python Feature Walkthrough
 2 | ==================================
 3 | * [Basic walkthrough of wrappers](basic_walkthrough.py)
 4 | * [Cutomize loss function, and evaluation metric](custom_objective.py)
 5 | * [Boosting from existing prediction](boost_from_prediction.py)
 6 | * [Predicting using first n trees](predict_first_ntree.py)
 7 | * [Generalized Linear Model](generalized_linear_model.py)
 8 | * [Cross validation](cross_validation.py)
 9 | * [Predicting leaf indices](predict_leaf_indices.py)
10 | * [Sklearn Wrapper](sklearn_examples.py)
11 | * [Sklearn Parallel](sklearn_parallel.py)
12 | * [Sklearn access evals result](sklearn_evals_result.py)
13 | * [Access evals result](evals_result.py)
14 | * [External Memory](external_memory.py)
15 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/basic_walkthrough.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import scipy.sparse
 4 | import pickle
 5 | import xgboost as xgb
 6 | 
 7 | ### simple example
 8 | # load file from text file, also binary buffer generated by xgboost
 9 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
10 | dtest = xgb.DMatrix('../data/agaricus.txt.test')
11 | 
12 | # specify parameters via map, definition are same as c++ version
13 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
14 | 
15 | # specify validations set to watch performance
16 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
17 | num_round = 2
18 | bst = xgb.train(param, dtrain, num_round, watchlist)
19 | 
20 | # this is prediction
21 | preds = bst.predict(dtest)
22 | labels = dtest.get_label()
23 | print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds))))
24 | bst.save_model('0001.model')
25 | # dump model
26 | bst.dump_model('dump.raw.txt')
27 | # dump model with feature map
28 | bst.dump_model('dump.nice.txt', '../data/featmap.txt')
29 | 
30 | # save dmatrix into binary buffer
31 | dtest.save_binary('dtest.buffer')
32 | # save model
33 | bst.save_model('xgb.model')
34 | # load model and data in
35 | bst2 = xgb.Booster(model_file='xgb.model')
36 | dtest2 = xgb.DMatrix('dtest.buffer')
37 | preds2 = bst2.predict(dtest2)
38 | # assert they are the same
39 | assert np.sum(np.abs(preds2 - preds)) == 0
40 | 
41 | # alternatively, you can pickle the booster
42 | pks = pickle.dumps(bst2)
43 | # load model and data in
44 | bst3 = pickle.loads(pks)
45 | preds3 = bst3.predict(dtest2)
46 | # assert they are the same
47 | assert np.sum(np.abs(preds3 - preds)) == 0
48 | 
49 | ###
50 | # build dmatrix from scipy.sparse
51 | print('start running example of build DMatrix from scipy.sparse CSR Matrix')
52 | labels = []
53 | row = []; col = []; dat = []
54 | i = 0
55 | for l in open('../data/agaricus.txt.train'):
56 |     arr = l.split()
57 |     labels.append(int(arr[0]))
58 |     for it in arr[1:]:
59 |         k,v = it.split(':')
60 |         row.append(i); col.append(int(k)); dat.append(float(v))
61 |     i += 1
62 | csr = scipy.sparse.csr_matrix((dat, (row, col)))
63 | dtrain = xgb.DMatrix(csr, label=labels)
64 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
65 | bst = xgb.train(param, dtrain, num_round, watchlist)
66 | 
67 | print('start running example of build DMatrix from scipy.sparse CSC Matrix')
68 | # we can also construct from csc matrix
69 | csc = scipy.sparse.csc_matrix((dat, (row, col)))
70 | dtrain = xgb.DMatrix(csc, label=labels)
71 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
72 | bst = xgb.train(param, dtrain, num_round, watchlist)
73 | 
74 | print('start running example of build DMatrix from numpy array')
75 | # NOTE: npymat is numpy array, we will convert it into scipy.sparse.csr_matrix in internal implementation
76 | # then convert to DMatrix
77 | npymat = csr.todense()
78 | dtrain = xgb.DMatrix(npymat, label=labels)
79 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
80 | bst = xgb.train(param, dtrain, num_round, watchlist)
81 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/boost_from_prediction.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import xgboost as xgb
 4 | 
 5 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
 6 | dtest = xgb.DMatrix('../data/agaricus.txt.test')
 7 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
 8 | ###
 9 | # advanced: start from a initial base prediction
10 | #
11 | print ('start running example to start from a initial prediction')
12 | # specify parameters via map, definition are same as c++ version
13 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
14 | # train xgboost for 1 round
15 | bst = xgb.train(param, dtrain, 1, watchlist)
16 | # Note: we need the margin value instead of transformed prediction in set_base_margin
17 | # do predict with output_margin=True, will always give you margin values before logistic transformation
18 | ptrain = bst.predict(dtrain, output_margin=True)
19 | ptest = bst.predict(dtest, output_margin=True)
20 | dtrain.set_base_margin(ptrain)
21 | dtest.set_base_margin(ptest)
22 | 
23 | 
24 | print('this is result of running from initial prediction')
25 | bst = xgb.train(param, dtrain, 1, watchlist)


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/cross_validation.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import xgboost as xgb
 4 | 
 5 | ### load data in do training
 6 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
 7 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
 8 | num_round = 2
 9 | 
10 | print('running cross validation')
11 | # do cross validation, this will print result out as
12 | # [iteration]  metric_name:mean_value+std_value
13 | # std_value is standard deviation of the metric
14 | xgb.cv(param, dtrain, num_round, nfold=5,
15 |        metrics={'error'}, seed=0,
16 |        callbacks=[xgb.callback.print_evaluation(show_stdv=True)])
17 | 
18 | print('running cross validation, disable standard deviation display')
19 | # do cross validation, this will print result out as
20 | # [iteration]  metric_name:mean_value
21 | res = xgb.cv(param, dtrain, num_boost_round=10, nfold=5,
22 |              metrics={'error'}, seed=0,
23 |              callbacks=[xgb.callback.print_evaluation(show_stdv=False),
24 |                         xgb.callback.early_stop(3)])
25 | print(res)
26 | print('running cross validation, with preprocessing function')
27 | # define the preprocessing function
28 | # used to return the preprocessed training, test data, and parameter
29 | # we can use this to do weight rescale, etc.
30 | # as a example, we try to set scale_pos_weight
31 | def fpreproc(dtrain, dtest, param):
32 |     label = dtrain.get_label()
33 |     ratio = float(np.sum(label == 0)) / np.sum(label == 1)
34 |     param['scale_pos_weight'] = ratio
35 |     return (dtrain, dtest, param)
36 | 
37 | # do cross validation, for each fold
38 | # the dtrain, dtest, param will be passed into fpreproc
39 | # then the return value of fpreproc will be used to generate
40 | # results of that fold
41 | xgb.cv(param, dtrain, num_round, nfold=5,
42 |        metrics={'auc'}, seed=0, fpreproc=fpreproc)
43 | 
44 | ###
45 | # you can also do cross validation with cutomized loss function
46 | # See custom_objective.py
47 | ##
48 | print('running cross validation, with cutomsized loss function')
49 | def logregobj(preds, dtrain):
50 |     labels = dtrain.get_label()
51 |     preds = 1.0 / (1.0 + np.exp(-preds))
52 |     grad = preds - labels
53 |     hess = preds * (1.0 - preds)
54 |     return grad, hess
55 | def evalerror(preds, dtrain):
56 |     labels = dtrain.get_label()
57 |     return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
58 | 
59 | param = {'max_depth':2, 'eta':1, 'silent':1}
60 | # train with customized objective
61 | xgb.cv(param, dtrain, num_round, nfold=5, seed=0,
62 |        obj=logregobj, feval=evalerror)
63 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/custom_objective.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import xgboost as xgb
 4 | ###
 5 | # advanced: customized loss function
 6 | #
 7 | print('start running example to used customized objective function')
 8 | 
 9 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
10 | dtest = xgb.DMatrix('../data/agaricus.txt.test')
11 | 
12 | # note: for customized objective function, we leave objective as default
13 | # note: what we are getting is margin value in prediction
14 | # you must know what you are doing
15 | param = {'max_depth': 2, 'eta': 1, 'silent': 1}
16 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
17 | num_round = 2
18 | 
19 | # user define objective function, given prediction, return gradient and second order gradient
20 | # this is log likelihood loss
21 | def logregobj(preds, dtrain):
22 |     labels = dtrain.get_label()
23 |     preds = 1.0 / (1.0 + np.exp(-preds))
24 |     grad = preds - labels
25 |     hess = preds * (1.0 - preds)
26 |     return grad, hess
27 | 
28 | # user defined evaluation function, return a pair metric_name, result
29 | # NOTE: when you do customized loss function, the default prediction value is margin
30 | # this may make builtin evaluation metric not function properly
31 | # for example, we are doing logistic loss, the prediction is score before logistic transformation
32 | # the builtin evaluation error assumes input is after logistic transformation
33 | # Take this in mind when you use the customization, and maybe you need write customized evaluation function
34 | def evalerror(preds, dtrain):
35 |     labels = dtrain.get_label()
36 |     # return a pair metric_name, result
37 |     # since preds are margin(before logistic transformation, cutoff at 0)
38 |     return 'error', float(sum(labels != (preds > 0.0))) / len(labels)
39 | 
40 | # training with customized objective, we can also do step by step training
41 | # simply look at xgboost.py's implementation of train
42 | bst = xgb.train(param, dtrain, num_round, watchlist, logregobj, evalerror)
43 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/evals_result.py:
--------------------------------------------------------------------------------
 1 | ##
 2 | #  This script demonstrate how to access the eval metrics in xgboost
 3 | ##
 4 | 
 5 | import xgboost as xgb
 6 | dtrain = xgb.DMatrix('../data/agaricus.txt.train', silent=True)
 7 | dtest = xgb.DMatrix('../data/agaricus.txt.test', silent=True)
 8 | 
 9 | param = [('max_depth', 2), ('objective', 'binary:logistic'), ('eval_metric', 'logloss'), ('eval_metric', 'error')]
10 |  
11 | num_round = 2
12 | watchlist = [(dtest,'eval'), (dtrain,'train')]
13 | 
14 | evals_result = {}
15 | bst = xgb.train(param, dtrain, num_round, watchlist, evals_result=evals_result)
16 | 
17 | print('Access logloss metric directly from evals_result:')
18 | print(evals_result['eval']['logloss'])
19 | 
20 | print('')
21 | print('Access metrics through a loop:')
22 | for e_name, e_mtrs in evals_result.items():
23 |     print('- {}'.format(e_name))
24 |     for e_mtr_name, e_mtr_vals in e_mtrs.items():
25 |         print('   - {}'.format(e_mtr_name))
26 |         print('      - {}'.format(e_mtr_vals))
27 | 
28 | print('')
29 | print('Access complete dictionary:')
30 | print(evals_result)
31 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/external_memory.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import scipy.sparse
 4 | import xgboost as xgb
 5 | 
 6 | ### simple example for using external memory version
 7 | 
 8 | # this is the only difference, add a # followed by a cache prefix name
 9 | # several cache file with the prefix will be generated
10 | # currently only support convert from libsvm file
11 | dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
12 | dtest = xgb.DMatrix('../data/agaricus.txt.test#dtest.cache')
13 | 
14 | # specify validations set to watch performance
15 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
16 | 
17 | # performance notice: set nthread to be the number of your real cpu
18 | # some cpu offer two threads per core, for example, a 4 core cpu with 8 threads, in such case set nthread=4
19 | #param['nthread']=num_real_cpu
20 | 
21 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
22 | num_round = 2
23 | bst = xgb.train(param, dtrain, num_round, watchlist)
24 | 
25 | 
26 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/gamma_regression.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import xgboost as xgb
 3 | import numpy as np
 4 | 
 5 | #  this script demonstrates how to fit gamma regression model (with log link function)
 6 | #  in xgboost, before running the demo you need to generate the autoclaims dataset
 7 | #  by running gen_autoclaims.R located in xgboost/demo/data.
 8 | 
 9 | data = np.genfromtxt('../data/autoclaims.csv', delimiter=',')
10 | dtrain = xgb.DMatrix(data[0:4741, 0:34], data[0:4741, 34])
11 | dtest = xgb.DMatrix(data[4741:6773, 0:34], data[4741:6773, 34])
12 | 
13 | # for gamma regression, we need to set the objective to 'reg:gamma', it also suggests
14 | # to set the base_score to a value between 1 to 5 if the number of iteration is small
15 | param = {'silent':1, 'objective':'reg:gamma', 'booster':'gbtree', 'base_score':3}
16 | 
17 | # the rest of settings are the same
18 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
19 | num_round = 30
20 | 
21 | # training and evaluation
22 | bst = xgb.train(param, dtrain, num_round, watchlist)
23 | preds = bst.predict(dtest)
24 | labels = dtest.get_label()
25 | print('test deviance=%f' % (2 * np.sum((labels - preds) / preds - np.log(labels) + np.log(preds))))
26 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/generalized_linear_model.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import xgboost as xgb
 3 | ##
 4 | #  this script demonstrate how to fit generalized linear model in xgboost
 5 | #  basically, we are using linear model, instead of tree for our boosters
 6 | ##
 7 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
 8 | dtest = xgb.DMatrix('../data/agaricus.txt.test')
 9 | # change booster to gblinear, so that we are fitting a linear model
10 | # alpha is the L1 regularizer
11 | # lambda is the L2 regularizer
12 | # you can also set lambda_bias which is L2 regularizer on the bias term
13 | param = {'silent':1, 'objective':'binary:logistic', 'booster':'gblinear',
14 |          'alpha': 0.0001, 'lambda': 1}
15 | 
16 | # normally, you do not need to set eta (step_size)
17 | # XGBoost uses a parallel coordinate descent algorithm (shotgun),
18 | # there could be affection on convergence with parallelization on certain cases
19 | # setting eta to be smaller value, e.g 0.5 can make the optimization more stable
20 | # param['eta'] = 1
21 | 
22 | ##
23 | # the rest of settings are the same
24 | ##
25 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
26 | num_round = 4
27 | bst = xgb.train(param, dtrain, num_round, watchlist)
28 | preds = bst.predict(dtest)
29 | labels = dtest.get_label()
30 | print('error=%f' % (sum(1 for i in range(len(preds)) if int(preds[i] > 0.5) != labels[i]) / float(len(preds))))
31 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/predict_first_ntree.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import xgboost as xgb
 4 | 
 5 | ### load data in do training
 6 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
 7 | dtest = xgb.DMatrix('../data/agaricus.txt.test')
 8 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
 9 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
10 | num_round = 3
11 | bst = xgb.train(param, dtrain, num_round, watchlist)
12 | 
13 | print('start testing prediction from first n trees')
14 | ### predict using first 1 tree
15 | label = dtest.get_label()
16 | ypred1 = bst.predict(dtest, ntree_limit=1)
17 | # by default, we predict using all the trees
18 | ypred2 = bst.predict(dtest)
19 | print('error of ypred1=%f' % (np.sum((ypred1 > 0.5) != label) / float(len(label))))
20 | print('error of ypred2=%f' % (np.sum((ypred2 > 0.5) != label) / float(len(label))))
21 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/predict_leaf_indices.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import xgboost as xgb
 3 | 
 4 | ### load data in do training
 5 | dtrain = xgb.DMatrix('../data/agaricus.txt.train')
 6 | dtest = xgb.DMatrix('../data/agaricus.txt.test')
 7 | param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic'}
 8 | watchlist = [(dtest, 'eval'), (dtrain, 'train')]
 9 | num_round = 3
10 | bst = xgb.train(param, dtrain, num_round, watchlist)
11 | 
12 | print ('start testing predict the leaf indices')
13 | ### predict using first 2 tree
14 | leafindex = bst.predict(dtest, ntree_limit=2, pred_leaf=True)
15 | print(leafindex.shape)
16 | print(leafindex)
17 | ### predict all trees
18 | leafindex = bst.predict(dtest, pred_leaf=True)
19 | print(leafindex.shape)
20 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/runall.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | export PYTHONPATH=PYTHONPATH:../../python-package
 3 | python basic_walkthrough.py
 4 | python custom_objective.py
 5 | python boost_from_prediction.py
 6 | python predict_first_ntree.py
 7 | python generalized_linear_model.py
 8 | python cross_validation.py
 9 | python predict_leaf_indices.py
10 | python sklearn_examples.py
11 | python sklearn_parallel.py
12 | python external_memory.py
13 | rm -rf *~ *.model *.buffer
14 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/sklearn_evals_result.py:
--------------------------------------------------------------------------------
 1 | ##
 2 | #  This script demonstrate how to access the xgboost eval metrics by using sklearn
 3 | ##
 4 | 
 5 | import xgboost as xgb
 6 | import numpy as np
 7 | from sklearn.datasets import make_hastie_10_2
 8 | 
 9 | X, y = make_hastie_10_2(n_samples=2000, random_state=42)
10 | 
11 | # Map labels from {-1, 1} to {0, 1}
12 | labels, y = np.unique(y, return_inverse=True)
13 | 
14 | X_train, X_test = X[:1600], X[1600:]
15 | y_train, y_test = y[:1600], y[1600:]
16 | 
17 | param_dist = {'objective':'binary:logistic', 'n_estimators':2}
18 | 
19 | clf = xgb.XGBModel(**param_dist)
20 | # Or you can use: clf = xgb.XGBClassifier(**param_dist)
21 | 
22 | clf.fit(X_train, y_train,
23 |         eval_set=[(X_train, y_train), (X_test, y_test)], 
24 |         eval_metric='logloss',
25 |         verbose=True)
26 | 
27 | # Load evals result by calling the evals_result() function
28 | evals_result = clf.evals_result()
29 | 
30 | print('Access logloss metric directly from validation_0:')
31 | print(evals_result['validation_0']['logloss'])
32 | 
33 | print('')
34 | print('Access metrics through a loop:')
35 | for e_name, e_mtrs in evals_result.items():
36 |     print('- {}'.format(e_name))
37 |     for e_mtr_name, e_mtr_vals in e_mtrs.items():
38 |         print('   - {}'.format(e_mtr_name))
39 |         print('      - {}'.format(e_mtr_vals))
40 |  
41 | print('')
42 | print('Access complete dict:')
43 | print(evals_result)
44 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/sklearn_examples.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | '''
 3 | Created on 1 Apr 2015
 4 | 
 5 | @author: Jamie Hall
 6 | '''
 7 | import pickle
 8 | import xgboost as xgb
 9 | 
10 | import numpy as np
11 | from sklearn.model_selection import KFold, train_test_split, GridSearchCV
12 | from sklearn.metrics import confusion_matrix, mean_squared_error
13 | from sklearn.datasets import load_iris, load_digits, load_boston
14 | 
15 | rng = np.random.RandomState(31337)
16 | 
17 | print("Zeros and Ones from the Digits dataset: binary classification")
18 | digits = load_digits(2)
19 | y = digits['target']
20 | X = digits['data']
21 | kf = KFold(n_splits=2, shuffle=True, random_state=rng)
22 | for train_index, test_index in kf.split(X):
23 |     xgb_model = xgb.XGBClassifier().fit(X[train_index], y[train_index])
24 |     predictions = xgb_model.predict(X[test_index])
25 |     actuals = y[test_index]
26 |     print(confusion_matrix(actuals, predictions))
27 | 
28 | print("Iris: multiclass classification")
29 | iris = load_iris()
30 | y = iris['target']
31 | X = iris['data']
32 | kf = KFold(n_splits=2, shuffle=True, random_state=rng)
33 | for train_index, test_index in kf.split(X):
34 |     xgb_model = xgb.XGBClassifier().fit(X[train_index], y[train_index])
35 |     predictions = xgb_model.predict(X[test_index])
36 |     actuals = y[test_index]
37 |     print(confusion_matrix(actuals, predictions))
38 | 
39 | print("Boston Housing: regression")
40 | boston = load_boston()
41 | y = boston['target']
42 | X = boston['data']
43 | kf = KFold(n_splits=2, shuffle=True, random_state=rng)
44 | for train_index, test_index in kf.split(X):
45 |     xgb_model = xgb.XGBRegressor().fit(X[train_index], y[train_index])
46 |     predictions = xgb_model.predict(X[test_index])
47 |     actuals = y[test_index]
48 |     print(mean_squared_error(actuals, predictions))
49 | 
50 | print("Parameter optimization")
51 | y = boston['target']
52 | X = boston['data']
53 | xgb_model = xgb.XGBRegressor()
54 | clf = GridSearchCV(xgb_model,
55 |                    {'max_depth': [2,4,6],
56 |                     'n_estimators': [50,100,200]}, verbose=1)
57 | clf.fit(X,y)
58 | print(clf.best_score_)
59 | print(clf.best_params_)
60 | 
61 | # The sklearn API models are picklable
62 | print("Pickling sklearn API models")
63 | # must open in binary format to pickle
64 | pickle.dump(clf, open("best_boston.pkl", "wb"))
65 | clf2 = pickle.load(open("best_boston.pkl", "rb"))
66 | print(np.allclose(clf.predict(X), clf2.predict(X)))
67 | 
68 | # Early-stopping
69 | 
70 | X = digits['data']
71 | y = digits['target']
72 | X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
73 | clf = xgb.XGBClassifier()
74 | clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="auc",
75 |         eval_set=[(X_test, y_test)])
76 | 
77 | 


--------------------------------------------------------------------------------
/XGBoost_demo/guide-python/sklearn_parallel.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | if __name__ == "__main__":
 4 |     # NOTE: on posix systems, this *has* to be here and in the
 5 |     # `__name__ == "__main__"` clause to run XGBoost in parallel processes
 6 |     # using fork, if XGBoost was built with OpenMP support. Otherwise, if you
 7 |     # build XGBoost without OpenMP support, you can use fork, which is the
 8 |     # default backend for joblib, and omit this.
 9 |     try:
10 |         from multiprocessing import set_start_method
11 |     except ImportError:
12 |         raise ImportError("Unable to import multiprocessing.set_start_method."
13 |                           " This example only runs on Python 3.4")
14 |     set_start_method("forkserver")
15 | 
16 |     import numpy as np
17 |     from sklearn.model_selection import GridSearchCV
18 |     from sklearn.datasets import load_boston
19 |     import xgboost as xgb
20 | 
21 |     rng = np.random.RandomState(31337)
22 | 
23 |     print("Parallel Parameter optimization")
24 |     boston = load_boston()
25 | 
26 |     os.environ["OMP_NUM_THREADS"] = "2"  # or to whatever you want
27 |     y = boston['target']
28 |     X = boston['data']
29 |     xgb_model = xgb.XGBRegressor()
30 |     clf = GridSearchCV(xgb_model, {'max_depth': [2, 4, 6],
31 |                                    'n_estimators': [50, 100, 200]}, verbose=1,
32 |                        n_jobs=2)
33 |     clf.fit(X, y)
34 |     print(clf.best_score_)
35 |     print(clf.best_params_)
36 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/README.md:
--------------------------------------------------------------------------------
 1 | Highlights
 2 | =====
 3 | Higgs challenge ends recently, xgboost is being used by many users. This list highlights the xgboost solutions of players
 4 | * Blogpost by phunther: [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
 5 | * The solution by Tianqi Chen and Tong He [Link](https://github.com/hetong007/higgsml)
 6 | 
 7 | Guide for Kaggle Higgs Challenge
 8 | =====
 9 | 
10 | This is the folder giving example of how to use XGBoost Python Module  to run Kaggle Higgs competition
11 | 
12 | This script will achieve about 3.600 AMS score in public leaderboard. To get start, you need do following step:
13 | 
14 | 1. Compile the XGBoost python lib
15 | ```bash
16 | cd ../..
17 | make
18 | ```
19 | 
20 | 2. Put training.csv test.csv on folder './data' (you can create a symbolic link)
21 | 
22 | 3. Run ./run.sh
23 | 
24 | Speed
25 | =====
26 | speedtest.py compares xgboost's speed on this dataset with sklearn.GBM
27 | 
28 | 
29 | Using R module
30 | =====
31 | * Alternatively, you can run using R, higgs-train.R and higgs-pred.R.
32 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/higgs-cv.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import numpy as np
 3 | import xgboost as xgb
 4 | 
 5 | ### load data in do training
 6 | train = np.loadtxt('./data/training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s'.encode('utf-8')) } )
 7 | label  = train[:,32]
 8 | data   = train[:,1:31]
 9 | weight = train[:,31]
10 | dtrain = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight )
11 | param = {'max_depth':6, 'eta':0.1, 'silent':1, 'objective':'binary:logitraw', 'nthread':4}
12 | num_round = 120
13 | 
14 | print ('running cross validation, with preprocessing function')
15 | # define the preprocessing function
16 | # used to return the preprocessed training, test data, and parameter
17 | # we can use this to do weight rescale, etc.
18 | # as a example, we try to set scale_pos_weight
19 | def fpreproc(dtrain, dtest, param):
20 |     label = dtrain.get_label()
21 |     ratio = float(np.sum(label == 0)) / np.sum(label==1)
22 |     param['scale_pos_weight'] = ratio
23 |     wtrain = dtrain.get_weight()
24 |     wtest = dtest.get_weight()
25 |     sum_weight = sum(wtrain) + sum(wtest)
26 |     wtrain *= sum_weight / sum(wtrain)
27 |     wtest *= sum_weight / sum(wtest)
28 |     dtrain.set_weight(wtrain)
29 |     dtest.set_weight(wtest)
30 |     return (dtrain, dtest, param)
31 | 
32 | # do cross validation, for each fold
33 | # the dtrain, dtest, param will be passed into fpreproc
34 | # then the return value of fpreproc will be used to generate
35 | # results of that fold
36 | xgb.cv(param, dtrain, num_round, nfold=5,
37 |        metrics={'ams@0.15', 'auc'}, seed = 0, fpreproc = fpreproc)
38 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/higgs-numpy.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # this is the example script to use xgboost to train
 3 | import numpy as np
 4 | 
 5 | import xgboost as xgb
 6 | 
 7 | test_size = 550000
 8 | 
 9 | # path to where the data lies
10 | dpath = 'data'
11 | 
12 | # load in training data, directly use numpy
13 | dtrain = np.loadtxt( dpath+'/training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s'.encode('utf-8')) } )
14 | print ('finish loading from csv ')
15 | 
16 | label  = dtrain[:,32]
17 | data   = dtrain[:,1:31]
18 | # rescale weight to make it same as test set
19 | weight = dtrain[:,31] * float(test_size) / len(label)
20 | 
21 | sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0  )
22 | sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0  )
23 | 
24 | # print weight statistics
25 | print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos ))
26 | 
27 | # construct xgboost.DMatrix from numpy array, treat -999.0 as missing value
28 | xgmat = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight )
29 | 
30 | # setup parameters for xgboost
31 | param = {}
32 | # use logistic regression loss, use raw prediction before logistic transformation
33 | # since we only need the rank
34 | param['objective'] = 'binary:logitraw'
35 | # scale weight of positive examples
36 | param['scale_pos_weight'] = sum_wneg/sum_wpos
37 | param['eta'] = 0.1
38 | param['max_depth'] = 6
39 | param['eval_metric'] = 'auc'
40 | param['silent'] = 1
41 | param['nthread'] = 16
42 | 
43 | # you can directly throw param in, though we want to watch multiple metrics here
44 | plst = list(param.items())+[('eval_metric', 'ams@0.15')]
45 | 
46 | watchlist = [ (xgmat,'train') ]
47 | # boost 120 trees
48 | num_round = 120
49 | print ('loading data end, start to boost trees')
50 | bst = xgb.train( plst, xgmat, num_round, watchlist );
51 | # save out model
52 | bst.save_model('higgs.model')
53 | 
54 | print ('finish training')
55 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/higgs-pred.R:
--------------------------------------------------------------------------------
 1 | # install xgboost package, see R-package in root folder
 2 | require(xgboost)
 3 | require(methods)
 4 | 
 5 | modelfile <- "higgs.model"
 6 | outfile <- "higgs.pred.csv"
 7 | dtest <- read.csv("data/test.csv", header=TRUE)
 8 | data <- as.matrix(dtest[2:31])
 9 | idx <- dtest[[1]]
10 | 
11 | xgmat <- xgb.DMatrix(data, missing = -999.0)
12 | bst <- xgb.load(modelfile=modelfile)
13 | ypred <- predict(bst, xgmat)
14 | 
15 | rorder <- rank(ypred, ties.method="first")
16 | 
17 | threshold <- 0.15
18 | # to be completed
19 | ntop <- length(rorder) - as.integer(threshold*length(rorder))
20 | plabel <- ifelse(rorder > ntop, "s", "b")
21 | outdata <- list("EventId" = idx,
22 |                 "RankOrder" = rorder,
23 |                 "Class" = plabel)
24 | write.csv(outdata, file = outfile, quote=FALSE, row.names=FALSE)
25 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/higgs-pred.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # make prediction
 3 | import numpy as np
 4 | import xgboost as xgb
 5 | 
 6 | # path to where the data lies
 7 | dpath = 'data'
 8 | 
 9 | modelfile = 'higgs.model'
10 | outfile = 'higgs.pred.csv'
11 | # make top 15% as positive
12 | threshold_ratio = 0.15
13 | 
14 | # load in training data, directly use numpy
15 | dtest = np.loadtxt( dpath+'/test.csv', delimiter=',', skiprows=1 )
16 | data   = dtest[:,1:31]
17 | idx = dtest[:,0]
18 | 
19 | print ('finish loading from csv ')
20 | xgmat = xgb.DMatrix( data, missing = -999.0 )
21 | bst = xgb.Booster({'nthread':16}, model_file = modelfile)
22 | ypred = bst.predict( xgmat )
23 | 
24 | res  = [ ( int(idx[i]), ypred[i] ) for i in range(len(ypred)) ]
25 | 
26 | rorder = {}
27 | for k, v in sorted( res, key = lambda x:-x[1] ):
28 |     rorder[ k ] = len(rorder) + 1
29 | 
30 | # write out predictions
31 | ntop = int( threshold_ratio * len(rorder ) )
32 | fo = open(outfile, 'w')
33 | nhit = 0
34 | ntot = 0
35 | fo.write('EventId,RankOrder,Class\n')
36 | for k, v in res:
37 |     if rorder[k] <= ntop:
38 |         lb = 's'
39 |         nhit += 1
40 |     else:
41 |         lb = 'b'
42 |     # change output rank order to follow Kaggle convention
43 |     fo.write('%s,%d,%s\n' % ( k,  len(rorder)+1-rorder[k], lb ) )
44 |     ntot += 1
45 | fo.close()
46 | 
47 | print ('finished writing into prediction file')
48 | 
49 | 
50 | 
51 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/higgs-train.R:
--------------------------------------------------------------------------------
 1 | # install xgboost package, see R-package in root folder
 2 | require(xgboost)
 3 | require(methods)
 4 | 
 5 | testsize <- 550000
 6 | 
 7 | dtrain <- read.csv("data/training.csv", header=TRUE)
 8 | dtrain[33] <- dtrain[33] == "s"
 9 | label <- as.numeric(dtrain[[33]])
10 | data <- as.matrix(dtrain[2:31])
11 | weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
12 | 
13 | sumwpos <- sum(weight * (label==1.0))
14 | sumwneg <- sum(weight * (label==0.0))
15 | print(paste("weight statistics: wpos=", sumwpos, "wneg=", sumwneg, "ratio=", sumwneg / sumwpos))
16 | 
17 | xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
18 | param <- list("objective" = "binary:logitraw",
19 |               "scale_pos_weight" = sumwneg / sumwpos,
20 |               "bst:eta" = 0.1,
21 |               "bst:max_depth" = 6,
22 |               "eval_metric" = "auc",
23 |               "eval_metric" = "ams@0.15",
24 |               "silent" = 1,
25 |               "nthread" = 16)
26 | watchlist <- list("train" = xgmat)
27 | nround = 120
28 | print ("loading data end, start to boost trees")
29 | bst = xgb.train(param, xgmat, nround, watchlist );
30 | # save out model
31 | xgb.save(bst, "higgs.model")
32 | print ('finish training')
33 | 
34 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | python -u higgs-numpy.py
 4 | ret=$?
 5 | if [[ $ret != 0 ]]; then
 6 |     echo "ERROR in higgs-numpy.py"
 7 |     exit $ret
 8 | fi
 9 | python -u higgs-pred.py
10 | ret=$?
11 | if [[ $ret != 0 ]]; then
12 |     echo "ERROR in higgs-pred.py"
13 |     exit $ret
14 | fi
15 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/speedtest.R:
--------------------------------------------------------------------------------
 1 | # install xgboost package, see R-package in root folder
 2 | require(xgboost)
 3 | require(gbm)
 4 | require(methods)
 5 | 
 6 | testsize <- 550000
 7 | 
 8 | dtrain <- read.csv("data/training.csv", header=TRUE, nrows=350001)
 9 | dtrain$Label = as.numeric(dtrain$Label=='s')
10 | # gbm.time = system.time({
11 | #   gbm.model <- gbm(Label ~ ., data = dtrain[, -c(1,32)], n.trees = 120, 
12 | #                    interaction.depth = 6, shrinkage = 0.1, bag.fraction = 1,
13 | #                    verbose = TRUE)
14 | # })
15 | # print(gbm.time)
16 | # Test result: 761.48 secs
17 | 
18 | # dtrain[33] <- dtrain[33] == "s"
19 | # label <- as.numeric(dtrain[[33]])
20 | data <- as.matrix(dtrain[2:31])
21 | weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
22 | 
23 | sumwpos <- sum(weight * (label==1.0))
24 | sumwneg <- sum(weight * (label==0.0))
25 | print(paste("weight statistics: wpos=", sumwpos, "wneg=", sumwneg, "ratio=", sumwneg / sumwpos))
26 | 
27 | xgboost.time = list()
28 | threads = c(1,2,4,8,16)
29 | for (i in 1:length(threads)){
30 |   thread = threads[i]
31 |   xgboost.time[[i]] = system.time({
32 |     xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
33 |     param <- list("objective" = "binary:logitraw",
34 |                   "scale_pos_weight" = sumwneg / sumwpos,
35 |                   "bst:eta" = 0.1,
36 |                   "bst:max_depth" = 6,
37 |                   "eval_metric" = "auc",
38 |                   "eval_metric" = "ams@0.15",
39 |                   "silent" = 1,
40 |                   "nthread" = thread)
41 |     watchlist <- list("train" = xgmat)
42 |     nround = 120
43 |     print ("loading data end, start to boost trees")
44 |     bst = xgb.train(param, xgmat, nround, watchlist );
45 |     # save out model
46 |     xgb.save(bst, "higgs.model")
47 |     print ('finish training')
48 |   })
49 | }
50 | 
51 | xgboost.time
52 | # [[1]]
53 | # user  system elapsed 
54 | # 99.015   0.051  98.982 
55 | # 
56 | # [[2]]
57 | # user  system elapsed 
58 | # 100.268   0.317  55.473 
59 | # 
60 | # [[3]]
61 | # user  system elapsed 
62 | # 111.682   0.777  35.963 
63 | # 
64 | # [[4]]
65 | # user  system elapsed 
66 | # 149.396   1.851  32.661 
67 | # 
68 | # [[5]]
69 | # user  system elapsed 
70 | # 157.390   5.988  40.949 
71 | 
72 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-higgs/speedtest.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # this is the example script to use xgboost to train
 3 | import numpy as np
 4 | import xgboost as xgb
 5 | from sklearn.ensemble import GradientBoostingClassifier
 6 | import time
 7 | test_size = 550000
 8 | 
 9 | # path to where the data lies
10 | dpath = 'data'
11 | 
12 | # load in training data, directly use numpy
13 | dtrain = np.loadtxt( dpath+'/training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s') } )
14 | print ('finish loading from csv ')
15 | 
16 | label  = dtrain[:,32]
17 | data   = dtrain[:,1:31]
18 | # rescale weight to make it same as test set
19 | weight = dtrain[:,31] * float(test_size) / len(label)
20 | 
21 | sum_wpos = sum( weight[i] for i in range(len(label)) if label[i] == 1.0  )
22 | sum_wneg = sum( weight[i] for i in range(len(label)) if label[i] == 0.0  )
23 | 
24 | # print weight statistics
25 | print ('weight statistics: wpos=%g, wneg=%g, ratio=%g' % ( sum_wpos, sum_wneg, sum_wneg/sum_wpos ))
26 | 
27 | # construct xgboost.DMatrix from numpy array, treat -999.0 as missing value
28 | xgmat = xgb.DMatrix( data, label=label, missing = -999.0, weight=weight )
29 | 
30 | # setup parameters for xgboost
31 | param = {}
32 | # use logistic regression loss
33 | param['objective'] = 'binary:logitraw'
34 | # scale weight of positive examples
35 | param['scale_pos_weight'] = sum_wneg/sum_wpos
36 | param['bst:eta'] = 0.1
37 | param['bst:max_depth'] = 6
38 | param['eval_metric'] = 'auc'
39 | param['silent'] = 1
40 | param['nthread'] = 4
41 | 
42 | plst = param.items()+[('eval_metric', 'ams@0.15')]
43 | 
44 | watchlist = [ (xgmat,'train') ]
45 | # boost 10 trees
46 | num_round = 10
47 | print ('loading data end, start to boost trees')
48 | print ("training GBM from sklearn")
49 | tmp = time.time()
50 | gbm = GradientBoostingClassifier(n_estimators=num_round, max_depth=6, verbose=2)
51 | gbm.fit(data, label)
52 | print ("sklearn.GBM costs: %s seconds" % str(time.time() - tmp))
53 | #raw_input()
54 | print ("training xgboost")
55 | threads = [1, 2, 4, 16]
56 | for i in threads:
57 |     param['nthread'] = i
58 |     tmp = time.time()
59 |     plst = param.items()+[('eval_metric', 'ams@0.15')]
60 |     bst = xgb.train( plst, xgmat, num_round, watchlist );
61 |     print ("XGBoost with %d thread costs: %s seconds" % (i, str(time.time() - tmp)))
62 | 
63 | print ('finish training')
64 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-otto/README.MD:
--------------------------------------------------------------------------------
 1 | Benckmark for Otto Group Competition
 2 | =========
 3 | 
 4 | This is a folder containing the benchmark for the [Otto Group Competition on Kaggle](http://www.kaggle.com/c/otto-group-product-classification-challenge).
 5 | 
 6 | ## Getting started
 7 | 
 8 | 1. Put `train.csv` and `test.csv` under the `data` folder
 9 | 2. Run the script
10 | 3. Submit the `submission.csv`
11 | 
12 | The parameter `nthread` controls the number of cores to run on, please set it to suit your machine.
13 | 
14 | ## R-package
15 | 
16 | To install the R-package of xgboost, please run
17 | 
18 | ```r
19 | devtools::install_github('tqchen/xgboost',subdir='R-package')
20 | ```
21 | 
22 | Windows users may need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
23 | 
24 | 
25 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-otto/otto_train_pred.R:
--------------------------------------------------------------------------------
 1 | require(xgboost)
 2 | require(methods)
 3 | 
 4 | train = read.csv('data/train.csv',header=TRUE,stringsAsFactors = F)
 5 | test = read.csv('data/test.csv',header=TRUE,stringsAsFactors = F)
 6 | train = train[,-1]
 7 | test = test[,-1]
 8 | 
 9 | y = train[,ncol(train)]
10 | y = gsub('Class_','',y)
11 | y = as.integer(y)-1  # xgboost take features in [0,numOfClass)
12 | 
13 | x = rbind(train[,-ncol(train)],test)
14 | x = as.matrix(x)
15 | x = matrix(as.numeric(x),nrow(x),ncol(x))
16 | trind = 1:length(y)
17 | teind = (nrow(train)+1):nrow(x)
18 | 
19 | # Set necessary parameter
20 | param <- list("objective" = "multi:softprob",
21 |               "eval_metric" = "mlogloss",
22 |               "num_class" = 9,
23 |               "nthread" = 8)
24 | 
25 | # Run Cross Validation
26 | cv.nround = 50
27 | bst.cv = xgb.cv(param=param, data = x[trind,], label = y, 
28 |                 nfold = 3, nrounds=cv.nround)
29 | 
30 | # Train the model
31 | nround = 50
32 | bst = xgboost(param=param, data = x[trind,], label = y, nrounds=nround)
33 | 
34 | # Make prediction
35 | pred = predict(bst,x[teind,])
36 | pred = matrix(pred,9,length(pred)/9)
37 | pred = t(pred)
38 | 
39 | # Output submission
40 | pred = format(pred, digits=2,scientific=F) # shrink the size of submission
41 | pred = data.frame(1:nrow(pred),pred)
42 | names(pred) = c('id', paste0('Class_',1:9))
43 | write.csv(pred,file='submission.csv', quote=FALSE,row.names=FALSE)
44 | 


--------------------------------------------------------------------------------
/XGBoost_demo/kaggle-otto/understandingXGBoostModel.Rmd:
--------------------------------------------------------------------------------
  1 | ---
  2 | title: "Understanding XGBoost Model on Otto Dataset"
  3 | author: "Michaël Benesty"
  4 | output:
  5 |   rmarkdown::html_vignette:
  6 |     css: ../../R-package/vignettes/vignette.css
  7 |     number_sections: yes
  8 |     toc: yes
  9 | ---
 10 | 
 11 | Introduction
 12 | ============
 13 | 
 14 | **XGBoost** is an implementation of the famous gradient boosting algorithm. This model is often described as a *blackbox*, meaning it works well but it is not trivial to understand how. Indeed, the model is made of hundreds (thousands?) of decision trees. You may wonder how possible a human would be able to have a general view of the model?
 15 | 
 16 | While XGBoost is known for its fast speed and accurate predictive power, it also comes with various functions to help you understand the model.
 17 | The purpose of this RMarkdown document is to demonstrate how easily we can leverage the functions already implemented in **XGBoost R** package. Of course, everything showed below can be applied to the dataset you may have to manipulate at work or wherever!
 18 | 
 19 | First we will prepare the **Otto** dataset and train a model, then we will generate two visualisations to get a clue of what is important to the model, finally, we will see how we can leverage these information.
 20 | 
 21 | Preparation of the data
 22 | =======================
 23 | 
 24 | This part is based on the **R** tutorial example by [Tong He](https://github.com/dmlc/xgboost/blob/master/demo/kaggle-otto/otto_train_pred.R)
 25 | 
 26 | First, let's load the packages and the dataset.
 27 | 
 28 | ```{r loading}
 29 | require(xgboost)
 30 | require(methods)
 31 | require(data.table)
 32 | require(magrittr)
 33 | train <- fread('data/train.csv', header = T, stringsAsFactors = F)
 34 | test <- fread('data/test.csv', header=TRUE, stringsAsFactors = F)
 35 | ```
 36 | > `magrittr` and `data.table` are here to make the code cleaner and much more rapid.
 37 | 
 38 | Let's explore the dataset.
 39 | 
 40 | ```{r explore}
 41 | # Train dataset dimensions
 42 | dim(train)
 43 | 
 44 | # Training content
 45 | train[1:6,1:5, with =F]
 46 | 
 47 | # Test dataset dimensions
 48 | dim(test)
 49 | 
 50 | # Test content
 51 | test[1:6,1:5, with =F]
 52 | ```
 53 | > We only display the 6 first rows and 5 first columns for convenience
 54 | 
 55 | Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product.
 56 | 
 57 | Obviously the first column (`ID`) doesn't contain any useful information.
 58 | 
 59 | To let the algorithm focus on real stuff, we will delete it.
 60 | 
 61 | ```{r clean, results='hide'}
 62 | # Delete ID column in training dataset
 63 | train[, id := NULL]
 64 | 
 65 | # Delete ID column in testing dataset
 66 | test[, id := NULL]
 67 | ```
 68 | 
 69 | According to its description, the **Otto** challenge is a multi class classification challenge. We need to extract the labels (here the name of the different classes) from the dataset. We only have two files (test and training), it seems logical that the training file contains the class we are looking for. Usually the labels is in the first or the last column. We already know what is in the first column, let's check the content of the last one.
 70 | 
 71 | ```{r searchLabel}
 72 | # Check the content of the last column
 73 | train[1:6, ncol(train), with  = F]
 74 | # Save the name of the last column
 75 | nameLastCol <- names(train)[ncol(train)]
 76 | ```
 77 | 
 78 | The classes are provided as character string in the `r ncol(train)`th column called `r nameLastCol`. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to `integer`. Moreover, according to the documentation, it should start at `0`.
 79 | 
 80 | For that purpose, we will:
 81 | 
 82 | * extract the target column
 83 | * remove `Class_` from each class name
 84 | * convert to `integer`
 85 | * remove `1` to the new value
 86 | 
 87 | ```{r classToIntegers}
 88 | # Convert from classes to numbers
 89 | y <- train[, nameLastCol, with = F][[1]] %>% gsub('Class_','',.) %>% {as.integer(.) -1}
 90 | 
 91 | # Display the first 5 levels
 92 | y[1:5]
 93 | ```
 94 | 
 95 | We remove label column from training dataset, otherwise **XGBoost** would use it to guess the labels!
 96 | 
 97 | ```{r deleteCols, results='hide'}
 98 | train[, nameLastCol:=NULL, with = F]
 99 | ```
100 | 
101 | `data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by **XGBoost**. We need to convert both datasets (training and test) in `numeric` Matrix format.
102 | 
103 | ```{r convertToNumericMatrix}
104 | trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix
105 | testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix
106 | ```
107 | 
108 | Model training
109 | ==============
110 | 
111 | Before the learning we will use the cross validation to evaluate the our error rate.
112 | 
113 | Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part to use it as the test data and perform a training. Then it will reintegrate the first part and retain the second part, do a training and so on...
114 | 
115 | You can look at the function documentation for more information.
116 | 
117 | ```{r crossValidation}
118 | numberOfClasses <- max(y) + 1
119 | 
120 | param <- list("objective" = "multi:softprob",
121 |               "eval_metric" = "mlogloss",
122 |               "num_class" = numberOfClasses)
123 | 
124 | cv.nround <- 5
125 | cv.nfold <- 3
126 | 
127 | bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
128 |                 nfold = cv.nfold, nrounds = cv.nround)
129 | ```
130 | > As we can see the error rate is low on the test dataset (for a 5mn trained model).
131 | 
132 | Finally, we are ready to train the real model!!!
133 | 
134 | ```{r modelTraining}
135 | nround = 50
136 | bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nround)
137 | ```
138 | 
139 | Model understanding
140 | ===================
141 | 
142 | Feature importance
143 | ------------------
144 | 
145 | So far, we have built a model made of **`r nround`** trees.
146 | 
147 | To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
148 | 
149 | Each division operation is called a *split*.
150 | 
151 | Each group at each division level is called a branch and the deepest level is called a *leaf*.
152 | 
153 | In the final model, these *leafs* are supposed to be as pure as possible for each tree, meaning in our case that each *leaf* should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).
154 | 
155 | **Not all *splits* are equally important**. Basically the first *split* of a tree will have more impact on the purity that, for instance, the deepest *split*. Intuitively, we understand that the first *split* makes most of the work, and the following *splits* focus on smaller parts of the dataset which have been misclassified by the first *tree*.
156 | 
157 | In the same way, in Boosting we try to optimize the misclassification at each round (it is called the *loss*). So the first *tree* will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous *trees*.
158 | 
159 | The improvement brought by each *split* can be measured, it is the *gain*.
160 | 
161 | Each *split* is done on one feature only at one value.
162 | 
163 | Let's see what the model looks like.
164 | 
165 | ```{r modelDump}
166 | model <- xgb.dump(bst, with.stats = T)
167 | model[1:10]
168 | ```
169 | > For convenience, we are displaying the first 10 lines of the model only.
170 | 
171 | Clearly, it is not easy to understand what it means.
172 | 
173 | Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A).
174 | 
175 | Hopefully, **XGBoost** offers a better representation: **feature importance**.
176 | 
177 | Feature importance is about averaging the *gain* of each feature for all *split* and all *trees*.
178 | 
179 | Then we can use the function `xgb.plot.importance`.
180 | 
181 | ```{r importanceFeature, fig.align='center', fig.height=5, fig.width=10}
182 | # Get the feature real names
183 | names <- dimnames(trainMatrix)[[2]]
184 | 
185 | # Compute feature importance matrix
186 | importance_matrix <- xgb.importance(names, model = bst)
187 | 
188 | # Nice graph
189 | xgb.plot.importance(importance_matrix[1:10,])
190 | ```
191 | 
192 | > To make it understandable we first extract the column names from the `Matrix`.
193 | 
194 | Interpretation
195 | --------------
196 | 
197 | In the feature importance above, we can see the first 10 most important features.
198 | 
199 | This function gives a color to each bar. These colors represent groups of features. Basically a K-means clustering is  applied to group each feature by importance.
200 | 
201 | From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels.
202 | 
203 | Or you can just reason about why these features are so important (in **Otto** challenge we can't go this way because there is not enough information).
204 | 
205 | Tree graph
206 | ----------
207 | 
208 | Feature importance gives you feature weight information but not interaction between features.
209 | 
210 | **XGBoost R** package have another useful function for that.
211 | 
212 | Please, scroll on the right to see the tree.
213 | 
214 | ```{r treeGraph, dpi=1500, fig.align='left'}
215 | xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)
216 | ```
217 | 
218 | We are just displaying the first two trees here.
219 | 
220 | On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the interaction between features is complicated.
221 | Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
222 | 
223 | Going deeper
224 | ============
225 | 
226 | There are 4 documents you may also be interested in:
227 | 
228 | * [xgboostPresentation.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd): general presentation
229 | * [discoverYourData.Rmd](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/discoverYourData.Rmd): explaining feature analysis
230 | * [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit): use case
231 | * [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/): very good book to have a good understanding of the model
232 | 


--------------------------------------------------------------------------------
/XGBoost_demo/multiclass_classification/README.md:
--------------------------------------------------------------------------------
 1 | Demonstrating how to use XGBoost accomplish Multi-Class classification task on [UCI Dermatology dataset](https://archive.ics.uci.edu/ml/datasets/Dermatology)
 2 | 
 3 | Make sure you make make xgboost python module in ../../python
 4 | 
 5 | 1. Run runexp.sh
 6 | ```bash
 7 | ./runexp.sh
 8 | ```
 9 | 
10 | 
11 | 


--------------------------------------------------------------------------------
/XGBoost_demo/multiclass_classification/runexp.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | if [ -f dermatology.data ]
 3 | then
 4 |     echo "use existing data to run multi class classification"
 5 | else
 6 |     echo "getting data from uci, make sure you are connected to internet"
 7 |     wget https://archive.ics.uci.edu/ml/machine-learning-databases/dermatology/dermatology.data
 8 | fi
 9 | python train.py
10 | 


--------------------------------------------------------------------------------
/XGBoost_demo/multiclass_classification/train.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | from __future__ import division
 4 | 
 5 | import numpy as np
 6 | import xgboost as xgb
 7 | 
 8 | # label need to be 0 to num_class -1
 9 | data = np.loadtxt('./dermatology.data', delimiter=',',
10 |         converters={33: lambda x:int(x == '?'), 34: lambda x:int(x) - 1})
11 | sz = data.shape
12 | 
13 | train = data[:int(sz[0] * 0.7), :]
14 | test = data[int(sz[0] * 0.7):, :]
15 | 
16 | train_X = train[:, :33]
17 | train_Y = train[:, 34]
18 | 
19 | test_X = test[:, :33]
20 | test_Y = test[:, 34]
21 | 
22 | xg_train = xgb.DMatrix(train_X, label=train_Y)
23 | xg_test = xgb.DMatrix(test_X, label=test_Y)
24 | # setup parameters for xgboost
25 | param = {}
26 | # use softmax multi-class classification
27 | param['objective'] = 'multi:softmax'
28 | # scale weight of positive examples
29 | param['eta'] = 0.1
30 | param['max_depth'] = 6
31 | param['silent'] = 1
32 | param['nthread'] = 4
33 | param['num_class'] = 6
34 | 
35 | watchlist = [(xg_train, 'train'), (xg_test, 'test')]
36 | num_round = 5
37 | bst = xgb.train(param, xg_train, num_round, watchlist)
38 | # get prediction
39 | pred = bst.predict(xg_test)
40 | error_rate = np.sum(pred != test_Y) / test_Y.shape[0]
41 | print('Test error using softmax = {}'.format(error_rate))
42 | 
43 | # do the same thing again, but output probabilities
44 | param['objective'] = 'multi:softprob'
45 | bst = xgb.train(param, xg_train, num_round, watchlist)
46 | # Note: this convention has been changed since xgboost-unity
47 | # get prediction, this is in 1D array, need reshape to (ndata, nclass)
48 | pred_prob = bst.predict(xg_test).reshape(test_Y.shape[0], 6)
49 | pred_label = np.argmax(pred_prob, axis=1)
50 | error_rate = np.sum(pred_label != test_Y) / test_Y.shape[0]
51 | print('Test error using softprob = {}'.format(error_rate))
52 | 


--------------------------------------------------------------------------------
/XGBoost_demo/rank/README.md:
--------------------------------------------------------------------------------
 1 | Learning to rank
 2 | ====
 3 | XGBoost supports accomplishing ranking tasks. In ranking scenario, data are often grouped and we need the [group information file](../../doc/input_format.md#group-input-format) to specify ranking tasks. The model used in XGBoost for ranking is the LambdaRank, this function is not yet completed. Currently, we provide pairwise rank.
 4 | 
 5 | ### Parameters
 6 | The configuration setting is similar to the regression and binary classification setting, except user need to specify the objectives:
 7 | 
 8 | ```
 9 | ...
10 | objective="rank:pairwise"
11 | ...
12 | ```
13 | For more usage details please refer to the [binary classification demo](../binary_classification),
14 | 
15 | Instructions
16 | ====
17 | The dataset for ranking demo is from LETOR04 MQ2008 fold1,
18 | You can use the following command to run the example
19 | 
20 | Get the data: ./wgetdata.sh
21 | Run the example: ./runexp.sh
22 | 


--------------------------------------------------------------------------------
/XGBoost_demo/rank/mq2008.conf:
--------------------------------------------------------------------------------
 1 | # General Parameters, see comment for each definition
 2 | 
 3 | # specify objective
 4 | objective="rank:pairwise"
 5 | 
 6 | # Tree Booster Parameters
 7 | # step size shrinkage
 8 | eta = 0.1 
 9 | # minimum loss reduction required to make a further partition
10 | gamma = 1.0 
11 | # minimum sum of instance weight(hessian) needed in a child
12 | min_child_weight = 0.1
13 | # maximum depth of a tree
14 | max_depth = 6
15 | 
16 | # Task parameters
17 | # the number of round to do boosting
18 | num_round = 4
19 | # 0 means do not save any model except the final round model
20 | save_period = 0 
21 | # The path of training data
22 | data = "mq2008.train" 
23 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set
24 | eval[test] = "mq2008.vali" 
25 | # The path of test data 
26 | test:data = "mq2008.test"      
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/XGBoost_demo/rank/runexp.sh:
--------------------------------------------------------------------------------
 1 | python trans_data.py train.txt mq2008.train mq2008.train.group
 2 | 
 3 | python trans_data.py test.txt mq2008.test mq2008.test.group
 4 | 
 5 | python trans_data.py vali.txt mq2008.vali mq2008.vali.group
 6 | 
 7 | ../../xgboost mq2008.conf
 8 | 
 9 | ../../xgboost mq2008.conf task=pred model_in=0004.model
10 | 
11 | 
12 | 


--------------------------------------------------------------------------------
/XGBoost_demo/rank/trans_data.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | 
 3 | def save_data(group_data,output_feature,output_group):
 4 |     if len(group_data) == 0:
 5 |         return
 6 | 
 7 |     output_group.write(str(len(group_data))+"\n")
 8 |     for data in group_data:
 9 |         # only include nonzero features
10 |         feats = [ p for p in data[2:] if float(p.split(':')[1]) != 0.0 ]        
11 |         output_feature.write(data[0] + " " + " ".join(feats) + "\n")
12 | 
13 | if __name__ == "__main__":
14 |     if len(sys.argv) != 4:
15 |         print ("Usage: python trans_data.py [Ranksvm Format Input] [Output Feature File] [Output Group File]")
16 |         sys.exit(0)
17 | 
18 |     fi = open(sys.argv[1])
19 |     output_feature = open(sys.argv[2],"w")
20 |     output_group = open(sys.argv[3],"w")
21 |     
22 |     group_data = []
23 |     group = ""
24 |     for line in fi:
25 |         if not line:
26 |             break
27 |         if "#" in line:
28 |             line = line[:line.index("#")]
29 |         splits = line.strip().split(" ")
30 |         if splits[1] != group:
31 |             save_data(group_data,output_feature,output_group)
32 |             group_data = []
33 |         group = splits[1]
34 |         group_data.append(splits)
35 | 
36 |     save_data(group_data,output_feature,output_group)
37 | 
38 |     fi.close()
39 |     output_feature.close()
40 |     output_group.close()
41 | 
42 | 


--------------------------------------------------------------------------------
/XGBoost_demo/rank/wgetdata.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 | wget http://research.microsoft.com/en-us/um/beijing/projects/letor/LETOR4.0/Data/MQ2008.rar
3 | unrar x MQ2008.rar
4 | mv -f MQ2008/Fold1/*.txt .
5 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/README.md:
--------------------------------------------------------------------------------
 1 | Regression
 2 | ====
 3 | Using XGBoost for regression is very similar to using it for binary classification. We suggest that you can refer to the [binary classification demo](../binary_classification) first. In XGBoost if we use negative log likelihood as the loss function for regression, the training procedure is same as training binary classifier of XGBoost. 
 4 | 
 5 | ### Tutorial
 6 | The dataset we used is the [computer hardware dataset from UCI repository](https://archive.ics.uci.edu/ml/datasets/Computer+Hardware). The demo for regression is almost the same as the [binary classification demo](../binary_classification), except a little difference in general parameter:
 7 | ```
 8 | # General parameter
 9 | # this is the only difference with classification, use reg:linear to do linear regression
10 | # when labels are in [0,1] we can also use reg:logistic
11 | objective = reg:linear
12 | ...
13 | 
14 | ```
15 | 
16 | The input format is same as binary classification, except that the label is now the target regression values. We use linear regression here, if we want use objective = reg:logistic logistic regression, the label needed to be pre-scaled into [0,1].
17 | 
18 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/machine.conf:
--------------------------------------------------------------------------------
 1 | # General Parameters, see comment for each definition
 2 | # choose the tree booster, can also change to gblinear
 3 | booster = gbtree
 4 | # this is the only difference with classification, use reg:linear to do linear classification
 5 | # when labels are in [0,1] we can also use reg:logistic
 6 | objective = reg:linear
 7 | 
 8 | # Tree Booster Parameters
 9 | # step size shrinkage
10 | eta = 1.0 
11 | # minimum loss reduction required to make a further partition
12 | gamma = 1.0 
13 | # minimum sum of instance weight(hessian) needed in a child
14 | min_child_weight = 1 
15 | # maximum depth of a tree
16 | max_depth = 3 
17 | 
18 | # Task parameters
19 | # the number of round to do boosting
20 | num_round = 2
21 | # 0 means do not save any model except the final round model
22 | save_period = 0 
23 | # The path of training data
24 | data = "machine.txt.train" 
25 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set
26 | eval[test] = "machine.txt.test" 
27 | # The path of test data 
28 | test:data = "machine.txt.test"      
29 | 
30 | 
31 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/machine.data:
--------------------------------------------------------------------------------
  1 | adviser,32/60,125,256,6000,256,16,128,198,199
  2 | amdahl,470v/7,29,8000,32000,32,8,32,269,253
  3 | amdahl,470v/7a,29,8000,32000,32,8,32,220,253
  4 | amdahl,470v/7b,29,8000,32000,32,8,32,172,253
  5 | amdahl,470v/7c,29,8000,16000,32,8,16,132,132
  6 | amdahl,470v/b,26,8000,32000,64,8,32,318,290
  7 | amdahl,580-5840,23,16000,32000,64,16,32,367,381
  8 | amdahl,580-5850,23,16000,32000,64,16,32,489,381
  9 | amdahl,580-5860,23,16000,64000,64,16,32,636,749
 10 | amdahl,580-5880,23,32000,64000,128,32,64,1144,1238
 11 | apollo,dn320,400,1000,3000,0,1,2,38,23
 12 | apollo,dn420,400,512,3500,4,1,6,40,24
 13 | basf,7/65,60,2000,8000,65,1,8,92,70
 14 | basf,7/68,50,4000,16000,65,1,8,138,117
 15 | bti,5000,350,64,64,0,1,4,10,15
 16 | bti,8000,200,512,16000,0,4,32,35,64
 17 | burroughs,b1955,167,524,2000,8,4,15,19,23
 18 | burroughs,b2900,143,512,5000,0,7,32,28,29
 19 | burroughs,b2925,143,1000,2000,0,5,16,31,22
 20 | burroughs,b4955,110,5000,5000,142,8,64,120,124
 21 | burroughs,b5900,143,1500,6300,0,5,32,30,35
 22 | burroughs,b5920,143,3100,6200,0,5,20,33,39
 23 | burroughs,b6900,143,2300,6200,0,6,64,61,40
 24 | burroughs,b6925,110,3100,6200,0,6,64,76,45
 25 | c.r.d,68/10-80,320,128,6000,0,1,12,23,28
 26 | c.r.d,universe:2203t,320,512,2000,4,1,3,69,21
 27 | c.r.d,universe:68,320,256,6000,0,1,6,33,28
 28 | c.r.d,universe:68/05,320,256,3000,4,1,3,27,22
 29 | c.r.d,universe:68/137,320,512,5000,4,1,5,77,28
 30 | c.r.d,universe:68/37,320,256,5000,4,1,6,27,27
 31 | cdc,cyber:170/750,25,1310,2620,131,12,24,274,102
 32 | cdc,cyber:170/760,25,1310,2620,131,12,24,368,102
 33 | cdc,cyber:170/815,50,2620,10480,30,12,24,32,74
 34 | cdc,cyber:170/825,50,2620,10480,30,12,24,63,74
 35 | cdc,cyber:170/835,56,5240,20970,30,12,24,106,138
 36 | cdc,cyber:170/845,64,5240,20970,30,12,24,208,136
 37 | cdc,omega:480-i,50,500,2000,8,1,4,20,23
 38 | cdc,omega:480-ii,50,1000,4000,8,1,5,29,29
 39 | cdc,omega:480-iii,50,2000,8000,8,1,5,71,44
 40 | cambex,1636-1,50,1000,4000,8,3,5,26,30
 41 | cambex,1636-10,50,1000,8000,8,3,5,36,41
 42 | cambex,1641-1,50,2000,16000,8,3,5,40,74
 43 | cambex,1641-11,50,2000,16000,8,3,6,52,74
 44 | cambex,1651-1,50,2000,16000,8,3,6,60,74
 45 | dec,decsys:10:1091,133,1000,12000,9,3,12,72,54
 46 | dec,decsys:20:2060,133,1000,8000,9,3,12,72,41
 47 | dec,microvax-1,810,512,512,8,1,1,18,18
 48 | dec,vax:11/730,810,1000,5000,0,1,1,20,28
 49 | dec,vax:11/750,320,512,8000,4,1,5,40,36
 50 | dec,vax:11/780,200,512,8000,8,1,8,62,38
 51 | dg,eclipse:c/350,700,384,8000,0,1,1,24,34
 52 | dg,eclipse:m/600,700,256,2000,0,1,1,24,19
 53 | dg,eclipse:mv/10000,140,1000,16000,16,1,3,138,72
 54 | dg,eclipse:mv/4000,200,1000,8000,0,1,2,36,36
 55 | dg,eclipse:mv/6000,110,1000,4000,16,1,2,26,30
 56 | dg,eclipse:mv/8000,110,1000,12000,16,1,2,60,56
 57 | dg,eclipse:mv/8000-ii,220,1000,8000,16,1,2,71,42
 58 | formation,f4000/100,800,256,8000,0,1,4,12,34
 59 | formation,f4000/200,800,256,8000,0,1,4,14,34
 60 | formation,f4000/200ap,800,256,8000,0,1,4,20,34
 61 | formation,f4000/300,800,256,8000,0,1,4,16,34
 62 | formation,f4000/300ap,800,256,8000,0,1,4,22,34
 63 | four-phase,2000/260,125,512,1000,0,8,20,36,19
 64 | gould,concept:32/8705,75,2000,8000,64,1,38,144,75
 65 | gould,concept:32/8750,75,2000,16000,64,1,38,144,113
 66 | gould,concept:32/8780,75,2000,16000,128,1,38,259,157
 67 | hp,3000/30,90,256,1000,0,3,10,17,18
 68 | hp,3000/40,105,256,2000,0,3,10,26,20
 69 | hp,3000/44,105,1000,4000,0,3,24,32,28
 70 | hp,3000/48,105,2000,4000,8,3,19,32,33
 71 | hp,3000/64,75,2000,8000,8,3,24,62,47
 72 | hp,3000/88,75,3000,8000,8,3,48,64,54
 73 | hp,3000/iii,175,256,2000,0,3,24,22,20
 74 | harris,100,300,768,3000,0,6,24,36,23
 75 | harris,300,300,768,3000,6,6,24,44,25
 76 | harris,500,300,768,12000,6,6,24,50,52
 77 | harris,600,300,768,4500,0,1,24,45,27
 78 | harris,700,300,384,12000,6,1,24,53,50
 79 | harris,80,300,192,768,6,6,24,36,18
 80 | harris,800,180,768,12000,6,1,31,84,53
 81 | honeywell,dps:6/35,330,1000,3000,0,2,4,16,23
 82 | honeywell,dps:6/92,300,1000,4000,8,3,64,38,30
 83 | honeywell,dps:6/96,300,1000,16000,8,2,112,38,73
 84 | honeywell,dps:7/35,330,1000,2000,0,1,2,16,20
 85 | honeywell,dps:7/45,330,1000,4000,0,3,6,22,25
 86 | honeywell,dps:7/55,140,2000,4000,0,3,6,29,28
 87 | honeywell,dps:7/65,140,2000,4000,0,4,8,40,29
 88 | honeywell,dps:8/44,140,2000,4000,8,1,20,35,32
 89 | honeywell,dps:8/49,140,2000,32000,32,1,20,134,175
 90 | honeywell,dps:8/50,140,2000,8000,32,1,54,66,57
 91 | honeywell,dps:8/52,140,2000,32000,32,1,54,141,181
 92 | honeywell,dps:8/62,140,2000,32000,32,1,54,189,181
 93 | honeywell,dps:8/20,140,2000,4000,8,1,20,22,32
 94 | ibm,3033:s,57,4000,16000,1,6,12,132,82
 95 | ibm,3033:u,57,4000,24000,64,12,16,237,171
 96 | ibm,3081,26,16000,32000,64,16,24,465,361
 97 | ibm,3081:d,26,16000,32000,64,8,24,465,350
 98 | ibm,3083:b,26,8000,32000,0,8,24,277,220
 99 | ibm,3083:e,26,8000,16000,0,8,16,185,113
100 | ibm,370/125-2,480,96,512,0,1,1,6,15
101 | ibm,370/148,203,1000,2000,0,1,5,24,21
102 | ibm,370/158-3,115,512,6000,16,1,6,45,35
103 | ibm,38/3,1100,512,1500,0,1,1,7,18
104 | ibm,38/4,1100,768,2000,0,1,1,13,20
105 | ibm,38/5,600,768,2000,0,1,1,16,20
106 | ibm,38/7,400,2000,4000,0,1,1,32,28
107 | ibm,38/8,400,4000,8000,0,1,1,32,45
108 | ibm,4321,900,1000,1000,0,1,2,11,18
109 | ibm,4331-1,900,512,1000,0,1,2,11,17
110 | ibm,4331-11,900,1000,4000,4,1,2,18,26
111 | ibm,4331-2,900,1000,4000,8,1,2,22,28
112 | ibm,4341,900,2000,4000,0,3,6,37,28
113 | ibm,4341-1,225,2000,4000,8,3,6,40,31
114 | ibm,4341-10,225,2000,4000,8,3,6,34,31
115 | ibm,4341-11,180,2000,8000,8,1,6,50,42
116 | ibm,4341-12,185,2000,16000,16,1,6,76,76
117 | ibm,4341-2,180,2000,16000,16,1,6,66,76
118 | ibm,4341-9,225,1000,4000,2,3,6,24,26
119 | ibm,4361-4,25,2000,12000,8,1,4,49,59
120 | ibm,4361-5,25,2000,12000,16,3,5,66,65
121 | ibm,4381-1,17,4000,16000,8,6,12,100,101
122 | ibm,4381-2,17,4000,16000,32,6,12,133,116
123 | ibm,8130-a,1500,768,1000,0,0,0,12,18
124 | ibm,8130-b,1500,768,2000,0,0,0,18,20
125 | ibm,8140,800,768,2000,0,0,0,20,20
126 | ipl,4436,50,2000,4000,0,3,6,27,30
127 | ipl,4443,50,2000,8000,8,3,6,45,44
128 | ipl,4445,50,2000,8000,8,1,6,56,44
129 | ipl,4446,50,2000,16000,24,1,6,70,82
130 | ipl,4460,50,2000,16000,24,1,6,80,82
131 | ipl,4480,50,8000,16000,48,1,10,136,128
132 | magnuson,m80/30,100,1000,8000,0,2,6,16,37
133 | magnuson,m80/31,100,1000,8000,24,2,6,26,46
134 | magnuson,m80/32,100,1000,8000,24,3,6,32,46
135 | magnuson,m80/42,50,2000,16000,12,3,16,45,80
136 | magnuson,m80/43,50,2000,16000,24,6,16,54,88
137 | magnuson,m80/44,50,2000,16000,24,6,16,65,88
138 | microdata,seq.ms/3200,150,512,4000,0,8,128,30,33
139 | nas,as/3000,115,2000,8000,16,1,3,50,46
140 | nas,as/3000-n,115,2000,4000,2,1,5,40,29
141 | nas,as/5000,92,2000,8000,32,1,6,62,53
142 | nas,as/5000-e,92,2000,8000,32,1,6,60,53
143 | nas,as/5000-n,92,2000,8000,4,1,6,50,41
144 | nas,as/6130,75,4000,16000,16,1,6,66,86
145 | nas,as/6150,60,4000,16000,32,1,6,86,95
146 | nas,as/6620,60,2000,16000,64,5,8,74,107
147 | nas,as/6630,60,4000,16000,64,5,8,93,117
148 | nas,as/6650,50,4000,16000,64,5,10,111,119
149 | nas,as/7000,72,4000,16000,64,8,16,143,120
150 | nas,as/7000-n,72,2000,8000,16,6,8,105,48
151 | nas,as/8040,40,8000,16000,32,8,16,214,126
152 | nas,as/8050,40,8000,32000,64,8,24,277,266
153 | nas,as/8060,35,8000,32000,64,8,24,370,270
154 | nas,as/9000-dpc,38,16000,32000,128,16,32,510,426
155 | nas,as/9000-n,48,4000,24000,32,8,24,214,151
156 | nas,as/9040,38,8000,32000,64,8,24,326,267
157 | nas,as/9060,30,16000,32000,256,16,24,510,603
158 | ncr,v8535:ii,112,1000,1000,0,1,4,8,19
159 | ncr,v8545:ii,84,1000,2000,0,1,6,12,21
160 | ncr,v8555:ii,56,1000,4000,0,1,6,17,26
161 | ncr,v8565:ii,56,2000,6000,0,1,8,21,35
162 | ncr,v8565:ii-e,56,2000,8000,0,1,8,24,41
163 | ncr,v8575:ii,56,4000,8000,0,1,8,34,47
164 | ncr,v8585:ii,56,4000,12000,0,1,8,42,62
165 | ncr,v8595:ii,56,4000,16000,0,1,8,46,78
166 | ncr,v8635,38,4000,8000,32,16,32,51,80
167 | ncr,v8650,38,4000,8000,32,16,32,116,80
168 | ncr,v8655,38,8000,16000,64,4,8,100,142
169 | ncr,v8665,38,8000,24000,160,4,8,140,281
170 | ncr,v8670,38,4000,16000,128,16,32,212,190
171 | nixdorf,8890/30,200,1000,2000,0,1,2,25,21
172 | nixdorf,8890/50,200,1000,4000,0,1,4,30,25
173 | nixdorf,8890/70,200,2000,8000,64,1,5,41,67
174 | perkin-elmer,3205,250,512,4000,0,1,7,25,24
175 | perkin-elmer,3210,250,512,4000,0,4,7,50,24
176 | perkin-elmer,3230,250,1000,16000,1,1,8,50,64
177 | prime,50-2250,160,512,4000,2,1,5,30,25
178 | prime,50-250-ii,160,512,2000,2,3,8,32,20
179 | prime,50-550-ii,160,1000,4000,8,1,14,38,29
180 | prime,50-750-ii,160,1000,8000,16,1,14,60,43
181 | prime,50-850-ii,160,2000,8000,32,1,13,109,53
182 | siemens,7.521,240,512,1000,8,1,3,6,19
183 | siemens,7.531,240,512,2000,8,1,5,11,22
184 | siemens,7.536,105,2000,4000,8,3,8,22,31
185 | siemens,7.541,105,2000,6000,16,6,16,33,41
186 | siemens,7.551,105,2000,8000,16,4,14,58,47
187 | siemens,7.561,52,4000,16000,32,4,12,130,99
188 | siemens,7.865-2,70,4000,12000,8,6,8,75,67
189 | siemens,7.870-2,59,4000,12000,32,6,12,113,81
190 | siemens,7.872-2,59,8000,16000,64,12,24,188,149
191 | siemens,7.875-2,26,8000,24000,32,8,16,173,183
192 | siemens,7.880-2,26,8000,32000,64,12,16,248,275
193 | siemens,7.881-2,26,8000,32000,128,24,32,405,382
194 | sperry,1100/61-h1,116,2000,8000,32,5,28,70,56
195 | sperry,1100/81,50,2000,32000,24,6,26,114,182
196 | sperry,1100/82,50,2000,32000,48,26,52,208,227
197 | sperry,1100/83,50,2000,32000,112,52,104,307,341
198 | sperry,1100/84,50,4000,32000,112,52,104,397,360
199 | sperry,1100/93,30,8000,64000,96,12,176,915,919
200 | sperry,1100/94,30,8000,64000,128,12,176,1150,978
201 | sperry,80/3,180,262,4000,0,1,3,12,24
202 | sperry,80/4,180,512,4000,0,1,3,14,24
203 | sperry,80/5,180,262,4000,0,1,3,18,24
204 | sperry,80/6,180,512,4000,0,1,3,21,24
205 | sperry,80/8,124,1000,8000,0,1,8,42,37
206 | sperry,90/80-model-3,98,1000,8000,32,2,8,46,50
207 | sratus,32,125,2000,8000,0,2,14,52,41
208 | wang,vs-100,480,512,8000,32,0,0,67,47
209 | wang,vs-90,480,1000,4000,0,0,0,45,25
210 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/machine.names:
--------------------------------------------------------------------------------
 1 | 1. Title: Relative CPU Performance Data 
 2 | 
 3 | 2. Source Information
 4 |    -- Creators: Phillip Ein-Dor and Jacob Feldmesser
 5 |      -- Ein-Dor: Faculty of Management; Tel Aviv University; Ramat-Aviv; 
 6 |         Tel Aviv, 69978; Israel
 7 |    -- Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779   
 8 |    -- Date: October, 1987
 9 |  
10 | 3. Past Usage:
11 |     1. Ein-Dor and Feldmesser (CACM 4/87, pp 308-317)
12 |        -- Results: 
13 |           -- linear regression prediction of relative cpu performance
14 |           -- Recorded 34% average deviation from actual values 
15 |     2. Kibler,D. & Aha,D. (1988).  Instance-Based Prediction of
16 |        Real-Valued Attributes.  In Proceedings of the CSCSI (Canadian
17 |        AI) Conference.
18 |        -- Results:
19 |           -- instance-based prediction of relative cpu performance
20 |           -- similar results; no transformations required
21 |     - Predicted attribute: cpu relative performance (numeric)
22 | 
23 | 4. Relevant Information:
24 |    -- The estimated relative performance values were estimated by the authors
25 |       using a linear regression method.  See their article (pp 308-313) for
26 |       more details on how the relative performance values were set.
27 | 
28 | 5. Number of Instances: 209 
29 | 
30 | 6. Number of Attributes: 10 (6 predictive attributes, 2 non-predictive, 
31 |                              1 goal field, and the linear regression's guess)
32 | 
33 | 7. Attribute Information:
34 |    1. vendor name: 30 
35 |       (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, 
36 |        dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, 
37 |        microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, 
38 |        sratus, wang)
39 |    2. Model Name: many unique symbols
40 |    3. MYCT: machine cycle time in nanoseconds (integer)
41 |    4. MMIN: minimum main memory in kilobytes (integer)
42 |    5. MMAX: maximum main memory in kilobytes (integer)
43 |    6. CACH: cache memory in kilobytes (integer)
44 |    7. CHMIN: minimum channels in units (integer)
45 |    8. CHMAX: maximum channels in units (integer)
46 |    9. PRP: published relative performance (integer)
47 |   10. ERP: estimated relative performance from the original article (integer)
48 | 
49 | 8. Missing Attribute Values: None
50 | 
51 | 9. Class Distribution: the class value (PRP) is continuously valued.
52 |    PRP Value Range:   Number of Instances in Range:
53 |    0-20               31
54 |    21-100             121
55 |    101-200            27
56 |    201-300            13
57 |    301-400            7
58 |    401-500            4
59 |    501-600            2
60 |    above 600          4
61 | 
62 | Summary Statistics:
63 | 	   Min  Max   Mean    SD      PRP Correlation
64 |    MCYT:   17   1500  203.8   260.3   -0.3071
65 |    MMIN:   64   32000 2868.0  3878.7   0.7949
66 |    MMAX:   64   64000 11796.1 11726.6  0.8630
67 |    CACH:   0    256   25.2    40.6     0.6626
68 |    CHMIN:  0    52    4.7     6.8      0.6089
69 |    CHMAX:  0    176   18.2    26.0     0.6052
70 |    PRP:    6    1150  105.6   160.8    1.0000
71 |    ERP:   15    1238  99.3    154.8    0.9665
72 | 
73 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/mapfeat.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | 
 3 | fo = open( 'machine.txt', 'w' )
 4 | cnt = 6
 5 | fmap = {}
 6 | for l in open( 'machine.data' ):
 7 |     arr = l.split(',')
 8 |     fo.write(arr[8])
 9 |     for i in range( 0,6 ):
10 |         fo.write( ' %d:%s' %(i,arr[i+2]) )
11 | 
12 |     if arr[0] not in fmap:
13 |         fmap[arr[0]] = cnt
14 |         cnt += 1
15 | 
16 |     fo.write( ' %d:1' % fmap[arr[0]] )
17 |     fo.write('\n')
18 | 
19 | fo.close()
20 | 
21 | # create feature map for machine data
22 | fo = open('featmap.txt', 'w')
23 | # list from machine.names
24 | names = ['vendor','MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP' ];
25 | 
26 | for i in range(0,6):
27 |     fo.write( '%d\t%s\tint\n' % (i, names[i+1]))
28 | 
29 | for v, k in sorted( fmap.items(), key = lambda x:x[1] ):
30 |     fo.write( '%d\tvendor=%s\ti\n' % (k, v))
31 | fo.close()
32 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/mknfold.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import sys
 3 | import random
 4 | 
 5 | if len(sys.argv) < 2:
 6 |     print ('Usage:<filename> <k> [nfold = 5]')
 7 |     exit(0)
 8 | 
 9 | random.seed( 10 )
10 | 
11 | k = int( sys.argv[2] )
12 | if len(sys.argv) > 3:
13 |     nfold = int( sys.argv[3] )
14 | else:
15 |     nfold = 5
16 | 
17 | fi = open( sys.argv[1], 'r' )
18 | ftr = open( sys.argv[1]+'.train', 'w' )
19 | fte = open( sys.argv[1]+'.test', 'w' )
20 | for l in fi:
21 |     if random.randint( 1 , nfold ) == k:
22 |         fte.write( l )
23 |     else:
24 |         ftr.write( l )
25 | 
26 | fi.close()
27 | ftr.close()
28 | fte.close()
29 | 
30 | 


--------------------------------------------------------------------------------
/XGBoost_demo/regression/runexp.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | # map the data to features. For convenience we only use 7 original attributes and encode them as features in a trivial way 
 3 | python mapfeat.py
 4 | # split train and test
 5 | python mknfold.py machine.txt 1
 6 | # training and output the models
 7 | ../../xgboost machine.conf
 8 | # output predictions of test data
 9 | ../../xgboost machine.conf task=pred model_in=0002.model
10 | # print the boosters of 0002.model in dump.raw.txt
11 | ../../xgboost machine.conf task=dump model_in=0002.model name_dump=dump.raw.txt
12 | # print the boosters of 0002.model in dump.nice.txt with feature map
13 | ../../xgboost machine.conf task=dump model_in=0002.model fmap=featmap.txt name_dump=dump.nice.txt 
14 | 
15 | # cat the result
16 | cat dump.nice.txt
17 | 


--------------------------------------------------------------------------------
/XGBoost_demo/yearpredMSD/README.md:
--------------------------------------------------------------------------------
 1 | Demonstrating how to use XGBoost on [Year Prediction task of Million Song Dataset](https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD)
 2 | 
 3 | 1. Run runexp.sh
 4 | ```bash
 5 | ./runexp.sh
 6 | ```
 7 | 
 8 | You can also use the script to prepare LIBSVM format, and run the [Distributed Version](../../multi-node).
 9 | Note that though that normally you only need to use single machine for dataset at this scale, and use distributed version for larger scale dataset.
10 | 


--------------------------------------------------------------------------------
/XGBoost_demo/yearpredMSD/csv2libsvm.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | import sys
 3 | 
 4 | if len(sys.argv) < 3:
 5 |     print 'Usage: <csv> <libsvm>'
 6 |     print 'convert a all numerical csv to libsvm'
 7 | 
 8 | fo = open(sys.argv[2], 'w')
 9 | for l in open(sys.argv[1]):
10 |     arr = l.split(',')
11 |     fo.write('%s' % arr[0])
12 |     for i in xrange(len(arr) - 1):
13 |         fo.write(' %d:%s' % (i, arr[i+1]))
14 | fo.close()
15 | 


--------------------------------------------------------------------------------
/XGBoost_demo/yearpredMSD/runexp.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | if [ -f YearPredictionMSD.txt ]
 4 | then
 5 |     echo "use existing data to run experiment"
 6 | else
 7 |     echo "getting data from uci, make sure you are connected to internet"
 8 |     wget https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip
 9 |     unzip YearPredictionMSD.txt.zip
10 | fi
11 | echo "start making data.."
12 | # map feature using indicator encoding, also produce featmap.txt
13 | python csv2libsvm.py YearPredictionMSD.txt yearpredMSD.libsvm
14 | head -n 463715 yearpredMSD.libsvm > yearpredMSD.libsvm.train
15 | tail -n 51630 yearpredMSD.libsvm > yearpredMSD.libsvm.test
16 | echo "finish making the data"
17 | ../../xgboost yearpredMSD.conf
18 | 


--------------------------------------------------------------------------------
/XGBoost_demo/yearpredMSD/yearpredMSD.conf:
--------------------------------------------------------------------------------
 1 | # General Parameters, see comment for each definition
 2 | # choose the tree booster, can also change to gblinear
 3 | booster = gbtree
 4 | # this is the only difference with classification, use reg:linear to do linear classification
 5 | # when labels are in [0,1] we can also use reg:logistic
 6 | objective = reg:linear
 7 | 
 8 | # Tree Booster Parameters
 9 | # step size shrinkage
10 | eta = 1.0
11 | # minimum loss reduction required to make a further partition
12 | gamma = 1.0 
13 | # minimum sum of instance weight(hessian) needed in a child
14 | min_child_weight = 1 
15 | # maximum depth of a tree
16 | max_depth = 5
17 | 
18 | base_score = 2001
19 | # Task parameters
20 | # the number of round to do boosting
21 | num_round = 100
22 | # 0 means do not save any model except the final round model
23 | save_period = 0 
24 | # The path of training data
25 | data = "yearpredMSD.libsvm.train"
26 | # The path of validation data, used to monitor training process, here [test] sets name of the validation set
27 | eval[test] = "yearpredMSD.libsvm.test"
28 | # The path of test data 
29 | #test:data = "yearpredMSD.libsvm.test"
30 | 
31 | 


--------------------------------------------------------------------------------
/git学习.md:
--------------------------------------------------------------------------------
 1 | *git是分布式版本控制系统，跟踪管理的是修改，而非文件*
 2 | 
 3 | ## 基本操作
 4 | 1.安装git: `sudo apt-get install git`    
 5 | 安装后设置： `git config --global user.name "Your Name"`  
 6 | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`git config --global user.email "email@example.com"`  
 7 | 2.初始化目录为git仓库： **`git init`**  
 8 | &nbsp;&nbsp;&nbsp;把文件添加到仓库：**`git add <file>`**  
 9 | &nbsp;&nbsp;&nbsp;把文件提交到仓库：**`git commit -m <说明>`**  
10 | 3.查看仓库当前的状态：**`git status`**  
11 | &nbsp;&nbsp;&nbsp;查看修改了哪些内容：`git diff`  
12 | 4.查看提交历史：**`git log (--pretty=oneline)`**  
13 | &nbsp;&nbsp;&nbsp;查看命令历史：`git reflog`  
14 | &nbsp;&nbsp;&nbsp;回到上一版本：`git reset --hard HEAD^ `  
15 | 5.工作区和版本库间的关系  
16 | &nbsp;&nbsp;&nbsp;&nbsp;![工作区和版本库](https://static.liaoxuefeng.com/files/attachments/919020037470528/0)  
17 | 6.撤销修改  
18 | &nbsp;&nbsp;&nbsp;修改了但是未git add: `git checkout -- <file>`  
19 | &nbsp;&nbsp;&nbsp;git add了但未git commit: `git reset HEAD <file>`  
20 | 7.删除文件：在目录中用rm命令删除文件后  
21 | &nbsp;&nbsp;&nbsp;&nbsp;若确实要删除文件:`git rm <file>`  
22 | &nbsp;&nbsp;&nbsp;&nbsp;若删错了，可恢复：`git checkout -- <file>`
23 | 
24 | ## 远程仓库  
25 | 1.关联远程库:**`git remote add origin <远程库地址>`** (关联后远程库的名字就是origin)  
26 | 2.推送：**`git push (-u) origin BRANCH`** (-u第一次推送时加，将本地当前的分支推送到远程的BRANCH分支）  
27 |          `git push origin 本地分支名:远程分支名` (将本地某分支推送到远程某分支)
28 |          
29 | 3.将远程BRANCH分支与本地整合：**`git pull origin BRANCH`** (等同于git fetch和git merge)  
30 |          `git pull origin 远程分支名:本地分支名`
31 | 
32 | 4.克隆：**`git clone <远程库地址>`** (无需先进行关联)  
33 | 5.查看远程仓库地址：`git remote -v`
34 | 
35 | ## 分支管理  
36 | 1.查看当前所有分支：`git branch`   (`git branch -a`查看本地和远程的分支)  
37 | 2.创建分支：`git branch <name>`  
38 | 3.切换分支：`git checkout <name>`  
39 | 4.创建+切换分支：`git checkout -b <name>`  
40 | 5.合并某分支到当前分支：`git merge <name>`  
41 | 6.删除分支：`git branch -d <name>`  
42 | 7.分支冲突 （1）手动解决 （2）解决冲突后查看分支合并图：`git log --graph (--pretty=oneline)`  
43 | 8.分支管理策略、Bug分支、Feature分支、多人协作（实际开发中用)参照[Git教程-廖雪峰的官方网站](https://www.liaoxuefeng.com/wiki/0013739516305929606dd18361248578c67b8067c8c017b000/)
44 | 
45 | 9.比较两个分支的差异：
46 | 
47 | `git diff branch1 branch2 --stat`   //显示出所有有差异的文件列表
48 | 
49 | `git diff branch1 branch2 文件名(带路径)`   //显示指定文件的详细差异
50 | 
51 | `git diff branch1 branch2`      //显示出所有有差异的文件的详细差异
52 | 
53 | 10. 查看本地分支与远程分支的关联关系：`git branch -vv`
54 | 
55 | 11. 更新远程分支列表： `git remote update origin --prune`
56 | 
57 | 12. 拉取远程分支并创建本地分支 `git checkout -b 本地分支名x origin/远程分支名x` （采用此种方法建立的本地分支会和远程分支建立映射关系）
58 | 
59 | ## 其他
60 | 1. git切换分支时会把未add或未commit的内容带过去；因为工作区和暂存区是公共的，未add和未commit的内容不属于任何一个分支。
61 | 


--------------------------------------------------------------------------------
/sklearn学习.md:
--------------------------------------------------------------------------------
 1 | ### 目录
 2 | ### <a href="#1">1. 分类、回归</a>
 3 | ### <a href="#2">2. 降维</a>
 4 | ### <a href="#3">3. 模型评估与选择</a>
 5 | ### <a href="#4">4. 数据预处理</a>
 6 | 
 7 | 
 8 | |大类 |  小类 | 适用问题 | 实现 | 说明 |
 9 | |-------- | --------| -------- | -------- | -------- |
10 | | **<a name="1">分类、回归</a>** | | | | |
11 | |1.1 [广义线性模型](http://scikit-learn.org/stable/modules/linear_model.html)| 1.1.1 普通最小二乘法  |  回归 | [sklearn.linear_model.LinearRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) | |
12 | | 注：本节中所有的回归模型皆为线性回归模型 | 1.1.2 Ridge/岭回归 | 回归 | [sklearn.linear_model.Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) | 解决两类回归问题：<br>一是样本少于变量个数<br>二是变量间存在共线性 |
13 | | | 1.1.3 Lasso  | 回归 | [sklearn.linear_model.Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) | 适合特征较少的数据 |
14 | | | 1.1.4 Multi-task Lasso  | 回归 | [sklearn.linear_model.MultiTaskLasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskLasso.html#sklearn.linear_model.MultiTaskLasso) | y值不是一元的回归问题
15 | | | 1.1.5 Elastic Net  | 回归 | [sklearn.linear_model.ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet) | 结合了Ridge和Lasso |
16 | | | 1.1.6 Multi-task Elastic Net  | 回归 | [sklearn.linear_model.MultiTaskElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskElasticNet.html#sklearn.linear_model.MultiTaskElasticNet) | y值不是一元的回归问题 |
17 | | | 1.1.7 Least Angle Regression(LARS)  | 回归 | [sklearn.linear_model.Lars](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lars.html#sklearn.linear_model.Lars) | 适合高维数据 |
18 | | | 1.1.8 LARS Lasso  | 回归 | [sklearn.linear_model.LassoLars](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLars.html#sklearn.linear_model.LassoLars) | (1)适合高维数据使用<br>(2)LARS算法实现的lasso模型 |
19 | | | 1.1.9 Orthogonal Matching Pursuit (OMP)  | 回归 | [sklearn.linear_model.OrthogonalMatchingPursuit](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.OrthogonalMatchingPursuit.html#sklearn.linear_model.OrthogonalMatchingPursuit) | 基于贪心算法实现 |
20 | | | 1.1.10 贝叶斯回归  | 回归 | [sklearn.linear_model.BayesianRidge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.BayesianRidge.html#sklearn.linear_model.BayesianRidge) <br>[sklearn.linear_model.ARDRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ARDRegression.html#sklearn.linear_model.ARDRegression)| 优点： (1)适用于手边数据(2)可用于在估计过程中包含正规化参数 <br>缺点：耗时 |
21 | | | 1.1.11 Logistic regression  | 分类 | [sklearn.linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) |
22 | | | 1.1.12 SGD(随机梯度下降法)  | 分类<br>/回归 | [sklearn.linear_model.SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier)<br>[sklearn.linear_model.SGDRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) | 适用于大规模数据 |
23 | | | 1.1.13 Perceptron  | 分类 | [sklearn.linear_model.Perceptron](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron) | 适用于大规模数据 |
24 | | | 1.1.14 Passive Aggressive Algorithms  | 分类<br>/回归 | [sklearn.linear_model.<br>PassiveAggressiveClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html#sklearn.linear_model.PassiveAggressiveClassifier)<br><br>[sklearn.linear_model.<br>PassiveAggressiveRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveRegressor.html#sklearn.linear_model.PassiveAggressiveRegressor) | 适用于大规模数据 |
25 | | | 1.1.15 Huber Regression  | 回归 | [sklearn.linear_model.HuberRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.HuberRegressor.html#sklearn.linear_model.HuberRegressor) | 能够处理数据中有异常值的情况
26 | | | 1.1.16 多项式回归  | 回归 | [sklearn.preprocessing.PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) | 通过PolynomialFeatures将非线性特征转化成多项式形式，再用线性模型进行处理 |
27 | | 1.2 [线性和二次判别分析](http://scikit-learn.org/stable/modules/lda_qda.html) | 1.2.1 LDA | 分类/降维 | [sklearn.discriminant_analysis.<br>LinearDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis) |  |
28 | |  | 1.2.2 QDA | 分类 | [sklearn.discriminant_analysis.<br>QuadraticDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html#sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis) |  |
29 | | 1.3 [核岭回归](http://scikit-learn.org/stable/modules/kernel_ridge.html) | 简称KRR | 回归 | [sklearn.kernel_ridge.KernelRidge](http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge) | 将核技巧应用到岭回归(1.1.2)中，以实现非线性回归 |
30 | | 1.4 [支持向量机](http://scikit-learn.org/stable/modules/svm.html) | 1.4.1 SVC,NuSVC,LinearSVC | 分类 | [sklearn.svm.SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)<br>[sklearn.svm.NuSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC)<br>[sklearn.svm.LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC)| SVC可用于非线性分类，可指定核函数；<br>NuSVC与SVC唯一的不同是可控制支持向量的个数;<br>LinearSVC用于线性分类|
31 | |  | 1.4.2 SVR,NuSVR,LinearSVR | 回归 | [sklearn.svm.SVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR)<br>[sklearn.svm.NuSVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVR)<br>[sklearn.svm.LinearSVR](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVR)| 同上，将"分类"变成"回归"即可 |
32 | |  | 1.4.3 OneClassSVM | 异常检测 | [sklearn.svm.OneClassSVM](http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html#sklearn.svm.OneClassSVM)| 无监督实现异常值检测 |
33 | | 1.5 [随机梯度下降](http://scikit-learn.org/stable/modules/sgd.html) | 同1.1.12 |  |  |  |
34 | | 1.6 [最近邻](http://scikit-learn.org/stable/modules/neighbors.html) | 1.6.1 Unsupervised Nearest Neighbors | -- | [sklearn.neighbors.NearestNeighbors](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors) | 无监督实现K近邻的寻找 |
35 | | | 1.6.2 Nearest Neighbors Classification | 分类 | [sklearn.neighbors.KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)<br>[sklearn.neighbors.RadiusNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsClassifier.html#sklearn.neighbors.RadiusNeighborsClassifier) | (1)不太适用于高维数据<br>(2)两种实现只是距离度量不一样，后者更适合非均匀的采样 |
36 | | | 1.6.3 Nearest Neighbors Regression| 回归 | [sklearn.neighbors.KNeighborsRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor)<br>[sklearn.neighbors.RadiusNeighborsRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.RadiusNeighborsRegressor.html#sklearn.neighbors.RadiusNeighborsRegressor) | 同上 |
37 | | | 1.6.5 Nearest Centroid Classifier | 分类 | [sklearn.neighbors.NearestCentroid](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html#sklearn.neighbors.NearestCentroid) | 每个类对应一个质心，测试样本被分类到距离最近的质心所在的类别 |
38 | | 1.7 [高斯过程(GP/GPML)](http://scikit-learn.org/stable/modules/gaussian_process.html) | 1.7.1 GPR | 回归 | [sklearn.gaussian_process.<br>GaussianProcessRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html#sklearn.gaussian_process.GaussianProcessRegressor) | 与KRR一样使用了核技巧 |
39 | |  | 1.7.3 GPC | 分类 | [sklearn.gaussian_process.<br>GaussianProcessClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessClassifier.html#sklearn.gaussian_process.GaussianProcessClassifier) |  |
40 | | 1.8 [交叉分解](http://scikit-learn.org/stable/modules/cross_decomposition.html) | 实现算法：CCA和PLS | -- | -- | 用来计算两个多元数据集的线性关系，当预测数据比观测数据有更多的变量时，用PLS更好 |
41 | | 1.9 [朴素贝叶斯](http://scikit-learn.org/stable/modules/naive_bayes.html) | 1.9.1 高斯朴素贝叶斯 | 分类 | [sklearn.naive_bayes.GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) | 处理特征是连续型变量的情况 |
42 | |  | 1.9.2 多项式朴素贝叶斯 | 分类 | [sklearn.naive_bayes.MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) | 最常见，要求特征是离散数据 |
43 | |  | 1.9.3 伯努利朴素贝叶斯 | 分类 | [sklearn.naive_bayes.BernoulliNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB) | 要求特征是离散的，且为布尔类型，即true和false，或者1和0 |
44 | | 1.10 [决策树](http://scikit-learn.org/stable/modules/tree.html) | 1.10.1 Classification | 分类 | [sklearn.tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) |  |
45 | |  | 1.10.2 Regression | 回归 | [sklearn.tree.DecisionTreeRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor) |  |
46 | | 1.11 [集成方法](http://scikit-learn.org/stable/modules/ensemble.html) | 1.11.1 Bagging | 分类/回归 | [sklearn.ensemble.BaggingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier)<br>[sklearn.ensemble.BaggingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html#sklearn.ensemble.BaggingRegressor) | 可以指定基学习器，默认为决策树 |
47 | | 注：1和2属于集成方法中的并行化方法，3和4属于序列化方法 | 1.11.2 Forests of randomized trees | 分类/回归 | RandomForest(RF,随机森林):<br>[sklearn.ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)<br>[sklearn.ensemble.RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)<br>ExtraTrees(RF改进):<br>[sklearn.ensemble.ExtraTreesClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier)<br>[sklearn.ensemble.ExtraTreesRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor) | 基学习器为决策树 |
48 | | | 1.11.3 AdaBoost | 分类/回归 | [sklearn.ensemble.AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier)<br>[sklearn.ensemble.AdaBoostRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html#sklearn.ensemble.AdaBoostRegressor) | 可以指定基学习器，默认为决策树 |
49 | | 号外：最近特别火的两个梯度提升算法，LightGBM和XGBoost<br>(XGBoost提供了sklearn接口)  | 1.11.4 Gradient Tree Boosting| 分类/回归 | GBDT:<BR>[sklearn.ensemble.GradientBoostingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier)<br>GBRT:<BR>[sklearn.ensemble.GradientBoostingRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor) | 基学习器为决策树 |
50 | |  | 1.11.5 Voting Classifier | 分类 | [sklearn.ensemble.VotingClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html#sklearn.ensemble.VotingClassifier) | 须指定基学习器 |
51 | | 1.12 [多类与多标签算法](http://scikit-learn.org/stable/modules/multiclass.html) | -- | -- | -- | **sklearn中的分类算法都默认支持多类分类**，其中LinearSVC、 LogisticRegression和GaussianProcessClassifier在进行多类分类时需指定参数multi_class |
52 | | 1.13 [特征选择](http://scikit-learn.org/stable/modules/feature_selection.html) | 1.13.1 过滤法之方差选择法 | 特征选择 | [sklearn.feature_selection.VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) | 特征选择方法分为3种：过滤法、包裹法和嵌入法。过滤法不用考虑后续学习器 |
53 | |  | 1.13.2 过滤法之卡方检验 | 特征选择 | [sklearn.feature_selection.chi2](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html#sklearn.feature_selection.chi2)，结合[sklearn.feature_selection.SelectKBest](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest) | |
54 | |  | 1.13.3 包裹法之递归特征消除法 | 特征选择 | [sklearn.feature_selection.RFE](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE) | 包裹法需考虑后续学习器，参数中需输入基学习器 |
55 | |  | 1.13.4 嵌入法 | 特征选择 | [sklearn.feature_selection.SelectFromModel](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) | 嵌入法是过滤法和嵌入法的结合，参数中也需输入基学习器 |
56 | | 1.14 [半监督](http://scikit-learn.org/stable/modules/label_propagation.html) | 1.14.1 Label Propagation | 分类/回归 | [sklearn.semi_supervised.LabelPropagation](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html#sklearn.semi_supervised.LabelPropagation)<br>[sklearn.semi_supervised.LabelSpreading](http://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelSpreading.html#sklearn.semi_supervised.LabelSpreading) |  |
57 | | 1.15 [保序回归](http://scikit-learn.org/stable/modules/isotonic.html) | -- | 回归 | [sklearn.isotonic.IsotonicRegression](http://scikit-learn.org/stable/modules/generated/sklearn.isotonic.IsotonicRegression.html#sklearn.isotonic.IsotonicRegression) |  |
58 | | 1.16 [概率校准](http://scikit-learn.org/stable/modules/calibration.html) | -- | -- | -- | 在执行分类时，获得预测的标签的概率 |
59 | | 1.17 [神经网络模型](http://scikit-learn.org/stable/modules/neural_networks_supervised.html) | （待写） |  |  |  |
60 | | **<a name="2">降维</a>** |  | | | |
61 | | 2.5 [降维](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) | 2.5.1 主成分分析 | 降维 | PCA:<br>[sklearn.decomposition.PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)<br>IPCA:<br>[sklearn.decomposition.IncrementalPCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA)<br>KPCA:<br>[sklearn.decomposition.KernelPCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html#sklearn.decomposition.KernelPCA)<BR>SPCA:<br>[sklearn.decomposition.SparsePCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html#sklearn.decomposition.SparsePCA) | (1)IPCA比PCA有更好的内存效率，适合超大规模降维。<br>(2)KPCA可以进行非线性降维<br>(3)SPCA是PCA的变体，降维后返回最佳的稀疏矩阵 |
62 | |  | 2.5.2 截断奇异值分解 | 降维 | [sklearn.decomposition.TruncatedSVD](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD)| 可以直接对scipy.sparse矩阵处理 |
63 | |  | 2.5.3 字典学习 | -- | [sklearn.decomposition.SparseCoder](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparseCoder.html#sklearn.decomposition.SparseCoder)<br>[sklearn.decomposition.DictionaryLearning](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.DictionaryLearning.html#sklearn.decomposition.DictionaryLearning) | SparseCoder实现稀疏编码，DictionaryLearning实现字典学习 |
64 | | **<a name="3">模型评估与选择</a>** |  |  | | |
65 | | 3.1 [交叉验证/CV](http://scikit-learn.org/stable/modules/cross_validation.html) | 3.1.1 分割训练集和测试集 | -- | [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split) | |
66 | | | 3.1.2 通过交叉验证评估score |--| [sklearn.model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score) | score对应性能度量，分类问题默认为accuracy_score，回归问题默认为r2_score |
67 | | | 3.1.3 留一法LOO | -- | [sklearn.model_selection.LeaveOneOut](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html#sklearn.model_selection.LeaveOneOut) | CV的特例 |
68 | | | 3.1.4 留P法LPO | -- | [sklearn.model_selection.LeavePOut](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeavePOut.html#sklearn.model_selection.LeavePOut) | CV的特例 |
69 | | 3.2 [调参](http://scikit-learn.org/stable/modules/grid_search.html) | 3.2.1 网格搜索 | -- | [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) | 最常用的调参方法。可传入学习器、学习器参数范围、性能度量score(默认为accuracy_score或r2_score )等 |
70 | | | 3.2.2 随机搜索 | -- | [sklearn.model_selection.RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) | 参数传入同上 |
71 | | 3.3 [性能度量](http://scikit-learn.org/stable/modules/model_evaluation.html) |  3.3.1 分类度量 | -- | -- | 对应交叉验证和调参中的score |
72 | | | 3.3.2 回归度量 | -- | -- | |
73 | | | 3.3.3 聚类度量 | -- | -- | |
74 | | 3.4 [模型持久性](http://scikit-learn.org/stable/modules/model_persistence.html) | -- | -- | -- | 使用pickle存放模型，可以使模型不用重复训练 |
75 | | 3.5 [验证曲线](http://scikit-learn.org/stable/modules/learning_curve.html#learning-curve)| 3.5.1 验证曲线 | -- | [sklearn.model_selection.validation_curve](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.validation_curve.html#sklearn.model_selection.validation_curve) | 横轴为某个参数的值，纵轴为模型得分 |
76 | | | 3.5.2 学习曲线 | -- | [sklearn.model_selection.learning_curve](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html#sklearn.model_selection.learning_curve) | 横轴为训练数据大小，纵轴为模型得分 |
77 | | **<a name="4">数据预处理</a>** |  |  | | |
78 | | 4.3 [数据预处理](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) | 4.3.1 标准化 | 数据预处理 | 标准化:<br>[sklearn.preprocessing.scale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html#sklearn.preprocessing.scale)<br>[sklearn.preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) | scale与StandardScaler都是将将特征转化成标准正态分布(即均值为0，方差为1),且都可以处理scipy.sparse矩阵，但一般选择后者 |
79 | | |  | 数据预处理 | <br>区间缩放:<br>[sklearn.preprocessing.MinMaxScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler)<br>[sklearn.preprocessing.MaxAbsScale](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) | MinMaxScaler默认为0-1缩放，MaxAbsScaler可以处理scipy.sparse矩阵 |
80 | | | 4.3.2 非线性转换 | 数据预处理 | [sklearn.preprocessing.QuantileTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer) | 可以更少的受异常值的影响 |
81 | | | 4.3.3 归一化 | 数据预处理 | [sklearn.preprocessing.Normalizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) | 将行向量转换为单位向量，目的在于样本向量在点乘运算或其他核函数计算相似性时，拥有统一的标准 |
82 | | | 4.3.4 二值化 | 数据预处理 | [sklearn.preprocessing.Binarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html#sklearn.preprocessing.Binarizer) | 通过设置阈值对定量特征处理，获取布尔值 |
83 | | | 4.3.5 哑编码 | 数据预处理 | [sklearn.preprocessing.OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) | 对定性特征编码。也可用pandas.get_dummies实现 |
84 | | | 4.3.6 缺失值计算 | 数据预处理 | [sklearn.preprocessing.Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer) | 可用三种方式填充缺失值，均值（默认）、中位数和众数。也可用pandas.fillna实现 |
85 | | | 4.3.7 多项式转换 | 数据预处理 | [sklearn.preprocessing.PolynomialFeatures](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures) | |
86 | | | 4.3.8 自定义转换 | 数据预处理 | [sklearn.preprocessing.FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer) | |
87 | | | | | | |
88 | | | | | | |
89 | 
90 | 
91 | 


--------------------------------------------------------------------------------
/xgboost学习.md:
--------------------------------------------------------------------------------
 1 | ## XGBoost调参指南
 2 | [参考-官网](http://xgboost.readthedocs.io/en/latest/////////how_to/param_tuning.html)
 3 | ### 方法1
 4 | 可按照**max_depth, min_child_weight colsamplt_bytree,eta**的顺序一个一个调，每次调的时候其他参数保持不变
 5 | ### 方法2:防止过拟合
 6 | When you observe high training accuracy, but low tests accuracy, it is likely that you encounter overfitting problem.
 7 | 
 8 | There are in general two ways that you can control overfitting in xgboost
 9 | 
10 | - The first way is to directly control model complexity
11 | 	This include **max_depth, min_child_weight and gamma**
12 | - The second way is to add randomness to make training robust to noise
13 | This include **subsample, colsample_bytree**
14 | You can also reduce stepsize **eta**, but needs to remember to increase **num_round** when you do so.
15 | 
16 | ## XGBoost参数
17 | [参考1-官网](http://xgboost.readthedocs.io/en/latest/////parameter.html) [参考2-CSDN](http://m.blog.csdn.net/qimiejia5584/article/details/78622442)
18 | <br>XGBoost参数类型分为三种：
19 | - general parameters/一般参数：决定使用哪种booster，可选gbtree、dart、gblinear
20 | - booster parameters/提升器参数：不同的booster选取不同的参数
21 | - task parameters/任务参数：设置学习场景，比如回归、二类分类、多类分类等
22 | 
23 | | 参数类型 | 参数名 | 参数取值 | 默认值 | 说明 |
24 | |-------- | --------| -------- | -------- | -------- |
25 | | **general** | | | | |
26 | || booster | gbtree<br>dart<br>gblinear | gbtree | 指定使用的booster。前两种为树模型，后两种为线性模型。一般用默认的就好|
27 | | | silent | 0、1 | 0 | 0表示输出运行信息，1表示采取静默模式 |
28 | | | nthread | |系统允许的最大线程数 | 并发线程数 |
29 | | | num_pbuffer | 不可指定 | | 缓冲池大小 |
30 | | | num_feature | 不可指定 |  | 特征维度 |
31 | | **booster** |  |  |  | |
32 | | **Tree Booster** | **eta** | [0,1] |0.3| 学习率 |
33 | | | **gamma** | [0,∞] |0 | 最小损失分裂。越大越保守，一般0.1、0.2这样子  |
34 | | | **max_depth** | [0,∞] | 6 | 树的最大深度。越大模型越复杂，越容易过拟合 |
35 | | | **min_child_weight** | [0,∞] | 1 | 子节点最小的权重。非常容易影响结果，参数数值越大，就越保守，越不会过拟合 |
36 | | | max_delta_step | [0,∞] |0  | 最大delta的步数|
37 | | | **subsample** | (0,1] | 1 | 子样本数目。是否只使用部分的样本进行训练，这可以避免过拟合化。默认为1，即全部用作训练。 |
38 | | | **colsample_bytree** | (0,1] | 1 | 每棵树的列数（特征数） |
39 | | | colsample_bylevel | (0,1] | 1 | 每一层的列数（特征数）|
40 | | | lambda |  | 1 | L2正则化的权重 |
41 | | | alpha |  | 0 | L1正则化的权重 |
42 | | | tree_method | {‘auto’, ‘exact’, ‘approx’, ‘hist’, ‘gpu_exact’, ‘gpu_hist’}  | 'auto' |树构造的算法 |
43 | | | sketch_eps | (0,1) |0.03 | 当tree_mothed='approx'才有用|
44 | | | scale_pos_weight |  | 1 | 用在不均衡的分类中 |
45 | | | updater |  | ’grow_colmaker,prune’ | 更新器 |
46 | | | refresh_leaf |  |1 |当updater= refresh才有用 |
47 | | | process_type | {‘default’, ‘update’} |’default’ |程序运行方式。update表示更新已有的树|
48 | | | grow_policy |  {‘depthwise’, ‘lossguide’}  |’depthwise’ |控制新结点加入树的方式。当tree_mothod='hist'才有用|
49 | | | max_leaves |  |0 | 要添加的最大结点数 |
50 | | | max_bin |  | 256|当tree_mothod='hist'才有用|
51 | | | predictor | {‘cpu_predictor’, ‘gpu_predictor’}  |’cpu_predictor’ | 算法预测时采用的方式|
52 | | **Dart Booster**<br>（比Tree Booster增加了5个参数）  | sample_type| {“uniform”,“weighted”}  | "uniform"|选样方式 |
53 | |  |normalize_type  |{"tree", "forest"} | "tree"|正则化方式 |
54 | | | rate_drop   | [0.0, 1.0] |0.0| 舍弃上一轮树的比例|
55 | || one_drop |  | 0| 当值不为0的时候，至少有一棵树被舍弃|
56 | | |skip	_drop  | [0.0, 1.0] |0.0 | 跳过舍弃树的程序的概率|
57 | | **Linear Booster** | lambda |  |0 |L2正则化的权重 |
58 | |  | alpha |  |0 | L1正则化的权重|
59 | |  | lambda_bias |  |0 |L2正则化的偏爱 |
60 | | **Tweedie Regression**<br>（三种booster共有的参数） | tweedie_variance_power | (1,2) |1.5 | 接近2代表接近gamma分布，接近1代表接近泊松分布|
61 | | **task** |  |  | | |
62 | |  | **objective** | “reg:linear” – 线性回归<br> “reg:logistic” – 逻辑回归<br>"binary:logistic" – 逻辑回归处理二分类（输出概率）<br>"binary:logitraw" – 逻辑回归处理二分类（输出分数）<br>"count:poisson" – 泊松回归处理计算数据（输出均值、max_delta_step参数默认为0.7）<br>"multi:softmax" – 采用softmax目标函数处理多分类问题（需要设定类别数"num_class"）<br>"multi:softprob" – 多分类（与上一搁一样，只是它的输出是ndata*nclass<br>"rank:pairwise" – 处理排位问题<br>"reg:gamma" – 用γ回归（返回均值）<br>"reg:tweedie" – 用特威迪回归| 'reg:linear' |定义学习任务及相应的学习目标 |
63 | |  | base_score |  | 0.5 | 所有实例的初始预测得分 |
64 | |  | eval_metric |  | 回归问题:rmse<br>分类问题:error<br>排位问题:map| 评价指标/性能度量。Python可通过list传递多个评价指标 |
65 | |  | seed |  | 0  | 随机数种子。每次取一样的seed可得到相同的随机划分 |
66 | | **Command Line** |  |  | | |
67 | ||**num_round**||| boosting的轮数/迭代次数|
68 | 
69 | xgb使用sklearn接口会改变的函数名是：[官网](http://xgboost.readthedocs.io/en/latest/////python/python_api.html#module-xgboost.sklearn)
70 | - eta -> learning_rate
71 | - lambda -> reg_lambda
72 | - alpha -> reg_alpha


--------------------------------------------------------------------------------