└── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | ##2016 阅读记录
 2 | 
 3 | - 2016-03-05
 4 | 
 5 | Xgboost中两个我之前没有用过的特性，一个是[用户自定义代价函数](https://github.com/dmlc/xgboost/blob/master/demo/guide-python/custom_objective.py)，另一个是[pred_leaf](https://github.com/dmlc/xgboost/blob/master/demo/guide-python/predict_leaf_indices.py),预测的时候设置pred_leaf为True，将对每个样本返回其在每棵树上的leaf index（一共有num_round棵树），可以当成新的特征来用。
 6 | 
 7 | 
 8 | - 2016-03-07
 9 | 
10 | 阅读了[《A Programmer's Guide to Data Mining》](http://guidetodatamining.com/)这本电子书，内容过于简单，两三个小时读完，干货不多。
11 | 
12 | - 2016-03-08
13 | 
14 | [Converting categorical data into numbers with Pandas and Scikit-learn](http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/)  这篇文章讲了类别特征的处理，文中提到一点需要引起思考：If you have missing values in a binary feature, there’s an alternative representation:-1 for negatives，0 for missing values，1 for positives。It worked better in case of the Analytics Edge competition: an SVM trained on one-hot encoded data with d indicators scored 0.768 in terms of AUC, while the alternative representation yielded 0.778.
15 | 
16 | - 2016-03-11
17 | 
18 | [Imbalanced data – Finding Waldo](http://www.financealleycat.com/?p=69)  这篇文章讲了不平衡数据的处理，都是常见的方法（简单采样，合成采样），但是文章最后讲了一个很有趣的处理方式：如果不平衡数据中某个类别的数据非常少，那么也可以把分类问题当成异常值检测的问题（ [anomaly detection](https://en.wikipedia.org/wiki/Anomaly_detection)），只需要检测出异常值就行了。
19 | 
20 | - 2016-04-01
21 | 
22 | 看了large-scale svm相关的内容，发现一个不错的工具[EnsembleSVM](https://github.com/claesenm/EnsembleSVM)，在准确率不下降的同时减小计算复杂度，对应论文[EnsembleSVM: A Library for Ensemble Learning Using Support Vector Machines](http://jmlr.org/papers/volume15/claesen14a/claesen14a.pdf)，另外一篇cite比较多的论文[Making large-scale support vector machine learning practical](http://dl.acm.org/citation.cfm?id=299104)
23 | 


--------------------------------------------------------------------------------