├── .gitignore ├── AV-loan-prediction ├── loan_prediction.md ├── result_sklearn_rf.csv ├── sklearn-rf.py ├── test.csv ├── train.csv └── xgb.py ├── DC-loan-rp ├── 0.717.csv ├── add_data │ ├── 0.717.csv │ └── add_data.py ├── feature-selection │ ├── 0.70_feature_score.csv │ ├── anylze.py │ ├── drop_feature.txt │ ├── drop_list.txt │ ├── feature_score_category.csv │ └── feature_score_numeric.csv ├── sklearn-rf.py ├── small_data │ ├── test_x.csv │ ├── train_x.csv │ └── train_y.csv ├── source.R ├── xgb.py └── xgb_dummy.py ├── Kaggle-bag-of-words ├── BagOfWords_LR.py ├── BagOfWords_RF.py ├── Kaggle-Word2Vec.R ├── KaggleWord2VecUtility.py ├── README.md ├── Word2Vec_AverageVectors.py ├── Word2Vec_BagOfCentroids.py ├── generate_d2v.py ├── generate_w2v.py ├── nbsvm.py ├── out │ └── Bag_of_Words_model_RF.csv └── predict.py ├── Kaggle-digit-recognizer ├── .gitignore ├── Digit Recognizer.md ├── data │ └── readme.txt ├── experiment1-rf-1000.py ├── knn_by_myself.py ├── naive_bayes_by_myself.py ├── nn │ ├── README.md │ ├── gen │ │ ├── nn_benchmark.csv │ │ ├── nn_benchmark1.csv │ │ └── nn_benchmark2.csv │ └── src │ │ ├── DigitRecognizer.py │ │ ├── PyNeural │ │ ├── PyNeural.py │ │ ├── PyNeural.py.bak │ │ └── __init__.py │ │ └── ensemble.py ├── py-knn │ ├── experiment1-custom-knn-brute-force.py │ ├── experiment2-sklearn-knn-kdtree.py │ ├── experiment2-sklearn-knn-kdtree.py.bak │ ├── experiment3-sklean-pca-knn.py │ ├── experiment3-sklean-pca-knn.py.bak │ └── load_data.py ├── svm_by_myself.py ├── svm_pca.py ├── using_sklearn.py └── using_theano.py ├── README.md ├── feature_engineering_example.ipynb ├── kaggle-titanic ├── README.md ├── SOUPTONUTS.md.txt ├── Theano Tutorial.R ├── code.py ├── input │ ├── test.csv │ └── train.csv ├── ipynb-notebook │ ├── Kaggle_Titanic_Example.ipynb │ ├── test.csv │ └── train.csv ├── lr.py ├── randomforest_gridsearchCV.py ├── result_rf.csv ├── result_xgb.csv ├── sklearn-random-forest.py ├── xgb.py └── 笔记1.md └── kaggle_bike_competition_train.csv /.gitignore: -------------------------------------------------------------------------------- 1 | */DataCastle-Solution 2 | ### Python template 3 | # Byte-compiled / optimized / DLL files 4 | __pycache__/ 5 | *.py[cod] 6 | *$py.class 7 | 8 | # C extensions 9 | *.so 10 | 11 | # Distribution / packaging 12 | .Python 13 | env/ 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | *.egg-info/ 26 | .installed.cfg 27 | *.egg 28 | 29 | # PyInstaller 30 | # Usually these files are written by a python script from a template 31 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 32 | *.manifest 33 | *.spec 34 | 35 | # Installer logs 36 | pip-log.txt 37 | pip-delete-this-directory.txt 38 | 39 | # Unit test / coverage reports 40 | htmlcov/ 41 | .tox/ 42 | .coverage 43 | .coverage.* 44 | .cache 45 | nosetests.xml 46 | coverage.xml 47 | *,cover 48 | 49 | # Translations 50 | *.mo 51 | *.pot 52 | 53 | # Django stuff: 54 | *.log 55 | 56 | # Sphinx documentation 57 | docs/_build/ 58 | 59 | # PyBuilder 60 | target/ 61 | 62 | # PyCharm 63 | .idea 64 | 65 | # Created by .ignore support plugin (hsz.mobi) 66 | -------------------------------------------------------------------------------- /AV-loan-prediction/loan_prediction.md: -------------------------------------------------------------------------------- 1 | ## Problem Statement 2 | About Company 3 | Dream Housing Finance company deals in all home loans. They have presence across all urban, 4 | semi urban and rural areas. Customer first apply for home loan after that company validates 5 | the customer eligibility for loan. 6 | 7 | Problem 8 | Company wants to automate the loan eligibility process (real time) based on customer detail provided 9 | while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, 10 | Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify 11 | the customers segments, those are eligible for loan amount so that they can specifically target these customers. 12 | Here they have provided a partial data set. 13 | 14 | ## Data Set 15 | Variable Description 16 | Loan_ID Unique Loan ID 17 | Gender Male/ Female 18 | Married Applicant married (Y/N) 19 | Dependents Number of dependents 20 | Education Applicant Education (Graduate/ Under Graduate) 21 | Self_Employed Self employed (Y/N) 22 | ApplicantIncome Applicant income 23 | CoapplicantIncome Coapplicant income 24 | LoanAmount Loan amount in thousands 25 | Loan_Amount_Term Term of loan in months 26 | Credit_History credit history meets guidelines 27 | Property_Area Urban/ Semi Urban/ Rural 28 | Loan_Status Loan approved (Y/N) -------------------------------------------------------------------------------- /AV-loan-prediction/result_sklearn_rf.csv: -------------------------------------------------------------------------------- 1 | Loan_ID,Loan_Status 2 | LP001015,Y 3 | LP001022,Y 4 | LP001031,Y 5 | LP001035,Y 6 | LP001051,Y 7 | LP001054,Y 8 | LP001055,Y 9 | LP001056,N 10 | LP001059,Y 11 | LP001067,Y 12 | LP001078,Y 13 | LP001082,Y 14 | LP001083,Y 15 | LP001094,N 16 | LP001096,Y 17 | LP001099,Y 18 | LP001105,Y 19 | LP001107,Y 20 | LP001108,Y 21 | LP001115,Y 22 | LP001121,Y 23 | LP001124,Y 24 | LP001128,Y 25 | LP001135,Y 26 | LP001149,Y 27 | LP001153,N 28 | LP001163,Y 29 | LP001169,Y 30 | LP001174,Y 31 | LP001176,Y 32 | LP001177,Y 33 | LP001183,Y 34 | LP001185,Y 35 | LP001187,Y 36 | LP001190,Y 37 | LP001203,N 38 | LP001208,Y 39 | LP001210,Y 40 | LP001211,Y 41 | LP001219,Y 42 | LP001220,Y 43 | LP001221,Y 44 | LP001226,Y 45 | LP001230,Y 46 | LP001231,Y 47 | LP001232,Y 48 | LP001237,Y 49 | LP001242,Y 50 | LP001268,Y 51 | LP001270,Y 52 | LP001284,Y 53 | LP001287,Y 54 | LP001291,Y 55 | LP001298,Y 56 | LP001312,Y 57 | LP001313,N 58 | LP001317,Y 59 | LP001321,Y 60 | LP001323,N 61 | LP001324,Y 62 | LP001332,Y 63 | LP001335,Y 64 | LP001338,Y 65 | LP001347,N 66 | LP001348,Y 67 | LP001351,Y 68 | LP001352,N 69 | LP001358,N 70 | LP001359,Y 71 | LP001361,N 72 | LP001366,Y 73 | LP001368,Y 74 | LP001375,Y 75 | LP001380,Y 76 | LP001386,Y 77 | LP001400,Y 78 | LP001407,Y 79 | LP001413,Y 80 | LP001415,Y 81 | LP001419,Y 82 | LP001420,N 83 | LP001428,Y 84 | LP001445,N 85 | LP001446,Y 86 | LP001450,N 87 | LP001452,Y 88 | LP001455,Y 89 | LP001466,Y 90 | LP001471,Y 91 | LP001472,Y 92 | LP001475,Y 93 | LP001483,Y 94 | LP001486,Y 95 | LP001490,Y 96 | LP001496,N 97 | LP001499,Y 98 | LP001500,Y 99 | LP001501,Y 100 | LP001517,Y 101 | LP001527,Y 102 | LP001534,Y 103 | LP001542,N 104 | LP001547,Y 105 | LP001548,Y 106 | LP001558,Y 107 | LP001561,Y 108 | LP001563,N 109 | LP001567,Y 110 | LP001568,Y 111 | LP001573,Y 112 | LP001584,Y 113 | LP001587,Y 114 | LP001589,Y 115 | LP001591,Y 116 | LP001599,Y 117 | LP001601,Y 118 | LP001607,N 119 | LP001611,N 120 | LP001613,N 121 | LP001622,N 122 | LP001627,Y 123 | LP001650,Y 124 | LP001651,Y 125 | LP001652,N 126 | LP001655,N 127 | LP001660,Y 128 | LP001662,N 129 | LP001663,Y 130 | LP001667,Y 131 | LP001695,Y 132 | LP001703,Y 133 | LP001718,Y 134 | LP001728,Y 135 | LP001735,Y 136 | LP001737,Y 137 | LP001739,Y 138 | LP001742,Y 139 | LP001757,Y 140 | LP001769,Y 141 | LP001771,Y 142 | LP001785,N 143 | LP001787,Y 144 | LP001789,N 145 | LP001791,Y 146 | LP001794,Y 147 | LP001797,Y 148 | LP001815,Y 149 | LP001817,N 150 | LP001818,Y 151 | LP001822,Y 152 | LP001827,Y 153 | LP001831,Y 154 | LP001842,Y 155 | LP001853,N 156 | LP001855,Y 157 | LP001857,Y 158 | LP001862,Y 159 | LP001867,Y 160 | LP001878,Y 161 | LP001881,Y 162 | LP001886,Y 163 | LP001906,N 164 | LP001909,Y 165 | LP001911,Y 166 | LP001921,Y 167 | LP001923,N 168 | LP001933,N 169 | LP001943,Y 170 | LP001950,N 171 | LP001959,N 172 | LP001961,Y 173 | LP001973,Y 174 | LP001975,Y 175 | LP001979,N 176 | LP001995,N 177 | LP001999,Y 178 | LP002007,Y 179 | LP002009,Y 180 | LP002016,Y 181 | LP002017,Y 182 | LP002018,Y 183 | LP002027,Y 184 | LP002028,Y 185 | LP002042,Y 186 | LP002045,Y 187 | LP002046,Y 188 | LP002047,Y 189 | LP002056,Y 190 | LP002057,Y 191 | LP002059,Y 192 | LP002062,Y 193 | LP002064,Y 194 | LP002069,N 195 | LP002070,N 196 | LP002077,Y 197 | LP002083,Y 198 | LP002090,N 199 | LP002096,Y 200 | LP002099,N 201 | LP002102,Y 202 | LP002105,Y 203 | LP002107,Y 204 | LP002111,Y 205 | LP002117,Y 206 | LP002118,Y 207 | LP002123,Y 208 | LP002125,Y 209 | LP002148,Y 210 | LP002152,Y 211 | LP002165,Y 212 | LP002167,Y 213 | LP002168,N 214 | LP002172,Y 215 | LP002176,Y 216 | LP002183,Y 217 | LP002184,Y 218 | LP002186,Y 219 | LP002192,Y 220 | LP002195,Y 221 | LP002208,Y 222 | LP002212,Y 223 | LP002240,Y 224 | LP002245,Y 225 | LP002253,Y 226 | LP002256,N 227 | LP002257,Y 228 | LP002264,Y 229 | LP002270,Y 230 | LP002279,Y 231 | LP002286,N 232 | LP002294,Y 233 | LP002298,Y 234 | LP002306,Y 235 | LP002310,Y 236 | LP002311,Y 237 | LP002316,N 238 | LP002321,N 239 | LP002325,Y 240 | LP002326,Y 241 | LP002329,N 242 | LP002333,Y 243 | LP002339,N 244 | LP002344,Y 245 | LP002346,N 246 | LP002354,Y 247 | LP002355,N 248 | LP002358,Y 249 | LP002360,Y 250 | LP002375,Y 251 | LP002376,Y 252 | LP002383,N 253 | LP002385,Y 254 | LP002389,Y 255 | LP002394,Y 256 | LP002397,Y 257 | LP002399,N 258 | LP002400,Y 259 | LP002402,Y 260 | LP002412,Y 261 | LP002415,Y 262 | LP002417,Y 263 | LP002420,Y 264 | LP002425,Y 265 | LP002433,Y 266 | LP002440,Y 267 | LP002441,Y 268 | LP002442,N 269 | LP002445,Y 270 | LP002450,N 271 | LP002471,Y 272 | LP002476,Y 273 | LP002482,Y 274 | LP002485,Y 275 | LP002495,N 276 | LP002496,N 277 | LP002523,Y 278 | LP002542,Y 279 | LP002550,Y 280 | LP002551,N 281 | LP002553,Y 282 | LP002554,Y 283 | LP002561,Y 284 | LP002566,Y 285 | LP002568,Y 286 | LP002570,Y 287 | LP002572,Y 288 | LP002581,Y 289 | LP002584,Y 290 | LP002592,Y 291 | LP002593,Y 292 | LP002599,Y 293 | LP002604,Y 294 | LP002605,Y 295 | LP002609,N 296 | LP002610,Y 297 | LP002612,Y 298 | LP002614,Y 299 | LP002630,N 300 | LP002635,Y 301 | LP002639,Y 302 | LP002644,Y 303 | LP002651,N 304 | LP002654,Y 305 | LP002657,Y 306 | LP002711,Y 307 | LP002712,N 308 | LP002721,Y 309 | LP002735,Y 310 | LP002744,Y 311 | LP002745,Y 312 | LP002746,Y 313 | LP002747,N 314 | LP002754,Y 315 | LP002759,Y 316 | LP002760,Y 317 | LP002766,Y 318 | LP002769,Y 319 | LP002774,N 320 | LP002775,Y 321 | LP002781,Y 322 | LP002782,Y 323 | LP002786,N 324 | LP002790,Y 325 | LP002791,Y 326 | LP002793,Y 327 | LP002802,N 328 | LP002803,Y 329 | LP002805,Y 330 | LP002806,Y 331 | LP002816,Y 332 | LP002823,Y 333 | LP002825,Y 334 | LP002826,Y 335 | LP002843,Y 336 | LP002849,Y 337 | LP002850,Y 338 | LP002853,Y 339 | LP002856,Y 340 | LP002857,Y 341 | LP002858,N 342 | LP002860,Y 343 | LP002867,Y 344 | LP002869,N 345 | LP002870,N 346 | LP002876,Y 347 | LP002878,Y 348 | LP002879,N 349 | LP002885,Y 350 | LP002890,Y 351 | LP002891,Y 352 | LP002899,Y 353 | LP002901,Y 354 | LP002907,Y 355 | LP002920,Y 356 | LP002921,N 357 | LP002932,Y 358 | LP002935,Y 359 | LP002952,Y 360 | LP002954,Y 361 | LP002962,Y 362 | LP002965,Y 363 | LP002969,Y 364 | LP002971,Y 365 | LP002975,Y 366 | LP002980,Y 367 | LP002986,Y 368 | LP002989,Y 369 | -------------------------------------------------------------------------------- /AV-loan-prediction/sklearn-rf.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.metrics import roc_auc_score 3 | from sklearn.ensemble import RandomForestClassifier 4 | from sklearn.svm import SVC 5 | from sklearn import preprocessing 6 | 7 | 8 | train = pd.read_csv('train.csv') 9 | test = pd.read_csv('test.csv') 10 | print ("Starting...") 11 | print(('Number of training examples {0} '.format(train.shape[0]))) 12 | print((train.Loan_Status.value_counts())) 13 | print(('Number of test examples {0} '.format(test.shape[0]))) 14 | 15 | 16 | cat_vbl = {'Gender','Married','Dependents','Self_Employed','Property_Area'} 17 | num_vbl = {'LoanAmount','Loan_Amount_Term','Credit_History'} 18 | 19 | for var in num_vbl: 20 | train[var] = train[var].fillna(value = train[var].mean()) 21 | test[var] = test[var].fillna(value = test[var].mean()) 22 | train['Credibility'] = train['ApplicantIncome'] / train['LoanAmount'] 23 | test['Credibility'] = test['ApplicantIncome'] / test['LoanAmount'] 24 | 25 | print ("Starting Label Encode") 26 | for var in cat_vbl: 27 | lb = preprocessing.LabelEncoder() 28 | full_data = pd.concat((train[var],test[var]),axis=0).astype('str') 29 | lb.fit( full_data ) 30 | train[var] = lb.transform(train[var].astype('str')) 31 | test[var] = lb.transform(test[var].astype('str')) 32 | 33 | train = train.fillna(value = -999) 34 | test = test.fillna(value = -999) 35 | print ("Filled Missing Values") 36 | 37 | features = ['Credibility', 38 | 'Gender', 39 | 'Married', 40 | 'Dependents', 41 | 'Self_Employed', 42 | 'Property_Area', 43 | 'ApplicantIncome', 44 | 'CoapplicantIncome', 45 | 'LoanAmount', 46 | 'Loan_Amount_Term', 47 | 'Credit_History' 48 | ] 49 | 50 | x_train = train[features].values 51 | y_train = train['Loan_Status'].values 52 | x_test = test[features].values 53 | 54 | # Random Forest 55 | rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, oob_score = True, max_features = "auto",random_state=10, min_samples_split=2, min_samples_leaf=2) 56 | rf.fit(x_train, y_train) 57 | print(('Training accuracy:', rf.oob_score_)) 58 | 59 | 60 | print ("Starting to predict on the dataset") 61 | rec= rf.predict(x_test) 62 | 63 | print ("Prediction Completed") 64 | test['Loan_Status'] = rec 65 | test.to_csv('result_sklearn_rf.csv',columns=['Loan_ID','Loan_Status'],index=False) 66 | -------------------------------------------------------------------------------- /AV-loan-prediction/test.csv: -------------------------------------------------------------------------------- 1 | Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area 2 | LP001015,Male,Yes,0,Graduate,No,5720,0,110,360,1,Urban 3 | LP001022,Male,Yes,1,Graduate,No,3076,1500,126,360,1,Urban 4 | LP001031,Male,Yes,2,Graduate,No,5000,1800,208,360,1,Urban 5 | LP001035,Male,Yes,2,Graduate,No,2340,2546,100,360,,Urban 6 | LP001051,Male,No,0,Not Graduate,No,3276,0,78,360,1,Urban 7 | LP001054,Male,Yes,0,Not Graduate,Yes,2165,3422,152,360,1,Urban 8 | LP001055,Female,No,1,Not Graduate,No,2226,0,59,360,1,Semiurban 9 | LP001056,Male,Yes,2,Not Graduate,No,3881,0,147,360,0,Rural 10 | LP001059,Male,Yes,2,Graduate,,13633,0,280,240,1,Urban 11 | LP001067,Male,No,0,Not Graduate,No,2400,2400,123,360,1,Semiurban 12 | LP001078,Male,No,0,Not Graduate,No,3091,0,90,360,1,Urban 13 | LP001082,Male,Yes,1,Graduate,,2185,1516,162,360,1,Semiurban 14 | LP001083,Male,No,3+,Graduate,No,4166,0,40,180,,Urban 15 | LP001094,Male,Yes,2,Graduate,,12173,0,166,360,0,Semiurban 16 | LP001096,Female,No,0,Graduate,No,4666,0,124,360,1,Semiurban 17 | LP001099,Male,No,1,Graduate,No,5667,0,131,360,1,Urban 18 | LP001105,Male,Yes,2,Graduate,No,4583,2916,200,360,1,Urban 19 | LP001107,Male,Yes,3+,Graduate,No,3786,333,126,360,1,Semiurban 20 | LP001108,Male,Yes,0,Graduate,No,9226,7916,300,360,1,Urban 21 | LP001115,Male,No,0,Graduate,No,1300,3470,100,180,1,Semiurban 22 | LP001121,Male,Yes,1,Not Graduate,No,1888,1620,48,360,1,Urban 23 | LP001124,Female,No,3+,Not Graduate,No,2083,0,28,180,1,Urban 24 | LP001128,,No,0,Graduate,No,3909,0,101,360,1,Urban 25 | LP001135,Female,No,0,Not Graduate,No,3765,0,125,360,1,Urban 26 | LP001149,Male,Yes,0,Graduate,No,5400,4380,290,360,1,Urban 27 | LP001153,Male,No,0,Graduate,No,0,24000,148,360,0,Rural 28 | LP001163,Male,Yes,2,Graduate,No,4363,1250,140,360,,Urban 29 | LP001169,Male,Yes,0,Graduate,No,7500,3750,275,360,1,Urban 30 | LP001174,Male,Yes,0,Graduate,No,3772,833,57,360,,Semiurban 31 | LP001176,Male,No,0,Graduate,No,2942,2382,125,180,1,Urban 32 | LP001177,Female,No,0,Not Graduate,No,2478,0,75,360,1,Semiurban 33 | LP001183,Male,Yes,2,Graduate,No,6250,820,192,360,1,Urban 34 | LP001185,Male,No,0,Graduate,No,3268,1683,152,360,1,Semiurban 35 | LP001187,Male,Yes,0,Graduate,No,2783,2708,158,360,1,Urban 36 | LP001190,Male,Yes,0,Graduate,No,2740,1541,101,360,1,Urban 37 | LP001203,Male,No,0,Graduate,No,3150,0,176,360,0,Semiurban 38 | LP001208,Male,Yes,2,Graduate,,7350,4029,185,180,1,Urban 39 | LP001210,Male,Yes,0,Graduate,Yes,2267,2792,90,360,1,Urban 40 | LP001211,Male,No,0,Graduate,Yes,5833,0,116,360,1,Urban 41 | LP001219,Male,No,0,Graduate,No,3643,1963,138,360,1,Urban 42 | LP001220,Male,Yes,0,Graduate,No,5629,818,100,360,1,Urban 43 | LP001221,Female,No,0,Graduate,No,3644,0,110,360,1,Urban 44 | LP001226,Male,Yes,0,Not Graduate,No,1750,2024,90,360,1,Semiurban 45 | LP001230,Male,No,0,Graduate,No,6500,2600,200,360,1,Semiurban 46 | LP001231,Female,No,0,Graduate,No,3666,0,84,360,1,Urban 47 | LP001232,Male,Yes,0,Graduate,No,4260,3900,185,,,Urban 48 | LP001237,Male,Yes,,Not Graduate,No,4163,1475,162,360,1,Urban 49 | LP001242,Male,No,0,Not Graduate,No,2356,1902,108,360,1,Semiurban 50 | LP001268,Male,No,0,Graduate,No,6792,3338,187,,1,Urban 51 | LP001270,Male,Yes,3+,Not Graduate,Yes,8000,250,187,360,1,Semiurban 52 | LP001284,Male,Yes,1,Graduate,No,2419,1707,124,360,1,Urban 53 | LP001287,,Yes,3+,Not Graduate,No,3500,833,120,360,1,Semiurban 54 | LP001291,Male,Yes,1,Graduate,No,3500,3077,160,360,1,Semiurban 55 | LP001298,Male,Yes,2,Graduate,No,4116,1000,30,180,1,Urban 56 | LP001312,Male,Yes,0,Not Graduate,Yes,5293,0,92,360,1,Urban 57 | LP001313,Male,No,0,Graduate,No,2750,0,130,360,0,Urban 58 | LP001317,Female,No,0,Not Graduate,No,4402,0,130,360,1,Rural 59 | LP001321,Male,Yes,2,Graduate,No,3613,3539,134,180,1,Semiurban 60 | LP001323,Female,Yes,2,Graduate,No,2779,3664,176,360,0,Semiurban 61 | LP001324,Male,Yes,3+,Graduate,No,4720,0,90,180,1,Semiurban 62 | LP001332,Male,Yes,0,Not Graduate,No,2415,1721,110,360,1,Semiurban 63 | LP001335,Male,Yes,0,Graduate,Yes,7016,292,125,360,1,Urban 64 | LP001338,Female,No,2,Graduate,No,4968,0,189,360,1,Semiurban 65 | LP001347,Female,No,0,Graduate,No,2101,1500,108,360,0,Rural 66 | LP001348,Male,Yes,3+,Not Graduate,No,4490,0,125,360,1,Urban 67 | LP001351,Male,Yes,0,Graduate,No,2917,3583,138,360,1,Semiurban 68 | LP001352,Male,Yes,0,Not Graduate,No,4700,0,135,360,0,Semiurban 69 | LP001358,Male,Yes,0,Graduate,No,3445,0,130,360,0,Semiurban 70 | LP001359,Male,Yes,0,Graduate,No,7666,0,187,360,1,Semiurban 71 | LP001361,Male,Yes,0,Graduate,No,2458,5105,188,360,0,Rural 72 | LP001366,Female,No,,Graduate,No,3250,0,95,360,1,Semiurban 73 | LP001368,Male,No,0,Graduate,No,4463,0,65,360,1,Semiurban 74 | LP001375,Male,Yes,1,Graduate,,4083,1775,139,60,1,Urban 75 | LP001380,Male,Yes,0,Graduate,Yes,3900,2094,232,360,1,Rural 76 | LP001386,Male,Yes,0,Not Graduate,No,4750,3583,144,360,1,Semiurban 77 | LP001400,Male,No,0,Graduate,No,3583,3435,155,360,1,Urban 78 | LP001407,Male,Yes,0,Graduate,No,3189,2367,186,360,1,Urban 79 | LP001413,Male,No,0,Graduate,Yes,6356,0,50,360,1,Rural 80 | LP001415,Male,Yes,1,Graduate,No,3413,4053,,360,1,Semiurban 81 | LP001419,Female,Yes,0,Graduate,No,7950,0,185,360,1,Urban 82 | LP001420,Male,Yes,3+,Graduate,No,3829,1103,163,360,0,Urban 83 | LP001428,Male,Yes,3+,Graduate,No,72529,0,360,360,1,Urban 84 | LP001445,Male,Yes,2,Not Graduate,No,4136,0,149,480,0,Rural 85 | LP001446,Male,Yes,0,Graduate,No,8449,0,257,360,1,Rural 86 | LP001450,Male,Yes,0,Graduate,No,4456,0,131,180,0,Semiurban 87 | LP001452,Male,Yes,2,Graduate,No,4635,8000,102,180,1,Rural 88 | LP001455,Male,Yes,0,Graduate,No,3571,1917,135,360,1,Urban 89 | LP001466,Male,No,0,Graduate,No,3066,0,95,360,1,Semiurban 90 | LP001471,Male,No,2,Not Graduate,No,3235,2015,77,360,1,Semiurban 91 | LP001472,Female,No,0,Graduate,,5058,0,200,360,1,Rural 92 | LP001475,Male,Yes,0,Graduate,Yes,3188,2286,130,360,,Rural 93 | LP001483,Male,Yes,3+,Graduate,No,13518,0,390,360,1,Rural 94 | LP001486,Male,Yes,1,Graduate,No,4364,2500,185,360,1,Semiurban 95 | LP001490,Male,Yes,2,Not Graduate,No,4766,1646,100,360,1,Semiurban 96 | LP001496,Male,Yes,1,Graduate,No,4609,2333,123,360,0,Semiurban 97 | LP001499,Female,Yes,3+,Graduate,No,6260,0,110,360,1,Semiurban 98 | LP001500,Male,Yes,1,Graduate,No,3333,4200,256,360,1,Urban 99 | LP001501,Male,Yes,0,Graduate,No,3500,3250,140,360,1,Semiurban 100 | LP001517,Male,Yes,3+,Graduate,No,9719,0,61,360,1,Urban 101 | LP001527,Male,Yes,3+,Graduate,No,6835,0,188,360,,Semiurban 102 | LP001534,Male,No,0,Graduate,No,4452,0,131,360,1,Rural 103 | LP001542,Female,Yes,0,Graduate,No,2262,0,,480,0,Semiurban 104 | LP001547,Male,Yes,1,Graduate,No,3901,0,116,360,1,Urban 105 | LP001548,Male,Yes,2,Not Graduate,No,2687,0,50,180,1,Rural 106 | LP001558,Male,No,0,Graduate,No,2243,2233,107,360,,Semiurban 107 | LP001561,Female,Yes,0,Graduate,No,3417,1287,200,360,1,Semiurban 108 | LP001563,,No,0,Graduate,No,1596,1760,119,360,0,Urban 109 | LP001567,Male,Yes,3+,Graduate,No,4513,0,120,360,1,Rural 110 | LP001568,Male,Yes,0,Graduate,No,4500,0,140,360,1,Semiurban 111 | LP001573,Male,Yes,0,Not Graduate,No,4523,1350,165,360,1,Urban 112 | LP001584,Female,No,0,Graduate,Yes,4742,0,108,360,1,Semiurban 113 | LP001587,Male,Yes,,Graduate,No,4082,0,93,360,1,Semiurban 114 | LP001589,Female,No,0,Graduate,No,3417,0,102,360,1,Urban 115 | LP001591,Female,Yes,2,Graduate,No,2922,3396,122,360,1,Semiurban 116 | LP001599,Male,Yes,0,Graduate,No,4167,4754,160,360,1,Rural 117 | LP001601,Male,No,3+,Graduate,No,4243,4123,157,360,,Semiurban 118 | LP001607,Female,No,0,Not Graduate,No,0,1760,180,360,1,Semiurban 119 | LP001611,Male,Yes,1,Graduate,No,1516,2900,80,,0,Rural 120 | LP001613,Female,No,0,Graduate,No,1762,2666,104,360,0,Urban 121 | LP001622,Male,Yes,2,Graduate,No,724,3510,213,360,0,Rural 122 | LP001627,Male,No,0,Graduate,No,3125,0,65,360,1,Urban 123 | LP001650,Male,Yes,0,Graduate,No,2333,3803,146,360,1,Rural 124 | LP001651,Male,Yes,3+,Graduate,No,3350,1560,135,360,1,Urban 125 | LP001652,Male,No,0,Graduate,No,2500,6414,187,360,0,Rural 126 | LP001655,Female,No,0,Graduate,No,12500,0,300,360,0,Urban 127 | LP001660,Male,No,0,Graduate,No,4667,0,120,360,1,Semiurban 128 | LP001662,Male,No,0,Graduate,No,6500,0,71,360,0,Urban 129 | LP001663,Male,Yes,2,Graduate,No,7500,0,225,360,1,Urban 130 | LP001667,Male,No,0,Graduate,No,3073,0,70,180,1,Urban 131 | LP001695,Male,Yes,1,Not Graduate,No,3321,2088,70,,1,Semiurban 132 | LP001703,Male,Yes,0,Graduate,No,3333,1270,124,360,1,Urban 133 | LP001718,Male,No,0,Graduate,No,3391,0,132,360,1,Rural 134 | LP001728,Male,Yes,1,Graduate,Yes,3343,1517,105,360,1,Rural 135 | LP001735,Female,No,1,Graduate,No,3620,0,90,360,1,Urban 136 | LP001737,Male,No,0,Graduate,No,4000,0,83,84,1,Urban 137 | LP001739,Male,Yes,0,Graduate,No,4258,0,125,360,1,Urban 138 | LP001742,Male,Yes,2,Graduate,No,4500,0,147,360,1,Rural 139 | LP001757,Male,Yes,1,Graduate,No,2014,2925,120,360,1,Rural 140 | LP001769,,No,,Graduate,No,3333,1250,110,360,1,Semiurban 141 | LP001771,Female,No,3+,Graduate,No,4083,0,103,360,,Semiurban 142 | LP001785,Male,No,0,Graduate,No,4727,0,150,360,0,Rural 143 | LP001787,Male,Yes,3+,Graduate,No,3089,2999,100,240,1,Rural 144 | LP001789,Male,Yes,3+,Not Graduate,,6794,528,139,360,0,Urban 145 | LP001791,Male,Yes,0,Graduate,Yes,32000,0,550,360,,Semiurban 146 | LP001794,Male,Yes,2,Graduate,Yes,10890,0,260,12,1,Rural 147 | LP001797,Female,No,0,Graduate,No,12941,0,150,300,1,Urban 148 | LP001815,Male,No,0,Not Graduate,No,3276,0,90,360,1,Semiurban 149 | LP001817,Male,No,0,Not Graduate,Yes,8703,0,199,360,0,Rural 150 | LP001818,Male,Yes,1,Graduate,No,4742,717,139,360,1,Semiurban 151 | LP001822,Male,No,0,Graduate,No,5900,0,150,360,1,Urban 152 | LP001827,Male,No,0,Graduate,No,3071,4309,180,360,1,Urban 153 | LP001831,Male,Yes,0,Graduate,No,2783,1456,113,360,1,Urban 154 | LP001842,Male,No,0,Graduate,No,5000,0,148,360,1,Rural 155 | LP001853,Male,Yes,1,Not Graduate,No,2463,2360,117,360,0,Urban 156 | LP001855,Male,Yes,2,Graduate,No,4855,0,72,360,1,Rural 157 | LP001857,Male,No,0,Not Graduate,Yes,1599,2474,125,300,1,Semiurban 158 | LP001862,Male,Yes,2,Graduate,Yes,4246,4246,214,360,1,Urban 159 | LP001867,Male,Yes,0,Graduate,No,4333,2291,133,350,1,Rural 160 | LP001878,Male,No,1,Graduate,No,5823,2529,187,360,1,Semiurban 161 | LP001881,Male,Yes,0,Not Graduate,No,7895,0,143,360,1,Rural 162 | LP001886,Male,No,0,Graduate,No,4150,4256,209,360,1,Rural 163 | LP001906,Male,No,0,Graduate,,2964,0,84,360,0,Semiurban 164 | LP001909,Male,No,0,Graduate,No,5583,0,116,360,1,Urban 165 | LP001911,Female,No,0,Graduate,No,2708,0,65,360,1,Rural 166 | LP001921,Male,No,1,Graduate,No,3180,2370,80,240,,Rural 167 | LP001923,Male,No,0,Not Graduate,No,2268,0,170,360,0,Semiurban 168 | LP001933,Male,No,2,Not Graduate,No,1141,2017,120,360,0,Urban 169 | LP001943,Male,Yes,0,Graduate,No,3042,3167,135,360,1,Urban 170 | LP001950,Female,Yes,3+,Graduate,,1750,2935,94,360,0,Semiurban 171 | LP001959,Female,Yes,1,Graduate,No,3564,0,79,360,1,Rural 172 | LP001961,Female,No,0,Graduate,No,3958,0,110,360,1,Rural 173 | LP001973,Male,Yes,2,Not Graduate,No,4483,0,130,360,1,Rural 174 | LP001975,Male,Yes,0,Graduate,No,5225,0,143,360,1,Rural 175 | LP001979,Male,No,0,Graduate,No,3017,2845,159,180,0,Urban 176 | LP001995,Male,Yes,0,Not Graduate,No,2431,1820,110,360,0,Rural 177 | LP001999,Male,Yes,2,Graduate,,4912,4614,160,360,1,Rural 178 | LP002007,Male,Yes,2,Not Graduate,No,2500,3333,131,360,1,Urban 179 | LP002009,Female,No,0,Graduate,No,2918,0,65,360,,Rural 180 | LP002016,Male,Yes,2,Graduate,No,5128,0,143,360,1,Rural 181 | LP002017,Male,Yes,3+,Graduate,No,15312,0,187,360,,Urban 182 | LP002018,Male,Yes,2,Graduate,No,3958,2632,160,360,1,Semiurban 183 | LP002027,Male,Yes,0,Graduate,No,4334,2945,165,360,1,Semiurban 184 | LP002028,Male,Yes,2,Graduate,No,4358,0,110,360,1,Urban 185 | LP002042,Female,Yes,1,Graduate,No,4000,3917,173,360,1,Rural 186 | LP002045,Male,Yes,3+,Graduate,No,10166,750,150,,1,Urban 187 | LP002046,Male,Yes,0,Not Graduate,No,4483,0,135,360,,Semiurban 188 | LP002047,Male,Yes,2,Not Graduate,No,4521,1184,150,360,1,Semiurban 189 | LP002056,Male,Yes,2,Graduate,No,9167,0,235,360,1,Semiurban 190 | LP002057,Male,Yes,0,Not Graduate,No,13083,0,,360,1,Rural 191 | LP002059,Male,Yes,2,Graduate,No,7874,3967,336,360,1,Rural 192 | LP002062,Female,Yes,1,Graduate,No,4333,0,132,84,1,Rural 193 | LP002064,Male,No,0,Graduate,No,4083,0,96,360,1,Urban 194 | LP002069,Male,Yes,2,Not Graduate,,3785,2912,180,360,0,Rural 195 | LP002070,Male,Yes,3+,Not Graduate,No,2654,1998,128,360,0,Rural 196 | LP002077,Male,Yes,1,Graduate,No,10000,2690,412,360,1,Semiurban 197 | LP002083,Male,No,0,Graduate,Yes,5833,0,116,360,1,Urban 198 | LP002090,Male,Yes,1,Graduate,No,4796,0,114,360,0,Semiurban 199 | LP002096,Male,Yes,0,Not Graduate,No,2000,1600,115,360,1,Rural 200 | LP002099,Male,Yes,2,Graduate,No,2540,700,104,360,0,Urban 201 | LP002102,Male,Yes,0,Graduate,Yes,1900,1442,88,360,1,Rural 202 | LP002105,Male,Yes,0,Graduate,Yes,8706,0,108,480,1,Rural 203 | LP002107,Male,Yes,3+,Not Graduate,No,2855,542,90,360,1,Urban 204 | LP002111,Male,Yes,,Graduate,No,3016,1300,100,360,,Urban 205 | LP002117,Female,Yes,0,Graduate,No,3159,2374,108,360,1,Semiurban 206 | LP002118,Female,No,0,Graduate,No,1937,1152,78,360,1,Semiurban 207 | LP002123,Male,Yes,0,Graduate,No,2613,2417,123,360,1,Semiurban 208 | LP002125,Male,Yes,1,Graduate,No,4960,2600,187,360,1,Semiurban 209 | LP002148,Male,Yes,1,Graduate,No,3074,1083,146,360,1,Semiurban 210 | LP002152,Female,No,0,Graduate,No,4213,0,80,360,1,Urban 211 | LP002165,,No,1,Not Graduate,No,2038,4027,100,360,1,Rural 212 | LP002167,Female,No,0,Graduate,No,2362,0,55,360,1,Urban 213 | LP002168,Male,No,0,Graduate,No,5333,2400,200,360,0,Rural 214 | LP002172,Male,Yes,3+,Graduate,Yes,5384,0,150,360,1,Semiurban 215 | LP002176,Male,No,0,Graduate,No,5708,0,150,360,1,Rural 216 | LP002183,Male,Yes,0,Not Graduate,No,3754,3719,118,,1,Rural 217 | LP002184,Male,Yes,0,Not Graduate,No,2914,2130,150,300,1,Urban 218 | LP002186,Male,Yes,0,Not Graduate,No,2747,2458,118,36,1,Semiurban 219 | LP002192,Male,Yes,0,Graduate,No,7830,2183,212,360,1,Rural 220 | LP002195,Male,Yes,1,Graduate,Yes,3507,3148,212,360,1,Rural 221 | LP002208,Male,Yes,1,Graduate,No,3747,2139,125,360,1,Urban 222 | LP002212,Male,Yes,0,Graduate,No,2166,2166,108,360,,Urban 223 | LP002240,Male,Yes,0,Not Graduate,No,3500,2168,149,360,1,Rural 224 | LP002245,Male,Yes,2,Not Graduate,No,2896,0,80,480,1,Urban 225 | LP002253,Female,No,1,Graduate,No,5062,0,152,300,1,Rural 226 | LP002256,Female,No,2,Graduate,Yes,5184,0,187,360,0,Semiurban 227 | LP002257,Female,No,0,Graduate,No,2545,0,74,360,1,Urban 228 | LP002264,Male,Yes,0,Graduate,No,2553,1768,102,360,1,Urban 229 | LP002270,Male,Yes,1,Graduate,No,3436,3809,100,360,1,Rural 230 | LP002279,Male,No,0,Graduate,No,2412,2755,130,360,1,Rural 231 | LP002286,Male,Yes,3+,Not Graduate,No,5180,0,125,360,0,Urban 232 | LP002294,Male,No,0,Graduate,No,14911,14507,130,360,1,Semiurban 233 | LP002298,,No,0,Graduate,Yes,2860,2988,138,360,1,Urban 234 | LP002306,Male,Yes,0,Graduate,No,1173,1594,28,180,1,Rural 235 | LP002310,Female,No,1,Graduate,No,7600,0,92,360,1,Semiurban 236 | LP002311,Female,Yes,0,Graduate,No,2157,1788,104,360,1,Urban 237 | LP002316,Male,No,0,Graduate,No,2231,2774,176,360,0,Urban 238 | LP002321,Female,No,0,Graduate,No,2274,5211,117,360,0,Semiurban 239 | LP002325,Male,Yes,2,Not Graduate,No,6166,13983,102,360,1,Rural 240 | LP002326,Male,Yes,2,Not Graduate,No,2513,1110,107,360,1,Semiurban 241 | LP002329,Male,No,0,Graduate,No,4333,0,66,480,1,Urban 242 | LP002333,Male,No,0,Not Graduate,No,3844,0,105,360,1,Urban 243 | LP002339,Male,Yes,0,Graduate,No,3887,1517,105,360,0,Semiurban 244 | LP002344,Male,Yes,0,Graduate,No,3510,828,105,360,1,Semiurban 245 | LP002346,Male,Yes,0,Graduate,,2539,1704,125,360,0,Rural 246 | LP002354,Female,No,0,Not Graduate,No,2107,0,64,360,1,Semiurban 247 | LP002355,,Yes,0,Graduate,No,3186,3145,150,180,0,Semiurban 248 | LP002358,Male,Yes,2,Graduate,Yes,5000,2166,150,360,1,Urban 249 | LP002360,Male,Yes,,Graduate,No,10000,0,,360,1,Urban 250 | LP002375,Male,Yes,0,Not Graduate,Yes,3943,0,64,360,1,Semiurban 251 | LP002376,Male,No,0,Graduate,No,2925,0,40,180,1,Rural 252 | LP002383,Male,Yes,3+,Graduate,No,3242,437,142,480,0,Urban 253 | LP002385,Male,Yes,,Graduate,No,3863,0,70,300,1,Semiurban 254 | LP002389,Female,No,1,Graduate,No,4028,0,131,360,1,Semiurban 255 | LP002394,Male,Yes,2,Graduate,No,4010,1025,120,360,1,Urban 256 | LP002397,Female,Yes,1,Graduate,No,3719,1585,114,360,1,Urban 257 | LP002399,Male,No,0,Graduate,,2858,0,123,360,0,Rural 258 | LP002400,Female,Yes,0,Graduate,No,3833,0,92,360,1,Rural 259 | LP002402,Male,Yes,0,Graduate,No,3333,4288,160,360,1,Urban 260 | LP002412,Male,Yes,0,Graduate,No,3007,3725,151,360,1,Rural 261 | LP002415,Female,No,1,Graduate,,1850,4583,81,360,,Rural 262 | LP002417,Male,Yes,3+,Not Graduate,No,2792,2619,171,360,1,Semiurban 263 | LP002420,Male,Yes,0,Graduate,No,2982,1550,110,360,1,Semiurban 264 | LP002425,Male,No,0,Graduate,No,3417,738,100,360,,Rural 265 | LP002433,Male,Yes,1,Graduate,No,18840,0,234,360,1,Rural 266 | LP002440,Male,Yes,2,Graduate,No,2995,1120,184,360,1,Rural 267 | LP002441,Male,No,,Graduate,No,3579,3308,138,360,,Semiurban 268 | LP002442,Female,Yes,1,Not Graduate,No,3835,1400,112,480,0,Urban 269 | LP002445,Female,No,1,Not Graduate,No,3854,3575,117,360,1,Rural 270 | LP002450,Male,Yes,2,Graduate,No,5833,750,49,360,0,Rural 271 | LP002471,Male,No,0,Graduate,No,3508,0,99,360,1,Rural 272 | LP002476,Female,Yes,3+,Not Graduate,No,1635,2444,99,360,1,Urban 273 | LP002482,Female,No,0,Graduate,Yes,3333,3916,212,360,1,Rural 274 | LP002485,Male,No,1,Graduate,No,24797,0,240,360,1,Semiurban 275 | LP002495,Male,Yes,2,Graduate,No,5667,440,130,360,0,Semiurban 276 | LP002496,Female,No,0,Graduate,No,3500,0,94,360,0,Semiurban 277 | LP002523,Male,Yes,3+,Graduate,No,2773,1497,108,360,1,Semiurban 278 | LP002542,Male,Yes,0,Graduate,,6500,0,144,360,1,Urban 279 | LP002550,Female,No,0,Graduate,No,5769,0,110,180,1,Semiurban 280 | LP002551,Male,Yes,3+,Not Graduate,,3634,910,176,360,0,Semiurban 281 | LP002553,,No,0,Graduate,No,29167,0,185,360,1,Semiurban 282 | LP002554,Male,No,0,Graduate,No,2166,2057,122,360,1,Semiurban 283 | LP002561,Male,Yes,0,Graduate,No,5000,0,126,360,1,Rural 284 | LP002566,Female,No,0,Graduate,No,5530,0,135,360,,Urban 285 | LP002568,Male,No,0,Not Graduate,No,9000,0,122,360,1,Rural 286 | LP002570,Female,Yes,2,Graduate,No,10000,11666,460,360,1,Urban 287 | LP002572,Male,Yes,1,Graduate,,8750,0,297,360,1,Urban 288 | LP002581,Male,Yes,0,Not Graduate,No,2157,2730,140,360,,Rural 289 | LP002584,Male,No,0,Graduate,,1972,4347,106,360,1,Rural 290 | LP002592,Male,No,0,Graduate,No,4983,0,141,360,1,Urban 291 | LP002593,Male,Yes,1,Graduate,No,8333,4000,,360,1,Urban 292 | LP002599,Male,Yes,0,Graduate,No,3667,2000,170,360,1,Semiurban 293 | LP002604,Male,Yes,2,Graduate,No,3166,2833,145,360,1,Urban 294 | LP002605,Male,No,0,Not Graduate,No,3271,0,90,360,1,Rural 295 | LP002609,Female,Yes,0,Graduate,No,2241,2000,88,360,0,Urban 296 | LP002610,Male,Yes,1,Not Graduate,,1792,2565,128,360,1,Urban 297 | LP002612,Female,Yes,0,Graduate,No,2666,0,84,480,1,Semiurban 298 | LP002614,,No,0,Graduate,No,6478,0,108,360,1,Semiurban 299 | LP002630,Male,No,0,Not Graduate,,3808,0,83,360,1,Rural 300 | LP002635,Female,Yes,2,Not Graduate,No,3729,0,117,360,1,Semiurban 301 | LP002639,Male,Yes,2,Graduate,No,4120,0,128,360,1,Rural 302 | LP002644,Male,Yes,1,Graduate,Yes,7500,0,75,360,1,Urban 303 | LP002651,Male,Yes,1,Graduate,,6300,0,125,360,0,Urban 304 | LP002654,Female,No,,Graduate,Yes,14987,0,177,360,1,Rural 305 | LP002657,,Yes,1,Not Graduate,Yes,570,2125,68,360,1,Rural 306 | LP002711,Male,Yes,0,Graduate,No,2600,700,96,360,1,Semiurban 307 | LP002712,Male,No,2,Not Graduate,No,2733,1083,180,360,,Semiurban 308 | LP002721,Male,Yes,2,Graduate,Yes,7500,0,183,360,1,Rural 309 | LP002735,Male,Yes,2,Not Graduate,No,3859,0,121,360,1,Rural 310 | LP002744,Male,Yes,1,Graduate,No,6825,0,162,360,1,Rural 311 | LP002745,Male,Yes,0,Graduate,No,3708,4700,132,360,1,Semiurban 312 | LP002746,Male,No,0,Graduate,No,5314,0,147,360,1,Urban 313 | LP002747,Female,No,3+,Graduate,No,2366,5272,153,360,0,Rural 314 | LP002754,Male,No,,Graduate,No,2066,2108,104,84,1,Urban 315 | LP002759,Male,Yes,2,Graduate,No,5000,0,149,360,1,Rural 316 | LP002760,Female,No,0,Graduate,No,3767,0,134,300,1,Urban 317 | LP002766,Female,Yes,0,Graduate,No,7859,879,165,180,1,Semiurban 318 | LP002769,Female,Yes,0,Graduate,No,4283,0,120,360,1,Rural 319 | LP002774,Male,Yes,0,Not Graduate,No,1700,2900,67,360,0,Urban 320 | LP002775,,No,0,Not Graduate,No,4768,0,125,360,1,Rural 321 | LP002781,Male,No,0,Graduate,No,3083,2738,120,360,1,Urban 322 | LP002782,Male,Yes,1,Graduate,No,2667,1542,148,360,1,Rural 323 | LP002786,Female,Yes,0,Not Graduate,No,1647,1762,181,360,1,Urban 324 | LP002790,Male,Yes,3+,Graduate,No,3400,0,80,120,1,Urban 325 | LP002791,Male,No,1,Graduate,,16000,5000,40,360,1,Semiurban 326 | LP002793,Male,Yes,0,Graduate,No,5333,0,90,360,1,Rural 327 | LP002802,Male,No,0,Graduate,No,2875,2416,95,6,0,Semiurban 328 | LP002803,Male,Yes,1,Not Graduate,,2600,618,122,360,1,Semiurban 329 | LP002805,Male,Yes,2,Graduate,No,5041,700,150,360,1,Urban 330 | LP002806,Male,Yes,3+,Graduate,Yes,6958,1411,150,360,1,Rural 331 | LP002816,Male,Yes,1,Graduate,No,3500,1658,104,360,,Semiurban 332 | LP002823,Male,Yes,0,Graduate,No,5509,0,143,360,1,Rural 333 | LP002825,Male,Yes,3+,Graduate,No,9699,0,300,360,1,Urban 334 | LP002826,Female,Yes,1,Not Graduate,No,3621,2717,171,360,1,Urban 335 | LP002843,Female,Yes,0,Graduate,No,4709,0,113,360,1,Semiurban 336 | LP002849,Male,Yes,0,Graduate,No,1516,1951,35,360,1,Semiurban 337 | LP002850,Male,No,2,Graduate,No,2400,0,46,360,1,Urban 338 | LP002853,Female,No,0,Not Graduate,No,3015,2000,145,360,,Urban 339 | LP002856,Male,Yes,0,Graduate,No,2292,1558,119,360,1,Urban 340 | LP002857,Male,Yes,1,Graduate,Yes,2360,3355,87,240,1,Rural 341 | LP002858,Female,No,0,Graduate,No,4333,2333,162,360,0,Rural 342 | LP002860,Male,Yes,0,Graduate,Yes,2623,4831,122,180,1,Semiurban 343 | LP002867,Male,No,0,Graduate,Yes,3972,4275,187,360,1,Rural 344 | LP002869,Male,Yes,3+,Not Graduate,No,3522,0,81,180,1,Rural 345 | LP002870,Male,Yes,1,Graduate,No,4700,0,80,360,1,Urban 346 | LP002876,Male,No,0,Graduate,No,6858,0,176,360,1,Rural 347 | LP002878,Male,Yes,3+,Graduate,No,8334,0,260,360,1,Urban 348 | LP002879,Male,Yes,0,Graduate,No,3391,1966,133,360,0,Rural 349 | LP002885,Male,No,0,Not Graduate,No,2868,0,70,360,1,Urban 350 | LP002890,Male,Yes,2,Not Graduate,No,3418,1380,135,360,1,Urban 351 | LP002891,Male,Yes,0,Graduate,Yes,2500,296,137,300,1,Rural 352 | LP002899,Male,Yes,2,Graduate,No,8667,0,254,360,1,Rural 353 | LP002901,Male,No,0,Graduate,No,2283,15000,106,360,,Rural 354 | LP002907,Male,Yes,0,Graduate,No,5817,910,109,360,1,Urban 355 | LP002920,Male,Yes,0,Graduate,No,5119,3769,120,360,1,Rural 356 | LP002921,Male,Yes,3+,Not Graduate,No,5316,187,158,180,0,Semiurban 357 | LP002932,Male,Yes,3+,Graduate,No,7603,1213,197,360,1,Urban 358 | LP002935,Male,Yes,1,Graduate,No,3791,1936,85,360,1,Urban 359 | LP002952,Male,No,0,Graduate,No,2500,0,60,360,1,Urban 360 | LP002954,Male,Yes,2,Not Graduate,No,3132,0,76,360,,Rural 361 | LP002962,Male,No,0,Graduate,No,4000,2667,152,360,1,Semiurban 362 | LP002965,Female,Yes,0,Graduate,No,8550,4255,96,360,,Urban 363 | LP002969,Male,Yes,1,Graduate,No,2269,2167,99,360,1,Semiurban 364 | LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777,113,360,1,Urban 365 | LP002975,Male,Yes,0,Graduate,No,4158,709,115,360,1,Urban 366 | LP002980,Male,No,0,Graduate,No,3250,1993,126,360,,Semiurban 367 | LP002986,Male,Yes,0,Graduate,No,5000,2393,158,360,1,Rural 368 | LP002989,Male,No,0,Graduate,Yes,9200,0,98,180,1,Rural 369 | -------------------------------------------------------------------------------- /AV-loan-prediction/xgb.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | import pandas as pd 5 | import numpy as np 6 | import xgboost as xgb 7 | import time 8 | import math 9 | from sklearn.cross_validation import train_test_split 10 | from sklearn import preprocessing 11 | 12 | def load_data(): 13 | train = pd.read_csv('train.csv') 14 | test = pd.read_csv('test.csv') 15 | train_y = (train.Loan_Status == 'Y').astype('int') 16 | train_x = train.drop(['Loan_ID','Loan_Status'], axis=1) 17 | test_uid = test.Loan_ID 18 | test_x = test.drop(['Loan_ID'], axis=1) 19 | 20 | cat_var = ['Gender','Married','Dependents','Education', 'Self_Employed', 'Property_Area'] 21 | num_var = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History'] 22 | for var in num_var: 23 | train_x[var] = train_x[var].fillna(value = train_x[var].mean()) 24 | test_x[var] = test_x[var].fillna(value = test_x[var].mean()) 25 | train_x['Credibility'] = train_x['ApplicantIncome'] / train_x['LoanAmount'] 26 | test_x['Credibility'] = test_x['ApplicantIncome'] / test_x['LoanAmount'] 27 | train_x = train_x.fillna(value = -999) 28 | test_x = test_x.fillna(value = -999) 29 | 30 | for var in cat_var: 31 | lb = preprocessing.LabelEncoder() 32 | full_data = pd.concat((train_x[var],test_x[var]),axis=0).astype('str') 33 | lb.fit( full_data ) 34 | train_x[var] = lb.transform(train_x[var].astype('str')) 35 | test_x[var] = lb.transform(test_x[var].astype('str')) 36 | 37 | return train_x, train_y, test_x, test_uid 38 | 39 | 40 | def using_xgb(train_x, train_y, test_x, test_uid): 41 | scale_val = (train_y.sum() / train_y.shape[0]) 42 | X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, train_size=0.75, random_state=0) 43 | xgb_train = xgb.DMatrix(X_train, label=y_train) 44 | xgb_val = xgb.DMatrix(X_val, label=y_val) 45 | xgb_test = xgb.DMatrix(test_x) 46 | 47 | # 设置xgboost分类器参数 48 | params = { 49 | 'booster': 'gbtree', 50 | 'objective': 'binary:logistic', 51 | 'eval_metric': 'auc', 52 | 'early_stopping_rounds': 200, 53 | 'gamma':0, 54 | 'lambda': 1000, 55 | 'min_child_weight': 5, 56 | 'scale_pos_weight': scale_val, 57 | 'subsample': 0.7, 58 | 'max_depth':6, 59 | 'eta': 0.01, 60 | #'colsample_bytree': 0.7, 61 | 'nthread': 2 62 | } 63 | watchlist = [(xgb_val, 'val'), (xgb_train, 'train')] 64 | num_round = 10000 65 | bst = xgb.train(params, xgb_train, num_boost_round=num_round, evals=watchlist) 66 | scores = bst.predict(xgb_test, ntree_limit=bst.best_ntree_limit) 67 | pred = np.where(scores > 0.5, 'Y','N') 68 | 69 | 70 | print((pd.value_counts(pred))) 71 | 72 | 73 | return 0 74 | result = pd.DataFrame({"Loan_ID":test_uid, "Loan_Status":pred}, columns=['Loan_ID','Loan_Status']) 75 | result.to_csv('result/xgb_'+str(time.time())[-4:]+'.csv', index=False) 76 | 77 | def main(): 78 | train_x, train_y, test_x, test_uid = load_data() 79 | print("load_data() end!") 80 | using_xgb(train_x, train_y, test_x, test_uid) 81 | 82 | 83 | if __name__ == '__main__': 84 | main() -------------------------------------------------------------------------------- /DC-loan-rp/add_data/add_data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: add_data.py.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2016/01/01 下午 12:00 10 | ''' 11 | 12 | import pandas as pd 13 | import numpy as np 14 | 15 | path = 'd:/dataset/rp/' 16 | test_x_csv = path + 'test_x.csv' 17 | test_y_csv = './0.717.csv' 18 | dtest_x = pd.read_csv(test_x_csv) 19 | dtest_y = pd.read_csv(test_y_csv) 20 | 21 | test_xy = pd.merge(dtest_x, dtest_y, on='uid') 22 | add_low = test_xy[test_xy.score < 0.1] 23 | add_high = test_xy[test_xy.score > 0.97] 24 | add_test_xy = pd.concat([add_low,add_high], axis=0) 25 | 26 | add_test_xy = add_test_xy.drop_duplicates(cols='uid') 27 | 28 | print(add_test_xy) 29 | add_y = add_test_xy[['uid','score']].copy() 30 | add_y['score'] = np.where(add_y['score']<0.5, 0, 1) 31 | add_y.columns = ['uid', 'y'] 32 | add_X = add_test_xy.drop(['score'], axis=1) 33 | 34 | add_y.to_csv(path+'add_y.csv', index=False) 35 | add_X.to_csv(path+'add_X.csv', index=False) -------------------------------------------------------------------------------- /DC-loan-rp/feature-selection/0.70_feature_score.csv: -------------------------------------------------------------------------------- 1 | feature,fscore 2 | x894,1 3 | x799,1 4 | x817,1 5 | x261,1 6 | x723,1 7 | x290,1 8 | x929,1 9 | x440,1 10 | x262,1 11 | x1052,1 12 | x291,1 13 | x495,1 14 | x900,1 15 | x923,1 16 | x500,2 17 | x621,2 18 | x872,2 19 | x250,2 20 | x1039,2 21 | x279,2 22 | x275,2 23 | x278,3 24 | x886,3 25 | x740,3 26 | x911,3 27 | x812,3 28 | x505,3 29 | x450,3 30 | x598,4 31 | x507,4 32 | x453,4 33 | x540,4 34 | x544,4 35 | x1042,4 36 | x808,4 37 | x536,4 38 | x965,4 39 | x1044,4 40 | x931,5 41 | x1045,5 42 | x425,5 43 | x445,5 44 | x927,5 45 | x438,5 46 | x501,5 47 | x934,5 48 | x249,5 49 | x834,5 50 | x320,5 51 | x883,5 52 | x1034,5 53 | x321,5 54 | x724,6 55 | x938,6 56 | x498,6 57 | x776,6 58 | x1095,6 59 | x810,6 60 | x1031,6 61 | x668,6 62 | x702,6 63 | x448,7 64 | x922,7 65 | x437,7 66 | x635,7 67 | x658,7 68 | x701,7 69 | x936,7 70 | x912,7 71 | x662,7 72 | x691,8 73 | x1037,8 74 | x439,8 75 | x1079,8 76 | x312,8 77 | x854,8 78 | x893,9 79 | x537,9 80 | x443,9 81 | x901,9 82 | x508,9 83 | x318,9 84 | x545,9 85 | x916,9 86 | x871,10 87 | x302,10 88 | x755,10 89 | x1050,10 90 | x1014,10 91 | x1032,10 92 | x1033,10 93 | x907,10 94 | x1098,10 95 | x447,10 96 | x310,11 97 | x596,11 98 | x442,11 99 | x671,11 100 | x903,11 101 | x891,12 102 | x1093,12 103 | x906,12 104 | x715,12 105 | x754,12 106 | x811,12 107 | x917,12 108 | x712,13 109 | x1035,13 110 | x492,13 111 | x719,13 112 | x426,14 113 | x567,14 114 | x742,14 115 | x921,14 116 | x884,14 117 | x801,14 118 | x317,14 119 | x316,14 120 | x625,14 121 | x452,15 122 | x896,15 123 | x597,15 124 | x714,15 125 | x939,15 126 | x618,15 127 | x1081,15 128 | x281,15 129 | x1027,15 130 | x898,16 131 | x758,16 132 | x423,16 133 | x670,16 134 | x457,16 135 | x599,16 136 | x970,17 137 | x451,17 138 | x975,17 139 | x454,18 140 | x324,18 141 | x309,18 142 | x570,18 143 | x421,19 144 | x303,19 145 | x449,19 146 | x300,19 147 | x280,19 148 | x669,19 149 | x881,20 150 | x1047,20 151 | x264,20 152 | x694,20 153 | x569,21 154 | x594,21 155 | x664,21 156 | x832,21 157 | x1069,21 158 | x919,21 159 | x759,21 160 | x456,22 161 | x1013,22 162 | x739,22 163 | x1040,22 164 | x816,23 165 | x904,23 166 | x600,23 167 | x519,23 168 | x784,23 169 | x1043,23 170 | x319,23 171 | x840,23 172 | x882,24 173 | x1097,24 174 | x1002,24 175 | x311,24 176 | x753,24 177 | x885,24 178 | x989,25 179 | x622,25 180 | x313,25 181 | x888,25 182 | x639,25 183 | x1124,26 184 | x335,26 185 | x969,26 186 | x1127,26 187 | x839,26 188 | x925,26 189 | x926,27 190 | x1030,27 191 | x718,27 192 | x932,27 193 | x756,28 194 | x1038,28 195 | x880,28 196 | x441,28 197 | x632,29 198 | x841,29 199 | x848,29 200 | x243,29 201 | x930,29 202 | x315,29 203 | x693,30 204 | x539,30 205 | x1070,30 206 | x571,31 207 | x973,31 208 | x502,31 209 | x331,31 210 | x706,31 211 | x948,32 212 | x1091,32 213 | x301,32 214 | x244,32 215 | x260,33 216 | x608,33 217 | x703,34 218 | x659,35 219 | x1063,35 220 | x974,35 221 | x328,36 222 | x1080,36 223 | x415,36 224 | x845,36 225 | x1061,36 226 | x887,36 227 | x802,37 228 | x414,37 229 | x1036,37 230 | x745,38 231 | x572,38 232 | x972,39 233 | x242,39 234 | x308,39 235 | x646,39 236 | x708,39 237 | x1054,40 238 | x1086,40 239 | x793,41 240 | x961,41 241 | x1007,41 242 | x779,41 243 | x649,41 244 | x910,41 245 | x612,42 246 | x561,42 247 | x1065,43 248 | x314,43 249 | x523,43 250 | x1100,44 251 | x924,44 252 | x258,44 253 | x326,44 254 | x267,45 255 | x375,45 256 | x783,45 257 | x1092,45 258 | x680,45 259 | x411,45 260 | x334,46 261 | x1088,46 262 | x627,46 263 | x1119,46 264 | x1125,47 265 | x416,47 266 | x870,47 267 | x1075,47 268 | x1085,47 269 | x1099,47 270 | x1096,48 271 | x1129,48 272 | x673,48 273 | x1006,48 274 | x1133,49 275 | x589,49 276 | x908,49 277 | x419,49 278 | x547,49 279 | x628,51 280 | x1076,51 281 | x928,51 282 | x892,51 283 | x1017,52 284 | x1126,52 285 | x1012,52 286 | x1000,52 287 | x465,53 288 | x1123,53 289 | x624,53 290 | x532,53 291 | x1121,54 292 | x807,54 293 | x466,54 294 | x327,55 295 | x863,55 296 | x1058,55 297 | x971,57 298 | x531,57 299 | x915,58 300 | x720,59 301 | x1053,59 302 | x672,59 303 | x390,59 304 | x835,60 305 | x1136,60 306 | x467,60 307 | x577,61 308 | x667,61 309 | x751,61 310 | x1122,61 311 | x675,61 312 | x446,62 313 | x967,62 314 | x819,62 315 | x1071,63 316 | x844,63 317 | x265,63 318 | x1128,64 319 | x444,64 320 | x1089,65 321 | x1066,65 322 | x643,65 323 | x1062,66 324 | x1112,66 325 | x631,67 326 | x322,67 327 | x595,67 328 | x623,67 329 | x521,68 330 | x794,68 331 | x376,69 332 | x263,69 333 | x709,69 334 | x679,69 335 | x402,69 336 | x1130,69 337 | x828,69 338 | x526,69 339 | x1094,70 340 | x251,70 341 | x937,70 342 | x889,71 343 | x378,71 344 | x717,71 345 | x1067,71 346 | x580,73 347 | x517,73 348 | x1082,73 349 | x1104,73 350 | x1106,74 351 | x420,74 352 | x1008,74 353 | x1135,74 354 | x586,74 355 | x274,74 356 | x761,76 357 | x684,77 358 | x535,77 359 | x1056,77 360 | x1120,77 361 | x991,77 362 | x869,78 363 | x682,78 364 | x874,78 365 | x998,79 366 | x493,79 367 | x850,79 368 | x862,80 369 | x253,81 370 | x678,81 371 | x968,82 372 | x579,82 373 | x259,83 374 | x780,83 375 | x325,83 376 | x966,83 377 | x757,83 378 | x789,83 379 | x954,83 380 | x1105,83 381 | x1118,83 382 | x1055,84 383 | x381,84 384 | x307,84 385 | x1101,85 386 | x875,85 387 | x1116,85 388 | x573,85 389 | x273,86 390 | x1059,87 391 | x277,87 392 | x955,87 393 | x821,88 394 | x391,88 395 | x707,88 396 | x380,89 397 | x1131,89 398 | x527,89 399 | x837,89 400 | x827,90 401 | x1117,91 402 | x782,91 403 | x857,91 404 | x1046,91 405 | x520,93 406 | x697,93 407 | x1025,93 408 | x1049,93 409 | x1114,93 410 | x1111,94 411 | x605,94 412 | x935,94 413 | x805,95 414 | x962,95 415 | x750,96 416 | x1107,97 417 | x736,97 418 | x770,98 419 | x602,98 420 | x332,100 421 | x574,100 422 | x1115,101 423 | x283,101 424 | x1057,101 425 | x69,102 426 | x818,103 427 | x1102,104 428 | x1064,104 429 | x803,104 430 | x1041,104 431 | x503,104 432 | x351,105 433 | x345,105 434 | x914,105 435 | x705,105 436 | x1087,106 437 | x933,106 438 | x384,107 439 | x610,107 440 | x1028,109 441 | x1009,109 442 | x777,109 443 | x269,110 444 | x1103,111 445 | x256,111 446 | x417,111 447 | x1109,111 448 | x1068,111 449 | x1016,112 450 | x1090,112 451 | x282,113 452 | x593,113 453 | x370,114 454 | x905,114 455 | x980,114 456 | x920,114 457 | x510,115 458 | x847,115 459 | x245,115 460 | x1138,115 461 | x652,117 462 | x355,118 463 | x1015,118 464 | x1073,119 465 | x990,119 466 | x866,119 467 | x1113,119 468 | x534,119 469 | x1060,119 470 | x765,119 471 | x800,120 472 | x1137,120 473 | x630,120 474 | x1110,120 475 | x1048,120 476 | x851,120 477 | x806,120 478 | x1074,121 479 | x32,121 480 | x285,121 481 | x732,122 482 | x1083,122 483 | x747,122 484 | x677,122 485 | x575,123 486 | x626,123 487 | x522,125 488 | x890,125 489 | x386,126 490 | x343,126 491 | x655,127 492 | x330,128 493 | x984,128 494 | x349,128 495 | x1021,129 496 | x377,129 497 | x340,129 498 | x674,130 499 | x873,130 500 | x611,130 501 | x918,130 502 | x413,130 503 | x480,132 504 | x1072,132 505 | x909,133 506 | x565,134 507 | x964,135 508 | x749,135 509 | x458,136 510 | x704,136 511 | x795,136 512 | x266,137 513 | x737,137 514 | x676,137 515 | x976,138 516 | x344,138 517 | x1108,139 518 | x353,139 519 | x564,139 520 | x470,139 521 | x369,139 522 | x790,139 523 | x963,139 524 | x371,139 525 | x838,140 526 | x410,140 527 | x254,140 528 | x464,140 529 | x436,141 530 | x650,142 531 | x1020,144 532 | x1023,144 533 | x979,144 534 | x987,144 535 | x342,145 536 | x902,145 537 | x461,146 538 | x424,146 539 | x997,147 540 | x484,148 541 | x609,149 542 | x746,150 543 | x585,150 544 | x394,150 545 | x513,151 546 | x372,153 547 | x843,153 548 | x700,153 549 | x329,153 550 | x983,153 551 | x525,154 552 | x338,156 553 | x418,156 554 | x985,157 555 | x422,157 556 | x174,157 557 | x203,157 558 | x431,158 559 | x339,158 560 | x509,158 561 | x945,159 562 | x657,160 563 | x497,160 564 | x1003,160 565 | x752,161 566 | x647,162 567 | x1084,162 568 | x982,163 569 | x993,165 570 | x360,166 571 | x809,166 572 | x614,166 573 | x1078,166 574 | x389,167 575 | x512,167 576 | x170,168 577 | x562,168 578 | x387,169 579 | x388,169 580 | x396,170 581 | x77,170 582 | x382,172 583 | x576,172 584 | x581,172 585 | x1029,173 586 | x434,174 587 | x796,174 588 | x341,174 589 | x347,175 590 | x374,177 591 | x429,177 592 | x45,179 593 | x306,179 594 | x1077,181 595 | x760,182 596 | x100,182 597 | x529,183 598 | x294,184 599 | x986,185 600 | x5,185 601 | x506,186 602 | x460,186 603 | x476,186 604 | x957,187 605 | x1010,188 606 | x977,188 607 | x681,188 608 | x494,189 609 | x864,192 610 | x960,192 611 | x1005,193 612 | x762,193 613 | x1022,194 614 | x590,194 615 | x528,196 616 | x97,196 617 | x304,196 618 | x408,196 619 | x1011,197 620 | x514,197 621 | x996,197 622 | x485,198 623 | x385,198 624 | x459,198 625 | x958,200 626 | x583,200 627 | x196,200 628 | x36,200 629 | x713,204 630 | x393,204 631 | x689,204 632 | x481,204 633 | x361,204 634 | x122,204 635 | x73,206 636 | x80,207 637 | x729,207 638 | x336,207 639 | x75,207 640 | x556,207 641 | x478,208 642 | x1018,208 643 | x147,209 644 | x988,209 645 | x295,209 646 | x430,210 647 | x64,210 648 | x365,211 649 | x499,211 650 | x804,212 651 | x1024,212 652 | x195,212 653 | x781,212 654 | x401,212 655 | x354,212 656 | x70,213 657 | x103,213 658 | x733,213 659 | x748,214 660 | x220,214 661 | x856,215 662 | x764,215 663 | x824,216 664 | x490,216 665 | x582,216 666 | x198,216 667 | x12,217 668 | x17,218 669 | x395,219 670 | x188,219 671 | x683,220 672 | x155,220 673 | x186,220 674 | x533,220 675 | x197,221 676 | x33,222 677 | x999,224 678 | x563,225 679 | x956,226 680 | x109,227 681 | x651,227 682 | x483,227 683 | x39,228 684 | x978,229 685 | x1019,229 686 | x255,230 687 | x530,230 688 | x288,231 689 | x475,231 690 | x836,232 691 | x118,232 692 | x202,234 693 | x104,234 694 | x995,234 695 | x146,236 696 | x110,236 697 | x992,236 698 | x1004,237 699 | x403,239 700 | x205,239 701 | x221,239 702 | x766,239 703 | x169,239 704 | x268,240 705 | x116,240 706 | x548,240 707 | x348,241 708 | x176,243 709 | x48,243 710 | x553,243 711 | x350,245 712 | x271,245 713 | x648,246 714 | x356,246 715 | x287,247 716 | x3,247 717 | x379,248 718 | x878,248 719 | x398,248 720 | x412,249 721 | x323,249 722 | x106,251 723 | x469,251 724 | x373,252 725 | x81,252 726 | x731,253 727 | x72,253 728 | x108,253 729 | x568,254 730 | x686,254 731 | x210,255 732 | x225,255 733 | x482,256 734 | x362,259 735 | x699,259 736 | x63,259 737 | x207,259 738 | x284,260 739 | x383,260 740 | x653,260 741 | x23,260 742 | x222,261 743 | x299,262 744 | x213,263 745 | x474,263 746 | x578,264 747 | x145,265 748 | x233,265 749 | x772,265 750 | x86,265 751 | x201,265 752 | x230,265 753 | x190,266 754 | x124,266 755 | x946,266 756 | x557,266 757 | x232,267 758 | x194,267 759 | x136,267 760 | x405,267 761 | x117,268 762 | x555,268 763 | x868,268 764 | x738,268 765 | x107,269 766 | x846,269 767 | x206,269 768 | x409,269 769 | x257,269 770 | x656,269 771 | x15,269 772 | x167,270 773 | x551,270 774 | x359,270 775 | x949,271 776 | x119,271 777 | x981,272 778 | x550,274 779 | x953,274 780 | x1026,274 781 | x473,274 782 | x358,274 783 | x191,276 784 | x16,278 785 | x994,278 786 | x392,278 787 | x89,278 788 | x179,278 789 | x865,278 790 | x524,279 791 | x173,279 792 | x37,279 793 | x150,280 794 | x113,280 795 | x472,280 796 | x14,280 797 | x4,280 798 | x182,281 799 | x730,281 800 | x337,281 801 | x552,281 802 | x292,282 803 | x479,282 804 | x60,283 805 | x241,284 806 | x88,285 807 | x226,285 808 | x157,286 809 | x125,286 810 | x634,286 811 | x792,286 812 | x346,287 813 | x192,287 814 | x47,288 815 | x721,288 816 | x168,288 817 | x178,289 818 | x187,290 819 | x71,292 820 | x193,292 821 | x49,292 822 | x216,293 823 | x228,293 824 | x858,294 825 | x74,294 826 | x121,294 827 | x154,295 828 | x111,295 829 | x214,296 830 | x546,296 831 | x151,296 832 | x357,296 833 | x486,297 834 | x85,297 835 | x399,298 836 | x305,299 837 | x687,299 838 | x826,300 839 | x477,301 840 | x352,301 841 | x947,301 842 | x44,301 843 | x698,301 844 | x1001,303 845 | x52,303 846 | x397,303 847 | x46,304 848 | x204,304 849 | x400,304 850 | x227,304 851 | x849,305 852 | x61,305 853 | x132,305 854 | x404,306 855 | x181,307 856 | x130,307 857 | x156,307 858 | x38,308 859 | x11,308 860 | x91,309 861 | x959,310 862 | x217,311 863 | x129,311 864 | x200,311 865 | x120,313 866 | x471,313 867 | x9,313 868 | x10,315 869 | x19,316 870 | x829,318 871 | x144,319 872 | x950,321 873 | x20,321 874 | x62,322 875 | x166,323 876 | x18,323 877 | x246,324 878 | x41,324 879 | x185,325 880 | x139,326 881 | x43,326 882 | x128,326 883 | x160,326 884 | x84,326 885 | x24,327 886 | x942,328 887 | x208,329 888 | x867,331 889 | x722,331 890 | x171,331 891 | x22,332 892 | x102,332 893 | x21,332 894 | x42,332 895 | x184,333 896 | x613,333 897 | x105,335 898 | x601,336 899 | x78,336 900 | x407,336 901 | x51,337 902 | x8,337 903 | x7,338 904 | x149,338 905 | x112,338 906 | x690,339 907 | x560,340 908 | x823,340 909 | x165,341 910 | x126,341 911 | x629,341 912 | x234,344 913 | x172,344 914 | x771,346 915 | x87,346 916 | x685,346 917 | x83,346 918 | x645,347 919 | x163,348 920 | x462,349 921 | x435,349 922 | x511,350 923 | x778,350 924 | x427,351 925 | x57,352 926 | x159,352 927 | x56,353 928 | x763,353 929 | x272,355 930 | x183,357 931 | x189,358 932 | x238,359 933 | x842,360 934 | x240,361 935 | x161,361 936 | x952,361 937 | x237,361 938 | x223,362 939 | x35,362 940 | x852,362 941 | x215,363 942 | x199,363 943 | x289,365 944 | x13,366 945 | x428,366 946 | x114,367 947 | x716,368 948 | x406,371 949 | x180,372 950 | x131,375 951 | x367,375 952 | x296,375 953 | x235,376 954 | x1,376 955 | x54,377 956 | x633,378 957 | x162,378 958 | x123,379 959 | x252,380 960 | x432,381 961 | x140,381 962 | x90,381 963 | x34,382 964 | x82,383 965 | x6,384 966 | x825,386 967 | x40,387 968 | x218,388 969 | x31,389 970 | x941,389 971 | x944,390 972 | x229,390 973 | x115,391 974 | x133,392 975 | x211,392 976 | x735,393 977 | x177,393 978 | x363,394 979 | x127,394 980 | x175,395 981 | x66,395 982 | x219,395 983 | x209,395 984 | x820,398 985 | x293,399 986 | x164,401 987 | x768,402 988 | x50,402 989 | x298,404 990 | x141,405 991 | x30,405 992 | x286,406 993 | x68,410 994 | x654,414 995 | x148,414 996 | x591,415 997 | x153,415 998 | x943,416 999 | x2,419 1000 | x822,420 1001 | x143,420 1002 | x101,420 1003 | x734,421 1004 | x158,422 1005 | x134,422 1006 | x607,422 1007 | x55,423 1008 | x93,423 1009 | x53,424 1010 | x297,428 1011 | x65,429 1012 | x79,431 1013 | x212,431 1014 | x239,433 1015 | x92,437 1016 | x224,439 1017 | x433,439 1018 | x366,440 1019 | x558,441 1020 | x549,444 1021 | x99,447 1022 | x231,454 1023 | x95,455 1024 | x554,458 1025 | x879,458 1026 | x28,458 1027 | x58,459 1028 | x98,459 1029 | x791,460 1030 | x606,461 1031 | x769,463 1032 | x940,464 1033 | x67,464 1034 | x27,468 1035 | x135,469 1036 | x25,469 1037 | x236,471 1038 | x152,474 1039 | x951,476 1040 | x94,479 1041 | x142,480 1042 | x270,482 1043 | x767,485 1044 | x59,488 1045 | x76,493 1046 | x137,501 1047 | x364,503 1048 | x96,507 1049 | x29,517 1050 | x592,532 1051 | x368,545 1052 | x138,563 1053 | x26,566 1054 | x584,580 1055 | x559,590 1056 | -------------------------------------------------------------------------------- /DC-loan-rp/feature-selection/anylze.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import matplotlib.pyplot as plt 3 | 4 | path = 'd:/dataset/rp/' 5 | features_type_csv = path + 'features_type.csv' 6 | features = pd.read_csv(features_type_csv) 7 | numeric = features.feature[features.type == 'numeric'] 8 | category = features.feature[features.type == 'category'] 9 | 10 | print('feature\nnumeric: %d ; category: %d' % (numeric.shape[0], category.shape[0]) ) 11 | 12 | feature_score_csv = './0.70_feature_score.csv' 13 | feature_score = pd.read_csv(feature_score_csv) 14 | 15 | feature_score.index = feature_score.feature 16 | feature_score = feature_score.drop(['feature'], axis=1) 17 | 18 | print('feature__category') 19 | feature_score_category = feature_score.ix[category] 20 | feature_score_category = feature_score_category.sort_values(by='fscore', ascending=False) 21 | feature_score_category.to_csv('./feature_score_category.csv') 22 | 23 | category_is_null = feature_score_category[feature_score_category.fscore.isnull()] 24 | list_null = list(category_is_null.index) 25 | print(list_null) 26 | 27 | print('feature__numeric') 28 | feature_score_numeric = feature_score.ix[numeric] 29 | feature_score_numeric = feature_score_numeric.sort_values(by='fscore', ascending=False) 30 | feature_score_numeric.to_csv('./feature_score_numeric.csv') 31 | 32 | 33 | f1 = open('./drop_list.txt', 'r') 34 | f2 = open('./drop_feature.txt', 'w+') 35 | for line in f1: 36 | row = line.strip().split(',') 37 | f2.write(row[0]+'\n') 38 | 39 | f1.close() 40 | f2.close() -------------------------------------------------------------------------------- /DC-loan-rp/feature-selection/drop_feature.txt: -------------------------------------------------------------------------------- 1 | x455 2 | x487 3 | x488 4 | x1051 5 | x1132 6 | x1134 7 | x247 8 | x248 9 | x276 10 | x333 11 | x463 12 | x468 13 | x489 14 | x491 15 | x496 16 | x504 17 | x515 18 | x516 19 | x518 20 | x538 21 | x541 22 | x542 23 | x543 24 | x566 25 | x587 26 | x588 27 | x603 28 | x604 29 | x615 30 | x616 31 | x617 32 | x619 33 | x620 34 | x636 35 | x637 36 | x638 37 | x640 38 | x641 39 | x642 40 | x644 41 | x660 42 | x661 43 | x663 44 | x665 45 | x666 46 | x688 47 | x692 48 | x695 49 | x696 50 | x710 51 | x711 52 | x725 53 | x726 54 | x727 55 | x728 56 | x741 57 | x743 58 | x744 59 | x773 60 | x774 61 | x775 62 | x785 63 | x786 64 | x787 65 | x788 66 | x797 67 | x798 68 | x813 69 | x814 70 | x815 71 | x830 72 | x831 73 | x833 74 | x853 75 | x855 76 | x859 77 | x860 78 | x861 79 | x876 80 | x877 81 | x895 82 | x897 83 | x899 84 | x913 85 | -------------------------------------------------------------------------------- /DC-loan-rp/feature-selection/drop_list.txt: -------------------------------------------------------------------------------- 1 | x455, 2 | x487, 3 | x488, 4 | x1051, 5 | x1132, 6 | x1134, 7 | x247, 8 | x248, 9 | x276, 10 | x333, 11 | x463, 12 | x468, 13 | x489, 14 | x491, 15 | x496, 16 | x504, 17 | x515, 18 | x516, 19 | x518, 20 | x538, 21 | x541, 22 | x542, 23 | x543, 24 | x566, 25 | x587, 26 | x588, 27 | x603, 28 | x604, 29 | x615, 30 | x616, 31 | x617, 32 | x619, 33 | x620, 34 | x636, 35 | x637, 36 | x638, 37 | x640, 38 | x641, 39 | x642, 40 | x644, 41 | x660, 42 | x661, 43 | x663, 44 | x665, 45 | x666, 46 | x688, 47 | x692, 48 | x695, 49 | x696, 50 | x710, 51 | x711, 52 | x725, 53 | x726, 54 | x727, 55 | x728, 56 | x741, 57 | x743, 58 | x744, 59 | x773, 60 | x774, 61 | x775, 62 | x785, 63 | x786, 64 | x787, 65 | x788, 66 | x797, 67 | x798, 68 | x813, 69 | x814, 70 | x815, 71 | x830, 72 | x831, 73 | x833, 74 | x853, 75 | x855, 76 | x859, 77 | x860, 78 | x861, 79 | x876, 80 | x877, 81 | x895, 82 | x897, 83 | x899, 84 | x913, 85 | -------------------------------------------------------------------------------- /DC-loan-rp/feature-selection/feature_score_category.csv: -------------------------------------------------------------------------------- 1 | feature,fscore 2 | x1108,139.0 3 | x1137,120.0 4 | x1110,120.0 5 | x1113,119.0 6 | x1138,115.0 7 | x1109,111.0 8 | x417,111.0 9 | x1041,104.0 10 | x1115,101.0 11 | x1107,97.0 12 | x1111,94.0 13 | x1114,93.0 14 | x1117,91.0 15 | x1046,91.0 16 | x1131,89.0 17 | x1116,85.0 18 | x1118,83.0 19 | x1120,77.0 20 | x1135,74.0 21 | x1130,69.0 22 | x1112,66.0 23 | x444,64.0 24 | x1128,64.0 25 | x1122,61.0 26 | x1136,60.0 27 | x1121,54.0 28 | x1123,53.0 29 | x1126,52.0 30 | x1133,49.0 31 | x1129,48.0 32 | x416,47.0 33 | x1125,47.0 34 | x1119,46.0 35 | x411,45.0 36 | x314,43.0 37 | x308,39.0 38 | x1036,37.0 39 | x415,36.0 40 | x301,32.0 41 | x315,29.0 42 | x1038,28.0 43 | x1127,26.0 44 | x1124,26.0 45 | x313,25.0 46 | x311,24.0 47 | x885,24.0 48 | x319,23.0 49 | x1043,23.0 50 | x1040,22.0 51 | x1047,20.0 52 | x449,19.0 53 | x303,19.0 54 | x300,19.0 55 | x454,18.0 56 | x309,18.0 57 | x452,15.0 58 | x884,14.0 59 | x426,14.0 60 | x316,14.0 61 | x317,14.0 62 | x1035,13.0 63 | x310,11.0 64 | x1050,10.0 65 | x302,10.0 66 | x1032,10.0 67 | x447,10.0 68 | x1033,10.0 69 | x318,9.0 70 | x312,8.0 71 | x439,8.0 72 | x1037,8.0 73 | x448,7.0 74 | x1031,6.0 75 | x1034,5.0 76 | x321,5.0 77 | x320,5.0 78 | x1045,5.0 79 | x438,5.0 80 | x883,5.0 81 | x1042,4.0 82 | x1044,4.0 83 | x453,4.0 84 | x450,3.0 85 | x886,3.0 86 | x1039,2.0 87 | x440,1.0 88 | x1052,1.0 89 | x455, 90 | x487, 91 | x488, 92 | x1051, 93 | x1132, 94 | x1134, 95 | -------------------------------------------------------------------------------- /DC-loan-rp/feature-selection/feature_score_numeric.csv: -------------------------------------------------------------------------------- 1 | feature,fscore 2 | x559,590.0 3 | x584,580.0 4 | x26,566.0 5 | x138,563.0 6 | x368,545.0 7 | x592,532.0 8 | x29,517.0 9 | x96,507.0 10 | x364,503.0 11 | x137,501.0 12 | x76,493.0 13 | x59,488.0 14 | x767,485.0 15 | x270,482.0 16 | x142,480.0 17 | x94,479.0 18 | x951,476.0 19 | x152,474.0 20 | x236,471.0 21 | x135,469.0 22 | x25,469.0 23 | x27,468.0 24 | x940,464.0 25 | x67,464.0 26 | x769,463.0 27 | x606,461.0 28 | x791,460.0 29 | x98,459.0 30 | x58,459.0 31 | x28,458.0 32 | x554,458.0 33 | x879,458.0 34 | x95,455.0 35 | x231,454.0 36 | x99,447.0 37 | x549,444.0 38 | x558,441.0 39 | x366,440.0 40 | x224,439.0 41 | x433,439.0 42 | x92,437.0 43 | x239,433.0 44 | x79,431.0 45 | x212,431.0 46 | x65,429.0 47 | x297,428.0 48 | x53,424.0 49 | x55,423.0 50 | x93,423.0 51 | x134,422.0 52 | x607,422.0 53 | x158,422.0 54 | x734,421.0 55 | x143,420.0 56 | x822,420.0 57 | x101,420.0 58 | x2,419.0 59 | x943,416.0 60 | x153,415.0 61 | x591,415.0 62 | x148,414.0 63 | x654,414.0 64 | x68,410.0 65 | x286,406.0 66 | x141,405.0 67 | x30,405.0 68 | x298,404.0 69 | x50,402.0 70 | x768,402.0 71 | x164,401.0 72 | x293,399.0 73 | x820,398.0 74 | x209,395.0 75 | x175,395.0 76 | x66,395.0 77 | x219,395.0 78 | x127,394.0 79 | x363,394.0 80 | x735,393.0 81 | x177,393.0 82 | x133,392.0 83 | x211,392.0 84 | x115,391.0 85 | x229,390.0 86 | x944,390.0 87 | x31,389.0 88 | x941,389.0 89 | x218,388.0 90 | x40,387.0 91 | x825,386.0 92 | x6,384.0 93 | x82,383.0 94 | x34,382.0 95 | x90,381.0 96 | x140,381.0 97 | x432,381.0 98 | x252,380.0 99 | x123,379.0 100 | x633,378.0 101 | x162,378.0 102 | x54,377.0 103 | x235,376.0 104 | x1,376.0 105 | x131,375.0 106 | x367,375.0 107 | x296,375.0 108 | x180,372.0 109 | x406,371.0 110 | x716,368.0 111 | x114,367.0 112 | x13,366.0 113 | x428,366.0 114 | x289,365.0 115 | x199,363.0 116 | x215,363.0 117 | x35,362.0 118 | x852,362.0 119 | x223,362.0 120 | x952,361.0 121 | x237,361.0 122 | x161,361.0 123 | x240,361.0 124 | x842,360.0 125 | x238,359.0 126 | x189,358.0 127 | x183,357.0 128 | x272,355.0 129 | x763,353.0 130 | x56,353.0 131 | x57,352.0 132 | x159,352.0 133 | x427,351.0 134 | x778,350.0 135 | x511,350.0 136 | x435,349.0 137 | x462,349.0 138 | x163,348.0 139 | x645,347.0 140 | x771,346.0 141 | x83,346.0 142 | x685,346.0 143 | x87,346.0 144 | x172,344.0 145 | x234,344.0 146 | x629,341.0 147 | x165,341.0 148 | x126,341.0 149 | x560,340.0 150 | x823,340.0 151 | x690,339.0 152 | x149,338.0 153 | x112,338.0 154 | x7,338.0 155 | x51,337.0 156 | x8,337.0 157 | x78,336.0 158 | x407,336.0 159 | x601,336.0 160 | x105,335.0 161 | x184,333.0 162 | x613,333.0 163 | x42,332.0 164 | x102,332.0 165 | x22,332.0 166 | x21,332.0 167 | x722,331.0 168 | x171,331.0 169 | x867,331.0 170 | x208,329.0 171 | x942,328.0 172 | x24,327.0 173 | x160,326.0 174 | x128,326.0 175 | x43,326.0 176 | x84,326.0 177 | x139,326.0 178 | x185,325.0 179 | x246,324.0 180 | x41,324.0 181 | x18,323.0 182 | x166,323.0 183 | x62,322.0 184 | x20,321.0 185 | x950,321.0 186 | x144,319.0 187 | x829,318.0 188 | x19,316.0 189 | x10,315.0 190 | x9,313.0 191 | x471,313.0 192 | x120,313.0 193 | x129,311.0 194 | x217,311.0 195 | x200,311.0 196 | x959,310.0 197 | x91,309.0 198 | x38,308.0 199 | x11,308.0 200 | x181,307.0 201 | x130,307.0 202 | x156,307.0 203 | x404,306.0 204 | x132,305.0 205 | x61,305.0 206 | x849,305.0 207 | x46,304.0 208 | x227,304.0 209 | x204,304.0 210 | x400,304.0 211 | x52,303.0 212 | x1001,303.0 213 | x397,303.0 214 | x352,301.0 215 | x698,301.0 216 | x947,301.0 217 | x477,301.0 218 | x44,301.0 219 | x826,300.0 220 | x305,299.0 221 | x687,299.0 222 | x399,298.0 223 | x85,297.0 224 | x486,297.0 225 | x546,296.0 226 | x151,296.0 227 | x357,296.0 228 | x214,296.0 229 | x111,295.0 230 | x154,295.0 231 | x121,294.0 232 | x74,294.0 233 | x858,294.0 234 | x216,293.0 235 | x228,293.0 236 | x49,292.0 237 | x71,292.0 238 | x193,292.0 239 | x187,290.0 240 | x178,289.0 241 | x47,288.0 242 | x168,288.0 243 | x721,288.0 244 | x346,287.0 245 | x192,287.0 246 | x157,286.0 247 | x125,286.0 248 | x792,286.0 249 | x634,286.0 250 | x226,285.0 251 | x88,285.0 252 | x241,284.0 253 | x60,283.0 254 | x292,282.0 255 | x479,282.0 256 | x552,281.0 257 | x337,281.0 258 | x730,281.0 259 | x182,281.0 260 | x113,280.0 261 | x150,280.0 262 | x14,280.0 263 | x4,280.0 264 | x472,280.0 265 | x524,279.0 266 | x37,279.0 267 | x173,279.0 268 | x89,278.0 269 | x994,278.0 270 | x179,278.0 271 | x392,278.0 272 | x16,278.0 273 | x865,278.0 274 | x191,276.0 275 | x953,274.0 276 | x358,274.0 277 | x473,274.0 278 | x1026,274.0 279 | x550,274.0 280 | x981,272.0 281 | x119,271.0 282 | x949,271.0 283 | x167,270.0 284 | x551,270.0 285 | x359,270.0 286 | x15,269.0 287 | x107,269.0 288 | x257,269.0 289 | x846,269.0 290 | x409,269.0 291 | x206,269.0 292 | x656,269.0 293 | x555,268.0 294 | x738,268.0 295 | x117,268.0 296 | x868,268.0 297 | x194,267.0 298 | x405,267.0 299 | x136,267.0 300 | x232,267.0 301 | x557,266.0 302 | x190,266.0 303 | x946,266.0 304 | x124,266.0 305 | x233,265.0 306 | x772,265.0 307 | x201,265.0 308 | x86,265.0 309 | x145,265.0 310 | x230,265.0 311 | x578,264.0 312 | x213,263.0 313 | x474,263.0 314 | x299,262.0 315 | x222,261.0 316 | x383,260.0 317 | x653,260.0 318 | x284,260.0 319 | x23,260.0 320 | x63,259.0 321 | x362,259.0 322 | x699,259.0 323 | x207,259.0 324 | x482,256.0 325 | x210,255.0 326 | x225,255.0 327 | x568,254.0 328 | x686,254.0 329 | x72,253.0 330 | x731,253.0 331 | x108,253.0 332 | x81,252.0 333 | x373,252.0 334 | x469,251.0 335 | x106,251.0 336 | x323,249.0 337 | x412,249.0 338 | x878,248.0 339 | x379,248.0 340 | x398,248.0 341 | x287,247.0 342 | x3,247.0 343 | x648,246.0 344 | x356,246.0 345 | x271,245.0 346 | x350,245.0 347 | x176,243.0 348 | x553,243.0 349 | x48,243.0 350 | x348,241.0 351 | x548,240.0 352 | x116,240.0 353 | x268,240.0 354 | x205,239.0 355 | x403,239.0 356 | x766,239.0 357 | x221,239.0 358 | x169,239.0 359 | x1004,237.0 360 | x146,236.0 361 | x110,236.0 362 | x992,236.0 363 | x104,234.0 364 | x202,234.0 365 | x995,234.0 366 | x118,232.0 367 | x836,232.0 368 | x475,231.0 369 | x288,231.0 370 | x255,230.0 371 | x530,230.0 372 | x1019,229.0 373 | x978,229.0 374 | x39,228.0 375 | x483,227.0 376 | x651,227.0 377 | x109,227.0 378 | x956,226.0 379 | x563,225.0 380 | x999,224.0 381 | x33,222.0 382 | x197,221.0 383 | x683,220.0 384 | x533,220.0 385 | x186,220.0 386 | x155,220.0 387 | x395,219.0 388 | x188,219.0 389 | x17,218.0 390 | x12,217.0 391 | x198,216.0 392 | x824,216.0 393 | x490,216.0 394 | x582,216.0 395 | x856,215.0 396 | x764,215.0 397 | x748,214.0 398 | x220,214.0 399 | x103,213.0 400 | x733,213.0 401 | x70,213.0 402 | x781,212.0 403 | x804,212.0 404 | x354,212.0 405 | x195,212.0 406 | x1024,212.0 407 | x401,212.0 408 | x365,211.0 409 | x499,211.0 410 | x64,210.0 411 | x430,210.0 412 | x988,209.0 413 | x147,209.0 414 | x295,209.0 415 | x478,208.0 416 | x1018,208.0 417 | x729,207.0 418 | x80,207.0 419 | x556,207.0 420 | x336,207.0 421 | x75,207.0 422 | x73,206.0 423 | x481,204.0 424 | x122,204.0 425 | x713,204.0 426 | x689,204.0 427 | x393,204.0 428 | x361,204.0 429 | x196,200.0 430 | x958,200.0 431 | x36,200.0 432 | x583,200.0 433 | x485,198.0 434 | x385,198.0 435 | x459,198.0 436 | x1011,197.0 437 | x514,197.0 438 | x996,197.0 439 | x304,196.0 440 | x528,196.0 441 | x408,196.0 442 | x97,196.0 443 | x1022,194.0 444 | x590,194.0 445 | x1005,193.0 446 | x762,193.0 447 | x960,192.0 448 | x864,192.0 449 | x494,189.0 450 | x681,188.0 451 | x1010,188.0 452 | x977,188.0 453 | x957,187.0 454 | x506,186.0 455 | x476,186.0 456 | x460,186.0 457 | x986,185.0 458 | x5,185.0 459 | x294,184.0 460 | x529,183.0 461 | x100,182.0 462 | x760,182.0 463 | x1077,181.0 464 | x306,179.0 465 | x45,179.0 466 | x374,177.0 467 | x429,177.0 468 | x347,175.0 469 | x341,174.0 470 | x434,174.0 471 | x796,174.0 472 | x1029,173.0 473 | x581,172.0 474 | x382,172.0 475 | x576,172.0 476 | x396,170.0 477 | x77,170.0 478 | x388,169.0 479 | x387,169.0 480 | x170,168.0 481 | x562,168.0 482 | x512,167.0 483 | x389,167.0 484 | x360,166.0 485 | x1078,166.0 486 | x614,166.0 487 | x809,166.0 488 | x993,165.0 489 | x982,163.0 490 | x647,162.0 491 | x1084,162.0 492 | x752,161.0 493 | x657,160.0 494 | x1003,160.0 495 | x497,160.0 496 | x945,159.0 497 | x339,158.0 498 | x509,158.0 499 | x431,158.0 500 | x422,157.0 501 | x174,157.0 502 | x203,157.0 503 | x985,157.0 504 | x418,156.0 505 | x338,156.0 506 | x525,154.0 507 | x983,153.0 508 | x329,153.0 509 | x700,153.0 510 | x372,153.0 511 | x843,153.0 512 | x513,151.0 513 | x394,150.0 514 | x585,150.0 515 | x746,150.0 516 | x609,149.0 517 | x484,148.0 518 | x997,147.0 519 | x461,146.0 520 | x424,146.0 521 | x342,145.0 522 | x902,145.0 523 | x987,144.0 524 | x979,144.0 525 | x1023,144.0 526 | x1020,144.0 527 | x650,142.0 528 | x436,141.0 529 | x410,140.0 530 | x464,140.0 531 | x838,140.0 532 | x254,140.0 533 | x353,139.0 534 | x963,139.0 535 | x369,139.0 536 | x371,139.0 537 | x470,139.0 538 | x790,139.0 539 | x564,139.0 540 | x976,138.0 541 | x344,138.0 542 | x676,137.0 543 | x737,137.0 544 | x266,137.0 545 | x458,136.0 546 | x795,136.0 547 | x704,136.0 548 | x749,135.0 549 | x964,135.0 550 | x565,134.0 551 | x909,133.0 552 | x480,132.0 553 | x1072,132.0 554 | x413,130.0 555 | x873,130.0 556 | x918,130.0 557 | x611,130.0 558 | x674,130.0 559 | x377,129.0 560 | x1021,129.0 561 | x340,129.0 562 | x349,128.0 563 | x984,128.0 564 | x330,128.0 565 | x655,127.0 566 | x386,126.0 567 | x343,126.0 568 | x890,125.0 569 | x522,125.0 570 | x575,123.0 571 | x626,123.0 572 | x747,122.0 573 | x732,122.0 574 | x1083,122.0 575 | x677,122.0 576 | x285,121.0 577 | x32,121.0 578 | x1074,121.0 579 | x1048,120.0 580 | x851,120.0 581 | x800,120.0 582 | x806,120.0 583 | x630,120.0 584 | x1060,119.0 585 | x1073,119.0 586 | x765,119.0 587 | x866,119.0 588 | x990,119.0 589 | x534,119.0 590 | x355,118.0 591 | x1015,118.0 592 | x652,117.0 593 | x847,115.0 594 | x510,115.0 595 | x245,115.0 596 | x370,114.0 597 | x920,114.0 598 | x905,114.0 599 | x980,114.0 600 | x593,113.0 601 | x282,113.0 602 | x1016,112.0 603 | x1090,112.0 604 | x1103,111.0 605 | x1068,111.0 606 | x256,111.0 607 | x269,110.0 608 | x777,109.0 609 | x1028,109.0 610 | x1009,109.0 611 | x384,107.0 612 | x610,107.0 613 | x933,106.0 614 | x1087,106.0 615 | x351,105.0 616 | x705,105.0 617 | x914,105.0 618 | x345,105.0 619 | x1102,104.0 620 | x503,104.0 621 | x803,104.0 622 | x1064,104.0 623 | x818,103.0 624 | x69,102.0 625 | x283,101.0 626 | x1057,101.0 627 | x332,100.0 628 | x574,100.0 629 | x770,98.0 630 | x602,98.0 631 | x736,97.0 632 | x750,96.0 633 | x805,95.0 634 | x962,95.0 635 | x935,94.0 636 | x605,94.0 637 | x520,93.0 638 | x1025,93.0 639 | x1049,93.0 640 | x697,93.0 641 | x857,91.0 642 | x782,91.0 643 | x827,90.0 644 | x380,89.0 645 | x837,89.0 646 | x527,89.0 647 | x391,88.0 648 | x707,88.0 649 | x821,88.0 650 | x1059,87.0 651 | x277,87.0 652 | x955,87.0 653 | x273,86.0 654 | x573,85.0 655 | x875,85.0 656 | x1101,85.0 657 | x1055,84.0 658 | x381,84.0 659 | x307,84.0 660 | x325,83.0 661 | x789,83.0 662 | x1105,83.0 663 | x780,83.0 664 | x757,83.0 665 | x259,83.0 666 | x966,83.0 667 | x954,83.0 668 | x968,82.0 669 | x579,82.0 670 | x253,81.0 671 | x678,81.0 672 | x862,80.0 673 | x493,79.0 674 | x998,79.0 675 | x850,79.0 676 | x869,78.0 677 | x874,78.0 678 | x682,78.0 679 | x991,77.0 680 | x535,77.0 681 | x1056,77.0 682 | x684,77.0 683 | x761,76.0 684 | x1106,74.0 685 | x420,74.0 686 | x274,74.0 687 | x1008,74.0 688 | x586,74.0 689 | x1104,73.0 690 | x517,73.0 691 | x1082,73.0 692 | x580,73.0 693 | x378,71.0 694 | x889,71.0 695 | x1067,71.0 696 | x717,71.0 697 | x1094,70.0 698 | x937,70.0 699 | x251,70.0 700 | x709,69.0 701 | x402,69.0 702 | x828,69.0 703 | x526,69.0 704 | x263,69.0 705 | x376,69.0 706 | x679,69.0 707 | x521,68.0 708 | x794,68.0 709 | x595,67.0 710 | x623,67.0 711 | x322,67.0 712 | x631,67.0 713 | x1062,66.0 714 | x1066,65.0 715 | x1089,65.0 716 | x643,65.0 717 | x1071,63.0 718 | x844,63.0 719 | x265,63.0 720 | x819,62.0 721 | x446,62.0 722 | x967,62.0 723 | x667,61.0 724 | x675,61.0 725 | x577,61.0 726 | x751,61.0 727 | x467,60.0 728 | x835,60.0 729 | x720,59.0 730 | x672,59.0 731 | x390,59.0 732 | x1053,59.0 733 | x915,58.0 734 | x531,57.0 735 | x971,57.0 736 | x1058,55.0 737 | x863,55.0 738 | x327,55.0 739 | x466,54.0 740 | x807,54.0 741 | x465,53.0 742 | x624,53.0 743 | x532,53.0 744 | x1012,52.0 745 | x1000,52.0 746 | x1017,52.0 747 | x1076,51.0 748 | x892,51.0 749 | x628,51.0 750 | x928,51.0 751 | x908,49.0 752 | x589,49.0 753 | x419,49.0 754 | x547,49.0 755 | x1006,48.0 756 | x1096,48.0 757 | x673,48.0 758 | x1099,47.0 759 | x1085,47.0 760 | x870,47.0 761 | x1075,47.0 762 | x334,46.0 763 | x627,46.0 764 | x1088,46.0 765 | x680,45.0 766 | x375,45.0 767 | x1092,45.0 768 | x783,45.0 769 | x267,45.0 770 | x924,44.0 771 | x258,44.0 772 | x1100,44.0 773 | x326,44.0 774 | x1065,43.0 775 | x523,43.0 776 | x561,42.0 777 | x612,42.0 778 | x649,41.0 779 | x1007,41.0 780 | x961,41.0 781 | x793,41.0 782 | x910,41.0 783 | x779,41.0 784 | x1054,40.0 785 | x1086,40.0 786 | x972,39.0 787 | x708,39.0 788 | x242,39.0 789 | x646,39.0 790 | x572,38.0 791 | x745,38.0 792 | x414,37.0 793 | x802,37.0 794 | x1080,36.0 795 | x1061,36.0 796 | x887,36.0 797 | x328,36.0 798 | x845,36.0 799 | x974,35.0 800 | x659,35.0 801 | x1063,35.0 802 | x703,34.0 803 | x260,33.0 804 | x608,33.0 805 | x1091,32.0 806 | x948,32.0 807 | x244,32.0 808 | x502,31.0 809 | x973,31.0 810 | x571,31.0 811 | x331,31.0 812 | x706,31.0 813 | x1070,30.0 814 | x539,30.0 815 | x693,30.0 816 | x632,29.0 817 | x930,29.0 818 | x841,29.0 819 | x243,29.0 820 | x848,29.0 821 | x756,28.0 822 | x441,28.0 823 | x880,28.0 824 | x932,27.0 825 | x926,27.0 826 | x718,27.0 827 | x1030,27.0 828 | x335,26.0 829 | x969,26.0 830 | x925,26.0 831 | x839,26.0 832 | x888,25.0 833 | x639,25.0 834 | x622,25.0 835 | x989,25.0 836 | x882,24.0 837 | x753,24.0 838 | x1002,24.0 839 | x1097,24.0 840 | x784,23.0 841 | x904,23.0 842 | x816,23.0 843 | x519,23.0 844 | x840,23.0 845 | x600,23.0 846 | x1013,22.0 847 | x739,22.0 848 | x456,22.0 849 | x919,21.0 850 | x759,21.0 851 | x569,21.0 852 | x594,21.0 853 | x832,21.0 854 | x1069,21.0 855 | x664,21.0 856 | x264,20.0 857 | x881,20.0 858 | x694,20.0 859 | x421,19.0 860 | x280,19.0 861 | x669,19.0 862 | x324,18.0 863 | x570,18.0 864 | x975,17.0 865 | x970,17.0 866 | x451,17.0 867 | x457,16.0 868 | x423,16.0 869 | x898,16.0 870 | x758,16.0 871 | x670,16.0 872 | x599,16.0 873 | x939,15.0 874 | x714,15.0 875 | x597,15.0 876 | x1081,15.0 877 | x1027,15.0 878 | x281,15.0 879 | x618,15.0 880 | x896,15.0 881 | x742,14.0 882 | x801,14.0 883 | x921,14.0 884 | x625,14.0 885 | x567,14.0 886 | x492,13.0 887 | x712,13.0 888 | x719,13.0 889 | x715,12.0 890 | x1093,12.0 891 | x754,12.0 892 | x917,12.0 893 | x906,12.0 894 | x891,12.0 895 | x811,12.0 896 | x671,11.0 897 | x442,11.0 898 | x903,11.0 899 | x596,11.0 900 | x755,10.0 901 | x1014,10.0 902 | x1098,10.0 903 | x871,10.0 904 | x907,10.0 905 | x893,9.0 906 | x545,9.0 907 | x537,9.0 908 | x508,9.0 909 | x901,9.0 910 | x443,9.0 911 | x916,9.0 912 | x854,8.0 913 | x691,8.0 914 | x1079,8.0 915 | x922,7.0 916 | x658,7.0 917 | x635,7.0 918 | x701,7.0 919 | x912,7.0 920 | x662,7.0 921 | x437,7.0 922 | x936,7.0 923 | x668,6.0 924 | x702,6.0 925 | x1095,6.0 926 | x724,6.0 927 | x938,6.0 928 | x776,6.0 929 | x810,6.0 930 | x498,6.0 931 | x927,5.0 932 | x249,5.0 933 | x501,5.0 934 | x425,5.0 935 | x834,5.0 936 | x445,5.0 937 | x934,5.0 938 | x931,5.0 939 | x536,4.0 940 | x965,4.0 941 | x808,4.0 942 | x544,4.0 943 | x540,4.0 944 | x598,4.0 945 | x507,4.0 946 | x812,3.0 947 | x278,3.0 948 | x740,3.0 949 | x911,3.0 950 | x505,3.0 951 | x500,2.0 952 | x250,2.0 953 | x621,2.0 954 | x279,2.0 955 | x872,2.0 956 | x275,2.0 957 | x290,1.0 958 | x723,1.0 959 | x817,1.0 960 | x291,1.0 961 | x799,1.0 962 | x923,1.0 963 | x262,1.0 964 | x495,1.0 965 | x894,1.0 966 | x900,1.0 967 | x929,1.0 968 | x261,1.0 969 | x247, 970 | x248, 971 | x276, 972 | x333, 973 | x463, 974 | x468, 975 | x489, 976 | x491, 977 | x496, 978 | x504, 979 | x515, 980 | x516, 981 | x518, 982 | x538, 983 | x541, 984 | x542, 985 | x543, 986 | x566, 987 | x587, 988 | x588, 989 | x603, 990 | x604, 991 | x615, 992 | x616, 993 | x617, 994 | x619, 995 | x620, 996 | x636, 997 | x637, 998 | x638, 999 | x640, 1000 | x641, 1001 | x642, 1002 | x644, 1003 | x660, 1004 | x661, 1005 | x663, 1006 | x665, 1007 | x666, 1008 | x688, 1009 | x692, 1010 | x695, 1011 | x696, 1012 | x710, 1013 | x711, 1014 | x725, 1015 | x726, 1016 | x727, 1017 | x728, 1018 | x741, 1019 | x743, 1020 | x744, 1021 | x773, 1022 | x774, 1023 | x775, 1024 | x785, 1025 | x786, 1026 | x787, 1027 | x788, 1028 | x797, 1029 | x798, 1030 | x813, 1031 | x814, 1032 | x815, 1033 | x830, 1034 | x831, 1035 | x833, 1036 | x853, 1037 | x855, 1038 | x859, 1039 | x860, 1040 | x861, 1041 | x876, 1042 | x877, 1043 | x895, 1044 | x897, 1045 | x899, 1046 | x913, 1047 | -------------------------------------------------------------------------------- /DC-loan-rp/sklearn-rf.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: sklearn-rf.py.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2016/01/16 下午 10:47 10 | ''' 11 | 12 | import pandas as pd 13 | import numpy as np 14 | from sklearn import cross_validation 15 | from sklearn import metrics 16 | from sklearn.ensemble import RandomForestClassifier 17 | import time 18 | 19 | def load_data(dummy=False): 20 | path = 'D:/dataset/rp/' 21 | train_x = pd.read_csv(path + 'train_x.csv') 22 | train_x = train_x.drop(['uid'], axis=1) 23 | 24 | train_y = pd.read_csv(path + 'train_y.csv') 25 | train_y = train_y.drop(['uid'], axis=1) 26 | 27 | test_x = pd.read_csv(path + 'test_x.csv') 28 | test_uid = test_x.uid 29 | test_x = test_x.drop(['uid'], axis=1) 30 | 31 | if dummy: # 将分类类型的变量转为哑变量 32 | features = pd.read_csv(path + 'features_type.csv') 33 | features_category = features.feature[features.type == 'category'] 34 | encoded = pd.get_dummies(pd.concat([train_x, test_x], axis=0), columns=features_category) 35 | train_rows = train_x.shape[0] 36 | train_x = encoded.iloc[:train_rows, :] 37 | test_x = encoded.iloc[train_rows:, :] 38 | 39 | return train_x, train_y, test_x, test_uid 40 | 41 | def sklearn_random_forest(train_x, train_y, test_x, test_uid): 42 | # 设置参数 43 | clf = RandomForestClassifier(n_estimators=5, 44 | bootstrap=True, #是否有放回的采样 45 | oob_score=False, 46 | n_jobs=4, #并行job个数 47 | min_samples_split=5) 48 | # 训练模型 49 | n_samples = train_x.shape[0] 50 | cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.3, random_state=0) 51 | predicted = cross_validation.cross_val_predict(clf, train_x, train_y, cv=cv) 52 | print(metrics.accuracy_score(train_y, predicted)) 53 | 54 | test_y = clf.predict(test_x) 55 | result = pd.DataFrame({"uid":test_uid, "score":test_y}, columns=['uid','score']) 56 | result.to_csv('rf_'+str(time.time())+'.csv', index=False) 57 | 58 | def main(): 59 | train_x, train_y, test_x, test_uid = load_data(dummy=True) 60 | sklearn_random_forest(train_x, train_y, test_x, test_uid) 61 | 62 | if __name__ == '__main__': 63 | main() -------------------------------------------------------------------------------- /DC-loan-rp/small_data/train_y.csv: -------------------------------------------------------------------------------- 1 | uid,y 2 | 1792,1 3 | 4211,1 4 | 14658,1 5 | 17041,1 6 | 15765,1 7 | 6638,1 8 | 1649,1 9 | 5709,1 10 | 4749,0 11 | 7702,1 12 | 7456,1 13 | 4356,1 14 | 15094,1 15 | 16423,1 16 | 11970,1 17 | 13011,1 18 | 16849,1 19 | 9058,1 20 | 14323,1 21 | 5819,1 22 | 3595,1 23 | 14420,1 24 | 18082,1 25 | 8906,1 26 | 16752,1 27 | 14053,1 28 | 18994,1 29 | 12846,1 30 | 1495,1 31 | 5063,1 32 | 19456,1 33 | 4340,1 34 | 19146,0 35 | 10021,1 36 | 12994,1 37 | 18933,1 38 | 19739,0 39 | 18329,1 40 | 18724,1 41 | 10547,1 42 | 6338,1 43 | 8457,1 44 | 858,1 45 | 2580,0 46 | 1533,1 47 | 1565,0 48 | 11623,1 49 | 10857,1 50 | 9691,1 51 | 18520,1 52 | 17390,1 53 | 15930,1 54 | 6418,1 55 | 8485,0 56 | 7407,1 57 | 17837,1 58 | 18572,1 59 | 12310,1 60 | 2752,1 61 | 5068,1 62 | 5952,1 63 | 8147,1 64 | 5510,1 65 | 12481,1 66 | 455,1 67 | 6550,1 68 | 12289,1 69 | 8356,1 70 | 11317,1 71 | 10586,1 72 | 19239,1 73 | 16891,0 74 | 5071,1 75 | 1513,1 76 | 8852,1 77 | 31,1 78 | 6691,1 79 | 7452,1 80 | 9654,0 81 | 12010,1 82 | 9731,1 83 | 10938,1 84 | 13004,1 85 | 18726,1 86 | 7996,0 87 | 16360,1 88 | 10620,1 89 | 8776,1 90 | 19538,1 91 | 16652,1 92 | 10913,1 93 | 15901,1 94 | 10891,1 95 | 19770,1 96 | 15515,1 97 | 15757,1 98 | 2387,1 99 | 7761,0 100 | 13111,1 101 | 14062,1 102 | 5036,1 103 | 2663,0 104 | 7935,1 105 | 5630,1 106 | 14950,1 107 | 16032,1 108 | 16216,1 109 | 13931,1 110 | 14558,1 111 | 6397,1 112 | 15271,0 113 | 5488,1 114 | 6757,1 115 | 13607,0 116 | 18701,0 117 | 18733,1 118 | 14038,0 119 | 16711,1 120 | 14277,1 121 | 15000,1 122 | 241,1 123 | 1051,1 124 | 3241,1 125 | 13721,1 126 | 8359,1 127 | 15836,1 128 | 16195,1 129 | 10145,1 130 | 9591,1 131 | 1530,1 132 | 18882,1 133 | 19924,1 134 | 18696,0 135 | 14232,0 136 | 1514,1 137 | 13743,1 138 | 5821,0 139 | 18835,1 140 | 4376,0 141 | 19088,0 142 | 8590,1 143 | 4673,1 144 | 9607,1 145 | 8572,0 146 | 848,0 147 | 11470,1 148 | 18378,1 149 | 19668,1 150 | 9768,1 151 | 17572,1 152 | 12040,1 153 | 18588,1 154 | 9032,1 155 | 2561,1 156 | 3896,1 157 | 19572,0 158 | 2990,1 159 | 13708,1 160 | 11665,1 161 | 11582,1 162 | 2782,1 163 | 13169,1 164 | 6537,1 165 | 10741,1 166 | 19407,1 167 | 1605,1 168 | 10493,0 169 | 7885,1 170 | 15088,1 171 | 19477,1 172 | 15914,1 173 | 11259,1 174 | 1361,1 175 | 3722,1 176 | 16285,1 177 | 9831,1 178 | 11081,1 179 | 14580,1 180 | 8056,1 181 | 18402,1 182 | 6304,1 183 | 17065,1 184 | 4988,1 185 | 2028,1 186 | 18090,1 187 | 15769,1 188 | 18252,1 189 | 13358,1 190 | 4594,1 191 | 3505,1 192 | 9270,1 193 | 5914,1 194 | 15670,1 195 | 7941,1 196 | 13714,1 197 | 1211,1 198 | 17220,1 199 | 17273,0 200 | 8555,1 201 | 11256,1 202 | 10832,1 203 | 13467,1 204 | 15994,1 205 | 1280,1 206 | 9173,1 207 | 3559,0 208 | 10441,0 209 | 11885,1 210 | 916,1 211 | 5122,1 212 | 17178,1 213 | 3069,1 214 | 1748,1 215 | 15322,1 216 | 5849,1 217 | 4422,1 218 | 7037,0 219 | 2035,1 220 | 12628,1 221 | 1135,1 222 | 4135,1 223 | 12967,1 224 | 8479,1 225 | 6337,1 226 | 2483,1 227 | 3592,1 228 | 10646,1 229 | 2313,1 230 | 15667,1 231 | 5087,1 232 | 16552,1 233 | 3695,1 234 | 15866,0 235 | 19544,1 236 | 17520,1 237 | 18127,1 238 | 1499,1 239 | 16952,1 240 | 2677,1 241 | 9997,1 242 | 7724,1 243 | 6854,0 244 | 5150,1 245 | 12596,1 246 | 15889,1 247 | 19507,1 248 | 12311,0 249 | 16405,1 250 | 17565,1 251 | 6609,1 252 | 2568,1 253 | 7705,1 254 | 11719,1 255 | 11998,1 256 | 2007,1 257 | 7595,1 258 | 12318,0 259 | 685,1 260 | 12614,1 261 | 4698,1 262 | 9902,1 263 | 13009,1 264 | 17267,0 265 | 81,1 266 | 6113,1 267 | 17000,1 268 | 4492,1 269 | 19079,1 270 | 1969,1 271 | 97,1 272 | 2981,1 273 | 12210,1 274 | 378,1 275 | 159,1 276 | 19751,1 277 | 2463,1 278 | 4312,0 279 | 5155,1 280 | 8439,1 281 | 19887,1 282 | 3831,1 283 | 11249,1 284 | 8431,1 285 | 10596,1 286 | 18036,1 287 | 5586,1 288 | 17947,1 289 | 4245,1 290 | 2459,1 291 | 9847,1 292 | 15236,1 293 | 10610,1 294 | 18447,1 295 | 12739,1 296 | 1441,1 297 | 14130,1 298 | 17478,1 299 | 5292,1 300 | 3578,1 301 | 16649,1 302 | 17435,1 303 | 1510,0 304 | 8556,0 305 | 13148,1 306 | 6118,1 307 | 12297,1 308 | 1159,1 309 | 1981,1 310 | 7120,1 311 | 17774,1 312 | 3021,1 313 | 17743,1 314 | 17392,1 315 | 13611,1 316 | 11629,0 317 | 6422,1 318 | 2000,1 319 | 8663,1 320 | 14870,1 321 | 17154,1 322 | 9615,1 323 | 1475,1 324 | 2654,1 325 | 14415,1 326 | 2957,1 327 | 1279,1 328 | 10932,1 329 | 10829,1 330 | 14806,1 331 | 1526,1 332 | 2520,1 333 | 2570,1 334 | 18918,1 335 | 19423,1 336 | 9098,1 337 | 6599,1 338 | 200,1 339 | 14016,1 340 | 11012,1 341 | 3701,1 342 | 4812,1 343 | 3063,1 344 | 15665,1 345 | 13814,1 346 | 17366,1 347 | 3059,0 348 | 12219,1 349 | 7823,1 350 | 1192,1 351 | 12423,1 352 | 7287,1 353 | 17369,1 354 | 1551,0 355 | 13211,1 356 | 3119,1 357 | 16838,1 358 | 205,1 359 | 13458,1 360 | 16226,1 361 | 6127,1 362 | 1622,1 363 | 3092,1 364 | 5310,1 365 | 11617,1 366 | 12272,1 367 | 16210,1 368 | 7990,0 369 | 1918,1 370 | 16861,1 371 | 8695,1 372 | 9027,1 373 | 7376,1 374 | 16836,1 375 | 8386,0 376 | 16680,1 377 | 14917,1 378 | 7484,1 379 | 10522,1 380 | 16493,1 381 | 19628,1 382 | 4765,1 383 | 17562,1 384 | 16075,1 385 | 9907,1 386 | 17480,1 387 | 13976,1 388 | 15058,1 389 | 10703,1 390 | 6303,1 391 | 454,1 392 | 17517,1 393 | 197,1 394 | 8280,1 395 | 19798,0 396 | 2822,1 397 | 11523,1 398 | 11889,1 399 | 820,1 400 | 6194,1 401 | 5768,1 402 | 12066,1 403 | 7259,1 404 | 16731,1 405 | 9330,1 406 | 9748,0 407 | 6262,1 408 | 6720,1 409 | 18619,1 410 | 9165,1 411 | 1080,1 412 | 2778,1 413 | 14872,1 414 | 5585,0 415 | 9865,0 416 | 10802,1 417 | 15705,1 418 | 10529,1 419 | 5144,1 420 | 14586,1 421 | 8516,1 422 | 17286,1 423 | 6109,1 424 | 14605,0 425 | 18176,1 426 | 14832,1 427 | 2516,1 428 | 5694,1 429 | 703,1 430 | 9824,1 431 | 12865,1 432 | 17927,1 433 | 5455,1 434 | 16202,1 435 | 6967,1 436 | 13279,1 437 | 14845,1 438 | 10739,1 439 | 4468,1 440 | 15601,1 441 | 269,1 442 | 814,1 443 | 14632,0 444 | 17967,1 445 | 6423,1 446 | 6549,1 447 | 3461,1 448 | 3445,1 449 | 14015,1 450 | 5921,1 451 | 18431,0 452 | 14191,1 453 | 8564,1 454 | 13732,1 455 | 10329,1 456 | 1333,1 457 | 18674,1 458 | 17835,1 459 | 17068,1 460 | 14629,1 461 | 19949,1 462 | 18589,1 463 | 16370,1 464 | 4851,1 465 | 537,1 466 | 15882,1 467 | 4146,0 468 | 10405,1 469 | 8031,1 470 | 8403,1 471 | 6,0 472 | 14543,0 473 | 5278,1 474 | 4379,1 475 | 9166,1 476 | 14297,1 477 | 19105,1 478 | 19140,1 479 | 16387,1 480 | 2453,1 481 | 15776,1 482 | 18893,0 483 | 14280,1 484 | 3833,0 485 | 13240,1 486 | 2831,1 487 | 7623,0 488 | 15233,1 489 | 16127,1 490 | 3840,1 491 | 382,1 492 | 647,1 493 | 8017,1 494 | 16443,1 495 | 12005,1 496 | 12929,1 497 | 7767,1 498 | 2983,0 499 | 2821,0 500 | 14713,1 501 | 847,1 502 | 7826,1 503 | 3928,1 504 | 10304,1 505 | 13789,0 506 | 11673,1 507 | 17813,1 508 | 2136,1 509 | 3126,1 510 | 9291,1 511 | 1327,0 512 | 6978,1 513 | 4846,1 514 | 1935,1 515 | 8661,1 516 | 8080,1 517 | 19574,1 518 | 6420,1 519 | 6403,1 520 | 12436,1 521 | 2141,1 522 | 16770,1 523 | 7441,1 524 | 5597,1 525 | 12875,0 526 | 12986,1 527 | 18928,1 528 | 18577,1 529 | 257,1 530 | 15870,1 531 | 14869,1 532 | 8960,0 533 | 11774,1 534 | 3141,1 535 | 6006,1 536 | 18599,1 537 | 2855,1 538 | 11686,1 539 | 6365,1 540 | 15455,1 541 | 14677,1 542 | 168,1 543 | 17487,0 544 | 10903,0 545 | 12155,1 546 | 554,1 547 | 17332,1 548 | 18501,1 549 | 14971,1 550 | 13442,1 551 | 19355,1 552 | 8870,0 553 | 18264,1 554 | 15591,1 555 | 4437,1 556 | 9469,1 557 | 12798,1 558 | 2478,1 559 | 18651,1 560 | 9869,1 561 | 10172,1 562 | 15614,1 563 | 14127,1 564 | 9781,1 565 | 8501,1 566 | 18664,1 567 | 5567,1 568 | 19931,1 569 | 4702,0 570 | 19365,1 571 | 6957,0 572 | 5476,1 573 | 8262,1 574 | 4565,1 575 | 20000,1 576 | 9154,1 577 | 13192,1 578 | 3033,1 579 | 18526,1 580 | 4803,1 581 | 15319,1 582 | 3292,1 583 | 15877,1 584 | 14497,1 585 | 16374,1 586 | 15437,1 587 | 16356,1 588 | 11031,1 589 | 17099,1 590 | 4177,1 591 | 11950,1 592 | 12295,1 593 | 19658,1 594 | 9168,1 595 | 2024,0 596 | 3900,0 597 | 2566,1 598 | 19431,1 599 | 18492,1 600 | 17315,1 601 | 3255,1 602 | 2508,1 603 | 17779,1 604 | 12696,1 605 | 18847,1 606 | 4780,1 607 | 16014,1 608 | 19069,1 609 | 8199,1 610 | 7513,1 611 | 16301,1 612 | 11560,1 613 | 18593,1 614 | 2702,1 615 | 17876,1 616 | 1766,1 617 | 6038,1 618 | 17509,1 619 | 16299,1 620 | 8324,1 621 | 7505,1 622 | 7783,1 623 | 8985,1 624 | 15633,1 625 | 18469,1 626 | 10722,1 627 | 16981,1 628 | 13050,1 629 | 15464,1 630 | 13237,1 631 | 17888,1 632 | 12514,1 633 | 12663,1 634 | 16079,1 635 | 4150,1 636 | 10728,1 637 | 15427,1 638 | 14944,1 639 | 7399,1 640 | 18669,1 641 | 10104,1 642 | 9299,1 643 | 14974,1 644 | 7136,1 645 | 4152,0 646 | 14184,1 647 | 18080,1 648 | 7746,1 649 | 19601,1 650 | 17470,1 651 | 11561,1 652 | 10862,1 653 | 11109,1 654 | 4469,1 655 | 17459,1 656 | 10336,1 657 | 17632,1 658 | 13748,1 659 | 666,1 660 | 12056,1 661 | 3009,1 662 | 16774,0 663 | 15106,1 664 | 7548,1 665 | 7800,1 666 | 17027,0 667 | 4308,1 668 | 4480,1 669 | 17060,1 670 | 19015,1 671 | 13827,1 672 | 3494,1 673 | 11585,0 674 | 18903,1 675 | 1753,1 676 | 12227,1 677 | 2408,1 678 | 2300,1 679 | 19187,1 680 | 7228,1 681 | 11094,1 682 | 8867,1 683 | 6380,1 684 | 6772,1 685 | 9204,1 686 | 5076,1 687 | 19120,1 688 | 17857,1 689 | 14304,1 690 | 1445,1 691 | 12092,1 692 | 8335,1 693 | 2798,1 694 | 10672,0 695 | 611,1 696 | 4103,1 697 | 11794,1 698 | 11887,1 699 | 7600,1 700 | 10837,1 701 | 14194,1 702 | 18259,1 703 | 391,1 704 | 10448,1 705 | 12552,1 706 | 19641,0 707 | 15940,1 708 | 17609,1 709 | 16049,1 710 | 17903,1 711 | 1887,1 712 | 12669,1 713 | 11164,1 714 | 14626,1 715 | 17715,1 716 | 15727,1 717 | 6334,1 718 | 15386,1 719 | 7027,1 720 | 12296,1 721 | 1740,0 722 | 15272,1 723 | 12095,1 724 | 15821,1 725 | 12243,1 726 | 4855,1 727 | 16611,1 728 | 15315,1 729 | 2678,1 730 | 14855,1 731 | 16865,0 732 | 3880,1 733 | 2100,1 734 | 9762,1 735 | 17540,1 736 | 19638,1 737 | 3198,1 738 | 4959,1 739 | 18719,0 740 | 2284,0 741 | 12172,1 742 | 11655,1 743 | 1585,1 744 | 776,1 745 | 6426,0 746 | 15064,1 747 | 9288,1 748 | 18811,1 749 | 7477,1 750 | 8350,0 751 | 9227,1 752 | 8163,0 753 | 2657,1 754 | 16399,1 755 | 5471,1 756 | 6014,1 757 | 16655,1 758 | 11500,1 759 | 17229,1 760 | 13284,0 761 | 1313,1 762 | 1977,1 763 | 17529,1 764 | 18200,1 765 | 10193,1 766 | 7185,1 767 | 1028,0 768 | 1756,1 769 | 13245,0 770 | 11955,1 771 | 10774,1 772 | 7510,1 773 | 13418,1 774 | 19756,1 775 | 6090,1 776 | 14187,1 777 | 19774,1 778 | 8535,1 779 | 10689,1 780 | 2871,1 781 | 14255,1 782 | 19789,1 783 | 11747,1 784 | 4005,1 785 | 9608,1 786 | 12756,1 787 | 3637,0 788 | 2207,0 789 | 192,1 790 | 17959,0 791 | 16994,1 792 | 9417,1 793 | 6041,1 794 | 3684,1 795 | 13341,1 796 | 13307,1 797 | 11530,1 798 | 9939,1 799 | 1873,1 800 | 1130,1 801 | 13293,1 802 | 19998,1 803 | 16378,1 804 | 6494,1 805 | 7759,1 806 | 9357,1 807 | 1442,1 808 | 3509,1 809 | 3518,1 810 | 9593,1 811 | 15458,1 812 | 17635,1 813 | 3950,1 814 | 15209,1 815 | 19828,1 816 | 16305,1 817 | 10959,1 818 | 18222,1 819 | 10679,1 820 | 15874,1 821 | 2791,0 822 | 1444,1 823 | 2107,0 824 | 18309,1 825 | 4999,1 826 | 18282,1 827 | 11295,1 828 | 7711,1 829 | 11956,1 830 | 19056,1 831 | 7356,1 832 | 11314,1 833 | 921,1 834 | 12673,1 835 | 18494,1 836 | 19880,1 837 | 15913,1 838 | 3236,1 839 | 3546,1 840 | 4726,1 841 | 2155,1 842 | 6471,1 843 | 13268,0 844 | 5021,1 845 | 15586,1 846 | 10194,1 847 | 6222,1 848 | 7487,1 849 | 11746,1 850 | 15653,1 851 | 7614,1 852 | 10631,1 853 | 13078,1 854 | 4490,1 855 | 7933,1 856 | 12350,1 857 | 10397,1 858 | 15006,1 859 | 2432,1 860 | 2929,1 861 | 11761,1 862 | 12888,1 863 | 10528,0 864 | 16977,1 865 | 2016,1 866 | 16181,1 867 | 4180,1 868 | 17792,1 869 | 659,1 870 | 15502,1 871 | 1320,1 872 | 13411,1 873 | 3465,1 874 | 8297,1 875 | 19956,1 876 | 10088,1 877 | 5799,1 878 | 9639,1 879 | 14431,0 880 | 9492,1 881 | 8827,1 882 | 18528,1 883 | 10778,1 884 | 13713,1 885 | 7072,1 886 | 12527,1 887 | 10937,1 888 | 6112,1 889 | 18460,1 890 | 10504,1 891 | 4484,1 892 | 17416,0 893 | 15399,1 894 | 17708,1 895 | 18021,1 896 | 10317,1 897 | 17453,1 898 | 6954,1 899 | 6239,1 900 | 16103,1 901 | 16229,1 902 | 5245,1 903 | 3810,1 904 | 16289,1 905 | 11496,1 906 | 11278,1 907 | 15906,1 908 | 3968,1 909 | 499,1 910 | 8410,0 911 | 11974,1 912 | 12880,1 913 | 11927,1 914 | 13129,1 915 | 16024,1 916 | 8570,1 917 | 619,1 918 | 13488,0 919 | 641,1 920 | 10177,1 921 | 10609,1 922 | 15411,1 923 | 8953,1 924 | 19054,0 925 | 7145,1 926 | 17308,1 927 | 6776,1 928 | 4710,1 929 | 13294,1 930 | 4920,1 931 | 7165,1 932 | 8277,0 933 | 14775,1 934 | 19373,1 935 | 1203,1 936 | 18959,1 937 | 18587,1 938 | 18868,1 939 | 15370,1 940 | 1560,1 941 | 4853,1 942 | 2304,1 943 | 15918,1 944 | 11811,1 945 | 15693,1 946 | 2712,1 947 | 13921,1 948 | 8882,1 949 | 6234,1 950 | 13349,1 951 | 11004,1 952 | 17928,1 953 | 8991,1 954 | 12975,0 955 | 19283,1 956 | 3064,1 957 | 14790,1 958 | 2714,1 959 | 6409,1 960 | 18748,1 961 | 12557,1 962 | 16914,1 963 | 6932,1 964 | 14135,0 965 | 5901,1 966 | 165,1 967 | 10063,1 968 | 12248,1 969 | 3046,1 970 | 3507,1 971 | 18676,1 972 | 19244,1 973 | 13105,1 974 | 14981,1 975 | 9524,1 976 | 10565,1 977 | 2704,0 978 | 8419,1 979 | 11361,1 980 | 7275,0 981 | 4501,1 982 | 3931,1 983 | 8756,1 984 | 2572,1 985 | 9459,0 986 | 5356,0 987 | 7840,1 988 | 13740,1 989 | 12534,0 990 | 8279,1 991 | 15249,1 992 | 8683,1 993 | 13777,0 994 | 18275,1 995 | 12446,0 996 | 10462,1 997 | 9220,1 998 | 5590,1 999 | 7581,1 1000 | 7272,1 1001 | 2358,1 1002 | -------------------------------------------------------------------------------- /DC-loan-rp/source.R: -------------------------------------------------------------------------------- 1 | library(xgboost) 2 | library(Matrix) 3 | 4 | # read data 5 | train=read.csv('strain_x.csv') 6 | test=read.csv('stest_x.csv') 7 | train.y=read.csv('strain_y.csv') 8 | ft=read.csv('sfeatures_type.csv') 9 | fn.cat=as.character(ft[ft[,2]=='category',1]) 10 | 11 | fn.num=as.character(ft[ft[,2]=='numeric',1]) 12 | 13 | 14 | # create dummy variables 15 | temp.train=data.frame(rep(0,nrow(train))) 16 | temp.test=data.frame(rep(0,nrow(test))) 17 | for(f in fn.cat){ 18 | levels=unique(train[,f]) 19 | col.train=data.frame(factor(train[,f],levels=levels)) 20 | col.test=data.frame(factor(test[,f],levels=levels)) 21 | colnames(col.train)=f 22 | colnames(col.test)=f 23 | temp.train=cbind(temp.train,model.matrix(as.formula(paste0('~',f,'-1')),data=col.train)) 24 | temp.train[,paste0(f,'-1')]=NULL 25 | temp.test=cbind(temp.test,model.matrix(as.formula(paste0('~',f,'-1')),data=col.test)) 26 | temp.test[,paste0(f,'-1')]=NULL 27 | } 28 | temp.train[,1]=NULL 29 | temp.test[,1]=NULL 30 | train.new=Matrix(data.matrix(cbind(train[,c('uid',fn.num)],temp.train)),sparse=T) 31 | test.new=Matrix(data.matrix(cbind(test[,c('uid',fn.num)],temp.test)),sparse=T) 32 | 33 | 34 | # fit xgboost model 35 | 36 | dtrain=xgb.DMatrix(data=train.new[,-1],label=1-train.y$y) 37 | dtest= xgb.DMatrix(data=test.new[,-1]) 38 | 39 | model=xgb.train(booster='gbtree', 40 | objective='binary:logistic', 41 | scale_pos_weight=8.7, 42 | gamma=0, 43 | lambda=1000, 44 | alpha=800, 45 | subsample=0.75, 46 | colsample_bytree=0.30, 47 | min_child_weight=5, 48 | max_depth=8, 49 | eta=0.01, 50 | data=dtrain, 51 | nrounds=1520, 52 | metrics='auc', 53 | nthread=2) 54 | 55 | # predict probabilities 56 | pred=1-predict(model,dtest) 57 | 58 | write.csv(data.frame('uid'=test.new[,1],'score'=pred),file='2015-12-22.csv',row.names=F) -------------------------------------------------------------------------------- /DC-loan-rp/xgb.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/python3 2 | 3 | 4 | import pandas as pd 5 | import xgboost as xgb 6 | import time 7 | from sklearn.cross_validation import train_test_split 8 | import matplotlib.pyplot as plt 9 | import numpy as np 10 | 11 | # set data path 12 | path = 'D:/dataset/rp/' 13 | train_x_csv = path+'train_x.csv' 14 | train_y_csv = path+'train_y.csv' 15 | test_x_csv = path+'test_x.csv' 16 | features_type_csv = path+'features_type.csv' 17 | 18 | # load data 19 | train_x = pd.read_csv(train_x_csv) 20 | train_y = pd.read_csv(train_y_csv) 21 | train_xy = pd.merge(train_x, train_y, on='uid') 22 | test = pd.read_csv(test_x_csv) 23 | test_uid = test.uid 24 | test_x = test.drop(['uid'], axis=1) 25 | 26 | # split train set,generate train,val,test set 27 | train_xy = train_xy.drop(['uid'], axis=1) 28 | train, val = train_test_split(train_xy, test_size=0.35) 29 | y = train.y 30 | X = train.drop(['y'], axis=1) 31 | 32 | def add_data(X,y): 33 | add_X = pd.read_csv(path+'add_X.csv') 34 | add_X = add_X.drop(['uid'], axis=1) 35 | add_y = pd.read_csv(path+'add_y.csv') 36 | add_y = add_y.drop(['uid'], axis=1) 37 | add_y = add_y.y 38 | X = pd.concat([X,add_X], axis=0) 39 | y = pd.concat([y,add_y], axis=0) 40 | return X, y 41 | 42 | X, y = add_data(X, y) 43 | 44 | 45 | val_y = val.y 46 | val_X = val.drop(['y'], axis=1) 47 | 48 | # DC-loan-rp start here 49 | dtest = xgb.DMatrix(test_x) 50 | dval = xgb.DMatrix(val_X, label=val_y) 51 | dtrain = xgb.DMatrix(X, label=y) 52 | 53 | params = { 54 | 'booster': 'gbtree', 55 | 'objective': 'binary:logistic', 56 | 'early_stopping_rounds': 100, 57 | 'scale_pos_weight': 0.77, 58 | 'eval_metric': 'auc', 59 | 'gamma': 0.1, 60 | 'min_child_weight': 5, 61 | 'lambda': 700, 62 | 'subsample': 0.7, 63 | 'colsample_bytree': 0.3, 64 | 'max_depth': 8, 65 | 'eta': 0.03, 66 | } 67 | 68 | watchlist = [(dval, 'val'), (dtrain, 'train')] 69 | model = xgb.train(params, dtrain, num_boost_round=5, evals=watchlist) 70 | model.save_model('./xgb.model') 71 | 72 | # predict test set (from the best iteration) 73 | scores = model.predict(dtest, ntree_limit=model.best_ntree_limit) 74 | result = pd.DataFrame({"uid":test_uid, "score":scores}, columns=['uid','score']) 75 | result.to_csv(str(time.time())+'.csv', index=False) 76 | 77 | features = model.get_fscore() 78 | features = sorted(features.items(), key=lambda d:d[1]) 79 | f_df = pd.DataFrame(features, columns=['feature','fscore']) 80 | f_df.to_csv('./feature_score.csv',index=False) 81 | 82 | 83 | 84 | ''' 85 | plt.figure() 86 | import_f = f_df[:10] 87 | import_f.plot(kind='barh', x='feature', y='fscore', legend=False) 88 | plt.title('XGBoost Feature Importance') 89 | plt.xlabel('relative importance') 90 | plt.show() 91 | ''' -------------------------------------------------------------------------------- /DC-loan-rp/xgb_dummy.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | import xgboost as xgb 5 | import pandas as pd 6 | import numpy as np 7 | from sklearn.cross_validation import train_test_split 8 | import time 9 | 10 | 11 | def split_data(): 12 | path = 'D:/dataset/rp/' 13 | small_size = 1000 14 | dtrain = pd.read_csv(path + 'train_x.csv') 15 | labels = pd.read_csv(path + 'train_y.csv') 16 | dtest = pd.read_csv(path + 'test_x.csv') 17 | 18 | dtrain[:small_size].to_csv(path+'small_data/train_x.csv', index=False) 19 | labels[:small_size].to_csv(path+'small_data/train_y.csv', index=False) 20 | dtest[:small_size].to_csv(path+'small_data/test_x.csv', index=False) 21 | return dtrain, labels, dtest 22 | 23 | def load_data(dummy=False): 24 | path = 'D:/dataset/rp/small_data/' 25 | #path = 'D:/dataset/rp/' 26 | train_x = pd.read_csv(path + 'train_x.csv') 27 | train_x = train_x.drop(['uid'], axis=1) 28 | 29 | train_y = pd.read_csv(path + 'train_y.csv') 30 | train_y = train_y.drop(['uid'], axis=1) 31 | 32 | test_x = pd.read_csv(path + 'test_x.csv') 33 | test_uid = test_x.uid 34 | test_x = test_x.drop(['uid'], axis=1) 35 | 36 | if dummy: # 将分类类型的变量转为哑变量 37 | features = pd.read_csv(path + 'features_type.csv') 38 | features_category = features.feature[features.type == 'category'] 39 | encoded = pd.get_dummies(pd.concat([train_x, test_x], axis=0), columns=features_category) 40 | train_rows = train_x.shape[0] 41 | train_x = encoded.iloc[:train_rows, :] 42 | test_x = encoded.iloc[train_rows:, :] 43 | 44 | return train_x, train_y, test_x, test_uid 45 | 46 | def main(): 47 | train_x, train_y, test_x, test_uid = load_data(dummy=True) 48 | 49 | # 交叉验证,分割训练数据集 50 | random_seed = 10 51 | X_train, X_val, y_train, y_val= train_test_split(train_x, train_y, test_size=0.33, random_state=2016) 52 | xgb_train = xgb.DMatrix(X_train, label=y_train) 53 | xgb_val = xgb.DMatrix(X_val, label=y_val) 54 | xgb_test = xgb.DMatrix(test_x) 55 | 56 | # 设置xgboost分类器参数 57 | params = { 58 | 'booster': 'gbtree', 59 | 'objective': 'binary:logistic', 60 | 'eval_metric': 'auc', 61 | 'early_stopping_rounds': 100, 62 | 'scale_pos_weight': 0.77, 63 | 'gamma': 0.1, 64 | 'min_child_weight': 5, 65 | 'lambda': 700, 66 | 'subsample': 0.7, 67 | 'colsample_bytree': 0.3, 68 | 'max_depth': 8, 69 | 'eta': 0.03, 70 | 'nthread': 4 71 | } 72 | watchlist = [(xgb_val, 'test'), (xgb_train, 'train')] 73 | num_round = 10 74 | bst = xgb.train(params, xgb_train, num_boost_round=num_round, evals=watchlist) 75 | bst.save_model('./xgb.model') 76 | 77 | scores = bst.predict(xgb_test, ntree_limit=bst.best_ntree_limit) 78 | result = pd.DataFrame({"uid":test_uid, "score":scores}, columns=['uid','score']) 79 | result.to_csv('dummy_'+str(time.time())+'.csv', index=False) 80 | 81 | 82 | features = bst.get_fscore() 83 | features = sorted(features.items(), key=lambda d:d[1]) 84 | f_df = pd.DataFrame(features, columns=['feature','fscore']) 85 | f_df.to_csv('./feature_score.csv',index=False) 86 | 87 | ''' 88 | plt.figure() 89 | import_f = f_df[:10] 90 | import_f.plot(kind='barh', x='feature', y='fscore', legend=False) 91 | plt.title('XGBoost Feature Importance') 92 | plt.xlabel('relative importance') 93 | plt.show() 94 | ''' 95 | 96 | 97 | if __name__ == '__main__': 98 | main() 99 | -------------------------------------------------------------------------------- /Kaggle-bag-of-words/BagOfWords_LR.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility 4 | from sklearn.feature_extraction.text import TfidfVectorizer 5 | from sklearn.linear_model import LogisticRegression 6 | from sklearn import cross_validation 7 | import pandas as pd 8 | import numpy as np 9 | 10 | path = 'D:/dataset/word2vec/' 11 | train = pd.read_csv(path+'labeledTrainData.tsv', header=0, delimiter="\t", quoting=3) 12 | test = pd.read_csv(path+'testData.tsv', header=0, delimiter="\t", quoting=3 ) 13 | y = train["sentiment"] 14 | 15 | print("Cleaning and parsing movie reviews...\n") 16 | traindata = [] 17 | for i in range( 0, len(train["review"])): 18 | traindata.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], False))) 19 | testdata = [] 20 | for i in range(0,len(test["review"])): 21 | testdata.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], False))) 22 | 23 | print('vectorizing... ') 24 | tfv = TfidfVectorizer(min_df=3, max_features=None, 25 | strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', 26 | ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1, 27 | stop_words = 'english') 28 | X_all = traindata + testdata 29 | lentrain = len(traindata) 30 | 31 | print("fitting pipeline... ") 32 | tfv.fit(X_all) 33 | X_all = tfv.transform(X_all) 34 | 35 | X = X_all[:lentrain] 36 | X_test = X_all[lentrain:] 37 | 38 | model = LogisticRegression(penalty='l2', dual=True, tol=0.0001, 39 | C=1, fit_intercept=True, intercept_scaling=1.0, 40 | class_weight=None, random_state=None) 41 | print(("20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(model, X, y, cv=20, scoring='roc_auc')) )) 42 | 43 | print("Retrain on all training data, predicting test labels...\n") 44 | model.fit(X,y) 45 | result = model.predict_proba(X_test)[:,1] 46 | output = pd.DataFrame( data={"id":test["id"], "sentiment":result} ) 47 | 48 | # Use pandas to write the comma-separated output file 49 | output.to_csv('out/Bag_of_Words_model_LR.csv', index=False, quoting=3) 50 | print("Wrote results to Bag_of_Words_model_LR.csv") -------------------------------------------------------------------------------- /Kaggle-bag-of-words/BagOfWords_RF.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Author: Angela Chapman 4 | # Date: 8/6/2014 5 | # 6 | # This file contains code to accompany the Kaggle tutorial 7 | # "Deep learning goes to the movies". The code in this file 8 | # is for Part 1 of the tutorial on Natural Language Processing. 9 | # 10 | # *************************************** # 11 | 12 | 13 | from sklearn.feature_extraction.text import CountVectorizer 14 | from sklearn.ensemble import RandomForestClassifier 15 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility 16 | import pandas as pd 17 | 18 | 19 | if __name__ == '__main__': 20 | path = 'D:/dataset/word2vec/' 21 | train = pd.read_csv(path+'labeledTrainData.tsv', header=0, delimiter="\t", quoting=3) 22 | test = pd.read_csv(path+'testData.tsv', header=0, delimiter="\t", quoting=3 ) 23 | 24 | print('The first review is:') 25 | print((train["review"][0])) 26 | 27 | input("Press Enter to continue...") 28 | 29 | 30 | print('Download text data sets. If you already have NLTK datasets downloaded, just close the Python download window...') 31 | #nltk.download() # Download text data sets, including stop words 32 | 33 | # Initialize an empty list to hold the clean reviews 34 | clean_train_reviews = [] 35 | 36 | # Loop over each review; create an index i that goes from 0 to the length 37 | # of the movie review list 38 | 39 | print("Cleaning and parsing the training set movie reviews...\n") 40 | for i in range( 0, len(train["review"])): 41 | clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], False))) 42 | 43 | # ****** Create a bag of words from the training set 44 | # 45 | print("Creating the bag of words...\n") 46 | 47 | 48 | # Initialize the "CountVectorizer" object, which is scikit-learn's 49 | # bag of words tool. 50 | vectorizer = CountVectorizer(analyzer = "word", 51 | tokenizer = None, 52 | preprocessor = None, 53 | stop_words = None, 54 | max_features = 5000) 55 | 56 | # fit_transform() does two functions: First, it fits the model 57 | # and learns the vocabulary; second, it transforms our training data 58 | # into feature vectors. The data to fit_transform should be a list of 59 | # strings. 60 | train_data_features = vectorizer.fit_transform(clean_train_reviews) 61 | 62 | # Numpy arrays are easy to work with, so convert the result to an 63 | # array 64 | train_data_features = train_data_features.toarray() 65 | 66 | # ******* Train a random forest using the bag of words 67 | # 68 | print("Training the random forest (this may take a while)...") 69 | 70 | 71 | # Initialize a Random Forest classifier with 100 trees 72 | forest = RandomForestClassifier(n_estimators = 100) 73 | 74 | # Fit the forest to the training set, using the bag of words as 75 | # features and the sentiment labels as the response variable 76 | # 77 | # This may take a few minutes to run 78 | forest = forest.fit( train_data_features, train["sentiment"] ) 79 | 80 | 81 | 82 | # Create an empty list and append the clean reviews one by one 83 | clean_test_reviews = [] 84 | 85 | print("Cleaning and parsing the test set movie reviews...\n") 86 | for i in range(0,len(test["review"])): 87 | clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], False))) 88 | 89 | # Get a bag of words for the test set, and convert to a numpy array 90 | test_data_features = vectorizer.transform(clean_test_reviews) 91 | test_data_features = test_data_features.toarray() 92 | 93 | # Use the random forest to make sentiment label predictions 94 | print("Predicting test labels...\n") 95 | result = forest.predict(test_data_features) 96 | 97 | # Copy the results to a pandas dataframe with an "id" column and 98 | # a "sentiment" column 99 | output = pd.DataFrame( data={"id":test["id"], "sentiment":result} ) 100 | 101 | # Use pandas to write the comma-separated output file 102 | output.to_csv('out/Bag_of_Words_model_RF.csv', index=False, quoting=3) 103 | print("Wrote results to Bag_of_Words_model_RF.csv") 104 | 105 | 106 | -------------------------------------------------------------------------------- /Kaggle-bag-of-words/KaggleWord2VecUtility.py: -------------------------------------------------------------------------------- 1 | import re 2 | import nltk 3 | 4 | import pandas as pd 5 | import numpy as np 6 | 7 | from bs4 import BeautifulSoup 8 | from nltk.corpus import stopwords 9 | 10 | 11 | class KaggleWord2VecUtility(object): 12 | """KaggleWord2VecUtility is a utility class for processing raw HTML text into segments for further learning""" 13 | 14 | @staticmethod 15 | def review_to_wordlist( review, remove_stopwords=False ): 16 | # Function to convert a document to a sequence of Kaggle_bag_of_words, 17 | # optionally removing stop Kaggle-bag-of-words. Returns a list of Kaggle-bag-of-words. 18 | # 19 | # 1. Remove HTML 20 | review_text = BeautifulSoup(review, "lxml").get_text() 21 | # 22 | # 2. Remove non-letters 23 | review_text = re.sub("[^a-zA-Z]"," ", review_text) 24 | # 25 | # 3. Convert Kaggle_bag_of_words to lower case and split them 26 | words = review_text.lower().split() 27 | # 28 | # 4. Optionally remove stop Kaggle_bag_of_words (false by default) 29 | if remove_stopwords: 30 | stops = set(stopwords.words("english")) 31 | words = [w for w in words if not w in stops] 32 | # 33 | # 5. Return a list of Kaggle_bag_of_words 34 | return(words) 35 | 36 | # Define a function to split a review into parsed sentences 37 | @staticmethod 38 | def review_to_sentences( review, tokenizer, remove_stopwords=False ): 39 | # Function to split a review into parsed sentences. Returns a 40 | # list of sentences, where each sentence is a list of Kaggle_bag_of_words 41 | # 42 | # 1. Use the NLTK tokenizer to split the paragraph into sentences 43 | raw_sentences = tokenizer.tokenize(review.decode('utf8').strip()) 44 | # 45 | # 2. Loop over each sentence 46 | sentences = [] 47 | for raw_sentence in raw_sentences: 48 | # If a sentence is empty, skip it 49 | if len(raw_sentence) > 0: 50 | # Otherwise, call review_to_wordlist to get a list of Kaggle_bag_of_words 51 | sentences.append( KaggleWord2VecUtility.review_to_wordlist( raw_sentence, \ 52 | remove_stopwords )) 53 | # 54 | # Return the list of sentences (each sentence is a list of Kaggle_bag_of_words, 55 | # so this returns a list of lists 56 | return sentences -------------------------------------------------------------------------------- /Kaggle-bag-of-words/README.md: -------------------------------------------------------------------------------- 1 | ## Use Google's Word2Vec for movie reviews 2 | 3 | In this tutorial competition, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis. 4 | 5 | Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers. There's another Kaggle competition for movie review sentiment analysis. In this tutorial we explore how Word2Vec can be applied to a similar problem. 6 | 7 | Deep learning has been in the news a lot over the past few years, even making it to the front page of the New York Times. These machine learning techniques, inspired by the architecture of the human brain and made possible by recent advances in computing power, have been making waves via breakthrough results in image recognition, speech processing, and natural language tasks. Recently, deep learning approaches won several Kaggle competitions, including a drug discovery task, and cat and dog image recognition. 8 | 9 | 10 | ## Data Set Description 11 | 12 | 13 | 14 | # Word2Vec 简介 15 | 16 | Word2vec 是 Google 在 2013 年年中开源的一款将词表征为实数值向量的高效工具, 其利用深度学习的思想,可以通过训练,把对文本内容的处理简化为 K 维向量空间中的向量运算,而向量空间上的相似度可以用来表示文本语义上的相似度。Word2vec输出的词向量可以被用来做很多 NLP 相关的工作,比如聚类、找同义词、词性分析等等。如果换个思路, 把词当做特征,那么Word2vec就可以把特征映射到 K 维向量空间,可以为文本数据寻求更加深层次的特征表示 。 17 | 18 | Word2vec 使用的是 Distributed representation 的词向量表示方式。Distributed representation 最早由 Hinton在 1986 年提出[4]。其基本思想是 通过训练将每个词映射成 K 维实数向量(K 一般为模型中的超参数),通过词之间的距离(比如 cosine 相似度、欧氏距离等)来判断它们之间的语义相似度.其采用一个 三层的神经网络 ,输入层-隐层-输出层。有个核心的技术是 根据词频用Huffman编码 ,使得所有词频相似的词隐藏层激活的内容基本一致,出现频率越高的词语,他们激活的隐藏层数目越少,这样有效的降低了计算的复杂度。而Word2vec大受欢迎的一个原因正是其高效性,Mikolov 在论文[2]中指出,一个优化的单机版本一天可训练上千亿词。 19 | 20 | 这个三层神经网络本身是 对语言模型进行建模 ,但也同时 获得一种单词在向量空间上的表示 ,而这个副作用才是Word2vec的真正目标。 21 | 22 | 与潜在语义分析(Latent Semantic Index, LSI)、潜在狄立克雷分配(Latent Dirichlet Allocation,LDA)的经典过程相比,Word2vec利用了词的上下文,语义信息更加地丰富。 23 | 24 | 25 | 26 | 27 | * [文本深度表示模型Word2Vec](http://wei-li.cnblogs.com/p/word2vec.html) 28 | * [深度学习word2vec笔记之基础篇](http://blog.csdn.net/mytestmy/article/details/26961315) 29 | * [深度学习word2vec笔记之算法篇](http://blog.csdn.net/mytestmy/article/details/26969149) 30 | * [基于Kaggle数据的词袋模型文本分类教程](http://www.csdn.net/article/1970-01-01/2825782) 31 | * [情感分析的新方法——基于Word2Vec/Doc2Vec/Python](http://datartisan.com/article/detail/48.html) 32 | http://nbviewer.jupyter.org/github/MatthieuBizien/Bag-popcorn/blob/master/Kaggle-Word2Vec.ipynb 33 | *************** 34 | 35 | 这是一个文本情感二分类问题。25000的labeled训练样本,只有一个raw text 特征”review“。 36 | 评价指标为AUC,所以这里提交结果需要用概率,我开始就掉坑里了,结果一直上不来。 37 | 比赛里有教程如何使用word2vec进行二分类,可以作为入门学习材料。 38 | 我没有使用word embeddinng,直接采用BOW及ngram作为特征训练,效果还凑合,后面其实可以融合embedding特征试试。 39 | 对于raw text我采用TfidfVectorizer(stop_words=’english’, ngram_range=(1,3), sublinear_tf=True, min_df=2), 40 | 并采用卡方检验进行特征选择,经过CV,最终确定特征数为200000。 41 | 单模型我选取了GBRT/NB/LR/linear SVC。 42 | GBRT一般对于维度较大比较稀疏效果不是很好,但对于该数据表现不是很差。 43 | NB采用MultinomialNB效果也没有想象的那么惊艳。 44 | 几个模型按效果排序为linear SVC(0.95601)>LR(0.94823)>GBRT(0.94173)>NB(0.93693),看来线性SVM在文本上还是很强悍的。 45 | 后续我又采用LDA生成主题特征,本来抱着很大期望,现实还是那么骨感,采用上述单模型AUC最好也只有0.93024。 46 | 既然单独使用主题特征没有提高,那和BOW融合呢?果然work了! 47 | 后面试验证实特征融合还是linear SVC效果最好,LDA主题定为500,而且不去除停用词效果更好,AUC为0.95998。 48 | 既然没有时间搞单模型了,还有最后一招,多模型融合。这里有一个原则就是模型尽量多样,不一定要求指标最好。 49 | 最终我选取5组不是很差的多模型结果进行average stacking,AUC为0.96115,63位。 50 | 最终private LB跌倒了71st,应该融合word enbedding试试,没时间细搞了。 51 | 52 | 53 | 54 | 55 | http://cs.stanford.edu/~quocle/paragraph_vector.pdf 56 | * https://cs224d.stanford.edu/reports/SadeghianAmir.pdf 57 | * 用简单的TDF 作为Feature,然后用简单的M-Bayesian方法来进行分类。 58 | http://nbviewer.ipython.org/github/jmsteinw/Notebooks/blob/master/NLP_Movies.ipynb -------------------------------------------------------------------------------- /Kaggle-bag-of-words/Word2Vec_AverageVectors.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Author: Angela Chapman 4 | # Date: 8/6/2014 5 | # 6 | # This file contains code to accompany the Kaggle tutorial 7 | # "Deep learning goes to the movies". The code in this file 8 | # is for Parts 2 and 3 of the tutorial, which cover how to 9 | # train a model using Word2Vec. 10 | # 11 | # *************************************** # 12 | 13 | 14 | # ****** Read the two training sets and the test set 15 | # 16 | import pandas as pd 17 | import os 18 | from nltk.corpus import stopwords 19 | import nltk.data 20 | import logging 21 | import numpy as np # Make sure that numpy is imported 22 | from gensim.models import Word2Vec 23 | from sklearn.ensemble import RandomForestClassifier 24 | 25 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility 26 | 27 | 28 | # ****** Define functions to create average word vectors 29 | # 30 | 31 | def makeFeatureVec(words, model, num_features): 32 | # Function to average all of the word vectors in a given 33 | # paragraph 34 | # 35 | # Pre-initialize an empty numpy array (for speed) 36 | featureVec = np.zeros((num_features,),dtype="float32") 37 | # 38 | nwords = 0. 39 | # 40 | # Index2word is a list that contains the names of the words in 41 | # the model's vocabulary. Convert it to a set, for speed 42 | index2word_set = set(model.index2word) 43 | # 44 | # Loop over each word in the review and, if it is in the model's 45 | # vocaublary, add its feature vector to the total 46 | for word in words: 47 | if word in index2word_set: 48 | nwords = nwords + 1. 49 | featureVec = np.add(featureVec,model[word]) 50 | # 51 | # Divide the result by the number of words to get the average 52 | featureVec = np.divide(featureVec,nwords) 53 | return featureVec 54 | 55 | 56 | def getAvgFeatureVecs(reviews, model, num_features): 57 | # Given a set of reviews (each one a list of words), calculate 58 | # the average feature vector for each one and return a 2D numpy array 59 | # 60 | # Initialize a counter 61 | counter = 0. 62 | # 63 | # Preallocate a 2D numpy array, for speed 64 | reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") 65 | # 66 | # Loop through the reviews 67 | for review in reviews: 68 | # 69 | # Print a status message every 1000th review 70 | if counter%1000. == 0.: 71 | print("Review %d of %d" % (counter, len(reviews))) 72 | # 73 | # Call the function (defined above) that makes average feature vectors 74 | reviewFeatureVecs[counter] = makeFeatureVec(review, model, \ 75 | num_features) 76 | # 77 | # Increment the counter 78 | counter = counter + 1. 79 | return reviewFeatureVecs 80 | 81 | 82 | def getCleanReviews(reviews): 83 | clean_reviews = [] 84 | for review in reviews["review"]: 85 | clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True )) 86 | return clean_reviews 87 | 88 | 89 | 90 | if __name__ == '__main__': 91 | 92 | # Read data from files 93 | train = pd.read_csv( os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, delimiter="\t", quoting=3 ) 94 | test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", quoting=3 ) 95 | unlabeled_train = pd.read_csv( os.path.join(os.path.dirname(__file__), 'data', "unlabeledTrainData.tsv"), header=0, delimiter="\t", quoting=3 ) 96 | 97 | # Verify the number of reviews that were read (100,000 in total) 98 | print("Read %d labeled train reviews, %d labeled test reviews, " \ 99 | "and %d unlabeled reviews\n" % (train["review"].size, 100 | test["review"].size, unlabeled_train["review"].size )) 101 | 102 | 103 | 104 | # Load the punkt tokenizer 105 | tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 106 | 107 | 108 | 109 | # ****** Split the labeled and unlabeled training sets into clean sentences 110 | # 111 | sentences = [] # Initialize an empty list of sentences 112 | 113 | print("Parsing sentences from training set") 114 | for review in train["review"]: 115 | sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer) 116 | 117 | print("Parsing sentences from unlabeled set") 118 | for review in unlabeled_train["review"]: 119 | sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer) 120 | 121 | # ****** Set parameters and train the word2vec model 122 | # 123 | # Import the built-in logging module and configure it so that Word2Vec 124 | # creates nice output messages 125 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ 126 | level=logging.INFO) 127 | 128 | # Set values for various parameters 129 | num_features = 300 # Word vector dimensionality 130 | min_word_count = 40 # Minimum word count 131 | num_workers = 4 # Number of threads to run in parallel 132 | context = 10 # Context window size 133 | downsampling = 1e-3 # Downsample setting for frequent words 134 | 135 | # Initialize and train the model (this will take some time) 136 | print("Training Word2Vec model...") 137 | model = Word2Vec(sentences, workers=num_workers, \ 138 | size=num_features, min_count = min_word_count, \ 139 | window = context, sample = downsampling, seed=1) 140 | 141 | # If you don't plan to train the model any further, calling 142 | # init_sims will make the model much more memory-efficient. 143 | model.init_sims(replace=True) 144 | 145 | # It can be helpful to create a meaningful model name and 146 | # save the model for later use. You can load it later using Word2Vec.load() 147 | model_name = "300features_40minwords_10context" 148 | model.save(model_name) 149 | 150 | model.doesnt_match("man woman child kitchen".split()) 151 | model.doesnt_match("france england germany berlin".split()) 152 | model.doesnt_match("paris berlin london austria".split()) 153 | model.most_similar("man") 154 | model.most_similar("queen") 155 | model.most_similar("awful") 156 | 157 | 158 | 159 | # ****** Create average vectors for the training and test sets 160 | # 161 | print("Creating average feature vecs for training reviews") 162 | 163 | trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features ) 164 | 165 | print("Creating average feature vecs for test reviews") 166 | 167 | testDataVecs = getAvgFeatureVecs( getCleanReviews(test), model, num_features ) 168 | 169 | 170 | # ****** Fit a random forest to the training set, then make predictions 171 | # 172 | # Fit a random forest to the training data, using 100 trees 173 | forest = RandomForestClassifier( n_estimators = 100 ) 174 | 175 | print("Fitting a random forest to labeled training data...") 176 | forest = forest.fit( trainDataVecs, train["sentiment"] ) 177 | 178 | # Test & extract results 179 | result = forest.predict( testDataVecs ) 180 | 181 | # Write the test results 182 | output = pd.DataFrame( data={"id":test["id"], "sentiment":result} ) 183 | output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 ) 184 | print("Wrote Word2Vec_AverageVectors.csv") 185 | -------------------------------------------------------------------------------- /Kaggle-bag-of-words/Word2Vec_BagOfCentroids.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Author: Angela Chapman 4 | # Date: 8/6/2014 5 | # 6 | # This file contains code to accompany the Kaggle tutorial 7 | # "Deep learning goes to the movies". The code in this file 8 | # is for Part 2 of the tutorial and covers Bag of Centroids 9 | # for a Word2Vec model. This code assumes that you have already 10 | # run Word2Vec and saved a model called "300features_40minwords_10context" 11 | # 12 | # *************************************** # 13 | 14 | 15 | # Load a pre-trained model 16 | from gensim.models import Word2Vec 17 | from sklearn.cluster import KMeans 18 | import time 19 | import pandas as pd 20 | from sklearn.ensemble import RandomForestClassifier 21 | from bs4 import BeautifulSoup 22 | import re 23 | from nltk.corpus import stopwords 24 | import numpy as np 25 | import os 26 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility 27 | 28 | 29 | # Define a function to create bags of centroids 30 | # 31 | def create_bag_of_centroids( wordlist, word_centroid_map ): 32 | # 33 | # The number of clusters is equal to the highest cluster index 34 | # in the word / centroid map 35 | num_centroids = max( word_centroid_map.values() ) + 1 36 | # 37 | # Pre-allocate the bag of centroids vector (for speed) 38 | bag_of_centroids = np.zeros( num_centroids, dtype="float32" ) 39 | # 40 | # Loop over the words in the review. If the word is in the vocabulary, 41 | # find which cluster it belongs to, and increment that cluster count 42 | # by one 43 | for word in wordlist: 44 | if word in word_centroid_map: 45 | index = word_centroid_map[word] 46 | bag_of_centroids[index] += 1 47 | # 48 | # Return the "bag of centroids" 49 | return bag_of_centroids 50 | 51 | 52 | if __name__ == '__main__': 53 | 54 | model = Word2Vec.load("300features_40minwords_10context") 55 | 56 | 57 | # ****** Run k-means on the word vectors and print a few clusters 58 | # 59 | 60 | start = time.time() # Start time 61 | 62 | # Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an 63 | # average of 5 words per cluster 64 | word_vectors = model.syn0 65 | num_clusters = word_vectors.shape[0] / 5 66 | 67 | # Initalize a k-means object and use it to extract centroids 68 | print("Running K means") 69 | kmeans_clustering = KMeans( n_clusters = num_clusters ) 70 | idx = kmeans_clustering.fit_predict( word_vectors ) 71 | 72 | # Get the end time and print how long the process took 73 | end = time.time() 74 | elapsed = end - start 75 | print(("Time taken for K Means clustering: ", elapsed, "seconds.")) 76 | 77 | 78 | # Create a Word / Index dictionary, mapping each vocabulary word to 79 | # a cluster number 80 | word_centroid_map = dict(list(zip( model.index2word, idx ))) 81 | 82 | # Print the first ten clusters 83 | for cluster in range(0,10): 84 | # 85 | # Print the cluster number 86 | print(("\nCluster %d" % cluster)) 87 | # 88 | # Find all of the words for that cluster number, and print them out 89 | words = [] 90 | for i in range(0,len(list(word_centroid_map.values()))): 91 | if( list(word_centroid_map.values())[i] == cluster ): 92 | words.append(list(word_centroid_map.keys())[i]) 93 | print(words) 94 | 95 | 96 | 97 | 98 | # Create clean_train_reviews and clean_test_reviews as we did before 99 | # 100 | 101 | # Read data from files 102 | train = pd.read_csv( os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, delimiter="\t", quoting=3 ) 103 | test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", quoting=3 ) 104 | 105 | 106 | print("Cleaning training reviews") 107 | clean_train_reviews = [] 108 | for review in train["review"]: 109 | clean_train_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, \ 110 | remove_stopwords=True )) 111 | 112 | print("Cleaning test reviews") 113 | clean_test_reviews = [] 114 | for review in test["review"]: 115 | clean_test_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, \ 116 | remove_stopwords=True )) 117 | 118 | 119 | # ****** Create bags of centroids 120 | # 121 | # Pre-allocate an array for the training set bags of centroids (for speed) 122 | train_centroids = np.zeros( (train["review"].size, num_clusters), \ 123 | dtype="float32" ) 124 | 125 | # Transform the training set reviews into bags of centroids 126 | counter = 0 127 | for review in clean_train_reviews: 128 | train_centroids[counter] = create_bag_of_centroids( review, \ 129 | word_centroid_map ) 130 | counter += 1 131 | 132 | # Repeat for test reviews 133 | test_centroids = np.zeros(( test["review"].size, num_clusters), \ 134 | dtype="float32" ) 135 | 136 | counter = 0 137 | for review in clean_test_reviews: 138 | test_centroids[counter] = create_bag_of_centroids( review, \ 139 | word_centroid_map ) 140 | counter += 1 141 | 142 | 143 | # ****** Fit a random forest and extract predictions 144 | # 145 | forest = RandomForestClassifier(n_estimators = 100) 146 | 147 | # Fitting the forest may take a few minutes 148 | print("Fitting a random forest to labeled training data...") 149 | forest = forest.fit(train_centroids,train["sentiment"]) 150 | result = forest.predict(test_centroids) 151 | 152 | # Write the test results 153 | output = pd.DataFrame(data={"id":test["id"], "sentiment":result}) 154 | output.to_csv("BagOfCentroids.csv", index=False, quoting=3) 155 | print("Wrote BagOfCentroids.csv") 156 | -------------------------------------------------------------------------------- /Kaggle-bag-of-words/generate_d2v.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import logging 3 | import os.path 4 | import pandas as pd 5 | import numpy as np 6 | 7 | from KaggleWord2VecUtility import KaggleWord2VecUtility 8 | 9 | from gensim.models import Doc2Vec 10 | from gensim.models.doc2vec import LabeledSentence 11 | 12 | 13 | def getFeatureVecs(reviews, model, num_features): 14 | reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") 15 | counter = -1 16 | 17 | for review in reviews: 18 | counter += 1 19 | try: 20 | reviewFeatureVecs[counter] = np.array(model[review.labels[0]]).reshape((1, num_features)) 21 | except: 22 | continue 23 | return reviewFeatureVecs 24 | 25 | 26 | def getCleanLabeledReviews(reviews): 27 | clean_reviews = [] 28 | for review in reviews["review"]: 29 | clean_reviews.append(KaggleWord2VecUtility.review_to_wordlist(review)) 30 | 31 | labelized = [] 32 | for i, id_label in enumerate(reviews["id"]): 33 | labelized.append(LabeledSentence(clean_reviews[i], [id_label])) 34 | return labelized 35 | 36 | 37 | 38 | if __name__ == '__main__': 39 | train = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3) 40 | test = pd.read_csv('../data/testData.tsv', header=0, delimiter="\t", quoting=3) 41 | unsup = pd.read_csv('../data/unlabeledTrainData.tsv', header=0, delimiter="\t", quoting=3 ) 42 | 43 | print "Cleaning and labeling all data sets...\n" 44 | 45 | train_reviews = getCleanLabeledReviews(train) 46 | test_reviews = getCleanLabeledReviews(test) 47 | unsup_reviews = getCleanLabeledReviews(unsup) 48 | 49 | n_dim =5000 50 | 51 | model_dm_name = "%dfeatures_1minwords_10context_dm" % n_dim 52 | model_dbow_name = "%dfeatures_1minwords_10context_dbow" % n_dim 53 | 54 | 55 | 56 | if not os.path.exists(model_dm_name) or not os.path.exists(model_dbow_name): 57 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ 58 | level=logging.INFO) 59 | 60 | num_features = n_dim # Word vector dimensionality 61 | min_word_count = 1 # Minimum word count, if bigger, some sentences may be missing 62 | num_workers = 4 # Number of threads to run in parallel 63 | context = 10 # Context window size 64 | downsampling = 1e-3 # Downsample setting for frequent words 65 | 66 | print "Training Doc2Vec model..." 67 | model_dm = Doc2Vec(min_count=min_word_count, window=context, size=num_features, \ 68 | sample=downsampling, workers=num_workers) 69 | model_dbow = Doc2Vec(min_count=min_word_count, window=context, size=num_features, 70 | sample=downsampling, workers=num_workers, dm=0) 71 | 72 | all_reviews = np.concatenate((train_reviews, test_reviews, unsup_reviews)) 73 | model_dm.build_vocab(all_reviews) 74 | model_dbow.build_vocab(all_reviews) 75 | 76 | for epoch in range(10): 77 | perm = np.random.permutation(all_reviews.shape[0]) 78 | model_dm.train(all_reviews[perm]) 79 | model_dbow.train(all_reviews[perm]) 80 | 81 | model_dm.save(model_dm_name) 82 | model_dbow.save(model_dbow_name) 83 | -------------------------------------------------------------------------------- /Kaggle-bag-of-words/generate_w2v.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import pandas as pd 3 | import nltk.data 4 | import logging 5 | import os.path 6 | import numpy as np 7 | 8 | from KaggleWord2VecUtility import KaggleWord2VecUtility 9 | 10 | from gensim.models import Word2Vec 11 | 12 | 13 | def makeFeatureVec(words, model, num_features): 14 | featureVec = np.zeros((num_features,),dtype="float32") 15 | nwords = 0 16 | 17 | index2word_set = set(model.index2word) 18 | for word in words: 19 | if word in index2word_set: 20 | nwords = nwords + 1 21 | featureVec = np.add(featureVec,model[word]) 22 | 23 | if nwords != 0: 24 | featureVec /= nwords 25 | return featureVec 26 | 27 | 28 | def getAvgFeatureVecs(reviews, model, num_features): 29 | reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") 30 | counter = 0 31 | 32 | for review in reviews: 33 | reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features) 34 | counter = counter + 1 35 | return reviewFeatureVecs 36 | 37 | 38 | def getCleanReviews(reviews): 39 | clean_reviews = [] 40 | for review in reviews["review"]: 41 | clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True )) 42 | return clean_reviews 43 | 44 | 45 | if __name__ == '__main__': 46 | train = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3) 47 | #test = pd.read_csv('../data/testData.tsv', header=0, delimiter="\t", quoting=3) 48 | unsup = pd.read_csv('../data/unlabeledTrainData.tsv', header=0, delimiter="\t", quoting=3 ) 49 | 50 | n_dim = 5000 51 | 52 | model_name = "%dfeatures_40minwords_10context" % n_dim 53 | 54 | 55 | if not os.path.exists(model_name): 56 | tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 57 | 58 | sentences = [] 59 | 60 | print "Parsing sentences from training set" 61 | for review in train["review"]: 62 | sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer) 63 | 64 | print "Parsing sentences from unlabeled set" 65 | for review in unsup["review"]: 66 | sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer) 67 | 68 | 69 | logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\ 70 | level=logging.INFO) 71 | 72 | 73 | num_features = n_dim # Word vector dimensionality 74 | min_word_count = 5 # Minimum word count 75 | num_workers = 4 # Number of threads to run in parallel 76 | context = 10 # Context window size 77 | downsampling = 1e-3 # Downsample setting for frequent words 78 | 79 | print "Training Word2Vec model..." 80 | model = Word2Vec(sentences, workers=num_workers, \ 81 | size=num_features, min_count = min_word_count, \ 82 | window = context, sample = downsampling, seed=1) 83 | 84 | model.init_sims(replace=True) 85 | model.save(model_name) -------------------------------------------------------------------------------- /Kaggle-bag-of-words/nbsvm.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | from collections import Counter 4 | 5 | from KaggleWord2VecUtility import KaggleWord2VecUtility 6 | 7 | def tokenize(sentence, grams): 8 | words = KaggleWord2VecUtility.review_to_wordlist(sentence) 9 | tokens = [] 10 | for gram in grams: 11 | for i in range(len(words) - gram + 1): 12 | tokens += ["_*_".join(words[i:i+gram])] 13 | return tokens 14 | 15 | 16 | def build_dict(data, grams): 17 | dic = Counter() 18 | for token_list in data: 19 | dic.update(token_list) 20 | return dic 21 | 22 | 23 | def compute_ratio(poscounts, negcounts, alpha=1): 24 | alltokens = list(set(poscounts.keys() + negcounts.keys())) 25 | dic = dict((t, i) for i, t in enumerate(alltokens)) 26 | d = len(dic) 27 | 28 | print "Computing r...\n" 29 | 30 | p, q = np.ones(d) * alpha , np.ones(d) * alpha 31 | for t in alltokens: 32 | p[dic[t]] += poscounts[t] 33 | q[dic[t]] += negcounts[t] 34 | p /= abs(p).sum() 35 | q /= abs(q).sum() 36 | r = np.log(p/q) 37 | return dic, r 38 | 39 | 40 | def generate_svmlight_content(data, dic, r, grams): 41 | output = [] 42 | for _, row in data.iterrows(): 43 | tokens = tokenize(row['review'], grams) 44 | indexes = [] 45 | for t in tokens: 46 | try: 47 | indexes += [dic[t]] 48 | except KeyError: 49 | pass 50 | indexes = list(set(indexes)) 51 | indexes.sort() 52 | if 'sentiment' in row: 53 | line = [str(row['sentiment'])] 54 | else: 55 | line = ['0'] 56 | for i in indexes: 57 | line += ["%i:%f" % (i + 1, r[i])] 58 | output += [" ".join(line)] 59 | 60 | return "\n".join(output) 61 | 62 | 63 | def generate_svmlight_files(train, test, grams, outfn): 64 | ngram = [int(i) for i in grams] 65 | ptrain = [] 66 | ntrain = [] 67 | 68 | print "Parsing training data...\n" 69 | 70 | for _, row in train.iterrows(): 71 | if row['sentiment'] == 1: 72 | ptrain.append(tokenize(row['review'], ngram)) 73 | elif row['sentiment'] == 0: 74 | ntrain.append(tokenize(row['review'], ngram)) 75 | 76 | pos_counts = build_dict(ptrain, ngram) 77 | neg_counts = build_dict(ntrain, ngram) 78 | 79 | dic, r = compute_ratio(pos_counts, neg_counts) 80 | 81 | f = open(outfn + '-train.txt', "w") 82 | f.writelines(generate_svmlight_content(train, dic, r, ngram)) 83 | f.close() 84 | 85 | print "Parsing test data...\n" 86 | 87 | f = open(outfn + '-test.txt', "w") 88 | f.writelines(generate_svmlight_content(test, dic, r, ngram)) 89 | f.close() 90 | 91 | print "SVMlight files have been generated!" 92 | 93 | 94 | -------------------------------------------------------------------------------- /Kaggle-bag-of-words/predict.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import pandas as pd 3 | import numpy as np 4 | 5 | from sklearn.linear_model import LogisticRegression 6 | from sklearn.metrics import roc_auc_score 7 | from sklearn.preprocessing import scale 8 | from sklearn.feature_extraction.text import TfidfVectorizer 9 | from sklearn.datasets import load_svmlight_files 10 | from scipy.sparse import hstack 11 | 12 | from gensim.models import Doc2Vec, Word2Vec 13 | from gensim.models.doc2vec import LabeledSentence 14 | 15 | from nbsvm import generate_svmlight_files 16 | 17 | from KaggleWord2VecUtility import KaggleWord2VecUtility 18 | 19 | 20 | def makeFeatureVec(words, model, num_features): 21 | featureVec = np.zeros((num_features,),dtype="float32") 22 | nwords = 0 23 | 24 | index2word_set = set(model.index2word) 25 | for word in words: 26 | if word in index2word_set: 27 | nwords = nwords + 1 28 | featureVec = np.add(featureVec,model[word]) 29 | 30 | if nwords != 0: 31 | featureVec /= nwords 32 | return featureVec 33 | 34 | 35 | def getAvgFeatureVecs(reviews, model, num_features): 36 | counter = 0 37 | 38 | reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") 39 | 40 | for review in reviews: 41 | reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features) 42 | counter = counter + 1 43 | return reviewFeatureVecs 44 | 45 | 46 | def getCleanReviews(reviews): 47 | clean_reviews = [] 48 | for review in reviews["review"]: 49 | clean_reviews.append(KaggleWord2VecUtility.review_to_wordlist(review, True)) 50 | return clean_reviews 51 | 52 | 53 | def getFeatureVecs(reviews, model, num_features): 54 | reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32") 55 | counter = -1 56 | 57 | for review in reviews: 58 | counter = counter + 1 59 | try: 60 | reviewFeatureVecs[counter] = np.array(model[review.labels[0]]).reshape((1, num_features)) 61 | except: 62 | continue 63 | return reviewFeatureVecs 64 | 65 | 66 | def getCleanLabeledReviews(reviews): 67 | clean_reviews = [] 68 | for review in reviews["review"]: 69 | clean_reviews.append(KaggleWord2VecUtility.review_to_wordlist(review, True)) 70 | 71 | labelized = [] 72 | for i, id_label in enumerate(reviews["id"]): 73 | labelized.append(LabeledSentence(clean_reviews[i], [id_label])) 74 | return labelized 75 | 76 | 77 | if __name__ == '__main__': 78 | train = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3) 79 | test = pd.read_csv('../data/testData.tsv', header=0, delimiter="\t", quoting=3 ) 80 | 81 | print "Cleaning and parsing the data sets...\n" 82 | 83 | clean_train_reviews = [] 84 | for review in train['review']: 85 | clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(review))) 86 | 87 | clean_test_reviews = [] 88 | for review in test['review']: 89 | clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(review))) 90 | 91 | print "Creating the bag of words...\n" 92 | 93 | vectorizer = TfidfVectorizer(max_features=50000, ngram_range=(1,3), sublinear_tf=True) 94 | 95 | X_train_bow = vectorizer.fit_transform(clean_train_reviews) 96 | X_test_bow = vectorizer.transform(clean_test_reviews) 97 | 98 | 99 | print "Cleaning and labeling the data sets...\n" 100 | 101 | train_reviews = getCleanLabeledReviews(train) 102 | test_reviews = getCleanLabeledReviews(test) 103 | 104 | n_dim = 5000 105 | 106 | print 'Loading doc2vec model..\n' 107 | 108 | model_dm_name = "../data/%dfeatures_1minwords_10context_dm" % n_dim 109 | model_dbow_name = "../data/%dfeatures_1minwords_10context_dbow" % n_dim 110 | 111 | model_dm = Doc2Vec.load(model_dm_name) 112 | model_dbow = Doc2Vec.load(model_dbow_name) 113 | 114 | print "Creating the d2v vectors...\n" 115 | 116 | X_train_d2v_dm = getFeatureVecs(train_reviews, model_dm, n_dim) 117 | X_train_d2v_dbow = getFeatureVecs(train_reviews, model_dbow, n_dim) 118 | X_train_d2v = np.hstack((X_train_d2v_dm, X_train_d2v_dbow)) 119 | 120 | X_test_d2v_dm = getFeatureVecs(test_reviews, model_dm, n_dim) 121 | X_test_d2v_dbow = getFeatureVecs(test_reviews, model_dbow, n_dim) 122 | X_test_d2v = np.hstack((X_test_d2v_dm, X_test_d2v_dbow)) 123 | 124 | 125 | print 'Loading word2vec model..\n' 126 | 127 | model_name = "../data/%dfeatures_40minwords_10context" % n_dim 128 | 129 | model = Word2Vec.load(model_name) 130 | 131 | print "Creating the w2v vectors...\n" 132 | 133 | X_train_w2v = scale(getAvgFeatureVecs(getCleanReviews(train), model, n_dim)) 134 | X_test_w2v = scale(getAvgFeatureVecs(getCleanReviews(test), model, n_dim)) 135 | 136 | print "Generating the svmlight-format files...\n" 137 | 138 | generate_svmlight_files(train, test, '123', '../data/nbsvm') 139 | 140 | print "Creating the nbsvm...\n" 141 | 142 | files = ("../data/nbsvm-train.txt", "../data/nbsvm-test.txt") 143 | 144 | X_train_nbsvm, _, X_test_nbsvm, _ = load_svmlight_files(files) 145 | 146 | print "Combing the bag of words and the w2v vectors...\n" 147 | 148 | X_train_bwv = hstack([X_train_bow, X_train_w2v]) 149 | X_test_bwv = hstack([X_test_bow, X_test_w2v]) 150 | 151 | 152 | print "Combing the bag of words and the d2v vectors...\n" 153 | 154 | X_train_bdv = hstack([X_train_bow, X_train_d2v]) 155 | X_test_bdv = hstack([X_test_bow, X_test_d2v]) 156 | 157 | 158 | print "Checking the dimension of training vectors" 159 | 160 | print 'BoW', X_train_bow.shape 161 | print 'W2V', X_train_w2v.shape 162 | print 'D2V', X_train_d2v.shape 163 | print 'NBSVM', X_train_nbsvm.shape 164 | print 'BoW-W2V', X_train_bwv.shape 165 | print 'BoW-D2V', X_train_bdv.shape 166 | print '' 167 | 168 | y_train = train['sentiment'] 169 | 170 | 171 | print "Predicting with Bag-of-words model...\n" 172 | 173 | clf = LogisticRegression(class_weight="auto") 174 | 175 | clf.fit(X_train_bow, y_train) 176 | y_prob_bow = clf.predict_proba(X_test_bow) 177 | 178 | print "Predicting with NBSVM...\n" 179 | 180 | clf.fit(X_train_nbsvm, y_train) 181 | y_prob_nbsvm = clf.predict_proba(X_test_nbsvm) 182 | 183 | 184 | print "Predicting with Bag-of-words model and Word2Vec model...\n" 185 | 186 | clf.fit(X_train_bwv, y_train) 187 | y_prob_bwv = clf.predict_proba(X_test_bwv) 188 | 189 | 190 | print "Predicting with Bag-of-words model and Doc2Vec model...\n" 191 | 192 | clf.fit(X_train_bdv, y_train) 193 | y_prob_bdv = clf.predict_proba(X_test_bdv) 194 | 195 | 196 | print "\nWeighted Average: BOW/BOW-W2V/BOW-D2V/NBSVM\n" 197 | 198 | alpha = 0.081633 199 | beta = 0.265306 200 | theta = 0.551020 201 | 202 | y_pred = alpha*y_prob_bow + (1-alpha-beta-theta)*y_prob_bwv + beta*y_prob_bdv + theta*y_prob_nbsvm 203 | 204 | output = pd.DataFrame(data={"id":test["id"], "sentiment":y_pred[:,1]}) 205 | output.to_csv('BoW008_W2V5000_D2V10000_NBSVM055_model.csv', index=False, quoting=3) 206 | 207 | print "Wrote results to BoW008_W2V5000_D2V10000_NBSVM055_model.csv" 208 | 209 | 210 | print "\nMax-Min (Average)\n" 211 | y_mean = (y_prob_bow + y_prob_bwv + y_prob_bdv + y_prob_nbsvm)/4 212 | y_score_mean = [] 213 | 214 | i = 0 215 | for row in y_mean: 216 | if row[1] > 0.5: 217 | val = max(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1]) 218 | y_score_mean.append(val) 219 | elif row[1] < 0.5: 220 | val = min(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1]) 221 | y_score_mean.append(val) 222 | else: 223 | y_score_mean.append(y_pred[i,1]) 224 | i += 1 225 | 226 | 227 | print "\nMax-Min (Weighted Average)\n" 228 | y_score_best = [] 229 | 230 | i = 0 231 | for row in y_pred: 232 | if row[1] > 0.5: 233 | val = max(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1]) 234 | y_score_best.append(val) 235 | elif row[1] < 0.5: 236 | val = min(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1]) 237 | y_score_best.append(val) 238 | else: 239 | y_score_best.append(y_pred[i,1]) 240 | i += 1 241 | 242 | 243 | print "\nFinal Ensemble\n" 244 | y_wa = np.array([row[1] for row in y_pred]) 245 | y_am = np.array(y_score_mean) 246 | y_wam = np.array(y_score_best) 247 | 248 | alpha1 = 0.591837 249 | alpha2 = 0.387755 250 | y_final = alpha1*y_wa + (1-alpha1-alpha2)*y_am + alpha2*y_wam 251 | 252 | output = pd.DataFrame(data={"id":test["id"], "sentiment":y_final}) 253 | output.to_csv('WeightedAverage059_MaxMinAverage_MaxMinWeightedAverage039_model.csv', index=False, quoting=3) 254 | 255 | print "Wrote results to WeightedAverage059_MaxMinAverage_MaxMinWeightedAverage039_model.csv" 256 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/.gitignore: -------------------------------------------------------------------------------- 1 | *.py[co] 2 | 3 | # Packages 4 | *.egg 5 | *.egg-info 6 | dist 7 | build 8 | eggs 9 | parts 10 | bin 11 | var 12 | sdist 13 | develop-eggs 14 | .installed.cfg 15 | 16 | # Installer logs 17 | pip-log.txt 18 | 19 | # Unit test / coverage reports 20 | .coverage 21 | .tox 22 | 23 | #Translations 24 | *.mo 25 | 26 | #Mr Developer 27 | .mr.developer.cfg 28 | 29 | #All big data files 30 | ../data/train.csv 31 | ../data/test.csv 32 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/Digit Recognizer.md: -------------------------------------------------------------------------------- 1 | # 1. Kaggle Digit Recognizer 2 | 此任务是在MNIST(一个带Label的数字像素集合)上训练一个数字分类器,训练集的大小为42000个training example, 3 | 每个example是28*28=784个灰度像素值和一个0~9的label。最后的排名以在测试集上的分类正确率为依据排名。 4 | ### 数据集格式 5 | 一张手写数字图片由28*28=784个像素组成,每一个像素的取值范围[0,255]。 6 | **训练集train.csv** 7 | 每一行由[label,pixel0~pixel783],label代表这张图是什么数字。 8 | **测试集test.csv** 9 | 中没有label这一列,label是需要预测的。 10 | **提交结果文件name.csv** 11 | 列名 [ImageId,Label],ImageId对应测试集中的每一行。 12 | 13 | 14 | [TensorFlow softmax regression & deep NN](https://www.kaggle.com/kakauandme/digit-recognizer/tensorflow-softmax-regression-deep-nn/notebook) -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/data/readme.txt: -------------------------------------------------------------------------------- 1 | Training dataset (73.22Mb): 2 | http://www.kaggle.com/c/digit-recognizer/download/train.csv 3 | 4 | Testing dataset (48.75Mb): 5 | http://www.kaggle.com/c/digit-recognizer/download/test.csv 6 | 7 | 28x28 unrolled to 784 features. In training first column is label/target class -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/experiment1-rf-1000.py: -------------------------------------------------------------------------------- 1 | from sklearn.ensemble import RandomForestClassifier 2 | from numpy import genfromtxt, savetxt 3 | 4 | CPU = 1 5 | 6 | 7 | def main(): 8 | print("Reading training set") 9 | dataset = genfromtxt(open('../data/train.csv', 'r'), delimiter=',', dtype='int64')[1:] 10 | target = [x[0] for x in dataset] 11 | train = [x[1:] for x in dataset] 12 | print("Reading test set") 13 | test = genfromtxt(open('../data/test.csv', 'r'), delimiter=',', dtype='int64')[1:] 14 | 15 | #create and train the random forest 16 | rf = RandomForestClassifier(n_estimators=1000, n_jobs=CPU) 17 | print("Fitting RF classifier") 18 | rf.fit(train, target) 19 | 20 | print("Predicting test set") 21 | savetxt('submission-version-1.csv', rf.predict(test), delimiter=',', fmt='%d') 22 | 23 | if __name__ == "__main__": 24 | main() 25 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/knn_by_myself.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: knn_by_myself.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2015/12/24 下午 9:59 10 | 11 | https://www.kaggle.com/c/digit-recognizer/ 12 | score: 0.96300 13 | pred test 14 | time cost: 12311.181267s 15 | os: windows 10 16 | CPU: AMD A6-4400M APU with Radeon(tm) Graphics 2.70GHz 17 | RAM: 6GB 18 | ''' 19 | import numpy as np 20 | import time 21 | 22 | 23 | def load_data(): 24 | train_data = np.loadtxt('d:/dataset/digits/train.csv', dtype=np.uint8, delimiter=',', skiprows=1) 25 | test_data = np.loadtxt('d:/dataset/digits/test.csv', dtype=np.uint8, delimiter=',', skiprows=1) 26 | label = np.ravel(train_data[:, :1]) # 多维转一维 扁平化 27 | data = np.where(train_data[:, 1:] != 0, 1, 0) # 数据归一化 28 | test = np.where(test_data != 0, 1, 0) 29 | return data, label, test 30 | 31 | 32 | def test_knn(train_data, train_label, test_data, test_label): 33 | start = time.clock() 34 | error = 0 35 | m = len(test_data) 36 | labels = [] 37 | for i in range(m): 38 | calc_label = classify(test_data[i], train_data, train_label, 3) 39 | labels.append(calc_label) 40 | error = error + (calc_label != test_label[i]) 41 | 42 | print(('error: ', error)) 43 | print(('error percent: %f' % (float(error) / m))) 44 | print(('time cost: %f s' % (time.clock() - start))) 45 | 46 | 47 | def save2csv(labels, csv_name): 48 | f = open('d:/dataset/digits/' + csv_name, 'w') 49 | f.write('ImageId,Label\n') 50 | for i in range(1, len(labels)+1): 51 | f.write(str(i)+','+str(labels[i])) 52 | f.write("\n") 53 | f.close() 54 | 55 | 56 | def knn_pred(train_data, train_label, test_data): 57 | start = time.clock() 58 | m = len(test_data) 59 | labels = [] 60 | for i in range(m): 61 | calc_label = classify(test_data[i], train_data, train_label, 3) 62 | labels.append(calc_label) 63 | save2csv(labels, 'knn_result.csv') 64 | print(('time cost: %f s' % (time.clock() - start))) 65 | 66 | 67 | def classify(inx, train_data, train_label, k): 68 | sz = train_data.shape[0] 69 | inx_temp = np.tile(inx, (sz, 1)) - train_data 70 | sq_inx_temp = inx_temp ** 2 71 | sq_distance = sq_inx_temp.sum(axis=1) 72 | distance = sq_distance ** 0.5 73 | sort_dist = distance.argsort() 74 | class_set = {} 75 | for i in range(k): 76 | label = train_label[sort_dist[i]] 77 | class_set[label] = class_set.get(label, 0) + 1 78 | sorted_class_set = sorted(list(class_set.items()), key=lambda d: d[1], reverse=True) # 按字典中的从大到小排序 79 | # python2.7 -> python3.5 : itertimes() -> items() 80 | return sorted_class_set[0][0] 81 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/naive_bayes_by_myself.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: naive_bayes_by_myself.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2015/12/25 1:51 10 | ''' 11 | 12 | import numpy as np 13 | import time 14 | 15 | def csv2vector(file_name): 16 | pass 17 | 18 | def savefile(labels, file_name): 19 | pass 20 | 21 | 22 | def trainNB0(trainMatrix,trainclass): 23 | numpics = len(trainMatrix) #record numbers 24 | numpix = len(trainMatrix[0])#pix numbers 25 | pDic={} 26 | for v in trainclass: 27 | pDic[v] = pDic.get(v,0)+1 28 | for k,v in list(pDic.items()): 29 | pDic[k]=v/float(numpics)#p of every class 30 | pnumdic={} 31 | psumdic={} 32 | for k in list(pDic.keys()): 33 | pnumdic[k]=np.ones(numpix) 34 | for i in range(numpics): 35 | pnumdic[trainclass[i]] += trainMatrix[i] 36 | psumdic[trainclass[i]] = psumdic.get(trainclass[i],2) + sum(trainMatrix[i]) 37 | pvecdic={} 38 | for k in list(pnumdic.keys()): 39 | pvecdic[k]=np.log(pnumdic[k]/float(psumdic[k])) 40 | return pvecdic,pDic 41 | 42 | def classifyNB(vec2class,pvecdic,pDic): 43 | presult={} 44 | for k in list(pDic.keys()): 45 | presult[k]=sum(vec2class*pvecdic[k])+np.log(pDic[k]) 46 | tmp=float("-inf") 47 | result="" 48 | for k in list(presult.keys()): 49 | if presult[k]>tmp: 50 | tmp= presult[k] 51 | result=k 52 | return result 53 | 54 | def testNB(): 55 | print("load train data...") 56 | trainSet, trainlabel=csv2vector("train.csv",1) 57 | print("load test data...") 58 | testSet,testlabel = csv2vector("test.csv") 59 | print("start train...") 60 | pvecdic,pDic=trainNB0(trainSet, trainlabel) 61 | start = time.clock() 62 | print("start test...") 63 | result="ImageId,Label\n" 64 | for i in range(len(testSet)): 65 | tmp = classifyNB(testSet[i],pvecdic,pDic) 66 | result += str(i+1)+","+tmp+"\n" 67 | #print tmp 68 | savefile(result,"result_NB.csv") 69 | end = time.clock() 70 | print(("time cost: %f s" % (end - start))) 71 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/nn/README.md: -------------------------------------------------------------------------------- 1 | DigitRecognizer 2 | =============== 3 | 4 | A hand-written-digit recognition program using a feed forward neural network. This program is purposed for a predictive algorithm competition on Kaggle. 5 | 6 | This project also contains a module, PyNeural, which is meant to be used as a neural network system for machine learning. -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/nn/src/DigitRecognizer.py: -------------------------------------------------------------------------------- 1 | __author__ = 'Marco Giancarli, m.a.giancarli@gmail.com' 2 | 3 | 4 | import numpy as np 5 | from math import pi as PI 6 | from csv import reader 7 | from csv import writer 8 | from csv import QUOTE_NONE 9 | from skimage import filters 10 | from skimage import measure 11 | from skimage import transform 12 | from skimage import io 13 | from skimage import viewer 14 | from PyNeural.PyNeural import NeuralNetwork 15 | 16 | 17 | """ 18 | Return a list of all possible translations of the image that don't cut off any 19 | part of the number. 20 | """ 21 | def all_translations(x): 22 | translations = [x] 23 | # image = np.array([x]) 24 | # image.resize((28, 28)) 25 | # if sum(image[0, :]) == 0: 26 | # image1 = np.vstack([image[1:, :], image[0:1, :]]) 27 | # translations.append(image1.reshape((1, -1)).tolist()[0]) 28 | # if sum(image[:, 0]) == 0: 29 | # image2 = np.hstack([image[:, 1:], image[:, 0:1]]) 30 | # translations.append(image2.reshape((1, -1)).tolist()[0]) 31 | # if sum(image[-1, :]) == 0: 32 | # image3 = np.vstack([image[-1:, :], image[:-1, :]]) 33 | # translations.append(image3.reshape((1, -1)).tolist()[0]) 34 | # if sum(image[:, -1]) == 0: 35 | # image4 = np.hstack([image[:, -1:], image[:, :-1]]) 36 | # translations.append(image4.reshape((1, -1)).tolist()[0]) 37 | 38 | return translations 39 | 40 | """ 41 | Return new features, given an initial list x of raw features. 42 | """ 43 | def get_features(x): 44 | features = [] 45 | image = np.array([x]) 46 | image.resize((28, 28)) 47 | binary_image = filters.threshold_adaptive(image, 9) 48 | 49 | angles = np.linspace(0, 1, 8) * PI 50 | h, _, _ = transform.hough_line(filters.sobel(binary_image), theta=angles) 51 | h_sum = [ 52 | [sum(row[start:start+5]) for start in range(0, 75, 5)] 53 | for row in zip(*h) 54 | ] 55 | features.extend(np.array(h_sum).reshape(1, -1).tolist()[0]) 56 | 57 | # moments = measure.moments(binary_image) 58 | # hu_moments = measure.moments_hu(moments) 59 | # # reshape: -1 as a dimension size makes the dimension implicit 60 | # features.extend(moments.reshape((1, -1)).tolist()[0]) 61 | # features.extend(hu_moments.reshape((1, -1)).tolist()[0]) 62 | 63 | # h_line, _, _ = transform.hough_line(binary_image) 64 | # features.extend(np.array(h_line).reshape((1, -1)).tolist()[0]) 65 | 66 | return features 67 | 68 | def main(): 69 | print('Loading training set...') 70 | 71 | training_x_raw = [] 72 | training_y_raw = [] 73 | training_x = [] 74 | training_y = [] 75 | samples = 0 76 | m = 0 77 | 78 | with open('res/datasets/train.csv', ) as training_file: 79 | training_data = reader(training_file, delimiter=',') 80 | skipped_titles = False 81 | for line in training_data: 82 | if not skipped_titles: 83 | skipped_titles = True 84 | continue 85 | fields = list(line) 86 | training_y_raw = fields[0] 87 | training_x_raw = fields[1:] 88 | # remove the labels 89 | training_y.append(int(training_y_raw)) 90 | 91 | for features in all_translations([int(v) for v in training_x_raw]): 92 | training_x.append(features + get_features(features)) 93 | m += 1 94 | 95 | samples += 1 96 | if any([samples % 1000 == 0, 97 | samples % 100 == 0 and samples < 2000, 98 | samples % 10 == 0 and samples < 200]): 99 | print(samples, 'samples loaded.', m, 'generated samples.') 100 | print('Done.', m, 'total samples.') 101 | 102 | x_array = np.array(training_x) 103 | # normalize the training set 104 | training_x = ((x_array - np.average(x_array)) / np.std(x_array)).tolist() 105 | 106 | layer_sizes = [x_array.shape[1], 121, 10] 107 | alpha = 0.04 108 | test_size = m / 4 # 4 fold testing 109 | 110 | print('Training set loaded. Samples:', len(training_x)) 111 | print('Training network (layers: ' + \ 112 | ' -> '.join(map(str, layer_sizes)) + ')...') 113 | 114 | network = NeuralNetwork(layer_sizes, alpha) 115 | 116 | network.train( 117 | training_x[:-test_size], 118 | training_y[:-test_size], 119 | test_inputs=training_x[-test_size:], 120 | test_outputs=training_y[-test_size:], 121 | epoch_cap=15, 122 | error_goal=0.00, 123 | dropconnect_chance=0.05 124 | ) 125 | 126 | print('Network trained.') 127 | 128 | num_correct = 0 129 | num_tests = 0 130 | for x, y in zip(training_x[-2000:], training_y[-2000:]): 131 | prediction = network.predict(x) 132 | num_tests += 1 133 | if int(prediction) == y: 134 | num_correct += 1 135 | print(str(num_correct), '/', str(num_tests)) 136 | 137 | # clear junk 138 | network.momentum = None 139 | network.dropconnect_matrices = None 140 | training_x = None 141 | training_y = None 142 | training_data = None 143 | training_x_raw = None 144 | training_y_raw = None 145 | 146 | print('Loading test data...') 147 | 148 | test_x_raw = [] 149 | test_x = [] 150 | test_y = [] 151 | 152 | output_file_name = 'gen/nn_benchmark5.csv' 153 | 154 | with open(output_file_name, 'wb') as output_file: 155 | w = writer(output_file, delimiter=',', quoting=QUOTE_NONE) 156 | w.writerow(['ImageId','Label']) 157 | 158 | with open('res/datasets/test.csv', ) as test_file: 159 | test_data = reader(test_file, delimiter=',') 160 | skipped_titles = False 161 | num_predictions = 0 162 | for line in test_data: 163 | if not skipped_titles: 164 | skipped_titles = True 165 | continue 166 | fields = list(line) 167 | test_x_raw = fields 168 | # remove the damn labels 169 | features = [int(val) for val in test_x_raw] 170 | features.extend(get_features(features)) 171 | test_x.append(features) 172 | num_predictions += 1 173 | if num_predictions % 100 == 0: 174 | x_array = np.array(test_x) 175 | # normalize the test set 176 | test_x = ( 177 | (x_array - np.average(x_array)) / np.std(x_array) 178 | ).tolist() 179 | for i in range(100): 180 | w.writerow([num_predictions-99+i, 181 | network.predict(test_x[i])]) 182 | test_x = [] 183 | x_array = [] 184 | 185 | 186 | print('Predicted labels and stored as "' + output_file_name + '".') 187 | 188 | if __name__ == '__main__': 189 | main() -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/nn/src/PyNeural/PyNeural.py: -------------------------------------------------------------------------------- 1 | __author__ = 'MarcoGiancarli, m.a.giancarli@gmail.com' 2 | 3 | 4 | import math 5 | import numpy as np 6 | 7 | 8 | # Use tanh instead of normal sigmoid because it's faster. Emulates sigmoid. 9 | def sigmoid(x): 10 | return (np.tanh(x) + 1) / 2 11 | 12 | 13 | # derivative of our sigmoid function 14 | def d_sigmoid(x): 15 | return (np.tanh(x)+1) * (1-np.tanh(x)) / 4 16 | 17 | 18 | def output_vector_to_scalar(vector): 19 | # get the index of the max in the vector 20 | m,i = max((v,i) for i,v in enumerate(vector.tolist())) 21 | return i 22 | 23 | 24 | def output_scalar_to_vector(scalar, num_outputs): 25 | # same size as outputs, all 0s 26 | vector = [0] * num_outputs 27 | # add 1 to the correct index 28 | vector[scalar] += 1 29 | return vector 30 | 31 | 32 | #TODO: add methods to save state 33 | #TODO: learning curves? 34 | class NeuralNetwork: 35 | def __init__(self, layer_sizes, alpha, labels=None, reg_constant=0): 36 | self.alpha = alpha 37 | self.regularization_constant = reg_constant 38 | self.dropconnect_matrices = None 39 | 40 | if labels is None: 41 | self.labels = list(range(layer_sizes[-1])) 42 | elif len(labels) != layer_sizes[-1]: 43 | #TODO: throw exception here 44 | print('Fucked up because the size of layer does not match the ' \ 45 | 'size of the outputs. (' + \ 46 | str(len(labels)) + ' != ' + str(layer_sizes[-1]) + ')') 47 | exit(1) 48 | else: 49 | self.labels = labels 50 | 51 | # theta is the weights matrix for each node. we skip the first layer 52 | # because it has no weights. 53 | self.theta = [None] * len(layer_sizes) 54 | for l in range(1, len(layer_sizes)): 55 | # append a matrix which represents the initial weights for layer l 56 | # for node in layer l, add a weight to each node in layer l-1 + bias 57 | beta = 0.7 * math.pow(layer_sizes[l], 1/layer_sizes[l-1]) 58 | self.theta[l] = np.random.random( 59 | (layer_sizes[l], layer_sizes[l-1]+1) 60 | ) * 2 - 1 61 | norm = [ 62 | math.sqrt(x) 63 | for x in np.multiply( 64 | self.theta[l], 65 | self.theta[l]).dot(np.ones([layer_sizes[l-1]+1])) 66 | ] 67 | for row_num in range(len(norm)): 68 | self.theta[l][row_num,:] = self.theta[l][row_num,:] * \ 69 | beta / norm[row_num] 70 | 71 | self.momentum = [np.zeros(t.shape) for t in self.theta[1:]] 72 | 73 | """ 74 | Feed forward and return lists of matrices A and Z for one set of inputs. 75 | """ 76 | def feed_forward(self, input_vector, dropconnect_matrices=None): 77 | A = [None]*len(self.theta) 78 | Z = [None]*len(self.theta) 79 | A[0] = input_vector.T # 1 x n 80 | Z[0] = None # z_1 doesn't exist 81 | for l in range(1, len(self.theta)): 82 | # add constant (1) to the weights that correspond with each node 83 | A_with_ones = np.concatenate((np.array([1]), A[l-1])) 84 | if dropconnect_matrices is not None: 85 | Z[l] = np.dot(np.multiply(self.theta[l], 86 | dropconnect_matrices[l-1]), 87 | A_with_ones) 88 | else: 89 | Z[l] = np.dot(self.theta[l], A_with_ones) 90 | A[l] = sigmoid(Z[l]) 91 | 92 | return A, Z 93 | 94 | """ 95 | Back propagate for one training sample. 96 | """ 97 | def back_prop(self, input_vector, output_vector, dropconnect_matrices): 98 | A, Z = self.feed_forward(input_vector, dropconnect_matrices) 99 | 100 | # let delta be a list of matrices where delta[l][i][j] is delta 101 | # at layer l, training sample i, and node j 102 | # the delta is None for the data layer, others we assign later 103 | delta = [None] * len(self.theta) 104 | delta[-1] = np.multiply(A[-1] - output_vector.T, d_sigmoid(Z[-1])) 105 | 106 | # note: no error on data layer, we have the output layer 107 | for l in reversed(list(range(1, len(self.theta)-1))): 108 | theta_t_delta = np.dot(np.multiply(self.theta[l+1], 109 | dropconnect_matrices[l]).T, 110 | delta[l+1]) 111 | delta[l] = np.multiply(theta_t_delta[1:], d_sigmoid(Z[l])) 112 | 113 | # Calculate the partial derivatives for all theta values using delta 114 | D = [None]*len(self.theta) # make list of size L, where L is num layers 115 | for l in range(1, len(self.theta)): 116 | D[l] = np.dot(np.atleast_2d(A[l-1]).T, np.atleast_2d(delta[l])) 117 | 118 | return D, delta 119 | 120 | """ 121 | This method is used for supervised training on a data set. 122 | """ 123 | def train(self, inputs, outputs, test_inputs=None, test_outputs=None, 124 | epoch_cap=100, error_goal=0, dropconnect_chance=0.15): 125 | # create these first so that we don't have to do it every epoch 126 | input_vectors = [np.array(x) for x in inputs] 127 | output_vectors = [ 128 | np.array(output_scalar_to_vector(y, self.theta[-1].shape[0])) 129 | for y in outputs 130 | ] 131 | test_input_vectors = [np.array(x) for x in test_inputs] 132 | test_output_vectors = [ 133 | np.array(output_scalar_to_vector(y, self.theta[-1].shape[0])) 134 | for y in test_outputs 135 | ] 136 | 137 | m = len(outputs) 138 | for iteration in range(epoch_cap): 139 | if dropconnect_chance > 0: 140 | dropconnect_matrices = \ 141 | self.make_dropconnect_matrices(dropconnect_chance) 142 | for input_vector, output_vector in zip(input_vectors, 143 | output_vectors): 144 | gradient, bias = self.back_prop(input_vector, 145 | output_vector, 146 | dropconnect_matrices) 147 | gradient_with_bias = [None]*len(self.theta) 148 | 149 | for l in range(1, len(self.theta)): 150 | gradient_with_bias[l] = np.vstack((bias[l], gradient[l])) 151 | gradient_with_bias[l] = gradient_with_bias[l].T 152 | 153 | gradient_with_bias = [g for g in gradient_with_bias[1:]] 154 | self.gradient_descent(gradient_with_bias) 155 | 156 | # test the updated system against the validation set 157 | if test_inputs is not None and test_outputs is not None: 158 | num_tests = len(test_output_vectors) 159 | num_correct = 0 160 | for test_input, test_output in zip(test_input_vectors, 161 | test_outputs): 162 | prediction = self.predict(test_input) 163 | if prediction == test_output: 164 | num_correct += 1 165 | test_accuracy = float(num_correct) / float(num_tests) 166 | print('Test at epoch %s: %s / %s -- Accuracy: %s' % ( 167 | str(iteration+1), str(num_correct), 168 | str(num_tests), str(test_accuracy) 169 | )) 170 | 171 | if test_accuracy >= 1.0 - error_goal: 172 | return 173 | 174 | """ 175 | This method calls feed_forward and returns just the prediction labels for 176 | all samples. 177 | """ 178 | def predict(self, input): 179 | A, _ = self.feed_forward(np.array(input)) 180 | return np.argmax(A[-1]) 181 | 182 | def gradient_descent(self, gradient): 183 | for l in range(1, len(self.theta)): 184 | # gradient doesnt have a None value at index 0, but theta does 185 | self.theta[l] = np.add( 186 | self.theta[l], 187 | (-1.0 * self.alpha) * (gradient[l-1] + self.momentum[l-1]) 188 | ) 189 | self.momentum = [m/2 + g/2 for m, g in zip(self.momentum, gradient)] 190 | 191 | def make_dropconnect_matrices(self, dropconnect_chance): 192 | assert(0 <= dropconnect_chance < 1) 193 | dropconnect_matrices = [ 194 | np.fix(np.random.random(t.shape) + (1-dropconnect_chance)) 195 | for t in self.theta[1:] 196 | ] 197 | return dropconnect_matrices -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/nn/src/PyNeural/PyNeural.py.bak: -------------------------------------------------------------------------------- 1 | __author__ = 'MarcoGiancarli, m.a.giancarli@gmail.com' 2 | 3 | 4 | import math 5 | import numpy as np 6 | 7 | 8 | # Use tanh instead of normal sigmoid because it's faster. Emulates sigmoid. 9 | def sigmoid(x): 10 | return (np.tanh(x) + 1) / 2 11 | 12 | 13 | # derivative of our sigmoid function 14 | def d_sigmoid(x): 15 | return (np.tanh(x)+1) * (1-np.tanh(x)) / 4 16 | 17 | 18 | def output_vector_to_scalar(vector): 19 | # get the index of the max in the vector 20 | m,i = max((v,i) for i,v in enumerate(vector.tolist())) 21 | return i 22 | 23 | 24 | def output_scalar_to_vector(scalar, num_outputs): 25 | # same size as outputs, all 0s 26 | vector = [0] * num_outputs 27 | # add 1 to the correct index 28 | vector[scalar] += 1 29 | return vector 30 | 31 | 32 | #TODO: add methods to save state 33 | #TODO: learning curves? 34 | class NeuralNetwork: 35 | def __init__(self, layer_sizes, alpha, labels=None, reg_constant=0): 36 | self.alpha = alpha 37 | self.regularization_constant = reg_constant 38 | self.dropconnect_matrices = None 39 | 40 | if labels is None: 41 | self.labels = range(layer_sizes[-1]) 42 | elif len(labels) != layer_sizes[-1]: 43 | #TODO: throw exception here 44 | print 'Fucked up because the size of layer does not match the ' \ 45 | 'size of the outputs. (' + \ 46 | str(len(labels)) + ' != ' + str(layer_sizes[-1]) + ')' 47 | exit(1) 48 | else: 49 | self.labels = labels 50 | 51 | # theta is the weights matrix for each node. we skip the first layer 52 | # because it has no weights. 53 | self.theta = [None] * len(layer_sizes) 54 | for l in range(1, len(layer_sizes)): 55 | # append a matrix which represents the initial weights for layer l 56 | # for node in layer l, add a weight to each node in layer l-1 + bias 57 | beta = 0.7 * math.pow(layer_sizes[l], 1/layer_sizes[l-1]) 58 | self.theta[l] = np.random.random( 59 | (layer_sizes[l], layer_sizes[l-1]+1) 60 | ) * 2 - 1 61 | norm = [ 62 | math.sqrt(x) 63 | for x in np.multiply( 64 | self.theta[l], 65 | self.theta[l]).dot(np.ones([layer_sizes[l-1]+1])) 66 | ] 67 | for row_num in range(len(norm)): 68 | self.theta[l][row_num,:] = self.theta[l][row_num,:] * \ 69 | beta / norm[row_num] 70 | 71 | self.momentum = [np.zeros(t.shape) for t in self.theta[1:]] 72 | 73 | """ 74 | Feed forward and return lists of matrices A and Z for one set of inputs. 75 | """ 76 | def feed_forward(self, input_vector, dropconnect_matrices=None): 77 | A = [None]*len(self.theta) 78 | Z = [None]*len(self.theta) 79 | A[0] = input_vector.T # 1 x n 80 | Z[0] = None # z_1 doesn't exist 81 | for l in range(1, len(self.theta)): 82 | # add constant (1) to the weights that correspond with each node 83 | A_with_ones = np.concatenate((np.array([1]), A[l-1])) 84 | if dropconnect_matrices is not None: 85 | Z[l] = np.dot(np.multiply(self.theta[l], 86 | dropconnect_matrices[l-1]), 87 | A_with_ones) 88 | else: 89 | Z[l] = np.dot(self.theta[l], A_with_ones) 90 | A[l] = sigmoid(Z[l]) 91 | 92 | return A, Z 93 | 94 | """ 95 | Back propagate for one training sample. 96 | """ 97 | def back_prop(self, input_vector, output_vector, dropconnect_matrices): 98 | A, Z = self.feed_forward(input_vector, dropconnect_matrices) 99 | 100 | # let delta be a list of matrices where delta[l][i][j] is delta 101 | # at layer l, training sample i, and node j 102 | # the delta is None for the input layer, others we assign later 103 | delta = [None] * len(self.theta) 104 | delta[-1] = np.multiply(A[-1] - output_vector.T, d_sigmoid(Z[-1])) 105 | 106 | # note: no error on input layer, we have the output layer 107 | for l in reversed(range(1, len(self.theta)-1)): 108 | theta_t_delta = np.dot(np.multiply(self.theta[l+1], 109 | dropconnect_matrices[l]).T, 110 | delta[l+1]) 111 | delta[l] = np.multiply(theta_t_delta[1:], d_sigmoid(Z[l])) 112 | 113 | # Calculate the partial derivatives for all theta values using delta 114 | D = [None]*len(self.theta) # make list of size L, where L is num layers 115 | for l in range(1, len(self.theta)): 116 | D[l] = np.dot(np.atleast_2d(A[l-1]).T, np.atleast_2d(delta[l])) 117 | 118 | return D, delta 119 | 120 | """ 121 | This method is used for supervised training on a data set. 122 | """ 123 | def train(self, inputs, outputs, test_inputs=None, test_outputs=None, 124 | epoch_cap=100, error_goal=0, dropconnect_chance=0.15): 125 | # create these first so that we don't have to do it every epoch 126 | input_vectors = [np.array(x) for x in inputs] 127 | output_vectors = [ 128 | np.array(output_scalar_to_vector(y, self.theta[-1].shape[0])) 129 | for y in outputs 130 | ] 131 | test_input_vectors = [np.array(x) for x in test_inputs] 132 | test_output_vectors = [ 133 | np.array(output_scalar_to_vector(y, self.theta[-1].shape[0])) 134 | for y in test_outputs 135 | ] 136 | 137 | m = len(outputs) 138 | for iteration in range(epoch_cap): 139 | if dropconnect_chance > 0: 140 | dropconnect_matrices = \ 141 | self.make_dropconnect_matrices(dropconnect_chance) 142 | for input_vector, output_vector in zip(input_vectors, 143 | output_vectors): 144 | gradient, bias = self.back_prop(input_vector, 145 | output_vector, 146 | dropconnect_matrices) 147 | gradient_with_bias = [None]*len(self.theta) 148 | 149 | for l in range(1, len(self.theta)): 150 | gradient_with_bias[l] = np.vstack((bias[l], gradient[l])) 151 | gradient_with_bias[l] = gradient_with_bias[l].T 152 | 153 | gradient_with_bias = [g for g in gradient_with_bias[1:]] 154 | self.gradient_descent(gradient_with_bias) 155 | 156 | # test the updated system against the validation set 157 | if test_inputs is not None and test_outputs is not None: 158 | num_tests = len(test_output_vectors) 159 | num_correct = 0 160 | for test_input, test_output in zip(test_input_vectors, 161 | test_outputs): 162 | prediction = self.predict(test_input) 163 | if prediction == test_output: 164 | num_correct += 1 165 | test_accuracy = float(num_correct) / float(num_tests) 166 | print 'Test at epoch %s: %s / %s -- Accuracy: %s' % ( 167 | str(iteration+1), str(num_correct), 168 | str(num_tests), str(test_accuracy) 169 | ) 170 | 171 | if test_accuracy >= 1.0 - error_goal: 172 | return 173 | 174 | """ 175 | This method calls feed_forward and returns just the prediction labels for 176 | all samples. 177 | """ 178 | def predict(self, input): 179 | A, _ = self.feed_forward(np.array(input)) 180 | return np.argmax(A[-1]) 181 | 182 | def gradient_descent(self, gradient): 183 | for l in range(1, len(self.theta)): 184 | # gradient doesnt have a None value at index 0, but theta does 185 | self.theta[l] = np.add( 186 | self.theta[l], 187 | (-1.0 * self.alpha) * (gradient[l-1] + self.momentum[l-1]) 188 | ) 189 | self.momentum = [m/2 + g/2 for m, g in zip(self.momentum, gradient)] 190 | 191 | def make_dropconnect_matrices(self, dropconnect_chance): 192 | assert(0 <= dropconnect_chance < 1) 193 | dropconnect_matrices = [ 194 | np.fix(np.random.random(t.shape) + (1-dropconnect_chance)) 195 | for t in self.theta[1:] 196 | ] 197 | return dropconnect_matrices -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/nn/src/PyNeural/__init__.py: -------------------------------------------------------------------------------- 1 | __author__ = 'Marco' 2 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/nn/src/ensemble.py: -------------------------------------------------------------------------------- 1 | __author__ = 'MarcoGiancarli, m.a.giancarli@gmail.com' 2 | 3 | 4 | # This program is to be used to combine the results of an ensemble of trained 5 | # networks. It reads the results from the benchmark csv files and write the mode 6 | # into a file named 'ensemble_benchmark.csv'. 7 | 8 | from csv import writer 9 | from csv import QUOTE_NONE 10 | 11 | BASE_PATH = '../gen/' 12 | 13 | files_in_ensemble = [BASE_PATH + name for name in [ 14 | # 'nn_benchmark.csv', 15 | 'nn_benchmark1.csv', 16 | 'nn_benchmark2.csv', 17 | 'nn_benchmark3.csv', 18 | 'nn_benchmark4.csv', 19 | 'nn_benchmark5.csv', 20 | ]] 21 | 22 | input_files = [open(file_name,'r') for file_name in files_in_ensemble] 23 | for f in input_files: 24 | f.readline() 25 | 26 | with open(BASE_PATH+'ensemble_benchmark.csv', 'wb') as output_file: 27 | w = writer(output_file, delimiter=',', quoting=QUOTE_NONE) 28 | w.writerow(['ImageId','Label']) 29 | 30 | for ex_count in range(28000): 31 | current_predictions = [] # list of digits given by several data files for the same image 32 | prediction_counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # number of times each label appears 33 | 34 | # get the label column for the next line for each data file 35 | for input_file in input_files: 36 | current_predictions.append(input_file.readline().split(',')[1]) 37 | 38 | # add one to the counter at the index of each label from the data files 39 | for prediction in current_predictions: 40 | prediction_counts[int(prediction)] += 1 41 | 42 | # the max index is the label that was predicted most frequently 43 | averaged_prediction = max(range(len(prediction_counts)),key=prediction_counts.__getitem__) 44 | 45 | print(str(ex_count+1) + ' -- Predictions: ' + ', '.join(current_predictions) + \ 46 | ' -- Average: ' + str(averaged_prediction)) 47 | w.writerow([ex_count+1, averaged_prediction]) -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/py-knn/experiment1-custom-knn-brute-force.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import operator 3 | import time 4 | 5 | 6 | # euclidean distance without square root to save 7 | # some computational time 8 | def euclid(x1, x2): 9 | return np.sum(np.power(np.subtract(x1, x2), 2)) 10 | 11 | 12 | # calc kNN from test_row vs training set 13 | # default k = 5 14 | # brute force! :( 15 | def knn(test_row, train, k=5): 16 | diffs = {} 17 | idx = 0 18 | start = time.time() 19 | for t in train: 20 | diffs[idx] = euclid(test_row, t) 21 | idx = idx + 1 22 | print("for loop: %f idx(%d)" % (time.time() - start, idx)) 23 | return sorted(iter(diffs.items()), key=operator.itemgetter(1))[:k] 24 | 25 | 26 | # majority vote 27 | def majority(knn, labels): 28 | a = {} 29 | for idx, distance in knn: 30 | if labels[idx] in list(a.keys()): 31 | a[labels[idx]] = a[labels[idx]] + 1 32 | else: 33 | a[labels[idx]] = 1 34 | return sorted(iter(a.items()), key=operator.itemgetter(1), reverse=True)[0][0] 35 | 36 | 37 | # worker. crawl through test set and predicts number 38 | def doWork(train, test, labels): 39 | output_file = open("output.csv", "w", 0) 40 | idx = 0 41 | size = len(test) 42 | for test_sample in test: 43 | idx += 1 44 | start = time.time() 45 | prediction = majority(knn(test_sample, train, k=100), labels) 46 | print("Knn: %f" % (time.time() - start)) 47 | output_file.write(prediction) 48 | output_file.write("\n") 49 | print((float(idx) / size) * 100) 50 | output_file.close() 51 | 52 | 53 | # majority vote for a little bit optimized worker 54 | def majority_vote(knn, labels): 55 | knn = [k[0, 0] for k in knn] 56 | a = {} 57 | for idx in knn: 58 | if labels[idx] in list(a.keys()): 59 | a[labels[idx]] = a[labels[idx]] + 1 60 | else: 61 | a[labels[idx]] = 1 62 | return sorted(iter(a.items()), key=operator.itemgetter(1), reverse=True)[0][0] 63 | 64 | 65 | def doWorkNumpy(train, test, labels): 66 | k = 20 67 | train_mat = np.mat(train) 68 | output_file = open("output-numpy2.csv", "w", 0) 69 | idx = 0 70 | size = len(test) 71 | for test_sample in test: 72 | idx += 1 73 | start = time.time() 74 | knn = np.argsort(np.sum(np.power(np.subtract(train_mat, test_sample), 2), axis=1), axis=0)[:k] 75 | s = time.time() 76 | prediction = majority_vote(knn, labels) 77 | output_file.write(prediction) 78 | output_file.write("\n") 79 | print("Knn: %f, majority %f" % (time.time() - start, time.time() - s)) 80 | print("Done: %f" % (float(idx) / size)) 81 | output_file.close() 82 | output_file = open("done.txt", "w") 83 | output_file.write("DONE") 84 | output_file.close() 85 | 86 | 87 | if __name__ == '__main__': 88 | from load_data import read_data 89 | train, labels = read_data("../data/train.csv") 90 | test, tmpl = read_data("../data/test.csv", test=True) 91 | doWorkNumpy(train, test, labels) 92 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/py-knn/experiment2-sklearn-knn-kdtree.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.neighbors import KNeighborsClassifier 3 | 4 | 5 | def doWork(train, test, labels): 6 | print("Converting training to matrix") 7 | train_mat = np.mat(train) 8 | print("Fitting knn") 9 | knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree") 10 | print(knn.fit(train_mat, labels)) 11 | print("Preddicting") 12 | predictions = knn.predict(test) 13 | print("Writing to file") 14 | write_to_file(predictions) 15 | return predictions 16 | 17 | 18 | def write_to_file(predictions): 19 | f = open("output-knn-skilearn.csv", "w") 20 | for p in predictions: 21 | f.write(str(p)) 22 | f.write("\n") 23 | f.close() 24 | 25 | 26 | if __name__ == '__main__': 27 | from load_data import read_data 28 | train, labels = read_data("../data/train.csv") 29 | test, tmpl = read_data("../data/test.csv", test=True) 30 | predictions = doWork(train, test, labels) 31 | print(predictions) 32 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/py-knn/experiment2-sklearn-knn-kdtree.py.bak: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | from sklearn.neighbors import KNeighborsClassifier 3 | 4 | 5 | def doWork(train, test, labels): 6 | print "Converting training to matrix" 7 | train_mat = np.mat(train) 8 | print "Fitting knn" 9 | knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree") 10 | print knn.fit(train_mat, labels) 11 | print "Preddicting" 12 | predictions = knn.predict(test) 13 | print "Writing to file" 14 | write_to_file(predictions) 15 | return predictions 16 | 17 | 18 | def write_to_file(predictions): 19 | f = open("output-knn-skilearn.csv", "w") 20 | for p in predictions: 21 | f.write(str(p)) 22 | f.write("\n") 23 | f.close() 24 | 25 | 26 | if __name__ == '__main__': 27 | from load_data import read_data 28 | train, labels = read_data("../data/train.csv") 29 | test, tmpl = read_data("../data/test.csv", test=True) 30 | predictions = doWork(train, test, labels) 31 | print predictions 32 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/py-knn/experiment3-sklean-pca-knn.py: -------------------------------------------------------------------------------- 1 | from sklearn.neighbors import KNeighborsClassifier 2 | from sklearn import decomposition 3 | import numpy as np 4 | 5 | PCA_COMPONENTS = 100 6 | 7 | 8 | def doWork(train, labels, test): 9 | print("Converting training set to matrix") 10 | X_train = np.mat(train) 11 | 12 | print("Fitting PCA. Components: %d" % PCA_COMPONENTS) 13 | pca = decomposition.PCA(n_components=PCA_COMPONENTS).fit(X_train) 14 | 15 | print("Reducing training to %d components" % PCA_COMPONENTS) 16 | X_train_reduced = pca.transform(X_train) 17 | 18 | print("Fitting kNN with k=10, kd_tree") 19 | knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree") 20 | print(knn.fit(X_train_reduced, labels)) 21 | 22 | print("Reducing test to %d components" % PCA_COMPONENTS) 23 | X_test_reduced = pca.transform(test) 24 | 25 | print("Preddicting numbers") 26 | predictions = knn.predict(X_test_reduced) 27 | 28 | print("Writing to file") 29 | write_to_file(predictions) 30 | 31 | return predictions 32 | 33 | 34 | def write_to_file(predictions): 35 | f = open("output-pca-knn-skilearn-v3.csv", "w") 36 | for p in predictions: 37 | f.write(str(p)) 38 | f.write("\n") 39 | f.close() 40 | 41 | 42 | if __name__ == '__main__': 43 | from load_data import read_data 44 | train, labels = read_data("../data/train.csv") 45 | test, tmpl = read_data("../data/test.csv", test=True) 46 | print(doWork(train, labels, test)) 47 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/py-knn/experiment3-sklean-pca-knn.py.bak: -------------------------------------------------------------------------------- 1 | from sklearn.neighbors import KNeighborsClassifier 2 | from sklearn import decomposition 3 | import numpy as np 4 | 5 | PCA_COMPONENTS = 100 6 | 7 | 8 | def doWork(train, labels, test): 9 | print "Converting training set to matrix" 10 | X_train = np.mat(train) 11 | 12 | print "Fitting PCA. Components: %d" % PCA_COMPONENTS 13 | pca = decomposition.PCA(n_components=PCA_COMPONENTS).fit(X_train) 14 | 15 | print "Reducing training to %d components" % PCA_COMPONENTS 16 | X_train_reduced = pca.transform(X_train) 17 | 18 | print "Fitting kNN with k=10, kd_tree" 19 | knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree") 20 | print knn.fit(X_train_reduced, labels) 21 | 22 | print "Reducing test to %d components" % PCA_COMPONENTS 23 | X_test_reduced = pca.transform(test) 24 | 25 | print "Preddicting numbers" 26 | predictions = knn.predict(X_test_reduced) 27 | 28 | print "Writing to file" 29 | write_to_file(predictions) 30 | 31 | return predictions 32 | 33 | 34 | def write_to_file(predictions): 35 | f = open("output-pca-knn-skilearn-v3.csv", "w") 36 | for p in predictions: 37 | f.write(str(p)) 38 | f.write("\n") 39 | f.close() 40 | 41 | 42 | if __name__ == '__main__': 43 | from load_data import read_data 44 | train, labels = read_data("../data/train.csv") 45 | test, tmpl = read_data("../data/test.csv", test=True) 46 | print doWork(train, labels, test) 47 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/py-knn/load_data.py: -------------------------------------------------------------------------------- 1 | import csv 2 | import numpy as np 3 | 4 | 5 | # loading csv data into numpy array 6 | def read_data(f, header=True, test=False): 7 | data = [] 8 | labels = [] 9 | 10 | csv_reader = csv.reader(open(f, "r"), delimiter=",") 11 | index = 0 12 | for row in csv_reader: 13 | index = index + 1 14 | if header and index == 1: 15 | continue 16 | 17 | if not test: 18 | labels.append(int(row[0])) 19 | row = row[1:] 20 | 21 | data.append(np.array(np.int64(row))) 22 | return (data, labels) 23 | 24 | 25 | if __name__ == "__main__": 26 | train, labels = read_data("../data/train.csv") 27 | test, tmpl = read_data("../data/test.csv", test=True) 28 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/svm_by_myself.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: svm_by_myself.py.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2016/01/15 下午 10:33 10 | ''' 11 | 12 | ''' 13 | 基于SVM的手写数字识别 14 | http://liuhongjiang.github.io/tech/blog/2012/12/29/svm-ocr/ 15 | 16 | Kaggle Digit OCR: SVM (Top 5) 17 | http://www.sotoseattle.com/blog/2013/10/13/Kaggle-Digit-SVM/ 18 | ''' 19 | 20 | def main(): 21 | pass 22 | 23 | 24 | if __name__ == '__main__': 25 | main() -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/svm_pca.py: -------------------------------------------------------------------------------- 1 | import numpy 2 | from sklearn.decomposition import PCA 3 | from sklearn.svm import SVC 4 | 5 | ''' 6 | score: 0.98243 7 | ''' 8 | COMPONENT_NUM = 35 9 | 10 | path = 'd:/dataset/digits/' 11 | print('Read training data...') 12 | with open(path+'train.csv', 'r') as reader: 13 | reader.readline() 14 | train_label = [] 15 | train_data = [] 16 | for line in reader.readlines(): 17 | data = list(map(int, line.rstrip().split(','))) 18 | train_label.append(data[0]) 19 | train_data.append(data[1:]) 20 | 21 | print(('Loaded ' + str(len(train_label)))) 22 | 23 | print('Reduction...') 24 | train_label = numpy.array(train_label) 25 | train_data = numpy.array(train_data) 26 | pca = PCA(n_components=COMPONENT_NUM, whiten=True) 27 | pca.fit(train_data) 28 | train_data = pca.transform(train_data) 29 | 30 | print('Train SVM...') 31 | svc = SVC() 32 | svc.fit(train_data, train_label) 33 | 34 | print('Read testing data...') 35 | with open(path+'test.csv', 'r') as reader: 36 | reader.readline() 37 | test_data = [] 38 | for line in reader.readlines(): 39 | pixels = list(map(int, line.rstrip().split(','))) 40 | test_data.append(pixels) 41 | print(('Loaded ' + str(len(test_data)))) 42 | 43 | print('Predicting...') 44 | test_data = numpy.array(test_data) 45 | test_data = pca.transform(test_data) 46 | predict = svc.predict(test_data) 47 | 48 | print('Saving...') 49 | with open(path+'predict.csv', 'w') as writer: 50 | writer.write('"ImageId","Label"\n') 51 | count = 0 52 | for p in predict: 53 | count += 1 54 | writer.write(str(count) + ',"' + str(p) + '"\n') 55 | -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/using_sklearn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | """ 4 | @Filename: handwriting.py 5 | @Author: yew1eb 6 | @Date: 2015/12/23 0023 7 | """ 8 | 9 | ''' 10 | 使用sickit-learn中的分类算法预测 11 | 用户使用文档:http://scikit-learn.org/dev/user_guide.html 12 | ''' 13 | 14 | import numpy as np 15 | from sklearn.neighbors import KNeighborsClassifier 16 | from sklearn.ensemble import RandomForestClassifier 17 | from sklearn import svm 18 | from sklearn.naive_bayes import GaussianNB #naive bayes 高斯分布的数据 19 | from sklearn.naive_bayes import MultinomialNB #naive bayes 多项式分布的数据 20 | from sklearn.linear_model import LinearRegression 21 | 22 | def load_data(): 23 | # strain.csv 3000条数据; train.csv 完整训练数据集 24 | train_data = np.loadtxt('d:/dataset/digits/train.csv', dtype=np.uint8,delimiter=',', skiprows=1) 25 | test_data = np.loadtxt('d:/dataset/digits/test.csv', dtype=np.uint8,delimiter=',', skiprows=1) 26 | label = train_data[:,:1] 27 | data = np.where(train_data[:, 1:]!=0, 1, 0)# 数据归一化 28 | test = np.where(test_data !=0, 1, 0) 29 | return data, label, test 30 | 31 | def save2csv(labels, csv_name): 32 | np.savetxt('d:/dataset/digits/'+csv_name, np.c_[list(range(1,len(labels)+1)),labels], 33 | delimiter=',', header = 'ImageId,Label', comments = '', fmt='%d') 34 | 35 | def sklearn_logistic(train_data, train_label, test_data): 36 | model = LinearRegression() 37 | model.fit(train_data, train_label.ravel()) 38 | test_label = model.predict(test_data) 39 | save2csv(test_label, 'sklearn_logistic_result.csv') 40 | 41 | def sklearn_knn(train_data, train_label, test_data): 42 | model = KNeighborsClassifier(n_neighbors=6) 43 | model.fit(train_data, train_label.ravel()) 44 | test_label = model.predict(test_data) 45 | save2csv(test_label, 'sklearn_knn_result.csv') 46 | 47 | def sklearn_random_forest(train_data, train_label, test_data): 48 | model = RandomForestClassifier(n_estimators=1000, min_samples_split=5) 49 | model = model.fit(train_data, train_label.ravel() ) 50 | test_label = model.predict(test_data) 51 | save2csv(test_label, 'sklearn_random_forest.csv') 52 | 53 | def sklearn_svm(train_data, train_label, test_data): 54 | model = svm.SVC(C=14, kernel='rbf', gamma=0.001, cache_size=200) 55 | # svm.SVC(C=6.2, kernel='poly', degree=4, coef0=0.48, cache_size=200) 56 | model.fit(train_data, train_label.ravel() ) 57 | test_label = model.predict(test_data) 58 | 59 | save2csv(test_label, 'sklearn_svm_rbf_result.csv') 60 | 61 | def sklearn_GaussianNB(train_data, train_label, test_data): 62 | model = GaussianNB() 63 | model.fit(train_data, train_label.ravel()) 64 | test_label = model.predict(test_data) 65 | save2csv(test_label, 'sklearn_GaussianNB_Result.csv') 66 | 67 | def sklearn_MultinomialNB(train_data, train_label, test_data): 68 | model = MultinomialNB(alpha=0.1) 69 | model.fit(train_data, train_label.ravel()) 70 | test_label = model.predict(test_data) 71 | save2csv(test_label, 'sklearn_MultinomialNB_Result.csv') 72 | 73 | 74 | def main(): 75 | train_data, train_label, test_data = load_data() 76 | 77 | #sklearn_logistic(train_data, train_label, test_data) 78 | 79 | #sklearn_knn(train_data, train_label, test_data) 80 | 81 | #sklearn_random_forest(train_data, train_label, test_data) 82 | 83 | sklearn_svm(train_data, train_label, test_data) 84 | 85 | # naive bayes 0.5~ 86 | #sklearn_GaussianNB(train_data, train_label, test_data) 87 | #sklearn_MultinomialNB(train_data, train_label, test_data) 88 | 89 | if __name__ == '__main__': 90 | main() -------------------------------------------------------------------------------- /Kaggle-digit-recognizer/using_theano.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: using_theano.py.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2016/01/15 下午 10:13 10 | ''' 11 | 12 | ''' 13 | theano 是一个python语言的库,实现了一些机器学习的方法,最大的特点是可以就像普通的python程序一样透明的使用GPU 14 | theano的主页:http://deeplearning.net/software/theano/index.html 15 | theano 同时也支持符号计算,并且和numpy相容,numpy是一个python的矩阵计算的库, 16 | 可以让python具备matlab的计算能力,虽然没有matlab方便 17 | deeplearning.net向导: http://deeplearning.net/tutorial/ 18 | 19 | 利用python的theano库刷kaggle mnist排行榜 20 | http://wiki.swarma.net/index.php?title=%E5%88%A9%E7%94%A8python%E7%9A%84theano%E5%BA%93%E5%88%B7kaggle_mnist%E6%8E%92%E8%A1%8C%E6%A6%9C&variant=zh-cn 21 | ''' 22 | 23 | def main(): 24 | pass 25 | 26 | 27 | if __name__ == '__main__': 28 | main() -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Mining Competition Getting Started 2 | *************** 3 | ## Analytics Vidhya 4 | ### AV Loan Prediction [url](http://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction#) 5 | 仅作为练习的小问题, 根据用户的特征预测是否发放住房贷款,二分类问题 6 | 总11个特征(Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area) 7 | ,Loan_ID是用户ID,Loan_Status是需要预测的,特征包含数值类型和分类类型 8 | 9 | ## Data Castle 10 | ### 微额借款用户人品预测大赛 [url](http://pkbigdata.com/common/competition/148.html) 11 | 同上,区别在与这个的特征比较多 12 | 13 | ## Kaggle 14 | ### Digit Recognizer [url](https://www.kaggle.com/c/digit-recognizer) 15 | 多分类练习题 16 | 17 | ### Titanic: Machine Learning from Disaster [url](https://www.kaggle.com/c/titanic) 18 | 二分类问题,给出0/1即可,评价指标为accuracy。 19 | 20 | ### Bag of Words Meets Bags of Popcorn [url](https://www.kaggle.com/c/word2vec-nlp-tutorial) 21 | 这是一个文本情感二分类问题。评价指标为AUC。 22 | http://www.cnblogs.com/lijingpeng/p/5787549.html 23 | 24 | ### Display Advertising Challenge [url](https://www.kaggle.com/c/criteo-display-ad-challenge) 25 | 这是一个广告CTR预估的比赛,由知名广告公司Criteo赞助举办。数据包括4千万训练样本,500万测试样本,特征包括13个数值特征,26个类别特征,评价指标为logloss。 26 | CTR工业界做法一般都是LR,只是特征会各种组合/transform,可以到上亿维。这里我也首选LR,特征缺失值我用的众数,对于26个类别特征采用one-hot编码, 27 | 数值特征我用pandas画出来发现不符合正态分布,有很大偏移,就没有scale到[0,1], 28 | 采用的是根据五分位点(min,25%,中位数,75%,max)切分为6个区间(负值/过大值分别分到了1和6区间作为异常值处理),然后一并one-hot编码,最终特征100万左右,训练文件20+G。 29 | 强调下可能遇到的坑:1.one-hot最好自己实现,除非你机器内存足够大(需全load到numpy,而且非sparse);2.LR最好用SGD或者mini-batch, 30 | 而且out-of-core模式(http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py), 31 | 除非还是你的内存足够大;3.Think twice before code.由于数据量大,中间出错重跑的话时间成品比较高。 32 | 我发现sklearn的LR和liblinear的LR有着截然不同的表现,sklearn的L2正则化结果好于L1,liblinear的L1好于L2,我理解是他们优化方法不同导致的。 33 | 最终结果liblinear的LR的L1最优,logloss=0.46601,LB为227th/718,这也正符合lasso产生sparse的直觉。 34 | 我也单独尝试了xgboost,logloss=0.46946,可能还是和GBRT对高维度sparse特征效果不好有关。Facebook有一篇论文把GBRT输出作为transformed feature喂给下游的线性分类器, 35 | 取得了不错的效果,可以参考下。(Practical Lessons from Predicting Clicks on Ads at Facebook) 36 | 我只是简单试验了LR作为baseline,后面其实还有很多搞法,可以参考forum获胜者给出的solution, 37 | 比如:1. Vowpal Wabbit工具不用区分类别和数值特征;2.libFFM工具做特征交叉组合;3.feature hash trick;4.每个特征的评价点击率作为新特征加入;5.多模型ensemble等。 38 | -------------------------------------------------------------------------------- /kaggle-titanic/README.md: -------------------------------------------------------------------------------- 1 | 说句题外话,网上貌似有遇难者名单,LB上好几个score 1.0的。有坊间说,score超过90%就怀疑作弊了,不知真假,不过top300绝大多数都集中在0.808-0.818。这个题目我后面没有太多的改进想法了,求指导啊~ 2 | 数据包括数值和类别特征,并存在缺失值。类别特征这里我做了one-hot-encode,缺失值是采用均值/中位数/众数需要根据数据来定,我的做法是根据pandas打印出列数据分布来定。 3 | 模型我采用了DT/RF/GBDT/SVC,由于xgboost输出是概率,需要指定阈值确定0/1,可能我指定不恰当,效果不好0.78847。 4 | 效果最好的是RF,0.81340。这里经过筛选我使用的特征包括’Pclass’,’Gender’, ‘Cabin’,’Ticket’,’Embarked’,’Title’进行onehot编码,’Age’,’SibSp’,’Parch’,’Fare’,’class_age’,’Family’ 归一化。 5 | 我也尝试进行构建一些新特征和特征组合,比如title分割为Mr/Mrs/Miss/Master四类或者split提取第一个词,添加fare_per_person等,pipeline中也加入feature selection,但是效果都没有提高,求指导~ 6 | 7 | 8 | 9 | [kaggle数据挖掘竞赛初步--Titanic](http://www.cnblogs.com/north-north/tag/kaggle/) 10 | 11 | [Kaggle Titanic Competition Part I – Intro] 12 | (http://www.ultravioletanalytics.com/2014/10/30/kaggle-titanic-competition-part-i-intro/) 13 | 14 | 15 | [Kaggle Competition | Titanic Machine Learning from Disaster] 16 | (http://nbviewer.ipython.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb) 17 | 18 | https://github.com/agconti/kaggle-titanic 19 | 20 | http://www.sotoseattle.com/blog/categories/kaggle/ 21 | 22 | [Titanic: Machine Learning from Disaster - Getting Started With R] 23 | https://github.com/trevorstephens/titanic 24 | https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md 25 | 26 | 27 | http://mlwave.com/tutorial-titanic-machine-learning-from-distaster/ 28 | Full Titanic Example with Random Forest 29 | https://www.youtube.com/watch?v=0GrciaGYzV0 30 | 31 | [Tutorial: Titanic dataset machine learning for Kaggle] 32 | (http://corpocrat.com/2014/08/29/tutorial-titanic-dataset-machine-learning-for-kaggle/) 33 | 34 | [Getting Started with R: Titanic Competition in Kaggle] 35 | (http://armandruiz.com/kaggle/Titanic_Kaggle_Analysis.html) 36 | 37 | [A complete guide to getting 0.79903 in Kaggle’s Titanic Competition with Python](https://triangleinequality.wordpress.com/2013/09/05/a-complete-guide-to-getting-0-79903-in-kaggles-titanic-competition-with-python/) 38 | [机器学习系列(3)_逻辑回归应用之Kaggle泰坦尼克之灾](http://blog.csdn.net/han_xiaoyang/article/details/49797143) 39 | https://www.kaggle.com/malais/titanic/kaggle-first-ipythonnotebook/notebook -------------------------------------------------------------------------------- /kaggle-titanic/code.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | 4 | from sklearn import cross_validation 5 | from sklearn.cross_validation import KFold 6 | 7 | from sklearn.linear_model import LinearRegression 8 | from sklearn.linear_model import LogisticRegression 9 | from sklearn.ensemble import RandomForestClassifier 10 | 11 | titanic = pd.read_csv("./data/train.csv", dtype={"Age": np.float64}, ) 12 | 13 | # Preprocessing Data 14 | # ================== 15 | 16 | # Fill in missing value in "Age". 17 | titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median()) 18 | 19 | # Replace all the occurences of male with the number 0. 20 | titanic.loc[titanic["Sex"] == "male", "Sex"] = 0 21 | titanic.loc[titanic["Sex"] == "female", "Sex"] = 1 22 | 23 | # Convert the Embarked Column. 24 | titanic["Embarked"] = titanic["Embarked"].fillna("S") 25 | titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0 26 | titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1 27 | titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2 28 | 29 | 30 | # The columns we'll use to predict the target 31 | predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"] 32 | 33 | # Linear Regression 34 | # ================= 35 | alg = LinearRegression() 36 | # Generate cross validation folds for the titanic dataset. It return the row indices corresponding to train and test. 37 | # We set random_state to ensure we get the same splits every time we run this. 38 | kf = KFold(titanic.shape[0], n_folds=3) 39 | 40 | predictions = [] 41 | for train, test in kf: 42 | # The predictors we're using the train the algorithm. Note how we only take the rows in the train folds. 43 | train_predictors = (titanic[predictors].iloc[train,:]) 44 | # The target we're using to train the algorithm. 45 | train_target = titanic["Survived"].iloc[train] 46 | # Training the algorithm using the predictors and target. 47 | alg.fit(train_predictors, train_target) 48 | # We can now make predictions on the test fold 49 | test_predictions = alg.predict(titanic[predictors].iloc[test,:]) 50 | predictions.append(test_predictions) 51 | 52 | # Evaluating error and accuracy 53 | predictions = np.concatenate(predictions,axis = 0) 54 | predictions[predictions > .5] = 1 55 | predictions[predictions <= .5] = 0 56 | 57 | accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions) 58 | 59 | print(('Accuracy of Linear Regression on the training set is ' + str(accuracy))) 60 | 61 | # Logistic Regression 62 | # =================== 63 | alg = LogisticRegression() 64 | # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) 65 | scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3) 66 | # Take the mean of the scores (because we have one for each fold) 67 | print(('Accuracy of Logistic Regression on the training set is ' + str(scores.mean()))) 68 | 69 | # Random Forest 70 | # =================== 71 | from sklearn.ensemble import RandomForestClassifier 72 | 73 | alg = RandomForestClassifier(n_estimators=1000,min_samples_leaf=5, max_features="auto", n_jobs=2, random_state=10) 74 | # Compute the accuracy score for all the cross validation folds. (much simpler than what we did before!) 75 | scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3) 76 | # Take the mean of the scores (because we have one for each fold) 77 | print(('Accuracy of Random Forest on the training set is ' + str(scores.mean()))) 78 | 79 | # Test Set 80 | # ======== 81 | titanic_test = pd.read_csv("./data/test.csv", dtype={"Age": np.float64}, ) 82 | 83 | titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median()) 84 | 85 | titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0 86 | titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1 87 | 88 | titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S") 89 | 90 | titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0 91 | titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1 92 | titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2 93 | 94 | titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median()) 95 | 96 | # Train the algorithm using all the training data 97 | alg.fit(titanic[predictors], titanic["Survived"]) 98 | 99 | # Make predictions using the test set. 100 | predictions = alg.predict(titanic_test[predictors]) 101 | 102 | # Create a new dataframe with only the columns Kaggle wants from the dataset. 103 | submission = pd.DataFrame({ 104 | "PassengerId": titanic_test["PassengerId"], 105 | "Survived": predictions 106 | }) 107 | 108 | submission.to_csv('result_rf.csv', index=False) -------------------------------------------------------------------------------- /kaggle-titanic/lr.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | import pandas as pd 5 | import numpy as np 6 | import matplotlib.pyplot as plt 7 | 8 | base_path = './data/' 9 | train = pd.read_csv(base_path+'train.csv') 10 | 11 | # 初步观察数据 12 | #print(train.info()) 13 | ''' 14 | 特征信息: 15 | PassengerId => 乘客ID 16 | Pclass => 乘客等级(1/2/3等舱位) 17 | Name => 乘客姓名 18 | Sex => 性别 19 | Age => 年龄 20 | SibSp => 堂兄弟/妹个数 21 | Parch => 父母与小孩个数 22 | Ticket => 船票信息 23 | Fare => 票价 24 | Cabin => 客舱 25 | Embarked => 登船港口 26 | 27 | Age,Cabin列有缺失 28 | Name,Sex,Ticket,Cabin,Embarked列为分类类型 29 | ''' 30 | 31 | #print(train.describe()) 32 | ''' 33 | 查看数值类型特征的统计信息 34 | ''' 35 | 36 | # 数据初步分析 37 | ''' 38 | 看看每个/多个 属性和最后的Survived之间有着什么样的关系 39 | ''' 40 | 41 | def analyze_features(train): 42 | fig = plt.figure() 43 | 44 | fig.set(alpha=0.2) # 设定图表颜色alpha参数 45 | plt.subplot2grid((2,3), (0,0)) # 在一张大图里分列几个小图 46 | plt.title('显示中文') 47 | train.Survived.value_counts().plot(kind='bar') # 柱状图 48 | plt.title('获救情况 (1为获救)') # 标题 49 | plt.ylabel('人数') 50 | 51 | plt.subplot2grid((2,3),(0,1)) 52 | train.Pclass.value_counts().plot(kind='bar') 53 | plt.ylabel('人数') 54 | plt.title('乘客等级分布') 55 | 56 | plt.subplot2grid((2,3),(0,2)) 57 | plt.scatter(train.Survived, train.Age) 58 | plt.ylabel('年龄') 59 | plt.grid(b=True, which='major', axis='y') 60 | plt.title('按年龄看获救分布(1为获救)') 61 | 62 | plt.subplot2grid((2,3),(1,0), colspan=2) 63 | train.Age[train.Pclass == 1].plot(kind='kde') 64 | train.Age[train.Pclass == 2].plot(kind='kde') 65 | train.Age[train.Pclass == 3].plot(kind='kde') 66 | plt.xlabel("年龄")# plots an axis lable 67 | plt.ylabel("密度") 68 | plt.title("各等级的乘客年龄分布") 69 | plt.legend(('头等舱', '2等舱','3等舱'),loc='best') # sets our legend for our graph. 70 | 71 | 72 | plt.subplot2grid((2,3),(1,2)) 73 | train.Embarked.value_counts().plot(kind='bar') 74 | plt.title("各登船口岸上船人数") 75 | plt.ylabel("人数") 76 | plt.show() 77 | 78 | analyze_features(train) 79 | ''' 80 | 不同舱位/乘客等级可能和财富/地位有关系,最后获救概率可能会不一样 81 | 年龄对获救概率也一定是有影响的,毕竟前面说了,副船长还说『小孩和女士先走』呢 82 | 和登船港口是不是有关系呢?也许登船港口不同,人的出身地位不同? 83 | ''' 84 | # http://blog.csdn.net/han_xiaoyang/article/details/49797143 -------------------------------------------------------------------------------- /kaggle-titanic/randomforest_gridsearchCV.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yew1eb/DM-Competition-Getting-Started/c7ac0d3226883a4e387cd24058e7618acd231fa5/kaggle-titanic/randomforest_gridsearchCV.py -------------------------------------------------------------------------------- /kaggle-titanic/result_rf.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Survived 2 | 892,0 3 | 893,0 4 | 894,0 5 | 895,0 6 | 896,0 7 | 897,0 8 | 898,1 9 | 899,0 10 | 900,1 11 | 901,0 12 | 902,0 13 | 903,0 14 | 904,1 15 | 905,0 16 | 906,1 17 | 907,1 18 | 908,0 19 | 909,0 20 | 910,0 21 | 911,0 22 | 912,0 23 | 913,0 24 | 914,1 25 | 915,0 26 | 916,1 27 | 917,0 28 | 918,1 29 | 919,0 30 | 920,1 31 | 921,0 32 | 922,0 33 | 923,0 34 | 924,0 35 | 925,0 36 | 926,1 37 | 927,0 38 | 928,0 39 | 929,0 40 | 930,0 41 | 931,0 42 | 932,0 43 | 933,1 44 | 934,0 45 | 935,1 46 | 936,1 47 | 937,0 48 | 938,0 49 | 939,0 50 | 940,1 51 | 941,1 52 | 942,0 53 | 943,0 54 | 944,1 55 | 945,1 56 | 946,0 57 | 947,0 58 | 948,0 59 | 949,0 60 | 950,0 61 | 951,1 62 | 952,0 63 | 953,0 64 | 954,0 65 | 955,1 66 | 956,1 67 | 957,1 68 | 958,1 69 | 959,0 70 | 960,0 71 | 961,1 72 | 962,1 73 | 963,0 74 | 964,0 75 | 965,0 76 | 966,1 77 | 967,0 78 | 968,0 79 | 969,1 80 | 970,0 81 | 971,1 82 | 972,1 83 | 973,0 84 | 974,0 85 | 975,0 86 | 976,0 87 | 977,0 88 | 978,1 89 | 979,0 90 | 980,1 91 | 981,1 92 | 982,0 93 | 983,0 94 | 984,1 95 | 985,0 96 | 986,0 97 | 987,0 98 | 988,1 99 | 989,0 100 | 990,0 101 | 991,0 102 | 992,1 103 | 993,0 104 | 994,0 105 | 995,0 106 | 996,1 107 | 997,0 108 | 998,0 109 | 999,0 110 | 1000,0 111 | 1001,0 112 | 1002,0 113 | 1003,1 114 | 1004,1 115 | 1005,1 116 | 1006,1 117 | 1007,0 118 | 1008,0 119 | 1009,1 120 | 1010,0 121 | 1011,1 122 | 1012,1 123 | 1013,0 124 | 1014,1 125 | 1015,0 126 | 1016,0 127 | 1017,1 128 | 1018,0 129 | 1019,1 130 | 1020,0 131 | 1021,0 132 | 1022,0 133 | 1023,0 134 | 1024,0 135 | 1025,0 136 | 1026,0 137 | 1027,0 138 | 1028,0 139 | 1029,0 140 | 1030,0 141 | 1031,0 142 | 1032,0 143 | 1033,1 144 | 1034,0 145 | 1035,0 146 | 1036,1 147 | 1037,0 148 | 1038,0 149 | 1039,0 150 | 1040,1 151 | 1041,0 152 | 1042,1 153 | 1043,0 154 | 1044,0 155 | 1045,1 156 | 1046,0 157 | 1047,0 158 | 1048,1 159 | 1049,1 160 | 1050,1 161 | 1051,1 162 | 1052,1 163 | 1053,1 164 | 1054,1 165 | 1055,0 166 | 1056,0 167 | 1057,0 168 | 1058,0 169 | 1059,0 170 | 1060,1 171 | 1061,0 172 | 1062,0 173 | 1063,0 174 | 1064,0 175 | 1065,0 176 | 1066,0 177 | 1067,1 178 | 1068,1 179 | 1069,0 180 | 1070,1 181 | 1071,1 182 | 1072,0 183 | 1073,0 184 | 1074,1 185 | 1075,0 186 | 1076,1 187 | 1077,0 188 | 1078,1 189 | 1079,0 190 | 1080,0 191 | 1081,0 192 | 1082,0 193 | 1083,0 194 | 1084,0 195 | 1085,0 196 | 1086,1 197 | 1087,0 198 | 1088,1 199 | 1089,1 200 | 1090,0 201 | 1091,0 202 | 1092,1 203 | 1093,1 204 | 1094,0 205 | 1095,1 206 | 1096,0 207 | 1097,0 208 | 1098,1 209 | 1099,0 210 | 1100,1 211 | 1101,0 212 | 1102,0 213 | 1103,0 214 | 1104,0 215 | 1105,1 216 | 1106,0 217 | 1107,0 218 | 1108,1 219 | 1109,0 220 | 1110,1 221 | 1111,0 222 | 1112,1 223 | 1113,0 224 | 1114,1 225 | 1115,0 226 | 1116,1 227 | 1117,1 228 | 1118,0 229 | 1119,1 230 | 1120,0 231 | 1121,0 232 | 1122,0 233 | 1123,1 234 | 1124,0 235 | 1125,0 236 | 1126,0 237 | 1127,0 238 | 1128,0 239 | 1129,0 240 | 1130,1 241 | 1131,1 242 | 1132,1 243 | 1133,1 244 | 1134,0 245 | 1135,0 246 | 1136,0 247 | 1137,0 248 | 1138,1 249 | 1139,0 250 | 1140,1 251 | 1141,0 252 | 1142,1 253 | 1143,0 254 | 1144,0 255 | 1145,0 256 | 1146,0 257 | 1147,0 258 | 1148,0 259 | 1149,0 260 | 1150,1 261 | 1151,0 262 | 1152,0 263 | 1153,0 264 | 1154,1 265 | 1155,1 266 | 1156,0 267 | 1157,0 268 | 1158,0 269 | 1159,0 270 | 1160,0 271 | 1161,0 272 | 1162,0 273 | 1163,0 274 | 1164,1 275 | 1165,1 276 | 1166,0 277 | 1167,1 278 | 1168,0 279 | 1169,0 280 | 1170,0 281 | 1171,0 282 | 1172,0 283 | 1173,1 284 | 1174,1 285 | 1175,1 286 | 1176,1 287 | 1177,0 288 | 1178,0 289 | 1179,0 290 | 1180,0 291 | 1181,0 292 | 1182,0 293 | 1183,1 294 | 1184,0 295 | 1185,0 296 | 1186,0 297 | 1187,0 298 | 1188,1 299 | 1189,0 300 | 1190,0 301 | 1191,0 302 | 1192,0 303 | 1193,0 304 | 1194,0 305 | 1195,0 306 | 1196,1 307 | 1197,1 308 | 1198,0 309 | 1199,1 310 | 1200,0 311 | 1201,0 312 | 1202,0 313 | 1203,0 314 | 1204,0 315 | 1205,1 316 | 1206,1 317 | 1207,1 318 | 1208,0 319 | 1209,0 320 | 1210,0 321 | 1211,0 322 | 1212,0 323 | 1213,0 324 | 1214,0 325 | 1215,1 326 | 1216,1 327 | 1217,0 328 | 1218,1 329 | 1219,0 330 | 1220,0 331 | 1221,0 332 | 1222,1 333 | 1223,0 334 | 1224,0 335 | 1225,1 336 | 1226,0 337 | 1227,0 338 | 1228,0 339 | 1229,0 340 | 1230,0 341 | 1231,0 342 | 1232,0 343 | 1233,0 344 | 1234,0 345 | 1235,1 346 | 1236,0 347 | 1237,1 348 | 1238,0 349 | 1239,1 350 | 1240,0 351 | 1241,1 352 | 1242,1 353 | 1243,0 354 | 1244,0 355 | 1245,0 356 | 1246,1 357 | 1247,0 358 | 1248,1 359 | 1249,0 360 | 1250,0 361 | 1251,1 362 | 1252,0 363 | 1253,1 364 | 1254,1 365 | 1255,0 366 | 1256,1 367 | 1257,0 368 | 1258,0 369 | 1259,0 370 | 1260,1 371 | 1261,0 372 | 1262,0 373 | 1263,1 374 | 1264,0 375 | 1265,0 376 | 1266,1 377 | 1267,1 378 | 1268,0 379 | 1269,0 380 | 1270,0 381 | 1271,0 382 | 1272,0 383 | 1273,0 384 | 1274,1 385 | 1275,0 386 | 1276,0 387 | 1277,1 388 | 1278,0 389 | 1279,0 390 | 1280,0 391 | 1281,0 392 | 1282,0 393 | 1283,1 394 | 1284,0 395 | 1285,0 396 | 1286,0 397 | 1287,1 398 | 1288,0 399 | 1289,1 400 | 1290,0 401 | 1291,0 402 | 1292,1 403 | 1293,0 404 | 1294,1 405 | 1295,0 406 | 1296,0 407 | 1297,0 408 | 1298,0 409 | 1299,0 410 | 1300,1 411 | 1301,1 412 | 1302,1 413 | 1303,1 414 | 1304,0 415 | 1305,0 416 | 1306,1 417 | 1307,0 418 | 1308,0 419 | 1309,0 420 | -------------------------------------------------------------------------------- /kaggle-titanic/result_xgb.csv: -------------------------------------------------------------------------------- 1 | PassengerId,Survived 2 | 892,0 3 | 893,0 4 | 894,0 5 | 895,0 6 | 896,1 7 | 897,0 8 | 898,0 9 | 899,0 10 | 900,1 11 | 901,0 12 | 902,0 13 | 903,0 14 | 904,1 15 | 905,0 16 | 906,1 17 | 907,1 18 | 908,0 19 | 909,0 20 | 910,1 21 | 911,0 22 | 912,0 23 | 913,0 24 | 914,1 25 | 915,1 26 | 916,1 27 | 917,0 28 | 918,1 29 | 919,1 30 | 920,1 31 | 921,0 32 | 922,0 33 | 923,0 34 | 924,1 35 | 925,0 36 | 926,1 37 | 927,0 38 | 928,0 39 | 929,0 40 | 930,0 41 | 931,1 42 | 932,0 43 | 933,1 44 | 934,0 45 | 935,1 46 | 936,1 47 | 937,0 48 | 938,0 49 | 939,0 50 | 940,1 51 | 941,1 52 | 942,0 53 | 943,0 54 | 944,1 55 | 945,1 56 | 946,0 57 | 947,0 58 | 948,0 59 | 949,0 60 | 950,0 61 | 951,1 62 | 952,0 63 | 953,0 64 | 954,0 65 | 955,1 66 | 956,0 67 | 957,1 68 | 958,1 69 | 959,0 70 | 960,0 71 | 961,1 72 | 962,1 73 | 963,0 74 | 964,1 75 | 965,0 76 | 966,1 77 | 967,0 78 | 968,0 79 | 969,1 80 | 970,0 81 | 971,1 82 | 972,1 83 | 973,0 84 | 974,0 85 | 975,0 86 | 976,0 87 | 977,0 88 | 978,1 89 | 979,1 90 | 980,1 91 | 981,1 92 | 982,0 93 | 983,0 94 | 984,1 95 | 985,0 96 | 986,0 97 | 987,0 98 | 988,1 99 | 989,0 100 | 990,1 101 | 991,0 102 | 992,1 103 | 993,0 104 | 994,0 105 | 995,0 106 | 996,1 107 | 997,0 108 | 998,0 109 | 999,0 110 | 1000,0 111 | 1001,0 112 | 1002,0 113 | 1003,1 114 | 1004,1 115 | 1005,1 116 | 1006,1 117 | 1007,0 118 | 1008,0 119 | 1009,1 120 | 1010,1 121 | 1011,1 122 | 1012,1 123 | 1013,0 124 | 1014,1 125 | 1015,0 126 | 1016,0 127 | 1017,1 128 | 1018,0 129 | 1019,1 130 | 1020,0 131 | 1021,0 132 | 1022,0 133 | 1023,0 134 | 1024,0 135 | 1025,0 136 | 1026,0 137 | 1027,0 138 | 1028,1 139 | 1029,0 140 | 1030,0 141 | 1031,0 142 | 1032,0 143 | 1033,1 144 | 1034,0 145 | 1035,0 146 | 1036,1 147 | 1037,0 148 | 1038,0 149 | 1039,0 150 | 1040,1 151 | 1041,0 152 | 1042,1 153 | 1043,0 154 | 1044,0 155 | 1045,0 156 | 1046,0 157 | 1047,0 158 | 1048,1 159 | 1049,1 160 | 1050,1 161 | 1051,1 162 | 1052,1 163 | 1053,1 164 | 1054,1 165 | 1055,0 166 | 1056,0 167 | 1057,1 168 | 1058,0 169 | 1059,0 170 | 1060,1 171 | 1061,0 172 | 1062,0 173 | 1063,1 174 | 1064,0 175 | 1065,0 176 | 1066,0 177 | 1067,1 178 | 1068,1 179 | 1069,0 180 | 1070,1 181 | 1071,1 182 | 1072,0 183 | 1073,0 184 | 1074,1 185 | 1075,0 186 | 1076,1 187 | 1077,0 188 | 1078,1 189 | 1079,0 190 | 1080,0 191 | 1081,0 192 | 1082,0 193 | 1083,0 194 | 1084,1 195 | 1085,0 196 | 1086,1 197 | 1087,0 198 | 1088,1 199 | 1089,0 200 | 1090,0 201 | 1091,0 202 | 1092,1 203 | 1093,1 204 | 1094,0 205 | 1095,1 206 | 1096,0 207 | 1097,0 208 | 1098,0 209 | 1099,0 210 | 1100,1 211 | 1101,0 212 | 1102,0 213 | 1103,0 214 | 1104,0 215 | 1105,1 216 | 1106,0 217 | 1107,0 218 | 1108,1 219 | 1109,0 220 | 1110,1 221 | 1111,0 222 | 1112,1 223 | 1113,0 224 | 1114,1 225 | 1115,0 226 | 1116,1 227 | 1117,0 228 | 1118,0 229 | 1119,1 230 | 1120,0 231 | 1121,0 232 | 1122,0 233 | 1123,1 234 | 1124,0 235 | 1125,0 236 | 1126,0 237 | 1127,0 238 | 1128,0 239 | 1129,1 240 | 1130,1 241 | 1131,1 242 | 1132,1 243 | 1133,1 244 | 1134,0 245 | 1135,0 246 | 1136,0 247 | 1137,0 248 | 1138,1 249 | 1139,0 250 | 1140,1 251 | 1141,0 252 | 1142,1 253 | 1143,0 254 | 1144,0 255 | 1145,0 256 | 1146,0 257 | 1147,0 258 | 1148,0 259 | 1149,0 260 | 1150,1 261 | 1151,0 262 | 1152,0 263 | 1153,0 264 | 1154,1 265 | 1155,1 266 | 1156,0 267 | 1157,0 268 | 1158,0 269 | 1159,0 270 | 1160,0 271 | 1161,0 272 | 1162,0 273 | 1163,0 274 | 1164,1 275 | 1165,1 276 | 1166,0 277 | 1167,1 278 | 1168,0 279 | 1169,0 280 | 1170,0 281 | 1171,0 282 | 1172,0 283 | 1173,1 284 | 1174,1 285 | 1175,0 286 | 1176,1 287 | 1177,0 288 | 1178,0 289 | 1179,0 290 | 1180,0 291 | 1181,0 292 | 1182,0 293 | 1183,0 294 | 1184,0 295 | 1185,0 296 | 1186,0 297 | 1187,0 298 | 1188,1 299 | 1189,0 300 | 1190,0 301 | 1191,0 302 | 1192,0 303 | 1193,0 304 | 1194,0 305 | 1195,0 306 | 1196,1 307 | 1197,1 308 | 1198,0 309 | 1199,1 310 | 1200,0 311 | 1201,0 312 | 1202,0 313 | 1203,1 314 | 1204,0 315 | 1205,0 316 | 1206,1 317 | 1207,1 318 | 1208,0 319 | 1209,0 320 | 1210,0 321 | 1211,0 322 | 1212,0 323 | 1213,0 324 | 1214,0 325 | 1215,1 326 | 1216,1 327 | 1217,0 328 | 1218,1 329 | 1219,0 330 | 1220,0 331 | 1221,0 332 | 1222,1 333 | 1223,0 334 | 1224,0 335 | 1225,1 336 | 1226,0 337 | 1227,0 338 | 1228,0 339 | 1229,0 340 | 1230,0 341 | 1231,0 342 | 1232,0 343 | 1233,0 344 | 1234,0 345 | 1235,1 346 | 1236,0 347 | 1237,1 348 | 1238,0 349 | 1239,0 350 | 1240,0 351 | 1241,1 352 | 1242,1 353 | 1243,0 354 | 1244,0 355 | 1245,0 356 | 1246,1 357 | 1247,0 358 | 1248,1 359 | 1249,0 360 | 1250,0 361 | 1251,1 362 | 1252,0 363 | 1253,1 364 | 1254,1 365 | 1255,0 366 | 1256,1 367 | 1257,0 368 | 1258,0 369 | 1259,0 370 | 1260,1 371 | 1261,0 372 | 1262,0 373 | 1263,1 374 | 1264,0 375 | 1265,0 376 | 1266,1 377 | 1267,1 378 | 1268,0 379 | 1269,0 380 | 1270,0 381 | 1271,0 382 | 1272,0 383 | 1273,0 384 | 1274,0 385 | 1275,0 386 | 1276,0 387 | 1277,1 388 | 1278,0 389 | 1279,0 390 | 1280,0 391 | 1281,0 392 | 1282,0 393 | 1283,1 394 | 1284,0 395 | 1285,0 396 | 1286,0 397 | 1287,1 398 | 1288,0 399 | 1289,1 400 | 1290,0 401 | 1291,0 402 | 1292,1 403 | 1293,0 404 | 1294,1 405 | 1295,0 406 | 1296,0 407 | 1297,0 408 | 1298,0 409 | 1299,0 410 | 1300,1 411 | 1301,1 412 | 1302,1 413 | 1303,1 414 | 1304,1 415 | 1305,0 416 | 1306,1 417 | 1307,0 418 | 1308,0 419 | 1309,0 420 | -------------------------------------------------------------------------------- /kaggle-titanic/sklearn-random-forest.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding:utf-8 -*- 3 | 4 | ''' 5 | @filename: sklearn-random-forest.py 6 | @author: yew1eb 7 | @site: http://blog.yew1eb.net 8 | @contact: yew1eb@gmail.com 9 | @time: 2015/12/27 下午 10:18 10 | ''' 11 | from sklearn import cross_validation 12 | from sklearn.cross_validation import train_test_split 13 | from sklearn.tree import DecisionTreeClassifier 14 | import numpy as np 15 | import pandas as pd 16 | import matplotlib.pyplot as plt 17 | from sklearn import metrics 18 | 19 | 20 | def load_data(): 21 | df = pd.read_csv('D:/dataset/titanic/train.csv', header=0) 22 | #特征选择 23 | # 只取出三个自变量 24 | # 将Age(年龄)缺失的数据补全 25 | # 将Pclass变量转变为三个哑(Summy)变量 26 | # 将sex转为0-1变量 27 | subdf = df[['Pclass', 'Sex', 'Age']] 28 | y = df.Survived 29 | age = subdf['Age'].fillna(value=subdf.Age.mean()) 30 | pclass = pd.get_dummies(subdf['Pclass'], prefix='Pclass') 31 | sex = (subdf['Sex']=='male').astype('int') 32 | X = pd.concat([pclass, age, sex], axis=1) 33 | #print(X.head()) 34 | return X, y 35 | 36 | # 分析各特征的重要性 37 | # feature_importance = clf.feature_importances_ 38 | # 对于随机森林如何得到变量的重要性,可以看scikit-learn官方文档 http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py 39 | # important_features = X_train.columns.values[0::] 40 | # analyze_feature(feature_importance, important_features) 41 | def analyze_feature(feature_importance, important_features): 42 | feature_importance = 100.0 * (feature_importance / feature_importance.max()) 43 | sorted_idx = np.argsort(feature_importance)[::-1] 44 | pos = np.arange(sorted_idx.shape[0]) + 0.5 45 | plt.title('Feature Importance') 46 | plt.barh(pos, feature_importance[sorted_idx[::-1]], color='r', align='center') 47 | plt.yticks(pos, important_features) 48 | plt.xlabel('Relativ Importance') 49 | plt.draw() 50 | plt.show() 51 | 52 | def sklearn_decisoin_tree(X, y): 53 | X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=6) 54 | clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5) 55 | bst = clf.fit(X_train, y_train) 56 | # 准确率 print("accuracy rate: {:.6f}".format(bst.score(X_test, y_test))) 57 | # 交叉验证 58 | scores = cross_validation.cross_val_score(clf, X, y, cv=10) 59 | print(scores) 60 | 61 | 62 | def main(): 63 | X, y = load_data() 64 | 65 | 66 | 67 | 68 | if __name__ == '__main__': 69 | main() 70 | 71 | 72 | 73 | 74 | 75 | -------------------------------------------------------------------------------- /kaggle-titanic/xgb.py: -------------------------------------------------------------------------------- 1 | # This script shows you how to make a submission using a few 2 | # useful Python libraries. 3 | # It gets a public leaderboard score of 0.76077. 4 | # Maybe you can tweak it and do better...? 5 | 6 | import pandas as pd 7 | import xgboost as xgb 8 | from sklearn.preprocessing import LabelEncoder 9 | import numpy as np 10 | 11 | # Load the data 12 | train_df = pd.read_csv('./data/train.csv', header=0) 13 | test_df = pd.read_csv('./data/test.csv', header=0) 14 | 15 | # We'll impute missing values using the median for numeric columns and the most 16 | # common value for string columns. 17 | # This is based on some nice code by 'sveitser' at http://stackoverflow.com/a/25562948 18 | from sklearn.base import TransformerMixin 19 | class DataFrameImputer(TransformerMixin): 20 | def fit(self, X, y=None): 21 | self.fill = pd.Series([X[c].value_counts().index[0] 22 | if X[c].dtype == np.dtype('O') else X[c].median() for c in X], 23 | index=X.columns) 24 | return self 25 | def transform(self, X, y=None): 26 | return X.fillna(self.fill) 27 | 28 | feature_columns_to_use = ['Pclass','Sex','Age','Fare','Parch'] 29 | nonnumeric_columns = ['Sex'] 30 | 31 | # Join the features from train and test together before imputing missing values, 32 | # in case their distribution is slightly different 33 | big_X = train_df[feature_columns_to_use].append(test_df[feature_columns_to_use]) 34 | big_X_imputed = DataFrameImputer().fit_transform(big_X) 35 | 36 | # XGBoost doesn't (yet) handle categorical features automatically, so we need to change 37 | # them to columns of integer values. 38 | # See http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing for more 39 | # details and options 40 | le = LabelEncoder() 41 | for feature in nonnumeric_columns: 42 | big_X_imputed[feature] = le.fit_transform(big_X_imputed[feature]) 43 | 44 | # Prepare the inputs for the model 45 | train_X = big_X_imputed[0:train_df.shape[0]].as_matrix() 46 | test_X = big_X_imputed[train_df.shape[0]::].as_matrix() 47 | train_y = train_df['Survived'] 48 | 49 | # You can experiment with many other options here, using the same .fit() and .predict() 50 | # methods; see http://scikit-learn.org 51 | # This example uses the current build of XGBoost, from https://github.com/dmlc/xgboost 52 | gbm = xgb.XGBClassifier(max_depth=6, n_estimators=1000, learning_rate=0.02).fit(train_X, train_y) 53 | predictions = gbm.predict(test_X) 54 | 55 | # Kaggle needs the submission to have a certain format; 56 | # see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv 57 | # for an example of what it's supposed to look like. 58 | submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'], 59 | 'Survived': predictions }) 60 | submission.to_csv("result_xgb.csv", index=False) -------------------------------------------------------------------------------- /kaggle-titanic/笔记1.md: -------------------------------------------------------------------------------- 1 | ## Kaggle-Titanic 2 | Kaggle上的一个入门题目,属于二分类问题。 3 | ### 问题背景 4 | 泰坦尼克号中一个经典的场面就是豪华游艇倒了,大家都惊恐逃生,可是救生艇的数量有限,不可能让大家都同时获救, 5 | 这时候副船长发话了:lady and kid first!这并不是一个随意安排的逃生顺序,而是某些人有优先逃生的特权,比如贵族,女人,小孩的。 6 | 那么现在问题来了:给出一些船员的个人信息以及存活状况,让参赛者根据这些信息训练出合适的模型并预测其他人的存活状况。 7 | ### 数据集 8 | 9 | 字段之间用逗号隔开,每行数据包含的字段如下: 10 | ``` 11 | PassengerID 12 | Survived(存活与否) 13 | Pclass(客舱等级) 14 | Name(姓名) 15 | Sex(性别) 16 | Age(年龄) 17 | SibSp(亲戚和配偶在船数量) 18 | Parch(父母孩子的在船数量) 19 | Ticket(票编号) 20 | Fare(价格) 21 | Cabin(客舱位置) 22 | Embarked(上船的港口编号) 23 | ``` 24 | ### 评估方式 25 | 比赛通过准确率指标评估模型优劣 26 | $$precision=\frac{\sum_{i=1}^NI(\hat{y}_i==y_i)}{N}$$ 27 | y^i表示预测值,yi表示实际值 28 | 29 | ### 模型选择 30 | 常见的分类模型有:SVM,LR,Navie Bayesian,CART以及由CART演化而来的树类模型,Random Forest,GBDT,最近详细研究了GBDT, 31 | 发现它的拟合能力近乎完美,而且在调整了参数之后可以降低过拟合的影响,据说高斯过程的拟合能力也比不过它,这次就决定直接采用GBDT来做主模型。 32 | 33 | ### 特征选择 34 | 第一反应就是名字,Ticket,Cabin这些字段太零散,基本上每个人的都不一样,感觉并没有什么用。 35 | Cabin这一维度的特征更是缺失很严重,所以暂且不考虑Ticket,Cabin的这些特征 。 36 | 反观Name这个特征,看似并没有什么用,大家的名字都不一样,实际上从GBDT的调试过程来看,这个特征的使用频率很高的。 37 | 通俗地说,Name可以给模型提供一定的泛化能力,比如一家人面临危机的时候,大家肯定都先找到自己的家人一起逃生,所以一家人的存活状况相关性肯定很高的。 38 | 所以我引入名字特征的方式并不是直接引入名字,而是考虑和当前预测人的名字同姓的存活率。 39 | 另外还有个背景问题,就是逃生的时候,女士优先逃生,这个时候家人就分开了,所以名字这个特征还要考虑性别,综合来说就是性别+姓的存活率作为一个特征。 40 | 另外,还增了相应类别的存活率这些特征,比如各种性别的存活率以及各种等级的存活率,之前请教过别人, 41 | 这种把分类ID扔进模型之后有必要把所属ID的百分比扔进去吗?还是有必要的,前者是“是什么类别“的因素,后者是“有多少存活比例“的因素,是和有不能混为一谈。 42 | 43 | ### 特征缺失 44 | 数据中的某列特征丢失了在模型训练的时候是很正常的。目前了解到的解决方案是: 45 | 直接扔掉这行数据(数据多任性) 46 | 对于缺失的数据统一给一个新的label,让模型来学出给这种label多大的权值(感觉数据量大的情况才能训出来) 47 | 这个特征的缺失率很高 48 | 直接扔掉这列特征 49 | 搞一个模型来拟合这维度的特征 50 | 给一个默认值,这个值可以是均值,或者众数。(感觉这个方法其实和上一个方法的拟合很相似,通过均值or众数来拟合其实可理解为人工的最大似然。 51 | 52 | ### 使用scikit-learn的随机数森林实践 53 | sklearn-random-forest.py 54 | --------------------------------------------------------------------------------