├── .gitignore
├── AV-loan-prediction
    ├── loan_prediction.md
    ├── result_sklearn_rf.csv
    ├── sklearn-rf.py
    ├── test.csv
    ├── train.csv
    └── xgb.py
├── DC-loan-rp
    ├── 0.717.csv
    ├── add_data
    │   ├── 0.717.csv
    │   └── add_data.py
    ├── feature-selection
    │   ├── 0.70_feature_score.csv
    │   ├── anylze.py
    │   ├── drop_feature.txt
    │   ├── drop_list.txt
    │   ├── feature_score_category.csv
    │   └── feature_score_numeric.csv
    ├── sklearn-rf.py
    ├── small_data
    │   ├── test_x.csv
    │   ├── train_x.csv
    │   └── train_y.csv
    ├── source.R
    ├── xgb.py
    └── xgb_dummy.py
├── Kaggle-bag-of-words
    ├── BagOfWords_LR.py
    ├── BagOfWords_RF.py
    ├── Kaggle-Word2Vec.R
    ├── KaggleWord2VecUtility.py
    ├── README.md
    ├── Word2Vec_AverageVectors.py
    ├── Word2Vec_BagOfCentroids.py
    ├── generate_d2v.py
    ├── generate_w2v.py
    ├── nbsvm.py
    ├── out
    │   └── Bag_of_Words_model_RF.csv
    └── predict.py
├── Kaggle-digit-recognizer
    ├── .gitignore
    ├── Digit Recognizer.md
    ├── data
    │   └── readme.txt
    ├── experiment1-rf-1000.py
    ├── knn_by_myself.py
    ├── naive_bayes_by_myself.py
    ├── nn
    │   ├── README.md
    │   ├── gen
    │   │   ├── nn_benchmark.csv
    │   │   ├── nn_benchmark1.csv
    │   │   └── nn_benchmark2.csv
    │   └── src
    │   │   ├── DigitRecognizer.py
    │   │   ├── PyNeural
    │   │       ├── PyNeural.py
    │   │       ├── PyNeural.py.bak
    │   │       └── __init__.py
    │   │   └── ensemble.py
    ├── py-knn
    │   ├── experiment1-custom-knn-brute-force.py
    │   ├── experiment2-sklearn-knn-kdtree.py
    │   ├── experiment2-sklearn-knn-kdtree.py.bak
    │   ├── experiment3-sklean-pca-knn.py
    │   ├── experiment3-sklean-pca-knn.py.bak
    │   └── load_data.py
    ├── svm_by_myself.py
    ├── svm_pca.py
    ├── using_sklearn.py
    └── using_theano.py
├── README.md
├── feature_engineering_example.ipynb
├── kaggle-titanic
    ├── README.md
    ├── SOUPTONUTS.md.txt
    ├── Theano Tutorial.R
    ├── code.py
    ├── input
    │   ├── test.csv
    │   └── train.csv
    ├── ipynb-notebook
    │   ├── Kaggle_Titanic_Example.ipynb
    │   ├── test.csv
    │   └── train.csv
    ├── lr.py
    ├── randomforest_gridsearchCV.py
    ├── result_rf.csv
    ├── result_xgb.csv
    ├── sklearn-random-forest.py
    ├── xgb.py
    └── 笔记1.md
└── kaggle_bike_competition_train.csv


/.gitignore:
--------------------------------------------------------------------------------
 1 | */DataCastle-Solution
 2 | ### Python template
 3 | # Byte-compiled / optimized / DLL files
 4 | __pycache__/
 5 | *.py[cod]
 6 | *$py.class
 7 | 
 8 | # C extensions
 9 | *.so
10 | 
11 | # Distribution / packaging
12 | .Python
13 | env/
14 | build/
15 | develop-eggs/
16 | dist/
17 | downloads/
18 | eggs/
19 | .eggs/
20 | lib/
21 | lib64/
22 | parts/
23 | sdist/
24 | var/
25 | *.egg-info/
26 | .installed.cfg
27 | *.egg
28 | 
29 | # PyInstaller
30 | #  Usually these files are written by a python script from a template
31 | #  before PyInstaller builds the exe, so as to inject date/other infos into it.
32 | *.manifest
33 | *.spec
34 | 
35 | # Installer logs
36 | pip-log.txt
37 | pip-delete-this-directory.txt
38 | 
39 | # Unit test / coverage reports
40 | htmlcov/
41 | .tox/
42 | .coverage
43 | .coverage.*
44 | .cache
45 | nosetests.xml
46 | coverage.xml
47 | *,cover
48 | 
49 | # Translations
50 | *.mo
51 | *.pot
52 | 
53 | # Django stuff:
54 | *.log
55 | 
56 | # Sphinx documentation
57 | docs/_build/
58 | 
59 | # PyBuilder
60 | target/
61 | 
62 | # PyCharm
63 | .idea
64 | 
65 | # Created by .ignore support plugin (hsz.mobi)
66 | 


--------------------------------------------------------------------------------
/AV-loan-prediction/loan_prediction.md:
--------------------------------------------------------------------------------
 1 | ## Problem Statement
 2 | About Company
 3 | Dream Housing Finance company deals in all home loans. They have presence across all urban, 
 4 | semi urban and rural areas. Customer first apply for home loan after that company validates 
 5 | the customer eligibility for loan.
 6 | 
 7 | Problem
 8 | Company wants to automate the loan eligibility process (real time) based on customer detail provided 
 9 | while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, 
10 | Income, Loan Amount, Credit History and others. To automate this process, they have given a problem to identify 
11 | the customers segments, those are eligible for loan amount so that they can specifically target these customers.
12 |  Here they have provided a partial data set.
13 | 
14 | ## Data Set
15 | Variable  Description  
16 | Loan_ID	Unique Loan ID  
17 | Gender	Male/ Female  
18 | Married	Applicant married (Y/N)  
19 | Dependents	Number of dependents  
20 | Education	Applicant Education (Graduate/ Under Graduate)  
21 | Self_Employed	Self employed (Y/N)  
22 | ApplicantIncome	Applicant income  
23 | CoapplicantIncome	Coapplicant income  
24 | LoanAmount	Loan amount in thousands   
25 | Loan_Amount_Term	Term of loan in months  
26 | Credit_History	credit history meets guidelines  
27 | Property_Area	Urban/ Semi Urban/ Rural  
28 | Loan_Status	Loan approved (Y/N)  


--------------------------------------------------------------------------------
/AV-loan-prediction/result_sklearn_rf.csv:
--------------------------------------------------------------------------------
  1 | Loan_ID,Loan_Status
  2 | LP001015,Y
  3 | LP001022,Y
  4 | LP001031,Y
  5 | LP001035,Y
  6 | LP001051,Y
  7 | LP001054,Y
  8 | LP001055,Y
  9 | LP001056,N
 10 | LP001059,Y
 11 | LP001067,Y
 12 | LP001078,Y
 13 | LP001082,Y
 14 | LP001083,Y
 15 | LP001094,N
 16 | LP001096,Y
 17 | LP001099,Y
 18 | LP001105,Y
 19 | LP001107,Y
 20 | LP001108,Y
 21 | LP001115,Y
 22 | LP001121,Y
 23 | LP001124,Y
 24 | LP001128,Y
 25 | LP001135,Y
 26 | LP001149,Y
 27 | LP001153,N
 28 | LP001163,Y
 29 | LP001169,Y
 30 | LP001174,Y
 31 | LP001176,Y
 32 | LP001177,Y
 33 | LP001183,Y
 34 | LP001185,Y
 35 | LP001187,Y
 36 | LP001190,Y
 37 | LP001203,N
 38 | LP001208,Y
 39 | LP001210,Y
 40 | LP001211,Y
 41 | LP001219,Y
 42 | LP001220,Y
 43 | LP001221,Y
 44 | LP001226,Y
 45 | LP001230,Y
 46 | LP001231,Y
 47 | LP001232,Y
 48 | LP001237,Y
 49 | LP001242,Y
 50 | LP001268,Y
 51 | LP001270,Y
 52 | LP001284,Y
 53 | LP001287,Y
 54 | LP001291,Y
 55 | LP001298,Y
 56 | LP001312,Y
 57 | LP001313,N
 58 | LP001317,Y
 59 | LP001321,Y
 60 | LP001323,N
 61 | LP001324,Y
 62 | LP001332,Y
 63 | LP001335,Y
 64 | LP001338,Y
 65 | LP001347,N
 66 | LP001348,Y
 67 | LP001351,Y
 68 | LP001352,N
 69 | LP001358,N
 70 | LP001359,Y
 71 | LP001361,N
 72 | LP001366,Y
 73 | LP001368,Y
 74 | LP001375,Y
 75 | LP001380,Y
 76 | LP001386,Y
 77 | LP001400,Y
 78 | LP001407,Y
 79 | LP001413,Y
 80 | LP001415,Y
 81 | LP001419,Y
 82 | LP001420,N
 83 | LP001428,Y
 84 | LP001445,N
 85 | LP001446,Y
 86 | LP001450,N
 87 | LP001452,Y
 88 | LP001455,Y
 89 | LP001466,Y
 90 | LP001471,Y
 91 | LP001472,Y
 92 | LP001475,Y
 93 | LP001483,Y
 94 | LP001486,Y
 95 | LP001490,Y
 96 | LP001496,N
 97 | LP001499,Y
 98 | LP001500,Y
 99 | LP001501,Y
100 | LP001517,Y
101 | LP001527,Y
102 | LP001534,Y
103 | LP001542,N
104 | LP001547,Y
105 | LP001548,Y
106 | LP001558,Y
107 | LP001561,Y
108 | LP001563,N
109 | LP001567,Y
110 | LP001568,Y
111 | LP001573,Y
112 | LP001584,Y
113 | LP001587,Y
114 | LP001589,Y
115 | LP001591,Y
116 | LP001599,Y
117 | LP001601,Y
118 | LP001607,N
119 | LP001611,N
120 | LP001613,N
121 | LP001622,N
122 | LP001627,Y
123 | LP001650,Y
124 | LP001651,Y
125 | LP001652,N
126 | LP001655,N
127 | LP001660,Y
128 | LP001662,N
129 | LP001663,Y
130 | LP001667,Y
131 | LP001695,Y
132 | LP001703,Y
133 | LP001718,Y
134 | LP001728,Y
135 | LP001735,Y
136 | LP001737,Y
137 | LP001739,Y
138 | LP001742,Y
139 | LP001757,Y
140 | LP001769,Y
141 | LP001771,Y
142 | LP001785,N
143 | LP001787,Y
144 | LP001789,N
145 | LP001791,Y
146 | LP001794,Y
147 | LP001797,Y
148 | LP001815,Y
149 | LP001817,N
150 | LP001818,Y
151 | LP001822,Y
152 | LP001827,Y
153 | LP001831,Y
154 | LP001842,Y
155 | LP001853,N
156 | LP001855,Y
157 | LP001857,Y
158 | LP001862,Y
159 | LP001867,Y
160 | LP001878,Y
161 | LP001881,Y
162 | LP001886,Y
163 | LP001906,N
164 | LP001909,Y
165 | LP001911,Y
166 | LP001921,Y
167 | LP001923,N
168 | LP001933,N
169 | LP001943,Y
170 | LP001950,N
171 | LP001959,N
172 | LP001961,Y
173 | LP001973,Y
174 | LP001975,Y
175 | LP001979,N
176 | LP001995,N
177 | LP001999,Y
178 | LP002007,Y
179 | LP002009,Y
180 | LP002016,Y
181 | LP002017,Y
182 | LP002018,Y
183 | LP002027,Y
184 | LP002028,Y
185 | LP002042,Y
186 | LP002045,Y
187 | LP002046,Y
188 | LP002047,Y
189 | LP002056,Y
190 | LP002057,Y
191 | LP002059,Y
192 | LP002062,Y
193 | LP002064,Y
194 | LP002069,N
195 | LP002070,N
196 | LP002077,Y
197 | LP002083,Y
198 | LP002090,N
199 | LP002096,Y
200 | LP002099,N
201 | LP002102,Y
202 | LP002105,Y
203 | LP002107,Y
204 | LP002111,Y
205 | LP002117,Y
206 | LP002118,Y
207 | LP002123,Y
208 | LP002125,Y
209 | LP002148,Y
210 | LP002152,Y
211 | LP002165,Y
212 | LP002167,Y
213 | LP002168,N
214 | LP002172,Y
215 | LP002176,Y
216 | LP002183,Y
217 | LP002184,Y
218 | LP002186,Y
219 | LP002192,Y
220 | LP002195,Y
221 | LP002208,Y
222 | LP002212,Y
223 | LP002240,Y
224 | LP002245,Y
225 | LP002253,Y
226 | LP002256,N
227 | LP002257,Y
228 | LP002264,Y
229 | LP002270,Y
230 | LP002279,Y
231 | LP002286,N
232 | LP002294,Y
233 | LP002298,Y
234 | LP002306,Y
235 | LP002310,Y
236 | LP002311,Y
237 | LP002316,N
238 | LP002321,N
239 | LP002325,Y
240 | LP002326,Y
241 | LP002329,N
242 | LP002333,Y
243 | LP002339,N
244 | LP002344,Y
245 | LP002346,N
246 | LP002354,Y
247 | LP002355,N
248 | LP002358,Y
249 | LP002360,Y
250 | LP002375,Y
251 | LP002376,Y
252 | LP002383,N
253 | LP002385,Y
254 | LP002389,Y
255 | LP002394,Y
256 | LP002397,Y
257 | LP002399,N
258 | LP002400,Y
259 | LP002402,Y
260 | LP002412,Y
261 | LP002415,Y
262 | LP002417,Y
263 | LP002420,Y
264 | LP002425,Y
265 | LP002433,Y
266 | LP002440,Y
267 | LP002441,Y
268 | LP002442,N
269 | LP002445,Y
270 | LP002450,N
271 | LP002471,Y
272 | LP002476,Y
273 | LP002482,Y
274 | LP002485,Y
275 | LP002495,N
276 | LP002496,N
277 | LP002523,Y
278 | LP002542,Y
279 | LP002550,Y
280 | LP002551,N
281 | LP002553,Y
282 | LP002554,Y
283 | LP002561,Y
284 | LP002566,Y
285 | LP002568,Y
286 | LP002570,Y
287 | LP002572,Y
288 | LP002581,Y
289 | LP002584,Y
290 | LP002592,Y
291 | LP002593,Y
292 | LP002599,Y
293 | LP002604,Y
294 | LP002605,Y
295 | LP002609,N
296 | LP002610,Y
297 | LP002612,Y
298 | LP002614,Y
299 | LP002630,N
300 | LP002635,Y
301 | LP002639,Y
302 | LP002644,Y
303 | LP002651,N
304 | LP002654,Y
305 | LP002657,Y
306 | LP002711,Y
307 | LP002712,N
308 | LP002721,Y
309 | LP002735,Y
310 | LP002744,Y
311 | LP002745,Y
312 | LP002746,Y
313 | LP002747,N
314 | LP002754,Y
315 | LP002759,Y
316 | LP002760,Y
317 | LP002766,Y
318 | LP002769,Y
319 | LP002774,N
320 | LP002775,Y
321 | LP002781,Y
322 | LP002782,Y
323 | LP002786,N
324 | LP002790,Y
325 | LP002791,Y
326 | LP002793,Y
327 | LP002802,N
328 | LP002803,Y
329 | LP002805,Y
330 | LP002806,Y
331 | LP002816,Y
332 | LP002823,Y
333 | LP002825,Y
334 | LP002826,Y
335 | LP002843,Y
336 | LP002849,Y
337 | LP002850,Y
338 | LP002853,Y
339 | LP002856,Y
340 | LP002857,Y
341 | LP002858,N
342 | LP002860,Y
343 | LP002867,Y
344 | LP002869,N
345 | LP002870,N
346 | LP002876,Y
347 | LP002878,Y
348 | LP002879,N
349 | LP002885,Y
350 | LP002890,Y
351 | LP002891,Y
352 | LP002899,Y
353 | LP002901,Y
354 | LP002907,Y
355 | LP002920,Y
356 | LP002921,N
357 | LP002932,Y
358 | LP002935,Y
359 | LP002952,Y
360 | LP002954,Y
361 | LP002962,Y
362 | LP002965,Y
363 | LP002969,Y
364 | LP002971,Y
365 | LP002975,Y
366 | LP002980,Y
367 | LP002986,Y
368 | LP002989,Y
369 | 


--------------------------------------------------------------------------------
/AV-loan-prediction/sklearn-rf.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | from sklearn.metrics import roc_auc_score
 3 | from sklearn.ensemble import RandomForestClassifier
 4 | from sklearn.svm import SVC
 5 | from sklearn import preprocessing
 6 | 
 7 | 
 8 | train = pd.read_csv('train.csv')
 9 | test = pd.read_csv('test.csv')
10 | print ("Starting...")
11 | print(('Number of training examples {0} '.format(train.shape[0])))
12 | print((train.Loan_Status.value_counts()))
13 | print(('Number of test examples {0} '.format(test.shape[0])))
14 | 
15 | 
16 | cat_vbl = {'Gender','Married','Dependents','Self_Employed','Property_Area'}
17 | num_vbl = {'LoanAmount','Loan_Amount_Term','Credit_History'}
18 | 
19 | for var in num_vbl:
20 |     train[var] = train[var].fillna(value = train[var].mean())
21 |     test[var] = test[var].fillna(value = test[var].mean())
22 | train['Credibility'] = train['ApplicantIncome'] / train['LoanAmount']
23 | test['Credibility'] = test['ApplicantIncome'] / test['LoanAmount']
24 | 
25 | print ("Starting Label Encode")
26 | for var in cat_vbl:
27 |     lb = preprocessing.LabelEncoder()
28 |     full_data = pd.concat((train[var],test[var]),axis=0).astype('str')
29 |     lb.fit( full_data )
30 |     train[var] = lb.transform(train[var].astype('str'))
31 |     test[var] = lb.transform(test[var].astype('str'))
32 | 
33 | train = train.fillna(value = -999)
34 | test = test.fillna(value = -999)
35 | print ("Filled Missing Values")
36 | 
37 | features = ['Credibility',
38 |             'Gender',
39 |             'Married',
40 |             'Dependents',
41 |             'Self_Employed',
42 |             'Property_Area',
43 |             'ApplicantIncome',
44 |             'CoapplicantIncome',
45 |             'LoanAmount',
46 |             'Loan_Amount_Term',
47 |             'Credit_History'
48 | ]
49 | 
50 | x_train = train[features].values
51 | y_train = train['Loan_Status'].values
52 | x_test = test[features].values
53 | 
54 | # Random Forest
55 | rf = RandomForestClassifier(n_estimators=1000, n_jobs=-1, oob_score = True, max_features = "auto",random_state=10, min_samples_split=2, min_samples_leaf=2)
56 | rf.fit(x_train, y_train)
57 | print(('Training accuracy:', rf.oob_score_))
58 | 
59 | 
60 | print ("Starting to predict on the dataset")
61 | rec= rf.predict(x_test)
62 | 
63 | print ("Prediction Completed")
64 | test['Loan_Status'] = rec
65 | test.to_csv('result_sklearn_rf.csv',columns=['Loan_ID','Loan_Status'],index=False)
66 | 


--------------------------------------------------------------------------------
/AV-loan-prediction/test.csv:
--------------------------------------------------------------------------------
  1 | Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
  2 | LP001015,Male,Yes,0,Graduate,No,5720,0,110,360,1,Urban
  3 | LP001022,Male,Yes,1,Graduate,No,3076,1500,126,360,1,Urban
  4 | LP001031,Male,Yes,2,Graduate,No,5000,1800,208,360,1,Urban
  5 | LP001035,Male,Yes,2,Graduate,No,2340,2546,100,360,,Urban
  6 | LP001051,Male,No,0,Not Graduate,No,3276,0,78,360,1,Urban
  7 | LP001054,Male,Yes,0,Not Graduate,Yes,2165,3422,152,360,1,Urban
  8 | LP001055,Female,No,1,Not Graduate,No,2226,0,59,360,1,Semiurban
  9 | LP001056,Male,Yes,2,Not Graduate,No,3881,0,147,360,0,Rural
 10 | LP001059,Male,Yes,2,Graduate,,13633,0,280,240,1,Urban
 11 | LP001067,Male,No,0,Not Graduate,No,2400,2400,123,360,1,Semiurban
 12 | LP001078,Male,No,0,Not Graduate,No,3091,0,90,360,1,Urban
 13 | LP001082,Male,Yes,1,Graduate,,2185,1516,162,360,1,Semiurban
 14 | LP001083,Male,No,3+,Graduate,No,4166,0,40,180,,Urban
 15 | LP001094,Male,Yes,2,Graduate,,12173,0,166,360,0,Semiurban
 16 | LP001096,Female,No,0,Graduate,No,4666,0,124,360,1,Semiurban
 17 | LP001099,Male,No,1,Graduate,No,5667,0,131,360,1,Urban
 18 | LP001105,Male,Yes,2,Graduate,No,4583,2916,200,360,1,Urban
 19 | LP001107,Male,Yes,3+,Graduate,No,3786,333,126,360,1,Semiurban
 20 | LP001108,Male,Yes,0,Graduate,No,9226,7916,300,360,1,Urban
 21 | LP001115,Male,No,0,Graduate,No,1300,3470,100,180,1,Semiurban
 22 | LP001121,Male,Yes,1,Not Graduate,No,1888,1620,48,360,1,Urban
 23 | LP001124,Female,No,3+,Not Graduate,No,2083,0,28,180,1,Urban
 24 | LP001128,,No,0,Graduate,No,3909,0,101,360,1,Urban
 25 | LP001135,Female,No,0,Not Graduate,No,3765,0,125,360,1,Urban
 26 | LP001149,Male,Yes,0,Graduate,No,5400,4380,290,360,1,Urban
 27 | LP001153,Male,No,0,Graduate,No,0,24000,148,360,0,Rural
 28 | LP001163,Male,Yes,2,Graduate,No,4363,1250,140,360,,Urban
 29 | LP001169,Male,Yes,0,Graduate,No,7500,3750,275,360,1,Urban
 30 | LP001174,Male,Yes,0,Graduate,No,3772,833,57,360,,Semiurban
 31 | LP001176,Male,No,0,Graduate,No,2942,2382,125,180,1,Urban
 32 | LP001177,Female,No,0,Not Graduate,No,2478,0,75,360,1,Semiurban
 33 | LP001183,Male,Yes,2,Graduate,No,6250,820,192,360,1,Urban
 34 | LP001185,Male,No,0,Graduate,No,3268,1683,152,360,1,Semiurban
 35 | LP001187,Male,Yes,0,Graduate,No,2783,2708,158,360,1,Urban
 36 | LP001190,Male,Yes,0,Graduate,No,2740,1541,101,360,1,Urban
 37 | LP001203,Male,No,0,Graduate,No,3150,0,176,360,0,Semiurban
 38 | LP001208,Male,Yes,2,Graduate,,7350,4029,185,180,1,Urban
 39 | LP001210,Male,Yes,0,Graduate,Yes,2267,2792,90,360,1,Urban
 40 | LP001211,Male,No,0,Graduate,Yes,5833,0,116,360,1,Urban
 41 | LP001219,Male,No,0,Graduate,No,3643,1963,138,360,1,Urban
 42 | LP001220,Male,Yes,0,Graduate,No,5629,818,100,360,1,Urban
 43 | LP001221,Female,No,0,Graduate,No,3644,0,110,360,1,Urban
 44 | LP001226,Male,Yes,0,Not Graduate,No,1750,2024,90,360,1,Semiurban
 45 | LP001230,Male,No,0,Graduate,No,6500,2600,200,360,1,Semiurban
 46 | LP001231,Female,No,0,Graduate,No,3666,0,84,360,1,Urban
 47 | LP001232,Male,Yes,0,Graduate,No,4260,3900,185,,,Urban
 48 | LP001237,Male,Yes,,Not Graduate,No,4163,1475,162,360,1,Urban
 49 | LP001242,Male,No,0,Not Graduate,No,2356,1902,108,360,1,Semiurban
 50 | LP001268,Male,No,0,Graduate,No,6792,3338,187,,1,Urban
 51 | LP001270,Male,Yes,3+,Not Graduate,Yes,8000,250,187,360,1,Semiurban
 52 | LP001284,Male,Yes,1,Graduate,No,2419,1707,124,360,1,Urban
 53 | LP001287,,Yes,3+,Not Graduate,No,3500,833,120,360,1,Semiurban
 54 | LP001291,Male,Yes,1,Graduate,No,3500,3077,160,360,1,Semiurban
 55 | LP001298,Male,Yes,2,Graduate,No,4116,1000,30,180,1,Urban
 56 | LP001312,Male,Yes,0,Not Graduate,Yes,5293,0,92,360,1,Urban
 57 | LP001313,Male,No,0,Graduate,No,2750,0,130,360,0,Urban
 58 | LP001317,Female,No,0,Not Graduate,No,4402,0,130,360,1,Rural
 59 | LP001321,Male,Yes,2,Graduate,No,3613,3539,134,180,1,Semiurban
 60 | LP001323,Female,Yes,2,Graduate,No,2779,3664,176,360,0,Semiurban
 61 | LP001324,Male,Yes,3+,Graduate,No,4720,0,90,180,1,Semiurban
 62 | LP001332,Male,Yes,0,Not Graduate,No,2415,1721,110,360,1,Semiurban
 63 | LP001335,Male,Yes,0,Graduate,Yes,7016,292,125,360,1,Urban
 64 | LP001338,Female,No,2,Graduate,No,4968,0,189,360,1,Semiurban
 65 | LP001347,Female,No,0,Graduate,No,2101,1500,108,360,0,Rural
 66 | LP001348,Male,Yes,3+,Not Graduate,No,4490,0,125,360,1,Urban
 67 | LP001351,Male,Yes,0,Graduate,No,2917,3583,138,360,1,Semiurban
 68 | LP001352,Male,Yes,0,Not Graduate,No,4700,0,135,360,0,Semiurban
 69 | LP001358,Male,Yes,0,Graduate,No,3445,0,130,360,0,Semiurban
 70 | LP001359,Male,Yes,0,Graduate,No,7666,0,187,360,1,Semiurban
 71 | LP001361,Male,Yes,0,Graduate,No,2458,5105,188,360,0,Rural
 72 | LP001366,Female,No,,Graduate,No,3250,0,95,360,1,Semiurban
 73 | LP001368,Male,No,0,Graduate,No,4463,0,65,360,1,Semiurban
 74 | LP001375,Male,Yes,1,Graduate,,4083,1775,139,60,1,Urban
 75 | LP001380,Male,Yes,0,Graduate,Yes,3900,2094,232,360,1,Rural
 76 | LP001386,Male,Yes,0,Not Graduate,No,4750,3583,144,360,1,Semiurban
 77 | LP001400,Male,No,0,Graduate,No,3583,3435,155,360,1,Urban
 78 | LP001407,Male,Yes,0,Graduate,No,3189,2367,186,360,1,Urban
 79 | LP001413,Male,No,0,Graduate,Yes,6356,0,50,360,1,Rural
 80 | LP001415,Male,Yes,1,Graduate,No,3413,4053,,360,1,Semiurban
 81 | LP001419,Female,Yes,0,Graduate,No,7950,0,185,360,1,Urban
 82 | LP001420,Male,Yes,3+,Graduate,No,3829,1103,163,360,0,Urban
 83 | LP001428,Male,Yes,3+,Graduate,No,72529,0,360,360,1,Urban
 84 | LP001445,Male,Yes,2,Not Graduate,No,4136,0,149,480,0,Rural
 85 | LP001446,Male,Yes,0,Graduate,No,8449,0,257,360,1,Rural
 86 | LP001450,Male,Yes,0,Graduate,No,4456,0,131,180,0,Semiurban
 87 | LP001452,Male,Yes,2,Graduate,No,4635,8000,102,180,1,Rural
 88 | LP001455,Male,Yes,0,Graduate,No,3571,1917,135,360,1,Urban
 89 | LP001466,Male,No,0,Graduate,No,3066,0,95,360,1,Semiurban
 90 | LP001471,Male,No,2,Not Graduate,No,3235,2015,77,360,1,Semiurban
 91 | LP001472,Female,No,0,Graduate,,5058,0,200,360,1,Rural
 92 | LP001475,Male,Yes,0,Graduate,Yes,3188,2286,130,360,,Rural
 93 | LP001483,Male,Yes,3+,Graduate,No,13518,0,390,360,1,Rural
 94 | LP001486,Male,Yes,1,Graduate,No,4364,2500,185,360,1,Semiurban
 95 | LP001490,Male,Yes,2,Not Graduate,No,4766,1646,100,360,1,Semiurban
 96 | LP001496,Male,Yes,1,Graduate,No,4609,2333,123,360,0,Semiurban
 97 | LP001499,Female,Yes,3+,Graduate,No,6260,0,110,360,1,Semiurban
 98 | LP001500,Male,Yes,1,Graduate,No,3333,4200,256,360,1,Urban
 99 | LP001501,Male,Yes,0,Graduate,No,3500,3250,140,360,1,Semiurban
100 | LP001517,Male,Yes,3+,Graduate,No,9719,0,61,360,1,Urban
101 | LP001527,Male,Yes,3+,Graduate,No,6835,0,188,360,,Semiurban
102 | LP001534,Male,No,0,Graduate,No,4452,0,131,360,1,Rural
103 | LP001542,Female,Yes,0,Graduate,No,2262,0,,480,0,Semiurban
104 | LP001547,Male,Yes,1,Graduate,No,3901,0,116,360,1,Urban
105 | LP001548,Male,Yes,2,Not Graduate,No,2687,0,50,180,1,Rural
106 | LP001558,Male,No,0,Graduate,No,2243,2233,107,360,,Semiurban
107 | LP001561,Female,Yes,0,Graduate,No,3417,1287,200,360,1,Semiurban
108 | LP001563,,No,0,Graduate,No,1596,1760,119,360,0,Urban
109 | LP001567,Male,Yes,3+,Graduate,No,4513,0,120,360,1,Rural
110 | LP001568,Male,Yes,0,Graduate,No,4500,0,140,360,1,Semiurban
111 | LP001573,Male,Yes,0,Not Graduate,No,4523,1350,165,360,1,Urban
112 | LP001584,Female,No,0,Graduate,Yes,4742,0,108,360,1,Semiurban
113 | LP001587,Male,Yes,,Graduate,No,4082,0,93,360,1,Semiurban
114 | LP001589,Female,No,0,Graduate,No,3417,0,102,360,1,Urban
115 | LP001591,Female,Yes,2,Graduate,No,2922,3396,122,360,1,Semiurban
116 | LP001599,Male,Yes,0,Graduate,No,4167,4754,160,360,1,Rural
117 | LP001601,Male,No,3+,Graduate,No,4243,4123,157,360,,Semiurban
118 | LP001607,Female,No,0,Not Graduate,No,0,1760,180,360,1,Semiurban
119 | LP001611,Male,Yes,1,Graduate,No,1516,2900,80,,0,Rural
120 | LP001613,Female,No,0,Graduate,No,1762,2666,104,360,0,Urban
121 | LP001622,Male,Yes,2,Graduate,No,724,3510,213,360,0,Rural
122 | LP001627,Male,No,0,Graduate,No,3125,0,65,360,1,Urban
123 | LP001650,Male,Yes,0,Graduate,No,2333,3803,146,360,1,Rural
124 | LP001651,Male,Yes,3+,Graduate,No,3350,1560,135,360,1,Urban
125 | LP001652,Male,No,0,Graduate,No,2500,6414,187,360,0,Rural
126 | LP001655,Female,No,0,Graduate,No,12500,0,300,360,0,Urban
127 | LP001660,Male,No,0,Graduate,No,4667,0,120,360,1,Semiurban
128 | LP001662,Male,No,0,Graduate,No,6500,0,71,360,0,Urban
129 | LP001663,Male,Yes,2,Graduate,No,7500,0,225,360,1,Urban
130 | LP001667,Male,No,0,Graduate,No,3073,0,70,180,1,Urban
131 | LP001695,Male,Yes,1,Not Graduate,No,3321,2088,70,,1,Semiurban
132 | LP001703,Male,Yes,0,Graduate,No,3333,1270,124,360,1,Urban
133 | LP001718,Male,No,0,Graduate,No,3391,0,132,360,1,Rural
134 | LP001728,Male,Yes,1,Graduate,Yes,3343,1517,105,360,1,Rural
135 | LP001735,Female,No,1,Graduate,No,3620,0,90,360,1,Urban
136 | LP001737,Male,No,0,Graduate,No,4000,0,83,84,1,Urban
137 | LP001739,Male,Yes,0,Graduate,No,4258,0,125,360,1,Urban
138 | LP001742,Male,Yes,2,Graduate,No,4500,0,147,360,1,Rural
139 | LP001757,Male,Yes,1,Graduate,No,2014,2925,120,360,1,Rural
140 | LP001769,,No,,Graduate,No,3333,1250,110,360,1,Semiurban
141 | LP001771,Female,No,3+,Graduate,No,4083,0,103,360,,Semiurban
142 | LP001785,Male,No,0,Graduate,No,4727,0,150,360,0,Rural
143 | LP001787,Male,Yes,3+,Graduate,No,3089,2999,100,240,1,Rural
144 | LP001789,Male,Yes,3+,Not Graduate,,6794,528,139,360,0,Urban
145 | LP001791,Male,Yes,0,Graduate,Yes,32000,0,550,360,,Semiurban
146 | LP001794,Male,Yes,2,Graduate,Yes,10890,0,260,12,1,Rural
147 | LP001797,Female,No,0,Graduate,No,12941,0,150,300,1,Urban
148 | LP001815,Male,No,0,Not Graduate,No,3276,0,90,360,1,Semiurban
149 | LP001817,Male,No,0,Not Graduate,Yes,8703,0,199,360,0,Rural
150 | LP001818,Male,Yes,1,Graduate,No,4742,717,139,360,1,Semiurban
151 | LP001822,Male,No,0,Graduate,No,5900,0,150,360,1,Urban
152 | LP001827,Male,No,0,Graduate,No,3071,4309,180,360,1,Urban
153 | LP001831,Male,Yes,0,Graduate,No,2783,1456,113,360,1,Urban
154 | LP001842,Male,No,0,Graduate,No,5000,0,148,360,1,Rural
155 | LP001853,Male,Yes,1,Not Graduate,No,2463,2360,117,360,0,Urban
156 | LP001855,Male,Yes,2,Graduate,No,4855,0,72,360,1,Rural
157 | LP001857,Male,No,0,Not Graduate,Yes,1599,2474,125,300,1,Semiurban
158 | LP001862,Male,Yes,2,Graduate,Yes,4246,4246,214,360,1,Urban
159 | LP001867,Male,Yes,0,Graduate,No,4333,2291,133,350,1,Rural
160 | LP001878,Male,No,1,Graduate,No,5823,2529,187,360,1,Semiurban
161 | LP001881,Male,Yes,0,Not Graduate,No,7895,0,143,360,1,Rural
162 | LP001886,Male,No,0,Graduate,No,4150,4256,209,360,1,Rural
163 | LP001906,Male,No,0,Graduate,,2964,0,84,360,0,Semiurban
164 | LP001909,Male,No,0,Graduate,No,5583,0,116,360,1,Urban
165 | LP001911,Female,No,0,Graduate,No,2708,0,65,360,1,Rural
166 | LP001921,Male,No,1,Graduate,No,3180,2370,80,240,,Rural
167 | LP001923,Male,No,0,Not Graduate,No,2268,0,170,360,0,Semiurban
168 | LP001933,Male,No,2,Not Graduate,No,1141,2017,120,360,0,Urban
169 | LP001943,Male,Yes,0,Graduate,No,3042,3167,135,360,1,Urban
170 | LP001950,Female,Yes,3+,Graduate,,1750,2935,94,360,0,Semiurban
171 | LP001959,Female,Yes,1,Graduate,No,3564,0,79,360,1,Rural
172 | LP001961,Female,No,0,Graduate,No,3958,0,110,360,1,Rural
173 | LP001973,Male,Yes,2,Not Graduate,No,4483,0,130,360,1,Rural
174 | LP001975,Male,Yes,0,Graduate,No,5225,0,143,360,1,Rural
175 | LP001979,Male,No,0,Graduate,No,3017,2845,159,180,0,Urban
176 | LP001995,Male,Yes,0,Not Graduate,No,2431,1820,110,360,0,Rural
177 | LP001999,Male,Yes,2,Graduate,,4912,4614,160,360,1,Rural
178 | LP002007,Male,Yes,2,Not Graduate,No,2500,3333,131,360,1,Urban
179 | LP002009,Female,No,0,Graduate,No,2918,0,65,360,,Rural
180 | LP002016,Male,Yes,2,Graduate,No,5128,0,143,360,1,Rural
181 | LP002017,Male,Yes,3+,Graduate,No,15312,0,187,360,,Urban
182 | LP002018,Male,Yes,2,Graduate,No,3958,2632,160,360,1,Semiurban
183 | LP002027,Male,Yes,0,Graduate,No,4334,2945,165,360,1,Semiurban
184 | LP002028,Male,Yes,2,Graduate,No,4358,0,110,360,1,Urban
185 | LP002042,Female,Yes,1,Graduate,No,4000,3917,173,360,1,Rural
186 | LP002045,Male,Yes,3+,Graduate,No,10166,750,150,,1,Urban
187 | LP002046,Male,Yes,0,Not Graduate,No,4483,0,135,360,,Semiurban
188 | LP002047,Male,Yes,2,Not Graduate,No,4521,1184,150,360,1,Semiurban
189 | LP002056,Male,Yes,2,Graduate,No,9167,0,235,360,1,Semiurban
190 | LP002057,Male,Yes,0,Not Graduate,No,13083,0,,360,1,Rural
191 | LP002059,Male,Yes,2,Graduate,No,7874,3967,336,360,1,Rural
192 | LP002062,Female,Yes,1,Graduate,No,4333,0,132,84,1,Rural
193 | LP002064,Male,No,0,Graduate,No,4083,0,96,360,1,Urban
194 | LP002069,Male,Yes,2,Not Graduate,,3785,2912,180,360,0,Rural
195 | LP002070,Male,Yes,3+,Not Graduate,No,2654,1998,128,360,0,Rural
196 | LP002077,Male,Yes,1,Graduate,No,10000,2690,412,360,1,Semiurban
197 | LP002083,Male,No,0,Graduate,Yes,5833,0,116,360,1,Urban
198 | LP002090,Male,Yes,1,Graduate,No,4796,0,114,360,0,Semiurban
199 | LP002096,Male,Yes,0,Not Graduate,No,2000,1600,115,360,1,Rural
200 | LP002099,Male,Yes,2,Graduate,No,2540,700,104,360,0,Urban
201 | LP002102,Male,Yes,0,Graduate,Yes,1900,1442,88,360,1,Rural
202 | LP002105,Male,Yes,0,Graduate,Yes,8706,0,108,480,1,Rural
203 | LP002107,Male,Yes,3+,Not Graduate,No,2855,542,90,360,1,Urban
204 | LP002111,Male,Yes,,Graduate,No,3016,1300,100,360,,Urban
205 | LP002117,Female,Yes,0,Graduate,No,3159,2374,108,360,1,Semiurban
206 | LP002118,Female,No,0,Graduate,No,1937,1152,78,360,1,Semiurban
207 | LP002123,Male,Yes,0,Graduate,No,2613,2417,123,360,1,Semiurban
208 | LP002125,Male,Yes,1,Graduate,No,4960,2600,187,360,1,Semiurban
209 | LP002148,Male,Yes,1,Graduate,No,3074,1083,146,360,1,Semiurban
210 | LP002152,Female,No,0,Graduate,No,4213,0,80,360,1,Urban
211 | LP002165,,No,1,Not Graduate,No,2038,4027,100,360,1,Rural
212 | LP002167,Female,No,0,Graduate,No,2362,0,55,360,1,Urban
213 | LP002168,Male,No,0,Graduate,No,5333,2400,200,360,0,Rural
214 | LP002172,Male,Yes,3+,Graduate,Yes,5384,0,150,360,1,Semiurban
215 | LP002176,Male,No,0,Graduate,No,5708,0,150,360,1,Rural
216 | LP002183,Male,Yes,0,Not Graduate,No,3754,3719,118,,1,Rural
217 | LP002184,Male,Yes,0,Not Graduate,No,2914,2130,150,300,1,Urban
218 | LP002186,Male,Yes,0,Not Graduate,No,2747,2458,118,36,1,Semiurban
219 | LP002192,Male,Yes,0,Graduate,No,7830,2183,212,360,1,Rural
220 | LP002195,Male,Yes,1,Graduate,Yes,3507,3148,212,360,1,Rural
221 | LP002208,Male,Yes,1,Graduate,No,3747,2139,125,360,1,Urban
222 | LP002212,Male,Yes,0,Graduate,No,2166,2166,108,360,,Urban
223 | LP002240,Male,Yes,0,Not Graduate,No,3500,2168,149,360,1,Rural
224 | LP002245,Male,Yes,2,Not Graduate,No,2896,0,80,480,1,Urban
225 | LP002253,Female,No,1,Graduate,No,5062,0,152,300,1,Rural
226 | LP002256,Female,No,2,Graduate,Yes,5184,0,187,360,0,Semiurban
227 | LP002257,Female,No,0,Graduate,No,2545,0,74,360,1,Urban
228 | LP002264,Male,Yes,0,Graduate,No,2553,1768,102,360,1,Urban
229 | LP002270,Male,Yes,1,Graduate,No,3436,3809,100,360,1,Rural
230 | LP002279,Male,No,0,Graduate,No,2412,2755,130,360,1,Rural
231 | LP002286,Male,Yes,3+,Not Graduate,No,5180,0,125,360,0,Urban
232 | LP002294,Male,No,0,Graduate,No,14911,14507,130,360,1,Semiurban
233 | LP002298,,No,0,Graduate,Yes,2860,2988,138,360,1,Urban
234 | LP002306,Male,Yes,0,Graduate,No,1173,1594,28,180,1,Rural
235 | LP002310,Female,No,1,Graduate,No,7600,0,92,360,1,Semiurban
236 | LP002311,Female,Yes,0,Graduate,No,2157,1788,104,360,1,Urban
237 | LP002316,Male,No,0,Graduate,No,2231,2774,176,360,0,Urban
238 | LP002321,Female,No,0,Graduate,No,2274,5211,117,360,0,Semiurban
239 | LP002325,Male,Yes,2,Not Graduate,No,6166,13983,102,360,1,Rural
240 | LP002326,Male,Yes,2,Not Graduate,No,2513,1110,107,360,1,Semiurban
241 | LP002329,Male,No,0,Graduate,No,4333,0,66,480,1,Urban
242 | LP002333,Male,No,0,Not Graduate,No,3844,0,105,360,1,Urban
243 | LP002339,Male,Yes,0,Graduate,No,3887,1517,105,360,0,Semiurban
244 | LP002344,Male,Yes,0,Graduate,No,3510,828,105,360,1,Semiurban
245 | LP002346,Male,Yes,0,Graduate,,2539,1704,125,360,0,Rural
246 | LP002354,Female,No,0,Not Graduate,No,2107,0,64,360,1,Semiurban
247 | LP002355,,Yes,0,Graduate,No,3186,3145,150,180,0,Semiurban
248 | LP002358,Male,Yes,2,Graduate,Yes,5000,2166,150,360,1,Urban
249 | LP002360,Male,Yes,,Graduate,No,10000,0,,360,1,Urban
250 | LP002375,Male,Yes,0,Not Graduate,Yes,3943,0,64,360,1,Semiurban
251 | LP002376,Male,No,0,Graduate,No,2925,0,40,180,1,Rural
252 | LP002383,Male,Yes,3+,Graduate,No,3242,437,142,480,0,Urban
253 | LP002385,Male,Yes,,Graduate,No,3863,0,70,300,1,Semiurban
254 | LP002389,Female,No,1,Graduate,No,4028,0,131,360,1,Semiurban
255 | LP002394,Male,Yes,2,Graduate,No,4010,1025,120,360,1,Urban
256 | LP002397,Female,Yes,1,Graduate,No,3719,1585,114,360,1,Urban
257 | LP002399,Male,No,0,Graduate,,2858,0,123,360,0,Rural
258 | LP002400,Female,Yes,0,Graduate,No,3833,0,92,360,1,Rural
259 | LP002402,Male,Yes,0,Graduate,No,3333,4288,160,360,1,Urban
260 | LP002412,Male,Yes,0,Graduate,No,3007,3725,151,360,1,Rural
261 | LP002415,Female,No,1,Graduate,,1850,4583,81,360,,Rural
262 | LP002417,Male,Yes,3+,Not Graduate,No,2792,2619,171,360,1,Semiurban
263 | LP002420,Male,Yes,0,Graduate,No,2982,1550,110,360,1,Semiurban
264 | LP002425,Male,No,0,Graduate,No,3417,738,100,360,,Rural
265 | LP002433,Male,Yes,1,Graduate,No,18840,0,234,360,1,Rural
266 | LP002440,Male,Yes,2,Graduate,No,2995,1120,184,360,1,Rural
267 | LP002441,Male,No,,Graduate,No,3579,3308,138,360,,Semiurban
268 | LP002442,Female,Yes,1,Not Graduate,No,3835,1400,112,480,0,Urban
269 | LP002445,Female,No,1,Not Graduate,No,3854,3575,117,360,1,Rural
270 | LP002450,Male,Yes,2,Graduate,No,5833,750,49,360,0,Rural
271 | LP002471,Male,No,0,Graduate,No,3508,0,99,360,1,Rural
272 | LP002476,Female,Yes,3+,Not Graduate,No,1635,2444,99,360,1,Urban
273 | LP002482,Female,No,0,Graduate,Yes,3333,3916,212,360,1,Rural
274 | LP002485,Male,No,1,Graduate,No,24797,0,240,360,1,Semiurban
275 | LP002495,Male,Yes,2,Graduate,No,5667,440,130,360,0,Semiurban
276 | LP002496,Female,No,0,Graduate,No,3500,0,94,360,0,Semiurban
277 | LP002523,Male,Yes,3+,Graduate,No,2773,1497,108,360,1,Semiurban
278 | LP002542,Male,Yes,0,Graduate,,6500,0,144,360,1,Urban
279 | LP002550,Female,No,0,Graduate,No,5769,0,110,180,1,Semiurban
280 | LP002551,Male,Yes,3+,Not Graduate,,3634,910,176,360,0,Semiurban
281 | LP002553,,No,0,Graduate,No,29167,0,185,360,1,Semiurban
282 | LP002554,Male,No,0,Graduate,No,2166,2057,122,360,1,Semiurban
283 | LP002561,Male,Yes,0,Graduate,No,5000,0,126,360,1,Rural
284 | LP002566,Female,No,0,Graduate,No,5530,0,135,360,,Urban
285 | LP002568,Male,No,0,Not Graduate,No,9000,0,122,360,1,Rural
286 | LP002570,Female,Yes,2,Graduate,No,10000,11666,460,360,1,Urban
287 | LP002572,Male,Yes,1,Graduate,,8750,0,297,360,1,Urban
288 | LP002581,Male,Yes,0,Not Graduate,No,2157,2730,140,360,,Rural
289 | LP002584,Male,No,0,Graduate,,1972,4347,106,360,1,Rural
290 | LP002592,Male,No,0,Graduate,No,4983,0,141,360,1,Urban
291 | LP002593,Male,Yes,1,Graduate,No,8333,4000,,360,1,Urban
292 | LP002599,Male,Yes,0,Graduate,No,3667,2000,170,360,1,Semiurban
293 | LP002604,Male,Yes,2,Graduate,No,3166,2833,145,360,1,Urban
294 | LP002605,Male,No,0,Not Graduate,No,3271,0,90,360,1,Rural
295 | LP002609,Female,Yes,0,Graduate,No,2241,2000,88,360,0,Urban
296 | LP002610,Male,Yes,1,Not Graduate,,1792,2565,128,360,1,Urban
297 | LP002612,Female,Yes,0,Graduate,No,2666,0,84,480,1,Semiurban
298 | LP002614,,No,0,Graduate,No,6478,0,108,360,1,Semiurban
299 | LP002630,Male,No,0,Not Graduate,,3808,0,83,360,1,Rural
300 | LP002635,Female,Yes,2,Not Graduate,No,3729,0,117,360,1,Semiurban
301 | LP002639,Male,Yes,2,Graduate,No,4120,0,128,360,1,Rural
302 | LP002644,Male,Yes,1,Graduate,Yes,7500,0,75,360,1,Urban
303 | LP002651,Male,Yes,1,Graduate,,6300,0,125,360,0,Urban
304 | LP002654,Female,No,,Graduate,Yes,14987,0,177,360,1,Rural
305 | LP002657,,Yes,1,Not Graduate,Yes,570,2125,68,360,1,Rural
306 | LP002711,Male,Yes,0,Graduate,No,2600,700,96,360,1,Semiurban
307 | LP002712,Male,No,2,Not Graduate,No,2733,1083,180,360,,Semiurban
308 | LP002721,Male,Yes,2,Graduate,Yes,7500,0,183,360,1,Rural
309 | LP002735,Male,Yes,2,Not Graduate,No,3859,0,121,360,1,Rural
310 | LP002744,Male,Yes,1,Graduate,No,6825,0,162,360,1,Rural
311 | LP002745,Male,Yes,0,Graduate,No,3708,4700,132,360,1,Semiurban
312 | LP002746,Male,No,0,Graduate,No,5314,0,147,360,1,Urban
313 | LP002747,Female,No,3+,Graduate,No,2366,5272,153,360,0,Rural
314 | LP002754,Male,No,,Graduate,No,2066,2108,104,84,1,Urban
315 | LP002759,Male,Yes,2,Graduate,No,5000,0,149,360,1,Rural
316 | LP002760,Female,No,0,Graduate,No,3767,0,134,300,1,Urban
317 | LP002766,Female,Yes,0,Graduate,No,7859,879,165,180,1,Semiurban
318 | LP002769,Female,Yes,0,Graduate,No,4283,0,120,360,1,Rural
319 | LP002774,Male,Yes,0,Not Graduate,No,1700,2900,67,360,0,Urban
320 | LP002775,,No,0,Not Graduate,No,4768,0,125,360,1,Rural
321 | LP002781,Male,No,0,Graduate,No,3083,2738,120,360,1,Urban
322 | LP002782,Male,Yes,1,Graduate,No,2667,1542,148,360,1,Rural
323 | LP002786,Female,Yes,0,Not Graduate,No,1647,1762,181,360,1,Urban
324 | LP002790,Male,Yes,3+,Graduate,No,3400,0,80,120,1,Urban
325 | LP002791,Male,No,1,Graduate,,16000,5000,40,360,1,Semiurban
326 | LP002793,Male,Yes,0,Graduate,No,5333,0,90,360,1,Rural
327 | LP002802,Male,No,0,Graduate,No,2875,2416,95,6,0,Semiurban
328 | LP002803,Male,Yes,1,Not Graduate,,2600,618,122,360,1,Semiurban
329 | LP002805,Male,Yes,2,Graduate,No,5041,700,150,360,1,Urban
330 | LP002806,Male,Yes,3+,Graduate,Yes,6958,1411,150,360,1,Rural
331 | LP002816,Male,Yes,1,Graduate,No,3500,1658,104,360,,Semiurban
332 | LP002823,Male,Yes,0,Graduate,No,5509,0,143,360,1,Rural
333 | LP002825,Male,Yes,3+,Graduate,No,9699,0,300,360,1,Urban
334 | LP002826,Female,Yes,1,Not Graduate,No,3621,2717,171,360,1,Urban
335 | LP002843,Female,Yes,0,Graduate,No,4709,0,113,360,1,Semiurban
336 | LP002849,Male,Yes,0,Graduate,No,1516,1951,35,360,1,Semiurban
337 | LP002850,Male,No,2,Graduate,No,2400,0,46,360,1,Urban
338 | LP002853,Female,No,0,Not Graduate,No,3015,2000,145,360,,Urban
339 | LP002856,Male,Yes,0,Graduate,No,2292,1558,119,360,1,Urban
340 | LP002857,Male,Yes,1,Graduate,Yes,2360,3355,87,240,1,Rural
341 | LP002858,Female,No,0,Graduate,No,4333,2333,162,360,0,Rural
342 | LP002860,Male,Yes,0,Graduate,Yes,2623,4831,122,180,1,Semiurban
343 | LP002867,Male,No,0,Graduate,Yes,3972,4275,187,360,1,Rural
344 | LP002869,Male,Yes,3+,Not Graduate,No,3522,0,81,180,1,Rural
345 | LP002870,Male,Yes,1,Graduate,No,4700,0,80,360,1,Urban
346 | LP002876,Male,No,0,Graduate,No,6858,0,176,360,1,Rural
347 | LP002878,Male,Yes,3+,Graduate,No,8334,0,260,360,1,Urban
348 | LP002879,Male,Yes,0,Graduate,No,3391,1966,133,360,0,Rural
349 | LP002885,Male,No,0,Not Graduate,No,2868,0,70,360,1,Urban
350 | LP002890,Male,Yes,2,Not Graduate,No,3418,1380,135,360,1,Urban
351 | LP002891,Male,Yes,0,Graduate,Yes,2500,296,137,300,1,Rural
352 | LP002899,Male,Yes,2,Graduate,No,8667,0,254,360,1,Rural
353 | LP002901,Male,No,0,Graduate,No,2283,15000,106,360,,Rural
354 | LP002907,Male,Yes,0,Graduate,No,5817,910,109,360,1,Urban
355 | LP002920,Male,Yes,0,Graduate,No,5119,3769,120,360,1,Rural
356 | LP002921,Male,Yes,3+,Not Graduate,No,5316,187,158,180,0,Semiurban
357 | LP002932,Male,Yes,3+,Graduate,No,7603,1213,197,360,1,Urban
358 | LP002935,Male,Yes,1,Graduate,No,3791,1936,85,360,1,Urban
359 | LP002952,Male,No,0,Graduate,No,2500,0,60,360,1,Urban
360 | LP002954,Male,Yes,2,Not Graduate,No,3132,0,76,360,,Rural
361 | LP002962,Male,No,0,Graduate,No,4000,2667,152,360,1,Semiurban
362 | LP002965,Female,Yes,0,Graduate,No,8550,4255,96,360,,Urban
363 | LP002969,Male,Yes,1,Graduate,No,2269,2167,99,360,1,Semiurban
364 | LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777,113,360,1,Urban
365 | LP002975,Male,Yes,0,Graduate,No,4158,709,115,360,1,Urban
366 | LP002980,Male,No,0,Graduate,No,3250,1993,126,360,,Semiurban
367 | LP002986,Male,Yes,0,Graduate,No,5000,2393,158,360,1,Rural
368 | LP002989,Male,No,0,Graduate,Yes,9200,0,98,180,1,Rural
369 | 


--------------------------------------------------------------------------------
/AV-loan-prediction/xgb.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | import pandas as pd
 5 | import numpy as np
 6 | import xgboost as xgb
 7 | import time
 8 | import math
 9 | from sklearn.cross_validation import train_test_split
10 | from sklearn import preprocessing
11 | 
12 | def load_data():
13 |     train = pd.read_csv('train.csv')
14 |     test  = pd.read_csv('test.csv')
15 |     train_y = (train.Loan_Status == 'Y').astype('int')
16 |     train_x = train.drop(['Loan_ID','Loan_Status'], axis=1)
17 |     test_uid = test.Loan_ID
18 |     test_x = test.drop(['Loan_ID'], axis=1)
19 | 
20 |     cat_var = ['Gender','Married','Dependents','Education', 'Self_Employed', 'Property_Area']
21 |     num_var = ['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term','Credit_History']
22 |     for var in num_var:
23 |         train_x[var] = train_x[var].fillna(value = train_x[var].mean())
24 |         test_x[var]  = test_x[var].fillna(value = test_x[var].mean())
25 |     train_x['Credibility'] = train_x['ApplicantIncome'] / train_x['LoanAmount']
26 |     test_x['Credibility'] = test_x['ApplicantIncome'] / test_x['LoanAmount']
27 |     train_x = train_x.fillna(value = -999)
28 |     test_x  = test_x.fillna(value = -999)
29 | 
30 |     for var in cat_var:
31 |         lb = preprocessing.LabelEncoder()
32 |         full_data = pd.concat((train_x[var],test_x[var]),axis=0).astype('str')
33 |         lb.fit( full_data )
34 |         train_x[var] = lb.transform(train_x[var].astype('str'))
35 |         test_x[var] = lb.transform(test_x[var].astype('str'))
36 | 
37 |     return train_x, train_y, test_x, test_uid
38 | 
39 | 
40 | def using_xgb(train_x, train_y, test_x, test_uid):
41 |     scale_val = (train_y.sum() / train_y.shape[0])
42 |     X_train, X_val, y_train, y_val = train_test_split(train_x, train_y, train_size=0.75, random_state=0)
43 |     xgb_train = xgb.DMatrix(X_train, label=y_train)
44 |     xgb_val   = xgb.DMatrix(X_val,   label=y_val)
45 |     xgb_test  = xgb.DMatrix(test_x)
46 | 
47 |     # 设置xgboost分类器参数
48 |     params = {
49 |         'booster': 'gbtree',
50 |         'objective': 'binary:logistic',
51 |         'eval_metric': 'auc',
52 |         'early_stopping_rounds': 200,
53 |         'gamma':0,
54 |         'lambda': 1000,
55 |         'min_child_weight': 5,
56 |         'scale_pos_weight': scale_val,
57 |         'subsample': 0.7,
58 |         'max_depth':6,
59 |         'eta': 0.01,
60 |         #'colsample_bytree': 0.7,
61 |         'nthread': 2
62 |     }
63 |     watchlist = [(xgb_val, 'val'), (xgb_train, 'train')]
64 |     num_round = 10000
65 |     bst = xgb.train(params, xgb_train, num_boost_round=num_round, evals=watchlist)
66 |     scores = bst.predict(xgb_test, ntree_limit=bst.best_ntree_limit)
67 |     pred = np.where(scores > 0.5, 'Y','N')
68 | 
69 | 
70 |     print((pd.value_counts(pred)))
71 | 
72 | 
73 |     return 0
74 |     result = pd.DataFrame({"Loan_ID":test_uid, "Loan_Status":pred}, columns=['Loan_ID','Loan_Status'])
75 |     result.to_csv('result/xgb_'+str(time.time())[-4:]+'.csv', index=False)
76 | 
77 | def main():
78 |     train_x, train_y, test_x, test_uid = load_data()
79 |     print("load_data() end!")
80 |     using_xgb(train_x, train_y, test_x, test_uid)
81 | 
82 | 
83 | if __name__ == '__main__':
84 |     main()


--------------------------------------------------------------------------------
/DC-loan-rp/add_data/add_data.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: add_data.py.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2016/01/01 下午 12:00
10 | '''
11 | 
12 | import pandas as pd
13 | import numpy as np
14 | 
15 | path = 'd:/dataset/rp/'
16 | test_x_csv = path + 'test_x.csv'
17 | test_y_csv = './0.717.csv'
18 | dtest_x = pd.read_csv(test_x_csv)
19 | dtest_y = pd.read_csv(test_y_csv)
20 | 
21 | test_xy = pd.merge(dtest_x, dtest_y, on='uid')
22 | add_low  = test_xy[test_xy.score < 0.1]
23 | add_high = test_xy[test_xy.score > 0.97]
24 | add_test_xy = pd.concat([add_low,add_high], axis=0)
25 | 
26 | add_test_xy = add_test_xy.drop_duplicates(cols='uid')
27 | 
28 | print(add_test_xy)
29 | add_y = add_test_xy[['uid','score']].copy()
30 | add_y['score'] = np.where(add_y['score']<0.5, 0, 1)
31 | add_y.columns = ['uid', 'y']
32 | add_X = add_test_xy.drop(['score'], axis=1)
33 | 
34 | add_y.to_csv(path+'add_y.csv', index=False)
35 | add_X.to_csv(path+'add_X.csv', index=False)


--------------------------------------------------------------------------------
/DC-loan-rp/feature-selection/0.70_feature_score.csv:
--------------------------------------------------------------------------------
   1 | feature,fscore
   2 | x894,1
   3 | x799,1
   4 | x817,1
   5 | x261,1
   6 | x723,1
   7 | x290,1
   8 | x929,1
   9 | x440,1
  10 | x262,1
  11 | x1052,1
  12 | x291,1
  13 | x495,1
  14 | x900,1
  15 | x923,1
  16 | x500,2
  17 | x621,2
  18 | x872,2
  19 | x250,2
  20 | x1039,2
  21 | x279,2
  22 | x275,2
  23 | x278,3
  24 | x886,3
  25 | x740,3
  26 | x911,3
  27 | x812,3
  28 | x505,3
  29 | x450,3
  30 | x598,4
  31 | x507,4
  32 | x453,4
  33 | x540,4
  34 | x544,4
  35 | x1042,4
  36 | x808,4
  37 | x536,4
  38 | x965,4
  39 | x1044,4
  40 | x931,5
  41 | x1045,5
  42 | x425,5
  43 | x445,5
  44 | x927,5
  45 | x438,5
  46 | x501,5
  47 | x934,5
  48 | x249,5
  49 | x834,5
  50 | x320,5
  51 | x883,5
  52 | x1034,5
  53 | x321,5
  54 | x724,6
  55 | x938,6
  56 | x498,6
  57 | x776,6
  58 | x1095,6
  59 | x810,6
  60 | x1031,6
  61 | x668,6
  62 | x702,6
  63 | x448,7
  64 | x922,7
  65 | x437,7
  66 | x635,7
  67 | x658,7
  68 | x701,7
  69 | x936,7
  70 | x912,7
  71 | x662,7
  72 | x691,8
  73 | x1037,8
  74 | x439,8
  75 | x1079,8
  76 | x312,8
  77 | x854,8
  78 | x893,9
  79 | x537,9
  80 | x443,9
  81 | x901,9
  82 | x508,9
  83 | x318,9
  84 | x545,9
  85 | x916,9
  86 | x871,10
  87 | x302,10
  88 | x755,10
  89 | x1050,10
  90 | x1014,10
  91 | x1032,10
  92 | x1033,10
  93 | x907,10
  94 | x1098,10
  95 | x447,10
  96 | x310,11
  97 | x596,11
  98 | x442,11
  99 | x671,11
 100 | x903,11
 101 | x891,12
 102 | x1093,12
 103 | x906,12
 104 | x715,12
 105 | x754,12
 106 | x811,12
 107 | x917,12
 108 | x712,13
 109 | x1035,13
 110 | x492,13
 111 | x719,13
 112 | x426,14
 113 | x567,14
 114 | x742,14
 115 | x921,14
 116 | x884,14
 117 | x801,14
 118 | x317,14
 119 | x316,14
 120 | x625,14
 121 | x452,15
 122 | x896,15
 123 | x597,15
 124 | x714,15
 125 | x939,15
 126 | x618,15
 127 | x1081,15
 128 | x281,15
 129 | x1027,15
 130 | x898,16
 131 | x758,16
 132 | x423,16
 133 | x670,16
 134 | x457,16
 135 | x599,16
 136 | x970,17
 137 | x451,17
 138 | x975,17
 139 | x454,18
 140 | x324,18
 141 | x309,18
 142 | x570,18
 143 | x421,19
 144 | x303,19
 145 | x449,19
 146 | x300,19
 147 | x280,19
 148 | x669,19
 149 | x881,20
 150 | x1047,20
 151 | x264,20
 152 | x694,20
 153 | x569,21
 154 | x594,21
 155 | x664,21
 156 | x832,21
 157 | x1069,21
 158 | x919,21
 159 | x759,21
 160 | x456,22
 161 | x1013,22
 162 | x739,22
 163 | x1040,22
 164 | x816,23
 165 | x904,23
 166 | x600,23
 167 | x519,23
 168 | x784,23
 169 | x1043,23
 170 | x319,23
 171 | x840,23
 172 | x882,24
 173 | x1097,24
 174 | x1002,24
 175 | x311,24
 176 | x753,24
 177 | x885,24
 178 | x989,25
 179 | x622,25
 180 | x313,25
 181 | x888,25
 182 | x639,25
 183 | x1124,26
 184 | x335,26
 185 | x969,26
 186 | x1127,26
 187 | x839,26
 188 | x925,26
 189 | x926,27
 190 | x1030,27
 191 | x718,27
 192 | x932,27
 193 | x756,28
 194 | x1038,28
 195 | x880,28
 196 | x441,28
 197 | x632,29
 198 | x841,29
 199 | x848,29
 200 | x243,29
 201 | x930,29
 202 | x315,29
 203 | x693,30
 204 | x539,30
 205 | x1070,30
 206 | x571,31
 207 | x973,31
 208 | x502,31
 209 | x331,31
 210 | x706,31
 211 | x948,32
 212 | x1091,32
 213 | x301,32
 214 | x244,32
 215 | x260,33
 216 | x608,33
 217 | x703,34
 218 | x659,35
 219 | x1063,35
 220 | x974,35
 221 | x328,36
 222 | x1080,36
 223 | x415,36
 224 | x845,36
 225 | x1061,36
 226 | x887,36
 227 | x802,37
 228 | x414,37
 229 | x1036,37
 230 | x745,38
 231 | x572,38
 232 | x972,39
 233 | x242,39
 234 | x308,39
 235 | x646,39
 236 | x708,39
 237 | x1054,40
 238 | x1086,40
 239 | x793,41
 240 | x961,41
 241 | x1007,41
 242 | x779,41
 243 | x649,41
 244 | x910,41
 245 | x612,42
 246 | x561,42
 247 | x1065,43
 248 | x314,43
 249 | x523,43
 250 | x1100,44
 251 | x924,44
 252 | x258,44
 253 | x326,44
 254 | x267,45
 255 | x375,45
 256 | x783,45
 257 | x1092,45
 258 | x680,45
 259 | x411,45
 260 | x334,46
 261 | x1088,46
 262 | x627,46
 263 | x1119,46
 264 | x1125,47
 265 | x416,47
 266 | x870,47
 267 | x1075,47
 268 | x1085,47
 269 | x1099,47
 270 | x1096,48
 271 | x1129,48
 272 | x673,48
 273 | x1006,48
 274 | x1133,49
 275 | x589,49
 276 | x908,49
 277 | x419,49
 278 | x547,49
 279 | x628,51
 280 | x1076,51
 281 | x928,51
 282 | x892,51
 283 | x1017,52
 284 | x1126,52
 285 | x1012,52
 286 | x1000,52
 287 | x465,53
 288 | x1123,53
 289 | x624,53
 290 | x532,53
 291 | x1121,54
 292 | x807,54
 293 | x466,54
 294 | x327,55
 295 | x863,55
 296 | x1058,55
 297 | x971,57
 298 | x531,57
 299 | x915,58
 300 | x720,59
 301 | x1053,59
 302 | x672,59
 303 | x390,59
 304 | x835,60
 305 | x1136,60
 306 | x467,60
 307 | x577,61
 308 | x667,61
 309 | x751,61
 310 | x1122,61
 311 | x675,61
 312 | x446,62
 313 | x967,62
 314 | x819,62
 315 | x1071,63
 316 | x844,63
 317 | x265,63
 318 | x1128,64
 319 | x444,64
 320 | x1089,65
 321 | x1066,65
 322 | x643,65
 323 | x1062,66
 324 | x1112,66
 325 | x631,67
 326 | x322,67
 327 | x595,67
 328 | x623,67
 329 | x521,68
 330 | x794,68
 331 | x376,69
 332 | x263,69
 333 | x709,69
 334 | x679,69
 335 | x402,69
 336 | x1130,69
 337 | x828,69
 338 | x526,69
 339 | x1094,70
 340 | x251,70
 341 | x937,70
 342 | x889,71
 343 | x378,71
 344 | x717,71
 345 | x1067,71
 346 | x580,73
 347 | x517,73
 348 | x1082,73
 349 | x1104,73
 350 | x1106,74
 351 | x420,74
 352 | x1008,74
 353 | x1135,74
 354 | x586,74
 355 | x274,74
 356 | x761,76
 357 | x684,77
 358 | x535,77
 359 | x1056,77
 360 | x1120,77
 361 | x991,77
 362 | x869,78
 363 | x682,78
 364 | x874,78
 365 | x998,79
 366 | x493,79
 367 | x850,79
 368 | x862,80
 369 | x253,81
 370 | x678,81
 371 | x968,82
 372 | x579,82
 373 | x259,83
 374 | x780,83
 375 | x325,83
 376 | x966,83
 377 | x757,83
 378 | x789,83
 379 | x954,83
 380 | x1105,83
 381 | x1118,83
 382 | x1055,84
 383 | x381,84
 384 | x307,84
 385 | x1101,85
 386 | x875,85
 387 | x1116,85
 388 | x573,85
 389 | x273,86
 390 | x1059,87
 391 | x277,87
 392 | x955,87
 393 | x821,88
 394 | x391,88
 395 | x707,88
 396 | x380,89
 397 | x1131,89
 398 | x527,89
 399 | x837,89
 400 | x827,90
 401 | x1117,91
 402 | x782,91
 403 | x857,91
 404 | x1046,91
 405 | x520,93
 406 | x697,93
 407 | x1025,93
 408 | x1049,93
 409 | x1114,93
 410 | x1111,94
 411 | x605,94
 412 | x935,94
 413 | x805,95
 414 | x962,95
 415 | x750,96
 416 | x1107,97
 417 | x736,97
 418 | x770,98
 419 | x602,98
 420 | x332,100
 421 | x574,100
 422 | x1115,101
 423 | x283,101
 424 | x1057,101
 425 | x69,102
 426 | x818,103
 427 | x1102,104
 428 | x1064,104
 429 | x803,104
 430 | x1041,104
 431 | x503,104
 432 | x351,105
 433 | x345,105
 434 | x914,105
 435 | x705,105
 436 | x1087,106
 437 | x933,106
 438 | x384,107
 439 | x610,107
 440 | x1028,109
 441 | x1009,109
 442 | x777,109
 443 | x269,110
 444 | x1103,111
 445 | x256,111
 446 | x417,111
 447 | x1109,111
 448 | x1068,111
 449 | x1016,112
 450 | x1090,112
 451 | x282,113
 452 | x593,113
 453 | x370,114
 454 | x905,114
 455 | x980,114
 456 | x920,114
 457 | x510,115
 458 | x847,115
 459 | x245,115
 460 | x1138,115
 461 | x652,117
 462 | x355,118
 463 | x1015,118
 464 | x1073,119
 465 | x990,119
 466 | x866,119
 467 | x1113,119
 468 | x534,119
 469 | x1060,119
 470 | x765,119
 471 | x800,120
 472 | x1137,120
 473 | x630,120
 474 | x1110,120
 475 | x1048,120
 476 | x851,120
 477 | x806,120
 478 | x1074,121
 479 | x32,121
 480 | x285,121
 481 | x732,122
 482 | x1083,122
 483 | x747,122
 484 | x677,122
 485 | x575,123
 486 | x626,123
 487 | x522,125
 488 | x890,125
 489 | x386,126
 490 | x343,126
 491 | x655,127
 492 | x330,128
 493 | x984,128
 494 | x349,128
 495 | x1021,129
 496 | x377,129
 497 | x340,129
 498 | x674,130
 499 | x873,130
 500 | x611,130
 501 | x918,130
 502 | x413,130
 503 | x480,132
 504 | x1072,132
 505 | x909,133
 506 | x565,134
 507 | x964,135
 508 | x749,135
 509 | x458,136
 510 | x704,136
 511 | x795,136
 512 | x266,137
 513 | x737,137
 514 | x676,137
 515 | x976,138
 516 | x344,138
 517 | x1108,139
 518 | x353,139
 519 | x564,139
 520 | x470,139
 521 | x369,139
 522 | x790,139
 523 | x963,139
 524 | x371,139
 525 | x838,140
 526 | x410,140
 527 | x254,140
 528 | x464,140
 529 | x436,141
 530 | x650,142
 531 | x1020,144
 532 | x1023,144
 533 | x979,144
 534 | x987,144
 535 | x342,145
 536 | x902,145
 537 | x461,146
 538 | x424,146
 539 | x997,147
 540 | x484,148
 541 | x609,149
 542 | x746,150
 543 | x585,150
 544 | x394,150
 545 | x513,151
 546 | x372,153
 547 | x843,153
 548 | x700,153
 549 | x329,153
 550 | x983,153
 551 | x525,154
 552 | x338,156
 553 | x418,156
 554 | x985,157
 555 | x422,157
 556 | x174,157
 557 | x203,157
 558 | x431,158
 559 | x339,158
 560 | x509,158
 561 | x945,159
 562 | x657,160
 563 | x497,160
 564 | x1003,160
 565 | x752,161
 566 | x647,162
 567 | x1084,162
 568 | x982,163
 569 | x993,165
 570 | x360,166
 571 | x809,166
 572 | x614,166
 573 | x1078,166
 574 | x389,167
 575 | x512,167
 576 | x170,168
 577 | x562,168
 578 | x387,169
 579 | x388,169
 580 | x396,170
 581 | x77,170
 582 | x382,172
 583 | x576,172
 584 | x581,172
 585 | x1029,173
 586 | x434,174
 587 | x796,174
 588 | x341,174
 589 | x347,175
 590 | x374,177
 591 | x429,177
 592 | x45,179
 593 | x306,179
 594 | x1077,181
 595 | x760,182
 596 | x100,182
 597 | x529,183
 598 | x294,184
 599 | x986,185
 600 | x5,185
 601 | x506,186
 602 | x460,186
 603 | x476,186
 604 | x957,187
 605 | x1010,188
 606 | x977,188
 607 | x681,188
 608 | x494,189
 609 | x864,192
 610 | x960,192
 611 | x1005,193
 612 | x762,193
 613 | x1022,194
 614 | x590,194
 615 | x528,196
 616 | x97,196
 617 | x304,196
 618 | x408,196
 619 | x1011,197
 620 | x514,197
 621 | x996,197
 622 | x485,198
 623 | x385,198
 624 | x459,198
 625 | x958,200
 626 | x583,200
 627 | x196,200
 628 | x36,200
 629 | x713,204
 630 | x393,204
 631 | x689,204
 632 | x481,204
 633 | x361,204
 634 | x122,204
 635 | x73,206
 636 | x80,207
 637 | x729,207
 638 | x336,207
 639 | x75,207
 640 | x556,207
 641 | x478,208
 642 | x1018,208
 643 | x147,209
 644 | x988,209
 645 | x295,209
 646 | x430,210
 647 | x64,210
 648 | x365,211
 649 | x499,211
 650 | x804,212
 651 | x1024,212
 652 | x195,212
 653 | x781,212
 654 | x401,212
 655 | x354,212
 656 | x70,213
 657 | x103,213
 658 | x733,213
 659 | x748,214
 660 | x220,214
 661 | x856,215
 662 | x764,215
 663 | x824,216
 664 | x490,216
 665 | x582,216
 666 | x198,216
 667 | x12,217
 668 | x17,218
 669 | x395,219
 670 | x188,219
 671 | x683,220
 672 | x155,220
 673 | x186,220
 674 | x533,220
 675 | x197,221
 676 | x33,222
 677 | x999,224
 678 | x563,225
 679 | x956,226
 680 | x109,227
 681 | x651,227
 682 | x483,227
 683 | x39,228
 684 | x978,229
 685 | x1019,229
 686 | x255,230
 687 | x530,230
 688 | x288,231
 689 | x475,231
 690 | x836,232
 691 | x118,232
 692 | x202,234
 693 | x104,234
 694 | x995,234
 695 | x146,236
 696 | x110,236
 697 | x992,236
 698 | x1004,237
 699 | x403,239
 700 | x205,239
 701 | x221,239
 702 | x766,239
 703 | x169,239
 704 | x268,240
 705 | x116,240
 706 | x548,240
 707 | x348,241
 708 | x176,243
 709 | x48,243
 710 | x553,243
 711 | x350,245
 712 | x271,245
 713 | x648,246
 714 | x356,246
 715 | x287,247
 716 | x3,247
 717 | x379,248
 718 | x878,248
 719 | x398,248
 720 | x412,249
 721 | x323,249
 722 | x106,251
 723 | x469,251
 724 | x373,252
 725 | x81,252
 726 | x731,253
 727 | x72,253
 728 | x108,253
 729 | x568,254
 730 | x686,254
 731 | x210,255
 732 | x225,255
 733 | x482,256
 734 | x362,259
 735 | x699,259
 736 | x63,259
 737 | x207,259
 738 | x284,260
 739 | x383,260
 740 | x653,260
 741 | x23,260
 742 | x222,261
 743 | x299,262
 744 | x213,263
 745 | x474,263
 746 | x578,264
 747 | x145,265
 748 | x233,265
 749 | x772,265
 750 | x86,265
 751 | x201,265
 752 | x230,265
 753 | x190,266
 754 | x124,266
 755 | x946,266
 756 | x557,266
 757 | x232,267
 758 | x194,267
 759 | x136,267
 760 | x405,267
 761 | x117,268
 762 | x555,268
 763 | x868,268
 764 | x738,268
 765 | x107,269
 766 | x846,269
 767 | x206,269
 768 | x409,269
 769 | x257,269
 770 | x656,269
 771 | x15,269
 772 | x167,270
 773 | x551,270
 774 | x359,270
 775 | x949,271
 776 | x119,271
 777 | x981,272
 778 | x550,274
 779 | x953,274
 780 | x1026,274
 781 | x473,274
 782 | x358,274
 783 | x191,276
 784 | x16,278
 785 | x994,278
 786 | x392,278
 787 | x89,278
 788 | x179,278
 789 | x865,278
 790 | x524,279
 791 | x173,279
 792 | x37,279
 793 | x150,280
 794 | x113,280
 795 | x472,280
 796 | x14,280
 797 | x4,280
 798 | x182,281
 799 | x730,281
 800 | x337,281
 801 | x552,281
 802 | x292,282
 803 | x479,282
 804 | x60,283
 805 | x241,284
 806 | x88,285
 807 | x226,285
 808 | x157,286
 809 | x125,286
 810 | x634,286
 811 | x792,286
 812 | x346,287
 813 | x192,287
 814 | x47,288
 815 | x721,288
 816 | x168,288
 817 | x178,289
 818 | x187,290
 819 | x71,292
 820 | x193,292
 821 | x49,292
 822 | x216,293
 823 | x228,293
 824 | x858,294
 825 | x74,294
 826 | x121,294
 827 | x154,295
 828 | x111,295
 829 | x214,296
 830 | x546,296
 831 | x151,296
 832 | x357,296
 833 | x486,297
 834 | x85,297
 835 | x399,298
 836 | x305,299
 837 | x687,299
 838 | x826,300
 839 | x477,301
 840 | x352,301
 841 | x947,301
 842 | x44,301
 843 | x698,301
 844 | x1001,303
 845 | x52,303
 846 | x397,303
 847 | x46,304
 848 | x204,304
 849 | x400,304
 850 | x227,304
 851 | x849,305
 852 | x61,305
 853 | x132,305
 854 | x404,306
 855 | x181,307
 856 | x130,307
 857 | x156,307
 858 | x38,308
 859 | x11,308
 860 | x91,309
 861 | x959,310
 862 | x217,311
 863 | x129,311
 864 | x200,311
 865 | x120,313
 866 | x471,313
 867 | x9,313
 868 | x10,315
 869 | x19,316
 870 | x829,318
 871 | x144,319
 872 | x950,321
 873 | x20,321
 874 | x62,322
 875 | x166,323
 876 | x18,323
 877 | x246,324
 878 | x41,324
 879 | x185,325
 880 | x139,326
 881 | x43,326
 882 | x128,326
 883 | x160,326
 884 | x84,326
 885 | x24,327
 886 | x942,328
 887 | x208,329
 888 | x867,331
 889 | x722,331
 890 | x171,331
 891 | x22,332
 892 | x102,332
 893 | x21,332
 894 | x42,332
 895 | x184,333
 896 | x613,333
 897 | x105,335
 898 | x601,336
 899 | x78,336
 900 | x407,336
 901 | x51,337
 902 | x8,337
 903 | x7,338
 904 | x149,338
 905 | x112,338
 906 | x690,339
 907 | x560,340
 908 | x823,340
 909 | x165,341
 910 | x126,341
 911 | x629,341
 912 | x234,344
 913 | x172,344
 914 | x771,346
 915 | x87,346
 916 | x685,346
 917 | x83,346
 918 | x645,347
 919 | x163,348
 920 | x462,349
 921 | x435,349
 922 | x511,350
 923 | x778,350
 924 | x427,351
 925 | x57,352
 926 | x159,352
 927 | x56,353
 928 | x763,353
 929 | x272,355
 930 | x183,357
 931 | x189,358
 932 | x238,359
 933 | x842,360
 934 | x240,361
 935 | x161,361
 936 | x952,361
 937 | x237,361
 938 | x223,362
 939 | x35,362
 940 | x852,362
 941 | x215,363
 942 | x199,363
 943 | x289,365
 944 | x13,366
 945 | x428,366
 946 | x114,367
 947 | x716,368
 948 | x406,371
 949 | x180,372
 950 | x131,375
 951 | x367,375
 952 | x296,375
 953 | x235,376
 954 | x1,376
 955 | x54,377
 956 | x633,378
 957 | x162,378
 958 | x123,379
 959 | x252,380
 960 | x432,381
 961 | x140,381
 962 | x90,381
 963 | x34,382
 964 | x82,383
 965 | x6,384
 966 | x825,386
 967 | x40,387
 968 | x218,388
 969 | x31,389
 970 | x941,389
 971 | x944,390
 972 | x229,390
 973 | x115,391
 974 | x133,392
 975 | x211,392
 976 | x735,393
 977 | x177,393
 978 | x363,394
 979 | x127,394
 980 | x175,395
 981 | x66,395
 982 | x219,395
 983 | x209,395
 984 | x820,398
 985 | x293,399
 986 | x164,401
 987 | x768,402
 988 | x50,402
 989 | x298,404
 990 | x141,405
 991 | x30,405
 992 | x286,406
 993 | x68,410
 994 | x654,414
 995 | x148,414
 996 | x591,415
 997 | x153,415
 998 | x943,416
 999 | x2,419
1000 | x822,420
1001 | x143,420
1002 | x101,420
1003 | x734,421
1004 | x158,422
1005 | x134,422
1006 | x607,422
1007 | x55,423
1008 | x93,423
1009 | x53,424
1010 | x297,428
1011 | x65,429
1012 | x79,431
1013 | x212,431
1014 | x239,433
1015 | x92,437
1016 | x224,439
1017 | x433,439
1018 | x366,440
1019 | x558,441
1020 | x549,444
1021 | x99,447
1022 | x231,454
1023 | x95,455
1024 | x554,458
1025 | x879,458
1026 | x28,458
1027 | x58,459
1028 | x98,459
1029 | x791,460
1030 | x606,461
1031 | x769,463
1032 | x940,464
1033 | x67,464
1034 | x27,468
1035 | x135,469
1036 | x25,469
1037 | x236,471
1038 | x152,474
1039 | x951,476
1040 | x94,479
1041 | x142,480
1042 | x270,482
1043 | x767,485
1044 | x59,488
1045 | x76,493
1046 | x137,501
1047 | x364,503
1048 | x96,507
1049 | x29,517
1050 | x592,532
1051 | x368,545
1052 | x138,563
1053 | x26,566
1054 | x584,580
1055 | x559,590
1056 | 


--------------------------------------------------------------------------------
/DC-loan-rp/feature-selection/anylze.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import matplotlib.pyplot as plt
 3 | 
 4 | path = 'd:/dataset/rp/'
 5 | features_type_csv = path + 'features_type.csv'
 6 | features = pd.read_csv(features_type_csv)
 7 | numeric = features.feature[features.type == 'numeric']
 8 | category = features.feature[features.type == 'category']
 9 | 
10 | print('feature\nnumeric: %d ; category: %d' % (numeric.shape[0], category.shape[0]) )
11 | 
12 | feature_score_csv = './0.70_feature_score.csv'
13 | feature_score = pd.read_csv(feature_score_csv)
14 | 
15 | feature_score.index = feature_score.feature
16 | feature_score = feature_score.drop(['feature'], axis=1)
17 | 
18 | print('feature__category')
19 | feature_score_category = feature_score.ix[category]
20 | feature_score_category = feature_score_category.sort_values(by='fscore', ascending=False)
21 | feature_score_category.to_csv('./feature_score_category.csv')
22 | 
23 | category_is_null = feature_score_category[feature_score_category.fscore.isnull()]
24 | list_null = list(category_is_null.index)
25 | print(list_null)
26 | 
27 | print('feature__numeric')
28 | feature_score_numeric = feature_score.ix[numeric]
29 | feature_score_numeric = feature_score_numeric.sort_values(by='fscore', ascending=False)
30 | feature_score_numeric.to_csv('./feature_score_numeric.csv')
31 | 
32 | 
33 | f1 = open('./drop_list.txt', 'r')
34 | f2 = open('./drop_feature.txt', 'w+')
35 | for line in f1:
36 |     row = line.strip().split(',')
37 |     f2.write(row[0]+'\n')
38 | 
39 | f1.close()
40 | f2.close()


--------------------------------------------------------------------------------
/DC-loan-rp/feature-selection/drop_feature.txt:
--------------------------------------------------------------------------------
 1 | x455
 2 | x487
 3 | x488
 4 | x1051
 5 | x1132
 6 | x1134
 7 | x247
 8 | x248
 9 | x276
10 | x333
11 | x463
12 | x468
13 | x489
14 | x491
15 | x496
16 | x504
17 | x515
18 | x516
19 | x518
20 | x538
21 | x541
22 | x542
23 | x543
24 | x566
25 | x587
26 | x588
27 | x603
28 | x604
29 | x615
30 | x616
31 | x617
32 | x619
33 | x620
34 | x636
35 | x637
36 | x638
37 | x640
38 | x641
39 | x642
40 | x644
41 | x660
42 | x661
43 | x663
44 | x665
45 | x666
46 | x688
47 | x692
48 | x695
49 | x696
50 | x710
51 | x711
52 | x725
53 | x726
54 | x727
55 | x728
56 | x741
57 | x743
58 | x744
59 | x773
60 | x774
61 | x775
62 | x785
63 | x786
64 | x787
65 | x788
66 | x797
67 | x798
68 | x813
69 | x814
70 | x815
71 | x830
72 | x831
73 | x833
74 | x853
75 | x855
76 | x859
77 | x860
78 | x861
79 | x876
80 | x877
81 | x895
82 | x897
83 | x899
84 | x913
85 | 


--------------------------------------------------------------------------------
/DC-loan-rp/feature-selection/drop_list.txt:
--------------------------------------------------------------------------------
 1 | x455,
 2 | x487,
 3 | x488,
 4 | x1051,
 5 | x1132,
 6 | x1134,
 7 | x247,
 8 | x248,
 9 | x276,
10 | x333,
11 | x463,
12 | x468,
13 | x489,
14 | x491,
15 | x496,
16 | x504,
17 | x515,
18 | x516,
19 | x518,
20 | x538,
21 | x541,
22 | x542,
23 | x543,
24 | x566,
25 | x587,
26 | x588,
27 | x603,
28 | x604,
29 | x615,
30 | x616,
31 | x617,
32 | x619,
33 | x620,
34 | x636,
35 | x637,
36 | x638,
37 | x640,
38 | x641,
39 | x642,
40 | x644,
41 | x660,
42 | x661,
43 | x663,
44 | x665,
45 | x666,
46 | x688,
47 | x692,
48 | x695,
49 | x696,
50 | x710,
51 | x711,
52 | x725,
53 | x726,
54 | x727,
55 | x728,
56 | x741,
57 | x743,
58 | x744,
59 | x773,
60 | x774,
61 | x775,
62 | x785,
63 | x786,
64 | x787,
65 | x788,
66 | x797,
67 | x798,
68 | x813,
69 | x814,
70 | x815,
71 | x830,
72 | x831,
73 | x833,
74 | x853,
75 | x855,
76 | x859,
77 | x860,
78 | x861,
79 | x876,
80 | x877,
81 | x895,
82 | x897,
83 | x899,
84 | x913,
85 | 


--------------------------------------------------------------------------------
/DC-loan-rp/feature-selection/feature_score_category.csv:
--------------------------------------------------------------------------------
 1 | feature,fscore
 2 | x1108,139.0
 3 | x1137,120.0
 4 | x1110,120.0
 5 | x1113,119.0
 6 | x1138,115.0
 7 | x1109,111.0
 8 | x417,111.0
 9 | x1041,104.0
10 | x1115,101.0
11 | x1107,97.0
12 | x1111,94.0
13 | x1114,93.0
14 | x1117,91.0
15 | x1046,91.0
16 | x1131,89.0
17 | x1116,85.0
18 | x1118,83.0
19 | x1120,77.0
20 | x1135,74.0
21 | x1130,69.0
22 | x1112,66.0
23 | x444,64.0
24 | x1128,64.0
25 | x1122,61.0
26 | x1136,60.0
27 | x1121,54.0
28 | x1123,53.0
29 | x1126,52.0
30 | x1133,49.0
31 | x1129,48.0
32 | x416,47.0
33 | x1125,47.0
34 | x1119,46.0
35 | x411,45.0
36 | x314,43.0
37 | x308,39.0
38 | x1036,37.0
39 | x415,36.0
40 | x301,32.0
41 | x315,29.0
42 | x1038,28.0
43 | x1127,26.0
44 | x1124,26.0
45 | x313,25.0
46 | x311,24.0
47 | x885,24.0
48 | x319,23.0
49 | x1043,23.0
50 | x1040,22.0
51 | x1047,20.0
52 | x449,19.0
53 | x303,19.0
54 | x300,19.0
55 | x454,18.0
56 | x309,18.0
57 | x452,15.0
58 | x884,14.0
59 | x426,14.0
60 | x316,14.0
61 | x317,14.0
62 | x1035,13.0
63 | x310,11.0
64 | x1050,10.0
65 | x302,10.0
66 | x1032,10.0
67 | x447,10.0
68 | x1033,10.0
69 | x318,9.0
70 | x312,8.0
71 | x439,8.0
72 | x1037,8.0
73 | x448,7.0
74 | x1031,6.0
75 | x1034,5.0
76 | x321,5.0
77 | x320,5.0
78 | x1045,5.0
79 | x438,5.0
80 | x883,5.0
81 | x1042,4.0
82 | x1044,4.0
83 | x453,4.0
84 | x450,3.0
85 | x886,3.0
86 | x1039,2.0
87 | x440,1.0
88 | x1052,1.0
89 | x455,
90 | x487,
91 | x488,
92 | x1051,
93 | x1132,
94 | x1134,
95 | 


--------------------------------------------------------------------------------
/DC-loan-rp/feature-selection/feature_score_numeric.csv:
--------------------------------------------------------------------------------
   1 | feature,fscore
   2 | x559,590.0
   3 | x584,580.0
   4 | x26,566.0
   5 | x138,563.0
   6 | x368,545.0
   7 | x592,532.0
   8 | x29,517.0
   9 | x96,507.0
  10 | x364,503.0
  11 | x137,501.0
  12 | x76,493.0
  13 | x59,488.0
  14 | x767,485.0
  15 | x270,482.0
  16 | x142,480.0
  17 | x94,479.0
  18 | x951,476.0
  19 | x152,474.0
  20 | x236,471.0
  21 | x135,469.0
  22 | x25,469.0
  23 | x27,468.0
  24 | x940,464.0
  25 | x67,464.0
  26 | x769,463.0
  27 | x606,461.0
  28 | x791,460.0
  29 | x98,459.0
  30 | x58,459.0
  31 | x28,458.0
  32 | x554,458.0
  33 | x879,458.0
  34 | x95,455.0
  35 | x231,454.0
  36 | x99,447.0
  37 | x549,444.0
  38 | x558,441.0
  39 | x366,440.0
  40 | x224,439.0
  41 | x433,439.0
  42 | x92,437.0
  43 | x239,433.0
  44 | x79,431.0
  45 | x212,431.0
  46 | x65,429.0
  47 | x297,428.0
  48 | x53,424.0
  49 | x55,423.0
  50 | x93,423.0
  51 | x134,422.0
  52 | x607,422.0
  53 | x158,422.0
  54 | x734,421.0
  55 | x143,420.0
  56 | x822,420.0
  57 | x101,420.0
  58 | x2,419.0
  59 | x943,416.0
  60 | x153,415.0
  61 | x591,415.0
  62 | x148,414.0
  63 | x654,414.0
  64 | x68,410.0
  65 | x286,406.0
  66 | x141,405.0
  67 | x30,405.0
  68 | x298,404.0
  69 | x50,402.0
  70 | x768,402.0
  71 | x164,401.0
  72 | x293,399.0
  73 | x820,398.0
  74 | x209,395.0
  75 | x175,395.0
  76 | x66,395.0
  77 | x219,395.0
  78 | x127,394.0
  79 | x363,394.0
  80 | x735,393.0
  81 | x177,393.0
  82 | x133,392.0
  83 | x211,392.0
  84 | x115,391.0
  85 | x229,390.0
  86 | x944,390.0
  87 | x31,389.0
  88 | x941,389.0
  89 | x218,388.0
  90 | x40,387.0
  91 | x825,386.0
  92 | x6,384.0
  93 | x82,383.0
  94 | x34,382.0
  95 | x90,381.0
  96 | x140,381.0
  97 | x432,381.0
  98 | x252,380.0
  99 | x123,379.0
 100 | x633,378.0
 101 | x162,378.0
 102 | x54,377.0
 103 | x235,376.0
 104 | x1,376.0
 105 | x131,375.0
 106 | x367,375.0
 107 | x296,375.0
 108 | x180,372.0
 109 | x406,371.0
 110 | x716,368.0
 111 | x114,367.0
 112 | x13,366.0
 113 | x428,366.0
 114 | x289,365.0
 115 | x199,363.0
 116 | x215,363.0
 117 | x35,362.0
 118 | x852,362.0
 119 | x223,362.0
 120 | x952,361.0
 121 | x237,361.0
 122 | x161,361.0
 123 | x240,361.0
 124 | x842,360.0
 125 | x238,359.0
 126 | x189,358.0
 127 | x183,357.0
 128 | x272,355.0
 129 | x763,353.0
 130 | x56,353.0
 131 | x57,352.0
 132 | x159,352.0
 133 | x427,351.0
 134 | x778,350.0
 135 | x511,350.0
 136 | x435,349.0
 137 | x462,349.0
 138 | x163,348.0
 139 | x645,347.0
 140 | x771,346.0
 141 | x83,346.0
 142 | x685,346.0
 143 | x87,346.0
 144 | x172,344.0
 145 | x234,344.0
 146 | x629,341.0
 147 | x165,341.0
 148 | x126,341.0
 149 | x560,340.0
 150 | x823,340.0
 151 | x690,339.0
 152 | x149,338.0
 153 | x112,338.0
 154 | x7,338.0
 155 | x51,337.0
 156 | x8,337.0
 157 | x78,336.0
 158 | x407,336.0
 159 | x601,336.0
 160 | x105,335.0
 161 | x184,333.0
 162 | x613,333.0
 163 | x42,332.0
 164 | x102,332.0
 165 | x22,332.0
 166 | x21,332.0
 167 | x722,331.0
 168 | x171,331.0
 169 | x867,331.0
 170 | x208,329.0
 171 | x942,328.0
 172 | x24,327.0
 173 | x160,326.0
 174 | x128,326.0
 175 | x43,326.0
 176 | x84,326.0
 177 | x139,326.0
 178 | x185,325.0
 179 | x246,324.0
 180 | x41,324.0
 181 | x18,323.0
 182 | x166,323.0
 183 | x62,322.0
 184 | x20,321.0
 185 | x950,321.0
 186 | x144,319.0
 187 | x829,318.0
 188 | x19,316.0
 189 | x10,315.0
 190 | x9,313.0
 191 | x471,313.0
 192 | x120,313.0
 193 | x129,311.0
 194 | x217,311.0
 195 | x200,311.0
 196 | x959,310.0
 197 | x91,309.0
 198 | x38,308.0
 199 | x11,308.0
 200 | x181,307.0
 201 | x130,307.0
 202 | x156,307.0
 203 | x404,306.0
 204 | x132,305.0
 205 | x61,305.0
 206 | x849,305.0
 207 | x46,304.0
 208 | x227,304.0
 209 | x204,304.0
 210 | x400,304.0
 211 | x52,303.0
 212 | x1001,303.0
 213 | x397,303.0
 214 | x352,301.0
 215 | x698,301.0
 216 | x947,301.0
 217 | x477,301.0
 218 | x44,301.0
 219 | x826,300.0
 220 | x305,299.0
 221 | x687,299.0
 222 | x399,298.0
 223 | x85,297.0
 224 | x486,297.0
 225 | x546,296.0
 226 | x151,296.0
 227 | x357,296.0
 228 | x214,296.0
 229 | x111,295.0
 230 | x154,295.0
 231 | x121,294.0
 232 | x74,294.0
 233 | x858,294.0
 234 | x216,293.0
 235 | x228,293.0
 236 | x49,292.0
 237 | x71,292.0
 238 | x193,292.0
 239 | x187,290.0
 240 | x178,289.0
 241 | x47,288.0
 242 | x168,288.0
 243 | x721,288.0
 244 | x346,287.0
 245 | x192,287.0
 246 | x157,286.0
 247 | x125,286.0
 248 | x792,286.0
 249 | x634,286.0
 250 | x226,285.0
 251 | x88,285.0
 252 | x241,284.0
 253 | x60,283.0
 254 | x292,282.0
 255 | x479,282.0
 256 | x552,281.0
 257 | x337,281.0
 258 | x730,281.0
 259 | x182,281.0
 260 | x113,280.0
 261 | x150,280.0
 262 | x14,280.0
 263 | x4,280.0
 264 | x472,280.0
 265 | x524,279.0
 266 | x37,279.0
 267 | x173,279.0
 268 | x89,278.0
 269 | x994,278.0
 270 | x179,278.0
 271 | x392,278.0
 272 | x16,278.0
 273 | x865,278.0
 274 | x191,276.0
 275 | x953,274.0
 276 | x358,274.0
 277 | x473,274.0
 278 | x1026,274.0
 279 | x550,274.0
 280 | x981,272.0
 281 | x119,271.0
 282 | x949,271.0
 283 | x167,270.0
 284 | x551,270.0
 285 | x359,270.0
 286 | x15,269.0
 287 | x107,269.0
 288 | x257,269.0
 289 | x846,269.0
 290 | x409,269.0
 291 | x206,269.0
 292 | x656,269.0
 293 | x555,268.0
 294 | x738,268.0
 295 | x117,268.0
 296 | x868,268.0
 297 | x194,267.0
 298 | x405,267.0
 299 | x136,267.0
 300 | x232,267.0
 301 | x557,266.0
 302 | x190,266.0
 303 | x946,266.0
 304 | x124,266.0
 305 | x233,265.0
 306 | x772,265.0
 307 | x201,265.0
 308 | x86,265.0
 309 | x145,265.0
 310 | x230,265.0
 311 | x578,264.0
 312 | x213,263.0
 313 | x474,263.0
 314 | x299,262.0
 315 | x222,261.0
 316 | x383,260.0
 317 | x653,260.0
 318 | x284,260.0
 319 | x23,260.0
 320 | x63,259.0
 321 | x362,259.0
 322 | x699,259.0
 323 | x207,259.0
 324 | x482,256.0
 325 | x210,255.0
 326 | x225,255.0
 327 | x568,254.0
 328 | x686,254.0
 329 | x72,253.0
 330 | x731,253.0
 331 | x108,253.0
 332 | x81,252.0
 333 | x373,252.0
 334 | x469,251.0
 335 | x106,251.0
 336 | x323,249.0
 337 | x412,249.0
 338 | x878,248.0
 339 | x379,248.0
 340 | x398,248.0
 341 | x287,247.0
 342 | x3,247.0
 343 | x648,246.0
 344 | x356,246.0
 345 | x271,245.0
 346 | x350,245.0
 347 | x176,243.0
 348 | x553,243.0
 349 | x48,243.0
 350 | x348,241.0
 351 | x548,240.0
 352 | x116,240.0
 353 | x268,240.0
 354 | x205,239.0
 355 | x403,239.0
 356 | x766,239.0
 357 | x221,239.0
 358 | x169,239.0
 359 | x1004,237.0
 360 | x146,236.0
 361 | x110,236.0
 362 | x992,236.0
 363 | x104,234.0
 364 | x202,234.0
 365 | x995,234.0
 366 | x118,232.0
 367 | x836,232.0
 368 | x475,231.0
 369 | x288,231.0
 370 | x255,230.0
 371 | x530,230.0
 372 | x1019,229.0
 373 | x978,229.0
 374 | x39,228.0
 375 | x483,227.0
 376 | x651,227.0
 377 | x109,227.0
 378 | x956,226.0
 379 | x563,225.0
 380 | x999,224.0
 381 | x33,222.0
 382 | x197,221.0
 383 | x683,220.0
 384 | x533,220.0
 385 | x186,220.0
 386 | x155,220.0
 387 | x395,219.0
 388 | x188,219.0
 389 | x17,218.0
 390 | x12,217.0
 391 | x198,216.0
 392 | x824,216.0
 393 | x490,216.0
 394 | x582,216.0
 395 | x856,215.0
 396 | x764,215.0
 397 | x748,214.0
 398 | x220,214.0
 399 | x103,213.0
 400 | x733,213.0
 401 | x70,213.0
 402 | x781,212.0
 403 | x804,212.0
 404 | x354,212.0
 405 | x195,212.0
 406 | x1024,212.0
 407 | x401,212.0
 408 | x365,211.0
 409 | x499,211.0
 410 | x64,210.0
 411 | x430,210.0
 412 | x988,209.0
 413 | x147,209.0
 414 | x295,209.0
 415 | x478,208.0
 416 | x1018,208.0
 417 | x729,207.0
 418 | x80,207.0
 419 | x556,207.0
 420 | x336,207.0
 421 | x75,207.0
 422 | x73,206.0
 423 | x481,204.0
 424 | x122,204.0
 425 | x713,204.0
 426 | x689,204.0
 427 | x393,204.0
 428 | x361,204.0
 429 | x196,200.0
 430 | x958,200.0
 431 | x36,200.0
 432 | x583,200.0
 433 | x485,198.0
 434 | x385,198.0
 435 | x459,198.0
 436 | x1011,197.0
 437 | x514,197.0
 438 | x996,197.0
 439 | x304,196.0
 440 | x528,196.0
 441 | x408,196.0
 442 | x97,196.0
 443 | x1022,194.0
 444 | x590,194.0
 445 | x1005,193.0
 446 | x762,193.0
 447 | x960,192.0
 448 | x864,192.0
 449 | x494,189.0
 450 | x681,188.0
 451 | x1010,188.0
 452 | x977,188.0
 453 | x957,187.0
 454 | x506,186.0
 455 | x476,186.0
 456 | x460,186.0
 457 | x986,185.0
 458 | x5,185.0
 459 | x294,184.0
 460 | x529,183.0
 461 | x100,182.0
 462 | x760,182.0
 463 | x1077,181.0
 464 | x306,179.0
 465 | x45,179.0
 466 | x374,177.0
 467 | x429,177.0
 468 | x347,175.0
 469 | x341,174.0
 470 | x434,174.0
 471 | x796,174.0
 472 | x1029,173.0
 473 | x581,172.0
 474 | x382,172.0
 475 | x576,172.0
 476 | x396,170.0
 477 | x77,170.0
 478 | x388,169.0
 479 | x387,169.0
 480 | x170,168.0
 481 | x562,168.0
 482 | x512,167.0
 483 | x389,167.0
 484 | x360,166.0
 485 | x1078,166.0
 486 | x614,166.0
 487 | x809,166.0
 488 | x993,165.0
 489 | x982,163.0
 490 | x647,162.0
 491 | x1084,162.0
 492 | x752,161.0
 493 | x657,160.0
 494 | x1003,160.0
 495 | x497,160.0
 496 | x945,159.0
 497 | x339,158.0
 498 | x509,158.0
 499 | x431,158.0
 500 | x422,157.0
 501 | x174,157.0
 502 | x203,157.0
 503 | x985,157.0
 504 | x418,156.0
 505 | x338,156.0
 506 | x525,154.0
 507 | x983,153.0
 508 | x329,153.0
 509 | x700,153.0
 510 | x372,153.0
 511 | x843,153.0
 512 | x513,151.0
 513 | x394,150.0
 514 | x585,150.0
 515 | x746,150.0
 516 | x609,149.0
 517 | x484,148.0
 518 | x997,147.0
 519 | x461,146.0
 520 | x424,146.0
 521 | x342,145.0
 522 | x902,145.0
 523 | x987,144.0
 524 | x979,144.0
 525 | x1023,144.0
 526 | x1020,144.0
 527 | x650,142.0
 528 | x436,141.0
 529 | x410,140.0
 530 | x464,140.0
 531 | x838,140.0
 532 | x254,140.0
 533 | x353,139.0
 534 | x963,139.0
 535 | x369,139.0
 536 | x371,139.0
 537 | x470,139.0
 538 | x790,139.0
 539 | x564,139.0
 540 | x976,138.0
 541 | x344,138.0
 542 | x676,137.0
 543 | x737,137.0
 544 | x266,137.0
 545 | x458,136.0
 546 | x795,136.0
 547 | x704,136.0
 548 | x749,135.0
 549 | x964,135.0
 550 | x565,134.0
 551 | x909,133.0
 552 | x480,132.0
 553 | x1072,132.0
 554 | x413,130.0
 555 | x873,130.0
 556 | x918,130.0
 557 | x611,130.0
 558 | x674,130.0
 559 | x377,129.0
 560 | x1021,129.0
 561 | x340,129.0
 562 | x349,128.0
 563 | x984,128.0
 564 | x330,128.0
 565 | x655,127.0
 566 | x386,126.0
 567 | x343,126.0
 568 | x890,125.0
 569 | x522,125.0
 570 | x575,123.0
 571 | x626,123.0
 572 | x747,122.0
 573 | x732,122.0
 574 | x1083,122.0
 575 | x677,122.0
 576 | x285,121.0
 577 | x32,121.0
 578 | x1074,121.0
 579 | x1048,120.0
 580 | x851,120.0
 581 | x800,120.0
 582 | x806,120.0
 583 | x630,120.0
 584 | x1060,119.0
 585 | x1073,119.0
 586 | x765,119.0
 587 | x866,119.0
 588 | x990,119.0
 589 | x534,119.0
 590 | x355,118.0
 591 | x1015,118.0
 592 | x652,117.0
 593 | x847,115.0
 594 | x510,115.0
 595 | x245,115.0
 596 | x370,114.0
 597 | x920,114.0
 598 | x905,114.0
 599 | x980,114.0
 600 | x593,113.0
 601 | x282,113.0
 602 | x1016,112.0
 603 | x1090,112.0
 604 | x1103,111.0
 605 | x1068,111.0
 606 | x256,111.0
 607 | x269,110.0
 608 | x777,109.0
 609 | x1028,109.0
 610 | x1009,109.0
 611 | x384,107.0
 612 | x610,107.0
 613 | x933,106.0
 614 | x1087,106.0
 615 | x351,105.0
 616 | x705,105.0
 617 | x914,105.0
 618 | x345,105.0
 619 | x1102,104.0
 620 | x503,104.0
 621 | x803,104.0
 622 | x1064,104.0
 623 | x818,103.0
 624 | x69,102.0
 625 | x283,101.0
 626 | x1057,101.0
 627 | x332,100.0
 628 | x574,100.0
 629 | x770,98.0
 630 | x602,98.0
 631 | x736,97.0
 632 | x750,96.0
 633 | x805,95.0
 634 | x962,95.0
 635 | x935,94.0
 636 | x605,94.0
 637 | x520,93.0
 638 | x1025,93.0
 639 | x1049,93.0
 640 | x697,93.0
 641 | x857,91.0
 642 | x782,91.0
 643 | x827,90.0
 644 | x380,89.0
 645 | x837,89.0
 646 | x527,89.0
 647 | x391,88.0
 648 | x707,88.0
 649 | x821,88.0
 650 | x1059,87.0
 651 | x277,87.0
 652 | x955,87.0
 653 | x273,86.0
 654 | x573,85.0
 655 | x875,85.0
 656 | x1101,85.0
 657 | x1055,84.0
 658 | x381,84.0
 659 | x307,84.0
 660 | x325,83.0
 661 | x789,83.0
 662 | x1105,83.0
 663 | x780,83.0
 664 | x757,83.0
 665 | x259,83.0
 666 | x966,83.0
 667 | x954,83.0
 668 | x968,82.0
 669 | x579,82.0
 670 | x253,81.0
 671 | x678,81.0
 672 | x862,80.0
 673 | x493,79.0
 674 | x998,79.0
 675 | x850,79.0
 676 | x869,78.0
 677 | x874,78.0
 678 | x682,78.0
 679 | x991,77.0
 680 | x535,77.0
 681 | x1056,77.0
 682 | x684,77.0
 683 | x761,76.0
 684 | x1106,74.0
 685 | x420,74.0
 686 | x274,74.0
 687 | x1008,74.0
 688 | x586,74.0
 689 | x1104,73.0
 690 | x517,73.0
 691 | x1082,73.0
 692 | x580,73.0
 693 | x378,71.0
 694 | x889,71.0
 695 | x1067,71.0
 696 | x717,71.0
 697 | x1094,70.0
 698 | x937,70.0
 699 | x251,70.0
 700 | x709,69.0
 701 | x402,69.0
 702 | x828,69.0
 703 | x526,69.0
 704 | x263,69.0
 705 | x376,69.0
 706 | x679,69.0
 707 | x521,68.0
 708 | x794,68.0
 709 | x595,67.0
 710 | x623,67.0
 711 | x322,67.0
 712 | x631,67.0
 713 | x1062,66.0
 714 | x1066,65.0
 715 | x1089,65.0
 716 | x643,65.0
 717 | x1071,63.0
 718 | x844,63.0
 719 | x265,63.0
 720 | x819,62.0
 721 | x446,62.0
 722 | x967,62.0
 723 | x667,61.0
 724 | x675,61.0
 725 | x577,61.0
 726 | x751,61.0
 727 | x467,60.0
 728 | x835,60.0
 729 | x720,59.0
 730 | x672,59.0
 731 | x390,59.0
 732 | x1053,59.0
 733 | x915,58.0
 734 | x531,57.0
 735 | x971,57.0
 736 | x1058,55.0
 737 | x863,55.0
 738 | x327,55.0
 739 | x466,54.0
 740 | x807,54.0
 741 | x465,53.0
 742 | x624,53.0
 743 | x532,53.0
 744 | x1012,52.0
 745 | x1000,52.0
 746 | x1017,52.0
 747 | x1076,51.0
 748 | x892,51.0
 749 | x628,51.0
 750 | x928,51.0
 751 | x908,49.0
 752 | x589,49.0
 753 | x419,49.0
 754 | x547,49.0
 755 | x1006,48.0
 756 | x1096,48.0
 757 | x673,48.0
 758 | x1099,47.0
 759 | x1085,47.0
 760 | x870,47.0
 761 | x1075,47.0
 762 | x334,46.0
 763 | x627,46.0
 764 | x1088,46.0
 765 | x680,45.0
 766 | x375,45.0
 767 | x1092,45.0
 768 | x783,45.0
 769 | x267,45.0
 770 | x924,44.0
 771 | x258,44.0
 772 | x1100,44.0
 773 | x326,44.0
 774 | x1065,43.0
 775 | x523,43.0
 776 | x561,42.0
 777 | x612,42.0
 778 | x649,41.0
 779 | x1007,41.0
 780 | x961,41.0
 781 | x793,41.0
 782 | x910,41.0
 783 | x779,41.0
 784 | x1054,40.0
 785 | x1086,40.0
 786 | x972,39.0
 787 | x708,39.0
 788 | x242,39.0
 789 | x646,39.0
 790 | x572,38.0
 791 | x745,38.0
 792 | x414,37.0
 793 | x802,37.0
 794 | x1080,36.0
 795 | x1061,36.0
 796 | x887,36.0
 797 | x328,36.0
 798 | x845,36.0
 799 | x974,35.0
 800 | x659,35.0
 801 | x1063,35.0
 802 | x703,34.0
 803 | x260,33.0
 804 | x608,33.0
 805 | x1091,32.0
 806 | x948,32.0
 807 | x244,32.0
 808 | x502,31.0
 809 | x973,31.0
 810 | x571,31.0
 811 | x331,31.0
 812 | x706,31.0
 813 | x1070,30.0
 814 | x539,30.0
 815 | x693,30.0
 816 | x632,29.0
 817 | x930,29.0
 818 | x841,29.0
 819 | x243,29.0
 820 | x848,29.0
 821 | x756,28.0
 822 | x441,28.0
 823 | x880,28.0
 824 | x932,27.0
 825 | x926,27.0
 826 | x718,27.0
 827 | x1030,27.0
 828 | x335,26.0
 829 | x969,26.0
 830 | x925,26.0
 831 | x839,26.0
 832 | x888,25.0
 833 | x639,25.0
 834 | x622,25.0
 835 | x989,25.0
 836 | x882,24.0
 837 | x753,24.0
 838 | x1002,24.0
 839 | x1097,24.0
 840 | x784,23.0
 841 | x904,23.0
 842 | x816,23.0
 843 | x519,23.0
 844 | x840,23.0
 845 | x600,23.0
 846 | x1013,22.0
 847 | x739,22.0
 848 | x456,22.0
 849 | x919,21.0
 850 | x759,21.0
 851 | x569,21.0
 852 | x594,21.0
 853 | x832,21.0
 854 | x1069,21.0
 855 | x664,21.0
 856 | x264,20.0
 857 | x881,20.0
 858 | x694,20.0
 859 | x421,19.0
 860 | x280,19.0
 861 | x669,19.0
 862 | x324,18.0
 863 | x570,18.0
 864 | x975,17.0
 865 | x970,17.0
 866 | x451,17.0
 867 | x457,16.0
 868 | x423,16.0
 869 | x898,16.0
 870 | x758,16.0
 871 | x670,16.0
 872 | x599,16.0
 873 | x939,15.0
 874 | x714,15.0
 875 | x597,15.0
 876 | x1081,15.0
 877 | x1027,15.0
 878 | x281,15.0
 879 | x618,15.0
 880 | x896,15.0
 881 | x742,14.0
 882 | x801,14.0
 883 | x921,14.0
 884 | x625,14.0
 885 | x567,14.0
 886 | x492,13.0
 887 | x712,13.0
 888 | x719,13.0
 889 | x715,12.0
 890 | x1093,12.0
 891 | x754,12.0
 892 | x917,12.0
 893 | x906,12.0
 894 | x891,12.0
 895 | x811,12.0
 896 | x671,11.0
 897 | x442,11.0
 898 | x903,11.0
 899 | x596,11.0
 900 | x755,10.0
 901 | x1014,10.0
 902 | x1098,10.0
 903 | x871,10.0
 904 | x907,10.0
 905 | x893,9.0
 906 | x545,9.0
 907 | x537,9.0
 908 | x508,9.0
 909 | x901,9.0
 910 | x443,9.0
 911 | x916,9.0
 912 | x854,8.0
 913 | x691,8.0
 914 | x1079,8.0
 915 | x922,7.0
 916 | x658,7.0
 917 | x635,7.0
 918 | x701,7.0
 919 | x912,7.0
 920 | x662,7.0
 921 | x437,7.0
 922 | x936,7.0
 923 | x668,6.0
 924 | x702,6.0
 925 | x1095,6.0
 926 | x724,6.0
 927 | x938,6.0
 928 | x776,6.0
 929 | x810,6.0
 930 | x498,6.0
 931 | x927,5.0
 932 | x249,5.0
 933 | x501,5.0
 934 | x425,5.0
 935 | x834,5.0
 936 | x445,5.0
 937 | x934,5.0
 938 | x931,5.0
 939 | x536,4.0
 940 | x965,4.0
 941 | x808,4.0
 942 | x544,4.0
 943 | x540,4.0
 944 | x598,4.0
 945 | x507,4.0
 946 | x812,3.0
 947 | x278,3.0
 948 | x740,3.0
 949 | x911,3.0
 950 | x505,3.0
 951 | x500,2.0
 952 | x250,2.0
 953 | x621,2.0
 954 | x279,2.0
 955 | x872,2.0
 956 | x275,2.0
 957 | x290,1.0
 958 | x723,1.0
 959 | x817,1.0
 960 | x291,1.0
 961 | x799,1.0
 962 | x923,1.0
 963 | x262,1.0
 964 | x495,1.0
 965 | x894,1.0
 966 | x900,1.0
 967 | x929,1.0
 968 | x261,1.0
 969 | x247,
 970 | x248,
 971 | x276,
 972 | x333,
 973 | x463,
 974 | x468,
 975 | x489,
 976 | x491,
 977 | x496,
 978 | x504,
 979 | x515,
 980 | x516,
 981 | x518,
 982 | x538,
 983 | x541,
 984 | x542,
 985 | x543,
 986 | x566,
 987 | x587,
 988 | x588,
 989 | x603,
 990 | x604,
 991 | x615,
 992 | x616,
 993 | x617,
 994 | x619,
 995 | x620,
 996 | x636,
 997 | x637,
 998 | x638,
 999 | x640,
1000 | x641,
1001 | x642,
1002 | x644,
1003 | x660,
1004 | x661,
1005 | x663,
1006 | x665,
1007 | x666,
1008 | x688,
1009 | x692,
1010 | x695,
1011 | x696,
1012 | x710,
1013 | x711,
1014 | x725,
1015 | x726,
1016 | x727,
1017 | x728,
1018 | x741,
1019 | x743,
1020 | x744,
1021 | x773,
1022 | x774,
1023 | x775,
1024 | x785,
1025 | x786,
1026 | x787,
1027 | x788,
1028 | x797,
1029 | x798,
1030 | x813,
1031 | x814,
1032 | x815,
1033 | x830,
1034 | x831,
1035 | x833,
1036 | x853,
1037 | x855,
1038 | x859,
1039 | x860,
1040 | x861,
1041 | x876,
1042 | x877,
1043 | x895,
1044 | x897,
1045 | x899,
1046 | x913,
1047 | 


--------------------------------------------------------------------------------
/DC-loan-rp/sklearn-rf.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: sklearn-rf.py.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2016/01/16 下午 10:47
10 | '''
11 | 
12 | import pandas as pd
13 | import numpy as np
14 | from sklearn import cross_validation
15 | from sklearn import metrics
16 | from sklearn.ensemble import RandomForestClassifier
17 | import time
18 | 
19 | def load_data(dummy=False):
20 |     path = 'D:/dataset/rp/'
21 |     train_x = pd.read_csv(path + 'train_x.csv')
22 |     train_x = train_x.drop(['uid'], axis=1)
23 | 
24 |     train_y = pd.read_csv(path + 'train_y.csv')
25 |     train_y = train_y.drop(['uid'], axis=1)
26 | 
27 |     test_x = pd.read_csv(path + 'test_x.csv')
28 |     test_uid = test_x.uid
29 |     test_x = test_x.drop(['uid'], axis=1)
30 | 
31 |     if dummy: # 将分类类型的变量转为哑变量
32 |         features = pd.read_csv(path + 'features_type.csv')
33 |         features_category = features.feature[features.type == 'category']
34 |         encoded = pd.get_dummies(pd.concat([train_x, test_x], axis=0), columns=features_category)
35 |         train_rows = train_x.shape[0]
36 |         train_x = encoded.iloc[:train_rows, :]
37 |         test_x  = encoded.iloc[train_rows:, :]
38 | 
39 |     return train_x, train_y, test_x, test_uid
40 | 
41 | def sklearn_random_forest(train_x, train_y, test_x, test_uid):
42 |     # 设置参数
43 |     clf = RandomForestClassifier(n_estimators=5,
44 |                                  bootstrap=True, #是否有放回的采样
45 |                                  oob_score=False,
46 |                                  n_jobs=4, #并行job个数
47 |                                  min_samples_split=5)
48 |     # 训练模型
49 |     n_samples = train_x.shape[0]
50 |     cv = cross_validation.ShuffleSplit(n_samples, n_iter=3, test_size=0.3, random_state=0)
51 |     predicted = cross_validation.cross_val_predict(clf, train_x, train_y, cv=cv)
52 |     print(metrics.accuracy_score(train_y, predicted))
53 | 
54 |     test_y = clf.predict(test_x)
55 |     result = pd.DataFrame({"uid":test_uid, "score":test_y}, columns=['uid','score'])
56 |     result.to_csv('rf_'+str(time.time())+'.csv', index=False)
57 | 
58 | def main():
59 |     train_x, train_y, test_x, test_uid = load_data(dummy=True)
60 |     sklearn_random_forest(train_x, train_y, test_x, test_uid)
61 | 
62 | if __name__ == '__main__':
63 |     main()


--------------------------------------------------------------------------------
/DC-loan-rp/small_data/train_y.csv:
--------------------------------------------------------------------------------
   1 | uid,y
   2 | 1792,1
   3 | 4211,1
   4 | 14658,1
   5 | 17041,1
   6 | 15765,1
   7 | 6638,1
   8 | 1649,1
   9 | 5709,1
  10 | 4749,0
  11 | 7702,1
  12 | 7456,1
  13 | 4356,1
  14 | 15094,1
  15 | 16423,1
  16 | 11970,1
  17 | 13011,1
  18 | 16849,1
  19 | 9058,1
  20 | 14323,1
  21 | 5819,1
  22 | 3595,1
  23 | 14420,1
  24 | 18082,1
  25 | 8906,1
  26 | 16752,1
  27 | 14053,1
  28 | 18994,1
  29 | 12846,1
  30 | 1495,1
  31 | 5063,1
  32 | 19456,1
  33 | 4340,1
  34 | 19146,0
  35 | 10021,1
  36 | 12994,1
  37 | 18933,1
  38 | 19739,0
  39 | 18329,1
  40 | 18724,1
  41 | 10547,1
  42 | 6338,1
  43 | 8457,1
  44 | 858,1
  45 | 2580,0
  46 | 1533,1
  47 | 1565,0
  48 | 11623,1
  49 | 10857,1
  50 | 9691,1
  51 | 18520,1
  52 | 17390,1
  53 | 15930,1
  54 | 6418,1
  55 | 8485,0
  56 | 7407,1
  57 | 17837,1
  58 | 18572,1
  59 | 12310,1
  60 | 2752,1
  61 | 5068,1
  62 | 5952,1
  63 | 8147,1
  64 | 5510,1
  65 | 12481,1
  66 | 455,1
  67 | 6550,1
  68 | 12289,1
  69 | 8356,1
  70 | 11317,1
  71 | 10586,1
  72 | 19239,1
  73 | 16891,0
  74 | 5071,1
  75 | 1513,1
  76 | 8852,1
  77 | 31,1
  78 | 6691,1
  79 | 7452,1
  80 | 9654,0
  81 | 12010,1
  82 | 9731,1
  83 | 10938,1
  84 | 13004,1
  85 | 18726,1
  86 | 7996,0
  87 | 16360,1
  88 | 10620,1
  89 | 8776,1
  90 | 19538,1
  91 | 16652,1
  92 | 10913,1
  93 | 15901,1
  94 | 10891,1
  95 | 19770,1
  96 | 15515,1
  97 | 15757,1
  98 | 2387,1
  99 | 7761,0
 100 | 13111,1
 101 | 14062,1
 102 | 5036,1
 103 | 2663,0
 104 | 7935,1
 105 | 5630,1
 106 | 14950,1
 107 | 16032,1
 108 | 16216,1
 109 | 13931,1
 110 | 14558,1
 111 | 6397,1
 112 | 15271,0
 113 | 5488,1
 114 | 6757,1
 115 | 13607,0
 116 | 18701,0
 117 | 18733,1
 118 | 14038,0
 119 | 16711,1
 120 | 14277,1
 121 | 15000,1
 122 | 241,1
 123 | 1051,1
 124 | 3241,1
 125 | 13721,1
 126 | 8359,1
 127 | 15836,1
 128 | 16195,1
 129 | 10145,1
 130 | 9591,1
 131 | 1530,1
 132 | 18882,1
 133 | 19924,1
 134 | 18696,0
 135 | 14232,0
 136 | 1514,1
 137 | 13743,1
 138 | 5821,0
 139 | 18835,1
 140 | 4376,0
 141 | 19088,0
 142 | 8590,1
 143 | 4673,1
 144 | 9607,1
 145 | 8572,0
 146 | 848,0
 147 | 11470,1
 148 | 18378,1
 149 | 19668,1
 150 | 9768,1
 151 | 17572,1
 152 | 12040,1
 153 | 18588,1
 154 | 9032,1
 155 | 2561,1
 156 | 3896,1
 157 | 19572,0
 158 | 2990,1
 159 | 13708,1
 160 | 11665,1
 161 | 11582,1
 162 | 2782,1
 163 | 13169,1
 164 | 6537,1
 165 | 10741,1
 166 | 19407,1
 167 | 1605,1
 168 | 10493,0
 169 | 7885,1
 170 | 15088,1
 171 | 19477,1
 172 | 15914,1
 173 | 11259,1
 174 | 1361,1
 175 | 3722,1
 176 | 16285,1
 177 | 9831,1
 178 | 11081,1
 179 | 14580,1
 180 | 8056,1
 181 | 18402,1
 182 | 6304,1
 183 | 17065,1
 184 | 4988,1
 185 | 2028,1
 186 | 18090,1
 187 | 15769,1
 188 | 18252,1
 189 | 13358,1
 190 | 4594,1
 191 | 3505,1
 192 | 9270,1
 193 | 5914,1
 194 | 15670,1
 195 | 7941,1
 196 | 13714,1
 197 | 1211,1
 198 | 17220,1
 199 | 17273,0
 200 | 8555,1
 201 | 11256,1
 202 | 10832,1
 203 | 13467,1
 204 | 15994,1
 205 | 1280,1
 206 | 9173,1
 207 | 3559,0
 208 | 10441,0
 209 | 11885,1
 210 | 916,1
 211 | 5122,1
 212 | 17178,1
 213 | 3069,1
 214 | 1748,1
 215 | 15322,1
 216 | 5849,1
 217 | 4422,1
 218 | 7037,0
 219 | 2035,1
 220 | 12628,1
 221 | 1135,1
 222 | 4135,1
 223 | 12967,1
 224 | 8479,1
 225 | 6337,1
 226 | 2483,1
 227 | 3592,1
 228 | 10646,1
 229 | 2313,1
 230 | 15667,1
 231 | 5087,1
 232 | 16552,1
 233 | 3695,1
 234 | 15866,0
 235 | 19544,1
 236 | 17520,1
 237 | 18127,1
 238 | 1499,1
 239 | 16952,1
 240 | 2677,1
 241 | 9997,1
 242 | 7724,1
 243 | 6854,0
 244 | 5150,1
 245 | 12596,1
 246 | 15889,1
 247 | 19507,1
 248 | 12311,0
 249 | 16405,1
 250 | 17565,1
 251 | 6609,1
 252 | 2568,1
 253 | 7705,1
 254 | 11719,1
 255 | 11998,1
 256 | 2007,1
 257 | 7595,1
 258 | 12318,0
 259 | 685,1
 260 | 12614,1
 261 | 4698,1
 262 | 9902,1
 263 | 13009,1
 264 | 17267,0
 265 | 81,1
 266 | 6113,1
 267 | 17000,1
 268 | 4492,1
 269 | 19079,1
 270 | 1969,1
 271 | 97,1
 272 | 2981,1
 273 | 12210,1
 274 | 378,1
 275 | 159,1
 276 | 19751,1
 277 | 2463,1
 278 | 4312,0
 279 | 5155,1
 280 | 8439,1
 281 | 19887,1
 282 | 3831,1
 283 | 11249,1
 284 | 8431,1
 285 | 10596,1
 286 | 18036,1
 287 | 5586,1
 288 | 17947,1
 289 | 4245,1
 290 | 2459,1
 291 | 9847,1
 292 | 15236,1
 293 | 10610,1
 294 | 18447,1
 295 | 12739,1
 296 | 1441,1
 297 | 14130,1
 298 | 17478,1
 299 | 5292,1
 300 | 3578,1
 301 | 16649,1
 302 | 17435,1
 303 | 1510,0
 304 | 8556,0
 305 | 13148,1
 306 | 6118,1
 307 | 12297,1
 308 | 1159,1
 309 | 1981,1
 310 | 7120,1
 311 | 17774,1
 312 | 3021,1
 313 | 17743,1
 314 | 17392,1
 315 | 13611,1
 316 | 11629,0
 317 | 6422,1
 318 | 2000,1
 319 | 8663,1
 320 | 14870,1
 321 | 17154,1
 322 | 9615,1
 323 | 1475,1
 324 | 2654,1
 325 | 14415,1
 326 | 2957,1
 327 | 1279,1
 328 | 10932,1
 329 | 10829,1
 330 | 14806,1
 331 | 1526,1
 332 | 2520,1
 333 | 2570,1
 334 | 18918,1
 335 | 19423,1
 336 | 9098,1
 337 | 6599,1
 338 | 200,1
 339 | 14016,1
 340 | 11012,1
 341 | 3701,1
 342 | 4812,1
 343 | 3063,1
 344 | 15665,1
 345 | 13814,1
 346 | 17366,1
 347 | 3059,0
 348 | 12219,1
 349 | 7823,1
 350 | 1192,1
 351 | 12423,1
 352 | 7287,1
 353 | 17369,1
 354 | 1551,0
 355 | 13211,1
 356 | 3119,1
 357 | 16838,1
 358 | 205,1
 359 | 13458,1
 360 | 16226,1
 361 | 6127,1
 362 | 1622,1
 363 | 3092,1
 364 | 5310,1
 365 | 11617,1
 366 | 12272,1
 367 | 16210,1
 368 | 7990,0
 369 | 1918,1
 370 | 16861,1
 371 | 8695,1
 372 | 9027,1
 373 | 7376,1
 374 | 16836,1
 375 | 8386,0
 376 | 16680,1
 377 | 14917,1
 378 | 7484,1
 379 | 10522,1
 380 | 16493,1
 381 | 19628,1
 382 | 4765,1
 383 | 17562,1
 384 | 16075,1
 385 | 9907,1
 386 | 17480,1
 387 | 13976,1
 388 | 15058,1
 389 | 10703,1
 390 | 6303,1
 391 | 454,1
 392 | 17517,1
 393 | 197,1
 394 | 8280,1
 395 | 19798,0
 396 | 2822,1
 397 | 11523,1
 398 | 11889,1
 399 | 820,1
 400 | 6194,1
 401 | 5768,1
 402 | 12066,1
 403 | 7259,1
 404 | 16731,1
 405 | 9330,1
 406 | 9748,0
 407 | 6262,1
 408 | 6720,1
 409 | 18619,1
 410 | 9165,1
 411 | 1080,1
 412 | 2778,1
 413 | 14872,1
 414 | 5585,0
 415 | 9865,0
 416 | 10802,1
 417 | 15705,1
 418 | 10529,1
 419 | 5144,1
 420 | 14586,1
 421 | 8516,1
 422 | 17286,1
 423 | 6109,1
 424 | 14605,0
 425 | 18176,1
 426 | 14832,1
 427 | 2516,1
 428 | 5694,1
 429 | 703,1
 430 | 9824,1
 431 | 12865,1
 432 | 17927,1
 433 | 5455,1
 434 | 16202,1
 435 | 6967,1
 436 | 13279,1
 437 | 14845,1
 438 | 10739,1
 439 | 4468,1
 440 | 15601,1
 441 | 269,1
 442 | 814,1
 443 | 14632,0
 444 | 17967,1
 445 | 6423,1
 446 | 6549,1
 447 | 3461,1
 448 | 3445,1
 449 | 14015,1
 450 | 5921,1
 451 | 18431,0
 452 | 14191,1
 453 | 8564,1
 454 | 13732,1
 455 | 10329,1
 456 | 1333,1
 457 | 18674,1
 458 | 17835,1
 459 | 17068,1
 460 | 14629,1
 461 | 19949,1
 462 | 18589,1
 463 | 16370,1
 464 | 4851,1
 465 | 537,1
 466 | 15882,1
 467 | 4146,0
 468 | 10405,1
 469 | 8031,1
 470 | 8403,1
 471 | 6,0
 472 | 14543,0
 473 | 5278,1
 474 | 4379,1
 475 | 9166,1
 476 | 14297,1
 477 | 19105,1
 478 | 19140,1
 479 | 16387,1
 480 | 2453,1
 481 | 15776,1
 482 | 18893,0
 483 | 14280,1
 484 | 3833,0
 485 | 13240,1
 486 | 2831,1
 487 | 7623,0
 488 | 15233,1
 489 | 16127,1
 490 | 3840,1
 491 | 382,1
 492 | 647,1
 493 | 8017,1
 494 | 16443,1
 495 | 12005,1
 496 | 12929,1
 497 | 7767,1
 498 | 2983,0
 499 | 2821,0
 500 | 14713,1
 501 | 847,1
 502 | 7826,1
 503 | 3928,1
 504 | 10304,1
 505 | 13789,0
 506 | 11673,1
 507 | 17813,1
 508 | 2136,1
 509 | 3126,1
 510 | 9291,1
 511 | 1327,0
 512 | 6978,1
 513 | 4846,1
 514 | 1935,1
 515 | 8661,1
 516 | 8080,1
 517 | 19574,1
 518 | 6420,1
 519 | 6403,1
 520 | 12436,1
 521 | 2141,1
 522 | 16770,1
 523 | 7441,1
 524 | 5597,1
 525 | 12875,0
 526 | 12986,1
 527 | 18928,1
 528 | 18577,1
 529 | 257,1
 530 | 15870,1
 531 | 14869,1
 532 | 8960,0
 533 | 11774,1
 534 | 3141,1
 535 | 6006,1
 536 | 18599,1
 537 | 2855,1
 538 | 11686,1
 539 | 6365,1
 540 | 15455,1
 541 | 14677,1
 542 | 168,1
 543 | 17487,0
 544 | 10903,0
 545 | 12155,1
 546 | 554,1
 547 | 17332,1
 548 | 18501,1
 549 | 14971,1
 550 | 13442,1
 551 | 19355,1
 552 | 8870,0
 553 | 18264,1
 554 | 15591,1
 555 | 4437,1
 556 | 9469,1
 557 | 12798,1
 558 | 2478,1
 559 | 18651,1
 560 | 9869,1
 561 | 10172,1
 562 | 15614,1
 563 | 14127,1
 564 | 9781,1
 565 | 8501,1
 566 | 18664,1
 567 | 5567,1
 568 | 19931,1
 569 | 4702,0
 570 | 19365,1
 571 | 6957,0
 572 | 5476,1
 573 | 8262,1
 574 | 4565,1
 575 | 20000,1
 576 | 9154,1
 577 | 13192,1
 578 | 3033,1
 579 | 18526,1
 580 | 4803,1
 581 | 15319,1
 582 | 3292,1
 583 | 15877,1
 584 | 14497,1
 585 | 16374,1
 586 | 15437,1
 587 | 16356,1
 588 | 11031,1
 589 | 17099,1
 590 | 4177,1
 591 | 11950,1
 592 | 12295,1
 593 | 19658,1
 594 | 9168,1
 595 | 2024,0
 596 | 3900,0
 597 | 2566,1
 598 | 19431,1
 599 | 18492,1
 600 | 17315,1
 601 | 3255,1
 602 | 2508,1
 603 | 17779,1
 604 | 12696,1
 605 | 18847,1
 606 | 4780,1
 607 | 16014,1
 608 | 19069,1
 609 | 8199,1
 610 | 7513,1
 611 | 16301,1
 612 | 11560,1
 613 | 18593,1
 614 | 2702,1
 615 | 17876,1
 616 | 1766,1
 617 | 6038,1
 618 | 17509,1
 619 | 16299,1
 620 | 8324,1
 621 | 7505,1
 622 | 7783,1
 623 | 8985,1
 624 | 15633,1
 625 | 18469,1
 626 | 10722,1
 627 | 16981,1
 628 | 13050,1
 629 | 15464,1
 630 | 13237,1
 631 | 17888,1
 632 | 12514,1
 633 | 12663,1
 634 | 16079,1
 635 | 4150,1
 636 | 10728,1
 637 | 15427,1
 638 | 14944,1
 639 | 7399,1
 640 | 18669,1
 641 | 10104,1
 642 | 9299,1
 643 | 14974,1
 644 | 7136,1
 645 | 4152,0
 646 | 14184,1
 647 | 18080,1
 648 | 7746,1
 649 | 19601,1
 650 | 17470,1
 651 | 11561,1
 652 | 10862,1
 653 | 11109,1
 654 | 4469,1
 655 | 17459,1
 656 | 10336,1
 657 | 17632,1
 658 | 13748,1
 659 | 666,1
 660 | 12056,1
 661 | 3009,1
 662 | 16774,0
 663 | 15106,1
 664 | 7548,1
 665 | 7800,1
 666 | 17027,0
 667 | 4308,1
 668 | 4480,1
 669 | 17060,1
 670 | 19015,1
 671 | 13827,1
 672 | 3494,1
 673 | 11585,0
 674 | 18903,1
 675 | 1753,1
 676 | 12227,1
 677 | 2408,1
 678 | 2300,1
 679 | 19187,1
 680 | 7228,1
 681 | 11094,1
 682 | 8867,1
 683 | 6380,1
 684 | 6772,1
 685 | 9204,1
 686 | 5076,1
 687 | 19120,1
 688 | 17857,1
 689 | 14304,1
 690 | 1445,1
 691 | 12092,1
 692 | 8335,1
 693 | 2798,1
 694 | 10672,0
 695 | 611,1
 696 | 4103,1
 697 | 11794,1
 698 | 11887,1
 699 | 7600,1
 700 | 10837,1
 701 | 14194,1
 702 | 18259,1
 703 | 391,1
 704 | 10448,1
 705 | 12552,1
 706 | 19641,0
 707 | 15940,1
 708 | 17609,1
 709 | 16049,1
 710 | 17903,1
 711 | 1887,1
 712 | 12669,1
 713 | 11164,1
 714 | 14626,1
 715 | 17715,1
 716 | 15727,1
 717 | 6334,1
 718 | 15386,1
 719 | 7027,1
 720 | 12296,1
 721 | 1740,0
 722 | 15272,1
 723 | 12095,1
 724 | 15821,1
 725 | 12243,1
 726 | 4855,1
 727 | 16611,1
 728 | 15315,1
 729 | 2678,1
 730 | 14855,1
 731 | 16865,0
 732 | 3880,1
 733 | 2100,1
 734 | 9762,1
 735 | 17540,1
 736 | 19638,1
 737 | 3198,1
 738 | 4959,1
 739 | 18719,0
 740 | 2284,0
 741 | 12172,1
 742 | 11655,1
 743 | 1585,1
 744 | 776,1
 745 | 6426,0
 746 | 15064,1
 747 | 9288,1
 748 | 18811,1
 749 | 7477,1
 750 | 8350,0
 751 | 9227,1
 752 | 8163,0
 753 | 2657,1
 754 | 16399,1
 755 | 5471,1
 756 | 6014,1
 757 | 16655,1
 758 | 11500,1
 759 | 17229,1
 760 | 13284,0
 761 | 1313,1
 762 | 1977,1
 763 | 17529,1
 764 | 18200,1
 765 | 10193,1
 766 | 7185,1
 767 | 1028,0
 768 | 1756,1
 769 | 13245,0
 770 | 11955,1
 771 | 10774,1
 772 | 7510,1
 773 | 13418,1
 774 | 19756,1
 775 | 6090,1
 776 | 14187,1
 777 | 19774,1
 778 | 8535,1
 779 | 10689,1
 780 | 2871,1
 781 | 14255,1
 782 | 19789,1
 783 | 11747,1
 784 | 4005,1
 785 | 9608,1
 786 | 12756,1
 787 | 3637,0
 788 | 2207,0
 789 | 192,1
 790 | 17959,0
 791 | 16994,1
 792 | 9417,1
 793 | 6041,1
 794 | 3684,1
 795 | 13341,1
 796 | 13307,1
 797 | 11530,1
 798 | 9939,1
 799 | 1873,1
 800 | 1130,1
 801 | 13293,1
 802 | 19998,1
 803 | 16378,1
 804 | 6494,1
 805 | 7759,1
 806 | 9357,1
 807 | 1442,1
 808 | 3509,1
 809 | 3518,1
 810 | 9593,1
 811 | 15458,1
 812 | 17635,1
 813 | 3950,1
 814 | 15209,1
 815 | 19828,1
 816 | 16305,1
 817 | 10959,1
 818 | 18222,1
 819 | 10679,1
 820 | 15874,1
 821 | 2791,0
 822 | 1444,1
 823 | 2107,0
 824 | 18309,1
 825 | 4999,1
 826 | 18282,1
 827 | 11295,1
 828 | 7711,1
 829 | 11956,1
 830 | 19056,1
 831 | 7356,1
 832 | 11314,1
 833 | 921,1
 834 | 12673,1
 835 | 18494,1
 836 | 19880,1
 837 | 15913,1
 838 | 3236,1
 839 | 3546,1
 840 | 4726,1
 841 | 2155,1
 842 | 6471,1
 843 | 13268,0
 844 | 5021,1
 845 | 15586,1
 846 | 10194,1
 847 | 6222,1
 848 | 7487,1
 849 | 11746,1
 850 | 15653,1
 851 | 7614,1
 852 | 10631,1
 853 | 13078,1
 854 | 4490,1
 855 | 7933,1
 856 | 12350,1
 857 | 10397,1
 858 | 15006,1
 859 | 2432,1
 860 | 2929,1
 861 | 11761,1
 862 | 12888,1
 863 | 10528,0
 864 | 16977,1
 865 | 2016,1
 866 | 16181,1
 867 | 4180,1
 868 | 17792,1
 869 | 659,1
 870 | 15502,1
 871 | 1320,1
 872 | 13411,1
 873 | 3465,1
 874 | 8297,1
 875 | 19956,1
 876 | 10088,1
 877 | 5799,1
 878 | 9639,1
 879 | 14431,0
 880 | 9492,1
 881 | 8827,1
 882 | 18528,1
 883 | 10778,1
 884 | 13713,1
 885 | 7072,1
 886 | 12527,1
 887 | 10937,1
 888 | 6112,1
 889 | 18460,1
 890 | 10504,1
 891 | 4484,1
 892 | 17416,0
 893 | 15399,1
 894 | 17708,1
 895 | 18021,1
 896 | 10317,1
 897 | 17453,1
 898 | 6954,1
 899 | 6239,1
 900 | 16103,1
 901 | 16229,1
 902 | 5245,1
 903 | 3810,1
 904 | 16289,1
 905 | 11496,1
 906 | 11278,1
 907 | 15906,1
 908 | 3968,1
 909 | 499,1
 910 | 8410,0
 911 | 11974,1
 912 | 12880,1
 913 | 11927,1
 914 | 13129,1
 915 | 16024,1
 916 | 8570,1
 917 | 619,1
 918 | 13488,0
 919 | 641,1
 920 | 10177,1
 921 | 10609,1
 922 | 15411,1
 923 | 8953,1
 924 | 19054,0
 925 | 7145,1
 926 | 17308,1
 927 | 6776,1
 928 | 4710,1
 929 | 13294,1
 930 | 4920,1
 931 | 7165,1
 932 | 8277,0
 933 | 14775,1
 934 | 19373,1
 935 | 1203,1
 936 | 18959,1
 937 | 18587,1
 938 | 18868,1
 939 | 15370,1
 940 | 1560,1
 941 | 4853,1
 942 | 2304,1
 943 | 15918,1
 944 | 11811,1
 945 | 15693,1
 946 | 2712,1
 947 | 13921,1
 948 | 8882,1
 949 | 6234,1
 950 | 13349,1
 951 | 11004,1
 952 | 17928,1
 953 | 8991,1
 954 | 12975,0
 955 | 19283,1
 956 | 3064,1
 957 | 14790,1
 958 | 2714,1
 959 | 6409,1
 960 | 18748,1
 961 | 12557,1
 962 | 16914,1
 963 | 6932,1
 964 | 14135,0
 965 | 5901,1
 966 | 165,1
 967 | 10063,1
 968 | 12248,1
 969 | 3046,1
 970 | 3507,1
 971 | 18676,1
 972 | 19244,1
 973 | 13105,1
 974 | 14981,1
 975 | 9524,1
 976 | 10565,1
 977 | 2704,0
 978 | 8419,1
 979 | 11361,1
 980 | 7275,0
 981 | 4501,1
 982 | 3931,1
 983 | 8756,1
 984 | 2572,1
 985 | 9459,0
 986 | 5356,0
 987 | 7840,1
 988 | 13740,1
 989 | 12534,0
 990 | 8279,1
 991 | 15249,1
 992 | 8683,1
 993 | 13777,0
 994 | 18275,1
 995 | 12446,0
 996 | 10462,1
 997 | 9220,1
 998 | 5590,1
 999 | 7581,1
1000 | 7272,1
1001 | 2358,1
1002 | 


--------------------------------------------------------------------------------
/DC-loan-rp/source.R:
--------------------------------------------------------------------------------
 1 | library(xgboost)
 2 | library(Matrix)
 3 | 
 4 | # read data
 5 | train=read.csv('strain_x.csv')
 6 | test=read.csv('stest_x.csv')
 7 | train.y=read.csv('strain_y.csv')
 8 | ft=read.csv('sfeatures_type.csv')
 9 | fn.cat=as.character(ft[ft[,2]=='category',1])
10 | 
11 | fn.num=as.character(ft[ft[,2]=='numeric',1])
12 | 
13 | 
14 | # create dummy variables
15 | temp.train=data.frame(rep(0,nrow(train)))
16 | temp.test=data.frame(rep(0,nrow(test)))
17 | for(f in fn.cat){
18 | levels=unique(train[,f])
19 | col.train=data.frame(factor(train[,f],levels=levels))
20 | col.test=data.frame(factor(test[,f],levels=levels))
21 | colnames(col.train)=f
22 | colnames(col.test)=f
23 | temp.train=cbind(temp.train,model.matrix(as.formula(paste0('~',f,'-1')),data=col.train))
24 | temp.train[,paste0(f,'-1')]=NULL
25 | temp.test=cbind(temp.test,model.matrix(as.formula(paste0('~',f,'-1')),data=col.test))
26 | temp.test[,paste0(f,'-1')]=NULL
27 | }
28 | temp.train[,1]=NULL
29 | temp.test[,1]=NULL
30 | train.new=Matrix(data.matrix(cbind(train[,c('uid',fn.num)],temp.train)),sparse=T)
31 | test.new=Matrix(data.matrix(cbind(test[,c('uid',fn.num)],temp.test)),sparse=T)
32 | 
33 | 
34 | # fit xgboost model
35 | 
36 | dtrain=xgb.DMatrix(data=train.new[,-1],label=1-train.y$y)
37 | dtest= xgb.DMatrix(data=test.new[,-1])
38 | 
39 | model=xgb.train(booster='gbtree',
40 |                    objective='binary:logistic',
41 |                    scale_pos_weight=8.7,
42 |                    gamma=0,
43 |                    lambda=1000,
44 |                    alpha=800,
45 |                    subsample=0.75,
46 |                    colsample_bytree=0.30,
47 |                    min_child_weight=5,
48 |                    max_depth=8,
49 |                    eta=0.01,
50 |                    data=dtrain,
51 |                    nrounds=1520,
52 |                    metrics='auc',
53 |                    nthread=2)
54 | 
55 | # predict probabilities
56 | pred=1-predict(model,dtest)
57 | 
58 | write.csv(data.frame('uid'=test.new[,1],'score'=pred),file='2015-12-22.csv',row.names=F)


--------------------------------------------------------------------------------
/DC-loan-rp/xgb.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/python3
 2 | 
 3 | 
 4 | import pandas as pd
 5 | import xgboost as xgb
 6 | import time
 7 | from sklearn.cross_validation import train_test_split
 8 | import matplotlib.pyplot as plt
 9 | import numpy as np
10 | 
11 | # set data path
12 | path = 'D:/dataset/rp/'
13 | train_x_csv = path+'train_x.csv'
14 | train_y_csv = path+'train_y.csv'
15 | test_x_csv  = path+'test_x.csv'
16 | features_type_csv = path+'features_type.csv'
17 | 
18 | # load data
19 | train_x = pd.read_csv(train_x_csv)
20 | train_y = pd.read_csv(train_y_csv)
21 | train_xy = pd.merge(train_x, train_y, on='uid')
22 | test = pd.read_csv(test_x_csv)
23 | test_uid = test.uid
24 | test_x = test.drop(['uid'], axis=1)
25 | 
26 | # split train set,generate train,val,test set
27 | train_xy = train_xy.drop(['uid'], axis=1)
28 | train, val = train_test_split(train_xy, test_size=0.35)
29 | y = train.y
30 | X = train.drop(['y'], axis=1)
31 | 
32 | def add_data(X,y):
33 |     add_X = pd.read_csv(path+'add_X.csv')
34 |     add_X = add_X.drop(['uid'], axis=1)
35 |     add_y = pd.read_csv(path+'add_y.csv')
36 |     add_y = add_y.drop(['uid'], axis=1)
37 |     add_y = add_y.y
38 |     X = pd.concat([X,add_X], axis=0)
39 |     y = pd.concat([y,add_y], axis=0)
40 |     return X, y
41 | 
42 | X, y = add_data(X, y)
43 | 
44 | 
45 | val_y = val.y
46 | val_X = val.drop(['y'], axis=1)
47 | 
48 | # DC-loan-rp start here
49 | dtest = xgb.DMatrix(test_x)
50 | dval = xgb.DMatrix(val_X, label=val_y)
51 | dtrain = xgb.DMatrix(X, label=y)
52 | 
53 | params = {
54 |     'booster': 'gbtree',
55 |     'objective': 'binary:logistic',
56 |     'early_stopping_rounds': 100,
57 |     'scale_pos_weight': 0.77,
58 |     'eval_metric': 'auc',
59 |     'gamma': 0.1,
60 |     'min_child_weight': 5,
61 |     'lambda': 700,
62 |     'subsample': 0.7,
63 |     'colsample_bytree': 0.3,
64 |     'max_depth': 8,
65 |     'eta': 0.03,
66 | }
67 | 
68 | watchlist = [(dval, 'val'), (dtrain, 'train')]
69 | model = xgb.train(params, dtrain, num_boost_round=5, evals=watchlist)
70 | model.save_model('./xgb.model')
71 | 
72 | # predict test set (from the best iteration)
73 | scores = model.predict(dtest, ntree_limit=model.best_ntree_limit)
74 | result = pd.DataFrame({"uid":test_uid, "score":scores}, columns=['uid','score'])
75 | result.to_csv(str(time.time())+'.csv', index=False)
76 | 
77 | features = model.get_fscore()
78 | features = sorted(features.items(), key=lambda d:d[1])
79 | f_df = pd.DataFrame(features, columns=['feature','fscore'])
80 | f_df.to_csv('./feature_score.csv',index=False)
81 | 
82 | 
83 | 
84 | '''
85 | plt.figure()
86 | import_f = f_df[:10]
87 | import_f.plot(kind='barh', x='feature', y='fscore', legend=False)
88 | plt.title('XGBoost Feature Importance')
89 | plt.xlabel('relative importance')
90 | plt.show()
91 | '''


--------------------------------------------------------------------------------
/DC-loan-rp/xgb_dummy.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | import xgboost as xgb
 5 | import pandas as pd
 6 | import numpy as np
 7 | from sklearn.cross_validation import train_test_split
 8 | import time
 9 | 
10 | 
11 | def split_data():
12 |     path = 'D:/dataset/rp/'
13 |     small_size = 1000
14 |     dtrain = pd.read_csv(path + 'train_x.csv')
15 |     labels = pd.read_csv(path + 'train_y.csv')
16 |     dtest = pd.read_csv(path + 'test_x.csv')
17 | 
18 |     dtrain[:small_size].to_csv(path+'small_data/train_x.csv', index=False)
19 |     labels[:small_size].to_csv(path+'small_data/train_y.csv', index=False)
20 |     dtest[:small_size].to_csv(path+'small_data/test_x.csv', index=False)
21 |     return dtrain, labels, dtest
22 | 
23 | def load_data(dummy=False):
24 |     path = 'D:/dataset/rp/small_data/'
25 |     #path = 'D:/dataset/rp/'
26 |     train_x = pd.read_csv(path + 'train_x.csv')
27 |     train_x = train_x.drop(['uid'], axis=1)
28 | 
29 |     train_y = pd.read_csv(path + 'train_y.csv')
30 |     train_y = train_y.drop(['uid'], axis=1)
31 | 
32 |     test_x = pd.read_csv(path + 'test_x.csv')
33 |     test_uid = test_x.uid
34 |     test_x = test_x.drop(['uid'], axis=1)
35 | 
36 |     if dummy: # 将分类类型的变量转为哑变量
37 |         features = pd.read_csv(path + 'features_type.csv')
38 |         features_category = features.feature[features.type == 'category']
39 |         encoded = pd.get_dummies(pd.concat([train_x, test_x], axis=0), columns=features_category)
40 |         train_rows = train_x.shape[0]
41 |         train_x = encoded.iloc[:train_rows, :]
42 |         test_x  = encoded.iloc[train_rows:, :]
43 | 
44 |     return train_x, train_y, test_x, test_uid
45 | 
46 | def main():
47 |     train_x, train_y, test_x, test_uid = load_data(dummy=True)
48 | 
49 |     # 交叉验证，分割训练数据集
50 |     random_seed = 10
51 |     X_train, X_val, y_train, y_val= train_test_split(train_x, train_y, test_size=0.33, random_state=2016)
52 |     xgb_train = xgb.DMatrix(X_train, label=y_train)
53 |     xgb_val   = xgb.DMatrix(X_val, label=y_val)
54 |     xgb_test  = xgb.DMatrix(test_x)
55 | 
56 |     # 设置xgboost分类器参数
57 |     params = {
58 |         'booster': 'gbtree',
59 |         'objective': 'binary:logistic',
60 |         'eval_metric': 'auc',
61 |         'early_stopping_rounds': 100,
62 |         'scale_pos_weight': 0.77,
63 |         'gamma': 0.1,
64 |         'min_child_weight': 5,
65 |         'lambda': 700,
66 |         'subsample': 0.7,
67 |         'colsample_bytree': 0.3,
68 |         'max_depth': 8,
69 |         'eta': 0.03,
70 |         'nthread': 4
71 |     }
72 |     watchlist = [(xgb_val, 'test'), (xgb_train, 'train')]
73 |     num_round = 10
74 |     bst = xgb.train(params, xgb_train, num_boost_round=num_round, evals=watchlist)
75 |     bst.save_model('./xgb.model')
76 | 
77 |     scores = bst.predict(xgb_test, ntree_limit=bst.best_ntree_limit)
78 |     result = pd.DataFrame({"uid":test_uid, "score":scores}, columns=['uid','score'])
79 |     result.to_csv('dummy_'+str(time.time())+'.csv', index=False)
80 | 
81 | 
82 |     features = bst.get_fscore()
83 |     features = sorted(features.items(), key=lambda d:d[1])
84 |     f_df = pd.DataFrame(features, columns=['feature','fscore'])
85 |     f_df.to_csv('./feature_score.csv',index=False)
86 | 
87 |     '''
88 |     plt.figure()
89 |     import_f = f_df[:10]
90 |     import_f.plot(kind='barh', x='feature', y='fscore', legend=False)
91 |     plt.title('XGBoost Feature Importance')
92 |     plt.xlabel('relative importance')
93 |     plt.show()
94 |     '''
95 | 
96 | 
97 | if __name__ == '__main__':
98 |     main()
99 | 


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/BagOfWords_LR.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | 
 3 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility
 4 | from sklearn.feature_extraction.text import TfidfVectorizer
 5 | from sklearn.linear_model import LogisticRegression
 6 | from sklearn import cross_validation
 7 | import pandas as pd
 8 | import numpy as np
 9 | 
10 | path = 'D:/dataset/word2vec/'
11 | train = pd.read_csv(path+'labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
12 | test = pd.read_csv(path+'testData.tsv', header=0, delimiter="\t", quoting=3 )
13 | y = train["sentiment"]
14 | 
15 | print("Cleaning and parsing movie reviews...\n")
16 | traindata = []
17 | for i in range( 0, len(train["review"])):
18 |     traindata.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], False)))
19 | testdata = []
20 | for i in range(0,len(test["review"])):
21 |     testdata.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], False)))
22 | 
23 | print('vectorizing... ')
24 | tfv = TfidfVectorizer(min_df=3,  max_features=None,
25 |         strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
26 |         ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1,
27 |         stop_words = 'english')
28 | X_all = traindata + testdata
29 | lentrain = len(traindata)
30 | 
31 | print("fitting pipeline... ")
32 | tfv.fit(X_all)
33 | X_all = tfv.transform(X_all)
34 | 
35 | X = X_all[:lentrain]
36 | X_test = X_all[lentrain:]
37 | 
38 | model = LogisticRegression(penalty='l2', dual=True, tol=0.0001,
39 |                          C=1, fit_intercept=True, intercept_scaling=1.0,
40 |                          class_weight=None, random_state=None)
41 | print(("20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(model, X, y, cv=20, scoring='roc_auc')) ))
42 | 
43 | print("Retrain on all training data, predicting test labels...\n")
44 | model.fit(X,y)
45 | result = model.predict_proba(X_test)[:,1]
46 | output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
47 | 
48 | # Use pandas to write the comma-separated output file
49 | output.to_csv('out/Bag_of_Words_model_LR.csv', index=False, quoting=3)
50 | print("Wrote results to Bag_of_Words_model_LR.csv")


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/BagOfWords_RF.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | #  Author: Angela Chapman
  4 | #  Date: 8/6/2014
  5 | #
  6 | #  This file contains code to accompany the Kaggle tutorial
  7 | #  "Deep learning goes to the movies".  The code in this file
  8 | #  is for Part 1 of the tutorial on Natural Language Processing.
  9 | #
 10 | # *************************************** #
 11 | 
 12 | 
 13 | from sklearn.feature_extraction.text import CountVectorizer
 14 | from sklearn.ensemble import RandomForestClassifier
 15 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility
 16 | import pandas as pd
 17 | 
 18 | 
 19 | if __name__ == '__main__':
 20 |     path = 'D:/dataset/word2vec/'
 21 |     train = pd.read_csv(path+'labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
 22 |     test = pd.read_csv(path+'testData.tsv', header=0, delimiter="\t", quoting=3 )
 23 | 
 24 |     print('The first review is:')
 25 |     print((train["review"][0]))
 26 | 
 27 |     input("Press Enter to continue...")
 28 | 
 29 | 
 30 |     print('Download text data sets. If you already have NLTK datasets downloaded, just close the Python download window...')
 31 |     #nltk.download()  # Download text data sets, including stop words
 32 | 
 33 |     # Initialize an empty list to hold the clean reviews
 34 |     clean_train_reviews = []
 35 | 
 36 |     # Loop over each review; create an index i that goes from 0 to the length
 37 |     # of the movie review list
 38 | 
 39 |     print("Cleaning and parsing the training set movie reviews...\n")
 40 |     for i in range( 0, len(train["review"])):
 41 |         clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(train["review"][i], False)))
 42 | 
 43 |     # ****** Create a bag of words from the training set
 44 |     #
 45 |     print("Creating the bag of words...\n")
 46 | 
 47 | 
 48 |     # Initialize the "CountVectorizer" object, which is scikit-learn's
 49 |     # bag of words tool.
 50 |     vectorizer = CountVectorizer(analyzer = "word",
 51 |                              tokenizer = None,
 52 |                              preprocessor = None,
 53 |                              stop_words = None,
 54 |                              max_features = 5000)
 55 | 
 56 |     # fit_transform() does two functions: First, it fits the model
 57 |     # and learns the vocabulary; second, it transforms our training data
 58 |     # into feature vectors. The data to fit_transform should be a list of
 59 |     # strings.
 60 |     train_data_features = vectorizer.fit_transform(clean_train_reviews)
 61 | 
 62 |     # Numpy arrays are easy to work with, so convert the result to an
 63 |     # array
 64 |     train_data_features = train_data_features.toarray()
 65 | 
 66 |     # ******* Train a random forest using the bag of words
 67 |     #
 68 |     print("Training the random forest (this may take a while)...")
 69 | 
 70 | 
 71 |     # Initialize a Random Forest classifier with 100 trees
 72 |     forest = RandomForestClassifier(n_estimators = 100)
 73 | 
 74 |     # Fit the forest to the training set, using the bag of words as
 75 |     # features and the sentiment labels as the response variable
 76 |     #
 77 |     # This may take a few minutes to run
 78 |     forest = forest.fit( train_data_features, train["sentiment"] )
 79 | 
 80 | 
 81 | 
 82 |     # Create an empty list and append the clean reviews one by one
 83 |     clean_test_reviews = []
 84 | 
 85 |     print("Cleaning and parsing the test set movie reviews...\n")
 86 |     for i in range(0,len(test["review"])):
 87 |         clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(test["review"][i], False)))
 88 | 
 89 |     # Get a bag of words for the test set, and convert to a numpy array
 90 |     test_data_features = vectorizer.transform(clean_test_reviews)
 91 |     test_data_features = test_data_features.toarray()
 92 | 
 93 |     # Use the random forest to make sentiment label predictions
 94 |     print("Predicting test labels...\n")
 95 |     result = forest.predict(test_data_features)
 96 | 
 97 |     # Copy the results to a pandas dataframe with an "id" column and
 98 |     # a "sentiment" column
 99 |     output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
100 | 
101 |     # Use pandas to write the comma-separated output file
102 |     output.to_csv('out/Bag_of_Words_model_RF.csv', index=False, quoting=3)
103 |     print("Wrote results to Bag_of_Words_model_RF.csv")
104 | 
105 | 
106 | 


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/KaggleWord2VecUtility.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | import nltk
 3 | 
 4 | import pandas as pd
 5 | import numpy as np
 6 | 
 7 | from bs4 import BeautifulSoup
 8 | from nltk.corpus import stopwords
 9 | 
10 | 
11 | class KaggleWord2VecUtility(object):
12 |     """KaggleWord2VecUtility is a utility class for processing raw HTML text into segments for further learning"""
13 | 
14 |     @staticmethod
15 |     def review_to_wordlist( review, remove_stopwords=False ):
16 |         # Function to convert a document to a sequence of Kaggle_bag_of_words,
17 |         # optionally removing stop Kaggle-bag-of-words.  Returns a list of Kaggle-bag-of-words.
18 |         #
19 |         # 1. Remove HTML
20 |         review_text = BeautifulSoup(review, "lxml").get_text()
21 |         #
22 |         # 2. Remove non-letters
23 |         review_text = re.sub("[^a-zA-Z]"," ", review_text)
24 |         #
25 |         # 3. Convert Kaggle_bag_of_words to lower case and split them
26 |         words = review_text.lower().split()
27 |         #
28 |         # 4. Optionally remove stop Kaggle_bag_of_words (false by default)
29 |         if remove_stopwords:
30 |             stops = set(stopwords.words("english"))
31 |             words = [w for w in words if not w in stops]
32 |         #
33 |         # 5. Return a list of Kaggle_bag_of_words
34 |         return(words)
35 | 
36 |     # Define a function to split a review into parsed sentences
37 |     @staticmethod
38 |     def review_to_sentences( review, tokenizer, remove_stopwords=False ):
39 |         # Function to split a review into parsed sentences. Returns a
40 |         # list of sentences, where each sentence is a list of Kaggle_bag_of_words
41 |         #
42 |         # 1. Use the NLTK tokenizer to split the paragraph into sentences
43 |         raw_sentences = tokenizer.tokenize(review.decode('utf8').strip())
44 |         #
45 |         # 2. Loop over each sentence
46 |         sentences = []
47 |         for raw_sentence in raw_sentences:
48 |             # If a sentence is empty, skip it
49 |             if len(raw_sentence) > 0:
50 |                 # Otherwise, call review_to_wordlist to get a list of Kaggle_bag_of_words
51 |                 sentences.append( KaggleWord2VecUtility.review_to_wordlist( raw_sentence, \
52 |                   remove_stopwords ))
53 |         #
54 |         # Return the list of sentences (each sentence is a list of Kaggle_bag_of_words,
55 |         # so this returns a list of lists
56 |         return sentences


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/README.md:
--------------------------------------------------------------------------------
 1 | ## Use Google's Word2Vec for movie reviews
 2 | 
 3 | In this tutorial competition, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.
 4 | 
 5 | Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers. There's another Kaggle competition for movie review sentiment analysis. In this tutorial we explore how Word2Vec can be applied to a similar problem.
 6 | 
 7 | Deep learning has been in the news a lot over the past few years, even making it to the front page of the New York Times. These machine learning techniques, inspired by the architecture of the human brain and made possible by recent advances in computing power, have been making waves via breakthrough results in image recognition, speech processing, and natural language tasks. Recently, deep learning approaches won several Kaggle competitions, including a drug discovery task, and cat and dog image recognition.
 8 | 
 9 | 
10 | ## Data Set Description 
11 | <https://www.kaggle.com/c/word2vec-nlp-tutorial/data>
12 | 
13 | 
14 | # Word2Vec 简介
15 | 
16 | Word2vec 是 Google 在 2013 年年中开源的一款将词表征为实数值向量的高效工具, 其利用深度学习的思想，可以通过训练，把对文本内容的处理简化为 K 维向量空间中的向量运算，而向量空间上的相似度可以用来表示文本语义上的相似度。Word2vec输出的词向量可以被用来做很多 NLP 相关的工作，比如聚类、找同义词、词性分析等等。如果换个思路， 把词当做特征，那么Word2vec就可以把特征映射到 K 维向量空间，可以为文本数据寻求更加深层次的特征表示 。
17 | 
18 | Word2vec 使用的是 Distributed representation 的词向量表示方式。Distributed representation 最早由 Hinton在 1986 年提出[4]。其基本思想是 通过训练将每个词映射成 K 维实数向量（K 一般为模型中的超参数），通过词之间的距离（比如 cosine 相似度、欧氏距离等）来判断它们之间的语义相似度.其采用一个 三层的神经网络 ，输入层-隐层-输出层。有个核心的技术是 根据词频用Huffman编码 ，使得所有词频相似的词隐藏层激活的内容基本一致，出现频率越高的词语，他们激活的隐藏层数目越少，这样有效的降低了计算的复杂度。而Word2vec大受欢迎的一个原因正是其高效性，Mikolov 在论文[2]中指出，一个优化的单机版本一天可训练上千亿词。
19 | 
20 | 这个三层神经网络本身是 对语言模型进行建模 ，但也同时 获得一种单词在向量空间上的表示 ，而这个副作用才是Word2vec的真正目标。
21 | 
22 | 与潜在语义分析（Latent Semantic Index, LSI）、潜在狄立克雷分配（Latent Dirichlet Allocation，LDA）的经典过程相比，Word2vec利用了词的上下文，语义信息更加地丰富。
23 | 
24 | 
25 |    
26 |    
27 | * [文本深度表示模型Word2Vec](http://wei-li.cnblogs.com/p/word2vec.html)    
28 | * [深度学习word2vec笔记之基础篇](http://blog.csdn.net/mytestmy/article/details/26961315)  
29 | * [深度学习word2vec笔记之算法篇](http://blog.csdn.net/mytestmy/article/details/26969149)  
30 | * [基于Kaggle数据的词袋模型文本分类教程](http://www.csdn.net/article/1970-01-01/2825782)  
31 | * [情感分析的新方法——基于Word2Vec/Doc2Vec/Python](http://datartisan.com/article/detail/48.html)  
32 | http://nbviewer.jupyter.org/github/MatthieuBizien/Bag-popcorn/blob/master/Kaggle-Word2Vec.ipynb
33 | ***************
34 | 
35 | 这是一个文本情感二分类问题。25000的labeled训练样本，只有一个raw text 特征”review“。
36 | 评价指标为AUC，所以这里提交结果需要用概率，我开始就掉坑里了，结果一直上不来。
37 | 比赛里有教程如何使用word2vec进行二分类，可以作为入门学习材料。
38 | 我没有使用word embeddinng，直接采用BOW及ngram作为特征训练，效果还凑合，后面其实可以融合embedding特征试试。
39 | 对于raw text我采用TfidfVectorizer(stop_words=’english’, ngram_range=(1,3), sublinear_tf=True, min_df=2)，
40 | 并采用卡方检验进行特征选择，经过CV，最终确定特征数为200000。
41 | 单模型我选取了GBRT/NB/LR/linear SVC。
42 | GBRT一般对于维度较大比较稀疏效果不是很好，但对于该数据表现不是很差。
43 | NB采用MultinomialNB效果也没有想象的那么惊艳。
44 | 几个模型按效果排序为linear SVC(0.95601)>LR(0.94823)>GBRT(0.94173)>NB(0.93693)，看来线性SVM在文本上还是很强悍的。
45 | 后续我又采用LDA生成主题特征，本来抱着很大期望，现实还是那么骨感，采用上述单模型AUC最好也只有0.93024。
46 | 既然单独使用主题特征没有提高，那和BOW融合呢？果然work了!
47 | 后面试验证实特征融合还是linear SVC效果最好，LDA主题定为500，而且不去除停用词效果更好，AUC为0.95998。
48 | 既然没有时间搞单模型了，还有最后一招，多模型融合。这里有一个原则就是模型尽量多样，不一定要求指标最好。
49 | 最终我选取5组不是很差的多模型结果进行average stacking，AUC为0.96115，63位。
50 | 最终private LB跌倒了71st，应该融合word enbedding试试，没时间细搞了。
51 | 
52 | 
53 | 
54 | 
55 | http://cs.stanford.edu/~quocle/paragraph_vector.pdf
56 | * https://cs224d.stanford.edu/reports/SadeghianAmir.pdf
57 | * 用简单的TDF 作为Feature，然后用简单的M-Bayesian方法来进行分类。
58 | http://nbviewer.ipython.org/github/jmsteinw/Notebooks/blob/master/NLP_Movies.ipynb


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/Word2Vec_AverageVectors.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | #  Author: Angela Chapman
  4 | #  Date: 8/6/2014
  5 | #
  6 | #  This file contains code to accompany the Kaggle tutorial
  7 | #  "Deep learning goes to the movies".  The code in this file
  8 | #  is for Parts 2 and 3 of the tutorial, which cover how to
  9 | #  train a model using Word2Vec.
 10 | #
 11 | # *************************************** #
 12 | 
 13 | 
 14 | # ****** Read the two training sets and the test set
 15 | #
 16 | import pandas as pd
 17 | import os
 18 | from nltk.corpus import stopwords
 19 | import nltk.data
 20 | import logging
 21 | import numpy as np  # Make sure that numpy is imported
 22 | from gensim.models import Word2Vec
 23 | from sklearn.ensemble import RandomForestClassifier
 24 | 
 25 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility
 26 | 
 27 | 
 28 | # ****** Define functions to create average word vectors
 29 | #
 30 | 
 31 | def makeFeatureVec(words, model, num_features):
 32 |     # Function to average all of the word vectors in a given
 33 |     # paragraph
 34 |     #
 35 |     # Pre-initialize an empty numpy array (for speed)
 36 |     featureVec = np.zeros((num_features,),dtype="float32")
 37 |     #
 38 |     nwords = 0.
 39 |     #
 40 |     # Index2word is a list that contains the names of the words in
 41 |     # the model's vocabulary. Convert it to a set, for speed
 42 |     index2word_set = set(model.index2word)
 43 |     #
 44 |     # Loop over each word in the review and, if it is in the model's
 45 |     # vocaublary, add its feature vector to the total
 46 |     for word in words:
 47 |         if word in index2word_set:
 48 |             nwords = nwords + 1.
 49 |             featureVec = np.add(featureVec,model[word])
 50 |     #
 51 |     # Divide the result by the number of words to get the average
 52 |     featureVec = np.divide(featureVec,nwords)
 53 |     return featureVec
 54 | 
 55 | 
 56 | def getAvgFeatureVecs(reviews, model, num_features):
 57 |     # Given a set of reviews (each one a list of words), calculate
 58 |     # the average feature vector for each one and return a 2D numpy array
 59 |     #
 60 |     # Initialize a counter
 61 |     counter = 0.
 62 |     #
 63 |     # Preallocate a 2D numpy array, for speed
 64 |     reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
 65 |     #
 66 |     # Loop through the reviews
 67 |     for review in reviews:
 68 |        #
 69 |        # Print a status message every 1000th review
 70 |        if counter%1000. == 0.:
 71 |            print("Review %d of %d" % (counter, len(reviews)))
 72 |        #
 73 |        # Call the function (defined above) that makes average feature vectors
 74 |        reviewFeatureVecs[counter] = makeFeatureVec(review, model, \
 75 |            num_features)
 76 |        #
 77 |        # Increment the counter
 78 |        counter = counter + 1.
 79 |     return reviewFeatureVecs
 80 | 
 81 | 
 82 | def getCleanReviews(reviews):
 83 |     clean_reviews = []
 84 |     for review in reviews["review"]:
 85 |         clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
 86 |     return clean_reviews
 87 | 
 88 | 
 89 | 
 90 | if __name__ == '__main__':
 91 | 
 92 |     # Read data from files
 93 |     train = pd.read_csv( os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, delimiter="\t", quoting=3 )
 94 |     test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", quoting=3 )
 95 |     unlabeled_train = pd.read_csv( os.path.join(os.path.dirname(__file__), 'data', "unlabeledTrainData.tsv"), header=0,  delimiter="\t", quoting=3 )
 96 | 
 97 |     # Verify the number of reviews that were read (100,000 in total)
 98 |     print("Read %d labeled train reviews, %d labeled test reviews, " \
 99 |      "and %d unlabeled reviews\n" % (train["review"].size,
100 |      test["review"].size, unlabeled_train["review"].size ))
101 | 
102 | 
103 | 
104 |     # Load the punkt tokenizer
105 |     tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
106 | 
107 | 
108 | 
109 |     # ****** Split the labeled and unlabeled training sets into clean sentences
110 |     #
111 |     sentences = []  # Initialize an empty list of sentences
112 | 
113 |     print("Parsing sentences from training set")
114 |     for review in train["review"]:
115 |         sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
116 | 
117 |     print("Parsing sentences from unlabeled set")
118 |     for review in unlabeled_train["review"]:
119 |         sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
120 | 
121 |     # ****** Set parameters and train the word2vec model
122 |     #
123 |     # Import the built-in logging module and configure it so that Word2Vec
124 |     # creates nice output messages
125 |     logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
126 |         level=logging.INFO)
127 | 
128 |     # Set values for various parameters
129 |     num_features = 300    # Word vector dimensionality
130 |     min_word_count = 40   # Minimum word count
131 |     num_workers = 4       # Number of threads to run in parallel
132 |     context = 10          # Context window size
133 |     downsampling = 1e-3   # Downsample setting for frequent words
134 | 
135 |     # Initialize and train the model (this will take some time)
136 |     print("Training Word2Vec model...")
137 |     model = Word2Vec(sentences, workers=num_workers, \
138 |                 size=num_features, min_count = min_word_count, \
139 |                 window = context, sample = downsampling, seed=1)
140 | 
141 |     # If you don't plan to train the model any further, calling
142 |     # init_sims will make the model much more memory-efficient.
143 |     model.init_sims(replace=True)
144 | 
145 |     # It can be helpful to create a meaningful model name and
146 |     # save the model for later use. You can load it later using Word2Vec.load()
147 |     model_name = "300features_40minwords_10context"
148 |     model.save(model_name)
149 | 
150 |     model.doesnt_match("man woman child kitchen".split())
151 |     model.doesnt_match("france england germany berlin".split())
152 |     model.doesnt_match("paris berlin london austria".split())
153 |     model.most_similar("man")
154 |     model.most_similar("queen")
155 |     model.most_similar("awful")
156 | 
157 | 
158 | 
159 |     # ****** Create average vectors for the training and test sets
160 |     #
161 |     print("Creating average feature vecs for training reviews")
162 | 
163 |     trainDataVecs = getAvgFeatureVecs( getCleanReviews(train), model, num_features )
164 | 
165 |     print("Creating average feature vecs for test reviews")
166 | 
167 |     testDataVecs = getAvgFeatureVecs( getCleanReviews(test), model, num_features )
168 | 
169 | 
170 |     # ****** Fit a random forest to the training set, then make predictions
171 |     #
172 |     # Fit a random forest to the training data, using 100 trees
173 |     forest = RandomForestClassifier( n_estimators = 100 )
174 | 
175 |     print("Fitting a random forest to labeled training data...")
176 |     forest = forest.fit( trainDataVecs, train["sentiment"] )
177 | 
178 |     # Test & extract results
179 |     result = forest.predict( testDataVecs )
180 | 
181 |     # Write the test results
182 |     output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
183 |     output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )
184 |     print("Wrote Word2Vec_AverageVectors.csv")
185 | 


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/Word2Vec_BagOfCentroids.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | 
  3 | #  Author: Angela Chapman
  4 | #  Date: 8/6/2014
  5 | #
  6 | #  This file contains code to accompany the Kaggle tutorial
  7 | #  "Deep learning goes to the movies".  The code in this file
  8 | #  is for Part 2 of the tutorial and covers Bag of Centroids
  9 | #  for a Word2Vec model. This code assumes that you have already
 10 | #  run Word2Vec and saved a model called "300features_40minwords_10context"
 11 | #
 12 | # *************************************** #
 13 | 
 14 | 
 15 | # Load a pre-trained model
 16 | from gensim.models import Word2Vec
 17 | from sklearn.cluster import KMeans
 18 | import time
 19 | import pandas as pd
 20 | from sklearn.ensemble import RandomForestClassifier
 21 | from bs4 import BeautifulSoup
 22 | import re
 23 | from nltk.corpus import stopwords
 24 | import numpy as np
 25 | import os
 26 | from Kaggle_bag_of_words.KaggleWord2VecUtility import KaggleWord2VecUtility
 27 | 
 28 | 
 29 | # Define a function to create bags of centroids
 30 | #
 31 | def create_bag_of_centroids( wordlist, word_centroid_map ):
 32 |     #
 33 |     # The number of clusters is equal to the highest cluster index
 34 |     # in the word / centroid map
 35 |     num_centroids = max( word_centroid_map.values() ) + 1
 36 |     #
 37 |     # Pre-allocate the bag of centroids vector (for speed)
 38 |     bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
 39 |     #
 40 |     # Loop over the words in the review. If the word is in the vocabulary,
 41 |     # find which cluster it belongs to, and increment that cluster count
 42 |     # by one
 43 |     for word in wordlist:
 44 |         if word in word_centroid_map:
 45 |             index = word_centroid_map[word]
 46 |             bag_of_centroids[index] += 1
 47 |     #
 48 |     # Return the "bag of centroids"
 49 |     return bag_of_centroids
 50 | 
 51 | 
 52 | if __name__ == '__main__':
 53 | 
 54 |     model = Word2Vec.load("300features_40minwords_10context")
 55 | 
 56 | 
 57 |     # ****** Run k-means on the word vectors and print a few clusters
 58 |     #
 59 | 
 60 |     start = time.time() # Start time
 61 | 
 62 |     # Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
 63 |     # average of 5 words per cluster
 64 |     word_vectors = model.syn0
 65 |     num_clusters = word_vectors.shape[0] / 5
 66 | 
 67 |     # Initalize a k-means object and use it to extract centroids
 68 |     print("Running K means")
 69 |     kmeans_clustering = KMeans( n_clusters = num_clusters )
 70 |     idx = kmeans_clustering.fit_predict( word_vectors )
 71 | 
 72 |     # Get the end time and print how long the process took
 73 |     end = time.time()
 74 |     elapsed = end - start
 75 |     print(("Time taken for K Means clustering: ", elapsed, "seconds."))
 76 | 
 77 | 
 78 |     # Create a Word / Index dictionary, mapping each vocabulary word to
 79 |     # a cluster number
 80 |     word_centroid_map = dict(list(zip( model.index2word, idx )))
 81 | 
 82 |     # Print the first ten clusters
 83 |     for cluster in range(0,10):
 84 |         #
 85 |         # Print the cluster number
 86 |         print(("\nCluster %d" % cluster))
 87 |         #
 88 |         # Find all of the words for that cluster number, and print them out
 89 |         words = []
 90 |         for i in range(0,len(list(word_centroid_map.values()))):
 91 |             if( list(word_centroid_map.values())[i] == cluster ):
 92 |                 words.append(list(word_centroid_map.keys())[i])
 93 |         print(words)
 94 | 
 95 | 
 96 | 
 97 | 
 98 |     # Create clean_train_reviews and clean_test_reviews as we did before
 99 |     #
100 | 
101 |     # Read data from files
102 |     train = pd.read_csv( os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, delimiter="\t", quoting=3 )
103 |     test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", quoting=3 )
104 | 
105 | 
106 |     print("Cleaning training reviews")
107 |     clean_train_reviews = []
108 |     for review in train["review"]:
109 |         clean_train_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, \
110 |             remove_stopwords=True ))
111 | 
112 |     print("Cleaning test reviews")
113 |     clean_test_reviews = []
114 |     for review in test["review"]:
115 |         clean_test_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, \
116 |             remove_stopwords=True ))
117 | 
118 | 
119 |     # ****** Create bags of centroids
120 |     #
121 |     # Pre-allocate an array for the training set bags of centroids (for speed)
122 |     train_centroids = np.zeros( (train["review"].size, num_clusters), \
123 |         dtype="float32" )
124 | 
125 |     # Transform the training set reviews into bags of centroids
126 |     counter = 0
127 |     for review in clean_train_reviews:
128 |         train_centroids[counter] = create_bag_of_centroids( review, \
129 |             word_centroid_map )
130 |         counter += 1
131 | 
132 |     # Repeat for test reviews
133 |     test_centroids = np.zeros(( test["review"].size, num_clusters), \
134 |         dtype="float32" )
135 | 
136 |     counter = 0
137 |     for review in clean_test_reviews:
138 |         test_centroids[counter] = create_bag_of_centroids( review, \
139 |             word_centroid_map )
140 |         counter += 1
141 | 
142 | 
143 |     # ****** Fit a random forest and extract predictions
144 |     #
145 |     forest = RandomForestClassifier(n_estimators = 100)
146 | 
147 |     # Fitting the forest may take a few minutes
148 |     print("Fitting a random forest to labeled training data...")
149 |     forest = forest.fit(train_centroids,train["sentiment"])
150 |     result = forest.predict(test_centroids)
151 | 
152 |     # Write the test results
153 |     output = pd.DataFrame(data={"id":test["id"], "sentiment":result})
154 |     output.to_csv("BagOfCentroids.csv", index=False, quoting=3)
155 |     print("Wrote BagOfCentroids.csv")
156 | 


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/generate_d2v.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import logging
 3 | import os.path
 4 | import pandas as pd
 5 | import numpy as np
 6 | 
 7 | from KaggleWord2VecUtility import KaggleWord2VecUtility
 8 | 
 9 | from gensim.models import Doc2Vec
10 | from gensim.models.doc2vec import LabeledSentence
11 | 
12 | 
13 | def getFeatureVecs(reviews, model, num_features):
14 |     reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
15 |     counter = -1
16 |      
17 |     for review in reviews:
18 |         counter += 1
19 |         try:
20 |             reviewFeatureVecs[counter] = np.array(model[review.labels[0]]).reshape((1, num_features))
21 |         except:
22 |             continue
23 |     return reviewFeatureVecs
24 | 
25 | 
26 | def getCleanLabeledReviews(reviews):
27 |     clean_reviews = []
28 |     for review in reviews["review"]:
29 |         clean_reviews.append(KaggleWord2VecUtility.review_to_wordlist(review))
30 |     
31 |     labelized = []
32 |     for i, id_label in enumerate(reviews["id"]):
33 |         labelized.append(LabeledSentence(clean_reviews[i], [id_label]))
34 |     return labelized
35 | 
36 | 
37 | 
38 | if __name__ == '__main__':
39 |     train  = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
40 |     test = pd.read_csv('../data/testData.tsv', header=0, delimiter="\t", quoting=3)
41 |     unsup = pd.read_csv('../data/unlabeledTrainData.tsv', header=0,  delimiter="\t", quoting=3 )
42 |    
43 |     print "Cleaning and labeling all data sets...\n"
44 |     
45 |     train_reviews = getCleanLabeledReviews(train)
46 |     test_reviews = getCleanLabeledReviews(test)
47 |     unsup_reviews = getCleanLabeledReviews(unsup)
48 | 
49 |     n_dim =5000
50 | 	
51 |     model_dm_name = "%dfeatures_1minwords_10context_dm" % n_dim
52 |     model_dbow_name = "%dfeatures_1minwords_10context_dbow" % n_dim
53 |     
54 |     
55 |     
56 |     if not os.path.exists(model_dm_name) or not os.path.exists(model_dbow_name):
57 |         logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
58 |                             level=logging.INFO)
59 |  
60 |         num_features = n_dim    # Word vector dimensionality
61 |         min_word_count = 1   # Minimum word count, if bigger, some sentences may be missing
62 |         num_workers = 4       # Number of threads to run in parallel
63 |         context = 10          # Context window size
64 |         downsampling = 1e-3   # Downsample setting for frequent words
65 |  
66 |         print "Training Doc2Vec model..."
67 |         model_dm = Doc2Vec(min_count=min_word_count, window=context, size=num_features, \
68 |                            sample=downsampling, workers=num_workers)
69 |         model_dbow = Doc2Vec(min_count=min_word_count, window=context, size=num_features, 
70 |                              sample=downsampling, workers=num_workers, dm=0)
71 |         
72 |         all_reviews = np.concatenate((train_reviews, test_reviews, unsup_reviews))
73 |         model_dm.build_vocab(all_reviews)
74 |         model_dbow.build_vocab(all_reviews)
75 |         
76 |         for epoch in range(10):
77 |             perm = np.random.permutation(all_reviews.shape[0])
78 |             model_dm.train(all_reviews[perm])
79 |             model_dbow.train(all_reviews[perm])
80 |             
81 |         model_dm.save(model_dm_name)
82 |         model_dbow.save(model_dbow_name)
83 | 


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/generate_w2v.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | import pandas as pd
 3 | import nltk.data
 4 | import logging
 5 | import os.path
 6 | import numpy as np
 7 | 
 8 | from KaggleWord2VecUtility import KaggleWord2VecUtility
 9 | 
10 | from gensim.models import Word2Vec
11 | 
12 | 
13 | def makeFeatureVec(words, model, num_features):
14 |     featureVec = np.zeros((num_features,),dtype="float32")
15 |     nwords = 0
16 | 
17 |     index2word_set = set(model.index2word)
18 |     for word in words:
19 |         if word in index2word_set:
20 |             nwords = nwords + 1
21 |             featureVec = np.add(featureVec,model[word])
22 | 
23 |     if nwords != 0:
24 |         featureVec /= nwords
25 |     return featureVec
26 | 
27 | 
28 | def getAvgFeatureVecs(reviews, model, num_features):
29 |     reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
30 |     counter = 0
31 |     
32 |     for review in reviews:
33 |         reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
34 |         counter = counter + 1
35 |     return reviewFeatureVecs
36 | 
37 | 
38 | def getCleanReviews(reviews):
39 |     clean_reviews = []
40 |     for review in reviews["review"]:
41 |         clean_reviews.append( KaggleWord2VecUtility.review_to_wordlist( review, remove_stopwords=True ))
42 |     return clean_reviews
43 | 
44 | 
45 | if __name__ == '__main__':
46 |     train  = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
47 |     #test = pd.read_csv('../data/testData.tsv', header=0, delimiter="\t", quoting=3)
48 |     unsup = pd.read_csv('../data/unlabeledTrainData.tsv', header=0,  delimiter="\t", quoting=3 )
49 |     
50 |     n_dim = 5000
51 | 	
52 |     model_name = "%dfeatures_40minwords_10context" % n_dim
53 |     
54 |     
55 |     if not os.path.exists(model_name): 
56 |         tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
57 |         
58 |         sentences = []
59 |  
60 |         print "Parsing sentences from training set"
61 |         for review in train["review"]:
62 |             sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
63 |      
64 |         print "Parsing sentences from unlabeled set"
65 |         for review in unsup["review"]:
66 |             sentences += KaggleWord2VecUtility.review_to_sentences(review, tokenizer)
67 | 
68 | 
69 |         logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
70 |                             level=logging.INFO)
71 |  
72 | 
73 |         num_features = n_dim    # Word vector dimensionality
74 |         min_word_count = 5   # Minimum word count
75 |         num_workers = 4       # Number of threads to run in parallel
76 |         context = 10          # Context window size
77 |         downsampling = 1e-3   # Downsample setting for frequent words
78 |  
79 |         print "Training Word2Vec model..."
80 |         model = Word2Vec(sentences, workers=num_workers, \
81 |                          size=num_features, min_count = min_word_count, \
82 |                          window = context, sample = downsampling, seed=1)
83 |  
84 |         model.init_sims(replace=True)
85 |         model.save(model_name)


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/nbsvm.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import pandas as pd
 3 | from collections import Counter
 4 | 
 5 | from KaggleWord2VecUtility import KaggleWord2VecUtility
 6 | 
 7 | def tokenize(sentence, grams):
 8 |     words = KaggleWord2VecUtility.review_to_wordlist(sentence)
 9 |     tokens = []
10 |     for gram in grams:
11 |         for i in range(len(words) - gram + 1):
12 |             tokens += ["_*_".join(words[i:i+gram])]
13 |     return tokens
14 | 
15 | 
16 | def build_dict(data, grams):
17 |     dic = Counter()
18 |     for token_list in data:
19 |         dic.update(token_list)
20 |     return dic
21 | 
22 | 
23 | def compute_ratio(poscounts, negcounts, alpha=1):
24 |     alltokens = list(set(poscounts.keys() + negcounts.keys()))
25 |     dic = dict((t, i) for i, t in enumerate(alltokens))
26 |     d = len(dic)
27 |     
28 |     print "Computing r...\n"
29 |     
30 |     p, q = np.ones(d) * alpha , np.ones(d) * alpha
31 |     for t in alltokens:
32 |         p[dic[t]] += poscounts[t]
33 |         q[dic[t]] += negcounts[t]
34 |     p /= abs(p).sum()
35 |     q /= abs(q).sum()
36 |     r = np.log(p/q)
37 |     return dic, r
38 | 
39 | 
40 | def generate_svmlight_content(data, dic, r, grams):
41 |     output = []
42 |     for _, row in data.iterrows():
43 |         tokens = tokenize(row['review'], grams)
44 |         indexes = []
45 |         for t in tokens:
46 |             try:
47 |                 indexes += [dic[t]]
48 |             except KeyError:
49 |                 pass
50 |         indexes = list(set(indexes))
51 |         indexes.sort()
52 |         if 'sentiment' in row:
53 |             line = [str(row['sentiment'])]
54 |         else:
55 |             line = ['0']
56 |         for i in indexes:
57 |             line += ["%i:%f" % (i + 1, r[i])]
58 |         output += [" ".join(line)]
59 |             
60 |     return "\n".join(output)
61 |     
62 |     
63 | def generate_svmlight_files(train, test, grams, outfn):
64 |     ngram = [int(i) for i in grams]
65 |     ptrain = []
66 |     ntrain = []
67 |     
68 |     print "Parsing training data...\n"
69 |     
70 |     for _, row in train.iterrows():
71 |         if row['sentiment'] == 1:
72 |             ptrain.append(tokenize(row['review'], ngram))
73 |         elif row['sentiment'] == 0:
74 |             ntrain.append(tokenize(row['review'], ngram))
75 |     
76 |     pos_counts = build_dict(ptrain, ngram)
77 |     neg_counts = build_dict(ntrain, ngram)
78 |     
79 |     dic, r = compute_ratio(pos_counts, neg_counts)
80 |     
81 |     f = open(outfn + '-train.txt', "w")
82 |     f.writelines(generate_svmlight_content(train, dic, r, ngram))
83 |     f.close()
84 | 
85 |     print "Parsing test data...\n"
86 |     
87 |     f = open(outfn + '-test.txt', "w")
88 |     f.writelines(generate_svmlight_content(test, dic, r, ngram))
89 |     f.close()
90 |     
91 |     print "SVMlight files have been generated!"
92 |     
93 |     
94 |     


--------------------------------------------------------------------------------
/Kaggle-bag-of-words/predict.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | import pandas as pd
  3 | import numpy as np
  4 | 
  5 | from sklearn.linear_model import LogisticRegression
  6 | from sklearn.metrics import roc_auc_score
  7 | from sklearn.preprocessing import scale
  8 | from sklearn.feature_extraction.text import TfidfVectorizer
  9 | from sklearn.datasets import load_svmlight_files
 10 | from scipy.sparse import hstack
 11 | 
 12 | from gensim.models import Doc2Vec, Word2Vec
 13 | from gensim.models.doc2vec import LabeledSentence
 14 | 
 15 | from nbsvm import generate_svmlight_files
 16 | 
 17 | from KaggleWord2VecUtility import KaggleWord2VecUtility
 18 | 
 19 | 
 20 | def makeFeatureVec(words, model, num_features):
 21 |     featureVec = np.zeros((num_features,),dtype="float32")
 22 |     nwords = 0
 23 | 
 24 |     index2word_set = set(model.index2word)
 25 |     for word in words:
 26 |         if word in index2word_set:
 27 |             nwords = nwords + 1
 28 |             featureVec = np.add(featureVec,model[word])
 29 |     
 30 |     if nwords != 0:
 31 |         featureVec /= nwords
 32 |     return featureVec
 33 | 
 34 | 
 35 | def getAvgFeatureVecs(reviews, model, num_features):
 36 |     counter = 0
 37 | 
 38 |     reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
 39 | 
 40 |     for review in reviews:
 41 |         reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)
 42 |         counter = counter + 1
 43 |     return reviewFeatureVecs
 44 | 
 45 | 
 46 | def getCleanReviews(reviews):
 47 |     clean_reviews = []
 48 |     for review in reviews["review"]:
 49 |         clean_reviews.append(KaggleWord2VecUtility.review_to_wordlist(review, True))
 50 |     return clean_reviews 
 51 | 
 52 | 
 53 | def getFeatureVecs(reviews, model, num_features):
 54 |     reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
 55 |     counter = -1
 56 |     
 57 |     for review in reviews:
 58 |         counter = counter + 1
 59 |         try:
 60 |             reviewFeatureVecs[counter] = np.array(model[review.labels[0]]).reshape((1, num_features))
 61 |         except:
 62 |             continue
 63 |     return reviewFeatureVecs
 64 | 
 65 | 
 66 | def getCleanLabeledReviews(reviews):
 67 |     clean_reviews = []
 68 |     for review in reviews["review"]:
 69 |         clean_reviews.append(KaggleWord2VecUtility.review_to_wordlist(review, True))
 70 |     
 71 |     labelized = []
 72 |     for i, id_label in enumerate(reviews["id"]):
 73 |         labelized.append(LabeledSentence(clean_reviews[i], [id_label]))
 74 |     return labelized
 75 | 
 76 | 
 77 | if __name__ == '__main__':
 78 |     train = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
 79 |     test = pd.read_csv('../data/testData.tsv', header=0, delimiter="\t", quoting=3 )
 80 |    
 81 |     print "Cleaning and parsing the data sets...\n"
 82 | 
 83 |     clean_train_reviews = []
 84 |     for review in train['review']:
 85 |         clean_train_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(review)))
 86 | 
 87 |     clean_test_reviews = []
 88 |     for review in test['review']:
 89 |         clean_test_reviews.append(" ".join(KaggleWord2VecUtility.review_to_wordlist(review)))
 90 | 
 91 |     print "Creating the bag of words...\n"
 92 | 
 93 |     vectorizer = TfidfVectorizer(max_features=50000, ngram_range=(1,3), sublinear_tf=True)
 94 |     
 95 |     X_train_bow = vectorizer.fit_transform(clean_train_reviews)
 96 |     X_test_bow = vectorizer.transform(clean_test_reviews)
 97 |     
 98 |        
 99 |     print "Cleaning and labeling the data sets...\n"
100 |     
101 |     train_reviews = getCleanLabeledReviews(train)
102 |     test_reviews = getCleanLabeledReviews(test)
103 | 
104 |     n_dim = 5000
105 |     
106 |     print 'Loading doc2vec model..\n'
107 |     
108 |     model_dm_name = "../data/%dfeatures_1minwords_10context_dm" % n_dim
109 |     model_dbow_name = "../data/%dfeatures_1minwords_10context_dbow" % n_dim
110 |           
111 |     model_dm = Doc2Vec.load(model_dm_name)
112 |     model_dbow = Doc2Vec.load(model_dbow_name)
113 |         
114 |     print "Creating the d2v vectors...\n"
115 | 
116 |     X_train_d2v_dm = getFeatureVecs(train_reviews, model_dm, n_dim)
117 |     X_train_d2v_dbow = getFeatureVecs(train_reviews, model_dbow, n_dim)
118 |     X_train_d2v = np.hstack((X_train_d2v_dm, X_train_d2v_dbow))
119 | 
120 |     X_test_d2v_dm = getFeatureVecs(test_reviews, model_dm, n_dim)
121 |     X_test_d2v_dbow = getFeatureVecs(test_reviews, model_dbow, n_dim)
122 |     X_test_d2v = np.hstack((X_test_d2v_dm, X_test_d2v_dbow))
123 |     
124 |     
125 |     print 'Loading word2vec model..\n'
126 |     
127 |     model_name = "../data/%dfeatures_40minwords_10context" % n_dim
128 | 	
129 |     model = Word2Vec.load(model_name)
130 |     
131 |     print "Creating the w2v vectors...\n"
132 | 
133 |     X_train_w2v = scale(getAvgFeatureVecs(getCleanReviews(train), model, n_dim))
134 |     X_test_w2v = scale(getAvgFeatureVecs(getCleanReviews(test), model, n_dim))
135 |     
136 |     print "Generating the svmlight-format files...\n"
137 |     
138 |     generate_svmlight_files(train, test, '123', '../data/nbsvm')
139 |     
140 |     print "Creating the nbsvm...\n"
141 |     
142 |     files = ("../data/nbsvm-train.txt", "../data/nbsvm-test.txt")
143 |      
144 |     X_train_nbsvm, _, X_test_nbsvm, _ = load_svmlight_files(files)
145 |     
146 |     print "Combing the bag of words and the w2v vectors...\n"
147 |     
148 |     X_train_bwv = hstack([X_train_bow, X_train_w2v])
149 |     X_test_bwv = hstack([X_test_bow, X_test_w2v])
150 | 
151 |     
152 |     print "Combing the bag of words and the d2v vectors...\n"
153 |     
154 |     X_train_bdv = hstack([X_train_bow, X_train_d2v])
155 |     X_test_bdv = hstack([X_test_bow, X_test_d2v])
156 | 
157 |     
158 |     print "Checking the dimension of training vectors"
159 |     
160 |     print 'BoW', X_train_bow.shape
161 |     print 'W2V', X_train_w2v.shape
162 |     print 'D2V', X_train_d2v.shape
163 |     print 'NBSVM', X_train_nbsvm.shape
164 |     print 'BoW-W2V', X_train_bwv.shape
165 |     print 'BoW-D2V', X_train_bdv.shape
166 |     print ''
167 | 
168 |     y_train = train['sentiment']
169 |     
170 |     
171 |     print "Predicting with Bag-of-words model...\n" 
172 |     
173 |     clf = LogisticRegression(class_weight="auto")
174 |     
175 |     clf.fit(X_train_bow, y_train)
176 |     y_prob_bow = clf.predict_proba(X_test_bow)
177 |     
178 |     print "Predicting with NBSVM...\n" 
179 | 	
180 |     clf.fit(X_train_nbsvm, y_train)
181 |     y_prob_nbsvm = clf.predict_proba(X_test_nbsvm)
182 | 	
183 | 	
184 | 	print "Predicting with Bag-of-words model and Word2Vec model...\n" 
185 | 	
186 |     clf.fit(X_train_bwv, y_train)
187 |     y_prob_bwv = clf.predict_proba(X_test_bwv)
188 |     
189 | 	
190 | 	print "Predicting with Bag-of-words model and Doc2Vec model...\n" 
191 | 	
192 |     clf.fit(X_train_bdv, y_train)
193 |     y_prob_bdv = clf.predict_proba(X_test_bdv)
194 |     
195 | 
196 |     print "\nWeighted Average: BOW/BOW-W2V/BOW-D2V/NBSVM\n"
197 |     
198 |     alpha = 0.081633
199 |     beta = 0.265306
200 |     theta = 0.551020
201 |       
202 |     y_pred = alpha*y_prob_bow + (1-alpha-beta-theta)*y_prob_bwv + beta*y_prob_bdv + theta*y_prob_nbsvm
203 | 
204 |     output = pd.DataFrame(data={"id":test["id"], "sentiment":y_pred[:,1]})
205 |     output.to_csv('BoW008_W2V5000_D2V10000_NBSVM055_model.csv', index=False, quoting=3)
206 |     
207 |     print "Wrote results to BoW008_W2V5000_D2V10000_NBSVM055_model.csv"   
208 |     
209 | 
210 |     print "\nMax-Min (Average)\n"
211 |     y_mean = (y_prob_bow + y_prob_bwv + y_prob_bdv + y_prob_nbsvm)/4
212 |     y_score_mean = []
213 |     
214 |     i = 0
215 |     for row in y_mean:
216 |         if row[1] > 0.5:
217 |             val = max(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1])
218 |             y_score_mean.append(val)
219 |         elif row[1] < 0.5:
220 |             val = min(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1])
221 |             y_score_mean.append(val)
222 |         else:
223 |             y_score_mean.append(y_pred[i,1])
224 |         i += 1
225 |     
226 |  
227 |     print "\nMax-Min (Weighted Average)\n"
228 |     y_score_best = []
229 |     
230 |     i = 0
231 |     for row in y_pred:
232 |         if row[1] > 0.5:
233 |             val = max(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1])
234 |             y_score_best.append(val)
235 |         elif row[1] < 0.5:
236 |             val = min(y_prob_bow[i,1],y_prob_bwv[i,1],y_prob_bdv[i,1],y_prob_nbsvm[i,1])
237 |             y_score_best.append(val)
238 |         else:
239 |             y_score_best.append(y_pred[i,1])
240 |         i += 1
241 |  
242 |  
243 |     print "\nFinal Ensemble\n"
244 |     y_wa = np.array([row[1] for row in y_pred])
245 |     y_am = np.array(y_score_mean)
246 |     y_wam = np.array(y_score_best)
247 |     
248 |     alpha1 = 0.591837
249 |     alpha2 = 0.387755
250 |     y_final = alpha1*y_wa + (1-alpha1-alpha2)*y_am + alpha2*y_wam
251 | 
252 |     output = pd.DataFrame(data={"id":test["id"], "sentiment":y_final})
253 |     output.to_csv('WeightedAverage059_MaxMinAverage_MaxMinWeightedAverage039_model.csv', index=False, quoting=3)
254 |     
255 |     print "Wrote results to WeightedAverage059_MaxMinAverage_MaxMinWeightedAverage039_model.csv"
256 |     


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/.gitignore:
--------------------------------------------------------------------------------
 1 | *.py[co]
 2 | 
 3 | # Packages
 4 | *.egg
 5 | *.egg-info
 6 | dist
 7 | build
 8 | eggs
 9 | parts
10 | bin
11 | var
12 | sdist
13 | develop-eggs
14 | .installed.cfg
15 | 
16 | # Installer logs
17 | pip-log.txt
18 | 
19 | # Unit test / coverage reports
20 | .coverage
21 | .tox
22 | 
23 | #Translations
24 | *.mo
25 | 
26 | #Mr Developer
27 | .mr.developer.cfg
28 | 
29 | #All big data files
30 | ../data/train.csv
31 | ../data/test.csv
32 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/Digit Recognizer.md:
--------------------------------------------------------------------------------
 1 | # 1. Kaggle Digit Recognizer
 2 | 此任务是在MNIST（一个带Label的数字像素集合）上训练一个数字分类器，训练集的大小为42000个training example，
 3 | 每个example是28*28=784个灰度像素值和一个0~9的label。最后的排名以在测试集上的分类正确率为依据排名。
 4 | ### 数据集格式
 5 | 一张手写数字图片由28*28=784个像素组成，每一个像素的取值范围[0,255]。  
 6 | **训练集train.csv**  
 7 | 每一行由[label,pixel0~pixel783]，label代表这张图是什么数字。  
 8 | **测试集test.csv**  
 9 | 中没有label这一列，label是需要预测的。  
10 | **提交结果文件name.csv**  
11 | 列名 [ImageId,Label]，ImageId对应测试集中的每一行。  
12 | 
13 | 
14 | [TensorFlow softmax regression & deep NN](https://www.kaggle.com/kakauandme/digit-recognizer/tensorflow-softmax-regression-deep-nn/notebook)


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/data/readme.txt:
--------------------------------------------------------------------------------
1 | Training dataset (73.22Mb):
2 | http://www.kaggle.com/c/digit-recognizer/download/train.csv
3 | 
4 | Testing dataset (48.75Mb):
5 | http://www.kaggle.com/c/digit-recognizer/download/test.csv
6 | 
7 | 28x28 unrolled to 784 features. In training first column is label/target class


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/experiment1-rf-1000.py:
--------------------------------------------------------------------------------
 1 | from sklearn.ensemble import RandomForestClassifier
 2 | from numpy import genfromtxt, savetxt
 3 | 
 4 | CPU = 1
 5 | 
 6 | 
 7 | def main():
 8 |     print("Reading training set")
 9 |     dataset = genfromtxt(open('../data/train.csv', 'r'), delimiter=',', dtype='int64')[1:]
10 |     target = [x[0] for x in dataset]
11 |     train = [x[1:] for x in dataset]
12 |     print("Reading test set")
13 |     test = genfromtxt(open('../data/test.csv', 'r'), delimiter=',', dtype='int64')[1:]
14 | 
15 |     #create and train the random forest
16 |     rf = RandomForestClassifier(n_estimators=1000, n_jobs=CPU)
17 |     print("Fitting RF classifier")
18 |     rf.fit(train, target)
19 | 
20 |     print("Predicting test set")
21 |     savetxt('submission-version-1.csv', rf.predict(test), delimiter=',', fmt='%d')
22 | 
23 | if __name__ == "__main__":
24 |     main()
25 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/knn_by_myself.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: knn_by_myself.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2015/12/24 下午 9:59
10 | 
11 | https://www.kaggle.com/c/digit-recognizer/
12 | score: 0.96300
13 | pred test
14 | time cost: 12311.181267s
15 | os:  windows 10
16 | CPU: AMD A6-4400M APU with Radeon(tm) Graphics 2.70GHz
17 | RAM: 6GB
18 | '''
19 | import numpy as np
20 | import time
21 | 
22 | 
23 | def load_data():
24 |     train_data = np.loadtxt('d:/dataset/digits/train.csv', dtype=np.uint8, delimiter=',', skiprows=1)
25 |     test_data = np.loadtxt('d:/dataset/digits/test.csv', dtype=np.uint8, delimiter=',', skiprows=1)
26 |     label = np.ravel(train_data[:, :1])  # 多维转一维 扁平化
27 |     data = np.where(train_data[:, 1:] != 0, 1, 0)  # 数据归一化
28 |     test = np.where(test_data != 0, 1, 0)
29 |     return data, label, test
30 | 
31 | 
32 | def test_knn(train_data, train_label, test_data, test_label):
33 |     start = time.clock()
34 |     error = 0
35 |     m = len(test_data)
36 |     labels = []
37 |     for i in range(m):
38 |         calc_label = classify(test_data[i], train_data, train_label, 3)
39 |         labels.append(calc_label)
40 |         error = error + (calc_label != test_label[i])
41 | 
42 |     print(('error: ', error))
43 |     print(('error percent: %f' % (float(error) / m)))
44 |     print(('time cost: %f s' % (time.clock() - start)))
45 | 
46 | 
47 | def save2csv(labels, csv_name):
48 |     f = open('d:/dataset/digits/' + csv_name, 'w')
49 |     f.write('ImageId,Label\n')
50 |     for i in range(1, len(labels)+1):
51 |         f.write(str(i)+','+str(labels[i]))
52 |         f.write("\n")
53 |     f.close()
54 | 
55 | 
56 | def knn_pred(train_data, train_label, test_data):
57 |     start = time.clock()
58 |     m = len(test_data)
59 |     labels = []
60 |     for i in range(m):
61 |         calc_label = classify(test_data[i], train_data, train_label, 3)
62 |         labels.append(calc_label)
63 |     save2csv(labels, 'knn_result.csv')
64 |     print(('time cost: %f s' % (time.clock() - start)))
65 | 
66 | 
67 | def classify(inx, train_data, train_label, k):
68 |     sz = train_data.shape[0]
69 |     inx_temp = np.tile(inx, (sz, 1)) - train_data
70 |     sq_inx_temp = inx_temp ** 2
71 |     sq_distance = sq_inx_temp.sum(axis=1)
72 |     distance = sq_distance ** 0.5
73 |     sort_dist = distance.argsort()
74 |     class_set = {}
75 |     for i in range(k):
76 |         label = train_label[sort_dist[i]]
77 |         class_set[label] = class_set.get(label, 0) + 1
78 |     sorted_class_set = sorted(list(class_set.items()), key=lambda d: d[1], reverse=True)  # 按字典中的从大到小排序
79 |     # python2.7 -> python3.5 : itertimes() -> items()
80 |     return sorted_class_set[0][0]
81 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/naive_bayes_by_myself.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: naive_bayes_by_myself.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2015/12/25 1:51
10 | '''
11 | 
12 | import numpy as np
13 | import time
14 | 
15 | def csv2vector(file_name):
16 |     pass
17 | 
18 | def savefile(labels, file_name):
19 |     pass
20 | 
21 | 
22 | def trainNB0(trainMatrix,trainclass):
23 |     numpics = len(trainMatrix)  #record numbers
24 |     numpix = len(trainMatrix[0])#pix numbers
25 |     pDic={}
26 |     for v in trainclass:
27 |         pDic[v] = pDic.get(v,0)+1
28 |     for k,v in list(pDic.items()):
29 |         pDic[k]=v/float(numpics)#p of every class
30 |     pnumdic={}
31 |     psumdic={}
32 |     for k in list(pDic.keys()):
33 |         pnumdic[k]=np.ones(numpix)
34 |     for i in range(numpics):
35 |         pnumdic[trainclass[i]] += trainMatrix[i]
36 |         psumdic[trainclass[i]] = psumdic.get(trainclass[i],2) + sum(trainMatrix[i])
37 |     pvecdic={}
38 |     for k in list(pnumdic.keys()):
39 |         pvecdic[k]=np.log(pnumdic[k]/float(psumdic[k]))
40 |     return pvecdic,pDic
41 | 
42 | def classifyNB(vec2class,pvecdic,pDic):
43 |     presult={}
44 |     for k in list(pDic.keys()):
45 |         presult[k]=sum(vec2class*pvecdic[k])+np.log(pDic[k])
46 |     tmp=float("-inf")
47 |     result=""
48 |     for k in list(presult.keys()):
49 |         if presult[k]>tmp:
50 |             tmp= presult[k]
51 |             result=k
52 |     return result
53 | 
54 | def testNB():
55 |     print("load train data...")
56 |     trainSet, trainlabel=csv2vector("train.csv",1)
57 |     print("load test data...")
58 |     testSet,testlabel = csv2vector("test.csv")
59 |     print("start train...")
60 |     pvecdic,pDic=trainNB0(trainSet, trainlabel)
61 |     start = time.clock()
62 |     print("start test...")
63 |     result="ImageId,Label\n"
64 |     for i in range(len(testSet)):
65 |         tmp = classifyNB(testSet[i],pvecdic,pDic)
66 |         result += str(i+1)+","+tmp+"\n"
67 |         #print tmp
68 |     savefile(result,"result_NB.csv")
69 |     end = time.clock()
70 |     print(("time cost: %f s" % (end - start)))
71 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/nn/README.md:
--------------------------------------------------------------------------------
1 | DigitRecognizer
2 | ===============
3 | 
4 | A hand-written-digit recognition program using a feed forward neural network. This program is purposed for a predictive algorithm competition on <a href="http://kaggle.com">Kaggle</a>.
5 | 
6 | This project also contains a module, PyNeural, which is meant to be used as a neural network system for machine learning.


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/nn/src/DigitRecognizer.py:
--------------------------------------------------------------------------------
  1 | __author__ = 'Marco Giancarli, m.a.giancarli@gmail.com'
  2 | 
  3 | 
  4 | import numpy as np
  5 | from math import pi as PI
  6 | from csv import reader
  7 | from csv import writer
  8 | from csv import QUOTE_NONE
  9 | from skimage import filters
 10 | from skimage import measure
 11 | from skimage import transform
 12 | from skimage import io
 13 | from skimage import viewer
 14 | from PyNeural.PyNeural import NeuralNetwork
 15 | 
 16 | 
 17 | """
 18 | Return a list of all possible translations of the image that don't cut off any
 19 | part of the number.
 20 | """
 21 | def all_translations(x):
 22 |     translations = [x]
 23 |     # image = np.array([x])
 24 |     # image.resize((28, 28))
 25 |     # if sum(image[0, :]) == 0:
 26 |     #     image1 = np.vstack([image[1:, :], image[0:1, :]])
 27 |     #     translations.append(image1.reshape((1, -1)).tolist()[0])
 28 |     # if sum(image[:, 0]) == 0:
 29 |     #     image2 = np.hstack([image[:, 1:], image[:, 0:1]])
 30 |     #     translations.append(image2.reshape((1, -1)).tolist()[0])
 31 |     # if sum(image[-1, :]) == 0:
 32 |     #     image3 = np.vstack([image[-1:, :], image[:-1, :]])
 33 |     #     translations.append(image3.reshape((1, -1)).tolist()[0])
 34 |     # if sum(image[:, -1]) == 0:
 35 |     #     image4 = np.hstack([image[:, -1:], image[:, :-1]])
 36 |     #     translations.append(image4.reshape((1, -1)).tolist()[0])
 37 | 
 38 |     return translations
 39 | 
 40 | """
 41 | Return new features, given an initial list x of raw features.
 42 | """
 43 | def get_features(x):
 44 |     features = []
 45 |     image = np.array([x])
 46 |     image.resize((28, 28))
 47 |     binary_image = filters.threshold_adaptive(image, 9)
 48 | 
 49 |     angles = np.linspace(0, 1, 8) * PI
 50 |     h, _, _ = transform.hough_line(filters.sobel(binary_image), theta=angles)
 51 |     h_sum = [
 52 |         [sum(row[start:start+5]) for start in range(0, 75, 5)]
 53 |         for row in zip(*h)
 54 |     ]
 55 |     features.extend(np.array(h_sum).reshape(1, -1).tolist()[0])
 56 | 
 57 |     # moments = measure.moments(binary_image)
 58 |     # hu_moments = measure.moments_hu(moments)
 59 |     # # reshape: -1 as a dimension size makes the dimension implicit
 60 |     # features.extend(moments.reshape((1, -1)).tolist()[0])
 61 |     # features.extend(hu_moments.reshape((1, -1)).tolist()[0])
 62 | 
 63 |     # h_line, _, _ = transform.hough_line(binary_image)
 64 |     # features.extend(np.array(h_line).reshape((1, -1)).tolist()[0])
 65 | 
 66 |     return features
 67 | 
 68 | def main():
 69 |     print('Loading training set...')
 70 | 
 71 |     training_x_raw = []
 72 |     training_y_raw = []
 73 |     training_x = []
 74 |     training_y = []
 75 |     samples = 0
 76 |     m = 0
 77 | 
 78 |     with open('res/datasets/train.csv', ) as training_file:
 79 |         training_data = reader(training_file, delimiter=',')
 80 |         skipped_titles = False
 81 |         for line in training_data:
 82 |             if not skipped_titles:
 83 |                 skipped_titles = True
 84 |                 continue
 85 |             fields = list(line)
 86 |             training_y_raw = fields[0]
 87 |             training_x_raw = fields[1:]
 88 |             # remove the labels
 89 |             training_y.append(int(training_y_raw))
 90 | 
 91 |             for features in all_translations([int(v) for v in training_x_raw]):
 92 |                 training_x.append(features + get_features(features))
 93 |                 m += 1
 94 | 
 95 |             samples += 1
 96 |             if any([samples % 1000 == 0,
 97 |                     samples % 100 == 0 and samples < 2000,
 98 |                     samples % 10 == 0 and samples < 200]):
 99 |                 print(samples, 'samples loaded.', m, 'generated samples.')
100 |     print('Done.', m, 'total samples.')
101 | 
102 |     x_array = np.array(training_x)
103 |     # normalize the training set
104 |     training_x = ((x_array - np.average(x_array)) / np.std(x_array)).tolist()
105 | 
106 |     layer_sizes = [x_array.shape[1], 121, 10]
107 |     alpha = 0.04
108 |     test_size = m / 4 # 4 fold testing
109 | 
110 |     print('Training set loaded. Samples:', len(training_x))
111 |     print('Training network (layers: ' + \
112 |           ' -> '.join(map(str, layer_sizes)) + ')...')
113 | 
114 |     network = NeuralNetwork(layer_sizes, alpha)
115 | 
116 |     network.train(
117 |         training_x[:-test_size],
118 |         training_y[:-test_size],
119 |         test_inputs=training_x[-test_size:],
120 |         test_outputs=training_y[-test_size:],
121 |         epoch_cap=15,
122 |         error_goal=0.00,
123 |         dropconnect_chance=0.05
124 |     )
125 | 
126 |     print('Network trained.')
127 | 
128 |     num_correct = 0
129 |     num_tests = 0
130 |     for x, y in zip(training_x[-2000:], training_y[-2000:]):
131 |         prediction = network.predict(x)
132 |         num_tests += 1
133 |         if int(prediction) == y:
134 |             num_correct += 1
135 |     print(str(num_correct), '/', str(num_tests))
136 | 
137 |     # clear junk
138 |     network.momentum = None
139 |     network.dropconnect_matrices = None
140 |     training_x = None
141 |     training_y = None
142 |     training_data = None
143 |     training_x_raw = None
144 |     training_y_raw = None
145 | 
146 |     print('Loading test data...')
147 | 
148 |     test_x_raw = []
149 |     test_x = []
150 |     test_y = []
151 | 
152 |     output_file_name = 'gen/nn_benchmark5.csv'
153 | 
154 |     with open(output_file_name, 'wb') as output_file:
155 |         w = writer(output_file, delimiter=',', quoting=QUOTE_NONE)
156 |         w.writerow(['ImageId','Label'])
157 | 
158 |         with open('res/datasets/test.csv', ) as test_file:
159 |             test_data = reader(test_file, delimiter=',')
160 |             skipped_titles = False
161 |             num_predictions = 0
162 |             for line in test_data:
163 |                 if not skipped_titles:
164 |                     skipped_titles = True
165 |                     continue
166 |                 fields = list(line)
167 |                 test_x_raw = fields
168 |                 # remove the damn labels
169 |                 features = [int(val) for val in test_x_raw]
170 |                 features.extend(get_features(features))
171 |                 test_x.append(features)
172 |                 num_predictions += 1
173 |                 if num_predictions % 100 == 0:
174 |                     x_array = np.array(test_x)
175 |                     # normalize the test set
176 |                     test_x = (
177 |                         (x_array - np.average(x_array)) / np.std(x_array)
178 |                     ).tolist()
179 |                     for i in range(100):
180 |                         w.writerow([num_predictions-99+i,
181 |                                     network.predict(test_x[i])])
182 |                     test_x = []
183 |                     x_array = []
184 | 
185 | 
186 |     print('Predicted labels and stored as "' + output_file_name + '".')
187 | 
188 | if __name__ == '__main__':
189 |     main()


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/nn/src/PyNeural/PyNeural.py:
--------------------------------------------------------------------------------
  1 | __author__ = 'MarcoGiancarli, m.a.giancarli@gmail.com'
  2 | 
  3 | 
  4 | import math
  5 | import numpy as np
  6 | 
  7 | 
  8 | # Use tanh instead of normal sigmoid because it's faster. Emulates sigmoid.
  9 | def sigmoid(x):
 10 |     return (np.tanh(x) + 1) / 2
 11 | 
 12 | 
 13 | # derivative of our sigmoid function
 14 | def d_sigmoid(x):
 15 |     return (np.tanh(x)+1) * (1-np.tanh(x)) / 4
 16 | 
 17 | 
 18 | def output_vector_to_scalar(vector):
 19 |     # get the index of the max in the vector
 20 |     m,i = max((v,i) for i,v in enumerate(vector.tolist()))
 21 |     return i
 22 | 
 23 | 
 24 | def output_scalar_to_vector(scalar, num_outputs):
 25 |     # same size as outputs, all 0s
 26 |     vector = [0] * num_outputs
 27 |     # add 1 to the correct index
 28 |     vector[scalar] += 1
 29 |     return vector
 30 | 
 31 | 
 32 | #TODO: add methods to save state
 33 | #TODO: learning curves?
 34 | class NeuralNetwork:
 35 |     def __init__(self, layer_sizes, alpha, labels=None, reg_constant=0):
 36 |         self.alpha = alpha
 37 |         self.regularization_constant = reg_constant
 38 |         self.dropconnect_matrices = None
 39 | 
 40 |         if labels is None:
 41 |             self.labels = list(range(layer_sizes[-1]))
 42 |         elif len(labels) != layer_sizes[-1]:
 43 |             #TODO: throw exception here
 44 |             print('Fucked up because the size of layer does not match the ' \
 45 |                   'size of the outputs. (' + \
 46 |                   str(len(labels)) + ' != ' + str(layer_sizes[-1]) + ')')
 47 |             exit(1)
 48 |         else:
 49 |             self.labels = labels
 50 | 
 51 |         # theta is the weights matrix for each node. we skip the first layer
 52 |         # because it has no weights.
 53 |         self.theta = [None] * len(layer_sizes)
 54 |         for l in range(1, len(layer_sizes)):
 55 |             # append a matrix which represents the initial weights for layer l
 56 |             # for node in layer l, add a weight to each node in layer l-1 + bias
 57 |             beta = 0.7 * math.pow(layer_sizes[l], 1/layer_sizes[l-1])
 58 |             self.theta[l] = np.random.random(
 59 |                 (layer_sizes[l], layer_sizes[l-1]+1)
 60 |             ) * 2 - 1
 61 |             norm = [
 62 |                 math.sqrt(x)
 63 |                 for x in np.multiply(
 64 |                     self.theta[l],
 65 |                     self.theta[l]).dot(np.ones([layer_sizes[l-1]+1]))
 66 |             ]
 67 |             for row_num in range(len(norm)):
 68 |                 self.theta[l][row_num,:] = self.theta[l][row_num,:] * \
 69 |                                            beta / norm[row_num]
 70 | 
 71 |         self.momentum = [np.zeros(t.shape) for t in self.theta[1:]]
 72 | 
 73 |     """
 74 |     Feed forward and return lists of matrices A and Z for one set of inputs.
 75 |     """
 76 |     def feed_forward(self, input_vector, dropconnect_matrices=None):
 77 |         A = [None]*len(self.theta)
 78 |         Z = [None]*len(self.theta)
 79 |         A[0] = input_vector.T  # 1 x n
 80 |         Z[0] = None  # z_1 doesn't exist
 81 |         for l in range(1, len(self.theta)):
 82 |             # add constant (1) to the weights that correspond with each node
 83 |             A_with_ones = np.concatenate((np.array([1]), A[l-1]))
 84 |             if dropconnect_matrices is not None:
 85 |                 Z[l] = np.dot(np.multiply(self.theta[l],
 86 |                                           dropconnect_matrices[l-1]),
 87 |                               A_with_ones)
 88 |             else:
 89 |                 Z[l] = np.dot(self.theta[l], A_with_ones)
 90 |             A[l] = sigmoid(Z[l])
 91 | 
 92 |         return A, Z
 93 | 
 94 |     """
 95 |     Back propagate for one training sample.
 96 |     """
 97 |     def back_prop(self, input_vector, output_vector, dropconnect_matrices):
 98 |         A, Z = self.feed_forward(input_vector, dropconnect_matrices)
 99 | 
100 |         # let delta be a list of matrices where delta[l][i][j] is delta
101 |         # at layer l, training sample i, and node j
102 |         # the delta is None for the data layer, others we assign later
103 |         delta = [None] * len(self.theta)
104 |         delta[-1] = np.multiply(A[-1] - output_vector.T, d_sigmoid(Z[-1]))
105 | 
106 |         # note: no error on data layer, we have the output layer
107 |         for l in reversed(list(range(1, len(self.theta)-1))):
108 |             theta_t_delta = np.dot(np.multiply(self.theta[l+1],
109 |                                                dropconnect_matrices[l]).T,
110 |                                    delta[l+1])
111 |             delta[l] = np.multiply(theta_t_delta[1:], d_sigmoid(Z[l]))
112 | 
113 |         # Calculate the partial derivatives for all theta values using delta
114 |         D = [None]*len(self.theta)  # make list of size L, where L is num layers
115 |         for l in range(1, len(self.theta)):
116 |             D[l] = np.dot(np.atleast_2d(A[l-1]).T, np.atleast_2d(delta[l]))
117 | 
118 |         return D, delta
119 | 
120 |     """
121 |     This method is used for supervised training on a data set.
122 |     """
123 |     def train(self, inputs, outputs, test_inputs=None, test_outputs=None,
124 |               epoch_cap=100, error_goal=0, dropconnect_chance=0.15):
125 |         # create these first so that we don't have to do it every epoch
126 |         input_vectors = [np.array(x) for x in inputs]
127 |         output_vectors = [
128 |             np.array(output_scalar_to_vector(y, self.theta[-1].shape[0]))
129 |             for y in outputs
130 |         ]
131 |         test_input_vectors = [np.array(x) for x in test_inputs]
132 |         test_output_vectors = [
133 |             np.array(output_scalar_to_vector(y, self.theta[-1].shape[0]))
134 |             for y in test_outputs
135 |         ]
136 | 
137 |         m = len(outputs)
138 |         for iteration in range(epoch_cap):
139 |             if dropconnect_chance > 0:
140 |                 dropconnect_matrices = \
141 |                     self.make_dropconnect_matrices(dropconnect_chance)
142 |             for input_vector, output_vector in zip(input_vectors,
143 |                                                    output_vectors):
144 |                 gradient, bias = self.back_prop(input_vector,
145 |                                                 output_vector,
146 |                                                 dropconnect_matrices)
147 |                 gradient_with_bias = [None]*len(self.theta)
148 | 
149 |                 for l in range(1, len(self.theta)):
150 |                     gradient_with_bias[l] = np.vstack((bias[l], gradient[l]))
151 |                     gradient_with_bias[l] = gradient_with_bias[l].T
152 | 
153 |                 gradient_with_bias = [g for g in gradient_with_bias[1:]]
154 |                 self.gradient_descent(gradient_with_bias)
155 | 
156 |             # test the updated system against the validation set
157 |             if test_inputs is not None and test_outputs is not None:
158 |                 num_tests = len(test_output_vectors)
159 |                 num_correct = 0
160 |                 for test_input, test_output in zip(test_input_vectors,
161 |                                                    test_outputs):
162 |                     prediction = self.predict(test_input)
163 |                     if prediction == test_output:
164 |                         num_correct += 1
165 |                 test_accuracy = float(num_correct) / float(num_tests)
166 |                 print('Test at epoch %s: %s / %s -- Accuracy: %s' % (
167 |                     str(iteration+1), str(num_correct),
168 |                     str(num_tests), str(test_accuracy)
169 |                 ))
170 | 
171 |                 if test_accuracy >= 1.0 - error_goal:
172 |                     return
173 | 
174 |     """
175 |     This method calls feed_forward and returns just the prediction labels for
176 |     all samples.
177 |     """
178 |     def predict(self, input):
179 |         A, _ = self.feed_forward(np.array(input))
180 |         return np.argmax(A[-1])
181 | 
182 |     def gradient_descent(self, gradient):
183 |         for l in range(1, len(self.theta)):
184 |             # gradient doesnt have a None value at index 0, but theta does
185 |             self.theta[l] = np.add(
186 |                 self.theta[l],
187 |                 (-1.0 * self.alpha) * (gradient[l-1] + self.momentum[l-1])
188 |             )
189 |         self.momentum = [m/2 + g/2 for m, g in zip(self.momentum, gradient)]
190 | 
191 |     def make_dropconnect_matrices(self, dropconnect_chance):
192 |         assert(0 <= dropconnect_chance < 1)
193 |         dropconnect_matrices = [
194 |             np.fix(np.random.random(t.shape) + (1-dropconnect_chance))
195 |             for t in self.theta[1:]
196 |         ]
197 |         return dropconnect_matrices


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/nn/src/PyNeural/PyNeural.py.bak:
--------------------------------------------------------------------------------
  1 | __author__ = 'MarcoGiancarli, m.a.giancarli@gmail.com'
  2 | 
  3 | 
  4 | import math
  5 | import numpy as np
  6 | 
  7 | 
  8 | # Use tanh instead of normal sigmoid because it's faster. Emulates sigmoid.
  9 | def sigmoid(x):
 10 |     return (np.tanh(x) + 1) / 2
 11 | 
 12 | 
 13 | # derivative of our sigmoid function
 14 | def d_sigmoid(x):
 15 |     return (np.tanh(x)+1) * (1-np.tanh(x)) / 4
 16 | 
 17 | 
 18 | def output_vector_to_scalar(vector):
 19 |     # get the index of the max in the vector
 20 |     m,i = max((v,i) for i,v in enumerate(vector.tolist()))
 21 |     return i
 22 | 
 23 | 
 24 | def output_scalar_to_vector(scalar, num_outputs):
 25 |     # same size as outputs, all 0s
 26 |     vector = [0] * num_outputs
 27 |     # add 1 to the correct index
 28 |     vector[scalar] += 1
 29 |     return vector
 30 | 
 31 | 
 32 | #TODO: add methods to save state
 33 | #TODO: learning curves?
 34 | class NeuralNetwork:
 35 |     def __init__(self, layer_sizes, alpha, labels=None, reg_constant=0):
 36 |         self.alpha = alpha
 37 |         self.regularization_constant = reg_constant
 38 |         self.dropconnect_matrices = None
 39 | 
 40 |         if labels is None:
 41 |             self.labels = range(layer_sizes[-1])
 42 |         elif len(labels) != layer_sizes[-1]:
 43 |             #TODO: throw exception here
 44 |             print 'Fucked up because the size of layer does not match the ' \
 45 |                   'size of the outputs. (' + \
 46 |                   str(len(labels)) + ' != ' + str(layer_sizes[-1]) + ')'
 47 |             exit(1)
 48 |         else:
 49 |             self.labels = labels
 50 | 
 51 |         # theta is the weights matrix for each node. we skip the first layer
 52 |         # because it has no weights.
 53 |         self.theta = [None] * len(layer_sizes)
 54 |         for l in range(1, len(layer_sizes)):
 55 |             # append a matrix which represents the initial weights for layer l
 56 |             # for node in layer l, add a weight to each node in layer l-1 + bias
 57 |             beta = 0.7 * math.pow(layer_sizes[l], 1/layer_sizes[l-1])
 58 |             self.theta[l] = np.random.random(
 59 |                 (layer_sizes[l], layer_sizes[l-1]+1)
 60 |             ) * 2 - 1
 61 |             norm = [
 62 |                 math.sqrt(x)
 63 |                 for x in np.multiply(
 64 |                     self.theta[l],
 65 |                     self.theta[l]).dot(np.ones([layer_sizes[l-1]+1]))
 66 |             ]
 67 |             for row_num in range(len(norm)):
 68 |                 self.theta[l][row_num,:] = self.theta[l][row_num,:] * \
 69 |                                            beta / norm[row_num]
 70 | 
 71 |         self.momentum = [np.zeros(t.shape) for t in self.theta[1:]]
 72 | 
 73 |     """
 74 |     Feed forward and return lists of matrices A and Z for one set of inputs.
 75 |     """
 76 |     def feed_forward(self, input_vector, dropconnect_matrices=None):
 77 |         A = [None]*len(self.theta)
 78 |         Z = [None]*len(self.theta)
 79 |         A[0] = input_vector.T  # 1 x n
 80 |         Z[0] = None  # z_1 doesn't exist
 81 |         for l in range(1, len(self.theta)):
 82 |             # add constant (1) to the weights that correspond with each node
 83 |             A_with_ones = np.concatenate((np.array([1]), A[l-1]))
 84 |             if dropconnect_matrices is not None:
 85 |                 Z[l] = np.dot(np.multiply(self.theta[l],
 86 |                                           dropconnect_matrices[l-1]),
 87 |                               A_with_ones)
 88 |             else:
 89 |                 Z[l] = np.dot(self.theta[l], A_with_ones)
 90 |             A[l] = sigmoid(Z[l])
 91 | 
 92 |         return A, Z
 93 | 
 94 |     """
 95 |     Back propagate for one training sample.
 96 |     """
 97 |     def back_prop(self, input_vector, output_vector, dropconnect_matrices):
 98 |         A, Z = self.feed_forward(input_vector, dropconnect_matrices)
 99 | 
100 |         # let delta be a list of matrices where delta[l][i][j] is delta
101 |         # at layer l, training sample i, and node j
102 |         # the delta is None for the input layer, others we assign later
103 |         delta = [None] * len(self.theta)
104 |         delta[-1] = np.multiply(A[-1] - output_vector.T, d_sigmoid(Z[-1]))
105 | 
106 |         # note: no error on input layer, we have the output layer
107 |         for l in reversed(range(1, len(self.theta)-1)):
108 |             theta_t_delta = np.dot(np.multiply(self.theta[l+1],
109 |                                                dropconnect_matrices[l]).T,
110 |                                    delta[l+1])
111 |             delta[l] = np.multiply(theta_t_delta[1:], d_sigmoid(Z[l]))
112 | 
113 |         # Calculate the partial derivatives for all theta values using delta
114 |         D = [None]*len(self.theta)  # make list of size L, where L is num layers
115 |         for l in range(1, len(self.theta)):
116 |             D[l] = np.dot(np.atleast_2d(A[l-1]).T, np.atleast_2d(delta[l]))
117 | 
118 |         return D, delta
119 | 
120 |     """
121 |     This method is used for supervised training on a data set.
122 |     """
123 |     def train(self, inputs, outputs, test_inputs=None, test_outputs=None,
124 |               epoch_cap=100, error_goal=0, dropconnect_chance=0.15):
125 |         # create these first so that we don't have to do it every epoch
126 |         input_vectors = [np.array(x) for x in inputs]
127 |         output_vectors = [
128 |             np.array(output_scalar_to_vector(y, self.theta[-1].shape[0]))
129 |             for y in outputs
130 |         ]
131 |         test_input_vectors = [np.array(x) for x in test_inputs]
132 |         test_output_vectors = [
133 |             np.array(output_scalar_to_vector(y, self.theta[-1].shape[0]))
134 |             for y in test_outputs
135 |         ]
136 | 
137 |         m = len(outputs)
138 |         for iteration in range(epoch_cap):
139 |             if dropconnect_chance > 0:
140 |                 dropconnect_matrices = \
141 |                     self.make_dropconnect_matrices(dropconnect_chance)
142 |             for input_vector, output_vector in zip(input_vectors,
143 |                                                    output_vectors):
144 |                 gradient, bias = self.back_prop(input_vector,
145 |                                                 output_vector,
146 |                                                 dropconnect_matrices)
147 |                 gradient_with_bias = [None]*len(self.theta)
148 | 
149 |                 for l in range(1, len(self.theta)):
150 |                     gradient_with_bias[l] = np.vstack((bias[l], gradient[l]))
151 |                     gradient_with_bias[l] = gradient_with_bias[l].T
152 | 
153 |                 gradient_with_bias = [g for g in gradient_with_bias[1:]]
154 |                 self.gradient_descent(gradient_with_bias)
155 | 
156 |             # test the updated system against the validation set
157 |             if test_inputs is not None and test_outputs is not None:
158 |                 num_tests = len(test_output_vectors)
159 |                 num_correct = 0
160 |                 for test_input, test_output in zip(test_input_vectors,
161 |                                                    test_outputs):
162 |                     prediction = self.predict(test_input)
163 |                     if prediction == test_output:
164 |                         num_correct += 1
165 |                 test_accuracy = float(num_correct) / float(num_tests)
166 |                 print 'Test at epoch %s: %s / %s -- Accuracy: %s' % (
167 |                     str(iteration+1), str(num_correct),
168 |                     str(num_tests), str(test_accuracy)
169 |                 )
170 | 
171 |                 if test_accuracy >= 1.0 - error_goal:
172 |                     return
173 | 
174 |     """
175 |     This method calls feed_forward and returns just the prediction labels for
176 |     all samples.
177 |     """
178 |     def predict(self, input):
179 |         A, _ = self.feed_forward(np.array(input))
180 |         return np.argmax(A[-1])
181 | 
182 |     def gradient_descent(self, gradient):
183 |         for l in range(1, len(self.theta)):
184 |             # gradient doesnt have a None value at index 0, but theta does
185 |             self.theta[l] = np.add(
186 |                 self.theta[l],
187 |                 (-1.0 * self.alpha) * (gradient[l-1] + self.momentum[l-1])
188 |             )
189 |         self.momentum = [m/2 + g/2 for m, g in zip(self.momentum, gradient)]
190 | 
191 |     def make_dropconnect_matrices(self, dropconnect_chance):
192 |         assert(0 <= dropconnect_chance < 1)
193 |         dropconnect_matrices = [
194 |             np.fix(np.random.random(t.shape) + (1-dropconnect_chance))
195 |             for t in self.theta[1:]
196 |         ]
197 |         return dropconnect_matrices


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/nn/src/PyNeural/__init__.py:
--------------------------------------------------------------------------------
1 | __author__ = 'Marco'
2 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/nn/src/ensemble.py:
--------------------------------------------------------------------------------
 1 | __author__ = 'MarcoGiancarli, m.a.giancarli@gmail.com'
 2 | 
 3 | 
 4 | # This program is to be used to combine the results of an ensemble of trained
 5 | # networks. It reads the results from the benchmark csv files and write the mode
 6 | # into a file named 'ensemble_benchmark.csv'.
 7 | 
 8 | from csv import writer
 9 | from csv import QUOTE_NONE
10 | 
11 | BASE_PATH = '../gen/'
12 | 
13 | files_in_ensemble = [BASE_PATH + name for name in [
14 |     # 'nn_benchmark.csv',
15 |     'nn_benchmark1.csv',
16 |     'nn_benchmark2.csv',
17 |     'nn_benchmark3.csv',
18 |     'nn_benchmark4.csv',
19 |     'nn_benchmark5.csv',
20 | ]]
21 | 
22 | input_files = [open(file_name,'r') for file_name in files_in_ensemble]
23 | for f in input_files:
24 |     f.readline()
25 | 
26 | with open(BASE_PATH+'ensemble_benchmark.csv', 'wb') as output_file:
27 |     w = writer(output_file, delimiter=',', quoting=QUOTE_NONE)
28 |     w.writerow(['ImageId','Label'])
29 | 
30 |     for ex_count in range(28000):
31 |         current_predictions = [] # list of digits given by several data files for the same image
32 |         prediction_counts = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # number of times each label appears
33 | 
34 |         # get the label column for the next line for each data file
35 |         for input_file in input_files:
36 |             current_predictions.append(input_file.readline().split(',')[1])
37 | 
38 |         # add one to the counter at the index of each label from the data files
39 |         for prediction in current_predictions:
40 |             prediction_counts[int(prediction)] += 1
41 | 
42 |         # the max index is the label that was predicted most frequently
43 |         averaged_prediction = max(range(len(prediction_counts)),key=prediction_counts.__getitem__)
44 | 
45 |         print(str(ex_count+1) + ' -- Predictions: ' + ', '.join(current_predictions) + \
46 |                 ' -- Average: ' + str(averaged_prediction))
47 |         w.writerow([ex_count+1, averaged_prediction])


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/py-knn/experiment1-custom-knn-brute-force.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | import operator
 3 | import time
 4 | 
 5 | 
 6 | # euclidean distance without square root to save
 7 | # some computational time
 8 | def euclid(x1, x2):
 9 |     return np.sum(np.power(np.subtract(x1, x2), 2))
10 | 
11 | 
12 | # calc kNN from test_row vs training set
13 | # default k = 5
14 | # brute force! :(
15 | def knn(test_row, train, k=5):
16 |     diffs = {}
17 |     idx = 0
18 |     start = time.time()
19 |     for t in train:
20 |         diffs[idx] = euclid(test_row, t)
21 |         idx = idx + 1
22 |     print("for loop: %f idx(%d)" % (time.time() - start, idx))
23 |     return sorted(iter(diffs.items()), key=operator.itemgetter(1))[:k]
24 | 
25 | 
26 | # majority vote
27 | def majority(knn, labels):
28 |     a = {}
29 |     for idx, distance in knn:
30 |         if labels[idx] in list(a.keys()):
31 |             a[labels[idx]] = a[labels[idx]] + 1
32 |         else:
33 |             a[labels[idx]] = 1
34 |     return sorted(iter(a.items()), key=operator.itemgetter(1), reverse=True)[0][0]
35 | 
36 | 
37 | # worker. crawl through test set and predicts number
38 | def doWork(train, test, labels):
39 |     output_file = open("output.csv", "w", 0)
40 |     idx = 0
41 |     size = len(test)
42 |     for test_sample in test:
43 |         idx += 1
44 |         start = time.time()
45 |         prediction = majority(knn(test_sample, train, k=100), labels)
46 |         print("Knn: %f" % (time.time() - start))
47 |         output_file.write(prediction)
48 |         output_file.write("\n")
49 |         print((float(idx) / size) * 100)
50 |     output_file.close()
51 | 
52 | 
53 | # majority vote for a little bit optimized worker
54 | def majority_vote(knn, labels):
55 |     knn = [k[0, 0] for k in knn]
56 |     a = {}
57 |     for idx in knn:
58 |         if labels[idx] in list(a.keys()):
59 |             a[labels[idx]] = a[labels[idx]] + 1
60 |         else:
61 |             a[labels[idx]] = 1
62 |     return sorted(iter(a.items()), key=operator.itemgetter(1), reverse=True)[0][0]
63 | 
64 | 
65 | def doWorkNumpy(train, test, labels):
66 |     k = 20
67 |     train_mat = np.mat(train)
68 |     output_file = open("output-numpy2.csv", "w", 0)
69 |     idx = 0
70 |     size = len(test)
71 |     for test_sample in test:
72 |         idx += 1
73 |         start = time.time()
74 |         knn = np.argsort(np.sum(np.power(np.subtract(train_mat, test_sample), 2), axis=1), axis=0)[:k]
75 |         s = time.time()
76 |         prediction = majority_vote(knn, labels)
77 |         output_file.write(prediction)
78 |         output_file.write("\n")
79 |         print("Knn: %f, majority %f" % (time.time() - start, time.time() - s))
80 |         print("Done: %f" % (float(idx) / size))
81 |     output_file.close()
82 |     output_file = open("done.txt", "w")
83 |     output_file.write("DONE")
84 |     output_file.close()
85 | 
86 | 
87 | if __name__ == '__main__':
88 |     from load_data import read_data
89 |     train, labels = read_data("../data/train.csv")
90 |     test, tmpl = read_data("../data/test.csv", test=True)
91 |     doWorkNumpy(train, test, labels)
92 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/py-knn/experiment2-sklearn-knn-kdtree.py:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn.neighbors import KNeighborsClassifier
 3 | 
 4 | 
 5 | def doWork(train, test, labels):
 6 |     print("Converting training to matrix")
 7 |     train_mat = np.mat(train)
 8 |     print("Fitting knn")
 9 |     knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree")
10 |     print(knn.fit(train_mat, labels))
11 |     print("Preddicting")
12 |     predictions = knn.predict(test)
13 |     print("Writing to file")
14 |     write_to_file(predictions)
15 |     return predictions
16 | 
17 | 
18 | def write_to_file(predictions):
19 |     f = open("output-knn-skilearn.csv", "w")
20 |     for p in predictions:
21 |         f.write(str(p))
22 |         f.write("\n")
23 |     f.close()
24 | 
25 | 
26 | if __name__ == '__main__':
27 |     from load_data import read_data
28 |     train, labels = read_data("../data/train.csv")
29 |     test, tmpl = read_data("../data/test.csv", test=True)
30 |     predictions = doWork(train, test, labels)
31 |     print(predictions)
32 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/py-knn/experiment2-sklearn-knn-kdtree.py.bak:
--------------------------------------------------------------------------------
 1 | import numpy as np
 2 | from sklearn.neighbors import KNeighborsClassifier
 3 | 
 4 | 
 5 | def doWork(train, test, labels):
 6 |     print "Converting training to matrix"
 7 |     train_mat = np.mat(train)
 8 |     print "Fitting knn"
 9 |     knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree")
10 |     print knn.fit(train_mat, labels)
11 |     print "Preddicting"
12 |     predictions = knn.predict(test)
13 |     print "Writing to file"
14 |     write_to_file(predictions)
15 |     return predictions
16 | 
17 | 
18 | def write_to_file(predictions):
19 |     f = open("output-knn-skilearn.csv", "w")
20 |     for p in predictions:
21 |         f.write(str(p))
22 |         f.write("\n")
23 |     f.close()
24 | 
25 | 
26 | if __name__ == '__main__':
27 |     from load_data import read_data
28 |     train, labels = read_data("../data/train.csv")
29 |     test, tmpl = read_data("../data/test.csv", test=True)
30 |     predictions = doWork(train, test, labels)
31 |     print predictions
32 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/py-knn/experiment3-sklean-pca-knn.py:
--------------------------------------------------------------------------------
 1 | from sklearn.neighbors import KNeighborsClassifier
 2 | from sklearn import decomposition
 3 | import numpy as np
 4 | 
 5 | PCA_COMPONENTS = 100
 6 | 
 7 | 
 8 | def doWork(train, labels, test):
 9 |     print("Converting training set to matrix")
10 |     X_train = np.mat(train)
11 | 
12 |     print("Fitting PCA. Components: %d" % PCA_COMPONENTS)
13 |     pca = decomposition.PCA(n_components=PCA_COMPONENTS).fit(X_train)
14 | 
15 |     print("Reducing training to %d components" % PCA_COMPONENTS)
16 |     X_train_reduced = pca.transform(X_train)
17 | 
18 |     print("Fitting kNN with k=10, kd_tree")
19 |     knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree")
20 |     print(knn.fit(X_train_reduced, labels))
21 | 
22 |     print("Reducing test to %d components" % PCA_COMPONENTS)
23 |     X_test_reduced = pca.transform(test)
24 | 
25 |     print("Preddicting numbers")
26 |     predictions = knn.predict(X_test_reduced)
27 | 
28 |     print("Writing to file")
29 |     write_to_file(predictions)
30 | 
31 |     return predictions
32 | 
33 | 
34 | def write_to_file(predictions):
35 |     f = open("output-pca-knn-skilearn-v3.csv", "w")
36 |     for p in predictions:
37 |         f.write(str(p))
38 |         f.write("\n")
39 |     f.close()
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     from load_data import read_data
44 |     train, labels = read_data("../data/train.csv")
45 |     test, tmpl = read_data("../data/test.csv", test=True)
46 |     print(doWork(train, labels, test))
47 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/py-knn/experiment3-sklean-pca-knn.py.bak:
--------------------------------------------------------------------------------
 1 | from sklearn.neighbors import KNeighborsClassifier
 2 | from sklearn import decomposition
 3 | import numpy as np
 4 | 
 5 | PCA_COMPONENTS = 100
 6 | 
 7 | 
 8 | def doWork(train, labels, test):
 9 |     print "Converting training set to matrix"
10 |     X_train = np.mat(train)
11 | 
12 |     print "Fitting PCA. Components: %d" % PCA_COMPONENTS
13 |     pca = decomposition.PCA(n_components=PCA_COMPONENTS).fit(X_train)
14 | 
15 |     print "Reducing training to %d components" % PCA_COMPONENTS
16 |     X_train_reduced = pca.transform(X_train)
17 | 
18 |     print "Fitting kNN with k=10, kd_tree"
19 |     knn = KNeighborsClassifier(n_neighbors=10, algorithm="kd_tree")
20 |     print knn.fit(X_train_reduced, labels)
21 | 
22 |     print "Reducing test to %d components" % PCA_COMPONENTS
23 |     X_test_reduced = pca.transform(test)
24 | 
25 |     print "Preddicting numbers"
26 |     predictions = knn.predict(X_test_reduced)
27 | 
28 |     print "Writing to file"
29 |     write_to_file(predictions)
30 | 
31 |     return predictions
32 | 
33 | 
34 | def write_to_file(predictions):
35 |     f = open("output-pca-knn-skilearn-v3.csv", "w")
36 |     for p in predictions:
37 |         f.write(str(p))
38 |         f.write("\n")
39 |     f.close()
40 | 
41 | 
42 | if __name__ == '__main__':
43 |     from load_data import read_data
44 |     train, labels = read_data("../data/train.csv")
45 |     test, tmpl = read_data("../data/test.csv", test=True)
46 |     print doWork(train, labels, test)
47 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/py-knn/load_data.py:
--------------------------------------------------------------------------------
 1 | import csv
 2 | import numpy as np
 3 | 
 4 | 
 5 | # loading csv data into numpy array
 6 | def read_data(f, header=True, test=False):
 7 |     data = []
 8 |     labels = []
 9 | 
10 |     csv_reader = csv.reader(open(f, "r"), delimiter=",")
11 |     index = 0
12 |     for row in csv_reader:
13 |         index = index + 1
14 |         if header and index == 1:
15 |             continue
16 | 
17 |         if not test:
18 |             labels.append(int(row[0]))
19 |             row = row[1:]
20 | 
21 |         data.append(np.array(np.int64(row)))
22 |     return (data, labels)
23 | 
24 | 
25 | if __name__ == "__main__":
26 |     train, labels = read_data("../data/train.csv")
27 |     test, tmpl = read_data("../data/test.csv", test=True)
28 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/svm_by_myself.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: svm_by_myself.py.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2016/01/15 下午 10:33
10 | '''
11 | 
12 | '''
13 | 基于SVM的手写数字识别
14 | http://liuhongjiang.github.io/tech/blog/2012/12/29/svm-ocr/
15 | 
16 | Kaggle Digit OCR: SVM (Top 5)
17 | http://www.sotoseattle.com/blog/2013/10/13/Kaggle-Digit-SVM/
18 | '''
19 | 
20 | def main():
21 |     pass
22 | 
23 | 
24 | if __name__ == '__main__':
25 |     main()


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/svm_pca.py:
--------------------------------------------------------------------------------
 1 | import numpy
 2 | from sklearn.decomposition import PCA
 3 | from sklearn.svm import SVC
 4 | 
 5 | '''
 6 | score: 0.98243
 7 | '''
 8 | COMPONENT_NUM = 35
 9 | 
10 | path = 'd:/dataset/digits/'
11 | print('Read training data...')
12 | with open(path+'train.csv', 'r') as reader:
13 |     reader.readline()
14 |     train_label = []
15 |     train_data = []
16 |     for line in reader.readlines():
17 |         data = list(map(int, line.rstrip().split(',')))
18 |         train_label.append(data[0])
19 |         train_data.append(data[1:])
20 | 
21 | print(('Loaded ' + str(len(train_label))))
22 | 
23 | print('Reduction...')
24 | train_label = numpy.array(train_label)
25 | train_data = numpy.array(train_data)
26 | pca = PCA(n_components=COMPONENT_NUM, whiten=True)
27 | pca.fit(train_data)
28 | train_data = pca.transform(train_data)
29 | 
30 | print('Train SVM...')
31 | svc = SVC()
32 | svc.fit(train_data, train_label)
33 | 
34 | print('Read testing data...')
35 | with open(path+'test.csv', 'r') as reader:
36 |     reader.readline()
37 |     test_data = []
38 |     for line in reader.readlines():
39 |         pixels = list(map(int, line.rstrip().split(',')))
40 |         test_data.append(pixels)
41 | print(('Loaded ' + str(len(test_data))))
42 | 
43 | print('Predicting...')
44 | test_data = numpy.array(test_data)
45 | test_data = pca.transform(test_data)
46 | predict = svc.predict(test_data)
47 | 
48 | print('Saving...')
49 | with open(path+'predict.csv', 'w') as writer:
50 |     writer.write('"ImageId","Label"\n')
51 |     count = 0
52 |     for p in predict:
53 |         count += 1
54 |         writer.write(str(count) + ',"' + str(p) + '"\n')
55 | 


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/using_sklearn.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | """
 4 | @Filename: handwriting.py
 5 | @Author: yew1eb
 6 | @Date: 2015/12/23 0023
 7 | """
 8 | 
 9 | '''
10 | 使用sickit-learn中的分类算法预测
11 | 用户使用文档：http://scikit-learn.org/dev/user_guide.html
12 | '''
13 | 
14 | import numpy as np
15 | from sklearn.neighbors import KNeighborsClassifier
16 | from sklearn.ensemble import  RandomForestClassifier
17 | from sklearn import svm
18 | from sklearn.naive_bayes import  GaussianNB   #naive bayes 高斯分布的数据
19 | from sklearn.naive_bayes import MultinomialNB #naive bayes 多项式分布的数据
20 | from sklearn.linear_model import LinearRegression
21 | 
22 | def load_data():
23 |     # strain.csv  3000条数据; train.csv 完整训练数据集
24 |     train_data = np.loadtxt('d:/dataset/digits/train.csv', dtype=np.uint8,delimiter=',', skiprows=1)
25 |     test_data = np.loadtxt('d:/dataset/digits/test.csv', dtype=np.uint8,delimiter=',', skiprows=1)
26 |     label = train_data[:,:1]
27 |     data  = np.where(train_data[:, 1:]!=0, 1, 0)# 数据归一化
28 |     test  = np.where(test_data !=0, 1, 0)
29 |     return data, label, test
30 | 
31 | def save2csv(labels, csv_name):
32 |     np.savetxt('d:/dataset/digits/'+csv_name, np.c_[list(range(1,len(labels)+1)),labels],
33 |                delimiter=',', header = 'ImageId,Label', comments = '', fmt='%d')
34 | 
35 | def sklearn_logistic(train_data, train_label, test_data):
36 |     model = LinearRegression()
37 |     model.fit(train_data, train_label.ravel())
38 |     test_label = model.predict(test_data)
39 |     save2csv(test_label, 'sklearn_logistic_result.csv')
40 | 
41 | def sklearn_knn(train_data, train_label, test_data):
42 |     model = KNeighborsClassifier(n_neighbors=6)
43 |     model.fit(train_data, train_label.ravel())
44 |     test_label = model.predict(test_data)
45 |     save2csv(test_label, 'sklearn_knn_result.csv')
46 | 
47 | def sklearn_random_forest(train_data, train_label, test_data):
48 |     model = RandomForestClassifier(n_estimators=1000, min_samples_split=5)
49 |     model = model.fit(train_data, train_label.ravel() )
50 |     test_label = model.predict(test_data)
51 |     save2csv(test_label, 'sklearn_random_forest.csv')
52 | 
53 | def sklearn_svm(train_data, train_label, test_data):
54 |     model = svm.SVC(C=14, kernel='rbf', gamma=0.001, cache_size=200)
55 |     # svm.SVC(C=6.2, kernel='poly', degree=4, coef0=0.48, cache_size=200)
56 |     model.fit(train_data, train_label.ravel() )
57 |     test_label = model.predict(test_data)
58 | 
59 |     save2csv(test_label, 'sklearn_svm_rbf_result.csv')
60 | 
61 | def sklearn_GaussianNB(train_data, train_label, test_data):
62 |     model = GaussianNB()
63 |     model.fit(train_data, train_label.ravel())
64 |     test_label = model.predict(test_data)
65 |     save2csv(test_label, 'sklearn_GaussianNB_Result.csv')
66 | 
67 | def sklearn_MultinomialNB(train_data, train_label, test_data):
68 |     model = MultinomialNB(alpha=0.1)
69 |     model.fit(train_data, train_label.ravel())
70 |     test_label = model.predict(test_data)
71 |     save2csv(test_label, 'sklearn_MultinomialNB_Result.csv')
72 | 
73 | 
74 | def main():
75 |     train_data, train_label, test_data = load_data()
76 | 
77 |     #sklearn_logistic(train_data, train_label, test_data)
78 | 
79 |     #sklearn_knn(train_data, train_label, test_data)
80 | 
81 |     #sklearn_random_forest(train_data, train_label, test_data)
82 | 
83 |     sklearn_svm(train_data, train_label, test_data)
84 | 
85 |     # naive bayes 0.5~
86 |     #sklearn_GaussianNB(train_data, train_label, test_data)
87 |     #sklearn_MultinomialNB(train_data, train_label, test_data)
88 | 
89 | if __name__ == '__main__':
90 |     main()


--------------------------------------------------------------------------------
/Kaggle-digit-recognizer/using_theano.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: using_theano.py.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2016/01/15 下午 10:13
10 | '''
11 | 
12 | '''
13 | theano 是一个python语言的库，实现了一些机器学习的方法，最大的特点是可以就像普通的python程序一样透明的使用GPU
14 | theano的主页：http://deeplearning.net/software/theano/index.html
15 | theano 同时也支持符号计算，并且和numpy相容，numpy是一个python的矩阵计算的库，
16 | 可以让python具备matlab的计算能力，虽然没有matlab方便
17 | deeplearning.net向导: http://deeplearning.net/tutorial/
18 | 
19 | 利用python的theano库刷kaggle mnist排行榜
20 | http://wiki.swarma.net/index.php?title=%E5%88%A9%E7%94%A8python%E7%9A%84theano%E5%BA%93%E5%88%B7kaggle_mnist%E6%8E%92%E8%A1%8C%E6%A6%9C&variant=zh-cn
21 | '''
22 | 
23 | def main():
24 |     pass
25 | 
26 | 
27 | if __name__ == '__main__':
28 |     main()


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Data Mining Competition Getting Started
 2 | ***************
 3 | ## Analytics Vidhya
 4 | ### AV Loan Prediction [url](http://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction#)  
 5 |   仅作为练习的小问题, 根据用户的特征预测是否发放住房贷款，二分类问题  
 6 |   总11个特征(Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area)
 7 |   ，Loan_ID是用户ID，Loan_Status是需要预测的，特征包含数值类型和分类类型
 8 | 
 9 | ## Data Castle
10 | ### 微额借款用户人品预测大赛 [url](http://pkbigdata.com/common/competition/148.html)
11 |    同上，区别在与这个的特征比较多
12 | 
13 | ## Kaggle
14 | ### Digit Recognizer [url](https://www.kaggle.com/c/digit-recognizer)
15 | 多分类练习题
16 | 
17 | ### Titanic: Machine Learning from Disaster [url](https://www.kaggle.com/c/titanic)  
18 | 二分类问题，给出0/1即可，评价指标为accuracy。
19 | 
20 | ### Bag of Words Meets Bags of Popcorn [url](https://www.kaggle.com/c/word2vec-nlp-tutorial)  
21 | 这是一个文本情感二分类问题。评价指标为AUC。
22 | http://www.cnblogs.com/lijingpeng/p/5787549.html
23 | 
24 | ### Display Advertising Challenge [url](https://www.kaggle.com/c/criteo-display-ad-challenge)  
25 | 这是一个广告CTR预估的比赛，由知名广告公司Criteo赞助举办。数据包括4千万训练样本，500万测试样本，特征包括13个数值特征，26个类别特征，评价指标为logloss。
26 | CTR工业界做法一般都是LR，只是特征会各种组合/transform，可以到上亿维。这里我也首选LR，特征缺失值我用的众数，对于26个类别特征采用one-hot编码，
27 | 数值特征我用pandas画出来发现不符合正态分布，有很大偏移，就没有scale到[0,1]，
28 | 采用的是根据五分位点（min,25%,中位数,75%,max）切分为6个区间(负值/过大值分别分到了1和6区间作为异常值处理)，然后一并one-hot编码，最终特征100万左右，训练文件20+G。
29 | 强调下可能遇到的坑：1.one-hot最好自己实现，除非你机器内存足够大(需全load到numpy，而且非sparse);2.LR最好用SGD或者mini-batch，
30 | 而且out-of-core模式(http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html#example-applications-plot-out-of-core-classification-py), 
31 | 除非还是你的内存足够大;3.Think twice before code.由于数据量大，中间出错重跑的话时间成品比较高。
32 | 我发现sklearn的LR和liblinear的LR有着截然不同的表现，sklearn的L2正则化结果好于L1，liblinear的L1好于L2，我理解是他们优化方法不同导致的。
33 | 最终结果liblinear的LR的L1最优，logloss=0.46601，LB为227th/718，这也正符合lasso产生sparse的直觉。
34 | 我也单独尝试了xgboost，logloss=0.46946，可能还是和GBRT对高维度sparse特征效果不好有关。Facebook有一篇论文把GBRT输出作为transformed feature喂给下游的线性分类器，
35 | 取得了不错的效果，可以参考下。（Practical Lessons from Predicting Clicks on Ads at Facebook）
36 | 我只是简单试验了LR作为baseline，后面其实还有很多搞法，可以参考forum获胜者给出的solution，
37 | 比如：1. Vowpal Wabbit工具不用区分类别和数值特征；2.libFFM工具做特征交叉组合；3.feature hash trick；4.每个特征的评价点击率作为新特征加入；5.多模型ensemble等。
38 | 


--------------------------------------------------------------------------------
/kaggle-titanic/README.md:
--------------------------------------------------------------------------------
 1 | 说句题外话，网上貌似有遇难者名单，LB上好几个score 1.0的。有坊间说，score超过90%就怀疑作弊了，不知真假，不过top300绝大多数都集中在0.808-0.818。这个题目我后面没有太多的改进想法了，求指导啊~
 2 | 数据包括数值和类别特征，并存在缺失值。类别特征这里我做了one-hot-encode，缺失值是采用均值/中位数/众数需要根据数据来定，我的做法是根据pandas打印出列数据分布来定。
 3 | 模型我采用了DT/RF/GBDT/SVC，由于xgboost输出是概率，需要指定阈值确定0/1，可能我指定不恰当，效果不好0.78847。
 4 | 效果最好的是RF，0.81340。这里经过筛选我使用的特征包括’Pclass’,’Gender’, ‘Cabin’,’Ticket’,’Embarked’,’Title’进行onehot编码，’Age’,’SibSp’,’Parch’,’Fare’,’class_age’,’Family’ 归一化。
 5 | 我也尝试进行构建一些新特征和特征组合，比如title分割为Mr/Mrs/Miss/Master四类或者split提取第一个词，添加fare_per_person等，pipeline中也加入feature selection，但是效果都没有提高，求指导~
 6 | 
 7 | 
 8 | 
 9 | [kaggle数据挖掘竞赛初步--Titanic](http://www.cnblogs.com/north-north/tag/kaggle/)
10 | 
11 | [Kaggle Titanic Competition Part I – Intro]
12 | (http://www.ultravioletanalytics.com/2014/10/30/kaggle-titanic-competition-part-i-intro/)
13 | 
14 | 
15 | [Kaggle Competition | Titanic Machine Learning from Disaster]
16 | (http://nbviewer.ipython.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb)
17 | 
18 | https://github.com/agconti/kaggle-titanic
19 | 
20 | http://www.sotoseattle.com/blog/categories/kaggle/
21 | 
22 | [Titanic: Machine Learning from Disaster - Getting Started With R]
23 | https://github.com/trevorstephens/titanic
24 | https://github.com/wehrley/wehrley.github.io/blob/master/SOUPTONUTS.md
25 | 
26 | 
27 | http://mlwave.com/tutorial-titanic-machine-learning-from-distaster/
28 | Full Titanic Example with Random Forest
29 | https://www.youtube.com/watch?v=0GrciaGYzV0
30 | 
31 | [Tutorial: Titanic dataset machine learning for Kaggle]
32 | (http://corpocrat.com/2014/08/29/tutorial-titanic-dataset-machine-learning-for-kaggle/)
33 | 
34 | [Getting Started with R: Titanic Competition in Kaggle]
35 | (http://armandruiz.com/kaggle/Titanic_Kaggle_Analysis.html)
36 | 
37 | [A complete guide to getting 0.79903 in Kaggle’s Titanic Competition with Python](https://triangleinequality.wordpress.com/2013/09/05/a-complete-guide-to-getting-0-79903-in-kaggles-titanic-competition-with-python/)
38 | [机器学习系列(3)_逻辑回归应用之Kaggle泰坦尼克之灾](http://blog.csdn.net/han_xiaoyang/article/details/49797143)
39 |   https://www.kaggle.com/malais/titanic/kaggle-first-ipythonnotebook/notebook


--------------------------------------------------------------------------------
/kaggle-titanic/code.py:
--------------------------------------------------------------------------------
  1 | import numpy as np
  2 | import pandas as pd
  3 | 
  4 | from sklearn import cross_validation
  5 | from sklearn.cross_validation import KFold
  6 | 
  7 | from sklearn.linear_model import LinearRegression
  8 | from sklearn.linear_model import LogisticRegression
  9 | from sklearn.ensemble import RandomForestClassifier
 10 | 
 11 | titanic = pd.read_csv("./data/train.csv", dtype={"Age": np.float64}, )
 12 | 
 13 | # Preprocessing Data
 14 | # ==================
 15 | 
 16 | # Fill in missing value in "Age".
 17 | titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())
 18 | 
 19 | # Replace all the occurences of male with the number 0.
 20 | titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
 21 | titanic.loc[titanic["Sex"] == "female", "Sex"] = 1
 22 | 
 23 | # Convert the Embarked Column.
 24 | titanic["Embarked"] = titanic["Embarked"].fillna("S")
 25 | titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
 26 | titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
 27 | titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
 28 | 
 29 | 
 30 | # The columns we'll use to predict the target
 31 | predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
 32 | 
 33 | # Linear Regression
 34 | # =================
 35 | alg = LinearRegression()
 36 | # Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
 37 | # We set random_state to ensure we get the same splits every time we run this.
 38 | kf = KFold(titanic.shape[0], n_folds=3)
 39 | 
 40 | predictions = []
 41 | for train, test in kf:
 42 |     # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
 43 |     train_predictors = (titanic[predictors].iloc[train,:])
 44 |     # The target we're using to train the algorithm.
 45 |     train_target = titanic["Survived"].iloc[train]
 46 |     # Training the algorithm using the predictors and target.
 47 |     alg.fit(train_predictors, train_target)
 48 |     # We can now make predictions on the test fold
 49 |     test_predictions = alg.predict(titanic[predictors].iloc[test,:])
 50 |     predictions.append(test_predictions)
 51 | 
 52 | # Evaluating error and accuracy
 53 | predictions = np.concatenate(predictions,axis = 0)
 54 | predictions[predictions > .5] = 1
 55 | predictions[predictions <= .5] = 0
 56 | 
 57 | accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
 58 | 
 59 | print(('Accuracy of Linear Regression on the training set is ' + str(accuracy)))
 60 | 
 61 | # Logistic Regression
 62 | # ===================
 63 | alg = LogisticRegression()
 64 | # Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
 65 | scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
 66 | # Take the mean of the scores (because we have one for each fold)
 67 | print(('Accuracy of Logistic Regression on the training set is ' + str(scores.mean())))
 68 | 
 69 | # Random Forest
 70 | # ===================
 71 | from sklearn.ensemble import RandomForestClassifier
 72 | 
 73 | alg = RandomForestClassifier(n_estimators=1000,min_samples_leaf=5, max_features="auto", n_jobs=2, random_state=10)
 74 | # Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
 75 | scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
 76 | # Take the mean of the scores (because we have one for each fold)
 77 | print(('Accuracy of Random Forest on the training set is ' + str(scores.mean())))
 78 | 
 79 | # Test Set
 80 | # ========
 81 | titanic_test = pd.read_csv("./data/test.csv", dtype={"Age": np.float64}, )
 82 | 
 83 | titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())
 84 | 
 85 | titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
 86 | titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1
 87 | 
 88 | titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
 89 | 
 90 | titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
 91 | titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
 92 | titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2
 93 | 
 94 | titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())
 95 | 
 96 | # Train the algorithm using all the training data
 97 | alg.fit(titanic[predictors], titanic["Survived"])
 98 | 
 99 | # Make predictions using the test set.
100 | predictions = alg.predict(titanic_test[predictors])
101 | 
102 | # Create a new dataframe with only the columns Kaggle wants from the dataset.
103 | submission = pd.DataFrame({
104 |         "PassengerId": titanic_test["PassengerId"],
105 |         "Survived": predictions
106 |     })
107 | 
108 | submission.to_csv('result_rf.csv', index=False)


--------------------------------------------------------------------------------
/kaggle-titanic/lr.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | import pandas as pd
 5 | import numpy as np
 6 | import matplotlib.pyplot as plt
 7 | 
 8 | base_path = './data/'
 9 | train = pd.read_csv(base_path+'train.csv')
10 | 
11 | # 初步观察数据
12 | #print(train.info())
13 | '''
14 | 特征信息：
15 | PassengerId => 乘客ID
16 | Pclass => 乘客等级(1/2/3等舱位)
17 | Name => 乘客姓名
18 | Sex => 性别
19 | Age => 年龄
20 | SibSp => 堂兄弟/妹个数
21 | Parch => 父母与小孩个数
22 | Ticket => 船票信息
23 | Fare => 票价
24 | Cabin => 客舱
25 | Embarked => 登船港口
26 | 
27 | Age,Cabin列有缺失
28 | Name,Sex,Ticket,Cabin,Embarked列为分类类型
29 | '''
30 | 
31 | #print(train.describe())
32 | '''
33 | 查看数值类型特征的统计信息
34 | '''
35 | 
36 | # 数据初步分析
37 | '''
38 | 看看每个/多个 属性和最后的Survived之间有着什么样的关系
39 | '''
40 | 
41 | def analyze_features(train):
42 |     fig = plt.figure()
43 | 
44 |     fig.set(alpha=0.2) # 设定图表颜色alpha参数
45 |     plt.subplot2grid((2,3), (0,0)) # 在一张大图里分列几个小图
46 |     plt.title('显示中文')
47 |     train.Survived.value_counts().plot(kind='bar') # 柱状图
48 |     plt.title('获救情况 (1为获救)') # 标题
49 |     plt.ylabel('人数')
50 | 
51 |     plt.subplot2grid((2,3),(0,1))
52 |     train.Pclass.value_counts().plot(kind='bar')
53 |     plt.ylabel('人数')
54 |     plt.title('乘客等级分布')
55 | 
56 |     plt.subplot2grid((2,3),(0,2))
57 |     plt.scatter(train.Survived, train.Age)
58 |     plt.ylabel('年龄')
59 |     plt.grid(b=True, which='major', axis='y')
60 |     plt.title('按年龄看获救分布（1为获救)')
61 | 
62 |     plt.subplot2grid((2,3),(1,0), colspan=2)
63 |     train.Age[train.Pclass == 1].plot(kind='kde')
64 |     train.Age[train.Pclass == 2].plot(kind='kde')
65 |     train.Age[train.Pclass == 3].plot(kind='kde')
66 |     plt.xlabel("年龄")# plots an axis lable
67 |     plt.ylabel("密度")
68 |     plt.title("各等级的乘客年龄分布")
69 |     plt.legend(('头等舱', '2等舱','3等舱'),loc='best') # sets our legend for our graph.
70 | 
71 | 
72 |     plt.subplot2grid((2,3),(1,2))
73 |     train.Embarked.value_counts().plot(kind='bar')
74 |     plt.title("各登船口岸上船人数")
75 |     plt.ylabel("人数")
76 |     plt.show()
77 | 
78 | analyze_features(train)
79 | '''
80 | 不同舱位/乘客等级可能和财富/地位有关系，最后获救概率可能会不一样
81 | 年龄对获救概率也一定是有影响的，毕竟前面说了，副船长还说『小孩和女士先走』呢
82 | 和登船港口是不是有关系呢？也许登船港口不同，人的出身地位不同？
83 | '''
84 | # http://blog.csdn.net/han_xiaoyang/article/details/49797143


--------------------------------------------------------------------------------
/kaggle-titanic/randomforest_gridsearchCV.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/yew1eb/DM-Competition-Getting-Started/c7ac0d3226883a4e387cd24058e7618acd231fa5/kaggle-titanic/randomforest_gridsearchCV.py


--------------------------------------------------------------------------------
/kaggle-titanic/result_rf.csv:
--------------------------------------------------------------------------------
  1 | PassengerId,Survived
  2 | 892,0
  3 | 893,0
  4 | 894,0
  5 | 895,0
  6 | 896,0
  7 | 897,0
  8 | 898,1
  9 | 899,0
 10 | 900,1
 11 | 901,0
 12 | 902,0
 13 | 903,0
 14 | 904,1
 15 | 905,0
 16 | 906,1
 17 | 907,1
 18 | 908,0
 19 | 909,0
 20 | 910,0
 21 | 911,0
 22 | 912,0
 23 | 913,0
 24 | 914,1
 25 | 915,0
 26 | 916,1
 27 | 917,0
 28 | 918,1
 29 | 919,0
 30 | 920,1
 31 | 921,0
 32 | 922,0
 33 | 923,0
 34 | 924,0
 35 | 925,0
 36 | 926,1
 37 | 927,0
 38 | 928,0
 39 | 929,0
 40 | 930,0
 41 | 931,0
 42 | 932,0
 43 | 933,1
 44 | 934,0
 45 | 935,1
 46 | 936,1
 47 | 937,0
 48 | 938,0
 49 | 939,0
 50 | 940,1
 51 | 941,1
 52 | 942,0
 53 | 943,0
 54 | 944,1
 55 | 945,1
 56 | 946,0
 57 | 947,0
 58 | 948,0
 59 | 949,0
 60 | 950,0
 61 | 951,1
 62 | 952,0
 63 | 953,0
 64 | 954,0
 65 | 955,1
 66 | 956,1
 67 | 957,1
 68 | 958,1
 69 | 959,0
 70 | 960,0
 71 | 961,1
 72 | 962,1
 73 | 963,0
 74 | 964,0
 75 | 965,0
 76 | 966,1
 77 | 967,0
 78 | 968,0
 79 | 969,1
 80 | 970,0
 81 | 971,1
 82 | 972,1
 83 | 973,0
 84 | 974,0
 85 | 975,0
 86 | 976,0
 87 | 977,0
 88 | 978,1
 89 | 979,0
 90 | 980,1
 91 | 981,1
 92 | 982,0
 93 | 983,0
 94 | 984,1
 95 | 985,0
 96 | 986,0
 97 | 987,0
 98 | 988,1
 99 | 989,0
100 | 990,0
101 | 991,0
102 | 992,1
103 | 993,0
104 | 994,0
105 | 995,0
106 | 996,1
107 | 997,0
108 | 998,0
109 | 999,0
110 | 1000,0
111 | 1001,0
112 | 1002,0
113 | 1003,1
114 | 1004,1
115 | 1005,1
116 | 1006,1
117 | 1007,0
118 | 1008,0
119 | 1009,1
120 | 1010,0
121 | 1011,1
122 | 1012,1
123 | 1013,0
124 | 1014,1
125 | 1015,0
126 | 1016,0
127 | 1017,1
128 | 1018,0
129 | 1019,1
130 | 1020,0
131 | 1021,0
132 | 1022,0
133 | 1023,0
134 | 1024,0
135 | 1025,0
136 | 1026,0
137 | 1027,0
138 | 1028,0
139 | 1029,0
140 | 1030,0
141 | 1031,0
142 | 1032,0
143 | 1033,1
144 | 1034,0
145 | 1035,0
146 | 1036,1
147 | 1037,0
148 | 1038,0
149 | 1039,0
150 | 1040,1
151 | 1041,0
152 | 1042,1
153 | 1043,0
154 | 1044,0
155 | 1045,1
156 | 1046,0
157 | 1047,0
158 | 1048,1
159 | 1049,1
160 | 1050,1
161 | 1051,1
162 | 1052,1
163 | 1053,1
164 | 1054,1
165 | 1055,0
166 | 1056,0
167 | 1057,0
168 | 1058,0
169 | 1059,0
170 | 1060,1
171 | 1061,0
172 | 1062,0
173 | 1063,0
174 | 1064,0
175 | 1065,0
176 | 1066,0
177 | 1067,1
178 | 1068,1
179 | 1069,0
180 | 1070,1
181 | 1071,1
182 | 1072,0
183 | 1073,0
184 | 1074,1
185 | 1075,0
186 | 1076,1
187 | 1077,0
188 | 1078,1
189 | 1079,0
190 | 1080,0
191 | 1081,0
192 | 1082,0
193 | 1083,0
194 | 1084,0
195 | 1085,0
196 | 1086,1
197 | 1087,0
198 | 1088,1
199 | 1089,1
200 | 1090,0
201 | 1091,0
202 | 1092,1
203 | 1093,1
204 | 1094,0
205 | 1095,1
206 | 1096,0
207 | 1097,0
208 | 1098,1
209 | 1099,0
210 | 1100,1
211 | 1101,0
212 | 1102,0
213 | 1103,0
214 | 1104,0
215 | 1105,1
216 | 1106,0
217 | 1107,0
218 | 1108,1
219 | 1109,0
220 | 1110,1
221 | 1111,0
222 | 1112,1
223 | 1113,0
224 | 1114,1
225 | 1115,0
226 | 1116,1
227 | 1117,1
228 | 1118,0
229 | 1119,1
230 | 1120,0
231 | 1121,0
232 | 1122,0
233 | 1123,1
234 | 1124,0
235 | 1125,0
236 | 1126,0
237 | 1127,0
238 | 1128,0
239 | 1129,0
240 | 1130,1
241 | 1131,1
242 | 1132,1
243 | 1133,1
244 | 1134,0
245 | 1135,0
246 | 1136,0
247 | 1137,0
248 | 1138,1
249 | 1139,0
250 | 1140,1
251 | 1141,0
252 | 1142,1
253 | 1143,0
254 | 1144,0
255 | 1145,0
256 | 1146,0
257 | 1147,0
258 | 1148,0
259 | 1149,0
260 | 1150,1
261 | 1151,0
262 | 1152,0
263 | 1153,0
264 | 1154,1
265 | 1155,1
266 | 1156,0
267 | 1157,0
268 | 1158,0
269 | 1159,0
270 | 1160,0
271 | 1161,0
272 | 1162,0
273 | 1163,0
274 | 1164,1
275 | 1165,1
276 | 1166,0
277 | 1167,1
278 | 1168,0
279 | 1169,0
280 | 1170,0
281 | 1171,0
282 | 1172,0
283 | 1173,1
284 | 1174,1
285 | 1175,1
286 | 1176,1
287 | 1177,0
288 | 1178,0
289 | 1179,0
290 | 1180,0
291 | 1181,0
292 | 1182,0
293 | 1183,1
294 | 1184,0
295 | 1185,0
296 | 1186,0
297 | 1187,0
298 | 1188,1
299 | 1189,0
300 | 1190,0
301 | 1191,0
302 | 1192,0
303 | 1193,0
304 | 1194,0
305 | 1195,0
306 | 1196,1
307 | 1197,1
308 | 1198,0
309 | 1199,1
310 | 1200,0
311 | 1201,0
312 | 1202,0
313 | 1203,0
314 | 1204,0
315 | 1205,1
316 | 1206,1
317 | 1207,1
318 | 1208,0
319 | 1209,0
320 | 1210,0
321 | 1211,0
322 | 1212,0
323 | 1213,0
324 | 1214,0
325 | 1215,1
326 | 1216,1
327 | 1217,0
328 | 1218,1
329 | 1219,0
330 | 1220,0
331 | 1221,0
332 | 1222,1
333 | 1223,0
334 | 1224,0
335 | 1225,1
336 | 1226,0
337 | 1227,0
338 | 1228,0
339 | 1229,0
340 | 1230,0
341 | 1231,0
342 | 1232,0
343 | 1233,0
344 | 1234,0
345 | 1235,1
346 | 1236,0
347 | 1237,1
348 | 1238,0
349 | 1239,1
350 | 1240,0
351 | 1241,1
352 | 1242,1
353 | 1243,0
354 | 1244,0
355 | 1245,0
356 | 1246,1
357 | 1247,0
358 | 1248,1
359 | 1249,0
360 | 1250,0
361 | 1251,1
362 | 1252,0
363 | 1253,1
364 | 1254,1
365 | 1255,0
366 | 1256,1
367 | 1257,0
368 | 1258,0
369 | 1259,0
370 | 1260,1
371 | 1261,0
372 | 1262,0
373 | 1263,1
374 | 1264,0
375 | 1265,0
376 | 1266,1
377 | 1267,1
378 | 1268,0
379 | 1269,0
380 | 1270,0
381 | 1271,0
382 | 1272,0
383 | 1273,0
384 | 1274,1
385 | 1275,0
386 | 1276,0
387 | 1277,1
388 | 1278,0
389 | 1279,0
390 | 1280,0
391 | 1281,0
392 | 1282,0
393 | 1283,1
394 | 1284,0
395 | 1285,0
396 | 1286,0
397 | 1287,1
398 | 1288,0
399 | 1289,1
400 | 1290,0
401 | 1291,0
402 | 1292,1
403 | 1293,0
404 | 1294,1
405 | 1295,0
406 | 1296,0
407 | 1297,0
408 | 1298,0
409 | 1299,0
410 | 1300,1
411 | 1301,1
412 | 1302,1
413 | 1303,1
414 | 1304,0
415 | 1305,0
416 | 1306,1
417 | 1307,0
418 | 1308,0
419 | 1309,0
420 | 


--------------------------------------------------------------------------------
/kaggle-titanic/result_xgb.csv:
--------------------------------------------------------------------------------
  1 | PassengerId,Survived
  2 | 892,0
  3 | 893,0
  4 | 894,0
  5 | 895,0
  6 | 896,1
  7 | 897,0
  8 | 898,0
  9 | 899,0
 10 | 900,1
 11 | 901,0
 12 | 902,0
 13 | 903,0
 14 | 904,1
 15 | 905,0
 16 | 906,1
 17 | 907,1
 18 | 908,0
 19 | 909,0
 20 | 910,1
 21 | 911,0
 22 | 912,0
 23 | 913,0
 24 | 914,1
 25 | 915,1
 26 | 916,1
 27 | 917,0
 28 | 918,1
 29 | 919,1
 30 | 920,1
 31 | 921,0
 32 | 922,0
 33 | 923,0
 34 | 924,1
 35 | 925,0
 36 | 926,1
 37 | 927,0
 38 | 928,0
 39 | 929,0
 40 | 930,0
 41 | 931,1
 42 | 932,0
 43 | 933,1
 44 | 934,0
 45 | 935,1
 46 | 936,1
 47 | 937,0
 48 | 938,0
 49 | 939,0
 50 | 940,1
 51 | 941,1
 52 | 942,0
 53 | 943,0
 54 | 944,1
 55 | 945,1
 56 | 946,0
 57 | 947,0
 58 | 948,0
 59 | 949,0
 60 | 950,0
 61 | 951,1
 62 | 952,0
 63 | 953,0
 64 | 954,0
 65 | 955,1
 66 | 956,0
 67 | 957,1
 68 | 958,1
 69 | 959,0
 70 | 960,0
 71 | 961,1
 72 | 962,1
 73 | 963,0
 74 | 964,1
 75 | 965,0
 76 | 966,1
 77 | 967,0
 78 | 968,0
 79 | 969,1
 80 | 970,0
 81 | 971,1
 82 | 972,1
 83 | 973,0
 84 | 974,0
 85 | 975,0
 86 | 976,0
 87 | 977,0
 88 | 978,1
 89 | 979,1
 90 | 980,1
 91 | 981,1
 92 | 982,0
 93 | 983,0
 94 | 984,1
 95 | 985,0
 96 | 986,0
 97 | 987,0
 98 | 988,1
 99 | 989,0
100 | 990,1
101 | 991,0
102 | 992,1
103 | 993,0
104 | 994,0
105 | 995,0
106 | 996,1
107 | 997,0
108 | 998,0
109 | 999,0
110 | 1000,0
111 | 1001,0
112 | 1002,0
113 | 1003,1
114 | 1004,1
115 | 1005,1
116 | 1006,1
117 | 1007,0
118 | 1008,0
119 | 1009,1
120 | 1010,1
121 | 1011,1
122 | 1012,1
123 | 1013,0
124 | 1014,1
125 | 1015,0
126 | 1016,0
127 | 1017,1
128 | 1018,0
129 | 1019,1
130 | 1020,0
131 | 1021,0
132 | 1022,0
133 | 1023,0
134 | 1024,0
135 | 1025,0
136 | 1026,0
137 | 1027,0
138 | 1028,1
139 | 1029,0
140 | 1030,0
141 | 1031,0
142 | 1032,0
143 | 1033,1
144 | 1034,0
145 | 1035,0
146 | 1036,1
147 | 1037,0
148 | 1038,0
149 | 1039,0
150 | 1040,1
151 | 1041,0
152 | 1042,1
153 | 1043,0
154 | 1044,0
155 | 1045,0
156 | 1046,0
157 | 1047,0
158 | 1048,1
159 | 1049,1
160 | 1050,1
161 | 1051,1
162 | 1052,1
163 | 1053,1
164 | 1054,1
165 | 1055,0
166 | 1056,0
167 | 1057,1
168 | 1058,0
169 | 1059,0
170 | 1060,1
171 | 1061,0
172 | 1062,0
173 | 1063,1
174 | 1064,0
175 | 1065,0
176 | 1066,0
177 | 1067,1
178 | 1068,1
179 | 1069,0
180 | 1070,1
181 | 1071,1
182 | 1072,0
183 | 1073,0
184 | 1074,1
185 | 1075,0
186 | 1076,1
187 | 1077,0
188 | 1078,1
189 | 1079,0
190 | 1080,0
191 | 1081,0
192 | 1082,0
193 | 1083,0
194 | 1084,1
195 | 1085,0
196 | 1086,1
197 | 1087,0
198 | 1088,1
199 | 1089,0
200 | 1090,0
201 | 1091,0
202 | 1092,1
203 | 1093,1
204 | 1094,0
205 | 1095,1
206 | 1096,0
207 | 1097,0
208 | 1098,0
209 | 1099,0
210 | 1100,1
211 | 1101,0
212 | 1102,0
213 | 1103,0
214 | 1104,0
215 | 1105,1
216 | 1106,0
217 | 1107,0
218 | 1108,1
219 | 1109,0
220 | 1110,1
221 | 1111,0
222 | 1112,1
223 | 1113,0
224 | 1114,1
225 | 1115,0
226 | 1116,1
227 | 1117,0
228 | 1118,0
229 | 1119,1
230 | 1120,0
231 | 1121,0
232 | 1122,0
233 | 1123,1
234 | 1124,0
235 | 1125,0
236 | 1126,0
237 | 1127,0
238 | 1128,0
239 | 1129,1
240 | 1130,1
241 | 1131,1
242 | 1132,1
243 | 1133,1
244 | 1134,0
245 | 1135,0
246 | 1136,0
247 | 1137,0
248 | 1138,1
249 | 1139,0
250 | 1140,1
251 | 1141,0
252 | 1142,1
253 | 1143,0
254 | 1144,0
255 | 1145,0
256 | 1146,0
257 | 1147,0
258 | 1148,0
259 | 1149,0
260 | 1150,1
261 | 1151,0
262 | 1152,0
263 | 1153,0
264 | 1154,1
265 | 1155,1
266 | 1156,0
267 | 1157,0
268 | 1158,0
269 | 1159,0
270 | 1160,0
271 | 1161,0
272 | 1162,0
273 | 1163,0
274 | 1164,1
275 | 1165,1
276 | 1166,0
277 | 1167,1
278 | 1168,0
279 | 1169,0
280 | 1170,0
281 | 1171,0
282 | 1172,0
283 | 1173,1
284 | 1174,1
285 | 1175,0
286 | 1176,1
287 | 1177,0
288 | 1178,0
289 | 1179,0
290 | 1180,0
291 | 1181,0
292 | 1182,0
293 | 1183,0
294 | 1184,0
295 | 1185,0
296 | 1186,0
297 | 1187,0
298 | 1188,1
299 | 1189,0
300 | 1190,0
301 | 1191,0
302 | 1192,0
303 | 1193,0
304 | 1194,0
305 | 1195,0
306 | 1196,1
307 | 1197,1
308 | 1198,0
309 | 1199,1
310 | 1200,0
311 | 1201,0
312 | 1202,0
313 | 1203,1
314 | 1204,0
315 | 1205,0
316 | 1206,1
317 | 1207,1
318 | 1208,0
319 | 1209,0
320 | 1210,0
321 | 1211,0
322 | 1212,0
323 | 1213,0
324 | 1214,0
325 | 1215,1
326 | 1216,1
327 | 1217,0
328 | 1218,1
329 | 1219,0
330 | 1220,0
331 | 1221,0
332 | 1222,1
333 | 1223,0
334 | 1224,0
335 | 1225,1
336 | 1226,0
337 | 1227,0
338 | 1228,0
339 | 1229,0
340 | 1230,0
341 | 1231,0
342 | 1232,0
343 | 1233,0
344 | 1234,0
345 | 1235,1
346 | 1236,0
347 | 1237,1
348 | 1238,0
349 | 1239,0
350 | 1240,0
351 | 1241,1
352 | 1242,1
353 | 1243,0
354 | 1244,0
355 | 1245,0
356 | 1246,1
357 | 1247,0
358 | 1248,1
359 | 1249,0
360 | 1250,0
361 | 1251,1
362 | 1252,0
363 | 1253,1
364 | 1254,1
365 | 1255,0
366 | 1256,1
367 | 1257,0
368 | 1258,0
369 | 1259,0
370 | 1260,1
371 | 1261,0
372 | 1262,0
373 | 1263,1
374 | 1264,0
375 | 1265,0
376 | 1266,1
377 | 1267,1
378 | 1268,0
379 | 1269,0
380 | 1270,0
381 | 1271,0
382 | 1272,0
383 | 1273,0
384 | 1274,0
385 | 1275,0
386 | 1276,0
387 | 1277,1
388 | 1278,0
389 | 1279,0
390 | 1280,0
391 | 1281,0
392 | 1282,0
393 | 1283,1
394 | 1284,0
395 | 1285,0
396 | 1286,0
397 | 1287,1
398 | 1288,0
399 | 1289,1
400 | 1290,0
401 | 1291,0
402 | 1292,1
403 | 1293,0
404 | 1294,1
405 | 1295,0
406 | 1296,0
407 | 1297,0
408 | 1298,0
409 | 1299,0
410 | 1300,1
411 | 1301,1
412 | 1302,1
413 | 1303,1
414 | 1304,1
415 | 1305,0
416 | 1306,1
417 | 1307,0
418 | 1308,0
419 | 1309,0
420 | 


--------------------------------------------------------------------------------
/kaggle-titanic/sklearn-random-forest.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding:utf-8 -*-
 3 | 
 4 | '''
 5 | @filename: sklearn-random-forest.py
 6 | @author: yew1eb
 7 | @site: http://blog.yew1eb.net
 8 | @contact: yew1eb@gmail.com
 9 | @time: 2015/12/27 下午 10:18
10 | '''
11 | from sklearn import cross_validation
12 | from sklearn.cross_validation import train_test_split
13 | from sklearn.tree import DecisionTreeClassifier
14 | import numpy as np
15 | import pandas as pd
16 | import matplotlib.pyplot as plt
17 | from sklearn import metrics
18 | 
19 | 
20 | def load_data():
21 |     df = pd.read_csv('D:/dataset/titanic/train.csv', header=0)
22 | #特征选择
23 | #   只取出三个自变量
24 | #   将Age（年龄）缺失的数据补全
25 | #   将Pclass变量转变为三个哑（Summy）变量
26 | #   将sex转为0-1变量
27 |     subdf = df[['Pclass', 'Sex', 'Age']]
28 |     y = df.Survived
29 |     age = subdf['Age'].fillna(value=subdf.Age.mean())
30 |     pclass = pd.get_dummies(subdf['Pclass'], prefix='Pclass')
31 |     sex = (subdf['Sex']=='male').astype('int')
32 |     X = pd.concat([pclass, age, sex], axis=1)
33 |     #print(X.head())
34 |     return X, y
35 | 
36 | # 分析各特征的重要性
37 | # feature_importance = clf.feature_importances_
38 | # 对于随机森林如何得到变量的重要性，可以看scikit-learn官方文档 http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py
39 | # important_features = X_train.columns.values[0::]
40 | # analyze_feature(feature_importance, important_features)
41 | def analyze_feature(feature_importance, important_features):
42 |     feature_importance = 100.0 * (feature_importance / feature_importance.max())
43 |     sorted_idx = np.argsort(feature_importance)[::-1]
44 |     pos = np.arange(sorted_idx.shape[0]) + 0.5
45 |     plt.title('Feature Importance')
46 |     plt.barh(pos, feature_importance[sorted_idx[::-1]], color='r', align='center')
47 |     plt.yticks(pos, important_features)
48 |     plt.xlabel('Relativ Importance')
49 |     plt.draw()
50 |     plt.show()
51 | 
52 | def sklearn_decisoin_tree(X, y):
53 |     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=6)
54 |     clf = DecisionTreeClassifier(criterion='entropy', max_depth=3, min_samples_leaf=5)
55 |     bst = clf.fit(X_train, y_train)
56 |     # 准确率 print("accuracy rate: {:.6f}".format(bst.score(X_test, y_test)))
57 | # 交叉验证
58 |     scores = cross_validation.cross_val_score(clf, X, y, cv=10)
59 |     print(scores)
60 | 
61 | 
62 | def main():
63 |     X, y = load_data()
64 | 
65 | 
66 | 
67 | 
68 | if __name__ == '__main__':
69 |     main()
70 | 
71 | 
72 | 
73 | 
74 | 
75 | 


--------------------------------------------------------------------------------
/kaggle-titanic/xgb.py:
--------------------------------------------------------------------------------
 1 | # This script shows you how to make a submission using a few
 2 | # useful Python libraries.
 3 | # It gets a public leaderboard score of 0.76077.
 4 | # Maybe you can tweak it and do better...?
 5 | 
 6 | import pandas as pd
 7 | import xgboost as xgb
 8 | from sklearn.preprocessing import LabelEncoder
 9 | import numpy as np
10 | 
11 | # Load the data
12 | train_df = pd.read_csv('./data/train.csv', header=0)
13 | test_df = pd.read_csv('./data/test.csv', header=0)
14 | 
15 | # We'll impute missing values using the median for numeric columns and the most
16 | # common value for string columns.
17 | # This is based on some nice code by 'sveitser' at http://stackoverflow.com/a/25562948
18 | from sklearn.base import TransformerMixin
19 | class DataFrameImputer(TransformerMixin):
20 |     def fit(self, X, y=None):
21 |         self.fill = pd.Series([X[c].value_counts().index[0]
22 |             if X[c].dtype == np.dtype('O') else X[c].median() for c in X],
23 |             index=X.columns)
24 |         return self
25 |     def transform(self, X, y=None):
26 |         return X.fillna(self.fill)
27 | 
28 | feature_columns_to_use = ['Pclass','Sex','Age','Fare','Parch']
29 | nonnumeric_columns = ['Sex']
30 | 
31 | # Join the features from train and test together before imputing missing values,
32 | # in case their distribution is slightly different
33 | big_X = train_df[feature_columns_to_use].append(test_df[feature_columns_to_use])
34 | big_X_imputed = DataFrameImputer().fit_transform(big_X)
35 | 
36 | # XGBoost doesn't (yet) handle categorical features automatically, so we need to change
37 | # them to columns of integer values.
38 | # See http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing for more
39 | # details and options
40 | le = LabelEncoder()
41 | for feature in nonnumeric_columns:
42 |     big_X_imputed[feature] = le.fit_transform(big_X_imputed[feature])
43 | 
44 | # Prepare the inputs for the model
45 | train_X = big_X_imputed[0:train_df.shape[0]].as_matrix()
46 | test_X = big_X_imputed[train_df.shape[0]::].as_matrix()
47 | train_y = train_df['Survived']
48 | 
49 | # You can experiment with many other options here, using the same .fit() and .predict()
50 | # methods; see http://scikit-learn.org
51 | # This example uses the current build of XGBoost, from https://github.com/dmlc/xgboost
52 | gbm = xgb.XGBClassifier(max_depth=6, n_estimators=1000, learning_rate=0.02).fit(train_X, train_y)
53 | predictions = gbm.predict(test_X)
54 | 
55 | # Kaggle needs the submission to have a certain format;
56 | # see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
57 | # for an example of what it's supposed to look like.
58 | submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
59 |                             'Survived': predictions })
60 | submission.to_csv("result_xgb.csv", index=False)


--------------------------------------------------------------------------------
/kaggle-titanic/笔记1.md:
--------------------------------------------------------------------------------
 1 | ## Kaggle-Titanic
 2 | Kaggle上的一个入门题目，属于二分类问题。
 3 | ### 问题背景
 4 | 泰坦尼克号中一个经典的场面就是豪华游艇倒了，大家都惊恐逃生，可是救生艇的数量有限，不可能让大家都同时获救，
 5 | 这时候副船长发话了：lady and kid first！这并不是一个随意安排的逃生顺序，而是某些人有优先逃生的特权，比如贵族，女人，小孩的。 
 6 | 那么现在问题来了：给出一些船员的个人信息以及存活状况，让参赛者根据这些信息训练出合适的模型并预测其他人的存活状况。
 7 | ### 数据集
 8 | 
 9 | 字段之间用逗号隔开，每行数据包含的字段如下：
10 | ```
11 | PassengerID  
12 | Survived(存活与否)
13 | Pclass（客舱等级）
14 | Name（姓名）
15 | Sex（性别）
16 | Age（年龄）
17 | SibSp（亲戚和配偶在船数量）
18 | Parch（父母孩子的在船数量）
19 | Ticket（票编号）
20 | Fare（价格）
21 | Cabin（客舱位置）
22 | Embarked（上船的港口编号）
23 | ```
24 | ### 评估方式
25 | 比赛通过准确率指标评估模型优劣   
26 | $$precision=\frac{\sum_{i=1}^NI(\hat{y}_i==y_i)}{N}$$  
27 | y^i表示预测值，yi表示实际值
28 | 
29 | ### 模型选择
30 | 常见的分类模型有：SVM，LR，Navie Bayesian，CART以及由CART演化而来的树类模型，Random Forest，GBDT，最近详细研究了GBDT，
31 | 发现它的拟合能力近乎完美，而且在调整了参数之后可以降低过拟合的影响，据说高斯过程的拟合能力也比不过它，这次就决定直接采用GBDT来做主模型。
32 | 
33 | ### 特征选择
34 | 第一反应就是名字，Ticket，Cabin这些字段太零散，基本上每个人的都不一样，感觉并没有什么用。
35 | Cabin这一维度的特征更是缺失很严重，所以暂且不考虑Ticket，Cabin的这些特征 。 
36 | 反观Name这个特征，看似并没有什么用，大家的名字都不一样，实际上从GBDT的调试过程来看，这个特征的使用频率很高的。
37 | 通俗地说，Name可以给模型提供一定的泛化能力，比如一家人面临危机的时候，大家肯定都先找到自己的家人一起逃生，所以一家人的存活状况相关性肯定很高的。
38 | 所以我引入名字特征的方式并不是直接引入名字，而是考虑和当前预测人的名字同姓的存活率。
39 | 另外还有个背景问题，就是逃生的时候，女士优先逃生，这个时候家人就分开了，所以名字这个特征还要考虑性别，综合来说就是性别＋姓的存活率作为一个特征。 
40 | 另外，还增了相应类别的存活率这些特征，比如各种性别的存活率以及各种等级的存活率，之前请教过别人，
41 | 这种把分类ID扔进模型之后有必要把所属ID的百分比扔进去吗？还是有必要的，前者是“是什么类别“的因素，后者是“有多少存活比例“的因素，是和有不能混为一谈。
42 | 
43 | ### 特征缺失
44 | 数据中的某列特征丢失了在模型训练的时候是很正常的。目前了解到的解决方案是：
45 | 直接扔掉这行数据（数据多任性）
46 | 对于缺失的数据统一给一个新的label，让模型来学出给这种label多大的权值（感觉数据量大的情况才能训出来）
47 | 这个特征的缺失率很高 
48 | 直接扔掉这列特征
49 | 搞一个模型来拟合这维度的特征
50 | 给一个默认值，这个值可以是均值，或者众数。（感觉这个方法其实和上一个方法的拟合很相似，通过均值or众数来拟合其实可理解为人工的最大似然。
51 | 
52 | ### 使用scikit-learn的随机数森林实践
53 | sklearn-random-forest.py
54 | 


--------------------------------------------------------------------------------