├── README.md
├── First
├── 线上提交.ipynb
├── 线下验证.ipynb
├── 遗传算法.ipynb
└── 初赛处理.ipynb
└── Second
└── 复赛处理.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # 天池工业AI大赛-智能制造质量预测 思路代码
2 |
3 | First文件夹中存放了初赛最开始使用的PCA思路代码。代码做了精简删除了关于皮尔森、互信息筛选特征的部分,使用GBDT筛选特征需要对初赛处理文件代码进行替换和修改。
4 |
5 | 初赛处理文件:按照TOOL的不同将原始数据分割成若干数据集,针对每个数据集分别使用PCA降维,最后合并形成训练集,后期舍弃该方案。
6 | 线下验证:通过删除全空字段、删除数值过大字段、删除取值单一字段处理后,使用GBDT + SelectFromModel生成训练集,最后使用xgb、lgb单模型等模型、以及构建Stacking进行验证。
7 | 线上验证:使用xgb单模型、Stacking模型提交结果。
8 | 遗传算法:本来考虑用遗传算法在使用GBDT构建出特征候选集的基础上再筛选特征,但是整体框架编写完后发现需要调节的细节太多,后续只用到适应性函数来辅助线下验证。
9 | 遗传算法框架:
10 | 1.整体种群设置10个个体,每个个体的基因长度为候选特征集长度,基因编码采用0 1二进制法,最后随机生成基因序列初始化种群。
11 | 2.适应函数采用10折线性回归结果。
12 | 3.选择算子采用轮盘赌算法,选择该轮进化中适应性最高的个体。
13 | 4.交叉算子采用两端交叉法,随机选择父本和母本的一段基因序列进行交换。
14 | 5.变异算子采用随机变异法,初始设定变异概率阈值,每个个体生成一个随机数,若随机数小于阈值,则对应个体的随机基因点进行取反变异。
15 | tips:
16 | 实际使用时发现,轮盘赌算法在数轮进化后,所有个体最终的适应性值趋于一致,但是该值并不是最优结果,可能由于在每轮进化中最优个体没有保存下来,而产生的最优个体在轮盘赌中被洗牌。这里的选择算子可考虑换成锦标赛算法等。
17 | 适应性函数中使用线性回归主要是考虑缩短时间,这里应该使用树形模型作为适应性评价模型,以保持和特征选择所使用的模型一致。
18 | 交叉算子所使用的两端交叉,与两个端点的随机选择直接相关,端点距离太远可能导致重要特征被替换,距离太近导致无法实现交叉的效果,这个可以考虑更加灵活的交叉方法,比如固定长度单段交叉、固定长度多段交叉等等。
19 | 变异算子主要是变异阈值的选取,可以通过增加种群个体数量,增加遗传轮次进行调整。
20 | 关于遗传算法的几个关键算子,还有很多方法和技巧...在后续的比赛中争取再优化调整。
21 |
22 | Second文件夹中存放复赛代码。
23 |
24 | 复赛处理文件:按照以下步骤生成训练数据:
25 | 1.删除全空字段
26 | 2.转换obj字段,将TOOL字段数值化编码
27 | 3.删除类似日期的较大值
28 | 4.删除全相同字段
29 | 5.四分位处理异常值
30 | 6.再删除一次全相同字段
31 | 7.删除小方差字段
32 | 8.使用GBDT + SelectFromModel筛选特征
33 |
34 | 最后结果由xgb单模型与Stacking模型融合而成。
35 |
--------------------------------------------------------------------------------
/First/线上提交.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": false,
8 | "deletable": true,
9 | "editable": true
10 | },
11 | "outputs": [
12 | {
13 | "name": "stderr",
14 | "output_type": "stream",
15 | "text": [
16 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
17 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
18 | ]
19 | }
20 | ],
21 | "source": [
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "import gc\n",
25 | "import xgboost as xgb\n",
26 | "import lightgbm as lgb\n",
27 | "from sklearn.model_selection import train_test_split\n",
28 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n",
29 | "from datetime import datetime\n",
30 | "from catboost import CatBoostRegressor\n",
31 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, PolynomialFeatures\n",
32 | "from sklearn import tree\n",
33 | "from sklearn import linear_model\n",
34 | "from sklearn import svm\n",
35 | "from sklearn import neighbors\n",
36 | "from sklearn import ensemble\n",
37 | "from sklearn.tree import ExtraTreeRegressor\n",
38 | "from sklearn.decomposition import PCA\n",
39 | "from sklearn.manifold import TSNE\n",
40 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 2,
46 | "metadata": {
47 | "collapsed": true,
48 | "deletable": true,
49 | "editable": true
50 | },
51 | "outputs": [],
52 | "source": [
53 | "def evalerror(y, y_pred):\n",
54 | " loss = np.sum(np.square(y - y_pred))\n",
55 | " n = len(y)\n",
56 | " return loss / n"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 3,
62 | "metadata": {
63 | "collapsed": true,
64 | "deletable": true,
65 | "editable": true
66 | },
67 | "outputs": [],
68 | "source": [
69 | "train = pd.read_csv('train/train.csv')\n",
70 | "test = pd.read_csv('train/test.csv')\n",
71 | "y = pd.read_csv('train/y.csv')"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 4,
77 | "metadata": {
78 | "collapsed": false,
79 | "deletable": true,
80 | "editable": true
81 | },
82 | "outputs": [
83 | {
84 | "name": "stderr",
85 | "output_type": "stream",
86 | "text": [
87 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
88 | " y = column_or_1d(y, warn=True)\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "#GBDT筛选特征\n",
94 | "clf_gt2 = ensemble.GradientBoostingRegressor(max_depth=1, n_estimators=320, random_state=1)\n",
95 | "clf_gt2.fit(train, y)\n",
96 | "\n",
97 | "model1 = SelectFromModel(clf_gt2, prefit=True) \n",
98 | "train = pd.DataFrame(model1.transform(train))\n",
99 | "test = pd.DataFrame(model1.transform(test))"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 5,
105 | "metadata": {
106 | "collapsed": true,
107 | "deletable": true,
108 | "editable": true
109 | },
110 | "outputs": [],
111 | "source": [
112 | "#线上xgb单模型\n",
113 | "param = {}\n",
114 | "param['eta'] = 0.01\n",
115 | "param['max_depth'] = 6\n",
116 | "#param['mmin_child_weight'] = 5\n",
117 | "param['subsample'] = 0.8\n",
118 | "param['colsample_bytree'] = 0.3\n",
119 | "num_round = 750\n",
120 | "\n",
121 | "xgbTrain = xgb.DMatrix(train, label=y)\n",
122 | "xgbTest = xgb.DMatrix(test)\n",
123 | "modle = xgb.train(param, xgbTrain, num_round, )\n",
124 | "result_xgb = modle.predict(xgbTest)"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 8,
130 | "metadata": {
131 | "collapsed": false,
132 | "deletable": true,
133 | "editable": true,
134 | "scrolled": true
135 | },
136 | "outputs": [
137 | {
138 | "name": "stdout",
139 | "output_type": "stream",
140 | "text": [
141 | "the model 0\n",
142 | "the model 1\n",
143 | "the model 2\n",
144 | "the model 3\n"
145 | ]
146 | }
147 | ],
148 | "source": [
149 | "#线上Stacking模型\n",
150 | "model_gb = ensemble.GradientBoostingRegressor(n_estimators=450, \n",
151 | " max_depth=2, \n",
152 | " subsample=0.8, \n",
153 | " learning_rate=0.01, \n",
154 | " random_state=0, \n",
155 | " max_features=0.2)\n",
156 | "modle0 = xgb.XGBRegressor(learning_rate=0.01, \n",
157 | " max_depth=3, \n",
158 | " colsample_bytree=0.2, \n",
159 | " subsample=0.8, \n",
160 | " seed=0, \n",
161 | " n_estimators=2100)\n",
162 | "modle1 = xgb.XGBRegressor(learning_rate=0.01, \n",
163 | " max_depth=3, \n",
164 | " colsample_bytree=0.3, \n",
165 | " subsample=0.8, \n",
166 | " seed=0, \n",
167 | " n_estimators=1600,\n",
168 | " min_child_weight=6)\n",
169 | "\n",
170 | "clf1 = lgb.LGBMRegressor(colsample_bytree=0.3,\n",
171 | " learning_rate=0.01, \n",
172 | " subsample=0.8, \n",
173 | " num_leaves=4, \n",
174 | " objective='regression', \n",
175 | " n_estimators=350, \n",
176 | " seed=0)\n",
177 | "base_model = [['xgb0', modle0],\n",
178 | " ['xgb1', modle1], \n",
179 | " ['gb', model_gb],\n",
180 | " ['lgb', clf1],]\n",
181 | "\n",
182 | "folds = list(KFold(len(train), n_folds=5, random_state=0))\n",
183 | "S_train = np.zeros((train.shape[0], len(base_model)))\n",
184 | "S_test = np.zeros((test.shape[0], len(base_model))) \n",
185 | "for index, item in enumerate(base_model):\n",
186 | " print(\"the model\", index)\n",
187 | " clf = item[1]\n",
188 | " S_test_i = np.zeros((test.shape[0], len(folds)))\n",
189 | " for j, (train_idx, test_idx) in enumerate(folds):\n",
190 | " X_train = train.ix[train_idx, :]\n",
191 | " X_valid = train.ix[test_idx, :]\n",
192 | " Y = y.ix[train_idx, :]\n",
193 | " clf.fit(X_train, Y['Y'])\n",
194 | " S_train[test_idx, index] = clf.predict(X_valid)\n",
195 | " S_test_i[:, j] = clf.predict(test) \n",
196 | " S_test[:, index] = S_test_i.mean(1)\n",
197 | " \n",
198 | "linreg = linear_model.LinearRegression()\n",
199 | "linreg.fit(S_train, y)\n",
200 | "\n",
201 | "result = linreg.predict(S_test)"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 10,
207 | "metadata": {
208 | "collapsed": false,
209 | "deletable": true,
210 | "editable": true
211 | },
212 | "outputs": [],
213 | "source": [
214 | "sub = pd.read_csv('data/测试A-答案模板.csv', names=['ID'])\n",
215 | "sub['res'] = pd.DataFrame(result)[0]"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 128,
221 | "metadata": {
222 | "collapsed": true,
223 | "deletable": true,
224 | "editable": true
225 | },
226 | "outputs": [],
227 | "source": [
228 | "sub.to_csv('submission{}.csv'.format(datetime.now().strftime('%Y%m%d_%H%M%S')), index=False, header=None)"
229 | ]
230 | }
231 | ],
232 | "metadata": {
233 | "kernelspec": {
234 | "display_name": "Python 3",
235 | "language": "python",
236 | "name": "python3"
237 | },
238 | "language_info": {
239 | "codemirror_mode": {
240 | "name": "ipython",
241 | "version": 3
242 | },
243 | "file_extension": ".py",
244 | "mimetype": "text/x-python",
245 | "name": "python",
246 | "nbconvert_exporter": "python",
247 | "pygments_lexer": "ipython3",
248 | "version": "3.5.0"
249 | }
250 | },
251 | "nbformat": 4,
252 | "nbformat_minor": 2
253 | }
254 |
--------------------------------------------------------------------------------
/First/线下验证.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": false,
8 | "deletable": true,
9 | "editable": true
10 | },
11 | "outputs": [
12 | {
13 | "name": "stderr",
14 | "output_type": "stream",
15 | "text": [
16 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
17 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
18 | ]
19 | }
20 | ],
21 | "source": [
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "import gc\n",
25 | "import xgboost as xgb\n",
26 | "import lightgbm as lgb\n",
27 | "from sklearn.model_selection import train_test_split\n",
28 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n",
29 | "from datetime import datetime\n",
30 | "from catboost import CatBoostRegressor\n",
31 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale,PolynomialFeatures\n",
32 | "from sklearn import tree\n",
33 | "from sklearn import linear_model\n",
34 | "from sklearn import svm\n",
35 | "from sklearn import neighbors\n",
36 | "from sklearn import ensemble\n",
37 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE\n",
38 | "from minepy import MINE\n",
39 | "from mlxtend.regressor import StackingRegressor"
40 | ]
41 | },
42 | {
43 | "cell_type": "code",
44 | "execution_count": 27,
45 | "metadata": {
46 | "collapsed": true,
47 | "deletable": true,
48 | "editable": true
49 | },
50 | "outputs": [],
51 | "source": [
52 | "def evalerror(y, y_pred):\n",
53 | " loss = np.sum(np.square(y - y_pred))\n",
54 | " n = len(y)\n",
55 | " return loss / n"
56 | ]
57 | },
58 | {
59 | "cell_type": "code",
60 | "execution_count": 123,
61 | "metadata": {
62 | "collapsed": false,
63 | "deletable": true,
64 | "editable": true
65 | },
66 | "outputs": [],
67 | "source": [
68 | "train = pd.read_csv('train/train.csv')\n",
69 | "test = pd.read_csv('train/test.csv')\n",
70 | "y = pd.read_csv('train/y.csv')"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 125,
76 | "metadata": {
77 | "collapsed": false,
78 | "deletable": true,
79 | "editable": true
80 | },
81 | "outputs": [
82 | {
83 | "name": "stderr",
84 | "output_type": "stream",
85 | "text": [
86 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
87 | " y = column_or_1d(y, warn=True)\n"
88 | ]
89 | }
90 | ],
91 | "source": [
92 | "#GBDT筛选特征\n",
93 | "clf_gt2 = ensemble.GradientBoostingRegressor(max_depth=1, n_estimators=320, random_state=1)\n",
94 | "clf_gt2.fit(train, y)\n",
95 | "\n",
96 | "\n",
97 | "model1 = SelectFromModel(clf_gt2, prefit=True) \n",
98 | "train = pd.DataFrame(model1.transform(train))\n",
99 | "test = pd.DataFrame(model1.transform(test))"
100 | ]
101 | },
102 | {
103 | "cell_type": "code",
104 | "execution_count": 141,
105 | "metadata": {
106 | "collapsed": false,
107 | "deletable": true,
108 | "editable": true,
109 | "scrolled": false
110 | },
111 | "outputs": [
112 | {
113 | "name": "stdout",
114 | "output_type": "stream",
115 | "text": [
116 | "0.025204\n"
117 | ]
118 | }
119 | ],
120 | "source": [
121 | "#单xgb模型线下5折 交叉验证\n",
122 | "def wmaeEval(preds, dtrain):\n",
123 | " label = dtrain.get_label()\n",
124 | " return 'error', np.sum(np.square(preds - label)) / len(label)\n",
125 | "\n",
126 | "param = {}\n",
127 | "param['eta'] = 0.01\n",
128 | "param['max_depth'] = 3\n",
129 | "param['subsample'] = 0.8\n",
130 | "param['colsample_bytree'] = 0.3\n",
131 | "\n",
132 | "param['seed'] = 1\n",
133 | "num_round = 10000\n",
134 | "\n",
135 | "xgbTrain = xgb.DMatrix(train, label=y)\n",
136 | "modle = xgb.cv(param, xgbTrain, num_boost_round=4200, feval=wmaeEval, nfold=5)\n",
137 | "print(modle.iloc[-1, 0]) "
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": null,
143 | "metadata": {
144 | "collapsed": true
145 | },
146 | "outputs": [],
147 | "source": [
148 | "#单lgb模型线下5折 交叉验证\n",
149 | "params = {}\n",
150 | "params['learning_rate'] = 0.01\n",
151 | "params['boosting_type'] = 'gbdt'\n",
152 | "params['objective'] = 'regression' \n",
153 | "params['feature_fraction'] = 0.3 \n",
154 | "params['bagging_fraction'] = 0.8 \n",
155 | "params['num_leaves'] = 64 \n",
156 | "result = [] \n",
157 | "folds = list(KFold(len(train), n_folds=5, random_state=0))\n",
158 | "for j, (train_idx, test_idx) in enumerate(folds):\n",
159 | " print(\"the folds\", j)\n",
160 | " X_train = train.ix[train_idx, :]\n",
161 | " X_valid = train.ix[test_idx, :]\n",
162 | " \n",
163 | " Y_train = y.ix[train_idx, :]\n",
164 | " Y_valid = y.ix[test_idx, :]\n",
165 | " d_train = lgb.Dataset(X_train, label=Y_train['Y'])\n",
166 | " clf = lgb.train(params, d_train, 620)\n",
167 | " preds = clf.predict(X_valid)\n",
168 | " result.append(evalerror(preds, Y_valid['Y']))\n",
169 | " \n",
170 | "np.mean(result)"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": 37,
176 | "metadata": {
177 | "collapsed": false,
178 | "deletable": true,
179 | "editable": true
180 | },
181 | "outputs": [
182 | {
183 | "name": "stderr",
184 | "output_type": "stream",
185 | "text": [
186 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:4: SettingWithCopyWarning: \n",
187 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
188 | "\n",
189 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
190 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:7: SettingWithCopyWarning: \n",
191 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
192 | "\n",
193 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
194 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
195 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
196 | "\n",
197 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
198 | ]
199 | }
200 | ],
201 | "source": [
202 | "#线下Stacking 取20%数据集进行验证\n",
203 | "x_train, x_valid, y_train, y_valid = train_test_split(train, y, test_size = 0.2, random_state=1)\n",
204 | "x_train.reset_index(inplace=True)\n",
205 | "x_train.drop(['index'], axis=1, inplace=True)\n",
206 | "\n",
207 | "y_train.reset_index(inplace=True)\n",
208 | "y_train.drop(['index'], axis=1, inplace=True)\n",
209 | "\n",
210 | "y_valid.reset_index(inplace=True)\n",
211 | "y_valid.drop(['index'], axis=1, inplace=True)"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": 69,
217 | "metadata": {
218 | "collapsed": false,
219 | "deletable": true,
220 | "editable": true
221 | },
222 | "outputs": [
223 | {
224 | "name": "stdout",
225 | "output_type": "stream",
226 | "text": [
227 | "the model 0\n",
228 | "the model 1\n",
229 | "the model 2\n",
230 | "the model 3\n"
231 | ]
232 | },
233 | {
234 | "data": {
235 | "text/plain": [
236 | "0.027016934367817883"
237 | ]
238 | },
239 | "execution_count": 69,
240 | "metadata": {},
241 | "output_type": "execute_result"
242 | }
243 | ],
244 | "source": [
245 | "model_gb = ensemble.GradientBoostingRegressor(n_estimators=450, \n",
246 | " max_depth=2, \n",
247 | " subsample=0.8, \n",
248 | " learning_rate=0.01, \n",
249 | " random_state=0, \n",
250 | " max_features=0.2)\n",
251 | "modle0 = xgb.XGBRegressor(learning_rate=0.01, \n",
252 | " max_depth=3, \n",
253 | " colsample_bytree=0.2, \n",
254 | " subsample=0.8, \n",
255 | " seed=0, \n",
256 | " n_estimators=2100)\n",
257 | "modle1 = xgb.XGBRegressor(learning_rate=0.01, \n",
258 | " max_depth=3, \n",
259 | " colsample_bytree=0.3, \n",
260 | " subsample=0.8, \n",
261 | " seed=0, \n",
262 | " n_estimators=1600,\n",
263 | " min_child_weight=6)\n",
264 | "\n",
265 | "clf1 = lgb.LGBMRegressor(colsample_bytree=0.3,\n",
266 | " learning_rate=0.01, \n",
267 | " subsample=0.8, \n",
268 | " num_leaves=4, \n",
269 | " objective='regression', \n",
270 | " n_estimators=350, \n",
271 | " seed=0)\n",
272 | "base_model = [['xgb0', modle0],\n",
273 | " ['xgb1', modle1], \n",
274 | " ['gb', model_gb],\n",
275 | " ['lgb', clf1],]\n",
276 | "\n",
277 | "folds = list(KFold(len(x_train), n_folds=5, random_state=0))\n",
278 | "S_train = np.zeros((x_train.shape[0], len(base_model)))\n",
279 | "S_test = np.zeros((x_valid.shape[0], len(base_model))) \n",
280 | "for index, item in enumerate(base_model):\n",
281 | " print(\"the model\", index)\n",
282 | " clf = item[1]\n",
283 | " S_test_i = np.zeros((x_valid.shape[0], len(folds)))\n",
284 | " for j, (train_idx, test_idx) in enumerate(folds):\n",
285 | " X_train = x_train.ix[train_idx, :]\n",
286 | " X_valid = x_train.ix[test_idx, :]\n",
287 | " Y = y_train.ix[train_idx, :]\n",
288 | " clf.fit(X_train, Y['Y'])\n",
289 | " S_train[test_idx, index] = clf.predict(X_valid)\n",
290 | " S_test_i[:, j] = clf.predict(x_valid) \n",
291 | " S_test[:, index] = S_test_i.mean(1)\n",
292 | " \n",
293 | "linreg = linear_model.LinearRegression()\n",
294 | "linreg.fit(S_train, y_train)\n",
295 | "\n",
296 | "result = linreg.predict(S_test)\n",
297 | "evalerror(pd.DataFrame(result)[0], y_valid['Y'])"
298 | ]
299 | }
300 | ],
301 | "metadata": {
302 | "kernelspec": {
303 | "display_name": "Python 3",
304 | "language": "python",
305 | "name": "python3"
306 | },
307 | "language_info": {
308 | "codemirror_mode": {
309 | "name": "ipython",
310 | "version": 3
311 | },
312 | "file_extension": ".py",
313 | "mimetype": "text/x-python",
314 | "name": "python",
315 | "nbconvert_exporter": "python",
316 | "pygments_lexer": "ipython3",
317 | "version": "3.5.0"
318 | }
319 | },
320 | "nbformat": 4,
321 | "nbformat_minor": 2
322 | }
323 |
--------------------------------------------------------------------------------
/First/遗传算法.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 6,
6 | "metadata": {
7 | "collapsed": false,
8 | "deletable": true,
9 | "editable": true
10 | },
11 | "outputs": [],
12 | "source": [
13 | "import pandas as pd\n",
14 | "import numpy as np\n",
15 | "import xgboost as xgb\n",
16 | "import lightgbm as lgb\n",
17 | "from sklearn.model_selection import train_test_split\n",
18 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n",
19 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale\n",
20 | "from sklearn import tree\n",
21 | "from sklearn import linear_model\n",
22 | "from sklearn import svm\n",
23 | "from sklearn import neighbors\n",
24 | "from sklearn import ensemble\n",
25 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold"
26 | ]
27 | },
28 | {
29 | "cell_type": "code",
30 | "execution_count": 2,
31 | "metadata": {
32 | "collapsed": true,
33 | "deletable": true,
34 | "editable": true
35 | },
36 | "outputs": [],
37 | "source": [
38 | "def evalerror(y, y_pred):\n",
39 | " loss = np.sum(np.square(y - y_pred))\n",
40 | " n = len(y)\n",
41 | " return loss / n"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 3,
47 | "metadata": {
48 | "collapsed": false,
49 | "deletable": true,
50 | "editable": true
51 | },
52 | "outputs": [],
53 | "source": [
54 | "train = pd.read_csv('train/train.csv')\n",
55 | "test = pd.read_csv('train/test.csv')\n",
56 | "y = pd.read_csv('train/y.csv')"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": 7,
62 | "metadata": {
63 | "collapsed": false,
64 | "deletable": true,
65 | "editable": true
66 | },
67 | "outputs": [
68 | {
69 | "name": "stderr",
70 | "output_type": "stream",
71 | "text": [
72 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
73 | " y = column_or_1d(y, warn=True)\n"
74 | ]
75 | }
76 | ],
77 | "source": [
78 | "#GBDT特征候选集\n",
79 | "clf_gt = ensemble.GradientBoostingRegressor(max_depth=6, n_estimators=400, random_state=1)\n",
80 | "clf_gt.fit(train, y)\n",
81 | "model = SelectFromModel(clf_gt, prefit=True) \n",
82 | "train = pd.DataFrame(model.transform(train))\n",
83 | "test = pd.DataFrame(model.transform(test))"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": 8,
89 | "metadata": {
90 | "collapsed": false,
91 | "deletable": true,
92 | "editable": true
93 | },
94 | "outputs": [],
95 | "source": [
96 | "#初始化种群\n",
97 | "def Init_Individual(feature):\n",
98 | " Individual = []\n",
99 | " for i in range(10):\n",
100 | " Gene = []\n",
101 | " for g in range(len(feature)):\n",
102 | " Gene.append(np.random.randint(0, 2))\n",
103 | " Individual.append(Gene)\n",
104 | " return np.array(Individual)\n",
105 | "\n",
106 | "\n",
107 | "#适应性函数\n",
108 | "def fitness(Individual, y, dataSet):\n",
109 | " lr = linear_model.LinearRegression()\n",
110 | " fit = []\n",
111 | " index = []\n",
112 | " gene = []\n",
113 | " for i in range(len(Individual)):\n",
114 | " Gene_sequence = pd.DataFrame(dataSet.columns, columns=['feature'])\n",
115 | " Gene_sequence['gene'] = Individual[i]\n",
116 | " Gene_sequence = list(Gene_sequence[Gene_sequence['gene'] == 1]['feature'])\n",
117 | " \n",
118 | " cv_model = cross_val_score(lr, dataSet[Gene_sequence], y, cv=10, scoring='neg_mean_squared_error')\n",
119 | " fit.append(0.1 - np.mean(np.abs(cv_model)))\n",
120 | " index.append(i)\n",
121 | " gene.append(Individual[i]) \n",
122 | " \n",
123 | " Ind_fitness = pd.DataFrame(fit, columns=['fintness'])\n",
124 | "# Ind_fitness['Indi_index'] = index\n",
125 | " Ind_fitness['Gene'] = gene \n",
126 | " return Ind_fitness\n",
127 | "\n",
128 | "\n",
129 | "#轮盘赌选择最优个体\n",
130 | "def Roulette_wheel(Fitness): \n",
131 | " sumFits = np.sum(Fitness['fintness'])\n",
132 | "\n",
133 | " rndPoint = np.random.uniform(0, sumFits)\n",
134 | " accumulator = 0.0\n",
135 | " for ind, val in enumerate(Fitness['fintness']):\n",
136 | " accumulator += val\n",
137 | " if accumulator >= rndPoint:\n",
138 | " return np.array(Fitness[Fitness['fintness'] == val].values)[0]\n",
139 | " \n",
140 | "#交叉算子\n",
141 | "def Crossover_operator(Individual):\n",
142 | " idx1 = np.random.randint(0, len(Individual))\n",
143 | " idx2 = np.random.randint(0, len(Individual))\n",
144 | " while idx2 == idx1: \n",
145 | " idx2 = np.random.randint(0, len(Individual)) \n",
146 | " \n",
147 | " Father_gene = Individual[Individual['Indi_index'] == idx1]['Indi_Gene'].values\n",
148 | " Mother_gene = Individual[Individual['Indi_index'] == idx2]['Indi_Gene'].values\n",
149 | " \n",
150 | " crossPos_A = np.random.randint(0, len(Father_gene[0]))\n",
151 | " crossPos_B = np.random.randint(0, len(Father_gene[0])) \n",
152 | "\n",
153 | " while crossPos_A == crossPos_B: \n",
154 | " crossPos_B = np.random.randint(0, len(Father_gene[0])) \n",
155 | "\n",
156 | " if crossPos_A > crossPos_B:\n",
157 | " crossPos_A, crossPos_B = crossPos_B, crossPos_A\n",
158 | " \n",
159 | " if crossPos_A < crossPos_B:\n",
160 | " temp = Father_gene[0][crossPos_A]\n",
161 | " Father_gene[0][crossPos_A] = Mother_gene[0][crossPos_A]\n",
162 | " Mother_gene[0][crossPos_A] = temp\n",
163 | " crossPos_A = crossPos_A + 1\n",
164 | " \n",
165 | " return Father_gene, Mother_gene \n",
166 | "\n",
167 | "#变异算子\n",
168 | "def Mutation_operator(Individual):\n",
169 | " MUTATION_RATE = 0.165\n",
170 | " for i in range(len(Individual)):\n",
171 | " mutatePos = np.random.randint(0, len(Individual['Indi_Gene'][i]))\n",
172 | " theta = np.random.random()\n",
173 | " if theta < MUTATION_RATE:\n",
174 | " if Individual['Indi_Gene'][i][mutatePos] == 0:\n",
175 | " Individual['Indi_Gene'][i][mutatePos] = 1\n",
176 | " else:\n",
177 | " Individual['Indi_Gene'][i][mutatePos] = 0\n",
178 | " return Individual"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": 9,
184 | "metadata": {
185 | "collapsed": true,
186 | "deletable": true,
187 | "editable": true
188 | },
189 | "outputs": [],
190 | "source": [
191 | "#遗传算法\n",
192 | "def Genetic_algorithm(Individual, train, y, iterm):\n",
193 | " for i in range(iterm):\n",
194 | " print('第 %d 代' % i)\n",
195 | " fit = fitness(Individual, y, train)\n",
196 | " \n",
197 | " Roulette_gene = []\n",
198 | " index = []\n",
199 | " for i in range(len(Individual)):\n",
200 | " Roulette_gene.append(Roulette_wheel(fit))\n",
201 | " index.append(i)\n",
202 | " \n",
203 | " Choice_gene = pd.DataFrame(Roulette_gene, columns=['fintness', 'Indi_Gene'])\n",
204 | " Choice_gene['Indi_index'] = index\n",
205 | " Choice_gene['fintness'] = 0.1 - Choice_gene['fintness']\n",
206 | " Choice_gene = Choice_gene.sort_values(['fintness'])\n",
207 | " \n",
208 | " Cro_gene = []\n",
209 | " for i in range(5):\n",
210 | " gene1, gene2 = Crossover_operator(Choice_gene)\n",
211 | " Cro_gene.append(gene1)\n",
212 | " Cro_gene.append(gene2) \n",
213 | " \n",
214 | " Crossover_gene = pd.DataFrame(Cro_gene, columns=['Indi_Gene'])\n",
215 | " Crossover_gene['Indi_index'] = index\n",
216 | " \n",
217 | " New_gene = Mutation_operator(Crossover_gene)\n",
218 | " Individual = New_gene['Indi_Gene']\n",
219 | " fit['fintness'] = 0.1 - fit['fintness']\n",
220 | " return fit\n",
221 | " "
222 | ]
223 | },
224 | {
225 | "cell_type": "code",
226 | "execution_count": 18,
227 | "metadata": {
228 | "collapsed": false,
229 | "deletable": true,
230 | "editable": true
231 | },
232 | "outputs": [
233 | {
234 | "data": {
235 | "text/html": [
236 | "
\n",
237 | "
\n",
238 | " \n",
239 | " \n",
240 | " | \n",
241 | " fintness | \n",
242 | " Gene | \n",
243 | "
\n",
244 | " \n",
245 | " \n",
246 | " \n",
247 | " | 0 | \n",
248 | " 0.040204 | \n",
249 | " [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, ... | \n",
250 | "
\n",
251 | " \n",
252 | " | 1 | \n",
253 | " 0.040662 | \n",
254 | " [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, ... | \n",
255 | "
\n",
256 | " \n",
257 | " | 2 | \n",
258 | " 0.040862 | \n",
259 | " [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, ... | \n",
260 | "
\n",
261 | " \n",
262 | " | 3 | \n",
263 | " 0.041487 | \n",
264 | " [1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, ... | \n",
265 | "
\n",
266 | " \n",
267 | " | 4 | \n",
268 | " 0.042401 | \n",
269 | " [0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, ... | \n",
270 | "
\n",
271 | " \n",
272 | " | 5 | \n",
273 | " 0.042978 | \n",
274 | " [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ... | \n",
275 | "
\n",
276 | " \n",
277 | " | 6 | \n",
278 | " 0.043455 | \n",
279 | " [1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, ... | \n",
280 | "
\n",
281 | " \n",
282 | " | 7 | \n",
283 | " 0.043758 | \n",
284 | " [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, ... | \n",
285 | "
\n",
286 | " \n",
287 | " | 8 | \n",
288 | " 0.044766 | \n",
289 | " [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, ... | \n",
290 | "
\n",
291 | " \n",
292 | " | 9 | \n",
293 | " 0.045582 | \n",
294 | " [1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, ... | \n",
295 | "
\n",
296 | " \n",
297 | "
\n",
298 | "
"
299 | ],
300 | "text/plain": [
301 | " fintness Gene\n",
302 | "0 0.040204 [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, ...\n",
303 | "1 0.040662 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, ...\n",
304 | "2 0.040862 [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, ...\n",
305 | "3 0.041487 [1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...\n",
306 | "4 0.042401 [0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, ...\n",
307 | "5 0.042978 [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...\n",
308 | "6 0.043455 [1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, ...\n",
309 | "7 0.043758 [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...\n",
310 | "8 0.044766 [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, ...\n",
311 | "9 0.045582 [1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, ..."
312 | ]
313 | },
314 | "execution_count": 18,
315 | "metadata": {},
316 | "output_type": "execute_result"
317 | }
318 | ],
319 | "source": [
320 | "#初始化种群\n",
321 | "Individual = Init_Individual(train.columns)\n",
322 | "\n",
323 | "#计算每个个体的适应性\n",
324 | "fit = fitness(Individual, y, train)\n",
325 | "fit['fintness'] = 0.1 - fit['fintness']\n",
326 | "fit.sort_values(['fintness'], inplace=True)\n",
327 | "fit.reset_index(inplace=True, drop=['index'])\n",
328 | "fit"
329 | ]
330 | },
331 | {
332 | "cell_type": "code",
333 | "execution_count": 15,
334 | "metadata": {
335 | "collapsed": true
336 | },
337 | "outputs": [],
338 | "source": [
339 | "Gene_sequence = pd.DataFrame(train.columns, columns=['feature'])\n",
340 | "Gene_sequence['gene'] = fit['Gene'][0]\n",
341 | "Gene_sequence = list(Gene_sequence[Gene_sequence['gene'] == 1]['feature'])"
342 | ]
343 | },
344 | {
345 | "cell_type": "code",
346 | "execution_count": 14,
347 | "metadata": {
348 | "collapsed": false,
349 | "deletable": true,
350 | "editable": true
351 | },
352 | "outputs": [
353 | {
354 | "name": "stdout",
355 | "output_type": "stream",
356 | "text": [
357 | "0.0300686\n"
358 | ]
359 | }
360 | ],
361 | "source": [
362 | "def wmaeEval(preds, dtrain):\n",
363 | " label = dtrain.get_label()\n",
364 | " return 'error', np.sum(np.square(preds - label)) / len(label)\n",
365 | "param = {}\n",
366 | "param['eta'] = 0.01\n",
367 | "param['max_depth'] = 3\n",
368 | "\n",
369 | "param['subsample'] = 0.8\n",
370 | "param['colsample_bytree'] = 0.3\n",
371 | "num_round = 3300\n",
372 | "\n",
373 | "xgbTrain = xgb.DMatrix(train[Gene_sequence], label=y)\n",
374 | "modle = xgb.cv(param, xgbTrain, num_round, feval=wmaeEval, nfold=5)\n",
375 | "print(modle.iloc[-1, 0])"
376 | ]
377 | }
378 | ],
379 | "metadata": {
380 | "kernelspec": {
381 | "display_name": "Python 3",
382 | "language": "python",
383 | "name": "python3"
384 | },
385 | "language_info": {
386 | "codemirror_mode": {
387 | "name": "ipython",
388 | "version": 3
389 | },
390 | "file_extension": ".py",
391 | "mimetype": "text/x-python",
392 | "name": "python",
393 | "nbconvert_exporter": "python",
394 | "pygments_lexer": "ipython3",
395 | "version": "3.5.0"
396 | }
397 | },
398 | "nbformat": 4,
399 | "nbformat_minor": 2
400 | }
401 |
--------------------------------------------------------------------------------
/Second/复赛处理.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {
7 | "collapsed": false,
8 | "deletable": true,
9 | "editable": true
10 | },
11 | "outputs": [
12 | {
13 | "name": "stderr",
14 | "output_type": "stream",
15 | "text": [
16 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
17 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n"
18 | ]
19 | }
20 | ],
21 | "source": [
22 | "import pandas as pd\n",
23 | "import numpy as np\n",
24 | "import xgboost as xgb\n",
25 | "import lightgbm as lgb\n",
26 | "from sklearn.model_selection import train_test_split\n",
27 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n",
28 | "from datetime import datetime\n",
29 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale,PolynomialFeatures\n",
30 | "from sklearn import tree\n",
31 | "from sklearn import linear_model\n",
32 | "from sklearn import svm\n",
33 | "from sklearn import neighbors\n",
34 | "from sklearn import ensemble\n",
35 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE\n",
36 | "from mlxtend.regressor import StackingRegressor\n",
37 | "from collections import defaultdict"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 4,
43 | "metadata": {
44 | "collapsed": false,
45 | "deletable": true,
46 | "editable": true
47 | },
48 | "outputs": [],
49 | "source": [
50 | "data_set1 = pd.read_excel('../复赛数据/训练_20180117.xlsx')\n",
51 | "data_set2 = pd.read_excel('../复赛数据/测试A_20180117.xlsx')\n",
52 | "y2 = pd.read_csv('../复赛数据/[new] fusai_answer_a_20180127.csv', names=['ID', 'Value'])\n",
53 | "\n",
54 | "\n",
55 | "submit = pd.read_csv('../复赛数据/answer_sample_b_20180117.csv', names=['id', 'Y'])\n",
56 | "test_set = pd.read_excel('../复赛数据/测试B_20180117.xlsx')\n",
57 | "\n",
58 | "data_set2 = pd.concat([data_set2, y2[['Value']]], axis=1)\n",
59 | "data_set = pd.concat([data_set1, data_set2], axis=0)\n",
60 | "\n",
61 | "\n",
62 | "y = data_set[['Value']]\n",
63 | "y.reset_index(inplace=True, drop=['index'])\n",
64 | "data_set = data_set.drop(['Value'], axis=1)\n",
65 | "Le = LabelEncoder()"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 4,
71 | "metadata": {
72 | "collapsed": false,
73 | "deletable": true,
74 | "editable": true
75 | },
76 | "outputs": [],
77 | "source": [
78 | "input_data = pd.concat([data_set, test_set], ignore_index=True)\n",
79 | "input_data = input_data.drop(['ID'], axis=1)"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 5,
85 | "metadata": {
86 | "collapsed": false,
87 | "deletable": true,
88 | "editable": true,
89 | "scrolled": true
90 | },
91 | "outputs": [
92 | {
93 | "data": {
94 | "text/plain": [
95 | "defaultdict(>,\n",
96 | " {'210': 228,\n",
97 | " '220': 179,\n",
98 | " '300': 21,\n",
99 | " '310': 170,\n",
100 | " '311': 184,\n",
101 | " '312': 679,\n",
102 | " '330': 889,\n",
103 | " '340': 139,\n",
104 | " '344': 398,\n",
105 | " '360': 1030,\n",
106 | " '400': 183,\n",
107 | " '420': 173,\n",
108 | " '440A': 213,\n",
109 | " '520': 382,\n",
110 | " '750': 1030,\n",
111 | " 'Chamber': 1,\n",
112 | " 'Chamber ID': 1,\n",
113 | " 'ERROR:#N/A': 1,\n",
114 | " 'ERROR:#N/A (#1)': 1,\n",
115 | " 'ERROR:#N/A (#2)': 1,\n",
116 | " 'ERROR:#N/A (#3)': 1,\n",
117 | " 'ERROR:#N/A_1': 1,\n",
118 | " 'ERROR:#N/A_1 (#1)': 1,\n",
119 | " 'ERROR:#N/A_1 (#2)': 1,\n",
120 | " 'ERROR:#N/A_10': 1,\n",
121 | " 'ERROR:#N/A_11': 1,\n",
122 | " 'ERROR:#N/A_12': 1,\n",
123 | " 'ERROR:#N/A_13': 1,\n",
124 | " 'ERROR:#N/A_14': 1,\n",
125 | " 'ERROR:#N/A_15': 1,\n",
126 | " 'ERROR:#N/A_16': 1,\n",
127 | " 'ERROR:#N/A_17': 1,\n",
128 | " 'ERROR:#N/A_18': 1,\n",
129 | " 'ERROR:#N/A_19': 1,\n",
130 | " 'ERROR:#N/A_2': 1,\n",
131 | " 'ERROR:#N/A_2 (#1)': 1,\n",
132 | " 'ERROR:#N/A_20': 1,\n",
133 | " 'ERROR:#N/A_21': 1,\n",
134 | " 'ERROR:#N/A_22': 1,\n",
135 | " 'ERROR:#N/A_23': 1,\n",
136 | " 'ERROR:#N/A_24': 1,\n",
137 | " 'ERROR:#N/A_25': 1,\n",
138 | " 'ERROR:#N/A_26': 1,\n",
139 | " 'ERROR:#N/A_27': 1,\n",
140 | " 'ERROR:#N/A_28': 1,\n",
141 | " 'ERROR:#N/A_29': 1,\n",
142 | " 'ERROR:#N/A_3': 1,\n",
143 | " 'ERROR:#N/A_3 (#1)': 1,\n",
144 | " 'ERROR:#N/A_30': 1,\n",
145 | " 'ERROR:#N/A_4': 1,\n",
146 | " 'ERROR:#N/A_4 (#1)': 1,\n",
147 | " 'ERROR:#N/A_5': 1,\n",
148 | " 'ERROR:#N/A_6': 1,\n",
149 | " 'ERROR:#N/A_7': 1,\n",
150 | " 'ERROR:#N/A_8': 1,\n",
151 | " 'ERROR:#N/A_9': 1,\n",
152 | " 'OPERATION_ID': 1,\n",
153 | " 'TOOL (#1)': 1,\n",
154 | " 'TOOL (#2)': 1,\n",
155 | " 'TOOL (#3)': 1,\n",
156 | " 'TOOL_ID': 1,\n",
157 | " 'Tool': 1,\n",
158 | " 'Tool (#1)': 1,\n",
159 | " 'Tool (#2)': 1,\n",
160 | " 'Tool (#3)': 1,\n",
161 | " 'Tool (#4)': 1,\n",
162 | " 'Tool (#5)': 1})"
163 | ]
164 | },
165 | "execution_count": 5,
166 | "metadata": {},
167 | "output_type": "execute_result"
168 | }
169 | ],
170 | "source": [
171 | "col = pd.DataFrame(input_data.ix[:, 2:].columns, columns=['col'])\n",
172 | "TOOL_ID_col_dict = defaultdict(lambda : 0)\n",
173 | "for index,row in col.iterrows():\n",
174 | " info=row.col.split('X')[0]\n",
175 | " TOOL_ID_col_dict[info] += 1\n",
176 | "\n",
177 | "TOOL_ID_col_dict"
178 | ]
179 | },
180 | {
181 | "cell_type": "code",
182 | "execution_count": 6,
183 | "metadata": {
184 | "collapsed": true,
185 | "deletable": true,
186 | "editable": true
187 | },
188 | "outputs": [],
189 | "source": [
190 | "#删除全空字段\n",
191 | "null_col = []\n",
192 | "for col in data_set.columns:\n",
193 | " if data_set[col].isnull().all() == True:\n",
194 | " null_col.append(col)\n",
195 | " \n",
196 | "input_data.drop(null_col, axis=1, inplace=True) \n",
197 | "input_data.drop(['220X150', '220X151'], axis=1, inplace=True) "
198 | ]
199 | },
200 | {
201 | "cell_type": "code",
202 | "execution_count": 7,
203 | "metadata": {
204 | "collapsed": false,
205 | "deletable": true,
206 | "editable": true
207 | },
208 | "outputs": [],
209 | "source": [
210 | "#转换obj字段\n",
211 | "obj_col = []\n",
212 | "for col in input_data.columns:\n",
213 | " if input_data[col].dtypes == object:\n",
214 | " obj_col.append(col)\n",
215 | " \n",
216 | "for i in range(len(obj_col)):\n",
217 | " Le.fit(input_data[obj_col[i]].unique())\n",
218 | " input_data[obj_col[i]] = Le.transform(input_data[obj_col[i]])"
219 | ]
220 | },
221 | {
222 | "cell_type": "code",
223 | "execution_count": 8,
224 | "metadata": {
225 | "collapsed": true,
226 | "deletable": true,
227 | "editable": true
228 | },
229 | "outputs": [],
230 | "source": [
231 | "#删除较大值\n",
232 | "def date_cols(data):\n",
233 | " for col in data:\n",
234 | " if data[col].min() > 1e13:\n",
235 | " data = data.drop([col], axis=1) \n",
236 | " return data\n",
237 | "\n",
238 | "input_data = date_cols(input_data)"
239 | ]
240 | },
241 | {
242 | "cell_type": "code",
243 | "execution_count": 9,
244 | "metadata": {
245 | "collapsed": true,
246 | "deletable": true,
247 | "editable": true
248 | },
249 | "outputs": [],
250 | "source": [
251 | "#drop相同列\n",
252 | "def drop_col(data):\n",
253 | " data = data.fillna(data.mean()) \n",
254 | " for line in data.columns:\n",
255 | " if len(data[line].unique()) == 1:\n",
256 | " data = data.drop([line], axis=1) \n",
257 | " return data\n",
258 | "\n",
259 | "input_data = drop_col(input_data)"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": 10,
265 | "metadata": {
266 | "collapsed": false,
267 | "deletable": true,
268 | "editable": true
269 | },
270 | "outputs": [
271 | {
272 | "name": "stderr",
273 | "output_type": "stream",
274 | "text": [
275 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
276 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
277 | "\n",
278 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
279 | ]
280 | }
281 | ],
282 | "source": [
283 | "#四分位数处理异常值\n",
284 | "def quantile_dropout(data): \n",
285 | " for c in data.columns:\n",
286 | " Q1 = data[c].quantile(q=0.25, interpolation='linear')\n",
287 | " Q3 = data[c].quantile(q=0.75, interpolation='linear') \n",
288 | " \n",
289 | " min_v = Q1 - 3 * (Q3 - Q1)\n",
290 | " max_v = Q3 + 3 * (Q3 - Q1)\n",
291 | " \n",
292 | " data[c][(data[c] >= max_v) | (data[c] <= min_v)] = data[c].mean()\n",
293 | " \n",
294 | " return data\n",
295 | "\n",
296 | "#异常值处理完再清晰一遍数据\n",
297 | "input_data = quantile_dropout(input_data)\n",
298 | "input_data.fillna(input_data.mean(), inplace=True)\n",
299 | "input_data = drop_col(input_data)"
300 | ]
301 | },
302 | {
303 | "cell_type": "code",
304 | "execution_count": 11,
305 | "metadata": {
306 | "collapsed": false,
307 | "deletable": true,
308 | "editable": true
309 | },
310 | "outputs": [],
311 | "source": [
312 | "#剔除小方差字段\n",
313 | "VT=VarianceThreshold(threshold=0.5)\n",
314 | "input_data = pd.DataFrame(VT.fit_transform(input_data))"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 12,
320 | "metadata": {
321 | "collapsed": false,
322 | "deletable": true,
323 | "editable": true
324 | },
325 | "outputs": [
326 | {
327 | "name": "stderr",
328 | "output_type": "stream",
329 | "text": [
330 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n",
331 | " y = column_or_1d(y, warn=True)\n"
332 | ]
333 | }
334 | ],
335 | "source": [
336 | "train = input_data.ix[:1099, :]\n",
337 | "test = input_data.ix[1100:, :]\n",
338 | "clf_gt = ensemble.GradientBoostingRegressor(learning_rate=0.01, n_estimators=200, random_state=0)\n",
339 | "clf_gt.fit(train, y)\n",
340 | "\n",
341 | "model = SelectFromModel(clf_gt, prefit=True) \n",
342 | "train = pd.DataFrame(model.transform(train))\n",
343 | "test = pd.DataFrame(model.transform(test))"
344 | ]
345 | },
346 | {
347 | "cell_type": "code",
348 | "execution_count": 13,
349 | "metadata": {
350 | "collapsed": false,
351 | "deletable": true,
352 | "editable": true
353 | },
354 | "outputs": [
355 | {
356 | "name": "stdout",
357 | "output_type": "stream",
358 | "text": [
359 | "0.0105656\n"
360 | ]
361 | }
362 | ],
363 | "source": [
364 | "#单xgb模型线下\n",
365 | "def wmaeEval(preds, dtrain):\n",
366 | " label = dtrain.get_label()\n",
367 | " return 'error', np.sum(np.square(preds - label)) / len(label)\n",
368 | "param = {}\n",
369 | "param['eta'] = 0.01\n",
370 | "param['max_depth'] = 3\n",
371 | "param['subsample'] = 0.8\n",
372 | "param['colsample_bytree'] = 0.6\n",
373 | "param['seed'] = 0\n",
374 | "xgbTrain = xgb.DMatrix(train, label=y)\n",
375 | "modle = xgb.cv(param, xgbTrain, num_boost_round=9800, feval=wmaeEval, nfold=5)\n",
376 | "print(modle.iloc[-1, 0]) "
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 14,
382 | "metadata": {
383 | "collapsed": false,
384 | "deletable": true,
385 | "editable": true
386 | },
387 | "outputs": [],
388 | "source": [
389 | "#单xgb模型线上\n",
390 | "param = {}\n",
391 | "param['eta'] = 0.01\n",
392 | "param['max_depth'] = 3\n",
393 | "param['subsample'] = 0.8\n",
394 | "param['colsample_bytree'] = 0.6\n",
395 | "param['seed'] = 0\n",
396 | "xgbTrain = xgb.DMatrix(train, label=y)\n",
397 | "xgbTest = xgb.DMatrix(test)\n",
398 | "modle = xgb.train(param, xgbTrain, num_boost_round=9200)"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {
405 | "collapsed": false,
406 | "deletable": true,
407 | "editable": true
408 | },
409 | "outputs": [
410 | {
411 | "name": "stdout",
412 | "output_type": "stream",
413 | "text": [
414 | "the model 0\n"
415 | ]
416 | }
417 | ],
418 | "source": [
419 | "model_gb = ensemble.GradientBoostingRegressor(n_estimators=450, \n",
420 | " max_depth=2, \n",
421 | " subsample=0.8, \n",
422 | " learning_rate=0.01, \n",
423 | " random_state=2, \n",
424 | " max_features=0.2)\n",
425 | "modle0 = xgb.XGBRegressor(learning_rate=0.01, \n",
426 | " max_depth=3, \n",
427 | " colsample_bytree=0.6, \n",
428 | " subsample=0.8, \n",
429 | " seed=2, \n",
430 | " n_estimators=8800)\n",
431 | "modle1 = xgb.XGBRegressor(learning_rate=0.01, \n",
432 | " max_depth=3, \n",
433 | " colsample_bytree=0.6, \n",
434 | " subsample=0.8, \n",
435 | " seed=0, \n",
436 | " n_estimators=8800)\n",
437 | "\n",
438 | "clf1 = lgb.LGBMRegressor(colsample_bytree=0.3,\n",
439 | " learning_rate=0.01, \n",
440 | " subsample=0.8, \n",
441 | " num_leaves=4, \n",
442 | " objective='regression', \n",
443 | " n_estimators=1200, \n",
444 | " seed=2)\n",
445 | "base_model = [['xgb0', modle0],\n",
446 | " ['xgb1', modle1], \n",
447 | " ['gb', model_gb],\n",
448 | " ['lgb', clf1],]\n",
449 | "\n",
450 | "\n",
451 | "folds = list(KFold(len(train), n_folds=5, random_state=0))\n",
452 | "S_train = np.zeros((train.shape[0], len(base_model)))\n",
453 | "S_test = np.zeros((test.shape[0], len(base_model))) \n",
454 | "for index, item in enumerate(base_model):\n",
455 | " print(\"the model\", index)\n",
456 | " clf = item[1]\n",
457 | " S_test_i = np.zeros((test.shape[0], len(folds)))\n",
458 | " for j, (train_idx, test_idx) in enumerate(folds):\n",
459 | " X_train = train.ix[train_idx, :]\n",
460 | " X_valid = train.ix[test_idx, :]\n",
461 | " Y = y.ix[train_idx, :]\n",
462 | " clf.fit(X_train, Y['Value'])\n",
463 | " S_train[test_idx, index] = clf.predict(X_valid)\n",
464 | " S_test_i[:, j] = clf.predict(test) \n",
465 | " S_test[:, index] = S_test_i.mean(1)\n",
466 | " \n",
467 | "linreg = linear_model.LinearRegression()\n",
468 | "linreg.fit(S_train, y)\n",
469 | "result = linreg.predict(S_test)\n",
470 | "print('Done')"
471 | ]
472 | },
473 | {
474 | "cell_type": "code",
475 | "execution_count": 46,
476 | "metadata": {
477 | "collapsed": false,
478 | "deletable": true,
479 | "editable": true,
480 | "scrolled": true
481 | },
482 | "outputs": [
483 | {
484 | "data": {
485 | "text/html": [
486 | "\n",
487 | "
\n",
488 | " \n",
489 | " \n",
490 | " | \n",
491 | " stacking | \n",
492 | " singal | \n",
493 | " mean | \n",
494 | "
\n",
495 | " \n",
496 | " \n",
497 | " \n",
498 | " | 0 | \n",
499 | " 2.912807 | \n",
500 | " 2.889987 | \n",
501 | " 2.901397 | \n",
502 | "
\n",
503 | " \n",
504 | " | 1 | \n",
505 | " 3.007286 | \n",
506 | " 2.964874 | \n",
507 | " 2.986080 | \n",
508 | "
\n",
509 | " \n",
510 | " | 2 | \n",
511 | " 2.685882 | \n",
512 | " 2.662324 | \n",
513 | " 2.674103 | \n",
514 | "
\n",
515 | " \n",
516 | " | 3 | \n",
517 | " 2.806926 | \n",
518 | " 2.759750 | \n",
519 | " 2.783338 | \n",
520 | "
\n",
521 | " \n",
522 | " | 4 | \n",
523 | " 2.652038 | \n",
524 | " 2.619831 | \n",
525 | " 2.635935 | \n",
526 | "
\n",
527 | " \n",
528 | " | 5 | \n",
529 | " 2.787010 | \n",
530 | " 2.722117 | \n",
531 | " 2.754563 | \n",
532 | "
\n",
533 | " \n",
534 | " | 6 | \n",
535 | " 2.652562 | \n",
536 | " 2.627702 | \n",
537 | " 2.640132 | \n",
538 | "
\n",
539 | " \n",
540 | " | 7 | \n",
541 | " 2.588135 | \n",
542 | " 2.463275 | \n",
543 | " 2.525705 | \n",
544 | "
\n",
545 | " \n",
546 | " | 8 | \n",
547 | " 3.031880 | \n",
548 | " 2.980239 | \n",
549 | " 3.006060 | \n",
550 | "
\n",
551 | " \n",
552 | " | 9 | \n",
553 | " 2.666783 | \n",
554 | " 2.645448 | \n",
555 | " 2.656115 | \n",
556 | "
\n",
557 | " \n",
558 | " | 10 | \n",
559 | " 2.797045 | \n",
560 | " 2.746427 | \n",
561 | " 2.771736 | \n",
562 | "
\n",
563 | " \n",
564 | " | 11 | \n",
565 | " 2.821308 | \n",
566 | " 2.840287 | \n",
567 | " 2.830798 | \n",
568 | "
\n",
569 | " \n",
570 | " | 12 | \n",
571 | " 2.527795 | \n",
572 | " 2.521277 | \n",
573 | " 2.524536 | \n",
574 | "
\n",
575 | " \n",
576 | " | 13 | \n",
577 | " 2.738712 | \n",
578 | " 2.695673 | \n",
579 | " 2.717193 | \n",
580 | "
\n",
581 | " \n",
582 | " | 14 | \n",
583 | " 2.758436 | \n",
584 | " 2.706808 | \n",
585 | " 2.732622 | \n",
586 | "
\n",
587 | " \n",
588 | " | 15 | \n",
589 | " 2.944817 | \n",
590 | " 2.914038 | \n",
591 | " 2.929428 | \n",
592 | "
\n",
593 | " \n",
594 | " | 16 | \n",
595 | " 2.673143 | \n",
596 | " 2.621247 | \n",
597 | " 2.647195 | \n",
598 | "
\n",
599 | " \n",
600 | " | 17 | \n",
601 | " 2.925770 | \n",
602 | " 2.877609 | \n",
603 | " 2.901689 | \n",
604 | "
\n",
605 | " \n",
606 | " | 18 | \n",
607 | " 2.870120 | \n",
608 | " 2.816347 | \n",
609 | " 2.843233 | \n",
610 | "
\n",
611 | " \n",
612 | " | 19 | \n",
613 | " 3.268268 | \n",
614 | " 3.233553 | \n",
615 | " 3.250911 | \n",
616 | "
\n",
617 | " \n",
618 | " | 20 | \n",
619 | " 2.849246 | \n",
620 | " 2.805639 | \n",
621 | " 2.827443 | \n",
622 | "
\n",
623 | " \n",
624 | " | 21 | \n",
625 | " 3.080887 | \n",
626 | " 3.082602 | \n",
627 | " 3.081745 | \n",
628 | "
\n",
629 | " \n",
630 | " | 22 | \n",
631 | " 2.618728 | \n",
632 | " 2.594557 | \n",
633 | " 2.606642 | \n",
634 | "
\n",
635 | " \n",
636 | " | 23 | \n",
637 | " 2.772102 | \n",
638 | " 2.753884 | \n",
639 | " 2.762993 | \n",
640 | "
\n",
641 | " \n",
642 | " | 24 | \n",
643 | " 2.771248 | \n",
644 | " 2.759403 | \n",
645 | " 2.765326 | \n",
646 | "
\n",
647 | " \n",
648 | " | 25 | \n",
649 | " 2.991380 | \n",
650 | " 2.962934 | \n",
651 | " 2.977157 | \n",
652 | "
\n",
653 | " \n",
654 | " | 26 | \n",
655 | " 3.036091 | \n",
656 | " 2.966660 | \n",
657 | " 3.001376 | \n",
658 | "
\n",
659 | " \n",
660 | " | 27 | \n",
661 | " 2.953341 | \n",
662 | " 2.914098 | \n",
663 | " 2.933720 | \n",
664 | "
\n",
665 | " \n",
666 | " | 28 | \n",
667 | " 2.848419 | \n",
668 | " 2.797695 | \n",
669 | " 2.823057 | \n",
670 | "
\n",
671 | " \n",
672 | " | 29 | \n",
673 | " 2.725398 | \n",
674 | " 2.680045 | \n",
675 | " 2.702722 | \n",
676 | "
\n",
677 | " \n",
678 | " | ... | \n",
679 | " ... | \n",
680 | " ... | \n",
681 | " ... | \n",
682 | "
\n",
683 | " \n",
684 | " | 382 | \n",
685 | " 2.845396 | \n",
686 | " 2.803513 | \n",
687 | " 2.824454 | \n",
688 | "
\n",
689 | " \n",
690 | " | 383 | \n",
691 | " 2.960835 | \n",
692 | " 2.951946 | \n",
693 | " 2.956391 | \n",
694 | "
\n",
695 | " \n",
696 | " | 384 | \n",
697 | " 2.624246 | \n",
698 | " 2.585489 | \n",
699 | " 2.604867 | \n",
700 | "
\n",
701 | " \n",
702 | " | 385 | \n",
703 | " 2.880219 | \n",
704 | " 2.849117 | \n",
705 | " 2.864668 | \n",
706 | "
\n",
707 | " \n",
708 | " | 386 | \n",
709 | " 2.989834 | \n",
710 | " 2.992639 | \n",
711 | " 2.991237 | \n",
712 | "
\n",
713 | " \n",
714 | " | 387 | \n",
715 | " 2.696804 | \n",
716 | " 2.658472 | \n",
717 | " 2.677638 | \n",
718 | "
\n",
719 | " \n",
720 | " | 388 | \n",
721 | " 2.869157 | \n",
722 | " 2.806507 | \n",
723 | " 2.837832 | \n",
724 | "
\n",
725 | " \n",
726 | " | 389 | \n",
727 | " 3.033155 | \n",
728 | " 3.009951 | \n",
729 | " 3.021553 | \n",
730 | "
\n",
731 | " \n",
732 | " | 390 | \n",
733 | " 2.637270 | \n",
734 | " 2.612204 | \n",
735 | " 2.624737 | \n",
736 | "
\n",
737 | " \n",
738 | " | 391 | \n",
739 | " 2.920211 | \n",
740 | " 2.882556 | \n",
741 | " 2.901384 | \n",
742 | "
\n",
743 | " \n",
744 | " | 392 | \n",
745 | " 2.694041 | \n",
746 | " 2.669534 | \n",
747 | " 2.681787 | \n",
748 | "
\n",
749 | " \n",
750 | " | 393 | \n",
751 | " 2.718536 | \n",
752 | " 2.655436 | \n",
753 | " 2.686986 | \n",
754 | "
\n",
755 | " \n",
756 | " | 394 | \n",
757 | " 2.901037 | \n",
758 | " 2.893404 | \n",
759 | " 2.897221 | \n",
760 | "
\n",
761 | " \n",
762 | " | 395 | \n",
763 | " 2.544480 | \n",
764 | " 2.519438 | \n",
765 | " 2.531959 | \n",
766 | "
\n",
767 | " \n",
768 | " | 396 | \n",
769 | " 2.790338 | \n",
770 | " 2.772969 | \n",
771 | " 2.781654 | \n",
772 | "
\n",
773 | " \n",
774 | " | 397 | \n",
775 | " 2.940682 | \n",
776 | " 2.918911 | \n",
777 | " 2.929796 | \n",
778 | "
\n",
779 | " \n",
780 | " | 398 | \n",
781 | " 2.609586 | \n",
782 | " 2.586172 | \n",
783 | " 2.597879 | \n",
784 | "
\n",
785 | " \n",
786 | " | 399 | \n",
787 | " 2.833003 | \n",
788 | " 2.816421 | \n",
789 | " 2.824712 | \n",
790 | "
\n",
791 | " \n",
792 | " | 400 | \n",
793 | " 3.006545 | \n",
794 | " 2.980515 | \n",
795 | " 2.993530 | \n",
796 | "
\n",
797 | " \n",
798 | " | 401 | \n",
799 | " 2.646917 | \n",
800 | " 2.612457 | \n",
801 | " 2.629687 | \n",
802 | "
\n",
803 | " \n",
804 | " | 402 | \n",
805 | " 2.856515 | \n",
806 | " 2.821989 | \n",
807 | " 2.839252 | \n",
808 | "
\n",
809 | " \n",
810 | " | 403 | \n",
811 | " 2.722544 | \n",
812 | " 2.671485 | \n",
813 | " 2.697014 | \n",
814 | "
\n",
815 | " \n",
816 | " | 404 | \n",
817 | " 2.650129 | \n",
818 | " 2.598258 | \n",
819 | " 2.624193 | \n",
820 | "
\n",
821 | " \n",
822 | " | 405 | \n",
823 | " 2.789476 | \n",
824 | " 2.757772 | \n",
825 | " 2.773624 | \n",
826 | "
\n",
827 | " \n",
828 | " | 406 | \n",
829 | " 2.835194 | \n",
830 | " 2.779574 | \n",
831 | " 2.807384 | \n",
832 | "
\n",
833 | " \n",
834 | " | 407 | \n",
835 | " 2.592567 | \n",
836 | " 2.601666 | \n",
837 | " 2.597117 | \n",
838 | "
\n",
839 | " \n",
840 | " | 408 | \n",
841 | " 2.684609 | \n",
842 | " 2.662275 | \n",
843 | " 2.673442 | \n",
844 | "
\n",
845 | " \n",
846 | " | 409 | \n",
847 | " 2.790114 | \n",
848 | " 2.787076 | \n",
849 | " 2.788595 | \n",
850 | "
\n",
851 | " \n",
852 | " | 410 | \n",
853 | " 2.699389 | \n",
854 | " 2.692793 | \n",
855 | " 2.696091 | \n",
856 | "
\n",
857 | " \n",
858 | " | 411 | \n",
859 | " 2.680889 | \n",
860 | " 2.672264 | \n",
861 | " 2.676577 | \n",
862 | "
\n",
863 | " \n",
864 | "
\n",
865 | "
412 rows × 3 columns
\n",
866 | "
"
867 | ],
868 | "text/plain": [
869 | " stacking singal mean\n",
870 | "0 2.912807 2.889987 2.901397\n",
871 | "1 3.007286 2.964874 2.986080\n",
872 | "2 2.685882 2.662324 2.674103\n",
873 | "3 2.806926 2.759750 2.783338\n",
874 | "4 2.652038 2.619831 2.635935\n",
875 | "5 2.787010 2.722117 2.754563\n",
876 | "6 2.652562 2.627702 2.640132\n",
877 | "7 2.588135 2.463275 2.525705\n",
878 | "8 3.031880 2.980239 3.006060\n",
879 | "9 2.666783 2.645448 2.656115\n",
880 | "10 2.797045 2.746427 2.771736\n",
881 | "11 2.821308 2.840287 2.830798\n",
882 | "12 2.527795 2.521277 2.524536\n",
883 | "13 2.738712 2.695673 2.717193\n",
884 | "14 2.758436 2.706808 2.732622\n",
885 | "15 2.944817 2.914038 2.929428\n",
886 | "16 2.673143 2.621247 2.647195\n",
887 | "17 2.925770 2.877609 2.901689\n",
888 | "18 2.870120 2.816347 2.843233\n",
889 | "19 3.268268 3.233553 3.250911\n",
890 | "20 2.849246 2.805639 2.827443\n",
891 | "21 3.080887 3.082602 3.081745\n",
892 | "22 2.618728 2.594557 2.606642\n",
893 | "23 2.772102 2.753884 2.762993\n",
894 | "24 2.771248 2.759403 2.765326\n",
895 | "25 2.991380 2.962934 2.977157\n",
896 | "26 3.036091 2.966660 3.001376\n",
897 | "27 2.953341 2.914098 2.933720\n",
898 | "28 2.848419 2.797695 2.823057\n",
899 | "29 2.725398 2.680045 2.702722\n",
900 | ".. ... ... ...\n",
901 | "382 2.845396 2.803513 2.824454\n",
902 | "383 2.960835 2.951946 2.956391\n",
903 | "384 2.624246 2.585489 2.604867\n",
904 | "385 2.880219 2.849117 2.864668\n",
905 | "386 2.989834 2.992639 2.991237\n",
906 | "387 2.696804 2.658472 2.677638\n",
907 | "388 2.869157 2.806507 2.837832\n",
908 | "389 3.033155 3.009951 3.021553\n",
909 | "390 2.637270 2.612204 2.624737\n",
910 | "391 2.920211 2.882556 2.901384\n",
911 | "392 2.694041 2.669534 2.681787\n",
912 | "393 2.718536 2.655436 2.686986\n",
913 | "394 2.901037 2.893404 2.897221\n",
914 | "395 2.544480 2.519438 2.531959\n",
915 | "396 2.790338 2.772969 2.781654\n",
916 | "397 2.940682 2.918911 2.929796\n",
917 | "398 2.609586 2.586172 2.597879\n",
918 | "399 2.833003 2.816421 2.824712\n",
919 | "400 3.006545 2.980515 2.993530\n",
920 | "401 2.646917 2.612457 2.629687\n",
921 | "402 2.856515 2.821989 2.839252\n",
922 | "403 2.722544 2.671485 2.697014\n",
923 | "404 2.650129 2.598258 2.624193\n",
924 | "405 2.789476 2.757772 2.773624\n",
925 | "406 2.835194 2.779574 2.807384\n",
926 | "407 2.592567 2.601666 2.597117\n",
927 | "408 2.684609 2.662275 2.673442\n",
928 | "409 2.790114 2.787076 2.788595\n",
929 | "410 2.699389 2.692793 2.696091\n",
930 | "411 2.680889 2.672264 2.676577\n",
931 | "\n",
932 | "[412 rows x 3 columns]"
933 | ]
934 | },
935 | "execution_count": 46,
936 | "metadata": {},
937 | "output_type": "execute_result"
938 | }
939 | ],
940 | "source": [
941 | "fin = pd.DataFrame(result, columns=['stacking'])\n",
942 | "fin['singal'] = modle.predict(xgbTest)\n",
943 | "fin['mean'] = fin.mean(axis=1)"
944 | ]
945 | },
946 | {
947 | "cell_type": "code",
948 | "execution_count": null,
949 | "metadata": {
950 | "collapsed": true
951 | },
952 | "outputs": [],
953 | "source": [
954 | "submit['Y'] = fin['mean'].values "
955 | ]
956 | },
957 | {
958 | "cell_type": "code",
959 | "execution_count": null,
960 | "metadata": {
961 | "collapsed": true
962 | },
963 | "outputs": [],
964 | "source": [
965 | "submit.to_csv('sub{}.csv'.format(datetime.now().strftime('%Y%m%d_%H%M%S')), index=False, header=None)"
966 | ]
967 | }
968 | ],
969 | "metadata": {
970 | "kernelspec": {
971 | "display_name": "Python 3",
972 | "language": "python",
973 | "name": "python3"
974 | },
975 | "language_info": {
976 | "codemirror_mode": {
977 | "name": "ipython",
978 | "version": 3
979 | },
980 | "file_extension": ".py",
981 | "mimetype": "text/x-python",
982 | "name": "python",
983 | "nbconvert_exporter": "python",
984 | "pygments_lexer": "ipython3",
985 | "version": "3.5.0"
986 | }
987 | },
988 | "nbformat": 4,
989 | "nbformat_minor": 2
990 | }
991 |
--------------------------------------------------------------------------------
/First/初赛处理.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 23,
6 | "metadata": {
7 | "collapsed": false,
8 | "deletable": true,
9 | "editable": true
10 | },
11 | "outputs": [],
12 | "source": [
13 | "import pandas as pd\n",
14 | "import numpy as np\n",
15 | "import gc\n",
16 | "import xgboost as xgb\n",
17 | "import lightgbm as lgb\n",
18 | "from sklearn.model_selection import train_test_split\n",
19 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n",
20 | "from datetime import datetime\n",
21 | "from catboost import CatBoostRegressor\n",
22 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale\n",
23 | "from sklearn import tree\n",
24 | "from sklearn.decomposition import PCA\n",
25 | "from sklearn import linear_model\n",
26 | "from sklearn import svm\n",
27 | "from sklearn import neighbors\n",
28 | "from sklearn import ensemble\n",
29 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE\n",
30 | "from minepy import MINE\n",
31 | "from mlxtend.regressor import StackingRegressor\n",
32 | "from collections import defaultdict"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 7,
38 | "metadata": {
39 | "collapsed": false,
40 | "deletable": true,
41 | "editable": true
42 | },
43 | "outputs": [],
44 | "source": [
45 | "data_set = pd.read_csv('../data/train.csv')\n",
46 | "test_set = pd.read_csv('../data/test.csv')\n",
47 | "Le = LabelEncoder()\n",
48 | "\n",
49 | "y = data_set[['Y']]\n",
50 | "data_set = data_set.drop(['Y'], axis=1)\n",
51 | "input_data = pd.concat([data_set, test_set], ignore_index=True)"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": 8,
57 | "metadata": {
58 | "collapsed": false,
59 | "deletable": true,
60 | "editable": true
61 | },
62 | "outputs": [],
63 | "source": [
64 | "col = pd.DataFrame(input_data.ix[:, 2:].columns, columns=['col'])\n",
65 | "TOOL_ID_col_dict = defaultdict(lambda : 0)\n",
66 | "for index,row in col.iterrows():\n",
67 | " info=row.col.split('X')[0]\n",
68 | " TOOL_ID_col_dict[info] += 1\n",
69 | " \n",
70 | "TOOL_ID_col_dict"
71 | ]
72 | },
73 | {
74 | "cell_type": "code",
75 | "execution_count": 54,
76 | "metadata": {
77 | "collapsed": true,
78 | "deletable": true,
79 | "editable": true
80 | },
81 | "outputs": [],
82 | "source": [
83 | "#drop相同列\n",
84 | "def drop_col(data):\n",
85 | " data = data.fillna(data.mean()) \n",
86 | " for line in data.columns:\n",
87 | " if len(data[line].unique()) == 1:\n",
88 | " data = data.drop([line], axis=1) \n",
89 | " return data"
90 | ]
91 | },
92 | {
93 | "cell_type": "code",
94 | "execution_count": 10,
95 | "metadata": {
96 | "collapsed": false,
97 | "deletable": true,
98 | "editable": true
99 | },
100 | "outputs": [],
101 | "source": [
102 | "#四分位数处理异常值\n",
103 | "def quantile_dropout(data): \n",
104 | " for c in data.columns:\n",
105 | " Q1 = data[c].quantile(q=0.25, interpolation='linear')\n",
106 | " Q3 = data[c].quantile(q=0.75, interpolation='linear') \n",
107 | " \n",
108 | " min_v = Q1 - 3 * (Q3 - Q1)\n",
109 | " max_v = Q3 + 3 * (Q3 - Q1)\n",
110 | " \n",
111 | " data[c][(data[c] >= max_v) | (data[c] <= min_v)] = data[c].mean()\n",
112 | " \n",
113 | " return data"
114 | ]
115 | },
116 | {
117 | "cell_type": "code",
118 | "execution_count": 20,
119 | "metadata": {
120 | "collapsed": true,
121 | "deletable": true,
122 | "editable": true
123 | },
124 | "outputs": [],
125 | "source": [
126 | "#标准化\n",
127 | "def scala(data):\n",
128 | " # for col in data.columns: \n",
129 | " # data[col] = (data[col] - data[col].mean()) / data[col].std(ddof=0) \n",
130 | " # data = data.fillna(0) \n",
131 | " data = scale(data)\n",
132 | " return data"
133 | ]
134 | },
135 | {
136 | "cell_type": "code",
137 | "execution_count": 27,
138 | "metadata": {
139 | "collapsed": true,
140 | "deletable": true,
141 | "editable": true
142 | },
143 | "outputs": [],
144 | "source": [
145 | "#PCA\n",
146 | "def Pca(data):\n",
147 | " pca = PCA(n_components=0.9)\n",
148 | " pca.fit(data)\n",
149 | " X_new = pca.transform(data) \n",
150 | " return X_new"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 13,
156 | "metadata": {
157 | "collapsed": true,
158 | "deletable": true,
159 | "editable": true
160 | },
161 | "outputs": [],
162 | "source": [
163 | "#Tsne\n",
164 | "def Tsne(data):\n",
165 | " Tsne = TSNE(n_components=2)\n",
166 | " X_new = Tsne.fit_transform(data)\n",
167 | " return X_new"
168 | ]
169 | },
170 | {
171 | "cell_type": "code",
172 | "execution_count": 14,
173 | "metadata": {
174 | "collapsed": true,
175 | "deletable": true,
176 | "editable": true
177 | },
178 | "outputs": [],
179 | "source": [
180 | "#聚类\n",
181 | "def clu(data):\n",
182 | " km = KMeans(n_clusters=3).fit(data)\n",
183 | " clu = pd.DataFrame(km.predict(data), columns=['clu'])\n",
184 | " clu.groupby(['clu']).clu.count()"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 15,
190 | "metadata": {
191 | "collapsed": true,
192 | "deletable": true,
193 | "editable": true
194 | },
195 | "outputs": [],
196 | "source": [
197 | "def date_cols(data):\n",
198 | " for col in data:\n",
199 | " if data[col].min() > 1e13:\n",
200 | " data = data.drop([col], axis=1) \n",
201 | " return data"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": 16,
207 | "metadata": {
208 | "collapsed": true,
209 | "deletable": true,
210 | "editable": true
211 | },
212 | "outputs": [],
213 | "source": [
214 | "#孤立森林选出异常值过多的特征 暂时没用\n",
215 | "def get_outFeature(data):\n",
216 | " clf = IsolationForest(max_samples=20)\n",
217 | " outrate = []\n",
218 | " for col in data.columns: \n",
219 | " clf.fit(data[[col]])\n",
220 | " y_pred_train = clf.predict(data[[col]])\n",
221 | " \n",
222 | " values = data[[col]].values\n",
223 | " out = pd.DataFrame(values, columns=['columns'])\n",
224 | " out['y'] = y_pred_train\n",
225 | " \n",
226 | " outLine = len(out[out['y'] == -1])\n",
227 | " outRate = outLine / 600\n",
228 | " if outRate > 0.2:\n",
229 | " outrate.append(col)\n",
230 | " \n",
231 | " feature = [x for x in data.columns if x in outrate]\n",
232 | " train = pd.DataFrame(data[feature]).ix[:499, :]\n",
233 | " test = pd.DataFrame(data[feature]).ix[500:, :]\n",
234 | " return train, test"
235 | ]
236 | },
237 | {
238 | "cell_type": "markdown",
239 | "metadata": {
240 | "deletable": true,
241 | "editable": true
242 | },
243 | "source": [
244 | "处理X210_"
245 | ]
246 | },
247 | {
248 | "cell_type": "code",
249 | "execution_count": 37,
250 | "metadata": {
251 | "collapsed": false,
252 | "deletable": true,
253 | "editable": true
254 | },
255 | "outputs": [
256 | {
257 | "name": "stderr",
258 | "output_type": "stream",
259 | "text": [
260 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
261 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
262 | "\n",
263 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
264 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\preprocessing\\data.py:160: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.\n",
265 | " warnings.warn(\"Numerical issues were encountered \"\n"
266 | ]
267 | }
268 | ],
269 | "source": [
270 | "X210_COL = input_data.loc[:, 'TOOL_ID':'210X231']\n",
271 | "Le.fit(X210_COL['TOOL_ID'].unique())\n",
272 | "X210_COL['TOOL_ID'] = Le.transform(X210_COL['TOOL_ID'])\n",
273 | "X210_COL = date_cols(X210_COL)\n",
274 | "X210_COL = drop_col(X210_COL)\n",
275 | "X210_COL = quantile_dropout(X210_COL)\n",
276 | "X210_COL = drop_col(X210_COL)\n",
277 | "X210_COL = X210_COL.transpose().drop_duplicates().transpose()\n",
278 | "X210_COL = scala(X210_COL)"
279 | ]
280 | },
281 | {
282 | "cell_type": "code",
283 | "execution_count": 38,
284 | "metadata": {
285 | "collapsed": false,
286 | "deletable": true,
287 | "editable": true
288 | },
289 | "outputs": [],
290 | "source": [
291 | "pca_X210 = Pca(X210_COL)\n",
292 | "pca_X210 = pd.DataFrame(pca_X210)\n",
293 | "pca_X210.reset_index(inplace=True)\n",
294 | "pca_X210.rename({'index' : 'ID'}, inplace=True)\n",
295 | "pca_X210.drop(['index'], axis=1, inplace=True)"
296 | ]
297 | },
298 | {
299 | "cell_type": "markdown",
300 | "metadata": {
301 | "deletable": true,
302 | "editable": true
303 | },
304 | "source": [
305 | "处理X220_"
306 | ]
307 | },
308 | {
309 | "cell_type": "code",
310 | "execution_count": 39,
311 | "metadata": {
312 | "collapsed": false,
313 | "deletable": true,
314 | "editable": true
315 | },
316 | "outputs": [
317 | {
318 | "name": "stderr",
319 | "output_type": "stream",
320 | "text": [
321 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
322 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
323 | "\n",
324 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
325 | ]
326 | }
327 | ],
328 | "source": [
329 | "X220_COL = input_data.loc[:, 'Tool':'220X571']\n",
330 | "Le.fit(X220_COL['Tool'].unique())\n",
331 | "X220_COL['Tool'] = Le.transform(X220_COL['Tool'])\n",
332 | "X220_COL = drop_col(X220_COL)\n",
333 | "X220_COL = date_cols(X220_COL)\n",
334 | "X220_COL = quantile_dropout(X220_COL)\n",
335 | "X220_COL = drop_col(X220_COL)\n",
336 | "X220_COL = X220_COL.transpose().drop_duplicates().transpose()\n",
337 | "X220_COL = scala(X220_COL)"
338 | ]
339 | },
340 | {
341 | "cell_type": "code",
342 | "execution_count": 40,
343 | "metadata": {
344 | "collapsed": false,
345 | "deletable": true,
346 | "editable": true
347 | },
348 | "outputs": [],
349 | "source": [
350 | "pca_X220 = Pca(X220_COL)\n",
351 | "pca_X220 = pd.DataFrame(pca_X220)\n",
352 | "pca_X220.reset_index(inplace=True)\n",
353 | "pca_X220.rename({'index' : 'ID'}, inplace=True)\n",
354 | "pca_X220.drop(['index'], axis=1, inplace=True)"
355 | ]
356 | },
357 | {
358 | "cell_type": "markdown",
359 | "metadata": {
360 | "deletable": true,
361 | "editable": true
362 | },
363 | "source": [
364 | "处理X261_"
365 | ]
366 | },
367 | {
368 | "cell_type": "code",
369 | "execution_count": 41,
370 | "metadata": {
371 | "collapsed": false,
372 | "deletable": true,
373 | "editable": true
374 | },
375 | "outputs": [],
376 | "source": [
377 | "X261_COL = input_data.loc[:, '261X226':'261X763']\n",
378 | "X261_COL = drop_col(X261_COL)\n",
379 | "X261_COL = date_cols(X261_COL)\n",
380 | "X261_COL = quantile_dropout(X261_COL)\n",
381 | "X261_COL = drop_col(X261_COL)\n",
382 | "X261_COL = X261_COL.transpose().drop_duplicates().transpose()\n",
383 | "X261_COL = scala(X261_COL)"
384 | ]
385 | },
386 | {
387 | "cell_type": "code",
388 | "execution_count": 42,
389 | "metadata": {
390 | "collapsed": false,
391 | "deletable": true,
392 | "editable": true
393 | },
394 | "outputs": [],
395 | "source": [
396 | "pca_X261 = Pca(X261_COL)\n",
397 | "pca_X261 = pd.DataFrame(pca_X261)\n",
398 | "pca_X261.reset_index(inplace=True)\n",
399 | "pca_X261.rename({'index' : 'ID'}, inplace=True)\n",
400 | "pca_X261.drop(['index'], axis=1, inplace=True)"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {
406 | "deletable": true,
407 | "editable": true
408 | },
409 | "source": [
410 | "处理X300_ "
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": 44,
416 | "metadata": {
417 | "collapsed": false,
418 | "deletable": true,
419 | "editable": true
420 | },
421 | "outputs": [
422 | {
423 | "name": "stderr",
424 | "output_type": "stream",
425 | "text": [
426 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\preprocessing\\data.py:160: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.\n",
427 | " warnings.warn(\"Numerical issues were encountered \"\n"
428 | ]
429 | }
430 | ],
431 | "source": [
432 | "X300_COL = input_data.loc[:, 'TOOL_ID (#1)':'300X21']\n",
433 | "Le.fit(X300_COL['TOOL_ID (#1)'].unique())\n",
434 | "X300_COL['TOOL_ID (#1)'] = Le.transform(X300_COL['TOOL_ID (#1)'])\n",
435 | "X300_COL = drop_col(X300_COL)\n",
436 | "X300_COL = quantile_dropout(X300_COL.ix[:, 1:])\n",
437 | "X300_COL = drop_col(X300_COL)\n",
438 | "X300_COL = X300_COL.transpose().drop_duplicates().transpose()\n",
439 | "X300_COL = scala(X300_COL)"
440 | ]
441 | },
442 | {
443 | "cell_type": "code",
444 | "execution_count": 45,
445 | "metadata": {
446 | "collapsed": true,
447 | "deletable": true,
448 | "editable": true
449 | },
450 | "outputs": [],
451 | "source": [
452 | "pca_X300 = Pca(X300_COL)\n",
453 | "pca_X300 = pd.DataFrame(pca_X300)\n",
454 | "pca_X300.reset_index(inplace=True)\n",
455 | "pca_X300.rename({'index' : 'ID'}, inplace=True)\n",
456 | "pca_X300.drop(['index'], axis=1, inplace=True)"
457 | ]
458 | },
459 | {
460 | "cell_type": "markdown",
461 | "metadata": {
462 | "deletable": true,
463 | "editable": true
464 | },
465 | "source": [
466 | "处理X310_ "
467 | ]
468 | },
469 | {
470 | "cell_type": "code",
471 | "execution_count": 46,
472 | "metadata": {
473 | "collapsed": false,
474 | "deletable": true,
475 | "editable": true
476 | },
477 | "outputs": [
478 | {
479 | "name": "stderr",
480 | "output_type": "stream",
481 | "text": [
482 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
483 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
484 | "\n",
485 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
486 | ]
487 | }
488 | ],
489 | "source": [
490 | "X310_COL = input_data.loc[:, 'TOOL_ID (#2)':'310X207']\n",
491 | "Le.fit(X310_COL['TOOL_ID (#2)'].unique())\n",
492 | "X310_COL['TOOL_ID (#2)'] = Le.transform(X310_COL['TOOL_ID (#2)'])\n",
493 | "X310_COL = drop_col(X310_COL)\n",
494 | "X310_COL = date_cols(X310_COL)\n",
495 | "X310_COL = quantile_dropout(X310_COL)\n",
496 | "X310_COL = drop_col(X310_COL)\n",
497 | "X310_COL = X310_COL.transpose().drop_duplicates().transpose()\n",
498 | "X310_COL = scala(X310_COL)"
499 | ]
500 | },
501 | {
502 | "cell_type": "code",
503 | "execution_count": 47,
504 | "metadata": {
505 | "collapsed": false,
506 | "deletable": true,
507 | "editable": true
508 | },
509 | "outputs": [],
510 | "source": [
511 | "pca_X310 = Pca(X310_COL)\n",
512 | "pca_X310 = pd.DataFrame(pca_X310)\n",
513 | "pca_X310.reset_index(inplace=True)\n",
514 | "pca_X310.rename({'index' : 'ID'}, inplace=True)\n",
515 | "pca_X310.drop(['index'], axis=1, inplace=True)"
516 | ]
517 | },
518 | {
519 | "cell_type": "markdown",
520 | "metadata": {
521 | "deletable": true,
522 | "editable": true
523 | },
524 | "source": [
525 | "处理X311_"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": 56,
531 | "metadata": {
532 | "collapsed": false,
533 | "deletable": true,
534 | "editable": true
535 | },
536 | "outputs": [
537 | {
538 | "name": "stderr",
539 | "output_type": "stream",
540 | "text": [
541 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
542 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
543 | "\n",
544 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
545 | ]
546 | }
547 | ],
548 | "source": [
549 | "X311_COL = input_data.loc[:, 'TOOL_ID (#3)':'311X225']\n",
550 | "Le.fit(X311_COL['TOOL_ID (#3)'].unique())\n",
551 | "X311_COL['TOOL_ID (#3)'] = Le.transform(X311_COL['TOOL_ID (#3)'])\n",
552 | "X311_COL = drop_col(X311_COL)\n",
553 | "X311_COL = date_cols(X311_COL)\n",
554 | "X311_COL = quantile_dropout(X311_COL)\n",
555 | "X311_COL = drop_col(X311_COL)\n",
556 | "X311_COL = X311_COL.transpose().drop_duplicates().transpose()\n",
557 | "X311_COL = scala(X311_COL)"
558 | ]
559 | },
560 | {
561 | "cell_type": "code",
562 | "execution_count": 57,
563 | "metadata": {
564 | "collapsed": false,
565 | "deletable": true,
566 | "editable": true
567 | },
568 | "outputs": [],
569 | "source": [
570 | "pca_X311 = Pca(X311_COL)\n",
571 | "pca_X311 = pd.DataFrame(pca_X311)\n",
572 | "pca_X311.reset_index(inplace=True)\n",
573 | "pca_X311.rename({'index' : 'ID'}, inplace=True)\n",
574 | "pca_X311.drop(['index'], axis=1, inplace=True)"
575 | ]
576 | },
577 | {
578 | "cell_type": "markdown",
579 | "metadata": {
580 | "deletable": true,
581 | "editable": true
582 | },
583 | "source": [
584 | "处理X312_"
585 | ]
586 | },
587 | {
588 | "cell_type": "code",
589 | "execution_count": 58,
590 | "metadata": {
591 | "collapsed": false,
592 | "deletable": true,
593 | "editable": true
594 | },
595 | "outputs": [
596 | {
597 | "name": "stderr",
598 | "output_type": "stream",
599 | "text": [
600 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
601 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
602 | "\n",
603 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
604 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\preprocessing\\data.py:177: UserWarning: Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0. \n",
605 | " warnings.warn(\"Numerical issues were encountered \"\n"
606 | ]
607 | }
608 | ],
609 | "source": [
610 | "X312_COL = input_data.loc[:, 'Tool (#1)':'312X798']\n",
611 | "Le.fit(X312_COL['Tool (#1)'].unique())\n",
612 | "X312_COL['Tool (#1)'] = Le.transform(X312_COL['Tool (#1)'])\n",
613 | "X312_COL = drop_col(X312_COL)\n",
614 | "X312_COL = date_cols(X312_COL)\n",
615 | "X312_COL = quantile_dropout(X312_COL)\n",
616 | "X312_COL = drop_col(X312_COL)\n",
617 | "X312_COL = X312_COL.transpose().drop_duplicates().transpose()\n",
618 | "X312_COL = scala(X312_COL)"
619 | ]
620 | },
621 | {
622 | "cell_type": "code",
623 | "execution_count": 59,
624 | "metadata": {
625 | "collapsed": false,
626 | "deletable": true,
627 | "editable": true
628 | },
629 | "outputs": [],
630 | "source": [
631 | "pca_X312 = Pca(X312_COL)\n",
632 | "pca_X312 = pd.DataFrame(pca_X312)\n",
633 | "pca_X312.reset_index(inplace=True)\n",
634 | "pca_X312.rename({'index' : 'ID'}, inplace=True)\n",
635 | "pca_X312.drop(['index'], axis=1, inplace=True)"
636 | ]
637 | },
638 | {
639 | "cell_type": "markdown",
640 | "metadata": {
641 | "deletable": true,
642 | "editable": true
643 | },
644 | "source": [
645 | "处理X330_"
646 | ]
647 | },
648 | {
649 | "cell_type": "code",
650 | "execution_count": 60,
651 | "metadata": {
652 | "collapsed": false,
653 | "deletable": true,
654 | "editable": true
655 | },
656 | "outputs": [
657 | {
658 | "name": "stderr",
659 | "output_type": "stream",
660 | "text": [
661 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
662 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
663 | "\n",
664 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
665 | ]
666 | }
667 | ],
668 | "source": [
669 | "X330_COL = input_data.loc[:, 'Tool (#2)':'330X1311']\n",
670 | "Le.fit(X330_COL['Tool (#2)'].unique())\n",
671 | "X330_COL['Tool (#2)'] = Le.transform(X330_COL['Tool (#2)'])\n",
672 | "X330_COL = drop_col(X330_COL)\n",
673 | "X330_COL = date_cols(X330_COL)\n",
674 | "X330_COL = quantile_dropout(X330_COL)\n",
675 | "X330_COL = drop_col(X330_COL)\n",
676 | "X330_COL = X330_COL.transpose().drop_duplicates().transpose()\n",
677 | "X330_COL = scala(X330_COL)"
678 | ]
679 | },
680 | {
681 | "cell_type": "code",
682 | "execution_count": 61,
683 | "metadata": {
684 | "collapsed": false,
685 | "deletable": true,
686 | "editable": true
687 | },
688 | "outputs": [],
689 | "source": [
690 | "pca_X330 = Pca(X330_COL)\n",
691 | "pca_X330 = pd.DataFrame(pca_X330)\n",
692 | "pca_X330.reset_index(inplace=True)\n",
693 | "pca_X330.rename({'index' : 'ID'}, inplace=True)\n",
694 | "pca_X330.drop(['index'], axis=1, inplace=True)"
695 | ]
696 | },
697 | {
698 | "cell_type": "markdown",
699 | "metadata": {
700 | "deletable": true,
701 | "editable": true
702 | },
703 | "source": [
704 | "处理X340_"
705 | ]
706 | },
707 | {
708 | "cell_type": "code",
709 | "execution_count": 62,
710 | "metadata": {
711 | "collapsed": false,
712 | "deletable": true,
713 | "editable": true
714 | },
715 | "outputs": [
716 | {
717 | "name": "stderr",
718 | "output_type": "stream",
719 | "text": [
720 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
721 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
722 | "\n",
723 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
724 | ]
725 | }
726 | ],
727 | "source": [
728 | "X340_COL = input_data.loc[:, 'tool':'340X199']\n",
729 | "Le.fit(X340_COL['tool'].unique())\n",
730 | "X340_COL['tool'] = Le.transform(X340_COL['tool'])\n",
731 | "X340_COL = drop_col(X340_COL)\n",
732 | "X340_COL = date_cols(X340_COL)\n",
733 | "X340_COL = quantile_dropout(X340_COL)\n",
734 | "X340_COL = drop_col(X340_COL)\n",
735 | "X340_COL = X340_COL.transpose().drop_duplicates().transpose()\n",
736 | "X340_COL = scala(X340_COL)"
737 | ]
738 | },
739 | {
740 | "cell_type": "code",
741 | "execution_count": 63,
742 | "metadata": {
743 | "collapsed": false,
744 | "deletable": true,
745 | "editable": true
746 | },
747 | "outputs": [],
748 | "source": [
749 | "pca_X340 = Pca(X340_COL)\n",
750 | "pca_X340 = pd.DataFrame(pca_X340)\n",
751 | "pca_X340.reset_index(inplace=True)\n",
752 | "pca_X340.rename({'index' : 'ID'}, inplace=True)\n",
753 | "pca_X340.drop(['index'], axis=1, inplace=True)"
754 | ]
755 | },
756 | {
757 | "cell_type": "markdown",
758 | "metadata": {
759 | "deletable": true,
760 | "editable": true
761 | },
762 | "source": [
763 | "处理X344_"
764 | ]
765 | },
766 | {
767 | "cell_type": "code",
768 | "execution_count": 64,
769 | "metadata": {
770 | "collapsed": false,
771 | "deletable": true,
772 | "editable": true
773 | },
774 | "outputs": [
775 | {
776 | "name": "stderr",
777 | "output_type": "stream",
778 | "text": [
779 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
780 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
781 | "\n",
782 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
783 | ]
784 | }
785 | ],
786 | "source": [
787 | "X344_COL = input_data.loc[:, 'tool (#1)':'344X398']\n",
788 | "Le.fit(X344_COL['tool (#1)'].unique())\n",
789 | "X344_COL['tool (#1)'] = Le.transform(X344_COL['tool (#1)'])\n",
790 | "X344_COL = drop_col(X344_COL)\n",
791 | "X344_COL = date_cols(X344_COL)\n",
792 | "X344_COL = quantile_dropout(X344_COL)\n",
793 | "X344_COL = drop_col(X344_COL)\n",
794 | "X344_COL = X344_COL.transpose().drop_duplicates().transpose()\n",
795 | "X344_COL = scala(X344_COL)"
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": 65,
801 | "metadata": {
802 | "collapsed": false,
803 | "deletable": true,
804 | "editable": true
805 | },
806 | "outputs": [],
807 | "source": [
808 | "pca_X344 = Pca(X344_COL)\n",
809 | "pca_X344 = pd.DataFrame(pca_X344)\n",
810 | "pca_X344.reset_index(inplace=True)\n",
811 | "pca_X344.rename({'index' : 'ID'}, inplace=True)\n",
812 | "pca_X344.drop(['index'], axis=1, inplace=True)"
813 | ]
814 | },
815 | {
816 | "cell_type": "markdown",
817 | "metadata": {
818 | "deletable": true,
819 | "editable": true
820 | },
821 | "source": [
822 | "处理X360_ "
823 | ]
824 | },
825 | {
826 | "cell_type": "code",
827 | "execution_count": 66,
828 | "metadata": {
829 | "collapsed": false,
830 | "deletable": true,
831 | "editable": true
832 | },
833 | "outputs": [
834 | {
835 | "name": "stderr",
836 | "output_type": "stream",
837 | "text": [
838 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
839 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
840 | "\n",
841 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
842 | ]
843 | }
844 | ],
845 | "source": [
846 | "X360_COL = input_data.loc[:, 'TOOL':'360X1452']\n",
847 | "Le.fit(X360_COL['TOOL'].unique())\n",
848 | "X360_COL['TOOL'] = Le.transform(X360_COL['TOOL'])\n",
849 | "X360_COL = drop_col(X360_COL)\n",
850 | "X360_COL = date_cols(X360_COL)\n",
851 | "X360_COL = quantile_dropout(X360_COL)\n",
852 | "X360_COL = drop_col(X360_COL)\n",
853 | "X360_COL = X360_COL.transpose().drop_duplicates().transpose()\n",
854 | "X360_COL = scala(X360_COL)"
855 | ]
856 | },
857 | {
858 | "cell_type": "code",
859 | "execution_count": 67,
860 | "metadata": {
861 | "collapsed": false,
862 | "deletable": true,
863 | "editable": true
864 | },
865 | "outputs": [],
866 | "source": [
867 | "pca_X360 = Pca(X360_COL)\n",
868 | "pca_X360 = pd.DataFrame(pca_X360)\n",
869 | "pca_X360.reset_index(inplace=True)\n",
870 | "pca_X360.rename({'index' : 'ID'}, inplace=True)\n",
871 | "pca_X360.drop(['index'], axis=1, inplace=True)"
872 | ]
873 | },
874 | {
875 | "cell_type": "markdown",
876 | "metadata": {
877 | "deletable": true,
878 | "editable": true
879 | },
880 | "source": [
881 | "处理X400"
882 | ]
883 | },
884 | {
885 | "cell_type": "code",
886 | "execution_count": 68,
887 | "metadata": {
888 | "collapsed": false,
889 | "deletable": true,
890 | "editable": true
891 | },
892 | "outputs": [],
893 | "source": [
894 | "X400_COL = input_data.loc[:, '400X1':'400X230']\n",
895 | "X400_COL = drop_col(X400_COL)\n",
896 | "X400_COL = date_cols(X400_COL)\n",
897 | "X400_COL = quantile_dropout(X400_COL)\n",
898 | "X400_COL = drop_col(X400_COL)\n",
899 | "X400_COL = X400_COL.transpose().drop_duplicates().transpose()\n",
900 | "X400_COL = scala(X400_COL)"
901 | ]
902 | },
903 | {
904 | "cell_type": "code",
905 | "execution_count": 69,
906 | "metadata": {
907 | "collapsed": false,
908 | "deletable": true,
909 | "editable": true
910 | },
911 | "outputs": [],
912 | "source": [
913 | "pca_X400 = Pca(X400_COL)\n",
914 | "pca_X400 = pd.DataFrame(pca_X400)\n",
915 | "pca_X400.reset_index(inplace=True)\n",
916 | "pca_X400.rename({'index' : 'ID'}, inplace=True)\n",
917 | "pca_X400.drop(['index'], axis=1, inplace=True)"
918 | ]
919 | },
920 | {
921 | "cell_type": "markdown",
922 | "metadata": {
923 | "deletable": true,
924 | "editable": true
925 | },
926 | "source": [
927 | "处理X420"
928 | ]
929 | },
930 | {
931 | "cell_type": "code",
932 | "execution_count": 70,
933 | "metadata": {
934 | "collapsed": false,
935 | "deletable": true,
936 | "editable": true
937 | },
938 | "outputs": [],
939 | "source": [
940 | "X420_COL = input_data.loc[:, '420X1':'420X230']\n",
941 | "X420_COL = drop_col(X420_COL)\n",
942 | "X420_COL = date_cols(X420_COL)\n",
943 | "X420_COL = quantile_dropout(X420_COL)\n",
944 | "\n",
945 | "X420_COL = drop_col(X420_COL)\n",
946 | "X420_COL = X420_COL.transpose().drop_duplicates().transpose()\n",
947 | "X420_COL = scala(X420_COL)"
948 | ]
949 | },
950 | {
951 | "cell_type": "code",
952 | "execution_count": 71,
953 | "metadata": {
954 | "collapsed": false,
955 | "deletable": true,
956 | "editable": true
957 | },
958 | "outputs": [],
959 | "source": [
960 | "pca_X420 = Pca(X420_COL)\n",
961 | "pca_X420 = pd.DataFrame(pca_X420)\n",
962 | "pca_X420.reset_index(inplace=True)\n",
963 | "pca_X420.rename({'index' : 'ID'}, inplace=True)\n",
964 | "pca_X420.drop(['index'], axis=1, inplace=True)"
965 | ]
966 | },
967 | {
968 | "cell_type": "markdown",
969 | "metadata": {
970 | "deletable": true,
971 | "editable": true
972 | },
973 | "source": [
974 | "处理X440A"
975 | ]
976 | },
977 | {
978 | "cell_type": "code",
979 | "execution_count": 72,
980 | "metadata": {
981 | "collapsed": false,
982 | "deletable": true,
983 | "editable": true
984 | },
985 | "outputs": [
986 | {
987 | "name": "stderr",
988 | "output_type": "stream",
989 | "text": [
990 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
991 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
992 | "\n",
993 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
994 | ]
995 | }
996 | ],
997 | "source": [
998 | "X440A_COL = input_data.loc[:, 'TOOL (#1)':'440AX213']\n",
999 | "Le.fit(X440A_COL['TOOL (#1)'].unique())\n",
1000 | "X440A_COL['TOOL (#1)'] = Le.transform(X440A_COL['TOOL (#1)'])\n",
1001 | "X440A_COL = drop_col(X440A_COL)\n",
1002 | "X440A_COL = date_cols(X440A_COL)\n",
1003 | "X440A_COL = quantile_dropout(X440A_COL)\n",
1004 | "X440A_COL = drop_col(X440A_COL)\n",
1005 | "X440A_COL = X440A_COL.transpose().drop_duplicates().transpose()\n",
1006 | "X440A_COL = scala(X440A_COL)"
1007 | ]
1008 | },
1009 | {
1010 | "cell_type": "code",
1011 | "execution_count": 73,
1012 | "metadata": {
1013 | "collapsed": false,
1014 | "deletable": true,
1015 | "editable": true
1016 | },
1017 | "outputs": [],
1018 | "source": [
1019 | "pca_X440A = Pca(X440A_COL)\n",
1020 | "pca_X440A = pd.DataFrame(pca_X440A)\n",
1021 | "pca_X440A.reset_index(inplace=True)\n",
1022 | "pca_X440A.rename({'index' : 'ID'}, inplace=True)\n",
1023 | "pca_X440A.drop(['index'], axis=1, inplace=True)"
1024 | ]
1025 | },
1026 | {
1027 | "cell_type": "markdown",
1028 | "metadata": {
1029 | "deletable": true,
1030 | "editable": true
1031 | },
1032 | "source": [
1033 | "处理X520"
1034 | ]
1035 | },
1036 | {
1037 | "cell_type": "code",
1038 | "execution_count": 74,
1039 | "metadata": {
1040 | "collapsed": false,
1041 | "deletable": true,
1042 | "editable": true
1043 | },
1044 | "outputs": [
1045 | {
1046 | "name": "stderr",
1047 | "output_type": "stream",
1048 | "text": [
1049 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
1050 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
1051 | "\n",
1052 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
1053 | ]
1054 | }
1055 | ],
1056 | "source": [
1057 | "X520_COL = input_data.loc[:, 'Tool (#3)':'520X434']\n",
1058 | "Le.fit(X520_COL['Tool (#3)'].unique())\n",
1059 | "X520_COL['Tool (#3)'] = Le.transform(X520_COL['Tool (#3)'])\n",
1060 | "X520_COL = drop_col(X520_COL)\n",
1061 | "X520_COL = date_cols(X520_COL)\n",
1062 | "X520_COL = quantile_dropout(X520_COL)\n",
1063 | "X520_COL = drop_col(X520_COL)\n",
1064 | "X520_COL = X520_COL.transpose().drop_duplicates().transpose()\n",
1065 | "X520_COL = scala(X520_COL)"
1066 | ]
1067 | },
1068 | {
1069 | "cell_type": "code",
1070 | "execution_count": 75,
1071 | "metadata": {
1072 | "collapsed": false,
1073 | "deletable": true,
1074 | "editable": true
1075 | },
1076 | "outputs": [],
1077 | "source": [
1078 | "pca_X520 = Pca(X520_COL)\n",
1079 | "pca_X520 = pd.DataFrame(pca_X520)\n",
1080 | "pca_X520.reset_index(inplace=True)\n",
1081 | "pca_X520.rename({'index' : 'ID'}, inplace=True)\n",
1082 | "pca_X520.drop(['index'], axis=1, inplace=True)"
1083 | ]
1084 | },
1085 | {
1086 | "cell_type": "markdown",
1087 | "metadata": {
1088 | "deletable": true,
1089 | "editable": true
1090 | },
1091 | "source": [
1092 | "处理X750"
1093 | ]
1094 | },
1095 | {
1096 | "cell_type": "code",
1097 | "execution_count": 76,
1098 | "metadata": {
1099 | "collapsed": false,
1100 | "deletable": true,
1101 | "editable": true
1102 | },
1103 | "outputs": [
1104 | {
1105 | "name": "stderr",
1106 | "output_type": "stream",
1107 | "text": [
1108 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n",
1109 | "A value is trying to be set on a copy of a slice from a DataFrame\n",
1110 | "\n",
1111 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n"
1112 | ]
1113 | }
1114 | ],
1115 | "source": [
1116 | "X750_COL = input_data.loc[:, 'TOOL (#2)':'750X1452']\n",
1117 | "Le.fit(X750_COL['TOOL (#2)'].unique())\n",
1118 | "X750_COL['TOOL (#2)'] = Le.transform(X750_COL['TOOL (#2)'])\n",
1119 | "X750_COL = drop_col(X750_COL)\n",
1120 | "X750_COL = date_cols(X750_COL)\n",
1121 | "X750_COL = quantile_dropout(X750_COL)\n",
1122 | "X750_COL = drop_col(X750_COL)\n",
1123 | "X750_COL = X750_COL.transpose().drop_duplicates().transpose()\n",
1124 | "X750_COL = scala(X750_COL)"
1125 | ]
1126 | },
1127 | {
1128 | "cell_type": "code",
1129 | "execution_count": 77,
1130 | "metadata": {
1131 | "collapsed": false,
1132 | "deletable": true,
1133 | "editable": true
1134 | },
1135 | "outputs": [],
1136 | "source": [
1137 | "pca_X750 = Pca(X750_COL)\n",
1138 | "pca_X750 = pd.DataFrame(pca_X750)\n",
1139 | "pca_X750.reset_index(inplace=True)\n",
1140 | "pca_X750.rename({'index' : 'ID'}, inplace=True)\n",
1141 | "pca_X750.drop(['index'], axis=1, inplace=True)"
1142 | ]
1143 | },
1144 | {
1145 | "cell_type": "code",
1146 | "execution_count": 79,
1147 | "metadata": {
1148 | "collapsed": false,
1149 | "deletable": true,
1150 | "editable": true
1151 | },
1152 | "outputs": [],
1153 | "source": [
1154 | "data_Set = pd.concat([pca_X210, pca_X220, \n",
1155 | " pca_X261, pca_X300, \n",
1156 | " pca_X310, pca_X311,\n",
1157 | " pca_X312, pca_X330, \n",
1158 | " pca_X340, pca_X344, \n",
1159 | " pca_X360, pca_X400,\n",
1160 | " pca_X420, pca_X440A, \n",
1161 | " pca_X520, pca_X750], axis=1)"
1162 | ]
1163 | },
1164 | {
1165 | "cell_type": "code",
1166 | "execution_count": 80,
1167 | "metadata": {
1168 | "collapsed": true,
1169 | "deletable": true,
1170 | "editable": true
1171 | },
1172 | "outputs": [],
1173 | "source": [
1174 | "train = data_set.ix[:499, :]\n",
1175 | "test = data_set.ix[500:, :]"
1176 | ]
1177 | },
1178 | {
1179 | "cell_type": "code",
1180 | "execution_count": 79,
1181 | "metadata": {
1182 | "collapsed": false,
1183 | "deletable": true,
1184 | "editable": true
1185 | },
1186 | "outputs": [],
1187 | "source": [
1188 | "train.to_csv('train/train.csv', index=False)\n",
1189 | "test.to_csv('train/test.csv', index=False)\n",
1190 | "y.to_csv('train/y.csv', index=False)"
1191 | ]
1192 | }
1193 | ],
1194 | "metadata": {
1195 | "kernelspec": {
1196 | "display_name": "Python 3",
1197 | "language": "python",
1198 | "name": "python3"
1199 | },
1200 | "language_info": {
1201 | "codemirror_mode": {
1202 | "name": "ipython",
1203 | "version": 3
1204 | },
1205 | "file_extension": ".py",
1206 | "mimetype": "text/x-python",
1207 | "name": "python",
1208 | "nbconvert_exporter": "python",
1209 | "pygments_lexer": "ipython3",
1210 | "version": "3.5.0"
1211 | }
1212 | },
1213 | "nbformat": 4,
1214 | "nbformat_minor": 2
1215 | }
1216 |
--------------------------------------------------------------------------------