├── README.md ├── First ├── 线上提交.ipynb ├── 线下验证.ipynb ├── 遗传算法.ipynb └── 初赛处理.ipynb └── Second └── 复赛处理.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # 天池工业AI大赛-智能制造质量预测 思路代码 2 | 3 | First文件夹中存放了初赛最开始使用的PCA思路代码。代码做了精简删除了关于皮尔森、互信息筛选特征的部分,使用GBDT筛选特征需要对初赛处理文件代码进行替换和修改。 4 | 5 | 初赛处理文件:按照TOOL的不同将原始数据分割成若干数据集,针对每个数据集分别使用PCA降维,最后合并形成训练集,后期舍弃该方案。 6 | 线下验证:通过删除全空字段、删除数值过大字段、删除取值单一字段处理后,使用GBDT + SelectFromModel生成训练集,最后使用xgb、lgb单模型等模型、以及构建Stacking进行验证。 7 | 线上验证:使用xgb单模型、Stacking模型提交结果。 8 | 遗传算法:本来考虑用遗传算法在使用GBDT构建出特征候选集的基础上再筛选特征,但是整体框架编写完后发现需要调节的细节太多,后续只用到适应性函数来辅助线下验证。 9 | 遗传算法框架: 10 | 1.整体种群设置10个个体,每个个体的基因长度为候选特征集长度,基因编码采用0 1二进制法,最后随机生成基因序列初始化种群。 11 | 2.适应函数采用10折线性回归结果。 12 | 3.选择算子采用轮盘赌算法,选择该轮进化中适应性最高的个体。 13 | 4.交叉算子采用两端交叉法,随机选择父本和母本的一段基因序列进行交换。 14 | 5.变异算子采用随机变异法,初始设定变异概率阈值,每个个体生成一个随机数,若随机数小于阈值,则对应个体的随机基因点进行取反变异。 15 | tips: 16 | 实际使用时发现,轮盘赌算法在数轮进化后,所有个体最终的适应性值趋于一致,但是该值并不是最优结果,可能由于在每轮进化中最优个体没有保存下来,而产生的最优个体在轮盘赌中被洗牌。这里的选择算子可考虑换成锦标赛算法等。 17 | 适应性函数中使用线性回归主要是考虑缩短时间,这里应该使用树形模型作为适应性评价模型,以保持和特征选择所使用的模型一致。 18 | 交叉算子所使用的两端交叉,与两个端点的随机选择直接相关,端点距离太远可能导致重要特征被替换,距离太近导致无法实现交叉的效果,这个可以考虑更加灵活的交叉方法,比如固定长度单段交叉、固定长度多段交叉等等。 19 | 变异算子主要是变异阈值的选取,可以通过增加种群个体数量,增加遗传轮次进行调整。 20 | 关于遗传算法的几个关键算子,还有很多方法和技巧...在后续的比赛中争取再优化调整。 21 | 22 | Second文件夹中存放复赛代码。 23 | 24 | 复赛处理文件:按照以下步骤生成训练数据: 25 | 1.删除全空字段 26 | 2.转换obj字段,将TOOL字段数值化编码 27 | 3.删除类似日期的较大值 28 | 4.删除全相同字段 29 | 5.四分位处理异常值 30 | 6.再删除一次全相同字段 31 | 7.删除小方差字段 32 | 8.使用GBDT + SelectFromModel筛选特征 33 | 34 | 最后结果由xgb单模型与Stacking模型融合而成。 35 | -------------------------------------------------------------------------------- /First/线上提交.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false, 8 | "deletable": true, 9 | "editable": true 10 | }, 11 | "outputs": [ 12 | { 13 | "name": "stderr", 14 | "output_type": "stream", 15 | "text": [ 16 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 17 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 18 | ] 19 | } 20 | ], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "import gc\n", 25 | "import xgboost as xgb\n", 26 | "import lightgbm as lgb\n", 27 | "from sklearn.model_selection import train_test_split\n", 28 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n", 29 | "from datetime import datetime\n", 30 | "from catboost import CatBoostRegressor\n", 31 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, PolynomialFeatures\n", 32 | "from sklearn import tree\n", 33 | "from sklearn import linear_model\n", 34 | "from sklearn import svm\n", 35 | "from sklearn import neighbors\n", 36 | "from sklearn import ensemble\n", 37 | "from sklearn.tree import ExtraTreeRegressor\n", 38 | "from sklearn.decomposition import PCA\n", 39 | "from sklearn.manifold import TSNE\n", 40 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 2, 46 | "metadata": { 47 | "collapsed": true, 48 | "deletable": true, 49 | "editable": true 50 | }, 51 | "outputs": [], 52 | "source": [ 53 | "def evalerror(y, y_pred):\n", 54 | " loss = np.sum(np.square(y - y_pred))\n", 55 | " n = len(y)\n", 56 | " return loss / n" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "metadata": { 63 | "collapsed": true, 64 | "deletable": true, 65 | "editable": true 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "train = pd.read_csv('train/train.csv')\n", 70 | "test = pd.read_csv('train/test.csv')\n", 71 | "y = pd.read_csv('train/y.csv')" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "metadata": { 78 | "collapsed": false, 79 | "deletable": true, 80 | "editable": true 81 | }, 82 | "outputs": [ 83 | { 84 | "name": "stderr", 85 | "output_type": "stream", 86 | "text": [ 87 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 88 | " y = column_or_1d(y, warn=True)\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "#GBDT筛选特征\n", 94 | "clf_gt2 = ensemble.GradientBoostingRegressor(max_depth=1, n_estimators=320, random_state=1)\n", 95 | "clf_gt2.fit(train, y)\n", 96 | "\n", 97 | "model1 = SelectFromModel(clf_gt2, prefit=True) \n", 98 | "train = pd.DataFrame(model1.transform(train))\n", 99 | "test = pd.DataFrame(model1.transform(test))" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 5, 105 | "metadata": { 106 | "collapsed": true, 107 | "deletable": true, 108 | "editable": true 109 | }, 110 | "outputs": [], 111 | "source": [ 112 | "#线上xgb单模型\n", 113 | "param = {}\n", 114 | "param['eta'] = 0.01\n", 115 | "param['max_depth'] = 6\n", 116 | "#param['mmin_child_weight'] = 5\n", 117 | "param['subsample'] = 0.8\n", 118 | "param['colsample_bytree'] = 0.3\n", 119 | "num_round = 750\n", 120 | "\n", 121 | "xgbTrain = xgb.DMatrix(train, label=y)\n", 122 | "xgbTest = xgb.DMatrix(test)\n", 123 | "modle = xgb.train(param, xgbTrain, num_round, )\n", 124 | "result_xgb = modle.predict(xgbTest)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 8, 130 | "metadata": { 131 | "collapsed": false, 132 | "deletable": true, 133 | "editable": true, 134 | "scrolled": true 135 | }, 136 | "outputs": [ 137 | { 138 | "name": "stdout", 139 | "output_type": "stream", 140 | "text": [ 141 | "the model 0\n", 142 | "the model 1\n", 143 | "the model 2\n", 144 | "the model 3\n" 145 | ] 146 | } 147 | ], 148 | "source": [ 149 | "#线上Stacking模型\n", 150 | "model_gb = ensemble.GradientBoostingRegressor(n_estimators=450, \n", 151 | " max_depth=2, \n", 152 | " subsample=0.8, \n", 153 | " learning_rate=0.01, \n", 154 | " random_state=0, \n", 155 | " max_features=0.2)\n", 156 | "modle0 = xgb.XGBRegressor(learning_rate=0.01, \n", 157 | " max_depth=3, \n", 158 | " colsample_bytree=0.2, \n", 159 | " subsample=0.8, \n", 160 | " seed=0, \n", 161 | " n_estimators=2100)\n", 162 | "modle1 = xgb.XGBRegressor(learning_rate=0.01, \n", 163 | " max_depth=3, \n", 164 | " colsample_bytree=0.3, \n", 165 | " subsample=0.8, \n", 166 | " seed=0, \n", 167 | " n_estimators=1600,\n", 168 | " min_child_weight=6)\n", 169 | "\n", 170 | "clf1 = lgb.LGBMRegressor(colsample_bytree=0.3,\n", 171 | " learning_rate=0.01, \n", 172 | " subsample=0.8, \n", 173 | " num_leaves=4, \n", 174 | " objective='regression', \n", 175 | " n_estimators=350, \n", 176 | " seed=0)\n", 177 | "base_model = [['xgb0', modle0],\n", 178 | " ['xgb1', modle1], \n", 179 | " ['gb', model_gb],\n", 180 | " ['lgb', clf1],]\n", 181 | "\n", 182 | "folds = list(KFold(len(train), n_folds=5, random_state=0))\n", 183 | "S_train = np.zeros((train.shape[0], len(base_model)))\n", 184 | "S_test = np.zeros((test.shape[0], len(base_model))) \n", 185 | "for index, item in enumerate(base_model):\n", 186 | " print(\"the model\", index)\n", 187 | " clf = item[1]\n", 188 | " S_test_i = np.zeros((test.shape[0], len(folds)))\n", 189 | " for j, (train_idx, test_idx) in enumerate(folds):\n", 190 | " X_train = train.ix[train_idx, :]\n", 191 | " X_valid = train.ix[test_idx, :]\n", 192 | " Y = y.ix[train_idx, :]\n", 193 | " clf.fit(X_train, Y['Y'])\n", 194 | " S_train[test_idx, index] = clf.predict(X_valid)\n", 195 | " S_test_i[:, j] = clf.predict(test) \n", 196 | " S_test[:, index] = S_test_i.mean(1)\n", 197 | " \n", 198 | "linreg = linear_model.LinearRegression()\n", 199 | "linreg.fit(S_train, y)\n", 200 | "\n", 201 | "result = linreg.predict(S_test)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 10, 207 | "metadata": { 208 | "collapsed": false, 209 | "deletable": true, 210 | "editable": true 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "sub = pd.read_csv('data/测试A-答案模板.csv', names=['ID'])\n", 215 | "sub['res'] = pd.DataFrame(result)[0]" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 128, 221 | "metadata": { 222 | "collapsed": true, 223 | "deletable": true, 224 | "editable": true 225 | }, 226 | "outputs": [], 227 | "source": [ 228 | "sub.to_csv('submission{}.csv'.format(datetime.now().strftime('%Y%m%d_%H%M%S')), index=False, header=None)" 229 | ] 230 | } 231 | ], 232 | "metadata": { 233 | "kernelspec": { 234 | "display_name": "Python 3", 235 | "language": "python", 236 | "name": "python3" 237 | }, 238 | "language_info": { 239 | "codemirror_mode": { 240 | "name": "ipython", 241 | "version": 3 242 | }, 243 | "file_extension": ".py", 244 | "mimetype": "text/x-python", 245 | "name": "python", 246 | "nbconvert_exporter": "python", 247 | "pygments_lexer": "ipython3", 248 | "version": "3.5.0" 249 | } 250 | }, 251 | "nbformat": 4, 252 | "nbformat_minor": 2 253 | } 254 | -------------------------------------------------------------------------------- /First/线下验证.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false, 8 | "deletable": true, 9 | "editable": true 10 | }, 11 | "outputs": [ 12 | { 13 | "name": "stderr", 14 | "output_type": "stream", 15 | "text": [ 16 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 17 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 18 | ] 19 | } 20 | ], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "import gc\n", 25 | "import xgboost as xgb\n", 26 | "import lightgbm as lgb\n", 27 | "from sklearn.model_selection import train_test_split\n", 28 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n", 29 | "from datetime import datetime\n", 30 | "from catboost import CatBoostRegressor\n", 31 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale,PolynomialFeatures\n", 32 | "from sklearn import tree\n", 33 | "from sklearn import linear_model\n", 34 | "from sklearn import svm\n", 35 | "from sklearn import neighbors\n", 36 | "from sklearn import ensemble\n", 37 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE\n", 38 | "from minepy import MINE\n", 39 | "from mlxtend.regressor import StackingRegressor" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 27, 45 | "metadata": { 46 | "collapsed": true, 47 | "deletable": true, 48 | "editable": true 49 | }, 50 | "outputs": [], 51 | "source": [ 52 | "def evalerror(y, y_pred):\n", 53 | " loss = np.sum(np.square(y - y_pred))\n", 54 | " n = len(y)\n", 55 | " return loss / n" 56 | ] 57 | }, 58 | { 59 | "cell_type": "code", 60 | "execution_count": 123, 61 | "metadata": { 62 | "collapsed": false, 63 | "deletable": true, 64 | "editable": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "train = pd.read_csv('train/train.csv')\n", 69 | "test = pd.read_csv('train/test.csv')\n", 70 | "y = pd.read_csv('train/y.csv')" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 125, 76 | "metadata": { 77 | "collapsed": false, 78 | "deletable": true, 79 | "editable": true 80 | }, 81 | "outputs": [ 82 | { 83 | "name": "stderr", 84 | "output_type": "stream", 85 | "text": [ 86 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 87 | " y = column_or_1d(y, warn=True)\n" 88 | ] 89 | } 90 | ], 91 | "source": [ 92 | "#GBDT筛选特征\n", 93 | "clf_gt2 = ensemble.GradientBoostingRegressor(max_depth=1, n_estimators=320, random_state=1)\n", 94 | "clf_gt2.fit(train, y)\n", 95 | "\n", 96 | "\n", 97 | "model1 = SelectFromModel(clf_gt2, prefit=True) \n", 98 | "train = pd.DataFrame(model1.transform(train))\n", 99 | "test = pd.DataFrame(model1.transform(test))" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": 141, 105 | "metadata": { 106 | "collapsed": false, 107 | "deletable": true, 108 | "editable": true, 109 | "scrolled": false 110 | }, 111 | "outputs": [ 112 | { 113 | "name": "stdout", 114 | "output_type": "stream", 115 | "text": [ 116 | "0.025204\n" 117 | ] 118 | } 119 | ], 120 | "source": [ 121 | "#单xgb模型线下5折 交叉验证\n", 122 | "def wmaeEval(preds, dtrain):\n", 123 | " label = dtrain.get_label()\n", 124 | " return 'error', np.sum(np.square(preds - label)) / len(label)\n", 125 | "\n", 126 | "param = {}\n", 127 | "param['eta'] = 0.01\n", 128 | "param['max_depth'] = 3\n", 129 | "param['subsample'] = 0.8\n", 130 | "param['colsample_bytree'] = 0.3\n", 131 | "\n", 132 | "param['seed'] = 1\n", 133 | "num_round = 10000\n", 134 | "\n", 135 | "xgbTrain = xgb.DMatrix(train, label=y)\n", 136 | "modle = xgb.cv(param, xgbTrain, num_boost_round=4200, feval=wmaeEval, nfold=5)\n", 137 | "print(modle.iloc[-1, 0]) " 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": null, 143 | "metadata": { 144 | "collapsed": true 145 | }, 146 | "outputs": [], 147 | "source": [ 148 | "#单lgb模型线下5折 交叉验证\n", 149 | "params = {}\n", 150 | "params['learning_rate'] = 0.01\n", 151 | "params['boosting_type'] = 'gbdt'\n", 152 | "params['objective'] = 'regression' \n", 153 | "params['feature_fraction'] = 0.3 \n", 154 | "params['bagging_fraction'] = 0.8 \n", 155 | "params['num_leaves'] = 64 \n", 156 | "result = [] \n", 157 | "folds = list(KFold(len(train), n_folds=5, random_state=0))\n", 158 | "for j, (train_idx, test_idx) in enumerate(folds):\n", 159 | " print(\"the folds\", j)\n", 160 | " X_train = train.ix[train_idx, :]\n", 161 | " X_valid = train.ix[test_idx, :]\n", 162 | " \n", 163 | " Y_train = y.ix[train_idx, :]\n", 164 | " Y_valid = y.ix[test_idx, :]\n", 165 | " d_train = lgb.Dataset(X_train, label=Y_train['Y'])\n", 166 | " clf = lgb.train(params, d_train, 620)\n", 167 | " preds = clf.predict(X_valid)\n", 168 | " result.append(evalerror(preds, Y_valid['Y']))\n", 169 | " \n", 170 | "np.mean(result)" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 37, 176 | "metadata": { 177 | "collapsed": false, 178 | "deletable": true, 179 | "editable": true 180 | }, 181 | "outputs": [ 182 | { 183 | "name": "stderr", 184 | "output_type": "stream", 185 | "text": [ 186 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:4: SettingWithCopyWarning: \n", 187 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 188 | "\n", 189 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", 190 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:7: SettingWithCopyWarning: \n", 191 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 192 | "\n", 193 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", 194 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 195 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 196 | "\n", 197 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "#线下Stacking 取20%数据集进行验证\n", 203 | "x_train, x_valid, y_train, y_valid = train_test_split(train, y, test_size = 0.2, random_state=1)\n", 204 | "x_train.reset_index(inplace=True)\n", 205 | "x_train.drop(['index'], axis=1, inplace=True)\n", 206 | "\n", 207 | "y_train.reset_index(inplace=True)\n", 208 | "y_train.drop(['index'], axis=1, inplace=True)\n", 209 | "\n", 210 | "y_valid.reset_index(inplace=True)\n", 211 | "y_valid.drop(['index'], axis=1, inplace=True)" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 69, 217 | "metadata": { 218 | "collapsed": false, 219 | "deletable": true, 220 | "editable": true 221 | }, 222 | "outputs": [ 223 | { 224 | "name": "stdout", 225 | "output_type": "stream", 226 | "text": [ 227 | "the model 0\n", 228 | "the model 1\n", 229 | "the model 2\n", 230 | "the model 3\n" 231 | ] 232 | }, 233 | { 234 | "data": { 235 | "text/plain": [ 236 | "0.027016934367817883" 237 | ] 238 | }, 239 | "execution_count": 69, 240 | "metadata": {}, 241 | "output_type": "execute_result" 242 | } 243 | ], 244 | "source": [ 245 | "model_gb = ensemble.GradientBoostingRegressor(n_estimators=450, \n", 246 | " max_depth=2, \n", 247 | " subsample=0.8, \n", 248 | " learning_rate=0.01, \n", 249 | " random_state=0, \n", 250 | " max_features=0.2)\n", 251 | "modle0 = xgb.XGBRegressor(learning_rate=0.01, \n", 252 | " max_depth=3, \n", 253 | " colsample_bytree=0.2, \n", 254 | " subsample=0.8, \n", 255 | " seed=0, \n", 256 | " n_estimators=2100)\n", 257 | "modle1 = xgb.XGBRegressor(learning_rate=0.01, \n", 258 | " max_depth=3, \n", 259 | " colsample_bytree=0.3, \n", 260 | " subsample=0.8, \n", 261 | " seed=0, \n", 262 | " n_estimators=1600,\n", 263 | " min_child_weight=6)\n", 264 | "\n", 265 | "clf1 = lgb.LGBMRegressor(colsample_bytree=0.3,\n", 266 | " learning_rate=0.01, \n", 267 | " subsample=0.8, \n", 268 | " num_leaves=4, \n", 269 | " objective='regression', \n", 270 | " n_estimators=350, \n", 271 | " seed=0)\n", 272 | "base_model = [['xgb0', modle0],\n", 273 | " ['xgb1', modle1], \n", 274 | " ['gb', model_gb],\n", 275 | " ['lgb', clf1],]\n", 276 | "\n", 277 | "folds = list(KFold(len(x_train), n_folds=5, random_state=0))\n", 278 | "S_train = np.zeros((x_train.shape[0], len(base_model)))\n", 279 | "S_test = np.zeros((x_valid.shape[0], len(base_model))) \n", 280 | "for index, item in enumerate(base_model):\n", 281 | " print(\"the model\", index)\n", 282 | " clf = item[1]\n", 283 | " S_test_i = np.zeros((x_valid.shape[0], len(folds)))\n", 284 | " for j, (train_idx, test_idx) in enumerate(folds):\n", 285 | " X_train = x_train.ix[train_idx, :]\n", 286 | " X_valid = x_train.ix[test_idx, :]\n", 287 | " Y = y_train.ix[train_idx, :]\n", 288 | " clf.fit(X_train, Y['Y'])\n", 289 | " S_train[test_idx, index] = clf.predict(X_valid)\n", 290 | " S_test_i[:, j] = clf.predict(x_valid) \n", 291 | " S_test[:, index] = S_test_i.mean(1)\n", 292 | " \n", 293 | "linreg = linear_model.LinearRegression()\n", 294 | "linreg.fit(S_train, y_train)\n", 295 | "\n", 296 | "result = linreg.predict(S_test)\n", 297 | "evalerror(pd.DataFrame(result)[0], y_valid['Y'])" 298 | ] 299 | } 300 | ], 301 | "metadata": { 302 | "kernelspec": { 303 | "display_name": "Python 3", 304 | "language": "python", 305 | "name": "python3" 306 | }, 307 | "language_info": { 308 | "codemirror_mode": { 309 | "name": "ipython", 310 | "version": 3 311 | }, 312 | "file_extension": ".py", 313 | "mimetype": "text/x-python", 314 | "name": "python", 315 | "nbconvert_exporter": "python", 316 | "pygments_lexer": "ipython3", 317 | "version": "3.5.0" 318 | } 319 | }, 320 | "nbformat": 4, 321 | "nbformat_minor": 2 322 | } 323 | -------------------------------------------------------------------------------- /First/遗传算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 6, 6 | "metadata": { 7 | "collapsed": false, 8 | "deletable": true, 9 | "editable": true 10 | }, 11 | "outputs": [], 12 | "source": [ 13 | "import pandas as pd\n", 14 | "import numpy as np\n", 15 | "import xgboost as xgb\n", 16 | "import lightgbm as lgb\n", 17 | "from sklearn.model_selection import train_test_split\n", 18 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n", 19 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale\n", 20 | "from sklearn import tree\n", 21 | "from sklearn import linear_model\n", 22 | "from sklearn import svm\n", 23 | "from sklearn import neighbors\n", 24 | "from sklearn import ensemble\n", 25 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 2, 31 | "metadata": { 32 | "collapsed": true, 33 | "deletable": true, 34 | "editable": true 35 | }, 36 | "outputs": [], 37 | "source": [ 38 | "def evalerror(y, y_pred):\n", 39 | " loss = np.sum(np.square(y - y_pred))\n", 40 | " n = len(y)\n", 41 | " return loss / n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "metadata": { 48 | "collapsed": false, 49 | "deletable": true, 50 | "editable": true 51 | }, 52 | "outputs": [], 53 | "source": [ 54 | "train = pd.read_csv('train/train.csv')\n", 55 | "test = pd.read_csv('train/test.csv')\n", 56 | "y = pd.read_csv('train/y.csv')" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 7, 62 | "metadata": { 63 | "collapsed": false, 64 | "deletable": true, 65 | "editable": true 66 | }, 67 | "outputs": [ 68 | { 69 | "name": "stderr", 70 | "output_type": "stream", 71 | "text": [ 72 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 73 | " y = column_or_1d(y, warn=True)\n" 74 | ] 75 | } 76 | ], 77 | "source": [ 78 | "#GBDT特征候选集\n", 79 | "clf_gt = ensemble.GradientBoostingRegressor(max_depth=6, n_estimators=400, random_state=1)\n", 80 | "clf_gt.fit(train, y)\n", 81 | "model = SelectFromModel(clf_gt, prefit=True) \n", 82 | "train = pd.DataFrame(model.transform(train))\n", 83 | "test = pd.DataFrame(model.transform(test))" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 8, 89 | "metadata": { 90 | "collapsed": false, 91 | "deletable": true, 92 | "editable": true 93 | }, 94 | "outputs": [], 95 | "source": [ 96 | "#初始化种群\n", 97 | "def Init_Individual(feature):\n", 98 | " Individual = []\n", 99 | " for i in range(10):\n", 100 | " Gene = []\n", 101 | " for g in range(len(feature)):\n", 102 | " Gene.append(np.random.randint(0, 2))\n", 103 | " Individual.append(Gene)\n", 104 | " return np.array(Individual)\n", 105 | "\n", 106 | "\n", 107 | "#适应性函数\n", 108 | "def fitness(Individual, y, dataSet):\n", 109 | " lr = linear_model.LinearRegression()\n", 110 | " fit = []\n", 111 | " index = []\n", 112 | " gene = []\n", 113 | " for i in range(len(Individual)):\n", 114 | " Gene_sequence = pd.DataFrame(dataSet.columns, columns=['feature'])\n", 115 | " Gene_sequence['gene'] = Individual[i]\n", 116 | " Gene_sequence = list(Gene_sequence[Gene_sequence['gene'] == 1]['feature'])\n", 117 | " \n", 118 | " cv_model = cross_val_score(lr, dataSet[Gene_sequence], y, cv=10, scoring='neg_mean_squared_error')\n", 119 | " fit.append(0.1 - np.mean(np.abs(cv_model)))\n", 120 | " index.append(i)\n", 121 | " gene.append(Individual[i]) \n", 122 | " \n", 123 | " Ind_fitness = pd.DataFrame(fit, columns=['fintness'])\n", 124 | "# Ind_fitness['Indi_index'] = index\n", 125 | " Ind_fitness['Gene'] = gene \n", 126 | " return Ind_fitness\n", 127 | "\n", 128 | "\n", 129 | "#轮盘赌选择最优个体\n", 130 | "def Roulette_wheel(Fitness): \n", 131 | " sumFits = np.sum(Fitness['fintness'])\n", 132 | "\n", 133 | " rndPoint = np.random.uniform(0, sumFits)\n", 134 | " accumulator = 0.0\n", 135 | " for ind, val in enumerate(Fitness['fintness']):\n", 136 | " accumulator += val\n", 137 | " if accumulator >= rndPoint:\n", 138 | " return np.array(Fitness[Fitness['fintness'] == val].values)[0]\n", 139 | " \n", 140 | "#交叉算子\n", 141 | "def Crossover_operator(Individual):\n", 142 | " idx1 = np.random.randint(0, len(Individual))\n", 143 | " idx2 = np.random.randint(0, len(Individual))\n", 144 | " while idx2 == idx1: \n", 145 | " idx2 = np.random.randint(0, len(Individual)) \n", 146 | " \n", 147 | " Father_gene = Individual[Individual['Indi_index'] == idx1]['Indi_Gene'].values\n", 148 | " Mother_gene = Individual[Individual['Indi_index'] == idx2]['Indi_Gene'].values\n", 149 | " \n", 150 | " crossPos_A = np.random.randint(0, len(Father_gene[0]))\n", 151 | " crossPos_B = np.random.randint(0, len(Father_gene[0])) \n", 152 | "\n", 153 | " while crossPos_A == crossPos_B: \n", 154 | " crossPos_B = np.random.randint(0, len(Father_gene[0])) \n", 155 | "\n", 156 | " if crossPos_A > crossPos_B:\n", 157 | " crossPos_A, crossPos_B = crossPos_B, crossPos_A\n", 158 | " \n", 159 | " if crossPos_A < crossPos_B:\n", 160 | " temp = Father_gene[0][crossPos_A]\n", 161 | " Father_gene[0][crossPos_A] = Mother_gene[0][crossPos_A]\n", 162 | " Mother_gene[0][crossPos_A] = temp\n", 163 | " crossPos_A = crossPos_A + 1\n", 164 | " \n", 165 | " return Father_gene, Mother_gene \n", 166 | "\n", 167 | "#变异算子\n", 168 | "def Mutation_operator(Individual):\n", 169 | " MUTATION_RATE = 0.165\n", 170 | " for i in range(len(Individual)):\n", 171 | " mutatePos = np.random.randint(0, len(Individual['Indi_Gene'][i]))\n", 172 | " theta = np.random.random()\n", 173 | " if theta < MUTATION_RATE:\n", 174 | " if Individual['Indi_Gene'][i][mutatePos] == 0:\n", 175 | " Individual['Indi_Gene'][i][mutatePos] = 1\n", 176 | " else:\n", 177 | " Individual['Indi_Gene'][i][mutatePos] = 0\n", 178 | " return Individual" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": 9, 184 | "metadata": { 185 | "collapsed": true, 186 | "deletable": true, 187 | "editable": true 188 | }, 189 | "outputs": [], 190 | "source": [ 191 | "#遗传算法\n", 192 | "def Genetic_algorithm(Individual, train, y, iterm):\n", 193 | " for i in range(iterm):\n", 194 | " print('第 %d 代' % i)\n", 195 | " fit = fitness(Individual, y, train)\n", 196 | " \n", 197 | " Roulette_gene = []\n", 198 | " index = []\n", 199 | " for i in range(len(Individual)):\n", 200 | " Roulette_gene.append(Roulette_wheel(fit))\n", 201 | " index.append(i)\n", 202 | " \n", 203 | " Choice_gene = pd.DataFrame(Roulette_gene, columns=['fintness', 'Indi_Gene'])\n", 204 | " Choice_gene['Indi_index'] = index\n", 205 | " Choice_gene['fintness'] = 0.1 - Choice_gene['fintness']\n", 206 | " Choice_gene = Choice_gene.sort_values(['fintness'])\n", 207 | " \n", 208 | " Cro_gene = []\n", 209 | " for i in range(5):\n", 210 | " gene1, gene2 = Crossover_operator(Choice_gene)\n", 211 | " Cro_gene.append(gene1)\n", 212 | " Cro_gene.append(gene2) \n", 213 | " \n", 214 | " Crossover_gene = pd.DataFrame(Cro_gene, columns=['Indi_Gene'])\n", 215 | " Crossover_gene['Indi_index'] = index\n", 216 | " \n", 217 | " New_gene = Mutation_operator(Crossover_gene)\n", 218 | " Individual = New_gene['Indi_Gene']\n", 219 | " fit['fintness'] = 0.1 - fit['fintness']\n", 220 | " return fit\n", 221 | " " 222 | ] 223 | }, 224 | { 225 | "cell_type": "code", 226 | "execution_count": 18, 227 | "metadata": { 228 | "collapsed": false, 229 | "deletable": true, 230 | "editable": true 231 | }, 232 | "outputs": [ 233 | { 234 | "data": { 235 | "text/html": [ 236 | "
\n", 237 | "\n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | "
fintnessGene
00.040204[0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, ...
10.040662[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, ...
20.040862[0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, ...
30.041487[1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...
40.042401[0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, ...
50.042978[1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...
60.043455[1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, ...
70.043758[0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...
80.044766[1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, ...
90.045582[1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, ...
\n", 298 | "
" 299 | ], 300 | "text/plain": [ 301 | " fintness Gene\n", 302 | "0 0.040204 [0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, ...\n", 303 | "1 0.040662 [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, ...\n", 304 | "2 0.040862 [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, ...\n", 305 | "3 0.041487 [1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, ...\n", 306 | "4 0.042401 [0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, ...\n", 307 | "5 0.042978 [1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, ...\n", 308 | "6 0.043455 [1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, ...\n", 309 | "7 0.043758 [0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, ...\n", 310 | "8 0.044766 [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, ...\n", 311 | "9 0.045582 [1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, ..." 312 | ] 313 | }, 314 | "execution_count": 18, 315 | "metadata": {}, 316 | "output_type": "execute_result" 317 | } 318 | ], 319 | "source": [ 320 | "#初始化种群\n", 321 | "Individual = Init_Individual(train.columns)\n", 322 | "\n", 323 | "#计算每个个体的适应性\n", 324 | "fit = fitness(Individual, y, train)\n", 325 | "fit['fintness'] = 0.1 - fit['fintness']\n", 326 | "fit.sort_values(['fintness'], inplace=True)\n", 327 | "fit.reset_index(inplace=True, drop=['index'])\n", 328 | "fit" 329 | ] 330 | }, 331 | { 332 | "cell_type": "code", 333 | "execution_count": 15, 334 | "metadata": { 335 | "collapsed": true 336 | }, 337 | "outputs": [], 338 | "source": [ 339 | "Gene_sequence = pd.DataFrame(train.columns, columns=['feature'])\n", 340 | "Gene_sequence['gene'] = fit['Gene'][0]\n", 341 | "Gene_sequence = list(Gene_sequence[Gene_sequence['gene'] == 1]['feature'])" 342 | ] 343 | }, 344 | { 345 | "cell_type": "code", 346 | "execution_count": 14, 347 | "metadata": { 348 | "collapsed": false, 349 | "deletable": true, 350 | "editable": true 351 | }, 352 | "outputs": [ 353 | { 354 | "name": "stdout", 355 | "output_type": "stream", 356 | "text": [ 357 | "0.0300686\n" 358 | ] 359 | } 360 | ], 361 | "source": [ 362 | "def wmaeEval(preds, dtrain):\n", 363 | " label = dtrain.get_label()\n", 364 | " return 'error', np.sum(np.square(preds - label)) / len(label)\n", 365 | "param = {}\n", 366 | "param['eta'] = 0.01\n", 367 | "param['max_depth'] = 3\n", 368 | "\n", 369 | "param['subsample'] = 0.8\n", 370 | "param['colsample_bytree'] = 0.3\n", 371 | "num_round = 3300\n", 372 | "\n", 373 | "xgbTrain = xgb.DMatrix(train[Gene_sequence], label=y)\n", 374 | "modle = xgb.cv(param, xgbTrain, num_round, feval=wmaeEval, nfold=5)\n", 375 | "print(modle.iloc[-1, 0])" 376 | ] 377 | } 378 | ], 379 | "metadata": { 380 | "kernelspec": { 381 | "display_name": "Python 3", 382 | "language": "python", 383 | "name": "python3" 384 | }, 385 | "language_info": { 386 | "codemirror_mode": { 387 | "name": "ipython", 388 | "version": 3 389 | }, 390 | "file_extension": ".py", 391 | "mimetype": "text/x-python", 392 | "name": "python", 393 | "nbconvert_exporter": "python", 394 | "pygments_lexer": "ipython3", 395 | "version": "3.5.0" 396 | } 397 | }, 398 | "nbformat": 4, 399 | "nbformat_minor": 2 400 | } 401 | -------------------------------------------------------------------------------- /Second/复赛处理.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": { 7 | "collapsed": false, 8 | "deletable": true, 9 | "editable": true 10 | }, 11 | "outputs": [ 12 | { 13 | "name": "stderr", 14 | "output_type": "stream", 15 | "text": [ 16 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n", 17 | " \"This module will be removed in 0.20.\", DeprecationWarning)\n" 18 | ] 19 | } 20 | ], 21 | "source": [ 22 | "import pandas as pd\n", 23 | "import numpy as np\n", 24 | "import xgboost as xgb\n", 25 | "import lightgbm as lgb\n", 26 | "from sklearn.model_selection import train_test_split\n", 27 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n", 28 | "from datetime import datetime\n", 29 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale,PolynomialFeatures\n", 30 | "from sklearn import tree\n", 31 | "from sklearn import linear_model\n", 32 | "from sklearn import svm\n", 33 | "from sklearn import neighbors\n", 34 | "from sklearn import ensemble\n", 35 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE\n", 36 | "from mlxtend.regressor import StackingRegressor\n", 37 | "from collections import defaultdict" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 4, 43 | "metadata": { 44 | "collapsed": false, 45 | "deletable": true, 46 | "editable": true 47 | }, 48 | "outputs": [], 49 | "source": [ 50 | "data_set1 = pd.read_excel('../复赛数据/训练_20180117.xlsx')\n", 51 | "data_set2 = pd.read_excel('../复赛数据/测试A_20180117.xlsx')\n", 52 | "y2 = pd.read_csv('../复赛数据/[new] fusai_answer_a_20180127.csv', names=['ID', 'Value'])\n", 53 | "\n", 54 | "\n", 55 | "submit = pd.read_csv('../复赛数据/answer_sample_b_20180117.csv', names=['id', 'Y'])\n", 56 | "test_set = pd.read_excel('../复赛数据/测试B_20180117.xlsx')\n", 57 | "\n", 58 | "data_set2 = pd.concat([data_set2, y2[['Value']]], axis=1)\n", 59 | "data_set = pd.concat([data_set1, data_set2], axis=0)\n", 60 | "\n", 61 | "\n", 62 | "y = data_set[['Value']]\n", 63 | "y.reset_index(inplace=True, drop=['index'])\n", 64 | "data_set = data_set.drop(['Value'], axis=1)\n", 65 | "Le = LabelEncoder()" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": { 72 | "collapsed": false, 73 | "deletable": true, 74 | "editable": true 75 | }, 76 | "outputs": [], 77 | "source": [ 78 | "input_data = pd.concat([data_set, test_set], ignore_index=True)\n", 79 | "input_data = input_data.drop(['ID'], axis=1)" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 5, 85 | "metadata": { 86 | "collapsed": false, 87 | "deletable": true, 88 | "editable": true, 89 | "scrolled": true 90 | }, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "defaultdict(>,\n", 96 | " {'210': 228,\n", 97 | " '220': 179,\n", 98 | " '300': 21,\n", 99 | " '310': 170,\n", 100 | " '311': 184,\n", 101 | " '312': 679,\n", 102 | " '330': 889,\n", 103 | " '340': 139,\n", 104 | " '344': 398,\n", 105 | " '360': 1030,\n", 106 | " '400': 183,\n", 107 | " '420': 173,\n", 108 | " '440A': 213,\n", 109 | " '520': 382,\n", 110 | " '750': 1030,\n", 111 | " 'Chamber': 1,\n", 112 | " 'Chamber ID': 1,\n", 113 | " 'ERROR:#N/A': 1,\n", 114 | " 'ERROR:#N/A (#1)': 1,\n", 115 | " 'ERROR:#N/A (#2)': 1,\n", 116 | " 'ERROR:#N/A (#3)': 1,\n", 117 | " 'ERROR:#N/A_1': 1,\n", 118 | " 'ERROR:#N/A_1 (#1)': 1,\n", 119 | " 'ERROR:#N/A_1 (#2)': 1,\n", 120 | " 'ERROR:#N/A_10': 1,\n", 121 | " 'ERROR:#N/A_11': 1,\n", 122 | " 'ERROR:#N/A_12': 1,\n", 123 | " 'ERROR:#N/A_13': 1,\n", 124 | " 'ERROR:#N/A_14': 1,\n", 125 | " 'ERROR:#N/A_15': 1,\n", 126 | " 'ERROR:#N/A_16': 1,\n", 127 | " 'ERROR:#N/A_17': 1,\n", 128 | " 'ERROR:#N/A_18': 1,\n", 129 | " 'ERROR:#N/A_19': 1,\n", 130 | " 'ERROR:#N/A_2': 1,\n", 131 | " 'ERROR:#N/A_2 (#1)': 1,\n", 132 | " 'ERROR:#N/A_20': 1,\n", 133 | " 'ERROR:#N/A_21': 1,\n", 134 | " 'ERROR:#N/A_22': 1,\n", 135 | " 'ERROR:#N/A_23': 1,\n", 136 | " 'ERROR:#N/A_24': 1,\n", 137 | " 'ERROR:#N/A_25': 1,\n", 138 | " 'ERROR:#N/A_26': 1,\n", 139 | " 'ERROR:#N/A_27': 1,\n", 140 | " 'ERROR:#N/A_28': 1,\n", 141 | " 'ERROR:#N/A_29': 1,\n", 142 | " 'ERROR:#N/A_3': 1,\n", 143 | " 'ERROR:#N/A_3 (#1)': 1,\n", 144 | " 'ERROR:#N/A_30': 1,\n", 145 | " 'ERROR:#N/A_4': 1,\n", 146 | " 'ERROR:#N/A_4 (#1)': 1,\n", 147 | " 'ERROR:#N/A_5': 1,\n", 148 | " 'ERROR:#N/A_6': 1,\n", 149 | " 'ERROR:#N/A_7': 1,\n", 150 | " 'ERROR:#N/A_8': 1,\n", 151 | " 'ERROR:#N/A_9': 1,\n", 152 | " 'OPERATION_ID': 1,\n", 153 | " 'TOOL (#1)': 1,\n", 154 | " 'TOOL (#2)': 1,\n", 155 | " 'TOOL (#3)': 1,\n", 156 | " 'TOOL_ID': 1,\n", 157 | " 'Tool': 1,\n", 158 | " 'Tool (#1)': 1,\n", 159 | " 'Tool (#2)': 1,\n", 160 | " 'Tool (#3)': 1,\n", 161 | " 'Tool (#4)': 1,\n", 162 | " 'Tool (#5)': 1})" 163 | ] 164 | }, 165 | "execution_count": 5, 166 | "metadata": {}, 167 | "output_type": "execute_result" 168 | } 169 | ], 170 | "source": [ 171 | "col = pd.DataFrame(input_data.ix[:, 2:].columns, columns=['col'])\n", 172 | "TOOL_ID_col_dict = defaultdict(lambda : 0)\n", 173 | "for index,row in col.iterrows():\n", 174 | " info=row.col.split('X')[0]\n", 175 | " TOOL_ID_col_dict[info] += 1\n", 176 | "\n", 177 | "TOOL_ID_col_dict" 178 | ] 179 | }, 180 | { 181 | "cell_type": "code", 182 | "execution_count": 6, 183 | "metadata": { 184 | "collapsed": true, 185 | "deletable": true, 186 | "editable": true 187 | }, 188 | "outputs": [], 189 | "source": [ 190 | "#删除全空字段\n", 191 | "null_col = []\n", 192 | "for col in data_set.columns:\n", 193 | " if data_set[col].isnull().all() == True:\n", 194 | " null_col.append(col)\n", 195 | " \n", 196 | "input_data.drop(null_col, axis=1, inplace=True) \n", 197 | "input_data.drop(['220X150', '220X151'], axis=1, inplace=True) " 198 | ] 199 | }, 200 | { 201 | "cell_type": "code", 202 | "execution_count": 7, 203 | "metadata": { 204 | "collapsed": false, 205 | "deletable": true, 206 | "editable": true 207 | }, 208 | "outputs": [], 209 | "source": [ 210 | "#转换obj字段\n", 211 | "obj_col = []\n", 212 | "for col in input_data.columns:\n", 213 | " if input_data[col].dtypes == object:\n", 214 | " obj_col.append(col)\n", 215 | " \n", 216 | "for i in range(len(obj_col)):\n", 217 | " Le.fit(input_data[obj_col[i]].unique())\n", 218 | " input_data[obj_col[i]] = Le.transform(input_data[obj_col[i]])" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": 8, 224 | "metadata": { 225 | "collapsed": true, 226 | "deletable": true, 227 | "editable": true 228 | }, 229 | "outputs": [], 230 | "source": [ 231 | "#删除较大值\n", 232 | "def date_cols(data):\n", 233 | " for col in data:\n", 234 | " if data[col].min() > 1e13:\n", 235 | " data = data.drop([col], axis=1) \n", 236 | " return data\n", 237 | "\n", 238 | "input_data = date_cols(input_data)" 239 | ] 240 | }, 241 | { 242 | "cell_type": "code", 243 | "execution_count": 9, 244 | "metadata": { 245 | "collapsed": true, 246 | "deletable": true, 247 | "editable": true 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "#drop相同列\n", 252 | "def drop_col(data):\n", 253 | " data = data.fillna(data.mean()) \n", 254 | " for line in data.columns:\n", 255 | " if len(data[line].unique()) == 1:\n", 256 | " data = data.drop([line], axis=1) \n", 257 | " return data\n", 258 | "\n", 259 | "input_data = drop_col(input_data)" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": 10, 265 | "metadata": { 266 | "collapsed": false, 267 | "deletable": true, 268 | "editable": true 269 | }, 270 | "outputs": [ 271 | { 272 | "name": "stderr", 273 | "output_type": "stream", 274 | "text": [ 275 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 276 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 277 | "\n", 278 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 279 | ] 280 | } 281 | ], 282 | "source": [ 283 | "#四分位数处理异常值\n", 284 | "def quantile_dropout(data): \n", 285 | " for c in data.columns:\n", 286 | " Q1 = data[c].quantile(q=0.25, interpolation='linear')\n", 287 | " Q3 = data[c].quantile(q=0.75, interpolation='linear') \n", 288 | " \n", 289 | " min_v = Q1 - 3 * (Q3 - Q1)\n", 290 | " max_v = Q3 + 3 * (Q3 - Q1)\n", 291 | " \n", 292 | " data[c][(data[c] >= max_v) | (data[c] <= min_v)] = data[c].mean()\n", 293 | " \n", 294 | " return data\n", 295 | "\n", 296 | "#异常值处理完再清晰一遍数据\n", 297 | "input_data = quantile_dropout(input_data)\n", 298 | "input_data.fillna(input_data.mean(), inplace=True)\n", 299 | "input_data = drop_col(input_data)" 300 | ] 301 | }, 302 | { 303 | "cell_type": "code", 304 | "execution_count": 11, 305 | "metadata": { 306 | "collapsed": false, 307 | "deletable": true, 308 | "editable": true 309 | }, 310 | "outputs": [], 311 | "source": [ 312 | "#剔除小方差字段\n", 313 | "VT=VarianceThreshold(threshold=0.5)\n", 314 | "input_data = pd.DataFrame(VT.fit_transform(input_data))" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 12, 320 | "metadata": { 321 | "collapsed": false, 322 | "deletable": true, 323 | "editable": true 324 | }, 325 | "outputs": [ 326 | { 327 | "name": "stderr", 328 | "output_type": "stream", 329 | "text": [ 330 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\utils\\validation.py:526: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", 331 | " y = column_or_1d(y, warn=True)\n" 332 | ] 333 | } 334 | ], 335 | "source": [ 336 | "train = input_data.ix[:1099, :]\n", 337 | "test = input_data.ix[1100:, :]\n", 338 | "clf_gt = ensemble.GradientBoostingRegressor(learning_rate=0.01, n_estimators=200, random_state=0)\n", 339 | "clf_gt.fit(train, y)\n", 340 | "\n", 341 | "model = SelectFromModel(clf_gt, prefit=True) \n", 342 | "train = pd.DataFrame(model.transform(train))\n", 343 | "test = pd.DataFrame(model.transform(test))" 344 | ] 345 | }, 346 | { 347 | "cell_type": "code", 348 | "execution_count": 13, 349 | "metadata": { 350 | "collapsed": false, 351 | "deletable": true, 352 | "editable": true 353 | }, 354 | "outputs": [ 355 | { 356 | "name": "stdout", 357 | "output_type": "stream", 358 | "text": [ 359 | "0.0105656\n" 360 | ] 361 | } 362 | ], 363 | "source": [ 364 | "#单xgb模型线下\n", 365 | "def wmaeEval(preds, dtrain):\n", 366 | " label = dtrain.get_label()\n", 367 | " return 'error', np.sum(np.square(preds - label)) / len(label)\n", 368 | "param = {}\n", 369 | "param['eta'] = 0.01\n", 370 | "param['max_depth'] = 3\n", 371 | "param['subsample'] = 0.8\n", 372 | "param['colsample_bytree'] = 0.6\n", 373 | "param['seed'] = 0\n", 374 | "xgbTrain = xgb.DMatrix(train, label=y)\n", 375 | "modle = xgb.cv(param, xgbTrain, num_boost_round=9800, feval=wmaeEval, nfold=5)\n", 376 | "print(modle.iloc[-1, 0]) " 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 14, 382 | "metadata": { 383 | "collapsed": false, 384 | "deletable": true, 385 | "editable": true 386 | }, 387 | "outputs": [], 388 | "source": [ 389 | "#单xgb模型线上\n", 390 | "param = {}\n", 391 | "param['eta'] = 0.01\n", 392 | "param['max_depth'] = 3\n", 393 | "param['subsample'] = 0.8\n", 394 | "param['colsample_bytree'] = 0.6\n", 395 | "param['seed'] = 0\n", 396 | "xgbTrain = xgb.DMatrix(train, label=y)\n", 397 | "xgbTest = xgb.DMatrix(test)\n", 398 | "modle = xgb.train(param, xgbTrain, num_boost_round=9200)" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": { 405 | "collapsed": false, 406 | "deletable": true, 407 | "editable": true 408 | }, 409 | "outputs": [ 410 | { 411 | "name": "stdout", 412 | "output_type": "stream", 413 | "text": [ 414 | "the model 0\n" 415 | ] 416 | } 417 | ], 418 | "source": [ 419 | "model_gb = ensemble.GradientBoostingRegressor(n_estimators=450, \n", 420 | " max_depth=2, \n", 421 | " subsample=0.8, \n", 422 | " learning_rate=0.01, \n", 423 | " random_state=2, \n", 424 | " max_features=0.2)\n", 425 | "modle0 = xgb.XGBRegressor(learning_rate=0.01, \n", 426 | " max_depth=3, \n", 427 | " colsample_bytree=0.6, \n", 428 | " subsample=0.8, \n", 429 | " seed=2, \n", 430 | " n_estimators=8800)\n", 431 | "modle1 = xgb.XGBRegressor(learning_rate=0.01, \n", 432 | " max_depth=3, \n", 433 | " colsample_bytree=0.6, \n", 434 | " subsample=0.8, \n", 435 | " seed=0, \n", 436 | " n_estimators=8800)\n", 437 | "\n", 438 | "clf1 = lgb.LGBMRegressor(colsample_bytree=0.3,\n", 439 | " learning_rate=0.01, \n", 440 | " subsample=0.8, \n", 441 | " num_leaves=4, \n", 442 | " objective='regression', \n", 443 | " n_estimators=1200, \n", 444 | " seed=2)\n", 445 | "base_model = [['xgb0', modle0],\n", 446 | " ['xgb1', modle1], \n", 447 | " ['gb', model_gb],\n", 448 | " ['lgb', clf1],]\n", 449 | "\n", 450 | "\n", 451 | "folds = list(KFold(len(train), n_folds=5, random_state=0))\n", 452 | "S_train = np.zeros((train.shape[0], len(base_model)))\n", 453 | "S_test = np.zeros((test.shape[0], len(base_model))) \n", 454 | "for index, item in enumerate(base_model):\n", 455 | " print(\"the model\", index)\n", 456 | " clf = item[1]\n", 457 | " S_test_i = np.zeros((test.shape[0], len(folds)))\n", 458 | " for j, (train_idx, test_idx) in enumerate(folds):\n", 459 | " X_train = train.ix[train_idx, :]\n", 460 | " X_valid = train.ix[test_idx, :]\n", 461 | " Y = y.ix[train_idx, :]\n", 462 | " clf.fit(X_train, Y['Value'])\n", 463 | " S_train[test_idx, index] = clf.predict(X_valid)\n", 464 | " S_test_i[:, j] = clf.predict(test) \n", 465 | " S_test[:, index] = S_test_i.mean(1)\n", 466 | " \n", 467 | "linreg = linear_model.LinearRegression()\n", 468 | "linreg.fit(S_train, y)\n", 469 | "result = linreg.predict(S_test)\n", 470 | "print('Done')" 471 | ] 472 | }, 473 | { 474 | "cell_type": "code", 475 | "execution_count": 46, 476 | "metadata": { 477 | "collapsed": false, 478 | "deletable": true, 479 | "editable": true, 480 | "scrolled": true 481 | }, 482 | "outputs": [ 483 | { 484 | "data": { 485 | "text/html": [ 486 | "
\n", 487 | "\n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 746 | " \n", 747 | " \n", 748 | " \n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | "
stackingsingalmean
02.9128072.8899872.901397
13.0072862.9648742.986080
22.6858822.6623242.674103
32.8069262.7597502.783338
42.6520382.6198312.635935
52.7870102.7221172.754563
62.6525622.6277022.640132
72.5881352.4632752.525705
83.0318802.9802393.006060
92.6667832.6454482.656115
102.7970452.7464272.771736
112.8213082.8402872.830798
122.5277952.5212772.524536
132.7387122.6956732.717193
142.7584362.7068082.732622
152.9448172.9140382.929428
162.6731432.6212472.647195
172.9257702.8776092.901689
182.8701202.8163472.843233
193.2682683.2335533.250911
202.8492462.8056392.827443
213.0808873.0826023.081745
222.6187282.5945572.606642
232.7721022.7538842.762993
242.7712482.7594032.765326
252.9913802.9629342.977157
263.0360912.9666603.001376
272.9533412.9140982.933720
282.8484192.7976952.823057
292.7253982.6800452.702722
............
3822.8453962.8035132.824454
3832.9608352.9519462.956391
3842.6242462.5854892.604867
3852.8802192.8491172.864668
3862.9898342.9926392.991237
3872.6968042.6584722.677638
3882.8691572.8065072.837832
3893.0331553.0099513.021553
3902.6372702.6122042.624737
3912.9202112.8825562.901384
3922.6940412.6695342.681787
3932.7185362.6554362.686986
3942.9010372.8934042.897221
3952.5444802.5194382.531959
3962.7903382.7729692.781654
3972.9406822.9189112.929796
3982.6095862.5861722.597879
3992.8330032.8164212.824712
4003.0065452.9805152.993530
4012.6469172.6124572.629687
4022.8565152.8219892.839252
4032.7225442.6714852.697014
4042.6501292.5982582.624193
4052.7894762.7577722.773624
4062.8351942.7795742.807384
4072.5925672.6016662.597117
4082.6846092.6622752.673442
4092.7901142.7870762.788595
4102.6993892.6927932.696091
4112.6808892.6722642.676577
\n", 865 | "

412 rows × 3 columns

\n", 866 | "
" 867 | ], 868 | "text/plain": [ 869 | " stacking singal mean\n", 870 | "0 2.912807 2.889987 2.901397\n", 871 | "1 3.007286 2.964874 2.986080\n", 872 | "2 2.685882 2.662324 2.674103\n", 873 | "3 2.806926 2.759750 2.783338\n", 874 | "4 2.652038 2.619831 2.635935\n", 875 | "5 2.787010 2.722117 2.754563\n", 876 | "6 2.652562 2.627702 2.640132\n", 877 | "7 2.588135 2.463275 2.525705\n", 878 | "8 3.031880 2.980239 3.006060\n", 879 | "9 2.666783 2.645448 2.656115\n", 880 | "10 2.797045 2.746427 2.771736\n", 881 | "11 2.821308 2.840287 2.830798\n", 882 | "12 2.527795 2.521277 2.524536\n", 883 | "13 2.738712 2.695673 2.717193\n", 884 | "14 2.758436 2.706808 2.732622\n", 885 | "15 2.944817 2.914038 2.929428\n", 886 | "16 2.673143 2.621247 2.647195\n", 887 | "17 2.925770 2.877609 2.901689\n", 888 | "18 2.870120 2.816347 2.843233\n", 889 | "19 3.268268 3.233553 3.250911\n", 890 | "20 2.849246 2.805639 2.827443\n", 891 | "21 3.080887 3.082602 3.081745\n", 892 | "22 2.618728 2.594557 2.606642\n", 893 | "23 2.772102 2.753884 2.762993\n", 894 | "24 2.771248 2.759403 2.765326\n", 895 | "25 2.991380 2.962934 2.977157\n", 896 | "26 3.036091 2.966660 3.001376\n", 897 | "27 2.953341 2.914098 2.933720\n", 898 | "28 2.848419 2.797695 2.823057\n", 899 | "29 2.725398 2.680045 2.702722\n", 900 | ".. ... ... ...\n", 901 | "382 2.845396 2.803513 2.824454\n", 902 | "383 2.960835 2.951946 2.956391\n", 903 | "384 2.624246 2.585489 2.604867\n", 904 | "385 2.880219 2.849117 2.864668\n", 905 | "386 2.989834 2.992639 2.991237\n", 906 | "387 2.696804 2.658472 2.677638\n", 907 | "388 2.869157 2.806507 2.837832\n", 908 | "389 3.033155 3.009951 3.021553\n", 909 | "390 2.637270 2.612204 2.624737\n", 910 | "391 2.920211 2.882556 2.901384\n", 911 | "392 2.694041 2.669534 2.681787\n", 912 | "393 2.718536 2.655436 2.686986\n", 913 | "394 2.901037 2.893404 2.897221\n", 914 | "395 2.544480 2.519438 2.531959\n", 915 | "396 2.790338 2.772969 2.781654\n", 916 | "397 2.940682 2.918911 2.929796\n", 917 | "398 2.609586 2.586172 2.597879\n", 918 | "399 2.833003 2.816421 2.824712\n", 919 | "400 3.006545 2.980515 2.993530\n", 920 | "401 2.646917 2.612457 2.629687\n", 921 | "402 2.856515 2.821989 2.839252\n", 922 | "403 2.722544 2.671485 2.697014\n", 923 | "404 2.650129 2.598258 2.624193\n", 924 | "405 2.789476 2.757772 2.773624\n", 925 | "406 2.835194 2.779574 2.807384\n", 926 | "407 2.592567 2.601666 2.597117\n", 927 | "408 2.684609 2.662275 2.673442\n", 928 | "409 2.790114 2.787076 2.788595\n", 929 | "410 2.699389 2.692793 2.696091\n", 930 | "411 2.680889 2.672264 2.676577\n", 931 | "\n", 932 | "[412 rows x 3 columns]" 933 | ] 934 | }, 935 | "execution_count": 46, 936 | "metadata": {}, 937 | "output_type": "execute_result" 938 | } 939 | ], 940 | "source": [ 941 | "fin = pd.DataFrame(result, columns=['stacking'])\n", 942 | "fin['singal'] = modle.predict(xgbTest)\n", 943 | "fin['mean'] = fin.mean(axis=1)" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": null, 949 | "metadata": { 950 | "collapsed": true 951 | }, 952 | "outputs": [], 953 | "source": [ 954 | "submit['Y'] = fin['mean'].values " 955 | ] 956 | }, 957 | { 958 | "cell_type": "code", 959 | "execution_count": null, 960 | "metadata": { 961 | "collapsed": true 962 | }, 963 | "outputs": [], 964 | "source": [ 965 | "submit.to_csv('sub{}.csv'.format(datetime.now().strftime('%Y%m%d_%H%M%S')), index=False, header=None)" 966 | ] 967 | } 968 | ], 969 | "metadata": { 970 | "kernelspec": { 971 | "display_name": "Python 3", 972 | "language": "python", 973 | "name": "python3" 974 | }, 975 | "language_info": { 976 | "codemirror_mode": { 977 | "name": "ipython", 978 | "version": 3 979 | }, 980 | "file_extension": ".py", 981 | "mimetype": "text/x-python", 982 | "name": "python", 983 | "nbconvert_exporter": "python", 984 | "pygments_lexer": "ipython3", 985 | "version": "3.5.0" 986 | } 987 | }, 988 | "nbformat": 4, 989 | "nbformat_minor": 2 990 | } 991 | -------------------------------------------------------------------------------- /First/初赛处理.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 23, 6 | "metadata": { 7 | "collapsed": false, 8 | "deletable": true, 9 | "editable": true 10 | }, 11 | "outputs": [], 12 | "source": [ 13 | "import pandas as pd\n", 14 | "import numpy as np\n", 15 | "import gc\n", 16 | "import xgboost as xgb\n", 17 | "import lightgbm as lgb\n", 18 | "from sklearn.model_selection import train_test_split\n", 19 | "from sklearn.cross_validation import KFold, StratifiedKFold, cross_val_score\n", 20 | "from datetime import datetime\n", 21 | "from catboost import CatBoostRegressor\n", 22 | "from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder, minmax_scale, scale\n", 23 | "from sklearn import tree\n", 24 | "from sklearn.decomposition import PCA\n", 25 | "from sklearn import linear_model\n", 26 | "from sklearn import svm\n", 27 | "from sklearn import neighbors\n", 28 | "from sklearn import ensemble\n", 29 | "from sklearn.feature_selection import SelectFromModel, VarianceThreshold,RFE\n", 30 | "from minepy import MINE\n", 31 | "from mlxtend.regressor import StackingRegressor\n", 32 | "from collections import defaultdict" 33 | ] 34 | }, 35 | { 36 | "cell_type": "code", 37 | "execution_count": 7, 38 | "metadata": { 39 | "collapsed": false, 40 | "deletable": true, 41 | "editable": true 42 | }, 43 | "outputs": [], 44 | "source": [ 45 | "data_set = pd.read_csv('../data/train.csv')\n", 46 | "test_set = pd.read_csv('../data/test.csv')\n", 47 | "Le = LabelEncoder()\n", 48 | "\n", 49 | "y = data_set[['Y']]\n", 50 | "data_set = data_set.drop(['Y'], axis=1)\n", 51 | "input_data = pd.concat([data_set, test_set], ignore_index=True)" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": 8, 57 | "metadata": { 58 | "collapsed": false, 59 | "deletable": true, 60 | "editable": true 61 | }, 62 | "outputs": [], 63 | "source": [ 64 | "col = pd.DataFrame(input_data.ix[:, 2:].columns, columns=['col'])\n", 65 | "TOOL_ID_col_dict = defaultdict(lambda : 0)\n", 66 | "for index,row in col.iterrows():\n", 67 | " info=row.col.split('X')[0]\n", 68 | " TOOL_ID_col_dict[info] += 1\n", 69 | " \n", 70 | "TOOL_ID_col_dict" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 54, 76 | "metadata": { 77 | "collapsed": true, 78 | "deletable": true, 79 | "editable": true 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "#drop相同列\n", 84 | "def drop_col(data):\n", 85 | " data = data.fillna(data.mean()) \n", 86 | " for line in data.columns:\n", 87 | " if len(data[line].unique()) == 1:\n", 88 | " data = data.drop([line], axis=1) \n", 89 | " return data" 90 | ] 91 | }, 92 | { 93 | "cell_type": "code", 94 | "execution_count": 10, 95 | "metadata": { 96 | "collapsed": false, 97 | "deletable": true, 98 | "editable": true 99 | }, 100 | "outputs": [], 101 | "source": [ 102 | "#四分位数处理异常值\n", 103 | "def quantile_dropout(data): \n", 104 | " for c in data.columns:\n", 105 | " Q1 = data[c].quantile(q=0.25, interpolation='linear')\n", 106 | " Q3 = data[c].quantile(q=0.75, interpolation='linear') \n", 107 | " \n", 108 | " min_v = Q1 - 3 * (Q3 - Q1)\n", 109 | " max_v = Q3 + 3 * (Q3 - Q1)\n", 110 | " \n", 111 | " data[c][(data[c] >= max_v) | (data[c] <= min_v)] = data[c].mean()\n", 112 | " \n", 113 | " return data" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 20, 119 | "metadata": { 120 | "collapsed": true, 121 | "deletable": true, 122 | "editable": true 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "#标准化\n", 127 | "def scala(data):\n", 128 | " # for col in data.columns: \n", 129 | " # data[col] = (data[col] - data[col].mean()) / data[col].std(ddof=0) \n", 130 | " # data = data.fillna(0) \n", 131 | " data = scale(data)\n", 132 | " return data" 133 | ] 134 | }, 135 | { 136 | "cell_type": "code", 137 | "execution_count": 27, 138 | "metadata": { 139 | "collapsed": true, 140 | "deletable": true, 141 | "editable": true 142 | }, 143 | "outputs": [], 144 | "source": [ 145 | "#PCA\n", 146 | "def Pca(data):\n", 147 | " pca = PCA(n_components=0.9)\n", 148 | " pca.fit(data)\n", 149 | " X_new = pca.transform(data) \n", 150 | " return X_new" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 13, 156 | "metadata": { 157 | "collapsed": true, 158 | "deletable": true, 159 | "editable": true 160 | }, 161 | "outputs": [], 162 | "source": [ 163 | "#Tsne\n", 164 | "def Tsne(data):\n", 165 | " Tsne = TSNE(n_components=2)\n", 166 | " X_new = Tsne.fit_transform(data)\n", 167 | " return X_new" 168 | ] 169 | }, 170 | { 171 | "cell_type": "code", 172 | "execution_count": 14, 173 | "metadata": { 174 | "collapsed": true, 175 | "deletable": true, 176 | "editable": true 177 | }, 178 | "outputs": [], 179 | "source": [ 180 | "#聚类\n", 181 | "def clu(data):\n", 182 | " km = KMeans(n_clusters=3).fit(data)\n", 183 | " clu = pd.DataFrame(km.predict(data), columns=['clu'])\n", 184 | " clu.groupby(['clu']).clu.count()" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 15, 190 | "metadata": { 191 | "collapsed": true, 192 | "deletable": true, 193 | "editable": true 194 | }, 195 | "outputs": [], 196 | "source": [ 197 | "def date_cols(data):\n", 198 | " for col in data:\n", 199 | " if data[col].min() > 1e13:\n", 200 | " data = data.drop([col], axis=1) \n", 201 | " return data" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 16, 207 | "metadata": { 208 | "collapsed": true, 209 | "deletable": true, 210 | "editable": true 211 | }, 212 | "outputs": [], 213 | "source": [ 214 | "#孤立森林选出异常值过多的特征 暂时没用\n", 215 | "def get_outFeature(data):\n", 216 | " clf = IsolationForest(max_samples=20)\n", 217 | " outrate = []\n", 218 | " for col in data.columns: \n", 219 | " clf.fit(data[[col]])\n", 220 | " y_pred_train = clf.predict(data[[col]])\n", 221 | " \n", 222 | " values = data[[col]].values\n", 223 | " out = pd.DataFrame(values, columns=['columns'])\n", 224 | " out['y'] = y_pred_train\n", 225 | " \n", 226 | " outLine = len(out[out['y'] == -1])\n", 227 | " outRate = outLine / 600\n", 228 | " if outRate > 0.2:\n", 229 | " outrate.append(col)\n", 230 | " \n", 231 | " feature = [x for x in data.columns if x in outrate]\n", 232 | " train = pd.DataFrame(data[feature]).ix[:499, :]\n", 233 | " test = pd.DataFrame(data[feature]).ix[500:, :]\n", 234 | " return train, test" 235 | ] 236 | }, 237 | { 238 | "cell_type": "markdown", 239 | "metadata": { 240 | "deletable": true, 241 | "editable": true 242 | }, 243 | "source": [ 244 | "处理X210_" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 37, 250 | "metadata": { 251 | "collapsed": false, 252 | "deletable": true, 253 | "editable": true 254 | }, 255 | "outputs": [ 256 | { 257 | "name": "stderr", 258 | "output_type": "stream", 259 | "text": [ 260 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 261 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 262 | "\n", 263 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", 264 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\preprocessing\\data.py:160: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.\n", 265 | " warnings.warn(\"Numerical issues were encountered \"\n" 266 | ] 267 | } 268 | ], 269 | "source": [ 270 | "X210_COL = input_data.loc[:, 'TOOL_ID':'210X231']\n", 271 | "Le.fit(X210_COL['TOOL_ID'].unique())\n", 272 | "X210_COL['TOOL_ID'] = Le.transform(X210_COL['TOOL_ID'])\n", 273 | "X210_COL = date_cols(X210_COL)\n", 274 | "X210_COL = drop_col(X210_COL)\n", 275 | "X210_COL = quantile_dropout(X210_COL)\n", 276 | "X210_COL = drop_col(X210_COL)\n", 277 | "X210_COL = X210_COL.transpose().drop_duplicates().transpose()\n", 278 | "X210_COL = scala(X210_COL)" 279 | ] 280 | }, 281 | { 282 | "cell_type": "code", 283 | "execution_count": 38, 284 | "metadata": { 285 | "collapsed": false, 286 | "deletable": true, 287 | "editable": true 288 | }, 289 | "outputs": [], 290 | "source": [ 291 | "pca_X210 = Pca(X210_COL)\n", 292 | "pca_X210 = pd.DataFrame(pca_X210)\n", 293 | "pca_X210.reset_index(inplace=True)\n", 294 | "pca_X210.rename({'index' : 'ID'}, inplace=True)\n", 295 | "pca_X210.drop(['index'], axis=1, inplace=True)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "metadata": { 301 | "deletable": true, 302 | "editable": true 303 | }, 304 | "source": [ 305 | "处理X220_" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": 39, 311 | "metadata": { 312 | "collapsed": false, 313 | "deletable": true, 314 | "editable": true 315 | }, 316 | "outputs": [ 317 | { 318 | "name": "stderr", 319 | "output_type": "stream", 320 | "text": [ 321 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 322 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 323 | "\n", 324 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 325 | ] 326 | } 327 | ], 328 | "source": [ 329 | "X220_COL = input_data.loc[:, 'Tool':'220X571']\n", 330 | "Le.fit(X220_COL['Tool'].unique())\n", 331 | "X220_COL['Tool'] = Le.transform(X220_COL['Tool'])\n", 332 | "X220_COL = drop_col(X220_COL)\n", 333 | "X220_COL = date_cols(X220_COL)\n", 334 | "X220_COL = quantile_dropout(X220_COL)\n", 335 | "X220_COL = drop_col(X220_COL)\n", 336 | "X220_COL = X220_COL.transpose().drop_duplicates().transpose()\n", 337 | "X220_COL = scala(X220_COL)" 338 | ] 339 | }, 340 | { 341 | "cell_type": "code", 342 | "execution_count": 40, 343 | "metadata": { 344 | "collapsed": false, 345 | "deletable": true, 346 | "editable": true 347 | }, 348 | "outputs": [], 349 | "source": [ 350 | "pca_X220 = Pca(X220_COL)\n", 351 | "pca_X220 = pd.DataFrame(pca_X220)\n", 352 | "pca_X220.reset_index(inplace=True)\n", 353 | "pca_X220.rename({'index' : 'ID'}, inplace=True)\n", 354 | "pca_X220.drop(['index'], axis=1, inplace=True)" 355 | ] 356 | }, 357 | { 358 | "cell_type": "markdown", 359 | "metadata": { 360 | "deletable": true, 361 | "editable": true 362 | }, 363 | "source": [ 364 | "处理X261_" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": 41, 370 | "metadata": { 371 | "collapsed": false, 372 | "deletable": true, 373 | "editable": true 374 | }, 375 | "outputs": [], 376 | "source": [ 377 | "X261_COL = input_data.loc[:, '261X226':'261X763']\n", 378 | "X261_COL = drop_col(X261_COL)\n", 379 | "X261_COL = date_cols(X261_COL)\n", 380 | "X261_COL = quantile_dropout(X261_COL)\n", 381 | "X261_COL = drop_col(X261_COL)\n", 382 | "X261_COL = X261_COL.transpose().drop_duplicates().transpose()\n", 383 | "X261_COL = scala(X261_COL)" 384 | ] 385 | }, 386 | { 387 | "cell_type": "code", 388 | "execution_count": 42, 389 | "metadata": { 390 | "collapsed": false, 391 | "deletable": true, 392 | "editable": true 393 | }, 394 | "outputs": [], 395 | "source": [ 396 | "pca_X261 = Pca(X261_COL)\n", 397 | "pca_X261 = pd.DataFrame(pca_X261)\n", 398 | "pca_X261.reset_index(inplace=True)\n", 399 | "pca_X261.rename({'index' : 'ID'}, inplace=True)\n", 400 | "pca_X261.drop(['index'], axis=1, inplace=True)" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": { 406 | "deletable": true, 407 | "editable": true 408 | }, 409 | "source": [ 410 | "处理X300_ " 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 44, 416 | "metadata": { 417 | "collapsed": false, 418 | "deletable": true, 419 | "editable": true 420 | }, 421 | "outputs": [ 422 | { 423 | "name": "stderr", 424 | "output_type": "stream", 425 | "text": [ 426 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\preprocessing\\data.py:160: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.\n", 427 | " warnings.warn(\"Numerical issues were encountered \"\n" 428 | ] 429 | } 430 | ], 431 | "source": [ 432 | "X300_COL = input_data.loc[:, 'TOOL_ID (#1)':'300X21']\n", 433 | "Le.fit(X300_COL['TOOL_ID (#1)'].unique())\n", 434 | "X300_COL['TOOL_ID (#1)'] = Le.transform(X300_COL['TOOL_ID (#1)'])\n", 435 | "X300_COL = drop_col(X300_COL)\n", 436 | "X300_COL = quantile_dropout(X300_COL.ix[:, 1:])\n", 437 | "X300_COL = drop_col(X300_COL)\n", 438 | "X300_COL = X300_COL.transpose().drop_duplicates().transpose()\n", 439 | "X300_COL = scala(X300_COL)" 440 | ] 441 | }, 442 | { 443 | "cell_type": "code", 444 | "execution_count": 45, 445 | "metadata": { 446 | "collapsed": true, 447 | "deletable": true, 448 | "editable": true 449 | }, 450 | "outputs": [], 451 | "source": [ 452 | "pca_X300 = Pca(X300_COL)\n", 453 | "pca_X300 = pd.DataFrame(pca_X300)\n", 454 | "pca_X300.reset_index(inplace=True)\n", 455 | "pca_X300.rename({'index' : 'ID'}, inplace=True)\n", 456 | "pca_X300.drop(['index'], axis=1, inplace=True)" 457 | ] 458 | }, 459 | { 460 | "cell_type": "markdown", 461 | "metadata": { 462 | "deletable": true, 463 | "editable": true 464 | }, 465 | "source": [ 466 | "处理X310_ " 467 | ] 468 | }, 469 | { 470 | "cell_type": "code", 471 | "execution_count": 46, 472 | "metadata": { 473 | "collapsed": false, 474 | "deletable": true, 475 | "editable": true 476 | }, 477 | "outputs": [ 478 | { 479 | "name": "stderr", 480 | "output_type": "stream", 481 | "text": [ 482 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 483 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 484 | "\n", 485 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 486 | ] 487 | } 488 | ], 489 | "source": [ 490 | "X310_COL = input_data.loc[:, 'TOOL_ID (#2)':'310X207']\n", 491 | "Le.fit(X310_COL['TOOL_ID (#2)'].unique())\n", 492 | "X310_COL['TOOL_ID (#2)'] = Le.transform(X310_COL['TOOL_ID (#2)'])\n", 493 | "X310_COL = drop_col(X310_COL)\n", 494 | "X310_COL = date_cols(X310_COL)\n", 495 | "X310_COL = quantile_dropout(X310_COL)\n", 496 | "X310_COL = drop_col(X310_COL)\n", 497 | "X310_COL = X310_COL.transpose().drop_duplicates().transpose()\n", 498 | "X310_COL = scala(X310_COL)" 499 | ] 500 | }, 501 | { 502 | "cell_type": "code", 503 | "execution_count": 47, 504 | "metadata": { 505 | "collapsed": false, 506 | "deletable": true, 507 | "editable": true 508 | }, 509 | "outputs": [], 510 | "source": [ 511 | "pca_X310 = Pca(X310_COL)\n", 512 | "pca_X310 = pd.DataFrame(pca_X310)\n", 513 | "pca_X310.reset_index(inplace=True)\n", 514 | "pca_X310.rename({'index' : 'ID'}, inplace=True)\n", 515 | "pca_X310.drop(['index'], axis=1, inplace=True)" 516 | ] 517 | }, 518 | { 519 | "cell_type": "markdown", 520 | "metadata": { 521 | "deletable": true, 522 | "editable": true 523 | }, 524 | "source": [ 525 | "处理X311_" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": 56, 531 | "metadata": { 532 | "collapsed": false, 533 | "deletable": true, 534 | "editable": true 535 | }, 536 | "outputs": [ 537 | { 538 | "name": "stderr", 539 | "output_type": "stream", 540 | "text": [ 541 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 542 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 543 | "\n", 544 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 545 | ] 546 | } 547 | ], 548 | "source": [ 549 | "X311_COL = input_data.loc[:, 'TOOL_ID (#3)':'311X225']\n", 550 | "Le.fit(X311_COL['TOOL_ID (#3)'].unique())\n", 551 | "X311_COL['TOOL_ID (#3)'] = Le.transform(X311_COL['TOOL_ID (#3)'])\n", 552 | "X311_COL = drop_col(X311_COL)\n", 553 | "X311_COL = date_cols(X311_COL)\n", 554 | "X311_COL = quantile_dropout(X311_COL)\n", 555 | "X311_COL = drop_col(X311_COL)\n", 556 | "X311_COL = X311_COL.transpose().drop_duplicates().transpose()\n", 557 | "X311_COL = scala(X311_COL)" 558 | ] 559 | }, 560 | { 561 | "cell_type": "code", 562 | "execution_count": 57, 563 | "metadata": { 564 | "collapsed": false, 565 | "deletable": true, 566 | "editable": true 567 | }, 568 | "outputs": [], 569 | "source": [ 570 | "pca_X311 = Pca(X311_COL)\n", 571 | "pca_X311 = pd.DataFrame(pca_X311)\n", 572 | "pca_X311.reset_index(inplace=True)\n", 573 | "pca_X311.rename({'index' : 'ID'}, inplace=True)\n", 574 | "pca_X311.drop(['index'], axis=1, inplace=True)" 575 | ] 576 | }, 577 | { 578 | "cell_type": "markdown", 579 | "metadata": { 580 | "deletable": true, 581 | "editable": true 582 | }, 583 | "source": [ 584 | "处理X312_" 585 | ] 586 | }, 587 | { 588 | "cell_type": "code", 589 | "execution_count": 58, 590 | "metadata": { 591 | "collapsed": false, 592 | "deletable": true, 593 | "editable": true 594 | }, 595 | "outputs": [ 596 | { 597 | "name": "stderr", 598 | "output_type": "stream", 599 | "text": [ 600 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 601 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 602 | "\n", 603 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", 604 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\sklearn\\preprocessing\\data.py:177: UserWarning: Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0. \n", 605 | " warnings.warn(\"Numerical issues were encountered \"\n" 606 | ] 607 | } 608 | ], 609 | "source": [ 610 | "X312_COL = input_data.loc[:, 'Tool (#1)':'312X798']\n", 611 | "Le.fit(X312_COL['Tool (#1)'].unique())\n", 612 | "X312_COL['Tool (#1)'] = Le.transform(X312_COL['Tool (#1)'])\n", 613 | "X312_COL = drop_col(X312_COL)\n", 614 | "X312_COL = date_cols(X312_COL)\n", 615 | "X312_COL = quantile_dropout(X312_COL)\n", 616 | "X312_COL = drop_col(X312_COL)\n", 617 | "X312_COL = X312_COL.transpose().drop_duplicates().transpose()\n", 618 | "X312_COL = scala(X312_COL)" 619 | ] 620 | }, 621 | { 622 | "cell_type": "code", 623 | "execution_count": 59, 624 | "metadata": { 625 | "collapsed": false, 626 | "deletable": true, 627 | "editable": true 628 | }, 629 | "outputs": [], 630 | "source": [ 631 | "pca_X312 = Pca(X312_COL)\n", 632 | "pca_X312 = pd.DataFrame(pca_X312)\n", 633 | "pca_X312.reset_index(inplace=True)\n", 634 | "pca_X312.rename({'index' : 'ID'}, inplace=True)\n", 635 | "pca_X312.drop(['index'], axis=1, inplace=True)" 636 | ] 637 | }, 638 | { 639 | "cell_type": "markdown", 640 | "metadata": { 641 | "deletable": true, 642 | "editable": true 643 | }, 644 | "source": [ 645 | "处理X330_" 646 | ] 647 | }, 648 | { 649 | "cell_type": "code", 650 | "execution_count": 60, 651 | "metadata": { 652 | "collapsed": false, 653 | "deletable": true, 654 | "editable": true 655 | }, 656 | "outputs": [ 657 | { 658 | "name": "stderr", 659 | "output_type": "stream", 660 | "text": [ 661 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 662 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 663 | "\n", 664 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 665 | ] 666 | } 667 | ], 668 | "source": [ 669 | "X330_COL = input_data.loc[:, 'Tool (#2)':'330X1311']\n", 670 | "Le.fit(X330_COL['Tool (#2)'].unique())\n", 671 | "X330_COL['Tool (#2)'] = Le.transform(X330_COL['Tool (#2)'])\n", 672 | "X330_COL = drop_col(X330_COL)\n", 673 | "X330_COL = date_cols(X330_COL)\n", 674 | "X330_COL = quantile_dropout(X330_COL)\n", 675 | "X330_COL = drop_col(X330_COL)\n", 676 | "X330_COL = X330_COL.transpose().drop_duplicates().transpose()\n", 677 | "X330_COL = scala(X330_COL)" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": 61, 683 | "metadata": { 684 | "collapsed": false, 685 | "deletable": true, 686 | "editable": true 687 | }, 688 | "outputs": [], 689 | "source": [ 690 | "pca_X330 = Pca(X330_COL)\n", 691 | "pca_X330 = pd.DataFrame(pca_X330)\n", 692 | "pca_X330.reset_index(inplace=True)\n", 693 | "pca_X330.rename({'index' : 'ID'}, inplace=True)\n", 694 | "pca_X330.drop(['index'], axis=1, inplace=True)" 695 | ] 696 | }, 697 | { 698 | "cell_type": "markdown", 699 | "metadata": { 700 | "deletable": true, 701 | "editable": true 702 | }, 703 | "source": [ 704 | "处理X340_" 705 | ] 706 | }, 707 | { 708 | "cell_type": "code", 709 | "execution_count": 62, 710 | "metadata": { 711 | "collapsed": false, 712 | "deletable": true, 713 | "editable": true 714 | }, 715 | "outputs": [ 716 | { 717 | "name": "stderr", 718 | "output_type": "stream", 719 | "text": [ 720 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 721 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 722 | "\n", 723 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 724 | ] 725 | } 726 | ], 727 | "source": [ 728 | "X340_COL = input_data.loc[:, 'tool':'340X199']\n", 729 | "Le.fit(X340_COL['tool'].unique())\n", 730 | "X340_COL['tool'] = Le.transform(X340_COL['tool'])\n", 731 | "X340_COL = drop_col(X340_COL)\n", 732 | "X340_COL = date_cols(X340_COL)\n", 733 | "X340_COL = quantile_dropout(X340_COL)\n", 734 | "X340_COL = drop_col(X340_COL)\n", 735 | "X340_COL = X340_COL.transpose().drop_duplicates().transpose()\n", 736 | "X340_COL = scala(X340_COL)" 737 | ] 738 | }, 739 | { 740 | "cell_type": "code", 741 | "execution_count": 63, 742 | "metadata": { 743 | "collapsed": false, 744 | "deletable": true, 745 | "editable": true 746 | }, 747 | "outputs": [], 748 | "source": [ 749 | "pca_X340 = Pca(X340_COL)\n", 750 | "pca_X340 = pd.DataFrame(pca_X340)\n", 751 | "pca_X340.reset_index(inplace=True)\n", 752 | "pca_X340.rename({'index' : 'ID'}, inplace=True)\n", 753 | "pca_X340.drop(['index'], axis=1, inplace=True)" 754 | ] 755 | }, 756 | { 757 | "cell_type": "markdown", 758 | "metadata": { 759 | "deletable": true, 760 | "editable": true 761 | }, 762 | "source": [ 763 | "处理X344_" 764 | ] 765 | }, 766 | { 767 | "cell_type": "code", 768 | "execution_count": 64, 769 | "metadata": { 770 | "collapsed": false, 771 | "deletable": true, 772 | "editable": true 773 | }, 774 | "outputs": [ 775 | { 776 | "name": "stderr", 777 | "output_type": "stream", 778 | "text": [ 779 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 780 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 781 | "\n", 782 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 783 | ] 784 | } 785 | ], 786 | "source": [ 787 | "X344_COL = input_data.loc[:, 'tool (#1)':'344X398']\n", 788 | "Le.fit(X344_COL['tool (#1)'].unique())\n", 789 | "X344_COL['tool (#1)'] = Le.transform(X344_COL['tool (#1)'])\n", 790 | "X344_COL = drop_col(X344_COL)\n", 791 | "X344_COL = date_cols(X344_COL)\n", 792 | "X344_COL = quantile_dropout(X344_COL)\n", 793 | "X344_COL = drop_col(X344_COL)\n", 794 | "X344_COL = X344_COL.transpose().drop_duplicates().transpose()\n", 795 | "X344_COL = scala(X344_COL)" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 65, 801 | "metadata": { 802 | "collapsed": false, 803 | "deletable": true, 804 | "editable": true 805 | }, 806 | "outputs": [], 807 | "source": [ 808 | "pca_X344 = Pca(X344_COL)\n", 809 | "pca_X344 = pd.DataFrame(pca_X344)\n", 810 | "pca_X344.reset_index(inplace=True)\n", 811 | "pca_X344.rename({'index' : 'ID'}, inplace=True)\n", 812 | "pca_X344.drop(['index'], axis=1, inplace=True)" 813 | ] 814 | }, 815 | { 816 | "cell_type": "markdown", 817 | "metadata": { 818 | "deletable": true, 819 | "editable": true 820 | }, 821 | "source": [ 822 | "处理X360_ " 823 | ] 824 | }, 825 | { 826 | "cell_type": "code", 827 | "execution_count": 66, 828 | "metadata": { 829 | "collapsed": false, 830 | "deletable": true, 831 | "editable": true 832 | }, 833 | "outputs": [ 834 | { 835 | "name": "stderr", 836 | "output_type": "stream", 837 | "text": [ 838 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 839 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 840 | "\n", 841 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 842 | ] 843 | } 844 | ], 845 | "source": [ 846 | "X360_COL = input_data.loc[:, 'TOOL':'360X1452']\n", 847 | "Le.fit(X360_COL['TOOL'].unique())\n", 848 | "X360_COL['TOOL'] = Le.transform(X360_COL['TOOL'])\n", 849 | "X360_COL = drop_col(X360_COL)\n", 850 | "X360_COL = date_cols(X360_COL)\n", 851 | "X360_COL = quantile_dropout(X360_COL)\n", 852 | "X360_COL = drop_col(X360_COL)\n", 853 | "X360_COL = X360_COL.transpose().drop_duplicates().transpose()\n", 854 | "X360_COL = scala(X360_COL)" 855 | ] 856 | }, 857 | { 858 | "cell_type": "code", 859 | "execution_count": 67, 860 | "metadata": { 861 | "collapsed": false, 862 | "deletable": true, 863 | "editable": true 864 | }, 865 | "outputs": [], 866 | "source": [ 867 | "pca_X360 = Pca(X360_COL)\n", 868 | "pca_X360 = pd.DataFrame(pca_X360)\n", 869 | "pca_X360.reset_index(inplace=True)\n", 870 | "pca_X360.rename({'index' : 'ID'}, inplace=True)\n", 871 | "pca_X360.drop(['index'], axis=1, inplace=True)" 872 | ] 873 | }, 874 | { 875 | "cell_type": "markdown", 876 | "metadata": { 877 | "deletable": true, 878 | "editable": true 879 | }, 880 | "source": [ 881 | "处理X400" 882 | ] 883 | }, 884 | { 885 | "cell_type": "code", 886 | "execution_count": 68, 887 | "metadata": { 888 | "collapsed": false, 889 | "deletable": true, 890 | "editable": true 891 | }, 892 | "outputs": [], 893 | "source": [ 894 | "X400_COL = input_data.loc[:, '400X1':'400X230']\n", 895 | "X400_COL = drop_col(X400_COL)\n", 896 | "X400_COL = date_cols(X400_COL)\n", 897 | "X400_COL = quantile_dropout(X400_COL)\n", 898 | "X400_COL = drop_col(X400_COL)\n", 899 | "X400_COL = X400_COL.transpose().drop_duplicates().transpose()\n", 900 | "X400_COL = scala(X400_COL)" 901 | ] 902 | }, 903 | { 904 | "cell_type": "code", 905 | "execution_count": 69, 906 | "metadata": { 907 | "collapsed": false, 908 | "deletable": true, 909 | "editable": true 910 | }, 911 | "outputs": [], 912 | "source": [ 913 | "pca_X400 = Pca(X400_COL)\n", 914 | "pca_X400 = pd.DataFrame(pca_X400)\n", 915 | "pca_X400.reset_index(inplace=True)\n", 916 | "pca_X400.rename({'index' : 'ID'}, inplace=True)\n", 917 | "pca_X400.drop(['index'], axis=1, inplace=True)" 918 | ] 919 | }, 920 | { 921 | "cell_type": "markdown", 922 | "metadata": { 923 | "deletable": true, 924 | "editable": true 925 | }, 926 | "source": [ 927 | "处理X420" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": 70, 933 | "metadata": { 934 | "collapsed": false, 935 | "deletable": true, 936 | "editable": true 937 | }, 938 | "outputs": [], 939 | "source": [ 940 | "X420_COL = input_data.loc[:, '420X1':'420X230']\n", 941 | "X420_COL = drop_col(X420_COL)\n", 942 | "X420_COL = date_cols(X420_COL)\n", 943 | "X420_COL = quantile_dropout(X420_COL)\n", 944 | "\n", 945 | "X420_COL = drop_col(X420_COL)\n", 946 | "X420_COL = X420_COL.transpose().drop_duplicates().transpose()\n", 947 | "X420_COL = scala(X420_COL)" 948 | ] 949 | }, 950 | { 951 | "cell_type": "code", 952 | "execution_count": 71, 953 | "metadata": { 954 | "collapsed": false, 955 | "deletable": true, 956 | "editable": true 957 | }, 958 | "outputs": [], 959 | "source": [ 960 | "pca_X420 = Pca(X420_COL)\n", 961 | "pca_X420 = pd.DataFrame(pca_X420)\n", 962 | "pca_X420.reset_index(inplace=True)\n", 963 | "pca_X420.rename({'index' : 'ID'}, inplace=True)\n", 964 | "pca_X420.drop(['index'], axis=1, inplace=True)" 965 | ] 966 | }, 967 | { 968 | "cell_type": "markdown", 969 | "metadata": { 970 | "deletable": true, 971 | "editable": true 972 | }, 973 | "source": [ 974 | "处理X440A" 975 | ] 976 | }, 977 | { 978 | "cell_type": "code", 979 | "execution_count": 72, 980 | "metadata": { 981 | "collapsed": false, 982 | "deletable": true, 983 | "editable": true 984 | }, 985 | "outputs": [ 986 | { 987 | "name": "stderr", 988 | "output_type": "stream", 989 | "text": [ 990 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 991 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 992 | "\n", 993 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 994 | ] 995 | } 996 | ], 997 | "source": [ 998 | "X440A_COL = input_data.loc[:, 'TOOL (#1)':'440AX213']\n", 999 | "Le.fit(X440A_COL['TOOL (#1)'].unique())\n", 1000 | "X440A_COL['TOOL (#1)'] = Le.transform(X440A_COL['TOOL (#1)'])\n", 1001 | "X440A_COL = drop_col(X440A_COL)\n", 1002 | "X440A_COL = date_cols(X440A_COL)\n", 1003 | "X440A_COL = quantile_dropout(X440A_COL)\n", 1004 | "X440A_COL = drop_col(X440A_COL)\n", 1005 | "X440A_COL = X440A_COL.transpose().drop_duplicates().transpose()\n", 1006 | "X440A_COL = scala(X440A_COL)" 1007 | ] 1008 | }, 1009 | { 1010 | "cell_type": "code", 1011 | "execution_count": 73, 1012 | "metadata": { 1013 | "collapsed": false, 1014 | "deletable": true, 1015 | "editable": true 1016 | }, 1017 | "outputs": [], 1018 | "source": [ 1019 | "pca_X440A = Pca(X440A_COL)\n", 1020 | "pca_X440A = pd.DataFrame(pca_X440A)\n", 1021 | "pca_X440A.reset_index(inplace=True)\n", 1022 | "pca_X440A.rename({'index' : 'ID'}, inplace=True)\n", 1023 | "pca_X440A.drop(['index'], axis=1, inplace=True)" 1024 | ] 1025 | }, 1026 | { 1027 | "cell_type": "markdown", 1028 | "metadata": { 1029 | "deletable": true, 1030 | "editable": true 1031 | }, 1032 | "source": [ 1033 | "处理X520" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "code", 1038 | "execution_count": 74, 1039 | "metadata": { 1040 | "collapsed": false, 1041 | "deletable": true, 1042 | "editable": true 1043 | }, 1044 | "outputs": [ 1045 | { 1046 | "name": "stderr", 1047 | "output_type": "stream", 1048 | "text": [ 1049 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 1050 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 1051 | "\n", 1052 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 1053 | ] 1054 | } 1055 | ], 1056 | "source": [ 1057 | "X520_COL = input_data.loc[:, 'Tool (#3)':'520X434']\n", 1058 | "Le.fit(X520_COL['Tool (#3)'].unique())\n", 1059 | "X520_COL['Tool (#3)'] = Le.transform(X520_COL['Tool (#3)'])\n", 1060 | "X520_COL = drop_col(X520_COL)\n", 1061 | "X520_COL = date_cols(X520_COL)\n", 1062 | "X520_COL = quantile_dropout(X520_COL)\n", 1063 | "X520_COL = drop_col(X520_COL)\n", 1064 | "X520_COL = X520_COL.transpose().drop_duplicates().transpose()\n", 1065 | "X520_COL = scala(X520_COL)" 1066 | ] 1067 | }, 1068 | { 1069 | "cell_type": "code", 1070 | "execution_count": 75, 1071 | "metadata": { 1072 | "collapsed": false, 1073 | "deletable": true, 1074 | "editable": true 1075 | }, 1076 | "outputs": [], 1077 | "source": [ 1078 | "pca_X520 = Pca(X520_COL)\n", 1079 | "pca_X520 = pd.DataFrame(pca_X520)\n", 1080 | "pca_X520.reset_index(inplace=True)\n", 1081 | "pca_X520.rename({'index' : 'ID'}, inplace=True)\n", 1082 | "pca_X520.drop(['index'], axis=1, inplace=True)" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "markdown", 1087 | "metadata": { 1088 | "deletable": true, 1089 | "editable": true 1090 | }, 1091 | "source": [ 1092 | "处理X750" 1093 | ] 1094 | }, 1095 | { 1096 | "cell_type": "code", 1097 | "execution_count": 76, 1098 | "metadata": { 1099 | "collapsed": false, 1100 | "deletable": true, 1101 | "editable": true 1102 | }, 1103 | "outputs": [ 1104 | { 1105 | "name": "stderr", 1106 | "output_type": "stream", 1107 | "text": [ 1108 | "c:\\users\\scarlet\\appdata\\local\\programs\\python\\python35\\lib\\site-packages\\ipykernel\\__main__.py:10: SettingWithCopyWarning: \n", 1109 | "A value is trying to be set on a copy of a slice from a DataFrame\n", 1110 | "\n", 1111 | "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" 1112 | ] 1113 | } 1114 | ], 1115 | "source": [ 1116 | "X750_COL = input_data.loc[:, 'TOOL (#2)':'750X1452']\n", 1117 | "Le.fit(X750_COL['TOOL (#2)'].unique())\n", 1118 | "X750_COL['TOOL (#2)'] = Le.transform(X750_COL['TOOL (#2)'])\n", 1119 | "X750_COL = drop_col(X750_COL)\n", 1120 | "X750_COL = date_cols(X750_COL)\n", 1121 | "X750_COL = quantile_dropout(X750_COL)\n", 1122 | "X750_COL = drop_col(X750_COL)\n", 1123 | "X750_COL = X750_COL.transpose().drop_duplicates().transpose()\n", 1124 | "X750_COL = scala(X750_COL)" 1125 | ] 1126 | }, 1127 | { 1128 | "cell_type": "code", 1129 | "execution_count": 77, 1130 | "metadata": { 1131 | "collapsed": false, 1132 | "deletable": true, 1133 | "editable": true 1134 | }, 1135 | "outputs": [], 1136 | "source": [ 1137 | "pca_X750 = Pca(X750_COL)\n", 1138 | "pca_X750 = pd.DataFrame(pca_X750)\n", 1139 | "pca_X750.reset_index(inplace=True)\n", 1140 | "pca_X750.rename({'index' : 'ID'}, inplace=True)\n", 1141 | "pca_X750.drop(['index'], axis=1, inplace=True)" 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "code", 1146 | "execution_count": 79, 1147 | "metadata": { 1148 | "collapsed": false, 1149 | "deletable": true, 1150 | "editable": true 1151 | }, 1152 | "outputs": [], 1153 | "source": [ 1154 | "data_Set = pd.concat([pca_X210, pca_X220, \n", 1155 | " pca_X261, pca_X300, \n", 1156 | " pca_X310, pca_X311,\n", 1157 | " pca_X312, pca_X330, \n", 1158 | " pca_X340, pca_X344, \n", 1159 | " pca_X360, pca_X400,\n", 1160 | " pca_X420, pca_X440A, \n", 1161 | " pca_X520, pca_X750], axis=1)" 1162 | ] 1163 | }, 1164 | { 1165 | "cell_type": "code", 1166 | "execution_count": 80, 1167 | "metadata": { 1168 | "collapsed": true, 1169 | "deletable": true, 1170 | "editable": true 1171 | }, 1172 | "outputs": [], 1173 | "source": [ 1174 | "train = data_set.ix[:499, :]\n", 1175 | "test = data_set.ix[500:, :]" 1176 | ] 1177 | }, 1178 | { 1179 | "cell_type": "code", 1180 | "execution_count": 79, 1181 | "metadata": { 1182 | "collapsed": false, 1183 | "deletable": true, 1184 | "editable": true 1185 | }, 1186 | "outputs": [], 1187 | "source": [ 1188 | "train.to_csv('train/train.csv', index=False)\n", 1189 | "test.to_csv('train/test.csv', index=False)\n", 1190 | "y.to_csv('train/y.csv', index=False)" 1191 | ] 1192 | } 1193 | ], 1194 | "metadata": { 1195 | "kernelspec": { 1196 | "display_name": "Python 3", 1197 | "language": "python", 1198 | "name": "python3" 1199 | }, 1200 | "language_info": { 1201 | "codemirror_mode": { 1202 | "name": "ipython", 1203 | "version": 3 1204 | }, 1205 | "file_extension": ".py", 1206 | "mimetype": "text/x-python", 1207 | "name": "python", 1208 | "nbconvert_exporter": "python", 1209 | "pygments_lexer": "ipython3", 1210 | "version": "3.5.0" 1211 | } 1212 | }, 1213 | "nbformat": 4, 1214 | "nbformat_minor": 2 1215 | } 1216 | --------------------------------------------------------------------------------