├── README.md
├── 初赛
    ├── history
    │   ├── V0_学习.ipynb
    │   ├── V1_增加特征相关性.ipynb
    │   ├── V2_增加特征可视化和特征选择.ipynb
    │   ├── V3_结构重构更规范化.ipynb
    │   └── python-data-visualizations.ipynb
    ├── 最终程序.ipynb
    └── 津南数字制造算法挑战赛+20+Drop
    │   ├── readme.md
    │   └── run.py
└── 复赛
    ├── 复赛.ipynb
    └── 津南数字制造算法挑战赛+17+Drop
        ├── data
            └── optimize.csv
        ├── main.py
        └── readme.md


/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | 
  3 | # 天池比赛：[津南数字制造算法挑战赛【赛场一】](https://tianchi.aliyun.com/competition/entrance/231695/introduction) Drop 队（初赛17名，复赛结果未知）代码，初赛加复赛
  4 | 
  5 | ## 2019年3月9号更新——更新说明：
  6 | 
  7 | 增加了复赛程序，包括收率预测和最优参数生成。
  8 | 
  9 | + __收率预测：__
 10 | 
 11 |   与初赛类似，但有所改动，改动如下：
 12 | 
 13 |   （1）最终异常数据并没有用其他特征来预测，只是简单的改变为缺失值，最后发现效果甚至好于用其他特征预测。
 14 | 
 15 |   （2）未使用 id 特征。
 16 | 
 17 |   （3）经过多次特征选择最终决定用 9 个特征，因此略去了特征选择的过程。
 18 | 
 19 | + __最优参数生成：__
 20 | 
 21 |   测试了两种方法，并最终采用粒子群算法：
 22 | 
 23 |   （1）梯度下降：通过数值方法计算每个特征的梯度，然后按照梯度下降的方法更新参数，但是最终发现梯度下降对于树模型基本没用，猜测因为特征略微改变后不影响分裂后的结果，因此大部分时候梯度为 0。
 24 | 
 25 |   （2）粒子群算法：每个粒子代表一组参数，优化目标为最优参数，寻找最优的粒子。并测试了不同的参数，确保找到最优的参数。
 26 | 
 27 | ## 更新文件说明：
 28 | 
 29 | 复赛.ipynb 为本地程序，包括包括收率预测和最优参数生成。
 30 | 
 31 | 文件夹“津南数字制造算法挑战赛+17+Drop”为最终提交程序，不包括最优参数生成，只包括收率预测。
 32 | 
 33 | # 以下为原始内容：
 34 | 
 35 | -------------------------
 36 | 
 37 | 天池比赛：[津南数字制造算法挑战赛【赛场一】](https://tianchi.aliyun.com/competition/entrance/231695/introduction)初赛 17 名 Drop 队代码
 38 | 
 39 | + 作者: 陶亚凡 陕西科技大学
 40 | 
 41 | 我的 github 和 博客：
 42 | 
 43 | github: https://github.com/taoyafan
 44 | 
 45 | 博客: https://me.csdn.net/taoyafan
 46 | 
 47 | + 队友（Blue 电子科技大学的） github 和 博客：
 48 | 
 49 | github：https://github.com/BluesChang
 50 | 
 51 | 博客：https://blueschang.github.io
 52 | 
 53 | 程序各个部分很大程度的参考了鱼佬的 baseline
 54 | 
 55 | 感谢队友的很多贡献, 感谢鱼佬和他的的 baseline, 感谢林有夕大佬让我在群里不停的学到新知识.
 56 | 
 57 | 因为我这是第一次参加 ML 的比赛, 看着鱼佬 baseline 开始学习 pandas, sklearn 还有相关知识, 所以水平实在有限. 希望各位大佬给点意见或建议, 有什么问题或者可以改进的地方请告知我, 灰常感谢. 
 58 | 
 59 | 一直想开源, 但是成绩太差, 趁着还在首页, 赶快开源了, 我也是抱着学习的心态, 也没想着拿奖, 所以这个程序也没啥保留~
 60 | 
 61 | ## 环境要求
 62 | 
 63 | + __系统__
 64 | 
 65 | ubuntu 16.04
 66 | 
 67 | + __python 版本__
 68 | 
 69 | python 3.5 或以上
 70 | 
 71 | + __需要的库__
 72 | 
 73 | numpy, pandas, lightgbm, xgboost, sklearn
 74 | 
 75 | ## 文件说明
 76 | 
 77 | 包括[初赛](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B)和复赛（等比赛结束后开源），初赛中包括[本地最终程序](https://github.com/taoyafan/jinnan/blob/master/%E5%88%9D%E8%B5%9B/%E6%9C%80%E7%BB%88%E7%A8%8B%E5%BA%8F.ipynb)（.ipynb）、[提交的程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/%E6%B4%A5%E5%8D%97%E6%95%B0%E5%AD%97%E5%88%B6%E9%80%A0%E7%AE%97%E6%B3%95%E6%8C%91%E6%88%98%E8%B5%9B%2B20%2BDrop)（.py）和 [历史程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/history)。
 78 | 
 79 | 建议看[本地最终程序](https://github.com/taoyafan/jinnan/blob/master/%E5%88%9D%E8%B5%9B/%E6%9C%80%E7%BB%88%E7%A8%8B%E5%BA%8F.ipynb)，里面标题、注释和输出都比较全，方便阅读。
 80 | 
 81 | [提交的程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/%E6%B4%A5%E5%8D%97%E6%95%B0%E5%AD%97%E5%88%B6%E9%80%A0%E7%AE%97%E6%B3%95%E6%8C%91%E6%88%98%E8%B5%9B%2B20%2BDrop)为本地程序转成.py格式，然后稍加改善（main函数、结构、路径、输入约束等），介绍可以看里面的readme。
 82 | 
 83 |  [历史程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/history)里面都是记录了我学习的历程，从最初学习鱼佬的程序开始，有些有用有些没用。
 84 | 
 85 | ## 程序说明
 86 | 
 87 | + __程序思路:__ 
 88 | 
 89 | 读取数 -> 手动处理训练集明显异常数据 -> 数据清洗 -> 特征工程 -> 训练
 90 | 
 91 | __数据清洗:__
 92 | 
 93 | 删除缺失率过高的数据 -> 处理字符时间(段) -> 计算时间差 -> 处理异常值 -> 删除单一类别占比过大的特征
 94 | 
 95 | __特征工程:__
 96 | 
 97 | 构建新特征 -> 利用特征之间的相关性预测 nan 值 -> 后向特征选择
 98 | 
 99 | __训练:__
100 | 
101 | 使用 lgb 和 xgb 自动选择最优参数, 然后融合
102 | 
103 | + __数据通路:__
104 | 
105 | 开始学鱼佬的 baseline 自己写着写着变量名太多了,前后运行总是出现各种小错误, 所以对整体结构改了好多次, 最终使用了 pipe line, 包括了整个数据清洗和特征工程部分, 所以变量名少了, 各个部分也不存在耦合了, 所以我觉得有必要介绍下数据通路:
106 | 
107 | 数据读取得到 train, test ----> 合并得到 full ---> 经过 pipe line 得到 pipe_data ---> 训练集测试集分割得到 X_train 和 X_test ---> 训练得到结果 oof 和 predictions
108 | 
109 | ## 运行说明
110 | 
111 | 在复赛的一级目录下还需要 data 和 result 文件夹，分别放训练、测试数据和最终生成结果，最终生成的结果文件名为：测试名\_模型名\_结果\_特征数量_时间.csv（提交的程序按照官方要求命名）
112 | 


--------------------------------------------------------------------------------
/初赛/history/python-data-visualizations.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {
  6 |     "_cell_guid": "e748dd89-de20-44f2-a122-b2bb69fbab24",
  7 |     "_uuid": "a42ede279bffeecdddd64047e06fee4b9aed50c5"
  8 |    },
  9 |    "source": [
 10 |     "## This notebook demos Python data visualizations on the Iris dataset\n",
 11 |     "\n",
 12 |     "This Python 3 environment comes with many helpful analytics libraries installed. It is defined by the [kaggle/python docker image](https://github.com/kaggle/docker-python)\n",
 13 |     "\n",
 14 |     "We'll use three libraries for this tutorial: [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/).\n",
 15 |     "\n",
 16 |     "Press \"Fork\" at the top-right of this screen to run this notebook yourself and build each of the examples."
 17 |    ]
 18 |   },
 19 |   {
 20 |    "cell_type": "code",
 21 |    "execution_count": 1,
 22 |    "metadata": {
 23 |     "_cell_guid": "136008bf-b756-49c1-bc5e-81c1247b969d",
 24 |     "_uuid": "4a72555be32be45a318141821b58ceac28ffb0d7"
 25 |    },
 26 |    "outputs": [],
 27 |    "source": [
 28 |     "# First, we'll import pandas, a data processing and CSV file I/O library\n",
 29 |     "import pandas as pd\n",
 30 |     "\n",
 31 |     "# We'll also import seaborn, a Python graphing library\n",
 32 |     "import warnings # current version of seaborn generates a bunch of warnings that we'll ignore\n",
 33 |     "warnings.filterwarnings(\"ignore\")\n",
 34 |     "import seaborn as sns\n",
 35 |     "import matplotlib.pyplot as plt\n",
 36 |     "sns.set(style=\"white\", color_codes=True)\n",
 37 |     "\n",
 38 |     "# Next, we'll load the Iris flower dataset, which is in the \"../input/\" directory\n",
 39 |     "iris = pd.read_csv(\"../input/Iris.csv\") # the iris dataset is now a Pandas DataFrame\n",
 40 |     "\n",
 41 |     "# Let's see what's in the iris data - Jupyter notebooks print the result of the last thing you do\n",
 42 |     "iris.head()\n",
 43 |     "\n",
 44 |     "# Press shift+enter to execute this cell"
 45 |    ]
 46 |   },
 47 |   {
 48 |    "cell_type": "code",
 49 |    "execution_count": 2,
 50 |    "metadata": {
 51 |     "_cell_guid": "5dba36af-1bb8-49e5-9b49-1451f4136246",
 52 |     "_uuid": "ef33a54d1e704924d1eb29632728011d31bfb543"
 53 |    },
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "# Let's see how many examples we have of each species\n",
 57 |     "iris[\"Species\"].value_counts()"
 58 |    ]
 59 |   },
 60 |   {
 61 |    "cell_type": "code",
 62 |    "execution_count": 3,
 63 |    "metadata": {
 64 |     "_cell_guid": "b8588972-deb5-4094-99a6-5feb722e3301",
 65 |     "_uuid": "b61dbe844a638b1b26e0c3f16a104570d4b60010"
 66 |    },
 67 |    "outputs": [],
 68 |    "source": [
 69 |     "# The first way we can plot things is using the .plot extension from Pandas dataframes\n",
 70 |     "# We'll use this to make a scatterplot of the Iris features.\n",
 71 |     "iris.plot(kind=\"scatter\", x=\"SepalLengthCm\", y=\"SepalWidthCm\")"
 72 |    ]
 73 |   },
 74 |   {
 75 |    "cell_type": "code",
 76 |    "execution_count": 4,
 77 |    "metadata": {
 78 |     "_cell_guid": "dc213965-5341-4ce7-ad13-42eb5e2fa1e7",
 79 |     "_uuid": "81da4a44d4ec41f5c7acd172c75df2f47884a13e"
 80 |    },
 81 |    "outputs": [],
 82 |    "source": [
 83 |     "# We can also use the seaborn library to make a similar plot\n",
 84 |     "# A seaborn jointplot shows bivariate scatterplots and univariate histograms in the same figure\n",
 85 |     "sns.jointplot(x=\"SepalLengthCm\", y=\"SepalWidthCm\", data=iris, size=5)"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 5,
 91 |    "metadata": {
 92 |     "_cell_guid": "0a5c46f6-be6e-4ef6-94a4-9bea13c9a0aa",
 93 |     "_uuid": "d07401f715fa8f39951a6212bce668657d457fe1"
 94 |    },
 95 |    "outputs": [],
 96 |    "source": [
 97 |     "# One piece of information missing in the plots above is what species each plant is\n",
 98 |     "# We'll use seaborn's FacetGrid to color the scatterplot by species\n",
 99 |     "sns.FacetGrid(iris, hue=\"Species\", size=5) \\\n",
100 |     "   .map(plt.scatter, \"SepalLengthCm\", \"SepalWidthCm\") \\\n",
101 |     "   .add_legend()"
102 |    ]
103 |   },
104 |   {
105 |    "cell_type": "code",
106 |    "execution_count": 6,
107 |    "metadata": {
108 |     "_cell_guid": "128245d5-6f01-44cd-8b2f-8a49735ac552",
109 |     "_uuid": "01cb1b0849f6c7e800c8798164741a8fdae53617"
110 |    },
111 |    "outputs": [],
112 |    "source": [
113 |     "# We can look at an individual feature in Seaborn through a boxplot\n",
114 |     "sns.boxplot(x=\"Species\", y=\"PetalLengthCm\", data=iris)"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 7,
120 |    "metadata": {
121 |     "_cell_guid": "b86a675c-f604-496a-931a-df76d7d6aaa1",
122 |     "_uuid": "a481595c1e46d625e887b61f5eb0e3c48269bde9"
123 |    },
124 |    "outputs": [],
125 |    "source": [
126 |     "# One way we can extend this plot is adding a layer of individual points on top of\n",
127 |     "# it through Seaborn's striplot\n",
128 |     "# \n",
129 |     "# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
130 |     "# above the species\n",
131 |     "#\n",
132 |     "# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
133 |     "# on top of the previous axes\n",
134 |     "ax = sns.boxplot(x=\"Species\", y=\"PetalLengthCm\", data=iris)\n",
135 |     "ax = sns.stripplot(x=\"Species\", y=\"PetalLengthCm\", data=iris, jitter=True, edgecolor=\"gray\")"
136 |    ]
137 |   },
138 |   {
139 |    "cell_type": "code",
140 |    "execution_count": 8,
141 |    "metadata": {
142 |     "_cell_guid": "c49f199b-2798-4fdc-87a7-bd2f7f8ff447",
143 |     "_uuid": "0d422fc672f3cfb30ec02d1345942cc583c51b05"
144 |    },
145 |    "outputs": [],
146 |    "source": [
147 |     "# A violin plot combines the benefits of the previous two plots and simplifies them\n",
148 |     "# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
149 |     "sns.violinplot(x=\"Species\", y=\"PetalLengthCm\", data=iris, size=6)"
150 |    ]
151 |   },
152 |   {
153 |    "cell_type": "code",
154 |    "execution_count": 9,
155 |    "metadata": {
156 |     "_cell_guid": "78c32fc8-3c36-482a-81f4-14d4b6ee1430",
157 |     "_uuid": "b10aa16c47bdad1964d1746281564f68a5ab741e"
158 |    },
159 |    "outputs": [],
160 |    "source": [
161 |     "# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n",
162 |     "# which creates and visualizes a kernel density estimate of the underlying feature\n",
163 |     "sns.FacetGrid(iris, hue=\"Species\", size=6) \\\n",
164 |     "   .map(sns.kdeplot, \"PetalLengthCm\") \\\n",
165 |     "   .add_legend()"
166 |    ]
167 |   },
168 |   {
169 |    "cell_type": "code",
170 |    "execution_count": 10,
171 |    "metadata": {
172 |     "_cell_guid": "7351999e-4522-451f-b3f1-0031c3a88eaa",
173 |     "_uuid": "fb9e2f61bf81478f21489f1219358e2b6fa164dd"
174 |    },
175 |    "outputs": [],
176 |    "source": [
177 |     "# Another useful seaborn plot is the pairplot, which shows the bivariate relation\n",
178 |     "# between each pair of features\n",
179 |     "# \n",
180 |     "# From the pairplot, we'll see that the Iris-setosa species is separataed from the other\n",
181 |     "# two across all feature combinations\n",
182 |     "sns.pairplot(iris.drop(\"Id\", axis=1), hue=\"Species\", size=3)"
183 |    ]
184 |   },
185 |   {
186 |    "cell_type": "code",
187 |    "execution_count": 11,
188 |    "metadata": {
189 |     "_cell_guid": "3f1fb3ba-e0fd-45b4-8a64-fe2a689bb83b",
190 |     "_uuid": "417d197016286a1af02eb522b3a0e0476e76b39b"
191 |    },
192 |    "outputs": [],
193 |    "source": [
194 |     "# The diagonal elements in a pairplot show the histogram by default\n",
195 |     "# We can update these elements to show other things, such as a kde\n",
196 |     "sns.pairplot(iris.drop(\"Id\", axis=1), hue=\"Species\", size=3, diag_kind=\"kde\")"
197 |    ]
198 |   },
199 |   {
200 |    "cell_type": "code",
201 |    "execution_count": 12,
202 |    "metadata": {
203 |     "_cell_guid": "46cceec5-3525-4b02-8ab7-5ed1420cd198",
204 |     "_uuid": "d7fb122f77031cc79ab0e922608d9e6c5de774ca"
205 |    },
206 |    "outputs": [],
207 |    "source": [
208 |     "# Now that we've covered seaborn, let's go back to some of the ones we can make with Pandas\n",
209 |     "# We can quickly make a boxplot with Pandas on each feature split out by species\n",
210 |     "iris.drop(\"Id\", axis=1).boxplot(by=\"Species\", figsize=(12, 6))"
211 |    ]
212 |   },
213 |   {
214 |    "cell_type": "code",
215 |    "execution_count": 13,
216 |    "metadata": {
217 |     "_cell_guid": "5bbed28c-d813-41c4-824d-7038fbfee6ea",
218 |     "_uuid": "61c76e99340b06c8020151ae4b8942e1daa8b1ef"
219 |    },
220 |    "outputs": [],
221 |    "source": [
222 |     "# One cool more sophisticated technique pandas has available is called Andrews Curves\n",
223 |     "# Andrews Curves involve using attributes of samples as coefficients for Fourier series\n",
224 |     "# and then plotting these\n",
225 |     "from pandas.tools.plotting import andrews_curves\n",
226 |     "andrews_curves(iris.drop(\"Id\", axis=1), \"Species\")"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "code",
231 |    "execution_count": 14,
232 |    "metadata": {
233 |     "_cell_guid": "77c1b6f0-7632-4d61-bf03-7b5d6856b987",
234 |     "_uuid": "b9ac80fdd71c270c9991d34ca87f70d6b00b2192"
235 |    },
236 |    "outputs": [],
237 |    "source": [
238 |     "# Another multivariate visualization technique pandas has is parallel_coordinates\n",
239 |     "# Parallel coordinates plots each feature on a separate column & then draws lines\n",
240 |     "# connecting the features for each data sample\n",
241 |     "from pandas.tools.plotting import parallel_coordinates\n",
242 |     "parallel_coordinates(iris.drop(\"Id\", axis=1), \"Species\")"
243 |    ]
244 |   },
245 |   {
246 |    "cell_type": "code",
247 |    "execution_count": 15,
248 |    "metadata": {
249 |     "_cell_guid": "d5c6314f-7b36-41ce-b0bd-e2ef17941f97",
250 |     "_uuid": "38b7de27f1f882347de21193d93bf474f96c2288"
251 |    },
252 |    "outputs": [],
253 |    "source": [
254 |     "# A final multivariate visualization technique pandas has is radviz\n",
255 |     "# Which puts each feature as a point on a 2D plane, and then simulates\n",
256 |     "# having each sample attached to those points through a spring weighted\n",
257 |     "# by the relative value for that feature\n",
258 |     "from pandas.tools.plotting import radviz\n",
259 |     "radviz(iris.drop(\"Id\", axis=1), \"Species\")"
260 |    ]
261 |   },
262 |   {
263 |    "cell_type": "markdown",
264 |    "metadata": {
265 |     "_cell_guid": "0263903e-4c3f-41c5-adf6-a1a12c122ddb",
266 |     "_uuid": "a47be9b234eb942e71425b3e00b741a41488ea33"
267 |    },
268 |    "source": [
269 |     "# Wrapping Up\n",
270 |     "\n",
271 |     "I hope you enjoyed this quick introduction to some of the quick, simple data visualizations you can create with pandas, seaborn, and matplotlib in Python!\n",
272 |     "\n",
273 |     "I encourage you to run through these examples yourself, tweaking them and seeing what happens. From there, you can try applying these methods to a new dataset and incorprating them into your own workflow!\n",
274 |     "\n",
275 |     "See [Kaggle Datasets](https://www.kaggle.com/datasets) for other datasets to try visualizing. The [World Food Facts data](https://www.kaggle.com/openfoodfacts/world-food-facts) is an especially rich one for visualization."
276 |    ]
277 |   }
278 |  ],
279 |  "metadata": {
280 |   "kernelspec": {
281 |    "display_name": "Python 3",
282 |    "language": "python",
283 |    "name": "python3"
284 |   },
285 |   "language_info": {
286 |    "codemirror_mode": {
287 |     "name": "ipython",
288 |     "version": 3
289 |    },
290 |    "file_extension": ".py",
291 |    "mimetype": "text/x-python",
292 |    "name": "python",
293 |    "nbconvert_exporter": "python",
294 |    "pygments_lexer": "ipython3",
295 |    "version": "3.6.7"
296 |   },
297 |   "toc": {
298 |    "base_numbering": 1,
299 |    "nav_menu": {},
300 |    "number_sections": true,
301 |    "sideBar": true,
302 |    "skip_h1_title": false,
303 |    "title_cell": "Table of Contents",
304 |    "title_sidebar": "Contents",
305 |    "toc_cell": false,
306 |    "toc_position": {},
307 |    "toc_section_display": true,
308 |    "toc_window_display": false
309 |   }
310 |  },
311 |  "nbformat": 4,
312 |  "nbformat_minor": 1
313 | }
314 | 


--------------------------------------------------------------------------------
/初赛/最终程序.ipynb:
--------------------------------------------------------------------------------
   1 | {
   2 |  "cells": [
   3 |   {
   4 |    "cell_type": "markdown",
   5 |    "metadata": {},
   6 |    "source": [
   7 |     "+ __说明:__\n",
   8 |     "\n",
   9 |     "本程序是 阿里云天池 的比赛 津南数字制造算法挑战赛(赛场一) B 榜20名提交结果所使用的最终程序\n",
  10 |     "\n",
  11 |     "队伍名称 Drop, 作者: 陶亚凡 陕西科技大学\n",
  12 |     "\n",
  13 |     "我的 github 和 博客:(虽然没东西把, 但是还是求波关注, follow, stars /捂脸)\n",
  14 |     "\n",
  15 |     "github: https://github.com/taoyafan\n",
  16 |     "\n",
  17 |     "博客: https://me.csdn.net/taoyafan\n",
  18 |     "\n",
  19 |     "队友（Blue 电子科技大学的） github 和 博客：(也求波关注, follow, stars /捂脸)\n",
  20 |     "\n",
  21 |     "github：https://github.com/BluesChang\n",
  22 |     "\n",
  23 |     "博客：https://blueschang.github.io\n",
  24 |     "\n",
  25 |     "程序各个部分很大程度的参考了鱼佬的 baseline\n",
  26 |     "\n",
  27 |     "感谢队友的很多贡献, 感谢鱼佬和他的的 baseline, 感谢林有夕大佬让我在群里不停的学到新知识.\n",
  28 |     "\n",
  29 |     "因为我这是第一了 ML 的比赛, 看着鱼佬 baseline 开始学习 pandas, sklearn 还有相关知识, 所以水平实在有限. 希望各位大佬给点意见或建议, 有什么问题或者可以改进的地方请告知我, 灰常感谢. \n",
  30 |     "\n",
  31 |     "一直想开源, 但是成绩太差, 趁着还在首页, 赶快开源了, 我也是抱着学习的心态, 也没想着拿奖, 所以这个程序也没啥保留~\n",
  32 |     "\n",
  33 |     "+ __程序思路:__ \n",
  34 |     "\n",
  35 |     "读取数 -> 手动处理训练集明显异常数据 -> 数据清洗 -> 特征工程 -> 训练\n",
  36 |     "\n",
  37 |     "__数据清洗:__\n",
  38 |     "\n",
  39 |     "删除缺失率过高的数据 -> 处理字符时间(段) -> 计算时间差 -> 处理异常值 -> 删除单一类别占比过大的特征\n",
  40 |     "\n",
  41 |     "__特征工程:__\n",
  42 |     "\n",
  43 |     "构建新特征 -> 利用特征之间的相关性预测 nan 值 -> 后向特征选择\n",
  44 |     "\n",
  45 |     "__训练:__\n",
  46 |     "\n",
  47 |     "使用 lgb 和 xgb 自动选择最优参数, 然后融合\n",
  48 |     "\n",
  49 |     "+ __数据通路:__\n",
  50 |     "\n",
  51 |     "开始学鱼佬的 baseline 自己写着写着变量名太多了,前后运行总是出现各种小错误, 所以对整体结构改了好多次, 最终使用了 pipe line, 包括了整个数据清洗和特征工程部分, 所以变量名少了, 各个部分也不存在耦合了, 所以我觉得有必要介绍下数据通路:\n",
  52 |     "\n",
  53 |     "数据读取得到 train, test ----> 合并得到 full ---> 经过 pipe line 得到 pipe_data ---> 训练集测试集分割得到 X_train 和 X_test ---> 训练得到结果 oof 和 predictions\n",
  54 |     "\n"
  55 |    ]
  56 |   },
  57 |   {
  58 |    "cell_type": "markdown",
  59 |    "metadata": {},
  60 |    "source": [
  61 |     "# 导入包，读取数据"
  62 |    ]
  63 |   },
  64 |   {
  65 |    "cell_type": "code",
  66 |    "execution_count": 1,
  67 |    "metadata": {},
  68 |    "outputs": [
  69 |     {
  70 |      "name": "stderr",
  71 |      "output_type": "stream",
  72 |      "text": [
  73 |       "/usr/local/lib/python3.6/dist-packages/deap/tools/_hypervolume/pyhv.py:33: ImportWarning:\n",
  74 |       "\n",
  75 |       "Falling back to the python version of hypervolume module. Expect this to be very slow.\n",
  76 |       "\n",
  77 |       "/usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning:\n",
  78 |       "\n",
  79 |       "can't resolve package from __spec__ or __package__, falling back on __name__ and __path__\n",
  80 |       "\n",
  81 |       "/usr/lib/python3.6/importlib/_bootstrap_external.py:426: ImportWarning:\n",
  82 |       "\n",
  83 |       "Not importing directory /usr/local/lib/python3.6/dist-packages/mpl_toolkits: missing __init__\n",
  84 |       "\n",
  85 |       "/usr/lib/python3.6/importlib/_bootstrap_external.py:426: ImportWarning:\n",
  86 |       "\n",
  87 |       "Not importing directory /usr/local/lib/python3.6/dist-packages/google: missing __init__\n",
  88 |       "\n"
  89 |      ]
  90 |     }
  91 |    ],
  92 |    "source": [
  93 |     "import numpy as np \n",
  94 |     "import pandas as pd \n",
  95 |     "import lightgbm as lgb\n",
  96 |     "import xgboost as xgb\n",
  97 |     "from scipy import sparse\n",
  98 |     "import warnings\n",
  99 |     "import time\n",
 100 |     "import sys\n",
 101 |     "import os\n",
 102 |     "import re\n",
 103 |     "import datetime\n",
 104 |     "import matplotlib.pyplot as plt\n",
 105 |     "import seaborn as sns\n",
 106 |     "import plotly.offline as py\n",
 107 |     "import plotly.graph_objs as go \n",
 108 |     "import plotly.tools as tls\n",
 109 |     "from xgboost import XGBRegressor\n",
 110 |     "from tpot import TPOTRegressor"
 111 |    ]
 112 |   },
 113 |   {
 114 |    "cell_type": "code",
 115 |    "execution_count": 2,
 116 |    "metadata": {},
 117 |    "outputs": [
 118 |     {
 119 |      "name": "stderr",
 120 |      "output_type": "stream",
 121 |      "text": [
 122 |       "/usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning:\n",
 123 |       "\n",
 124 |       "can't resolve package from __spec__ or __package__, falling back on __name__ and __path__\n",
 125 |       "\n"
 126 |      ]
 127 |     }
 128 |    ],
 129 |    "source": [
 130 |     "from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RepeatedKFold, ShuffleSplit\n",
 131 |     "from sklearn.pipeline import Pipeline, make_pipeline\n",
 132 |     "from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone\n",
 133 |     "from sklearn.linear_model import LinearRegression\n",
 134 |     "from sklearn.linear_model import Ridge\n",
 135 |     "from sklearn.linear_model import Lasso\n",
 136 |     "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor\n",
 137 |     "from sklearn.svm import SVR, LinearSVR\n",
 138 |     "from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge\n",
 139 |     "from sklearn.kernel_ridge import KernelRidge\n",
 140 |     "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\n",
 141 |     "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
 142 |     "from sklearn.metrics import log_loss\n",
 143 |     "from sklearn.preprocessing import Imputer\n",
 144 |     "from scipy.stats import skew\n",
 145 |     "from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, Normalizer\n",
 146 |     "from sklearn.decomposition import PCA, KernelPCA\n",
 147 |     "from sklearn.model_selection import train_test_split\n",
 148 |     "from sklearn.model_selection import ParameterGrid"
 149 |    ]
 150 |   },
 151 |   {
 152 |    "cell_type": "code",
 153 |    "execution_count": 3,
 154 |    "metadata": {},
 155 |    "outputs": [
 156 |     {
 157 |      "data": {
 158 |       "text/html": [
 159 |        "<script type=\"text/javascript\">window.PlotlyConfig = {MathJaxConfig: 'local'};</script><script type=\"text/javascript\">if (window.MathJax) {MathJax.Hub.Config({SVG: {font: \"STIX-Web\"}});}</script><script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window._Plotly) {require(['plotly'],function(plotly) {window._Plotly=plotly;});}</script>"
 160 |       ],
 161 |       "text/vnd.plotly.v1+html": [
 162 |        "<script type=\"text/javascript\">window.PlotlyConfig = {MathJaxConfig: 'local'};</script><script type=\"text/javascript\">if (window.MathJax) {MathJax.Hub.Config({SVG: {font: \"STIX-Web\"}});}</script><script>requirejs.config({paths: { 'plotly': ['https://cdn.plot.ly/plotly-latest.min']},});if(!window._Plotly) {require(['plotly'],function(plotly) {window._Plotly=plotly;});}</script>"
 163 |       ]
 164 |      },
 165 |      "metadata": {},
 166 |      "output_type": "display_data"
 167 |     }
 168 |    ],
 169 |    "source": [
 170 |     "py.init_notebook_mode(connected=True)\n",
 171 |     "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
 172 |     "warnings.filterwarnings(\"ignore\")\n",
 173 |     "pd.set_option('display.max_columns',None)\n",
 174 |     "pd.set_option('max_colwidth',100)"
 175 |    ]
 176 |   },
 177 |   {
 178 |    "cell_type": "markdown",
 179 |    "metadata": {},
 180 |    "source": [
 181 |     "## 设定文件名, 读取文件"
 182 |    ]
 183 |   },
 184 |   {
 185 |    "cell_type": "code",
 186 |    "execution_count": 4,
 187 |    "metadata": {},
 188 |    "outputs": [],
 189 |    "source": [
 190 |     "train_file_name = 'data/jinnan_round1_train_20181227.csv'\n",
 191 |     "test_file_name = 'data/jinnan_round1_testB_20190121.csv'\n",
 192 |     "test_name = 'testB'"
 193 |    ]
 194 |   },
 195 |   {
 196 |    "cell_type": "code",
 197 |    "execution_count": 5,
 198 |    "metadata": {},
 199 |    "outputs": [],
 200 |    "source": [
 201 |     "# 读取数据， 改名\n",
 202 |     "train = pd.read_csv(train_file_name, encoding = 'gb18030')\n",
 203 |     "test  = pd.read_csv(test_file_name, encoding = 'gb18030')\n",
 204 |     "train.rename(columns={'样本id':'id', '收率':'target'}, inplace = True)\n",
 205 |     "test.rename(columns={'样本id':'id', '收率':'target'}, inplace = True)\n",
 206 |     "target_name = 'target'\n",
 207 |     "\n",
 208 |     "# 存在异常数据，改为 nan\n",
 209 |     "train.loc[1304, 'A25'] = np.nan\n",
 210 |     "train['A25'] = train['A25'].astype(float)\n",
 211 |     "\n",
 212 |     "# 去掉 id 前缀\n",
 213 |     "train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))\n",
 214 |     "test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))\n",
 215 |     "\n",
 216 |     "train.drop(train[train[target_name] < 0.87].index, inplace=True)\n",
 217 |     "full=pd.concat([train, test], ignore_index=True)"
 218 |    ]
 219 |   },
 220 |   {
 221 |    "cell_type": "markdown",
 222 |    "metadata": {},
 223 |    "source": [
 224 |     "# 数据清洗"
 225 |    ]
 226 |   },
 227 |   {
 228 |    "cell_type": "markdown",
 229 |    "metadata": {},
 230 |    "source": [
 231 |     "## 删除缺失率高的特征\n",
 232 |     "\n",
 233 |     "+ __删除缺失值大于 th_high 的值__\n",
 234 |     "+ __缺失值在 th_low 和 th_high 之间的特征根据是否缺失增加新特征__\n",
 235 |     "  \n",
 236 |     "  如 B10 缺失较高，增加新特征 B10_null，如果缺失为1，否则为0"
 237 |    ]
 238 |   },
 239 |   {
 240 |    "cell_type": "code",
 241 |    "execution_count": 6,
 242 |    "metadata": {},
 243 |    "outputs": [],
 244 |    "source": [
 245 |     "class del_nan_feature(BaseEstimator, TransformerMixin):\n",
 246 |     "    \n",
 247 |     "    def __init__(self, th_high=0.85, th_low=0.02):\n",
 248 |     "        self.th_high = th_high\n",
 249 |     "        self.th_low = th_low\n",
 250 |     "        \n",
 251 |     "    def fit(self, X, y=None):\n",
 252 |     "        return self\n",
 253 |     "    \n",
 254 |     "    def transform(self, X):\n",
 255 |     "        print('-'*30, ' '*5, 'del_nan_feature', ' '*5, '-'*30, '\\n')\n",
 256 |     "        print(\"shape before process = {}\".format(X.shape))\n",
 257 |     "\n",
 258 |     "        # 删除高缺失率特征\n",
 259 |     "        X.dropna(axis=1, thresh=(1-self.th_high)*X.shape[0], inplace=True)\n",
 260 |     "        \n",
 261 |     "        \n",
 262 |     "        # 缺失率较高，增加新特征\n",
 263 |     "        for col in X.columns:\n",
 264 |     "            if col == 'target':\n",
 265 |     "                continue\n",
 266 |     "            \n",
 267 |     "            miss_rate = X[col].isnull().sum()/ X.shape[0]\n",
 268 |     "            if miss_rate > self.th_low:\n",
 269 |     "                print(\"Missing rate of {} is {:.3f} exceed {}, adding new feature {}\".\n",
 270 |     "                     format(col, miss_rate, self.th_low, col+'_null'))\n",
 271 |     "                X[col+'_null'] = 0\n",
 272 |     "                X.loc[X[pd.isnull(X[col])].index, [col+'_null']] = 1\n",
 273 |     "        print(\"shape = {}\".format(X.shape))\n",
 274 |     "\n",
 275 |     "        return X"
 276 |    ]
 277 |   },
 278 |   {
 279 |    "cell_type": "markdown",
 280 |    "metadata": {},
 281 |    "source": [
 282 |     "## 处理字符时间（段）"
 283 |    ]
 284 |   },
 285 |   {
 286 |    "cell_type": "code",
 287 |    "execution_count": 7,
 288 |    "metadata": {},
 289 |    "outputs": [],
 290 |    "source": [
 291 |     "# 处理时间\n",
 292 |     "def timeTranSecond(t):\n",
 293 |     "    try:\n",
 294 |     "        h,m,s=t.split(\":\")\n",
 295 |     "    except:\n",
 296 |     "\n",
 297 |     "        if t=='1900/1/9 7:00':\n",
 298 |     "            return 7*3600/3600\n",
 299 |     "        elif t=='1900/1/1 2:30':\n",
 300 |     "            return (2*3600+30*60)/3600\n",
 301 |     "        elif pd.isnull(t):\n",
 302 |     "            return np.nan\n",
 303 |     "        else:\n",
 304 |     "            return 0\n",
 305 |     "\n",
 306 |     "    try:\n",
 307 |     "        tm = (int(h)*3600+int(m)*60+int(s))/3600\n",
 308 |     "    except:\n",
 309 |     "        return (30*60)/3600\n",
 310 |     "\n",
 311 |     "    return tm"
 312 |    ]
 313 |   },
 314 |   {
 315 |    "cell_type": "code",
 316 |    "execution_count": 8,
 317 |    "metadata": {},
 318 |    "outputs": [],
 319 |    "source": [
 320 |     "# 处理时间差\n",
 321 |     "def getDuration(se):\n",
 322 |     "    try:\n",
 323 |     "        sh,sm,eh,em=re.findall(r\"\\d+\",se)\n",
 324 |     "#         print(\"sh, sm, eh, em = {}, {}, {}, {}\".format(sh, em, eh, em))\n",
 325 |     "    except:\n",
 326 |     "        if pd.isnull(se):\n",
 327 |     "            return np.nan, np.nan, np.nan\n",
 328 |     "\n",
 329 |     "    try:\n",
 330 |     "        t_start = (int(sh)*3600 + int(sm)*60)/3600\n",
 331 |     "        t_end = (int(eh)*3600 + int(em)*60)/3600\n",
 332 |     "        \n",
 333 |     "        if t_start > t_end:\n",
 334 |     "            tm = t_end - t_start + 24\n",
 335 |     "        else:\n",
 336 |     "            tm = t_end - t_start\n",
 337 |     "    except:\n",
 338 |     "        if se=='19:-20:05':\n",
 339 |     "            return 19, 20, 1\n",
 340 |     "        elif se=='15:00-1600':\n",
 341 |     "            return 15, 16, 1\n",
 342 |     "        else:\n",
 343 |     "            print(\"se = {}\".format(se))\n",
 344 |     "\n",
 345 |     "\n",
 346 |     "    return t_start, t_end, tm"
 347 |    ]
 348 |   },
 349 |   {
 350 |    "cell_type": "code",
 351 |    "execution_count": 9,
 352 |    "metadata": {},
 353 |    "outputs": [],
 354 |    "source": [
 355 |     "class handle_time_str(BaseEstimator, TransformerMixin):\n",
 356 |     "    \n",
 357 |     "    def __init__(self):\n",
 358 |     "        pass\n",
 359 |     "        \n",
 360 |     "    def fit(self, X, y=None):\n",
 361 |     "        return self\n",
 362 |     "    \n",
 363 |     "    def transform(self, X):\n",
 364 |     "        print('-'*30, ' '*5, 'handle_time_str', ' '*5, '-'*30, '\\n')\n",
 365 |     "\n",
 366 |     "        for f in ['A5','A7','A9','A11','A14','A16','A24','A26','B5','B7']:\n",
 367 |     "            try:\n",
 368 |     "                X[f] = X[f].apply(timeTranSecond)\n",
 369 |     "            except:\n",
 370 |     "                print(f,'应该在前面被删除了！')\n",
 371 |     "\n",
 372 |     "\n",
 373 |     "        for f in ['A20','A28','B4','B9','B10','B11']:\n",
 374 |     "            try:\n",
 375 |     "                start_end_diff = X[f].apply(getDuration)\n",
 376 |     "                \n",
 377 |     "                X[f+'_start'] = start_end_diff.apply(lambda x: x[0])\n",
 378 |     "                X[f+'_end'] = start_end_diff.apply(lambda x: x[1])\n",
 379 |     "                X[f] = start_end_diff.apply(lambda x: x[2])\n",
 380 |     "\n",
 381 |     "            except:\n",
 382 |     "                print(f,'应该在前面被删除了！')\n",
 383 |     "        return X"
 384 |    ]
 385 |   },
 386 |   {
 387 |    "cell_type": "markdown",
 388 |    "metadata": {},
 389 |    "source": [
 390 |     "## 计算时间差"
 391 |    ]
 392 |   },
 393 |   {
 394 |    "cell_type": "code",
 395 |    "execution_count": null,
 396 |    "metadata": {},
 397 |    "outputs": [],
 398 |    "source": []
 399 |   },
 400 |   {
 401 |    "cell_type": "code",
 402 |    "execution_count": 10,
 403 |    "metadata": {},
 404 |    "outputs": [],
 405 |    "source": [
 406 |     "def t_start_t_end(t):\n",
 407 |     "    if pd.isnull(t[0]) or pd.isnull(t[1]):\n",
 408 |     "#         print(\"t_start = {}, t_end = {}, id = {}\".format(t[0], t[1], t[2]))\n",
 409 |     "        return np.nan\n",
 410 |     "        \n",
 411 |     "    if t[1] < t[0]:\n",
 412 |     "        t[1] += 24\n",
 413 |     "        \n",
 414 |     "    dt = t[1] - t[0]\n",
 415 |     "\n",
 416 |     "    if(dt > 24 or dt < 0):\n",
 417 |     "#         print(\"dt error, t_start = {}, t_end = {}, id = {}\".format(t[0], t[1], t[2]))\n",
 418 |     "        return np.nan\n",
 419 |     "    \n",
 420 |     "    return dt"
 421 |    ]
 422 |   },
 423 |   {
 424 |    "cell_type": "code",
 425 |    "execution_count": 11,
 426 |    "metadata": {},
 427 |    "outputs": [],
 428 |    "source": [
 429 |     "class calc_time_diff(BaseEstimator, TransformerMixin):\n",
 430 |     "    def __init__(self):\n",
 431 |     "        pass\n",
 432 |     "        \n",
 433 |     "    def fit(self, X, y=None):\n",
 434 |     "        return self\n",
 435 |     "    \n",
 436 |     "    def transform(self, X):\n",
 437 |     "        print('-'*30, ' '*5, 'calc_time_diff', ' '*5, '-'*30, '\\n')\n",
 438 |     "\n",
 439 |     "        # t_start 为时间的开始， tn 为中间的时间，减去 t_start 得到时间差\n",
 440 |     "        t_start = ['A9', 'A24', 'B5']\n",
 441 |     "        tn = {'A9':['A11', 'A14', 'A16'], 'A24':['A26'], 'B5':['B7']}\n",
 442 |     "        \n",
 443 |     "        # 计算时间差\n",
 444 |     "        for t_s in t_start:\n",
 445 |     "            for t_e in tn[t_s]:\n",
 446 |     "                X[t_e+'-'+t_s] = X[[t_s,t_e, target_name]].apply(t_start_t_end, axis=1)\n",
 447 |     "                \n",
 448 |     "        # 所有结果保留 3 位小数\n",
 449 |     "        X = X.apply(lambda x:round(x, 3))\n",
 450 |     "        \n",
 451 |     "        print(\"shape = {}\".format(X.shape))\n",
 452 |     "        \n",
 453 |     "        return X"
 454 |    ]
 455 |   },
 456 |   {
 457 |    "cell_type": "markdown",
 458 |    "metadata": {},
 459 |    "source": [
 460 |     "## 处理异常值"
 461 |    ]
 462 |   },
 463 |   {
 464 |    "cell_type": "markdown",
 465 |    "metadata": {},
 466 |    "source": [
 467 |     "+ __单一类别个数小于 threshold 的值视为异常值, 改为 nan__"
 468 |    ]
 469 |   },
 470 |   {
 471 |    "cell_type": "code",
 472 |    "execution_count": 12,
 473 |    "metadata": {},
 474 |    "outputs": [],
 475 |    "source": [
 476 |     "class handle_outliers(BaseEstimator, TransformerMixin):\n",
 477 |     "\n",
 478 |     "    def __init__(self, threshold=2):\n",
 479 |     "        self.th = threshold\n",
 480 |     "        \n",
 481 |     "    def fit(self, X, y=None):\n",
 482 |     "        return self\n",
 483 |     "    \n",
 484 |     "    def transform(self, X):\n",
 485 |     "        print('-'*30, ' '*5, 'handle_outliers', ' '*5, '-'*30, '\\n')\n",
 486 |     "        category_col = [col for col in X if col not in ['id', 'target']]\n",
 487 |     "        for col in category_col:\n",
 488 |     "            label = X[col].value_counts(dropna=False).index.tolist()\n",
 489 |     "            for i, num in enumerate(X[col].value_counts(dropna=False).values):\n",
 490 |     "                if num <= self.th:\n",
 491 |     "#                     print(\"Number of label {} in feature {} is {}\".format(label[i], col, num))\n",
 492 |     "                    X.loc[X[col]==label[i], [col]] = np.nan\n",
 493 |     "        \n",
 494 |     "        print(\"shape = {}\".format(X.shape))\n",
 495 |     "        return X"
 496 |    ]
 497 |   },
 498 |   {
 499 |    "cell_type": "markdown",
 500 |    "metadata": {},
 501 |    "source": [
 502 |     "## 删除单一类别占比过大特征"
 503 |    ]
 504 |   },
 505 |   {
 506 |    "cell_type": "code",
 507 |    "execution_count": 13,
 508 |    "metadata": {},
 509 |    "outputs": [],
 510 |    "source": [
 511 |     "class del_single_feature(BaseEstimator, TransformerMixin):\n",
 512 |     "\n",
 513 |     "    def __init__(self, threshold=0.98):\n",
 514 |     "        # 删除单一类别占比大于 threshold 的特征\n",
 515 |     "        self.th = threshold\n",
 516 |     "        \n",
 517 |     "    def fit(self, X, y=None):\n",
 518 |     "        return self\n",
 519 |     "    \n",
 520 |     "    def transform(self, X):\n",
 521 |     "        print('-'*30, ' '*5, 'del_single_feature', ' '*5, '-'*30, '\\n')\n",
 522 |     "        category_col = [col for col in X if col not in ['target']]\n",
 523 |     "        \n",
 524 |     "        for col in category_col:\n",
 525 |     "            rate = X[col].value_counts(normalize=True, dropna=False).values[0]\n",
 526 |     "            \n",
 527 |     "            if rate > self.th:\n",
 528 |     "                print(\"{} 的最大类别占比是 {}, drop it\".format(col, rate))\n",
 529 |     "                X.drop(col, axis=1, inplace=True)\n",
 530 |     "\n",
 531 |     "        print(\"shape = {}\".format(X.shape))\n",
 532 |     "        return X"
 533 |    ]
 534 |   },
 535 |   {
 536 |    "cell_type": "markdown",
 537 |    "metadata": {},
 538 |    "source": [
 539 |     "# 特征工程"
 540 |    ]
 541 |   },
 542 |   {
 543 |    "cell_type": "markdown",
 544 |    "metadata": {},
 545 |    "source": [
 546 |     "## 获得训练集与测试集"
 547 |    ]
 548 |   },
 549 |   {
 550 |    "cell_type": "code",
 551 |    "execution_count": 14,
 552 |    "metadata": {},
 553 |    "outputs": [],
 554 |    "source": [
 555 |     "def split_data(pipe_data, target_name='target'):\n",
 556 |     "   \n",
 557 |     "    # 特征列名\n",
 558 |     "    category_col = [col for col in pipe_data if col not in ['target',target_name]]\n",
 559 |     "    \n",
 560 |     "    # 训练、测试行索引\n",
 561 |     "    train_idx = pipe_data[np.logical_not(pd.isnull(pipe_data[target_name]))].index\n",
 562 |     "    test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index\n",
 563 |     "    \n",
 564 |     "    # 获得 train、test 数据\n",
 565 |     "    X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)\n",
 566 |     "    y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))\n",
 567 |     "    X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)\n",
 568 |     "    \n",
 569 |     "    return X_train, y_train, X_test, test_idx"
 570 |    ]
 571 |   },
 572 |   {
 573 |    "cell_type": "markdown",
 574 |    "metadata": {},
 575 |    "source": [
 576 |     "## xgb(用于特征 nan 值预测)"
 577 |    ]
 578 |   },
 579 |   {
 580 |    "cell_type": "code",
 581 |    "execution_count": 15,
 582 |    "metadata": {},
 583 |    "outputs": [],
 584 |    "source": [
 585 |     "##### xgb\n",
 586 |     "def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):\n",
 587 |     "    \n",
 588 |     "    if params == None:\n",
 589 |     "        xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, \n",
 590 |     "                  'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}\n",
 591 |     "    else:\n",
 592 |     "        xgb_params = params\n",
 593 |     "\n",
 594 |     "    folds = KFold(n_splits=10, shuffle=True, random_state=2018)\n",
 595 |     "    oof_xgb = np.zeros(len(X_train))\n",
 596 |     "    predictions_xgb = np.zeros(len(X_test))\n",
 597 |     "\n",
 598 |     "    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):\n",
 599 |     "        if(verbose_eval):\n",
 600 |     "            print(\"fold n°{}\".format(fold_+1))\n",
 601 |     "            print(\"len trn_idx  {}\".format(len(trn_idx)))\n",
 602 |     "            \n",
 603 |     "        trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])\n",
 604 |     "        val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])\n",
 605 |     "\n",
 606 |     "        watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]\n",
 607 |     "        clf = xgb.train(dtrain=trn_data,\n",
 608 |     "                        num_boost_round=20000,\n",
 609 |     "                        evals=watchlist,\n",
 610 |     "                        early_stopping_rounds=200,\n",
 611 |     "                        verbose_eval=verbose_eval,\n",
 612 |     "                        params=xgb_params)\n",
 613 |     "        \n",
 614 |     "        \n",
 615 |     "        oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)\n",
 616 |     "        predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits\n",
 617 |     "\n",
 618 |     "    if(verbose_eval):\n",
 619 |     "        print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_xgb, y_train)))\n",
 620 |     "    return oof_xgb, predictions_xgb"
 621 |    ]
 622 |   },
 623 |   {
 624 |    "cell_type": "markdown",
 625 |    "metadata": {},
 626 |    "source": [
 627 |     "## 根据 B14 构建新特征"
 628 |    ]
 629 |   },
 630 |   {
 631 |    "cell_type": "code",
 632 |    "execution_count": 16,
 633 |    "metadata": {},
 634 |    "outputs": [],
 635 |    "source": [
 636 |     "class add_new_features(BaseEstimator, TransformerMixin):\n",
 637 |     "\n",
 638 |     "    def __init__(self):\n",
 639 |     "        pass\n",
 640 |     "        \n",
 641 |     "    def fit(self, X, y=None):\n",
 642 |     "        return self\n",
 643 |     "\n",
 644 |     "    def transform(self, X):\n",
 645 |     "        print('-'*30, ' '*5, 'add_new_features', ' '*5, '-'*30, '\\n')\n",
 646 |     "\n",
 647 |     "        # 经过测试,只有 B14 / B12 有用\n",
 648 |     "        \n",
 649 |     "#         X['B14/A1'] = X['B14'] / X['A1']\n",
 650 |     "#         X['B14/A3'] = X['B14'] / X['A3']\n",
 651 |     "#         X['B14/A4'] = X['B14'] / X['A4']\n",
 652 |     "#         X['B14/A19'] = X['B14'] / X['A19']\n",
 653 |     "#         X['B14/B1'] = X['B14'] / X['B1']\n",
 654 |     "#         X['B14/B9'] = X['B14'] / X['B9']\n",
 655 |     "\n",
 656 |     "        X['B14/B12'] = X['B14'] / X['B12']\n",
 657 |     "        \n",
 658 |     "        print(\"shape = {}\".format(X.shape))\n",
 659 |     "        return X"
 660 |    ]
 661 |   },
 662 |   {
 663 |    "cell_type": "markdown",
 664 |    "metadata": {},
 665 |    "source": [
 666 |     "## 选择特征, nan 值填充\n",
 667 |     "\n",
 668 |     "+ __选择可能有效的特征__   (只是为了加快选择时间)\n",
 669 |     "\n",
 670 |     "+ __利用其他特征预测 nan，取最近值填充__"
 671 |    ]
 672 |   },
 673 |   {
 674 |    "cell_type": "code",
 675 |    "execution_count": 17,
 676 |    "metadata": {},
 677 |    "outputs": [],
 678 |    "source": [
 679 |     "def get_closest(indexes, predicts):\n",
 680 |     "    print(\"From {}\".format(predicts))\n",
 681 |     "\n",
 682 |     "    for i, one in enumerate(predicts):\n",
 683 |     "        predicts[i] = indexes[np.argsort(abs(indexes - one))[0]]\n",
 684 |     "\n",
 685 |     "    print(\"To {}\".format(predicts))\n",
 686 |     "    return predicts\n",
 687 |     "    \n",
 688 |     "\n",
 689 |     "def value_select_eval(pipe_data, selected_features):\n",
 690 |     "    \n",
 691 |     "    # 经过多次测试, 只选择可能是有用的特征\n",
 692 |     "    cols_with_nan = [col for col in pipe_data.columns \n",
 693 |     "                     if pipe_data[col].isnull().sum()>0 and col in selected_features]\n",
 694 |     "\n",
 695 |     "    for col in cols_with_nan:\n",
 696 |     "        X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name=col)\n",
 697 |     "        oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, verbose_eval=False)\n",
 698 |     "        \n",
 699 |     "        print(\"-\"*100, end=\"\\n\\n\")\n",
 700 |     "        print(\"CV normal MAE scores of predicting {} is {}\".\n",
 701 |     "              format(col, mean_absolute_error(oof_xgb, y_train)/np.mean(y_train)))\n",
 702 |     "        \n",
 703 |     "        pipe_data.loc[test_idx, [col]] = get_closest(pipe_data[col].value_counts().index,\n",
 704 |     "                                              predictions_xgb)\n",
 705 |     "\n",
 706 |     "    pipe_data = pipe_data[selected_features+['target']]\n",
 707 |     "\n",
 708 |     "    return pipe_data\n",
 709 |     "\n",
 710 |     "# pipe_data = value_eval(pipe_data)"
 711 |    ]
 712 |   },
 713 |   {
 714 |    "cell_type": "code",
 715 |    "execution_count": 18,
 716 |    "metadata": {},
 717 |    "outputs": [],
 718 |    "source": [
 719 |     "class selsected_fill_nans(BaseEstimator, TransformerMixin):\n",
 720 |     "\n",
 721 |     "    def __init__(self, selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',\n",
 722 |     "                                           'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']):\n",
 723 |     "        self.selected_fearutes = selected_features\n",
 724 |     "        pass\n",
 725 |     "        \n",
 726 |     "    def fit(self, X, y=None):\n",
 727 |     "        return self\n",
 728 |     "    \n",
 729 |     "    def transform(self, X):\n",
 730 |     "        print('-'*30, ' '*5, 'selsected_fill_nans', ' '*5, '-'*30, '\\n')\n",
 731 |     "\n",
 732 |     "        X = value_select_eval(X, self.selected_fearutes)\n",
 733 |     "\n",
 734 |     "        print(\"shape = {}\".format(X.shape))\n",
 735 |     "        return X"
 736 |    ]
 737 |   },
 738 |   {
 739 |    "cell_type": "code",
 740 |    "execution_count": 19,
 741 |    "metadata": {},
 742 |    "outputs": [],
 743 |    "source": [
 744 |     "def modeling_cross_validation(data):\n",
 745 |     "    X_train, y_train, X_test, test_idx = split_data(data,\n",
 746 |     "                                                    target_name='target')\n",
 747 |     "    oof_xgb, _ = xgb_predict(X_train, y_train, X_test, verbose_eval=False)\n",
 748 |     "    print('-'*100, end='\\n\\n')\n",
 749 |     "    return mean_squared_error(oof_xgb, y_train)\n",
 750 |     "\n",
 751 |     "\n",
 752 |     "def featureSelect(data):\n",
 753 |     "\n",
 754 |     "    init_cols = [f for f in data.columns if f not in ['target']]\n",
 755 |     "    best_cols = init_cols.copy()\n",
 756 |     "    best_score = modeling_cross_validation(data[best_cols+['target']])\n",
 757 |     "    print(\"初始 CV score: {:<8.8f}\".format(best_score))\n",
 758 |     "\n",
 759 |     "    for col in init_cols:\n",
 760 |     "        best_cols.remove(col)\n",
 761 |     "        score = modeling_cross_validation(data[best_cols+['target']])\n",
 762 |     "        print(\"当前选择特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}\".\n",
 763 |     "              format(col, score, best_score), end=\" \")\n",
 764 |     "        \n",
 765 |     "        if best_score - score > 0.0000004:\n",
 766 |     "            best_score = score\n",
 767 |     "            print(\"有效果,删除！！！！\")\n",
 768 |     "        else:\n",
 769 |     "            best_cols.append(col)\n",
 770 |     "            print(\"保留\")\n",
 771 |     "\n",
 772 |     "    print('-'*100)\n",
 773 |     "    print(\"优化后 CV score: {:<8.8f}\".format(best_score))\n",
 774 |     "    return best_cols, best_score"
 775 |    ]
 776 |   },
 777 |   {
 778 |    "cell_type": "markdown",
 779 |    "metadata": {},
 780 |    "source": [
 781 |     "## 后向选择特征"
 782 |    ]
 783 |   },
 784 |   {
 785 |    "cell_type": "code",
 786 |    "execution_count": 20,
 787 |    "metadata": {},
 788 |    "outputs": [],
 789 |    "source": [
 790 |     "class select_feature(BaseEstimator, TransformerMixin):\n",
 791 |     "\n",
 792 |     "    def __init__(self, init_features = None):\n",
 793 |     "        self.init_features = init_features\n",
 794 |     "        pass\n",
 795 |     "        \n",
 796 |     "    def fit(self, X, y=None):\n",
 797 |     "        return self\n",
 798 |     "    \n",
 799 |     "    def transform(self, X):\n",
 800 |     "        print('-'*30, ' '*5, 'select_feature', ' '*5, '-'*30, '\\n')\n",
 801 |     "        \n",
 802 |     "        if self.init_features:\n",
 803 |     "            X = X[self.init_features + ['target']]\n",
 804 |     "            best_features = self.init_features\n",
 805 |     "        else:\n",
 806 |     "            best_features = [col for col in X.columns]\n",
 807 |     "        \n",
 808 |     "        last_feartues = []\n",
 809 |     "        iteration = 0\n",
 810 |     "        equal_time = 0\n",
 811 |     "        \n",
 812 |     "        best_CV = 1\n",
 813 |     "        best_CV_feature = []\n",
 814 |     "        \n",
 815 |     "        # 打乱顺序,但是使用相同种子,保证每次运行结果相同\n",
 816 |     "        np.random.seed(2018)\n",
 817 |     "        while True:\n",
 818 |     "            print(\"Iteration = {}\\n\".format(iteration))\n",
 819 |     "            best_features, score = featureSelect(X[best_features + ['target']])\n",
 820 |     "            \n",
 821 |     "            # 保存最优 CV 的参数\n",
 822 |     "            if score < best_CV:\n",
 823 |     "                best_CV = score\n",
 824 |     "                best_CV_feature = best_features\n",
 825 |     "                print(\"Found best score :{}, with features :{}\".format(best_CV, best_features))\n",
 826 |     "                \n",
 827 |     "            np.random.shuffle(best_features)\n",
 828 |     "            print(\"\\nCurrent fearure length = {}\".format(len(best_features)))\n",
 829 |     "            \n",
 830 |     "            # 最终 3 次迭代相同，则终止迭代\n",
 831 |     "            if len(best_features) == len(last_feartues):\n",
 832 |     "                equal_time += 1\n",
 833 |     "                if equal_time == 3:\n",
 834 |     "                    break\n",
 835 |     "            else:\n",
 836 |     "                equal_time = 0\n",
 837 |     "            \n",
 838 |     "            last_feartues = best_features\n",
 839 |     "            iteration = iteration + 1\n",
 840 |     "\n",
 841 |     "            print(\"\\n\\n\\n\")\n",
 842 |     "            \n",
 843 |     "        return X[best_features + ['target']]\n"
 844 |    ]
 845 |   },
 846 |   {
 847 |    "cell_type": "markdown",
 848 |    "metadata": {},
 849 |    "source": [
 850 |     "# 训练"
 851 |    ]
 852 |   },
 853 |   {
 854 |    "cell_type": "markdown",
 855 |    "metadata": {},
 856 |    "source": [
 857 |     "## 构建 pipeline, 处理数据"
 858 |    ]
 859 |   },
 860 |   {
 861 |    "cell_type": "code",
 862 |    "execution_count": 21,
 863 |    "metadata": {},
 864 |    "outputs": [
 865 |     {
 866 |      "name": "stdout",
 867 |      "output_type": "stream",
 868 |      "text": [
 869 |       "------------------------------       del_nan_feature       ------------------------------ \n",
 870 |       "\n",
 871 |       "shape before process = (1532, 44)\n",
 872 |       "Missing rate of A3 is 0.029 exceed 0.02, adding new feature A3_null\n",
 873 |       "Missing rate of B10 is 0.172 exceed 0.02, adding new feature B10_null\n",
 874 |       "Missing rate of B11 is 0.597 exceed 0.02, adding new feature B11_null\n",
 875 |       "shape = (1532, 44)\n",
 876 |       "------------------------------       handle_time_str       ------------------------------ \n",
 877 |       "\n",
 878 |       "A7 应该在前面被删除了！\n",
 879 |       "------------------------------       calc_time_diff       ------------------------------ \n",
 880 |       "\n",
 881 |       "shape = (1532, 61)\n",
 882 |       "------------------------------       handle_outliers       ------------------------------ \n",
 883 |       "\n",
 884 |       "shape = (1532, 61)\n",
 885 |       "------------------------------       del_single_feature       ------------------------------ \n",
 886 |       "\n",
 887 |       "shape = (1532, 61)\n",
 888 |       "------------------------------       add_new_features       ------------------------------ \n",
 889 |       "\n",
 890 |       "shape = (1532, 62)\n",
 891 |       "------------------------------       selsected_fill_nans       ------------------------------ \n",
 892 |       "\n",
 893 |       "----------------------------------------------------------------------------------------------------\n",
 894 |       "\n",
 895 |       "CV normal MAE scores of predicting A16 is 0.006573812182966658\n",
 896 |       "From [2.82036011 2.0116301  2.78110905 2.24891324 2.64147919 2.61220436\n",
 897 |       " 2.5752665  3.00300634 2.9279013  2.84588192 2.99439096 3.15101939\n",
 898 |       " 1.30144072 3.0146347  2.44444267 2.60455203 2.70574424 2.75994805\n",
 899 |       " 2.58867866 3.00614175 2.78697994 2.03778946 2.69046123 2.72509097\n",
 900 |       " 2.03607538 2.52129808 2.99479207 2.92738628 2.41858149 2.70892806\n",
 901 |       " 2.80188948 2.75916436 2.00558983 2.99666125 3.02267092 2.11280097\n",
 902 |       " 2.88487023 2.52905945 3.2504842  2.92606165 2.52358037 2.57779263\n",
 903 |       " 2.58069354 2.91890304 2.9953025  2.49374625 2.68844172 2.45054981\n",
 904 |       " 3.02282879 2.01016228]\n",
 905 |       "To [3.  2.  3.  2.  2.5 2.5 2.5 3.  3.  3.  3.  3.  1.5 3.  2.5 2.5 2.5 3.\n",
 906 |       " 2.5 3.  3.  2.  2.5 2.5 2.  2.5 3.  3.  2.5 2.5 3.  3.  2.  3.  3.  2.\n",
 907 |       " 3.  2.5 3.5 3.  2.5 2.5 2.5 3.  3.  2.5 2.5 2.5 3.  2. ]\n",
 908 |       "----------------------------------------------------------------------------------------------------\n",
 909 |       "\n",
 910 |       "CV normal MAE scores of predicting A25 is 0.006985261873501027\n",
 911 |       "From [74.69595861 76.41018534 79.13030624 80.63013458 83.78137398 69.97030067\n",
 912 |       " 79.99184084 81.28120136]\n",
 913 |       "To [75. 76. 79. 80. 80. 70. 80. 80.]\n",
 914 |       "----------------------------------------------------------------------------------------------------\n",
 915 |       "\n",
 916 |       "CV normal MAE scores of predicting A28 is 0.037333278007722834\n",
 917 |       "From [1.16017157 0.60625793 1.06248447 0.79223084 0.98663072 0.9867671\n",
 918 |       " 0.93546665 0.80992338 0.5366797  1.07250487 0.89877572]\n",
 919 |       "To [1.167 0.667 1.    0.667 1.    1.    1.    0.667 0.5   1.    1.   ]\n",
 920 |       "----------------------------------------------------------------------------------------------------\n",
 921 |       "\n",
 922 |       "CV normal MAE scores of predicting A6 is 0.07248546276204286\n",
 923 |       "From [38.14284706 37.4052155  28.88562822 32.06546164 32.16541529 36.24718952\n",
 924 |       " 36.2260437  22.94354749 39.25126076 24.79291415 32.7412734  35.56525469\n",
 925 |       " 33.02388406 25.72406816]\n",
 926 |       "To [38. 37. 29. 32. 32. 36. 36. 23. 39. 25. 33. 36. 33. 26.]\n",
 927 |       "----------------------------------------------------------------------------------------------------\n",
 928 |       "\n",
 929 |       "CV normal MAE scores of predicting B14 is 0.001297790416243159\n",
 930 |       "From [400.24114609 402.01158142 400.45770264 401.50468063 402.03158188\n",
 931 |       " 337.98728943 402.02846909 401.99261093 402.0224762  341.76803589\n",
 932 |       " 400.6063652 ]\n",
 933 |       "To [400. 400. 400. 400. 400. 340. 400. 400. 400. 340. 400.]\n",
 934 |       "----------------------------------------------------------------------------------------------------\n",
 935 |       "\n",
 936 |       "CV normal MAE scores of predicting B5 is 0.016972058774115534\n",
 937 |       "From [15.12159047 15.72788435 14.3109533  19.78048182 14.65058997 14.53606975\n",
 938 |       " 15.33118927 14.2788609  13.99365219 14.93855688 14.00136399 14.85706538\n",
 939 |       " 16.71887732 21.97685862 13.37160268 15.54346603 14.7121506 ]\n",
 940 |       "To [15.  15.5 14.5 20.  14.5 14.5 15.5 14.5 14.  15.  14.  15.  16.5 22.\n",
 941 |       " 13.5 15.5 14.5]\n",
 942 |       "----------------------------------------------------------------------------------------------------\n",
 943 |       "\n",
 944 |       "CV normal MAE scores of predicting A28_end is 0.012402601510432685\n",
 945 |       "From [13.98934126  9.29064429  0.80129172 13.6963979  10.73074675 15.34772384\n",
 946 |       "  8.8364796  16.79893827 15.48608887 15.33091271 19.34231484 12.45444262\n",
 947 |       " 15.49709904 14.68686521 15.09787071 15.34401202 17.98878217 10.97875953\n",
 948 |       " 10.45144868 10.7770443  10.52596968 12.5426327  14.00931859 16.79731178\n",
 949 |       " 14.87435162 20.49663401 14.73405266 17.8244971   5.21450764 14.84811437]\n",
 950 |       "To [14.   9.   1.  13.5 10.5 15.5  9.  17.  15.5 15.5 19.5 12.5 15.5 14.5\n",
 951 |       " 15.  15.5 18.  11.  10.5 11.  10.5 12.5 14.  17.  15.  20.5 14.5 18.\n",
 952 |       "  5.  15. ]\n",
 953 |       "----------------------------------------------------------------------------------------------------\n",
 954 |       "\n",
 955 |       "CV normal MAE scores of predicting B14/B12 is 0.002980983626315855\n",
 956 |       "From [0.44779329 0.49992409 0.50056992 0.3333493  0.49980065 0.81825209\n",
 957 |       " 0.49989815 0.49929611 0.76591279 0.50158779 0.80235612 0.4996887 ]\n",
 958 |       "To [0.44444444 0.5        0.5        0.33333333 0.5        0.85\n",
 959 |       " 0.5        0.5        0.7        0.5        0.85       0.5       ]\n",
 960 |       "shape = (1532, 13)\n",
 961 |       "------------------------------       select_feature       ------------------------------ \n",
 962 |       "\n",
 963 |       "Iteration = 0\n",
 964 |       "\n",
 965 |       "----------------------------------------------------------------------------------------------------\n",
 966 |       "\n",
 967 |       "初始 CV score: 0.00011896\n",
 968 |       "----------------------------------------------------------------------------------------------------\n",
 969 |       "\n",
 970 |       "当前选择特征: A3_null, CV score: 0.00011909, 最佳cv score: 0.00011896 保留\n",
 971 |       "----------------------------------------------------------------------------------------------------\n",
 972 |       "\n",
 973 |       "当前选择特征: A6, CV score: 0.00012150, 最佳cv score: 0.00011896 保留\n",
 974 |       "----------------------------------------------------------------------------------------------------\n",
 975 |       "\n",
 976 |       "当前选择特征: A16, CV score: 0.00011866, 最佳cv score: 0.00011896 保留\n",
 977 |       "----------------------------------------------------------------------------------------------------\n",
 978 |       "\n",
 979 |       "当前选择特征: A25, CV score: 0.00011865, 最佳cv score: 0.00011896 保留\n",
 980 |       "----------------------------------------------------------------------------------------------------\n",
 981 |       "\n",
 982 |       "当前选择特征: A28, CV score: 0.00011749, 最佳cv score: 0.00011896 有效果,删除！！！！\n",
 983 |       "----------------------------------------------------------------------------------------------------\n",
 984 |       "\n",
 985 |       "当前选择特征: A28_end, CV score: 0.00011743, 最佳cv score: 0.00011749 保留\n",
 986 |       "----------------------------------------------------------------------------------------------------\n",
 987 |       "\n",
 988 |       "当前选择特征: B5, CV score: 0.00011818, 最佳cv score: 0.00011749 保留\n",
 989 |       "----------------------------------------------------------------------------------------------------\n",
 990 |       "\n",
 991 |       "当前选择特征: B10_null, CV score: 0.00011870, 最佳cv score: 0.00011749 保留\n",
 992 |       "----------------------------------------------------------------------------------------------------\n",
 993 |       "\n",
 994 |       "当前选择特征: B11_null, CV score: 0.00011940, 最佳cv score: 0.00011749 保留\n",
 995 |       "----------------------------------------------------------------------------------------------------\n",
 996 |       "\n",
 997 |       "当前选择特征: B14, CV score: 0.00012126, 最佳cv score: 0.00011749 保留\n",
 998 |       "----------------------------------------------------------------------------------------------------\n",
 999 |       "\n",
1000 |       "当前选择特征: B14/B12, CV score: 0.00012029, 最佳cv score: 0.00011749 保留\n",
1001 |       "----------------------------------------------------------------------------------------------------\n",
1002 |       "\n",
1003 |       "当前选择特征: id, CV score: 0.00018189, 最佳cv score: 0.00011749 保留\n",
1004 |       "----------------------------------------------------------------------------------------------------\n",
1005 |       "优化后 CV score: 0.00011749\n",
1006 |       "Found best score :0.00011748851254246741, with features :['A3_null', 'A6', 'A16', 'A25', 'A28_end', 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']\n",
1007 |       "\n",
1008 |       "Current fearure length = 11\n",
1009 |       "\n",
1010 |       "\n",
1011 |       "\n",
1012 |       "\n",
1013 |       "Iteration = 1\n",
1014 |       "\n",
1015 |       "----------------------------------------------------------------------------------------------------\n",
1016 |       "\n",
1017 |       "初始 CV score: 0.00011888\n",
1018 |       "----------------------------------------------------------------------------------------------------\n",
1019 |       "\n",
1020 |       "当前选择特征: A3_null, CV score: 0.00011882, 最佳cv score: 0.00011888 保留\n",
1021 |       "----------------------------------------------------------------------------------------------------\n",
1022 |       "\n",
1023 |       "当前选择特征: A25, CV score: 0.00011935, 最佳cv score: 0.00011888 保留\n",
1024 |       "----------------------------------------------------------------------------------------------------\n",
1025 |       "\n",
1026 |       "当前选择特征: B11_null, CV score: 0.00011808, 最佳cv score: 0.00011888 有效果,删除！！！！\n",
1027 |       "----------------------------------------------------------------------------------------------------\n",
1028 |       "\n",
1029 |       "当前选择特征: B14, CV score: 0.00012059, 最佳cv score: 0.00011808 保留\n"
1030 |      ]
1031 |     },
1032 |     {
1033 |      "name": "stdout",
1034 |      "output_type": "stream",
1035 |      "text": [
1036 |       "----------------------------------------------------------------------------------------------------\n",
1037 |       "\n",
1038 |       "当前选择特征: B14/B12, CV score: 0.00012291, 最佳cv score: 0.00011808 保留\n",
1039 |       "----------------------------------------------------------------------------------------------------\n",
1040 |       "\n",
1041 |       "当前选择特征: A28_end, CV score: 0.00011717, 最佳cv score: 0.00011808 有效果,删除！！！！\n",
1042 |       "----------------------------------------------------------------------------------------------------\n",
1043 |       "\n",
1044 |       "当前选择特征: B5, CV score: 0.00011785, 最佳cv score: 0.00011717 保留\n",
1045 |       "----------------------------------------------------------------------------------------------------\n",
1046 |       "\n",
1047 |       "当前选择特征: A6, CV score: 0.00012231, 最佳cv score: 0.00011717 保留\n",
1048 |       "----------------------------------------------------------------------------------------------------\n",
1049 |       "\n",
1050 |       "当前选择特征: A16, CV score: 0.00011826, 最佳cv score: 0.00011717 保留\n",
1051 |       "----------------------------------------------------------------------------------------------------\n",
1052 |       "\n",
1053 |       "当前选择特征: B10_null, CV score: 0.00011966, 最佳cv score: 0.00011717 保留\n",
1054 |       "----------------------------------------------------------------------------------------------------\n",
1055 |       "\n",
1056 |       "当前选择特征: id, CV score: 0.00018469, 最佳cv score: 0.00011717 保留\n",
1057 |       "----------------------------------------------------------------------------------------------------\n",
1058 |       "优化后 CV score: 0.00011717\n",
1059 |       "Found best score :0.00011716922906431079, with features :['A3_null', 'A25', 'B14', 'B14/B12', 'B5', 'A6', 'A16', 'B10_null', 'id']\n",
1060 |       "\n",
1061 |       "Current fearure length = 9\n",
1062 |       "\n",
1063 |       "\n",
1064 |       "\n",
1065 |       "\n",
1066 |       "Iteration = 2\n",
1067 |       "\n",
1068 |       "----------------------------------------------------------------------------------------------------\n",
1069 |       "\n",
1070 |       "初始 CV score: 0.00011730\n",
1071 |       "----------------------------------------------------------------------------------------------------\n",
1072 |       "\n",
1073 |       "当前选择特征: A6, CV score: 0.00012321, 最佳cv score: 0.00011730 保留\n",
1074 |       "----------------------------------------------------------------------------------------------------\n",
1075 |       "\n",
1076 |       "当前选择特征: A3_null, CV score: 0.00011892, 最佳cv score: 0.00011730 保留\n",
1077 |       "----------------------------------------------------------------------------------------------------\n",
1078 |       "\n",
1079 |       "当前选择特征: B14, CV score: 0.00012037, 最佳cv score: 0.00011730 保留\n",
1080 |       "----------------------------------------------------------------------------------------------------\n",
1081 |       "\n",
1082 |       "当前选择特征: A16, CV score: 0.00011723, 最佳cv score: 0.00011730 保留\n",
1083 |       "----------------------------------------------------------------------------------------------------\n",
1084 |       "\n",
1085 |       "当前选择特征: B5, CV score: 0.00011902, 最佳cv score: 0.00011730 保留\n",
1086 |       "----------------------------------------------------------------------------------------------------\n",
1087 |       "\n",
1088 |       "当前选择特征: A25, CV score: 0.00011825, 最佳cv score: 0.00011730 保留\n",
1089 |       "----------------------------------------------------------------------------------------------------\n",
1090 |       "\n",
1091 |       "当前选择特征: id, CV score: 0.00018102, 最佳cv score: 0.00011730 保留\n",
1092 |       "----------------------------------------------------------------------------------------------------\n",
1093 |       "\n",
1094 |       "当前选择特征: B14/B12, CV score: 0.00012256, 最佳cv score: 0.00011730 保留\n",
1095 |       "----------------------------------------------------------------------------------------------------\n",
1096 |       "\n",
1097 |       "当前选择特征: B10_null, CV score: 0.00011870, 最佳cv score: 0.00011730 保留\n",
1098 |       "----------------------------------------------------------------------------------------------------\n",
1099 |       "优化后 CV score: 0.00011730\n",
1100 |       "\n",
1101 |       "Current fearure length = 9\n",
1102 |       "\n",
1103 |       "\n",
1104 |       "\n",
1105 |       "\n",
1106 |       "Iteration = 3\n",
1107 |       "\n",
1108 |       "----------------------------------------------------------------------------------------------------\n",
1109 |       "\n",
1110 |       "初始 CV score: 0.00011678\n",
1111 |       "----------------------------------------------------------------------------------------------------\n",
1112 |       "\n",
1113 |       "当前选择特征: A25, CV score: 0.00011825, 最佳cv score: 0.00011678 保留\n",
1114 |       "----------------------------------------------------------------------------------------------------\n",
1115 |       "\n",
1116 |       "当前选择特征: B10_null, CV score: 0.00011825, 最佳cv score: 0.00011678 保留\n",
1117 |       "----------------------------------------------------------------------------------------------------\n",
1118 |       "\n",
1119 |       "当前选择特征: B5, CV score: 0.00011854, 最佳cv score: 0.00011678 保留\n",
1120 |       "----------------------------------------------------------------------------------------------------\n",
1121 |       "\n",
1122 |       "当前选择特征: A16, CV score: 0.00011864, 最佳cv score: 0.00011678 保留\n",
1123 |       "----------------------------------------------------------------------------------------------------\n",
1124 |       "\n",
1125 |       "当前选择特征: id, CV score: 0.00018284, 最佳cv score: 0.00011678 保留\n",
1126 |       "----------------------------------------------------------------------------------------------------\n",
1127 |       "\n",
1128 |       "当前选择特征: B14, CV score: 0.00012127, 最佳cv score: 0.00011678 保留\n",
1129 |       "----------------------------------------------------------------------------------------------------\n",
1130 |       "\n",
1131 |       "当前选择特征: A6, CV score: 0.00012289, 最佳cv score: 0.00011678 保留\n",
1132 |       "----------------------------------------------------------------------------------------------------\n",
1133 |       "\n",
1134 |       "当前选择特征: B14/B12, CV score: 0.00012132, 最佳cv score: 0.00011678 保留\n",
1135 |       "----------------------------------------------------------------------------------------------------\n",
1136 |       "\n",
1137 |       "当前选择特征: A3_null, CV score: 0.00011959, 最佳cv score: 0.00011678 保留\n",
1138 |       "----------------------------------------------------------------------------------------------------\n",
1139 |       "优化后 CV score: 0.00011678\n",
1140 |       "Found best score :0.00011677842465073222, with features :['A25', 'B10_null', 'B5', 'A16', 'id', 'B14', 'A6', 'B14/B12', 'A3_null']\n",
1141 |       "\n",
1142 |       "Current fearure length = 9\n",
1143 |       "\n",
1144 |       "\n",
1145 |       "\n",
1146 |       "\n",
1147 |       "Iteration = 4\n",
1148 |       "\n",
1149 |       "----------------------------------------------------------------------------------------------------\n",
1150 |       "\n",
1151 |       "初始 CV score: 0.00011833\n",
1152 |       "----------------------------------------------------------------------------------------------------\n",
1153 |       "\n",
1154 |       "当前选择特征: B14, CV score: 0.00012009, 最佳cv score: 0.00011833 保留\n",
1155 |       "----------------------------------------------------------------------------------------------------\n",
1156 |       "\n",
1157 |       "当前选择特征: B14/B12, CV score: 0.00012129, 最佳cv score: 0.00011833 保留\n",
1158 |       "----------------------------------------------------------------------------------------------------\n",
1159 |       "\n",
1160 |       "当前选择特征: A3_null, CV score: 0.00011953, 最佳cv score: 0.00011833 保留\n",
1161 |       "----------------------------------------------------------------------------------------------------\n",
1162 |       "\n",
1163 |       "当前选择特征: A6, CV score: 0.00012132, 最佳cv score: 0.00011833 保留\n",
1164 |       "----------------------------------------------------------------------------------------------------\n",
1165 |       "\n",
1166 |       "当前选择特征: B5, CV score: 0.00011791, 最佳cv score: 0.00011833 有效果,删除！！！！\n",
1167 |       "----------------------------------------------------------------------------------------------------\n",
1168 |       "\n",
1169 |       "当前选择特征: A16, CV score: 0.00012081, 最佳cv score: 0.00011791 保留\n",
1170 |       "----------------------------------------------------------------------------------------------------\n",
1171 |       "\n",
1172 |       "当前选择特征: B10_null, CV score: 0.00011889, 最佳cv score: 0.00011791 保留\n",
1173 |       "----------------------------------------------------------------------------------------------------\n",
1174 |       "\n",
1175 |       "当前选择特征: A25, CV score: 0.00012040, 最佳cv score: 0.00011791 保留\n",
1176 |       "----------------------------------------------------------------------------------------------------\n",
1177 |       "\n",
1178 |       "当前选择特征: id, CV score: 0.00018179, 最佳cv score: 0.00011791 保留\n",
1179 |       "----------------------------------------------------------------------------------------------------\n",
1180 |       "优化后 CV score: 0.00011791\n",
1181 |       "\n",
1182 |       "Current fearure length = 8\n",
1183 |       "\n",
1184 |       "\n",
1185 |       "\n",
1186 |       "\n",
1187 |       "Iteration = 5\n",
1188 |       "\n",
1189 |       "----------------------------------------------------------------------------------------------------\n",
1190 |       "\n",
1191 |       "初始 CV score: 0.00011747\n",
1192 |       "----------------------------------------------------------------------------------------------------\n",
1193 |       "\n",
1194 |       "当前选择特征: id, CV score: 0.00018279, 最佳cv score: 0.00011747 保留\n",
1195 |       "----------------------------------------------------------------------------------------------------\n",
1196 |       "\n",
1197 |       "当前选择特征: A25, CV score: 0.00012058, 最佳cv score: 0.00011747 保留\n",
1198 |       "----------------------------------------------------------------------------------------------------\n",
1199 |       "\n",
1200 |       "当前选择特征: A6, CV score: 0.00012409, 最佳cv score: 0.00011747 保留\n",
1201 |       "----------------------------------------------------------------------------------------------------\n",
1202 |       "\n",
1203 |       "当前选择特征: B14/B12, CV score: 0.00012101, 最佳cv score: 0.00011747 保留\n",
1204 |       "----------------------------------------------------------------------------------------------------\n",
1205 |       "\n",
1206 |       "当前选择特征: A3_null, CV score: 0.00011944, 最佳cv score: 0.00011747 保留\n",
1207 |       "----------------------------------------------------------------------------------------------------\n",
1208 |       "\n",
1209 |       "当前选择特征: A16, CV score: 0.00012004, 最佳cv score: 0.00011747 保留\n",
1210 |       "----------------------------------------------------------------------------------------------------\n",
1211 |       "\n",
1212 |       "当前选择特征: B10_null, CV score: 0.00011840, 最佳cv score: 0.00011747 保留\n"
1213 |      ]
1214 |     },
1215 |     {
1216 |      "name": "stdout",
1217 |      "output_type": "stream",
1218 |      "text": [
1219 |       "----------------------------------------------------------------------------------------------------\n",
1220 |       "\n",
1221 |       "当前选择特征: B14, CV score: 0.00012226, 最佳cv score: 0.00011747 保留\n",
1222 |       "----------------------------------------------------------------------------------------------------\n",
1223 |       "优化后 CV score: 0.00011747\n",
1224 |       "\n",
1225 |       "Current fearure length = 8\n",
1226 |       "\n",
1227 |       "\n",
1228 |       "\n",
1229 |       "\n",
1230 |       "Iteration = 6\n",
1231 |       "\n",
1232 |       "----------------------------------------------------------------------------------------------------\n",
1233 |       "\n",
1234 |       "初始 CV score: 0.00011715\n",
1235 |       "----------------------------------------------------------------------------------------------------\n",
1236 |       "\n",
1237 |       "当前选择特征: id, CV score: 0.00018398, 最佳cv score: 0.00011715 保留\n",
1238 |       "----------------------------------------------------------------------------------------------------\n",
1239 |       "\n",
1240 |       "当前选择特征: A3_null, CV score: 0.00011984, 最佳cv score: 0.00011715 保留\n",
1241 |       "----------------------------------------------------------------------------------------------------\n",
1242 |       "\n",
1243 |       "当前选择特征: A25, CV score: 0.00012132, 最佳cv score: 0.00011715 保留\n",
1244 |       "----------------------------------------------------------------------------------------------------\n",
1245 |       "\n",
1246 |       "当前选择特征: B14/B12, CV score: 0.00012198, 最佳cv score: 0.00011715 保留\n",
1247 |       "----------------------------------------------------------------------------------------------------\n",
1248 |       "\n",
1249 |       "当前选择特征: B10_null, CV score: 0.00011808, 最佳cv score: 0.00011715 保留\n",
1250 |       "----------------------------------------------------------------------------------------------------\n",
1251 |       "\n",
1252 |       "当前选择特征: A16, CV score: 0.00011993, 最佳cv score: 0.00011715 保留\n",
1253 |       "----------------------------------------------------------------------------------------------------\n",
1254 |       "\n",
1255 |       "当前选择特征: A6, CV score: 0.00012184, 最佳cv score: 0.00011715 保留\n",
1256 |       "----------------------------------------------------------------------------------------------------\n",
1257 |       "\n",
1258 |       "当前选择特征: B14, CV score: 0.00012037, 最佳cv score: 0.00011715 保留\n",
1259 |       "----------------------------------------------------------------------------------------------------\n",
1260 |       "优化后 CV score: 0.00011715\n",
1261 |       "\n",
1262 |       "Current fearure length = 8\n",
1263 |       "\n",
1264 |       "\n",
1265 |       "\n",
1266 |       "\n",
1267 |       "Iteration = 7\n",
1268 |       "\n",
1269 |       "----------------------------------------------------------------------------------------------------\n",
1270 |       "\n",
1271 |       "初始 CV score: 0.00011709\n",
1272 |       "----------------------------------------------------------------------------------------------------\n",
1273 |       "\n",
1274 |       "当前选择特征: A25, CV score: 0.00012090, 最佳cv score: 0.00011709 保留\n",
1275 |       "----------------------------------------------------------------------------------------------------\n",
1276 |       "\n",
1277 |       "当前选择特征: id, CV score: 0.00018248, 最佳cv score: 0.00011709 保留\n",
1278 |       "----------------------------------------------------------------------------------------------------\n",
1279 |       "\n",
1280 |       "当前选择特征: B14/B12, CV score: 0.00012002, 最佳cv score: 0.00011709 保留\n",
1281 |       "----------------------------------------------------------------------------------------------------\n",
1282 |       "\n",
1283 |       "当前选择特征: B14, CV score: 0.00012325, 最佳cv score: 0.00011709 保留\n",
1284 |       "----------------------------------------------------------------------------------------------------\n",
1285 |       "\n",
1286 |       "当前选择特征: A3_null, CV score: 0.00012002, 最佳cv score: 0.00011709 保留\n",
1287 |       "----------------------------------------------------------------------------------------------------\n",
1288 |       "\n",
1289 |       "当前选择特征: A16, CV score: 0.00011987, 最佳cv score: 0.00011709 保留\n",
1290 |       "----------------------------------------------------------------------------------------------------\n",
1291 |       "\n",
1292 |       "当前选择特征: A6, CV score: 0.00012154, 最佳cv score: 0.00011709 保留\n",
1293 |       "----------------------------------------------------------------------------------------------------\n",
1294 |       "\n",
1295 |       "当前选择特征: B10_null, CV score: 0.00011942, 最佳cv score: 0.00011709 保留\n",
1296 |       "----------------------------------------------------------------------------------------------------\n",
1297 |       "优化后 CV score: 0.00011709\n",
1298 |       "\n",
1299 |       "Current fearure length = 8\n",
1300 |       "(1532, 9)\n"
1301 |      ]
1302 |     }
1303 |    ],
1304 |    "source": [
1305 |     "selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end', \n",
1306 |     "                     'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']\n",
1307 |     "\n",
1308 |     "pipe = Pipeline([\n",
1309 |     "                ('del_nan_feature', del_nan_feature()),\n",
1310 |     "                ('handle_time_str', handle_time_str()),\n",
1311 |     "                ('calc_time_diff', calc_time_diff()),\n",
1312 |     "                ('Handle_outliers', handle_outliers(2)),\n",
1313 |     "                ('del_single_feature', del_single_feature(1)),\n",
1314 |     "                ('add_new_features', add_new_features()),\n",
1315 |     "                ('selsected_fill_nans', selsected_fill_nans(selected_features)),\n",
1316 |     "                ('select_feature', select_feature(selected_features)),\n",
1317 |     "                ])\n",
1318 |     "\n",
1319 |     "pipe_data = pipe.fit_transform(full.copy())\n",
1320 |     "print(pipe_data.shape)"
1321 |    ]
1322 |   },
1323 |   {
1324 |    "cell_type": "markdown",
1325 |    "metadata": {},
1326 |    "source": [
1327 |     "## 自动调参"
1328 |    ]
1329 |   },
1330 |   {
1331 |    "cell_type": "code",
1332 |    "execution_count": 22,
1333 |    "metadata": {},
1334 |    "outputs": [],
1335 |    "source": [
1336 |     "def find_best_params(pipe_data, predict_fun, param_grid):\n",
1337 |     "    \n",
1338 |     "    # 获得 train 和 test, 归一化\n",
1339 |     "    X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n",
1340 |     "    min_max_scaler = MinMaxScaler()\n",
1341 |     "    X_train = min_max_scaler.fit_transform(X_train)\n",
1342 |     "    X_test = min_max_scaler.transform(X_test)\n",
1343 |     "    best_score = 1\n",
1344 |     "\n",
1345 |     "    # 遍历所有参数,寻找最优\n",
1346 |     "    for params in ParameterGrid(param_grid):\n",
1347 |     "        print('-'*100, \"\\nparams = \\n{}\\n\".format(params))\n",
1348 |     "\n",
1349 |     "        oof, predictions = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)\n",
1350 |     "        score = mean_squared_error(oof, y_train)\n",
1351 |     "        print(\"CV score: {}, current best score: {}\".format(score, best_score))\n",
1352 |     "\n",
1353 |     "        if best_score > score:\n",
1354 |     "            print(\"Found new best score: {}\".format(score))\n",
1355 |     "            best_score = score\n",
1356 |     "            best_params = params\n",
1357 |     "\n",
1358 |     "\n",
1359 |     "    print('\\n\\nbest params: {}'.format(best_params))\n",
1360 |     "    print('best score: {}'.format(best_score))\n",
1361 |     "    \n",
1362 |     "    return best_params"
1363 |    ]
1364 |   },
1365 |   {
1366 |    "cell_type": "markdown",
1367 |    "metadata": {},
1368 |    "source": [
1369 |     "## lgb"
1370 |    ]
1371 |   },
1372 |   {
1373 |    "cell_type": "code",
1374 |    "execution_count": 23,
1375 |    "metadata": {},
1376 |    "outputs": [],
1377 |    "source": [
1378 |     "def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):\n",
1379 |     "    \n",
1380 |     "    if params == None:\n",
1381 |     "        lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective':'regression', 'max_depth': 4,\n",
1382 |     "         'learning_rate': 0.06, \"min_child_samples\": 3, \"boosting\": \"gbdt\", \"feature_fraction\": 1,\n",
1383 |     "         \"bagging_freq\": 0.7, \"bagging_fraction\": 1, \"bagging_seed\": 11, \"metric\": 'mse', \"lambda_l2\": 0.003,\n",
1384 |     "         \"verbosity\": -1}\n",
1385 |     "    else :\n",
1386 |     "        lgb_param = params\n",
1387 |     "        \n",
1388 |     "    folds = KFold(n_splits=10, shuffle=True, random_state=2018)\n",
1389 |     "    oof_lgb = np.zeros(len(X_train))\n",
1390 |     "    predictions_lgb = np.zeros(len(X_test))\n",
1391 |     "\n",
1392 |     "    for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):\n",
1393 |     "        if verbose_eval:\n",
1394 |     "            print(\"fold n°{}\".format(fold_+1))\n",
1395 |     "            \n",
1396 |     "        trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])\n",
1397 |     "        val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])\n",
1398 |     "\n",
1399 |     "        num_round = 10000\n",
1400 |     "        clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data],\n",
1401 |     "                        verbose_eval=verbose_eval, early_stopping_rounds = 100)\n",
1402 |     "        \n",
1403 |     "        oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)\n",
1404 |     "\n",
1405 |     "        predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits\n",
1406 |     "\n",
1407 |     "        if verbose_eval:\n",
1408 |     "            print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_lgb, y_train)))\n",
1409 |     "    \n",
1410 |     "    return oof_lgb, predictions_lgb"
1411 |    ]
1412 |   },
1413 |   {
1414 |    "cell_type": "markdown",
1415 |    "metadata": {},
1416 |    "source": [
1417 |     "+ __选择最优参数__"
1418 |    ]
1419 |   },
1420 |   {
1421 |    "cell_type": "code",
1422 |    "execution_count": 24,
1423 |    "metadata": {},
1424 |    "outputs": [
1425 |     {
1426 |      "name": "stdout",
1427 |      "output_type": "stream",
1428 |      "text": [
1429 |       "---------------------------------------------------------------------------------------------------- \n",
1430 |       "params = \n",
1431 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1432 |       "\n",
1433 |       "CV score: 0.00011489364207207567, current best score: 1\n",
1434 |       "Found new best score: 0.00011489364207207567\n",
1435 |       "---------------------------------------------------------------------------------------------------- \n",
1436 |       "params = \n",
1437 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1438 |       "\n",
1439 |       "CV score: 0.00012002848815037457, current best score: 0.00011489364207207567\n",
1440 |       "---------------------------------------------------------------------------------------------------- \n",
1441 |       "params = \n",
1442 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1443 |       "\n",
1444 |       "CV score: 0.00011402092342458887, current best score: 0.00011489364207207567\n",
1445 |       "Found new best score: 0.00011402092342458887\n",
1446 |       "---------------------------------------------------------------------------------------------------- \n",
1447 |       "params = \n",
1448 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1449 |       "\n",
1450 |       "CV score: 0.00011882313706702633, current best score: 0.00011402092342458887\n",
1451 |       "---------------------------------------------------------------------------------------------------- \n",
1452 |       "params = \n",
1453 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1454 |       "\n",
1455 |       "CV score: 0.00011427665390150208, current best score: 0.00011402092342458887\n",
1456 |       "---------------------------------------------------------------------------------------------------- \n",
1457 |       "params = \n",
1458 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1459 |       "\n",
1460 |       "CV score: 0.000118469789122594, current best score: 0.00011402092342458887\n",
1461 |       "---------------------------------------------------------------------------------------------------- \n",
1462 |       "params = \n",
1463 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1464 |       "\n",
1465 |       "CV score: 0.00011547690407588457, current best score: 0.00011402092342458887\n",
1466 |       "---------------------------------------------------------------------------------------------------- \n",
1467 |       "params = \n",
1468 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1469 |       "\n",
1470 |       "CV score: 0.00011985943852356268, current best score: 0.00011402092342458887\n",
1471 |       "---------------------------------------------------------------------------------------------------- \n",
1472 |       "params = \n",
1473 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1474 |       "\n",
1475 |       "CV score: 0.00011231611613708764, current best score: 0.00011402092342458887\n",
1476 |       "Found new best score: 0.00011231611613708764\n",
1477 |       "---------------------------------------------------------------------------------------------------- \n",
1478 |       "params = \n",
1479 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1480 |       "\n",
1481 |       "CV score: 0.00011748828797017007, current best score: 0.00011231611613708764\n",
1482 |       "---------------------------------------------------------------------------------------------------- \n",
1483 |       "params = \n",
1484 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1485 |       "\n",
1486 |       "CV score: 0.00011554903372234801, current best score: 0.00011231611613708764\n",
1487 |       "---------------------------------------------------------------------------------------------------- \n",
1488 |       "params = \n",
1489 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1490 |       "\n",
1491 |       "CV score: 0.00012001078341271754, current best score: 0.00011231611613708764\n",
1492 |       "---------------------------------------------------------------------------------------------------- \n",
1493 |       "params = \n",
1494 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1495 |       "\n",
1496 |       "CV score: 0.0001135845354614368, current best score: 0.00011231611613708764\n",
1497 |       "---------------------------------------------------------------------------------------------------- \n",
1498 |       "params = \n",
1499 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1500 |       "\n",
1501 |       "CV score: 0.00011950413736010373, current best score: 0.00011231611613708764\n",
1502 |       "---------------------------------------------------------------------------------------------------- \n",
1503 |       "params = \n",
1504 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1505 |       "\n",
1506 |       "CV score: 0.00011449170101534524, current best score: 0.00011231611613708764\n",
1507 |       "---------------------------------------------------------------------------------------------------- \n",
1508 |       "params = \n",
1509 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1510 |       "\n",
1511 |       "CV score: 0.00012067409892140623, current best score: 0.00011231611613708764\n",
1512 |       "---------------------------------------------------------------------------------------------------- \n",
1513 |       "params = \n",
1514 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1515 |       "\n"
1516 |      ]
1517 |     },
1518 |     {
1519 |      "name": "stdout",
1520 |      "output_type": "stream",
1521 |      "text": [
1522 |       "CV score: 0.00011435307236140136, current best score: 0.00011231611613708764\n",
1523 |       "---------------------------------------------------------------------------------------------------- \n",
1524 |       "params = \n",
1525 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1526 |       "\n",
1527 |       "CV score: 0.00012009711500307733, current best score: 0.00011231611613708764\n",
1528 |       "---------------------------------------------------------------------------------------------------- \n",
1529 |       "params = \n",
1530 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1531 |       "\n",
1532 |       "CV score: 0.00011480005750940053, current best score: 0.00011231611613708764\n",
1533 |       "---------------------------------------------------------------------------------------------------- \n",
1534 |       "params = \n",
1535 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1536 |       "\n",
1537 |       "CV score: 0.00012025520678151364, current best score: 0.00011231611613708764\n",
1538 |       "---------------------------------------------------------------------------------------------------- \n",
1539 |       "params = \n",
1540 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1541 |       "\n",
1542 |       "CV score: 0.00011420826273292763, current best score: 0.00011231611613708764\n",
1543 |       "---------------------------------------------------------------------------------------------------- \n",
1544 |       "params = \n",
1545 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1546 |       "\n",
1547 |       "CV score: 0.00011871676666472398, current best score: 0.00011231611613708764\n",
1548 |       "---------------------------------------------------------------------------------------------------- \n",
1549 |       "params = \n",
1550 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1551 |       "\n",
1552 |       "CV score: 0.00011447430938895559, current best score: 0.00011231611613708764\n",
1553 |       "---------------------------------------------------------------------------------------------------- \n",
1554 |       "params = \n",
1555 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1556 |       "\n",
1557 |       "CV score: 0.00011845637169175561, current best score: 0.00011231611613708764\n",
1558 |       "---------------------------------------------------------------------------------------------------- \n",
1559 |       "params = \n",
1560 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1561 |       "\n",
1562 |       "CV score: 0.00011541082150277204, current best score: 0.00011231611613708764\n",
1563 |       "---------------------------------------------------------------------------------------------------- \n",
1564 |       "params = \n",
1565 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1566 |       "\n",
1567 |       "CV score: 0.00011943383299618423, current best score: 0.00011231611613708764\n",
1568 |       "---------------------------------------------------------------------------------------------------- \n",
1569 |       "params = \n",
1570 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1571 |       "\n",
1572 |       "CV score: 0.00011212722352757958, current best score: 0.00011231611613708764\n",
1573 |       "Found new best score: 0.00011212722352757958\n",
1574 |       "---------------------------------------------------------------------------------------------------- \n",
1575 |       "params = \n",
1576 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1577 |       "\n",
1578 |       "CV score: 0.0001181836666297253, current best score: 0.00011212722352757958\n",
1579 |       "---------------------------------------------------------------------------------------------------- \n",
1580 |       "params = \n",
1581 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1582 |       "\n",
1583 |       "CV score: 0.0001158514519987164, current best score: 0.00011212722352757958\n",
1584 |       "---------------------------------------------------------------------------------------------------- \n",
1585 |       "params = \n",
1586 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1587 |       "\n",
1588 |       "CV score: 0.00012000426127843114, current best score: 0.00011212722352757958\n",
1589 |       "---------------------------------------------------------------------------------------------------- \n",
1590 |       "params = \n",
1591 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1592 |       "\n",
1593 |       "CV score: 0.00011360641302273318, current best score: 0.00011212722352757958\n",
1594 |       "---------------------------------------------------------------------------------------------------- \n",
1595 |       "params = \n",
1596 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1597 |       "\n",
1598 |       "CV score: 0.00011965654294675753, current best score: 0.00011212722352757958\n",
1599 |       "---------------------------------------------------------------------------------------------------- \n",
1600 |       "params = \n",
1601 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1602 |       "\n",
1603 |       "CV score: 0.00011443935391500729, current best score: 0.00011212722352757958\n",
1604 |       "---------------------------------------------------------------------------------------------------- \n",
1605 |       "params = \n",
1606 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1607 |       "\n"
1608 |      ]
1609 |     },
1610 |     {
1611 |      "name": "stdout",
1612 |      "output_type": "stream",
1613 |      "text": [
1614 |       "CV score: 0.00012091648099518892, current best score: 0.00011212722352757958\n",
1615 |       "---------------------------------------------------------------------------------------------------- \n",
1616 |       "params = \n",
1617 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1618 |       "\n",
1619 |       "CV score: 0.00011431927180539175, current best score: 0.00011212722352757958\n",
1620 |       "---------------------------------------------------------------------------------------------------- \n",
1621 |       "params = \n",
1622 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1623 |       "\n",
1624 |       "CV score: 0.00012002843916496843, current best score: 0.00011212722352757958\n",
1625 |       "---------------------------------------------------------------------------------------------------- \n",
1626 |       "params = \n",
1627 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1628 |       "\n",
1629 |       "CV score: 0.00011451755742912798, current best score: 0.00011212722352757958\n",
1630 |       "---------------------------------------------------------------------------------------------------- \n",
1631 |       "params = \n",
1632 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1633 |       "\n",
1634 |       "CV score: 0.00011991034882396216, current best score: 0.00011212722352757958\n",
1635 |       "---------------------------------------------------------------------------------------------------- \n",
1636 |       "params = \n",
1637 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1638 |       "\n",
1639 |       "CV score: 0.00011421574014520517, current best score: 0.00011212722352757958\n",
1640 |       "---------------------------------------------------------------------------------------------------- \n",
1641 |       "params = \n",
1642 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1643 |       "\n",
1644 |       "CV score: 0.00011862220062373018, current best score: 0.00011212722352757958\n",
1645 |       "---------------------------------------------------------------------------------------------------- \n",
1646 |       "params = \n",
1647 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1648 |       "\n",
1649 |       "CV score: 0.00011490378784720141, current best score: 0.00011212722352757958\n",
1650 |       "---------------------------------------------------------------------------------------------------- \n",
1651 |       "params = \n",
1652 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1653 |       "\n",
1654 |       "CV score: 0.00011862098330828783, current best score: 0.00011212722352757958\n",
1655 |       "---------------------------------------------------------------------------------------------------- \n",
1656 |       "params = \n",
1657 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1658 |       "\n",
1659 |       "CV score: 0.00011474358647366374, current best score: 0.00011212722352757958\n",
1660 |       "---------------------------------------------------------------------------------------------------- \n",
1661 |       "params = \n",
1662 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1663 |       "\n",
1664 |       "CV score: 0.00011966504510458172, current best score: 0.00011212722352757958\n",
1665 |       "---------------------------------------------------------------------------------------------------- \n",
1666 |       "params = \n",
1667 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1668 |       "\n",
1669 |       "CV score: 0.00011221151541733946, current best score: 0.00011212722352757958\n",
1670 |       "---------------------------------------------------------------------------------------------------- \n",
1671 |       "params = \n",
1672 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1673 |       "\n",
1674 |       "CV score: 0.00011777070012250079, current best score: 0.00011212722352757958\n",
1675 |       "---------------------------------------------------------------------------------------------------- \n",
1676 |       "params = \n",
1677 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1678 |       "\n",
1679 |       "CV score: 0.0001153000539253855, current best score: 0.00011212722352757958\n",
1680 |       "---------------------------------------------------------------------------------------------------- \n",
1681 |       "params = \n",
1682 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1683 |       "\n",
1684 |       "CV score: 0.00012101952260099292, current best score: 0.00011212722352757958\n",
1685 |       "---------------------------------------------------------------------------------------------------- \n",
1686 |       "params = \n",
1687 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1688 |       "\n",
1689 |       "CV score: 0.00011434513539248205, current best score: 0.00011212722352757958\n",
1690 |       "---------------------------------------------------------------------------------------------------- \n",
1691 |       "params = \n",
1692 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1693 |       "\n",
1694 |       "CV score: 0.00011965789595988069, current best score: 0.00011212722352757958\n",
1695 |       "---------------------------------------------------------------------------------------------------- \n",
1696 |       "params = \n",
1697 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1698 |       "\n"
1699 |      ]
1700 |     },
1701 |     {
1702 |      "name": "stdout",
1703 |      "output_type": "stream",
1704 |      "text": [
1705 |       "CV score: 0.000114353078699699, current best score: 0.00011212722352757958\n",
1706 |       "---------------------------------------------------------------------------------------------------- \n",
1707 |       "params = \n",
1708 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1709 |       "\n",
1710 |       "CV score: 0.0001212472410367223, current best score: 0.00011212722352757958\n",
1711 |       "---------------------------------------------------------------------------------------------------- \n",
1712 |       "params = \n",
1713 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1714 |       "\n",
1715 |       "CV score: 0.00011433021385951927, current best score: 0.00011212722352757958\n",
1716 |       "---------------------------------------------------------------------------------------------------- \n",
1717 |       "params = \n",
1718 |       "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1719 |       "\n",
1720 |       "CV score: 0.0001202623231156629, current best score: 0.00011212722352757958\n",
1721 |       "\n",
1722 |       "\n",
1723 |       "best params: {'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1724 |       "best score: 0.00011212722352757958\n"
1725 |      ]
1726 |     }
1727 |    ],
1728 |    "source": [
1729 |     "param_grid = [\n",
1730 |     "        {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective':['regression'],\n",
1731 |     "         'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], \"min_child_samples\": [3],\n",
1732 |     "         \"boosting\": [\"gbdt\"], \"feature_fraction\": [0.7], \"bagging_freq\": [1],\n",
1733 |     "         \"bagging_fraction\": [1], \"bagging_seed\": [11], \"metric\": ['mse'],\n",
1734 |     "         \"lambda_l2\": [0.0003, 0.001, 0.003], \"verbosity\": [-1]}\n",
1735 |     "              ]\n",
1736 |     "\n",
1737 |     "lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)"
1738 |    ]
1739 |   },
1740 |   {
1741 |    "cell_type": "markdown",
1742 |    "metadata": {},
1743 |    "source": [
1744 |     "+ __lgb 训练__"
1745 |    ]
1746 |   },
1747 |   {
1748 |    "cell_type": "code",
1749 |    "execution_count": 25,
1750 |    "metadata": {},
1751 |    "outputs": [
1752 |     {
1753 |      "name": "stdout",
1754 |      "output_type": "stream",
1755 |      "text": [
1756 |       "fold n°1\n",
1757 |       "Training until validation scores don't improve for 100 rounds.\n",
1758 |       "Early stopping, best iteration is:\n",
1759 |       "[84]\ttraining's l2: 7.64635e-05\tvalid_1's l2: 0.00013957\n",
1760 |       "CV score: 0.76865147\n",
1761 |       "fold n°2\n",
1762 |       "Training until validation scores don't improve for 100 rounds.\n",
1763 |       "Early stopping, best iteration is:\n",
1764 |       "[82]\ttraining's l2: 8.03721e-05\tvalid_1's l2: 0.000110882\n",
1765 |       "CV score: 0.68291002\n",
1766 |       "fold n°3\n",
1767 |       "Training until validation scores don't improve for 100 rounds.\n",
1768 |       "[200]\ttraining's l2: 5.30193e-05\tvalid_1's l2: 0.000139552\n",
1769 |       "Early stopping, best iteration is:\n",
1770 |       "[173]\ttraining's l2: 5.68838e-05\tvalid_1's l2: 0.000138392\n",
1771 |       "CV score: 0.59707720\n",
1772 |       "fold n°4\n",
1773 |       "Training until validation scores don't improve for 100 rounds.\n",
1774 |       "[200]\ttraining's l2: 5.73108e-05\tvalid_1's l2: 0.000104267\n",
1775 |       "Early stopping, best iteration is:\n",
1776 |       "[104]\ttraining's l2: 7.56046e-05\tvalid_1's l2: 9.52206e-05\n",
1777 |       "CV score: 0.51161256\n",
1778 |       "fold n°5\n",
1779 |       "Training until validation scores don't improve for 100 rounds.\n",
1780 |       "[200]\ttraining's l2: 5.36518e-05\tvalid_1's l2: 0.000110701\n",
1781 |       "Early stopping, best iteration is:\n",
1782 |       "[111]\ttraining's l2: 6.91289e-05\tvalid_1's l2: 0.0001088\n",
1783 |       "CV score: 0.42667732\n",
1784 |       "fold n°6\n",
1785 |       "Training until validation scores don't improve for 100 rounds.\n",
1786 |       "[200]\ttraining's l2: 5.63032e-05\tvalid_1's l2: 0.000123434\n",
1787 |       "Early stopping, best iteration is:\n",
1788 |       "[110]\ttraining's l2: 7.12232e-05\tvalid_1's l2: 0.000120677\n",
1789 |       "CV score: 0.34177334\n",
1790 |       "fold n°7\n",
1791 |       "Training until validation scores don't improve for 100 rounds.\n",
1792 |       "Early stopping, best iteration is:\n",
1793 |       "[88]\ttraining's l2: 7.56938e-05\tvalid_1's l2: 0.000111444\n",
1794 |       "CV score: 0.25611485\n",
1795 |       "fold n°8\n",
1796 |       "Training until validation scores don't improve for 100 rounds.\n",
1797 |       "[200]\ttraining's l2: 5.83909e-05\tvalid_1's l2: 8.13893e-05\n",
1798 |       "Early stopping, best iteration is:\n",
1799 |       "[177]\ttraining's l2: 6.21352e-05\tvalid_1's l2: 7.96722e-05\n",
1800 |       "CV score: 0.17049800\n",
1801 |       "fold n°9\n",
1802 |       "Training until validation scores don't improve for 100 rounds.\n",
1803 |       "[200]\ttraining's l2: 5.74223e-05\tvalid_1's l2: 0.000108159\n",
1804 |       "Early stopping, best iteration is:\n",
1805 |       "[183]\ttraining's l2: 5.99078e-05\tvalid_1's l2: 0.000107585\n",
1806 |       "CV score: 0.08544523\n",
1807 |       "fold n°10\n",
1808 |       "Training until validation scores don't improve for 100 rounds.\n",
1809 |       "[200]\ttraining's l2: 5.66965e-05\tvalid_1's l2: 0.000109477\n",
1810 |       "Early stopping, best iteration is:\n",
1811 |       "[176]\ttraining's l2: 5.98262e-05\tvalid_1's l2: 0.000108649\n",
1812 |       "CV score: 0.00011213\n"
1813 |      ]
1814 |     }
1815 |    ],
1816 |    "source": [
1817 |     "X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n",
1818 |     "min_max_scaler = MinMaxScaler()\n",
1819 |     "X_train = min_max_scaler.fit_transform(X_train)\n",
1820 |     "X_test = min_max_scaler.transform(X_test)\n",
1821 |     "oof_lgb, predictions_lgb = lgb_predict(X_train, y_train, X_test, params=lgb_best_params, verbose_eval=200) #"
1822 |    ]
1823 |   },
1824 |   {
1825 |    "cell_type": "markdown",
1826 |    "metadata": {},
1827 |    "source": [
1828 |     "## xgb"
1829 |    ]
1830 |   },
1831 |   {
1832 |    "cell_type": "markdown",
1833 |    "metadata": {},
1834 |    "source": [
1835 |     "+ __选择最优参数__"
1836 |    ]
1837 |   },
1838 |   {
1839 |    "cell_type": "code",
1840 |    "execution_count": 26,
1841 |    "metadata": {},
1842 |    "outputs": [
1843 |     {
1844 |      "name": "stdout",
1845 |      "output_type": "stream",
1846 |      "text": [
1847 |       "---------------------------------------------------------------------------------------------------- \n",
1848 |       "params = \n",
1849 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1850 |       "\n",
1851 |       "CV score: 0.00011466648461334117, current best score: 1\n",
1852 |       "Found new best score: 0.00011466648461334117\n",
1853 |       "---------------------------------------------------------------------------------------------------- \n",
1854 |       "params = \n",
1855 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1856 |       "\n",
1857 |       "CV score: 0.00011456634369353389, current best score: 0.00011466648461334117\n",
1858 |       "Found new best score: 0.00011456634369353389\n",
1859 |       "---------------------------------------------------------------------------------------------------- \n",
1860 |       "params = \n",
1861 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1862 |       "\n",
1863 |       "CV score: 0.00011522095556659337, current best score: 0.00011456634369353389\n",
1864 |       "---------------------------------------------------------------------------------------------------- \n",
1865 |       "params = \n",
1866 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1867 |       "\n",
1868 |       "CV score: 0.00011575362802403785, current best score: 0.00011456634369353389\n",
1869 |       "---------------------------------------------------------------------------------------------------- \n",
1870 |       "params = \n",
1871 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1872 |       "\n",
1873 |       "CV score: 0.0001153688729909817, current best score: 0.00011456634369353389\n",
1874 |       "---------------------------------------------------------------------------------------------------- \n",
1875 |       "params = \n",
1876 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1877 |       "\n",
1878 |       "CV score: 0.00011393958598424273, current best score: 0.00011456634369353389\n",
1879 |       "Found new best score: 0.00011393958598424273\n",
1880 |       "---------------------------------------------------------------------------------------------------- \n",
1881 |       "params = \n",
1882 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1883 |       "\n",
1884 |       "CV score: 0.00011519961810983215, current best score: 0.00011393958598424273\n",
1885 |       "---------------------------------------------------------------------------------------------------- \n",
1886 |       "params = \n",
1887 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1888 |       "\n",
1889 |       "CV score: 0.0001159343681149051, current best score: 0.00011393958598424273\n",
1890 |       "---------------------------------------------------------------------------------------------------- \n",
1891 |       "params = \n",
1892 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1893 |       "\n",
1894 |       "CV score: 0.00011533954423435673, current best score: 0.00011393958598424273\n",
1895 |       "---------------------------------------------------------------------------------------------------- \n",
1896 |       "params = \n",
1897 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1898 |       "\n",
1899 |       "CV score: 0.00011435484501228615, current best score: 0.00011393958598424273\n",
1900 |       "---------------------------------------------------------------------------------------------------- \n",
1901 |       "params = \n",
1902 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1903 |       "\n",
1904 |       "CV score: 0.00011550346442947287, current best score: 0.00011393958598424273\n",
1905 |       "---------------------------------------------------------------------------------------------------- \n",
1906 |       "params = \n",
1907 |       "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1908 |       "\n",
1909 |       "CV score: 0.00011732503571073892, current best score: 0.00011393958598424273\n",
1910 |       "---------------------------------------------------------------------------------------------------- \n",
1911 |       "params = \n",
1912 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1913 |       "\n",
1914 |       "CV score: 0.00011507860801671209, current best score: 0.00011393958598424273\n",
1915 |       "---------------------------------------------------------------------------------------------------- \n",
1916 |       "params = \n",
1917 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1918 |       "\n",
1919 |       "CV score: 0.00011373673653760243, current best score: 0.00011393958598424273\n",
1920 |       "Found new best score: 0.00011373673653760243\n",
1921 |       "---------------------------------------------------------------------------------------------------- \n",
1922 |       "params = \n",
1923 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1924 |       "\n",
1925 |       "CV score: 0.00011455317620732011, current best score: 0.00011373673653760243\n",
1926 |       "---------------------------------------------------------------------------------------------------- \n",
1927 |       "params = \n",
1928 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1929 |       "\n",
1930 |       "CV score: 0.0001157106534400235, current best score: 0.00011373673653760243\n",
1931 |       "---------------------------------------------------------------------------------------------------- \n",
1932 |       "params = \n",
1933 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1934 |       "\n",
1935 |       "CV score: 0.00011484817588831511, current best score: 0.00011373673653760243\n",
1936 |       "---------------------------------------------------------------------------------------------------- \n",
1937 |       "params = \n",
1938 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1939 |       "\n",
1940 |       "CV score: 0.00011350220852811235, current best score: 0.00011373673653760243\n",
1941 |       "Found new best score: 0.00011350220852811235\n",
1942 |       "---------------------------------------------------------------------------------------------------- \n",
1943 |       "params = \n",
1944 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1945 |       "\n",
1946 |       "CV score: 0.00011417123679618333, current best score: 0.00011350220852811235\n",
1947 |       "---------------------------------------------------------------------------------------------------- \n",
1948 |       "params = \n",
1949 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1950 |       "\n",
1951 |       "CV score: 0.0001139991054633333, current best score: 0.00011350220852811235\n",
1952 |       "---------------------------------------------------------------------------------------------------- \n",
1953 |       "params = \n",
1954 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1955 |       "\n",
1956 |       "CV score: 0.00011455587677046536, current best score: 0.00011350220852811235\n",
1957 |       "---------------------------------------------------------------------------------------------------- \n",
1958 |       "params = \n",
1959 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1960 |       "\n",
1961 |       "CV score: 0.00011509936251941176, current best score: 0.00011350220852811235\n",
1962 |       "---------------------------------------------------------------------------------------------------- \n",
1963 |       "params = \n",
1964 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1965 |       "\n"
1966 |      ]
1967 |     },
1968 |     {
1969 |      "name": "stdout",
1970 |      "output_type": "stream",
1971 |      "text": [
1972 |       "CV score: 0.00011510995861245286, current best score: 0.00011350220852811235\n",
1973 |       "---------------------------------------------------------------------------------------------------- \n",
1974 |       "params = \n",
1975 |       "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1976 |       "\n",
1977 |       "CV score: 0.00011452153482808672, current best score: 0.00011350220852811235\n",
1978 |       "---------------------------------------------------------------------------------------------------- \n",
1979 |       "params = \n",
1980 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1981 |       "\n",
1982 |       "CV score: 0.00011658284174959551, current best score: 0.00011350220852811235\n",
1983 |       "---------------------------------------------------------------------------------------------------- \n",
1984 |       "params = \n",
1985 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1986 |       "\n",
1987 |       "CV score: 0.00011423913370431543, current best score: 0.00011350220852811235\n",
1988 |       "---------------------------------------------------------------------------------------------------- \n",
1989 |       "params = \n",
1990 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1991 |       "\n",
1992 |       "CV score: 0.00011457178138601278, current best score: 0.00011350220852811235\n",
1993 |       "---------------------------------------------------------------------------------------------------- \n",
1994 |       "params = \n",
1995 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1996 |       "\n",
1997 |       "CV score: 0.00011543189691404133, current best score: 0.00011350220852811235\n",
1998 |       "---------------------------------------------------------------------------------------------------- \n",
1999 |       "params = \n",
2000 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
2001 |       "\n",
2002 |       "CV score: 0.00011610932754520534, current best score: 0.00011350220852811235\n",
2003 |       "---------------------------------------------------------------------------------------------------- \n",
2004 |       "params = \n",
2005 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
2006 |       "\n",
2007 |       "CV score: 0.00011469866498564288, current best score: 0.00011350220852811235\n",
2008 |       "---------------------------------------------------------------------------------------------------- \n",
2009 |       "params = \n",
2010 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
2011 |       "\n",
2012 |       "CV score: 0.00011495607118486482, current best score: 0.00011350220852811235\n",
2013 |       "---------------------------------------------------------------------------------------------------- \n",
2014 |       "params = \n",
2015 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
2016 |       "\n",
2017 |       "CV score: 0.00011659799844084755, current best score: 0.00011350220852811235\n",
2018 |       "---------------------------------------------------------------------------------------------------- \n",
2019 |       "params = \n",
2020 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
2021 |       "\n",
2022 |       "CV score: 0.00011684188710827838, current best score: 0.00011350220852811235\n",
2023 |       "---------------------------------------------------------------------------------------------------- \n",
2024 |       "params = \n",
2025 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
2026 |       "\n",
2027 |       "CV score: 0.0001155643748809089, current best score: 0.00011350220852811235\n",
2028 |       "---------------------------------------------------------------------------------------------------- \n",
2029 |       "params = \n",
2030 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
2031 |       "\n",
2032 |       "CV score: 0.0001150891914023831, current best score: 0.00011350220852811235\n",
2033 |       "---------------------------------------------------------------------------------------------------- \n",
2034 |       "params = \n",
2035 |       "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
2036 |       "\n",
2037 |       "CV score: 0.0001174593981966976, current best score: 0.00011350220852811235\n",
2038 |       "\n",
2039 |       "\n",
2040 |       "best params: {'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
2041 |       "best score: 0.00011350220852811235\n"
2042 |      ]
2043 |     }
2044 |    ],
2045 |    "source": [
2046 |     "param_grid = [\n",
2047 |     "              {'silent': [1],\n",
2048 |     "               'nthread': [4],\n",
2049 |     "               'eval_metric': ['rmse'],\n",
2050 |     "               'eta': [0.03],\n",
2051 |     "               'objective': ['reg:linear'],\n",
2052 |     "               'max_depth': [4, 5, 6],\n",
2053 |     "               'num_round': [1000],\n",
2054 |     "               'subsample': [0.4, 0.6, 0.8, 1],\n",
2055 |     "               'colsample_bytree': [0.7, 0.9, 1],\n",
2056 |     "               }\n",
2057 |     "              ]\n",
2058 |     "\n",
2059 |     "xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)"
2060 |    ]
2061 |   },
2062 |   {
2063 |    "cell_type": "markdown",
2064 |    "metadata": {},
2065 |    "source": [
2066 |     "+ __xgb 训练__"
2067 |    ]
2068 |   },
2069 |   {
2070 |    "cell_type": "code",
2071 |    "execution_count": 27,
2072 |    "metadata": {},
2073 |    "outputs": [
2074 |     {
2075 |      "name": "stdout",
2076 |      "output_type": "stream",
2077 |      "text": [
2078 |       "fold n°1\n",
2079 |       "len trn_idx  1244\n",
2080 |       "[0]\ttrain-rmse:0.41223\tvalid_data-rmse:0.414495\n",
2081 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2082 |       "\n",
2083 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2084 |       "[200]\ttrain-rmse:0.00937\tvalid_data-rmse:0.012154\n",
2085 |       "[400]\ttrain-rmse:0.00738\tvalid_data-rmse:0.012149\n",
2086 |       "Stopping. Best iteration:\n",
2087 |       "[249]\ttrain-rmse:0.008639\tvalid_data-rmse:0.011963\n",
2088 |       "\n",
2089 |       "fold n°2\n",
2090 |       "len trn_idx  1244\n",
2091 |       "[0]\ttrain-rmse:0.412554\tvalid_data-rmse:0.411489\n",
2092 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2093 |       "\n",
2094 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2095 |       "[200]\ttrain-rmse:0.009572\tvalid_data-rmse:0.010408\n",
2096 |       "[400]\ttrain-rmse:0.007492\tvalid_data-rmse:0.010257\n",
2097 |       "Stopping. Best iteration:\n",
2098 |       "[321]\ttrain-rmse:0.008054\tvalid_data-rmse:0.010154\n",
2099 |       "\n",
2100 |       "fold n°3\n",
2101 |       "len trn_idx  1244\n",
2102 |       "[0]\ttrain-rmse:0.412511\tvalid_data-rmse:0.412077\n",
2103 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2104 |       "\n",
2105 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2106 |       "[200]\ttrain-rmse:0.009498\tvalid_data-rmse:0.012291\n",
2107 |       "[400]\ttrain-rmse:0.007413\tvalid_data-rmse:0.011734\n",
2108 |       "[600]\ttrain-rmse:0.00631\tvalid_data-rmse:0.011799\n",
2109 |       "Stopping. Best iteration:\n",
2110 |       "[481]\ttrain-rmse:0.006915\tvalid_data-rmse:0.011714\n",
2111 |       "\n",
2112 |       "fold n°4\n",
2113 |       "len trn_idx  1245\n",
2114 |       "[0]\ttrain-rmse:0.412355\tvalid_data-rmse:0.413351\n",
2115 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2116 |       "\n",
2117 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2118 |       "[200]\ttrain-rmse:0.009632\tvalid_data-rmse:0.010239\n",
2119 |       "[400]\ttrain-rmse:0.007535\tvalid_data-rmse:0.009976\n",
2120 |       "Stopping. Best iteration:\n",
2121 |       "[361]\ttrain-rmse:0.00781\tvalid_data-rmse:0.009915\n",
2122 |       "\n",
2123 |       "fold n°5\n",
2124 |       "len trn_idx  1245\n",
2125 |       "[0]\ttrain-rmse:0.412679\tvalid_data-rmse:0.410576\n",
2126 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2127 |       "\n",
2128 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2129 |       "[200]\ttrain-rmse:0.009628\tvalid_data-rmse:0.010779\n",
2130 |       "[400]\ttrain-rmse:0.007522\tvalid_data-rmse:0.010314\n",
2131 |       "[600]\ttrain-rmse:0.006444\tvalid_data-rmse:0.010495\n",
2132 |       "Stopping. Best iteration:\n",
2133 |       "[406]\ttrain-rmse:0.007485\tvalid_data-rmse:0.010309\n",
2134 |       "\n",
2135 |       "fold n°6\n",
2136 |       "len trn_idx  1245\n",
2137 |       "[0]\ttrain-rmse:0.412697\tvalid_data-rmse:0.410264\n",
2138 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2139 |       "\n",
2140 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2141 |       "[200]\ttrain-rmse:0.009618\tvalid_data-rmse:0.01148\n",
2142 |       "[400]\ttrain-rmse:0.007507\tvalid_data-rmse:0.011228\n",
2143 |       "Stopping. Best iteration:\n",
2144 |       "[298]\ttrain-rmse:0.008307\tvalid_data-rmse:0.011154\n",
2145 |       "\n",
2146 |       "fold n°7\n",
2147 |       "len trn_idx  1245\n",
2148 |       "[0]\ttrain-rmse:0.412244\tvalid_data-rmse:0.414249\n",
2149 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2150 |       "\n",
2151 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2152 |       "[200]\ttrain-rmse:0.009547\tvalid_data-rmse:0.010669\n",
2153 |       "[400]\ttrain-rmse:0.007386\tvalid_data-rmse:0.010475\n",
2154 |       "Stopping. Best iteration:\n",
2155 |       "[276]\ttrain-rmse:0.008424\tvalid_data-rmse:0.010363\n",
2156 |       "\n",
2157 |       "fold n°8\n",
2158 |       "len trn_idx  1245\n",
2159 |       "[0]\ttrain-rmse:0.41229\tvalid_data-rmse:0.414101\n",
2160 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2161 |       "\n",
2162 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2163 |       "[200]\ttrain-rmse:0.009677\tvalid_data-rmse:0.009724\n",
2164 |       "[400]\ttrain-rmse:0.007648\tvalid_data-rmse:0.00927\n",
2165 |       "[600]\ttrain-rmse:0.006596\tvalid_data-rmse:0.009145\n",
2166 |       "[800]\ttrain-rmse:0.005798\tvalid_data-rmse:0.009118\n",
2167 |       "[1000]\ttrain-rmse:0.005144\tvalid_data-rmse:0.009119\n",
2168 |       "Stopping. Best iteration:\n",
2169 |       "[934]\ttrain-rmse:0.00535\tvalid_data-rmse:0.009081\n",
2170 |       "\n",
2171 |       "fold n°9\n",
2172 |       "len trn_idx  1245\n",
2173 |       "[0]\ttrain-rmse:0.412592\tvalid_data-rmse:0.411176\n",
2174 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2175 |       "\n",
2176 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2177 |       "[200]\ttrain-rmse:0.009536\tvalid_data-rmse:0.01135\n",
2178 |       "[400]\ttrain-rmse:0.007511\tvalid_data-rmse:0.010768\n",
2179 |       "[600]\ttrain-rmse:0.006424\tvalid_data-rmse:0.010755\n",
2180 |       "Stopping. Best iteration:\n",
2181 |       "[534]\ttrain-rmse:0.006738\tvalid_data-rmse:0.010726\n",
2182 |       "\n",
2183 |       "fold n°10\n",
2184 |       "len trn_idx  1245\n",
2185 |       "[0]\ttrain-rmse:0.412429\tvalid_data-rmse:0.412775\n",
2186 |       "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2187 |       "\n",
2188 |       "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2189 |       "[200]\ttrain-rmse:0.009541\tvalid_data-rmse:0.011636\n",
2190 |       "[400]\ttrain-rmse:0.007462\tvalid_data-rmse:0.010853\n",
2191 |       "[600]\ttrain-rmse:0.006416\tvalid_data-rmse:0.010884\n",
2192 |       "Stopping. Best iteration:\n",
2193 |       "[409]\ttrain-rmse:0.007404\tvalid_data-rmse:0.010834\n",
2194 |       "\n",
2195 |       "CV score: 0.00011350\n"
2196 |      ]
2197 |     }
2198 |    ],
2199 |    "source": [
2200 |     "X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n",
2201 |     "min_max_scaler = MinMaxScaler()\n",
2202 |     "X_train = min_max_scaler.fit_transform(X_train)\n",
2203 |     "X_test = min_max_scaler.transform(X_test)\n",
2204 |     "oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, params=xgb_best_params, verbose_eval=200) #"
2205 |    ]
2206 |   },
2207 |   {
2208 |    "cell_type": "markdown",
2209 |    "metadata": {},
2210 |    "source": [
2211 |     "## 模型融合"
2212 |    ]
2213 |   },
2214 |   {
2215 |    "cell_type": "code",
2216 |    "execution_count": 28,
2217 |    "metadata": {},
2218 |    "outputs": [
2219 |     {
2220 |      "name": "stdout",
2221 |      "output_type": "stream",
2222 |      "text": [
2223 |       "fold 0\n",
2224 |       "fold 1\n",
2225 |       "fold 2\n",
2226 |       "fold 3\n",
2227 |       "fold 4\n",
2228 |       "fold 5\n",
2229 |       "fold 6\n",
2230 |       "fold 7\n",
2231 |       "fold 8\n",
2232 |       "fold 9\n",
2233 |       "0.00011094962448042872\n"
2234 |      ]
2235 |     }
2236 |    ],
2237 |    "source": [
2238 |     "# stacking\n",
2239 |     "train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()\n",
2240 |     "test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()\n",
2241 |     "\n",
2242 |     "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)\n",
2243 |     "oof_stack = np.zeros(train_stack.shape[0])\n",
2244 |     "predictions = np.zeros(test_stack.shape[0])\n",
2245 |     "\n",
2246 |     "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):\n",
2247 |     "    print(\"fold {}\".format(fold_))\n",
2248 |     "    trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]\n",
2249 |     "    val_data, val_y = train_stack[val_idx], y_train[val_idx]\n",
2250 |     "    \n",
2251 |     "    clf_3 = BayesianRidge()\n",
2252 |     "    clf_3.fit(trn_data, trn_y)\n",
2253 |     "    \n",
2254 |     "    oof_stack[val_idx] = clf_3.predict(val_data)\n",
2255 |     "    predictions += clf_3.predict(test_stack) / 10\n",
2256 |     "    \n",
2257 |     "final_score = mean_squared_error(y_train, oof_stack)\n",
2258 |     "print(final_score)"
2259 |    ]
2260 |   },
2261 |   {
2262 |    "cell_type": "markdown",
2263 |    "metadata": {},
2264 |    "source": [
2265 |     "# 生成提交结果"
2266 |    ]
2267 |   },
2268 |   {
2269 |    "cell_type": "markdown",
2270 |    "metadata": {},
2271 |    "source": [
2272 |     "+ __生成文件名__"
2273 |    ]
2274 |   },
2275 |   {
2276 |    "cell_type": "code",
2277 |    "execution_count": 29,
2278 |    "metadata": {},
2279 |    "outputs": [
2280 |     {
2281 |      "name": "stdout",
2282 |      "output_type": "stream",
2283 |      "text": [
2284 |       "result/testB_lgb_xgb_11095_8features_20190122_10:36:07.csv\n"
2285 |      ]
2286 |     }
2287 |    ],
2288 |    "source": [
2289 |     "import time\n",
2290 |     "model_name = \"lgb_xgb\"\n",
2291 |     "file_name = 'result/{}_{}_{:5.0f}_{}features_{}.csv'.format(\n",
2292 |     "                                        test_name, \n",
2293 |     "                                        model_name,\n",
2294 |     "                                        final_score*1e8, X_train.shape[1],\n",
2295 |     "                                        time.strftime('%Y%m%d_%H:%M:%S',time.localtime(time.time())))\n",
2296 |     "                                        \n",
2297 |     "print(file_name)"
2298 |    ]
2299 |   },
2300 |   {
2301 |    "cell_type": "markdown",
2302 |    "metadata": {},
2303 |    "source": [
2304 |     "+ __写入文件__"
2305 |    ]
2306 |   },
2307 |   {
2308 |    "cell_type": "code",
2309 |    "execution_count": 30,
2310 |    "metadata": {},
2311 |    "outputs": [],
2312 |    "source": [
2313 |     "sub_df = pd.read_csv(test_file_name, encoding = 'gb18030')\n",
2314 |     "sub_df = sub_df[['样本id', 'A1']]\n",
2315 |     "sub_df['A1'] = predictions\n",
2316 |     "sub_df['A1'] = sub_df['A1'].apply(lambda x:round(x, 3))\n",
2317 |     "sub_df.to_csv(file_name, header=0, index=0) "
2318 |    ]
2319 |   }
2320 |  ],
2321 |  "metadata": {
2322 |   "kernelspec": {
2323 |    "display_name": "Python 3",
2324 |    "language": "python",
2325 |    "name": "python3"
2326 |   },
2327 |   "language_info": {
2328 |    "codemirror_mode": {
2329 |     "name": "ipython",
2330 |     "version": 3
2331 |    },
2332 |    "file_extension": ".py",
2333 |    "mimetype": "text/x-python",
2334 |    "name": "python",
2335 |    "nbconvert_exporter": "python",
2336 |    "pygments_lexer": "ipython3",
2337 |    "version": "3.6.7"
2338 |   },
2339 |   "toc": {
2340 |    "base_numbering": 1,
2341 |    "nav_menu": {},
2342 |    "number_sections": true,
2343 |    "sideBar": true,
2344 |    "skip_h1_title": false,
2345 |    "title_cell": "Table of Contents",
2346 |    "title_sidebar": "Contents",
2347 |    "toc_cell": false,
2348 |    "toc_position": {},
2349 |    "toc_section_display": true,
2350 |    "toc_window_display": true
2351 |   }
2352 |  },
2353 |  "nbformat": 4,
2354 |  "nbformat_minor": 2
2355 | }
2356 | 


--------------------------------------------------------------------------------
/初赛/津南数字制造算法挑战赛+20+Drop/readme.md:
--------------------------------------------------------------------------------
 1 | # 环境要求
 2 | 
 3 | ## 系统
 4 | 
 5 | ubuntu 16.04
 6 | 
 7 | ## python 版本
 8 | python 3.5 或 python 3.6
 9 | 
10 | ## 需要的库
11 | 
12 | numpy, pandas, lightgbm, xgboost, sklearn
13 | 
14 | 
15 | 
16 | # 运行说明
17 | 
18 | 生成 B 榜预测结果文件：
19 | python3 run.py --test_type=B
20 | 
21 | 生成 C 榜预测结果文件：
22 | python3 run.py --test_type=C
23 | 
24 | # 其他说明
25 | 
26 | 运行时间大概 6 分钟左右（不定）
27 | 
28 | 要求一级目录下存放 data 文件夹, 内部有 A 榜数据： jinnan_round1_train_20181227.csv, B 榜数据： jinnan_round1_testB_20190121.csv, C 榜数据: jinnan_round1_test_20190121.csv.
29 | 
30 | 生成结果在一级目录, B 榜名为 submit_B.csv, C 榜名为 submit_C.csv.
31 | 
32 | # 最后
33 | 
34 | 提前祝您春节快乐～
35 | 如果有问题希望您联系我们
36 | 
37 | 陶亚凡： 765370813@qq.com
38 | Blue： cy_1995@qq.com 


--------------------------------------------------------------------------------
/初赛/津南数字制造算法挑战赛+20+Drop/run.py:
--------------------------------------------------------------------------------
  1 | 
  2 | # In[1]:
  3 | 
  4 | 
  5 | import numpy as np 
  6 | import pandas as pd 
  7 | import lightgbm as lgb
  8 | import xgboost as xgb
  9 | import warnings
 10 | import re
 11 | import argparse, sys
 12 | 
 13 | # In[2]:
 14 | 
 15 | 
 16 | from sklearn.model_selection import KFold, RepeatedKFold
 17 | from sklearn.pipeline import Pipeline
 18 | from sklearn.base import BaseEstimator, TransformerMixin
 19 | from sklearn.linear_model import BayesianRidge
 20 | from sklearn.metrics import mean_squared_error, mean_absolute_error
 21 | from sklearn.preprocessing import MinMaxScaler
 22 | from sklearn.model_selection import ParameterGrid
 23 | 
 24 | 
 25 | # In[3]:
 26 | 
 27 | 
 28 | warnings.simplefilter(action='ignore', category=FutureWarning)
 29 | warnings.filterwarnings("ignore")
 30 | pd.set_option('display.max_columns', None)
 31 | pd.set_option('max_colwidth', 100)
 32 | 
 33 | 
 34 | 
 35 | 
 36 | # # 数据清洗
 37 | 
 38 | # ## 删除缺失率高的特征
 39 | # 
 40 | # + __删除缺失值大于 th_high 的值__
 41 | # + __缺失值在 th_low 和 th_high 之间的特征根据是否缺失增加新特征__
 42 | #   
 43 | #   如 B10 缺失较高，增加新特征 B10_null，如果缺失为1，否则为0
 44 | 
 45 | # In[6]:
 46 | 
 47 | 
 48 | class del_nan_feature(BaseEstimator, TransformerMixin):
 49 |     
 50 |     def __init__(self, th_high=0.85, th_low=0.02):
 51 |         self.th_high = th_high
 52 |         self.th_low = th_low
 53 |         
 54 |     def fit(self, X, y=None):
 55 |         return self
 56 |     
 57 |     def transform(self, X):
 58 |         print('-'*30, ' '*5, 'del_nan_feature', ' '*5, '-'*30, '\n')
 59 |         print("shape before process = {}".format(X.shape))
 60 | 
 61 |         # 删除高缺失率特征
 62 |         X.dropna(axis=1, thresh=(1-self.th_high)*X.shape[0], inplace=True)
 63 |         
 64 |         
 65 |         # 缺失率较高，增加新特征
 66 |         for col in X.columns:
 67 |             if col == 'target':
 68 |                 continue
 69 |             
 70 |             miss_rate = X[col].isnull().sum()/ X.shape[0]
 71 |             if miss_rate > self.th_low:
 72 |                 print("Missing rate of {} is {:.3f} exceed {}, adding new feature {}".
 73 |                      format(col, miss_rate, self.th_low, col+'_null'))
 74 |                 X[col+'_null'] = 0
 75 |                 X.loc[X[pd.isnull(X[col])].index, [col+'_null']] = 1
 76 |         print("shape = {}".format(X.shape))
 77 | 
 78 |         return X
 79 | 
 80 | 
 81 | # ## 处理字符时间（段）
 82 | 
 83 | # In[7]:
 84 | 
 85 | 
 86 | # 处理时间
 87 | def timeTranSecond(t):
 88 |     try:
 89 |         h,m,s=t.split(":")
 90 |     except:
 91 | 
 92 |         if t=='1900/1/9 7:00':
 93 |             return 7*3600/3600
 94 |         elif t=='1900/1/1 2:30':
 95 |             return (2*3600+30*60)/3600
 96 |         elif pd.isnull(t):
 97 |             return np.nan
 98 |         else:
 99 |             return 0
100 | 
101 |     try:
102 |         tm = (int(h)*3600+int(m)*60+int(s))/3600
103 |     except:
104 |         return (30*60)/3600
105 | 
106 |     return tm
107 | 
108 | 
109 | # In[8]:
110 | 
111 | 
112 | # 处理时间差
113 | def getDuration(se):
114 |     try:
115 |         sh,sm,eh,em=re.findall(r"\d+",se)
116 | #         print("sh, sm, eh, em = {}, {}, {}, {}".format(sh, em, eh, em))
117 |     except:
118 |         if pd.isnull(se):
119 |             return np.nan, np.nan, np.nan
120 | 
121 |     try:
122 |         t_start = (int(sh)*3600 + int(sm)*60)/3600
123 |         t_end = (int(eh)*3600 + int(em)*60)/3600
124 |         
125 |         if t_start > t_end:
126 |             tm = t_end - t_start + 24
127 |         else:
128 |             tm = t_end - t_start
129 |     except:
130 |         if se=='19:-20:05':
131 |             return 19, 20, 1
132 |         elif se=='15:00-1600':
133 |             return 15, 16, 1
134 |         else:
135 |             print("se = {}".format(se))
136 | 
137 | 
138 |     return t_start, t_end, tm
139 | 
140 | 
141 | # In[9]:
142 | 
143 | 
144 | class handle_time_str(BaseEstimator, TransformerMixin):
145 |     
146 |     def __init__(self):
147 |         pass
148 |         
149 |     def fit(self, X, y=None):
150 |         return self
151 |     
152 |     def transform(self, X):
153 |         print('-'*30, ' '*5, 'handle_time_str', ' '*5, '-'*30, '\n')
154 | 
155 |         for f in ['A5','A7','A9','A11','A14','A16','A24','A26','B5','B7']:
156 |             try:
157 |                 X[f] = X[f].apply(timeTranSecond)
158 |             except:
159 |                 print(f,'应该在前面被删除了！')
160 | 
161 | 
162 |         for f in ['A20','A28','B4','B9','B10','B11']:
163 |             try:
164 |                 start_end_diff = X[f].apply(getDuration)
165 |                 
166 |                 X[f+'_start'] = start_end_diff.apply(lambda x: x[0])
167 |                 X[f+'_end'] = start_end_diff.apply(lambda x: x[1])
168 |                 X[f] = start_end_diff.apply(lambda x: x[2])
169 | 
170 |             except:
171 |                 print(f,'应该在前面被删除了！')
172 |         return X
173 | 
174 | 
175 | # ## 计算时间差
176 | 
177 | # In[ ]:
178 | 
179 | 
180 | 
181 | 
182 | 
183 | # In[10]:
184 | 
185 | 
186 | def t_start_t_end(t):
187 |     if pd.isnull(t[0]) or pd.isnull(t[1]):
188 | #         print("t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
189 |         return np.nan
190 |         
191 |     if t[1] < t[0]:
192 |         t[1] += 24
193 |         
194 |     dt = t[1] - t[0]
195 | 
196 |     if(dt > 24 or dt < 0):
197 | #         print("dt error, t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
198 |         return np.nan
199 |     
200 |     return dt
201 | 
202 | 
203 | # In[11]:
204 | 
205 | 
206 | class calc_time_diff(BaseEstimator, TransformerMixin):
207 |     def __init__(self):
208 |         pass
209 |         
210 |     def fit(self, X, y=None):
211 |         return self
212 |     
213 |     def transform(self, X):
214 |         print('-'*30, ' '*5, 'calc_time_diff', ' '*5, '-'*30, '\n')
215 | 
216 |         # t_start 为时间的开始， tn 为中间的时间，减去 t_start 得到时间差
217 |         t_start = ['A9', 'A24', 'B5']
218 |         tn = {'A9':['A11', 'A14', 'A16'], 'A24': ['A26'], 'B5':['B7']}
219 |         
220 |         # 计算时间差
221 |         for t_s in t_start:
222 |             for t_e in tn[t_s]:
223 |                 X[t_e+'-'+t_s] = X[[t_s, t_e, 'target']].apply(t_start_t_end, axis=1)
224 |                 
225 |         # 所有结果保留 3 位小数
226 |         X = X.apply(lambda x: round(x, 3))
227 |         
228 |         print("shape = {}".format(X.shape))
229 |         
230 |         return X
231 | 
232 | 
233 | # ## 处理异常值
234 | 
235 | # + __单一类别个数小于 threshold 的值视为异常值, 改为 nan__
236 | 
237 | # In[12]:
238 | 
239 | 
240 | class handle_outliers(BaseEstimator, TransformerMixin):
241 | 
242 |     def __init__(self, threshold=2):
243 |         self.th = threshold
244 |         
245 |     def fit(self, X, y=None):
246 |         return self
247 |     
248 |     def transform(self, X):
249 |         print('-'*30, ' '*5, 'handle_outliers', ' '*5, '-'*30, '\n')
250 |         category_col = [col for col in X if col not in ['id', 'target']]
251 |         for col in category_col:
252 |             label = X[col].value_counts(dropna=False).index.tolist()
253 |             for i, num in enumerate(X[col].value_counts(dropna=False).values):
254 |                 if num <= self.th:
255 | #                     print("Number of label {} in feature {} is {}".format(label[i], col, num))
256 |                     X.loc[X[col]==label[i], [col]] = np.nan
257 |         
258 |         print("shape = {}".format(X.shape))
259 |         return X
260 | 
261 | 
262 | # ## 删除单一类别占比过大特征
263 | 
264 | # In[13]:
265 | 
266 | 
267 | class del_single_feature(BaseEstimator, TransformerMixin):
268 | 
269 |     def __init__(self, threshold=0.98):
270 |         # 删除单一类别占比大于 threshold 的特征
271 |         self.th = threshold
272 |         
273 |     def fit(self, X, y=None):
274 |         return self
275 |     
276 |     def transform(self, X):
277 |         print('-'*30, ' '*5, 'del_single_feature', ' '*5, '-'*30, '\n')
278 |         category_col = [col for col in X if col not in ['target']]
279 |         
280 |         for col in category_col:
281 |             rate = X[col].value_counts(normalize=True, dropna=False).values[0]
282 |             
283 |             if rate > self.th:
284 |                 print("{} 的最大类别占比是 {}, drop it".format(col, rate))
285 |                 X.drop(col, axis=1, inplace=True)
286 | 
287 |         print("shape = {}".format(X.shape))
288 |         return X
289 | 
290 | 
291 | # # 特征工程
292 | 
293 | # ## 获得训练集与测试集
294 | 
295 | # In[14]:
296 | 
297 | 
298 | def split_data(pipe_data, target_name='target'):
299 |    
300 |     # 特征列名
301 |     category_col = [col for col in pipe_data if col not in ['target',target_name]]
302 |     
303 |     # 训练、测试行索引
304 |     train_idx = pipe_data[np.logical_not(pd.isnull(pipe_data[target_name]))].index
305 |     test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index
306 |     
307 |     # 获得 train、test 数据
308 |     X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)
309 |     y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))
310 |     X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)
311 |     
312 |     return X_train, y_train, X_test, test_idx
313 | 
314 | 
315 | # ## xgb(用于特征 nan 值预测)
316 | 
317 | # In[15]:
318 | 
319 | 
320 | ##### xgb
321 | def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
322 |     
323 |     if params == None:
324 |         xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, 
325 |                   'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}
326 |     else:
327 |         xgb_params = params
328 | 
329 |     folds = KFold(n_splits=10, shuffle=True, random_state=2018)
330 |     oof_xgb = np.zeros(len(X_train))
331 |     predictions_xgb = np.zeros(len(X_test))
332 | 
333 |     for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
334 |         if(verbose_eval):
335 |             print("fold n°{}".format(fold_+1))
336 |             print("len trn_idx  {}".format(len(trn_idx)))
337 |             
338 |         trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
339 |         val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
340 | 
341 |         watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
342 |         clf = xgb.train(dtrain=trn_data,
343 |                         num_boost_round=20000,
344 |                         evals=watchlist,
345 |                         early_stopping_rounds=200,
346 |                         verbose_eval=verbose_eval,
347 |                         params=xgb_params)
348 |         
349 |         
350 |         oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
351 |         predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
352 | 
353 |     if(verbose_eval):
354 |         print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train)))
355 |     return oof_xgb, predictions_xgb
356 | 
357 | 
358 | # ## 根据 B14 构建新特征
359 | 
360 | # In[16]:
361 | 
362 | 
363 | class add_new_features(BaseEstimator, TransformerMixin):
364 | 
365 |     def __init__(self):
366 |         pass
367 |         
368 |     def fit(self, X, y=None):
369 |         return self
370 | 
371 |     def transform(self, X):
372 |         print('-'*30, ' '*5, 'add_new_features', ' '*5, '-'*30, '\n')
373 | 
374 |         # 经过测试,只有 B14 / B12 有用
375 |         
376 | #         X['B14/A1'] = X['B14'] / X['A1']
377 | #         X['B14/A3'] = X['B14'] / X['A3']
378 | #         X['B14/A4'] = X['B14'] / X['A4']
379 | #         X['B14/A19'] = X['B14'] / X['A19']
380 | #         X['B14/B1'] = X['B14'] / X['B1']
381 | #         X['B14/B9'] = X['B14'] / X['B9']
382 | 
383 |         X['B14/B12'] = X['B14'] / X['B12']
384 |         
385 |         print("shape = {}".format(X.shape))
386 |         return X
387 | 
388 | 
389 | # ## 选择特征, nan 值填充
390 | # 
391 | # + __选择可能有效的特征__   (只是为了加快选择时间)
392 | # 
393 | # + __利用其他特征预测 nan，取最近值填充__
394 | 
395 | # In[17]:
396 | 
397 | 
398 | def get_closest(indexes, predicts):
399 |     print("From {}".format(predicts))
400 | 
401 |     for i, one in enumerate(predicts):
402 |         predicts[i] = indexes[np.argsort(abs(indexes - one))[0]]
403 | 
404 |     print("To {}".format(predicts))
405 |     return predicts
406 |     
407 | 
408 | def value_select_eval(pipe_data, selected_features):
409 |     
410 |     # 经过多次测试, 只选择可能是有用的特征
411 |     cols_with_nan = [col for col in pipe_data.columns 
412 |                      if pipe_data[col].isnull().sum()>0 and col in selected_features]
413 | 
414 |     for col in cols_with_nan:
415 |         X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name=col)
416 |         oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, verbose_eval=False)
417 |         
418 |         print("-"*100, end="\n\n")
419 |         print("CV normal MAE scores of predicting {} is {}".
420 |               format(col, mean_absolute_error(oof_xgb, y_train)/np.mean(y_train)))
421 |         
422 |         pipe_data.loc[test_idx, [col]] = get_closest(pipe_data[col].value_counts().index,
423 |                                               predictions_xgb)
424 | 
425 |     pipe_data = pipe_data[selected_features+['target']]
426 | 
427 |     return pipe_data
428 | 
429 | # pipe_data = value_eval(pipe_data)
430 | 
431 | 
432 | # In[18]:
433 | 
434 | 
435 | class selsected_fill_nans(BaseEstimator, TransformerMixin):
436 | 
437 |     def __init__(self, selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',
438 |                                            'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']):
439 |         self.selected_fearutes = selected_features
440 |         pass
441 |         
442 |     def fit(self, X, y=None):
443 |         return self
444 |     
445 |     def transform(self, X):
446 |         print('-'*30, ' '*5, 'selsected_fill_nans', ' '*5, '-'*30, '\n')
447 | 
448 |         X = value_select_eval(X, self.selected_fearutes)
449 | 
450 |         print("shape = {}".format(X.shape))
451 |         return X
452 | 
453 | 
454 | # In[19]:
455 | 
456 | 
457 | def modeling_cross_validation(data):
458 |     X_train, y_train, X_test, test_idx = split_data(data,
459 |                                                     target_name='target')
460 |     oof_xgb, _ = xgb_predict(X_train, y_train, X_test, verbose_eval=False)
461 |     print('-'*100, end='\n\n')
462 |     return mean_squared_error(oof_xgb, y_train)
463 | 
464 | 
465 | def featureSelect(data):
466 | 
467 |     init_cols = [f for f in data.columns if f not in ['target']]
468 |     best_cols = init_cols.copy()
469 |     best_score = modeling_cross_validation(data[best_cols+['target']])
470 |     print("初始 CV score: {:<8.8f}".format(best_score))
471 | 
472 |     for col in init_cols:
473 |         best_cols.remove(col)
474 |         score = modeling_cross_validation(data[best_cols+['target']])
475 |         print("当前选择特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}".
476 |               format(col, score, best_score), end=" ")
477 |         
478 |         if best_score - score > 0.0000004:
479 |             best_score = score
480 |             print("有效果,删除！！！！")
481 |         else:
482 |             best_cols.append(col)
483 |             print("保留")
484 | 
485 |     print('-'*100)
486 |     print("优化后 CV score: {:<8.8f}".format(best_score))
487 |     return best_cols, best_score
488 | 
489 | 
490 | # ## 后向选择特征
491 | 
492 | # In[20]:
493 | 
494 | 
495 | class select_feature(BaseEstimator, TransformerMixin):
496 | 
497 |     def __init__(self, init_features = None):
498 |         self.init_features = init_features
499 |         pass
500 |         
501 |     def fit(self, X, y=None):
502 |         return self
503 |     
504 |     def transform(self, X):
505 |         print('-'*30, ' '*5, 'select_feature', ' '*5, '-'*30, '\n')
506 |         
507 |         if self.init_features:
508 |             X = X[self.init_features + ['target']]
509 |             best_features = self.init_features
510 |         else:
511 |             best_features = [col for col in X.columns]
512 |         
513 |         last_feartues = []
514 |         iteration = 0
515 |         equal_time = 0
516 |         
517 |         best_CV = 1
518 |         best_CV_feature = []
519 |         
520 |         # 打乱顺序,但是使用相同种子,保证每次运行结果相同
521 |         np.random.seed(2018)
522 |         while True:
523 |             print("Iteration = {}\n".format(iteration))
524 |             best_features, score = featureSelect(X[best_features + ['target']])
525 |             
526 |             # 保存最优 CV 的参数
527 |             if score < best_CV:
528 |                 best_CV = score
529 |                 best_CV_feature = best_features
530 |                 print("Found best score :{}, with features :{}".format(best_CV, best_features))
531 |                 
532 |             np.random.shuffle(best_features)
533 |             print("\nCurrent fearure length = {}".format(len(best_features)))
534 |             
535 |             # 最终 3 次迭代相同，则终止迭代
536 |             if len(best_features) == len(last_feartues):
537 |                 equal_time += 1
538 |                 if equal_time == 3:
539 |                     break
540 |             else:
541 |                 equal_time = 0
542 |             
543 |             last_feartues = best_features
544 |             iteration = iteration + 1
545 | 
546 |             print("\n\n\n")
547 |             
548 |         return X[best_features + ['target']]
549 | 
550 | 
551 | # # 训练
552 | 
553 | # ## 构建 pipeline, 处理数据
554 | 
555 | # In[21]:
556 | 
557 | 
558 | 
559 | 
560 | 
561 | # ## 自动调参
562 | 
563 | # In[22]:
564 | 
565 | 
566 | def find_best_params(pipe_data, predict_fun, param_grid):
567 |     
568 |     # 获得 train 和 test, 归一化
569 |     X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')
570 |     min_max_scaler = MinMaxScaler()
571 |     X_train = min_max_scaler.fit_transform(X_train)
572 |     X_test = min_max_scaler.transform(X_test)
573 |     best_score = 1
574 | 
575 |     # 遍历所有参数,寻找最优
576 |     for params in ParameterGrid(param_grid):
577 |         print('-'*100, "\nparams = \n{}\n".format(params))
578 | 
579 |         oof, predictions = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)
580 |         score = mean_squared_error(oof, y_train)
581 |         print("CV score: {}, current best score: {}".format(score, best_score))
582 | 
583 |         if best_score > score:
584 |             print("Found new best score: {}".format(score))
585 |             best_score = score
586 |             best_params = params
587 | 
588 | 
589 |     print('\n\nbest params: {}'.format(best_params))
590 |     print('best score: {}'.format(best_score))
591 |     
592 |     return best_params
593 | 
594 | 
595 | # ## lgb
596 | 
597 | # In[23]:
598 | 
599 | 
600 | def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
601 |     
602 |     if params == None:
603 |         lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective':'regression', 'max_depth': 4,
604 |          'learning_rate': 0.06, "min_child_samples": 3, "boosting": "gbdt", "feature_fraction": 1,
605 |          "bagging_freq": 0.7, "bagging_fraction": 1, "bagging_seed": 11, "metric": 'mse', "lambda_l2": 0.003,
606 |          "verbosity": -1}
607 |     else :
608 |         lgb_param = params
609 |         
610 |     folds = KFold(n_splits=10, shuffle=True, random_state=2018)
611 |     oof_lgb = np.zeros(len(X_train))
612 |     predictions_lgb = np.zeros(len(X_test))
613 | 
614 |     for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
615 |         if verbose_eval:
616 |             print("fold n°{}".format(fold_+1))
617 |             
618 |         trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
619 |         val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
620 | 
621 |         num_round = 10000
622 |         clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data],
623 |                         verbose_eval=verbose_eval, early_stopping_rounds = 100)
624 |         
625 |         oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
626 | 
627 |         predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
628 | 
629 |         if verbose_eval:
630 |             print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train)))
631 |     
632 |     return oof_lgb, predictions_lgb
633 | 
634 | 
635 | def init_config():
636 |     parser = argparse.ArgumentParser()
637 |     parser.add_argument('--test_type', type=str,
638 |                         help='Can be B or C, meaning running code with either test B or test C')
639 |     args = parser.parse_args()
640 |     return args
641 | 
642 | 
643 | def read_data(train_file_name, test_file_name):
644 |     # 读取数据， 改名
645 |     train = pd.read_csv(train_file_name, encoding='gb18030')
646 |     test = pd.read_csv(test_file_name, encoding='gb18030')
647 |     train.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
648 |     test.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
649 |     target_name = 'target'
650 | 
651 |     # 存在异常数据，改为 nan
652 |     train.loc[1304, 'A25'] = np.nan
653 |     train['A25'] = train['A25'].astype(float)
654 | 
655 |     # 去掉 id 前缀
656 |     train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))
657 |     test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))
658 | 
659 |     train.drop(train[train[target_name] < 0.87].index, inplace=True)
660 |     full = pd.concat([train, test], ignore_index=True)
661 |     return full
662 | 
663 | 
664 | def feature_processing(full):
665 | 
666 |     selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',
667 |                          'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']
668 |     
669 |     pipe = Pipeline([
670 |         ('del_nan_feature', del_nan_feature()),
671 |         ('handle_time_str', handle_time_str()),
672 |         ('calc_time_diff', calc_time_diff()),
673 |         ('Handle_outliers', handle_outliers(2)),
674 |         ('del_single_feature', del_single_feature(1)),
675 |         ('add_new_features', add_new_features()),
676 |         ('selsected_fill_nans', selsected_fill_nans(selected_features)),
677 |         ('select_feature', select_feature(selected_features)),
678 |     ])
679 |     
680 |     pipe_data = pipe.fit_transform(full.copy())
681 |     print(pipe_data.shape)
682 |     return pipe_data
683 | 
684 | 
685 | def train_predict(pipe_data):
686 | 
687 |     # lgb
688 |     param_grid = [
689 |         {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective': ['regression'],
690 |          'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], "min_child_samples": [3],
691 |          "boosting": ["gbdt"], "feature_fraction": [0.7], "bagging_freq": [1],
692 |          "bagging_fraction": [1], "bagging_seed": [11], "metric": ['mse'],
693 |          "lambda_l2": [0.0003, 0.001, 0.003], "verbosity": [-1]}
694 |     ]
695 |     
696 |     lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)
697 | 
698 |     X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')
699 |     min_max_scaler = MinMaxScaler()
700 |     X_train = min_max_scaler.fit_transform(X_train)
701 |     X_test = min_max_scaler.transform(X_test)
702 |     oof_lgb, predictions_lgb = lgb_predict(X_train, y_train, X_test, params=lgb_best_params, verbose_eval=200)  #
703 | 
704 |     # xgb
705 |     param_grid = [
706 |         {'silent': [1],
707 |          'nthread': [4],
708 |          'eval_metric': ['rmse'],
709 |          'eta': [0.03],
710 |          'objective': ['reg:linear'],
711 |          'max_depth': [4, 5, 6],
712 |          'num_round': [1000],
713 |          'subsample': [0.4, 0.6, 0.8, 1],
714 |          'colsample_bytree': [0.7, 0.9, 1],
715 |          }
716 |     ]
717 |     
718 |     xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)
719 | 
720 |     X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')
721 |     min_max_scaler = MinMaxScaler()
722 |     X_train = min_max_scaler.fit_transform(X_train)
723 |     X_test = min_max_scaler.transform(X_test)
724 |     oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, params=xgb_best_params, verbose_eval=200)  #
725 |     
726 |     # 模型融合 stacking
727 |     train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()
728 |     test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()
729 |     
730 |     folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
731 |     oof_stack = np.zeros(train_stack.shape[0])
732 |     predictions = np.zeros(test_stack.shape[0])
733 |     
734 |     for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):
735 |         print("fold {}".format(fold_))
736 |         trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]
737 |         val_data, val_y = train_stack[val_idx], y_train[val_idx]
738 |         
739 |         clf_3 = BayesianRidge()
740 |         clf_3.fit(trn_data, trn_y)
741 |         
742 |         oof_stack[val_idx] = clf_3.predict(val_data)
743 |         predictions += clf_3.predict(test_stack) / 10
744 |     
745 |     final_score = mean_squared_error(y_train, oof_stack)
746 |     print(final_score)
747 |     return predictions
748 | 
749 | 
750 | def gen_submit(test_file_name, result_name, predictions):
751 |     # 生成提交结果
752 |     sub_df = pd.read_csv(test_file_name, encoding='gb18030')
753 |     sub_df = sub_df[['样本id', 'A1']]
754 |     sub_df['A1'] = predictions
755 |     sub_df['A1'] = sub_df['A1'].apply(lambda x: round(x, 3))
756 |     print("Generating a submit file : {}".format(result_name))
757 |     sub_df.to_csv(result_name, header=0, index=0)
758 | 
759 | if __name__ == '__main__':
760 |     args = init_config()
761 |     print(args, file=sys.stderr)
762 |     
763 |     if args.test_type in ['B', 'b']:
764 |         test_file_name = 'data/jinnan_round1_testB_20190121.csv'
765 |         result_name = 'submit_B.csv'
766 |     elif args.test_type in ['C', 'c']:
767 |         test_file_name = 'data/jinnan_round1_test_20190121.csv'
768 |         result_name = 'submit_C.csv'
769 |     else:
770 |         raise RuntimeError('Need config of test_type, can be only B or C for example: --test_type=B')
771 | 
772 |     # 设定文件名, 读取文件
773 |     train_file_name = 'data/jinnan_round1_train_20181227.csv'
774 |     
775 |     print("Training file named {} and testing file named {}".format(train_file_name, test_file_name))
776 | 
777 |     # read file
778 |     full = read_data(train_file_name, test_file_name)
779 |     
780 |     # feature processing
781 |     pipe_data = feature_processing(full)
782 | 
783 |     # train and predict
784 |     predictions = train_predict(pipe_data)
785 | 
786 |     # generate a submit file
787 |     gen_submit(test_file_name, result_name, predictions)
788 |     
789 |     print("Finished")
790 | 


--------------------------------------------------------------------------------
/复赛/津南数字制造算法挑战赛+17+Drop/data/optimize.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taoyafan/jinnan/52035305b2219f6700f2e922825de2d5db1d7abc/复赛/津南数字制造算法挑战赛+17+Drop/data/optimize.csv


--------------------------------------------------------------------------------
/复赛/津南数字制造算法挑战赛+17+Drop/main.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | import numpy as np
  4 | import pandas as pd
  5 | import lightgbm as lgb
  6 | import xgboost as xgb
  7 | import warnings
  8 | import re
  9 | import argparse, sys
 10 | import pickle
 11 | import os
 12 | 
 13 | from sklearn.model_selection import KFold, RepeatedKFold
 14 | from sklearn.pipeline import Pipeline
 15 | from sklearn.base import BaseEstimator, TransformerMixin
 16 | from sklearn.linear_model import BayesianRidge
 17 | from sklearn.metrics import mean_squared_error, mean_absolute_error
 18 | from sklearn.preprocessing import MinMaxScaler
 19 | from sklearn.model_selection import ParameterGrid
 20 | 
 21 | warnings.simplefilter(action='ignore', category=FutureWarning)
 22 | warnings.filterwarnings("ignore")
 23 | 
 24 | def init_config():
 25 |   parser = argparse.ArgumentParser()
 26 |   parser.add_argument('--test_type', type=str,
 27 |                       help='Can be B or C, meaning running code with either test B or test C')
 28 |   args = parser.parse_args()
 29 |   return args
 30 | 
 31 | 
 32 | def pkl_load(file_name):
 33 |   with open(file_name, "rb") as f:
 34 |     return pickle.load(f)
 35 | 
 36 | 
 37 | def pkl_save(fname, data, protocol=3):
 38 |   with open(fname, "wb") as f:
 39 |     pickle.dump(data, f, protocol)
 40 |     
 41 |     
 42 | def load_models():
 43 |   lgb_models = pkl_load("models/lgb_models.pkl")
 44 |   xgb_models = pkl_load("models/xgb_models.pkl")
 45 |   stack_models = pkl_load("models/stack_models.pkl")
 46 |   min_max_scaler = pkl_load("models/min_max_scaler.pkl")
 47 |   return lgb_models, xgb_models, stack_models, min_max_scaler
 48 | 
 49 | 
 50 | def read_data(train_file_name, test_file_name):
 51 |   # 读取数据， 改名
 52 |   train = pd.read_csv(train_file_name, encoding='gb18030')
 53 |   test = pd.read_csv(test_file_name, encoding='gb18030')
 54 |   train.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
 55 |   test.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
 56 |   target_name = 'target'
 57 |   
 58 |   # 存在异常数据，改为 nan
 59 |   train.loc[1304, 'A25'] = np.nan
 60 |   train['A25'] = train['A25'].astype(float)
 61 |   
 62 |   # 去掉 id 前缀
 63 |   train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))
 64 |   test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))
 65 |   
 66 |   train.drop(train[train[target_name] < 0.87].index, inplace=True)
 67 |   _full = pd.concat([train, test], ignore_index=True)
 68 |   return _full
 69 | 
 70 | 
 71 | class del_nan_feature(BaseEstimator, TransformerMixin):
 72 | 
 73 |   def __init__(self, th_high=0.85, th_low=0.02):
 74 |     self.th_high = th_high
 75 |     self.th_low = th_low
 76 | 
 77 |   def fit(self, X, y=None):
 78 |     return self
 79 | 
 80 |   def transform(self, X):
 81 |     print('-' * 30, ' ' * 5, 'del_nan_feature', ' ' * 5, '-' * 30, '\n')
 82 |     print("shape before process = {}".format(X.shape))
 83 |   
 84 |     # 删除高缺失率特征
 85 |     X.dropna(axis=1, thresh=(1 - self.th_high) * X.shape[0], inplace=True)
 86 |   
 87 |     # 缺失率较高，增加新特征
 88 |     for col in X.columns:
 89 |       if col == 'target':
 90 |         continue
 91 |     
 92 |       miss_rate = X[col].isnull().sum() / X.shape[0]
 93 |       if miss_rate > self.th_low:
 94 |         print("Missing rate of {} is {:.3f} exceed {}, adding new feature {}".
 95 |               format(col, miss_rate, self.th_low, col + '_null'))
 96 |         X[col + '_null'] = 0
 97 |         X.loc[X[pd.isnull(X[col])].index, [col + '_null']] = 1
 98 |     print("shape = {}".format(X.shape))
 99 |   
100 |     return X
101 |   
102 | # 处理时间
103 | def timeTranSecond(t):
104 |     try:
105 |         h,m,s=t.split(":")
106 |     except:
107 | 
108 |         if t=='1900/1/9 7:00':
109 |             return 7*3600/3600
110 |         elif t=='1900/1/1 2:30':
111 |             return (2*3600+30*60)/3600
112 |         elif pd.isnull(t):
113 |             return np.nan
114 |         else:
115 |             return 0
116 | 
117 |     try:
118 |         tm = (int(h)*3600+int(m)*60+int(s))/3600
119 |     except:
120 |         return (30*60)/3600
121 | 
122 |     return tm
123 | 
124 | 
125 | # 处理时间差
126 | def getDuration(se):
127 |   try:
128 |     sh, sm, eh, em = re.findall(r"\d+", se)
129 |   #         print("sh, sm, eh, em = {}, {}, {}, {}".format(sh, em, eh, em))
130 |   except:
131 |     if pd.isnull(se):
132 |       return np.nan, np.nan, np.nan
133 |   
134 |   try:
135 |     t_start = (int(sh) * 3600 + int(sm) * 60) / 3600
136 |     t_end = (int(eh) * 3600 + int(em) * 60) / 3600
137 |     
138 |     if t_start > t_end:
139 |       tm = t_end - t_start + 24
140 |     else:
141 |       tm = t_end - t_start
142 |   except:
143 |     if se == '19:-20:05':
144 |       return 19, 20, 1
145 |     elif se == '15:00-1600':
146 |       return 15, 16, 1
147 |     else:
148 |       print("se = {}".format(se))
149 |   
150 |   return t_start, t_end, tm
151 | 
152 | 
153 | class handle_time_str(BaseEstimator, TransformerMixin):
154 | 
155 |   def __init__(self):
156 |     pass
157 | 
158 |   def fit(self, X, y=None):
159 |     return self
160 | 
161 |   def transform(self, X):
162 |     print('-' * 30, ' ' * 5, 'handle_time_str', ' ' * 5, '-' * 30, '\n')
163 |   
164 |     for f in ['A5', 'A7', 'A9', 'A11', 'A14', 'A16', 'A24', 'A26', 'B5', 'B7']:
165 |       try:
166 |         X[f] = X[f].apply(timeTranSecond)
167 |       except:
168 |         print(f, '应该在前面被删除了！')
169 |   
170 |     for f in ['A20', 'A28', 'B4', 'B9', 'B10', 'B11']:
171 |       try:
172 |         start_end_diff = X[f].apply(getDuration)
173 |       
174 |         X[f + '_start'] = start_end_diff.apply(lambda x: x[0])
175 |         X[f + '_end'] = start_end_diff.apply(lambda x: x[1])
176 |         X[f] = start_end_diff.apply(lambda x: x[2])
177 |     
178 |       except:
179 |         print(f, '应该在前面被删除了！')
180 |     return X
181 | 
182 | 
183 | def t_start_t_end(t):
184 |   if pd.isnull(t[0]) or pd.isnull(t[1]):
185 |     #         print("t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
186 |     return np.nan
187 | 
188 |   if t[1] < t[0]:
189 |     t[1] += 24
190 | 
191 |   dt = t[1] - t[0]
192 | 
193 |   if (dt > 24 or dt < 0):
194 |     #         print("dt error, t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
195 |     return np.nan
196 | 
197 |   return dt
198 | 
199 | 
200 | class calc_time_diff(BaseEstimator, TransformerMixin):
201 |   def __init__(self):
202 |     pass
203 | 
204 |   def fit(self, X, y=None):
205 |     return self
206 | 
207 |   def transform(self, X):
208 |     print('-' * 30, ' ' * 5, 'calc_time_diff', ' ' * 5, '-' * 30, '\n')
209 |   
210 |     # t_start 为时间的开始， tn 为中间的时间，减去 t_start 得到时间差
211 |     t_start = ['A9', 'A24', 'B5']
212 |     tn = {'A9': ['A11', 'A14', 'A16'], 'A24': ['A26'], 'B5': ['B7']}
213 |   
214 |     # 计算时间差
215 |     for t_s in t_start:
216 |       for t_e in tn[t_s]:
217 |         X[t_e + '-' + t_s] = X[[t_s, t_e, 'target']].apply(t_start_t_end, axis=1)
218 |   
219 |     # 所有结果保留 3 位小数
220 |     X = X.apply(lambda x: round(x, 3))
221 |   
222 |     print("shape = {}".format(X.shape))
223 |   
224 |     return X
225 | 
226 | 
227 | class handle_outliers(BaseEstimator, TransformerMixin):
228 | 
229 |   def __init__(self, threshold=2):
230 |     self.th = threshold
231 | 
232 |   def fit(self, X, y=None):
233 |     return self
234 | 
235 |   def transform(self, X):
236 |     print('-' * 30, ' ' * 5, 'handle_outliers', ' ' * 5, '-' * 30, '\n')
237 |     category_col = [col for col in X if col not in ['id', 'target']]
238 |     for col in category_col:
239 |       label = X[col].value_counts(dropna=False).index.tolist()
240 |       for i, num in enumerate(X[col].value_counts(dropna=False).values):
241 |         if num <= self.th:
242 |           #                     print("Number of label {} in feature {} is {}".format(label[i], col, num))
243 |           X.loc[X[col] == label[i], [col]] = np.nan
244 |   
245 |     print("shape = {}".format(X.shape))
246 |     return X
247 | 
248 | 
249 | def split_data(pipe_data, target_name='target', unused_feature=[], min_max_scaler=None):
250 | 
251 |   # 特征列名
252 |   category_col = [col for col in pipe_data if col not in ['target'] + [target_name] + unused_feature]
253 | 
254 |   # 训练、测试行索引
255 |   # 训练集只包括存在 target 和 target_name 的数据
256 |   train_idx = pipe_data[np.logical_not(
257 |     np.logical_or(pd.isnull(pipe_data[target_name]), pd.isnull(pipe_data['target']))
258 |   )].index
259 | 
260 |   test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index
261 | 
262 |   # 获得 train、test 数据
263 |   X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)
264 |   y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))
265 |   X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)
266 | 
267 |   # 归一化
268 |   if (min_max_scaler == None):
269 |     min_max_scaler = MinMaxScaler()
270 |     X_train = min_max_scaler.fit_transform(X_train)
271 |   else:
272 |     X_train = min_max_scaler.transform(X_train)
273 |   X_test = min_max_scaler.transform(X_test)
274 | 
275 |   return X_train, y_train, X_test, test_idx, min_max_scaler
276 | 
277 | 
278 | ##### xgb
279 | def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
280 |   if params == None:
281 |     xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8,
282 |                   'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}
283 |   else:
284 |     xgb_params = params
285 |   
286 |   folds = KFold(n_splits=10, shuffle=True, random_state=2018)
287 |   oof_xgb = np.zeros(len(X_train))
288 |   predictions_xgb = np.zeros(len(X_test))
289 |   xgb_models = []
290 |   
291 |   for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
292 |     if (verbose_eval):
293 |       print("fold n°{}".format(fold_ + 1))
294 |       print("len trn_idx  {}".format(len(trn_idx)))
295 |     
296 |     trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
297 |     val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
298 |     
299 |     watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
300 |     clf = xgb.train(dtrain=trn_data,
301 |                     num_boost_round=20000,
302 |                     evals=watchlist,
303 |                     early_stopping_rounds=200,
304 |                     verbose_eval=verbose_eval,
305 |                     params=xgb_params)
306 |     
307 |     xgb_models.append(clf)
308 |     oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
309 |     predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
310 |   
311 |   if (verbose_eval):
312 |     print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train)))
313 |   return oof_xgb, predictions_xgb, xgb_models
314 | 
315 | 
316 | def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
317 | 
318 |   if params == None:
319 |     lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective': 'regression', 'max_depth': 5,
320 |                  'learning_rate': 0.24, "min_child_samples": 3, "boosting": "gbdt", "feature_fraction": 0.7,
321 |                  "bagging_freq": 1, "bagging_fraction": 1, "bagging_seed": 11, "metric": 'mse', "lambda_l2": 0.003,
322 |                  "verbosity": -1}
323 |   else:
324 |     lgb_param = params
325 | 
326 |   folds = KFold(n_splits=10, shuffle=True, random_state=2018)
327 |   oof_lgb = np.zeros(len(X_train))
328 |   predictions_lgb = np.zeros(len(X_test))
329 |   lgb_models = []
330 | 
331 |   for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
332 |     if verbose_eval:
333 |       print("fold n°{}".format(fold_ + 1))
334 |   
335 |     trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
336 |     val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
337 |   
338 |     num_round = 10000
339 |     clf = lgb.train(lgb_param, trn_data, num_round, valid_sets=[trn_data, val_data],
340 |                     verbose_eval=verbose_eval, early_stopping_rounds=100)
341 |   
342 |     lgb_models.append(clf)
343 |     oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
344 |     predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
345 |   
346 |     if verbose_eval:
347 |       print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train)))
348 | 
349 |   return oof_lgb, predictions_lgb, lgb_models
350 | 
351 | 
352 | class add_new_features(BaseEstimator, TransformerMixin):
353 | 
354 |   def __init__(self):
355 |     pass
356 | 
357 |   def fit(self, X, y=None):
358 |     return self
359 | 
360 |   def transform(self, X):
361 |     print('-' * 30, ' ' * 5, 'add_new_features', ' ' * 5, '-' * 30, '\n')
362 |   
363 |     # 经过测试,只有 B14 / B12 有用
364 |   
365 |     #         X['B14/A1'] = X['B14'] / X['A1']
366 |     #         X['B14/A3'] = X['B14'] / X['A3']
367 |     #         X['B14/A4'] = X['B14'] / X['A4']
368 |     #         X['B14/A19'] = X['B14'] / X['A19']
369 |     #         X['B14/B1'] = X['B14'] / X['B1']
370 |     #         X['B14/B9'] = X['B14'] / X['B9']
371 |   
372 |     X['B14/B12'] = X['B14'] / X['B12']
373 |   
374 |     print("shape = {}".format(X.shape))
375 |     return X
376 | 
377 | 
378 | def predict(data, lgb_models, xgb_models, stack_models, min_max_scaler):
379 |   
380 |   def _predict(X):
381 |     lgb_result = 0
382 |     xgb_result = 0
383 |     stack_result = 0
384 |     
385 |     for clf in lgb_models:
386 |       lgb_result += clf.predict(X, num_iteration=clf.best_iteration) / len(lgb_models)
387 |     
388 |     for clf in xgb_models:
389 |       xgb_result += clf.predict(xgb.DMatrix(X), ntree_limit=clf.best_ntree_limit) / len(xgb_models)
390 |     
391 |     test_stack = np.vstack([lgb_result, xgb_result]).transpose()
392 |     for clf in stack_models:
393 |       stack_result += clf.predict(test_stack) / len(stack_models)
394 |     
395 |     return stack_result
396 | 
397 |   _, _, X_test, test_idx, _ = split_data(data, min_max_scaler=min_max_scaler)
398 |   result = _predict(X_test)
399 |   return result
400 | 
401 | 
402 | def feature_processing(full, outlier_th = 3):
403 |   selected_features = ['A22', 'A28', 'A20_end', 'B10', 'B11_start', 'A5', 'A10', 'B14/B12', 'B14']
404 |   pipe = Pipeline([
405 |                   ('del_nan_feature', del_nan_feature()),
406 |                   ('handle_time_str', handle_time_str()),
407 |                   ('calc_time_diff', calc_time_diff()),
408 |                   ('Handle_outliers', handle_outliers(outlier_th)),
409 |                   ('add_new_features', add_new_features()),
410 |                   ])
411 |   
412 |   pipe_data = pipe.fit_transform(full.copy())[selected_features+['target']]
413 |   print(pipe_data.shape)
414 |   return pipe_data
415 | 
416 | 
417 | def gen_submit(test_file_name, result_name, predictions):
418 |   # 生成提交结果
419 |   sub_df = pd.read_csv(test_file_name, encoding='gb18030')
420 |   sub_df = sub_df[['样本id', 'A1']]
421 |   sub_df['A1'] = predictions
422 |   sub_df['A1'] = sub_df['A1'].apply(lambda x: round(x, 3))
423 |   print("Generating a submit file : {}".format(result_name))
424 |   sub_df.to_csv(result_name, header=0, index=0)
425 | 
426 | 
427 | def find_best_params(pipe_data, predict_fun, param_grid):
428 | 
429 |   # 获得 train 和 test, 归一化
430 |   X_train, y_train, X_test, test_idx, _ = split_data(pipe_data, target_name='target')
431 |   best_score = 1
432 | 
433 |   # 遍历所有参数,寻找最优
434 |   for params in ParameterGrid(param_grid):
435 |     print('-' * 100, "\nparams = \n{}\n".format(params))
436 |   
437 |     oof, predictions, _ = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)
438 |     score = mean_squared_error(oof, y_train)
439 |     print("CV score: {}, current best score: {}".format(score, best_score))
440 |   
441 |     if best_score > score:
442 |       print("Found new best score: {}".format(score))
443 |       best_score = score
444 |       best_params = params
445 | 
446 |   print('\n\nbest params: {}'.format(best_params))
447 |   print('best score: {}'.format(best_score))
448 | 
449 |   return best_params
450 | 
451 | 
452 | def stacking_predict(oof_lgb, oof_xgb, predictions_lgb, predictions_xgb, y_train, verbose_eval=1):
453 |   # stacking
454 |   train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()
455 |   test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()
456 | 
457 |   folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
458 |   oof_stack = np.zeros(train_stack.shape[0])
459 |   predictions = np.zeros(test_stack.shape[0])
460 |   stack_models = []
461 | 
462 |   for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):
463 |     if verbose_eval:
464 |       print("fold {}".format(fold_))
465 |     trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]
466 |     val_data, val_y = train_stack[val_idx], y_train[val_idx]
467 |   
468 |     clf_3 = BayesianRidge()
469 |     clf_3.fit(trn_data, trn_y)
470 |     stack_models.append(clf_3)
471 |   
472 |     oof_stack[val_idx] = clf_3.predict(val_data)
473 |     predictions += clf_3.predict(test_stack) / 10
474 | 
475 |   final_score = mean_squared_error(y_train, oof_stack)
476 |   if verbose_eval:
477 |     print(final_score)
478 |   return oof_stack, predictions, final_score, stack_models
479 | 
480 | 
481 | def train_predict(pipe_data, lgb_best_params, xgb_best_params, verbose_eval=200):
482 |   X_train, y_train, X_test, test_idx, min_max_scaler = split_data(pipe_data, target_name='target')
483 | 
484 |   oof_lgb, predictions_lgb, lgb_models = lgb_predict(X_train, y_train, X_test,
485 |                                                      params=lgb_best_params, verbose_eval=verbose_eval)
486 |   if verbose_eval:
487 |     print('-' * 100)
488 |   oof_xgb, predictions_xgb, xgb_models = xgb_predict(X_train, y_train, X_test,
489 |                                                      params=xgb_best_params, verbose_eval=verbose_eval)
490 |   if verbose_eval:
491 |     print('-' * 100)
492 |   oof_stack, predictions, final_score, stack_models = stacking_predict(oof_lgb, oof_xgb,
493 |                                                                        predictions_lgb, predictions_xgb, y_train,
494 |                                                                        verbose_eval=verbose_eval)
495 | 
496 |   return oof_stack, predictions, final_score, lgb_models, xgb_models, stack_models, min_max_scaler
497 | 
498 | 
499 | def train_models():
500 |   full = read_data('data/jinnan_round1_train_20181227.csv', 'data/jinnan_round1_testB_20190121.csv')
501 |   pipe_data = feature_processing(full)
502 | 
503 |   param_grid = [
504 |     {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective': ['regression'],
505 |      'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], "min_child_samples": [3],
506 |      "boosting": ["gbdt"], "feature_fraction": [0.7], "bagging_freq": [1],
507 |      "bagging_fraction": [1], "bagging_seed": [11], "metric": ['mse'],
508 |      "lambda_l2": [0.0003, 0.001, 0.003], "verbosity": [-1]}
509 |   ]
510 | 
511 |   lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)
512 | 
513 |   param_grid = [
514 |     {'silent': [1],
515 |      'nthread': [4],
516 |      'eval_metric': ['rmse'],
517 |      'eta': [0.03],
518 |      'objective': ['reg:linear'],
519 |      'max_depth': [4, 5, 6],
520 |      'num_round': [1000],
521 |      'subsample': [0.4, 0.6, 0.8, 1],
522 |      'colsample_bytree': [0.7, 0.9, 1],
523 |      }
524 |   ]
525 | 
526 |   xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)
527 | 
528 |   oof_stack, predictions, final_score, lgb_models, xgb_models, stack_models, min_max_scaler = train_predict(
529 |       pipe_data, lgb_best_params, xgb_best_params)
530 | 
531 |  
532 |   if not os.path.exists('models'):
533 |     os.makedirs('models')
534 |     
535 |   pkl_save("models/lgb_models.pkl", lgb_models)
536 |   pkl_save("models/xgb_models.pkl", xgb_models)
537 |   pkl_save("models/stack_models.pkl", stack_models)
538 |   pkl_save("models/min_max_scaler.pkl", min_max_scaler)
539 | 
540 | if __name__ == '__main__':
541 |   args = init_config()
542 |   print(args, file=sys.stderr)
543 |   outlier_th = 3
544 |   
545 |   if args.test_type in ['B', 'b']:
546 |     test_file_name = 'data/jinnan_round1_testB_20190121.csv'
547 |     result_name = 'submit_B.csv'
548 |   elif args.test_type in ['C', 'c']:
549 |     test_file_name = 'data/jinnan_round1_test_20190201.csv'
550 |     result_name = 'submit_C.csv'
551 |   elif args.test_type in ['A', 'a']:
552 |     test_file_name = 'data/jinnan_round1_testA_20181227.csv'
553 |     result_name = 'submit_A.csv'
554 |   elif args.test_type in ['fusai', 'FuSai', 'Fusai', 'fuSai']:
555 |     test_file_name = 'data/FuSai.csv'
556 |     result_name = 'submit_FuSai.csv'
557 |   elif args.test_type in ['gen', 'Gen', 'GEN']:
558 |     test_file_name = 'data/optimize.csv'
559 |     result_name = 'submit_optimize.csv'
560 |     outlier_th = 0
561 |   else:
562 |     raise RuntimeError('Need config of test_type, can be only B or C for example: --test_type=B')
563 | 
564 |   # 设定文件名, 读取文件
565 |   train_file_name = 'data/jinnan_round1_train_20181227.csv'
566 | 
567 |   print("Training file named {} and testing file named {}".format(train_file_name, test_file_name))
568 | 
569 |   print("Generating training models")
570 |   train_models()
571 |   print("Saving training models to file: \'models\'")
572 | 
573 |   full = read_data(train_file_name, test_file_name)
574 |   lgb_models, xgb_models, stack_models, min_max_scaler = load_models()
575 |   # feature processing
576 |   pipe_data = feature_processing(full, outlier_th=outlier_th)
577 | 
578 |   # train and predict
579 |   predictions = predict(pipe_data, lgb_models, xgb_models, stack_models, min_max_scaler)
580 | 
581 |   # generate a submit file
582 |   gen_submit(test_file_name, result_name, predictions)
583 | 
584 |   print("Finished")
585 | 
586 | 


--------------------------------------------------------------------------------
/复赛/津南数字制造算法挑战赛+17+Drop/readme.md:
--------------------------------------------------------------------------------
 1 | # 环境要求
 2 | 
 3 | ## 系统
 4 | 
 5 | ubuntu 16.04
 6 | 
 7 | ## python 版本
 8 | python 3.
 9 | 
10 | ## 需要的库
11 | 
12 | numpy, pandas, lightgbm, xgboost, sklearn, pickle
13 | 
14 | 
15 | 
16 | # 运行说明
17 | 
18 | 生成复赛预测结果文件：
19 | python3 main.py --test_type=fusai
20 | 
21 | 生成 optimizes.csv 预测结果文件：
22 | python3 main.py --test_type=gen
23 | 
24 | # 最后
25 | 
26 | 如果有问题希望您联系我们
27 | 
28 | 陶亚凡： 765370813@qq.com
29 | Blue： cy_1995@qq.com 


--------------------------------------------------------------------------------