├── README.md
├── 初赛
├── history
│ ├── V0_学习.ipynb
│ ├── V1_增加特征相关性.ipynb
│ ├── V2_增加特征可视化和特征选择.ipynb
│ ├── V3_结构重构更规范化.ipynb
│ └── python-data-visualizations.ipynb
├── 最终程序.ipynb
└── 津南数字制造算法挑战赛+20+Drop
│ ├── readme.md
│ └── run.py
└── 复赛
├── 复赛.ipynb
└── 津南数字制造算法挑战赛+17+Drop
├── data
└── optimize.csv
├── main.py
└── readme.md
/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | # 天池比赛:[津南数字制造算法挑战赛【赛场一】](https://tianchi.aliyun.com/competition/entrance/231695/introduction) Drop 队(初赛17名,复赛结果未知)代码,初赛加复赛
4 |
5 | ## 2019年3月9号更新——更新说明:
6 |
7 | 增加了复赛程序,包括收率预测和最优参数生成。
8 |
9 | + __收率预测:__
10 |
11 | 与初赛类似,但有所改动,改动如下:
12 |
13 | (1)最终异常数据并没有用其他特征来预测,只是简单的改变为缺失值,最后发现效果甚至好于用其他特征预测。
14 |
15 | (2)未使用 id 特征。
16 |
17 | (3)经过多次特征选择最终决定用 9 个特征,因此略去了特征选择的过程。
18 |
19 | + __最优参数生成:__
20 |
21 | 测试了两种方法,并最终采用粒子群算法:
22 |
23 | (1)梯度下降:通过数值方法计算每个特征的梯度,然后按照梯度下降的方法更新参数,但是最终发现梯度下降对于树模型基本没用,猜测因为特征略微改变后不影响分裂后的结果,因此大部分时候梯度为 0。
24 |
25 | (2)粒子群算法:每个粒子代表一组参数,优化目标为最优参数,寻找最优的粒子。并测试了不同的参数,确保找到最优的参数。
26 |
27 | ## 更新文件说明:
28 |
29 | 复赛.ipynb 为本地程序,包括包括收率预测和最优参数生成。
30 |
31 | 文件夹“津南数字制造算法挑战赛+17+Drop”为最终提交程序,不包括最优参数生成,只包括收率预测。
32 |
33 | # 以下为原始内容:
34 |
35 | -------------------------
36 |
37 | 天池比赛:[津南数字制造算法挑战赛【赛场一】](https://tianchi.aliyun.com/competition/entrance/231695/introduction)初赛 17 名 Drop 队代码
38 |
39 | + 作者: 陶亚凡 陕西科技大学
40 |
41 | 我的 github 和 博客:
42 |
43 | github: https://github.com/taoyafan
44 |
45 | 博客: https://me.csdn.net/taoyafan
46 |
47 | + 队友(Blue 电子科技大学的) github 和 博客:
48 |
49 | github:https://github.com/BluesChang
50 |
51 | 博客:https://blueschang.github.io
52 |
53 | 程序各个部分很大程度的参考了鱼佬的 baseline
54 |
55 | 感谢队友的很多贡献, 感谢鱼佬和他的的 baseline, 感谢林有夕大佬让我在群里不停的学到新知识.
56 |
57 | 因为我这是第一次参加 ML 的比赛, 看着鱼佬 baseline 开始学习 pandas, sklearn 还有相关知识, 所以水平实在有限. 希望各位大佬给点意见或建议, 有什么问题或者可以改进的地方请告知我, 灰常感谢.
58 |
59 | 一直想开源, 但是成绩太差, 趁着还在首页, 赶快开源了, 我也是抱着学习的心态, 也没想着拿奖, 所以这个程序也没啥保留~
60 |
61 | ## 环境要求
62 |
63 | + __系统__
64 |
65 | ubuntu 16.04
66 |
67 | + __python 版本__
68 |
69 | python 3.5 或以上
70 |
71 | + __需要的库__
72 |
73 | numpy, pandas, lightgbm, xgboost, sklearn
74 |
75 | ## 文件说明
76 |
77 | 包括[初赛](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B)和复赛(等比赛结束后开源),初赛中包括[本地最终程序](https://github.com/taoyafan/jinnan/blob/master/%E5%88%9D%E8%B5%9B/%E6%9C%80%E7%BB%88%E7%A8%8B%E5%BA%8F.ipynb)(.ipynb)、[提交的程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/%E6%B4%A5%E5%8D%97%E6%95%B0%E5%AD%97%E5%88%B6%E9%80%A0%E7%AE%97%E6%B3%95%E6%8C%91%E6%88%98%E8%B5%9B%2B20%2BDrop)(.py)和 [历史程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/history)。
78 |
79 | 建议看[本地最终程序](https://github.com/taoyafan/jinnan/blob/master/%E5%88%9D%E8%B5%9B/%E6%9C%80%E7%BB%88%E7%A8%8B%E5%BA%8F.ipynb),里面标题、注释和输出都比较全,方便阅读。
80 |
81 | [提交的程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/%E6%B4%A5%E5%8D%97%E6%95%B0%E5%AD%97%E5%88%B6%E9%80%A0%E7%AE%97%E6%B3%95%E6%8C%91%E6%88%98%E8%B5%9B%2B20%2BDrop)为本地程序转成.py格式,然后稍加改善(main函数、结构、路径、输入约束等),介绍可以看里面的readme。
82 |
83 | [历史程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/history)里面都是记录了我学习的历程,从最初学习鱼佬的程序开始,有些有用有些没用。
84 |
85 | ## 程序说明
86 |
87 | + __程序思路:__
88 |
89 | 读取数 -> 手动处理训练集明显异常数据 -> 数据清洗 -> 特征工程 -> 训练
90 |
91 | __数据清洗:__
92 |
93 | 删除缺失率过高的数据 -> 处理字符时间(段) -> 计算时间差 -> 处理异常值 -> 删除单一类别占比过大的特征
94 |
95 | __特征工程:__
96 |
97 | 构建新特征 -> 利用特征之间的相关性预测 nan 值 -> 后向特征选择
98 |
99 | __训练:__
100 |
101 | 使用 lgb 和 xgb 自动选择最优参数, 然后融合
102 |
103 | + __数据通路:__
104 |
105 | 开始学鱼佬的 baseline 自己写着写着变量名太多了,前后运行总是出现各种小错误, 所以对整体结构改了好多次, 最终使用了 pipe line, 包括了整个数据清洗和特征工程部分, 所以变量名少了, 各个部分也不存在耦合了, 所以我觉得有必要介绍下数据通路:
106 |
107 | 数据读取得到 train, test ----> 合并得到 full ---> 经过 pipe line 得到 pipe_data ---> 训练集测试集分割得到 X_train 和 X_test ---> 训练得到结果 oof 和 predictions
108 |
109 | ## 运行说明
110 |
111 | 在复赛的一级目录下还需要 data 和 result 文件夹,分别放训练、测试数据和最终生成结果,最终生成的结果文件名为:测试名\_模型名\_结果\_特征数量_时间.csv(提交的程序按照官方要求命名)
112 |
--------------------------------------------------------------------------------
/初赛/history/python-data-visualizations.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {
6 | "_cell_guid": "e748dd89-de20-44f2-a122-b2bb69fbab24",
7 | "_uuid": "a42ede279bffeecdddd64047e06fee4b9aed50c5"
8 | },
9 | "source": [
10 | "## This notebook demos Python data visualizations on the Iris dataset\n",
11 | "\n",
12 | "This Python 3 environment comes with many helpful analytics libraries installed. It is defined by the [kaggle/python docker image](https://github.com/kaggle/docker-python)\n",
13 | "\n",
14 | "We'll use three libraries for this tutorial: [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/).\n",
15 | "\n",
16 | "Press \"Fork\" at the top-right of this screen to run this notebook yourself and build each of the examples."
17 | ]
18 | },
19 | {
20 | "cell_type": "code",
21 | "execution_count": 1,
22 | "metadata": {
23 | "_cell_guid": "136008bf-b756-49c1-bc5e-81c1247b969d",
24 | "_uuid": "4a72555be32be45a318141821b58ceac28ffb0d7"
25 | },
26 | "outputs": [],
27 | "source": [
28 | "# First, we'll import pandas, a data processing and CSV file I/O library\n",
29 | "import pandas as pd\n",
30 | "\n",
31 | "# We'll also import seaborn, a Python graphing library\n",
32 | "import warnings # current version of seaborn generates a bunch of warnings that we'll ignore\n",
33 | "warnings.filterwarnings(\"ignore\")\n",
34 | "import seaborn as sns\n",
35 | "import matplotlib.pyplot as plt\n",
36 | "sns.set(style=\"white\", color_codes=True)\n",
37 | "\n",
38 | "# Next, we'll load the Iris flower dataset, which is in the \"../input/\" directory\n",
39 | "iris = pd.read_csv(\"../input/Iris.csv\") # the iris dataset is now a Pandas DataFrame\n",
40 | "\n",
41 | "# Let's see what's in the iris data - Jupyter notebooks print the result of the last thing you do\n",
42 | "iris.head()\n",
43 | "\n",
44 | "# Press shift+enter to execute this cell"
45 | ]
46 | },
47 | {
48 | "cell_type": "code",
49 | "execution_count": 2,
50 | "metadata": {
51 | "_cell_guid": "5dba36af-1bb8-49e5-9b49-1451f4136246",
52 | "_uuid": "ef33a54d1e704924d1eb29632728011d31bfb543"
53 | },
54 | "outputs": [],
55 | "source": [
56 | "# Let's see how many examples we have of each species\n",
57 | "iris[\"Species\"].value_counts()"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": 3,
63 | "metadata": {
64 | "_cell_guid": "b8588972-deb5-4094-99a6-5feb722e3301",
65 | "_uuid": "b61dbe844a638b1b26e0c3f16a104570d4b60010"
66 | },
67 | "outputs": [],
68 | "source": [
69 | "# The first way we can plot things is using the .plot extension from Pandas dataframes\n",
70 | "# We'll use this to make a scatterplot of the Iris features.\n",
71 | "iris.plot(kind=\"scatter\", x=\"SepalLengthCm\", y=\"SepalWidthCm\")"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 4,
77 | "metadata": {
78 | "_cell_guid": "dc213965-5341-4ce7-ad13-42eb5e2fa1e7",
79 | "_uuid": "81da4a44d4ec41f5c7acd172c75df2f47884a13e"
80 | },
81 | "outputs": [],
82 | "source": [
83 | "# We can also use the seaborn library to make a similar plot\n",
84 | "# A seaborn jointplot shows bivariate scatterplots and univariate histograms in the same figure\n",
85 | "sns.jointplot(x=\"SepalLengthCm\", y=\"SepalWidthCm\", data=iris, size=5)"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": 5,
91 | "metadata": {
92 | "_cell_guid": "0a5c46f6-be6e-4ef6-94a4-9bea13c9a0aa",
93 | "_uuid": "d07401f715fa8f39951a6212bce668657d457fe1"
94 | },
95 | "outputs": [],
96 | "source": [
97 | "# One piece of information missing in the plots above is what species each plant is\n",
98 | "# We'll use seaborn's FacetGrid to color the scatterplot by species\n",
99 | "sns.FacetGrid(iris, hue=\"Species\", size=5) \\\n",
100 | " .map(plt.scatter, \"SepalLengthCm\", \"SepalWidthCm\") \\\n",
101 | " .add_legend()"
102 | ]
103 | },
104 | {
105 | "cell_type": "code",
106 | "execution_count": 6,
107 | "metadata": {
108 | "_cell_guid": "128245d5-6f01-44cd-8b2f-8a49735ac552",
109 | "_uuid": "01cb1b0849f6c7e800c8798164741a8fdae53617"
110 | },
111 | "outputs": [],
112 | "source": [
113 | "# We can look at an individual feature in Seaborn through a boxplot\n",
114 | "sns.boxplot(x=\"Species\", y=\"PetalLengthCm\", data=iris)"
115 | ]
116 | },
117 | {
118 | "cell_type": "code",
119 | "execution_count": 7,
120 | "metadata": {
121 | "_cell_guid": "b86a675c-f604-496a-931a-df76d7d6aaa1",
122 | "_uuid": "a481595c1e46d625e887b61f5eb0e3c48269bde9"
123 | },
124 | "outputs": [],
125 | "source": [
126 | "# One way we can extend this plot is adding a layer of individual points on top of\n",
127 | "# it through Seaborn's striplot\n",
128 | "# \n",
129 | "# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
130 | "# above the species\n",
131 | "#\n",
132 | "# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
133 | "# on top of the previous axes\n",
134 | "ax = sns.boxplot(x=\"Species\", y=\"PetalLengthCm\", data=iris)\n",
135 | "ax = sns.stripplot(x=\"Species\", y=\"PetalLengthCm\", data=iris, jitter=True, edgecolor=\"gray\")"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 8,
141 | "metadata": {
142 | "_cell_guid": "c49f199b-2798-4fdc-87a7-bd2f7f8ff447",
143 | "_uuid": "0d422fc672f3cfb30ec02d1345942cc583c51b05"
144 | },
145 | "outputs": [],
146 | "source": [
147 | "# A violin plot combines the benefits of the previous two plots and simplifies them\n",
148 | "# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
149 | "sns.violinplot(x=\"Species\", y=\"PetalLengthCm\", data=iris, size=6)"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": 9,
155 | "metadata": {
156 | "_cell_guid": "78c32fc8-3c36-482a-81f4-14d4b6ee1430",
157 | "_uuid": "b10aa16c47bdad1964d1746281564f68a5ab741e"
158 | },
159 | "outputs": [],
160 | "source": [
161 | "# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n",
162 | "# which creates and visualizes a kernel density estimate of the underlying feature\n",
163 | "sns.FacetGrid(iris, hue=\"Species\", size=6) \\\n",
164 | " .map(sns.kdeplot, \"PetalLengthCm\") \\\n",
165 | " .add_legend()"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 10,
171 | "metadata": {
172 | "_cell_guid": "7351999e-4522-451f-b3f1-0031c3a88eaa",
173 | "_uuid": "fb9e2f61bf81478f21489f1219358e2b6fa164dd"
174 | },
175 | "outputs": [],
176 | "source": [
177 | "# Another useful seaborn plot is the pairplot, which shows the bivariate relation\n",
178 | "# between each pair of features\n",
179 | "# \n",
180 | "# From the pairplot, we'll see that the Iris-setosa species is separataed from the other\n",
181 | "# two across all feature combinations\n",
182 | "sns.pairplot(iris.drop(\"Id\", axis=1), hue=\"Species\", size=3)"
183 | ]
184 | },
185 | {
186 | "cell_type": "code",
187 | "execution_count": 11,
188 | "metadata": {
189 | "_cell_guid": "3f1fb3ba-e0fd-45b4-8a64-fe2a689bb83b",
190 | "_uuid": "417d197016286a1af02eb522b3a0e0476e76b39b"
191 | },
192 | "outputs": [],
193 | "source": [
194 | "# The diagonal elements in a pairplot show the histogram by default\n",
195 | "# We can update these elements to show other things, such as a kde\n",
196 | "sns.pairplot(iris.drop(\"Id\", axis=1), hue=\"Species\", size=3, diag_kind=\"kde\")"
197 | ]
198 | },
199 | {
200 | "cell_type": "code",
201 | "execution_count": 12,
202 | "metadata": {
203 | "_cell_guid": "46cceec5-3525-4b02-8ab7-5ed1420cd198",
204 | "_uuid": "d7fb122f77031cc79ab0e922608d9e6c5de774ca"
205 | },
206 | "outputs": [],
207 | "source": [
208 | "# Now that we've covered seaborn, let's go back to some of the ones we can make with Pandas\n",
209 | "# We can quickly make a boxplot with Pandas on each feature split out by species\n",
210 | "iris.drop(\"Id\", axis=1).boxplot(by=\"Species\", figsize=(12, 6))"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 13,
216 | "metadata": {
217 | "_cell_guid": "5bbed28c-d813-41c4-824d-7038fbfee6ea",
218 | "_uuid": "61c76e99340b06c8020151ae4b8942e1daa8b1ef"
219 | },
220 | "outputs": [],
221 | "source": [
222 | "# One cool more sophisticated technique pandas has available is called Andrews Curves\n",
223 | "# Andrews Curves involve using attributes of samples as coefficients for Fourier series\n",
224 | "# and then plotting these\n",
225 | "from pandas.tools.plotting import andrews_curves\n",
226 | "andrews_curves(iris.drop(\"Id\", axis=1), \"Species\")"
227 | ]
228 | },
229 | {
230 | "cell_type": "code",
231 | "execution_count": 14,
232 | "metadata": {
233 | "_cell_guid": "77c1b6f0-7632-4d61-bf03-7b5d6856b987",
234 | "_uuid": "b9ac80fdd71c270c9991d34ca87f70d6b00b2192"
235 | },
236 | "outputs": [],
237 | "source": [
238 | "# Another multivariate visualization technique pandas has is parallel_coordinates\n",
239 | "# Parallel coordinates plots each feature on a separate column & then draws lines\n",
240 | "# connecting the features for each data sample\n",
241 | "from pandas.tools.plotting import parallel_coordinates\n",
242 | "parallel_coordinates(iris.drop(\"Id\", axis=1), \"Species\")"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 15,
248 | "metadata": {
249 | "_cell_guid": "d5c6314f-7b36-41ce-b0bd-e2ef17941f97",
250 | "_uuid": "38b7de27f1f882347de21193d93bf474f96c2288"
251 | },
252 | "outputs": [],
253 | "source": [
254 | "# A final multivariate visualization technique pandas has is radviz\n",
255 | "# Which puts each feature as a point on a 2D plane, and then simulates\n",
256 | "# having each sample attached to those points through a spring weighted\n",
257 | "# by the relative value for that feature\n",
258 | "from pandas.tools.plotting import radviz\n",
259 | "radviz(iris.drop(\"Id\", axis=1), \"Species\")"
260 | ]
261 | },
262 | {
263 | "cell_type": "markdown",
264 | "metadata": {
265 | "_cell_guid": "0263903e-4c3f-41c5-adf6-a1a12c122ddb",
266 | "_uuid": "a47be9b234eb942e71425b3e00b741a41488ea33"
267 | },
268 | "source": [
269 | "# Wrapping Up\n",
270 | "\n",
271 | "I hope you enjoyed this quick introduction to some of the quick, simple data visualizations you can create with pandas, seaborn, and matplotlib in Python!\n",
272 | "\n",
273 | "I encourage you to run through these examples yourself, tweaking them and seeing what happens. From there, you can try applying these methods to a new dataset and incorprating them into your own workflow!\n",
274 | "\n",
275 | "See [Kaggle Datasets](https://www.kaggle.com/datasets) for other datasets to try visualizing. The [World Food Facts data](https://www.kaggle.com/openfoodfacts/world-food-facts) is an especially rich one for visualization."
276 | ]
277 | }
278 | ],
279 | "metadata": {
280 | "kernelspec": {
281 | "display_name": "Python 3",
282 | "language": "python",
283 | "name": "python3"
284 | },
285 | "language_info": {
286 | "codemirror_mode": {
287 | "name": "ipython",
288 | "version": 3
289 | },
290 | "file_extension": ".py",
291 | "mimetype": "text/x-python",
292 | "name": "python",
293 | "nbconvert_exporter": "python",
294 | "pygments_lexer": "ipython3",
295 | "version": "3.6.7"
296 | },
297 | "toc": {
298 | "base_numbering": 1,
299 | "nav_menu": {},
300 | "number_sections": true,
301 | "sideBar": true,
302 | "skip_h1_title": false,
303 | "title_cell": "Table of Contents",
304 | "title_sidebar": "Contents",
305 | "toc_cell": false,
306 | "toc_position": {},
307 | "toc_section_display": true,
308 | "toc_window_display": false
309 | }
310 | },
311 | "nbformat": 4,
312 | "nbformat_minor": 1
313 | }
314 |
--------------------------------------------------------------------------------
/初赛/最终程序.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "+ __说明:__\n",
8 | "\n",
9 | "本程序是 阿里云天池 的比赛 津南数字制造算法挑战赛(赛场一) B 榜20名提交结果所使用的最终程序\n",
10 | "\n",
11 | "队伍名称 Drop, 作者: 陶亚凡 陕西科技大学\n",
12 | "\n",
13 | "我的 github 和 博客:(虽然没东西把, 但是还是求波关注, follow, stars /捂脸)\n",
14 | "\n",
15 | "github: https://github.com/taoyafan\n",
16 | "\n",
17 | "博客: https://me.csdn.net/taoyafan\n",
18 | "\n",
19 | "队友(Blue 电子科技大学的) github 和 博客:(也求波关注, follow, stars /捂脸)\n",
20 | "\n",
21 | "github:https://github.com/BluesChang\n",
22 | "\n",
23 | "博客:https://blueschang.github.io\n",
24 | "\n",
25 | "程序各个部分很大程度的参考了鱼佬的 baseline\n",
26 | "\n",
27 | "感谢队友的很多贡献, 感谢鱼佬和他的的 baseline, 感谢林有夕大佬让我在群里不停的学到新知识.\n",
28 | "\n",
29 | "因为我这是第一了 ML 的比赛, 看着鱼佬 baseline 开始学习 pandas, sklearn 还有相关知识, 所以水平实在有限. 希望各位大佬给点意见或建议, 有什么问题或者可以改进的地方请告知我, 灰常感谢. \n",
30 | "\n",
31 | "一直想开源, 但是成绩太差, 趁着还在首页, 赶快开源了, 我也是抱着学习的心态, 也没想着拿奖, 所以这个程序也没啥保留~\n",
32 | "\n",
33 | "+ __程序思路:__ \n",
34 | "\n",
35 | "读取数 -> 手动处理训练集明显异常数据 -> 数据清洗 -> 特征工程 -> 训练\n",
36 | "\n",
37 | "__数据清洗:__\n",
38 | "\n",
39 | "删除缺失率过高的数据 -> 处理字符时间(段) -> 计算时间差 -> 处理异常值 -> 删除单一类别占比过大的特征\n",
40 | "\n",
41 | "__特征工程:__\n",
42 | "\n",
43 | "构建新特征 -> 利用特征之间的相关性预测 nan 值 -> 后向特征选择\n",
44 | "\n",
45 | "__训练:__\n",
46 | "\n",
47 | "使用 lgb 和 xgb 自动选择最优参数, 然后融合\n",
48 | "\n",
49 | "+ __数据通路:__\n",
50 | "\n",
51 | "开始学鱼佬的 baseline 自己写着写着变量名太多了,前后运行总是出现各种小错误, 所以对整体结构改了好多次, 最终使用了 pipe line, 包括了整个数据清洗和特征工程部分, 所以变量名少了, 各个部分也不存在耦合了, 所以我觉得有必要介绍下数据通路:\n",
52 | "\n",
53 | "数据读取得到 train, test ----> 合并得到 full ---> 经过 pipe line 得到 pipe_data ---> 训练集测试集分割得到 X_train 和 X_test ---> 训练得到结果 oof 和 predictions\n",
54 | "\n"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "# 导入包,读取数据"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 1,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "name": "stderr",
71 | "output_type": "stream",
72 | "text": [
73 | "/usr/local/lib/python3.6/dist-packages/deap/tools/_hypervolume/pyhv.py:33: ImportWarning:\n",
74 | "\n",
75 | "Falling back to the python version of hypervolume module. Expect this to be very slow.\n",
76 | "\n",
77 | "/usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning:\n",
78 | "\n",
79 | "can't resolve package from __spec__ or __package__, falling back on __name__ and __path__\n",
80 | "\n",
81 | "/usr/lib/python3.6/importlib/_bootstrap_external.py:426: ImportWarning:\n",
82 | "\n",
83 | "Not importing directory /usr/local/lib/python3.6/dist-packages/mpl_toolkits: missing __init__\n",
84 | "\n",
85 | "/usr/lib/python3.6/importlib/_bootstrap_external.py:426: ImportWarning:\n",
86 | "\n",
87 | "Not importing directory /usr/local/lib/python3.6/dist-packages/google: missing __init__\n",
88 | "\n"
89 | ]
90 | }
91 | ],
92 | "source": [
93 | "import numpy as np \n",
94 | "import pandas as pd \n",
95 | "import lightgbm as lgb\n",
96 | "import xgboost as xgb\n",
97 | "from scipy import sparse\n",
98 | "import warnings\n",
99 | "import time\n",
100 | "import sys\n",
101 | "import os\n",
102 | "import re\n",
103 | "import datetime\n",
104 | "import matplotlib.pyplot as plt\n",
105 | "import seaborn as sns\n",
106 | "import plotly.offline as py\n",
107 | "import plotly.graph_objs as go \n",
108 | "import plotly.tools as tls\n",
109 | "from xgboost import XGBRegressor\n",
110 | "from tpot import TPOTRegressor"
111 | ]
112 | },
113 | {
114 | "cell_type": "code",
115 | "execution_count": 2,
116 | "metadata": {},
117 | "outputs": [
118 | {
119 | "name": "stderr",
120 | "output_type": "stream",
121 | "text": [
122 | "/usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning:\n",
123 | "\n",
124 | "can't resolve package from __spec__ or __package__, falling back on __name__ and __path__\n",
125 | "\n"
126 | ]
127 | }
128 | ],
129 | "source": [
130 | "from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RepeatedKFold, ShuffleSplit\n",
131 | "from sklearn.pipeline import Pipeline, make_pipeline\n",
132 | "from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone\n",
133 | "from sklearn.linear_model import LinearRegression\n",
134 | "from sklearn.linear_model import Ridge\n",
135 | "from sklearn.linear_model import Lasso\n",
136 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor\n",
137 | "from sklearn.svm import SVR, LinearSVR\n",
138 | "from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge\n",
139 | "from sklearn.kernel_ridge import KernelRidge\n",
140 | "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\n",
141 | "from sklearn.metrics import mean_squared_error, mean_absolute_error\n",
142 | "from sklearn.metrics import log_loss\n",
143 | "from sklearn.preprocessing import Imputer\n",
144 | "from scipy.stats import skew\n",
145 | "from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, Normalizer\n",
146 | "from sklearn.decomposition import PCA, KernelPCA\n",
147 | "from sklearn.model_selection import train_test_split\n",
148 | "from sklearn.model_selection import ParameterGrid"
149 | ]
150 | },
151 | {
152 | "cell_type": "code",
153 | "execution_count": 3,
154 | "metadata": {},
155 | "outputs": [
156 | {
157 | "data": {
158 | "text/html": [
159 | ""
160 | ],
161 | "text/vnd.plotly.v1+html": [
162 | ""
163 | ]
164 | },
165 | "metadata": {},
166 | "output_type": "display_data"
167 | }
168 | ],
169 | "source": [
170 | "py.init_notebook_mode(connected=True)\n",
171 | "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
172 | "warnings.filterwarnings(\"ignore\")\n",
173 | "pd.set_option('display.max_columns',None)\n",
174 | "pd.set_option('max_colwidth',100)"
175 | ]
176 | },
177 | {
178 | "cell_type": "markdown",
179 | "metadata": {},
180 | "source": [
181 | "## 设定文件名, 读取文件"
182 | ]
183 | },
184 | {
185 | "cell_type": "code",
186 | "execution_count": 4,
187 | "metadata": {},
188 | "outputs": [],
189 | "source": [
190 | "train_file_name = 'data/jinnan_round1_train_20181227.csv'\n",
191 | "test_file_name = 'data/jinnan_round1_testB_20190121.csv'\n",
192 | "test_name = 'testB'"
193 | ]
194 | },
195 | {
196 | "cell_type": "code",
197 | "execution_count": 5,
198 | "metadata": {},
199 | "outputs": [],
200 | "source": [
201 | "# 读取数据, 改名\n",
202 | "train = pd.read_csv(train_file_name, encoding = 'gb18030')\n",
203 | "test = pd.read_csv(test_file_name, encoding = 'gb18030')\n",
204 | "train.rename(columns={'样本id':'id', '收率':'target'}, inplace = True)\n",
205 | "test.rename(columns={'样本id':'id', '收率':'target'}, inplace = True)\n",
206 | "target_name = 'target'\n",
207 | "\n",
208 | "# 存在异常数据,改为 nan\n",
209 | "train.loc[1304, 'A25'] = np.nan\n",
210 | "train['A25'] = train['A25'].astype(float)\n",
211 | "\n",
212 | "# 去掉 id 前缀\n",
213 | "train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))\n",
214 | "test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))\n",
215 | "\n",
216 | "train.drop(train[train[target_name] < 0.87].index, inplace=True)\n",
217 | "full=pd.concat([train, test], ignore_index=True)"
218 | ]
219 | },
220 | {
221 | "cell_type": "markdown",
222 | "metadata": {},
223 | "source": [
224 | "# 数据清洗"
225 | ]
226 | },
227 | {
228 | "cell_type": "markdown",
229 | "metadata": {},
230 | "source": [
231 | "## 删除缺失率高的特征\n",
232 | "\n",
233 | "+ __删除缺失值大于 th_high 的值__\n",
234 | "+ __缺失值在 th_low 和 th_high 之间的特征根据是否缺失增加新特征__\n",
235 | " \n",
236 | " 如 B10 缺失较高,增加新特征 B10_null,如果缺失为1,否则为0"
237 | ]
238 | },
239 | {
240 | "cell_type": "code",
241 | "execution_count": 6,
242 | "metadata": {},
243 | "outputs": [],
244 | "source": [
245 | "class del_nan_feature(BaseEstimator, TransformerMixin):\n",
246 | " \n",
247 | " def __init__(self, th_high=0.85, th_low=0.02):\n",
248 | " self.th_high = th_high\n",
249 | " self.th_low = th_low\n",
250 | " \n",
251 | " def fit(self, X, y=None):\n",
252 | " return self\n",
253 | " \n",
254 | " def transform(self, X):\n",
255 | " print('-'*30, ' '*5, 'del_nan_feature', ' '*5, '-'*30, '\\n')\n",
256 | " print(\"shape before process = {}\".format(X.shape))\n",
257 | "\n",
258 | " # 删除高缺失率特征\n",
259 | " X.dropna(axis=1, thresh=(1-self.th_high)*X.shape[0], inplace=True)\n",
260 | " \n",
261 | " \n",
262 | " # 缺失率较高,增加新特征\n",
263 | " for col in X.columns:\n",
264 | " if col == 'target':\n",
265 | " continue\n",
266 | " \n",
267 | " miss_rate = X[col].isnull().sum()/ X.shape[0]\n",
268 | " if miss_rate > self.th_low:\n",
269 | " print(\"Missing rate of {} is {:.3f} exceed {}, adding new feature {}\".\n",
270 | " format(col, miss_rate, self.th_low, col+'_null'))\n",
271 | " X[col+'_null'] = 0\n",
272 | " X.loc[X[pd.isnull(X[col])].index, [col+'_null']] = 1\n",
273 | " print(\"shape = {}\".format(X.shape))\n",
274 | "\n",
275 | " return X"
276 | ]
277 | },
278 | {
279 | "cell_type": "markdown",
280 | "metadata": {},
281 | "source": [
282 | "## 处理字符时间(段)"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": 7,
288 | "metadata": {},
289 | "outputs": [],
290 | "source": [
291 | "# 处理时间\n",
292 | "def timeTranSecond(t):\n",
293 | " try:\n",
294 | " h,m,s=t.split(\":\")\n",
295 | " except:\n",
296 | "\n",
297 | " if t=='1900/1/9 7:00':\n",
298 | " return 7*3600/3600\n",
299 | " elif t=='1900/1/1 2:30':\n",
300 | " return (2*3600+30*60)/3600\n",
301 | " elif pd.isnull(t):\n",
302 | " return np.nan\n",
303 | " else:\n",
304 | " return 0\n",
305 | "\n",
306 | " try:\n",
307 | " tm = (int(h)*3600+int(m)*60+int(s))/3600\n",
308 | " except:\n",
309 | " return (30*60)/3600\n",
310 | "\n",
311 | " return tm"
312 | ]
313 | },
314 | {
315 | "cell_type": "code",
316 | "execution_count": 8,
317 | "metadata": {},
318 | "outputs": [],
319 | "source": [
320 | "# 处理时间差\n",
321 | "def getDuration(se):\n",
322 | " try:\n",
323 | " sh,sm,eh,em=re.findall(r\"\\d+\",se)\n",
324 | "# print(\"sh, sm, eh, em = {}, {}, {}, {}\".format(sh, em, eh, em))\n",
325 | " except:\n",
326 | " if pd.isnull(se):\n",
327 | " return np.nan, np.nan, np.nan\n",
328 | "\n",
329 | " try:\n",
330 | " t_start = (int(sh)*3600 + int(sm)*60)/3600\n",
331 | " t_end = (int(eh)*3600 + int(em)*60)/3600\n",
332 | " \n",
333 | " if t_start > t_end:\n",
334 | " tm = t_end - t_start + 24\n",
335 | " else:\n",
336 | " tm = t_end - t_start\n",
337 | " except:\n",
338 | " if se=='19:-20:05':\n",
339 | " return 19, 20, 1\n",
340 | " elif se=='15:00-1600':\n",
341 | " return 15, 16, 1\n",
342 | " else:\n",
343 | " print(\"se = {}\".format(se))\n",
344 | "\n",
345 | "\n",
346 | " return t_start, t_end, tm"
347 | ]
348 | },
349 | {
350 | "cell_type": "code",
351 | "execution_count": 9,
352 | "metadata": {},
353 | "outputs": [],
354 | "source": [
355 | "class handle_time_str(BaseEstimator, TransformerMixin):\n",
356 | " \n",
357 | " def __init__(self):\n",
358 | " pass\n",
359 | " \n",
360 | " def fit(self, X, y=None):\n",
361 | " return self\n",
362 | " \n",
363 | " def transform(self, X):\n",
364 | " print('-'*30, ' '*5, 'handle_time_str', ' '*5, '-'*30, '\\n')\n",
365 | "\n",
366 | " for f in ['A5','A7','A9','A11','A14','A16','A24','A26','B5','B7']:\n",
367 | " try:\n",
368 | " X[f] = X[f].apply(timeTranSecond)\n",
369 | " except:\n",
370 | " print(f,'应该在前面被删除了!')\n",
371 | "\n",
372 | "\n",
373 | " for f in ['A20','A28','B4','B9','B10','B11']:\n",
374 | " try:\n",
375 | " start_end_diff = X[f].apply(getDuration)\n",
376 | " \n",
377 | " X[f+'_start'] = start_end_diff.apply(lambda x: x[0])\n",
378 | " X[f+'_end'] = start_end_diff.apply(lambda x: x[1])\n",
379 | " X[f] = start_end_diff.apply(lambda x: x[2])\n",
380 | "\n",
381 | " except:\n",
382 | " print(f,'应该在前面被删除了!')\n",
383 | " return X"
384 | ]
385 | },
386 | {
387 | "cell_type": "markdown",
388 | "metadata": {},
389 | "source": [
390 | "## 计算时间差"
391 | ]
392 | },
393 | {
394 | "cell_type": "code",
395 | "execution_count": null,
396 | "metadata": {},
397 | "outputs": [],
398 | "source": []
399 | },
400 | {
401 | "cell_type": "code",
402 | "execution_count": 10,
403 | "metadata": {},
404 | "outputs": [],
405 | "source": [
406 | "def t_start_t_end(t):\n",
407 | " if pd.isnull(t[0]) or pd.isnull(t[1]):\n",
408 | "# print(\"t_start = {}, t_end = {}, id = {}\".format(t[0], t[1], t[2]))\n",
409 | " return np.nan\n",
410 | " \n",
411 | " if t[1] < t[0]:\n",
412 | " t[1] += 24\n",
413 | " \n",
414 | " dt = t[1] - t[0]\n",
415 | "\n",
416 | " if(dt > 24 or dt < 0):\n",
417 | "# print(\"dt error, t_start = {}, t_end = {}, id = {}\".format(t[0], t[1], t[2]))\n",
418 | " return np.nan\n",
419 | " \n",
420 | " return dt"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": 11,
426 | "metadata": {},
427 | "outputs": [],
428 | "source": [
429 | "class calc_time_diff(BaseEstimator, TransformerMixin):\n",
430 | " def __init__(self):\n",
431 | " pass\n",
432 | " \n",
433 | " def fit(self, X, y=None):\n",
434 | " return self\n",
435 | " \n",
436 | " def transform(self, X):\n",
437 | " print('-'*30, ' '*5, 'calc_time_diff', ' '*5, '-'*30, '\\n')\n",
438 | "\n",
439 | " # t_start 为时间的开始, tn 为中间的时间,减去 t_start 得到时间差\n",
440 | " t_start = ['A9', 'A24', 'B5']\n",
441 | " tn = {'A9':['A11', 'A14', 'A16'], 'A24':['A26'], 'B5':['B7']}\n",
442 | " \n",
443 | " # 计算时间差\n",
444 | " for t_s in t_start:\n",
445 | " for t_e in tn[t_s]:\n",
446 | " X[t_e+'-'+t_s] = X[[t_s,t_e, target_name]].apply(t_start_t_end, axis=1)\n",
447 | " \n",
448 | " # 所有结果保留 3 位小数\n",
449 | " X = X.apply(lambda x:round(x, 3))\n",
450 | " \n",
451 | " print(\"shape = {}\".format(X.shape))\n",
452 | " \n",
453 | " return X"
454 | ]
455 | },
456 | {
457 | "cell_type": "markdown",
458 | "metadata": {},
459 | "source": [
460 | "## 处理异常值"
461 | ]
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "+ __单一类别个数小于 threshold 的值视为异常值, 改为 nan__"
468 | ]
469 | },
470 | {
471 | "cell_type": "code",
472 | "execution_count": 12,
473 | "metadata": {},
474 | "outputs": [],
475 | "source": [
476 | "class handle_outliers(BaseEstimator, TransformerMixin):\n",
477 | "\n",
478 | " def __init__(self, threshold=2):\n",
479 | " self.th = threshold\n",
480 | " \n",
481 | " def fit(self, X, y=None):\n",
482 | " return self\n",
483 | " \n",
484 | " def transform(self, X):\n",
485 | " print('-'*30, ' '*5, 'handle_outliers', ' '*5, '-'*30, '\\n')\n",
486 | " category_col = [col for col in X if col not in ['id', 'target']]\n",
487 | " for col in category_col:\n",
488 | " label = X[col].value_counts(dropna=False).index.tolist()\n",
489 | " for i, num in enumerate(X[col].value_counts(dropna=False).values):\n",
490 | " if num <= self.th:\n",
491 | "# print(\"Number of label {} in feature {} is {}\".format(label[i], col, num))\n",
492 | " X.loc[X[col]==label[i], [col]] = np.nan\n",
493 | " \n",
494 | " print(\"shape = {}\".format(X.shape))\n",
495 | " return X"
496 | ]
497 | },
498 | {
499 | "cell_type": "markdown",
500 | "metadata": {},
501 | "source": [
502 | "## 删除单一类别占比过大特征"
503 | ]
504 | },
505 | {
506 | "cell_type": "code",
507 | "execution_count": 13,
508 | "metadata": {},
509 | "outputs": [],
510 | "source": [
511 | "class del_single_feature(BaseEstimator, TransformerMixin):\n",
512 | "\n",
513 | " def __init__(self, threshold=0.98):\n",
514 | " # 删除单一类别占比大于 threshold 的特征\n",
515 | " self.th = threshold\n",
516 | " \n",
517 | " def fit(self, X, y=None):\n",
518 | " return self\n",
519 | " \n",
520 | " def transform(self, X):\n",
521 | " print('-'*30, ' '*5, 'del_single_feature', ' '*5, '-'*30, '\\n')\n",
522 | " category_col = [col for col in X if col not in ['target']]\n",
523 | " \n",
524 | " for col in category_col:\n",
525 | " rate = X[col].value_counts(normalize=True, dropna=False).values[0]\n",
526 | " \n",
527 | " if rate > self.th:\n",
528 | " print(\"{} 的最大类别占比是 {}, drop it\".format(col, rate))\n",
529 | " X.drop(col, axis=1, inplace=True)\n",
530 | "\n",
531 | " print(\"shape = {}\".format(X.shape))\n",
532 | " return X"
533 | ]
534 | },
535 | {
536 | "cell_type": "markdown",
537 | "metadata": {},
538 | "source": [
539 | "# 特征工程"
540 | ]
541 | },
542 | {
543 | "cell_type": "markdown",
544 | "metadata": {},
545 | "source": [
546 | "## 获得训练集与测试集"
547 | ]
548 | },
549 | {
550 | "cell_type": "code",
551 | "execution_count": 14,
552 | "metadata": {},
553 | "outputs": [],
554 | "source": [
555 | "def split_data(pipe_data, target_name='target'):\n",
556 | " \n",
557 | " # 特征列名\n",
558 | " category_col = [col for col in pipe_data if col not in ['target',target_name]]\n",
559 | " \n",
560 | " # 训练、测试行索引\n",
561 | " train_idx = pipe_data[np.logical_not(pd.isnull(pipe_data[target_name]))].index\n",
562 | " test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index\n",
563 | " \n",
564 | " # 获得 train、test 数据\n",
565 | " X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)\n",
566 | " y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))\n",
567 | " X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)\n",
568 | " \n",
569 | " return X_train, y_train, X_test, test_idx"
570 | ]
571 | },
572 | {
573 | "cell_type": "markdown",
574 | "metadata": {},
575 | "source": [
576 | "## xgb(用于特征 nan 值预测)"
577 | ]
578 | },
579 | {
580 | "cell_type": "code",
581 | "execution_count": 15,
582 | "metadata": {},
583 | "outputs": [],
584 | "source": [
585 | "##### xgb\n",
586 | "def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):\n",
587 | " \n",
588 | " if params == None:\n",
589 | " xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, \n",
590 | " 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}\n",
591 | " else:\n",
592 | " xgb_params = params\n",
593 | "\n",
594 | " folds = KFold(n_splits=10, shuffle=True, random_state=2018)\n",
595 | " oof_xgb = np.zeros(len(X_train))\n",
596 | " predictions_xgb = np.zeros(len(X_test))\n",
597 | "\n",
598 | " for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):\n",
599 | " if(verbose_eval):\n",
600 | " print(\"fold n°{}\".format(fold_+1))\n",
601 | " print(\"len trn_idx {}\".format(len(trn_idx)))\n",
602 | " \n",
603 | " trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])\n",
604 | " val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])\n",
605 | "\n",
606 | " watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]\n",
607 | " clf = xgb.train(dtrain=trn_data,\n",
608 | " num_boost_round=20000,\n",
609 | " evals=watchlist,\n",
610 | " early_stopping_rounds=200,\n",
611 | " verbose_eval=verbose_eval,\n",
612 | " params=xgb_params)\n",
613 | " \n",
614 | " \n",
615 | " oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)\n",
616 | " predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits\n",
617 | "\n",
618 | " if(verbose_eval):\n",
619 | " print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_xgb, y_train)))\n",
620 | " return oof_xgb, predictions_xgb"
621 | ]
622 | },
623 | {
624 | "cell_type": "markdown",
625 | "metadata": {},
626 | "source": [
627 | "## 根据 B14 构建新特征"
628 | ]
629 | },
630 | {
631 | "cell_type": "code",
632 | "execution_count": 16,
633 | "metadata": {},
634 | "outputs": [],
635 | "source": [
636 | "class add_new_features(BaseEstimator, TransformerMixin):\n",
637 | "\n",
638 | " def __init__(self):\n",
639 | " pass\n",
640 | " \n",
641 | " def fit(self, X, y=None):\n",
642 | " return self\n",
643 | "\n",
644 | " def transform(self, X):\n",
645 | " print('-'*30, ' '*5, 'add_new_features', ' '*5, '-'*30, '\\n')\n",
646 | "\n",
647 | " # 经过测试,只有 B14 / B12 有用\n",
648 | " \n",
649 | "# X['B14/A1'] = X['B14'] / X['A1']\n",
650 | "# X['B14/A3'] = X['B14'] / X['A3']\n",
651 | "# X['B14/A4'] = X['B14'] / X['A4']\n",
652 | "# X['B14/A19'] = X['B14'] / X['A19']\n",
653 | "# X['B14/B1'] = X['B14'] / X['B1']\n",
654 | "# X['B14/B9'] = X['B14'] / X['B9']\n",
655 | "\n",
656 | " X['B14/B12'] = X['B14'] / X['B12']\n",
657 | " \n",
658 | " print(\"shape = {}\".format(X.shape))\n",
659 | " return X"
660 | ]
661 | },
662 | {
663 | "cell_type": "markdown",
664 | "metadata": {},
665 | "source": [
666 | "## 选择特征, nan 值填充\n",
667 | "\n",
668 | "+ __选择可能有效的特征__ (只是为了加快选择时间)\n",
669 | "\n",
670 | "+ __利用其他特征预测 nan,取最近值填充__"
671 | ]
672 | },
673 | {
674 | "cell_type": "code",
675 | "execution_count": 17,
676 | "metadata": {},
677 | "outputs": [],
678 | "source": [
679 | "def get_closest(indexes, predicts):\n",
680 | " print(\"From {}\".format(predicts))\n",
681 | "\n",
682 | " for i, one in enumerate(predicts):\n",
683 | " predicts[i] = indexes[np.argsort(abs(indexes - one))[0]]\n",
684 | "\n",
685 | " print(\"To {}\".format(predicts))\n",
686 | " return predicts\n",
687 | " \n",
688 | "\n",
689 | "def value_select_eval(pipe_data, selected_features):\n",
690 | " \n",
691 | " # 经过多次测试, 只选择可能是有用的特征\n",
692 | " cols_with_nan = [col for col in pipe_data.columns \n",
693 | " if pipe_data[col].isnull().sum()>0 and col in selected_features]\n",
694 | "\n",
695 | " for col in cols_with_nan:\n",
696 | " X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name=col)\n",
697 | " oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, verbose_eval=False)\n",
698 | " \n",
699 | " print(\"-\"*100, end=\"\\n\\n\")\n",
700 | " print(\"CV normal MAE scores of predicting {} is {}\".\n",
701 | " format(col, mean_absolute_error(oof_xgb, y_train)/np.mean(y_train)))\n",
702 | " \n",
703 | " pipe_data.loc[test_idx, [col]] = get_closest(pipe_data[col].value_counts().index,\n",
704 | " predictions_xgb)\n",
705 | "\n",
706 | " pipe_data = pipe_data[selected_features+['target']]\n",
707 | "\n",
708 | " return pipe_data\n",
709 | "\n",
710 | "# pipe_data = value_eval(pipe_data)"
711 | ]
712 | },
713 | {
714 | "cell_type": "code",
715 | "execution_count": 18,
716 | "metadata": {},
717 | "outputs": [],
718 | "source": [
719 | "class selsected_fill_nans(BaseEstimator, TransformerMixin):\n",
720 | "\n",
721 | " def __init__(self, selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',\n",
722 | " 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']):\n",
723 | " self.selected_fearutes = selected_features\n",
724 | " pass\n",
725 | " \n",
726 | " def fit(self, X, y=None):\n",
727 | " return self\n",
728 | " \n",
729 | " def transform(self, X):\n",
730 | " print('-'*30, ' '*5, 'selsected_fill_nans', ' '*5, '-'*30, '\\n')\n",
731 | "\n",
732 | " X = value_select_eval(X, self.selected_fearutes)\n",
733 | "\n",
734 | " print(\"shape = {}\".format(X.shape))\n",
735 | " return X"
736 | ]
737 | },
738 | {
739 | "cell_type": "code",
740 | "execution_count": 19,
741 | "metadata": {},
742 | "outputs": [],
743 | "source": [
744 | "def modeling_cross_validation(data):\n",
745 | " X_train, y_train, X_test, test_idx = split_data(data,\n",
746 | " target_name='target')\n",
747 | " oof_xgb, _ = xgb_predict(X_train, y_train, X_test, verbose_eval=False)\n",
748 | " print('-'*100, end='\\n\\n')\n",
749 | " return mean_squared_error(oof_xgb, y_train)\n",
750 | "\n",
751 | "\n",
752 | "def featureSelect(data):\n",
753 | "\n",
754 | " init_cols = [f for f in data.columns if f not in ['target']]\n",
755 | " best_cols = init_cols.copy()\n",
756 | " best_score = modeling_cross_validation(data[best_cols+['target']])\n",
757 | " print(\"初始 CV score: {:<8.8f}\".format(best_score))\n",
758 | "\n",
759 | " for col in init_cols:\n",
760 | " best_cols.remove(col)\n",
761 | " score = modeling_cross_validation(data[best_cols+['target']])\n",
762 | " print(\"当前选择特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}\".\n",
763 | " format(col, score, best_score), end=\" \")\n",
764 | " \n",
765 | " if best_score - score > 0.0000004:\n",
766 | " best_score = score\n",
767 | " print(\"有效果,删除!!!!\")\n",
768 | " else:\n",
769 | " best_cols.append(col)\n",
770 | " print(\"保留\")\n",
771 | "\n",
772 | " print('-'*100)\n",
773 | " print(\"优化后 CV score: {:<8.8f}\".format(best_score))\n",
774 | " return best_cols, best_score"
775 | ]
776 | },
777 | {
778 | "cell_type": "markdown",
779 | "metadata": {},
780 | "source": [
781 | "## 后向选择特征"
782 | ]
783 | },
784 | {
785 | "cell_type": "code",
786 | "execution_count": 20,
787 | "metadata": {},
788 | "outputs": [],
789 | "source": [
790 | "class select_feature(BaseEstimator, TransformerMixin):\n",
791 | "\n",
792 | " def __init__(self, init_features = None):\n",
793 | " self.init_features = init_features\n",
794 | " pass\n",
795 | " \n",
796 | " def fit(self, X, y=None):\n",
797 | " return self\n",
798 | " \n",
799 | " def transform(self, X):\n",
800 | " print('-'*30, ' '*5, 'select_feature', ' '*5, '-'*30, '\\n')\n",
801 | " \n",
802 | " if self.init_features:\n",
803 | " X = X[self.init_features + ['target']]\n",
804 | " best_features = self.init_features\n",
805 | " else:\n",
806 | " best_features = [col for col in X.columns]\n",
807 | " \n",
808 | " last_feartues = []\n",
809 | " iteration = 0\n",
810 | " equal_time = 0\n",
811 | " \n",
812 | " best_CV = 1\n",
813 | " best_CV_feature = []\n",
814 | " \n",
815 | " # 打乱顺序,但是使用相同种子,保证每次运行结果相同\n",
816 | " np.random.seed(2018)\n",
817 | " while True:\n",
818 | " print(\"Iteration = {}\\n\".format(iteration))\n",
819 | " best_features, score = featureSelect(X[best_features + ['target']])\n",
820 | " \n",
821 | " # 保存最优 CV 的参数\n",
822 | " if score < best_CV:\n",
823 | " best_CV = score\n",
824 | " best_CV_feature = best_features\n",
825 | " print(\"Found best score :{}, with features :{}\".format(best_CV, best_features))\n",
826 | " \n",
827 | " np.random.shuffle(best_features)\n",
828 | " print(\"\\nCurrent fearure length = {}\".format(len(best_features)))\n",
829 | " \n",
830 | " # 最终 3 次迭代相同,则终止迭代\n",
831 | " if len(best_features) == len(last_feartues):\n",
832 | " equal_time += 1\n",
833 | " if equal_time == 3:\n",
834 | " break\n",
835 | " else:\n",
836 | " equal_time = 0\n",
837 | " \n",
838 | " last_feartues = best_features\n",
839 | " iteration = iteration + 1\n",
840 | "\n",
841 | " print(\"\\n\\n\\n\")\n",
842 | " \n",
843 | " return X[best_features + ['target']]\n"
844 | ]
845 | },
846 | {
847 | "cell_type": "markdown",
848 | "metadata": {},
849 | "source": [
850 | "# 训练"
851 | ]
852 | },
853 | {
854 | "cell_type": "markdown",
855 | "metadata": {},
856 | "source": [
857 | "## 构建 pipeline, 处理数据"
858 | ]
859 | },
860 | {
861 | "cell_type": "code",
862 | "execution_count": 21,
863 | "metadata": {},
864 | "outputs": [
865 | {
866 | "name": "stdout",
867 | "output_type": "stream",
868 | "text": [
869 | "------------------------------ del_nan_feature ------------------------------ \n",
870 | "\n",
871 | "shape before process = (1532, 44)\n",
872 | "Missing rate of A3 is 0.029 exceed 0.02, adding new feature A3_null\n",
873 | "Missing rate of B10 is 0.172 exceed 0.02, adding new feature B10_null\n",
874 | "Missing rate of B11 is 0.597 exceed 0.02, adding new feature B11_null\n",
875 | "shape = (1532, 44)\n",
876 | "------------------------------ handle_time_str ------------------------------ \n",
877 | "\n",
878 | "A7 应该在前面被删除了!\n",
879 | "------------------------------ calc_time_diff ------------------------------ \n",
880 | "\n",
881 | "shape = (1532, 61)\n",
882 | "------------------------------ handle_outliers ------------------------------ \n",
883 | "\n",
884 | "shape = (1532, 61)\n",
885 | "------------------------------ del_single_feature ------------------------------ \n",
886 | "\n",
887 | "shape = (1532, 61)\n",
888 | "------------------------------ add_new_features ------------------------------ \n",
889 | "\n",
890 | "shape = (1532, 62)\n",
891 | "------------------------------ selsected_fill_nans ------------------------------ \n",
892 | "\n",
893 | "----------------------------------------------------------------------------------------------------\n",
894 | "\n",
895 | "CV normal MAE scores of predicting A16 is 0.006573812182966658\n",
896 | "From [2.82036011 2.0116301 2.78110905 2.24891324 2.64147919 2.61220436\n",
897 | " 2.5752665 3.00300634 2.9279013 2.84588192 2.99439096 3.15101939\n",
898 | " 1.30144072 3.0146347 2.44444267 2.60455203 2.70574424 2.75994805\n",
899 | " 2.58867866 3.00614175 2.78697994 2.03778946 2.69046123 2.72509097\n",
900 | " 2.03607538 2.52129808 2.99479207 2.92738628 2.41858149 2.70892806\n",
901 | " 2.80188948 2.75916436 2.00558983 2.99666125 3.02267092 2.11280097\n",
902 | " 2.88487023 2.52905945 3.2504842 2.92606165 2.52358037 2.57779263\n",
903 | " 2.58069354 2.91890304 2.9953025 2.49374625 2.68844172 2.45054981\n",
904 | " 3.02282879 2.01016228]\n",
905 | "To [3. 2. 3. 2. 2.5 2.5 2.5 3. 3. 3. 3. 3. 1.5 3. 2.5 2.5 2.5 3.\n",
906 | " 2.5 3. 3. 2. 2.5 2.5 2. 2.5 3. 3. 2.5 2.5 3. 3. 2. 3. 3. 2.\n",
907 | " 3. 2.5 3.5 3. 2.5 2.5 2.5 3. 3. 2.5 2.5 2.5 3. 2. ]\n",
908 | "----------------------------------------------------------------------------------------------------\n",
909 | "\n",
910 | "CV normal MAE scores of predicting A25 is 0.006985261873501027\n",
911 | "From [74.69595861 76.41018534 79.13030624 80.63013458 83.78137398 69.97030067\n",
912 | " 79.99184084 81.28120136]\n",
913 | "To [75. 76. 79. 80. 80. 70. 80. 80.]\n",
914 | "----------------------------------------------------------------------------------------------------\n",
915 | "\n",
916 | "CV normal MAE scores of predicting A28 is 0.037333278007722834\n",
917 | "From [1.16017157 0.60625793 1.06248447 0.79223084 0.98663072 0.9867671\n",
918 | " 0.93546665 0.80992338 0.5366797 1.07250487 0.89877572]\n",
919 | "To [1.167 0.667 1. 0.667 1. 1. 1. 0.667 0.5 1. 1. ]\n",
920 | "----------------------------------------------------------------------------------------------------\n",
921 | "\n",
922 | "CV normal MAE scores of predicting A6 is 0.07248546276204286\n",
923 | "From [38.14284706 37.4052155 28.88562822 32.06546164 32.16541529 36.24718952\n",
924 | " 36.2260437 22.94354749 39.25126076 24.79291415 32.7412734 35.56525469\n",
925 | " 33.02388406 25.72406816]\n",
926 | "To [38. 37. 29. 32. 32. 36. 36. 23. 39. 25. 33. 36. 33. 26.]\n",
927 | "----------------------------------------------------------------------------------------------------\n",
928 | "\n",
929 | "CV normal MAE scores of predicting B14 is 0.001297790416243159\n",
930 | "From [400.24114609 402.01158142 400.45770264 401.50468063 402.03158188\n",
931 | " 337.98728943 402.02846909 401.99261093 402.0224762 341.76803589\n",
932 | " 400.6063652 ]\n",
933 | "To [400. 400. 400. 400. 400. 340. 400. 400. 400. 340. 400.]\n",
934 | "----------------------------------------------------------------------------------------------------\n",
935 | "\n",
936 | "CV normal MAE scores of predicting B5 is 0.016972058774115534\n",
937 | "From [15.12159047 15.72788435 14.3109533 19.78048182 14.65058997 14.53606975\n",
938 | " 15.33118927 14.2788609 13.99365219 14.93855688 14.00136399 14.85706538\n",
939 | " 16.71887732 21.97685862 13.37160268 15.54346603 14.7121506 ]\n",
940 | "To [15. 15.5 14.5 20. 14.5 14.5 15.5 14.5 14. 15. 14. 15. 16.5 22.\n",
941 | " 13.5 15.5 14.5]\n",
942 | "----------------------------------------------------------------------------------------------------\n",
943 | "\n",
944 | "CV normal MAE scores of predicting A28_end is 0.012402601510432685\n",
945 | "From [13.98934126 9.29064429 0.80129172 13.6963979 10.73074675 15.34772384\n",
946 | " 8.8364796 16.79893827 15.48608887 15.33091271 19.34231484 12.45444262\n",
947 | " 15.49709904 14.68686521 15.09787071 15.34401202 17.98878217 10.97875953\n",
948 | " 10.45144868 10.7770443 10.52596968 12.5426327 14.00931859 16.79731178\n",
949 | " 14.87435162 20.49663401 14.73405266 17.8244971 5.21450764 14.84811437]\n",
950 | "To [14. 9. 1. 13.5 10.5 15.5 9. 17. 15.5 15.5 19.5 12.5 15.5 14.5\n",
951 | " 15. 15.5 18. 11. 10.5 11. 10.5 12.5 14. 17. 15. 20.5 14.5 18.\n",
952 | " 5. 15. ]\n",
953 | "----------------------------------------------------------------------------------------------------\n",
954 | "\n",
955 | "CV normal MAE scores of predicting B14/B12 is 0.002980983626315855\n",
956 | "From [0.44779329 0.49992409 0.50056992 0.3333493 0.49980065 0.81825209\n",
957 | " 0.49989815 0.49929611 0.76591279 0.50158779 0.80235612 0.4996887 ]\n",
958 | "To [0.44444444 0.5 0.5 0.33333333 0.5 0.85\n",
959 | " 0.5 0.5 0.7 0.5 0.85 0.5 ]\n",
960 | "shape = (1532, 13)\n",
961 | "------------------------------ select_feature ------------------------------ \n",
962 | "\n",
963 | "Iteration = 0\n",
964 | "\n",
965 | "----------------------------------------------------------------------------------------------------\n",
966 | "\n",
967 | "初始 CV score: 0.00011896\n",
968 | "----------------------------------------------------------------------------------------------------\n",
969 | "\n",
970 | "当前选择特征: A3_null, CV score: 0.00011909, 最佳cv score: 0.00011896 保留\n",
971 | "----------------------------------------------------------------------------------------------------\n",
972 | "\n",
973 | "当前选择特征: A6, CV score: 0.00012150, 最佳cv score: 0.00011896 保留\n",
974 | "----------------------------------------------------------------------------------------------------\n",
975 | "\n",
976 | "当前选择特征: A16, CV score: 0.00011866, 最佳cv score: 0.00011896 保留\n",
977 | "----------------------------------------------------------------------------------------------------\n",
978 | "\n",
979 | "当前选择特征: A25, CV score: 0.00011865, 最佳cv score: 0.00011896 保留\n",
980 | "----------------------------------------------------------------------------------------------------\n",
981 | "\n",
982 | "当前选择特征: A28, CV score: 0.00011749, 最佳cv score: 0.00011896 有效果,删除!!!!\n",
983 | "----------------------------------------------------------------------------------------------------\n",
984 | "\n",
985 | "当前选择特征: A28_end, CV score: 0.00011743, 最佳cv score: 0.00011749 保留\n",
986 | "----------------------------------------------------------------------------------------------------\n",
987 | "\n",
988 | "当前选择特征: B5, CV score: 0.00011818, 最佳cv score: 0.00011749 保留\n",
989 | "----------------------------------------------------------------------------------------------------\n",
990 | "\n",
991 | "当前选择特征: B10_null, CV score: 0.00011870, 最佳cv score: 0.00011749 保留\n",
992 | "----------------------------------------------------------------------------------------------------\n",
993 | "\n",
994 | "当前选择特征: B11_null, CV score: 0.00011940, 最佳cv score: 0.00011749 保留\n",
995 | "----------------------------------------------------------------------------------------------------\n",
996 | "\n",
997 | "当前选择特征: B14, CV score: 0.00012126, 最佳cv score: 0.00011749 保留\n",
998 | "----------------------------------------------------------------------------------------------------\n",
999 | "\n",
1000 | "当前选择特征: B14/B12, CV score: 0.00012029, 最佳cv score: 0.00011749 保留\n",
1001 | "----------------------------------------------------------------------------------------------------\n",
1002 | "\n",
1003 | "当前选择特征: id, CV score: 0.00018189, 最佳cv score: 0.00011749 保留\n",
1004 | "----------------------------------------------------------------------------------------------------\n",
1005 | "优化后 CV score: 0.00011749\n",
1006 | "Found best score :0.00011748851254246741, with features :['A3_null', 'A6', 'A16', 'A25', 'A28_end', 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']\n",
1007 | "\n",
1008 | "Current fearure length = 11\n",
1009 | "\n",
1010 | "\n",
1011 | "\n",
1012 | "\n",
1013 | "Iteration = 1\n",
1014 | "\n",
1015 | "----------------------------------------------------------------------------------------------------\n",
1016 | "\n",
1017 | "初始 CV score: 0.00011888\n",
1018 | "----------------------------------------------------------------------------------------------------\n",
1019 | "\n",
1020 | "当前选择特征: A3_null, CV score: 0.00011882, 最佳cv score: 0.00011888 保留\n",
1021 | "----------------------------------------------------------------------------------------------------\n",
1022 | "\n",
1023 | "当前选择特征: A25, CV score: 0.00011935, 最佳cv score: 0.00011888 保留\n",
1024 | "----------------------------------------------------------------------------------------------------\n",
1025 | "\n",
1026 | "当前选择特征: B11_null, CV score: 0.00011808, 最佳cv score: 0.00011888 有效果,删除!!!!\n",
1027 | "----------------------------------------------------------------------------------------------------\n",
1028 | "\n",
1029 | "当前选择特征: B14, CV score: 0.00012059, 最佳cv score: 0.00011808 保留\n"
1030 | ]
1031 | },
1032 | {
1033 | "name": "stdout",
1034 | "output_type": "stream",
1035 | "text": [
1036 | "----------------------------------------------------------------------------------------------------\n",
1037 | "\n",
1038 | "当前选择特征: B14/B12, CV score: 0.00012291, 最佳cv score: 0.00011808 保留\n",
1039 | "----------------------------------------------------------------------------------------------------\n",
1040 | "\n",
1041 | "当前选择特征: A28_end, CV score: 0.00011717, 最佳cv score: 0.00011808 有效果,删除!!!!\n",
1042 | "----------------------------------------------------------------------------------------------------\n",
1043 | "\n",
1044 | "当前选择特征: B5, CV score: 0.00011785, 最佳cv score: 0.00011717 保留\n",
1045 | "----------------------------------------------------------------------------------------------------\n",
1046 | "\n",
1047 | "当前选择特征: A6, CV score: 0.00012231, 最佳cv score: 0.00011717 保留\n",
1048 | "----------------------------------------------------------------------------------------------------\n",
1049 | "\n",
1050 | "当前选择特征: A16, CV score: 0.00011826, 最佳cv score: 0.00011717 保留\n",
1051 | "----------------------------------------------------------------------------------------------------\n",
1052 | "\n",
1053 | "当前选择特征: B10_null, CV score: 0.00011966, 最佳cv score: 0.00011717 保留\n",
1054 | "----------------------------------------------------------------------------------------------------\n",
1055 | "\n",
1056 | "当前选择特征: id, CV score: 0.00018469, 最佳cv score: 0.00011717 保留\n",
1057 | "----------------------------------------------------------------------------------------------------\n",
1058 | "优化后 CV score: 0.00011717\n",
1059 | "Found best score :0.00011716922906431079, with features :['A3_null', 'A25', 'B14', 'B14/B12', 'B5', 'A6', 'A16', 'B10_null', 'id']\n",
1060 | "\n",
1061 | "Current fearure length = 9\n",
1062 | "\n",
1063 | "\n",
1064 | "\n",
1065 | "\n",
1066 | "Iteration = 2\n",
1067 | "\n",
1068 | "----------------------------------------------------------------------------------------------------\n",
1069 | "\n",
1070 | "初始 CV score: 0.00011730\n",
1071 | "----------------------------------------------------------------------------------------------------\n",
1072 | "\n",
1073 | "当前选择特征: A6, CV score: 0.00012321, 最佳cv score: 0.00011730 保留\n",
1074 | "----------------------------------------------------------------------------------------------------\n",
1075 | "\n",
1076 | "当前选择特征: A3_null, CV score: 0.00011892, 最佳cv score: 0.00011730 保留\n",
1077 | "----------------------------------------------------------------------------------------------------\n",
1078 | "\n",
1079 | "当前选择特征: B14, CV score: 0.00012037, 最佳cv score: 0.00011730 保留\n",
1080 | "----------------------------------------------------------------------------------------------------\n",
1081 | "\n",
1082 | "当前选择特征: A16, CV score: 0.00011723, 最佳cv score: 0.00011730 保留\n",
1083 | "----------------------------------------------------------------------------------------------------\n",
1084 | "\n",
1085 | "当前选择特征: B5, CV score: 0.00011902, 最佳cv score: 0.00011730 保留\n",
1086 | "----------------------------------------------------------------------------------------------------\n",
1087 | "\n",
1088 | "当前选择特征: A25, CV score: 0.00011825, 最佳cv score: 0.00011730 保留\n",
1089 | "----------------------------------------------------------------------------------------------------\n",
1090 | "\n",
1091 | "当前选择特征: id, CV score: 0.00018102, 最佳cv score: 0.00011730 保留\n",
1092 | "----------------------------------------------------------------------------------------------------\n",
1093 | "\n",
1094 | "当前选择特征: B14/B12, CV score: 0.00012256, 最佳cv score: 0.00011730 保留\n",
1095 | "----------------------------------------------------------------------------------------------------\n",
1096 | "\n",
1097 | "当前选择特征: B10_null, CV score: 0.00011870, 最佳cv score: 0.00011730 保留\n",
1098 | "----------------------------------------------------------------------------------------------------\n",
1099 | "优化后 CV score: 0.00011730\n",
1100 | "\n",
1101 | "Current fearure length = 9\n",
1102 | "\n",
1103 | "\n",
1104 | "\n",
1105 | "\n",
1106 | "Iteration = 3\n",
1107 | "\n",
1108 | "----------------------------------------------------------------------------------------------------\n",
1109 | "\n",
1110 | "初始 CV score: 0.00011678\n",
1111 | "----------------------------------------------------------------------------------------------------\n",
1112 | "\n",
1113 | "当前选择特征: A25, CV score: 0.00011825, 最佳cv score: 0.00011678 保留\n",
1114 | "----------------------------------------------------------------------------------------------------\n",
1115 | "\n",
1116 | "当前选择特征: B10_null, CV score: 0.00011825, 最佳cv score: 0.00011678 保留\n",
1117 | "----------------------------------------------------------------------------------------------------\n",
1118 | "\n",
1119 | "当前选择特征: B5, CV score: 0.00011854, 最佳cv score: 0.00011678 保留\n",
1120 | "----------------------------------------------------------------------------------------------------\n",
1121 | "\n",
1122 | "当前选择特征: A16, CV score: 0.00011864, 最佳cv score: 0.00011678 保留\n",
1123 | "----------------------------------------------------------------------------------------------------\n",
1124 | "\n",
1125 | "当前选择特征: id, CV score: 0.00018284, 最佳cv score: 0.00011678 保留\n",
1126 | "----------------------------------------------------------------------------------------------------\n",
1127 | "\n",
1128 | "当前选择特征: B14, CV score: 0.00012127, 最佳cv score: 0.00011678 保留\n",
1129 | "----------------------------------------------------------------------------------------------------\n",
1130 | "\n",
1131 | "当前选择特征: A6, CV score: 0.00012289, 最佳cv score: 0.00011678 保留\n",
1132 | "----------------------------------------------------------------------------------------------------\n",
1133 | "\n",
1134 | "当前选择特征: B14/B12, CV score: 0.00012132, 最佳cv score: 0.00011678 保留\n",
1135 | "----------------------------------------------------------------------------------------------------\n",
1136 | "\n",
1137 | "当前选择特征: A3_null, CV score: 0.00011959, 最佳cv score: 0.00011678 保留\n",
1138 | "----------------------------------------------------------------------------------------------------\n",
1139 | "优化后 CV score: 0.00011678\n",
1140 | "Found best score :0.00011677842465073222, with features :['A25', 'B10_null', 'B5', 'A16', 'id', 'B14', 'A6', 'B14/B12', 'A3_null']\n",
1141 | "\n",
1142 | "Current fearure length = 9\n",
1143 | "\n",
1144 | "\n",
1145 | "\n",
1146 | "\n",
1147 | "Iteration = 4\n",
1148 | "\n",
1149 | "----------------------------------------------------------------------------------------------------\n",
1150 | "\n",
1151 | "初始 CV score: 0.00011833\n",
1152 | "----------------------------------------------------------------------------------------------------\n",
1153 | "\n",
1154 | "当前选择特征: B14, CV score: 0.00012009, 最佳cv score: 0.00011833 保留\n",
1155 | "----------------------------------------------------------------------------------------------------\n",
1156 | "\n",
1157 | "当前选择特征: B14/B12, CV score: 0.00012129, 最佳cv score: 0.00011833 保留\n",
1158 | "----------------------------------------------------------------------------------------------------\n",
1159 | "\n",
1160 | "当前选择特征: A3_null, CV score: 0.00011953, 最佳cv score: 0.00011833 保留\n",
1161 | "----------------------------------------------------------------------------------------------------\n",
1162 | "\n",
1163 | "当前选择特征: A6, CV score: 0.00012132, 最佳cv score: 0.00011833 保留\n",
1164 | "----------------------------------------------------------------------------------------------------\n",
1165 | "\n",
1166 | "当前选择特征: B5, CV score: 0.00011791, 最佳cv score: 0.00011833 有效果,删除!!!!\n",
1167 | "----------------------------------------------------------------------------------------------------\n",
1168 | "\n",
1169 | "当前选择特征: A16, CV score: 0.00012081, 最佳cv score: 0.00011791 保留\n",
1170 | "----------------------------------------------------------------------------------------------------\n",
1171 | "\n",
1172 | "当前选择特征: B10_null, CV score: 0.00011889, 最佳cv score: 0.00011791 保留\n",
1173 | "----------------------------------------------------------------------------------------------------\n",
1174 | "\n",
1175 | "当前选择特征: A25, CV score: 0.00012040, 最佳cv score: 0.00011791 保留\n",
1176 | "----------------------------------------------------------------------------------------------------\n",
1177 | "\n",
1178 | "当前选择特征: id, CV score: 0.00018179, 最佳cv score: 0.00011791 保留\n",
1179 | "----------------------------------------------------------------------------------------------------\n",
1180 | "优化后 CV score: 0.00011791\n",
1181 | "\n",
1182 | "Current fearure length = 8\n",
1183 | "\n",
1184 | "\n",
1185 | "\n",
1186 | "\n",
1187 | "Iteration = 5\n",
1188 | "\n",
1189 | "----------------------------------------------------------------------------------------------------\n",
1190 | "\n",
1191 | "初始 CV score: 0.00011747\n",
1192 | "----------------------------------------------------------------------------------------------------\n",
1193 | "\n",
1194 | "当前选择特征: id, CV score: 0.00018279, 最佳cv score: 0.00011747 保留\n",
1195 | "----------------------------------------------------------------------------------------------------\n",
1196 | "\n",
1197 | "当前选择特征: A25, CV score: 0.00012058, 最佳cv score: 0.00011747 保留\n",
1198 | "----------------------------------------------------------------------------------------------------\n",
1199 | "\n",
1200 | "当前选择特征: A6, CV score: 0.00012409, 最佳cv score: 0.00011747 保留\n",
1201 | "----------------------------------------------------------------------------------------------------\n",
1202 | "\n",
1203 | "当前选择特征: B14/B12, CV score: 0.00012101, 最佳cv score: 0.00011747 保留\n",
1204 | "----------------------------------------------------------------------------------------------------\n",
1205 | "\n",
1206 | "当前选择特征: A3_null, CV score: 0.00011944, 最佳cv score: 0.00011747 保留\n",
1207 | "----------------------------------------------------------------------------------------------------\n",
1208 | "\n",
1209 | "当前选择特征: A16, CV score: 0.00012004, 最佳cv score: 0.00011747 保留\n",
1210 | "----------------------------------------------------------------------------------------------------\n",
1211 | "\n",
1212 | "当前选择特征: B10_null, CV score: 0.00011840, 最佳cv score: 0.00011747 保留\n"
1213 | ]
1214 | },
1215 | {
1216 | "name": "stdout",
1217 | "output_type": "stream",
1218 | "text": [
1219 | "----------------------------------------------------------------------------------------------------\n",
1220 | "\n",
1221 | "当前选择特征: B14, CV score: 0.00012226, 最佳cv score: 0.00011747 保留\n",
1222 | "----------------------------------------------------------------------------------------------------\n",
1223 | "优化后 CV score: 0.00011747\n",
1224 | "\n",
1225 | "Current fearure length = 8\n",
1226 | "\n",
1227 | "\n",
1228 | "\n",
1229 | "\n",
1230 | "Iteration = 6\n",
1231 | "\n",
1232 | "----------------------------------------------------------------------------------------------------\n",
1233 | "\n",
1234 | "初始 CV score: 0.00011715\n",
1235 | "----------------------------------------------------------------------------------------------------\n",
1236 | "\n",
1237 | "当前选择特征: id, CV score: 0.00018398, 最佳cv score: 0.00011715 保留\n",
1238 | "----------------------------------------------------------------------------------------------------\n",
1239 | "\n",
1240 | "当前选择特征: A3_null, CV score: 0.00011984, 最佳cv score: 0.00011715 保留\n",
1241 | "----------------------------------------------------------------------------------------------------\n",
1242 | "\n",
1243 | "当前选择特征: A25, CV score: 0.00012132, 最佳cv score: 0.00011715 保留\n",
1244 | "----------------------------------------------------------------------------------------------------\n",
1245 | "\n",
1246 | "当前选择特征: B14/B12, CV score: 0.00012198, 最佳cv score: 0.00011715 保留\n",
1247 | "----------------------------------------------------------------------------------------------------\n",
1248 | "\n",
1249 | "当前选择特征: B10_null, CV score: 0.00011808, 最佳cv score: 0.00011715 保留\n",
1250 | "----------------------------------------------------------------------------------------------------\n",
1251 | "\n",
1252 | "当前选择特征: A16, CV score: 0.00011993, 最佳cv score: 0.00011715 保留\n",
1253 | "----------------------------------------------------------------------------------------------------\n",
1254 | "\n",
1255 | "当前选择特征: A6, CV score: 0.00012184, 最佳cv score: 0.00011715 保留\n",
1256 | "----------------------------------------------------------------------------------------------------\n",
1257 | "\n",
1258 | "当前选择特征: B14, CV score: 0.00012037, 最佳cv score: 0.00011715 保留\n",
1259 | "----------------------------------------------------------------------------------------------------\n",
1260 | "优化后 CV score: 0.00011715\n",
1261 | "\n",
1262 | "Current fearure length = 8\n",
1263 | "\n",
1264 | "\n",
1265 | "\n",
1266 | "\n",
1267 | "Iteration = 7\n",
1268 | "\n",
1269 | "----------------------------------------------------------------------------------------------------\n",
1270 | "\n",
1271 | "初始 CV score: 0.00011709\n",
1272 | "----------------------------------------------------------------------------------------------------\n",
1273 | "\n",
1274 | "当前选择特征: A25, CV score: 0.00012090, 最佳cv score: 0.00011709 保留\n",
1275 | "----------------------------------------------------------------------------------------------------\n",
1276 | "\n",
1277 | "当前选择特征: id, CV score: 0.00018248, 最佳cv score: 0.00011709 保留\n",
1278 | "----------------------------------------------------------------------------------------------------\n",
1279 | "\n",
1280 | "当前选择特征: B14/B12, CV score: 0.00012002, 最佳cv score: 0.00011709 保留\n",
1281 | "----------------------------------------------------------------------------------------------------\n",
1282 | "\n",
1283 | "当前选择特征: B14, CV score: 0.00012325, 最佳cv score: 0.00011709 保留\n",
1284 | "----------------------------------------------------------------------------------------------------\n",
1285 | "\n",
1286 | "当前选择特征: A3_null, CV score: 0.00012002, 最佳cv score: 0.00011709 保留\n",
1287 | "----------------------------------------------------------------------------------------------------\n",
1288 | "\n",
1289 | "当前选择特征: A16, CV score: 0.00011987, 最佳cv score: 0.00011709 保留\n",
1290 | "----------------------------------------------------------------------------------------------------\n",
1291 | "\n",
1292 | "当前选择特征: A6, CV score: 0.00012154, 最佳cv score: 0.00011709 保留\n",
1293 | "----------------------------------------------------------------------------------------------------\n",
1294 | "\n",
1295 | "当前选择特征: B10_null, CV score: 0.00011942, 最佳cv score: 0.00011709 保留\n",
1296 | "----------------------------------------------------------------------------------------------------\n",
1297 | "优化后 CV score: 0.00011709\n",
1298 | "\n",
1299 | "Current fearure length = 8\n",
1300 | "(1532, 9)\n"
1301 | ]
1302 | }
1303 | ],
1304 | "source": [
1305 | "selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end', \n",
1306 | " 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']\n",
1307 | "\n",
1308 | "pipe = Pipeline([\n",
1309 | " ('del_nan_feature', del_nan_feature()),\n",
1310 | " ('handle_time_str', handle_time_str()),\n",
1311 | " ('calc_time_diff', calc_time_diff()),\n",
1312 | " ('Handle_outliers', handle_outliers(2)),\n",
1313 | " ('del_single_feature', del_single_feature(1)),\n",
1314 | " ('add_new_features', add_new_features()),\n",
1315 | " ('selsected_fill_nans', selsected_fill_nans(selected_features)),\n",
1316 | " ('select_feature', select_feature(selected_features)),\n",
1317 | " ])\n",
1318 | "\n",
1319 | "pipe_data = pipe.fit_transform(full.copy())\n",
1320 | "print(pipe_data.shape)"
1321 | ]
1322 | },
1323 | {
1324 | "cell_type": "markdown",
1325 | "metadata": {},
1326 | "source": [
1327 | "## 自动调参"
1328 | ]
1329 | },
1330 | {
1331 | "cell_type": "code",
1332 | "execution_count": 22,
1333 | "metadata": {},
1334 | "outputs": [],
1335 | "source": [
1336 | "def find_best_params(pipe_data, predict_fun, param_grid):\n",
1337 | " \n",
1338 | " # 获得 train 和 test, 归一化\n",
1339 | " X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n",
1340 | " min_max_scaler = MinMaxScaler()\n",
1341 | " X_train = min_max_scaler.fit_transform(X_train)\n",
1342 | " X_test = min_max_scaler.transform(X_test)\n",
1343 | " best_score = 1\n",
1344 | "\n",
1345 | " # 遍历所有参数,寻找最优\n",
1346 | " for params in ParameterGrid(param_grid):\n",
1347 | " print('-'*100, \"\\nparams = \\n{}\\n\".format(params))\n",
1348 | "\n",
1349 | " oof, predictions = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)\n",
1350 | " score = mean_squared_error(oof, y_train)\n",
1351 | " print(\"CV score: {}, current best score: {}\".format(score, best_score))\n",
1352 | "\n",
1353 | " if best_score > score:\n",
1354 | " print(\"Found new best score: {}\".format(score))\n",
1355 | " best_score = score\n",
1356 | " best_params = params\n",
1357 | "\n",
1358 | "\n",
1359 | " print('\\n\\nbest params: {}'.format(best_params))\n",
1360 | " print('best score: {}'.format(best_score))\n",
1361 | " \n",
1362 | " return best_params"
1363 | ]
1364 | },
1365 | {
1366 | "cell_type": "markdown",
1367 | "metadata": {},
1368 | "source": [
1369 | "## lgb"
1370 | ]
1371 | },
1372 | {
1373 | "cell_type": "code",
1374 | "execution_count": 23,
1375 | "metadata": {},
1376 | "outputs": [],
1377 | "source": [
1378 | "def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):\n",
1379 | " \n",
1380 | " if params == None:\n",
1381 | " lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective':'regression', 'max_depth': 4,\n",
1382 | " 'learning_rate': 0.06, \"min_child_samples\": 3, \"boosting\": \"gbdt\", \"feature_fraction\": 1,\n",
1383 | " \"bagging_freq\": 0.7, \"bagging_fraction\": 1, \"bagging_seed\": 11, \"metric\": 'mse', \"lambda_l2\": 0.003,\n",
1384 | " \"verbosity\": -1}\n",
1385 | " else :\n",
1386 | " lgb_param = params\n",
1387 | " \n",
1388 | " folds = KFold(n_splits=10, shuffle=True, random_state=2018)\n",
1389 | " oof_lgb = np.zeros(len(X_train))\n",
1390 | " predictions_lgb = np.zeros(len(X_test))\n",
1391 | "\n",
1392 | " for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):\n",
1393 | " if verbose_eval:\n",
1394 | " print(\"fold n°{}\".format(fold_+1))\n",
1395 | " \n",
1396 | " trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])\n",
1397 | " val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])\n",
1398 | "\n",
1399 | " num_round = 10000\n",
1400 | " clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data],\n",
1401 | " verbose_eval=verbose_eval, early_stopping_rounds = 100)\n",
1402 | " \n",
1403 | " oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)\n",
1404 | "\n",
1405 | " predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits\n",
1406 | "\n",
1407 | " if verbose_eval:\n",
1408 | " print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_lgb, y_train)))\n",
1409 | " \n",
1410 | " return oof_lgb, predictions_lgb"
1411 | ]
1412 | },
1413 | {
1414 | "cell_type": "markdown",
1415 | "metadata": {},
1416 | "source": [
1417 | "+ __选择最优参数__"
1418 | ]
1419 | },
1420 | {
1421 | "cell_type": "code",
1422 | "execution_count": 24,
1423 | "metadata": {},
1424 | "outputs": [
1425 | {
1426 | "name": "stdout",
1427 | "output_type": "stream",
1428 | "text": [
1429 | "---------------------------------------------------------------------------------------------------- \n",
1430 | "params = \n",
1431 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1432 | "\n",
1433 | "CV score: 0.00011489364207207567, current best score: 1\n",
1434 | "Found new best score: 0.00011489364207207567\n",
1435 | "---------------------------------------------------------------------------------------------------- \n",
1436 | "params = \n",
1437 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1438 | "\n",
1439 | "CV score: 0.00012002848815037457, current best score: 0.00011489364207207567\n",
1440 | "---------------------------------------------------------------------------------------------------- \n",
1441 | "params = \n",
1442 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1443 | "\n",
1444 | "CV score: 0.00011402092342458887, current best score: 0.00011489364207207567\n",
1445 | "Found new best score: 0.00011402092342458887\n",
1446 | "---------------------------------------------------------------------------------------------------- \n",
1447 | "params = \n",
1448 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1449 | "\n",
1450 | "CV score: 0.00011882313706702633, current best score: 0.00011402092342458887\n",
1451 | "---------------------------------------------------------------------------------------------------- \n",
1452 | "params = \n",
1453 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1454 | "\n",
1455 | "CV score: 0.00011427665390150208, current best score: 0.00011402092342458887\n",
1456 | "---------------------------------------------------------------------------------------------------- \n",
1457 | "params = \n",
1458 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1459 | "\n",
1460 | "CV score: 0.000118469789122594, current best score: 0.00011402092342458887\n",
1461 | "---------------------------------------------------------------------------------------------------- \n",
1462 | "params = \n",
1463 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1464 | "\n",
1465 | "CV score: 0.00011547690407588457, current best score: 0.00011402092342458887\n",
1466 | "---------------------------------------------------------------------------------------------------- \n",
1467 | "params = \n",
1468 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1469 | "\n",
1470 | "CV score: 0.00011985943852356268, current best score: 0.00011402092342458887\n",
1471 | "---------------------------------------------------------------------------------------------------- \n",
1472 | "params = \n",
1473 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1474 | "\n",
1475 | "CV score: 0.00011231611613708764, current best score: 0.00011402092342458887\n",
1476 | "Found new best score: 0.00011231611613708764\n",
1477 | "---------------------------------------------------------------------------------------------------- \n",
1478 | "params = \n",
1479 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1480 | "\n",
1481 | "CV score: 0.00011748828797017007, current best score: 0.00011231611613708764\n",
1482 | "---------------------------------------------------------------------------------------------------- \n",
1483 | "params = \n",
1484 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1485 | "\n",
1486 | "CV score: 0.00011554903372234801, current best score: 0.00011231611613708764\n",
1487 | "---------------------------------------------------------------------------------------------------- \n",
1488 | "params = \n",
1489 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1490 | "\n",
1491 | "CV score: 0.00012001078341271754, current best score: 0.00011231611613708764\n",
1492 | "---------------------------------------------------------------------------------------------------- \n",
1493 | "params = \n",
1494 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1495 | "\n",
1496 | "CV score: 0.0001135845354614368, current best score: 0.00011231611613708764\n",
1497 | "---------------------------------------------------------------------------------------------------- \n",
1498 | "params = \n",
1499 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1500 | "\n",
1501 | "CV score: 0.00011950413736010373, current best score: 0.00011231611613708764\n",
1502 | "---------------------------------------------------------------------------------------------------- \n",
1503 | "params = \n",
1504 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1505 | "\n",
1506 | "CV score: 0.00011449170101534524, current best score: 0.00011231611613708764\n",
1507 | "---------------------------------------------------------------------------------------------------- \n",
1508 | "params = \n",
1509 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1510 | "\n",
1511 | "CV score: 0.00012067409892140623, current best score: 0.00011231611613708764\n",
1512 | "---------------------------------------------------------------------------------------------------- \n",
1513 | "params = \n",
1514 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1515 | "\n"
1516 | ]
1517 | },
1518 | {
1519 | "name": "stdout",
1520 | "output_type": "stream",
1521 | "text": [
1522 | "CV score: 0.00011435307236140136, current best score: 0.00011231611613708764\n",
1523 | "---------------------------------------------------------------------------------------------------- \n",
1524 | "params = \n",
1525 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1526 | "\n",
1527 | "CV score: 0.00012009711500307733, current best score: 0.00011231611613708764\n",
1528 | "---------------------------------------------------------------------------------------------------- \n",
1529 | "params = \n",
1530 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1531 | "\n",
1532 | "CV score: 0.00011480005750940053, current best score: 0.00011231611613708764\n",
1533 | "---------------------------------------------------------------------------------------------------- \n",
1534 | "params = \n",
1535 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1536 | "\n",
1537 | "CV score: 0.00012025520678151364, current best score: 0.00011231611613708764\n",
1538 | "---------------------------------------------------------------------------------------------------- \n",
1539 | "params = \n",
1540 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1541 | "\n",
1542 | "CV score: 0.00011420826273292763, current best score: 0.00011231611613708764\n",
1543 | "---------------------------------------------------------------------------------------------------- \n",
1544 | "params = \n",
1545 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1546 | "\n",
1547 | "CV score: 0.00011871676666472398, current best score: 0.00011231611613708764\n",
1548 | "---------------------------------------------------------------------------------------------------- \n",
1549 | "params = \n",
1550 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1551 | "\n",
1552 | "CV score: 0.00011447430938895559, current best score: 0.00011231611613708764\n",
1553 | "---------------------------------------------------------------------------------------------------- \n",
1554 | "params = \n",
1555 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1556 | "\n",
1557 | "CV score: 0.00011845637169175561, current best score: 0.00011231611613708764\n",
1558 | "---------------------------------------------------------------------------------------------------- \n",
1559 | "params = \n",
1560 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1561 | "\n",
1562 | "CV score: 0.00011541082150277204, current best score: 0.00011231611613708764\n",
1563 | "---------------------------------------------------------------------------------------------------- \n",
1564 | "params = \n",
1565 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1566 | "\n",
1567 | "CV score: 0.00011943383299618423, current best score: 0.00011231611613708764\n",
1568 | "---------------------------------------------------------------------------------------------------- \n",
1569 | "params = \n",
1570 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1571 | "\n",
1572 | "CV score: 0.00011212722352757958, current best score: 0.00011231611613708764\n",
1573 | "Found new best score: 0.00011212722352757958\n",
1574 | "---------------------------------------------------------------------------------------------------- \n",
1575 | "params = \n",
1576 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1577 | "\n",
1578 | "CV score: 0.0001181836666297253, current best score: 0.00011212722352757958\n",
1579 | "---------------------------------------------------------------------------------------------------- \n",
1580 | "params = \n",
1581 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1582 | "\n",
1583 | "CV score: 0.0001158514519987164, current best score: 0.00011212722352757958\n",
1584 | "---------------------------------------------------------------------------------------------------- \n",
1585 | "params = \n",
1586 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1587 | "\n",
1588 | "CV score: 0.00012000426127843114, current best score: 0.00011212722352757958\n",
1589 | "---------------------------------------------------------------------------------------------------- \n",
1590 | "params = \n",
1591 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1592 | "\n",
1593 | "CV score: 0.00011360641302273318, current best score: 0.00011212722352757958\n",
1594 | "---------------------------------------------------------------------------------------------------- \n",
1595 | "params = \n",
1596 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1597 | "\n",
1598 | "CV score: 0.00011965654294675753, current best score: 0.00011212722352757958\n",
1599 | "---------------------------------------------------------------------------------------------------- \n",
1600 | "params = \n",
1601 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1602 | "\n",
1603 | "CV score: 0.00011443935391500729, current best score: 0.00011212722352757958\n",
1604 | "---------------------------------------------------------------------------------------------------- \n",
1605 | "params = \n",
1606 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1607 | "\n"
1608 | ]
1609 | },
1610 | {
1611 | "name": "stdout",
1612 | "output_type": "stream",
1613 | "text": [
1614 | "CV score: 0.00012091648099518892, current best score: 0.00011212722352757958\n",
1615 | "---------------------------------------------------------------------------------------------------- \n",
1616 | "params = \n",
1617 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1618 | "\n",
1619 | "CV score: 0.00011431927180539175, current best score: 0.00011212722352757958\n",
1620 | "---------------------------------------------------------------------------------------------------- \n",
1621 | "params = \n",
1622 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1623 | "\n",
1624 | "CV score: 0.00012002843916496843, current best score: 0.00011212722352757958\n",
1625 | "---------------------------------------------------------------------------------------------------- \n",
1626 | "params = \n",
1627 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1628 | "\n",
1629 | "CV score: 0.00011451755742912798, current best score: 0.00011212722352757958\n",
1630 | "---------------------------------------------------------------------------------------------------- \n",
1631 | "params = \n",
1632 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1633 | "\n",
1634 | "CV score: 0.00011991034882396216, current best score: 0.00011212722352757958\n",
1635 | "---------------------------------------------------------------------------------------------------- \n",
1636 | "params = \n",
1637 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1638 | "\n",
1639 | "CV score: 0.00011421574014520517, current best score: 0.00011212722352757958\n",
1640 | "---------------------------------------------------------------------------------------------------- \n",
1641 | "params = \n",
1642 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1643 | "\n",
1644 | "CV score: 0.00011862220062373018, current best score: 0.00011212722352757958\n",
1645 | "---------------------------------------------------------------------------------------------------- \n",
1646 | "params = \n",
1647 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1648 | "\n",
1649 | "CV score: 0.00011490378784720141, current best score: 0.00011212722352757958\n",
1650 | "---------------------------------------------------------------------------------------------------- \n",
1651 | "params = \n",
1652 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1653 | "\n",
1654 | "CV score: 0.00011862098330828783, current best score: 0.00011212722352757958\n",
1655 | "---------------------------------------------------------------------------------------------------- \n",
1656 | "params = \n",
1657 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1658 | "\n",
1659 | "CV score: 0.00011474358647366374, current best score: 0.00011212722352757958\n",
1660 | "---------------------------------------------------------------------------------------------------- \n",
1661 | "params = \n",
1662 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1663 | "\n",
1664 | "CV score: 0.00011966504510458172, current best score: 0.00011212722352757958\n",
1665 | "---------------------------------------------------------------------------------------------------- \n",
1666 | "params = \n",
1667 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1668 | "\n",
1669 | "CV score: 0.00011221151541733946, current best score: 0.00011212722352757958\n",
1670 | "---------------------------------------------------------------------------------------------------- \n",
1671 | "params = \n",
1672 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1673 | "\n",
1674 | "CV score: 0.00011777070012250079, current best score: 0.00011212722352757958\n",
1675 | "---------------------------------------------------------------------------------------------------- \n",
1676 | "params = \n",
1677 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1678 | "\n",
1679 | "CV score: 0.0001153000539253855, current best score: 0.00011212722352757958\n",
1680 | "---------------------------------------------------------------------------------------------------- \n",
1681 | "params = \n",
1682 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1683 | "\n",
1684 | "CV score: 0.00012101952260099292, current best score: 0.00011212722352757958\n",
1685 | "---------------------------------------------------------------------------------------------------- \n",
1686 | "params = \n",
1687 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1688 | "\n",
1689 | "CV score: 0.00011434513539248205, current best score: 0.00011212722352757958\n",
1690 | "---------------------------------------------------------------------------------------------------- \n",
1691 | "params = \n",
1692 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1693 | "\n",
1694 | "CV score: 0.00011965789595988069, current best score: 0.00011212722352757958\n",
1695 | "---------------------------------------------------------------------------------------------------- \n",
1696 | "params = \n",
1697 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1698 | "\n"
1699 | ]
1700 | },
1701 | {
1702 | "name": "stdout",
1703 | "output_type": "stream",
1704 | "text": [
1705 | "CV score: 0.000114353078699699, current best score: 0.00011212722352757958\n",
1706 | "---------------------------------------------------------------------------------------------------- \n",
1707 | "params = \n",
1708 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1709 | "\n",
1710 | "CV score: 0.0001212472410367223, current best score: 0.00011212722352757958\n",
1711 | "---------------------------------------------------------------------------------------------------- \n",
1712 | "params = \n",
1713 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1714 | "\n",
1715 | "CV score: 0.00011433021385951927, current best score: 0.00011212722352757958\n",
1716 | "---------------------------------------------------------------------------------------------------- \n",
1717 | "params = \n",
1718 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1719 | "\n",
1720 | "CV score: 0.0001202623231156629, current best score: 0.00011212722352757958\n",
1721 | "\n",
1722 | "\n",
1723 | "best params: {'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n",
1724 | "best score: 0.00011212722352757958\n"
1725 | ]
1726 | }
1727 | ],
1728 | "source": [
1729 | "param_grid = [\n",
1730 | " {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective':['regression'],\n",
1731 | " 'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], \"min_child_samples\": [3],\n",
1732 | " \"boosting\": [\"gbdt\"], \"feature_fraction\": [0.7], \"bagging_freq\": [1],\n",
1733 | " \"bagging_fraction\": [1], \"bagging_seed\": [11], \"metric\": ['mse'],\n",
1734 | " \"lambda_l2\": [0.0003, 0.001, 0.003], \"verbosity\": [-1]}\n",
1735 | " ]\n",
1736 | "\n",
1737 | "lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)"
1738 | ]
1739 | },
1740 | {
1741 | "cell_type": "markdown",
1742 | "metadata": {},
1743 | "source": [
1744 | "+ __lgb 训练__"
1745 | ]
1746 | },
1747 | {
1748 | "cell_type": "code",
1749 | "execution_count": 25,
1750 | "metadata": {},
1751 | "outputs": [
1752 | {
1753 | "name": "stdout",
1754 | "output_type": "stream",
1755 | "text": [
1756 | "fold n°1\n",
1757 | "Training until validation scores don't improve for 100 rounds.\n",
1758 | "Early stopping, best iteration is:\n",
1759 | "[84]\ttraining's l2: 7.64635e-05\tvalid_1's l2: 0.00013957\n",
1760 | "CV score: 0.76865147\n",
1761 | "fold n°2\n",
1762 | "Training until validation scores don't improve for 100 rounds.\n",
1763 | "Early stopping, best iteration is:\n",
1764 | "[82]\ttraining's l2: 8.03721e-05\tvalid_1's l2: 0.000110882\n",
1765 | "CV score: 0.68291002\n",
1766 | "fold n°3\n",
1767 | "Training until validation scores don't improve for 100 rounds.\n",
1768 | "[200]\ttraining's l2: 5.30193e-05\tvalid_1's l2: 0.000139552\n",
1769 | "Early stopping, best iteration is:\n",
1770 | "[173]\ttraining's l2: 5.68838e-05\tvalid_1's l2: 0.000138392\n",
1771 | "CV score: 0.59707720\n",
1772 | "fold n°4\n",
1773 | "Training until validation scores don't improve for 100 rounds.\n",
1774 | "[200]\ttraining's l2: 5.73108e-05\tvalid_1's l2: 0.000104267\n",
1775 | "Early stopping, best iteration is:\n",
1776 | "[104]\ttraining's l2: 7.56046e-05\tvalid_1's l2: 9.52206e-05\n",
1777 | "CV score: 0.51161256\n",
1778 | "fold n°5\n",
1779 | "Training until validation scores don't improve for 100 rounds.\n",
1780 | "[200]\ttraining's l2: 5.36518e-05\tvalid_1's l2: 0.000110701\n",
1781 | "Early stopping, best iteration is:\n",
1782 | "[111]\ttraining's l2: 6.91289e-05\tvalid_1's l2: 0.0001088\n",
1783 | "CV score: 0.42667732\n",
1784 | "fold n°6\n",
1785 | "Training until validation scores don't improve for 100 rounds.\n",
1786 | "[200]\ttraining's l2: 5.63032e-05\tvalid_1's l2: 0.000123434\n",
1787 | "Early stopping, best iteration is:\n",
1788 | "[110]\ttraining's l2: 7.12232e-05\tvalid_1's l2: 0.000120677\n",
1789 | "CV score: 0.34177334\n",
1790 | "fold n°7\n",
1791 | "Training until validation scores don't improve for 100 rounds.\n",
1792 | "Early stopping, best iteration is:\n",
1793 | "[88]\ttraining's l2: 7.56938e-05\tvalid_1's l2: 0.000111444\n",
1794 | "CV score: 0.25611485\n",
1795 | "fold n°8\n",
1796 | "Training until validation scores don't improve for 100 rounds.\n",
1797 | "[200]\ttraining's l2: 5.83909e-05\tvalid_1's l2: 8.13893e-05\n",
1798 | "Early stopping, best iteration is:\n",
1799 | "[177]\ttraining's l2: 6.21352e-05\tvalid_1's l2: 7.96722e-05\n",
1800 | "CV score: 0.17049800\n",
1801 | "fold n°9\n",
1802 | "Training until validation scores don't improve for 100 rounds.\n",
1803 | "[200]\ttraining's l2: 5.74223e-05\tvalid_1's l2: 0.000108159\n",
1804 | "Early stopping, best iteration is:\n",
1805 | "[183]\ttraining's l2: 5.99078e-05\tvalid_1's l2: 0.000107585\n",
1806 | "CV score: 0.08544523\n",
1807 | "fold n°10\n",
1808 | "Training until validation scores don't improve for 100 rounds.\n",
1809 | "[200]\ttraining's l2: 5.66965e-05\tvalid_1's l2: 0.000109477\n",
1810 | "Early stopping, best iteration is:\n",
1811 | "[176]\ttraining's l2: 5.98262e-05\tvalid_1's l2: 0.000108649\n",
1812 | "CV score: 0.00011213\n"
1813 | ]
1814 | }
1815 | ],
1816 | "source": [
1817 | "X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n",
1818 | "min_max_scaler = MinMaxScaler()\n",
1819 | "X_train = min_max_scaler.fit_transform(X_train)\n",
1820 | "X_test = min_max_scaler.transform(X_test)\n",
1821 | "oof_lgb, predictions_lgb = lgb_predict(X_train, y_train, X_test, params=lgb_best_params, verbose_eval=200) #"
1822 | ]
1823 | },
1824 | {
1825 | "cell_type": "markdown",
1826 | "metadata": {},
1827 | "source": [
1828 | "## xgb"
1829 | ]
1830 | },
1831 | {
1832 | "cell_type": "markdown",
1833 | "metadata": {},
1834 | "source": [
1835 | "+ __选择最优参数__"
1836 | ]
1837 | },
1838 | {
1839 | "cell_type": "code",
1840 | "execution_count": 26,
1841 | "metadata": {},
1842 | "outputs": [
1843 | {
1844 | "name": "stdout",
1845 | "output_type": "stream",
1846 | "text": [
1847 | "---------------------------------------------------------------------------------------------------- \n",
1848 | "params = \n",
1849 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1850 | "\n",
1851 | "CV score: 0.00011466648461334117, current best score: 1\n",
1852 | "Found new best score: 0.00011466648461334117\n",
1853 | "---------------------------------------------------------------------------------------------------- \n",
1854 | "params = \n",
1855 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1856 | "\n",
1857 | "CV score: 0.00011456634369353389, current best score: 0.00011466648461334117\n",
1858 | "Found new best score: 0.00011456634369353389\n",
1859 | "---------------------------------------------------------------------------------------------------- \n",
1860 | "params = \n",
1861 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1862 | "\n",
1863 | "CV score: 0.00011522095556659337, current best score: 0.00011456634369353389\n",
1864 | "---------------------------------------------------------------------------------------------------- \n",
1865 | "params = \n",
1866 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1867 | "\n",
1868 | "CV score: 0.00011575362802403785, current best score: 0.00011456634369353389\n",
1869 | "---------------------------------------------------------------------------------------------------- \n",
1870 | "params = \n",
1871 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1872 | "\n",
1873 | "CV score: 0.0001153688729909817, current best score: 0.00011456634369353389\n",
1874 | "---------------------------------------------------------------------------------------------------- \n",
1875 | "params = \n",
1876 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1877 | "\n",
1878 | "CV score: 0.00011393958598424273, current best score: 0.00011456634369353389\n",
1879 | "Found new best score: 0.00011393958598424273\n",
1880 | "---------------------------------------------------------------------------------------------------- \n",
1881 | "params = \n",
1882 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1883 | "\n",
1884 | "CV score: 0.00011519961810983215, current best score: 0.00011393958598424273\n",
1885 | "---------------------------------------------------------------------------------------------------- \n",
1886 | "params = \n",
1887 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1888 | "\n",
1889 | "CV score: 0.0001159343681149051, current best score: 0.00011393958598424273\n",
1890 | "---------------------------------------------------------------------------------------------------- \n",
1891 | "params = \n",
1892 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1893 | "\n",
1894 | "CV score: 0.00011533954423435673, current best score: 0.00011393958598424273\n",
1895 | "---------------------------------------------------------------------------------------------------- \n",
1896 | "params = \n",
1897 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1898 | "\n",
1899 | "CV score: 0.00011435484501228615, current best score: 0.00011393958598424273\n",
1900 | "---------------------------------------------------------------------------------------------------- \n",
1901 | "params = \n",
1902 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1903 | "\n",
1904 | "CV score: 0.00011550346442947287, current best score: 0.00011393958598424273\n",
1905 | "---------------------------------------------------------------------------------------------------- \n",
1906 | "params = \n",
1907 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1908 | "\n",
1909 | "CV score: 0.00011732503571073892, current best score: 0.00011393958598424273\n",
1910 | "---------------------------------------------------------------------------------------------------- \n",
1911 | "params = \n",
1912 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1913 | "\n",
1914 | "CV score: 0.00011507860801671209, current best score: 0.00011393958598424273\n",
1915 | "---------------------------------------------------------------------------------------------------- \n",
1916 | "params = \n",
1917 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1918 | "\n",
1919 | "CV score: 0.00011373673653760243, current best score: 0.00011393958598424273\n",
1920 | "Found new best score: 0.00011373673653760243\n",
1921 | "---------------------------------------------------------------------------------------------------- \n",
1922 | "params = \n",
1923 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1924 | "\n",
1925 | "CV score: 0.00011455317620732011, current best score: 0.00011373673653760243\n",
1926 | "---------------------------------------------------------------------------------------------------- \n",
1927 | "params = \n",
1928 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1929 | "\n",
1930 | "CV score: 0.0001157106534400235, current best score: 0.00011373673653760243\n",
1931 | "---------------------------------------------------------------------------------------------------- \n",
1932 | "params = \n",
1933 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1934 | "\n",
1935 | "CV score: 0.00011484817588831511, current best score: 0.00011373673653760243\n",
1936 | "---------------------------------------------------------------------------------------------------- \n",
1937 | "params = \n",
1938 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1939 | "\n",
1940 | "CV score: 0.00011350220852811235, current best score: 0.00011373673653760243\n",
1941 | "Found new best score: 0.00011350220852811235\n",
1942 | "---------------------------------------------------------------------------------------------------- \n",
1943 | "params = \n",
1944 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1945 | "\n",
1946 | "CV score: 0.00011417123679618333, current best score: 0.00011350220852811235\n",
1947 | "---------------------------------------------------------------------------------------------------- \n",
1948 | "params = \n",
1949 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1950 | "\n",
1951 | "CV score: 0.0001139991054633333, current best score: 0.00011350220852811235\n",
1952 | "---------------------------------------------------------------------------------------------------- \n",
1953 | "params = \n",
1954 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1955 | "\n",
1956 | "CV score: 0.00011455587677046536, current best score: 0.00011350220852811235\n",
1957 | "---------------------------------------------------------------------------------------------------- \n",
1958 | "params = \n",
1959 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1960 | "\n",
1961 | "CV score: 0.00011509936251941176, current best score: 0.00011350220852811235\n",
1962 | "---------------------------------------------------------------------------------------------------- \n",
1963 | "params = \n",
1964 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1965 | "\n"
1966 | ]
1967 | },
1968 | {
1969 | "name": "stdout",
1970 | "output_type": "stream",
1971 | "text": [
1972 | "CV score: 0.00011510995861245286, current best score: 0.00011350220852811235\n",
1973 | "---------------------------------------------------------------------------------------------------- \n",
1974 | "params = \n",
1975 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1976 | "\n",
1977 | "CV score: 0.00011452153482808672, current best score: 0.00011350220852811235\n",
1978 | "---------------------------------------------------------------------------------------------------- \n",
1979 | "params = \n",
1980 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
1981 | "\n",
1982 | "CV score: 0.00011658284174959551, current best score: 0.00011350220852811235\n",
1983 | "---------------------------------------------------------------------------------------------------- \n",
1984 | "params = \n",
1985 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
1986 | "\n",
1987 | "CV score: 0.00011423913370431543, current best score: 0.00011350220852811235\n",
1988 | "---------------------------------------------------------------------------------------------------- \n",
1989 | "params = \n",
1990 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
1991 | "\n",
1992 | "CV score: 0.00011457178138601278, current best score: 0.00011350220852811235\n",
1993 | "---------------------------------------------------------------------------------------------------- \n",
1994 | "params = \n",
1995 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
1996 | "\n",
1997 | "CV score: 0.00011543189691404133, current best score: 0.00011350220852811235\n",
1998 | "---------------------------------------------------------------------------------------------------- \n",
1999 | "params = \n",
2000 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
2001 | "\n",
2002 | "CV score: 0.00011610932754520534, current best score: 0.00011350220852811235\n",
2003 | "---------------------------------------------------------------------------------------------------- \n",
2004 | "params = \n",
2005 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
2006 | "\n",
2007 | "CV score: 0.00011469866498564288, current best score: 0.00011350220852811235\n",
2008 | "---------------------------------------------------------------------------------------------------- \n",
2009 | "params = \n",
2010 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
2011 | "\n",
2012 | "CV score: 0.00011495607118486482, current best score: 0.00011350220852811235\n",
2013 | "---------------------------------------------------------------------------------------------------- \n",
2014 | "params = \n",
2015 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
2016 | "\n",
2017 | "CV score: 0.00011659799844084755, current best score: 0.00011350220852811235\n",
2018 | "---------------------------------------------------------------------------------------------------- \n",
2019 | "params = \n",
2020 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n",
2021 | "\n",
2022 | "CV score: 0.00011684188710827838, current best score: 0.00011350220852811235\n",
2023 | "---------------------------------------------------------------------------------------------------- \n",
2024 | "params = \n",
2025 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
2026 | "\n",
2027 | "CV score: 0.0001155643748809089, current best score: 0.00011350220852811235\n",
2028 | "---------------------------------------------------------------------------------------------------- \n",
2029 | "params = \n",
2030 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n",
2031 | "\n",
2032 | "CV score: 0.0001150891914023831, current best score: 0.00011350220852811235\n",
2033 | "---------------------------------------------------------------------------------------------------- \n",
2034 | "params = \n",
2035 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n",
2036 | "\n",
2037 | "CV score: 0.0001174593981966976, current best score: 0.00011350220852811235\n",
2038 | "\n",
2039 | "\n",
2040 | "best params: {'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n",
2041 | "best score: 0.00011350220852811235\n"
2042 | ]
2043 | }
2044 | ],
2045 | "source": [
2046 | "param_grid = [\n",
2047 | " {'silent': [1],\n",
2048 | " 'nthread': [4],\n",
2049 | " 'eval_metric': ['rmse'],\n",
2050 | " 'eta': [0.03],\n",
2051 | " 'objective': ['reg:linear'],\n",
2052 | " 'max_depth': [4, 5, 6],\n",
2053 | " 'num_round': [1000],\n",
2054 | " 'subsample': [0.4, 0.6, 0.8, 1],\n",
2055 | " 'colsample_bytree': [0.7, 0.9, 1],\n",
2056 | " }\n",
2057 | " ]\n",
2058 | "\n",
2059 | "xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)"
2060 | ]
2061 | },
2062 | {
2063 | "cell_type": "markdown",
2064 | "metadata": {},
2065 | "source": [
2066 | "+ __xgb 训练__"
2067 | ]
2068 | },
2069 | {
2070 | "cell_type": "code",
2071 | "execution_count": 27,
2072 | "metadata": {},
2073 | "outputs": [
2074 | {
2075 | "name": "stdout",
2076 | "output_type": "stream",
2077 | "text": [
2078 | "fold n°1\n",
2079 | "len trn_idx 1244\n",
2080 | "[0]\ttrain-rmse:0.41223\tvalid_data-rmse:0.414495\n",
2081 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2082 | "\n",
2083 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2084 | "[200]\ttrain-rmse:0.00937\tvalid_data-rmse:0.012154\n",
2085 | "[400]\ttrain-rmse:0.00738\tvalid_data-rmse:0.012149\n",
2086 | "Stopping. Best iteration:\n",
2087 | "[249]\ttrain-rmse:0.008639\tvalid_data-rmse:0.011963\n",
2088 | "\n",
2089 | "fold n°2\n",
2090 | "len trn_idx 1244\n",
2091 | "[0]\ttrain-rmse:0.412554\tvalid_data-rmse:0.411489\n",
2092 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2093 | "\n",
2094 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2095 | "[200]\ttrain-rmse:0.009572\tvalid_data-rmse:0.010408\n",
2096 | "[400]\ttrain-rmse:0.007492\tvalid_data-rmse:0.010257\n",
2097 | "Stopping. Best iteration:\n",
2098 | "[321]\ttrain-rmse:0.008054\tvalid_data-rmse:0.010154\n",
2099 | "\n",
2100 | "fold n°3\n",
2101 | "len trn_idx 1244\n",
2102 | "[0]\ttrain-rmse:0.412511\tvalid_data-rmse:0.412077\n",
2103 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2104 | "\n",
2105 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2106 | "[200]\ttrain-rmse:0.009498\tvalid_data-rmse:0.012291\n",
2107 | "[400]\ttrain-rmse:0.007413\tvalid_data-rmse:0.011734\n",
2108 | "[600]\ttrain-rmse:0.00631\tvalid_data-rmse:0.011799\n",
2109 | "Stopping. Best iteration:\n",
2110 | "[481]\ttrain-rmse:0.006915\tvalid_data-rmse:0.011714\n",
2111 | "\n",
2112 | "fold n°4\n",
2113 | "len trn_idx 1245\n",
2114 | "[0]\ttrain-rmse:0.412355\tvalid_data-rmse:0.413351\n",
2115 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2116 | "\n",
2117 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2118 | "[200]\ttrain-rmse:0.009632\tvalid_data-rmse:0.010239\n",
2119 | "[400]\ttrain-rmse:0.007535\tvalid_data-rmse:0.009976\n",
2120 | "Stopping. Best iteration:\n",
2121 | "[361]\ttrain-rmse:0.00781\tvalid_data-rmse:0.009915\n",
2122 | "\n",
2123 | "fold n°5\n",
2124 | "len trn_idx 1245\n",
2125 | "[0]\ttrain-rmse:0.412679\tvalid_data-rmse:0.410576\n",
2126 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2127 | "\n",
2128 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2129 | "[200]\ttrain-rmse:0.009628\tvalid_data-rmse:0.010779\n",
2130 | "[400]\ttrain-rmse:0.007522\tvalid_data-rmse:0.010314\n",
2131 | "[600]\ttrain-rmse:0.006444\tvalid_data-rmse:0.010495\n",
2132 | "Stopping. Best iteration:\n",
2133 | "[406]\ttrain-rmse:0.007485\tvalid_data-rmse:0.010309\n",
2134 | "\n",
2135 | "fold n°6\n",
2136 | "len trn_idx 1245\n",
2137 | "[0]\ttrain-rmse:0.412697\tvalid_data-rmse:0.410264\n",
2138 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2139 | "\n",
2140 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2141 | "[200]\ttrain-rmse:0.009618\tvalid_data-rmse:0.01148\n",
2142 | "[400]\ttrain-rmse:0.007507\tvalid_data-rmse:0.011228\n",
2143 | "Stopping. Best iteration:\n",
2144 | "[298]\ttrain-rmse:0.008307\tvalid_data-rmse:0.011154\n",
2145 | "\n",
2146 | "fold n°7\n",
2147 | "len trn_idx 1245\n",
2148 | "[0]\ttrain-rmse:0.412244\tvalid_data-rmse:0.414249\n",
2149 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2150 | "\n",
2151 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2152 | "[200]\ttrain-rmse:0.009547\tvalid_data-rmse:0.010669\n",
2153 | "[400]\ttrain-rmse:0.007386\tvalid_data-rmse:0.010475\n",
2154 | "Stopping. Best iteration:\n",
2155 | "[276]\ttrain-rmse:0.008424\tvalid_data-rmse:0.010363\n",
2156 | "\n",
2157 | "fold n°8\n",
2158 | "len trn_idx 1245\n",
2159 | "[0]\ttrain-rmse:0.41229\tvalid_data-rmse:0.414101\n",
2160 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2161 | "\n",
2162 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2163 | "[200]\ttrain-rmse:0.009677\tvalid_data-rmse:0.009724\n",
2164 | "[400]\ttrain-rmse:0.007648\tvalid_data-rmse:0.00927\n",
2165 | "[600]\ttrain-rmse:0.006596\tvalid_data-rmse:0.009145\n",
2166 | "[800]\ttrain-rmse:0.005798\tvalid_data-rmse:0.009118\n",
2167 | "[1000]\ttrain-rmse:0.005144\tvalid_data-rmse:0.009119\n",
2168 | "Stopping. Best iteration:\n",
2169 | "[934]\ttrain-rmse:0.00535\tvalid_data-rmse:0.009081\n",
2170 | "\n",
2171 | "fold n°9\n",
2172 | "len trn_idx 1245\n",
2173 | "[0]\ttrain-rmse:0.412592\tvalid_data-rmse:0.411176\n",
2174 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2175 | "\n",
2176 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2177 | "[200]\ttrain-rmse:0.009536\tvalid_data-rmse:0.01135\n",
2178 | "[400]\ttrain-rmse:0.007511\tvalid_data-rmse:0.010768\n",
2179 | "[600]\ttrain-rmse:0.006424\tvalid_data-rmse:0.010755\n",
2180 | "Stopping. Best iteration:\n",
2181 | "[534]\ttrain-rmse:0.006738\tvalid_data-rmse:0.010726\n",
2182 | "\n",
2183 | "fold n°10\n",
2184 | "len trn_idx 1245\n",
2185 | "[0]\ttrain-rmse:0.412429\tvalid_data-rmse:0.412775\n",
2186 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n",
2187 | "\n",
2188 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n",
2189 | "[200]\ttrain-rmse:0.009541\tvalid_data-rmse:0.011636\n",
2190 | "[400]\ttrain-rmse:0.007462\tvalid_data-rmse:0.010853\n",
2191 | "[600]\ttrain-rmse:0.006416\tvalid_data-rmse:0.010884\n",
2192 | "Stopping. Best iteration:\n",
2193 | "[409]\ttrain-rmse:0.007404\tvalid_data-rmse:0.010834\n",
2194 | "\n",
2195 | "CV score: 0.00011350\n"
2196 | ]
2197 | }
2198 | ],
2199 | "source": [
2200 | "X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n",
2201 | "min_max_scaler = MinMaxScaler()\n",
2202 | "X_train = min_max_scaler.fit_transform(X_train)\n",
2203 | "X_test = min_max_scaler.transform(X_test)\n",
2204 | "oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, params=xgb_best_params, verbose_eval=200) #"
2205 | ]
2206 | },
2207 | {
2208 | "cell_type": "markdown",
2209 | "metadata": {},
2210 | "source": [
2211 | "## 模型融合"
2212 | ]
2213 | },
2214 | {
2215 | "cell_type": "code",
2216 | "execution_count": 28,
2217 | "metadata": {},
2218 | "outputs": [
2219 | {
2220 | "name": "stdout",
2221 | "output_type": "stream",
2222 | "text": [
2223 | "fold 0\n",
2224 | "fold 1\n",
2225 | "fold 2\n",
2226 | "fold 3\n",
2227 | "fold 4\n",
2228 | "fold 5\n",
2229 | "fold 6\n",
2230 | "fold 7\n",
2231 | "fold 8\n",
2232 | "fold 9\n",
2233 | "0.00011094962448042872\n"
2234 | ]
2235 | }
2236 | ],
2237 | "source": [
2238 | "# stacking\n",
2239 | "train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()\n",
2240 | "test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()\n",
2241 | "\n",
2242 | "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)\n",
2243 | "oof_stack = np.zeros(train_stack.shape[0])\n",
2244 | "predictions = np.zeros(test_stack.shape[0])\n",
2245 | "\n",
2246 | "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):\n",
2247 | " print(\"fold {}\".format(fold_))\n",
2248 | " trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]\n",
2249 | " val_data, val_y = train_stack[val_idx], y_train[val_idx]\n",
2250 | " \n",
2251 | " clf_3 = BayesianRidge()\n",
2252 | " clf_3.fit(trn_data, trn_y)\n",
2253 | " \n",
2254 | " oof_stack[val_idx] = clf_3.predict(val_data)\n",
2255 | " predictions += clf_3.predict(test_stack) / 10\n",
2256 | " \n",
2257 | "final_score = mean_squared_error(y_train, oof_stack)\n",
2258 | "print(final_score)"
2259 | ]
2260 | },
2261 | {
2262 | "cell_type": "markdown",
2263 | "metadata": {},
2264 | "source": [
2265 | "# 生成提交结果"
2266 | ]
2267 | },
2268 | {
2269 | "cell_type": "markdown",
2270 | "metadata": {},
2271 | "source": [
2272 | "+ __生成文件名__"
2273 | ]
2274 | },
2275 | {
2276 | "cell_type": "code",
2277 | "execution_count": 29,
2278 | "metadata": {},
2279 | "outputs": [
2280 | {
2281 | "name": "stdout",
2282 | "output_type": "stream",
2283 | "text": [
2284 | "result/testB_lgb_xgb_11095_8features_20190122_10:36:07.csv\n"
2285 | ]
2286 | }
2287 | ],
2288 | "source": [
2289 | "import time\n",
2290 | "model_name = \"lgb_xgb\"\n",
2291 | "file_name = 'result/{}_{}_{:5.0f}_{}features_{}.csv'.format(\n",
2292 | " test_name, \n",
2293 | " model_name,\n",
2294 | " final_score*1e8, X_train.shape[1],\n",
2295 | " time.strftime('%Y%m%d_%H:%M:%S',time.localtime(time.time())))\n",
2296 | " \n",
2297 | "print(file_name)"
2298 | ]
2299 | },
2300 | {
2301 | "cell_type": "markdown",
2302 | "metadata": {},
2303 | "source": [
2304 | "+ __写入文件__"
2305 | ]
2306 | },
2307 | {
2308 | "cell_type": "code",
2309 | "execution_count": 30,
2310 | "metadata": {},
2311 | "outputs": [],
2312 | "source": [
2313 | "sub_df = pd.read_csv(test_file_name, encoding = 'gb18030')\n",
2314 | "sub_df = sub_df[['样本id', 'A1']]\n",
2315 | "sub_df['A1'] = predictions\n",
2316 | "sub_df['A1'] = sub_df['A1'].apply(lambda x:round(x, 3))\n",
2317 | "sub_df.to_csv(file_name, header=0, index=0) "
2318 | ]
2319 | }
2320 | ],
2321 | "metadata": {
2322 | "kernelspec": {
2323 | "display_name": "Python 3",
2324 | "language": "python",
2325 | "name": "python3"
2326 | },
2327 | "language_info": {
2328 | "codemirror_mode": {
2329 | "name": "ipython",
2330 | "version": 3
2331 | },
2332 | "file_extension": ".py",
2333 | "mimetype": "text/x-python",
2334 | "name": "python",
2335 | "nbconvert_exporter": "python",
2336 | "pygments_lexer": "ipython3",
2337 | "version": "3.6.7"
2338 | },
2339 | "toc": {
2340 | "base_numbering": 1,
2341 | "nav_menu": {},
2342 | "number_sections": true,
2343 | "sideBar": true,
2344 | "skip_h1_title": false,
2345 | "title_cell": "Table of Contents",
2346 | "title_sidebar": "Contents",
2347 | "toc_cell": false,
2348 | "toc_position": {},
2349 | "toc_section_display": true,
2350 | "toc_window_display": true
2351 | }
2352 | },
2353 | "nbformat": 4,
2354 | "nbformat_minor": 2
2355 | }
2356 |
--------------------------------------------------------------------------------
/初赛/津南数字制造算法挑战赛+20+Drop/readme.md:
--------------------------------------------------------------------------------
1 | # 环境要求
2 |
3 | ## 系统
4 |
5 | ubuntu 16.04
6 |
7 | ## python 版本
8 | python 3.5 或 python 3.6
9 |
10 | ## 需要的库
11 |
12 | numpy, pandas, lightgbm, xgboost, sklearn
13 |
14 |
15 |
16 | # 运行说明
17 |
18 | 生成 B 榜预测结果文件:
19 | python3 run.py --test_type=B
20 |
21 | 生成 C 榜预测结果文件:
22 | python3 run.py --test_type=C
23 |
24 | # 其他说明
25 |
26 | 运行时间大概 6 分钟左右(不定)
27 |
28 | 要求一级目录下存放 data 文件夹, 内部有 A 榜数据: jinnan_round1_train_20181227.csv, B 榜数据: jinnan_round1_testB_20190121.csv, C 榜数据: jinnan_round1_test_20190121.csv.
29 |
30 | 生成结果在一级目录, B 榜名为 submit_B.csv, C 榜名为 submit_C.csv.
31 |
32 | # 最后
33 |
34 | 提前祝您春节快乐~
35 | 如果有问题希望您联系我们
36 |
37 | 陶亚凡: 765370813@qq.com
38 | Blue: cy_1995@qq.com
--------------------------------------------------------------------------------
/初赛/津南数字制造算法挑战赛+20+Drop/run.py:
--------------------------------------------------------------------------------
1 |
2 | # In[1]:
3 |
4 |
5 | import numpy as np
6 | import pandas as pd
7 | import lightgbm as lgb
8 | import xgboost as xgb
9 | import warnings
10 | import re
11 | import argparse, sys
12 |
13 | # In[2]:
14 |
15 |
16 | from sklearn.model_selection import KFold, RepeatedKFold
17 | from sklearn.pipeline import Pipeline
18 | from sklearn.base import BaseEstimator, TransformerMixin
19 | from sklearn.linear_model import BayesianRidge
20 | from sklearn.metrics import mean_squared_error, mean_absolute_error
21 | from sklearn.preprocessing import MinMaxScaler
22 | from sklearn.model_selection import ParameterGrid
23 |
24 |
25 | # In[3]:
26 |
27 |
28 | warnings.simplefilter(action='ignore', category=FutureWarning)
29 | warnings.filterwarnings("ignore")
30 | pd.set_option('display.max_columns', None)
31 | pd.set_option('max_colwidth', 100)
32 |
33 |
34 |
35 |
36 | # # 数据清洗
37 |
38 | # ## 删除缺失率高的特征
39 | #
40 | # + __删除缺失值大于 th_high 的值__
41 | # + __缺失值在 th_low 和 th_high 之间的特征根据是否缺失增加新特征__
42 | #
43 | # 如 B10 缺失较高,增加新特征 B10_null,如果缺失为1,否则为0
44 |
45 | # In[6]:
46 |
47 |
48 | class del_nan_feature(BaseEstimator, TransformerMixin):
49 |
50 | def __init__(self, th_high=0.85, th_low=0.02):
51 | self.th_high = th_high
52 | self.th_low = th_low
53 |
54 | def fit(self, X, y=None):
55 | return self
56 |
57 | def transform(self, X):
58 | print('-'*30, ' '*5, 'del_nan_feature', ' '*5, '-'*30, '\n')
59 | print("shape before process = {}".format(X.shape))
60 |
61 | # 删除高缺失率特征
62 | X.dropna(axis=1, thresh=(1-self.th_high)*X.shape[0], inplace=True)
63 |
64 |
65 | # 缺失率较高,增加新特征
66 | for col in X.columns:
67 | if col == 'target':
68 | continue
69 |
70 | miss_rate = X[col].isnull().sum()/ X.shape[0]
71 | if miss_rate > self.th_low:
72 | print("Missing rate of {} is {:.3f} exceed {}, adding new feature {}".
73 | format(col, miss_rate, self.th_low, col+'_null'))
74 | X[col+'_null'] = 0
75 | X.loc[X[pd.isnull(X[col])].index, [col+'_null']] = 1
76 | print("shape = {}".format(X.shape))
77 |
78 | return X
79 |
80 |
81 | # ## 处理字符时间(段)
82 |
83 | # In[7]:
84 |
85 |
86 | # 处理时间
87 | def timeTranSecond(t):
88 | try:
89 | h,m,s=t.split(":")
90 | except:
91 |
92 | if t=='1900/1/9 7:00':
93 | return 7*3600/3600
94 | elif t=='1900/1/1 2:30':
95 | return (2*3600+30*60)/3600
96 | elif pd.isnull(t):
97 | return np.nan
98 | else:
99 | return 0
100 |
101 | try:
102 | tm = (int(h)*3600+int(m)*60+int(s))/3600
103 | except:
104 | return (30*60)/3600
105 |
106 | return tm
107 |
108 |
109 | # In[8]:
110 |
111 |
112 | # 处理时间差
113 | def getDuration(se):
114 | try:
115 | sh,sm,eh,em=re.findall(r"\d+",se)
116 | # print("sh, sm, eh, em = {}, {}, {}, {}".format(sh, em, eh, em))
117 | except:
118 | if pd.isnull(se):
119 | return np.nan, np.nan, np.nan
120 |
121 | try:
122 | t_start = (int(sh)*3600 + int(sm)*60)/3600
123 | t_end = (int(eh)*3600 + int(em)*60)/3600
124 |
125 | if t_start > t_end:
126 | tm = t_end - t_start + 24
127 | else:
128 | tm = t_end - t_start
129 | except:
130 | if se=='19:-20:05':
131 | return 19, 20, 1
132 | elif se=='15:00-1600':
133 | return 15, 16, 1
134 | else:
135 | print("se = {}".format(se))
136 |
137 |
138 | return t_start, t_end, tm
139 |
140 |
141 | # In[9]:
142 |
143 |
144 | class handle_time_str(BaseEstimator, TransformerMixin):
145 |
146 | def __init__(self):
147 | pass
148 |
149 | def fit(self, X, y=None):
150 | return self
151 |
152 | def transform(self, X):
153 | print('-'*30, ' '*5, 'handle_time_str', ' '*5, '-'*30, '\n')
154 |
155 | for f in ['A5','A7','A9','A11','A14','A16','A24','A26','B5','B7']:
156 | try:
157 | X[f] = X[f].apply(timeTranSecond)
158 | except:
159 | print(f,'应该在前面被删除了!')
160 |
161 |
162 | for f in ['A20','A28','B4','B9','B10','B11']:
163 | try:
164 | start_end_diff = X[f].apply(getDuration)
165 |
166 | X[f+'_start'] = start_end_diff.apply(lambda x: x[0])
167 | X[f+'_end'] = start_end_diff.apply(lambda x: x[1])
168 | X[f] = start_end_diff.apply(lambda x: x[2])
169 |
170 | except:
171 | print(f,'应该在前面被删除了!')
172 | return X
173 |
174 |
175 | # ## 计算时间差
176 |
177 | # In[ ]:
178 |
179 |
180 |
181 |
182 |
183 | # In[10]:
184 |
185 |
186 | def t_start_t_end(t):
187 | if pd.isnull(t[0]) or pd.isnull(t[1]):
188 | # print("t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
189 | return np.nan
190 |
191 | if t[1] < t[0]:
192 | t[1] += 24
193 |
194 | dt = t[1] - t[0]
195 |
196 | if(dt > 24 or dt < 0):
197 | # print("dt error, t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
198 | return np.nan
199 |
200 | return dt
201 |
202 |
203 | # In[11]:
204 |
205 |
206 | class calc_time_diff(BaseEstimator, TransformerMixin):
207 | def __init__(self):
208 | pass
209 |
210 | def fit(self, X, y=None):
211 | return self
212 |
213 | def transform(self, X):
214 | print('-'*30, ' '*5, 'calc_time_diff', ' '*5, '-'*30, '\n')
215 |
216 | # t_start 为时间的开始, tn 为中间的时间,减去 t_start 得到时间差
217 | t_start = ['A9', 'A24', 'B5']
218 | tn = {'A9':['A11', 'A14', 'A16'], 'A24': ['A26'], 'B5':['B7']}
219 |
220 | # 计算时间差
221 | for t_s in t_start:
222 | for t_e in tn[t_s]:
223 | X[t_e+'-'+t_s] = X[[t_s, t_e, 'target']].apply(t_start_t_end, axis=1)
224 |
225 | # 所有结果保留 3 位小数
226 | X = X.apply(lambda x: round(x, 3))
227 |
228 | print("shape = {}".format(X.shape))
229 |
230 | return X
231 |
232 |
233 | # ## 处理异常值
234 |
235 | # + __单一类别个数小于 threshold 的值视为异常值, 改为 nan__
236 |
237 | # In[12]:
238 |
239 |
240 | class handle_outliers(BaseEstimator, TransformerMixin):
241 |
242 | def __init__(self, threshold=2):
243 | self.th = threshold
244 |
245 | def fit(self, X, y=None):
246 | return self
247 |
248 | def transform(self, X):
249 | print('-'*30, ' '*5, 'handle_outliers', ' '*5, '-'*30, '\n')
250 | category_col = [col for col in X if col not in ['id', 'target']]
251 | for col in category_col:
252 | label = X[col].value_counts(dropna=False).index.tolist()
253 | for i, num in enumerate(X[col].value_counts(dropna=False).values):
254 | if num <= self.th:
255 | # print("Number of label {} in feature {} is {}".format(label[i], col, num))
256 | X.loc[X[col]==label[i], [col]] = np.nan
257 |
258 | print("shape = {}".format(X.shape))
259 | return X
260 |
261 |
262 | # ## 删除单一类别占比过大特征
263 |
264 | # In[13]:
265 |
266 |
267 | class del_single_feature(BaseEstimator, TransformerMixin):
268 |
269 | def __init__(self, threshold=0.98):
270 | # 删除单一类别占比大于 threshold 的特征
271 | self.th = threshold
272 |
273 | def fit(self, X, y=None):
274 | return self
275 |
276 | def transform(self, X):
277 | print('-'*30, ' '*5, 'del_single_feature', ' '*5, '-'*30, '\n')
278 | category_col = [col for col in X if col not in ['target']]
279 |
280 | for col in category_col:
281 | rate = X[col].value_counts(normalize=True, dropna=False).values[0]
282 |
283 | if rate > self.th:
284 | print("{} 的最大类别占比是 {}, drop it".format(col, rate))
285 | X.drop(col, axis=1, inplace=True)
286 |
287 | print("shape = {}".format(X.shape))
288 | return X
289 |
290 |
291 | # # 特征工程
292 |
293 | # ## 获得训练集与测试集
294 |
295 | # In[14]:
296 |
297 |
298 | def split_data(pipe_data, target_name='target'):
299 |
300 | # 特征列名
301 | category_col = [col for col in pipe_data if col not in ['target',target_name]]
302 |
303 | # 训练、测试行索引
304 | train_idx = pipe_data[np.logical_not(pd.isnull(pipe_data[target_name]))].index
305 | test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index
306 |
307 | # 获得 train、test 数据
308 | X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)
309 | y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))
310 | X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)
311 |
312 | return X_train, y_train, X_test, test_idx
313 |
314 |
315 | # ## xgb(用于特征 nan 值预测)
316 |
317 | # In[15]:
318 |
319 |
320 | ##### xgb
321 | def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
322 |
323 | if params == None:
324 | xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8,
325 | 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}
326 | else:
327 | xgb_params = params
328 |
329 | folds = KFold(n_splits=10, shuffle=True, random_state=2018)
330 | oof_xgb = np.zeros(len(X_train))
331 | predictions_xgb = np.zeros(len(X_test))
332 |
333 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
334 | if(verbose_eval):
335 | print("fold n°{}".format(fold_+1))
336 | print("len trn_idx {}".format(len(trn_idx)))
337 |
338 | trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
339 | val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
340 |
341 | watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
342 | clf = xgb.train(dtrain=trn_data,
343 | num_boost_round=20000,
344 | evals=watchlist,
345 | early_stopping_rounds=200,
346 | verbose_eval=verbose_eval,
347 | params=xgb_params)
348 |
349 |
350 | oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
351 | predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
352 |
353 | if(verbose_eval):
354 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train)))
355 | return oof_xgb, predictions_xgb
356 |
357 |
358 | # ## 根据 B14 构建新特征
359 |
360 | # In[16]:
361 |
362 |
363 | class add_new_features(BaseEstimator, TransformerMixin):
364 |
365 | def __init__(self):
366 | pass
367 |
368 | def fit(self, X, y=None):
369 | return self
370 |
371 | def transform(self, X):
372 | print('-'*30, ' '*5, 'add_new_features', ' '*5, '-'*30, '\n')
373 |
374 | # 经过测试,只有 B14 / B12 有用
375 |
376 | # X['B14/A1'] = X['B14'] / X['A1']
377 | # X['B14/A3'] = X['B14'] / X['A3']
378 | # X['B14/A4'] = X['B14'] / X['A4']
379 | # X['B14/A19'] = X['B14'] / X['A19']
380 | # X['B14/B1'] = X['B14'] / X['B1']
381 | # X['B14/B9'] = X['B14'] / X['B9']
382 |
383 | X['B14/B12'] = X['B14'] / X['B12']
384 |
385 | print("shape = {}".format(X.shape))
386 | return X
387 |
388 |
389 | # ## 选择特征, nan 值填充
390 | #
391 | # + __选择可能有效的特征__ (只是为了加快选择时间)
392 | #
393 | # + __利用其他特征预测 nan,取最近值填充__
394 |
395 | # In[17]:
396 |
397 |
398 | def get_closest(indexes, predicts):
399 | print("From {}".format(predicts))
400 |
401 | for i, one in enumerate(predicts):
402 | predicts[i] = indexes[np.argsort(abs(indexes - one))[0]]
403 |
404 | print("To {}".format(predicts))
405 | return predicts
406 |
407 |
408 | def value_select_eval(pipe_data, selected_features):
409 |
410 | # 经过多次测试, 只选择可能是有用的特征
411 | cols_with_nan = [col for col in pipe_data.columns
412 | if pipe_data[col].isnull().sum()>0 and col in selected_features]
413 |
414 | for col in cols_with_nan:
415 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name=col)
416 | oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, verbose_eval=False)
417 |
418 | print("-"*100, end="\n\n")
419 | print("CV normal MAE scores of predicting {} is {}".
420 | format(col, mean_absolute_error(oof_xgb, y_train)/np.mean(y_train)))
421 |
422 | pipe_data.loc[test_idx, [col]] = get_closest(pipe_data[col].value_counts().index,
423 | predictions_xgb)
424 |
425 | pipe_data = pipe_data[selected_features+['target']]
426 |
427 | return pipe_data
428 |
429 | # pipe_data = value_eval(pipe_data)
430 |
431 |
432 | # In[18]:
433 |
434 |
435 | class selsected_fill_nans(BaseEstimator, TransformerMixin):
436 |
437 | def __init__(self, selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',
438 | 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']):
439 | self.selected_fearutes = selected_features
440 | pass
441 |
442 | def fit(self, X, y=None):
443 | return self
444 |
445 | def transform(self, X):
446 | print('-'*30, ' '*5, 'selsected_fill_nans', ' '*5, '-'*30, '\n')
447 |
448 | X = value_select_eval(X, self.selected_fearutes)
449 |
450 | print("shape = {}".format(X.shape))
451 | return X
452 |
453 |
454 | # In[19]:
455 |
456 |
457 | def modeling_cross_validation(data):
458 | X_train, y_train, X_test, test_idx = split_data(data,
459 | target_name='target')
460 | oof_xgb, _ = xgb_predict(X_train, y_train, X_test, verbose_eval=False)
461 | print('-'*100, end='\n\n')
462 | return mean_squared_error(oof_xgb, y_train)
463 |
464 |
465 | def featureSelect(data):
466 |
467 | init_cols = [f for f in data.columns if f not in ['target']]
468 | best_cols = init_cols.copy()
469 | best_score = modeling_cross_validation(data[best_cols+['target']])
470 | print("初始 CV score: {:<8.8f}".format(best_score))
471 |
472 | for col in init_cols:
473 | best_cols.remove(col)
474 | score = modeling_cross_validation(data[best_cols+['target']])
475 | print("当前选择特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}".
476 | format(col, score, best_score), end=" ")
477 |
478 | if best_score - score > 0.0000004:
479 | best_score = score
480 | print("有效果,删除!!!!")
481 | else:
482 | best_cols.append(col)
483 | print("保留")
484 |
485 | print('-'*100)
486 | print("优化后 CV score: {:<8.8f}".format(best_score))
487 | return best_cols, best_score
488 |
489 |
490 | # ## 后向选择特征
491 |
492 | # In[20]:
493 |
494 |
495 | class select_feature(BaseEstimator, TransformerMixin):
496 |
497 | def __init__(self, init_features = None):
498 | self.init_features = init_features
499 | pass
500 |
501 | def fit(self, X, y=None):
502 | return self
503 |
504 | def transform(self, X):
505 | print('-'*30, ' '*5, 'select_feature', ' '*5, '-'*30, '\n')
506 |
507 | if self.init_features:
508 | X = X[self.init_features + ['target']]
509 | best_features = self.init_features
510 | else:
511 | best_features = [col for col in X.columns]
512 |
513 | last_feartues = []
514 | iteration = 0
515 | equal_time = 0
516 |
517 | best_CV = 1
518 | best_CV_feature = []
519 |
520 | # 打乱顺序,但是使用相同种子,保证每次运行结果相同
521 | np.random.seed(2018)
522 | while True:
523 | print("Iteration = {}\n".format(iteration))
524 | best_features, score = featureSelect(X[best_features + ['target']])
525 |
526 | # 保存最优 CV 的参数
527 | if score < best_CV:
528 | best_CV = score
529 | best_CV_feature = best_features
530 | print("Found best score :{}, with features :{}".format(best_CV, best_features))
531 |
532 | np.random.shuffle(best_features)
533 | print("\nCurrent fearure length = {}".format(len(best_features)))
534 |
535 | # 最终 3 次迭代相同,则终止迭代
536 | if len(best_features) == len(last_feartues):
537 | equal_time += 1
538 | if equal_time == 3:
539 | break
540 | else:
541 | equal_time = 0
542 |
543 | last_feartues = best_features
544 | iteration = iteration + 1
545 |
546 | print("\n\n\n")
547 |
548 | return X[best_features + ['target']]
549 |
550 |
551 | # # 训练
552 |
553 | # ## 构建 pipeline, 处理数据
554 |
555 | # In[21]:
556 |
557 |
558 |
559 |
560 |
561 | # ## 自动调参
562 |
563 | # In[22]:
564 |
565 |
566 | def find_best_params(pipe_data, predict_fun, param_grid):
567 |
568 | # 获得 train 和 test, 归一化
569 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')
570 | min_max_scaler = MinMaxScaler()
571 | X_train = min_max_scaler.fit_transform(X_train)
572 | X_test = min_max_scaler.transform(X_test)
573 | best_score = 1
574 |
575 | # 遍历所有参数,寻找最优
576 | for params in ParameterGrid(param_grid):
577 | print('-'*100, "\nparams = \n{}\n".format(params))
578 |
579 | oof, predictions = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)
580 | score = mean_squared_error(oof, y_train)
581 | print("CV score: {}, current best score: {}".format(score, best_score))
582 |
583 | if best_score > score:
584 | print("Found new best score: {}".format(score))
585 | best_score = score
586 | best_params = params
587 |
588 |
589 | print('\n\nbest params: {}'.format(best_params))
590 | print('best score: {}'.format(best_score))
591 |
592 | return best_params
593 |
594 |
595 | # ## lgb
596 |
597 | # In[23]:
598 |
599 |
600 | def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
601 |
602 | if params == None:
603 | lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective':'regression', 'max_depth': 4,
604 | 'learning_rate': 0.06, "min_child_samples": 3, "boosting": "gbdt", "feature_fraction": 1,
605 | "bagging_freq": 0.7, "bagging_fraction": 1, "bagging_seed": 11, "metric": 'mse', "lambda_l2": 0.003,
606 | "verbosity": -1}
607 | else :
608 | lgb_param = params
609 |
610 | folds = KFold(n_splits=10, shuffle=True, random_state=2018)
611 | oof_lgb = np.zeros(len(X_train))
612 | predictions_lgb = np.zeros(len(X_test))
613 |
614 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
615 | if verbose_eval:
616 | print("fold n°{}".format(fold_+1))
617 |
618 | trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
619 | val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
620 |
621 | num_round = 10000
622 | clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data],
623 | verbose_eval=verbose_eval, early_stopping_rounds = 100)
624 |
625 | oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
626 |
627 | predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
628 |
629 | if verbose_eval:
630 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train)))
631 |
632 | return oof_lgb, predictions_lgb
633 |
634 |
635 | def init_config():
636 | parser = argparse.ArgumentParser()
637 | parser.add_argument('--test_type', type=str,
638 | help='Can be B or C, meaning running code with either test B or test C')
639 | args = parser.parse_args()
640 | return args
641 |
642 |
643 | def read_data(train_file_name, test_file_name):
644 | # 读取数据, 改名
645 | train = pd.read_csv(train_file_name, encoding='gb18030')
646 | test = pd.read_csv(test_file_name, encoding='gb18030')
647 | train.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
648 | test.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
649 | target_name = 'target'
650 |
651 | # 存在异常数据,改为 nan
652 | train.loc[1304, 'A25'] = np.nan
653 | train['A25'] = train['A25'].astype(float)
654 |
655 | # 去掉 id 前缀
656 | train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))
657 | test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))
658 |
659 | train.drop(train[train[target_name] < 0.87].index, inplace=True)
660 | full = pd.concat([train, test], ignore_index=True)
661 | return full
662 |
663 |
664 | def feature_processing(full):
665 |
666 | selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',
667 | 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']
668 |
669 | pipe = Pipeline([
670 | ('del_nan_feature', del_nan_feature()),
671 | ('handle_time_str', handle_time_str()),
672 | ('calc_time_diff', calc_time_diff()),
673 | ('Handle_outliers', handle_outliers(2)),
674 | ('del_single_feature', del_single_feature(1)),
675 | ('add_new_features', add_new_features()),
676 | ('selsected_fill_nans', selsected_fill_nans(selected_features)),
677 | ('select_feature', select_feature(selected_features)),
678 | ])
679 |
680 | pipe_data = pipe.fit_transform(full.copy())
681 | print(pipe_data.shape)
682 | return pipe_data
683 |
684 |
685 | def train_predict(pipe_data):
686 |
687 | # lgb
688 | param_grid = [
689 | {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective': ['regression'],
690 | 'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], "min_child_samples": [3],
691 | "boosting": ["gbdt"], "feature_fraction": [0.7], "bagging_freq": [1],
692 | "bagging_fraction": [1], "bagging_seed": [11], "metric": ['mse'],
693 | "lambda_l2": [0.0003, 0.001, 0.003], "verbosity": [-1]}
694 | ]
695 |
696 | lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)
697 |
698 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')
699 | min_max_scaler = MinMaxScaler()
700 | X_train = min_max_scaler.fit_transform(X_train)
701 | X_test = min_max_scaler.transform(X_test)
702 | oof_lgb, predictions_lgb = lgb_predict(X_train, y_train, X_test, params=lgb_best_params, verbose_eval=200) #
703 |
704 | # xgb
705 | param_grid = [
706 | {'silent': [1],
707 | 'nthread': [4],
708 | 'eval_metric': ['rmse'],
709 | 'eta': [0.03],
710 | 'objective': ['reg:linear'],
711 | 'max_depth': [4, 5, 6],
712 | 'num_round': [1000],
713 | 'subsample': [0.4, 0.6, 0.8, 1],
714 | 'colsample_bytree': [0.7, 0.9, 1],
715 | }
716 | ]
717 |
718 | xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)
719 |
720 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')
721 | min_max_scaler = MinMaxScaler()
722 | X_train = min_max_scaler.fit_transform(X_train)
723 | X_test = min_max_scaler.transform(X_test)
724 | oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, params=xgb_best_params, verbose_eval=200) #
725 |
726 | # 模型融合 stacking
727 | train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()
728 | test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()
729 |
730 | folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
731 | oof_stack = np.zeros(train_stack.shape[0])
732 | predictions = np.zeros(test_stack.shape[0])
733 |
734 | for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):
735 | print("fold {}".format(fold_))
736 | trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]
737 | val_data, val_y = train_stack[val_idx], y_train[val_idx]
738 |
739 | clf_3 = BayesianRidge()
740 | clf_3.fit(trn_data, trn_y)
741 |
742 | oof_stack[val_idx] = clf_3.predict(val_data)
743 | predictions += clf_3.predict(test_stack) / 10
744 |
745 | final_score = mean_squared_error(y_train, oof_stack)
746 | print(final_score)
747 | return predictions
748 |
749 |
750 | def gen_submit(test_file_name, result_name, predictions):
751 | # 生成提交结果
752 | sub_df = pd.read_csv(test_file_name, encoding='gb18030')
753 | sub_df = sub_df[['样本id', 'A1']]
754 | sub_df['A1'] = predictions
755 | sub_df['A1'] = sub_df['A1'].apply(lambda x: round(x, 3))
756 | print("Generating a submit file : {}".format(result_name))
757 | sub_df.to_csv(result_name, header=0, index=0)
758 |
759 | if __name__ == '__main__':
760 | args = init_config()
761 | print(args, file=sys.stderr)
762 |
763 | if args.test_type in ['B', 'b']:
764 | test_file_name = 'data/jinnan_round1_testB_20190121.csv'
765 | result_name = 'submit_B.csv'
766 | elif args.test_type in ['C', 'c']:
767 | test_file_name = 'data/jinnan_round1_test_20190121.csv'
768 | result_name = 'submit_C.csv'
769 | else:
770 | raise RuntimeError('Need config of test_type, can be only B or C for example: --test_type=B')
771 |
772 | # 设定文件名, 读取文件
773 | train_file_name = 'data/jinnan_round1_train_20181227.csv'
774 |
775 | print("Training file named {} and testing file named {}".format(train_file_name, test_file_name))
776 |
777 | # read file
778 | full = read_data(train_file_name, test_file_name)
779 |
780 | # feature processing
781 | pipe_data = feature_processing(full)
782 |
783 | # train and predict
784 | predictions = train_predict(pipe_data)
785 |
786 | # generate a submit file
787 | gen_submit(test_file_name, result_name, predictions)
788 |
789 | print("Finished")
790 |
--------------------------------------------------------------------------------
/复赛/津南数字制造算法挑战赛+17+Drop/data/optimize.csv:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/taoyafan/jinnan/52035305b2219f6700f2e922825de2d5db1d7abc/复赛/津南数字制造算法挑战赛+17+Drop/data/optimize.csv
--------------------------------------------------------------------------------
/复赛/津南数字制造算法挑战赛+17+Drop/main.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | import numpy as np
4 | import pandas as pd
5 | import lightgbm as lgb
6 | import xgboost as xgb
7 | import warnings
8 | import re
9 | import argparse, sys
10 | import pickle
11 | import os
12 |
13 | from sklearn.model_selection import KFold, RepeatedKFold
14 | from sklearn.pipeline import Pipeline
15 | from sklearn.base import BaseEstimator, TransformerMixin
16 | from sklearn.linear_model import BayesianRidge
17 | from sklearn.metrics import mean_squared_error, mean_absolute_error
18 | from sklearn.preprocessing import MinMaxScaler
19 | from sklearn.model_selection import ParameterGrid
20 |
21 | warnings.simplefilter(action='ignore', category=FutureWarning)
22 | warnings.filterwarnings("ignore")
23 |
24 | def init_config():
25 | parser = argparse.ArgumentParser()
26 | parser.add_argument('--test_type', type=str,
27 | help='Can be B or C, meaning running code with either test B or test C')
28 | args = parser.parse_args()
29 | return args
30 |
31 |
32 | def pkl_load(file_name):
33 | with open(file_name, "rb") as f:
34 | return pickle.load(f)
35 |
36 |
37 | def pkl_save(fname, data, protocol=3):
38 | with open(fname, "wb") as f:
39 | pickle.dump(data, f, protocol)
40 |
41 |
42 | def load_models():
43 | lgb_models = pkl_load("models/lgb_models.pkl")
44 | xgb_models = pkl_load("models/xgb_models.pkl")
45 | stack_models = pkl_load("models/stack_models.pkl")
46 | min_max_scaler = pkl_load("models/min_max_scaler.pkl")
47 | return lgb_models, xgb_models, stack_models, min_max_scaler
48 |
49 |
50 | def read_data(train_file_name, test_file_name):
51 | # 读取数据, 改名
52 | train = pd.read_csv(train_file_name, encoding='gb18030')
53 | test = pd.read_csv(test_file_name, encoding='gb18030')
54 | train.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
55 | test.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True)
56 | target_name = 'target'
57 |
58 | # 存在异常数据,改为 nan
59 | train.loc[1304, 'A25'] = np.nan
60 | train['A25'] = train['A25'].astype(float)
61 |
62 | # 去掉 id 前缀
63 | train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))
64 | test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))
65 |
66 | train.drop(train[train[target_name] < 0.87].index, inplace=True)
67 | _full = pd.concat([train, test], ignore_index=True)
68 | return _full
69 |
70 |
71 | class del_nan_feature(BaseEstimator, TransformerMixin):
72 |
73 | def __init__(self, th_high=0.85, th_low=0.02):
74 | self.th_high = th_high
75 | self.th_low = th_low
76 |
77 | def fit(self, X, y=None):
78 | return self
79 |
80 | def transform(self, X):
81 | print('-' * 30, ' ' * 5, 'del_nan_feature', ' ' * 5, '-' * 30, '\n')
82 | print("shape before process = {}".format(X.shape))
83 |
84 | # 删除高缺失率特征
85 | X.dropna(axis=1, thresh=(1 - self.th_high) * X.shape[0], inplace=True)
86 |
87 | # 缺失率较高,增加新特征
88 | for col in X.columns:
89 | if col == 'target':
90 | continue
91 |
92 | miss_rate = X[col].isnull().sum() / X.shape[0]
93 | if miss_rate > self.th_low:
94 | print("Missing rate of {} is {:.3f} exceed {}, adding new feature {}".
95 | format(col, miss_rate, self.th_low, col + '_null'))
96 | X[col + '_null'] = 0
97 | X.loc[X[pd.isnull(X[col])].index, [col + '_null']] = 1
98 | print("shape = {}".format(X.shape))
99 |
100 | return X
101 |
102 | # 处理时间
103 | def timeTranSecond(t):
104 | try:
105 | h,m,s=t.split(":")
106 | except:
107 |
108 | if t=='1900/1/9 7:00':
109 | return 7*3600/3600
110 | elif t=='1900/1/1 2:30':
111 | return (2*3600+30*60)/3600
112 | elif pd.isnull(t):
113 | return np.nan
114 | else:
115 | return 0
116 |
117 | try:
118 | tm = (int(h)*3600+int(m)*60+int(s))/3600
119 | except:
120 | return (30*60)/3600
121 |
122 | return tm
123 |
124 |
125 | # 处理时间差
126 | def getDuration(se):
127 | try:
128 | sh, sm, eh, em = re.findall(r"\d+", se)
129 | # print("sh, sm, eh, em = {}, {}, {}, {}".format(sh, em, eh, em))
130 | except:
131 | if pd.isnull(se):
132 | return np.nan, np.nan, np.nan
133 |
134 | try:
135 | t_start = (int(sh) * 3600 + int(sm) * 60) / 3600
136 | t_end = (int(eh) * 3600 + int(em) * 60) / 3600
137 |
138 | if t_start > t_end:
139 | tm = t_end - t_start + 24
140 | else:
141 | tm = t_end - t_start
142 | except:
143 | if se == '19:-20:05':
144 | return 19, 20, 1
145 | elif se == '15:00-1600':
146 | return 15, 16, 1
147 | else:
148 | print("se = {}".format(se))
149 |
150 | return t_start, t_end, tm
151 |
152 |
153 | class handle_time_str(BaseEstimator, TransformerMixin):
154 |
155 | def __init__(self):
156 | pass
157 |
158 | def fit(self, X, y=None):
159 | return self
160 |
161 | def transform(self, X):
162 | print('-' * 30, ' ' * 5, 'handle_time_str', ' ' * 5, '-' * 30, '\n')
163 |
164 | for f in ['A5', 'A7', 'A9', 'A11', 'A14', 'A16', 'A24', 'A26', 'B5', 'B7']:
165 | try:
166 | X[f] = X[f].apply(timeTranSecond)
167 | except:
168 | print(f, '应该在前面被删除了!')
169 |
170 | for f in ['A20', 'A28', 'B4', 'B9', 'B10', 'B11']:
171 | try:
172 | start_end_diff = X[f].apply(getDuration)
173 |
174 | X[f + '_start'] = start_end_diff.apply(lambda x: x[0])
175 | X[f + '_end'] = start_end_diff.apply(lambda x: x[1])
176 | X[f] = start_end_diff.apply(lambda x: x[2])
177 |
178 | except:
179 | print(f, '应该在前面被删除了!')
180 | return X
181 |
182 |
183 | def t_start_t_end(t):
184 | if pd.isnull(t[0]) or pd.isnull(t[1]):
185 | # print("t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
186 | return np.nan
187 |
188 | if t[1] < t[0]:
189 | t[1] += 24
190 |
191 | dt = t[1] - t[0]
192 |
193 | if (dt > 24 or dt < 0):
194 | # print("dt error, t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2]))
195 | return np.nan
196 |
197 | return dt
198 |
199 |
200 | class calc_time_diff(BaseEstimator, TransformerMixin):
201 | def __init__(self):
202 | pass
203 |
204 | def fit(self, X, y=None):
205 | return self
206 |
207 | def transform(self, X):
208 | print('-' * 30, ' ' * 5, 'calc_time_diff', ' ' * 5, '-' * 30, '\n')
209 |
210 | # t_start 为时间的开始, tn 为中间的时间,减去 t_start 得到时间差
211 | t_start = ['A9', 'A24', 'B5']
212 | tn = {'A9': ['A11', 'A14', 'A16'], 'A24': ['A26'], 'B5': ['B7']}
213 |
214 | # 计算时间差
215 | for t_s in t_start:
216 | for t_e in tn[t_s]:
217 | X[t_e + '-' + t_s] = X[[t_s, t_e, 'target']].apply(t_start_t_end, axis=1)
218 |
219 | # 所有结果保留 3 位小数
220 | X = X.apply(lambda x: round(x, 3))
221 |
222 | print("shape = {}".format(X.shape))
223 |
224 | return X
225 |
226 |
227 | class handle_outliers(BaseEstimator, TransformerMixin):
228 |
229 | def __init__(self, threshold=2):
230 | self.th = threshold
231 |
232 | def fit(self, X, y=None):
233 | return self
234 |
235 | def transform(self, X):
236 | print('-' * 30, ' ' * 5, 'handle_outliers', ' ' * 5, '-' * 30, '\n')
237 | category_col = [col for col in X if col not in ['id', 'target']]
238 | for col in category_col:
239 | label = X[col].value_counts(dropna=False).index.tolist()
240 | for i, num in enumerate(X[col].value_counts(dropna=False).values):
241 | if num <= self.th:
242 | # print("Number of label {} in feature {} is {}".format(label[i], col, num))
243 | X.loc[X[col] == label[i], [col]] = np.nan
244 |
245 | print("shape = {}".format(X.shape))
246 | return X
247 |
248 |
249 | def split_data(pipe_data, target_name='target', unused_feature=[], min_max_scaler=None):
250 |
251 | # 特征列名
252 | category_col = [col for col in pipe_data if col not in ['target'] + [target_name] + unused_feature]
253 |
254 | # 训练、测试行索引
255 | # 训练集只包括存在 target 和 target_name 的数据
256 | train_idx = pipe_data[np.logical_not(
257 | np.logical_or(pd.isnull(pipe_data[target_name]), pd.isnull(pipe_data['target']))
258 | )].index
259 |
260 | test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index
261 |
262 | # 获得 train、test 数据
263 | X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)
264 | y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))
265 | X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)
266 |
267 | # 归一化
268 | if (min_max_scaler == None):
269 | min_max_scaler = MinMaxScaler()
270 | X_train = min_max_scaler.fit_transform(X_train)
271 | else:
272 | X_train = min_max_scaler.transform(X_train)
273 | X_test = min_max_scaler.transform(X_test)
274 |
275 | return X_train, y_train, X_test, test_idx, min_max_scaler
276 |
277 |
278 | ##### xgb
279 | def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
280 | if params == None:
281 | xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8,
282 | 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}
283 | else:
284 | xgb_params = params
285 |
286 | folds = KFold(n_splits=10, shuffle=True, random_state=2018)
287 | oof_xgb = np.zeros(len(X_train))
288 | predictions_xgb = np.zeros(len(X_test))
289 | xgb_models = []
290 |
291 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
292 | if (verbose_eval):
293 | print("fold n°{}".format(fold_ + 1))
294 | print("len trn_idx {}".format(len(trn_idx)))
295 |
296 | trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])
297 | val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])
298 |
299 | watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]
300 | clf = xgb.train(dtrain=trn_data,
301 | num_boost_round=20000,
302 | evals=watchlist,
303 | early_stopping_rounds=200,
304 | verbose_eval=verbose_eval,
305 | params=xgb_params)
306 |
307 | xgb_models.append(clf)
308 | oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)
309 | predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits
310 |
311 | if (verbose_eval):
312 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train)))
313 | return oof_xgb, predictions_xgb, xgb_models
314 |
315 |
316 | def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):
317 |
318 | if params == None:
319 | lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective': 'regression', 'max_depth': 5,
320 | 'learning_rate': 0.24, "min_child_samples": 3, "boosting": "gbdt", "feature_fraction": 0.7,
321 | "bagging_freq": 1, "bagging_fraction": 1, "bagging_seed": 11, "metric": 'mse', "lambda_l2": 0.003,
322 | "verbosity": -1}
323 | else:
324 | lgb_param = params
325 |
326 | folds = KFold(n_splits=10, shuffle=True, random_state=2018)
327 | oof_lgb = np.zeros(len(X_train))
328 | predictions_lgb = np.zeros(len(X_test))
329 | lgb_models = []
330 |
331 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
332 | if verbose_eval:
333 | print("fold n°{}".format(fold_ + 1))
334 |
335 | trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])
336 | val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])
337 |
338 | num_round = 10000
339 | clf = lgb.train(lgb_param, trn_data, num_round, valid_sets=[trn_data, val_data],
340 | verbose_eval=verbose_eval, early_stopping_rounds=100)
341 |
342 | lgb_models.append(clf)
343 | oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)
344 | predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits
345 |
346 | if verbose_eval:
347 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train)))
348 |
349 | return oof_lgb, predictions_lgb, lgb_models
350 |
351 |
352 | class add_new_features(BaseEstimator, TransformerMixin):
353 |
354 | def __init__(self):
355 | pass
356 |
357 | def fit(self, X, y=None):
358 | return self
359 |
360 | def transform(self, X):
361 | print('-' * 30, ' ' * 5, 'add_new_features', ' ' * 5, '-' * 30, '\n')
362 |
363 | # 经过测试,只有 B14 / B12 有用
364 |
365 | # X['B14/A1'] = X['B14'] / X['A1']
366 | # X['B14/A3'] = X['B14'] / X['A3']
367 | # X['B14/A4'] = X['B14'] / X['A4']
368 | # X['B14/A19'] = X['B14'] / X['A19']
369 | # X['B14/B1'] = X['B14'] / X['B1']
370 | # X['B14/B9'] = X['B14'] / X['B9']
371 |
372 | X['B14/B12'] = X['B14'] / X['B12']
373 |
374 | print("shape = {}".format(X.shape))
375 | return X
376 |
377 |
378 | def predict(data, lgb_models, xgb_models, stack_models, min_max_scaler):
379 |
380 | def _predict(X):
381 | lgb_result = 0
382 | xgb_result = 0
383 | stack_result = 0
384 |
385 | for clf in lgb_models:
386 | lgb_result += clf.predict(X, num_iteration=clf.best_iteration) / len(lgb_models)
387 |
388 | for clf in xgb_models:
389 | xgb_result += clf.predict(xgb.DMatrix(X), ntree_limit=clf.best_ntree_limit) / len(xgb_models)
390 |
391 | test_stack = np.vstack([lgb_result, xgb_result]).transpose()
392 | for clf in stack_models:
393 | stack_result += clf.predict(test_stack) / len(stack_models)
394 |
395 | return stack_result
396 |
397 | _, _, X_test, test_idx, _ = split_data(data, min_max_scaler=min_max_scaler)
398 | result = _predict(X_test)
399 | return result
400 |
401 |
402 | def feature_processing(full, outlier_th = 3):
403 | selected_features = ['A22', 'A28', 'A20_end', 'B10', 'B11_start', 'A5', 'A10', 'B14/B12', 'B14']
404 | pipe = Pipeline([
405 | ('del_nan_feature', del_nan_feature()),
406 | ('handle_time_str', handle_time_str()),
407 | ('calc_time_diff', calc_time_diff()),
408 | ('Handle_outliers', handle_outliers(outlier_th)),
409 | ('add_new_features', add_new_features()),
410 | ])
411 |
412 | pipe_data = pipe.fit_transform(full.copy())[selected_features+['target']]
413 | print(pipe_data.shape)
414 | return pipe_data
415 |
416 |
417 | def gen_submit(test_file_name, result_name, predictions):
418 | # 生成提交结果
419 | sub_df = pd.read_csv(test_file_name, encoding='gb18030')
420 | sub_df = sub_df[['样本id', 'A1']]
421 | sub_df['A1'] = predictions
422 | sub_df['A1'] = sub_df['A1'].apply(lambda x: round(x, 3))
423 | print("Generating a submit file : {}".format(result_name))
424 | sub_df.to_csv(result_name, header=0, index=0)
425 |
426 |
427 | def find_best_params(pipe_data, predict_fun, param_grid):
428 |
429 | # 获得 train 和 test, 归一化
430 | X_train, y_train, X_test, test_idx, _ = split_data(pipe_data, target_name='target')
431 | best_score = 1
432 |
433 | # 遍历所有参数,寻找最优
434 | for params in ParameterGrid(param_grid):
435 | print('-' * 100, "\nparams = \n{}\n".format(params))
436 |
437 | oof, predictions, _ = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)
438 | score = mean_squared_error(oof, y_train)
439 | print("CV score: {}, current best score: {}".format(score, best_score))
440 |
441 | if best_score > score:
442 | print("Found new best score: {}".format(score))
443 | best_score = score
444 | best_params = params
445 |
446 | print('\n\nbest params: {}'.format(best_params))
447 | print('best score: {}'.format(best_score))
448 |
449 | return best_params
450 |
451 |
452 | def stacking_predict(oof_lgb, oof_xgb, predictions_lgb, predictions_xgb, y_train, verbose_eval=1):
453 | # stacking
454 | train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()
455 | test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()
456 |
457 | folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)
458 | oof_stack = np.zeros(train_stack.shape[0])
459 | predictions = np.zeros(test_stack.shape[0])
460 | stack_models = []
461 |
462 | for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):
463 | if verbose_eval:
464 | print("fold {}".format(fold_))
465 | trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]
466 | val_data, val_y = train_stack[val_idx], y_train[val_idx]
467 |
468 | clf_3 = BayesianRidge()
469 | clf_3.fit(trn_data, trn_y)
470 | stack_models.append(clf_3)
471 |
472 | oof_stack[val_idx] = clf_3.predict(val_data)
473 | predictions += clf_3.predict(test_stack) / 10
474 |
475 | final_score = mean_squared_error(y_train, oof_stack)
476 | if verbose_eval:
477 | print(final_score)
478 | return oof_stack, predictions, final_score, stack_models
479 |
480 |
481 | def train_predict(pipe_data, lgb_best_params, xgb_best_params, verbose_eval=200):
482 | X_train, y_train, X_test, test_idx, min_max_scaler = split_data(pipe_data, target_name='target')
483 |
484 | oof_lgb, predictions_lgb, lgb_models = lgb_predict(X_train, y_train, X_test,
485 | params=lgb_best_params, verbose_eval=verbose_eval)
486 | if verbose_eval:
487 | print('-' * 100)
488 | oof_xgb, predictions_xgb, xgb_models = xgb_predict(X_train, y_train, X_test,
489 | params=xgb_best_params, verbose_eval=verbose_eval)
490 | if verbose_eval:
491 | print('-' * 100)
492 | oof_stack, predictions, final_score, stack_models = stacking_predict(oof_lgb, oof_xgb,
493 | predictions_lgb, predictions_xgb, y_train,
494 | verbose_eval=verbose_eval)
495 |
496 | return oof_stack, predictions, final_score, lgb_models, xgb_models, stack_models, min_max_scaler
497 |
498 |
499 | def train_models():
500 | full = read_data('data/jinnan_round1_train_20181227.csv', 'data/jinnan_round1_testB_20190121.csv')
501 | pipe_data = feature_processing(full)
502 |
503 | param_grid = [
504 | {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective': ['regression'],
505 | 'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], "min_child_samples": [3],
506 | "boosting": ["gbdt"], "feature_fraction": [0.7], "bagging_freq": [1],
507 | "bagging_fraction": [1], "bagging_seed": [11], "metric": ['mse'],
508 | "lambda_l2": [0.0003, 0.001, 0.003], "verbosity": [-1]}
509 | ]
510 |
511 | lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)
512 |
513 | param_grid = [
514 | {'silent': [1],
515 | 'nthread': [4],
516 | 'eval_metric': ['rmse'],
517 | 'eta': [0.03],
518 | 'objective': ['reg:linear'],
519 | 'max_depth': [4, 5, 6],
520 | 'num_round': [1000],
521 | 'subsample': [0.4, 0.6, 0.8, 1],
522 | 'colsample_bytree': [0.7, 0.9, 1],
523 | }
524 | ]
525 |
526 | xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)
527 |
528 | oof_stack, predictions, final_score, lgb_models, xgb_models, stack_models, min_max_scaler = train_predict(
529 | pipe_data, lgb_best_params, xgb_best_params)
530 |
531 |
532 | if not os.path.exists('models'):
533 | os.makedirs('models')
534 |
535 | pkl_save("models/lgb_models.pkl", lgb_models)
536 | pkl_save("models/xgb_models.pkl", xgb_models)
537 | pkl_save("models/stack_models.pkl", stack_models)
538 | pkl_save("models/min_max_scaler.pkl", min_max_scaler)
539 |
540 | if __name__ == '__main__':
541 | args = init_config()
542 | print(args, file=sys.stderr)
543 | outlier_th = 3
544 |
545 | if args.test_type in ['B', 'b']:
546 | test_file_name = 'data/jinnan_round1_testB_20190121.csv'
547 | result_name = 'submit_B.csv'
548 | elif args.test_type in ['C', 'c']:
549 | test_file_name = 'data/jinnan_round1_test_20190201.csv'
550 | result_name = 'submit_C.csv'
551 | elif args.test_type in ['A', 'a']:
552 | test_file_name = 'data/jinnan_round1_testA_20181227.csv'
553 | result_name = 'submit_A.csv'
554 | elif args.test_type in ['fusai', 'FuSai', 'Fusai', 'fuSai']:
555 | test_file_name = 'data/FuSai.csv'
556 | result_name = 'submit_FuSai.csv'
557 | elif args.test_type in ['gen', 'Gen', 'GEN']:
558 | test_file_name = 'data/optimize.csv'
559 | result_name = 'submit_optimize.csv'
560 | outlier_th = 0
561 | else:
562 | raise RuntimeError('Need config of test_type, can be only B or C for example: --test_type=B')
563 |
564 | # 设定文件名, 读取文件
565 | train_file_name = 'data/jinnan_round1_train_20181227.csv'
566 |
567 | print("Training file named {} and testing file named {}".format(train_file_name, test_file_name))
568 |
569 | print("Generating training models")
570 | train_models()
571 | print("Saving training models to file: \'models\'")
572 |
573 | full = read_data(train_file_name, test_file_name)
574 | lgb_models, xgb_models, stack_models, min_max_scaler = load_models()
575 | # feature processing
576 | pipe_data = feature_processing(full, outlier_th=outlier_th)
577 |
578 | # train and predict
579 | predictions = predict(pipe_data, lgb_models, xgb_models, stack_models, min_max_scaler)
580 |
581 | # generate a submit file
582 | gen_submit(test_file_name, result_name, predictions)
583 |
584 | print("Finished")
585 |
586 |
--------------------------------------------------------------------------------
/复赛/津南数字制造算法挑战赛+17+Drop/readme.md:
--------------------------------------------------------------------------------
1 | # 环境要求
2 |
3 | ## 系统
4 |
5 | ubuntu 16.04
6 |
7 | ## python 版本
8 | python 3.
9 |
10 | ## 需要的库
11 |
12 | numpy, pandas, lightgbm, xgboost, sklearn, pickle
13 |
14 |
15 |
16 | # 运行说明
17 |
18 | 生成复赛预测结果文件:
19 | python3 main.py --test_type=fusai
20 |
21 | 生成 optimizes.csv 预测结果文件:
22 | python3 main.py --test_type=gen
23 |
24 | # 最后
25 |
26 | 如果有问题希望您联系我们
27 |
28 | 陶亚凡: 765370813@qq.com
29 | Blue: cy_1995@qq.com
--------------------------------------------------------------------------------