├── README.md ├── 初赛 ├── history │ ├── V0_学习.ipynb │ ├── V1_增加特征相关性.ipynb │ ├── V2_增加特征可视化和特征选择.ipynb │ ├── V3_结构重构更规范化.ipynb │ └── python-data-visualizations.ipynb ├── 最终程序.ipynb └── 津南数字制造算法挑战赛+20+Drop │ ├── readme.md │ └── run.py └── 复赛 ├── 复赛.ipynb └── 津南数字制造算法挑战赛+17+Drop ├── data └── optimize.csv ├── main.py └── readme.md /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | # 天池比赛:[津南数字制造算法挑战赛【赛场一】](https://tianchi.aliyun.com/competition/entrance/231695/introduction) Drop 队(初赛17名,复赛结果未知)代码,初赛加复赛 4 | 5 | ## 2019年3月9号更新——更新说明: 6 | 7 | 增加了复赛程序,包括收率预测和最优参数生成。 8 | 9 | + __收率预测:__ 10 | 11 | 与初赛类似,但有所改动,改动如下: 12 | 13 | (1)最终异常数据并没有用其他特征来预测,只是简单的改变为缺失值,最后发现效果甚至好于用其他特征预测。 14 | 15 | (2)未使用 id 特征。 16 | 17 | (3)经过多次特征选择最终决定用 9 个特征,因此略去了特征选择的过程。 18 | 19 | + __最优参数生成:__ 20 | 21 | 测试了两种方法,并最终采用粒子群算法: 22 | 23 | (1)梯度下降:通过数值方法计算每个特征的梯度,然后按照梯度下降的方法更新参数,但是最终发现梯度下降对于树模型基本没用,猜测因为特征略微改变后不影响分裂后的结果,因此大部分时候梯度为 0。 24 | 25 | (2)粒子群算法:每个粒子代表一组参数,优化目标为最优参数,寻找最优的粒子。并测试了不同的参数,确保找到最优的参数。 26 | 27 | ## 更新文件说明: 28 | 29 | 复赛.ipynb 为本地程序,包括包括收率预测和最优参数生成。 30 | 31 | 文件夹“津南数字制造算法挑战赛+17+Drop”为最终提交程序,不包括最优参数生成,只包括收率预测。 32 | 33 | # 以下为原始内容: 34 | 35 | ------------------------- 36 | 37 | 天池比赛:[津南数字制造算法挑战赛【赛场一】](https://tianchi.aliyun.com/competition/entrance/231695/introduction)初赛 17 名 Drop 队代码 38 | 39 | + 作者: 陶亚凡 陕西科技大学 40 | 41 | 我的 github 和 博客: 42 | 43 | github: https://github.com/taoyafan 44 | 45 | 博客: https://me.csdn.net/taoyafan 46 | 47 | + 队友(Blue 电子科技大学的) github 和 博客: 48 | 49 | github:https://github.com/BluesChang 50 | 51 | 博客:https://blueschang.github.io 52 | 53 | 程序各个部分很大程度的参考了鱼佬的 baseline 54 | 55 | 感谢队友的很多贡献, 感谢鱼佬和他的的 baseline, 感谢林有夕大佬让我在群里不停的学到新知识. 56 | 57 | 因为我这是第一次参加 ML 的比赛, 看着鱼佬 baseline 开始学习 pandas, sklearn 还有相关知识, 所以水平实在有限. 希望各位大佬给点意见或建议, 有什么问题或者可以改进的地方请告知我, 灰常感谢. 58 | 59 | 一直想开源, 但是成绩太差, 趁着还在首页, 赶快开源了, 我也是抱着学习的心态, 也没想着拿奖, 所以这个程序也没啥保留~ 60 | 61 | ## 环境要求 62 | 63 | + __系统__ 64 | 65 | ubuntu 16.04 66 | 67 | + __python 版本__ 68 | 69 | python 3.5 或以上 70 | 71 | + __需要的库__ 72 | 73 | numpy, pandas, lightgbm, xgboost, sklearn 74 | 75 | ## 文件说明 76 | 77 | 包括[初赛](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B)和复赛(等比赛结束后开源),初赛中包括[本地最终程序](https://github.com/taoyafan/jinnan/blob/master/%E5%88%9D%E8%B5%9B/%E6%9C%80%E7%BB%88%E7%A8%8B%E5%BA%8F.ipynb)(.ipynb)、[提交的程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/%E6%B4%A5%E5%8D%97%E6%95%B0%E5%AD%97%E5%88%B6%E9%80%A0%E7%AE%97%E6%B3%95%E6%8C%91%E6%88%98%E8%B5%9B%2B20%2BDrop)(.py)和 [历史程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/history)。 78 | 79 | 建议看[本地最终程序](https://github.com/taoyafan/jinnan/blob/master/%E5%88%9D%E8%B5%9B/%E6%9C%80%E7%BB%88%E7%A8%8B%E5%BA%8F.ipynb),里面标题、注释和输出都比较全,方便阅读。 80 | 81 | [提交的程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/%E6%B4%A5%E5%8D%97%E6%95%B0%E5%AD%97%E5%88%B6%E9%80%A0%E7%AE%97%E6%B3%95%E6%8C%91%E6%88%98%E8%B5%9B%2B20%2BDrop)为本地程序转成.py格式,然后稍加改善(main函数、结构、路径、输入约束等),介绍可以看里面的readme。 82 | 83 | [历史程序](https://github.com/taoyafan/jinnan/tree/master/%E5%88%9D%E8%B5%9B/history)里面都是记录了我学习的历程,从最初学习鱼佬的程序开始,有些有用有些没用。 84 | 85 | ## 程序说明 86 | 87 | + __程序思路:__ 88 | 89 | 读取数 -> 手动处理训练集明显异常数据 -> 数据清洗 -> 特征工程 -> 训练 90 | 91 | __数据清洗:__ 92 | 93 | 删除缺失率过高的数据 -> 处理字符时间(段) -> 计算时间差 -> 处理异常值 -> 删除单一类别占比过大的特征 94 | 95 | __特征工程:__ 96 | 97 | 构建新特征 -> 利用特征之间的相关性预测 nan 值 -> 后向特征选择 98 | 99 | __训练:__ 100 | 101 | 使用 lgb 和 xgb 自动选择最优参数, 然后融合 102 | 103 | + __数据通路:__ 104 | 105 | 开始学鱼佬的 baseline 自己写着写着变量名太多了,前后运行总是出现各种小错误, 所以对整体结构改了好多次, 最终使用了 pipe line, 包括了整个数据清洗和特征工程部分, 所以变量名少了, 各个部分也不存在耦合了, 所以我觉得有必要介绍下数据通路: 106 | 107 | 数据读取得到 train, test ----> 合并得到 full ---> 经过 pipe line 得到 pipe_data ---> 训练集测试集分割得到 X_train 和 X_test ---> 训练得到结果 oof 和 predictions 108 | 109 | ## 运行说明 110 | 111 | 在复赛的一级目录下还需要 data 和 result 文件夹,分别放训练、测试数据和最终生成结果,最终生成的结果文件名为:测试名\_模型名\_结果\_特征数量_时间.csv(提交的程序按照官方要求命名) 112 | -------------------------------------------------------------------------------- /初赛/history/python-data-visualizations.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "_cell_guid": "e748dd89-de20-44f2-a122-b2bb69fbab24", 7 | "_uuid": "a42ede279bffeecdddd64047e06fee4b9aed50c5" 8 | }, 9 | "source": [ 10 | "## This notebook demos Python data visualizations on the Iris dataset\n", 11 | "\n", 12 | "This Python 3 environment comes with many helpful analytics libraries installed. It is defined by the [kaggle/python docker image](https://github.com/kaggle/docker-python)\n", 13 | "\n", 14 | "We'll use three libraries for this tutorial: [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), and [seaborn](http://stanford.edu/~mwaskom/software/seaborn/).\n", 15 | "\n", 16 | "Press \"Fork\" at the top-right of this screen to run this notebook yourself and build each of the examples." 17 | ] 18 | }, 19 | { 20 | "cell_type": "code", 21 | "execution_count": 1, 22 | "metadata": { 23 | "_cell_guid": "136008bf-b756-49c1-bc5e-81c1247b969d", 24 | "_uuid": "4a72555be32be45a318141821b58ceac28ffb0d7" 25 | }, 26 | "outputs": [], 27 | "source": [ 28 | "# First, we'll import pandas, a data processing and CSV file I/O library\n", 29 | "import pandas as pd\n", 30 | "\n", 31 | "# We'll also import seaborn, a Python graphing library\n", 32 | "import warnings # current version of seaborn generates a bunch of warnings that we'll ignore\n", 33 | "warnings.filterwarnings(\"ignore\")\n", 34 | "import seaborn as sns\n", 35 | "import matplotlib.pyplot as plt\n", 36 | "sns.set(style=\"white\", color_codes=True)\n", 37 | "\n", 38 | "# Next, we'll load the Iris flower dataset, which is in the \"../input/\" directory\n", 39 | "iris = pd.read_csv(\"../input/Iris.csv\") # the iris dataset is now a Pandas DataFrame\n", 40 | "\n", 41 | "# Let's see what's in the iris data - Jupyter notebooks print the result of the last thing you do\n", 42 | "iris.head()\n", 43 | "\n", 44 | "# Press shift+enter to execute this cell" 45 | ] 46 | }, 47 | { 48 | "cell_type": "code", 49 | "execution_count": 2, 50 | "metadata": { 51 | "_cell_guid": "5dba36af-1bb8-49e5-9b49-1451f4136246", 52 | "_uuid": "ef33a54d1e704924d1eb29632728011d31bfb543" 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "# Let's see how many examples we have of each species\n", 57 | "iris[\"Species\"].value_counts()" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 3, 63 | "metadata": { 64 | "_cell_guid": "b8588972-deb5-4094-99a6-5feb722e3301", 65 | "_uuid": "b61dbe844a638b1b26e0c3f16a104570d4b60010" 66 | }, 67 | "outputs": [], 68 | "source": [ 69 | "# The first way we can plot things is using the .plot extension from Pandas dataframes\n", 70 | "# We'll use this to make a scatterplot of the Iris features.\n", 71 | "iris.plot(kind=\"scatter\", x=\"SepalLengthCm\", y=\"SepalWidthCm\")" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "metadata": { 78 | "_cell_guid": "dc213965-5341-4ce7-ad13-42eb5e2fa1e7", 79 | "_uuid": "81da4a44d4ec41f5c7acd172c75df2f47884a13e" 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "# We can also use the seaborn library to make a similar plot\n", 84 | "# A seaborn jointplot shows bivariate scatterplots and univariate histograms in the same figure\n", 85 | "sns.jointplot(x=\"SepalLengthCm\", y=\"SepalWidthCm\", data=iris, size=5)" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 5, 91 | "metadata": { 92 | "_cell_guid": "0a5c46f6-be6e-4ef6-94a4-9bea13c9a0aa", 93 | "_uuid": "d07401f715fa8f39951a6212bce668657d457fe1" 94 | }, 95 | "outputs": [], 96 | "source": [ 97 | "# One piece of information missing in the plots above is what species each plant is\n", 98 | "# We'll use seaborn's FacetGrid to color the scatterplot by species\n", 99 | "sns.FacetGrid(iris, hue=\"Species\", size=5) \\\n", 100 | " .map(plt.scatter, \"SepalLengthCm\", \"SepalWidthCm\") \\\n", 101 | " .add_legend()" 102 | ] 103 | }, 104 | { 105 | "cell_type": "code", 106 | "execution_count": 6, 107 | "metadata": { 108 | "_cell_guid": "128245d5-6f01-44cd-8b2f-8a49735ac552", 109 | "_uuid": "01cb1b0849f6c7e800c8798164741a8fdae53617" 110 | }, 111 | "outputs": [], 112 | "source": [ 113 | "# We can look at an individual feature in Seaborn through a boxplot\n", 114 | "sns.boxplot(x=\"Species\", y=\"PetalLengthCm\", data=iris)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "code", 119 | "execution_count": 7, 120 | "metadata": { 121 | "_cell_guid": "b86a675c-f604-496a-931a-df76d7d6aaa1", 122 | "_uuid": "a481595c1e46d625e887b61f5eb0e3c48269bde9" 123 | }, 124 | "outputs": [], 125 | "source": [ 126 | "# One way we can extend this plot is adding a layer of individual points on top of\n", 127 | "# it through Seaborn's striplot\n", 128 | "# \n", 129 | "# We'll use jitter=True so that all the points don't fall in single vertical lines\n", 130 | "# above the species\n", 131 | "#\n", 132 | "# Saving the resulting axes as ax each time causes the resulting plot to be shown\n", 133 | "# on top of the previous axes\n", 134 | "ax = sns.boxplot(x=\"Species\", y=\"PetalLengthCm\", data=iris)\n", 135 | "ax = sns.stripplot(x=\"Species\", y=\"PetalLengthCm\", data=iris, jitter=True, edgecolor=\"gray\")" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 8, 141 | "metadata": { 142 | "_cell_guid": "c49f199b-2798-4fdc-87a7-bd2f7f8ff447", 143 | "_uuid": "0d422fc672f3cfb30ec02d1345942cc583c51b05" 144 | }, 145 | "outputs": [], 146 | "source": [ 147 | "# A violin plot combines the benefits of the previous two plots and simplifies them\n", 148 | "# Denser regions of the data are fatter, and sparser thiner in a violin plot\n", 149 | "sns.violinplot(x=\"Species\", y=\"PetalLengthCm\", data=iris, size=6)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 9, 155 | "metadata": { 156 | "_cell_guid": "78c32fc8-3c36-482a-81f4-14d4b6ee1430", 157 | "_uuid": "b10aa16c47bdad1964d1746281564f68a5ab741e" 158 | }, 159 | "outputs": [], 160 | "source": [ 161 | "# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n", 162 | "# which creates and visualizes a kernel density estimate of the underlying feature\n", 163 | "sns.FacetGrid(iris, hue=\"Species\", size=6) \\\n", 164 | " .map(sns.kdeplot, \"PetalLengthCm\") \\\n", 165 | " .add_legend()" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 10, 171 | "metadata": { 172 | "_cell_guid": "7351999e-4522-451f-b3f1-0031c3a88eaa", 173 | "_uuid": "fb9e2f61bf81478f21489f1219358e2b6fa164dd" 174 | }, 175 | "outputs": [], 176 | "source": [ 177 | "# Another useful seaborn plot is the pairplot, which shows the bivariate relation\n", 178 | "# between each pair of features\n", 179 | "# \n", 180 | "# From the pairplot, we'll see that the Iris-setosa species is separataed from the other\n", 181 | "# two across all feature combinations\n", 182 | "sns.pairplot(iris.drop(\"Id\", axis=1), hue=\"Species\", size=3)" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 11, 188 | "metadata": { 189 | "_cell_guid": "3f1fb3ba-e0fd-45b4-8a64-fe2a689bb83b", 190 | "_uuid": "417d197016286a1af02eb522b3a0e0476e76b39b" 191 | }, 192 | "outputs": [], 193 | "source": [ 194 | "# The diagonal elements in a pairplot show the histogram by default\n", 195 | "# We can update these elements to show other things, such as a kde\n", 196 | "sns.pairplot(iris.drop(\"Id\", axis=1), hue=\"Species\", size=3, diag_kind=\"kde\")" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 12, 202 | "metadata": { 203 | "_cell_guid": "46cceec5-3525-4b02-8ab7-5ed1420cd198", 204 | "_uuid": "d7fb122f77031cc79ab0e922608d9e6c5de774ca" 205 | }, 206 | "outputs": [], 207 | "source": [ 208 | "# Now that we've covered seaborn, let's go back to some of the ones we can make with Pandas\n", 209 | "# We can quickly make a boxplot with Pandas on each feature split out by species\n", 210 | "iris.drop(\"Id\", axis=1).boxplot(by=\"Species\", figsize=(12, 6))" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 13, 216 | "metadata": { 217 | "_cell_guid": "5bbed28c-d813-41c4-824d-7038fbfee6ea", 218 | "_uuid": "61c76e99340b06c8020151ae4b8942e1daa8b1ef" 219 | }, 220 | "outputs": [], 221 | "source": [ 222 | "# One cool more sophisticated technique pandas has available is called Andrews Curves\n", 223 | "# Andrews Curves involve using attributes of samples as coefficients for Fourier series\n", 224 | "# and then plotting these\n", 225 | "from pandas.tools.plotting import andrews_curves\n", 226 | "andrews_curves(iris.drop(\"Id\", axis=1), \"Species\")" 227 | ] 228 | }, 229 | { 230 | "cell_type": "code", 231 | "execution_count": 14, 232 | "metadata": { 233 | "_cell_guid": "77c1b6f0-7632-4d61-bf03-7b5d6856b987", 234 | "_uuid": "b9ac80fdd71c270c9991d34ca87f70d6b00b2192" 235 | }, 236 | "outputs": [], 237 | "source": [ 238 | "# Another multivariate visualization technique pandas has is parallel_coordinates\n", 239 | "# Parallel coordinates plots each feature on a separate column & then draws lines\n", 240 | "# connecting the features for each data sample\n", 241 | "from pandas.tools.plotting import parallel_coordinates\n", 242 | "parallel_coordinates(iris.drop(\"Id\", axis=1), \"Species\")" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 15, 248 | "metadata": { 249 | "_cell_guid": "d5c6314f-7b36-41ce-b0bd-e2ef17941f97", 250 | "_uuid": "38b7de27f1f882347de21193d93bf474f96c2288" 251 | }, 252 | "outputs": [], 253 | "source": [ 254 | "# A final multivariate visualization technique pandas has is radviz\n", 255 | "# Which puts each feature as a point on a 2D plane, and then simulates\n", 256 | "# having each sample attached to those points through a spring weighted\n", 257 | "# by the relative value for that feature\n", 258 | "from pandas.tools.plotting import radviz\n", 259 | "radviz(iris.drop(\"Id\", axis=1), \"Species\")" 260 | ] 261 | }, 262 | { 263 | "cell_type": "markdown", 264 | "metadata": { 265 | "_cell_guid": "0263903e-4c3f-41c5-adf6-a1a12c122ddb", 266 | "_uuid": "a47be9b234eb942e71425b3e00b741a41488ea33" 267 | }, 268 | "source": [ 269 | "# Wrapping Up\n", 270 | "\n", 271 | "I hope you enjoyed this quick introduction to some of the quick, simple data visualizations you can create with pandas, seaborn, and matplotlib in Python!\n", 272 | "\n", 273 | "I encourage you to run through these examples yourself, tweaking them and seeing what happens. From there, you can try applying these methods to a new dataset and incorprating them into your own workflow!\n", 274 | "\n", 275 | "See [Kaggle Datasets](https://www.kaggle.com/datasets) for other datasets to try visualizing. The [World Food Facts data](https://www.kaggle.com/openfoodfacts/world-food-facts) is an especially rich one for visualization." 276 | ] 277 | } 278 | ], 279 | "metadata": { 280 | "kernelspec": { 281 | "display_name": "Python 3", 282 | "language": "python", 283 | "name": "python3" 284 | }, 285 | "language_info": { 286 | "codemirror_mode": { 287 | "name": "ipython", 288 | "version": 3 289 | }, 290 | "file_extension": ".py", 291 | "mimetype": "text/x-python", 292 | "name": "python", 293 | "nbconvert_exporter": "python", 294 | "pygments_lexer": "ipython3", 295 | "version": "3.6.7" 296 | }, 297 | "toc": { 298 | "base_numbering": 1, 299 | "nav_menu": {}, 300 | "number_sections": true, 301 | "sideBar": true, 302 | "skip_h1_title": false, 303 | "title_cell": "Table of Contents", 304 | "title_sidebar": "Contents", 305 | "toc_cell": false, 306 | "toc_position": {}, 307 | "toc_section_display": true, 308 | "toc_window_display": false 309 | } 310 | }, 311 | "nbformat": 4, 312 | "nbformat_minor": 1 313 | } 314 | -------------------------------------------------------------------------------- /初赛/最终程序.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "+ __说明:__\n", 8 | "\n", 9 | "本程序是 阿里云天池 的比赛 津南数字制造算法挑战赛(赛场一) B 榜20名提交结果所使用的最终程序\n", 10 | "\n", 11 | "队伍名称 Drop, 作者: 陶亚凡 陕西科技大学\n", 12 | "\n", 13 | "我的 github 和 博客:(虽然没东西把, 但是还是求波关注, follow, stars /捂脸)\n", 14 | "\n", 15 | "github: https://github.com/taoyafan\n", 16 | "\n", 17 | "博客: https://me.csdn.net/taoyafan\n", 18 | "\n", 19 | "队友(Blue 电子科技大学的) github 和 博客:(也求波关注, follow, stars /捂脸)\n", 20 | "\n", 21 | "github:https://github.com/BluesChang\n", 22 | "\n", 23 | "博客:https://blueschang.github.io\n", 24 | "\n", 25 | "程序各个部分很大程度的参考了鱼佬的 baseline\n", 26 | "\n", 27 | "感谢队友的很多贡献, 感谢鱼佬和他的的 baseline, 感谢林有夕大佬让我在群里不停的学到新知识.\n", 28 | "\n", 29 | "因为我这是第一了 ML 的比赛, 看着鱼佬 baseline 开始学习 pandas, sklearn 还有相关知识, 所以水平实在有限. 希望各位大佬给点意见或建议, 有什么问题或者可以改进的地方请告知我, 灰常感谢. \n", 30 | "\n", 31 | "一直想开源, 但是成绩太差, 趁着还在首页, 赶快开源了, 我也是抱着学习的心态, 也没想着拿奖, 所以这个程序也没啥保留~\n", 32 | "\n", 33 | "+ __程序思路:__ \n", 34 | "\n", 35 | "读取数 -> 手动处理训练集明显异常数据 -> 数据清洗 -> 特征工程 -> 训练\n", 36 | "\n", 37 | "__数据清洗:__\n", 38 | "\n", 39 | "删除缺失率过高的数据 -> 处理字符时间(段) -> 计算时间差 -> 处理异常值 -> 删除单一类别占比过大的特征\n", 40 | "\n", 41 | "__特征工程:__\n", 42 | "\n", 43 | "构建新特征 -> 利用特征之间的相关性预测 nan 值 -> 后向特征选择\n", 44 | "\n", 45 | "__训练:__\n", 46 | "\n", 47 | "使用 lgb 和 xgb 自动选择最优参数, 然后融合\n", 48 | "\n", 49 | "+ __数据通路:__\n", 50 | "\n", 51 | "开始学鱼佬的 baseline 自己写着写着变量名太多了,前后运行总是出现各种小错误, 所以对整体结构改了好多次, 最终使用了 pipe line, 包括了整个数据清洗和特征工程部分, 所以变量名少了, 各个部分也不存在耦合了, 所以我觉得有必要介绍下数据通路:\n", 52 | "\n", 53 | "数据读取得到 train, test ----> 合并得到 full ---> 经过 pipe line 得到 pipe_data ---> 训练集测试集分割得到 X_train 和 X_test ---> 训练得到结果 oof 和 predictions\n", 54 | "\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "# 导入包,读取数据" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 1, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stderr", 71 | "output_type": "stream", 72 | "text": [ 73 | "/usr/local/lib/python3.6/dist-packages/deap/tools/_hypervolume/pyhv.py:33: ImportWarning:\n", 74 | "\n", 75 | "Falling back to the python version of hypervolume module. Expect this to be very slow.\n", 76 | "\n", 77 | "/usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning:\n", 78 | "\n", 79 | "can't resolve package from __spec__ or __package__, falling back on __name__ and __path__\n", 80 | "\n", 81 | "/usr/lib/python3.6/importlib/_bootstrap_external.py:426: ImportWarning:\n", 82 | "\n", 83 | "Not importing directory /usr/local/lib/python3.6/dist-packages/mpl_toolkits: missing __init__\n", 84 | "\n", 85 | "/usr/lib/python3.6/importlib/_bootstrap_external.py:426: ImportWarning:\n", 86 | "\n", 87 | "Not importing directory /usr/local/lib/python3.6/dist-packages/google: missing __init__\n", 88 | "\n" 89 | ] 90 | } 91 | ], 92 | "source": [ 93 | "import numpy as np \n", 94 | "import pandas as pd \n", 95 | "import lightgbm as lgb\n", 96 | "import xgboost as xgb\n", 97 | "from scipy import sparse\n", 98 | "import warnings\n", 99 | "import time\n", 100 | "import sys\n", 101 | "import os\n", 102 | "import re\n", 103 | "import datetime\n", 104 | "import matplotlib.pyplot as plt\n", 105 | "import seaborn as sns\n", 106 | "import plotly.offline as py\n", 107 | "import plotly.graph_objs as go \n", 108 | "import plotly.tools as tls\n", 109 | "from xgboost import XGBRegressor\n", 110 | "from tpot import TPOTRegressor" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 2, 116 | "metadata": {}, 117 | "outputs": [ 118 | { 119 | "name": "stderr", 120 | "output_type": "stream", 121 | "text": [ 122 | "/usr/lib/python3.6/importlib/_bootstrap.py:219: ImportWarning:\n", 123 | "\n", 124 | "can't resolve package from __spec__ or __package__, falling back on __name__ and __path__\n", 125 | "\n" 126 | ] 127 | } 128 | ], 129 | "source": [ 130 | "from sklearn.model_selection import cross_val_score, GridSearchCV, KFold, RepeatedKFold, ShuffleSplit\n", 131 | "from sklearn.pipeline import Pipeline, make_pipeline\n", 132 | "from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone\n", 133 | "from sklearn.linear_model import LinearRegression\n", 134 | "from sklearn.linear_model import Ridge\n", 135 | "from sklearn.linear_model import Lasso\n", 136 | "from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor\n", 137 | "from sklearn.svm import SVR, LinearSVR\n", 138 | "from sklearn.linear_model import ElasticNet, SGDRegressor, BayesianRidge\n", 139 | "from sklearn.kernel_ridge import KernelRidge\n", 140 | "from sklearn.preprocessing import OneHotEncoder, LabelEncoder\n", 141 | "from sklearn.metrics import mean_squared_error, mean_absolute_error\n", 142 | "from sklearn.metrics import log_loss\n", 143 | "from sklearn.preprocessing import Imputer\n", 144 | "from scipy.stats import skew\n", 145 | "from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, Normalizer\n", 146 | "from sklearn.decomposition import PCA, KernelPCA\n", 147 | "from sklearn.model_selection import train_test_split\n", 148 | "from sklearn.model_selection import ParameterGrid" 149 | ] 150 | }, 151 | { 152 | "cell_type": "code", 153 | "execution_count": 3, 154 | "metadata": {}, 155 | "outputs": [ 156 | { 157 | "data": { 158 | "text/html": [ 159 | "" 160 | ], 161 | "text/vnd.plotly.v1+html": [ 162 | "" 163 | ] 164 | }, 165 | "metadata": {}, 166 | "output_type": "display_data" 167 | } 168 | ], 169 | "source": [ 170 | "py.init_notebook_mode(connected=True)\n", 171 | "warnings.simplefilter(action='ignore', category=FutureWarning)\n", 172 | "warnings.filterwarnings(\"ignore\")\n", 173 | "pd.set_option('display.max_columns',None)\n", 174 | "pd.set_option('max_colwidth',100)" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "## 设定文件名, 读取文件" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 4, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "train_file_name = 'data/jinnan_round1_train_20181227.csv'\n", 191 | "test_file_name = 'data/jinnan_round1_testB_20190121.csv'\n", 192 | "test_name = 'testB'" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": 5, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "# 读取数据, 改名\n", 202 | "train = pd.read_csv(train_file_name, encoding = 'gb18030')\n", 203 | "test = pd.read_csv(test_file_name, encoding = 'gb18030')\n", 204 | "train.rename(columns={'样本id':'id', '收率':'target'}, inplace = True)\n", 205 | "test.rename(columns={'样本id':'id', '收率':'target'}, inplace = True)\n", 206 | "target_name = 'target'\n", 207 | "\n", 208 | "# 存在异常数据,改为 nan\n", 209 | "train.loc[1304, 'A25'] = np.nan\n", 210 | "train['A25'] = train['A25'].astype(float)\n", 211 | "\n", 212 | "# 去掉 id 前缀\n", 213 | "train['id'] = train['id'].apply(lambda x: int(x.split('_')[1]))\n", 214 | "test['id'] = test['id'].apply(lambda x: int(x.split('_')[1]))\n", 215 | "\n", 216 | "train.drop(train[train[target_name] < 0.87].index, inplace=True)\n", 217 | "full=pd.concat([train, test], ignore_index=True)" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "# 数据清洗" 225 | ] 226 | }, 227 | { 228 | "cell_type": "markdown", 229 | "metadata": {}, 230 | "source": [ 231 | "## 删除缺失率高的特征\n", 232 | "\n", 233 | "+ __删除缺失值大于 th_high 的值__\n", 234 | "+ __缺失值在 th_low 和 th_high 之间的特征根据是否缺失增加新特征__\n", 235 | " \n", 236 | " 如 B10 缺失较高,增加新特征 B10_null,如果缺失为1,否则为0" 237 | ] 238 | }, 239 | { 240 | "cell_type": "code", 241 | "execution_count": 6, 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "class del_nan_feature(BaseEstimator, TransformerMixin):\n", 246 | " \n", 247 | " def __init__(self, th_high=0.85, th_low=0.02):\n", 248 | " self.th_high = th_high\n", 249 | " self.th_low = th_low\n", 250 | " \n", 251 | " def fit(self, X, y=None):\n", 252 | " return self\n", 253 | " \n", 254 | " def transform(self, X):\n", 255 | " print('-'*30, ' '*5, 'del_nan_feature', ' '*5, '-'*30, '\\n')\n", 256 | " print(\"shape before process = {}\".format(X.shape))\n", 257 | "\n", 258 | " # 删除高缺失率特征\n", 259 | " X.dropna(axis=1, thresh=(1-self.th_high)*X.shape[0], inplace=True)\n", 260 | " \n", 261 | " \n", 262 | " # 缺失率较高,增加新特征\n", 263 | " for col in X.columns:\n", 264 | " if col == 'target':\n", 265 | " continue\n", 266 | " \n", 267 | " miss_rate = X[col].isnull().sum()/ X.shape[0]\n", 268 | " if miss_rate > self.th_low:\n", 269 | " print(\"Missing rate of {} is {:.3f} exceed {}, adding new feature {}\".\n", 270 | " format(col, miss_rate, self.th_low, col+'_null'))\n", 271 | " X[col+'_null'] = 0\n", 272 | " X.loc[X[pd.isnull(X[col])].index, [col+'_null']] = 1\n", 273 | " print(\"shape = {}\".format(X.shape))\n", 274 | "\n", 275 | " return X" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "## 处理字符时间(段)" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 7, 288 | "metadata": {}, 289 | "outputs": [], 290 | "source": [ 291 | "# 处理时间\n", 292 | "def timeTranSecond(t):\n", 293 | " try:\n", 294 | " h,m,s=t.split(\":\")\n", 295 | " except:\n", 296 | "\n", 297 | " if t=='1900/1/9 7:00':\n", 298 | " return 7*3600/3600\n", 299 | " elif t=='1900/1/1 2:30':\n", 300 | " return (2*3600+30*60)/3600\n", 301 | " elif pd.isnull(t):\n", 302 | " return np.nan\n", 303 | " else:\n", 304 | " return 0\n", 305 | "\n", 306 | " try:\n", 307 | " tm = (int(h)*3600+int(m)*60+int(s))/3600\n", 308 | " except:\n", 309 | " return (30*60)/3600\n", 310 | "\n", 311 | " return tm" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": 8, 317 | "metadata": {}, 318 | "outputs": [], 319 | "source": [ 320 | "# 处理时间差\n", 321 | "def getDuration(se):\n", 322 | " try:\n", 323 | " sh,sm,eh,em=re.findall(r\"\\d+\",se)\n", 324 | "# print(\"sh, sm, eh, em = {}, {}, {}, {}\".format(sh, em, eh, em))\n", 325 | " except:\n", 326 | " if pd.isnull(se):\n", 327 | " return np.nan, np.nan, np.nan\n", 328 | "\n", 329 | " try:\n", 330 | " t_start = (int(sh)*3600 + int(sm)*60)/3600\n", 331 | " t_end = (int(eh)*3600 + int(em)*60)/3600\n", 332 | " \n", 333 | " if t_start > t_end:\n", 334 | " tm = t_end - t_start + 24\n", 335 | " else:\n", 336 | " tm = t_end - t_start\n", 337 | " except:\n", 338 | " if se=='19:-20:05':\n", 339 | " return 19, 20, 1\n", 340 | " elif se=='15:00-1600':\n", 341 | " return 15, 16, 1\n", 342 | " else:\n", 343 | " print(\"se = {}\".format(se))\n", 344 | "\n", 345 | "\n", 346 | " return t_start, t_end, tm" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 9, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "class handle_time_str(BaseEstimator, TransformerMixin):\n", 356 | " \n", 357 | " def __init__(self):\n", 358 | " pass\n", 359 | " \n", 360 | " def fit(self, X, y=None):\n", 361 | " return self\n", 362 | " \n", 363 | " def transform(self, X):\n", 364 | " print('-'*30, ' '*5, 'handle_time_str', ' '*5, '-'*30, '\\n')\n", 365 | "\n", 366 | " for f in ['A5','A7','A9','A11','A14','A16','A24','A26','B5','B7']:\n", 367 | " try:\n", 368 | " X[f] = X[f].apply(timeTranSecond)\n", 369 | " except:\n", 370 | " print(f,'应该在前面被删除了!')\n", 371 | "\n", 372 | "\n", 373 | " for f in ['A20','A28','B4','B9','B10','B11']:\n", 374 | " try:\n", 375 | " start_end_diff = X[f].apply(getDuration)\n", 376 | " \n", 377 | " X[f+'_start'] = start_end_diff.apply(lambda x: x[0])\n", 378 | " X[f+'_end'] = start_end_diff.apply(lambda x: x[1])\n", 379 | " X[f] = start_end_diff.apply(lambda x: x[2])\n", 380 | "\n", 381 | " except:\n", 382 | " print(f,'应该在前面被删除了!')\n", 383 | " return X" 384 | ] 385 | }, 386 | { 387 | "cell_type": "markdown", 388 | "metadata": {}, 389 | "source": [ 390 | "## 计算时间差" 391 | ] 392 | }, 393 | { 394 | "cell_type": "code", 395 | "execution_count": null, 396 | "metadata": {}, 397 | "outputs": [], 398 | "source": [] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 10, 403 | "metadata": {}, 404 | "outputs": [], 405 | "source": [ 406 | "def t_start_t_end(t):\n", 407 | " if pd.isnull(t[0]) or pd.isnull(t[1]):\n", 408 | "# print(\"t_start = {}, t_end = {}, id = {}\".format(t[0], t[1], t[2]))\n", 409 | " return np.nan\n", 410 | " \n", 411 | " if t[1] < t[0]:\n", 412 | " t[1] += 24\n", 413 | " \n", 414 | " dt = t[1] - t[0]\n", 415 | "\n", 416 | " if(dt > 24 or dt < 0):\n", 417 | "# print(\"dt error, t_start = {}, t_end = {}, id = {}\".format(t[0], t[1], t[2]))\n", 418 | " return np.nan\n", 419 | " \n", 420 | " return dt" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 11, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [ 429 | "class calc_time_diff(BaseEstimator, TransformerMixin):\n", 430 | " def __init__(self):\n", 431 | " pass\n", 432 | " \n", 433 | " def fit(self, X, y=None):\n", 434 | " return self\n", 435 | " \n", 436 | " def transform(self, X):\n", 437 | " print('-'*30, ' '*5, 'calc_time_diff', ' '*5, '-'*30, '\\n')\n", 438 | "\n", 439 | " # t_start 为时间的开始, tn 为中间的时间,减去 t_start 得到时间差\n", 440 | " t_start = ['A9', 'A24', 'B5']\n", 441 | " tn = {'A9':['A11', 'A14', 'A16'], 'A24':['A26'], 'B5':['B7']}\n", 442 | " \n", 443 | " # 计算时间差\n", 444 | " for t_s in t_start:\n", 445 | " for t_e in tn[t_s]:\n", 446 | " X[t_e+'-'+t_s] = X[[t_s,t_e, target_name]].apply(t_start_t_end, axis=1)\n", 447 | " \n", 448 | " # 所有结果保留 3 位小数\n", 449 | " X = X.apply(lambda x:round(x, 3))\n", 450 | " \n", 451 | " print(\"shape = {}\".format(X.shape))\n", 452 | " \n", 453 | " return X" 454 | ] 455 | }, 456 | { 457 | "cell_type": "markdown", 458 | "metadata": {}, 459 | "source": [ 460 | "## 处理异常值" 461 | ] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "+ __单一类别个数小于 threshold 的值视为异常值, 改为 nan__" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": 12, 473 | "metadata": {}, 474 | "outputs": [], 475 | "source": [ 476 | "class handle_outliers(BaseEstimator, TransformerMixin):\n", 477 | "\n", 478 | " def __init__(self, threshold=2):\n", 479 | " self.th = threshold\n", 480 | " \n", 481 | " def fit(self, X, y=None):\n", 482 | " return self\n", 483 | " \n", 484 | " def transform(self, X):\n", 485 | " print('-'*30, ' '*5, 'handle_outliers', ' '*5, '-'*30, '\\n')\n", 486 | " category_col = [col for col in X if col not in ['id', 'target']]\n", 487 | " for col in category_col:\n", 488 | " label = X[col].value_counts(dropna=False).index.tolist()\n", 489 | " for i, num in enumerate(X[col].value_counts(dropna=False).values):\n", 490 | " if num <= self.th:\n", 491 | "# print(\"Number of label {} in feature {} is {}\".format(label[i], col, num))\n", 492 | " X.loc[X[col]==label[i], [col]] = np.nan\n", 493 | " \n", 494 | " print(\"shape = {}\".format(X.shape))\n", 495 | " return X" 496 | ] 497 | }, 498 | { 499 | "cell_type": "markdown", 500 | "metadata": {}, 501 | "source": [ 502 | "## 删除单一类别占比过大特征" 503 | ] 504 | }, 505 | { 506 | "cell_type": "code", 507 | "execution_count": 13, 508 | "metadata": {}, 509 | "outputs": [], 510 | "source": [ 511 | "class del_single_feature(BaseEstimator, TransformerMixin):\n", 512 | "\n", 513 | " def __init__(self, threshold=0.98):\n", 514 | " # 删除单一类别占比大于 threshold 的特征\n", 515 | " self.th = threshold\n", 516 | " \n", 517 | " def fit(self, X, y=None):\n", 518 | " return self\n", 519 | " \n", 520 | " def transform(self, X):\n", 521 | " print('-'*30, ' '*5, 'del_single_feature', ' '*5, '-'*30, '\\n')\n", 522 | " category_col = [col for col in X if col not in ['target']]\n", 523 | " \n", 524 | " for col in category_col:\n", 525 | " rate = X[col].value_counts(normalize=True, dropna=False).values[0]\n", 526 | " \n", 527 | " if rate > self.th:\n", 528 | " print(\"{} 的最大类别占比是 {}, drop it\".format(col, rate))\n", 529 | " X.drop(col, axis=1, inplace=True)\n", 530 | "\n", 531 | " print(\"shape = {}\".format(X.shape))\n", 532 | " return X" 533 | ] 534 | }, 535 | { 536 | "cell_type": "markdown", 537 | "metadata": {}, 538 | "source": [ 539 | "# 特征工程" 540 | ] 541 | }, 542 | { 543 | "cell_type": "markdown", 544 | "metadata": {}, 545 | "source": [ 546 | "## 获得训练集与测试集" 547 | ] 548 | }, 549 | { 550 | "cell_type": "code", 551 | "execution_count": 14, 552 | "metadata": {}, 553 | "outputs": [], 554 | "source": [ 555 | "def split_data(pipe_data, target_name='target'):\n", 556 | " \n", 557 | " # 特征列名\n", 558 | " category_col = [col for col in pipe_data if col not in ['target',target_name]]\n", 559 | " \n", 560 | " # 训练、测试行索引\n", 561 | " train_idx = pipe_data[np.logical_not(pd.isnull(pipe_data[target_name]))].index\n", 562 | " test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index\n", 563 | " \n", 564 | " # 获得 train、test 数据\n", 565 | " X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64)\n", 566 | " y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64))\n", 567 | " X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64)\n", 568 | " \n", 569 | " return X_train, y_train, X_test, test_idx" 570 | ] 571 | }, 572 | { 573 | "cell_type": "markdown", 574 | "metadata": {}, 575 | "source": [ 576 | "## xgb(用于特征 nan 值预测)" 577 | ] 578 | }, 579 | { 580 | "cell_type": "code", 581 | "execution_count": 15, 582 | "metadata": {}, 583 | "outputs": [], 584 | "source": [ 585 | "##### xgb\n", 586 | "def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):\n", 587 | " \n", 588 | " if params == None:\n", 589 | " xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, \n", 590 | " 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4}\n", 591 | " else:\n", 592 | " xgb_params = params\n", 593 | "\n", 594 | " folds = KFold(n_splits=10, shuffle=True, random_state=2018)\n", 595 | " oof_xgb = np.zeros(len(X_train))\n", 596 | " predictions_xgb = np.zeros(len(X_test))\n", 597 | "\n", 598 | " for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):\n", 599 | " if(verbose_eval):\n", 600 | " print(\"fold n°{}\".format(fold_+1))\n", 601 | " print(\"len trn_idx {}\".format(len(trn_idx)))\n", 602 | " \n", 603 | " trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx])\n", 604 | " val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx])\n", 605 | "\n", 606 | " watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]\n", 607 | " clf = xgb.train(dtrain=trn_data,\n", 608 | " num_boost_round=20000,\n", 609 | " evals=watchlist,\n", 610 | " early_stopping_rounds=200,\n", 611 | " verbose_eval=verbose_eval,\n", 612 | " params=xgb_params)\n", 613 | " \n", 614 | " \n", 615 | " oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit)\n", 616 | " predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits\n", 617 | "\n", 618 | " if(verbose_eval):\n", 619 | " print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_xgb, y_train)))\n", 620 | " return oof_xgb, predictions_xgb" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "## 根据 B14 构建新特征" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": 16, 633 | "metadata": {}, 634 | "outputs": [], 635 | "source": [ 636 | "class add_new_features(BaseEstimator, TransformerMixin):\n", 637 | "\n", 638 | " def __init__(self):\n", 639 | " pass\n", 640 | " \n", 641 | " def fit(self, X, y=None):\n", 642 | " return self\n", 643 | "\n", 644 | " def transform(self, X):\n", 645 | " print('-'*30, ' '*5, 'add_new_features', ' '*5, '-'*30, '\\n')\n", 646 | "\n", 647 | " # 经过测试,只有 B14 / B12 有用\n", 648 | " \n", 649 | "# X['B14/A1'] = X['B14'] / X['A1']\n", 650 | "# X['B14/A3'] = X['B14'] / X['A3']\n", 651 | "# X['B14/A4'] = X['B14'] / X['A4']\n", 652 | "# X['B14/A19'] = X['B14'] / X['A19']\n", 653 | "# X['B14/B1'] = X['B14'] / X['B1']\n", 654 | "# X['B14/B9'] = X['B14'] / X['B9']\n", 655 | "\n", 656 | " X['B14/B12'] = X['B14'] / X['B12']\n", 657 | " \n", 658 | " print(\"shape = {}\".format(X.shape))\n", 659 | " return X" 660 | ] 661 | }, 662 | { 663 | "cell_type": "markdown", 664 | "metadata": {}, 665 | "source": [ 666 | "## 选择特征, nan 值填充\n", 667 | "\n", 668 | "+ __选择可能有效的特征__ (只是为了加快选择时间)\n", 669 | "\n", 670 | "+ __利用其他特征预测 nan,取最近值填充__" 671 | ] 672 | }, 673 | { 674 | "cell_type": "code", 675 | "execution_count": 17, 676 | "metadata": {}, 677 | "outputs": [], 678 | "source": [ 679 | "def get_closest(indexes, predicts):\n", 680 | " print(\"From {}\".format(predicts))\n", 681 | "\n", 682 | " for i, one in enumerate(predicts):\n", 683 | " predicts[i] = indexes[np.argsort(abs(indexes - one))[0]]\n", 684 | "\n", 685 | " print(\"To {}\".format(predicts))\n", 686 | " return predicts\n", 687 | " \n", 688 | "\n", 689 | "def value_select_eval(pipe_data, selected_features):\n", 690 | " \n", 691 | " # 经过多次测试, 只选择可能是有用的特征\n", 692 | " cols_with_nan = [col for col in pipe_data.columns \n", 693 | " if pipe_data[col].isnull().sum()>0 and col in selected_features]\n", 694 | "\n", 695 | " for col in cols_with_nan:\n", 696 | " X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name=col)\n", 697 | " oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, verbose_eval=False)\n", 698 | " \n", 699 | " print(\"-\"*100, end=\"\\n\\n\")\n", 700 | " print(\"CV normal MAE scores of predicting {} is {}\".\n", 701 | " format(col, mean_absolute_error(oof_xgb, y_train)/np.mean(y_train)))\n", 702 | " \n", 703 | " pipe_data.loc[test_idx, [col]] = get_closest(pipe_data[col].value_counts().index,\n", 704 | " predictions_xgb)\n", 705 | "\n", 706 | " pipe_data = pipe_data[selected_features+['target']]\n", 707 | "\n", 708 | " return pipe_data\n", 709 | "\n", 710 | "# pipe_data = value_eval(pipe_data)" 711 | ] 712 | }, 713 | { 714 | "cell_type": "code", 715 | "execution_count": 18, 716 | "metadata": {}, 717 | "outputs": [], 718 | "source": [ 719 | "class selsected_fill_nans(BaseEstimator, TransformerMixin):\n", 720 | "\n", 721 | " def __init__(self, selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end',\n", 722 | " 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']):\n", 723 | " self.selected_fearutes = selected_features\n", 724 | " pass\n", 725 | " \n", 726 | " def fit(self, X, y=None):\n", 727 | " return self\n", 728 | " \n", 729 | " def transform(self, X):\n", 730 | " print('-'*30, ' '*5, 'selsected_fill_nans', ' '*5, '-'*30, '\\n')\n", 731 | "\n", 732 | " X = value_select_eval(X, self.selected_fearutes)\n", 733 | "\n", 734 | " print(\"shape = {}\".format(X.shape))\n", 735 | " return X" 736 | ] 737 | }, 738 | { 739 | "cell_type": "code", 740 | "execution_count": 19, 741 | "metadata": {}, 742 | "outputs": [], 743 | "source": [ 744 | "def modeling_cross_validation(data):\n", 745 | " X_train, y_train, X_test, test_idx = split_data(data,\n", 746 | " target_name='target')\n", 747 | " oof_xgb, _ = xgb_predict(X_train, y_train, X_test, verbose_eval=False)\n", 748 | " print('-'*100, end='\\n\\n')\n", 749 | " return mean_squared_error(oof_xgb, y_train)\n", 750 | "\n", 751 | "\n", 752 | "def featureSelect(data):\n", 753 | "\n", 754 | " init_cols = [f for f in data.columns if f not in ['target']]\n", 755 | " best_cols = init_cols.copy()\n", 756 | " best_score = modeling_cross_validation(data[best_cols+['target']])\n", 757 | " print(\"初始 CV score: {:<8.8f}\".format(best_score))\n", 758 | "\n", 759 | " for col in init_cols:\n", 760 | " best_cols.remove(col)\n", 761 | " score = modeling_cross_validation(data[best_cols+['target']])\n", 762 | " print(\"当前选择特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}\".\n", 763 | " format(col, score, best_score), end=\" \")\n", 764 | " \n", 765 | " if best_score - score > 0.0000004:\n", 766 | " best_score = score\n", 767 | " print(\"有效果,删除!!!!\")\n", 768 | " else:\n", 769 | " best_cols.append(col)\n", 770 | " print(\"保留\")\n", 771 | "\n", 772 | " print('-'*100)\n", 773 | " print(\"优化后 CV score: {:<8.8f}\".format(best_score))\n", 774 | " return best_cols, best_score" 775 | ] 776 | }, 777 | { 778 | "cell_type": "markdown", 779 | "metadata": {}, 780 | "source": [ 781 | "## 后向选择特征" 782 | ] 783 | }, 784 | { 785 | "cell_type": "code", 786 | "execution_count": 20, 787 | "metadata": {}, 788 | "outputs": [], 789 | "source": [ 790 | "class select_feature(BaseEstimator, TransformerMixin):\n", 791 | "\n", 792 | " def __init__(self, init_features = None):\n", 793 | " self.init_features = init_features\n", 794 | " pass\n", 795 | " \n", 796 | " def fit(self, X, y=None):\n", 797 | " return self\n", 798 | " \n", 799 | " def transform(self, X):\n", 800 | " print('-'*30, ' '*5, 'select_feature', ' '*5, '-'*30, '\\n')\n", 801 | " \n", 802 | " if self.init_features:\n", 803 | " X = X[self.init_features + ['target']]\n", 804 | " best_features = self.init_features\n", 805 | " else:\n", 806 | " best_features = [col for col in X.columns]\n", 807 | " \n", 808 | " last_feartues = []\n", 809 | " iteration = 0\n", 810 | " equal_time = 0\n", 811 | " \n", 812 | " best_CV = 1\n", 813 | " best_CV_feature = []\n", 814 | " \n", 815 | " # 打乱顺序,但是使用相同种子,保证每次运行结果相同\n", 816 | " np.random.seed(2018)\n", 817 | " while True:\n", 818 | " print(\"Iteration = {}\\n\".format(iteration))\n", 819 | " best_features, score = featureSelect(X[best_features + ['target']])\n", 820 | " \n", 821 | " # 保存最优 CV 的参数\n", 822 | " if score < best_CV:\n", 823 | " best_CV = score\n", 824 | " best_CV_feature = best_features\n", 825 | " print(\"Found best score :{}, with features :{}\".format(best_CV, best_features))\n", 826 | " \n", 827 | " np.random.shuffle(best_features)\n", 828 | " print(\"\\nCurrent fearure length = {}\".format(len(best_features)))\n", 829 | " \n", 830 | " # 最终 3 次迭代相同,则终止迭代\n", 831 | " if len(best_features) == len(last_feartues):\n", 832 | " equal_time += 1\n", 833 | " if equal_time == 3:\n", 834 | " break\n", 835 | " else:\n", 836 | " equal_time = 0\n", 837 | " \n", 838 | " last_feartues = best_features\n", 839 | " iteration = iteration + 1\n", 840 | "\n", 841 | " print(\"\\n\\n\\n\")\n", 842 | " \n", 843 | " return X[best_features + ['target']]\n" 844 | ] 845 | }, 846 | { 847 | "cell_type": "markdown", 848 | "metadata": {}, 849 | "source": [ 850 | "# 训练" 851 | ] 852 | }, 853 | { 854 | "cell_type": "markdown", 855 | "metadata": {}, 856 | "source": [ 857 | "## 构建 pipeline, 处理数据" 858 | ] 859 | }, 860 | { 861 | "cell_type": "code", 862 | "execution_count": 21, 863 | "metadata": {}, 864 | "outputs": [ 865 | { 866 | "name": "stdout", 867 | "output_type": "stream", 868 | "text": [ 869 | "------------------------------ del_nan_feature ------------------------------ \n", 870 | "\n", 871 | "shape before process = (1532, 44)\n", 872 | "Missing rate of A3 is 0.029 exceed 0.02, adding new feature A3_null\n", 873 | "Missing rate of B10 is 0.172 exceed 0.02, adding new feature B10_null\n", 874 | "Missing rate of B11 is 0.597 exceed 0.02, adding new feature B11_null\n", 875 | "shape = (1532, 44)\n", 876 | "------------------------------ handle_time_str ------------------------------ \n", 877 | "\n", 878 | "A7 应该在前面被删除了!\n", 879 | "------------------------------ calc_time_diff ------------------------------ \n", 880 | "\n", 881 | "shape = (1532, 61)\n", 882 | "------------------------------ handle_outliers ------------------------------ \n", 883 | "\n", 884 | "shape = (1532, 61)\n", 885 | "------------------------------ del_single_feature ------------------------------ \n", 886 | "\n", 887 | "shape = (1532, 61)\n", 888 | "------------------------------ add_new_features ------------------------------ \n", 889 | "\n", 890 | "shape = (1532, 62)\n", 891 | "------------------------------ selsected_fill_nans ------------------------------ \n", 892 | "\n", 893 | "----------------------------------------------------------------------------------------------------\n", 894 | "\n", 895 | "CV normal MAE scores of predicting A16 is 0.006573812182966658\n", 896 | "From [2.82036011 2.0116301 2.78110905 2.24891324 2.64147919 2.61220436\n", 897 | " 2.5752665 3.00300634 2.9279013 2.84588192 2.99439096 3.15101939\n", 898 | " 1.30144072 3.0146347 2.44444267 2.60455203 2.70574424 2.75994805\n", 899 | " 2.58867866 3.00614175 2.78697994 2.03778946 2.69046123 2.72509097\n", 900 | " 2.03607538 2.52129808 2.99479207 2.92738628 2.41858149 2.70892806\n", 901 | " 2.80188948 2.75916436 2.00558983 2.99666125 3.02267092 2.11280097\n", 902 | " 2.88487023 2.52905945 3.2504842 2.92606165 2.52358037 2.57779263\n", 903 | " 2.58069354 2.91890304 2.9953025 2.49374625 2.68844172 2.45054981\n", 904 | " 3.02282879 2.01016228]\n", 905 | "To [3. 2. 3. 2. 2.5 2.5 2.5 3. 3. 3. 3. 3. 1.5 3. 2.5 2.5 2.5 3.\n", 906 | " 2.5 3. 3. 2. 2.5 2.5 2. 2.5 3. 3. 2.5 2.5 3. 3. 2. 3. 3. 2.\n", 907 | " 3. 2.5 3.5 3. 2.5 2.5 2.5 3. 3. 2.5 2.5 2.5 3. 2. ]\n", 908 | "----------------------------------------------------------------------------------------------------\n", 909 | "\n", 910 | "CV normal MAE scores of predicting A25 is 0.006985261873501027\n", 911 | "From [74.69595861 76.41018534 79.13030624 80.63013458 83.78137398 69.97030067\n", 912 | " 79.99184084 81.28120136]\n", 913 | "To [75. 76. 79. 80. 80. 70. 80. 80.]\n", 914 | "----------------------------------------------------------------------------------------------------\n", 915 | "\n", 916 | "CV normal MAE scores of predicting A28 is 0.037333278007722834\n", 917 | "From [1.16017157 0.60625793 1.06248447 0.79223084 0.98663072 0.9867671\n", 918 | " 0.93546665 0.80992338 0.5366797 1.07250487 0.89877572]\n", 919 | "To [1.167 0.667 1. 0.667 1. 1. 1. 0.667 0.5 1. 1. ]\n", 920 | "----------------------------------------------------------------------------------------------------\n", 921 | "\n", 922 | "CV normal MAE scores of predicting A6 is 0.07248546276204286\n", 923 | "From [38.14284706 37.4052155 28.88562822 32.06546164 32.16541529 36.24718952\n", 924 | " 36.2260437 22.94354749 39.25126076 24.79291415 32.7412734 35.56525469\n", 925 | " 33.02388406 25.72406816]\n", 926 | "To [38. 37. 29. 32. 32. 36. 36. 23. 39. 25. 33. 36. 33. 26.]\n", 927 | "----------------------------------------------------------------------------------------------------\n", 928 | "\n", 929 | "CV normal MAE scores of predicting B14 is 0.001297790416243159\n", 930 | "From [400.24114609 402.01158142 400.45770264 401.50468063 402.03158188\n", 931 | " 337.98728943 402.02846909 401.99261093 402.0224762 341.76803589\n", 932 | " 400.6063652 ]\n", 933 | "To [400. 400. 400. 400. 400. 340. 400. 400. 400. 340. 400.]\n", 934 | "----------------------------------------------------------------------------------------------------\n", 935 | "\n", 936 | "CV normal MAE scores of predicting B5 is 0.016972058774115534\n", 937 | "From [15.12159047 15.72788435 14.3109533 19.78048182 14.65058997 14.53606975\n", 938 | " 15.33118927 14.2788609 13.99365219 14.93855688 14.00136399 14.85706538\n", 939 | " 16.71887732 21.97685862 13.37160268 15.54346603 14.7121506 ]\n", 940 | "To [15. 15.5 14.5 20. 14.5 14.5 15.5 14.5 14. 15. 14. 15. 16.5 22.\n", 941 | " 13.5 15.5 14.5]\n", 942 | "----------------------------------------------------------------------------------------------------\n", 943 | "\n", 944 | "CV normal MAE scores of predicting A28_end is 0.012402601510432685\n", 945 | "From [13.98934126 9.29064429 0.80129172 13.6963979 10.73074675 15.34772384\n", 946 | " 8.8364796 16.79893827 15.48608887 15.33091271 19.34231484 12.45444262\n", 947 | " 15.49709904 14.68686521 15.09787071 15.34401202 17.98878217 10.97875953\n", 948 | " 10.45144868 10.7770443 10.52596968 12.5426327 14.00931859 16.79731178\n", 949 | " 14.87435162 20.49663401 14.73405266 17.8244971 5.21450764 14.84811437]\n", 950 | "To [14. 9. 1. 13.5 10.5 15.5 9. 17. 15.5 15.5 19.5 12.5 15.5 14.5\n", 951 | " 15. 15.5 18. 11. 10.5 11. 10.5 12.5 14. 17. 15. 20.5 14.5 18.\n", 952 | " 5. 15. ]\n", 953 | "----------------------------------------------------------------------------------------------------\n", 954 | "\n", 955 | "CV normal MAE scores of predicting B14/B12 is 0.002980983626315855\n", 956 | "From [0.44779329 0.49992409 0.50056992 0.3333493 0.49980065 0.81825209\n", 957 | " 0.49989815 0.49929611 0.76591279 0.50158779 0.80235612 0.4996887 ]\n", 958 | "To [0.44444444 0.5 0.5 0.33333333 0.5 0.85\n", 959 | " 0.5 0.5 0.7 0.5 0.85 0.5 ]\n", 960 | "shape = (1532, 13)\n", 961 | "------------------------------ select_feature ------------------------------ \n", 962 | "\n", 963 | "Iteration = 0\n", 964 | "\n", 965 | "----------------------------------------------------------------------------------------------------\n", 966 | "\n", 967 | "初始 CV score: 0.00011896\n", 968 | "----------------------------------------------------------------------------------------------------\n", 969 | "\n", 970 | "当前选择特征: A3_null, CV score: 0.00011909, 最佳cv score: 0.00011896 保留\n", 971 | "----------------------------------------------------------------------------------------------------\n", 972 | "\n", 973 | "当前选择特征: A6, CV score: 0.00012150, 最佳cv score: 0.00011896 保留\n", 974 | "----------------------------------------------------------------------------------------------------\n", 975 | "\n", 976 | "当前选择特征: A16, CV score: 0.00011866, 最佳cv score: 0.00011896 保留\n", 977 | "----------------------------------------------------------------------------------------------------\n", 978 | "\n", 979 | "当前选择特征: A25, CV score: 0.00011865, 最佳cv score: 0.00011896 保留\n", 980 | "----------------------------------------------------------------------------------------------------\n", 981 | "\n", 982 | "当前选择特征: A28, CV score: 0.00011749, 最佳cv score: 0.00011896 有效果,删除!!!!\n", 983 | "----------------------------------------------------------------------------------------------------\n", 984 | "\n", 985 | "当前选择特征: A28_end, CV score: 0.00011743, 最佳cv score: 0.00011749 保留\n", 986 | "----------------------------------------------------------------------------------------------------\n", 987 | "\n", 988 | "当前选择特征: B5, CV score: 0.00011818, 最佳cv score: 0.00011749 保留\n", 989 | "----------------------------------------------------------------------------------------------------\n", 990 | "\n", 991 | "当前选择特征: B10_null, CV score: 0.00011870, 最佳cv score: 0.00011749 保留\n", 992 | "----------------------------------------------------------------------------------------------------\n", 993 | "\n", 994 | "当前选择特征: B11_null, CV score: 0.00011940, 最佳cv score: 0.00011749 保留\n", 995 | "----------------------------------------------------------------------------------------------------\n", 996 | "\n", 997 | "当前选择特征: B14, CV score: 0.00012126, 最佳cv score: 0.00011749 保留\n", 998 | "----------------------------------------------------------------------------------------------------\n", 999 | "\n", 1000 | "当前选择特征: B14/B12, CV score: 0.00012029, 最佳cv score: 0.00011749 保留\n", 1001 | "----------------------------------------------------------------------------------------------------\n", 1002 | "\n", 1003 | "当前选择特征: id, CV score: 0.00018189, 最佳cv score: 0.00011749 保留\n", 1004 | "----------------------------------------------------------------------------------------------------\n", 1005 | "优化后 CV score: 0.00011749\n", 1006 | "Found best score :0.00011748851254246741, with features :['A3_null', 'A6', 'A16', 'A25', 'A28_end', 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']\n", 1007 | "\n", 1008 | "Current fearure length = 11\n", 1009 | "\n", 1010 | "\n", 1011 | "\n", 1012 | "\n", 1013 | "Iteration = 1\n", 1014 | "\n", 1015 | "----------------------------------------------------------------------------------------------------\n", 1016 | "\n", 1017 | "初始 CV score: 0.00011888\n", 1018 | "----------------------------------------------------------------------------------------------------\n", 1019 | "\n", 1020 | "当前选择特征: A3_null, CV score: 0.00011882, 最佳cv score: 0.00011888 保留\n", 1021 | "----------------------------------------------------------------------------------------------------\n", 1022 | "\n", 1023 | "当前选择特征: A25, CV score: 0.00011935, 最佳cv score: 0.00011888 保留\n", 1024 | "----------------------------------------------------------------------------------------------------\n", 1025 | "\n", 1026 | "当前选择特征: B11_null, CV score: 0.00011808, 最佳cv score: 0.00011888 有效果,删除!!!!\n", 1027 | "----------------------------------------------------------------------------------------------------\n", 1028 | "\n", 1029 | "当前选择特征: B14, CV score: 0.00012059, 最佳cv score: 0.00011808 保留\n" 1030 | ] 1031 | }, 1032 | { 1033 | "name": "stdout", 1034 | "output_type": "stream", 1035 | "text": [ 1036 | "----------------------------------------------------------------------------------------------------\n", 1037 | "\n", 1038 | "当前选择特征: B14/B12, CV score: 0.00012291, 最佳cv score: 0.00011808 保留\n", 1039 | "----------------------------------------------------------------------------------------------------\n", 1040 | "\n", 1041 | "当前选择特征: A28_end, CV score: 0.00011717, 最佳cv score: 0.00011808 有效果,删除!!!!\n", 1042 | "----------------------------------------------------------------------------------------------------\n", 1043 | "\n", 1044 | "当前选择特征: B5, CV score: 0.00011785, 最佳cv score: 0.00011717 保留\n", 1045 | "----------------------------------------------------------------------------------------------------\n", 1046 | "\n", 1047 | "当前选择特征: A6, CV score: 0.00012231, 最佳cv score: 0.00011717 保留\n", 1048 | "----------------------------------------------------------------------------------------------------\n", 1049 | "\n", 1050 | "当前选择特征: A16, CV score: 0.00011826, 最佳cv score: 0.00011717 保留\n", 1051 | "----------------------------------------------------------------------------------------------------\n", 1052 | "\n", 1053 | "当前选择特征: B10_null, CV score: 0.00011966, 最佳cv score: 0.00011717 保留\n", 1054 | "----------------------------------------------------------------------------------------------------\n", 1055 | "\n", 1056 | "当前选择特征: id, CV score: 0.00018469, 最佳cv score: 0.00011717 保留\n", 1057 | "----------------------------------------------------------------------------------------------------\n", 1058 | "优化后 CV score: 0.00011717\n", 1059 | "Found best score :0.00011716922906431079, with features :['A3_null', 'A25', 'B14', 'B14/B12', 'B5', 'A6', 'A16', 'B10_null', 'id']\n", 1060 | "\n", 1061 | "Current fearure length = 9\n", 1062 | "\n", 1063 | "\n", 1064 | "\n", 1065 | "\n", 1066 | "Iteration = 2\n", 1067 | "\n", 1068 | "----------------------------------------------------------------------------------------------------\n", 1069 | "\n", 1070 | "初始 CV score: 0.00011730\n", 1071 | "----------------------------------------------------------------------------------------------------\n", 1072 | "\n", 1073 | "当前选择特征: A6, CV score: 0.00012321, 最佳cv score: 0.00011730 保留\n", 1074 | "----------------------------------------------------------------------------------------------------\n", 1075 | "\n", 1076 | "当前选择特征: A3_null, CV score: 0.00011892, 最佳cv score: 0.00011730 保留\n", 1077 | "----------------------------------------------------------------------------------------------------\n", 1078 | "\n", 1079 | "当前选择特征: B14, CV score: 0.00012037, 最佳cv score: 0.00011730 保留\n", 1080 | "----------------------------------------------------------------------------------------------------\n", 1081 | "\n", 1082 | "当前选择特征: A16, CV score: 0.00011723, 最佳cv score: 0.00011730 保留\n", 1083 | "----------------------------------------------------------------------------------------------------\n", 1084 | "\n", 1085 | "当前选择特征: B5, CV score: 0.00011902, 最佳cv score: 0.00011730 保留\n", 1086 | "----------------------------------------------------------------------------------------------------\n", 1087 | "\n", 1088 | "当前选择特征: A25, CV score: 0.00011825, 最佳cv score: 0.00011730 保留\n", 1089 | "----------------------------------------------------------------------------------------------------\n", 1090 | "\n", 1091 | "当前选择特征: id, CV score: 0.00018102, 最佳cv score: 0.00011730 保留\n", 1092 | "----------------------------------------------------------------------------------------------------\n", 1093 | "\n", 1094 | "当前选择特征: B14/B12, CV score: 0.00012256, 最佳cv score: 0.00011730 保留\n", 1095 | "----------------------------------------------------------------------------------------------------\n", 1096 | "\n", 1097 | "当前选择特征: B10_null, CV score: 0.00011870, 最佳cv score: 0.00011730 保留\n", 1098 | "----------------------------------------------------------------------------------------------------\n", 1099 | "优化后 CV score: 0.00011730\n", 1100 | "\n", 1101 | "Current fearure length = 9\n", 1102 | "\n", 1103 | "\n", 1104 | "\n", 1105 | "\n", 1106 | "Iteration = 3\n", 1107 | "\n", 1108 | "----------------------------------------------------------------------------------------------------\n", 1109 | "\n", 1110 | "初始 CV score: 0.00011678\n", 1111 | "----------------------------------------------------------------------------------------------------\n", 1112 | "\n", 1113 | "当前选择特征: A25, CV score: 0.00011825, 最佳cv score: 0.00011678 保留\n", 1114 | "----------------------------------------------------------------------------------------------------\n", 1115 | "\n", 1116 | "当前选择特征: B10_null, CV score: 0.00011825, 最佳cv score: 0.00011678 保留\n", 1117 | "----------------------------------------------------------------------------------------------------\n", 1118 | "\n", 1119 | "当前选择特征: B5, CV score: 0.00011854, 最佳cv score: 0.00011678 保留\n", 1120 | "----------------------------------------------------------------------------------------------------\n", 1121 | "\n", 1122 | "当前选择特征: A16, CV score: 0.00011864, 最佳cv score: 0.00011678 保留\n", 1123 | "----------------------------------------------------------------------------------------------------\n", 1124 | "\n", 1125 | "当前选择特征: id, CV score: 0.00018284, 最佳cv score: 0.00011678 保留\n", 1126 | "----------------------------------------------------------------------------------------------------\n", 1127 | "\n", 1128 | "当前选择特征: B14, CV score: 0.00012127, 最佳cv score: 0.00011678 保留\n", 1129 | "----------------------------------------------------------------------------------------------------\n", 1130 | "\n", 1131 | "当前选择特征: A6, CV score: 0.00012289, 最佳cv score: 0.00011678 保留\n", 1132 | "----------------------------------------------------------------------------------------------------\n", 1133 | "\n", 1134 | "当前选择特征: B14/B12, CV score: 0.00012132, 最佳cv score: 0.00011678 保留\n", 1135 | "----------------------------------------------------------------------------------------------------\n", 1136 | "\n", 1137 | "当前选择特征: A3_null, CV score: 0.00011959, 最佳cv score: 0.00011678 保留\n", 1138 | "----------------------------------------------------------------------------------------------------\n", 1139 | "优化后 CV score: 0.00011678\n", 1140 | "Found best score :0.00011677842465073222, with features :['A25', 'B10_null', 'B5', 'A16', 'id', 'B14', 'A6', 'B14/B12', 'A3_null']\n", 1141 | "\n", 1142 | "Current fearure length = 9\n", 1143 | "\n", 1144 | "\n", 1145 | "\n", 1146 | "\n", 1147 | "Iteration = 4\n", 1148 | "\n", 1149 | "----------------------------------------------------------------------------------------------------\n", 1150 | "\n", 1151 | "初始 CV score: 0.00011833\n", 1152 | "----------------------------------------------------------------------------------------------------\n", 1153 | "\n", 1154 | "当前选择特征: B14, CV score: 0.00012009, 最佳cv score: 0.00011833 保留\n", 1155 | "----------------------------------------------------------------------------------------------------\n", 1156 | "\n", 1157 | "当前选择特征: B14/B12, CV score: 0.00012129, 最佳cv score: 0.00011833 保留\n", 1158 | "----------------------------------------------------------------------------------------------------\n", 1159 | "\n", 1160 | "当前选择特征: A3_null, CV score: 0.00011953, 最佳cv score: 0.00011833 保留\n", 1161 | "----------------------------------------------------------------------------------------------------\n", 1162 | "\n", 1163 | "当前选择特征: A6, CV score: 0.00012132, 最佳cv score: 0.00011833 保留\n", 1164 | "----------------------------------------------------------------------------------------------------\n", 1165 | "\n", 1166 | "当前选择特征: B5, CV score: 0.00011791, 最佳cv score: 0.00011833 有效果,删除!!!!\n", 1167 | "----------------------------------------------------------------------------------------------------\n", 1168 | "\n", 1169 | "当前选择特征: A16, CV score: 0.00012081, 最佳cv score: 0.00011791 保留\n", 1170 | "----------------------------------------------------------------------------------------------------\n", 1171 | "\n", 1172 | "当前选择特征: B10_null, CV score: 0.00011889, 最佳cv score: 0.00011791 保留\n", 1173 | "----------------------------------------------------------------------------------------------------\n", 1174 | "\n", 1175 | "当前选择特征: A25, CV score: 0.00012040, 最佳cv score: 0.00011791 保留\n", 1176 | "----------------------------------------------------------------------------------------------------\n", 1177 | "\n", 1178 | "当前选择特征: id, CV score: 0.00018179, 最佳cv score: 0.00011791 保留\n", 1179 | "----------------------------------------------------------------------------------------------------\n", 1180 | "优化后 CV score: 0.00011791\n", 1181 | "\n", 1182 | "Current fearure length = 8\n", 1183 | "\n", 1184 | "\n", 1185 | "\n", 1186 | "\n", 1187 | "Iteration = 5\n", 1188 | "\n", 1189 | "----------------------------------------------------------------------------------------------------\n", 1190 | "\n", 1191 | "初始 CV score: 0.00011747\n", 1192 | "----------------------------------------------------------------------------------------------------\n", 1193 | "\n", 1194 | "当前选择特征: id, CV score: 0.00018279, 最佳cv score: 0.00011747 保留\n", 1195 | "----------------------------------------------------------------------------------------------------\n", 1196 | "\n", 1197 | "当前选择特征: A25, CV score: 0.00012058, 最佳cv score: 0.00011747 保留\n", 1198 | "----------------------------------------------------------------------------------------------------\n", 1199 | "\n", 1200 | "当前选择特征: A6, CV score: 0.00012409, 最佳cv score: 0.00011747 保留\n", 1201 | "----------------------------------------------------------------------------------------------------\n", 1202 | "\n", 1203 | "当前选择特征: B14/B12, CV score: 0.00012101, 最佳cv score: 0.00011747 保留\n", 1204 | "----------------------------------------------------------------------------------------------------\n", 1205 | "\n", 1206 | "当前选择特征: A3_null, CV score: 0.00011944, 最佳cv score: 0.00011747 保留\n", 1207 | "----------------------------------------------------------------------------------------------------\n", 1208 | "\n", 1209 | "当前选择特征: A16, CV score: 0.00012004, 最佳cv score: 0.00011747 保留\n", 1210 | "----------------------------------------------------------------------------------------------------\n", 1211 | "\n", 1212 | "当前选择特征: B10_null, CV score: 0.00011840, 最佳cv score: 0.00011747 保留\n" 1213 | ] 1214 | }, 1215 | { 1216 | "name": "stdout", 1217 | "output_type": "stream", 1218 | "text": [ 1219 | "----------------------------------------------------------------------------------------------------\n", 1220 | "\n", 1221 | "当前选择特征: B14, CV score: 0.00012226, 最佳cv score: 0.00011747 保留\n", 1222 | "----------------------------------------------------------------------------------------------------\n", 1223 | "优化后 CV score: 0.00011747\n", 1224 | "\n", 1225 | "Current fearure length = 8\n", 1226 | "\n", 1227 | "\n", 1228 | "\n", 1229 | "\n", 1230 | "Iteration = 6\n", 1231 | "\n", 1232 | "----------------------------------------------------------------------------------------------------\n", 1233 | "\n", 1234 | "初始 CV score: 0.00011715\n", 1235 | "----------------------------------------------------------------------------------------------------\n", 1236 | "\n", 1237 | "当前选择特征: id, CV score: 0.00018398, 最佳cv score: 0.00011715 保留\n", 1238 | "----------------------------------------------------------------------------------------------------\n", 1239 | "\n", 1240 | "当前选择特征: A3_null, CV score: 0.00011984, 最佳cv score: 0.00011715 保留\n", 1241 | "----------------------------------------------------------------------------------------------------\n", 1242 | "\n", 1243 | "当前选择特征: A25, CV score: 0.00012132, 最佳cv score: 0.00011715 保留\n", 1244 | "----------------------------------------------------------------------------------------------------\n", 1245 | "\n", 1246 | "当前选择特征: B14/B12, CV score: 0.00012198, 最佳cv score: 0.00011715 保留\n", 1247 | "----------------------------------------------------------------------------------------------------\n", 1248 | "\n", 1249 | "当前选择特征: B10_null, CV score: 0.00011808, 最佳cv score: 0.00011715 保留\n", 1250 | "----------------------------------------------------------------------------------------------------\n", 1251 | "\n", 1252 | "当前选择特征: A16, CV score: 0.00011993, 最佳cv score: 0.00011715 保留\n", 1253 | "----------------------------------------------------------------------------------------------------\n", 1254 | "\n", 1255 | "当前选择特征: A6, CV score: 0.00012184, 最佳cv score: 0.00011715 保留\n", 1256 | "----------------------------------------------------------------------------------------------------\n", 1257 | "\n", 1258 | "当前选择特征: B14, CV score: 0.00012037, 最佳cv score: 0.00011715 保留\n", 1259 | "----------------------------------------------------------------------------------------------------\n", 1260 | "优化后 CV score: 0.00011715\n", 1261 | "\n", 1262 | "Current fearure length = 8\n", 1263 | "\n", 1264 | "\n", 1265 | "\n", 1266 | "\n", 1267 | "Iteration = 7\n", 1268 | "\n", 1269 | "----------------------------------------------------------------------------------------------------\n", 1270 | "\n", 1271 | "初始 CV score: 0.00011709\n", 1272 | "----------------------------------------------------------------------------------------------------\n", 1273 | "\n", 1274 | "当前选择特征: A25, CV score: 0.00012090, 最佳cv score: 0.00011709 保留\n", 1275 | "----------------------------------------------------------------------------------------------------\n", 1276 | "\n", 1277 | "当前选择特征: id, CV score: 0.00018248, 最佳cv score: 0.00011709 保留\n", 1278 | "----------------------------------------------------------------------------------------------------\n", 1279 | "\n", 1280 | "当前选择特征: B14/B12, CV score: 0.00012002, 最佳cv score: 0.00011709 保留\n", 1281 | "----------------------------------------------------------------------------------------------------\n", 1282 | "\n", 1283 | "当前选择特征: B14, CV score: 0.00012325, 最佳cv score: 0.00011709 保留\n", 1284 | "----------------------------------------------------------------------------------------------------\n", 1285 | "\n", 1286 | "当前选择特征: A3_null, CV score: 0.00012002, 最佳cv score: 0.00011709 保留\n", 1287 | "----------------------------------------------------------------------------------------------------\n", 1288 | "\n", 1289 | "当前选择特征: A16, CV score: 0.00011987, 最佳cv score: 0.00011709 保留\n", 1290 | "----------------------------------------------------------------------------------------------------\n", 1291 | "\n", 1292 | "当前选择特征: A6, CV score: 0.00012154, 最佳cv score: 0.00011709 保留\n", 1293 | "----------------------------------------------------------------------------------------------------\n", 1294 | "\n", 1295 | "当前选择特征: B10_null, CV score: 0.00011942, 最佳cv score: 0.00011709 保留\n", 1296 | "----------------------------------------------------------------------------------------------------\n", 1297 | "优化后 CV score: 0.00011709\n", 1298 | "\n", 1299 | "Current fearure length = 8\n", 1300 | "(1532, 9)\n" 1301 | ] 1302 | } 1303 | ], 1304 | "source": [ 1305 | "selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end', \n", 1306 | " 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']\n", 1307 | "\n", 1308 | "pipe = Pipeline([\n", 1309 | " ('del_nan_feature', del_nan_feature()),\n", 1310 | " ('handle_time_str', handle_time_str()),\n", 1311 | " ('calc_time_diff', calc_time_diff()),\n", 1312 | " ('Handle_outliers', handle_outliers(2)),\n", 1313 | " ('del_single_feature', del_single_feature(1)),\n", 1314 | " ('add_new_features', add_new_features()),\n", 1315 | " ('selsected_fill_nans', selsected_fill_nans(selected_features)),\n", 1316 | " ('select_feature', select_feature(selected_features)),\n", 1317 | " ])\n", 1318 | "\n", 1319 | "pipe_data = pipe.fit_transform(full.copy())\n", 1320 | "print(pipe_data.shape)" 1321 | ] 1322 | }, 1323 | { 1324 | "cell_type": "markdown", 1325 | "metadata": {}, 1326 | "source": [ 1327 | "## 自动调参" 1328 | ] 1329 | }, 1330 | { 1331 | "cell_type": "code", 1332 | "execution_count": 22, 1333 | "metadata": {}, 1334 | "outputs": [], 1335 | "source": [ 1336 | "def find_best_params(pipe_data, predict_fun, param_grid):\n", 1337 | " \n", 1338 | " # 获得 train 和 test, 归一化\n", 1339 | " X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n", 1340 | " min_max_scaler = MinMaxScaler()\n", 1341 | " X_train = min_max_scaler.fit_transform(X_train)\n", 1342 | " X_test = min_max_scaler.transform(X_test)\n", 1343 | " best_score = 1\n", 1344 | "\n", 1345 | " # 遍历所有参数,寻找最优\n", 1346 | " for params in ParameterGrid(param_grid):\n", 1347 | " print('-'*100, \"\\nparams = \\n{}\\n\".format(params))\n", 1348 | "\n", 1349 | " oof, predictions = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False)\n", 1350 | " score = mean_squared_error(oof, y_train)\n", 1351 | " print(\"CV score: {}, current best score: {}\".format(score, best_score))\n", 1352 | "\n", 1353 | " if best_score > score:\n", 1354 | " print(\"Found new best score: {}\".format(score))\n", 1355 | " best_score = score\n", 1356 | " best_params = params\n", 1357 | "\n", 1358 | "\n", 1359 | " print('\\n\\nbest params: {}'.format(best_params))\n", 1360 | " print('best score: {}'.format(best_score))\n", 1361 | " \n", 1362 | " return best_params" 1363 | ] 1364 | }, 1365 | { 1366 | "cell_type": "markdown", 1367 | "metadata": {}, 1368 | "source": [ 1369 | "## lgb" 1370 | ] 1371 | }, 1372 | { 1373 | "cell_type": "code", 1374 | "execution_count": 23, 1375 | "metadata": {}, 1376 | "outputs": [], 1377 | "source": [ 1378 | "def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100):\n", 1379 | " \n", 1380 | " if params == None:\n", 1381 | " lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective':'regression', 'max_depth': 4,\n", 1382 | " 'learning_rate': 0.06, \"min_child_samples\": 3, \"boosting\": \"gbdt\", \"feature_fraction\": 1,\n", 1383 | " \"bagging_freq\": 0.7, \"bagging_fraction\": 1, \"bagging_seed\": 11, \"metric\": 'mse', \"lambda_l2\": 0.003,\n", 1384 | " \"verbosity\": -1}\n", 1385 | " else :\n", 1386 | " lgb_param = params\n", 1387 | " \n", 1388 | " folds = KFold(n_splits=10, shuffle=True, random_state=2018)\n", 1389 | " oof_lgb = np.zeros(len(X_train))\n", 1390 | " predictions_lgb = np.zeros(len(X_test))\n", 1391 | "\n", 1392 | " for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):\n", 1393 | " if verbose_eval:\n", 1394 | " print(\"fold n°{}\".format(fold_+1))\n", 1395 | " \n", 1396 | " trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx])\n", 1397 | " val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx])\n", 1398 | "\n", 1399 | " num_round = 10000\n", 1400 | " clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data],\n", 1401 | " verbose_eval=verbose_eval, early_stopping_rounds = 100)\n", 1402 | " \n", 1403 | " oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration)\n", 1404 | "\n", 1405 | " predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits\n", 1406 | "\n", 1407 | " if verbose_eval:\n", 1408 | " print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_lgb, y_train)))\n", 1409 | " \n", 1410 | " return oof_lgb, predictions_lgb" 1411 | ] 1412 | }, 1413 | { 1414 | "cell_type": "markdown", 1415 | "metadata": {}, 1416 | "source": [ 1417 | "+ __选择最优参数__" 1418 | ] 1419 | }, 1420 | { 1421 | "cell_type": "code", 1422 | "execution_count": 24, 1423 | "metadata": {}, 1424 | "outputs": [ 1425 | { 1426 | "name": "stdout", 1427 | "output_type": "stream", 1428 | "text": [ 1429 | "---------------------------------------------------------------------------------------------------- \n", 1430 | "params = \n", 1431 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1432 | "\n", 1433 | "CV score: 0.00011489364207207567, current best score: 1\n", 1434 | "Found new best score: 0.00011489364207207567\n", 1435 | "---------------------------------------------------------------------------------------------------- \n", 1436 | "params = \n", 1437 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1438 | "\n", 1439 | "CV score: 0.00012002848815037457, current best score: 0.00011489364207207567\n", 1440 | "---------------------------------------------------------------------------------------------------- \n", 1441 | "params = \n", 1442 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1443 | "\n", 1444 | "CV score: 0.00011402092342458887, current best score: 0.00011489364207207567\n", 1445 | "Found new best score: 0.00011402092342458887\n", 1446 | "---------------------------------------------------------------------------------------------------- \n", 1447 | "params = \n", 1448 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1449 | "\n", 1450 | "CV score: 0.00011882313706702633, current best score: 0.00011402092342458887\n", 1451 | "---------------------------------------------------------------------------------------------------- \n", 1452 | "params = \n", 1453 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1454 | "\n", 1455 | "CV score: 0.00011427665390150208, current best score: 0.00011402092342458887\n", 1456 | "---------------------------------------------------------------------------------------------------- \n", 1457 | "params = \n", 1458 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1459 | "\n", 1460 | "CV score: 0.000118469789122594, current best score: 0.00011402092342458887\n", 1461 | "---------------------------------------------------------------------------------------------------- \n", 1462 | "params = \n", 1463 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1464 | "\n", 1465 | "CV score: 0.00011547690407588457, current best score: 0.00011402092342458887\n", 1466 | "---------------------------------------------------------------------------------------------------- \n", 1467 | "params = \n", 1468 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1469 | "\n", 1470 | "CV score: 0.00011985943852356268, current best score: 0.00011402092342458887\n", 1471 | "---------------------------------------------------------------------------------------------------- \n", 1472 | "params = \n", 1473 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1474 | "\n", 1475 | "CV score: 0.00011231611613708764, current best score: 0.00011402092342458887\n", 1476 | "Found new best score: 0.00011231611613708764\n", 1477 | "---------------------------------------------------------------------------------------------------- \n", 1478 | "params = \n", 1479 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1480 | "\n", 1481 | "CV score: 0.00011748828797017007, current best score: 0.00011231611613708764\n", 1482 | "---------------------------------------------------------------------------------------------------- \n", 1483 | "params = \n", 1484 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1485 | "\n", 1486 | "CV score: 0.00011554903372234801, current best score: 0.00011231611613708764\n", 1487 | "---------------------------------------------------------------------------------------------------- \n", 1488 | "params = \n", 1489 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1490 | "\n", 1491 | "CV score: 0.00012001078341271754, current best score: 0.00011231611613708764\n", 1492 | "---------------------------------------------------------------------------------------------------- \n", 1493 | "params = \n", 1494 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1495 | "\n", 1496 | "CV score: 0.0001135845354614368, current best score: 0.00011231611613708764\n", 1497 | "---------------------------------------------------------------------------------------------------- \n", 1498 | "params = \n", 1499 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1500 | "\n", 1501 | "CV score: 0.00011950413736010373, current best score: 0.00011231611613708764\n", 1502 | "---------------------------------------------------------------------------------------------------- \n", 1503 | "params = \n", 1504 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1505 | "\n", 1506 | "CV score: 0.00011449170101534524, current best score: 0.00011231611613708764\n", 1507 | "---------------------------------------------------------------------------------------------------- \n", 1508 | "params = \n", 1509 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1510 | "\n", 1511 | "CV score: 0.00012067409892140623, current best score: 0.00011231611613708764\n", 1512 | "---------------------------------------------------------------------------------------------------- \n", 1513 | "params = \n", 1514 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1515 | "\n" 1516 | ] 1517 | }, 1518 | { 1519 | "name": "stdout", 1520 | "output_type": "stream", 1521 | "text": [ 1522 | "CV score: 0.00011435307236140136, current best score: 0.00011231611613708764\n", 1523 | "---------------------------------------------------------------------------------------------------- \n", 1524 | "params = \n", 1525 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.0003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1526 | "\n", 1527 | "CV score: 0.00012009711500307733, current best score: 0.00011231611613708764\n", 1528 | "---------------------------------------------------------------------------------------------------- \n", 1529 | "params = \n", 1530 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1531 | "\n", 1532 | "CV score: 0.00011480005750940053, current best score: 0.00011231611613708764\n", 1533 | "---------------------------------------------------------------------------------------------------- \n", 1534 | "params = \n", 1535 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1536 | "\n", 1537 | "CV score: 0.00012025520678151364, current best score: 0.00011231611613708764\n", 1538 | "---------------------------------------------------------------------------------------------------- \n", 1539 | "params = \n", 1540 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1541 | "\n", 1542 | "CV score: 0.00011420826273292763, current best score: 0.00011231611613708764\n", 1543 | "---------------------------------------------------------------------------------------------------- \n", 1544 | "params = \n", 1545 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1546 | "\n", 1547 | "CV score: 0.00011871676666472398, current best score: 0.00011231611613708764\n", 1548 | "---------------------------------------------------------------------------------------------------- \n", 1549 | "params = \n", 1550 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1551 | "\n", 1552 | "CV score: 0.00011447430938895559, current best score: 0.00011231611613708764\n", 1553 | "---------------------------------------------------------------------------------------------------- \n", 1554 | "params = \n", 1555 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1556 | "\n", 1557 | "CV score: 0.00011845637169175561, current best score: 0.00011231611613708764\n", 1558 | "---------------------------------------------------------------------------------------------------- \n", 1559 | "params = \n", 1560 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1561 | "\n", 1562 | "CV score: 0.00011541082150277204, current best score: 0.00011231611613708764\n", 1563 | "---------------------------------------------------------------------------------------------------- \n", 1564 | "params = \n", 1565 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1566 | "\n", 1567 | "CV score: 0.00011943383299618423, current best score: 0.00011231611613708764\n", 1568 | "---------------------------------------------------------------------------------------------------- \n", 1569 | "params = \n", 1570 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1571 | "\n", 1572 | "CV score: 0.00011212722352757958, current best score: 0.00011231611613708764\n", 1573 | "Found new best score: 0.00011212722352757958\n", 1574 | "---------------------------------------------------------------------------------------------------- \n", 1575 | "params = \n", 1576 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1577 | "\n", 1578 | "CV score: 0.0001181836666297253, current best score: 0.00011212722352757958\n", 1579 | "---------------------------------------------------------------------------------------------------- \n", 1580 | "params = \n", 1581 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1582 | "\n", 1583 | "CV score: 0.0001158514519987164, current best score: 0.00011212722352757958\n", 1584 | "---------------------------------------------------------------------------------------------------- \n", 1585 | "params = \n", 1586 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1587 | "\n", 1588 | "CV score: 0.00012000426127843114, current best score: 0.00011212722352757958\n", 1589 | "---------------------------------------------------------------------------------------------------- \n", 1590 | "params = \n", 1591 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1592 | "\n", 1593 | "CV score: 0.00011360641302273318, current best score: 0.00011212722352757958\n", 1594 | "---------------------------------------------------------------------------------------------------- \n", 1595 | "params = \n", 1596 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1597 | "\n", 1598 | "CV score: 0.00011965654294675753, current best score: 0.00011212722352757958\n", 1599 | "---------------------------------------------------------------------------------------------------- \n", 1600 | "params = \n", 1601 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1602 | "\n", 1603 | "CV score: 0.00011443935391500729, current best score: 0.00011212722352757958\n", 1604 | "---------------------------------------------------------------------------------------------------- \n", 1605 | "params = \n", 1606 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1607 | "\n" 1608 | ] 1609 | }, 1610 | { 1611 | "name": "stdout", 1612 | "output_type": "stream", 1613 | "text": [ 1614 | "CV score: 0.00012091648099518892, current best score: 0.00011212722352757958\n", 1615 | "---------------------------------------------------------------------------------------------------- \n", 1616 | "params = \n", 1617 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1618 | "\n", 1619 | "CV score: 0.00011431927180539175, current best score: 0.00011212722352757958\n", 1620 | "---------------------------------------------------------------------------------------------------- \n", 1621 | "params = \n", 1622 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1623 | "\n", 1624 | "CV score: 0.00012002843916496843, current best score: 0.00011212722352757958\n", 1625 | "---------------------------------------------------------------------------------------------------- \n", 1626 | "params = \n", 1627 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1628 | "\n", 1629 | "CV score: 0.00011451755742912798, current best score: 0.00011212722352757958\n", 1630 | "---------------------------------------------------------------------------------------------------- \n", 1631 | "params = \n", 1632 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1633 | "\n", 1634 | "CV score: 0.00011991034882396216, current best score: 0.00011212722352757958\n", 1635 | "---------------------------------------------------------------------------------------------------- \n", 1636 | "params = \n", 1637 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1638 | "\n", 1639 | "CV score: 0.00011421574014520517, current best score: 0.00011212722352757958\n", 1640 | "---------------------------------------------------------------------------------------------------- \n", 1641 | "params = \n", 1642 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1643 | "\n", 1644 | "CV score: 0.00011862220062373018, current best score: 0.00011212722352757958\n", 1645 | "---------------------------------------------------------------------------------------------------- \n", 1646 | "params = \n", 1647 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1648 | "\n", 1649 | "CV score: 0.00011490378784720141, current best score: 0.00011212722352757958\n", 1650 | "---------------------------------------------------------------------------------------------------- \n", 1651 | "params = \n", 1652 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.06, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1653 | "\n", 1654 | "CV score: 0.00011862098330828783, current best score: 0.00011212722352757958\n", 1655 | "---------------------------------------------------------------------------------------------------- \n", 1656 | "params = \n", 1657 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1658 | "\n", 1659 | "CV score: 0.00011474358647366374, current best score: 0.00011212722352757958\n", 1660 | "---------------------------------------------------------------------------------------------------- \n", 1661 | "params = \n", 1662 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1663 | "\n", 1664 | "CV score: 0.00011966504510458172, current best score: 0.00011212722352757958\n", 1665 | "---------------------------------------------------------------------------------------------------- \n", 1666 | "params = \n", 1667 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1668 | "\n", 1669 | "CV score: 0.00011221151541733946, current best score: 0.00011212722352757958\n", 1670 | "---------------------------------------------------------------------------------------------------- \n", 1671 | "params = \n", 1672 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1673 | "\n", 1674 | "CV score: 0.00011777070012250079, current best score: 0.00011212722352757958\n", 1675 | "---------------------------------------------------------------------------------------------------- \n", 1676 | "params = \n", 1677 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1678 | "\n", 1679 | "CV score: 0.0001153000539253855, current best score: 0.00011212722352757958\n", 1680 | "---------------------------------------------------------------------------------------------------- \n", 1681 | "params = \n", 1682 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.12, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1683 | "\n", 1684 | "CV score: 0.00012101952260099292, current best score: 0.00011212722352757958\n", 1685 | "---------------------------------------------------------------------------------------------------- \n", 1686 | "params = \n", 1687 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1688 | "\n", 1689 | "CV score: 0.00011434513539248205, current best score: 0.00011212722352757958\n", 1690 | "---------------------------------------------------------------------------------------------------- \n", 1691 | "params = \n", 1692 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 3, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1693 | "\n", 1694 | "CV score: 0.00011965789595988069, current best score: 0.00011212722352757958\n", 1695 | "---------------------------------------------------------------------------------------------------- \n", 1696 | "params = \n", 1697 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1698 | "\n" 1699 | ] 1700 | }, 1701 | { 1702 | "name": "stdout", 1703 | "output_type": "stream", 1704 | "text": [ 1705 | "CV score: 0.000114353078699699, current best score: 0.00011212722352757958\n", 1706 | "---------------------------------------------------------------------------------------------------- \n", 1707 | "params = \n", 1708 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1709 | "\n", 1710 | "CV score: 0.0001212472410367223, current best score: 0.00011212722352757958\n", 1711 | "---------------------------------------------------------------------------------------------------- \n", 1712 | "params = \n", 1713 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1714 | "\n", 1715 | "CV score: 0.00011433021385951927, current best score: 0.00011212722352757958\n", 1716 | "---------------------------------------------------------------------------------------------------- \n", 1717 | "params = \n", 1718 | "{'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.003, 'learning_rate': 0.24, 'max_depth': 5, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 3, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1719 | "\n", 1720 | "CV score: 0.0001202623231156629, current best score: 0.00011212722352757958\n", 1721 | "\n", 1722 | "\n", 1723 | "best params: {'bagging_fraction': 1, 'bagging_freq': 1, 'bagging_seed': 11, 'boosting': 'gbdt', 'feature_fraction': 0.7, 'lambda_l2': 0.001, 'learning_rate': 0.12, 'max_depth': 4, 'metric': 'mse', 'min_child_samples': 3, 'min_data_in_leaf': 2, 'num_leaves': 20, 'objective': 'regression', 'verbosity': -1}\n", 1724 | "best score: 0.00011212722352757958\n" 1725 | ] 1726 | } 1727 | ], 1728 | "source": [ 1729 | "param_grid = [\n", 1730 | " {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective':['regression'],\n", 1731 | " 'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], \"min_child_samples\": [3],\n", 1732 | " \"boosting\": [\"gbdt\"], \"feature_fraction\": [0.7], \"bagging_freq\": [1],\n", 1733 | " \"bagging_fraction\": [1], \"bagging_seed\": [11], \"metric\": ['mse'],\n", 1734 | " \"lambda_l2\": [0.0003, 0.001, 0.003], \"verbosity\": [-1]}\n", 1735 | " ]\n", 1736 | "\n", 1737 | "lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid)" 1738 | ] 1739 | }, 1740 | { 1741 | "cell_type": "markdown", 1742 | "metadata": {}, 1743 | "source": [ 1744 | "+ __lgb 训练__" 1745 | ] 1746 | }, 1747 | { 1748 | "cell_type": "code", 1749 | "execution_count": 25, 1750 | "metadata": {}, 1751 | "outputs": [ 1752 | { 1753 | "name": "stdout", 1754 | "output_type": "stream", 1755 | "text": [ 1756 | "fold n°1\n", 1757 | "Training until validation scores don't improve for 100 rounds.\n", 1758 | "Early stopping, best iteration is:\n", 1759 | "[84]\ttraining's l2: 7.64635e-05\tvalid_1's l2: 0.00013957\n", 1760 | "CV score: 0.76865147\n", 1761 | "fold n°2\n", 1762 | "Training until validation scores don't improve for 100 rounds.\n", 1763 | "Early stopping, best iteration is:\n", 1764 | "[82]\ttraining's l2: 8.03721e-05\tvalid_1's l2: 0.000110882\n", 1765 | "CV score: 0.68291002\n", 1766 | "fold n°3\n", 1767 | "Training until validation scores don't improve for 100 rounds.\n", 1768 | "[200]\ttraining's l2: 5.30193e-05\tvalid_1's l2: 0.000139552\n", 1769 | "Early stopping, best iteration is:\n", 1770 | "[173]\ttraining's l2: 5.68838e-05\tvalid_1's l2: 0.000138392\n", 1771 | "CV score: 0.59707720\n", 1772 | "fold n°4\n", 1773 | "Training until validation scores don't improve for 100 rounds.\n", 1774 | "[200]\ttraining's l2: 5.73108e-05\tvalid_1's l2: 0.000104267\n", 1775 | "Early stopping, best iteration is:\n", 1776 | "[104]\ttraining's l2: 7.56046e-05\tvalid_1's l2: 9.52206e-05\n", 1777 | "CV score: 0.51161256\n", 1778 | "fold n°5\n", 1779 | "Training until validation scores don't improve for 100 rounds.\n", 1780 | "[200]\ttraining's l2: 5.36518e-05\tvalid_1's l2: 0.000110701\n", 1781 | "Early stopping, best iteration is:\n", 1782 | "[111]\ttraining's l2: 6.91289e-05\tvalid_1's l2: 0.0001088\n", 1783 | "CV score: 0.42667732\n", 1784 | "fold n°6\n", 1785 | "Training until validation scores don't improve for 100 rounds.\n", 1786 | "[200]\ttraining's l2: 5.63032e-05\tvalid_1's l2: 0.000123434\n", 1787 | "Early stopping, best iteration is:\n", 1788 | "[110]\ttraining's l2: 7.12232e-05\tvalid_1's l2: 0.000120677\n", 1789 | "CV score: 0.34177334\n", 1790 | "fold n°7\n", 1791 | "Training until validation scores don't improve for 100 rounds.\n", 1792 | "Early stopping, best iteration is:\n", 1793 | "[88]\ttraining's l2: 7.56938e-05\tvalid_1's l2: 0.000111444\n", 1794 | "CV score: 0.25611485\n", 1795 | "fold n°8\n", 1796 | "Training until validation scores don't improve for 100 rounds.\n", 1797 | "[200]\ttraining's l2: 5.83909e-05\tvalid_1's l2: 8.13893e-05\n", 1798 | "Early stopping, best iteration is:\n", 1799 | "[177]\ttraining's l2: 6.21352e-05\tvalid_1's l2: 7.96722e-05\n", 1800 | "CV score: 0.17049800\n", 1801 | "fold n°9\n", 1802 | "Training until validation scores don't improve for 100 rounds.\n", 1803 | "[200]\ttraining's l2: 5.74223e-05\tvalid_1's l2: 0.000108159\n", 1804 | "Early stopping, best iteration is:\n", 1805 | "[183]\ttraining's l2: 5.99078e-05\tvalid_1's l2: 0.000107585\n", 1806 | "CV score: 0.08544523\n", 1807 | "fold n°10\n", 1808 | "Training until validation scores don't improve for 100 rounds.\n", 1809 | "[200]\ttraining's l2: 5.66965e-05\tvalid_1's l2: 0.000109477\n", 1810 | "Early stopping, best iteration is:\n", 1811 | "[176]\ttraining's l2: 5.98262e-05\tvalid_1's l2: 0.000108649\n", 1812 | "CV score: 0.00011213\n" 1813 | ] 1814 | } 1815 | ], 1816 | "source": [ 1817 | "X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n", 1818 | "min_max_scaler = MinMaxScaler()\n", 1819 | "X_train = min_max_scaler.fit_transform(X_train)\n", 1820 | "X_test = min_max_scaler.transform(X_test)\n", 1821 | "oof_lgb, predictions_lgb = lgb_predict(X_train, y_train, X_test, params=lgb_best_params, verbose_eval=200) #" 1822 | ] 1823 | }, 1824 | { 1825 | "cell_type": "markdown", 1826 | "metadata": {}, 1827 | "source": [ 1828 | "## xgb" 1829 | ] 1830 | }, 1831 | { 1832 | "cell_type": "markdown", 1833 | "metadata": {}, 1834 | "source": [ 1835 | "+ __选择最优参数__" 1836 | ] 1837 | }, 1838 | { 1839 | "cell_type": "code", 1840 | "execution_count": 26, 1841 | "metadata": {}, 1842 | "outputs": [ 1843 | { 1844 | "name": "stdout", 1845 | "output_type": "stream", 1846 | "text": [ 1847 | "---------------------------------------------------------------------------------------------------- \n", 1848 | "params = \n", 1849 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1850 | "\n", 1851 | "CV score: 0.00011466648461334117, current best score: 1\n", 1852 | "Found new best score: 0.00011466648461334117\n", 1853 | "---------------------------------------------------------------------------------------------------- \n", 1854 | "params = \n", 1855 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1856 | "\n", 1857 | "CV score: 0.00011456634369353389, current best score: 0.00011466648461334117\n", 1858 | "Found new best score: 0.00011456634369353389\n", 1859 | "---------------------------------------------------------------------------------------------------- \n", 1860 | "params = \n", 1861 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1862 | "\n", 1863 | "CV score: 0.00011522095556659337, current best score: 0.00011456634369353389\n", 1864 | "---------------------------------------------------------------------------------------------------- \n", 1865 | "params = \n", 1866 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1867 | "\n", 1868 | "CV score: 0.00011575362802403785, current best score: 0.00011456634369353389\n", 1869 | "---------------------------------------------------------------------------------------------------- \n", 1870 | "params = \n", 1871 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1872 | "\n", 1873 | "CV score: 0.0001153688729909817, current best score: 0.00011456634369353389\n", 1874 | "---------------------------------------------------------------------------------------------------- \n", 1875 | "params = \n", 1876 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1877 | "\n", 1878 | "CV score: 0.00011393958598424273, current best score: 0.00011456634369353389\n", 1879 | "Found new best score: 0.00011393958598424273\n", 1880 | "---------------------------------------------------------------------------------------------------- \n", 1881 | "params = \n", 1882 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1883 | "\n", 1884 | "CV score: 0.00011519961810983215, current best score: 0.00011393958598424273\n", 1885 | "---------------------------------------------------------------------------------------------------- \n", 1886 | "params = \n", 1887 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1888 | "\n", 1889 | "CV score: 0.0001159343681149051, current best score: 0.00011393958598424273\n", 1890 | "---------------------------------------------------------------------------------------------------- \n", 1891 | "params = \n", 1892 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1893 | "\n", 1894 | "CV score: 0.00011533954423435673, current best score: 0.00011393958598424273\n", 1895 | "---------------------------------------------------------------------------------------------------- \n", 1896 | "params = \n", 1897 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1898 | "\n", 1899 | "CV score: 0.00011435484501228615, current best score: 0.00011393958598424273\n", 1900 | "---------------------------------------------------------------------------------------------------- \n", 1901 | "params = \n", 1902 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1903 | "\n", 1904 | "CV score: 0.00011550346442947287, current best score: 0.00011393958598424273\n", 1905 | "---------------------------------------------------------------------------------------------------- \n", 1906 | "params = \n", 1907 | "{'colsample_bytree': 0.7, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1908 | "\n", 1909 | "CV score: 0.00011732503571073892, current best score: 0.00011393958598424273\n", 1910 | "---------------------------------------------------------------------------------------------------- \n", 1911 | "params = \n", 1912 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1913 | "\n", 1914 | "CV score: 0.00011507860801671209, current best score: 0.00011393958598424273\n", 1915 | "---------------------------------------------------------------------------------------------------- \n", 1916 | "params = \n", 1917 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1918 | "\n", 1919 | "CV score: 0.00011373673653760243, current best score: 0.00011393958598424273\n", 1920 | "Found new best score: 0.00011373673653760243\n", 1921 | "---------------------------------------------------------------------------------------------------- \n", 1922 | "params = \n", 1923 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1924 | "\n", 1925 | "CV score: 0.00011455317620732011, current best score: 0.00011373673653760243\n", 1926 | "---------------------------------------------------------------------------------------------------- \n", 1927 | "params = \n", 1928 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1929 | "\n", 1930 | "CV score: 0.0001157106534400235, current best score: 0.00011373673653760243\n", 1931 | "---------------------------------------------------------------------------------------------------- \n", 1932 | "params = \n", 1933 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1934 | "\n", 1935 | "CV score: 0.00011484817588831511, current best score: 0.00011373673653760243\n", 1936 | "---------------------------------------------------------------------------------------------------- \n", 1937 | "params = \n", 1938 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1939 | "\n", 1940 | "CV score: 0.00011350220852811235, current best score: 0.00011373673653760243\n", 1941 | "Found new best score: 0.00011350220852811235\n", 1942 | "---------------------------------------------------------------------------------------------------- \n", 1943 | "params = \n", 1944 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1945 | "\n", 1946 | "CV score: 0.00011417123679618333, current best score: 0.00011350220852811235\n", 1947 | "---------------------------------------------------------------------------------------------------- \n", 1948 | "params = \n", 1949 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1950 | "\n", 1951 | "CV score: 0.0001139991054633333, current best score: 0.00011350220852811235\n", 1952 | "---------------------------------------------------------------------------------------------------- \n", 1953 | "params = \n", 1954 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1955 | "\n", 1956 | "CV score: 0.00011455587677046536, current best score: 0.00011350220852811235\n", 1957 | "---------------------------------------------------------------------------------------------------- \n", 1958 | "params = \n", 1959 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1960 | "\n", 1961 | "CV score: 0.00011509936251941176, current best score: 0.00011350220852811235\n", 1962 | "---------------------------------------------------------------------------------------------------- \n", 1963 | "params = \n", 1964 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1965 | "\n" 1966 | ] 1967 | }, 1968 | { 1969 | "name": "stdout", 1970 | "output_type": "stream", 1971 | "text": [ 1972 | "CV score: 0.00011510995861245286, current best score: 0.00011350220852811235\n", 1973 | "---------------------------------------------------------------------------------------------------- \n", 1974 | "params = \n", 1975 | "{'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1976 | "\n", 1977 | "CV score: 0.00011452153482808672, current best score: 0.00011350220852811235\n", 1978 | "---------------------------------------------------------------------------------------------------- \n", 1979 | "params = \n", 1980 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 1981 | "\n", 1982 | "CV score: 0.00011658284174959551, current best score: 0.00011350220852811235\n", 1983 | "---------------------------------------------------------------------------------------------------- \n", 1984 | "params = \n", 1985 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 1986 | "\n", 1987 | "CV score: 0.00011423913370431543, current best score: 0.00011350220852811235\n", 1988 | "---------------------------------------------------------------------------------------------------- \n", 1989 | "params = \n", 1990 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 1991 | "\n", 1992 | "CV score: 0.00011457178138601278, current best score: 0.00011350220852811235\n", 1993 | "---------------------------------------------------------------------------------------------------- \n", 1994 | "params = \n", 1995 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 4, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 1996 | "\n", 1997 | "CV score: 0.00011543189691404133, current best score: 0.00011350220852811235\n", 1998 | "---------------------------------------------------------------------------------------------------- \n", 1999 | "params = \n", 2000 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 2001 | "\n", 2002 | "CV score: 0.00011610932754520534, current best score: 0.00011350220852811235\n", 2003 | "---------------------------------------------------------------------------------------------------- \n", 2004 | "params = \n", 2005 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 2006 | "\n", 2007 | "CV score: 0.00011469866498564288, current best score: 0.00011350220852811235\n", 2008 | "---------------------------------------------------------------------------------------------------- \n", 2009 | "params = \n", 2010 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 2011 | "\n", 2012 | "CV score: 0.00011495607118486482, current best score: 0.00011350220852811235\n", 2013 | "---------------------------------------------------------------------------------------------------- \n", 2014 | "params = \n", 2015 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 2016 | "\n", 2017 | "CV score: 0.00011659799844084755, current best score: 0.00011350220852811235\n", 2018 | "---------------------------------------------------------------------------------------------------- \n", 2019 | "params = \n", 2020 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.4}\n", 2021 | "\n", 2022 | "CV score: 0.00011684188710827838, current best score: 0.00011350220852811235\n", 2023 | "---------------------------------------------------------------------------------------------------- \n", 2024 | "params = \n", 2025 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 2026 | "\n", 2027 | "CV score: 0.0001155643748809089, current best score: 0.00011350220852811235\n", 2028 | "---------------------------------------------------------------------------------------------------- \n", 2029 | "params = \n", 2030 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.8}\n", 2031 | "\n", 2032 | "CV score: 0.0001150891914023831, current best score: 0.00011350220852811235\n", 2033 | "---------------------------------------------------------------------------------------------------- \n", 2034 | "params = \n", 2035 | "{'colsample_bytree': 1, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 6, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 1}\n", 2036 | "\n", 2037 | "CV score: 0.0001174593981966976, current best score: 0.00011350220852811235\n", 2038 | "\n", 2039 | "\n", 2040 | "best params: {'colsample_bytree': 0.9, 'eta': 0.03, 'eval_metric': 'rmse', 'max_depth': 5, 'nthread': 4, 'num_round': 1000, 'objective': 'reg:linear', 'silent': 1, 'subsample': 0.6}\n", 2041 | "best score: 0.00011350220852811235\n" 2042 | ] 2043 | } 2044 | ], 2045 | "source": [ 2046 | "param_grid = [\n", 2047 | " {'silent': [1],\n", 2048 | " 'nthread': [4],\n", 2049 | " 'eval_metric': ['rmse'],\n", 2050 | " 'eta': [0.03],\n", 2051 | " 'objective': ['reg:linear'],\n", 2052 | " 'max_depth': [4, 5, 6],\n", 2053 | " 'num_round': [1000],\n", 2054 | " 'subsample': [0.4, 0.6, 0.8, 1],\n", 2055 | " 'colsample_bytree': [0.7, 0.9, 1],\n", 2056 | " }\n", 2057 | " ]\n", 2058 | "\n", 2059 | "xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid)" 2060 | ] 2061 | }, 2062 | { 2063 | "cell_type": "markdown", 2064 | "metadata": {}, 2065 | "source": [ 2066 | "+ __xgb 训练__" 2067 | ] 2068 | }, 2069 | { 2070 | "cell_type": "code", 2071 | "execution_count": 27, 2072 | "metadata": {}, 2073 | "outputs": [ 2074 | { 2075 | "name": "stdout", 2076 | "output_type": "stream", 2077 | "text": [ 2078 | "fold n°1\n", 2079 | "len trn_idx 1244\n", 2080 | "[0]\ttrain-rmse:0.41223\tvalid_data-rmse:0.414495\n", 2081 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2082 | "\n", 2083 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2084 | "[200]\ttrain-rmse:0.00937\tvalid_data-rmse:0.012154\n", 2085 | "[400]\ttrain-rmse:0.00738\tvalid_data-rmse:0.012149\n", 2086 | "Stopping. Best iteration:\n", 2087 | "[249]\ttrain-rmse:0.008639\tvalid_data-rmse:0.011963\n", 2088 | "\n", 2089 | "fold n°2\n", 2090 | "len trn_idx 1244\n", 2091 | "[0]\ttrain-rmse:0.412554\tvalid_data-rmse:0.411489\n", 2092 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2093 | "\n", 2094 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2095 | "[200]\ttrain-rmse:0.009572\tvalid_data-rmse:0.010408\n", 2096 | "[400]\ttrain-rmse:0.007492\tvalid_data-rmse:0.010257\n", 2097 | "Stopping. Best iteration:\n", 2098 | "[321]\ttrain-rmse:0.008054\tvalid_data-rmse:0.010154\n", 2099 | "\n", 2100 | "fold n°3\n", 2101 | "len trn_idx 1244\n", 2102 | "[0]\ttrain-rmse:0.412511\tvalid_data-rmse:0.412077\n", 2103 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2104 | "\n", 2105 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2106 | "[200]\ttrain-rmse:0.009498\tvalid_data-rmse:0.012291\n", 2107 | "[400]\ttrain-rmse:0.007413\tvalid_data-rmse:0.011734\n", 2108 | "[600]\ttrain-rmse:0.00631\tvalid_data-rmse:0.011799\n", 2109 | "Stopping. Best iteration:\n", 2110 | "[481]\ttrain-rmse:0.006915\tvalid_data-rmse:0.011714\n", 2111 | "\n", 2112 | "fold n°4\n", 2113 | "len trn_idx 1245\n", 2114 | "[0]\ttrain-rmse:0.412355\tvalid_data-rmse:0.413351\n", 2115 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2116 | "\n", 2117 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2118 | "[200]\ttrain-rmse:0.009632\tvalid_data-rmse:0.010239\n", 2119 | "[400]\ttrain-rmse:0.007535\tvalid_data-rmse:0.009976\n", 2120 | "Stopping. Best iteration:\n", 2121 | "[361]\ttrain-rmse:0.00781\tvalid_data-rmse:0.009915\n", 2122 | "\n", 2123 | "fold n°5\n", 2124 | "len trn_idx 1245\n", 2125 | "[0]\ttrain-rmse:0.412679\tvalid_data-rmse:0.410576\n", 2126 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2127 | "\n", 2128 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2129 | "[200]\ttrain-rmse:0.009628\tvalid_data-rmse:0.010779\n", 2130 | "[400]\ttrain-rmse:0.007522\tvalid_data-rmse:0.010314\n", 2131 | "[600]\ttrain-rmse:0.006444\tvalid_data-rmse:0.010495\n", 2132 | "Stopping. Best iteration:\n", 2133 | "[406]\ttrain-rmse:0.007485\tvalid_data-rmse:0.010309\n", 2134 | "\n", 2135 | "fold n°6\n", 2136 | "len trn_idx 1245\n", 2137 | "[0]\ttrain-rmse:0.412697\tvalid_data-rmse:0.410264\n", 2138 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2139 | "\n", 2140 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2141 | "[200]\ttrain-rmse:0.009618\tvalid_data-rmse:0.01148\n", 2142 | "[400]\ttrain-rmse:0.007507\tvalid_data-rmse:0.011228\n", 2143 | "Stopping. Best iteration:\n", 2144 | "[298]\ttrain-rmse:0.008307\tvalid_data-rmse:0.011154\n", 2145 | "\n", 2146 | "fold n°7\n", 2147 | "len trn_idx 1245\n", 2148 | "[0]\ttrain-rmse:0.412244\tvalid_data-rmse:0.414249\n", 2149 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2150 | "\n", 2151 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2152 | "[200]\ttrain-rmse:0.009547\tvalid_data-rmse:0.010669\n", 2153 | "[400]\ttrain-rmse:0.007386\tvalid_data-rmse:0.010475\n", 2154 | "Stopping. Best iteration:\n", 2155 | "[276]\ttrain-rmse:0.008424\tvalid_data-rmse:0.010363\n", 2156 | "\n", 2157 | "fold n°8\n", 2158 | "len trn_idx 1245\n", 2159 | "[0]\ttrain-rmse:0.41229\tvalid_data-rmse:0.414101\n", 2160 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2161 | "\n", 2162 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2163 | "[200]\ttrain-rmse:0.009677\tvalid_data-rmse:0.009724\n", 2164 | "[400]\ttrain-rmse:0.007648\tvalid_data-rmse:0.00927\n", 2165 | "[600]\ttrain-rmse:0.006596\tvalid_data-rmse:0.009145\n", 2166 | "[800]\ttrain-rmse:0.005798\tvalid_data-rmse:0.009118\n", 2167 | "[1000]\ttrain-rmse:0.005144\tvalid_data-rmse:0.009119\n", 2168 | "Stopping. Best iteration:\n", 2169 | "[934]\ttrain-rmse:0.00535\tvalid_data-rmse:0.009081\n", 2170 | "\n", 2171 | "fold n°9\n", 2172 | "len trn_idx 1245\n", 2173 | "[0]\ttrain-rmse:0.412592\tvalid_data-rmse:0.411176\n", 2174 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2175 | "\n", 2176 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2177 | "[200]\ttrain-rmse:0.009536\tvalid_data-rmse:0.01135\n", 2178 | "[400]\ttrain-rmse:0.007511\tvalid_data-rmse:0.010768\n", 2179 | "[600]\ttrain-rmse:0.006424\tvalid_data-rmse:0.010755\n", 2180 | "Stopping. Best iteration:\n", 2181 | "[534]\ttrain-rmse:0.006738\tvalid_data-rmse:0.010726\n", 2182 | "\n", 2183 | "fold n°10\n", 2184 | "len trn_idx 1245\n", 2185 | "[0]\ttrain-rmse:0.412429\tvalid_data-rmse:0.412775\n", 2186 | "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", 2187 | "\n", 2188 | "Will train until valid_data-rmse hasn't improved in 200 rounds.\n", 2189 | "[200]\ttrain-rmse:0.009541\tvalid_data-rmse:0.011636\n", 2190 | "[400]\ttrain-rmse:0.007462\tvalid_data-rmse:0.010853\n", 2191 | "[600]\ttrain-rmse:0.006416\tvalid_data-rmse:0.010884\n", 2192 | "Stopping. Best iteration:\n", 2193 | "[409]\ttrain-rmse:0.007404\tvalid_data-rmse:0.010834\n", 2194 | "\n", 2195 | "CV score: 0.00011350\n" 2196 | ] 2197 | } 2198 | ], 2199 | "source": [ 2200 | "X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target')\n", 2201 | "min_max_scaler = MinMaxScaler()\n", 2202 | "X_train = min_max_scaler.fit_transform(X_train)\n", 2203 | "X_test = min_max_scaler.transform(X_test)\n", 2204 | "oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, params=xgb_best_params, verbose_eval=200) #" 2205 | ] 2206 | }, 2207 | { 2208 | "cell_type": "markdown", 2209 | "metadata": {}, 2210 | "source": [ 2211 | "## 模型融合" 2212 | ] 2213 | }, 2214 | { 2215 | "cell_type": "code", 2216 | "execution_count": 28, 2217 | "metadata": {}, 2218 | "outputs": [ 2219 | { 2220 | "name": "stdout", 2221 | "output_type": "stream", 2222 | "text": [ 2223 | "fold 0\n", 2224 | "fold 1\n", 2225 | "fold 2\n", 2226 | "fold 3\n", 2227 | "fold 4\n", 2228 | "fold 5\n", 2229 | "fold 6\n", 2230 | "fold 7\n", 2231 | "fold 8\n", 2232 | "fold 9\n", 2233 | "0.00011094962448042872\n" 2234 | ] 2235 | } 2236 | ], 2237 | "source": [ 2238 | "# stacking\n", 2239 | "train_stack = np.vstack([oof_lgb, oof_xgb]).transpose()\n", 2240 | "test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose()\n", 2241 | "\n", 2242 | "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590)\n", 2243 | "oof_stack = np.zeros(train_stack.shape[0])\n", 2244 | "predictions = np.zeros(test_stack.shape[0])\n", 2245 | "\n", 2246 | "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)):\n", 2247 | " print(\"fold {}\".format(fold_))\n", 2248 | " trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx]\n", 2249 | " val_data, val_y = train_stack[val_idx], y_train[val_idx]\n", 2250 | " \n", 2251 | " clf_3 = BayesianRidge()\n", 2252 | " clf_3.fit(trn_data, trn_y)\n", 2253 | " \n", 2254 | " oof_stack[val_idx] = clf_3.predict(val_data)\n", 2255 | " predictions += clf_3.predict(test_stack) / 10\n", 2256 | " \n", 2257 | "final_score = mean_squared_error(y_train, oof_stack)\n", 2258 | "print(final_score)" 2259 | ] 2260 | }, 2261 | { 2262 | "cell_type": "markdown", 2263 | "metadata": {}, 2264 | "source": [ 2265 | "# 生成提交结果" 2266 | ] 2267 | }, 2268 | { 2269 | "cell_type": "markdown", 2270 | "metadata": {}, 2271 | "source": [ 2272 | "+ __生成文件名__" 2273 | ] 2274 | }, 2275 | { 2276 | "cell_type": "code", 2277 | "execution_count": 29, 2278 | "metadata": {}, 2279 | "outputs": [ 2280 | { 2281 | "name": "stdout", 2282 | "output_type": "stream", 2283 | "text": [ 2284 | "result/testB_lgb_xgb_11095_8features_20190122_10:36:07.csv\n" 2285 | ] 2286 | } 2287 | ], 2288 | "source": [ 2289 | "import time\n", 2290 | "model_name = \"lgb_xgb\"\n", 2291 | "file_name = 'result/{}_{}_{:5.0f}_{}features_{}.csv'.format(\n", 2292 | " test_name, \n", 2293 | " model_name,\n", 2294 | " final_score*1e8, X_train.shape[1],\n", 2295 | " time.strftime('%Y%m%d_%H:%M:%S',time.localtime(time.time())))\n", 2296 | " \n", 2297 | "print(file_name)" 2298 | ] 2299 | }, 2300 | { 2301 | "cell_type": "markdown", 2302 | "metadata": {}, 2303 | "source": [ 2304 | "+ __写入文件__" 2305 | ] 2306 | }, 2307 | { 2308 | "cell_type": "code", 2309 | "execution_count": 30, 2310 | "metadata": {}, 2311 | "outputs": [], 2312 | "source": [ 2313 | "sub_df = pd.read_csv(test_file_name, encoding = 'gb18030')\n", 2314 | "sub_df = sub_df[['样本id', 'A1']]\n", 2315 | "sub_df['A1'] = predictions\n", 2316 | "sub_df['A1'] = sub_df['A1'].apply(lambda x:round(x, 3))\n", 2317 | "sub_df.to_csv(file_name, header=0, index=0) " 2318 | ] 2319 | } 2320 | ], 2321 | "metadata": { 2322 | "kernelspec": { 2323 | "display_name": "Python 3", 2324 | "language": "python", 2325 | "name": "python3" 2326 | }, 2327 | "language_info": { 2328 | "codemirror_mode": { 2329 | "name": "ipython", 2330 | "version": 3 2331 | }, 2332 | "file_extension": ".py", 2333 | "mimetype": "text/x-python", 2334 | "name": "python", 2335 | "nbconvert_exporter": "python", 2336 | "pygments_lexer": "ipython3", 2337 | "version": "3.6.7" 2338 | }, 2339 | "toc": { 2340 | "base_numbering": 1, 2341 | "nav_menu": {}, 2342 | "number_sections": true, 2343 | "sideBar": true, 2344 | "skip_h1_title": false, 2345 | "title_cell": "Table of Contents", 2346 | "title_sidebar": "Contents", 2347 | "toc_cell": false, 2348 | "toc_position": {}, 2349 | "toc_section_display": true, 2350 | "toc_window_display": true 2351 | } 2352 | }, 2353 | "nbformat": 4, 2354 | "nbformat_minor": 2 2355 | } 2356 | -------------------------------------------------------------------------------- /初赛/津南数字制造算法挑战赛+20+Drop/readme.md: -------------------------------------------------------------------------------- 1 | # 环境要求 2 | 3 | ## 系统 4 | 5 | ubuntu 16.04 6 | 7 | ## python 版本 8 | python 3.5 或 python 3.6 9 | 10 | ## 需要的库 11 | 12 | numpy, pandas, lightgbm, xgboost, sklearn 13 | 14 | 15 | 16 | # 运行说明 17 | 18 | 生成 B 榜预测结果文件: 19 | python3 run.py --test_type=B 20 | 21 | 生成 C 榜预测结果文件: 22 | python3 run.py --test_type=C 23 | 24 | # 其他说明 25 | 26 | 运行时间大概 6 分钟左右(不定) 27 | 28 | 要求一级目录下存放 data 文件夹, 内部有 A 榜数据: jinnan_round1_train_20181227.csv, B 榜数据: jinnan_round1_testB_20190121.csv, C 榜数据: jinnan_round1_test_20190121.csv. 29 | 30 | 生成结果在一级目录, B 榜名为 submit_B.csv, C 榜名为 submit_C.csv. 31 | 32 | # 最后 33 | 34 | 提前祝您春节快乐~ 35 | 如果有问题希望您联系我们 36 | 37 | 陶亚凡: 765370813@qq.com 38 | Blue: cy_1995@qq.com -------------------------------------------------------------------------------- /初赛/津南数字制造算法挑战赛+20+Drop/run.py: -------------------------------------------------------------------------------- 1 | 2 | # In[1]: 3 | 4 | 5 | import numpy as np 6 | import pandas as pd 7 | import lightgbm as lgb 8 | import xgboost as xgb 9 | import warnings 10 | import re 11 | import argparse, sys 12 | 13 | # In[2]: 14 | 15 | 16 | from sklearn.model_selection import KFold, RepeatedKFold 17 | from sklearn.pipeline import Pipeline 18 | from sklearn.base import BaseEstimator, TransformerMixin 19 | from sklearn.linear_model import BayesianRidge 20 | from sklearn.metrics import mean_squared_error, mean_absolute_error 21 | from sklearn.preprocessing import MinMaxScaler 22 | from sklearn.model_selection import ParameterGrid 23 | 24 | 25 | # In[3]: 26 | 27 | 28 | warnings.simplefilter(action='ignore', category=FutureWarning) 29 | warnings.filterwarnings("ignore") 30 | pd.set_option('display.max_columns', None) 31 | pd.set_option('max_colwidth', 100) 32 | 33 | 34 | 35 | 36 | # # 数据清洗 37 | 38 | # ## 删除缺失率高的特征 39 | # 40 | # + __删除缺失值大于 th_high 的值__ 41 | # + __缺失值在 th_low 和 th_high 之间的特征根据是否缺失增加新特征__ 42 | # 43 | # 如 B10 缺失较高,增加新特征 B10_null,如果缺失为1,否则为0 44 | 45 | # In[6]: 46 | 47 | 48 | class del_nan_feature(BaseEstimator, TransformerMixin): 49 | 50 | def __init__(self, th_high=0.85, th_low=0.02): 51 | self.th_high = th_high 52 | self.th_low = th_low 53 | 54 | def fit(self, X, y=None): 55 | return self 56 | 57 | def transform(self, X): 58 | print('-'*30, ' '*5, 'del_nan_feature', ' '*5, '-'*30, '\n') 59 | print("shape before process = {}".format(X.shape)) 60 | 61 | # 删除高缺失率特征 62 | X.dropna(axis=1, thresh=(1-self.th_high)*X.shape[0], inplace=True) 63 | 64 | 65 | # 缺失率较高,增加新特征 66 | for col in X.columns: 67 | if col == 'target': 68 | continue 69 | 70 | miss_rate = X[col].isnull().sum()/ X.shape[0] 71 | if miss_rate > self.th_low: 72 | print("Missing rate of {} is {:.3f} exceed {}, adding new feature {}". 73 | format(col, miss_rate, self.th_low, col+'_null')) 74 | X[col+'_null'] = 0 75 | X.loc[X[pd.isnull(X[col])].index, [col+'_null']] = 1 76 | print("shape = {}".format(X.shape)) 77 | 78 | return X 79 | 80 | 81 | # ## 处理字符时间(段) 82 | 83 | # In[7]: 84 | 85 | 86 | # 处理时间 87 | def timeTranSecond(t): 88 | try: 89 | h,m,s=t.split(":") 90 | except: 91 | 92 | if t=='1900/1/9 7:00': 93 | return 7*3600/3600 94 | elif t=='1900/1/1 2:30': 95 | return (2*3600+30*60)/3600 96 | elif pd.isnull(t): 97 | return np.nan 98 | else: 99 | return 0 100 | 101 | try: 102 | tm = (int(h)*3600+int(m)*60+int(s))/3600 103 | except: 104 | return (30*60)/3600 105 | 106 | return tm 107 | 108 | 109 | # In[8]: 110 | 111 | 112 | # 处理时间差 113 | def getDuration(se): 114 | try: 115 | sh,sm,eh,em=re.findall(r"\d+",se) 116 | # print("sh, sm, eh, em = {}, {}, {}, {}".format(sh, em, eh, em)) 117 | except: 118 | if pd.isnull(se): 119 | return np.nan, np.nan, np.nan 120 | 121 | try: 122 | t_start = (int(sh)*3600 + int(sm)*60)/3600 123 | t_end = (int(eh)*3600 + int(em)*60)/3600 124 | 125 | if t_start > t_end: 126 | tm = t_end - t_start + 24 127 | else: 128 | tm = t_end - t_start 129 | except: 130 | if se=='19:-20:05': 131 | return 19, 20, 1 132 | elif se=='15:00-1600': 133 | return 15, 16, 1 134 | else: 135 | print("se = {}".format(se)) 136 | 137 | 138 | return t_start, t_end, tm 139 | 140 | 141 | # In[9]: 142 | 143 | 144 | class handle_time_str(BaseEstimator, TransformerMixin): 145 | 146 | def __init__(self): 147 | pass 148 | 149 | def fit(self, X, y=None): 150 | return self 151 | 152 | def transform(self, X): 153 | print('-'*30, ' '*5, 'handle_time_str', ' '*5, '-'*30, '\n') 154 | 155 | for f in ['A5','A7','A9','A11','A14','A16','A24','A26','B5','B7']: 156 | try: 157 | X[f] = X[f].apply(timeTranSecond) 158 | except: 159 | print(f,'应该在前面被删除了!') 160 | 161 | 162 | for f in ['A20','A28','B4','B9','B10','B11']: 163 | try: 164 | start_end_diff = X[f].apply(getDuration) 165 | 166 | X[f+'_start'] = start_end_diff.apply(lambda x: x[0]) 167 | X[f+'_end'] = start_end_diff.apply(lambda x: x[1]) 168 | X[f] = start_end_diff.apply(lambda x: x[2]) 169 | 170 | except: 171 | print(f,'应该在前面被删除了!') 172 | return X 173 | 174 | 175 | # ## 计算时间差 176 | 177 | # In[ ]: 178 | 179 | 180 | 181 | 182 | 183 | # In[10]: 184 | 185 | 186 | def t_start_t_end(t): 187 | if pd.isnull(t[0]) or pd.isnull(t[1]): 188 | # print("t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2])) 189 | return np.nan 190 | 191 | if t[1] < t[0]: 192 | t[1] += 24 193 | 194 | dt = t[1] - t[0] 195 | 196 | if(dt > 24 or dt < 0): 197 | # print("dt error, t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2])) 198 | return np.nan 199 | 200 | return dt 201 | 202 | 203 | # In[11]: 204 | 205 | 206 | class calc_time_diff(BaseEstimator, TransformerMixin): 207 | def __init__(self): 208 | pass 209 | 210 | def fit(self, X, y=None): 211 | return self 212 | 213 | def transform(self, X): 214 | print('-'*30, ' '*5, 'calc_time_diff', ' '*5, '-'*30, '\n') 215 | 216 | # t_start 为时间的开始, tn 为中间的时间,减去 t_start 得到时间差 217 | t_start = ['A9', 'A24', 'B5'] 218 | tn = {'A9':['A11', 'A14', 'A16'], 'A24': ['A26'], 'B5':['B7']} 219 | 220 | # 计算时间差 221 | for t_s in t_start: 222 | for t_e in tn[t_s]: 223 | X[t_e+'-'+t_s] = X[[t_s, t_e, 'target']].apply(t_start_t_end, axis=1) 224 | 225 | # 所有结果保留 3 位小数 226 | X = X.apply(lambda x: round(x, 3)) 227 | 228 | print("shape = {}".format(X.shape)) 229 | 230 | return X 231 | 232 | 233 | # ## 处理异常值 234 | 235 | # + __单一类别个数小于 threshold 的值视为异常值, 改为 nan__ 236 | 237 | # In[12]: 238 | 239 | 240 | class handle_outliers(BaseEstimator, TransformerMixin): 241 | 242 | def __init__(self, threshold=2): 243 | self.th = threshold 244 | 245 | def fit(self, X, y=None): 246 | return self 247 | 248 | def transform(self, X): 249 | print('-'*30, ' '*5, 'handle_outliers', ' '*5, '-'*30, '\n') 250 | category_col = [col for col in X if col not in ['id', 'target']] 251 | for col in category_col: 252 | label = X[col].value_counts(dropna=False).index.tolist() 253 | for i, num in enumerate(X[col].value_counts(dropna=False).values): 254 | if num <= self.th: 255 | # print("Number of label {} in feature {} is {}".format(label[i], col, num)) 256 | X.loc[X[col]==label[i], [col]] = np.nan 257 | 258 | print("shape = {}".format(X.shape)) 259 | return X 260 | 261 | 262 | # ## 删除单一类别占比过大特征 263 | 264 | # In[13]: 265 | 266 | 267 | class del_single_feature(BaseEstimator, TransformerMixin): 268 | 269 | def __init__(self, threshold=0.98): 270 | # 删除单一类别占比大于 threshold 的特征 271 | self.th = threshold 272 | 273 | def fit(self, X, y=None): 274 | return self 275 | 276 | def transform(self, X): 277 | print('-'*30, ' '*5, 'del_single_feature', ' '*5, '-'*30, '\n') 278 | category_col = [col for col in X if col not in ['target']] 279 | 280 | for col in category_col: 281 | rate = X[col].value_counts(normalize=True, dropna=False).values[0] 282 | 283 | if rate > self.th: 284 | print("{} 的最大类别占比是 {}, drop it".format(col, rate)) 285 | X.drop(col, axis=1, inplace=True) 286 | 287 | print("shape = {}".format(X.shape)) 288 | return X 289 | 290 | 291 | # # 特征工程 292 | 293 | # ## 获得训练集与测试集 294 | 295 | # In[14]: 296 | 297 | 298 | def split_data(pipe_data, target_name='target'): 299 | 300 | # 特征列名 301 | category_col = [col for col in pipe_data if col not in ['target',target_name]] 302 | 303 | # 训练、测试行索引 304 | train_idx = pipe_data[np.logical_not(pd.isnull(pipe_data[target_name]))].index 305 | test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index 306 | 307 | # 获得 train、test 数据 308 | X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64) 309 | y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64)) 310 | X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64) 311 | 312 | return X_train, y_train, X_test, test_idx 313 | 314 | 315 | # ## xgb(用于特征 nan 值预测) 316 | 317 | # In[15]: 318 | 319 | 320 | ##### xgb 321 | def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100): 322 | 323 | if params == None: 324 | xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, 325 | 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4} 326 | else: 327 | xgb_params = params 328 | 329 | folds = KFold(n_splits=10, shuffle=True, random_state=2018) 330 | oof_xgb = np.zeros(len(X_train)) 331 | predictions_xgb = np.zeros(len(X_test)) 332 | 333 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): 334 | if(verbose_eval): 335 | print("fold n°{}".format(fold_+1)) 336 | print("len trn_idx {}".format(len(trn_idx))) 337 | 338 | trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx]) 339 | val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx]) 340 | 341 | watchlist = [(trn_data, 'train'), (val_data, 'valid_data')] 342 | clf = xgb.train(dtrain=trn_data, 343 | num_boost_round=20000, 344 | evals=watchlist, 345 | early_stopping_rounds=200, 346 | verbose_eval=verbose_eval, 347 | params=xgb_params) 348 | 349 | 350 | oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit) 351 | predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits 352 | 353 | if(verbose_eval): 354 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train))) 355 | return oof_xgb, predictions_xgb 356 | 357 | 358 | # ## 根据 B14 构建新特征 359 | 360 | # In[16]: 361 | 362 | 363 | class add_new_features(BaseEstimator, TransformerMixin): 364 | 365 | def __init__(self): 366 | pass 367 | 368 | def fit(self, X, y=None): 369 | return self 370 | 371 | def transform(self, X): 372 | print('-'*30, ' '*5, 'add_new_features', ' '*5, '-'*30, '\n') 373 | 374 | # 经过测试,只有 B14 / B12 有用 375 | 376 | # X['B14/A1'] = X['B14'] / X['A1'] 377 | # X['B14/A3'] = X['B14'] / X['A3'] 378 | # X['B14/A4'] = X['B14'] / X['A4'] 379 | # X['B14/A19'] = X['B14'] / X['A19'] 380 | # X['B14/B1'] = X['B14'] / X['B1'] 381 | # X['B14/B9'] = X['B14'] / X['B9'] 382 | 383 | X['B14/B12'] = X['B14'] / X['B12'] 384 | 385 | print("shape = {}".format(X.shape)) 386 | return X 387 | 388 | 389 | # ## 选择特征, nan 值填充 390 | # 391 | # + __选择可能有效的特征__ (只是为了加快选择时间) 392 | # 393 | # + __利用其他特征预测 nan,取最近值填充__ 394 | 395 | # In[17]: 396 | 397 | 398 | def get_closest(indexes, predicts): 399 | print("From {}".format(predicts)) 400 | 401 | for i, one in enumerate(predicts): 402 | predicts[i] = indexes[np.argsort(abs(indexes - one))[0]] 403 | 404 | print("To {}".format(predicts)) 405 | return predicts 406 | 407 | 408 | def value_select_eval(pipe_data, selected_features): 409 | 410 | # 经过多次测试, 只选择可能是有用的特征 411 | cols_with_nan = [col for col in pipe_data.columns 412 | if pipe_data[col].isnull().sum()>0 and col in selected_features] 413 | 414 | for col in cols_with_nan: 415 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name=col) 416 | oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, verbose_eval=False) 417 | 418 | print("-"*100, end="\n\n") 419 | print("CV normal MAE scores of predicting {} is {}". 420 | format(col, mean_absolute_error(oof_xgb, y_train)/np.mean(y_train))) 421 | 422 | pipe_data.loc[test_idx, [col]] = get_closest(pipe_data[col].value_counts().index, 423 | predictions_xgb) 424 | 425 | pipe_data = pipe_data[selected_features+['target']] 426 | 427 | return pipe_data 428 | 429 | # pipe_data = value_eval(pipe_data) 430 | 431 | 432 | # In[18]: 433 | 434 | 435 | class selsected_fill_nans(BaseEstimator, TransformerMixin): 436 | 437 | def __init__(self, selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end', 438 | 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id']): 439 | self.selected_fearutes = selected_features 440 | pass 441 | 442 | def fit(self, X, y=None): 443 | return self 444 | 445 | def transform(self, X): 446 | print('-'*30, ' '*5, 'selsected_fill_nans', ' '*5, '-'*30, '\n') 447 | 448 | X = value_select_eval(X, self.selected_fearutes) 449 | 450 | print("shape = {}".format(X.shape)) 451 | return X 452 | 453 | 454 | # In[19]: 455 | 456 | 457 | def modeling_cross_validation(data): 458 | X_train, y_train, X_test, test_idx = split_data(data, 459 | target_name='target') 460 | oof_xgb, _ = xgb_predict(X_train, y_train, X_test, verbose_eval=False) 461 | print('-'*100, end='\n\n') 462 | return mean_squared_error(oof_xgb, y_train) 463 | 464 | 465 | def featureSelect(data): 466 | 467 | init_cols = [f for f in data.columns if f not in ['target']] 468 | best_cols = init_cols.copy() 469 | best_score = modeling_cross_validation(data[best_cols+['target']]) 470 | print("初始 CV score: {:<8.8f}".format(best_score)) 471 | 472 | for col in init_cols: 473 | best_cols.remove(col) 474 | score = modeling_cross_validation(data[best_cols+['target']]) 475 | print("当前选择特征: {}, CV score: {:<8.8f}, 最佳cv score: {:<8.8f}". 476 | format(col, score, best_score), end=" ") 477 | 478 | if best_score - score > 0.0000004: 479 | best_score = score 480 | print("有效果,删除!!!!") 481 | else: 482 | best_cols.append(col) 483 | print("保留") 484 | 485 | print('-'*100) 486 | print("优化后 CV score: {:<8.8f}".format(best_score)) 487 | return best_cols, best_score 488 | 489 | 490 | # ## 后向选择特征 491 | 492 | # In[20]: 493 | 494 | 495 | class select_feature(BaseEstimator, TransformerMixin): 496 | 497 | def __init__(self, init_features = None): 498 | self.init_features = init_features 499 | pass 500 | 501 | def fit(self, X, y=None): 502 | return self 503 | 504 | def transform(self, X): 505 | print('-'*30, ' '*5, 'select_feature', ' '*5, '-'*30, '\n') 506 | 507 | if self.init_features: 508 | X = X[self.init_features + ['target']] 509 | best_features = self.init_features 510 | else: 511 | best_features = [col for col in X.columns] 512 | 513 | last_feartues = [] 514 | iteration = 0 515 | equal_time = 0 516 | 517 | best_CV = 1 518 | best_CV_feature = [] 519 | 520 | # 打乱顺序,但是使用相同种子,保证每次运行结果相同 521 | np.random.seed(2018) 522 | while True: 523 | print("Iteration = {}\n".format(iteration)) 524 | best_features, score = featureSelect(X[best_features + ['target']]) 525 | 526 | # 保存最优 CV 的参数 527 | if score < best_CV: 528 | best_CV = score 529 | best_CV_feature = best_features 530 | print("Found best score :{}, with features :{}".format(best_CV, best_features)) 531 | 532 | np.random.shuffle(best_features) 533 | print("\nCurrent fearure length = {}".format(len(best_features))) 534 | 535 | # 最终 3 次迭代相同,则终止迭代 536 | if len(best_features) == len(last_feartues): 537 | equal_time += 1 538 | if equal_time == 3: 539 | break 540 | else: 541 | equal_time = 0 542 | 543 | last_feartues = best_features 544 | iteration = iteration + 1 545 | 546 | print("\n\n\n") 547 | 548 | return X[best_features + ['target']] 549 | 550 | 551 | # # 训练 552 | 553 | # ## 构建 pipeline, 处理数据 554 | 555 | # In[21]: 556 | 557 | 558 | 559 | 560 | 561 | # ## 自动调参 562 | 563 | # In[22]: 564 | 565 | 566 | def find_best_params(pipe_data, predict_fun, param_grid): 567 | 568 | # 获得 train 和 test, 归一化 569 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target') 570 | min_max_scaler = MinMaxScaler() 571 | X_train = min_max_scaler.fit_transform(X_train) 572 | X_test = min_max_scaler.transform(X_test) 573 | best_score = 1 574 | 575 | # 遍历所有参数,寻找最优 576 | for params in ParameterGrid(param_grid): 577 | print('-'*100, "\nparams = \n{}\n".format(params)) 578 | 579 | oof, predictions = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False) 580 | score = mean_squared_error(oof, y_train) 581 | print("CV score: {}, current best score: {}".format(score, best_score)) 582 | 583 | if best_score > score: 584 | print("Found new best score: {}".format(score)) 585 | best_score = score 586 | best_params = params 587 | 588 | 589 | print('\n\nbest params: {}'.format(best_params)) 590 | print('best score: {}'.format(best_score)) 591 | 592 | return best_params 593 | 594 | 595 | # ## lgb 596 | 597 | # In[23]: 598 | 599 | 600 | def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100): 601 | 602 | if params == None: 603 | lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective':'regression', 'max_depth': 4, 604 | 'learning_rate': 0.06, "min_child_samples": 3, "boosting": "gbdt", "feature_fraction": 1, 605 | "bagging_freq": 0.7, "bagging_fraction": 1, "bagging_seed": 11, "metric": 'mse', "lambda_l2": 0.003, 606 | "verbosity": -1} 607 | else : 608 | lgb_param = params 609 | 610 | folds = KFold(n_splits=10, shuffle=True, random_state=2018) 611 | oof_lgb = np.zeros(len(X_train)) 612 | predictions_lgb = np.zeros(len(X_test)) 613 | 614 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): 615 | if verbose_eval: 616 | print("fold n°{}".format(fold_+1)) 617 | 618 | trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx]) 619 | val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx]) 620 | 621 | num_round = 10000 622 | clf = lgb.train(lgb_param, trn_data, num_round, valid_sets = [trn_data, val_data], 623 | verbose_eval=verbose_eval, early_stopping_rounds = 100) 624 | 625 | oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration) 626 | 627 | predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits 628 | 629 | if verbose_eval: 630 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train))) 631 | 632 | return oof_lgb, predictions_lgb 633 | 634 | 635 | def init_config(): 636 | parser = argparse.ArgumentParser() 637 | parser.add_argument('--test_type', type=str, 638 | help='Can be B or C, meaning running code with either test B or test C') 639 | args = parser.parse_args() 640 | return args 641 | 642 | 643 | def read_data(train_file_name, test_file_name): 644 | # 读取数据, 改名 645 | train = pd.read_csv(train_file_name, encoding='gb18030') 646 | test = pd.read_csv(test_file_name, encoding='gb18030') 647 | train.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True) 648 | test.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True) 649 | target_name = 'target' 650 | 651 | # 存在异常数据,改为 nan 652 | train.loc[1304, 'A25'] = np.nan 653 | train['A25'] = train['A25'].astype(float) 654 | 655 | # 去掉 id 前缀 656 | train['id'] = train['id'].apply(lambda x: int(x.split('_')[1])) 657 | test['id'] = test['id'].apply(lambda x: int(x.split('_')[1])) 658 | 659 | train.drop(train[train[target_name] < 0.87].index, inplace=True) 660 | full = pd.concat([train, test], ignore_index=True) 661 | return full 662 | 663 | 664 | def feature_processing(full): 665 | 666 | selected_features = ['A3_null', 'A6', 'A16', 'A25', 'A28', 'A28_end', 667 | 'B5', 'B10_null', 'B11_null', 'B14', 'B14/B12', 'id'] 668 | 669 | pipe = Pipeline([ 670 | ('del_nan_feature', del_nan_feature()), 671 | ('handle_time_str', handle_time_str()), 672 | ('calc_time_diff', calc_time_diff()), 673 | ('Handle_outliers', handle_outliers(2)), 674 | ('del_single_feature', del_single_feature(1)), 675 | ('add_new_features', add_new_features()), 676 | ('selsected_fill_nans', selsected_fill_nans(selected_features)), 677 | ('select_feature', select_feature(selected_features)), 678 | ]) 679 | 680 | pipe_data = pipe.fit_transform(full.copy()) 681 | print(pipe_data.shape) 682 | return pipe_data 683 | 684 | 685 | def train_predict(pipe_data): 686 | 687 | # lgb 688 | param_grid = [ 689 | {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective': ['regression'], 690 | 'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], "min_child_samples": [3], 691 | "boosting": ["gbdt"], "feature_fraction": [0.7], "bagging_freq": [1], 692 | "bagging_fraction": [1], "bagging_seed": [11], "metric": ['mse'], 693 | "lambda_l2": [0.0003, 0.001, 0.003], "verbosity": [-1]} 694 | ] 695 | 696 | lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid) 697 | 698 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target') 699 | min_max_scaler = MinMaxScaler() 700 | X_train = min_max_scaler.fit_transform(X_train) 701 | X_test = min_max_scaler.transform(X_test) 702 | oof_lgb, predictions_lgb = lgb_predict(X_train, y_train, X_test, params=lgb_best_params, verbose_eval=200) # 703 | 704 | # xgb 705 | param_grid = [ 706 | {'silent': [1], 707 | 'nthread': [4], 708 | 'eval_metric': ['rmse'], 709 | 'eta': [0.03], 710 | 'objective': ['reg:linear'], 711 | 'max_depth': [4, 5, 6], 712 | 'num_round': [1000], 713 | 'subsample': [0.4, 0.6, 0.8, 1], 714 | 'colsample_bytree': [0.7, 0.9, 1], 715 | } 716 | ] 717 | 718 | xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid) 719 | 720 | X_train, y_train, X_test, test_idx = split_data(pipe_data, target_name='target') 721 | min_max_scaler = MinMaxScaler() 722 | X_train = min_max_scaler.fit_transform(X_train) 723 | X_test = min_max_scaler.transform(X_test) 724 | oof_xgb, predictions_xgb = xgb_predict(X_train, y_train, X_test, params=xgb_best_params, verbose_eval=200) # 725 | 726 | # 模型融合 stacking 727 | train_stack = np.vstack([oof_lgb, oof_xgb]).transpose() 728 | test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose() 729 | 730 | folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590) 731 | oof_stack = np.zeros(train_stack.shape[0]) 732 | predictions = np.zeros(test_stack.shape[0]) 733 | 734 | for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)): 735 | print("fold {}".format(fold_)) 736 | trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx] 737 | val_data, val_y = train_stack[val_idx], y_train[val_idx] 738 | 739 | clf_3 = BayesianRidge() 740 | clf_3.fit(trn_data, trn_y) 741 | 742 | oof_stack[val_idx] = clf_3.predict(val_data) 743 | predictions += clf_3.predict(test_stack) / 10 744 | 745 | final_score = mean_squared_error(y_train, oof_stack) 746 | print(final_score) 747 | return predictions 748 | 749 | 750 | def gen_submit(test_file_name, result_name, predictions): 751 | # 生成提交结果 752 | sub_df = pd.read_csv(test_file_name, encoding='gb18030') 753 | sub_df = sub_df[['样本id', 'A1']] 754 | sub_df['A1'] = predictions 755 | sub_df['A1'] = sub_df['A1'].apply(lambda x: round(x, 3)) 756 | print("Generating a submit file : {}".format(result_name)) 757 | sub_df.to_csv(result_name, header=0, index=0) 758 | 759 | if __name__ == '__main__': 760 | args = init_config() 761 | print(args, file=sys.stderr) 762 | 763 | if args.test_type in ['B', 'b']: 764 | test_file_name = 'data/jinnan_round1_testB_20190121.csv' 765 | result_name = 'submit_B.csv' 766 | elif args.test_type in ['C', 'c']: 767 | test_file_name = 'data/jinnan_round1_test_20190121.csv' 768 | result_name = 'submit_C.csv' 769 | else: 770 | raise RuntimeError('Need config of test_type, can be only B or C for example: --test_type=B') 771 | 772 | # 设定文件名, 读取文件 773 | train_file_name = 'data/jinnan_round1_train_20181227.csv' 774 | 775 | print("Training file named {} and testing file named {}".format(train_file_name, test_file_name)) 776 | 777 | # read file 778 | full = read_data(train_file_name, test_file_name) 779 | 780 | # feature processing 781 | pipe_data = feature_processing(full) 782 | 783 | # train and predict 784 | predictions = train_predict(pipe_data) 785 | 786 | # generate a submit file 787 | gen_submit(test_file_name, result_name, predictions) 788 | 789 | print("Finished") 790 | -------------------------------------------------------------------------------- /复赛/津南数字制造算法挑战赛+17+Drop/data/optimize.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/taoyafan/jinnan/52035305b2219f6700f2e922825de2d5db1d7abc/复赛/津南数字制造算法挑战赛+17+Drop/data/optimize.csv -------------------------------------------------------------------------------- /复赛/津南数字制造算法挑战赛+17+Drop/main.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import numpy as np 4 | import pandas as pd 5 | import lightgbm as lgb 6 | import xgboost as xgb 7 | import warnings 8 | import re 9 | import argparse, sys 10 | import pickle 11 | import os 12 | 13 | from sklearn.model_selection import KFold, RepeatedKFold 14 | from sklearn.pipeline import Pipeline 15 | from sklearn.base import BaseEstimator, TransformerMixin 16 | from sklearn.linear_model import BayesianRidge 17 | from sklearn.metrics import mean_squared_error, mean_absolute_error 18 | from sklearn.preprocessing import MinMaxScaler 19 | from sklearn.model_selection import ParameterGrid 20 | 21 | warnings.simplefilter(action='ignore', category=FutureWarning) 22 | warnings.filterwarnings("ignore") 23 | 24 | def init_config(): 25 | parser = argparse.ArgumentParser() 26 | parser.add_argument('--test_type', type=str, 27 | help='Can be B or C, meaning running code with either test B or test C') 28 | args = parser.parse_args() 29 | return args 30 | 31 | 32 | def pkl_load(file_name): 33 | with open(file_name, "rb") as f: 34 | return pickle.load(f) 35 | 36 | 37 | def pkl_save(fname, data, protocol=3): 38 | with open(fname, "wb") as f: 39 | pickle.dump(data, f, protocol) 40 | 41 | 42 | def load_models(): 43 | lgb_models = pkl_load("models/lgb_models.pkl") 44 | xgb_models = pkl_load("models/xgb_models.pkl") 45 | stack_models = pkl_load("models/stack_models.pkl") 46 | min_max_scaler = pkl_load("models/min_max_scaler.pkl") 47 | return lgb_models, xgb_models, stack_models, min_max_scaler 48 | 49 | 50 | def read_data(train_file_name, test_file_name): 51 | # 读取数据, 改名 52 | train = pd.read_csv(train_file_name, encoding='gb18030') 53 | test = pd.read_csv(test_file_name, encoding='gb18030') 54 | train.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True) 55 | test.rename(columns={'样本id': 'id', '收率': 'target'}, inplace=True) 56 | target_name = 'target' 57 | 58 | # 存在异常数据,改为 nan 59 | train.loc[1304, 'A25'] = np.nan 60 | train['A25'] = train['A25'].astype(float) 61 | 62 | # 去掉 id 前缀 63 | train['id'] = train['id'].apply(lambda x: int(x.split('_')[1])) 64 | test['id'] = test['id'].apply(lambda x: int(x.split('_')[1])) 65 | 66 | train.drop(train[train[target_name] < 0.87].index, inplace=True) 67 | _full = pd.concat([train, test], ignore_index=True) 68 | return _full 69 | 70 | 71 | class del_nan_feature(BaseEstimator, TransformerMixin): 72 | 73 | def __init__(self, th_high=0.85, th_low=0.02): 74 | self.th_high = th_high 75 | self.th_low = th_low 76 | 77 | def fit(self, X, y=None): 78 | return self 79 | 80 | def transform(self, X): 81 | print('-' * 30, ' ' * 5, 'del_nan_feature', ' ' * 5, '-' * 30, '\n') 82 | print("shape before process = {}".format(X.shape)) 83 | 84 | # 删除高缺失率特征 85 | X.dropna(axis=1, thresh=(1 - self.th_high) * X.shape[0], inplace=True) 86 | 87 | # 缺失率较高,增加新特征 88 | for col in X.columns: 89 | if col == 'target': 90 | continue 91 | 92 | miss_rate = X[col].isnull().sum() / X.shape[0] 93 | if miss_rate > self.th_low: 94 | print("Missing rate of {} is {:.3f} exceed {}, adding new feature {}". 95 | format(col, miss_rate, self.th_low, col + '_null')) 96 | X[col + '_null'] = 0 97 | X.loc[X[pd.isnull(X[col])].index, [col + '_null']] = 1 98 | print("shape = {}".format(X.shape)) 99 | 100 | return X 101 | 102 | # 处理时间 103 | def timeTranSecond(t): 104 | try: 105 | h,m,s=t.split(":") 106 | except: 107 | 108 | if t=='1900/1/9 7:00': 109 | return 7*3600/3600 110 | elif t=='1900/1/1 2:30': 111 | return (2*3600+30*60)/3600 112 | elif pd.isnull(t): 113 | return np.nan 114 | else: 115 | return 0 116 | 117 | try: 118 | tm = (int(h)*3600+int(m)*60+int(s))/3600 119 | except: 120 | return (30*60)/3600 121 | 122 | return tm 123 | 124 | 125 | # 处理时间差 126 | def getDuration(se): 127 | try: 128 | sh, sm, eh, em = re.findall(r"\d+", se) 129 | # print("sh, sm, eh, em = {}, {}, {}, {}".format(sh, em, eh, em)) 130 | except: 131 | if pd.isnull(se): 132 | return np.nan, np.nan, np.nan 133 | 134 | try: 135 | t_start = (int(sh) * 3600 + int(sm) * 60) / 3600 136 | t_end = (int(eh) * 3600 + int(em) * 60) / 3600 137 | 138 | if t_start > t_end: 139 | tm = t_end - t_start + 24 140 | else: 141 | tm = t_end - t_start 142 | except: 143 | if se == '19:-20:05': 144 | return 19, 20, 1 145 | elif se == '15:00-1600': 146 | return 15, 16, 1 147 | else: 148 | print("se = {}".format(se)) 149 | 150 | return t_start, t_end, tm 151 | 152 | 153 | class handle_time_str(BaseEstimator, TransformerMixin): 154 | 155 | def __init__(self): 156 | pass 157 | 158 | def fit(self, X, y=None): 159 | return self 160 | 161 | def transform(self, X): 162 | print('-' * 30, ' ' * 5, 'handle_time_str', ' ' * 5, '-' * 30, '\n') 163 | 164 | for f in ['A5', 'A7', 'A9', 'A11', 'A14', 'A16', 'A24', 'A26', 'B5', 'B7']: 165 | try: 166 | X[f] = X[f].apply(timeTranSecond) 167 | except: 168 | print(f, '应该在前面被删除了!') 169 | 170 | for f in ['A20', 'A28', 'B4', 'B9', 'B10', 'B11']: 171 | try: 172 | start_end_diff = X[f].apply(getDuration) 173 | 174 | X[f + '_start'] = start_end_diff.apply(lambda x: x[0]) 175 | X[f + '_end'] = start_end_diff.apply(lambda x: x[1]) 176 | X[f] = start_end_diff.apply(lambda x: x[2]) 177 | 178 | except: 179 | print(f, '应该在前面被删除了!') 180 | return X 181 | 182 | 183 | def t_start_t_end(t): 184 | if pd.isnull(t[0]) or pd.isnull(t[1]): 185 | # print("t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2])) 186 | return np.nan 187 | 188 | if t[1] < t[0]: 189 | t[1] += 24 190 | 191 | dt = t[1] - t[0] 192 | 193 | if (dt > 24 or dt < 0): 194 | # print("dt error, t_start = {}, t_end = {}, id = {}".format(t[0], t[1], t[2])) 195 | return np.nan 196 | 197 | return dt 198 | 199 | 200 | class calc_time_diff(BaseEstimator, TransformerMixin): 201 | def __init__(self): 202 | pass 203 | 204 | def fit(self, X, y=None): 205 | return self 206 | 207 | def transform(self, X): 208 | print('-' * 30, ' ' * 5, 'calc_time_diff', ' ' * 5, '-' * 30, '\n') 209 | 210 | # t_start 为时间的开始, tn 为中间的时间,减去 t_start 得到时间差 211 | t_start = ['A9', 'A24', 'B5'] 212 | tn = {'A9': ['A11', 'A14', 'A16'], 'A24': ['A26'], 'B5': ['B7']} 213 | 214 | # 计算时间差 215 | for t_s in t_start: 216 | for t_e in tn[t_s]: 217 | X[t_e + '-' + t_s] = X[[t_s, t_e, 'target']].apply(t_start_t_end, axis=1) 218 | 219 | # 所有结果保留 3 位小数 220 | X = X.apply(lambda x: round(x, 3)) 221 | 222 | print("shape = {}".format(X.shape)) 223 | 224 | return X 225 | 226 | 227 | class handle_outliers(BaseEstimator, TransformerMixin): 228 | 229 | def __init__(self, threshold=2): 230 | self.th = threshold 231 | 232 | def fit(self, X, y=None): 233 | return self 234 | 235 | def transform(self, X): 236 | print('-' * 30, ' ' * 5, 'handle_outliers', ' ' * 5, '-' * 30, '\n') 237 | category_col = [col for col in X if col not in ['id', 'target']] 238 | for col in category_col: 239 | label = X[col].value_counts(dropna=False).index.tolist() 240 | for i, num in enumerate(X[col].value_counts(dropna=False).values): 241 | if num <= self.th: 242 | # print("Number of label {} in feature {} is {}".format(label[i], col, num)) 243 | X.loc[X[col] == label[i], [col]] = np.nan 244 | 245 | print("shape = {}".format(X.shape)) 246 | return X 247 | 248 | 249 | def split_data(pipe_data, target_name='target', unused_feature=[], min_max_scaler=None): 250 | 251 | # 特征列名 252 | category_col = [col for col in pipe_data if col not in ['target'] + [target_name] + unused_feature] 253 | 254 | # 训练、测试行索引 255 | # 训练集只包括存在 target 和 target_name 的数据 256 | train_idx = pipe_data[np.logical_not( 257 | np.logical_or(pd.isnull(pipe_data[target_name]), pd.isnull(pipe_data['target'])) 258 | )].index 259 | 260 | test_idx = pipe_data[pd.isnull(pipe_data[target_name])].index 261 | 262 | # 获得 train、test 数据 263 | X_train = pipe_data.loc[train_idx, category_col].values.astype(np.float64) 264 | y_train = np.squeeze(pipe_data.loc[train_idx, [target_name]].values.astype(np.float64)) 265 | X_test = pipe_data.loc[test_idx, category_col].values.astype(np.float64) 266 | 267 | # 归一化 268 | if (min_max_scaler == None): 269 | min_max_scaler = MinMaxScaler() 270 | X_train = min_max_scaler.fit_transform(X_train) 271 | else: 272 | X_train = min_max_scaler.transform(X_train) 273 | X_test = min_max_scaler.transform(X_test) 274 | 275 | return X_train, y_train, X_test, test_idx, min_max_scaler 276 | 277 | 278 | ##### xgb 279 | def xgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100): 280 | if params == None: 281 | xgb_params = {'eta': 0.05, 'max_depth': 10, 'subsample': 0.8, 'colsample_bytree': 0.8, 282 | 'objective': 'reg:linear', 'eval_metric': 'rmse', 'silent': True, 'nthread': 4} 283 | else: 284 | xgb_params = params 285 | 286 | folds = KFold(n_splits=10, shuffle=True, random_state=2018) 287 | oof_xgb = np.zeros(len(X_train)) 288 | predictions_xgb = np.zeros(len(X_test)) 289 | xgb_models = [] 290 | 291 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): 292 | if (verbose_eval): 293 | print("fold n°{}".format(fold_ + 1)) 294 | print("len trn_idx {}".format(len(trn_idx))) 295 | 296 | trn_data = xgb.DMatrix(X_train[trn_idx], y_train[trn_idx]) 297 | val_data = xgb.DMatrix(X_train[val_idx], y_train[val_idx]) 298 | 299 | watchlist = [(trn_data, 'train'), (val_data, 'valid_data')] 300 | clf = xgb.train(dtrain=trn_data, 301 | num_boost_round=20000, 302 | evals=watchlist, 303 | early_stopping_rounds=200, 304 | verbose_eval=verbose_eval, 305 | params=xgb_params) 306 | 307 | xgb_models.append(clf) 308 | oof_xgb[val_idx] = clf.predict(xgb.DMatrix(X_train[val_idx]), ntree_limit=clf.best_ntree_limit) 309 | predictions_xgb += clf.predict(xgb.DMatrix(X_test), ntree_limit=clf.best_ntree_limit) / folds.n_splits 310 | 311 | if (verbose_eval): 312 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_xgb, y_train))) 313 | return oof_xgb, predictions_xgb, xgb_models 314 | 315 | 316 | def lgb_predict(X_train, y_train, X_test, params=None, verbose_eval=100): 317 | 318 | if params == None: 319 | lgb_param = {'num_leaves': 20, 'min_data_in_leaf': 2, 'objective': 'regression', 'max_depth': 5, 320 | 'learning_rate': 0.24, "min_child_samples": 3, "boosting": "gbdt", "feature_fraction": 0.7, 321 | "bagging_freq": 1, "bagging_fraction": 1, "bagging_seed": 11, "metric": 'mse', "lambda_l2": 0.003, 322 | "verbosity": -1} 323 | else: 324 | lgb_param = params 325 | 326 | folds = KFold(n_splits=10, shuffle=True, random_state=2018) 327 | oof_lgb = np.zeros(len(X_train)) 328 | predictions_lgb = np.zeros(len(X_test)) 329 | lgb_models = [] 330 | 331 | for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)): 332 | if verbose_eval: 333 | print("fold n°{}".format(fold_ + 1)) 334 | 335 | trn_data = lgb.Dataset(X_train[trn_idx], y_train[trn_idx]) 336 | val_data = lgb.Dataset(X_train[val_idx], y_train[val_idx]) 337 | 338 | num_round = 10000 339 | clf = lgb.train(lgb_param, trn_data, num_round, valid_sets=[trn_data, val_data], 340 | verbose_eval=verbose_eval, early_stopping_rounds=100) 341 | 342 | lgb_models.append(clf) 343 | oof_lgb[val_idx] = clf.predict(X_train[val_idx], num_iteration=clf.best_iteration) 344 | predictions_lgb += clf.predict(X_test, num_iteration=clf.best_iteration) / folds.n_splits 345 | 346 | if verbose_eval: 347 | print("CV score: {:<8.8f}".format(mean_squared_error(oof_lgb, y_train))) 348 | 349 | return oof_lgb, predictions_lgb, lgb_models 350 | 351 | 352 | class add_new_features(BaseEstimator, TransformerMixin): 353 | 354 | def __init__(self): 355 | pass 356 | 357 | def fit(self, X, y=None): 358 | return self 359 | 360 | def transform(self, X): 361 | print('-' * 30, ' ' * 5, 'add_new_features', ' ' * 5, '-' * 30, '\n') 362 | 363 | # 经过测试,只有 B14 / B12 有用 364 | 365 | # X['B14/A1'] = X['B14'] / X['A1'] 366 | # X['B14/A3'] = X['B14'] / X['A3'] 367 | # X['B14/A4'] = X['B14'] / X['A4'] 368 | # X['B14/A19'] = X['B14'] / X['A19'] 369 | # X['B14/B1'] = X['B14'] / X['B1'] 370 | # X['B14/B9'] = X['B14'] / X['B9'] 371 | 372 | X['B14/B12'] = X['B14'] / X['B12'] 373 | 374 | print("shape = {}".format(X.shape)) 375 | return X 376 | 377 | 378 | def predict(data, lgb_models, xgb_models, stack_models, min_max_scaler): 379 | 380 | def _predict(X): 381 | lgb_result = 0 382 | xgb_result = 0 383 | stack_result = 0 384 | 385 | for clf in lgb_models: 386 | lgb_result += clf.predict(X, num_iteration=clf.best_iteration) / len(lgb_models) 387 | 388 | for clf in xgb_models: 389 | xgb_result += clf.predict(xgb.DMatrix(X), ntree_limit=clf.best_ntree_limit) / len(xgb_models) 390 | 391 | test_stack = np.vstack([lgb_result, xgb_result]).transpose() 392 | for clf in stack_models: 393 | stack_result += clf.predict(test_stack) / len(stack_models) 394 | 395 | return stack_result 396 | 397 | _, _, X_test, test_idx, _ = split_data(data, min_max_scaler=min_max_scaler) 398 | result = _predict(X_test) 399 | return result 400 | 401 | 402 | def feature_processing(full, outlier_th = 3): 403 | selected_features = ['A22', 'A28', 'A20_end', 'B10', 'B11_start', 'A5', 'A10', 'B14/B12', 'B14'] 404 | pipe = Pipeline([ 405 | ('del_nan_feature', del_nan_feature()), 406 | ('handle_time_str', handle_time_str()), 407 | ('calc_time_diff', calc_time_diff()), 408 | ('Handle_outliers', handle_outliers(outlier_th)), 409 | ('add_new_features', add_new_features()), 410 | ]) 411 | 412 | pipe_data = pipe.fit_transform(full.copy())[selected_features+['target']] 413 | print(pipe_data.shape) 414 | return pipe_data 415 | 416 | 417 | def gen_submit(test_file_name, result_name, predictions): 418 | # 生成提交结果 419 | sub_df = pd.read_csv(test_file_name, encoding='gb18030') 420 | sub_df = sub_df[['样本id', 'A1']] 421 | sub_df['A1'] = predictions 422 | sub_df['A1'] = sub_df['A1'].apply(lambda x: round(x, 3)) 423 | print("Generating a submit file : {}".format(result_name)) 424 | sub_df.to_csv(result_name, header=0, index=0) 425 | 426 | 427 | def find_best_params(pipe_data, predict_fun, param_grid): 428 | 429 | # 获得 train 和 test, 归一化 430 | X_train, y_train, X_test, test_idx, _ = split_data(pipe_data, target_name='target') 431 | best_score = 1 432 | 433 | # 遍历所有参数,寻找最优 434 | for params in ParameterGrid(param_grid): 435 | print('-' * 100, "\nparams = \n{}\n".format(params)) 436 | 437 | oof, predictions, _ = predict_fun(X_train, y_train, X_test, params=params, verbose_eval=False) 438 | score = mean_squared_error(oof, y_train) 439 | print("CV score: {}, current best score: {}".format(score, best_score)) 440 | 441 | if best_score > score: 442 | print("Found new best score: {}".format(score)) 443 | best_score = score 444 | best_params = params 445 | 446 | print('\n\nbest params: {}'.format(best_params)) 447 | print('best score: {}'.format(best_score)) 448 | 449 | return best_params 450 | 451 | 452 | def stacking_predict(oof_lgb, oof_xgb, predictions_lgb, predictions_xgb, y_train, verbose_eval=1): 453 | # stacking 454 | train_stack = np.vstack([oof_lgb, oof_xgb]).transpose() 455 | test_stack = np.vstack([predictions_lgb, predictions_xgb]).transpose() 456 | 457 | folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=4590) 458 | oof_stack = np.zeros(train_stack.shape[0]) 459 | predictions = np.zeros(test_stack.shape[0]) 460 | stack_models = [] 461 | 462 | for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack, y_train)): 463 | if verbose_eval: 464 | print("fold {}".format(fold_)) 465 | trn_data, trn_y = train_stack[trn_idx], y_train[trn_idx] 466 | val_data, val_y = train_stack[val_idx], y_train[val_idx] 467 | 468 | clf_3 = BayesianRidge() 469 | clf_3.fit(trn_data, trn_y) 470 | stack_models.append(clf_3) 471 | 472 | oof_stack[val_idx] = clf_3.predict(val_data) 473 | predictions += clf_3.predict(test_stack) / 10 474 | 475 | final_score = mean_squared_error(y_train, oof_stack) 476 | if verbose_eval: 477 | print(final_score) 478 | return oof_stack, predictions, final_score, stack_models 479 | 480 | 481 | def train_predict(pipe_data, lgb_best_params, xgb_best_params, verbose_eval=200): 482 | X_train, y_train, X_test, test_idx, min_max_scaler = split_data(pipe_data, target_name='target') 483 | 484 | oof_lgb, predictions_lgb, lgb_models = lgb_predict(X_train, y_train, X_test, 485 | params=lgb_best_params, verbose_eval=verbose_eval) 486 | if verbose_eval: 487 | print('-' * 100) 488 | oof_xgb, predictions_xgb, xgb_models = xgb_predict(X_train, y_train, X_test, 489 | params=xgb_best_params, verbose_eval=verbose_eval) 490 | if verbose_eval: 491 | print('-' * 100) 492 | oof_stack, predictions, final_score, stack_models = stacking_predict(oof_lgb, oof_xgb, 493 | predictions_lgb, predictions_xgb, y_train, 494 | verbose_eval=verbose_eval) 495 | 496 | return oof_stack, predictions, final_score, lgb_models, xgb_models, stack_models, min_max_scaler 497 | 498 | 499 | def train_models(): 500 | full = read_data('data/jinnan_round1_train_20181227.csv', 'data/jinnan_round1_testB_20190121.csv') 501 | pipe_data = feature_processing(full) 502 | 503 | param_grid = [ 504 | {'num_leaves': [20], 'min_data_in_leaf': [2, 3], 'objective': ['regression'], 505 | 'max_depth': [3, 4, 5], 'learning_rate': [0.06, 0.12, 0.24], "min_child_samples": [3], 506 | "boosting": ["gbdt"], "feature_fraction": [0.7], "bagging_freq": [1], 507 | "bagging_fraction": [1], "bagging_seed": [11], "metric": ['mse'], 508 | "lambda_l2": [0.0003, 0.001, 0.003], "verbosity": [-1]} 509 | ] 510 | 511 | lgb_best_params = find_best_params(pipe_data, lgb_predict, param_grid) 512 | 513 | param_grid = [ 514 | {'silent': [1], 515 | 'nthread': [4], 516 | 'eval_metric': ['rmse'], 517 | 'eta': [0.03], 518 | 'objective': ['reg:linear'], 519 | 'max_depth': [4, 5, 6], 520 | 'num_round': [1000], 521 | 'subsample': [0.4, 0.6, 0.8, 1], 522 | 'colsample_bytree': [0.7, 0.9, 1], 523 | } 524 | ] 525 | 526 | xgb_best_params = find_best_params(pipe_data, xgb_predict, param_grid) 527 | 528 | oof_stack, predictions, final_score, lgb_models, xgb_models, stack_models, min_max_scaler = train_predict( 529 | pipe_data, lgb_best_params, xgb_best_params) 530 | 531 | 532 | if not os.path.exists('models'): 533 | os.makedirs('models') 534 | 535 | pkl_save("models/lgb_models.pkl", lgb_models) 536 | pkl_save("models/xgb_models.pkl", xgb_models) 537 | pkl_save("models/stack_models.pkl", stack_models) 538 | pkl_save("models/min_max_scaler.pkl", min_max_scaler) 539 | 540 | if __name__ == '__main__': 541 | args = init_config() 542 | print(args, file=sys.stderr) 543 | outlier_th = 3 544 | 545 | if args.test_type in ['B', 'b']: 546 | test_file_name = 'data/jinnan_round1_testB_20190121.csv' 547 | result_name = 'submit_B.csv' 548 | elif args.test_type in ['C', 'c']: 549 | test_file_name = 'data/jinnan_round1_test_20190201.csv' 550 | result_name = 'submit_C.csv' 551 | elif args.test_type in ['A', 'a']: 552 | test_file_name = 'data/jinnan_round1_testA_20181227.csv' 553 | result_name = 'submit_A.csv' 554 | elif args.test_type in ['fusai', 'FuSai', 'Fusai', 'fuSai']: 555 | test_file_name = 'data/FuSai.csv' 556 | result_name = 'submit_FuSai.csv' 557 | elif args.test_type in ['gen', 'Gen', 'GEN']: 558 | test_file_name = 'data/optimize.csv' 559 | result_name = 'submit_optimize.csv' 560 | outlier_th = 0 561 | else: 562 | raise RuntimeError('Need config of test_type, can be only B or C for example: --test_type=B') 563 | 564 | # 设定文件名, 读取文件 565 | train_file_name = 'data/jinnan_round1_train_20181227.csv' 566 | 567 | print("Training file named {} and testing file named {}".format(train_file_name, test_file_name)) 568 | 569 | print("Generating training models") 570 | train_models() 571 | print("Saving training models to file: \'models\'") 572 | 573 | full = read_data(train_file_name, test_file_name) 574 | lgb_models, xgb_models, stack_models, min_max_scaler = load_models() 575 | # feature processing 576 | pipe_data = feature_processing(full, outlier_th=outlier_th) 577 | 578 | # train and predict 579 | predictions = predict(pipe_data, lgb_models, xgb_models, stack_models, min_max_scaler) 580 | 581 | # generate a submit file 582 | gen_submit(test_file_name, result_name, predictions) 583 | 584 | print("Finished") 585 | 586 | -------------------------------------------------------------------------------- /复赛/津南数字制造算法挑战赛+17+Drop/readme.md: -------------------------------------------------------------------------------- 1 | # 环境要求 2 | 3 | ## 系统 4 | 5 | ubuntu 16.04 6 | 7 | ## python 版本 8 | python 3. 9 | 10 | ## 需要的库 11 | 12 | numpy, pandas, lightgbm, xgboost, sklearn, pickle 13 | 14 | 15 | 16 | # 运行说明 17 | 18 | 生成复赛预测结果文件: 19 | python3 main.py --test_type=fusai 20 | 21 | 生成 optimizes.csv 预测结果文件: 22 | python3 main.py --test_type=gen 23 | 24 | # 最后 25 | 26 | 如果有问题希望您联系我们 27 | 28 | 陶亚凡: 765370813@qq.com 29 | Blue: cy_1995@qq.com --------------------------------------------------------------------------------