├── README.md ├── bank_telemarketing_data_analysis.ipynb ├── images ├── Random Forest.png ├── jaccard系数.png ├── 余弦相似性.png ├── 曼哈顿距离.png └── 欧式距离.png ├── 关联分析 └── README.md ├── 分类算法 ├── README.md └── 用户流失预测分析与应用.ipynb ├── 回归分析 ├── README.md └── 大型促销活动前的销售预测.ipynb └── 聚类分析 ├── README.md └── 客户特征的聚类与探索性分析.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # 机器学习(Machine Learning) 2 | - 监督学习(Supervised Learning):训练样本带有信息标记(**y值**),利用已有的训练样本信息学习数据的规律预测未知的新样本标签。 3 | - 回归分析(Regression) 4 | - 分类(Classification) 5 | - 无监督学习(Unsupervised Learning):训练样本的标记信息时未知的,目测是为了揭露训练样本的内在数学,结构和信息,为进一步的数据挖掘提供基础。 6 | - 聚类(Clustering) 7 | ## 1 回归 8 | 回归是研究自变量x对因变量y影响的一种数据分析方法。主要应用场景是**进行预测和空值**,例如,计划制定、KPI制定、目标制定等;也可以基于预测的数据与实际数据进行比对和分析,确定事件发展程度并给未来行动提供方向性指导。 9 | 10 | 回归分析可应用于分析自变量和因变量的影响关系(已知x,求y),也可以分析自变量对因变量的影响方向(正向or反向影响)。 11 | 12 | **常用的回归算法包括**: 13 | - 线性回归 14 | - 二项式回归 15 | - 对数回归 16 | - 指数回归 17 | - 核SVM 18 | - 岭回归 19 | - Lasso 20 | 21 | 优点: 22 | - 数据模式和结果便于理解,如线性回归用y=ax+b的形式表达 23 | - 基于函数公式的业务应用中,可直接代入法求解,应用起来容易。 24 | 25 | 缺点: 26 | - 只能分析少量变量间相互关系,无法处理海量变量间的相互作用关系,尤其是变量共同因素对因变量的影响程度。 27 | 28 | ### 1.1 注意回归变量之间的共线性问题 29 | 检验共线性的三个指标: 30 | - 容忍度:[0,1],每个自变量(x)作为因变量(y)进行回归建模得到的残差比例。值越小,说明共线性问题的可能性越大。 31 | - 方差膨胀因子:容忍度的倒数,值越大则共线性问题越明显,通常以10作为判断边界。VIF<10,不存在多重共线性;10<=VIF<=100,存在较强的多重共线性;VIF>=100,存在严重多重共线性 32 | - 特征值:对自变量进行主成分分析,如果多个维度的特征值等于0,则可能存在比较严重的共线性。 33 | - 相关系数:R>0.8:可能存在较强的相关性。 34 | 35 | **解决共线性的5种常用方法**: 36 | - 增大样本量 37 | - 岭回归法 38 | - 逐步回归法 39 | - 主成分回归 40 | - 人工去除 41 | 42 | ### 1.2 相关系数、判定系数和回归系数之间的关系 43 | 假设一回归方程:y = 42.738x + 169.94,其中R的平方 = 0.5252,如果对这两个变量作相关性分析,还会得到相关系数R=0.72468551874050. 44 | 45 | 回归系数:42.738,自变量x的**回归系数**;0.5252是该方程的**判定系数**;0.724....是两个变量的**相关性系数**。 46 | - 判定系数:自变量对因变量的方差解释程度的值;计算公式为:回归平方和与总离差平方和之比值 47 | - 相关系数:又称为解释系数,是衡量变量间的相关程度或密切程度的值,本质是线性相关性的判断。 48 | 49 | 三者间的关系: 50 | - 判定系数是**所有参与模型中自变量的对因变量联合影响程度**,而非某个自变量的影响程度。 51 | - 回归系数与相关系数的关系:回归系数>0,相关系数取值为(0,1)。说明两者正相关;如果系数小于0,相关系数取值为(-1,0),说明两者负相关。 52 | 53 | ## 2 分类算法 54 | - 一种对**离散型随机变量**建模或预测的监督学习算法。 55 | - 使用案例包括邮件过滤、金融欺诈和预测雇员异动等输出为类别的任务。 56 | - 分类算法通常适用于预测一个类别(或类别的概率)而不是连续的数值。 57 | 58 | ### 2.1 分类算法的应用 59 | - 预测 60 | - 提炼应用规则 61 | - 提取变量特征 62 | - 处理缺失值 63 | 64 | ### 2.2 (基础)决策树 Decision Tree 65 | 决策树是一个树结构(可以是二叉树或非二叉树)。 66 | 67 | 其每个非叶节点表示一个**特征属性**上的测试,每个分支代表这个特征属性在某个值域上的输出,而每个叶节点存放一个类别。 68 | 69 | 使用决策树进行决策的过程就是从**根节点开始**,测试待分类项中相应的特征属性,并按照其值选择输出分支,知道到达叶子节点,将**叶子节点**存放的类别作为决策结果。 70 | 71 | **优点:** 72 | - 适用任何类型的数据(类别变量更普遍) 73 | - 直观、决策树可以提供可视化,便于理解 74 | - 模型预测出的结果简单,可解释性强 75 | - 适用于小规模数据 76 | 77 | **缺点:** 78 | - 当数据中存在连续变量的属性时,决策树表现并不是很好 79 | - 不稳定性,一点点的扰动或者改动都可能改动整棵树 80 | - 特殊属性增加时,错误增加的比较快 81 | - 很容易在训练数据中生成复杂的树结构,造成过拟合。 82 | 83 | ### 2.3 随机森林 Random Forest 84 | ![image](https://github.com/teamowu/Machine-Learning/blob/master/images/Random%20Forest.png) 85 | 86 | **优点:** 87 | - 随机森林不容易限于过拟合 88 | - 具有很好的抗噪声能力 89 | - 处理很高维度(feature多)的数据,并且不用做特征选择 90 | - 训练速度快 91 | 92 | ## 3 聚类 93 | - 一种无监督式机器学习(即**数据没有标注**) 94 | - 算法基于数据的内部结构寻找观察样本的自然族群(即集群) 95 | - 使用案例包括客户细分,新闻聚类,文章推荐等等。 96 | 97 | **用于衡量相似性的几个指标**: 98 | - 欧式距离 Euclidean distance 99 | - 定义:指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离) 100 | - 用途: 101 | 102 | ![image](https://github.com/teamowu/Machine-Learning/blob/master/images/%E6%AC%A7%E5%BC%8F%E8%B7%9D%E7%A6%BB.png) 103 | 104 | - 曼哈顿距离 Manhattan distance 105 | - 定义:就是表示两个点在标准坐标系上的绝对轴距之和。 106 | - 用途: 107 | 108 | ![image](https://github.com/teamowu/Machine-Learning/blob/master/images/%E6%9B%BC%E5%93%88%E9%A1%BF%E8%B7%9D%E7%A6%BB.png) 109 | 110 | - 余弦相似性 cosine 111 | - 定义:通过计算两个向量的夹角余弦值来评估他们的相似度。 112 | - 用途:新闻分类 113 | 114 | ![image](https://github.com/teamowu/Machine-Learning/blob/master/images/%E4%BD%99%E5%BC%A6%E7%9B%B8%E4%BC%BC%E6%80%A7.png) 115 | 116 | - Jaccard系数 117 | - 定义:给定两个集合A,B,Jaccard 系数定义为A与B交集的大小与A与B并集的大小的比值。 118 | - 用途:用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大,样本相似度越高。 119 | 120 | ![image](https://github.com/teamowu/Machine-Learning/blob/master/images/jaccard%E7%B3%BB%E6%95%B0.png) 121 | 122 | ### 3.1 层次聚类 Hierarchical Cluster Analysis(HCA) 123 | 层次聚类是一系列基于以下概念的聚类算法: 124 | - 最开始由一个数据点作为一个集群 125 | - 对于每个集群,基于相同的标准合并集群 126 | - 重复这一过程直到只留下一个集群,因此就得到了集群的层次结构。 127 | 128 | ### 3.2 K均值聚类 K-means Clustering Algorithm 129 | - 聚类的度量基于样本点之间的几何距离(即在坐标平面中的距离) 130 | - 集群是围绕在聚类中心的族群,而集群呈现出类球状并具有相似的大小 131 | - 对于给定的k值,算法先给出一个初始的分组方法,然后通过反复迭代的方法改变分组,使得每一次改进之后的分组方案较前一次好 132 | 133 | ### 3.3 DBSCAN 134 | - 基于密度的算法,它将样本点的密度区域组成一个集群 135 | - DBSCAN不需要假设集群为球状,并且它的性能是可拓展的 136 | - 不需要每个点都被分配到一个集群中,这降低了集群的异常数据。 137 | 138 | 139 | -------------------------------------------------------------------------------- /bank_telemarketing_data_analysis.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Pyhton数据分析:银行电话营销数据分析" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## 一、前言\n", 15 | "### 项目介绍:\n", 16 | "在我们的日常生活中,银行为我们的财产提供了基本的安全保障,便利了我们的生活,而我们在银行的一些记录信息,方便了银行对我们进行一些行为预测,本项目则根据客户的以往记录信息,预测客户是否办理存款业务。\n" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "### 数据集分析\n", 24 | "该项目的数据集对应的任务是「分类任务」,本数据集共包含25317行,18列数据,其中字段为y的列是标签列,包含0和1两个值,0表示不订购业务,1表示订购业务,其他列是特征列\n", 25 | "### 字段的描述\n", 26 | "ID:客户唯一标识
\n", 27 | "age:客户年龄
\n", 28 | "job:客户的职业
\n", 29 | "marital:婚姻状况
\n", 30 | "education:受教育水平
\n", 31 | "default:是否有违约记录
\n", 32 | "balance:每年账户的平均余额
\n", 33 | "housing:是否有住房贷款
\n", 34 | "loan:是否有个人贷款
\n", 35 | "contact:与客户联系的沟通方式
\n", 36 | "day:最后一次联系的时间(几号)
\n", 37 | "month:最后一次联系的时间(月份)
\n", 38 | "duration:最后一次联系的交流时长
\n", 39 | "campaign:在本次活动中,与该客户交流过的次数
\n", 40 | "pdays:距离上次活动最后一次联系该客户,过去了多久
\n", 41 | "previous:在本次活动之前,与该客户交流过的次数
\n", 42 | "poutcome:上一次活动的结果
\n", 43 | "y:客户是否会订购定期存款业务,0表示不订购,1表示订购
" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "## 二、提出问题\n", 51 | "* 哪个分类模型更适合预测客户是否订购定期存款业务?\n", 52 | "\n", 53 | "### 分析流程\n", 54 | "* 查看数据\n", 55 | "* 特征处理\n", 56 | "* 选择模型\n", 57 | "* 数据归一化" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "## 三、探索性数据分析\n", 65 | "### 导入必备的库" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 1, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "#Basic library\n", 75 | "import pandas as pd\n", 76 | "import numpy as np\n", 77 | "import matplotlib.pyplot as plt\n", 78 | "import seaborn as sns\n", 79 | "\n", 80 | "#machine learning\n", 81 | "from sklearn.model_selection import train_test_split\n", 82 | "from sklearn.model_selection import cross_val_score\n", 83 | "from sklearn.metrics import roc_auc_score\n", 84 | "from sklearn.preprocessing import LabelEncoder, MinMaxScaler\n", 85 | "\n", 86 | "#Model\n", 87 | "from sklearn.neighbors import KNeighborsClassifier\n", 88 | "from sklearn.linear_model import LogisticRegression\n", 89 | "from sklearn.tree import DecisionTreeClassifier\n", 90 | "\n", 91 | "#igonore warnings\n", 92 | "import warnings\n", 93 | "warnings.filterwarnings('ignore')" 94 | ] 95 | }, 96 | { 97 | "cell_type": "markdown", 98 | "metadata": {}, 99 | "source": [ 100 | "### 查看数据" 101 | ] 102 | }, 103 | { 104 | "cell_type": "code", 105 | "execution_count": 2, 106 | "metadata": { 107 | "scrolled": true 108 | }, 109 | "outputs": [ 110 | { 111 | "data": { 112 | "text/html": [ 113 | "
\n", 114 | "\n", 127 | "\n", 128 | " \n", 129 | " \n", 130 | " \n", 131 | " \n", 132 | " \n", 133 | " \n", 134 | " \n", 135 | " \n", 136 | " \n", 137 | " \n", 138 | " \n", 139 | " \n", 140 | " \n", 141 | " \n", 142 | " \n", 143 | " \n", 144 | " \n", 145 | " \n", 146 | " \n", 147 | " \n", 148 | " \n", 149 | " \n", 150 | " \n", 151 | " \n", 152 | " \n", 153 | " \n", 154 | " \n", 155 | " \n", 156 | " \n", 157 | " \n", 158 | " \n", 159 | " \n", 160 | " \n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
ID
143managementmarriedtertiaryno291yesnounknown9may1502-10unknown0
242techniciandivorcedprimaryno5076yesnocellular7apr9912512other0
347admin.marriedsecondaryno104yesyescellular14jul772-10unknown0
428managementsinglesecondaryno-994yesyescellular18jul1742-10unknown0
542techniciandivorcedsecondaryno2974yesnounknown21may1875-10unknown0
\n", 273 | "
" 274 | ], 275 | "text/plain": [ 276 | " age job marital education default balance housing loan \\\n", 277 | "ID \n", 278 | "1 43 management married tertiary no 291 yes no \n", 279 | "2 42 technician divorced primary no 5076 yes no \n", 280 | "3 47 admin. married secondary no 104 yes yes \n", 281 | "4 28 management single secondary no -994 yes yes \n", 282 | "5 42 technician divorced secondary no 2974 yes no \n", 283 | "\n", 284 | " contact day month duration campaign pdays previous poutcome y \n", 285 | "ID \n", 286 | "1 unknown 9 may 150 2 -1 0 unknown 0 \n", 287 | "2 cellular 7 apr 99 1 251 2 other 0 \n", 288 | "3 cellular 14 jul 77 2 -1 0 unknown 0 \n", 289 | "4 cellular 18 jul 174 2 -1 0 unknown 0 \n", 290 | "5 unknown 21 may 187 5 -1 0 unknown 0 " 291 | ] 292 | }, 293 | "execution_count": 2, 294 | "metadata": {}, 295 | "output_type": "execute_result" 296 | } 297 | ], 298 | "source": [ 299 | "data_all = pd.read_csv('./dataFile/train_set.csv',index_col='ID')\n", 300 | "data_all.head()" 301 | ] 302 | }, 303 | { 304 | "cell_type": "code", 305 | "execution_count": 3, 306 | "metadata": {}, 307 | "outputs": [ 308 | { 309 | "name": "stdout", 310 | "output_type": "stream", 311 | "text": [ 312 | "\n", 313 | "Int64Index: 25317 entries, 1 to 25317\n", 314 | "Data columns (total 17 columns):\n", 315 | "age 25317 non-null int64\n", 316 | "job 25317 non-null object\n", 317 | "marital 25317 non-null object\n", 318 | "education 25317 non-null object\n", 319 | "default 25317 non-null object\n", 320 | "balance 25317 non-null int64\n", 321 | "housing 25317 non-null object\n", 322 | "loan 25317 non-null object\n", 323 | "contact 25317 non-null object\n", 324 | "day 25317 non-null int64\n", 325 | "month 25317 non-null object\n", 326 | "duration 25317 non-null int64\n", 327 | "campaign 25317 non-null int64\n", 328 | "pdays 25317 non-null int64\n", 329 | "previous 25317 non-null int64\n", 330 | "poutcome 25317 non-null object\n", 331 | "y 25317 non-null int64\n", 332 | "dtypes: int64(8), object(9)\n", 333 | "memory usage: 3.5+ MB\n" 334 | ] 335 | } 336 | ], 337 | "source": [ 338 | "#查看数据的基本信息\n", 339 | "data_all.info()" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 4, 345 | "metadata": {}, 346 | "outputs": [ 347 | { 348 | "data": { 349 | "text/plain": [ 350 | "(25317, 17)" 351 | ] 352 | }, 353 | "execution_count": 4, 354 | "metadata": {}, 355 | "output_type": "execute_result" 356 | } 357 | ], 358 | "source": [ 359 | "# 查看数据data_all的维度\n", 360 | "data_all.shape" 361 | ] 362 | }, 363 | { 364 | "cell_type": "code", 365 | "execution_count": 6, 366 | "metadata": { 367 | "scrolled": true 368 | }, 369 | "outputs": [ 370 | { 371 | "data": { 372 | "text/plain": [ 373 | "age False\n", 374 | "job False\n", 375 | "marital False\n", 376 | "education False\n", 377 | "default False\n", 378 | "balance False\n", 379 | "housing False\n", 380 | "loan False\n", 381 | "contact False\n", 382 | "day False\n", 383 | "month False\n", 384 | "duration False\n", 385 | "campaign False\n", 386 | "pdays False\n", 387 | "previous False\n", 388 | "poutcome False\n", 389 | "y False\n", 390 | "dtype: bool" 391 | ] 392 | }, 393 | "execution_count": 6, 394 | "metadata": {}, 395 | "output_type": "execute_result" 396 | } 397 | ], 398 | "source": [ 399 | "# 查看每列数据是否包含缺失值。\n", 400 | "data_all.isnull().any()" 401 | ] 402 | }, 403 | { 404 | "cell_type": "markdown", 405 | "metadata": {}, 406 | "source": [ 407 | "显然,数据集中不包含任何缺失值。" 408 | ] 409 | }, 410 | { 411 | "cell_type": "markdown", 412 | "metadata": {}, 413 | "source": [ 414 | "### 特征处理" 415 | ] 416 | }, 417 | { 418 | "cell_type": "code", 419 | "execution_count": 7, 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "data": { 424 | "text/plain": [ 425 | "['job',\n", 426 | " 'marital',\n", 427 | " 'education',\n", 428 | " 'default',\n", 429 | " 'housing',\n", 430 | " 'loan',\n", 431 | " 'contact',\n", 432 | " 'month',\n", 433 | " 'poutcome']" 434 | ] 435 | }, 436 | "execution_count": 7, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "# 获得data_all中列的数据类型是object的列的列名。\n", 443 | "data_obj_col = data_all.select_dtypes('object').columns.to_list()\n", 444 | "data_obj_col" 445 | ] 446 | }, 447 | { 448 | "cell_type": "code", 449 | "execution_count": 9, 450 | "metadata": { 451 | "scrolled": true 452 | }, 453 | "outputs": [ 454 | { 455 | "data": { 456 | "text/plain": [ 457 | "(25317, 9)" 458 | ] 459 | }, 460 | "execution_count": 9, 461 | "metadata": {}, 462 | "output_type": "execute_result" 463 | } 464 | ], 465 | "source": [ 466 | "# 获得数据集中列的数据类型为object的所有数据,以及打印数据的维度\n", 467 | "data_obj=data_all[data_obj_col]\n", 468 | "data_obj.shape" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 10, 474 | "metadata": {}, 475 | "outputs": [], 476 | "source": [ 477 | "# 依据data_obj_col,获得数据集中列的数据类型为数值型的列的列。\n", 478 | "data_num_col = data_all.columns.difference(data_obj_col)" 479 | ] 480 | }, 481 | { 482 | "cell_type": "code", 483 | "execution_count": 11, 484 | "metadata": {}, 485 | "outputs": [ 486 | { 487 | "data": { 488 | "text/plain": [ 489 | "(25317, 8)" 490 | ] 491 | }, 492 | "execution_count": 11, 493 | "metadata": {}, 494 | "output_type": "execute_result" 495 | } 496 | ], 497 | "source": [ 498 | "# 获得数据集中列的数据类型为数值型的所有数据,以及数据的维度\n", 499 | "data_num=data_all[data_num_col]\n", 500 | "data_num.shape" 501 | ] 502 | }, 503 | { 504 | "cell_type": "code", 505 | "execution_count": 12, 506 | "metadata": {}, 507 | "outputs": [ 508 | { 509 | "data": { 510 | "text/plain": [ 511 | "['age', 'balance', 'campaign', 'day', 'duration', 'pdays', 'previous', 'y']" 512 | ] 513 | }, 514 | "execution_count": 12, 515 | "metadata": {}, 516 | "output_type": "execute_result" 517 | } 518 | ], 519 | "source": [ 520 | "# 打印data_num的列名\n", 521 | "data_num.columns.to_list()" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "从以上输出的数据可知:\n", 529 | "* Object类型的列有9个\n", 530 | "* 数值类型的列有8个,数值类型的列名分别为:'age', 'balance', 'campaign', 'day', 'duration', 'pdays', 'previous','y'" 531 | ] 532 | }, 533 | { 534 | "cell_type": "markdown", 535 | "metadata": {}, 536 | "source": [ 537 | "### 标签编码\n", 538 | "将object类型的列中只有两个值的列进行标签编码,将编码后的列添加到data_num数据集中" 539 | ] 540 | }, 541 | { 542 | "cell_type": "code", 543 | "execution_count": 13, 544 | "metadata": {}, 545 | "outputs": [ 546 | { 547 | "data": { 548 | "text/plain": [ 549 | "['default', 'housing', 'loan']" 550 | ] 551 | }, 552 | "execution_count": 13, 553 | "metadata": {}, 554 | "output_type": "execute_result" 555 | } 556 | ], 557 | "source": [ 558 | "# 计算data_obj中每列中的唯一值;然后得到每一列中只有两个值的列名\n", 559 | "two_unique_cols = data_obj.nunique()[data_obj.nunique()==2].index.tolist()\n", 560 | "two_unique_cols" 561 | ] 562 | }, 563 | { 564 | "cell_type": "code", 565 | "execution_count": 14, 566 | "metadata": {}, 567 | "outputs": [], 568 | "source": [ 569 | "# 对列中唯一值只有两个值的列进行标签编码,将标签编码后的数据存到data_num数据集中\n", 570 | "y = data_all[two_unique_cols].apply(LabelEncoder().fit_transform)\n", 571 | "data_num = pd.concat([y, data_num],ignore_index=False, sort=True, axis=1)" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": 15, 577 | "metadata": { 578 | "scrolled": true 579 | }, 580 | "outputs": [ 581 | { 582 | "name": "stdout", 583 | "output_type": "stream", 584 | "text": [ 585 | "data_num的维度是:(25317, 11)\n" 586 | ] 587 | }, 588 | { 589 | "data": { 590 | "text/plain": [ 591 | "Index(['default', 'housing', 'loan', 'age', 'balance', 'campaign', 'day',\n", 592 | " 'duration', 'pdays', 'previous', 'y'],\n", 593 | " dtype='object')" 594 | ] 595 | }, 596 | "execution_count": 15, 597 | "metadata": {}, 598 | "output_type": "execute_result" 599 | } 600 | ], 601 | "source": [ 602 | "# 打印data_num的维度和列名\n", 603 | "print('data_num的维度是:{}'.format(data_num.shape))\n", 604 | "data_num.columns" 605 | ] 606 | }, 607 | { 608 | "cell_type": "markdown", 609 | "metadata": {}, 610 | "source": [ 611 | "### 数据抽样\n", 612 | "由于建模的时候,样本不平衡会对模型的训练产生很大的影响,这里将采取简单的方法对数据进行抽样,以使y中每类的样本相对平衡" 613 | ] 614 | }, 615 | { 616 | "cell_type": "code", 617 | "execution_count": 16, 618 | "metadata": { 619 | "scrolled": false 620 | }, 621 | "outputs": [ 622 | { 623 | "data": { 624 | "text/plain": [ 625 | "" 626 | ] 627 | }, 628 | "execution_count": 16, 629 | "metadata": {}, 630 | "output_type": "execute_result" 631 | }, 632 | { 633 | "data": { 634 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAEKCAYAAADaa8itAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADwtJREFUeJzt3X+s3fVdx/HnCwo6dYRiC8MW7bI0xjqVsYY17h8cSVdItDiHGcmkTpIuC9MtMWboH3YBSaZumrEgSXUddJlD4japSWdtms3F7BcXR/jp0htEuBbphbINJRkpvv3jfu84a0/b09vPud97uM9HcnLO930+3+95f5Ob+8r3+/me70lVIUlSC2f13YAk6dXDUJEkNWOoSJKaMVQkSc0YKpKkZgwVSVIzhookqRlDRZLUjKEiSWpmRd8NLLZVq1bVunXr+m5DkibK/fff/2xVrT7VuGUXKuvWrWNqaqrvNiRpoiT5z1HGefpLktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktTMsvtG/Zl68x/s7rsFLUH3//n1fbcgLQkeqUiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmhlbqCS5JMmXkjyW5JEkH+jqFyTZn+Rg97yyqyfJbUmmkzyY5LKBbW3rxh9Msm2g/uYkD3Xr3JYk49ofSdKpjfNI5Sjw+1X1c8Am4MYkG4CbgANVtR440C0DXAWs7x7bgTtgLoSAHcBbgMuBHfNB1I3ZPrDeljHujyTpFMYWKlX1dFX9W/f6BeAxYA2wFbirG3YXcE33eiuwu+Z8HTg/ycXA24H9VXWkqp4H9gNbuvfOq6qvVVUBuwe2JUnqwaLMqSRZB7wJ+AZwUVU9DXPBA1zYDVsDPDWw2kxXO1l9ZkhdktSTsYdKkp8APgd8sKq+d7KhQ2q1gPqwHrYnmUoyNTs7e6qWJUkLNNZQSXIOc4Hymar6fFd+pjt1Rfd8uKvPAJcMrL4WOHSK+toh9eNU1c6q2lhVG1evXn1mOyVJOqFxXv0V4JPAY1X1FwNv7QHmr+DaBtw7UL++uwpsE/Dd7vTYPmBzkpXdBP1mYF/33gtJNnWfdf3AtiRJPVgxxm2/Ffgt4KEkD3S1PwI+AtyT5AbgSeDa7r29wNXANPAi8B6AqjqS5Bbgvm7czVV1pHv9PuBO4DXAF7uHJKknYwuVqvpXhs97AFw5ZHwBN55gW7uAXUPqU8Abz6BNSVJDfqNektSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzYwtVJLsSnI4ycMDtQ8n+a8kD3SPqwfe+8Mk00m+neTtA/UtXW06yU0D9dcn+UaSg0n+Lsm549oXSdJoxnmkciewZUj9L6vq0u6xFyDJBuBdwM936/xVkrOTnA3cDlwFbACu68YC/Gm3rfXA88ANY9wXSdIIxhYqVfUV4MiIw7cCd1fV96vqP4Bp4PLuMV1Vj1fVS8DdwNYkAd4G/H23/l3ANU13QJJ02vqYU3l/kge702Mru9oa4KmBMTNd7UT1nwS+U1VHj6lLknq02KFyB/AG4FLgaeBjXT1DxtYC6kMl2Z5kKsnU7Ozs6XUsSRrZooZKVT1TVS9X1f8Bf83c6S2YO9K4ZGDoWuDQSerPAucnWXFM/USfu7OqNlbVxtWrV7fZGUnScRY1VJJcPLD468D8lWF7gHcl+ZEkrwfWA98E7gPWd1d6ncvcZP6eqirgS8A7u/W3Afcuxj5Ikk5sxamHLEySzwJXAKuSzAA7gCuSXMrcqaongPcCVNUjSe4BHgWOAjdW1cvddt4P7APOBnZV1SPdR3wIuDvJnwDfAj45rn2RJI1mbKFSVdcNKZ/wH39V3QrcOqS+F9g7pP44r5w+kyQtAX6jXpLUjKEiSWrGUJEkNWOoSJKaMVQkSc0YKpKkZgwVSVIzhookqRlDRZLUjKEiSWrGUJEkNWOoSJKaMVQkSc2MFCpJDoxSkyQtbye99X2SHwV+jLnfRFnJKz/jex7wU2PuTZI0YU71eyrvBT7IXIDczyuh8j3g9jH2JUmaQCcNlar6OPDxJL9bVZ9YpJ4kSRNqpF9+rKpPJPllYN3gOlW1e0x9SZIm0EihkuTTwBuAB4CXu3IBhook6QdG/Y36jcCGqqpxNiNJmmyjfk/lYeB142xEkjT5Rj1SWQU8muSbwPfni1X1a2PpSpI0kUYNlQ+PswlJ0qvDqFd//cu4G5EkTb5Rr/56gbmrvQDOBc4B/reqzhtXY5KkyTPqkcprB5eTXANcPpaOJEkTa0F3Ka6qfwDe1rgXSdKEG/X01zsGFs9i7nsrfmdFkvRDRr3661cHXh8FngC2Nu9GkjTRRp1Tec+4G5EkTb5Rf6RrbZIvJDmc5Jkkn0uydtzNSZImy6gT9Z8C9jD3uyprgH/sapIk/cCoobK6qj5VVUe7x53A6jH2JUmaQKOGyrNJ3p3k7O7xbuC5cTYmSZo8o4bK7wC/Cfw38DTwTsDJe0nSDxn1kuJbgG1V9TxAkguAjzIXNpIkAaMfqfzifKAAVNUR4E3jaUmSNKlGDZWzkqycX+iOVEY9ypEkLROjhsrHgK8muSXJzcBXgT872QpJdnXfa3l4oHZBkv1JDnbPK7t6ktyWZDrJg0kuG1hnWzf+YJJtA/U3J3moW+e2JDmdHZcktTdSqFTVbuA3gGeAWeAdVfXpU6x2J7DlmNpNwIGqWg8c6JYBrgLWd4/twB3wgyOiHcBbmLsr8o6BI6Y7urHz6x37WZKkRTbyKayqehR49DTGfyXJumPKW4Erutd3AV8GPtTVd1dVAV9Pcn6Si7ux+7s5HJLsB7Yk+TJwXlV9ravvBq4Bvjhqf5Kk9hZ06/szcFFVPQ3QPV/Y1dcATw2Mm+lqJ6vPDKkPlWR7kqkkU7Ozs2e8E5Kk4RY7VE5k2HxILaA+VFXtrKqNVbVx9WpvBCBJ47LYofJMd1qL7vlwV58BLhkYtxY4dIr62iF1SVKPFjtU9gDzV3BtA+4dqF/fXQW2Cfhud3psH7A5ycpugn4zsK9774Ukm7qrvq4f2JYkqSdj+65Jks8yN9G+KskMc1dxfQS4J8kNwJPAtd3wvcDVwDTwIt0tYKrqSJJbgPu6cTfPT9oD72PuCrPXMDdB7yS9JPVsbKFSVded4K0rh4wt4MYTbGcXsGtIfQp445n0KElqa6lM1EuSXgUMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqppdQSfJEkoeSPJBkqqtdkGR/koPd88quniS3JZlO8mCSywa2s60bfzDJtj72RZL0ij6PVH6lqi6tqo3d8k3AgapaDxzolgGuAtZ3j+3AHTAXQsAO4C3A5cCO+SCSJPVjKZ3+2grc1b2+C7hmoL675nwdOD/JxcDbgf1VdaSqngf2A1sWu2lJ0iv6CpUC/jnJ/Um2d7WLquppgO75wq6+BnhqYN2Zrnai+nGSbE8ylWRqdna24W5Ikgat6Olz31pVh5JcCOxP8u8nGZshtTpJ/fhi1U5gJ8DGjRuHjpEknblejlSq6lD3fBj4AnNzIs90p7Xong93w2eASwZWXwscOkldktSTRQ+VJD+e5LXzr4HNwMPAHmD+Cq5twL3d6z3A9d1VYJuA73anx/YBm5Os7CboN3c1SVJP+jj9dRHwhSTzn/+3VfVPSe4D7klyA/AkcG03fi9wNTANvAi8B6CqjiS5BbivG3dzVR1ZvN2QJB1r0UOlqh4HfmlI/TngyiH1Am48wbZ2Abta9yhJWpildEmxJGnCGSqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1Exfv/woaQyevPkX+m5BS9BP//FDi/ZZHqlIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJambiQyXJliTfTjKd5Ka++5Gk5WyiQyXJ2cDtwFXABuC6JBv67UqSlq+JDhXgcmC6qh6vqpeAu4GtPfckScvWpIfKGuCpgeWZriZJ6sGKvhs4QxlSq+MGJduB7d3i/yT59li7Wj5WAc/23cRSkI9u67sFHc+/z3k7hv2rPG0/M8qgSQ+VGeCSgeW1wKFjB1XVTmDnYjW1XCSZqqqNffchDePfZz8m/fTXfcD6JK9Pci7wLmBPzz1J0rI10UcqVXU0yfuBfcDZwK6qeqTntiRp2ZroUAGoqr3A3r77WKY8pailzL/PHqTquHltSZIWZNLnVCRJS4ihogXx9jhaqpLsSnI4ycN997IcGSo6bd4eR0vcncCWvptYrgwVLYS3x9GSVVVfAY703cdyZahoIbw9jqShDBUtxEi3x5G0/BgqWoiRbo8jafkxVLQQ3h5H0lCGik5bVR0F5m+P8xhwj7fH0VKR5LPA14CfTTKT5Ia+e1pO/Ea9JKkZj1QkSc0YKpKkZgwVSVIzhookqRlDRZLUjKEiSWrGUJEkNWOoSD1KckuSDwws35rk9/rsSToTfvlR6lGSdcDnq+qyJGcBB4HLq+q5XhuTFmhF3w1Iy1lVPZHkuSRvAi4CvmWgaJIZKlL//gb4beB1wK5+W5HOjKe/pJ51d3p+CDgHWF9VL/fckrRgHqlIPauql5J8CfiOgaJJZ6hIPesm6DcB1/bdi3SmvKRY6lGSDcA0cKCqDvbdj3SmnFORJDXjkYokqRlDRZLUjKEiSWrGUJEkNWOoSJKaMVQkSc38P2E7ocOY2wiuAAAAAElFTkSuQmCC\n", 635 | "text/plain": [ 636 | "
" 637 | ] 638 | }, 639 | "metadata": { 640 | "needs_background": "light" 641 | }, 642 | "output_type": "display_data" 643 | } 644 | ], 645 | "source": [ 646 | "# 使用countplot对y列的唯一值进行画图显示\n", 647 | "sns.countplot(data_num['y'])" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": 17, 653 | "metadata": { 654 | "scrolled": true 655 | }, 656 | "outputs": [ 657 | { 658 | "name": "stdout", 659 | "output_type": "stream", 660 | "text": [ 661 | "0有:22356,1有:2961\n" 662 | ] 663 | } 664 | ], 665 | "source": [ 666 | "# 标签列y中存在两个唯一值,现在计算每个值的个数.\n", 667 | "data_num['y'].value_counts()\n", 668 | "value_0 = (data_num['y']==0).sum()\n", 669 | "value_1 = (data_num['y']==1).sum()\n", 670 | "print(\"0有:{},1有:{}\".format(value_0,value_1))" 671 | ] 672 | }, 673 | { 674 | "cell_type": "markdown", 675 | "metadata": {}, 676 | "source": [ 677 | "从以上的输出和图的显示来看,y标签列的每类的样本数不一致,因此为保抽取样本的平衡,对y==0的样本进行随机抽样。" 678 | ] 679 | }, 680 | { 681 | "cell_type": "code", 682 | "execution_count": 18, 683 | "metadata": { 684 | "scrolled": true 685 | }, 686 | "outputs": [], 687 | "source": [ 688 | "# 以上可知y中列存在0和1两个值,由于0包含的元素个数远远大于1的个数,现在使用sample从y为0的样本中随机抽取一些样本,\n", 689 | "# 要求0包含的样本数和1的样本数相同,且随机种子设定为22,\n", 690 | "# sample随机抽样方法:sample(n=样本数,random_state=随机种子数)\n", 691 | "data_num_0 = data_num[data_num['y']==0].sample(n=value_1, random_state=22)" 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": 19, 697 | "metadata": {}, 698 | "outputs": [], 699 | "source": [ 700 | "# 得到数据集中y列中为1的样本\n", 701 | "data_num_1=data_num[data_num['y']!=0]" 702 | ] 703 | }, 704 | { 705 | "cell_type": "code", 706 | "execution_count": 20, 707 | "metadata": {}, 708 | "outputs": [ 709 | { 710 | "data": { 711 | "text/plain": [ 712 | "(5922, 11)" 713 | ] 714 | }, 715 | "execution_count": 20, 716 | "metadata": {}, 717 | "output_type": "execute_result" 718 | } 719 | ], 720 | "source": [ 721 | "# 将y为1的样本和随机抽取的y为0的样本进行合并,并且打印合并后数据集的维度\n", 722 | "data_num_sample = pd.concat([data_num_0,data_num_1], axis=0, ignore_index=False)\n", 723 | "data_num_sample.shape" 724 | ] 725 | }, 726 | { 727 | "cell_type": "code", 728 | "execution_count": 21, 729 | "metadata": {}, 730 | "outputs": [ 731 | { 732 | "data": { 733 | "text/html": [ 734 | "
\n", 735 | "\n", 748 | "\n", 749 | " \n", 750 | " \n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | "
defaulthousingloanagebalancecampaigndaydurationpdayspreviousy
count5922.0000005922.0000005922.0000005922.0000005922.0000005922.0000005922.0000005922.0000005922.0000005922.0000005922.000000
mean0.0138470.4689290.12664641.1703821616.0460992.47332015.489868378.14032452.9181020.8635600.500000
std0.1168640.4990760.33260512.0124573371.6011682.7459048.419602353.409237109.9765872.2841460.500042
min0.0000000.0000000.00000018.000000-1965.0000001.0000001.0000004.000000-1.0000000.0000000.000000
25%0.0000000.0000000.00000032.000000130.0000001.0000008.000000145.000000-1.0000000.0000000.000000
50%0.0000000.0000000.00000039.000000574.0000002.00000015.000000259.000000-1.0000000.0000000.500000
75%0.0000001.0000000.00000049.0000001854.7500003.00000021.000000492.00000077.7500001.0000001.000000
max1.0000001.0000001.00000095.000000102127.00000044.00000031.0000003881.000000854.00000058.0000001.000000
\n", 880 | "
" 881 | ], 882 | "text/plain": [ 883 | " default housing loan age balance \\\n", 884 | "count 5922.000000 5922.000000 5922.000000 5922.000000 5922.000000 \n", 885 | "mean 0.013847 0.468929 0.126646 41.170382 1616.046099 \n", 886 | "std 0.116864 0.499076 0.332605 12.012457 3371.601168 \n", 887 | "min 0.000000 0.000000 0.000000 18.000000 -1965.000000 \n", 888 | "25% 0.000000 0.000000 0.000000 32.000000 130.000000 \n", 889 | "50% 0.000000 0.000000 0.000000 39.000000 574.000000 \n", 890 | "75% 0.000000 1.000000 0.000000 49.000000 1854.750000 \n", 891 | "max 1.000000 1.000000 1.000000 95.000000 102127.000000 \n", 892 | "\n", 893 | " campaign day duration pdays previous \\\n", 894 | "count 5922.000000 5922.000000 5922.000000 5922.000000 5922.000000 \n", 895 | "mean 2.473320 15.489868 378.140324 52.918102 0.863560 \n", 896 | "std 2.745904 8.419602 353.409237 109.976587 2.284146 \n", 897 | "min 1.000000 1.000000 4.000000 -1.000000 0.000000 \n", 898 | "25% 1.000000 8.000000 145.000000 -1.000000 0.000000 \n", 899 | "50% 2.000000 15.000000 259.000000 -1.000000 0.000000 \n", 900 | "75% 3.000000 21.000000 492.000000 77.750000 1.000000 \n", 901 | "max 44.000000 31.000000 3881.000000 854.000000 58.000000 \n", 902 | "\n", 903 | " y \n", 904 | "count 5922.000000 \n", 905 | "mean 0.500000 \n", 906 | "std 0.500042 \n", 907 | "min 0.000000 \n", 908 | "25% 0.000000 \n", 909 | "50% 0.500000 \n", 910 | "75% 1.000000 \n", 911 | "max 1.000000 " 912 | ] 913 | }, 914 | "execution_count": 21, 915 | "metadata": {}, 916 | "output_type": "execute_result" 917 | } 918 | ], 919 | "source": [ 920 | "# 对数据集拆分为训练集和测试集之前,先检查下数据是否存在问题,调用describe,打印下数据的统计信息\n", 921 | "data_num_sample.describe()" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": 22, 927 | "metadata": {}, 928 | "outputs": [ 929 | { 930 | "data": { 931 | "text/plain": [ 932 | "Index(['default', 'housing', 'loan', 'age', 'balance', 'campaign', 'day',\n", 933 | " 'duration', 'pdays', 'previous'],\n", 934 | " dtype='object')" 935 | ] 936 | }, 937 | "execution_count": 22, 938 | "metadata": {}, 939 | "output_type": "execute_result" 940 | } 941 | ], 942 | "source": [ 943 | "# 将数据集中的y标签列数据存到y变量中,将其他的(特征)列存到X变量中,并且打印X的所有列的列名\n", 944 | "y = data_num_sample['y']\n", 945 | "X = data_num_sample.drop(columns='y',axis=1)\n", 946 | "X.columns" 947 | ] 948 | }, 949 | { 950 | "cell_type": "code", 951 | "execution_count": 23, 952 | "metadata": {}, 953 | "outputs": [], 954 | "source": [ 955 | "# 将数据集X和y进行随机切分,得到训练集和测试集数据,随机种子设定为22,测试集占比1/4。\n", 956 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22, test_size=1/4)" 957 | ] 958 | }, 959 | { 960 | "cell_type": "code", 961 | "execution_count": 24, 962 | "metadata": {}, 963 | "outputs": [ 964 | { 965 | "name": "stdout", 966 | "output_type": "stream", 967 | "text": [ 968 | "X_train的维度为(4441, 10),X_test的维度为(1481, 10)\n" 969 | ] 970 | } 971 | ], 972 | "source": [ 973 | "print(\"X_train的维度为{},X_test的维度为{}\".format(X_train.shape,X_test.shape))" 974 | ] 975 | }, 976 | { 977 | "cell_type": "markdown", 978 | "metadata": {}, 979 | "source": [ 980 | "## 四、选择模型" 981 | ] 982 | }, 983 | { 984 | "cell_type": "markdown", 985 | "metadata": {}, 986 | "source": [ 987 | "通过以上的特征处理和数据切分,我们将抽样后的数据集随机切分为训练集和测试集,这里,将依据训练集和测试集通过交叉验证的方法来选择模型,以下将对kNN模型、逻辑回归模型和决策树模型进行训练,通过评价指标选择一个较有的模型" 988 | ] 989 | }, 990 | { 991 | "cell_type": "markdown", 992 | "metadata": {}, 993 | "source": [ 994 | "train_test_model和predict_auc方法均是被调用方法,在下面的模型选择的方法中会调用这两个方法" 995 | ] 996 | }, 997 | { 998 | "cell_type": "code", 999 | "execution_count": 25, 1000 | "metadata": {}, 1001 | "outputs": [], 1002 | "source": [ 1003 | "def train_test_model(clf, X_train, y_train, cv_scores, param):\n", 1004 | " \"\"\"\n", 1005 | " 功能:\n", 1006 | " 依据训练集,对模型进行训练,得到交叉验证后的评分值\n", 1007 | " 参数:\n", 1008 | " clf:模型\n", 1009 | " param:模型参数\n", 1010 | " X_train:训练集\n", 1011 | " y_train:训练样本对应的标签\n", 1012 | " cv_scores:字`典类型,将得到的最终评分值存到字典中。\n", 1013 | "\n", 1014 | " \"\"\"\n", 1015 | " # 使用10折交叉验证,roc_auc作为评价指标,对clf进行评分计算,并且对得到的评分计算均值,并且将参数和评分进行打印,\n", 1016 | " # 比如:参数为5,评分均值为0.76342,则打印输出为:参数=5,验证集上的AUC=0.76342\n", 1017 | " val_scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')\n", 1018 | " score_mean = val_scores.mean()\n", 1019 | " print(score_mean)\n", 1020 | " print(\"参数={},验证集上的AUC={}\".format(param, score_mean))\n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " # 经过以上操作后,还需要将评分均值存到字典cv_scores中,其中,键为模型参数,值为得到的评分均值\n", 1025 | " # 比如:,则在字典中表现为{5:0.76342}\n", 1026 | " cv_scores[param] = score_mean" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "code", 1031 | "execution_count": 26, 1032 | "metadata": {}, 1033 | "outputs": [], 1034 | "source": [ 1035 | "def predict_auc(model,X_train,y_train,X_test,y_test):\n", 1036 | " \"\"\"\n", 1037 | " 功能:\n", 1038 | " 使用训练数据训练模型,使用训练好的模型对测试数据进行预测,进而得到模型的AUC评分\n", 1039 | " 参数:\n", 1040 | " model:设置了最优参数的模型\n", 1041 | " X_train:训练集\n", 1042 | " y_train:训练数据对应的标签\n", 1043 | " X_test:测试集\n", 1044 | " y_test:测试数据对应的标签\n", 1045 | " 返回值\n", 1046 | " 返回模型的AUC值\n", 1047 | " \"\"\"\n", 1048 | " # 设置最优参数后,对整个训练集进行训练,然后通过predict对测试集进行预测,并将结果存入变量y_pred,使用roc_auc_score()评分方法计算模型的AUC值\n", 1049 | " # 将得到的AUC值进行打印,打印输出格式如:模型AUC值:0.75177973\n", 1050 | " model.fit(X_train, y_train)\n", 1051 | " y_pred = model.predict(X_test)\n", 1052 | " model_auc = roc_auc_score(y_pred, y_test)\n", 1053 | " print('模型AUC值:{}'.format(model_auc))\n", 1054 | "\n", 1055 | " return model_auc\n", 1056 | " " 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "markdown", 1061 | "metadata": {}, 1062 | "source": [ 1063 | "### 1.kNN模型\n", 1064 | "k近邻法(k-nearest neighbor,k-NN)是一种基本的分类方法,对于给定的数据集,若输入一个新的实例,在训练数据集中找到与该实例最邻近的k个实例,这个k个实例的多数属于某个类,那么就把该输入实例判定为这个类。" 1065 | ] 1066 | }, 1067 | { 1068 | "cell_type": "code", 1069 | "execution_count": 27, 1070 | "metadata": { 1071 | "scrolled": true 1072 | }, 1073 | "outputs": [ 1074 | { 1075 | "name": "stdout", 1076 | "output_type": "stream", 1077 | "text": [ 1078 | "0.8077509534923599\n", 1079 | "参数=5,验证集上的AUC=0.8077509534923599\n", 1080 | "0.8175383570005726\n", 1081 | "参数=7,验证集上的AUC=0.8175383570005726\n", 1082 | "0.7528462119421223\n", 1083 | "参数=2,验证集上的AUC=0.7528462119421223\n", 1084 | "0.7822092488933985\n", 1085 | "参数=3,验证集上的AUC=0.7822092488933985\n", 1086 | "0.8142750349747395\n", 1087 | "参数=6,验证集上的AUC=0.8142750349747395\n", 1088 | "最优的参数值:7\n", 1089 | "模型AUC值:0.7202998772646505\n" 1090 | ] 1091 | } 1092 | ], 1093 | "source": [ 1094 | "# 这里只对kNN模型的k参数进行选择,为k设定不同的值,进而得到不同的kNN模型,使用不同的kNN模型得到AUC值,进而得到最优模型的评分值\n", 1095 | "knn_parameters = [5,7,2,3,6]\n", 1096 | "knn_cv_scores = {}\n", 1097 | "for param in knn_parameters:\n", 1098 | " knn_clf = KNeighborsClassifier(n_neighbors=param)\n", 1099 | " train_test_model(knn_clf, X_train, y_train,knn_cv_scores,param)\n", 1100 | " \n", 1101 | "knn_best_para=max(knn_cv_scores,key=knn_cv_scores.get)\n", 1102 | "print('最优的参数值:{}'.format(knn_best_para))\n", 1103 | "\n", 1104 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n", 1105 | "knn_model= KNeighborsClassifier(n_neighbors=knn_best_para)\n", 1106 | "knn_model_auc = predict_auc(knn_model,X_train,y_train,X_test,y_test)" 1107 | ] 1108 | }, 1109 | { 1110 | "cell_type": "markdown", 1111 | "metadata": {}, 1112 | "source": [ 1113 | "### 2.逻辑回归模型\n", 1114 | "逻辑回归模型是一种分类模型,其模型输出的结果处于(0,1)之间,当输出结果大于给定的阈值时,则为A类,小于阈值为B类,由于具体的原理比较复杂,在此只对逻辑回归进行了简单的原理介绍" 1115 | ] 1116 | }, 1117 | { 1118 | "cell_type": "code", 1119 | "execution_count": 28, 1120 | "metadata": {}, 1121 | "outputs": [ 1122 | { 1123 | "name": "stdout", 1124 | "output_type": "stream", 1125 | "text": [ 1126 | "0.8623990756280264\n", 1127 | "参数=1,验证集上的AUC=0.8623990756280264\n", 1128 | "0.8623787155370144\n", 1129 | "参数=3,验证集上的AUC=0.8623787155370144\n", 1130 | "0.8624392095715059\n", 1131 | "参数=5,验证集上的AUC=0.8624392095715059\n", 1132 | "0.8617857855030111\n", 1133 | "参数=10,验证集上的AUC=0.8617857855030111\n", 1134 | "0.8622761466118669\n", 1135 | "参数=15,验证集上的AUC=0.8622761466118669\n", 1136 | "最优的参数值:5\n", 1137 | "模型AUC值:0.7704903918371834\n" 1138 | ] 1139 | } 1140 | ], 1141 | "source": [ 1142 | "# 这里只对逻辑回归模型的参数C设定不同的值,根据不同的值可以得到不同的模型,依据不同模型的评分,选择最优的参数值,进而得到最终模型的评分值\n", 1143 | "lr_parameters = [1,3,5,10,15]\n", 1144 | "lr_cv_scores = {}\n", 1145 | "for param in lr_parameters:\n", 1146 | " lr_clf = LogisticRegression(C=param)\n", 1147 | " train_test_model(lr_clf, X_train, y_train,lr_cv_scores,param)\n", 1148 | " \n", 1149 | "lr_best_para=max(lr_cv_scores,key=lr_cv_scores.get)\n", 1150 | "print('最优的参数值:{}'.format(lr_best_para))\n", 1151 | "\n", 1152 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n", 1153 | "lr_model= LogisticRegression(C=lr_best_para)\n", 1154 | "lr_model_auc = predict_auc(lr_model,X_train,y_train,X_test,y_test)" 1155 | ] 1156 | }, 1157 | { 1158 | "cell_type": "markdown", 1159 | "metadata": {}, 1160 | "source": [ 1161 | "### 3.决策树模型\n", 1162 | "决策树也是一种分类模型,显而易见,决策树可以理解为将数据按照某种规则生成一颗形似树的结构,以实现对数据的分类。" 1163 | ] 1164 | }, 1165 | { 1166 | "cell_type": "code", 1167 | "execution_count": 29, 1168 | "metadata": { 1169 | "scrolled": true 1170 | }, 1171 | "outputs": [ 1172 | { 1173 | "name": "stdout", 1174 | "output_type": "stream", 1175 | "text": [ 1176 | "0.7183915142959287\n", 1177 | "参数=1,验证集上的AUC=0.7183915142959287\n", 1178 | "0.8312521096369881\n", 1179 | "参数=3,验证集上的AUC=0.8312521096369881\n", 1180 | "0.8561861778610342\n", 1181 | "参数=5,验证集上的AUC=0.8561861778610342\n", 1182 | "0.8021781689000533\n", 1183 | "参数=10,验证集上的AUC=0.8021781689000533\n", 1184 | "0.7493300003167633\n", 1185 | "参数=15,验证集上的AUC=0.7493300003167633\n", 1186 | "最优的参数值:5\n", 1187 | "模型AUC值:0.7760305631026779\n" 1188 | ] 1189 | } 1190 | ], 1191 | "source": [ 1192 | "# 以下将对决策树模型的树的深度进行选择,以得到不同的模型,依据模型的评分值,得到最优的参数值,进而得到最终模型的评分值AUC\n", 1193 | "dt_parameters = [1,3,5,10,15]\n", 1194 | "dt_cv_scores = {}\n", 1195 | "for param in dt_parameters:\n", 1196 | " dt_clf = DecisionTreeClassifier(max_depth=param)\n", 1197 | " train_test_model(dt_clf, X_train, y_train,dt_cv_scores,param)\n", 1198 | " \n", 1199 | "dt_best_para=max(dt_cv_scores,key=dt_cv_scores.get)\n", 1200 | "print('最优的参数值:{}'.format(dt_best_para))\n", 1201 | "\n", 1202 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n", 1203 | "dt_model= DecisionTreeClassifier(max_depth=dt_best_para)\n", 1204 | "dt_model_auc = predict_auc(dt_model,X_train,y_train,X_test,y_test)" 1205 | ] 1206 | }, 1207 | { 1208 | "cell_type": "markdown", 1209 | "metadata": {}, 1210 | "source": [ 1211 | "对以上三个模型进行分析,决策树模型好,因为模型AUC值值最大。" 1212 | ] 1213 | }, 1214 | { 1215 | "cell_type": "markdown", 1216 | "metadata": {}, 1217 | "source": [ 1218 | "### 数据归一化" 1219 | ] 1220 | }, 1221 | { 1222 | "cell_type": "code", 1223 | "execution_count": 30, 1224 | "metadata": { 1225 | "scrolled": false 1226 | }, 1227 | "outputs": [ 1228 | { 1229 | "data": { 1230 | "text/html": [ 1231 | "
\n", 1232 | "\n", 1245 | "\n", 1246 | " \n", 1247 | " \n", 1248 | " \n", 1249 | " \n", 1250 | " \n", 1251 | " \n", 1252 | " \n", 1253 | " \n", 1254 | " \n", 1255 | " \n", 1256 | " \n", 1257 | " \n", 1258 | " \n", 1259 | " \n", 1260 | " \n", 1261 | " \n", 1262 | " \n", 1263 | " \n", 1264 | " \n", 1265 | " \n", 1266 | " \n", 1267 | " \n", 1268 | " \n", 1269 | " \n", 1270 | " \n", 1271 | " \n", 1272 | " \n", 1273 | " \n", 1274 | " \n", 1275 | " \n", 1276 | " \n", 1277 | " \n", 1278 | " \n", 1279 | " \n", 1280 | " \n", 1281 | " \n", 1282 | " \n", 1283 | " \n", 1284 | " \n", 1285 | " \n", 1286 | " \n", 1287 | " \n", 1288 | " \n", 1289 | " \n", 1290 | " \n", 1291 | " \n", 1292 | " \n", 1293 | " \n", 1294 | " \n", 1295 | " \n", 1296 | " \n", 1297 | " \n", 1298 | " \n", 1299 | " \n", 1300 | " \n", 1301 | " \n", 1302 | " \n", 1303 | " \n", 1304 | " \n", 1305 | " \n", 1306 | " \n", 1307 | " \n", 1308 | " \n", 1309 | " \n", 1310 | " \n", 1311 | " \n", 1312 | " \n", 1313 | " \n", 1314 | " \n", 1315 | " \n", 1316 | " \n", 1317 | " \n", 1318 | " \n", 1319 | " \n", 1320 | " \n", 1321 | " \n", 1322 | " \n", 1323 | " \n", 1324 | " \n", 1325 | " \n", 1326 | " \n", 1327 | " \n", 1328 | " \n", 1329 | " \n", 1330 | " \n", 1331 | " \n", 1332 | " \n", 1333 | " \n", 1334 | " \n", 1335 | " \n", 1336 | " \n", 1337 | " \n", 1338 | " \n", 1339 | " \n", 1340 | " \n", 1341 | " \n", 1342 | " \n", 1343 | " \n", 1344 | " \n", 1345 | " \n", 1346 | " \n", 1347 | " \n", 1348 | " \n", 1349 | " \n", 1350 | " \n", 1351 | " \n", 1352 | " \n", 1353 | " \n", 1354 | " \n", 1355 | " \n", 1356 | " \n", 1357 | " \n", 1358 | " \n", 1359 | " \n", 1360 | " \n", 1361 | " \n", 1362 | " \n", 1363 | " \n", 1364 | " \n", 1365 | " \n", 1366 | " \n", 1367 | "
defaulthousingloanagebalancecampaigndaydurationpdaysprevious
count4441.0000004441.0000004441.0000004441.0000004441.0000004441.0000004441.0000004441.0000004441.000004441.000000
mean0.0144110.4717410.12857541.1868951587.3064632.47421815.548750377.23530753.213240.833146
std0.1191920.4992570.33476612.0270613136.4789552.7517118.417989350.516914111.215892.134833
min0.0000000.0000000.00000018.000000-1965.0000001.0000001.0000004.000000-1.000000.000000
25%0.0000000.0000000.00000032.000000119.0000001.0000008.000000144.000000-1.000000.000000
50%0.0000000.0000000.00000039.000000577.0000002.00000015.000000260.000000-1.000000.000000
75%0.0000001.0000000.00000049.0000001853.0000003.00000021.000000489.00000064.000001.000000
max1.0000001.0000001.00000093.00000081204.00000044.00000031.0000003881.000000854.0000037.000000
\n", 1368 | "
" 1369 | ], 1370 | "text/plain": [ 1371 | " default housing loan age balance \\\n", 1372 | "count 4441.000000 4441.000000 4441.000000 4441.000000 4441.000000 \n", 1373 | "mean 0.014411 0.471741 0.128575 41.186895 1587.306463 \n", 1374 | "std 0.119192 0.499257 0.334766 12.027061 3136.478955 \n", 1375 | "min 0.000000 0.000000 0.000000 18.000000 -1965.000000 \n", 1376 | "25% 0.000000 0.000000 0.000000 32.000000 119.000000 \n", 1377 | "50% 0.000000 0.000000 0.000000 39.000000 577.000000 \n", 1378 | "75% 0.000000 1.000000 0.000000 49.000000 1853.000000 \n", 1379 | "max 1.000000 1.000000 1.000000 93.000000 81204.000000 \n", 1380 | "\n", 1381 | " campaign day duration pdays previous \n", 1382 | "count 4441.000000 4441.000000 4441.000000 4441.00000 4441.000000 \n", 1383 | "mean 2.474218 15.548750 377.235307 53.21324 0.833146 \n", 1384 | "std 2.751711 8.417989 350.516914 111.21589 2.134833 \n", 1385 | "min 1.000000 1.000000 4.000000 -1.00000 0.000000 \n", 1386 | "25% 1.000000 8.000000 144.000000 -1.00000 0.000000 \n", 1387 | "50% 2.000000 15.000000 260.000000 -1.00000 0.000000 \n", 1388 | "75% 3.000000 21.000000 489.000000 64.00000 1.000000 \n", 1389 | "max 44.000000 31.000000 3881.000000 854.00000 37.000000 " 1390 | ] 1391 | }, 1392 | "execution_count": 30, 1393 | "metadata": {}, 1394 | "output_type": "execute_result" 1395 | } 1396 | ], 1397 | "source": [ 1398 | "# 先观察下训练集的特征,查看特征的量纲具有什么特点\n", 1399 | "X_train.describe()" 1400 | ] 1401 | }, 1402 | { 1403 | "cell_type": "markdown", 1404 | "metadata": {}, 1405 | "source": [ 1406 | "回答:从以上的统计信息中,可以知道,特征与特征之间的量纲不一致,因此需要对特征值进行归一化处理。" 1407 | ] 1408 | }, 1409 | { 1410 | "cell_type": "markdown", 1411 | "metadata": {}, 1412 | "source": [ 1413 | "#### 归一化\n", 1414 | "对训练数据和测试集数据做归一化" 1415 | ] 1416 | }, 1417 | { 1418 | "cell_type": "code", 1419 | "execution_count": 31, 1420 | "metadata": {}, 1421 | "outputs": [], 1422 | "source": [ 1423 | "# 使用MinMaxScaler对训练集X_train和测试集X_test进行归一化\n", 1424 | "scaler = MinMaxScaler()\n", 1425 | "X_train_scaled = scaler.fit_transform(X_train.astype('float64'))\n", 1426 | "X_test_scaled = scaler.transform(X_test.astype('float64'))" 1427 | ] 1428 | }, 1429 | { 1430 | "cell_type": "markdown", 1431 | "metadata": {}, 1432 | "source": [ 1433 | "由于以下训练模型的代码以上的代码类似,若是感兴趣的同学,可以将以下给出的kNN模型,逻辑回归模型和决策树模型的代码删除掉,然后尝试自己写代码,完成模型的训练" 1434 | ] 1435 | }, 1436 | { 1437 | "cell_type": "markdown", 1438 | "metadata": {}, 1439 | "source": [ 1440 | "#### kNN模型" 1441 | ] 1442 | }, 1443 | { 1444 | "cell_type": "code", 1445 | "execution_count": 32, 1446 | "metadata": { 1447 | "scrolled": false 1448 | }, 1449 | "outputs": [ 1450 | { 1451 | "name": "stdout", 1452 | "output_type": "stream", 1453 | "text": [ 1454 | "0.8362348007709579\n", 1455 | "参数=5,验证集上的AUC=0.8362348007709579\n", 1456 | "0.844031883023167\n", 1457 | "参数=7,验证集上的AUC=0.844031883023167\n", 1458 | "0.795107142467294\n", 1459 | "参数=2,验证集上的AUC=0.795107142467294\n", 1460 | "0.8216005781567567\n", 1461 | "参数=3,验证集上的AUC=0.8216005781567567\n", 1462 | "0.8418551395164939\n", 1463 | "参数=6,验证集上的AUC=0.8418551395164939\n", 1464 | "最优的参数值:7\n", 1465 | "模型AUC值:0.7633043955039752\n" 1466 | ] 1467 | } 1468 | ], 1469 | "source": [ 1470 | "knn_scaler_parameters = [5,7,2,3,6]\n", 1471 | "knn_scaler_cv_scores = {}\n", 1472 | "for param in knn_scaler_parameters:\n", 1473 | " knn_scaler_clf = KNeighborsClassifier(n_neighbors=param)\n", 1474 | " train_test_model(knn_scaler_clf, X_train_scaled, y_train,knn_scaler_cv_scores,param)\n", 1475 | " \n", 1476 | "knn_scaler_best_para=max(knn_scaler_cv_scores,key=knn_scaler_cv_scores.get)\n", 1477 | "print('最优的参数值:{}'.format(knn_scaler_best_para))\n", 1478 | "\n", 1479 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n", 1480 | "knn_scaler_model= KNeighborsClassifier(n_neighbors=knn_scaler_best_para)\n", 1481 | "knn_scaler_model_auc = predict_auc(knn_scaler_model,X_train_scaled,y_train,X_test_scaled,y_test)" 1482 | ] 1483 | }, 1484 | { 1485 | "cell_type": "markdown", 1486 | "metadata": {}, 1487 | "source": [ 1488 | "#### 逻辑回归模型" 1489 | ] 1490 | }, 1491 | { 1492 | "cell_type": "code", 1493 | "execution_count": 33, 1494 | "metadata": { 1495 | "scrolled": true 1496 | }, 1497 | "outputs": [ 1498 | { 1499 | "name": "stdout", 1500 | "output_type": "stream", 1501 | "text": [ 1502 | "0.8576315184282677\n", 1503 | "参数=1,验证集上的AUC=0.8576315184282677\n", 1504 | "0.8610214502261355\n", 1505 | "参数=3,验证集上的AUC=0.8610214502261355\n", 1506 | "0.8617095314058393\n", 1507 | "参数=5,验证集上的AUC=0.8617095314058393\n", 1508 | "0.8620709066888409\n", 1509 | "参数=10,验证集上的AUC=0.8620709066888409\n", 1510 | "0.8622355155684189\n", 1511 | "参数=15,验证集上的AUC=0.8622355155684189\n", 1512 | "最优的参数值:15\n", 1513 | "模型AUC值:0.7691338914433311\n" 1514 | ] 1515 | } 1516 | ], 1517 | "source": [ 1518 | "lr_scaler_parameters = [1,3,5,10,15]\n", 1519 | "lr_scaler_cv_scores = {}\n", 1520 | "for param in lr_scaler_parameters:\n", 1521 | " lr_scaler_clf = LogisticRegression(C=param)\n", 1522 | " train_test_model(lr_scaler_clf, X_train_scaled, y_train,lr_scaler_cv_scores,param)\n", 1523 | " \n", 1524 | "lr_scaler_best_para=max(lr_scaler_cv_scores,key=lr_scaler_cv_scores.get)\n", 1525 | "print('最优的参数值:{}'.format(lr_scaler_best_para))\n", 1526 | "\n", 1527 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n", 1528 | "lr_scaler_model= LogisticRegression(C=lr_scaler_best_para)\n", 1529 | "lr_scaler_model_auc = predict_auc(lr_scaler_model,X_train_scaled,y_train,X_test_scaled,y_test)" 1530 | ] 1531 | }, 1532 | { 1533 | "cell_type": "markdown", 1534 | "metadata": {}, 1535 | "source": [ 1536 | "#### 决策树模型" 1537 | ] 1538 | }, 1539 | { 1540 | "cell_type": "code", 1541 | "execution_count": 34, 1542 | "metadata": { 1543 | "scrolled": true 1544 | }, 1545 | "outputs": [ 1546 | { 1547 | "name": "stdout", 1548 | "output_type": "stream", 1549 | "text": [ 1550 | "0.7183915142959287\n", 1551 | "参数=1,验证集上的AUC=0.7183915142959287\n", 1552 | "0.8312521096369881\n", 1553 | "参数=3,验证集上的AUC=0.8312521096369881\n", 1554 | "0.8566254157311546\n", 1555 | "参数=5,验证集上的AUC=0.8566254157311546\n", 1556 | "0.8051230940708141\n", 1557 | "参数=10,验证集上的AUC=0.8051230940708141\n", 1558 | "0.748734443042695\n", 1559 | "参数=15,验证集上的AUC=0.748734443042695\n", 1560 | "最优的参数值:5\n", 1561 | "模型AUC值:0.7767016204516204\n" 1562 | ] 1563 | } 1564 | ], 1565 | "source": [ 1566 | "dt_scaler_parameters = [1,3,5,10,15]\n", 1567 | "dt_scaler_cv_scores = {}\n", 1568 | "for param in dt_scaler_parameters:\n", 1569 | " dt_scaler_clf = DecisionTreeClassifier(max_depth=param)\n", 1570 | " train_test_model(dt_scaler_clf, X_train_scaled, y_train,dt_scaler_cv_scores,param)\n", 1571 | " \n", 1572 | "dt_scaler_best_para=max(dt_scaler_cv_scores,key=dt_scaler_cv_scores.get)\n", 1573 | "print('最优的参数值:{}'.format(dt_scaler_best_para))\n", 1574 | "\n", 1575 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n", 1576 | "dt_scaler_model= DecisionTreeClassifier(max_depth=dt_scaler_best_para)\n", 1577 | "dt_scaler_model_auc = predict_auc(dt_scaler_model,X_train_scaled,y_train,X_test_scaled,y_test)" 1578 | ] 1579 | }, 1580 | { 1581 | "cell_type": "markdown", 1582 | "metadata": {}, 1583 | "source": [ 1584 | "#### 将未归一化和归一化后的数据得到的模型AUC值进行合并" 1585 | ] 1586 | }, 1587 | { 1588 | "cell_type": "code", 1589 | "execution_count": 35, 1590 | "metadata": {}, 1591 | "outputs": [ 1592 | { 1593 | "data": { 1594 | "text/html": [ 1595 | "
\n", 1596 | "\n", 1609 | "\n", 1610 | " \n", 1611 | " \n", 1612 | " \n", 1613 | " \n", 1614 | " \n", 1615 | " \n", 1616 | " \n", 1617 | " \n", 1618 | " \n", 1619 | " \n", 1620 | " \n", 1621 | " \n", 1622 | " \n", 1623 | " \n", 1624 | " \n", 1625 | " \n", 1626 | " \n", 1627 | " \n", 1628 | " \n", 1629 | " \n", 1630 | " \n", 1631 | " \n", 1632 | " \n", 1633 | " \n", 1634 | "
Not Scaled (%)Scaled (%)
kNN0.7203000.763304
LR0.7704900.769134
DT0.7760310.776702
\n", 1635 | "
" 1636 | ], 1637 | "text/plain": [ 1638 | " Not Scaled (%) Scaled (%)\n", 1639 | "kNN 0.720300 0.763304\n", 1640 | "LR 0.770490 0.769134\n", 1641 | "DT 0.776031 0.776702" 1642 | ] 1643 | }, 1644 | "execution_count": 35, 1645 | "metadata": {}, 1646 | "output_type": "execute_result" 1647 | } 1648 | ], 1649 | "source": [ 1650 | "col_name = ['Not Scaled (%)', 'Scaled (%)']\n", 1651 | "row_name = ['kNN','LR','DT']\n", 1652 | "# 创建dataframe结构的变量models_auc_df,列索引设置为col_name,行索引设置为row_name,将未归一化和归一化的AUC值按照索引存放到对应的位置,\n", 1653 | "# 其中未归一化模型的AUC值分别为knn_model_auc,lr_model_auc,dt_model_auc,归一化后的模型AUC值分别为knn_scaler_model_auc,lr_scaler_model_auc,dt_scaler_model_auc\n", 1654 | "# 然后将数据models_auc_df的数据进行打印\n", 1655 | "\n", 1656 | "models_auc_df = pd.DataFrame([[knn_model_auc,knn_scaler_model_auc],[lr_model_auc,lr_scaler_model_auc],[dt_model_auc,dt_scaler_model_auc]],\n", 1657 | " columns=col_name,index=row_name)\n", 1658 | "models_auc_df" 1659 | ] 1660 | }, 1661 | { 1662 | "cell_type": "code", 1663 | "execution_count": 36, 1664 | "metadata": { 1665 | "scrolled": true 1666 | }, 1667 | "outputs": [ 1668 | { 1669 | "data": { 1670 | "text/plain": [ 1671 | "(array([0, 1, 2]), )" 1672 | ] 1673 | }, 1674 | "execution_count": 36, 1675 | "metadata": {}, 1676 | "output_type": "execute_result" 1677 | }, 1678 | { 1679 | "data": { 1680 | "text/plain": [ 1681 | "
" 1682 | ] 1683 | }, 1684 | "metadata": {}, 1685 | "output_type": "display_data" 1686 | }, 1687 | { 1688 | "data": { 1689 | "image/png": "\n", 1690 | "text/plain": [ 1691 | "
" 1692 | ] 1693 | }, 1694 | "metadata": { 1695 | "needs_background": "light" 1696 | }, 1697 | "output_type": "display_data" 1698 | } 1699 | ], 1700 | "source": [ 1701 | "# 为未归一化和归一化的数据绘制分组柱状图,可视化的图和下图相似即可\n", 1702 | "# 对models_auc_df数据进行可视化,要求设置图例到右下角,标题为未归一化和归一化数据的模型AUC值比较\n", 1703 | "\n", 1704 | "#解决图例中文显示问题,设置字体样式\n", 1705 | "plt.rcParams['font.sans-serif']=['SimHei'] \n", 1706 | "plt.rcParams['axes.unicode_minus']=False \n", 1707 | "\n", 1708 | "#创建画布\n", 1709 | "plt.figure(figsize=(20,12), dpi=120)\n", 1710 | "\n", 1711 | "models_auc_df.plot(kind='bar')\n", 1712 | "\n", 1713 | "plt.xticks(rotation=360)" 1714 | ] 1715 | }, 1716 | { 1717 | "cell_type": "markdown", 1718 | "metadata": {}, 1719 | "source": [ 1720 | "依据对未归一化和归一化后的数据训练模型,通过对模型的AUC值进行比较,可以发现:归一化能有效提高KNN模型的AUC值,但是对逻辑回归和决策树的影响不大。" 1721 | ] 1722 | } 1723 | ], 1724 | "metadata": { 1725 | "kernelspec": { 1726 | "display_name": "Python 3", 1727 | "language": "python", 1728 | "name": "python3" 1729 | }, 1730 | "language_info": { 1731 | "codemirror_mode": { 1732 | "name": "ipython", 1733 | "version": 3 1734 | }, 1735 | "file_extension": ".py", 1736 | "mimetype": "text/x-python", 1737 | "name": "python", 1738 | "nbconvert_exporter": "python", 1739 | "pygments_lexer": "ipython3", 1740 | "version": "3.6.8" 1741 | } 1742 | }, 1743 | "nbformat": 4, 1744 | "nbformat_minor": 2 1745 | } 1746 | -------------------------------------------------------------------------------- /images/Random Forest.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/Random Forest.png -------------------------------------------------------------------------------- /images/jaccard系数.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/jaccard系数.png -------------------------------------------------------------------------------- /images/余弦相似性.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/余弦相似性.png -------------------------------------------------------------------------------- /images/曼哈顿距离.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/曼哈顿距离.png -------------------------------------------------------------------------------- /images/欧式距离.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/欧式距离.png -------------------------------------------------------------------------------- /关联分析/README.md: -------------------------------------------------------------------------------- 1 | # 关联分析 2 | 寻找最终能够**解释数据变量之间关系的规则**,来找出大量多源数据集中有用的关联规则,它是从大量数据中发现多种数据之间关系的一种方法。 3 | 另外,它也可以基于时间序列对多种数据间的关系进行挖掘。 4 | 5 | ## 常见的关联算法 6 | - Apriori 7 | - FP-Growth 8 | - PrefixSpan 9 | - SPADE 10 | - AprioriAll 11 | - AprioriSome 12 | 13 | ## 典型的销售应用场景 14 | - 购物篮分析 15 | - 优化商品布局,e.g.超市可以把关联度高的商品摆放在一起,便于顾客一起挑选。 16 | - 设计促销方案,e.g.两种关联度高的商品一起搭配购买可以享受价格优惠。 17 | - 快速商品推荐,通常在电商业务中使用。e.g.顾客浏览某一商品,页面上会推荐“经常一起购买的产品”或者90%的顾客也看了如下商品“等规则进行推荐。 18 | 19 | ## 关联分析中的关键指标 20 | - 支持度(support): 21 | - 置信度(confidence) 22 | - 提升度(Lift):当Lift>1, 应用关联规则比不应用关联规则能产生更好的结果;反之,规则具有负相关的作用,该规则为无效规则。 23 | - 做关联规则评估时,需要综合考虑支持度、置信度和提升度3个指标,支持度和置信度的值越大越好。 24 | 。 25 | **频繁规则 & 有效规则**: 26 | - 频繁规则:关联结果中支持度和置信度都比较高的规则 27 | - 有效规则:关联规则真正能促进规则中的前/后项的提升。 28 | - 频繁规则 != 有效规则 29 | 30 | ## 关联分析的更多应用场景 31 | **相同维度下的关联分析**: 32 | - 网站页面浏览关联分析 33 | - 广告流量关联分析 34 | - 用户关键字搜索关联分析 35 | 36 | **跨维度的关联分析**: 37 | - 不同场景下关联分析 38 | - 相同场景下的事件分析 39 | -------------------------------------------------------------------------------- /分类算法/README.md: -------------------------------------------------------------------------------- 1 | # 2.2 如何选择分类分析算法 2 | - 文本分类: **朴素贝叶斯**,如电子邮件中的垃圾邮件识别。 3 | - 若训练集较小,选择高偏差且低方差的分类算法效果更好,如**朴素贝叶斯、支持向量机,因为这类算法不容易过拟合**。 4 | - 如果关注的是算法模型的计算时间和模型易用性,那么支持向量机、人工神经网络不是好的选择。 5 | - 如果重视算法的准确率,应选择精度较高的方法,如**支持向量机或GBDT、XGBOOST等基于Boosting的集成方法**。 6 | - 如果注重效果的稳定性或模型鲁棒性,那么应选择**随机森林、组合投票模型等基于Bagging的集成方法**。 7 | - 如果想得到有关预测结果的概率信息,然后**基于预测概率做进一步的应用,那么使用逻辑回归是比较好的选择**。 8 | - 如果**担心离群点或数据不可分并且需要清晰的决策规则,那么选择决策树**。 9 | -------------------------------------------------------------------------------- /回归分析/README.md: -------------------------------------------------------------------------------- 1 | # 回归分析 2 | 如何选择回归分析算法? 3 | - 简单线性回归。适合数据集本身结构简单、分布规律有明显线性关系的场景。 4 | - 自变量数量少或降维后得到了可以使用的二维变量(包括预测变量)可以直接通过散点图发现自变量和因变量的相互关系,然后选择最佳回归方法。 5 | - 如果经过基本判断发现自变量间有较强的共线性关系,那么可以使用对多重共线性(自变量高度相关)能灵活处理的算法,例如岭回归。 6 | - 如果**数据集噪音较多,推荐使用主成分回归**,因为各主成分回归通过对参与回归的主成分的合理选择,可以去掉噪音。另外,对各个主成分间相互正交,能解决多元线性回归中的 7 | 共线性问题。这些都能有效地提高模型的抗干扰能力。 8 | - 如果高维度变量下,使用正则化回归方法效果最好,例如Lasso,Ridge和ElasticNet;或者使用逐步回归从中挑选出影响显著的自变量来建立回归模型。 9 | - 如果要同时验证多个算法,并想从中选择一个来做好的拟合,可以使用交叉检验做多个模型的效果对比,并通过R-square,Adjusted R-squre,AIC,BIC以及各种残差、误差 10 | 项指标做综合评估。 11 | - 如果注重模型的可解释性,那么容易理解的线性回归,指数回归,对数回归,二项或多项式回归要比核回归,支持向量机等更适合。 12 | - 集成或组合回归方法。一旦确认了几个方法,但又不确定该如何取舍,可以将多个回归模型做集成或组合方法使用,即同时将多个模型的结果通过加权、均值等方式确定最终输出结果值。 13 | -------------------------------------------------------------------------------- /聚类分析/README.md: -------------------------------------------------------------------------------- 1 | # 如何选择聚类分析算法 2 | 聚类算法有几十种之多,聚类算法的选择主要参考一下因素: 3 | - 如果数据量是高维的,那么选择**谱聚类**,它是子空间划分的一种。 4 | - 如果数据量为中小规模,例如**100万条以内**,k均值将是比较好的选择;如果**超过100万条**,可以考虑使用**MiniBatchKmeans**; 5 | - 数据集中有**噪点**(离群点),那么使用基于密度的**DBSCAN**可以有效应对这个问题。 6 | - 如果追求更高的**分类准确度**,那么选择**谱聚类**将比K均值准确度更好。 7 | -------------------------------------------------------------------------------- /聚类分析/客户特征的聚类与探索性分析.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 项目背景\n", 8 | "\n", 9 | "某天,业务部门拿到了一些关于客户的数据找到数据部门,苦于没有分析入手点,希望数据部门通过对这些数据的分析,给业务部门一些启示,或者提供后续分析或业务思考的建议。\n", 10 | "\n", 11 | "基于上述场景和需求,本次分析的交付需求如下:\n", 12 | "- 这是一次EDA任务,且业务方没有任何先验经验提供给数据部门。\n", 13 | "- 分析结果用于业务的知识启发或后续分析的深入应用。\n", 14 | "- 除数据统计和基本展示类的探索性分析以外的数据挖掘。\n", 15 | "\n", 16 | "#### 数据源特征:\n", 17 | "- USER_ID:用户ID列,整数型。该列作为用户唯一ID标志,这意味着该列不能作为聚类的特征,而只能作为用户聚类后的所属类的标记。\n", 18 | "- AVG_ORDERS:平均用户订单数量,浮点型。\n", 19 | "- AVG_MONEY:平均订单价值,即每单的订单价格,浮点型。\n", 20 | "- IS_ACTIVE:是否活跃,通过其他模型得到的结果,字符串型。\n", 21 | "- SEX:性别,以0,1,2来表示性别未知、男和女3个值。\n", 22 | "\n", 23 | "#### 分析思路:\n", 24 | "- 字符串型特征不能直接作训练,因为sklearn的对象一般都是数值型的向量矩阵或稀疏矩阵,而不能是原生字符串。\n", 25 | "- SEX本质是分类型变量,不能直接参与距离计算。\n", 26 | "- AVG_ORDERS和AVG_MONEY具有明显的量纲差异,需要作无量纲化处理。\n", 27 | "- 分割ID列。" 28 | ] 29 | }, 30 | { 31 | "cell_type": "code", 32 | "execution_count": 57, 33 | "metadata": {}, 34 | "outputs": [], 35 | "source": [ 36 | "#导入包\n", 37 | "import pandas as pd\n", 38 | "import numpy as np\n", 39 | "import matplotlib.pyplot as plt\n", 40 | "%matplotlib inline\n", 41 | "from sklearn.preprocessing import MinMaxScaler\n", 42 | "from sklearn.cluster import KMeans\n", 43 | "from sklearn.metrics import calinski_harabaz_score,silhouette_score" 44 | ] 45 | }, 46 | { 47 | "cell_type": "code", 48 | "execution_count": 15, 49 | "metadata": {}, 50 | "outputs": [], 51 | "source": [ 52 | "raw_data = pd.read_csv('cluster.txt')\n", 53 | "#数值型特征\n", 54 | "numeric_feature = raw_data.iloc[:,1:3]" 55 | ] 56 | }, 57 | { 58 | "cell_type": "code", 59 | "execution_count": 18, 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "name": "stdout", 64 | "output_type": "stream", 65 | "text": [ 66 | "[[0.64200477 0.62591687]\n", 67 | " [0.91169451 0.80440098]\n", 68 | " [0.69451074 0.39608802]\n", 69 | " ...\n", 70 | " [0.3221957 0.17359413]\n", 71 | " [0.42004773 0.31295844]\n", 72 | " [0.64916468 0.40831296]]\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "#标准化\n", 78 | "scaler = MinMaxScaler()\n", 79 | "scaled_numeric_feature = scaler.fit_transform(numeric_feature)\n", 80 | "print(scaled_numeric_feature[:,:2])" 81 | ] 82 | }, 83 | { 84 | "cell_type": "code", 85 | "execution_count": 25, 86 | "metadata": {}, 87 | "outputs": [ 88 | { 89 | "data": { 90 | "text/plain": [ 91 | "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n", 92 | " n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',\n", 93 | " random_state=0, tol=0.0001, verbose=0)" 94 | ] 95 | }, 96 | "execution_count": 25, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | } 100 | ], 101 | "source": [ 102 | "#训练模型\n", 103 | "n_cluster = 3\n", 104 | "model_kmeans = KMeans(n_clusters = n_cluster, random_state=0)\n", 105 | "model_kmeans.fit(scaled_numeric_feature)" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 27, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "sample: 1000 \t features: 4\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "#模型效果评估\n", 123 | "n_samples,n_features = raw_data.iloc[:,1:].shape #总样本数,总特征数\n", 124 | "print('sample: %d \\t features: %d' % (n_samples,n_features))" 125 | ] 126 | }, 127 | { 128 | "cell_type": "code", 129 | "execution_count": 31, 130 | "metadata": {}, 131 | "outputs": [ 132 | { 133 | "name": "stdout", 134 | "output_type": "stream", 135 | "text": [ 136 | "\n", 137 | " unspuervised_score: \n", 138 | " ------------------------------------------------------------\n", 139 | " silh c&h\n", 140 | "0 0.634086 2860.821834\n" 141 | ] 142 | } 143 | ], 144 | "source": [ 145 | "#非监督式评估方法\n", 146 | "silhouette_s = silhouette_score(scaled_numeric_feature, model_kmeans.labels_, metric='euclidean')\n", 147 | "calinski_harabaz_s = calinski_harabaz_score(scaled_numeric_feature, model_kmeans.labels_) # Calinski和harabaz得分\n", 148 | "unspuervised_data = {'silh':[silhouette_s], 'c&h':[calinski_harabaz_s]}\n", 149 | "unspuervised_score = pd.DataFrame.from_dict(unspuervised_data)\n", 150 | "print(\"\\n\",'unspuervised_score:', '\\n', '-'*60)\n", 151 | "print(unspuervised_score)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "上述结果中,显示了聚类的效果还不错。以silh为例,当其值>0.5时,说明聚类质量较优。优秀与否的基本原则是不同类别间是否具有显著的区分效应。" 159 | ] 160 | }, 161 | { 162 | "cell_type": "code", 163 | "execution_count": 35, 164 | "metadata": {}, 165 | "outputs": [ 166 | { 167 | "name": "stdout", 168 | "output_type": "stream", 169 | "text": [ 170 | " USER_ID AVG_ORDERS AVG_MONEY IS_ACTIVE SEX labels\n", 171 | "0 1 3.58 40.43 活跃 1 2\n", 172 | "1 2 4.71 41.16 不活跃 1 2\n", 173 | "2 3 3.80 39.49 不活跃 2 1\n", 174 | "3 4 2.85 38.36 不活跃 1 0\n", 175 | "4 5 3.71 38.34 活跃 1 1\n" 176 | ] 177 | } 178 | ], 179 | "source": [ 180 | "#合并数据和特征\n", 181 | "kmeans_labels = pd.DataFrame(model_kmeans.labels_, columns = ['labels'])\n", 182 | "#组合原始数据和标签\n", 183 | "kmeans_data = pd.concat([raw_data, kmeans_labels], axis=1)\n", 184 | "print(kmeans_data.head())" 185 | ] 186 | }, 187 | { 188 | "cell_type": "code", 189 | "execution_count": 41, 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | " record_count record_rate\n", 197 | "labels \n", 198 | "0 332 0.332\n", 199 | "1 337 0.337\n", 200 | "2 331 0.331\n" 201 | ] 202 | } 203 | ], 204 | "source": [ 205 | "#计算不同聚类类别的样本量和占比\n", 206 | "label_count = kmeans_data.groupby(['labels'])['SEX'].count()\n", 207 | "label_count_ratio = label_count / kmeans_data.shape[0]\n", 208 | "kmeans_record_count = pd.concat([label_count,label_count_ratio], axis=1)\n", 209 | "kmeans_record_count.columns = ['record_count', 'record_rate']\n", 210 | "print(kmeans_record_count.head())" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 44, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | " AVG_ORDERS AVG_MONEY\n", 223 | "labels \n", 224 | "0 2.022349 38.980602\n", 225 | "1 3.987389 39.028754\n", 226 | "2 3.958610 40.996254\n" 227 | ] 228 | } 229 | ], 230 | "source": [ 231 | "#计算不同聚类类别数值型特征\n", 232 | "kmeans_numeric_features = kmeans_data.groupby(['labels'])['AVG_ORDERS', 'AVG_MONEY'].mean()\n", 233 | "print(kmeans_numeric_features)" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": 52, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "#计算不同聚类类别分类型特征\n", 243 | "active_list = []\n", 244 | "sex_gb_list = []\n", 245 | "unique_labels = np.unique(model_kmeans.labels_)\n", 246 | "for each_label in unique_labels:\n", 247 | " each_data = kmeans_data[kmeans_data['labels']==each_label]\n", 248 | " active_list.append(each_data.groupby(['IS_ACTIVE'])['USER_ID'].count()/each_data.shape[0])\n", 249 | " sex_gb_list.append(each_data.groupby(['SEX'])['USER_ID'].count()/each_data.shape[0])\n", 250 | "\n", 251 | "kmeans_active_pd = pd.DataFrame(active_list)\n", 252 | "kmeans_sex_gb_pd = pd.DataFrame(sex_gb_list)\n", 253 | "kmeans_string_features = pd.concat((kmeans_active_pd,kmeans_sex_gb_pd), axis=1)\n", 254 | "kmeans_string_features.index = unique_labels" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": 53, 260 | "metadata": {}, 261 | "outputs": [], 262 | "source": [ 263 | "#合并所有类别的分析结果" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 55, 269 | "metadata": {}, 270 | "outputs": [ 271 | { 272 | "name": "stdout", 273 | "output_type": "stream", 274 | "text": [ 275 | " record_count record_rate AVG_ORDERS AVG_MONEY 不活跃 活跃 \\\n", 276 | "0 332 0.332 2.022349 38.980602 0.487952 0.512048 \n", 277 | "1 337 0.337 3.987389 39.028754 0.495549 0.504451 \n", 278 | "2 331 0.331 3.958610 40.996254 0.504532 0.495468 \n", 279 | "\n", 280 | " 0 1 2 \n", 281 | "0 0.003012 0.990964 0.006024 \n", 282 | "1 0.014837 0.014837 0.970326 \n", 283 | "2 0.984894 0.009063 0.006042 \n" 284 | ] 285 | } 286 | ], 287 | "source": [ 288 | "features_all = pd.concat((kmeans_record_count,kmeans_numeric_features,kmeans_string_features), axis=1)\n", 289 | "print(features_all.head())" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": 59, 295 | "metadata": {}, 296 | "outputs": [ 297 | { 298 | "data": { 299 | "image/png": "\n", 300 | "text/plain": [ 301 | "
" 302 | ] 303 | }, 304 | "metadata": {}, 305 | "output_type": "display_data" 306 | } 307 | ], 308 | "source": [ 309 | "# 可视化图形展示\n", 310 | "# part 1 全局配置\n", 311 | "fig = plt.figure(figsize=(10, 7))\n", 312 | "titles = ['RECORD_RATE','AVG_ORDERS','AVG_MONEY','IS_ACTIVE','SEX'] # 共用标题\n", 313 | "line_index,col_index = 3,5 # 定义网格数\n", 314 | "ax_ids = np.arange(1,16).reshape(line_index,col_index) # 生成子网格索引值\n", 315 | "plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签\n", 316 | " \n", 317 | "# part 2 画出三个类别的占比\n", 318 | "pie_fracs = features_all['record_rate'].tolist()\n", 319 | "for ind in range(len(pie_fracs)):\n", 320 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,0][ind])\n", 321 | " init_labels = ['','',''] # 初始化空label标签\n", 322 | " init_labels[ind] = 'cluster_{0}'.format(ind) # 设置标签\n", 323 | " init_colors = ['lightgray', 'lightgray', 'lightgray']\n", 324 | " init_colors[ind] = 'g' # 设置目标面积区别颜色\n", 325 | " ax.pie(x=pie_fracs, autopct='%3.0f %%',labels=init_labels,colors=init_colors)\n", 326 | " ax.set_aspect('equal') # 设置饼图为圆形\n", 327 | " if ind == 0:\n", 328 | " ax.set_title(titles[0])\n", 329 | " \n", 330 | "# part 3 画出AVG_ORDERS均值\n", 331 | "avg_orders_label = 'AVG_ORDERS'\n", 332 | "avg_orders_fraces = features_all[avg_orders_label]\n", 333 | "for ind, frace in enumerate(avg_orders_fraces):\n", 334 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,1][ind])\n", 335 | " ax.bar(x=unique_labels,height=[0,avg_orders_fraces[ind],0])# 画出柱形图\n", 336 | " ax.set_ylim((0, max(avg_orders_fraces)*1.2))\n", 337 | " ax.set_xticks([])\n", 338 | " ax.set_yticks([])\n", 339 | " if ind == 0:# 设置总标题\n", 340 | " ax.set_title(titles[1])\n", 341 | " # 设置每个柱形图的数值标签和x轴label\n", 342 | " ax.text(unique_labels[1],frace+0.4,s='{:.2f}'.format(frace),ha='center',va='top')\n", 343 | " ax.text(unique_labels[1],-0.4,s=avg_orders_label,ha='center',va='bottom')\n", 344 | " \n", 345 | "# part 4 画出AVG_MONEY均值\n", 346 | "avg_money_label = 'AVG_MONEY'\n", 347 | "avg_money_fraces = features_all[avg_money_label]\n", 348 | "for ind, frace in enumerate(avg_money_fraces):\n", 349 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,2][ind])\n", 350 | " ax.bar(x=unique_labels,height=[0,avg_money_fraces[ind],0])# 画出柱形图\n", 351 | " ax.set_ylim((0, max(avg_money_fraces)*1.2))\n", 352 | " ax.set_xticks([])\n", 353 | " ax.set_yticks([])\n", 354 | " if ind == 0:# 设置总标题\n", 355 | " ax.set_title(titles[2])\n", 356 | " # 设置每个柱形图的数值标签和x轴label\n", 357 | " ax.text(unique_labels[1],frace+4,s='{:.0f}'.format(frace),ha='center',va='top')\n", 358 | " ax.text(unique_labels[1],-4,s=avg_money_label,ha='center',va='bottom')\n", 359 | " \n", 360 | "# part 5 画出是否活跃\n", 361 | "axtivity_labels = ['不活跃','活跃']\n", 362 | "x_ticket = [i for i in range(len(axtivity_labels))]\n", 363 | "activity_data = features_all[axtivity_labels]\n", 364 | "ylim_max = np.max(np.max(activity_data))\n", 365 | "for ind,each_data in enumerate(activity_data.values):\n", 366 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,3][ind])\n", 367 | " ax.bar(x=x_ticket,height=each_data) # 画出柱形图\n", 368 | " ax.set_ylim((0, ylim_max*1.2))\n", 369 | " ax.set_xticks([])\n", 370 | " ax.set_yticks([]) \n", 371 | " if ind == 0:# 设置总标题\n", 372 | " ax.set_title(titles[3])\n", 373 | " # 设置每个柱形图的数值标签和x轴label\n", 374 | " activity_values = ['{:.1%}'.format(i) for i in each_data]\n", 375 | " for i in range(len(x_ticket)):\n", 376 | " ax.text(x_ticket[i],each_data[i]+0.05,s=activity_values[i],ha='center',va='top')\n", 377 | " ax.text(x_ticket[i],-0.05,s=axtivity_labels[i],ha='center',va='bottom')\n", 378 | " \n", 379 | "# part 6 画出性别分布\n", 380 | "sex_data = features_all.iloc[:,-3:]\n", 381 | "x_ticket = [i for i in range(len(sex_data))]\n", 382 | "sex_labels = ['SEX_{}'.format(i) for i in range(3)]\n", 383 | "ylim_max = np.max(np.max(sex_data))\n", 384 | "for ind,each_data in enumerate(sex_data.values):\n", 385 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,4][ind])\n", 386 | " ax.bar(x=x_ticket,height=each_data) # 画柱形图\n", 387 | " ax.set_ylim((0, ylim_max*1.2))\n", 388 | " ax.set_xticks([])\n", 389 | " ax.set_yticks([])\n", 390 | " if ind == 0: # 设置标题\n", 391 | " ax.set_title(titles[4]) \n", 392 | " # 设置每个柱形图的数值标签和x轴label\n", 393 | " sex_values = ['{:.1%}'.format(i) for i in each_data]\n", 394 | " for i in range(len(x_ticket)):\n", 395 | " ax.text(x_ticket[i],each_data[i]+0.1,s=sex_values[i],ha='center',va='top')\n", 396 | " ax.text(x_ticket[i],-0.1,s=sex_labels[i],ha='center',va='bottom')\n", 397 | " \n", 398 | "plt.tight_layout(pad=0.8) #设置默认的间距" 399 | ] 400 | }, 401 | { 402 | "cell_type": "markdown", 403 | "metadata": {}, 404 | "source": [ 405 | "# 结论\n", 406 | "\n", 407 | "聚类后,群体划分为3类:\n", 408 | "- cluster_0:显著和区分性特征是平均订单量少(仅为2.02),男性为主的客户群体;\n", 409 | "- cluster_1:平均订单量多(3.99),女性为主的客户\n", 410 | "- cluster_2:与cluster_1类似,但群体属于未知性别。\n", 411 | "\n", 412 | "鉴于平均订单价值和活跃程度在所有类别中的分布相对意志和均匀,无法达到区分的特性,也不具有表示该群体的显著性特征。因此忽略。\n", 413 | "\n", 414 | "最后,我们得到3类群体:**低价值的男性客户群体、高价值的女性客户群体以及高价值的未知性别客户群体。**\n", 415 | "\n", 416 | "**衍生的分析方向**:\n", 417 | "- 未知性别群体不应该有如此高的平均订单价值,更重要的是其样本量并不少。那么不太可能是随机发生的事件,很可能在某些方面,例如数据采集、客户体验、客户注册等方面存在某些问题,或者这类客户群体就是不愿意透露性别。可作为另一个EDA课题的开始\n", 418 | "- 第二类高价值的女性客户群体,可做用户喜欢和特征分析,例如看一下她们都是什么事件购买、客单价平均多少、集中品类、折扣力度喜欢、来源渠道、促销方式等是否有明显的集中化倾向。" 419 | ] 420 | } 421 | ], 422 | "metadata": { 423 | "kernelspec": { 424 | "display_name": "Python 3", 425 | "language": "python", 426 | "name": "python3" 427 | }, 428 | "language_info": { 429 | "codemirror_mode": { 430 | "name": "ipython", 431 | "version": 3 432 | }, 433 | "file_extension": ".py", 434 | "mimetype": "text/x-python", 435 | "name": "python", 436 | "nbconvert_exporter": "python", 437 | "pygments_lexer": "ipython3", 438 | "version": "3.6.8" 439 | } 440 | }, 441 | "nbformat": 4, 442 | "nbformat_minor": 2 443 | } 444 | --------------------------------------------------------------------------------