├── README.md
├── bank_telemarketing_data_analysis.ipynb
├── images
├── Random Forest.png
├── jaccard系数.png
├── 余弦相似性.png
├── 曼哈顿距离.png
└── 欧式距离.png
├── 关联分析
└── README.md
├── 分类算法
├── README.md
└── 用户流失预测分析与应用.ipynb
├── 回归分析
├── README.md
└── 大型促销活动前的销售预测.ipynb
└── 聚类分析
├── README.md
└── 客户特征的聚类与探索性分析.ipynb
/README.md:
--------------------------------------------------------------------------------
1 | # 机器学习(Machine Learning)
2 | - 监督学习(Supervised Learning):训练样本带有信息标记(**y值**),利用已有的训练样本信息学习数据的规律预测未知的新样本标签。
3 | - 回归分析(Regression)
4 | - 分类(Classification)
5 | - 无监督学习(Unsupervised Learning):训练样本的标记信息时未知的,目测是为了揭露训练样本的内在数学,结构和信息,为进一步的数据挖掘提供基础。
6 | - 聚类(Clustering)
7 | ## 1 回归
8 | 回归是研究自变量x对因变量y影响的一种数据分析方法。主要应用场景是**进行预测和空值**,例如,计划制定、KPI制定、目标制定等;也可以基于预测的数据与实际数据进行比对和分析,确定事件发展程度并给未来行动提供方向性指导。
9 |
10 | 回归分析可应用于分析自变量和因变量的影响关系(已知x,求y),也可以分析自变量对因变量的影响方向(正向or反向影响)。
11 |
12 | **常用的回归算法包括**:
13 | - 线性回归
14 | - 二项式回归
15 | - 对数回归
16 | - 指数回归
17 | - 核SVM
18 | - 岭回归
19 | - Lasso
20 |
21 | 优点:
22 | - 数据模式和结果便于理解,如线性回归用y=ax+b的形式表达
23 | - 基于函数公式的业务应用中,可直接代入法求解,应用起来容易。
24 |
25 | 缺点:
26 | - 只能分析少量变量间相互关系,无法处理海量变量间的相互作用关系,尤其是变量共同因素对因变量的影响程度。
27 |
28 | ### 1.1 注意回归变量之间的共线性问题
29 | 检验共线性的三个指标:
30 | - 容忍度:[0,1],每个自变量(x)作为因变量(y)进行回归建模得到的残差比例。值越小,说明共线性问题的可能性越大。
31 | - 方差膨胀因子:容忍度的倒数,值越大则共线性问题越明显,通常以10作为判断边界。VIF<10,不存在多重共线性;10<=VIF<=100,存在较强的多重共线性;VIF>=100,存在严重多重共线性
32 | - 特征值:对自变量进行主成分分析,如果多个维度的特征值等于0,则可能存在比较严重的共线性。
33 | - 相关系数:R>0.8:可能存在较强的相关性。
34 |
35 | **解决共线性的5种常用方法**:
36 | - 增大样本量
37 | - 岭回归法
38 | - 逐步回归法
39 | - 主成分回归
40 | - 人工去除
41 |
42 | ### 1.2 相关系数、判定系数和回归系数之间的关系
43 | 假设一回归方程:y = 42.738x + 169.94,其中R的平方 = 0.5252,如果对这两个变量作相关性分析,还会得到相关系数R=0.72468551874050.
44 |
45 | 回归系数:42.738,自变量x的**回归系数**;0.5252是该方程的**判定系数**;0.724....是两个变量的**相关性系数**。
46 | - 判定系数:自变量对因变量的方差解释程度的值;计算公式为:回归平方和与总离差平方和之比值
47 | - 相关系数:又称为解释系数,是衡量变量间的相关程度或密切程度的值,本质是线性相关性的判断。
48 |
49 | 三者间的关系:
50 | - 判定系数是**所有参与模型中自变量的对因变量联合影响程度**,而非某个自变量的影响程度。
51 | - 回归系数与相关系数的关系:回归系数>0,相关系数取值为(0,1)。说明两者正相关;如果系数小于0,相关系数取值为(-1,0),说明两者负相关。
52 |
53 | ## 2 分类算法
54 | - 一种对**离散型随机变量**建模或预测的监督学习算法。
55 | - 使用案例包括邮件过滤、金融欺诈和预测雇员异动等输出为类别的任务。
56 | - 分类算法通常适用于预测一个类别(或类别的概率)而不是连续的数值。
57 |
58 | ### 2.1 分类算法的应用
59 | - 预测
60 | - 提炼应用规则
61 | - 提取变量特征
62 | - 处理缺失值
63 |
64 | ### 2.2 (基础)决策树 Decision Tree
65 | 决策树是一个树结构(可以是二叉树或非二叉树)。
66 |
67 | 其每个非叶节点表示一个**特征属性**上的测试,每个分支代表这个特征属性在某个值域上的输出,而每个叶节点存放一个类别。
68 |
69 | 使用决策树进行决策的过程就是从**根节点开始**,测试待分类项中相应的特征属性,并按照其值选择输出分支,知道到达叶子节点,将**叶子节点**存放的类别作为决策结果。
70 |
71 | **优点:**
72 | - 适用任何类型的数据(类别变量更普遍)
73 | - 直观、决策树可以提供可视化,便于理解
74 | - 模型预测出的结果简单,可解释性强
75 | - 适用于小规模数据
76 |
77 | **缺点:**
78 | - 当数据中存在连续变量的属性时,决策树表现并不是很好
79 | - 不稳定性,一点点的扰动或者改动都可能改动整棵树
80 | - 特殊属性增加时,错误增加的比较快
81 | - 很容易在训练数据中生成复杂的树结构,造成过拟合。
82 |
83 | ### 2.3 随机森林 Random Forest
84 | 
85 |
86 | **优点:**
87 | - 随机森林不容易限于过拟合
88 | - 具有很好的抗噪声能力
89 | - 处理很高维度(feature多)的数据,并且不用做特征选择
90 | - 训练速度快
91 |
92 | ## 3 聚类
93 | - 一种无监督式机器学习(即**数据没有标注**)
94 | - 算法基于数据的内部结构寻找观察样本的自然族群(即集群)
95 | - 使用案例包括客户细分,新闻聚类,文章推荐等等。
96 |
97 | **用于衡量相似性的几个指标**:
98 | - 欧式距离 Euclidean distance
99 | - 定义:指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离)
100 | - 用途:
101 |
102 | 
103 |
104 | - 曼哈顿距离 Manhattan distance
105 | - 定义:就是表示两个点在标准坐标系上的绝对轴距之和。
106 | - 用途:
107 |
108 | 
109 |
110 | - 余弦相似性 cosine
111 | - 定义:通过计算两个向量的夹角余弦值来评估他们的相似度。
112 | - 用途:新闻分类
113 |
114 | 
115 |
116 | - Jaccard系数
117 | - 定义:给定两个集合A,B,Jaccard 系数定义为A与B交集的大小与A与B并集的大小的比值。
118 | - 用途:用于比较有限样本集之间的相似性与差异性。Jaccard系数值越大,样本相似度越高。
119 |
120 | 
121 |
122 | ### 3.1 层次聚类 Hierarchical Cluster Analysis(HCA)
123 | 层次聚类是一系列基于以下概念的聚类算法:
124 | - 最开始由一个数据点作为一个集群
125 | - 对于每个集群,基于相同的标准合并集群
126 | - 重复这一过程直到只留下一个集群,因此就得到了集群的层次结构。
127 |
128 | ### 3.2 K均值聚类 K-means Clustering Algorithm
129 | - 聚类的度量基于样本点之间的几何距离(即在坐标平面中的距离)
130 | - 集群是围绕在聚类中心的族群,而集群呈现出类球状并具有相似的大小
131 | - 对于给定的k值,算法先给出一个初始的分组方法,然后通过反复迭代的方法改变分组,使得每一次改进之后的分组方案较前一次好
132 |
133 | ### 3.3 DBSCAN
134 | - 基于密度的算法,它将样本点的密度区域组成一个集群
135 | - DBSCAN不需要假设集群为球状,并且它的性能是可拓展的
136 | - 不需要每个点都被分配到一个集群中,这降低了集群的异常数据。
137 |
138 |
139 |
--------------------------------------------------------------------------------
/bank_telemarketing_data_analysis.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Pyhton数据分析:银行电话营销数据分析"
8 | ]
9 | },
10 | {
11 | "cell_type": "markdown",
12 | "metadata": {},
13 | "source": [
14 | "## 一、前言\n",
15 | "### 项目介绍:\n",
16 | "在我们的日常生活中,银行为我们的财产提供了基本的安全保障,便利了我们的生活,而我们在银行的一些记录信息,方便了银行对我们进行一些行为预测,本项目则根据客户的以往记录信息,预测客户是否办理存款业务。\n"
17 | ]
18 | },
19 | {
20 | "cell_type": "markdown",
21 | "metadata": {},
22 | "source": [
23 | "### 数据集分析\n",
24 | "该项目的数据集对应的任务是「分类任务」,本数据集共包含25317行,18列数据,其中字段为y的列是标签列,包含0和1两个值,0表示不订购业务,1表示订购业务,其他列是特征列\n",
25 | "### 字段的描述\n",
26 | "ID:客户唯一标识
\n",
27 | "age:客户年龄
\n",
28 | "job:客户的职业
\n",
29 | "marital:婚姻状况
\n",
30 | "education:受教育水平
\n",
31 | "default:是否有违约记录
\n",
32 | "balance:每年账户的平均余额
\n",
33 | "housing:是否有住房贷款
\n",
34 | "loan:是否有个人贷款
\n",
35 | "contact:与客户联系的沟通方式
\n",
36 | "day:最后一次联系的时间(几号)
\n",
37 | "month:最后一次联系的时间(月份)
\n",
38 | "duration:最后一次联系的交流时长
\n",
39 | "campaign:在本次活动中,与该客户交流过的次数
\n",
40 | "pdays:距离上次活动最后一次联系该客户,过去了多久
\n",
41 | "previous:在本次活动之前,与该客户交流过的次数
\n",
42 | "poutcome:上一次活动的结果
\n",
43 | "y:客户是否会订购定期存款业务,0表示不订购,1表示订购
"
44 | ]
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "## 二、提出问题\n",
51 | "* 哪个分类模型更适合预测客户是否订购定期存款业务?\n",
52 | "\n",
53 | "### 分析流程\n",
54 | "* 查看数据\n",
55 | "* 特征处理\n",
56 | "* 选择模型\n",
57 | "* 数据归一化"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "## 三、探索性数据分析\n",
65 | "### 导入必备的库"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": 1,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "#Basic library\n",
75 | "import pandas as pd\n",
76 | "import numpy as np\n",
77 | "import matplotlib.pyplot as plt\n",
78 | "import seaborn as sns\n",
79 | "\n",
80 | "#machine learning\n",
81 | "from sklearn.model_selection import train_test_split\n",
82 | "from sklearn.model_selection import cross_val_score\n",
83 | "from sklearn.metrics import roc_auc_score\n",
84 | "from sklearn.preprocessing import LabelEncoder, MinMaxScaler\n",
85 | "\n",
86 | "#Model\n",
87 | "from sklearn.neighbors import KNeighborsClassifier\n",
88 | "from sklearn.linear_model import LogisticRegression\n",
89 | "from sklearn.tree import DecisionTreeClassifier\n",
90 | "\n",
91 | "#igonore warnings\n",
92 | "import warnings\n",
93 | "warnings.filterwarnings('ignore')"
94 | ]
95 | },
96 | {
97 | "cell_type": "markdown",
98 | "metadata": {},
99 | "source": [
100 | "### 查看数据"
101 | ]
102 | },
103 | {
104 | "cell_type": "code",
105 | "execution_count": 2,
106 | "metadata": {
107 | "scrolled": true
108 | },
109 | "outputs": [
110 | {
111 | "data": {
112 | "text/html": [
113 | "
\n",
114 | "\n",
127 | "
\n",
128 | " \n",
129 | " \n",
130 | " | \n",
131 | " age | \n",
132 | " job | \n",
133 | " marital | \n",
134 | " education | \n",
135 | " default | \n",
136 | " balance | \n",
137 | " housing | \n",
138 | " loan | \n",
139 | " contact | \n",
140 | " day | \n",
141 | " month | \n",
142 | " duration | \n",
143 | " campaign | \n",
144 | " pdays | \n",
145 | " previous | \n",
146 | " poutcome | \n",
147 | " y | \n",
148 | "
\n",
149 | " \n",
150 | " ID | \n",
151 | " | \n",
152 | " | \n",
153 | " | \n",
154 | " | \n",
155 | " | \n",
156 | " | \n",
157 | " | \n",
158 | " | \n",
159 | " | \n",
160 | " | \n",
161 | " | \n",
162 | " | \n",
163 | " | \n",
164 | " | \n",
165 | " | \n",
166 | " | \n",
167 | " | \n",
168 | "
\n",
169 | " \n",
170 | " \n",
171 | " \n",
172 | " 1 | \n",
173 | " 43 | \n",
174 | " management | \n",
175 | " married | \n",
176 | " tertiary | \n",
177 | " no | \n",
178 | " 291 | \n",
179 | " yes | \n",
180 | " no | \n",
181 | " unknown | \n",
182 | " 9 | \n",
183 | " may | \n",
184 | " 150 | \n",
185 | " 2 | \n",
186 | " -1 | \n",
187 | " 0 | \n",
188 | " unknown | \n",
189 | " 0 | \n",
190 | "
\n",
191 | " \n",
192 | " 2 | \n",
193 | " 42 | \n",
194 | " technician | \n",
195 | " divorced | \n",
196 | " primary | \n",
197 | " no | \n",
198 | " 5076 | \n",
199 | " yes | \n",
200 | " no | \n",
201 | " cellular | \n",
202 | " 7 | \n",
203 | " apr | \n",
204 | " 99 | \n",
205 | " 1 | \n",
206 | " 251 | \n",
207 | " 2 | \n",
208 | " other | \n",
209 | " 0 | \n",
210 | "
\n",
211 | " \n",
212 | " 3 | \n",
213 | " 47 | \n",
214 | " admin. | \n",
215 | " married | \n",
216 | " secondary | \n",
217 | " no | \n",
218 | " 104 | \n",
219 | " yes | \n",
220 | " yes | \n",
221 | " cellular | \n",
222 | " 14 | \n",
223 | " jul | \n",
224 | " 77 | \n",
225 | " 2 | \n",
226 | " -1 | \n",
227 | " 0 | \n",
228 | " unknown | \n",
229 | " 0 | \n",
230 | "
\n",
231 | " \n",
232 | " 4 | \n",
233 | " 28 | \n",
234 | " management | \n",
235 | " single | \n",
236 | " secondary | \n",
237 | " no | \n",
238 | " -994 | \n",
239 | " yes | \n",
240 | " yes | \n",
241 | " cellular | \n",
242 | " 18 | \n",
243 | " jul | \n",
244 | " 174 | \n",
245 | " 2 | \n",
246 | " -1 | \n",
247 | " 0 | \n",
248 | " unknown | \n",
249 | " 0 | \n",
250 | "
\n",
251 | " \n",
252 | " 5 | \n",
253 | " 42 | \n",
254 | " technician | \n",
255 | " divorced | \n",
256 | " secondary | \n",
257 | " no | \n",
258 | " 2974 | \n",
259 | " yes | \n",
260 | " no | \n",
261 | " unknown | \n",
262 | " 21 | \n",
263 | " may | \n",
264 | " 187 | \n",
265 | " 5 | \n",
266 | " -1 | \n",
267 | " 0 | \n",
268 | " unknown | \n",
269 | " 0 | \n",
270 | "
\n",
271 | " \n",
272 | "
\n",
273 | "
"
274 | ],
275 | "text/plain": [
276 | " age job marital education default balance housing loan \\\n",
277 | "ID \n",
278 | "1 43 management married tertiary no 291 yes no \n",
279 | "2 42 technician divorced primary no 5076 yes no \n",
280 | "3 47 admin. married secondary no 104 yes yes \n",
281 | "4 28 management single secondary no -994 yes yes \n",
282 | "5 42 technician divorced secondary no 2974 yes no \n",
283 | "\n",
284 | " contact day month duration campaign pdays previous poutcome y \n",
285 | "ID \n",
286 | "1 unknown 9 may 150 2 -1 0 unknown 0 \n",
287 | "2 cellular 7 apr 99 1 251 2 other 0 \n",
288 | "3 cellular 14 jul 77 2 -1 0 unknown 0 \n",
289 | "4 cellular 18 jul 174 2 -1 0 unknown 0 \n",
290 | "5 unknown 21 may 187 5 -1 0 unknown 0 "
291 | ]
292 | },
293 | "execution_count": 2,
294 | "metadata": {},
295 | "output_type": "execute_result"
296 | }
297 | ],
298 | "source": [
299 | "data_all = pd.read_csv('./dataFile/train_set.csv',index_col='ID')\n",
300 | "data_all.head()"
301 | ]
302 | },
303 | {
304 | "cell_type": "code",
305 | "execution_count": 3,
306 | "metadata": {},
307 | "outputs": [
308 | {
309 | "name": "stdout",
310 | "output_type": "stream",
311 | "text": [
312 | "\n",
313 | "Int64Index: 25317 entries, 1 to 25317\n",
314 | "Data columns (total 17 columns):\n",
315 | "age 25317 non-null int64\n",
316 | "job 25317 non-null object\n",
317 | "marital 25317 non-null object\n",
318 | "education 25317 non-null object\n",
319 | "default 25317 non-null object\n",
320 | "balance 25317 non-null int64\n",
321 | "housing 25317 non-null object\n",
322 | "loan 25317 non-null object\n",
323 | "contact 25317 non-null object\n",
324 | "day 25317 non-null int64\n",
325 | "month 25317 non-null object\n",
326 | "duration 25317 non-null int64\n",
327 | "campaign 25317 non-null int64\n",
328 | "pdays 25317 non-null int64\n",
329 | "previous 25317 non-null int64\n",
330 | "poutcome 25317 non-null object\n",
331 | "y 25317 non-null int64\n",
332 | "dtypes: int64(8), object(9)\n",
333 | "memory usage: 3.5+ MB\n"
334 | ]
335 | }
336 | ],
337 | "source": [
338 | "#查看数据的基本信息\n",
339 | "data_all.info()"
340 | ]
341 | },
342 | {
343 | "cell_type": "code",
344 | "execution_count": 4,
345 | "metadata": {},
346 | "outputs": [
347 | {
348 | "data": {
349 | "text/plain": [
350 | "(25317, 17)"
351 | ]
352 | },
353 | "execution_count": 4,
354 | "metadata": {},
355 | "output_type": "execute_result"
356 | }
357 | ],
358 | "source": [
359 | "# 查看数据data_all的维度\n",
360 | "data_all.shape"
361 | ]
362 | },
363 | {
364 | "cell_type": "code",
365 | "execution_count": 6,
366 | "metadata": {
367 | "scrolled": true
368 | },
369 | "outputs": [
370 | {
371 | "data": {
372 | "text/plain": [
373 | "age False\n",
374 | "job False\n",
375 | "marital False\n",
376 | "education False\n",
377 | "default False\n",
378 | "balance False\n",
379 | "housing False\n",
380 | "loan False\n",
381 | "contact False\n",
382 | "day False\n",
383 | "month False\n",
384 | "duration False\n",
385 | "campaign False\n",
386 | "pdays False\n",
387 | "previous False\n",
388 | "poutcome False\n",
389 | "y False\n",
390 | "dtype: bool"
391 | ]
392 | },
393 | "execution_count": 6,
394 | "metadata": {},
395 | "output_type": "execute_result"
396 | }
397 | ],
398 | "source": [
399 | "# 查看每列数据是否包含缺失值。\n",
400 | "data_all.isnull().any()"
401 | ]
402 | },
403 | {
404 | "cell_type": "markdown",
405 | "metadata": {},
406 | "source": [
407 | "显然,数据集中不包含任何缺失值。"
408 | ]
409 | },
410 | {
411 | "cell_type": "markdown",
412 | "metadata": {},
413 | "source": [
414 | "### 特征处理"
415 | ]
416 | },
417 | {
418 | "cell_type": "code",
419 | "execution_count": 7,
420 | "metadata": {},
421 | "outputs": [
422 | {
423 | "data": {
424 | "text/plain": [
425 | "['job',\n",
426 | " 'marital',\n",
427 | " 'education',\n",
428 | " 'default',\n",
429 | " 'housing',\n",
430 | " 'loan',\n",
431 | " 'contact',\n",
432 | " 'month',\n",
433 | " 'poutcome']"
434 | ]
435 | },
436 | "execution_count": 7,
437 | "metadata": {},
438 | "output_type": "execute_result"
439 | }
440 | ],
441 | "source": [
442 | "# 获得data_all中列的数据类型是object的列的列名。\n",
443 | "data_obj_col = data_all.select_dtypes('object').columns.to_list()\n",
444 | "data_obj_col"
445 | ]
446 | },
447 | {
448 | "cell_type": "code",
449 | "execution_count": 9,
450 | "metadata": {
451 | "scrolled": true
452 | },
453 | "outputs": [
454 | {
455 | "data": {
456 | "text/plain": [
457 | "(25317, 9)"
458 | ]
459 | },
460 | "execution_count": 9,
461 | "metadata": {},
462 | "output_type": "execute_result"
463 | }
464 | ],
465 | "source": [
466 | "# 获得数据集中列的数据类型为object的所有数据,以及打印数据的维度\n",
467 | "data_obj=data_all[data_obj_col]\n",
468 | "data_obj.shape"
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": 10,
474 | "metadata": {},
475 | "outputs": [],
476 | "source": [
477 | "# 依据data_obj_col,获得数据集中列的数据类型为数值型的列的列。\n",
478 | "data_num_col = data_all.columns.difference(data_obj_col)"
479 | ]
480 | },
481 | {
482 | "cell_type": "code",
483 | "execution_count": 11,
484 | "metadata": {},
485 | "outputs": [
486 | {
487 | "data": {
488 | "text/plain": [
489 | "(25317, 8)"
490 | ]
491 | },
492 | "execution_count": 11,
493 | "metadata": {},
494 | "output_type": "execute_result"
495 | }
496 | ],
497 | "source": [
498 | "# 获得数据集中列的数据类型为数值型的所有数据,以及数据的维度\n",
499 | "data_num=data_all[data_num_col]\n",
500 | "data_num.shape"
501 | ]
502 | },
503 | {
504 | "cell_type": "code",
505 | "execution_count": 12,
506 | "metadata": {},
507 | "outputs": [
508 | {
509 | "data": {
510 | "text/plain": [
511 | "['age', 'balance', 'campaign', 'day', 'duration', 'pdays', 'previous', 'y']"
512 | ]
513 | },
514 | "execution_count": 12,
515 | "metadata": {},
516 | "output_type": "execute_result"
517 | }
518 | ],
519 | "source": [
520 | "# 打印data_num的列名\n",
521 | "data_num.columns.to_list()"
522 | ]
523 | },
524 | {
525 | "cell_type": "markdown",
526 | "metadata": {},
527 | "source": [
528 | "从以上输出的数据可知:\n",
529 | "* Object类型的列有9个\n",
530 | "* 数值类型的列有8个,数值类型的列名分别为:'age', 'balance', 'campaign', 'day', 'duration', 'pdays', 'previous','y'"
531 | ]
532 | },
533 | {
534 | "cell_type": "markdown",
535 | "metadata": {},
536 | "source": [
537 | "### 标签编码\n",
538 | "将object类型的列中只有两个值的列进行标签编码,将编码后的列添加到data_num数据集中"
539 | ]
540 | },
541 | {
542 | "cell_type": "code",
543 | "execution_count": 13,
544 | "metadata": {},
545 | "outputs": [
546 | {
547 | "data": {
548 | "text/plain": [
549 | "['default', 'housing', 'loan']"
550 | ]
551 | },
552 | "execution_count": 13,
553 | "metadata": {},
554 | "output_type": "execute_result"
555 | }
556 | ],
557 | "source": [
558 | "# 计算data_obj中每列中的唯一值;然后得到每一列中只有两个值的列名\n",
559 | "two_unique_cols = data_obj.nunique()[data_obj.nunique()==2].index.tolist()\n",
560 | "two_unique_cols"
561 | ]
562 | },
563 | {
564 | "cell_type": "code",
565 | "execution_count": 14,
566 | "metadata": {},
567 | "outputs": [],
568 | "source": [
569 | "# 对列中唯一值只有两个值的列进行标签编码,将标签编码后的数据存到data_num数据集中\n",
570 | "y = data_all[two_unique_cols].apply(LabelEncoder().fit_transform)\n",
571 | "data_num = pd.concat([y, data_num],ignore_index=False, sort=True, axis=1)"
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": 15,
577 | "metadata": {
578 | "scrolled": true
579 | },
580 | "outputs": [
581 | {
582 | "name": "stdout",
583 | "output_type": "stream",
584 | "text": [
585 | "data_num的维度是:(25317, 11)\n"
586 | ]
587 | },
588 | {
589 | "data": {
590 | "text/plain": [
591 | "Index(['default', 'housing', 'loan', 'age', 'balance', 'campaign', 'day',\n",
592 | " 'duration', 'pdays', 'previous', 'y'],\n",
593 | " dtype='object')"
594 | ]
595 | },
596 | "execution_count": 15,
597 | "metadata": {},
598 | "output_type": "execute_result"
599 | }
600 | ],
601 | "source": [
602 | "# 打印data_num的维度和列名\n",
603 | "print('data_num的维度是:{}'.format(data_num.shape))\n",
604 | "data_num.columns"
605 | ]
606 | },
607 | {
608 | "cell_type": "markdown",
609 | "metadata": {},
610 | "source": [
611 | "### 数据抽样\n",
612 | "由于建模的时候,样本不平衡会对模型的训练产生很大的影响,这里将采取简单的方法对数据进行抽样,以使y中每类的样本相对平衡"
613 | ]
614 | },
615 | {
616 | "cell_type": "code",
617 | "execution_count": 16,
618 | "metadata": {
619 | "scrolled": false
620 | },
621 | "outputs": [
622 | {
623 | "data": {
624 | "text/plain": [
625 | ""
626 | ]
627 | },
628 | "execution_count": 16,
629 | "metadata": {},
630 | "output_type": "execute_result"
631 | },
632 | {
633 | "data": {
634 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZUAAAEKCAYAAADaa8itAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAADwtJREFUeJzt3X+s3fVdx/HnCwo6dYRiC8MW7bI0xjqVsYY17h8cSVdItDiHGcmkTpIuC9MtMWboH3YBSaZumrEgSXUddJlD4japSWdtms3F7BcXR/jp0htEuBbphbINJRkpvv3jfu84a0/b09vPud97uM9HcnLO930+3+95f5Ob+8r3+/me70lVIUlSC2f13YAk6dXDUJEkNWOoSJKaMVQkSc0YKpKkZgwVSVIzhookqRlDRZLUjKEiSWpmRd8NLLZVq1bVunXr+m5DkibK/fff/2xVrT7VuGUXKuvWrWNqaqrvNiRpoiT5z1HGefpLktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktTMsvtG/Zl68x/s7rsFLUH3//n1fbcgLQkeqUiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmhlbqCS5JMmXkjyW5JEkH+jqFyTZn+Rg97yyqyfJbUmmkzyY5LKBbW3rxh9Msm2g/uYkD3Xr3JYk49ofSdKpjfNI5Sjw+1X1c8Am4MYkG4CbgANVtR440C0DXAWs7x7bgTtgLoSAHcBbgMuBHfNB1I3ZPrDeljHujyTpFMYWKlX1dFX9W/f6BeAxYA2wFbirG3YXcE33eiuwu+Z8HTg/ycXA24H9VXWkqp4H9gNbuvfOq6qvVVUBuwe2JUnqwaLMqSRZB7wJ+AZwUVU9DXPBA1zYDVsDPDWw2kxXO1l9ZkhdktSTsYdKkp8APgd8sKq+d7KhQ2q1gPqwHrYnmUoyNTs7e6qWJUkLNNZQSXIOc4Hymar6fFd+pjt1Rfd8uKvPAJcMrL4WOHSK+toh9eNU1c6q2lhVG1evXn1mOyVJOqFxXv0V4JPAY1X1FwNv7QHmr+DaBtw7UL++uwpsE/Dd7vTYPmBzkpXdBP1mYF/33gtJNnWfdf3AtiRJPVgxxm2/Ffgt4KEkD3S1PwI+AtyT5AbgSeDa7r29wNXANPAi8B6AqjqS5Bbgvm7czVV1pHv9PuBO4DXAF7uHJKknYwuVqvpXhs97AFw5ZHwBN55gW7uAXUPqU8Abz6BNSVJDfqNektSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzYwtVJLsSnI4ycMDtQ8n+a8kD3SPqwfe+8Mk00m+neTtA/UtXW06yU0D9dcn+UaSg0n+Lsm549oXSdJoxnmkciewZUj9L6vq0u6xFyDJBuBdwM936/xVkrOTnA3cDlwFbACu68YC/Gm3rfXA88ANY9wXSdIIxhYqVfUV4MiIw7cCd1fV96vqP4Bp4PLuMV1Vj1fVS8DdwNYkAd4G/H23/l3ANU13QJJ02vqYU3l/kge702Mru9oa4KmBMTNd7UT1nwS+U1VHj6lLknq02KFyB/AG4FLgaeBjXT1DxtYC6kMl2Z5kKsnU7Ozs6XUsSRrZooZKVT1TVS9X1f8Bf83c6S2YO9K4ZGDoWuDQSerPAucnWXFM/USfu7OqNlbVxtWrV7fZGUnScRY1VJJcPLD468D8lWF7gHcl+ZEkrwfWA98E7gPWd1d6ncvcZP6eqirgS8A7u/W3Afcuxj5Ikk5sxamHLEySzwJXAKuSzAA7gCuSXMrcqaongPcCVNUjSe4BHgWOAjdW1cvddt4P7APOBnZV1SPdR3wIuDvJnwDfAj45rn2RJI1mbKFSVdcNKZ/wH39V3QrcOqS+F9g7pP44r5w+kyQtAX6jXpLUjKEiSWrGUJEkNWOoSJKaMVQkSc0YKpKkZgwVSVIzhookqRlDRZLUjKEiSWrGUJEkNWOoSJKaMVQkSc2MFCpJDoxSkyQtbye99X2SHwV+jLnfRFnJKz/jex7wU2PuTZI0YU71eyrvBT7IXIDczyuh8j3g9jH2JUmaQCcNlar6OPDxJL9bVZ9YpJ4kSRNqpF9+rKpPJPllYN3gOlW1e0x9SZIm0EihkuTTwBuAB4CXu3IBhook6QdG/Y36jcCGqqpxNiNJmmyjfk/lYeB142xEkjT5Rj1SWQU8muSbwPfni1X1a2PpSpI0kUYNlQ+PswlJ0qvDqFd//cu4G5EkTb5Rr/56gbmrvQDOBc4B/reqzhtXY5KkyTPqkcprB5eTXANcPpaOJEkTa0F3Ka6qfwDe1rgXSdKEG/X01zsGFs9i7nsrfmdFkvRDRr3661cHXh8FngC2Nu9GkjTRRp1Tec+4G5EkTb5Rf6RrbZIvJDmc5Jkkn0uydtzNSZImy6gT9Z8C9jD3uyprgH/sapIk/cCoobK6qj5VVUe7x53A6jH2JUmaQKOGyrNJ3p3k7O7xbuC5cTYmSZo8o4bK7wC/Cfw38DTwTsDJe0nSDxn1kuJbgG1V9TxAkguAjzIXNpIkAaMfqfzifKAAVNUR4E3jaUmSNKlGDZWzkqycX+iOVEY9ypEkLROjhsrHgK8muSXJzcBXgT872QpJdnXfa3l4oHZBkv1JDnbPK7t6ktyWZDrJg0kuG1hnWzf+YJJtA/U3J3moW+e2JDmdHZcktTdSqFTVbuA3gGeAWeAdVfXpU6x2J7DlmNpNwIGqWg8c6JYBrgLWd4/twB3wgyOiHcBbmLsr8o6BI6Y7urHz6x37WZKkRTbyKayqehR49DTGfyXJumPKW4Erutd3AV8GPtTVd1dVAV9Pcn6Si7ux+7s5HJLsB7Yk+TJwXlV9ravvBq4Bvjhqf5Kk9hZ06/szcFFVPQ3QPV/Y1dcATw2Mm+lqJ6vPDKkPlWR7kqkkU7Ozs2e8E5Kk4RY7VE5k2HxILaA+VFXtrKqNVbVx9WpvBCBJ47LYofJMd1qL7vlwV58BLhkYtxY4dIr62iF1SVKPFjtU9gDzV3BtA+4dqF/fXQW2Cfhud3psH7A5ycpugn4zsK9774Ukm7qrvq4f2JYkqSdj+65Jks8yN9G+KskMc1dxfQS4J8kNwJPAtd3wvcDVwDTwIt0tYKrqSJJbgPu6cTfPT9oD72PuCrPXMDdB7yS9JPVsbKFSVded4K0rh4wt4MYTbGcXsGtIfQp445n0KElqa6lM1EuSXgUMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqppdQSfJEkoeSPJBkqqtdkGR/koPd88quniS3JZlO8mCSywa2s60bfzDJtj72RZL0ij6PVH6lqi6tqo3d8k3AgapaDxzolgGuAtZ3j+3AHTAXQsAO4C3A5cCO+SCSJPVjKZ3+2grc1b2+C7hmoL675nwdOD/JxcDbgf1VdaSqngf2A1sWu2lJ0iv6CpUC/jnJ/Um2d7WLquppgO75wq6+BnhqYN2Zrnai+nGSbE8ylWRqdna24W5Ikgat6Olz31pVh5JcCOxP8u8nGZshtTpJ/fhi1U5gJ8DGjRuHjpEknblejlSq6lD3fBj4AnNzIs90p7Xong93w2eASwZWXwscOkldktSTRQ+VJD+e5LXzr4HNwMPAHmD+Cq5twL3d6z3A9d1VYJuA73anx/YBm5Os7CboN3c1SVJP+jj9dRHwhSTzn/+3VfVPSe4D7klyA/AkcG03fi9wNTANvAi8B6CqjiS5BbivG3dzVR1ZvN2QJB1r0UOlqh4HfmlI/TngyiH1Am48wbZ2Abta9yhJWpildEmxJGnCGSqSpGYMFUlSM4aKJKkZQ0WS1IyhIklqxlCRJDVjqEiSmjFUJEnNGCqSpGYMFUlSM4aKJKkZQ0WS1Exfv/woaQyevPkX+m5BS9BP//FDi/ZZHqlIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJasZQkSQ1Y6hIkpoxVCRJzRgqkqRmDBVJUjOGiiSpGUNFktSMoSJJambiQyXJliTfTjKd5Ka++5Gk5WyiQyXJ2cDtwFXABuC6JBv67UqSlq+JDhXgcmC6qh6vqpeAu4GtPfckScvWpIfKGuCpgeWZriZJ6sGKvhs4QxlSq+MGJduB7d3i/yT59li7Wj5WAc/23cRSkI9u67sFHc+/z3k7hv2rPG0/M8qgSQ+VGeCSgeW1wKFjB1XVTmDnYjW1XCSZqqqNffchDePfZz8m/fTXfcD6JK9Pci7wLmBPzz1J0rI10UcqVXU0yfuBfcDZwK6qeqTntiRp2ZroUAGoqr3A3r77WKY8pailzL/PHqTquHltSZIWZNLnVCRJS4ihogXx9jhaqpLsSnI4ycN997IcGSo6bd4eR0vcncCWvptYrgwVLYS3x9GSVVVfAY703cdyZahoIbw9jqShDBUtxEi3x5G0/BgqWoiRbo8jafkxVLQQ3h5H0lCGik5bVR0F5m+P8xhwj7fH0VKR5LPA14CfTTKT5Ia+e1pO/Ea9JKkZj1QkSc0YKpKkZgwVSVIzhookqRlDRZLUjKEiSWrGUJEkNWOoSD1KckuSDwws35rk9/rsSToTfvlR6lGSdcDnq+qyJGcBB4HLq+q5XhuTFmhF3w1Iy1lVPZHkuSRvAi4CvmWgaJIZKlL//gb4beB1wK5+W5HOjKe/pJ51d3p+CDgHWF9VL/fckrRgHqlIPauql5J8CfiOgaJJZ6hIPesm6DcB1/bdi3SmvKRY6lGSDcA0cKCqDvbdj3SmnFORJDXjkYokqRlDRZLUjKEiSWrGUJEkNWOoSJKaMVQkSc38P2E7ocOY2wiuAAAAAElFTkSuQmCC\n",
635 | "text/plain": [
636 | ""
637 | ]
638 | },
639 | "metadata": {
640 | "needs_background": "light"
641 | },
642 | "output_type": "display_data"
643 | }
644 | ],
645 | "source": [
646 | "# 使用countplot对y列的唯一值进行画图显示\n",
647 | "sns.countplot(data_num['y'])"
648 | ]
649 | },
650 | {
651 | "cell_type": "code",
652 | "execution_count": 17,
653 | "metadata": {
654 | "scrolled": true
655 | },
656 | "outputs": [
657 | {
658 | "name": "stdout",
659 | "output_type": "stream",
660 | "text": [
661 | "0有:22356,1有:2961\n"
662 | ]
663 | }
664 | ],
665 | "source": [
666 | "# 标签列y中存在两个唯一值,现在计算每个值的个数.\n",
667 | "data_num['y'].value_counts()\n",
668 | "value_0 = (data_num['y']==0).sum()\n",
669 | "value_1 = (data_num['y']==1).sum()\n",
670 | "print(\"0有:{},1有:{}\".format(value_0,value_1))"
671 | ]
672 | },
673 | {
674 | "cell_type": "markdown",
675 | "metadata": {},
676 | "source": [
677 | "从以上的输出和图的显示来看,y标签列的每类的样本数不一致,因此为保抽取样本的平衡,对y==0的样本进行随机抽样。"
678 | ]
679 | },
680 | {
681 | "cell_type": "code",
682 | "execution_count": 18,
683 | "metadata": {
684 | "scrolled": true
685 | },
686 | "outputs": [],
687 | "source": [
688 | "# 以上可知y中列存在0和1两个值,由于0包含的元素个数远远大于1的个数,现在使用sample从y为0的样本中随机抽取一些样本,\n",
689 | "# 要求0包含的样本数和1的样本数相同,且随机种子设定为22,\n",
690 | "# sample随机抽样方法:sample(n=样本数,random_state=随机种子数)\n",
691 | "data_num_0 = data_num[data_num['y']==0].sample(n=value_1, random_state=22)"
692 | ]
693 | },
694 | {
695 | "cell_type": "code",
696 | "execution_count": 19,
697 | "metadata": {},
698 | "outputs": [],
699 | "source": [
700 | "# 得到数据集中y列中为1的样本\n",
701 | "data_num_1=data_num[data_num['y']!=0]"
702 | ]
703 | },
704 | {
705 | "cell_type": "code",
706 | "execution_count": 20,
707 | "metadata": {},
708 | "outputs": [
709 | {
710 | "data": {
711 | "text/plain": [
712 | "(5922, 11)"
713 | ]
714 | },
715 | "execution_count": 20,
716 | "metadata": {},
717 | "output_type": "execute_result"
718 | }
719 | ],
720 | "source": [
721 | "# 将y为1的样本和随机抽取的y为0的样本进行合并,并且打印合并后数据集的维度\n",
722 | "data_num_sample = pd.concat([data_num_0,data_num_1], axis=0, ignore_index=False)\n",
723 | "data_num_sample.shape"
724 | ]
725 | },
726 | {
727 | "cell_type": "code",
728 | "execution_count": 21,
729 | "metadata": {},
730 | "outputs": [
731 | {
732 | "data": {
733 | "text/html": [
734 | "\n",
735 | "\n",
748 | "
\n",
749 | " \n",
750 | " \n",
751 | " | \n",
752 | " default | \n",
753 | " housing | \n",
754 | " loan | \n",
755 | " age | \n",
756 | " balance | \n",
757 | " campaign | \n",
758 | " day | \n",
759 | " duration | \n",
760 | " pdays | \n",
761 | " previous | \n",
762 | " y | \n",
763 | "
\n",
764 | " \n",
765 | " \n",
766 | " \n",
767 | " count | \n",
768 | " 5922.000000 | \n",
769 | " 5922.000000 | \n",
770 | " 5922.000000 | \n",
771 | " 5922.000000 | \n",
772 | " 5922.000000 | \n",
773 | " 5922.000000 | \n",
774 | " 5922.000000 | \n",
775 | " 5922.000000 | \n",
776 | " 5922.000000 | \n",
777 | " 5922.000000 | \n",
778 | " 5922.000000 | \n",
779 | "
\n",
780 | " \n",
781 | " mean | \n",
782 | " 0.013847 | \n",
783 | " 0.468929 | \n",
784 | " 0.126646 | \n",
785 | " 41.170382 | \n",
786 | " 1616.046099 | \n",
787 | " 2.473320 | \n",
788 | " 15.489868 | \n",
789 | " 378.140324 | \n",
790 | " 52.918102 | \n",
791 | " 0.863560 | \n",
792 | " 0.500000 | \n",
793 | "
\n",
794 | " \n",
795 | " std | \n",
796 | " 0.116864 | \n",
797 | " 0.499076 | \n",
798 | " 0.332605 | \n",
799 | " 12.012457 | \n",
800 | " 3371.601168 | \n",
801 | " 2.745904 | \n",
802 | " 8.419602 | \n",
803 | " 353.409237 | \n",
804 | " 109.976587 | \n",
805 | " 2.284146 | \n",
806 | " 0.500042 | \n",
807 | "
\n",
808 | " \n",
809 | " min | \n",
810 | " 0.000000 | \n",
811 | " 0.000000 | \n",
812 | " 0.000000 | \n",
813 | " 18.000000 | \n",
814 | " -1965.000000 | \n",
815 | " 1.000000 | \n",
816 | " 1.000000 | \n",
817 | " 4.000000 | \n",
818 | " -1.000000 | \n",
819 | " 0.000000 | \n",
820 | " 0.000000 | \n",
821 | "
\n",
822 | " \n",
823 | " 25% | \n",
824 | " 0.000000 | \n",
825 | " 0.000000 | \n",
826 | " 0.000000 | \n",
827 | " 32.000000 | \n",
828 | " 130.000000 | \n",
829 | " 1.000000 | \n",
830 | " 8.000000 | \n",
831 | " 145.000000 | \n",
832 | " -1.000000 | \n",
833 | " 0.000000 | \n",
834 | " 0.000000 | \n",
835 | "
\n",
836 | " \n",
837 | " 50% | \n",
838 | " 0.000000 | \n",
839 | " 0.000000 | \n",
840 | " 0.000000 | \n",
841 | " 39.000000 | \n",
842 | " 574.000000 | \n",
843 | " 2.000000 | \n",
844 | " 15.000000 | \n",
845 | " 259.000000 | \n",
846 | " -1.000000 | \n",
847 | " 0.000000 | \n",
848 | " 0.500000 | \n",
849 | "
\n",
850 | " \n",
851 | " 75% | \n",
852 | " 0.000000 | \n",
853 | " 1.000000 | \n",
854 | " 0.000000 | \n",
855 | " 49.000000 | \n",
856 | " 1854.750000 | \n",
857 | " 3.000000 | \n",
858 | " 21.000000 | \n",
859 | " 492.000000 | \n",
860 | " 77.750000 | \n",
861 | " 1.000000 | \n",
862 | " 1.000000 | \n",
863 | "
\n",
864 | " \n",
865 | " max | \n",
866 | " 1.000000 | \n",
867 | " 1.000000 | \n",
868 | " 1.000000 | \n",
869 | " 95.000000 | \n",
870 | " 102127.000000 | \n",
871 | " 44.000000 | \n",
872 | " 31.000000 | \n",
873 | " 3881.000000 | \n",
874 | " 854.000000 | \n",
875 | " 58.000000 | \n",
876 | " 1.000000 | \n",
877 | "
\n",
878 | " \n",
879 | "
\n",
880 | "
"
881 | ],
882 | "text/plain": [
883 | " default housing loan age balance \\\n",
884 | "count 5922.000000 5922.000000 5922.000000 5922.000000 5922.000000 \n",
885 | "mean 0.013847 0.468929 0.126646 41.170382 1616.046099 \n",
886 | "std 0.116864 0.499076 0.332605 12.012457 3371.601168 \n",
887 | "min 0.000000 0.000000 0.000000 18.000000 -1965.000000 \n",
888 | "25% 0.000000 0.000000 0.000000 32.000000 130.000000 \n",
889 | "50% 0.000000 0.000000 0.000000 39.000000 574.000000 \n",
890 | "75% 0.000000 1.000000 0.000000 49.000000 1854.750000 \n",
891 | "max 1.000000 1.000000 1.000000 95.000000 102127.000000 \n",
892 | "\n",
893 | " campaign day duration pdays previous \\\n",
894 | "count 5922.000000 5922.000000 5922.000000 5922.000000 5922.000000 \n",
895 | "mean 2.473320 15.489868 378.140324 52.918102 0.863560 \n",
896 | "std 2.745904 8.419602 353.409237 109.976587 2.284146 \n",
897 | "min 1.000000 1.000000 4.000000 -1.000000 0.000000 \n",
898 | "25% 1.000000 8.000000 145.000000 -1.000000 0.000000 \n",
899 | "50% 2.000000 15.000000 259.000000 -1.000000 0.000000 \n",
900 | "75% 3.000000 21.000000 492.000000 77.750000 1.000000 \n",
901 | "max 44.000000 31.000000 3881.000000 854.000000 58.000000 \n",
902 | "\n",
903 | " y \n",
904 | "count 5922.000000 \n",
905 | "mean 0.500000 \n",
906 | "std 0.500042 \n",
907 | "min 0.000000 \n",
908 | "25% 0.000000 \n",
909 | "50% 0.500000 \n",
910 | "75% 1.000000 \n",
911 | "max 1.000000 "
912 | ]
913 | },
914 | "execution_count": 21,
915 | "metadata": {},
916 | "output_type": "execute_result"
917 | }
918 | ],
919 | "source": [
920 | "# 对数据集拆分为训练集和测试集之前,先检查下数据是否存在问题,调用describe,打印下数据的统计信息\n",
921 | "data_num_sample.describe()"
922 | ]
923 | },
924 | {
925 | "cell_type": "code",
926 | "execution_count": 22,
927 | "metadata": {},
928 | "outputs": [
929 | {
930 | "data": {
931 | "text/plain": [
932 | "Index(['default', 'housing', 'loan', 'age', 'balance', 'campaign', 'day',\n",
933 | " 'duration', 'pdays', 'previous'],\n",
934 | " dtype='object')"
935 | ]
936 | },
937 | "execution_count": 22,
938 | "metadata": {},
939 | "output_type": "execute_result"
940 | }
941 | ],
942 | "source": [
943 | "# 将数据集中的y标签列数据存到y变量中,将其他的(特征)列存到X变量中,并且打印X的所有列的列名\n",
944 | "y = data_num_sample['y']\n",
945 | "X = data_num_sample.drop(columns='y',axis=1)\n",
946 | "X.columns"
947 | ]
948 | },
949 | {
950 | "cell_type": "code",
951 | "execution_count": 23,
952 | "metadata": {},
953 | "outputs": [],
954 | "source": [
955 | "# 将数据集X和y进行随机切分,得到训练集和测试集数据,随机种子设定为22,测试集占比1/4。\n",
956 | "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22, test_size=1/4)"
957 | ]
958 | },
959 | {
960 | "cell_type": "code",
961 | "execution_count": 24,
962 | "metadata": {},
963 | "outputs": [
964 | {
965 | "name": "stdout",
966 | "output_type": "stream",
967 | "text": [
968 | "X_train的维度为(4441, 10),X_test的维度为(1481, 10)\n"
969 | ]
970 | }
971 | ],
972 | "source": [
973 | "print(\"X_train的维度为{},X_test的维度为{}\".format(X_train.shape,X_test.shape))"
974 | ]
975 | },
976 | {
977 | "cell_type": "markdown",
978 | "metadata": {},
979 | "source": [
980 | "## 四、选择模型"
981 | ]
982 | },
983 | {
984 | "cell_type": "markdown",
985 | "metadata": {},
986 | "source": [
987 | "通过以上的特征处理和数据切分,我们将抽样后的数据集随机切分为训练集和测试集,这里,将依据训练集和测试集通过交叉验证的方法来选择模型,以下将对kNN模型、逻辑回归模型和决策树模型进行训练,通过评价指标选择一个较有的模型"
988 | ]
989 | },
990 | {
991 | "cell_type": "markdown",
992 | "metadata": {},
993 | "source": [
994 | "train_test_model和predict_auc方法均是被调用方法,在下面的模型选择的方法中会调用这两个方法"
995 | ]
996 | },
997 | {
998 | "cell_type": "code",
999 | "execution_count": 25,
1000 | "metadata": {},
1001 | "outputs": [],
1002 | "source": [
1003 | "def train_test_model(clf, X_train, y_train, cv_scores, param):\n",
1004 | " \"\"\"\n",
1005 | " 功能:\n",
1006 | " 依据训练集,对模型进行训练,得到交叉验证后的评分值\n",
1007 | " 参数:\n",
1008 | " clf:模型\n",
1009 | " param:模型参数\n",
1010 | " X_train:训练集\n",
1011 | " y_train:训练样本对应的标签\n",
1012 | " cv_scores:字`典类型,将得到的最终评分值存到字典中。\n",
1013 | "\n",
1014 | " \"\"\"\n",
1015 | " # 使用10折交叉验证,roc_auc作为评价指标,对clf进行评分计算,并且对得到的评分计算均值,并且将参数和评分进行打印,\n",
1016 | " # 比如:参数为5,评分均值为0.76342,则打印输出为:参数=5,验证集上的AUC=0.76342\n",
1017 | " val_scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')\n",
1018 | " score_mean = val_scores.mean()\n",
1019 | " print(score_mean)\n",
1020 | " print(\"参数={},验证集上的AUC={}\".format(param, score_mean))\n",
1021 | " \n",
1022 | " \n",
1023 | " \n",
1024 | " # 经过以上操作后,还需要将评分均值存到字典cv_scores中,其中,键为模型参数,值为得到的评分均值\n",
1025 | " # 比如:,则在字典中表现为{5:0.76342}\n",
1026 | " cv_scores[param] = score_mean"
1027 | ]
1028 | },
1029 | {
1030 | "cell_type": "code",
1031 | "execution_count": 26,
1032 | "metadata": {},
1033 | "outputs": [],
1034 | "source": [
1035 | "def predict_auc(model,X_train,y_train,X_test,y_test):\n",
1036 | " \"\"\"\n",
1037 | " 功能:\n",
1038 | " 使用训练数据训练模型,使用训练好的模型对测试数据进行预测,进而得到模型的AUC评分\n",
1039 | " 参数:\n",
1040 | " model:设置了最优参数的模型\n",
1041 | " X_train:训练集\n",
1042 | " y_train:训练数据对应的标签\n",
1043 | " X_test:测试集\n",
1044 | " y_test:测试数据对应的标签\n",
1045 | " 返回值\n",
1046 | " 返回模型的AUC值\n",
1047 | " \"\"\"\n",
1048 | " # 设置最优参数后,对整个训练集进行训练,然后通过predict对测试集进行预测,并将结果存入变量y_pred,使用roc_auc_score()评分方法计算模型的AUC值\n",
1049 | " # 将得到的AUC值进行打印,打印输出格式如:模型AUC值:0.75177973\n",
1050 | " model.fit(X_train, y_train)\n",
1051 | " y_pred = model.predict(X_test)\n",
1052 | " model_auc = roc_auc_score(y_pred, y_test)\n",
1053 | " print('模型AUC值:{}'.format(model_auc))\n",
1054 | "\n",
1055 | " return model_auc\n",
1056 | " "
1057 | ]
1058 | },
1059 | {
1060 | "cell_type": "markdown",
1061 | "metadata": {},
1062 | "source": [
1063 | "### 1.kNN模型\n",
1064 | "k近邻法(k-nearest neighbor,k-NN)是一种基本的分类方法,对于给定的数据集,若输入一个新的实例,在训练数据集中找到与该实例最邻近的k个实例,这个k个实例的多数属于某个类,那么就把该输入实例判定为这个类。"
1065 | ]
1066 | },
1067 | {
1068 | "cell_type": "code",
1069 | "execution_count": 27,
1070 | "metadata": {
1071 | "scrolled": true
1072 | },
1073 | "outputs": [
1074 | {
1075 | "name": "stdout",
1076 | "output_type": "stream",
1077 | "text": [
1078 | "0.8077509534923599\n",
1079 | "参数=5,验证集上的AUC=0.8077509534923599\n",
1080 | "0.8175383570005726\n",
1081 | "参数=7,验证集上的AUC=0.8175383570005726\n",
1082 | "0.7528462119421223\n",
1083 | "参数=2,验证集上的AUC=0.7528462119421223\n",
1084 | "0.7822092488933985\n",
1085 | "参数=3,验证集上的AUC=0.7822092488933985\n",
1086 | "0.8142750349747395\n",
1087 | "参数=6,验证集上的AUC=0.8142750349747395\n",
1088 | "最优的参数值:7\n",
1089 | "模型AUC值:0.7202998772646505\n"
1090 | ]
1091 | }
1092 | ],
1093 | "source": [
1094 | "# 这里只对kNN模型的k参数进行选择,为k设定不同的值,进而得到不同的kNN模型,使用不同的kNN模型得到AUC值,进而得到最优模型的评分值\n",
1095 | "knn_parameters = [5,7,2,3,6]\n",
1096 | "knn_cv_scores = {}\n",
1097 | "for param in knn_parameters:\n",
1098 | " knn_clf = KNeighborsClassifier(n_neighbors=param)\n",
1099 | " train_test_model(knn_clf, X_train, y_train,knn_cv_scores,param)\n",
1100 | " \n",
1101 | "knn_best_para=max(knn_cv_scores,key=knn_cv_scores.get)\n",
1102 | "print('最优的参数值:{}'.format(knn_best_para))\n",
1103 | "\n",
1104 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n",
1105 | "knn_model= KNeighborsClassifier(n_neighbors=knn_best_para)\n",
1106 | "knn_model_auc = predict_auc(knn_model,X_train,y_train,X_test,y_test)"
1107 | ]
1108 | },
1109 | {
1110 | "cell_type": "markdown",
1111 | "metadata": {},
1112 | "source": [
1113 | "### 2.逻辑回归模型\n",
1114 | "逻辑回归模型是一种分类模型,其模型输出的结果处于(0,1)之间,当输出结果大于给定的阈值时,则为A类,小于阈值为B类,由于具体的原理比较复杂,在此只对逻辑回归进行了简单的原理介绍"
1115 | ]
1116 | },
1117 | {
1118 | "cell_type": "code",
1119 | "execution_count": 28,
1120 | "metadata": {},
1121 | "outputs": [
1122 | {
1123 | "name": "stdout",
1124 | "output_type": "stream",
1125 | "text": [
1126 | "0.8623990756280264\n",
1127 | "参数=1,验证集上的AUC=0.8623990756280264\n",
1128 | "0.8623787155370144\n",
1129 | "参数=3,验证集上的AUC=0.8623787155370144\n",
1130 | "0.8624392095715059\n",
1131 | "参数=5,验证集上的AUC=0.8624392095715059\n",
1132 | "0.8617857855030111\n",
1133 | "参数=10,验证集上的AUC=0.8617857855030111\n",
1134 | "0.8622761466118669\n",
1135 | "参数=15,验证集上的AUC=0.8622761466118669\n",
1136 | "最优的参数值:5\n",
1137 | "模型AUC值:0.7704903918371834\n"
1138 | ]
1139 | }
1140 | ],
1141 | "source": [
1142 | "# 这里只对逻辑回归模型的参数C设定不同的值,根据不同的值可以得到不同的模型,依据不同模型的评分,选择最优的参数值,进而得到最终模型的评分值\n",
1143 | "lr_parameters = [1,3,5,10,15]\n",
1144 | "lr_cv_scores = {}\n",
1145 | "for param in lr_parameters:\n",
1146 | " lr_clf = LogisticRegression(C=param)\n",
1147 | " train_test_model(lr_clf, X_train, y_train,lr_cv_scores,param)\n",
1148 | " \n",
1149 | "lr_best_para=max(lr_cv_scores,key=lr_cv_scores.get)\n",
1150 | "print('最优的参数值:{}'.format(lr_best_para))\n",
1151 | "\n",
1152 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n",
1153 | "lr_model= LogisticRegression(C=lr_best_para)\n",
1154 | "lr_model_auc = predict_auc(lr_model,X_train,y_train,X_test,y_test)"
1155 | ]
1156 | },
1157 | {
1158 | "cell_type": "markdown",
1159 | "metadata": {},
1160 | "source": [
1161 | "### 3.决策树模型\n",
1162 | "决策树也是一种分类模型,显而易见,决策树可以理解为将数据按照某种规则生成一颗形似树的结构,以实现对数据的分类。"
1163 | ]
1164 | },
1165 | {
1166 | "cell_type": "code",
1167 | "execution_count": 29,
1168 | "metadata": {
1169 | "scrolled": true
1170 | },
1171 | "outputs": [
1172 | {
1173 | "name": "stdout",
1174 | "output_type": "stream",
1175 | "text": [
1176 | "0.7183915142959287\n",
1177 | "参数=1,验证集上的AUC=0.7183915142959287\n",
1178 | "0.8312521096369881\n",
1179 | "参数=3,验证集上的AUC=0.8312521096369881\n",
1180 | "0.8561861778610342\n",
1181 | "参数=5,验证集上的AUC=0.8561861778610342\n",
1182 | "0.8021781689000533\n",
1183 | "参数=10,验证集上的AUC=0.8021781689000533\n",
1184 | "0.7493300003167633\n",
1185 | "参数=15,验证集上的AUC=0.7493300003167633\n",
1186 | "最优的参数值:5\n",
1187 | "模型AUC值:0.7760305631026779\n"
1188 | ]
1189 | }
1190 | ],
1191 | "source": [
1192 | "# 以下将对决策树模型的树的深度进行选择,以得到不同的模型,依据模型的评分值,得到最优的参数值,进而得到最终模型的评分值AUC\n",
1193 | "dt_parameters = [1,3,5,10,15]\n",
1194 | "dt_cv_scores = {}\n",
1195 | "for param in dt_parameters:\n",
1196 | " dt_clf = DecisionTreeClassifier(max_depth=param)\n",
1197 | " train_test_model(dt_clf, X_train, y_train,dt_cv_scores,param)\n",
1198 | " \n",
1199 | "dt_best_para=max(dt_cv_scores,key=dt_cv_scores.get)\n",
1200 | "print('最优的参数值:{}'.format(dt_best_para))\n",
1201 | "\n",
1202 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n",
1203 | "dt_model= DecisionTreeClassifier(max_depth=dt_best_para)\n",
1204 | "dt_model_auc = predict_auc(dt_model,X_train,y_train,X_test,y_test)"
1205 | ]
1206 | },
1207 | {
1208 | "cell_type": "markdown",
1209 | "metadata": {},
1210 | "source": [
1211 | "对以上三个模型进行分析,决策树模型好,因为模型AUC值值最大。"
1212 | ]
1213 | },
1214 | {
1215 | "cell_type": "markdown",
1216 | "metadata": {},
1217 | "source": [
1218 | "### 数据归一化"
1219 | ]
1220 | },
1221 | {
1222 | "cell_type": "code",
1223 | "execution_count": 30,
1224 | "metadata": {
1225 | "scrolled": false
1226 | },
1227 | "outputs": [
1228 | {
1229 | "data": {
1230 | "text/html": [
1231 | "\n",
1232 | "\n",
1245 | "
\n",
1246 | " \n",
1247 | " \n",
1248 | " | \n",
1249 | " default | \n",
1250 | " housing | \n",
1251 | " loan | \n",
1252 | " age | \n",
1253 | " balance | \n",
1254 | " campaign | \n",
1255 | " day | \n",
1256 | " duration | \n",
1257 | " pdays | \n",
1258 | " previous | \n",
1259 | "
\n",
1260 | " \n",
1261 | " \n",
1262 | " \n",
1263 | " count | \n",
1264 | " 4441.000000 | \n",
1265 | " 4441.000000 | \n",
1266 | " 4441.000000 | \n",
1267 | " 4441.000000 | \n",
1268 | " 4441.000000 | \n",
1269 | " 4441.000000 | \n",
1270 | " 4441.000000 | \n",
1271 | " 4441.000000 | \n",
1272 | " 4441.00000 | \n",
1273 | " 4441.000000 | \n",
1274 | "
\n",
1275 | " \n",
1276 | " mean | \n",
1277 | " 0.014411 | \n",
1278 | " 0.471741 | \n",
1279 | " 0.128575 | \n",
1280 | " 41.186895 | \n",
1281 | " 1587.306463 | \n",
1282 | " 2.474218 | \n",
1283 | " 15.548750 | \n",
1284 | " 377.235307 | \n",
1285 | " 53.21324 | \n",
1286 | " 0.833146 | \n",
1287 | "
\n",
1288 | " \n",
1289 | " std | \n",
1290 | " 0.119192 | \n",
1291 | " 0.499257 | \n",
1292 | " 0.334766 | \n",
1293 | " 12.027061 | \n",
1294 | " 3136.478955 | \n",
1295 | " 2.751711 | \n",
1296 | " 8.417989 | \n",
1297 | " 350.516914 | \n",
1298 | " 111.21589 | \n",
1299 | " 2.134833 | \n",
1300 | "
\n",
1301 | " \n",
1302 | " min | \n",
1303 | " 0.000000 | \n",
1304 | " 0.000000 | \n",
1305 | " 0.000000 | \n",
1306 | " 18.000000 | \n",
1307 | " -1965.000000 | \n",
1308 | " 1.000000 | \n",
1309 | " 1.000000 | \n",
1310 | " 4.000000 | \n",
1311 | " -1.00000 | \n",
1312 | " 0.000000 | \n",
1313 | "
\n",
1314 | " \n",
1315 | " 25% | \n",
1316 | " 0.000000 | \n",
1317 | " 0.000000 | \n",
1318 | " 0.000000 | \n",
1319 | " 32.000000 | \n",
1320 | " 119.000000 | \n",
1321 | " 1.000000 | \n",
1322 | " 8.000000 | \n",
1323 | " 144.000000 | \n",
1324 | " -1.00000 | \n",
1325 | " 0.000000 | \n",
1326 | "
\n",
1327 | " \n",
1328 | " 50% | \n",
1329 | " 0.000000 | \n",
1330 | " 0.000000 | \n",
1331 | " 0.000000 | \n",
1332 | " 39.000000 | \n",
1333 | " 577.000000 | \n",
1334 | " 2.000000 | \n",
1335 | " 15.000000 | \n",
1336 | " 260.000000 | \n",
1337 | " -1.00000 | \n",
1338 | " 0.000000 | \n",
1339 | "
\n",
1340 | " \n",
1341 | " 75% | \n",
1342 | " 0.000000 | \n",
1343 | " 1.000000 | \n",
1344 | " 0.000000 | \n",
1345 | " 49.000000 | \n",
1346 | " 1853.000000 | \n",
1347 | " 3.000000 | \n",
1348 | " 21.000000 | \n",
1349 | " 489.000000 | \n",
1350 | " 64.00000 | \n",
1351 | " 1.000000 | \n",
1352 | "
\n",
1353 | " \n",
1354 | " max | \n",
1355 | " 1.000000 | \n",
1356 | " 1.000000 | \n",
1357 | " 1.000000 | \n",
1358 | " 93.000000 | \n",
1359 | " 81204.000000 | \n",
1360 | " 44.000000 | \n",
1361 | " 31.000000 | \n",
1362 | " 3881.000000 | \n",
1363 | " 854.00000 | \n",
1364 | " 37.000000 | \n",
1365 | "
\n",
1366 | " \n",
1367 | "
\n",
1368 | "
"
1369 | ],
1370 | "text/plain": [
1371 | " default housing loan age balance \\\n",
1372 | "count 4441.000000 4441.000000 4441.000000 4441.000000 4441.000000 \n",
1373 | "mean 0.014411 0.471741 0.128575 41.186895 1587.306463 \n",
1374 | "std 0.119192 0.499257 0.334766 12.027061 3136.478955 \n",
1375 | "min 0.000000 0.000000 0.000000 18.000000 -1965.000000 \n",
1376 | "25% 0.000000 0.000000 0.000000 32.000000 119.000000 \n",
1377 | "50% 0.000000 0.000000 0.000000 39.000000 577.000000 \n",
1378 | "75% 0.000000 1.000000 0.000000 49.000000 1853.000000 \n",
1379 | "max 1.000000 1.000000 1.000000 93.000000 81204.000000 \n",
1380 | "\n",
1381 | " campaign day duration pdays previous \n",
1382 | "count 4441.000000 4441.000000 4441.000000 4441.00000 4441.000000 \n",
1383 | "mean 2.474218 15.548750 377.235307 53.21324 0.833146 \n",
1384 | "std 2.751711 8.417989 350.516914 111.21589 2.134833 \n",
1385 | "min 1.000000 1.000000 4.000000 -1.00000 0.000000 \n",
1386 | "25% 1.000000 8.000000 144.000000 -1.00000 0.000000 \n",
1387 | "50% 2.000000 15.000000 260.000000 -1.00000 0.000000 \n",
1388 | "75% 3.000000 21.000000 489.000000 64.00000 1.000000 \n",
1389 | "max 44.000000 31.000000 3881.000000 854.00000 37.000000 "
1390 | ]
1391 | },
1392 | "execution_count": 30,
1393 | "metadata": {},
1394 | "output_type": "execute_result"
1395 | }
1396 | ],
1397 | "source": [
1398 | "# 先观察下训练集的特征,查看特征的量纲具有什么特点\n",
1399 | "X_train.describe()"
1400 | ]
1401 | },
1402 | {
1403 | "cell_type": "markdown",
1404 | "metadata": {},
1405 | "source": [
1406 | "回答:从以上的统计信息中,可以知道,特征与特征之间的量纲不一致,因此需要对特征值进行归一化处理。"
1407 | ]
1408 | },
1409 | {
1410 | "cell_type": "markdown",
1411 | "metadata": {},
1412 | "source": [
1413 | "#### 归一化\n",
1414 | "对训练数据和测试集数据做归一化"
1415 | ]
1416 | },
1417 | {
1418 | "cell_type": "code",
1419 | "execution_count": 31,
1420 | "metadata": {},
1421 | "outputs": [],
1422 | "source": [
1423 | "# 使用MinMaxScaler对训练集X_train和测试集X_test进行归一化\n",
1424 | "scaler = MinMaxScaler()\n",
1425 | "X_train_scaled = scaler.fit_transform(X_train.astype('float64'))\n",
1426 | "X_test_scaled = scaler.transform(X_test.astype('float64'))"
1427 | ]
1428 | },
1429 | {
1430 | "cell_type": "markdown",
1431 | "metadata": {},
1432 | "source": [
1433 | "由于以下训练模型的代码以上的代码类似,若是感兴趣的同学,可以将以下给出的kNN模型,逻辑回归模型和决策树模型的代码删除掉,然后尝试自己写代码,完成模型的训练"
1434 | ]
1435 | },
1436 | {
1437 | "cell_type": "markdown",
1438 | "metadata": {},
1439 | "source": [
1440 | "#### kNN模型"
1441 | ]
1442 | },
1443 | {
1444 | "cell_type": "code",
1445 | "execution_count": 32,
1446 | "metadata": {
1447 | "scrolled": false
1448 | },
1449 | "outputs": [
1450 | {
1451 | "name": "stdout",
1452 | "output_type": "stream",
1453 | "text": [
1454 | "0.8362348007709579\n",
1455 | "参数=5,验证集上的AUC=0.8362348007709579\n",
1456 | "0.844031883023167\n",
1457 | "参数=7,验证集上的AUC=0.844031883023167\n",
1458 | "0.795107142467294\n",
1459 | "参数=2,验证集上的AUC=0.795107142467294\n",
1460 | "0.8216005781567567\n",
1461 | "参数=3,验证集上的AUC=0.8216005781567567\n",
1462 | "0.8418551395164939\n",
1463 | "参数=6,验证集上的AUC=0.8418551395164939\n",
1464 | "最优的参数值:7\n",
1465 | "模型AUC值:0.7633043955039752\n"
1466 | ]
1467 | }
1468 | ],
1469 | "source": [
1470 | "knn_scaler_parameters = [5,7,2,3,6]\n",
1471 | "knn_scaler_cv_scores = {}\n",
1472 | "for param in knn_scaler_parameters:\n",
1473 | " knn_scaler_clf = KNeighborsClassifier(n_neighbors=param)\n",
1474 | " train_test_model(knn_scaler_clf, X_train_scaled, y_train,knn_scaler_cv_scores,param)\n",
1475 | " \n",
1476 | "knn_scaler_best_para=max(knn_scaler_cv_scores,key=knn_scaler_cv_scores.get)\n",
1477 | "print('最优的参数值:{}'.format(knn_scaler_best_para))\n",
1478 | "\n",
1479 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n",
1480 | "knn_scaler_model= KNeighborsClassifier(n_neighbors=knn_scaler_best_para)\n",
1481 | "knn_scaler_model_auc = predict_auc(knn_scaler_model,X_train_scaled,y_train,X_test_scaled,y_test)"
1482 | ]
1483 | },
1484 | {
1485 | "cell_type": "markdown",
1486 | "metadata": {},
1487 | "source": [
1488 | "#### 逻辑回归模型"
1489 | ]
1490 | },
1491 | {
1492 | "cell_type": "code",
1493 | "execution_count": 33,
1494 | "metadata": {
1495 | "scrolled": true
1496 | },
1497 | "outputs": [
1498 | {
1499 | "name": "stdout",
1500 | "output_type": "stream",
1501 | "text": [
1502 | "0.8576315184282677\n",
1503 | "参数=1,验证集上的AUC=0.8576315184282677\n",
1504 | "0.8610214502261355\n",
1505 | "参数=3,验证集上的AUC=0.8610214502261355\n",
1506 | "0.8617095314058393\n",
1507 | "参数=5,验证集上的AUC=0.8617095314058393\n",
1508 | "0.8620709066888409\n",
1509 | "参数=10,验证集上的AUC=0.8620709066888409\n",
1510 | "0.8622355155684189\n",
1511 | "参数=15,验证集上的AUC=0.8622355155684189\n",
1512 | "最优的参数值:15\n",
1513 | "模型AUC值:0.7691338914433311\n"
1514 | ]
1515 | }
1516 | ],
1517 | "source": [
1518 | "lr_scaler_parameters = [1,3,5,10,15]\n",
1519 | "lr_scaler_cv_scores = {}\n",
1520 | "for param in lr_scaler_parameters:\n",
1521 | " lr_scaler_clf = LogisticRegression(C=param)\n",
1522 | " train_test_model(lr_scaler_clf, X_train_scaled, y_train,lr_scaler_cv_scores,param)\n",
1523 | " \n",
1524 | "lr_scaler_best_para=max(lr_scaler_cv_scores,key=lr_scaler_cv_scores.get)\n",
1525 | "print('最优的参数值:{}'.format(lr_scaler_best_para))\n",
1526 | "\n",
1527 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n",
1528 | "lr_scaler_model= LogisticRegression(C=lr_scaler_best_para)\n",
1529 | "lr_scaler_model_auc = predict_auc(lr_scaler_model,X_train_scaled,y_train,X_test_scaled,y_test)"
1530 | ]
1531 | },
1532 | {
1533 | "cell_type": "markdown",
1534 | "metadata": {},
1535 | "source": [
1536 | "#### 决策树模型"
1537 | ]
1538 | },
1539 | {
1540 | "cell_type": "code",
1541 | "execution_count": 34,
1542 | "metadata": {
1543 | "scrolled": true
1544 | },
1545 | "outputs": [
1546 | {
1547 | "name": "stdout",
1548 | "output_type": "stream",
1549 | "text": [
1550 | "0.7183915142959287\n",
1551 | "参数=1,验证集上的AUC=0.7183915142959287\n",
1552 | "0.8312521096369881\n",
1553 | "参数=3,验证集上的AUC=0.8312521096369881\n",
1554 | "0.8566254157311546\n",
1555 | "参数=5,验证集上的AUC=0.8566254157311546\n",
1556 | "0.8051230940708141\n",
1557 | "参数=10,验证集上的AUC=0.8051230940708141\n",
1558 | "0.748734443042695\n",
1559 | "参数=15,验证集上的AUC=0.748734443042695\n",
1560 | "最优的参数值:5\n",
1561 | "模型AUC值:0.7767016204516204\n"
1562 | ]
1563 | }
1564 | ],
1565 | "source": [
1566 | "dt_scaler_parameters = [1,3,5,10,15]\n",
1567 | "dt_scaler_cv_scores = {}\n",
1568 | "for param in dt_scaler_parameters:\n",
1569 | " dt_scaler_clf = DecisionTreeClassifier(max_depth=param)\n",
1570 | " train_test_model(dt_scaler_clf, X_train_scaled, y_train,dt_scaler_cv_scores,param)\n",
1571 | " \n",
1572 | "dt_scaler_best_para=max(dt_scaler_cv_scores,key=dt_scaler_cv_scores.get)\n",
1573 | "print('最优的参数值:{}'.format(dt_scaler_best_para))\n",
1574 | "\n",
1575 | "# 为模型设置最优参数,训练模型,对测试集进行预测,使用roc_auc_score计算模型的AUC值\n",
1576 | "dt_scaler_model= DecisionTreeClassifier(max_depth=dt_scaler_best_para)\n",
1577 | "dt_scaler_model_auc = predict_auc(dt_scaler_model,X_train_scaled,y_train,X_test_scaled,y_test)"
1578 | ]
1579 | },
1580 | {
1581 | "cell_type": "markdown",
1582 | "metadata": {},
1583 | "source": [
1584 | "#### 将未归一化和归一化后的数据得到的模型AUC值进行合并"
1585 | ]
1586 | },
1587 | {
1588 | "cell_type": "code",
1589 | "execution_count": 35,
1590 | "metadata": {},
1591 | "outputs": [
1592 | {
1593 | "data": {
1594 | "text/html": [
1595 | "\n",
1596 | "\n",
1609 | "
\n",
1610 | " \n",
1611 | " \n",
1612 | " | \n",
1613 | " Not Scaled (%) | \n",
1614 | " Scaled (%) | \n",
1615 | "
\n",
1616 | " \n",
1617 | " \n",
1618 | " \n",
1619 | " kNN | \n",
1620 | " 0.720300 | \n",
1621 | " 0.763304 | \n",
1622 | "
\n",
1623 | " \n",
1624 | " LR | \n",
1625 | " 0.770490 | \n",
1626 | " 0.769134 | \n",
1627 | "
\n",
1628 | " \n",
1629 | " DT | \n",
1630 | " 0.776031 | \n",
1631 | " 0.776702 | \n",
1632 | "
\n",
1633 | " \n",
1634 | "
\n",
1635 | "
"
1636 | ],
1637 | "text/plain": [
1638 | " Not Scaled (%) Scaled (%)\n",
1639 | "kNN 0.720300 0.763304\n",
1640 | "LR 0.770490 0.769134\n",
1641 | "DT 0.776031 0.776702"
1642 | ]
1643 | },
1644 | "execution_count": 35,
1645 | "metadata": {},
1646 | "output_type": "execute_result"
1647 | }
1648 | ],
1649 | "source": [
1650 | "col_name = ['Not Scaled (%)', 'Scaled (%)']\n",
1651 | "row_name = ['kNN','LR','DT']\n",
1652 | "# 创建dataframe结构的变量models_auc_df,列索引设置为col_name,行索引设置为row_name,将未归一化和归一化的AUC值按照索引存放到对应的位置,\n",
1653 | "# 其中未归一化模型的AUC值分别为knn_model_auc,lr_model_auc,dt_model_auc,归一化后的模型AUC值分别为knn_scaler_model_auc,lr_scaler_model_auc,dt_scaler_model_auc\n",
1654 | "# 然后将数据models_auc_df的数据进行打印\n",
1655 | "\n",
1656 | "models_auc_df = pd.DataFrame([[knn_model_auc,knn_scaler_model_auc],[lr_model_auc,lr_scaler_model_auc],[dt_model_auc,dt_scaler_model_auc]],\n",
1657 | " columns=col_name,index=row_name)\n",
1658 | "models_auc_df"
1659 | ]
1660 | },
1661 | {
1662 | "cell_type": "code",
1663 | "execution_count": 36,
1664 | "metadata": {
1665 | "scrolled": true
1666 | },
1667 | "outputs": [
1668 | {
1669 | "data": {
1670 | "text/plain": [
1671 | "(array([0, 1, 2]), )"
1672 | ]
1673 | },
1674 | "execution_count": 36,
1675 | "metadata": {},
1676 | "output_type": "execute_result"
1677 | },
1678 | {
1679 | "data": {
1680 | "text/plain": [
1681 | ""
1682 | ]
1683 | },
1684 | "metadata": {},
1685 | "output_type": "display_data"
1686 | },
1687 | {
1688 | "data": {
1689 | "image/png": "\n",
1690 | "text/plain": [
1691 | ""
1692 | ]
1693 | },
1694 | "metadata": {
1695 | "needs_background": "light"
1696 | },
1697 | "output_type": "display_data"
1698 | }
1699 | ],
1700 | "source": [
1701 | "# 为未归一化和归一化的数据绘制分组柱状图,可视化的图和下图相似即可\n",
1702 | "# 对models_auc_df数据进行可视化,要求设置图例到右下角,标题为未归一化和归一化数据的模型AUC值比较\n",
1703 | "\n",
1704 | "#解决图例中文显示问题,设置字体样式\n",
1705 | "plt.rcParams['font.sans-serif']=['SimHei'] \n",
1706 | "plt.rcParams['axes.unicode_minus']=False \n",
1707 | "\n",
1708 | "#创建画布\n",
1709 | "plt.figure(figsize=(20,12), dpi=120)\n",
1710 | "\n",
1711 | "models_auc_df.plot(kind='bar')\n",
1712 | "\n",
1713 | "plt.xticks(rotation=360)"
1714 | ]
1715 | },
1716 | {
1717 | "cell_type": "markdown",
1718 | "metadata": {},
1719 | "source": [
1720 | "依据对未归一化和归一化后的数据训练模型,通过对模型的AUC值进行比较,可以发现:归一化能有效提高KNN模型的AUC值,但是对逻辑回归和决策树的影响不大。"
1721 | ]
1722 | }
1723 | ],
1724 | "metadata": {
1725 | "kernelspec": {
1726 | "display_name": "Python 3",
1727 | "language": "python",
1728 | "name": "python3"
1729 | },
1730 | "language_info": {
1731 | "codemirror_mode": {
1732 | "name": "ipython",
1733 | "version": 3
1734 | },
1735 | "file_extension": ".py",
1736 | "mimetype": "text/x-python",
1737 | "name": "python",
1738 | "nbconvert_exporter": "python",
1739 | "pygments_lexer": "ipython3",
1740 | "version": "3.6.8"
1741 | }
1742 | },
1743 | "nbformat": 4,
1744 | "nbformat_minor": 2
1745 | }
1746 |
--------------------------------------------------------------------------------
/images/Random Forest.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/Random Forest.png
--------------------------------------------------------------------------------
/images/jaccard系数.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/jaccard系数.png
--------------------------------------------------------------------------------
/images/余弦相似性.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/余弦相似性.png
--------------------------------------------------------------------------------
/images/曼哈顿距离.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/曼哈顿距离.png
--------------------------------------------------------------------------------
/images/欧式距离.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/teamowu/Machine-Learning/adc76bda05eb8a1265e3e733f8e4c4d89b43893e/images/欧式距离.png
--------------------------------------------------------------------------------
/关联分析/README.md:
--------------------------------------------------------------------------------
1 | # 关联分析
2 | 寻找最终能够**解释数据变量之间关系的规则**,来找出大量多源数据集中有用的关联规则,它是从大量数据中发现多种数据之间关系的一种方法。
3 | 另外,它也可以基于时间序列对多种数据间的关系进行挖掘。
4 |
5 | ## 常见的关联算法
6 | - Apriori
7 | - FP-Growth
8 | - PrefixSpan
9 | - SPADE
10 | - AprioriAll
11 | - AprioriSome
12 |
13 | ## 典型的销售应用场景
14 | - 购物篮分析
15 | - 优化商品布局,e.g.超市可以把关联度高的商品摆放在一起,便于顾客一起挑选。
16 | - 设计促销方案,e.g.两种关联度高的商品一起搭配购买可以享受价格优惠。
17 | - 快速商品推荐,通常在电商业务中使用。e.g.顾客浏览某一商品,页面上会推荐“经常一起购买的产品”或者90%的顾客也看了如下商品“等规则进行推荐。
18 |
19 | ## 关联分析中的关键指标
20 | - 支持度(support):
21 | - 置信度(confidence)
22 | - 提升度(Lift):当Lift>1, 应用关联规则比不应用关联规则能产生更好的结果;反之,规则具有负相关的作用,该规则为无效规则。
23 | - 做关联规则评估时,需要综合考虑支持度、置信度和提升度3个指标,支持度和置信度的值越大越好。
24 | 。
25 | **频繁规则 & 有效规则**:
26 | - 频繁规则:关联结果中支持度和置信度都比较高的规则
27 | - 有效规则:关联规则真正能促进规则中的前/后项的提升。
28 | - 频繁规则 != 有效规则
29 |
30 | ## 关联分析的更多应用场景
31 | **相同维度下的关联分析**:
32 | - 网站页面浏览关联分析
33 | - 广告流量关联分析
34 | - 用户关键字搜索关联分析
35 |
36 | **跨维度的关联分析**:
37 | - 不同场景下关联分析
38 | - 相同场景下的事件分析
39 |
--------------------------------------------------------------------------------
/分类算法/README.md:
--------------------------------------------------------------------------------
1 | # 2.2 如何选择分类分析算法
2 | - 文本分类: **朴素贝叶斯**,如电子邮件中的垃圾邮件识别。
3 | - 若训练集较小,选择高偏差且低方差的分类算法效果更好,如**朴素贝叶斯、支持向量机,因为这类算法不容易过拟合**。
4 | - 如果关注的是算法模型的计算时间和模型易用性,那么支持向量机、人工神经网络不是好的选择。
5 | - 如果重视算法的准确率,应选择精度较高的方法,如**支持向量机或GBDT、XGBOOST等基于Boosting的集成方法**。
6 | - 如果注重效果的稳定性或模型鲁棒性,那么应选择**随机森林、组合投票模型等基于Bagging的集成方法**。
7 | - 如果想得到有关预测结果的概率信息,然后**基于预测概率做进一步的应用,那么使用逻辑回归是比较好的选择**。
8 | - 如果**担心离群点或数据不可分并且需要清晰的决策规则,那么选择决策树**。
9 |
--------------------------------------------------------------------------------
/回归分析/README.md:
--------------------------------------------------------------------------------
1 | # 回归分析
2 | 如何选择回归分析算法?
3 | - 简单线性回归。适合数据集本身结构简单、分布规律有明显线性关系的场景。
4 | - 自变量数量少或降维后得到了可以使用的二维变量(包括预测变量)可以直接通过散点图发现自变量和因变量的相互关系,然后选择最佳回归方法。
5 | - 如果经过基本判断发现自变量间有较强的共线性关系,那么可以使用对多重共线性(自变量高度相关)能灵活处理的算法,例如岭回归。
6 | - 如果**数据集噪音较多,推荐使用主成分回归**,因为各主成分回归通过对参与回归的主成分的合理选择,可以去掉噪音。另外,对各个主成分间相互正交,能解决多元线性回归中的
7 | 共线性问题。这些都能有效地提高模型的抗干扰能力。
8 | - 如果高维度变量下,使用正则化回归方法效果最好,例如Lasso,Ridge和ElasticNet;或者使用逐步回归从中挑选出影响显著的自变量来建立回归模型。
9 | - 如果要同时验证多个算法,并想从中选择一个来做好的拟合,可以使用交叉检验做多个模型的效果对比,并通过R-square,Adjusted R-squre,AIC,BIC以及各种残差、误差
10 | 项指标做综合评估。
11 | - 如果注重模型的可解释性,那么容易理解的线性回归,指数回归,对数回归,二项或多项式回归要比核回归,支持向量机等更适合。
12 | - 集成或组合回归方法。一旦确认了几个方法,但又不确定该如何取舍,可以将多个回归模型做集成或组合方法使用,即同时将多个模型的结果通过加权、均值等方式确定最终输出结果值。
13 |
--------------------------------------------------------------------------------
/聚类分析/README.md:
--------------------------------------------------------------------------------
1 | # 如何选择聚类分析算法
2 | 聚类算法有几十种之多,聚类算法的选择主要参考一下因素:
3 | - 如果数据量是高维的,那么选择**谱聚类**,它是子空间划分的一种。
4 | - 如果数据量为中小规模,例如**100万条以内**,k均值将是比较好的选择;如果**超过100万条**,可以考虑使用**MiniBatchKmeans**;
5 | - 数据集中有**噪点**(离群点),那么使用基于密度的**DBSCAN**可以有效应对这个问题。
6 | - 如果追求更高的**分类准确度**,那么选择**谱聚类**将比K均值准确度更好。
7 |
--------------------------------------------------------------------------------
/聚类分析/客户特征的聚类与探索性分析.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 项目背景\n",
8 | "\n",
9 | "某天,业务部门拿到了一些关于客户的数据找到数据部门,苦于没有分析入手点,希望数据部门通过对这些数据的分析,给业务部门一些启示,或者提供后续分析或业务思考的建议。\n",
10 | "\n",
11 | "基于上述场景和需求,本次分析的交付需求如下:\n",
12 | "- 这是一次EDA任务,且业务方没有任何先验经验提供给数据部门。\n",
13 | "- 分析结果用于业务的知识启发或后续分析的深入应用。\n",
14 | "- 除数据统计和基本展示类的探索性分析以外的数据挖掘。\n",
15 | "\n",
16 | "#### 数据源特征:\n",
17 | "- USER_ID:用户ID列,整数型。该列作为用户唯一ID标志,这意味着该列不能作为聚类的特征,而只能作为用户聚类后的所属类的标记。\n",
18 | "- AVG_ORDERS:平均用户订单数量,浮点型。\n",
19 | "- AVG_MONEY:平均订单价值,即每单的订单价格,浮点型。\n",
20 | "- IS_ACTIVE:是否活跃,通过其他模型得到的结果,字符串型。\n",
21 | "- SEX:性别,以0,1,2来表示性别未知、男和女3个值。\n",
22 | "\n",
23 | "#### 分析思路:\n",
24 | "- 字符串型特征不能直接作训练,因为sklearn的对象一般都是数值型的向量矩阵或稀疏矩阵,而不能是原生字符串。\n",
25 | "- SEX本质是分类型变量,不能直接参与距离计算。\n",
26 | "- AVG_ORDERS和AVG_MONEY具有明显的量纲差异,需要作无量纲化处理。\n",
27 | "- 分割ID列。"
28 | ]
29 | },
30 | {
31 | "cell_type": "code",
32 | "execution_count": 57,
33 | "metadata": {},
34 | "outputs": [],
35 | "source": [
36 | "#导入包\n",
37 | "import pandas as pd\n",
38 | "import numpy as np\n",
39 | "import matplotlib.pyplot as plt\n",
40 | "%matplotlib inline\n",
41 | "from sklearn.preprocessing import MinMaxScaler\n",
42 | "from sklearn.cluster import KMeans\n",
43 | "from sklearn.metrics import calinski_harabaz_score,silhouette_score"
44 | ]
45 | },
46 | {
47 | "cell_type": "code",
48 | "execution_count": 15,
49 | "metadata": {},
50 | "outputs": [],
51 | "source": [
52 | "raw_data = pd.read_csv('cluster.txt')\n",
53 | "#数值型特征\n",
54 | "numeric_feature = raw_data.iloc[:,1:3]"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 18,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "[[0.64200477 0.62591687]\n",
67 | " [0.91169451 0.80440098]\n",
68 | " [0.69451074 0.39608802]\n",
69 | " ...\n",
70 | " [0.3221957 0.17359413]\n",
71 | " [0.42004773 0.31295844]\n",
72 | " [0.64916468 0.40831296]]\n"
73 | ]
74 | }
75 | ],
76 | "source": [
77 | "#标准化\n",
78 | "scaler = MinMaxScaler()\n",
79 | "scaled_numeric_feature = scaler.fit_transform(numeric_feature)\n",
80 | "print(scaled_numeric_feature[:,:2])"
81 | ]
82 | },
83 | {
84 | "cell_type": "code",
85 | "execution_count": 25,
86 | "metadata": {},
87 | "outputs": [
88 | {
89 | "data": {
90 | "text/plain": [
91 | "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n",
92 | " n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',\n",
93 | " random_state=0, tol=0.0001, verbose=0)"
94 | ]
95 | },
96 | "execution_count": 25,
97 | "metadata": {},
98 | "output_type": "execute_result"
99 | }
100 | ],
101 | "source": [
102 | "#训练模型\n",
103 | "n_cluster = 3\n",
104 | "model_kmeans = KMeans(n_clusters = n_cluster, random_state=0)\n",
105 | "model_kmeans.fit(scaled_numeric_feature)"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 27,
111 | "metadata": {},
112 | "outputs": [
113 | {
114 | "name": "stdout",
115 | "output_type": "stream",
116 | "text": [
117 | "sample: 1000 \t features: 4\n"
118 | ]
119 | }
120 | ],
121 | "source": [
122 | "#模型效果评估\n",
123 | "n_samples,n_features = raw_data.iloc[:,1:].shape #总样本数,总特征数\n",
124 | "print('sample: %d \\t features: %d' % (n_samples,n_features))"
125 | ]
126 | },
127 | {
128 | "cell_type": "code",
129 | "execution_count": 31,
130 | "metadata": {},
131 | "outputs": [
132 | {
133 | "name": "stdout",
134 | "output_type": "stream",
135 | "text": [
136 | "\n",
137 | " unspuervised_score: \n",
138 | " ------------------------------------------------------------\n",
139 | " silh c&h\n",
140 | "0 0.634086 2860.821834\n"
141 | ]
142 | }
143 | ],
144 | "source": [
145 | "#非监督式评估方法\n",
146 | "silhouette_s = silhouette_score(scaled_numeric_feature, model_kmeans.labels_, metric='euclidean')\n",
147 | "calinski_harabaz_s = calinski_harabaz_score(scaled_numeric_feature, model_kmeans.labels_) # Calinski和harabaz得分\n",
148 | "unspuervised_data = {'silh':[silhouette_s], 'c&h':[calinski_harabaz_s]}\n",
149 | "unspuervised_score = pd.DataFrame.from_dict(unspuervised_data)\n",
150 | "print(\"\\n\",'unspuervised_score:', '\\n', '-'*60)\n",
151 | "print(unspuervised_score)"
152 | ]
153 | },
154 | {
155 | "cell_type": "markdown",
156 | "metadata": {},
157 | "source": [
158 | "上述结果中,显示了聚类的效果还不错。以silh为例,当其值>0.5时,说明聚类质量较优。优秀与否的基本原则是不同类别间是否具有显著的区分效应。"
159 | ]
160 | },
161 | {
162 | "cell_type": "code",
163 | "execution_count": 35,
164 | "metadata": {},
165 | "outputs": [
166 | {
167 | "name": "stdout",
168 | "output_type": "stream",
169 | "text": [
170 | " USER_ID AVG_ORDERS AVG_MONEY IS_ACTIVE SEX labels\n",
171 | "0 1 3.58 40.43 活跃 1 2\n",
172 | "1 2 4.71 41.16 不活跃 1 2\n",
173 | "2 3 3.80 39.49 不活跃 2 1\n",
174 | "3 4 2.85 38.36 不活跃 1 0\n",
175 | "4 5 3.71 38.34 活跃 1 1\n"
176 | ]
177 | }
178 | ],
179 | "source": [
180 | "#合并数据和特征\n",
181 | "kmeans_labels = pd.DataFrame(model_kmeans.labels_, columns = ['labels'])\n",
182 | "#组合原始数据和标签\n",
183 | "kmeans_data = pd.concat([raw_data, kmeans_labels], axis=1)\n",
184 | "print(kmeans_data.head())"
185 | ]
186 | },
187 | {
188 | "cell_type": "code",
189 | "execution_count": 41,
190 | "metadata": {},
191 | "outputs": [
192 | {
193 | "name": "stdout",
194 | "output_type": "stream",
195 | "text": [
196 | " record_count record_rate\n",
197 | "labels \n",
198 | "0 332 0.332\n",
199 | "1 337 0.337\n",
200 | "2 331 0.331\n"
201 | ]
202 | }
203 | ],
204 | "source": [
205 | "#计算不同聚类类别的样本量和占比\n",
206 | "label_count = kmeans_data.groupby(['labels'])['SEX'].count()\n",
207 | "label_count_ratio = label_count / kmeans_data.shape[0]\n",
208 | "kmeans_record_count = pd.concat([label_count,label_count_ratio], axis=1)\n",
209 | "kmeans_record_count.columns = ['record_count', 'record_rate']\n",
210 | "print(kmeans_record_count.head())"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 44,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "name": "stdout",
220 | "output_type": "stream",
221 | "text": [
222 | " AVG_ORDERS AVG_MONEY\n",
223 | "labels \n",
224 | "0 2.022349 38.980602\n",
225 | "1 3.987389 39.028754\n",
226 | "2 3.958610 40.996254\n"
227 | ]
228 | }
229 | ],
230 | "source": [
231 | "#计算不同聚类类别数值型特征\n",
232 | "kmeans_numeric_features = kmeans_data.groupby(['labels'])['AVG_ORDERS', 'AVG_MONEY'].mean()\n",
233 | "print(kmeans_numeric_features)"
234 | ]
235 | },
236 | {
237 | "cell_type": "code",
238 | "execution_count": 52,
239 | "metadata": {},
240 | "outputs": [],
241 | "source": [
242 | "#计算不同聚类类别分类型特征\n",
243 | "active_list = []\n",
244 | "sex_gb_list = []\n",
245 | "unique_labels = np.unique(model_kmeans.labels_)\n",
246 | "for each_label in unique_labels:\n",
247 | " each_data = kmeans_data[kmeans_data['labels']==each_label]\n",
248 | " active_list.append(each_data.groupby(['IS_ACTIVE'])['USER_ID'].count()/each_data.shape[0])\n",
249 | " sex_gb_list.append(each_data.groupby(['SEX'])['USER_ID'].count()/each_data.shape[0])\n",
250 | "\n",
251 | "kmeans_active_pd = pd.DataFrame(active_list)\n",
252 | "kmeans_sex_gb_pd = pd.DataFrame(sex_gb_list)\n",
253 | "kmeans_string_features = pd.concat((kmeans_active_pd,kmeans_sex_gb_pd), axis=1)\n",
254 | "kmeans_string_features.index = unique_labels"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": 53,
260 | "metadata": {},
261 | "outputs": [],
262 | "source": [
263 | "#合并所有类别的分析结果"
264 | ]
265 | },
266 | {
267 | "cell_type": "code",
268 | "execution_count": 55,
269 | "metadata": {},
270 | "outputs": [
271 | {
272 | "name": "stdout",
273 | "output_type": "stream",
274 | "text": [
275 | " record_count record_rate AVG_ORDERS AVG_MONEY 不活跃 活跃 \\\n",
276 | "0 332 0.332 2.022349 38.980602 0.487952 0.512048 \n",
277 | "1 337 0.337 3.987389 39.028754 0.495549 0.504451 \n",
278 | "2 331 0.331 3.958610 40.996254 0.504532 0.495468 \n",
279 | "\n",
280 | " 0 1 2 \n",
281 | "0 0.003012 0.990964 0.006024 \n",
282 | "1 0.014837 0.014837 0.970326 \n",
283 | "2 0.984894 0.009063 0.006042 \n"
284 | ]
285 | }
286 | ],
287 | "source": [
288 | "features_all = pd.concat((kmeans_record_count,kmeans_numeric_features,kmeans_string_features), axis=1)\n",
289 | "print(features_all.head())"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": 59,
295 | "metadata": {},
296 | "outputs": [
297 | {
298 | "data": {
299 | "image/png": "\n",
300 | "text/plain": [
301 | ""
302 | ]
303 | },
304 | "metadata": {},
305 | "output_type": "display_data"
306 | }
307 | ],
308 | "source": [
309 | "# 可视化图形展示\n",
310 | "# part 1 全局配置\n",
311 | "fig = plt.figure(figsize=(10, 7))\n",
312 | "titles = ['RECORD_RATE','AVG_ORDERS','AVG_MONEY','IS_ACTIVE','SEX'] # 共用标题\n",
313 | "line_index,col_index = 3,5 # 定义网格数\n",
314 | "ax_ids = np.arange(1,16).reshape(line_index,col_index) # 生成子网格索引值\n",
315 | "plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签\n",
316 | " \n",
317 | "# part 2 画出三个类别的占比\n",
318 | "pie_fracs = features_all['record_rate'].tolist()\n",
319 | "for ind in range(len(pie_fracs)):\n",
320 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,0][ind])\n",
321 | " init_labels = ['','',''] # 初始化空label标签\n",
322 | " init_labels[ind] = 'cluster_{0}'.format(ind) # 设置标签\n",
323 | " init_colors = ['lightgray', 'lightgray', 'lightgray']\n",
324 | " init_colors[ind] = 'g' # 设置目标面积区别颜色\n",
325 | " ax.pie(x=pie_fracs, autopct='%3.0f %%',labels=init_labels,colors=init_colors)\n",
326 | " ax.set_aspect('equal') # 设置饼图为圆形\n",
327 | " if ind == 0:\n",
328 | " ax.set_title(titles[0])\n",
329 | " \n",
330 | "# part 3 画出AVG_ORDERS均值\n",
331 | "avg_orders_label = 'AVG_ORDERS'\n",
332 | "avg_orders_fraces = features_all[avg_orders_label]\n",
333 | "for ind, frace in enumerate(avg_orders_fraces):\n",
334 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,1][ind])\n",
335 | " ax.bar(x=unique_labels,height=[0,avg_orders_fraces[ind],0])# 画出柱形图\n",
336 | " ax.set_ylim((0, max(avg_orders_fraces)*1.2))\n",
337 | " ax.set_xticks([])\n",
338 | " ax.set_yticks([])\n",
339 | " if ind == 0:# 设置总标题\n",
340 | " ax.set_title(titles[1])\n",
341 | " # 设置每个柱形图的数值标签和x轴label\n",
342 | " ax.text(unique_labels[1],frace+0.4,s='{:.2f}'.format(frace),ha='center',va='top')\n",
343 | " ax.text(unique_labels[1],-0.4,s=avg_orders_label,ha='center',va='bottom')\n",
344 | " \n",
345 | "# part 4 画出AVG_MONEY均值\n",
346 | "avg_money_label = 'AVG_MONEY'\n",
347 | "avg_money_fraces = features_all[avg_money_label]\n",
348 | "for ind, frace in enumerate(avg_money_fraces):\n",
349 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,2][ind])\n",
350 | " ax.bar(x=unique_labels,height=[0,avg_money_fraces[ind],0])# 画出柱形图\n",
351 | " ax.set_ylim((0, max(avg_money_fraces)*1.2))\n",
352 | " ax.set_xticks([])\n",
353 | " ax.set_yticks([])\n",
354 | " if ind == 0:# 设置总标题\n",
355 | " ax.set_title(titles[2])\n",
356 | " # 设置每个柱形图的数值标签和x轴label\n",
357 | " ax.text(unique_labels[1],frace+4,s='{:.0f}'.format(frace),ha='center',va='top')\n",
358 | " ax.text(unique_labels[1],-4,s=avg_money_label,ha='center',va='bottom')\n",
359 | " \n",
360 | "# part 5 画出是否活跃\n",
361 | "axtivity_labels = ['不活跃','活跃']\n",
362 | "x_ticket = [i for i in range(len(axtivity_labels))]\n",
363 | "activity_data = features_all[axtivity_labels]\n",
364 | "ylim_max = np.max(np.max(activity_data))\n",
365 | "for ind,each_data in enumerate(activity_data.values):\n",
366 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,3][ind])\n",
367 | " ax.bar(x=x_ticket,height=each_data) # 画出柱形图\n",
368 | " ax.set_ylim((0, ylim_max*1.2))\n",
369 | " ax.set_xticks([])\n",
370 | " ax.set_yticks([]) \n",
371 | " if ind == 0:# 设置总标题\n",
372 | " ax.set_title(titles[3])\n",
373 | " # 设置每个柱形图的数值标签和x轴label\n",
374 | " activity_values = ['{:.1%}'.format(i) for i in each_data]\n",
375 | " for i in range(len(x_ticket)):\n",
376 | " ax.text(x_ticket[i],each_data[i]+0.05,s=activity_values[i],ha='center',va='top')\n",
377 | " ax.text(x_ticket[i],-0.05,s=axtivity_labels[i],ha='center',va='bottom')\n",
378 | " \n",
379 | "# part 6 画出性别分布\n",
380 | "sex_data = features_all.iloc[:,-3:]\n",
381 | "x_ticket = [i for i in range(len(sex_data))]\n",
382 | "sex_labels = ['SEX_{}'.format(i) for i in range(3)]\n",
383 | "ylim_max = np.max(np.max(sex_data))\n",
384 | "for ind,each_data in enumerate(sex_data.values):\n",
385 | " ax = fig.add_subplot(line_index, col_index, ax_ids[:,4][ind])\n",
386 | " ax.bar(x=x_ticket,height=each_data) # 画柱形图\n",
387 | " ax.set_ylim((0, ylim_max*1.2))\n",
388 | " ax.set_xticks([])\n",
389 | " ax.set_yticks([])\n",
390 | " if ind == 0: # 设置标题\n",
391 | " ax.set_title(titles[4]) \n",
392 | " # 设置每个柱形图的数值标签和x轴label\n",
393 | " sex_values = ['{:.1%}'.format(i) for i in each_data]\n",
394 | " for i in range(len(x_ticket)):\n",
395 | " ax.text(x_ticket[i],each_data[i]+0.1,s=sex_values[i],ha='center',va='top')\n",
396 | " ax.text(x_ticket[i],-0.1,s=sex_labels[i],ha='center',va='bottom')\n",
397 | " \n",
398 | "plt.tight_layout(pad=0.8) #设置默认的间距"
399 | ]
400 | },
401 | {
402 | "cell_type": "markdown",
403 | "metadata": {},
404 | "source": [
405 | "# 结论\n",
406 | "\n",
407 | "聚类后,群体划分为3类:\n",
408 | "- cluster_0:显著和区分性特征是平均订单量少(仅为2.02),男性为主的客户群体;\n",
409 | "- cluster_1:平均订单量多(3.99),女性为主的客户\n",
410 | "- cluster_2:与cluster_1类似,但群体属于未知性别。\n",
411 | "\n",
412 | "鉴于平均订单价值和活跃程度在所有类别中的分布相对意志和均匀,无法达到区分的特性,也不具有表示该群体的显著性特征。因此忽略。\n",
413 | "\n",
414 | "最后,我们得到3类群体:**低价值的男性客户群体、高价值的女性客户群体以及高价值的未知性别客户群体。**\n",
415 | "\n",
416 | "**衍生的分析方向**:\n",
417 | "- 未知性别群体不应该有如此高的平均订单价值,更重要的是其样本量并不少。那么不太可能是随机发生的事件,很可能在某些方面,例如数据采集、客户体验、客户注册等方面存在某些问题,或者这类客户群体就是不愿意透露性别。可作为另一个EDA课题的开始\n",
418 | "- 第二类高价值的女性客户群体,可做用户喜欢和特征分析,例如看一下她们都是什么事件购买、客单价平均多少、集中品类、折扣力度喜欢、来源渠道、促销方式等是否有明显的集中化倾向。"
419 | ]
420 | }
421 | ],
422 | "metadata": {
423 | "kernelspec": {
424 | "display_name": "Python 3",
425 | "language": "python",
426 | "name": "python3"
427 | },
428 | "language_info": {
429 | "codemirror_mode": {
430 | "name": "ipython",
431 | "version": 3
432 | },
433 | "file_extension": ".py",
434 | "mimetype": "text/x-python",
435 | "name": "python",
436 | "nbconvert_exporter": "python",
437 | "pygments_lexer": "ipython3",
438 | "version": "3.6.8"
439 | }
440 | },
441 | "nbformat": 4,
442 | "nbformat_minor": 2
443 | }
444 |
--------------------------------------------------------------------------------