├── .gitignore
├── 0. 配置环境.ipynb
├── 1. Series和DataFrame对象的创建.ipynb
├── 10. 数值统计运算.ipynb
├── 12. Category型与离散化.ipynb
├── 13. 时间型操作.ipynb
├── 14. Object型操作.ipynb
├── 2. Series和DataFrame对象的查、改、增、删.ipynb
├── 3. merge详解.ipynb
├── 4. Index对象的创建,查、改、增、删和使用.ipynb
├── 5. 普通列和行index的相互转化.ipynb
├── 6. 数据结构总览.ipynb
├── 7. 显示控制.ipynb
├── 8. 快速查看整体信息.ipynb
├── 9. 数值运算.ipynb
├── README.md
├── pandas_tutorial.egg-info
├── PKG-INFO
├── SOURCES.txt
├── dependency_links.txt
├── requires.txt
└── top_level.txt
├── resource
├── 1.png
├── 2.png
├── 204078950.png
├── 3.png
├── pay.png
└── 我的.png
├── setup.py
└── tips.csv
/.gitignore:
--------------------------------------------------------------------------------
1 | .ipynb_checkpoints
2 |
3 | venv
4 | dist
--------------------------------------------------------------------------------
/0. 配置环境.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 配置环境\n",
8 | "\n",
9 | "因为jupyter notebook是运行在浏览器中,所以什么IDE都不需要,只需要安装好python和一些需要的库。"
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {},
15 | "source": [
16 | "# 0. python和需要的库安装\n",
17 | "- 看我之前写的文章:https://zhuanlan.zhihu.com/p/27335822"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": null,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": []
26 | },
27 | {
28 | "cell_type": "markdown",
29 | "metadata": {
30 | "tags": []
31 | },
32 | "source": [
33 | "# 1. jupyter notebook"
34 | ]
35 | },
36 | {
37 | "cell_type": "markdown",
38 | "metadata": {
39 | "tags": []
40 | },
41 | "source": [
42 | "## 1.1 jupyter及其拓展安装"
43 | ]
44 | },
45 | {
46 | "cell_type": "markdown",
47 | "metadata": {},
48 | "source": [
49 | "- 输入:`pip install jupyter notebook`\n",
50 | "- 输入:`pip install jupyter_contrib_nbextensions `\n",
51 | "- 输入:`jupyter contrib nbextension install --user`"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "## 1.2 配置自动生成目录"
59 | ]
60 | },
61 | {
62 | "cell_type": "markdown",
63 | "metadata": {},
64 | "source": [
65 | "1. 输入:`jupyter notebook`, 浏览器会自动打开;\n",
66 | "\n",
67 | "2. 在打开的界面进入 `Nbextensions` 选项卡;\n",
68 | "\n",
69 | "\n",
70 | "3. 在`Table of Contents (2)`下,点击 **Enable**,启动自动生成目录选项;\n",
71 | "\n",
72 | "\n",
73 | "4. 改变这一项的默认 4 为 3。\n",
74 | ""
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": []
83 | },
84 | {
85 | "cell_type": "markdown",
86 | "metadata": {},
87 | "source": [
88 | "# 2. jupter lab\n",
89 | "\n",
90 | "jupterlab是更新一代的jupter,我个人比较喜欢用这个"
91 | ]
92 | },
93 | {
94 | "cell_type": "markdown",
95 | "metadata": {},
96 | "source": [
97 | "- 输入:`pip install jupyterlab`\n",
98 | "- 输入:`jupyter lab` , 浏览器会自动打开 "
99 | ]
100 | },
101 | {
102 | "cell_type": "code",
103 | "execution_count": null,
104 | "metadata": {},
105 | "outputs": [],
106 | "source": []
107 | },
108 | {
109 | "cell_type": "markdown",
110 | "metadata": {},
111 | "source": [
112 | "# 3. 几个快捷键"
113 | ]
114 | },
115 | {
116 | "cell_type": "markdown",
117 | "metadata": {},
118 | "source": [
119 | "处于命令模式,也即鼠标点击cell左侧,文本框变灰时\n",
120 | "- `a` 在当前cell上面增加一个cell\n",
121 | "- `b` 在当前cell下面增加一个cell\n",
122 | "- `dd` 删除当前cell\n",
123 | "- `按shift enter` 运行当前cell"
124 | ]
125 | }
126 | ],
127 | "metadata": {
128 | "kernelspec": {
129 | "display_name": "Python 3 (ipykernel)",
130 | "language": "python",
131 | "name": "python3"
132 | },
133 | "language_info": {
134 | "codemirror_mode": {
135 | "name": "ipython",
136 | "version": 3
137 | },
138 | "file_extension": ".py",
139 | "mimetype": "text/x-python",
140 | "name": "python",
141 | "nbconvert_exporter": "python",
142 | "pygments_lexer": "ipython3",
143 | "version": "3.10.4"
144 | },
145 | "toc": {
146 | "colors": {
147 | "hover_highlight": "#DAA520",
148 | "navigate_num": "#000000",
149 | "navigate_text": "#333333",
150 | "running_highlight": "#FF0000",
151 | "selected_highlight": "#FFD700",
152 | "sidebar_border": "#EEEEEE",
153 | "wrapper_background": "#FFFFFF"
154 | },
155 | "moveMenuLeft": true,
156 | "nav_menu": {
157 | "height": "84px",
158 | "width": "252px"
159 | },
160 | "navigate_menu": true,
161 | "number_sections": false,
162 | "sideBar": true,
163 | "threshold": "3",
164 | "toc_cell": false,
165 | "toc_section_display": "block",
166 | "toc_window_display": true,
167 | "widenNotebook": false
168 | }
169 | },
170 | "nbformat": 4,
171 | "nbformat_minor": 4
172 | }
173 |
--------------------------------------------------------------------------------
/1. Series和DataFrame对象的创建.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Series和DataFrame对象的创建\n",
8 | "pandas中的核心对象是Series和DataFrame,这一节主要介绍如何创建这两种对象。"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "__auther__ = 'zhenhang.sun@gmail.com'"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {
24 | "ExecuteTime": {
25 | "end_time": "2017-09-25T14:15:51.389871Z",
26 | "start_time": "2017-09-25T14:15:51.202730Z"
27 | }
28 | },
29 | "outputs": [
30 | {
31 | "data": {
32 | "text/plain": [
33 | "'D:\\\\github\\\\pandas-tutorial'"
34 | ]
35 | },
36 | "execution_count": 2,
37 | "metadata": {},
38 | "output_type": "execute_result"
39 | }
40 | ],
41 | "source": [
42 | "pwd"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 3,
48 | "metadata": {
49 | "ExecuteTime": {
50 | "end_time": "2017-09-25T14:16:03.547342Z",
51 | "start_time": "2017-09-25T14:15:51.394374Z"
52 | }
53 | },
54 | "outputs": [],
55 | "source": [
56 | "import numpy as np\n",
57 | "import pandas as pd"
58 | ]
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "# 1. Series"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {
70 | "collapsed": true
71 | },
72 | "source": [
73 | "Series是pandas中暴露给我们使用的基本对象,它是由相同元素类型构成的一维数据结构,同时具有列表和字典的属性,字典的属性由索引赋予。\n",
74 | "\n",
75 | " Series:有序,有索引\n",
76 | " list: 有序,无索引\n",
77 | " dict: 无序,有索引"
78 | ]
79 | },
80 | {
81 | "cell_type": "markdown",
82 | "metadata": {},
83 | "source": [
84 | "## 1.1 预览"
85 | ]
86 | },
87 | {
88 | "cell_type": "code",
89 | "execution_count": 4,
90 | "metadata": {
91 | "ExecuteTime": {
92 | "end_time": "2017-09-25T14:16:04.076194Z",
93 | "start_time": "2017-09-25T14:16:03.551337Z"
94 | }
95 | },
96 | "outputs": [
97 | {
98 | "data": {
99 | "text/plain": [
100 | "a 1\n",
101 | "b 2\n",
102 | "c 3\n",
103 | "Name: sss, dtype: int64"
104 | ]
105 | },
106 | "execution_count": 4,
107 | "metadata": {},
108 | "output_type": "execute_result"
109 | }
110 | ],
111 | "source": [
112 | "data = [1,2,3]\n",
113 | "index = ['a','b','c']\n",
114 | "s = pd.Series(data=data, index=index, name='sss')\n",
115 | "s"
116 | ]
117 | },
118 | {
119 | "cell_type": "code",
120 | "execution_count": 5,
121 | "metadata": {
122 | "ExecuteTime": {
123 | "end_time": "2017-09-25T14:16:04.089203Z",
124 | "start_time": "2017-09-25T14:16:04.082197Z"
125 | }
126 | },
127 | "outputs": [
128 | {
129 | "data": {
130 | "text/plain": [
131 | "Index(['a', 'b', 'c'], dtype='object')"
132 | ]
133 | },
134 | "execution_count": 5,
135 | "metadata": {},
136 | "output_type": "execute_result"
137 | }
138 | ],
139 | "source": [
140 | "s.index # 四个属性之一:索引"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 6,
146 | "metadata": {
147 | "ExecuteTime": {
148 | "end_time": "2017-09-25T14:16:04.323077Z",
149 | "start_time": "2017-09-25T14:16:04.091205Z"
150 | }
151 | },
152 | "outputs": [
153 | {
154 | "data": {
155 | "text/plain": [
156 | "'sss'"
157 | ]
158 | },
159 | "execution_count": 6,
160 | "metadata": {},
161 | "output_type": "execute_result"
162 | }
163 | ],
164 | "source": [
165 | "s.name # 四个属性之二:名字,"
166 | ]
167 | },
168 | {
169 | "cell_type": "code",
170 | "execution_count": 7,
171 | "metadata": {
172 | "ExecuteTime": {
173 | "end_time": "2017-09-25T14:16:04.639450Z",
174 | "start_time": "2017-09-25T14:16:04.327080Z"
175 | }
176 | },
177 | "outputs": [
178 | {
179 | "data": {
180 | "text/plain": [
181 | "array([1, 2, 3], dtype=int64)"
182 | ]
183 | },
184 | "execution_count": 7,
185 | "metadata": {},
186 | "output_type": "execute_result"
187 | }
188 | ],
189 | "source": [
190 | "s.values # 四个属性之三:值"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": 8,
196 | "metadata": {
197 | "ExecuteTime": {
198 | "end_time": "2017-09-25T14:16:04.822963Z",
199 | "start_time": "2017-09-25T14:16:04.641434Z"
200 | }
201 | },
202 | "outputs": [
203 | {
204 | "data": {
205 | "text/plain": [
206 | "dtype('int64')"
207 | ]
208 | },
209 | "execution_count": 8,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "s.dtype # 四个属性之四:元素类型"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "## 1.2 创建"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "##### `pd.Series(data=None, index=None, name=None)`\n",
230 | "- data:多种类型,见下面具体介绍;\n",
231 | "- index:索引信息;\n",
232 | "- name:对data的说明,用的不多,一般在和DataFrame、Index互相转换时才需要。"
233 | ]
234 | },
235 | {
236 | "cell_type": "markdown",
237 | "metadata": {},
238 | "source": [
239 | "### 1.2.1 data无索引\n",
240 | "- 如果 data 为 **ndarray(1D) 或 list(1D)**,那么其缺少 Series 需要的索引信息;\n",
241 | "- 如果提供 index,则必须和data长度相同;\n",
242 | "- 如果不提供 index,那么其将生成默认数值索引 range(0, data.shape[0])。"
243 | ]
244 | },
245 | {
246 | "cell_type": "code",
247 | "execution_count": 9,
248 | "metadata": {
249 | "ExecuteTime": {
250 | "end_time": "2017-09-25T14:16:05.175335Z",
251 | "start_time": "2017-09-25T14:16:04.825968Z"
252 | }
253 | },
254 | "outputs": [
255 | {
256 | "data": {
257 | "text/plain": [
258 | "a 1\n",
259 | "b 2\n",
260 | "c 3\n",
261 | "dtype: int32"
262 | ]
263 | },
264 | "execution_count": 9,
265 | "metadata": {},
266 | "output_type": "execute_result"
267 | }
268 | ],
269 | "source": [
270 | "# data = [1,2,3]\n",
271 | "data1 = np.array([1,2,3])\n",
272 | "index1 = ['a','b','c']\n",
273 | "s = pd.Series(data=data1, index=index1)\n",
274 | "s"
275 | ]
276 | },
277 | {
278 | "cell_type": "markdown",
279 | "metadata": {},
280 | "source": []
281 | },
282 | {
283 | "cell_type": "markdown",
284 | "metadata": {},
285 | "source": [
286 | "### 1.2.2 data有索引\n",
287 | " - 如果 data 为 **Series 或 dict** ,那么其已经提供了 Series 需要的索引信息,所以 index 项是不需要提供的;\n",
288 | " - 如果额外提供了 index 项,那么其将对当前构建的Series进行 重索引(以当前index为准,如果旧索引没有,则值为nan)(等同于reindex操作)。"
289 | ]
290 | },
291 | {
292 | "cell_type": "code",
293 | "execution_count": 10,
294 | "metadata": {
295 | "ExecuteTime": {
296 | "end_time": "2017-09-25T14:16:05.540456Z",
297 | "start_time": "2017-09-25T14:16:05.181344Z"
298 | }
299 | },
300 | "outputs": [
301 | {
302 | "data": {
303 | "text/plain": [
304 | "a 1.0\n",
305 | "b 2.0\n",
306 | "d NaN\n",
307 | "dtype: float64"
308 | ]
309 | },
310 | "execution_count": 10,
311 | "metadata": {},
312 | "output_type": "execute_result"
313 | }
314 | ],
315 | "source": [
316 | "# data = pd.Series([a,b,c], index = ['a','b','c'] )\n",
317 | "data2 = { 'a':1, 'b':2,'c':3 }\n",
318 | "index2 = ['a','b','d']\n",
319 | "s = pd.Series(data=data2, index=index2)\n",
320 | "s # 缺失项填充NaN(NaN:not a number为pandas缺失值标记)。"
321 | ]
322 | },
323 | {
324 | "cell_type": "code",
325 | "execution_count": null,
326 | "metadata": {},
327 | "outputs": [],
328 | "source": []
329 | },
330 | {
331 | "cell_type": "markdown",
332 | "metadata": {
333 | "collapsed": true
334 | },
335 | "source": [
336 | "# 2. DataFrame"
337 | ]
338 | },
339 | {
340 | "cell_type": "markdown",
341 | "metadata": {
342 | "collapsed": true
343 | },
344 | "source": [
345 | "DataFrame由具有共同索引的Series按列排列构成(2D),是使用最多的对象。"
346 | ]
347 | },
348 | {
349 | "cell_type": "markdown",
350 | "metadata": {},
351 | "source": [
352 | "## 2.1 预览"
353 | ]
354 | },
355 | {
356 | "cell_type": "code",
357 | "execution_count": 11,
358 | "metadata": {
359 | "ExecuteTime": {
360 | "end_time": "2017-09-25T14:16:06.079948Z",
361 | "start_time": "2017-09-25T14:16:05.549462Z"
362 | }
363 | },
364 | "outputs": [
365 | {
366 | "data": {
367 | "text/html": [
368 | "
\n",
369 | "\n",
382 | "
\n",
383 | " \n",
384 | " \n",
385 | " | \n",
386 | " A | \n",
387 | " B | \n",
388 | " C | \n",
389 | "
\n",
390 | " \n",
391 | " \n",
392 | " \n",
393 | " a | \n",
394 | " 1 | \n",
395 | " 2 | \n",
396 | " 3 | \n",
397 | "
\n",
398 | " \n",
399 | " b | \n",
400 | " 4 | \n",
401 | " 5 | \n",
402 | " 6 | \n",
403 | "
\n",
404 | " \n",
405 | "
\n",
406 | "
"
407 | ],
408 | "text/plain": [
409 | " A B C\n",
410 | "a 1 2 3\n",
411 | "b 4 5 6"
412 | ]
413 | },
414 | "execution_count": 11,
415 | "metadata": {},
416 | "output_type": "execute_result"
417 | }
418 | ],
419 | "source": [
420 | "data = [[1,2,3],\n",
421 | " [4,5,6]]\n",
422 | "index = ['a','b']\n",
423 | "columns = ['A','B','C']\n",
424 | "df = pd.DataFrame(data=data, index = index, columns = columns)\n",
425 | "df"
426 | ]
427 | },
428 | {
429 | "cell_type": "code",
430 | "execution_count": 12,
431 | "metadata": {
432 | "ExecuteTime": {
433 | "end_time": "2017-09-25T14:16:06.088969Z",
434 | "start_time": "2017-09-25T14:16:06.082952Z"
435 | }
436 | },
437 | "outputs": [
438 | {
439 | "data": {
440 | "text/plain": [
441 | "Index(['a', 'b'], dtype='object')"
442 | ]
443 | },
444 | "execution_count": 12,
445 | "metadata": {},
446 | "output_type": "execute_result"
447 | }
448 | ],
449 | "source": [
450 | "df.index # 行索引"
451 | ]
452 | },
453 | {
454 | "cell_type": "code",
455 | "execution_count": 13,
456 | "metadata": {
457 | "ExecuteTime": {
458 | "end_time": "2017-09-25T14:16:06.299196Z",
459 | "start_time": "2017-09-25T14:16:06.091956Z"
460 | }
461 | },
462 | "outputs": [
463 | {
464 | "data": {
465 | "text/plain": [
466 | "Index(['A', 'B', 'C'], dtype='object')"
467 | ]
468 | },
469 | "execution_count": 13,
470 | "metadata": {},
471 | "output_type": "execute_result"
472 | }
473 | ],
474 | "source": [
475 | "df.columns # 列索引,由Series的name构成"
476 | ]
477 | },
478 | {
479 | "cell_type": "code",
480 | "execution_count": 14,
481 | "metadata": {
482 | "ExecuteTime": {
483 | "end_time": "2017-09-25T14:16:06.454532Z",
484 | "start_time": "2017-09-25T14:16:06.308708Z"
485 | }
486 | },
487 | "outputs": [
488 | {
489 | "data": {
490 | "text/plain": [
491 | "array([[1, 2, 3],\n",
492 | " [4, 5, 6]], dtype=int64)"
493 | ]
494 | },
495 | "execution_count": 14,
496 | "metadata": {},
497 | "output_type": "execute_result"
498 | }
499 | ],
500 | "source": [
501 | "df.values "
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": 15,
507 | "metadata": {
508 | "ExecuteTime": {
509 | "end_time": "2017-09-25T14:16:06.696251Z",
510 | "start_time": "2017-09-25T14:16:06.456534Z"
511 | }
512 | },
513 | "outputs": [
514 | {
515 | "data": {
516 | "text/plain": [
517 | "A int64\n",
518 | "B int64\n",
519 | "C int64\n",
520 | "dtype: object"
521 | ]
522 | },
523 | "execution_count": 15,
524 | "metadata": {},
525 | "output_type": "execute_result"
526 | }
527 | ],
528 | "source": [
529 | "df.dtypes # 这里的dtype带s,查看每列元素类型"
530 | ]
531 | },
532 | {
533 | "cell_type": "markdown",
534 | "metadata": {},
535 | "source": [
536 | "## 2.2 创建"
537 | ]
538 | },
539 | {
540 | "cell_type": "markdown",
541 | "metadata": {},
542 | "source": [
543 | "##### `pd.DataFrame(data=None, index=None, columns=None)`\n",
544 | "函数有多个参数,对我们有用的主要是:`data`,`index`和`columns`三项"
545 | ]
546 | },
547 | {
548 | "cell_type": "markdown",
549 | "metadata": {},
550 | "source": [
551 | "### 2.2.1 data无 行索引,无 列索引\n",
552 | "- 如果 data 为 **ndarray(2D) or list of list(2D)**,那么其缺少 DataFrame 需要的行、列索引信息;\n",
553 | "- 如果提供 index 或 columns 项,其必须和data的行 或 列长度相同;\n",
554 | "- 如果不提供 index 或 columns 项,那么其将默认生成数值索引range(0, data.shape[0])) 或 range(0, data.shape[1])。"
555 | ]
556 | },
557 | {
558 | "cell_type": "code",
559 | "execution_count": 16,
560 | "metadata": {
561 | "ExecuteTime": {
562 | "end_time": "2017-09-25T14:16:07.141602Z",
563 | "start_time": "2017-09-25T14:16:06.702255Z"
564 | }
565 | },
566 | "outputs": [
567 | {
568 | "data": {
569 | "text/html": [
570 | "\n",
571 | "\n",
584 | "
\n",
585 | " \n",
586 | " \n",
587 | " | \n",
588 | " A | \n",
589 | " B | \n",
590 | " C | \n",
591 | "
\n",
592 | " \n",
593 | " \n",
594 | " \n",
595 | " a | \n",
596 | " 1 | \n",
597 | " 2 | \n",
598 | " 3 | \n",
599 | "
\n",
600 | " \n",
601 | " b | \n",
602 | " 4 | \n",
603 | " 5 | \n",
604 | " 6 | \n",
605 | "
\n",
606 | " \n",
607 | "
\n",
608 | "
"
609 | ],
610 | "text/plain": [
611 | " A B C\n",
612 | "a 1 2 3\n",
613 | "b 4 5 6"
614 | ]
615 | },
616 | "execution_count": 16,
617 | "metadata": {},
618 | "output_type": "execute_result"
619 | }
620 | ],
621 | "source": [
622 | "# data = [[1,2,3],\n",
623 | "# [4,5,6]]\n",
624 | "data1 = np.array([[1,2,3],\n",
625 | " [4,5,6]] )\n",
626 | "index1 = ['a','b']\n",
627 | "columns1 = ['A','B','C']\n",
628 | "df = pd.DataFrame(data=data1, index=index1, columns=columns1)\n",
629 | "df"
630 | ]
631 | },
632 | {
633 | "cell_type": "markdown",
634 | "metadata": {},
635 | "source": [
636 | "### 2.2.2 data无 行索引,有 列索引\n",
637 | " - 如果data为 **dict of (ndarray(1D) or list(1D))**,所有ndarray或list的长度必须相同。dict的key为DataFrame提供了需要的columns信息,缺失index;\n",
638 | " - 如果提供 index 项,必须和list的长度相同;\n",
639 | " - 如果不提供 index,那么其将默认生成数值索引range(0, data.shape[0]));\n",
640 | " - 如果还额外提供了columns项,那么其将对当前构建的DataFrame进行 **列重索引**。"
641 | ]
642 | },
643 | {
644 | "cell_type": "code",
645 | "execution_count": 17,
646 | "metadata": {
647 | "ExecuteTime": {
648 | "end_time": "2017-09-25T14:16:07.349774Z",
649 | "start_time": "2017-09-25T14:16:07.146606Z"
650 | }
651 | },
652 | "outputs": [
653 | {
654 | "data": {
655 | "text/html": [
656 | "\n",
657 | "\n",
670 | "
\n",
671 | " \n",
672 | " \n",
673 | " | \n",
674 | " A | \n",
675 | " B | \n",
676 | " D | \n",
677 | "
\n",
678 | " \n",
679 | " \n",
680 | " \n",
681 | " a | \n",
682 | " 1 | \n",
683 | " 2 | \n",
684 | " NaN | \n",
685 | "
\n",
686 | " \n",
687 | " b | \n",
688 | " 4 | \n",
689 | " 5 | \n",
690 | " NaN | \n",
691 | "
\n",
692 | " \n",
693 | "
\n",
694 | "
"
695 | ],
696 | "text/plain": [
697 | " A B D\n",
698 | "a 1 2 NaN\n",
699 | "b 4 5 NaN"
700 | ]
701 | },
702 | "execution_count": 17,
703 | "metadata": {},
704 | "output_type": "execute_result"
705 | }
706 | ],
707 | "source": [
708 | "data2 = { 'A' : [1,4], 'B': [2,5], 'C':[3,6] }\n",
709 | "index2 = ['a','b']\n",
710 | "columns2 = ['A','B','D']\n",
711 | "df = pd.DataFrame(data=data2, index=index2, columns=columns2)\n",
712 | "df"
713 | ]
714 | },
715 | {
716 | "cell_type": "markdown",
717 | "metadata": {},
718 | "source": [
719 | "### 2.2.3 data有 行索引,有 列索引\n",
720 | " - 如果data为 **dict of (Series or dict)**,那么其已经提供了DataFrame需要的所有信息;\n",
721 | " - 如果多个Series或dict间的索引不一致,那么取并操作(pandas不会试图丢掉信息),缺失的数据填充NaN;\n",
722 | " - 如果提供了index项或columns项,那么其将对当前构建的DataFrame进行 重索引(reindex,pandas内部调用接口)。"
723 | ]
724 | },
725 | {
726 | "cell_type": "code",
727 | "execution_count": 18,
728 | "metadata": {
729 | "ExecuteTime": {
730 | "end_time": "2017-09-25T14:16:07.570012Z",
731 | "start_time": "2017-09-25T14:16:07.351777Z"
732 | }
733 | },
734 | "outputs": [
735 | {
736 | "data": {
737 | "text/html": [
738 | "\n",
739 | "\n",
752 | "
\n",
753 | " \n",
754 | " \n",
755 | " | \n",
756 | " A | \n",
757 | " B | \n",
758 | " C | \n",
759 | "
\n",
760 | " \n",
761 | " \n",
762 | " \n",
763 | " a | \n",
764 | " 1.0 | \n",
765 | " 2.0 | \n",
766 | " 3.0 | \n",
767 | "
\n",
768 | " \n",
769 | " b | \n",
770 | " 4.0 | \n",
771 | " 5.0 | \n",
772 | " NaN | \n",
773 | "
\n",
774 | " \n",
775 | " c | \n",
776 | " NaN | \n",
777 | " NaN | \n",
778 | " 6.0 | \n",
779 | "
\n",
780 | " \n",
781 | "
\n",
782 | "
"
783 | ],
784 | "text/plain": [
785 | " A B C\n",
786 | "a 1.0 2.0 3.0\n",
787 | "b 4.0 5.0 NaN\n",
788 | "c NaN NaN 6.0"
789 | ]
790 | },
791 | "execution_count": 18,
792 | "metadata": {},
793 | "output_type": "execute_result"
794 | }
795 | ],
796 | "source": [
797 | "# data3 = { 'A' : pd.Series([1,4] ,index = ['a','b']), 'B' : pd.Series([2,5] ,index = ['a','b']), 'C' : pd.Series([3,6] ,index = ['a','c']) }\n",
798 | "data3 = { 'A' : { 'a':1, 'b':4}, 'B': {'a':2,'b':5}, 'C':{'a':3, 'c':6} }\n",
799 | "df = pd.DataFrame(data=data3)\n",
800 | "df"
801 | ]
802 | },
803 | {
804 | "cell_type": "markdown",
805 | "metadata": {},
806 | "source": [
807 | "# 3. 由文件创建"
808 | ]
809 | },
810 | {
811 | "cell_type": "markdown",
812 | "metadata": {},
813 | "source": [
814 | "## 3.1 由.csv文件创建"
815 | ]
816 | },
817 | {
818 | "cell_type": "markdown",
819 | "metadata": {},
820 | "source": [
821 | "##### `pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None,index_col=None, encoding=None ) `\n",
822 | "read_csv的参数很多,但这几个参数就够我们使用了:\n",
823 | "- filepath_or_buffer:路径和文件名不要带中文,带中文容易报错;\n",
824 | "- sep: csv文件数据的分隔符,默认是',',根据实际情况修改;\n",
825 | "- header:如果有列名,那么这一项不用改;\n",
826 | "- names:如果没有列名,那么必须设置header = None, names为需要传入的列名列表,不设置默认生成数值索引;\n",
827 | "- index_col:list of (int or name),传入列名的列表或者列名的位置,选取这几列作为索引;\n",
828 | "- encoding:根据你的文档编码来确定,如果有中文读取报错,试试encoding = 'gbk'。"
829 | ]
830 | },
831 | {
832 | "cell_type": "code",
833 | "execution_count": 19,
834 | "metadata": {
835 | "ExecuteTime": {
836 | "end_time": "2017-09-25T14:16:08.148424Z",
837 | "start_time": "2017-09-25T14:16:07.573011Z"
838 | },
839 | "scrolled": true
840 | },
841 | "outputs": [
842 | {
843 | "data": {
844 | "text/html": [
845 | "\n",
846 | "\n",
859 | "
\n",
860 | " \n",
861 | " \n",
862 | " | \n",
863 | " total_bill | \n",
864 | " tip | \n",
865 | " sex | \n",
866 | " smoker | \n",
867 | " day | \n",
868 | " time | \n",
869 | " size | \n",
870 | "
\n",
871 | " \n",
872 | " \n",
873 | " \n",
874 | " 0 | \n",
875 | " 16.99 | \n",
876 | " 1.01 | \n",
877 | " Female | \n",
878 | " No | \n",
879 | " Sun | \n",
880 | " Dinner | \n",
881 | " 2 | \n",
882 | "
\n",
883 | " \n",
884 | " 1 | \n",
885 | " 10.34 | \n",
886 | " 1.66 | \n",
887 | " Male | \n",
888 | " No | \n",
889 | " Sun | \n",
890 | " Dinner | \n",
891 | " 3 | \n",
892 | "
\n",
893 | " \n",
894 | " 2 | \n",
895 | " 21.01 | \n",
896 | " 3.50 | \n",
897 | " Male | \n",
898 | " No | \n",
899 | " Sun | \n",
900 | " Dinner | \n",
901 | " 3 | \n",
902 | "
\n",
903 | " \n",
904 | " 3 | \n",
905 | " 23.68 | \n",
906 | " 3.31 | \n",
907 | " Male | \n",
908 | " No | \n",
909 | " Sun | \n",
910 | " Dinner | \n",
911 | " 2 | \n",
912 | "
\n",
913 | " \n",
914 | " 4 | \n",
915 | " 24.59 | \n",
916 | " 3.61 | \n",
917 | " Female | \n",
918 | " No | \n",
919 | " Sun | \n",
920 | " Dinner | \n",
921 | " 4 | \n",
922 | "
\n",
923 | " \n",
924 | "
\n",
925 | "
"
926 | ],
927 | "text/plain": [
928 | " total_bill tip sex smoker day time size\n",
929 | "0 16.99 1.01 Female No Sun Dinner 2\n",
930 | "1 10.34 1.66 Male No Sun Dinner 3\n",
931 | "2 21.01 3.50 Male No Sun Dinner 3\n",
932 | "3 23.68 3.31 Male No Sun Dinner 2\n",
933 | "4 24.59 3.61 Female No Sun Dinner 4"
934 | ]
935 | },
936 | "execution_count": 19,
937 | "metadata": {},
938 | "output_type": "execute_result"
939 | }
940 | ],
941 | "source": [
942 | "tips = pd.read_csv( 'tips.csv')\n",
943 | "tips.head()"
944 | ]
945 | },
946 | {
947 | "cell_type": "code",
948 | "execution_count": 20,
949 | "metadata": {
950 | "ExecuteTime": {
951 | "end_time": "2017-09-25T14:16:08.291633Z",
952 | "start_time": "2017-09-25T14:16:08.151413Z"
953 | }
954 | },
955 | "outputs": [
956 | {
957 | "data": {
958 | "text/plain": [
959 | "RangeIndex(start=0, stop=244, step=1)"
960 | ]
961 | },
962 | "execution_count": 20,
963 | "metadata": {},
964 | "output_type": "execute_result"
965 | }
966 | ],
967 | "source": [
968 | "tips.index"
969 | ]
970 | },
971 | {
972 | "cell_type": "code",
973 | "execution_count": 21,
974 | "metadata": {
975 | "ExecuteTime": {
976 | "end_time": "2017-09-25T14:16:08.517552Z",
977 | "start_time": "2017-09-25T14:16:08.302620Z"
978 | }
979 | },
980 | "outputs": [
981 | {
982 | "data": {
983 | "text/plain": [
984 | "Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')"
985 | ]
986 | },
987 | "execution_count": 21,
988 | "metadata": {},
989 | "output_type": "execute_result"
990 | }
991 | ],
992 | "source": [
993 | "tips.columns"
994 | ]
995 | },
996 | {
997 | "cell_type": "code",
998 | "execution_count": 22,
999 | "metadata": {
1000 | "ExecuteTime": {
1001 | "end_time": "2017-09-25T14:16:08.899490Z",
1002 | "start_time": "2017-09-25T14:16:08.520554Z"
1003 | }
1004 | },
1005 | "outputs": [
1006 | {
1007 | "data": {
1008 | "text/plain": [
1009 | "array([[16.99, 1.01, 'Female', ..., 'Sun', 'Dinner', 2],\n",
1010 | " [10.34, 1.66, 'Male', ..., 'Sun', 'Dinner', 3],\n",
1011 | " [21.01, 3.5, 'Male', ..., 'Sun', 'Dinner', 3],\n",
1012 | " ...,\n",
1013 | " [22.67, 2.0, 'Male', ..., 'Sat', 'Dinner', 2],\n",
1014 | " [17.82, 1.75, 'Male', ..., 'Sat', 'Dinner', 2],\n",
1015 | " [18.78, 3.0, 'Female', ..., 'Thur', 'Dinner', 2]], dtype=object)"
1016 | ]
1017 | },
1018 | "execution_count": 22,
1019 | "metadata": {},
1020 | "output_type": "execute_result"
1021 | }
1022 | ],
1023 | "source": [
1024 | "tips.values"
1025 | ]
1026 | },
1027 | {
1028 | "cell_type": "markdown",
1029 | "metadata": {
1030 | "collapsed": true
1031 | },
1032 | "source": [
1033 | "## 3.2 由.excel文件创建"
1034 | ]
1035 | },
1036 | {
1037 | "cell_type": "markdown",
1038 | "metadata": {},
1039 | "source": [
1040 | "##### `pd.read_excel(io, sheetname=0, header=0, index_col=None, names=None) `\n",
1041 | "read_excel的参数很多,但这几个参数就够我们使用了:\n",
1042 | "- header:如果有列名,那么这一项不用改;\n",
1043 | "- names:如果没有列名,那么必须设置header = None, names为列名的列表,不设置默认生成数值索引;\n",
1044 | "- index_col:同上。"
1045 | ]
1046 | },
1047 | {
1048 | "cell_type": "code",
1049 | "execution_count": null,
1050 | "metadata": {},
1051 | "outputs": [],
1052 | "source": []
1053 | }
1054 | ],
1055 | "metadata": {
1056 | "kernelspec": {
1057 | "display_name": "Python 3",
1058 | "language": "python",
1059 | "name": "python3"
1060 | },
1061 | "language_info": {
1062 | "codemirror_mode": {
1063 | "name": "ipython",
1064 | "version": 3
1065 | },
1066 | "file_extension": ".py",
1067 | "mimetype": "text/x-python",
1068 | "name": "python",
1069 | "nbconvert_exporter": "python",
1070 | "pygments_lexer": "ipython3",
1071 | "version": "3.7.0"
1072 | },
1073 | "toc": {
1074 | "base_numbering": 1,
1075 | "nav_menu": {
1076 | "height": "282px",
1077 | "width": "252px"
1078 | },
1079 | "number_sections": false,
1080 | "sideBar": true,
1081 | "skip_h1_title": false,
1082 | "title_cell": "Table of Contents",
1083 | "title_sidebar": "Contents",
1084 | "toc_cell": false,
1085 | "toc_position": {
1086 | "height": "485px",
1087 | "left": "0px",
1088 | "right": "1068px",
1089 | "top": "66px",
1090 | "width": "298px"
1091 | },
1092 | "toc_section_display": "block",
1093 | "toc_window_display": true
1094 | }
1095 | },
1096 | "nbformat": 4,
1097 | "nbformat_minor": 2
1098 | }
1099 |
--------------------------------------------------------------------------------
/10. 数值统计运算.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 统计运算\n",
8 | "这一章包含数据分析用得最多的函数操作。"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "__auther__ = 'zhenhang.sun@gmail.com'"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {},
24 | "outputs": [
25 | {
26 | "data": {
27 | "text/plain": [
28 | "'D:\\\\github\\\\pandas-tutorial'"
29 | ]
30 | },
31 | "execution_count": 2,
32 | "metadata": {},
33 | "output_type": "execute_result"
34 | }
35 | ],
36 | "source": [
37 | "pwd"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "import numpy as np\n",
47 | "import pandas as pd"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "# 1. 数值型统计运算\n",
55 | "这些统计操作只对元素类型为数值型的列有效,返回以列索引或行索引为索引的Series。"
56 | ]
57 | },
58 | {
59 | "cell_type": "markdown",
60 | "metadata": {},
61 | "source": [
62 | "## 1.1 一元统计\n",
63 | "顾名思义,这些统计只是自身分布情况的反映。"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "### 1.1.1 sum()\n",
71 | "和"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "##### `DataFrame.sum(axis='index')`\n",
79 | "- axis:'index' 竖着加,'columns' 横着加"
80 | ]
81 | },
82 | {
83 | "cell_type": "code",
84 | "execution_count": 4,
85 | "metadata": {},
86 | "outputs": [
87 | {
88 | "data": {
89 | "text/html": [
90 | "\n",
91 | "\n",
104 | "
\n",
105 | " \n",
106 | " \n",
107 | " | \n",
108 | " A | \n",
109 | " B | \n",
110 | "
\n",
111 | " \n",
112 | " \n",
113 | " \n",
114 | " a | \n",
115 | " 1 | \n",
116 | " 2 | \n",
117 | "
\n",
118 | " \n",
119 | " b | \n",
120 | " 3 | \n",
121 | " 5 | \n",
122 | "
\n",
123 | " \n",
124 | "
\n",
125 | "
"
126 | ],
127 | "text/plain": [
128 | " A B\n",
129 | "a 1 2\n",
130 | "b 3 5"
131 | ]
132 | },
133 | "execution_count": 4,
134 | "metadata": {},
135 | "output_type": "execute_result"
136 | }
137 | ],
138 | "source": [
139 | "df = pd.DataFrame([[1,2],[3,5]], index=['a','b'],columns=['A','B'])\n",
140 | "df"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": 5,
146 | "metadata": {},
147 | "outputs": [
148 | {
149 | "data": {
150 | "text/plain": [
151 | "A 4\n",
152 | "B 7\n",
153 | "dtype: int64"
154 | ]
155 | },
156 | "execution_count": 5,
157 | "metadata": {},
158 | "output_type": "execute_result"
159 | }
160 | ],
161 | "source": [
162 | "df.sum() # 默认按列加\n",
163 | "# df.sum(axis='index') "
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": 6,
169 | "metadata": {},
170 | "outputs": [
171 | {
172 | "data": {
173 | "text/plain": [
174 | "a 3\n",
175 | "b 8\n",
176 | "dtype: int64"
177 | ]
178 | },
179 | "execution_count": 6,
180 | "metadata": {},
181 | "output_type": "execute_result"
182 | }
183 | ],
184 | "source": [
185 | "df.sum(axis='columns') # 按行加"
186 | ]
187 | },
188 | {
189 | "cell_type": "markdown",
190 | "metadata": {},
191 | "source": [
192 | "### 1.1.2 mean(), std(), var()\n",
193 | "均值、标准差、方差"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": 7,
199 | "metadata": {},
200 | "outputs": [
201 | {
202 | "data": {
203 | "text/plain": [
204 | "A 2.0\n",
205 | "B 3.5\n",
206 | "dtype: float64"
207 | ]
208 | },
209 | "execution_count": 7,
210 | "metadata": {},
211 | "output_type": "execute_result"
212 | }
213 | ],
214 | "source": [
215 | "df.mean()"
216 | ]
217 | },
218 | {
219 | "cell_type": "code",
220 | "execution_count": 8,
221 | "metadata": {},
222 | "outputs": [
223 | {
224 | "data": {
225 | "text/plain": [
226 | "A 1.414214\n",
227 | "B 2.121320\n",
228 | "dtype: float64"
229 | ]
230 | },
231 | "execution_count": 8,
232 | "metadata": {},
233 | "output_type": "execute_result"
234 | }
235 | ],
236 | "source": [
237 | "df.std()"
238 | ]
239 | },
240 | {
241 | "cell_type": "code",
242 | "execution_count": 9,
243 | "metadata": {},
244 | "outputs": [
245 | {
246 | "data": {
247 | "text/plain": [
248 | "A 2.0\n",
249 | "B 4.5\n",
250 | "dtype: float64"
251 | ]
252 | },
253 | "execution_count": 9,
254 | "metadata": {},
255 | "output_type": "execute_result"
256 | }
257 | ],
258 | "source": [
259 | "df.var()"
260 | ]
261 | },
262 | {
263 | "cell_type": "code",
264 | "execution_count": null,
265 | "metadata": {},
266 | "outputs": [],
267 | "source": []
268 | },
269 | {
270 | "cell_type": "markdown",
271 | "metadata": {},
272 | "source": [
273 | "### 1.1.3 max(), min(), median()\n",
274 | "最大、最小、中值"
275 | ]
276 | },
277 | {
278 | "cell_type": "code",
279 | "execution_count": 10,
280 | "metadata": {},
281 | "outputs": [
282 | {
283 | "data": {
284 | "text/plain": [
285 | "A 1\n",
286 | "B 2\n",
287 | "dtype: int64"
288 | ]
289 | },
290 | "execution_count": 10,
291 | "metadata": {},
292 | "output_type": "execute_result"
293 | }
294 | ],
295 | "source": [
296 | "df.min()"
297 | ]
298 | },
299 | {
300 | "cell_type": "code",
301 | "execution_count": 11,
302 | "metadata": {},
303 | "outputs": [
304 | {
305 | "data": {
306 | "text/plain": [
307 | "A 3\n",
308 | "B 5\n",
309 | "dtype: int64"
310 | ]
311 | },
312 | "execution_count": 11,
313 | "metadata": {},
314 | "output_type": "execute_result"
315 | }
316 | ],
317 | "source": [
318 | "df.max()"
319 | ]
320 | },
321 | {
322 | "cell_type": "code",
323 | "execution_count": 12,
324 | "metadata": {},
325 | "outputs": [
326 | {
327 | "data": {
328 | "text/plain": [
329 | "A 2.0\n",
330 | "B 3.5\n",
331 | "dtype: float64"
332 | ]
333 | },
334 | "execution_count": 12,
335 | "metadata": {},
336 | "output_type": "execute_result"
337 | }
338 | ],
339 | "source": [
340 | "df.median()"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {},
347 | "outputs": [],
348 | "source": []
349 | },
350 | {
351 | "cell_type": "markdown",
352 | "metadata": {},
353 | "source": [
354 | "## 1.2 二元统计\n",
355 | "计算任意两列直接的统计量,返回以列索引 同时为新行索引和列索引的DataFrame。"
356 | ]
357 | },
358 | {
359 | "cell_type": "markdown",
360 | "metadata": {},
361 | "source": [
362 | "### 1.2.1 cov() \n",
363 | "协方差"
364 | ]
365 | },
366 | {
367 | "cell_type": "markdown",
368 | "metadata": {},
369 | "source": [
370 | "##### `DataFrame.cov(min_periods=None)`\n",
371 | "- min_periods:每一列去除NaN后,要求能够参与运算的最少元素个数。"
372 | ]
373 | },
374 | {
375 | "cell_type": "code",
376 | "execution_count": 13,
377 | "metadata": {},
378 | "outputs": [
379 | {
380 | "data": {
381 | "text/html": [
382 | "\n",
383 | "\n",
396 | "
\n",
397 | " \n",
398 | " \n",
399 | " | \n",
400 | " B | \n",
401 | " C | \n",
402 | "
\n",
403 | " \n",
404 | " \n",
405 | " \n",
406 | " 0 | \n",
407 | " 1 | \n",
408 | " 2.0 | \n",
409 | "
\n",
410 | " \n",
411 | " 1 | \n",
412 | " 2 | \n",
413 | " NaN | \n",
414 | "
\n",
415 | " \n",
416 | "
\n",
417 | "
"
418 | ],
419 | "text/plain": [
420 | " B C\n",
421 | "0 1 2.0\n",
422 | "1 2 NaN"
423 | ]
424 | },
425 | "execution_count": 13,
426 | "metadata": {},
427 | "output_type": "execute_result"
428 | }
429 | ],
430 | "source": [
431 | "df1 = pd.DataFrame([[1,2],[2,np.nan]],columns=['B','C'])\n",
432 | "df1"
433 | ]
434 | },
435 | {
436 | "cell_type": "code",
437 | "execution_count": 14,
438 | "metadata": {},
439 | "outputs": [
440 | {
441 | "data": {
442 | "text/html": [
443 | "\n",
444 | "\n",
457 | "
\n",
458 | " \n",
459 | " \n",
460 | " | \n",
461 | " B | \n",
462 | " C | \n",
463 | "
\n",
464 | " \n",
465 | " \n",
466 | " \n",
467 | " B | \n",
468 | " 0.5 | \n",
469 | " NaN | \n",
470 | "
\n",
471 | " \n",
472 | " C | \n",
473 | " NaN | \n",
474 | " NaN | \n",
475 | "
\n",
476 | " \n",
477 | "
\n",
478 | "
"
479 | ],
480 | "text/plain": [
481 | " B C\n",
482 | "B 0.5 NaN\n",
483 | "C NaN NaN"
484 | ]
485 | },
486 | "execution_count": 14,
487 | "metadata": {},
488 | "output_type": "execute_result"
489 | }
490 | ],
491 | "source": [
492 | "df1.cov()"
493 | ]
494 | },
495 | {
496 | "cell_type": "markdown",
497 | "metadata": {},
498 | "source": [
499 | "### 1.2.2 corr() \n",
500 | "相关系数\n",
501 | "\n",
502 | "##### `DataFrame.corr(method='pearson', min_periods=1)`\n",
503 | "- method:用的方法\n",
504 | "- min_periods:每一列去除NaN后,要求能够参与运算的最少元素个数。"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": 15,
510 | "metadata": {},
511 | "outputs": [
512 | {
513 | "data": {
514 | "text/html": [
515 | "\n",
516 | "\n",
529 | "
\n",
530 | " \n",
531 | " \n",
532 | " | \n",
533 | " B | \n",
534 | " C | \n",
535 | "
\n",
536 | " \n",
537 | " \n",
538 | " \n",
539 | " B | \n",
540 | " 1.0 | \n",
541 | " NaN | \n",
542 | "
\n",
543 | " \n",
544 | " C | \n",
545 | " NaN | \n",
546 | " NaN | \n",
547 | "
\n",
548 | " \n",
549 | "
\n",
550 | "
"
551 | ],
552 | "text/plain": [
553 | " B C\n",
554 | "B 1.0 NaN\n",
555 | "C NaN NaN"
556 | ]
557 | },
558 | "execution_count": 15,
559 | "metadata": {},
560 | "output_type": "execute_result"
561 | }
562 | ],
563 | "source": [
564 | "df1.corr()"
565 | ]
566 | },
567 | {
568 | "cell_type": "markdown",
569 | "metadata": {},
570 | "source": [
571 | "### 1.2.3 corrwith()\n",
572 | "corr是自身列之间的关系,而这个函数可以对不同的DataFrame进行运算,计算之前会把两个DataFrame完全对齐(对齐方式为取交集)。\n",
573 | "##### `DataFrame.corrwith(other, axis=0, drop=False, method='pearson')`\n",
574 | "- other:另一个DataFrame或Series\n",
575 | "- axis:'index or 0'或'columns or 1',index,计算列相关性。columns,计算行相关性。默认为 index。\n",
576 | "- drop:是否丢掉axis对应的索引取交时被过滤的了那些索引\n",
577 | "- method:计算相关性的方法"
578 | ]
579 | },
580 | {
581 | "cell_type": "code",
582 | "execution_count": 16,
583 | "metadata": {},
584 | "outputs": [
585 | {
586 | "data": {
587 | "text/html": [
588 | "\n",
589 | "\n",
602 | "
\n",
603 | " \n",
604 | " \n",
605 | " | \n",
606 | " B | \n",
607 | " C | \n",
608 | "
\n",
609 | " \n",
610 | " \n",
611 | " \n",
612 | " a | \n",
613 | " 2 | \n",
614 | " 1 | \n",
615 | "
\n",
616 | " \n",
617 | " b | \n",
618 | " 5 | \n",
619 | " 0 | \n",
620 | "
\n",
621 | " \n",
622 | " 2 | \n",
623 | " 2 | \n",
624 | " 3 | \n",
625 | "
\n",
626 | " \n",
627 | "
\n",
628 | "
"
629 | ],
630 | "text/plain": [
631 | " B C\n",
632 | "a 2 1\n",
633 | "b 5 0\n",
634 | "2 2 3"
635 | ]
636 | },
637 | "execution_count": 16,
638 | "metadata": {},
639 | "output_type": "execute_result"
640 | }
641 | ],
642 | "source": [
643 | "df1 = pd.DataFrame([[2,1],[5,0],[2,3]],index=['a','b',2],columns=['B','C'])\n",
644 | "df1"
645 | ]
646 | },
647 | {
648 | "cell_type": "code",
649 | "execution_count": 17,
650 | "metadata": {
651 | "scrolled": true
652 | },
653 | "outputs": [
654 | {
655 | "data": {
656 | "text/html": [
657 | "\n",
658 | "\n",
671 | "
\n",
672 | " \n",
673 | " \n",
674 | " | \n",
675 | " A | \n",
676 | " B | \n",
677 | "
\n",
678 | " \n",
679 | " \n",
680 | " \n",
681 | " a | \n",
682 | " 1 | \n",
683 | " 2 | \n",
684 | "
\n",
685 | " \n",
686 | " b | \n",
687 | " 3 | \n",
688 | " 5 | \n",
689 | "
\n",
690 | " \n",
691 | "
\n",
692 | "
"
693 | ],
694 | "text/plain": [
695 | " A B\n",
696 | "a 1 2\n",
697 | "b 3 5"
698 | ]
699 | },
700 | "execution_count": 17,
701 | "metadata": {},
702 | "output_type": "execute_result"
703 | }
704 | ],
705 | "source": [
706 | "df"
707 | ]
708 | },
709 | {
710 | "cell_type": "code",
711 | "execution_count": 18,
712 | "metadata": {},
713 | "outputs": [
714 | {
715 | "data": {
716 | "text/plain": [
717 | "B 1.0\n",
718 | "dtype: float64"
719 | ]
720 | },
721 | "execution_count": 18,
722 | "metadata": {},
723 | "output_type": "execute_result"
724 | }
725 | ],
726 | "source": [
727 | "df.corrwith(df1, axis=0, drop=True) #df和df1取交对齐, 取交后只剩下B列,a,b行"
728 | ]
729 | },
730 | {
731 | "cell_type": "code",
732 | "execution_count": 19,
733 | "metadata": {},
734 | "outputs": [
735 | {
736 | "data": {
737 | "text/plain": [
738 | "A NaN\n",
739 | "B 1.0\n",
740 | "C NaN\n",
741 | "dtype: float64"
742 | ]
743 | },
744 | "execution_count": 19,
745 | "metadata": {},
746 | "output_type": "execute_result"
747 | }
748 | ],
749 | "source": [
750 | "df.corrwith(df1, axis=0, drop=False) #补充取交时被过滤的A和C列"
751 | ]
752 | },
753 | {
754 | "cell_type": "code",
755 | "execution_count": null,
756 | "metadata": {},
757 | "outputs": [],
758 | "source": []
759 | },
760 | {
761 | "cell_type": "code",
762 | "execution_count": 20,
763 | "metadata": {},
764 | "outputs": [
765 | {
766 | "data": {
767 | "text/plain": [
768 | "a 3\n",
769 | "b 1\n",
770 | "c 1\n",
771 | "Name: B, dtype: int64"
772 | ]
773 | },
774 | "execution_count": 20,
775 | "metadata": {},
776 | "output_type": "execute_result"
777 | }
778 | ],
779 | "source": [
780 | "#other为series时,一般都是df的每一列分别与series进行相关性计算。\n",
781 | "s = pd.Series([3,1,1], index=['a','b','c'], name='B') \n",
782 | "s"
783 | ]
784 | },
785 | {
786 | "cell_type": "code",
787 | "execution_count": 21,
788 | "metadata": {
789 | "scrolled": false
790 | },
791 | "outputs": [
792 | {
793 | "data": {
794 | "text/html": [
795 | "\n",
796 | "\n",
809 | "
\n",
810 | " \n",
811 | " \n",
812 | " | \n",
813 | " A | \n",
814 | " B | \n",
815 | "
\n",
816 | " \n",
817 | " \n",
818 | " \n",
819 | " a | \n",
820 | " 1 | \n",
821 | " 2 | \n",
822 | "
\n",
823 | " \n",
824 | " b | \n",
825 | " 3 | \n",
826 | " 5 | \n",
827 | "
\n",
828 | " \n",
829 | "
\n",
830 | "
"
831 | ],
832 | "text/plain": [
833 | " A B\n",
834 | "a 1 2\n",
835 | "b 3 5"
836 | ]
837 | },
838 | "execution_count": 21,
839 | "metadata": {},
840 | "output_type": "execute_result"
841 | }
842 | ],
843 | "source": [
844 | "df"
845 | ]
846 | },
847 | {
848 | "cell_type": "code",
849 | "execution_count": 22,
850 | "metadata": {},
851 | "outputs": [
852 | {
853 | "data": {
854 | "text/plain": [
855 | "A -1.0\n",
856 | "B -1.0\n",
857 | "dtype: float64"
858 | ]
859 | },
860 | "execution_count": 22,
861 | "metadata": {},
862 | "output_type": "execute_result"
863 | }
864 | ],
865 | "source": [
866 | "df.corrwith(s, axis=0)"
867 | ]
868 | },
869 | {
870 | "cell_type": "code",
871 | "execution_count": null,
872 | "metadata": {},
873 | "outputs": [],
874 | "source": []
875 | },
876 | {
877 | "cell_type": "markdown",
878 | "metadata": {},
879 | "source": [
880 | "# 2. 类别型统计运算"
881 | ]
882 | },
883 | {
884 | "cell_type": "markdown",
885 | "metadata": {},
886 | "source": [
887 | "## 2.1 value_counts()\n",
888 | "不适合DataFrame。\n",
889 | "##### `Series/Index.value_counts(normalize=False, ascending=False, bins=None)`\n",
890 | "- normalize:True or False,计算频次或者频率比;\n",
891 | "- ascending:True or False,排序方式,默认降序;\n",
892 | "- bins:int,pd.cut的一种快捷操作,对连续数值型效果好;"
893 | ]
894 | },
895 | {
896 | "cell_type": "code",
897 | "execution_count": 23,
898 | "metadata": {
899 | "scrolled": true
900 | },
901 | "outputs": [
902 | {
903 | "data": {
904 | "text/plain": [
905 | "0 1\n",
906 | "1 2\n",
907 | "2 1\n",
908 | "3 2\n",
909 | "4 1\n",
910 | "5 3\n",
911 | "dtype: int64"
912 | ]
913 | },
914 | "execution_count": 23,
915 | "metadata": {},
916 | "output_type": "execute_result"
917 | }
918 | ],
919 | "source": [
920 | "s = pd.Series([1,2,1,2,1,3])\n",
921 | "s"
922 | ]
923 | },
924 | {
925 | "cell_type": "code",
926 | "execution_count": 24,
927 | "metadata": {},
928 | "outputs": [
929 | {
930 | "data": {
931 | "text/plain": [
932 | "1 3\n",
933 | "2 2\n",
934 | "3 1\n",
935 | "dtype: int64"
936 | ]
937 | },
938 | "execution_count": 24,
939 | "metadata": {},
940 | "output_type": "execute_result"
941 | }
942 | ],
943 | "source": [
944 | "s.value_counts()"
945 | ]
946 | },
947 | {
948 | "cell_type": "code",
949 | "execution_count": 25,
950 | "metadata": {},
951 | "outputs": [
952 | {
953 | "data": {
954 | "text/plain": [
955 | "3 1\n",
956 | "2 2\n",
957 | "1 3\n",
958 | "dtype: int64"
959 | ]
960 | },
961 | "execution_count": 25,
962 | "metadata": {},
963 | "output_type": "execute_result"
964 | }
965 | ],
966 | "source": [
967 | "s.value_counts(ascending=True)"
968 | ]
969 | },
970 | {
971 | "cell_type": "code",
972 | "execution_count": 26,
973 | "metadata": {},
974 | "outputs": [
975 | {
976 | "data": {
977 | "text/plain": [
978 | "(0.997, 2.0] 5\n",
979 | "(2.0, 3.0] 1\n",
980 | "dtype: int64"
981 | ]
982 | },
983 | "execution_count": 26,
984 | "metadata": {},
985 | "output_type": "execute_result"
986 | }
987 | ],
988 | "source": [
989 | "s.value_counts(bins=2) # bins按照int平均分割,左开右闭,左侧外延1%以包含最左值"
990 | ]
991 | },
992 | {
993 | "cell_type": "markdown",
994 | "metadata": {},
995 | "source": [
996 | "## 2.2 count()\n",
997 | "计算统计每一类非NaN元素个数,这个函数可以快速了解哪些特征或哪些样本缺失比较严重。\n",
998 | "##### `DataFrame.count(axis=0)`\n",
999 | "- axis: 0-查看列,1-查看行;"
1000 | ]
1001 | },
1002 | {
1003 | "cell_type": "code",
1004 | "execution_count": 27,
1005 | "metadata": {},
1006 | "outputs": [
1007 | {
1008 | "data": {
1009 | "text/html": [
1010 | "\n",
1011 | "\n",
1024 | "
\n",
1025 | " \n",
1026 | " \n",
1027 | " | \n",
1028 | " A | \n",
1029 | " B | \n",
1030 | "
\n",
1031 | " \n",
1032 | " \n",
1033 | " \n",
1034 | " a | \n",
1035 | " 1 | \n",
1036 | " 2 | \n",
1037 | "
\n",
1038 | " \n",
1039 | " b | \n",
1040 | " 3 | \n",
1041 | " 5 | \n",
1042 | "
\n",
1043 | " \n",
1044 | "
\n",
1045 | "
"
1046 | ],
1047 | "text/plain": [
1048 | " A B\n",
1049 | "a 1 2\n",
1050 | "b 3 5"
1051 | ]
1052 | },
1053 | "execution_count": 27,
1054 | "metadata": {},
1055 | "output_type": "execute_result"
1056 | }
1057 | ],
1058 | "source": [
1059 | "df"
1060 | ]
1061 | },
1062 | {
1063 | "cell_type": "code",
1064 | "execution_count": 28,
1065 | "metadata": {
1066 | "scrolled": false
1067 | },
1068 | "outputs": [
1069 | {
1070 | "data": {
1071 | "text/plain": [
1072 | "A 2\n",
1073 | "B 2\n",
1074 | "dtype: int64"
1075 | ]
1076 | },
1077 | "execution_count": 28,
1078 | "metadata": {},
1079 | "output_type": "execute_result"
1080 | }
1081 | ],
1082 | "source": [
1083 | "df.count(axis=0)"
1084 | ]
1085 | },
1086 | {
1087 | "cell_type": "code",
1088 | "execution_count": 29,
1089 | "metadata": {},
1090 | "outputs": [
1091 | {
1092 | "data": {
1093 | "text/plain": [
1094 | "a 2\n",
1095 | "b 2\n",
1096 | "dtype: int64"
1097 | ]
1098 | },
1099 | "execution_count": 29,
1100 | "metadata": {},
1101 | "output_type": "execute_result"
1102 | }
1103 | ],
1104 | "source": [
1105 | "df.count(axis=1)"
1106 | ]
1107 | },
1108 | {
1109 | "cell_type": "code",
1110 | "execution_count": null,
1111 | "metadata": {},
1112 | "outputs": [],
1113 | "source": []
1114 | }
1115 | ],
1116 | "metadata": {
1117 | "kernelspec": {
1118 | "display_name": "Python 3",
1119 | "language": "python",
1120 | "name": "python3"
1121 | },
1122 | "language_info": {
1123 | "codemirror_mode": {
1124 | "name": "ipython",
1125 | "version": 3
1126 | },
1127 | "file_extension": ".py",
1128 | "mimetype": "text/x-python",
1129 | "name": "python",
1130 | "nbconvert_exporter": "python",
1131 | "pygments_lexer": "ipython3",
1132 | "version": "3.7.0"
1133 | },
1134 | "toc": {
1135 | "base_numbering": 1,
1136 | "nav_menu": {
1137 | "height": "67px",
1138 | "width": "253px"
1139 | },
1140 | "number_sections": false,
1141 | "sideBar": true,
1142 | "skip_h1_title": false,
1143 | "title_cell": "Table of Contents",
1144 | "title_sidebar": "Contents",
1145 | "toc_cell": false,
1146 | "toc_position": {
1147 | "height": "600px",
1148 | "left": "0px",
1149 | "right": "1190.23px",
1150 | "top": "67px",
1151 | "width": "232px"
1152 | },
1153 | "toc_section_display": "block",
1154 | "toc_window_display": true
1155 | }
1156 | },
1157 | "nbformat": 4,
1158 | "nbformat_minor": 2
1159 | }
1160 |
--------------------------------------------------------------------------------
/12. Category型与离散化.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# Category型与离散化\n",
8 | "类别类型可谓是非常常用的一种类型,其具有如下特征:\n",
9 | "1. 取固定几种值;\n",
10 | "2. 可以定义序,序的形式与实数序或字典序可以都不同;\n",
11 | "3. 即使是数值表示,数值运算可能也无意义,与离散数值型不一定相同。"
12 | ]
13 | },
14 | {
15 | "cell_type": "code",
16 | "execution_count": 1,
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "__auther__ = 'zhenhang.sun@gmail.com'"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "metadata": {},
27 | "outputs": [
28 | {
29 | "data": {
30 | "text/plain": [
31 | "'D:\\\\github\\\\pandas-tutorial'"
32 | ]
33 | },
34 | "execution_count": 2,
35 | "metadata": {},
36 | "output_type": "execute_result"
37 | }
38 | ],
39 | "source": [
40 | "pwd"
41 | ]
42 | },
43 | {
44 | "cell_type": "code",
45 | "execution_count": 3,
46 | "metadata": {},
47 | "outputs": [],
48 | "source": [
49 | "import numpy as np\n",
50 | "import pandas as pd"
51 | ]
52 | },
53 | {
54 | "cell_type": "code",
55 | "execution_count": null,
56 | "metadata": {},
57 | "outputs": [],
58 | "source": []
59 | },
60 | {
61 | "cell_type": "markdown",
62 | "metadata": {},
63 | "source": [
64 | "# 1. 创建"
65 | ]
66 | },
67 | {
68 | "cell_type": "markdown",
69 | "metadata": {},
70 | "source": [
71 | "## 1.1 创建Category的类"
72 | ]
73 | },
74 | {
75 | "cell_type": "markdown",
76 | "metadata": {},
77 | "source": [
78 | "##### `pd.Categorical(values, categories=None, ordered=False)`\n",
79 | "- values: 类别序列;\n",
80 | "- categories:自定义的类别序列;\n",
81 | "- ordered:类别是否定义顺序,默认增序。"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": 4,
87 | "metadata": {},
88 | "outputs": [
89 | {
90 | "data": {
91 | "text/plain": [
92 | "[2, 1, 1, 3]\n",
93 | "Categories (3, int64): [1 < 2 < 3]"
94 | ]
95 | },
96 | "execution_count": 4,
97 | "metadata": {},
98 | "output_type": "execute_result"
99 | }
100 | ],
101 | "source": [
102 | "c = pd.Categorical([2,1,1,3], ordered=True ) \n",
103 | "c\n",
104 | "# 不提供categories,则用values去重后的值作为类别\n",
105 | "# 若ordered =True,顺序则按照字典序升序给定"
106 | ]
107 | },
108 | {
109 | "cell_type": "code",
110 | "execution_count": 5,
111 | "metadata": {},
112 | "outputs": [
113 | {
114 | "data": {
115 | "text/plain": [
116 | "[NaN, 2.0, 3.0]\n",
117 | "Categories (2, int64): [3 < 2]"
118 | ]
119 | },
120 | "execution_count": 5,
121 | "metadata": {},
122 | "output_type": "execute_result"
123 | }
124 | ],
125 | "source": [
126 | "c = pd.Categorical([1,2,3], categories=[3,2], ordered=True )\n",
127 | "c\n",
128 | "# 提供categories(类别不能有重复,否则报错),若values的值不在categories中,则用NaN替换\n",
129 | "# 若ordered =True,顺序则按照类别顺序升序给定"
130 | ]
131 | },
132 | {
133 | "cell_type": "markdown",
134 | "metadata": {},
135 | "source": [
136 | "**类别的两个重要属性**"
137 | ]
138 | },
139 | {
140 | "cell_type": "code",
141 | "execution_count": 6,
142 | "metadata": {},
143 | "outputs": [
144 | {
145 | "data": {
146 | "text/plain": [
147 | "Int64Index([3, 2], dtype='int64')"
148 | ]
149 | },
150 | "execution_count": 6,
151 | "metadata": {},
152 | "output_type": "execute_result"
153 | }
154 | ],
155 | "source": [
156 | "c.categories # 类别"
157 | ]
158 | },
159 | {
160 | "cell_type": "code",
161 | "execution_count": 7,
162 | "metadata": {
163 | "scrolled": true
164 | },
165 | "outputs": [
166 | {
167 | "data": {
168 | "text/plain": [
169 | "True"
170 | ]
171 | },
172 | "execution_count": 7,
173 | "metadata": {},
174 | "output_type": "execute_result"
175 | }
176 | ],
177 | "source": [
178 | "c.ordered # 是否有序"
179 | ]
180 | },
181 | {
182 | "cell_type": "markdown",
183 | "metadata": {},
184 | "source": [
185 | "## 1.2 转换为类别类型"
186 | ]
187 | },
188 | {
189 | "cell_type": "code",
190 | "execution_count": 8,
191 | "metadata": {},
192 | "outputs": [
193 | {
194 | "data": {
195 | "text/plain": [
196 | "0 2\n",
197 | "1 1\n",
198 | "2 1\n",
199 | "3 3\n",
200 | "dtype: int64"
201 | ]
202 | },
203 | "execution_count": 8,
204 | "metadata": {},
205 | "output_type": "execute_result"
206 | }
207 | ],
208 | "source": [
209 | "s = pd.Series([2,1,1,3])\n",
210 | "s"
211 | ]
212 | },
213 | {
214 | "cell_type": "code",
215 | "execution_count": 9,
216 | "metadata": {},
217 | "outputs": [
218 | {
219 | "data": {
220 | "text/plain": [
221 | "0 2\n",
222 | "1 1\n",
223 | "2 1\n",
224 | "3 3\n",
225 | "dtype: category\n",
226 | "Categories (3, int64): [1, 2, 3]"
227 | ]
228 | },
229 | "execution_count": 9,
230 | "metadata": {},
231 | "output_type": "execute_result"
232 | }
233 | ],
234 | "source": [
235 | "s = s.astype('category') \n",
236 | "s #可以看到dtype已经变成category型 "
237 | ]
238 | },
239 | {
240 | "cell_type": "markdown",
241 | "metadata": {},
242 | "source": [
243 | "**Series查看类型属性需要通过 .cat **"
244 | ]
245 | },
246 | {
247 | "cell_type": "code",
248 | "execution_count": 10,
249 | "metadata": {
250 | "scrolled": true
251 | },
252 | "outputs": [
253 | {
254 | "data": {
255 | "text/plain": [
256 | "Int64Index([1, 2, 3], dtype='int64')"
257 | ]
258 | },
259 | "execution_count": 10,
260 | "metadata": {},
261 | "output_type": "execute_result"
262 | }
263 | ],
264 | "source": [
265 | "s.cat.categories"
266 | ]
267 | },
268 | {
269 | "cell_type": "code",
270 | "execution_count": 11,
271 | "metadata": {},
272 | "outputs": [
273 | {
274 | "data": {
275 | "text/plain": [
276 | "False"
277 | ]
278 | },
279 | "execution_count": 11,
280 | "metadata": {},
281 | "output_type": "execute_result"
282 | }
283 | ],
284 | "source": [
285 | "s.cat.ordered"
286 | ]
287 | },
288 | {
289 | "cell_type": "code",
290 | "execution_count": null,
291 | "metadata": {},
292 | "outputs": [],
293 | "source": []
294 | },
295 | {
296 | "cell_type": "markdown",
297 | "metadata": {},
298 | "source": [
299 | "# 2. 查、改、增、删"
300 | ]
301 | },
302 | {
303 | "cell_type": "markdown",
304 | "metadata": {},
305 | "source": [
306 | "## 2.1 查"
307 | ]
308 | },
309 | {
310 | "cell_type": "markdown",
311 | "metadata": {},
312 | "source": [
313 | "##### [] 四种查看方式\n",
314 | "类别类型是序列形式,可以采用 []来查看,不过.loc[]和.iloc[]都是不支持的。"
315 | ]
316 | },
317 | {
318 | "cell_type": "code",
319 | "execution_count": 12,
320 | "metadata": {},
321 | "outputs": [
322 | {
323 | "data": {
324 | "text/plain": [
325 | "nan"
326 | ]
327 | },
328 | "execution_count": 12,
329 | "metadata": {},
330 | "output_type": "execute_result"
331 | }
332 | ],
333 | "source": [
334 | "c[0]"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": 13,
340 | "metadata": {},
341 | "outputs": [
342 | {
343 | "data": {
344 | "text/plain": [
345 | "[NaN, 2.0]\n",
346 | "Categories (2, int64): [3 < 2]"
347 | ]
348 | },
349 | "execution_count": 13,
350 | "metadata": {},
351 | "output_type": "execute_result"
352 | }
353 | ],
354 | "source": [
355 | "c[0:2]"
356 | ]
357 | },
358 | {
359 | "cell_type": "code",
360 | "execution_count": 14,
361 | "metadata": {},
362 | "outputs": [
363 | {
364 | "data": {
365 | "text/plain": [
366 | "[NaN, 3.0]\n",
367 | "Categories (2, int64): [3 < 2]"
368 | ]
369 | },
370 | "execution_count": 14,
371 | "metadata": {},
372 | "output_type": "execute_result"
373 | }
374 | ],
375 | "source": [
376 | "c[[0,2]]"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 15,
382 | "metadata": {},
383 | "outputs": [
384 | {
385 | "data": {
386 | "text/plain": [
387 | "[NaN, 3.0]\n",
388 | "Categories (2, int64): [3 < 2]"
389 | ]
390 | },
391 | "execution_count": 15,
392 | "metadata": {},
393 | "output_type": "execute_result"
394 | }
395 | ],
396 | "source": [
397 | "mask = [True, False, True]\n",
398 | "c[mask]"
399 | ]
400 | },
401 | {
402 | "cell_type": "code",
403 | "execution_count": null,
404 | "metadata": {},
405 | "outputs": [],
406 | "source": []
407 | },
408 | {
409 | "cell_type": "markdown",
410 | "metadata": {},
411 | "source": [
412 | "## 2.2 改"
413 | ]
414 | },
415 | {
416 | "cell_type": "markdown",
417 | "metadata": {},
418 | "source": [
419 | "### 2.2.1 改类别值\n",
420 | "这个功能用得会比较多,将字符串类别映射为数值类别。"
421 | ]
422 | },
423 | {
424 | "cell_type": "markdown",
425 | "metadata": {},
426 | "source": [
427 | "**直接修改**"
428 | ]
429 | },
430 | {
431 | "cell_type": "code",
432 | "execution_count": 16,
433 | "metadata": {},
434 | "outputs": [],
435 | "source": [
436 | "c1 = c.copy()"
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": 17,
442 | "metadata": {},
443 | "outputs": [
444 | {
445 | "data": {
446 | "text/plain": [
447 | "[NaN, 6, 5]\n",
448 | "Categories (2, object): [5 < 6]"
449 | ]
450 | },
451 | "execution_count": 17,
452 | "metadata": {},
453 | "output_type": "execute_result"
454 | }
455 | ],
456 | "source": [
457 | "c1.categories = ['5','6'] # 这种改法,新的类别序列与旧类别序列长度必须相同,实质为将值和类型依次替换\n",
458 | "c1"
459 | ]
460 | },
461 | {
462 | "cell_type": "code",
463 | "execution_count": 18,
464 | "metadata": {},
465 | "outputs": [
466 | {
467 | "data": {
468 | "text/plain": [
469 | "0 2\n",
470 | "1 1\n",
471 | "2 1\n",
472 | "3 3\n",
473 | "dtype: category\n",
474 | "Categories (3, int64): [1, 2, 3]"
475 | ]
476 | },
477 | "execution_count": 18,
478 | "metadata": {},
479 | "output_type": "execute_result"
480 | }
481 | ],
482 | "source": [
483 | "s1= s.copy()\n",
484 | "s1"
485 | ]
486 | },
487 | {
488 | "cell_type": "code",
489 | "execution_count": 19,
490 | "metadata": {},
491 | "outputs": [
492 | {
493 | "data": {
494 | "text/plain": [
495 | "0 5\n",
496 | "1 6\n",
497 | "2 6\n",
498 | "3 7\n",
499 | "dtype: category\n",
500 | "Categories (3, int64): [6, 5, 7]"
501 | ]
502 | },
503 | "execution_count": 19,
504 | "metadata": {},
505 | "output_type": "execute_result"
506 | }
507 | ],
508 | "source": [
509 | "s1.cat.categories = [6,5,7]\n",
510 | "s1 # 对Series来说,用.cat操作改法是相同的"
511 | ]
512 | },
513 | {
514 | "cell_type": "markdown",
515 | "metadata": {},
516 | "source": [
517 | "##### 函数改\n",
518 | "##### `categories.rename_categories(cat , inplace = False)`\n",
519 | "- cat:新的类别,必须和旧类别长度相同;\n",
520 | "- inplace:True or False,是否原地修改。"
521 | ]
522 | },
523 | {
524 | "cell_type": "code",
525 | "execution_count": 20,
526 | "metadata": {},
527 | "outputs": [
528 | {
529 | "data": {
530 | "text/plain": [
531 | "[NaN, 6, 5]\n",
532 | "Categories (2, object): [5 < 6]"
533 | ]
534 | },
535 | "execution_count": 20,
536 | "metadata": {},
537 | "output_type": "execute_result"
538 | }
539 | ],
540 | "source": [
541 | "c1 = c.copy()\n",
542 | "c1.rename_categories(['5','6'], inplace=True) #和上面完全相同\n",
543 | "c1 "
544 | ]
545 | },
546 | {
547 | "cell_type": "code",
548 | "execution_count": 21,
549 | "metadata": {},
550 | "outputs": [
551 | {
552 | "data": {
553 | "text/plain": [
554 | "0 5\n",
555 | "1 6\n",
556 | "2 6\n",
557 | "3 7\n",
558 | "dtype: category\n",
559 | "Categories (3, object): [6, 5, 7]"
560 | ]
561 | },
562 | "execution_count": 21,
563 | "metadata": {},
564 | "output_type": "execute_result"
565 | }
566 | ],
567 | "source": [
568 | "s1 = s.copy()\n",
569 | "s1.cat.rename_categories(['6','5','7'], inplace=True) # 和上面完全相同\n",
570 | "s1"
571 | ]
572 | },
573 | {
574 | "cell_type": "markdown",
575 | "metadata": {},
576 | "source": [
577 | "### 2.2.2 有序、无序转变"
578 | ]
579 | },
580 | {
581 | "cell_type": "markdown",
582 | "metadata": {},
583 | "source": [
584 | "##### `categories.as_ordered(inplace=False)`\n",
585 | "##### `categories.as_unordered(inplace=False)`\n",
586 | "- inplace:True or False,是否原地修改。"
587 | ]
588 | },
589 | {
590 | "cell_type": "code",
591 | "execution_count": 22,
592 | "metadata": {},
593 | "outputs": [
594 | {
595 | "data": {
596 | "text/plain": [
597 | "[NaN, 2.0, 3.0]\n",
598 | "Categories (2, int64): [3 < 2]"
599 | ]
600 | },
601 | "execution_count": 22,
602 | "metadata": {},
603 | "output_type": "execute_result"
604 | }
605 | ],
606 | "source": [
607 | "c1 = c.copy()\n",
608 | "c1"
609 | ]
610 | },
611 | {
612 | "cell_type": "code",
613 | "execution_count": 23,
614 | "metadata": {},
615 | "outputs": [
616 | {
617 | "data": {
618 | "text/plain": [
619 | "[NaN, 2.0, 3.0]\n",
620 | "Categories (2, int64): [3, 2]"
621 | ]
622 | },
623 | "execution_count": 23,
624 | "metadata": {},
625 | "output_type": "execute_result"
626 | }
627 | ],
628 | "source": [
629 | "c1.as_unordered(inplace = True)\n",
630 | "c1"
631 | ]
632 | },
633 | {
634 | "cell_type": "code",
635 | "execution_count": 24,
636 | "metadata": {},
637 | "outputs": [
638 | {
639 | "data": {
640 | "text/plain": [
641 | "[NaN, 2.0, 3.0]\n",
642 | "Categories (2, int64): [3 < 2]"
643 | ]
644 | },
645 | "execution_count": 24,
646 | "metadata": {},
647 | "output_type": "execute_result"
648 | }
649 | ],
650 | "source": [
651 | "c1.as_ordered()"
652 | ]
653 | },
654 | {
655 | "cell_type": "markdown",
656 | "metadata": {},
657 | "source": [
658 | "### 2.2.3 有序改变顺序"
659 | ]
660 | },
661 | {
662 | "cell_type": "markdown",
663 | "metadata": {},
664 | "source": [
665 | "##### `categories.reorder_categories(cat , ordered=False,inplace=False)`\n",
666 | "- cat:只能是旧类别改变顺序后的序列,不能增减类别;\n",
667 | "- ordered:True or False,类别是否有序\n",
668 | "- inplace:True or False,是否原地修改。"
669 | ]
670 | },
671 | {
672 | "cell_type": "code",
673 | "execution_count": 25,
674 | "metadata": {},
675 | "outputs": [
676 | {
677 | "data": {
678 | "text/plain": [
679 | "[NaN, 2.0, 3.0]\n",
680 | "Categories (2, int64): [3 < 2]"
681 | ]
682 | },
683 | "execution_count": 25,
684 | "metadata": {},
685 | "output_type": "execute_result"
686 | }
687 | ],
688 | "source": [
689 | "c1 = c.copy()\n",
690 | "c1"
691 | ]
692 | },
693 | {
694 | "cell_type": "code",
695 | "execution_count": 26,
696 | "metadata": {},
697 | "outputs": [
698 | {
699 | "data": {
700 | "text/plain": [
701 | "[NaN, 2.0, 3.0]\n",
702 | "Categories (2, int64): [2 < 3]"
703 | ]
704 | },
705 | "execution_count": 26,
706 | "metadata": {},
707 | "output_type": "execute_result"
708 | }
709 | ],
710 | "source": [
711 | "c1.reorder_categories([2,3],ordered=True,inplace=True)\n",
712 | "c1"
713 | ]
714 | },
715 | {
716 | "cell_type": "markdown",
717 | "metadata": {},
718 | "source": [
719 | "## 2.3 增"
720 | ]
721 | },
722 | {
723 | "cell_type": "markdown",
724 | "metadata": {},
725 | "source": [
726 | "##### `categories.add_categories(cat,inplace=False)`\n",
727 | "- cat:想要新增的类别,必须不在旧类别中;\n",
728 | "- inplace:True or False,是否原地修改。"
729 | ]
730 | },
731 | {
732 | "cell_type": "code",
733 | "execution_count": 27,
734 | "metadata": {},
735 | "outputs": [
736 | {
737 | "data": {
738 | "text/plain": [
739 | "[NaN, 2.0, 3.0]\n",
740 | "Categories (2, int64): [3 < 2]"
741 | ]
742 | },
743 | "execution_count": 27,
744 | "metadata": {},
745 | "output_type": "execute_result"
746 | }
747 | ],
748 | "source": [
749 | "c1 = c.copy()\n",
750 | "c1"
751 | ]
752 | },
753 | {
754 | "cell_type": "code",
755 | "execution_count": 28,
756 | "metadata": {},
757 | "outputs": [
758 | {
759 | "data": {
760 | "text/plain": [
761 | "[NaN, 2.0, 3.0]\n",
762 | "Categories (4, int64): [3 < 2 < 4 < 5]"
763 | ]
764 | },
765 | "execution_count": 28,
766 | "metadata": {},
767 | "output_type": "execute_result"
768 | }
769 | ],
770 | "source": [
771 | "c1.add_categories([4,5], inplace = True)\n",
772 | "c1"
773 | ]
774 | },
775 | {
776 | "cell_type": "markdown",
777 | "metadata": {},
778 | "source": [
779 | "## 2.4 删"
780 | ]
781 | },
782 | {
783 | "cell_type": "markdown",
784 | "metadata": {},
785 | "source": [
786 | "### 2.4.1 删除任意不需要的类别"
787 | ]
788 | },
789 | {
790 | "cell_type": "markdown",
791 | "metadata": {},
792 | "source": [
793 | "##### `categories.remove_categories(cat,inplace=False)`\n",
794 | "- cat:想要删除的类别,必须在旧类别中;\n",
795 | "- inplace:True or False,是否原地修改。"
796 | ]
797 | },
798 | {
799 | "cell_type": "code",
800 | "execution_count": 29,
801 | "metadata": {},
802 | "outputs": [
803 | {
804 | "data": {
805 | "text/plain": [
806 | "[NaN, 2.0, 3.0]\n",
807 | "Categories (3, int64): [3 < 2 < 5]"
808 | ]
809 | },
810 | "execution_count": 29,
811 | "metadata": {},
812 | "output_type": "execute_result"
813 | }
814 | ],
815 | "source": [
816 | "c1.remove_categories([4],inplace=True)\n",
817 | "c1"
818 | ]
819 | },
820 | {
821 | "cell_type": "markdown",
822 | "metadata": {},
823 | "source": [
824 | "### 2.4.2 去除没有使用的类别"
825 | ]
826 | },
827 | {
828 | "cell_type": "markdown",
829 | "metadata": {},
830 | "source": [
831 | "##### `categories.remove_unused_categories(inplace=False)`\n",
832 | "- inplace:True or False,是否原地修改。"
833 | ]
834 | },
835 | {
836 | "cell_type": "code",
837 | "execution_count": 30,
838 | "metadata": {},
839 | "outputs": [
840 | {
841 | "data": {
842 | "text/plain": [
843 | "[NaN, 2.0, 3.0]\n",
844 | "Categories (2, int64): [3 < 2]"
845 | ]
846 | },
847 | "execution_count": 30,
848 | "metadata": {},
849 | "output_type": "execute_result"
850 | }
851 | ],
852 | "source": [
853 | "c1.remove_unused_categories(inplace = True) # 类别 5 被去除\n",
854 | "c1"
855 | ]
856 | },
857 | {
858 | "cell_type": "markdown",
859 | "metadata": {},
860 | "source": [
861 | "## 2.5 改增删 三合一"
862 | ]
863 | },
864 | {
865 | "cell_type": "markdown",
866 | "metadata": {},
867 | "source": [
868 | "##### `categories.set_categories(cat , ordered = False,rename=False, inplace=False)`\n",
869 | "- cat:只能是旧类别改变顺序后的序列,不能增减类别;\n",
870 | "- ordered:True or False,改序,如果提供这一项,保持原来属性,最好明确给出;\n",
871 | "- rename:True or False,改名,这个参数我发现没啥用(?);\n",
872 | "- inplace:True or False,是否原地修改。"
873 | ]
874 | },
875 | {
876 | "cell_type": "code",
877 | "execution_count": 31,
878 | "metadata": {},
879 | "outputs": [
880 | {
881 | "data": {
882 | "text/plain": [
883 | "[NaN, 2.0, 3.0]\n",
884 | "Categories (2, int64): [3 < 2]"
885 | ]
886 | },
887 | "execution_count": 31,
888 | "metadata": {},
889 | "output_type": "execute_result"
890 | }
891 | ],
892 | "source": [
893 | "c1 = c.copy()\n",
894 | "c1"
895 | ]
896 | },
897 | {
898 | "cell_type": "code",
899 | "execution_count": 32,
900 | "metadata": {},
901 | "outputs": [
902 | {
903 | "data": {
904 | "text/plain": [
905 | "[NaN, 2.0, NaN]\n",
906 | "Categories (3, int64): [2 < 4 < 5]"
907 | ]
908 | },
909 | "execution_count": 32,
910 | "metadata": {},
911 | "output_type": "execute_result"
912 | }
913 | ],
914 | "source": [
915 | "c1.set_categories([2,4,5], ordered=True, inplace=True) # 删除了旧类别 1,增加新类别4、5,\n",
916 | "c1"
917 | ]
918 | },
919 | {
920 | "cell_type": "code",
921 | "execution_count": null,
922 | "metadata": {},
923 | "outputs": [],
924 | "source": []
925 | },
926 | {
927 | "cell_type": "markdown",
928 | "metadata": {},
929 | "source": [
930 | "# 3. cut() 和 qcut()\n",
931 | "这俩函数用于将连续型变量分割为类别变量。"
932 | ]
933 | },
934 | {
935 | "cell_type": "markdown",
936 | "metadata": {},
937 | "source": [
938 | "## 3.1 cut()"
939 | ]
940 | },
941 | {
942 | "cell_type": "markdown",
943 | "metadata": {},
944 | "source": [
945 | "##### `pd.cut(x, bins, right=False,include_lowest=False, labels=None, retbins=False)`\n",
946 | "- x:待分割的Series或序列;\n",
947 | "- bins:如果是int,那么将Series的进行等分,并在最大最小值的基础上外延1%作为区间边界;如果是序列,那么将序列值作为分隔点;\n",
948 | "- right:True or False,分隔区间默认为左闭右开;\n",
949 | "- include_lowest:True or False,将最左侧区间的左值外延1%,试图去包含最小值;\n",
950 | "- labels:分隔后是区间,可以用label来替换为想要的类别形式;\n",
951 | "- retbins:是否返回分隔点;"
952 | ]
953 | },
954 | {
955 | "cell_type": "code",
956 | "execution_count": 33,
957 | "metadata": {},
958 | "outputs": [
959 | {
960 | "data": {
961 | "text/plain": [
962 | "0 0\n",
963 | "1 1\n",
964 | "2 2\n",
965 | "3 3\n",
966 | "4 4\n",
967 | "dtype: int64"
968 | ]
969 | },
970 | "execution_count": 33,
971 | "metadata": {},
972 | "output_type": "execute_result"
973 | }
974 | ],
975 | "source": [
976 | "s = pd.Series(range(0,5))\n",
977 | "s"
978 | ]
979 | },
980 | {
981 | "cell_type": "code",
982 | "execution_count": 34,
983 | "metadata": {},
984 | "outputs": [
985 | {
986 | "data": {
987 | "text/plain": [
988 | "0 (-0.004, 1.333]\n",
989 | "1 (-0.004, 1.333]\n",
990 | "2 (1.333, 2.667]\n",
991 | "3 (2.667, 4.0]\n",
992 | "4 (2.667, 4.0]\n",
993 | "dtype: category\n",
994 | "Categories (3, interval[float64]): [(-0.004, 1.333] < (1.333, 2.667] < (2.667, 4.0]]"
995 | ]
996 | },
997 | "execution_count": 34,
998 | "metadata": {},
999 | "output_type": "execute_result"
1000 | }
1001 | ],
1002 | "source": [
1003 | "pd.cut(s, 3) # 可以看到一共3个类别,类别形式为区间形式(]"
1004 | ]
1005 | },
1006 | {
1007 | "cell_type": "code",
1008 | "execution_count": 35,
1009 | "metadata": {},
1010 | "outputs": [
1011 | {
1012 | "data": {
1013 | "text/plain": [
1014 | "0 a\n",
1015 | "1 a\n",
1016 | "2 b\n",
1017 | "3 c\n",
1018 | "4 c\n",
1019 | "dtype: category\n",
1020 | "Categories (3, object): [a < b < c]"
1021 | ]
1022 | },
1023 | "execution_count": 35,
1024 | "metadata": {},
1025 | "output_type": "execute_result"
1026 | }
1027 | ],
1028 | "source": [
1029 | "pd.cut(s, 3, labels=['a','b','c']) # 这样就清晰多了"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "code",
1034 | "execution_count": 36,
1035 | "metadata": {},
1036 | "outputs": [
1037 | {
1038 | "data": {
1039 | "text/plain": [
1040 | "(0 a\n",
1041 | " 1 a\n",
1042 | " 2 b\n",
1043 | " 3 c\n",
1044 | " 4 c\n",
1045 | " dtype: category\n",
1046 | " Categories (3, object): [a < b < c],\n",
1047 | " array([-0.004 , 1.33333333, 2.66666667, 4. ]))"
1048 | ]
1049 | },
1050 | "execution_count": 36,
1051 | "metadata": {},
1052 | "output_type": "execute_result"
1053 | }
1054 | ],
1055 | "source": [
1056 | "pd.cut( s, 3, labels = ['a','b','c'], retbins=True) # 分隔点也返回"
1057 | ]
1058 | },
1059 | {
1060 | "cell_type": "code",
1061 | "execution_count": 37,
1062 | "metadata": {},
1063 | "outputs": [
1064 | {
1065 | "data": {
1066 | "text/plain": [
1067 | "0 [0.0, 2.5)\n",
1068 | "1 [0.0, 2.5)\n",
1069 | "2 [0.0, 2.5)\n",
1070 | "3 [2.5, 4.0)\n",
1071 | "4 NaN\n",
1072 | "dtype: category\n",
1073 | "Categories (2, interval[float64]): [[0.0, 2.5) < [2.5, 4.0)]"
1074 | ]
1075 | },
1076 | "execution_count": 37,
1077 | "metadata": {},
1078 | "output_type": "execute_result"
1079 | }
1080 | ],
1081 | "source": [
1082 | "pd.cut(s,[0,2.5,4], right=False) # 左闭右开,不包括4,所以4不属于任何一类别"
1083 | ]
1084 | },
1085 | {
1086 | "cell_type": "code",
1087 | "execution_count": 38,
1088 | "metadata": {},
1089 | "outputs": [
1090 | {
1091 | "data": {
1092 | "text/plain": [
1093 | "0 NaN\n",
1094 | "1 (0.0, 2.5]\n",
1095 | "2 (0.0, 2.5]\n",
1096 | "3 (2.5, 4.0]\n",
1097 | "4 (2.5, 4.0]\n",
1098 | "dtype: category\n",
1099 | "Categories (2, interval[float64]): [(0.0, 2.5] < (2.5, 4.0]]"
1100 | ]
1101 | },
1102 | "execution_count": 38,
1103 | "metadata": {},
1104 | "output_type": "execute_result"
1105 | }
1106 | ],
1107 | "source": [
1108 | "pd.cut(s, [0,2.5,4], right=True) # 左开右闭,不包括0,所以0不属于任何一类别"
1109 | ]
1110 | },
1111 | {
1112 | "cell_type": "code",
1113 | "execution_count": 39,
1114 | "metadata": {},
1115 | "outputs": [
1116 | {
1117 | "data": {
1118 | "text/plain": [
1119 | "0 (-0.001, 2.5]\n",
1120 | "1 (-0.001, 2.5]\n",
1121 | "2 (-0.001, 2.5]\n",
1122 | "3 (2.5, 4.0]\n",
1123 | "4 (2.5, 4.0]\n",
1124 | "dtype: category\n",
1125 | "Categories (2, interval[float64]): [(-0.001, 2.5] < (2.5, 4.0]]"
1126 | ]
1127 | },
1128 | "execution_count": 39,
1129 | "metadata": {},
1130 | "output_type": "execute_result"
1131 | }
1132 | ],
1133 | "source": [
1134 | "pd.cut(s, [0,2.5,4], right=True, include_lowest=True) # 最左侧值被包含"
1135 | ]
1136 | },
1137 | {
1138 | "cell_type": "markdown",
1139 | "metadata": {},
1140 | "source": [
1141 | "## 3.2 qcut()"
1142 | ]
1143 | },
1144 | {
1145 | "cell_type": "markdown",
1146 | "metadata": {},
1147 | "source": [
1148 | "##### `pd.qcut(x, q, labels=None, retbins=False)`\n",
1149 | "- x:待分割的Series或序列;\n",
1150 | "- q:安装分位数也来定义分隔点,而不是按照给定值;\n",
1151 | "- labels:分隔后是区间,可以用label来替换为想要的类别形式;\n",
1152 | "- retbins:是否返回分隔点;"
1153 | ]
1154 | },
1155 | {
1156 | "cell_type": "code",
1157 | "execution_count": 40,
1158 | "metadata": {},
1159 | "outputs": [
1160 | {
1161 | "data": {
1162 | "text/plain": [
1163 | "0 a\n",
1164 | "1 a\n",
1165 | "2 b\n",
1166 | "3 c\n",
1167 | "4 d\n",
1168 | "dtype: category\n",
1169 | "Categories (4, object): [a < b < c < d]"
1170 | ]
1171 | },
1172 | "execution_count": 40,
1173 | "metadata": {},
1174 | "output_type": "execute_result"
1175 | }
1176 | ],
1177 | "source": [
1178 | "pd.qcut(s, q = [0.0, 0.25, 0.5, 0.75, 1.0], labels=['a','b','c','d']) \n",
1179 | "# 5个分位点,形成 4 个区间。看来默认参数是right =True, include_lowest = True"
1180 | ]
1181 | },
1182 | {
1183 | "cell_type": "code",
1184 | "execution_count": null,
1185 | "metadata": {},
1186 | "outputs": [],
1187 | "source": []
1188 | }
1189 | ],
1190 | "metadata": {
1191 | "kernelspec": {
1192 | "display_name": "Python 3",
1193 | "language": "python",
1194 | "name": "python3"
1195 | },
1196 | "language_info": {
1197 | "codemirror_mode": {
1198 | "name": "ipython",
1199 | "version": 3
1200 | },
1201 | "file_extension": ".py",
1202 | "mimetype": "text/x-python",
1203 | "name": "python",
1204 | "nbconvert_exporter": "python",
1205 | "pygments_lexer": "ipython3",
1206 | "version": "3.7.0"
1207 | },
1208 | "toc": {
1209 | "base_numbering": 1,
1210 | "nav_menu": {
1211 | "height": "318px",
1212 | "width": "252px"
1213 | },
1214 | "number_sections": false,
1215 | "sideBar": true,
1216 | "skip_h1_title": false,
1217 | "title_cell": "Table of Contents",
1218 | "title_sidebar": "Contents",
1219 | "toc_cell": false,
1220 | "toc_position": {},
1221 | "toc_section_display": "block",
1222 | "toc_window_display": true
1223 | }
1224 | },
1225 | "nbformat": 4,
1226 | "nbformat_minor": 2
1227 | }
1228 |
--------------------------------------------------------------------------------
/14. Object型操作.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "id": "0485413d-d8e5-4156-9eed-a8df3eca8caa",
6 | "metadata": {},
7 | "source": [
8 | "# Object型操作\n",
9 | "\n",
10 | "字符串以及一些复合类型比如列表类型tuple、list,字典类型等元素类型的操作"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 1,
16 | "id": "06dcbb60-10a4-4428-b351-827300085735",
17 | "metadata": {},
18 | "outputs": [],
19 | "source": [
20 | "__auther__ = 'zhenhang.sun@gmail.com'"
21 | ]
22 | },
23 | {
24 | "cell_type": "code",
25 | "execution_count": 2,
26 | "id": "75514624-4fb5-4d6e-a798-0e29d74ab65d",
27 | "metadata": {},
28 | "outputs": [
29 | {
30 | "data": {
31 | "text/plain": [
32 | "'D:\\\\github\\\\pandas-tutorial'"
33 | ]
34 | },
35 | "execution_count": 2,
36 | "metadata": {},
37 | "output_type": "execute_result"
38 | }
39 | ],
40 | "source": [
41 | "pwd"
42 | ]
43 | },
44 | {
45 | "cell_type": "code",
46 | "execution_count": 3,
47 | "id": "87e9525b-0868-43e1-a7f2-d9ef1d701db1",
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "import pandas as pd"
52 | ]
53 | },
54 | {
55 | "cell_type": "code",
56 | "execution_count": null,
57 | "id": "f95a93c9-09ae-4cdb-bff3-eeecb67e63fd",
58 | "metadata": {},
59 | "outputs": [],
60 | "source": []
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "id": "ddd37be4-2523-44e0-8a5d-a412a1eac1c8",
65 | "metadata": {
66 | "tags": []
67 | },
68 | "source": [
69 | "# 0.\n",
70 | "\n",
71 | "**统一的操作前缀 .str**"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": 4,
77 | "id": "30716e0b-4773-4619-b244-9f29469d4505",
78 | "metadata": {},
79 | "outputs": [
80 | {
81 | "data": {
82 | "text/html": [
83 | "\n",
84 | "\n",
97 | "
\n",
98 | " \n",
99 | " \n",
100 | " | \n",
101 | " str | \n",
102 | " list | \n",
103 | " dict | \n",
104 | "
\n",
105 | " \n",
106 | " \n",
107 | " \n",
108 | " 0 | \n",
109 | " abcidef | \n",
110 | " [1, 2, 3] | \n",
111 | " {'a': 1, 'b': 2} | \n",
112 | "
\n",
113 | " \n",
114 | " 1 | \n",
115 | " hahabciidef | \n",
116 | " [4, 5, 6] | \n",
117 | " {'a': 10, 'b': 10} | \n",
118 | "
\n",
119 | " \n",
120 | "
\n",
121 | "
"
122 | ],
123 | "text/plain": [
124 | " str list dict\n",
125 | "0 abcidef [1, 2, 3] {'a': 1, 'b': 2}\n",
126 | "1 hahabciidef [4, 5, 6] {'a': 10, 'b': 10}"
127 | ]
128 | },
129 | "execution_count": 4,
130 | "metadata": {},
131 | "output_type": "execute_result"
132 | }
133 | ],
134 | "source": [
135 | "df = pd.DataFrame({\"str\": [\"abcidef\", \"hahabciidef\"],\n",
136 | " \"list\":[[1,2,3],[4,5,6]],\n",
137 | " \"dict\":[{'a':1, 'b':2}, {'a':10, 'b':10}]\n",
138 | " \n",
139 | " })\n",
140 | "df"
141 | ]
142 | },
143 | {
144 | "cell_type": "markdown",
145 | "id": "477aedb8-bf1c-4339-82ba-01dd917c7e92",
146 | "metadata": {
147 | "tags": []
148 | },
149 | "source": [
150 | "# 1. 字符串操作"
151 | ]
152 | },
153 | {
154 | "cell_type": "markdown",
155 | "id": "e5e7fa21-e544-4f5b-8796-ca69ffd6c49a",
156 | "metadata": {},
157 | "source": [
158 | "## 1.1 字符串变换\n",
159 | "\n",
160 | "常用的字符串操作都支持:可以自行查看。"
161 | ]
162 | },
163 | {
164 | "cell_type": "code",
165 | "execution_count": 5,
166 | "id": "910b843e-b943-4d8b-aa5a-2e5868100fae",
167 | "metadata": {},
168 | "outputs": [
169 | {
170 | "data": {
171 | "text/plain": [
172 | "0 ABCIDEF\n",
173 | "1 HAHABCIIDEF\n",
174 | "Name: str, dtype: object"
175 | ]
176 | },
177 | "execution_count": 5,
178 | "metadata": {},
179 | "output_type": "execute_result"
180 | }
181 | ],
182 | "source": [
183 | "df['str'].str.upper()"
184 | ]
185 | },
186 | {
187 | "cell_type": "code",
188 | "execution_count": 6,
189 | "id": "113d9078-60d7-4fe2-bb1f-4cc288763192",
190 | "metadata": {},
191 | "outputs": [
192 | {
193 | "data": {
194 | "text/plain": [
195 | "0 Abcidef\n",
196 | "1 Hahabciidef\n",
197 | "Name: str, dtype: object"
198 | ]
199 | },
200 | "execution_count": 6,
201 | "metadata": {},
202 | "output_type": "execute_result"
203 | }
204 | ],
205 | "source": [
206 | "df['str'].str.capitalize()"
207 | ]
208 | },
209 | {
210 | "cell_type": "code",
211 | "execution_count": null,
212 | "id": "0659267a-d330-4b7f-af33-46d022796dbc",
213 | "metadata": {},
214 | "outputs": [],
215 | "source": []
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "id": "a36f7a83-74e0-485d-9964-c03f300db770",
220 | "metadata": {},
221 | "source": [
222 | "## 1.2 字符串提取-正则\n"
223 | ]
224 | },
225 | {
226 | "cell_type": "code",
227 | "execution_count": 7,
228 | "id": "c635805e-489f-433c-afa3-bc0af1cf1b7c",
229 | "metadata": {},
230 | "outputs": [
231 | {
232 | "data": {
233 | "text/html": [
234 | "\n",
235 | "\n",
248 | "
\n",
249 | " \n",
250 | " \n",
251 | " | \n",
252 | " 0 | \n",
253 | " 1 | \n",
254 | "
\n",
255 | " \n",
256 | " \n",
257 | " \n",
258 | " 0 | \n",
259 | " abc | \n",
260 | " def | \n",
261 | "
\n",
262 | " \n",
263 | " 1 | \n",
264 | " abc | \n",
265 | " def | \n",
266 | "
\n",
267 | " \n",
268 | "
\n",
269 | "
"
270 | ],
271 | "text/plain": [
272 | " 0 1\n",
273 | "0 abc def\n",
274 | "1 abc def"
275 | ]
276 | },
277 | "execution_count": 7,
278 | "metadata": {},
279 | "output_type": "execute_result"
280 | }
281 | ],
282 | "source": [
283 | "df['str'].str.extract(r\"(abc).*?(def)\")"
284 | ]
285 | },
286 | {
287 | "cell_type": "markdown",
288 | "id": "18a813fd-9ea7-458e-b87d-26e8adc131de",
289 | "metadata": {},
290 | "source": [
291 | "## 1.3 查找"
292 | ]
293 | },
294 | {
295 | "cell_type": "code",
296 | "execution_count": 8,
297 | "id": "fc7670a5-7dc1-438b-ae19-5cea28590625",
298 | "metadata": {},
299 | "outputs": [
300 | {
301 | "data": {
302 | "text/plain": [
303 | "0 0\n",
304 | "1 3\n",
305 | "Name: str, dtype: int64"
306 | ]
307 | },
308 | "execution_count": 8,
309 | "metadata": {},
310 | "output_type": "execute_result"
311 | }
312 | ],
313 | "source": [
314 | "df['str'].str.find('abc')"
315 | ]
316 | },
317 | {
318 | "cell_type": "markdown",
319 | "id": "6adc383f-9cc0-4832-b3c8-4ae523eed5ed",
320 | "metadata": {},
321 | "source": [
322 | "## 1.4 分割"
323 | ]
324 | },
325 | {
326 | "cell_type": "code",
327 | "execution_count": 9,
328 | "id": "ce57630a-d5cb-46e1-9b55-3b33c41fdb46",
329 | "metadata": {},
330 | "outputs": [
331 | {
332 | "data": {
333 | "text/plain": [
334 | "0 [abc, def]\n",
335 | "1 [hahabc, , def]\n",
336 | "Name: str, dtype: object"
337 | ]
338 | },
339 | "execution_count": 9,
340 | "metadata": {},
341 | "output_type": "execute_result"
342 | }
343 | ],
344 | "source": [
345 | "df['str'].str.split('i')"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 10,
351 | "id": "1f23ba45-10bb-4b15-a16e-e43c931b7da5",
352 | "metadata": {},
353 | "outputs": [
354 | {
355 | "data": {
356 | "text/html": [
357 | "\n",
358 | "\n",
371 | "
\n",
372 | " \n",
373 | " \n",
374 | " | \n",
375 | " 0 | \n",
376 | " 1 | \n",
377 | " 2 | \n",
378 | "
\n",
379 | " \n",
380 | " \n",
381 | " \n",
382 | " 0 | \n",
383 | " abc | \n",
384 | " def | \n",
385 | " None | \n",
386 | "
\n",
387 | " \n",
388 | " 1 | \n",
389 | " hahabc | \n",
390 | " | \n",
391 | " def | \n",
392 | "
\n",
393 | " \n",
394 | "
\n",
395 | "
"
396 | ],
397 | "text/plain": [
398 | " 0 1 2\n",
399 | "0 abc def None\n",
400 | "1 hahabc def"
401 | ]
402 | },
403 | "execution_count": 10,
404 | "metadata": {},
405 | "output_type": "execute_result"
406 | }
407 | ],
408 | "source": [
409 | "df['str'].str.split('i', expand=True)"
410 | ]
411 | },
412 | {
413 | "cell_type": "markdown",
414 | "id": "ca9988fb-2ea0-448a-9155-9309dc0cfbfa",
415 | "metadata": {
416 | "tags": []
417 | },
418 | "source": [
419 | "## 1.5 包含判断"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": 11,
425 | "id": "52d5f38b-db26-4580-ba54-4022cac12375",
426 | "metadata": {},
427 | "outputs": [
428 | {
429 | "data": {
430 | "text/plain": [
431 | "0 True\n",
432 | "1 True\n",
433 | "Name: str, dtype: bool"
434 | ]
435 | },
436 | "execution_count": 11,
437 | "metadata": {},
438 | "output_type": "execute_result"
439 | }
440 | ],
441 | "source": [
442 | "#取第 个位置的量\n",
443 | "df['str'].str.contains('abc')"
444 | ]
445 | },
446 | {
447 | "cell_type": "code",
448 | "execution_count": null,
449 | "id": "3fa1d363-b1c9-47cb-a118-2acd2ca23644",
450 | "metadata": {},
451 | "outputs": [],
452 | "source": []
453 | },
454 | {
455 | "cell_type": "markdown",
456 | "id": "fc594985-f9e0-4e48-a14c-3bc1130b2638",
457 | "metadata": {
458 | "tags": []
459 | },
460 | "source": [
461 | "# 2. 列表类型"
462 | ]
463 | },
464 | {
465 | "cell_type": "code",
466 | "execution_count": 12,
467 | "id": "a9bc121b-15c3-4318-acef-d7c1efdeff08",
468 | "metadata": {},
469 | "outputs": [
470 | {
471 | "data": {
472 | "text/html": [
473 | "\n",
474 | "\n",
487 | "
\n",
488 | " \n",
489 | " \n",
490 | " | \n",
491 | " str | \n",
492 | " list | \n",
493 | " dict | \n",
494 | "
\n",
495 | " \n",
496 | " \n",
497 | " \n",
498 | " 0 | \n",
499 | " abcidef | \n",
500 | " [1, 2, 3] | \n",
501 | " {'a': 1, 'b': 2} | \n",
502 | "
\n",
503 | " \n",
504 | " 1 | \n",
505 | " hahabciidef | \n",
506 | " [4, 5, 6] | \n",
507 | " {'a': 10, 'b': 10} | \n",
508 | "
\n",
509 | " \n",
510 | "
\n",
511 | "
"
512 | ],
513 | "text/plain": [
514 | " str list dict\n",
515 | "0 abcidef [1, 2, 3] {'a': 1, 'b': 2}\n",
516 | "1 hahabciidef [4, 5, 6] {'a': 10, 'b': 10}"
517 | ]
518 | },
519 | "execution_count": 12,
520 | "metadata": {},
521 | "output_type": "execute_result"
522 | }
523 | ],
524 | "source": [
525 | "df"
526 | ]
527 | },
528 | {
529 | "cell_type": "markdown",
530 | "id": "4210a82a-d4cb-42ac-897f-8dd399ac47f1",
531 | "metadata": {},
532 | "source": [
533 | "## 2.1 取元素"
534 | ]
535 | },
536 | {
537 | "cell_type": "code",
538 | "execution_count": 13,
539 | "id": "be3cd399-4e33-4d34-9d76-41470afa302e",
540 | "metadata": {},
541 | "outputs": [
542 | {
543 | "data": {
544 | "text/plain": [
545 | "0 2\n",
546 | "1 5\n",
547 | "Name: list, dtype: int64"
548 | ]
549 | },
550 | "execution_count": 13,
551 | "metadata": {},
552 | "output_type": "execute_result"
553 | }
554 | ],
555 | "source": [
556 | "#取第 个位置的量\n",
557 | "df['list'].str.get(1)"
558 | ]
559 | },
560 | {
561 | "cell_type": "markdown",
562 | "id": "b87d818c-7df0-4120-9386-9f851bc09a5d",
563 | "metadata": {},
564 | "source": [
565 | "## 2.2 展开\n"
566 | ]
567 | },
568 | {
569 | "cell_type": "code",
570 | "execution_count": 14,
571 | "id": "1501fd67-5d88-4de5-8341-eebedc759cc4",
572 | "metadata": {},
573 | "outputs": [
574 | {
575 | "data": {
576 | "text/plain": [
577 | "0 1\n",
578 | "0 2\n",
579 | "0 3\n",
580 | "1 4\n",
581 | "1 5\n",
582 | "1 6\n",
583 | "Name: list, dtype: object"
584 | ]
585 | },
586 | "execution_count": 14,
587 | "metadata": {},
588 | "output_type": "execute_result"
589 | }
590 | ],
591 | "source": [
592 | "# 炸开 series 形式\n",
593 | "# 注意此时不需要使用 .str\n",
594 | "df['list'].explode()"
595 | ]
596 | },
597 | {
598 | "cell_type": "code",
599 | "execution_count": 15,
600 | "id": "f7bc208a-aef9-4753-87ec-f7722f373430",
601 | "metadata": {},
602 | "outputs": [
603 | {
604 | "data": {
605 | "text/html": [
606 | "\n",
607 | "\n",
620 | "
\n",
621 | " \n",
622 | " \n",
623 | " | \n",
624 | " str | \n",
625 | " list | \n",
626 | " dict | \n",
627 | "
\n",
628 | " \n",
629 | " \n",
630 | " \n",
631 | " 0 | \n",
632 | " abcidef | \n",
633 | " 1 | \n",
634 | " {'a': 1, 'b': 2} | \n",
635 | "
\n",
636 | " \n",
637 | " 0 | \n",
638 | " abcidef | \n",
639 | " 2 | \n",
640 | " {'a': 1, 'b': 2} | \n",
641 | "
\n",
642 | " \n",
643 | " 0 | \n",
644 | " abcidef | \n",
645 | " 3 | \n",
646 | " {'a': 1, 'b': 2} | \n",
647 | "
\n",
648 | " \n",
649 | " 1 | \n",
650 | " hahabciidef | \n",
651 | " 4 | \n",
652 | " {'a': 10, 'b': 10} | \n",
653 | "
\n",
654 | " \n",
655 | " 1 | \n",
656 | " hahabciidef | \n",
657 | " 5 | \n",
658 | " {'a': 10, 'b': 10} | \n",
659 | "
\n",
660 | " \n",
661 | " 1 | \n",
662 | " hahabciidef | \n",
663 | " 6 | \n",
664 | " {'a': 10, 'b': 10} | \n",
665 | "
\n",
666 | " \n",
667 | "
\n",
668 | "
"
669 | ],
670 | "text/plain": [
671 | " str list dict\n",
672 | "0 abcidef 1 {'a': 1, 'b': 2}\n",
673 | "0 abcidef 2 {'a': 1, 'b': 2}\n",
674 | "0 abcidef 3 {'a': 1, 'b': 2}\n",
675 | "1 hahabciidef 4 {'a': 10, 'b': 10}\n",
676 | "1 hahabciidef 5 {'a': 10, 'b': 10}\n",
677 | "1 hahabciidef 6 {'a': 10, 'b': 10}"
678 | ]
679 | },
680 | "execution_count": 15,
681 | "metadata": {},
682 | "output_type": "execute_result"
683 | }
684 | ],
685 | "source": [
686 | "#炸开 dataframe 指定某列\n",
687 | "df.explode('list')"
688 | ]
689 | },
690 | {
691 | "cell_type": "code",
692 | "execution_count": 16,
693 | "id": "9e53f211-daba-41b8-ab62-9f114afc81e4",
694 | "metadata": {},
695 | "outputs": [
696 | {
697 | "data": {
698 | "text/html": [
699 | "\n",
700 | "\n",
713 | "
\n",
714 | " \n",
715 | " \n",
716 | " | \n",
717 | " e1 | \n",
718 | " e2 | \n",
719 | " e3 | \n",
720 | "
\n",
721 | " \n",
722 | " \n",
723 | " \n",
724 | " 0 | \n",
725 | " 1 | \n",
726 | " 2 | \n",
727 | " 3 | \n",
728 | "
\n",
729 | " \n",
730 | " 1 | \n",
731 | " 4 | \n",
732 | " 5 | \n",
733 | " 6 | \n",
734 | "
\n",
735 | " \n",
736 | "
\n",
737 | "
"
738 | ],
739 | "text/plain": [
740 | " e1 e2 e3\n",
741 | "0 1 2 3\n",
742 | "1 4 5 6"
743 | ]
744 | },
745 | "execution_count": 16,
746 | "metadata": {},
747 | "output_type": "execute_result"
748 | }
749 | ],
750 | "source": [
751 | "# 变为多列\n",
752 | "pd.DataFrame( df['list'].tolist(), columns=['e1', 'e2', 'e3'])"
753 | ]
754 | },
755 | {
756 | "cell_type": "code",
757 | "execution_count": null,
758 | "id": "e1999fa5-79d8-46f8-9531-6c5365081768",
759 | "metadata": {},
760 | "outputs": [],
761 | "source": []
762 | },
763 | {
764 | "cell_type": "markdown",
765 | "id": "03e24e48-5766-4f8f-8ab5-7e5bd7042a6c",
766 | "metadata": {
767 | "tags": []
768 | },
769 | "source": [
770 | "# 3. 字典类型"
771 | ]
772 | },
773 | {
774 | "cell_type": "code",
775 | "execution_count": 17,
776 | "id": "78339a64-6d4d-43cb-8688-13213552e549",
777 | "metadata": {},
778 | "outputs": [
779 | {
780 | "data": {
781 | "text/html": [
782 | "\n",
783 | "\n",
796 | "
\n",
797 | " \n",
798 | " \n",
799 | " | \n",
800 | " str | \n",
801 | " list | \n",
802 | " dict | \n",
803 | "
\n",
804 | " \n",
805 | " \n",
806 | " \n",
807 | " 0 | \n",
808 | " abcidef | \n",
809 | " [1, 2, 3] | \n",
810 | " {'a': 1, 'b': 2} | \n",
811 | "
\n",
812 | " \n",
813 | " 1 | \n",
814 | " hahabciidef | \n",
815 | " [4, 5, 6] | \n",
816 | " {'a': 10, 'b': 10} | \n",
817 | "
\n",
818 | " \n",
819 | "
\n",
820 | "
"
821 | ],
822 | "text/plain": [
823 | " str list dict\n",
824 | "0 abcidef [1, 2, 3] {'a': 1, 'b': 2}\n",
825 | "1 hahabciidef [4, 5, 6] {'a': 10, 'b': 10}"
826 | ]
827 | },
828 | "execution_count": 17,
829 | "metadata": {},
830 | "output_type": "execute_result"
831 | }
832 | ],
833 | "source": [
834 | "df"
835 | ]
836 | },
837 | {
838 | "cell_type": "markdown",
839 | "id": "84ba8ff7-1a1f-4c12-b7b1-1bb4cf5759e9",
840 | "metadata": {},
841 | "source": [
842 | "## 3.1 取元素"
843 | ]
844 | },
845 | {
846 | "cell_type": "code",
847 | "execution_count": 18,
848 | "id": "d7a76a3c-5a9f-4dc2-8475-491acdeaea1d",
849 | "metadata": {},
850 | "outputs": [
851 | {
852 | "data": {
853 | "text/plain": [
854 | "0 1\n",
855 | "1 10\n",
856 | "Name: dict, dtype: int64"
857 | ]
858 | },
859 | "execution_count": 18,
860 | "metadata": {},
861 | "output_type": "execute_result"
862 | }
863 | ],
864 | "source": [
865 | "#取key对应的值\n",
866 | "df['dict'].str.get('a')"
867 | ]
868 | },
869 | {
870 | "cell_type": "markdown",
871 | "id": "8331bb98-26c3-4a3b-90bb-dae05611124a",
872 | "metadata": {},
873 | "source": [
874 | "## 3.2 展开"
875 | ]
876 | },
877 | {
878 | "cell_type": "code",
879 | "execution_count": 19,
880 | "id": "8b079c37-9c34-414f-981b-d74b46a4d5c1",
881 | "metadata": {},
882 | "outputs": [
883 | {
884 | "data": {
885 | "text/plain": [
886 | "0 a\n",
887 | "0 b\n",
888 | "1 a\n",
889 | "1 b\n",
890 | "Name: dict, dtype: object"
891 | ]
892 | },
893 | "execution_count": 19,
894 | "metadata": {},
895 | "output_type": "execute_result"
896 | }
897 | ],
898 | "source": [
899 | "#炸开,可以只剩下key,事实上,这个操作也不适合dict\n",
900 | "df['dict'].explode()"
901 | ]
902 | },
903 | {
904 | "cell_type": "code",
905 | "execution_count": 20,
906 | "id": "8ae3ef52-f8b9-4c79-9a7c-17fc6e4e4899",
907 | "metadata": {},
908 | "outputs": [
909 | {
910 | "data": {
911 | "text/html": [
912 | "\n",
913 | "\n",
926 | "
\n",
927 | " \n",
928 | " \n",
929 | " | \n",
930 | " a | \n",
931 | " b | \n",
932 | "
\n",
933 | " \n",
934 | " \n",
935 | " \n",
936 | " 0 | \n",
937 | " 1 | \n",
938 | " 2 | \n",
939 | "
\n",
940 | " \n",
941 | " 1 | \n",
942 | " 10 | \n",
943 | " 10 | \n",
944 | "
\n",
945 | " \n",
946 | "
\n",
947 | "
"
948 | ],
949 | "text/plain": [
950 | " a b\n",
951 | "0 1 2\n",
952 | "1 10 10"
953 | ]
954 | },
955 | "execution_count": 20,
956 | "metadata": {},
957 | "output_type": "execute_result"
958 | }
959 | ],
960 | "source": [
961 | "# 变为多列\n",
962 | "pd.DataFrame(df['dict'].tolist())"
963 | ]
964 | }
965 | ],
966 | "metadata": {
967 | "kernelspec": {
968 | "display_name": "Python 3 (ipykernel)",
969 | "language": "python",
970 | "name": "python3"
971 | },
972 | "language_info": {
973 | "codemirror_mode": {
974 | "name": "ipython",
975 | "version": 3
976 | },
977 | "file_extension": ".py",
978 | "mimetype": "text/x-python",
979 | "name": "python",
980 | "nbconvert_exporter": "python",
981 | "pygments_lexer": "ipython3",
982 | "version": "3.10.0"
983 | },
984 | "toc-autonumbering": false,
985 | "toc-showcode": false,
986 | "toc-showmarkdowntxt": false,
987 | "toc-showtags": false
988 | },
989 | "nbformat": 4,
990 | "nbformat_minor": 5
991 | }
992 |
--------------------------------------------------------------------------------
/3. merge详解.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# merge详解\n",
8 | "merge本来准备放在上一章的增那一节讲的,不过其算是关系数据库用得很多的一个操作,变化也较多,所以单独开一篇细讲这个函数。 "
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "__auther__ = 'zhenhang.sun@gmail.com'"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {},
24 | "outputs": [
25 | {
26 | "data": {
27 | "text/plain": [
28 | "'D:\\\\github\\\\pandas-tutorial'"
29 | ]
30 | },
31 | "execution_count": 2,
32 | "metadata": {},
33 | "output_type": "execute_result"
34 | }
35 | ],
36 | "source": [
37 | "pwd"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "import numpy as np\n",
47 | "import pandas as pd"
48 | ]
49 | },
50 | {
51 | "cell_type": "markdown",
52 | "metadata": {},
53 | "source": [
54 | "# 1. 函数说明"
55 | ]
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "##### `pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)`\n",
62 | "concat函数只能根据索引对齐,而如果想在任意**列**上对齐合并,则需要merge函数,其在sql应用很多。\n",
63 | "- left, right: 两个要对齐合并的DataFrame;\n",
64 | "- how: 先做笛卡尔积操作,然后按照要求,保留需要的,缺失的数据填充NaN;\n",
65 | " - left: 以左DataFrame为基准,即左侧DataFrame的数据全部保留(不代表完全一致、可能会存在复制),保持原序;\n",
66 | " - right: 以右DataFrame为基准,保持原序;\n",
67 | " - inner: 交,保留左右DataFrame在on上完全一致的行,保持左DataFrame顺序;\n",
68 | " - outer: 并,按照字典顺序重新排序;\n",
69 | "- on:对应列名或者行索引的名字,如果要在DataFrame相同的列索引做对齐,用这个参数;\n",
70 | "- left_on, right_on, left_index, right_index:\n",
71 | " - on对应列名或者行索引的名字(所以行索引一般要跟列一样看待,有自己的名字),用这俩参数;\n",
72 | " - index对应要使用的index,不建议使用,会搞晕。\n",
73 | "- sort: True or False,是否按字典序重新排序。"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 4,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "data": {
83 | "text/html": [
84 | "\n",
85 | "\n",
98 | "
\n",
99 | " \n",
100 | " \n",
101 | " | \n",
102 | " A | \n",
103 | " B | \n",
104 | "
\n",
105 | " \n",
106 | " \n",
107 | " \n",
108 | " a | \n",
109 | " 1 | \n",
110 | " 2 | \n",
111 | "
\n",
112 | " \n",
113 | " b | \n",
114 | " 3 | \n",
115 | " 4 | \n",
116 | "
\n",
117 | " \n",
118 | "
\n",
119 | "
"
120 | ],
121 | "text/plain": [
122 | " A B\n",
123 | "a 1 2\n",
124 | "b 3 4"
125 | ]
126 | },
127 | "execution_count": 4,
128 | "metadata": {},
129 | "output_type": "execute_result"
130 | }
131 | ],
132 | "source": [
133 | "df1 = pd.DataFrame([[1,2],[3,4]], index = ['a','b'],columns = ['A','B'])\n",
134 | "df2 = pd.DataFrame([[1,3],[4,8]], index = ['b','d'],columns = ['B','C'])\n",
135 | "df1"
136 | ]
137 | },
138 | {
139 | "cell_type": "code",
140 | "execution_count": 5,
141 | "metadata": {},
142 | "outputs": [
143 | {
144 | "data": {
145 | "text/html": [
146 | "\n",
147 | "\n",
160 | "
\n",
161 | " \n",
162 | " \n",
163 | " | \n",
164 | " B | \n",
165 | " C | \n",
166 | "
\n",
167 | " \n",
168 | " \n",
169 | " \n",
170 | " b | \n",
171 | " 1 | \n",
172 | " 3 | \n",
173 | "
\n",
174 | " \n",
175 | " d | \n",
176 | " 4 | \n",
177 | " 8 | \n",
178 | "
\n",
179 | " \n",
180 | "
\n",
181 | "
"
182 | ],
183 | "text/plain": [
184 | " B C\n",
185 | "b 1 3\n",
186 | "d 4 8"
187 | ]
188 | },
189 | "execution_count": 5,
190 | "metadata": {},
191 | "output_type": "execute_result"
192 | }
193 | ],
194 | "source": [
195 | "df2"
196 | ]
197 | },
198 | {
199 | "cell_type": "markdown",
200 | "metadata": {},
201 | "source": [
202 | "**如果单纯的按照index对齐,不如用concat方法。**"
203 | ]
204 | },
205 | {
206 | "cell_type": "code",
207 | "execution_count": 6,
208 | "metadata": {},
209 | "outputs": [
210 | {
211 | "data": {
212 | "text/html": [
213 | "\n",
214 | "\n",
227 | "
\n",
228 | " \n",
229 | " \n",
230 | " | \n",
231 | " A | \n",
232 | " B_x | \n",
233 | " B_y | \n",
234 | " C | \n",
235 | "
\n",
236 | " \n",
237 | " \n",
238 | " \n",
239 | " b | \n",
240 | " 3 | \n",
241 | " 4 | \n",
242 | " 1 | \n",
243 | " 3 | \n",
244 | "
\n",
245 | " \n",
246 | "
\n",
247 | "
"
248 | ],
249 | "text/plain": [
250 | " A B_x B_y C\n",
251 | "b 3 4 1 3"
252 | ]
253 | },
254 | "execution_count": 6,
255 | "metadata": {},
256 | "output_type": "execute_result"
257 | }
258 | ],
259 | "source": [
260 | "pd.merge(left=df1, right= df2, how='inner' ,left_index=True, right_index=True)"
261 | ]
262 | },
263 | {
264 | "cell_type": "code",
265 | "execution_count": 7,
266 | "metadata": {},
267 | "outputs": [
268 | {
269 | "data": {
270 | "text/html": [
271 | "\n",
272 | "\n",
285 | "
\n",
286 | " \n",
287 | " \n",
288 | " | \n",
289 | " A | \n",
290 | " B | \n",
291 | " B | \n",
292 | " C | \n",
293 | "
\n",
294 | " \n",
295 | " \n",
296 | " \n",
297 | " b | \n",
298 | " 3 | \n",
299 | " 4 | \n",
300 | " 1 | \n",
301 | " 3 | \n",
302 | "
\n",
303 | " \n",
304 | "
\n",
305 | "
"
306 | ],
307 | "text/plain": [
308 | " A B B C\n",
309 | "b 3 4 1 3"
310 | ]
311 | },
312 | "execution_count": 7,
313 | "metadata": {},
314 | "output_type": "execute_result"
315 | }
316 | ],
317 | "source": [
318 | "# 小区别是concat对重复列没有重命名,但是重名的情况不多,而且重名了说明之前设计就不大合理。\n",
319 | "pd.concat([df1, df2], join='inner', axis =1) "
320 | ]
321 | },
322 | {
323 | "cell_type": "code",
324 | "execution_count": null,
325 | "metadata": {},
326 | "outputs": [],
327 | "source": []
328 | },
329 | {
330 | "cell_type": "markdown",
331 | "metadata": {},
332 | "source": [
333 | "# 2. on 用法\n",
334 | "设置 how='inner'"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": 8,
340 | "metadata": {
341 | "scrolled": true
342 | },
343 | "outputs": [
344 | {
345 | "data": {
346 | "text/html": [
347 | "\n",
348 | "\n",
361 | "
\n",
362 | " \n",
363 | " \n",
364 | " | \n",
365 | " A | \n",
366 | " B | \n",
367 | " C | \n",
368 | "
\n",
369 | " \n",
370 | " \n",
371 | " \n",
372 | " 0 | \n",
373 | " 3 | \n",
374 | " 4 | \n",
375 | " 8 | \n",
376 | "
\n",
377 | " \n",
378 | "
\n",
379 | "
"
380 | ],
381 | "text/plain": [
382 | " A B C\n",
383 | "0 3 4 8"
384 | ]
385 | },
386 | "execution_count": 8,
387 | "metadata": {},
388 | "output_type": "execute_result"
389 | }
390 | ],
391 | "source": [
392 | "#对于'B'列:df1的'b'行、df2的'd'行,是相同的,其他都不同。 \n",
393 | "pd.merge(left=df1, right=df2, how='inner', on=['B']) "
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": 9,
399 | "metadata": {},
400 | "outputs": [
401 | {
402 | "data": {
403 | "text/html": [
404 | "\n",
405 | "\n",
418 | "
\n",
419 | " \n",
420 | " \n",
421 | " | \n",
422 | " A | \n",
423 | " B_x | \n",
424 | " B_y | \n",
425 | " C | \n",
426 | "
\n",
427 | " \n",
428 | " \n",
429 | " \n",
430 | " 0 | \n",
431 | " 3 | \n",
432 | " 4 | \n",
433 | " 1 | \n",
434 | " 3 | \n",
435 | "
\n",
436 | " \n",
437 | "
\n",
438 | "
"
439 | ],
440 | "text/plain": [
441 | " A B_x B_y C\n",
442 | "0 3 4 1 3"
443 | ]
444 | },
445 | "execution_count": 9,
446 | "metadata": {},
447 | "output_type": "execute_result"
448 | }
449 | ],
450 | "source": [
451 | "# df1的'A'列'b'行,df2的'C'列'd'行是相同的,其他都不同。\n",
452 | "# 其他列如果同名会进行重命名。\n",
453 | "pd.merge(left=df1, right=df2, how='inner',left_on=['A'] ,right_on=['C'])"
454 | ]
455 | },
456 | {
457 | "cell_type": "code",
458 | "execution_count": null,
459 | "metadata": {},
460 | "outputs": [],
461 | "source": []
462 | },
463 | {
464 | "cell_type": "markdown",
465 | "metadata": {},
466 | "source": [
467 | "# 3. how用法"
468 | ]
469 | },
470 | {
471 | "cell_type": "code",
472 | "execution_count": 10,
473 | "metadata": {},
474 | "outputs": [
475 | {
476 | "data": {
477 | "text/html": [
478 | "\n",
479 | "\n",
492 | "
\n",
493 | " \n",
494 | " \n",
495 | " | \n",
496 | " A | \n",
497 | " B | \n",
498 | " C | \n",
499 | "
\n",
500 | " \n",
501 | " \n",
502 | " \n",
503 | " 0 | \n",
504 | " 1 | \n",
505 | " 2 | \n",
506 | " NaN | \n",
507 | "
\n",
508 | " \n",
509 | " 1 | \n",
510 | " 3 | \n",
511 | " 4 | \n",
512 | " 8.0 | \n",
513 | "
\n",
514 | " \n",
515 | "
\n",
516 | "
"
517 | ],
518 | "text/plain": [
519 | " A B C\n",
520 | "0 1 2 NaN\n",
521 | "1 3 4 8.0"
522 | ]
523 | },
524 | "execution_count": 10,
525 | "metadata": {},
526 | "output_type": "execute_result"
527 | }
528 | ],
529 | "source": [
530 | "# 保持左侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n",
531 | "pd.merge(left=df1, right=df2, how='left', on=['B'] )"
532 | ]
533 | },
534 | {
535 | "cell_type": "code",
536 | "execution_count": 11,
537 | "metadata": {},
538 | "outputs": [
539 | {
540 | "data": {
541 | "text/html": [
542 | "\n",
543 | "\n",
556 | "
\n",
557 | " \n",
558 | " \n",
559 | " | \n",
560 | " A | \n",
561 | " B | \n",
562 | " C | \n",
563 | "
\n",
564 | " \n",
565 | " \n",
566 | " \n",
567 | " 0 | \n",
568 | " 3.0 | \n",
569 | " 4 | \n",
570 | " 8 | \n",
571 | "
\n",
572 | " \n",
573 | " 1 | \n",
574 | " NaN | \n",
575 | " 1 | \n",
576 | " 3 | \n",
577 | "
\n",
578 | " \n",
579 | "
\n",
580 | "
"
581 | ],
582 | "text/plain": [
583 | " A B C\n",
584 | "0 3.0 4 8\n",
585 | "1 NaN 1 3"
586 | ]
587 | },
588 | "execution_count": 11,
589 | "metadata": {},
590 | "output_type": "execute_result"
591 | }
592 | ],
593 | "source": [
594 | "# 保持右侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n",
595 | "pd.merge(left=df1, right=df2, how='right', on=['B'] )"
596 | ]
597 | },
598 | {
599 | "cell_type": "markdown",
600 | "metadata": {},
601 | "source": [
602 | "**对齐的列存在重复值,重复的也没关系,操作逻辑是一致的,完全可以假想不存在重复。**"
603 | ]
604 | },
605 | {
606 | "cell_type": "code",
607 | "execution_count": 12,
608 | "metadata": {
609 | "scrolled": false
610 | },
611 | "outputs": [
612 | {
613 | "data": {
614 | "text/html": [
615 | "\n",
616 | "\n",
629 | "
\n",
630 | " \n",
631 | " \n",
632 | " | \n",
633 | " A | \n",
634 | " B | \n",
635 | "
\n",
636 | " \n",
637 | " \n",
638 | " \n",
639 | " a | \n",
640 | " 1 | \n",
641 | " 4 | \n",
642 | "
\n",
643 | " \n",
644 | " b | \n",
645 | " 3 | \n",
646 | " 4 | \n",
647 | "
\n",
648 | " \n",
649 | "
\n",
650 | "
"
651 | ],
652 | "text/plain": [
653 | " A B\n",
654 | "a 1 4\n",
655 | "b 3 4"
656 | ]
657 | },
658 | "execution_count": 12,
659 | "metadata": {},
660 | "output_type": "execute_result"
661 | }
662 | ],
663 | "source": [
664 | "df1.loc['a','B'] = 4 #改成重复\n",
665 | "df1"
666 | ]
667 | },
668 | {
669 | "cell_type": "code",
670 | "execution_count": 13,
671 | "metadata": {
672 | "scrolled": true
673 | },
674 | "outputs": [
675 | {
676 | "data": {
677 | "text/html": [
678 | "\n",
679 | "\n",
692 | "
\n",
693 | " \n",
694 | " \n",
695 | " | \n",
696 | " A | \n",
697 | " B | \n",
698 | " C | \n",
699 | "
\n",
700 | " \n",
701 | " \n",
702 | " \n",
703 | " 0 | \n",
704 | " 1.0 | \n",
705 | " 4 | \n",
706 | " 8 | \n",
707 | "
\n",
708 | " \n",
709 | " 1 | \n",
710 | " 3.0 | \n",
711 | " 4 | \n",
712 | " 8 | \n",
713 | "
\n",
714 | " \n",
715 | " 2 | \n",
716 | " NaN | \n",
717 | " 1 | \n",
718 | " 3 | \n",
719 | "
\n",
720 | " \n",
721 | "
\n",
722 | "
"
723 | ],
724 | "text/plain": [
725 | " A B C\n",
726 | "0 1.0 4 8\n",
727 | "1 3.0 4 8\n",
728 | "2 NaN 1 3"
729 | ]
730 | },
731 | "execution_count": 13,
732 | "metadata": {},
733 | "output_type": "execute_result"
734 | }
735 | ],
736 | "source": [
737 | "### 保持右侧的列都在,如果左侧对齐的列存在重复值,那么对齐上后也存在重复。\n",
738 | "pd.merge(left=df1, right=df2, how='right', on=['B'] )"
739 | ]
740 | },
741 | {
742 | "cell_type": "code",
743 | "execution_count": null,
744 | "metadata": {},
745 | "outputs": [],
746 | "source": []
747 | }
748 | ],
749 | "metadata": {
750 | "kernelspec": {
751 | "display_name": "Python 3",
752 | "language": "python",
753 | "name": "python3"
754 | },
755 | "language_info": {
756 | "codemirror_mode": {
757 | "name": "ipython",
758 | "version": 3
759 | },
760 | "file_extension": ".py",
761 | "mimetype": "text/x-python",
762 | "name": "python",
763 | "nbconvert_exporter": "python",
764 | "pygments_lexer": "ipython3",
765 | "version": "3.7.0"
766 | },
767 | "toc": {
768 | "base_numbering": 1,
769 | "nav_menu": {
770 | "height": "84px",
771 | "width": "252px"
772 | },
773 | "number_sections": false,
774 | "sideBar": true,
775 | "skip_h1_title": false,
776 | "title_cell": "Table of Contents",
777 | "title_sidebar": "Contents",
778 | "toc_cell": false,
779 | "toc_position": {
780 | "height": "485px",
781 | "left": "0px",
782 | "right": "1146px",
783 | "top": "66px",
784 | "width": "134px"
785 | },
786 | "toc_section_display": "block",
787 | "toc_window_display": true
788 | }
789 | },
790 | "nbformat": 4,
791 | "nbformat_minor": 2
792 | }
793 |
--------------------------------------------------------------------------------
/6. 数据结构总览.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 数据结构总览\n",
8 | "前面几章已经将我们使用最频繁的三种数据结构做了介绍,本章进行总结一下,之后在其基础上,再介绍一下基本数据类型和很有用的中间类型。"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "__auther__ = 'zhenhang.sun@gmail.com'"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {},
24 | "outputs": [
25 | {
26 | "data": {
27 | "text/plain": [
28 | "'D:\\\\github\\\\pandas-tutorial'"
29 | ]
30 | },
31 | "execution_count": 2,
32 | "metadata": {},
33 | "output_type": "execute_result"
34 | }
35 | ],
36 | "source": [
37 | "pwd"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "import numpy as np\n",
47 | "import pandas as pd"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": []
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "# 1. 三种常用数据结构"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {},
67 | "source": [
68 | "## 1.1 Series\n",
69 | "Series是由数据类型相同的元素构成的一维数据结构,具有列表和字典的特性。"
70 | ]
71 | },
72 | {
73 | "cell_type": "markdown",
74 | "metadata": {},
75 | "source": [
76 | "**四个重要属性**\n",
77 | "- Series.index\n",
78 | "- Series.name\n",
79 | "- Series.values\n",
80 | "- Series.dtype"
81 | ]
82 | },
83 | {
84 | "cell_type": "markdown",
85 | "metadata": {},
86 | "source": [
87 | "## 1.2 DataFrame\n",
88 | "DataFrame是由索引相同的Series构成的的一二维数据结构。"
89 | ]
90 | },
91 | {
92 | "cell_type": "markdown",
93 | "metadata": {},
94 | "source": [
95 | "#### 四个重要属性\n",
96 | "- DataFrame.index\n",
97 | "- DataFrame.columns\n",
98 | "- DataFrame.values\n",
99 | "- DataFrame.dtypes"
100 | ]
101 | },
102 | {
103 | "cell_type": "markdown",
104 | "metadata": {},
105 | "source": [
106 | "## 1.3 Index\n",
107 | "Index是构成和操作Series、DataFrame的关键,其具有元组特性。"
108 | ]
109 | },
110 | {
111 | "cell_type": "markdown",
112 | "metadata": {},
113 | "source": [
114 | "**三个重要属性**\n",
115 | "- Index.name\n",
116 | "- Index.values\n",
117 | "- Index.dtype"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": []
126 | },
127 | {
128 | "cell_type": "markdown",
129 | "metadata": {},
130 | "source": [
131 | "# 2. 基本数据类型\n",
132 | "- 这些数据类型实际上都是numpy带来的;\n",
133 | "- 基本数据类型中不包括字符串类型,字符串都是存储为object_型;\n",
134 | "- 所以使用这些类型时,要加上前缀 `np.` 。"
135 | ]
136 | },
137 | {
138 | "cell_type": "markdown",
139 | "metadata": {},
140 | "source": [
141 | "## 2.1 布尔型"
142 | ]
143 | },
144 | {
145 | "cell_type": "code",
146 | "execution_count": 4,
147 | "metadata": {},
148 | "outputs": [],
149 | "source": [
150 | "columns = [u'类别',u'说明 ',u'简称']"
151 | ]
152 | },
153 | {
154 | "cell_type": "code",
155 | "execution_count": 5,
156 | "metadata": {},
157 | "outputs": [
158 | {
159 | "data": {
160 | "text/html": [
161 | "\n",
162 | "\n",
175 | "
\n",
176 | " \n",
177 | " \n",
178 | " | \n",
179 | " 类别 | \n",
180 | " 说明 | \n",
181 | " 简称 | \n",
182 | "
\n",
183 | " \n",
184 | " \n",
185 | " \n",
186 | " 0 | \n",
187 | " bool_ | \n",
188 | " compatible: Python bool | \n",
189 | " ? | \n",
190 | "
\n",
191 | " \n",
192 | " 1 | \n",
193 | " bool8 | \n",
194 | " 8 bits | \n",
195 | " | \n",
196 | "
\n",
197 | " \n",
198 | "
\n",
199 | "
"
200 | ],
201 | "text/plain": [
202 | " 类别 说明 简称\n",
203 | "0 bool_ compatible: Python bool ?\n",
204 | "1 bool8 8 bits "
205 | ]
206 | },
207 | "execution_count": 5,
208 | "metadata": {},
209 | "output_type": "execute_result"
210 | }
211 | ],
212 | "source": [
213 | "data = [ ['bool_','compatible: Python bool','?'],\n",
214 | " ['bool8','8 bits',''] ] \n",
215 | "pd.DataFrame(data, columns = columns)"
216 | ]
217 | },
218 | {
219 | "cell_type": "markdown",
220 | "metadata": {},
221 | "source": [
222 | "## 2.2 整型"
223 | ]
224 | },
225 | {
226 | "cell_type": "markdown",
227 | "metadata": {},
228 | "source": [
229 | "### 2.2.1 有符号整型"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 6,
235 | "metadata": {},
236 | "outputs": [
237 | {
238 | "data": {
239 | "text/html": [
240 | "\n",
241 | "\n",
254 | "
\n",
255 | " \n",
256 | " \n",
257 | " | \n",
258 | " 类别 | \n",
259 | " 说明 | \n",
260 | " 简称 | \n",
261 | "
\n",
262 | " \n",
263 | " \n",
264 | " \n",
265 | " 0 | \n",
266 | " byte | \n",
267 | " compatible: C char | \n",
268 | " b | \n",
269 | "
\n",
270 | " \n",
271 | " 1 | \n",
272 | " short | \n",
273 | " compatible: C short | \n",
274 | " h | \n",
275 | "
\n",
276 | " \n",
277 | " 2 | \n",
278 | " intc | \n",
279 | " compatible: C int | \n",
280 | " i | \n",
281 | "
\n",
282 | " \n",
283 | " 3 | \n",
284 | " int_ | \n",
285 | " compatible: Python int | \n",
286 | " l | \n",
287 | "
\n",
288 | " \n",
289 | " 4 | \n",
290 | " longlong | \n",
291 | " compatible: C long long | \n",
292 | " q | \n",
293 | "
\n",
294 | " \n",
295 | " 5 | \n",
296 | " intp | \n",
297 | " large enough to fit a pointer | \n",
298 | " p | \n",
299 | "
\n",
300 | " \n",
301 | " 6 | \n",
302 | " int8 | \n",
303 | " 8 bits | \n",
304 | " | \n",
305 | "
\n",
306 | " \n",
307 | " 7 | \n",
308 | " int16 | \n",
309 | " 16 bits | \n",
310 | " | \n",
311 | "
\n",
312 | " \n",
313 | " 8 | \n",
314 | " int32 | \n",
315 | " 32 bits | \n",
316 | " | \n",
317 | "
\n",
318 | " \n",
319 | " 9 | \n",
320 | " int64 | \n",
321 | " 64 bits | \n",
322 | " | \n",
323 | "
\n",
324 | " \n",
325 | "
\n",
326 | "
"
327 | ],
328 | "text/plain": [
329 | " 类别 说明 简称\n",
330 | "0 byte compatible: C char b\n",
331 | "1 short compatible: C short h\n",
332 | "2 intc compatible: C int i\n",
333 | "3 int_ compatible: Python int l\n",
334 | "4 longlong compatible: C long long q\n",
335 | "5 intp large enough to fit a pointer p\n",
336 | "6 int8 8 bits \n",
337 | "7 int16 16 bits \n",
338 | "8 int32 32 bits \n",
339 | "9 int64 64 bits "
340 | ]
341 | },
342 | "execution_count": 6,
343 | "metadata": {},
344 | "output_type": "execute_result"
345 | }
346 | ],
347 | "source": [
348 | "data = [['byte','compatible: C char','b'],\n",
349 | "['short','compatible: C short','h'],\n",
350 | "['intc','compatible: C int','i'],\n",
351 | "['int_','compatible: Python int','l'],\n",
352 | "['longlong','compatible: C long long','q'],\n",
353 | "['intp','large enough to fit a pointer','p'],\n",
354 | "['int8','8 bits','' ],\n",
355 | "['int16','16 bits','' ],\n",
356 | "['int32','32 bits',''],\n",
357 | "['int64','64 bits','']]\n",
358 | "pd.DataFrame(data = data, columns = columns)"
359 | ]
360 | },
361 | {
362 | "cell_type": "markdown",
363 | "metadata": {},
364 | "source": [
365 | "### 2.2.2 无符号整型"
366 | ]
367 | },
368 | {
369 | "cell_type": "code",
370 | "execution_count": 7,
371 | "metadata": {},
372 | "outputs": [
373 | {
374 | "data": {
375 | "text/html": [
376 | "\n",
377 | "\n",
390 | "
\n",
391 | " \n",
392 | " \n",
393 | " | \n",
394 | " 类别 | \n",
395 | " 说明 | \n",
396 | " 简称 | \n",
397 | "
\n",
398 | " \n",
399 | " \n",
400 | " \n",
401 | " 0 | \n",
402 | " ubyte | \n",
403 | " compatible: C unsigned char | \n",
404 | " B | \n",
405 | "
\n",
406 | " \n",
407 | " 1 | \n",
408 | " ushort | \n",
409 | " compatible: C unsigned short | \n",
410 | " H | \n",
411 | "
\n",
412 | " \n",
413 | " 2 | \n",
414 | " uintc | \n",
415 | " compatible: C unsigned int | \n",
416 | " I | \n",
417 | "
\n",
418 | " \n",
419 | " 3 | \n",
420 | " uint | \n",
421 | " compatible: Python int | \n",
422 | " L | \n",
423 | "
\n",
424 | " \n",
425 | " 4 | \n",
426 | " ulonglong | \n",
427 | " compatible: C long long | \n",
428 | " Q | \n",
429 | "
\n",
430 | " \n",
431 | " 5 | \n",
432 | " uintp | \n",
433 | " large enough to fit a pointer | \n",
434 | " P | \n",
435 | "
\n",
436 | " \n",
437 | " 6 | \n",
438 | " uint8 | \n",
439 | " 8 bits | \n",
440 | " | \n",
441 | "
\n",
442 | " \n",
443 | " 7 | \n",
444 | " uint16 | \n",
445 | " 16 bits | \n",
446 | " | \n",
447 | "
\n",
448 | " \n",
449 | " 8 | \n",
450 | " uint32 | \n",
451 | " 32 bits | \n",
452 | " | \n",
453 | "
\n",
454 | " \n",
455 | " 9 | \n",
456 | " uint64 | \n",
457 | " 64 bits | \n",
458 | " | \n",
459 | "
\n",
460 | " \n",
461 | "
\n",
462 | "
"
463 | ],
464 | "text/plain": [
465 | " 类别 说明 简称\n",
466 | "0 ubyte compatible: C unsigned char B\n",
467 | "1 ushort compatible: C unsigned short H\n",
468 | "2 uintc compatible: C unsigned int I\n",
469 | "3 uint compatible: Python int L\n",
470 | "4 ulonglong compatible: C long long Q\n",
471 | "5 uintp large enough to fit a pointer P\n",
472 | "6 uint8 8 bits \n",
473 | "7 uint16 16 bits \n",
474 | "8 uint32 32 bits \n",
475 | "9 uint64 64 bits "
476 | ]
477 | },
478 | "execution_count": 7,
479 | "metadata": {},
480 | "output_type": "execute_result"
481 | }
482 | ],
483 | "source": [
484 | "data = [['ubyte','compatible: C unsigned char','B'],\n",
485 | "['ushort','compatible: C unsigned short','H'],\n",
486 | "['uintc','compatible: C unsigned int','I'],\n",
487 | "['uint','compatible: Python int','L'],\n",
488 | "['ulonglong','compatible: C long long','Q'],\n",
489 | "['uintp','large enough to fit a pointer','P'],\n",
490 | "['uint8','8 bits',''], \n",
491 | "['uint16','16 bits',''],\n",
492 | "['uint32','32 bits',''],\n",
493 | "['uint64','64 bits','']]\n",
494 | "pd.DataFrame(data = data, columns = columns)"
495 | ]
496 | },
497 | {
498 | "cell_type": "markdown",
499 | "metadata": {},
500 | "source": [
501 | "## 2.3 浮点型"
502 | ]
503 | },
504 | {
505 | "cell_type": "code",
506 | "execution_count": 8,
507 | "metadata": {},
508 | "outputs": [
509 | {
510 | "data": {
511 | "text/html": [
512 | "\n",
513 | "\n",
526 | "
\n",
527 | " \n",
528 | " \n",
529 | " | \n",
530 | " 类别 | \n",
531 | " 说明 | \n",
532 | " 简称 | \n",
533 | "
\n",
534 | " \n",
535 | " \n",
536 | " \n",
537 | " 0 | \n",
538 | " half | \n",
539 | " | \n",
540 | " e | \n",
541 | "
\n",
542 | " \n",
543 | " 1 | \n",
544 | " single | \n",
545 | " compatible: C float | \n",
546 | " f | \n",
547 | "
\n",
548 | " \n",
549 | " 2 | \n",
550 | " double | \n",
551 | " compatible: C double | \n",
552 | " | \n",
553 | "
\n",
554 | " \n",
555 | " 3 | \n",
556 | " float_ | \n",
557 | " compatible: Python float | \n",
558 | " d | \n",
559 | "
\n",
560 | " \n",
561 | " 4 | \n",
562 | " longfloat | \n",
563 | " compatible: C long float | \n",
564 | " g | \n",
565 | "
\n",
566 | " \n",
567 | " 5 | \n",
568 | " float16 | \n",
569 | " 16 bits | \n",
570 | " | \n",
571 | "
\n",
572 | " \n",
573 | " 6 | \n",
574 | " float32 | \n",
575 | " 32 bits | \n",
576 | " | \n",
577 | "
\n",
578 | " \n",
579 | " 7 | \n",
580 | " float64 | \n",
581 | " 64 bits | \n",
582 | " | \n",
583 | "
\n",
584 | " \n",
585 | " 8 | \n",
586 | " float96 | \n",
587 | " 96 bits, platform? | \n",
588 | " | \n",
589 | "
\n",
590 | " \n",
591 | " 9 | \n",
592 | " float128 | \n",
593 | " 128 bits, platform? | \n",
594 | " | \n",
595 | "
\n",
596 | " \n",
597 | "
\n",
598 | "
"
599 | ],
600 | "text/plain": [
601 | " 类别 说明 简称\n",
602 | "0 half e\n",
603 | "1 single compatible: C float f\n",
604 | "2 double compatible: C double \n",
605 | "3 float_ compatible: Python float d\n",
606 | "4 longfloat compatible: C long float g\n",
607 | "5 float16 16 bits \n",
608 | "6 float32 32 bits \n",
609 | "7 float64 64 bits \n",
610 | "8 float96 96 bits, platform? \n",
611 | "9 float128 128 bits, platform? "
612 | ]
613 | },
614 | "execution_count": 8,
615 | "metadata": {},
616 | "output_type": "execute_result"
617 | }
618 | ],
619 | "source": [
620 | "data = [['half',' ','e'],\n",
621 | "['single','compatible: C float','f'],\n",
622 | "['double','compatible: C double',''],\n",
623 | "['float_','compatible: Python float','d'],\n",
624 | "['longfloat','compatible: C long float','g'],\n",
625 | "['float16','16 bits',''],\n",
626 | "['float32','32 bits',''],\n",
627 | "['float64','64 bits',''], \n",
628 | "['float96','96 bits, platform?',''], \n",
629 | "['float128','128 bits, platform?','']]\n",
630 | "pd.DataFrame(data = data, columns = columns)"
631 | ]
632 | },
633 | {
634 | "cell_type": "markdown",
635 | "metadata": {},
636 | "source": [
637 | "## 2.4 复数型"
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": 9,
643 | "metadata": {},
644 | "outputs": [
645 | {
646 | "data": {
647 | "text/html": [
648 | "\n",
649 | "\n",
662 | "
\n",
663 | " \n",
664 | " \n",
665 | " | \n",
666 | " 类别 | \n",
667 | " 说明 | \n",
668 | " 简称 | \n",
669 | "
\n",
670 | " \n",
671 | " \n",
672 | " \n",
673 | " 0 | \n",
674 | " csingle | \n",
675 | " | \n",
676 | " F | \n",
677 | "
\n",
678 | " \n",
679 | " 1 | \n",
680 | " complex_ | \n",
681 | " compatible: Python complex | \n",
682 | " D | \n",
683 | "
\n",
684 | " \n",
685 | " 2 | \n",
686 | " clongfloat | \n",
687 | " | \n",
688 | " G | \n",
689 | "
\n",
690 | " \n",
691 | " 3 | \n",
692 | " complex64 | \n",
693 | " two 32-bit floats | \n",
694 | " | \n",
695 | "
\n",
696 | " \n",
697 | " 4 | \n",
698 | " complex128 | \n",
699 | " two 64-bit floats | \n",
700 | " | \n",
701 | "
\n",
702 | " \n",
703 | " 5 | \n",
704 | " complex192 | \n",
705 | " two 96-bit floats, platform? | \n",
706 | " | \n",
707 | "
\n",
708 | " \n",
709 | " 6 | \n",
710 | " complex256 | \n",
711 | " two 128-bit floats, platform? | \n",
712 | " | \n",
713 | "
\n",
714 | " \n",
715 | "
\n",
716 | "
"
717 | ],
718 | "text/plain": [
719 | " 类别 说明 简称\n",
720 | "0 csingle F\n",
721 | "1 complex_ compatible: Python complex D\n",
722 | "2 clongfloat G\n",
723 | "3 complex64 two 32-bit floats \n",
724 | "4 complex128 two 64-bit floats \n",
725 | "5 complex192 two 96-bit floats, platform? \n",
726 | "6 complex256 two 128-bit floats, platform? "
727 | ]
728 | },
729 | "execution_count": 9,
730 | "metadata": {},
731 | "output_type": "execute_result"
732 | }
733 | ],
734 | "source": [
735 | "data = [['csingle',' ','F'],\n",
736 | "['complex_','compatible: Python complex','D'],\n",
737 | "['clongfloat',' ','G'],\n",
738 | "['complex64','two 32-bit floats',''], \n",
739 | "['complex128','two 64-bit floats',''], \n",
740 | "['complex192','two 96-bit floats, platform?',''], \n",
741 | "['complex256','two 128-bit floats, platform?','']]\n",
742 | "pd.DataFrame(data = data, columns = columns)"
743 | ]
744 | },
745 | {
746 | "cell_type": "markdown",
747 | "metadata": {},
748 | "source": [
749 | "## 2.5 任意类型\n",
750 | "Object其实就是就是指向python的类类型object的一个引用。"
751 | ]
752 | },
753 | {
754 | "cell_type": "code",
755 | "execution_count": 10,
756 | "metadata": {},
757 | "outputs": [
758 | {
759 | "data": {
760 | "text/html": [
761 | "\n",
762 | "\n",
775 | "
\n",
776 | " \n",
777 | " \n",
778 | " | \n",
779 | " 类别 | \n",
780 | " 说明 | \n",
781 | " 简称 | \n",
782 | "
\n",
783 | " \n",
784 | " \n",
785 | " \n",
786 | " 0 | \n",
787 | " object_ | \n",
788 | " any Python object | \n",
789 | " O | \n",
790 | "
\n",
791 | " \n",
792 | "
\n",
793 | "
"
794 | ],
795 | "text/plain": [
796 | " 类别 说明 简称\n",
797 | "0 object_ any Python object O"
798 | ]
799 | },
800 | "execution_count": 10,
801 | "metadata": {},
802 | "output_type": "execute_result"
803 | }
804 | ],
805 | "source": [
806 | "data = [['object_','any Python object','O']]\n",
807 | "pd.DataFrame(data = data, columns = columns)"
808 | ]
809 | },
810 | {
811 | "cell_type": "code",
812 | "execution_count": null,
813 | "metadata": {},
814 | "outputs": [],
815 | "source": []
816 | },
817 | {
818 | "cell_type": "markdown",
819 | "metadata": {},
820 | "source": [
821 | "# 3. 有用的中间类型"
822 | ]
823 | },
824 | {
825 | "cell_type": "markdown",
826 | "metadata": {},
827 | "source": [
828 | "## 3.1 .str\n",
829 | "这个中间类型可将object_类型的Series当做字符串来处理,有很多可用的字符串处理函数。在后面的章节会专门讲这个应用。"
830 | ]
831 | },
832 | {
833 | "cell_type": "code",
834 | "execution_count": 11,
835 | "metadata": {},
836 | "outputs": [
837 | {
838 | "data": {
839 | "text/plain": [
840 | "0 a_b\n",
841 | "1 b_c\n",
842 | "2 c_d\n",
843 | "dtype: object"
844 | ]
845 | },
846 | "execution_count": 11,
847 | "metadata": {},
848 | "output_type": "execute_result"
849 | }
850 | ],
851 | "source": [
852 | "s = pd.Series(['a_b','b_c','c_d'],dtype = 'object')\n",
853 | "s"
854 | ]
855 | },
856 | {
857 | "cell_type": "code",
858 | "execution_count": 12,
859 | "metadata": {},
860 | "outputs": [
861 | {
862 | "data": {
863 | "text/html": [
864 | "\n",
865 | "\n",
878 | "
\n",
879 | " \n",
880 | " \n",
881 | " | \n",
882 | " 0 | \n",
883 | " 1 | \n",
884 | "
\n",
885 | " \n",
886 | " \n",
887 | " \n",
888 | " 0 | \n",
889 | " a | \n",
890 | " b | \n",
891 | "
\n",
892 | " \n",
893 | " 1 | \n",
894 | " b | \n",
895 | " c | \n",
896 | "
\n",
897 | " \n",
898 | " 2 | \n",
899 | " c | \n",
900 | " d | \n",
901 | "
\n",
902 | " \n",
903 | "
\n",
904 | "
"
905 | ],
906 | "text/plain": [
907 | " 0 1\n",
908 | "0 a b\n",
909 | "1 b c\n",
910 | "2 c d"
911 | ]
912 | },
913 | "execution_count": 12,
914 | "metadata": {},
915 | "output_type": "execute_result"
916 | }
917 | ],
918 | "source": [
919 | "s.str.split('_',expand = True)"
920 | ]
921 | },
922 | {
923 | "cell_type": "markdown",
924 | "metadata": {},
925 | "source": [
926 | "## 3.2 .cat\n",
927 | "这个中间类型专门处理类别类型,类别类型是机器学习中经常面对的一种特征属性,后面章节会讲到。"
928 | ]
929 | },
930 | {
931 | "cell_type": "code",
932 | "execution_count": 13,
933 | "metadata": {},
934 | "outputs": [
935 | {
936 | "data": {
937 | "text/plain": [
938 | "0 1\n",
939 | "1 2\n",
940 | "2 3\n",
941 | "dtype: category\n",
942 | "Categories (3, int64): [1, 2, 3]"
943 | ]
944 | },
945 | "execution_count": 13,
946 | "metadata": {},
947 | "output_type": "execute_result"
948 | }
949 | ],
950 | "source": [
951 | "s = pd.Series( [1,2,3], dtype='category')\n",
952 | "s"
953 | ]
954 | },
955 | {
956 | "cell_type": "code",
957 | "execution_count": 14,
958 | "metadata": {},
959 | "outputs": [
960 | {
961 | "data": {
962 | "text/plain": [
963 | "Int64Index([1, 2, 3], dtype='int64')"
964 | ]
965 | },
966 | "execution_count": 14,
967 | "metadata": {},
968 | "output_type": "execute_result"
969 | }
970 | ],
971 | "source": [
972 | "s.cat.categories"
973 | ]
974 | },
975 | {
976 | "cell_type": "markdown",
977 | "metadata": {},
978 | "source": [
979 | "## 3.3 .dt\n",
980 | "这个中间类型专门处理时间格式的Series,在时间序列分析中会用到。"
981 | ]
982 | },
983 | {
984 | "cell_type": "code",
985 | "execution_count": 15,
986 | "metadata": {},
987 | "outputs": [
988 | {
989 | "data": {
990 | "text/plain": [
991 | "0 2017-08-01\n",
992 | "1 2017-08-03\n",
993 | "2 2017-08-03\n",
994 | "dtype: datetime64[ns]"
995 | ]
996 | },
997 | "execution_count": 15,
998 | "metadata": {},
999 | "output_type": "execute_result"
1000 | }
1001 | ],
1002 | "source": [
1003 | "s = pd.Series(['2017-08-01','2017-08-03','2017-08-03'], dtype = 'datetime64[ns]')\n",
1004 | "s"
1005 | ]
1006 | },
1007 | {
1008 | "cell_type": "code",
1009 | "execution_count": 16,
1010 | "metadata": {
1011 | "scrolled": false
1012 | },
1013 | "outputs": [
1014 | {
1015 | "data": {
1016 | "text/plain": [
1017 | "0 2017\n",
1018 | "1 2017\n",
1019 | "2 2017\n",
1020 | "dtype: int64"
1021 | ]
1022 | },
1023 | "execution_count": 16,
1024 | "metadata": {},
1025 | "output_type": "execute_result"
1026 | }
1027 | ],
1028 | "source": [
1029 | "s.dt.year"
1030 | ]
1031 | },
1032 | {
1033 | "cell_type": "code",
1034 | "execution_count": null,
1035 | "metadata": {},
1036 | "outputs": [],
1037 | "source": []
1038 | }
1039 | ],
1040 | "metadata": {
1041 | "kernelspec": {
1042 | "display_name": "Python 3",
1043 | "language": "python",
1044 | "name": "python3"
1045 | },
1046 | "language_info": {
1047 | "codemirror_mode": {
1048 | "name": "ipython",
1049 | "version": 3
1050 | },
1051 | "file_extension": ".py",
1052 | "mimetype": "text/x-python",
1053 | "name": "python",
1054 | "nbconvert_exporter": "python",
1055 | "pygments_lexer": "ipython3",
1056 | "version": "3.7.0"
1057 | },
1058 | "toc": {
1059 | "base_numbering": 1,
1060 | "nav_menu": {
1061 | "height": "318px",
1062 | "width": "252px"
1063 | },
1064 | "number_sections": false,
1065 | "sideBar": true,
1066 | "skip_h1_title": false,
1067 | "title_cell": "Table of Contents",
1068 | "title_sidebar": "Contents",
1069 | "toc_cell": false,
1070 | "toc_position": {
1071 | "height": "calc(100% - 180px)",
1072 | "left": "10px",
1073 | "top": "150px",
1074 | "width": "256px"
1075 | },
1076 | "toc_section_display": "block",
1077 | "toc_window_display": true
1078 | }
1079 | },
1080 | "nbformat": 4,
1081 | "nbformat_minor": 2
1082 | }
1083 |
--------------------------------------------------------------------------------
/7. 显示控制.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 显示控制\n",
8 | "讲完怎么构造基本数据结构,不忙讲复杂函数和操作,先讲讲怎么让数据呈现的更符合心意更重要。"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "__auther__ = 'zhenhang.sun@gmail.com'"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {},
24 | "outputs": [
25 | {
26 | "data": {
27 | "text/plain": [
28 | "'D:\\\\github\\\\pandas-tutorial'"
29 | ]
30 | },
31 | "execution_count": 2,
32 | "metadata": {},
33 | "output_type": "execute_result"
34 | }
35 | ],
36 | "source": [
37 | "pwd"
38 | ]
39 | },
40 | {
41 | "cell_type": "code",
42 | "execution_count": 3,
43 | "metadata": {},
44 | "outputs": [],
45 | "source": [
46 | "import numpy as np\n",
47 | "import pandas as pd"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": []
56 | },
57 | {
58 | "cell_type": "markdown",
59 | "metadata": {},
60 | "source": [
61 | "# 1. 函数说明\n",
62 | "- get/set/reset_opthon其实是从本地配置文件去查询和设置这个关键字;\n",
63 | "- 这些关键字都是以字符串给定的,可以使用任何正则表达式,但如果匹配到多个则报错,所以最好是精确表示。"
64 | ]
65 | },
66 | {
67 | "cell_type": "markdown",
68 | "metadata": {},
69 | "source": [
70 | "## 1.1 查询默认参数\n",
71 | "##### `pd.get_option(key)`\n",
72 | "- key:也即上面列出的关键字"
73 | ]
74 | },
75 | {
76 | "cell_type": "markdown",
77 | "metadata": {},
78 | "source": [
79 | "## 1.2 自定义参数\n",
80 | "##### `pd.set_option(key, value)`\n",
81 | "- key:要设置的关键字\n",
82 | "- value:int,要设置的值"
83 | ]
84 | },
85 | {
86 | "cell_type": "markdown",
87 | "metadata": {},
88 | "source": [
89 | "## 1.3 恢复自定义参数为默认参数。\n",
90 | "##### `pandas.reset_option(key)`\n",
91 | "- key:上面提到的关键字"
92 | ]
93 | },
94 | {
95 | "cell_type": "code",
96 | "execution_count": null,
97 | "metadata": {},
98 | "outputs": [],
99 | "source": []
100 | },
101 | {
102 | "cell_type": "markdown",
103 | "metadata": {},
104 | "source": [
105 | "# 2. 重要参数\n",
106 | "一般来说,对我们有用的get_option和set_option操作不多,主要有以下三种关键字:\n",
107 | "- 'display.max_rows'\n",
108 | "- 'display.max_columns'\n",
109 | "- 'display.max_colwidth'"
110 | ]
111 | },
112 | {
113 | "cell_type": "markdown",
114 | "metadata": {},
115 | "source": [
116 | "## 1.1 max_rows\n",
117 | "控制可以显示的最大行数。"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": 4,
123 | "metadata": {},
124 | "outputs": [
125 | {
126 | "data": {
127 | "text/plain": [
128 | "60"
129 | ]
130 | },
131 | "execution_count": 4,
132 | "metadata": {},
133 | "output_type": "execute_result"
134 | }
135 | ],
136 | "source": [
137 | "pd.get_option('display.max_rows')# 60的意思就是最多显示60行,如果行数超过60,那么将省略显示一部分。"
138 | ]
139 | },
140 | {
141 | "cell_type": "code",
142 | "execution_count": 5,
143 | "metadata": {
144 | "scrolled": true
145 | },
146 | "outputs": [
147 | {
148 | "data": {
149 | "text/plain": [
150 | "0 NaN\n",
151 | "1 NaN\n",
152 | "2 NaN\n",
153 | "3 NaN\n",
154 | "4 NaN\n",
155 | "5 NaN\n",
156 | "6 NaN\n",
157 | "7 NaN\n",
158 | "8 NaN\n",
159 | "9 NaN\n",
160 | "10 NaN\n",
161 | "11 NaN\n",
162 | "12 NaN\n",
163 | "13 NaN\n",
164 | "14 NaN\n",
165 | "15 NaN\n",
166 | "16 NaN\n",
167 | "17 NaN\n",
168 | "18 NaN\n",
169 | "19 NaN\n",
170 | "20 NaN\n",
171 | "21 NaN\n",
172 | "22 NaN\n",
173 | "23 NaN\n",
174 | "24 NaN\n",
175 | "25 NaN\n",
176 | "26 NaN\n",
177 | "27 NaN\n",
178 | "28 NaN\n",
179 | "29 NaN\n",
180 | "30 NaN\n",
181 | "31 NaN\n",
182 | "32 NaN\n",
183 | "33 NaN\n",
184 | "34 NaN\n",
185 | "35 NaN\n",
186 | "36 NaN\n",
187 | "37 NaN\n",
188 | "38 NaN\n",
189 | "39 NaN\n",
190 | "40 NaN\n",
191 | "41 NaN\n",
192 | "42 NaN\n",
193 | "43 NaN\n",
194 | "44 NaN\n",
195 | "45 NaN\n",
196 | "46 NaN\n",
197 | "47 NaN\n",
198 | "48 NaN\n",
199 | "49 NaN\n",
200 | "50 NaN\n",
201 | "51 NaN\n",
202 | "52 NaN\n",
203 | "53 NaN\n",
204 | "54 NaN\n",
205 | "55 NaN\n",
206 | "56 NaN\n",
207 | "57 NaN\n",
208 | "58 NaN\n",
209 | "59 NaN\n",
210 | "dtype: float64"
211 | ]
212 | },
213 | "execution_count": 5,
214 | "metadata": {},
215 | "output_type": "execute_result"
216 | }
217 | ],
218 | "source": [
219 | "# pd.set_option('display.max_rows', 61)\n",
220 | "pd.Series( index = range(0, 60) ) # 把60改成61试试,然后将上面注释取消,再试试。"
221 | ]
222 | },
223 | {
224 | "cell_type": "markdown",
225 | "metadata": {},
226 | "source": [
227 | "## 1.2 max_columns\n",
228 | "控制可以显示的最大列数,这个参数对DataFrame更有价值。"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": 6,
234 | "metadata": {},
235 | "outputs": [
236 | {
237 | "data": {
238 | "text/plain": [
239 | "20"
240 | ]
241 | },
242 | "execution_count": 6,
243 | "metadata": {},
244 | "output_type": "execute_result"
245 | }
246 | ],
247 | "source": [
248 | "pd.get_option('display.max_columns')"
249 | ]
250 | },
251 | {
252 | "cell_type": "code",
253 | "execution_count": 7,
254 | "metadata": {},
255 | "outputs": [
256 | {
257 | "data": {
258 | "text/html": [
259 | "\n",
260 | "\n",
273 | "
\n",
274 | " \n",
275 | " \n",
276 | " | \n",
277 | " 0 | \n",
278 | " 1 | \n",
279 | " 2 | \n",
280 | " 3 | \n",
281 | " 4 | \n",
282 | " 5 | \n",
283 | " 6 | \n",
284 | " 7 | \n",
285 | " 8 | \n",
286 | " 9 | \n",
287 | " 10 | \n",
288 | " 11 | \n",
289 | " 12 | \n",
290 | " 13 | \n",
291 | " 14 | \n",
292 | " 15 | \n",
293 | " 16 | \n",
294 | " 17 | \n",
295 | " 18 | \n",
296 | " 19 | \n",
297 | "
\n",
298 | " \n",
299 | " \n",
300 | " \n",
301 | "
\n",
302 | "
"
303 | ],
304 | "text/plain": [
305 | "Empty DataFrame\n",
306 | "Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]\n",
307 | "Index: []"
308 | ]
309 | },
310 | "execution_count": 7,
311 | "metadata": {},
312 | "output_type": "execute_result"
313 | }
314 | ],
315 | "source": [
316 | "# pd.set_option('display.max_columns', 21)\n",
317 | "pd.DataFrame( columns = range(0,20)) # 分别设置20和21试试,然后将上面注释取消,再试试。"
318 | ]
319 | },
320 | {
321 | "cell_type": "markdown",
322 | "metadata": {},
323 | "source": [
324 | "## 1.3 max_colwidth\n",
325 | "控制每个网格点能够显示的最大字符数"
326 | ]
327 | },
328 | {
329 | "cell_type": "code",
330 | "execution_count": 8,
331 | "metadata": {},
332 | "outputs": [
333 | {
334 | "data": {
335 | "text/plain": [
336 | "50"
337 | ]
338 | },
339 | "execution_count": 8,
340 | "metadata": {},
341 | "output_type": "execute_result"
342 | }
343 | ],
344 | "source": [
345 | "pd.get_option('display.max_colwidth')"
346 | ]
347 | },
348 | {
349 | "cell_type": "code",
350 | "execution_count": 9,
351 | "metadata": {},
352 | "outputs": [
353 | {
354 | "data": {
355 | "text/plain": [
356 | "1 1111111111111111111111111111111111111111111111...\n",
357 | "dtype: object"
358 | ]
359 | },
360 | "execution_count": 9,
361 | "metadata": {},
362 | "output_type": "execute_result"
363 | }
364 | ],
365 | "source": [
366 | "# pd.set_option('display.max_colwidth', 51)\n",
367 | "s = '1'*50 # 把50改成49试试,把上面的注释取消,再试试\n",
368 | "pd.Series( data=[s], index = ['1']) #实际应用中把max_colwidth大体设得大一点就可以"
369 | ]
370 | },
371 | {
372 | "cell_type": "markdown",
373 | "metadata": {},
374 | "source": [
375 | "## 1.4 precision\n",
376 | "控制浮点类型显示的小数位数,不影响实际精度。"
377 | ]
378 | },
379 | {
380 | "cell_type": "code",
381 | "execution_count": 10,
382 | "metadata": {},
383 | "outputs": [
384 | {
385 | "data": {
386 | "text/plain": [
387 | "6"
388 | ]
389 | },
390 | "execution_count": 10,
391 | "metadata": {},
392 | "output_type": "execute_result"
393 | }
394 | ],
395 | "source": [
396 | "pd.get_option('display.precision')"
397 | ]
398 | },
399 | {
400 | "cell_type": "code",
401 | "execution_count": 11,
402 | "metadata": {},
403 | "outputs": [
404 | {
405 | "data": {
406 | "text/plain": [
407 | "0 1.010101\n",
408 | "dtype: float64"
409 | ]
410 | },
411 | "execution_count": 11,
412 | "metadata": {},
413 | "output_type": "execute_result"
414 | }
415 | ],
416 | "source": [
417 | "# pd.set_option('display.precision', 10)\n",
418 | "a = 1.01010101\n",
419 | "s = pd.Series( data= [a]) # \n",
420 | "s"
421 | ]
422 | },
423 | {
424 | "cell_type": "code",
425 | "execution_count": 12,
426 | "metadata": {},
427 | "outputs": [
428 | {
429 | "data": {
430 | "text/plain": [
431 | "1.01010101"
432 | ]
433 | },
434 | "execution_count": 12,
435 | "metadata": {},
436 | "output_type": "execute_result"
437 | }
438 | ],
439 | "source": [
440 | "s[0] # 在Series数据被截断显示,但实际上精度并没变, 把上面的注释取消,再试试"
441 | ]
442 | },
443 | {
444 | "cell_type": "markdown",
445 | "metadata": {},
446 | "source": [
447 | "## 1.5 colheader_justify\n",
448 | "控制DataFrame的列名对齐位置,靠左或者靠右"
449 | ]
450 | },
451 | {
452 | "cell_type": "code",
453 | "execution_count": 13,
454 | "metadata": {},
455 | "outputs": [
456 | {
457 | "data": {
458 | "text/plain": [
459 | "'right'"
460 | ]
461 | },
462 | "execution_count": 13,
463 | "metadata": {},
464 | "output_type": "execute_result"
465 | }
466 | ],
467 | "source": [
468 | "pd.get_option('display.colheader_justify')"
469 | ]
470 | },
471 | {
472 | "cell_type": "code",
473 | "execution_count": 14,
474 | "metadata": {},
475 | "outputs": [
476 | {
477 | "data": {
478 | "text/html": [
479 | "\n",
480 | "\n",
493 | "
\n",
494 | " \n",
495 | " \n",
496 | " | \n",
497 | " a | \n",
498 | "
\n",
499 | " \n",
500 | " \n",
501 | " \n",
502 | " 0 | \n",
503 | " 000000000000000 | \n",
504 | "
\n",
505 | " \n",
506 | "
\n",
507 | "
"
508 | ],
509 | "text/plain": [
510 | " a\n",
511 | "0 000000000000000"
512 | ]
513 | },
514 | "execution_count": 14,
515 | "metadata": {},
516 | "output_type": "execute_result"
517 | }
518 | ],
519 | "source": [
520 | "# pd.set_option('display.colheader_justify','left') # 我的这个设置好像无效,不知道咋回事\n",
521 | "pd.DataFrame( data = ['000000000000000'], columns = ['a'])"
522 | ]
523 | },
524 | {
525 | "cell_type": "markdown",
526 | "metadata": {},
527 | "source": [
528 | "## 1.6. 更多设置参数查看官方说明\n",
529 | "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html"
530 | ]
531 | },
532 | {
533 | "cell_type": "code",
534 | "execution_count": null,
535 | "metadata": {},
536 | "outputs": [],
537 | "source": []
538 | }
539 | ],
540 | "metadata": {
541 | "kernelspec": {
542 | "display_name": "Python 3",
543 | "language": "python",
544 | "name": "python3"
545 | },
546 | "language_info": {
547 | "codemirror_mode": {
548 | "name": "ipython",
549 | "version": 3
550 | },
551 | "file_extension": ".py",
552 | "mimetype": "text/x-python",
553 | "name": "python",
554 | "nbconvert_exporter": "python",
555 | "pygments_lexer": "ipython3",
556 | "version": "3.7.0"
557 | },
558 | "toc": {
559 | "base_numbering": 1,
560 | "nav_menu": {
561 | "height": "228px",
562 | "width": "252px"
563 | },
564 | "number_sections": false,
565 | "sideBar": true,
566 | "skip_h1_title": false,
567 | "title_cell": "Table of Contents",
568 | "title_sidebar": "Contents",
569 | "toc_cell": false,
570 | "toc_position": {
571 | "height": "485px",
572 | "left": "0px",
573 | "right": "1068px",
574 | "top": "66px",
575 | "width": "212px"
576 | },
577 | "toc_section_display": "block",
578 | "toc_window_display": true
579 | }
580 | },
581 | "nbformat": 4,
582 | "nbformat_minor": 2
583 | }
584 |
--------------------------------------------------------------------------------
/8. 快速查看整体信息.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "# 快速查看整体信息\n",
8 | "上一章讲到了控制DataFrame显示的一些参数,本章则具体讲解一下如何快速获得对DataFrame的整体认知。"
9 | ]
10 | },
11 | {
12 | "cell_type": "code",
13 | "execution_count": 1,
14 | "metadata": {},
15 | "outputs": [],
16 | "source": [
17 | "__auther__ = 'zhenhang.sun@gmail.com'"
18 | ]
19 | },
20 | {
21 | "cell_type": "code",
22 | "execution_count": 2,
23 | "metadata": {},
24 | "outputs": [],
25 | "source": [
26 | "pwd"
27 | ]
28 | },
29 | {
30 | "cell_type": "code",
31 | "execution_count": 3,
32 | "metadata": {},
33 | "outputs": [],
34 | "source": [
35 | "import numpy as np\n",
36 | "import pandas as pd"
37 | ]
38 | },
39 | {
40 | "cell_type": "code",
41 | "execution_count": null,
42 | "metadata": {},
43 | "outputs": [],
44 | "source": []
45 | },
46 | {
47 | "cell_type": "markdown",
48 | "metadata": {},
49 | "source": [
50 | "# 1. info()\n",
51 | "这是DataFrame才可用的API,快捷查看多种信息:总行数和列数、每列元素类型和非NaN的个数,总内存。"
52 | ]
53 | },
54 | {
55 | "cell_type": "markdown",
56 | "metadata": {},
57 | "source": [
58 | "##### `DataFrame.info(verbose=None, memory_usage=True, null_counts=True)`\n",
59 | "- verbose:True or False,字面意思是冗长的,也就说如何DataFrame有很多列,是否显示所有列的信息,如果为否,那么会省略一部分;\n",
60 | "- memory_usage:True or False,默认为True,是否查看DataFrame的内存使用情况;\n",
61 | "- null_counts:True or False,默认为True,是否统计NaN值的个数。"
62 | ]
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 4,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "data": {
71 | "text/html": [
72 | "\n",
73 | "\n",
86 | "
\n",
87 | " \n",
88 | " \n",
89 | " | \n",
90 | " 0 | \n",
91 | " 1 | \n",
92 | " 2 | \n",
93 | " 3 | \n",
94 | " 4 | \n",
95 | " 5 | \n",
96 | " 6 | \n",
97 | " 7 | \n",
98 | " 8 | \n",
99 | " 9 | \n",
100 | "
\n",
101 | " \n",
102 | " \n",
103 | " \n",
104 | "
\n",
105 | "
"
106 | ],
107 | "text/plain": [
108 | "Empty DataFrame\n",
109 | "Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
110 | "Index: []"
111 | ]
112 | },
113 | "execution_count": 4,
114 | "metadata": {},
115 | "output_type": "execute_result"
116 | }
117 | ],
118 | "source": [
119 | "df = pd.DataFrame(columns = range(0,10))\n",
120 | "df"
121 | ]
122 | },
123 | {
124 | "cell_type": "code",
125 | "execution_count": 5,
126 | "metadata": {
127 | "scrolled": false
128 | },
129 | "outputs": [
130 | {
131 | "name": "stdout",
132 | "output_type": "stream",
133 | "text": [
134 | "\n",
135 | "Index: 0 entries\n",
136 | "Data columns (total 10 columns):\n",
137 | "0 0 non-null object\n",
138 | "1 0 non-null object\n",
139 | "2 0 non-null object\n",
140 | "3 0 non-null object\n",
141 | "4 0 non-null object\n",
142 | "5 0 non-null object\n",
143 | "6 0 non-null object\n",
144 | "7 0 non-null object\n",
145 | "8 0 non-null object\n",
146 | "9 0 non-null object\n",
147 | "dtypes: object(10)\n",
148 | "memory usage: 0.0+ bytes\n"
149 | ]
150 | }
151 | ],
152 | "source": [
153 | "df.info() # 直接默认设置即可"
154 | ]
155 | },
156 | {
157 | "cell_type": "code",
158 | "execution_count": null,
159 | "metadata": {},
160 | "outputs": [],
161 | "source": []
162 | },
163 | {
164 | "cell_type": "markdown",
165 | "metadata": {},
166 | "source": [
167 | "# 2. ndim, shape, size\n",
168 | "查看维数,形状,元素个数。"
169 | ]
170 | },
171 | {
172 | "cell_type": "code",
173 | "execution_count": 6,
174 | "metadata": {},
175 | "outputs": [
176 | {
177 | "data": {
178 | "text/html": [
179 | "\n",
180 | "\n",
193 | "
\n",
194 | " \n",
195 | " \n",
196 | " | \n",
197 | " A | \n",
198 | " B | \n",
199 | "
\n",
200 | " \n",
201 | " \n",
202 | " \n",
203 | " 0 | \n",
204 | " NaN | \n",
205 | " 2.0 | \n",
206 | "
\n",
207 | " \n",
208 | " 1 | \n",
209 | " 3.0 | \n",
210 | " NaN | \n",
211 | "
\n",
212 | " \n",
213 | "
\n",
214 | "
"
215 | ],
216 | "text/plain": [
217 | " A B\n",
218 | "0 NaN 2.0\n",
219 | "1 3.0 NaN"
220 | ]
221 | },
222 | "execution_count": 6,
223 | "metadata": {},
224 | "output_type": "execute_result"
225 | }
226 | ],
227 | "source": [
228 | "df = pd.DataFrame([[np.nan, 2], [3, np.nan]], columns=['A','B'])\n",
229 | "df"
230 | ]
231 | },
232 | {
233 | "cell_type": "code",
234 | "execution_count": 7,
235 | "metadata": {},
236 | "outputs": [
237 | {
238 | "data": {
239 | "text/plain": [
240 | "2"
241 | ]
242 | },
243 | "execution_count": 7,
244 | "metadata": {},
245 | "output_type": "execute_result"
246 | }
247 | ],
248 | "source": [
249 | "df.ndim # 返回维度数,Series一维,DataFrame两维,平时很少用到,不过有时会在循环中用到"
250 | ]
251 | },
252 | {
253 | "cell_type": "code",
254 | "execution_count": 8,
255 | "metadata": {},
256 | "outputs": [
257 | {
258 | "data": {
259 | "text/plain": [
260 | "(2, 2)"
261 | ]
262 | },
263 | "execution_count": 8,
264 | "metadata": {},
265 | "output_type": "execute_result"
266 | }
267 | ],
268 | "source": [
269 | "df.shape # (行数,列数)"
270 | ]
271 | },
272 | {
273 | "cell_type": "code",
274 | "execution_count": 9,
275 | "metadata": {},
276 | "outputs": [
277 | {
278 | "data": {
279 | "text/plain": [
280 | "4"
281 | ]
282 | },
283 | "execution_count": 9,
284 | "metadata": {},
285 | "output_type": "execute_result"
286 | }
287 | ],
288 | "source": [
289 | "df.size # 元素个数,rows×cols"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {},
296 | "outputs": [],
297 | "source": []
298 | },
299 | {
300 | "cell_type": "markdown",
301 | "metadata": {},
302 | "source": [
303 | "# 3. head(), tail()\n",
304 | "默认分别查看头5行和后5行。"
305 | ]
306 | },
307 | {
308 | "cell_type": "markdown",
309 | "metadata": {},
310 | "source": [
311 | "##### `Series/DataFrame.head(n=5)`\n",
312 | "##### `Series/DataFrame.tail(n=5)`"
313 | ]
314 | },
315 | {
316 | "cell_type": "code",
317 | "execution_count": 10,
318 | "metadata": {
319 | "scrolled": true
320 | },
321 | "outputs": [
322 | {
323 | "data": {
324 | "text/plain": [
325 | "0 0\n",
326 | "1 1\n",
327 | "2 2\n",
328 | "3 3\n",
329 | "4 4\n",
330 | "dtype: int64"
331 | ]
332 | },
333 | "execution_count": 10,
334 | "metadata": {},
335 | "output_type": "execute_result"
336 | }
337 | ],
338 | "source": [
339 | "s = pd.Series(range(0,5))\n",
340 | "s"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": 11,
346 | "metadata": {},
347 | "outputs": [
348 | {
349 | "data": {
350 | "text/plain": [
351 | "0 0\n",
352 | "1 1\n",
353 | "2 2\n",
354 | "dtype: int64"
355 | ]
356 | },
357 | "execution_count": 11,
358 | "metadata": {},
359 | "output_type": "execute_result"
360 | }
361 | ],
362 | "source": [
363 | "s.head(3)"
364 | ]
365 | },
366 | {
367 | "cell_type": "code",
368 | "execution_count": 12,
369 | "metadata": {},
370 | "outputs": [
371 | {
372 | "data": {
373 | "text/plain": [
374 | "2 2\n",
375 | "3 3\n",
376 | "4 4\n",
377 | "dtype: int64"
378 | ]
379 | },
380 | "execution_count": 12,
381 | "metadata": {},
382 | "output_type": "execute_result"
383 | }
384 | ],
385 | "source": [
386 | "s.tail(3)"
387 | ]
388 | },
389 | {
390 | "cell_type": "code",
391 | "execution_count": null,
392 | "metadata": {},
393 | "outputs": [],
394 | "source": []
395 | },
396 | {
397 | "cell_type": "markdown",
398 | "metadata": {},
399 | "source": [
400 | "# 4. memory_usage()\n",
401 | "比info中内存显示更可控一些,单位是**字节**。"
402 | ]
403 | },
404 | {
405 | "cell_type": "markdown",
406 | "metadata": {},
407 | "source": [
408 | "##### `Series/DataFrame.memory_usage(index=True, deep=False)`\n",
409 | "- index:是否显示索引占用的内存,毫无疑问索引也占用内存;\n",
410 | "- deep:是否显示object类型的列消耗的系统资源,由于pandas中object元素只是一个引用,我估计这个deep是指显示真实的内存占用。"
411 | ]
412 | },
413 | {
414 | "cell_type": "code",
415 | "execution_count": 13,
416 | "metadata": {},
417 | "outputs": [
418 | {
419 | "data": {
420 | "text/plain": [
421 | "Index 80\n",
422 | "A 16\n",
423 | "B 16\n",
424 | "dtype: int64"
425 | ]
426 | },
427 | "execution_count": 13,
428 | "metadata": {},
429 | "output_type": "execute_result"
430 | }
431 | ],
432 | "source": [
433 | "df.memory_usage(deep=False) # Index即索引占用内存"
434 | ]
435 | },
436 | {
437 | "cell_type": "code",
438 | "execution_count": 14,
439 | "metadata": {},
440 | "outputs": [
441 | {
442 | "data": {
443 | "text/plain": [
444 | "Index 80\n",
445 | "A 16\n",
446 | "B 16\n",
447 | "dtype: int64"
448 | ]
449 | },
450 | "execution_count": 14,
451 | "metadata": {},
452 | "output_type": "execute_result"
453 | }
454 | ],
455 | "source": [
456 | "df.memory_usage(deep=True) # object 型占用的内存变大"
457 | ]
458 | },
459 | {
460 | "cell_type": "code",
461 | "execution_count": null,
462 | "metadata": {},
463 | "outputs": [],
464 | "source": []
465 | },
466 | {
467 | "cell_type": "markdown",
468 | "metadata": {},
469 | "source": [
470 | "# 5. describe()\n",
471 | "快速查看每一列的统计信息,默认排除所有NaN元素。"
472 | ]
473 | },
474 | {
475 | "cell_type": "markdown",
476 | "metadata": {},
477 | "source": [
478 | "##### `DataFrame.describe(include=[np.number])`\n",
479 | "- include:'all'或者[np.number 或 np.object]。numberic只对元素属性为数值的列做数值统计,object只对元素属性为object的列做类字符串统计。"
480 | ]
481 | },
482 | {
483 | "cell_type": "code",
484 | "execution_count": 15,
485 | "metadata": {
486 | "scrolled": false
487 | },
488 | "outputs": [
489 | {
490 | "data": {
491 | "text/html": [
492 | "\n",
493 | "\n",
506 | "
\n",
507 | " \n",
508 | " \n",
509 | " | \n",
510 | " numeric | \n",
511 | " object | \n",
512 | "
\n",
513 | " \n",
514 | " \n",
515 | " \n",
516 | " 0 | \n",
517 | " 1 | \n",
518 | " a | \n",
519 | "
\n",
520 | " \n",
521 | " 1 | \n",
522 | " 2 | \n",
523 | " b | \n",
524 | "
\n",
525 | " \n",
526 | " 2 | \n",
527 | " 1 | \n",
528 | " b | \n",
529 | "
\n",
530 | " \n",
531 | "
\n",
532 | "
"
533 | ],
534 | "text/plain": [
535 | " numeric object\n",
536 | "0 1 a\n",
537 | "1 2 b\n",
538 | "2 1 b"
539 | ]
540 | },
541 | "execution_count": 15,
542 | "metadata": {},
543 | "output_type": "execute_result"
544 | }
545 | ],
546 | "source": [
547 | "df = pd.DataFrame( [[1,'a'],[2,'b'],[1,'b']], columns=['numeric','object'])\n",
548 | "df"
549 | ]
550 | },
551 | {
552 | "cell_type": "code",
553 | "execution_count": 16,
554 | "metadata": {},
555 | "outputs": [
556 | {
557 | "data": {
558 | "text/plain": [
559 | "numeric int64\n",
560 | "object object\n",
561 | "dtype: object"
562 | ]
563 | },
564 | "execution_count": 16,
565 | "metadata": {},
566 | "output_type": "execute_result"
567 | }
568 | ],
569 | "source": [
570 | "df.dtypes"
571 | ]
572 | },
573 | {
574 | "cell_type": "code",
575 | "execution_count": 17,
576 | "metadata": {},
577 | "outputs": [
578 | {
579 | "data": {
580 | "text/html": [
581 | "\n",
582 | "\n",
595 | "
\n",
596 | " \n",
597 | " \n",
598 | " | \n",
599 | " numeric | \n",
600 | "
\n",
601 | " \n",
602 | " \n",
603 | " \n",
604 | " count | \n",
605 | " 3.000000 | \n",
606 | "
\n",
607 | " \n",
608 | " mean | \n",
609 | " 1.333333 | \n",
610 | "
\n",
611 | " \n",
612 | " std | \n",
613 | " 0.577350 | \n",
614 | "
\n",
615 | " \n",
616 | " min | \n",
617 | " 1.000000 | \n",
618 | "
\n",
619 | " \n",
620 | " 25% | \n",
621 | " 1.000000 | \n",
622 | "
\n",
623 | " \n",
624 | " 50% | \n",
625 | " 1.000000 | \n",
626 | "
\n",
627 | " \n",
628 | " 75% | \n",
629 | " 1.500000 | \n",
630 | "
\n",
631 | " \n",
632 | " max | \n",
633 | " 2.000000 | \n",
634 | "
\n",
635 | " \n",
636 | "
\n",
637 | "
"
638 | ],
639 | "text/plain": [
640 | " numeric\n",
641 | "count 3.000000\n",
642 | "mean 1.333333\n",
643 | "std 0.577350\n",
644 | "min 1.000000\n",
645 | "25% 1.000000\n",
646 | "50% 1.000000\n",
647 | "75% 1.500000\n",
648 | "max 2.000000"
649 | ]
650 | },
651 | "execution_count": 17,
652 | "metadata": {},
653 | "output_type": "execute_result"
654 | }
655 | ],
656 | "source": [
657 | "df.describe() # 默认只对数值列进行统计 "
658 | ]
659 | },
660 | {
661 | "cell_type": "code",
662 | "execution_count": 18,
663 | "metadata": {},
664 | "outputs": [
665 | {
666 | "data": {
667 | "text/html": [
668 | "\n",
669 | "\n",
682 | "
\n",
683 | " \n",
684 | " \n",
685 | " | \n",
686 | " object | \n",
687 | "
\n",
688 | " \n",
689 | " \n",
690 | " \n",
691 | " count | \n",
692 | " 3 | \n",
693 | "
\n",
694 | " \n",
695 | " unique | \n",
696 | " 2 | \n",
697 | "
\n",
698 | " \n",
699 | " top | \n",
700 | " b | \n",
701 | "
\n",
702 | " \n",
703 | " freq | \n",
704 | " 2 | \n",
705 | "
\n",
706 | " \n",
707 | "
\n",
708 | "
"
709 | ],
710 | "text/plain": [
711 | " object\n",
712 | "count 3\n",
713 | "unique 2\n",
714 | "top b\n",
715 | "freq 2"
716 | ]
717 | },
718 | "execution_count": 18,
719 | "metadata": {},
720 | "output_type": "execute_result"
721 | }
722 | ],
723 | "source": [
724 | "df.describe(include=[np.object]) # 只对object型列进行统计,类别统计方式,只统计这四种"
725 | ]
726 | },
727 | {
728 | "cell_type": "code",
729 | "execution_count": 19,
730 | "metadata": {
731 | "scrolled": true
732 | },
733 | "outputs": [
734 | {
735 | "data": {
736 | "text/html": [
737 | "\n",
738 | "\n",
751 | "
\n",
752 | " \n",
753 | " \n",
754 | " | \n",
755 | " numeric | \n",
756 | " object | \n",
757 | "
\n",
758 | " \n",
759 | " \n",
760 | " \n",
761 | " count | \n",
762 | " 3.000000 | \n",
763 | " 3 | \n",
764 | "
\n",
765 | " \n",
766 | " unique | \n",
767 | " NaN | \n",
768 | " 2 | \n",
769 | "
\n",
770 | " \n",
771 | " top | \n",
772 | " NaN | \n",
773 | " b | \n",
774 | "
\n",
775 | " \n",
776 | " freq | \n",
777 | " NaN | \n",
778 | " 2 | \n",
779 | "
\n",
780 | " \n",
781 | " mean | \n",
782 | " 1.333333 | \n",
783 | " NaN | \n",
784 | "
\n",
785 | " \n",
786 | " std | \n",
787 | " 0.577350 | \n",
788 | " NaN | \n",
789 | "
\n",
790 | " \n",
791 | " min | \n",
792 | " 1.000000 | \n",
793 | " NaN | \n",
794 | "
\n",
795 | " \n",
796 | " 25% | \n",
797 | " 1.000000 | \n",
798 | " NaN | \n",
799 | "
\n",
800 | " \n",
801 | " 50% | \n",
802 | " 1.000000 | \n",
803 | " NaN | \n",
804 | "
\n",
805 | " \n",
806 | " 75% | \n",
807 | " 1.500000 | \n",
808 | " NaN | \n",
809 | "
\n",
810 | " \n",
811 | " max | \n",
812 | " 2.000000 | \n",
813 | " NaN | \n",
814 | "
\n",
815 | " \n",
816 | "
\n",
817 | "
"
818 | ],
819 | "text/plain": [
820 | " numeric object\n",
821 | "count 3.000000 3\n",
822 | "unique NaN 2\n",
823 | "top NaN b\n",
824 | "freq NaN 2\n",
825 | "mean 1.333333 NaN\n",
826 | "std 0.577350 NaN\n",
827 | "min 1.000000 NaN\n",
828 | "25% 1.000000 NaN\n",
829 | "50% 1.000000 NaN\n",
830 | "75% 1.500000 NaN\n",
831 | "max 2.000000 NaN"
832 | ]
833 | },
834 | "execution_count": 19,
835 | "metadata": {},
836 | "output_type": "execute_result"
837 | }
838 | ],
839 | "source": [
840 | "df.describe(include = 'all') # 数值序列和object序列共同统计的信息只有count: 非NaN元素个数"
841 | ]
842 | },
843 | {
844 | "cell_type": "code",
845 | "execution_count": null,
846 | "metadata": {},
847 | "outputs": [],
848 | "source": []
849 | }
850 | ],
851 | "metadata": {
852 | "kernelspec": {
853 | "display_name": "Python 3",
854 | "language": "python",
855 | "name": "python3"
856 | },
857 | "language_info": {
858 | "codemirror_mode": {
859 | "name": "ipython",
860 | "version": 3
861 | },
862 | "file_extension": ".py",
863 | "mimetype": "text/x-python",
864 | "name": "python",
865 | "nbconvert_exporter": "python",
866 | "pygments_lexer": "ipython3",
867 | "version": "3.7.0"
868 | },
869 | "toc": {
870 | "base_numbering": 1,
871 | "nav_menu": {
872 | "height": "141px",
873 | "width": "253px"
874 | },
875 | "number_sections": false,
876 | "sideBar": true,
877 | "skip_h1_title": false,
878 | "title_cell": "Table of Contents",
879 | "title_sidebar": "Contents",
880 | "toc_cell": false,
881 | "toc_position": {},
882 | "toc_section_display": "block",
883 | "toc_window_display": true
884 | }
885 | },
886 | "nbformat": 4,
887 | "nbformat_minor": 2
888 | }
889 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## 教程目录
2 |
3 |
4 |
5 | 0. 配置环境
6 | 1. Series和DataFrame对象的创建
7 | 2. Series和DataFrame对象的查、改、增、删
8 | 3. merge详解
9 | 4. Index对象的创建,查、改、增、删和使用
10 | 5. 普通列和行index的相互转化
11 | 6. 数据结构总览
12 | 7. 显示控制
13 | 8. 快速查看整体信息
14 | 9. 数值运算
15 | 10. 数值统计运算
16 | 11. mask与比较运算(待完成)
17 | 12. Category型与离散化
18 | 13. 时间型操作
19 | 14. Object型操作
20 | 15. groupby详解(待完成)
21 | 16. resample详解(待完成)
22 | 17. ……
23 |
24 |
25 |
26 | ## 教程说明
27 |
28 | 当今最热的职业是数据科学,数据科学领域应用最广泛的编程语言是python,python这么火的原因就是其有一个功能强大的数据科学库:pandas。
29 |
30 | ## 为什么写这套教程
31 | 然而,作为一名数据科学行业从业者,即使在pandas中浸淫日久,我常常还需要去查询官方文档,这严重影响了我的工作效率;甚至有时候迫不得已还得写循环操作,非常不pandas,这我忍不了,所以我觉得我得做点什么。
32 |
33 | 经过多次通读官方文档后,我认为问题根因在于:
34 | - 官方文档组织杂而乱,知识框架不够精炼一致;
35 | - 面面俱到,高价值信息被为了完整性而稀释;
36 | - 文档更新不及时,API功能有时与文档描述不符。
37 |
38 | 与此同时,我也通读了国内外各种pandas教程,不过总体而言这些教程多数浅尝辄止,不够实用。所以,我决定编写一套pandas教程,提高自己能力的同时,也能帮助大家少走弯路。
39 |
40 | ## 教程编写核心原则
41 | 这套教程编写的核心原则是:
42 | - 首重知识体系逻辑,没有组织、不成体系的信息是无效信息,很难记住和使用;
43 | - 知识粒度大小适中,即不流于表面也不深入过多细节;
44 | - 示例精炼短小(能看出操作效果),方便手打练习;
45 | - 在示例位置都会注上解释,辅助理解。
46 |
47 | ## 这套教程适合谁
48 | 这套教程包含从初级到进阶的内容,适合初学者和希望进阶建立知识体系的数据科学从业者阅读。为确保教程的高可用性和准确性,我花了大量时间精心准备,但仍难免有错漏,非常欢迎各位读者能够跟我反馈。
49 |
50 | ## 知乎主页
51 | 花半楼:https://www.zhihu.com/people/HANGZS
52 |
53 | ## 交流可以加我微信
54 | #### 微信号:204078950
55 | #### 公众号:花半楼
56 |
57 | 
58 |
59 |
--------------------------------------------------------------------------------
/pandas_tutorial.egg-info/PKG-INFO:
--------------------------------------------------------------------------------
1 | Metadata-Version: 2.1
2 | Name: pandas-tutorial
3 | Version: 0.1.0
4 | Summary: pandas tutorial
5 | Home-page: https://github.com/hangsz/pandas-tutorial
6 | Author: hangsz
7 | Author-email: zhenhang.sun@outlook.com
8 | Description-Content-Type: text/markdown
9 |
10 | ## 教程目录
11 |
12 |
13 |
14 | 0. 配置环境
15 | 1. Series和DataFrame对象的创建
16 | 2. Series和DataFrame对象的查、改、增、删
17 | 3. merge详解
18 | 4. Index对象的创建,查、改、增、删和使用
19 | 5. 普通列和行index的相互转化
20 | 6. 数据结构总览
21 | 7. 显示控制
22 | 8. 快速查看整体信息
23 | 9. 数值运算
24 | 10. 数值统计运算
25 | 11. mask与比较运算(待完成)
26 | 12. Category型与离散化
27 | 13. 时间型操作
28 | 14. Object型操作
29 | 15. groupby详解(待完成)
30 | 16. resample详解(待完成)
31 | 17. ……
32 |
33 |
34 |
35 | ## 教程说明
36 |
37 | 当今最热的职业是数据科学,数据科学领域应用最广泛的编程语言是python,python这么火的原因就是其有一个功能强大的数据科学库:pandas。
38 |
39 | ## 为什么写这套教程
40 | 然而,作为一名数据科学行业从业者,即使在pandas中浸淫日久,我常常还需要去查询官方文档,这严重影响了我的工作效率;甚至有时候迫不得已还得写循环操作,非常不pandas,这我忍不了,所以我觉得我得做点什么。
41 |
42 | 经过多次通读官方文档后,我认为问题根因在于:
43 | - 官方文档组织杂而乱,知识框架不够精炼一致;
44 | - 面面俱到,高价值信息被为了完整性而稀释;
45 | - 文档更新不及时,API功能有时与文档描述不符。
46 |
47 | 与此同时,我也通读了国内外各种pandas教程,不过总体而言这些教程多数浅尝辄止,不够实用。所以,我决定编写一套pandas教程,提高自己能力的同时,也能帮助大家少走弯路。
48 |
49 | ## 教程编写核心原则
50 | 这套教程编写的核心原则是:
51 | - 首重知识体系逻辑,没有组织、不成体系的信息是无效信息,很难记住和使用;
52 | - 知识粒度大小适中,即不流于表面也不深入过多细节;
53 | - 示例精炼短小(能看出操作效果),方便手打练习;
54 | - 在示例位置都会注上解释,辅助理解。
55 |
56 | ## 这套教程适合谁
57 | 这套教程包含从初级到进阶的内容,适合初学者和希望进阶建立知识体系的数据科学从业者阅读。为确保教程的高可用性和准确性,我花了大量时间精心准备,但仍难免有错漏,非常欢迎各位读者能够跟我反馈。
58 |
59 | ## 知乎主页
60 | 花半楼:https://www.zhihu.com/people/HANGZS
61 |
62 | ## 交流可以加我微信
63 | #### 微信号:204078950
64 | #### 公众号:花半楼
65 |
66 | 
67 |
68 |
--------------------------------------------------------------------------------
/pandas_tutorial.egg-info/SOURCES.txt:
--------------------------------------------------------------------------------
1 | README.md
2 | setup.py
3 | pandas_tutorial.egg-info/PKG-INFO
4 | pandas_tutorial.egg-info/SOURCES.txt
5 | pandas_tutorial.egg-info/dependency_links.txt
6 | pandas_tutorial.egg-info/requires.txt
7 | pandas_tutorial.egg-info/top_level.txt
--------------------------------------------------------------------------------
/pandas_tutorial.egg-info/dependency_links.txt:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/pandas_tutorial.egg-info/requires.txt:
--------------------------------------------------------------------------------
1 | raft-python==0.1.3
2 | pandas==2.2.1
3 |
--------------------------------------------------------------------------------
/pandas_tutorial.egg-info/top_level.txt:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/resource/1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/1.png
--------------------------------------------------------------------------------
/resource/2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/2.png
--------------------------------------------------------------------------------
/resource/204078950.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/204078950.png
--------------------------------------------------------------------------------
/resource/3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/3.png
--------------------------------------------------------------------------------
/resource/pay.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/pay.png
--------------------------------------------------------------------------------
/resource/我的.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/我的.png
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # coding: utf-8
3 |
4 | import setuptools
5 | from setuptools import setup
6 |
7 | with open('README.md', 'r', encoding='utf-8') as fh:
8 | long_description = fh.read()
9 |
10 | setup(
11 | name='pandas-tutorial',
12 | version='0.1.0',
13 | author='hangsz',
14 | author_email='zhenhang.sun@outlook.com',
15 | url='https://github.com/hangsz/pandas-tutorial',
16 | description=u'pandas tutorial',
17 | long_description=long_description,
18 | long_description_content_type="text/markdown",
19 | packages=setuptools.find_packages(),
20 | install_requires=[
21 | 'raft-python==0.1.3',
22 | 'pandas==2.2.1'
23 | ]
24 | )
--------------------------------------------------------------------------------
/tips.csv:
--------------------------------------------------------------------------------
1 | total_bill,tip,sex,smoker,day,time,size
2 | 16.99,1.01,Female,No,Sun,Dinner,2
3 | 10.34,1.66,Male,No,Sun,Dinner,3
4 | 21.01,3.5,Male,No,Sun,Dinner,3
5 | 23.68,3.31,Male,No,Sun,Dinner,2
6 | 24.59,3.61,Female,No,Sun,Dinner,4
7 | 25.29,4.71,Male,No,Sun,Dinner,4
8 | 8.77,2.0,Male,No,Sun,Dinner,2
9 | 26.88,3.12,Male,No,Sun,Dinner,4
10 | 15.04,1.96,Male,No,Sun,Dinner,2
11 | 14.78,3.23,Male,No,Sun,Dinner,2
12 | 10.27,1.71,Male,No,Sun,Dinner,2
13 | 35.26,5.0,Female,No,Sun,Dinner,4
14 | 15.42,1.57,Male,No,Sun,Dinner,2
15 | 18.43,3.0,Male,No,Sun,Dinner,4
16 | 14.83,3.02,Female,No,Sun,Dinner,2
17 | 21.58,3.92,Male,No,Sun,Dinner,2
18 | 10.33,1.67,Female,No,Sun,Dinner,3
19 | 16.29,3.71,Male,No,Sun,Dinner,3
20 | 16.97,3.5,Female,No,Sun,Dinner,3
21 | 20.65,3.35,Male,No,Sat,Dinner,3
22 | 17.92,4.08,Male,No,Sat,Dinner,2
23 | 20.29,2.75,Female,No,Sat,Dinner,2
24 | 15.77,2.23,Female,No,Sat,Dinner,2
25 | 39.42,7.58,Male,No,Sat,Dinner,4
26 | 19.82,3.18,Male,No,Sat,Dinner,2
27 | 17.81,2.34,Male,No,Sat,Dinner,4
28 | 13.37,2.0,Male,No,Sat,Dinner,2
29 | 12.69,2.0,Male,No,Sat,Dinner,2
30 | 21.7,4.3,Male,No,Sat,Dinner,2
31 | 19.65,3.0,Female,No,Sat,Dinner,2
32 | 9.55,1.45,Male,No,Sat,Dinner,2
33 | 18.35,2.5,Male,No,Sat,Dinner,4
34 | 15.06,3.0,Female,No,Sat,Dinner,2
35 | 20.69,2.45,Female,No,Sat,Dinner,4
36 | 17.78,3.27,Male,No,Sat,Dinner,2
37 | 24.06,3.6,Male,No,Sat,Dinner,3
38 | 16.31,2.0,Male,No,Sat,Dinner,3
39 | 16.93,3.07,Female,No,Sat,Dinner,3
40 | 18.69,2.31,Male,No,Sat,Dinner,3
41 | 31.27,5.0,Male,No,Sat,Dinner,3
42 | 16.04,2.24,Male,No,Sat,Dinner,3
43 | 17.46,2.54,Male,No,Sun,Dinner,2
44 | 13.94,3.06,Male,No,Sun,Dinner,2
45 | 9.68,1.32,Male,No,Sun,Dinner,2
46 | 30.4,5.6,Male,No,Sun,Dinner,4
47 | 18.29,3.0,Male,No,Sun,Dinner,2
48 | 22.23,5.0,Male,No,Sun,Dinner,2
49 | 32.4,6.0,Male,No,Sun,Dinner,4
50 | 28.55,2.05,Male,No,Sun,Dinner,3
51 | 18.04,3.0,Male,No,Sun,Dinner,2
52 | 12.54,2.5,Male,No,Sun,Dinner,2
53 | 10.29,2.6,Female,No,Sun,Dinner,2
54 | 34.81,5.2,Female,No,Sun,Dinner,4
55 | 9.94,1.56,Male,No,Sun,Dinner,2
56 | 25.56,4.34,Male,No,Sun,Dinner,4
57 | 19.49,3.51,Male,No,Sun,Dinner,2
58 | 38.01,3.0,Male,Yes,Sat,Dinner,4
59 | 26.41,1.5,Female,No,Sat,Dinner,2
60 | 11.24,1.76,Male,Yes,Sat,Dinner,2
61 | 48.27,6.73,Male,No,Sat,Dinner,4
62 | 20.29,3.21,Male,Yes,Sat,Dinner,2
63 | 13.81,2.0,Male,Yes,Sat,Dinner,2
64 | 11.02,1.98,Male,Yes,Sat,Dinner,2
65 | 18.29,3.76,Male,Yes,Sat,Dinner,4
66 | 17.59,2.64,Male,No,Sat,Dinner,3
67 | 20.08,3.15,Male,No,Sat,Dinner,3
68 | 16.45,2.47,Female,No,Sat,Dinner,2
69 | 3.07,1.0,Female,Yes,Sat,Dinner,1
70 | 20.23,2.01,Male,No,Sat,Dinner,2
71 | 15.01,2.09,Male,Yes,Sat,Dinner,2
72 | 12.02,1.97,Male,No,Sat,Dinner,2
73 | 17.07,3.0,Female,No,Sat,Dinner,3
74 | 26.86,3.14,Female,Yes,Sat,Dinner,2
75 | 25.28,5.0,Female,Yes,Sat,Dinner,2
76 | 14.73,2.2,Female,No,Sat,Dinner,2
77 | 10.51,1.25,Male,No,Sat,Dinner,2
78 | 17.92,3.08,Male,Yes,Sat,Dinner,2
79 | 27.2,4.0,Male,No,Thur,Lunch,4
80 | 22.76,3.0,Male,No,Thur,Lunch,2
81 | 17.29,2.71,Male,No,Thur,Lunch,2
82 | 19.44,3.0,Male,Yes,Thur,Lunch,2
83 | 16.66,3.4,Male,No,Thur,Lunch,2
84 | 10.07,1.83,Female,No,Thur,Lunch,1
85 | 32.68,5.0,Male,Yes,Thur,Lunch,2
86 | 15.98,2.03,Male,No,Thur,Lunch,2
87 | 34.83,5.17,Female,No,Thur,Lunch,4
88 | 13.03,2.0,Male,No,Thur,Lunch,2
89 | 18.28,4.0,Male,No,Thur,Lunch,2
90 | 24.71,5.85,Male,No,Thur,Lunch,2
91 | 21.16,3.0,Male,No,Thur,Lunch,2
92 | 28.97,3.0,Male,Yes,Fri,Dinner,2
93 | 22.49,3.5,Male,No,Fri,Dinner,2
94 | 5.75,1.0,Female,Yes,Fri,Dinner,2
95 | 16.32,4.3,Female,Yes,Fri,Dinner,2
96 | 22.75,3.25,Female,No,Fri,Dinner,2
97 | 40.17,4.73,Male,Yes,Fri,Dinner,4
98 | 27.28,4.0,Male,Yes,Fri,Dinner,2
99 | 12.03,1.5,Male,Yes,Fri,Dinner,2
100 | 21.01,3.0,Male,Yes,Fri,Dinner,2
101 | 12.46,1.5,Male,No,Fri,Dinner,2
102 | 11.35,2.5,Female,Yes,Fri,Dinner,2
103 | 15.38,3.0,Female,Yes,Fri,Dinner,2
104 | 44.3,2.5,Female,Yes,Sat,Dinner,3
105 | 22.42,3.48,Female,Yes,Sat,Dinner,2
106 | 20.92,4.08,Female,No,Sat,Dinner,2
107 | 15.36,1.64,Male,Yes,Sat,Dinner,2
108 | 20.49,4.06,Male,Yes,Sat,Dinner,2
109 | 25.21,4.29,Male,Yes,Sat,Dinner,2
110 | 18.24,3.76,Male,No,Sat,Dinner,2
111 | 14.31,4.0,Female,Yes,Sat,Dinner,2
112 | 14.0,3.0,Male,No,Sat,Dinner,2
113 | 7.25,1.0,Female,No,Sat,Dinner,1
114 | 38.07,4.0,Male,No,Sun,Dinner,3
115 | 23.95,2.55,Male,No,Sun,Dinner,2
116 | 25.71,4.0,Female,No,Sun,Dinner,3
117 | 17.31,3.5,Female,No,Sun,Dinner,2
118 | 29.93,5.07,Male,No,Sun,Dinner,4
119 | 10.65,1.5,Female,No,Thur,Lunch,2
120 | 12.43,1.8,Female,No,Thur,Lunch,2
121 | 24.08,2.92,Female,No,Thur,Lunch,4
122 | 11.69,2.31,Male,No,Thur,Lunch,2
123 | 13.42,1.68,Female,No,Thur,Lunch,2
124 | 14.26,2.5,Male,No,Thur,Lunch,2
125 | 15.95,2.0,Male,No,Thur,Lunch,2
126 | 12.48,2.52,Female,No,Thur,Lunch,2
127 | 29.8,4.2,Female,No,Thur,Lunch,6
128 | 8.52,1.48,Male,No,Thur,Lunch,2
129 | 14.52,2.0,Female,No,Thur,Lunch,2
130 | 11.38,2.0,Female,No,Thur,Lunch,2
131 | 22.82,2.18,Male,No,Thur,Lunch,3
132 | 19.08,1.5,Male,No,Thur,Lunch,2
133 | 20.27,2.83,Female,No,Thur,Lunch,2
134 | 11.17,1.5,Female,No,Thur,Lunch,2
135 | 12.26,2.0,Female,No,Thur,Lunch,2
136 | 18.26,3.25,Female,No,Thur,Lunch,2
137 | 8.51,1.25,Female,No,Thur,Lunch,2
138 | 10.33,2.0,Female,No,Thur,Lunch,2
139 | 14.15,2.0,Female,No,Thur,Lunch,2
140 | 16.0,2.0,Male,Yes,Thur,Lunch,2
141 | 13.16,2.75,Female,No,Thur,Lunch,2
142 | 17.47,3.5,Female,No,Thur,Lunch,2
143 | 34.3,6.7,Male,No,Thur,Lunch,6
144 | 41.19,5.0,Male,No,Thur,Lunch,5
145 | 27.05,5.0,Female,No,Thur,Lunch,6
146 | 16.43,2.3,Female,No,Thur,Lunch,2
147 | 8.35,1.5,Female,No,Thur,Lunch,2
148 | 18.64,1.36,Female,No,Thur,Lunch,3
149 | 11.87,1.63,Female,No,Thur,Lunch,2
150 | 9.78,1.73,Male,No,Thur,Lunch,2
151 | 7.51,2.0,Male,No,Thur,Lunch,2
152 | 14.07,2.5,Male,No,Sun,Dinner,2
153 | 13.13,2.0,Male,No,Sun,Dinner,2
154 | 17.26,2.74,Male,No,Sun,Dinner,3
155 | 24.55,2.0,Male,No,Sun,Dinner,4
156 | 19.77,2.0,Male,No,Sun,Dinner,4
157 | 29.85,5.14,Female,No,Sun,Dinner,5
158 | 48.17,5.0,Male,No,Sun,Dinner,6
159 | 25.0,3.75,Female,No,Sun,Dinner,4
160 | 13.39,2.61,Female,No,Sun,Dinner,2
161 | 16.49,2.0,Male,No,Sun,Dinner,4
162 | 21.5,3.5,Male,No,Sun,Dinner,4
163 | 12.66,2.5,Male,No,Sun,Dinner,2
164 | 16.21,2.0,Female,No,Sun,Dinner,3
165 | 13.81,2.0,Male,No,Sun,Dinner,2
166 | 17.51,3.0,Female,Yes,Sun,Dinner,2
167 | 24.52,3.48,Male,No,Sun,Dinner,3
168 | 20.76,2.24,Male,No,Sun,Dinner,2
169 | 31.71,4.5,Male,No,Sun,Dinner,4
170 | 10.59,1.61,Female,Yes,Sat,Dinner,2
171 | 10.63,2.0,Female,Yes,Sat,Dinner,2
172 | 50.81,10.0,Male,Yes,Sat,Dinner,3
173 | 15.81,3.16,Male,Yes,Sat,Dinner,2
174 | 7.25,5.15,Male,Yes,Sun,Dinner,2
175 | 31.85,3.18,Male,Yes,Sun,Dinner,2
176 | 16.82,4.0,Male,Yes,Sun,Dinner,2
177 | 32.9,3.11,Male,Yes,Sun,Dinner,2
178 | 17.89,2.0,Male,Yes,Sun,Dinner,2
179 | 14.48,2.0,Male,Yes,Sun,Dinner,2
180 | 9.6,4.0,Female,Yes,Sun,Dinner,2
181 | 34.63,3.55,Male,Yes,Sun,Dinner,2
182 | 34.65,3.68,Male,Yes,Sun,Dinner,4
183 | 23.33,5.65,Male,Yes,Sun,Dinner,2
184 | 45.35,3.5,Male,Yes,Sun,Dinner,3
185 | 23.17,6.5,Male,Yes,Sun,Dinner,4
186 | 40.55,3.0,Male,Yes,Sun,Dinner,2
187 | 20.69,5.0,Male,No,Sun,Dinner,5
188 | 20.9,3.5,Female,Yes,Sun,Dinner,3
189 | 30.46,2.0,Male,Yes,Sun,Dinner,5
190 | 18.15,3.5,Female,Yes,Sun,Dinner,3
191 | 23.1,4.0,Male,Yes,Sun,Dinner,3
192 | 15.69,1.5,Male,Yes,Sun,Dinner,2
193 | 19.81,4.19,Female,Yes,Thur,Lunch,2
194 | 28.44,2.56,Male,Yes,Thur,Lunch,2
195 | 15.48,2.02,Male,Yes,Thur,Lunch,2
196 | 16.58,4.0,Male,Yes,Thur,Lunch,2
197 | 7.56,1.44,Male,No,Thur,Lunch,2
198 | 10.34,2.0,Male,Yes,Thur,Lunch,2
199 | 43.11,5.0,Female,Yes,Thur,Lunch,4
200 | 13.0,2.0,Female,Yes,Thur,Lunch,2
201 | 13.51,2.0,Male,Yes,Thur,Lunch,2
202 | 18.71,4.0,Male,Yes,Thur,Lunch,3
203 | 12.74,2.01,Female,Yes,Thur,Lunch,2
204 | 13.0,2.0,Female,Yes,Thur,Lunch,2
205 | 16.4,2.5,Female,Yes,Thur,Lunch,2
206 | 20.53,4.0,Male,Yes,Thur,Lunch,4
207 | 16.47,3.23,Female,Yes,Thur,Lunch,3
208 | 26.59,3.41,Male,Yes,Sat,Dinner,3
209 | 38.73,3.0,Male,Yes,Sat,Dinner,4
210 | 24.27,2.03,Male,Yes,Sat,Dinner,2
211 | 12.76,2.23,Female,Yes,Sat,Dinner,2
212 | 30.06,2.0,Male,Yes,Sat,Dinner,3
213 | 25.89,5.16,Male,Yes,Sat,Dinner,4
214 | 48.33,9.0,Male,No,Sat,Dinner,4
215 | 13.27,2.5,Female,Yes,Sat,Dinner,2
216 | 28.17,6.5,Female,Yes,Sat,Dinner,3
217 | 12.9,1.1,Female,Yes,Sat,Dinner,2
218 | 28.15,3.0,Male,Yes,Sat,Dinner,5
219 | 11.59,1.5,Male,Yes,Sat,Dinner,2
220 | 7.74,1.44,Male,Yes,Sat,Dinner,2
221 | 30.14,3.09,Female,Yes,Sat,Dinner,4
222 | 12.16,2.2,Male,Yes,Fri,Lunch,2
223 | 13.42,3.48,Female,Yes,Fri,Lunch,2
224 | 8.58,1.92,Male,Yes,Fri,Lunch,1
225 | 15.98,3.0,Female,No,Fri,Lunch,3
226 | 13.42,1.58,Male,Yes,Fri,Lunch,2
227 | 16.27,2.5,Female,Yes,Fri,Lunch,2
228 | 10.09,2.0,Female,Yes,Fri,Lunch,2
229 | 20.45,3.0,Male,No,Sat,Dinner,4
230 | 13.28,2.72,Male,No,Sat,Dinner,2
231 | 22.12,2.88,Female,Yes,Sat,Dinner,2
232 | 24.01,2.0,Male,Yes,Sat,Dinner,4
233 | 15.69,3.0,Male,Yes,Sat,Dinner,3
234 | 11.61,3.39,Male,No,Sat,Dinner,2
235 | 10.77,1.47,Male,No,Sat,Dinner,2
236 | 15.53,3.0,Male,Yes,Sat,Dinner,2
237 | 10.07,1.25,Male,No,Sat,Dinner,2
238 | 12.6,1.0,Male,Yes,Sat,Dinner,2
239 | 32.83,1.17,Male,Yes,Sat,Dinner,2
240 | 35.83,4.67,Female,No,Sat,Dinner,3
241 | 29.03,5.92,Male,No,Sat,Dinner,3
242 | 27.18,2.0,Female,Yes,Sat,Dinner,2
243 | 22.67,2.0,Male,Yes,Sat,Dinner,2
244 | 17.82,1.75,Male,No,Sat,Dinner,2
245 | 18.78,3.0,Female,No,Thur,Dinner,2
246 |
--------------------------------------------------------------------------------