├── .gitignore ├── 0. 配置环境.ipynb ├── 1. Series和DataFrame对象的创建.ipynb ├── 10. 数值统计运算.ipynb ├── 12. Category型与离散化.ipynb ├── 13. 时间型操作.ipynb ├── 14. Object型操作.ipynb ├── 2. Series和DataFrame对象的查、改、增、删.ipynb ├── 3. merge详解.ipynb ├── 4. Index对象的创建,查、改、增、删和使用.ipynb ├── 5. 普通列和行index的相互转化.ipynb ├── 6. 数据结构总览.ipynb ├── 7. 显示控制.ipynb ├── 8. 快速查看整体信息.ipynb ├── 9. 数值运算.ipynb ├── README.md ├── pandas_tutorial.egg-info ├── PKG-INFO ├── SOURCES.txt ├── dependency_links.txt ├── requires.txt └── top_level.txt ├── resource ├── 1.png ├── 2.png ├── 204078950.png ├── 3.png ├── pay.png └── 我的.png ├── setup.py └── tips.csv /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | 3 | venv 4 | dist -------------------------------------------------------------------------------- /0. 配置环境.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 配置环境\n", 8 | "\n", 9 | "因为jupyter notebook是运行在浏览器中,所以什么IDE都不需要,只需要安装好python和一些需要的库。" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "# 0. python和需要的库安装\n", 17 | "- 看我之前写的文章:https://zhuanlan.zhihu.com/p/27335822" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": null, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [] 26 | }, 27 | { 28 | "cell_type": "markdown", 29 | "metadata": { 30 | "tags": [] 31 | }, 32 | "source": [ 33 | "# 1. jupyter notebook" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": { 39 | "tags": [] 40 | }, 41 | "source": [ 42 | "## 1.1 jupyter及其拓展安装" 43 | ] 44 | }, 45 | { 46 | "cell_type": "markdown", 47 | "metadata": {}, 48 | "source": [ 49 | "- 输入:`pip install jupyter notebook`\n", 50 | "- 输入:`pip install jupyter_contrib_nbextensions `\n", 51 | "- 输入:`jupyter contrib nbextension install --user`" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "## 1.2 配置自动生成目录" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "1. 输入:`jupyter notebook`, 浏览器会自动打开;\n", 66 | "\n", 67 | "2. 在打开的界面进入 `Nbextensions` 选项卡;\n", 68 | "![](resource/1.png)\n", 69 | "\n", 70 | "3. 在`Table of Contents (2)`下,点击 **Enable**,启动自动生成目录选项;\n", 71 | "![](resource/2.png)\n", 72 | "\n", 73 | "4. 改变这一项的默认 4 为 3。\n", 74 | "![](resource/3.png)" 75 | ] 76 | }, 77 | { 78 | "cell_type": "code", 79 | "execution_count": null, 80 | "metadata": {}, 81 | "outputs": [], 82 | "source": [] 83 | }, 84 | { 85 | "cell_type": "markdown", 86 | "metadata": {}, 87 | "source": [ 88 | "# 2. jupter lab\n", 89 | "\n", 90 | "jupterlab是更新一代的jupter,我个人比较喜欢用这个" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "metadata": {}, 96 | "source": [ 97 | "- 输入:`pip install jupyterlab`\n", 98 | "- 输入:`jupyter lab` , 浏览器会自动打开 " 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": null, 104 | "metadata": {}, 105 | "outputs": [], 106 | "source": [] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "metadata": {}, 111 | "source": [ 112 | "# 3. 几个快捷键" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "处于命令模式,也即鼠标点击cell左侧,文本框变灰时\n", 120 | "- `a` 在当前cell上面增加一个cell\n", 121 | "- `b` 在当前cell下面增加一个cell\n", 122 | "- `dd` 删除当前cell\n", 123 | "- `按shift enter` 运行当前cell" 124 | ] 125 | } 126 | ], 127 | "metadata": { 128 | "kernelspec": { 129 | "display_name": "Python 3 (ipykernel)", 130 | "language": "python", 131 | "name": "python3" 132 | }, 133 | "language_info": { 134 | "codemirror_mode": { 135 | "name": "ipython", 136 | "version": 3 137 | }, 138 | "file_extension": ".py", 139 | "mimetype": "text/x-python", 140 | "name": "python", 141 | "nbconvert_exporter": "python", 142 | "pygments_lexer": "ipython3", 143 | "version": "3.10.4" 144 | }, 145 | "toc": { 146 | "colors": { 147 | "hover_highlight": "#DAA520", 148 | "navigate_num": "#000000", 149 | "navigate_text": "#333333", 150 | "running_highlight": "#FF0000", 151 | "selected_highlight": "#FFD700", 152 | "sidebar_border": "#EEEEEE", 153 | "wrapper_background": "#FFFFFF" 154 | }, 155 | "moveMenuLeft": true, 156 | "nav_menu": { 157 | "height": "84px", 158 | "width": "252px" 159 | }, 160 | "navigate_menu": true, 161 | "number_sections": false, 162 | "sideBar": true, 163 | "threshold": "3", 164 | "toc_cell": false, 165 | "toc_section_display": "block", 166 | "toc_window_display": true, 167 | "widenNotebook": false 168 | } 169 | }, 170 | "nbformat": 4, 171 | "nbformat_minor": 4 172 | } 173 | -------------------------------------------------------------------------------- /1. Series和DataFrame对象的创建.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Series和DataFrame对象的创建\n", 8 | "pandas中的核心对象是Series和DataFrame,这一节主要介绍如何创建这两种对象。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "__auther__ = 'zhenhang.sun@gmail.com'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": { 24 | "ExecuteTime": { 25 | "end_time": "2017-09-25T14:15:51.389871Z", 26 | "start_time": "2017-09-25T14:15:51.202730Z" 27 | } 28 | }, 29 | "outputs": [ 30 | { 31 | "data": { 32 | "text/plain": [ 33 | "'D:\\\\github\\\\pandas-tutorial'" 34 | ] 35 | }, 36 | "execution_count": 2, 37 | "metadata": {}, 38 | "output_type": "execute_result" 39 | } 40 | ], 41 | "source": [ 42 | "pwd" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": 3, 48 | "metadata": { 49 | "ExecuteTime": { 50 | "end_time": "2017-09-25T14:16:03.547342Z", 51 | "start_time": "2017-09-25T14:15:51.394374Z" 52 | } 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "import numpy as np\n", 57 | "import pandas as pd" 58 | ] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "# 1. Series" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": { 70 | "collapsed": true 71 | }, 72 | "source": [ 73 | "Series是pandas中暴露给我们使用的基本对象,它是由相同元素类型构成的一维数据结构,同时具有列表和字典的属性,字典的属性由索引赋予。\n", 74 | "\n", 75 | " Series:有序,有索引\n", 76 | " list: 有序,无索引\n", 77 | " dict: 无序,有索引" 78 | ] 79 | }, 80 | { 81 | "cell_type": "markdown", 82 | "metadata": {}, 83 | "source": [ 84 | "## 1.1 预览" 85 | ] 86 | }, 87 | { 88 | "cell_type": "code", 89 | "execution_count": 4, 90 | "metadata": { 91 | "ExecuteTime": { 92 | "end_time": "2017-09-25T14:16:04.076194Z", 93 | "start_time": "2017-09-25T14:16:03.551337Z" 94 | } 95 | }, 96 | "outputs": [ 97 | { 98 | "data": { 99 | "text/plain": [ 100 | "a 1\n", 101 | "b 2\n", 102 | "c 3\n", 103 | "Name: sss, dtype: int64" 104 | ] 105 | }, 106 | "execution_count": 4, 107 | "metadata": {}, 108 | "output_type": "execute_result" 109 | } 110 | ], 111 | "source": [ 112 | "data = [1,2,3]\n", 113 | "index = ['a','b','c']\n", 114 | "s = pd.Series(data=data, index=index, name='sss')\n", 115 | "s" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 5, 121 | "metadata": { 122 | "ExecuteTime": { 123 | "end_time": "2017-09-25T14:16:04.089203Z", 124 | "start_time": "2017-09-25T14:16:04.082197Z" 125 | } 126 | }, 127 | "outputs": [ 128 | { 129 | "data": { 130 | "text/plain": [ 131 | "Index(['a', 'b', 'c'], dtype='object')" 132 | ] 133 | }, 134 | "execution_count": 5, 135 | "metadata": {}, 136 | "output_type": "execute_result" 137 | } 138 | ], 139 | "source": [ 140 | "s.index # 四个属性之一:索引" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 6, 146 | "metadata": { 147 | "ExecuteTime": { 148 | "end_time": "2017-09-25T14:16:04.323077Z", 149 | "start_time": "2017-09-25T14:16:04.091205Z" 150 | } 151 | }, 152 | "outputs": [ 153 | { 154 | "data": { 155 | "text/plain": [ 156 | "'sss'" 157 | ] 158 | }, 159 | "execution_count": 6, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "s.name # 四个属性之二:名字," 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 7, 171 | "metadata": { 172 | "ExecuteTime": { 173 | "end_time": "2017-09-25T14:16:04.639450Z", 174 | "start_time": "2017-09-25T14:16:04.327080Z" 175 | } 176 | }, 177 | "outputs": [ 178 | { 179 | "data": { 180 | "text/plain": [ 181 | "array([1, 2, 3], dtype=int64)" 182 | ] 183 | }, 184 | "execution_count": 7, 185 | "metadata": {}, 186 | "output_type": "execute_result" 187 | } 188 | ], 189 | "source": [ 190 | "s.values # 四个属性之三:值" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": 8, 196 | "metadata": { 197 | "ExecuteTime": { 198 | "end_time": "2017-09-25T14:16:04.822963Z", 199 | "start_time": "2017-09-25T14:16:04.641434Z" 200 | } 201 | }, 202 | "outputs": [ 203 | { 204 | "data": { 205 | "text/plain": [ 206 | "dtype('int64')" 207 | ] 208 | }, 209 | "execution_count": 8, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "s.dtype # 四个属性之四:元素类型" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "## 1.2 创建" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "##### `pd.Series(data=None, index=None, name=None)`\n", 230 | "- data:多种类型,见下面具体介绍;\n", 231 | "- index:索引信息;\n", 232 | "- name:对data的说明,用的不多,一般在和DataFrame、Index互相转换时才需要。" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "### 1.2.1 data无索引\n", 240 | "- 如果 data 为 **ndarray(1D) 或 list(1D)**,那么其缺少 Series 需要的索引信息;\n", 241 | "- 如果提供 index,则必须和data长度相同;\n", 242 | "- 如果不提供 index,那么其将生成默认数值索引 range(0, data.shape[0])。" 243 | ] 244 | }, 245 | { 246 | "cell_type": "code", 247 | "execution_count": 9, 248 | "metadata": { 249 | "ExecuteTime": { 250 | "end_time": "2017-09-25T14:16:05.175335Z", 251 | "start_time": "2017-09-25T14:16:04.825968Z" 252 | } 253 | }, 254 | "outputs": [ 255 | { 256 | "data": { 257 | "text/plain": [ 258 | "a 1\n", 259 | "b 2\n", 260 | "c 3\n", 261 | "dtype: int32" 262 | ] 263 | }, 264 | "execution_count": 9, 265 | "metadata": {}, 266 | "output_type": "execute_result" 267 | } 268 | ], 269 | "source": [ 270 | "# data = [1,2,3]\n", 271 | "data1 = np.array([1,2,3])\n", 272 | "index1 = ['a','b','c']\n", 273 | "s = pd.Series(data=data1, index=index1)\n", 274 | "s" 275 | ] 276 | }, 277 | { 278 | "cell_type": "markdown", 279 | "metadata": {}, 280 | "source": [] 281 | }, 282 | { 283 | "cell_type": "markdown", 284 | "metadata": {}, 285 | "source": [ 286 | "### 1.2.2 data有索引\n", 287 | " - 如果 data 为 **Series 或 dict** ,那么其已经提供了 Series 需要的索引信息,所以 index 项是不需要提供的;\n", 288 | " - 如果额外提供了 index 项,那么其将对当前构建的Series进行 重索引(以当前index为准,如果旧索引没有,则值为nan)(等同于reindex操作)。" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": 10, 294 | "metadata": { 295 | "ExecuteTime": { 296 | "end_time": "2017-09-25T14:16:05.540456Z", 297 | "start_time": "2017-09-25T14:16:05.181344Z" 298 | } 299 | }, 300 | "outputs": [ 301 | { 302 | "data": { 303 | "text/plain": [ 304 | "a 1.0\n", 305 | "b 2.0\n", 306 | "d NaN\n", 307 | "dtype: float64" 308 | ] 309 | }, 310 | "execution_count": 10, 311 | "metadata": {}, 312 | "output_type": "execute_result" 313 | } 314 | ], 315 | "source": [ 316 | "# data = pd.Series([a,b,c], index = ['a','b','c'] )\n", 317 | "data2 = { 'a':1, 'b':2,'c':3 }\n", 318 | "index2 = ['a','b','d']\n", 319 | "s = pd.Series(data=data2, index=index2)\n", 320 | "s # 缺失项填充NaN(NaN:not a number为pandas缺失值标记)。" 321 | ] 322 | }, 323 | { 324 | "cell_type": "code", 325 | "execution_count": null, 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": { 333 | "collapsed": true 334 | }, 335 | "source": [ 336 | "# 2. DataFrame" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": { 342 | "collapsed": true 343 | }, 344 | "source": [ 345 | "DataFrame由具有共同索引的Series按列排列构成(2D),是使用最多的对象。" 346 | ] 347 | }, 348 | { 349 | "cell_type": "markdown", 350 | "metadata": {}, 351 | "source": [ 352 | "## 2.1 预览" 353 | ] 354 | }, 355 | { 356 | "cell_type": "code", 357 | "execution_count": 11, 358 | "metadata": { 359 | "ExecuteTime": { 360 | "end_time": "2017-09-25T14:16:06.079948Z", 361 | "start_time": "2017-09-25T14:16:05.549462Z" 362 | } 363 | }, 364 | "outputs": [ 365 | { 366 | "data": { 367 | "text/html": [ 368 | "
\n", 369 | "\n", 382 | "\n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | "
ABC
a123
b456
\n", 406 | "
" 407 | ], 408 | "text/plain": [ 409 | " A B C\n", 410 | "a 1 2 3\n", 411 | "b 4 5 6" 412 | ] 413 | }, 414 | "execution_count": 11, 415 | "metadata": {}, 416 | "output_type": "execute_result" 417 | } 418 | ], 419 | "source": [ 420 | "data = [[1,2,3],\n", 421 | " [4,5,6]]\n", 422 | "index = ['a','b']\n", 423 | "columns = ['A','B','C']\n", 424 | "df = pd.DataFrame(data=data, index = index, columns = columns)\n", 425 | "df" 426 | ] 427 | }, 428 | { 429 | "cell_type": "code", 430 | "execution_count": 12, 431 | "metadata": { 432 | "ExecuteTime": { 433 | "end_time": "2017-09-25T14:16:06.088969Z", 434 | "start_time": "2017-09-25T14:16:06.082952Z" 435 | } 436 | }, 437 | "outputs": [ 438 | { 439 | "data": { 440 | "text/plain": [ 441 | "Index(['a', 'b'], dtype='object')" 442 | ] 443 | }, 444 | "execution_count": 12, 445 | "metadata": {}, 446 | "output_type": "execute_result" 447 | } 448 | ], 449 | "source": [ 450 | "df.index # 行索引" 451 | ] 452 | }, 453 | { 454 | "cell_type": "code", 455 | "execution_count": 13, 456 | "metadata": { 457 | "ExecuteTime": { 458 | "end_time": "2017-09-25T14:16:06.299196Z", 459 | "start_time": "2017-09-25T14:16:06.091956Z" 460 | } 461 | }, 462 | "outputs": [ 463 | { 464 | "data": { 465 | "text/plain": [ 466 | "Index(['A', 'B', 'C'], dtype='object')" 467 | ] 468 | }, 469 | "execution_count": 13, 470 | "metadata": {}, 471 | "output_type": "execute_result" 472 | } 473 | ], 474 | "source": [ 475 | "df.columns # 列索引,由Series的name构成" 476 | ] 477 | }, 478 | { 479 | "cell_type": "code", 480 | "execution_count": 14, 481 | "metadata": { 482 | "ExecuteTime": { 483 | "end_time": "2017-09-25T14:16:06.454532Z", 484 | "start_time": "2017-09-25T14:16:06.308708Z" 485 | } 486 | }, 487 | "outputs": [ 488 | { 489 | "data": { 490 | "text/plain": [ 491 | "array([[1, 2, 3],\n", 492 | " [4, 5, 6]], dtype=int64)" 493 | ] 494 | }, 495 | "execution_count": 14, 496 | "metadata": {}, 497 | "output_type": "execute_result" 498 | } 499 | ], 500 | "source": [ 501 | "df.values " 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 15, 507 | "metadata": { 508 | "ExecuteTime": { 509 | "end_time": "2017-09-25T14:16:06.696251Z", 510 | "start_time": "2017-09-25T14:16:06.456534Z" 511 | } 512 | }, 513 | "outputs": [ 514 | { 515 | "data": { 516 | "text/plain": [ 517 | "A int64\n", 518 | "B int64\n", 519 | "C int64\n", 520 | "dtype: object" 521 | ] 522 | }, 523 | "execution_count": 15, 524 | "metadata": {}, 525 | "output_type": "execute_result" 526 | } 527 | ], 528 | "source": [ 529 | "df.dtypes # 这里的dtype带s,查看每列元素类型" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "metadata": {}, 535 | "source": [ 536 | "## 2.2 创建" 537 | ] 538 | }, 539 | { 540 | "cell_type": "markdown", 541 | "metadata": {}, 542 | "source": [ 543 | "##### `pd.DataFrame(data=None, index=None, columns=None)`\n", 544 | "函数有多个参数,对我们有用的主要是:`data`,`index`和`columns`三项" 545 | ] 546 | }, 547 | { 548 | "cell_type": "markdown", 549 | "metadata": {}, 550 | "source": [ 551 | "### 2.2.1 data无 行索引,无 列索引\n", 552 | "- 如果 data 为 **ndarray(2D) or list of list(2D)**,那么其缺少 DataFrame 需要的行、列索引信息;\n", 553 | "- 如果提供 index 或 columns 项,其必须和data的行 或 列长度相同;\n", 554 | "- 如果不提供 index 或 columns 项,那么其将默认生成数值索引range(0, data.shape[0])) 或 range(0, data.shape[1])。" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 16, 560 | "metadata": { 561 | "ExecuteTime": { 562 | "end_time": "2017-09-25T14:16:07.141602Z", 563 | "start_time": "2017-09-25T14:16:06.702255Z" 564 | } 565 | }, 566 | "outputs": [ 567 | { 568 | "data": { 569 | "text/html": [ 570 | "
\n", 571 | "\n", 584 | "\n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | "
ABC
a123
b456
\n", 608 | "
" 609 | ], 610 | "text/plain": [ 611 | " A B C\n", 612 | "a 1 2 3\n", 613 | "b 4 5 6" 614 | ] 615 | }, 616 | "execution_count": 16, 617 | "metadata": {}, 618 | "output_type": "execute_result" 619 | } 620 | ], 621 | "source": [ 622 | "# data = [[1,2,3],\n", 623 | "# [4,5,6]]\n", 624 | "data1 = np.array([[1,2,3],\n", 625 | " [4,5,6]] )\n", 626 | "index1 = ['a','b']\n", 627 | "columns1 = ['A','B','C']\n", 628 | "df = pd.DataFrame(data=data1, index=index1, columns=columns1)\n", 629 | "df" 630 | ] 631 | }, 632 | { 633 | "cell_type": "markdown", 634 | "metadata": {}, 635 | "source": [ 636 | "### 2.2.2 data无 行索引,有 列索引\n", 637 | " - 如果data为 **dict of (ndarray(1D) or list(1D))**,所有ndarray或list的长度必须相同。dict的key为DataFrame提供了需要的columns信息,缺失index;\n", 638 | " - 如果提供 index 项,必须和list的长度相同;\n", 639 | " - 如果不提供 index,那么其将默认生成数值索引range(0, data.shape[0]));\n", 640 | " - 如果还额外提供了columns项,那么其将对当前构建的DataFrame进行 **列重索引**。" 641 | ] 642 | }, 643 | { 644 | "cell_type": "code", 645 | "execution_count": 17, 646 | "metadata": { 647 | "ExecuteTime": { 648 | "end_time": "2017-09-25T14:16:07.349774Z", 649 | "start_time": "2017-09-25T14:16:07.146606Z" 650 | } 651 | }, 652 | "outputs": [ 653 | { 654 | "data": { 655 | "text/html": [ 656 | "
\n", 657 | "\n", 670 | "\n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | "
ABD
a12NaN
b45NaN
\n", 694 | "
" 695 | ], 696 | "text/plain": [ 697 | " A B D\n", 698 | "a 1 2 NaN\n", 699 | "b 4 5 NaN" 700 | ] 701 | }, 702 | "execution_count": 17, 703 | "metadata": {}, 704 | "output_type": "execute_result" 705 | } 706 | ], 707 | "source": [ 708 | "data2 = { 'A' : [1,4], 'B': [2,5], 'C':[3,6] }\n", 709 | "index2 = ['a','b']\n", 710 | "columns2 = ['A','B','D']\n", 711 | "df = pd.DataFrame(data=data2, index=index2, columns=columns2)\n", 712 | "df" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": {}, 718 | "source": [ 719 | "### 2.2.3 data有 行索引,有 列索引\n", 720 | " - 如果data为 **dict of (Series or dict)**,那么其已经提供了DataFrame需要的所有信息;\n", 721 | " - 如果多个Series或dict间的索引不一致,那么取并操作(pandas不会试图丢掉信息),缺失的数据填充NaN;\n", 722 | " - 如果提供了index项或columns项,那么其将对当前构建的DataFrame进行 重索引(reindex,pandas内部调用接口)。" 723 | ] 724 | }, 725 | { 726 | "cell_type": "code", 727 | "execution_count": 18, 728 | "metadata": { 729 | "ExecuteTime": { 730 | "end_time": "2017-09-25T14:16:07.570012Z", 731 | "start_time": "2017-09-25T14:16:07.351777Z" 732 | } 733 | }, 734 | "outputs": [ 735 | { 736 | "data": { 737 | "text/html": [ 738 | "
\n", 739 | "\n", 752 | "\n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | "
ABC
a1.02.03.0
b4.05.0NaN
cNaNNaN6.0
\n", 782 | "
" 783 | ], 784 | "text/plain": [ 785 | " A B C\n", 786 | "a 1.0 2.0 3.0\n", 787 | "b 4.0 5.0 NaN\n", 788 | "c NaN NaN 6.0" 789 | ] 790 | }, 791 | "execution_count": 18, 792 | "metadata": {}, 793 | "output_type": "execute_result" 794 | } 795 | ], 796 | "source": [ 797 | "# data3 = { 'A' : pd.Series([1,4] ,index = ['a','b']), 'B' : pd.Series([2,5] ,index = ['a','b']), 'C' : pd.Series([3,6] ,index = ['a','c']) }\n", 798 | "data3 = { 'A' : { 'a':1, 'b':4}, 'B': {'a':2,'b':5}, 'C':{'a':3, 'c':6} }\n", 799 | "df = pd.DataFrame(data=data3)\n", 800 | "df" 801 | ] 802 | }, 803 | { 804 | "cell_type": "markdown", 805 | "metadata": {}, 806 | "source": [ 807 | "# 3. 由文件创建" 808 | ] 809 | }, 810 | { 811 | "cell_type": "markdown", 812 | "metadata": {}, 813 | "source": [ 814 | "## 3.1 由.csv文件创建" 815 | ] 816 | }, 817 | { 818 | "cell_type": "markdown", 819 | "metadata": {}, 820 | "source": [ 821 | "##### `pd.read_csv(filepath_or_buffer, sep=',', header='infer', names=None,index_col=None, encoding=None ) `\n", 822 | "read_csv的参数很多,但这几个参数就够我们使用了:\n", 823 | "- filepath_or_buffer:路径和文件名不要带中文,带中文容易报错;\n", 824 | "- sep: csv文件数据的分隔符,默认是',',根据实际情况修改;\n", 825 | "- header:如果有列名,那么这一项不用改;\n", 826 | "- names:如果没有列名,那么必须设置header = None, names为需要传入的列名列表,不设置默认生成数值索引;\n", 827 | "- index_col:list of (int or name),传入列名的列表或者列名的位置,选取这几列作为索引;\n", 828 | "- encoding:根据你的文档编码来确定,如果有中文读取报错,试试encoding = 'gbk'。" 829 | ] 830 | }, 831 | { 832 | "cell_type": "code", 833 | "execution_count": 19, 834 | "metadata": { 835 | "ExecuteTime": { 836 | "end_time": "2017-09-25T14:16:08.148424Z", 837 | "start_time": "2017-09-25T14:16:07.573011Z" 838 | }, 839 | "scrolled": true 840 | }, 841 | "outputs": [ 842 | { 843 | "data": { 844 | "text/html": [ 845 | "
\n", 846 | "\n", 859 | "\n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 909 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 913 | " \n", 914 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 918 | " \n", 919 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 923 | " \n", 924 | "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
\n", 925 | "
" 926 | ], 927 | "text/plain": [ 928 | " total_bill tip sex smoker day time size\n", 929 | "0 16.99 1.01 Female No Sun Dinner 2\n", 930 | "1 10.34 1.66 Male No Sun Dinner 3\n", 931 | "2 21.01 3.50 Male No Sun Dinner 3\n", 932 | "3 23.68 3.31 Male No Sun Dinner 2\n", 933 | "4 24.59 3.61 Female No Sun Dinner 4" 934 | ] 935 | }, 936 | "execution_count": 19, 937 | "metadata": {}, 938 | "output_type": "execute_result" 939 | } 940 | ], 941 | "source": [ 942 | "tips = pd.read_csv( 'tips.csv')\n", 943 | "tips.head()" 944 | ] 945 | }, 946 | { 947 | "cell_type": "code", 948 | "execution_count": 20, 949 | "metadata": { 950 | "ExecuteTime": { 951 | "end_time": "2017-09-25T14:16:08.291633Z", 952 | "start_time": "2017-09-25T14:16:08.151413Z" 953 | } 954 | }, 955 | "outputs": [ 956 | { 957 | "data": { 958 | "text/plain": [ 959 | "RangeIndex(start=0, stop=244, step=1)" 960 | ] 961 | }, 962 | "execution_count": 20, 963 | "metadata": {}, 964 | "output_type": "execute_result" 965 | } 966 | ], 967 | "source": [ 968 | "tips.index" 969 | ] 970 | }, 971 | { 972 | "cell_type": "code", 973 | "execution_count": 21, 974 | "metadata": { 975 | "ExecuteTime": { 976 | "end_time": "2017-09-25T14:16:08.517552Z", 977 | "start_time": "2017-09-25T14:16:08.302620Z" 978 | } 979 | }, 980 | "outputs": [ 981 | { 982 | "data": { 983 | "text/plain": [ 984 | "Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')" 985 | ] 986 | }, 987 | "execution_count": 21, 988 | "metadata": {}, 989 | "output_type": "execute_result" 990 | } 991 | ], 992 | "source": [ 993 | "tips.columns" 994 | ] 995 | }, 996 | { 997 | "cell_type": "code", 998 | "execution_count": 22, 999 | "metadata": { 1000 | "ExecuteTime": { 1001 | "end_time": "2017-09-25T14:16:08.899490Z", 1002 | "start_time": "2017-09-25T14:16:08.520554Z" 1003 | } 1004 | }, 1005 | "outputs": [ 1006 | { 1007 | "data": { 1008 | "text/plain": [ 1009 | "array([[16.99, 1.01, 'Female', ..., 'Sun', 'Dinner', 2],\n", 1010 | " [10.34, 1.66, 'Male', ..., 'Sun', 'Dinner', 3],\n", 1011 | " [21.01, 3.5, 'Male', ..., 'Sun', 'Dinner', 3],\n", 1012 | " ...,\n", 1013 | " [22.67, 2.0, 'Male', ..., 'Sat', 'Dinner', 2],\n", 1014 | " [17.82, 1.75, 'Male', ..., 'Sat', 'Dinner', 2],\n", 1015 | " [18.78, 3.0, 'Female', ..., 'Thur', 'Dinner', 2]], dtype=object)" 1016 | ] 1017 | }, 1018 | "execution_count": 22, 1019 | "metadata": {}, 1020 | "output_type": "execute_result" 1021 | } 1022 | ], 1023 | "source": [ 1024 | "tips.values" 1025 | ] 1026 | }, 1027 | { 1028 | "cell_type": "markdown", 1029 | "metadata": { 1030 | "collapsed": true 1031 | }, 1032 | "source": [ 1033 | "## 3.2 由.excel文件创建" 1034 | ] 1035 | }, 1036 | { 1037 | "cell_type": "markdown", 1038 | "metadata": {}, 1039 | "source": [ 1040 | "##### `pd.read_excel(io, sheetname=0, header=0, index_col=None, names=None) `\n", 1041 | "read_excel的参数很多,但这几个参数就够我们使用了:\n", 1042 | "- header:如果有列名,那么这一项不用改;\n", 1043 | "- names:如果没有列名,那么必须设置header = None, names为列名的列表,不设置默认生成数值索引;\n", 1044 | "- index_col:同上。" 1045 | ] 1046 | }, 1047 | { 1048 | "cell_type": "code", 1049 | "execution_count": null, 1050 | "metadata": {}, 1051 | "outputs": [], 1052 | "source": [] 1053 | } 1054 | ], 1055 | "metadata": { 1056 | "kernelspec": { 1057 | "display_name": "Python 3", 1058 | "language": "python", 1059 | "name": "python3" 1060 | }, 1061 | "language_info": { 1062 | "codemirror_mode": { 1063 | "name": "ipython", 1064 | "version": 3 1065 | }, 1066 | "file_extension": ".py", 1067 | "mimetype": "text/x-python", 1068 | "name": "python", 1069 | "nbconvert_exporter": "python", 1070 | "pygments_lexer": "ipython3", 1071 | "version": "3.7.0" 1072 | }, 1073 | "toc": { 1074 | "base_numbering": 1, 1075 | "nav_menu": { 1076 | "height": "282px", 1077 | "width": "252px" 1078 | }, 1079 | "number_sections": false, 1080 | "sideBar": true, 1081 | "skip_h1_title": false, 1082 | "title_cell": "Table of Contents", 1083 | "title_sidebar": "Contents", 1084 | "toc_cell": false, 1085 | "toc_position": { 1086 | "height": "485px", 1087 | "left": "0px", 1088 | "right": "1068px", 1089 | "top": "66px", 1090 | "width": "298px" 1091 | }, 1092 | "toc_section_display": "block", 1093 | "toc_window_display": true 1094 | } 1095 | }, 1096 | "nbformat": 4, 1097 | "nbformat_minor": 2 1098 | } 1099 | -------------------------------------------------------------------------------- /10. 数值统计运算.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 统计运算\n", 8 | "这一章包含数据分析用得最多的函数操作。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "__auther__ = 'zhenhang.sun@gmail.com'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/plain": [ 28 | "'D:\\\\github\\\\pandas-tutorial'" 29 | ] 30 | }, 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "output_type": "execute_result" 34 | } 35 | ], 36 | "source": [ 37 | "pwd" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import numpy as np\n", 47 | "import pandas as pd" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "# 1. 数值型统计运算\n", 55 | "这些统计操作只对元素类型为数值型的列有效,返回以列索引或行索引为索引的Series。" 56 | ] 57 | }, 58 | { 59 | "cell_type": "markdown", 60 | "metadata": {}, 61 | "source": [ 62 | "## 1.1 一元统计\n", 63 | "顾名思义,这些统计只是自身分布情况的反映。" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "### 1.1.1 sum()\n", 71 | "和" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "##### `DataFrame.sum(axis='index')`\n", 79 | "- axis:'index' 竖着加,'columns' 横着加" 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": 4, 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/html": [ 90 | "
\n", 91 | "\n", 104 | "\n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | " \n", 121 | " \n", 122 | " \n", 123 | " \n", 124 | "
AB
a12
b35
\n", 125 | "
" 126 | ], 127 | "text/plain": [ 128 | " A B\n", 129 | "a 1 2\n", 130 | "b 3 5" 131 | ] 132 | }, 133 | "execution_count": 4, 134 | "metadata": {}, 135 | "output_type": "execute_result" 136 | } 137 | ], 138 | "source": [ 139 | "df = pd.DataFrame([[1,2],[3,5]], index=['a','b'],columns=['A','B'])\n", 140 | "df" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 5, 146 | "metadata": {}, 147 | "outputs": [ 148 | { 149 | "data": { 150 | "text/plain": [ 151 | "A 4\n", 152 | "B 7\n", 153 | "dtype: int64" 154 | ] 155 | }, 156 | "execution_count": 5, 157 | "metadata": {}, 158 | "output_type": "execute_result" 159 | } 160 | ], 161 | "source": [ 162 | "df.sum() # 默认按列加\n", 163 | "# df.sum(axis='index') " 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": 6, 169 | "metadata": {}, 170 | "outputs": [ 171 | { 172 | "data": { 173 | "text/plain": [ 174 | "a 3\n", 175 | "b 8\n", 176 | "dtype: int64" 177 | ] 178 | }, 179 | "execution_count": 6, 180 | "metadata": {}, 181 | "output_type": "execute_result" 182 | } 183 | ], 184 | "source": [ 185 | "df.sum(axis='columns') # 按行加" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "### 1.1.2 mean(), std(), var()\n", 193 | "均值、标准差、方差" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 7, 199 | "metadata": {}, 200 | "outputs": [ 201 | { 202 | "data": { 203 | "text/plain": [ 204 | "A 2.0\n", 205 | "B 3.5\n", 206 | "dtype: float64" 207 | ] 208 | }, 209 | "execution_count": 7, 210 | "metadata": {}, 211 | "output_type": "execute_result" 212 | } 213 | ], 214 | "source": [ 215 | "df.mean()" 216 | ] 217 | }, 218 | { 219 | "cell_type": "code", 220 | "execution_count": 8, 221 | "metadata": {}, 222 | "outputs": [ 223 | { 224 | "data": { 225 | "text/plain": [ 226 | "A 1.414214\n", 227 | "B 2.121320\n", 228 | "dtype: float64" 229 | ] 230 | }, 231 | "execution_count": 8, 232 | "metadata": {}, 233 | "output_type": "execute_result" 234 | } 235 | ], 236 | "source": [ 237 | "df.std()" 238 | ] 239 | }, 240 | { 241 | "cell_type": "code", 242 | "execution_count": 9, 243 | "metadata": {}, 244 | "outputs": [ 245 | { 246 | "data": { 247 | "text/plain": [ 248 | "A 2.0\n", 249 | "B 4.5\n", 250 | "dtype: float64" 251 | ] 252 | }, 253 | "execution_count": 9, 254 | "metadata": {}, 255 | "output_type": "execute_result" 256 | } 257 | ], 258 | "source": [ 259 | "df.var()" 260 | ] 261 | }, 262 | { 263 | "cell_type": "code", 264 | "execution_count": null, 265 | "metadata": {}, 266 | "outputs": [], 267 | "source": [] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "### 1.1.3 max(), min(), median()\n", 274 | "最大、最小、中值" 275 | ] 276 | }, 277 | { 278 | "cell_type": "code", 279 | "execution_count": 10, 280 | "metadata": {}, 281 | "outputs": [ 282 | { 283 | "data": { 284 | "text/plain": [ 285 | "A 1\n", 286 | "B 2\n", 287 | "dtype: int64" 288 | ] 289 | }, 290 | "execution_count": 10, 291 | "metadata": {}, 292 | "output_type": "execute_result" 293 | } 294 | ], 295 | "source": [ 296 | "df.min()" 297 | ] 298 | }, 299 | { 300 | "cell_type": "code", 301 | "execution_count": 11, 302 | "metadata": {}, 303 | "outputs": [ 304 | { 305 | "data": { 306 | "text/plain": [ 307 | "A 3\n", 308 | "B 5\n", 309 | "dtype: int64" 310 | ] 311 | }, 312 | "execution_count": 11, 313 | "metadata": {}, 314 | "output_type": "execute_result" 315 | } 316 | ], 317 | "source": [ 318 | "df.max()" 319 | ] 320 | }, 321 | { 322 | "cell_type": "code", 323 | "execution_count": 12, 324 | "metadata": {}, 325 | "outputs": [ 326 | { 327 | "data": { 328 | "text/plain": [ 329 | "A 2.0\n", 330 | "B 3.5\n", 331 | "dtype: float64" 332 | ] 333 | }, 334 | "execution_count": 12, 335 | "metadata": {}, 336 | "output_type": "execute_result" 337 | } 338 | ], 339 | "source": [ 340 | "df.median()" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": {}, 347 | "outputs": [], 348 | "source": [] 349 | }, 350 | { 351 | "cell_type": "markdown", 352 | "metadata": {}, 353 | "source": [ 354 | "## 1.2 二元统计\n", 355 | "计算任意两列直接的统计量,返回以列索引 同时为新行索引和列索引的DataFrame。" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "### 1.2.1 cov() \n", 363 | "协方差" 364 | ] 365 | }, 366 | { 367 | "cell_type": "markdown", 368 | "metadata": {}, 369 | "source": [ 370 | "##### `DataFrame.cov(min_periods=None)`\n", 371 | "- min_periods:每一列去除NaN后,要求能够参与运算的最少元素个数。" 372 | ] 373 | }, 374 | { 375 | "cell_type": "code", 376 | "execution_count": 13, 377 | "metadata": {}, 378 | "outputs": [ 379 | { 380 | "data": { 381 | "text/html": [ 382 | "
\n", 383 | "\n", 396 | "\n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | "
BC
012.0
12NaN
\n", 417 | "
" 418 | ], 419 | "text/plain": [ 420 | " B C\n", 421 | "0 1 2.0\n", 422 | "1 2 NaN" 423 | ] 424 | }, 425 | "execution_count": 13, 426 | "metadata": {}, 427 | "output_type": "execute_result" 428 | } 429 | ], 430 | "source": [ 431 | "df1 = pd.DataFrame([[1,2],[2,np.nan]],columns=['B','C'])\n", 432 | "df1" 433 | ] 434 | }, 435 | { 436 | "cell_type": "code", 437 | "execution_count": 14, 438 | "metadata": {}, 439 | "outputs": [ 440 | { 441 | "data": { 442 | "text/html": [ 443 | "
\n", 444 | "\n", 457 | "\n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | "
BC
B0.5NaN
CNaNNaN
\n", 478 | "
" 479 | ], 480 | "text/plain": [ 481 | " B C\n", 482 | "B 0.5 NaN\n", 483 | "C NaN NaN" 484 | ] 485 | }, 486 | "execution_count": 14, 487 | "metadata": {}, 488 | "output_type": "execute_result" 489 | } 490 | ], 491 | "source": [ 492 | "df1.cov()" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "### 1.2.2 corr() \n", 500 | "相关系数\n", 501 | "\n", 502 | "##### `DataFrame.corr(method='pearson', min_periods=1)`\n", 503 | "- method:用的方法\n", 504 | "- min_periods:每一列去除NaN后,要求能够参与运算的最少元素个数。" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": 15, 510 | "metadata": {}, 511 | "outputs": [ 512 | { 513 | "data": { 514 | "text/html": [ 515 | "
\n", 516 | "\n", 529 | "\n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | "
BC
B1.0NaN
CNaNNaN
\n", 550 | "
" 551 | ], 552 | "text/plain": [ 553 | " B C\n", 554 | "B 1.0 NaN\n", 555 | "C NaN NaN" 556 | ] 557 | }, 558 | "execution_count": 15, 559 | "metadata": {}, 560 | "output_type": "execute_result" 561 | } 562 | ], 563 | "source": [ 564 | "df1.corr()" 565 | ] 566 | }, 567 | { 568 | "cell_type": "markdown", 569 | "metadata": {}, 570 | "source": [ 571 | "### 1.2.3 corrwith()\n", 572 | "corr是自身列之间的关系,而这个函数可以对不同的DataFrame进行运算,计算之前会把两个DataFrame完全对齐(对齐方式为取交集)。\n", 573 | "##### `DataFrame.corrwith(other, axis=0, drop=False, method='pearson')`\n", 574 | "- other:另一个DataFrame或Series\n", 575 | "- axis:'index or 0'或'columns or 1',index,计算列相关性。columns,计算行相关性。默认为 index。\n", 576 | "- drop:是否丢掉axis对应的索引取交时被过滤的了那些索引\n", 577 | "- method:计算相关性的方法" 578 | ] 579 | }, 580 | { 581 | "cell_type": "code", 582 | "execution_count": 16, 583 | "metadata": {}, 584 | "outputs": [ 585 | { 586 | "data": { 587 | "text/html": [ 588 | "
\n", 589 | "\n", 602 | "\n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | "
BC
a21
b50
223
\n", 628 | "
" 629 | ], 630 | "text/plain": [ 631 | " B C\n", 632 | "a 2 1\n", 633 | "b 5 0\n", 634 | "2 2 3" 635 | ] 636 | }, 637 | "execution_count": 16, 638 | "metadata": {}, 639 | "output_type": "execute_result" 640 | } 641 | ], 642 | "source": [ 643 | "df1 = pd.DataFrame([[2,1],[5,0],[2,3]],index=['a','b',2],columns=['B','C'])\n", 644 | "df1" 645 | ] 646 | }, 647 | { 648 | "cell_type": "code", 649 | "execution_count": 17, 650 | "metadata": { 651 | "scrolled": true 652 | }, 653 | "outputs": [ 654 | { 655 | "data": { 656 | "text/html": [ 657 | "
\n", 658 | "\n", 671 | "\n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | "
AB
a12
b35
\n", 692 | "
" 693 | ], 694 | "text/plain": [ 695 | " A B\n", 696 | "a 1 2\n", 697 | "b 3 5" 698 | ] 699 | }, 700 | "execution_count": 17, 701 | "metadata": {}, 702 | "output_type": "execute_result" 703 | } 704 | ], 705 | "source": [ 706 | "df" 707 | ] 708 | }, 709 | { 710 | "cell_type": "code", 711 | "execution_count": 18, 712 | "metadata": {}, 713 | "outputs": [ 714 | { 715 | "data": { 716 | "text/plain": [ 717 | "B 1.0\n", 718 | "dtype: float64" 719 | ] 720 | }, 721 | "execution_count": 18, 722 | "metadata": {}, 723 | "output_type": "execute_result" 724 | } 725 | ], 726 | "source": [ 727 | "df.corrwith(df1, axis=0, drop=True) #df和df1取交对齐, 取交后只剩下B列,a,b行" 728 | ] 729 | }, 730 | { 731 | "cell_type": "code", 732 | "execution_count": 19, 733 | "metadata": {}, 734 | "outputs": [ 735 | { 736 | "data": { 737 | "text/plain": [ 738 | "A NaN\n", 739 | "B 1.0\n", 740 | "C NaN\n", 741 | "dtype: float64" 742 | ] 743 | }, 744 | "execution_count": 19, 745 | "metadata": {}, 746 | "output_type": "execute_result" 747 | } 748 | ], 749 | "source": [ 750 | "df.corrwith(df1, axis=0, drop=False) #补充取交时被过滤的A和C列" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": null, 756 | "metadata": {}, 757 | "outputs": [], 758 | "source": [] 759 | }, 760 | { 761 | "cell_type": "code", 762 | "execution_count": 20, 763 | "metadata": {}, 764 | "outputs": [ 765 | { 766 | "data": { 767 | "text/plain": [ 768 | "a 3\n", 769 | "b 1\n", 770 | "c 1\n", 771 | "Name: B, dtype: int64" 772 | ] 773 | }, 774 | "execution_count": 20, 775 | "metadata": {}, 776 | "output_type": "execute_result" 777 | } 778 | ], 779 | "source": [ 780 | "#other为series时,一般都是df的每一列分别与series进行相关性计算。\n", 781 | "s = pd.Series([3,1,1], index=['a','b','c'], name='B') \n", 782 | "s" 783 | ] 784 | }, 785 | { 786 | "cell_type": "code", 787 | "execution_count": 21, 788 | "metadata": { 789 | "scrolled": false 790 | }, 791 | "outputs": [ 792 | { 793 | "data": { 794 | "text/html": [ 795 | "
\n", 796 | "\n", 809 | "\n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | "
AB
a12
b35
\n", 830 | "
" 831 | ], 832 | "text/plain": [ 833 | " A B\n", 834 | "a 1 2\n", 835 | "b 3 5" 836 | ] 837 | }, 838 | "execution_count": 21, 839 | "metadata": {}, 840 | "output_type": "execute_result" 841 | } 842 | ], 843 | "source": [ 844 | "df" 845 | ] 846 | }, 847 | { 848 | "cell_type": "code", 849 | "execution_count": 22, 850 | "metadata": {}, 851 | "outputs": [ 852 | { 853 | "data": { 854 | "text/plain": [ 855 | "A -1.0\n", 856 | "B -1.0\n", 857 | "dtype: float64" 858 | ] 859 | }, 860 | "execution_count": 22, 861 | "metadata": {}, 862 | "output_type": "execute_result" 863 | } 864 | ], 865 | "source": [ 866 | "df.corrwith(s, axis=0)" 867 | ] 868 | }, 869 | { 870 | "cell_type": "code", 871 | "execution_count": null, 872 | "metadata": {}, 873 | "outputs": [], 874 | "source": [] 875 | }, 876 | { 877 | "cell_type": "markdown", 878 | "metadata": {}, 879 | "source": [ 880 | "# 2. 类别型统计运算" 881 | ] 882 | }, 883 | { 884 | "cell_type": "markdown", 885 | "metadata": {}, 886 | "source": [ 887 | "## 2.1 value_counts()\n", 888 | "不适合DataFrame。\n", 889 | "##### `Series/Index.value_counts(normalize=False, ascending=False, bins=None)`\n", 890 | "- normalize:True or False,计算频次或者频率比;\n", 891 | "- ascending:True or False,排序方式,默认降序;\n", 892 | "- bins:int,pd.cut的一种快捷操作,对连续数值型效果好;" 893 | ] 894 | }, 895 | { 896 | "cell_type": "code", 897 | "execution_count": 23, 898 | "metadata": { 899 | "scrolled": true 900 | }, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "0 1\n", 906 | "1 2\n", 907 | "2 1\n", 908 | "3 2\n", 909 | "4 1\n", 910 | "5 3\n", 911 | "dtype: int64" 912 | ] 913 | }, 914 | "execution_count": 23, 915 | "metadata": {}, 916 | "output_type": "execute_result" 917 | } 918 | ], 919 | "source": [ 920 | "s = pd.Series([1,2,1,2,1,3])\n", 921 | "s" 922 | ] 923 | }, 924 | { 925 | "cell_type": "code", 926 | "execution_count": 24, 927 | "metadata": {}, 928 | "outputs": [ 929 | { 930 | "data": { 931 | "text/plain": [ 932 | "1 3\n", 933 | "2 2\n", 934 | "3 1\n", 935 | "dtype: int64" 936 | ] 937 | }, 938 | "execution_count": 24, 939 | "metadata": {}, 940 | "output_type": "execute_result" 941 | } 942 | ], 943 | "source": [ 944 | "s.value_counts()" 945 | ] 946 | }, 947 | { 948 | "cell_type": "code", 949 | "execution_count": 25, 950 | "metadata": {}, 951 | "outputs": [ 952 | { 953 | "data": { 954 | "text/plain": [ 955 | "3 1\n", 956 | "2 2\n", 957 | "1 3\n", 958 | "dtype: int64" 959 | ] 960 | }, 961 | "execution_count": 25, 962 | "metadata": {}, 963 | "output_type": "execute_result" 964 | } 965 | ], 966 | "source": [ 967 | "s.value_counts(ascending=True)" 968 | ] 969 | }, 970 | { 971 | "cell_type": "code", 972 | "execution_count": 26, 973 | "metadata": {}, 974 | "outputs": [ 975 | { 976 | "data": { 977 | "text/plain": [ 978 | "(0.997, 2.0] 5\n", 979 | "(2.0, 3.0] 1\n", 980 | "dtype: int64" 981 | ] 982 | }, 983 | "execution_count": 26, 984 | "metadata": {}, 985 | "output_type": "execute_result" 986 | } 987 | ], 988 | "source": [ 989 | "s.value_counts(bins=2) # bins按照int平均分割,左开右闭,左侧外延1%以包含最左值" 990 | ] 991 | }, 992 | { 993 | "cell_type": "markdown", 994 | "metadata": {}, 995 | "source": [ 996 | "## 2.2 count()\n", 997 | "计算统计每一类非NaN元素个数,这个函数可以快速了解哪些特征或哪些样本缺失比较严重。\n", 998 | "##### `DataFrame.count(axis=0)`\n", 999 | "- axis: 0-查看列,1-查看行;" 1000 | ] 1001 | }, 1002 | { 1003 | "cell_type": "code", 1004 | "execution_count": 27, 1005 | "metadata": {}, 1006 | "outputs": [ 1007 | { 1008 | "data": { 1009 | "text/html": [ 1010 | "
\n", 1011 | "\n", 1024 | "\n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1034 | " \n", 1035 | " \n", 1036 | " \n", 1037 | " \n", 1038 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | "
AB
a12
b35
\n", 1045 | "
" 1046 | ], 1047 | "text/plain": [ 1048 | " A B\n", 1049 | "a 1 2\n", 1050 | "b 3 5" 1051 | ] 1052 | }, 1053 | "execution_count": 27, 1054 | "metadata": {}, 1055 | "output_type": "execute_result" 1056 | } 1057 | ], 1058 | "source": [ 1059 | "df" 1060 | ] 1061 | }, 1062 | { 1063 | "cell_type": "code", 1064 | "execution_count": 28, 1065 | "metadata": { 1066 | "scrolled": false 1067 | }, 1068 | "outputs": [ 1069 | { 1070 | "data": { 1071 | "text/plain": [ 1072 | "A 2\n", 1073 | "B 2\n", 1074 | "dtype: int64" 1075 | ] 1076 | }, 1077 | "execution_count": 28, 1078 | "metadata": {}, 1079 | "output_type": "execute_result" 1080 | } 1081 | ], 1082 | "source": [ 1083 | "df.count(axis=0)" 1084 | ] 1085 | }, 1086 | { 1087 | "cell_type": "code", 1088 | "execution_count": 29, 1089 | "metadata": {}, 1090 | "outputs": [ 1091 | { 1092 | "data": { 1093 | "text/plain": [ 1094 | "a 2\n", 1095 | "b 2\n", 1096 | "dtype: int64" 1097 | ] 1098 | }, 1099 | "execution_count": 29, 1100 | "metadata": {}, 1101 | "output_type": "execute_result" 1102 | } 1103 | ], 1104 | "source": [ 1105 | "df.count(axis=1)" 1106 | ] 1107 | }, 1108 | { 1109 | "cell_type": "code", 1110 | "execution_count": null, 1111 | "metadata": {}, 1112 | "outputs": [], 1113 | "source": [] 1114 | } 1115 | ], 1116 | "metadata": { 1117 | "kernelspec": { 1118 | "display_name": "Python 3", 1119 | "language": "python", 1120 | "name": "python3" 1121 | }, 1122 | "language_info": { 1123 | "codemirror_mode": { 1124 | "name": "ipython", 1125 | "version": 3 1126 | }, 1127 | "file_extension": ".py", 1128 | "mimetype": "text/x-python", 1129 | "name": "python", 1130 | "nbconvert_exporter": "python", 1131 | "pygments_lexer": "ipython3", 1132 | "version": "3.7.0" 1133 | }, 1134 | "toc": { 1135 | "base_numbering": 1, 1136 | "nav_menu": { 1137 | "height": "67px", 1138 | "width": "253px" 1139 | }, 1140 | "number_sections": false, 1141 | "sideBar": true, 1142 | "skip_h1_title": false, 1143 | "title_cell": "Table of Contents", 1144 | "title_sidebar": "Contents", 1145 | "toc_cell": false, 1146 | "toc_position": { 1147 | "height": "600px", 1148 | "left": "0px", 1149 | "right": "1190.23px", 1150 | "top": "67px", 1151 | "width": "232px" 1152 | }, 1153 | "toc_section_display": "block", 1154 | "toc_window_display": true 1155 | } 1156 | }, 1157 | "nbformat": 4, 1158 | "nbformat_minor": 2 1159 | } 1160 | -------------------------------------------------------------------------------- /12. Category型与离散化.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Category型与离散化\n", 8 | "类别类型可谓是非常常用的一种类型,其具有如下特征:\n", 9 | "1. 取固定几种值;\n", 10 | "2. 可以定义序,序的形式与实数序或字典序可以都不同;\n", 11 | "3. 即使是数值表示,数值运算可能也无意义,与离散数值型不一定相同。" 12 | ] 13 | }, 14 | { 15 | "cell_type": "code", 16 | "execution_count": 1, 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "__auther__ = 'zhenhang.sun@gmail.com'" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "metadata": {}, 27 | "outputs": [ 28 | { 29 | "data": { 30 | "text/plain": [ 31 | "'D:\\\\github\\\\pandas-tutorial'" 32 | ] 33 | }, 34 | "execution_count": 2, 35 | "metadata": {}, 36 | "output_type": "execute_result" 37 | } 38 | ], 39 | "source": [ 40 | "pwd" 41 | ] 42 | }, 43 | { 44 | "cell_type": "code", 45 | "execution_count": 3, 46 | "metadata": {}, 47 | "outputs": [], 48 | "source": [ 49 | "import numpy as np\n", 50 | "import pandas as pd" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [] 59 | }, 60 | { 61 | "cell_type": "markdown", 62 | "metadata": {}, 63 | "source": [ 64 | "# 1. 创建" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "## 1.1 创建Category的类" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "##### `pd.Categorical(values, categories=None, ordered=False)`\n", 79 | "- values: 类别序列;\n", 80 | "- categories:自定义的类别序列;\n", 81 | "- ordered:类别是否定义顺序,默认增序。" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": 4, 87 | "metadata": {}, 88 | "outputs": [ 89 | { 90 | "data": { 91 | "text/plain": [ 92 | "[2, 1, 1, 3]\n", 93 | "Categories (3, int64): [1 < 2 < 3]" 94 | ] 95 | }, 96 | "execution_count": 4, 97 | "metadata": {}, 98 | "output_type": "execute_result" 99 | } 100 | ], 101 | "source": [ 102 | "c = pd.Categorical([2,1,1,3], ordered=True ) \n", 103 | "c\n", 104 | "# 不提供categories,则用values去重后的值作为类别\n", 105 | "# 若ordered =True,顺序则按照字典序升序给定" 106 | ] 107 | }, 108 | { 109 | "cell_type": "code", 110 | "execution_count": 5, 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "[NaN, 2.0, 3.0]\n", 117 | "Categories (2, int64): [3 < 2]" 118 | ] 119 | }, 120 | "execution_count": 5, 121 | "metadata": {}, 122 | "output_type": "execute_result" 123 | } 124 | ], 125 | "source": [ 126 | "c = pd.Categorical([1,2,3], categories=[3,2], ordered=True )\n", 127 | "c\n", 128 | "# 提供categories(类别不能有重复,否则报错),若values的值不在categories中,则用NaN替换\n", 129 | "# 若ordered =True,顺序则按照类别顺序升序给定" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "**类别的两个重要属性**" 137 | ] 138 | }, 139 | { 140 | "cell_type": "code", 141 | "execution_count": 6, 142 | "metadata": {}, 143 | "outputs": [ 144 | { 145 | "data": { 146 | "text/plain": [ 147 | "Int64Index([3, 2], dtype='int64')" 148 | ] 149 | }, 150 | "execution_count": 6, 151 | "metadata": {}, 152 | "output_type": "execute_result" 153 | } 154 | ], 155 | "source": [ 156 | "c.categories # 类别" 157 | ] 158 | }, 159 | { 160 | "cell_type": "code", 161 | "execution_count": 7, 162 | "metadata": { 163 | "scrolled": true 164 | }, 165 | "outputs": [ 166 | { 167 | "data": { 168 | "text/plain": [ 169 | "True" 170 | ] 171 | }, 172 | "execution_count": 7, 173 | "metadata": {}, 174 | "output_type": "execute_result" 175 | } 176 | ], 177 | "source": [ 178 | "c.ordered # 是否有序" 179 | ] 180 | }, 181 | { 182 | "cell_type": "markdown", 183 | "metadata": {}, 184 | "source": [ 185 | "## 1.2 转换为类别类型" 186 | ] 187 | }, 188 | { 189 | "cell_type": "code", 190 | "execution_count": 8, 191 | "metadata": {}, 192 | "outputs": [ 193 | { 194 | "data": { 195 | "text/plain": [ 196 | "0 2\n", 197 | "1 1\n", 198 | "2 1\n", 199 | "3 3\n", 200 | "dtype: int64" 201 | ] 202 | }, 203 | "execution_count": 8, 204 | "metadata": {}, 205 | "output_type": "execute_result" 206 | } 207 | ], 208 | "source": [ 209 | "s = pd.Series([2,1,1,3])\n", 210 | "s" 211 | ] 212 | }, 213 | { 214 | "cell_type": "code", 215 | "execution_count": 9, 216 | "metadata": {}, 217 | "outputs": [ 218 | { 219 | "data": { 220 | "text/plain": [ 221 | "0 2\n", 222 | "1 1\n", 223 | "2 1\n", 224 | "3 3\n", 225 | "dtype: category\n", 226 | "Categories (3, int64): [1, 2, 3]" 227 | ] 228 | }, 229 | "execution_count": 9, 230 | "metadata": {}, 231 | "output_type": "execute_result" 232 | } 233 | ], 234 | "source": [ 235 | "s = s.astype('category') \n", 236 | "s #可以看到dtype已经变成category型 " 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "metadata": {}, 242 | "source": [ 243 | "**Series查看类型属性需要通过 .cat **" 244 | ] 245 | }, 246 | { 247 | "cell_type": "code", 248 | "execution_count": 10, 249 | "metadata": { 250 | "scrolled": true 251 | }, 252 | "outputs": [ 253 | { 254 | "data": { 255 | "text/plain": [ 256 | "Int64Index([1, 2, 3], dtype='int64')" 257 | ] 258 | }, 259 | "execution_count": 10, 260 | "metadata": {}, 261 | "output_type": "execute_result" 262 | } 263 | ], 264 | "source": [ 265 | "s.cat.categories" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": 11, 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "data": { 275 | "text/plain": [ 276 | "False" 277 | ] 278 | }, 279 | "execution_count": 11, 280 | "metadata": {}, 281 | "output_type": "execute_result" 282 | } 283 | ], 284 | "source": [ 285 | "s.cat.ordered" 286 | ] 287 | }, 288 | { 289 | "cell_type": "code", 290 | "execution_count": null, 291 | "metadata": {}, 292 | "outputs": [], 293 | "source": [] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "# 2. 查、改、增、删" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": {}, 305 | "source": [ 306 | "## 2.1 查" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "##### [] 四种查看方式\n", 314 | "类别类型是序列形式,可以采用 []来查看,不过.loc[]和.iloc[]都是不支持的。" 315 | ] 316 | }, 317 | { 318 | "cell_type": "code", 319 | "execution_count": 12, 320 | "metadata": {}, 321 | "outputs": [ 322 | { 323 | "data": { 324 | "text/plain": [ 325 | "nan" 326 | ] 327 | }, 328 | "execution_count": 12, 329 | "metadata": {}, 330 | "output_type": "execute_result" 331 | } 332 | ], 333 | "source": [ 334 | "c[0]" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 13, 340 | "metadata": {}, 341 | "outputs": [ 342 | { 343 | "data": { 344 | "text/plain": [ 345 | "[NaN, 2.0]\n", 346 | "Categories (2, int64): [3 < 2]" 347 | ] 348 | }, 349 | "execution_count": 13, 350 | "metadata": {}, 351 | "output_type": "execute_result" 352 | } 353 | ], 354 | "source": [ 355 | "c[0:2]" 356 | ] 357 | }, 358 | { 359 | "cell_type": "code", 360 | "execution_count": 14, 361 | "metadata": {}, 362 | "outputs": [ 363 | { 364 | "data": { 365 | "text/plain": [ 366 | "[NaN, 3.0]\n", 367 | "Categories (2, int64): [3 < 2]" 368 | ] 369 | }, 370 | "execution_count": 14, 371 | "metadata": {}, 372 | "output_type": "execute_result" 373 | } 374 | ], 375 | "source": [ 376 | "c[[0,2]]" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 15, 382 | "metadata": {}, 383 | "outputs": [ 384 | { 385 | "data": { 386 | "text/plain": [ 387 | "[NaN, 3.0]\n", 388 | "Categories (2, int64): [3 < 2]" 389 | ] 390 | }, 391 | "execution_count": 15, 392 | "metadata": {}, 393 | "output_type": "execute_result" 394 | } 395 | ], 396 | "source": [ 397 | "mask = [True, False, True]\n", 398 | "c[mask]" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": null, 404 | "metadata": {}, 405 | "outputs": [], 406 | "source": [] 407 | }, 408 | { 409 | "cell_type": "markdown", 410 | "metadata": {}, 411 | "source": [ 412 | "## 2.2 改" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "### 2.2.1 改类别值\n", 420 | "这个功能用得会比较多,将字符串类别映射为数值类别。" 421 | ] 422 | }, 423 | { 424 | "cell_type": "markdown", 425 | "metadata": {}, 426 | "source": [ 427 | "**直接修改**" 428 | ] 429 | }, 430 | { 431 | "cell_type": "code", 432 | "execution_count": 16, 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [ 436 | "c1 = c.copy()" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": 17, 442 | "metadata": {}, 443 | "outputs": [ 444 | { 445 | "data": { 446 | "text/plain": [ 447 | "[NaN, 6, 5]\n", 448 | "Categories (2, object): [5 < 6]" 449 | ] 450 | }, 451 | "execution_count": 17, 452 | "metadata": {}, 453 | "output_type": "execute_result" 454 | } 455 | ], 456 | "source": [ 457 | "c1.categories = ['5','6'] # 这种改法,新的类别序列与旧类别序列长度必须相同,实质为将值和类型依次替换\n", 458 | "c1" 459 | ] 460 | }, 461 | { 462 | "cell_type": "code", 463 | "execution_count": 18, 464 | "metadata": {}, 465 | "outputs": [ 466 | { 467 | "data": { 468 | "text/plain": [ 469 | "0 2\n", 470 | "1 1\n", 471 | "2 1\n", 472 | "3 3\n", 473 | "dtype: category\n", 474 | "Categories (3, int64): [1, 2, 3]" 475 | ] 476 | }, 477 | "execution_count": 18, 478 | "metadata": {}, 479 | "output_type": "execute_result" 480 | } 481 | ], 482 | "source": [ 483 | "s1= s.copy()\n", 484 | "s1" 485 | ] 486 | }, 487 | { 488 | "cell_type": "code", 489 | "execution_count": 19, 490 | "metadata": {}, 491 | "outputs": [ 492 | { 493 | "data": { 494 | "text/plain": [ 495 | "0 5\n", 496 | "1 6\n", 497 | "2 6\n", 498 | "3 7\n", 499 | "dtype: category\n", 500 | "Categories (3, int64): [6, 5, 7]" 501 | ] 502 | }, 503 | "execution_count": 19, 504 | "metadata": {}, 505 | "output_type": "execute_result" 506 | } 507 | ], 508 | "source": [ 509 | "s1.cat.categories = [6,5,7]\n", 510 | "s1 # 对Series来说,用.cat操作改法是相同的" 511 | ] 512 | }, 513 | { 514 | "cell_type": "markdown", 515 | "metadata": {}, 516 | "source": [ 517 | "##### 函数改\n", 518 | "##### `categories.rename_categories(cat , inplace = False)`\n", 519 | "- cat:新的类别,必须和旧类别长度相同;\n", 520 | "- inplace:True or False,是否原地修改。" 521 | ] 522 | }, 523 | { 524 | "cell_type": "code", 525 | "execution_count": 20, 526 | "metadata": {}, 527 | "outputs": [ 528 | { 529 | "data": { 530 | "text/plain": [ 531 | "[NaN, 6, 5]\n", 532 | "Categories (2, object): [5 < 6]" 533 | ] 534 | }, 535 | "execution_count": 20, 536 | "metadata": {}, 537 | "output_type": "execute_result" 538 | } 539 | ], 540 | "source": [ 541 | "c1 = c.copy()\n", 542 | "c1.rename_categories(['5','6'], inplace=True) #和上面完全相同\n", 543 | "c1 " 544 | ] 545 | }, 546 | { 547 | "cell_type": "code", 548 | "execution_count": 21, 549 | "metadata": {}, 550 | "outputs": [ 551 | { 552 | "data": { 553 | "text/plain": [ 554 | "0 5\n", 555 | "1 6\n", 556 | "2 6\n", 557 | "3 7\n", 558 | "dtype: category\n", 559 | "Categories (3, object): [6, 5, 7]" 560 | ] 561 | }, 562 | "execution_count": 21, 563 | "metadata": {}, 564 | "output_type": "execute_result" 565 | } 566 | ], 567 | "source": [ 568 | "s1 = s.copy()\n", 569 | "s1.cat.rename_categories(['6','5','7'], inplace=True) # 和上面完全相同\n", 570 | "s1" 571 | ] 572 | }, 573 | { 574 | "cell_type": "markdown", 575 | "metadata": {}, 576 | "source": [ 577 | "### 2.2.2 有序、无序转变" 578 | ] 579 | }, 580 | { 581 | "cell_type": "markdown", 582 | "metadata": {}, 583 | "source": [ 584 | "##### `categories.as_ordered(inplace=False)`\n", 585 | "##### `categories.as_unordered(inplace=False)`\n", 586 | "- inplace:True or False,是否原地修改。" 587 | ] 588 | }, 589 | { 590 | "cell_type": "code", 591 | "execution_count": 22, 592 | "metadata": {}, 593 | "outputs": [ 594 | { 595 | "data": { 596 | "text/plain": [ 597 | "[NaN, 2.0, 3.0]\n", 598 | "Categories (2, int64): [3 < 2]" 599 | ] 600 | }, 601 | "execution_count": 22, 602 | "metadata": {}, 603 | "output_type": "execute_result" 604 | } 605 | ], 606 | "source": [ 607 | "c1 = c.copy()\n", 608 | "c1" 609 | ] 610 | }, 611 | { 612 | "cell_type": "code", 613 | "execution_count": 23, 614 | "metadata": {}, 615 | "outputs": [ 616 | { 617 | "data": { 618 | "text/plain": [ 619 | "[NaN, 2.0, 3.0]\n", 620 | "Categories (2, int64): [3, 2]" 621 | ] 622 | }, 623 | "execution_count": 23, 624 | "metadata": {}, 625 | "output_type": "execute_result" 626 | } 627 | ], 628 | "source": [ 629 | "c1.as_unordered(inplace = True)\n", 630 | "c1" 631 | ] 632 | }, 633 | { 634 | "cell_type": "code", 635 | "execution_count": 24, 636 | "metadata": {}, 637 | "outputs": [ 638 | { 639 | "data": { 640 | "text/plain": [ 641 | "[NaN, 2.0, 3.0]\n", 642 | "Categories (2, int64): [3 < 2]" 643 | ] 644 | }, 645 | "execution_count": 24, 646 | "metadata": {}, 647 | "output_type": "execute_result" 648 | } 649 | ], 650 | "source": [ 651 | "c1.as_ordered()" 652 | ] 653 | }, 654 | { 655 | "cell_type": "markdown", 656 | "metadata": {}, 657 | "source": [ 658 | "### 2.2.3 有序改变顺序" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "##### `categories.reorder_categories(cat , ordered=False,inplace=False)`\n", 666 | "- cat:只能是旧类别改变顺序后的序列,不能增减类别;\n", 667 | "- ordered:True or False,类别是否有序\n", 668 | "- inplace:True or False,是否原地修改。" 669 | ] 670 | }, 671 | { 672 | "cell_type": "code", 673 | "execution_count": 25, 674 | "metadata": {}, 675 | "outputs": [ 676 | { 677 | "data": { 678 | "text/plain": [ 679 | "[NaN, 2.0, 3.0]\n", 680 | "Categories (2, int64): [3 < 2]" 681 | ] 682 | }, 683 | "execution_count": 25, 684 | "metadata": {}, 685 | "output_type": "execute_result" 686 | } 687 | ], 688 | "source": [ 689 | "c1 = c.copy()\n", 690 | "c1" 691 | ] 692 | }, 693 | { 694 | "cell_type": "code", 695 | "execution_count": 26, 696 | "metadata": {}, 697 | "outputs": [ 698 | { 699 | "data": { 700 | "text/plain": [ 701 | "[NaN, 2.0, 3.0]\n", 702 | "Categories (2, int64): [2 < 3]" 703 | ] 704 | }, 705 | "execution_count": 26, 706 | "metadata": {}, 707 | "output_type": "execute_result" 708 | } 709 | ], 710 | "source": [ 711 | "c1.reorder_categories([2,3],ordered=True,inplace=True)\n", 712 | "c1" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": {}, 718 | "source": [ 719 | "## 2.3 增" 720 | ] 721 | }, 722 | { 723 | "cell_type": "markdown", 724 | "metadata": {}, 725 | "source": [ 726 | "##### `categories.add_categories(cat,inplace=False)`\n", 727 | "- cat:想要新增的类别,必须不在旧类别中;\n", 728 | "- inplace:True or False,是否原地修改。" 729 | ] 730 | }, 731 | { 732 | "cell_type": "code", 733 | "execution_count": 27, 734 | "metadata": {}, 735 | "outputs": [ 736 | { 737 | "data": { 738 | "text/plain": [ 739 | "[NaN, 2.0, 3.0]\n", 740 | "Categories (2, int64): [3 < 2]" 741 | ] 742 | }, 743 | "execution_count": 27, 744 | "metadata": {}, 745 | "output_type": "execute_result" 746 | } 747 | ], 748 | "source": [ 749 | "c1 = c.copy()\n", 750 | "c1" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": 28, 756 | "metadata": {}, 757 | "outputs": [ 758 | { 759 | "data": { 760 | "text/plain": [ 761 | "[NaN, 2.0, 3.0]\n", 762 | "Categories (4, int64): [3 < 2 < 4 < 5]" 763 | ] 764 | }, 765 | "execution_count": 28, 766 | "metadata": {}, 767 | "output_type": "execute_result" 768 | } 769 | ], 770 | "source": [ 771 | "c1.add_categories([4,5], inplace = True)\n", 772 | "c1" 773 | ] 774 | }, 775 | { 776 | "cell_type": "markdown", 777 | "metadata": {}, 778 | "source": [ 779 | "## 2.4 删" 780 | ] 781 | }, 782 | { 783 | "cell_type": "markdown", 784 | "metadata": {}, 785 | "source": [ 786 | "### 2.4.1 删除任意不需要的类别" 787 | ] 788 | }, 789 | { 790 | "cell_type": "markdown", 791 | "metadata": {}, 792 | "source": [ 793 | "##### `categories.remove_categories(cat,inplace=False)`\n", 794 | "- cat:想要删除的类别,必须在旧类别中;\n", 795 | "- inplace:True or False,是否原地修改。" 796 | ] 797 | }, 798 | { 799 | "cell_type": "code", 800 | "execution_count": 29, 801 | "metadata": {}, 802 | "outputs": [ 803 | { 804 | "data": { 805 | "text/plain": [ 806 | "[NaN, 2.0, 3.0]\n", 807 | "Categories (3, int64): [3 < 2 < 5]" 808 | ] 809 | }, 810 | "execution_count": 29, 811 | "metadata": {}, 812 | "output_type": "execute_result" 813 | } 814 | ], 815 | "source": [ 816 | "c1.remove_categories([4],inplace=True)\n", 817 | "c1" 818 | ] 819 | }, 820 | { 821 | "cell_type": "markdown", 822 | "metadata": {}, 823 | "source": [ 824 | "### 2.4.2 去除没有使用的类别" 825 | ] 826 | }, 827 | { 828 | "cell_type": "markdown", 829 | "metadata": {}, 830 | "source": [ 831 | "##### `categories.remove_unused_categories(inplace=False)`\n", 832 | "- inplace:True or False,是否原地修改。" 833 | ] 834 | }, 835 | { 836 | "cell_type": "code", 837 | "execution_count": 30, 838 | "metadata": {}, 839 | "outputs": [ 840 | { 841 | "data": { 842 | "text/plain": [ 843 | "[NaN, 2.0, 3.0]\n", 844 | "Categories (2, int64): [3 < 2]" 845 | ] 846 | }, 847 | "execution_count": 30, 848 | "metadata": {}, 849 | "output_type": "execute_result" 850 | } 851 | ], 852 | "source": [ 853 | "c1.remove_unused_categories(inplace = True) # 类别 5 被去除\n", 854 | "c1" 855 | ] 856 | }, 857 | { 858 | "cell_type": "markdown", 859 | "metadata": {}, 860 | "source": [ 861 | "## 2.5 改增删 三合一" 862 | ] 863 | }, 864 | { 865 | "cell_type": "markdown", 866 | "metadata": {}, 867 | "source": [ 868 | "##### `categories.set_categories(cat , ordered = False,rename=False, inplace=False)`\n", 869 | "- cat:只能是旧类别改变顺序后的序列,不能增减类别;\n", 870 | "- ordered:True or False,改序,如果提供这一项,保持原来属性,最好明确给出;\n", 871 | "- rename:True or False,改名,这个参数我发现没啥用(?);\n", 872 | "- inplace:True or False,是否原地修改。" 873 | ] 874 | }, 875 | { 876 | "cell_type": "code", 877 | "execution_count": 31, 878 | "metadata": {}, 879 | "outputs": [ 880 | { 881 | "data": { 882 | "text/plain": [ 883 | "[NaN, 2.0, 3.0]\n", 884 | "Categories (2, int64): [3 < 2]" 885 | ] 886 | }, 887 | "execution_count": 31, 888 | "metadata": {}, 889 | "output_type": "execute_result" 890 | } 891 | ], 892 | "source": [ 893 | "c1 = c.copy()\n", 894 | "c1" 895 | ] 896 | }, 897 | { 898 | "cell_type": "code", 899 | "execution_count": 32, 900 | "metadata": {}, 901 | "outputs": [ 902 | { 903 | "data": { 904 | "text/plain": [ 905 | "[NaN, 2.0, NaN]\n", 906 | "Categories (3, int64): [2 < 4 < 5]" 907 | ] 908 | }, 909 | "execution_count": 32, 910 | "metadata": {}, 911 | "output_type": "execute_result" 912 | } 913 | ], 914 | "source": [ 915 | "c1.set_categories([2,4,5], ordered=True, inplace=True) # 删除了旧类别 1,增加新类别4、5,\n", 916 | "c1" 917 | ] 918 | }, 919 | { 920 | "cell_type": "code", 921 | "execution_count": null, 922 | "metadata": {}, 923 | "outputs": [], 924 | "source": [] 925 | }, 926 | { 927 | "cell_type": "markdown", 928 | "metadata": {}, 929 | "source": [ 930 | "# 3. cut() 和 qcut()\n", 931 | "这俩函数用于将连续型变量分割为类别变量。" 932 | ] 933 | }, 934 | { 935 | "cell_type": "markdown", 936 | "metadata": {}, 937 | "source": [ 938 | "## 3.1 cut()" 939 | ] 940 | }, 941 | { 942 | "cell_type": "markdown", 943 | "metadata": {}, 944 | "source": [ 945 | "##### `pd.cut(x, bins, right=False,include_lowest=False, labels=None, retbins=False)`\n", 946 | "- x:待分割的Series或序列;\n", 947 | "- bins:如果是int,那么将Series的进行等分,并在最大最小值的基础上外延1%作为区间边界;如果是序列,那么将序列值作为分隔点;\n", 948 | "- right:True or False,分隔区间默认为左闭右开;\n", 949 | "- include_lowest:True or False,将最左侧区间的左值外延1%,试图去包含最小值;\n", 950 | "- labels:分隔后是区间,可以用label来替换为想要的类别形式;\n", 951 | "- retbins:是否返回分隔点;" 952 | ] 953 | }, 954 | { 955 | "cell_type": "code", 956 | "execution_count": 33, 957 | "metadata": {}, 958 | "outputs": [ 959 | { 960 | "data": { 961 | "text/plain": [ 962 | "0 0\n", 963 | "1 1\n", 964 | "2 2\n", 965 | "3 3\n", 966 | "4 4\n", 967 | "dtype: int64" 968 | ] 969 | }, 970 | "execution_count": 33, 971 | "metadata": {}, 972 | "output_type": "execute_result" 973 | } 974 | ], 975 | "source": [ 976 | "s = pd.Series(range(0,5))\n", 977 | "s" 978 | ] 979 | }, 980 | { 981 | "cell_type": "code", 982 | "execution_count": 34, 983 | "metadata": {}, 984 | "outputs": [ 985 | { 986 | "data": { 987 | "text/plain": [ 988 | "0 (-0.004, 1.333]\n", 989 | "1 (-0.004, 1.333]\n", 990 | "2 (1.333, 2.667]\n", 991 | "3 (2.667, 4.0]\n", 992 | "4 (2.667, 4.0]\n", 993 | "dtype: category\n", 994 | "Categories (3, interval[float64]): [(-0.004, 1.333] < (1.333, 2.667] < (2.667, 4.0]]" 995 | ] 996 | }, 997 | "execution_count": 34, 998 | "metadata": {}, 999 | "output_type": "execute_result" 1000 | } 1001 | ], 1002 | "source": [ 1003 | "pd.cut(s, 3) # 可以看到一共3个类别,类别形式为区间形式(]" 1004 | ] 1005 | }, 1006 | { 1007 | "cell_type": "code", 1008 | "execution_count": 35, 1009 | "metadata": {}, 1010 | "outputs": [ 1011 | { 1012 | "data": { 1013 | "text/plain": [ 1014 | "0 a\n", 1015 | "1 a\n", 1016 | "2 b\n", 1017 | "3 c\n", 1018 | "4 c\n", 1019 | "dtype: category\n", 1020 | "Categories (3, object): [a < b < c]" 1021 | ] 1022 | }, 1023 | "execution_count": 35, 1024 | "metadata": {}, 1025 | "output_type": "execute_result" 1026 | } 1027 | ], 1028 | "source": [ 1029 | "pd.cut(s, 3, labels=['a','b','c']) # 这样就清晰多了" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": 36, 1035 | "metadata": {}, 1036 | "outputs": [ 1037 | { 1038 | "data": { 1039 | "text/plain": [ 1040 | "(0 a\n", 1041 | " 1 a\n", 1042 | " 2 b\n", 1043 | " 3 c\n", 1044 | " 4 c\n", 1045 | " dtype: category\n", 1046 | " Categories (3, object): [a < b < c],\n", 1047 | " array([-0.004 , 1.33333333, 2.66666667, 4. ]))" 1048 | ] 1049 | }, 1050 | "execution_count": 36, 1051 | "metadata": {}, 1052 | "output_type": "execute_result" 1053 | } 1054 | ], 1055 | "source": [ 1056 | "pd.cut( s, 3, labels = ['a','b','c'], retbins=True) # 分隔点也返回" 1057 | ] 1058 | }, 1059 | { 1060 | "cell_type": "code", 1061 | "execution_count": 37, 1062 | "metadata": {}, 1063 | "outputs": [ 1064 | { 1065 | "data": { 1066 | "text/plain": [ 1067 | "0 [0.0, 2.5)\n", 1068 | "1 [0.0, 2.5)\n", 1069 | "2 [0.0, 2.5)\n", 1070 | "3 [2.5, 4.0)\n", 1071 | "4 NaN\n", 1072 | "dtype: category\n", 1073 | "Categories (2, interval[float64]): [[0.0, 2.5) < [2.5, 4.0)]" 1074 | ] 1075 | }, 1076 | "execution_count": 37, 1077 | "metadata": {}, 1078 | "output_type": "execute_result" 1079 | } 1080 | ], 1081 | "source": [ 1082 | "pd.cut(s,[0,2.5,4], right=False) # 左闭右开,不包括4,所以4不属于任何一类别" 1083 | ] 1084 | }, 1085 | { 1086 | "cell_type": "code", 1087 | "execution_count": 38, 1088 | "metadata": {}, 1089 | "outputs": [ 1090 | { 1091 | "data": { 1092 | "text/plain": [ 1093 | "0 NaN\n", 1094 | "1 (0.0, 2.5]\n", 1095 | "2 (0.0, 2.5]\n", 1096 | "3 (2.5, 4.0]\n", 1097 | "4 (2.5, 4.0]\n", 1098 | "dtype: category\n", 1099 | "Categories (2, interval[float64]): [(0.0, 2.5] < (2.5, 4.0]]" 1100 | ] 1101 | }, 1102 | "execution_count": 38, 1103 | "metadata": {}, 1104 | "output_type": "execute_result" 1105 | } 1106 | ], 1107 | "source": [ 1108 | "pd.cut(s, [0,2.5,4], right=True) # 左开右闭,不包括0,所以0不属于任何一类别" 1109 | ] 1110 | }, 1111 | { 1112 | "cell_type": "code", 1113 | "execution_count": 39, 1114 | "metadata": {}, 1115 | "outputs": [ 1116 | { 1117 | "data": { 1118 | "text/plain": [ 1119 | "0 (-0.001, 2.5]\n", 1120 | "1 (-0.001, 2.5]\n", 1121 | "2 (-0.001, 2.5]\n", 1122 | "3 (2.5, 4.0]\n", 1123 | "4 (2.5, 4.0]\n", 1124 | "dtype: category\n", 1125 | "Categories (2, interval[float64]): [(-0.001, 2.5] < (2.5, 4.0]]" 1126 | ] 1127 | }, 1128 | "execution_count": 39, 1129 | "metadata": {}, 1130 | "output_type": "execute_result" 1131 | } 1132 | ], 1133 | "source": [ 1134 | "pd.cut(s, [0,2.5,4], right=True, include_lowest=True) # 最左侧值被包含" 1135 | ] 1136 | }, 1137 | { 1138 | "cell_type": "markdown", 1139 | "metadata": {}, 1140 | "source": [ 1141 | "## 3.2 qcut()" 1142 | ] 1143 | }, 1144 | { 1145 | "cell_type": "markdown", 1146 | "metadata": {}, 1147 | "source": [ 1148 | "##### `pd.qcut(x, q, labels=None, retbins=False)`\n", 1149 | "- x:待分割的Series或序列;\n", 1150 | "- q:安装分位数也来定义分隔点,而不是按照给定值;\n", 1151 | "- labels:分隔后是区间,可以用label来替换为想要的类别形式;\n", 1152 | "- retbins:是否返回分隔点;" 1153 | ] 1154 | }, 1155 | { 1156 | "cell_type": "code", 1157 | "execution_count": 40, 1158 | "metadata": {}, 1159 | "outputs": [ 1160 | { 1161 | "data": { 1162 | "text/plain": [ 1163 | "0 a\n", 1164 | "1 a\n", 1165 | "2 b\n", 1166 | "3 c\n", 1167 | "4 d\n", 1168 | "dtype: category\n", 1169 | "Categories (4, object): [a < b < c < d]" 1170 | ] 1171 | }, 1172 | "execution_count": 40, 1173 | "metadata": {}, 1174 | "output_type": "execute_result" 1175 | } 1176 | ], 1177 | "source": [ 1178 | "pd.qcut(s, q = [0.0, 0.25, 0.5, 0.75, 1.0], labels=['a','b','c','d']) \n", 1179 | "# 5个分位点,形成 4 个区间。看来默认参数是right =True, include_lowest = True" 1180 | ] 1181 | }, 1182 | { 1183 | "cell_type": "code", 1184 | "execution_count": null, 1185 | "metadata": {}, 1186 | "outputs": [], 1187 | "source": [] 1188 | } 1189 | ], 1190 | "metadata": { 1191 | "kernelspec": { 1192 | "display_name": "Python 3", 1193 | "language": "python", 1194 | "name": "python3" 1195 | }, 1196 | "language_info": { 1197 | "codemirror_mode": { 1198 | "name": "ipython", 1199 | "version": 3 1200 | }, 1201 | "file_extension": ".py", 1202 | "mimetype": "text/x-python", 1203 | "name": "python", 1204 | "nbconvert_exporter": "python", 1205 | "pygments_lexer": "ipython3", 1206 | "version": "3.7.0" 1207 | }, 1208 | "toc": { 1209 | "base_numbering": 1, 1210 | "nav_menu": { 1211 | "height": "318px", 1212 | "width": "252px" 1213 | }, 1214 | "number_sections": false, 1215 | "sideBar": true, 1216 | "skip_h1_title": false, 1217 | "title_cell": "Table of Contents", 1218 | "title_sidebar": "Contents", 1219 | "toc_cell": false, 1220 | "toc_position": {}, 1221 | "toc_section_display": "block", 1222 | "toc_window_display": true 1223 | } 1224 | }, 1225 | "nbformat": 4, 1226 | "nbformat_minor": 2 1227 | } 1228 | -------------------------------------------------------------------------------- /14. Object型操作.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "id": "0485413d-d8e5-4156-9eed-a8df3eca8caa", 6 | "metadata": {}, 7 | "source": [ 8 | "# Object型操作\n", 9 | "\n", 10 | "字符串以及一些复合类型比如列表类型tuple、list,字典类型等元素类型的操作" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 1, 16 | "id": "06dcbb60-10a4-4428-b351-827300085735", 17 | "metadata": {}, 18 | "outputs": [], 19 | "source": [ 20 | "__auther__ = 'zhenhang.sun@gmail.com'" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "id": "75514624-4fb5-4d6e-a798-0e29d74ab65d", 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "data": { 31 | "text/plain": [ 32 | "'D:\\\\github\\\\pandas-tutorial'" 33 | ] 34 | }, 35 | "execution_count": 2, 36 | "metadata": {}, 37 | "output_type": "execute_result" 38 | } 39 | ], 40 | "source": [ 41 | "pwd" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 3, 47 | "id": "87e9525b-0868-43e1-a7f2-d9ef1d701db1", 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "import pandas as pd" 52 | ] 53 | }, 54 | { 55 | "cell_type": "code", 56 | "execution_count": null, 57 | "id": "f95a93c9-09ae-4cdb-bff3-eeecb67e63fd", 58 | "metadata": {}, 59 | "outputs": [], 60 | "source": [] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "id": "ddd37be4-2523-44e0-8a5d-a412a1eac1c8", 65 | "metadata": { 66 | "tags": [] 67 | }, 68 | "source": [ 69 | "# 0.\n", 70 | "\n", 71 | "**统一的操作前缀 .str**" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": 4, 77 | "id": "30716e0b-4773-4619-b244-9f29469d4505", 78 | "metadata": {}, 79 | "outputs": [ 80 | { 81 | "data": { 82 | "text/html": [ 83 | "
\n", 84 | "\n", 97 | "\n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | " \n", 119 | " \n", 120 | "
strlistdict
0abcidef[1, 2, 3]{'a': 1, 'b': 2}
1hahabciidef[4, 5, 6]{'a': 10, 'b': 10}
\n", 121 | "
" 122 | ], 123 | "text/plain": [ 124 | " str list dict\n", 125 | "0 abcidef [1, 2, 3] {'a': 1, 'b': 2}\n", 126 | "1 hahabciidef [4, 5, 6] {'a': 10, 'b': 10}" 127 | ] 128 | }, 129 | "execution_count": 4, 130 | "metadata": {}, 131 | "output_type": "execute_result" 132 | } 133 | ], 134 | "source": [ 135 | "df = pd.DataFrame({\"str\": [\"abcidef\", \"hahabciidef\"],\n", 136 | " \"list\":[[1,2,3],[4,5,6]],\n", 137 | " \"dict\":[{'a':1, 'b':2}, {'a':10, 'b':10}]\n", 138 | " \n", 139 | " })\n", 140 | "df" 141 | ] 142 | }, 143 | { 144 | "cell_type": "markdown", 145 | "id": "477aedb8-bf1c-4339-82ba-01dd917c7e92", 146 | "metadata": { 147 | "tags": [] 148 | }, 149 | "source": [ 150 | "# 1. 字符串操作" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "id": "e5e7fa21-e544-4f5b-8796-ca69ffd6c49a", 156 | "metadata": {}, 157 | "source": [ 158 | "## 1.1 字符串变换\n", 159 | "\n", 160 | "常用的字符串操作都支持:可以自行查看。" 161 | ] 162 | }, 163 | { 164 | "cell_type": "code", 165 | "execution_count": 5, 166 | "id": "910b843e-b943-4d8b-aa5a-2e5868100fae", 167 | "metadata": {}, 168 | "outputs": [ 169 | { 170 | "data": { 171 | "text/plain": [ 172 | "0 ABCIDEF\n", 173 | "1 HAHABCIIDEF\n", 174 | "Name: str, dtype: object" 175 | ] 176 | }, 177 | "execution_count": 5, 178 | "metadata": {}, 179 | "output_type": "execute_result" 180 | } 181 | ], 182 | "source": [ 183 | "df['str'].str.upper()" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 6, 189 | "id": "113d9078-60d7-4fe2-bb1f-4cc288763192", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "0 Abcidef\n", 196 | "1 Hahabciidef\n", 197 | "Name: str, dtype: object" 198 | ] 199 | }, 200 | "execution_count": 6, 201 | "metadata": {}, 202 | "output_type": "execute_result" 203 | } 204 | ], 205 | "source": [ 206 | "df['str'].str.capitalize()" 207 | ] 208 | }, 209 | { 210 | "cell_type": "code", 211 | "execution_count": null, 212 | "id": "0659267a-d330-4b7f-af33-46d022796dbc", 213 | "metadata": {}, 214 | "outputs": [], 215 | "source": [] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "id": "a36f7a83-74e0-485d-9964-c03f300db770", 220 | "metadata": {}, 221 | "source": [ 222 | "## 1.2 字符串提取-正则\n" 223 | ] 224 | }, 225 | { 226 | "cell_type": "code", 227 | "execution_count": 7, 228 | "id": "c635805e-489f-433c-afa3-bc0af1cf1b7c", 229 | "metadata": {}, 230 | "outputs": [ 231 | { 232 | "data": { 233 | "text/html": [ 234 | "
\n", 235 | "\n", 248 | "\n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | "
01
0abcdef
1abcdef
\n", 269 | "
" 270 | ], 271 | "text/plain": [ 272 | " 0 1\n", 273 | "0 abc def\n", 274 | "1 abc def" 275 | ] 276 | }, 277 | "execution_count": 7, 278 | "metadata": {}, 279 | "output_type": "execute_result" 280 | } 281 | ], 282 | "source": [ 283 | "df['str'].str.extract(r\"(abc).*?(def)\")" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "id": "18a813fd-9ea7-458e-b87d-26e8adc131de", 289 | "metadata": {}, 290 | "source": [ 291 | "## 1.3 查找" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 8, 297 | "id": "fc7670a5-7dc1-438b-ae19-5cea28590625", 298 | "metadata": {}, 299 | "outputs": [ 300 | { 301 | "data": { 302 | "text/plain": [ 303 | "0 0\n", 304 | "1 3\n", 305 | "Name: str, dtype: int64" 306 | ] 307 | }, 308 | "execution_count": 8, 309 | "metadata": {}, 310 | "output_type": "execute_result" 311 | } 312 | ], 313 | "source": [ 314 | "df['str'].str.find('abc')" 315 | ] 316 | }, 317 | { 318 | "cell_type": "markdown", 319 | "id": "6adc383f-9cc0-4832-b3c8-4ae523eed5ed", 320 | "metadata": {}, 321 | "source": [ 322 | "## 1.4 分割" 323 | ] 324 | }, 325 | { 326 | "cell_type": "code", 327 | "execution_count": 9, 328 | "id": "ce57630a-d5cb-46e1-9b55-3b33c41fdb46", 329 | "metadata": {}, 330 | "outputs": [ 331 | { 332 | "data": { 333 | "text/plain": [ 334 | "0 [abc, def]\n", 335 | "1 [hahabc, , def]\n", 336 | "Name: str, dtype: object" 337 | ] 338 | }, 339 | "execution_count": 9, 340 | "metadata": {}, 341 | "output_type": "execute_result" 342 | } 343 | ], 344 | "source": [ 345 | "df['str'].str.split('i')" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 10, 351 | "id": "1f23ba45-10bb-4b15-a16e-e43c931b7da5", 352 | "metadata": {}, 353 | "outputs": [ 354 | { 355 | "data": { 356 | "text/html": [ 357 | "
\n", 358 | "\n", 371 | "\n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 389 | " \n", 390 | " \n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | "
012
0abcdefNone
1hahabcdef
\n", 395 | "
" 396 | ], 397 | "text/plain": [ 398 | " 0 1 2\n", 399 | "0 abc def None\n", 400 | "1 hahabc def" 401 | ] 402 | }, 403 | "execution_count": 10, 404 | "metadata": {}, 405 | "output_type": "execute_result" 406 | } 407 | ], 408 | "source": [ 409 | "df['str'].str.split('i', expand=True)" 410 | ] 411 | }, 412 | { 413 | "cell_type": "markdown", 414 | "id": "ca9988fb-2ea0-448a-9155-9309dc0cfbfa", 415 | "metadata": { 416 | "tags": [] 417 | }, 418 | "source": [ 419 | "## 1.5 包含判断" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": 11, 425 | "id": "52d5f38b-db26-4580-ba54-4022cac12375", 426 | "metadata": {}, 427 | "outputs": [ 428 | { 429 | "data": { 430 | "text/plain": [ 431 | "0 True\n", 432 | "1 True\n", 433 | "Name: str, dtype: bool" 434 | ] 435 | }, 436 | "execution_count": 11, 437 | "metadata": {}, 438 | "output_type": "execute_result" 439 | } 440 | ], 441 | "source": [ 442 | "#取第 个位置的量\n", 443 | "df['str'].str.contains('abc')" 444 | ] 445 | }, 446 | { 447 | "cell_type": "code", 448 | "execution_count": null, 449 | "id": "3fa1d363-b1c9-47cb-a118-2acd2ca23644", 450 | "metadata": {}, 451 | "outputs": [], 452 | "source": [] 453 | }, 454 | { 455 | "cell_type": "markdown", 456 | "id": "fc594985-f9e0-4e48-a14c-3bc1130b2638", 457 | "metadata": { 458 | "tags": [] 459 | }, 460 | "source": [ 461 | "# 2. 列表类型" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": 12, 467 | "id": "a9bc121b-15c3-4318-acef-d7c1efdeff08", 468 | "metadata": {}, 469 | "outputs": [ 470 | { 471 | "data": { 472 | "text/html": [ 473 | "
\n", 474 | "\n", 487 | "\n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | "
strlistdict
0abcidef[1, 2, 3]{'a': 1, 'b': 2}
1hahabciidef[4, 5, 6]{'a': 10, 'b': 10}
\n", 511 | "
" 512 | ], 513 | "text/plain": [ 514 | " str list dict\n", 515 | "0 abcidef [1, 2, 3] {'a': 1, 'b': 2}\n", 516 | "1 hahabciidef [4, 5, 6] {'a': 10, 'b': 10}" 517 | ] 518 | }, 519 | "execution_count": 12, 520 | "metadata": {}, 521 | "output_type": "execute_result" 522 | } 523 | ], 524 | "source": [ 525 | "df" 526 | ] 527 | }, 528 | { 529 | "cell_type": "markdown", 530 | "id": "4210a82a-d4cb-42ac-897f-8dd399ac47f1", 531 | "metadata": {}, 532 | "source": [ 533 | "## 2.1 取元素" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": 13, 539 | "id": "be3cd399-4e33-4d34-9d76-41470afa302e", 540 | "metadata": {}, 541 | "outputs": [ 542 | { 543 | "data": { 544 | "text/plain": [ 545 | "0 2\n", 546 | "1 5\n", 547 | "Name: list, dtype: int64" 548 | ] 549 | }, 550 | "execution_count": 13, 551 | "metadata": {}, 552 | "output_type": "execute_result" 553 | } 554 | ], 555 | "source": [ 556 | "#取第 个位置的量\n", 557 | "df['list'].str.get(1)" 558 | ] 559 | }, 560 | { 561 | "cell_type": "markdown", 562 | "id": "b87d818c-7df0-4120-9386-9f851bc09a5d", 563 | "metadata": {}, 564 | "source": [ 565 | "## 2.2 展开\n" 566 | ] 567 | }, 568 | { 569 | "cell_type": "code", 570 | "execution_count": 14, 571 | "id": "1501fd67-5d88-4de5-8341-eebedc759cc4", 572 | "metadata": {}, 573 | "outputs": [ 574 | { 575 | "data": { 576 | "text/plain": [ 577 | "0 1\n", 578 | "0 2\n", 579 | "0 3\n", 580 | "1 4\n", 581 | "1 5\n", 582 | "1 6\n", 583 | "Name: list, dtype: object" 584 | ] 585 | }, 586 | "execution_count": 14, 587 | "metadata": {}, 588 | "output_type": "execute_result" 589 | } 590 | ], 591 | "source": [ 592 | "# 炸开 series 形式\n", 593 | "# 注意此时不需要使用 .str\n", 594 | "df['list'].explode()" 595 | ] 596 | }, 597 | { 598 | "cell_type": "code", 599 | "execution_count": 15, 600 | "id": "f7bc208a-aef9-4753-87ec-f7722f373430", 601 | "metadata": {}, 602 | "outputs": [ 603 | { 604 | "data": { 605 | "text/html": [ 606 | "
\n", 607 | "\n", 620 | "\n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | "
strlistdict
0abcidef1{'a': 1, 'b': 2}
0abcidef2{'a': 1, 'b': 2}
0abcidef3{'a': 1, 'b': 2}
1hahabciidef4{'a': 10, 'b': 10}
1hahabciidef5{'a': 10, 'b': 10}
1hahabciidef6{'a': 10, 'b': 10}
\n", 668 | "
" 669 | ], 670 | "text/plain": [ 671 | " str list dict\n", 672 | "0 abcidef 1 {'a': 1, 'b': 2}\n", 673 | "0 abcidef 2 {'a': 1, 'b': 2}\n", 674 | "0 abcidef 3 {'a': 1, 'b': 2}\n", 675 | "1 hahabciidef 4 {'a': 10, 'b': 10}\n", 676 | "1 hahabciidef 5 {'a': 10, 'b': 10}\n", 677 | "1 hahabciidef 6 {'a': 10, 'b': 10}" 678 | ] 679 | }, 680 | "execution_count": 15, 681 | "metadata": {}, 682 | "output_type": "execute_result" 683 | } 684 | ], 685 | "source": [ 686 | "#炸开 dataframe 指定某列\n", 687 | "df.explode('list')" 688 | ] 689 | }, 690 | { 691 | "cell_type": "code", 692 | "execution_count": 16, 693 | "id": "9e53f211-daba-41b8-ab62-9f114afc81e4", 694 | "metadata": {}, 695 | "outputs": [ 696 | { 697 | "data": { 698 | "text/html": [ 699 | "
\n", 700 | "\n", 713 | "\n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | "
e1e2e3
0123
1456
\n", 737 | "
" 738 | ], 739 | "text/plain": [ 740 | " e1 e2 e3\n", 741 | "0 1 2 3\n", 742 | "1 4 5 6" 743 | ] 744 | }, 745 | "execution_count": 16, 746 | "metadata": {}, 747 | "output_type": "execute_result" 748 | } 749 | ], 750 | "source": [ 751 | "# 变为多列\n", 752 | "pd.DataFrame( df['list'].tolist(), columns=['e1', 'e2', 'e3'])" 753 | ] 754 | }, 755 | { 756 | "cell_type": "code", 757 | "execution_count": null, 758 | "id": "e1999fa5-79d8-46f8-9531-6c5365081768", 759 | "metadata": {}, 760 | "outputs": [], 761 | "source": [] 762 | }, 763 | { 764 | "cell_type": "markdown", 765 | "id": "03e24e48-5766-4f8f-8ab5-7e5bd7042a6c", 766 | "metadata": { 767 | "tags": [] 768 | }, 769 | "source": [ 770 | "# 3. 字典类型" 771 | ] 772 | }, 773 | { 774 | "cell_type": "code", 775 | "execution_count": 17, 776 | "id": "78339a64-6d4d-43cb-8688-13213552e549", 777 | "metadata": {}, 778 | "outputs": [ 779 | { 780 | "data": { 781 | "text/html": [ 782 | "
\n", 783 | "\n", 796 | "\n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | "
strlistdict
0abcidef[1, 2, 3]{'a': 1, 'b': 2}
1hahabciidef[4, 5, 6]{'a': 10, 'b': 10}
\n", 820 | "
" 821 | ], 822 | "text/plain": [ 823 | " str list dict\n", 824 | "0 abcidef [1, 2, 3] {'a': 1, 'b': 2}\n", 825 | "1 hahabciidef [4, 5, 6] {'a': 10, 'b': 10}" 826 | ] 827 | }, 828 | "execution_count": 17, 829 | "metadata": {}, 830 | "output_type": "execute_result" 831 | } 832 | ], 833 | "source": [ 834 | "df" 835 | ] 836 | }, 837 | { 838 | "cell_type": "markdown", 839 | "id": "84ba8ff7-1a1f-4c12-b7b1-1bb4cf5759e9", 840 | "metadata": {}, 841 | "source": [ 842 | "## 3.1 取元素" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": 18, 848 | "id": "d7a76a3c-5a9f-4dc2-8475-491acdeaea1d", 849 | "metadata": {}, 850 | "outputs": [ 851 | { 852 | "data": { 853 | "text/plain": [ 854 | "0 1\n", 855 | "1 10\n", 856 | "Name: dict, dtype: int64" 857 | ] 858 | }, 859 | "execution_count": 18, 860 | "metadata": {}, 861 | "output_type": "execute_result" 862 | } 863 | ], 864 | "source": [ 865 | "#取key对应的值\n", 866 | "df['dict'].str.get('a')" 867 | ] 868 | }, 869 | { 870 | "cell_type": "markdown", 871 | "id": "8331bb98-26c3-4a3b-90bb-dae05611124a", 872 | "metadata": {}, 873 | "source": [ 874 | "## 3.2 展开" 875 | ] 876 | }, 877 | { 878 | "cell_type": "code", 879 | "execution_count": 19, 880 | "id": "8b079c37-9c34-414f-981b-d74b46a4d5c1", 881 | "metadata": {}, 882 | "outputs": [ 883 | { 884 | "data": { 885 | "text/plain": [ 886 | "0 a\n", 887 | "0 b\n", 888 | "1 a\n", 889 | "1 b\n", 890 | "Name: dict, dtype: object" 891 | ] 892 | }, 893 | "execution_count": 19, 894 | "metadata": {}, 895 | "output_type": "execute_result" 896 | } 897 | ], 898 | "source": [ 899 | "#炸开,可以只剩下key,事实上,这个操作也不适合dict\n", 900 | "df['dict'].explode()" 901 | ] 902 | }, 903 | { 904 | "cell_type": "code", 905 | "execution_count": 20, 906 | "id": "8ae3ef52-f8b9-4c79-9a7c-17fc6e4e4899", 907 | "metadata": {}, 908 | "outputs": [ 909 | { 910 | "data": { 911 | "text/html": [ 912 | "
\n", 913 | "\n", 926 | "\n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 945 | " \n", 946 | "
ab
012
11010
\n", 947 | "
" 948 | ], 949 | "text/plain": [ 950 | " a b\n", 951 | "0 1 2\n", 952 | "1 10 10" 953 | ] 954 | }, 955 | "execution_count": 20, 956 | "metadata": {}, 957 | "output_type": "execute_result" 958 | } 959 | ], 960 | "source": [ 961 | "# 变为多列\n", 962 | "pd.DataFrame(df['dict'].tolist())" 963 | ] 964 | } 965 | ], 966 | "metadata": { 967 | "kernelspec": { 968 | "display_name": "Python 3 (ipykernel)", 969 | "language": "python", 970 | "name": "python3" 971 | }, 972 | "language_info": { 973 | "codemirror_mode": { 974 | "name": "ipython", 975 | "version": 3 976 | }, 977 | "file_extension": ".py", 978 | "mimetype": "text/x-python", 979 | "name": "python", 980 | "nbconvert_exporter": "python", 981 | "pygments_lexer": "ipython3", 982 | "version": "3.10.0" 983 | }, 984 | "toc-autonumbering": false, 985 | "toc-showcode": false, 986 | "toc-showmarkdowntxt": false, 987 | "toc-showtags": false 988 | }, 989 | "nbformat": 4, 990 | "nbformat_minor": 5 991 | } 992 | -------------------------------------------------------------------------------- /3. merge详解.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# merge详解\n", 8 | "merge本来准备放在上一章的增那一节讲的,不过其算是关系数据库用得很多的一个操作,变化也较多,所以单独开一篇细讲这个函数。 " 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "__auther__ = 'zhenhang.sun@gmail.com'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/plain": [ 28 | "'D:\\\\github\\\\pandas-tutorial'" 29 | ] 30 | }, 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "output_type": "execute_result" 34 | } 35 | ], 36 | "source": [ 37 | "pwd" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import numpy as np\n", 47 | "import pandas as pd" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "# 1. 函数说明" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "##### `pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False)`\n", 62 | "concat函数只能根据索引对齐,而如果想在任意**列**上对齐合并,则需要merge函数,其在sql应用很多。\n", 63 | "- left, right: 两个要对齐合并的DataFrame;\n", 64 | "- how: 先做笛卡尔积操作,然后按照要求,保留需要的,缺失的数据填充NaN;\n", 65 | " - left: 以左DataFrame为基准,即左侧DataFrame的数据全部保留(不代表完全一致、可能会存在复制),保持原序;\n", 66 | " - right: 以右DataFrame为基准,保持原序;\n", 67 | " - inner: 交,保留左右DataFrame在on上完全一致的行,保持左DataFrame顺序;\n", 68 | " - outer: 并,按照字典顺序重新排序;\n", 69 | "- on:对应列名或者行索引的名字,如果要在DataFrame相同的列索引做对齐,用这个参数;\n", 70 | "- left_on, right_on, left_index, right_index:\n", 71 | " - on对应列名或者行索引的名字(所以行索引一般要跟列一样看待,有自己的名字),用这俩参数;\n", 72 | " - index对应要使用的index,不建议使用,会搞晕。\n", 73 | "- sort: True or False,是否按字典序重新排序。" 74 | ] 75 | }, 76 | { 77 | "cell_type": "code", 78 | "execution_count": 4, 79 | "metadata": {}, 80 | "outputs": [ 81 | { 82 | "data": { 83 | "text/html": [ 84 | "
\n", 85 | "\n", 98 | "\n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | " \n", 105 | " \n", 106 | " \n", 107 | " \n", 108 | " \n", 109 | " \n", 110 | " \n", 111 | " \n", 112 | " \n", 113 | " \n", 114 | " \n", 115 | " \n", 116 | " \n", 117 | " \n", 118 | "
AB
a12
b34
\n", 119 | "
" 120 | ], 121 | "text/plain": [ 122 | " A B\n", 123 | "a 1 2\n", 124 | "b 3 4" 125 | ] 126 | }, 127 | "execution_count": 4, 128 | "metadata": {}, 129 | "output_type": "execute_result" 130 | } 131 | ], 132 | "source": [ 133 | "df1 = pd.DataFrame([[1,2],[3,4]], index = ['a','b'],columns = ['A','B'])\n", 134 | "df2 = pd.DataFrame([[1,3],[4,8]], index = ['b','d'],columns = ['B','C'])\n", 135 | "df1" 136 | ] 137 | }, 138 | { 139 | "cell_type": "code", 140 | "execution_count": 5, 141 | "metadata": {}, 142 | "outputs": [ 143 | { 144 | "data": { 145 | "text/html": [ 146 | "
\n", 147 | "\n", 160 | "\n", 161 | " \n", 162 | " \n", 163 | " \n", 164 | " \n", 165 | " \n", 166 | " \n", 167 | " \n", 168 | " \n", 169 | " \n", 170 | " \n", 171 | " \n", 172 | " \n", 173 | " \n", 174 | " \n", 175 | " \n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | "
BC
b13
d48
\n", 181 | "
" 182 | ], 183 | "text/plain": [ 184 | " B C\n", 185 | "b 1 3\n", 186 | "d 4 8" 187 | ] 188 | }, 189 | "execution_count": 5, 190 | "metadata": {}, 191 | "output_type": "execute_result" 192 | } 193 | ], 194 | "source": [ 195 | "df2" 196 | ] 197 | }, 198 | { 199 | "cell_type": "markdown", 200 | "metadata": {}, 201 | "source": [ 202 | "**如果单纯的按照index对齐,不如用concat方法。**" 203 | ] 204 | }, 205 | { 206 | "cell_type": "code", 207 | "execution_count": 6, 208 | "metadata": {}, 209 | "outputs": [ 210 | { 211 | "data": { 212 | "text/html": [ 213 | "
\n", 214 | "\n", 227 | "\n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | "
AB_xB_yC
b3413
\n", 247 | "
" 248 | ], 249 | "text/plain": [ 250 | " A B_x B_y C\n", 251 | "b 3 4 1 3" 252 | ] 253 | }, 254 | "execution_count": 6, 255 | "metadata": {}, 256 | "output_type": "execute_result" 257 | } 258 | ], 259 | "source": [ 260 | "pd.merge(left=df1, right= df2, how='inner' ,left_index=True, right_index=True)" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 7, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "text/html": [ 271 | "
\n", 272 | "\n", 285 | "\n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | "
ABBC
b3413
\n", 305 | "
" 306 | ], 307 | "text/plain": [ 308 | " A B B C\n", 309 | "b 3 4 1 3" 310 | ] 311 | }, 312 | "execution_count": 7, 313 | "metadata": {}, 314 | "output_type": "execute_result" 315 | } 316 | ], 317 | "source": [ 318 | "# 小区别是concat对重复列没有重命名,但是重名的情况不多,而且重名了说明之前设计就不大合理。\n", 319 | "pd.concat([df1, df2], join='inner', axis =1) " 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": null, 325 | "metadata": {}, 326 | "outputs": [], 327 | "source": [] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "metadata": {}, 332 | "source": [ 333 | "# 2. on 用法\n", 334 | "设置 how='inner'" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 8, 340 | "metadata": { 341 | "scrolled": true 342 | }, 343 | "outputs": [ 344 | { 345 | "data": { 346 | "text/html": [ 347 | "
\n", 348 | "\n", 361 | "\n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 378 | "
ABC
0348
\n", 379 | "
" 380 | ], 381 | "text/plain": [ 382 | " A B C\n", 383 | "0 3 4 8" 384 | ] 385 | }, 386 | "execution_count": 8, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "#对于'B'列:df1的'b'行、df2的'd'行,是相同的,其他都不同。 \n", 393 | "pd.merge(left=df1, right=df2, how='inner', on=['B']) " 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": 9, 399 | "metadata": {}, 400 | "outputs": [ 401 | { 402 | "data": { 403 | "text/html": [ 404 | "
\n", 405 | "\n", 418 | "\n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | "
AB_xB_yC
03413
\n", 438 | "
" 439 | ], 440 | "text/plain": [ 441 | " A B_x B_y C\n", 442 | "0 3 4 1 3" 443 | ] 444 | }, 445 | "execution_count": 9, 446 | "metadata": {}, 447 | "output_type": "execute_result" 448 | } 449 | ], 450 | "source": [ 451 | "# df1的'A'列'b'行,df2的'C'列'd'行是相同的,其他都不同。\n", 452 | "# 其他列如果同名会进行重命名。\n", 453 | "pd.merge(left=df1, right=df2, how='inner',left_on=['A'] ,right_on=['C'])" 454 | ] 455 | }, 456 | { 457 | "cell_type": "code", 458 | "execution_count": null, 459 | "metadata": {}, 460 | "outputs": [], 461 | "source": [] 462 | }, 463 | { 464 | "cell_type": "markdown", 465 | "metadata": {}, 466 | "source": [ 467 | "# 3. how用法" 468 | ] 469 | }, 470 | { 471 | "cell_type": "code", 472 | "execution_count": 10, 473 | "metadata": {}, 474 | "outputs": [ 475 | { 476 | "data": { 477 | "text/html": [ 478 | "
\n", 479 | "\n", 492 | "\n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | "
ABC
012NaN
1348.0
\n", 516 | "
" 517 | ], 518 | "text/plain": [ 519 | " A B C\n", 520 | "0 1 2 NaN\n", 521 | "1 3 4 8.0" 522 | ] 523 | }, 524 | "execution_count": 10, 525 | "metadata": {}, 526 | "output_type": "execute_result" 527 | } 528 | ], 529 | "source": [ 530 | "# 保持左侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n", 531 | "pd.merge(left=df1, right=df2, how='left', on=['B'] )" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": 11, 537 | "metadata": {}, 538 | "outputs": [ 539 | { 540 | "data": { 541 | "text/html": [ 542 | "
\n", 543 | "\n", 556 | "\n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | "
ABC
03.048
1NaN13
\n", 580 | "
" 581 | ], 582 | "text/plain": [ 583 | " A B C\n", 584 | "0 3.0 4 8\n", 585 | "1 NaN 1 3" 586 | ] 587 | }, 588 | "execution_count": 11, 589 | "metadata": {}, 590 | "output_type": "execute_result" 591 | } 592 | ], 593 | "source": [ 594 | "# 保持右侧DataFrame不变,用右侧来跟它对齐,对不上的填NaN。\n", 595 | "pd.merge(left=df1, right=df2, how='right', on=['B'] )" 596 | ] 597 | }, 598 | { 599 | "cell_type": "markdown", 600 | "metadata": {}, 601 | "source": [ 602 | "**对齐的列存在重复值,重复的也没关系,操作逻辑是一致的,完全可以假想不存在重复。**" 603 | ] 604 | }, 605 | { 606 | "cell_type": "code", 607 | "execution_count": 12, 608 | "metadata": { 609 | "scrolled": false 610 | }, 611 | "outputs": [ 612 | { 613 | "data": { 614 | "text/html": [ 615 | "
\n", 616 | "\n", 629 | "\n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 639 | " \n", 640 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | "
AB
a14
b34
\n", 650 | "
" 651 | ], 652 | "text/plain": [ 653 | " A B\n", 654 | "a 1 4\n", 655 | "b 3 4" 656 | ] 657 | }, 658 | "execution_count": 12, 659 | "metadata": {}, 660 | "output_type": "execute_result" 661 | } 662 | ], 663 | "source": [ 664 | "df1.loc['a','B'] = 4 #改成重复\n", 665 | "df1" 666 | ] 667 | }, 668 | { 669 | "cell_type": "code", 670 | "execution_count": 13, 671 | "metadata": { 672 | "scrolled": true 673 | }, 674 | "outputs": [ 675 | { 676 | "data": { 677 | "text/html": [ 678 | "
\n", 679 | "\n", 692 | "\n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | "
ABC
01.048
13.048
2NaN13
\n", 722 | "
" 723 | ], 724 | "text/plain": [ 725 | " A B C\n", 726 | "0 1.0 4 8\n", 727 | "1 3.0 4 8\n", 728 | "2 NaN 1 3" 729 | ] 730 | }, 731 | "execution_count": 13, 732 | "metadata": {}, 733 | "output_type": "execute_result" 734 | } 735 | ], 736 | "source": [ 737 | "### 保持右侧的列都在,如果左侧对齐的列存在重复值,那么对齐上后也存在重复。\n", 738 | "pd.merge(left=df1, right=df2, how='right', on=['B'] )" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [] 747 | } 748 | ], 749 | "metadata": { 750 | "kernelspec": { 751 | "display_name": "Python 3", 752 | "language": "python", 753 | "name": "python3" 754 | }, 755 | "language_info": { 756 | "codemirror_mode": { 757 | "name": "ipython", 758 | "version": 3 759 | }, 760 | "file_extension": ".py", 761 | "mimetype": "text/x-python", 762 | "name": "python", 763 | "nbconvert_exporter": "python", 764 | "pygments_lexer": "ipython3", 765 | "version": "3.7.0" 766 | }, 767 | "toc": { 768 | "base_numbering": 1, 769 | "nav_menu": { 770 | "height": "84px", 771 | "width": "252px" 772 | }, 773 | "number_sections": false, 774 | "sideBar": true, 775 | "skip_h1_title": false, 776 | "title_cell": "Table of Contents", 777 | "title_sidebar": "Contents", 778 | "toc_cell": false, 779 | "toc_position": { 780 | "height": "485px", 781 | "left": "0px", 782 | "right": "1146px", 783 | "top": "66px", 784 | "width": "134px" 785 | }, 786 | "toc_section_display": "block", 787 | "toc_window_display": true 788 | } 789 | }, 790 | "nbformat": 4, 791 | "nbformat_minor": 2 792 | } 793 | -------------------------------------------------------------------------------- /6. 数据结构总览.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 数据结构总览\n", 8 | "前面几章已经将我们使用最频繁的三种数据结构做了介绍,本章进行总结一下,之后在其基础上,再介绍一下基本数据类型和很有用的中间类型。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "__auther__ = 'zhenhang.sun@gmail.com'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/plain": [ 28 | "'D:\\\\github\\\\pandas-tutorial'" 29 | ] 30 | }, 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "output_type": "execute_result" 34 | } 35 | ], 36 | "source": [ 37 | "pwd" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import numpy as np\n", 47 | "import pandas as pd" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "# 1. 三种常用数据结构" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "## 1.1 Series\n", 69 | "Series是由数据类型相同的元素构成的一维数据结构,具有列表和字典的特性。" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "**四个重要属性**\n", 77 | "- Series.index\n", 78 | "- Series.name\n", 79 | "- Series.values\n", 80 | "- Series.dtype" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "## 1.2 DataFrame\n", 88 | "DataFrame是由索引相同的Series构成的的一二维数据结构。" 89 | ] 90 | }, 91 | { 92 | "cell_type": "markdown", 93 | "metadata": {}, 94 | "source": [ 95 | "#### 四个重要属性\n", 96 | "- DataFrame.index\n", 97 | "- DataFrame.columns\n", 98 | "- DataFrame.values\n", 99 | "- DataFrame.dtypes" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "## 1.3 Index\n", 107 | "Index是构成和操作Series、DataFrame的关键,其具有元组特性。" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "**三个重要属性**\n", 115 | "- Index.name\n", 116 | "- Index.values\n", 117 | "- Index.dtype" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "# 2. 基本数据类型\n", 132 | "- 这些数据类型实际上都是numpy带来的;\n", 133 | "- 基本数据类型中不包括字符串类型,字符串都是存储为object_型;\n", 134 | "- 所以使用这些类型时,要加上前缀 `np.` 。" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "metadata": {}, 140 | "source": [ 141 | "## 2.1 布尔型" 142 | ] 143 | }, 144 | { 145 | "cell_type": "code", 146 | "execution_count": 4, 147 | "metadata": {}, 148 | "outputs": [], 149 | "source": [ 150 | "columns = [u'类别',u'说明 ',u'简称']" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 5, 156 | "metadata": {}, 157 | "outputs": [ 158 | { 159 | "data": { 160 | "text/html": [ 161 | "
\n", 162 | "\n", 175 | "\n", 176 | " \n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | "
类别说明简称
0bool_compatible: Python bool?
1bool88 bits
\n", 199 | "
" 200 | ], 201 | "text/plain": [ 202 | " 类别 说明 简称\n", 203 | "0 bool_ compatible: Python bool ?\n", 204 | "1 bool8 8 bits " 205 | ] 206 | }, 207 | "execution_count": 5, 208 | "metadata": {}, 209 | "output_type": "execute_result" 210 | } 211 | ], 212 | "source": [ 213 | "data = [ ['bool_','compatible: Python bool','?'],\n", 214 | " ['bool8','8 bits',''] ] \n", 215 | "pd.DataFrame(data, columns = columns)" 216 | ] 217 | }, 218 | { 219 | "cell_type": "markdown", 220 | "metadata": {}, 221 | "source": [ 222 | "## 2.2 整型" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "### 2.2.1 有符号整型" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 6, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/html": [ 240 | "
\n", 241 | "\n", 254 | "\n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 302 | " \n", 303 | " \n", 304 | " \n", 305 | " \n", 306 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " \n", 320 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | "
类别说明简称
0bytecompatible: C charb
1shortcompatible: C shorth
2intccompatible: C inti
3int_compatible: Python intl
4longlongcompatible: C long longq
5intplarge enough to fit a pointerp
6int88 bits
7int1616 bits
8int3232 bits
9int6464 bits
\n", 326 | "
" 327 | ], 328 | "text/plain": [ 329 | " 类别 说明 简称\n", 330 | "0 byte compatible: C char b\n", 331 | "1 short compatible: C short h\n", 332 | "2 intc compatible: C int i\n", 333 | "3 int_ compatible: Python int l\n", 334 | "4 longlong compatible: C long long q\n", 335 | "5 intp large enough to fit a pointer p\n", 336 | "6 int8 8 bits \n", 337 | "7 int16 16 bits \n", 338 | "8 int32 32 bits \n", 339 | "9 int64 64 bits " 340 | ] 341 | }, 342 | "execution_count": 6, 343 | "metadata": {}, 344 | "output_type": "execute_result" 345 | } 346 | ], 347 | "source": [ 348 | "data = [['byte','compatible: C char','b'],\n", 349 | "['short','compatible: C short','h'],\n", 350 | "['intc','compatible: C int','i'],\n", 351 | "['int_','compatible: Python int','l'],\n", 352 | "['longlong','compatible: C long long','q'],\n", 353 | "['intp','large enough to fit a pointer','p'],\n", 354 | "['int8','8 bits','' ],\n", 355 | "['int16','16 bits','' ],\n", 356 | "['int32','32 bits',''],\n", 357 | "['int64','64 bits','']]\n", 358 | "pd.DataFrame(data = data, columns = columns)" 359 | ] 360 | }, 361 | { 362 | "cell_type": "markdown", 363 | "metadata": {}, 364 | "source": [ 365 | "### 2.2.2 无符号整型" 366 | ] 367 | }, 368 | { 369 | "cell_type": "code", 370 | "execution_count": 7, 371 | "metadata": {}, 372 | "outputs": [ 373 | { 374 | "data": { 375 | "text/html": [ 376 | "
\n", 377 | "\n", 390 | "\n", 391 | " \n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | "
类别说明简称
0ubytecompatible: C unsigned charB
1ushortcompatible: C unsigned shortH
2uintccompatible: C unsigned intI
3uintcompatible: Python intL
4ulonglongcompatible: C long longQ
5uintplarge enough to fit a pointerP
6uint88 bits
7uint1616 bits
8uint3232 bits
9uint6464 bits
\n", 462 | "
" 463 | ], 464 | "text/plain": [ 465 | " 类别 说明 简称\n", 466 | "0 ubyte compatible: C unsigned char B\n", 467 | "1 ushort compatible: C unsigned short H\n", 468 | "2 uintc compatible: C unsigned int I\n", 469 | "3 uint compatible: Python int L\n", 470 | "4 ulonglong compatible: C long long Q\n", 471 | "5 uintp large enough to fit a pointer P\n", 472 | "6 uint8 8 bits \n", 473 | "7 uint16 16 bits \n", 474 | "8 uint32 32 bits \n", 475 | "9 uint64 64 bits " 476 | ] 477 | }, 478 | "execution_count": 7, 479 | "metadata": {}, 480 | "output_type": "execute_result" 481 | } 482 | ], 483 | "source": [ 484 | "data = [['ubyte','compatible: C unsigned char','B'],\n", 485 | "['ushort','compatible: C unsigned short','H'],\n", 486 | "['uintc','compatible: C unsigned int','I'],\n", 487 | "['uint','compatible: Python int','L'],\n", 488 | "['ulonglong','compatible: C long long','Q'],\n", 489 | "['uintp','large enough to fit a pointer','P'],\n", 490 | "['uint8','8 bits',''], \n", 491 | "['uint16','16 bits',''],\n", 492 | "['uint32','32 bits',''],\n", 493 | "['uint64','64 bits','']]\n", 494 | "pd.DataFrame(data = data, columns = columns)" 495 | ] 496 | }, 497 | { 498 | "cell_type": "markdown", 499 | "metadata": {}, 500 | "source": [ 501 | "## 2.3 浮点型" 502 | ] 503 | }, 504 | { 505 | "cell_type": "code", 506 | "execution_count": 8, 507 | "metadata": {}, 508 | "outputs": [ 509 | { 510 | "data": { 511 | "text/html": [ 512 | "
\n", 513 | "\n", 526 | "\n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 548 | " \n", 549 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 561 | " \n", 562 | " \n", 563 | " \n", 564 | " \n", 565 | " \n", 566 | " \n", 567 | " \n", 568 | " \n", 569 | " \n", 570 | " \n", 571 | " \n", 572 | " \n", 573 | " \n", 574 | " \n", 575 | " \n", 576 | " \n", 577 | " \n", 578 | " \n", 579 | " \n", 580 | " \n", 581 | " \n", 582 | " \n", 583 | " \n", 584 | " \n", 585 | " \n", 586 | " \n", 587 | " \n", 588 | " \n", 589 | " \n", 590 | " \n", 591 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | "
类别说明简称
0halfe
1singlecompatible: C floatf
2doublecompatible: C double
3float_compatible: Python floatd
4longfloatcompatible: C long floatg
5float1616 bits
6float3232 bits
7float6464 bits
8float9696 bits, platform?
9float128128 bits, platform?
\n", 598 | "
" 599 | ], 600 | "text/plain": [ 601 | " 类别 说明 简称\n", 602 | "0 half e\n", 603 | "1 single compatible: C float f\n", 604 | "2 double compatible: C double \n", 605 | "3 float_ compatible: Python float d\n", 606 | "4 longfloat compatible: C long float g\n", 607 | "5 float16 16 bits \n", 608 | "6 float32 32 bits \n", 609 | "7 float64 64 bits \n", 610 | "8 float96 96 bits, platform? \n", 611 | "9 float128 128 bits, platform? " 612 | ] 613 | }, 614 | "execution_count": 8, 615 | "metadata": {}, 616 | "output_type": "execute_result" 617 | } 618 | ], 619 | "source": [ 620 | "data = [['half',' ','e'],\n", 621 | "['single','compatible: C float','f'],\n", 622 | "['double','compatible: C double',''],\n", 623 | "['float_','compatible: Python float','d'],\n", 624 | "['longfloat','compatible: C long float','g'],\n", 625 | "['float16','16 bits',''],\n", 626 | "['float32','32 bits',''],\n", 627 | "['float64','64 bits',''], \n", 628 | "['float96','96 bits, platform?',''], \n", 629 | "['float128','128 bits, platform?','']]\n", 630 | "pd.DataFrame(data = data, columns = columns)" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "## 2.4 复数型" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": 9, 643 | "metadata": {}, 644 | "outputs": [ 645 | { 646 | "data": { 647 | "text/html": [ 648 | "
\n", 649 | "\n", 662 | "\n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | "
类别说明简称
0csingleF
1complex_compatible: Python complexD
2clongfloatG
3complex64two 32-bit floats
4complex128two 64-bit floats
5complex192two 96-bit floats, platform?
6complex256two 128-bit floats, platform?
\n", 716 | "
" 717 | ], 718 | "text/plain": [ 719 | " 类别 说明 简称\n", 720 | "0 csingle F\n", 721 | "1 complex_ compatible: Python complex D\n", 722 | "2 clongfloat G\n", 723 | "3 complex64 two 32-bit floats \n", 724 | "4 complex128 two 64-bit floats \n", 725 | "5 complex192 two 96-bit floats, platform? \n", 726 | "6 complex256 two 128-bit floats, platform? " 727 | ] 728 | }, 729 | "execution_count": 9, 730 | "metadata": {}, 731 | "output_type": "execute_result" 732 | } 733 | ], 734 | "source": [ 735 | "data = [['csingle',' ','F'],\n", 736 | "['complex_','compatible: Python complex','D'],\n", 737 | "['clongfloat',' ','G'],\n", 738 | "['complex64','two 32-bit floats',''], \n", 739 | "['complex128','two 64-bit floats',''], \n", 740 | "['complex192','two 96-bit floats, platform?',''], \n", 741 | "['complex256','two 128-bit floats, platform?','']]\n", 742 | "pd.DataFrame(data = data, columns = columns)" 743 | ] 744 | }, 745 | { 746 | "cell_type": "markdown", 747 | "metadata": {}, 748 | "source": [ 749 | "## 2.5 任意类型\n", 750 | "Object其实就是就是指向python的类类型object的一个引用。" 751 | ] 752 | }, 753 | { 754 | "cell_type": "code", 755 | "execution_count": 10, 756 | "metadata": {}, 757 | "outputs": [ 758 | { 759 | "data": { 760 | "text/html": [ 761 | "
\n", 762 | "\n", 775 | "\n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | "
类别说明简称
0object_any Python objectO
\n", 793 | "
" 794 | ], 795 | "text/plain": [ 796 | " 类别 说明 简称\n", 797 | "0 object_ any Python object O" 798 | ] 799 | }, 800 | "execution_count": 10, 801 | "metadata": {}, 802 | "output_type": "execute_result" 803 | } 804 | ], 805 | "source": [ 806 | "data = [['object_','any Python object','O']]\n", 807 | "pd.DataFrame(data = data, columns = columns)" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": null, 813 | "metadata": {}, 814 | "outputs": [], 815 | "source": [] 816 | }, 817 | { 818 | "cell_type": "markdown", 819 | "metadata": {}, 820 | "source": [ 821 | "# 3. 有用的中间类型" 822 | ] 823 | }, 824 | { 825 | "cell_type": "markdown", 826 | "metadata": {}, 827 | "source": [ 828 | "## 3.1 .str\n", 829 | "这个中间类型可将object_类型的Series当做字符串来处理,有很多可用的字符串处理函数。在后面的章节会专门讲这个应用。" 830 | ] 831 | }, 832 | { 833 | "cell_type": "code", 834 | "execution_count": 11, 835 | "metadata": {}, 836 | "outputs": [ 837 | { 838 | "data": { 839 | "text/plain": [ 840 | "0 a_b\n", 841 | "1 b_c\n", 842 | "2 c_d\n", 843 | "dtype: object" 844 | ] 845 | }, 846 | "execution_count": 11, 847 | "metadata": {}, 848 | "output_type": "execute_result" 849 | } 850 | ], 851 | "source": [ 852 | "s = pd.Series(['a_b','b_c','c_d'],dtype = 'object')\n", 853 | "s" 854 | ] 855 | }, 856 | { 857 | "cell_type": "code", 858 | "execution_count": 12, 859 | "metadata": {}, 860 | "outputs": [ 861 | { 862 | "data": { 863 | "text/html": [ 864 | "
\n", 865 | "\n", 878 | "\n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 887 | " \n", 888 | " \n", 889 | " \n", 890 | " \n", 891 | " \n", 892 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | "
01
0ab
1bc
2cd
\n", 904 | "
" 905 | ], 906 | "text/plain": [ 907 | " 0 1\n", 908 | "0 a b\n", 909 | "1 b c\n", 910 | "2 c d" 911 | ] 912 | }, 913 | "execution_count": 12, 914 | "metadata": {}, 915 | "output_type": "execute_result" 916 | } 917 | ], 918 | "source": [ 919 | "s.str.split('_',expand = True)" 920 | ] 921 | }, 922 | { 923 | "cell_type": "markdown", 924 | "metadata": {}, 925 | "source": [ 926 | "## 3.2 .cat\n", 927 | "这个中间类型专门处理类别类型,类别类型是机器学习中经常面对的一种特征属性,后面章节会讲到。" 928 | ] 929 | }, 930 | { 931 | "cell_type": "code", 932 | "execution_count": 13, 933 | "metadata": {}, 934 | "outputs": [ 935 | { 936 | "data": { 937 | "text/plain": [ 938 | "0 1\n", 939 | "1 2\n", 940 | "2 3\n", 941 | "dtype: category\n", 942 | "Categories (3, int64): [1, 2, 3]" 943 | ] 944 | }, 945 | "execution_count": 13, 946 | "metadata": {}, 947 | "output_type": "execute_result" 948 | } 949 | ], 950 | "source": [ 951 | "s = pd.Series( [1,2,3], dtype='category')\n", 952 | "s" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 14, 958 | "metadata": {}, 959 | "outputs": [ 960 | { 961 | "data": { 962 | "text/plain": [ 963 | "Int64Index([1, 2, 3], dtype='int64')" 964 | ] 965 | }, 966 | "execution_count": 14, 967 | "metadata": {}, 968 | "output_type": "execute_result" 969 | } 970 | ], 971 | "source": [ 972 | "s.cat.categories" 973 | ] 974 | }, 975 | { 976 | "cell_type": "markdown", 977 | "metadata": {}, 978 | "source": [ 979 | "## 3.3 .dt\n", 980 | "这个中间类型专门处理时间格式的Series,在时间序列分析中会用到。" 981 | ] 982 | }, 983 | { 984 | "cell_type": "code", 985 | "execution_count": 15, 986 | "metadata": {}, 987 | "outputs": [ 988 | { 989 | "data": { 990 | "text/plain": [ 991 | "0 2017-08-01\n", 992 | "1 2017-08-03\n", 993 | "2 2017-08-03\n", 994 | "dtype: datetime64[ns]" 995 | ] 996 | }, 997 | "execution_count": 15, 998 | "metadata": {}, 999 | "output_type": "execute_result" 1000 | } 1001 | ], 1002 | "source": [ 1003 | "s = pd.Series(['2017-08-01','2017-08-03','2017-08-03'], dtype = 'datetime64[ns]')\n", 1004 | "s" 1005 | ] 1006 | }, 1007 | { 1008 | "cell_type": "code", 1009 | "execution_count": 16, 1010 | "metadata": { 1011 | "scrolled": false 1012 | }, 1013 | "outputs": [ 1014 | { 1015 | "data": { 1016 | "text/plain": [ 1017 | "0 2017\n", 1018 | "1 2017\n", 1019 | "2 2017\n", 1020 | "dtype: int64" 1021 | ] 1022 | }, 1023 | "execution_count": 16, 1024 | "metadata": {}, 1025 | "output_type": "execute_result" 1026 | } 1027 | ], 1028 | "source": [ 1029 | "s.dt.year" 1030 | ] 1031 | }, 1032 | { 1033 | "cell_type": "code", 1034 | "execution_count": null, 1035 | "metadata": {}, 1036 | "outputs": [], 1037 | "source": [] 1038 | } 1039 | ], 1040 | "metadata": { 1041 | "kernelspec": { 1042 | "display_name": "Python 3", 1043 | "language": "python", 1044 | "name": "python3" 1045 | }, 1046 | "language_info": { 1047 | "codemirror_mode": { 1048 | "name": "ipython", 1049 | "version": 3 1050 | }, 1051 | "file_extension": ".py", 1052 | "mimetype": "text/x-python", 1053 | "name": "python", 1054 | "nbconvert_exporter": "python", 1055 | "pygments_lexer": "ipython3", 1056 | "version": "3.7.0" 1057 | }, 1058 | "toc": { 1059 | "base_numbering": 1, 1060 | "nav_menu": { 1061 | "height": "318px", 1062 | "width": "252px" 1063 | }, 1064 | "number_sections": false, 1065 | "sideBar": true, 1066 | "skip_h1_title": false, 1067 | "title_cell": "Table of Contents", 1068 | "title_sidebar": "Contents", 1069 | "toc_cell": false, 1070 | "toc_position": { 1071 | "height": "calc(100% - 180px)", 1072 | "left": "10px", 1073 | "top": "150px", 1074 | "width": "256px" 1075 | }, 1076 | "toc_section_display": "block", 1077 | "toc_window_display": true 1078 | } 1079 | }, 1080 | "nbformat": 4, 1081 | "nbformat_minor": 2 1082 | } 1083 | -------------------------------------------------------------------------------- /7. 显示控制.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 显示控制\n", 8 | "讲完怎么构造基本数据结构,不忙讲复杂函数和操作,先讲讲怎么让数据呈现的更符合心意更重要。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "__auther__ = 'zhenhang.sun@gmail.com'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [ 25 | { 26 | "data": { 27 | "text/plain": [ 28 | "'D:\\\\github\\\\pandas-tutorial'" 29 | ] 30 | }, 31 | "execution_count": 2, 32 | "metadata": {}, 33 | "output_type": "execute_result" 34 | } 35 | ], 36 | "source": [ 37 | "pwd" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": 3, 43 | "metadata": {}, 44 | "outputs": [], 45 | "source": [ 46 | "import numpy as np\n", 47 | "import pandas as pd" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "# 1. 函数说明\n", 62 | "- get/set/reset_opthon其实是从本地配置文件去查询和设置这个关键字;\n", 63 | "- 这些关键字都是以字符串给定的,可以使用任何正则表达式,但如果匹配到多个则报错,所以最好是精确表示。" 64 | ] 65 | }, 66 | { 67 | "cell_type": "markdown", 68 | "metadata": {}, 69 | "source": [ 70 | "## 1.1 查询默认参数\n", 71 | "##### `pd.get_option(key)`\n", 72 | "- key:也即上面列出的关键字" 73 | ] 74 | }, 75 | { 76 | "cell_type": "markdown", 77 | "metadata": {}, 78 | "source": [ 79 | "## 1.2 自定义参数\n", 80 | "##### `pd.set_option(key, value)`\n", 81 | "- key:要设置的关键字\n", 82 | "- value:int,要设置的值" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "## 1.3 恢复自定义参数为默认参数。\n", 90 | "##### `pandas.reset_option(key)`\n", 91 | "- key:上面提到的关键字" 92 | ] 93 | }, 94 | { 95 | "cell_type": "code", 96 | "execution_count": null, 97 | "metadata": {}, 98 | "outputs": [], 99 | "source": [] 100 | }, 101 | { 102 | "cell_type": "markdown", 103 | "metadata": {}, 104 | "source": [ 105 | "# 2. 重要参数\n", 106 | "一般来说,对我们有用的get_option和set_option操作不多,主要有以下三种关键字:\n", 107 | "- 'display.max_rows'\n", 108 | "- 'display.max_columns'\n", 109 | "- 'display.max_colwidth'" 110 | ] 111 | }, 112 | { 113 | "cell_type": "markdown", 114 | "metadata": {}, 115 | "source": [ 116 | "## 1.1 max_rows\n", 117 | "控制可以显示的最大行数。" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 4, 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "60" 129 | ] 130 | }, 131 | "execution_count": 4, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "pd.get_option('display.max_rows')# 60的意思就是最多显示60行,如果行数超过60,那么将省略显示一部分。" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 5, 143 | "metadata": { 144 | "scrolled": true 145 | }, 146 | "outputs": [ 147 | { 148 | "data": { 149 | "text/plain": [ 150 | "0 NaN\n", 151 | "1 NaN\n", 152 | "2 NaN\n", 153 | "3 NaN\n", 154 | "4 NaN\n", 155 | "5 NaN\n", 156 | "6 NaN\n", 157 | "7 NaN\n", 158 | "8 NaN\n", 159 | "9 NaN\n", 160 | "10 NaN\n", 161 | "11 NaN\n", 162 | "12 NaN\n", 163 | "13 NaN\n", 164 | "14 NaN\n", 165 | "15 NaN\n", 166 | "16 NaN\n", 167 | "17 NaN\n", 168 | "18 NaN\n", 169 | "19 NaN\n", 170 | "20 NaN\n", 171 | "21 NaN\n", 172 | "22 NaN\n", 173 | "23 NaN\n", 174 | "24 NaN\n", 175 | "25 NaN\n", 176 | "26 NaN\n", 177 | "27 NaN\n", 178 | "28 NaN\n", 179 | "29 NaN\n", 180 | "30 NaN\n", 181 | "31 NaN\n", 182 | "32 NaN\n", 183 | "33 NaN\n", 184 | "34 NaN\n", 185 | "35 NaN\n", 186 | "36 NaN\n", 187 | "37 NaN\n", 188 | "38 NaN\n", 189 | "39 NaN\n", 190 | "40 NaN\n", 191 | "41 NaN\n", 192 | "42 NaN\n", 193 | "43 NaN\n", 194 | "44 NaN\n", 195 | "45 NaN\n", 196 | "46 NaN\n", 197 | "47 NaN\n", 198 | "48 NaN\n", 199 | "49 NaN\n", 200 | "50 NaN\n", 201 | "51 NaN\n", 202 | "52 NaN\n", 203 | "53 NaN\n", 204 | "54 NaN\n", 205 | "55 NaN\n", 206 | "56 NaN\n", 207 | "57 NaN\n", 208 | "58 NaN\n", 209 | "59 NaN\n", 210 | "dtype: float64" 211 | ] 212 | }, 213 | "execution_count": 5, 214 | "metadata": {}, 215 | "output_type": "execute_result" 216 | } 217 | ], 218 | "source": [ 219 | "# pd.set_option('display.max_rows', 61)\n", 220 | "pd.Series( index = range(0, 60) ) # 把60改成61试试,然后将上面注释取消,再试试。" 221 | ] 222 | }, 223 | { 224 | "cell_type": "markdown", 225 | "metadata": {}, 226 | "source": [ 227 | "## 1.2 max_columns\n", 228 | "控制可以显示的最大列数,这个参数对DataFrame更有价值。" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": 6, 234 | "metadata": {}, 235 | "outputs": [ 236 | { 237 | "data": { 238 | "text/plain": [ 239 | "20" 240 | ] 241 | }, 242 | "execution_count": 6, 243 | "metadata": {}, 244 | "output_type": "execute_result" 245 | } 246 | ], 247 | "source": [ 248 | "pd.get_option('display.max_columns')" 249 | ] 250 | }, 251 | { 252 | "cell_type": "code", 253 | "execution_count": 7, 254 | "metadata": {}, 255 | "outputs": [ 256 | { 257 | "data": { 258 | "text/html": [ 259 | "
\n", 260 | "\n", 273 | "\n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 294 | " \n", 295 | " \n", 296 | " \n", 297 | " \n", 298 | " \n", 299 | " \n", 300 | " \n", 301 | "
012345678910111213141516171819
\n", 302 | "
" 303 | ], 304 | "text/plain": [ 305 | "Empty DataFrame\n", 306 | "Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]\n", 307 | "Index: []" 308 | ] 309 | }, 310 | "execution_count": 7, 311 | "metadata": {}, 312 | "output_type": "execute_result" 313 | } 314 | ], 315 | "source": [ 316 | "# pd.set_option('display.max_columns', 21)\n", 317 | "pd.DataFrame( columns = range(0,20)) # 分别设置20和21试试,然后将上面注释取消,再试试。" 318 | ] 319 | }, 320 | { 321 | "cell_type": "markdown", 322 | "metadata": {}, 323 | "source": [ 324 | "## 1.3 max_colwidth\n", 325 | "控制每个网格点能够显示的最大字符数" 326 | ] 327 | }, 328 | { 329 | "cell_type": "code", 330 | "execution_count": 8, 331 | "metadata": {}, 332 | "outputs": [ 333 | { 334 | "data": { 335 | "text/plain": [ 336 | "50" 337 | ] 338 | }, 339 | "execution_count": 8, 340 | "metadata": {}, 341 | "output_type": "execute_result" 342 | } 343 | ], 344 | "source": [ 345 | "pd.get_option('display.max_colwidth')" 346 | ] 347 | }, 348 | { 349 | "cell_type": "code", 350 | "execution_count": 9, 351 | "metadata": {}, 352 | "outputs": [ 353 | { 354 | "data": { 355 | "text/plain": [ 356 | "1 1111111111111111111111111111111111111111111111...\n", 357 | "dtype: object" 358 | ] 359 | }, 360 | "execution_count": 9, 361 | "metadata": {}, 362 | "output_type": "execute_result" 363 | } 364 | ], 365 | "source": [ 366 | "# pd.set_option('display.max_colwidth', 51)\n", 367 | "s = '1'*50 # 把50改成49试试,把上面的注释取消,再试试\n", 368 | "pd.Series( data=[s], index = ['1']) #实际应用中把max_colwidth大体设得大一点就可以" 369 | ] 370 | }, 371 | { 372 | "cell_type": "markdown", 373 | "metadata": {}, 374 | "source": [ 375 | "## 1.4 precision\n", 376 | "控制浮点类型显示的小数位数,不影响实际精度。" 377 | ] 378 | }, 379 | { 380 | "cell_type": "code", 381 | "execution_count": 10, 382 | "metadata": {}, 383 | "outputs": [ 384 | { 385 | "data": { 386 | "text/plain": [ 387 | "6" 388 | ] 389 | }, 390 | "execution_count": 10, 391 | "metadata": {}, 392 | "output_type": "execute_result" 393 | } 394 | ], 395 | "source": [ 396 | "pd.get_option('display.precision')" 397 | ] 398 | }, 399 | { 400 | "cell_type": "code", 401 | "execution_count": 11, 402 | "metadata": {}, 403 | "outputs": [ 404 | { 405 | "data": { 406 | "text/plain": [ 407 | "0 1.010101\n", 408 | "dtype: float64" 409 | ] 410 | }, 411 | "execution_count": 11, 412 | "metadata": {}, 413 | "output_type": "execute_result" 414 | } 415 | ], 416 | "source": [ 417 | "# pd.set_option('display.precision', 10)\n", 418 | "a = 1.01010101\n", 419 | "s = pd.Series( data= [a]) # \n", 420 | "s" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": 12, 426 | "metadata": {}, 427 | "outputs": [ 428 | { 429 | "data": { 430 | "text/plain": [ 431 | "1.01010101" 432 | ] 433 | }, 434 | "execution_count": 12, 435 | "metadata": {}, 436 | "output_type": "execute_result" 437 | } 438 | ], 439 | "source": [ 440 | "s[0] # 在Series数据被截断显示,但实际上精度并没变, 把上面的注释取消,再试试" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "## 1.5 colheader_justify\n", 448 | "控制DataFrame的列名对齐位置,靠左或者靠右" 449 | ] 450 | }, 451 | { 452 | "cell_type": "code", 453 | "execution_count": 13, 454 | "metadata": {}, 455 | "outputs": [ 456 | { 457 | "data": { 458 | "text/plain": [ 459 | "'right'" 460 | ] 461 | }, 462 | "execution_count": 13, 463 | "metadata": {}, 464 | "output_type": "execute_result" 465 | } 466 | ], 467 | "source": [ 468 | "pd.get_option('display.colheader_justify')" 469 | ] 470 | }, 471 | { 472 | "cell_type": "code", 473 | "execution_count": 14, 474 | "metadata": {}, 475 | "outputs": [ 476 | { 477 | "data": { 478 | "text/html": [ 479 | "
\n", 480 | "\n", 493 | "\n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | "
a
0000000000000000
\n", 507 | "
" 508 | ], 509 | "text/plain": [ 510 | " a\n", 511 | "0 000000000000000" 512 | ] 513 | }, 514 | "execution_count": 14, 515 | "metadata": {}, 516 | "output_type": "execute_result" 517 | } 518 | ], 519 | "source": [ 520 | "# pd.set_option('display.colheader_justify','left') # 我的这个设置好像无效,不知道咋回事\n", 521 | "pd.DataFrame( data = ['000000000000000'], columns = ['a'])" 522 | ] 523 | }, 524 | { 525 | "cell_type": "markdown", 526 | "metadata": {}, 527 | "source": [ 528 | "## 1.6. 更多设置参数查看官方说明\n", 529 | "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.set_option.html" 530 | ] 531 | }, 532 | { 533 | "cell_type": "code", 534 | "execution_count": null, 535 | "metadata": {}, 536 | "outputs": [], 537 | "source": [] 538 | } 539 | ], 540 | "metadata": { 541 | "kernelspec": { 542 | "display_name": "Python 3", 543 | "language": "python", 544 | "name": "python3" 545 | }, 546 | "language_info": { 547 | "codemirror_mode": { 548 | "name": "ipython", 549 | "version": 3 550 | }, 551 | "file_extension": ".py", 552 | "mimetype": "text/x-python", 553 | "name": "python", 554 | "nbconvert_exporter": "python", 555 | "pygments_lexer": "ipython3", 556 | "version": "3.7.0" 557 | }, 558 | "toc": { 559 | "base_numbering": 1, 560 | "nav_menu": { 561 | "height": "228px", 562 | "width": "252px" 563 | }, 564 | "number_sections": false, 565 | "sideBar": true, 566 | "skip_h1_title": false, 567 | "title_cell": "Table of Contents", 568 | "title_sidebar": "Contents", 569 | "toc_cell": false, 570 | "toc_position": { 571 | "height": "485px", 572 | "left": "0px", 573 | "right": "1068px", 574 | "top": "66px", 575 | "width": "212px" 576 | }, 577 | "toc_section_display": "block", 578 | "toc_window_display": true 579 | } 580 | }, 581 | "nbformat": 4, 582 | "nbformat_minor": 2 583 | } 584 | -------------------------------------------------------------------------------- /8. 快速查看整体信息.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 快速查看整体信息\n", 8 | "上一章讲到了控制DataFrame显示的一些参数,本章则具体讲解一下如何快速获得对DataFrame的整体认知。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": {}, 15 | "outputs": [], 16 | "source": [ 17 | "__auther__ = 'zhenhang.sun@gmail.com'" 18 | ] 19 | }, 20 | { 21 | "cell_type": "code", 22 | "execution_count": 2, 23 | "metadata": {}, 24 | "outputs": [], 25 | "source": [ 26 | "pwd" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 3, 32 | "metadata": {}, 33 | "outputs": [], 34 | "source": [ 35 | "import numpy as np\n", 36 | "import pandas as pd" 37 | ] 38 | }, 39 | { 40 | "cell_type": "code", 41 | "execution_count": null, 42 | "metadata": {}, 43 | "outputs": [], 44 | "source": [] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "# 1. info()\n", 51 | "这是DataFrame才可用的API,快捷查看多种信息:总行数和列数、每列元素类型和非NaN的个数,总内存。" 52 | ] 53 | }, 54 | { 55 | "cell_type": "markdown", 56 | "metadata": {}, 57 | "source": [ 58 | "##### `DataFrame.info(verbose=None, memory_usage=True, null_counts=True)`\n", 59 | "- verbose:True or False,字面意思是冗长的,也就说如何DataFrame有很多列,是否显示所有列的信息,如果为否,那么会省略一部分;\n", 60 | "- memory_usage:True or False,默认为True,是否查看DataFrame的内存使用情况;\n", 61 | "- null_counts:True or False,默认为True,是否统计NaN值的个数。" 62 | ] 63 | }, 64 | { 65 | "cell_type": "code", 66 | "execution_count": 4, 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "data": { 71 | "text/html": [ 72 | "
\n", 73 | "\n", 86 | "\n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | " \n", 95 | " \n", 96 | " \n", 97 | " \n", 98 | " \n", 99 | " \n", 100 | " \n", 101 | " \n", 102 | " \n", 103 | " \n", 104 | "
0123456789
\n", 105 | "
" 106 | ], 107 | "text/plain": [ 108 | "Empty DataFrame\n", 109 | "Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n", 110 | "Index: []" 111 | ] 112 | }, 113 | "execution_count": 4, 114 | "metadata": {}, 115 | "output_type": "execute_result" 116 | } 117 | ], 118 | "source": [ 119 | "df = pd.DataFrame(columns = range(0,10))\n", 120 | "df" 121 | ] 122 | }, 123 | { 124 | "cell_type": "code", 125 | "execution_count": 5, 126 | "metadata": { 127 | "scrolled": false 128 | }, 129 | "outputs": [ 130 | { 131 | "name": "stdout", 132 | "output_type": "stream", 133 | "text": [ 134 | "\n", 135 | "Index: 0 entries\n", 136 | "Data columns (total 10 columns):\n", 137 | "0 0 non-null object\n", 138 | "1 0 non-null object\n", 139 | "2 0 non-null object\n", 140 | "3 0 non-null object\n", 141 | "4 0 non-null object\n", 142 | "5 0 non-null object\n", 143 | "6 0 non-null object\n", 144 | "7 0 non-null object\n", 145 | "8 0 non-null object\n", 146 | "9 0 non-null object\n", 147 | "dtypes: object(10)\n", 148 | "memory usage: 0.0+ bytes\n" 149 | ] 150 | } 151 | ], 152 | "source": [ 153 | "df.info() # 直接默认设置即可" 154 | ] 155 | }, 156 | { 157 | "cell_type": "code", 158 | "execution_count": null, 159 | "metadata": {}, 160 | "outputs": [], 161 | "source": [] 162 | }, 163 | { 164 | "cell_type": "markdown", 165 | "metadata": {}, 166 | "source": [ 167 | "# 2. ndim, shape, size\n", 168 | "查看维数,形状,元素个数。" 169 | ] 170 | }, 171 | { 172 | "cell_type": "code", 173 | "execution_count": 6, 174 | "metadata": {}, 175 | "outputs": [ 176 | { 177 | "data": { 178 | "text/html": [ 179 | "
\n", 180 | "\n", 193 | "\n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | "
AB
0NaN2.0
13.0NaN
\n", 214 | "
" 215 | ], 216 | "text/plain": [ 217 | " A B\n", 218 | "0 NaN 2.0\n", 219 | "1 3.0 NaN" 220 | ] 221 | }, 222 | "execution_count": 6, 223 | "metadata": {}, 224 | "output_type": "execute_result" 225 | } 226 | ], 227 | "source": [ 228 | "df = pd.DataFrame([[np.nan, 2], [3, np.nan]], columns=['A','B'])\n", 229 | "df" 230 | ] 231 | }, 232 | { 233 | "cell_type": "code", 234 | "execution_count": 7, 235 | "metadata": {}, 236 | "outputs": [ 237 | { 238 | "data": { 239 | "text/plain": [ 240 | "2" 241 | ] 242 | }, 243 | "execution_count": 7, 244 | "metadata": {}, 245 | "output_type": "execute_result" 246 | } 247 | ], 248 | "source": [ 249 | "df.ndim # 返回维度数,Series一维,DataFrame两维,平时很少用到,不过有时会在循环中用到" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 8, 255 | "metadata": {}, 256 | "outputs": [ 257 | { 258 | "data": { 259 | "text/plain": [ 260 | "(2, 2)" 261 | ] 262 | }, 263 | "execution_count": 8, 264 | "metadata": {}, 265 | "output_type": "execute_result" 266 | } 267 | ], 268 | "source": [ 269 | "df.shape # (行数,列数)" 270 | ] 271 | }, 272 | { 273 | "cell_type": "code", 274 | "execution_count": 9, 275 | "metadata": {}, 276 | "outputs": [ 277 | { 278 | "data": { 279 | "text/plain": [ 280 | "4" 281 | ] 282 | }, 283 | "execution_count": 9, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "df.size # 元素个数,rows×cols" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": {}, 296 | "outputs": [], 297 | "source": [] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "# 3. head(), tail()\n", 304 | "默认分别查看头5行和后5行。" 305 | ] 306 | }, 307 | { 308 | "cell_type": "markdown", 309 | "metadata": {}, 310 | "source": [ 311 | "##### `Series/DataFrame.head(n=5)`\n", 312 | "##### `Series/DataFrame.tail(n=5)`" 313 | ] 314 | }, 315 | { 316 | "cell_type": "code", 317 | "execution_count": 10, 318 | "metadata": { 319 | "scrolled": true 320 | }, 321 | "outputs": [ 322 | { 323 | "data": { 324 | "text/plain": [ 325 | "0 0\n", 326 | "1 1\n", 327 | "2 2\n", 328 | "3 3\n", 329 | "4 4\n", 330 | "dtype: int64" 331 | ] 332 | }, 333 | "execution_count": 10, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "s = pd.Series(range(0,5))\n", 340 | "s" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": 11, 346 | "metadata": {}, 347 | "outputs": [ 348 | { 349 | "data": { 350 | "text/plain": [ 351 | "0 0\n", 352 | "1 1\n", 353 | "2 2\n", 354 | "dtype: int64" 355 | ] 356 | }, 357 | "execution_count": 11, 358 | "metadata": {}, 359 | "output_type": "execute_result" 360 | } 361 | ], 362 | "source": [ 363 | "s.head(3)" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 12, 369 | "metadata": {}, 370 | "outputs": [ 371 | { 372 | "data": { 373 | "text/plain": [ 374 | "2 2\n", 375 | "3 3\n", 376 | "4 4\n", 377 | "dtype: int64" 378 | ] 379 | }, 380 | "execution_count": 12, 381 | "metadata": {}, 382 | "output_type": "execute_result" 383 | } 384 | ], 385 | "source": [ 386 | "s.tail(3)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "code", 391 | "execution_count": null, 392 | "metadata": {}, 393 | "outputs": [], 394 | "source": [] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "# 4. memory_usage()\n", 401 | "比info中内存显示更可控一些,单位是**字节**。" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "##### `Series/DataFrame.memory_usage(index=True, deep=False)`\n", 409 | "- index:是否显示索引占用的内存,毫无疑问索引也占用内存;\n", 410 | "- deep:是否显示object类型的列消耗的系统资源,由于pandas中object元素只是一个引用,我估计这个deep是指显示真实的内存占用。" 411 | ] 412 | }, 413 | { 414 | "cell_type": "code", 415 | "execution_count": 13, 416 | "metadata": {}, 417 | "outputs": [ 418 | { 419 | "data": { 420 | "text/plain": [ 421 | "Index 80\n", 422 | "A 16\n", 423 | "B 16\n", 424 | "dtype: int64" 425 | ] 426 | }, 427 | "execution_count": 13, 428 | "metadata": {}, 429 | "output_type": "execute_result" 430 | } 431 | ], 432 | "source": [ 433 | "df.memory_usage(deep=False) # Index即索引占用内存" 434 | ] 435 | }, 436 | { 437 | "cell_type": "code", 438 | "execution_count": 14, 439 | "metadata": {}, 440 | "outputs": [ 441 | { 442 | "data": { 443 | "text/plain": [ 444 | "Index 80\n", 445 | "A 16\n", 446 | "B 16\n", 447 | "dtype: int64" 448 | ] 449 | }, 450 | "execution_count": 14, 451 | "metadata": {}, 452 | "output_type": "execute_result" 453 | } 454 | ], 455 | "source": [ 456 | "df.memory_usage(deep=True) # object 型占用的内存变大" 457 | ] 458 | }, 459 | { 460 | "cell_type": "code", 461 | "execution_count": null, 462 | "metadata": {}, 463 | "outputs": [], 464 | "source": [] 465 | }, 466 | { 467 | "cell_type": "markdown", 468 | "metadata": {}, 469 | "source": [ 470 | "# 5. describe()\n", 471 | "快速查看每一列的统计信息,默认排除所有NaN元素。" 472 | ] 473 | }, 474 | { 475 | "cell_type": "markdown", 476 | "metadata": {}, 477 | "source": [ 478 | "##### `DataFrame.describe(include=[np.number])`\n", 479 | "- include:'all'或者[np.number 或 np.object]。numberic只对元素属性为数值的列做数值统计,object只对元素属性为object的列做类字符串统计。" 480 | ] 481 | }, 482 | { 483 | "cell_type": "code", 484 | "execution_count": 15, 485 | "metadata": { 486 | "scrolled": false 487 | }, 488 | "outputs": [ 489 | { 490 | "data": { 491 | "text/html": [ 492 | "
\n", 493 | "\n", 506 | "\n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | " \n", 517 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 528 | " \n", 529 | " \n", 530 | " \n", 531 | "
numericobject
01a
12b
21b
\n", 532 | "
" 533 | ], 534 | "text/plain": [ 535 | " numeric object\n", 536 | "0 1 a\n", 537 | "1 2 b\n", 538 | "2 1 b" 539 | ] 540 | }, 541 | "execution_count": 15, 542 | "metadata": {}, 543 | "output_type": "execute_result" 544 | } 545 | ], 546 | "source": [ 547 | "df = pd.DataFrame( [[1,'a'],[2,'b'],[1,'b']], columns=['numeric','object'])\n", 548 | "df" 549 | ] 550 | }, 551 | { 552 | "cell_type": "code", 553 | "execution_count": 16, 554 | "metadata": {}, 555 | "outputs": [ 556 | { 557 | "data": { 558 | "text/plain": [ 559 | "numeric int64\n", 560 | "object object\n", 561 | "dtype: object" 562 | ] 563 | }, 564 | "execution_count": 16, 565 | "metadata": {}, 566 | "output_type": "execute_result" 567 | } 568 | ], 569 | "source": [ 570 | "df.dtypes" 571 | ] 572 | }, 573 | { 574 | "cell_type": "code", 575 | "execution_count": 17, 576 | "metadata": {}, 577 | "outputs": [ 578 | { 579 | "data": { 580 | "text/html": [ 581 | "
\n", 582 | "\n", 595 | "\n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 604 | " \n", 605 | " \n", 606 | " \n", 607 | " \n", 608 | " \n", 609 | " \n", 610 | " \n", 611 | " \n", 612 | " \n", 613 | " \n", 614 | " \n", 615 | " \n", 616 | " \n", 617 | " \n", 618 | " \n", 619 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 633 | " \n", 634 | " \n", 635 | " \n", 636 | "
numeric
count3.000000
mean1.333333
std0.577350
min1.000000
25%1.000000
50%1.000000
75%1.500000
max2.000000
\n", 637 | "
" 638 | ], 639 | "text/plain": [ 640 | " numeric\n", 641 | "count 3.000000\n", 642 | "mean 1.333333\n", 643 | "std 0.577350\n", 644 | "min 1.000000\n", 645 | "25% 1.000000\n", 646 | "50% 1.000000\n", 647 | "75% 1.500000\n", 648 | "max 2.000000" 649 | ] 650 | }, 651 | "execution_count": 17, 652 | "metadata": {}, 653 | "output_type": "execute_result" 654 | } 655 | ], 656 | "source": [ 657 | "df.describe() # 默认只对数值列进行统计 " 658 | ] 659 | }, 660 | { 661 | "cell_type": "code", 662 | "execution_count": 18, 663 | "metadata": {}, 664 | "outputs": [ 665 | { 666 | "data": { 667 | "text/html": [ 668 | "
\n", 669 | "\n", 682 | "\n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | "
object
count3
unique2
topb
freq2
\n", 708 | "
" 709 | ], 710 | "text/plain": [ 711 | " object\n", 712 | "count 3\n", 713 | "unique 2\n", 714 | "top b\n", 715 | "freq 2" 716 | ] 717 | }, 718 | "execution_count": 18, 719 | "metadata": {}, 720 | "output_type": "execute_result" 721 | } 722 | ], 723 | "source": [ 724 | "df.describe(include=[np.object]) # 只对object型列进行统计,类别统计方式,只统计这四种" 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": 19, 730 | "metadata": { 731 | "scrolled": true 732 | }, 733 | "outputs": [ 734 | { 735 | "data": { 736 | "text/html": [ 737 | "
\n", 738 | "\n", 751 | "\n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | "
numericobject
count3.0000003
uniqueNaN2
topNaNb
freqNaN2
mean1.333333NaN
std0.577350NaN
min1.000000NaN
25%1.000000NaN
50%1.000000NaN
75%1.500000NaN
max2.000000NaN
\n", 817 | "
" 818 | ], 819 | "text/plain": [ 820 | " numeric object\n", 821 | "count 3.000000 3\n", 822 | "unique NaN 2\n", 823 | "top NaN b\n", 824 | "freq NaN 2\n", 825 | "mean 1.333333 NaN\n", 826 | "std 0.577350 NaN\n", 827 | "min 1.000000 NaN\n", 828 | "25% 1.000000 NaN\n", 829 | "50% 1.000000 NaN\n", 830 | "75% 1.500000 NaN\n", 831 | "max 2.000000 NaN" 832 | ] 833 | }, 834 | "execution_count": 19, 835 | "metadata": {}, 836 | "output_type": "execute_result" 837 | } 838 | ], 839 | "source": [ 840 | "df.describe(include = 'all') # 数值序列和object序列共同统计的信息只有count: 非NaN元素个数" 841 | ] 842 | }, 843 | { 844 | "cell_type": "code", 845 | "execution_count": null, 846 | "metadata": {}, 847 | "outputs": [], 848 | "source": [] 849 | } 850 | ], 851 | "metadata": { 852 | "kernelspec": { 853 | "display_name": "Python 3", 854 | "language": "python", 855 | "name": "python3" 856 | }, 857 | "language_info": { 858 | "codemirror_mode": { 859 | "name": "ipython", 860 | "version": 3 861 | }, 862 | "file_extension": ".py", 863 | "mimetype": "text/x-python", 864 | "name": "python", 865 | "nbconvert_exporter": "python", 866 | "pygments_lexer": "ipython3", 867 | "version": "3.7.0" 868 | }, 869 | "toc": { 870 | "base_numbering": 1, 871 | "nav_menu": { 872 | "height": "141px", 873 | "width": "253px" 874 | }, 875 | "number_sections": false, 876 | "sideBar": true, 877 | "skip_h1_title": false, 878 | "title_cell": "Table of Contents", 879 | "title_sidebar": "Contents", 880 | "toc_cell": false, 881 | "toc_position": {}, 882 | "toc_section_display": "block", 883 | "toc_window_display": true 884 | } 885 | }, 886 | "nbformat": 4, 887 | "nbformat_minor": 2 888 | } 889 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## 教程目录 2 | 3 |
4 | 5 | 0. 配置环境 6 | 1. Series和DataFrame对象的创建 7 | 2. Series和DataFrame对象的查、改、增、删 8 | 3. merge详解 9 | 4. Index对象的创建,查、改、增、删和使用 10 | 5. 普通列和行index的相互转化 11 | 6. 数据结构总览 12 | 7. 显示控制 13 | 8. 快速查看整体信息 14 | 9. 数值运算 15 | 10. 数值统计运算 16 | 11. mask与比较运算(待完成) 17 | 12. Category型与离散化 18 | 13. 时间型操作 19 | 14. Object型操作 20 | 15. groupby详解(待完成) 21 | 16. resample详解(待完成) 22 | 17. …… 23 | 24 |
25 | 26 | ## 教程说明 27 | 28 | 当今最热的职业是数据科学,数据科学领域应用最广泛的编程语言是python,python这么火的原因就是其有一个功能强大的数据科学库:pandas。 29 | 30 | ## 为什么写这套教程 31 | 然而,作为一名数据科学行业从业者,即使在pandas中浸淫日久,我常常还需要去查询官方文档,这严重影响了我的工作效率;甚至有时候迫不得已还得写循环操作,非常不pandas,这我忍不了,所以我觉得我得做点什么。 32 | 33 | 经过多次通读官方文档后,我认为问题根因在于: 34 | - 官方文档组织杂而乱,知识框架不够精炼一致; 35 | - 面面俱到,高价值信息被为了完整性而稀释; 36 | - 文档更新不及时,API功能有时与文档描述不符。 37 | 38 | 与此同时,我也通读了国内外各种pandas教程,不过总体而言这些教程多数浅尝辄止,不够实用。所以,我决定编写一套pandas教程,提高自己能力的同时,也能帮助大家少走弯路。 39 | 40 | ## 教程编写核心原则 41 | 这套教程编写的核心原则是: 42 | - 首重知识体系逻辑,没有组织、不成体系的信息是无效信息,很难记住和使用; 43 | - 知识粒度大小适中,即不流于表面也不深入过多细节; 44 | - 示例精炼短小(能看出操作效果),方便手打练习; 45 | - 在示例位置都会注上解释,辅助理解。 46 | 47 | ## 这套教程适合谁 48 | 这套教程包含从初级到进阶的内容,适合初学者和希望进阶建立知识体系的数据科学从业者阅读。为确保教程的高可用性和准确性,我花了大量时间精心准备,但仍难免有错漏,非常欢迎各位读者能够跟我反馈。 49 | 50 | ## 知乎主页 51 | 花半楼:https://www.zhihu.com/people/HANGZS 52 | 53 | ## 交流可以加我微信 54 | #### 微信号:204078950 55 | #### 公众号:花半楼 56 | 57 | ![image](./resource/204078950.png) 58 | 59 | -------------------------------------------------------------------------------- /pandas_tutorial.egg-info/PKG-INFO: -------------------------------------------------------------------------------- 1 | Metadata-Version: 2.1 2 | Name: pandas-tutorial 3 | Version: 0.1.0 4 | Summary: pandas tutorial 5 | Home-page: https://github.com/hangsz/pandas-tutorial 6 | Author: hangsz 7 | Author-email: zhenhang.sun@outlook.com 8 | Description-Content-Type: text/markdown 9 | 10 | ## 教程目录 11 | 12 |
13 | 14 | 0. 配置环境 15 | 1. Series和DataFrame对象的创建 16 | 2. Series和DataFrame对象的查、改、增、删 17 | 3. merge详解 18 | 4. Index对象的创建,查、改、增、删和使用 19 | 5. 普通列和行index的相互转化 20 | 6. 数据结构总览 21 | 7. 显示控制 22 | 8. 快速查看整体信息 23 | 9. 数值运算 24 | 10. 数值统计运算 25 | 11. mask与比较运算(待完成) 26 | 12. Category型与离散化 27 | 13. 时间型操作 28 | 14. Object型操作 29 | 15. groupby详解(待完成) 30 | 16. resample详解(待完成) 31 | 17. …… 32 | 33 |
34 | 35 | ## 教程说明 36 | 37 | 当今最热的职业是数据科学,数据科学领域应用最广泛的编程语言是python,python这么火的原因就是其有一个功能强大的数据科学库:pandas。 38 | 39 | ## 为什么写这套教程 40 | 然而,作为一名数据科学行业从业者,即使在pandas中浸淫日久,我常常还需要去查询官方文档,这严重影响了我的工作效率;甚至有时候迫不得已还得写循环操作,非常不pandas,这我忍不了,所以我觉得我得做点什么。 41 | 42 | 经过多次通读官方文档后,我认为问题根因在于: 43 | - 官方文档组织杂而乱,知识框架不够精炼一致; 44 | - 面面俱到,高价值信息被为了完整性而稀释; 45 | - 文档更新不及时,API功能有时与文档描述不符。 46 | 47 | 与此同时,我也通读了国内外各种pandas教程,不过总体而言这些教程多数浅尝辄止,不够实用。所以,我决定编写一套pandas教程,提高自己能力的同时,也能帮助大家少走弯路。 48 | 49 | ## 教程编写核心原则 50 | 这套教程编写的核心原则是: 51 | - 首重知识体系逻辑,没有组织、不成体系的信息是无效信息,很难记住和使用; 52 | - 知识粒度大小适中,即不流于表面也不深入过多细节; 53 | - 示例精炼短小(能看出操作效果),方便手打练习; 54 | - 在示例位置都会注上解释,辅助理解。 55 | 56 | ## 这套教程适合谁 57 | 这套教程包含从初级到进阶的内容,适合初学者和希望进阶建立知识体系的数据科学从业者阅读。为确保教程的高可用性和准确性,我花了大量时间精心准备,但仍难免有错漏,非常欢迎各位读者能够跟我反馈。 58 | 59 | ## 知乎主页 60 | 花半楼:https://www.zhihu.com/people/HANGZS 61 | 62 | ## 交流可以加我微信 63 | #### 微信号:204078950 64 | #### 公众号:花半楼 65 | 66 | ![image](./resource/204078950.png) 67 | 68 | -------------------------------------------------------------------------------- /pandas_tutorial.egg-info/SOURCES.txt: -------------------------------------------------------------------------------- 1 | README.md 2 | setup.py 3 | pandas_tutorial.egg-info/PKG-INFO 4 | pandas_tutorial.egg-info/SOURCES.txt 5 | pandas_tutorial.egg-info/dependency_links.txt 6 | pandas_tutorial.egg-info/requires.txt 7 | pandas_tutorial.egg-info/top_level.txt -------------------------------------------------------------------------------- /pandas_tutorial.egg-info/dependency_links.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /pandas_tutorial.egg-info/requires.txt: -------------------------------------------------------------------------------- 1 | raft-python==0.1.3 2 | pandas==2.2.1 3 | -------------------------------------------------------------------------------- /pandas_tutorial.egg-info/top_level.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /resource/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/1.png -------------------------------------------------------------------------------- /resource/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/2.png -------------------------------------------------------------------------------- /resource/204078950.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/204078950.png -------------------------------------------------------------------------------- /resource/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/3.png -------------------------------------------------------------------------------- /resource/pay.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/pay.png -------------------------------------------------------------------------------- /resource/我的.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/hangsz/pandas-tutorial/71f5868816e976421d9020a6d4c15404fd6907b7/resource/我的.png -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding: utf-8 3 | 4 | import setuptools 5 | from setuptools import setup 6 | 7 | with open('README.md', 'r', encoding='utf-8') as fh: 8 | long_description = fh.read() 9 | 10 | setup( 11 | name='pandas-tutorial', 12 | version='0.1.0', 13 | author='hangsz', 14 | author_email='zhenhang.sun@outlook.com', 15 | url='https://github.com/hangsz/pandas-tutorial', 16 | description=u'pandas tutorial', 17 | long_description=long_description, 18 | long_description_content_type="text/markdown", 19 | packages=setuptools.find_packages(), 20 | install_requires=[ 21 | 'raft-python==0.1.3', 22 | 'pandas==2.2.1' 23 | ] 24 | ) -------------------------------------------------------------------------------- /tips.csv: -------------------------------------------------------------------------------- 1 | total_bill,tip,sex,smoker,day,time,size 2 | 16.99,1.01,Female,No,Sun,Dinner,2 3 | 10.34,1.66,Male,No,Sun,Dinner,3 4 | 21.01,3.5,Male,No,Sun,Dinner,3 5 | 23.68,3.31,Male,No,Sun,Dinner,2 6 | 24.59,3.61,Female,No,Sun,Dinner,4 7 | 25.29,4.71,Male,No,Sun,Dinner,4 8 | 8.77,2.0,Male,No,Sun,Dinner,2 9 | 26.88,3.12,Male,No,Sun,Dinner,4 10 | 15.04,1.96,Male,No,Sun,Dinner,2 11 | 14.78,3.23,Male,No,Sun,Dinner,2 12 | 10.27,1.71,Male,No,Sun,Dinner,2 13 | 35.26,5.0,Female,No,Sun,Dinner,4 14 | 15.42,1.57,Male,No,Sun,Dinner,2 15 | 18.43,3.0,Male,No,Sun,Dinner,4 16 | 14.83,3.02,Female,No,Sun,Dinner,2 17 | 21.58,3.92,Male,No,Sun,Dinner,2 18 | 10.33,1.67,Female,No,Sun,Dinner,3 19 | 16.29,3.71,Male,No,Sun,Dinner,3 20 | 16.97,3.5,Female,No,Sun,Dinner,3 21 | 20.65,3.35,Male,No,Sat,Dinner,3 22 | 17.92,4.08,Male,No,Sat,Dinner,2 23 | 20.29,2.75,Female,No,Sat,Dinner,2 24 | 15.77,2.23,Female,No,Sat,Dinner,2 25 | 39.42,7.58,Male,No,Sat,Dinner,4 26 | 19.82,3.18,Male,No,Sat,Dinner,2 27 | 17.81,2.34,Male,No,Sat,Dinner,4 28 | 13.37,2.0,Male,No,Sat,Dinner,2 29 | 12.69,2.0,Male,No,Sat,Dinner,2 30 | 21.7,4.3,Male,No,Sat,Dinner,2 31 | 19.65,3.0,Female,No,Sat,Dinner,2 32 | 9.55,1.45,Male,No,Sat,Dinner,2 33 | 18.35,2.5,Male,No,Sat,Dinner,4 34 | 15.06,3.0,Female,No,Sat,Dinner,2 35 | 20.69,2.45,Female,No,Sat,Dinner,4 36 | 17.78,3.27,Male,No,Sat,Dinner,2 37 | 24.06,3.6,Male,No,Sat,Dinner,3 38 | 16.31,2.0,Male,No,Sat,Dinner,3 39 | 16.93,3.07,Female,No,Sat,Dinner,3 40 | 18.69,2.31,Male,No,Sat,Dinner,3 41 | 31.27,5.0,Male,No,Sat,Dinner,3 42 | 16.04,2.24,Male,No,Sat,Dinner,3 43 | 17.46,2.54,Male,No,Sun,Dinner,2 44 | 13.94,3.06,Male,No,Sun,Dinner,2 45 | 9.68,1.32,Male,No,Sun,Dinner,2 46 | 30.4,5.6,Male,No,Sun,Dinner,4 47 | 18.29,3.0,Male,No,Sun,Dinner,2 48 | 22.23,5.0,Male,No,Sun,Dinner,2 49 | 32.4,6.0,Male,No,Sun,Dinner,4 50 | 28.55,2.05,Male,No,Sun,Dinner,3 51 | 18.04,3.0,Male,No,Sun,Dinner,2 52 | 12.54,2.5,Male,No,Sun,Dinner,2 53 | 10.29,2.6,Female,No,Sun,Dinner,2 54 | 34.81,5.2,Female,No,Sun,Dinner,4 55 | 9.94,1.56,Male,No,Sun,Dinner,2 56 | 25.56,4.34,Male,No,Sun,Dinner,4 57 | 19.49,3.51,Male,No,Sun,Dinner,2 58 | 38.01,3.0,Male,Yes,Sat,Dinner,4 59 | 26.41,1.5,Female,No,Sat,Dinner,2 60 | 11.24,1.76,Male,Yes,Sat,Dinner,2 61 | 48.27,6.73,Male,No,Sat,Dinner,4 62 | 20.29,3.21,Male,Yes,Sat,Dinner,2 63 | 13.81,2.0,Male,Yes,Sat,Dinner,2 64 | 11.02,1.98,Male,Yes,Sat,Dinner,2 65 | 18.29,3.76,Male,Yes,Sat,Dinner,4 66 | 17.59,2.64,Male,No,Sat,Dinner,3 67 | 20.08,3.15,Male,No,Sat,Dinner,3 68 | 16.45,2.47,Female,No,Sat,Dinner,2 69 | 3.07,1.0,Female,Yes,Sat,Dinner,1 70 | 20.23,2.01,Male,No,Sat,Dinner,2 71 | 15.01,2.09,Male,Yes,Sat,Dinner,2 72 | 12.02,1.97,Male,No,Sat,Dinner,2 73 | 17.07,3.0,Female,No,Sat,Dinner,3 74 | 26.86,3.14,Female,Yes,Sat,Dinner,2 75 | 25.28,5.0,Female,Yes,Sat,Dinner,2 76 | 14.73,2.2,Female,No,Sat,Dinner,2 77 | 10.51,1.25,Male,No,Sat,Dinner,2 78 | 17.92,3.08,Male,Yes,Sat,Dinner,2 79 | 27.2,4.0,Male,No,Thur,Lunch,4 80 | 22.76,3.0,Male,No,Thur,Lunch,2 81 | 17.29,2.71,Male,No,Thur,Lunch,2 82 | 19.44,3.0,Male,Yes,Thur,Lunch,2 83 | 16.66,3.4,Male,No,Thur,Lunch,2 84 | 10.07,1.83,Female,No,Thur,Lunch,1 85 | 32.68,5.0,Male,Yes,Thur,Lunch,2 86 | 15.98,2.03,Male,No,Thur,Lunch,2 87 | 34.83,5.17,Female,No,Thur,Lunch,4 88 | 13.03,2.0,Male,No,Thur,Lunch,2 89 | 18.28,4.0,Male,No,Thur,Lunch,2 90 | 24.71,5.85,Male,No,Thur,Lunch,2 91 | 21.16,3.0,Male,No,Thur,Lunch,2 92 | 28.97,3.0,Male,Yes,Fri,Dinner,2 93 | 22.49,3.5,Male,No,Fri,Dinner,2 94 | 5.75,1.0,Female,Yes,Fri,Dinner,2 95 | 16.32,4.3,Female,Yes,Fri,Dinner,2 96 | 22.75,3.25,Female,No,Fri,Dinner,2 97 | 40.17,4.73,Male,Yes,Fri,Dinner,4 98 | 27.28,4.0,Male,Yes,Fri,Dinner,2 99 | 12.03,1.5,Male,Yes,Fri,Dinner,2 100 | 21.01,3.0,Male,Yes,Fri,Dinner,2 101 | 12.46,1.5,Male,No,Fri,Dinner,2 102 | 11.35,2.5,Female,Yes,Fri,Dinner,2 103 | 15.38,3.0,Female,Yes,Fri,Dinner,2 104 | 44.3,2.5,Female,Yes,Sat,Dinner,3 105 | 22.42,3.48,Female,Yes,Sat,Dinner,2 106 | 20.92,4.08,Female,No,Sat,Dinner,2 107 | 15.36,1.64,Male,Yes,Sat,Dinner,2 108 | 20.49,4.06,Male,Yes,Sat,Dinner,2 109 | 25.21,4.29,Male,Yes,Sat,Dinner,2 110 | 18.24,3.76,Male,No,Sat,Dinner,2 111 | 14.31,4.0,Female,Yes,Sat,Dinner,2 112 | 14.0,3.0,Male,No,Sat,Dinner,2 113 | 7.25,1.0,Female,No,Sat,Dinner,1 114 | 38.07,4.0,Male,No,Sun,Dinner,3 115 | 23.95,2.55,Male,No,Sun,Dinner,2 116 | 25.71,4.0,Female,No,Sun,Dinner,3 117 | 17.31,3.5,Female,No,Sun,Dinner,2 118 | 29.93,5.07,Male,No,Sun,Dinner,4 119 | 10.65,1.5,Female,No,Thur,Lunch,2 120 | 12.43,1.8,Female,No,Thur,Lunch,2 121 | 24.08,2.92,Female,No,Thur,Lunch,4 122 | 11.69,2.31,Male,No,Thur,Lunch,2 123 | 13.42,1.68,Female,No,Thur,Lunch,2 124 | 14.26,2.5,Male,No,Thur,Lunch,2 125 | 15.95,2.0,Male,No,Thur,Lunch,2 126 | 12.48,2.52,Female,No,Thur,Lunch,2 127 | 29.8,4.2,Female,No,Thur,Lunch,6 128 | 8.52,1.48,Male,No,Thur,Lunch,2 129 | 14.52,2.0,Female,No,Thur,Lunch,2 130 | 11.38,2.0,Female,No,Thur,Lunch,2 131 | 22.82,2.18,Male,No,Thur,Lunch,3 132 | 19.08,1.5,Male,No,Thur,Lunch,2 133 | 20.27,2.83,Female,No,Thur,Lunch,2 134 | 11.17,1.5,Female,No,Thur,Lunch,2 135 | 12.26,2.0,Female,No,Thur,Lunch,2 136 | 18.26,3.25,Female,No,Thur,Lunch,2 137 | 8.51,1.25,Female,No,Thur,Lunch,2 138 | 10.33,2.0,Female,No,Thur,Lunch,2 139 | 14.15,2.0,Female,No,Thur,Lunch,2 140 | 16.0,2.0,Male,Yes,Thur,Lunch,2 141 | 13.16,2.75,Female,No,Thur,Lunch,2 142 | 17.47,3.5,Female,No,Thur,Lunch,2 143 | 34.3,6.7,Male,No,Thur,Lunch,6 144 | 41.19,5.0,Male,No,Thur,Lunch,5 145 | 27.05,5.0,Female,No,Thur,Lunch,6 146 | 16.43,2.3,Female,No,Thur,Lunch,2 147 | 8.35,1.5,Female,No,Thur,Lunch,2 148 | 18.64,1.36,Female,No,Thur,Lunch,3 149 | 11.87,1.63,Female,No,Thur,Lunch,2 150 | 9.78,1.73,Male,No,Thur,Lunch,2 151 | 7.51,2.0,Male,No,Thur,Lunch,2 152 | 14.07,2.5,Male,No,Sun,Dinner,2 153 | 13.13,2.0,Male,No,Sun,Dinner,2 154 | 17.26,2.74,Male,No,Sun,Dinner,3 155 | 24.55,2.0,Male,No,Sun,Dinner,4 156 | 19.77,2.0,Male,No,Sun,Dinner,4 157 | 29.85,5.14,Female,No,Sun,Dinner,5 158 | 48.17,5.0,Male,No,Sun,Dinner,6 159 | 25.0,3.75,Female,No,Sun,Dinner,4 160 | 13.39,2.61,Female,No,Sun,Dinner,2 161 | 16.49,2.0,Male,No,Sun,Dinner,4 162 | 21.5,3.5,Male,No,Sun,Dinner,4 163 | 12.66,2.5,Male,No,Sun,Dinner,2 164 | 16.21,2.0,Female,No,Sun,Dinner,3 165 | 13.81,2.0,Male,No,Sun,Dinner,2 166 | 17.51,3.0,Female,Yes,Sun,Dinner,2 167 | 24.52,3.48,Male,No,Sun,Dinner,3 168 | 20.76,2.24,Male,No,Sun,Dinner,2 169 | 31.71,4.5,Male,No,Sun,Dinner,4 170 | 10.59,1.61,Female,Yes,Sat,Dinner,2 171 | 10.63,2.0,Female,Yes,Sat,Dinner,2 172 | 50.81,10.0,Male,Yes,Sat,Dinner,3 173 | 15.81,3.16,Male,Yes,Sat,Dinner,2 174 | 7.25,5.15,Male,Yes,Sun,Dinner,2 175 | 31.85,3.18,Male,Yes,Sun,Dinner,2 176 | 16.82,4.0,Male,Yes,Sun,Dinner,2 177 | 32.9,3.11,Male,Yes,Sun,Dinner,2 178 | 17.89,2.0,Male,Yes,Sun,Dinner,2 179 | 14.48,2.0,Male,Yes,Sun,Dinner,2 180 | 9.6,4.0,Female,Yes,Sun,Dinner,2 181 | 34.63,3.55,Male,Yes,Sun,Dinner,2 182 | 34.65,3.68,Male,Yes,Sun,Dinner,4 183 | 23.33,5.65,Male,Yes,Sun,Dinner,2 184 | 45.35,3.5,Male,Yes,Sun,Dinner,3 185 | 23.17,6.5,Male,Yes,Sun,Dinner,4 186 | 40.55,3.0,Male,Yes,Sun,Dinner,2 187 | 20.69,5.0,Male,No,Sun,Dinner,5 188 | 20.9,3.5,Female,Yes,Sun,Dinner,3 189 | 30.46,2.0,Male,Yes,Sun,Dinner,5 190 | 18.15,3.5,Female,Yes,Sun,Dinner,3 191 | 23.1,4.0,Male,Yes,Sun,Dinner,3 192 | 15.69,1.5,Male,Yes,Sun,Dinner,2 193 | 19.81,4.19,Female,Yes,Thur,Lunch,2 194 | 28.44,2.56,Male,Yes,Thur,Lunch,2 195 | 15.48,2.02,Male,Yes,Thur,Lunch,2 196 | 16.58,4.0,Male,Yes,Thur,Lunch,2 197 | 7.56,1.44,Male,No,Thur,Lunch,2 198 | 10.34,2.0,Male,Yes,Thur,Lunch,2 199 | 43.11,5.0,Female,Yes,Thur,Lunch,4 200 | 13.0,2.0,Female,Yes,Thur,Lunch,2 201 | 13.51,2.0,Male,Yes,Thur,Lunch,2 202 | 18.71,4.0,Male,Yes,Thur,Lunch,3 203 | 12.74,2.01,Female,Yes,Thur,Lunch,2 204 | 13.0,2.0,Female,Yes,Thur,Lunch,2 205 | 16.4,2.5,Female,Yes,Thur,Lunch,2 206 | 20.53,4.0,Male,Yes,Thur,Lunch,4 207 | 16.47,3.23,Female,Yes,Thur,Lunch,3 208 | 26.59,3.41,Male,Yes,Sat,Dinner,3 209 | 38.73,3.0,Male,Yes,Sat,Dinner,4 210 | 24.27,2.03,Male,Yes,Sat,Dinner,2 211 | 12.76,2.23,Female,Yes,Sat,Dinner,2 212 | 30.06,2.0,Male,Yes,Sat,Dinner,3 213 | 25.89,5.16,Male,Yes,Sat,Dinner,4 214 | 48.33,9.0,Male,No,Sat,Dinner,4 215 | 13.27,2.5,Female,Yes,Sat,Dinner,2 216 | 28.17,6.5,Female,Yes,Sat,Dinner,3 217 | 12.9,1.1,Female,Yes,Sat,Dinner,2 218 | 28.15,3.0,Male,Yes,Sat,Dinner,5 219 | 11.59,1.5,Male,Yes,Sat,Dinner,2 220 | 7.74,1.44,Male,Yes,Sat,Dinner,2 221 | 30.14,3.09,Female,Yes,Sat,Dinner,4 222 | 12.16,2.2,Male,Yes,Fri,Lunch,2 223 | 13.42,3.48,Female,Yes,Fri,Lunch,2 224 | 8.58,1.92,Male,Yes,Fri,Lunch,1 225 | 15.98,3.0,Female,No,Fri,Lunch,3 226 | 13.42,1.58,Male,Yes,Fri,Lunch,2 227 | 16.27,2.5,Female,Yes,Fri,Lunch,2 228 | 10.09,2.0,Female,Yes,Fri,Lunch,2 229 | 20.45,3.0,Male,No,Sat,Dinner,4 230 | 13.28,2.72,Male,No,Sat,Dinner,2 231 | 22.12,2.88,Female,Yes,Sat,Dinner,2 232 | 24.01,2.0,Male,Yes,Sat,Dinner,4 233 | 15.69,3.0,Male,Yes,Sat,Dinner,3 234 | 11.61,3.39,Male,No,Sat,Dinner,2 235 | 10.77,1.47,Male,No,Sat,Dinner,2 236 | 15.53,3.0,Male,Yes,Sat,Dinner,2 237 | 10.07,1.25,Male,No,Sat,Dinner,2 238 | 12.6,1.0,Male,Yes,Sat,Dinner,2 239 | 32.83,1.17,Male,Yes,Sat,Dinner,2 240 | 35.83,4.67,Female,No,Sat,Dinner,3 241 | 29.03,5.92,Male,No,Sat,Dinner,3 242 | 27.18,2.0,Female,Yes,Sat,Dinner,2 243 | 22.67,2.0,Male,Yes,Sat,Dinner,2 244 | 17.82,1.75,Male,No,Sat,Dinner,2 245 | 18.78,3.0,Female,No,Thur,Dinner,2 246 | --------------------------------------------------------------------------------