├── 1TensorFlow介绍
├── .ipynb_checkpoints
│ ├── 2-1两个向量的加法运算-checkpoint.ipynb
│ ├── 验证TensorFlow是否安装成功(不输出额外信息)-checkpoint.ipynb
│ └── 验证TensorFlow是否安装成功-checkpoint.ipynb
├── 1-1验证TensorFlow是否安装成功.ipynb
├── 1-2验证TensorFlow是否安装成功(不输出额外信息).ipynb
└── 2-1两个向量的加法运算.ipynb
├── Google机器学习速成课程笔记
├── README.md
├── images
│ ├── 0003_linear.png
│ ├── 0003_loss.png
│ ├── 0003_practice.png
│ ├── 0003_tempature.png
│ ├── 0004_gradient_descent.png
│ ├── L2_loss_figure.png
│ ├── LearningFromData.png
│ └── linear_regression.png
├── 机器学习概念
│ ├── 机器学习简介.md
│ ├── 深入了解机器学习.md
│ ├── 问题构建.md
│ └── 降低损失.md
└── 简介
│ ├── intro_to_pandas.ipynb
│ └── 前提条件和准备工作.md
├── README.md
└── TensorFlow从入门到精通
├── 01_简单线性模型.ipynb
└── README.md
/1TensorFlow介绍/.ipynb_checkpoints/2-1两个向量的加法运算-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Summary: 理解TensorFlow程序结构\n",
8 | "Author: Amusi\n",
9 | "Date: 2018-03-29\n",
10 | "Note: 这里是计算图和执行图的简单示例:对两个向量进行加法运算"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 7,
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import tensorflow as tf"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 8,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "[3 3 8 7]\n"
36 | ]
37 | }
38 | ],
39 | "source": [
40 | "# 定义图\n",
41 | "v_1 = tf.constant([1, 2, 3, 4])\n",
42 | "v_2 = tf.constant([2, 1, 5, 3])\n",
43 | "v_add = tf.add(v_1, v_2) # 或者 v_add = v_1 + v_2\n",
44 | "\n",
45 | "# 使用tf.Session()执行图\n",
46 | "with tf.Session() as sess:\n",
47 | " print(sess.run(v_add))\n",
48 | " \n",
49 | "# 上述两行命令等价于下述代码.建议使用上述方式来执行图,因为使用with语句,就默认程序块结束后或自动close. 如果不要with,就要对Session进行close\n",
50 | "#sess = tf.Session()\n",
51 | "#print(sess.run(v_add))\n",
52 | "#sess.close()"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 9,
58 | "metadata": {
59 | "collapsed": false
60 | },
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "[3 3 8 7]\n"
67 | ]
68 | }
69 | ],
70 | "source": [
71 | "# 使用tf.InteractiveSession()执行图,会直接定义为默认图,所以不需要调用session,比如下述中没有run方法\n",
72 | "sess = tf.InteractiveSession()\n",
73 | "\n",
74 | "v_1 = tf.constant([1, 2, 3, 4])\n",
75 | "v_2 = tf.constant([2, 1, 5, 3])\n",
76 | "v_add = tf.add(v_1, v_2) # 或者 v_add = v_1 + v_2\n",
77 | "\n",
78 | "print(v_add.eval())\n",
79 | "sess.close()"
80 | ]
81 | }
82 | ],
83 | "metadata": {
84 | "anaconda-cloud": {},
85 | "kernelspec": {
86 | "display_name": "Python [default]",
87 | "language": "python",
88 | "name": "python3"
89 | },
90 | "language_info": {
91 | "codemirror_mode": {
92 | "name": "ipython",
93 | "version": 3
94 | },
95 | "file_extension": ".py",
96 | "mimetype": "text/x-python",
97 | "name": "python",
98 | "nbconvert_exporter": "python",
99 | "pygments_lexer": "ipython3",
100 | "version": "3.5.2"
101 | }
102 | },
103 | "nbformat": 4,
104 | "nbformat_minor": 1
105 | }
106 |
--------------------------------------------------------------------------------
/1TensorFlow介绍/.ipynb_checkpoints/验证TensorFlow是否安装成功(不输出额外信息)-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 3,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "# Summary: 验证TensorFlow是否安装成功,不输出相关的警告消息和硬件信息消息\n",
12 | "# Author: Amusi\n",
13 | "# Date: 2018-03-25"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 6,
19 | "metadata": {
20 | "collapsed": false
21 | },
22 | "outputs": [
23 | {
24 | "name": "stdout",
25 | "output_type": "stream",
26 | "text": [
27 | "欢迎来到深度神经网络的世界!\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "import tensorflow as tf\n",
33 | "# 不输出相关的警告消息和硬件信息消息\n",
34 | "import os\n",
35 | "os.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n",
36 | "message = tf.constant('欢迎来到深度神经网络的世界!')\n",
37 | "with tf.Session() as sess:\n",
38 | " print(sess.run(message).decode())"
39 | ]
40 | }
41 | ],
42 | "metadata": {
43 | "anaconda-cloud": {},
44 | "kernelspec": {
45 | "display_name": "Python [default]",
46 | "language": "python",
47 | "name": "python3"
48 | },
49 | "language_info": {
50 | "codemirror_mode": {
51 | "name": "ipython",
52 | "version": 3
53 | },
54 | "file_extension": ".py",
55 | "mimetype": "text/x-python",
56 | "name": "python",
57 | "nbconvert_exporter": "python",
58 | "pygments_lexer": "ipython3",
59 | "version": "3.5.2"
60 | }
61 | },
62 | "nbformat": 4,
63 | "nbformat_minor": 1
64 | }
65 |
--------------------------------------------------------------------------------
/1TensorFlow介绍/.ipynb_checkpoints/验证TensorFlow是否安装成功-checkpoint.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 3,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "# Summary: 验证TensorFlow是否安装成功\n",
12 | "# Author: Amusi\n",
13 | "# Date: 2018-03-25"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 4,
19 | "metadata": {
20 | "collapsed": false
21 | },
22 | "outputs": [
23 | {
24 | "name": "stdout",
25 | "output_type": "stream",
26 | "text": [
27 | "欢迎来到深度神经网络的世界!\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "import tensorflow as tf\n",
33 | "message = tf.constant('欢迎来到深度神经网络的世界!')\n",
34 | "with tf.Session() as sess:\n",
35 | " print(sess.run(message).decode())"
36 | ]
37 | }
38 | ],
39 | "metadata": {
40 | "anaconda-cloud": {},
41 | "kernelspec": {
42 | "display_name": "Python [default]",
43 | "language": "python",
44 | "name": "python3"
45 | },
46 | "language_info": {
47 | "codemirror_mode": {
48 | "name": "ipython",
49 | "version": 3
50 | },
51 | "file_extension": ".py",
52 | "mimetype": "text/x-python",
53 | "name": "python",
54 | "nbconvert_exporter": "python",
55 | "pygments_lexer": "ipython3",
56 | "version": "3.5.2"
57 | }
58 | },
59 | "nbformat": 4,
60 | "nbformat_minor": 1
61 | }
62 |
--------------------------------------------------------------------------------
/1TensorFlow介绍/1-1验证TensorFlow是否安装成功.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 3,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "# Summary: 验证TensorFlow是否安装成功\n",
12 | "# Author: Amusi\n",
13 | "# Date: 2018-03-25"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 4,
19 | "metadata": {
20 | "collapsed": false
21 | },
22 | "outputs": [
23 | {
24 | "name": "stdout",
25 | "output_type": "stream",
26 | "text": [
27 | "欢迎来到深度神经网络的世界!\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "import tensorflow as tf\n",
33 | "message = tf.constant('欢迎来到深度神经网络的世界!')\n",
34 | "with tf.Session() as sess:\n",
35 | " print(sess.run(message).decode())"
36 | ]
37 | }
38 | ],
39 | "metadata": {
40 | "anaconda-cloud": {},
41 | "kernelspec": {
42 | "display_name": "Python [default]",
43 | "language": "python",
44 | "name": "python3"
45 | },
46 | "language_info": {
47 | "codemirror_mode": {
48 | "name": "ipython",
49 | "version": 3
50 | },
51 | "file_extension": ".py",
52 | "mimetype": "text/x-python",
53 | "name": "python",
54 | "nbconvert_exporter": "python",
55 | "pygments_lexer": "ipython3",
56 | "version": "3.5.2"
57 | }
58 | },
59 | "nbformat": 4,
60 | "nbformat_minor": 1
61 | }
62 |
--------------------------------------------------------------------------------
/1TensorFlow介绍/1-2验证TensorFlow是否安装成功(不输出额外信息).ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 3,
6 | "metadata": {
7 | "collapsed": true
8 | },
9 | "outputs": [],
10 | "source": [
11 | "# Summary: 验证TensorFlow是否安装成功,不输出相关的警告消息和硬件信息消息\n",
12 | "# Author: Amusi\n",
13 | "# Date: 2018-03-25"
14 | ]
15 | },
16 | {
17 | "cell_type": "code",
18 | "execution_count": 6,
19 | "metadata": {
20 | "collapsed": false
21 | },
22 | "outputs": [
23 | {
24 | "name": "stdout",
25 | "output_type": "stream",
26 | "text": [
27 | "欢迎来到深度神经网络的世界!\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "import tensorflow as tf\n",
33 | "# 不输出相关的警告消息和硬件信息消息\n",
34 | "import os\n",
35 | "os.environ['TF_CPP_MIN_LOG_LEVEL']='2'\n",
36 | "message = tf.constant('欢迎来到深度神经网络的世界!')\n",
37 | "with tf.Session() as sess:\n",
38 | " print(sess.run(message).decode())"
39 | ]
40 | }
41 | ],
42 | "metadata": {
43 | "anaconda-cloud": {},
44 | "kernelspec": {
45 | "display_name": "Python [default]",
46 | "language": "python",
47 | "name": "python3"
48 | },
49 | "language_info": {
50 | "codemirror_mode": {
51 | "name": "ipython",
52 | "version": 3
53 | },
54 | "file_extension": ".py",
55 | "mimetype": "text/x-python",
56 | "name": "python",
57 | "nbconvert_exporter": "python",
58 | "pygments_lexer": "ipython3",
59 | "version": "3.5.2"
60 | }
61 | },
62 | "nbformat": 4,
63 | "nbformat_minor": 1
64 | }
65 |
--------------------------------------------------------------------------------
/1TensorFlow介绍/2-1两个向量的加法运算.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "Summary: 理解TensorFlow程序结构\n",
8 | "Author: Amusi\n",
9 | "Date: 2018-03-29\n",
10 | "Note: 这里是计算图和执行图的简单示例:对两个向量进行加法运算"
11 | ]
12 | },
13 | {
14 | "cell_type": "code",
15 | "execution_count": 7,
16 | "metadata": {
17 | "collapsed": true
18 | },
19 | "outputs": [],
20 | "source": [
21 | "import tensorflow as tf"
22 | ]
23 | },
24 | {
25 | "cell_type": "code",
26 | "execution_count": 8,
27 | "metadata": {
28 | "collapsed": false
29 | },
30 | "outputs": [
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "[3 3 8 7]\n"
36 | ]
37 | }
38 | ],
39 | "source": [
40 | "# 定义图\n",
41 | "v_1 = tf.constant([1, 2, 3, 4])\n",
42 | "v_2 = tf.constant([2, 1, 5, 3])\n",
43 | "v_add = tf.add(v_1, v_2) # 或者 v_add = v_1 + v_2\n",
44 | "\n",
45 | "# 使用tf.Session()执行图\n",
46 | "with tf.Session() as sess:\n",
47 | " print(sess.run(v_add))\n",
48 | " \n",
49 | "# 上述两行命令等价于下述代码.建议使用上述方式来执行图,因为使用with语句,就默认程序块结束后或自动close. 如果不要with,就要对Session进行close\n",
50 | "#sess = tf.Session()\n",
51 | "#print(sess.run(v_add))\n",
52 | "#sess.close()"
53 | ]
54 | },
55 | {
56 | "cell_type": "code",
57 | "execution_count": 9,
58 | "metadata": {
59 | "collapsed": false
60 | },
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "[3 3 8 7]\n"
67 | ]
68 | }
69 | ],
70 | "source": [
71 | "# 使用tf.InteractiveSession()执行图,会直接定义为默认图,所以不需要调用session,比如下述中没有run方法\n",
72 | "sess = tf.InteractiveSession()\n",
73 | "\n",
74 | "v_1 = tf.constant([1, 2, 3, 4])\n",
75 | "v_2 = tf.constant([2, 1, 5, 3])\n",
76 | "v_add = tf.add(v_1, v_2) # 或者 v_add = v_1 + v_2\n",
77 | "\n",
78 | "print(v_add.eval())\n",
79 | "sess.close()"
80 | ]
81 | }
82 | ],
83 | "metadata": {
84 | "anaconda-cloud": {},
85 | "kernelspec": {
86 | "display_name": "Python [default]",
87 | "language": "python",
88 | "name": "python3"
89 | },
90 | "language_info": {
91 | "codemirror_mode": {
92 | "name": "ipython",
93 | "version": 3
94 | },
95 | "file_extension": ".py",
96 | "mimetype": "text/x-python",
97 | "name": "python",
98 | "nbconvert_exporter": "python",
99 | "pygments_lexer": "ipython3",
100 | "version": "3.5.2"
101 | }
102 | },
103 | "nbformat": 4,
104 | "nbformat_minor": 1
105 | }
106 |
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/README.md:
--------------------------------------------------------------------------------
1 | # Google机器学习速成课程
2 |
3 | 官网:https://developers.google.com/machine-learning/crash-course/
4 |
5 | - [视频课程](https://developers.google.com/machine-learning/crash-course/)
6 | - [练习](https://developers.google.com/machine-learning/crash-course/exercises)
7 | - [术语库](https://developers.google.com/machine-learning/crash-course/glossary)
8 |
9 | **机器学习速成课程简介**
10 |
11 | - Google 制作的节奏紧凑、内容实用的机器学习简介课程(使用TensorFlow API)
12 | - 学习速成课程包含一系列视频讲座课程、实际案例分析和实践练习。
13 |
14 |
15 | **Bilibili视频**
16 |
17 | - [英文语音&中文字幕](https://www.bilibili.com/video/av21398424/?p=1)
18 | - [中文语音&无字幕](https://www.bilibili.com/video/av20229263/?p=1)
19 |
20 |
21 | # 目录
22 |
23 | ## 简介
24 |
25 | [前提条件和准备工作](简介/前提条件和准备工作.md)
26 |
27 |
28 |
29 | ## 第一部分:机器学习概念
30 |
31 | - [x] [机器学习简介](机器学习概念/机器学习简介.md)
32 | - [x] [问题构建](机器学习概念/问题构建.md)
33 | - [x] [深入了解机器学习](机器学习概念/深入了解机器学习.md)
34 | - [x] [降低损失](机器学习概念/降低损失.md)
35 | - [ ] 使用TF的基本步骤
36 | - [ ] 泛化
37 | - [ ] 训练集和测试集
38 | - [ ] 验证
39 | - [ ] 表示法
40 | - [ ] 特征组合
41 | - [ ] 正则化:简单性
42 | - [ ] 逻辑回归
43 | - [ ] 分类
44 | - [ ] 正则化:稀疏性
45 | - [ ] 神经网络简介
46 | - [ ] 训练神经网络
47 | - [ ] 多类别神经网络
48 | - [ ] 嵌入
49 |
50 |
51 |
52 | ## 第二部分:机器学习工程
53 |
54 | - [ ] 生产环境机器学习系统
55 | - [ ] 静态训练和动态训练
56 | - [ ] 静态推理与动态推理
57 | - [ ] 数据依赖关系
58 |
59 |
60 |
61 | ## 第三部分:机器学习现实世界应用示例
62 |
63 | - [ ] 癌症预测
64 | - [ ] 18世纪文学
65 | - [ ] 现实世界应用准则
66 | - [ ]
67 |
68 | ## 总结
69 |
70 | - 后续步骤
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/0003_linear.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/0003_linear.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/0003_loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/0003_loss.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/0003_practice.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/0003_practice.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/0003_tempature.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/0003_tempature.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/0004_gradient_descent.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/0004_gradient_descent.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/L2_loss_figure.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/L2_loss_figure.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/LearningFromData.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/LearningFromData.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/images/linear_regression.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/MichaelBeechan/TensorFlow-From-One-To-One/f4527746d542b58dc230c960806f97a0d4061a15/Google机器学习速成课程笔记/images/linear_regression.png
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/机器学习概念/机器学习简介.md:
--------------------------------------------------------------------------------
1 | # 机器学习简介
2 |
3 | [Machine Learning(机器学习)](https://en.wikipedia.org/wiki/Machine_learning) 是计算机科学领域,通常使用统计技术使计算机能够利用数据“学习”(即逐步提高特定任务的性能),而不需要明确编程。
4 |
5 | 
6 |
7 |
8 |
9 | ## 学习目标
10 |
11 | - 了解掌握机器学习技术的实际优势
12 | - 理解机器学习技术背后的理念
13 |
14 |
15 |
16 | ## 为什么要使用机器学习
17 |
18 | 使用机器学习可以将以下三件事做得更好:
19 |
20 | - **缩短编程时间**:使用一个现成模型,提供若干样本,短时间获得较可靠的程序
21 | - **自定义自己的产品,使其更适合特定的用户群体**:使用机器学习技术构建程序,然后迁移。
22 | - **解决人工方法无法解决的问题:**如定义神经网络,输入大量样本,网络会自动学习,提取特征。
23 |
24 |
25 |
26 |
27 | # Reference
28 |
29 | [机器学习简介](https://developers.google.com/machine-learning/crash-course/ml-intro)
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/机器学习概念/深入了解机器学习.md:
--------------------------------------------------------------------------------
1 | [TOC]
2 |
3 | # 深入了解机器学习 (Descending into ML)
4 |
5 | **线性回归(linear regression)**是一种找到最适合一组点的直线或超平面的方法。本模块会先直观介绍线性回归,为介绍线性回归的机器学习方法奠定基础。
6 |
7 | ## 学习目标
8 |
9 | - 复习前面学过的直线拟合知识。
10 | - 将机器学习中的权重和偏差与直线拟合中的斜率和偏移关联起来。
11 | - 大致了解“损失”(loss),详细了解平方损失。
12 |
13 | ## 从数据中学习
14 |
15 | - 您可以使用很多种复杂的方法从数据中学习
16 | - 但我们可以从简单且熟悉的内容入手
17 | - 从简单的内容入手可打开通往一些广泛实用方法的大门
18 |
19 | 
20 |
21 |
22 |
23 | ## 线性回归示例:房价预测
24 |
25 | x:房子面积
26 |
27 | y:房价
28 |
29 | 图中的蓝色点表示输入数据,我们需要根据当前数据拟合出一条直线:y'=wx+b
30 |
31 | 
32 |
33 | ## 好用的回归损失函数
34 |
35 | 给定样本的 **L2 损失**也称为平方误差
36 |
37 | = 预测值和标签值之差的平方
38 |
39 | = (标签值 - 预测值)^2
40 |
41 | = (y - y')^2
42 |
43 | 对于上述放假预测实例,我们可以通过 **L2损失** 来衡量拟合直线的好坏。
44 |
45 | 如上图所示,红色的线段数学意义为{预测值 - 标签值} : {y' - y} ,这里的{y' - y}就表示误差。 (y - y')^2即表示 L2损失。
46 |
47 | 
48 |
49 | 注:上图中目标值就是标签值。
50 |
51 | ## 定义数据集上的 L2 损失
52 |
53 | $$(L_{2}Loss=\sum_{(x,y)\in D}(y-prediction(x))^{2})$$
54 |
55 | ∑: 我们对训练集中的所有样本进行求和。
56 |
57 | D: 有时取平均值也会有用, 除以$$\frac{1}{\left \| N \right \|}$$
58 |
59 | ## 线性回归
60 |
61 | 人们[早就知晓](https://wikipedia.org/wiki/Dolbear's_law),相比凉爽的天气,蟋蟀在较为炎热的天气里鸣叫更为频繁。数十年来,专业和业余昆虫学者已将每分钟的鸣叫声和温度方面的数据编入目录。Ruth 阿姨将她喜爱的蟋蟀数据库作为生日礼物送给您,并邀请您自己利用该数据库训练一个模型,从而预测鸣叫声与温度的关系。
62 |
63 | 首先建议您将数据绘制成图表,了解下数据的分布情况:
64 |
65 | 
66 |
67 | 毫无疑问,此曲线图表明温度随着鸣叫声次数的增加而上升。鸣叫声与温度之间的关系是线性关系吗?是的,您可以绘制一条直线来近似地表示这种关系,如下所示:
68 |
69 | 
70 |
71 | 事实上,虽然该直线并未精确无误地经过每个点,但针对我们拥有的数据,清楚地显示了鸣叫声与温度之间的关系。只需运用一点代数知识,您就可以将这种关系写下来:**y=mx+b**
72 |
73 | 其中:
74 |
75 | - y 指的是温度(以摄氏度表示),即我们试图预测的值。
76 | - m 指的是直线的斜率。
77 | - x 指的是每分钟的鸣叫声次数,即输入特征的值。
78 | - b 指的是 y 轴截距。
79 |
80 | 按照机器学习的惯例,您需要写一个存在细微差别的模型方程式:**y′=b+w1x1**
81 |
82 | 其中:
83 |
84 | - y′ 指的是预测[标签](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology#labels)(理想输出值)。
85 | - b 指的是偏差(y 轴截距)。而在一些机器学习文档中,它称为 w0。
86 | - w1 指的是特征 1 的权重。权重与上文中用 m 表示的“斜率”的概念相同。
87 | - x1 指的是[特征](https://developers.google.com/machine-learning/crash-course/framing/ml-terminology#features)(已知输入项)。
88 |
89 | 要根据新的每分钟的鸣叫声值 x1 **推断**(预测)温度 y′,只需将 x1 值代入此模型即可。
90 |
91 | 下标(例如 w1 和 x1)预示着可以用多个特征来表示更复杂的模型。例如,具有三个特征的模型可以采用以下方程式:**y′=b+w1x1+w2x2+w3x3**
92 |
93 | ## 关键字词
94 |
95 | | [偏差](https://developers.google.com/machine-learning/glossary#bias) | [推断](https://developers.google.com/machine-learning/glossary#inference) |
96 | | ---------------------------------------- | ---------------------------------------- |
97 | | [线性回归](https://developers.google.com/machine-learning/glossary#linear_regression) | [权重](https://developers.google.com/machine-learning/glossary#weight) |
98 |
99 | ## 训练与损失
100 |
101 | 简单来说,**训练**模型表示通过**有标签样本**来学习(确定)所有权重和偏差的理想值。在监督式学习中,机器学习算法通过以下方式构建模型:检查多个样本并尝试找出可最大限度地减少损失的模型;这一过程称为**经验风险最小化**。
102 |
103 | 损失是对糟糕预测的惩罚。也就是说,**损失**是一个数值,表示对于单个样本而言模型预测的准确程度。如果模型的预测完全准确,则损失为零,否则损失会较大。**训练模型的目标是从所有样本中找到一组平均损失“较小”的权重和偏差。**例如,图 3 左侧显示的是损失较大的模型,右侧显示的是损失较小的模型。关于此图,请注意以下几点:
104 |
105 | - 红色箭头表示损失。
106 | - 蓝线表示预测。
107 |
108 | 
109 |
110 |
111 |
112 | 请注意,左侧曲线图中的红色箭头比右侧曲线图中的对应红色箭头长得多。显然,相较于左侧曲线图中的蓝线,右侧曲线图中的蓝线代表的是预测效果更好的模型。
113 |
114 | 您可能想知道自己能否创建一个数学函数(损失函数),以有意义的方式汇总各个损失。
115 |
116 | ### 平方损失:一种常见的损失函数
117 |
118 | 接下来我们要看的线性回归模型使用的是一种称为**平方损失**(又称为 **L2 损失**)的损失函数。单个样本的平方损失如下:
119 |
120 | ```
121 | = the square of the difference between the label and the prediction
122 | = (observation - prediction(x))^2
123 | = (y - y')^2
124 | ```
125 |
126 | **均方误差** (**MSE**) 指的是每个样本的平均平方损失。要计算 MSE,请求出各个样本的所有平方损失之和,然后除以样本数量:
127 |
128 | $$MSE = \frac{1}{N}\sum_{(x,y)\in D}(y-prediction(x))^{2}$$
129 |
130 | 其中:
131 |
132 | - (x,y)指的是样本,其中
133 | - x 指的是模型进行预测时使用的特征集(例如,温度、年龄和交配成功率)。
134 | - y 指的是样本的标签(例如,每分钟的鸣叫次数)。
135 | - prediction(x) 指的是权重和偏差与特征集 x 结合的函数。
136 | - D 指的是包含多个有标签样本(即 (x,y))的数据集。
137 | - N 指的是 D 中的样本数量。
138 |
139 | 虽然 MSE 常用于机器学习,但它既不是唯一实用的损失函数,也不是适用于所有情形的最佳损失函数。
140 |
141 | ## 关键字词
142 |
143 | | [经验风险最小化](https://developers.google.com/machine-learning/glossary#ERM) | [损失](https://developers.google.com/machine-learning/glossary#loss) |
144 | | ---------------------------------------- | ---------------------------------------- |
145 | | [均方误差](https://developers.google.com/machine-learning/glossary#MSE) | [平方损失](https://developers.google.com/machine-learning/glossary#squared_loss) |
146 | | [训练](https://developers.google.com/machine-learning/glossary#training) | |
147 |
148 | ## 练习题
149 |
150 | ### 均方误差
151 |
152 | 请看以下两个曲线图:
153 |
154 | 
155 |
156 | **对于以上曲线图中显示的两个数据集,哪个数据集的均方误差 (MSE) 较高?**
157 |
158 | A.左侧的数据集
159 |
160 | B.右侧的数据集
161 |
162 | 答案解析:正确答案是B。
163 |
164 | 左侧曲线图,线上的 6 个样本产生的总损失为 0。不在线上的 4 个样本离线并不远(距离都为1),因此即使对偏移求平方值,产生的值仍然很小。根据MSE公式,求得左侧数据集的均方误差为:0.4。
165 |
166 | 线上的 8 个样本产生的总损失为 0。不过,尽管只有两个点在线外,但这两个点的离线距离依然是左图中离群点的 2 倍。平方损失进一步加大差异,因此两个点的偏移量产生的损失是一个点的 4 倍。根据MSE公式,求得左侧数据集的均方误差为:0.8。
167 |
168 | 因此右侧数据集的MSE值 = 0.8 > 左侧数据集的MSE值 = 0.4,故选B。
169 |
170 |
171 |
172 | # Reference
173 |
174 | [深度了解机器学习(Descending into ML)](https://developers.google.com/machine-learning/crash-course/descending-into-ml)
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/机器学习概念/问题构建.md:
--------------------------------------------------------------------------------
1 | # 问题构建(Framing)
2 |
3 | ## **什么是(监督式)机器学习?**
4 |
5 | 简单来说,它的定义如下:
6 |
7 | - 机器学习系统通过学习如何组合输入信息来对从未见过的数据做出有用的预测。
8 |
9 |
10 |
11 | ## 机器学习主要术语
12 |
13 | ### 标签(label)
14 |
15 | **标签(label)**是指我们要预测的真实事物,即简单线性回归中的 y 变量
16 |
17 | 举个例子,标签可以是小麦未来的价格,图像中显示的动物品种或其它事物。
18 |
19 | ### 特征(feature/data)
20 |
21 | **特征(feature/data)**是指用于描述数据的输入变量,即基本线性回归中的 x 变量
22 |
23 | 简单的机器学习项目可能会使用单个特征,而比较复杂的机器学习项目可能会使用数百万个特征,按如下方式指定:**x** = {x1, x2, ... , xN}
24 |
25 | 在垃圾邮件检测器示例中,特征可能包括:
26 |
27 | - 电子邮件文本中的字词
28 | - 发件人的地址
29 | - 发送电子邮件的时段
30 | - 电子邮件中包含“一种奇怪的把戏”这样的短语。
31 |
32 |
33 |
34 | ### 样本(sample)
35 |
36 | **样本(sample)**是指数据的特定实例:**x**。(我们采用粗体 **x** 表示它是一个矢量。)我们将样本分为以下两类:
37 |
38 | - 有标签样本
39 | - 无标签样本
40 |
41 | **有标签样本(labeled sample)**具有{特征, 标签} : {x, y}
42 |
43 | ```shell
44 | labeled examples: {features, label}: (x, y)
45 | ```
46 |
47 | 我们使用有标签样本来**训练**模型。在我们的垃圾邮件检测器示例中,有标签样本是用户明确标记为“垃圾邮件”或“非垃圾邮件”的各个电子邮件。
48 |
49 | 例如,下表显示了从包含加利福尼亚州房价信息的[数据集](https://developers.google.com/machine-learning/crash-course/california-housing-data-description)中抽取的 5 个有标签样本:
50 |
51 | | housingMedianAge(特征) | totalRooms(特征) | totalBedrooms(特征) | medianHouseValue(标签) |
52 | | -------------------- | -------------- | ----------------- | -------------------- |
53 | | 15 | 5612 | 1283 | 66900 |
54 | | 19 | 7650 | 1901 | 80100 |
55 | | 17 | 720 | 174 | 85700 |
56 | | 14 | 1501 | 337 | 73400 |
57 | | 20 | 1454 | 326 | 65500 |
58 |
59 | **无标签样本(unlabled sample)**包含特征,但不包含标签。即{特征, ?} : {x, ?}
60 |
61 | ```shell
62 | unlabeled examples: {features, ?}: (x, ?)
63 | ```
64 |
65 | - 在使用有标签样本训练了我们的模型之后,我们会使用该模型来预测无标签样本的标签。在垃圾邮件检测器示例中,无标签样本是用户尚未添加标签的新电子邮件。
66 |
67 | ### 模型(model)
68 |
69 | **模型(model)**可以将样本映射到预测标签:y'
70 |
71 | 模型定义了特征与标签之间的关系。例如,垃圾邮件检测模型可能会将某些特征与“垃圾邮件”紧密联系起来。我们来重点介绍一下模型生命周期的两个阶段:
72 |
73 | - **训练(train)**表示创建或**学习**模型。也就是说,您向模型展示有标签样本,让模型逐渐学习特征与标签之间的关系。
74 | - **推断(inference)**表示将训练后的模型应用于无标签样本。也就是说,您使用训练后的模型来做出有用的预测 (`y'`)。例如,在推断期间,您可以针对新的无标签样本预测 `medianHouseValue`。
75 |
76 | 注意,训练模型后,会生成模型的有效内部参数(迭代更新),这些内部参数值是通过学习得到的
77 |
78 |
79 |
80 | **温馨提示:**不要把上述的特征和样本搞混了,特征可以理解为属性,样本可以理解为数据集的元素。一个数据集可以有若干样本,一个样本可以有若干属性。
81 |
82 | ### 回归与分类(Regression and Classification)
83 |
84 | **回归(regression)**模型可预测连续值。例如,回归模型做出的预测可回答如下问题:
85 |
86 | - 加利福尼亚州一栋房产的价值是多少?
87 | - 用户点击此广告的概率是多少?
88 |
89 | **分类(classification)**模型可预测离散值。例如,分类模型做出的预测可回答如下问题:
90 |
91 | - 某个指定电子邮件是垃圾邮件还是非垃圾邮件?
92 | - 这是一张狗、猫还是仓鼠图片?
93 |
94 | ## 练习题
95 |
96 | ### 监督式学习
97 |
98 | 查看以下选项。
99 |
100 | 假设您想开发一种监督式机器学习模型来预测指定的电子邮件是“垃圾邮件”还是“非垃圾邮件”。以下哪些表述正确?
101 |
102 | **(正确)未标记为“垃圾邮件”或“非垃圾邮件”的电子邮件是无标签样本。**
103 |
104 | 由于我们的标签由“垃圾邮件”和“非垃圾邮件”这两个值组成,因此任何尚未标记为垃圾邮件或非垃圾邮件的电子邮件都是无标签样本。
105 |
106 | **(正确)有些标签可能不可靠。**
107 |
108 | 当然。此数据集的标签可能来自将特定电子邮件标记为垃圾邮件的电子邮件用户。由于很少的用户会将每一封可疑的电子邮件都标记为垃圾邮件,因此我们可能很难知道某封电子邮件是否是垃圾邮件。此外,有些垃圾内容发布者或僵尸网络可能会故意提供错误标签来误导我们的模型。
109 |
110 | **(错误)主题标头中的字词适合做标签。**
111 |
112 | 主题标头中的字词可能是优质特征,但不适合做标签。
113 |
114 | **(错误)我们将使用无标签样本来训练模型。**
115 |
116 | 我们将使用**有标签**样本来训练模型。然后,我们可以对无标签样本运行训练后的模型,以推理无标签的电子邮件是垃圾邮件还是非垃圾邮件。
117 |
118 | ### 特征和标签
119 |
120 | 假设一家在线鞋店希望创建一种监督式机器学习模型,以便为用户提供合乎个人需求的鞋子推荐。也就是说,该模型会向小马推荐某些鞋子,而向小美推荐另外一些鞋子。以下哪些表述正确?
121 |
122 | **(错误)鞋的美观程度是一项实用特征。**
123 |
124 | 合适的特征应该是具体且可量化的。美观程度是一种过于模糊的概念,不能作为实用特征。美观程度可能是某些具体特征(例如样式和颜色)的综合表现。样式和颜色都比美观程度更适合用作特征。
125 |
126 | **(正确)鞋码是一项实用特征。**
127 |
128 | 鞋码是一种可量化的标志,可能对用户是否喜欢推荐的鞋子有很大影响。例如,如果小马穿 43 码的鞋,则该模型不应该推荐 39 码的鞋。
129 |
130 | **(错误)用户喜欢的鞋子是一种实用标签。**
131 |
132 | 喜好不是可观察且可量化的指标。我们能做到最好的就是针对用户的喜好来搜索可观察的代理指标。
133 |
134 | **(正确)用户点击鞋子描述的次数是一项实用特征。**
135 |
136 | 用户可能只是想要详细了解他们喜欢的鞋子。因此,用户点击次数是可观察且可量化的指标,可用来训练合适的标签。
137 |
138 | # Reference
139 |
140 | [问题构建](https://developers.google.com/machine-learning/crash-course/framing/video-lecture)
141 |
142 | [机器学习术语表](https://developers.google.com/machine-learning/crash-course/glossary#classification_model)
143 |
144 | [问题构建(Framing)练习题](https://developers.google.com/machine-learning/crash-course/framing/check-your-understanding)
145 |
146 |
147 |
148 |
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/机器学习概念/降低损失.md:
--------------------------------------------------------------------------------
1 | # 降低损失(Reducing Loss)
2 |
3 | 为了训练模型,我们需要一种可降低模型损失的好方法。迭代方法是一种广泛用于降低损失的方法,而且使用起来简单有效。
4 |
5 | ## **学习目标**
6 |
7 | - 了解如何使用迭代方法来训练模型。
8 |
9 | - 全面了解梯度下降法和一些变体,包括:
10 |
11 | - - 小批量梯度下降法
12 | - 随机梯度下降法
13 |
14 | - 尝试不同的学习速率。
15 |
16 | ## 权重初始化
17 |
18 | - 对于凸形问题,权重可以从任何位置开始(比如,所有值为 0 的位置)
19 |
20 | - - 凸形:想象一个碗的形状(如下图[1]所示)
21 | - 只有一个最低点
22 |
23 | - 预示:不适用于神经网络
24 |
25 | - - 非凸形:想象一个蛋托的形状(如下图[2]所示)
26 | - 有多个最低点
27 | - 很大程度上取决于初始值
28 |
29 |
30 | -
31 |
32 | ## 降低损失:迭代方法
33 |
34 | [上一单元](深入了解机器学习.md)介绍了损失的概念。在本单元中,您将了解机器学习模型如何以迭代方式降低损失。
35 |
36 | 迭代学习可能会让您想到“[Hot and Cold](http://www.howcast.com/videos/258352-how-to-play-hot-and-cold/)”这种寻找隐藏物品(如顶针)的儿童游戏。在我们的游戏中,“隐藏的物品”就是最佳模型。刚开始,您会胡乱猜测(“w1 的值为 0。”),等待系统告诉您损失是多少。然后,您再尝试另一种猜测(“w1 的值为 0.5。”),看看损失是多少。哎呀,这次更接近目标了。实际上,如果您以正确方式玩这个游戏,通常会越来越接近目标。这个游戏真正棘手的地方在于尽可能高效地找到最佳模型。
37 |
38 | 下图显示了机器学习算法用于训练模型的迭代试错过程:
39 |
40 | 
41 |
42 | **图 1. 用于训练模型的迭代方法。**
43 |
44 | 我们将在整个机器学习速成课程中使用相同的迭代方法详细说明各种复杂情况,尤其是处于暴风雨中的蓝云区域。迭代策略在机器学习中的应用非常普遍,这主要是因为它们可以很好地扩展到大型数据集。
45 |
46 | “模型”部分将一个或多个特征作为输入,然后返回一个预测 (y') 作为输出。为了进行简化,不妨考虑一种采用一个特征并返回一个预测的模型:
47 |
48 | y′=b+w1x1
49 |
50 | 我们应该为 b 和 w1 设置哪些初始值?对于线性回归问题,事实证明初始值并不重要。我们可以随机选择值,不过我们还是选择采用以下这些无关紧要的值:
51 |
52 | - b = 0
53 | - w1 = 0
54 |
55 | 假设第一个特征值是 10。将该特征值代入预测函数会得到以下结果:
56 |
57 | ```
58 | y' = 0 + 0(10)
59 | y' = 0
60 |
61 | ```
62 |
63 | 图中的“计算损失”部分是模型将要使用的[损失函数](深入了解机器学习.md)。假设我们使用平方损失函数。损失函数将采用两个输入值:
64 |
65 | - **y':模型对特征 x 的预测**
66 | - **y:特征 x 对应的正确标签。**
67 |
68 | 最后,我们来看图的“计算参数更新”部分。机器学习系统就是在此部分检查损失函数的值,并为 b 和 w1 生成新值。现在,假设这个神秘的绿色框会产生新值(绿色框如图1所示),然后机器学习系统将根据所有标签重新评估所有特征,为损失函数生成一个新值,而该值又产生新的参数值。这种学习过程会持续迭代,直到该算法发现损失可能最低的模型参数。通常,您可以不断迭代,直到总体损失不再变化或至少变化极其缓慢为止。这时候,我们可以说该模型已**收敛**。
69 |
70 | **要点:**在训练机器学习模型时,首先对权重和偏差进行初始猜测,然后反复调整这些猜测,直到获得损失可能最低的权重和偏差为止。
71 |
72 | **关键字词**
73 |
74 | - [收敛](https://developers.google.com/machine-learning/crash-course/glossary#convergence):通俗来说,收敛通常是指在训练期间达到的一种状态,即经过一定次数的迭代之后,训练[**损失**](https://developers.google.com/machine-learning/crash-course/glossary#loss)和验证损失在每次迭代中的变化都非常小或根本没有变化。也就是说,如果采用当前数据进行额外的训练将无法改进模型,模型即达到收敛状态。在深度学习中,损失值有时会在最终下降之前的多次迭代中保持不变或几乎保持不变,暂时形成收敛的假象。
75 |
76 | 另请参阅[**早停法**](https://developers.google.com/machine-learning/crash-course/glossary#early_stopping)。
77 |
78 | 另请参阅 Boyd 和 Vandenberghe 合著的 [Convex Optimization](https://web.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf)(《凸优化》)。
79 |
80 | - [损失](https://developers.google.com/machine-learning/glossary#loss):一种衡量指标,用于衡量模型的[**预测**](https://developers.google.com/machine-learning/glossary/#prediction)偏离其[**标签**](https://developers.google.com/machine-learning/glossary/#label)的程度。或者更悲观地说是衡量模型有多差。要确定此值,模型必须定义损失函数。例如,线性回归模型通常将[**均方误差**](https://developers.google.com/machine-learning/glossary/#MSE)用于损失函数,而逻辑回归模型则使用[**对数损失函数**](https://developers.google.com/machine-learning/glossary/#Log_Loss)。
81 |
82 | - [训练](https://developers.google.com/machine-learning/glossary#training):确定构成模型的理想[**参数**](https://developers.google.com/machine-learning/glossary/#parameter)的过程。
83 |
84 | ## 降低损失(Reducing Loss):梯度下降法GD
85 |
86 | 梯度下降法:一种通过计算并且减小梯度将[**损失**](https://developers.google.com/machine-learning/crash-course/glossary#loss)降至最低的技术,它以训练数据为条件,来计算损失相对于模型参数的梯度。通俗来说,梯度下降法以迭代方式调整参数,逐渐找到[**权重**](https://developers.google.com/machine-learning/crash-course/glossary#weight)和偏差的最佳组合,从而将损失降至最低。
87 |
88 | 迭代方法图([图 1](https://developers.google.com/machine-learning/crash-course/reducing-loss/an-iterative-approach#ml-block-diagram))包含一个标题为“计算参数更新”的华而不实的绿框。现在,我们将用更实质的方法代替这种华而不实的算法。
89 |
90 | 假设我们有时间和计算资源来计算 w1 的所有可能值的损失。对于我们一直在研究的回归问题,所产生的损失与 w1 的图形始终是凸形。换言之,图形始终是碗状图,如下所示:
91 |
92 | 
93 |
94 | **图 2. 回归问题产生的损失与权重图为凸形。**
95 |
96 |
97 |
98 | 凸形问题只有一个最低点;即只存在一个斜率正好为 0 的位置。这个最小值就是损失函数收敛之处。
99 |
100 | 通过计算整个数据集中 w1 每个可能值的损失函数来找到收敛点这种方法效率太低。我们来研究一种更好的机制,这种机制在机器学习领域非常热门,称为**梯度下降法**。
101 |
102 | 梯度下降法的第一个阶段是为 w1 选择一个起始值(起点)。起点并不重要;因此很多算法就直接将 w1 设为 0 或随机选择一个值。下图显示的是我们选择了一个稍大于 0 的起点:
103 |
104 | 
105 |
106 | **图 3. 梯度下降法的起点。**
107 |
108 | 然后,梯度下降法算法会计算损失曲线在起点处的梯度。简而言之,**梯度**是偏导数的矢量;它可以让您了解哪个方向距离目标“更近”或“更远”。请注意,损失相对于单个权重的梯度(如图 3 所示)就等于导数。
109 |
110 | ▸详细了解偏导数和梯度。
111 |
112 | 请注意,梯度是一个矢量,因此具有以下两个特征:
113 |
114 | - 方向
115 | - 大小
116 |
117 | 梯度始终指向损失函数中增长最为快速的方向。**梯度下降法算法会沿着负梯度的方向走一步,以便尽快降低损失。**
118 |
119 | 
120 |
121 | **图 4. 梯度下降法依赖于负梯度。**
122 |
123 | 为了确定损失函数曲线上的下一个点,梯度下降法算法会将梯度大小的一部分与起点相加(权重值的变化方式应该是与梯度方向相反,即沿着负梯度的方向,而当前点梯度为负数,所以负梯度方向即为正,权重w应该是增加的),如下图所示:
124 |
125 | 
126 |
127 | **图 5. 一个梯度步长将我们移动到损失曲线上的下一个点。**
128 |
129 | 然后,梯度下降法会重复此过程,逐渐接近最低点。
130 |
131 | **关键字词**
132 |
133 | - [梯度下降法](https://developers.google.com/machine-learning/crash-course/glossary#gradient_descent):一种通过计算并且减小梯度将[**损失**](https://developers.google.com/machine-learning/crash-course/glossary#loss)降至最低的技术,它以训练数据为条件,来计算损失相对于模型参数的梯度。通俗来说,梯度下降法以迭代方式调整参数,逐渐找到[**权重**](https://developers.google.com/machine-learning/crash-course/glossary#weight)和偏差的最佳组合,从而将损失降至最低。
134 | - [步长](https://developers.google.com/machine-learning/crash-course/glossary#step) :是[**学习速率**](https://developers.google.com/machine-learning/crash-course/glossary#learning_rate)的同义词。
135 |
136 | ## 降低损失 (Reducing Loss):学习速率LR
137 |
138 | 正如之前所述,梯度矢量具有方向和大小。梯度下降法算法用梯度乘以一个称为**学习速率(learning rate)**(有时也称为**步长**)的标量,以确定下一个点的位置。例如,如果梯度大小为 2.5,学习速率为 0.01,则梯度下降法算法会选择距离前一个点 0.025 的位置作为下一个点。
139 |
140 | **超参数**是编程人员在机器学习算法中用于调整的旋钮。大多数机器学习编程人员会花费相当多的时间来调整学习速率(简称:调参)。**如果您选择的学习速率过小,就会花费太长的学习时间。**为什么学习速率小,花费的时间就长呢?因为学习速率小,梯度下降的值就小,使得权重值更新的变化小,因此收敛就会慢,时间耗时长。
141 |
142 | 
143 |
144 | **图 6. 学习速率过小。**
145 |
146 | 相反,如果您指定的学习速率过大,下一个点将永远在 U 形曲线的底部随意弹跳,就好像量子力学实验出现了严重错误一样:
147 |
148 | 
149 |
150 | **图 7. 学习速率过大。**
151 |
152 | 每个回归问题都存在一个[金发姑娘](https://wikipedia.org/wiki/Goldilocks_principle)学习速率。“金发姑娘”值与损失函数的平坦程度相关。如果您知道损失函数的梯度较小,则可以放心地试着采用较大的学习速率,以补偿较小的梯度并获得更大的步长。
153 |
154 | 
155 |
156 | **图 8. 学习速率恰恰好。**
157 |
158 | **关键字词**
159 |
160 | | [超参数](https://developers.google.com/machine-learning/crash-course/glossary#hyperparameter) | [学习速率](https://developers.google.com/machine-learning/crash-course/glossary#learning_rate) |
161 | | ---------------------------------------- | ---------------------------------------- |
162 | | [步长](https://developers.google.com/machine-learning/crash-course/glossary#step_size) | |
163 |
164 | - **超参数(hyperparameter):**在模型训练的连续过程中,您调节的“旋钮”。例如,[**学习速率**](https://developers.google.com/machine-learning/crash-course/glossary#learning_rate)就是一种超参数。与[**参数**](https://developers.google.com/machine-learning/crash-course/glossary#parameter)相对。
165 | - **学习速率(learning rate):**在训练模型时用于梯度下降的一个变量。在每次迭代期间,[**梯度下降法**](https://developers.google.com/machine-learning/crash-course/glossary#gradient_descent)都会将学习速率与梯度相乘。得出的乘积称为**梯度步长**。学习速率是一个重要的[**超参数**](https://developers.google.com/machine-learning/crash-course/glossary#hyperparameter)。
166 | - **步长(step size)**:是[**学习速率**](https://developers.google.com/machine-learning/crash-course/glossary#learning_rate)的同义词。
167 |
168 | ## 优化学习速率
169 |
170 | [(练习题)举个栗子](https://developers.google.com/machine-learning/crash-course/fitter/graph):尝试不同的学习速率,看看不同的学习速率对到达损失曲线最低点所需的步数有何影响。请尝试进行图表下方的练习。
171 |
172 | 
173 |
174 |
175 |
176 | 当学习速率为0.7时,需要执行10步,模型的损失才会降到最低,如下图所示:
177 |
178 | 
179 |
180 |
181 |
182 | 但增大学习速率,比如设为1.80时,只需要执行3步,模型的损失才会降到最低,如下图所示:
183 |
184 | 
185 |
186 | 但要注意,学习率不能设置太大,一般在0~1之间。比如将学习速率设为4后,每次迭代后,损失会越来越大,永远不会收敛。
187 |
188 | ## 降低损失 (Reducing Loss):随机梯度下降法SGD
189 |
190 | 在梯度下降法中,**批量(batch)**指的是用于在单次迭代中计算梯度的**样本总数**。到目前为止,我们一直假定批量是指整个数据集。就 Google 的规模而言,数据集通常包含数十亿甚至数千亿个样本。此外,Google 数据集通常包含海量特征。因此,一个批量可能相当巨大。如果是超大批量,则单次迭代就可能要花费很长时间进行计算。
191 |
192 | 包含随机抽样样本的大型数据集可能包含冗余数据。实际上,批量越大,出现冗余的可能性就越高。一些冗余可能有助于消除杂乱的梯度,但超大批量所具备的预测价值往往并不比大型批量高。
193 |
194 | 如果我们可以通过更少的计算量得出正确的平均梯度,会怎么样?通过从我们的数据集中随机选择样本,我们可以通过小得多的数据集估算(尽管过程非常杂乱)出较大的平均值。 **随机梯度下降法** (**SGD**) 将这种想法运用到极致,它每次迭代只使用**一个样本(批量大小为 1)**。如果进行足够的迭代,SGD 也可以发挥作用,但过程会非常杂乱。“随机”这一术语表示构成各个批量的一个样本都是随机选择的。
195 |
196 | **小批量随机梯度下降法**(**Mini-Batch SGD**)是介于全批量迭代与 SGD 之间的折衷方案。小批量通常包含 10-1000 个随机选择的样本。小批量 SGD 可以减少 SGD 中的杂乱样本数量,但仍然比全批量更高效。
197 |
198 | 为了简化说明,我们只针对单个特征重点介绍了梯度下降法。请放心,梯度下降法也适用于包含多个特征的特征集。
199 |
200 | **关键字词**
201 |
202 | | [批量](https://developers.google.com/machine-learning/crash-course/glossary#batch) | [批量大小](https://developers.google.com/machine-learning/crash-course/glossary#batch_size) |
203 | | ---------------------------------------- | ---------------------------------------- |
204 | | [小批量](https://developers.google.com/machine-learning/crash-course/glossary#mini-batch) | [随机梯度下降法](https://developers.google.com/machine-learning/crash-course/glossary#stochastic_gradient_descent_(SGD)) |
205 |
206 | - **批量(batch):**[**模型训练**](https://developers.google.com/machine-learning/crash-course/glossary#model_training)的一次[**迭代**](https://developers.google.com/machine-learning/crash-course/glossary#iteration)(即一次[**梯度**](https://developers.google.com/machine-learning/crash-course/glossary#gradient)更新)中使用的样本集。另请参阅[**批次规模**](https://developers.google.com/machine-learning/crash-course/glossary#batch_size)。
207 | - **批量大小(batch size):**一个[**批次**](https://developers.google.com/machine-learning/crash-course/glossary#batch)中的样本数。例如,[**SGD**](https://developers.google.com/machine-learning/crash-course/glossary#SGD) 的批次规模为 1,而[**小批次**](https://developers.google.com/machine-learning/crash-course/glossary#mini-batch)的规模通常介于 10 到 1000 之间。批次规模在训练和推断期间通常是固定的;不过,TensorFlow 允许使用动态批次规模。
208 | - **小批量(mini-batch):**从训练或推断过程的一次迭代中一起运行的整批[**样本**](https://developers.google.com/machine-learning/crash-course/glossary#example)内随机选择的一小部分。小批次的[**规模**](https://developers.google.com/machine-learning/crash-course/glossary#batch_size)通常介于 10 到 1000 之间。与基于完整的训练数据计算损失相比,基于小批次数据计算损失要高效得多。
209 | - **随机梯度下降法(SGD,stochastic gradient descent):**批次规模为 1 的一种[**梯度下降法**](https://developers.google.com/machine-learning/crash-course/glossary#gradient_descent)。换句话说,SGD 依赖于从数据集中随机均匀选择的单个样本来计算每步的梯度估算值。
210 |
211 | ## 降低损失 (Reducing Loss):Playground 练习
212 |
213 | ### 学习速率和收敛
214 |
215 | 这是一系列 Playground 练习中的第一个练习。 [Playground](http://playground.tensorflow.org/) 是专为本课程开发的教程,旨在讲解机器学习原理。
216 |
217 | 每个 Playground 练习都会生成一个数据集。此数据集的标签具有两个可能值。您可以将这两个可能值设想成垃圾邮件与非垃圾邮件,或者设想成健康的树与生病的树。大部分练习的目标是调整各种超参数,以构建可成功划分(分开或区分)一个标签值和另一个标签值的模型。请注意,大部分数据集都包含一定数量的杂乱样本,导致无法成功划分每个样本。
218 |
219 | 每个 Playground 练习都会显示模型当前状态的直观图示。例如,以下就是一个模型的直观图示:
220 |
221 | 
222 |
223 | 请注意以下关于模型直观图示的说明:
224 |
225 | - 每个蓝点表示一类数据的一个样本(例如,一棵健康的树)。
226 | - 每个橙点表示另一类数据的一个样本(例如,一棵生病的树)。
227 | - 背景颜色表示该模型对于应该在何处找到相应颜色样本的预测。某个蓝点周围显示蓝色背景表示该模型正确地预测了该样本。相反,某个蓝点周围显示橙色背景则表示该模型错误地预测了该样本。
228 | - 背景的蓝色和橙色部分色调会有深浅之分。例如,直观图示的左侧是纯蓝色,但在直观图示的中心颜色则逐渐淡化为白色。您可以将颜色强度视为表明该模型对其猜测结果的自信程度。因此,纯蓝色表示该模型对其猜测结果非常自信,而浅蓝色则表示该模型的自信程度稍低。(图中所示的模型直观图示在预测方面的表现非常糟糕。)
229 |
230 | 可以通过直观图示来判断模型的进展。(“非常棒 - 大多数蓝点都有蓝色背景”或者“糟糕!蓝点有橙色背景。”)除了颜色之外,Playground 还会以数字形式显示模型的当前损失。(“糟糕!损失正在上升,而不是下降。”)
231 |
232 | 此练习的界面提供了 3 个按钮:
233 |
234 | | 图标 | 名称 | 用途 |
235 | | ---------------------------------------- | ---------------- | ---------------------------------------- |
236 | |  | Reset(重置) | 将 `Iterations` 重置为 0。重置该模型已学习的所有权重。 |
237 | |  | Step(步) | 展开一次新的迭代。对于每次迭代,模型都会发生变化,有时是细微变化,有时是巨大变化。 |
238 | |  | Regenerate(重新生成) | 生成一个新数据集。不会重置 `Iterations`。 |
239 |
240 | 在这第一个 Playground 练习中,您将通过执行以下两个任务来尝试不同的学习速率。
241 |
242 | **任务 1**:注意 Playgroud 右上角的**学习速率**菜单。指定学习速率为 3,这个值非常高。通过点击“步”按钮 10 或 20 次,观察这种较高的学习速率会如何影响您的模型。在早期的每次迭代之后,请注意模型的直观图示如何急剧变化。模型似乎已收敛后,您甚至可能看到出现不稳定的情况。另请注意从 x1 和 x2 到模型直观图示之间的线。这些线的权重表示模型中相应特征的权重。也就是说,线越粗,权重越高。
243 |
244 | - 学习率为3,点击10步后,loss为0.257,如下图所示:
245 |
246 | 
247 |
248 |
249 |
250 | 学习率为3,点击20步后,loss为0.255,如下图所示:
251 |
252 | 
253 |
254 |
255 |
256 | 根据loss的变化曲线可知,在前2、3次迭代时,loss是逐渐下降的,但在后面几次迭代时,loss逐渐上升。如第10次迭代和第20次迭代的loss是近似的。
257 |
258 |
259 |
260 | **任务 2**:执行以下操作:
261 |
262 | 1. 按**重置**按钮。
263 | 2. 降低`学习速率`。
264 | 3. 多次按“步”按钮。
265 |
266 | 较低的学习速率对收敛有何影响?了解模型收敛所需的步数,并了解模型收敛的顺滑平稳程度。尝试较低的学习速率。能否发现因过慢而无用的学习速率?(您将在练习的正下方找到相关讨论。)
267 |
268 | 答:降低学习速率,loss会逐渐缓慢下降。
269 |
270 | ## 降低损失 (Reducing Loss):检查您的理解情况
271 |
272 | ### 检查您的理解情况:批量大小
273 |
274 | 问题:基于大型数据集执行梯度下降法时,以下哪个批量大小可能比较高效?
275 |
276 | A. 小批量或甚至包含一个样本的批量 (SGD)。
277 |
278 | B. 全批量。
279 |
280 | 答:正确答案是A。小批量或甚至包含一个样本的批量 (SGD)。
281 |
282 | 令人惊讶的是,在小批量或甚至包含一个样本的批量上执行梯度下降法通常比全批量更高效。毕竟,计算一个样本的梯度要比计算数百万个样本的梯度成本低的多。为确保获得良好的代表性样本,该算法在每次迭代时都会抽取另一个随机小批量数据(或包含一个样本的批量数据)。
283 |
284 | **关键字词**
285 |
286 | [批量](https://developers.google.com/machine-learning/crash-course/glossary#batch)
287 |
288 | [批量大小](https://developers.google.com/machine-learning/crash-course/glossary#batch_size)
289 |
290 | [小批量](https://developers.google.com/machine-learning/crash-course/glossary#mini-batch)
291 |
292 | [随机梯度下降法](https://developers.google.com/machine-learning/crash-course/glossary#stochastic_gradient_descent_(SGD))
293 |
294 | # Reference
295 |
296 | [降低损失](https://developers.google.com/machine-learning/crash-course/reducing-loss/video-lecture)
297 |
298 |
299 |
300 |
301 |
302 |
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/简介/intro_to_pandas.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "nbformat": 4,
3 | "nbformat_minor": 0,
4 | "metadata": {
5 | "colab": {
6 | "name": "intro_to_pandas.ipynb",
7 | "version": "0.3.2",
8 | "views": {},
9 | "default_view": {},
10 | "provenance": [
11 | {
12 | "file_id": "/v2/external/notebooks/mlcc/intro_to_pandas.ipynb",
13 | "timestamp": 1528022507361
14 | }
15 | ],
16 | "collapsed_sections": []
17 | }
18 | },
19 | "cells": [
20 | {
21 | "metadata": {
22 | "id": "JndnmDMp66FL",
23 | "colab_type": "text"
24 | },
25 | "cell_type": "markdown",
26 | "source": [
27 | "#### Copyright 2017 Google LLC."
28 | ]
29 | },
30 | {
31 | "metadata": {
32 | "id": "hMqWDc_m6rUC",
33 | "colab_type": "code",
34 | "colab": {
35 | "autoexec": {
36 | "startup": false,
37 | "wait_interval": 0
38 | }
39 | },
40 | "cellView": "both"
41 | },
42 | "cell_type": "code",
43 | "source": [
44 | "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
45 | "# you may not use this file except in compliance with the License.\n",
46 | "# You may obtain a copy of the License at\n",
47 | "#\n",
48 | "# https://www.apache.org/licenses/LICENSE-2.0\n",
49 | "#\n",
50 | "# Unless required by applicable law or agreed to in writing, software\n",
51 | "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
52 | "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
53 | "# See the License for the specific language governing permissions and\n",
54 | "# limitations under the License."
55 | ],
56 | "execution_count": 0,
57 | "outputs": []
58 | },
59 | {
60 | "metadata": {
61 | "id": "rHLcriKWLRe4",
62 | "colab_type": "text"
63 | },
64 | "cell_type": "markdown",
65 | "source": [
66 | " # Pandas 简介"
67 | ]
68 | },
69 | {
70 | "metadata": {
71 | "id": "QvJBqX8_Bctk",
72 | "colab_type": "text"
73 | },
74 | "cell_type": "markdown",
75 | "source": [
76 | "**学习目标:**\n",
77 | " * 大致了解 *pandas* 库的 `DataFrame` 和 `Series` 数据结构\n",
78 | " * 存取和处理 `DataFrame` 和 `Series` 中的数据\n",
79 | " * 将 CSV 数据导入 pandas 库的 `DataFrame`\n",
80 | " * 对 `DataFrame` 重建索引来随机打乱数据"
81 | ]
82 | },
83 | {
84 | "metadata": {
85 | "id": "TIFJ83ZTBctl",
86 | "colab_type": "text"
87 | },
88 | "cell_type": "markdown",
89 | "source": [
90 | " [*pandas*](http://pandas.pydata.org/) 是一种列存数据分析 API。它是用于处理和分析输入数据的强大工具,很多机器学习框架都支持将 *pandas* 数据结构作为输入。\n",
91 | "虽然全方位介绍 *pandas* API 会占据很长篇幅,但它的核心概念非常简单,我们会在下文中进行说明。有关更完整的参考,请访问 [*pandas* 文档网站](http://pandas.pydata.org/pandas-docs/stable/index.html),其中包含丰富的文档和教程资源。"
92 | ]
93 | },
94 | {
95 | "metadata": {
96 | "id": "s_JOISVgmn9v",
97 | "colab_type": "text"
98 | },
99 | "cell_type": "markdown",
100 | "source": [
101 | " ## 基本概念\n",
102 | "\n",
103 | "以下行导入了 *pandas* API 并输出了相应的 API 版本:"
104 | ]
105 | },
106 | {
107 | "metadata": {
108 | "id": "aSRYu62xUi3g",
109 | "colab_type": "code",
110 | "colab": {
111 | "autoexec": {
112 | "startup": false,
113 | "wait_interval": 0
114 | },
115 | "base_uri": "https://localhost:8080/",
116 | "height": 34
117 | },
118 | "outputId": "f9b2b479-12f9-4392-e49a-36d2d12ab7de",
119 | "executionInfo": {
120 | "status": "ok",
121 | "timestamp": 1528027858792,
122 | "user_tz": -480,
123 | "elapsed": 1550,
124 | "user": {
125 | "displayName": "陈方杰",
126 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
127 | "userId": "115418685205938795893"
128 | }
129 | }
130 | },
131 | "cell_type": "code",
132 | "source": [
133 | "import pandas as pd\n",
134 | "pd.__version__"
135 | ],
136 | "execution_count": 16,
137 | "outputs": [
138 | {
139 | "output_type": "execute_result",
140 | "data": {
141 | "text/plain": [
142 | "u'0.22.0'"
143 | ]
144 | },
145 | "metadata": {
146 | "tags": []
147 | },
148 | "execution_count": 16
149 | }
150 | ]
151 | },
152 | {
153 | "metadata": {
154 | "id": "daQreKXIUslr",
155 | "colab_type": "text"
156 | },
157 | "cell_type": "markdown",
158 | "source": [
159 | " *pandas* 中的主要数据结构被实现为以下两类:\n",
160 | "\n",
161 | " * **`DataFrame`**,您可以将它想象成一个关系型数据表格,其中包含多个行和已命名的列。\n",
162 | " * **`Series`**,它是单一列。`DataFrame` 中包含一个或多个 `Series`,每个 `Series` 均有一个名称。\n",
163 | "\n",
164 | "数据框架是用于数据操控的一种常用抽象实现形式。[Spark](https://spark.apache.org/) 和 [R](https://www.r-project.org/about.html) 中也有类似的实现。"
165 | ]
166 | },
167 | {
168 | "metadata": {
169 | "id": "fjnAk1xcU0yc",
170 | "colab_type": "text"
171 | },
172 | "cell_type": "markdown",
173 | "source": [
174 | " 创建 `Series` 的一种方法是构建 `Series` 对象。例如:"
175 | ]
176 | },
177 | {
178 | "metadata": {
179 | "id": "DFZ42Uq7UFDj",
180 | "colab_type": "code",
181 | "colab": {
182 | "autoexec": {
183 | "startup": false,
184 | "wait_interval": 0
185 | },
186 | "base_uri": "https://localhost:8080/",
187 | "height": 84
188 | },
189 | "outputId": "d89e5330-3a2d-4e77-f0d9-bb9a895b863a",
190 | "executionInfo": {
191 | "status": "ok",
192 | "timestamp": 1528027897144,
193 | "user_tz": -480,
194 | "elapsed": 1671,
195 | "user": {
196 | "displayName": "陈方杰",
197 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
198 | "userId": "115418685205938795893"
199 | }
200 | }
201 | },
202 | "cell_type": "code",
203 | "source": [
204 | "pd.Series(['San Francisco', 'San Jose', 'Sacramento'])"
205 | ],
206 | "execution_count": 17,
207 | "outputs": [
208 | {
209 | "output_type": "execute_result",
210 | "data": {
211 | "text/plain": [
212 | "0 San Francisco\n",
213 | "1 San Jose\n",
214 | "2 Sacramento\n",
215 | "dtype: object"
216 | ]
217 | },
218 | "metadata": {
219 | "tags": []
220 | },
221 | "execution_count": 17
222 | }
223 | ]
224 | },
225 | {
226 | "metadata": {
227 | "id": "U5ouUp1cU6pC",
228 | "colab_type": "text"
229 | },
230 | "cell_type": "markdown",
231 | "source": [
232 | " 您可以将映射 `string` 列名称的 `dict` 传递到它们各自的 `Series`,从而创建`DataFrame`对象。如果 `Series` 在长度上不一致,系统会用特殊的 [NA/NaN](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) 值填充缺失的值。例如:"
233 | ]
234 | },
235 | {
236 | "metadata": {
237 | "id": "avgr6GfiUh8t",
238 | "colab_type": "code",
239 | "colab": {
240 | "autoexec": {
241 | "startup": false,
242 | "wait_interval": 0
243 | },
244 | "base_uri": "https://localhost:8080/",
245 | "height": 136
246 | },
247 | "outputId": "e4d1cce6-7042-426b-fc93-9943dd83e8bd",
248 | "executionInfo": {
249 | "status": "ok",
250 | "timestamp": 1528027901578,
251 | "user_tz": -480,
252 | "elapsed": 1361,
253 | "user": {
254 | "displayName": "陈方杰",
255 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
256 | "userId": "115418685205938795893"
257 | }
258 | }
259 | },
260 | "cell_type": "code",
261 | "source": [
262 | "city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])\n",
263 | "population = pd.Series([852469, 1015785, 485199])\n",
264 | "\n",
265 | "pd.DataFrame({ 'City name': city_names, 'Population': population })"
266 | ],
267 | "execution_count": 18,
268 | "outputs": [
269 | {
270 | "output_type": "execute_result",
271 | "data": {
272 | "text/html": [
273 | "
\n",
274 | "\n",
287 | "
\n",
288 | " \n",
289 | " \n",
290 | " | \n",
291 | " City name | \n",
292 | " Population | \n",
293 | "
\n",
294 | " \n",
295 | " \n",
296 | " \n",
297 | " | 0 | \n",
298 | " San Francisco | \n",
299 | " 852469 | \n",
300 | "
\n",
301 | " \n",
302 | " | 1 | \n",
303 | " San Jose | \n",
304 | " 1015785 | \n",
305 | "
\n",
306 | " \n",
307 | " | 2 | \n",
308 | " Sacramento | \n",
309 | " 485199 | \n",
310 | "
\n",
311 | " \n",
312 | "
\n",
313 | "
"
314 | ],
315 | "text/plain": [
316 | " City name Population\n",
317 | "0 San Francisco 852469\n",
318 | "1 San Jose 1015785\n",
319 | "2 Sacramento 485199"
320 | ]
321 | },
322 | "metadata": {
323 | "tags": []
324 | },
325 | "execution_count": 18
326 | }
327 | ]
328 | },
329 | {
330 | "metadata": {
331 | "id": "oa5wfZT7VHJl",
332 | "colab_type": "text"
333 | },
334 | "cell_type": "markdown",
335 | "source": [
336 | " 但是在大多数情况下,您需要将整个文件加载到 `DataFrame` 中。下面的示例加载了一个包含加利福尼亚州住房数据的文件。请运行以下单元格以加载数据,并创建特征定义:"
337 | ]
338 | },
339 | {
340 | "metadata": {
341 | "id": "av6RYOraVG1V",
342 | "colab_type": "code",
343 | "colab": {
344 | "autoexec": {
345 | "startup": false,
346 | "wait_interval": 0
347 | },
348 | "base_uri": "https://localhost:8080/",
349 | "height": 284
350 | },
351 | "outputId": "836dcf5a-539c-40ec-a821-93429e7932df",
352 | "executionInfo": {
353 | "status": "ok",
354 | "timestamp": 1528027923363,
355 | "user_tz": -480,
356 | "elapsed": 1415,
357 | "user": {
358 | "displayName": "陈方杰",
359 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
360 | "userId": "115418685205938795893"
361 | }
362 | }
363 | },
364 | "cell_type": "code",
365 | "source": [
366 | "california_housing_dataframe = pd.read_csv(\"https://storage.googleapis.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
367 | "california_housing_dataframe.describe()"
368 | ],
369 | "execution_count": 19,
370 | "outputs": [
371 | {
372 | "output_type": "execute_result",
373 | "data": {
374 | "text/html": [
375 | "\n",
376 | "\n",
389 | "
\n",
390 | " \n",
391 | " \n",
392 | " | \n",
393 | " longitude | \n",
394 | " latitude | \n",
395 | " housing_median_age | \n",
396 | " total_rooms | \n",
397 | " total_bedrooms | \n",
398 | " population | \n",
399 | " households | \n",
400 | " median_income | \n",
401 | " median_house_value | \n",
402 | "
\n",
403 | " \n",
404 | " \n",
405 | " \n",
406 | " | count | \n",
407 | " 17000.000000 | \n",
408 | " 17000.000000 | \n",
409 | " 17000.000000 | \n",
410 | " 17000.000000 | \n",
411 | " 17000.000000 | \n",
412 | " 17000.000000 | \n",
413 | " 17000.000000 | \n",
414 | " 17000.000000 | \n",
415 | " 17000.000000 | \n",
416 | "
\n",
417 | " \n",
418 | " | mean | \n",
419 | " -119.562108 | \n",
420 | " 35.625225 | \n",
421 | " 28.589353 | \n",
422 | " 2643.664412 | \n",
423 | " 539.410824 | \n",
424 | " 1429.573941 | \n",
425 | " 501.221941 | \n",
426 | " 3.883578 | \n",
427 | " 207300.912353 | \n",
428 | "
\n",
429 | " \n",
430 | " | std | \n",
431 | " 2.005166 | \n",
432 | " 2.137340 | \n",
433 | " 12.586937 | \n",
434 | " 2179.947071 | \n",
435 | " 421.499452 | \n",
436 | " 1147.852959 | \n",
437 | " 384.520841 | \n",
438 | " 1.908157 | \n",
439 | " 115983.764387 | \n",
440 | "
\n",
441 | " \n",
442 | " | min | \n",
443 | " -124.350000 | \n",
444 | " 32.540000 | \n",
445 | " 1.000000 | \n",
446 | " 2.000000 | \n",
447 | " 1.000000 | \n",
448 | " 3.000000 | \n",
449 | " 1.000000 | \n",
450 | " 0.499900 | \n",
451 | " 14999.000000 | \n",
452 | "
\n",
453 | " \n",
454 | " | 25% | \n",
455 | " -121.790000 | \n",
456 | " 33.930000 | \n",
457 | " 18.000000 | \n",
458 | " 1462.000000 | \n",
459 | " 297.000000 | \n",
460 | " 790.000000 | \n",
461 | " 282.000000 | \n",
462 | " 2.566375 | \n",
463 | " 119400.000000 | \n",
464 | "
\n",
465 | " \n",
466 | " | 50% | \n",
467 | " -118.490000 | \n",
468 | " 34.250000 | \n",
469 | " 29.000000 | \n",
470 | " 2127.000000 | \n",
471 | " 434.000000 | \n",
472 | " 1167.000000 | \n",
473 | " 409.000000 | \n",
474 | " 3.544600 | \n",
475 | " 180400.000000 | \n",
476 | "
\n",
477 | " \n",
478 | " | 75% | \n",
479 | " -118.000000 | \n",
480 | " 37.720000 | \n",
481 | " 37.000000 | \n",
482 | " 3151.250000 | \n",
483 | " 648.250000 | \n",
484 | " 1721.000000 | \n",
485 | " 605.250000 | \n",
486 | " 4.767000 | \n",
487 | " 265000.000000 | \n",
488 | "
\n",
489 | " \n",
490 | " | max | \n",
491 | " -114.310000 | \n",
492 | " 41.950000 | \n",
493 | " 52.000000 | \n",
494 | " 37937.000000 | \n",
495 | " 6445.000000 | \n",
496 | " 35682.000000 | \n",
497 | " 6082.000000 | \n",
498 | " 15.000100 | \n",
499 | " 500001.000000 | \n",
500 | "
\n",
501 | " \n",
502 | "
\n",
503 | "
"
504 | ],
505 | "text/plain": [
506 | " longitude latitude housing_median_age total_rooms \\\n",
507 | "count 17000.000000 17000.000000 17000.000000 17000.000000 \n",
508 | "mean -119.562108 35.625225 28.589353 2643.664412 \n",
509 | "std 2.005166 2.137340 12.586937 2179.947071 \n",
510 | "min -124.350000 32.540000 1.000000 2.000000 \n",
511 | "25% -121.790000 33.930000 18.000000 1462.000000 \n",
512 | "50% -118.490000 34.250000 29.000000 2127.000000 \n",
513 | "75% -118.000000 37.720000 37.000000 3151.250000 \n",
514 | "max -114.310000 41.950000 52.000000 37937.000000 \n",
515 | "\n",
516 | " total_bedrooms population households median_income \\\n",
517 | "count 17000.000000 17000.000000 17000.000000 17000.000000 \n",
518 | "mean 539.410824 1429.573941 501.221941 3.883578 \n",
519 | "std 421.499452 1147.852959 384.520841 1.908157 \n",
520 | "min 1.000000 3.000000 1.000000 0.499900 \n",
521 | "25% 297.000000 790.000000 282.000000 2.566375 \n",
522 | "50% 434.000000 1167.000000 409.000000 3.544600 \n",
523 | "75% 648.250000 1721.000000 605.250000 4.767000 \n",
524 | "max 6445.000000 35682.000000 6082.000000 15.000100 \n",
525 | "\n",
526 | " median_house_value \n",
527 | "count 17000.000000 \n",
528 | "mean 207300.912353 \n",
529 | "std 115983.764387 \n",
530 | "min 14999.000000 \n",
531 | "25% 119400.000000 \n",
532 | "50% 180400.000000 \n",
533 | "75% 265000.000000 \n",
534 | "max 500001.000000 "
535 | ]
536 | },
537 | "metadata": {
538 | "tags": []
539 | },
540 | "execution_count": 19
541 | }
542 | ]
543 | },
544 | {
545 | "metadata": {
546 | "id": "WrkBjfz5kEQu",
547 | "colab_type": "text"
548 | },
549 | "cell_type": "markdown",
550 | "source": [
551 | " 上面的示例使用 `DataFrame.describe` 来显示关于 `DataFrame` 的有趣统计信息。另一个实用函数是 `DataFrame.head`,它显示 `DataFrame` 的前几个记录:"
552 | ]
553 | },
554 | {
555 | "metadata": {
556 | "id": "s3ND3bgOkB5k",
557 | "colab_type": "code",
558 | "colab": {
559 | "autoexec": {
560 | "startup": false,
561 | "wait_interval": 0
562 | },
563 | "base_uri": "https://localhost:8080/",
564 | "height": 195
565 | },
566 | "outputId": "65520b8f-c9cd-43dc-fd20-5c6300a950d9",
567 | "executionInfo": {
568 | "status": "ok",
569 | "timestamp": 1528027926502,
570 | "user_tz": -480,
571 | "elapsed": 1166,
572 | "user": {
573 | "displayName": "陈方杰",
574 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
575 | "userId": "115418685205938795893"
576 | }
577 | }
578 | },
579 | "cell_type": "code",
580 | "source": [
581 | "california_housing_dataframe.head()"
582 | ],
583 | "execution_count": 20,
584 | "outputs": [
585 | {
586 | "output_type": "execute_result",
587 | "data": {
588 | "text/html": [
589 | "\n",
590 | "\n",
603 | "
\n",
604 | " \n",
605 | " \n",
606 | " | \n",
607 | " longitude | \n",
608 | " latitude | \n",
609 | " housing_median_age | \n",
610 | " total_rooms | \n",
611 | " total_bedrooms | \n",
612 | " population | \n",
613 | " households | \n",
614 | " median_income | \n",
615 | " median_house_value | \n",
616 | "
\n",
617 | " \n",
618 | " \n",
619 | " \n",
620 | " | 0 | \n",
621 | " -114.31 | \n",
622 | " 34.19 | \n",
623 | " 15.0 | \n",
624 | " 5612.0 | \n",
625 | " 1283.0 | \n",
626 | " 1015.0 | \n",
627 | " 472.0 | \n",
628 | " 1.4936 | \n",
629 | " 66900.0 | \n",
630 | "
\n",
631 | " \n",
632 | " | 1 | \n",
633 | " -114.47 | \n",
634 | " 34.40 | \n",
635 | " 19.0 | \n",
636 | " 7650.0 | \n",
637 | " 1901.0 | \n",
638 | " 1129.0 | \n",
639 | " 463.0 | \n",
640 | " 1.8200 | \n",
641 | " 80100.0 | \n",
642 | "
\n",
643 | " \n",
644 | " | 2 | \n",
645 | " -114.56 | \n",
646 | " 33.69 | \n",
647 | " 17.0 | \n",
648 | " 720.0 | \n",
649 | " 174.0 | \n",
650 | " 333.0 | \n",
651 | " 117.0 | \n",
652 | " 1.6509 | \n",
653 | " 85700.0 | \n",
654 | "
\n",
655 | " \n",
656 | " | 3 | \n",
657 | " -114.57 | \n",
658 | " 33.64 | \n",
659 | " 14.0 | \n",
660 | " 1501.0 | \n",
661 | " 337.0 | \n",
662 | " 515.0 | \n",
663 | " 226.0 | \n",
664 | " 3.1917 | \n",
665 | " 73400.0 | \n",
666 | "
\n",
667 | " \n",
668 | " | 4 | \n",
669 | " -114.57 | \n",
670 | " 33.57 | \n",
671 | " 20.0 | \n",
672 | " 1454.0 | \n",
673 | " 326.0 | \n",
674 | " 624.0 | \n",
675 | " 262.0 | \n",
676 | " 1.9250 | \n",
677 | " 65500.0 | \n",
678 | "
\n",
679 | " \n",
680 | "
\n",
681 | "
"
682 | ],
683 | "text/plain": [
684 | " longitude latitude housing_median_age total_rooms total_bedrooms \\\n",
685 | "0 -114.31 34.19 15.0 5612.0 1283.0 \n",
686 | "1 -114.47 34.40 19.0 7650.0 1901.0 \n",
687 | "2 -114.56 33.69 17.0 720.0 174.0 \n",
688 | "3 -114.57 33.64 14.0 1501.0 337.0 \n",
689 | "4 -114.57 33.57 20.0 1454.0 326.0 \n",
690 | "\n",
691 | " population households median_income median_house_value \n",
692 | "0 1015.0 472.0 1.4936 66900.0 \n",
693 | "1 1129.0 463.0 1.8200 80100.0 \n",
694 | "2 333.0 117.0 1.6509 85700.0 \n",
695 | "3 515.0 226.0 3.1917 73400.0 \n",
696 | "4 624.0 262.0 1.9250 65500.0 "
697 | ]
698 | },
699 | "metadata": {
700 | "tags": []
701 | },
702 | "execution_count": 20
703 | }
704 | ]
705 | },
706 | {
707 | "metadata": {
708 | "id": "w9-Es5Y6laGd",
709 | "colab_type": "text"
710 | },
711 | "cell_type": "markdown",
712 | "source": [
713 | " *pandas* 的另一个强大功能是绘制图表。例如,借助 `DataFrame.hist`,您可以快速了解一个列中值的分布:"
714 | ]
715 | },
716 | {
717 | "metadata": {
718 | "id": "nqndFVXVlbPN",
719 | "colab_type": "code",
720 | "colab": {
721 | "autoexec": {
722 | "startup": false,
723 | "wait_interval": 0
724 | },
725 | "base_uri": "https://localhost:8080/",
726 | "height": 395
727 | },
728 | "outputId": "b7cc684b-908e-490e-fa44-46e50a0e8ab9",
729 | "executionInfo": {
730 | "status": "ok",
731 | "timestamp": 1528027930882,
732 | "user_tz": -480,
733 | "elapsed": 1385,
734 | "user": {
735 | "displayName": "陈方杰",
736 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
737 | "userId": "115418685205938795893"
738 | }
739 | }
740 | },
741 | "cell_type": "code",
742 | "source": [
743 | "california_housing_dataframe.hist('housing_median_age')"
744 | ],
745 | "execution_count": 21,
746 | "outputs": [
747 | {
748 | "output_type": "execute_result",
749 | "data": {
750 | "text/plain": [
751 | "array([[]],\n",
752 | " dtype=object)"
753 | ]
754 | },
755 | "metadata": {
756 | "tags": []
757 | },
758 | "execution_count": 21
759 | },
760 | {
761 | "output_type": "display_data",
762 | "data": {
763 | "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeoAAAFZCAYAAABXM2zhAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAIABJREFUeJzt3X1UlHX+//HXMDAH0UEEGTfLarf0\naEmaa5l4U0Iokp7IVRPWdU3q6Iqtlql499WTlajRmmZZmunRU7GNtofcAjJxyyRanT0uuu0p2VOr\neTejKCqgSPP7o9Os/FRguP1Az8dfcTEz1+d6H+3pdQ1zYfF6vV4BAAAjBTT3AgAAwPURagAADEao\nAQAwGKEGAMBghBoAAIMRagAADEaogVo6cuSI7rjjjkbdxz//+U+lpKQ06j4a0h133KEjR47o448/\n1ty5c5t7OUCrZOFz1EDtHDlyREOHDtW//vWv5l6KMe644w7l5ubqpptuau6lAK0WZ9SAn5xOp0aO\nHKn7779f27dv1w8//KA//elPio+PV3x8vNLS0lRaWipJiomJ0d69e33P/enry5cva/78+Ro2bJji\n4uI0bdo0nT9/XgUFBYqLi5MkrV69Ws8++6xSU1MVGxur0aNH6+TJk5KkgwcPaujQoRo6dKheeeUV\njRw5UgUFBdWue/Xq1Vq0aJEmT56sgQMHatasWcrLy9OoUaM0cOBA5eXlSZIuXbqk5557TsOGDVNM\nTIzWrl3re42//e1viouL0/Dhw7V+/Xrf9m3btmnixImSJI/Ho5SUFMXHxysmJkZvvfVWleN/9913\nNXr0aA0cOFDp6ek1zrusrEwzZszwrWfZsmW+71U3hx07dmjkyJGKjY3VpEmTdPr06Rr3BZiIUAN+\n+OGHH1RRUaEPPvhAc+fO1cqVK/XRRx/p008/1bZt2/TXv/5VJSUl2rhxY7Wvs3v3bh05ckTZ2dnK\nzc3V7bffrn/84x9XPS47O1vz5s3Tjh07FBERoa1bt0qSFi5cqIkTJyo3N1ft2rXTt99+W6v179q1\nSy+88II++OADZWdn+9Y9ZcoUrVu3TpK0bt06HTp0SB988IG2b9+unJwc5eXlqbKyUvPnz9eiRYv0\n0UcfKSAgQJWVlVft47XXXtNNN92k7Oxsbdq0SRkZGTp27Jjv+3//+9+VmZmprVu3asuWLTp+/Hi1\na37nnXd04cIFZWdn6/3339e2bdt8//i53hwOHz6s2bNnKyMjQ5988on69eunxYsX12pGgGkINeAH\nr9erxMREST9e9j1+/Lh27dqlxMREhYSEyGq1atSoUfr888+rfZ3w8HAVFRXp448/9p0xDho06KrH\n9e3bVzfeeKMsFot69OihY8eOqby8XAcPHtSIESMkSb/97W9V23ew7r77bkVERKhDhw6KjIzU4MGD\nJUndunXzna3n5eUpOTlZNptNISEhevjhh5Wbm6tvv/1Wly5d0sCBAyVJjzzyyDX3sWDBAi1cuFCS\n1KVLF0VGRurIkSO+748cOVJWq1WdOnVSRERElYhfy6RJk/Tqq6/KYrGoffv26tq1q44cOVLtHD79\n9FPde++96tatmyRp3Lhx2rlz5zX/YQGYLrC5FwC0JFarVW3atJEkBQQE6IcfftDp06fVvn1732Pa\nt2+vU6dOVfs6d911lxYsWKDNmzdrzpw5iomJ0aJFi656nN1ur7LvyspKnT17VhaLRaGhoZKkoKAg\nRURE1Gr9bdu2rfJ6ISEhVY5Fks6dO6elS5fqpZdekvTjpfC77rpLZ8+eVbt27aoc57UUFhb6zqID\nAgLkdrt9ry2pymv8dEzV+fbbb5Wenq7//Oc/CggI0PHjxzVq1Khq53Du3Dnt3btX8fHxVfZ75syZ\nWs8KMAWhBuqpY8eOOnPmjO/rM2fOqGPHjpKqBlCSzp496/vvn97TPnPmjObNm6c333xT0dHRNe6v\nXbt28nq9KisrU5s2bXT58uUGff/V4XBo0qRJGjJkSJXtRUVFOn/+vO/r6+1z1qxZ+v3vf6+kpCRZ\nLJZrXinwx7PPPqs777xTa9askdVq1bhx4yRVPweHw6Ho6GitWrWqXvsGTMClb6CeHnjgAWVlZams\nrEyXL1+W0+nU/fffL0mKjIzUv//9b0nShx9+qIsXL0qStm7dqjVr1kiSwsLC9Ktf/arW+2vbtq1u\nu+02ffTRR5KkzMxMWSyWBjue2NhYvffee6qsrJTX69Wrr76qTz/9VDfffLOsVqvvh7W2bdt2zf2e\nOnVKPXv2lMVi0fvvv6+ysjLfD9fVxalTp9SjRw9ZrVZ9/vnn+u6771RaWlrtHAYOHKi9e/fq8OHD\nkn782Ntzzz1X5zUAzYlQA/UUHx+vwYMHa9SoURoxYoR+8YtfaMKECZKkqVOnauPGjRoxYoSKiop0\n++23S/oxhj/9xPLw4cN16NAhPfbYY7Xe56JFi7R27Vo99NBDKi0tVadOnRos1snJyercubMeeugh\nxcfHq6ioSL/+9a8VFBSkJUuWaN68eRo+fLgsFovv0vmVpk+frtTUVI0cOVKlpaV69NFHtXDhQv33\nv/+t03r+8Ic/aNmyZRoxYoS+/PJLTZs2TatXr9a+ffuuOweHw6ElS5YoNTVVw4cP17PPPquEhIT6\njgZoFnyOGmihvF6vL8733XefNm7cqO7duzfzqpoec0Brxxk10AL98Y9/9H2cKj8/X16vV7feemvz\nLqoZMAf8HHBGDbRARUVFmjt3rs6ePaugoCDNmjVLN910k1JTU6/5+Ntuu833nrhpioqK6rzua83h\np58PAFoLQg0AgMG49A0AgMEINQAABjPyhidu9zm/Ht+hQ4iKi+v+Oc2fO+ZXd8yufphf3TG7+jFt\nfpGR9ut+r1WcUQcGWpt7CS0a86s7Zlc/zK/umF39tKT5tYpQAwDQWhFqAAAMRqgBADBYjT9MVlZW\nprS0NJ06dUoXL17U1KlT1b17d82ePVuVlZWKjIzUihUrZLPZlJWVpU2bNikgIEBjx47VmDFjVFFR\nobS0NB09elRWq1VLly5Vly5dmuLYAABo8Wo8o87Ly1PPnj21ZcsWrVy5Uunp6Vq1apWSk5P19ttv\n65ZbbpHT6VRpaanWrFmjjRs3avPmzdq0aZPOnDmj7du3KzQ0VO+8846mTJmijIyMpjguAABahRpD\nnZCQoCeeeEKSdOzYMXXq1EkFBQWKjY2VJA0ZMkT5+fnav3+/oqKiZLfbFRwcrD59+sjlcik/P19x\ncXGSpOjoaLlcrkY8HAAAWpdaf4563LhxOn78uNauXavHHntMNptNkhQRESG32y2Px6Pw8HDf48PD\nw6/aHhAQIIvFokuXLvmeDwAArq/WoX733Xf11VdfadasWbry9uDXu1W4v9uv1KFDiN+fcavuw+Ko\nGfOrO2ZXP8yv7phd/bSU+dUY6gMHDigiIkI33HCDevToocrKSrVt21bl5eUKDg7WiRMn5HA45HA4\n5PF4fM87efKkevfuLYfDIbfbre7du6uiokJer7fGs2l/7xYTGWn3+25m+B/mV3fMrn6YX90xu/ox\nbX71ujPZ3r17tWHDBkmSx+NRaWmpoqOjlZOTI0nKzc3VoEGD1KtXLxUWFqqkpEQXLlyQy+VS3759\nNWDAAGVnZ0v68QfT+vXr1xDHBADAz0KNZ9Tjxo3T/PnzlZycrPLycv3f//2fevbsqTlz5igzM1Od\nO3dWYmKigoKCNHPmTKWkpMhisSg1NVV2u10JCQnas2ePkpKSZLPZlJ6e3hTHBQBAq2Dk76P293KE\naZcwWhrmV3fMrn6YX90xu/oxbX7VXfo28rdnAcC1TErf2dxLqNGGtJjmXgJaGW4hCgCAwQg1AAAG\nI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QA\nABiMUAMAYDBCDQCAwQg1AAAGC6zNg5YvX659+/bp8uXLmjx5snbu3KmDBw8qLCxMkpSSkqIHHnhA\nWVlZ2rRpkwICAjR27FiNGTNGFRUVSktL09GjR2W1WrV06VJ16dKlUQ8KAIDWosZQf/HFF/rmm2+U\nmZmp4uJiPfLII7rvvvv09NNPa8iQIb7HlZaWas2aNXI6nQoKCtLo0aMVFxenvLw8hYaGKiMjQ7t3\n71ZGRoZWrlzZqAcFAEBrUeOl73vuuUcvv/yyJCk0NFRlZWWqrKy86nH79+9XVFSU7Ha7goOD1adP\nH7lcLuXn5ysuLk6SFB0dLZfL1cCHAABA61VjqK1Wq0JCQiRJTqdTgwcPltVq1ZYtWzRhwgQ99dRT\nOn36tDwej8LDw33PCw8Pl9vtrrI9ICBAFotFly5daqTDAQCgdanVe9SStGPHDjmdTm3YsEEHDhxQ\nWFiYevTooTfeeEOvvPKK7r777iqP93q913yd622/UocOIQoMtNZ2aZKkyEi7X49HVcyv7phd/bS2\n+TXl8bS22TW1ljK/WoX6s88+09q1a7V+/XrZ7Xb179/f972YmBgtXrxYw4YNk8fj8W0/efKkevfu\nLYfDIbfbre7du6uiokJer1c2m63a/RUXl/p1EJGRdrnd5/x6Dv6H+dUds6uf1ji/pjqe1ji7pmTa\n/Kr7R0ONl77PnTun5cuX6/XXX/f9lPeTTz6pw4cPS5IKCgrUtWtX9erVS4WFhSopKdGFCxfkcrnU\nt29fDRgwQNnZ2ZKkvLw89evXryGOCQCAn4Uaz6g//PBDFRcXa8aMGb5to0aN0owZM9SmTRuFhIRo\n6dKlCg4O1syZM5WSkiKLxaLU1FTZ7XYlJCRoz549SkpKks1mU3p6eqMeEAAArYnFW5s3jZuYv5cj\nTLuE0dIwv7pjdvXj7/wmpe9sxNU0jA1pMU2yH/7s1Y9p86vXpW8AANB8CDUAAAYj1AAAGIxQAwBg\nMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAA\nGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYLbO4FAA1lUvrO5l5CtTakxTT3\nEgC0QJxRAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAYj1AAAGIxQAwBgMEINAIDB\nCDUAAAYj1AAAGIxQAwBgMEINAIDBCDUAAAbj91EDTcT035ct8TuzARNxRg0AgMFqdUa9fPly7du3\nT5cvX9bkyZMVFRWl2bNnq7KyUpGRkVqxYoVsNpuysrK0adMmBQQEaOzYsRozZowqKiqUlpamo0eP\nymq1aunSperSpUtjHxcAAK1CjaH+4osv9M033ygzM1PFxcV65JFH1L9/fyUnJ2v48OF66aWX5HQ6\nlZiYqDVr1sjpdCooKEijR49WXFyc8vLyFBoaqoyMDO3evVsZGRlauXJlUxwbAAAtXo2Xvu+55x69\n/PLLkqTQ0FCVlZWpoKBAsbGxkqQhQ4YoPz9f+/fvV1RUlOx2u4KDg9WnTx+5XC7l5+crLi5OkhQd\nHS2Xy9WIhwMAQOtS4xm11WpVSEiIJMnpdGrw4MHavXu3bDabJCkiIkJut1sej0fh4eG+54WHh1+1\nPSAgQBaLRZcuXfI9/1o6dAhRYKDVrwOJjLT79XhUxfwgNc+fg9b2Z68pj6e1za6ptZT51fqnvnfs\n2CGn06kNGzZo6NChvu1er/eaj/d3+5WKi0truyxJPw7b7T7n13PwP8wPP2nqPwet8c9eUx1Pa5xd\nUzJtftX9o6FWP/X92Wefae3atVq3bp3sdrtCQkJUXl4uSTpx4oQcDoccDoc8Ho/vOSdPnvRtd7vd\nkqSKigp5vd5qz6YBAMD/1Bjqc+fOafny5Xr99dcVFhYm6cf3mnNyciRJubm5GjRokHr16qXCwkKV\nlJTowoULcrlc6tu3rwYMGKDs7GxJUl5envr169eIhwMAQOtS46XvDz/8UMXFxZoxY4ZvW3p6uhYs\nWKDMzEx17txZiYmJCgoK0syZM5WSkiKLxaLU1FTZ7XYlJCRoz549SkpKks1mU3p6eqMeEAAArUmN\noX700Uf16KOPXrX9rbfeumpbfHy84uPjq2z76bPTAADAf9xCFIBPS7jNKfBzwy1EAQAwGKEGAMBg\nhBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGHcmQ61wxyoAaB6cUQMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABgssLkXAADAlSal72zuJdRoQ1pM\nk+2LM2oAAAxGqAEAMBihBgDAYIQaAACDEWoAAAxGqAEAMBihBgDAYLX6HPXXX3+tqVOnauLEiRo/\nfrzS0tJ08OBBhYWFSZJSUlL0wAMPKCsrS5s2bVJAQIDGjh2rMWPGqKKiQmlpaTp69KisVquWLl2q\nLl26NOpBAUBz4TPAaGg1hrq0tFRLlixR//79q2x/+umnNWTIkCqPW7NmjZxOp4KCgjR69GjFxcUp\nLy9PoaGhysjI0O7du5WRkaGVK1c2/JEAANAK1Xjp22azad26dXI4HNU+bv/+/YqKipLdbldwcLD6\n9Okjl8ul/Px8xcXFSZKio6PlcrkaZuUAAPwM1BjqwMBABQcHX7V9y5YtmjBhgp566imdPn1aHo9H\n4eHhvu+Hh4fL7XZX2R4QECCLxaJLly414CEAANB61ele3w8//LDCwsLUo0cPvfHGG3rllVd09913\nV3mM1+u95nOvt/1KHTqEKDDQ6teaIiPtfj0eVTE/4OeDv+/115QzrFOor3y/OiYmRosXL9awYcPk\n8Xh820+ePKnevXvL4XDI7Xare/fuqqiokNfrlc1mq/b1i4tL/VpPZKRdbvc5/w4CPswP+Hnh73v9\nNfQMqwt/nT6e9eSTT+rw4cOSpIKCAnXt2lW9evVSYWGhSkpKdOHCBblcLvXt21cDBgxQdna2JCkv\nL0/9+vWryy4BAPhZqvGM+sCBA1q2bJm+//57BQYGKicnR+PHj9eMGTPUpk0bhYSEaOnSpQoODtbM\nmTOVkpIii8Wi1NRU2e12JSQkaM+ePUpKSpLNZlN6enpTHBcAAK1CjaHu2bOnNm/efNX2YcOGXbUt\nPj5e8fHxVbb99NlpAADgP+5MBgCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMA\nYDBCDQCAwQg1AAAGI9QAABiMUAMAYLA6/T5qAEDLNSl9Z3MvAX7gjBoAAIMRagAADEaoAQAwGKEG\nAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEao\nAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMRagAADEaoAQAwGKEGAMBghBoAAIMR\nagAADFarUH/99dd68MEHtWXLFknSsWPH9Lvf/U7JycmaPn26Ll26JEnKysrSb37zG40ZM0bvvfee\nJKmiokIzZ85UUlKSxo8fr8OHDzfSoQAA0PrUGOrS0lItWbJE/fv3921btWqVkpOT9fbbb+uWW26R\n0+lUaWmp1qxZo40bN2rz5s3atGmTzpw5o+3btys0NFTvvPOOpkyZooyMjEY9IAAAWpMaQ22z2bRu\n3To5HA7ftoKCAsXGxkqShgwZovz8fO3fv19RUVGy2+0KDg5Wnz595HK5lJ+fr7i4OElSdHS0XC5X\nIx0KAACtT42hDgwMVHBwcJVtZWVlstlskqSIiAi53W55PB6Fh4f7HhMeHn7V9oCAAFksFt+lcgAA\nUL3A+r6A1+ttkO1X6tAhRIGBVr/WERlp9+vxqIr5AUDtNeX/M+sU6pCQEJWXlys4OFgnTpyQw+GQ\nw+GQx+PxPebkyZPq3bu3HA6H3G63unfvroqKCnm9Xt/Z+PUUF5f6tZ7ISLvc7nN1ORSI+QGAvxr6\n/5nVhb9OH8+Kjo5WTk6OJCk3N1eDBg1Sr169VFhYqJKSEl24cEEul0t9+/bVgAEDlJ2dLUnKy8tT\nv3796rJLAAB+lmo8oz5w4ICWLVum77//XoGBgcrJydGLL76otLQ0ZWZmqnPnzkpMTFRQUJBmzpyp\nlJQUWSwWpaamym63KyEhQXv27FFSUpJsNpvS09Ob4rgAAGgVLN7avGncxPy9pMCl2/qpzfwmpe9s\notUAgPk2pMU06Os1+KVvAADQNOr9U99oGJyxAgCuhTNqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAM\nRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAA\ngxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMRqgBADAYoQYA\nwGCEGgAAgxFqAAAMRqgBADAYoQYAwGCEGgAAgxFqAAAMFtjcC2gKk9J3NvcSAACoE86oAQAwGKEG\nAMBghBoAAIMRagAADFanHyYrKCjQ9OnT1bVrV0lSt27d9Pjjj2v27NmqrKxUZGSkVqxYIZvNpqys\nLG3atEkBAQEaO3asxowZ06AHAABAa1bnn/q+9957tWrVKt/Xc+fOVXJysoYPH66XXnpJTqdTiYmJ\nWrNmjZxOp4KCgjR69GjFxcUpLCysQRYPAEBr12CXvgsKChQbGytJGjJkiPLz87V//35FRUXJbrcr\nODhYffr0kcvlaqhdAgDQ6tX5jPrQoUOaMmWKzp49q2nTpqmsrEw2m02SFBERIbfbLY/Ho/DwcN9z\nwsPD5Xa7a3ztDh1CFBho9Ws9kZF2/w4AAIA6asrm1CnUt956q6ZNm6bhw4fr8OHDmjBhgiorK33f\n93q913ze9bb//4qLS/1aT2SkXW73Ob+eAwBAXTV0c6oLf50ufXfq1EkJCQmyWCy6+eab1bFjR509\ne1bl5eWSpBMnTsjhcMjhcMjj8fied/LkSTkcjrrsEgCAn6U6hTorK0tvvvmmJMntduvUqVMaNWqU\ncnJyJEm5ubkaNGiQevXqpcLCQpWUlOjChQtyuVzq27dvw60eAIBWrk6XvmNiYvTMM8/ok08+UUVF\nhRYvXqwePXpozpw5yszMVOfOnZWYmKigoCDNnDlTKSkpslgsSk1Nld3Oe8kAANSWxVvbN46bkL/X\n/mt6j5pfygEAaEgb0mIa9PUa/D1qAADQNAg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiM\nUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAG\nI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCA\nwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABiMUAMAYDBCDQCAwQg1AAAGI9QAABgssCl2\n8sILL2j//v2yWCyaN2+e7rrrrqbYLQAALV6jh/rLL7/Ud999p8zMTBUVFWnevHnKzMxs7N0CANAq\nNPql7/z8fD344IOSpNtuu01nz57V+fPnG3u3AAC0Co0eao/How4dOvi+Dg8Pl9vtbuzdAgDQKjTJ\ne9RX8nq9NT4mMtLu9+tW95wPMh72+/UAADBBo59ROxwOeTwe39cnT55UZGRkY+8WAIBWodFDPWDA\nAOXk5EiSDh48KIfDoXbt2jX2bgEAaBUa/dJ3nz59dOedd2rcuHGyWCxatGhRY+8SAIBWw+KtzZvG\nAACgWXBnMgAADEaoAQAwWJN/PKuhcXtS/3399deaOnWqJk6cqPHjx+vYsWOaPXu2KisrFRkZqRUr\nVshmszX3Mo20fPly7du3T5cvX9bkyZMVFRXF7GqhrKxMaWlpOnXqlC5evKipU6eqe/fuzM5P5eXl\nGjFihKZOnar+/fszv1oqKCjQ9OnT1bVrV0lSt27d9Pjjj7eY+bXoM+orb0/6/PPP6/nnn2/uJRmv\ntLRUS5YsUf/+/X3bVq1apeTkZL399tu65ZZb5HQ6m3GF5vriiy/0zTffKDMzU+vXr9cLL7zA7Gop\nLy9PPXv21JYtW7Ry5Uqlp6czuzp47bXX1L59e0n8vfXXvffeq82bN2vz5s1auHBhi5pfiw41tyf1\nn81m07p16+RwOHzbCgoKFBsbK0kaMmSI8vPzm2t5Rrvnnnv08ssvS5JCQ0NVVlbG7GopISFBTzzx\nhCTp2LFj6tSpE7PzU1FRkQ4dOqQHHnhAEn9v66slza9Fh5rbk/ovMDBQwcHBVbaVlZX5LvlEREQw\nw+uwWq0KCQmRJDmdTg0ePJjZ+WncuHF65plnNG/ePGbnp2XLliktLc33NfPzz6FDhzRlyhQlJSXp\n888/b1Hza/HvUV+JT5rVHzOs2Y4dO+R0OrVhwwYNHTrUt53Z1ezdd9/VV199pVmzZlWZF7Or3l/+\n8hf17t1bXbp0ueb3mV/1br31Vk2bNk3Dhw/X4cOHNWHCBFVWVvq+b/r8WnSouT1pwwgJCVF5ebmC\ng4N14sSJKpfFUdVnn32mtWvXav369bLb7cyulg4cOKCIiAjdcMMN6tGjhyorK9W2bVtmV0u7du3S\n4cOHtWvXLh0/flw2m40/e37o1KmTEhISJEk333yzOnbsqMLCwhYzvxZ96ZvbkzaM6Oho3xxzc3M1\naNCgZl6Rmc6dO6fly5fr9ddfV1hYmCRmV1t79+7Vhg0bJP34llVpaSmz88PKlSu1detW/fnPf9aY\nMWM0depU5ueHrKwsvfnmm5Ikt9utU6dOadSoUS1mfi3+zmQvvvii9u7d67s9affu3Zt7SUY7cOCA\nli1bpu+//16BgYHq1KmTXnytKYqYAAAArElEQVTxRaWlpenixYvq3Lmzli5dqqCgoOZeqnEyMzO1\nevVq/fKXv/RtS09P14IFC5hdDcrLyzV//nwdO3ZM5eXlmjZtmnr27Kk5c+YwOz+tXr1aN954owYO\nHMj8aun8+fN65plnVFJSooqKCk2bNk09evRoMfNr8aEGAKA1a9GXvgEAaO0INQAABiPUAAAYjFAD\nAGAwQg0AgMEINQAABiPUAAAYjFADAGCw/wdkB5RjykY3PgAAAABJRU5ErkJggg==\n",
764 | "text/plain": [
765 | ""
766 | ]
767 | },
768 | "metadata": {
769 | "tags": []
770 | }
771 | }
772 | ]
773 | },
774 | {
775 | "metadata": {
776 | "id": "XtYZ7114n3b-",
777 | "colab_type": "text"
778 | },
779 | "cell_type": "markdown",
780 | "source": [
781 | " ## 访问数据\n",
782 | "\n",
783 | "您可以使用熟悉的 Python dict/list 指令访问 `DataFrame` 数据:"
784 | ]
785 | },
786 | {
787 | "metadata": {
788 | "id": "_TFm7-looBFF",
789 | "colab_type": "code",
790 | "colab": {
791 | "autoexec": {
792 | "startup": false,
793 | "wait_interval": 0
794 | },
795 | "base_uri": "https://localhost:8080/",
796 | "height": 101
797 | },
798 | "outputId": "5676d59f-377e-4944-f125-b8c32242b145",
799 | "executionInfo": {
800 | "status": "ok",
801 | "timestamp": 1528027935465,
802 | "user_tz": -480,
803 | "elapsed": 1353,
804 | "user": {
805 | "displayName": "陈方杰",
806 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
807 | "userId": "115418685205938795893"
808 | }
809 | }
810 | },
811 | "cell_type": "code",
812 | "source": [
813 | "cities = pd.DataFrame({ 'City name': city_names, 'Population': population })\n",
814 | "print type(cities['City name'])\n",
815 | "cities['City name']"
816 | ],
817 | "execution_count": 22,
818 | "outputs": [
819 | {
820 | "output_type": "stream",
821 | "text": [
822 | "\n"
823 | ],
824 | "name": "stdout"
825 | },
826 | {
827 | "output_type": "execute_result",
828 | "data": {
829 | "text/plain": [
830 | "0 San Francisco\n",
831 | "1 San Jose\n",
832 | "2 Sacramento\n",
833 | "Name: City name, dtype: object"
834 | ]
835 | },
836 | "metadata": {
837 | "tags": []
838 | },
839 | "execution_count": 22
840 | }
841 | ]
842 | },
843 | {
844 | "metadata": {
845 | "id": "V5L6xacLoxyv",
846 | "colab_type": "code",
847 | "colab": {
848 | "autoexec": {
849 | "startup": false,
850 | "wait_interval": 0
851 | },
852 | "base_uri": "https://localhost:8080/",
853 | "height": 50
854 | },
855 | "outputId": "77170b83-33cd-47b1-9db4-03a825e34d1e",
856 | "executionInfo": {
857 | "status": "ok",
858 | "timestamp": 1528027938449,
859 | "user_tz": -480,
860 | "elapsed": 1241,
861 | "user": {
862 | "displayName": "陈方杰",
863 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
864 | "userId": "115418685205938795893"
865 | }
866 | }
867 | },
868 | "cell_type": "code",
869 | "source": [
870 | "print type(cities['City name'][1])\n",
871 | "cities['City name'][1]"
872 | ],
873 | "execution_count": 23,
874 | "outputs": [
875 | {
876 | "output_type": "stream",
877 | "text": [
878 | "\n"
879 | ],
880 | "name": "stdout"
881 | },
882 | {
883 | "output_type": "execute_result",
884 | "data": {
885 | "text/plain": [
886 | "'San Jose'"
887 | ]
888 | },
889 | "metadata": {
890 | "tags": []
891 | },
892 | "execution_count": 23
893 | }
894 | ]
895 | },
896 | {
897 | "metadata": {
898 | "id": "gcYX1tBPugZl",
899 | "colab_type": "code",
900 | "colab": {
901 | "autoexec": {
902 | "startup": false,
903 | "wait_interval": 0
904 | },
905 | "base_uri": "https://localhost:8080/",
906 | "height": 123
907 | },
908 | "outputId": "ce5ad5e9-6509-4da0-fc82-e867a55b482a",
909 | "executionInfo": {
910 | "status": "ok",
911 | "timestamp": 1528027941171,
912 | "user_tz": -480,
913 | "elapsed": 1156,
914 | "user": {
915 | "displayName": "陈方杰",
916 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
917 | "userId": "115418685205938795893"
918 | }
919 | }
920 | },
921 | "cell_type": "code",
922 | "source": [
923 | "print type(cities[0:2])\n",
924 | "cities[0:2]"
925 | ],
926 | "execution_count": 24,
927 | "outputs": [
928 | {
929 | "output_type": "stream",
930 | "text": [
931 | "\n"
932 | ],
933 | "name": "stdout"
934 | },
935 | {
936 | "output_type": "execute_result",
937 | "data": {
938 | "text/html": [
939 | "\n",
940 | "\n",
953 | "
\n",
954 | " \n",
955 | " \n",
956 | " | \n",
957 | " City name | \n",
958 | " Population | \n",
959 | "
\n",
960 | " \n",
961 | " \n",
962 | " \n",
963 | " | 0 | \n",
964 | " San Francisco | \n",
965 | " 852469 | \n",
966 | "
\n",
967 | " \n",
968 | " | 1 | \n",
969 | " San Jose | \n",
970 | " 1015785 | \n",
971 | "
\n",
972 | " \n",
973 | "
\n",
974 | "
"
975 | ],
976 | "text/plain": [
977 | " City name Population\n",
978 | "0 San Francisco 852469\n",
979 | "1 San Jose 1015785"
980 | ]
981 | },
982 | "metadata": {
983 | "tags": []
984 | },
985 | "execution_count": 24
986 | }
987 | ]
988 | },
989 | {
990 | "metadata": {
991 | "id": "65g1ZdGVjXsQ",
992 | "colab_type": "text"
993 | },
994 | "cell_type": "markdown",
995 | "source": [
996 | " 此外,*pandas* 针对高级[索引和选择](http://pandas.pydata.org/pandas-docs/stable/indexing.html)提供了极其丰富的 API(数量过多,此处无法逐一列出)。"
997 | ]
998 | },
999 | {
1000 | "metadata": {
1001 | "id": "RM1iaD-ka3Y1",
1002 | "colab_type": "text"
1003 | },
1004 | "cell_type": "markdown",
1005 | "source": [
1006 | " ## 操控数据\n",
1007 | "\n",
1008 | "您可以向 `Series` 应用 Python 的基本运算指令。例如:"
1009 | ]
1010 | },
1011 | {
1012 | "metadata": {
1013 | "id": "XWmyCFJ5bOv-",
1014 | "colab_type": "code",
1015 | "colab": {
1016 | "autoexec": {
1017 | "startup": false,
1018 | "wait_interval": 0
1019 | },
1020 | "base_uri": "https://localhost:8080/",
1021 | "height": 84
1022 | },
1023 | "outputId": "7de9ada2-e463-49e0-d76c-c356980d9619",
1024 | "executionInfo": {
1025 | "status": "ok",
1026 | "timestamp": 1528027944724,
1027 | "user_tz": -480,
1028 | "elapsed": 1342,
1029 | "user": {
1030 | "displayName": "陈方杰",
1031 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1032 | "userId": "115418685205938795893"
1033 | }
1034 | }
1035 | },
1036 | "cell_type": "code",
1037 | "source": [
1038 | "population / 1000."
1039 | ],
1040 | "execution_count": 25,
1041 | "outputs": [
1042 | {
1043 | "output_type": "execute_result",
1044 | "data": {
1045 | "text/plain": [
1046 | "0 852.469\n",
1047 | "1 1015.785\n",
1048 | "2 485.199\n",
1049 | "dtype: float64"
1050 | ]
1051 | },
1052 | "metadata": {
1053 | "tags": []
1054 | },
1055 | "execution_count": 25
1056 | }
1057 | ]
1058 | },
1059 | {
1060 | "metadata": {
1061 | "id": "TQzIVnbnmWGM",
1062 | "colab_type": "text"
1063 | },
1064 | "cell_type": "markdown",
1065 | "source": [
1066 | " [NumPy](http://www.numpy.org/) 是一种用于进行科学计算的常用工具包。*pandas* `Series` 可用作大多数 NumPy 函数的参数:"
1067 | ]
1068 | },
1069 | {
1070 | "metadata": {
1071 | "id": "ko6pLK6JmkYP",
1072 | "colab_type": "code",
1073 | "colab": {
1074 | "autoexec": {
1075 | "startup": false,
1076 | "wait_interval": 0
1077 | },
1078 | "base_uri": "https://localhost:8080/",
1079 | "height": 84
1080 | },
1081 | "outputId": "7534bbe4-e719-4244-e0cf-e63da3d9993a",
1082 | "executionInfo": {
1083 | "status": "ok",
1084 | "timestamp": 1528027947332,
1085 | "user_tz": -480,
1086 | "elapsed": 1172,
1087 | "user": {
1088 | "displayName": "陈方杰",
1089 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1090 | "userId": "115418685205938795893"
1091 | }
1092 | }
1093 | },
1094 | "cell_type": "code",
1095 | "source": [
1096 | "import numpy as np\n",
1097 | "\n",
1098 | "np.log(population)"
1099 | ],
1100 | "execution_count": 26,
1101 | "outputs": [
1102 | {
1103 | "output_type": "execute_result",
1104 | "data": {
1105 | "text/plain": [
1106 | "0 13.655892\n",
1107 | "1 13.831172\n",
1108 | "2 13.092314\n",
1109 | "dtype: float64"
1110 | ]
1111 | },
1112 | "metadata": {
1113 | "tags": []
1114 | },
1115 | "execution_count": 26
1116 | }
1117 | ]
1118 | },
1119 | {
1120 | "metadata": {
1121 | "id": "xmxFuQmurr6d",
1122 | "colab_type": "text"
1123 | },
1124 | "cell_type": "markdown",
1125 | "source": [
1126 | " 对于更复杂的单列转换,您可以使用 `Series.apply`。像 Python [映射函数](https://docs.python.org/2/library/functions.html#map)一样,`Series.apply` 将以参数形式接受 [lambda 函数](https://docs.python.org/2/tutorial/controlflow.html#lambda-expressions),而该函数会应用于每个值。\n",
1127 | "\n",
1128 | "下面的示例创建了一个指明 `population` 是否超过 100 万的新 `Series`:"
1129 | ]
1130 | },
1131 | {
1132 | "metadata": {
1133 | "id": "Fc1DvPAbstjI",
1134 | "colab_type": "code",
1135 | "colab": {
1136 | "autoexec": {
1137 | "startup": false,
1138 | "wait_interval": 0
1139 | },
1140 | "base_uri": "https://localhost:8080/",
1141 | "height": 84
1142 | },
1143 | "outputId": "55250445-9035-4fbb-d6f7-de838d4e2d2d",
1144 | "executionInfo": {
1145 | "status": "ok",
1146 | "timestamp": 1528027949958,
1147 | "user_tz": -480,
1148 | "elapsed": 1337,
1149 | "user": {
1150 | "displayName": "陈方杰",
1151 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1152 | "userId": "115418685205938795893"
1153 | }
1154 | }
1155 | },
1156 | "cell_type": "code",
1157 | "source": [
1158 | "population.apply(lambda val: val > 1000000)"
1159 | ],
1160 | "execution_count": 27,
1161 | "outputs": [
1162 | {
1163 | "output_type": "execute_result",
1164 | "data": {
1165 | "text/plain": [
1166 | "0 False\n",
1167 | "1 True\n",
1168 | "2 False\n",
1169 | "dtype: bool"
1170 | ]
1171 | },
1172 | "metadata": {
1173 | "tags": []
1174 | },
1175 | "execution_count": 27
1176 | }
1177 | ]
1178 | },
1179 | {
1180 | "metadata": {
1181 | "id": "6qh63m-ayb-c",
1182 | "colab_type": "text"
1183 | },
1184 | "cell_type": "markdown",
1185 | "source": [
1186 | " ## 练习 1\n",
1187 | "\n",
1188 | "通过添加一个新的布尔值列(当且仅当以下*两项*均为 True 时为 True)修改 `cities` 表格:\n",
1189 | "\n",
1190 | " * 城市以圣人命名。\n",
1191 | " * 城市面积大于 50 平方英里。\n",
1192 | "\n",
1193 | "**注意:**布尔值 `Series` 是使用“按位”而非传统布尔值“运算符”组合的。例如,执行*逻辑与*时,应使用 `&`,而不是 `and`。\n",
1194 | "\n",
1195 | "**提示:**\"San\" 在西班牙语中意为 \"saint\"。"
1196 | ]
1197 | },
1198 | {
1199 | "metadata": {
1200 | "id": "ZeYYLoV9b9fB",
1201 | "colab_type": "text"
1202 | },
1203 | "cell_type": "markdown",
1204 | "source": [
1205 | " \n",
1206 | "`DataFrames` 的修改方式也非常简单。例如,以下代码向现有 `DataFrame` 添加了两个 `Series`:"
1207 | ]
1208 | },
1209 | {
1210 | "metadata": {
1211 | "id": "0gCEX99Hb8LR",
1212 | "colab_type": "code",
1213 | "colab": {
1214 | "autoexec": {
1215 | "startup": false,
1216 | "wait_interval": 0
1217 | },
1218 | "base_uri": "https://localhost:8080/",
1219 | "height": 136
1220 | },
1221 | "outputId": "5f7bff59-cef4-4b20-83d0-1d98b1ccd6c3",
1222 | "executionInfo": {
1223 | "status": "ok",
1224 | "timestamp": 1528027952256,
1225 | "user_tz": -480,
1226 | "elapsed": 1152,
1227 | "user": {
1228 | "displayName": "陈方杰",
1229 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1230 | "userId": "115418685205938795893"
1231 | }
1232 | }
1233 | },
1234 | "cell_type": "code",
1235 | "source": [
1236 | "cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])\n",
1237 | "cities['Population density'] = cities['Population'] / cities['Area square miles']\n",
1238 | "cities"
1239 | ],
1240 | "execution_count": 28,
1241 | "outputs": [
1242 | {
1243 | "output_type": "execute_result",
1244 | "data": {
1245 | "text/html": [
1246 | "\n",
1247 | "\n",
1260 | "
\n",
1261 | " \n",
1262 | " \n",
1263 | " | \n",
1264 | " City name | \n",
1265 | " Population | \n",
1266 | " Area square miles | \n",
1267 | " Population density | \n",
1268 | "
\n",
1269 | " \n",
1270 | " \n",
1271 | " \n",
1272 | " | 0 | \n",
1273 | " San Francisco | \n",
1274 | " 852469 | \n",
1275 | " 46.87 | \n",
1276 | " 18187.945381 | \n",
1277 | "
\n",
1278 | " \n",
1279 | " | 1 | \n",
1280 | " San Jose | \n",
1281 | " 1015785 | \n",
1282 | " 176.53 | \n",
1283 | " 5754.177760 | \n",
1284 | "
\n",
1285 | " \n",
1286 | " | 2 | \n",
1287 | " Sacramento | \n",
1288 | " 485199 | \n",
1289 | " 97.92 | \n",
1290 | " 4955.055147 | \n",
1291 | "
\n",
1292 | " \n",
1293 | "
\n",
1294 | "
"
1295 | ],
1296 | "text/plain": [
1297 | " City name Population Area square miles Population density\n",
1298 | "0 San Francisco 852469 46.87 18187.945381\n",
1299 | "1 San Jose 1015785 176.53 5754.177760\n",
1300 | "2 Sacramento 485199 97.92 4955.055147"
1301 | ]
1302 | },
1303 | "metadata": {
1304 | "tags": []
1305 | },
1306 | "execution_count": 28
1307 | }
1308 | ]
1309 | },
1310 | {
1311 | "metadata": {
1312 | "id": "zCOn8ftSyddH",
1313 | "colab_type": "code",
1314 | "colab": {
1315 | "autoexec": {
1316 | "startup": false,
1317 | "wait_interval": 0
1318 | },
1319 | "base_uri": "https://localhost:8080/",
1320 | "height": 136
1321 | },
1322 | "outputId": "3f55dfb8-faa8-48a5-a8d8-c320df9a262a",
1323 | "executionInfo": {
1324 | "status": "ok",
1325 | "timestamp": 1528028043662,
1326 | "user_tz": -480,
1327 | "elapsed": 1341,
1328 | "user": {
1329 | "displayName": "陈方杰",
1330 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1331 | "userId": "115418685205938795893"
1332 | }
1333 | }
1334 | },
1335 | "cell_type": "code",
1336 | "source": [
1337 | "# Your code here\n",
1338 | "# [2018-06-03] Amusi add:\n",
1339 | "# 布尔值 Series 是使用“按位”而非传统布尔值“运算符”组合的。例如,执行逻辑与时,应使用 &,而不是 and。\n",
1340 | "# 注: Panda series apply使用说明:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.apply.html\n",
1341 | "\n",
1342 | "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'] .apply(lambda name: name.startswith('San'))\n",
1343 | "cities"
1344 | ],
1345 | "execution_count": 32,
1346 | "outputs": [
1347 | {
1348 | "output_type": "execute_result",
1349 | "data": {
1350 | "text/html": [
1351 | "\n",
1352 | "\n",
1365 | "
\n",
1366 | " \n",
1367 | " \n",
1368 | " | \n",
1369 | " City name | \n",
1370 | " Population | \n",
1371 | " Area square miles | \n",
1372 | " Population density | \n",
1373 | " Is wide and has saint name | \n",
1374 | "
\n",
1375 | " \n",
1376 | " \n",
1377 | " \n",
1378 | " | 0 | \n",
1379 | " San Francisco | \n",
1380 | " 852469 | \n",
1381 | " 46.87 | \n",
1382 | " 18187.945381 | \n",
1383 | " False | \n",
1384 | "
\n",
1385 | " \n",
1386 | " | 1 | \n",
1387 | " San Jose | \n",
1388 | " 1015785 | \n",
1389 | " 176.53 | \n",
1390 | " 5754.177760 | \n",
1391 | " True | \n",
1392 | "
\n",
1393 | " \n",
1394 | " | 2 | \n",
1395 | " Sacramento | \n",
1396 | " 485199 | \n",
1397 | " 97.92 | \n",
1398 | " 4955.055147 | \n",
1399 | " False | \n",
1400 | "
\n",
1401 | " \n",
1402 | "
\n",
1403 | "
"
1404 | ],
1405 | "text/plain": [
1406 | " City name Population Area square miles Population density \\\n",
1407 | "0 San Francisco 852469 46.87 18187.945381 \n",
1408 | "1 San Jose 1015785 176.53 5754.177760 \n",
1409 | "2 Sacramento 485199 97.92 4955.055147 \n",
1410 | "\n",
1411 | " Is wide and has saint name \n",
1412 | "0 False \n",
1413 | "1 True \n",
1414 | "2 False "
1415 | ]
1416 | },
1417 | "metadata": {
1418 | "tags": []
1419 | },
1420 | "execution_count": 32
1421 | }
1422 | ]
1423 | },
1424 | {
1425 | "metadata": {
1426 | "id": "YHIWvc9Ms-Ll",
1427 | "colab_type": "text"
1428 | },
1429 | "cell_type": "markdown",
1430 | "source": [
1431 | " ### 解决方案\n",
1432 | "\n",
1433 | "点击下方,查看解决方案。"
1434 | ]
1435 | },
1436 | {
1437 | "metadata": {
1438 | "id": "T5OlrqtdtCIb",
1439 | "colab_type": "code",
1440 | "colab": {
1441 | "autoexec": {
1442 | "startup": false,
1443 | "wait_interval": 0
1444 | },
1445 | "base_uri": "https://localhost:8080/",
1446 | "height": 136
1447 | },
1448 | "outputId": "1212940f-9af4-4039-d03a-ae2fd4d72e3b",
1449 | "executionInfo": {
1450 | "status": "ok",
1451 | "timestamp": 1528028019617,
1452 | "user_tz": -480,
1453 | "elapsed": 1848,
1454 | "user": {
1455 | "displayName": "陈方杰",
1456 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1457 | "userId": "115418685205938795893"
1458 | }
1459 | }
1460 | },
1461 | "cell_type": "code",
1462 | "source": [
1463 | "cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))\n",
1464 | "cities"
1465 | ],
1466 | "execution_count": 31,
1467 | "outputs": [
1468 | {
1469 | "output_type": "execute_result",
1470 | "data": {
1471 | "text/html": [
1472 | "\n",
1473 | "\n",
1486 | "
\n",
1487 | " \n",
1488 | " \n",
1489 | " | \n",
1490 | " City name | \n",
1491 | " Population | \n",
1492 | " Area square miles | \n",
1493 | " Population density | \n",
1494 | " Is wide and has saint name | \n",
1495 | "
\n",
1496 | " \n",
1497 | " \n",
1498 | " \n",
1499 | " | 0 | \n",
1500 | " San Francisco | \n",
1501 | " 852469 | \n",
1502 | " 46.87 | \n",
1503 | " 18187.945381 | \n",
1504 | " False | \n",
1505 | "
\n",
1506 | " \n",
1507 | " | 1 | \n",
1508 | " San Jose | \n",
1509 | " 1015785 | \n",
1510 | " 176.53 | \n",
1511 | " 5754.177760 | \n",
1512 | " True | \n",
1513 | "
\n",
1514 | " \n",
1515 | " | 2 | \n",
1516 | " Sacramento | \n",
1517 | " 485199 | \n",
1518 | " 97.92 | \n",
1519 | " 4955.055147 | \n",
1520 | " False | \n",
1521 | "
\n",
1522 | " \n",
1523 | "
\n",
1524 | "
"
1525 | ],
1526 | "text/plain": [
1527 | " City name Population Area square miles Population density \\\n",
1528 | "0 San Francisco 852469 46.87 18187.945381 \n",
1529 | "1 San Jose 1015785 176.53 5754.177760 \n",
1530 | "2 Sacramento 485199 97.92 4955.055147 \n",
1531 | "\n",
1532 | " Is wide and has saint name \n",
1533 | "0 False \n",
1534 | "1 True \n",
1535 | "2 False "
1536 | ]
1537 | },
1538 | "metadata": {
1539 | "tags": []
1540 | },
1541 | "execution_count": 31
1542 | }
1543 | ]
1544 | },
1545 | {
1546 | "metadata": {
1547 | "id": "f-xAOJeMiXFB",
1548 | "colab_type": "text"
1549 | },
1550 | "cell_type": "markdown",
1551 | "source": [
1552 | " ## 索引\n",
1553 | "`Series` 和 `DataFrame` 对象也定义了 `index` 属性,该属性会向每个 `Series` 项或 `DataFrame` 行赋一个标识符值。\n",
1554 | "\n",
1555 | "默认情况下,在构造时,*pandas* 会赋可反映源数据顺序的索引值。索引值在创建后是稳定的;也就是说,它们不会因为数据重新排序而发生改变。"
1556 | ]
1557 | },
1558 | {
1559 | "metadata": {
1560 | "id": "2684gsWNinq9",
1561 | "colab_type": "code",
1562 | "colab": {
1563 | "autoexec": {
1564 | "startup": false,
1565 | "wait_interval": 0
1566 | },
1567 | "base_uri": "https://localhost:8080/",
1568 | "height": 34
1569 | },
1570 | "outputId": "3c3fa21d-ba90-4db6-888d-cbd1df2028ed",
1571 | "executionInfo": {
1572 | "status": "ok",
1573 | "timestamp": 1528028079282,
1574 | "user_tz": -480,
1575 | "elapsed": 1359,
1576 | "user": {
1577 | "displayName": "陈方杰",
1578 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1579 | "userId": "115418685205938795893"
1580 | }
1581 | }
1582 | },
1583 | "cell_type": "code",
1584 | "source": [
1585 | "city_names.index"
1586 | ],
1587 | "execution_count": 33,
1588 | "outputs": [
1589 | {
1590 | "output_type": "execute_result",
1591 | "data": {
1592 | "text/plain": [
1593 | "RangeIndex(start=0, stop=3, step=1)"
1594 | ]
1595 | },
1596 | "metadata": {
1597 | "tags": []
1598 | },
1599 | "execution_count": 33
1600 | }
1601 | ]
1602 | },
1603 | {
1604 | "metadata": {
1605 | "id": "F_qPe2TBjfWd",
1606 | "colab_type": "code",
1607 | "colab": {
1608 | "autoexec": {
1609 | "startup": false,
1610 | "wait_interval": 0
1611 | },
1612 | "base_uri": "https://localhost:8080/",
1613 | "height": 34
1614 | },
1615 | "outputId": "5364feec-d2c4-425e-e09d-b8a0590b1ca5",
1616 | "executionInfo": {
1617 | "status": "ok",
1618 | "timestamp": 1528028086703,
1619 | "user_tz": -480,
1620 | "elapsed": 1169,
1621 | "user": {
1622 | "displayName": "陈方杰",
1623 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1624 | "userId": "115418685205938795893"
1625 | }
1626 | }
1627 | },
1628 | "cell_type": "code",
1629 | "source": [
1630 | "cities.index"
1631 | ],
1632 | "execution_count": 34,
1633 | "outputs": [
1634 | {
1635 | "output_type": "execute_result",
1636 | "data": {
1637 | "text/plain": [
1638 | "RangeIndex(start=0, stop=3, step=1)"
1639 | ]
1640 | },
1641 | "metadata": {
1642 | "tags": []
1643 | },
1644 | "execution_count": 34
1645 | }
1646 | ]
1647 | },
1648 | {
1649 | "metadata": {
1650 | "id": "hp2oWY9Slo_h",
1651 | "colab_type": "text"
1652 | },
1653 | "cell_type": "markdown",
1654 | "source": [
1655 | " 调用 `DataFrame.reindex` 以手动重新排列各行的顺序。例如,以下方式与按城市名称排序具有相同的效果:"
1656 | ]
1657 | },
1658 | {
1659 | "metadata": {
1660 | "id": "sN0zUzSAj-U1",
1661 | "colab_type": "code",
1662 | "colab": {
1663 | "autoexec": {
1664 | "startup": false,
1665 | "wait_interval": 0
1666 | },
1667 | "base_uri": "https://localhost:8080/",
1668 | "height": 136
1669 | },
1670 | "outputId": "34406cf9-eefb-4dab-8eb0-3bf677de2c69",
1671 | "executionInfo": {
1672 | "status": "ok",
1673 | "timestamp": 1528028093432,
1674 | "user_tz": -480,
1675 | "elapsed": 1183,
1676 | "user": {
1677 | "displayName": "陈方杰",
1678 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1679 | "userId": "115418685205938795893"
1680 | }
1681 | }
1682 | },
1683 | "cell_type": "code",
1684 | "source": [
1685 | "cities.reindex([2, 0, 1])"
1686 | ],
1687 | "execution_count": 35,
1688 | "outputs": [
1689 | {
1690 | "output_type": "execute_result",
1691 | "data": {
1692 | "text/html": [
1693 | "\n",
1694 | "\n",
1707 | "
\n",
1708 | " \n",
1709 | " \n",
1710 | " | \n",
1711 | " City name | \n",
1712 | " Population | \n",
1713 | " Area square miles | \n",
1714 | " Population density | \n",
1715 | " Is wide and has saint name | \n",
1716 | "
\n",
1717 | " \n",
1718 | " \n",
1719 | " \n",
1720 | " | 2 | \n",
1721 | " Sacramento | \n",
1722 | " 485199 | \n",
1723 | " 97.92 | \n",
1724 | " 4955.055147 | \n",
1725 | " False | \n",
1726 | "
\n",
1727 | " \n",
1728 | " | 0 | \n",
1729 | " San Francisco | \n",
1730 | " 852469 | \n",
1731 | " 46.87 | \n",
1732 | " 18187.945381 | \n",
1733 | " False | \n",
1734 | "
\n",
1735 | " \n",
1736 | " | 1 | \n",
1737 | " San Jose | \n",
1738 | " 1015785 | \n",
1739 | " 176.53 | \n",
1740 | " 5754.177760 | \n",
1741 | " True | \n",
1742 | "
\n",
1743 | " \n",
1744 | "
\n",
1745 | "
"
1746 | ],
1747 | "text/plain": [
1748 | " City name Population Area square miles Population density \\\n",
1749 | "2 Sacramento 485199 97.92 4955.055147 \n",
1750 | "0 San Francisco 852469 46.87 18187.945381 \n",
1751 | "1 San Jose 1015785 176.53 5754.177760 \n",
1752 | "\n",
1753 | " Is wide and has saint name \n",
1754 | "2 False \n",
1755 | "0 False \n",
1756 | "1 True "
1757 | ]
1758 | },
1759 | "metadata": {
1760 | "tags": []
1761 | },
1762 | "execution_count": 35
1763 | }
1764 | ]
1765 | },
1766 | {
1767 | "metadata": {
1768 | "id": "-GQFz8NZuS06",
1769 | "colab_type": "text"
1770 | },
1771 | "cell_type": "markdown",
1772 | "source": [
1773 | " 重建索引是一种随机排列 `DataFrame` 的绝佳方式。在下面的示例中,我们会取用类似数组的索引,然后将其传递至 NumPy 的 `random.permutation` 函数,该函数会随机排列其值的位置。如果使用此重新随机排列的数组调用 `reindex`,会导致 `DataFrame` 行以同样的方式随机排列。\n",
1774 | "尝试多次运行以下单元格!"
1775 | ]
1776 | },
1777 | {
1778 | "metadata": {
1779 | "id": "mF8GC0k8uYhz",
1780 | "colab_type": "code",
1781 | "colab": {
1782 | "autoexec": {
1783 | "startup": false,
1784 | "wait_interval": 0
1785 | },
1786 | "base_uri": "https://localhost:8080/",
1787 | "height": 136
1788 | },
1789 | "outputId": "0722ce64-deb6-42c5-b2cc-e0ab1e1ea44d",
1790 | "executionInfo": {
1791 | "status": "ok",
1792 | "timestamp": 1528028109548,
1793 | "user_tz": -480,
1794 | "elapsed": 1322,
1795 | "user": {
1796 | "displayName": "陈方杰",
1797 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1798 | "userId": "115418685205938795893"
1799 | }
1800 | }
1801 | },
1802 | "cell_type": "code",
1803 | "source": [
1804 | "cities.reindex(np.random.permutation(cities.index))"
1805 | ],
1806 | "execution_count": 36,
1807 | "outputs": [
1808 | {
1809 | "output_type": "execute_result",
1810 | "data": {
1811 | "text/html": [
1812 | "\n",
1813 | "\n",
1826 | "
\n",
1827 | " \n",
1828 | " \n",
1829 | " | \n",
1830 | " City name | \n",
1831 | " Population | \n",
1832 | " Area square miles | \n",
1833 | " Population density | \n",
1834 | " Is wide and has saint name | \n",
1835 | "
\n",
1836 | " \n",
1837 | " \n",
1838 | " \n",
1839 | " | 0 | \n",
1840 | " San Francisco | \n",
1841 | " 852469 | \n",
1842 | " 46.87 | \n",
1843 | " 18187.945381 | \n",
1844 | " False | \n",
1845 | "
\n",
1846 | " \n",
1847 | " | 2 | \n",
1848 | " Sacramento | \n",
1849 | " 485199 | \n",
1850 | " 97.92 | \n",
1851 | " 4955.055147 | \n",
1852 | " False | \n",
1853 | "
\n",
1854 | " \n",
1855 | " | 1 | \n",
1856 | " San Jose | \n",
1857 | " 1015785 | \n",
1858 | " 176.53 | \n",
1859 | " 5754.177760 | \n",
1860 | " True | \n",
1861 | "
\n",
1862 | " \n",
1863 | "
\n",
1864 | "
"
1865 | ],
1866 | "text/plain": [
1867 | " City name Population Area square miles Population density \\\n",
1868 | "0 San Francisco 852469 46.87 18187.945381 \n",
1869 | "2 Sacramento 485199 97.92 4955.055147 \n",
1870 | "1 San Jose 1015785 176.53 5754.177760 \n",
1871 | "\n",
1872 | " Is wide and has saint name \n",
1873 | "0 False \n",
1874 | "2 False \n",
1875 | "1 True "
1876 | ]
1877 | },
1878 | "metadata": {
1879 | "tags": []
1880 | },
1881 | "execution_count": 36
1882 | }
1883 | ]
1884 | },
1885 | {
1886 | "metadata": {
1887 | "id": "fSso35fQmGKb",
1888 | "colab_type": "text"
1889 | },
1890 | "cell_type": "markdown",
1891 | "source": [
1892 | " 有关详情,请参阅[索引文档](http://pandas.pydata.org/pandas-docs/stable/indexing.html#index-objects)。"
1893 | ]
1894 | },
1895 | {
1896 | "metadata": {
1897 | "id": "8UngIdVhz8C0",
1898 | "colab_type": "text"
1899 | },
1900 | "cell_type": "markdown",
1901 | "source": [
1902 | " ## 练习 2\n",
1903 | "\n",
1904 | "`reindex` 方法允许使用未包含在原始 `DataFrame` 索引值中的索引值。请试一下,看看如果使用此类值会发生什么!您认为允许此类值的原因是什么?"
1905 | ]
1906 | },
1907 | {
1908 | "metadata": {
1909 | "id": "PN55GrDX0jzO",
1910 | "colab_type": "code",
1911 | "colab": {
1912 | "autoexec": {
1913 | "startup": false,
1914 | "wait_interval": 0
1915 | },
1916 | "base_uri": "https://localhost:8080/",
1917 | "height": 195
1918 | },
1919 | "outputId": "37b9adae-5781-401c-d4b2-63fa8069ba98",
1920 | "executionInfo": {
1921 | "status": "ok",
1922 | "timestamp": 1528028246239,
1923 | "user_tz": -480,
1924 | "elapsed": 1188,
1925 | "user": {
1926 | "displayName": "陈方杰",
1927 | "photoUrl": "https://lh3.googleusercontent.com/a/default-user=s128",
1928 | "userId": "115418685205938795893"
1929 | }
1930 | }
1931 | },
1932 | "cell_type": "code",
1933 | "source": [
1934 | "# Your code here\n",
1935 | "cities.reindex([0, 5, 1, 7, 2])"
1936 | ],
1937 | "execution_count": 39,
1938 | "outputs": [
1939 | {
1940 | "output_type": "execute_result",
1941 | "data": {
1942 | "text/html": [
1943 | "\n",
1944 | "\n",
1957 | "
\n",
1958 | " \n",
1959 | " \n",
1960 | " | \n",
1961 | " City name | \n",
1962 | " Population | \n",
1963 | " Area square miles | \n",
1964 | " Population density | \n",
1965 | " Is wide and has saint name | \n",
1966 | "
\n",
1967 | " \n",
1968 | " \n",
1969 | " \n",
1970 | " | 0 | \n",
1971 | " San Francisco | \n",
1972 | " 852469.0 | \n",
1973 | " 46.87 | \n",
1974 | " 18187.945381 | \n",
1975 | " False | \n",
1976 | "
\n",
1977 | " \n",
1978 | " | 5 | \n",
1979 | " NaN | \n",
1980 | " NaN | \n",
1981 | " NaN | \n",
1982 | " NaN | \n",
1983 | " NaN | \n",
1984 | "
\n",
1985 | " \n",
1986 | " | 1 | \n",
1987 | " San Jose | \n",
1988 | " 1015785.0 | \n",
1989 | " 176.53 | \n",
1990 | " 5754.177760 | \n",
1991 | " True | \n",
1992 | "
\n",
1993 | " \n",
1994 | " | 7 | \n",
1995 | " NaN | \n",
1996 | " NaN | \n",
1997 | " NaN | \n",
1998 | " NaN | \n",
1999 | " NaN | \n",
2000 | "
\n",
2001 | " \n",
2002 | " | 2 | \n",
2003 | " Sacramento | \n",
2004 | " 485199.0 | \n",
2005 | " 97.92 | \n",
2006 | " 4955.055147 | \n",
2007 | " False | \n",
2008 | "
\n",
2009 | " \n",
2010 | "
\n",
2011 | "
"
2012 | ],
2013 | "text/plain": [
2014 | " City name Population Area square miles Population density \\\n",
2015 | "0 San Francisco 852469.0 46.87 18187.945381 \n",
2016 | "5 NaN NaN NaN NaN \n",
2017 | "1 San Jose 1015785.0 176.53 5754.177760 \n",
2018 | "7 NaN NaN NaN NaN \n",
2019 | "2 Sacramento 485199.0 97.92 4955.055147 \n",
2020 | "\n",
2021 | " Is wide and has saint name \n",
2022 | "0 False \n",
2023 | "5 NaN \n",
2024 | "1 True \n",
2025 | "7 NaN \n",
2026 | "2 False "
2027 | ]
2028 | },
2029 | "metadata": {
2030 | "tags": []
2031 | },
2032 | "execution_count": 39
2033 | }
2034 | ]
2035 | },
2036 | {
2037 | "metadata": {
2038 | "id": "TJffr5_Jwqvd",
2039 | "colab_type": "text"
2040 | },
2041 | "cell_type": "markdown",
2042 | "source": [
2043 | " ### 解决方案\n",
2044 | "\n",
2045 | "点击下方,查看解决方案。"
2046 | ]
2047 | },
2048 | {
2049 | "metadata": {
2050 | "id": "8oSvi2QWwuDH",
2051 | "colab_type": "text"
2052 | },
2053 | "cell_type": "markdown",
2054 | "source": [
2055 | " 如果您的 `reindex` 输入数组包含原始 `DataFrame` 索引值中没有的值,`reindex` 会为此类“丢失的”索引添加新行,并在所有对应列中填充 `NaN` 值:"
2056 | ]
2057 | },
2058 | {
2059 | "metadata": {
2060 | "id": "yBdkucKCwy4x",
2061 | "colab_type": "code",
2062 | "colab": {
2063 | "autoexec": {
2064 | "startup": false,
2065 | "wait_interval": 0
2066 | }
2067 | }
2068 | },
2069 | "cell_type": "code",
2070 | "source": [
2071 | "cities.reindex([0, 4, 5, 2])"
2072 | ],
2073 | "execution_count": 0,
2074 | "outputs": []
2075 | },
2076 | {
2077 | "metadata": {
2078 | "id": "2l82PhPbwz7g",
2079 | "colab_type": "text"
2080 | },
2081 | "cell_type": "markdown",
2082 | "source": [
2083 | " 这种行为是可取的,因为索引通常是从实际数据中提取的字符串(请参阅 [*pandas* reindex 文档](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html),查看索引值是浏览器名称的示例)。\n",
2084 | "\n",
2085 | "在这种情况下,如果允许出现“丢失的”索引,您将可以轻松使用外部列表重建索引,因为您不必担心会将输入清理掉。"
2086 | ]
2087 | }
2088 | ]
2089 | }
--------------------------------------------------------------------------------
/Google机器学习速成课程笔记/简介/前提条件和准备工作.md:
--------------------------------------------------------------------------------
1 | # 前提条件和准备工作
2 |
3 | **机器学习速成课程适合您吗?**
4 |
5 | - 我对机器学习知之甚少或一无所知。
6 | - 我对机器学习有一些了解,但想了解更新、更全面的机器学习知识。
7 | - 我很了解机器学习,但对 TensorFlow 知之甚少或一无所知。
8 |
9 | 如果你的情况符合上述3个选项中的任意一个,那么本课程一定很适合你,相信会帮助到你!
10 |
11 | 在开始机器学习速成课程之前,请先阅读下面的[前提条件](https://developers.google.com/machine-learning/crash-course/prereqs-and-prework#prerequisites)和[准备工作](https://developers.google.com/machine-learning/crash-course/prereqs-and-prework#prework)部分,以确保您已做好完成所有单元所需的准备工作。
12 |
13 | ## 前提条件
14 |
15 | 机器学习速成课程并不会假定或要求您预先掌握机器学习方面的任何知识。但是,为了能够理解课程中介绍的概念并完成练习,您最好满足以下前提条件:
16 |
17 | - **掌握入门级[代数](https://en.wikipedia.org/wiki/Algebra)知识。** 您应该了解变量和系数、线性方程式、函数图和直方图(熟悉对数和导数等更高级的数学概念会有帮助,但不是必需条件)。
18 | - **熟练掌握编程基础知识,并且具有一些使用 Python 进行编码的经验。** 机器学习速成课程中的编程练习是通过 [TensorFlow](https://www.tensorflow.org/) 并使用 [Python](https://www.python.org/) 进行编码的。您无需拥有使用 TensorFlow 的任何经验,但应该能够熟练阅读和编写包含基础编程结构(例如,函数定义/调用、列表和字典、循环和条件表达式)的 Python 代码。
19 |
20 | **注意**:有关机器学习速成课程中使用的数学和编程概念的详细列表以及针对每个概念的参考资料,请参阅下面的[主要概念和工具](https://developers.google.com/machine-learning/crash-course/prereqs-and-prework#key-concepts)部分。
21 |
22 | ## 准备工作
23 |
24 | 可使用 [Colaboratory](https://colab.research.google.com/) 平台直接在浏览器中运行编程练习(无需设置!)。Colaboratory 支持大多数主流浏览器,并且在 Chrome 和 Firefox 的各个桌面版本上进行了最全面的测试。如果您想下载并离线运行这些练习,请参阅有关设置本地环境的[说明](https://github.com/google/eng-edu/blob/master/ml/cc/README.md#with-docker)。
25 |
26 | ### Pandas 使用入门
27 |
28 | 机器学习速成课程中的编程练习使用 [Pandas](http://pandas.pydata.org/) 库来操控数据集。如果您不熟悉 Pandas,建议您先学习[Pandas 简介](https://colab.research.google.com/notebooks/mlcc/intro_to_pandas.ipynb?hl=zh-cn)教程,该教程介绍了练习中使用的主要 Pandas 功能。
29 |
30 | 注:为了便捷,我已经试运行并下载了[intro_to_pandas.ipynb](intro_to_pandas.ipynb)。
31 |
32 | ### 低阶 TensorFlow 基础知识
33 |
34 | 机器学习速成课程中的编程练习使用 TensorFlow 的高阶 [tf.estimator API](https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator) 来配置模型。如果您有兴趣从头开始构建 TensorFlow 模型,请学习以下教程:
35 |
36 | - [TensorFlow Hello World](https://colab.research.google.com/notebooks/mlcc/hello_world.ipynb?hl=zh-cn) 在低阶 TensorFlow 中编码的“Hello World”。
37 | - [TensorFlow 编程概念](https://colab.research.google.com/notebooks/mlcc/tensorflow_programming_concepts.ipynb?hl=zh-cn) 演示了 TensorFlow 应用中的基本组件:张量、指令、图和会话。
38 | - [创建和操控张量](https://colab.research.google.com/notebooks/mlcc/creating_and_manipulating_tensors.ipynb?hl=zh-cn) 张量快速入门:TensorFlow 编程中的核心概念。此外,还提供了线性代数中的矩阵加法和乘法方面的复习进修内容。
39 |
40 | ## 主要概念和工具
41 |
42 | 机器学习速成课程中介绍并应用了以下概念和工具。有关详情,请参阅链接的资源。
43 |
44 | ### 数学
45 |
46 | #### 代数
47 |
48 | - [变量](https://www.khanacademy.org/math/algebra/introduction-to-algebra/alg1-intro-to-variables/v/what-is-a-variable)、[系数](https://www.khanacademy.org/math/cc-sixth-grade-math/cc-6th-equivalent-exp/cc-6th-parts-of-expressions/v/expression-terms-factors-and-coefficients)和[函数](https://www.khanacademy.org/math/algebra/algebra-functions)
49 | - [线性方程式](https://wikipedia.org/wiki/Linear_equation),例如 y=b+w1x1+w2x2
50 | - [对数](https://wikipedia.org/wiki/Logarithm)和对数方程式,例如 y=ln(1+ez)
51 | - [S 型函数](https://wikipedia.org/wiki/Sigmoid_function)
52 |
53 | #### 线性代数
54 |
55 | - [张量和张量等级](https://www.tensorflow.org/programmers_guide/tensors)
56 | - [矩阵乘法](https://wikipedia.org/wiki/Matrix_multiplication)
57 |
58 | #### 三角学
59 |
60 | - [Tanh](https://reference.wolfram.com/language/ref/Tanh.html)(作为[激活函数](https://developers.google.com/machine-learning/crash-course/glossary#activation_function)进行讲解,无需提前掌握相关知识)
61 |
62 | #### 统计信息
63 |
64 | - [平均值、中间值、离群值](https://www.khanacademy.org/math/probability/data-distributions-a1/summarizing-center-distributions/v/mean-median-and-mode)和[标准偏差](https://wikipedia.org/wiki/Standard_deviation)
65 | - 能够读懂[直方图](https://wikipedia.org/wiki/Histogram)
66 |
67 | #### 微积分(可选,适合高级主题)**
68 |
69 | - [导数](https://wikipedia.org/wiki/Derivative)概念(您不必真正计算导数)
70 | - [梯度](https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient)或斜率
71 | - [偏导数](https://wikipedia.org/wiki/Partial_derivative)(与梯度紧密相关)
72 | - [链式法则](https://wikipedia.org/wiki/Chain_rule)(可让您全面了解用于训练神经网络的[反向传播算法](https://developers.google.com/machine-learning/crash-course/backprop-scroll/))
73 |
74 | ### Python 编程
75 |
76 | #### 基础 Python
77 |
78 | [Python 教程](https://docs.python.org/3/tutorial/)中介绍了以下 Python 基础知识:
79 |
80 | - [定义和调用函数](https://docs.python.org/3/tutorial/controlflow.html#defining-functions):使用位置和[关键字](https://docs.python.org/3/tutorial/controlflow.html#keyword-arguments)参数
81 | - [字典](https://docs.python.org/3/tutorial/datastructures.html#dictionaries)、[列表](https://docs.python.org/3/tutorial/introduction.html#lists)、[集合](https://docs.python.org/3/tutorial/datastructures.html#sets)(创建、访问和迭代)
82 | - [`for` 循环](https://docs.python.org/3/tutorial/controlflow.html#for-statements):包含多个迭代器变量的 `for` 循环(例如 `for a, b in [(1,2), (3,4)]`)
83 | - [`if/else` 条件块](https://docs.python.org/3/tutorial/controlflow.html#if-statements)和[条件表达式](https://docs.python.org/2.5/whatsnew/pep-308.html)
84 | - [字符串格式化](https://docs.python.org/3/tutorial/inputoutput.html#old-string-formatting)(例如 `'%.2f' % 3.14`)
85 | - 变量、赋值、[基本数据类型](https://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator)(`int`、`float`、`bool`、`str`)
86 | - [`pass` 语句](https://docs.python.org/3/tutorial/controlflow.html#pass-statements)
87 |
88 | #### 中级 Python
89 |
90 | [Python 教程](https://docs.python.org/3/tutorial/)还介绍了以下更高级的 Python 功能:
91 |
92 | - [列表推导式](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
93 | - [Lambda 函数](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions)
94 |
95 | ### 第三方 Python 库
96 |
97 | 机器学习速成课程代码示例使用了第三方库提供的以下功能。无需提前熟悉这些库;您可以在需要时查询相关内容。
98 |
99 | #### [Matplotlib](http://matplotlib.org/contents.html)(适合数据可视化)
100 |
101 | - [`pyplot`](http://matplotlib.org/api/pyplot_api.html) 模块
102 | - [`cm`](http://matplotlib.org/api/cm_api.html) 模块
103 | - [`gridspec`](http://matplotlib.org/api/gridspec_api.html) 模块
104 |
105 | #### [Seaborn](http://seaborn.pydata.org/index.html)(适合热图)
106 |
107 | - [`heatmap`](http://seaborn.pydata.org/generated/seaborn.heatmap.html) 函数
108 |
109 | #### [Pandas](http://pandas.pydata.org/)(适合数据处理)
110 |
111 | - [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) 类
112 |
113 | #### [NumPy](http://www.numpy.org/)(适合低阶数学运算)
114 |
115 | - [`linspace`](https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.linspace.html) 函数
116 | - [`random`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.random.html#numpy.random.random) 函数
117 | - [`array`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html) 函数
118 | - [`arange`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) 函数
119 |
120 | #### [scikit-learn](http://scikit-learn.org/)(适合评估指标)
121 |
122 | - [metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) 模块
123 |
124 | ### Bash 终端/云端控制台
125 |
126 | 要在本地计算机上或云端控制台中运行编程练习,您应该能熟练使用命令行:
127 |
128 | - [Bash 参考手册](https://tiswww.case.edu/php/chet/bash/bashref.html)
129 | - [Bash 快速参考表](https://github.com/LeCoupa/awesome-cheatsheets/blob/master/languages/bash.sh)
130 | - [了解 Shell](http://www.learnshell.org/)
131 |
132 |
133 |
134 | # Reference
135 |
136 | https://developers.google.com/machine-learning/crash-course/prereqs-and-prework#key-concepts
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # TensorFlow-From-Zero-To-One
2 | **(正在学)**记录自己学习TensorFlow的笔记和代码
3 |
4 | - [TensorFlow从入门到精通](TensorFlow从入门到精通)
5 |
6 |
7 | - [Google机器学习速成课程笔记](Google机器学习速成课程笔记)
8 |
9 |
10 |
11 | ## TensorFlow 参考学习资料
12 |
13 | ### TensorFlow入门指南
14 |
15 | [TensorFlow如何入门](https://www.zhihu.com/question/49909565)
16 |
17 | [TensorFlow topic](https://github.com/topics/tensorflow)
18 |
19 | ### **网上教程**
20 |
21 | - **(推荐)**[官网](https://www.tensorflow.org/)
22 |
23 | - [GitHub官网](https://github.com/tensorflow/tensorflow)
24 |
25 | - [英文论坛](https://medium.com/tensorflow)
26 |
27 | - [官网(中国域名)](https://tensorflow.google.cn/) 不需要翻墙
28 |
29 | - [TensorFlow中文社区](http://www.tensorfly.cn/)
30 |
31 | - [TensorFlow Basic Tutorial Lab](https://github.com/hunkim/DeepLearningZeroToAll) 香港科技大学TensorFlow教程及GitHub代码
32 |
33 | - [TensorFlow 中文文档(掘金翻译计划)最新](https://github.com/xitu/tensorflow-docs)
34 |
35 | - [TensorFlow 中文文档(极客学院翻译)](http://wiki.jikexueyuan.com/project/tensorflow-zh/)
36 |
37 | - [TensorFlow Js官网](https://js.tensorflow.org/)
38 |
39 | - [awesome-tensorflow](https://github.com/jtoy/awesome-tensorflow):TensorFlow - A curated list of dedicated resources
40 |
41 | - [TensorFlow-Examples](https://github.com/aymericdamien/TensorFlow-Examples):TensorFlow Tutorial and Examples for Beginners with Latest APIs
42 |
43 | - [TensorFlow-Tutorials](https://github.com/nlintz/TensorFlow-Tutorials):Simple tutorials using Google's TensorFlow Framework
44 |
45 | - [EffectiveTensorflow](https://github.com/vahidk/EffectiveTensorflow)
46 |
47 | - [DeepLearningZeroToAll](https://github.com/hunkim/DeepLearningZeroToAll)
48 |
49 | - [deep-learning-keras-tensorflow](https://github.com/leriomaggio/deep-learning-keras-tensorflow):Introduction to Deep Neural Networks with Keras and Tensorflow
50 |
51 |
52 |
53 | ### **视频教程**
54 |
55 | **国外**
56 |
57 | - **(推荐)(在学ing)**[TensorFlow-Tutorials](https://github.com/Hvass-Labs/TensorFlow-Tutorials):TensorFlow Tutorials with [YouTube Videos](https://www.youtube.com/playlist?list=PL9Hr9sNUjfsmEu1ZniY0XpHSzl5uihcXZ)
58 |
59 | - Stanford CS 20SI: Tensorflow for Deep Learning Research (斯坦福TensorFlow教程)
60 |
61 | - [网址](https://web.stanford.edu/class/cs20si/index.html)
62 |
63 |
64 | - [视频](https://www.bilibili.com/video/av9156347/?from=search&seid=6905181275544516403)
65 | - [GitHub](https://github.com/chiphuyen/stanford-tensorflow-tutorials)
66 |
67 | - **(推荐)(在学ing)**[Google 机器学习速成课程](https://developers.google.com/machine-learning/crash-course/):该课程虽然为Machine Learning,但代码全使用TensorFlow,有实战意义
68 |
69 | - [Udacity](https://cn.udacity.com/course/deep-learning--ud730) (优达学城TensorFlow教程)
70 |
71 | **国内**
72 |
73 | - [莫烦TensorFlow教程](https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/) [TensorFlow-Tutorial](http://Tensorflow-Tutorial)
74 |
75 | - [深度学习框架-Tensorflow基础入门](http://study.163.com/course/introduction/1004113066.htm?share=1&shareId=1020102948)
76 |
77 | - [TensorFlow实用课程](http://study.163.com/course/courseMain.htm?courseId=1005167033&share=1&shareId=1020102948) [ tensorflow-practice](https://github.com/yule-li/tensorflow-practice)
78 |
79 |
80 |
81 | ### **书籍资源**
82 |
83 | - **(推荐)**[Hands-On Machine Learning with Scikit-Learn and TensorFlow](https://github.com/ageron/handson-ml) [ pdf书籍](http://download.csdn.net/download/xinconan1992/9877225) [中文翻译](https://github.com/apachecn/hands_on_Ml_with_Sklearn_and_TF)
84 |
85 | - **(推荐)(在学ing)**[《TensorFlow 1.x Deep Learning Cookbook》](https://github.com/PacktPublishing/TensorFlow-1x-Deep-Learning-Cookbook)
86 |
87 | - [《TensorFlow Machine Learning Cookbook》](https://github.com/nfmcclure/tensorflow_cookbook)
88 |
89 | - **(推荐)**[TensorFlow-Book](https://github.com/BinRoot/TensorFlow-Book):Accompanying source code for 《Machine Learning with TensorFlow》
90 |
91 | - TensorFlow: 实战Google深度学习框架
92 |
93 | - [《21个项目玩转深度学习—基于TensorFlow的实践详解》](https://github.com/hzy46/Deep-Learning-21-Examples)
94 |
95 |
96 |
97 | ### 实战项目
98 |
99 | - **(推荐)**[TensorFlow/models](https://github.com/tensorflow/models):Models and examples built with TensorFlow
100 | - [TensorLayer](http://tensorlayer.readthedocs.io/en/latest/)
101 | - [DCGAN-tensorflow](https://github.com/carpedm20/DCGAN-tensorflow)
102 | - [FastMaskRCNN](https://github.com/CharlesShang/FastMaskRCNN)
103 | - [SSD-TensorFlow](https://github.com/balancap/SSD-Tensorflow)
--------------------------------------------------------------------------------
/TensorFlow从入门到精通/README.md:
--------------------------------------------------------------------------------
1 | # TensorFlow 从入门到精通
2 |
3 | 官网:https://youtu.be/er8RQZoX3yk
4 |
5 | - [视频课程](https://youtu.be/er8RQZoX3yk)
6 | - [github](https://github.com/Hvass-Labs/TensorFlow-Tutorials)
7 |
8 | **TensorFlow从入门到精通教程(TensorFlow-Tutorials)简介**
9 |
10 | - TensorFlow教程的介绍和概述
11 |
12 |
13 | # 目录
14 |
15 | ## 前言
16 |
17 | - TensorFlow教程的介绍
18 | - 在云端运行TensorFlow教程
19 | - TensorFlow GPU 和 CPU版本的速度测试
20 |
21 | ## TensorFlow教程内容
22 |
23 | - [01 简单线性模型(Simple Linear Model)](01_简单线性模型.ipynb)
24 |
25 | - 02 卷积神经网络(Convolutional Neural Network)
26 |
27 | - 03 张量(Pretty Tensor)
28 |
29 | - 03-B 层(Layers) API
30 |
31 | - 03-C Keras API
32 |
33 | - 04 保存和恢复(Save & Restore)
34 |
35 | - 05 ???学习(Ensemble Learning)
36 |
37 | - 06 CIFAR-10
38 |
39 | - 07 Inception Model
40 |
41 | - 07 Inception Model(Extra)
42 |
43 | - 08 迁移学习(Transfer Learning)
44 |
45 | - 09 视频数据(Video Data)
46 |
47 | - 10 Fine-Tuning
48 |
49 | - 11 对抗示例(Adversarial Examples)
50 |
51 | - 12 MNIST的对抗噪声(Adversarial Noise for MNIST)
52 |
53 | - 13 视觉分析(Visual Analysis)
54 |
55 | - 13-B MNIST的视觉分析(Visual Analysis for MNIST)
56 |
57 | - 14 DeepDream
58 |
59 | - 15 风格迁移(Style Transfer)
60 |
61 | - 16 增强学习(Reinforcement Learning)
62 |
63 | - 17 估计器API(Estimator API)
64 |
65 | - 18 TFRecords & Dataset API
66 |
67 | - 19 超参数优化(Hyper-Parameter Optimization)
68 |
69 | - 20 自然语言处理(Neural Language Processing)
70 |
71 | - 21 机器翻译(Machine Translation)
72 |
73 | - 22 图像字幕(Image Captioning)
74 |
75 | - 23 时间序列预测(Time-Series Prediction)
76 |
--------------------------------------------------------------------------------