├── .DS_Store ├── README.md ├── SUMMARY.md ├── assets ├── .DS_Store └── images │ ├── C03_001.png │ ├── C03_002.png │ ├── C03_003.png │ ├── C03_004.png │ ├── C03_005.png │ ├── C03_006.png │ ├── C03_007.png │ ├── C03_008.png │ ├── C03_009.png │ ├── C04_001.png │ ├── C04_002.png │ ├── C04_003.png │ ├── C05_001.png │ ├── C05_002.png │ ├── C05_003.png │ ├── C05_004.png │ ├── C05_005.png │ ├── C06_001.png │ ├── C06_002.png │ ├── C06_003.png │ ├── C06_004.png │ ├── C06_005.png │ ├── C06_006.png │ ├── C06_007.png │ ├── C06_008.png │ ├── C07_009.png │ ├── F1-1.png │ ├── F1-2.png │ ├── F1-3.png │ ├── book_cn.png │ ├── buy.png │ ├── tips.png │ ├── tips_001.png │ ├── tips_002.png │ ├── tips_003.png │ └── weixin.jpg └── chapters ├── .DS_Store ├── Chapter_01_Introduction.md ├── Chapter_02_A_Crash_Course_in_Python.md ├── Chapter_03_Visualizing_Data.md ├── Chapter_04_Linear_Algebra.md ├── Chapter_05_Statistics.md ├── Chapter_06_Probability.md ├── Chapter_07_Hypothesis_and_Inference.md ├── Chapter_08_Gradient_Descent.md ├── Chapter_09_Getting_Data.md ├── Chapter_10_Working_with_Data.md ├── Chapter_11_Machine_Learning.md ├── Chapter_12_k_Nearest_Neighbors.md ├── Chapter_13_Naive_Bayes.md ├── Chapter_14_Simple_Linear_Regression.md ├── Chapter_15_Multiple_Regression.md ├── Chapter_16_Logistic_Regression.md ├── Chapter_17_Decision_Trees.md ├── Chapter_18_Neural_Networks.md ├── Chapter_19_Clustering.md ├── Chapter_20_Natural_Language_Processing.md ├── Chapter_21_Network_Analysis.md ├── Chapter_22_Recommender_Systems.md ├── Chapter_23_Database_and_SQL.md ├── Chapter_24_MapReduce.md └── Chapter_25_Go_Forth_and_Do_Data_Science.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/.DS_Store -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Data Science from Scratch First Principles with Python 试译 2 | 3 | # 停止声明 4 | 5 | 大家可以去图灵社区购买电子版了。已经出了电子版,比较好的版本。[图灵社区:数据科学入门](http://www.ituring.com.cn/book/1687) 6 | 欢迎大家入手。 7 | 8 | ![book](assets/images/book_cn.png) 9 | 10 | 这个项目即刻停止更新。 11 | 12 | 祝好! 13 | 14 | ## 声明 15 | 该翻译仅供学习交流,不做任何商业用途,尊重作者的辛勤劳动,请到这里[购买]正版图书(http://www.amazon.com/Data-Science-Scratch-Principles-Python/dp/149190142X) 16 | 17 | ![buy](assets/images/buy.png) 18 | * 作者**Joel Grus**的[Github链接](https://github.com/joelgrus) 19 | * [书籍示例代码](https://github.com/joelgrus/data-science-from-scratch) 20 | 21 | ##前言 22 | 23 | ###2015.08.15 前言 24 | 25 | From : [hexcola](https://github.com/hexcola) 26 | 27 | 我在自学的过程中,遇到不少有趣的资源,比如在[Jason的网站](http://machinelearningmastery.com/)找到的对出于不同目标,不同背景学习机器学习的小伙伴划分派系(查看:[Find Your Machine Learning Tribe: Get Started And Avoid Getting The Wrong Advice](http://machinelearningmastery.com/machine-learning-tribe/))从而更具有针对性的开始和深入。 28 | 29 | 作为一个(数学底子不扎实的)程序猿 :(,深刻体会到先去学习线性代数、概率论、统计学然后再开始机器学习是有多么枯燥和沮丧,Jason的观点 [Machine Learning for Programmers: Leap from developer to machine learning practitioner](http://machinelearningmastery.com/machine-learning-for-programmers/) 绝对是让人眼前一亮,在挖他得博文时知道了**[Data Science From Scratch: First Principles with Python](http://joelgrus.com/2015/04/26/data-science-from-scratch-first-principles-with-python/)** 这本书,在今年近5月份出版,现在还没有中文版。 30 | 31 | 啃书做笔记的时候想,也许还有很多小伙伴和我面临一样的困境呢,要不试试把它翻译一下,权当巩固,说不定能帮到一些小伙伴,在我还在思考时,iphyer小朋友很久之前已经在GitBook上展开他的工作了,并已经翻译了两章了([点这里阅读](http://iphyer.gitbooks.io/data-science-from-scratch-with-python/content/index.html)),联系他后,很愉快的就决定一起实现这个计划! 32 | 33 | 我们尽力做到最好,但才疏学浅,如果有误,还望不吝赐教! 34 | 35 | 如果你想一起翻译,或者交流数据科学的学习心得,联系我们吧! 36 | 37 | ###2016.03.09 更新 38 | 39 | From : [iphyer](https://github.com/iphyer) 40 | 41 | 因为各种原因这个项目进度比较缓慢。[iphyer](https://github.com/iphyer) 和 [hexcola](https://github.com/hexcola) 都比较忙。[hexcola](https://github.com/hexcola) 也参加了创业公司,时间紧张。 42 | 43 | 正好 2016 年初 [iphyer](https://github.com/iphyer) 参加了 [bingjin](https://github.com/bingjin) 的 [ThinkPython](https://github.com/bingjin/ThinkPython2-CN) 翻译项目,所以我们重新整理了下整个项目, 也请 [bingjin](https://github.com/bingjin) 帮我们推广下,希望可以引入更多的人加入到这个项目中来。 44 | 45 | ##翻译指南 46 | ###任务分类 47 | 48 | 项目的任务主要是两类: 49 | 50 | 1. 翻译 51 | 2. 校对 52 | 53 | 欢迎大家加入进来,最近感觉到了data science的引爆点,希望这本书可以给大家更多的帮助。 54 | 55 | ###参与步骤 56 | 57 | 1. fork本项目 58 | 2. 修改进度表,添加自己的github账户。大家在commit里面标注下自己申请的章节号。 59 | 3. 提交到自己的github仓库 60 | 4. 向[iphyer](https://github.com/iphyer)pull requests 61 | 5. 翻译或者校对 62 | 6. 提交到自己的github仓库 63 | 7. 向我pull requests 64 | 65 | 主要就是这个流程。希望大家可以借助 Github 协同工作起来。 66 | 67 | 也可以微信联系我,微信号: iphyer。 68 | 69 | ![weixin](assets/images/weixin.jpg) 70 | 71 | ###注意事项 72 | 73 | 1. 统一使用 Markdown 语言编辑。 74 | 2. 图片存放在assets/images/文件夹下,截图存为png格式即可,命名格式为:C章节号-序号.png,如第3章第2张图片的序号为:C03_002.png。注意章节号为两位数字,序号为三位数字。如果出现公式也请按照图片处理,这样方便在网页上显示出来,而不需要费力协调格式问题。 75 | 3. 翻译不需要包括英文,包括也可以,这样可以帮助校对的同学校对。但是为了减轻翻译负担不强制包括。 76 | 4. 这本书的翻译并不轻松,如果感觉翻译不下去,也可以提交已经翻译好的部分结果,在commit中说明即可。但是在最后完成名单中只保留首页的完成者。 77 | 5. 图书中有不少独立出来的提示部分,类似这张图 78 | 79 | ![caution](assets/images/tips.png) 80 | 81 | 请统一使用如下方式插入: 82 | 83 | ``` 84 | ![](../assets/images/tips_002.png) 85 | 写下注意中的内容 86 | ``` 87 | tips_001.png : 88 | 89 | ![caution1](assets/images/tips_001.png) 90 | 91 | tips_002.png : 92 | 93 | ![caution2](assets/images/tips_002.png) 94 | 95 | tips_003.png : 96 | 97 | ![caution3](assets/images/tips_003.png) 98 | 99 | 100 | 101 | ##翻译进度表 102 | 103 | |章节 |译者 |翻译进度 |校对者 |校对进度 | 104 | |------|:-------:|:-------------:|:-----:|:-----:| 105 | | [第1章:简介](chapters/Chapter_01_Introduction.md) | [iphyer](https://github.com/iphyer) | 完成 | | 待校对 | 106 | | [第2章:Python快速入门教程](chapters/Chapter_02_A_Crash_Course_in_Python.md) | [iphyer](https://github.com/iphyer) | 完成 | | 待校对 | 107 | | [第3章:数据可视化](chapters/Chapter_03_Visualizing_Data.md) | [hexcola](https://github.com/hexcola) | 完成 | | 待校对 | 108 | | [第4章:线性代数](chapters/Chapter_04_Linear_Algebra.md) | [hexcola](https://github.com/hexcola) | 完成 | | 待校对 | 109 | | [第5章:统计](chapters/Chapter_05_Statistics.md) | [hexcola](https://github.com/hexcola) | 完成 | | 待校对 | 110 | | [第6章:概率](chapters/Chapter_06_Probability.md) | [iphyer](https://github.com/iphyer) | 正在进行 | | 待校对 | 111 | | [第7章:假设和推理](chapters/Chapter_07_Hypothesis_and_Inference.md) | | 待认领 | | 待校对 | 112 | | [第8章:梯度下降](chapters/Chapter_08_Gradient_Descent.md) | | 待认领 | | 待校对 | 113 | | [第9章:获取数据](chapters/Chapter_09_Getting_Data.md) | | 待认领 | | 待校对 | 114 | | [第10章:处理数据](chapters/Chapter_10_Working_with_Data.md) | | 待认领 | | 待校对 | 115 | | [第11章:机器学习](chapters/Chapter_11_Machine_Learning.md) | | 待认领 | | 待校对 | 116 | | [第12章:k近邻算法](chapters/Chapter_12_k_Nearest_Neighbors.md) | | 待认领 | | 待校对 | 117 | | [第13章:朴素贝叶斯](chapters/Chapter_13_Naive_Bayes.md) | | 待认领 | | 待校对 | 118 | | [第14章:简单线性回归](chapters/Chapter_14_Simple_Linear_Regression.md) | | 待认领 | | 待校对 | 119 | | [第15章:多元回归](chapters/Chapter_15_Multiple_Regression.md) | | 待认领 | | 待校对 | 120 | | [第16章:逻辑回归](chapters/Chapter_16_Logistic_Regression.md) | | 待认领 |   | 待校对 | 121 | | [第17章:决策树](chapters/Chapter_17_Decision_Trees.md) | | 待认领 | | 待校对 | 122 | | [第18章:神经网络](chapters/Chapter_18_Neural_Networks.md) | | 待认领 | | 待校对 | 123 | | [第19章:集群](chapters/Chapter_19_Clustering.md) | | 待认领 | | 待校对 | 124 | | [第20章:自然语言处理](chapters/Chapter_20_Natural_Language_Processing.md)| | 待认领 | | 待校对 | 125 | | [第21章:网络分析](chapters/Chapter_21_Network_Analysis.md) | | 待认领 | | 待校对 | 126 | | [第22章:推荐系统](chapters/Chapter_22_Recommender_Systems) | | 待认领 | | 待校对 | 127 | | [第23章:数据库与SQL](chapters/Chapter_23_Database_and_SQL.md) | | 待认领 | | 待校对 | 128 | | [第24章:MapReduce](chapters/Chapter_24_MapReduce.md) | | 待认领 | | 待校对 | 129 | | [第25章:前进吧!继续你的数据科学之路](chapters/Chapter_25_Go_Forth_and_Do_Data_Science.md) | | 待认领 | | 待校对 | 130 | 全书目前的翻译进度: 131 | 132 | 133 | |状态 |已完成 |正在进行 |等待认领 |校对完成 |校对完成 | 134 | | ------|------- |:-------------:| -----:|-----:|-----:| 135 | |数量 |5 章 |1 章 |19 章 |0 章 |25 章 | 136 | 137 | 欢迎大家参与进来! 138 | 139 | 140 | -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | # Summary 2 | 3 | * [README](README.md) 4 | * [第1章:简介](chapters/Chapter_01_Introduction.md) 5 | * [第2章:Python快速入门教程](chapters/Chapter_02_A_Crash_Course_in_Python.md) 6 | * [第3章:数据可视化](chapters/Chapter_03_Visualizing_Data.md) 7 | * [第4章:线性代数](chapters/Chapter_04_Linear_Algebra.md) 8 | * [第5章:统计](chapters/Chapter_05_Statistics.md) 9 | * [第6章:概率](chapters/Chapter_06_Probability.md) 10 | * [第7章:假设和推理](chapters/Chapter_07_Hypothesis_and_Inference.md) 11 | * [第8章:梯度下降](chapters/Chapter_08_Gradient_Descent.md) 12 | * [第9章:获取数据](chapters/Chapter_09_Getting_Data.md) 13 | * [第10章:处理数据](chapters/Chapter_10_Working_with_Data.md) 14 | * [第11章:机器学习](chapters/Chapter_11_Machine_Learning.md) 15 | * [第12章:k近邻算法](chapters/Chapter_12_k_Nearest_Neighbors.md) 16 | * [第13章:朴素贝叶斯](chapters/Chapter_13_Naive_Bayes.md) 17 | * [第14章:简单线性回归](chapters/Chapter_14_Simple_Linear_Regression.md) 18 | * [第15章:多元回归](chapters/Chapter_15_Multiple_Regression.md) 19 | * [第16章:逻辑回归](chapters/Chapter_16_Logistic_Regression.md) 20 | * [第17章:决策树](chapters/Chapter_17_Decision_Trees.md) 21 | * [第18章:神经网络](chapters/Chapter_18_Neural_Networks.md) 22 | * [第19章:集群](chapters/Chapter_19_Clustering.md) 23 | * [第20章:自然语言处理](chapters/Chapter_20_Natural_Language_Processing.md) 24 | * [第21章:网络分析](chapters/Chapter_21_Network_Analysis.md) 25 | * [第22章:推荐系统](chapters/Chapter_22_Recommender_Systems) 26 | * [第23章:数据库与SQL](chapters/Chapter_23_Database_and_SQL.md) 27 | * [第24章:MapReduce](chapters/Chapter_24_MapReduce.md) 28 | * [第25章:前进吧!继续你的数据科学之路](chapters/Chapter_25_Go_Forth_and_Do_Data_Science.md) 29 | 30 | -------------------------------------------------------------------------------- /assets/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/.DS_Store -------------------------------------------------------------------------------- /assets/images/C03_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_001.png -------------------------------------------------------------------------------- /assets/images/C03_002.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_002.png -------------------------------------------------------------------------------- /assets/images/C03_003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_003.png -------------------------------------------------------------------------------- /assets/images/C03_004.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_004.png -------------------------------------------------------------------------------- /assets/images/C03_005.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_005.png -------------------------------------------------------------------------------- /assets/images/C03_006.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_006.png -------------------------------------------------------------------------------- /assets/images/C03_007.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_007.png -------------------------------------------------------------------------------- /assets/images/C03_008.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_008.png -------------------------------------------------------------------------------- /assets/images/C03_009.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C03_009.png -------------------------------------------------------------------------------- /assets/images/C04_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C04_001.png -------------------------------------------------------------------------------- /assets/images/C04_002.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C04_002.png -------------------------------------------------------------------------------- /assets/images/C04_003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C04_003.png -------------------------------------------------------------------------------- /assets/images/C05_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C05_001.png -------------------------------------------------------------------------------- /assets/images/C05_002.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C05_002.png -------------------------------------------------------------------------------- /assets/images/C05_003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C05_003.png -------------------------------------------------------------------------------- /assets/images/C05_004.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C05_004.png -------------------------------------------------------------------------------- /assets/images/C05_005.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C05_005.png -------------------------------------------------------------------------------- /assets/images/C06_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_001.png -------------------------------------------------------------------------------- /assets/images/C06_002.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_002.png -------------------------------------------------------------------------------- /assets/images/C06_003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_003.png -------------------------------------------------------------------------------- /assets/images/C06_004.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_004.png -------------------------------------------------------------------------------- /assets/images/C06_005.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_005.png -------------------------------------------------------------------------------- /assets/images/C06_006.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_006.png -------------------------------------------------------------------------------- /assets/images/C06_007.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_007.png -------------------------------------------------------------------------------- /assets/images/C06_008.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C06_008.png -------------------------------------------------------------------------------- /assets/images/C07_009.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/C07_009.png -------------------------------------------------------------------------------- /assets/images/F1-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/F1-1.png -------------------------------------------------------------------------------- /assets/images/F1-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/F1-2.png -------------------------------------------------------------------------------- /assets/images/F1-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/F1-3.png -------------------------------------------------------------------------------- /assets/images/book_cn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/book_cn.png -------------------------------------------------------------------------------- /assets/images/buy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/buy.png -------------------------------------------------------------------------------- /assets/images/tips.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/tips.png -------------------------------------------------------------------------------- /assets/images/tips_001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/tips_001.png -------------------------------------------------------------------------------- /assets/images/tips_002.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/tips_002.png -------------------------------------------------------------------------------- /assets/images/tips_003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/tips_003.png -------------------------------------------------------------------------------- /assets/images/weixin.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/assets/images/weixin.jpg -------------------------------------------------------------------------------- /chapters/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/iphyer/data-science-from-scratch_zh/037eacc001715863c89d9ba2df0fa31b81cb63a3/chapters/.DS_Store -------------------------------------------------------------------------------- /chapters/Chapter_01_Introduction.md: -------------------------------------------------------------------------------- 1 | # Chapter 1 引子 2 | > “数据!数据!数据!”他焦急地高叫着,“(如果没有数据),巧妇难为无米之炊啊!” 3 | 4 | > --Arthur Conan Doyle 5 | 6 | ## 数据力量 7 | 我们正生活在一个被数据淹没的世界。各大网站都会记录每一个访问者的每一次点击数据,智能手机会记录使用者每一天每一秒的位置和速度数据。量化生活者会使用计步器记录自己的心跳,运动习惯,饮食和睡眠数据。智能汽车会搜集行车习惯数据,智能住宅会搜集生活习惯数据,智能店铺会搜集购物习惯数据。就连互联网本身所代表的无所不知的巨大知识库也是由无数相互链接的数据组成的一本百科全书;专业领域知识库可以提供关于电影、音乐、体育赛事结果、弹珠台机器、文化基因和鸡尾酒的各种数据;以及无数由政府提供的统计数据(有一部分还是接近事实的),这些数据都充斥在你的生活中。 8 | 9 | 埋藏在这些数据下面的是对于无数问题的惊奇的答案。在这本书中,我们将会教会你如何寻找这些答案。 10 | 11 | ## 数据科学是什么? 12 | 有一个关于数据科学家的笑话说,数据科学家就是那些比计算科学家知道更多统计知识而且比统计学家知道更多计算机知识的人(我个人并不认为这是一个好笑话)。事实上,从实用角度说,一些数据科学家确实是统计学家,但是另外的数据科学家则更像计算机工程师。一些数据科学家是机器学习专家,但是也存在不少数据科学家从幼儿园开始就没有接触过机器学习。一些数据科学家是发表过出色论文的博士,但是也有不少数据科学家是从来没有读过学术论文的人(对此我深表遗憾,他们应当感到羞愧)。一言以蔽之,无论人们如何定义数据科学,你一定可以找到让这个定义完全错误的数据科学家。 13 | 14 | 当然这并意味着我们不能尝试对于数据科学家下一个定义。我们认为数据科学家就是那些从混乱的原始数据中提取有用经验和启发性观点的人。事实上,现实生活中也的确有不少人正在努力从数据中得到有用经验和启发性观点。 15 | 16 | 比如,相亲网站 OkCupid 要求用户回答很多问题,这些问题可以帮助 OkCupid 寻找该用户最匹配的约会对象。但是有时这些问题也可以帮助人们预测一些无伤大雅的精彩问题,比如你可以要求 OkCupid 预测你是不是能够和你的约会对象在第一次约会的时候发生一次超友谊的亲密接触。 17 | 18 | Facebook 也同样要求用户列出他们的家乡和现居地,虽然表面上看这确实可以帮助真实生活中的朋友更加容易地联系到你,但是 Facebook 也利用这些数据来分析很多其他问题,比如:全球人口的移民模式以及某支球队的球迷大本营。 19 | 20 | 作为一家大型零售店,Target 会记录你在线上和线下店铺的购物交互数据。Target 会使用这些数据来预测客户是不是怀孕了来向怀孕的用户推销他们的婴幼儿用品。 21 | 22 | 2012年,奥巴马的竞选团队雇佣了很多数据科学家。这些数据科科学家使用数据挖掘技术识别出必须格外重视的选民,帮助竞选团队选择出最佳的竞选募捐策划,同时展开更有针对性的竞选募捐活动,并且用最有效的方式吸引选民参与投票。这些数据科学家们被普遍认为在奥巴马的第二次胜选中起到了重要作用。奥巴马的这次选举标志着未来的选举一定会变得越来越数据驱动,他们将使得竞选活动更加依赖数据进行决策,这也将会把竞选活动变成一场没有尽头的数据科学和数据搜集能力的军备竞赛。 23 | 24 | 在你开始感到厌烦之前,我不得不说:有些数据科学家有时也会运用他们的才智和掌握的数据来提高政府效率,比如救助无家可归者或者提高公众健康水平。当然如果你志不在此,而是希望研究出最好的吸引用户点击广告的方法等,这些致力于提高公众利益的数据科学家对于你的事业也完全没有任何伤害。 25 | 26 | ## 假设场景 : DataSciencester 27 | 恭喜你!你已经被任命为一家专为数据科学家服务的社交网站 DataSciencester 的首席数据科学家,你将全面负责 DataSciencester 网站的数据业务。 28 | 29 | 虽然这是专为数据科学家服务的网站,但是 DataSciencester 从来没有建立过自己的数据科学实践(更准确地说,DataSciencester 也从来没有建立过自己的产品)。现在该你出场了!在这本书中,你将通过解决实际工作中遇到的问题来学习数据科学的基本概念。有时我们需要研究用户直接提供的数据,有时我们需要研究用户和网站间接交互的数据,有时我们甚至需要研究通过我们自己设计的实验获得的数据。 30 | 31 | 同时因为 DataSciencester 有强烈的极客原创精神,我们需要从头构建自己的工具。通过这本书,你会对于数据科学有一个完整的详实理解。学完本书,在将来的工作中,我们希望你能够运用学到的知识和技能为陷入困境的公司提供帮助或者去研究任何吸引你的问题。 32 | 33 | 欢迎加入,祝你一帆风顺!(闲谈一句,通常,作为数据科学家你可以在周五穿着牛仔裤上班而且楼下的浴室随时待命。) 34 | 35 | ###寻找关键联系人 36 | 好了,现在你开始了在 DataSciencester 工作的第一天,公司负责客户网络的高管对于 DataSciencester 网站的用户有不少疑问,以前他找不到人帮忙,现在他非常高兴你加入了公司并且希望你可以帮助到他。 37 | 38 | 首先,他特别希望你帮助他在数据科学家用户中找出那些“关键联系人”。所以他给了你他已经得到的关于整个DataSciencester 用户关系的数据(但是,你需要注意的是,在真实的工作中,你常常得不到到你希望的数据而不得不自己动手获取数据,我们将在第 9 章详细讨论如何获取需要的数据)。 39 | 40 | 这些数据到底是什么呢?它是一个包括所有用户的 Python 列表,列表的每一个元素都是一个字典,字典包含了用户的 id 数字和用户的姓名(巧合的是,这些 id 的用户名都有和 id 数字谐音的部分) 41 | 42 | ```python 43 | users = [ 44 | { "id": 0, "name": "Hero" }, 45 | { "id": 1, "name": "Dunn" }, 46 | { "id": 2, "name": "Sue" }, 47 | { "id": 3, "name": "Chi" }, 48 | { "id": 4, "name": "Thor" }, 49 | { "id": 5, "name": "Clive" }, 50 | { "id": 6, "name": "Hicks" }, 51 | { "id": 7, "name": "Devin" }, 52 | { "id": 8, "name": "Kate" }, 53 | { "id": 9, "name": "Klein" } 54 | ] 55 | ``` 56 | 57 | 同时你得到了一组表示“友谊关系”的数据,就是如下的这个包含 id 号码的`friendships`列表: 58 | 59 | 比如,元组 (0, 1) 代表 id 为 0 的数据科学家( Hero ) 和 id 为 1 的数据科学家 ( Dunn ) 是朋友。这个关系也可以用图 1-1 来表示: 60 | ![](../assets/images/F1-1.png) 61 | >![](../assets/images/tips_002.png) 62 | > 暂时不要在代码的细节上纠缠太久。在第 2 章中,我们会带着你快速的学习 Python 。现在你只需要大致理解这些代码是为了实现哪些目标即可。 63 | 64 | 因为我们使用 Python 的字典结构来表示用户,所以我们可以非常方便地添加更多的数据。 65 | 66 | 比如,我们可以尝试给每一个用户添加一个朋友列表。首先我们对每一个用户创建一个代表朋友属性的空列表。 67 | 68 | ```python 69 | for user in users: 70 | user["friends"] = [] 71 | ``` 72 | 然后我们可以通过`friendships`数据来填充`friends`属性列表。 73 | ```python 74 | for i, j in friendships: 75 | #这段代码可以工作是因为 users[i] 就是 id 为 i 的用户 76 | users[i]["friends"].append(users[j]) # 给用户 i 添加朋友 j 77 | users[j]["friends"].append(users[i]) # 给用户 j 添加朋友 i 78 | ``` 79 | 80 | 一旦每一个用户字典都包括了一个朋友列表,我们就可以很容易地进一步探索朋友关系图的性质,比如 “平均一个用户有多少个朋友?”。 81 | 82 | 要回答这个问题,首先我们必须找出所有的朋友关系数,这只需要统计朋友列表的长度就可以了。 83 | 84 | ```python 85 | def number_of_friends(user): 86 | """每一个用户有多少朋友""" 87 | return len(user["friends"]) # 朋友列表长度 88 | 89 | total_connections = sum(number_of_friends(user) 90 | for user in users) # 24 91 | ``` 92 | 这样我们只需要简单地除以用户数即可得到平均一个用户有多少朋友了: 93 | ```python 94 | from __future__ import division #引入整数除法特性 95 | #注意该语句必须是模块或程序的第一个语句。 96 | num_users = len(users) #列表长度为10 97 | avg_connections = total_connections / num_users #每一个用户平均拥有的朋友数 2.4 98 | ``` 99 | 同样的思路我们也可以很容易地找出朋友关系最多的人——他们就是有最多朋友数目的人。 100 | 101 | 因为数据量不是特别大,所以我们可以很容易地对所有的用户按照从“朋友最多的人”到“朋友最少的人”的顺序进行排序: 102 | ```python 103 | #创建一个朋友数目列表num_friends_by_id 104 | num_friends_by_id = [(user["id"],number_of_friends(user)) 105 | for user in users] 106 | 107 | sorted(num_friends_by_id, #排序列表 108 | key=lambda (users_id , num_friends): num_friends, #依照num_friends排序 109 | reverse=True) #倒序输出,从最大到最小 110 | 111 | #num_friends_by_id输出结果如下,每一对都是(user_id,num_friends)组合: 112 | #[(0, 2), (1, 3), (2, 3), (3, 3), (4, 2), 113 | # (5, 3),(6, 2), (7, 2), (8, 3), (9, 1)] 114 | ``` 115 | 如果换一个思路,从社交网络的角度来理解我们刚刚完成的工作,就是找出这个用户关系网络中占据最中心位置的人。事实上,我们刚刚计算了这个网络的重要属性之一:程度中心性(见图 1-2). 116 | ![](F1-2.png) 117 | 118 | 程度中心性非常方便计算, 但是不能给出更多准确的细节信息。比如,在 DateSciencester 的用户朋友网络中我们知道,用户 Thor (id 为 4 )只有 2 个朋友关系,而用户 Dunn ( id 为 1 )有 3 个朋友关系。但是回头看一看上面展示的网络图,似乎 Thor 更加具有程度中心性。 在第 21 章,我们将会更加仔细地讨论网络的性质和研究更加复杂的中心性定义,这些更加复杂的中心性可能会更加合适。 119 | 120 | ### 你可能认识的数据科学家 121 | 122 | 当你正在努力填写新员工登记表的时候,负责人事的高管来到你的办公桌前。她希望能够激发数据科学家之间更多的交流和联系,因此她希望你能够策划一个“你可能认识的数据科学家”的提示功能。 123 | 124 | 你的直觉告诉你一个用户很有可能认识自己朋友的朋友。这个想法非常容易验证:对于每一个用户的朋友们,验证这个朋友的朋友是不是被这个用户认识,最后合并结果即可检测这个想法是不是可靠: 125 | 126 | ```python 127 | def friends_of_friend_ids_bad(user): 128 | # "foaf" 是 "friend of a friend"的简称 129 | return [foaf["id"] 130 | for friend in user["friends"] # 对于每一个用户的朋友们 131 | for foaf in friend["friends"]] # 检验这个朋友的朋友是不是这个用户的朋友 132 | ``` 133 | 134 | 当我们把上面的函数作用在第一个用户`users[0]`上的时候,`friends_of_friend_ids_bad(users[0])`给出如下的结果 135 | ```python 136 | [0, 2, 3, 0, 1, 3] 137 | ``` 138 | 结果中包括了用户 0 两次,因为用户 0 ( Hero )确实同时是他的两个朋友的朋友。结果中也包括用户 1 和用户 2 ,虽然他们已经是用户 1 的朋友。同时他也包括了用户 3 两次,因为用户 3 ( Chi ) 可以通过用户 0 的两个朋友和用户 0 联系起来,具体的验证代码如下: 139 | 140 | ```python 141 | print [friend["id"] for friend in users[0]["friends"]] #[1, 2] 142 | print [friend["id"] for friend in users[1]["friends"]] #[0, 2, 3] 143 | print [friend["id"] for friend in users[2]["friends"]] #[0, 1, 3] 144 | ``` 145 | 知道人们可以借助自己朋友的朋友互相认识彼此是非常有趣的信息,所以或许我们应该统计下通过共同朋友可能成为朋友的数目。为了实现这个目的,我们需要借助辅助函数来排除已经彼此认识成为朋友的那批用户: 146 | ```python 147 | from collections import Counter # 并不默认加载collection函数 148 | 149 | def not_the_same(user, other_user): 150 | """排除相同用户""" 151 | return user["id"] != other_user["id"] 152 | 153 | def not_friends(user, other_user): 154 | """other_user用户并不是user用户的朋友;也就是 other_user并不和user用户的friends列表中个的用户相同""" 155 | return all(not_the_same(friend, other_user) 156 | for friend in user["friends"]) 157 | 158 | def friends_of_friend_ids(user): 159 | return Counter(foaf["id"] 160 | for friend in user["friends"] # 对于每一个user用户的朋友 161 | for foaf in friend["friends"] # 对于每一个user用户朋友的朋友 162 | if not_the_same(user, foaf) # 排除相同用户 163 | and not_friends(user, foaf)) # 排除已经是朋友的用户 164 | 165 | print friends_of_friend_ids(users[3]) # Counter({0: 2, 5: 1}) 166 | ``` 167 | 这个输出结果正确地说明用户 Chi (id 为 3 ) 和用户 Hero ( id 为 0 ) 之间有 2 个共同朋友,而和用户 Clive ( id 为 5) 只有 1 个共同用户。 168 | 169 | 作为一个数据科学家,你知道大家都喜欢遇到和自己有共同兴趣的人。(事实上,下面要做的这个小探索是对数据科学家需要掌握的专业技能的精彩展示。) 通过咨询朋友,你得到了如下的数据,这个列表的每一个元素都包括一个由用户 id 和兴趣 interest 组成的元组 。 170 | ```python 171 | interests = [ 172 | (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"), 173 | (0, "Spark"), (0, "Storm"), (0, "Cassandra"), 174 | (1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"), 175 | (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"), 176 | (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"), 177 | (3, "statistics"), (3, "regression"), (3, "probability"), 178 | (4, "machine learning"), (4, "regression"), (4, "decision trees"), 179 | (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"), 180 | (5, "Haskell"), (5, "programming languages"), (6, "statistics"), 181 | (6, "probability"), (6, "mathematics"), (6, "theory"), 182 | (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"), 183 | (7, "neural networks"), (8, "neural networks"), (8, "deep learning"), 184 | (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"), 185 | (9, "Java"), (9, "MapReduce"), (9, "Big Data") 186 | ] 187 | ``` 188 | 比如,用户 Thor ( id 为 4 ) 和用户 Devin ( id 为 7 ) 没有任何相同的朋友,但是他们都对于机器学习有兴趣。 189 | 190 | 非常容易地我们就可以构建一个函数寻找有相同兴趣的用户: 191 | ```python 192 | def data_scientists_who_like(target_interest): 193 | return [user_id 194 | for user_id, user_interest in interests 195 | if user_interest == target_interest] 196 | ``` 197 | 虽然上面的方法可以正确得出我们期望的结果,但是每一次都必须遍历整个兴趣列表。如果我们有很多的用户和兴趣对或者我们希望做大量的查找,这样的程序效率就比较低来。因此,我们应该专门建立一个从兴趣到用户的检索: 198 | ```python 199 | from collections import defaultdict 200 | 201 | # 字典的键是兴趣,值是对该兴趣感兴趣用户名列表 202 | user_ids_by_interest = defaultdict(list) 203 | 204 | for user_id, interest in interests: 205 | user_ids_by_interest[interest].append(user_id) 206 | ``` 207 | 另一种形式是从用户到兴趣的检索: 208 | ```python 209 | # 键是用户名,值是该用户的兴趣列表 210 | interests_by_user_id = defaultdict(list) 211 | 212 | for user_id, interest in interests: 213 | interests_by_user_id[user_id].append(interest) 214 | ``` 215 | 现在我们可以很容易的找到对于一个特定的用户和他有最多相同兴趣的用户了,具体思路如下: 216 | 217 | * 对该用户的兴趣列表做一次循环 218 | * 对于该用户的每一个兴趣,在相应的兴趣对应的用户列表中再次循环 219 | * 记录下每一个用户在这样的循环中出现的次数 220 | 221 | 具体实现的代码为: 222 | ```python 223 | def most_common_interests_with(user_id): 224 | return Counter(interested_user_id 225 | for interest in interests_by_user_id[user_id] 226 | for interested_user_id in user_ids_by_interest[interest] 227 | if interested_user_id != user_id) 228 | ``` 229 | 230 | 或许将来,我们可以通过这个方法整合共同朋友和共同兴趣数据来构建一个更加丰富的“你应该知道的数据科学家”的功能。在第 22 章,我们将深入讨论这一点。 231 | 232 | ### 工资和经验的关系 233 | 234 | 现在你打算去吃午饭,但是负责公共关系的高管询问你是不是能够提供一些关于数据科学家收入的有趣事实。收入数据当然是非常敏感的数据,所以负责公共关系的高管给你提供的是匿名后的收入数据(单位:美元)和工作年限数据(单位:年)。 235 | ```python 236 | salaries_and_tenures = [(83000, 8.7), (88000, 8.1), 237 | (48000, 0.7), (76000, 6), 238 | (69000, 6.5), (76000, 7.5), 239 | (60000, 2.5), (83000, 10), 240 | (48000, 1.9), (63000, 4.2)] 241 | ``` 242 | 非常自然地,第一步就是先对这些数据先做一副关系图来探索可能存在的关系(我们将在第 3 章研究如何画图)。现在你可以在图 1-3 看到画图的结果: 243 | ![](../assets/images/F1-3.png) 244 | 245 | 趋势看起来非常明显,工作时间越长挣得越多。但是你怎么把这幅图转化成一个有趣的故事?你的第一个想法就是查看不同工作年限的平均工资: 246 | ```python 247 | # 键是工作年限,值是每一个工作年限的工资 248 | salary_by_tenure = defaultdict(list) 249 | 250 | for salary, tenure in salaries_and_tenures: 251 | salary_by_tenure[tenure].append(salary) 252 | 253 | # 键是工作年限,值是每一个工作年限的平均工资 254 | average_salary_by_tenure = { 255 | tenure : sum(salaries) / len(salaries) 256 | for tenure, salaries in salary_by_tenure.items() 257 | } 258 | ``` 259 | 当然,这看起来不是特别有用,因为没有一个用户有相同的工作年限,这也就意味着我们其实只是在报告每一个单独用户的工资而不是多个用户的平均工资,具体结果如下: 260 | ```python 261 | {0.7: 48000, 262 | 1.9: 48000, 263 | 2.5: 60000, 264 | 4.2: 63000, 265 | 6: 76000, 266 | 6.5: 69000, 267 | 7.5: 76000, 268 | 8.1: 88000, 269 | 8.7: 83000, 270 | 10: 83000} 271 | ``` 272 | 更有用的可能是将工作年限做一个粗略地分组再进行统计,代码如下: 273 | ```python 274 | def tenure_bucket(tenure): 275 | if tenure < 2: 276 | return "less than two" 277 | elif tenure < 5: 278 | return "between two and five" 279 | else: 280 | return "more than five" 281 | ``` 282 | 然后把属于同一个工作年限分组的工资数据合并到一个列表中,具体代码如下: 283 | ```python 284 | #键是工作年限分组数据,值是该工作年限分组对应的工资列表 285 | salary_by_tenure_bucket = defaultdict(list) 286 | 287 | for salary, tenure in salaries_and_tenures: 288 | bucket = tenure_bucket(tenure) 289 | salary_by_tenure_bucket[bucket].append(salary) 290 | ``` 291 | 最后对每一个工作年限分组计算平均值,具体代码如下: 292 | ```python 293 | average_salary_by_bucket = { 294 | tenure_bucket : sum(salaries) / len(salaries) 295 | for tenure_bucket, salaries in salary_by_tenure_bucket.iteritems() 296 | } 297 | ``` 298 | 这样我们可以得到一个更加有意思的结果: 299 | ```python 300 | {'between two and five': 61500, 301 | 'less than two': 48000, 302 | 'more than five': 79166} 303 | ``` 304 | 305 | 现在终于你有了一个可以大声宣传的有趣的事实:“有5年以上工作经验的数据科学家比菜鸟数据科学家可以多挣65%”。 306 | 307 | 当然,我们必须承认我们的分组标准是粗略选择的。事实上我们真正想说明的是,平均而言,更多的工作经验对于工资的会有积极的影响。更进一步,为了做出一些更加吸引人的有趣事实,或许我们应该对于未来做一些大胆的预测。虽然我们可能并不知道的这些工作年限对应的工资数据。我们将在第 14 章仔细研究这个思路。 308 | 309 | ### 付费用户 310 | 311 | 当你吃完午饭回到办公室的时候,负责财务的高管正在等你。她希望能够更好地区分出付费用户和未付费用户。(她已经知道付费用户和未付费用户的用户名,但是没有更进一步的信息。) 312 | 313 | 你注意到工作年限和是否付费之间似乎存在某种联系。 314 | ```python 315 | 0.7 paid 316 | 1.9 unpaid 317 | 2.5 paid 318 | 4.2 unpaid 319 | 6 unpaid 320 | 6.5 unpaid 321 | 7.5 unpaid 322 | 8.1 unpaid 323 | 8.7 paid 324 | 10 paid 325 | ``` 326 | 工作年限较长和较短的用户倾向于付费,但是接近平均工作年限的用户常常不倾向于付费。 327 | 328 | 因此,你打算建立一个分类模型来区分付费用户和未付费用户。当然需要说明的是,虽然目前的数据量确实不足以创建一个可靠的模型,但是我们可以只是进行初步的尝试。所以你尝试认为工作年限较长和较短的用户是付费用户,但是接近平均工作年限的用户是未付费用户。具体的分类模型代码如下: 329 | ```python 330 | def predict_paid_or_unpaid(years_experience): 331 | if years_experience < 3.0: 332 | return "paid" 333 | elif years_experience < 8.5: 334 | return "unpaid" 335 | else: 336 | return "paid" 337 | ``` 338 | 当然,我们只是粗略地估测了分段点的位置。 339 | 340 | 随着数据量和数学知识的增加,我们可以建立一个更加可靠的基于用户工作年限来预测用户付费可能性的模型。我们会在第 16 章仔细研究这个问题。 341 | 342 | 热门话题 343 | 344 | 当你在 DataSciencester 第一天的工作接近尾声准备下班的时候,负责产品内容管理的高管向你咨询用户们最感兴趣的热门话题是哪些,因为她需要安排公司博客发布内容的日程,通常这些博客的内容就是用户门最感兴趣的话题。你已经从前面“你可能认识的数据科学家”项目中“推荐拥有共同兴趣的陌生用户”部分得到了如下的用户兴趣数据: 345 | 346 | ```python 347 | interests = [ 348 | (0, "Hadoop"), (0, "Big Data"), (0, "HBase"), (0, "Java"), 349 | (0, "Spark"), (0, "Storm"), (0, "Cassandra"), 350 | (1, "NoSQL"), (1, "MongoDB"), (1, "Cassandra"), (1, "HBase"), 351 | (1, "Postgres"), (2, "Python"), (2, "scikit-learn"), (2, "scipy"), 352 | (2, "numpy"), (2, "statsmodels"), (2, "pandas"), (3, "R"), (3, "Python"), 353 | (3, "statistics"), (3, "regression"), (3, "probability"), 354 | (4, "machine learning"), (4, "regression"), (4, "decision trees"), 355 | (4, "libsvm"), (5, "Python"), (5, "R"), (5, "Java"), (5, "C++"), 356 | (5, "Haskell"), (5, "programming languages"), (6, "statistics"), 357 | (6, "probability"), (6, "mathematics"), (6, "theory"), 358 | (7, "machine learning"), (7, "scikit-learn"), (7, "Mahout"), 359 | (7, "neural networks"), (8, "neural networks"), (8, "deep learning"), 360 | (8, "Big Data"), (8, "artificial intelligence"), (9, "Hadoop"), 361 | (9, "Java"), (9, "MapReduce"), (9, "Big Data") 362 | ] 363 | ``` 364 | 一个简单但是可能不是那么激动人心的方法就是仅仅从关键词被提及的频数角度来找出最受欢迎的兴趣,具体步骤如下: 365 | 366 | 把兴趣列表中的单词转化为小写(因为不同的用户可能大写也可能不大写同一个关键词的开头字母) 367 | 对兴趣词组进行分词 368 | 统计分词结果中,每一个不同单词的出现次数 369 | 具体代码如下: 370 | ```python 371 | words_and_counts = Counter(word 372 | for user, interest in interests 373 | for word in interest.lower().split()) 374 | ``` 375 | 这样我们就可以很方便的输出所有出现次数大于 1 的单词,具体代码如下: 376 | ```python 377 | for word, count in words_and_counts.most_common(): 378 | if count > 1: 379 | print word, count 380 | ``` 381 | 这样就可以得到你期望的结果了(除非你希望 “scikit-learn” 应该被分割开来,这时候确实没有给出你期望的结果),具体结果如下: 382 | ```python 383 | learning 3 384 | java 3 385 | python 3 386 | big 3 387 | data 3 388 | hbase 2 389 | regression 2 390 | cassandra 2 391 | statistics 2 392 | probability 2 393 | hadoop 2 394 | networks 2 395 | machine 2 396 | neural 2 397 | scikit-learn 2 398 | r 2 399 | ``` 400 | 我们会在第 20 章探讨更加高级的方法来提取热门话题。 401 | 402 | ## 向前进 403 | 404 | 这是成功的第一天!这也是非常忙碌的第一天!这也是相当忙碌的第一天!当再也没有人向你咨询问题的时候,你下班离开了办公楼。睡个好觉,明天将会是新员工入职培训会。(是的,你在新员工入职培训会之前就已经认真地工作了整整一天!一定要告诉你的人力资源主管这一点。) 405 | 406 | * [返回目录](../README.md) 407 | * [下一章](Chapter_02_A_Crash_Course_in_Python.md) 408 | -------------------------------------------------------------------------------- /chapters/Chapter_02_A_Crash_Course_in_Python.md: -------------------------------------------------------------------------------- 1 | # Chapter 2 Python快速入门课程 2 | > 让我难以置信的是,25 年过去了,人们依旧为 Python 痴狂。 3 | 4 | > Michael Palin 5 | 6 | 所有 DataSciencester 的新员工都被要求参加新员工的入职培训会,在入职培训会上最有意思的部分就是会有一个关于 Python 的快速入门课程。 7 | 8 | 这并不是一个无所不包的 Python 教程,但是这门课程将会重点讲授那些 DataSciencester 数据科学家们最常使用的知识点( 一部分知识点恰恰是通常的 Python 入门课程不怎么关注的。)。 9 | 10 | ## 基础知识 11 | 12 | ### 获取 Python 13 | 14 | 你可以直接从 [Python.org](https://www.python.org/) 下载 Python。 但是如果你的机器上还有 Python, 我特别推荐你直接安装 [Anaconda](https://store.continuum.io/cshop/anaconda/) 这个 Python 发行版, 因为 Anaconda 几乎已经包括了所有你可能在数据科学中用到的 Python 包。 15 | 16 | 在我写作本书的时候,最新的 Python 版本已经升级到了 3.4 。但是在 DataSciencester 我们使用更加可靠的老版本, Python 2.7 。 Python 3 并不向后兼容 Python 2 而很多重要的软件包只能在 Python 2.7 下工作。 数据科学社区依然坚守在 Python 2.7 的版本上,这也意味着我们不得不遵守这个传统。 请确保你已经安装了正确的 Python 版本。 17 | 18 | 如果你不能安装 Anaconda,请一定要安装`pip` 这个软件包, `pip` 是让你可以轻松安装第三方软件包,特别是有些我们必须的第三方软件包的 Python 包管理程序。 `pip` 也可以和 IPython很好的集合在一起工作,`IPython`是一个相当易用方便的 Python 交互式解释器。 19 | 20 | ( 如果你安装了 Anaconda 那么 `pip` 和 `IPython` 应该已经自动安装了。) 21 | 22 | ```bash 23 | pip install ipython 24 | ``` 25 | 如果出现了任何奇怪晦涩的错误信息,请直接搜索网络,那儿肯定有很多好的解决方法。 26 | 27 | ### Python 之禅 28 | 29 | Python 是一门有着自己独特设计理念的语言,这些设计理念通过 Python 之禅这个框架来解释,你可以在 Python 的交互解释器中通过输入 `import this` 来查看。 30 | 31 | 人们最长讨论和提到的是如下几句: 32 | 33 | > ( 任何问题)应该有一个解决方法,也只有一个最好的解决方案,请使用最显然直接的方法实现它。 34 | 35 | > (英文原版:There should be one-- and preferably only one --obvious way to do it.) 36 | 按照最直接的方式( 虽然对于初学者可能并不是那么直接。)写下的代码通常被认为是符合 Python 惯用法( Pythonic )的代码。虽然这本书并不是一本单纯的关于 Python 的书,但是只有偶然的几次我们会违背 Python 惯用法也就是我们使用非 Python 惯用法 的方式来实现可以用 Python 惯用法 完成的目标,大多数情况下我们倾向使用符合 Python 惯用法的方法解决问题。 37 | 38 | ### 空格格式 39 | 40 | 许多语言习惯使用花括号来表示程序块的起始与结束。但是 Python 使用缩进,比如: 41 | ```python 42 | for i in [1,2,3, 4,5]: 43 | print i # 对 i 做循环 44 | for j in [1, 2,3,4,5]: 45 | print j # 对 j 做循环 46 | print i + j # 对 j 做循环的最后一行代码 47 | print i # 对 i 做循环的最后一行代码 48 | print "done looping" 49 | ``` 50 | 这种缩进的方式让 Python 的代码非常具有可读性,当然这也要求你对于代码的格式非常小心。对于非常长的多行计算公式而言,一个非常有用的 Python 特性是, Python会自动忽略在圆括号和花括号之间的空格,比如: 51 | ```python 52 | long_winded_computation = (1 + 2 + 3 + 4 + 5 + 6 +7 + 8 + 9 + 10 + 11 + 12 + 53 | 13 + 14 + 15 + 16 + 17 + 18 + 19 + 20 ) 54 | ``` 55 | 利用这个特性你可以写出更加容易阅读的代码,比如: 56 | ```python 57 | list_of_list = [[1, 2, 3], [4, 5, 6], [7, 8, 9]] 58 | 59 | easier_to_read_list_lists = [[1, 2, 3], 60 | [4, 5, 6], 61 | [7, 8, 9] ] 62 | ``` 63 | 你也可以用反斜线来提示语句会延续到下一行,虽然我们很少这样做: 64 | ```python 65 | two_plus_three = 2 + \ 66 | 3 67 | ``` 68 | 使用空格格式的一个结果就是我们不能够像在其他语言中那样,简单地在 Python 交互解释器中复制或者粘贴代码。比如,如果你想粘贴如下的代码到 Python 交互解释器中: 69 | ```python 70 | for i in [1, 2, 3, 4, 5]: 71 | 72 | #注意上方的空格 73 | print i 74 | ``` 75 | 你可能会得到如下的提示信息: 76 | > 译者注,这里需要特别注意输入代码的格式区别。 具体详情请见 [Github Issue Report#11](https://github.com/joelgrus/data-science-from-scratch/issues/11)。 77 | 78 | ```python 79 | IndentationError: unexpected indent 80 | ``` 81 | 因为交互解释器认为空行代表了`for`循环的结束。 82 | 83 | 当然如果利用`Ipython`的话就可以体验`Ipython`自带的 `%paste`函数,`%paste`函数可以正确的系统剪贴板上的任意内容到交互解释器中,包括空格和其他格式。单单这点就值得你使用`Ipython`来编辑你的 Python 代码。 84 | 85 | ### 模块 86 | 87 | 有些功能 Python 并不默认载入。这些功能既包括 Python 自带的一部分功能也包括你自行下载的第三方软件包。为了使用这些功能你需要首先用 `import` 命令导入包含这些功能的模块。 88 | 89 | 一个简单的方法是直接导入模块本身,比如: 90 | 91 | ```python 92 | import re 93 | my_regex = re.compile("[0-9]+",re.I) 94 | ``` 95 | 这里的 `re` 就是包括了正则表达式功能的模块。通过直接使用`import` 命令导入相应的包之后,你就可以通过在模块包含的函数或者方法之前加上该模块名作为前缀比如`re`来使用他们了。 96 | 97 | 如果 `re` 在你的程序中已经被占用了的话,你就可以如下的方法来引入 `re` 模块。 98 | 99 | ```python 100 | import re as regex 101 | my_regex = regex.compile("[0-9]+",re.I) 102 | ``` 103 | 通常在你的模块名比较难以记住或者需要输入很多字符的时候,你会使用上面的第二种方法导入特定的模块。比如,当我们使用 `matplotlib` 可视化数据的时候,通常我们这样导入需要的模块: 104 | 105 | ```python 106 | import matplotlib.pyplot as plt 107 | ``` 108 | 如果你只是需要某个模块的特定功能,你就可以直接导入这些功能然后不需要在这些功能前面加上前缀,比如: 109 | ```python 110 | from collections import defaultdict, Counter 111 | lookup = defaultdict(int) 112 | my_counter = Counter() 113 | ``` 114 | 如果你喜欢做些破坏,你可以将某个模块的所有内容导入到 Python 的命名空间,这很有可能覆盖你已经定义的某些变量名,比如: 115 | ```python 116 | match = 10 117 | from re import * #注意,re 模块含有 match 函数 118 | print match #“” 119 | ``` 120 | 当然因为你不是个破坏狂,所以你是不会这么做的,对吧? 121 | 122 | ###算数运算 123 | 124 | Python 2.7 默认使用整数除法,因为 `5 / 2` 的结果是 `2` 。 但是,大多数情况下这不是你希望的得到的结果,因此我们总会在我们的程序文件的开头使用如下语句: 125 | ```python 126 | from __future__ import division 127 | ``` 128 | 这样 `5/2` 的结果就是 `2.5` 。 这本书的所有程序都使用这种形式的除法。只有极个别的例子,当我们需要整数除法的时候,我们可以通过使用双反斜杠来得打整数除法,比如 `5 // 2`。 129 | 130 | ### 函数 131 | 132 | 函数简而言之就是一种规则,这种规则对于零个或者多个输入进行处理并给出一个相应的输出。 在 Python 中,我们使用关键字 `def`来定义函数,比如: 133 | ```python 134 | def double(x): 135 | '''这里通常会协商函数的说明文档,文档一般会解释函数的功能. 136 | 比如对于这个函数而言,我们会写上: 这个函数将把输入都乘以2作为输出''' 137 | return x * 2 138 | ``` 139 | 在 Python 中函数是一级(First-class)公民,这也就是意味着你可以像对待参数一样自由地把函数赋值给变量或者将函数传递给函数,比如: 140 | ```python 141 | def apply_to_one(f): 142 | '''该函数将自动把 1 作为函数的参数''' 143 | return f(1) 144 | 145 | my_double = double #调用前面定义的函数 146 | x = apply_to_one(my_double) #结果为2 147 | ``` 148 | 149 | 同样 Python 允许你非常方便的定义Lambda表达式,比如: 150 | ```python 151 | y = apply_to_one(lambda x : x+4) #结果是5 152 | ``` 153 | 你也可以把 `lambds`表达式赋值给变量,尽管很多人会告诉你你应该使用`def`定义函数的方法来代替: 154 | ```python 155 | another_double =lambda x : 2 * x #最好不要这样 156 | def another_double(x): return 2 *x #推荐这样 157 | ``` 158 | 函数的参数也可以事先赋值,只有在你不希望出现默认值的时候这个方法才需要特别注意。 159 | 160 | ```python 161 | def my_print(message = "my default message"): 162 | print message 163 | 164 | my_print("hello") #输出 "hello" 165 | my_print() #输出 "my default message" 166 | ``` 167 | 168 | 有时候通过参数名来确定特定参数也是非常有用的,比如: 169 | ```python 170 | def subtract(a = 0 , b = 0): 171 | return a - b 172 | subtract(10,5) #输出是5 173 | subtract(0,5) #输出是-5 174 | subtract(b=5) #输出是-5 175 | ``` 176 | 在这本书里我们可能需要自己编写很多的函数,做好准备! 177 | 178 | ### 字符串 179 | 180 | 字符串的内容会被单引号或者双引号括起来,当然在这里你的单引号或者双引号必须配对。 181 | 182 | ```python 183 | single_quoted_string = 'data science' 184 | double_quoted_string = 'data science' 185 | ``` 186 | Python 通过反斜杠来表示特殊字符,比如: 187 | ```python 188 | tab_string = "\t" #代表制表符 189 | len(tab_string)   #输出结果是1 190 | ``` 191 | 如果你就是希望使用反斜杠( 比如在 Windows 操作系统中表示路径或者在正则表达式中 ),你可以通过`r`使用 `raw` 字符串: 192 | ```python 193 | not_tab_string = r"\t" #表示 "\" 和 "t" 194 | len(not_tab_string) #输出结果是2 195 | ``` 196 | 197 | 你也可以通过使用三重引号来建立多行字符串: 198 | ```python 199 | multi_line_string= """THis is the first line 200 | and this is the second line 201 | and this is the third line""" 202 | ``` 203 | ### 异常 204 | 205 | 当程序发生错误的时候, Python 会捕获并处理异常。如果不处理,这些异常会导致程序崩溃。你可以通过使用`try`和`except`来处理异常: 206 | ```python 207 | try: 208 | print 0 / 0 209 | except ZeroDivisionError: 210 |  print "can not divied by zero" 211 | ``` 212 | 虽然在很多其他语言中异常被认为是编程的错误,但是在 Python 的世界中通过异常来使得程序更加整洁是没什么不好意思的,在这本书中,我们有时候也会使用异常。 213 | 214 | ### 列表 215 | 216 | 可能在 Python 中最基本的数据结构就是列表`list`了。列表简而言之就是一个有序的集合(在其他语言中列表可能被成为数组,但是 Python 中的列表更多了一些附加的功能。) 217 | 218 | ```python 219 | integer_list = [1, 2, 3] 220 | heterogeneous_list = ["string", 0.1, True] 221 | list_of_lists = [integer_list, heterogeneous_list, [] ] 222 | 223 | list_length= len(integer_list) #结果是3 224 | list_sum= sum(integer_list)    #结果是6 225 | ``` 226 | 你可以通过方括号来取得或者设置列表中某个特定位置元素的值: 227 | ```python 228 | x = range(10) # 建立[0,1,2...9]的新列表 229 | zero = x[0]     # 结果是0,列表序号是从0开始的 230 | one = x[1]     # 结果是1 231 | nine = x[-1]    # 结果是9,更加符合 Python 惯用法的引用最后一个列表元素的方法 232 | eight = x[-2]   # 结果是8, 更加符合 Python 惯用法的引用倒数第二个列表元素的方法 233 | x[0] = -1     # 现在x列表的结果是[-1,1,2...9] 234 | ``` 235 | 你也可以通过方括号来对于列表进行切片操作: 236 | ```python 237 | first_three=x[:3] #[-1,1,2] 238 | three_to_end=x[3:] #[3,4,...,9] 239 | one_to_four=x[1:5] #[1,2,3,4] 240 | last_three=x[-3:] #[7,8,9] 241 | ithout_first_and_last=x[1:-1] #[1,2,...,8] 242 | copy_of_x = x[:] #[-1,1,2,...,9] 243 | ``` 244 | Python 有`in`操作符来帮助你检查某个元素是不是在列表中 245 | ```python 246 | 1 in [1,2,3] #真 247 | 0 in [1,2,3] #假 248 | ``` 249 | 需要注意的是`in`操作符的检查挨个检查列表中的元素,所以除非你知道你的列表很短否则你不应该使用它。(或者你不是那么在乎时间的时候。) 250 | 251 | 可以很容易地把两个列表拼接在一起: 252 | 253 | ```python 254 | x = [1,2,3] 255 | x.extend[4,5,6] #现在x为[1,2,3,4,5,6] 256 | ``` 257 | 258 | 如果你不希望修改x,你可以使用列表加法: 259 | 260 | ```python 261 | x=[1,2,3] 262 | y=x+[4,5,6] #现在y为[1,2,3,4,5,6],而x没有改变 263 | ``` 264 | 更经常出现的情况是,我们需要一次添加一个元素进列表: 265 | 266 | ```python 267 | x=[1,2,3] 268 | x.append(0) #现在x是[1,2,3,0] 269 | y=x[-1] #结果是0 270 | z=len(x) #结果是4 271 | ``` 272 | 273 | 如果你知道列表有多少元素,我们就可以很容易的分叉整个列表: 274 | 275 | ```python 276 | x,y = [1,2] #现在x是1,y是2 277 | ``` 278 | 当然如果你在等号两边没有同样数目的元素,那么你会得到一个`ValueError`的报错信息。 279 | 280 | 经常我们也会通过使用下划线来扔掉那些我们不需要的元素: 281 | ```python 282 | _,y = [1,2] #现在y是2,当然我们并不关心第一个元素 283 | ``` 284 | 285 | ### 元组 286 | 287 | 元组是不可的列表。很多你可以在列表上做的事情在元组上没办法实现。你可以用括号(或不加任何标记)来表示元组,而不是在列表形式中的方括号: 288 | ```python 289 | my_list = [1,2] 290 | my_tuple = (1,2) 291 | other_tuple = 3,4 292 | my_list[1] = 3 #现在列表是[ 1, 3 ] 293 | 294 | try: 295 | my_tuple[1] = 3 296 | except TypeError: 297 | print "can not modify a tuple" 298 | ``` 299 | 300 | 元组是从函数返回多个值的方法,比如: 301 | 302 | ```python 303 | def sum_and_product(x,y): 304 | return (x+y),(x * y) 305 | 306 | sp = sum_and_product(2, 3) #结果是元组(5,6) 307 | s,p = sum_and_product(5,10) #s值为 15, p值为 50 308 | ``` 309 | 元组和列表也都可以用来赋多个值。 310 | 311 | ```python 312 | x,y = 1,2 313 | x,y = y,x #Pythonic规范的交换x,y的方法 314 | ``` 315 | ### 字典 316 | 317 | 字典是另一种非常重要的基本数据类型。字典将键和值联系在一起,所以字典允许你可以快速地依照特定的键值得到值。 318 | 319 | ```python 320 | empty_dict = {} #Pythonic 321 | empty_dict2 = dict() #less Pythonic 322 | grades = {"Joel" : 80,"Tim": 95} 323 | ``` 324 | 325 | 你可以使用方括号来通过一个键获得相应的值。 326 | 327 | ```python 328 | joel_grade = grades["Joel"] 329 | ``` 330 | 331 | 当然如果你使用字典中不存在的键,则会得到一个`KeyError`: 332 | ```python 333 | try: 334 |   kates_grad = grades["Kate"] 335 | except KeyError: 336 | print "no grad for Kate!" 337 | ``` 338 | 你可以使用`in`方法来检查一个键是否存在。 339 | 340 | ```python 341 | joel_has_grade = "Joel" in grades #True 342 | kate_has_grade = "Kate" in grades #False 343 | ``` 344 | 当你使用`get`方法的时候,如果你查找的键不在字典中,则会返回默认值而不是抛出错误。 345 | ```python 346 | joel_grade = grades.get( "joel" ,0 ) #结果是80 347 | kates_grade=grades.get( "Kate" ,0 ) #结果是 0 348 | no_ones_grade = grades.get("No One") #给出默认值 None 349 | ``` 350 | 同样的,你可以使用方括号给键值对赋值。 351 | 352 | ```python 353 | grades["Tim"] = 99 354 | grades["Kate"] = 100 355 | num_students = len(grades) 356 | ``` 357 | 我们经常使用字典来表示结构化的数据,比如: 358 | 359 | ```python 360 | tweet = { 361 | "users" : "Joelgrus", 362 | "text" : "Data Science is Awesome", 363 | "retweet_count" : 100, 364 | "hashtags" : ["#data","science","#datascience","#awesome","#yolo"] 365 | } 366 | ``` 367 | 除了可以查看特定的键值我们还可以查看所有的键值,比如: 368 | 369 | ```python 370 | tweet_keys = tweet.keys() #键列表 371 | tweet_values = tweet.values() #值列表 372 | tweet_items = tweet.items() #(键,值)元组列表 373 | 374 | "user" in tweet_keys #真,但是使用更短的列表更好 375 | "user" in tweet # 更加符合Pythonic的方式,使用更快的方法 376 | "joelgrus" in tweet #真 377 | 378 | ``` 379 | 380 | 字典的键是不可以更改的。值得注意的是,`lists`是不可以作为键的。如果你需要使用多组成的键,你可以使用元组或者想办法把键转化成一个字符串来使用字典结构。 381 | 382 | #### defaultdict 383 | 384 | 假设你正在考虑统计一个文档的单词数目,一个显然的方法是使用字典结构,键代表单词,值代表单词出现的次数。这样你就可以通过读取每一个单词增加该单词的出现次数或者对于新出现的单词增加一个新的键,具体实现如下: 385 | 386 | ```python 387 | 388 | word_counts = {} 389 | for word in document: 390 | if word in word_counts: 391 | word_counts[word] + = 1 392 | else: 393 | word_counts[word] = 1 394 | 395 | 396 | ``` 397 | 398 | 当然你也可以使用" Forgiveness is better than Permission(异常处理比强制要求更好)"的原则来处理查找一个不存在的键的异常情况,比如: 399 | 400 | ```python 401 | 402 | word_counts = {} 403 | 404 | for word in document : 405 | try: 406 | word_counts[word] + = 1 407 | except KeyError: 408 | word_counts[word] = 1 409 | 410 | ```  411 | 412 | 第三种方法就是使用`get`方法,这个可以更加优雅的处理缺失的键值: 413 | 414 | ```python 415 | 416 | word_counts = {} 417 | 418 | for word in document: 419 | previous_count = word_counts.get(word,0) 420 | word_counts[word] = previous_count + 1 421 | 422 | ``` 423 | 424 | 任何看到这几种方法的人都会觉得这些方法有点别扭,就是为什么`defaultdict`非常有用的原因。`defaultdict`看起来就像Python自带的字典结构,但是当你尝试查找一个字典中不存在的键的时候两者表现出极大的不同,`defaultdict`会尝试使用添加这个本来不存在的键并使用你预先设定的零值函数来设置该键对应的值。为了使用`defaultdict`你必须首先从`collections`包中引入它: 425 | 426 | ```python 427 | 428 | from collections import defaultdict 429 | 430 | word_counts = defaultdict(int) #int() 产生 0 431 | 432 | for word in document: 433 | word_counts[word] += 1 434 | 435 | ``` 436 | 437 | `defaultdict`也可以很方便的和`list`或者`dict`甚至其他函数整合在一起: 438 | 439 | ```python 440 | 441 | dd_list = defaultdict(list) #产生空列表 442 | dd_list[2].append(1) #现在dd_list变成{2:[1]} 443 | 444 | dd_dict = defaultdict(dict) #产生空字典 445 | dd_dict["Joel"]["City"] = "Seattle" #{"Joel":{"City" : "Seattle"}} 446 | 447 | dd_pair = defaultdict(lambda: [0,0]) 448 | dd_pair[2][1] = 1 #现在dd_pair是{2:[0,1]} 449 | 450 | ``` 451 | 452 | 这种方法在我们搜集相应的键和值的时候特别有用,因为我们可以不必每次都检查当前键是不是存在。 453 | 454 | ### Counter 455 | 456 | `Counter`将一个序列值转换成类似`defaultdict`的键值结构来统计对象出现的次数,我们一般使用这个结构来建立柱状图: 457 | 458 | ```python 459 | 460 | form collections import Counter 461 | 462 | c = Counter([0, 1, 2, 0]) #c 现在变成 {0 : 2, 1 : 1, 2 : 1} 463 | 464 | 465 | ``` 466 | 467 | 这让我们可以非常轻松地解决单词统计问题: 468 | 469 | ```python 470 | 471 | word_counts = Counter(document) 472 | 473 | ``` 474 | 475 | 一个`Counter`对象最经常使用的方法是`most_common`: 476 | 477 | ```python 478 | 479 | #输出前10个出现最多的单词和相应的频率 480 | 481 | for word,count in word_counts.most_common(10): 482 | print word,count 483 | 484 | ``` 485 | 486 | ###集合 487 | 488 | 下一个数据结构是集合(set),集合由一系列不同元素组成: 489 | 490 | ```python 491 | 492 | s = set() #创建空集合 493 | s.add(1) #现在集合为( 1 ) 494 | s.add(2) #现在集合为( 1 2 ) 495 | s.add(2) #现在集合依然是( 1 2 ) 496 | x = len(s) #结果是2 497 | y = 2 in s #结果是True 498 | z = 3 in s #结果是False 499 | ``` 500 | 501 | 我们主要因为以下两个原因使用`sets`,首先集合的`in`操作非常快。比如如果我们希望测试很多项目是不是在某个集合中,集合数据结构比列表更加合适: 502 | 503 | ```python 504 | 505 | stopwords_list = ["a", "an", "at"]+hundreds_of_other_words +["yet", "you"] 506 | 507 | "zip " in stopwords_list #假,但比较慢 508 | 509 | stopwords_set = set(stopwords_list) 510 | 511 | "zip" in stopwords_set #快 512 | 513 | ``` 514 | 515 | 第二个原因是找出一个集合的不同元素: 516 | 517 | ```python 518 | 519 | item_list = [1, 2, 3, 1, 2, 3] 520 | num_items = len(item_list) #结果是6 521 | item_set = set(item_list) #结果是{1, 2, 3} 522 | num_distinct_items =len(item_set) #结果是3 523 | distinct_item_list = list(item_set) #结果是[1, 2, 3] 524 | ``` 525 | 526 | 但是综合而言,相对于字典和列表我们不是很经常地使用集合。 527 | 528 | ###流程控制 529 | 530 | Python和其他语言类似,你可以使用`if`关键字来根据不同的条件进行不同的操作: 531 | 532 | ```python 533 | 534 | if 1 > 2: 535 | message = "if only 1 were greater than two..." 536 | elif 1 > 3: 537 | message = "elif stands for 'else if'" 538 | else: 539 | message = "when all else fails use else (if you want to)" 540 | 541 | ``` 542 | 543 | 你当然也可以把if-then-else结构利用三元操作符写在一行,我们有时也会采用这个方法: 544 | 545 | ```python 546 | 547 | parity = "even" if x % 2 == 0 else "odd" 548 | 549 | ``` 550 | 551 | Python也有`while`循环: 552 | 553 | ```python 554 | x = 0 555 | 556 | while x < 10: 557 | print x,"is less than 10" 558 | x += 1 559 | 560 | ``` 561 | 562 | 虽然我们更经常使用`for`和`in`来做循环操作: 563 | 564 | ```python 565 | 566 | for x in range(10): 567 | print x,"is less than 10" 568 | 569 | ``` 570 | 571 | 如果你想实现更加复杂的控制罗辑,你需要使用`continue`和`break`关键字: 572 | 573 | ```python 574 | 575 | for x in range(10): 576 | if x == 3 : 577 | continue #立刻进入下一个循环 578 | if x == 5 : 579 | break #结束整个循环 580 | print x 581 | 582 | ``` 583 | 584 | 上面的代码将打印出0, 1, 2, 和 4。 585 | 586 | ###真值 587 | 588 | 布尔值在Python中和其他语言并没有什么不同,除了它们需要大写开头字母: 589 | 590 | ```python 591 | 592 | one_is_less_than two = 1 < 2 #真 593 | true_equals_false = True == False #假 594 | 595 | ``` 596 | 597 | Python 使用`None`来指示并不存在的值,这和其他语言的`null`类似: 598 | 599 | ```python 600 | 601 | x = None 602 | print x == None #结果是True,但是这种方法并不 Pythonic 603 | print x is None #结果是真,符合 Pythonic 604 | 605 | ``` 606 | 607 | Python 允许你在需要使用布尔值的时候使用任何的值。下面的都可以代表假: 608 | 609 | * False 610 | * None 611 | * [] (空列表) 612 | * {} (空字典) 613 | * "" 614 | * set() 615 | * 0 616 | * 0.0 617 | 618 | 相应地,没有列出的值都可以作为真值。这让你可以非常方便地使用`if`关键字来判断空列表,空字符串,空字典。这个特性也会引起一些特别的问题,比如你下面的代码: 619 | 620 | ```python 621 | 622 | s = some_function_that_returns_a_string() 623 | 624 | if s: 625 | first_char = s[0] 626 | else: 627 | first_char = "" 628 | 629 | ``` 630 | 631 | 更加简单的方法是: 632 | 633 | ```python 634 | 635 | first_char = s and s[0] 636 | 637 | ``` 638 | 639 | 这是因为`and`会返回第二个值如果第一个值为真的时候,当第一个值非真的时候返回第一个值。类似地,如果x可能是一个数字或者空值(None): 640 | 641 | ```python 642 | 643 | safe_x = x and 0 644 | 645 | ``` 646 | 647 | 这一定会返回返回一个数字。 648 | 649 | Python 也有一个`all`函数,`all`函数接收一个列表当每一个元素都为真的时候返回`True`; `any`函数则会在至少有一个元素为真的时候返回`True`: 650 | 651 | ```python 652 | 653 | all([True, 1, { 3 }]) #结果为真 654 | all([True, 1, {}]) #结果为假,因为{}代表空字典为假 655 | any([True, 1, {}]) #结果为真 656 | all([]) #结果为真,因为列表中没有假元素 657 | any([]) #结果为假,因为列表中没有真元素 658 | 659 | ``` 660 | 661 | ##高级部分 662 | 663 | 下面我们将看一些在处理数据的时候非常有用的Python高级特性。 664 | 665 | ###排序 666 | 667 | 每一个Python列表都有`sort`方法来实现排序功能。如果你不希望弄乱原来的数据顺序,你可以使用`sorted`函数,这会产生一个新的列表: 668 | 669 | ```python 670 | 671 | x = [4, 1, 2, 3] 672 | y = sorted(x) #结果是[1, 2, 3, 4],x没有改变 673 | x.sort() #现在x变成[1, 2, 3, 4] 674 | 675 | ``` 676 | 默认情况下,`sort`和`sorted`按照从小到大的顺序简单地比较每个值的大小来排序。 677 | 678 | 如果你想从大到小排序,你可以指明`reverse`参数为`True`。同时你也可以通过指定特定的比较关键字`key`来实现更加复杂的函数结果比较: 679 | 680 | ```python 681 | 682 | #从大到小按照绝对值大小排序 683 | x = sorted([-4, 1, -2, 3], key = abs, reverse = True) #结果是[-4, 3, -2, 1] 684 | 685 | #对于word和count从大到小排序 686 | 687 | wc = sorted(word_counts.item(), 688 | key = lambda (word,count) : count, 689 | reverse = True) 690 | 691 | ``` 692 | ###列表推导式(List Comprehensions也有翻译成列表生成式) 693 | 694 | 经常地我们需要把一个列表转换成另外一个列表,可能是选中的特定元素也可能是对于原始元素做一定的转化或者两者都有。列表推导式是实现这个目标更加Pythonic的方法: 695 | 696 | ```python 697 | 698 | even_numbers = [x for x in range(5) if x % 2 == 0] #结果是[0, 2, 4] 699 | squares = [x * x for x in range(5)] #结果是[0, 1, 4, 9, 16] 700 | even_squares = [x * x for x in even_numbers] #结果是[0, 4, 16] 701 | 702 | ``` 703 | 704 | 你也可以相应的把列表转换成字典或者集合: 705 | 706 | ```python 707 | 708 | square_dict = {x : x * x for x in range(5) } #结果是{ 0:0, 1:1, 2:4, 3:9, 4:16 } 709 | square_set = {x * x for x in [1,-1] } #结果是{ 1 } 710 | 711 | ``` 712 | 如果你不需要列表中的某个值,可以使用下划线去除相应的元素: 713 | 714 | ```python 715 | 716 | zeros = [ 0 for _in even_numbers ] #和even_numbers有一样的长度 717 | 718 | ``` 719 | 720 | 列表推导式也可以包含多个`for`循环: 721 | 722 | ```python 723 | 724 | pairs = [(x, y) 725 | for x in range(10) 726 | for y in range(10)] #结果是100对(0,0), (0,1), ... (9,8),(9,9) 727 | 728 | ``` 729 | 同时后面的`for`可以使用前一个`for`的结果: 730 | 731 | ```python 732 | 733 | increasing_pairs = [(x, y) #只包括x 我坚信可视化是实现个人目标最强大的工具之一 4 | 5 | > —— Harvey Mackay 6 | 7 | 数据科学家工具集的一个基本组成部分是数据可视化。尽管非常容易创建可视化,但要做好却不容易。 8 | 9 | 数据可视化主要有两个用途: 10 | * 研究数据 11 | * 传递数据信息 12 | 13 | 本章,我们集中处理建立 你研究数据所需要的技能以及在本书余下章节中所要生成的可视化,正如其他的一些章节,数据可视化内容丰富得足以单独写一本书。然而,我们试试让你理解怎样才是好的可视化,什么不是 14 | 15 | ## matplotlib 16 | 在数据可视化领域有很多工具。我们也使用广泛使用的 [matplotlib library](http://matplotlib.org)。如果你对生成可交互的Web可视化,这个选择并不好,但对于简单的条形图,折线图和散点图,它绰绰有余。 17 | 18 | 具体来说,我们将使用`matplotlib.pyplot`模块。 在其中最简单的应用, `pyplot`, 一旦完成,你可以保存(通过`savefig()`)或显示(通过`show()`) 19 | 20 | 例如,绘制一个简单的图表 21 | ```python 22 | from matplotlib import pyplot as plt 23 | 24 | years = [1950, 1960, 1970, 1980, 1990, 2000, 2010] 25 | gdp = [300.2, 543.3, 1075.9, 2862.5, 5979.6, 10289.7, 14958.3] 26 | 27 | # create a line chart, years on x-axis, gdp on y-axis 28 | plt.plot(years, gdp, color='green', marker='o', linestyle='solid') 29 | 30 | # add a title 31 | plt.title("Nominal GDP") 32 | 33 | # add a label to the y-axis 34 | plt.ylabel("Billions of $") 35 | plt.show() 36 | ``` 37 | 38 | ![](../assets/images/C03_001.png) 39 | 40 | 怎样让图片显示得更好比较复杂且不在本章的讨论范围之内。有许多方式来自定义图表,比如坐标文本,线条样式,点标记,与其花时间到这些,我们仅仅使用其中一些到我们的案例中。 41 | 42 | >插图 尽管我们不会使用太多功能,`matplotlib`能够生成很复杂的图表, 可交互的,如果你想了解更多查阅其相关文档。 43 | 44 | 45 | 46 | ## 条形图 47 | 用条形图是很很好的选择,图3-2展示了每个不同的电影赢得了几次奥斯卡金像奖 48 | ```python 49 | from matplotlib import pyplot as plt 50 | 51 | movies = ["Annie Hall", "Ben-Hur", "Casablanca", "Gandhi", "West Side Story"] 52 | num_oscars = [5, 11, 3, 8, 10] 53 | 54 | # bars are by default with 0.8, so we'll add 0.1 to the left corrdinates 55 | # so that each bar is centered 56 | xs = [i + 0.1 for i, _ in enumerate(movies)] 57 | 58 | # plot bars with left x-coordinates [xs], heights [num_oscars] 59 | plt.bar(xs, num_oscars) 60 | 61 | plt.ylabel("# of Academy Awards") 62 | plt.title("My Favorite Movies") 63 | 64 | # label x-axis with movie names at bar centers 65 | plt.xticks([i + 0.5 for i, _ in enumerate(movies)], movies) 66 | 67 | plt.show() 68 | ``` 69 | 条形图也可以用来绘制不同时段的数值的直方图, 用于可视化研究数值是如何分布的,如图3-3 70 | ```python 71 | from matplotlib import pyplot as plt 72 | from collections import Counter 73 | 74 | grades = [83, 95, 91,87, 70, 0, 85, 82, 100, 67, 73, 0] 75 | decile = lambda grade: grade // 10 * 10 76 | histogram = Counter(decile(grade) for grade in grades) 77 | 78 | plt.bar([x - 4 for x in histogram.keys()], # shift each bar to the left by 4 79 | histogram.values(), # give each bar its correct height 80 | 8) # give each bar a width of 8 81 | 82 | plt.axis([-5, 105, 0, 5]) # x-axis from -5 to 105 83 | # y-axis from 0 to 5 84 | 85 | plt.xticks([10 * i for i in range(11)]) # x-axis labels at 0, 10, ..., 100 86 | plt.xlabel("Decile") 87 | plt.ylabel("# of Students") 88 | plt.title("Distribution of Exam 1 Grades") 89 | plt.show() 90 | 91 | ``` 92 | ![](../assets/images/C03_003.png) 93 | 94 | `plt.bar`的第三个参数指定了条的宽度,本例中我们指定宽度为8(这让条和条之间的缝隙很小,)And we 95 | shifted the bar left by 4, so that (for example) the “80” bar has its left and right sides at 96 | 76 and 84, and (hence) its center at 80. 97 | 98 | The call to plt.axis indicates that we want the x-axis to range from -5 to 105 (so that 99 | the “0” and “100” bars are fully shown), and that the y-axis should range from 0 to 5. 100 | And the call to plt.xticks puts x-axis labels at 0, 10, 20, …, 100. 101 | 102 | Be judicious when using plt.axis(). When creating bar charts it is considered espe‐ 103 | cially bad form for your y-axis not to start at 0, since this is an easy way to mislead 104 | people (Figure 3-4): 105 | 106 | ```python 107 | from matplotlib import pyplot as plt 108 | 109 | mentions = [500, 505] 110 | years = [2013, 2014] 111 | 112 | plt.bar([2012.6, 2013.6], mentions, 0.8) 113 | plt.xticks(years) 114 | plt.ylabel("# of times I heard someone say 'data science'") 115 | 116 | # if you don't do this, matplotlib will label the x-axis 0, 1 117 | # and then add a +2.013e3 off in the corner (bad matplotlib!) 118 | plt.ticklabel_format(useOffset=False) 119 | 120 | # misleading y-axis only shows the part above 500 121 | plt.axis([2012.5, 2014.5, 499, 506]) 122 | plt.title("Look at the 'Huge' Increase!") 123 | plt.show() 124 | ``` 125 | ![](../assets/images/C03_004.png) 126 | 127 | 在图3-5中,我们使用了更多的 128 | In Figure 3-5, we use more-sensible axes, and it looks far less impressive: 129 | ```python 130 | from matplotlib import pyplot as plt 131 | 132 | mentions = [500, 505] 133 | years = [2013, 2014] 134 | 135 | plt.bar([2012.6, 2013.6], mentions, 0.8) 136 | plt.xticks(years) 137 | plt.ylabel("# of times I heard someone say 'data science'") 138 | 139 | # if you don't do this, matplotlib will label the x-axis 0, 1 140 | # and then add a +2.013e3 off in the corner (bad matplotlib!) 141 | plt.ticklabel_format(useOffset=False) 142 | 143 | # misleading y-axis only shows the part above 500 144 | # plt.axis([2012.5, 2014.5, 499, 506]) 145 | # plt.title("Look at the 'Huge' Increase!") 146 | # plt.show() 147 | 148 | plt.axis([2012.5,2014.5,0,550]) 149 | plt.title("Not So Huge Anymore") 150 | plt.show() 151 | ``` 152 | ![](../assets/images/C03_005.png) 153 | 154 | 155 | ## 折线图 156 | 正如我们所见到的,我们可以使用`plt.plot()`来绘制折线图。他它们用来表示趋势很合适,如下图所示 157 | ```python 158 | from matplotlib import pyplot as plt 159 | 160 | variance = [1, 2, 4, 8, 16, 32, 64, 128, 256] 161 | bias_squared = [256, 128, 64, 32, 16, 8, 4, 2, 1] 162 | total_error = [x + y for x, y in zip(variance, bias_squared)] 163 | xs = [ i for i, _ in enumerate(variance)] 164 | 165 | # we can make multiple calls to plt.plot 166 | # to show multiple series on the same chart 167 | plt.plot(xs, variance, 'g-', label='variance') # green solid line 168 | plt.plot(xs, bias_squared, 'r-.', label='bias^2') # red dot-dashed line 169 | plt.plot(xs, total_error, 'b:', label='total error')# blue dotted line 170 | 171 | # because we've assigned labels to each series 172 | # we can get a legend for free 173 | # loc=9 means "top center" 174 | plt.legend(loc=9) 175 | plt.xlabel("model complexity") 176 | plt.title("The Bias-Variance Tradeoff") 177 | plt.show() 178 | ``` 179 | 180 | ![](../assets/images/C03_006.png) 181 | 182 | 183 | ## 散点图 184 | A scatterplot is the right choice for visualizing the relationship between two paired 185 | sets of data. For example, Figure 3-7 illustrates the relationship between the number 186 | of friends your users have and the number of minutes they spend on the site every 187 | day: 188 | 189 | ```python 190 | from matplotlib import pyplot as plt 191 | 192 | friends = [ 70, 65, 72, 63, 71, 64, 60, 64, 67] 193 | minutes = [175, 170, 205, 120, 220, 130, 105, 145, 190] 194 | labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'] 195 | plt.scatter(friends, minutes) 196 | 197 | # label each point 198 | for label, friend_count, minute_count in zip(labels, friends, minutes): 199 | plt.annotate(label, 200 | xy=(friend_count, minute_count), # put the label with its point 201 | xytext=(5, -5), # but slightly offset 202 | textcoords='offset points') 203 | 204 | plt.title("Daily Minutes vs. Number of Friends") 205 | plt.xlabel("# of friends") 206 | plt.ylabel("daily minutes spent on the site") 207 | plt.show() 208 | ``` 209 | ![](../assets/images/C03_007.png) 210 | 211 | If you’re scattering comparable variables, you might get a misleading picture if you let 212 | matplotlib choose the scale, as in Figure 3-8: 213 | 214 | ```python 215 | from matplotlib import pyplot as plt 216 | 217 | test_1_grades = [ 99, 90, 85, 97, 80] 218 | test_2_grades = [100, 85, 60, 90, 70] 219 | plt.scatter(test_1_grades, test_2_grades) 220 | plt.title("Axes Aren't Comparable") 221 | plt.xlabel("test 1 grade") 222 | plt.ylabel("test 2 grade") 223 | plt.show() 224 | ``` 225 | 226 | ![](../assets/images/C03_008.png) 227 | 228 | If we include a call to plt.axis("equal"), the plot (Figure 3-9) more accurately 229 | shows that most of the variation occurs on test 2. 230 | That’s enough to get you started doing visualization. We’ll learn much more about 231 | visualization throughout the book. 232 | 233 | ![](../assets/images/C03_009.png) 234 | 235 | 236 | ## 进一步探索 237 | 238 | * [seaborn](http://stanford.io) 基于matplotlib,可以用来更加简单的生成更漂亮(也更复杂)的可视化文件。 239 | 240 | * [D3.js](http://d3js.org) 是个Javascript库,它可以生成用于Web应用上的复杂的可交互的可视化。虽然不是Python,但它比较新潮且广泛使用,很值得你去熟悉了解。 241 | 242 | • [Bokeh](http://bokeh.pydata.org) is a newer library that brings D3-style visualizations into Python. 是款可以用Python生成3D风格的可视化的新库。 243 | 244 | • [ggplot](http://bit.ly) is a Python port of the popular R library ggplot2, which is widely used for 245 | creating “publication quality” charts and graphics. It’s probably most interesting 246 | if you’re already an avid ggplot2 user, and possibly a little opaque if you’re not. 247 | -------------------------------------------------------------------------------- /chapters/Chapter_04_Linear_Algebra.md: -------------------------------------------------------------------------------- 1 | # Chapter4 线性代数 2 | 3 | > 还是有什么比代数更加没用或用处不大的么? 4 | > 5 | > —— Billy Connolly 6 | 7 | 线性代数是数学上用来处理向量空间的一个分支。尽管我不指望在这么短的章节里教你,但它是很多数据科学概念和技术的基础,这意味着我有责任尝试(教会你)。 8 | 9 | 在本章我们所学到的内容在本书接下来的部分会用到很多。 10 | 11 | ## 向量 12 | 13 | 抽象来说, 向量是可以通过相加或乘以标量(比如数字)得到新的向量的对象 。对我们而言,它是在有限维度空间的点。尽管你可能并不把数据看作向量,但用向量表示数值型数据是很好的方式。 14 | 15 | 举例来说,你有一群人的身高,体重和年龄数据,你可以把这些数据看做三维向量(身高,体重,年龄)。如果你教一个有4门课的班级,你可以将学生的分数作为四维向量(成绩1,成绩2,成绩3,成绩4) 16 | 17 | 将向量用数值列表表示是最简单直接的方式,一个含有3个数的列表对应一个三维空间的向量,反之亦然: 18 | 19 | ```python 20 | height_weight_age=[70, # inches, 21 | 170, # pounds, 22 | 40] # years 23 | 24 | grades = [95, # exam1 25 | 88, # exam2 26 | 75, # exam3 27 | 62] # exam4 28 | ``` 29 | 30 | 用该方式需要面临一个问题就是需要处理向量的运算。由于Python的列表并非向量(因此也没有提供向量计算支持),我们需要自己实现这些计算工具,那我们就从这里开始吧。 31 | 32 | 我们会频繁使用向量相加,也就是向量的分量相加(Vectors add componentwise.),我们就从这里开始。 33 | 向量的分量相加的意思是,假设,有两个向量v和w,它们长度相同(列表中元素的个数相等),他们的和的第一个元素是 v[0] + w[0], 第二个元素是v[1] + w[1], 以此类推。(如果它们的元素个数不等,则不允许相加),举例来说,[1,2] [2,1]相加的结果为[1 + 2, 2 + 1],或者[3, 3] 如图4-1所示 34 | 35 | ![](../assets/images/C04_001.png) 36 | 37 | 38 | 在python代码中,我们可以通过zip两个向量后,使用列表推导式(list comprehension) 相加对应的元素来实现: 39 | > 译者注:[什么是列表推导式](http://www.cainiao8.com/python/basic/python_14_list_comprehension.html) 40 | 41 | ```python 42 | def vector_add(v, w): 43 | """adds corresponding elements""" 44 | return [v_i + w_i 45 | for v_i, w_i in zip(v, w)] 46 | ``` 47 | 48 | 同理,两个向量相减的运算,也就是列表对应元素进行相减: 49 | 50 | ```python 51 | def vector_subtract(v, w): 52 | """subtracts corresponding elements""" 53 | return [v_i - w_i 54 | for v_i, w_i in zip(v,w)] 55 | 56 | ``` 57 | 58 | 有时我们还需要获取一组向量的和,也就是有这样一个向量,它的第一个元素是所有向量的第一个元素之和,第二个元素是所有向量的第二个元素之和,以此类推。最简单的实现方法就是对一组向量进行逐个相加: 59 | 60 | ```python 61 | def vector_sum(vectors): 62 | """sums all corresponding elements""" 63 | result = vectors[0] # start with the first vector 64 | for vector in vectors[1:]: # then loop over the others 65 | result = vector_add(result, vector) # and add them to the result 66 | return result 67 | ``` 68 | 69 | 稍微观察一下就会发现,我们是使用`vector_add`方法来减少列表中向量的总数,这意味着我们可以用高阶函数(更高级的函数)来简化该当前的函数。 70 | 71 | ```python 72 | def vector_sum(vectors): 73 | return reduce(vector_add, vectors) 74 | ``` 75 | 76 | 或者: 77 | 78 | ```python 79 | vector_sum = partial(reduce, vector_add) 80 | ``` 81 | 82 | although this last one is probably more clever than helpful. 83 | 84 | 85 | 我们还要能支持向量乘以标量,也就是将向量中的每个元素与该标量相乘: 86 | 87 | ```python 88 | def scalar_multiply(c, v): 89 | """c is a number, v is a vector""" 90 | return [c * v_i for v_i in v] 91 | 92 | ``` 93 | 94 | 以下函数可以实现求一组长度相同的向量的分量的平均值: 95 | 96 | ```python 97 | def vector_mean(vectors): 98 | """ compute the vector whose ith element is the mean of the 99 | ith elements of the input vectors""" 100 | n = len(vectors) 101 | return scalar_multiply(1/n, vector_sum(vectors)) 102 | ``` 103 | 104 | 比较少见的是点积(dot product), 两个向量的点积也就是其分量积的总和 105 | 106 | ```python 107 | def dot(v, w): 108 | """v_1 * w_1 + ... + v_n * w_n""" 109 | return sum(v_i * w_i 110 | for v_i, w_i in zip(v,w)) 111 | ``` 112 | 113 | The dot product measures how far the vector v extends in the w direction. For example, if w = [1, 0] then dot(v, w) is just the first component of v. Another way of saying this is that it’s the length of the vector you’d get if you projected v onto w 114 | (Figure 4-2). 115 | 116 | 在w方向上的 117 | 例如:假设 w = [1, 0] 118 | 119 | ![](../assets/images/C04_002.png) 120 | 121 | 使用如下的代码,可以很简单的算出向量的平方和 122 | 123 | ```python 124 | def sum_of_squares(v): 125 | """v_1 * v_1 + ... + v_n * v_n""" 126 | return dot(v, v) 127 | ``` 128 | 129 | 其计算的结果我们可以用来计算向量的长度: 130 | 131 | ```python 132 | import math 133 | def magnitude(v): 134 | return math.sqrt(sum_of_squares(v)) # math.sqrt is square root function 135 | ``` 136 | 137 | 现在,用于计算两个向量之间距离的所有组成部分都具备了: 138 | 139 | ![](../assets/images/C04_003.png) 140 | 141 | ```python 142 | def squared_distance(v, w): 143 | """(v_1 - w_1) ** 2 + ... + (v_n - w_n) ** 2""" 144 | return sum_of_squares(vector_subtract(v, w)) 145 | 146 | def distance(v, w): 147 | return math.sqrt(squared_distance(v, w)) 148 | ``` 149 | 150 | 可能这样写更加清楚些(以上的代码等同于以下的代码): 151 | 152 | ```python 153 | def distance(v, w): 154 | return magnitude(vector_subtract(v, w)) 155 | ``` 156 | 157 | 知道这些足够们开始了,本书里我们会大量使用这些函数。 158 | 159 | >用列表来阐述向量非常好,但性能却很糟糕,在实际应用中,我们会使用包含了高性能的数组类以及支持各种运算函数的NumPy库。 160 | 161 | 162 | ## 矩阵 163 | 164 | 矩阵是数值的二维集合。我们将用具有相同长度列表的列表来表示矩阵,列表中的每一个列表元素代表一行。 假设A是一个矩阵,那么A[i][j]表示第i行第j列的元素。根据数学约定,我盟将使用大写字母来表示矩阵,例如: 165 | 166 | ```python 167 | A = [[1, 2, 3], # A has 2 rows and 3 columns 168 | [4, 5, 6]] 169 | 170 | B = [[1, 2], # B has 3 rows and 2 columns 171 | [3, 4], 172 | [5, 6]] 173 | ``` 174 | 175 | > 在数学中,我们用 row 1 表示矩阵中的第一行,column 1 表示第一列,但由于我们使用Python 中的list来表示数组,list的索引是从0开始的,所以我们将矩阵的第一行成为row 0,第一列称为 column 0 176 | 177 | 鉴于我们用列表的列表的表示方法,根据其形状, 矩阵A有 len(A) 行,有len(A[0]) 列: 178 | 179 | ```python 180 | def shape(A): 181 | num_rows = len(A) 182 | num_cols = len(A[0]) if A else 0 # number of elements in first row 183 | return num_rows, num_cols 184 | ``` 185 | 186 | 假设一个矩阵有n行和k列,我们认为其是 n x k矩阵。我们可以认为 n x k 矩阵的每一行是一个长度为k的矩阵,每列是一个长度为n的向量: 187 | 188 | ```python 189 | def get_row(A, i): 190 | return A[i] # A[i] is already the ith row 191 | 192 | def get_column(A, j): 193 | return [A_i[j] # jth element of row A_i 194 | for A_i in A] # for each row A_i 195 | ``` 196 | 197 | We’ll also want to be able to create a matrix given its shape and a function for generating its elements. We can do this using a nested list comprehension: 198 | 199 | 我们需要一个可以根据给定其形状来创建一个矩阵及其元素的函数,可以通过嵌套列表来实现: 200 | 201 | 202 | 203 | ```python 204 | def make_matrix(num_rows, num_cols, entry_fn): 205 | """returns a num_rows x num_cols matrix 206 | whose (i,j)th entry is entry_fn(i, j)""" 207 | return [[entry_fn(i, j) # given i, create a list 208 | for j in range(num_cols)] # [entry_fn(i, 0), ... ] 209 | for i in range(num_rows)] # create one list for each i 210 | ``` 211 | 212 | 213 | 基于这个函数,你可以创建一个 5 x 5的单位矩阵(对角线值为1,其他是0): 214 | ```python 215 | def is_diagonal(i, j): 216 | """1's on the 'diagonal', 0's everywhere else""" 217 | return 1 if i == j else 0 218 | 219 | identity_matrix = make_matrix(5, 5, is_diagonal) 220 | # [[1, 0, 0, 0, 0], 221 | # [0, 1, 0, 0, 0], 222 | # [0, 0, 1, 0, 0], 223 | # [0, 0, 0, 1, 0], 224 | # [0, 0, 0, 0, 1]] 225 | ``` 226 | 227 | Matrices will be important to us for several reasons. 228 | First, we can use a matrix to represent a data set consisting of multiple vectors, simply by considering each vector as a row of the matrix. For example, if you had the heights, weights, and ages of 1,000 people you could put them in a 1, 000 × 3 matrix: 229 | 230 | 矩阵对我们来说很重要有几个原因。 231 | 232 | 首先,我们可以用矩阵来表示 由多个向量组成的数据集,简单将每个向量当做一行。 例如,你又1000个人的身高,体重和年龄数据,你可以将它们表示为一个 1000 x 3的矩阵: 233 | 234 | ```python 235 | data = [[70, 170, 40], 236 | [65, 120, 26], 237 | [77, 250, 19], 238 | # .... 239 | ] 240 | ``` 241 | Second, as we’ll see later, we can use an n × k matrix to represent a linear function that maps k-dimensional vectors to n-dimensional vectors. Several of our techniques and concepts will involve such functions. 242 | 243 | 其二,在后面后涉及到,我们可以用一个 n x k 矩阵来表示一个对应k维 向量到 n维向量的线性函数。一些技术和概念会需要这样的函数 244 | 245 | Third, matrices can be used to represent binary relationships. In Chapter 1, we represented the edges of a network as a collection of pairs (i, j). An alternative representation would be to create a matrix A such that A[i][j] is 1 if nodes i and j are connected and 0 otherwise. 246 | Recall that before we had: 247 | 248 | 第三,矩阵可以来表示二进制信息,在第一章中, 另外一种表示方法可以创造一个矩阵A,其中 A[i][j] 中如果节点i和j连接,其值是1,其他的都是0. 249 | 250 | ```python 251 | friendships = [(0, 1), (0, 2), (1, 2), (1, 3), (2, 3), (3, 4), 252 | (4, 5), (5, 6), (5, 7), (6, 8), (7, 8), (8, 9)] 253 | ``` 254 | 255 | 我们也可以表示为: 256 | 257 | ```python 258 | # user 0 1 2 3 4 5 6 7 8 9 259 | # 260 | friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0 261 | [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1 262 | [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2 263 | [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3 264 | [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4 265 | [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5 266 | [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6 267 | [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7 268 | [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8 269 | [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9 270 | ``` 271 | 272 | 273 | If there are very few connections, this is a much more inefficient representation, since 274 | you end up having to store a lot of zeroes. However, with the matrix representation it 275 | is much quicker to check whether two nodes are connected—you just have to do a 276 | matrix lookup instead of (potentially) inspecting every edge: 277 | 278 | 如果 关系比较少,由于需要存储大量的0, 该方法效率比较低,然而,检查两个节点是否有关联, 279 | 280 | ```python 281 | friendships[0][2] == 1 # True, 0 and 2 are friends 282 | friendships[0][8] == 1 # False, 0 and 8 are not friends 283 | ``` 284 | 285 | 286 | Similarly, to find the connections a node has, you only need to inspect the column (or the row) corresponding to that node: 287 | 288 | 类似的,查找一个节点的关系,仅仅需要查看该节点对应的列(或行) 289 | 290 | ```python 291 | friends_of_five = [i # only need 292 | for i, is_friend in enumerate(friendships[5]) # to look at 293 | if is_friend] # one row 294 | ``` 295 | 296 | Previously we added a list of connections to each node object to speed up this process, but for a large, evolving graph that would probably be too expensive and difficult to maintain. 297 | We’ll revisit matrices throughout the book. 298 | 299 | 300 | 301 | ## 进一步探索 302 | 303 | * 线性代数在被据科学家们广泛使用(对于理解的人来说,频繁使用,只不过不明显,对于不理解线性代数的人说不频繁,frequently implicitly, and not infrequently by people who don’t understand it),阅读写相关的教材是很不错的想法,可以在网上找到不少免费的资料: 304 | 305 | - [UC Davis的线性代数](https://www.math.ucdavis.edu/~linear/) 306 | - [圣迈克尔学院的 线性代数](http://joshua.smcvt.edu/linearalgebra/) 307 | - 如果你喜欢探索,[Linear Algebra Done](http://www.math.brown.edu/~treil/papers/LADW/LADW.html)网站有更加高级的介绍 308 | 309 | * 我们在本章所有创建的代码,在[NumPy](http://www.numpy.org)中都能找到(而且更多)。 310 | -------------------------------------------------------------------------------- /chapters/Chapter_05_Statistics.md: -------------------------------------------------------------------------------- 1 | # Chapter 5 统计 2 | 3 | > Facts are stubborn, but statistics are more pliable. 4 | 5 | > Mark Twain 6 | 7 | Statistics refers to the mathematics and techniques with which we understand data. It is a rich, enormous field, more suited to a shelf (or room) in a library rather than a chapter in a book, and so our discussion will necessarily not be a deep one. Instead, I’ll try to teach you just enough to be dangerous, and pique your interest just enough that you’ll go off and learn more. 8 | 9 | 统计是指帮助我们理解数据的数学和技术。关于其丰富广阔的内容更像需要图书馆的一个书架或房间来呈现,因此在本章的讨论内容不会深入。取而代之,我所教你的内容?!足够你感兴趣 10 | 11 | ## 描述单个数据集合 12 | 13 | Through a combination of word-of-mouth and luck, DataSciencester has grown to dozens of members, and the VP of Fundraising asks you for some sort of description of how many friends your members have that he can include in his elevator pitches. 14 | 15 | 进过口碑和运气, 数据科学家已经发展到几十个成员,筹款的副总裁问你 16 | 17 | Using techniques from Chapter 1, you are easily able to produce this data. But now you are faced with the problem of how to describe it. 18 | One obvious description of any data set is simply the data itself: 19 | 20 | 使用第一章提到的技能,你可以很简单的生成数据,但你会遇到如何描述它的问题。任何数据集的描述是对数据进行简化 21 | 22 | ```python 23 | num_friends = [100, 49, 41, 40, 25, 24 | # ... and lots more 25 | ] 26 | ``` 27 | For a small enough data set this might even be the best description. But for a larger data set, this is unwieldy and probably opaque. (Imagine staring at a list of 1 million numbers.) For that reason we use statistics to distill and communicate relevant features of our data. 28 | As a first approach you put the friend counts into a histogram using Counter and plt.bar() (Figure 5-1): 29 | 30 | 对于小块数据集用以上的方式是很好的描述方式。但对于大一些的数据,这就有些不实用且恐怕有些不透明?!(想象一下一个列表里有上百万的数值)出于那样的情况,我们使用统计来提取和表示数据的相关特性,第一个方法 31 | 如图 5-1 所示: 32 | 33 | ![](../assets/images/C05_001.png) 34 | 35 | ```python 36 | friend_counts = Counter(num_friends) 37 | xs = range(101) # largest value is 100 38 | ys = [friend_counts[x] for x in xs] # height is just # of friends 39 | plt.bar(xs, ys) 40 | plt.axis([0, 101, 0, 25]) 41 | plt.title("Histogram of Friend Counts") 42 | plt.xlabel("# of friends") 43 | plt.ylabel("# of people") 44 | plt.show() 45 | ``` 46 | 47 | Unfortunately, this chart is still too difficult to slip into conversations. So you start generating some statistics. Probably the simplest statistic is simply the number of data points: 48 | 49 | 不幸的是,图表也不能很好的说明问题,于是你开始生成统计信息。也许最简单的统计信息是数据量? 50 | 51 | ```python 52 | num_points = len(num_friends) # 204 53 | ``` 54 | 55 | You’re probably also interested in the largest and smallest values: 56 | 你也许对一组数据里的最大值和最小值感兴趣 57 | 58 | ```python 59 | largest_value = max(num_friends) # 100 60 | smallest_value = min(num_friends) # 1 61 | ``` 62 | 63 | which are just special cases of wanting to know the values in specific positions: 64 | 在某些情况下需要知道特殊位置的特殊值 65 | 66 | ```python 67 | sorted_values = sorted(num_friends) 68 | smallest_value = sorted_values[0] # 1 69 | second_smallest_value = sorted_values[1] # 1 70 | second_largest_value = sorted_values[-2] # 49 71 | ``` 72 | But we’re only getting started. 73 | 但这仅仅是开始。 74 | 75 | ### Central Tendencies 集中趋势 中值计算 76 | Usually, we’ll want some notion of where our data is centered. Most commonly we’ll use the mean (or average), which is just the sum of the data divided by its count: 77 | 78 | 通常来说,我们想知道数据集中在哪个,一半我们用平均值(或平均数),也就是所有数值的总和再除以数值的个数 79 | 80 | ```python 81 | # 如果不在顶部添加 from __future__ import division会报错 82 | def mean(x): 83 | return sum(x) / len(x) 84 | mean(num_friends) # 7.333333 85 | ``` 86 | 87 | If you have two data points, the mean is simply the point halfway between them. As you add more points, the mean shifts around, but it always depends on the value of every point. 88 | 89 | 如果有两个数据点,平均数就是它们中间的点。当添加更多的点时,平均数点会位移,但总是和所有点得值相关。 90 | 91 | We’ll also sometimes be interested in the median, which is the middle-most value (if the number of data points is odd) or the average of the two middle-most values (if the number of data points is even). 92 | 93 | 我们有时还对中位数感兴趣,也就是最中间的那个值(如果数据的总量是奇数的话)或者中间两个数的平均值(如果数据的总量是偶数)。 94 | 95 | For instance, if we have five data points in a sorted vector x, the median is x[5 // 2] or x[2]. If we have six data points, we want the average of x[2] (the third point) and x[3] (the fourth point). 96 | 97 | 举例来说, 如果在一个有序向量X中有5个值,中位数就是x[5 //2]或者 x[2]. 如果有六个值,我们会取x[2](第三个点) 和 x[3](第4个点)的平均值。 98 | 99 | Notice that—unlike the mean—the median doesn’t depend on every value in your data. For example, if you make the largest point larger (or the smallest point smaller), the middle points remain unchanged, which means so does the median. 100 | 101 | 注意中位数不像平均数,它不依赖于数据中的所有值,例如,让更大的数据点更大(或更小的数据点更小),中间的点不变,中位数也不变。 102 | 103 | The median function is slightly more complicated than you might expect, mostly because of the “even” case: 104 | 105 | 由于数据点总数会出现偶数的情况,求中位数的函数可能比你预期的稍微复杂些: 106 | 107 | 108 | ```python 109 | def median(v): 110 | """finds the 'middle-most' value of v""" n = len(v) 111 | sorted_v = sorted(v) 112 | midpoint = n // 2 113 | if n % 2 == 1: 114 | # if odd, return the middle value 115 | return sorted_v[midpoint] 116 | else: 117 | # if even, return the average of the middle values 118 | lo = midpoint - 1 119 | hi = midpoint 120 | return (sorted_v[lo] + sorted_v[hi]) / 2 121 | median(num_friends) # 6.0 122 | ``` 123 | 124 | Clearly, the mean is simpler to compute, and it varies smoothly as our data changes. If we have n data points and one of them increases by some small amount e, then necessarily the mean will increase by e / n. (This makes the mean amenable to all sorts of calculus tricks.) Whereas in order to find the median, we have to sort our data. And changing one of our data points by a small amount e might increase the median by e, by some number less than e, or not at all (depending on the rest of the data). 125 | 126 | 显而易见的,求平均数比较易于计算,且当我们的数据变化时它也能平缓的改变。 如果我们有n个数据点,其中一个点增加了e,那么平均数增加了 e/n, 这使得求平均数运算经得起各种计算,相比于求中位数,我们必须对数据进行排序。 127 | 128 | > There are, in fact, nonobvious tricks to efficiently compute medians without sorting the data. However, they are beyond the scope of this book, so we have to sort the data. 129 | 130 | > 事实上是有高效的技巧,不需要排序就能获取中间数的。然而,不在本书的讨论范围之内,因此我们必须排序。 131 | 132 | At the same time, the mean is very sensitive to outliers in our data. If our friendliest user had 200 friends (instead of 100), then the mean would rise to 7.82, while the median would stay the same. If outliers are likely to be bad data (or otherwise unrepresentative of whatever phenomenon we’re trying to understand), then the mean can sometimes give us a misleading picture. For example, the story is often told that in the mid-1980s, the major at the University of North Carolina with the highest average starting salary was geography, mostly on account of NBA star (and outlier) Michael Jordan. 133 | 134 | 与此同时,取平均数对数据中的异常值是非常敏感的。假设我们最友好的用户有200(而不是100)个朋友,那么平均数将会上升7.82,但中位数还是一样。如果异常值正好是脏数据(或者不能表示我们想理解的情况),那么该平均数有时会误导我们。例如在1980年经常说的一个故事,北卡罗莱纳大学最高的 平均起薪是地理,主要是由于把NBA明星(异常值)迈克尔-乔丹算进去了。 135 | 136 | A generalization of the median is the quantile, which represents the value less than which a certain percentile of the data lies. (The median represents the value less than which 50% of the data lies.) 137 | 138 | 139 | 中位数所推广的是位数,用于表示数值低于某个确定的。。 140 | 141 | ```python 142 | def quantile(x, p): 143 | """returns the pth-percentile value in x""" 144 | p_index = int(p * len(x)) 145 | return sorted(x)[p_index] 146 | quantile(num_friends, 0.10) # 1 147 | quantile(num_friends, 0.25) # 3 148 | quantile(num_friends, 0.75) # 9 149 | quantile(num_friends, 0.90) # 13 150 | ``` 151 | 152 | Less commonly you might want to look at the mode, or most-common value[s]: 153 | 154 | 比较少见的是你可能需要求取众数, 155 | 156 | ```python 157 | def mode(x): 158 | """returns a list, might be more than one mode""" 159 | counts = Counter(x) 160 | max_count = max(counts.values()) 161 | return [x_i for x_i, count in counts.iteritems() 162 | if count == max_count] 163 | 164 | mode(num_friends) # 1 and 6 165 | ``` 166 | 167 | But most frequently we’ll just use the mean. 168 | 但绝大数情况下我们只使用求平均值。 169 | 170 | ### 离散 171 | Dispersion refers to measures of how spread out our data is. Typically they’re statistics for which values near zero signify not spread out at all and for which large values (whatever that means) signify very spread out. For instance, a very simple measure is the range, which is just the difference between the largest and smallest elements: 172 | 离散是用来测量我们的数据如何分布的 173 | 174 | 例如,最简单的测量就是范围,也就是最大值和最小值之间的距离 175 | 176 | 177 | ```python 178 | # "range" already means something in Python, so we'll use a different name 179 | def data_range(x): 180 | return max(x) - min(x) 181 | data_range(num_friends) # 99 182 | ``` 183 | 184 | 185 | The range is zero precisely when the max and min are equal, which can only happen if the elements of x are all the same, which means the data is as undispersed as possible. Conversely, if the range is large, then the max is much larger than the min and the data is more spread out. 186 | 187 | 当最大值和最小值相等时距离为0,这种情况仅仅在x中所有元素都相同的情况下发生,也就意味着该数据没有离散。对应的,如果距离非常大,那么最大值比最小值大很多,数据更加分散。 188 | 189 | Like the median, the range doesn’t really depend on the whole data set. A data set whose points are all either 0 or 100 has the same range as a data set whose values are 0, 100, and lots of 50s. But it seems like the first data set “should” be more spread out. 190 | 191 | 正如众数,距离几乎不依赖于整个数据集。一个数据集里的所有点全是0或100,那么 192 | 193 | A more complex measure of dispersion is the variance, which is computed as: 194 | 195 | 更加复杂的离散测量方法是方差: 196 | 197 | 198 | 199 | 200 | ```python 201 | def de_mean(x): 202 | """translate x by subtracting its mean (so the result has mean 0)""" 203 | x_bar = mean(x) 204 | return [x_i - x_bar for x_i in x] 205 | def variance(x): 206 | """assumes x has at least two elements""" 207 | n = len(x) 208 | deviations = de_mean(x) 209 | return sum_of_squares(deviations) / (n - 1) 210 | 211 | variance(num_friends) # 81.54 212 | ``` 213 | 214 | > This looks like it is almost the average squared deviation from the mean, except that we’re dividing by n-1 instead of n. In fact, when we’re dealing with a sample from a larger population, x_bar is only an estimate of the actual mean, which means that on average (x_i - x_bar) ** 2 is an underestimate of x_i’s squared deviation from the mean, which is why we divide by n-1 instead of n. See Wikipe‐ dia. 215 | 216 | Now, whatever units our data is in (e.g., “friends”), all of our measures of central tendency are in that same unit. The range will similarly be in that same unit. The var‐ iance, on the other hand, has units that are the square of the original units (e.g., “friends squared”). As it can be hard to make sense of these, we often look instead at the standard deviation: 217 | 218 | ```python 219 | def standard_deviation(x): 220 | return math.sqrt(variance(x)) 221 | 222 | standard_deviation(num_friends) # 9.03 223 | ``` 224 | Both the range and the standard deviation have the same outlier problem that we saw earlier for the mean. Using the same example, if our friendliest user had instead 200 friends, the standard deviation would be 14.89, more than 60% higher! 225 | A more robust alternative computes the difference between the 75th percentile value and the 25th percentile value: 226 | ```python 227 | 228 | def interquartile_range(x): 229 | return quantile(x, 0.75) - quantile(x, 0.25) 230 | interquartile_range(num_friends) # 6 231 | ``` 232 | which is quite plainly unaffected by a small number of outliers. 233 | 234 | ## Correlation 相关性 235 | DataSciencester’s VP of Growth has a theory that the amount of time people spend on the site is related to the number of friends they have on the site (she’s not a VP for nothing), and she’s asked you to verify this. 236 | After digging through traffic logs, you’ve come up with a list daily_minutes that shows how many minutes per day each user spends on DataSciencester, and you’ve ordered it so that its elements correspond to the elements of our previous num_friends list. We’d like to investigate the relationship between these two metrics. 237 | We’ll first look at covariance, the paired analogue of variance. Whereas variance measures how a single variable deviates from its mean, covariance(协方差) measures how two variables vary in tandem from their means: 238 | 239 | ```python 240 | 241 | def covariance(x, y): 242 | n = len(x) 243 | return dot(de_mean(x), de_mean(y)) / (n - 1) covariance(num_friends, daily_minutes) # 22.43 244 | ``` 245 | Recall that dot sums up the products of corresponding pairs of elements. When corresponding elements of x and y are either both above their means or both below their means, a positive number enters the sum. When one is above its mean and the other below, a negative number enters the sum. Accordingly, a “large” positive covariance means that x tends to be large when y is large and small when y is small. A “large” negative covariance means the opposite—that x tends to be small when y is large and vice versa. A covariance close to zero means that no such relationship exists. 246 | Nonetheless, this number can be hard to interpret, for a couple of reasons: 247 | 248 | * Its units are the product of the inputs’ units (e.g., friend-minutes-per-day), which can be hard to make sense of. (What’s a “friend-minute-per-day”?) 249 | * If each user had twice as many friends (but the same number of minutes), the covariance would be twice as large. But in a sense the variables would be just as interrelated. Said differently, it’s hard to say what counts as a “large” covariance. 250 | 251 | For this reason, it’s more common to look at the correlation, which divides out the standard deviations of both variables: 252 | 253 | ```python 254 | def correlation(x, y): 255 | stdev_x = standard_deviation(x) 256 | stdev_y = standard_deviation(y) 257 | if stdev_x > 0 and stdev_y > 0: 258 | return covariance(x, y) / stdev_x / stdev_y 259 | else: 260 | return 0 # if no variation, correlation is zero 261 | 262 | correlation(num_friends, daily_minutes) # 0.25 263 | ``` 264 | The correlation is unitless and always lies between -1 (perfect anti-correlation) and 1 (perfect correlation). A number like 0.25 represents a relatively weak positive corre‐ lation. 265 | However, one thing we neglected to do was examine our data. Check out Figure 5-2. 266 | 267 | Figure 5-2 268 | The person with 100 friends (who spends only one minute per day on the site) is a huge outlier, and correlation can be very sensitive to outliers. What happens if we ignore him? 269 | 270 | ```python 271 | outlier = num_friends.index(100) # index of outlier 272 | num_friends_good = [x 273 | for i, x in enumerate(num_friends) 274 | if i != outlier] 275 | 276 | daily_minutes_good = [x 277 | for i, x in enumerate(daily_minutes) 278 | if i != outlier] 279 | 280 | correlation(num_friends_good, daily_minutes_good) # 0.57 281 | ``` 282 | Without the outlier, there is a much stronger correlation (Figure 5-3). 283 | 除去异常值,关联性更强 284 | 285 | You investigate further and discover that the outlier was actually an internal test account that no one ever bothered to remove. So you feel pretty justified in excluding it. 286 | 287 | 你继续调查并发现那个异常值其实是一个内部测试账号 288 | 289 | 290 | ## Simpson's Paradox 291 | One not uncommon surprise when analyzing data is Simpson’s Paradox, in which correlations can be misleading when confounding variables are ignored. 292 | For example, imagine that you can identify all of your members as either East Coast data scientists or West Coast data scientists. You decide to examine which coast’s data scientists are friendlier: 293 | 294 | pic 295 | 296 | It certainly looks like the West Coast data scientists are friendlier than the East Coast data scientists. Your coworkers advance all sorts of theories as to why this might be: maybe it’s the sun, or the coffee, or the organic produce, or the laid-back Pacific vibe? 297 | When playing with the data you discover something very strange. If you only look at people with PhDs, the East Coast data scientists have more friends on average. And if you only look at people without PhDs, the East Coast data scientists also have more friends on average! 298 | 299 | pic 300 | 301 | Once you account for the users’ degrees, the correlation goes in the opposite direc‐ tion! Bucketing the data as East Coast/West Coast disguised the fact that the East Coast data scientists skew much more heavily toward PhD types. 302 | This phenomenon crops up in the real world with some regularity. The key issue is that correlation is measuring the relationship between your two variables all else being equal. If your data classes are assigned at random, as they might be in a well-designed experiment, “all else being equal” might not be a terrible assumption. But when there is a deeper pattern to class assignments, “all else being equal” can be an awful assump‐ tion. 303 | The only real way to avoid this is by knowing your data and by doing what you can to make sure you’ve checked for possible confounding factors. Obviously, this is not always possible. If you didn’t have the educational attainment of these 200 data scien‐ tists, you might simply conclude that there was something inherently more sociable about the West Coast. 304 | 305 | ## 关联性的其他注意事项 306 | A correlation of zero indicates that there is no linear relationship between the two variables. However, there may be other sorts of relationships. For example, if: 307 | ```python 308 | x = [-2, -1, 0, 1, 2] 309 | y = [2, 1,0,1,2] 310 | ``` 311 | then x and y have zero correlation. But they certainly have a relationship—each element of y equals the absolute value of the corresponding element of x. What they don’t have is a relationship in which knowing how x_i compares to mean(x) gives us information about how y_i compares to mean(y). That is the sort of relationship that correlation looks for. 312 | In addition, correlation tells you nothing about how large the relationship is. The variables: 313 | ```python 314 | x = [-2, 1, 0, 1, 2] 315 | y = [99.98, 99.99, 100, 100.01, 100.02] 316 | ``` 317 | are perfectly correlated, but (depending on what you’re measuring) it’s quite possible that this relationship isn’t all that interesting. 318 | 319 | ## Correlation and Causation 关联性与因果关系 320 | You have probably heard at some point that “correlation is not causation,” most likely by someone looking at data that posed a challenge to parts of his worldview that he was reluctant to question. Nonetheless, this is an important point—if x and y are strongly correlated, that might mean that x causes y, that y causes x, that each causes the other, that some third factor causes both, or it might mean nothing. 321 | 322 | Consider the relationship between num_friends and daily_minutes. It’s possible that having more friends on the site causes DataSciencester users to spend more time on the site. This might be the case if each friend posts a certain amount of content each day, which means that the more friends you have, the more time it takes to stay current with their updates. 323 | 324 | However, it’s also possible that the more time you spend arguing in the DataSciencester forums, the more you encounter and befriend like-minded people. That is, spending more time on the site causes users to have more friends. 325 | 326 | A third possibility is that the users who are most passionate about data science spend more time on the site (because they find it more interesting) and more actively collect data science friends (because they don’t want to associate with anyone else). 327 | 328 | One way to feel more confident about causality is by conducting randomized trials. If you can randomly split your users into two groups with similar demographics and give one of the groups a slightly different experience, then you can often feel pretty good that the different experiences are causing the different outcomes. 329 | 330 | For instance, if you don’t mind being angrily accused of [experimenting on your users](http://www.nytimes.com/2014/06/30/technology/facebook-tinkers-with-users-emotions-in-news-feed-experiment-stirring-outcry.html?_r=1), you could randomly choose a subset of your users and show them content from only a fraction of their friends. If this subset subsequently spent less time on the site, this would give you some confidence that having more friends causes more time on the site. 331 | 332 | ## For Further Exploration 333 | * [SciPy](http://docs.scipy.org/doc/scipy/reference/stats.html), [pandas](http://pandas.pydata.org), and [StatsModels](http://statsmodels.sourceforge.net) all come with a wide variety of statistical func‐ tions. 334 | 335 | * Statistics is important. (Or maybe statistics are important?) If you want to be a good data scientist it would be a good idea to read a statistics textbook. Many are freely available online. A couple that I like are: 336 | — [OpenIntro Statistics](https://www.openintro.org/stat/textbook.php) 337 | — [OpenStax Introductory Statistics](https://www.openstaxcollege.org/textbooks/introductory-statistics) 338 | 339 | -------------------------------------------------------------------------------- /chapters/Chapter_06_Probability.md: -------------------------------------------------------------------------------- 1 | # Chapter 6 概率 2 | > The lays of probability, so true in general, so fallacious in particular 3 | 4 | > -- Edward Gibbon 5 | 6 | It is hard to do data science without some sort of understanding of probability and its mathematics. As with our treatment of statistics in [Chapter 5](), we’ll wave our hands a lot and elide many of the technicalities. 7 | For our purposes you should think of probability as a way of quantifying the uncer‐ tainty associated with events chosen from a some universe of events. Rather than get‐ ting technical about what these terms mean, think of rolling a die. The universe consists of all possible outcomes. And any subset of these outcomes is an event; for example, “the die rolls a one” or “the die rolls an even number.” 8 | Notationally, we write P E  to mean “the probability of the event E.” 9 | We’ll use probability theory to build models. We’ll use probability theory to evaluate 10 | models. We’ll use probability theory all over the place. 11 | One could, were one so inclined, get really deep into the philosophy of what probabil‐ 12 | ity theory means. (This is best done over beers.) We won’t be doing that. 13 | 14 | ## Dependence and Independence 15 | Roughly speaking, we say that two events E and F are dependent if knowing some‐ thing about whether E happens gives us information about whether F happens (and vice versa). Otherwise they are independent. 16 | For instance, if we flip a fair coin twice, knowing whether the first flip is Heads gives us no information about whether the second flip is Heads. These events are inde‐ pendent. On the other hand, knowing whether the first flip is Heads certainly gives us information about whether both flips are Tails. (If the first flip is Heads, then defi‐ nitely it’s not the case that both flips are Tails.) These two events are dependent. 17 | Mathematically, we say that two events E and F are independent if the probability that they both happen is the product of the probabilities that each one happens: 18 | ``` 19 | P(E,F) =P(E)P(F) 20 | ``` 21 | In the example above, the probability of “first flip Heads” is 1/2, and the probability of “both flips Tails” is 1/4, but the probability of “first flip Heads and both flips Tails” is 0. 22 | 23 | 24 | ## Conditional Probability 25 | When two events E and F are independent, then by definition we have: 26 | ``` 27 | P(E,F) = P(E)P(F) 28 | ``` 29 | If they are not necessarily independent (and if the probability of F is not zero), then we define the probability of E “conditional on F” as: 30 | 31 | ``` 32 | P(E|F) = P(E,F)/P(F) 33 | ``` 34 | 35 | You should think of this as the probability that *E* happens, given that we know that *F* happens. 36 | We often rewrite this as: 37 | ``` 38 | P(E,F) = P(E) 39 | ``` 40 | When E and F are independent, you can check that this gives: 41 | ``` 42 | P(E|F) = P(E) 43 | ``` 44 | which is the mathematical way of expressing that knowing F occurred gives us no 45 | additional information about whether E occurred. 46 | One common tricky example involves a family with two (unknown) children. If we assume that: 47 | 1. Each child is equally likely to be a boy or a girl 48 | 2. The gender of the second child is independent of the gender of the first child 49 | 50 | then the event “no girls” has probability 1/4, the event “one girl, one boy” has proba‐ bility 1/2, and the event “two girls” has probability 1/4. 51 | Now we can ask what is the probability of the event “both children are girls” (B) con‐ ditional on the event “the older child is a girl” (G)? Using the definition of conditional probability: 52 | ``` 53 | P(B|G) = P(B,G)/P(G) = P(B)/P(G) = 1/2 54 | ``` 55 | since the event B and G (“both children are girls and the older child is a girl”) is just the event B. (Once you know that both children are girls, it’s necessarily true that the older child is a girl.) 56 | Most likely this result accords with your intuition. 57 | We could also ask about the probability of the event “both children are girls” condi‐ tional on the event “at least one of the children is a girl” (L). Surprisingly, the answer is different from before! 58 | As before, the event B and L (“both children are girls and at least one of the children is a girl”) is just the event B. This means we have: 59 | ``` 60 | P(B|L)=P(B,L)/P(L) = P(B)/P(L)=1/3 61 | ``` 62 | 63 | How can this be the case? Well, if all you know is that at least one of the children is a girl, then it is twice as likely that the family has one boy and one girl than that it has both girls. 64 | We can check this by “generating” a lot of families: 65 | 66 | ```python 67 | def random_kid(): 68 | return random.choice(["boy", "girl"]) 69 | 70 | both_girls = 0 71 | older_girl = 0 72 | either_girl = 0 73 | 74 | random.seed(0) 75 | for _ in range(10000): 76 | younger = random_kid() 77 | older = random_kid() 78 | if older == "girl": 79 | older_girl += 1 80 | if older == "girl" and younger == "girl": 81 | both_girls += 1 82 | if older == "girl" or younger == "girl": 83 | either_girl += 1 84 | 85 | print "P(both | older):", both_girls / older_girl # 0.514 ~ 1/2 86 | print "P(both | either): ", both_girls / either_girl # 0.342 ~ 1/3 87 | ``` 88 | ## Bayes's Theorem 89 | One of the data scientist’s best friends is Bayes’s Theorem, which is a way of “revers‐ ing” conditional probabilities. Let’s say we need to know the probability of some event E conditional on some other event F occurring. But we only have information about the probability of F conditional on E occurring. Using the definition of conditional probability twice tells us that: 90 | ``` 91 | P(E|F)=P(E,F)|P(F)=P(E|F)P(E)/P(F) 92 | ``` 93 | The event F can be split into the two mutually exclusive events “F and E” and “F and 94 | not E.” If we write ¬E for “not E” (i.e., “E doesn’t happen”), then: 95 | ``` 96 | P(F)=P(F,E)+P(F,-E) 97 | ``` 98 | so that: 99 | ``` 100 | P(E|F) = P(F|E)P(E)/[P(F|E)P(E) + P(F|-E)P(-E)] 101 | ``` 102 | which is how Bayes’s Theorem is often stated. 103 | This theorem often gets used to demonstrate why data scientists are smarter than doctors. Imagine a certain disease that affects 1 in every 10,000 people. And imagine that there is a test for this disease that gives the correct result (“diseased” if you have the disease, “nondiseased” if you don’t) 99% of the time. 104 | What does a positive test mean? Let’s use T for the event “your test is positive” and D for the event “you have the disease.” Then Bayes’s Theorem says that the probability that you have the disease, conditional on testing positive, is: 105 | 106 | ``` 107 | P(D|T) = P(T|D)P(D)/[P(T|D)P(D) + P(T|-D)P(-D)] 108 | ``` 109 | Here we know that P T D , the probability that someone with the disease tests posi‐ tive, is 0.99. P(D) , the probability that any given person has the disease, is 1/10,000 = 0.0001. P(T|¬D) , the probability that someone without the disease tests positive, is 0.01. And P(¬D) , the probability that any given person doesn’t have the disease, is 0.9999. If you substitute these numbers into Bayes’s Theorem you find 110 | ``` 111 | P(D|T) = 0.98% 112 | ``` 113 | That is, less than 1% of the people who test positive actually have the disease. 114 | 115 | > This assumes that people take the test more or less at random. If only people with certain symptoms take the test we would instead have to condition on the event “positive test and symptoms” and the number would likely be a lot higher. 116 | 117 | While this is a simple calculation for a data scientist, most doctors will guess that P(D|T) is approximately 2. 118 | A more intuitive way to see this is to imagine a population of 1 million people. You’d expect 100 of them to have the disease, and 99 of those 100 to test positive. On the other hand, you’d expect 999,900 of them not to have the disease, and 9,999 of those to test positive. Which means that you’d expect only 99 out of (99 + 9999) positive testers to actually have the disease. 119 | ## Random Variables 120 | A random variable is a variable whose possible values have an associated probability distribution. A very simple random variable equals 1 if a coin flip turns up heads and 0 if the flip turns up tails. A more complicated one might measure the number of heads observed when flipping a coin 10 times or a value picked from range(10) where each number is equally likely. 121 | 122 | The associated distribution gives the probabilities that the variable realizes each of its possible values. The coin flip variable equals 0 with probability 0.5 and 1 with proba‐ bility 0.5. The range(10) variable has a distribution that assigns probability 0.1 to each of the numbers from 0 to 9. 123 | 124 | We will sometimes talk about the expected value of a random variable, which is the average of its values weighted by their probabilities. The coin flip variable has an expected value of 1/2 (= 0 * 1/2 + 1 * 1/2), and the range(10) variable has an expected value of 4.5. 125 | 126 | Random variables can be conditioned on events just as other events can. Going back to the two-child example from [“Conditional Probability”]() on page 70, if X is the ran‐ dom variable representing the number of girls, X equals 0 with probability 1/4, 1 with probability 1/2, and 2 with probability 1/4. 127 | 128 | We can define a new random variable Y that gives the number of girls conditional on at least one of the children being a girl. Then Y equals 1 with probability 2/3 and 2 with probability 1/3. And a variable Z that’s the number of girls conditional on the older child being a girl equals 1 with probability 1/2 and 2 with probability 1/2. 129 | 130 | For the most part, we will be using random variables implicitly in what we do without calling special attention to them. But if you look deeply you’ll see them. 131 | 132 | ## Continuous Distributions 133 | A coin flip corresponds to a discrete distribution—one that associates positive proba‐ bility with discrete outcomes. Often we’ll want to model distributions across a contin‐ uum of outcomes. (For our purposes, these outcomes will always be real numbers, although that’s not always the case in real life.) For example, the uniform distribution puts equal weight on all the numbers between 0 and 1. 134 | Because there are infinitely many numbers between 0 and 1, this means that the weight it assigns to individual points must necessarily be zero. For this reason, we represent a continuous distribution with a probability density function (pdf) such that the probability of seeing a value in a certain interval equals the integral of the density function over the interval. 135 | 136 | > If your integral calculus is rusty, a simpler way of understanding this is that if a distribution has density function f , then the proba‐ bility of seeing a value between x and x+h is approximately h * f  x  if h is small. 137 | 138 | 139 | The density function for the uniform distribution is just: 140 | ```python 141 | def uniform_pdf(x): 142 | return 1 if x >= 0 and x < 1 else 0 143 | ``` 144 | The probability that a random variable following that distribution is between 0.2 and 0.3 is 1/10, as you’d expect. Python’s random.random() is a [pseudo]random variable with a uniform density. 145 | We will often be more interested in the cumulative distribution function (cdf), which gives the probability that a random variable is less than or equal to a certain value. It’s not hard to create the cumulative distribution function for the uniform distribution (Figure 6-1): 146 | 147 | ```python 148 | 149 | def uniform_cdf(x): 150 | "returns the probability that a uniform random variable is <= x" 151 | if x < 0: return 0 # uniform random is never less than 0 152 | elif x < 1: return x # e.g. P(X <= 0.4) = 0.4 153 | else: return 1 # uniform random is always less than 1 154 | 155 | ``` 156 | Figure 157 | 158 | ## The Normal Distribution 159 | The normal distribution is the king of distributions. It is the classic bell curve–shaped distribution and is completely determined by two parameters: its mean μ (mu) and its standard deviation σ (sigma). The mean indicates where the bell is centered, and the standard deviation how “wide” it is. 160 | 161 | It has the distribution function: 162 | 163 | pic 164 | 165 | which we can implement as: 166 | ```python 167 | def normal_pdf(x, mu=0, sigma=1): 168 | sqrt_two_pi = math.sqrt(2 * math.pi) 169 | return (math.exp(-(x-mu) ** 2 / 2 / sigma ** 2) / (sqrt_two_pi * sigma)) 170 | 171 | ``` 172 | In Figure 6-2, we plot some of these pdfs to see what they look like: 173 | ```python 174 | xs = [x / 10.0 for x in range(-50, 50)] 175 | plt.plot(xs,[normal_pdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1') plt.plot(xs,[normal_pdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2') plt.plot(xs,[normal_pdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5') plt.plot(xs,[normal_pdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1') plt.legend() 176 | plt.title("Various Normal pdfs") 177 | plt.show() 178 | ``` 179 | Figure 180 | 181 | When μ = 0 and σ = 1, it’s called the standard normal distribution. If Z is a standard 182 | normal random variable, then it turns out that: 183 | ``` 184 | X = σZ + μ 185 | ``` 186 | 187 | is also normal but with mean μ and standard deviation σ. Conversely, if X is a normal random variable with mean μ and standard deviation σ, 188 | ``` 189 | Z= X−μ/σ 190 | ``` 191 | is a standard normal variable. 192 | 193 | The cumulative distribution function for the normal distribution cannot be written in an “elementary” manner, but we can write it using [Python’s math.erf](https://en.wikipedia.org/wiki/Error_function): 194 | ```python 195 | 196 | def normal_cdf(x, mu=0,sigma=1): 197 | return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2 198 | ``` 199 | Again, in Figure 6-3, we plot a few: 200 | ```python 201 | xs = [x / 10.0 for x in range(-50, 50)] 202 | plt.plot(xs,[normal_cdf(x,sigma=1) for x in xs],'-',label='mu=0,sigma=1') plt.plot(xs,[normal_cdf(x,sigma=2) for x in xs],'--',label='mu=0,sigma=2') plt.plot(xs,[normal_cdf(x,sigma=0.5) for x in xs],':',label='mu=0,sigma=0.5') plt.plot(xs,[normal_cdf(x,mu=-1) for x in xs],'-.',label='mu=-1,sigma=1') plt.legend(loc=4) # bottom right 203 | plt.title("Various Normal cdfs") 204 | plt.show() 205 | ``` 206 | Sometimes we’ll need to invert normal_cdf to find the value corresponding to a specified probability. There’s no simple way to compute its inverse, but normal_cdf is continuous and strictly increasing, so we can use a [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm): 207 | ```python 208 | def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001): """find approximate inverse using binary search""" 209 | # if not standard, compute standard and rescale 210 | if mu != 0 or sigma != 1: 211 | return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance) 212 | low_z, low_p = -10.0, 0 213 | hi_z, hi_p = 10.0, 1 214 | while hi_z - low_z > tolerance: 215 | mid_z = (low_z + hi_z) / 2 mid_p = normal_cdf(mid_z) if mid_p < p: 216 | # normal_cdf(-10) is (very close to) 0 217 | # normal_cdf(10) is (very close to) 1 218 | # consider the midpoint 219 | # and the cdf's value there 220 | # midpoint is still too low, search above it 221 | low_z, low_p = mid_z, mid_p elif mid_p > p: 222 | # midpoint is still too high, search below it 223 | hi_z, hi_p = mid_z, mid_p else: 224 | break return mid_z 225 | ``` 226 | The function repeatedly bisects intervals until it narrows in on a Z that’s close enough to the desired probability. 227 | 228 | 229 | ## The Central Limit Theorem 230 | One reason the normal distribution is so useful is the central limit theorem, which says (in essence) that a random variable defined as the average of a large number of independent and identically distributed random variables is itself approximately nor‐ mally distributed. 231 | In particular, if x1, ..., xn are random variables with mean μ and standard deviation σ, and if n is large, then: 232 | 233 | !pic 234 | 235 | is approximately normally distributed with mean μ and standard deviation !pic. Equivalently (but often more usefully), 236 | 237 | !pic 238 | 239 | is approximately normally distributed with mean 0 and standard deviation 1. 240 | An easy way to illustrate this is by looking at binomial random variables, which have 241 | two parameters n and p. A Binomial(n,p) random variable is simply the sum of n independent Bernoulli(p) random variables, each of which equals 1 with probability p and 0 with probability 1 − p: 242 | 243 | ```python 244 | def bernoulli_trial(p): 245 | return 1 if random.random() < p else 0 246 | 247 | def binomial(n, p): 248 | return sum(bernoulli_trial(p) for _ in range(n)) 249 | ``` 250 | 251 | The mean of a Bernoulli(p) variable is p, and its standard deviation is !pic 252 | The central limit theorem says that as n gets large, a Binomial(n,p) variable is approximately a normal random variable with mean μ = np and standard deviation !pic . 253 | If we plot both, you can easily see the resemblance: 254 | 255 | ```python 256 | def make_hist(p, n, num_points): 257 | data = [binomial(n, p) for _ in range(num_points)] 258 | # use a bar chart to show the actual binomial samples 259 | histogram = Counter(data) 260 | plt.bar([x - 0.4 for x in histogram.keys()], 261 | [v / num_points for v in histogram.values()], 262 | 0.8, 263 | color='0.75') 264 | mu = p * n 265 | sigma = math.sqrt(n * p * (1 - p)) 266 | # use a line chart to show the normal approximation 267 | xs = range(min(data), max(data) + 1) 268 | ys = [normal_cdf(i + 0.5, mu, sigma) - normal_cdf(i - 0.5, mu, sigma) 269 | for i in xs] plt.plot(xs,ys) 270 | plt.title("Binomial Distribution vs. Normal Approximation") 271 | plt.show() 272 | ``` 273 | For example, when you call `make_hist(0.75, 100, 10000)`, you get the graph in Figure 6-4. 274 | !Figure 275 | The moral of this approximation is that if you want to know the probability that (say) a fair coin turns up more than 60 heads in 100 flips, you can estimate it as the proba‐ bility that a Normal(50,5) is greater than 60, which is easier than computing the Bino‐ mial(100,0.5) cdf. (Although in most applications you’d probably be using statistical software that would gladly compute whatever probabilities you want.) 276 | 277 | 278 | ## For Further Exploration 279 | * [scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html) contains pdf and cdf functions for most of the popular probability distributions. 280 | 281 | * Remember how, at the end of [Chapter 5](), I said that it would be a good idea to study a statistics textbook? It would also be a good idea to study a probability textbook. The best one I know that’s available online is [Introduction to Probability](http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/amsbook.mac.pdf). 282 | -------------------------------------------------------------------------------- /chapters/Chapter_07_Hypothesis_and_Inference.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_08_Gradient_Descent.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_09_Getting_Data.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_10_Working_with_Data.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_11_Machine_Learning.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_12_k_Nearest_Neighbors.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_13_Naive_Bayes.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_14_Simple_Linear_Regression.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_15_Multiple_Regression.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_16_Logistic_Regression.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_17_Decision_Trees.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_18_Neural_Networks.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_19_Clustering.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_20_Natural_Language_Processing.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_21_Network_Analysis.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_22_Recommender_Systems.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_23_Database_and_SQL.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_24_MapReduce.md: -------------------------------------------------------------------------------- 1 | WIP -------------------------------------------------------------------------------- /chapters/Chapter_25_Go_Forth_and_Do_Data_Science.md: -------------------------------------------------------------------------------- 1 | WIP --------------------------------------------------------------------------------