├── .DS_Store ├── dogs_vs_cats ├── dogvscat.png ├── PetOrNot.jpeg ├── pet_annotations.jpg └── README.md ├── sf_crime_classification ├── crime.png ├── label_vis.png ├── sf_top_crimes_map.png └── README.md ├── toxic-comment-classification ├── pics │ ├── hist.png │ └── tags.png └── README.md ├── mathematical_expression_recognition ├── example.jpg └── README.md ├── README.md ├── creating_customer_segments ├── README.md ├── .gitignore ├── visuals.py ├── cluster.csv ├── customers.csv └── customer_segments.ipynb ├── boston_housing ├── README.md ├── .gitignore ├── visuals.py ├── housing.csv └── boston_housing.ipynb ├── finding_donors ├── .gitignore ├── project_description.md ├── README.md ├── visuals.py └── finding_donors.ipynb ├── titanic_survival_exploration ├── .gitignore ├── README.md ├── titanic_visualizations.py └── titanic_survival_exploration.ipynb ├── LICENSE ├── Rossmann_Store_Sales └── README.md └── quora-question-duplicate └── README.md /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/.DS_Store -------------------------------------------------------------------------------- /dogs_vs_cats/dogvscat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/dogs_vs_cats/dogvscat.png -------------------------------------------------------------------------------- /dogs_vs_cats/PetOrNot.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/dogs_vs_cats/PetOrNot.jpeg -------------------------------------------------------------------------------- /dogs_vs_cats/pet_annotations.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/dogs_vs_cats/pet_annotations.jpg -------------------------------------------------------------------------------- /sf_crime_classification/crime.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/sf_crime_classification/crime.png -------------------------------------------------------------------------------- /sf_crime_classification/label_vis.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/sf_crime_classification/label_vis.png -------------------------------------------------------------------------------- /toxic-comment-classification/pics/hist.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/toxic-comment-classification/pics/hist.png -------------------------------------------------------------------------------- /toxic-comment-classification/pics/tags.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/toxic-comment-classification/pics/tags.png -------------------------------------------------------------------------------- /sf_crime_classification/sf_top_crimes_map.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/sf_crime_classification/sf_top_crimes_map.png -------------------------------------------------------------------------------- /mathematical_expression_recognition/example.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/udacity/cn-machine-learning/HEAD/mathematical_expression_recognition/example.jpg -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cn-machine-learning 2 | 3 | [点击这里](https://github.com/udacity/machine-learning/)查看项目文档的英文版本。 4 | 5 | 如果你发现任何翻译错误,或有任何建议,欢迎提交 issue 告诉我们! 6 | 7 | 8 | # Archival Note 9 | This repository is deprecated; therefore, we are going to archive it. However, learners will be able to fork it to their personal Github account but cannot submit PRs to this repository. If you have any issues or suggestions to make, feel free to: 10 | - Utilize the https://knowledge.udacity.com/ forum to seek help on content-specific issues. 11 | - Submit a support ticket along with the link to your forked repository if (learners are) blocked for other reasons. Here are the links for the [retail consumers](https://udacity.zendesk.com/hc/en-us/requests/new) and [enterprise learners](https://udacityenterprise.zendesk.com/hc/en-us/requests/new?ticket_form_id=360000279131). -------------------------------------------------------------------------------- /creating_customer_segments/README.md: -------------------------------------------------------------------------------- 1 | # 项目 3: 非监督学习 2 | ## 创建用户细分 3 | 4 | ### 安装 5 | 6 | 这个项目要求使用 **Python 3** 并且需要安装下面这些python包: 7 | 8 | - [NumPy](http://www.numpy.org/) 9 | - [Pandas](http://pandas.pydata.org) 10 | - [scikit-learn](http://scikit-learn.org/stable/) 11 | 12 | 你同样需要安装好相应软件使之能够运行[Jupyter Notebook](http://jupyter.org/)。 13 | 14 | 优达学城推荐学生安装 [Anaconda](https://www.continuum.io/downloads), 这是一个已经打包好的python发行版,它包含了我们这个项目需要的所有的库和软件。 15 | 16 | ### 代码 17 | 18 | 初始代码包含在 `customer_segments.ipynb` 这个notebook文件中。这里面有一些代码已经实现好来帮助你开始项目,但是为了完成项目,你还需要实现附加的功能。 19 | 20 | ### 运行 21 | 22 | 在命令行中,确保当前目录为 `customer_segments.ipynb` 文件夹的最顶层(目录包含本 README 文件),运行下列命令: 23 | 24 | ```jupyter notebook customer_segments.ipynb``` 25 | 26 | ​这会启动 Jupyter Notebook 并把项目文件打开在你的浏览器中。 27 | 28 | ## 数据 29 | 30 | ​这个项目的数据包含在 `customers.csv` 文件中。你能在[UCI 机器学习信息库](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers)页面中找到更多信息。 31 | -------------------------------------------------------------------------------- /boston_housing/README.md: -------------------------------------------------------------------------------- 1 | # 项目1:模型评估与验证 2 | ## 波士顿房价预测 3 | 4 | ### 准备工作 5 | 6 | 这个项目需要安装 **Python3** 和以下的 Python 函数库: 7 | 8 | - [NumPy](http://www.numpy.org/) 9 | - [matplotlib](http://matplotlib.org/) 10 | - [scikit-learn](http://scikit-learn.org/stable/) 11 | 12 | 你还需要安装一个软件,以运行和编辑 [.ipynb](http://jupyter.org/) 文件。 13 | 14 | 优达学城推荐学生安装 [Anaconda](https://www.continuum.io/downloads),这是一个常用的 Python 集成编译环境,且已包含了本项目中所需的全部函数库。 15 | 16 | ### 代码 17 | 18 | 代码的模版已经在 `boston_housing.ipynb` 文件中给出。你还会用到 `visuals.py` 和名为 `housing.csv` 的数据文件来完成这个项目。我们已经为你提供了一部分代码,但还有些功能需要你来实现才能以完成这个项目。 19 | 20 | ### 运行 21 | 22 | 在终端或命令行窗口中,选定 `boston_housing/` 的目录下(包含此README文件),运行下方的命令: 23 | 24 | ```jupyter notebook boston_housing.ipynb``` 25 | 26 | 这样就能够启动 Jupyter notebook 软件,并在你的浏览器中打开文件。 27 | 28 | ### 数据 29 | 30 | 经过编辑的波士顿房价数据集有490个数据点,每个点有三个特征。这个数据集编辑自[加州大学欧文分校机器学习数据集库(数据集已下线)](https://archive.ics.uci.edu/ml/datasets.html)。 31 | 32 | **特征** 33 | 34 | 1. `RM`: 住宅平均房间数量 35 | 2. `LSTAT`: 区域中被认为是低收入阶层的比率 36 | 3. `PTRATIO`: 镇上学生与教师数量比例 37 | 38 | **目标变量** 39 | 40 | `MEDV`: 房屋的中值价格 -------------------------------------------------------------------------------- /boston_housing/.gitignore: -------------------------------------------------------------------------------- 1 | # Mac OS 2 | .DS_Store 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | .hypothesis/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | 58 | # Sphinx documentation 59 | docs/_build/ 60 | 61 | # PyBuilder 62 | target/ 63 | 64 | #Ipython Notebook 65 | .ipynb_checkpoints 66 | -------------------------------------------------------------------------------- /finding_donors/.gitignore: -------------------------------------------------------------------------------- 1 | # Mac OS 2 | .DS_Store 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | .hypothesis/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | 58 | # Sphinx documentation 59 | docs/_build/ 60 | 61 | # PyBuilder 62 | target/ 63 | 64 | #Ipython Notebook 65 | .ipynb_checkpoints 66 | -------------------------------------------------------------------------------- /creating_customer_segments/.gitignore: -------------------------------------------------------------------------------- 1 | # Mac OS 2 | .DS_Store 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | .hypothesis/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | 58 | # Sphinx documentation 59 | docs/_build/ 60 | 61 | # PyBuilder 62 | target/ 63 | 64 | #Ipython Notebook 65 | .ipynb_checkpoints 66 | -------------------------------------------------------------------------------- /titanic_survival_exploration/.gitignore: -------------------------------------------------------------------------------- 1 | # Mac OS 2 | .DS_Store 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | env/ 15 | build/ 16 | develop-eggs/ 17 | dist/ 18 | downloads/ 19 | eggs/ 20 | .eggs/ 21 | lib/ 22 | lib64/ 23 | parts/ 24 | sdist/ 25 | var/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | 30 | # PyInstaller 31 | # Usually these files are written by a python script from a template 32 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 33 | *.manifest 34 | *.spec 35 | 36 | # Installer logs 37 | pip-log.txt 38 | pip-delete-this-directory.txt 39 | 40 | # Unit test / coverage reports 41 | htmlcov/ 42 | .tox/ 43 | .coverage 44 | .coverage.* 45 | .cache 46 | nosetests.xml 47 | coverage.xml 48 | *,cover 49 | .hypothesis/ 50 | 51 | # Translations 52 | *.mo 53 | *.pot 54 | 55 | # Django stuff: 56 | *.log 57 | 58 | # Sphinx documentation 59 | docs/_build/ 60 | 61 | # PyBuilder 62 | target/ 63 | 64 | #Ipython Notebook 65 | .ipynb_checkpoints 66 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2017 Udacity, Inc. 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy 4 | of this software and associated documentation files (the "Software"), to deal 5 | in the Software without restriction, including without limitation the rights 6 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 7 | copies of the Software, and to permit persons to whom the Software is 8 | furnished to do so, subject to the following conditions: 9 | 10 | The above copyright notice and this permission notice shall be included in all 11 | copies or substantial portions of the Software. 12 | 13 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 14 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 15 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 16 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 17 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 18 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 19 | SOFTWARE. 20 | -------------------------------------------------------------------------------- /sf_crime_classification/README.md: -------------------------------------------------------------------------------- 1 | # San Francisco Crime Classification 2 | (from Kaggle Competition) 3 | 4 | ### 题目描述 5 | ![](./crime.png) 6 | 7 | 在技术天堂盛名之下, 旧金山也是一个罪恶的温床, 在这个项目中, 你将尝试运用机器学习帮助警员判断罪案发生的类型. 8 | 判断罪案类型的意义在于, 能够帮助警局灵活且快速地调配警力以及安排优先级, 防止警力浪费, 加快破案效率. 9 | 10 | 你会获得2003到2015的罪案记录, 你需要利用坐标, 区域, 日期, 辖区警署等特征建立分类模型, 判断罪案类型. 11 | 12 | 13 | ### 数据下载 14 | 此数据集可以从Kaggle上[下载](https://www.kaggle.com/c/sf-crime/data) 15 | 或者通过[Kaggle API](https://github.com/Kaggle/kaggle-api)获取 16 | `$ kaggle competitions download -c sf-crime` 17 | 18 | 19 | ### 建议 20 | * 本项目是一个多分类问题, 类别不均衡 21 | ![label_vis](./label_vis.png) 22 | 23 | * 你可以对罪案类型的分布进行可视化, 构建一个犯罪地图, 这将是一个很不错的锻炼 24 | ![location_vis](./sf_top_crimes_map.png) 25 | _[图片来源](https://www.kaggle.com/benhamner/san-francisco-top-crimes-map/code)_ 26 | 27 | ### 提交 28 | * PDF 报告文件(注意这不应该是notebook的导出,请按照[模板](https://github.com/nd009/capstone/blob/master/capstone_report_template.md)填写) 29 | * 项目相关代码 30 | * 包含使用的库,机器硬件,机器操作系统,训练时间等数据的 README 文档 31 | * 将你的结果提交到[kaggle](https://www.kaggle.com/c/sf-crime/leaderboard), 在报告中汇报你在公榜的分数 32 | 33 | ### 参考 34 | 相关[kaggle kernels](https://www.kaggle.com/c/sf-crime/kernels) 35 | -------------------------------------------------------------------------------- /Rossmann_Store_Sales/README.md: -------------------------------------------------------------------------------- 1 | # Forecast Rossmann Store Sales 2 | (from Kaggle Competition) 3 | 4 | 5 | ### 题目描述 6 | 7 | ![](./rossmann_banner2.png) 8 | 9 | Rossmann是欧洲的一家连锁药店。 在这个源自Kaggle比赛[Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)中,我们需要根据Rossmann药妆店的信息(比如促销,竞争对手,节假日)以及在过去的销售情况,来预测Rossmann未来的销售额。 10 | 11 | ### 数据下载 12 | 此数据集可以从Kaggle上[下载](https://www.kaggle.com/c/rossmann-store-sales/data)。 13 | 14 | 15 | ### 建议 16 | 17 | * 建模第一步就是分析你的数据集,包括特征分析、预测的目标分析等等;该任务是一个回归预测类问题,这里尤其要注意预测是未来的销量; 18 | * 合理的划分你的训练集、验证集,记住你的目的是对于测试集的预测,这是需要提交到kaggle测评的; 19 | * 模型层面的话,建议尝试GBDT类模型,例如xgboost、lightgbm等模型; 20 | * 多多参考kaggle discussion,你能获得很多的优秀特征工程建议以及模型构建的技巧; 21 | 22 | 23 | ### 提交 24 | * PDF 报告文件(注意这不应该是notebook的导出,请按照[模板](https://github.com/nd009/capstone/blob/master/capstone_report_template.md)填写) 25 | * 项目相关代码 26 | * 包含使用的库,机器硬件,机器操作系统,训练时间等数据的 README 文档 27 | * 我们要求学员最低达到leaderboard private 的top 10%,对于测试集rmpse为0.11773。 28 | [kaggle 排行榜](https://www.kaggle.com/c/rossmann-store-sales/leaderboard) 29 | 30 | 31 | 32 | ### 参考 33 | 比赛第一名的[采访](http://blog.kaggle.com/2015/12/21/rossmann-store-sales-winners-interview-1st-place-gert/)及[参考资料](https://www.kaggle.com/c/rossmann-store-sales/forums/t/18024/model-documentation-1st-place)。 34 | 35 | [第三名的方案](https://github.com/entron/entity-embedding-rossmann) 36 | 37 | -------------------------------------------------------------------------------- /mathematical_expression_recognition/README.md: -------------------------------------------------------------------------------- 1 | # 算式识别 2 | 3 | ## AWS 4 | 5 | 由于此项目要求的计算量较大,建议使用亚马逊 p3.2xlarge 云服务器来完成该项目,在使用 p3 之前,你可以先用 p2.xlarge 练手,参考:[在aws上配置深度学习主机 ](https://zhuanlan.zhihu.com/p/25066187),[利用AWS学习深度学习](https://zhuanlan.zhihu.com/p/33176260)。 6 | 7 | ## 描述 8 | 9 | 使用深度学习识别一张图片中的算式。 10 | 11 | * 输入:一张彩色图片 12 | * 输出:算式的内容 13 | 14 | ## 数据 15 | 16 | 数据集可以通过这个链接下载: 17 | 18 | [https://s3.cn-north-1.amazonaws.com.cn/static-documents/nd009/MLND+Capstone/Mathematical_Expression_Recognition_train.zip](https://s3.cn-north-1.amazonaws.com.cn/static-documents/nd009/MLND+Capstone/Mathematical_Expression_Recognition_train.zip) 19 | 20 | 此数据集包含10万张图片,每张图里面都有一个算式。 21 | 22 | * 可能包含 `+-*` 三种运算符 23 | * 可能包含一对括号,也可能不包含括号 24 | * 每个字符都可能旋转,所以 `+` 号可能长得像我们平时手写的 `*` 号,不过 `*` 号有六个瓣 25 | 26 | ![](example.jpg) 27 | 28 | ## 建议 29 | 30 | 建议使用 OpenCV, tensorflow, Keras 完成该项目。其他的工具也可以尝试,比如 pytorch, mxnet 等。 31 | 32 | * [OpenCV 项目](https://github.com/opencv/opencv) 33 | * [tensorflow 项目主页](https://github.com/tensorflow/tensorflow) 34 | * [Keras 项目主页](https://github.com/fchollet/keras) 35 | * [OpenCV python tutorials](https://docs.opencv.org/master/d6/d00/tutorial_py_root.html) 36 | * [Keras 英文文档](https://keras.io) 37 | * [Keras 中文文档](https://keras.io/zh/) 38 | 39 | ### 建议模型 40 | 41 | 如果你不知道如何去构建你的模型,可以看看下面的论文: 42 | 43 | [An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition](https://arxiv.org/abs/1507.05717) 44 | 45 | ## 最低要求 46 | 47 | 本项目的最低要求是99%的准确率。 48 | 49 | ## 应用(可选)(推荐) 50 | 51 | 为了能够让他人使用此 OCR 服务,我们可以将模型部署到服务器上。 52 | 53 | ### 网页应用 54 | 55 | 推荐的工具: 56 | 57 | * [Flask](https://github.com/pallets/flask) 58 | * [Flask 中文文档](http://docs.jinkan.org/docs/flask/) 59 | 60 | ## 评估 61 | 62 | 你的项目会由优达学城项目评审师依照[机器学习毕业项目要求](https://review.udacity.com/#!/rubrics/1785/view)来评审。请确定你已完整的读过了这个要求,并在提交前对照检查过了你的项目。提交项目必须满足所有要求中每一项才能算作项目通过。 63 | 64 | ## 提交 65 | 66 | * PDF 报告文件 67 | * 所有代码文件(jupyter notebook) 68 | * 以上 notebook 导出的 html 文件 69 | * 包含使用的库,机器硬件,机器操作系统,训练时间等数据的 README 文档(使用 Markdown 编写) 70 | -------------------------------------------------------------------------------- /titanic_survival_exploration/README.md: -------------------------------------------------------------------------------- 1 | # 项目 0: 入门与基础 2 | ## 预测泰坦尼克号乘客幸存率 3 | 4 | ### 安装要求 5 | 这个项目要求使用 **Python 2.7** 以及安装下列python库 6 | 7 | - [NumPy](http://www.numpy.org/) 8 | - [Pandas](http://pandas.pydata.org) 9 | - [matplotlib](http://matplotlib.org/) 10 | - [scikit-learn](http://scikit-learn.org/stable/) 11 | ​ 12 | 13 | 你还需要安装和运行 [Jupyter Notebook](http://jupyter.readthedocs.io/en/latest/install.html#optional-for-experienced-python-developers-installing-jupyter-with-pip)。 14 | 15 | 16 | 优达学城推荐学生安装 [Anaconda](https://www.continuum.io/downloads),一个包含了项目需要的所有库和软件的 Python 发行版本。[这里](https://classroom.udacity.com/nanodegrees/nd002/parts/0021345403/modules/317671873575460/lessons/5430778793/concepts/54140889150923)介绍了如何安装Anaconda。 17 | 18 | 如果你使用macOS系统并且对命令行比较熟悉,可以安装[homebrew](http://brew.sh/),以及brew版python 19 | 20 | ```bash 21 | $ brew install python 22 | ``` 23 | 24 | 再用下列命令安装所需要的python库 25 | 26 | ```bash 27 | $ pip install numpy pandas matplotlib scikit-learn scipy jupyter 28 | ``` 29 | 30 | ### 代码 31 | ​ 32 | 核心代码在 `titanic_survival_exploration.ipynb` 文件中,辅助代码在 `titanic_visualizations.py` 文件中。尽管已经提供了一些代码帮助你上手,你还是需要补充些代码使得项目要求的功能能够成功实现。 33 | 34 | ### 运行 35 | ​ 36 | 在命令行中,确保当前目录为 `titanic_survival_exploration/` 文件夹的最顶层(目录包含本 README 文件),运行下列命令: 37 | 38 | ```bash 39 | $ jupyter notebook titanic_survival_exploration.ipynb 40 | ``` 41 | ​ 42 | 这会启动 Jupyter Notebook 把项目文件打开在你的浏览器中。 43 | 44 | 对jupyter不熟悉的同学可以看一下这两个链接: 45 | 46 | - [Jupyter使用视频教程](http://cn-static.udacity.com/mlnd/how_to_use_jupyter.mp4) 47 | - [为什么使用jupyter?](https://www.zhihu.com/question/37490497) 48 | ​ 49 | ​ 50 | ​ 51 | ​ 52 | ​ 53 | ​ 54 | ​ 55 | ​ 56 | ​ 57 | ​ 58 | ​ 59 | ​ 60 | ​ 61 | ​ 62 | 63 | ### 数据 64 | ​ 65 | 这个项目的数据包含在 `titanic_data.csv` 文件中。文件包含下列特征: 66 | ​ 67 | - **Survived**:是否存活(0代表否,1代表是) 68 | - **Pclass**:社会阶级(1代表上层阶级,2代表中层阶级,3代表底层阶级) 69 | - **Name**:船上乘客的名字 70 | - **Sex**:船上乘客的性别 71 | - **Age**:船上乘客的年龄(可能存在 `NaN`) 72 | - **SibSp**:乘客在船上的兄弟姐妹和配偶的数量 73 | - **Parch**:乘客在船上的父母以及小孩的数量 74 | - **Ticket**:乘客船票的编号 75 | - **Fare**:乘客为船票支付的费用 76 | - **Cabin**:乘客所在船舱的编号(可能存在 `NaN`) 77 | - **Embarked**:乘客上船的港口(C 代表从 Cherbourg 登船,Q 代表从 Queenstown 登船,S 代表从 Southampton 登船) 78 | -------------------------------------------------------------------------------- /finding_donors/project_description.md: -------------------------------------------------------------------------------- 1 | # 内容: 监督学习 2 | ## 项目:为CharityML寻找捐献者 3 | 4 | ## 项目概况 5 | 在这个项目中,你将使用监督技术和分析能力对美国人口普查数据进行分析,以帮助CharityML(一个虚拟的慈善机构)识别最有可能向他们捐款的人,你将首先探索数据以了解人口普查数据是如何记录的。接下来,你将使用一系列的转换和预处理技术以将数据整理成能用的形式。然后,你将在这个数据上评价你选择的几个算法,然后考虑哪一个是最合适的。之后,你将优化你现在为CharityML选择的模型。最后,你将探索选择的模型和它的预测能力。 6 | 7 | ## 项目亮点 8 | 这个项目设计成帮助你熟悉在sklearn中能够使用的多个监督学习算法,并提供一个评价模型在某种类型的数据上表现的方法。在机器学习中准确理解在什么时候什么地方应该选择什么算法和不应该选择什么算法是十分重要的。 9 | 10 | 完成这个项目你将学会以下内容: 11 | - 知道什么时候应该使用预处理以及如何做预处理。 12 | - 如何为问题设置一个基准。 13 | - 判断在一个特定的数据集上几个监督学习算法的表现如何。 14 | - 调查候选的解决方案模型是否足够解决问题。 15 | 16 | ## 软件要求 17 | 18 | 这个项目要求使用 Python 2.7 并且需要安装下面这些python包: 19 | 20 | - [Python 2.7](https://www.python.org/download/releases/2.7/) 21 | - [NumPy](http://www.numpy.org/) 22 | - [Pandas](http://pandas.pydata.org/) 23 | - [scikit-learn](http://scikit-learn.org/stable/) 24 | - [matplotlib](http://matplotlib.org/) 25 | 26 | 你同样需要安装好相应软件使之能够运行 [iPython Notebook](http://ipython.org/notebook.html) 27 | 28 | 优达学城推荐学生安装[Anaconda](https://www.continuum.io/downloads), 这是一个已经打包好的python发行版,它包含了我们这个项目需要的所有的库和软件。请注意你安装的是2.7而不是3.X 29 | 30 | ## 开始项目 31 | 32 | 对于这个项目,你能够在**Resources**部分找到一个能下载的`find_donors.zip`。*你也可以访问我们的[机器学习项目GitHub](https://github.com/udacity/machine-learning)获取我们纳米学位中的所有项目* 33 | 34 | 这个项目包含以下文件: 35 | 36 | - `find_donors.ipynb`: 这是你需要工作的主要的文件。 37 | - `census.csv`: 项目使用的数据集,你将需要在notebook中载入这个数据集。 38 | - `visuals.py`: 一个实现了可视化功能的Python代码。不要修改它。 39 | 40 | 在终端或命令提示符中,导航到包含项目文件的文件夹,使用命令`jupyter notebook finding_donors.ipynb`以在一个浏览器窗口或一个标签页打开notebook文件。或者你也可以使用命令`jupyter notebook`或者`ipython notebook`然后在打开的网页中导航到需要的文件夹。跟随notebook中的指引,回答每一个问题以成功完成项目。在这个项目中我们也提供了一个**README**文件,其中也包含了你在这个项目中需要了解的信息或者指引。 41 | 42 | ## 提交项目 43 | 44 | ### 评价 45 | 你的项目会由Udacity项目评审师根据**为CharityML寻找捐献者项目量规**进行评审。请注意仔细阅读这份量规并在提交前进行全面的自我评价。这份量规中涉及的所有条目必须全部被标记成*meeting specifications*你才能通过。 46 | 47 | ### 需要提交的文件 48 | 当你准备好提交你的项目的时候,请收集以下的文件,并把他们压缩进单个压缩包中上传。或者你也可以在你的GitHub Repo中的一个名叫`finding_donors`的文件夹中提供以下文件以方便检查: 49 | - 回答了所有问题并且所有的代码单元被执行并显示了输出结果的`finding_donors.ipynb`文件。 50 | - 一个从项目的notebook文件中导出的命名为**report.html**的**HTML**文件。这个文件*必须*提供。 51 | 52 | 一旦你收集好了这些文件,并阅读了项目量规,请进入项目提交页面。 53 | 54 | ### 我准备好了! 55 | 当你准备好提交项目的时候,点击页面底部的**提交项目**按钮。 56 | 57 | 如果你提交项目中遇到任何问题或者是希望检查你的提交的进度,请给**machine-support@udacity.com**发邮件,或者你可以访问论坛. 58 | 59 | ### 然后? 60 | 当你的项目评审师给你回复之后你会马上收到一封通知邮件。在等待的同时你也可以开始准备下一个项目,或者学习相关的课程。 -------------------------------------------------------------------------------- /quora-question-duplicate/README.md: -------------------------------------------------------------------------------- 1 | ## Quora句子相似度匹配 2 | 3 | ### 准备工作 4 | 5 | 6 | 优达学城推荐学生安装 [Anaconda](https://www.continuum.io/downloads),这是一个常用的Python集成编译环境,且已包含了本项目中所需的全部函数库。我们在P0项目中也有讲解[如何搭建学习环境](https://github.com/nd009/titanic_survival_exploration/blob/master/README.md)。 7 | 8 | ### 题目描述 9 | 10 | [Quora Querstion Pairs数据集](https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs)是Quora于2017年公开的句子匹配数据集,其通过**给定两个句子的一致性标签标注,从而来判断句子是否一致。** 11 | 12 | Quora 数据集训练集共包含40K的句子对,且其完全来自于Quora网站自身,Quora在发布数据集的同时,在Kaggle平台,发起了[Quora句子相似度匹配大赛](https://www.kaggle.com/c/quora-question-pairs),共有3307支队伍参加了本次句子相似度匹配大赛,参赛队伍不仅包括来自麻省理工学院、伦敦大学学院、北京大学、清华大学、中科院计算所等高校研究所,也包括了来自微软、Airbnb、IBM等工业界的人员。 13 | 14 | 我们这里要求你使用Kaggle端的数据集,其由Train,Test两部分构成,你需要通过在Train数据集上进行验证集划分、建模,在Test数据集上进行测试,并且提交到Kaggle进行测评。 15 | 16 | **【IMPORTANT】**数据集下载地址:[Quora DataSet](https://www.kaggle.com/c/quora-question-pairs),请注册kaggle之后,点击同意加入比赛,方可下载数据集。 17 | 18 | 数据集描述: 19 | 20 | 21 | | | Train | Test | 22 | | ------ | :-----: | :----: | 23 | | Data Size | 404290 | 2345796 | 24 | | Vocab Size | 95603 | 101049 | 25 | 26 | 27 | ### 题目特点 28 | 29 | 句子相似度匹配涉及了海量的自然语言处理领域的基本任务,其是检索任务、对话任务、分类任务的基础。通过该赛题你可以**有选择的**学习到自然语言处理领域方方面面的知识,并且能为你在自然语言处理相关的实际工作中提供思路。 30 | 31 | 32 | ### 预备知识 33 | 34 | * NLP基础,包括词袋模型、TF-IDF算法、主题模型(PCA、LDA、NMF) 35 | * 相关模型,包括Logistic Regression,GBDT(Xgboost,lightgbm),RandomForest 36 | * 句子相似度测度, 包括余弦相似度、编辑距离、Word Mover Distance 37 | * 这里学员可以根据自己的基础,选择是否使用深度学习类模型,深度类模型对于该题并不是必须的 38 | 39 | 部分参考链接: 40 | 1. 特征工程: 41 | * 一个非常完善的特征工程思路: https://www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur/ 42 | 43 | * 特征工程部分的参考代码: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question 44 | 45 | * 该比赛的一个解决方案: https://github.com/qqgeogor/kaggle-quora-solution-8th 46 | 47 | 2. 深度类模型(可选) 48 | * ARC I https://arxiv.org/abs/1503.03244 49 | * Siamese-CNN https://arxiv.org/pdf/1702.03814.pdf 50 | * Siamese-LSTM https://arxiv.org/pdf/1702.03814.pdf 51 | * Multi-Perspective-CNN https://arxiv.org/pdf/1702.03814.pdf 52 | 53 | 54 | ### 建议 55 | 56 | 在撰写报告的时候,可以侧重于理论知识方面的论述,例如关于NLP的一些基本知识(词袋模型,主题模型,TF-IDF等),以及非常常用的模型算法,例如Logistic Regression, GBDT(Xgboost,lightgbm), RandomForest;以及基本的相似度度量方法等。在你的报告你可以加上深度类模型,即使你在试验部分没有进行建模,这也将有助于你了解学习最新的相关知识。 57 | 58 | 在进行算法试验的过程中,注意对于你的特征工程有较好的记录,例如整理成表格形式;另外也建议不同类型的特征工程的生成代码可以独立开,这样更加方便你的调试。 59 | 60 | 模型融合部分,可以尝试最简单的加权平均方法,也可以直接使用更加复杂的Stacking[这里是一个较好的STACKING代码框架](https://github.com/qqgeogor/kaggle-quora-solution-8th),相关模型融合资料可以参考[这里](https://mlwave.com/kaggle-ensembling-guide/) 61 | 62 | ### 要求 63 | * PDF 报告文件(注意这不应该是notebook的导出,请按照[模板](https://github.com/nd009/capstone/blob/master/capstone_report_template.md)填写) 64 | * 项目相关代码 65 | 66 | * 包含使用的库,机器硬件,机器操作系统,训练时间等数据的 README 文档 67 | 68 | * 你的最优提交分数需要达到kaggle private leaderboard 的top 20%,对于该题目的就是660th/3307,对应logloss得分为0.18267 69 | 70 | * 符合Udacity的[项目要求](https://review.udacity.com/#!/rubrics/273/view)。 71 | 72 | 73 | 74 | 75 | -------------------------------------------------------------------------------- /finding_donors/README.md: -------------------------------------------------------------------------------- 1 | # 机器学习纳米学位 2 | # 监督学习 3 | ## 项目: 为CharityML寻找捐献者 4 | ### 安装 5 | 6 | 这个项目需要安装下面这些python包: 7 | 8 | - [NumPy](http://www.numpy.org/) 9 | - [Pandas](http://pandas.pydata.org/) 10 | - [scikit-learn](http://scikit-learn.org/stable/) 11 | - [matplotlib](http://matplotlib.org/) 12 | 13 | 你同样需要安装好相应软件使之能够运行 [iPython Notebook](http://ipython.org/notebook.html) 14 | 15 | 优达学城推荐学生安装[Anaconda](https://www.continuum.io/downloads), 这是一个已经打包好的python发行版,它包含了我们这个项目需要的所有的库和软件。 16 | 17 | ### 代码 18 | 19 | 初始代码包含在`finding_donors.ipynb`这个notebook文件中。你还会用到`visuals.py`和名为`census.csv`的数据文件来完成这个项目。我们已经为你提供了一部分代码,但还有些功能需要你来实现才能以完成这个项目。 20 | 这里面有一些代码已经实现好来帮助你开始项目,但是为了完成项目,你还需要实现附加的功能。 21 | 注意包含在`visuals.py`中的代码设计成一个外部导入的功能,而不是打算学生去修改。如果你对notebook中创建的可视化感兴趣,你也可以去查看这些代码。 22 | 23 | 24 | ### 运行 25 | 在命令行中,确保当前目录为 `finding_donors/` 文件夹的最顶层(目录包含本 README 文件),运行下列命令: 26 | 27 | ```bash 28 | jupyter notebook finding_donors.ipynb 29 | ``` 30 | 31 | ​这会启动 Jupyter Notebook 并把项目文件打开在你的浏览器中。 32 | 33 | ### 数据 34 | 35 | 修改的人口普查数据集含有将近32,000个数据点,每一个数据点含有13个特征。这个数据集是Ron Kohavi的论文*"Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid",*中数据集的一个修改版本。你能够在[这里](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf)找到论文,在[UCI的网站](https://archive.ics.uci.edu/ml/datasets/Census+Income)找到原始数据集。 36 | 37 | **特征** 38 | 39 | - `age`: 一个整数,表示被调查者的年龄。 40 | - `workclass`: 一个类别变量表示被调查者的通常劳动类型,允许的值有 {Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked} 41 | - `education_level`: 一个类别变量表示教育程度,允许的值有 {Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool} 42 | - `education-num`: 一个整数表示在学校学习了多少年 43 | - `marital-status`: 一个类别变量,允许的值有 {Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse} 44 | - `occupation`: 一个类别变量表示一般的职业领域,允许的值有 {Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces} 45 | - `relationship`: 一个类别变量表示家庭情况,允许的值有 {Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried} 46 | - `race`: 一个类别变量表示人种,允许的值有 {White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black} 47 | - `sex`: 一个类别变量表示性别,允许的值有 {Female, Male} 48 | - `capital-gain`: 连续值。 49 | - `capital-loss`: 连续值。 50 | - `hours-per-week`: 连续值。 51 | - `native-country`: 一个类别变量表示原始的国家,允许的值有 {United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands} 52 | 53 | **目标变量** 54 | 55 | - `income`: 一个类别变量,表示收入属于那个类别,允许的值有 {<=50K, >50K} -------------------------------------------------------------------------------- /toxic-comment-classification/README.md: -------------------------------------------------------------------------------- 1 | ## Jigsaw恶毒评论分类 2 | 3 | 4 | ### 准备工作 5 | 6 | 7 | 优达学城推荐学生安装 [Anaconda](https://www.continuum.io/downloads),这是一个常用的Python集成编译环境,且已包含了本项目中所需的大部分函数库。我们在P0项目中也有讲解[如何搭建学习环境](https://github.com/nd009/titanic_survival_exploration/blob/master/README.md)。 8 | 9 | 此题需要进行深度学习模型的建模,推荐使用Keras2.0.8以上版本; 10 | 11 | 预训练词向量对于此题特别关键,推荐下载GloVe fastText word2vec各自最大的词向量版本,下载地址如下: 12 | * [GloVe](https://nlp.stanford.edu/projects/glove/) 13 | * [fastText](https://fasttext.cc/docs/en/english-vectors.html) 14 | * [word2vec](https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models) 15 | 16 | 17 | ### 题目描述 18 | 19 | Jigsaw(前身为Google ideas)在kaggle平台上举办了一场[文本分类比赛](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge#description),旨在对于网络社区部分恶毒评论进行区分鉴别。在该赛题中,你需要建立一个可以区分不同类型的言语攻击行为的模型,该赛题一共提供了toxic,severe_toxic,obscene,threat,insult,identity_hate这六种分类标签,你需要根据提供的训练数据进行模型训练学习。 20 | 21 | 22 | 我们这里要求你使用Kaggle端的数据集,其由Train,Test两部分构成,你需要通过在Train数据集上进行验证集划分、建模,在Test数据集上进行测试,并且提交到Kaggle进行测评。 23 | 24 | 数据集下载链接:https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data 25 | 26 | 27 | 28 | ### 题目特点 29 | 30 | 1. 非平衡数据集,很显然该题是一个严重不平衡的数据集 31 | ![训练集中不同标签的分布不一致](pics/hist.png) 32 | 2. 多类别标签,也就是一个样本可能被归于多于一个的标签,例如对于某评论,其既涉及言语辱骂,也带有威胁性质,就可以标注为toxic,threat 33 | ![多类别标签](pics/tags.png) 34 | 35 | 36 | ### 预备知识 37 | 38 | 1. 了解词向量模型,例如Word2vec,GloVe,fastText 39 | 2. 了解一维卷积神经网络,递归神经网络 40 | 3. 传统类模型,例如词袋模型,N-gram,tfidf 41 | 42 | 43 | 44 | ### 建议 45 | 46 | 在撰写报告的时候,可以侧重于理论知识方面的论述,包括但不限于: 47 | 1. 文本的两种基本表征方式(词向量模型,词袋模型) 48 | 2. 深度神经网络中不同优化器之间的区别(SGD,Adam,RMSprop等) 49 | 3. 深度类模型的综述、TextCNN,LSTM等 50 | 4. 基于词袋模型+tfidf+lsvc的传统模型介绍 51 | 52 | 在进行算法试验的过程中,一定要注意记录你的网络调参过程,该题对于深度类网络的调参是有一定要求的,另外,该题需要搭建GPU环境。 53 | 54 | 模型融合部分,可以尝试最简单的加权平均方法,也可以直接使用更加复杂的Stacking,相关模型融合资料可以参考[这里](https://mlwave.com/kaggle-ensembling-guide/) 55 | 56 | ### 要求 57 | * PDF 报告文件(注意这不应该是notebook的导出,请按照[模板](https://github.com/nd009/capstone/blob/master/capstone_report_template.md)填写) 58 | * 项目相关代码 59 | 60 | * 包含使用的库,机器硬件,机器操作系统,训练时间等数据的 README 文档 61 | 62 | * 这里我们要求你的最优提交分数需要达到kaggle private leaderboard 的top 20%,且对于单模型分数有一定的要求,你需要关注于你的单模型构建,并且在报告你给出你最高的单模型线上得分。 63 | 64 | * 此题主要关注于NLP领域深度类模型的构建,关注点是学生对于深度类模型在NLP领域的应用。而另外一个capstone(quora question duplicate)则考察学生对于NLP基本知识的综合应用,涉及范围较该题更加广。 65 | 66 | * 考虑到该课题的深度性,学员需要进行**不少于两个**的基于预训练词向量的深度类型模型的建模以及**不少于一个**的传统词袋模型建模。在你的报告中,你需要阐述清晰你所使用的模型和词向量。**不可以直接使用他人的代码** 67 | 68 | * 符合Udacity的[项目要求](https://review.udacity.com/#!/rubrics/273/view)。 69 | 70 | 71 | ### 部分解决方案参考 72 | 73 | 1. [词袋模型 + LR 解决方案](https://www.kaggle.com/tunguz/logistic-regression-with-words-and-char-n-grams) 74 | 2. [TextCNN 解决方案](https://www.kaggle.com/yekenot/textcnn-2d-convolution) 75 | 3. [LSTM + Attention 解决方案](https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043) 76 | 4. [GRU 解决方案](https://www.kaggle.com/prashantkikani/pooled-gru-with-preprocessing) 77 | 5. [GRU + CNN 解决方案](https://www.kaggle.com/konohayui/bi-gru-cnn-poolings) 78 | 79 | 80 | 81 | -------------------------------------------------------------------------------- /dogs_vs_cats/README.md: -------------------------------------------------------------------------------- 1 | # 猫狗大战 2 | 3 | ## 注意:请不要直接使用网上公开的代码 4 | 5 | [Dogs vs. Cats Redux: Kernels Edition 6 | ](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition) 7 | 8 | ![](dogvscat.png) 9 | 10 | ## AWS 11 | 12 | 由于此项目要求的计算量较大,建议使用亚马逊 p3.2xlarge 云服务器来完成该项目,在使用 p3 之前,你可以先用 p2.xlarge 练手,参考:[在aws上配置深度学习主机 ](https://zhuanlan.zhihu.com/p/25066187),[利用AWS学习深度学习](https://zhuanlan.zhihu.com/p/33176260)。 13 | 14 | ## 描述 15 | 16 | 使用深度学习方法识别一张图片是猫还是狗。 17 | 18 | * 输入:一张彩色图片 19 | * 输出:是猫还是狗 20 | 21 | ## 数据 22 | 23 | 此数据集可以从 kaggle 上下载。[Dogs vs. Cats Redux: Kernels Edition](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/data) 24 | 25 | 此外还有一个数据集也非常好,可以作为扩充数据集或是做检测/分割问题:[The Oxford-IIIT Pet Dataset](http://www.robots.ox.ac.uk/~vgg/data/pets/) 26 | 27 | ![](pet_annotations.jpg) 28 | 29 | ## 建议 30 | 31 | 建议使用 OpenCV, tensorflow, Keras 完成该项目。其他的工具也可以尝试,比如 pytorch, mxnet 等。 32 | 33 | * [OpenCV 项目](https://github.com/opencv/opencv) 34 | * [tensorflow 项目主页](https://github.com/tensorflow/tensorflow) 35 | * [Keras 项目主页](https://github.com/fchollet/keras) 36 | * [OpenCV python tutorials](https://docs.opencv.org/master/d6/d00/tutorial_py_root.html) 37 | * [Keras 英文文档](https://keras.io) 38 | * [Keras 中文文档](https://keras.io/zh/) 39 | 40 | ### 建议模型 41 | 42 | 如果你不知道如何去构建你的模型,可以尝试以下的模型,后面的数字代表年份和月份: 43 | 44 | * [VGGNet](https://arxiv.org/abs/1409.1556) 14.09 45 | * [ResNet](https://arxiv.org/abs/1512.03385) 15.12 46 | * [Inception v3](https://arxiv.org/abs/1512.00567) 15.12 47 | * [InceptionResNetV2](https://arxiv.org/abs/1602.07261) 16.02 48 | * [DenseNet](https://arxiv.org/abs/1608.06993) 16.08 49 | * [Xception](https://arxiv.org/abs/1610.02357) 16.10 50 | * [NASNet](https://arxiv.org/abs/1707.07012) 17.07 51 | 52 | 参考 Keras 文档:[Documentation for individual models](https://keras.io/applications/#documentation-for-individual-models) 53 | 54 | ## 最低要求 55 | 56 | 本项目的最低要求是 kaggle Public Leaderboard 前 10%。 57 | 58 | 参考链接:[https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/leaderboard](https://www.kaggle.com/c/dogs-vs-cats-redux-kernels-edition/leaderboard) 59 | 60 | ## 应用(可选)(推荐) 61 | 62 | 应用形式多种多样,可以是在本地调用摄像头跑的程序,也可以网页的,也可以是 iOS APP 或 Android APP,甚至可以是微信公众号或微信小程序。 63 | 64 | ### 网页应用 65 | 66 | 推荐的工具: 67 | 68 | * [Flask](https://github.com/pallets/flask) 69 | * [Flask 中文文档](http://docs.jinkan.org/docs/flask/) 70 | 71 | ### 微信公众号 72 | 73 | 可以参考这个例子:[微信数字识别](https://github.com/ypwhs/wechat_digit_recognition)。 74 | 75 | 网页接口部分可以参考 [Flask](https://github.com/pallets/flask) 而不必用 python cgi。 76 | 77 | 最新建议:可以使用小程序而不是公众号,小程序更合适。 78 | 79 | ### iOS 80 | 81 | 如果你使用 Keras 完成该项目,可以直接使用 Apple 提供的 [Core ML Tools](https://developer.apple.com/documentation/coreml/converting_trained_models_to_core_ml) 把训练出来的 Keras 模型直接转为 iOS 可以使用的模型。 82 | 83 | 当然在 iOS 平台上你也可以使用 [MetalPerformanceShaders](https://developer.apple.com/reference/metalperformanceshaders) 来实现卷积神经网络。 84 | 85 | 这里有一个 [Inception v3](https://github.com/shu223/iOS-10-Sampler/blob/master/iOS-10-Sampler/Samples/Inception3Net.swift) 在 iOS 上跑的例子,你可以参考,不过我们还是建议直接用上面的工具将 Keras 的模型转为 iOS 直接可以使用的模型。 86 | 87 | ![](https://raw.githubusercontent.com/shu223/iOS-10-Sampler/master/README_resources/imagerecog.gif) 88 | 89 | OpenCV 的 iOS Framework 文件可以直接在这里下载:[OpenCV releases](https://github.com/opencv/opencv/releases)。这里有一份教程,可以轻松入门:[turorial_hello](https://docs.opencv.org/master/d7/d88/tutorial_hello.html) 90 | 91 | 最终效果可以参考这个 app :[PetOrNot](https://itunes.apple.com/cn/app/petornot/id926645155) 92 | 93 | ![PetOrNot](PetOrNot.jpeg) 94 | 95 | ### Android 96 | 97 | 在 Android 上运行 tensorflow 可以参考 [android tensorflow](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android)。 98 | 99 | 在 Android 上运行 OpenCV 可以参考 [OpenCV4Android SDK](http://docs.opencv.org/master/da/d2a/tutorial_O4A_SDK.html)。 100 | 101 | ## 评估 102 | 103 | 你的项目会由优达学城项目评审师依照[机器学习毕业项目要求](https://review.udacity.com/#!/rubrics/1785/view)来评审。请确定你已完整的读过了这个要求,并在提交前对照检查过了你的项目。提交项目必须满足所有要求中每一项才能算作项目通过。 104 | 105 | ## 提交 106 | 107 | * PDF 报告文件 108 | * 数据预处理代码(jupyter notebook) 109 | * 模型训练代码(jupyter notebook) 110 | * 以上 notebook 导出的 html 文件 111 | * 应用代码(可选) 112 | * 包含使用的库,机器硬件,机器操作系统,训练时间等数据的 README 文档(建议使用 Markdown ) 113 | -------------------------------------------------------------------------------- /boston_housing/visuals.py: -------------------------------------------------------------------------------- 1 | ########################################### 2 | # Suppress matplotlib user warnings 3 | # Necessary for newer version of matplotlib 4 | import warnings 5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib") 6 | # 7 | # Display inline matplotlib plots with IPython 8 | from IPython import get_ipython 9 | get_ipython().run_line_magic('matplotlib', 'inline') 10 | ########################################### 11 | 12 | import matplotlib.pyplot as pl 13 | import numpy as np 14 | from sklearn.model_selection import learning_curve, validation_curve 15 | from sklearn.tree import DecisionTreeRegressor 16 | from sklearn.model_selection import ShuffleSplit, train_test_split 17 | 18 | 19 | def ModelLearning(X, y): 20 | """ Calculates the performance of several models with varying sizes of training data. 21 | The learning and validation scores for each model are then plotted. """ 22 | 23 | # Create 10 cross-validation sets for training and testing 24 | cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0) 25 | 26 | 27 | # Generate the training set sizes increasing by 50 28 | train_sizes = np.rint(np.linspace(1, X.shape[0]*0.8 - 1, 9)).astype(int) 29 | 30 | # Create the figure window 31 | fig = pl.figure(figsize=(10,7)) 32 | 33 | # Create three different models based on max_depth 34 | for k, depth in enumerate([1,3,6,10]): 35 | 36 | # Create a Decision tree regressor at max_depth = depth 37 | regressor = DecisionTreeRegressor(max_depth = depth) 38 | 39 | # Calculate the training and testing scores 40 | sizes, train_scores, valid_scores = learning_curve(regressor, X, y, \ 41 | cv = cv, train_sizes = train_sizes, scoring = 'r2') 42 | 43 | # Find the mean and standard deviation for smoothing 44 | train_std = np.std(train_scores, axis = 1) 45 | train_mean = np.mean(train_scores, axis = 1) 46 | valid_std = np.std(valid_scores, axis = 1) 47 | valid_mean = np.mean(valid_scores, axis = 1) 48 | 49 | # Subplot the learning curve 50 | ax = fig.add_subplot(2, 2, k+1) 51 | ax.plot(sizes, train_mean, 'o-', color = 'r', label = 'Training Score') 52 | ax.plot(sizes, valid_mean, 'o-', color = 'g', label = 'Validation Score') 53 | ax.fill_between(sizes, train_mean - train_std, \ 54 | train_mean + train_std, alpha = 0.15, color = 'r') 55 | ax.fill_between(sizes, valid_mean - valid_std, \ 56 | valid_mean + valid_std, alpha = 0.15, color = 'g') 57 | 58 | # Labels 59 | ax.set_title('max_depth = %s'%(depth)) 60 | ax.set_xlabel('Number of Training Points') 61 | ax.set_ylabel('r2_score') 62 | ax.set_xlim([0, X.shape[0]*0.8]) 63 | ax.set_ylim([-0.05, 1.05]) 64 | 65 | # Visual aesthetics 66 | ax.legend(bbox_to_anchor=(1.05, 2.05), loc='lower left', borderaxespad = 0.) 67 | fig.suptitle('Decision Tree Regressor Learning Performances', fontsize = 16, y = 1.03) 68 | fig.tight_layout() 69 | fig.show() 70 | 71 | 72 | def ModelComplexity(X, y): 73 | """ Calculates the performance of the model as model complexity increases. 74 | The learning and validation errors rates are then plotted. """ 75 | 76 | # Create 10 cross-validation sets for training and testing 77 | cv = ShuffleSplit(n_splits = 10, test_size = 0.2, random_state = 0) 78 | 79 | # Vary the max_depth parameter from 1 to 10 80 | max_depth = np.arange(1,11) 81 | 82 | # Calculate the training and testing scores 83 | train_scores, valid_scores = validation_curve(DecisionTreeRegressor(), X, y, \ 84 | param_name = "max_depth", param_range = max_depth, cv = cv, scoring = 'r2') 85 | 86 | # Find the mean and standard deviation for smoothing 87 | train_mean = np.mean(train_scores, axis=1) 88 | train_std = np.std(train_scores, axis=1) 89 | valid_mean = np.mean(valid_scores, axis=1) 90 | valid_std = np.std(valid_scores, axis=1) 91 | 92 | # Plot the validation curve 93 | pl.figure(figsize=(7, 5)) 94 | pl.title('Decision Tree Regressor Complexity Performance') 95 | pl.plot(max_depth, train_mean, 'o-', color = 'r', label = 'Training Score') 96 | pl.plot(max_depth, valid_mean, 'o-', color = 'g', label = 'Validation Score') 97 | pl.fill_between(max_depth, train_mean - train_std, \ 98 | train_mean + train_std, alpha = 0.15, color = 'r') 99 | pl.fill_between(max_depth, valid_mean - valid_std, \ 100 | valid_mean + valid_std, alpha = 0.15, color = 'g') 101 | 102 | # Visual aesthetics 103 | pl.legend(loc = 'lower right') 104 | pl.xlabel('Maximum Depth') 105 | pl.ylabel('r2_score') 106 | pl.ylim([-0.05,1.05]) 107 | pl.show() 108 | 109 | 110 | def PredictTrials(X, y, fitter, data): 111 | """ Performs trials of fitting and predicting data. """ 112 | 113 | # Store the predicted prices 114 | prices = [] 115 | 116 | for k in range(10): 117 | # Split the data 118 | X_train, X_test, y_train, y_test = train_test_split(X, y, \ 119 | test_size = 0.2, random_state = k) 120 | 121 | # Fit the data 122 | reg = fitter(X_train, y_train) 123 | 124 | # Make a prediction 125 | pred = reg.predict([data[0]])[0] 126 | prices.append(pred) 127 | 128 | # Result 129 | print("Trial {}: ${:,.2f}".format(k+1, pred)) 130 | 131 | # Display price range 132 | print("\nRange in prices: ${:,.2f}".format(max(prices) - min(prices))) 133 | -------------------------------------------------------------------------------- /titanic_survival_exploration/titanic_visualizations.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import pandas as pd 3 | import matplotlib.pyplot as plt 4 | 5 | def filter_data(data, condition): 6 | """ 7 | Remove elements that do not match the condition provided. 8 | Takes a data list as input and returns a filtered list. 9 | Conditions should be a list of strings of the following format: 10 | ' ' 11 | where the following operations are valid: >, <, >=, <=, ==, != 12 | 13 | Example: ["Sex == 'male'", 'Age < 18'] 14 | """ 15 | 16 | field, op, value = condition.split(" ") 17 | 18 | # convert value into number or strip excess quotes if string 19 | try: 20 | value = float(value) 21 | except: 22 | value = value.strip("\'\"") 23 | 24 | # get booleans for filtering 25 | if op == ">": 26 | matches = data[field] > value 27 | elif op == "<": 28 | matches = data[field] < value 29 | elif op == ">=": 30 | matches = data[field] >= value 31 | elif op == "<=": 32 | matches = data[field] <= value 33 | elif op == "==": 34 | matches = data[field] == value 35 | elif op == "!=": 36 | matches = data[field] != value 37 | else: # catch invalid operation codes 38 | raise Exception("Invalid comparison operator. Only >, <, >=, <=, ==, != allowed.") 39 | 40 | # filter data and outcomes 41 | data = data[matches].reset_index(drop = True) 42 | return data 43 | 44 | def survival_stats(data, outcomes, key, filters = []): 45 | """ 46 | Print out selected statistics regarding survival, given a feature of 47 | interest and any number of filters (including no filters) 48 | """ 49 | 50 | # Check that the key exists 51 | if key not in data.columns.values : 52 | print "'{}' is not a feature of the Titanic data. Did you spell something wrong?".format(key) 53 | return False 54 | 55 | # Return the function before visualizing if 'Cabin' or 'Ticket' 56 | # is selected: too many unique categories to display 57 | if(key == 'Cabin' or key == 'PassengerId' or key == 'Ticket'): 58 | print "'{}' has too many unique categories to display! Try a different feature.".format(key) 59 | return False 60 | 61 | # Merge data and outcomes into single dataframe 62 | all_data = pd.concat([data, outcomes], axis = 1) 63 | 64 | # Apply filters to data 65 | for condition in filters: 66 | all_data = filter_data(all_data, condition) 67 | 68 | # Create outcomes DataFrame 69 | all_data = all_data[[key, 'Survived']] 70 | 71 | # Create plotting figure 72 | plt.figure(figsize=(8,6)) 73 | 74 | # 'Numerical' features 75 | if(key == 'Age' or key == 'Fare'): 76 | 77 | # Remove NaN values from Age data 78 | all_data = all_data[~np.isnan(all_data[key])] 79 | 80 | # Divide the range of data into bins and count survival rates 81 | min_value = all_data[key].min() 82 | max_value = all_data[key].max() 83 | value_range = max_value - min_value 84 | 85 | # 'Fares' has larger range of values than 'Age' so create more bins 86 | if(key == 'Fare'): 87 | bins = np.arange(0, all_data['Fare'].max() + 20, 20) 88 | if(key == 'Age'): 89 | bins = np.arange(0, all_data['Age'].max() + 10, 10) 90 | 91 | # Overlay each bin's survival rates 92 | nonsurv_vals = all_data[all_data['Survived'] == 0][key].reset_index(drop = True) 93 | surv_vals = all_data[all_data['Survived'] == 1][key].reset_index(drop = True) 94 | plt.hist(nonsurv_vals, bins = bins, alpha = 0.6, 95 | color = 'red', label = 'Did not survive') 96 | plt.hist(surv_vals, bins = bins, alpha = 0.6, 97 | color = 'green', label = 'Survived') 98 | 99 | # Add legend to plot 100 | plt.xlim(0, bins.max()) 101 | plt.legend(framealpha = 0.8) 102 | 103 | # 'Categorical' features 104 | else: 105 | 106 | # Set the various categories 107 | if(key == 'Pclass'): 108 | values = np.arange(1,4) 109 | if(key == 'Parch' or key == 'SibSp'): 110 | values = np.arange(0,np.max(data[key]) + 1) 111 | if(key == 'Embarked'): 112 | values = ['C', 'Q', 'S'] 113 | if(key == 'Sex'): 114 | values = ['male', 'female'] 115 | 116 | # Create DataFrame containing categories and count of each 117 | frame = pd.DataFrame(index = np.arange(len(values)), columns=(key,'Survived','NSurvived')) 118 | for i, value in enumerate(values): 119 | frame.loc[i] = [value, \ 120 | len(all_data[(all_data['Survived'] == 1) & (all_data[key] == value)]), \ 121 | len(all_data[(all_data['Survived'] == 0) & (all_data[key] == value)])] 122 | 123 | # Set the width of each bar 124 | bar_width = 0.4 125 | 126 | # Display each category's survival rates 127 | for i in np.arange(len(frame)): 128 | nonsurv_bar = plt.bar(i-bar_width, frame.loc[i]['NSurvived'], width = bar_width, color = 'r') 129 | surv_bar = plt.bar(i, frame.loc[i]['Survived'], width = bar_width, color = 'g') 130 | 131 | plt.xticks(np.arange(len(frame)), values) 132 | plt.legend((nonsurv_bar[0], surv_bar[0]),('Did not survive', 'Survived'), framealpha = 0.8) 133 | 134 | # Common attributes for plot formatting 135 | plt.xlabel(key) 136 | plt.ylabel('Number of Passengers') 137 | plt.title('Passenger Survival Statistics With \'%s\' Feature'%(key)) 138 | plt.show() 139 | 140 | # Report number of passengers with missing values 141 | if sum(pd.isnull(all_data[key])): 142 | nan_outcomes = all_data[pd.isnull(all_data[key])]['Survived'] 143 | print "Passengers with missing '{}' values: {} ({} survived, {} did not survive)".format( \ 144 | key, len(nan_outcomes), sum(nan_outcomes == 1), sum(nan_outcomes == 0)) 145 | 146 | -------------------------------------------------------------------------------- /finding_donors/visuals.py: -------------------------------------------------------------------------------- 1 | ########################################### 2 | # Suppress matplotlib user warnings 3 | # Necessary for newer version of matplotlib 4 | import warnings 5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib") 6 | # 7 | # Display inline matplotlib plots with IPython 8 | from IPython import get_ipython 9 | get_ipython().run_line_magic('matplotlib', 'inline') 10 | ########################################### 11 | 12 | import matplotlib.pyplot as pl 13 | import matplotlib.patches as mpatches 14 | import numpy as np 15 | import pandas as pd 16 | from time import time 17 | from sklearn.metrics import f1_score, accuracy_score 18 | 19 | 20 | def distribution(data, transformed = False): 21 | """ 22 | Visualization code for displaying skewed distributions of features 23 | """ 24 | 25 | # Create figure 26 | fig = pl.figure(figsize = (11,5)); 27 | 28 | # Skewed feature plotting 29 | for i, feature in enumerate(['capital-gain','capital-loss']): 30 | ax = fig.add_subplot(1, 2, i+1) 31 | ax.hist(data[feature], bins = 25, color = '#00A0A0') 32 | ax.set_title("'%s' Feature Distribution"%(feature), fontsize = 14) 33 | ax.set_xlabel("Value") 34 | ax.set_ylabel("Number of Records") 35 | ax.set_ylim((0, 2000)) 36 | ax.set_yticks([0, 500, 1000, 1500, 2000]) 37 | ax.set_yticklabels([0, 500, 1000, 1500, ">2000"]) 38 | 39 | # Plot aesthetics 40 | if transformed: 41 | fig.suptitle("Log-transformed Distributions of Continuous Census Data Features", \ 42 | fontsize = 16, y = 1.03) 43 | else: 44 | fig.suptitle("Skewed Distributions of Continuous Census Data Features", \ 45 | fontsize = 16, y = 1.03) 46 | 47 | fig.tight_layout() 48 | fig.show() 49 | 50 | 51 | def evaluate(results, accuracy, f1): 52 | """ 53 | Visualization code to display results of various learners. 54 | 55 | inputs: 56 | - learners: a list of supervised learners 57 | - stats: a list of dictionaries of the statistic results from 'train_predict()' 58 | - accuracy: The score for the naive predictor 59 | - f1: The score for the naive predictor 60 | """ 61 | 62 | # Create figure 63 | fig, ax = pl.subplots(2, 3, figsize = (11,7)) 64 | 65 | # Constants 66 | bar_width = 0.3 67 | colors = ['#A00000','#00A0A0','#00A000'] 68 | 69 | # Super loop to plot four panels of data 70 | for k, learner in enumerate(results.keys()): 71 | for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_val', 'f_val']): 72 | for i in np.arange(3): 73 | 74 | # Creative plot code 75 | ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k]) 76 | ax[j//3, j%3].set_xticks([0.45, 1.45, 2.45]) 77 | ax[j//3, j%3].set_xticklabels(["1%", "10%", "100%"]) 78 | ax[j//3, j%3].set_xlabel("Training Set Size") 79 | ax[j//3, j%3].set_xlim((-0.1, 3.0)) 80 | 81 | # Add unique y-labels 82 | ax[0, 0].set_ylabel("Time (in seconds)") 83 | ax[0, 1].set_ylabel("Accuracy Score") 84 | ax[0, 2].set_ylabel("F-score") 85 | ax[1, 0].set_ylabel("Time (in seconds)") 86 | ax[1, 1].set_ylabel("Accuracy Score") 87 | ax[1, 2].set_ylabel("F-score") 88 | 89 | # Add titles 90 | ax[0, 0].set_title("Model Training") 91 | ax[0, 1].set_title("Accuracy Score on Training Subset") 92 | ax[0, 2].set_title("F-score on Training Subset") 93 | ax[1, 0].set_title("Model Predicting") 94 | ax[1, 1].set_title("Accuracy Score on Testing Set") 95 | ax[1, 2].set_title("F-score on Testing Set") 96 | 97 | # Add horizontal lines for naive predictors 98 | ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 99 | ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 100 | ax[0, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 101 | ax[1, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed') 102 | 103 | # Set y-limits for score panels 104 | ax[0, 1].set_ylim((0, 1)) 105 | ax[0, 2].set_ylim((0, 1)) 106 | ax[1, 1].set_ylim((0, 1)) 107 | ax[1, 2].set_ylim((0, 1)) 108 | 109 | # Create patches for the legend 110 | patches = [] 111 | for i, learner in enumerate(results.keys()): 112 | patches.append(mpatches.Patch(color = colors[i], label = learner)) 113 | pl.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \ 114 | loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large') 115 | 116 | # Aesthetics 117 | pl.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.10) 118 | pl.subplots_adjust(top=0.85, bottom=0., left=0.10, right=0.95, hspace=0.3,wspace=0.35) 119 | pl.show() 120 | 121 | 122 | def feature_plot(importances, X_train, y_train): 123 | 124 | # Display the five most important features 125 | indices = np.argsort(importances)[::-1] 126 | columns = X_train.columns.values[indices[:5]] 127 | values = importances[indices][:5] 128 | 129 | # Creat the plot 130 | fig = pl.figure(figsize = (9,5)) 131 | pl.title("Normalized Weights for First Five Most Predictive Features", fontsize = 16) 132 | rects = pl.bar(np.arange(5), values, width = 0.6, align="center", color = '#00A000', \ 133 | label = "Feature Weight") 134 | 135 | # make bar chart higher to fit the text label 136 | axes = pl.gca() 137 | axes.set_ylim([0, np.max(values) * 1.1]) 138 | 139 | # add text label on each bar 140 | delta = np.max(values) * 0.02 141 | 142 | for rect in rects: 143 | height = rect.get_height() 144 | pl.text(rect.get_x() + rect.get_width()/2., 145 | height + delta, 146 | '%.2f' % height, 147 | ha='center', 148 | va='bottom') 149 | 150 | # Detect if xlabels are too long 151 | rotation = 0 152 | for i in columns: 153 | if len(i) > 20: 154 | rotation = 10 # If one is longer than 20 than rotate 10 degrees 155 | break 156 | pl.xticks(np.arange(5), columns, rotation = rotation) 157 | pl.xlim((-0.5, 4.5)) 158 | pl.ylabel("Weight", fontsize = 12) 159 | pl.xlabel("Feature", fontsize = 12) 160 | 161 | pl.legend(loc = 'upper center') 162 | pl.tight_layout() 163 | pl.show() 164 | -------------------------------------------------------------------------------- /creating_customer_segments/visuals.py: -------------------------------------------------------------------------------- 1 | ########################################### 2 | # Suppress matplotlib user warnings 3 | # Necessary for newer version of matplotlib 4 | import warnings 5 | warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib") 6 | # 7 | # Display inline matplotlib plots with IPython 8 | from IPython import get_ipython 9 | get_ipython().run_line_magic('matplotlib', 'inline') 10 | ########################################### 11 | 12 | import matplotlib.pyplot as plt 13 | import matplotlib.cm as cm 14 | import pandas as pd 15 | import numpy as np 16 | 17 | def pca_results(good_data, pca): 18 | ''' 19 | Create a DataFrame of the PCA results 20 | Includes dimension feature weights and explained variance 21 | Visualizes the PCA results 22 | ''' 23 | 24 | # Dimension indexing 25 | dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)] 26 | 27 | # PCA components 28 | components = pd.DataFrame(np.round(pca.components_, 4), columns = list(good_data.keys())) 29 | components.index = dimensions 30 | 31 | # PCA explained variance 32 | ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1) 33 | variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance']) 34 | variance_ratios.index = dimensions 35 | 36 | # Create a bar plot visualization 37 | fig, ax = plt.subplots(figsize = (14,8)) 38 | 39 | # Plot the feature weights as a function of the components 40 | components.plot(ax = ax, kind = 'bar'); 41 | ax.set_ylabel("Feature Weights") 42 | ax.set_xticklabels(dimensions, rotation=0) 43 | 44 | 45 | # Display the explained variance ratios 46 | for i, ev in enumerate(pca.explained_variance_ratio_): 47 | ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n %.4f"%(ev)) 48 | 49 | # Return a concatenated DataFrame 50 | return pd.concat([variance_ratios, components], axis = 1) 51 | 52 | def cluster_results(reduced_data, preds, centers, pca_samples): 53 | ''' 54 | Visualizes the PCA-reduced cluster data in two dimensions 55 | Adds cues for cluster centers and student-selected sample data 56 | ''' 57 | 58 | predictions = pd.DataFrame(preds, columns = ['Cluster']) 59 | plot_data = pd.concat([predictions, reduced_data], axis = 1) 60 | 61 | # Generate the cluster plot 62 | fig, ax = plt.subplots(figsize = (14,8)) 63 | 64 | # Color map 65 | cmap = cm.get_cmap('gist_rainbow') 66 | 67 | # Color the points based on assigned cluster 68 | for i, cluster in plot_data.groupby('Cluster'): 69 | cluster.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \ 70 | color = cmap((i)*1.0/(len(centers)-1)), label = 'Cluster %i'%(i), s=30); 71 | 72 | # Plot centers with indicators 73 | for i, c in enumerate(centers): 74 | ax.scatter(x = c[0], y = c[1], color = 'white', edgecolors = 'black', \ 75 | alpha = 1, linewidth = 2, marker = 'o', s=200); 76 | ax.scatter(x = c[0], y = c[1], marker='$%d$'%(i), alpha = 1, s=100); 77 | 78 | # Plot transformed sample points 79 | ax.scatter(x = pca_samples[:,0], y = pca_samples[:,1], \ 80 | s = 150, linewidth = 4, color = 'black', marker = 'x'); 81 | 82 | # Set plot title 83 | ax.set_title("Cluster Learning on PCA-Reduced Data - Centroids Marked by Number\nTransformed Sample Data Marked by Black Cross"); 84 | 85 | 86 | def biplot(good_data, reduced_data, pca): 87 | ''' 88 | Produce a biplot that shows a scatterplot of the reduced 89 | data and the projections of the original features. 90 | 91 | good_data: original data, before transformation. 92 | Needs to be a pandas dataframe with valid column names 93 | reduced_data: the reduced data (the first two dimensions are plotted) 94 | pca: pca object that contains the components_ attribute 95 | 96 | return: a matplotlib AxesSubplot object (for any additional customization) 97 | 98 | This procedure is inspired by the script: 99 | https://github.com/teddyroland/python-biplot 100 | ''' 101 | 102 | fig, ax = plt.subplots(figsize = (14,8)) 103 | # scatterplot of the reduced data 104 | ax.scatter(x=reduced_data.loc[:, 'Dimension 1'], y=reduced_data.loc[:, 'Dimension 2'], 105 | facecolors='b', edgecolors='b', s=70, alpha=0.5) 106 | 107 | feature_vectors = pca.components_.T 108 | 109 | # we use scaling factors to make the arrows easier to see 110 | arrow_size, text_pos = 7.0, 8.0, 111 | 112 | # projections of the original features 113 | for i, v in enumerate(feature_vectors): 114 | ax.arrow(0, 0, arrow_size*v[0], arrow_size*v[1], 115 | head_width=0.2, head_length=0.2, linewidth=2, color='red') 116 | ax.text(v[0]*text_pos, v[1]*text_pos, good_data.columns[i], color='black', 117 | ha='center', va='center', fontsize=18) 118 | 119 | ax.set_xlabel("Dimension 1", fontsize=14) 120 | ax.set_ylabel("Dimension 2", fontsize=14) 121 | ax.set_title("PC plane with original feature projections.", fontsize=16); 122 | return ax 123 | 124 | 125 | def channel_results(reduced_data, outliers, pca_samples): 126 | ''' 127 | Visualizes the PCA-reduced cluster data in two dimensions using the full dataset 128 | Data is labeled by "Channel" and cues added for student-selected sample data 129 | ''' 130 | 131 | # Check that the dataset is loadable 132 | try: 133 | full_data = pd.read_csv("customers.csv") 134 | except: 135 | print("Dataset could not be loaded. Is the file missing?") 136 | return False 137 | 138 | # Create the Channel DataFrame 139 | channel = pd.DataFrame(full_data['Channel'], columns = ['Channel']) 140 | channel = channel.drop(channel.index[outliers]).reset_index(drop = True) 141 | labeled = pd.concat([reduced_data, channel], axis = 1) 142 | 143 | # Generate the cluster plot 144 | fig, ax = plt.subplots(figsize = (14,8)) 145 | 146 | # Color map 147 | cmap = cm.get_cmap('gist_rainbow') 148 | 149 | # Color the points based on assigned Channel 150 | labels = ['Hotel/Restaurant/Cafe', 'Retailer'] 151 | grouped = labeled.groupby('Channel') 152 | for i, channel in grouped: 153 | channel.plot(ax = ax, kind = 'scatter', x = 'Dimension 1', y = 'Dimension 2', \ 154 | color = cmap((i-1)*1.0/2), label = labels[i-1], s=30); 155 | 156 | # Plot transformed sample points 157 | for i, sample in enumerate(pca_samples): 158 | ax.scatter(x = sample[0], y = sample[1], \ 159 | s = 200, linewidth = 3, color = 'black', marker = 'o', facecolors = 'none'); 160 | ax.scatter(x = sample[0]+0.25, y = sample[1]+0.3, marker='$%d$'%(i), alpha = 1, s=125); 161 | 162 | # Set plot title 163 | ax.set_title("PCA-Reduced Data Labeled by 'Channel'\nTransformed Sample Data Circled"); -------------------------------------------------------------------------------- /boston_housing/housing.csv: -------------------------------------------------------------------------------- 1 | RM,LSTAT,PTRATIO,MEDV 2 | 6.575,4.98,15.3,504000.0 3 | 6.421,9.14,17.8,453600.0 4 | 7.185,4.03,17.8,728700.0 5 | 6.998,2.94,18.7,701400.0 6 | 7.147,5.33,18.7,760200.0 7 | 6.43,5.21,18.7,602700.0 8 | 6.012,12.43,15.2,480900.0 9 | 6.172,19.15,15.2,569100.0 10 | 5.631,29.93,15.2,346500.0 11 | 6.004,17.1,15.2,396900.0 12 | 6.377,20.45,15.2,315000.0 13 | 6.009,13.27,15.2,396900.0 14 | 5.889,15.71,15.2,455700.0 15 | 5.949,8.26,21.0,428400.0 16 | 6.096,10.26,21.0,382200.0 17 | 5.834,8.47,21.0,417900.0 18 | 5.935,6.58,21.0,485100.0 19 | 5.99,14.67,21.0,367500.0 20 | 5.456,11.69,21.0,424200.0 21 | 5.727,11.28,21.0,382200.0 22 | 5.57,21.02,21.0,285600.0 23 | 5.965,13.83,21.0,411600.0 24 | 6.142,18.72,21.0,319200.0 25 | 5.813,19.88,21.0,304500.0 26 | 5.924,16.3,21.0,327600.0 27 | 5.599,16.51,21.0,291900.0 28 | 5.813,14.81,21.0,348600.0 29 | 6.047,17.28,21.0,310800.0 30 | 6.495,12.8,21.0,386400.0 31 | 6.674,11.98,21.0,441000.0 32 | 5.713,22.6,21.0,266700.0 33 | 6.072,13.04,21.0,304500.0 34 | 5.95,27.71,21.0,277200.0 35 | 5.701,18.35,21.0,275100.0 36 | 6.096,20.34,21.0,283500.0 37 | 5.933,9.68,19.2,396900.0 38 | 5.841,11.41,19.2,420000.0 39 | 5.85,8.77,19.2,441000.0 40 | 5.966,10.13,19.2,518700.0 41 | 6.595,4.32,18.3,646800.0 42 | 7.024,1.98,18.3,732900.0 43 | 6.77,4.84,17.9,558600.0 44 | 6.169,5.81,17.9,531300.0 45 | 6.211,7.44,17.9,518700.0 46 | 6.069,9.55,17.9,445200.0 47 | 5.682,10.21,17.9,405300.0 48 | 5.786,14.15,17.9,420000.0 49 | 6.03,18.8,17.9,348600.0 50 | 5.399,30.81,17.9,302400.0 51 | 5.602,16.2,17.9,407400.0 52 | 5.963,13.45,16.8,413700.0 53 | 6.115,9.43,16.8,430500.0 54 | 6.511,5.28,16.8,525000.0 55 | 5.998,8.43,16.8,491400.0 56 | 5.888,14.8,21.1,396900.0 57 | 7.249,4.81,17.9,743400.0 58 | 6.383,5.77,17.3,518700.0 59 | 6.816,3.95,15.1,663600.0 60 | 6.145,6.86,19.7,489300.0 61 | 5.927,9.22,19.7,411600.0 62 | 5.741,13.15,19.7,392700.0 63 | 5.966,14.44,19.7,336000.0 64 | 6.456,6.73,19.7,466200.0 65 | 6.762,9.5,19.7,525000.0 66 | 7.104,8.05,18.6,693000.0 67 | 6.29,4.67,16.1,493500.0 68 | 5.787,10.24,16.1,407400.0 69 | 5.878,8.1,18.9,462000.0 70 | 5.594,13.09,18.9,365400.0 71 | 5.885,8.79,18.9,438900.0 72 | 6.417,6.72,19.2,508200.0 73 | 5.961,9.88,19.2,455700.0 74 | 6.065,5.52,19.2,478800.0 75 | 6.245,7.54,19.2,491400.0 76 | 6.273,6.78,18.7,506100.0 77 | 6.286,8.94,18.7,449400.0 78 | 6.279,11.97,18.7,420000.0 79 | 6.14,10.27,18.7,436800.0 80 | 6.232,12.34,18.7,445200.0 81 | 5.874,9.1,18.7,426300.0 82 | 6.727,5.29,19.0,588000.0 83 | 6.619,7.22,19.0,501900.0 84 | 6.302,6.72,19.0,520800.0 85 | 6.167,7.51,19.0,480900.0 86 | 6.389,9.62,18.5,501900.0 87 | 6.63,6.53,18.5,558600.0 88 | 6.015,12.86,18.5,472500.0 89 | 6.121,8.44,18.5,466200.0 90 | 7.007,5.5,17.8,495600.0 91 | 7.079,5.7,17.8,602700.0 92 | 6.417,8.81,17.8,474600.0 93 | 6.405,8.2,17.8,462000.0 94 | 6.442,8.16,18.2,480900.0 95 | 6.211,6.21,18.2,525000.0 96 | 6.249,10.59,18.2,432600.0 97 | 6.625,6.65,18.0,596400.0 98 | 6.163,11.34,18.0,449400.0 99 | 8.069,4.21,18.0,812700.0 100 | 7.82,3.57,18.0,919800.0 101 | 7.416,6.19,18.0,697200.0 102 | 6.727,9.42,20.9,577500.0 103 | 6.781,7.67,20.9,556500.0 104 | 6.405,10.63,20.9,390600.0 105 | 6.137,13.44,20.9,405300.0 106 | 6.167,12.33,20.9,422100.0 107 | 5.851,16.47,20.9,409500.0 108 | 5.836,18.66,20.9,409500.0 109 | 6.127,14.09,20.9,428400.0 110 | 6.474,12.27,20.9,415800.0 111 | 6.229,15.55,20.9,407400.0 112 | 6.195,13.0,20.9,455700.0 113 | 6.715,10.16,17.8,478800.0 114 | 5.913,16.21,17.8,394800.0 115 | 6.092,17.09,17.8,392700.0 116 | 6.254,10.45,17.8,388500.0 117 | 5.928,15.76,17.8,384300.0 118 | 6.176,12.04,17.8,445200.0 119 | 6.021,10.3,17.8,403200.0 120 | 5.872,15.37,17.8,428400.0 121 | 5.731,13.61,17.8,405300.0 122 | 5.87,14.37,19.1,462000.0 123 | 6.004,14.27,19.1,426300.0 124 | 5.961,17.93,19.1,430500.0 125 | 5.856,25.41,19.1,363300.0 126 | 5.879,17.58,19.1,394800.0 127 | 5.986,14.81,19.1,449400.0 128 | 5.613,27.26,19.1,329700.0 129 | 5.693,17.19,21.2,340200.0 130 | 6.431,15.39,21.2,378000.0 131 | 5.637,18.34,21.2,300300.0 132 | 6.458,12.6,21.2,403200.0 133 | 6.326,12.26,21.2,411600.0 134 | 6.372,11.12,21.2,483000.0 135 | 5.822,15.03,21.2,386400.0 136 | 5.757,17.31,21.2,327600.0 137 | 6.335,16.96,21.2,380100.0 138 | 5.942,16.9,21.2,365400.0 139 | 6.454,14.59,21.2,359100.0 140 | 5.857,21.32,21.2,279300.0 141 | 6.151,18.46,21.2,373800.0 142 | 6.174,24.16,21.2,294000.0 143 | 5.019,34.41,21.2,302400.0 144 | 5.403,26.82,14.7,281400.0 145 | 5.468,26.42,14.7,327600.0 146 | 4.903,29.29,14.7,247800.0 147 | 6.13,27.8,14.7,289800.0 148 | 5.628,16.65,14.7,327600.0 149 | 4.926,29.53,14.7,306600.0 150 | 5.186,28.32,14.7,373800.0 151 | 5.597,21.45,14.7,323400.0 152 | 6.122,14.1,14.7,451500.0 153 | 5.404,13.28,14.7,411600.0 154 | 5.012,12.12,14.7,321300.0 155 | 5.709,15.79,14.7,407400.0 156 | 6.129,15.12,14.7,357000.0 157 | 6.152,15.02,14.7,327600.0 158 | 5.272,16.14,14.7,275100.0 159 | 6.943,4.59,14.7,867300.0 160 | 6.066,6.43,14.7,510300.0 161 | 6.51,7.39,14.7,489300.0 162 | 6.25,5.5,14.7,567000.0 163 | 5.854,11.64,14.7,476700.0 164 | 6.101,9.81,14.7,525000.0 165 | 5.877,12.14,14.7,499800.0 166 | 6.319,11.1,14.7,499800.0 167 | 6.402,11.32,14.7,468300.0 168 | 5.875,14.43,14.7,365400.0 169 | 5.88,12.03,14.7,401100.0 170 | 5.572,14.69,16.6,485100.0 171 | 6.416,9.04,16.6,495600.0 172 | 5.859,9.64,16.6,474600.0 173 | 6.546,5.33,16.6,617400.0 174 | 6.02,10.11,16.6,487200.0 175 | 6.315,6.29,16.6,516600.0 176 | 6.86,6.92,16.6,627900.0 177 | 6.98,5.04,17.8,781200.0 178 | 7.765,7.56,17.8,835800.0 179 | 6.144,9.45,17.8,760200.0 180 | 7.155,4.82,17.8,795900.0 181 | 6.563,5.68,17.8,682500.0 182 | 5.604,13.98,17.8,554400.0 183 | 6.153,13.15,17.8,621600.0 184 | 6.782,6.68,15.2,672000.0 185 | 6.556,4.56,15.2,625800.0 186 | 7.185,5.39,15.2,732900.0 187 | 6.951,5.1,15.2,777000.0 188 | 6.739,4.69,15.2,640500.0 189 | 7.178,2.87,15.2,764400.0 190 | 6.8,5.03,15.6,653100.0 191 | 6.604,4.38,15.6,611100.0 192 | 7.287,4.08,12.6,699300.0 193 | 7.107,8.61,12.6,636300.0 194 | 7.274,6.62,12.6,726600.0 195 | 6.975,4.56,17.0,732900.0 196 | 7.135,4.45,17.0,690900.0 197 | 6.162,7.43,14.7,506100.0 198 | 7.61,3.11,14.7,888300.0 199 | 7.853,3.81,14.7,1018500.0 200 | 5.891,10.87,18.6,474600.0 201 | 6.326,10.97,18.6,512400.0 202 | 5.783,18.06,18.6,472500.0 203 | 6.064,14.66,18.6,512400.0 204 | 5.344,23.09,18.6,420000.0 205 | 5.96,17.27,18.6,455700.0 206 | 5.404,23.98,18.6,405300.0 207 | 5.807,16.03,18.6,470400.0 208 | 6.375,9.38,18.6,590100.0 209 | 5.412,29.55,18.6,497700.0 210 | 6.182,9.47,18.6,525000.0 211 | 5.888,13.51,16.4,489300.0 212 | 6.642,9.69,16.4,602700.0 213 | 5.951,17.92,16.4,451500.0 214 | 6.373,10.5,16.4,483000.0 215 | 6.951,9.71,17.4,560700.0 216 | 6.164,21.46,17.4,455700.0 217 | 6.879,9.93,17.4,577500.0 218 | 6.618,7.6,17.4,632100.0 219 | 8.266,4.14,17.4,940800.0 220 | 8.04,3.13,17.4,789600.0 221 | 7.163,6.36,17.4,663600.0 222 | 7.686,3.92,17.4,980700.0 223 | 6.552,3.76,17.4,661500.0 224 | 5.981,11.65,17.4,510300.0 225 | 7.412,5.25,17.4,665700.0 226 | 8.337,2.47,17.4,875700.0 227 | 8.247,3.95,17.4,1014300.0 228 | 6.726,8.05,17.4,609000.0 229 | 6.086,10.88,17.4,504000.0 230 | 6.631,9.54,17.4,527100.0 231 | 7.358,4.73,17.4,661500.0 232 | 6.481,6.36,16.6,497700.0 233 | 6.606,7.37,16.6,489300.0 234 | 6.897,11.38,16.6,462000.0 235 | 6.095,12.4,16.6,422100.0 236 | 6.358,11.22,16.6,466200.0 237 | 6.393,5.19,16.6,497700.0 238 | 5.593,12.5,19.1,369600.0 239 | 5.605,18.46,19.1,388500.0 240 | 6.108,9.16,19.1,510300.0 241 | 6.226,10.15,19.1,430500.0 242 | 6.433,9.52,19.1,514500.0 243 | 6.718,6.56,19.1,550200.0 244 | 6.487,5.9,19.1,512400.0 245 | 6.438,3.59,19.1,520800.0 246 | 6.957,3.53,19.1,621600.0 247 | 8.259,3.54,19.1,898800.0 248 | 6.108,6.57,16.4,459900.0 249 | 5.876,9.25,16.4,438900.0 250 | 7.454,3.11,15.9,924000.0 251 | 7.333,7.79,13.0,756000.0 252 | 6.842,6.9,13.0,632100.0 253 | 7.203,9.59,13.0,709800.0 254 | 7.52,7.26,13.0,905100.0 255 | 8.398,5.91,13.0,1024800.0 256 | 7.327,11.25,13.0,651000.0 257 | 7.206,8.1,13.0,766500.0 258 | 5.56,10.45,13.0,478800.0 259 | 7.014,14.79,13.0,644700.0 260 | 7.47,3.16,13.0,913500.0 261 | 5.92,13.65,18.6,434700.0 262 | 5.856,13.0,18.6,443100.0 263 | 6.24,6.59,18.6,529200.0 264 | 6.538,7.73,18.6,512400.0 265 | 7.691,6.58,18.6,739200.0 266 | 6.758,3.53,17.6,680400.0 267 | 6.854,2.98,17.6,672000.0 268 | 7.267,6.05,17.6,697200.0 269 | 6.826,4.16,17.6,695100.0 270 | 6.482,7.19,17.6,611100.0 271 | 6.812,4.85,14.9,737100.0 272 | 7.82,3.76,14.9,953400.0 273 | 6.968,4.59,14.9,743400.0 274 | 7.645,3.01,14.9,966000.0 275 | 7.088,7.85,15.3,676200.0 276 | 6.453,8.23,15.3,462000.0 277 | 6.23,12.93,18.2,422100.0 278 | 6.209,7.14,16.6,487200.0 279 | 6.315,7.6,16.6,468300.0 280 | 6.565,9.51,16.6,520800.0 281 | 6.861,3.33,19.2,598500.0 282 | 7.148,3.56,19.2,783300.0 283 | 6.63,4.7,19.2,585900.0 284 | 6.127,8.58,16.0,501900.0 285 | 6.009,10.4,16.0,455700.0 286 | 6.678,6.27,16.0,600600.0 287 | 6.549,7.39,16.0,569100.0 288 | 5.79,15.84,16.0,426300.0 289 | 6.345,4.97,14.8,472500.0 290 | 7.041,4.74,14.8,609000.0 291 | 6.871,6.07,14.8,520800.0 292 | 6.59,9.5,16.1,462000.0 293 | 6.495,8.67,16.1,554400.0 294 | 6.982,4.86,16.1,695100.0 295 | 7.236,6.93,18.4,758100.0 296 | 6.616,8.93,18.4,596400.0 297 | 7.42,6.47,18.4,701400.0 298 | 6.849,7.53,18.4,592200.0 299 | 6.635,4.54,18.4,478800.0 300 | 5.972,9.97,18.4,426300.0 301 | 4.973,12.64,18.4,338100.0 302 | 6.122,5.98,18.4,464100.0 303 | 6.023,11.72,18.4,407400.0 304 | 6.266,7.9,18.4,453600.0 305 | 6.567,9.28,18.4,499800.0 306 | 5.705,11.5,18.4,340200.0 307 | 5.914,18.33,18.4,373800.0 308 | 5.782,15.94,18.4,415800.0 309 | 6.382,10.36,18.4,485100.0 310 | 6.113,12.73,18.4,441000.0 311 | 6.426,7.2,19.6,499800.0 312 | 6.376,6.87,19.6,485100.0 313 | 6.041,7.7,19.6,428400.0 314 | 5.708,11.74,19.6,388500.0 315 | 6.415,6.12,19.6,525000.0 316 | 6.431,5.08,19.6,516600.0 317 | 6.312,6.15,19.6,483000.0 318 | 6.083,12.79,19.6,466200.0 319 | 5.868,9.97,16.9,405300.0 320 | 6.333,7.34,16.9,474600.0 321 | 6.144,9.09,16.9,415800.0 322 | 5.706,12.43,16.9,359100.0 323 | 6.031,7.83,16.9,407400.0 324 | 6.316,5.68,20.2,466200.0 325 | 6.31,6.75,20.2,434700.0 326 | 6.037,8.01,20.2,443100.0 327 | 5.869,9.8,20.2,409500.0 328 | 5.895,10.56,20.2,388500.0 329 | 6.059,8.51,20.2,432600.0 330 | 5.985,9.74,20.2,399000.0 331 | 5.968,9.29,20.2,392700.0 332 | 7.241,5.49,15.5,686700.0 333 | 6.54,8.65,15.9,346500.0 334 | 6.696,7.18,17.6,501900.0 335 | 6.874,4.61,17.6,655200.0 336 | 6.014,10.53,18.8,367500.0 337 | 5.898,12.67,18.8,361200.0 338 | 6.516,6.36,17.9,485100.0 339 | 6.635,5.99,17.0,514500.0 340 | 6.939,5.89,19.7,558600.0 341 | 6.49,5.98,19.7,480900.0 342 | 6.579,5.49,18.3,506100.0 343 | 5.884,7.79,18.3,390600.0 344 | 6.728,4.5,17.0,632100.0 345 | 5.663,8.05,22.0,382200.0 346 | 5.936,5.57,22.0,432600.0 347 | 6.212,17.6,20.2,373800.0 348 | 6.395,13.27,20.2,455700.0 349 | 6.127,11.48,20.2,476700.0 350 | 6.112,12.67,20.2,474600.0 351 | 6.398,7.79,20.2,525000.0 352 | 6.251,14.19,20.2,417900.0 353 | 5.362,10.19,20.2,436800.0 354 | 5.803,14.64,20.2,352800.0 355 | 3.561,7.12,20.2,577500.0 356 | 4.963,14.0,20.2,459900.0 357 | 3.863,13.33,20.2,485100.0 358 | 4.906,34.77,20.2,289800.0 359 | 4.138,37.97,20.2,289800.0 360 | 7.313,13.44,20.2,315000.0 361 | 6.649,23.24,20.2,291900.0 362 | 6.794,21.24,20.2,279300.0 363 | 6.38,23.69,20.2,275100.0 364 | 6.223,21.78,20.2,214200.0 365 | 6.968,17.21,20.2,218400.0 366 | 6.545,21.08,20.2,228900.0 367 | 5.536,23.6,20.2,237300.0 368 | 5.52,24.56,20.2,258300.0 369 | 4.368,30.63,20.2,184800.0 370 | 5.277,30.81,20.2,151200.0 371 | 4.652,28.28,20.2,220500.0 372 | 5.0,31.99,20.2,155400.0 373 | 4.88,30.62,20.2,214200.0 374 | 5.39,20.85,20.2,241500.0 375 | 5.713,17.11,20.2,317100.0 376 | 6.051,18.76,20.2,487200.0 377 | 5.036,25.68,20.2,203700.0 378 | 6.193,15.17,20.2,289800.0 379 | 5.887,16.35,20.2,266700.0 380 | 6.471,17.12,20.2,275100.0 381 | 6.405,19.37,20.2,262500.0 382 | 5.747,19.92,20.2,178500.0 383 | 5.453,30.59,20.2,105000.0 384 | 5.852,29.97,20.2,132300.0 385 | 5.987,26.77,20.2,117600.0 386 | 6.343,20.32,20.2,151200.0 387 | 6.404,20.31,20.2,254100.0 388 | 5.349,19.77,20.2,174300.0 389 | 5.531,27.38,20.2,178500.0 390 | 5.683,22.98,20.2,105000.0 391 | 4.138,23.34,20.2,249900.0 392 | 5.608,12.13,20.2,585900.0 393 | 5.617,26.4,20.2,361200.0 394 | 6.852,19.78,20.2,577500.0 395 | 5.757,10.11,20.2,315000.0 396 | 6.657,21.22,20.2,361200.0 397 | 4.628,34.37,20.2,375900.0 398 | 5.155,20.08,20.2,342300.0 399 | 4.519,36.98,20.2,147000.0 400 | 6.434,29.05,20.2,151200.0 401 | 6.782,25.79,20.2,157500.0 402 | 5.304,26.64,20.2,218400.0 403 | 5.957,20.62,20.2,184800.0 404 | 6.824,22.74,20.2,176400.0 405 | 6.411,15.02,20.2,350700.0 406 | 6.006,15.7,20.2,298200.0 407 | 5.648,14.1,20.2,436800.0 408 | 6.103,23.29,20.2,281400.0 409 | 5.565,17.16,20.2,245700.0 410 | 5.896,24.39,20.2,174300.0 411 | 5.837,15.69,20.2,214200.0 412 | 6.202,14.52,20.2,228900.0 413 | 6.193,21.52,20.2,231000.0 414 | 6.38,24.08,20.2,199500.0 415 | 6.348,17.64,20.2,304500.0 416 | 6.833,19.69,20.2,296100.0 417 | 6.425,12.03,20.2,338100.0 418 | 6.436,16.22,20.2,300300.0 419 | 6.208,15.17,20.2,245700.0 420 | 6.629,23.27,20.2,281400.0 421 | 6.461,18.05,20.2,201600.0 422 | 6.152,26.45,20.2,182700.0 423 | 5.935,34.02,20.2,176400.0 424 | 5.627,22.88,20.2,268800.0 425 | 5.818,22.11,20.2,220500.0 426 | 6.406,19.52,20.2,359100.0 427 | 6.219,16.59,20.2,386400.0 428 | 6.485,18.85,20.2,323400.0 429 | 5.854,23.79,20.2,226800.0 430 | 6.459,23.98,20.2,247800.0 431 | 6.341,17.79,20.2,312900.0 432 | 6.251,16.44,20.2,264600.0 433 | 6.185,18.13,20.2,296100.0 434 | 6.417,19.31,20.2,273000.0 435 | 6.749,17.44,20.2,281400.0 436 | 6.655,17.73,20.2,319200.0 437 | 6.297,17.27,20.2,338100.0 438 | 7.393,16.74,20.2,373800.0 439 | 6.728,18.71,20.2,312900.0 440 | 6.525,18.13,20.2,296100.0 441 | 5.976,19.01,20.2,266700.0 442 | 5.936,16.94,20.2,283500.0 443 | 6.301,16.23,20.2,312900.0 444 | 6.081,14.7,20.2,420000.0 445 | 6.701,16.42,20.2,344400.0 446 | 6.376,14.65,20.2,371700.0 447 | 6.317,13.99,20.2,409500.0 448 | 6.513,10.29,20.2,424200.0 449 | 6.209,13.22,20.2,449400.0 450 | 5.759,14.13,20.2,417900.0 451 | 5.952,17.15,20.2,399000.0 452 | 6.003,21.32,20.2,401100.0 453 | 5.926,18.13,20.2,401100.0 454 | 5.713,14.76,20.2,422100.0 455 | 6.167,16.29,20.2,417900.0 456 | 6.229,12.87,20.2,411600.0 457 | 6.437,14.36,20.2,487200.0 458 | 6.98,11.66,20.2,625800.0 459 | 5.427,18.14,20.2,289800.0 460 | 6.162,24.1,20.2,279300.0 461 | 6.484,18.68,20.2,350700.0 462 | 5.304,24.91,20.2,252000.0 463 | 6.185,18.03,20.2,306600.0 464 | 6.229,13.11,20.2,449400.0 465 | 6.242,10.74,20.2,483000.0 466 | 6.75,7.74,20.2,497700.0 467 | 7.061,7.01,20.2,525000.0 468 | 5.762,10.42,20.2,457800.0 469 | 5.871,13.34,20.2,432600.0 470 | 6.312,10.58,20.2,445200.0 471 | 6.114,14.98,20.2,401100.0 472 | 5.905,11.45,20.2,432600.0 473 | 5.454,18.06,20.1,319200.0 474 | 5.414,23.97,20.1,147000.0 475 | 5.093,29.68,20.1,170100.0 476 | 5.983,18.07,20.1,285600.0 477 | 5.983,13.35,20.1,422100.0 478 | 5.707,12.01,19.2,457800.0 479 | 5.926,13.59,19.2,514500.0 480 | 5.67,17.6,19.2,485100.0 481 | 5.39,21.14,19.2,413700.0 482 | 5.794,14.1,19.2,384300.0 483 | 6.019,12.92,19.2,445200.0 484 | 5.569,15.1,19.2,367500.0 485 | 6.027,14.33,19.2,352800.0 486 | 6.593,9.67,21.0,470400.0 487 | 6.12,9.08,21.0,432600.0 488 | 6.976,5.64,21.0,501900.0 489 | 6.794,6.48,21.0,462000.0 490 | 6.03,7.88,21.0,249900.0 491 | -------------------------------------------------------------------------------- /titanic_survival_exploration/titanic_survival_exploration.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 机器学习工程师纳米学位\n", 8 | "## 机器学习基础\n", 9 | "## 项目 0: 预测泰坦尼克号乘客生还率\n", 10 | "\n", 11 | "1912年,泰坦尼克号在第一次航行中就与冰山相撞沉没,导致了大部分乘客和船员身亡。在这个入门项目中,我们将探索部分泰坦尼克号旅客名单,来确定哪些特征可以最好地预测一个人是否会生还。为了完成这个项目,你将需要实现几个基于条件的预测并回答下面的问题。我们将根据代码的完成度和对问题的解答来对你提交的项目的进行评估。 \n", 12 | "\n", 13 | "> **提示**:这样的文字将会指导你如何使用 iPython Notebook 来完成项目。" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "点击[这里](https://github.com/udacity/machine-learning/blob/master/projects/titanic_survival_exploration/titanic_survival_exploration.ipynb)查看本文件的英文版本。" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "### 了解数据\n", 28 | "\n", 29 | "当我们开始处理泰坦尼克号乘客数据时,会先导入我们需要的功能模块以及将数据加载到 `pandas` DataFrame。运行下面区域中的代码加载数据,并使用 `.head()` 函数显示前几项乘客数据。 \n", 30 | "\n", 31 | "> **提示**:你可以通过单击代码区域,然后使用键盘快捷键 **Shift+Enter** 或 **Shift+ Return** 来运行代码。或者在选择代码后使用**播放**(run cell)按钮执行代码。像这样的 MarkDown 文本可以通过双击编辑,并使用这些相同的快捷键保存。[Markdown](http://daringfireball.net/projects/markdown/syntax) 允许你编写易读的纯文本并且可以转换为 HTML。" 32 | ] 33 | }, 34 | { 35 | "cell_type": "code", 36 | "execution_count": null, 37 | "metadata": { 38 | "collapsed": true 39 | }, 40 | "outputs": [], 41 | "source": [ 42 | "# 检查你的Python版本\n", 43 | "from sys import version_info\n", 44 | "if version_info.major != 2 and version_info.minor != 7:\n", 45 | " raise Exception('请使用Python 2.7来完成此项目')" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "import numpy as np\n", 57 | "import pandas as pd\n", 58 | "\n", 59 | "# 数据可视化代码\n", 60 | "from titanic_visualizations import survival_stats\n", 61 | "from IPython.display import display\n", 62 | "%matplotlib inline\n", 63 | "\n", 64 | "# 加载数据集\n", 65 | "in_file = 'titanic_data.csv'\n", 66 | "full_data = pd.read_csv(in_file)\n", 67 | "\n", 68 | "# 显示数据列表中的前几项乘客数据\n", 69 | "display(full_data.head())" 70 | ] 71 | }, 72 | { 73 | "cell_type": "markdown", 74 | "metadata": {}, 75 | "source": [ 76 | "从泰坦尼克号的数据样本中,我们可以看到船上每位旅客的特征\n", 77 | "\n", 78 | "- **Survived**:是否存活(0代表否,1代表是)\n", 79 | "- **Pclass**:社会阶级(1代表上层阶级,2代表中层阶级,3代表底层阶级)\n", 80 | "- **Name**:船上乘客的名字\n", 81 | "- **Sex**:船上乘客的性别\n", 82 | "- **Age**:船上乘客的年龄(可能存在 `NaN`)\n", 83 | "- **SibSp**:乘客在船上的兄弟姐妹和配偶的数量\n", 84 | "- **Parch**:乘客在船上的父母以及小孩的数量\n", 85 | "- **Ticket**:乘客船票的编号\n", 86 | "- **Fare**:乘客为船票支付的费用\n", 87 | "- **Cabin**:乘客所在船舱的编号(可能存在 `NaN`)\n", 88 | "- **Embarked**:乘客上船的港口(C 代表从 Cherbourg 登船,Q 代表从 Queenstown 登船,S 代表从 Southampton 登船)\n", 89 | "\n", 90 | "因为我们感兴趣的是每个乘客或船员是否在事故中活了下来。可以将 **Survived** 这一特征从这个数据集移除,并且用一个单独的变量 `outcomes` 来存储。它也做为我们要预测的目标。\n", 91 | "\n", 92 | "运行该代码,从数据集中移除 **Survived** 这个特征,并将它存储在变量 `outcomes` 中。" 93 | ] 94 | }, 95 | { 96 | "cell_type": "code", 97 | "execution_count": null, 98 | "metadata": { 99 | "collapsed": false 100 | }, 101 | "outputs": [], 102 | "source": [ 103 | "# 从数据集中移除 'Survived' 这个特征,并将它存储在一个新的变量中。\n", 104 | "outcomes = full_data['Survived']\n", 105 | "data = full_data.drop('Survived', axis = 1)\n", 106 | "\n", 107 | "# 显示已移除 'Survived' 特征的数据集\n", 108 | "display(data.head())" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "metadata": {}, 114 | "source": [ 115 | "这个例子展示了如何将泰坦尼克号的 **Survived** 数据从 DataFrame 移除。注意到 `data`(乘客数据)和 `outcomes` (是否存活)现在已经匹配好。这意味着对于任何乘客的 `data.loc[i]` 都有对应的存活的结果 `outcome[i]`。" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "### 计算准确率\n", 123 | "为了验证我们预测的结果,我们需要一个标准来给我们的预测打分。因为我们最感兴趣的是我们预测的**准确率**,既正确预测乘客存活的比例。运行下面的代码来创建我们的 `accuracy_score` 函数以对前五名乘客的预测来做测试。\n", 124 | "\n", 125 | "**思考题**:在前五个乘客中,如果我们预测他们全部都存活,你觉得我们预测的准确率是多少?" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "collapsed": false 133 | }, 134 | "outputs": [], 135 | "source": [ 136 | "def accuracy_score(truth, pred):\n", 137 | " \"\"\" 返回 pred 相对于 truth 的准确率 \"\"\"\n", 138 | " \n", 139 | " # 确保预测的数量与结果的数量一致\n", 140 | " if len(truth) == len(pred): \n", 141 | " \n", 142 | " # 计算预测准确率(百分比)\n", 143 | " return \"Predictions have an accuracy of {:.2f}%.\".format((truth == pred).mean()*100)\n", 144 | " \n", 145 | " else:\n", 146 | " return \"Number of predictions does not match number of outcomes!\"\n", 147 | " \n", 148 | "# 测试 'accuracy_score' 函数\n", 149 | "predictions = pd.Series(np.ones(5, dtype = int)) #五个预测全部为1,既存活\n", 150 | "print accuracy_score(outcomes[:5], predictions)" 151 | ] 152 | }, 153 | { 154 | "cell_type": "markdown", 155 | "metadata": {}, 156 | "source": [ 157 | "> **提示**:如果你保存 iPython Notebook,代码运行的输出也将被保存。但是,一旦你重新打开项目,你的工作区将会被重置。请确保每次都从上次离开的地方运行代码来重新生成变量和函数。\n", 158 | "\n", 159 | "### 最简单的预测\n", 160 | "\n", 161 | "如果我们要预测泰坦尼克号上的乘客是否存活,但是我们又对他们一无所知,那么最好的预测就是船上的人无一幸免。这是因为,我们可以假定当船沉没的时候大多数乘客都遇难了。下面的 `predictions_0` 函数就预测船上的乘客全部遇难。 " 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": { 168 | "collapsed": false 169 | }, 170 | "outputs": [], 171 | "source": [ 172 | "def predictions_0(data):\n", 173 | " \"\"\" 不考虑任何特征,预测所有人都无法生还 \"\"\"\n", 174 | "\n", 175 | " predictions = []\n", 176 | " for _, passenger in data.iterrows():\n", 177 | " \n", 178 | " # 预测 'passenger' 的生还率\n", 179 | " predictions.append(0)\n", 180 | " \n", 181 | " # 返回预测结果\n", 182 | " return pd.Series(predictions)\n", 183 | "\n", 184 | "# 进行预测\n", 185 | "predictions = predictions_0(data)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "metadata": {}, 191 | "source": [ 192 | "**问题1**:对比真实的泰坦尼克号的数据,如果我们做一个所有乘客都没有存活的预测,这个预测的准确率能达到多少?\n", 193 | "\n", 194 | "**回答**: *请用预测结果来替换掉这里的文字*\n", 195 | "\n", 196 | "**提示**:运行下面的代码来查看预测的准确率。" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": null, 202 | "metadata": { 203 | "collapsed": true 204 | }, 205 | "outputs": [], 206 | "source": [ 207 | "print accuracy_score(outcomes, predictions)" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "### 考虑一个特征进行预测\n", 215 | "\n", 216 | "我们可以使用 `survival_stats` 函数来看看 **Sex** 这一特征对乘客的存活率有多大影响。这个函数定义在名为 `titanic_visualizations.py` 的 Python 脚本文件中,我们的项目提供了这个文件。传递给函数的前两个参数分别是泰坦尼克号的乘客数据和乘客的 生还结果。第三个参数表明我们会依据哪个特征来绘制图形。\n", 217 | "\n", 218 | "运行下面的代码绘制出依据乘客性别计算存活率的柱形图。 " 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "collapsed": false 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "survival_stats(data, outcomes, 'Sex')" 230 | ] 231 | }, 232 | { 233 | "cell_type": "markdown", 234 | "metadata": {}, 235 | "source": [ 236 | "观察泰坦尼克号上乘客存活的数据统计,我们可以发现大部分男性乘客在船沉没的时候都遇难了。相反的,大部分女性乘客都在事故中**生还**。让我们以此改进先前的预测:如果乘客是男性,那么我们就预测他们遇难;如果乘客是女性,那么我们预测他们在事故中活了下来。\n", 237 | "\n", 238 | "将下面的代码补充完整,让函数可以进行正确预测。 \n", 239 | "\n", 240 | "**提示**:您可以用访问 dictionary(字典)的方法来访问船上乘客的每个特征对应的值。例如, `passenger['Sex']` 返回乘客的性别。" 241 | ] 242 | }, 243 | { 244 | "cell_type": "code", 245 | "execution_count": null, 246 | "metadata": { 247 | "collapsed": false 248 | }, 249 | "outputs": [], 250 | "source": [ 251 | "def predictions_1(data):\n", 252 | " \"\"\" 只考虑一个特征,如果是女性则生还 \"\"\"\n", 253 | " \n", 254 | " predictions = []\n", 255 | " for _, passenger in data.iterrows():\n", 256 | " \n", 257 | " # TODO 1\n", 258 | " # 移除下方的 'pass' 声明\n", 259 | " # 输入你自己的预测条件\n", 260 | " pass\n", 261 | " \n", 262 | " # 返回预测结果\n", 263 | " return pd.Series(predictions)\n", 264 | "\n", 265 | "# 进行预测\n", 266 | "predictions = predictions_1(data)" 267 | ] 268 | }, 269 | { 270 | "cell_type": "markdown", 271 | "metadata": {}, 272 | "source": [ 273 | "**问题2**:当我们预测船上女性乘客全部存活,而剩下的人全部遇难,那么我们预测的准确率会达到多少?\n", 274 | "\n", 275 | "**回答**: *用预测结果来替换掉这里的文字*\n", 276 | "\n", 277 | "**提示**:你需要在下面添加一个代码区域,实现代码并运行来计算准确率。" 278 | ] 279 | }, 280 | { 281 | "cell_type": "markdown", 282 | "metadata": {}, 283 | "source": [ 284 | "### 考虑两个特征进行预测\n", 285 | "\n", 286 | "仅仅使用乘客性别(Sex)这一特征,我们预测的准确性就有了明显的提高。现在再看一下使用额外的特征能否更进一步提升我们的预测准确度。例如,综合考虑所有在泰坦尼克号上的男性乘客:我们是否找到这些乘客中的一个子集,他们的存活概率较高。让我们再次使用 `survival_stats` 函数来看看每位男性乘客的年龄(Age)。这一次,我们将使用第四个参数来限定柱形图中只有男性乘客。\n", 287 | "\n", 288 | "运行下面这段代码,把男性基于年龄的生存结果绘制出来。" 289 | ] 290 | }, 291 | { 292 | "cell_type": "code", 293 | "execution_count": null, 294 | "metadata": { 295 | "collapsed": false 296 | }, 297 | "outputs": [], 298 | "source": [ 299 | "survival_stats(data, outcomes, 'Age', [\"Sex == 'male'\"])" 300 | ] 301 | }, 302 | { 303 | "cell_type": "markdown", 304 | "metadata": { 305 | "collapsed": true 306 | }, 307 | "source": [ 308 | "仔细观察泰坦尼克号存活的数据统计,在船沉没的时候,大部分小于10岁的男孩都活着,而大多数10岁以上的男性都随着船的沉没而**遇难**。让我们继续在先前预测的基础上构建:如果乘客是女性,那么我们就预测她们全部存活;如果乘客是男性并且小于10岁,我们也会预测他们全部存活;所有其它我们就预测他们都没有幸存。 \n", 309 | "\n", 310 | "将下面缺失的代码补充完整,让我们的函数可以实现预测。 \n", 311 | "**提示**: 您可以用之前 `predictions_1` 的代码作为开始来修改代码,实现新的预测函数。" 312 | ] 313 | }, 314 | { 315 | "cell_type": "code", 316 | "execution_count": null, 317 | "metadata": { 318 | "collapsed": false 319 | }, 320 | "outputs": [], 321 | "source": [ 322 | "def predictions_2(data):\n", 323 | " \"\"\" 考虑两个特征: \n", 324 | " - 如果是女性则生还\n", 325 | " - 如果是男性并且小于10岁则生还 \"\"\"\n", 326 | " \n", 327 | " predictions = []\n", 328 | " for _, passenger in data.iterrows():\n", 329 | " \n", 330 | " # TODO 2\n", 331 | " # 移除下方的 'pass' 声明\n", 332 | " # 输入你自己的预测条件\n", 333 | " pass\n", 334 | " \n", 335 | " # 返回预测结果\n", 336 | " return pd.Series(predictions)\n", 337 | "\n", 338 | "# 进行预测\n", 339 | "predictions = predictions_2(data)" 340 | ] 341 | }, 342 | { 343 | "cell_type": "markdown", 344 | "metadata": {}, 345 | "source": [ 346 | "**问题3**:当预测所有女性以及小于10岁的男性都存活的时候,预测的准确率会达到多少?\n", 347 | "\n", 348 | "**回答**: *用预测结果来替换掉这里的文字*\n", 349 | "\n", 350 | "**提示**:你需要在下面添加一个代码区域,实现代码并运行来计算准确率。" 351 | ] 352 | }, 353 | { 354 | "cell_type": "markdown", 355 | "metadata": { 356 | "collapsed": true 357 | }, 358 | "source": [ 359 | "### 你自己的预测模型\n", 360 | "\n", 361 | "添加年龄(Age)特征与性别(Sex)的结合比单独使用性别(Sex)也提高了不少准确度。现在该你来做预测了:找到一系列的特征和条件来对数据进行划分,使得预测结果提高到80%以上。这可能需要多个特性和多个层次的条件语句才会成功。你可以在不同的条件下多次使用相同的特征。**Pclass**,**Sex**,**Age**,**SibSp** 和 **Parch** 是建议尝试使用的特征。 \n", 362 | "\n", 363 | "使用 `survival_stats` 函数来观测泰坦尼克号上乘客存活的数据统计。 \n", 364 | "**提示:** 要使用多个过滤条件,把每一个条件放在一个列表里作为最后一个参数传递进去。例如: `[\"Sex == 'male'\", \"Age < 18\"]`" 365 | ] 366 | }, 367 | { 368 | "cell_type": "code", 369 | "execution_count": null, 370 | "metadata": { 371 | "collapsed": false 372 | }, 373 | "outputs": [], 374 | "source": [ 375 | "survival_stats(data, outcomes, 'Age', [\"Sex == 'male'\", \"Age < 18\"])" 376 | ] 377 | }, 378 | { 379 | "cell_type": "markdown", 380 | "metadata": {}, 381 | "source": [ 382 | "当查看和研究了图形化的泰坦尼克号上乘客的数据统计后,请补全下面这段代码中缺失的部分,使得函数可以返回你的预测。 \n", 383 | "在到达最终的预测模型前请确保记录你尝试过的各种特征和条件。 \n", 384 | "**提示:** 您可以用之前 `predictions_2` 的代码作为开始来修改代码,实现新的预测函数。" 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": null, 390 | "metadata": { 391 | "collapsed": false 392 | }, 393 | "outputs": [], 394 | "source": [ 395 | "def predictions_3(data):\n", 396 | " \"\"\" 考虑多个特征,准确率至少达到80% \"\"\"\n", 397 | " \n", 398 | " predictions = []\n", 399 | " for _, passenger in data.iterrows():\n", 400 | " \n", 401 | " # TODO 3\n", 402 | " # 移除下方的 'pass' 声明\n", 403 | " # 输入你自己的预测条件\n", 404 | " pass\n", 405 | " \n", 406 | " # 返回预测结果\n", 407 | " return pd.Series(predictions)\n", 408 | "\n", 409 | "# 进行预测\n", 410 | "predictions = predictions_3(data)" 411 | ] 412 | }, 413 | { 414 | "cell_type": "markdown", 415 | "metadata": {}, 416 | "source": [ 417 | "**问题4**:请描述你实现80%准确度的预测模型所经历的步骤。您观察过哪些特征?某些特性是否比其他特征更有帮助?你用了什么条件来预测生还结果?你最终的预测的准确率是多少?\n", 418 | "\n", 419 | "**回答**:*用上面问题的答案来替换掉这里的文字*\n", 420 | "\n", 421 | "**提示**:你需要在下面添加一个代码区域,实现代码并运行来计算准确率。" 422 | ] 423 | }, 424 | { 425 | "cell_type": "markdown", 426 | "metadata": {}, 427 | "source": [ 428 | "### 结论\n", 429 | "\n", 430 | "经过了数次对数据的探索和分类,你创建了一个预测泰坦尼克号乘客存活率的有用的算法。在这个项目中你手动地实现了一个简单的机器学习模型——决策树(*decision tree*)。决策树每次按照一个特征把数据分割成越来越小的群组(被称为 *nodes*)。每次数据的一个子集被分出来,如果分割后新子集之间的相似度比分割前更高(包含近似的标签),我们的预测也就更加准确。电脑来帮助我们做这件事会比手动做更彻底,更精确。[这个链接](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)提供了另一个使用决策树做机器学习入门的例子。 \n", 431 | "\n", 432 | "决策树是许多**监督学习**算法中的一种。在监督学习中,我们关心的是使用数据的特征并根据数据的结果标签进行预测或建模。也就是说,每一组数据都有一个真正的结果值,不论是像泰坦尼克号生存数据集一样的标签,或者是连续的房价预测。\n", 433 | "\n", 434 | "**问题5**:想象一个真实世界中应用监督学习的场景,你期望预测的结果是什么?举出两个在这个场景中能够帮助你进行预测的数据集中的特征。" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "metadata": { 440 | "collapsed": true 441 | }, 442 | "source": [ 443 | "**回答**: *用你的答案替换掉这里的文字*" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "> **注意**: 当你写完了所有**5个问题,3个TODO**。你就可以把你的 iPython Notebook 导出成 HTML 文件。你可以在菜单栏,这样导出**File -> Download as -> HTML (.html)** 把这个 HTML 和这个 iPython notebook 一起做为你的作业提交。" 451 | ] 452 | }, 453 | { 454 | "cell_type": "markdown", 455 | "metadata": {}, 456 | "source": [ 457 | "---\n", 458 | "翻译:毛礼建 | 校译:黄强 | 审译:曹晨巍" 459 | ] 460 | } 461 | ], 462 | "metadata": { 463 | "kernelspec": { 464 | "display_name": "Python 2", 465 | "language": "python", 466 | "name": "python2" 467 | }, 468 | "language_info": { 469 | "codemirror_mode": { 470 | "name": "ipython", 471 | "version": 2 472 | }, 473 | "file_extension": ".py", 474 | "mimetype": "text/x-python", 475 | "name": "python", 476 | "nbconvert_exporter": "python", 477 | "pygments_lexer": "ipython2", 478 | "version": "2.7.13" 479 | } 480 | }, 481 | "nbformat": 4, 482 | "nbformat_minor": 0 483 | } 484 | -------------------------------------------------------------------------------- /creating_customer_segments/cluster.csv: -------------------------------------------------------------------------------- 1 | Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen,cluster 2 | 3,12669,9656,7561,214,2674,1338,1 3 | 3,7057,9810,9568,1762,3293,1776,1 4 | 3,6353,8808,7684,2405,3516,7844,1 5 | 3,13265,1196,4221,6404,507,1788,0 6 | 3,22615,5410,7198,3915,1777,5185,1 7 | 3,9413,8259,5126,666,1795,1451,1 8 | 3,12126,3199,6975,480,3140,545,1 9 | 3,7579,4956,9426,1669,3321,2566,1 10 | 3,5963,3648,6192,425,1716,750,1 11 | 3,6006,11093,18881,1159,7425,2098,1 12 | 3,3366,5403,12974,4400,5977,1744,1 13 | 3,13146,1124,4523,1420,549,497,0 14 | 3,31714,12319,11757,287,3881,2931,1 15 | 3,21217,6208,14982,3095,6707,602,1 16 | 3,24653,9465,12091,294,5058,2168,1 17 | 3,10253,1114,3821,397,964,412,0 18 | 3,1020,8816,12121,134,4508,1080,1 19 | 3,5876,6157,2933,839,370,4478,0 20 | 3,18601,6327,10099,2205,2767,3181,1 21 | 3,7780,2495,9464,669,2518,501,1 22 | 3,17546,4519,4602,1066,2259,2124,1 23 | 3,5567,871,2010,3383,375,569,0 24 | 3,31276,1917,4469,9408,2381,4334,1 25 | 3,26373,36423,22019,5154,4337,16523,1 26 | 3,22647,9776,13792,2915,4482,5778,1 27 | 3,16165,4230,7595,201,4003,57,1 28 | 3,9898,961,2861,3151,242,833,0 29 | 3,14276,803,3045,485,100,518,0 30 | 3,4113,20484,25957,1158,8604,5206,1 31 | 3,43088,2100,2609,1200,1107,823,0 32 | 3,18815,3610,11107,1148,2134,2963,1 33 | 3,2612,4339,3133,2088,820,985,0 34 | 3,21632,1318,2886,266,918,405,0 35 | 3,29729,4786,7326,6130,361,1083,0 36 | 3,1502,1979,2262,425,483,395,0 37 | 3,688,5491,11091,833,4239,436,1 38 | 3,29955,4362,5428,1729,862,4626,0 39 | 3,15168,10556,12477,1920,6506,714,1 40 | 3,4591,15729,16709,33,6956,433,1 41 | 3,56159,555,902,10002,212,2916,0 42 | 3,24025,4332,4757,9510,1145,5864,1 43 | 3,19176,3065,5956,2033,2575,2802,1 44 | 3,10850,7555,14961,188,6899,46,1 45 | 3,630,11095,23998,787,9529,72,1 46 | 3,9670,7027,10471,541,4618,65,1 47 | 3,5181,22044,21531,1740,7353,4985,1 48 | 3,3103,14069,21955,1668,6792,1452,1 49 | 3,44466,54259,55571,7782,24171,6465,1 50 | 3,11519,6152,10868,584,5121,1476,1 51 | 3,4967,21412,28921,1798,13583,1163,1 52 | 3,6269,1095,1980,3860,609,2162,0 53 | 3,3347,4051,6996,239,1538,301,1 54 | 3,40721,3916,5876,532,2587,1278,1 55 | 3,491,10473,11532,744,5611,224,1 56 | 3,27329,1449,1947,2436,204,1333,0 57 | 3,5264,3683,5005,1057,2024,1130,1 58 | 3,4098,29892,26866,2616,17740,1340,1 59 | 3,5417,9933,10487,38,7572,1282,1 60 | 3,13779,1970,1648,596,227,436,0 61 | 3,6137,5360,8040,129,3084,1603,1 62 | 3,8590,3045,7854,96,4095,225,1 63 | 3,35942,38369,59598,3254,26701,2017,1 64 | 3,7823,6245,6544,4154,4074,964,1 65 | 3,9396,11601,15775,2896,7677,1295,1 66 | 3,4760,1227,3250,3724,1247,1145,0 67 | 3,19913,6759,13462,1256,5141,834,1 68 | 3,2446,7260,3993,5870,788,3095,0 69 | 3,8352,2820,1293,779,656,144,0 70 | 3,16705,2037,3202,10643,116,1365,0 71 | 3,18291,1266,21042,5373,4173,14472,1 72 | 3,4420,5139,2661,8872,1321,181,0 73 | 3,19899,5332,8713,8132,764,648,0 74 | 3,8190,6343,9794,1285,1901,1780,1 75 | 3,717,3587,6532,7530,529,894,0 76 | 3,12205,12697,28540,869,12034,1009,1 77 | 3,10766,1175,2067,2096,301,167,0 78 | 3,1640,3259,3655,868,1202,1653,1 79 | 3,7005,829,3009,430,610,529,0 80 | 3,219,9540,14403,283,7818,156,1 81 | 3,10362,9232,11009,737,3537,2342,1 82 | 3,20874,1563,1783,2320,550,772,0 83 | 3,11867,3327,4814,1178,3837,120,1 84 | 3,16117,46197,92780,1026,40827,2944,1 85 | 3,22925,73498,32114,987,20070,903,1 86 | 3,43265,5025,8117,6312,1579,14351,1 87 | 3,7864,542,4042,9735,165,46,0 88 | 3,24904,3836,5330,3443,454,3178,0 89 | 3,11405,596,1638,3347,69,360,0 90 | 3,12754,2762,2530,8693,627,1117,0 91 | 3,9198,27472,32034,3232,18906,5130,1 92 | 3,11314,3090,2062,35009,71,2698,0 93 | 3,5626,12220,11323,206,5038,244,1 94 | 3,3,2920,6252,440,223,709,1 95 | 3,23,2616,8118,145,3874,217,1 96 | 3,403,254,610,774,54,63,0 97 | 3,503,112,778,895,56,132,0 98 | 3,9658,2182,1909,5639,215,323,0 99 | 3,11594,7779,12144,3252,8035,3029,1 100 | 3,1420,10810,16267,1593,6766,1838,1 101 | 3,2932,6459,7677,2561,4573,1386,1 102 | 3,56082,3504,8906,18028,1480,2498,1 103 | 3,14100,2132,3445,1336,1491,548,0 104 | 3,15587,1014,3970,910,139,1378,0 105 | 3,1454,6337,10704,133,6830,1831,1 106 | 3,8797,10646,14886,2471,8969,1438,1 107 | 3,1531,8397,6981,247,2505,1236,1 108 | 3,1406,16729,28986,673,836,3,1 109 | 3,11818,1648,1694,2276,169,1647,0 110 | 3,12579,11114,17569,805,6457,1519,1 111 | 3,19046,2770,2469,8853,483,2708,0 112 | 3,14438,2295,1733,3220,585,1561,0 113 | 3,18044,1080,2000,2555,118,1266,0 114 | 3,11134,793,2988,2715,276,610,0 115 | 3,11173,2521,3355,1517,310,222,0 116 | 3,6990,3880,5380,1647,319,1160,0 117 | 3,20049,1891,2362,5343,411,933,0 118 | 3,8258,2344,2147,3896,266,635,0 119 | 3,17160,1200,3412,2417,174,1136,0 120 | 3,4020,3234,1498,2395,264,255,0 121 | 3,12212,201,245,1991,25,860,0 122 | 3,11170,10769,8814,2194,1976,143,1 123 | 3,36050,1642,2961,4787,500,1621,0 124 | 3,76237,3473,7102,16538,778,918,0 125 | 3,19219,1840,1658,8195,349,483,0 126 | 3,21465,7243,10685,880,2386,2749,1 127 | 3,42312,926,1510,1718,410,1819,0 128 | 3,7149,2428,699,6316,395,911,0 129 | 3,2101,589,314,346,70,310,0 130 | 3,14903,2032,2479,576,955,328,0 131 | 3,9434,1042,1235,436,256,396,0 132 | 3,7388,1882,2174,720,47,537,0 133 | 3,6300,1289,2591,1170,199,326,0 134 | 3,4625,8579,7030,4575,2447,1542,1 135 | 3,3087,8080,8282,661,721,36,1 136 | 3,13537,4257,5034,155,249,3271,0 137 | 3,5387,4979,3343,825,637,929,0 138 | 3,17623,4280,7305,2279,960,2616,0 139 | 3,30379,13252,5189,321,51,1450,0 140 | 3,37036,7152,8253,2995,20,3,0 141 | 3,10405,1596,1096,8425,399,318,0 142 | 3,18827,3677,1988,118,516,201,0 143 | 3,22039,8384,34792,42,12591,4430,1 144 | 3,7769,1936,2177,926,73,520,0 145 | 3,9203,3373,2707,1286,1082,526,0 146 | 3,5924,584,542,4052,283,434,0 147 | 3,31812,1433,1651,800,113,1440,0 148 | 3,16225,1825,1765,853,170,1067,0 149 | 3,1289,3328,2022,531,255,1774,0 150 | 3,18840,1371,3135,3001,352,184,0 151 | 3,3463,9250,2368,779,302,1627,0 152 | 3,1989,10690,19460,233,11577,2153,1 153 | 3,3830,5291,14855,317,6694,3182,1 154 | 3,17773,1366,2474,3378,811,418,0 155 | 3,2861,6570,9618,930,4004,1682,1 156 | 3,355,7704,14682,398,8077,303,1 157 | 3,1725,3651,12822,824,4424,2157,1 158 | 3,12434,540,283,1092,3,2233,0 159 | 3,15177,2024,3810,2665,232,610,0 160 | 3,5531,15726,26870,2367,13726,446,1 161 | 3,5224,7603,8584,2540,3674,238,1 162 | 3,15615,12653,19858,4425,7108,2379,1 163 | 3,4822,6721,9170,993,4973,3637,1 164 | 3,2926,3195,3268,405,1680,693,1 165 | 3,5809,735,803,1393,79,429,0 166 | 3,5414,717,2155,2399,69,750,0 167 | 3,260,8675,13430,1116,7015,323,1 168 | 3,200,25862,19816,651,8773,6250,1 169 | 3,955,5479,6536,333,2840,707,1 170 | 3,514,7677,19805,937,9836,716,1 171 | 3,286,1208,5241,2515,153,1442,0 172 | 3,2343,7845,11874,52,4196,1697,1 173 | 3,45640,6958,6536,7368,1532,230,0 174 | 3,12759,7330,4533,1752,20,2631,0 175 | 3,11002,7075,4945,1152,120,395,0 176 | 3,3157,4888,2500,4477,273,2165,0 177 | 3,12356,6036,8887,402,1382,2794,1 178 | 3,112151,29627,18148,16745,4948,8550,1 179 | 3,694,8533,10518,443,6907,156,1 180 | 3,36847,43950,20170,36534,239,47943,1 181 | 3,327,918,4710,74,334,11,1 182 | 3,8170,6448,1139,2181,58,247,0 183 | 3,3009,521,854,3470,949,727,0 184 | 3,2438,8002,9819,6269,3459,3,1 185 | 3,8040,7639,11687,2758,6839,404,1 186 | 3,834,11577,11522,275,4027,1856,1 187 | 3,16936,6250,1981,7332,118,64,0 188 | 3,13624,295,1381,890,43,84,0 189 | 3,5509,1461,2251,547,187,409,0 190 | 3,180,3485,20292,959,5618,666,1 191 | 3,7107,1012,2974,806,355,1142,0 192 | 3,17023,5139,5230,7888,330,1755,0 193 | 1,30624,7209,4897,18711,763,2876,1 194 | 1,2427,7097,10391,1127,4314,1468,1 195 | 1,11686,2154,6824,3527,592,697,0 196 | 1,9670,2280,2112,520,402,347,0 197 | 1,3067,13240,23127,3941,9959,731,1 198 | 1,4484,14399,24708,3549,14235,1681,1 199 | 1,25203,11487,9490,5065,284,6854,1 200 | 1,583,685,2216,469,954,18,1 201 | 1,1956,891,5226,1383,5,1328,0 202 | 1,1107,11711,23596,955,9265,710,1 203 | 1,6373,780,950,878,288,285,0 204 | 1,2541,4737,6089,2946,5316,120,1 205 | 1,1537,3748,5838,1859,3381,806,1 206 | 1,5550,12729,16767,864,12420,797,1 207 | 1,18567,1895,1393,1801,244,2100,0 208 | 1,12119,28326,39694,4736,19410,2870,1 209 | 1,7291,1012,2062,1291,240,1775,0 210 | 1,3317,6602,6861,1329,3961,1215,1 211 | 1,2362,6551,11364,913,5957,791,1 212 | 1,2806,10765,15538,1374,5828,2388,1 213 | 1,2532,16599,36486,179,13308,674,1 214 | 1,18044,1475,2046,2532,130,1158,0 215 | 1,18,7504,15205,1285,4797,6372,1 216 | 1,4155,367,1390,2306,86,130,0 217 | 1,14755,899,1382,1765,56,749,0 218 | 1,5396,7503,10646,91,4167,239,1 219 | 1,5041,1115,2856,7496,256,375,0 220 | 1,2790,2527,5265,5612,788,1360,0 221 | 1,7274,659,1499,784,70,659,0 222 | 1,12680,3243,4157,660,761,786,0 223 | 1,20782,5921,9212,1759,2568,1553,1 224 | 1,4042,2204,1563,2286,263,689,0 225 | 1,1869,577,572,950,4762,203,0 226 | 1,8656,2746,2501,6845,694,980,0 227 | 1,11072,5989,5615,8321,955,2137,0 228 | 1,2344,10678,3828,1439,1566,490,1 229 | 1,25962,1780,3838,638,284,834,0 230 | 1,964,4984,3316,937,409,7,1 231 | 1,15603,2703,3833,4260,325,2563,0 232 | 1,1838,6380,2824,1218,1216,295,1 233 | 1,8635,820,3047,2312,415,225,0 234 | 1,18692,3838,593,4634,28,1215,0 235 | 1,7363,475,585,1112,72,216,0 236 | 1,47493,2567,3779,5243,828,2253,0 237 | 1,22096,3575,7041,11422,343,2564,0 238 | 1,24929,1801,2475,2216,412,1047,0 239 | 1,18226,659,2914,3752,586,578,0 240 | 1,11210,3576,5119,561,1682,2398,1 241 | 1,6202,7775,10817,1183,3143,1970,1 242 | 1,3062,6154,13916,230,8933,2784,1 243 | 1,8885,2428,1777,1777,430,610,0 244 | 1,13569,346,489,2077,44,659,0 245 | 1,15671,5279,2406,559,562,572,0 246 | 1,8040,3795,2070,6340,918,291,0 247 | 1,3191,1993,1799,1730,234,710,0 248 | 1,6134,23133,33586,6746,18594,5121,1 249 | 1,6623,1860,4740,7683,205,1693,0 250 | 1,29526,7961,16966,432,363,1391,0 251 | 1,10379,17972,4748,4686,1547,3265,1 252 | 1,31614,489,1495,3242,111,615,0 253 | 1,11092,5008,5249,453,392,373,0 254 | 1,8475,1931,1883,5004,3593,987,0 255 | 1,56083,4563,2124,6422,730,3321,0 256 | 1,53205,4959,7336,3012,967,818,0 257 | 1,9193,4885,2157,327,780,548,0 258 | 1,7858,1110,1094,6818,49,287,0 259 | 1,23257,1372,1677,982,429,655,0 260 | 1,2153,1115,6684,4324,2894,411,0 261 | 1,1073,9679,15445,61,5980,1265,1 262 | 1,5909,23527,13699,10155,830,3636,1 263 | 1,572,9763,22182,2221,4882,2563,1 264 | 1,20893,1222,2576,3975,737,3628,0 265 | 1,11908,8053,19847,1069,6374,698,1 266 | 1,15218,258,1138,2516,333,204,0 267 | 1,4720,1032,975,5500,197,56,0 268 | 1,2083,5007,1563,1120,147,1550,0 269 | 1,514,8323,6869,529,93,1040,0 270 | 3,36817,3045,1493,4802,210,1824,0 271 | 3,894,1703,1841,744,759,1153,0 272 | 3,680,1610,223,862,96,379,0 273 | 3,27901,3749,6964,4479,603,2503,0 274 | 3,9061,829,683,16919,621,139,0 275 | 3,11693,2317,2543,5845,274,1409,0 276 | 3,17360,6200,9694,1293,3620,1721,1 277 | 3,3366,2884,2431,977,167,1104,0 278 | 3,12238,7108,6235,1093,2328,2079,1 279 | 3,49063,3965,4252,5970,1041,1404,0 280 | 3,25767,3613,2013,10303,314,1384,0 281 | 3,68951,4411,12609,8692,751,2406,1 282 | 3,40254,640,3600,1042,436,18,0 283 | 3,7149,2247,1242,1619,1226,128,0 284 | 3,15354,2102,2828,8366,386,1027,0 285 | 3,16260,594,1296,848,445,258,0 286 | 3,42786,286,471,1388,32,22,0 287 | 3,2708,2160,2642,502,965,1522,0 288 | 3,6022,3354,3261,2507,212,686,0 289 | 3,2838,3086,4329,3838,825,1060,0 290 | 2,3996,11103,12469,902,5952,741,1 291 | 2,21273,2013,6550,909,811,1854,0 292 | 2,7588,1897,5234,417,2208,254,1 293 | 2,19087,1304,3643,3045,710,898,0 294 | 2,8090,3199,6986,1455,3712,531,1 295 | 2,6758,4560,9965,934,4538,1037,1 296 | 2,444,879,2060,264,290,259,1 297 | 2,16448,6243,6360,824,2662,2005,1 298 | 2,5283,13316,20399,1809,8752,172,1 299 | 2,2886,5302,9785,364,6236,555,1 300 | 2,2599,3688,13829,492,10069,59,1 301 | 2,161,7460,24773,617,11783,2410,1 302 | 2,243,12939,8852,799,3909,211,1 303 | 2,6468,12867,21570,1840,7558,1543,1 304 | 2,17327,2374,2842,1149,351,925,0 305 | 2,6987,1020,3007,416,257,656,0 306 | 2,918,20655,13567,1465,6846,806,1 307 | 2,7034,1492,2405,12569,299,1117,0 308 | 2,29635,2335,8280,3046,371,117,0 309 | 2,2137,3737,19172,1274,17120,142,1 310 | 2,9784,925,2405,4447,183,297,0 311 | 2,10617,1795,7647,1483,857,1233,0 312 | 2,1479,14982,11924,662,3891,3508,1 313 | 2,7127,1375,2201,2679,83,1059,0 314 | 2,1182,3088,6114,978,821,1637,1 315 | 2,11800,2713,3558,2121,706,51,0 316 | 2,9759,25071,17645,1128,12408,1625,1 317 | 2,1774,3696,2280,514,275,834,0 318 | 2,9155,1897,5167,2714,228,1113,0 319 | 2,15881,713,3315,3703,1470,229,0 320 | 2,13360,944,11593,915,1679,573,0 321 | 2,25977,3587,2464,2369,140,1092,0 322 | 2,32717,16784,13626,60869,1272,5609,1 323 | 2,4414,1610,1431,3498,387,834,0 324 | 2,542,899,1664,414,88,522,0 325 | 2,16933,2209,3389,7849,210,1534,0 326 | 2,5113,1486,4583,5127,492,739,0 327 | 2,9790,1786,5109,3570,182,1043,0 328 | 2,11223,14881,26839,1234,9606,1102,1 329 | 2,22321,3216,1447,2208,178,2602,0 330 | 2,8565,4980,67298,131,38102,1215,1 331 | 2,16823,928,2743,11559,332,3486,0 332 | 2,27082,6817,10790,1365,4111,2139,1 333 | 2,13970,1511,1330,650,146,778,0 334 | 2,9351,1347,2611,8170,442,868,0 335 | 2,3,333,7021,15601,15,550,1 336 | 2,2617,1188,5332,9584,573,1942,0 337 | 3,381,4025,9670,388,7271,1371,1 338 | 3,2320,5763,11238,767,5162,2158,1 339 | 3,255,5758,5923,349,4595,1328,1 340 | 3,1689,6964,26316,1456,15469,37,1 341 | 3,3043,1172,1763,2234,217,379,0 342 | 3,1198,2602,8335,402,3843,303,1 343 | 3,2771,6939,15541,2693,6600,1115,1 344 | 3,27380,7184,12311,2809,4621,1022,1 345 | 3,3428,2380,2028,1341,1184,665,0 346 | 3,5981,14641,20521,2005,12218,445,1 347 | 3,3521,1099,1997,1796,173,995,0 348 | 3,1210,10044,22294,1741,12638,3137,1 349 | 3,608,1106,1533,830,90,195,0 350 | 3,117,6264,21203,228,8682,1111,1 351 | 3,14039,7393,2548,6386,1333,2341,0 352 | 3,190,727,2012,245,184,127,1 353 | 3,22686,134,218,3157,9,548,0 354 | 3,37,1275,22272,137,6747,110,1 355 | 3,759,18664,1660,6114,536,4100,0 356 | 3,796,5878,2109,340,232,776,0 357 | 3,19746,2872,2006,2601,468,503,0 358 | 3,4734,607,864,1206,159,405,0 359 | 3,2121,1601,2453,560,179,712,0 360 | 3,4627,997,4438,191,1335,314,1 361 | 3,2615,873,1524,1103,514,468,0 362 | 3,4692,6128,8025,1619,4515,3105,1 363 | 3,9561,2217,1664,1173,222,447,0 364 | 3,3477,894,534,1457,252,342,0 365 | 3,22335,1196,2406,2046,101,558,0 366 | 3,6211,337,683,1089,41,296,0 367 | 3,39679,3944,4955,1364,523,2235,0 368 | 3,20105,1887,1939,8164,716,790,0 369 | 3,3884,3801,1641,876,397,4829,0 370 | 3,15076,6257,7398,1504,1916,3113,1 371 | 3,6338,2256,1668,1492,311,686,0 372 | 3,5841,1450,1162,597,476,70,0 373 | 3,3136,8630,13586,5641,4666,1426,1 374 | 3,38793,3154,2648,1034,96,1242,0 375 | 3,3225,3294,1902,282,68,1114,0 376 | 3,4048,5164,10391,130,813,179,1 377 | 3,28257,944,2146,3881,600,270,0 378 | 3,17770,4591,1617,9927,246,532,0 379 | 3,34454,7435,8469,2540,1711,2893,1 380 | 3,1821,1364,3450,4006,397,361,0 381 | 3,10683,21858,15400,3635,282,5120,1 382 | 3,11635,922,1614,2583,192,1068,0 383 | 3,1206,3620,2857,1945,353,967,0 384 | 3,20918,1916,1573,1960,231,961,0 385 | 3,9785,848,1172,1677,200,406,0 386 | 3,9385,1530,1422,3019,227,684,0 387 | 3,3352,1181,1328,5502,311,1000,0 388 | 3,2647,2761,2313,907,95,1827,0 389 | 3,518,4180,3600,659,122,654,0 390 | 3,23632,6730,3842,8620,385,819,0 391 | 3,12377,865,3204,1398,149,452,0 392 | 3,9602,1316,1263,2921,841,290,0 393 | 3,4515,11991,9345,2644,3378,2213,1 394 | 3,11535,1666,1428,6838,64,743,0 395 | 3,11442,1032,582,5390,74,247,0 396 | 3,9612,577,935,1601,469,375,0 397 | 3,4446,906,1238,3576,153,1014,0 398 | 3,27167,2801,2128,13223,92,1902,0 399 | 3,26539,4753,5091,220,10,340,0 400 | 3,25606,11006,4604,127,632,288,0 401 | 3,18073,4613,3444,4324,914,715,0 402 | 3,6884,1046,1167,2069,593,378,0 403 | 3,25066,5010,5026,9806,1092,960,0 404 | 3,7362,12844,18683,2854,7883,553,1 405 | 3,8257,3880,6407,1646,2730,344,1 406 | 3,8708,3634,6100,2349,2123,5137,1 407 | 3,6633,2096,4563,1389,1860,1892,0 408 | 3,2126,3289,3281,1535,235,4365,0 409 | 3,97,3605,12400,98,2970,62,1 410 | 3,4983,4859,6633,17866,912,2435,0 411 | 3,5969,1990,3417,5679,1135,290,0 412 | 3,7842,6046,8552,1691,3540,1874,1 413 | 3,4389,10940,10908,848,6728,993,1 414 | 3,5065,5499,11055,364,3485,1063,1 415 | 3,660,8494,18622,133,6740,776,1 416 | 3,8861,3783,2223,633,1580,1521,0 417 | 3,4456,5266,13227,25,6818,1393,1 418 | 3,17063,4847,9053,1031,3415,1784,1 419 | 3,26400,1377,4172,830,948,1218,0 420 | 3,17565,3686,4657,1059,1803,668,0 421 | 3,16980,2884,12232,874,3213,249,1 422 | 3,11243,2408,2593,15348,108,1886,0 423 | 3,13134,9347,14316,3141,5079,1894,1 424 | 3,31012,16687,5429,15082,439,1163,0 425 | 3,3047,5970,4910,2198,850,317,0 426 | 3,8607,1750,3580,47,84,2501,0 427 | 3,3097,4230,16483,575,241,2080,0 428 | 3,8533,5506,5160,13486,1377,1498,0 429 | 3,21117,1162,4754,269,1328,395,0 430 | 3,1982,3218,1493,1541,356,1449,0 431 | 3,16731,3922,7994,688,2371,838,1 432 | 3,29703,12051,16027,13135,182,2204,0 433 | 3,39228,1431,764,4510,93,2346,0 434 | 3,14531,15488,30243,437,14841,1867,1 435 | 3,10290,1981,2232,1038,168,2125,0 436 | 3,2787,1698,2510,65,477,52,1 437 | -------------------------------------------------------------------------------- /creating_customer_segments/customers.csv: -------------------------------------------------------------------------------- 1 | Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicatessen 2 | 2,3,12669,9656,7561,214,2674,1338 3 | 2,3,7057,9810,9568,1762,3293,1776 4 | 2,3,6353,8808,7684,2405,3516,7844 5 | 1,3,13265,1196,4221,6404,507,1788 6 | 2,3,22615,5410,7198,3915,1777,5185 7 | 2,3,9413,8259,5126,666,1795,1451 8 | 2,3,12126,3199,6975,480,3140,545 9 | 2,3,7579,4956,9426,1669,3321,2566 10 | 1,3,5963,3648,6192,425,1716,750 11 | 2,3,6006,11093,18881,1159,7425,2098 12 | 2,3,3366,5403,12974,4400,5977,1744 13 | 2,3,13146,1124,4523,1420,549,497 14 | 2,3,31714,12319,11757,287,3881,2931 15 | 2,3,21217,6208,14982,3095,6707,602 16 | 2,3,24653,9465,12091,294,5058,2168 17 | 1,3,10253,1114,3821,397,964,412 18 | 2,3,1020,8816,12121,134,4508,1080 19 | 1,3,5876,6157,2933,839,370,4478 20 | 2,3,18601,6327,10099,2205,2767,3181 21 | 1,3,7780,2495,9464,669,2518,501 22 | 2,3,17546,4519,4602,1066,2259,2124 23 | 1,3,5567,871,2010,3383,375,569 24 | 1,3,31276,1917,4469,9408,2381,4334 25 | 2,3,26373,36423,22019,5154,4337,16523 26 | 2,3,22647,9776,13792,2915,4482,5778 27 | 2,3,16165,4230,7595,201,4003,57 28 | 1,3,9898,961,2861,3151,242,833 29 | 1,3,14276,803,3045,485,100,518 30 | 2,3,4113,20484,25957,1158,8604,5206 31 | 1,3,43088,2100,2609,1200,1107,823 32 | 1,3,18815,3610,11107,1148,2134,2963 33 | 1,3,2612,4339,3133,2088,820,985 34 | 1,3,21632,1318,2886,266,918,405 35 | 1,3,29729,4786,7326,6130,361,1083 36 | 1,3,1502,1979,2262,425,483,395 37 | 2,3,688,5491,11091,833,4239,436 38 | 1,3,29955,4362,5428,1729,862,4626 39 | 2,3,15168,10556,12477,1920,6506,714 40 | 2,3,4591,15729,16709,33,6956,433 41 | 1,3,56159,555,902,10002,212,2916 42 | 1,3,24025,4332,4757,9510,1145,5864 43 | 1,3,19176,3065,5956,2033,2575,2802 44 | 2,3,10850,7555,14961,188,6899,46 45 | 2,3,630,11095,23998,787,9529,72 46 | 2,3,9670,7027,10471,541,4618,65 47 | 2,3,5181,22044,21531,1740,7353,4985 48 | 2,3,3103,14069,21955,1668,6792,1452 49 | 2,3,44466,54259,55571,7782,24171,6465 50 | 2,3,11519,6152,10868,584,5121,1476 51 | 2,3,4967,21412,28921,1798,13583,1163 52 | 1,3,6269,1095,1980,3860,609,2162 53 | 1,3,3347,4051,6996,239,1538,301 54 | 2,3,40721,3916,5876,532,2587,1278 55 | 2,3,491,10473,11532,744,5611,224 56 | 1,3,27329,1449,1947,2436,204,1333 57 | 1,3,5264,3683,5005,1057,2024,1130 58 | 2,3,4098,29892,26866,2616,17740,1340 59 | 2,3,5417,9933,10487,38,7572,1282 60 | 1,3,13779,1970,1648,596,227,436 61 | 1,3,6137,5360,8040,129,3084,1603 62 | 2,3,8590,3045,7854,96,4095,225 63 | 2,3,35942,38369,59598,3254,26701,2017 64 | 2,3,7823,6245,6544,4154,4074,964 65 | 2,3,9396,11601,15775,2896,7677,1295 66 | 1,3,4760,1227,3250,3724,1247,1145 67 | 2,3,85,20959,45828,36,24231,1423 68 | 1,3,9,1534,7417,175,3468,27 69 | 2,3,19913,6759,13462,1256,5141,834 70 | 1,3,2446,7260,3993,5870,788,3095 71 | 1,3,8352,2820,1293,779,656,144 72 | 1,3,16705,2037,3202,10643,116,1365 73 | 1,3,18291,1266,21042,5373,4173,14472 74 | 1,3,4420,5139,2661,8872,1321,181 75 | 2,3,19899,5332,8713,8132,764,648 76 | 2,3,8190,6343,9794,1285,1901,1780 77 | 1,3,20398,1137,3,4407,3,975 78 | 1,3,717,3587,6532,7530,529,894 79 | 2,3,12205,12697,28540,869,12034,1009 80 | 1,3,10766,1175,2067,2096,301,167 81 | 1,3,1640,3259,3655,868,1202,1653 82 | 1,3,7005,829,3009,430,610,529 83 | 2,3,219,9540,14403,283,7818,156 84 | 2,3,10362,9232,11009,737,3537,2342 85 | 1,3,20874,1563,1783,2320,550,772 86 | 2,3,11867,3327,4814,1178,3837,120 87 | 2,3,16117,46197,92780,1026,40827,2944 88 | 2,3,22925,73498,32114,987,20070,903 89 | 1,3,43265,5025,8117,6312,1579,14351 90 | 1,3,7864,542,4042,9735,165,46 91 | 1,3,24904,3836,5330,3443,454,3178 92 | 1,3,11405,596,1638,3347,69,360 93 | 1,3,12754,2762,2530,8693,627,1117 94 | 2,3,9198,27472,32034,3232,18906,5130 95 | 1,3,11314,3090,2062,35009,71,2698 96 | 2,3,5626,12220,11323,206,5038,244 97 | 1,3,3,2920,6252,440,223,709 98 | 2,3,23,2616,8118,145,3874,217 99 | 1,3,403,254,610,774,54,63 100 | 1,3,503,112,778,895,56,132 101 | 1,3,9658,2182,1909,5639,215,323 102 | 2,3,11594,7779,12144,3252,8035,3029 103 | 2,3,1420,10810,16267,1593,6766,1838 104 | 2,3,2932,6459,7677,2561,4573,1386 105 | 1,3,56082,3504,8906,18028,1480,2498 106 | 1,3,14100,2132,3445,1336,1491,548 107 | 1,3,15587,1014,3970,910,139,1378 108 | 2,3,1454,6337,10704,133,6830,1831 109 | 2,3,8797,10646,14886,2471,8969,1438 110 | 2,3,1531,8397,6981,247,2505,1236 111 | 2,3,1406,16729,28986,673,836,3 112 | 1,3,11818,1648,1694,2276,169,1647 113 | 2,3,12579,11114,17569,805,6457,1519 114 | 1,3,19046,2770,2469,8853,483,2708 115 | 1,3,14438,2295,1733,3220,585,1561 116 | 1,3,18044,1080,2000,2555,118,1266 117 | 1,3,11134,793,2988,2715,276,610 118 | 1,3,11173,2521,3355,1517,310,222 119 | 1,3,6990,3880,5380,1647,319,1160 120 | 1,3,20049,1891,2362,5343,411,933 121 | 1,3,8258,2344,2147,3896,266,635 122 | 1,3,17160,1200,3412,2417,174,1136 123 | 1,3,4020,3234,1498,2395,264,255 124 | 1,3,12212,201,245,1991,25,860 125 | 2,3,11170,10769,8814,2194,1976,143 126 | 1,3,36050,1642,2961,4787,500,1621 127 | 1,3,76237,3473,7102,16538,778,918 128 | 1,3,19219,1840,1658,8195,349,483 129 | 2,3,21465,7243,10685,880,2386,2749 130 | 1,3,140,8847,3823,142,1062,3 131 | 1,3,42312,926,1510,1718,410,1819 132 | 1,3,7149,2428,699,6316,395,911 133 | 1,3,2101,589,314,346,70,310 134 | 1,3,14903,2032,2479,576,955,328 135 | 1,3,9434,1042,1235,436,256,396 136 | 1,3,7388,1882,2174,720,47,537 137 | 1,3,6300,1289,2591,1170,199,326 138 | 1,3,4625,8579,7030,4575,2447,1542 139 | 1,3,3087,8080,8282,661,721,36 140 | 1,3,13537,4257,5034,155,249,3271 141 | 1,3,5387,4979,3343,825,637,929 142 | 1,3,17623,4280,7305,2279,960,2616 143 | 1,3,30379,13252,5189,321,51,1450 144 | 1,3,37036,7152,8253,2995,20,3 145 | 1,3,10405,1596,1096,8425,399,318 146 | 1,3,18827,3677,1988,118,516,201 147 | 2,3,22039,8384,34792,42,12591,4430 148 | 1,3,7769,1936,2177,926,73,520 149 | 1,3,9203,3373,2707,1286,1082,526 150 | 1,3,5924,584,542,4052,283,434 151 | 1,3,31812,1433,1651,800,113,1440 152 | 1,3,16225,1825,1765,853,170,1067 153 | 1,3,1289,3328,2022,531,255,1774 154 | 1,3,18840,1371,3135,3001,352,184 155 | 1,3,3463,9250,2368,779,302,1627 156 | 1,3,622,55,137,75,7,8 157 | 2,3,1989,10690,19460,233,11577,2153 158 | 2,3,3830,5291,14855,317,6694,3182 159 | 1,3,17773,1366,2474,3378,811,418 160 | 2,3,2861,6570,9618,930,4004,1682 161 | 2,3,355,7704,14682,398,8077,303 162 | 2,3,1725,3651,12822,824,4424,2157 163 | 1,3,12434,540,283,1092,3,2233 164 | 1,3,15177,2024,3810,2665,232,610 165 | 2,3,5531,15726,26870,2367,13726,446 166 | 2,3,5224,7603,8584,2540,3674,238 167 | 2,3,15615,12653,19858,4425,7108,2379 168 | 2,3,4822,6721,9170,993,4973,3637 169 | 1,3,2926,3195,3268,405,1680,693 170 | 1,3,5809,735,803,1393,79,429 171 | 1,3,5414,717,2155,2399,69,750 172 | 2,3,260,8675,13430,1116,7015,323 173 | 2,3,200,25862,19816,651,8773,6250 174 | 1,3,955,5479,6536,333,2840,707 175 | 2,3,514,7677,19805,937,9836,716 176 | 1,3,286,1208,5241,2515,153,1442 177 | 2,3,2343,7845,11874,52,4196,1697 178 | 1,3,45640,6958,6536,7368,1532,230 179 | 1,3,12759,7330,4533,1752,20,2631 180 | 1,3,11002,7075,4945,1152,120,395 181 | 1,3,3157,4888,2500,4477,273,2165 182 | 1,3,12356,6036,8887,402,1382,2794 183 | 1,3,112151,29627,18148,16745,4948,8550 184 | 1,3,694,8533,10518,443,6907,156 185 | 1,3,36847,43950,20170,36534,239,47943 186 | 1,3,327,918,4710,74,334,11 187 | 1,3,8170,6448,1139,2181,58,247 188 | 1,3,3009,521,854,3470,949,727 189 | 1,3,2438,8002,9819,6269,3459,3 190 | 2,3,8040,7639,11687,2758,6839,404 191 | 2,3,834,11577,11522,275,4027,1856 192 | 1,3,16936,6250,1981,7332,118,64 193 | 1,3,13624,295,1381,890,43,84 194 | 1,3,5509,1461,2251,547,187,409 195 | 2,3,180,3485,20292,959,5618,666 196 | 1,3,7107,1012,2974,806,355,1142 197 | 1,3,17023,5139,5230,7888,330,1755 198 | 1,1,30624,7209,4897,18711,763,2876 199 | 2,1,2427,7097,10391,1127,4314,1468 200 | 1,1,11686,2154,6824,3527,592,697 201 | 1,1,9670,2280,2112,520,402,347 202 | 2,1,3067,13240,23127,3941,9959,731 203 | 2,1,4484,14399,24708,3549,14235,1681 204 | 1,1,25203,11487,9490,5065,284,6854 205 | 1,1,583,685,2216,469,954,18 206 | 1,1,1956,891,5226,1383,5,1328 207 | 2,1,1107,11711,23596,955,9265,710 208 | 1,1,6373,780,950,878,288,285 209 | 2,1,2541,4737,6089,2946,5316,120 210 | 1,1,1537,3748,5838,1859,3381,806 211 | 2,1,5550,12729,16767,864,12420,797 212 | 1,1,18567,1895,1393,1801,244,2100 213 | 2,1,12119,28326,39694,4736,19410,2870 214 | 1,1,7291,1012,2062,1291,240,1775 215 | 1,1,3317,6602,6861,1329,3961,1215 216 | 2,1,2362,6551,11364,913,5957,791 217 | 1,1,2806,10765,15538,1374,5828,2388 218 | 2,1,2532,16599,36486,179,13308,674 219 | 1,1,18044,1475,2046,2532,130,1158 220 | 2,1,18,7504,15205,1285,4797,6372 221 | 1,1,4155,367,1390,2306,86,130 222 | 1,1,14755,899,1382,1765,56,749 223 | 1,1,5396,7503,10646,91,4167,239 224 | 1,1,5041,1115,2856,7496,256,375 225 | 2,1,2790,2527,5265,5612,788,1360 226 | 1,1,7274,659,1499,784,70,659 227 | 1,1,12680,3243,4157,660,761,786 228 | 2,1,20782,5921,9212,1759,2568,1553 229 | 1,1,4042,2204,1563,2286,263,689 230 | 1,1,1869,577,572,950,4762,203 231 | 1,1,8656,2746,2501,6845,694,980 232 | 2,1,11072,5989,5615,8321,955,2137 233 | 1,1,2344,10678,3828,1439,1566,490 234 | 1,1,25962,1780,3838,638,284,834 235 | 1,1,964,4984,3316,937,409,7 236 | 1,1,15603,2703,3833,4260,325,2563 237 | 1,1,1838,6380,2824,1218,1216,295 238 | 1,1,8635,820,3047,2312,415,225 239 | 1,1,18692,3838,593,4634,28,1215 240 | 1,1,7363,475,585,1112,72,216 241 | 1,1,47493,2567,3779,5243,828,2253 242 | 1,1,22096,3575,7041,11422,343,2564 243 | 1,1,24929,1801,2475,2216,412,1047 244 | 1,1,18226,659,2914,3752,586,578 245 | 1,1,11210,3576,5119,561,1682,2398 246 | 1,1,6202,7775,10817,1183,3143,1970 247 | 2,1,3062,6154,13916,230,8933,2784 248 | 1,1,8885,2428,1777,1777,430,610 249 | 1,1,13569,346,489,2077,44,659 250 | 1,1,15671,5279,2406,559,562,572 251 | 1,1,8040,3795,2070,6340,918,291 252 | 1,1,3191,1993,1799,1730,234,710 253 | 2,1,6134,23133,33586,6746,18594,5121 254 | 1,1,6623,1860,4740,7683,205,1693 255 | 1,1,29526,7961,16966,432,363,1391 256 | 1,1,10379,17972,4748,4686,1547,3265 257 | 1,1,31614,489,1495,3242,111,615 258 | 1,1,11092,5008,5249,453,392,373 259 | 1,1,8475,1931,1883,5004,3593,987 260 | 1,1,56083,4563,2124,6422,730,3321 261 | 1,1,53205,4959,7336,3012,967,818 262 | 1,1,9193,4885,2157,327,780,548 263 | 1,1,7858,1110,1094,6818,49,287 264 | 1,1,23257,1372,1677,982,429,655 265 | 1,1,2153,1115,6684,4324,2894,411 266 | 2,1,1073,9679,15445,61,5980,1265 267 | 1,1,5909,23527,13699,10155,830,3636 268 | 2,1,572,9763,22182,2221,4882,2563 269 | 1,1,20893,1222,2576,3975,737,3628 270 | 2,1,11908,8053,19847,1069,6374,698 271 | 1,1,15218,258,1138,2516,333,204 272 | 1,1,4720,1032,975,5500,197,56 273 | 1,1,2083,5007,1563,1120,147,1550 274 | 1,1,514,8323,6869,529,93,1040 275 | 1,3,36817,3045,1493,4802,210,1824 276 | 1,3,894,1703,1841,744,759,1153 277 | 1,3,680,1610,223,862,96,379 278 | 1,3,27901,3749,6964,4479,603,2503 279 | 1,3,9061,829,683,16919,621,139 280 | 1,3,11693,2317,2543,5845,274,1409 281 | 2,3,17360,6200,9694,1293,3620,1721 282 | 1,3,3366,2884,2431,977,167,1104 283 | 2,3,12238,7108,6235,1093,2328,2079 284 | 1,3,49063,3965,4252,5970,1041,1404 285 | 1,3,25767,3613,2013,10303,314,1384 286 | 1,3,68951,4411,12609,8692,751,2406 287 | 1,3,40254,640,3600,1042,436,18 288 | 1,3,7149,2247,1242,1619,1226,128 289 | 1,3,15354,2102,2828,8366,386,1027 290 | 1,3,16260,594,1296,848,445,258 291 | 1,3,42786,286,471,1388,32,22 292 | 1,3,2708,2160,2642,502,965,1522 293 | 1,3,6022,3354,3261,2507,212,686 294 | 1,3,2838,3086,4329,3838,825,1060 295 | 2,2,3996,11103,12469,902,5952,741 296 | 1,2,21273,2013,6550,909,811,1854 297 | 2,2,7588,1897,5234,417,2208,254 298 | 1,2,19087,1304,3643,3045,710,898 299 | 2,2,8090,3199,6986,1455,3712,531 300 | 2,2,6758,4560,9965,934,4538,1037 301 | 1,2,444,879,2060,264,290,259 302 | 2,2,16448,6243,6360,824,2662,2005 303 | 2,2,5283,13316,20399,1809,8752,172 304 | 2,2,2886,5302,9785,364,6236,555 305 | 2,2,2599,3688,13829,492,10069,59 306 | 2,2,161,7460,24773,617,11783,2410 307 | 2,2,243,12939,8852,799,3909,211 308 | 2,2,6468,12867,21570,1840,7558,1543 309 | 1,2,17327,2374,2842,1149,351,925 310 | 1,2,6987,1020,3007,416,257,656 311 | 2,2,918,20655,13567,1465,6846,806 312 | 1,2,7034,1492,2405,12569,299,1117 313 | 1,2,29635,2335,8280,3046,371,117 314 | 2,2,2137,3737,19172,1274,17120,142 315 | 1,2,9784,925,2405,4447,183,297 316 | 1,2,10617,1795,7647,1483,857,1233 317 | 2,2,1479,14982,11924,662,3891,3508 318 | 1,2,7127,1375,2201,2679,83,1059 319 | 1,2,1182,3088,6114,978,821,1637 320 | 1,2,11800,2713,3558,2121,706,51 321 | 2,2,9759,25071,17645,1128,12408,1625 322 | 1,2,1774,3696,2280,514,275,834 323 | 1,2,9155,1897,5167,2714,228,1113 324 | 1,2,15881,713,3315,3703,1470,229 325 | 1,2,13360,944,11593,915,1679,573 326 | 1,2,25977,3587,2464,2369,140,1092 327 | 1,2,32717,16784,13626,60869,1272,5609 328 | 1,2,4414,1610,1431,3498,387,834 329 | 1,2,542,899,1664,414,88,522 330 | 1,2,16933,2209,3389,7849,210,1534 331 | 1,2,5113,1486,4583,5127,492,739 332 | 1,2,9790,1786,5109,3570,182,1043 333 | 2,2,11223,14881,26839,1234,9606,1102 334 | 1,2,22321,3216,1447,2208,178,2602 335 | 2,2,8565,4980,67298,131,38102,1215 336 | 2,2,16823,928,2743,11559,332,3486 337 | 2,2,27082,6817,10790,1365,4111,2139 338 | 1,2,13970,1511,1330,650,146,778 339 | 1,2,9351,1347,2611,8170,442,868 340 | 1,2,3,333,7021,15601,15,550 341 | 1,2,2617,1188,5332,9584,573,1942 342 | 2,3,381,4025,9670,388,7271,1371 343 | 2,3,2320,5763,11238,767,5162,2158 344 | 1,3,255,5758,5923,349,4595,1328 345 | 2,3,1689,6964,26316,1456,15469,37 346 | 1,3,3043,1172,1763,2234,217,379 347 | 1,3,1198,2602,8335,402,3843,303 348 | 2,3,2771,6939,15541,2693,6600,1115 349 | 2,3,27380,7184,12311,2809,4621,1022 350 | 1,3,3428,2380,2028,1341,1184,665 351 | 2,3,5981,14641,20521,2005,12218,445 352 | 1,3,3521,1099,1997,1796,173,995 353 | 2,3,1210,10044,22294,1741,12638,3137 354 | 1,3,608,1106,1533,830,90,195 355 | 2,3,117,6264,21203,228,8682,1111 356 | 1,3,14039,7393,2548,6386,1333,2341 357 | 1,3,190,727,2012,245,184,127 358 | 1,3,22686,134,218,3157,9,548 359 | 2,3,37,1275,22272,137,6747,110 360 | 1,3,759,18664,1660,6114,536,4100 361 | 1,3,796,5878,2109,340,232,776 362 | 1,3,19746,2872,2006,2601,468,503 363 | 1,3,4734,607,864,1206,159,405 364 | 1,3,2121,1601,2453,560,179,712 365 | 1,3,4627,997,4438,191,1335,314 366 | 1,3,2615,873,1524,1103,514,468 367 | 2,3,4692,6128,8025,1619,4515,3105 368 | 1,3,9561,2217,1664,1173,222,447 369 | 1,3,3477,894,534,1457,252,342 370 | 1,3,22335,1196,2406,2046,101,558 371 | 1,3,6211,337,683,1089,41,296 372 | 2,3,39679,3944,4955,1364,523,2235 373 | 1,3,20105,1887,1939,8164,716,790 374 | 1,3,3884,3801,1641,876,397,4829 375 | 2,3,15076,6257,7398,1504,1916,3113 376 | 1,3,6338,2256,1668,1492,311,686 377 | 1,3,5841,1450,1162,597,476,70 378 | 2,3,3136,8630,13586,5641,4666,1426 379 | 1,3,38793,3154,2648,1034,96,1242 380 | 1,3,3225,3294,1902,282,68,1114 381 | 2,3,4048,5164,10391,130,813,179 382 | 1,3,28257,944,2146,3881,600,270 383 | 1,3,17770,4591,1617,9927,246,532 384 | 1,3,34454,7435,8469,2540,1711,2893 385 | 1,3,1821,1364,3450,4006,397,361 386 | 1,3,10683,21858,15400,3635,282,5120 387 | 1,3,11635,922,1614,2583,192,1068 388 | 1,3,1206,3620,2857,1945,353,967 389 | 1,3,20918,1916,1573,1960,231,961 390 | 1,3,9785,848,1172,1677,200,406 391 | 1,3,9385,1530,1422,3019,227,684 392 | 1,3,3352,1181,1328,5502,311,1000 393 | 1,3,2647,2761,2313,907,95,1827 394 | 1,3,518,4180,3600,659,122,654 395 | 1,3,23632,6730,3842,8620,385,819 396 | 1,3,12377,865,3204,1398,149,452 397 | 1,3,9602,1316,1263,2921,841,290 398 | 2,3,4515,11991,9345,2644,3378,2213 399 | 1,3,11535,1666,1428,6838,64,743 400 | 1,3,11442,1032,582,5390,74,247 401 | 1,3,9612,577,935,1601,469,375 402 | 1,3,4446,906,1238,3576,153,1014 403 | 1,3,27167,2801,2128,13223,92,1902 404 | 1,3,26539,4753,5091,220,10,340 405 | 1,3,25606,11006,4604,127,632,288 406 | 1,3,18073,4613,3444,4324,914,715 407 | 1,3,6884,1046,1167,2069,593,378 408 | 1,3,25066,5010,5026,9806,1092,960 409 | 2,3,7362,12844,18683,2854,7883,553 410 | 2,3,8257,3880,6407,1646,2730,344 411 | 1,3,8708,3634,6100,2349,2123,5137 412 | 1,3,6633,2096,4563,1389,1860,1892 413 | 1,3,2126,3289,3281,1535,235,4365 414 | 1,3,97,3605,12400,98,2970,62 415 | 1,3,4983,4859,6633,17866,912,2435 416 | 1,3,5969,1990,3417,5679,1135,290 417 | 2,3,7842,6046,8552,1691,3540,1874 418 | 2,3,4389,10940,10908,848,6728,993 419 | 1,3,5065,5499,11055,364,3485,1063 420 | 2,3,660,8494,18622,133,6740,776 421 | 1,3,8861,3783,2223,633,1580,1521 422 | 1,3,4456,5266,13227,25,6818,1393 423 | 2,3,17063,4847,9053,1031,3415,1784 424 | 1,3,26400,1377,4172,830,948,1218 425 | 2,3,17565,3686,4657,1059,1803,668 426 | 2,3,16980,2884,12232,874,3213,249 427 | 1,3,11243,2408,2593,15348,108,1886 428 | 1,3,13134,9347,14316,3141,5079,1894 429 | 1,3,31012,16687,5429,15082,439,1163 430 | 1,3,3047,5970,4910,2198,850,317 431 | 1,3,8607,1750,3580,47,84,2501 432 | 1,3,3097,4230,16483,575,241,2080 433 | 1,3,8533,5506,5160,13486,1377,1498 434 | 1,3,21117,1162,4754,269,1328,395 435 | 1,3,1982,3218,1493,1541,356,1449 436 | 1,3,16731,3922,7994,688,2371,838 437 | 1,3,29703,12051,16027,13135,182,2204 438 | 1,3,39228,1431,764,4510,93,2346 439 | 2,3,14531,15488,30243,437,14841,1867 440 | 1,3,10290,1981,2232,1038,168,2125 441 | 1,3,2787,1698,2510,65,477,52 442 | -------------------------------------------------------------------------------- /boston_housing/boston_housing.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 机器学习工程师纳米学位\n", 8 | "## 模型评价与验证\n", 9 | "## 项目 1: 预测波士顿房价\n", 10 | "\n", 11 | "\n", 12 | "欢迎来到机器学习的预测波士顿房价项目!在此文件中,有些示例代码已经提供给你,但你还需要实现更多的功能来让项目成功运行。除非有明确要求,你无须修改任何已给出的代码。以**编程练习**开始的标题表示接下来的内容中有需要你必须实现的功能。每一部分都会有详细的指导,需要实现的部分也会在注释中以**TODO**标出。请仔细阅读所有的提示!\n", 13 | "\n", 14 | "除了实现代码外,你还**必须**回答一些与项目和实现有关的问题。每一个需要你回答的问题都会以**'问题 X'**为标题。请仔细阅读每个问题,并且在问题后的**'回答'**文字框中写出完整的答案。你的项目将会根据你对问题的回答和撰写代码所实现的功能来进行评分。\n", 15 | "\n", 16 | ">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown可以通过双击进入编辑模式。" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "---\n", 24 | "## 第一步. 导入数据\n", 25 | "在这个项目中,你将利用马萨诸塞州波士顿郊区的房屋信息数据训练和测试一个模型,并对模型的性能和预测能力进行测试。通过该数据训练后的好的模型可以被用来对房屋做特定预测---尤其是对房屋的价值。对于房地产经纪等人的日常工作来说,这样的预测模型被证明非常有价值。\n", 26 | "\n", 27 | "此项目的数据集来自[UCI机器学习知识库(数据集已下线)](https://archive.ics.uci.edu/ml/datasets.html)。波士顿房屋这些数据于1978年开始统计,共506个数据点,涵盖了麻省波士顿不同郊区房屋14种特征的信息。本项目对原始数据集做了以下处理:\n", 28 | "- 有16个`'MEDV'` 值为50.0的数据点被移除。 这很可能是由于这些数据点包含**遗失**或**看不到的值**。\n", 29 | "- 有1个数据点的 `'RM'` 值为8.78. 这是一个异常值,已经被移除。\n", 30 | "- 对于本项目,房屋的`'RM'`, `'LSTAT'`,`'PTRATIO'`以及`'MEDV'`特征是必要的,其余不相关特征已经被移除。\n", 31 | "- `'MEDV'`特征的值已经过必要的数学转换,可以反映35年来市场的通货膨胀效应。\n", 32 | "\n", 33 | "运行下面区域的代码以载入波士顿房屋数据集,以及一些此项目所需的 Python 库。如果成功返回数据集的大小,表示数据集已载入成功。" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "collapsed": true 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "# Import libraries necessary for this project\n", 45 | "import numpy as np\n", 46 | "import pandas as pd\n", 47 | "from sklearn.model_selection import ShuffleSplit\n", 48 | "\n", 49 | "# Import supplementary visualizations code visuals.py\n", 50 | "import visuals as vs\n", 51 | "\n", 52 | "# Pretty display for notebooks\n", 53 | "%matplotlib inline\n", 54 | "\n", 55 | "# Load the Boston housing dataset\n", 56 | "data = pd.read_csv('housing.csv')\n", 57 | "prices = data['MEDV']\n", 58 | "features = data.drop('MEDV', axis = 1)\n", 59 | " \n", 60 | "# Success\n", 61 | "print(\"Boston housing dataset has {} data points with {} variables each.\".format(*data.shape))" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "---\n", 69 | "## 第二步. 分析数据\n", 70 | "在项目的第一个部分,你会对波士顿房地产数据进行初步的观察并给出你的分析。通过对数据的探索来熟悉数据可以让你更好地理解和解释你的结果。\n", 71 | "\n", 72 | "由于这个项目的最终目标是建立一个预测房屋价值的模型,我们需要将数据集分为**特征(features)**和**目标变量(target variable)**。\n", 73 | "- **特征** `'RM'`, `'LSTAT'`,和 `'PTRATIO'`,给我们提供了每个数据点的数量相关的信息。\n", 74 | "- **目标变量**:` 'MEDV'`,是我们希望预测的变量。\n", 75 | "\n", 76 | "他们分别被存在 `features` 和 `prices` 两个变量名中。" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "### 编程练习 1:基础统计运算\n", 84 | "你的第一个编程练习是计算有关波士顿房价的描述统计数据。我们已为你导入了 ` NumPy `,你需要使用这个库来执行必要的计算。这些统计数据对于分析模型的预测结果非常重要的。\n", 85 | "在下面的代码中,你要做的是:\n", 86 | "- 计算 `prices` 中的 `'MEDV'` 的最小值、最大值、均值、中值和标准差;\n", 87 | "- 将运算结果储存在相应的变量中。" 88 | ] 89 | }, 90 | { 91 | "cell_type": "code", 92 | "execution_count": null, 93 | "metadata": { 94 | "collapsed": true 95 | }, 96 | "outputs": [], 97 | "source": [ 98 | "# TODO: Minimum price of the data\n", 99 | "minimum_price = None\n", 100 | "\n", 101 | "# TODO: Maximum price of the data\n", 102 | "maximum_price = None\n", 103 | "\n", 104 | "# TODO: Mean price of the data\n", 105 | "mean_price = None\n", 106 | "\n", 107 | "# TODO: Median price of the data\n", 108 | "median_price = None\n", 109 | "\n", 110 | "# TODO: Standard deviation of prices of the data\n", 111 | "std_price = None\n", 112 | "\n", 113 | "# Show the calculated statistics\n", 114 | "print(\"Statistics for Boston housing dataset:\\n\")\n", 115 | "print(\"Minimum price: ${:.2f}\".format(minimum_price)) \n", 116 | "print(\"Maximum price: ${:.2f}\".format(maximum_price))\n", 117 | "print(\"Mean price: ${:.2f}\".format(mean_price))\n", 118 | "print(\"Median price ${:.2f}\".format(median_price))\n", 119 | "print(\"Standard deviation of prices: ${:.2f}\".format(std_price))" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "### 问题 1 - 特征观察\n", 127 | "\n", 128 | "如前文所述,本项目中我们关注的是其中三个值:`'RM'`、`'LSTAT'` 和`'PTRATIO'`,对每一个数据点:\n", 129 | "- `'RM'` 是该地区中每个房屋的平均房间数量;\n", 130 | "- `'LSTAT'` 是指该地区有多少百分比的业主属于是低收入阶层(有工作但收入微薄);\n", 131 | "- `'PTRATIO'` 是该地区的中学和小学里,学生和老师的数目比(`学生/老师`)。\n", 132 | "\n", 133 | "_凭直觉,上述三个特征中对每一个来说,你认为增大该特征的数值,`'MEDV'`的值会是**增大**还是**减小**呢?每一个答案都需要你给出理由。_\n", 134 | "\n", 135 | "**提示:**你预期一个`'RM'` 值是6的房屋跟`'RM'` 值是7的房屋相比,价值更高还是更低呢?" 136 | ] 137 | }, 138 | { 139 | "cell_type": "markdown", 140 | "metadata": {}, 141 | "source": [ 142 | "### 问题 1 - 回答:" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "---\n", 150 | "## 第三步. 建立模型\n", 151 | "在项目的第三步中,你需要了解必要的工具和技巧来让你的模型进行预测。用这些工具和技巧对每一个模型的表现做精确的衡量可以极大地增强你预测的信心。" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "### 编程练习2:定义衡量标准\n", 159 | "如果不能对模型的训练和测试的表现进行量化地评估,我们就很难衡量模型的好坏。通常我们会定义一些衡量标准,这些标准可以通过对某些误差或者拟合程度的计算来得到。在这个项目中,你将通过运算[决定系数](https://en.wikipedia.org/wiki/Coefficient_of_determination) $R^2$ 来量化模型的表现。模型的决定系数是回归分析中十分常用的统计信息,经常被当作衡量模型预测能力好坏的标准。\n", 160 | "\n", 161 | "$R^2$ 的数值范围从0至1,表示**目标变量**的预测值和实际值之间的相关程度平方的百分比。一个模型的 $R^2$ 值为0还不如直接用**平均值**来预测效果好;而一个 $R^2$ 值为1的模型则可以对目标变量进行完美的预测。从0至1之间的数值,则表示该模型中目标变量中有百分之多少能够用**特征**来解释。模型也可能出现负值的 $R^2$,这种情况下模型所做预测有时会比直接计算目标变量的平均值差很多。\n", 162 | "\n", 163 | "在下方代码的 `performance_metric` 函数中,你要实现:\n", 164 | "- 使用 `sklearn.metrics` 中的 [`r2_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) 来计算 `y_true` 和 `y_predict` 的 $R^2$ 值,作为对其表现的评判。\n", 165 | "- 将他们的表现评分储存到 `score` 变量中。" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": null, 171 | "metadata": { 172 | "collapsed": true 173 | }, 174 | "outputs": [], 175 | "source": [ 176 | "# TODO: Import 'r2_score'\n", 177 | "\n", 178 | "def performance_metric(y_true, y_predict):\n", 179 | " \"\"\" Calculates and returns the performance score between \n", 180 | " true and predicted values based on the metric chosen. \"\"\"\n", 181 | " \n", 182 | " # TODO: Calculate the performance score between 'y_true' and 'y_predict'\n", 183 | " score = None\n", 184 | " \n", 185 | " # Return the score\n", 186 | " return score" 187 | ] 188 | }, 189 | { 190 | "cell_type": "markdown", 191 | "metadata": {}, 192 | "source": [ 193 | "### 问题 2 - 拟合程度\n", 194 | "\n", 195 | "假设一个数据集有五个数据且一个模型做出下列目标变量的预测:\n", 196 | "\n", 197 | "| 真实数值 | 预测数值 |\n", 198 | "| :-------------: | :--------: |\n", 199 | "| 3.0 | 2.5 |\n", 200 | "| -0.5 | 0.0 |\n", 201 | "| 2.0 | 2.1 |\n", 202 | "| 7.0 | 7.8 |\n", 203 | "| 4.2 | 5.3 |\n", 204 | "*你觉得这个模型已成功地描述了目标变量的变化吗?如果成功,请解释为什么,如果没有,也请给出原因。* \n", 205 | "\n", 206 | "**提示1**:运行下方的代码,使用 `performance_metric` 函数来计算 `y_true` 和 `y_predict` 的决定系数。\n", 207 | "\n", 208 | "**提示2**:$R^2$ 分数是指可以从自变量中预测的因变量的方差比例。 换一种说法:\n", 209 | "\n", 210 | "* $R^2$ 为0意味着因变量不能从自变量预测。\n", 211 | "* $R^2$ 为1意味着可以从自变量预测因变量。\n", 212 | "* $R^2$ 在0到1之间表示因变量可预测的程度。\n", 213 | "* $R^2$ 为0.40意味着 Y 中40%的方差可以从 X 预测。" 214 | ] 215 | }, 216 | { 217 | "cell_type": "code", 218 | "execution_count": null, 219 | "metadata": { 220 | "collapsed": true 221 | }, 222 | "outputs": [], 223 | "source": [ 224 | "# Calculate the performance of this model\n", 225 | "score = performance_metric([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])\n", 226 | "print(\"Model has a coefficient of determination, R^2, of {:.3f}.\".format(score))" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "### 问题 2 - 回答:" 234 | ] 235 | }, 236 | { 237 | "cell_type": "markdown", 238 | "metadata": {}, 239 | "source": [ 240 | "### 编程练习 3: 数据分割与重排\n", 241 | "接下来,你需要把波士顿房屋数据集分成训练和测试两个子集。通常在这个过程中,数据也会被重排列,以消除数据集中由于顺序而产生的偏差。\n", 242 | "在下面的代码中,你需要\n", 243 | "\n", 244 | "* 使用 `sklearn.model_selection` 中的 `train_test_split`, 将 `features` 和 `prices` 的数据都分成用于训练的数据子集和用于测试的数据子集。\n", 245 | " - 分割比例为:80%的数据用于训练,20%用于测试;\n", 246 | " - 选定一个数值以设定 `train_test_split` 中的 `random_state` ,这会确保结果的一致性;\n", 247 | "* 将分割后的训练集与测试集分配给 `X_train`, `X_test`, `y_train` 和 `y_test`。" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": null, 253 | "metadata": { 254 | "collapsed": true 255 | }, 256 | "outputs": [], 257 | "source": [ 258 | "# TODO: Import 'train_test_split'\n", 259 | "\n", 260 | "# TODO: Shuffle and split the data into training and testing subsets\n", 261 | "X_train, X_test, y_train, y_test = (None, None, None, None)\n", 262 | "\n", 263 | "# Success\n", 264 | "print(\"Training and testing split was successful.\")" 265 | ] 266 | }, 267 | { 268 | "cell_type": "markdown", 269 | "metadata": {}, 270 | "source": [ 271 | "### 问题 3 - 训练及测试\n", 272 | "*将数据集按一定比例分为训练用的数据集和测试用的数据集对学习算法有什么好处?*\n", 273 | "\n", 274 | "*如果用模型已经见过的数据,例如部分训练集数据进行测试,又有什么坏处?*\n", 275 | "\n", 276 | "**提示:** 如果没有数据来对模型进行测试,会出现什么问题?" 277 | ] 278 | }, 279 | { 280 | "cell_type": "markdown", 281 | "metadata": {}, 282 | "source": [ 283 | "### 问题 3 - 回答:" 284 | ] 285 | }, 286 | { 287 | "cell_type": "markdown", 288 | "metadata": {}, 289 | "source": [ 290 | "---\n", 291 | "## 第四步. 分析模型的表现\n", 292 | "在项目的第四步,我们来看一下不同参数下,模型在训练集和验证集上的表现。这里,我们专注于一个特定的算法(带剪枝的决策树,但这并不是这个项目的重点),和这个算法的一个参数 `'max_depth'`。用全部训练集训练,选择不同`'max_depth'` 参数,观察这一参数的变化如何影响模型的表现。画出模型的表现来对于分析过程十分有益。" 293 | ] 294 | }, 295 | { 296 | "cell_type": "markdown", 297 | "metadata": {}, 298 | "source": [ 299 | "### 学习曲线\n", 300 | "下方区域内的代码会输出四幅图像,它们是一个决策树模型在不同最大深度下的表现。每一条曲线都直观得显示了随着训练数据量的增加,模型学习曲线的在训练集评分和验证集评分的变化,评分使用决定系数 $R^2$。曲线的阴影区域代表的是该曲线的不确定性(用标准差衡量)。\n", 301 | "\n", 302 | "运行下方区域中的代码,并利用输出的图形回答下面的问题。" 303 | ] 304 | }, 305 | { 306 | "cell_type": "code", 307 | "execution_count": null, 308 | "metadata": { 309 | "collapsed": true, 310 | "scrolled": false 311 | }, 312 | "outputs": [], 313 | "source": [ 314 | "# Produce learning curves for varying training set sizes and maximum depths\n", 315 | "vs.ModelLearning(features, prices)" 316 | ] 317 | }, 318 | { 319 | "cell_type": "markdown", 320 | "metadata": {}, 321 | "source": [ 322 | "### 问题 4 - 学习曲线\n", 323 | "* 选择上述图像中的其中一个,并给出其最大深度。\n", 324 | "* 随着训练数据量的增加,训练集曲线的评分有怎样的变化?验证集曲线呢?\n", 325 | "* 如果有更多的训练数据,是否能有效提升模型的表现呢?\n", 326 | "\n", 327 | "**提示:**学习曲线的评分是否最终会收敛到特定的值?一般来说,你拥有的数据越多,模型表现力越好。但是,如果你的训练和测试曲线以高于基准阈值的分数收敛,这是否有必要?基于训练和测试曲线已经收敛的前提下,思考添加更多训练点的优缺点。" 328 | ] 329 | }, 330 | { 331 | "cell_type": "markdown", 332 | "metadata": {}, 333 | "source": [ 334 | "### 问题 4 - 回答:" 335 | ] 336 | }, 337 | { 338 | "cell_type": "markdown", 339 | "metadata": {}, 340 | "source": [ 341 | "### 复杂度曲线\n", 342 | "下列代码内的区域会输出一幅图像,它展示了一个已经经过训练和验证的决策树模型在不同最大深度条件下的表现。这个图形将包含两条曲线,一个是训练集的变化,一个是验证集的变化。跟**学习曲线**相似,阴影区域代表该曲线的不确定性,模型训练和测试部分的评分都用的 `performance_metric` 函数。\n", 343 | "\n", 344 | "**运行下方区域中的代码,并利用输出的图形并回答下面的问题5与问题6。**" 345 | ] 346 | }, 347 | { 348 | "cell_type": "code", 349 | "execution_count": null, 350 | "metadata": { 351 | "collapsed": true 352 | }, 353 | "outputs": [], 354 | "source": [ 355 | "vs.ModelComplexity(X_train, y_train)" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "metadata": {}, 361 | "source": [ 362 | "### 问题 5 - 偏差(bias)与方差(variance)之间的权衡取舍\n", 363 | "* 当模型以最大深度 1训练时,模型的预测是出现很大的偏差还是出现了很大的方差?\n", 364 | "* 当模型以最大深度10训练时,情形又如何呢?\n", 365 | "* 图形中的哪些特征能够支持你的结论?\n", 366 | " \n", 367 | "**提示:** 高偏差表示欠拟合(模型过于简单),而高方差表示过拟合(模型过于复杂,以至于无法泛化)。考虑哪种模型(深度1或10)对应着上述的情况,并权衡偏差与方差。" 368 | ] 369 | }, 370 | { 371 | "cell_type": "markdown", 372 | "metadata": {}, 373 | "source": [ 374 | "### 问题 5 - 回答:" 375 | ] 376 | }, 377 | { 378 | "cell_type": "markdown", 379 | "metadata": {}, 380 | "source": [ 381 | "### 问题 6- 最优模型的猜测\n", 382 | "* 结合问题 5 中的图,你认为最大深度是多少的模型能够最好地对未见过的数据进行预测?\n", 383 | "* 你得出这个答案的依据是什么?\n", 384 | "\n", 385 | "**提示**:查看问题5上方的图表,并查看模型在不同 `depth`下的验证分数。随着深度的增加模型的表现力会变得更好吗?我们在什么情况下获得最佳验证分数而不会使我们的模型过度复杂?请记住,奥卡姆剃刀:“在竞争性假设中,应该选择假设最少的那一个。”" 386 | ] 387 | }, 388 | { 389 | "cell_type": "markdown", 390 | "metadata": {}, 391 | "source": [ 392 | "### 问题 6 - 回答:" 393 | ] 394 | }, 395 | { 396 | "cell_type": "markdown", 397 | "metadata": {}, 398 | "source": [ 399 | "---\n", 400 | "## 第五步. 评估模型的表现\n", 401 | "在项目的最后一节中,你将构建一个模型,并使用 `fit_model` 中的优化模型去预测客户特征集。" 402 | ] 403 | }, 404 | { 405 | "cell_type": "markdown", 406 | "metadata": {}, 407 | "source": [ 408 | "### 问题 7- 网格搜索(Grid Search)\n", 409 | "* 什么是网格搜索法?\n", 410 | "* 如何用它来优化模型?\n", 411 | "\n", 412 | "**提示**:在解释网格搜索算法时,首先要理解我们为什么使用网格搜索算法,以及我们使用它的最终目的是什么。为了使你的回答更具有说服力,你还可以给出一个模型中可以使用此方法进行优化参数的示例。" 413 | ] 414 | }, 415 | { 416 | "cell_type": "markdown", 417 | "metadata": {}, 418 | "source": [ 419 | "### 问题 7 - 回答:" 420 | ] 421 | }, 422 | { 423 | "cell_type": "markdown", 424 | "metadata": {}, 425 | "source": [ 426 | "### 问题 8 - 交叉验证\n", 427 | "- 什么是K折交叉验证法(k-fold cross-validation)?\n", 428 | "- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 是如何结合交叉验证来完成对最佳参数组合的选择的?\n", 429 | "- [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 中的`'cv_results_'`属性能告诉我们什么?\n", 430 | "- 网格搜索为什么要使用K折交叉验证?K折交叉验证能够避免什么问题?\n", 431 | "\n", 432 | "**提示**:在解释k-fold交叉验证时,一定要理解'k'是什么,和数据集是如何分成不同的部分来进行训练和测试的,以及基于'k'值运行的次数。\n", 433 | "在考虑k-fold交叉验证如何帮助网格搜索时,你可以使用特定的数据子集来进行训练与测试有什么缺点,以及K折交叉验证是如何帮助缓解这个问题。" 434 | ] 435 | }, 436 | { 437 | "cell_type": "markdown", 438 | "metadata": {}, 439 | "source": [ 440 | "### 问题 8 - 回答:" 441 | ] 442 | }, 443 | { 444 | "cell_type": "markdown", 445 | "metadata": {}, 446 | "source": [ 447 | "### 编程练习 4:拟合模型\n", 448 | "在这个练习中,你将需要将所学到的内容整合,使用**决策树算法**训练一个模型。为了得出的是一个最优模型,你需要使用网格搜索法训练模型,以找到最佳的 `'max_depth'` 参数。你可以把`'max_depth'` 参数理解为决策树算法在做出预测前,允许其对数据提出问题的数量。决策树是**监督学习算法**中的一种。\n", 449 | "\n", 450 | "另外,你会发现在实现的过程中是使用`ShuffleSplit()`作为交叉验证的另一种形式(参见'cv_sets'变量)。虽然它不是你在问题8中描述的K-fold交叉验证方法,但它同样非常有用!下面的`ShuffleSplit()`实现将创建10个('n_splits')混洗集合,并且对于每个混洗集,数据的20%('test_size')将被用作验证集合。当您在实现代码的时候,请思考一下它与 `K-fold cross-validation` 的不同与相似之处。\n", 451 | "\n", 452 | "请注意,`ShuffleSplit` 在 `Scikit-Learn` 版本0.17和0.18中有不同的参数。对于下面代码单元格中的 `fit_model` 函数,您需要实现以下内容:\n", 453 | "\n", 454 | "1. **定义 `'regressor'` 变量**: 使用 `sklearn.tree` 中的 [`DecisionTreeRegressor`](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) 创建一个决策树的回归函数;\n", 455 | "2. **定义 `'params'` 变量**: 为 `'max_depth'` 参数创造一个字典,它的值是从1至10的数组;\n", 456 | "3. **定义 `'scoring_fnc'` 变量**: 使用 `sklearn.metrics` 中的 [`make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) 创建一个评分函数。将 `‘performance_metric’` 作为参数传至这个函数中;\n", 457 | "4. **定义 `'grid'` 变量**: 使用 `sklearn.model_selection` 中的 [`GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 创建一个网格搜索对象;将变量`'regressor'`, `'params'`, `'scoring_fnc'`和 `'cv_sets'` 作为参数传至这个对象构造函数中;\n", 458 | "\n", 459 | " \n", 460 | "如果你对 Python 函数的默认参数定义和传递不熟悉,可以参考这个MIT课程的[视频](http://cn-static.udacity.com/mlnd/videos/MIT600XXT114-V004200_DTH.mp4)。" 461 | ] 462 | }, 463 | { 464 | "cell_type": "code", 465 | "execution_count": null, 466 | "metadata": { 467 | "collapsed": true 468 | }, 469 | "outputs": [], 470 | "source": [ 471 | "# TODO: Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'\n", 472 | "\n", 473 | "def fit_model(X, y):\n", 474 | " \"\"\" Performs grid search over the 'max_depth' parameter for a \n", 475 | " decision tree regressor trained on the input data [X, y]. \"\"\"\n", 476 | " \n", 477 | " # Create cross-validation sets from the training data\n", 478 | " # sklearn version 0.18: ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None)\n", 479 | " # sklearn versiin 0.17: ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None)\n", 480 | " cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=42)\n", 481 | " \n", 482 | " # TODO: Create a decision tree regressor object\n", 483 | " regressor = None\n", 484 | "\n", 485 | " # TODO: Create a dictionary for the parameter 'max_depth' with a range from 1 to 10\n", 486 | " params = {}\n", 487 | "\n", 488 | " # TODO: Transform 'performance_metric' into a scoring function using 'make_scorer' \n", 489 | " scoring_fnc = None\n", 490 | "\n", 491 | " # TODO: Create the grid search cv object --> GridSearchCV()\n", 492 | " # Make sure to include the right parameters in the object:\n", 493 | " # (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.\n", 494 | " grid = None\n", 495 | "\n", 496 | " # Fit the grid search object to the data to compute the optimal model\n", 497 | " grid = grid.fit(X, y)\n", 498 | "\n", 499 | " # Return the optimal model after fitting the data\n", 500 | " return grid.best_estimator_" 501 | ] 502 | }, 503 | { 504 | "cell_type": "markdown", 505 | "metadata": {}, 506 | "source": [ 507 | "## 第六步. 做出预测\n", 508 | "当我们用数据训练出一个模型,它现在就可用于对新的数据进行预测。在决策树回归函数中,模型已经学会对新输入的数据*提问*,并返回对**目标变量**的预测值。你可以用这个预测来获取数据未知目标变量的信息,这些数据必须是不包含在训练数据之内的。" 509 | ] 510 | }, 511 | { 512 | "cell_type": "markdown", 513 | "metadata": {}, 514 | "source": [ 515 | "### 问题 9 - 最优模型\n", 516 | "*最优模型的最大深度(maximum depth)是多少?此答案与你在**问题 6**所做的猜测是否相同?*\n", 517 | "\n", 518 | "运行下方区域内的代码,将决策树回归函数代入训练数据的集合,以得到最优化的模型。" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": null, 524 | "metadata": { 525 | "collapsed": true, 526 | "scrolled": true 527 | }, 528 | "outputs": [], 529 | "source": [ 530 | "# Fit the training data to the model using grid search\n", 531 | "reg = fit_model(X_train, y_train)\n", 532 | "\n", 533 | "# Produce the value for 'max_depth'\n", 534 | "print(\"Parameter 'max_depth' is {} for the optimal model.\".format(reg.get_params()['max_depth']))" 535 | ] 536 | }, 537 | { 538 | "cell_type": "markdown", 539 | "metadata": {}, 540 | "source": [ 541 | "### 问题 9 - 回答:\n" 542 | ] 543 | }, 544 | { 545 | "cell_type": "markdown", 546 | "metadata": {}, 547 | "source": [ 548 | "### 问题 10 - 预测销售价格\n", 549 | "想像你是一个在波士顿地区的房屋经纪人,并期待使用此模型以帮助你的客户评估他们想出售的房屋。你已经从你的三个客户收集到以下的资讯:\n", 550 | "\n", 551 | "| 特征 | 客戶 1 | 客戶 2 | 客戶 3 |\n", 552 | "| :---: | :---: | :---: | :---: |\n", 553 | "| 房屋内房间总数 | 5 间房间 | 4 间房间 | 8 间房间 |\n", 554 | "| 社区贫困指数(%被认为是贫困阶层) | 17% | 32% | 3% |\n", 555 | "| 邻近学校的学生-老师比例 | 15:1 | 22:1 | 12:1 |\n", 556 | "\n", 557 | "* 你会建议每位客户的房屋销售的价格为多少?\n", 558 | "* 从房屋特征的数值判断,这样的价格合理吗?为什么?\n", 559 | "\n", 560 | "**提示:**用你在**分析数据**部分计算出来的统计信息来帮助你证明你的答案。\n", 561 | "\n", 562 | "运行下列的代码区域,使用你优化的模型来为每位客户的房屋价值做出预测。" 563 | ] 564 | }, 565 | { 566 | "cell_type": "code", 567 | "execution_count": null, 568 | "metadata": { 569 | "collapsed": true 570 | }, 571 | "outputs": [], 572 | "source": [ 573 | "# Produce a matrix for client data\n", 574 | "client_data = [[5, 17, 15], # Client 1\n", 575 | " [4, 32, 22], # Client 2\n", 576 | " [8, 3, 12]] # Client 3\n", 577 | "\n", 578 | "# Show predictions\n", 579 | "for i, price in enumerate(reg.predict(client_data)):\n", 580 | " print(\"Predicted selling price for Client {}'s home: ${:,.2f}\".format(i+1, price))" 581 | ] 582 | }, 583 | { 584 | "cell_type": "markdown", 585 | "metadata": {}, 586 | "source": [ 587 | "### 问题 10 - 回答:" 588 | ] 589 | }, 590 | { 591 | "cell_type": "markdown", 592 | "metadata": {}, 593 | "source": [ 594 | "### 编程练习 5\n", 595 | "你刚刚预测了三个客户的房子的售价。在这个练习中,你将用你的最优模型在整个测试数据上进行预测, 并计算相对于目标变量的决定系数 $R^2$ 的值。\n", 596 | "\n", 597 | "**提示:**\n", 598 | "* 你可能需要用到 `X_test`, `y_test`, `reg`, `performance_metric`。\n", 599 | "* 参考问题10的代码进行预测。\n", 600 | "* 参考问题2的代码来计算 $R^2$ 的值。\n" 601 | ] 602 | }, 603 | { 604 | "cell_type": "code", 605 | "execution_count": null, 606 | "metadata": { 607 | "collapsed": true 608 | }, 609 | "outputs": [], 610 | "source": [ 611 | "# TODO Calculate the r2 score between 'y_true' and 'y_predict'\n", 612 | "\n", 613 | "r2 = None\n", 614 | "\n", 615 | "print(\"Optimal model has R^2 score {:,.2f} on test data\".format(r2))" 616 | ] 617 | }, 618 | { 619 | "cell_type": "markdown", 620 | "metadata": {}, 621 | "source": [ 622 | "### 问题11 - 分析决定系数\n", 623 | "\n", 624 | "你刚刚计算了最优模型在测试集上的决定系数,你会如何评价这个结果?" 625 | ] 626 | }, 627 | { 628 | "cell_type": "markdown", 629 | "metadata": {}, 630 | "source": [ 631 | "### 问题11 - 回答" 632 | ] 633 | }, 634 | { 635 | "cell_type": "markdown", 636 | "metadata": {}, 637 | "source": [ 638 | "### 模型健壮性\n", 639 | "\n", 640 | "一个最优的模型不一定是一个健壮模型。有的时候模型会过于复杂或者过于简单,以致于难以泛化新增添的数据;有的时候模型采用的学习算法并不适用于特定的数据结构;有的时候样本本身可能有太多噪点或样本过少,使得模型无法准确地预测目标变量。这些情况下我们会说模型是欠拟合的。\n", 641 | "\n", 642 | "### 问题 12 - 模型健壮性\n", 643 | "\n", 644 | "模型是否足够健壮来保证预测的一致性?\n", 645 | "\n", 646 | "**提示**: 执行下方区域中的代码,采用不同的训练和测试集执行 `fit_model` 函数10次。注意观察对一个特定的客户来说,预测是如何随训练数据的变化而变化的。" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": null, 652 | "metadata": { 653 | "collapsed": true 654 | }, 655 | "outputs": [], 656 | "source": [ 657 | "vs.PredictTrials(features, prices, fit_model, client_data)" 658 | ] 659 | }, 660 | { 661 | "cell_type": "markdown", 662 | "metadata": {}, 663 | "source": [ 664 | "### 问题 12 - 回答:" 665 | ] 666 | }, 667 | { 668 | "cell_type": "markdown", 669 | "metadata": {}, 670 | "source": [ 671 | "### 问题 13 - 实用性探讨\n", 672 | "*简单地讨论一下你建构的模型能否在现实世界中使用?* \n", 673 | "\n", 674 | "提示:回答以下几个问题,并给出相应结论的理由:\n", 675 | "- *1978年所采集的数据,在已考虑通货膨胀的前提下,在今天是否仍然适用?*\n", 676 | "- *数据中呈现的特征是否足够描述一个房屋?*\n", 677 | "- *在波士顿这样的大都市采集的数据,能否应用在其它乡镇地区?*\n", 678 | "- *你觉得仅仅凭房屋所在社区的环境来判断房屋价值合理吗?*" 679 | ] 680 | }, 681 | { 682 | "cell_type": "markdown", 683 | "metadata": {}, 684 | "source": [ 685 | "### 问题 13 - 回答:" 686 | ] 687 | }, 688 | { 689 | "cell_type": "markdown", 690 | "metadata": {}, 691 | "source": [ 692 | "## 第七步.完成和提交" 693 | ] 694 | }, 695 | { 696 | "cell_type": "markdown", 697 | "metadata": {}, 698 | "source": [ 699 | "当你完成了以上所有的代码和问题,你需要将 iPython Notebook 导出 HTML,导出方法:在左上角的菜单中选择 **File -> Download as -> HTML (.html)**。当你提交项目时,需要包含**可运行的 .ipynb 文件**和**导出的 HTML 文件**。" 700 | ] 701 | } 702 | ], 703 | "metadata": { 704 | "kernelspec": { 705 | "display_name": "Python 3", 706 | "language": "python", 707 | "name": "python3" 708 | }, 709 | "language_info": { 710 | "codemirror_mode": { 711 | "name": "ipython", 712 | "version": 3 713 | }, 714 | "file_extension": ".py", 715 | "mimetype": "text/x-python", 716 | "name": "python", 717 | "nbconvert_exporter": "python", 718 | "pygments_lexer": "ipython3", 719 | "version": "3.6.4" 720 | } 721 | }, 722 | "nbformat": 4, 723 | "nbformat_minor": 1 724 | } 725 | -------------------------------------------------------------------------------- /creating_customer_segments/customer_segments.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 机器学习纳米学位\n", 8 | "## 非监督学习\n", 9 | "## 项目 3: 创建用户分类" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "欢迎来到机器学习工程师纳米学位的第三个项目!在这个 notebook 文件中,有些模板代码已经提供给你,但你还需要实现更多的功能来完成这个项目。除非有明确要求,你无须修改任何已给出的代码。以**'练习'**开始的标题表示接下来的代码部分中有你必须要实现的功能。每一部分都会有详细的指导,需要实现的部分也会在注释中以 **'TODO'** 标出。请仔细阅读所有的提示!\n", 17 | "\n", 18 | "除了实现代码外,你还**必须**回答一些与项目和你的实现有关的问题。每一个需要你回答的问题都会以**'问题 X'**为标题。请仔细阅读每个问题,并且在问题后的**'回答'**文字框中写出完整的答案。我们将根据你对问题的回答和撰写代码所实现的功能来对你提交的项目进行评分。\n", 19 | "\n", 20 | ">**提示:**Code 和 Markdown 区域可通过 **Shift + Enter** 快捷键运行。此外,Markdown 可以通过双击进入编辑模式。" 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": {}, 26 | "source": [ 27 | "## 开始\n", 28 | "\n", 29 | "在这个项目中,你将分析一个数据集的内在结构,这个数据集包含很多客户真对不同类型产品的年度采购额(用**金额**表示)。这个项目的任务之一是如何最好地描述一个批发商不同种类顾客之间的差异。这样做将能够使得批发商能够更好的组织他们的物流服务以满足每个客户的需求。\n", 30 | "\n", 31 | "这个项目的数据集能够在[UCI机器学习信息库](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers)中找到.因为这个项目的目的,分析将不会包括 'Channel' 和 'Region' 这两个特征——重点集中在6个记录的客户购买的产品类别上。\n", 32 | "\n", 33 | "运行下面的的代码单元以载入整个客户数据集和一些这个项目需要的 Python 库。如果你的数据集载入成功,你将看到后面输出数据集的大小。" 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": {}, 40 | "outputs": [], 41 | "source": [ 42 | "# 检查你的Python版本\n", 43 | "from sys import version_info\n", 44 | "if version_info.major != 3:\n", 45 | " raise Exception('请使用Python 3.x 来完成此项目')" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "# 引入这个项目需要的库\n", 55 | "import numpy as np\n", 56 | "import pandas as pd\n", 57 | "import visuals as vs\n", 58 | "from IPython.display import display # 使得我们可以对DataFrame使用display()函数\n", 59 | "\n", 60 | "# 设置以内联的形式显示matplotlib绘制的图片(在notebook中显示更美观)\n", 61 | "%matplotlib inline\n", 62 | "# 高分辨率显示\n", 63 | "# %config InlineBackend.figure_format='retina'\n", 64 | "\n", 65 | "# 载入整个客户数据集\n", 66 | "try:\n", 67 | " data = pd.read_csv(\"customers.csv\")\n", 68 | " data.drop(['Region', 'Channel'], axis = 1, inplace = True)\n", 69 | " print(\"Wholesale customers dataset has {} samples with {} features each.\".format(*data.shape))\n", 70 | "except:\n", 71 | " print(\"Dataset could not be loaded. Is the dataset missing?\")" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "## 分析数据\n", 79 | "在这部分,你将开始分析数据,通过可视化和代码来理解每一个特征和其他特征的联系。你会看到关于数据集的统计描述,考虑每一个属性的相关性,然后从数据集中选择若干个样本数据点,你将在整个项目中一直跟踪研究这几个数据点。\n", 80 | "\n", 81 | "运行下面的代码单元给出数据集的一个统计描述。注意这个数据集包含了6个重要的产品类型:**'Fresh'**, **'Milk'**, **'Grocery'**, **'Frozen'**, **'Detergents_Paper'**和 **'Delicatessen'**。想一下这里每一个类型代表你会购买什么样的产品。" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "# 显示数据集的一个描述\n", 91 | "display(data.describe())" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "### 练习: 选择样本\n", 99 | "为了对客户有一个更好的了解,并且了解代表他们的数据将会在这个分析过程中如何变换。最好是选择几个样本数据点,并且更为详细地分析它们。在下面的代码单元中,选择**三个**索引加入到索引列表`indices`中,这三个索引代表你要追踪的客户。我们建议你不断尝试,直到找到三个明显不同的客户。" 100 | ] 101 | }, 102 | { 103 | "cell_type": "code", 104 | "execution_count": null, 105 | "metadata": {}, 106 | "outputs": [], 107 | "source": [ 108 | "# TODO:从数据集中选择三个你希望抽样的数据点的索引\n", 109 | "indices = []\n", 110 | "\n", 111 | "# 为选择的样本建立一个DataFrame\n", 112 | "samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)\n", 113 | "print(\"Chosen samples of wholesale customers dataset:\")\n", 114 | "display(samples)" 115 | ] 116 | }, 117 | { 118 | "cell_type": "markdown", 119 | "metadata": {}, 120 | "source": [ 121 | "### 问题 1\n", 122 | "在你看来你选择的这三个样本点分别代表什么类型的企业(客户)?对每一个你选择的样本客户,通过它在每一种产品类型上的花费与数据集的统计描述进行比较,给出你做上述判断的理由。\n", 123 | "\n", 124 | "\n", 125 | "**提示:** 企业的类型包括超市、咖啡馆、零售商以及其他。注意不要使用具体企业的名字,比如说在描述一个餐饮业客户时,你不能使用麦当劳。" 126 | ] 127 | }, 128 | { 129 | "cell_type": "markdown", 130 | "metadata": {}, 131 | "source": [ 132 | "**回答:**" 133 | ] 134 | }, 135 | { 136 | "cell_type": "markdown", 137 | "metadata": {}, 138 | "source": [ 139 | "### 练习: 特征相关性\n", 140 | "一个有趣的想法是,考虑这六个类别中的一个(或者多个)产品类别,是否对于理解客户的购买行为具有实际的相关性。也就是说,当用户购买了一定数量的某一类产品,我们是否能够确定他们必然会成比例地购买另一种类的产品。有一个简单的方法可以检测相关性:我们用移除了某一个特征之后的数据集来构建一个监督学习(回归)模型,然后用这个模型去预测那个被移除的特征,再对这个预测结果进行评分,看看预测结果如何。\n", 141 | "\n", 142 | "在下面的代码单元中,你需要实现以下的功能:\n", 143 | " - 使用 `DataFrame.drop` 函数移除数据集中你选择的不需要的特征,并将移除后的结果赋值给 `new_data` 。\n", 144 | " - 使用 `sklearn.model_selection.train_test_split` 将数据集分割成训练集和测试集。\n", 145 | " - 使用移除的特征作为你的目标标签。设置 `test_size` 为 `0.25` 并设置一个 `random_state` 。\n", 146 | " \n", 147 | " \n", 148 | " - 导入一个 DecisionTreeRegressor (决策树回归器),设置一个 `random_state`,然后用训练集训练它。\n", 149 | " - 使用回归器的 `score` 函数输出模型在测试集上的预测得分。" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "# TODO:为DataFrame创建一个副本,用'drop'函数丢弃一个特征# TODO: \n", 159 | "new_data = None\n", 160 | "\n", 161 | "# TODO:使用给定的特征作为目标,将数据分割成训练集和测试集\n", 162 | "X_train, X_test, y_train, y_test = (None, None, None, None)\n", 163 | "\n", 164 | "# TODO:创建一个DecisionTreeRegressor(决策树回归器)并在训练集上训练它\n", 165 | "regressor = None\n", 166 | "\n", 167 | "# TODO:输出在测试集上的预测得分\n", 168 | "score = None" 169 | ] 170 | }, 171 | { 172 | "cell_type": "markdown", 173 | "metadata": {}, 174 | "source": [ 175 | "### 问题 2\n", 176 | "你尝试预测哪一个特征?预测的得分是多少?这个特征对于区分用户的消费习惯来说必要吗?为什么? \n", 177 | "**提示:** 决定系数(coefficient of determination),$R^2$ 结果在0到1之间,1表示完美拟合,一个负的 $R^2$ 表示模型不能够拟合数据。" 178 | ] 179 | }, 180 | { 181 | "cell_type": "markdown", 182 | "metadata": {}, 183 | "source": [ 184 | "**回答:**" 185 | ] 186 | }, 187 | { 188 | "cell_type": "markdown", 189 | "metadata": {}, 190 | "source": [ 191 | "### 可视化特征分布\n", 192 | "为了能够对这个数据集有一个更好的理解,我们可以对数据集中的每一个产品特征构建一个散布矩阵(scatter matrix)。如果你发现你在上面尝试预测的特征对于区分一个特定的用户来说是必须的,那么这个特征和其它的特征可能不会在下面的散射矩阵中显示任何关系。相反的,如果你认为这个特征对于识别一个特定的客户是没有作用的,那么通过散布矩阵可以看出在这个数据特征和其它特征中有关联性。运行下面的代码以创建一个散布矩阵。" 193 | ] 194 | }, 195 | { 196 | "cell_type": "code", 197 | "execution_count": null, 198 | "metadata": {}, 199 | "outputs": [], 200 | "source": [ 201 | "# 对于数据中的每一对特征构造一个散布矩阵\n", 202 | "pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "### 问题 3\n", 210 | "这里是否存在一些特征他们彼此之间存在一定程度相关性?如果有请列出。这个结果是验证了还是否认了你尝试预测的那个特征的相关性?这些特征的数据是怎么分布的?\n", 211 | "\n", 212 | "**提示:** 这些数据是正态分布(normally distributed)的吗?大多数的数据点分布在哪?" 213 | ] 214 | }, 215 | { 216 | "cell_type": "markdown", 217 | "metadata": {}, 218 | "source": [ 219 | "**回答:**" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "## 数据预处理\n", 227 | "在这个部分,你将通过在数据上做一个合适的缩放,并检测异常点(你可以选择性移除)将数据预处理成一个更好的代表客户的形式。预处理数据是保证你在分析中能够得到显著且有意义的结果的重要环节。" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "metadata": {}, 233 | "source": [ 234 | "### 练习: 特征缩放\n", 235 | "如果数据不是正态分布的,尤其是数据的平均数和中位数相差很大的时候(表示数据非常歪斜)。这时候通常用一个[非线性的缩放](https://github.com/czcbangkai/translations/blob/master/use_of_logarithms_in_economics/use_of_logarithms_in_economics.pdf)是很合适的,[(英文原文)](http://econbrowser.com/archives/2014/02/use-of-logarithms-in-economics) — 尤其是对于金融数据。一种实现这个缩放的方法是使用 [Box-Cox 变换](http://scipy.github.io/devdocs/generated/scipy.stats.boxcox.html),这个方法能够计算出能够最佳减小数据倾斜的指数变换方法。一个比较简单的并且在大多数情况下都适用的方法是使用自然对数。\n", 236 | "\n", 237 | "在下面的代码单元中,你将需要实现以下功能:\n", 238 | " - 使用 `np.log` 函数在数据 `data` 上做一个对数缩放,然后将它的副本(不改变原始data的值)赋值给 `log_data`。 \n", 239 | " - 使用 `np.log` 函数在样本数据 `samples` 上做一个对数缩放,然后将它的副本赋值给 `log_samples`。" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": null, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "# TODO:使用自然对数缩放数据\n", 249 | "log_data = None\n", 250 | "\n", 251 | "# TODO:使用自然对数缩放样本数据\n", 252 | "log_samples = None\n", 253 | "\n", 254 | "# 为每一对新产生的特征制作一个散射矩阵\n", 255 | "pd.plotting.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "metadata": {}, 261 | "source": [ 262 | "### 观察\n", 263 | "在使用了一个自然对数的缩放之后,数据的各个特征会显得更加的正态分布。对于任意的你以前发现有相关关系的特征对,观察他们的相关关系是否还是存在的(并且尝试观察,他们的相关关系相比原来是变强了还是变弱了)。\n", 264 | "\n", 265 | "运行下面的代码以观察样本数据在进行了自然对数转换之后如何改变了。" 266 | ] 267 | }, 268 | { 269 | "cell_type": "code", 270 | "execution_count": null, 271 | "metadata": {}, 272 | "outputs": [], 273 | "source": [ 274 | "# 展示经过对数变换后的样本数据\n", 275 | "display(log_samples)" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "### 练习: 异常值检测\n", 283 | "对于任何的分析,在数据预处理的过程中检测数据中的异常值都是非常重要的一步。异常值的出现会使得把这些值考虑进去后结果出现倾斜。这里有很多关于怎样定义什么是数据集中的异常值的经验法则。这里我们将使用[ Tukey 的定义异常值的方法](http://datapigtechnologies.com/blog/index.php/highlighting-outliers-in-your-data-with-the-tukey-method/):一个异常阶(outlier step)被定义成1.5倍的四分位距(interquartile range,IQR)。一个数据点如果某个特征包含在该特征的 IQR 之外的特征,那么该数据点被认定为异常点。\n", 284 | "\n", 285 | "在下面的代码单元中,你需要完成下面的功能:\n", 286 | " - 将指定特征的 25th 分位点的值分配给 `Q1` 。使用 `np.percentile` 来完成这个功能。\n", 287 | " - 将指定特征的 75th 分位点的值分配给 `Q3` 。同样的,使用 `np.percentile` 来完成这个功能。\n", 288 | " - 将指定特征的异常阶的计算结果赋值给 `step`。\n", 289 | " - 选择性地通过将索引添加到 `outliers` 列表中,以移除异常值。\n", 290 | "\n", 291 | "**注意:** 如果你选择移除异常值,请保证你选择的样本点不在这些移除的点当中!\n", 292 | "一旦你完成了这些功能,数据集将存储在 `good_data` 中。" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "# 对于每一个特征,找到值异常高或者是异常低的数据点\n", 302 | "for feature in log_data.keys():\n", 303 | " \n", 304 | " # TODO: 计算给定特征的Q1(数据的25th分位点)\n", 305 | " Q1 = None\n", 306 | " \n", 307 | " # TODO: 计算给定特征的Q3(数据的75th分位点)\n", 308 | " Q3 = None\n", 309 | " \n", 310 | " # TODO: 使用四分位范围计算异常阶(1.5倍的四分位距)\n", 311 | " step = None\n", 312 | " \n", 313 | " # 显示异常点\n", 314 | " print(\"Data points considered outliers for the feature '{}':\".format(feature))\n", 315 | " display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])\n", 316 | " \n", 317 | "# TODO(可选): 选择你希望移除的数据点的索引\n", 318 | "outliers = []\n", 319 | "\n", 320 | "# 以下代码会移除outliers中索引的数据点, 并储存在good_data中\n", 321 | "good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True)" 322 | ] 323 | }, 324 | { 325 | "cell_type": "markdown", 326 | "metadata": {}, 327 | "source": [ 328 | "### 问题 4\n", 329 | "请列出所有在多于一个特征下被看作是异常的数据点。这些点应该被从数据集中移除吗?为什么?把你认为需要移除的数据点全部加入到到 `outliers` 变量中。" 330 | ] 331 | }, 332 | { 333 | "cell_type": "markdown", 334 | "metadata": {}, 335 | "source": [ 336 | "**回答:**" 337 | ] 338 | }, 339 | { 340 | "cell_type": "markdown", 341 | "metadata": {}, 342 | "source": [ 343 | "## 特征转换\n", 344 | "在这个部分中你将使用主成分分析(PCA)来分析批发商客户数据的内在结构。由于使用PCA在一个数据集上会计算出最大化方差的维度,我们将找出哪一个特征组合能够最好的描绘客户。" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "### 练习: 主成分分析(PCA)\n", 352 | "\n", 353 | "既然数据被缩放到一个更加正态分布的范围中并且我们也移除了需要移除的异常点,我们现在就能够在 `good_data` 上使用PCA算法以发现数据的哪一个维度能够最大化特征的方差。除了找到这些维度,PCA 也将报告每一个维度的解释方差比(explained variance ratio)--这个数据有多少方差能够用这个单独的维度来解释。注意 PCA 的一个组成部分(维度)能够被看做这个空间中的一个新的“特征”,但是它是原来数据中的特征构成的。\n", 354 | "\n", 355 | "在下面的代码单元中,你将要实现下面的功能:\n", 356 | " - 导入 `sklearn.decomposition.PCA` 并且将 `good_data` 用 PCA 并且使用6个维度进行拟合后的结果保存到 `pca` 中。\n", 357 | " - 使用 `pca.transform` 将 `log_samples` 进行转换,并将结果存储到 `pca_samples` 中。" 358 | ] 359 | }, 360 | { 361 | "cell_type": "code", 362 | "execution_count": null, 363 | "metadata": {}, 364 | "outputs": [], 365 | "source": [ 366 | "# TODO:通过在good data上进行PCA,将其转换成6个维度\n", 367 | "pca = None\n", 368 | "\n", 369 | "# TODO:使用上面的PCA拟合将变换施加在log_samples上\n", 370 | "pca_samples = None\n", 371 | "\n", 372 | "# 生成PCA的结果图\n", 373 | "pca_results = vs.pca_results(good_data, pca)" 374 | ] 375 | }, 376 | { 377 | "cell_type": "markdown", 378 | "metadata": {}, 379 | "source": [ 380 | "### 问题 5\n", 381 | "数据的第一个和第二个主成分**总共**表示了多少的方差? 前四个主成分呢?使用上面提供的可视化图像,从用户花费的角度来讨论前四个主要成分中每个主成分代表的消费行为并给出你做出判断的理由。\n", 382 | "\n", 383 | "**提示:**\n", 384 | "* 对每个主成分中的特征分析权重的正负和大小。\n", 385 | "* 结合每个主成分权重的正负讨论消费行为。\n", 386 | "* 某一特定维度上的正向增长对应正权特征的增长和负权特征的减少。增长和减少的速率和每个特征的权重相关。[参考资料:Interpretation of the Principal Components](https://onlinecourses.science.psu.edu/stat505/node/54)" 387 | ] 388 | }, 389 | { 390 | "cell_type": "markdown", 391 | "metadata": {}, 392 | "source": [ 393 | "**回答:**" 394 | ] 395 | }, 396 | { 397 | "cell_type": "markdown", 398 | "metadata": {}, 399 | "source": [ 400 | "### 观察\n", 401 | "运行下面的代码,查看经过对数转换的样本数据在进行一个6个维度的主成分分析(PCA)之后会如何改变。观察样本数据的前四个维度的数值。考虑这和你初始对样本点的解释是否一致。" 402 | ] 403 | }, 404 | { 405 | "cell_type": "code", 406 | "execution_count": null, 407 | "metadata": {}, 408 | "outputs": [], 409 | "source": [ 410 | "# 展示经过PCA转换的sample log-data\n", 411 | "display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))" 412 | ] 413 | }, 414 | { 415 | "cell_type": "markdown", 416 | "metadata": {}, 417 | "source": [ 418 | "### 练习:降维\n", 419 | "当使用主成分分析的时候,一个主要的目的是减少数据的维度,这实际上降低了问题的复杂度。当然降维也是需要一定代价的:更少的维度能够表示的数据中的总方差更少。因为这个,**累计解释方差比(cumulative explained variance ratio)**对于我们确定这个问题需要多少维度非常重要。另外,如果大部分的方差都能够通过两个或者是三个维度进行表示的话,降维之后的数据能够被可视化。\n", 420 | "\n", 421 | "在下面的代码单元中,你将实现下面的功能:\n", 422 | " - 将 `good_data` 用两个维度的PCA进行拟合,并将结果存储到 `pca` 中去。\n", 423 | " - 使用 `pca.transform` 将 `good_data` 进行转换,并将结果存储在 `reduced_data` 中。\n", 424 | " - 使用 `pca.transform` 将 `log_samples` 进行转换,并将结果存储在 `pca_samples` 中。" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": {}, 431 | "outputs": [], 432 | "source": [ 433 | "# TODO:通过在good data上进行PCA,将其转换成两个维度\n", 434 | "pca = None\n", 435 | "\n", 436 | "# TODO:使用上面训练的PCA将good data进行转换\n", 437 | "reduced_data = None\n", 438 | "\n", 439 | "# TODO:使用上面训练的PCA将log_samples进行转换\n", 440 | "pca_samples = None\n", 441 | "\n", 442 | "# 为降维后的数据创建一个DataFrame\n", 443 | "reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])" 444 | ] 445 | }, 446 | { 447 | "cell_type": "markdown", 448 | "metadata": {}, 449 | "source": [ 450 | "### 观察\n", 451 | "运行以下代码观察当仅仅使用两个维度进行 PCA 转换后,这个对数样本数据将怎样变化。观察这里的结果与一个使用六个维度的 PCA 转换相比较时,前两维的数值是保持不变的。" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "# 展示经过两个维度的PCA转换之后的样本log-data\n", 461 | "display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))" 462 | ] 463 | }, 464 | { 465 | "cell_type": "markdown", 466 | "metadata": {}, 467 | "source": [ 468 | "## 可视化一个双标图(Biplot)\n", 469 | "双标图是一个散点图,每个数据点的位置由它所在主成分的分数确定。坐标系是主成分(这里是 `Dimension 1` 和 `Dimension 2`)。此外,双标图还展示出初始特征在主成分上的投影。一个双标图可以帮助我们理解降维后的数据,发现主成分和初始特征之间的关系。\n", 470 | "\n", 471 | "运行下面的代码来创建一个降维后数据的双标图。" 472 | ] 473 | }, 474 | { 475 | "cell_type": "code", 476 | "execution_count": null, 477 | "metadata": {}, 478 | "outputs": [], 479 | "source": [ 480 | "# 可视化双标图\n", 481 | "vs.biplot(good_data, reduced_data, pca)" 482 | ] 483 | }, 484 | { 485 | "cell_type": "markdown", 486 | "metadata": {}, 487 | "source": [ 488 | "### 观察\n", 489 | "\n", 490 | "一旦我们有了原始特征的投影(红色箭头),就能更加容易的理解散点图每个数据点的相对位置。\n", 491 | "\n", 492 | "在这个双标图中,哪些初始特征与第一个主成分有强关联?哪些初始特征与第二个主成分相关联?你观察到的是否与之前得到的 pca_results 图相符?" 493 | ] 494 | }, 495 | { 496 | "cell_type": "markdown", 497 | "metadata": {}, 498 | "source": [ 499 | "## 聚类\n", 500 | "\n", 501 | "在这个部分,你讲选择使用 K-Means 聚类算法或者是高斯混合模型聚类算法以发现数据中隐藏的客户分类。然后,你将从簇中恢复一些特定的关键数据点,通过将它们转换回原始的维度和规模,从而理解他们的含义。" 502 | ] 503 | }, 504 | { 505 | "cell_type": "markdown", 506 | "metadata": {}, 507 | "source": [ 508 | "### 问题 6\n", 509 | "使用 K-Means 聚类算法的优点是什么?使用高斯混合模型聚类算法的优点是什么?基于你现在对客户数据的观察结果,你选用了这两个算法中的哪一个,为什么?" 510 | ] 511 | }, 512 | { 513 | "cell_type": "markdown", 514 | "metadata": {}, 515 | "source": [ 516 | "**回答:**" 517 | ] 518 | }, 519 | { 520 | "cell_type": "markdown", 521 | "metadata": {}, 522 | "source": [ 523 | "### 练习: 创建聚类\n", 524 | "\n", 525 | "针对不同情况,有些问题你需要的聚类数目可能是已知的。但是在聚类数目不作为一个**先验**知道的情况下,我们并不能够保证某个聚类的数目对这个数据是最优的,因为我们对于数据的结构(如果存在的话)是不清楚的。但是,我们可以通过计算每一个簇中点的**轮廓系数**来衡量聚类的质量。数据点的[轮廓系数](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)衡量了它与分配给他的簇的相似度,这个值范围在-1(不相似)到1(相似)。**平均**轮廓系数为我们提供了一种简单地度量聚类质量的方法。\n", 526 | "\n", 527 | "在接下来的代码单元中,你将实现下列功能:\n", 528 | " - 在 `reduced_data` 上使用一个聚类算法,并将结果赋值到 `clusterer`,需要设置 `random_state` 使得结果可以复现。\n", 529 | " - 使用 `clusterer.predict` 预测 `reduced_data` 中的每一个点的簇,并将结果赋值到 `preds`。\n", 530 | " - 使用算法的某个属性值找到聚类中心,并将它们赋值到 `centers`。\n", 531 | " - 预测 `pca_samples` 中的每一个样本点的类别并将结果赋值到 `sample_preds`。\n", 532 | " - 导入 `sklearn.metrics.silhouette_score` 包并计算 `reduced_data` 相对于 `preds` 的轮廓系数。\n", 533 | " - 将轮廓系数赋值给 `score` 并输出结果。" 534 | ] 535 | }, 536 | { 537 | "cell_type": "code", 538 | "execution_count": null, 539 | "metadata": {}, 540 | "outputs": [], 541 | "source": [ 542 | "# TODO:在降维后的数据上使用你选择的聚类算法\n", 543 | "clusterer = None\n", 544 | "\n", 545 | "# TODO:预测每一个点的簇\n", 546 | "preds = None\n", 547 | "\n", 548 | "# TODO:找到聚类中心\n", 549 | "centers = None\n", 550 | "\n", 551 | "# TODO:预测在每一个转换后的样本点的类\n", 552 | "sample_preds = None\n", 553 | "\n", 554 | "# TODO:计算选择的类别的平均轮廓系数(mean silhouette coefficient)\n", 555 | "score = None" 556 | ] 557 | }, 558 | { 559 | "cell_type": "markdown", 560 | "metadata": {}, 561 | "source": [ 562 | "### 问题 7\n", 563 | "\n", 564 | "汇报你尝试的不同的聚类数对应的轮廓系数。在这些当中哪一个聚类的数目能够得到最佳的轮廓系数?" 565 | ] 566 | }, 567 | { 568 | "cell_type": "markdown", 569 | "metadata": {}, 570 | "source": [ 571 | "**回答:**" 572 | ] 573 | }, 574 | { 575 | "cell_type": "markdown", 576 | "metadata": {}, 577 | "source": [ 578 | "### 聚类可视化\n", 579 | "一旦你选好了通过上面的评价函数得到的算法的最佳聚类数目,你就能够通过使用下面的代码块可视化来得到的结果。作为实验,你可以试着调整你的聚类算法的聚类的数量来看一下不同的可视化结果。但是你提供的最终的可视化图像必须和你选择的最优聚类数目一致。" 580 | ] 581 | }, 582 | { 583 | "cell_type": "code", 584 | "execution_count": null, 585 | "metadata": {}, 586 | "outputs": [], 587 | "source": [ 588 | "# 从已有的实现中展示聚类的结果\n", 589 | "vs.cluster_results(reduced_data, preds, centers, pca_samples)" 590 | ] 591 | }, 592 | { 593 | "cell_type": "markdown", 594 | "metadata": {}, 595 | "source": [ 596 | "### 练习: 数据恢复\n", 597 | "上面的可视化图像中提供的每一个聚类都有一个中心点。这些中心(或者叫平均点)并不是数据中真实存在的点,但是是所有预测在这个簇中的数据点的平均。对于创建客户分类的问题,一个簇的中心对应于那个分类的平均用户。因为这个数据现在进行了降维并缩放到一定的范围,我们可以通过施加一个反向的转换恢复这个点所代表的用户的花费。\n", 598 | "\n", 599 | "在下面的代码单元中,你将实现下列的功能:\n", 600 | " - 使用 `pca.inverse_transform` 将 `centers` 反向转换,并将结果存储在 `log_centers` 中。\n", 601 | " - 使用 `np.log` 的反函数 `np.exp` 反向转换 `log_centers` 并将结果存储到 `true_centers` 中。\n" 602 | ] 603 | }, 604 | { 605 | "cell_type": "code", 606 | "execution_count": null, 607 | "metadata": {}, 608 | "outputs": [], 609 | "source": [ 610 | "# TODO:反向转换中心点\n", 611 | "log_centers = None\n", 612 | "\n", 613 | "# TODO:对中心点做指数转换\n", 614 | "true_centers = None\n", 615 | "\n", 616 | "# 显示真实的中心点\n", 617 | "segments = ['Segment {}'.format(i) for i in range(0,len(centers))]\n", 618 | "true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())\n", 619 | "true_centers.index = segments\n", 620 | "display(true_centers)" 621 | ] 622 | }, 623 | { 624 | "cell_type": "markdown", 625 | "metadata": {}, 626 | "source": [ 627 | "### 问题 8\n", 628 | "考虑上面的代表性数据点在每一个产品类型的花费总数,你认为这些客户分类代表了哪类客户?为什么?需要参考在项目最开始得到的统计值来给出理由。\n", 629 | "\n", 630 | "**提示:** 一个被分到`'Cluster X'`的客户最好被用 `'Segment X'`中的特征集来标识的企业类型表示。" 631 | ] 632 | }, 633 | { 634 | "cell_type": "markdown", 635 | "metadata": {}, 636 | "source": [ 637 | "**回答:**" 638 | ] 639 | }, 640 | { 641 | "cell_type": "markdown", 642 | "metadata": {}, 643 | "source": [ 644 | "### 问题 9\n", 645 | "对于每一个样本点**问题 8 **中的哪一个分类能够最好的表示它?你之前对样本的预测和现在的结果相符吗?\n", 646 | "\n", 647 | "运行下面的代码单元以找到每一个样本点被预测到哪一个簇中去。" 648 | ] 649 | }, 650 | { 651 | "cell_type": "code", 652 | "execution_count": null, 653 | "metadata": {}, 654 | "outputs": [], 655 | "source": [ 656 | "# 显示预测结果\n", 657 | "for i, pred in enumerate(sample_preds):\n", 658 | " print(\"Sample point\", i, \"predicted to be in Cluster\", pred)" 659 | ] 660 | }, 661 | { 662 | "cell_type": "markdown", 663 | "metadata": {}, 664 | "source": [ 665 | "**回答:**" 666 | ] 667 | }, 668 | { 669 | "cell_type": "markdown", 670 | "metadata": {}, 671 | "source": [ 672 | "## 结论\n", 673 | "\n", 674 | "在最后一部分中,你要学习如何使用已经被分类的数据。首先,你要考虑不同组的客户**客户分类**,针对不同的派送策略受到的影响会有什么不同。其次,你要考虑到,每一个客户都被打上了标签(客户属于哪一个分类)可以给客户数据提供一个多一个特征。最后,你会把客户分类与一个数据中的隐藏变量做比较,看一下这个分类是否辨识了特定的关系。" 675 | ] 676 | }, 677 | { 678 | "cell_type": "markdown", 679 | "metadata": { 680 | "collapsed": true 681 | }, 682 | "source": [ 683 | "### 问题 10\n", 684 | "在对他们的服务或者是产品做细微的改变的时候,公司经常会使用 [A/B tests ](https://en.wikipedia.org/wiki/A/B_testing)以确定这些改变会对客户产生积极作用还是消极作用。这个批发商希望考虑将他的派送服务从每周5天变为每周3天,但是他只会对他客户当中对此有积极反馈的客户采用。这个批发商应该如何利用客户分类来知道哪些客户对它的这个派送策略的改变有积极的反馈,如果有的话?你需要给出在这个情形下A/B 测试具体的实现方法,以及最终得出结论的依据是什么?\n", 685 | "\n", 686 | "**提示:** 我们能假设这个改变对所有的客户影响都一致吗?我们怎样才能够确定它对于哪个类型的客户影响最大?" 687 | ] 688 | }, 689 | { 690 | "cell_type": "markdown", 691 | "metadata": {}, 692 | "source": [ 693 | "**回答:**" 694 | ] 695 | }, 696 | { 697 | "cell_type": "markdown", 698 | "metadata": {}, 699 | "source": [ 700 | "### 问题 11\n", 701 | "通过聚类技术,我们能够将原有的没有标记的数据集中的附加结构分析出来。因为每一个客户都有一个最佳的划分(取决于你选择使用的聚类算法),我们可以把用户分类作为数据的一个[工程特征](https://en.wikipedia.org/wiki/Feature_learning#Unsupervised_feature_learning)。假设批发商最近迎来十位新顾客,并且他已经为每位顾客每个产品类别年度采购额进行了预估。进行了这些估算之后,批发商该如何运用它的预估和非监督学习的结果来对这十个新的客户进行更好的预测?\n", 702 | "\n", 703 | "**提示**:在下面的代码单元中,我们提供了一个已经做好聚类的数据(聚类结果为数据中的cluster属性),我们将在这个数据集上做一个小实验。尝试运行下面的代码看看我们尝试预测‘Region’的时候,如果存在聚类特征'cluster'与不存在相比对最终的得分会有什么影响?这对你有什么启发?" 704 | ] 705 | }, 706 | { 707 | "cell_type": "code", 708 | "execution_count": null, 709 | "metadata": {}, 710 | "outputs": [], 711 | "source": [ 712 | "from sklearn.ensemble import RandomForestClassifier\n", 713 | "from sklearn.model_selection import train_test_split\n", 714 | "\n", 715 | "# 读取包含聚类结果的数据\n", 716 | "cluster_data = pd.read_csv(\"cluster.csv\")\n", 717 | "y = cluster_data['Region']\n", 718 | "X = cluster_data.drop(['Region'], axis = 1)\n", 719 | "\n", 720 | "# 划分训练集测试集\n", 721 | "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=24)\n", 722 | "\n", 723 | "clf = RandomForestClassifier(random_state=24)\n", 724 | "clf.fit(X_train, y_train)\n", 725 | "score_with_cluster = clf.score(X_test, y_test)\n", 726 | "\n", 727 | "# 移除cluster特征\n", 728 | "X_train = X_train.copy()\n", 729 | "X_train.drop(['cluster'], axis=1, inplace=True)\n", 730 | "X_test = X_test.copy()\n", 731 | "X_test.drop(['cluster'], axis=1, inplace=True)\n", 732 | "clf.fit(X_train, y_train)\n", 733 | "score_no_cluster = clf.score(X_test, y_test)\n", 734 | "\n", 735 | "print(\"不使用cluster特征的得分: %.4f\"%score_no_cluster)\n", 736 | "print(\"使用cluster特征的得分: %.4f\"%score_with_cluster)" 737 | ] 738 | }, 739 | { 740 | "cell_type": "markdown", 741 | "metadata": {}, 742 | "source": [ 743 | "**回答:**" 744 | ] 745 | }, 746 | { 747 | "cell_type": "markdown", 748 | "metadata": {}, 749 | "source": [ 750 | "### 可视化内在的分布\n", 751 | "\n", 752 | "在这个项目的开始,我们讨论了从数据集中移除 `'Channel'` 和 `'Region'` 特征,这样在分析过程中我们就会着重分析用户产品类别。通过重新引入 `Channel` 这个特征到数据集中,并施加和原来数据集同样的 PCA 变换的时候我们将能够发现数据集产生一个有趣的结构。\n", 753 | "\n", 754 | "运行下面的代码单元以查看哪一个数据点在降维的空间中被标记为 `'HoReCa'` (旅馆/餐馆/咖啡厅)或者 `'Retail'`。另外,你将发现样本点在图中被圈了出来,用以显示他们的标签。" 755 | ] 756 | }, 757 | { 758 | "cell_type": "code", 759 | "execution_count": null, 760 | "metadata": { 761 | "scrolled": false 762 | }, 763 | "outputs": [], 764 | "source": [ 765 | "# 根据‘Channel‘数据显示聚类的结果\n", 766 | "vs.channel_results(reduced_data, outliers, pca_samples)" 767 | ] 768 | }, 769 | { 770 | "cell_type": "markdown", 771 | "metadata": {}, 772 | "source": [ 773 | "### 问题 12\n", 774 | "\n", 775 | "你选择的聚类算法和聚类点的数目,与内在的旅馆/餐馆/咖啡店和零售商的分布相比,有足够好吗?根据这个分布有没有哪个簇能够刚好划分成'零售商'或者是'旅馆/饭店/咖啡馆'?你觉得这个分类和前面你对于用户分类的定义是一致的吗?" 776 | ] 777 | }, 778 | { 779 | "cell_type": "markdown", 780 | "metadata": {}, 781 | "source": [ 782 | "**回答:**" 783 | ] 784 | }, 785 | { 786 | "cell_type": "markdown", 787 | "metadata": {}, 788 | "source": [ 789 | "> **注意**: 当你写完了所有的代码,并且回答了所有的问题。你就可以把你的 iPython Notebook 导出成 HTML 文件。你可以在菜单栏,这样导出**File -> Download as -> HTML (.html)**把这个 HTML 和这个 iPython notebook 一起做为你的作业提交。 " 790 | ] 791 | } 792 | ], 793 | "metadata": { 794 | "anaconda-cloud": {}, 795 | "kernelspec": { 796 | "display_name": "Python 3", 797 | "language": "python", 798 | "name": "python3" 799 | }, 800 | "language_info": { 801 | "codemirror_mode": { 802 | "name": "ipython", 803 | "version": 3 804 | }, 805 | "file_extension": ".py", 806 | "mimetype": "text/x-python", 807 | "name": "python", 808 | "nbconvert_exporter": "python", 809 | "pygments_lexer": "ipython3", 810 | "version": "3.6.4" 811 | } 812 | }, 813 | "nbformat": 4, 814 | "nbformat_minor": 1 815 | } 816 | -------------------------------------------------------------------------------- /finding_donors/finding_donors.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 机器学习纳米学位\n", 8 | "## 监督学习\n", 9 | "## 项目2: 为*CharityML*寻找捐献者" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": {}, 15 | "source": [ 16 | "欢迎来到机器学习工程师纳米学位的第二个项目!在此文件中,有些示例代码已经提供给你,但你还需要实现更多的功能让项目成功运行。除非有明确要求,你无须修改任何已给出的代码。以**'练习'**开始的标题表示接下来的代码部分中有你必须要实现的功能。每一部分都会有详细的指导,需要实现的部分也会在注释中以'TODO'标出。请仔细阅读所有的提示!\n", 17 | "\n", 18 | "除了实现代码外,你还必须回答一些与项目和你的实现有关的问题。每一个需要你回答的问题都会以**'问题 X'**为标题。请仔细阅读每个问题,并且在问题后的**'回答'**文字框中写出完整的答案。我们将根据你对问题的回答和撰写代码所实现的功能来对你提交的项目进行评分。\n", 19 | ">**提示:**Code 和 Markdown 区域可通过**Shift + Enter**快捷键运行。此外,Markdown可以通过双击进入编辑模式。" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## 开始\n", 27 | "\n", 28 | "在这个项目中,你将使用1994年美国人口普查收集的数据,选用几个监督学习算法以准确地建模被调查者的收入。然后,你将根据初步结果从中选择出最佳的候选算法,并进一步优化该算法以最好地建模这些数据。你的目标是建立一个能够准确地预测被调查者年收入是否超过50000美元的模型。这种类型的任务会出现在那些依赖于捐款而存在的非营利性组织。了解人群的收入情况可以帮助一个非营利性的机构更好地了解他们要多大的捐赠,或是否他们应该接触这些人。虽然我们很难直接从公开的资源中推断出一个人的一般收入阶层,但是我们可以(也正是我们将要做的)从其他的一些公开的可获得的资源中获得一些特征从而推断出该值。\n", 29 | "\n", 30 | "这个项目的数据集来自[UCI机器学习知识库](https://archive.ics.uci.edu/ml/datasets/Census+Income)。这个数据集是由Ron Kohavi和Barry Becker在发表文章_\"Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid\"_之后捐赠的,你可以在Ron Kohavi提供的[在线版本](https://www.aaai.org/Papers/KDD/1996/KDD96-033.pdf)中找到这个文章。我们在这里探索的数据集相比于原有的数据集有一些小小的改变,比如说移除了特征`'fnlwgt'` 以及一些遗失的或者是格式不正确的记录。" 31 | ] 32 | }, 33 | { 34 | "cell_type": "markdown", 35 | "metadata": {}, 36 | "source": [ 37 | "----\n", 38 | "## 探索数据\n", 39 | "运行下面的代码单元以载入需要的Python库并导入人口普查数据。注意数据集的最后一列`'income'`将是我们需要预测的列(表示被调查者的年收入会大于或者是最多50,000美元),人口普查数据中的每一列都将是关于被调查者的特征。" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": null, 45 | "metadata": {}, 46 | "outputs": [], 47 | "source": [ 48 | "# 为这个项目导入需要的库\n", 49 | "import numpy as np\n", 50 | "import pandas as pd\n", 51 | "from time import time\n", 52 | "from IPython.display import display # 允许为DataFrame使用display()\n", 53 | "\n", 54 | "# 导入附加的可视化代码visuals.py\n", 55 | "import visuals as vs\n", 56 | "\n", 57 | "# 为notebook提供更加漂亮的可视化\n", 58 | "%matplotlib inline\n", 59 | "\n", 60 | "# 导入人口普查数据\n", 61 | "data = pd.read_csv(\"census.csv\")\n", 62 | "\n", 63 | "# 成功 - 显示第一条记录\n", 64 | "display(data.head(n=1))" 65 | ] 66 | }, 67 | { 68 | "cell_type": "markdown", 69 | "metadata": {}, 70 | "source": [ 71 | "### 练习:数据探索\n", 72 | "首先我们对数据集进行一个粗略的探索,我们将看看每一个类别里会有多少被调查者?并且告诉我们这些里面多大比例是年收入大于50,000美元的。在下面的代码单元中,你将需要计算以下量:\n", 73 | "\n", 74 | "- 总的记录数量,`'n_records'`\n", 75 | "- 年收入大于50,000美元的人数,`'n_greater_50k'`.\n", 76 | "- 年收入最多为50,000美元的人数 `'n_at_most_50k'`.\n", 77 | "- 年收入大于50,000美元的人所占的比例, `'greater_percent'`.\n", 78 | "\n", 79 | "**提示:** 您可能需要查看上面的生成的表,以了解`'income'`条目的格式是什么样的。 " 80 | ] 81 | }, 82 | { 83 | "cell_type": "code", 84 | "execution_count": null, 85 | "metadata": {}, 86 | "outputs": [], 87 | "source": [ 88 | "# TODO:总的记录数\n", 89 | "n_records = None\n", 90 | "\n", 91 | "# TODO:被调查者的收入大于$50,000的人数\n", 92 | "n_greater_50k = None\n", 93 | "\n", 94 | "# TODO:被调查者的收入最多为$50,000的人数\n", 95 | "n_at_most_50k = None\n", 96 | "\n", 97 | "# TODO:被调查者收入大于$50,000所占的比例\n", 98 | "greater_percent = None\n", 99 | "\n", 100 | "# 打印结果\n", 101 | "print (\"Total number of records: {}\".format(n_records))\n", 102 | "print (\"Individuals making more than $50,000: {}\".format(n_greater_50k))\n", 103 | "print (\"Individuals making at most $50,000: {}\".format(n_at_most_50k))\n", 104 | "print (\"Percentage of individuals making more than $50,000: {:.2f}%\".format(greater_percent))" 105 | ] 106 | }, 107 | { 108 | "cell_type": "markdown", 109 | "metadata": {}, 110 | "source": [ 111 | "----\n", 112 | "## 准备数据\n", 113 | "在数据能够被作为输入提供给机器学习算法之前,它经常需要被清洗,格式化,和重新组织 - 这通常被叫做**预处理**。幸运的是,对于这个数据集,没有我们必须处理的无效或丢失的条目,然而,由于某一些特征存在的特性我们必须进行一定的调整。这个预处理都可以极大地帮助我们提升几乎所有的学习算法的结果和预测能力。\n", 114 | "\n", 115 | "### 获得特征和标签\n", 116 | "`income` 列是我们需要的标签,记录一个人的年收入是否高于50K。 因此我们应该把他从数据中剥离出来,单独存放。" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": null, 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "# 将数据切分成特征和对应的标签\n", 126 | "income_raw = data['income']\n", 127 | "features_raw = data.drop('income', axis = 1)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "### 转换倾斜的连续特征\n", 135 | "\n", 136 | "一个数据集有时可能包含至少一个靠近某个数字的特征,但有时也会有一些相对来说存在极大值或者极小值的不平凡分布的的特征。算法对这种分布的数据会十分敏感,并且如果这种数据没有能够很好地规一化处理会使得算法表现不佳。在人口普查数据集的两个特征符合这个描述:'`capital-gain'`和`'capital-loss'`。\n", 137 | "\n", 138 | "运行下面的代码单元以创建一个关于这两个特征的条形图。请注意当前的值的范围和它们是如何分布的。" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "# 可视化 'capital-gain'和'capital-loss' 两个特征\n", 148 | "vs.distribution(features_raw)" 149 | ] 150 | }, 151 | { 152 | "cell_type": "markdown", 153 | "metadata": {}, 154 | "source": [ 155 | "对于高度倾斜分布的特征如`'capital-gain'`和`'capital-loss'`,常见的做法是对数据施加一个对数转换,将数据转换成对数,这样非常大和非常小的值不会对学习算法产生负面的影响。并且使用对数变换显著降低了由于异常值所造成的数据范围异常。但是在应用这个变换时必须小心:因为0的对数是没有定义的,所以我们必须先将数据处理成一个比0稍微大一点的数以成功完成对数转换。\n", 156 | "\n", 157 | "运行下面的代码单元来执行数据的转换和可视化结果。再次,注意值的范围和它们是如何分布的。" 158 | ] 159 | }, 160 | { 161 | "cell_type": "code", 162 | "execution_count": null, 163 | "metadata": {}, 164 | "outputs": [], 165 | "source": [ 166 | "# 对于倾斜的数据使用Log转换\n", 167 | "skewed = ['capital-gain', 'capital-loss']\n", 168 | "features_raw[skewed] = data[skewed].apply(lambda x: np.log(x + 1))\n", 169 | "\n", 170 | "# 可视化对数转换后 'capital-gain'和'capital-loss' 两个特征\n", 171 | "vs.distribution(features_raw, transformed = True)" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "### 规一化数字特征\n", 179 | "除了对于高度倾斜的特征施加转换,对数值特征施加一些形式的缩放通常会是一个好的习惯。在数据上面施加一个缩放并不会改变数据分布的形式(比如上面说的'capital-gain' or 'capital-loss');但是,规一化保证了每一个特征在使用监督学习器的时候能够被平等的对待。注意一旦使用了缩放,观察数据的原始形式不再具有它本来的意义了,就像下面的例子展示的。\n", 180 | "\n", 181 | "运行下面的代码单元来规一化每一个数字特征。我们将使用[`sklearn.preprocessing.MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)来完成这个任务。" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": null, 187 | "metadata": {}, 188 | "outputs": [], 189 | "source": [ 190 | "from sklearn.preprocessing import MinMaxScaler\n", 191 | "\n", 192 | "# 初始化一个 scaler,并将它施加到特征上\n", 193 | "scaler = MinMaxScaler()\n", 194 | "numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']\n", 195 | "features_raw[numerical] = scaler.fit_transform(data[numerical])\n", 196 | "\n", 197 | "# 显示一个经过缩放的样例记录\n", 198 | "display(features_raw.head(n = 1))" 199 | ] 200 | }, 201 | { 202 | "cell_type": "markdown", 203 | "metadata": {}, 204 | "source": [ 205 | "### 练习:数据预处理\n", 206 | "\n", 207 | "从上面的**数据探索**中的表中,我们可以看到有几个属性的每一条记录都是非数字的。通常情况下,学习算法期望输入是数字的,这要求非数字的特征(称为类别变量)被转换。转换类别变量的一种流行的方法是使用**独热编码**方案。独热编码为每一个非数字特征的每一个可能的类别创建一个_“虚拟”_变量。例如,假设`someFeature`有三个可能的取值`A`,`B`或者`C`,。我们将把这个特征编码成`someFeature_A`, `someFeature_B`和`someFeature_C`.\n", 208 | "\n", 209 | "| 特征X | | 特征X_A | 特征X_B | 特征X_C |\n", 210 | "| :-: | | :-: | :-: | :-: |\n", 211 | "| B | | 0 | 1 | 0 |\n", 212 | "| C | ----> 独热编码 ----> | 0 | 0 | 1 |\n", 213 | "| A | | 1 | 0 | 0 |\n", 214 | "\n", 215 | "此外,对于非数字的特征,我们需要将非数字的标签`'income'`转换成数值以保证学习算法能够正常工作。因为这个标签只有两种可能的类别(\"<=50K\"和\">50K\"),我们不必要使用独热编码,可以直接将他们编码分别成两个类`0`和`1`,在下面的代码单元中你将实现以下功能:\n", 216 | " - 使用[`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html?highlight=get_dummies#pandas.get_dummies)对`'features_raw'`数据来施加一个独热编码。\n", 217 | " - 将目标标签`'income_raw'`转换成数字项。\n", 218 | " - 将\"<=50K\"转换成`0`;将\">50K\"转换成`1`。" 219 | ] 220 | }, 221 | { 222 | "cell_type": "code", 223 | "execution_count": null, 224 | "metadata": { 225 | "scrolled": true 226 | }, 227 | "outputs": [], 228 | "source": [ 229 | "# TODO:使用pandas.get_dummies()对'features_raw'数据进行独热编码\n", 230 | "features = None\n", 231 | "\n", 232 | "# TODO:将'income_raw'编码成数字值\n", 233 | "income = None\n", 234 | "\n", 235 | "# 打印经过独热编码之后的特征数量\n", 236 | "encoded = list(features.columns)\n", 237 | "print (\"{} total features after one-hot encoding.\".format(len(encoded)))\n", 238 | "\n", 239 | "# 移除下面一行的注释以观察编码的特征名字\n", 240 | "#print encoded" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "### 混洗和切分数据\n", 248 | "现在所有的 _类别变量_ 已被转换成数值特征,而且所有的数值特征已被规一化。和我们一般情况下做的一样,我们现在将数据(包括特征和它们的标签)切分成训练和测试集。其中80%的数据将用于训练和20%的数据用于测试。然后再进一步把训练数据分为训练集和验证集,用来选择和优化模型。\n", 249 | "\n", 250 | "运行下面的代码单元来完成切分。" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": null, 256 | "metadata": {}, 257 | "outputs": [], 258 | "source": [ 259 | "# 导入 train_test_split\n", 260 | "from sklearn.model_selection import train_test_split\n", 261 | "\n", 262 | "# 将'features'和'income'数据切分成训练集和测试集\n", 263 | "X_train, X_test, y_train, y_test = train_test_split(features, income, test_size = 0.2, random_state = 0,\n", 264 | " stratify = income)\n", 265 | "# 将'X_train'和'y_train'进一步切分为训练集和验证集\n", 266 | "X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0,\n", 267 | " stratify = y_train)\n", 268 | "\n", 269 | "# 显示切分的结果\n", 270 | "print (\"Training set has {} samples.\".format(X_train.shape[0]))\n", 271 | "print (\"Validation set has {} samples.\".format(X_val.shape[0]))\n", 272 | "print (\"Testing set has {} samples.\".format(X_test.shape[0]))" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "----\n", 280 | "## 评价模型性能\n", 281 | "在这一部分中,我们将尝试四种不同的算法,并确定哪一个能够最好地建模数据。四种算法包含一个*天真的预测器* 和三个你选择的监督学习器。" 282 | ] 283 | }, 284 | { 285 | "cell_type": "markdown", 286 | "metadata": {}, 287 | "source": [ 288 | "### 评价方法和朴素的预测器\n", 289 | "*CharityML*通过他们的研究人员知道被调查者的年收入大于\\$50,000最有可能向他们捐款。因为这个原因*CharityML*对于准确预测谁能够获得\\$50,000以上收入尤其有兴趣。这样看起来使用**准确率**作为评价模型的标准是合适的。另外,把*没有*收入大于\\$50,000的人识别成年收入大于\\$50,000对于*CharityML*来说是有害的,因为他想要找到的是有意愿捐款的用户。这样,我们期望的模型具有准确预测那些能够年收入大于\\$50,000的能力比模型去**查全**这些被调查者*更重要*。我们能够使用**F-beta score**作为评价指标,这样能够同时考虑查准率和查全率:\n", 290 | "\n", 291 | "$$ F_{\\beta} = (1 + \\beta^2) \\cdot \\frac{precision \\cdot recall}{\\left( \\beta^2 \\cdot precision \\right) + recall} $$\n", 292 | "\n", 293 | "\n", 294 | "尤其是,当 $\\beta = 0.5$ 的时候更多的强调查准率,这叫做**F$_{0.5}$ score** (或者为了简单叫做F-score)。" 295 | ] 296 | }, 297 | { 298 | "cell_type": "markdown", 299 | "metadata": {}, 300 | "source": [ 301 | "### 问题 1 - 天真的预测器的性能\n", 302 | "\n", 303 | "通过查看收入超过和不超过 \\$50,000 的人数,我们能发现多数被调查者年收入没有超过 \\$50,000。如果我们简单地预测说*“这个人的收入没有超过 \\$50,000”*,我们就可以得到一个 准确率超过 50% 的预测。这样我们甚至不用看数据就能做到一个准确率超过 50%。这样一个预测被称作是天真的。通常对数据使用一个*天真的预测器*是十分重要的,这样能够帮助建立一个模型表现是否好的基准。 使用下面的代码单元计算天真的预测器的相关性能。将你的计算结果赋值给`'accuracy'`, `‘precision’`, `‘recall’` 和 `'fscore'`,这些值会在后面被使用,请注意这里不能使用scikit-learn,你需要根据公式自己实现相关计算。\n", 304 | "\n", 305 | "*如果我们选择一个无论什么情况都预测被调查者年收入大于 \\$50,000 的模型,那么这个模型在**验证集上**的准确率,查准率,查全率和 F-score是多少?* \n" 306 | ] 307 | }, 308 | { 309 | "cell_type": "code", 310 | "execution_count": null, 311 | "metadata": {}, 312 | "outputs": [], 313 | "source": [ 314 | "#不能使用scikit-learn,你需要根据公式自己实现相关计算。\n", 315 | "\n", 316 | "#TODO: 计算准确率\n", 317 | "accuracy = None\n", 318 | "\n", 319 | "# TODO: 计算查准率 Precision\n", 320 | "precision = None\n", 321 | "\n", 322 | "# TODO: 计算查全率 Recall\n", 323 | "recall = None\n", 324 | "\n", 325 | "# TODO: 使用上面的公式,设置beta=0.5,计算F-score\n", 326 | "fscore = None\n", 327 | "\n", 328 | "# 打印结果\n", 329 | "print (\"Naive Predictor on validation data: \\n \\\n", 330 | " Accuracy score: {:.4f} \\n \\\n", 331 | " Precision: {:.4f} \\n \\\n", 332 | " Recall: {:.4f} \\n \\\n", 333 | " F-score: {:.4f}\".format(accuracy, precision, recall, fscore))" 334 | ] 335 | }, 336 | { 337 | "cell_type": "markdown", 338 | "metadata": {}, 339 | "source": [ 340 | "## 监督学习模型\n", 341 | "### 问题 2 - 模型应用\n", 342 | "\n", 343 | "你能够在 [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html) 中选择以下监督学习模型\n", 344 | "- 高斯朴素贝叶斯 (GaussianNB)\n", 345 | "- 决策树 (DecisionTree)\n", 346 | "- 集成方法 (Bagging, AdaBoost, Random Forest, Gradient Boosting)\n", 347 | "- K近邻 (K Nearest Neighbors)\n", 348 | "- 随机梯度下降分类器 (SGDC)\n", 349 | "- 支撑向量机 (SVM)\n", 350 | "- Logistic回归(LogisticRegression)\n", 351 | "\n", 352 | "从上面的监督学习模型中选择三个适合我们这个问题的模型,并回答相应问题。" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "### 模型1\n", 360 | "\n", 361 | "**模型名称**\n", 362 | "\n", 363 | "回答:\n", 364 | "\n", 365 | "\n", 366 | "**描述一个该模型在真实世界的一个应用场景。(你需要为此做点研究,并给出你的引用出处)**\n", 367 | "\n", 368 | "回答:\n", 369 | "\n", 370 | "**这个模型的优势是什么?他什么情况下表现最好?**\n", 371 | "\n", 372 | "回答:\n", 373 | "\n", 374 | "**这个模型的缺点是什么?什么条件下它表现很差?**\n", 375 | "\n", 376 | "回答:\n", 377 | "\n", 378 | "**根据我们当前数据集的特点,为什么这个模型适合这个问题。**\n", 379 | "\n", 380 | "回答:" 381 | ] 382 | }, 383 | { 384 | "cell_type": "markdown", 385 | "metadata": {}, 386 | "source": [ 387 | "### 模型2\n", 388 | "\n", 389 | "**模型名称**\n", 390 | "\n", 391 | "回答:\n", 392 | "\n", 393 | "\n", 394 | "**描述一个该模型在真实世界的一个应用场景。(你需要为此做点研究,并给出你的引用出处)**\n", 395 | "\n", 396 | "回答:\n", 397 | "\n", 398 | "**这个模型的优势是什么?他什么情况下表现最好?**\n", 399 | "\n", 400 | "回答:\n", 401 | "\n", 402 | "**这个模型的缺点是什么?什么条件下它表现很差?**\n", 403 | "\n", 404 | "回答:\n", 405 | "\n", 406 | "**根据我们当前数据集的特点,为什么这个模型适合这个问题。**\n", 407 | "\n", 408 | "回答:" 409 | ] 410 | }, 411 | { 412 | "cell_type": "markdown", 413 | "metadata": {}, 414 | "source": [ 415 | "### 模型3\n", 416 | "\n", 417 | "**模型名称**\n", 418 | "\n", 419 | "回答:\n", 420 | "\n", 421 | "\n", 422 | "**描述一个该模型在真实世界的一个应用场景。(你需要为此做点研究,并给出你的引用出处)**\n", 423 | "\n", 424 | "回答:\n", 425 | "\n", 426 | "**这个模型的优势是什么?他什么情况下表现最好?**\n", 427 | "\n", 428 | "回答:\n", 429 | "\n", 430 | "**这个模型的缺点是什么?什么条件下它表现很差?**\n", 431 | "\n", 432 | "回答:\n", 433 | "\n", 434 | "**根据我们当前数据集的特点,为什么这个模型适合这个问题。**\n", 435 | "\n", 436 | "回答:" 437 | ] 438 | }, 439 | { 440 | "cell_type": "markdown", 441 | "metadata": {}, 442 | "source": [ 443 | "### 练习 - 创建一个训练和预测的流水线\n", 444 | "为了正确评估你选择的每一个模型的性能,创建一个能够帮助你快速有效地使用不同大小的训练集并在验证集上做预测的训练和验证的流水线是十分重要的。\n", 445 | "你在这里实现的功能将会在接下来的部分中被用到。在下面的代码单元中,你将实现以下功能:\n", 446 | "\n", 447 | " - 从[`sklearn.metrics`](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)中导入`fbeta_score`和`accuracy_score`。\n", 448 | " - 用训练集拟合学习器,并记录训练时间。\n", 449 | " - 对训练集的前300个数据点和验证集进行预测并记录预测时间。\n", 450 | " - 计算预测训练集的前300个数据点的准确率和F-score。\n", 451 | " - 计算预测验证集的准确率和F-score。" 452 | ] 453 | }, 454 | { 455 | "cell_type": "code", 456 | "execution_count": null, 457 | "metadata": {}, 458 | "outputs": [], 459 | "source": [ 460 | "# TODO:从sklearn中导入两个评价指标 - fbeta_score和accuracy_score\n", 461 | "from sklearn.metrics import fbeta_score, accuracy_score\n", 462 | "\n", 463 | "def train_predict(learner, sample_size, X_train, y_train, X_val, y_val): \n", 464 | " '''\n", 465 | " inputs:\n", 466 | " - learner: the learning algorithm to be trained and predicted on\n", 467 | " - sample_size: the size of samples (number) to be drawn from training set\n", 468 | " - X_train: features training set\n", 469 | " - y_train: income training set\n", 470 | " - X_val: features validation set\n", 471 | " - y_val: income validation set\n", 472 | " '''\n", 473 | " \n", 474 | " results = {}\n", 475 | " \n", 476 | " # TODO:使用sample_size大小的训练数据来拟合学习器\n", 477 | " # TODO: Fit the learner to the training data using slicing with 'sample_size'\n", 478 | " start = time() # 获得程序开始时间\n", 479 | " learner = None\n", 480 | " end = time() # 获得程序结束时间\n", 481 | " \n", 482 | " # TODO:计算训练时间\n", 483 | " results['train_time'] = None\n", 484 | " \n", 485 | " # TODO: 得到在验证集上的预测值\n", 486 | " # 然后得到对前300个训练数据的预测结果\n", 487 | " start = time() # 获得程序开始时间\n", 488 | " predictions_val = None\n", 489 | " predictions_train = None\n", 490 | " end = time() # 获得程序结束时间\n", 491 | " \n", 492 | " # TODO:计算预测用时\n", 493 | " results['pred_time'] = None\n", 494 | " \n", 495 | " # TODO:计算在最前面的300个训练数据的准确率\n", 496 | " results['acc_train'] = None\n", 497 | " \n", 498 | " # TODO:计算在验证上的准确率\n", 499 | " results['acc_val'] = None\n", 500 | " \n", 501 | " # TODO:计算在最前面300个训练数据上的F-score\n", 502 | " results['f_train'] = None\n", 503 | " \n", 504 | " # TODO:计算验证集上的F-score\n", 505 | " results['f_val'] = None\n", 506 | " \n", 507 | " # 成功\n", 508 | " print (\"{} trained on {} samples.\".format(learner.__class__.__name__, sample_size))\n", 509 | " \n", 510 | " # 返回结果\n", 511 | " return results" 512 | ] 513 | }, 514 | { 515 | "cell_type": "markdown", 516 | "metadata": {}, 517 | "source": [ 518 | "### 练习:初始模型的评估\n", 519 | "在下面的代码单元中,您将需要实现以下功能: \n", 520 | "- 导入你在前面讨论的三个监督学习模型。 \n", 521 | "- 初始化三个模型并存储在`'clf_A'`,`'clf_B'`和`'clf_C'`中。\n", 522 | " - 使用模型的默认参数值,在接下来的部分中你将需要对某一个模型的参数进行调整。 \n", 523 | " - 设置`random_state` (如果有这个参数)。 \n", 524 | "- 计算1%, 10%, 100%的训练数据分别对应多少个数据点,并将这些值存储在`'samples_1'`, `'samples_10'`, `'samples_100'`中\n", 525 | "\n", 526 | "**注意:**取决于你选择的算法,下面实现的代码可能需要一些时间来运行!" 527 | ] 528 | }, 529 | { 530 | "cell_type": "code", 531 | "execution_count": null, 532 | "metadata": {}, 533 | "outputs": [], 534 | "source": [ 535 | "# TODO:从sklearn中导入三个监督学习模型\n", 536 | "\n", 537 | "# TODO:初始化三个模型\n", 538 | "clf_A = None\n", 539 | "clf_B = None\n", 540 | "clf_C = None\n", 541 | "\n", 542 | "# TODO:计算1%, 10%, 100%的训练数据分别对应多少点\n", 543 | "samples_1 = None\n", 544 | "samples_10 = None\n", 545 | "samples_100 = None\n", 546 | "\n", 547 | "# 收集学习器的结果\n", 548 | "results = {}\n", 549 | "for clf in [clf_A, clf_B, clf_C]:\n", 550 | " clf_name = clf.__class__.__name__\n", 551 | " results[clf_name] = {}\n", 552 | " for i, samples in enumerate([samples_1, samples_10, samples_100]):\n", 553 | " results[clf_name][i] = train_predict(clf, samples, X_train, y_train, X_val, y_val)\n", 554 | "\n", 555 | "# 对选择的三个模型得到的评价结果进行可视化\n", 556 | "vs.evaluate(results, accuracy, fscore)" 557 | ] 558 | }, 559 | { 560 | "cell_type": "markdown", 561 | "metadata": {}, 562 | "source": [ 563 | "----\n", 564 | "## 提高效果\n", 565 | "\n", 566 | "在这最后一节中,您将从三个有监督的学习模型中选择 *最好的* 模型来使用学生数据。你将在整个训练集(`X_train`和`y_train`)上使用网格搜索优化至少调节一个参数以获得一个比没有调节之前更好的 F-score。" 567 | ] 568 | }, 569 | { 570 | "cell_type": "markdown", 571 | "metadata": {}, 572 | "source": [ 573 | "### 问题 3 - 选择最佳的模型\n", 574 | "\n", 575 | "*基于你前面做的评价,用一到两段话向 *CharityML* 解释这三个模型中哪一个对于判断被调查者的年收入大于 \\$50,000 是最合适的。* \n", 576 | "**提示:**你的答案应该包括评价指标,预测/训练时间,以及该算法是否适合这里的数据。" 577 | ] 578 | }, 579 | { 580 | "cell_type": "markdown", 581 | "metadata": {}, 582 | "source": [ 583 | "**回答:**" 584 | ] 585 | }, 586 | { 587 | "cell_type": "markdown", 588 | "metadata": {}, 589 | "source": [ 590 | "### 问题 4 - 用通俗的话解释模型\n", 591 | "\n", 592 | "*用一到两段话,向 *CharityML* 用外行也听得懂的话来解释最终模型是如何工作的。你需要解释所选模型的主要特点。例如,这个模型是怎样被训练的,它又是如何做出预测的。避免使用高级的数学或技术术语,不要使用公式或特定的算法名词。*" 593 | ] 594 | }, 595 | { 596 | "cell_type": "markdown", 597 | "metadata": {}, 598 | "source": [ 599 | "**回答: ** " 600 | ] 601 | }, 602 | { 603 | "cell_type": "markdown", 604 | "metadata": {}, 605 | "source": [ 606 | "### 练习:模型调优\n", 607 | "调节选择的模型的参数。使用网格搜索(GridSearchCV)来至少调整模型的重要参数(至少调整一个),这个参数至少需尝试3个不同的值。你要使用整个训练集来完成这个过程。在接下来的代码单元中,你需要实现以下功能:\n", 608 | "\n", 609 | "- 导入[`sklearn.model_selection.GridSearchCV`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) 和 [`sklearn.metrics.make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html).\n", 610 | "- 初始化你选择的分类器,并将其存储在`clf`中。\n", 611 | " - 设置`random_state` (如果有这个参数)。\n", 612 | "- 创建一个对于这个模型你希望调整参数的字典。\n", 613 | " - 例如: parameters = {'parameter' : [list of values]}。\n", 614 | " - **注意:** 如果你的学习器有 `max_features` 参数,请不要调节它!\n", 615 | "- 使用`make_scorer`来创建一个`fbeta_score`评分对象(设置$\\beta = 0.5$)。\n", 616 | "- 在分类器clf上用'scorer'作为评价函数运行网格搜索,并将结果存储在grid_obj中。\n", 617 | "- 用训练集(X_train, y_train)训练grid search object,并将结果存储在`grid_fit`中。\n", 618 | "\n", 619 | "**注意:** 取决于你选择的参数列表,下面实现的代码可能需要花一些时间运行!" 620 | ] 621 | }, 622 | { 623 | "cell_type": "code", 624 | "execution_count": null, 625 | "metadata": {}, 626 | "outputs": [], 627 | "source": [ 628 | "# TODO:导入'GridSearchCV', 'make_scorer'和其他一些需要的库\n", 629 | "\n", 630 | "# TODO:初始化分类器\n", 631 | "clf = None\n", 632 | "\n", 633 | "# TODO:创建你希望调节的参数列表\n", 634 | "parameters = None\n", 635 | "\n", 636 | "# TODO:创建一个fbeta_score打分对象\n", 637 | "scorer = None\n", 638 | "\n", 639 | "# TODO:在分类器上使用网格搜索,使用'scorer'作为评价函数\n", 640 | "grid_obj = None\n", 641 | "\n", 642 | "# TODO:用训练数据拟合网格搜索对象并找到最佳参数\n", 643 | "\n", 644 | "# 得到estimator\n", 645 | "best_clf = grid_obj.best_estimator_\n", 646 | "\n", 647 | "# 使用没有调优的模型做预测\n", 648 | "predictions = (clf.fit(X_train, y_train)).predict(X_val)\n", 649 | "best_predictions = best_clf.predict(X_val)\n", 650 | "\n", 651 | "# 汇报调优后的模型\n", 652 | "print (\"best_clf\\n------\")\n", 653 | "print (best_clf)\n", 654 | "\n", 655 | "# 汇报调参前和调参后的分数\n", 656 | "print (\"\\nUnoptimized model\\n------\")\n", 657 | "print (\"Accuracy score on validation data: {:.4f}\".format(accuracy_score(y_val, predictions)))\n", 658 | "print (\"F-score on validation data: {:.4f}\".format(fbeta_score(y_val, predictions, beta = 0.5)))\n", 659 | "print (\"\\nOptimized Model\\n------\")\n", 660 | "print (\"Final accuracy score on the validation data: {:.4f}\".format(accuracy_score(y_val, best_predictions)))\n", 661 | "print (\"Final F-score on the validation data: {:.4f}\".format(fbeta_score(y_val, best_predictions, beta = 0.5)))" 662 | ] 663 | }, 664 | { 665 | "cell_type": "markdown", 666 | "metadata": {}, 667 | "source": [ 668 | "### 问题 5 - 最终模型评估\n", 669 | "\n", 670 | "_你的最优模型在测试数据上的准确率和 F-score 是多少?这些分数比没有优化的模型好还是差?_\n", 671 | "**注意:**请在下面的表格中填写你的结果,然后在答案框中提供讨论。" 672 | ] 673 | }, 674 | { 675 | "cell_type": "markdown", 676 | "metadata": {}, 677 | "source": [ 678 | "#### 结果:\n", 679 | " \n", 680 | "| 评价指标 | 未优化的模型 | 优化的模型 |\n", 681 | "| :------------: | :---------------: | :-------------: | \n", 682 | "| 准确率 | | |\n", 683 | "| F-score | | |" 684 | ] 685 | }, 686 | { 687 | "cell_type": "markdown", 688 | "metadata": {}, 689 | "source": [ 690 | "**回答:**" 691 | ] 692 | }, 693 | { 694 | "cell_type": "markdown", 695 | "metadata": {}, 696 | "source": [ 697 | "----\n", 698 | "## 特征的重要性\n", 699 | "\n", 700 | "在数据上(比如我们这里使用的人口普查的数据)使用监督学习算法的一个重要的任务是决定哪些特征能够提供最强的预测能力。专注于少量的有效特征和标签之间的关系,我们能够更加简单地理解这些现象,这在很多情况下都是十分有用的。在这个项目的情境下这表示我们希望选择一小部分特征,这些特征能够在预测被调查者是否年收入大于\\$50,000这个问题上有很强的预测能力。\n", 701 | "\n", 702 | "选择一个有 `'feature_importance_'` 属性的scikit学习分类器(例如 AdaBoost,随机森林)。`'feature_importance_'` 属性是对特征的重要性排序的函数。在下一个代码单元中用这个分类器拟合训练集数据并使用这个属性来决定人口普查数据中最重要的5个特征。" 703 | ] 704 | }, 705 | { 706 | "cell_type": "markdown", 707 | "metadata": {}, 708 | "source": [ 709 | "### 问题 6 - 观察特征相关性\n", 710 | "\n", 711 | "当**探索数据**的时候,它显示在这个人口普查数据集中每一条记录我们有十三个可用的特征。 \n", 712 | "_在这十三个记录中,你认为哪五个特征对于预测是最重要的,选择每个特征的理由是什么?你会怎样对他们排序?_" 713 | ] 714 | }, 715 | { 716 | "cell_type": "markdown", 717 | "metadata": {}, 718 | "source": [ 719 | "**回答:**\n", 720 | "- 特征1:\n", 721 | "- 特征2:\n", 722 | "- 特征3:\n", 723 | "- 特征4:\n", 724 | "- 特征5:" 725 | ] 726 | }, 727 | { 728 | "cell_type": "markdown", 729 | "metadata": {}, 730 | "source": [ 731 | "### 练习 - 提取特征重要性\n", 732 | "\n", 733 | "选择一个`scikit-learn`中有`feature_importance_`属性的监督学习分类器,这个属性是一个在做预测的时候根据所选择的算法来对特征重要性进行排序的功能。\n", 734 | "\n", 735 | "在下面的代码单元中,你将要实现以下功能:\n", 736 | " - 如果这个模型和你前面使用的三个模型不一样的话从sklearn中导入一个监督学习模型。\n", 737 | " - 在整个训练集上训练一个监督学习模型。\n", 738 | " - 使用模型中的 `'feature_importances_'`提取特征的重要性。" 739 | ] 740 | }, 741 | { 742 | "cell_type": "code", 743 | "execution_count": null, 744 | "metadata": {}, 745 | "outputs": [], 746 | "source": [ 747 | "# TODO:导入一个有'feature_importances_'的监督学习模型\n", 748 | "\n", 749 | "# TODO:在训练集上训练一个监督学习模型\n", 750 | "model = None\n", 751 | "\n", 752 | "# TODO: 提取特征重要性\n", 753 | "importances = None\n", 754 | "\n", 755 | "# 绘图\n", 756 | "vs.feature_plot(importances, X_train, y_train)" 757 | ] 758 | }, 759 | { 760 | "cell_type": "markdown", 761 | "metadata": {}, 762 | "source": [ 763 | "### 问题 7 - 提取特征重要性\n", 764 | "观察上面创建的展示五个用于预测被调查者年收入是否大于\\$50,000最相关的特征的可视化图像。\n", 765 | "\n", 766 | "_这五个特征的权重加起来是否超过了0.5?_
\n", 767 | "_这五个特征和你在**问题 6**中讨论的特征比较怎么样?_
\n", 768 | "_如果说你的答案和这里的相近,那么这个可视化怎样佐证了你的想法?_
\n", 769 | "_如果你的选择不相近,那么为什么你觉得这些特征更加相关?_" 770 | ] 771 | }, 772 | { 773 | "cell_type": "markdown", 774 | "metadata": {}, 775 | "source": [ 776 | "**回答:**" 777 | ] 778 | }, 779 | { 780 | "cell_type": "markdown", 781 | "metadata": {}, 782 | "source": [ 783 | "### 特征选择\n", 784 | "\n", 785 | "如果我们只是用可用特征的一个子集的话模型表现会怎么样?通过使用更少的特征来训练,在评价指标的角度来看我们的期望是训练和预测的时间会更少。从上面的可视化来看,我们可以看到前五个最重要的特征贡献了数据中**所有**特征中超过一半的重要性。这提示我们可以尝试去**减小特征空间**,简化模型需要学习的信息。下面代码单元将使用你前面发现的优化模型,并**只使用五个最重要的特征**在相同的训练集上训练模型。" 786 | ] 787 | }, 788 | { 789 | "cell_type": "code", 790 | "execution_count": null, 791 | "metadata": {}, 792 | "outputs": [], 793 | "source": [ 794 | "# 导入克隆模型的功能\n", 795 | "from sklearn.base import clone\n", 796 | "\n", 797 | "# 减小特征空间\n", 798 | "X_train_reduced = X_train[X_train.columns.values[(np.argsort(importances)[::-1])[:5]]]\n", 799 | "X_val_reduced = X_val[X_val.columns.values[(np.argsort(importances)[::-1])[:5]]]\n", 800 | "\n", 801 | "# 在前面的网格搜索的基础上训练一个“最好的”模型\n", 802 | "clf_on_reduced = (clone(best_clf)).fit(X_train_reduced, y_train)\n", 803 | "\n", 804 | "# 做一个新的预测\n", 805 | "reduced_predictions = clf_on_reduced.predict(X_val_reduced)\n", 806 | "\n", 807 | "# 对于每一个版本的数据汇报最终模型的分数\n", 808 | "print (\"Final Model trained on full data\\n------\")\n", 809 | "print (\"Accuracy on validation data: {:.4f}\".format(accuracy_score(y_val, best_predictions)))\n", 810 | "print (\"F-score on validation data: {:.4f}\".format(fbeta_score(y_val, best_predictions, beta = 0.5)))\n", 811 | "print (\"\\nFinal Model trained on reduced data\\n------\")\n", 812 | "print (\"Accuracy on validation data: {:.4f}\".format(accuracy_score(y_val, reduced_predictions)))\n", 813 | "print (\"F-score on validation data: {:.4f}\".format(fbeta_score(y_val, reduced_predictions, beta = 0.5)))" 814 | ] 815 | }, 816 | { 817 | "cell_type": "markdown", 818 | "metadata": {}, 819 | "source": [ 820 | "### 问题 8 - 特征选择的影响\n", 821 | "\n", 822 | "*最终模型在只是用五个特征的数据上和使用所有的特征数据上的 F-score 和准确率相比怎么样?* \n", 823 | "*如果训练时间是一个要考虑的因素,你会考虑使用部分特征的数据作为你的训练集吗?*" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": {}, 829 | "source": [ 830 | "**回答:**" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "### 问题 9 - 在测试集上测试你的模型\n", 838 | "\n", 839 | "终于到了测试的时候,记住,测试集只能用一次。\n", 840 | "\n", 841 | "*使用你最有信心的模型,在测试集上测试,计算出准确率和 F-score。*\n", 842 | "*简述你选择这个模型的原因,并分析测试结果*" 843 | ] 844 | }, 845 | { 846 | "cell_type": "code", 847 | "execution_count": null, 848 | "metadata": {}, 849 | "outputs": [], 850 | "source": [ 851 | "#TODO test your model on testing data and report accuracy and F score" 852 | ] 853 | }, 854 | { 855 | "cell_type": "markdown", 856 | "metadata": {}, 857 | "source": [ 858 | "> **注意:** 当你写完了所有的代码,并且回答了所有的问题。你就可以把你的 iPython Notebook 导出成 HTML 文件。你可以在菜单栏,这样导出**File -> Download as -> HTML (.html)**把这个 HTML 和这个 iPython notebook 一起做为你的作业提交。" 859 | ] 860 | } 861 | ], 862 | "metadata": { 863 | "anaconda-cloud": {}, 864 | "kernelspec": { 865 | "display_name": "Python 3", 866 | "language": "python", 867 | "name": "python3" 868 | }, 869 | "language_info": { 870 | "codemirror_mode": { 871 | "name": "ipython", 872 | "version": 3 873 | }, 874 | "file_extension": ".py", 875 | "mimetype": "text/x-python", 876 | "name": "python", 877 | "nbconvert_exporter": "python", 878 | "pygments_lexer": "ipython3", 879 | "version": "3.6.4" 880 | } 881 | }, 882 | "nbformat": 4, 883 | "nbformat_minor": 1 884 | } 885 | --------------------------------------------------------------------------------