├── 其他资料 ├── 书籍 │ ├── 书单.md │ ├── 01. 概率论与数理统计.陈希孺.ipynb │ ├── data │ │ └── 1.txt │ ├── 02. 机器学习.周志华.ipynb │ └── 03. 统计学习方法.李航.ipynb ├── 论文 │ └── README.md └── 课程 │ ├── 02. Python3入门机器学习经典算法与应用.liuyubobobo.ipynb │ └── 01. Machine Learning. Coursera. Andrew Ng.ipynb ├── .gitignore ├── images ├── or.png ├── and.png ├── bst.png ├── knn.png ├── not.png ├── svm-1.jpg ├── svm-2.jpg ├── svr.png ├── xor.png ├── cement.png ├── kd-bst.webp ├── loss-2.jpg ├── neutron.jpg ├── newton.jpeg ├── sigmoid.jpg ├── ai-ml-dl.jpg ├── bp-method.png ├── cement-2.png ├── cement-3.png ├── gradient-1.jpg ├── gradient-2.jpg ├── hinge-loss.png ├── kd-create.webp ├── knn-date.jpg ├── knn-movie.png ├── mp-neuron.png ├── neutron-1.jpg ├── pr-curve.webp ├── roc-auc.webp ├── svm-3-loss.jpg ├── l2-distance.gif ├── xor-neotron.png ├── cost-function.png ├── learning-rate.png ├── svm-unseparable.png ├── xor-nonlinear.png ├── bayesian-network.jpg ├── bayesian-network.png ├── gradient-sample-1.png ├── gradient-sample-2.png ├── kernel-functions.jpg ├── Chebyshev-Distance.png ├── Manhattan-Distance.png ├── NewtonIteration_Ani.gif ├── gradient-direction-2.jpg ├── gradient-direction.jpg ├── l1-parameter-space.webp ├── l2-parameter-space.webp ├── lp-parameter-space.webp ├── neutron-and-or-not.png ├── ridge-regression-1.webp ├── ridge-regression-2.webp ├── ridge-regression-3.webp ├── activation-function-1.png ├── activation-function-2.png ├── activation-function-3.png ├── gaussian-distribution.png ├── decision-tree-watermelon.png ├── l1-l2-parameterr-space.webp ├── l1-vs-l2-parameter-space.webp ├── bayesian-network-dependency.png ├── decision-tree-axis-parallel.png └── decision-tree-multi-variate.png ├── 降维 └── TODO.md ├── 聚类 ├── 4. DBSCAN 算法.ipynb ├── 5. 层次聚类算法.ipynb ├── 2. 聚类算法一览.ipynb ├── 3. k-means 算法.ipynb └── 1. 什么是聚类.ipynb ├── 逻辑回归 ├── x. 从逻辑回归到神经网络.ipynb ├── 08. 从二分类到多分类.ipynb ├── 09. 从线性分类到非线性分类.ipynb ├── x. 广义线性模型.ipynb ├── 06. 使用牛顿法解逻辑回归.ipynb ├── 07. 线性回归和逻辑回归总结.ipynb ├── 10. 分类模型的评估方法.ipynb └── 03. 逻辑回归和一元分类问题的求解.ipynb ├── 线性回归 ├── 19. LASSO回归和稀疏学习.ipynb ├── x. 线性回归的概率解释.ipynb ├── TODO.md ├── 优化算法 │ ├── 02. 拟牛顿法.ipynb │ └── 01. 牛顿法.ipynb ├── x. 最小二乘的数学解释.ipynb ├── 18. 使用岭回归解决共线性问题.ipynb ├── 15. 线性回归实例.ipynb ├── 17. 求解岭回归和LASSO回归.ipynb ├── 11. 梯度下降的简单例子.ipynb ├── 10. 什么是梯度下降.ipynb ├── 14. 最优化问题的其他算法.ipynb ├── 09. 回归模型的评估和选择.ipynb └── 16. 带约束条件的线性回归.ipynb ├── 决策树 ├── 03. 决策树算法实战.ipynb ├── 06. 多变量决策树.ipynb ├── 01. 决策树算法概述.ipynb └── 05. 连续值和缺失值.ipynb ├── 支持向量机 ├── 08. 非线性支持向量机实战.ipynb ├── 05. SMO 算法.ipynb ├── 01. 什么是支持向量机.ipynb ├── 09. 支持向量回归.ipynb ├── 02. 支持向量机的对偶问题.ipynb ├── 04. 软间隔支持向量机的对偶问题.ipynb ├── 07. 非线性支持向量机和核函数.ipynb └── 03. 硬间隔与软间隔.ipynb ├── 集成学习 ├── 02. Boosting 算法.ipynb ├── 04. 随机森林.ipynb ├── 05. 个体学习器的多样性.ipynb ├── 03. Bagging 算法.ipynb └── 01. 什么是集成学习?.ipynb ├── 贝叶斯 ├── x. 隐变量和 EM 算法.ipynb ├── x. 概率图模型.ipynb ├── 04. 半朴素贝叶斯分类器.ipynb ├── 03. 贝叶斯分类实例.ipynb ├── 05. 贝叶斯网.ipynb ├── 01. 贝叶斯定理.ipynb └── 02. 朴素贝叶斯分类器.ipynb ├── 数学基础 ├── 01. LaTeX 笔记.ipynb ├── x. 拉格朗日对偶问题.ipynb ├── x. 满秩矩阵.ipynb └── x. 正定矩阵.ipynb ├── 神经网络 ├── TODO.md ├── 02. 神经元模型.ipynb ├── 05. 其他神经网络.ipynb ├── 03. 从感知机到神经网络.ipynb └── 01. 感知机模型.ipynb ├── 概述 ├── 01. 什么是机器学习.ipynb ├── 03. 机器学习算法.ipynb └── 02. 机器学习分类.ipynb ├── kNN ├── 01. k-近邻算法概述.ipynb └── 03. k-近邻的算法实现:kd 树.ipynb └── README.md /其他资料/书籍/书单.md: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .ipynb_checkpoints 2 | -------------------------------------------------------------------------------- /images/or.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/or.png -------------------------------------------------------------------------------- /images/and.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/and.png -------------------------------------------------------------------------------- /images/bst.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/bst.png -------------------------------------------------------------------------------- /images/knn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/knn.png -------------------------------------------------------------------------------- /images/not.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/not.png -------------------------------------------------------------------------------- /images/svm-1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/svm-1.jpg -------------------------------------------------------------------------------- /images/svm-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/svm-2.jpg -------------------------------------------------------------------------------- /images/svr.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/svr.png -------------------------------------------------------------------------------- /images/xor.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/xor.png -------------------------------------------------------------------------------- /images/cement.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/cement.png -------------------------------------------------------------------------------- /images/kd-bst.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/kd-bst.webp -------------------------------------------------------------------------------- /images/loss-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/loss-2.jpg -------------------------------------------------------------------------------- /images/neutron.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/neutron.jpg -------------------------------------------------------------------------------- /images/newton.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/newton.jpeg -------------------------------------------------------------------------------- /images/sigmoid.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/sigmoid.jpg -------------------------------------------------------------------------------- /images/ai-ml-dl.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/ai-ml-dl.jpg -------------------------------------------------------------------------------- /images/bp-method.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/bp-method.png -------------------------------------------------------------------------------- /images/cement-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/cement-2.png -------------------------------------------------------------------------------- /images/cement-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/cement-3.png -------------------------------------------------------------------------------- /images/gradient-1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gradient-1.jpg -------------------------------------------------------------------------------- /images/gradient-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gradient-2.jpg -------------------------------------------------------------------------------- /images/hinge-loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/hinge-loss.png -------------------------------------------------------------------------------- /images/kd-create.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/kd-create.webp -------------------------------------------------------------------------------- /images/knn-date.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/knn-date.jpg -------------------------------------------------------------------------------- /images/knn-movie.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/knn-movie.png -------------------------------------------------------------------------------- /images/mp-neuron.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/mp-neuron.png -------------------------------------------------------------------------------- /images/neutron-1.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/neutron-1.jpg -------------------------------------------------------------------------------- /images/pr-curve.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/pr-curve.webp -------------------------------------------------------------------------------- /images/roc-auc.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/roc-auc.webp -------------------------------------------------------------------------------- /images/svm-3-loss.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/svm-3-loss.jpg -------------------------------------------------------------------------------- /images/l2-distance.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/l2-distance.gif -------------------------------------------------------------------------------- /images/xor-neotron.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/xor-neotron.png -------------------------------------------------------------------------------- /images/cost-function.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/cost-function.png -------------------------------------------------------------------------------- /images/learning-rate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/learning-rate.png -------------------------------------------------------------------------------- /images/svm-unseparable.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/svm-unseparable.png -------------------------------------------------------------------------------- /images/xor-nonlinear.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/xor-nonlinear.png -------------------------------------------------------------------------------- /images/bayesian-network.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/bayesian-network.jpg -------------------------------------------------------------------------------- /images/bayesian-network.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/bayesian-network.png -------------------------------------------------------------------------------- /images/gradient-sample-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gradient-sample-1.png -------------------------------------------------------------------------------- /images/gradient-sample-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gradient-sample-2.png -------------------------------------------------------------------------------- /images/kernel-functions.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/kernel-functions.jpg -------------------------------------------------------------------------------- /images/Chebyshev-Distance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/Chebyshev-Distance.png -------------------------------------------------------------------------------- /images/Manhattan-Distance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/Manhattan-Distance.png -------------------------------------------------------------------------------- /images/NewtonIteration_Ani.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/NewtonIteration_Ani.gif -------------------------------------------------------------------------------- /images/gradient-direction-2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gradient-direction-2.jpg -------------------------------------------------------------------------------- /images/gradient-direction.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gradient-direction.jpg -------------------------------------------------------------------------------- /images/l1-parameter-space.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/l1-parameter-space.webp -------------------------------------------------------------------------------- /images/l2-parameter-space.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/l2-parameter-space.webp -------------------------------------------------------------------------------- /images/lp-parameter-space.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/lp-parameter-space.webp -------------------------------------------------------------------------------- /images/neutron-and-or-not.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/neutron-and-or-not.png -------------------------------------------------------------------------------- /images/ridge-regression-1.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/ridge-regression-1.webp -------------------------------------------------------------------------------- /images/ridge-regression-2.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/ridge-regression-2.webp -------------------------------------------------------------------------------- /images/ridge-regression-3.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/ridge-regression-3.webp -------------------------------------------------------------------------------- /images/activation-function-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/activation-function-1.png -------------------------------------------------------------------------------- /images/activation-function-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/activation-function-2.png -------------------------------------------------------------------------------- /images/activation-function-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/activation-function-3.png -------------------------------------------------------------------------------- /images/gaussian-distribution.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/gaussian-distribution.png -------------------------------------------------------------------------------- /降维/TODO.md: -------------------------------------------------------------------------------- 1 | 常见的降维算法 2 | * 主成分分析(PCA,Principal Component Analysis) 3 | * 特征分解 4 | * 奇异值分解(SVD) 5 | * t-SNE 6 | * 自动编码器 -------------------------------------------------------------------------------- /images/decision-tree-watermelon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/decision-tree-watermelon.png -------------------------------------------------------------------------------- /images/l1-l2-parameterr-space.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/l1-l2-parameterr-space.webp -------------------------------------------------------------------------------- /images/l1-vs-l2-parameter-space.webp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/l1-vs-l2-parameter-space.webp -------------------------------------------------------------------------------- /images/bayesian-network-dependency.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/bayesian-network-dependency.png -------------------------------------------------------------------------------- /images/decision-tree-axis-parallel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/decision-tree-axis-parallel.png -------------------------------------------------------------------------------- /images/decision-tree-multi-variate.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/aneasystone/ml-notes/HEAD/images/decision-tree-multi-variate.png -------------------------------------------------------------------------------- /其他资料/论文/README.md: -------------------------------------------------------------------------------- 1 | * [An overview of gradient descent optimization algorithms](https://arxiv.org/pdf/1609.04747.pdf) 2 | 3 | 梯度下降及其优化算法 4 | 5 | * [Simple and scalable response prediction for display advertising](http://people.csail.mit.edu/romer/papers/TISTRespPredAds.pdf) 6 | 7 | 使用逻辑回归进行广告投放预测 -------------------------------------------------------------------------------- /聚类/4. DBSCAN 算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [] 7 | } 8 | ], 9 | "metadata": { 10 | "kernelspec": { 11 | "display_name": "Python 3", 12 | "language": "python", 13 | "name": "python3" 14 | }, 15 | "language_info": { 16 | "codemirror_mode": { 17 | "name": "ipython", 18 | "version": 3 19 | }, 20 | "file_extension": ".py", 21 | "mimetype": "text/x-python", 22 | "name": "python", 23 | "nbconvert_exporter": "python", 24 | "pygments_lexer": "ipython3", 25 | "version": "3.7.1" 26 | } 27 | }, 28 | "nbformat": 4, 29 | "nbformat_minor": 2 30 | } 31 | -------------------------------------------------------------------------------- /逻辑回归/x. 从逻辑回归到神经网络.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [] 7 | } 8 | ], 9 | "metadata": { 10 | "kernelspec": { 11 | "display_name": "Python 3", 12 | "language": "python", 13 | "name": "python3" 14 | }, 15 | "language_info": { 16 | "codemirror_mode": { 17 | "name": "ipython", 18 | "version": 3 19 | }, 20 | "file_extension": ".py", 21 | "mimetype": "text/x-python", 22 | "name": "python", 23 | "nbconvert_exporter": "python", 24 | "pygments_lexer": "ipython3", 25 | "version": "3.7.1" 26 | } 27 | }, 28 | "nbformat": 4, 29 | "nbformat_minor": 2 30 | } 31 | -------------------------------------------------------------------------------- /线性回归/19. LASSO回归和稀疏学习.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [] 7 | } 8 | ], 9 | "metadata": { 10 | "kernelspec": { 11 | "display_name": "Python 3", 12 | "language": "python", 13 | "name": "python3" 14 | }, 15 | "language_info": { 16 | "codemirror_mode": { 17 | "name": "ipython", 18 | "version": 3 19 | }, 20 | "file_extension": ".py", 21 | "mimetype": "text/x-python", 22 | "name": "python", 23 | "nbconvert_exporter": "python", 24 | "pygments_lexer": "ipython3", 25 | "version": "3.7.1" 26 | } 27 | }, 28 | "nbformat": 4, 29 | "nbformat_minor": 2 30 | } 31 | -------------------------------------------------------------------------------- /决策树/03. 决策树算法实战.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "TODO" 8 | ] 9 | } 10 | ], 11 | "metadata": { 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "language": "python", 15 | "name": "python3" 16 | }, 17 | "language_info": { 18 | "codemirror_mode": { 19 | "name": "ipython", 20 | "version": 3 21 | }, 22 | "file_extension": ".py", 23 | "mimetype": "text/x-python", 24 | "name": "python", 25 | "nbconvert_exporter": "python", 26 | "pygments_lexer": "ipython3", 27 | "version": "3.7.1" 28 | } 29 | }, 30 | "nbformat": 4, 31 | "nbformat_minor": 2 32 | } 33 | -------------------------------------------------------------------------------- /支持向量机/08. 非线性支持向量机实战.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "TODO" 8 | ] 9 | } 10 | ], 11 | "metadata": { 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "language": "python", 15 | "name": "python3" 16 | }, 17 | "language_info": { 18 | "codemirror_mode": { 19 | "name": "ipython", 20 | "version": 3 21 | }, 22 | "file_extension": ".py", 23 | "mimetype": "text/x-python", 24 | "name": "python", 25 | "nbconvert_exporter": "python", 26 | "pygments_lexer": "ipython3", 27 | "version": "3.7.1" 28 | } 29 | }, 30 | "nbformat": 4, 31 | "nbformat_minor": 2 32 | } 33 | -------------------------------------------------------------------------------- /逻辑回归/08. 从二分类到多分类.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "// 二分类 -> 多分类" 8 | ] 9 | } 10 | ], 11 | "metadata": { 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "language": "python", 15 | "name": "python3" 16 | }, 17 | "language_info": { 18 | "codemirror_mode": { 19 | "name": "ipython", 20 | "version": 3 21 | }, 22 | "file_extension": ".py", 23 | "mimetype": "text/x-python", 24 | "name": "python", 25 | "nbconvert_exporter": "python", 26 | "pygments_lexer": "ipython3", 27 | "version": "3.7.1" 28 | } 29 | }, 30 | "nbformat": 4, 31 | "nbformat_minor": 2 32 | } 33 | -------------------------------------------------------------------------------- /逻辑回归/09. 从线性分类到非线性分类.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "// 线性分类 -> 非线性分类" 8 | ] 9 | } 10 | ], 11 | "metadata": { 12 | "kernelspec": { 13 | "display_name": "Python 3", 14 | "language": "python", 15 | "name": "python3" 16 | }, 17 | "language_info": { 18 | "codemirror_mode": { 19 | "name": "ipython", 20 | "version": 3 21 | }, 22 | "file_extension": ".py", 23 | "mimetype": "text/x-python", 24 | "name": "python", 25 | "nbconvert_exporter": "python", 26 | "pygments_lexer": "ipython3", 27 | "version": "3.7.1" 28 | } 29 | }, 30 | "nbformat": 4, 31 | "nbformat_minor": 2 32 | } 33 | -------------------------------------------------------------------------------- /线性回归/x. 线性回归的概率解释.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "线性回归实际上满足下面三个假设:\n", 8 | "\n", 9 | "1. 因变量 y 和自变量 x 之间是线性关系;\n", 10 | "2. 自变量 x 和干扰项 e 相互独立;\n", 11 | "3. 没有被线性模型捕捉到的随即因素 e 服从正态分布;" 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 3", 18 | "language": "python", 19 | "name": "python3" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 3 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython3", 31 | "version": "3.7.1" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 2 36 | } 37 | -------------------------------------------------------------------------------- /集成学习/02. Boosting 算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "Boosting 算法的工作机制如下:先从初始训练集训练出一个基学习器,再根据基学习器的表现对训练样本分布进行调整,使得之前错误分类的样本在后续受到更多关注,然后基于调整后的样本分布来训练下一个基学习器,直到基学习器数目达到预设的 T,最终将这 T 个基学习器进行加权结合。\n", 8 | "\n", 9 | "Boosting 的代表算法是 AdaBoost。" 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.7.1" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /贝叶斯/x. 隐变量和 EM 算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在现实应用中,我们经常会遇到一些不完整的训练样本,某些属性可能会由于某种原因无法观测,比如在西瓜数据中,如果某个西瓜根蒂已脱落,无法看出它是蜷缩的,还是硬挺的,这种无法观测的属性变量叫做 **隐变量**(latent variable)。\n", 8 | "\n", 9 | "常用的隐变量估计方法是 **EM算法**(Expectation Maximization,期望最大化算法),它是一种迭代算法。" 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.7.1" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /线性回归/TODO.md: -------------------------------------------------------------------------------- 1 | * 带约束条件的最小二乘 2 | * 岭回归 3 | * Lasso 回归 4 | * 弹性网 5 | 6 | * 优化求解算法 7 | * 梯度下降法 8 | * 拟牛顿法 9 | * 牛顿法 10 | 11 | 鲁棒学习 12 | * L1 损失 13 | * Huber 损失 14 | * Tukey 损失 15 | 16 | 17 | 各种回归模型 18 | * 线性回归(简单线性回归) 19 | * 股票投资:根据股票背后企业的财务特征(X)预测该股票的未来收益率(Y),实现超额收益率 20 | * 客户终身价值:根据消费者的人口统计特征以及过去的消费记录(X)预测其给企业创造的收入(Y),识别潜在的高价值客户 21 | * 预测高血压:根据各种相关因素如:饮食习惯、服药习惯等(X)预测一个人的血压(Y) 22 | * 0-1 回归(广义线性模型) 23 | * 逻辑回归,Logistic Regression 24 | * Probit Regression 25 | * 个人征信:评价借款人未来还钱的可能性 26 | * 个性化商品推荐 27 | * 好友推荐 28 | * 定序回归 29 | * 定序数据就是关乎顺序的数据,但是又没有具体的数值意义 30 | * 消费者调查:1 表示非常不喜欢,2 表示有点不喜欢,3 表示一般般,4 表示有点喜欢,5 表示非常喜欢 31 | * 电影打分评级 32 | * 计数回归 33 | * RFM 模型:一定时间内,客户到访的次数 34 | * 一个癌症病人体内肿瘤的个数 35 | * 泊松回归、负二项回归、零膨胀泊松回归等方法 36 | * 生存回归 37 | * 生存数据就是一个现象或个体存续生存了多长时间,也就是我们常说的生存时间 38 | * 生存数据通常是 Censored Data,也就是截断的数据,比如 60+ 39 | * 寿险精算 40 | * Cox 等比例风险模型、AFT 加速失效模型 -------------------------------------------------------------------------------- /数学基础/01. LaTeX 笔记.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在学习机器学习的过程中,难免会遇到一些符号和公式,要生成这类数学公式,使用 **LaTeX** 无疑是最适合的。这篇文档记录在我做笔记时用到的一些 LaTeX 语法。\n", 8 | "\n", 9 | "### 参考\n", 10 | "\n", 11 | "* [常用数学符号的 LaTeX 表示方法](http://www.mohu.org/info/symbols/symbols.htm)\n", 12 | "* [一份不太简短的 LATEX 2ε 介绍](http://www.mohu.org/info/lshort-cn.pdf)" 13 | ] 14 | } 15 | ], 16 | "metadata": { 17 | "kernelspec": { 18 | "display_name": "Python 3", 19 | "language": "python", 20 | "name": "python3" 21 | }, 22 | "language_info": { 23 | "codemirror_mode": { 24 | "name": "ipython", 25 | "version": 3 26 | }, 27 | "file_extension": ".py", 28 | "mimetype": "text/x-python", 29 | "name": "python", 30 | "nbconvert_exporter": "python", 31 | "pygments_lexer": "ipython3", 32 | "version": "3.7.1" 33 | } 34 | }, 35 | "nbformat": 4, 36 | "nbformat_minor": 2 37 | } 38 | -------------------------------------------------------------------------------- /线性回归/优化算法/02. 拟牛顿法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "拟牛顿法在一定程度上解决了牛顿法计算量大的问题,其本质思想是改善牛顿法每次需要求解复杂的 Hessian 矩阵的逆矩阵的缺陷,它使用正定矩阵来近似 Hessian 矩阵的逆,从而简化了运算的复杂度。\n", 8 | "\n", 9 | "拟牛顿法和最速下降法一样只要求每一步迭代时知道目标函数的梯度。通过测量梯度的变化,构造一个目标函数的模型使之足以产生超线性收敛性。这类方法大大优于最速下降法,尤其对于困难的问题。另外,因为拟牛顿法不需要二阶导数的信息,所以有时比牛顿法更为有效。如今,优化软件中包含了大量的拟牛顿算法用来解决无约束,约束,和大规模的优化问题。" 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.7.1" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /决策树/06. 多变量决策树.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "决策树的决策边界有一个很明显的特点:**轴平行**(axis-parallel),也就是说它的分类边界由若干个与坐标轴平行的分段组成,比如下图中的决策树和相应的决策边界:\n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "如果决策边界不是轴平行的,而是使用斜的划分边界,不仅模型会简单得多,而且泛化性能也能得到提高。\n", 12 | "\n", 13 | "" 14 | ] 15 | } 16 | ], 17 | "metadata": { 18 | "kernelspec": { 19 | "display_name": "Python 3", 20 | "language": "python", 21 | "name": "python3" 22 | }, 23 | "language_info": { 24 | "codemirror_mode": { 25 | "name": "ipython", 26 | "version": 3 27 | }, 28 | "file_extension": ".py", 29 | "mimetype": "text/x-python", 30 | "name": "python", 31 | "nbconvert_exporter": "python", 32 | "pygments_lexer": "ipython3", 33 | "version": "3.7.1" 34 | } 35 | }, 36 | "nbformat": 4, 37 | "nbformat_minor": 2 38 | } 39 | -------------------------------------------------------------------------------- /集成学习/04. 随机森林.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**随机森林**(Random Forest,简称 RF)是 Bagging 的一个扩展变体,它是在以决策树为基学习器构建 Bagging 集成的基础上,进一步在决策树的训练过程中引入了随机属性选择。传统决策树在选择划分属性时是在当前节点的属性集合中(假设有 $d$ 个属性)选择一个最优属性,而在 RF 中,会先从该节点的属性集合中随机选择 $k$ 个,再从这 $k$ 个属性中选择一个最优属性。推荐 $k = log_2d$。\n", 8 | "\n", 9 | "在 Bagging 算法中,我们通过对训练集的样本进行随机采样从而实现了基学习器的**多样性**,这被称为**样本扰动**;随机森林对其作了一点改进,在**样本扰动**的基础上,还加上了**属性扰动**,这样得到的基学习器更具多样性,最终集成的模型也具有更好的泛化性能。" 10 | ] 11 | } 12 | ], 13 | "metadata": { 14 | "kernelspec": { 15 | "display_name": "Python 3", 16 | "language": "python", 17 | "name": "python3" 18 | }, 19 | "language_info": { 20 | "codemirror_mode": { 21 | "name": "ipython", 22 | "version": 3 23 | }, 24 | "file_extension": ".py", 25 | "mimetype": "text/x-python", 26 | "name": "python", 27 | "nbconvert_exporter": "python", 28 | "pygments_lexer": "ipython3", 29 | "version": "3.7.1" 30 | } 31 | }, 32 | "nbformat": 4, 33 | "nbformat_minor": 2 34 | } 35 | -------------------------------------------------------------------------------- /集成学习/05. 个体学习器的多样性.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 误差-分歧分解\n", 8 | "\n", 9 | "个体学习器准确性越高,多样性越大,则集成越好。\n", 10 | "\n", 11 | "### 多样性度量\n", 12 | "\n", 13 | "很很多方法用于估算个体学习器的多样化程度:\n", 14 | "\n", 15 | "* 不合度量\n", 16 | "* 相关系数\n", 17 | "* $Q$-统计量\n", 18 | "* $\\kappa$-统计量\n", 19 | "\n", 20 | "这些都是 **成对型**(pairwise)多样性度量,可以通过 **$\\kappa$-误差图** 绘制出来。\n", 21 | "\n", 22 | "### 如何提高多样性?\n", 23 | "\n", 24 | "* 数据样本扰动\n", 25 | "* 输入属性扰动\n", 26 | "* 输出表示扰动\n", 27 | "* 算法参数扰动" 28 | ] 29 | } 30 | ], 31 | "metadata": { 32 | "kernelspec": { 33 | "display_name": "Python 3", 34 | "language": "python", 35 | "name": "python3" 36 | }, 37 | "language_info": { 38 | "codemirror_mode": { 39 | "name": "ipython", 40 | "version": 3 41 | }, 42 | "file_extension": ".py", 43 | "mimetype": "text/x-python", 44 | "name": "python", 45 | "nbconvert_exporter": "python", 46 | "pygments_lexer": "ipython3", 47 | "version": "3.7.1" 48 | } 49 | }, 50 | "nbformat": 4, 51 | "nbformat_minor": 2 52 | } 53 | -------------------------------------------------------------------------------- /聚类/5. 层次聚类算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "层次聚类分为 **聚合**(agglomerative)和 **分裂**(divisive)两种方法,也称为 **自下而上**(bottom-up)和 **自上而下**(top-down)。聚合聚类开始将每个样本各自分到一个类,之后将相距最近的两个类合并,建立一个新类,重复此操作直到满足停止条件。而分裂聚类相反,开始将所有样本分到一个类,之后将已有类中相距最远的样本分到两个新类,重复此操作直到满足停止条件。\n", 8 | "\n", 9 | "层次聚类需要明确下面三个要素:\n", 10 | "\n", 11 | "* 距离或相似度算法:闵氏距离、马氏距离、相关系数、夹角余弦等\n", 12 | "* 合并规则:一般通过类间距离最小来进行合并,类间距离可以是最短距离、最长距离、中心距离、平均距离等\n", 13 | "* 停止条件:类的个数达到某个阈值,类的直径超过某个阈值等\n", 14 | "\n", 15 | "### **AGNES算法**(AGglomerative NESting)\n", 16 | "\n", 17 | "AGNES算法是经典的自下而上聚类算法。\n", 18 | "\n", 19 | "### **DIANA算法**(DIvisive ANAlysis)\n", 20 | "\n", 21 | "DIANA算法是经典的自上而下聚类算法。" 22 | ] 23 | } 24 | ], 25 | "metadata": { 26 | "kernelspec": { 27 | "display_name": "Python 3", 28 | "language": "python", 29 | "name": "python3" 30 | }, 31 | "language_info": { 32 | "codemirror_mode": { 33 | "name": "ipython", 34 | "version": 3 35 | }, 36 | "file_extension": ".py", 37 | "mimetype": "text/x-python", 38 | "name": "python", 39 | "nbconvert_exporter": "python", 40 | "pygments_lexer": "ipython3", 41 | "version": "3.7.1" 42 | } 43 | }, 44 | "nbformat": 4, 45 | "nbformat_minor": 2 46 | } 47 | -------------------------------------------------------------------------------- /神经网络/TODO.md: -------------------------------------------------------------------------------- 1 | 神经网络概览 2 | 3 | * 感知机(Perceptrons) 4 | * 前馈神经网络(Feed Forward Neural Networks) 5 | * 径向基函数网络(Radial Basis Function,RBF) 6 | * Hopfield 神经网络(Hopfield Network,HN) 7 | * 波尔兹曼机(Boltzmann Machines,BM) 8 | * 受限玻尔兹曼机(Restricted Boltzmann Machines,RBM) 9 | * 马尔可夫链(Markov Chains,MC) 10 | * 离散时间马尔可夫链(Discrete Time Markov Chain,DTMC) 11 | * 自编码器(Autoencoders,AE) 12 | * 稀疏自编码器(Sparse Autoencoders,SAE) 13 | * 变分自编码器(Variational Autoencoders, VAE) 14 | * 去噪自编码器(Denoising Autoencoders,DAE) 15 | * 深度信念网络(Deep Belief Networks,DBN) 16 | * 卷积神经网络(Convolutional Neural Networks,CNN) 17 | * 深度卷积神经网络(Deep Convolutional Neural Networks,DCNN) 18 | * 反卷积神经网络(Deconvolutional Networks,DN) 19 | * 深度卷积逆向图网络(Deep Convolutional Inverse Graphics Networks,DCIGN) 20 | * 生成式对抗网络(Generative Adversarial Networks,GAN) 21 | * 循环神经网络(Recurrent Neural Networks,RNN) 22 | * 长短时记忆网络(Long Short Term Memory,LSTM) 23 | * 门控循环单元(Gated Recurrent Units,GRU) 24 | * 神经图灵机(Neural Turing Machines,NTM) 25 | * 深度残差网络(Deep Residual Networks,DRN) 26 | * 回声状态网络(Echo State Networks,ESN) 27 | * 极限学习机(Extreme Learning Machines,ELM) 28 | * 液体状态机(Liquid State Machines,LSM) 29 | * 支持向量机(Support Vector Machines,SVM) 30 | * Kohonen 网络(Kohonen Networks,KN) 31 | * 自组织映射(Self-Organizing Map,SOM) 32 | 33 | 34 | https://www.infoq.cn/article/teach-you-how-to-read-all-kinds-of-neural-networks 35 | -------------------------------------------------------------------------------- /逻辑回归/x. 广义线性模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "线性回归和逻辑回归都是广义线性模型(Generalized Linear Model)的一种特殊形式。\n", 8 | "\n", 9 | "\n", 10 | "广义线性模型(Generalized Linear Model)\n", 11 | "https://zhuanlan.zhihu.com/p/22876460\n", 12 | "\n", 13 | "广义线性模型 学习笔记(一)——定义\n", 14 | "https://xg1990.com/blog/archives/304\n", 15 | "\n", 16 | "广义线性模型和逻辑回归的公式推导\n", 17 | "https://blog.csdn.net/wx_blue_pig/article/details/79808491\n", 18 | "\n", 19 | "从线性模型到广义线性模型(1)——模型假设篇\n", 20 | "https://cosx.org/2011/01/how-does-glm-generalize-lm-assumption/\n", 21 | "\n", 22 | "从线性模型到广义线性模型 (2)——参数估计、假设检验\n", 23 | "https://cosx.org/2011/01/how-does-glm-generalize-lm-fit-and-test/" 24 | ] 25 | } 26 | ], 27 | "metadata": { 28 | "kernelspec": { 29 | "display_name": "Python 3", 30 | "language": "python", 31 | "name": "python3" 32 | }, 33 | "language_info": { 34 | "codemirror_mode": { 35 | "name": "ipython", 36 | "version": 3 37 | }, 38 | "file_extension": ".py", 39 | "mimetype": "text/x-python", 40 | "name": "python", 41 | "nbconvert_exporter": "python", 42 | "pygments_lexer": "ipython3", 43 | "version": "3.7.1" 44 | } 45 | }, 46 | "nbformat": 4, 47 | "nbformat_minor": 2 48 | } 49 | -------------------------------------------------------------------------------- /集成学习/03. Bagging 算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "集成学习中个体学习器应尽可能的相互独立,要设法使基学习器尽可能具有较大差异。一种最简单的做法是对训练样本进行采样,产生若干个不同的子集,然后基于每一个子集训练出一个基学习器,这样得到的基学习器由于训练样本不同,他们之间有希望能得到比较大的差异。但是如果要保证每个子集样本都不同,那么每个子集的训练数据也会比较少,这样得到的基学习器性能就会比较差,所以常常使用交叉采样法。\n", 8 | "\n", 9 | "**Bagging**(Bootstrap AGGregatING) 是并形式集成学习方法最著名的代表,它基于 **自助采样法**(bootstrap sampling),它的基本流程如下:首先从数据集中随机取出一个样本放入采样集,再把该样本放回,这样经过 m 次随机采样,就得到了一个包含 m 个样本的采样集,使用同样的方法重复 T 次,我们就得到了 T 个采样集,然后基于每个采样集训练出一个基学习器,最后将这些基学习器进行结合。\n", 10 | "\n", 11 | "标准的 AdaBoost 只适用于二分类任务,而 Bagging 可以直接用于多分类和回归任务。另外,由于 Bagging 算法采用了自助采样法,所以每个基学习器只使用了训练集的约 63.2% 的样本,剩下约 36.8% 的样本称为 **包外样本**,可用于验证集对泛化性能进行 **包外估计**(out-of-bag estimate),还可以根据不同的基学习器采用不同的方法辅助学习器的训练:比如基学习器是决策树时,可用来辅助剪枝,基学习器是神经网络时,可用来辅助早停减少过拟合风险。" 12 | ] 13 | } 14 | ], 15 | "metadata": { 16 | "kernelspec": { 17 | "display_name": "Python 3", 18 | "language": "python", 19 | "name": "python3" 20 | }, 21 | "language_info": { 22 | "codemirror_mode": { 23 | "name": "ipython", 24 | "version": 3 25 | }, 26 | "file_extension": ".py", 27 | "mimetype": "text/x-python", 28 | "name": "python", 29 | "nbconvert_exporter": "python", 30 | "pygments_lexer": "ipython3", 31 | "version": "3.7.1" 32 | } 33 | }, 34 | "nbformat": 4, 35 | "nbformat_minor": 2 36 | } 37 | -------------------------------------------------------------------------------- /贝叶斯/x. 概率图模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 概率图模型分类\n", 8 | "\n", 9 | "* 概率图模型(Probabilistic Graphical Model,PGM)\n", 10 | " * 贝叶斯网络(Bayesian Network,有向图模型)\n", 11 | " * 隐马尔可夫模型(Hidden Markov Model,HMM,生成式模型)\n", 12 | " * 马尔可夫网络(Markov Network,无向图模型)\n", 13 | " * 马尔可夫随即场(Markov Random Field,MRF,生成式模型)\n", 14 | " * 条件随即场(Conditional Random Field,CRF,判别式模型)\n", 15 | " \n", 16 | "### 概率图模型的推断方法\n", 17 | "\n", 18 | "* 精确推断(动态规划问题)\n", 19 | " * 变量消去\n", 20 | " * 信念传播(Belief Propagation)\n", 21 | "* 近似推断\n", 22 | " * 采样(sampling)\n", 23 | " * 马尔可夫链-蒙特卡洛(Markov Chain Monte Carlo,MCMC)\n", 24 | " * Metropolis-Hastings(MH 算法)\n", 25 | " * 吉布斯采样(Gibbs sampling)\n", 26 | " * 变分推断(variational inference)" 27 | ] 28 | } 29 | ], 30 | "metadata": { 31 | "kernelspec": { 32 | "display_name": "Python 3", 33 | "language": "python", 34 | "name": "python3" 35 | }, 36 | "language_info": { 37 | "codemirror_mode": { 38 | "name": "ipython", 39 | "version": 3 40 | }, 41 | "file_extension": ".py", 42 | "mimetype": "text/x-python", 43 | "name": "python", 44 | "nbconvert_exporter": "python", 45 | "pygments_lexer": "ipython3", 46 | "version": "3.7.1" 47 | } 48 | }, 49 | "nbformat": 4, 50 | "nbformat_minor": 2 51 | } 52 | -------------------------------------------------------------------------------- /数学基础/x. 拉格朗日对偶问题.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "可微分的凸函数 $f: \\mathbb{R}^d \\to \\mathbb{R}$ 和 $g: \\mathbb{R}^d \\to \\mathbb{R}^p$ 的约束条件的最小化问题:\n", 8 | "\n", 9 | "$$\n", 10 | "\\mathop{\\min}_{t} f(t), g(t) \\leqslant 0\n", 11 | "$$\n", 12 | "\n", 13 | "可以转换为拉格朗日对偶问题重新定义:\n", 14 | "\n", 15 | "$$\n", 16 | "\\mathop{\\max}_{\\lambda} \\mathop{\\inf}_{t} L(t, \\lambda), \\lambda \\geqslant 0\n", 17 | "$$\n", 18 | "\n", 19 | "其中,$\\lambda$ 称为拉格朗日乘子:\n", 20 | "\n", 21 | "$$\n", 22 | "\\lambda = (\\lambda_1, \\dots, \\lambda_p)^T\n", 23 | "$$\n", 24 | "\n", 25 | "$L(t, \\lambda)$ 称为拉格朗日函数:\n", 26 | "\n", 27 | "$$\n", 28 | "L(t, \\lambda) = f(t) + \\lambda^Tg(t)\n", 29 | "$$\n", 30 | "\n", 31 | "拉格朗日对偶问题的 t 的解,与原来的问题的解是一致的。" 32 | ] 33 | } 34 | ], 35 | "metadata": { 36 | "kernelspec": { 37 | "display_name": "Python 3", 38 | "language": "python", 39 | "name": "python3" 40 | }, 41 | "language_info": { 42 | "codemirror_mode": { 43 | "name": "ipython", 44 | "version": 3 45 | }, 46 | "file_extension": ".py", 47 | "mimetype": "text/x-python", 48 | "name": "python", 49 | "nbconvert_exporter": "python", 50 | "pygments_lexer": "ipython3", 51 | "version": "3.7.1" 52 | } 53 | }, 54 | "nbformat": 4, 55 | "nbformat_minor": 2 56 | } 57 | -------------------------------------------------------------------------------- /概述/01. 什么是机器学习.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 机器学习定义\n", 8 | "\n", 9 | "* 非正式定义:在不直接针对问题进行编程的情况下,赋予计算机学习能力的一个研究领域。(1959年,出自 Arthur Samuel,他被誉为**机器学习之父**)\n", 10 | "* 正式定义:对于一个计算机程序来讲,给他一个任务 T 和一个性能测量方法 P,如果在经验 E 的影响下 P 对 T 的测量结果得到了改进,那么就说该程序从 E 中得到了学习。(1998年,出自 Tom Mitchell《Machine Leaming》)\n", 11 | "\n", 12 | "### 人工智能、深度学习和机器学习\n", 13 | "\n", 14 | "这三者的关系如下图所示:\n", 15 | "\n", 16 | "![](../images/ai-ml-dl.jpg)\n", 17 | "\n", 18 | "**人工智能**(Artificial Intelligence,简称 AI)是用于研究、模拟及扩展人的智能应用科学,涉及机器人、语音识别、图像识别、自然语言处理等。它的研究横跨多门学科,如计算机、数学、生物、语言、声音、视觉甚至心理学和哲学。AI 的核心是做到感知、推断、行动及根据经验值进行调整(sense, reason, act and adapt),即类似人类的智慧体智能学习提升。\n", 19 | "\n", 20 | "**深度学习**(Deep Learning)泛指深度神经网络学习,如卷积神经网络(Convolutional Neural Network,CNN)等,它把普通的神经网络从几层扩展到更多的层,通过大数据以及近些年发展的强大运算能力(比如 GPU)获取更精准的模型,广泛应用在图像和视频识别等领域。" 21 | ] 22 | } 23 | ], 24 | "metadata": { 25 | "kernelspec": { 26 | "display_name": "Python 3", 27 | "language": "python", 28 | "name": "python3" 29 | }, 30 | "language_info": { 31 | "codemirror_mode": { 32 | "name": "ipython", 33 | "version": 3 34 | }, 35 | "file_extension": ".py", 36 | "mimetype": "text/x-python", 37 | "name": "python", 38 | "nbconvert_exporter": "python", 39 | "pygments_lexer": "ipython3", 40 | "version": "3.7.1" 41 | } 42 | }, 43 | "nbformat": 4, 44 | "nbformat_minor": 2 45 | } 46 | -------------------------------------------------------------------------------- /概述/03. 机器学习算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 机器学习算法清单\n", 8 | "\n", 9 | "1. k-近邻(kNN)\n", 10 | "1. 决策树 ID3 算法\n", 11 | "1. 朴素贝叶斯分类算法\n", 12 | "1. 感知机\n", 13 | "1. 支持向量机(SVM)\n", 14 | "1. 逻辑回归(logistic regression)\n", 15 | "1. AdaBoost(adaptive boosting)\n", 16 | "\n", 17 | "### 生成模型 vs. 判别模型\n", 18 | "\n", 19 | "生成方法和判别方法是对监督学习算法的分类,由判别方法学到的模型叫做判别模型,由生成方法学到的模型叫做生成模型。\n", 20 | "\n", 21 | "判别方法由数据直接学习决策函数 $Y=f(X)$ 或者条件概率分布 $P(Y|X)$ 作为预测的模型,即判别模型。基本思想是有限样本条件下建立判别函数,不考虑样本的产生模型,直接研究预测模型。典型的判别模型包括k-近邻,感知机,决策树,支持向量机等。\n", 22 | "\n", 23 | "生成方法由数据学习联合概率密度分布 $P(X,Y)$,然后求出条件概率分布 $P(Y|X)$ 作为预测的模型,即生成模型 $P(Y|X) = \\frac{P(X,Y)}{P(X)}$。基本思想是首先建立样本的联合概率概率密度模型 $P(X,Y)$,然后再得到后验概率 $P(Y|X)$,再利用它进行分类。典型的生成模型是朴素贝叶斯算法。\n", 24 | "\n", 25 | "判别方法更多的关注各个类别之间的区别,而生成方法更多关注各个类别内部的相似度。判别方法画一条线,明确这种差别,线左边是类别A,线右边是类别B;生成方法拿样本去比较两个类别的已有数据,看看这个样本生成自哪个类别的可能性更大。\n", 26 | "\n", 27 | "由生成模型可以得到判别模型,但由判别模型得不到生成模型。" 28 | ] 29 | } 30 | ], 31 | "metadata": { 32 | "kernelspec": { 33 | "display_name": "Python 3", 34 | "language": "python", 35 | "name": "python3" 36 | }, 37 | "language_info": { 38 | "codemirror_mode": { 39 | "name": "ipython", 40 | "version": 3 41 | }, 42 | "file_extension": ".py", 43 | "mimetype": "text/x-python", 44 | "name": "python", 45 | "nbconvert_exporter": "python", 46 | "pygments_lexer": "ipython3", 47 | "version": "3.7.1" 48 | } 49 | }, 50 | "nbformat": 4, 51 | "nbformat_minor": 2 52 | } 53 | -------------------------------------------------------------------------------- /聚类/2. 聚类算法一览.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "聚类算法非常多,更新也非常快,这是因为聚类不存在客观标准,给定数据集,总能从某个角度找到以往算法未覆盖的某种标准从而设计出新算法,相对其他机器学习算法而言,聚类的知识还不够系统化。\n", 8 | "\n", 9 | "通过聚类得到的类或簇,本质是样本的子集,如果一个聚类方法假定一个样本只属于一个类,那么该方法称为 **硬聚类**(hard clustering),如果一个样本可以属于多个类,则称为 **软聚类**(soft clustering)。\n", 10 | "\n", 11 | "周志华老师的《机器学习》将聚类算法分为三大类:**原型聚类**(prototype-based clustering)、**密度聚类**(density-based clustering)和**层次聚类**(hierarchical clustering)。\n", 12 | "\n", 13 | "* 原型聚类假设聚类结构能通过一组原型刻画,它首先对原型进行初始化,然后对原型进行迭代更新求解,其代表算法有 **k 均值算法**(k-means),**学习向量量化算法**(learning vector quantization,简称 LVQ),**高斯混合聚类算法**(mixture-of-Gaussian)。\n", 14 | "* 密度聚类假设聚类结构能通过样本分布的紧密程度来确定,通常情况下,密度聚类算法从样本密度的角度来考察样本之间的**可连接性**,并基于可连接样本不断扩展聚类簇以获得最终的聚类结果,代表算法为 **DBSCAN**(Density-Based Spatial Clustering of Applications with Noise)。\n", 15 | "* 层次聚类试图在不同层次对数据集进行划分,从而形成树形的聚类结构,数据集的划分可采用**自底向上**的聚合策略,比如 **AGNES算法**(AGglomerative NESting),也可以采用**自顶向下**的分拆策略,比如 **DIANA算法**(DIvisive ANAlysis)。AGNES 和 DIANA 都不能对已合并或已分拆的聚类簇进行回溯调整,**BIRCH**、**ROCK** 等算法对此进行了改进。" 16 | ] 17 | } 18 | ], 19 | "metadata": { 20 | "kernelspec": { 21 | "display_name": "Python 3", 22 | "language": "python", 23 | "name": "python3" 24 | }, 25 | "language_info": { 26 | "codemirror_mode": { 27 | "name": "ipython", 28 | "version": 3 29 | }, 30 | "file_extension": ".py", 31 | "mimetype": "text/x-python", 32 | "name": "python", 33 | "nbconvert_exporter": "python", 34 | "pygments_lexer": "ipython3", 35 | "version": "3.7.1" 36 | } 37 | }, 38 | "nbformat": 4, 39 | "nbformat_minor": 2 40 | } 41 | -------------------------------------------------------------------------------- /数学基础/x. 满秩矩阵.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 线性相关和线性无关\n", 8 | "\n", 9 | "如果有不全为 0 的数 $k_1, \\dots, k_s$ 使得 $k_1 \\bf{\\alpha_1} + \\dots + k_s \\bf{\\alpha_s} = 0$,那么向量组 $\\bf{\\alpha_1}, \\dots, \\bf{\\alpha_s}$ 称为 **线性相关** 的。\n", 10 | "\n", 11 | "如果只有全为 0 的数 $k_1, \\dots, k_s$ 使得 $k_1 \\bf{\\alpha_1} + \\dots + k_s \\bf{\\alpha_s} = 0$,那么向量组 $\\bf{\\alpha_1}, \\dots, \\bf{\\alpha_s}$ 称为 **线性无关** 的。\n", 12 | "\n", 13 | "判断一个向量组是否线性相关,一般将向量组组成一个矩阵,再对矩阵进行 **初等行变换** 转成 **阶梯型矩阵**,最后比较阶梯型矩阵的非零行数目和未知量数目。\n", 14 | "\n", 15 | "### 极大线性无关组\n", 16 | "\n", 17 | "如果一个向量组的部分组是线性无关的,但是从这个向量组的其余向量中(如果还有的话)任取一个添进去,得到的新的部分组都是线性相关的,那么这个部分组被称为 **极大线性无关组**。\n", 18 | "\n", 19 | "### 向量组的秩\n", 20 | "\n", 21 | "向量组的极大线性无关组所含向量的个数称为这个 **向量组的秩**。\n", 22 | "\n", 23 | "### 矩阵的秩\n", 24 | "\n", 25 | "矩阵 A 的列向量组的秩称为 A 的 **列秩**,A 的行向量组的秩称为 A 的 **行秩**,任一矩阵的行秩等于它的列秩,行秩和列秩统称为 A 的 **秩**,记为 $rank(A)$。\n", 26 | "\n", 27 | "### 满秩矩阵\n", 28 | "\n", 29 | "一个 n 级矩阵 A 的秩如果等于它的级数 n,那么称 A 为 **满秩矩阵**(full-rank matrix)。" 30 | ] 31 | } 32 | ], 33 | "metadata": { 34 | "kernelspec": { 35 | "display_name": "Python 3", 36 | "language": "python", 37 | "name": "python3" 38 | }, 39 | "language_info": { 40 | "codemirror_mode": { 41 | "name": "ipython", 42 | "version": 3 43 | }, 44 | "file_extension": ".py", 45 | "mimetype": "text/x-python", 46 | "name": "python", 47 | "nbconvert_exporter": "python", 48 | "pygments_lexer": "ipython3", 49 | "version": "3.7.1" 50 | } 51 | }, 52 | "nbformat": 4, 53 | "nbformat_minor": 2 54 | } 55 | -------------------------------------------------------------------------------- /决策树/01. 决策树算法概述.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**决策树**(decision tree)是一种基于树结构的机器学习算法,它和人类在面临决策问题时的处理机制类似。譬如在判定一个西瓜是不是好瓜时,我们会进行一系列的判断:先看它是什么颜色,如果是青绿色,再看它的根蒂是什么形态,如果是蜷缩的,再听一听它敲起来是什么声音,如果是浊响的,最后得出结论,这是一个好瓜。(周志华, 机器学习, 第4章)\n", 8 | "\n", 9 | "![](../images/decision-tree-watermelon.png)\n", 10 | "\n", 11 | "可以看出,决策树算法和之前的算法有着明显的不同,它是一种 **非参数学习算法**(nonparametric methods)。之前的线性回归、逻辑回归等算法都是先选择一个目标函数,然后通过训练,学习目标函数的系数,这种算法称为 **参数学习算法**(parametric methods)。而非参数学习算法不需要对目标函数做过多的假设,算法可以自由的从训练数据中学习任意形式的函数。\n", 12 | "\n", 13 | "决策树算法的核心在于决策树的构建,如上图所示,决策树的每一个非叶子节点对应一个特征测试,所以要构建这样一棵树,首先我们要从样本数据中找出一个特征,通过该特征可以将数据划分成几个子集,然后在每个子集上重复上面的划分过程,直到所有的特征消耗完或子集不能再划分为止,这时生成的整个树形结构就是决策树。如何找到最优的划分特征,是构建过程中最关键的一步,这也是众多决策树算法的精华部分。\n", 14 | "\n", 15 | "### 决策树优缺点\n", 16 | "\n", 17 | "决策树最大的优点是简单,非常容易理解,而且模型可以可视化,非常直观。它的应用范围很广,可以用于解决回归和分类问题,天然支持多分类问题的求解。此外,无论特征是离散值,还是连续值,都可以处理。\n", 18 | "\n", 19 | "决策树很容易在训练数据中生成复杂的树结构,造成过拟合(overfitting),通常用剪枝来解决过拟合,譬如限制树的高度、叶子节点中的最少样本数量等。学习一棵最优的决策树被认为是 NP-Complete 问题,实际中的决策树是基于启发式的贪心算法建立的,这种算法不能保证建立全局最优的决策树,随机森林(Random Forest)可以缓解这个问题。" 20 | ] 21 | } 22 | ], 23 | "metadata": { 24 | "kernelspec": { 25 | "display_name": "Python 3", 26 | "language": "python", 27 | "name": "python3" 28 | }, 29 | "language_info": { 30 | "codemirror_mode": { 31 | "name": "ipython", 32 | "version": 3 33 | }, 34 | "file_extension": ".py", 35 | "mimetype": "text/x-python", 36 | "name": "python", 37 | "nbconvert_exporter": "python", 38 | "pygments_lexer": "ipython3", 39 | "version": "3.7.1" 40 | } 41 | }, 42 | "nbformat": 4, 43 | "nbformat_minor": 2 44 | } 45 | -------------------------------------------------------------------------------- /线性回归/优化算法/01. 牛顿法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "牛顿法是一种二阶优化算法,相对于梯度下降算法收敛速度更快。\n", 8 | "\n", 9 | "首先,我们在函数 $f(x)$ 上选择一个初始点 $x_1$,并计算相应的 $f(x_1)$ 和切线斜率 $f'(x_1)$,然后我们计算穿过点 $(x_1,f(x_1))$ 并且斜率为 $f'(x_1)$ 的直线和 X 轴的交点,也就是求如下方程的解:\n", 10 | "\n", 11 | "$$\n", 12 | "f(x_1) + f'(x_1)∗(x−x_1) = 0\n", 13 | "$$\n", 14 | "\n", 15 | "我们将新求得的点的坐标命名为 $x_2$,通常 $x_2$ 会比 $x_1$ 更接近 $f(x)=0$ 的解。因此我们现在可以利用 $x_2$ 开始下一轮迭代,迭代公式可化简为如下所示:\n", 16 | "\n", 17 | "$$\n", 18 | "x_{n+1} = x_n − \\frac{f(x_n)}{f'(x_n)}\n", 19 | "$$\n", 20 | "\n", 21 | "疑惑:不是二阶导数吗?\n", 22 | "\n", 23 | "牛顿法是基于当前位置的切线来确定下一次的位置,所以牛顿法又被很形象地称为是**切线法**。牛顿法的搜索路径(二维情况)如下图所示:\n", 24 | "\n", 25 | "![](../../images/NewtonIteration_Ani.gif)" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### 优缺点\n", 33 | "\n", 34 | "* 优点\n", 35 | " * 相对于梯度下降法,收敛速度更快。\n", 36 | "\n", 37 | "* 缺点\n", 38 | " * 每次计算都需要计算 Hessian 矩阵的逆,因此计算量较大。\n", 39 | " * 在多变量的情况下,如果目标矩阵的 Hessain 矩阵非正定,牛顿法确定的搜索方向并不一定是目标函数下降的方向。" 40 | ] 41 | } 42 | ], 43 | "metadata": { 44 | "kernelspec": { 45 | "display_name": "Python 3", 46 | "language": "python", 47 | "name": "python3" 48 | }, 49 | "language_info": { 50 | "codemirror_mode": { 51 | "name": "ipython", 52 | "version": 3 53 | }, 54 | "file_extension": ".py", 55 | "mimetype": "text/x-python", 56 | "name": "python", 57 | "nbconvert_exporter": "python", 58 | "pygments_lexer": "ipython3", 59 | "version": "3.7.1" 60 | } 61 | }, 62 | "nbformat": 4, 63 | "nbformat_minor": 2 64 | } 65 | -------------------------------------------------------------------------------- /支持向量机/05. SMO 算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "前面我们使用拉格朗日乘子法将支持向量机的基本形式转化为对偶形式,得到了硬间隔支持向量机和软间隔支持向量机的优化目标。\n", 8 | "\n", 9 | "硬间隔的优化目标为:\n", 10 | "\n", 11 | "$$\n", 12 | "\\begin{align}\n", 13 | "&\\mathop \\min_{\\alpha} \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j - \\sum_{i=1}^n \\alpha_i \\\\\n", 14 | "&s.t. \\sum_{i=1}^n \\alpha_i y_i = 0, \\alpha_i \\geq 0, i = 1,2,\\dots,n\n", 15 | "\\end{align}\n", 16 | "$$\n", 17 | "\n", 18 | "软间隔的优化目标为:\n", 19 | "\n", 20 | "$$\n", 21 | "\\begin{align}\n", 22 | "&\\mathop \\min_{\\alpha} \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j - \\sum_{i=1}^n \\alpha_i\\\\\n", 23 | "&s.t. \\sum_{i=1}^n \\alpha_i y_i = 0, 0 \\leq \\alpha_i \\leq C, i = 1,2,\\dots,n\n", 24 | "\\end{align}\n", 25 | "$$\n", 26 | "\n", 27 | "对于这个带约束条件的优化问题,可以使用 **二次规划** 的方法进行求解,在 SVM 的发展过程中,很多算法被提出来,其中 Platt 于 1998 年提出的 **序列最小最优化算法**(Sequential Minimal Optimization, 简称 SMO)被广泛使用。" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "// TODO:SMO 算法原理\n", 35 | "\n", 36 | "// TODO:KKT 条件" 37 | ] 38 | } 39 | ], 40 | "metadata": { 41 | "kernelspec": { 42 | "display_name": "Python 3", 43 | "language": "python", 44 | "name": "python3" 45 | }, 46 | "language_info": { 47 | "codemirror_mode": { 48 | "name": "ipython", 49 | "version": 3 50 | }, 51 | "file_extension": ".py", 52 | "mimetype": "text/x-python", 53 | "name": "python", 54 | "nbconvert_exporter": "python", 55 | "pygments_lexer": "ipython3", 56 | "version": "3.7.1" 57 | } 58 | }, 59 | "nbformat": 4, 60 | "nbformat_minor": 2 61 | } 62 | -------------------------------------------------------------------------------- /线性回归/x. 最小二乘的数学解释.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**微积分角度来讲**,最小二乘法是采用非迭代法,针对代价函数求导数而得出全局极值,进而对所给定参数进行估算。\n", 8 | "\n", 9 | "**计算数学角度来讲**,最小二乘法的本质上是一个线性优化问题,试图找到一个最优解。\n", 10 | "\n", 11 | "**线性代数角度来讲**,最小二乘法是求解线性方程组,当方程个数大于未知量个数,其方程本身无解,而最小二乘法则试图找到最优残差。\n", 12 | "\n", 13 | "**几何角度来讲**,最小二乘法中的几何意义是高维空间中的一个向量在低维子空间的投影。\n", 14 | "\n", 15 | "**概率论角度来讲**,如果数据的观测误差满足**高斯分布**(Gaussian distribution)或**正态分布**(Normal distribution),则最小二乘解就是使得观测数据出现概率最大的解,即**最大似然估计**(Maximum Likelihood Estimate,MLE),利用已知的样本结果,反推出最有可能(最大概率)导致这样结果的参数值。\n", 16 | "\n", 17 | "![](../images/gaussian-distribution.png)\n", 18 | "\n", 19 | "对于误差不符合高斯分布的,可以扩展到**广义误差分布**(Generalized error distribution,简称 GED)或**指数分布**(Exponential power distribution),广义误差是用来描述一类中间高两边低连续且对称的概率密度函数:\n", 20 | "\n", 21 | "$$\n", 22 | "p(x) \\mathrm{d}x = {1 \\over 2 a \\Gamma(1+1/b)} \\exp{(-|x/a|^b)} \\mathrm{d}x\n", 23 | "$$\n", 24 | "\n", 25 | "当 $b = 2$ 且 $a = \\sqrt{2}\\sigma$ 时,上式就退化成**高斯分布**,当 $b = 1$ 时,上式就退化成 **拉普拉斯分布**(Laplace distribution)。在一些长尾(long tail)的数据中,经常服从拉普拉斯分布,对他们来说**最小一乘法**才是更好的解法。\n", 26 | "\n", 27 | "#### 参考\n", 28 | "\n", 29 | "https://www.jianshu.com/p/40e251127025" 30 | ] 31 | } 32 | ], 33 | "metadata": { 34 | "kernelspec": { 35 | "display_name": "Python 3", 36 | "language": "python", 37 | "name": "python3" 38 | }, 39 | "language_info": { 40 | "codemirror_mode": { 41 | "name": "ipython", 42 | "version": 3 43 | }, 44 | "file_extension": ".py", 45 | "mimetype": "text/x-python", 46 | "name": "python", 47 | "nbconvert_exporter": "python", 48 | "pygments_lexer": "ipython3", 49 | "version": "3.7.1" 50 | } 51 | }, 52 | "nbformat": 4, 53 | "nbformat_minor": 2 54 | } 55 | -------------------------------------------------------------------------------- /线性回归/18. 使用岭回归解决共线性问题.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 什么时候使用岭回归?\n", 8 | "\n", 9 | "我们把岭回归的解和普通最小二乘的解析解作一个对比。\n", 10 | "\n", 11 | "我们知道,使用最小二乘进行线性回归时,有正规方程和梯度下降两种解法,使用正规方程可以求得:\n", 12 | "\n", 13 | "$$\n", 14 | "\\hat{w} = (X^TX)^{-1}X^Ty\n", 15 | "$$\n", 16 | "\n", 17 | "可以看到岭回归的解和普通最小二乘的解非常类似,只是在 $X^TX$ 的基础上加上一个对角矩阵。在前面我们提到过,正规方程有很多不足之处,其中很重要的一点是 $X^TX$ 必须是可逆的,关于 $X^TX$ 可逆有很多其他的说法:\n", 18 | "\n", 19 | "* $X^TX$ 必须是可逆的\n", 20 | "* $X^TX$ 的行列式不能等于 0\n", 21 | "* $X$ 必须是满秩的\n", 22 | "* $X$ 是正定矩阵\n", 23 | "\n", 24 | "为了保证 $X^TX$ 可逆,可以在 $X^TX$ 的对角线上加上一个很小的常量,这样可以提高其正则性,进而可以更稳定的进行逆矩阵的求解,这就是岭回归。\n", 25 | "\n", 26 | "另一方面,$X^TX$ 不能为 0,当变量之间具有很强的相关性时,$X^TX$ 会变得很小,甚至趋于 0,譬如一个自变量是身高 $x_1$,另一个是体重 $x_2$,根据这两个变量我们得到一个回归模型,$y = ax_1 + bx_2 + c$,由于 $x_1$ 和 $x_2$ 高度相关,所以 a 和 b 之间存在互相抵消的效应:你可以把 a 弄成一个很大的正数,同时把 b 弄成一个绝对值很大的负数,最终 $\\hat{y}$ 可能不会改变多少。这会导致用不同人群拟合出来的 a 和 b 差别可能会很大,模型的可解释性就大大降低了。这种情况被称为 **多重共线性**。\n", 27 | "\n", 28 | "岭回归是一种专用于共线性数据分析的有偏估计回归方法,实质上是一种改良的最小二乘估计法,通过放弃最小二乘法的无偏性,以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法,对病态数据的拟合要强于最小二乘。" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "TODO 使用岭回归解决多重共线性" 36 | ] 37 | } 38 | ], 39 | "metadata": { 40 | "kernelspec": { 41 | "display_name": "Python 3", 42 | "language": "python", 43 | "name": "python3" 44 | }, 45 | "language_info": { 46 | "codemirror_mode": { 47 | "name": "ipython", 48 | "version": 3 49 | }, 50 | "file_extension": ".py", 51 | "mimetype": "text/x-python", 52 | "name": "python", 53 | "nbconvert_exporter": "python", 54 | "pygments_lexer": "ipython3", 55 | "version": "3.7.1" 56 | } 57 | }, 58 | "nbformat": 4, 59 | "nbformat_minor": 2 60 | } 61 | -------------------------------------------------------------------------------- /数学基础/x. 正定矩阵.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 二次型\n", 8 | "\n", 9 | "数域 $K$ 上的一个 **n 元二次型** 是系数在 $K$ 中的 n 个变量的二次齐次多项式,它的一般形式为:\n", 10 | "\n", 11 | "$$\n", 12 | "f(x_1, x_2, \\dots, x_n) = \\sum_{i=1}^n \\sum_{j=1}^n a_{ij}x_ix_j\n", 13 | "$$\n", 14 | "\n", 15 | "其中 $a_{ij} = a_{ji}$。\n", 16 | "\n", 17 | "将上式中的系数写成矩阵形式:\n", 18 | "\n", 19 | "$$\n", 20 | "A = \n", 21 | "\\begin{aligned}{\n", 22 | "\\left[ \n", 23 | "\\begin{array}{ccc}\n", 24 | "a_{11} & a_{12} & \\dots & a_{1n} \\\\\n", 25 | "a_{21} & a_{22} & \\dots & a_{2n} \\\\\n", 26 | "\\vdots & \\vdots & \\dots & \\vdots \\\\\n", 27 | "a_{n1} & a_{n2} & \\dots & a_{nn}\n", 28 | "\\end{array} \n", 29 | "\\right ]\n", 30 | "}\\end{aligned}\n", 31 | "$$\n", 32 | "\n", 33 | "显然 A 是一个对称矩阵,它被称为 **二次型 $f(x_1, x_2, \\dots, x_n)$ 的矩阵**。\n", 34 | "\n", 35 | "令 $X = \\left[ \\begin{array}{ccc} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_n \\end{array} \\right]$,则二次型可以写成下面的形式:\n", 36 | "\n", 37 | "$$\n", 38 | "f(x_1, x_2, \\dots, x_n) = X'AX\n", 39 | "$$\n", 40 | "\n", 41 | "### 正定矩阵\n", 42 | "\n", 43 | "如果对于 $R^n$ 中的任意非零列向量 $\\bf{\\alpha}$,都有 $\\bf{\\bf{\\alpha}'A\\bf{\\alpha}} > 0$,那么 n 元实二次型 $X'AX$ 称为 **正定** 的,实对称矩阵 $A$ 被称为 **正定矩阵**(positive definite matrix)。" 44 | ] 45 | } 46 | ], 47 | "metadata": { 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "codemirror_mode": { 55 | "name": "ipython", 56 | "version": 3 57 | }, 58 | "file_extension": ".py", 59 | "mimetype": "text/x-python", 60 | "name": "python", 61 | "nbconvert_exporter": "python", 62 | "pygments_lexer": "ipython3", 63 | "version": "3.7.1" 64 | } 65 | }, 66 | "nbformat": 4, 67 | "nbformat_minor": 2 68 | } 69 | -------------------------------------------------------------------------------- /其他资料/书籍/01. 概率论与数理统计.陈希孺.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 第 1 章 事件的概率\n", 8 | "\n", 9 | "### 1.1 概率是什么\n", 10 | "\n", 11 | "### 1.2 古典概率计算\n", 12 | "\n", 13 | "### 1.3 事件的运算、条件概率与独立性\n", 14 | "\n", 15 | "## 第 2 章 随机变量及概率分布\n", 16 | "\n", 17 | "### 2.1 一维随机变量\n", 18 | "\n", 19 | "### 2.2 多维随机变量(随机向量)\n", 20 | "\n", 21 | "### 2.3 条件概率分布与随机变量的独立性\n", 22 | "\n", 23 | "### 2.4 随机变量的函数的概率分布\n", 24 | "\n", 25 | "## 第 3 章 随机变量的数字特征\n", 26 | "\n", 27 | "### 3.1 数学期望(均值)与中位数\n", 28 | "\n", 29 | "### 3.2 方差与矩\n", 30 | "\n", 31 | "### 3.3 协方差与相关系数\n", 32 | "\n", 33 | "### 3.4 大数定理和中心极限定理\n", 34 | "\n", 35 | "## 第 4 章 参数估计\n", 36 | "\n", 37 | "### 4.1 数理统计学的基本概念\n", 38 | "\n", 39 | "### 4.2 矩估计、极大似然估计和贝叶斯估计\n", 40 | "\n", 41 | "### 4.3 点估计的优良性准则\n", 42 | "\n", 43 | "### 4.4 区间估计\n", 44 | "\n", 45 | "## 第 5 章 假设检验\n", 46 | "\n", 47 | "### 5.1 问题提法和基本概念\n", 48 | "\n", 49 | "### 5.2 重要参数检验\n", 50 | "\n", 51 | "### 5.3 拟合优度检验\n", 52 | "\n", 53 | "## 第 6 章 回归、相关与方差分析\n", 54 | "\n", 55 | "### 6.1 回归分析的基本概念\n", 56 | "\n", 57 | "### 6.2 一元线性回归\n", 58 | "\n", 59 | "### 6.3 多元线性回归\n", 60 | "\n", 61 | "### 6.4 相关分析\n", 62 | "\n", 63 | "### 6.5 方差分析" 64 | ] 65 | } 66 | ], 67 | "metadata": { 68 | "kernelspec": { 69 | "display_name": "Python 3", 70 | "language": "python", 71 | "name": "python3" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 3 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython3", 83 | "version": "3.7.1" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 2 88 | } 89 | -------------------------------------------------------------------------------- /概述/02. 机器学习分类.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 机器学习分类\n", 8 | "\n", 9 | "机器学习根据处理数据的种类,可以分为:监督学习、无监督学习、半监督学习、强化学习等。\n", 10 | "\n", 11 | "**监督学习**(Supervised Learning)通常对有标签的训练样本特征进行学习,找到特征与标签之间的关系(函数),从而当训练结束,输入无标签的数据时,可以利用已经找出的关系分析出数据标签。如果学习的输出值是连续的,叫做**回归**(regression),如果是离散的,叫做**分类**(classification)。常用的监督学习方法有:神经网络、决策树、贝叶斯分类器、支持向量机等。\n", 12 | "\n", 13 | "**无监督学习**(Unsupervised Learning)指的是学习的数据没有标签,它通过学习算法找到数据内部的规律和性质,从而对数据进行分组,也就是所谓的『物以类聚,人以群分』,常见的任务有 **聚类**(clustering)、**异常检测**(anomaly detection) 和 **降维**(dimension reduction)。如果数据样本介于无标记及部分标记之间,这种机器学习被称为**半监督学习**(Semi-Supervised Learning)。\n", 14 | "\n", 15 | "**强化学习**(Reinforcement Learning)是智能系统**从环境到行为映射**的学习,以使奖励函数值最大,奖励函数是对产生动作的好坏作一种评价,而不是告诉机器如何去产生正确的动作。\n", 16 | "\n", 17 | "下面三个例子说明了监督学习、无监督学习和强化学习之间的区别,假设我们手上有一堆的选择题(训练样本),如何训练计算机来学习呢?\n", 18 | "\n", 19 | "监督学习是**提供选择题的同时也提供它们的标准答案**,计算机努力调整自己的模型参数,希望自己推测的答案与标准答案越一致越好,使计算机学会怎么做这类题。然后再让计算机去帮我们做没有提供答案的选择题(测试样本)。\n", 20 | "\n", 21 | "无监督学习是**只提供选择题,但是不提供标准答案**,计算机尝试分析这些题目之间的关系,对题目进行分类,计算机也不知道这几堆题的答案分别是什么,但计算机认为每一个类别内的题的答案应该是相同的。\n", 22 | "\n", 23 | "强化学习和无监督学习一样,**提供选择题但是不提供标准答案**,计算机尝试去做这些题,我们作为老师**批改计算机做的对不对**,对的越多,奖励越多,计算机努力调整自己的模型参数,希望自己推测的答案能够得到更多的奖励。不严谨的讲,可以理解为先无监督后有监督学习。\n", 24 | "\n", 25 | "另外,考虑到大部分数据或任务是存在相关性的,如果能把已经训练好的模型参数分享给新的模型来帮助新模型训练数据集,从而加快并优化模型的学习,不用像之前那样从零开始,这种学习方式称为**迁移学习**(Transfer Learning)。\n", 26 | "\n", 27 | "推荐系统(Recommendation System)\n", 28 | "\n", 29 | "学习理论(Learning Theory)" 30 | ] 31 | } 32 | ], 33 | "metadata": { 34 | "kernelspec": { 35 | "display_name": "Python 3", 36 | "language": "python", 37 | "name": "python3" 38 | }, 39 | "language_info": { 40 | "codemirror_mode": { 41 | "name": "ipython", 42 | "version": 3 43 | }, 44 | "file_extension": ".py", 45 | "mimetype": "text/x-python", 46 | "name": "python", 47 | "nbconvert_exporter": "python", 48 | "pygments_lexer": "ipython3", 49 | "version": "3.7.1" 50 | } 51 | }, 52 | "nbformat": 4, 53 | "nbformat_minor": 2 54 | } 55 | -------------------------------------------------------------------------------- /kNN/01. k-近邻算法概述.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**k-近邻**(k-Nearest Neighbor, 简称 kNN)是一种常见的监督学习算法,它是 Cover T 和 Hart P 在 1967 年提出的一种既可以用于分类,也可以用于回归的学习方法。它的工作原理非常简单:用一句古话来说就是,“近朱者赤,近墨者黑”,给定测试样本,基于某种距离度量找出训练集中与其最靠近的 k 个训练样本,这 k 个距离最近的样本就是 k-Nearest Neighbor。如果是在分类任务中,我们把这 k 个样本中出现最多的类别作为预测结果;如果是在回归任务中,我们把这 k 个样本的平均值作为预测结果。一般情况下,用 kNN 处理回归问题比较少见,我们这一节使用 kNN 解决分类问题。\n", 8 | "\n", 9 | "![](../images/knn.png)\n", 10 | "\n", 11 | "如上图所示,这是一个简单的分类任务,有两类不同的样本数据,分别用蓝色正方形和红色三角形表示,而图正中间的那个绿色的圆点则是待分类的数据。使用 kNN 来解决这个问题的基本思路是,从样本数据中找出距离绿色圆点最近的 k 个数据,然后看这 k 个数据中哪个类别出现的最多。可以看出,当 k = 3 时,有 2 个红色三角形和 1 个蓝色正方形,可以判定绿色圆点和红色三角形一类;但是当 k = 5 时,有 2 个红色三角形和 3 个蓝色正方形,可以判定绿色圆点和蓝色正方形一类。所以,在 kNN 算法中 k 值的选择很重要。\n", 12 | "\n", 13 | "在李航的《统计学习方法》中,将 **k值的选择**、**距离度量** 和 **分类决策规则** 作为 kNN 算法的三个基本要素:\n", 14 | "\n", 15 | "1. 首先是参数 k,它是一个超参数,它的取值对分类结果有着显著的影响,所以一般会通过交叉验证法,选择不同的 k 值进行计算,最后取一个分类结果最好的 k 值;\n", 16 | "2. 其次是距离度量的方法,最常用的是欧氏距离,除此之外,还有切比雪夫距离、马氏距离、巴氏距离等,要根据实际求解的问题来选择;\n", 17 | "3. 最后是分类决策规则,通常使用多数表决规则(majority voting rule),即将 k 个最近的样本中出现最多的类别作为预测结果。\n", 18 | "\n", 19 | "要注意的是,kNN 算法没有显式的训练过程,它的训练开销为零,这种学习算法又被称为 **懒惰学习**(lazy learning),相反的,那些有训练过程的算法被称为 **急切学习**(eager learning)。\n", 20 | "\n", 21 | "### kNN 优缺点\n", 22 | "\n", 23 | "* 优点\n", 24 | " * 简单好用,容易理解,精度高,理论成熟,既可以用来做分类也可以用来做回归;\n", 25 | " * 可用于数值型数据和离散型数据;\n", 26 | " * 训练时间复杂度为O(n);无数据输入假定;\n", 27 | " * 对异常值不敏感;\n", 28 | "* 缺点\n", 29 | " * 计算复杂性高;空间复杂性高;\n", 30 | " * 样本不平衡问题(即有些类别的样本数量很多,而其它样本的数量很少);\n", 31 | " * 一般数值很大的时候不用这个,计算量太大。但是单个样本又不能太少,否则容易发生误分;\n", 32 | " * 最大的缺点是无法给出数据的内在含义;" 33 | ] 34 | } 35 | ], 36 | "metadata": { 37 | "kernelspec": { 38 | "display_name": "Python 3", 39 | "language": "python", 40 | "name": "python3" 41 | }, 42 | "language_info": { 43 | "codemirror_mode": { 44 | "name": "ipython", 45 | "version": 3 46 | }, 47 | "file_extension": ".py", 48 | "mimetype": "text/x-python", 49 | "name": "python", 50 | "nbconvert_exporter": "python", 51 | "pygments_lexer": "ipython3", 52 | "version": "3.7.1" 53 | } 54 | }, 55 | "nbformat": 4, 56 | "nbformat_minor": 2 57 | } 58 | -------------------------------------------------------------------------------- /集成学习/01. 什么是集成学习?.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "所谓“三个臭皮蛋,顶一个诸葛亮”,**集成学习**(ensemble learning)就是使用类似的思想,通过构建并结合多个学习器来完成学习任务,有时也叫做 **多分类器系统**(multi-classifier system)或 **基于委员会的学习**(committee-based learning)等。\n", 8 | "\n", 9 | "从集成学习的定义可以看出,集成学习的关键在于两点:构建和结合。如何构建出多个不同的学习器?如何将多个学习器的学习结果结合在一起输出?\n", 10 | "\n", 11 | "一般我们把构建出来的学习器称为 **个体学习器**(individual learner)或 **组件学习器**(component learner),如果个体学习器都是同一种类型,这样的集成叫做 **同质集成**(homogeneous),如果个体学习器类型不同,叫做 **异质集成**(heterogenous)。同质集成中的个体学习器又叫做 **基学习器**(base learner)。\n", 12 | "\n", 13 | "根据 Krogh 和 Vedelsby 在 1995 年提出的 **误差-分歧分解** 理论(error-ambiguity decomposition):个体学习器的准确性越高,多样性越大,则集成效果越好,我们可以知道,要获得好的集成结果,个体学习器要做到“好而不同”,即个体学习器要同时具有**准确性**(accuracy)和**多样性**(diversity)。不过准确性和多样性是存在冲突的,一般情况下,准备性很高时,增加多样性就需要牺牲准确性,集成学习的构建算法的核心就是如何产生“好而不同”的个体学习器。\n", 14 | "\n", 15 | "根据个体学习器的生成方式,集成学习可以分为两类:\n", 16 | "\n", 17 | "* 序列化方法:个体学习器之间存在强依赖关系,必须串行生成,代表算法为 **Boosting**。\n", 18 | "* 并行化方法:个体学习器之间不存在强依赖关系,可以同时生成,代表算法为 **Bagging** 和 **随机森林**(Random Forest)。\n", 19 | "\n", 20 | "生成个体学习器之后,集成学习的第二项重点工作就是结合。通过结合多个学习器有很多好处:可以提高泛化性能,可以避免学习陷入局部极小值,可以扩大假设空间学得更好的近似。那么,如何结合多个学习器的学习结果呢?一般来说有三种不同的结合策略:\n", 21 | "\n", 22 | "* 平均法(averaging):常用于数值型输出,可分为**简单平均法**(simple averaging)和**加权平均法**(weighted averaging),在个体学习器性能相差较大时适合采用加权平均法,而在个体学习器性能相近时采用简单平均法。\n", 23 | "* 投票法(voting):常用于分类任务,可分为 **绝对多数投票法**(majority voting)、**相对多数投票法**(plurality voting)和**加权投票法**(weighted voting),使用绝对多数投票法时,当某标记得票数超过一半则预测为该标记,否则拒绝预测,这在可靠性要求较高的学习任务中是一个很好的机制,如果学习任务必须提供预测结果,则绝对多数投票法将退化为相对多数投票法。\n", 24 | "* 学习法:这是一种更强大的组合策略,它使用另一个学习器来学习如何组合,这里我们把个体学习器称为**初级学习器**,用于结合的学习器称为**次级学习器**或**元学习器**(meta learner),典型算法为 **Stacking**。" 25 | ] 26 | } 27 | ], 28 | "metadata": { 29 | "kernelspec": { 30 | "display_name": "Python 3", 31 | "language": "python", 32 | "name": "python3" 33 | }, 34 | "language_info": { 35 | "codemirror_mode": { 36 | "name": "ipython", 37 | "version": 3 38 | }, 39 | "file_extension": ".py", 40 | "mimetype": "text/x-python", 41 | "name": "python", 42 | "nbconvert_exporter": "python", 43 | "pygments_lexer": "ipython3", 44 | "version": "3.7.1" 45 | } 46 | }, 47 | "nbformat": 4, 48 | "nbformat_minor": 2 49 | } 50 | -------------------------------------------------------------------------------- /神经网络/02. 神经元模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "生物学家在很早的时候就开始研究人体中的 **神经元**(neuron)结构。在人体内,神经元的结构形式并非是完全相同的,但是无论结构形式如何,神经元都是由一些基本的成份组成的,主要包括三大部分:**细胞体**,**树突**(dendrites) 和 **轴突**(axon)。一个神经元通常具有多个 **树突**,用来接受传入信号,说白了就是接受其他神经元传递过来的化学物质,这些化学物质汇聚在细胞体内可以改变神经元的电位,如果电位超过一个阈值,那么神经元被激活,并通过 **轴突** 传递信息到其他神经元。如下图所示:\n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "基于上述的神经元结构,McCulloch 和 Pitts 在 1943 年提出了 **M-P 神经元模型**(由他们两的名字命名),在这个模型中,神经元接收 n 个输入,每个输入和神经元之间的连接赋有相应的权重,神经元通过某种计算(譬如带权求和)得到一个值,如果这个值超过神经元设定的阈值,就通过一个 **激活函数** 得到神经元的输出。如下图所示:\n", 12 | "\n", 13 | "\n", 14 | "\n", 15 | "常见的激活函数有:\n", 16 | "\n", 17 | "* 阶跃函数\n", 18 | "* Sigmoid 函数\n", 19 | "* tanh 函数\n", 20 | "* ReLU 函数\n", 21 | "* Softmax 函数\n", 22 | "\n", 23 | "
\n", 24 | " \n", 25 | " \n", 26 | " \n", 27 | "
\n", 28 | "\n", 29 | "把许多个这样的神经元按一定的层次结构连接起来,就得到了 **神经网络**(neural network)。\n", 30 | "\n", 31 | "除了 M-P 神经元模型,还有一些其他的神经元模型,比如脉冲神经元(spiking neuron)模型,但是 M-P 神经元模型是目前用的最多的和最广的,可谓是现在神经网络的基石。\n", 32 | "\n", 33 | "### 参考\n", 34 | "\n", 35 | "1. https://www.cnblogs.com/subconscious/p/5058741.html\n", 36 | "1. https://www.zhihu.com/question/22553761/answer/36429105\n", 37 | "1. http://fitzeng.org/2018/02/19/MLNeuralNetwork/" 38 | ] 39 | } 40 | ], 41 | "metadata": { 42 | "kernelspec": { 43 | "display_name": "Python 3", 44 | "language": "python", 45 | "name": "python3" 46 | }, 47 | "language_info": { 48 | "codemirror_mode": { 49 | "name": "ipython", 50 | "version": 3 51 | }, 52 | "file_extension": ".py", 53 | "mimetype": "text/x-python", 54 | "name": "python", 55 | "nbconvert_exporter": "python", 56 | "pygments_lexer": "ipython3", 57 | "version": "3.7.1" 58 | } 59 | }, 60 | "nbformat": 4, 61 | "nbformat_minor": 2 62 | } 63 | -------------------------------------------------------------------------------- /贝叶斯/04. 半朴素贝叶斯分类器.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "我们回顾一下朴素贝叶斯分类器,它的表达式如下:\n", 8 | "\n", 9 | "$$\n", 10 | "\\begin{align}\n", 11 | "y = f(x) &= \\mathop{\\arg\\max}_{c_k} P(Y=c_k|X=x) \\\\\n", 12 | "&= \\mathop{\\arg\\max}_{c_k} P(Y=c_k)P(X=x|Y=c_k) \\\\\n", 13 | "&= \\mathop{\\arg\\max}_{c_k} P(Y=c_k) \\prod_i P(X=x^{(i)}|Y=c_k)\n", 14 | "\\end{align}\n", 15 | "$$\n", 16 | "\n", 17 | "可以看出在使用贝叶斯公式计算后验概率 $P(Y=c_k|X=x)$ 时,类条件概率 $P(X=x|Y=c_k)$ 很难计算,它是所有属性上的联合概率,难以从有限的训练样本直接估计出来。所以,朴素贝叶斯分类器采用了一个很简单的假设:属性条件独立性假设,也就是说所有的属性相互独立,这样类条件概率可以写成每个属性的条件概率的乘积。\n", 18 | "\n", 19 | "不过在现实任务中,这个独立性假设往往很难成立,于是便有了 **半朴素贝叶斯分类器**(semi-naive Bayes classifiers),它的基本思想是适当的考虑一部分属性间的相互依赖信息,既不需要计算完全联合概率,又不至于彻底忽略了比较强的属性依赖关系。\n", 20 | "\n", 21 | "**独依赖估计**(One-Dependent Estimator,简称 ODE)是半朴素贝叶斯分类器最常用的一种策略,它假设每个属性在类别之外最多仅依赖于一个其他属性:\n", 22 | "\n", 23 | "$$\n", 24 | "y = f(x) = \\mathop{\\arg\\max}_{c_k} P(Y=c_k) \\prod_i P(X=x^{(i)}|Y=c_k, X=pa^{(i)})\n", 25 | "$$\n", 26 | "\n", 27 | "其中 $pa^{(i)}$ 为属性 $x^{(i)}$ 所依赖的属性,称为 $x^{(i)}$ 的父属性,若父属性 $pa^{(i)}$ 已知,概率 $P(X=x^{(i)}|Y=c_k, X=pa^{(i)})$ 也就很容易计算,所以问题也就转化为如何确定每个属性的父属性。常见的方法有:\n", 28 | "\n", 29 | "* SPODE(Super-Parent ODE),它假设所有属性都依赖于同一个属性,称为超父(super-parent),然后通过交叉验证等模型选择方法来确定超父属性\n", 30 | "* AODE(Averaged ODE),它尝试将每个属性作为超父来构建 SPODE,然后将那些具有足够训练数据支撑的 SPODE 通过集成学习汇聚起来得到结果\n", 31 | "* TAN(Tree Augmented naive Bayes)通过计算任意两个属性之间的 **条件互信息**(conditional mutual information) 构建完全图,再根据 **最大带权生成树**(maximum weighted spanning tree) 算法保留强相关属性之间的依赖性\n", 32 | "\n", 33 | "将 **属性条件独立性假设** 放松为 **独依赖假设** 可能获得泛化性能的提升,那么将 **独依赖假设**(ODE)再放松到 **k依赖假设**(kDE),是否能进一步提升泛化性能呢?要注意的是,随着 k 的增加,准确估计概率 $P(X=x^{(i)}|Y=c_k, X=pa^{(i)})$ 所需的训练样本数量将以指数级增加,因此,如果训练数据非常充分,泛化性能有可能提升,但是在有限样本条件下,效果不一定好。" 34 | ] 35 | } 36 | ], 37 | "metadata": { 38 | "kernelspec": { 39 | "display_name": "Python 3", 40 | "language": "python", 41 | "name": "python3" 42 | }, 43 | "language_info": { 44 | "codemirror_mode": { 45 | "name": "ipython", 46 | "version": 3 47 | }, 48 | "file_extension": ".py", 49 | "mimetype": "text/x-python", 50 | "name": "python", 51 | "nbconvert_exporter": "python", 52 | "pygments_lexer": "ipython3", 53 | "version": "3.7.1" 54 | } 55 | }, 56 | "nbformat": 4, 57 | "nbformat_minor": 2 58 | } 59 | -------------------------------------------------------------------------------- /神经网络/05. 其他神经网络.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 其他常见的神经网络\n", 8 | "\n", 9 | "* RBF网络\n", 10 | "\n", 11 | "使用 RBF(Radial Basis Fuction,径向基函数)作为隐层神经元的激活函数。\n", 12 | "\n", 13 | "* ART网络\n", 14 | "\n", 15 | "ART(Adaptive Resonance Theory,自适应谐振理论)网络是 **竞争型学习**(competitive learning)的重要代表,能比较好的解决竞争型学习中的 **可塑性-稳定性困境**(stability-plasticity dilemma),可塑性指神经网络要有学习新知识的能力,稳定性是指神经网络在学习新知识时要保持对旧知识的记忆。所以 ART 网络可以增量学习(incremental learning) 或 在线学习(online learning)。\n", 16 | "\n", 17 | "* SOM网络\n", 18 | "\n", 19 | "SOM(Self-Organizing Map,自组织映射)网络也是一种竞争型学习网络,能将高维输入数据映射到低维空间,同时保持输入空间在高维的拓扑结构。也叫做 **自组织特征映射网络** 或 **Kohonen 网络**。\n", 20 | "\n", 21 | "* 级联相关网络\n", 22 | "\n", 23 | "级联相关(Cascade Correlation)网络是 **结构自适应网络** 的重要代表,一般的神经网络模型网络结构是事先固定的,而结构自适应网络将网络结构也作为学习目标,在训练过程中,自动加入隐层节点。\n", 24 | "\n", 25 | "* Elman网络\n", 26 | "\n", 27 | "Elman网络是一种常见的 **递归神经网络**(recurrent neural networks),递归神经网络中允许出现环状结构,即可以让一些神经元的输出反馈回来作为输入信号,这使得网络在 t 时刻的输出状态不仅与 t 时刻的输入有关,还与 t-1 时刻的网络状态有关,从而可以处理与时间有关的动态变化。\n", 28 | "\n", 29 | "* Boltzmann机\n", 30 | "\n", 31 | "Boltzmann机也是一种递归神经网络,它是一种基于能量模型(energy-based model)的神经网络,能量最小化时网络达到理想状态,这时称为 **Boltzmann分布** 或 **平衡态**(equilibrium)或 **平稳分布**(stationary distribution)。标准的Boltzmann机是一个全连接图,训练复杂度很高,现实中常采用 **受限Boltzmann机**(RBM,Restricted Boltzmann Machine)。" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "### 深度学习\n", 39 | "\n", 40 | "深度学习(deep learning)就是多隐层神经网络,这样的网络难以使用BP算法训练,通常使用以下方法:\n", 41 | "\n", 42 | "* 无监督逐层训练:**预训练**(pre-training)+ **微调**(fine-tuning),比如在 **深度信念网络**(Deep Belief Network,简称 DBN)中每层都是一个受限 Boltzmann 机\n", 43 | "* 权共享(weight sharing),让一组神经元使用相同的连接权,比如在 **卷积神经网络**(Convolutional Neural Network,简称 CNN)中发挥了重要作用" 44 | ] 45 | } 46 | ], 47 | "metadata": { 48 | "kernelspec": { 49 | "display_name": "Python 3", 50 | "language": "python", 51 | "name": "python3" 52 | }, 53 | "language_info": { 54 | "codemirror_mode": { 55 | "name": "ipython", 56 | "version": 3 57 | }, 58 | "file_extension": ".py", 59 | "mimetype": "text/x-python", 60 | "name": "python", 61 | "nbconvert_exporter": "python", 62 | "pygments_lexer": "ipython3", 63 | "version": "3.7.1" 64 | } 65 | }, 66 | "nbformat": 4, 67 | "nbformat_minor": 2 68 | } 69 | -------------------------------------------------------------------------------- /线性回归/15. 线性回归实例.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "经过前面的学习,我们来看一个使用线性回归解决真实问题的实例:使用线性回归检测水泥质量([案例来源](https://www.ibm.com/developerworks/cn/analytics/library/ba-lo-machine-learning-cement-quality/index.html))。\n", 8 | "\n", 9 | "已知类似于下面这样的水泥成分样本数据,如何得到一个水泥质量预测模型,能预测出水泥的好坏?\n", 10 | "\n", 11 | "![](../images/cement.png)\n", 12 | "\n", 13 | "其中,每一行代表一个样本,1-7 列是每立方米混合物中各个成分的重量(单位:千克),第 8 列是已使用天数,第 9 列 是该行水泥样本的强度(单位:MPa)。前 8 列 是水泥的输入属性,最后 1 列是输出,我们要让机器学习这些样本后得到一个模型,这个模型可以输入样本表格之外的数据,预测出水泥的强度。很显然,这是一个回归问题。\n", 14 | "\n", 15 | "我们假设该模型为线性模型,用线性回归去解决该问题,我们可以使用正规方程法或梯度下降法,上文中采用的是梯度下降法。\n", 16 | "\n", 17 | "不过在计算之前,我们观察到部分这样的样本数据:\n", 18 | "\n", 19 | "![](../images/cement-2.png)\n", 20 | "\n", 21 | "可以发现不同维度的数值范围差异会很大,塑化剂维度和粗粒料的维度值范围分别是 10 左右和 1000 多 ,如此大的差异如果用梯度下降,会使得不同方向的下降速度差异很大,迭代算法可能收敛得很慢甚至不收敛。如果直接将该样本送入训练,会使目标函数的形状太\"扁\",在找到最优解前,梯度下降的过程不仅是曲折的,也是非常耗时的,如下面图 3 所示,如果特征差异太大,就会像左图一样,在求最优解时,会得到一个窄长的椭圆形,导致在梯度下降时,梯度的方向为垂直等高线的方向而走之字形路线,这样会使迭代很慢;而如果特征差异不大,如右图会形成一个近似的圆形,则迭代就会很快。\n", 22 | "\n", 23 | "![](../images/cement-3.png)\n", 24 | "\n", 25 | "所以如果求解使用迭代算法,需要在计算前把不同的特征缩放到同一个范围内,比如控制在 [-1 1] 内。这个特征数据缩放的过程就是数据归一化。" 26 | ] 27 | }, 28 | { 29 | "cell_type": "markdown", 30 | "metadata": {}, 31 | "source": [ 32 | "### 标准化和归一化\n", 33 | "\n", 34 | "在实际的运用中我们选取的特征,比如长度,重量,面积等等,通常单位和范围都不同,这会导致梯度下降算法变慢,所以我们要将特征缩放到相对统一的范围内。通常的方法有 **标准化(Standardization)** 和 **归一化(Normalization)**。\n", 35 | "\n", 36 | "标准化是把数据变成符合标准的正态分布,由 **中心极限定理** 可知,当数据量足够大时,无论原来的数据是何种分布,都可以通过下面的更新公式转变成正态分布:\n", 37 | "\n", 38 | "$$\n", 39 | "x_i := \\frac{x_i-\\mu}{\\delta}\n", 40 | "$$\n", 41 | "\n", 42 | "归一化对梯度下降算法很友好,可以让算法最终收敛并且提高训练速度和精度,归一化的更新公式为:\n", 43 | "\n", 44 | "$$\n", 45 | "x_i := \\frac{x_i-min(x_i)}{max(x_i)-min(x_i)}\n", 46 | "$$" 47 | ] 48 | } 49 | ], 50 | "metadata": { 51 | "kernelspec": { 52 | "display_name": "Python 3", 53 | "language": "python", 54 | "name": "python3" 55 | }, 56 | "language_info": { 57 | "codemirror_mode": { 58 | "name": "ipython", 59 | "version": 3 60 | }, 61 | "file_extension": ".py", 62 | "mimetype": "text/x-python", 63 | "name": "python", 64 | "nbconvert_exporter": "python", 65 | "pygments_lexer": "ipython3", 66 | "version": "3.7.1" 67 | } 68 | }, 69 | "nbformat": 4, 70 | "nbformat_minor": 2 71 | } 72 | -------------------------------------------------------------------------------- /逻辑回归/06. 使用牛顿法解逻辑回归.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在前面的学习中,我们知道逻辑回归问题可以转换为求解似然函数的极大值:\n", 8 | "\n", 9 | "$$\n", 10 | "\\mathop{\\arg \\max}_{\\theta} \\prod_{i=1}^n p(y_i | x_i; \\theta)\n", 11 | "$$\n", 12 | "\n", 13 | "再通过一系列的转换,最终得到逻辑回归的损失函数:\n", 14 | "\n", 15 | "$$\n", 16 | "loss = \\sum_{i=1}^n [ -y_i(\\theta^Tx_i) + \\ln (1+e^{\\theta^Tx_i}) ]\n", 17 | "$$\n", 18 | "\n", 19 | "为了让损失函数最小,然后使用了梯度下降算法对参数 $\\theta$ 不断迭代,最终得出参数 $\\theta$。\n", 20 | "\n", 21 | "它的梯度如下:\n", 22 | "\n", 23 | "$$\n", 24 | "\\frac{\\partial loss}{\\partial \\theta} = \\sum_{i=1}^n (\\pi(x_i) -y_i)x_i\n", 25 | "$$\n", 26 | "\n", 27 | "其中,\n", 28 | "\n", 29 | "$$\n", 30 | "\\pi(x) = p(y=1|x) = \\frac{e^{\\theta^Tx}}{1+e^{\\theta^Tx}}\n", 31 | "$$\n", 32 | "\n", 33 | "最后得出梯度下降的迭代公式如下:\n", 34 | "\n", 35 | "$$\n", 36 | "\\theta_j := \\theta_j + \\eta \\sum_{i=1}^n (y^{(i)} - \\pi(x^{(i)}))x_j^{(i)}\n", 37 | "$$\n", 38 | "\n", 39 | "梯度下降还可以进一步衍生出随机梯度下降法,都可以很好的求解逻辑回归问题,这一节将介绍一种新的方法,**牛顿法**(Newton method),它可以利用曲线本身的信息,比梯度下降法更容易收敛,迭代次数更少。" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## 牛顿法解一元函数极值\n", 47 | "\n", 48 | "## 牛顿法解多元函数极值\n", 49 | "\n", 50 | "## 牛顿法解逻辑回归\n", 51 | "\n", 52 | "牛顿法迭代公式为:\n", 53 | "\n", 54 | "$$\n", 55 | "\\theta := \\theta - \\frac{loss'(\\theta)}{loss''(\\theta)}\n", 56 | "$$\n", 57 | "\n", 58 | "其中,$loss'(\\theta)$ 为损失函数关于 $\\theta$ 的一阶导数:\n", 59 | "\n", 60 | "$$\n", 61 | "loss'(\\theta) = \\frac{\\partial loss}{\\partial \\theta} = \\sum_{i=1}^n (\\pi(x_i) -y_i)x_i\n", 62 | "$$\n", 63 | "\n", 64 | "$loss''(\\theta)$ 为损失函数关于 $\\theta$ 的二阶导数:\n", 65 | "\n", 66 | "$$\n", 67 | "loss''(\\theta) = \\frac{\\partial^2 loss}{\\partial \\theta \\partial \\theta^T} = \\sum_{i=1}^n x_ix_i^T \\pi(x_i)(1-\\pi(x_i))\n", 68 | "$$\n", 69 | "\n", 70 | "https://blog.csdn.net/baimafujinji/article/details/51179381" 71 | ] 72 | } 73 | ], 74 | "metadata": { 75 | "kernelspec": { 76 | "display_name": "Python 3", 77 | "language": "python", 78 | "name": "python3" 79 | }, 80 | "language_info": { 81 | "codemirror_mode": { 82 | "name": "ipython", 83 | "version": 3 84 | }, 85 | "file_extension": ".py", 86 | "mimetype": "text/x-python", 87 | "name": "python", 88 | "nbconvert_exporter": "python", 89 | "pygments_lexer": "ipython3", 90 | "version": "3.7.1" 91 | } 92 | }, 93 | "nbformat": 4, 94 | "nbformat_minor": 2 95 | } 96 | -------------------------------------------------------------------------------- /聚类/3. k-means 算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**k均值**(k-means)算法最初由 MacQueen 在 1967 年提出,是最著名的原型聚类算法之一。假设样本集被划分为 k 类 $\\{C_1, C_2, \\dots, C_k\\}$,我们可以计算出每个类的均值向量,也就是类的中心位置:\n", 8 | "\n", 9 | "$$\n", 10 | "\\mu_i = \\frac{\\sum_{x \\in C_i}x}{|C_i|}\n", 11 | "$$\n", 12 | "\n", 13 | "得到每个类的均值向量后,可以计算出每个样本到均值向量的平方误差之和:\n", 14 | "\n", 15 | "$$\n", 16 | "E(C_i) = \\sum_{x \\in C_i} \\|x-\\mu_i\\|^2_2\n", 17 | "$$\n", 18 | "\n", 19 | "将 k 个类的平方误差之和累加起来,就是 k 均值聚类算法的损失函数:\n", 20 | "\n", 21 | "$$\n", 22 | "E(C) = \\sum_{i=1}^k E(C_i)\n", 23 | "$$\n", 24 | "\n", 25 | "函数 $E(C)$ 也称为能量,它表示簇内样本围绕簇均值向量的紧密程度,E 值越小,表示簇内样本相似度越高,k 均值算法就是试图最小化该函数。不过要计算该函数的最小值,必须计算出样本集所有可能的簇划分,这是一个 NP 难问题,因此 k 均值算法采用贪心策略,通过迭代优化来近似求解。\n", 26 | "\n", 27 | "k 均值算法流程如下:\n", 28 | "\n", 29 | "* 从样本集 D 中随机选择 k 个样本作为初始簇,每个样本的值就是该簇的均值向量\n", 30 | "* 计算每个样本到均值向量的距离,选择距离最近的均值向量将每个样本划分到相应的簇里\n", 31 | "* 根据每个簇里的样本,重新计算每个簇的均值向量\n", 32 | "* 不断重复上述过程,直到迭代收敛或达到停止条件\n", 33 | "\n", 34 | "k 均值聚类算法的时间复杂度是 O(mnk),其中 m 是样本维数,n 是样本个数,k 是类别个数。" 35 | ] 36 | }, 37 | { 38 | "cell_type": "markdown", 39 | "metadata": {}, 40 | "source": [ 41 | "### 学习向量量化\n", 42 | "\n", 43 | "**学习向量量化**(Learning Vector Quantization,简称 LVQ)和 k-means 一样,也是一种原型聚类算法。不过 LVQ 假设数据样本带有类别标记,学习过程利用样本的这些监督信息来辅助聚类。\n", 44 | "\n", 45 | "首先随机初始化一组原型向量 $\\{p_1, p_2, \\dots, p_q\\}$,q 是一个超参数,代表簇的个数,然后从样本集中随机选择一个样本 $(x_j, y_j)$,计算样本 $x_j$ 和 原型向量中每个向量的距离,找出和 $x_j$ 距离最近的向量 $p_{i^*}$,如果样本 $x_j$ 的类别 $y_j$ 和 原型向量 $p_{i^*}$ 的类别相同,则使用下面的公式更新原型向量:\n", 46 | "\n", 47 | "$$\n", 48 | "p' = p_{i^*} + \\eta \\cdot (x_j - p_{i^*})\n", 49 | "$$\n", 50 | "\n", 51 | "如果样本 $x_j$ 的类别 $y_j$ 和 原型向量 $p_{i^*}$ 的类别不同,则使用下面的公式更新原型向量:\n", 52 | "\n", 53 | "$$\n", 54 | "p' = p_{i^*} - \\eta \\cdot (x_j - p_{i^*})\n", 55 | "$$\n", 56 | "\n", 57 | "直观上看,如果样本和原型向量类别相同,就让原型向量向样本靠拢,否则就远离样本。\n", 58 | "\n", 59 | "重复上面的迭代步骤,一直到满足停止条件为止(达到最大迭代次数,或者原型向量更新幅度很小甚至不再更新)。\n", 60 | "\n", 61 | "最终学的原型向量可以实现对样本空间的簇划分,将每个样本划入与其距离最近的原型向量即可,也就是说每个原型向量对应一个划分区域,在这个区域里,每个样本和原型向量的距离不大于它和其他原型向量的距离,这样的划分通常称为 **Voronoi 剖分**(Voronoi tessellation)。\n", 62 | "\n", 63 | "如果使用原型向量来表示划分区域中的样本,则可以实现数据的 **有损压缩**(lossy compression),这称为 **向量量化**(vector quantization),这就是 LVQ 名字的由来。" 64 | ] 65 | } 66 | ], 67 | "metadata": { 68 | "kernelspec": { 69 | "display_name": "Python 3", 70 | "language": "python", 71 | "name": "python3" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 3 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython3", 83 | "version": "3.7.1" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 2 88 | } 89 | -------------------------------------------------------------------------------- /支持向量机/01. 什么是支持向量机.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**支持向量机**(Support Vector Machine,简称 SVM)和逻辑回归、朴素贝叶斯一样,也是一种有监督分类算法,它的目标是通过最大化训练数据集的间隔找到一个最优划分超平面。\n", 8 | "\n", 9 | "那么什么是支持向量?什么是间隔?什么是最优划分超平面呢?\n", 10 | "\n", 11 | "给定如下图所示的训练样本,很显然这是二维平面上的一个二分类问题,一个最基本的分类想法是能不能找一条直线把两个样本类别划分成两个部分?这条直线有点像我们之前在逻辑回归中介绍的 **决策边界**,而在支持向量机中,我们称之为 **划分超平面**。但是能将样本分开的划分超平面可能会有多个,就像下图这样,到底哪个才是最好的呢?\n", 12 | "\n", 13 | "![](../images/svm-1.jpg)\n", 14 | "\n", 15 | "直观上看,我们应该去找位于这个类别正中间的划分超平面,尽可能的远离所有类别的数据点,这样它对训练样本的噪声影响最小,容忍性最好,也就是说,基于这个划分超平面所产生的分类结果是最鲁棒的,泛化能力最强。这个超平面叫做 **最优划分超平面**。\n", 16 | "\n", 17 | "那么如何去找最优划分超平面呢?这就要引出间隔的概念。我们知道平面可以用线性方程来表示,所以划分超平面可以写成:\n", 18 | "\n", 19 | "$$\n", 20 | "w^Tx + b = 0\n", 21 | "$$\n", 22 | "\n", 23 | "样本空间中的任意点到划分超平面的距离为:\n", 24 | "\n", 25 | "$$\n", 26 | "r = \\frac{|w^Tx + b|}{\\|w\\|}\n", 27 | "$$\n", 28 | "\n", 29 | "假设 $w^Tx + b \\ge +1$ 时为正例,$w^Tx + b \\le -1$ 时为负例,如下图所示:\n", 30 | "\n", 31 | "![](../images/svm-2.jpg)\n", 32 | "\n", 33 | "从图中可以看出,距离划分超平面最近的正例样本位于直线 $w^Tx + b = 1$ 上,而距离划分超平面最近的负例样本位于直线 $w^Tx + b = -1$ 上,这两条直线上的样本被称为 **支持向量**(support vector),这两条直线之间的距离被称为 **间隔**(margin),很显然,间隔的值为:\n", 34 | "\n", 35 | "$$\n", 36 | "\\gamma = \\frac{2}{\\|w\\|}\n", 37 | "$$\n", 38 | "\n", 39 | "要找到最优划分超平面,也就是让间隔最大化,所以问题可以转换为求优化问题的解:\n", 40 | "\n", 41 | "$$\n", 42 | "\\mathop \\max_{w,b} \\frac{2}{\\|w\\|}\n", 43 | "$$\n", 44 | "\n", 45 | "当然这个优化问题有一个约束条件:\n", 46 | "\n", 47 | "$$\n", 48 | "\\left\\{\n", 49 | "\\begin{align}\n", 50 | "w^Tx_i + b \\ge +1, y_i = +1 \\\\\n", 51 | "w^Tx_i + b \\le -1, y_i = -1\n", 52 | "\\end{align}\n", 53 | "\\right.\n", 54 | "$$\n", 55 | "\n", 56 | "约束条件也可以简写成:\n", 57 | "\n", 58 | "$$\n", 59 | "y_i(w^Tx_i + b) \\ge 1\n", 60 | "$$\n", 61 | "\n", 62 | "上面这个有约束条件的最优化问题就是支持向量机的基本形式,一般写成:\n", 63 | "\n", 64 | "$$\n", 65 | "\\begin{align}\n", 66 | "&\\mathop \\max_{w,b} \\frac{2}{\\|w\\|} \\\\\n", 67 | "&s.t. y_i(w^Tx_i + b) \\ge 1, i = 1,2,\\dots,n\n", 68 | "\\end{align}\n", 69 | "$$\n", 70 | "\n", 71 | "其中,$s.t.$ 是 subject to 的缩写,表示受限于,是前面优化目标的约束条件。稍作转换,支持向量机还能写成最小化的形式:\n", 72 | "\n", 73 | "$$\n", 74 | "\\begin{align}\n", 75 | "&\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 \\\\\n", 76 | "&s.t. y_i(w^Tx_i + b) \\ge 1, i = 1,2,\\dots,n\n", 77 | "\\end{align}\n", 78 | "$$" 79 | ] 80 | } 81 | ], 82 | "metadata": { 83 | "kernelspec": { 84 | "display_name": "Python 3", 85 | "language": "python", 86 | "name": "python3" 87 | }, 88 | "language_info": { 89 | "codemirror_mode": { 90 | "name": "ipython", 91 | "version": 3 92 | }, 93 | "file_extension": ".py", 94 | "mimetype": "text/x-python", 95 | "name": "python", 96 | "nbconvert_exporter": "python", 97 | "pygments_lexer": "ipython3", 98 | "version": "3.7.1" 99 | } 100 | }, 101 | "nbformat": 4, 102 | "nbformat_minor": 2 103 | } 104 | -------------------------------------------------------------------------------- /贝叶斯/03. 贝叶斯分类实例.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 实例一、预测性别\n", 8 | "\n", 9 | "在日常生活中,我们往往可以根据一个人的名字判断这个人是男是女,比如李大志,这个名字一听就是男性,因为大和志在男性的名字中用的比较多。通过贝叶斯定理,我们可以求得:\n", 10 | "\n", 11 | "$$\n", 12 | "P(gender=male | name=DaZhi) = \\frac{P(name=DaZhi | gender=male)P(gender=male)}{P(name=DaZhi)}\n", 13 | "$$\n", 14 | "\n", 15 | "$$\n", 16 | "P(gender=female | name=DaZhi) = \\frac{P(name=DaZhi | gender=female)P(gender=female)}{P(name=DaZhi)}\n", 17 | "$$\n", 18 | "\n", 19 | "其中,$P(name=DaZhi)$ 可以不用计算,$P(gender=female)$ 和 $P(gender=male)$ 一般来说差别不大,所以问题可以近似为:比较 $P(name=DaZhi | gender=female)$ 和 $P(name=DaZhi | gender=male)$ 的大小。\n", 20 | "\n", 21 | "### 实例二、预测西瓜好坏\n", 22 | "\n", 23 | "在周志华的西瓜书上,有这样一个数据集:\n", 24 | "\n", 25 | "|编号|色泽|根蒂|敲声|纹理|脐部|触感|密度|含糖率|好瓜|\n", 26 | "|---|---|---|---|----|----|---|---|------|---|\n", 27 | "|1|青绿|蜷缩|浊响|清晰|凹陷|硬滑|0.697|0.46|是|\n", 28 | "|2|乌黑|蜷缩|沉闷|清晰|凹陷|硬滑|0.774|0.376|是|\n", 29 | "|3|乌黑|蜷缩|浊响|清晰|凹陷|硬滑|0.634|0.264|是|\n", 30 | "|4|青绿|蜷缩|沉闷|清晰|凹陷|硬滑|0.608|0.318|是|\n", 31 | "|5|浅白|蜷缩|浊响|清晰|凹陷|硬滑|0.556|0.215|是|\n", 32 | "|6|青绿|稍蜷|浊响|清晰|稍凹|软粘|0.403|0.237|是|\n", 33 | "|7|乌黑|稍蜷|浊响|稍糊|稍凹|软粘|0.481|0.149|是|\n", 34 | "|8|乌黑|稍蜷|浊响|清晰|稍凹|硬滑|0.437|0.211|是|\n", 35 | "|9|乌黑|稍蜷|沉闷|稍糊|稍凹|硬滑|0.666|0.091|否|\n", 36 | "|10|青绿|硬挺|清脆|清晰|平坦|软粘|0.243|0.267|否|\n", 37 | "|11|浅白|硬挺|清脆|模糊|平坦|硬滑|0.245|0.057|否|\n", 38 | "|12|浅白|蜷缩|浊响|模糊|平坦|软粘|0.343|0.099|否|\n", 39 | "|13|青绿|稍蜷|浊响|稍糊|凹陷|硬滑|0.639|0.161|否|\n", 40 | "|14|浅白|稍蜷|沉闷|稍糊|凹陷|硬滑|0.657|0.198|否|\n", 41 | "|15|乌黑|稍蜷|浊响|清晰|稍凹|软粘|0.36|0.37|否|\n", 42 | "|16|浅白|蜷缩|浊响|模糊|平坦|硬滑|0.593|0.042|否| \n", 43 | "|17|青绿|蜷缩|沉闷|稍糊|稍凹|硬滑|0.719|0.103|否|\n", 44 | "\n", 45 | "如果这时拿到一个西瓜,特征如下:\n", 46 | "\n", 47 | "|色泽|根蒂|敲声|纹理|脐部|触感|密度|含糖率|\n", 48 | "|---|---|---|----|----|---|---|------|\n", 49 | "|青绿|蜷缩|浊响|清晰|凹陷|硬滑|0.697|0.46|\n", 50 | "\n", 51 | "那么如何判断这个瓜是好瓜还是坏瓜?\n", 52 | "\n", 53 | "### 实例三、拼写检查\n", 54 | "\n", 55 | "当用户在输入单词时不小心拼写出现了错误,如何返回他想输入的拼写正确的单词呢?比如,用户输入单词 thew,那么他是想输入 the,还是想输入 thaw?这种拼写检查的问题等同于分类问题:在许多可能拼写正确单词中,到底哪一个最有可能呢?\n", 56 | "\n", 57 | "假设用户输入的单词为 w,要返回的正确单词为 c,拼写检查就是求最大可能的 c:\n", 58 | "\n", 59 | "$$\n", 60 | "\\mathop{\\arg \\max}_c P(c|w) = \\mathop{\\arg \\max}_c P(w|c)P(c)\n", 61 | "$$\n", 62 | "\n", 63 | "$P(c)$ 表示单词 c 出现的概率,这个通过统计文本库中单词 c 出现的频率可以求出来。$P(w|c)$ 表示在输入单词 c 时输错成 w 的概率,这需要对大量的输入错误进行统计,这个数据一般不太容易得到,Peter Norvig 在他的一篇文章 [How to Write a Spelling Corrector](http://norvig.com/spell-correct.html) 中介绍了一种方法来近似这个,他使用的是**编辑距离**(edit distance)。" 64 | ] 65 | } 66 | ], 67 | "metadata": { 68 | "kernelspec": { 69 | "display_name": "Python 3", 70 | "language": "python", 71 | "name": "python3" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 3 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython3", 83 | "version": "3.7.1" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 2 88 | } 89 | -------------------------------------------------------------------------------- /线性回归/17. 求解岭回归和LASSO回归.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 岭回归的求解\n", 8 | "\n", 9 | "岭回归就是 **基于 $L_2$ 约束的最小二乘法**,它的损失函数为:\n", 10 | "\n", 11 | "$$\n", 12 | "loss = \\sum_{i=1}^m (y_i - w^Tx_i)^2 + \\lambda \\| w \\|^2\n", 13 | "$$\n", 14 | "\n", 15 | "我们将损失函数表示成矩阵形式:\n", 16 | "\n", 17 | "$$\n", 18 | "loss = (y - w^TX)^2 + \\lambda w^2\n", 19 | "$$\n", 20 | "\n", 21 | "和最小二乘的正规方程解法一样,我们对 $w$ 求偏导:\n", 22 | "\n", 23 | "$$\n", 24 | "\\frac {\\partial}{\\partial w}loss = 2X^T(Xw-y)+2 \\lambda w\n", 25 | "$$\n", 26 | "\n", 27 | "令其等于 0,求得损失函数最小时 $w$ 的解析解:\n", 28 | "\n", 29 | "$$\n", 30 | "\\hat{w} = (X^TX + \\lambda I)^{-1}X^Ty\n", 31 | "$$" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "### 一般 $L_2$ 约束的最小二乘求解\n", 39 | "\n", 40 | "在前面介绍岭回归的时候,我们不仅介绍了 $L_2$ 约束,还将其扩展到更一般的情况:\n", 41 | "\n", 42 | "$$\n", 43 | "\\mathop{\\min}_{\\theta}J(\\theta), \\theta^TG\\theta \\leqslant R\n", 44 | "$$\n", 45 | "\n", 46 | "它的损失函数可以表示成:\n", 47 | "\n", 48 | "$$\n", 49 | "loss = (y - w^TX)^2 + \\lambda w^TGw\n", 50 | "$$\n", 51 | "\n", 52 | "我们可以对其求偏导,得到 $w$ 的解析解:\n", 53 | "\n", 54 | "$$\n", 55 | "\\hat{w} = (X^TX + \\lambda G)^{-1}X^Ty\n", 56 | "$$" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "### LASSO 回归的求解\n", 64 | "\n", 65 | "LASSO 回归是 **基于 L_1 约束的最小二乘法**,它的损失函数为:\n", 66 | "\n", 67 | "$$\n", 68 | "loss = \\sum_{i=1}^m (y_i - w^Tx_i)^2 + \\lambda \\| w \\|\n", 69 | "$$\n", 70 | "\n", 71 | "这里的 $\\| w \\|$ 是对所有的 $w_i$ 绝对值进行求和,我们知道绝对值函数不能求导,所以不能像上面那样通过求偏导来求解。\n", 72 | "\n", 73 | "1. 解法一:通过二次函数在绝对值函数的上方进行控制\n", 74 | "\n", 75 | "$$\n", 76 | "|\\theta| \\leqslant \\frac{\\theta^2}{2c} + \\frac{c}{2}\n", 77 | "$$\n", 78 | "\n", 79 | "得到 $\\theta$ 的迭代更新公式:\n", 80 | "\n", 81 | "$$\n", 82 | "\\theta := (X^TX + \\lambda\\Theta^\\dagger)^{-1}X^Ty\n", 83 | "$$\n", 84 | "\n", 85 | "其中,$\\Theta = diag(|\\theta_1|, \\dots, |\\theta_b|)$,$\\Theta^\\dagger$ 表示 $\\Theta$ 的广义逆。\n", 86 | "\n", 87 | "具体的求解过程参见《图解机器学习》p45.\n", 88 | "\n", 89 | "2. 解法二:使用近端梯度下降(Proximal Gradient Descent,简称 PGD)\n", 90 | "\n", 91 | "L-Lipschitz 条件、二阶泰勒展开\n", 92 | "\n", 93 | "参见《机器学习》p253.\n", 94 | "\n", 95 | "3. 解法三:拟牛顿法\n", 96 | "\n", 97 | "BFGS、L-BFGS、Armijo 搜索准则、Sherman-Morrison 公式" 98 | ] 99 | } 100 | ], 101 | "metadata": { 102 | "kernelspec": { 103 | "display_name": "Python 3", 104 | "language": "python", 105 | "name": "python3" 106 | }, 107 | "language_info": { 108 | "codemirror_mode": { 109 | "name": "ipython", 110 | "version": 3 111 | }, 112 | "file_extension": ".py", 113 | "mimetype": "text/x-python", 114 | "name": "python", 115 | "nbconvert_exporter": "python", 116 | "pygments_lexer": "ipython3", 117 | "version": "3.7.1" 118 | } 119 | }, 120 | "nbformat": 4, 121 | "nbformat_minor": 2 122 | } 123 | -------------------------------------------------------------------------------- /其他资料/书籍/data/1.txt: -------------------------------------------------------------------------------- 1 | -0.017612 14.053064 0 2 | -1.395634 4.662541 1 3 | -0.752157 6.538620 0 4 | -1.322371 7.152853 0 5 | 0.423363 11.054677 0 6 | 0.406704 7.067335 1 7 | 0.667394 12.741452 0 8 | -2.460150 6.866805 1 9 | 0.569411 9.548755 0 10 | -0.026632 10.427743 0 11 | 0.850433 6.920334 1 12 | 1.347183 13.175500 0 13 | 1.176813 3.167020 1 14 | -1.781871 9.097953 0 15 | -0.566606 5.749003 1 16 | 0.931635 1.589505 1 17 | -0.024205 6.151823 1 18 | -0.036453 2.690988 1 19 | -0.196949 0.444165 1 20 | 1.014459 5.754399 1 21 | 1.985298 3.230619 1 22 | -1.693453 -0.557540 1 23 | -0.576525 11.778922 0 24 | -0.346811 -1.678730 1 25 | -2.124484 2.672471 1 26 | 1.217916 9.597015 0 27 | -0.733928 9.098687 0 28 | -3.642001 -1.618087 1 29 | 0.315985 3.523953 1 30 | 1.416614 9.619232 0 31 | -0.386323 3.989286 1 32 | 0.556921 8.294984 1 33 | 1.224863 11.587360 0 34 | -1.347803 -2.406051 1 35 | 1.196604 4.951851 1 36 | 0.275221 9.543647 0 37 | 0.470575 9.332488 0 38 | -1.889567 9.542662 0 39 | -1.527893 12.150579 0 40 | -1.185247 11.309318 0 41 | -0.445678 3.297303 1 42 | 1.042222 6.105155 1 43 | -0.618787 10.320986 0 44 | 1.152083 0.548467 1 45 | 0.828534 2.676045 1 46 | -1.237728 10.549033 0 47 | -0.683565 -2.166125 1 48 | 0.229456 5.921938 1 49 | -0.959885 11.555336 0 50 | 0.492911 10.993324 0 51 | 0.184992 8.721488 0 52 | -0.355715 10.325976 0 53 | -0.397822 8.058397 0 54 | 0.824839 13.730343 0 55 | 1.507278 5.027866 1 56 | 0.099671 6.835839 1 57 | -0.344008 10.717485 0 58 | 1.785928 7.718645 1 59 | -0.918801 11.560217 0 60 | -0.364009 4.747300 1 61 | -0.841722 4.119083 1 62 | 0.490426 1.960539 1 63 | -0.007194 9.075792 0 64 | 0.356107 12.447863 0 65 | 0.342578 12.281162 0 66 | -0.810823 -1.466018 1 67 | 2.530777 6.476801 1 68 | 1.296683 11.607559 0 69 | 0.475487 12.040035 0 70 | -0.783277 11.009725 0 71 | 0.074798 11.023650 0 72 | -1.337472 0.468339 1 73 | -0.102781 13.763651 0 74 | -0.147324 2.874846 1 75 | 0.518389 9.887035 0 76 | 1.015399 7.571882 0 77 | -1.658086 -0.027255 1 78 | 1.319944 2.171228 1 79 | 2.056216 5.019981 1 80 | -0.851633 4.375691 1 81 | -1.510047 6.061992 0 82 | -1.076637 -3.181888 1 83 | 1.821096 10.283990 0 84 | 3.010150 8.401766 1 85 | -1.099458 1.688274 1 86 | -0.834872 -1.733869 1 87 | -0.846637 3.849075 1 88 | 1.400102 12.628781 0 89 | 1.752842 5.468166 1 90 | 0.078557 0.059736 1 91 | 0.089392 -0.715300 1 92 | 1.825662 12.693808 0 93 | 0.197445 9.744638 0 94 | 0.126117 0.922311 1 95 | -0.679797 1.220530 1 96 | 0.677983 2.556666 1 97 | 0.761349 10.693862 0 98 | -2.168791 0.143632 1 99 | 1.388610 9.341997 0 100 | 0.317029 14.739025 0 -------------------------------------------------------------------------------- /神经网络/03. 从感知机到神经网络.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "通过前面的学习,我们掌握了感知机模型和神经元模型,我们稍微对比一下。\n", 8 | "\n", 9 | "感知机模型如下:\n", 10 | "\n", 11 | "$$\n", 12 | "y = sign(w^Tx + b)\n", 13 | "$$\n", 14 | "\n", 15 | "其中 $sign(z)$ 为符号函数,也就是阶跃函数。\n", 16 | "\n", 17 | "神经元模型如下:\n", 18 | "\n", 19 | "$$\n", 20 | "y = f(\\sum_{i=1}^n w_i x_i - \\theta)\n", 21 | "$$\n", 22 | "\n", 23 | "其中 $f(z)$ 可以是 $sign(z)$ 或 $sigmoid(z)$ 或其他类型的激活函数。\n", 24 | "\n", 25 | "可以看出两个模型实际上是一样的,可以说感知机就是一个简化版神经元,如下所示:\n", 26 | "\n", 27 | "\n", 28 | "\n", 29 | "这实际上是一个神经网络,左边的输入节点被称为 **输入层**,右边的输出节点被称为 **输出层**。要注意的是,一般我们根据网络中计算节点的层数来命名神经网络,在上面的模型中,左边的输入层只负责传输数据,并没有进行计算,只有右边的输出层进行了计算,所以被称为 **单层神经网络**。有计算过程的神经元叫做 **功能神经元**(functional neuron)。" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "我们用一个例子来展示下具体的神经元模型,譬如我们要解决下面的三种分类问题(AND、OR、NOT):\n", 37 | "\n", 38 | "
\n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | "
\n", 43 | "\n", 44 | "很显然这三个问题都是线性可分的,我们可以构造出如下的神经元模型:\n", 45 | "\n", 46 | "\n", 47 | "\n", 48 | "但是如果我们要处理下面的异或问题,这是一个线性不可分问题,单层的神经网络就无法解决了。\n", 49 | "\n", 50 | "\n", 51 | "\n", 52 | "我们在神经网络中增加一层,构造出下面这样的神经网络:\n", 53 | "\n", 54 | "\n", 55 | "\n", 56 | "这是一个经典的两层神经网络,除了最左边的输入层和最右边的输出层,我们在中间增加了一层,被称为 **隐藏层**(hidden layer),这样的两层神经网络有时候也被称为 **单隐层神经网络**。可以看出,在这个网络里每一层神经元与下一层神经元全互联,神经元之间不存在同层连接,也不存在跨层连接,这样的神经网络通常称为 **多层前馈神经网络**(multi-layer feedforward neural networks),所谓的前馈指的是,数据从输入节点到输出节点向前传递,没有输出的信息传递到输入层,即没有反馈,表现在图形上是有向图没有回路。当然也有 **反馈神经网络**(recurrent neural network),比如 Hopfield 网络、布尔兹曼机等。神经网络的学习过程,就是根据训练数据来调整神经元之间的连接权重和每个神经元的阈值。\n", 57 | "\n", 58 | "两层神经网络比单层神经网络具有更强的表现力,理论上可以证明,两层神经网络可以无限逼近任意连续函数,也就是说它可以解决非线性分类问题。为什么单层神经网络只能处理线性问题,而增加一层就可以处理非线性问题呢?这是因为增加的一层神经元,相当于对数据作了空间变换。也就是说,隐藏层对原始的数据进行了一个空间变换,使其可以被线性分类,然后输出层的决策分界划出了一个线性分类分界线,对其进行分类。\n", 59 | "\n", 60 | "在多层神经网络的基础上,后来又继续发展成了 **深度学习**(deep learning),其在语音识别、图像识别等领域得到了广泛的应用。" 61 | ] 62 | }, 63 | { 64 | "cell_type": "markdown", 65 | "metadata": {}, 66 | "source": [ 67 | "### 参考\n", 68 | "\n", 69 | "1. https://www.cnblogs.com/subconscious/p/5058741.html\n", 70 | "1. https://www.zhihu.com/question/22553761/answer/36429105\n", 71 | "1. http://fitzeng.org/2018/02/19/MLNeuralNetwork/" 72 | ] 73 | } 74 | ], 75 | "metadata": { 76 | "kernelspec": { 77 | "display_name": "Python 3", 78 | "language": "python", 79 | "name": "python3" 80 | }, 81 | "language_info": { 82 | "codemirror_mode": { 83 | "name": "ipython", 84 | "version": 3 85 | }, 86 | "file_extension": ".py", 87 | "mimetype": "text/x-python", 88 | "name": "python", 89 | "nbconvert_exporter": "python", 90 | "pygments_lexer": "ipython3", 91 | "version": "3.7.1" 92 | } 93 | }, 94 | "nbformat": 4, 95 | "nbformat_minor": 2 96 | } 97 | -------------------------------------------------------------------------------- /贝叶斯/05. 贝叶斯网.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在半朴素贝叶斯分类器中,我们对每一个属性的依赖做了约束,使其只依赖于另一个(ODE)或另几个属性(kDE),这使得属性之间的依赖关系呈树状或网状,不过这里的约束比较刻意,每个属性依赖的属性个数是相等的,在现实生活中,属性之间的依赖是没有规律的,更像是一个普通的网状结构。比如下图描述了心血管疾病和成因之间的关系,这被称为 **贝叶斯网**(Bayesian network)。\n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "贝叶斯网也叫做 **信念网**(belief network),它由网络结构 $G$ 和参数 $\\Theta$ 两部分组成,网络结构 $G$ 是一个有向无环图(directed acyclic graph,简称 DAG),其每一个节点对应一个属性,若两个属性有直接依赖关系,则用一条边连接起来;参数 $\\Theta$ 用于定量描述这种依赖关系,它包含了每个属性的 **条件概率表**(conditional probability table,简称 CPT)。\n", 12 | "\n", 13 | "### 贝叶斯网络结构\n", 14 | "\n", 15 | "下图是西瓜问题的一种贝叶斯网结构以及属性“根蒂”的条件概率表:\n", 16 | "\n", 17 | "\n", 18 | "\n", 19 | "贝叶斯网假设每个属性与它的非后裔属性独立,对于上图的联合概率可以定义为:\n", 20 | "\n", 21 | "$$\n", 22 | "P(x_1,x_2,x_3,x_4,x_5) = P(x_1)P(x_2)P(x_3|x_1)P(x_4|x_1,x_2)P(x_5|x_2)\n", 23 | "$$\n", 24 | "\n", 25 | "很显然,在贝叶斯网络中,存在下面三种典型的依赖关系:\n", 26 | "\n", 27 | "\n", 28 | "\n", 29 | "为了分析有向图中变量的条件独立性,通常使用 **有向分离**(D-separation) 将有向图转化为一个无向图,这个无向图称为 **道德图**(moral graph)。\n", 30 | "\n", 31 | "在知道了贝叶斯网络结构之后,下一步就是如何根据训练数据集构造出这样的贝叶斯网络?条件概率表如何计算?怎么根据贝叶斯网络来预测新的测试数据?\n", 32 | "\n", 33 | "### 贝叶斯网络结构的学习\n", 34 | "\n", 35 | "首先定义一个评分函数,用于评估贝叶斯网和训练数据的契合程度,然后基于这个评分函数来寻找结构最优的贝叶斯网,这被称为 **评分搜索**。评分函数通常基于信息论准则来设计,譬如将学习问题看作一个数据压缩任务,试图找到一个能以最短编码长度描述训练数据的模型,编码长度包括了**描述模型自身**所需的字节长度和使用该模型**描述数据**所需的节点长度,这就是 **最小描述长度**(Minimal Description Lenght,简称 MDL)准则。\n", 36 | "\n", 37 | "对给定的训练集 $D$,贝叶斯网 $B$ 的评分函数可写为:\n", 38 | "\n", 39 | "$$\n", 40 | "s(B|D) = f(\\theta)|B| - LL(B|D)\n", 41 | "$$\n", 42 | "\n", 43 | "其中,$|B|$ 是贝叶斯网的参数个数,$f(\\theta)$ 是描述每个参数 $\\theta$ 的字节数,$LL(B|D)$ 是贝叶斯网 $B$ 的对数似然:\n", 44 | "\n", 45 | "$$\n", 46 | "LL(B|D) = \\sum_{i=1}^m \\log P_B(x_i)\n", 47 | "$$\n", 48 | "\n", 49 | "可以看出,上式中第一项是计算编码贝叶斯网 $B$ 所需的字节数,第二项是计算 $B$ 所对应的概率分布 $P_B$ 需多少字节来描述。于是,学习任务就转化为一个优化任务,寻找一个贝叶斯网 $B$ 使评分函数 $s(B|D)$ 最小。\n", 50 | "\n", 51 | "不过从所有可能的网络结构空间搜索最优贝叶斯网结构是一个 NP 难问题,通常有两种常用的策略在有限时间内近似求解:\n", 52 | "\n", 53 | "* 贪心法:从某个网络结构出发,每次调整一条边(增加、删除、调整方向),直到评分函数值不再降低为止\n", 54 | "* 约束法:给网络结构施加约束来削减搜索空间,譬如之前学习的半朴素贝叶斯分类器 TAN 将结构限定为树形(半朴素贝叶斯分类器也是一种特殊的贝叶斯网)\n", 55 | "\n", 56 | "### 贝叶斯网络参数的学习\n", 57 | "\n", 58 | "若网络结构已知,即属性间的依赖关系已知,只需要对训练样本进行计数就可以估计出每个属性节点的条件概率表。\n", 59 | "\n", 60 | "### 根据贝叶斯网络进行推断\n", 61 | "\n", 62 | "贝叶斯网络训练好之后(结构和参数都已训练好),就可以通过一些属性的观测值来推测其他属性的值,比如观测到西瓜色泽青绿、敲声浊响、根蒂蜷缩,要判断它是否成熟、甜度如何。\n", 63 | "\n", 64 | "最理想的方法是根据贝叶斯网定义的联合概率分布来精确计算后验概率,不过这样的计算方法是 NP 难的,当网络节点较多、连接稠密时,难以精确推断,这时通过使用某种近似推断的方法,降低精度要求,在有限时间内求取近似解,在现实应用中,通常使用 **吉布斯采样**(Gibbs sampling)来完成。" 65 | ] 66 | } 67 | ], 68 | "metadata": { 69 | "kernelspec": { 70 | "display_name": "Python 3", 71 | "language": "python", 72 | "name": "python3" 73 | }, 74 | "language_info": { 75 | "codemirror_mode": { 76 | "name": "ipython", 77 | "version": 3 78 | }, 79 | "file_extension": ".py", 80 | "mimetype": "text/x-python", 81 | "name": "python", 82 | "nbconvert_exporter": "python", 83 | "pygments_lexer": "ipython3", 84 | "version": "3.7.1" 85 | } 86 | }, 87 | "nbformat": 4, 88 | "nbformat_minor": 2 89 | } 90 | -------------------------------------------------------------------------------- /逻辑回归/07. 线性回归和逻辑回归总结.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 线性回归总结\n", 8 | "\n", 9 | "我们先回顾一下线性回归问题,线性模型一般表示成:\n", 10 | "\n", 11 | "$$\n", 12 | "h(x) = h_\\theta(x) = \\theta_0 + \\theta_1x_1 + ... + \\theta_dx_d\n", 13 | "$$\n", 14 | "\n", 15 | "也可以写成向量表示:\n", 16 | "\n", 17 | "$$\n", 18 | "h_\\theta(x) = \\theta^TX\n", 19 | "$$\n", 20 | "\n", 21 | "最常用的求解方法是 **最小二乘法**,它采用 **平方损失** 作为损失函数:\n", 22 | "\n", 23 | "$$\n", 24 | "J(\\theta) = \\sum_{i=1}^m(h_\\theta(x^{(i)}) - y^{(i)})^2\n", 25 | "$$\n", 26 | "\n", 27 | "要求解的模型参数就是损失函数取最小值时参数取值:\n", 28 | "\n", 29 | "$$\n", 30 | "\\theta = \\min_{\\theta} J_\\theta\n", 31 | "$$\n", 32 | "\n", 33 | "最小二乘法有两种常见的求解思路,一种使用正规方程:\n", 34 | "\n", 35 | "$$\n", 36 | "\\theta = (X^TX)^{-1}X^Ty\n", 37 | "$$\n", 38 | "\n", 39 | "另一种使用优化算法梯度下降:\n", 40 | "\n", 41 | "$$\n", 42 | "\\theta_j := \\theta_j + \\alpha(y^{(i)} - h_\\theta(x^{(i)}))x_j^{(i)}\n", 43 | "$$" 44 | ] 45 | }, 46 | { 47 | "cell_type": "markdown", 48 | "metadata": {}, 49 | "source": [ 50 | "### 逻辑回归总结\n", 51 | "\n", 52 | "逻辑回归模型一般表示成:\n", 53 | "\n", 54 | "$$\n", 55 | "h_\\theta(x) = g(\\theta^Tx) = \\frac{1}{1 + e^{-\\theta^Tx}}\n", 56 | "$$\n", 57 | "\n", 58 | "其中,$g(z) = \\frac{1}{1 + e^{-z}}$,假设预测分类的概率满足伯努利分布:\n", 59 | "\n", 60 | "$$\n", 61 | "\\begin{align}\n", 62 | "P(y = 1 | x; \\theta) &= h_\\theta(x) \\\\\n", 63 | "P(y = 0 | x; \\theta) &= 1 - h_\\theta(x)\n", 64 | "\\end{align}\n", 65 | "$$\n", 66 | "\n", 67 | "我们把上面两个式子合在一起,有:\n", 68 | "\n", 69 | "$$\n", 70 | "P(y\\ | x; \\theta) = h_\\theta^{y}(x) (1-h_\\theta(x))^{1-y}\n", 71 | "$$\n", 72 | "\n", 73 | "根据 **极大似然估计** 我们有:\n", 74 | "\n", 75 | "$$\n", 76 | "L(\\theta) = \\prod_{i=1}^m h_\\theta^{y^{(i)}}(x^{(i)}) (1-h_\\theta(x^{(i)}))^{1-y^{(i)}}\n", 77 | "$$\n", 78 | "\n", 79 | "对其取对数得到逻辑回归的损失函数:\n", 80 | "\n", 81 | "$$\n", 82 | "\\ell(\\theta) = \\log L(\\theta) = \\sum_{i=1}^m y^{(i)}\\log h_\\theta(x^{(i)})) + (1-y^{(i)}) (1 - h_\\theta(x^{(i)}))\n", 83 | "$$\n", 84 | "\n", 85 | "和线性回归一样,可以使用梯度下降法求 $\\ell(\\theta)$ 的最小值:\n", 86 | "\n", 87 | "$$\n", 88 | "\\theta_j := \\theta_j + \\alpha(y^{(i)} - h_\\theta(x^{(i)}))x_j^{(i)}\n", 89 | "$$\n", 90 | "\n", 91 | "可以看到梯度下降公式和线性回归是一样的,差别在于 $h_\\theta(\\cdot)$,线性回归的 $h_\\theta(x) = \\theta^TX$,而逻辑回归的 $h_\\theta(x) = g(\\theta^Tx)$,这个 $g(\\cdot)$ 通常称为**连接函数**(link function,或称为联系函数),它必须是单调可微的,通过连接函数得到的模型称为**广义线性模型**(generalized linear model)。" 92 | ] 93 | }, 94 | { 95 | "cell_type": "markdown", 96 | "metadata": {}, 97 | "source": [ 98 | "### 线性回归和逻辑回归的比较\n", 99 | "\n", 100 | "回归问题中经常使用基于欧式距离的损失函数,而分类问题中,损失函数是基于概率分布,学术上被称为 **交叉熵**(Cross Entropy)。虽然两种损失函数的形式不同,但是它们的值都对应着模型的预测误差,因此这个值越小越好,这也是损失函数参数估计的基本原则。" 101 | ] 102 | } 103 | ], 104 | "metadata": { 105 | "kernelspec": { 106 | "display_name": "Python 3", 107 | "language": "python", 108 | "name": "python3" 109 | }, 110 | "language_info": { 111 | "codemirror_mode": { 112 | "name": "ipython", 113 | "version": 3 114 | }, 115 | "file_extension": ".py", 116 | "mimetype": "text/x-python", 117 | "name": "python", 118 | "nbconvert_exporter": "python", 119 | "pygments_lexer": "ipython3", 120 | "version": "3.7.1" 121 | } 122 | }, 123 | "nbformat": 4, 124 | "nbformat_minor": 2 125 | } 126 | -------------------------------------------------------------------------------- /支持向量机/09. 支持向量回归.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "支持向量机模型也可以处理回归问题,假设我们能容忍 $f(x)$ 和 $y$ 之间最多有 $\\epsilon$ 的偏差,只有当 $f(x)$ 和 $y$ 之间的偏差大于 $\\epsilon$ 时才计算损失,如下图所示:\n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "这被称为 **支持向量回归**(Support Vector Regression,简称 SVR),它的基本形式为:\n", 12 | "\n", 13 | "$$\n", 14 | "\\mathop \\min_{w,b} \\frac{1}{2} \\|w\\|^2 + C \\sum_{i=1}^m \\ell_{\\epsilon} (f(x_i)-y_i)\n", 15 | "$$\n", 16 | "\n", 17 | "其中 $\\ell_{\\epsilon}$ 表示 $\\epsilon-$不敏感损失($\\epsilon-$ insensitive loss)函数:\n", 18 | "\n", 19 | "$$\n", 20 | "\\ell_{\\epsilon}(z) =\n", 21 | "\\left\\{\n", 22 | "\\begin{align}\n", 23 | "&0, &|z| \\leq \\epsilon \\\\\n", 24 | "&|z| - \\epsilon, &|z| > \\epsilon \\\\\n", 25 | "\\end{align}\n", 26 | "\\right.\n", 27 | "$$" 28 | ] 29 | }, 30 | { 31 | "cell_type": "markdown", 32 | "metadata": {}, 33 | "source": [ 34 | "由于回归线两侧的松弛程度可以不同,可以引入两个松弛变量,将 SVR 的基本形式改写为:\n", 35 | "\n", 36 | "$$\n", 37 | "\\mathop \\min_{w,b,\\xi,\\hat{\\xi}} \\frac{1}{2} \\|w\\|^2 + C \\sum_{i=1}^m (\\xi_i + \\hat{\\xi_i})\n", 38 | "$$\n", 39 | "\n", 40 | "约束条件为:\n", 41 | "\n", 42 | "$$\n", 43 | "\\begin{align}\n", 44 | "s.t. &f(x_i) - y_i \\leq \\epsilon + \\xi_i \\\\\n", 45 | "&y_i - f(x_i) \\leq \\epsilon + \\hat{\\xi_i} \\\\\n", 46 | "&\\xi_i \\geq 0, \\hat{\\xi_i} \\geq 0, i = 1,2,\\dots,m\n", 47 | "\\end{align}\n", 48 | "$$\n", 49 | "\n", 50 | "引入拉格朗日乘子得到拉格朗日函数:\n", 51 | "\n", 52 | "$$\n", 53 | "L(w,b,\\xi,\\hat{\\xi},\\alpha,\\hat{\\alpha},\\mu,\\hat{\\mu})\n", 54 | "= \\frac{1}{2} \\|w\\|^2 + C \\sum_{i=1}^m (\\xi_i + \\hat{\\xi_i}) - \\sum_{i=1}^m \\mu_i \\xi_i - \\sum_{i=1}^m \\hat{\\mu_i} \\hat{\\xi_i} + \\sum_{i=1}^m \\alpha_i(f(x_i) - y_i - \\epsilon - \\xi_i) + \\sum_{i=1}^m \\hat{\\alpha_i}(y_i - f(x_i) - \\epsilon - \\hat{\\xi_i})\n", 55 | "$$\n", 56 | "\n", 57 | "对 $w, b, \\xi_i, \\hat{\\xi_i}$ 求偏导并令其为 0 得到:\n", 58 | "\n", 59 | "$$\n", 60 | "\\left\\{\n", 61 | "\\begin{align}\n", 62 | "w &= \\sum_{i=1}^m (\\hat{\\alpha_i} - \\alpha_i)x_i \\\\\n", 63 | "0 &= \\sum_{i=1}^m (\\hat{\\alpha_i} - \\alpha_i) \\\\\n", 64 | "C &= (\\alpha_i + \\mu_i) \\\\\n", 65 | "C &= (\\hat{\\alpha_i} + \\hat{\\mu_i}) \\\\\n", 66 | "\\end{align}\n", 67 | "\\right .\n", 68 | "$$\n", 69 | "\n", 70 | "将其带入拉格朗日函数,得到 SVR 的对偶形式:\n", 71 | "\n", 72 | "$$\n", 73 | "\\begin{align}\n", 74 | "\\mathop \\min_{\\alpha,\\hat{\\alpha}} &\\frac{1}{2} \\sum_{i=1}^m \\sum_{j=1}^m (\\hat{\\alpha_i} - \\alpha_i)(\\hat{\\alpha_j} - \\alpha_j) x_i x_j - \\sum_{i=1}^m y_i(\\hat{\\alpha_i} - \\alpha_i) + \\epsilon(\\hat{\\alpha_i} - \\alpha_i) \\\\\n", 75 | "s.t. &\\sum_{i=1}^m (\\hat{\\alpha_i} - \\alpha_i) = 0 \\\\\n", 76 | "& 0 \\leq \\alpha_i, \\hat{\\alpha_i} \\leq C\n", 77 | "\\end{align}\n", 78 | "$$" 79 | ] 80 | } 81 | ], 82 | "metadata": { 83 | "kernelspec": { 84 | "display_name": "Python 3", 85 | "language": "python", 86 | "name": "python3" 87 | }, 88 | "language_info": { 89 | "codemirror_mode": { 90 | "name": "ipython", 91 | "version": 3 92 | }, 93 | "file_extension": ".py", 94 | "mimetype": "text/x-python", 95 | "name": "python", 96 | "nbconvert_exporter": "python", 97 | "pygments_lexer": "ipython3", 98 | "version": "3.7.1" 99 | } 100 | }, 101 | "nbformat": 4, 102 | "nbformat_minor": 2 103 | } 104 | -------------------------------------------------------------------------------- /决策树/05. 连续值和缺失值.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 处理连续属性\n", 8 | "\n", 9 | "如果某个属性的取值是连续值,不能直接使用连续属性的可取值来对节点进行划分,通常采用 **二分法**(bi-partition)对连续属性进行处理。\n", 10 | "\n", 11 | "假设数据集 D 的 a 属性为连续属性,它有 n 个不同的取值,我们将其从小到大排序,得到 $\\{a^1, a^2, ..., a^n\\}$,对每两个相邻的点,我们取其中位点作为候选划分点,这样就得到了 n-1 个候选划分点:\n", 12 | "\n", 13 | "$$\n", 14 | "T_a = \\{ \\frac{a^i + a^{i+1}}{2} | i \\leqslant i \\leqslant n-1 \\}\n", 15 | "$$\n", 16 | "\n", 17 | "我们从中考察每一个划分点,假设划分点为 t,它可以将数据集划分为 $D_t^-$ 和 $D_t^+$,$D_t^-$ 表示属性 a 小于等于 t 的样本集合,$D_t^+$ 表示属性 a 大于 t 的样本集合。使用属性 a 上的划分点 t 对数据集进行划分后的信息熵可以记为:\n", 18 | "\n", 19 | "$$\n", 20 | "Ent(D_{a,t}) = \\frac{|D_t^-|}{|D|} Ent(D_t^-) + \\frac{|D_t^+|}{|D|} Ent(D_t^+)\n", 21 | "$$\n", 22 | "\n", 23 | "相应的信息增益为:\n", 24 | "\n", 25 | "$$\n", 26 | "Gain(D,a,t) = Ent(D) - Ent(D_{a,t})\n", 27 | "$$\n", 28 | "\n", 29 | "可以看出 t 的取值不同,划分后的信息熵也不同,信息增益自然也不同,我们希望划分后的信息增益越大越好,所以可以把信息增益最大值作为连续属性 a 的信息增益。\n", 30 | "\n", 31 | "$$\n", 32 | "Gain(D,a) = \\mathop \\max_{t \\in T_a} Gain(D,a,t)\n", 33 | "$$\n", 34 | "\n", 35 | "处理连续属性和处理离散属性的根本原理是一样的,都是选择信息增益最大的划分属性对数据集进行划分,不过在计算连续属性的信息增益时,还要计算合适的划分点。另外,还要注意一点,离散属性一当被选为划分属性,后代节点就不能再使用该属性进行划分,而连续属性却可以,譬如在父节点上使用了 $a \\leqslant 10$,子节点上还可以进一步使用 $a \\leqslant 5$。\n", 36 | "\n", 37 | "### 处理缺失属性\n", 38 | "\n", 39 | "我们有时候会遇到不完整的数据集,样本的某些属性值由于一些原因缺失了,这样的数据如果丢弃掉显然是很可惜的,缺失值处理是机器学习算法中一个很重要的问题。在决策树算法中,关键是找到最优的划分属性,可以使用信息增益或基尼指数等手段,不过前面学习的内容都是基于完整的数据集来划分的,有没有办法在属性缺失的情况下划分属性呢?\n", 40 | "\n", 41 | "假设有数据集 $D$ 和 属性 $a$,属性 $a$ 在某些样本上有缺失,我们剔除掉那些属性 $a$ 缺失的样本得到一个新数据集 $\\tilde{D}$,这样可以计算出新数据集在属性 $a$ 上的信息增益:\n", 42 | "\n", 43 | "$$\n", 44 | "\\begin{align}\n", 45 | "Gain(\\tilde{D}, a) &= Ent(\\tilde{D}) - \\sum_{v=1}^{V} \\frac{|\\tilde{D}^v|}{|\\tilde{D}|} Ent(\\tilde{D}^v) \\\\\n", 46 | "&= - \\sum_{k=1}^{\\left| \\mathcal{Y} \\right|} \\frac{|\\tilde{D}_k|}{|\\tilde{D}|} log_2 \\frac{|\\tilde{D}_k|}{|\\tilde{D}|} - \\sum_{v=1}^{V} \\frac{|\\tilde{D}^v|}{|\\tilde{D}|} Ent(\\tilde{D}^v)\n", 47 | "\\end{align}\n", 48 | "$$\n", 49 | "\n", 50 | "$\\tilde{D}_k(k = 1,2,...,|\\mathcal{Y}|)$ 表示在数据集 $\\tilde{D}$ 中一种有 $\\mathcal{Y}$ 个类别,而 $\\tilde{D}^v(v = 1,2,...,V)$ 表示属性 a 一共有 V 个不同的取值。\n", 51 | "\n", 52 | "这里得到的是在数据集 $\\tilde{D}$ 上属性 a 的信息增益,可以根据完整样本的占比来推算在数据集 $D$ 上的信息增益:\n", 53 | "\n", 54 | "$$\n", 55 | "Gain(D, a) = \\frac{|\\tilde{D}|}{|D|} Gain(\\tilde{D}, a)\n", 56 | "$$\n", 57 | "\n", 58 | "通过上面的计算方法,可以得到每个属性的信息增益,选择信息增益最大的属性来进行划分,譬如该属性有 3 种不同的取值,将得到 3 个决策子树,每个子树对应一个相应的子集。但是,如果某个样本该属性值未知,这个样本该放入哪个子树呢?一种常见的处理方式是将其同时放入三个子树,不过要分别赋予不同的权重,为样本赋予权重之后,信息增益的计算方式也需要稍微调整下(假设样本 $x$ 的权重为 $w_x$):\n", 59 | "\n", 60 | "$$\n", 61 | "\\left\\{\n", 62 | "\\begin{align}\n", 63 | "\\frac{|\\tilde{D}|}{|D|} &\\Longrightarrow \\frac{\\sum_{x \\in \\tilde{D}} w_x}{\\sum_{x \\in D} w_x} \\\\\n", 64 | "\\frac{|\\tilde{D}_k|}{|\\tilde{D}|} &\\Longrightarrow \\frac{\\sum_{x \\in \\tilde{D}_k} w_x}{\\sum_{x \\in \\tilde{D}} w_x} \\\\\n", 65 | "\\frac{|\\tilde{D}^v|}{|\\tilde{D}|} &\\Longrightarrow \\frac{\\sum_{x \\in \\tilde{D}^v} w_x}{\\sum_{x \\in \\tilde{D}} w_x}\n", 66 | "\\end{align}\n", 67 | "\\right.\n", 68 | "$$" 69 | ] 70 | } 71 | ], 72 | "metadata": { 73 | "kernelspec": { 74 | "display_name": "Python 3", 75 | "language": "python", 76 | "name": "python3" 77 | }, 78 | "language_info": { 79 | "codemirror_mode": { 80 | "name": "ipython", 81 | "version": 3 82 | }, 83 | "file_extension": ".py", 84 | "mimetype": "text/x-python", 85 | "name": "python", 86 | "nbconvert_exporter": "python", 87 | "pygments_lexer": "ipython3", 88 | "version": "3.7.1" 89 | } 90 | }, 91 | "nbformat": 4, 92 | "nbformat_minor": 2 93 | } 94 | -------------------------------------------------------------------------------- /kNN/03. k-近邻的算法实现:kd 树.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "根据前面的学习,我们知道实现 k-近邻算法的关键就是如何找到距离某个样本最近的 k 个样本。最简单的实现方法是线性扫描(linear scan),或者叫做暴力搜索,也就是依次计算输入样本和每个训练样本的距离,不过当训练集非常大时,这种计算非常耗时,是行不通的。\n", 8 | "\n", 9 | "为了提高 k-近邻搜索的效率,人们想出了很多特殊的数据结构存储训练数据,以减少计算距离的次数,其中 **kd 树**(kd tree)是最基础,也是最重要的一种。它的基本思想是对搜索空间进行层次划分,将整个空间划分为特定的几个部分,然后在特定空间的部分内进行搜索操作。如果划分空间没有重叠,这种情况叫做 **Clipping**,其代表算法就是 kd 树;如果划分空间有重叠,这种情况叫 **Overlapping**,其代表算法是 R 树。\n", 10 | "\n", 11 | "### 什么是 kd 树\n", 12 | "\n", 13 | "kd 树的概念是斯坦福大学 Jon Louis Bentley 于 1975 年在 ACM 杂志上发表的一篇论文 **Multidimensional Binary Search Trees Used for Associative Searching** 中正式提出的。它是 K-dimension tree 的缩写。其本质是一颗二叉树,树中存储的是一些 k 维数据,当 k = 1 时,kd 树就是我们非常熟悉的 **二叉搜索树**(Binary Search Tree,BST)。\n", 14 | "\n", 15 | "二叉搜索树具有如下性质:\n", 16 | "\n", 17 | "* 若它的左子树不为空,则左子树上所有结点的值均小于它的根结点的值;\n", 18 | "* 若它的右子树不为空,则右子树上所有结点的值均大于它的根结点的值;\n", 19 | "* 它的左、右子树也分别为二叉搜索树;\n", 20 | "\n", 21 | "我们不妨看一个例子,假设我们有一个表格存储了学生的语文成绩 chinese、数学成绩 math 和 英语成绩 english,如果要查询语文成绩介于 30~93 分的学生,如何处理?假设学生数量为N,如果顺序查询,则其时间复杂度为O(N),当学生规模很大时,其效率显然很低,如果使用二叉树,则其时间复杂度为O(logN),能极大地提高查询效率。将语文成绩存在构造好的二叉树里,就相当于在语文成绩上建立了索引,二叉搜索树示意图为:\n", 22 | "\n", 23 | "\n", 24 | "\n", 25 | "现在我们考虑二维的情况:查询语文成绩介于 30~93 分,且数学成绩介于 30~90 分的学生,又该如何处理呢?很显然,我们可以分别使用二叉树对语文成绩和数学成绩建立索引,先在语文成绩中查询得到集合 S1,再在数学成绩中查询得到集合 S2,然后计算 S1 和 S2 的交集,若 |S1|=m,|S2|=n,则其时间复杂度为 O(m\\*n),有没有更好的办法呢?\n", 26 | "\n", 27 | "实际上,kd 树可以更好的解决这个问题。\n", 28 | "\n", 29 | "### 构造 kd 树\n", 30 | "\n", 31 | "我们首先在二维平面上画出所有学生的成绩分布,然后先根据语文成绩,将所有人的成绩分成两半,其中一半的语文成绩 <=a,另一半的语文成绩 >a,分别得到集合 S1 和 S2;然后针对 S1,根据数学成绩分为两半,其中一半的数学成绩 <=b1,另一半的数学成绩 >b1,分别得到 S3 和 S4,针对 S2,也根据数学成绩分为两半,其中一半的数学成绩 <=b2,另一半的数学成绩 >b2,分别得到 S5 和 S6,然后再根据语文成绩分别对 S3、S4、S5、S6 继续执行类似划分得到更小的集合,然后再在更小的集合上根据数学成绩继续,以此类推...\n", 32 | "\n", 33 | "上面描述的就是构建 kd 树的基本思路,其构建后的 kd 树如下图所示:\n", 34 | "\n", 35 | "\n", 36 | "\n", 37 | "可以看到,l1 左边都是语文成绩低于 45 分的,l1 右边都是语文成绩高于 45 分的,l2 下方都是语文成绩低于 45 分且数学成绩低于 50 分的,l2 上方都是语文成绩低于 45 分且数学成绩高于 50 分的,后面以此类推。下面的图示,更清晰地表示了 kd 树的结构及其对应的二叉树:\n", 38 | "\n", 39 | "\n", 40 | "\n", 41 | "### kd 树相关算法\n", 42 | "\n", 43 | "上面只是介绍了构造 kd 树的基本思路,其具体的算法实现可以参考后面的参考链接。除了 kd 树的构造之外,关于 kd 树的常见算法还有:\n", 44 | "\n", 45 | "* kd 树的插入\n", 46 | "* kd 树的删除\n", 47 | "* kd 树的搜索\n", 48 | "\n", 49 | "关于 kd 树的搜索又可以分成几种不同的应用场景:\n", 50 | "\n", 51 | "* 范围查询:譬如上面的例子,查询语文成绩介于 30~93 分,且数学成绩介于 30~90 分的学生;\n", 52 | "* k-近邻查询:查询距离某个点最近的几个点,当 k = 1 时,就相当于最近邻查询;\n", 53 | "\n", 54 | "要实现 kNN 算法,就要用到这里的 kd 树的构造算法和 k-近邻查询算法,这两个是重点,其他的算法可以稍微了解一下,可以参考后面的参考链接。在李航老师的《统计学习方法》中,也介绍了 kd 树的构造算法和最近邻查询的算法以及相应的实例。\n", 55 | "\n", 56 | "关于 kd 树的搜索,还有一种改进算法,叫做 **BBF 算法**,另外,除了可以使用 kd 树来构造多维索引,还可以使用上面提到过的 **R 树**,以及 **球树**、**M树**、**VP树**、**MVP树** 等等数据结构。\n", 57 | "\n", 58 | "### 参考\n", 59 | "\n", 60 | "1. https://zhuanlan.zhihu.com/p/23966698\n", 61 | "1. https://blog.sengxian.com/algorithms/k-dimensional-tree\n", 62 | "1. https://www.jianshu.com/p/ffe52db3e12b\n", 63 | "1. https://wuzhiwei.net/kdtree/\n", 64 | "1. https://blog.csdn.net/v_JULY_v/article/details/8203674" 65 | ] 66 | } 67 | ], 68 | "metadata": { 69 | "kernelspec": { 70 | "display_name": "Python 3", 71 | "language": "python", 72 | "name": "python3" 73 | }, 74 | "language_info": { 75 | "codemirror_mode": { 76 | "name": "ipython", 77 | "version": 3 78 | }, 79 | "file_extension": ".py", 80 | "mimetype": "text/x-python", 81 | "name": "python", 82 | "nbconvert_exporter": "python", 83 | "pygments_lexer": "ipython3", 84 | "version": "3.7.1" 85 | } 86 | }, 87 | "nbformat": 4, 88 | "nbformat_minor": 2 89 | } 90 | -------------------------------------------------------------------------------- /神经网络/01. 感知机模型.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**感知机**(perceptron)是一种非常简单的用于二分类的线性模型,对于线性可分的样本,它对应于输入空间中将样本划分为正负两类的划分超平面,感知机模型在 1957 年由 Rosenblatt 提出,是神经网络和支持向量机的基础。\n", 8 | "\n", 9 | "### 感知机模型\n", 10 | "\n", 11 | "在支持向量机的学习中,我们知道,对于一个线性可分的输入空间,存在多个划分超平面可以将训练样本分开,如下图所示:\n", 12 | "\n", 13 | "![](../images/svm-1.jpg)\n", 14 | "\n", 15 | "这样的划分超平面可以写成线性方程的形式:\n", 16 | "\n", 17 | "$$\n", 18 | "w^Tx + b = 0\n", 19 | "$$\n", 20 | "\n", 21 | "一旦将空间划分成两个部分之后,就可以通过 $w^Tx + b$ 的符号来进行分类:\n", 22 | "\n", 23 | "$$\n", 24 | "f(x) = sign(w^Tx + b)\n", 25 | "$$\n", 26 | "\n", 27 | "其中,sign 是符号函数,当 $w^Tx + b \\geq 0$ 时,将其归为正类,当 $w^Tx + b < 0$ 时,将其归为负类。这里的 $f(x)$ 函数被称为 **感知机**(perceptron)模型。\n", 28 | "\n", 29 | "### 感知机的损失函数\n", 30 | "\n", 31 | "输入空间中的某个点到划分超平面的距离可以写成:\n", 32 | "\n", 33 | "$$\n", 34 | "r = \\frac{|w^Tx + b|}{\\|w\\|}\n", 35 | "$$\n", 36 | "\n", 37 | "如果某个点 $(x_i, y_i)$ 被误分类,也就是说当 $w^Tx_i + b \\geq 0$ 时,将其归为负类($y_i = -1$),当 $w^Tx_i + b < 0$ 时,将其归为正类($y_i = 1$),这时我们有:\n", 38 | "\n", 39 | "$$\n", 40 | "-y_i(w^Tx_i + b) \\geq 0\n", 41 | "$$\n", 42 | "\n", 43 | "所以对于误分类的点,它到划分超平面的距离可以写成:\n", 44 | "\n", 45 | "$$\n", 46 | "r = \\frac{|w^Tx + b|}{\\|w\\|} = \\frac{-y_i(w^Tx_i + b)}{\\|w\\|}\n", 47 | "$$\n", 48 | "\n", 49 | "我们将所有误分类的点到划分超平面的距离累加,得到感知机的损失函数:\n", 50 | "\n", 51 | "$$\n", 52 | "loss = -\\sum_{x_i \\in M} y_i(w^Tx_i + b)\n", 53 | "$$\n", 54 | "\n", 55 | "这里我们把 $\\frac{1}{\\|w\\|}$ 省略了,它不影响损失函数的结果,其中 $M$ 表示误分类点的集合,很显然,当没有点被误分类时,损失函数等于 0,感知机算法就是求损失函数最小时的 w、b 参数。\n", 56 | "\n", 57 | "### 使用随机梯度下降训练感知机\n", 58 | "\n", 59 | "要使感知机的损失函数最小化,通常使用 **随机梯度下降法**(stochastic gradient descent),首先选取一个超平面 $w_0, b_0$,然后用梯度下降法不断的极小化损失函数,在极小化过程中每次随机选取一个误分类点使其梯度下降。\n", 60 | "\n", 61 | "损失函数的梯度如下:\n", 62 | "\n", 63 | "$$\n", 64 | "\\begin{align}\n", 65 | "\\nabla_w loss &= -\\sum_{x_i \\in M} y_i x_i \\\\\n", 66 | "\\nabla_b loss &= -\\sum_{x_i \\in M} y_i\n", 67 | "\\end{align}\n", 68 | "$$\n", 69 | "\n", 70 | "随机选取一个误分类点 $(x_i, y_i)$,对 w, b 的更新公式为:\n", 71 | "\n", 72 | "$$\n", 73 | "\\begin{align}\n", 74 | "w &:= w + \\eta y_i x_i \\\\\n", 75 | "b &:= b + \\eta y_i\n", 76 | "\\end{align}\n", 77 | "$$\n", 78 | "\n", 79 | "式中 $\\eta$ 是步长,又叫学习率。通过不断的迭代,损失函数不断减小,直到为 0 。\n", 80 | "\n", 81 | "### 感知机算法的收敛性\n", 82 | "\n", 83 | "对于一个线性可分的数据集,感知机学习算法是不是一定收敛呢?这就需要引出 **Novikoff 定理** 了。\n", 84 | "\n", 85 | "根据 **Novikoff 定理**,感知机算法在训练集上的误分类次数 k 满足不等式:\n", 86 | "\n", 87 | "$$\n", 88 | "k \\leq (\\frac{R}{\\gamma})^2\n", 89 | "$$\n", 90 | "\n", 91 | "也就是说,误分类次数是有上界的,经过有限次迭代肯定可以找到将训练数据完全正确分开的超平面。\n", 92 | "\n", 93 | "### 感知机算法的对偶形式\n", 94 | "\n", 95 | "$$\n", 96 | "f(x) = sign(\\sum_{j=1}^N \\alpha_j y_j x_j \\cdot x + b)\n", 97 | "$$\n", 98 | "\n", 99 | "参考《统计学习方法》中的实例。\n", 100 | "\n", 101 | "### 感知机算法存在的问题\n", 102 | "\n", 103 | "感知机学习算法存在多个解,采用不同的初始值或者在迭代过程中改变选取误分类点的顺序,求得的解都会不同。为了得到唯一的超平面,需要对超平面增加约束条件,这就是线性支持向量机的想法。\n", 104 | "\n", 105 | "当训练样本不是线性可分的时候,感知机算法不收敛,迭代结果会发生震荡。" 106 | ] 107 | } 108 | ], 109 | "metadata": { 110 | "kernelspec": { 111 | "display_name": "Python 3", 112 | "language": "python", 113 | "name": "python3" 114 | }, 115 | "language_info": { 116 | "codemirror_mode": { 117 | "name": "ipython", 118 | "version": 3 119 | }, 120 | "file_extension": ".py", 121 | "mimetype": "text/x-python", 122 | "name": "python", 123 | "nbconvert_exporter": "python", 124 | "pygments_lexer": "ipython3", 125 | "version": "3.7.1" 126 | } 127 | }, 128 | "nbformat": 4, 129 | "nbformat_minor": 2 130 | } 131 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ml-notes 2 | 3 | 我的机器学习笔记 4 | 5 | ### 概述 6 | 7 | * [01. 什么是机器学习](概述/01.%20什么是机器学习.ipynb) 8 | * [02. 机器学习分类](概述/02.%20机器学习分类.ipynb) 9 | * [03. 机器学习算法](概述/03.%20机器学习算法.ipynb) 10 | 11 | ## 有监督学习 12 | 13 | ### 回归 14 | 15 | * [01. 什么是线性回归?](线性回归/01.%20什么是线性回归.ipynb) 16 | * [02. 一元线性回归的求解](线性回归/02.%20一元线性回归的求解.ipynb) 17 | * [03. 二元线性回归的求解](线性回归/03.%20二元线性回归的求解.ipynb) 18 | * [04. 多元线性回归的求解](线性回归/04.%20多元线性回归的求解.ipynb) 19 | * [05. 线性回归的扩展](线性回归/05.%20线性回归的扩展.ipynb) 20 | * [06. NFL定理和过拟合](线性回归/06.%20NFL定理和过拟合.ipynb) 21 | * [07. 使用 sklearn 解线性回归](线性回归/07.%20使用%20sklearn%20解线性回归.ipynb) 22 | * [08. 通过模型评估降低过拟合](线性回归/08.%20通过模型评估降低过拟合.ipynb) 23 | * [09. 回归模型的评估和选择](线性回归/09.%20回归模型的评估和选择.ipynb) 24 | * [10. 什么是梯度下降](线性回归/10.%20什么是梯度下降.ipynb) 25 | * [11. 梯度下降的简单例子](线性回归/11.%20梯度下降的简单例子.ipynb) 26 | * [12. 利用梯度下降法解线性回归](线性回归/12.%20利用梯度下降法解线性回归.ipynb) 27 | * [13. sklearn 中的梯度下降法](线性回归/13.%20sklearn%20中的梯度下降法.ipynb) 28 | * [14. 最优化问题的其他算法](线性回归/14.%20最优化问题的其他算法.ipynb) 29 | * [15. 线性回归实例](线性回归/15.%20线性回归实例.ipynb) 30 | * [16. 带约束条件的线性回归](线性回归/16.%20带约束条件的线性回归.ipynb) 31 | * [17. 求解岭回归和LASSO回归](线性回归/17.%20求解岭回归和LASSO回归.ipynb) 32 | * [18. 使用岭回归解决共线性问题](线性回归/18.%20使用岭回归解决共线性问题.ipynb) 33 | * [19. LASSO回归和稀疏学习](线性回归/19.%20LASSO回归和稀疏学习.ipynb) 34 | * [20. 鲁棒学习](线性回归/20.%20鲁棒学习.ipynb) 35 | 36 | ### 分类 37 | 38 | #### 逻辑回归 39 | 40 | * [01. 从回归到分类](逻辑回归/01.%20从回归到分类.ipynb) 41 | * [02. 一元分类问题和逻辑函数](逻辑回归/02.%20一元分类问题和逻辑函数.ipynb) 42 | * [03. 逻辑回归和一元分类问题的求解](逻辑回归/03.%20逻辑回归和一元分类问题的求解.ipynb) 43 | * [04. 使用梯度下降解逻辑回归](逻辑回归/04.%20使用梯度下降解逻辑回归.ipynb) 44 | * [05. 多元分类问题的求解](逻辑回归/05.%20多元分类问题的求解.ipynb) 45 | * [06. 使用牛顿法解逻辑回归](逻辑回归/06.%20使用牛顿法解逻辑回归.ipynb) 46 | * [07. 线性回归和逻辑回归总结](逻辑回归/07.%20线性回归和逻辑回归总结.ipynb) 47 | * [08. 从二分类到多分类](逻辑回归/08.%20从二分类到多分类.ipynb) 48 | * [09. 从线性分类到非线性分类](逻辑回归/09.%20从线性分类到非线性分类.ipynb) 49 | * [10. 分类模型的评估方法](逻辑回归/10.%20分类模型的评估方法.ipynb) 50 | 51 | #### k-近邻 52 | 53 | * [01. k-近邻算法概述](kNN/01.%20k-近邻算法概述.ipynb) 54 | * [02. 距离度量方法](kNN/02.%20距离度量方法.ipynb) 55 | * [03. k-近邻的算法实现:kd 树](kNN/03.%20k-近邻的算法实现:kd%20树.ipynb) 56 | * [04. sklearn 实战 k-近邻算法](kNN/04.%20sklearn%20实战%20k-近邻算法.ipynb) 57 | 58 | #### 决策树 59 | 60 | * [01. 决策树算法概述](决策树/01.%20决策树算法概述.ipynb) 61 | * [02. 如何选择最优的划分属性](决策树/02.%20如何选择最优的划分属性.ipynb) 62 | * [03. 决策树算法实战](决策树/03.%20决策树算法实战.ipynb) 63 | * [04. 决策树的剪枝策略](决策树/04.%20决策树的剪枝策略.ipynb) 64 | * [05. 连续值和缺失值](决策树/05.%20连续值和缺失值.ipynb) 65 | * [06. 多变量决策树](决策树/06.%20多变量决策树.ipynb) 66 | 67 | #### 贝叶斯 68 | 69 | * [01. 贝叶斯定理](贝叶斯/01.%20贝叶斯定理.ipynb) 70 | * [02. 朴素贝叶斯分类器](贝叶斯/02.%20朴素贝叶斯分类器.ipynb) 71 | * [03. 贝叶斯分类实例](贝叶斯/03.%20贝叶斯分类实例.ipynb) 72 | * [04. 半朴素贝叶斯分类器](贝叶斯/04.%20半朴素贝叶斯分类器.ipynb) 73 | * [05. 贝叶斯网](贝叶斯/05.%20贝叶斯网.ipynb) 74 | 75 | #### 支持向量机 76 | 77 | * [01. 什么是支持向量机](支持向量机/01.%20什么是支持向量机.ipynb) 78 | * [02. 支持向量机的对偶问题](支持向量机/02.%20支持向量机的对偶问题.ipynb) 79 | * [03. 硬间隔与软间隔](支持向量机/03.%20硬间隔与软间隔.ipynb) 80 | * [04. 软间隔支持向量机的对偶问题](支持向量机/04.%20软间隔支持向量机的对偶问题.ipynb) 81 | * [05. SMO 算法](支持向量机/05.%20SMO%20算法.ipynb) 82 | * [06. Scikit Learn 中的 SVM](支持向量机/06.%20Scikit%20Learn%20中的%20SVM.ipynb) 83 | * [07. 非线性支持向量机和核函数](支持向量机/07.%20非线性支持向量机和核函数.ipynb) 84 | * [08. 非线性支持向量机实战](支持向量机/08.%20非线性支持向量机实战.ipynb) 85 | * [09. 支持向量回归](支持向量机/09.%20支持向量回归.ipynb) 86 | 87 | #### 神经网络 88 | 89 | * [01. 感知机模型](神经网络/01.%20感知机模型.ipynb) 90 | * [02. 神经元模型](神经网络/02.%20神经元模型.ipynb) 91 | * [03. 从感知机到神经网络](神经网络/03.%20从感知机到神经网络.ipynb) 92 | * [04. 反向传播算法](神经网络/04.%20反向传播算法.ipynb) 93 | * [05. 其他神经网络](神经网络/05.%20其他神经网络.ipynb) 94 | 95 | #### 集成学习 96 | 97 | * [01. 什么是集成学习?](集成学习/01.%20什么是集成学习?.ipynb) 98 | * [02. Boosting 算法](集成学习/02.%20Boosting%20算法.ipynb) 99 | * [03. Bagging 算法](集成学习/03.%20Bagging%20算法.ipynb) 100 | * [04. 随机森林](集成学习/04.%20随机森林.ipynb) 101 | * [05. 个体学习器的多样性](集成学习/05.%20个体学习器的多样性.ipynb) 102 | 103 | ## 无监督学习 104 | 105 | ### 聚类 106 | 107 | * [1. 什么是聚类](聚类/1.%20什么是聚类.ipynb) 108 | * [2. 聚类算法一览](聚类/2.%20聚类算法一览.ipynb) 109 | * [3. k-means 算法](聚类/3.%20k-means%20算法.ipynb) 110 | * [4. DBSCAN 算法](聚类/4.%20DBSCAN%20算法.ipynb) 111 | * [5. 层次聚类算法](聚类/5.%20层次聚类算法.ipynb) 112 | 113 | ### 降维 114 | 115 | ### 特征选择 116 | 117 | ### 概率估计 118 | 119 | ### 主题模型 120 | 121 | ### 图分析 -------------------------------------------------------------------------------- /线性回归/11. 梯度下降的简单例子.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "根据前面介绍的梯度,我们来看几个使用梯度下降法求解极值的例子。对于函数 $J(\\theta)$,我们首先选取一个初始点 $\\theta_0$,然后沿着梯度方向对初始值进行更新得到 $\\theta_1$:\n", 8 | "\n", 9 | "$$\n", 10 | "\\theta_1 = \\theta_0 - \\eta \\frac{\\partial J(\\theta)}{\\partial \\theta}\n", 11 | "$$\n", 12 | "\n", 13 | "然后再对 $\\theta_1$ 进行更新得到 $\\theta_2$,以此类推,直到损失函数的值不再有显著变化,最终得到 $\\theta_k$ 就是我们要求的参数,它使损失函数达到最小。其中 $\\eta$ 是一个待确定的常数,通常被称为 **学习率**(learning rate)或**步长**(step),它是一个超参数。\n", 14 | "\n", 15 | "上面的迭代过程一般写成下面的形式,其中的 $:=$ 表示使用右侧的值更新左侧的值:\n", 16 | "\n", 17 | "$$\n", 18 | "\\theta := \\theta - \\eta \\frac{\\partial J(\\theta)}{\\partial \\theta}\n", 19 | "$$" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "### 一元函数的梯度下降\n", 27 | "\n", 28 | "我们首先看一个最简单的一元函数:\n", 29 | "\n", 30 | "$$\n", 31 | "J(\\theta) = \\theta^2\n", 32 | "$$\n", 33 | "\n", 34 | "求一元函数的梯度,也就对其求导:\n", 35 | "\n", 36 | "$$\n", 37 | "\\nabla J(\\theta) = J'(\\theta) = 2\\theta\n", 38 | "$$\n", 39 | "\n", 40 | "给 $\\theta$ 一个初始值 $\\theta_0 = 1$,并设学习率 $\\eta = 0.4$,根据梯度下降的更新公式,我们有:\n", 41 | "\n", 42 | "$$\n", 43 | "\\begin{align}\n", 44 | "\\theta_0 &= 1 \\\\\n", 45 | "\\theta_1 &= \\theta_0 - \\eta \\nabla J(\\theta_0) = 1 - 0.4 * (2*1) = 0.2 \\\\\n", 46 | "\\theta_2 &= \\theta_1 - \\eta \\nabla J(\\theta_1) = 0.2 - 0.4 * (2*0.2) = 0.04 \\\\\n", 47 | "\\theta_3 &= 0.008 \\\\\n", 48 | "\\theta_4 &= 0.0016 \\\\\n", 49 | "\\end{align}\n", 50 | "$$\n", 51 | "\n", 52 | "经过4步,基本上已经到达函数的最低点了。\n", 53 | "\n", 54 | "![](../images/gradient-sample-1.png)" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "### 二元函数的梯度下降\n", 62 | "\n", 63 | "我们再来看一个二元函数:\n", 64 | "\n", 65 | "$$\n", 66 | "J(\\theta) = \\theta_1^2 + \\theta_2^2\n", 67 | "$$\n", 68 | "\n", 69 | "该函数的梯度为:\n", 70 | "\n", 71 | "$$\n", 72 | "\\nabla J(\\theta) = \\langle 2\\theta_1, 2\\theta_2 \\rangle\n", 73 | "$$\n", 74 | "\n", 75 | "给 $\\theta$ 一个初始值 $\\theta^0 = \\langle 1,3 \\rangle$,并设学习率 $\\eta = 0.1$,开始迭代更新:\n", 76 | "\n", 77 | "$$\n", 78 | "\\begin{align}\n", 79 | "\\theta^0 &= \\langle 1,3 \\rangle \\\\\n", 80 | "\\theta^1 &= \\theta^0 - \\eta \\nabla J(\\theta^0) = \\langle 1,3 \\rangle - 0.1*\\langle 2,6 \\rangle = \\langle 0.8,2.4 \\rangle \\\\\n", 81 | "\\theta^2 &= \\theta^1 - \\eta \\nabla J(\\theta^1) = \\langle 0.8,2.4 \\rangle - 0.1*\\langle 1.6,4.8 \\rangle = \\langle 0.64,1.92 \\rangle\n", 82 | "\\end{align}\n", 83 | "$$\n", 84 | "\n", 85 | "可以写个程序来自动完成迭代过程,可以发现经过若干步的迭代,函数也达到了最低点。\n", 86 | "\n", 87 | "![](../images/gradient-sample-2.png)" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "### 关于学习率\n", 95 | "\n", 96 | "梯度下降算法中的 $\\eta$ 称为 **学习率**(或者 **步长**),它是一个超参数,关于它的值,不宜太大,也不宜太小。学习率用于控制每一步走的距离,如果太小,参数更新每次也很小,梯度下降的速度会非常慢,如果太大,参数更新也会很大,可能会直接越过极小值,甚至无法收敛到达最低点。\n", 97 | "\n", 98 | "![](../images/learning-rate.png)\n", 99 | "\n", 100 | "### 参考\n", 101 | "\n", 102 | "* [深入浅出--梯度下降法及其实现 - 简书](https://www.jianshu.com/p/c7e642877b0e)\n", 103 | "* [Gradient Descent - Problem of Hiking Down a Mountain | Udacity](https://storage.googleapis.com/supplemental_media/udacityu/315142919/Gradient%20Descent.pdf)" 104 | ] 105 | } 106 | ], 107 | "metadata": { 108 | "kernelspec": { 109 | "display_name": "Python 3", 110 | "language": "python", 111 | "name": "python3" 112 | }, 113 | "language_info": { 114 | "codemirror_mode": { 115 | "name": "ipython", 116 | "version": 3 117 | }, 118 | "file_extension": ".py", 119 | "mimetype": "text/x-python", 120 | "name": "python", 121 | "nbconvert_exporter": "python", 122 | "pygments_lexer": "ipython3", 123 | "version": "3.7.1" 124 | } 125 | }, 126 | "nbformat": 4, 127 | "nbformat_minor": 2 128 | } 129 | -------------------------------------------------------------------------------- /支持向量机/02. 支持向量机的对偶问题.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "上一节我们学习了支持向量机的优化目标为:\n", 8 | "\n", 9 | "$$\n", 10 | "\\begin{align}\n", 11 | "&\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 \\\\\n", 12 | "&s.t. y_i(w^Tx_i + b) \\geq 1, i = 1,2,\\dots,n\n", 13 | "\\end{align}\n", 14 | "$$\n", 15 | "\n", 16 | "这是一个 **凸二次规划**(convex quadratic programming) 问题,可以用现成的 QP 优化包来求解。(TODO)\n", 17 | "\n", 18 | "另外,还可以通过 **拉格朗日乘子法** 将其转换为解法更高效的对偶形式,其主要思想是将约束条件函数与原函数联系到一起,使之配成与变量数量相等的等式方程,从而求出原函数极值的各个变量的解。转换的方法简单来说就是对每一个约束条件加上一个 **拉格朗日乘子**(Lagrange multiplier),定义出 **拉格朗日函数** 如下:\n", 19 | "\n", 20 | "$$\n", 21 | "L(w,b,\\alpha) = \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i (y_i(w^Tx_i + b) - 1)\n", 22 | "$$\n", 23 | "\n", 24 | "其中,引入了一个新的变量 $\\alpha$,这个变量就是 **拉格朗日乘子**。所以支持向量机的优化目标可以写成:\n", 25 | "\n", 26 | "$$\n", 27 | "\\mathop \\min_{w,b} \\mathop \\max_{\\alpha} L(w,b,\\alpha)\n", 28 | "$$\n", 29 | "\n", 30 | "根据 **拉格郎日对偶性**,可以得到该优化目标的 **对偶问题**(dual problem):\n", 31 | "\n", 32 | "$$\n", 33 | "\\mathop \\max_{\\alpha} \\mathop \\min_{w,b} L(w,b,\\alpha)\n", 34 | "$$\n", 35 | "\n", 36 | "为了求解这个问题,我们可以先求 $\\mathop \\min_{w,b} L(w,b,\\alpha)$,很显然,这是一个最小值问题,直接使用求导的方法,我们对 $w$ 和 $b$ 分别求偏导:\n", 37 | "\n", 38 | "$$\n", 39 | "\\frac{\\partial L(w,b,\\alpha)}{\\partial w} = w - \\sum_{i=1}^n \\alpha_i y_i x_i\n", 40 | "$$\n", 41 | "\n", 42 | "$$\n", 43 | "\\frac{\\partial L(w,b,\\alpha)}{\\partial b} = \\sum_{i=1}^n \\alpha_i y_i\n", 44 | "$$\n", 45 | "\n", 46 | "令偏导等于 0,可以得到:\n", 47 | "\n", 48 | "$$\n", 49 | "w = \\sum_{i=1}^n \\alpha_i y_i x_i\n", 50 | "$$\n", 51 | "\n", 52 | "$$\n", 53 | "\\sum_{i=1}^n \\alpha_i y_i = 0\n", 54 | "$$\n", 55 | "\n", 56 | "将这两个结果带入 $L(w,b,\\alpha)$ 有:\n", 57 | "\n", 58 | "$$\n", 59 | "\\begin{align}\n", 60 | "L(w,b,\\alpha) &= \\frac{1}{2}\\|w\\|^2 - \\sum_{i=1}^n \\alpha_i (y_i(w^Tx_i + b) - 1) \\\\\n", 61 | "&= \\frac{1}{2}w^Tw - w^T \\sum_{i=1}^n \\alpha_i y_i x_i - b \\sum_{i=1}^n \\alpha_i y_i + \\sum_{i=1}^n \\alpha_i \\\\\n", 62 | "&= \\frac{1}{2}w^Tw - w^Tw - b 0 + \\sum_{i=1}^n \\alpha_i \\\\\n", 63 | "&= \\sum_{i=1}^n \\alpha_i - \\frac{1}{2}w^Tw \\\\\n", 64 | "&= \\sum_{i=1}^n \\alpha_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j \\\\\n", 65 | "\\end{align}\n", 66 | "$$\n", 67 | "\n", 68 | "所以问题转换为求:\n", 69 | "\n", 70 | "$$\n", 71 | "\\begin{align}\n", 72 | "\\mathop \\max_{\\alpha} \\mathop \\min_{w,b} L(w,b,\\alpha) &= \\mathop \\max_{\\alpha} \\sum_{i=1}^n \\alpha_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j \\\\\n", 73 | "&s.t. \\sum_{i=1}^n \\alpha_i y_i = 0, \\alpha_i \\geq 0, i = 1,2,\\dots,n\n", 74 | "\\end{align}\n", 75 | "$$\n", 76 | "\n", 77 | "转换符号变成求最小值:\n", 78 | "\n", 79 | "$$\n", 80 | "\\begin{align}\n", 81 | "&\\mathop \\min_{\\alpha} \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j - \\sum_{i=1}^n \\alpha_i \\\\\n", 82 | "&s.t. \\sum_{i=1}^n \\alpha_i y_i = 0, \\alpha_i \\geq 0, i = 1,2,\\dots,n\n", 83 | "\\end{align}\n", 84 | "$$" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "解出 $\\alpha$ 后,从而得到划分超平面:\n", 92 | "\n", 93 | "$$\n", 94 | "f(x) = w^Tx+b = \\sum_{i=1}^n \\alpha_i y_i x_i^T x + b\n", 95 | "$$" 96 | ] 97 | } 98 | ], 99 | "metadata": { 100 | "kernelspec": { 101 | "display_name": "Python 3", 102 | "language": "python", 103 | "name": "python3" 104 | }, 105 | "language_info": { 106 | "codemirror_mode": { 107 | "name": "ipython", 108 | "version": 3 109 | }, 110 | "file_extension": ".py", 111 | "mimetype": "text/x-python", 112 | "name": "python", 113 | "nbconvert_exporter": "python", 114 | "pygments_lexer": "ipython3", 115 | "version": "3.7.1" 116 | } 117 | }, 118 | "nbformat": 4, 119 | "nbformat_minor": 2 120 | } 121 | -------------------------------------------------------------------------------- /支持向量机/04. 软间隔支持向量机的对偶问题.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "上一节我们得到了软间隔支持向量机的基本形式:\n", 8 | "\n", 9 | "$$\n", 10 | "\\begin{align}\n", 11 | "&\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 + C\\sum_{i=1}^n\\xi_i \\\\\n", 12 | "&s.t. y_i(w^Tx_i + b) \\geq 1 - \\xi_i, \\xi_i \\geq 0, i = 1,2,\\dots,n\n", 13 | "\\end{align}\n", 14 | "$$\n", 15 | "\n", 16 | "为了求解这个问题,我们也使用 **拉格郎日乘子法** 将其转换为对偶形式。我们给每一个约束条件加上 **拉格朗日乘子**,得到 **拉格郎日函数**:\n", 17 | "\n", 18 | "$$\n", 19 | "\\begin{align}\n", 20 | "L(w,b,\\xi,\\alpha,\\beta) = \n", 21 | "& \\frac{1}{2}\\|w\\|^2 + C\\sum_{i=1}^n\\xi_i \\\\\n", 22 | "& + \\sum_{i=1}^n \\alpha_i (1-\\xi_i-y_i(w^Tx_i+b)) - \\sum_{i=1}^n \\beta_i\\xi_i\n", 23 | "\\end{align}\n", 24 | "$$\n", 25 | "\n", 26 | "其中 $\\alpha \\geq 0, \\beta \\geq 0$ 是拉格郎日乘子。\n", 27 | "\n", 28 | "和硬间隔支持向量机的优化目标一样:\n", 29 | "\n", 30 | "$$\n", 31 | "\\mathop \\min_{w,b,\\xi} \\mathop \\max_{\\alpha,\\beta} L(w,b,\\xi,\\alpha,\\beta)\n", 32 | "$$\n", 33 | "\n", 34 | "转换为对偶问题:\n", 35 | "\n", 36 | "$$\n", 37 | "\\mathop \\max_{\\alpha,\\beta} \\mathop \\min_{w,b,\\xi} L(w,b,\\xi,\\alpha,\\beta)\n", 38 | "$$\n", 39 | "\n", 40 | "所以,我们对 $w, b, \\xi$ 分别求偏导:\n", 41 | "\n", 42 | "$$\n", 43 | "\\left\\{\n", 44 | "\\begin{align}\n", 45 | "\\frac{\\partial L(w,b,\\xi,\\alpha,\\beta)}{\\partial w} &= w - \\sum_{i=1}^n\\alpha_iy_ix_i \\\\\n", 46 | "\\frac{\\partial L(w,b,\\xi,\\alpha,\\beta)}{\\partial b} &= -\\sum_{i=1}^n \\alpha_iy_i \\\\\n", 47 | "\\frac{\\partial L(w,b,\\xi,\\alpha,\\beta)}{\\partial \\xi_i} &= C - (\\alpha_i + \\beta_i)\n", 48 | "\\end{align}\n", 49 | "\\right .\n", 50 | "$$\n", 51 | "\n", 52 | "分别令其为零,得到:\n", 53 | "\n", 54 | "$$\n", 55 | "\\left\\{\n", 56 | "\\begin{align}\n", 57 | "w &= \\sum_{i=1}^n\\alpha_iy_ix_i \\\\\n", 58 | "0 &= \\sum_{i=1}^n \\alpha_iy_i \\\\\n", 59 | "C &= (\\alpha_i + \\beta_i) \\\\\n", 60 | "\\end{align}\n", 61 | "\\right .\n", 62 | "$$\n", 63 | "\n", 64 | "将这三个式子带入 $L(w,b,\\xi,\\alpha,\\beta)$ 有:\n", 65 | "\n", 66 | "$$\n", 67 | "\\begin{align}\n", 68 | "L(w,b,\\xi,\\alpha,\\beta) \n", 69 | "&= \\frac{1}{2}\\|w\\|^2 + C\\sum_{i=1}^n\\xi_i + \\sum_{i=1}^n \\alpha_i (1-\\xi_i-y_i(w^Tx_i+b)) - \\sum_{i=1}^n \\beta_i\\xi_i \\\\\n", 70 | "&= \\frac{1}{2}w^Tw + \\sum_{i=1}^n(\\alpha_i+\\beta_i)\\xi_i + \\sum_{i=1}^n(\\alpha_i-\\alpha_i\\xi_i-\\alpha_iy_iw^Tx_i-\\alpha_iy_ib-\\beta_i\\xi_i) \\\\\n", 71 | "&= \\sum_{i=1}^n\\alpha_i - \\frac{1}{2}w^Tw \\\\\n", 72 | "&= \\sum_{i=1}^n \\alpha_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j \\\\\n", 73 | "\\end{align}\n", 74 | "$$\n", 75 | "\n", 76 | "可以看到,参数 $\\beta$ 被消掉了,得到的结果和硬间隔支持向量机是一样的,问题转换为求:\n", 77 | "\n", 78 | "$$\n", 79 | "\\begin{align}\n", 80 | "\\mathop \\max_{\\alpha,\\beta} \\mathop \\min_{w,b} L(w,b,\\xi,\\alpha,\\beta) &= \\mathop \\max_{\\alpha,\\beta} \\sum_{i=1}^n \\alpha_i - \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j \\\\\n", 81 | "\\end{align}\n", 82 | "$$\n", 83 | "\n", 84 | "转换符号有:\n", 85 | "\n", 86 | "$$\n", 87 | "\\begin{align}\n", 88 | "\\mathop \\min_{\\alpha} \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j x_i x_j - \\sum_{i=1}^n \\alpha_i\\\\\n", 89 | "\\end{align}\n", 90 | "$$\n", 91 | "\n", 92 | "只是约束条件相比硬间隔支持向量机来说,多了 $\\alpha_i \\leq C$,所以有:\n", 93 | "\n", 94 | "$$\n", 95 | "s.t. \\sum_{i=1}^n \\alpha_i y_i = 0, 0 \\leq \\alpha_i \\leq C, i = 1,2,\\dots,n\n", 96 | "$$" 97 | ] 98 | } 99 | ], 100 | "metadata": { 101 | "kernelspec": { 102 | "display_name": "Python 3", 103 | "language": "python", 104 | "name": "python3" 105 | }, 106 | "language_info": { 107 | "codemirror_mode": { 108 | "name": "ipython", 109 | "version": 3 110 | }, 111 | "file_extension": ".py", 112 | "mimetype": "text/x-python", 113 | "name": "python", 114 | "nbconvert_exporter": "python", 115 | "pygments_lexer": "ipython3", 116 | "version": "3.7.1" 117 | } 118 | }, 119 | "nbformat": 4, 120 | "nbformat_minor": 2 121 | } 122 | -------------------------------------------------------------------------------- /贝叶斯/01. 贝叶斯定理.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 条件概率\n", 8 | "\n", 9 | "**概率**(probability)是从随机试验中的事件到实数域的映射函数,用于表示事件发生的可能性,事件 $A$ 发生的概率通常记为 $P(A)$。\n", 10 | "\n", 11 | "如果 $A$ 和 $B$ 是样本空间 $\\Omega$ 上的两个事件,且 $P(B) > 0$,那么在给定 $B$ 时 $A$ 的**条件概率**(conditional probability)为:\n", 12 | "\n", 13 | "$$\n", 14 | "P(A|B) = \\frac {P(A \\cap B)}{P(B)}\n", 15 | "$$\n", 16 | "\n", 17 | "$P(A|B)$ 表示在事件 $B$ 发生的条件下,事件 $A$ 发生的概率。\n", 18 | "\n", 19 | "同理可得:\n", 20 | "\n", 21 | "$$\n", 22 | "P(B|A) = \\frac {P(A \\cap B)}{P(A)}\n", 23 | "$$\n", 24 | "\n", 25 | "结合上面两个公式,得到概率的**乘法定理**(也叫乘法法则):\n", 26 | "\n", 27 | "$$\n", 28 | "P(A \\cap B) = P(A)P(B|A) = P(B)P(A|B)\n", 29 | "$$\n", 30 | "\n", 31 | "$P(A \\cap B)$ 称为**联合概率**,表示事件 A 和 B 同时发生的概率。\n", 32 | "\n", 33 | "### 独立事件\n", 34 | "\n", 35 | "在条件概率中,如果事件 A 和 B 是相互独立的,也就是说事件 A 的发生和事件 B 没有任何关系,我们有:\n", 36 | "\n", 37 | "$$\n", 38 | "P(A|B) = P(A)\n", 39 | "$$\n", 40 | "\n", 41 | "上面的式子也可以解释成,无论事件 B 有没有发生,事件 A 发生的概率都不变。于是对于独立事件,我们得到一个很重要的公式,这个公式是 **朴素贝叶斯**(naive Bayes)的前提:\n", 42 | "\n", 43 | "$$\n", 44 | "P(A \\cap B) = P(A)P(B)\n", 45 | "$$\n", 46 | "\n", 47 | "推广到一般形式:\n", 48 | "\n", 49 | "$$\n", 50 | "P(A_1 \\cap A_2 \\cap \\dots \\cap A_n) = P(A_1)P(A_2) \\dots P(A_n) = \\Pi_i P(A_i)\n", 51 | "$$\n", 52 | "\n", 53 | "### 全概率公式\n", 54 | "\n", 55 | "上面我们假设 $A$ 是样本空间 $\\Omega$ 上的一个事件,我们把这个事件的对立事件记为 $\\bar{A}$,很显然,样本空间是 $A$ 和 $\\bar{A}$ 两个事件之和。于是,事件 $B$ 的概率可以写成下面这样:\n", 56 | "\n", 57 | "$$\n", 58 | "\\begin{align}\n", 59 | "P(B) &= P(B \\cap A) + P(B \\cap \\bar{A}) \\\\\n", 60 | "&= P(B|A)P(A) + P(B|\\bar{A})P(\\bar{A})\n", 61 | "\\end{align}\n", 62 | "$$\n", 63 | "\n", 64 | "这就是**全概率公式**。这个公式可以推广到一般形式,假设 $A_1, A_2, ..., A_n$ 是样本空间 $\\Omega$ 的一个划分,即 $\\sum_{i=1}^nA_i = \\Omega$,并且 $A_i$ 互不相交,则有:\n", 65 | "\n", 66 | "$$\n", 67 | "P(B) = \\sum_{i=1}^n P(B|A_i)P(A_i)\n", 68 | "$$\n", 69 | "\n", 70 | "### 贝叶斯定理\n", 71 | "\n", 72 | "根据上面概率的乘法定理,可以得到下面的公式:\n", 73 | "\n", 74 | "$$\n", 75 | "P(A|B) = \\frac{P(B|A)P(A)}{P(B)}\n", 76 | "$$\n", 77 | "\n", 78 | "这个公式被称为 **贝叶斯公式**,也叫 **贝叶斯法则** 或 **贝叶斯定理**(Bayesian theorem),它是英国数学家托马斯·贝叶斯(Thomas Bayes)在1763年提出的,这个公式可以用于条件概率的求解。\n", 79 | "\n", 80 | "推广到一般形式,我们可以把样本空间划分成 $A_1, A_2, ..., A_n$,根据前面的全概率公式,则可以计算出每个划分 $A_i$ 的条件概率:\n", 81 | "\n", 82 | "$$\n", 83 | "P(A_i|B) = \\frac{P(B|A_i)P(A_i)}{\\sum_{i=1}^n P(B|A_i)P(A_i)}\n", 84 | "$$\n", 85 | "\n", 86 | "在上面的公式中,$P(A_i)$ 是在事件 B 发生之前 $A_i$ 的概率,所以叫做 **先验概率**(Prior probability)或 **边缘概率**,而 $P(A_i|B)$ 指的是事件 B 发生后,对 $A_i$ 的概率进行修正,我们把它叫做 **后验概率**(Posterior probability),$P(B|A_i)$ 叫做 **类条件概率**(class-conditional probability)或 **似然**(Likelihood)。\n", 87 | "\n", 88 | "疑惑:全概率公式在贝叶斯定理中有什么用?\n", 89 | "\n", 90 | "### 贝叶斯定理实例\n", 91 | "\n", 92 | "我们以一个实例来看看贝叶斯定理在实际中的应用。假设有两个箱子 A 和 B,箱子 A 中装了 5 个球,有一个坏球,箱子 B 中装了 3 个球,有 2 个坏球。\n", 93 | "\n", 94 | "|箱子|总球数|坏球数|\n", 95 | "|----|-----|-----|\n", 96 | "|A |5 |1 |\n", 97 | "|B |3 |2 |\n", 98 | "\n", 99 | "现在有一个人从两个箱子中随机挑一个箱子并拿出一个球,发现是坏球,请问这个球可能来自哪个箱子?\n", 100 | "\n", 101 | "这是一个典型的条件概率问题,要想知道球来自哪个箱子的概率最大,我们可以计算出球来自箱子 A 的概率,和来自箱子 B 的概率,比较其大小,就知道最有可能来自哪个箱子了。也就是计算 $P(A|bad)$ 和 $P(B|bad)$。\n", 102 | "\n", 103 | "根据贝叶斯定理很容易得出:\n", 104 | "\n", 105 | "$$\n", 106 | "P(A|bad) = \\frac{P(bad|A)P(A)}{P(bad)} = \\frac{\\frac{1}{5} \\times \\frac{1}{2}}{\\frac{3}{8}} = \\frac{4}{15}\n", 107 | "$$\n", 108 | "\n", 109 | "$$\n", 110 | "P(B|bad) = \\frac{P(bad|B)P(B)}{P(bad)} = \\frac{\\frac{2}{3} \\times \\frac{1}{2}}{\\frac{3}{8}} = \\frac{8}{9}\n", 111 | "$$\n", 112 | "\n", 113 | "很显然,这个球来自箱子 B 的概率最大。" 114 | ] 115 | } 116 | ], 117 | "metadata": { 118 | "kernelspec": { 119 | "display_name": "Python 3", 120 | "language": "python", 121 | "name": "python3" 122 | }, 123 | "language_info": { 124 | "codemirror_mode": { 125 | "name": "ipython", 126 | "version": 3 127 | }, 128 | "file_extension": ".py", 129 | "mimetype": "text/x-python", 130 | "name": "python", 131 | "nbconvert_exporter": "python", 132 | "pygments_lexer": "ipython3", 133 | "version": "3.7.1" 134 | } 135 | }, 136 | "nbformat": 4, 137 | "nbformat_minor": 2 138 | } 139 | -------------------------------------------------------------------------------- /聚类/1. 什么是聚类.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "**聚类**(clustering) 是一种无监督学习算法(unsupervised learning),训练样本的标记信息通常是未知的,它试图对无标记的训练样本进行学习,找到数据的内在性质和规律,将数据集划分成若干个通常是不相交的子集,这些子集被称为 **簇**(cluster)。通过这样的划分,每个簇可能对应一个潜在的概念,这些概念对聚类算法而言事先是未知的,聚类过程仅能自动形成簇结构,簇所对应的概念语义需要由使用者把握和命名。\n", 8 | "\n", 9 | "## 聚类模型的评估\n", 10 | "\n", 11 | "如何评估聚类模型的好坏,也就是如何对聚类的性能进行度量,这也被称为 **有效性指标**(validity index)。直观上看,我们希望同一簇的样本尽可能的彼此相似,不同簇的样本尽可能的不同,也就是我们常说的 “物以类聚”。\n", 12 | "\n", 13 | "评估方法一般有两种,一种是将聚类结果和某个**参考模型**(reference model)进行比较,这称作**外部指标**(external index);另一种是直接考察聚类结果本身,称作**内部指标**(internal index)。\n", 14 | "\n", 15 | "### 外部指标\n", 16 | "\n", 17 | "外部指标的计算首先要定义几个变量,假定通过聚类算法给出的簇划分为 $C = \\{C_1, C_2, ..., C_k\\}$,而根据参考模型给出的簇划分为 $C^* = \\{C_1^*, C_2^*, ..., C_s^*\\}$,也就是说聚类算法将数据划分为 k 类,参考模型将数据划分为 s 类,一般情况下 $k \\neq s$。然后将样本两两进行配对得到组合 $(x_i, x_j)$,其中 $i < j$。一共可以得到 $m(m-1)/2$ 个组合,并且只会有四种不同的情况:\n", 18 | "\n", 19 | "* $x_i$ 和 $x_j$ 在 $C$ 中属于相同簇,在 $C^*$ 中也属于相同簇,一共有 a 个组合\n", 20 | "* $x_i$ 和 $x_j$ 在 $C$ 中属于相同簇,在 $C^*$ 中属于不同簇,一共有 b 个组合\n", 21 | "* $x_i$ 和 $x_j$ 在 $C$ 中属于不同簇,在 $C^*$ 中属于相同簇,一共有 c 个组合\n", 22 | "* $x_i$ 和 $x_j$ 在 $C$ 中属于不同簇,在 $C^*$ 中也属于不同簇,一共有 d 个组合\n", 23 | "\n", 24 | "很显然:\n", 25 | "\n", 26 | "$$\n", 27 | "a + b + c + d = m(m-1)/2\n", 28 | "$$\n", 29 | "\n", 30 | "根据这四个变量,就可以定义出下面这些常用的外部指标:\n", 31 | "\n", 32 | "* **Jaccard 系数**(Jaccard Coefficient,简称 JC)\n", 33 | "\n", 34 | "$$\n", 35 | "JC = \\frac{a}{a+b+c}\n", 36 | "$$\n", 37 | "\n", 38 | "* **FM 指数**(Fowlkes and Mallows Index,简称 FMI)\n", 39 | "\n", 40 | "$$\n", 41 | "FMI = \\sqrt {\\frac{a}{a+b} \\cdot \\frac{a}{a+c}}\n", 42 | "$$\n", 43 | "\n", 44 | "* **Rand 指数**(Rand Index,简称 RI)\n", 45 | "\n", 46 | "$$\n", 47 | "RI = \\frac{2(a+d)}{m(m-1)}\n", 48 | "$$\n", 49 | "\n", 50 | "这些指标值的范围都在 0 - 1 之间,值越大说明聚类结果越准。\n", 51 | "\n", 52 | "### 内部指标\n", 53 | "\n", 54 | "在定义内部指标之前,也首先定义几个变量,假定通过聚类算法给出的簇划分为 $C = \\{C_1, C_2, ..., C_k\\}$,样本 $x_i$ 和 $x_j$ 之间的距离记为 $dist(x_i, x_j)$,可以得到簇内样本间的平均距离:\n", 55 | "\n", 56 | "$$\n", 57 | "avg(C) = \\frac{2}{|C|(|C|-1)} \\sum_{1 \\leqslant i \\leqslant j \\leqslant |C|} dist(x_i, x_j)\n", 58 | "$$\n", 59 | "\n", 60 | "簇内样本间的最远距离:\n", 61 | "\n", 62 | "$$\n", 63 | "diam(C) = max_{1 \\leqslant i \\leqslant j \\leqslant |C|} dist(x_i, x_j)\n", 64 | "$$\n", 65 | "\n", 66 | "簇 $C_i$ 和 簇 $C_j$ 最近样本间的距离:\n", 67 | "\n", 68 | "$$\n", 69 | "d_{min}(C_i, C_j) = min_{x_i \\in C_i, x_j \\in C_j} dist(x_i, x_j)\n", 70 | "$$\n", 71 | "\n", 72 | "簇 C 的中心点:\n", 73 | "\n", 74 | "$$\n", 75 | "\\mu = \\frac{1}{|C|} \\sum_{1 \\leqslant i \\leqslant |C|} x_i\n", 76 | "$$\n", 77 | "\n", 78 | "簇 $C_i$ 和 簇 $C_j$ 中心点间的距离:\n", 79 | "\n", 80 | "$$\n", 81 | "d_{cen}(C_i, C_j) = dist(\\mu_i, \\mu_j)\n", 82 | "$$\n", 83 | "\n", 84 | "从而得到下面这些常用的内部指标:\n", 85 | "\n", 86 | "* **DB 指数**(Davies-Bouldin Index,简称 DBI)\n", 87 | "\n", 88 | "$$\n", 89 | "DBI = \\frac{1}{k} \\sum_{1}^k \\mathop \\max _{j \\neq i} ( \\frac{avg(C_i) + avg(C_j)}{d_{cen}(C_i, C_j)} )\n", 90 | "$$\n", 91 | "\n", 92 | "* **Dunn 指数**(Dunn Index,简称 DI)\n", 93 | "\n", 94 | "$$\n", 95 | "DI = \\mathop \\min_{1 \\leqslant i \\leqslant k} \\Big \\{ \\mathop \\min_{j \\neq i} \\Big ( \\frac{d_{min}(C_i,C_j)}{\\mathop \\max_{1 \\leqslant l \\leqslant k}diam(C_l)} \\Big ) \\Big \\}\n", 96 | "$$\n", 97 | "\n", 98 | "## 距离的度量\n", 99 | "\n", 100 | "聚类的核心概念是 **距离**(distance)或 **相似度**(similarity),距离或相似度的定义多种多样,选择合适的距离度量方法是聚类的关键。\n", 101 | "\n", 102 | "* 闵可夫斯基距离(闵氏距离)\n", 103 | " * 欧式距离\n", 104 | " * 曼哈顿距离\n", 105 | " * 切比雪夫距离\n", 106 | "* 马哈拉诺比斯距离(马氏距离)\n", 107 | "* 相关系数\n", 108 | "* 夹角余弦\n", 109 | "\n", 110 | "参考 [kNN 距离度量方法](../kNN/02.%20距离度量方法.ipynb)" 111 | ] 112 | } 113 | ], 114 | "metadata": { 115 | "kernelspec": { 116 | "display_name": "Python 3", 117 | "language": "python", 118 | "name": "python3" 119 | }, 120 | "language_info": { 121 | "codemirror_mode": { 122 | "name": "ipython", 123 | "version": 3 124 | }, 125 | "file_extension": ".py", 126 | "mimetype": "text/x-python", 127 | "name": "python", 128 | "nbconvert_exporter": "python", 129 | "pygments_lexer": "ipython3", 130 | "version": "3.7.1" 131 | } 132 | }, 133 | "nbformat": 4, 134 | "nbformat_minor": 2 135 | } 136 | -------------------------------------------------------------------------------- /支持向量机/07. 非线性支持向量机和核函数.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在前面的学习中,我们假设训练样本是线性可分的,也就是存在一个划分超平面能将训练样本正确分类,那么如果训练样本是非线性的,不存在这样的划分超平面时,该如何解决呢?\n", 8 | "\n", 9 | "譬如下面的异或问题就不是线性可分的:\n", 10 | "\n", 11 | "\n", 12 | "\n", 13 | "我们在学习非线性回归的时候也遇到过类似这种问题,可以将样本从原始空间映射到一个更高维的特征空间,比如上图中我们将其映射到三维空间,这时就能找到合适的划分超平面。" 14 | ] 15 | }, 16 | { 17 | "cell_type": "markdown", 18 | "metadata": {}, 19 | "source": [ 20 | "令 $\\phi(x)$ 表示 $x$ 映射后的特征向量,于是划分超平面可以表示为:\n", 21 | "\n", 22 | "$$\n", 23 | "f(x) = w^T\\phi(x) + b\n", 24 | "$$\n", 25 | "\n", 26 | "优化目标可以写为:\n", 27 | "\n", 28 | "$$\n", 29 | "\\begin{align}\n", 30 | "&\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 \\\\\n", 31 | "&s.t. y_i(w^T\\phi(x_i) + b) \\geq 1, i = 1,2,\\dots,n\n", 32 | "\\end{align}\n", 33 | "$$\n", 34 | "\n", 35 | "其对偶问题为:\n", 36 | "\n", 37 | "$$\n", 38 | "\\begin{align}\n", 39 | "&\\mathop \\min_{\\alpha} \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j \\phi(x_i)^T \\phi(x_j) - \\sum_{i=1}^n \\alpha_i \\\\\n", 40 | "&s.t. \\sum_{i=1}^n \\alpha_i y_i = 0, \\alpha_i \\geq 0, i = 1,2,\\dots,n\n", 41 | "\\end{align}\n", 42 | "$$\n", 43 | "\n", 44 | "上式中,$\\phi(x_i)^T \\phi(x_j)$ 表示样本 $x_i$ 和 $x_j$ 映射到特征空间之后的内积,由于特征空间维数可能很高,直接计算通常比较困难,可以设想这样一个函数:\n", 45 | "\n", 46 | "$$\n", 47 | "\\kappa(x_i, x_j) = \\langle \\phi(x_i), \\phi(x_j) \\rangle = \\phi(x_i)^T \\phi(x_j)\n", 48 | "$$\n", 49 | "\n", 50 | "也就是 $x_i$ 和 $x_j$ 在特征空间的内积等于它们在原始样本空间中通过函数 $\\kappa(\\centerdot, \\centerdot)$ 计算的结果,这个函数就是 **核函数**(kernel function)。\n", 51 | "\n", 52 | "通过引入核函数,非线性支持向量机的优化目标就可以写成下面这种形式:\n", 53 | "\n", 54 | "$$\n", 55 | "\\mathop \\min_{\\alpha} \\frac{1}{2} \\sum_{i=1}^n \\sum_{j=1}^n \\alpha_i \\alpha_j y_i y_j \\kappa(x_i, x_j) - \\sum_{i=1}^n \\alpha_i\n", 56 | "$$" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "那么这样的核函数是否存在呢?如果已知映射函数 $\\phi(x)$,那么显然核函数可以通过映射函数的内积计算得到,如果不知道 $\\phi(x)$ 的具体形式,核函数是否存在?什么样的函数可以当作核函数?\n", 64 | "\n", 65 | "这里引入核函数定理,如果函数 $\\kappa$ 是对称的,并且对任意的输入数据它对应的 **核矩阵**(kernel matrix)$K$ 总是 **半正定** 的,那么 $\\kappa$ 就是一个核函数。核矩阵是由核函数构成的一个矩阵:\n", 66 | "\n", 67 | "$$\n", 68 | "K = \n", 69 | "\\left [\n", 70 | "\\begin{array}{ccc}\n", 71 | "\\kappa(x_1, x_1) & \\kappa(x_1, x_2) & \\dots & \\kappa(x_1, x_n) \\\\\n", 72 | "\\vdots & \\vdots & \\vdots & \\vdots \\\\\n", 73 | "\\kappa(x_n, x_1) & \\kappa(x_n, x_2) & \\dots & \\kappa(x_n, x_n)\n", 74 | "\\end{array} \n", 75 | "\\right ]\n", 76 | "$$\n", 77 | "\n", 78 | "这个定理也叫 **Mercer定理**,也就是说,对于任何一个半正定的核矩阵,总能找到一个与之对应的映射 $\\phi$,对于任何一个核函数,都隐式的对应一个特征空间,这个特征空间被称为 **再生核希尔伯特空间**(Reproducing Kernel Hilbert Space,简称 RKHS 空间)。\n", 79 | "\n", 80 | "关于核函数的概念,涉及很多泛函分析的概念,包括各种函数空间的概念,可以参考:[说一说核方法(一)——核方法与核函数简介](http://sakigami-yang.me/2017/08/13/about-kernel-01/)、[说一说核方法(二)——数学角度简介](http://sakigami-yang.me/2017/08/13/about-kernel-02/)。\n", 81 | "\n", 82 | "下面是一些常用的核函数:\n", 83 | "\n", 84 | "![](../images/kernel-functions.jpg)" 85 | ] 86 | }, 87 | { 88 | "cell_type": "markdown", 89 | "metadata": {}, 90 | "source": [ 91 | "可以看出使用核函数解决非线性分类问题的基本思想,是通过一个非线性变换将输入空间(欧式空间 $R^n$ 或离散集合)对应于一个特征空间(希尔伯特空间 $\\mathcal{H}$),使得在输入空间中的超曲面模型对应特征空间中的超平面模型,这样,分类问题的学习任务通过在特征空间中求解线性支持向量机就可以完成。\n", 92 | "\n", 93 | "实际上这样的技巧可以用在更广泛的地方,对于满足某些条件的优化问题,其最优解都可以表示成核函数的线性组合,这被称为 **表示定理**(representer theorem):\n", 94 | "\n", 95 | "> 对于下面的优化问题:\n", 96 | "> $$ \\mathop \\min_{w} \\frac{\\lambda}{2} \\|w\\|^2 + \\frac{1}{m} \\sum_{i=1}^m \\ell(w^T\\phi(x_i), y_i) $$\n", 97 | "> 它的解 $w$ 可以表示成样本的线性组合:\n", 98 | "> $$ w = \\sum_{i=1}^m \\alpha_i \\phi(x_i) $$\n", 99 | "\n", 100 | "通过核函数,我们可以将线性模型扩展成非线性模型,这启发了一系列基于核函数的学习方法,被统称为 **核方法**(kernel methods)或 **核技巧**(kernel trick)。譬如在 **线性判别分析**(LDA)中引入核方法对其进行非线性扩展,就得到了 **核线性判别分别**(KLDA)。" 101 | ] 102 | } 103 | ], 104 | "metadata": { 105 | "kernelspec": { 106 | "display_name": "Python 3", 107 | "language": "python", 108 | "name": "python3" 109 | }, 110 | "language_info": { 111 | "codemirror_mode": { 112 | "name": "ipython", 113 | "version": 3 114 | }, 115 | "file_extension": ".py", 116 | "mimetype": "text/x-python", 117 | "name": "python", 118 | "nbconvert_exporter": "python", 119 | "pygments_lexer": "ipython3", 120 | "version": "3.7.1" 121 | } 122 | }, 123 | "nbformat": 4, 124 | "nbformat_minor": 2 125 | } 126 | -------------------------------------------------------------------------------- /逻辑回归/10. 分类模型的评估方法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "和回归模型一样,这一节介绍一些常用的方法来评估分类模型的好坏。\n", 8 | "\n", 9 | "## 错误率和精度\n", 10 | "\n", 11 | "错误率指分类错误的样本数占样本总数的比例:\n", 12 | "\n", 13 | "$$\n", 14 | "E(f;D) = \\frac{1}{m} \\sum_{i=1}^m I(f(x_i) \\neq y_i)\n", 15 | "$$\n", 16 | "\n", 17 | "精度和错误率相反,指的是分类正确的样本数占样本总数的比例:\n", 18 | "\n", 19 | "$$\n", 20 | "acc(f;D) = \\frac{1}{m} \\sum_{i=1}^m I(f(x_i) = y_i) = 1 - E(f;D)\n", 21 | "$$\n", 22 | "\n", 23 | "## 查准率和查全率\n", 24 | "\n", 25 | "对于二分类问题,可以将样本的真实类别和模型的识别结果组成四种情形:\n", 26 | "\n", 27 | "* True Positive\n", 28 | " * 真正类(TP),样本的真实类别是正类,并且模型识别的结果也是正类。\n", 29 | "* False Negative\n", 30 | " * 假负类(FN),样本的真实类别是正类,但是模型将其识别成为负类。\n", 31 | "* False Positive\n", 32 | " * 假正类(FP),样本的真实类别是负类,但是模型将其识别成为正类。\n", 33 | "* True Negative\n", 34 | " * 真负类(TN),样本的真实类别是负类,并且模型将其识别成为负类。\n", 35 | " \n", 36 | "将这四种情形写成矩阵形式如下:\n", 37 | "\n", 38 | "\n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | "
真实情况预测结果
正例反例
正例TP(真正例)FN(假反例)
反例FP(假正例)TN(真反例)
\n", 58 | "\n", 59 | "这个矩阵被称为 **混淆矩阵**(confusion matrix)。\n", 60 | "\n", 61 | "**查准率**(precision)也叫做 **准确率**,定义如下:\n", 62 | "\n", 63 | "$$\n", 64 | "P = \\frac{TP}{TP + FP}\n", 65 | "$$\n", 66 | "\n", 67 | "**查全率**(recall)也叫做 **召回率**,定义如下:\n", 68 | "\n", 69 | "$$\n", 70 | "R = \\frac{TP}{TP + FN}\n", 71 | "$$\n", 72 | "\n", 73 | "查准率和查全率是一对矛盾的度量,一般情况下,查准率高时,查全率会偏低,查全率高时,查准率会偏低。以查准率为纵轴,查全率为横轴,可以画出查准率-查全率曲线,简称为 **P-R曲线**:\n", 74 | "\n", 75 | "![](../images/pr-curve.webp)\n", 76 | "\n", 77 | "如果一个学习器的 P-R 曲线把另一个学习器的 P-R 曲线包住,则可以断言第一个学习器要好,比如上图中 A 和 B 的性能要大于 C。\n", 78 | "\n", 79 | "如果两个学习器的 P-R 曲线发生了交叉,如上图中 A 和 B,这时两个学习器的性能就不好比较了,通常有几种不同的方法:\n", 80 | "\n", 81 | "* 比较 P-R 曲线下面积的大小,但这个面积不太容易估算\n", 82 | "* 比较 BEP 大小,BEP 是 **平衡点**(Break-Even Point)的简称,它是 查准率 = 查全率 时的取值,比如上图中 A 的 BEP = 0.8,B 的 BEP = 0.72,所以 A 优于 B\n", 83 | "* 比较 $F_1$ 大小,根据 $F_1$ 还可以推广到一般形式 $F_\\beta$,可以根据实际情况调整查准率和查全率的相对重要性\n", 84 | "\n", 85 | "$F_1$ 定义如下:\n", 86 | "\n", 87 | "$$\n", 88 | "F_1 = \\frac{2 \\times P \\times R}{P + R}\n", 89 | "$$\n", 90 | "\n", 91 | "$F_1$ 是基于查准率和查全率的 **调和平均**(harmonic mean)定义出来的:\n", 92 | "\n", 93 | "$$\n", 94 | "\\frac{1}{F_1} = \\frac{1}{2} (\\frac{1}{P} + \\frac{1}{R})\n", 95 | "$$\n", 96 | "\n", 97 | "$F_\\beta$ 定义如下:\n", 98 | "\n", 99 | "$$\n", 100 | "F_\\beta = \\frac{(1 + \\beta^2) \\times P \\times R}{(\\beta^2 + P) + R}\n", 101 | "$$\n", 102 | "\n", 103 | "$F_\\beta$ 是基于查准率和查全率的 **加权调和平均** 定义出来的:\n", 104 | "\n", 105 | "$$\n", 106 | "\\frac{1}{F_\\beta} = \\frac{1}{1 + \\beta^2} (\\frac{1}{P} + \\frac{\\beta^2}{R})\n", 107 | "$$\n", 108 | "\n", 109 | "> 和算术平均($\\frac{P+R}{2}$)和几何平均($\\sqrt{P \\times R}$)相比,调和平均更重视较小值。\n", 110 | "\n", 111 | "## ROC 与 AUC\n", 112 | "\n", 113 | "从混淆矩阵中还可以定义出 **真正例率**(True Positive Rate,简称 TPR)和 **假正例率**(False Positive Rate,简称 FPR):\n", 114 | "\n", 115 | "$$\n", 116 | "TPR = \\frac{TP}{TP + FN}\n", 117 | "$$\n", 118 | "\n", 119 | "$$\n", 120 | "FPR = \\frac{FP}{TN + FP}\n", 121 | "$$\n", 122 | "\n", 123 | "将 TPR 作为纵轴,FPR 作为横轴,可以画出 **ROC 曲线**(ROC 全称为 Receiver Operating Characteristic,受试者工作特征):\n", 124 | "\n", 125 | "![](../images/roc-auc.webp)\n", 126 | "\n", 127 | "通过 ROC 也可以判断两个学习器的性能优劣,如果一个学习器的 ROC 包住另一个学习器的 ROC,则认为第一个学习器要好。如果两个学习器的 ROC 发生交叉,则可以通过比较 ROC 曲线下的面积,这个面积有一个专门的名称叫 **AUC**(Area Under ROC Curve)。" 128 | ] 129 | } 130 | ], 131 | "metadata": { 132 | "kernelspec": { 133 | "display_name": "Python 3", 134 | "language": "python", 135 | "name": "python3" 136 | }, 137 | "language_info": { 138 | "codemirror_mode": { 139 | "name": "ipython", 140 | "version": 3 141 | }, 142 | "file_extension": ".py", 143 | "mimetype": "text/x-python", 144 | "name": "python", 145 | "nbconvert_exporter": "python", 146 | "pygments_lexer": "ipython3", 147 | "version": "3.7.1" 148 | } 149 | }, 150 | "nbformat": 4, 151 | "nbformat_minor": 2 152 | } 153 | -------------------------------------------------------------------------------- /线性回归/10. 什么是梯度下降.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "我们前面在求解线性回归时,都是通过下面的这个被叫做**正规方程**的式子来计算线性回归的:\n", 8 | "\n", 9 | "$$\n", 10 | "\\bf{a} = (\\rm{X}^T\\rm{X})^{-1}\\rm{X}^T\\bf{y}\n", 11 | "$$\n", 12 | "\n", 13 | "使用这种方法虽然简单,容易理解,但在现实生活中却有着很大的局限性。首先,并非所有的函数都可以通过求导来计算它的极值点,有些函数的导数根本不存在解析解;其次,在正规方程的计算过程中使用的矩阵 $\\rm{X}^T\\rm{X}$ 必须是可逆的,且不说有些矩阵是不可逆的,还有些矩阵的求逆非常困难,比如 **希尔伯特矩阵**;另外,当数据量或维度不断增加时,矩阵计算的开销也会越来越大,要知道,矩阵求逆的时间复杂度是矩阵维度的三次方,因此当数据量过大时,使用正规方程会非常耗时,数据量如果只有成百上千,可能影响并不大,一旦数据量达到万级以上,使用正规方程的性能问题就会凸显出来。\n", 14 | "\n", 15 | "我们以一元线性回归为例,在前面的最小二乘法求解过程中我们知道,求解一元线性回归可以转化为求解损失函数的最小值,损失函数是一个关于参数 a, b 的二元二次函数:\n", 16 | "\n", 17 | "$$\n", 18 | "loss = n\\bar{y^2} + a^2n\\bar{x^2} + nb^2 + 2abn\\bar{x} - 2an\\bar{xy} - 2bn\\bar{y}\n", 19 | "$$\n", 20 | "\n", 21 | "它是三维空间里的一个曲面,图像类似于下面左图这样:\n", 22 | "\n", 23 | "![](../images/loss-2.jpg)\n", 24 | "\n", 25 | "右图代表的是损失函数的**等高图**,每一条线表示损失函数的值相等,红色的叉号表示损失函数的最小值。要求这个最小值,除了前面介绍的正规方程法,这一节我们将学习一种新的方法,**梯度下降**(Gradient Descent),它比正规方程更具有广泛性,可以用于很多复杂算法的求解。\n", 26 | "\n", 27 | "梯度下降算法的基本思路是,在坡上任意取一点,然后沿着下坡方向走最后到达最低点,为了走的最快,我们每次都沿着最陡的方向下坡,这样不断的迭代,直到损失函数收敛到最小值。\n", 28 | "\n" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "### 什么是梯度?\n", 36 | "\n", 37 | "为了更好的理解梯度下降,我们来看看什么是**梯度**(gradient),而要知道什么是梯度,还得从函数的**导数**(derivative)说起。\n", 38 | "\n", 39 | "导数的公式如下所示:\n", 40 | "\n", 41 | "$$\n", 42 | "f'(x) = \\lim\\limits_{\\Delta x \\to 0}\\frac{\\Delta y}{\\Delta x}\n", 43 | " = \\lim\\limits_{\\Delta x \\to 0}\\frac{f(x_0 + \\Delta x) - f(x_0)}{\\Delta x}\n", 44 | "$$\n", 45 | "\n", 46 | "它可以表示函数在某点的切线斜率,也就是函数在该点附近的变化率,如果导数大于零,那么函数在区间内单调递增,如果导数小于零,函数在区间内则是单调递减。\n", 47 | "\n", 48 | "![](../images/gradient-1.jpg)\n", 49 | "\n", 50 | "上面是一个一元函数的导数示例,当扩展到二元或更多元的情况时,就出现了 **偏导数**(partial derivative)和 **方向导数**(directional derivative),偏导数是多元函数沿坐标轴的变化率,而方向导数则是多元函数沿任意方向的变化率。\n", 51 | "\n", 52 | "从上面的导数定义可以看出,导数是一个数,包括偏导数和方向导数也是一个数,代表的是函数沿某个方向的变化率。梯度并不是一个数,而是一个向量,它代表的是在各个导数中,变化趋势最大的那个方向。梯度的数学定义为:\n", 53 | "\n", 54 | "> 设函数 $z = f(x, y)$ 在平面区域 $D$ 内具有一阶连续偏导数,则对于每一点 $(x_0, y_0)$ 属于 $D$,都可定义一个向量:\n", 55 | ">\n", 56 | "> $$ \\frac{\\partial{f(x_0, y_0)}}{\\partial x}i + \\frac{\\partial{f(x_0, y_0)}}{\\partial y}j $$\n", 57 | ">\n", 58 | "> 这个向量被称为函数 $z = f(x, y)$ 在点 $(x_0, y_0)$ 处的梯度,记作 $grad f(x_0, y_0)$,即:\n", 59 | ">\n", 60 | ">$$\n", 61 | "\\begin{align}\n", 62 | "grad f(x_0, y_0) &= \\frac{\\partial{f(x_0, y_0)}}{\\partial x}i + \\frac{\\partial{f(x_0, y_0)}}{\\partial y}j \\\\\n", 63 | " &= f_x(x_0, y_0)i + f_y(x_0, y_0)j\n", 64 | "\\end{align}\n", 65 | "$$\n", 66 | ">\n", 67 | "> 有时候,梯度也可以用 $\\nabla$ 来表示:\n", 68 | "> $$ grad f(x_0, y_0) = \\nabla f(x_0, y_0) $$\n", 69 | ">\n", 70 | ">其中,$\\frac{\\partial}{\\partial x}i + \\frac{\\partial}{\\partial y}j$ 被称为 **向量微分算子**或 **Nabla 算子**。\n", 71 | "\n", 72 | "可以看到梯度就是对各个坐标轴分别求偏导,得到的一个向量,下面是一个例子,假设 $J(\\theta)$ 是一个三元一次函数:\n", 73 | "\n", 74 | "$$\n", 75 | "J(\\theta) = 5\\theta_1 + 2\\theta_2 - 12\\theta_3\n", 76 | "$$\n", 77 | "\n", 78 | "该函数的梯度为:\n", 79 | "\n", 80 | "$$\n", 81 | "\\nabla J(\\theta) = \\langle \\frac{\\partial J}{\\partial \\theta_1}, \\frac{\\partial J}{\\partial \\theta_2}, \\frac{\\partial J}{\\partial \\theta_3} \\rangle \\\\\n", 82 | "= \\langle 5, 2, -12 \\rangle\n", 83 | "$$\n", 84 | "\n", 85 | "那么为什么说梯度是变化趋势最大的方向呢?可以参考[这里](https://zhuanlan.zhihu.com/p/24913912)。总之,沿着梯度方向,函数会以最快的速度到达极大值,而沿着负梯度方向,函数会以最快的速度到达极小值。如果这个函数是一个**凸函数**,那么极值点就是最值点。假设上面的 $f(x,y)$ 就是我们的损失函数,且它是一个凸函数,那么我们就可以通过沿着负梯度方向,不断迭代,最终收敛在最小值。\n", 86 | "\n", 87 | "![](../images/gradient-direction.jpg)\n", 88 | "\n", 89 | "不过,如果损失函数是一个**非凸函数**,则可能会存在多个极值点,使用梯度下降法收敛的结果可能只是一个局部最小值,而不是全局最小值,譬如下图中的两种情况,可以看出,选择的初始点不同,计算出的极值点也不尽相同。\n", 90 | "\n", 91 | "![](../images/gradient-direction-2.jpg)\n", 92 | "\n", 93 | "这是梯度下降法的一个最主要的缺点,对初始点的选择极为敏感,不可避免的会存在陷入局部极小值的情形,梯度下降另一个缺点是,当到达最小点附近的时候收敛速度会变的很慢,所以针对这两个缺点,后来又提出了很多改进算法,比如 **拟牛顿法**,**共轭梯度法** 等。" 94 | ] 95 | } 96 | ], 97 | "metadata": { 98 | "kernelspec": { 99 | "display_name": "Python 3", 100 | "language": "python", 101 | "name": "python3" 102 | }, 103 | "language_info": { 104 | "codemirror_mode": { 105 | "name": "ipython", 106 | "version": 3 107 | }, 108 | "file_extension": ".py", 109 | "mimetype": "text/x-python", 110 | "name": "python", 111 | "nbconvert_exporter": "python", 112 | "pygments_lexer": "ipython3", 113 | "version": "3.7.1" 114 | } 115 | }, 116 | "nbformat": 4, 117 | "nbformat_minor": 2 118 | } 119 | -------------------------------------------------------------------------------- /线性回归/14. 最优化问题的其他算法.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "机器学习的本质是求解最优化问题,除了之前提到的线性问题的两种解法之外,还有很多相关的优化算法,这些算法不仅适用于线性模型,也适用于非线性模型。\n", 8 | "\n", 9 | "### 正规方程的优化算法\n", 10 | "\n", 11 | "在正规方程的计算过程中,我们既要对矩阵求逆,又要求矩阵乘法,如果直接计算,算法复杂度往往很高。这里的矩阵计算有很多优化方法,譬如对 X′X 进行 Cholesky 分解,或者对 X 进行 QR 分解等等。\n", 12 | "\n", 13 | "### 梯度下降的优化算法\n", 14 | "\n", 15 | "前面介绍了梯度下降的三种形式:BGD、SGD 和 MBGD。但在实际运用中还会面临很多挑战,具体的可以参考 Sebastian Ruder 的这篇论文 [An overview of gradient descent optimization algorithms](https://arxiv.org/pdf/1609.04747.pdf),这里是相应的 [中文翻译](https://blog.csdn.net/google19890102/article/details/69942970),论文中提到了 6 种优化算法:\n", 16 | "\n", 17 | "* Momentum(动量法)\n", 18 | "* NAG(Nesterov加速梯度下降法,Nesterov accelerated gradient)\n", 19 | "* Adagrad\n", 20 | "* Adadelta\n", 21 | "* RMSprop\n", 22 | "* Adam(自适应矩估计,Adaptive Moment Estimation)\n", 23 | "\n", 24 | "### 无约束项优化算法\n", 25 | "\n", 26 | "\n", 27 | "\n", 28 | "#### 牛顿法\n", 29 | "\n", 30 | "梯度下降主要是从目标函数的一阶导数推导而来的,形象点说,就是每次朝着当前梯度最大的方向收敛;而牛顿法是二阶收敛,每次考虑收敛方向的时候,还会考虑下一次的收敛的方向是否是最大(也就是梯度的梯度)。可以参考下图,红线为牛顿法,绿线为梯度下降:\n", 31 | "\n", 32 | "![](../images/newton.jpeg)\n", 33 | "\n", 34 | "本质上来看,牛顿法是二阶收敛,而梯度下降则为一阶收敛,所以牛顿法更快。简单来说,梯度下降是从所处位置选择一个坡度最大的方向走一步,而牛顿法则在选择方向时,不仅考虑坡度,还会考虑下一步的坡度是否会变得更大。\n", 35 | "\n", 36 | "几何上来说,牛顿法是用一个二次曲面去拟合当前所处位置的局部曲面,而梯度下降法是用一个平面去拟合当前的局部曲面,通常情况下,二次曲面的拟合会比平面更好,所以牛顿法选择的下降路径会更符合真实的最优下降路径。\n", 37 | "\n", 38 | "牛顿法的优点是二阶收敛,收敛速度快;它的缺点是每一步都需要求解目标函数的Hessian矩阵的逆矩阵,计算比较复杂。\n", 39 | "\n", 40 | "#### 高斯-牛顿迭代算法\n", 41 | "\n", 42 | "由于牛顿法需要求解二阶导,也就是 Hessian matrix,运算量大,不利于实现,所以通常在牛顿法的基础上用去掉二阶项,用一阶项来近似二阶导,从而保证了计算效率。\n", 43 | "\n", 44 | "高斯-牛顿迭代算法(Gauss-Newton iteration method)是另一种经常用来求解非线性最小二乘的迭代法,其原理是利用了 **泰勒展开公式**,其最大优点是收敛速度快。\n", 45 | "\n", 46 | "Taylor 级数求得原目标函数的二阶近似:\n", 47 | "\n", 48 | "$$\n", 49 | "f(x) \\approx \\phi(x) \n", 50 | "= f(x^{(k)}) + \\nabla f(x^{(k)})(x - x^{(k)}) + \\frac{1}{2}(x - x^{(k)})^T \\nabla^2 f(x^{k})(x - x^{(k)})\n", 51 | "$$\n", 52 | "\n", 53 | "把 $x$ 看作自变量,所有带有 $x^{(k)}$ 的项看作常量,令一阶导数等于 0,即可求近似函数的最小值:\n", 54 | "\n", 55 | "$$\n", 56 | "P_k = -(\\nabla^2 f(x_k))^{-1} \\nabla f(x_k) = - {Hesse}^{-1} {Jacobi}\n", 57 | "$$\n", 58 | "\n", 59 | "上边的 Hesse 矩阵,是一个多元函数的二阶偏导数构成的方阵,描述了函数的局部曲率。\n", 60 | "\n", 61 | "对于二元的情况,根据上述泰勒展开公式及求导,取0,可以得到如下迭代公式:\n", 62 | "\n", 63 | "$$\n", 64 | "x_{k+1} = x_k - \\frac{f'(x_k)}{f''(x_k)}\n", 65 | "$$\n", 66 | "\n", 67 | "可以看出,因为我们需要求矩阵逆,当 Hesse 矩阵不可逆时无法计算。而且矩阵的逆计算复杂度为n的立方,当规模很大时,计算量超大,通常改良做法是采用 **拟牛顿法**,如 BFGS、L-BFGS 等。此外,如果初始值离局部极小值太远,Taylor 展开并不能对原函数进行良好的近似。\n", 68 | "\n", 69 | "#### 拟牛顿法\n", 70 | "\n", 71 | "**拟牛顿法**(Quasi-Newton Methods)的本质思想是改善牛顿法每次需要求解复杂的Hessian矩阵的逆矩阵的缺陷,它使用正定矩阵来近似Hessian矩阵的逆,从而简化了运算的复杂度。拟牛顿法和最速下降法一样只要求每一步迭代时知道目标函数的梯度。通过测量梯度的变化,构造一个目标函数的模型使之足以产生超线性收敛性。这类方法大大优于最速下降法,尤其对于困难的问题。另外,因为拟牛顿法不需要二阶导数的信息,所以有时比牛顿法更为有效。如今,优化软件中包含了大量的拟牛顿算法用来解决无约束,约束,和大规模的优化问题。\n", 72 | "\n", 73 | "#### LM 方法\n", 74 | "\n", 75 | "Levenberg-Marquardt 方法,是由于高斯-牛顿方法在计算时需要保证矩阵的正定性,于是引入了一个约束,从而保证计算方法更具普适性。\n", 76 | "\n", 77 | "### 共轭梯度\n", 78 | "\n", 79 | "共轭梯度法(Conjugate gradient, CG)\n", 80 | "\n", 81 | "https://cosx.org/2016/11/conjugate-gradient-for-regression/\n", 82 | "\n", 83 | "### 近端梯度\n", 84 | "\n", 85 | "近端梯度法(Proximal Gradient Method ,PG)\n", 86 | "\n", 87 | "https://blog.csdn.net/qq547276542/article/details/78251779\n", 88 | "\n", 89 | "http://roachsinai.github.io/2016/08/03/1Proximal_Method/\n", 90 | "\n", 91 | "### 参考\n", 92 | "\n", 93 | "* [梯度下降优化算法综述](https://blog.csdn.net/google19890102/article/details/69942970)\n", 94 | "* [一文通透优化算法:从随机梯度、随机梯度下降法到牛顿法、共轭梯度](https://blog.csdn.net/v_JULY_v/article/details/81350035)\n", 95 | "* [关于梯度下降法、牛顿法、高斯-牛顿、LM方法的总结](https://blog.csdn.net/wuaini_1314/article/details/79562400)\n", 96 | "* [高斯牛顿(Gauss Newton)、列文伯格-马夸尔特(Levenberg-Marquardt)最优化算法与VSLAM](https://blog.csdn.net/zhubaohua_bupt/article/details/74973347)\n", 97 | "* [常见的几种最优化方法(梯度下降法、牛顿法、拟牛顿法、共轭梯度法等)](https://www.cnblogs.com/shixiangwan/p/7532830.html)\n", 98 | "* [Gauss-Newton algorithm for nonlinear models](http://fourier.eng.hmc.edu/e176/lectures/NM/node36.html)" 99 | ] 100 | } 101 | ], 102 | "metadata": { 103 | "kernelspec": { 104 | "display_name": "Python 3", 105 | "language": "python", 106 | "name": "python3" 107 | }, 108 | "language_info": { 109 | "codemirror_mode": { 110 | "name": "ipython", 111 | "version": 3 112 | }, 113 | "file_extension": ".py", 114 | "mimetype": "text/x-python", 115 | "name": "python", 116 | "nbconvert_exporter": "python", 117 | "pygments_lexer": "ipython3", 118 | "version": "3.7.1" 119 | } 120 | }, 121 | "nbformat": 4, 122 | "nbformat_minor": 2 123 | } 124 | -------------------------------------------------------------------------------- /支持向量机/03. 硬间隔与软间隔.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在前面的例子中,我们假定样本空间是线性可分的,即存在一个超平面将不同类别完全划分开来。然而在现实的分类任务中,样本空间往往是线性不可分的,譬如下面这样:\n", 8 | "\n", 9 | "![](../images/svm-unseparable.png)\n", 10 | "\n", 11 | "可见,数据中混入了一些异常点,导致没办法通过一个超平面将其分成两个部分。解决这个问题的一个办法是,允许支持向量机在一些样本上出错。在前面介绍支持向量机的基本形式时,我们要求所有的样本都满足下面的约束条件:\n", 12 | "\n", 13 | "$$\n", 14 | "\\left\\{\n", 15 | "\\begin{align}\n", 16 | "w^Tx_i + b \\ge +1, y_i = +1 \\\\\n", 17 | "w^Tx_i + b \\le -1, y_i = -1\n", 18 | "\\end{align}\n", 19 | "\\right.\n", 20 | "$$\n", 21 | "\n", 22 | "也可以简写成:\n", 23 | "\n", 24 | "$$\n", 25 | "y_i(w^Tx_i + b) \\geq 1\n", 26 | "$$\n", 27 | "\n", 28 | "这个约束条件确保所有样本都被正确划分,这被称为 **硬间隔**(hard margin),我们把这个约束条件稍微放宽,允许某些样本不满足该条件,得到的就是 **软间隔**(soft margin),当然,我们希望不满足约束条件的样本越少越好。\n", 29 | "\n", 30 | "为此,我们对每个样本引入一个松弛变量 $\\xi_i \\geq 0$,约束条件变成:\n", 31 | "\n", 32 | "$$\n", 33 | "y_i(w^Tx_i + b) \\geq 1 - \\xi_i\n", 34 | "$$\n", 35 | "\n", 36 | "同时,对每一个松弛变量 $\\xi$,支付一个代价 C,所以优化目标就变成了:\n", 37 | "\n", 38 | "$$\n", 39 | "\\begin{align}\n", 40 | "&\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 + C\\sum_{i=1}^n\\xi_i \\\\\n", 41 | "&s.t. y_i(w^Tx_i + b) \\geq 1 - \\xi_i, \\xi_i \\geq 0, i = 1,2,\\dots,n\n", 42 | "\\end{align}\n", 43 | "$$\n", 44 | "\n", 45 | "这就是软间隔支持向量机的基本形式。\n", 46 | "\n", 47 | "这里的 C 是一个超参数,决定了你有多重视异常点带来的损失。显然,当 C 为无穷大时,为了求优化目标的最小值,这里加上的松弛变量类似于惩罚项,要非常小甚至等于 0 才行,也就是 $\\xi_i = 0$,会迫使所有的样本都满足约束 $y_i(w^Tx_i + b) \\geq 1$,这就和硬间隔一样;当 C 为某一常数时,允许某些样本不满足约束,得到的就是软间隔。" 48 | ] 49 | }, 50 | { 51 | "cell_type": "markdown", 52 | "metadata": {}, 53 | "source": [ 54 | "### 损失函数的其他形式\n", 55 | "\n", 56 | "上面通过在硬间隔的优化目标中引入松弛变量 $\\xi_i$ 得到了软间隔支持向量机的基本形式。实际上,我们还有很多其他的方式来推导软间隔的优化目标,最简单的想法是在优化目标中加入 **0/1损失函数**,硬间隔的优化目标为:\n", 57 | "\n", 58 | "$$\n", 59 | "\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2\n", 60 | "$$\n", 61 | "\n", 62 | "加入 0/1损失函数后,优化目标变成了:\n", 63 | "\n", 64 | "$$\n", 65 | "\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 + C \\sum_{i=1}^m \\ell_{0/1}(y_i(w^Tx_i + b) - 1)\n", 66 | "$$\n", 67 | "\n", 68 | "其中,$\\ell_{0/1}$ 就是 0/1损失函数,当样本满足硬间隔的约束条件时,也就是 $y_i(w^Tx_i + b) \\geq 1$ 时,损失为 0,不满足时损失为 1:\n", 69 | "\n", 70 | "$$\n", 71 | "\\ell_{0/1}(z) =\n", 72 | "\\left\\{\n", 73 | "\\begin{align}\n", 74 | "0, z \\geq 0 \\\\\n", 75 | "1, z < 0 \\\\\n", 76 | "\\end{align}\n", 77 | "\\right.\n", 78 | "$$\n", 79 | "\n", 80 | "但是 0/1损失函数不连续,也不是凸函数,导致优化目标不易求解,通常用一些其他函数来替代 0/1损失函数,这被称为 **替代损失**(surrogate loss),替代损失一般具有良好的数学性质,通常是凸的连续函数并且是 0/1损失函数的上界。比如 **Hinge损失**:\n", 81 | "\n", 82 | "$$\n", 83 | "\\ell_{hinge}(z) = max(0, 1-z)\n", 84 | "$$\n", 85 | "\n", 86 | "Hinge损失,又叫 **合页损失** 或 **铰链损失**,它就像一个打开的合页形状:\n", 87 | "\n", 88 | "![](../images/hinge-loss.png)\n", 89 | "\n", 90 | "加入 Hinge损失函数后,优化目标变成了:\n", 91 | "\n", 92 | "$$\n", 93 | "\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 + C \\sum_{i=1}^m \\max(0, 1 - y_i(w^Tx_i + b))\n", 94 | "$$\n", 95 | "\n", 96 | "其中,当样本满足硬间隔的约束条件时,也就是 $y_i(w^Tx_i + b) \\geq 1$ 时,这时 $1 - y_i(w^Tx_i + b) \\leq 0$,Hinge损失为 0;当样本不满足约束条件时损失为 $1 - y_i(w^Tx_i + b)$。\n", 97 | "\n", 98 | "令:\n", 99 | "\n", 100 | "$$\n", 101 | "\\xi_i = \\max(0, 1 - y_i(w^Tx_i + b))\n", 102 | "$$\n", 103 | "\n", 104 | "于是得到和上面一样的优化目标:\n", 105 | "\n", 106 | "$$\n", 107 | "\\mathop \\min_{w,b} \\frac{1}{2}\\|w\\|^2 + C \\sum_{i=1}^m \\xi_i\n", 108 | "$$\n", 109 | "\n", 110 | "很显然:\n", 111 | "\n", 112 | "$$\n", 113 | "\\left\\{\n", 114 | "\\begin{align}\n", 115 | "& \\xi_i = \\max(0, 1 - y_i(w^Tx_i + b)) \\geq 1 - y_i(w^Tx_i + b) \\\\\n", 116 | "& \\xi_i = \\max(0, 1 - y_i(w^Tx_i + b)) \\geq 0\n", 117 | "\\end{align}\n", 118 | "\\right .\n", 119 | "$$\n", 120 | "\n", 121 | "所以有约束条件:\n", 122 | "\n", 123 | "$$\n", 124 | "y_i(w^Tx_i + b) \\geq 1 - \\xi_i, \\xi_i \\geq 0\n", 125 | "$$\n", 126 | "\n", 127 | "除 Hinge损失之外,还有很多其他的替代损失,比如 **指数损失**(exponential loss):\n", 128 | "\n", 129 | "$$\n", 130 | "\\ell_{exp}(z) = exp(-z)\n", 131 | "$$\n", 132 | "\n", 133 | "或者 **对率损失**(logistic loss):\n", 134 | "\n", 135 | "$$\n", 136 | "\\ell_{log}(z) = log(1+exp(-z))\n", 137 | "$$\n", 138 | "\n", 139 | "他们的图像如下所示:\n", 140 | "\n", 141 | "![](../images/svm-3-loss.jpg)" 142 | ] 143 | } 144 | ], 145 | "metadata": { 146 | "kernelspec": { 147 | "display_name": "Python 3", 148 | "language": "python", 149 | "name": "python3" 150 | }, 151 | "language_info": { 152 | "codemirror_mode": { 153 | "name": "ipython", 154 | "version": 3 155 | }, 156 | "file_extension": ".py", 157 | "mimetype": "text/x-python", 158 | "name": "python", 159 | "nbconvert_exporter": "python", 160 | "pygments_lexer": "ipython3", 161 | "version": "3.7.1" 162 | } 163 | }, 164 | "nbformat": 4, 165 | "nbformat_minor": 2 166 | } 167 | -------------------------------------------------------------------------------- /其他资料/课程/02. Python3入门机器学习经典算法与应用.liuyubobobo.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "* 课程名称:Python3入门机器学习经典算法与应用\n", 8 | "* 课程来源:https://coding.imooc.com/\n", 9 | "* 课程地址:https://coding.imooc.com/class/chapter/169.html\n", 10 | "* 讲师:liuyubobobo\n", 11 | "\n", 12 | "## 第1章 欢迎来到 Python3 玩转机器学习\n", 13 | "\n", 14 | "### 1-1 什么是机器学习\n", 15 | "### 1-2 课程涵盖的内容和理念\n", 16 | "### 1-3 课程所使用的主要技术栈\n", 17 | "\n", 18 | "## 第2章 机器学习基础\n", 19 | "\n", 20 | "### 2-1 机器学习世界的数据\n", 21 | "### 2-2 机器学习的主要任务\n", 22 | "### 2-3 监督学习,非监督学习,半监督学习和增强学习\n", 23 | "### 2-4 批量学习,在线学习,参数学习和非参数学习\n", 24 | "### 2-5 和机器学习相关的“哲学”思考\n", 25 | "### 2-6 课程使用环境搭建\n", 26 | "\n", 27 | "## 第3章 Jupyter Notebook, numpy和matplotlib\n", 28 | "\n", 29 | "### 3-1 Jupyter Notebook基础\n", 30 | "### 3-2 Jupyter Notebook中的魔法命令\n", 31 | "### 3-3 Numpy数据基础\n", 32 | "### 3-4 创建Numpy数组(和矩阵)\n", 33 | "### 3-5 Numpy数组(和矩阵)的基本操作\n", 34 | "### 3-6 Numpy数组(和矩阵)的合并与分割\n", 35 | "### 3-7 Numpy中的矩阵运算\n", 36 | "### 3-8 Numpy中的聚合运算\n", 37 | "### 3-9 Numpy中的arg运算\n", 38 | "### 3-10 Numpy中的比较和Fancy Indexing\n", 39 | "### 3-11 Matplotlib数据可视化基础\n", 40 | "### 3-12 数据加载和简单的数据探索\n", 41 | "\n", 42 | "## 第4章 最基础的分类算法-k近邻算法 kNN\n", 43 | "\n", 44 | "### 4-1 k近邻算法基础\n", 45 | "### 4-2 scikit-learn中的机器学习算法封装\n", 46 | "### 4-3 训练数据集,测试数据集\n", 47 | "### 4-4 分类准确度\n", 48 | "### 4-5 超参数\n", 49 | "### 4-6 网格搜索与k近邻算法中更多超参数\n", 50 | "### 4-7 数据归一化\n", 51 | "### 4-8 scikit-learn中的Scaler\n", 52 | "### 4-9 更多有关k近邻算法的思考\n", 53 | "\n", 54 | "## 第5章 线性回归法\n", 55 | "\n", 56 | "### 5-1 简单线性回归\n", 57 | "### 5-2 最小二乘法\n", 58 | "### 5-3 简单线性回归的实现\n", 59 | "### 5-4 向量化\n", 60 | "### 5-5 衡量线性回归法的指标:MSE,RMSE和MAE\n", 61 | "### 5-6 最好的衡量线性回归法的指标:R Squared\n", 62 | "### 5-7 多元线性回归和正规方程解\n", 63 | "### 5-8 实现多元线性回归\n", 64 | "### 5-9 使用scikit-learn解决回归问题\n", 65 | "### 5-10 线性回归的可解释性和更多思考\n", 66 | "\n", 67 | "## 第6章 梯度下降法\n", 68 | "\n", 69 | "### 6-1 什么是梯度下降法\n", 70 | "### 6-2 模拟实现梯度下降法\n", 71 | "### 6-3 线性回归中的梯度下降法\n", 72 | "### 6-4 实现线性回归中的梯度下降法\n", 73 | "### 6-5 梯度下降法的向量化和数据标准化\n", 74 | "### 6-6 随机梯度下降法\n", 75 | "### 6-7 scikit-learn中的随机梯度下降法\n", 76 | "### 6-8 如何确定梯度计算的准确性?调试梯度下降法\n", 77 | "### 6-9 有关梯度下降法的更多深入讨论\n", 78 | "\n", 79 | "## 第7章 PCA与梯度上升法\n", 80 | "\n", 81 | "### 7-1 什么是PCA\n", 82 | "### 7-2 使用梯度上升法求解PCA问题\n", 83 | "### 7-3 求数据的主成分PCA\n", 84 | "### 7-4 求数据的前n个主成分\n", 85 | "### 7-5 高维数据映射为低维数据\n", 86 | "### 7-6 scikit-learn中的PCA\n", 87 | "### 7-7 试手MNIST数据集\n", 88 | "### 7-8 使用PCA对数据进行降噪\n", 89 | "### 7-9 人脸识别与特征脸\n", 90 | "\n", 91 | "## 第8章 多项式回归与模型泛化\n", 92 | "\n", 93 | "### 8-1 什么是多项式回归\n", 94 | "### 8-2 scikit-learn中的多项式回归与Pipeline\n", 95 | "### 8-3 过拟合与欠拟合\n", 96 | "### 8-4 为什么要有训练数据集与测试数据集\n", 97 | "### 8-5 学习曲线\n", 98 | "### 8-6 验证数据集与交叉验证\n", 99 | "### 8-7 偏差方差平衡\n", 100 | "### 8-8 模型泛化与岭回归\n", 101 | "### 8-9 LASSO\n", 102 | "### 8-10 L1, L2和弹性网络\n", 103 | "\n", 104 | "## 第9章 逻辑回归\n", 105 | "\n", 106 | "### 9-1 什么是逻辑回归\n", 107 | "### 9-2 逻辑回归的损失函数\n", 108 | "### 9-3 逻辑回归损失函数的梯度\n", 109 | "### 9-4 实现逻辑回归算法\n", 110 | "### 9-5 决策边界\n", 111 | "### 9-6 在逻辑回归中使用多项式特征\n", 112 | "### 9-7 scikit-learn中的逻辑回归\n", 113 | "### 9-8 OvR与OvO\n", 114 | "\n", 115 | "## 第10章 评价分类结果\n", 116 | "\n", 117 | "### 10-1 准确度的陷阱和混淆矩阵\n", 118 | "### 10-2 精准率和召回率\n", 119 | "### 10-3 实现混淆矩阵,精准率和召回率\n", 120 | "### 10-4 F1 Score\n", 121 | "### 10-5 精准率和召回率的平衡\n", 122 | "### 10-6 精准率-召回率曲线\n", 123 | "### 10-7 ROC曲线\n", 124 | "### 10-8 多分类问题中的混淆矩阵\n", 125 | "\n", 126 | "## 第11章 支撑向量机 SVM\n", 127 | "\n", 128 | "### 11-1 什么是SVM\n", 129 | "### 11-2 SVM背后的最优化问题\n", 130 | "### 11-3 Soft Margin SVM\n", 131 | "### 11-4 scikit-learn中的SVM\n", 132 | "### 11-5 SVM中使用多项式特征和核函数\n", 133 | "### 11-6 到底什么是核函数\n", 134 | "### 11-7 RBF核函数\n", 135 | "### 11-8 RBF核函数中的gamma\n", 136 | "### 11-9 SVM思想解决回归问题\n", 137 | "\n", 138 | "## 第12章 决策树\n", 139 | "\n", 140 | "### 12-1 什么是决策树\n", 141 | "### 12-2 信息熵\n", 142 | "### 12-3 使用信息熵寻找最优划分\n", 143 | "### 12-4 基尼系数\n", 144 | "### 12-5 CART与决策树中的超参数\n", 145 | "### 12-6 决策树解决回归问题\n", 146 | "### 12-7 决策树的局限性\n", 147 | "\n", 148 | "## 第13章 集成学习和随机森林\n", 149 | "\n", 150 | "### 13-1 什么是集成学习\n", 151 | "### 13-2 Soft Voting Classifier\n", 152 | "### 13-3 Bagging 和 Pasting\n", 153 | "### 13-4 oob (Out-of-Bag) 和关于Bagging的更多讨论\n", 154 | "### 13-5 随机森林和 Extra-Trees\n", 155 | "### 13-6 Ada Boosting 和 Gradient Boosting\n", 156 | "### 13-7 Stacking\n", 157 | "\n", 158 | "## 第14章 更多机器学习算法\n", 159 | "\n", 160 | "### 14-1 学习scikit-learn文档, 大家加油!\n", 161 | "### 14-2 学习完这个课程以后怎样继续深入机器学习的学习?" 162 | ] 163 | } 164 | ], 165 | "metadata": { 166 | "kernelspec": { 167 | "display_name": "Python 3", 168 | "language": "python", 169 | "name": "python3" 170 | }, 171 | "language_info": { 172 | "codemirror_mode": { 173 | "name": "ipython", 174 | "version": 3 175 | }, 176 | "file_extension": ".py", 177 | "mimetype": "text/x-python", 178 | "name": "python", 179 | "nbconvert_exporter": "python", 180 | "pygments_lexer": "ipython3", 181 | "version": "3.7.1" 182 | } 183 | }, 184 | "nbformat": 4, 185 | "nbformat_minor": 2 186 | } 187 | -------------------------------------------------------------------------------- /线性回归/09. 回归模型的评估和选择.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在上一节中我们学到,为了避免模型在数据集上过度训练导致过拟合,通常将数据集划分为训练集和测试集。\n", 8 | "\n", 9 | "### 数据集划分方法\n", 10 | "\n", 11 | "假设我们有一个包含 m 个样例的数据集 $ D = \\{(\\bf{x_1}, y_1), (\\bf{x_2}, y_2), ..., (\\bf{x_m}, y_m)\\} $,得想办法对数据集进行划分,分成训练和测试两类,训练集记为 $ S $,测试集记为 $ T $。这样的划分方法有很多,常见的有:**留出法**(hold-out)、**交叉验证法**(cross validation)、**自助法**(bootstrapping)等。\n", 12 | "\n", 13 | "留出法非常简单,直接将数据集划分成两个互斥的集合,分别作为训练集和测试集,$ D = S \\cup T, S \\cap T = \\emptyset $,常见的做法是将大约 2/3 ~ 4/5 的数据用于训练,剩余的数据用于测试,注意划分的时候尽可能保持数据分布的一致性。\n", 14 | "\n", 15 | "交叉验证法先将数据集划分成 k 个大小相似的互斥子集,$ D = D_1 \\cup D_2 \\cup \\dots \\cup D_k, D_i \\cap D_j = \\emptyset(i \\ne j) $,同样划分时尽可能保持数据分布的一致性。然后,用其中的 k-1 个子集的并集作为训练集,剩下的那个子集作为测试集,这样可以进行 k 次训练和测试,最终计算这 k 次测试结果的均值。交叉验证法又叫 **k 折交叉验证**(k-fold cross validation)。\n", 16 | "\n", 17 | "自助法是以 **自助采样**(bootstrap sampling,也称为 **可重复采样**)为基础的样本划分方法,所谓自助采样,指的是从数据集 $ D $ 中随机挑选一个样本,将其拷贝放入一个新集合 $ D' $ 中,然后将样本重新放回集合 $ D $ 中,重复 m 次,得到一个包含 m 个样本的数据集 $ D' $。很显然,$ D $ 中有些样本在 $ D' $ 中会出现多次,还有一些样本不会出现,样本不会出现的概率是:\n", 18 | "\n", 19 | "$$\n", 20 | "\\lim\\limits_{m \\to +\\infty}(1-\\frac{1}{m})^m = \\frac{1}{e} \\approx 0.368\n", 21 | "$$\n", 22 | "\n", 23 | "我们将 $ D' $ 用作训练集,$ D' $ 中没有出现的样本 $ D \\backslash D'$ 用作测试集,这样的测试结果也叫做 **包外估计**(out-of-bag estimate)。" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "### 模型评估方法\n", 31 | "\n", 32 | "通过上面的方法,把样本数据分成训练集和测试集之后,就可以使用你的机器学习算法对训练集进行训练生成一个模型,然后再拿着这个模型在测试集上进行验证,评估这个模型的效果。\n", 33 | "\n", 34 | "其实,在前面介绍损失函数时提到了两种损失函数,一种是 **绝对损失(Absolute Loss)**,一种是 **平方损失(Squared Loss)**。\n", 35 | "\n", 36 | "绝对损失函数:$$ loss = |y - \\hat{y}| $$\n", 37 | "\n", 38 | "平方损失函数:$$ loss = (y - \\hat{y})^2 $$\n", 39 | "\n", 40 | "这实际上也是两种评估模型的指标,分别是:**平均绝对误差**(Mean Absolute Error,MAE) 和 **均方误差**(Mean Squared Error,MSE)。\n", 41 | "\n", 42 | "$$\n", 43 | "MAE = \\frac{1}{N} \\sum_{i=1}^{N} \\left| y_i - \\hat{y_i} \\right|\n", 44 | "$$\n", 45 | "\n", 46 | "$$\n", 47 | "MSE = \\frac{1}{N} \\sum_{i=1}^{N} (y_i - \\hat{y_i})^2\n", 48 | "$$\n", 49 | "\n", 50 | "除此之外,从 MAE 还可以演变成一个新的指标,**平均绝对百分误差**(Mean Absolute Percentage Error,MAPE):\n", 51 | "\n", 52 | "$$\n", 53 | "MAPE = \\frac{100}{N} \\sum_{i=1}^{N} \\left| \\frac{y_i - \\hat{y_i}}{y_i} \\right|, y_i \\ne 0\n", 54 | "$$\n", 55 | "\n", 56 | "MAPE 通过计算绝对误差的百分比来表示预测效果,其取值越小越好,如果 MAPE = 10,这表明预测平均偏离真实值 10%。\n", 57 | "\n", 58 | "从 MSE 也可以演变成一个新的指标,**均方根误差**(Root Mean Squared Error,RMSE),它就是对 MSE 求开方:\n", 59 | "\n", 60 | "$$\n", 61 | "RMSE = \\sqrt{\\frac{1}{N} \\sum_{i=1}^{N} (y_i - \\hat{y_i})^2} = \\sqrt{MSE}\n", 62 | "$$\n", 63 | "\n", 64 | "基于 RMSE 还有一个变种指标,叫**均方根对数误差**(Root Mean Squared Logarithmic Error,RMSLE),将上面公式中的 $y_i$ 和 $\\hat{y_i}$ 换成对数形式:\n", 65 | "\n", 66 | "$$\n", 67 | "RMSLE = \\sqrt{\\frac{1}{N} \\sum_{i=1}^{N} (log(y_i+1) - log(\\hat{y_i}+1))^2}\n", 68 | "$$\n", 69 | "\n", 70 | "最后,**$R^2$**(R-Square)也是一个常见的评估指标,它的公式如下:\n", 71 | "\n", 72 | "$$\n", 73 | "R^2 = 1 - \\frac{\\sum_{i=1}^{N} (y_i - \\hat{y_i})^2}{\\sum_{i=1}^{N} (y_i - \\bar{y_i})^2}\n", 74 | "$$\n", 75 | "\n", 76 | "$R^2$ 用于度量因变量的变异中可由自变量解释部分所占的比例,一般取值范围是 0~1,$R^2$ 越接近 1,表明回归平方和占总平方和的比例越大,回归线与各观测点越接近,用 x 的变化来解释 y 值变差的部分就越多,回归的拟合程度就越好。" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "### 使用 sklearn 评估模型\n", 84 | "\n", 85 | "sklearn 中的 metrics 模块提供了几个方法来计算模型评估的指标值,譬如 `mean_absolute_error()` 用于计算 MAE,`mean_squared_error()` 用于计算 MSE,`r2_score()` 用于计算 $R^2$。" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 3, 91 | "metadata": {}, 92 | "outputs": [ 93 | { 94 | "name": "stdout", 95 | "output_type": "stream", 96 | "text": [ 97 | "MSE = 4.5\n", 98 | "MAE = 1.5\n", 99 | "R-Square = 0.9945454545454545\n" 100 | ] 101 | } 102 | ], 103 | "source": [ 104 | "import numpy as np\n", 105 | "from sklearn import metrics\n", 106 | "\n", 107 | "expected = np.array([10,20,30,40,50,60,70,80,90,100])\n", 108 | "predicted = np.array([10,21,33,41,49,55,70,82,90,102])\n", 109 | "\n", 110 | "print('MAE = ', metrics.mean_absolute_error(expected, predicted))\n", 111 | "print('MSE = ', metrics.mean_squared_error(expected, predicted))\n", 112 | "print('R-Square = ', metrics.r2_score(expected, predicted))" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "### 模型选择\n", 120 | "\n", 121 | "上面介绍了多种数据集划分的方法和模型评估的方法,看起来我们只要挑一个评估结果最好的模型就可以了,但是实际上,模型选择可能要比想象中复杂的多。这里面涉及几个重要因素:\n", 122 | "\n", 123 | "* 首先,我们希望比较的是泛化性能,而我们通过评估方法得到的是测试集上的性能,两者的对比结果未必相同;\n", 124 | "* 第二,测试集上的性能和测试集本身的选择有很大关系,使用大小不同的测试集,或者使用大小相同但数据不同的测试集,测试结果也会有不同;\n", 125 | "* 第三,很多机器学习算法本身具有一定的随机性,即便相同的参数在相同的测试集上运行多次,测试结果也可能不同;\n", 126 | "\n", 127 | "那么,有没有适当的方法对学习器的性能进行比较呢?统计学中的 **假设检验**(hypothesis test)为此提供了重要依据,根据假设检验,我们可以推断出,若在测试集上观察到学习器 A 比 B 好,则 A 的泛化性能是否在统计意义上优于 B,以及这个结论的把握有多大。\n", 128 | "\n", 129 | "**二项检验**(binomial test)和 **t 检验**(t-test)是两种最基本的假设检验方法,可以对单个学习器泛化性能的假设进行检验;如果需要对多个学习器性能进行比较,可以使用 **交叉验证 t 检验** 和 **McNemar 检验**,这两个方法都是在同一个数据集上进行检验,**Fredman 检验** 和 **Nemenyi 后续检验** 可以在一组数据集上对学习器的性能进行检验。" 130 | ] 131 | }, 132 | { 133 | "cell_type": "markdown", 134 | "metadata": {}, 135 | "source": [ 136 | "#### 参考\n", 137 | "\n", 138 | "* 机器学习, 周志华\n", 139 | "* [一份非常全面的机器学习分类与回归算法的评估指标汇总](https://juejin.im/post/5bbbc1d6f265da0af40726fb)\n", 140 | "\n", 141 | "#### TODO\n", 142 | "\n", 143 | "* 为什么需要这么多的评估指标?\n", 144 | "* 每一种评估指标的优缺点,举例说明\n", 145 | "* 还有没有其他的评估指标?\n", 146 | "* RMSE 代表的是预测值和真实值差值的样本标准差?" 147 | ] 148 | } 149 | ], 150 | "metadata": { 151 | "kernelspec": { 152 | "display_name": "Python 3", 153 | "language": "python", 154 | "name": "python3" 155 | }, 156 | "language_info": { 157 | "codemirror_mode": { 158 | "name": "ipython", 159 | "version": 3 160 | }, 161 | "file_extension": ".py", 162 | "mimetype": "text/x-python", 163 | "name": "python", 164 | "nbconvert_exporter": "python", 165 | "pygments_lexer": "ipython3", 166 | "version": "3.7.1" 167 | } 168 | }, 169 | "nbformat": 4, 170 | "nbformat_minor": 2 171 | } 172 | -------------------------------------------------------------------------------- /其他资料/书籍/02. 机器学习.周志华.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 第 1 章 绪论\n", 8 | "\n", 9 | "### 1.1 引言\n", 10 | "\n", 11 | "机器学习所研究的主要内容,是关于在计算机上从数据中产生 **模型**(model)的算法,即 **学习算法**(learning algorithm)。\n", 12 | "\n", 13 | "### 1.2 基本术语\n", 14 | "\n", 15 | "要进行机器学习,先要有数据,比如:\n", 16 | "\n", 17 | "|色泽|根蒂|敲声|\n", 18 | "|---|----|---|\n", 19 | "|青绿|蜷缩|浊响|\n", 20 | "|乌黑|稍蜷|沉闷|\n", 21 | "|浅白|硬挺|清脆|\n", 22 | "\n", 23 | "这组记录的集合称为 **数据集**(data set),其中的每条记录称为一个 **示例**(instance)或 **样本**(sample)。\n", 24 | "\n", 25 | "### 1.3 假设空间\n", 26 | "\n", 27 | "### 1.4 归纳偏好\n", 28 | "\n", 29 | "### 1.5 发展历程\n", 30 | "\n", 31 | "### 1.6 应用现状\n", 32 | "\n", 33 | "### 1.7 阅读材料\n", 34 | "\n", 35 | "## 第 2 章 模型评估与选择\n", 36 | "\n", 37 | "### 2.1 经验误差与过拟合\n", 38 | "\n", 39 | "### 2.2 评估方法\n", 40 | "\n", 41 | "### 2.3 性能度量\n", 42 | "\n", 43 | "### 2.4 比较检验\n", 44 | "\n", 45 | "### 2.5 偏差与方差\n", 46 | "\n", 47 | "### 2.6 阅读材料\n", 48 | "\n", 49 | "## 第 3 章 线性模型\n", 50 | "\n", 51 | "### 3.1 基本形式\n", 52 | "\n", 53 | "### 3.2 线性回归\n", 54 | "\n", 55 | "### 3.3 对数几率回归\n", 56 | "\n", 57 | "### 3.4 线性判别分析\n", 58 | "\n", 59 | "### 3.5 多分类学习\n", 60 | "\n", 61 | "### 3.6 类别不平衡问题\n", 62 | "\n", 63 | "### 3.7 阅读材料\n", 64 | "\n", 65 | "## 第 4 章 决策树\n", 66 | "\n", 67 | "### 4.1 基本流程\n", 68 | "\n", 69 | "### 4.2 划分选择\n", 70 | "\n", 71 | "### 4.3 剪枝处理\n", 72 | "\n", 73 | "### 4.4 连续与缺失值\n", 74 | "\n", 75 | "### 4.5 多变量决策树\n", 76 | "\n", 77 | "### 4.6 阅读材料\n", 78 | "\n", 79 | "## 第 5 章 神经网络\n", 80 | "\n", 81 | "### 5.1 神经元模型\n", 82 | "\n", 83 | "### 5.2 感知机与多层网络\n", 84 | "\n", 85 | "### 5.3 误差拟传播算法\n", 86 | "\n", 87 | "### 5.4 全局最小与局部最小\n", 88 | "\n", 89 | "### 5.5 其他常见神经网络\n", 90 | "\n", 91 | "### 5.6 阅读材料\n", 92 | "\n", 93 | "## 第 6 章 支持向量机\n", 94 | "\n", 95 | "### 6.1 间隔与支持向量\n", 96 | "\n", 97 | "### 6.2 对偶问题\n", 98 | "\n", 99 | "### 6.3 核函数\n", 100 | "\n", 101 | "### 6.4 软间隔与正则化\n", 102 | "\n", 103 | "### 6.5 支持向量回归\n", 104 | "\n", 105 | "### 6.6 核方法\n", 106 | "\n", 107 | "### 6.7 阅读材料\n", 108 | "\n", 109 | "## 第 7 章 贝叶斯分类器\n", 110 | "\n", 111 | "### 7.1 贝叶斯决策论\n", 112 | "\n", 113 | "### 7.2 极大似然估计\n", 114 | "\n", 115 | "### 7.3 朴素贝叶斯分类器\n", 116 | "\n", 117 | "### 7.4 半朴素贝叶斯分类器\n", 118 | "\n", 119 | "### 7.5 贝叶斯网\n", 120 | "\n", 121 | "### 7.6 EM 算法\n", 122 | "\n", 123 | "### 7.7 阅读材料\n", 124 | "\n", 125 | "## 第 8 章 集成学习\n", 126 | "\n", 127 | "### 8.1 个体与集成\n", 128 | "\n", 129 | "### 8.2 Boosting\n", 130 | "\n", 131 | "### 8.3 Bagging 与随机森林\n", 132 | "\n", 133 | "### 8.4 结合策略\n", 134 | "\n", 135 | "### 8.5 多样性\n", 136 | "\n", 137 | "### 8.6 阅读材料\n", 138 | "\n", 139 | "## 第 9 章 聚类\n", 140 | "\n", 141 | "### 9.1 聚类任务\n", 142 | "\n", 143 | "### 9.2 性能度量\n", 144 | "\n", 145 | "### 9.3 距离计算\n", 146 | "\n", 147 | "### 9.4 原型聚类\n", 148 | "\n", 149 | "### 9.5 密度聚类\n", 150 | "\n", 151 | "### 9.6 层次聚类\n", 152 | "\n", 153 | "### 9.7 阅读材料\n", 154 | "\n", 155 | "## 第 10 章 降维与度量学习\n", 156 | "\n", 157 | "### 10.1 k近邻学习\n", 158 | "\n", 159 | "### 10.2 低维嵌入\n", 160 | "\n", 161 | "### 10.3 主成分分析\n", 162 | "\n", 163 | "### 10.4 核化线性降维\n", 164 | "\n", 165 | "### 10.5 流形学习\n", 166 | "\n", 167 | "### 10.6 度量学习\n", 168 | "\n", 169 | "### 10.7 阅读材料\n", 170 | "\n", 171 | "## 第 11 章 特征选择与稀疏学习\n", 172 | "\n", 173 | "### 11.1 子集搜索与评价\n", 174 | "\n", 175 | "### 11.2 过滤式选择\n", 176 | "\n", 177 | "### 11.3 包裹式选择\n", 178 | "\n", 179 | "### 11.4 嵌入式选择与 $L_1$ 正则化\n", 180 | "\n", 181 | "### 11.5 稀疏表示与字典学习\n", 182 | "\n", 183 | "### 11.6 压缩感知\n", 184 | "\n", 185 | "### 11.7 阅读材料\n", 186 | "\n", 187 | "## 第 12 章 计算学习理论\n", 188 | "\n", 189 | "### 12.1 基础知识\n", 190 | "\n", 191 | "### 12.2 PAC 学习\n", 192 | "\n", 193 | "### 12.3 有限假设空间\n", 194 | "\n", 195 | "### 12.4 VC 维\n", 196 | "\n", 197 | "### 12.5 Rademacher 复杂度\n", 198 | "\n", 199 | "### 12.6 稳定性\n", 200 | "\n", 201 | "### 12.7 阅读材料\n", 202 | "\n", 203 | "## 第 13 章 半监督学习\n", 204 | "\n", 205 | "### 13.1 未标记样本\n", 206 | "\n", 207 | "### 13.2 生成式方法\n", 208 | "\n", 209 | "### 13.3 半监督 SVM\n", 210 | "\n", 211 | "### 13.4 图半监督学习\n", 212 | "\n", 213 | "### 13.5 基于分歧的方法\n", 214 | "\n", 215 | "### 13.6 半监督聚类\n", 216 | "\n", 217 | "### 13.7 阅读材料\n", 218 | "\n", 219 | "## 第 14 章 概率图模型\n", 220 | "\n", 221 | "### 14.1 隐马尔可夫模型\n", 222 | "\n", 223 | "### 14.2 马尔可夫随机场\n", 224 | "\n", 225 | "### 14.3 条件随机场\n", 226 | "\n", 227 | "### 14.4 学习与推断\n", 228 | "\n", 229 | "### 14.5 近似推断\n", 230 | "\n", 231 | "### 14.6 话题模型\n", 232 | "\n", 233 | "### 14.7 阅读材料\n", 234 | "\n", 235 | "## 第 15 章 规则学习\n", 236 | "\n", 237 | "### 15.1 基本概念\n", 238 | "\n", 239 | "### 15.2 序贯覆盖\n", 240 | "\n", 241 | "### 15.3 剪枝优化\n", 242 | "\n", 243 | "### 15.4 一阶规则学习\n", 244 | "\n", 245 | "### 15.5 归纳逻辑程序设计\n", 246 | "\n", 247 | "### 15.6 阅读材料\n", 248 | "\n", 249 | "## 第 16 章 强化学习\n", 250 | "\n", 251 | "### 16.1 任务与奖赏\n", 252 | "\n", 253 | "### 16.2 K-摇臂赌博机\n", 254 | "\n", 255 | "### 16.3 有模型学习\n", 256 | "\n", 257 | "### 16.4 免模型学习\n", 258 | "\n", 259 | "### 16.5 值函数近似\n", 260 | "\n", 261 | "### 16.6 模仿学习\n", 262 | "\n", 263 | "### 16.7 阅读材料\n", 264 | "\n", 265 | "## 附录\n", 266 | "\n", 267 | "### 矩阵\n", 268 | "\n", 269 | "### 优化\n", 270 | "\n", 271 | "### 概率分布" 272 | ] 273 | } 274 | ], 275 | "metadata": { 276 | "kernelspec": { 277 | "display_name": "Python 3", 278 | "language": "python", 279 | "name": "python3" 280 | }, 281 | "language_info": { 282 | "codemirror_mode": { 283 | "name": "ipython", 284 | "version": 3 285 | }, 286 | "file_extension": ".py", 287 | "mimetype": "text/x-python", 288 | "name": "python", 289 | "nbconvert_exporter": "python", 290 | "pygments_lexer": "ipython3", 291 | "version": "3.7.1" 292 | } 293 | }, 294 | "nbformat": 4, 295 | "nbformat_minor": 2 296 | } 297 | -------------------------------------------------------------------------------- /贝叶斯/02. 朴素贝叶斯分类器.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### 贝叶斯分类器\n", 8 | "\n", 9 | "在前面的例子中,我们可以把所有的样本以表格的形式列出来:\n", 10 | "\n", 11 | "|箱子|球的类别|\n", 12 | "|----|------|\n", 13 | "|A|好|\n", 14 | "|A|坏|\n", 15 | "|A|好|\n", 16 | "|A|好|\n", 17 | "|A|好|\n", 18 | "|B|好|\n", 19 | "|B|坏|\n", 20 | "|B|坏|\n", 21 | "\n", 22 | "每一个样本都由两部分组成:特征和类别。可以看出这里只有一个特征(箱子编号),而且是二分类问题(好或坏)。\n", 23 | "\n", 24 | "所以贝叶斯定理天生具有分类功能,我们不妨推广到多特征多分类的情况,譬如把 $A_i$ 当作类标记,也就是说样本空间被划分成 $A_1, A_2, ..., A_n$,可以理解为样本空间可以分成 n 类,我们可以把 $B$ 当作样本数据的特征向量。类标记集合 $\\mathcal{Y} = \\{c_1, c_2, \\dots, c_K\\}$,输入为特征向量 $x$,输出为类标记 $y$,那么贝叶斯公式可以改成下面的形式:\n", 25 | "\n", 26 | "$$\n", 27 | "P(Y=c_k|X=x) = \\frac{P(X=x|Y=c_k)P(Y=c_k)}{\\sum_{i=1}^n P(X=x|Y=c_k)P(Y=c_k)}\n", 28 | "$$\n", 29 | "\n", 30 | "通过这个公式,可以计算出输入特征 $X$ 属于类别 $c_k$ 的概率,计算所有类别的概率,看看哪个类别的概率最大,就把输入特征归到这个类别,这就是**贝叶斯分类器**(Bayes classifier)的基本原理。\n", 31 | "\n", 32 | "### 朴素贝叶斯分类器\n", 33 | "\n", 34 | "我们再来看一个例子。假设某个医院早上来了8个门诊病人,如下表:\n", 35 | "\n", 36 | "|症状|职业|疾病|\n", 37 | "|---|----|----|\n", 38 | "|打喷嚏|护士|感冒|\n", 39 | "|打喷嚏|农夫|过敏|\n", 40 | "|头痛|建筑工人|脑震荡|\n", 41 | "|头痛|建筑工人|感冒|\n", 42 | "|打喷嚏|建筑工人|过敏|\n", 43 | "|打喷嚏|教师|感冒|\n", 44 | "|头痛|教师|脑震荡|\n", 45 | "|打喷嚏|教师|过敏|\n", 46 | "\n", 47 | "现在来了第9个病人,是一个打喷嚏的建筑工人,那么他最可能得的疾病是什么?\n", 48 | "\n", 49 | "很显然,从上表中的样本数据可以知道,这里的特征有两个(症状和职业),可能的疾病有:感冒、过敏、脑震荡,是个三分类问题。要想预测这个建筑工人的疾病,实际上就是求下面的三个条件概率,然后取概率值最大的那种情况:\n", 50 | "\n", 51 | "* $P(cold|sneeze \\cap builder)$\n", 52 | "* $P(allergy|sneeze \\cap builder)$\n", 53 | "* $P(concussion|sneeze \\cap builder)$\n", 54 | "\n", 55 | "根据贝叶斯定理,我们有:\n", 56 | "\n", 57 | "$$\n", 58 | "\\begin{align}\n", 59 | "P(cold|sneeze \\cap builder) = \\frac{P(sneeze \\cap builder|cold)P(cold)}{P(sneeze \\cap builder)}\n", 60 | "\\end{align}\n", 61 | "$$\n", 62 | "\n", 63 | "这里的 $P(sneeze \\cap builder|cold)$ 是个联合概率,当特征数非常多时,联合概率非常难求,所以我们在这里做了一个大胆的假设:**所有的特征是彼此独立的**。所以:\n", 64 | "\n", 65 | "$$\n", 66 | "\\begin{align}\n", 67 | "P(cold|sneeze \\cap builder) &= \\frac{P(sneeze \\cap builder|cold)P(cold)}{P(sneeze \\cap builder)} \\\\\n", 68 | "&= \\frac{P(sneeze|cold)P(builder|cold)P(cold)}{P(sneeze)P(builder)}\n", 69 | "\\end{align}\n", 70 | "$$\n", 71 | "\n", 72 | "根据这个假设得到的分类器,我们称之为**朴素贝叶斯分类器**(naive Bayes classifier)。英文 naive 的意思是天真的幼稚的,不过,尽管这个假设非常幼稚,但它在很多分类领域发挥着重要的作用。\n", 73 | "\n", 74 | "根据上面的表格,我们有:\n", 75 | "\n", 76 | "$$\n", 77 | "\\begin{align}\n", 78 | "P(cold) &= \\frac{3}{8} \\\\\n", 79 | "P(sneeze) &= \\frac{5}{8} \\\\\n", 80 | "P(builder) &= \\frac{3}{8} \\\\\n", 81 | "P(sneeze|cold) &= \\frac{2}{3} \\\\\n", 82 | "P(builder|cold) &= \\frac{1}{3}\n", 83 | "\\end{align}\n", 84 | "$$\n", 85 | "\n", 86 | "所以求得:\n", 87 | "\n", 88 | "$$\n", 89 | "P(cold|sneeze \\cap builder) = \\frac{\\frac{2}{3} \\times \\frac{1}{3} \\times \\frac{3}{8}}{\\frac{5}{8} \\times \\frac{3}{8}} = \\frac{16}{45}\n", 90 | "$$\n", 91 | "\n", 92 | "同理:\n", 93 | "\n", 94 | "$$\n", 95 | "P(allergy|sneeze \\cap builder) = \\frac{\\frac{3}{3} \\times \\frac{1}{3} \\times \\frac{3}{8}}{\\frac{5}{8} \\times \\frac{3}{8}} = \\frac{24}{45} \\\\\n", 96 | "P(concussion|sneeze \\cap builder) = \\frac{\\frac{0}{2} \\times \\frac{1}{2} \\times \\frac{2}{8}}{\\frac{5}{8} \\times \\frac{3}{8}} = 0\n", 97 | "$$\n", 98 | "\n", 99 | "可以推断出,这个建筑工人得过敏的可能性最大。\n", 100 | "\n", 101 | "在上面的计算过程中,三个概率的分母都是 $P(sneeze \\cap builder)$,而我们最后是要比较这三个概率的大小,所以这个值实际上可以不用算,这个值有时候又被为 **证据因子**(evidence)。\n", 102 | "\n", 103 | "最后,我们总结一下贝叶斯分类的公式:\n", 104 | "\n", 105 | "$$\n", 106 | "\\begin{align}\n", 107 | "y = f(x) &= \\mathop{\\arg\\max}_{c_k} P(Y=c_k|X=x) \\\\\n", 108 | "&= \\mathop{\\arg\\max}_{c_k} P(Y=c_k)P(X=x|Y=c_k) \\\\\n", 109 | "&= \\mathop{\\arg\\max}_{c_k} P(Y=c_k) \\prod_i P(X=x^{(i)}|Y=c_k)\n", 110 | "\\end{align}\n", 111 | "$$\n", 112 | "\n", 113 | "朴素贝叶斯法通过训练数据集学习到联合概率分布 $P(X,Y)$,具体的,它是先验概率和条件概率的乘积,实际上,联合概率分布代表的是生成数据的机制,所以贝叶斯模型被称为**生成式模型**。使用朴素贝叶斯法分类时,对给定的输入 $x$,通过学习到的模型计算后验概率,将后验概率最大的类别作为 $x$ 的类输出。不过上式中的连乘操作很容易造成下溢,可以对其求对数来规避这个问题,求对数后的似然叫做**对数似然**(log-likelihood):\n", 114 | "\n", 115 | "$$\n", 116 | "\\log \\prod_i P(X=x^{(i)}|Y=c_k) = \\sum_i \\log P(X=x^{(i)}|Y=c_k)\n", 117 | "$$\n", 118 | "\n", 119 | "为什么朴素贝叶斯法将实例分到后验概率最大的类中呢?因为这可以使得**期望风险**最小。(证明过程参见李航的《统计学习方法》4.1.2节)\n", 120 | "\n", 121 | "为了求先验概率 $P(Y=c_k)$ 和 条件概率 $P(X=x|Y=c_k)$,一般使用**极大似然估计**法。\n", 122 | "\n", 123 | "先验概率 $P(Y=c_k)$ 的极大似然估计是:\n", 124 | "\n", 125 | "$$\n", 126 | "P(Y=c_k) = \\frac{\\sum_{i=1}^{N}I(y_i=c_k)}{N}\n", 127 | "$$\n", 128 | "\n", 129 | "其中,$I(x)$ 称为**指示函数**(indicator function),当括号中的条件满足时函数值为 1,否则函数值为 0,相当于计数。条件概率 $P(X=x|Y=c_k)$ 的极大似然估计是:\n", 130 | "\n", 131 | "$$\n", 132 | "P(X=x|Y=c_k) = \\frac{\\sum_{i=1}^{N}I(x_i=x,y_i=c_k)}{\\sum_{i=1}^{N}I(y_i=c_k)}\n", 133 | "$$\n", 134 | "\n", 135 | "### 贝叶斯估计\n", 136 | "\n", 137 | "在上面的计算中,我们发现 $P(concussion|sneeze \\cap builder) = 0$,这是因为用极大似然法来估计条件概率时出现了概率值为 0 的情况,为了避免这种情况,一般推荐采用**贝叶斯估计**(Bayesian estimation),先验概率的贝叶斯估计为:\n", 138 | "\n", 139 | "$$\n", 140 | "P_{\\lambda}(Y=c_k) = \\frac{\\sum_{i=1}^{N}I(y_i=c_k) + \\lambda}{N+K\\lambda}\n", 141 | "$$\n", 142 | "\n", 143 | "其中,$K$ 表示类别的个数。同样的,条件概率的贝叶斯估计为:\n", 144 | "\n", 145 | "$$\n", 146 | "P_{\\lambda}(X=x|Y=c_k) = \\frac{\\sum_{i=1}^{N}I(x_i=x,y_i=c_k) + \\lambda}{\\sum_{i=1}^{N}I(y_i=c_k) + S\\lambda}\n", 147 | "$$\n", 148 | "\n", 149 | "当 $\\lambda = 0$ 时,就是极大似然估计,通常取 $\\lambda = 1$,这时称为 **拉普拉斯平滑**(Laplace smoothing)。\n", 150 | "\n", 151 | "### 连续属性的处理\n", 152 | "\n", 153 | "上面例子中的样本特征都是离散值,当特征是连续值时,我们可以用概率密度函数 $p(x|c)$ 替换上面的概率 $P(x|c)$。假设概率密度函数服从正态分布,即 $p(x|c) \\sim \\mathcal{N}(\\mu_c, \\delta^2_c)$,其中 $\\mu_c$ 和 $\\delta^2_c$ 分别是 c 类样本在某个属性上取值的均值和方差,则有:\n", 154 | "\n", 155 | "$$\n", 156 | "p(x|c) = \\frac{1}{\\sqrt{2\\pi}\\delta_c} \\exp \\lgroup -\\frac{(x-\\mu_c)^2}{2\\delta_c^2} \\rgroup\n", 157 | "$$" 158 | ] 159 | } 160 | ], 161 | "metadata": { 162 | "kernelspec": { 163 | "display_name": "Python 3", 164 | "language": "python", 165 | "name": "python3" 166 | }, 167 | "language_info": { 168 | "codemirror_mode": { 169 | "name": "ipython", 170 | "version": 3 171 | }, 172 | "file_extension": ".py", 173 | "mimetype": "text/x-python", 174 | "name": "python", 175 | "nbconvert_exporter": "python", 176 | "pygments_lexer": "ipython3", 177 | "version": "3.7.1" 178 | } 179 | }, 180 | "nbformat": 4, 181 | "nbformat_minor": 2 182 | } 183 | -------------------------------------------------------------------------------- /逻辑回归/03. 逻辑回归和一元分类问题的求解.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "上一节我们看到可以使用逻辑回归来求解分类问题:\n", 8 | "\n", 9 | "$$\n", 10 | "ln \\frac{y}{1-y} = wx+b\n", 11 | "$$\n", 12 | "\n", 13 | "但是这里的 w 和 b 要怎么计算呢?前面在求解线性回归时,我们首先确定模型的损失函数,然后通过计算损失函数的最小值从而得到模型的各个参数。那么逻辑回归的损失函数该如何确定呢?\n", 14 | "\n", 15 | "如果按照最小二乘的求解思路,这里的 $(x, ln \\frac{y}{1-y})$ 呈线性关系,可以定义出损失函数:\n", 16 | "\n", 17 | "$$\n", 18 | "loss = \\sum_{i=1}^{n}(ln \\frac{y_i}{1-y_i} - (wx_i + b))^2\n", 19 | "$$\n", 20 | "\n", 21 | "但是很显然,$y_i$ 的取值不是 0 就是 1,代入上面的损失函数要么分母是 0 要么分子是 0,根本没法求解。回忆前面一节 S 型函数的拟合曲线,实际上,它的值域范围是 $(0, 1)$,并不是等于 0 或 1,只不过取值非常接近 0 或 1,我们不妨把这个值理解为可能性。\n", 22 | "\n", 23 | "若将 y 视为样本 x 作为正例的可能性,则 1-y 则是 x 作为反例的可能性,两者的比值 $\\frac{y}{1-y}$ 称为 **几率**(odds),反映了 x 作为正例的相对可能性。对几率取对数 $ln \\frac{y}{1-y}$ 就是 **对数几率**(log odds,亦称 logit),这也是对数几率函数名称的由来,而使用对数几率函数拟合分类数据的这种回归方法被称为 **对数几率回归**(logistic regression),可能是中文比较拗口,很多地方喜欢把它称为 **逻辑回归**(logistic regression, 或者 **对数回归**)。不过要注意的是,虽然名字中有回归两个字,但它解决的是分类问题,它不仅能够预测“类别”,还能给出近似的概率预测。\n", 24 | "\n", 25 | "这样我们就可以把 $y$ 改写成 $p(y = 1 | x)$,这是一个条件概率,表示当 $x$ 发生时 $y = 1$ 的概率,这个概率也被称为 **后验概率**,同理,我们把 $1-y$ 改写成 $p(y = 0 | x)$ 表示当 $x$ 发生时 $y = 0$ 的概率。所以有:\n", 26 | "\n", 27 | "$$\n", 28 | "ln \\frac{p(y = 1 | x)}{p(y = 0 | x)} = wx+b\n", 29 | "$$\n", 30 | "\n", 31 | "并把分类问题改写成下面的条件概率形式:\n", 32 | "\n", 33 | "$$\n", 34 | "p(y = 1 | x) = \\frac{e^{wx+b}}{1+e^{wx+b}}\n", 35 | "$$\n", 36 | "\n", 37 | "$$\n", 38 | "p(y = 0 | x) = \\frac{1}{1+e^{wx+b}}\n", 39 | "$$\n", 40 | "\n", 41 | "这就是逻辑回归模型。" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "### 极大似然法解逻辑回归\n", 49 | "\n", 50 | "上面的逻辑回归模型和线性回归模型很不一样,没办法再用最小二乘来进行参数估计了。在最小二乘中,我们有每个样本的实际值和它对应的预测值,我们可以计算出两者的平方损失,然后累加起来,最小二乘的目标是使得累加起来的损失最小。而在这里,我们有每个样本的实际值和它对应的预测值的概率,如果我们的模型能让每个样本都正确分类,这样的模型损失是最小的,要实现这一点,也就是说我们要找一个模型让每个样本属于其真实分类的概率越大越好,这种方法被称为 **极大似然法**(maximum likelihood method)。\n", 51 | "\n", 52 | "$$\n", 53 | "\\mathop{\\arg \\max}_{w,b} \\prod_{i=1}^n p(y_i | x_i; w,b)\n", 54 | "$$\n", 55 | "\n", 56 | "但是这里的连乘结果很可能导致下溢,通常使用**对数似然**(log-likelihood)来代替,这样就可以将连乘转换成连加:\n", 57 | "\n", 58 | "$$\n", 59 | "\\mathop{\\arg \\max}_{w,b} \\sum_{i=1}^n \\ln p(y_i | x_i; w,b)\n", 60 | "$$\n", 61 | "\n", 62 | "其中 $p(y | x)$ 包含了 $y = 0$ 和 $y = 1$ 两种情况,这里有一个小技巧可以把两种情况写成一个表达式:\n", 63 | "\n", 64 | "$$\n", 65 | "p(y|x) = p(y=1|x)^y p(y=0|x)^{(1-y)}\n", 66 | "$$\n", 67 | "\n", 68 | "对其求对数有:\n", 69 | "\n", 70 | "$$\n", 71 | "\\ln p(y|x) = y \\ln p(y=1|x) + (1-y) \\ln p(y=0|x)\n", 72 | "$$\n", 73 | "\n", 74 | "令:$p(y=1|x) = \\pi(x)$,那么有 $p(y=0|x) = 1 - \\pi(x)$,所以有:\n", 75 | "\n", 76 | "$$\n", 77 | "\\begin{align}\n", 78 | "\\ln p(y|x) &= y \\ln \\pi(x) + (1-y) \\ln (1-\\pi(x)) \\\\\n", 79 | "&= y \\ln \\pi(x) + \\ln (1-\\pi(x)) - y\\ln(1-\\pi(x)) \\\\\n", 80 | "&= y \\ln \\frac{\\pi(x)}{1-\\pi(x)} + \\ln (1-\\pi(x)) \\\\\n", 81 | "&= y \\ln e^{wx+b} + \\ln \\frac{1}{1+e^{wx+b}} \\\\\n", 82 | "&= y(wx+b) - \\ln (1+e^{wx+b})\n", 83 | "\\end{align}\n", 84 | "$$\n", 85 | "\n", 86 | "于是我们得到了逻辑回归的损失函数:\n", 87 | "\n", 88 | "$$\n", 89 | "\\begin{align}\n", 90 | "loss \n", 91 | "&= \\sum_{i=1}^n \\ln p(y_i|x_i) \\\\\n", 92 | "&= \\sum_{i=1}^n y_i \\ln \\pi(x_i) + (1-y_i) \\ln (1-\\pi(x_i)) \\\\\n", 93 | "&= \\sum_{i=1}^n [ y_i(wx_i+b) - \\ln (1+e^{wx_i+b}) ] \\\\\n", 94 | "\\end{align}\n", 95 | "$$\n", 96 | "\n", 97 | "我们要求的就是上式最大值时的 w 和 b 参数,一般情况下,我们都是求损失函数的最小值,所以我们在上式中加一个负号。\n", 98 | "\n", 99 | "$$\n", 100 | "\\mathop{\\arg \\min}_{w,b} \\sum_{i=1}^n [ -y_i(wx_i+b) + \\ln (1+e^{wx_i+b}) ]\n", 101 | "$$" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "### 梯度下降法\n", 109 | "\n", 110 | "逻辑回归的损失函数不能像线性回归那样求解析解,一般我们通过梯度下降法或牛顿法等优化算法来求解。这里我们使用梯度下降法,首先我们需要求出上面的损失函数的梯度,也就是分别对 w 和 b 求偏导。\n", 111 | "\n", 112 | "这里要运用到微分法则中很重要的一条:**链式法则**,即:\n", 113 | "\n", 114 | "$$\n", 115 | "\\frac{\\partial y}{\\partial x} = \\frac{\\partial y}{\\partial z} \\frac{\\partial z}{\\partial x}\n", 116 | "$$\n", 117 | "\n", 118 | "我们令:\n", 119 | "\n", 120 | "$$\n", 121 | "z = \\ln p = \\ln (1+e^{wx_i+b})\n", 122 | "$$\n", 123 | "\n", 124 | "则:\n", 125 | "\n", 126 | "$$\n", 127 | "\\frac{\\partial z}{\\partial w} = \\frac{\\partial z}{\\partial p} \\frac{\\partial p}{\\partial w} = \\frac{1}{p} e^{wx_i+b} x_i = \\frac{e^{wx_i+b}}{1+e^{wx_i+b}}x_i\n", 128 | "$$\n", 129 | "\n", 130 | "$$\n", 131 | "\\frac{\\partial z}{\\partial b} = \\frac{\\partial z}{\\partial p} \\frac{\\partial p}{\\partial b} = \\frac{1}{p} e^{wx_i+b} = \\frac{e^{wx_i+b}}{1+e^{wx_i+b}}\n", 132 | "$$\n", 133 | "\n", 134 | "再令:\n", 135 | "\n", 136 | "$$\n", 137 | "z = -y_i(wx_i+b)\n", 138 | "$$\n", 139 | "\n", 140 | "则:\n", 141 | "\n", 142 | "$$\n", 143 | "\\frac{\\partial z}{\\partial w} = -y_ix_i\n", 144 | "$$\n", 145 | "\n", 146 | "$$\n", 147 | "\\frac{\\partial z}{\\partial b} = -y_i\n", 148 | "$$\n", 149 | "\n", 150 | "于是有:\n", 151 | "\n", 152 | "$$\n", 153 | "\\begin{align}\n", 154 | "\\frac{\\partial loss}{\\partial w} &= \\sum_{i=1}^n (\\frac{e^{wx_i+b}}{1+e^{wx_i+b}}x_i -y_ix_i) \\\\\n", 155 | "&= \\sum_{i=1}^n (\\frac{e^{wx_i+b}}{1+e^{wx_i+b}} -y_i)x_i \\\\\n", 156 | "&= \\sum_{i=1}^n (\\pi(x_i) -y_i)x_i\n", 157 | "\\end{align}\n", 158 | "$$\n", 159 | "\n", 160 | "$$\n", 161 | "\\begin{align}\n", 162 | "\\frac{\\partial loss}{\\partial b} &= \\sum_{i=1}^n (\\frac{e^{wx_i+b}}{1+e^{wx_i+b}} -y_i) \\\\\n", 163 | "&= \\sum_{i=1}^n (\\pi(x_i) -y_i)\n", 164 | "\\end{align}\n", 165 | "$$\n", 166 | "\n", 167 | "最后得到参数 w 和 b 的更新公式:\n", 168 | "\n", 169 | "$$\n", 170 | "w := w - \\eta \\frac{\\partial loss}{\\partial w} = w + \\eta \\sum_{i=1}^n (y_i - \\pi(x_i))x_i\n", 171 | "$$\n", 172 | "\n", 173 | "$$\n", 174 | "b := b - \\eta \\frac{\\partial loss}{\\partial b} = b + \\eta \\sum_{i=1}^n (y_i - \\pi(x_i))\n", 175 | "$$" 176 | ] 177 | } 178 | ], 179 | "metadata": { 180 | "kernelspec": { 181 | "display_name": "Python 3", 182 | "language": "python", 183 | "name": "python3" 184 | }, 185 | "language_info": { 186 | "codemirror_mode": { 187 | "name": "ipython", 188 | "version": 3 189 | }, 190 | "file_extension": ".py", 191 | "mimetype": "text/x-python", 192 | "name": "python", 193 | "nbconvert_exporter": "python", 194 | "pygments_lexer": "ipython3", 195 | "version": "3.7.1" 196 | } 197 | }, 198 | "nbformat": 4, 199 | "nbformat_minor": 2 200 | } 201 | -------------------------------------------------------------------------------- /其他资料/课程/01. Machine Learning. Coursera. Andrew Ng.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "* 课程名称:Machine Learning\n", 8 | "* 课程来源:https://www.coursera.org/\n", 9 | "* 课程地址:https://www.coursera.org/learn/machine-learning\n", 10 | "* 讲师:Andrew Ng\n", 11 | "\n", 12 | "## Week 1\n", 13 | "\n", 14 | "### Introduction\n", 15 | "\n", 16 | "#### Welcome to Machine Learning!\n", 17 | "#### Welcome\n", 18 | "#### What is Machine Learning?\n", 19 | "#### Supervised Learning\n", 20 | "#### Unsupervised Learning\n", 21 | "\n", 22 | "### Linear Regression with One Variable\n", 23 | "\n", 24 | "#### Model Representation\n", 25 | "#### Cost Function\n", 26 | "#### Cost Function - Intuition I\n", 27 | "#### Cost Function - Intuition II\n", 28 | "#### Gradient Descent\n", 29 | "#### Gradient Descent Intuition\n", 30 | "#### Gradient Descent For Linear Regression\n", 31 | "\n", 32 | "### Linear Algebra Review\n", 33 | "\n", 34 | "#### Matrices and Vectors\n", 35 | "#### Addition and Scalar Multiplication\n", 36 | "#### Matrix Vector Multiplication\n", 37 | "#### Matrix Matrix Multiplication\n", 38 | "#### Matrix Multiplication Properties\n", 39 | "#### Inverse and Transpose\n", 40 | "\n", 41 | "## Week 2\n", 42 | "\n", 43 | "### Linear Regression with Multiple Variables\n", 44 | "\n", 45 | "#### Multiple Features\n", 46 | "#### Gradient Descent for Multiple Variables\n", 47 | "#### Gradient Descent in Practice I - Feature Scaling\n", 48 | "#### Gradient Descent in Practice II - Learning Rate\n", 49 | "#### Features and Polynomial Regression\n", 50 | "#### Normal Equation\n", 51 | "#### Normal Equation Noninvertibility\n", 52 | "#### Working on and Submitting Programming Assignments\n", 53 | "\n", 54 | "### Octave/Matlab Tutorial\n", 55 | "\n", 56 | "#### Basic Operations\n", 57 | "#### Moving Data Around\n", 58 | "#### Computing on Data\n", 59 | "#### Plotting Data\n", 60 | "#### Control Statements: for, while, if statement\n", 61 | "#### Vectorization\n", 62 | "\n", 63 | "## Week 3\n", 64 | "\n", 65 | "### Logistic Regression\n", 66 | "\n", 67 | "#### Classification\n", 68 | "#### Hypothesis Representation\n", 69 | "#### Decision Boundary\n", 70 | "#### Cost Function\n", 71 | "#### Simplified Cost Function and Gradient Descent\n", 72 | "#### Advanced Optimization\n", 73 | "#### Multiclass Classification: One-vs-all\n", 74 | "\n", 75 | "### Regularization\n", 76 | "\n", 77 | "#### The Problem of Overfitting\n", 78 | "#### Cost Function\n", 79 | "#### Regularized Linear Regression\n", 80 | "#### Regularized Logistic Regression\n", 81 | "\n", 82 | "## Week 4\n", 83 | "\n", 84 | "### Neural Networks: Representation\n", 85 | "\n", 86 | "#### Non-linear Hypotheses\n", 87 | "#### Neurons and the Brain\n", 88 | "#### Model Representation I\n", 89 | "#### Model Representation II\n", 90 | "#### Examples and Intuitions I\n", 91 | "#### Examples and Intuitions II\n", 92 | "#### Multiclass Classification\n", 93 | "\n", 94 | "## Week 5\n", 95 | "\n", 96 | "### Neural Networks: Learning\n", 97 | "\n", 98 | "#### Cost Function\n", 99 | "#### Backpropagation Algorithm\n", 100 | "#### Backpropagation Intuition\n", 101 | "#### Implementation Note: Unrolling Parameters\n", 102 | "#### Gradient Checking\n", 103 | "#### Random Initialization\n", 104 | "#### Putting It Together\n", 105 | "#### Autonomous Driving\n", 106 | "\n", 107 | "## Week 6\n", 108 | "\n", 109 | "### Advice for Applying Machine Learning\n", 110 | "\n", 111 | "#### Deciding What to Try Next\n", 112 | "#### Evaluating a Hypothesis\n", 113 | "#### Model Selection and Train/Validation/Test Sets\n", 114 | "#### Diagnosing Bias vs. Variance\n", 115 | "#### Regularization and Bias/Variance\n", 116 | "#### Learning Curves\n", 117 | "#### Deciding What to Do Next Revisited\n", 118 | "\n", 119 | "### Machine Learning System Design\n", 120 | "\n", 121 | "#### Prioritizing What to Work On\n", 122 | "#### Error Analysis\n", 123 | "#### Error Metrics for Skewed Classes\n", 124 | "#### Trading Off Precision and Recall\n", 125 | "#### Data For Machine Learning\n", 126 | "\n", 127 | "## Week 7\n", 128 | "\n", 129 | "### Support Vector Machines\n", 130 | "\n", 131 | "#### Optimization Objective\n", 132 | "#### Large Margin Intuition\n", 133 | "#### Mathematics Behind Large Margin Classification\n", 134 | "#### Kernels I\n", 135 | "#### Kernels II\n", 136 | "#### Using An SVM\n", 137 | "\n", 138 | "## Week 8\n", 139 | "\n", 140 | "### Unsupervised Learning\n", 141 | "\n", 142 | "#### Unsupervised Learning: Introduction\n", 143 | "#### K-Means Algorithm\n", 144 | "#### Optimization Objective\n", 145 | "#### Random Initialization\n", 146 | "#### Choosing the Number of Clusters\n", 147 | "\n", 148 | "### Dimensionality Reduction\n", 149 | "\n", 150 | "#### Motivation I: Data Compression\n", 151 | "#### Motivation II: Visualization\n", 152 | "#### Principal Component Analysis Problem Formulation\n", 153 | "#### Principal Component Analysis Algorithm\n", 154 | "#### Reconstruction from Compressed Representation\n", 155 | "#### Choosing the Number of Principal Components\n", 156 | "#### Advice for Applying PCA\n", 157 | "\n", 158 | "## Week 9\n", 159 | "\n", 160 | "### Anomaly Detection\n", 161 | "\n", 162 | "#### Problem Motivation\n", 163 | "#### Gaussian Distribution\n", 164 | "#### Algorithm\n", 165 | "#### Developing and Evaluating an Anomaly Detection System\n", 166 | "#### Anomaly Detection vs. Supervised Learning\n", 167 | "#### Choosing What Features to Use\n", 168 | "#### Multivariate Gaussian Distribution\n", 169 | "#### Anomaly Detection using the Multivariate Gaussian Distribution\n", 170 | "\n", 171 | "### Recommender Systems\n", 172 | "\n", 173 | "#### Problem Formulation\n", 174 | "#### Content Based Recommendations\n", 175 | "#### Collaborative Filtering\n", 176 | "#### Collaborative Filtering Algorithm\n", 177 | "#### Vectorization: Low Rank Matrix Factorization\n", 178 | "#### Implementational Detail: Mean Normalization\n", 179 | "\n", 180 | "## Week 10\n", 181 | "\n", 182 | "### Large Scale Machine Learning\n", 183 | "\n", 184 | "#### Learning With Large Datasets\n", 185 | "#### Stochastic Gradient Descent\n", 186 | "#### Mini-Batch Gradient Descent\n", 187 | "#### Stochastic Gradient Descent Convergence\n", 188 | "#### Online Learning\n", 189 | "#### Map Reduce and Data Parallelism\n", 190 | "\n", 191 | "## Week 11\n", 192 | "\n", 193 | "### Application Example: Photo OCR\n", 194 | "\n", 195 | "#### Problem Description and Pipeline\n", 196 | "#### Sliding Windows\n", 197 | "#### Getting Lots of Data and Artificial Data\n", 198 | "#### Ceiling Analysis: What Part of the Pipeline to Work on Next\n", 199 | "#### Summary and Thank You" 200 | ] 201 | } 202 | ], 203 | "metadata": { 204 | "kernelspec": { 205 | "display_name": "Python 3", 206 | "language": "python", 207 | "name": "python3" 208 | }, 209 | "language_info": { 210 | "codemirror_mode": { 211 | "name": "ipython", 212 | "version": 3 213 | }, 214 | "file_extension": ".py", 215 | "mimetype": "text/x-python", 216 | "name": "python", 217 | "nbconvert_exporter": "python", 218 | "pygments_lexer": "ipython3", 219 | "version": "3.7.1" 220 | } 221 | }, 222 | "nbformat": 4, 223 | "nbformat_minor": 2 224 | } 225 | -------------------------------------------------------------------------------- /其他资料/书籍/03. 统计学习方法.李航.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 第 1 章 统计学习方法概论\n", 8 | "\n", 9 | "### 1.1 统计学习\n", 10 | "\n", 11 | "### 1.2 监督学习\n", 12 | "\n", 13 | "#### 1.2.1 基本概念\n", 14 | "\n", 15 | "#### 1.2.2 问题的形式化\n", 16 | "\n", 17 | "### 1.3 统计学习三要素\n", 18 | "\n", 19 | "#### 1.3.1 模型\n", 20 | "\n", 21 | "#### 1.3.2 策略\n", 22 | "\n", 23 | "#### 1.3.3 算法\n", 24 | "\n", 25 | "### 1.4 模型评估与模型选择\n", 26 | "\n", 27 | "#### 1.4.1 训练误差与测试误差\n", 28 | "\n", 29 | "#### 1.4.2 过拟合与模型模型选择\n", 30 | "\n", 31 | "### 1.5 正则化与交叉验证\n", 32 | "\n", 33 | "#### 1.5.1 正则化\n", 34 | "\n", 35 | "#### 1.5.2 交叉验证\n", 36 | "\n", 37 | "### 1.6 泛化能力\n", 38 | "\n", 39 | "#### 1.6.1 泛化误差\n", 40 | "\n", 41 | "#### 1.6.2 泛化误差上界\n", 42 | "\n", 43 | "### 1.7 生成模型与判别模型\n", 44 | "\n", 45 | "### 1.8 分类问题\n", 46 | "\n", 47 | "### 1.9 标注问题\n", 48 | "\n", 49 | "### 1.10 回归问题\n", 50 | "\n", 51 | "## 第 2 章 感知机\n", 52 | "\n", 53 | "### 2.1 感知机模型\n", 54 | "\n", 55 | "### 2.2 感知机学习策略\n", 56 | "\n", 57 | "#### 2.2.1 数据集的线性可分\n", 58 | "\n", 59 | "#### 2.2.2 感知机学习策略\n", 60 | "\n", 61 | "### 2.3 感知机学习算法\n", 62 | "\n", 63 | "#### 2.3.1 感知机学习算法的原始形式\n", 64 | "\n", 65 | "#### 2.3.2 算法的收敛性\n", 66 | "\n", 67 | "#### 2.3.3 感知机学习算法的对偶形式\n", 68 | "\n", 69 | "## 第 3 章 k 近邻法\n", 70 | "\n", 71 | "### 3.1 k 近邻算法\n", 72 | "\n", 73 | "### 3.2 k 近邻模型\n", 74 | "\n", 75 | "#### 3.2.1 模型\n", 76 | "\n", 77 | "#### 3.2.2 距离度量\n", 78 | "\n", 79 | "#### 3.2.3 k 值的选择\n", 80 | "\n", 81 | "#### 3.2.4 分类决策规则\n", 82 | "\n", 83 | "### 3.3 k 近邻法的实现:kd 树\n", 84 | "\n", 85 | "#### 3.3.1 构造 kd 树\n", 86 | "\n", 87 | "#### 3.3.2 搜索 kd 树\n", 88 | "\n", 89 | "## 第 4 章 朴素贝叶斯法\n", 90 | "\n", 91 | "### 4.1 朴素贝叶斯法的学习与分类\n", 92 | "\n", 93 | "#### 4.1.1 基本方法\n", 94 | "\n", 95 | "#### 4.1.2 后验概率最大化的含义\n", 96 | "\n", 97 | "### 4.2 朴素贝叶斯法的参数估计\n", 98 | "\n", 99 | "#### 4.2.1 极大似然法\n", 100 | "\n", 101 | "#### 4.2.2 学习与分类算法\n", 102 | "\n", 103 | "#### 4.2.3 贝叶斯估计\n", 104 | "\n", 105 | "## 第 5 章 决策树\n", 106 | "\n", 107 | "### 5.1 决策树模型与学习\n", 108 | "\n", 109 | "#### 5.1.1 决策树模型\n", 110 | "\n", 111 | "#### 5.1.2 决策树与 if-then 规则\n", 112 | "\n", 113 | "#### 5.1.3 决策树与条件概率分布\n", 114 | "\n", 115 | "#### 5.1.4 决策树学习\n", 116 | "\n", 117 | "### 5.2 特征选择\n", 118 | "\n", 119 | "#### 5.2.1 特征选择问题\n", 120 | "\n", 121 | "#### 5.2.2 信息增益\n", 122 | "\n", 123 | "#### 5.2.3 信息增益比\n", 124 | "\n", 125 | "### 5.3 决策树的生成\n", 126 | "\n", 127 | "#### 5.3.1 ID3 算法\n", 128 | "\n", 129 | "#### 5.3.2 C4.5 的生成算法\n", 130 | "\n", 131 | "### 5.4 决策树的剪枝\n", 132 | "\n", 133 | "### 5.5 CART 算法\n", 134 | "\n", 135 | "#### 5.5.1 CART 生成\n", 136 | "\n", 137 | "#### 5.5.2 CART 剪枝\n", 138 | "\n", 139 | "## 第 6 章 逻辑斯蒂回归与最大熵模型\n", 140 | "\n", 141 | "### 6.1 逻辑斯蒂回归模型\n", 142 | "\n", 143 | "#### 6.1.1 逻辑斯蒂分布\n", 144 | "\n", 145 | "#### 6.1.2 二项逻辑斯蒂回归模型\n", 146 | "\n", 147 | "#### 6.1.3 模型参数估计\n", 148 | "\n", 149 | "#### 6.1.4 多项逻辑斯蒂回归\n", 150 | "\n", 151 | "### 6.2 最大熵模型\n", 152 | "\n", 153 | "#### 6.2.1 最大熵原理\n", 154 | "\n", 155 | "#### 6.2.2 最大熵模型的定义\n", 156 | "\n", 157 | "#### 6.2.3 最大熵模型的学习\n", 158 | "\n", 159 | "#### 6.2.4 极大似然估计\n", 160 | "\n", 161 | "### 6.3 模型学习的最优化算法\n", 162 | "\n", 163 | "#### 6.3.1 改进的迭代尺度法\n", 164 | "\n", 165 | "#### 6.3.2 拟牛顿法\n", 166 | "\n", 167 | "## 第 7 章 支持向量机\n", 168 | "\n", 169 | "### 7.1 线性可分支持向量机与硬间隔最大化\n", 170 | "\n", 171 | "#### 7.1.1 线性可分支持向量机\n", 172 | "\n", 173 | "#### 7.1.2 函数间隔和几何间隔\n", 174 | "\n", 175 | "#### 7.1.3 间隔最大化\n", 176 | "\n", 177 | "#### 7.1.4 学习的对偶算法\n", 178 | "\n", 179 | "### 7.2 线性支持向量机与软间隔最大化\n", 180 | "\n", 181 | "#### 7.2.1 线性支持向量机\n", 182 | "\n", 183 | "#### 7.2.2 学习的对偶算法\n", 184 | "\n", 185 | "#### 7.2.3 支持向量\n", 186 | "\n", 187 | "#### 7.2.4 合页损失函数\n", 188 | "\n", 189 | "### 7.3 非线性支持向量机与核函数\n", 190 | "\n", 191 | "#### 7.3.1 核技巧\n", 192 | "\n", 193 | "#### 7.3.2 正定核\n", 194 | "\n", 195 | "#### 7.3.3 常用核函数\n", 196 | "\n", 197 | "#### 7.3.4 非线性支持向量分类机\n", 198 | "\n", 199 | "### 7.4 序列最小最优化算法\n", 200 | "\n", 201 | "#### 7.4.1 两个变量二次规划的求解方法\n", 202 | "\n", 203 | "#### 7.4.2 变量的选择方法\n", 204 | "\n", 205 | "#### 7.4.3 SMO 算法\n", 206 | "\n", 207 | "## 第 8 章 提升方法\n", 208 | "\n", 209 | "### 8.1 提升方法 AdaBoost 算法\n", 210 | "\n", 211 | "#### 8.1.1 提升方法的基本思路\n", 212 | "\n", 213 | "#### 8.1.2 AdaBoost 算法\n", 214 | "\n", 215 | "#### 8.1.3 AdaBoost 的例子\n", 216 | "\n", 217 | "### 8.2 AdaBoost 算法的训练误差分析\n", 218 | "\n", 219 | "### 8.3 AdaBoost 算法的解释\n", 220 | "\n", 221 | "#### 8.3.1 前向分步算法\n", 222 | "\n", 223 | "#### 8.3.2 前向分步算法与 AdaBoost\n", 224 | "\n", 225 | "### 8.4 提升树\n", 226 | "\n", 227 | "#### 8.4.1 提升树模型\n", 228 | "\n", 229 | "#### 8.4.2 提升树算法\n", 230 | "\n", 231 | "#### 8.4.3 梯度提升\n", 232 | "\n", 233 | "## 第 9 章 EM 算法及其推广\n", 234 | "\n", 235 | "### 9.1 EM 算法的引入\n", 236 | "\n", 237 | "#### 9.1.1 EM 算法\n", 238 | "\n", 239 | "#### 9.1.2 EM 算法的导出\n", 240 | "\n", 241 | "#### 9.1.3 EM 算法在非监督学习中的应用\n", 242 | "\n", 243 | "### 9.2 EM 算法的收敛性\n", 244 | "\n", 245 | "### 9.3 EM 算法在高斯混合模型学习中的应用\n", 246 | "\n", 247 | "#### 9.3.1 高斯混合模型\n", 248 | "\n", 249 | "#### 9.3.2 高斯混合模型参数估计的 EM 算法\n", 250 | "\n", 251 | "### 9.4 EM 算法的推广\n", 252 | "\n", 253 | "#### 9.4.1 F 函数的极大-极大算法\n", 254 | "\n", 255 | "#### 9.4.2 GEM 算法\n", 256 | "\n", 257 | "## 第 10 章 隐马尔可夫模型\n", 258 | "\n", 259 | "### 10.1 隐马尔可夫模型的基本概念\n", 260 | "\n", 261 | "#### 10.1.1 隐马尔可夫模型的定义\n", 262 | "\n", 263 | "#### 10.1.2 观测序列的生成模型\n", 264 | "\n", 265 | "#### 10.1.3 隐马尔可夫模型的 3 个基本问题\n", 266 | "\n", 267 | "### 10.2 概率计算算法\n", 268 | "\n", 269 | "#### 10.2.1 直接计算法\n", 270 | "\n", 271 | "#### 10.2.2 前向算法\n", 272 | "\n", 273 | "#### 10.2.3 后向算法\n", 274 | "\n", 275 | "#### 10.2.4 一些概率与期望值的计算\n", 276 | "\n", 277 | "### 10.3 学习算法\n", 278 | "\n", 279 | "#### 10.3.1 监督学习方法\n", 280 | "\n", 281 | "#### 10.3.2 Baum-Welch 算法\n", 282 | "\n", 283 | "#### 10.3.3 Baum-Welch 模型参数估计公式\n", 284 | "\n", 285 | "### 10.4 预测算法\n", 286 | "\n", 287 | "#### 10.4.1 近似算法\n", 288 | "\n", 289 | "#### 10.4.2 维特比算法\n", 290 | "\n", 291 | "## 第 11 章 条件随机场\n", 292 | "\n", 293 | "### 11.1 概率无向图模型\n", 294 | "\n", 295 | "#### 11.1.1 模型定义\n", 296 | "\n", 297 | "#### 11.1.2 概率无向图模型的因子分解\n", 298 | "\n", 299 | "### 11.2 条件随机场的定义与形式\n", 300 | "\n", 301 | "#### 11.2.1 条件随机场的定义\n", 302 | "\n", 303 | "#### 11.2.2 条件随机场的参数化形式\n", 304 | "\n", 305 | "#### 11.2.3 条件随机场的简化形式\n", 306 | "\n", 307 | "#### 11.2.4 条件随机场的矩阵形式\n", 308 | "\n", 309 | "### 11.3 条件随机场的概率计算问题\n", 310 | "\n", 311 | "#### 11.3.1 前向-后向算法\n", 312 | "\n", 313 | "#### 11.3.2 概率计算\n", 314 | "\n", 315 | "#### 11.3.3 期望值的计算\n", 316 | "\n", 317 | "### 11.4 条件随机场的学习算法\n", 318 | "\n", 319 | "#### 11.4.1 改进的迭代尺度法\n", 320 | "\n", 321 | "#### 11.4.2 拟牛顿法\n", 322 | "\n", 323 | "### 11.5 条件随机场的预测算法\n", 324 | "\n", 325 | "## 第 12 章 统计学习方法总结\n", 326 | "\n", 327 | "## 附录 A 梯度下降法\n", 328 | "\n", 329 | "## 附录 B 牛顿法和拟牛顿法\n", 330 | "\n", 331 | "## 附录 C 拉格朗日对偶性" 332 | ] 333 | } 334 | ], 335 | "metadata": { 336 | "kernelspec": { 337 | "display_name": "Python 3", 338 | "language": "python", 339 | "name": "python3" 340 | }, 341 | "language_info": { 342 | "codemirror_mode": { 343 | "name": "ipython", 344 | "version": 3 345 | }, 346 | "file_extension": ".py", 347 | "mimetype": "text/x-python", 348 | "name": "python", 349 | "nbconvert_exporter": "python", 350 | "pygments_lexer": "ipython3", 351 | "version": "3.7.1" 352 | } 353 | }, 354 | "nbformat": 4, 355 | "nbformat_minor": 2 356 | } 357 | -------------------------------------------------------------------------------- /线性回归/16. 带约束条件的线性回归.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "在前面的学习中,我们从同一份样本数据中得到了三个模型,而且三个模型的拟合效果都很好,不过很显然,二次模型和三次模型都过于复杂了,属于过拟合。为了判断哪个模型的性能最优,泛化能力最强,我们将数据集划分为训练集和测试集,然后选择在测试集上表现最好的模型。使用这种方法可以有效的评估模型结果,避免过拟合。\n", 8 | "\n", 9 | "这一节我们将学习一种避免过拟合的新方法:**正则化**(regularization)方法。\n", 10 | "\n", 11 | "在线性回归模型 $ \\hat{y} = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n = w^Tx $ 中,每一个变量 $x_i$ 就是模型的一个特征,参数 $w_i$ 代表相应的特征在模型中所占的权重。在前面的例子里,本来只有一个特征,我们人为造出了两个特征来($x^2$ 和 $x^3$),得到了三个模型,这两个特征对模型来说其实根本没用,过拟合就是因为模型过于复杂,把这些没用的特征也学习到了。所以如果能降低这些没用特征的权重,譬如直接降到 0,那就不会出现过拟合了,为了达到这一点,我们可以稍微改进下我们的损失函数:\n", 12 | "\n", 13 | "$$\n", 14 | "loss = \\sum_{i=1}^m (y_i - w^Tx_i)^2 + \\lambda \\| w \\|^2\n", 15 | "$$\n", 16 | "\n", 17 | "这个函数相比于我们之前的平方损失函数多了一个 $\\lambda \\| w \\|^2$,这被称之为 **正则化项** 或 **惩罚项**,其中正则化参数 $\\lambda > 0$,它是一个超参数,$\\| w \\|^2$ 表示 $L_2$ 范数,如下:\n", 18 | "\n", 19 | "$$\n", 20 | "\\| w \\|^2 = \\sum_{i=1}^m w_i^2 = w_1^2 + w_2^2 + \\dots + w_m^2\n", 21 | "$$\n", 22 | "\n", 23 | "很显然,为了使上面的损失函数最小,模型中不相干的参数越小越好,从而降低过拟合的风险,使用这个损失函数的回归方法称为 **岭回归**(ridge regression)。\n", 24 | "\n", 25 | "上面的 $L_2$ 范数可以改成 $L_p$ 范数,当 $p = 1$ 时,有损失函数:\n", 26 | "\n", 27 | "$$\n", 28 | "loss = \\sum_{i=1}^m (y_i - w^Tx_i)^2 + \\lambda \\| w \\|\n", 29 | "$$\n", 30 | "\n", 31 | "其中 $L_1$ 范数如下:\n", 32 | "\n", 33 | "$$\n", 34 | "\\| w \\| = \\sum_{i=1}^m |w_i| = |w_1| + |w_2| + \\dots + |w_m|\n", 35 | "$$\n", 36 | "\n", 37 | "这种回归方法称为 **LASSO 回归**(Least Absolute Shrinkage and Selection Operator)。\n", 38 | "\n", 39 | "可以看到这里的损失函数仍然是最小二乘,但是在后面增加了 $L_1$ 或 $L_2$ 范数约束,所以岭回归又被称为 **基于 $L_2$ 约束的最小二乘法**,LASSO 回归又被称为 **基于 $L_1$ 约束的最小二乘法**。" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "### $L_2$ 约束的最小二乘\n", 47 | "\n", 48 | "可以看到,为了避免过拟合,我们在模型的参数上施加了一定的约束条件。在普通最小二乘回归中,参数空间没有限制,参数可以取任何值,但是在岭回归和 Lasso 回归中,参数被限制在一定的范围内,这被称为 **部分空间约束的最小二乘学习法**。为了保证参数不会偏移到值域范围之外,可以附加一个 $P\\theta = \\theta$ 的约束条件。\n", 49 | "\n", 50 | "$$\n", 51 | "\\mathop{\\min}_{\\theta}J(\\theta), P\\theta = \\theta\n", 52 | "$$\n", 53 | "\n", 54 | "$P$ 是满足 $P^2 = P$ 和 $P^T = P$ 的 $b \\times b$ 维矩阵,表示的是矩阵 $P$ 的值域的正交投影矩阵。$P$ 通常是手工设置的,通过主成分分析法,也可以基于数据进行设置。\n", 55 | "\n", 56 | "> 疑惑:$P\\theta = \\theta$,$P$ 不就是单位矩阵吗?\n", 57 | "\n", 58 | "不过正交投影矩阵 $P$ 的设置有很大的自由度,所以改用操作相对容易的 $L_2$ 约束条件,譬如:\n", 59 | "\n", 60 | "$$\n", 61 | "\\mathop{\\min}_{\\theta}J(\\theta), \\|\\theta\\|^2 \\leqslant R\n", 62 | "$$\n", 63 | "\n", 64 | "这时的参数空间如下图所示:\n", 65 | "\n", 66 | "![](../images/l2-parameter-space.webp)\n", 67 | "\n", 68 | "从图中可以看出来,这种约束方法是以参数空间的原点为圆心,在一定半径范围的圆内(一般为**超球**)进行求解的。$R$ 表示的是圆的半径。\n", 69 | "\n", 70 | "利用**拉格朗日对偶问题**,该问题可以转换为求解下式的最优解:\n", 71 | "\n", 72 | "$$\n", 73 | "\\mathop{\\max}_{\\lambda} \\mathop{\\min}_{\\theta} [ J(\\theta) + \\frac{\\lambda}{2}(\\|\\theta\\|^2 - R) ], \\lambda \\geqslant 0\n", 74 | "$$\n", 75 | "\n", 76 | "拉格朗日对偶问题的拉格朗日乘子 $\\lambda$ 的解由圆的半径 $R$ 决定,如果不根据 $R$ 来决定 $\\lambda$,而是直接指定的话,$L_2$ 约束的最小二乘解 $\\hat{\\theta}$ 可以通过下式求得:\n", 77 | "\n", 78 | "$$\n", 79 | "\\hat{\\theta} = \\mathop{\\arg\\min}_{\\theta} [ J(\\theta) + \\frac{\\lambda}{2}\\|\\theta\\|^2 ]\n", 80 | "$$\n", 81 | "\n", 82 | "这就是岭回归。" 83 | ] 84 | }, 85 | { 86 | "cell_type": "markdown", 87 | "metadata": {}, 88 | "source": [ 89 | "### 一般 $L_2$ 约束的最小二乘\n", 90 | "\n", 91 | "上面的 $L_2$ 约束可以写成更一般的情况:\n", 92 | "\n", 93 | "$$\n", 94 | "\\mathop{\\min}_{\\theta}J(\\theta), \\theta^TG\\theta \\leqslant R\n", 95 | "$$\n", 96 | "\n", 97 | "这里的 $G$ 是一个 $b \\times b$ 的正则化矩阵。当 $G$ 是对称正定矩阵的时候,$\\theta^TG\\theta \\leqslant R$ 可以把参数限制在一个椭圆区域内。" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### $L_1$ 约束的最小二乘\n", 105 | "\n", 106 | "和 $L_2$ 约束类似,$L_1$ 约束可以写成下面的形式:\n", 107 | "\n", 108 | "$$\n", 109 | "\\mathop{\\min}_{\\theta}J(\\theta), \\|\\theta\\| \\leqslant R\n", 110 | "$$\n", 111 | "\n", 112 | "这时的参数空间如下图所示:\n", 113 | "\n", 114 | "![](../images/l1-parameter-space.webp)\n", 115 | "\n", 116 | "可见参数被限制在一个菱形范围内,利用拉格朗日对偶问题,$L_1$ 约束的最小二乘解可以通过下式求得:\n", 117 | "\n", 118 | "$$\n", 119 | "\\hat{\\theta} = \\mathop{\\arg\\min}_{\\theta} ( J(\\theta) + \\lambda\\|\\theta\\| )\n", 120 | "$$\n", 121 | "\n", 122 | "这就是 Lasso 回归。如果我们在参数空间中画出损失函数 $J(\\theta)$ 的等高线,可以把 $L_1$ 约束和 $L_2$ 约束作一个对比。由于 $J(\\theta)$ 是平方损失函数,它是一个二次凸函数,所以它的等高线在参数空间呈椭圆状,椭圆的中心即是最小二乘解,如下图:\n", 123 | "\n", 124 | "![](../images/l1-vs-l2-parameter-space.webp)\n", 125 | "\n", 126 | "而等高线和约束边界的交点即为带约束条件的最小二乘解。可以直观的看出,由于 $L_1$ 约束边界在各个参数的轴有一个尖角,等高线和它的交点更容易出现在参数的轴上,也就意味着其他的参数等于 0,像这样的解被称为 **稀疏解**(Sparse solution)。" 127 | ] 128 | }, 129 | { 130 | "cell_type": "markdown", 131 | "metadata": {}, 132 | "source": [ 133 | "### 岭回归的几何解释\n", 134 | "\n", 135 | "如果上面的等高线看上去不是很直观,我们可以在三维场景下再看看岭回归到底是怎么限制参数空间的。在一元线性回归的求解过程中,我们知道,损失函数实际上就是三维空间的一个曲面,损失函数的最小值就是该曲面的最低点。\n", 136 | "\n", 137 | "![](../images/ridge-regression-1.webp)\n", 138 | "\n", 139 | "岭回归意味着参数的取值被限制在一个圆内,也就是下面三维空间中的圆柱体,这个圆柱体和曲面的交点就是岭回归的解。\n", 140 | "\n", 141 | "![](../images/ridge-regression-2.webp)\n", 142 | "\n", 143 | "所以,从参数平面理解,即为抛物面等高线在水平面的投影和圆的交点,如下图所示:\n", 144 | "\n", 145 | "![](../images/ridge-regression-3.webp)" 146 | ] 147 | }, 148 | { 149 | "cell_type": "markdown", 150 | "metadata": {}, 151 | "source": [ 152 | "### $L_p$ 约束的最小二乘\n", 153 | "\n", 154 | "上面介绍了 $L_1$ 约束和 $L_2$ 约束,我们可以将其进一步推广到 $L_p$ 约束:\n", 155 | "\n", 156 | "$$\n", 157 | "\\|\\theta\\|_p = \\lgroup \\sum_{j=1}^b |\\theta_j|^p \\rgroup ^{\\frac{1}{p}} \\leqslant R\n", 158 | "$$\n", 159 | "\n", 160 | "$L_p$ 范数中有两个比较特殊的场景,一种是当 $p = 0$ 时的 $L_0$ 范数:\n", 161 | "\n", 162 | "$$\n", 163 | "\\|\\theta\\|_0 = \\sum_{j=1}^b I(\\theta_j \\neq 0)\n", 164 | "$$\n", 165 | "\n", 166 | "其中 $I(\\theta_j \\neq 0)$ 为指示函数,$\\theta_j \\neq 0$ 时为 1,否则为 0,所以 $L_0$ 范数表示的是非零元素的个数。\n", 167 | "\n", 168 | "另一种场景是当 $p = \\infty$ 时的 $L_\\infty$ 范数:\n", 169 | "\n", 170 | "$$\n", 171 | "\\|\\theta\\|_\\infty = \\max \\{ |\\theta_1|, \\dots, |\\theta_b| \\}\n", 172 | "$$\n", 173 | "\n", 174 | "它表示的是元素绝对值中的最大值,因此 $L_\\infty$ 范数也称为 **最大值范数**。\n", 175 | "\n", 176 | "我们画出不同的 $p$ 值参数空间的约束边界,如下图:\n", 177 | "\n", 178 | "![](../images/lp-parameter-space.webp)\n", 179 | "\n", 180 | "当 $p \\leqslant 1$ 时,约束边界在坐标轴上呈有峰值的尖形;当 $p \\geqslant 1$ 时,呈凸形,在坐标轴上呈尖行是存在稀疏解的秘诀。但是,如果约束边界不是凸形的话,可能存在多个解,最优化往往很困难。因此 $p = 1$ 是稀疏解存在的唯一凸形。" 181 | ] 182 | }, 183 | { 184 | "cell_type": "markdown", 185 | "metadata": {}, 186 | "source": [ 187 | "### $L_1 + L_2$ 约束的最小二乘\n", 188 | "\n", 189 | "$L_1$ 约束在实际使用时存在些许限制:\n", 190 | "\n", 191 | "1. 参数 b 比训练样本 n 多时,线性模型可选择的最大特征数被局限为 n;\n", 192 | "2. 如果有多个基函数相似的集合时,$L_1$ 约束会选择一个而忽略其它的,另外,$L_1$ 约束只能在多个相关性较强中的特征中选择一个;\n", 193 | "3. 参数 b 比训练样本 n 少时, $L_1$ 的通用性比 $L_2$ 稍差;\n", 194 | "\n", 195 | "通过结合 $L_1$ 和 $L_2$ 两个约束可以解决上述问题,这就是 $L_1 + L_2$ 约束的最小二乘,也被称为 **弹性网**。\n", 196 | "\n", 197 | "$$\n", 198 | "(1-\\tau)\\|\\theta\\|_1 + \\tau \\|\\theta\\|^2 \\leqslant R\n", 199 | "$$\n", 200 | "\n", 201 | "其中 $0 \\leqslant \\tau \\leqslant 1$,当 $\\tau = 0$ 时就是 $L_1$ 约束,当 $\\tau = 1$ 时就是 $L_2$ 约束,当 $\\tau = 0.5$ 时,约束边界如下图所示:\n", 202 | "\n", 203 | "![](../images/l1-l2-parameterr-space.webp)\n", 204 | "\n", 205 | "可以看出 $\\tau = 0.5$ 时的 $L_1 + L_2$ 约束和 $L_{1.4}$ 约束几乎一样,但是如果我们把角的部分放大,可以看出 $L_{1.4}$ 约束和 $L_2$ 约束一样平滑,而 $L_1 + L_2$ 约束却是和 $L_1$ 约束一样呈尖行。所以 $L_1 + L_2$ 约束也和 $L_1$ 约束一样容易求得稀疏解。\n", 206 | "\n", 207 | "实验证明 $L_1 + L_2$ 约束比 $L_1$ 约束具有更高的精度,但是为了调整 $L_1$ 和 $L_2$ 的平衡,引入了一个新的超参数 $\\tau$,求解要更复杂一点。" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "### 参考\n", 215 | "\n", 216 | "* https://www.jianshu.com/p/1677d27e08a7\n", 217 | "* https://www.jianshu.com/p/9c0b029478e9" 218 | ] 219 | } 220 | ], 221 | "metadata": { 222 | "kernelspec": { 223 | "display_name": "Python 3", 224 | "language": "python", 225 | "name": "python3" 226 | }, 227 | "language_info": { 228 | "codemirror_mode": { 229 | "name": "ipython", 230 | "version": 3 231 | }, 232 | "file_extension": ".py", 233 | "mimetype": "text/x-python", 234 | "name": "python", 235 | "nbconvert_exporter": "python", 236 | "pygments_lexer": "ipython3", 237 | "version": "3.7.1" 238 | } 239 | }, 240 | "nbformat": 4, 241 | "nbformat_minor": 2 242 | } 243 | --------------------------------------------------------------------------------