├── .gitignore ├── Deep-Learning ├── mxnet-notes │ ├── mshadow-note2-data-structures.ipynb │ └── operators-in-mxnet.ipynb ├── keras-notes │ └── keras-tips.ipynb ├── back-propagation-in-matrix-form.ipynb ├── theano-notes │ ├── part3-shared-variable.ipynb │ ├── part6-scan-function.ipynb │ └── part7-dimshuffle.ipynb └── back-propagation-through-time.ipynb ├── README.md ├── PRML ├── Chap1-Introduction │ └── Summary-three-curve-fitting-approaches.ipynb ├── Chap2-Probability-Distributions │ └── 2.2-multinomial-variables.ipynb └── Chap3-Linear-Models-For-Regression │ ├── 3.1-linear-basis-function-models.ipynb │ └── summary-baysian-linear-regression.ipynb ├── ML-Foundation └── lecture-1.ipynb ├── CS229 ├── RL1.ipynb ├── EM.ipynb ├── RL2.ipynb └── GLM.ipynb ├── YidaXu-ML ├── variational-inference-for-gaussian-distribution.ipynb ├── exponential-family.ipynb ├── EM-review.ipynb ├── variational-inference.ipynb ├── exponential-family-variational-inference.ipynb └── sampling-methods-part2.ipynb ├── README.ipynb ├── Machine-Learning ├── svd1.ipynb ├── xgboost-notes │ └── xgboost-note1.ipynb ├── softmax-crossentropy-derivative.ipynb └── svd-ridge-regression.ipynb └── NLP └── latent-dirichlet-allocation-1.ipynb /.gitignore: -------------------------------------------------------------------------------- 1 | *-checkpoint.ipynb 2 | *.pyc 3 | *.pdf 4 | *.csv 5 | errata.md 6 | -------------------------------------------------------------------------------- /Deep-Learning/mxnet-notes/mshadow-note2-data-structures.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# MSHADOW代码阅读笔记2——数据结构\n", 8 | "\n", 9 | "本文主要介绍MSHADOW使用到的数据结构。 \n", 10 | "## Random\n", 11 | "位于`random.h`,定义了MSHADOW的随机数生成器\n", 12 | "\n", 13 | "### Random\n", 14 | "\n", 15 | "作用:随机数生成器的模板类 \n", 16 | "定义: \n", 17 | "```c\n", 18 | "/*!\n", 19 | " * \\brief random number generator\n", 20 | " * \\tparam Device the device of random number generator\n", 21 | " * \\tparam DType the target data type of random number can be float for double\n", 22 | " */\n", 23 | "template\n", 24 | "class Random {};\n", 25 | "```\n", 26 | "\n", 27 | "### CPU Random\n", 28 | "作用:CPU版本的随机数生成器 \n", 29 | "定义: \n", 30 | "```c\n", 31 | "/*! \\brief CPU random number generator */\n", 32 | "template\n", 33 | "class Random\n", 34 | "```\n", 35 | "`struct cpu`来自`tensor.h`。cpu版本的随机数生成器默认使用C++ 11提供的基于梅森旋转伪随机算法的随机数生成器mt19937。\n", 36 | "\n", 37 | "### GPU Random\n", 38 | "作用:GPU版本的随机数生成器 \n", 39 | "定义: \n", 40 | "```c\n", 41 | "template\n", 42 | "class Random\n", 43 | "```\n", 44 | "gpu版本的随机数生成器默认使用curand库提供的接口实现。\n", 45 | "\n", 46 | "## Device\n", 47 | "位于tensor.h,定义所运行的设备。 \n", 48 | "### CPU\n", 49 | "\n", 50 | "### GPU" 51 | ] 52 | } 53 | ], 54 | "metadata": { 55 | "anaconda-cloud": {}, 56 | "kernelspec": { 57 | "display_name": "Python [conda root]", 58 | "language": "python", 59 | "name": "conda-root-py" 60 | }, 61 | "language_info": { 62 | "codemirror_mode": { 63 | "name": "ipython", 64 | "version": 2 65 | }, 66 | "file_extension": ".py", 67 | "mimetype": "text/x-python", 68 | "name": "python", 69 | "nbconvert_exporter": "python", 70 | "pygments_lexer": "ipython2", 71 | "version": "2.7.12" 72 | } 73 | }, 74 | "nbformat": 4, 75 | "nbformat_minor": 1 76 | } 77 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 简介 2 | 3 | > 作者:hschen 4 | > QQ:357033150 5 | > 邮箱:hschen0712@gmail.com 6 | 7 | 此笔记主要总结自一些论文、书籍以及公开课,由于本人水平有限,笔记中难免会出现各种错误,欢迎指正。 8 | 由于Github渲染`.ipynb`文件较慢,可以用nbviewer加快渲染:[点此加速](http://nbviewer.jupyter.org/github/hschen0712/machine-learning-notes/blob/master/README.ipynb) 9 | 10 | 11 | ## 目录 12 | 13 | 14 | 1.公开课/读书笔记 15 | - [徐亦达机器学习笔记](YidaXu-ML/) 16 | - [采样算法系列1——基本采样算法](YidaXu-ML/sampling-methods-part1.ipynb) 17 | - [采样算法系列2——MCMC](YidaXu-ML/sampling-methods-part2.ipynb) 18 | - [EM算法](YidaXu-ML/EM-review.ipynb) 19 | - [变分推断](YidaXu-ML/variational-inference.ipynb) 20 | - [高斯分布的变分推断](YidaXu-ML/variational-inference-for-gaussian-distribution.ipynb) 21 | - [指数分布族](YidaXu-ML/exponential-family.ipynb) 22 | - [指数分布族的变分推断](YidaXu-ML/exponential-family-variational-inference.ipynb) 23 | 24 | - [CS229课程笔记](CS229/) 25 | - [广义线性模型](CS229/GLM.ipynb) 26 | - [EM算法](CS229/EM.ipynb) 27 | - [增强学习系列1](CS229/RL1.ipynb) 28 | - [增强学习系列2](CS229/RL2.ipynb) 29 | 30 | - [台大机器学习基石笔记](ML-Foundation/) 31 | - [第一讲-学习问题](ML-Foundation/lecture-1.ipynb) 32 | - [PRML读书笔记](PRML/) 33 | - [第一章 简介](PRML/Chap1-Introduction) 34 | - [1.1 多项式曲线拟合](PRML/Chap1-Introduction/1.1-polynomial-curve-fitting.ipynb) 35 | - [1.2 概率论回顾](PRML/Chap1-Introduction/1.2-probability-theory.ipynb) 36 | - [总结-曲线拟合的三种参数估计方法](PRML/Chap1-Introduction/Summary-three-curve-fitting-approaches.ipynb) 37 | - [第二章 概率分布](PRML/Chap2-Probability-Distributions) 38 | - [2.1 二元变量](PRML/Chap2-Probability-Distributions/2.1-binary-variables.ipynb) 39 | - [2.2 多元变量](PRML/Chap2-Probability-Distributions/2.2-multinomial-variables.ipynb) 40 | - [第三章 线性回归模型](PRML/Chap3-Linear-Models-For-Regression) 41 | - [3.1 线性基函数模型](PRML/Chap3-Linear-Models-For-Regression/3.1-linear-basis-function-models.ipynb) 42 | - [总结-贝叶斯线性回归](PRML/Chap3-Linear-Models-For-Regression/summary-baysian-linear-regression.ipynb) 43 | 44 | 2.[机器学习笔记](Machine-Learning/) 45 | - [xgboost笔记](Machine-Learning/xgboost-notes) 46 | - [1. xgboost的安装](Machine-Learning/xgboost-notes/xgboost-note1.ipynb) 47 | - [softmax分类器](Machine-Learning/softmax-crossentropy-derivative.ipynb) 48 | - [用theano实现softmax分类器](Machine-Learning/implement-softmax-in-theano.ipynb) 49 | - [用SVD实现岭回归](Machine-Learning/svd-ridge-regression.ipynb) 50 | - [SVD系列1](Machine-Learning/svd1.ipynb) 51 | 52 | 3.[NLP笔记](NLP/) 53 | - [LDA系列1——LDA简介](NLP/latent-dirichlet-allocation-1.ipynb) 54 | - [LDA系列2——Gibbs采样](NLP/latent-dirichlet-allocation-2.ipynb) 55 | - [朴素贝叶斯](NLP/naive-bayes.ipynb) 56 | 57 | 58 | 4.[深度学习笔记](Deep-Learning/) 59 | - [theano笔记](Deep-Learning/theano-notes) 60 | - [2. theano简单计算](Deep-Learning/theano-notes/part2-simple-computations.ipynb) 61 | - [3. theano共享变量](Deep-Learning/theano-notes/part3-shared-variable.ipynb) 62 | - [4. theano随机数](Deep-Learning/theano-notes/part4-random-number.ipynb) 63 | - [6. theano的scan函数](Deep-Learning/theano-notes/part6-scan-function.ipynb) 64 | - [7. theano的dimshuffle](Deep-Learning/theano-notes/part7-dimshuffle.ipynb) 65 | - [mxnet笔记](Deep-Learning/mxnet-notes) 66 | - [1. Win10下安装MXNET](Deep-Learning/mxnet-notes/1-installation.ipynb) 67 | 68 | - [2. MXNET符号API](Deep-Learning/mxnet-notes/2-mxnet-symbolic.ipynb) 69 | 70 | - [mxnet中的运算符](Deep-Learning/mxnet-notes/operators-in-mxnet.ipynb) 71 | 72 | - [mshadow表达式模板教程](Deep-Learning/mxnet-notes/mshadow-expression-template-tutorial.ipynb) 73 | 74 | - [keras笔记](Deep-Learning/keras-notes) 75 | - [keras心得](Deep-Learning/keras-notes/keras-tips.ipynb) 76 | 77 | - [windows下安装caffe](Deep-Learning/install-caffe-in-windows.ipynb) 78 | - [BP算法矩阵形式推导](Deep-Learning/back-propagation-in-matrix-form.ipynb) 79 | - [随时间反向传播算法数学推导过程](Deep-Learning/back-propagation-through-time.ipynb) 80 | - [用numpy实现RNN](Deep-Learning/rnn-numpy.ipynb) 81 | - [随机矩阵的奇异值分析](Deep-Learning/singular-value-of-random-matrix.ipynb) 82 | -------------------------------------------------------------------------------- /PRML/Chap1-Introduction/Summary-three-curve-fitting-approaches.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 总结-曲线拟合的三种参数估计方法\n", 8 | "\n", 9 | "## 1.MLE\n", 10 | "假设目标变量服从一个均值为$y(x,\\mathbf{w})$,方差为$\\beta^{-1}$的高斯分布\n", 11 | "$$ p(t|x,\\mathbf{w},\\beta)=\\mathcal{N}(t|y(x,\\mathbf{w}),\\beta^{-1})=\\sqrt{\\frac{\\beta}{2\\pi}}\\exp(-\\frac{\\beta}{2}(t-y(x,\\mathbf{w}))^2)$$\n", 12 | "这里参数为$\\mathbf{w},\\beta$\n", 13 | "MLE是一种Frequentist方法,它认为参数是一个未知的定值,通过最大化样本的似然函数找到参数的估计。这种方法要工作,我们需要事先假定样本是独立同分布(i.i.d.)产生的。在独立同分布假设下,似然函数写为\n", 14 | "$$ p(\\mathbf{t}|\\mathbf{x},\\mathbf{w},\\beta)=\\prod_{i=1}^N \\mathcal{N}(t_i|y(x_i,\\mathbf{w}),\\beta^{-1})$$\n", 15 | "我们可以认为似然函数是关于参数的函数(因为数据已经给定了),同时我们取一个对数(其好处是防止最终概率值下溢):\n", 16 | "$$ \\mathcal{L}(\\mathbf{w},\\beta)=\\sum_{i=1}^N \\bigg\\{\\frac{1}{2}ln\\,(\\beta)-\\frac{1}{2}ln\\,(2\\pi)-\\frac{\\beta}{2}(t_i-y(x_i,\\mathbf{w}))^2\\bigg\\}=\\frac{N}{2}ln\\,(\\beta)-\\frac{\\beta}{2}\\sum_{i=1}^N(t_i-y(x_i,\\mathbf{w}))^2+const$$\n", 17 | "我们先优化$\\mathbf{w}$,此时可以将$\\beta$视作参数,于是我们的目标变为:\n", 18 | "$$\\mathbf{w}_{ML}=\\arg\\max_{\\mathbf{w}}-\\sum_{i=1}^N(t_i-y(x_i,\\mathbf{w}))^2=\\arg\\min_{\\mathbf{w}}\\sum_{i=1}^N(t_i-y(x_i,\\mathbf{w}))^2$$\n", 19 | "于是我们知道,求解$\\mathbf{w}_{ML}$等价于最小化误差的平方和。求得$\\mathbf{w}$的最大似然估计后我们带回到似然函数,并对$\\beta$求导令导数为0,得到\n", 20 | "$$ \\beta_{ML}^{-1}=\\frac{1}{N}\\sum_{i=1}^N(t_i-y(x_i,\\mathbf{w}_{ML}))^2$$\n", 21 | "这里其实有个地方我不太明白,就是为什么可以先优化$\\mathbf{w}$再代入到$\\beta$的最大似然解中,还需要多看一些优化的书。\n", 22 | "## 2.MAP\n", 23 | "MLE方法的一个缺点是容易over-fitting。举个例子来说,如果样本中存在一个outlier,为了拟合这个outlier,我们需要对$\\mathbf{w}_{ML}$作出很大的调整,从而失去了刻画一般数据的能力。而Bayesian方法通过给参数加先验避免了过拟合。\n", 24 | "我们假设$\\mathbf{w}$服从一个高斯分布:\n", 25 | "$$p(\\mathbf{w}|\\alpha)=\\mathcal{N}(\\mathbf{w}|\\mathbf{0},\\alpha^{-1}\\boldsymbol{I})=\\Big(\\frac{\\alpha}{2\\pi}\\Big)^{(M+1)/2}\\exp(-\\frac{\\alpha}{2}\\mathbf{w}^T\\mathbf{w})$$\n", 26 | "\n", 27 | "$\\alpha$称为超参数(hyperparameter),我们假定$\\alpha$和$\\beta$已知,根据贝叶斯公式有\n", 28 | "$$p(\\mathbf{w}|\\mathbf{x},\\mathbf{t},\\alpha,\\beta)\\propto p(\\mathbf{w}|\\alpha)p(\\mathbf{t}|\\mathbf{x},\\mathbf{w},\\beta)$$\n", 29 | "我们通过最大化后验概率$p(\\mathbf{w}|\\mathbf{x},\\mathbf{t},\\alpha,\\beta)$找到$\\mathbf{w}$的估计,这种方法称为MAP。$\\mathbf{w}_{MAP}$由最小化如下目标函数得到:\n", 30 | "$$ \\frac{\\beta}{2}\\sum_{i=1}^N(t_i-y(x_i,\\mathbf{w}))^2+\\frac{\\alpha}{2}\\mathbf{w}^T\\mathbf{w}$$\n", 31 | "通过观察可以发现,MAP等价于在MLE的目标函数后加了一个L2正则项。\n", 32 | "\n", 33 | "## 3.full Bayesian\n", 34 | "\n", 35 | "关于式1.68的理解\n", 36 | "在贝叶斯中,数据是已知的,只有参数w是未知的。因此式(1.68)中$x,\\mathbf{x},\\mathbf{t}$都是已知的,为了直观,我们可以把已知的都去掉,于是式1.68变为\n", 37 | "$$p(t)=\\int p(t|\\mathbf{w})p(\\mathbf{w}) d\\mathbf{w}=\\int p(t,\\mathbf{w})d\\mathbf{w}$$\n", 38 | "这就很好理解了,就是对$\\mathbf{w}$做marginalization(运用概率论的乘法公式和加法公式(连续的情况下求和变为积分))\n", 39 | "如果带上数据,推导是这样的:\n", 40 | "$$\\int p(t|x,\\mathbf{w})p(\\mathbf{w}|\\mathbf{x},\\mathbf{t})d\\mathbf{w}=\\int p(t,\\mathbf{w}|x,\\mathbf{x},\\mathbf{t})d\\mathbf{w}=p(t|x,\\mathbf{x},\\mathbf{t})$$\n", 41 | "关于后验预测分布是高斯分布的证明,参考我的另一篇笔记。\n", 42 | "\n", 43 | "总结一下各个方法。MLE和MAP都属于点估计,MLE在数据量少的场合下容易过拟合,而MAP通过对参数引入先验分布避免过拟合,等价于MLE的目标函数加正则化,如果假设参数是正态分布则对应L2正则(岭回归),如果假设参数服从拉普拉斯分布则对应L1正则(lasso);full Bayesian认为点估计不准确,在MAP的基础上考虑了w的分布,通过对参数做marginalization概括所有的w,进一步避免过拟合\n", 44 | "\n", 45 | "参考: \n", 46 | "1.http://ask.julyedu.com/question/150 \n", 47 | "2.PRML勘误 \n", 48 | "3.MLAPP 7.6 Bayesian linear regression" 49 | ] 50 | } 51 | ], 52 | "metadata": { 53 | "kernelspec": { 54 | "display_name": "Python 2", 55 | "language": "python", 56 | "name": "python2" 57 | }, 58 | "language_info": { 59 | "codemirror_mode": { 60 | "name": "ipython", 61 | "version": 2 62 | }, 63 | "file_extension": ".py", 64 | "mimetype": "text/x-python", 65 | "name": "python", 66 | "nbconvert_exporter": "python", 67 | "pygments_lexer": "ipython2", 68 | "version": "2.7.12" 69 | } 70 | }, 71 | "nbformat": 4, 72 | "nbformat_minor": 0 73 | } 74 | -------------------------------------------------------------------------------- /ML-Foundation/lecture-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 第一讲 学习问题\n", 8 | "\n", 9 | "这门课的授课老师是个台湾人,师从Caltech的Yaser S. Abu-Mostafa,他们共同编撰了《Learning From Data》这本书。Yaser S. Abu-Mostafa在edx上也开设了机器学习的公开课,不过说实话,他的埃及口音英语实在很难听懂,而且讲的内容偏重理论,所以追了几节课就放弃了。这次他的学生带来了coursera的机器学习基石这门公开课,讲的内容和Yaser的公开课差不多,而且是中文授课(ppt是英文),这对于华语世界的学生来说是个福音。未来几周,我将把这门课的听课收获记录下来,以便备忘。 \n", 10 | "\n", 11 | "\n", 12 | "## 什么是机器学习?\n", 13 | "\n", 14 | "机器学习是一门能够让计算机从数据中学习得到经验,以改善在特定任务上表现的新兴技术。机器学习模仿了人类的学习过程,人类的学习来源于观察,比如我们学说话,是通过聆听、模仿大人的话语;我们区分两个不同的物体,是通过视觉上的观察、比对,经思考、归纳总结出结论的。机器学习也是一样,不过由于机器不能主动观察获得数据,所以需要人类把资料“喂”给它。接着,机器学习算法通过求解一个最优化问题,来揭示数据背后蕴藏的规律。\n", 15 | "\n", 16 | "## 机器学习能够应用的场合:\n", 17 | "\n", 18 | "为了能使机器学习能够work, 以下三个要素缺一不可:\n", 19 | "\n", 20 | "1.首先要有数据\n", 21 | "\n", 22 | "数据的作用是训练机器学习模型,调整其参数,以逼近真实的目标函数。当有人登门拜访,期望你用机器学习解决他所提出的问题时,你首先应该问的是:你有什么样的数据?如果有,那么我们可以继续合作;如果没有,那就爱莫能助了。\n", 23 | "\n", 24 | "2.数据中有特定的规律可循\n", 25 | "\n", 26 | "数据背后必须含有潜在的规律或模式,如果数据杂乱无章没有规律可循,那么数据就是垃圾数据。\n", 27 | "\n", 28 | "3.数据中的规律无法给出明确的数学表达式定义\n", 29 | "\n", 30 | "现实生活中许多复杂的决策问题是无法简单地用明确的数学形式刻画的,如果能够明确定义,也就不需要机器学习了,直接写个程序就可以。\n", 31 | "\n", 32 | " \n", 33 | " \n", 34 | "我们可以通过以下问题来加深理解,**下面哪个问题可以应用机器学习?**\n", 35 | "\n", 36 | "1.预测一个婴儿下一次哭是在奇数分钟还是偶数分钟?\n", 37 | "\n", 38 | "2.寻找一张图上两个顶点间的最短路径\n", 39 | "\n", 40 | "3.预测未来十年地球是否会因为核战争而毁灭?\n", 41 | "\n", 42 | "4.预测股市明天的走势\n", 43 | "\n", 44 | "\n", 45 | "答案是:4,原因如下:\n", 46 | "\n", 47 | "问题1,婴儿下一次哭在什么时候是无规律可循的\n", 48 | "\n", 49 | "问题2,寻找图上两顶点间最短路径可以显式地编程给出\n", 50 | "\n", 51 | "问题3,预测地球未来十年是否会毁灭,我们没有之前的历史数据,所以无法进行训练\n", 52 | "\n", 53 | "问题4,是可以应用机器学习的,因为股价的波动遵循一定的规律;而且股价的波动与许多因素有关,我们无法用一个明确的数学公式给出这个规律;此外,我们可以利用过去十年的股市行情数据进行学习,综上,答案是4\n", 54 | "\n" 55 | ] 56 | }, 57 | { 58 | "cell_type": "markdown", 59 | "metadata": {}, 60 | "source": [ 61 | "## 机器学习的模型\n", 62 | "\n", 63 | "图中包含了机器学习中的一些基本概念:\n", 64 | "\n", 65 | "\n", 66 | "1.**目标函数(Target functuion)**:我们要学习得到的目标函数(记为$f:\\mathcal{X}\\to \\mathcal{Y}$),是观测值$\\mathcal{X}$到真实值$\\mathcal{Y}$的一个映射关系。通常这个函数是无法用数学形式明确表示的。\n", 67 | "\n", 68 | "2.**训练数据(Traning examples)**:由观察值和真实值所组成的一个集合,系由$f$产生的。\n", 69 | "\n", 70 | "3.**假设集(Hypothesis set)**:虽然我们事先不知道目标函数,但是我们可以猜想,比如我们认为某个假设可以用于预测,那么就把该假设加入假设集,当然假设不止一种,所有候选的假设构成了整个假设集。假设集可以是线性超平面(感知器、支持向量机),任意的非线性函数(神经网络),各种规则的集合(决策树)等。\n", 71 | "\n", 72 | "4.**学习算法(Learning algorithm)**:根据数据,利用最优化算法从假设集中找到最符合目标函数$f$的假设$g$。学习算法可以是感知器学习算法(PLA),序列最小最优化算法(SMO),反向传播算法(BP)等。\n", 73 | "\n", 74 | "\n", 75 | "一个机器学习模型的工作流程可以简单描述为:学习算法$\\mathcal{A}$根据训练数据调整模型参数,从假设集$\\mathcal{H}$里找到一个与目标函数$f$最一致的假设$g$。其中,目标函数$f$和训练集$\\mathcal{D}$是我们无法控制的,但学习算法$\\mathcal{A}$和假设集$\\mathcal{H}$是我们可以选择的,这两者决定了学习模型(learning model),如何选择学习模型对于我们的机器学习问题至关重要。用机器学习解决实际问题的过程类似建造一枚火箭,首先要有一个强大的引擎(模型),其次要有优质的燃料(数据)。空有强大的引擎,却没有足够好的燃料,火箭是无法升空的。而更好的引擎可以使得火箭飞的更高。" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "## 机器学习与其他相关领域的关系\n", 83 | "\n", 84 | "### 1.机器学习与数据挖掘\n", 85 | "\n", 86 | "机器学习的目标是根据数据,从假设集中找到一个与目标函数$f$很相近的假设$g$,数据挖掘的目标是从海量的数据中寻找有用的性质。如果要寻找的有用性质与寻找一个和目标函数很相近的假设的目标是一致的话,那么机器学习和数据挖掘没有本质上的区别。 \n", 87 | "\n", 88 | "如果数据挖掘要寻找的有趣的性质与机器学习任务有联系的话,那么二者可以互相帮助。比如数据挖掘找到的性质可以帮助机器学习得到更好的假设;反过来,机器学习也可以用在数据挖掘中来寻找有用的性质。传统的数据挖掘关注于大规模数据库的有效计算,而在现实中我们很难区分机器学习与数据挖掘,因为二者是如此相像而又密不可分。\n", 89 | "\n", 90 | "### 2.机器学习与人工智能\n", 91 | "\n", 92 | "人工智能的目标是让电脑具有某种方面上的智能,比如电脑学会下棋。机器学习是实现人工智能的方法,但实现人工智能的方法不局限与机器学习。\n", 93 | "\n", 94 | "### 3.机器学习与统计学\n", 95 | "\n", 96 | "统计学的目标:使用数据来对未知的过程作出推论。机器学习中,$g$是我们要得出的推论,$f$是未知的过程,从这个角度上看,我们可以结合统计学的知识来实现机器学习。统计学为机器学习提供了很多有用的工具。" 97 | ] 98 | } 99 | ], 100 | "metadata": { 101 | "kernelspec": { 102 | "display_name": "Python 2", 103 | "language": "python", 104 | "name": "python2" 105 | }, 106 | "language_info": { 107 | "codemirror_mode": { 108 | "name": "ipython", 109 | "version": 2 110 | }, 111 | "file_extension": ".py", 112 | "mimetype": "text/x-python", 113 | "name": "python", 114 | "nbconvert_exporter": "python", 115 | "pygments_lexer": "ipython2", 116 | "version": "2.7.13" 117 | } 118 | }, 119 | "nbformat": 4, 120 | "nbformat_minor": 2 121 | } 122 | -------------------------------------------------------------------------------- /CS229/RL1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 强化学习(1)\n", 8 | "## 强化学习简介\n", 9 | "监督学习在机器学习中取得了重大的成功,然而在顺序决策制定和控制问题中,比如无人直升机、无人汽车等,难以给出显式的监督信息,因此这类问题中监督模型无法学习。\n", 10 | "强化学习就是为了解决这类问题而产生的。在强化学习框架中,学习算法被称为一个agent,假设这个agent处于一个环境中,两者之间存在交互。agent通过与环境交互不断增强对环境的适应力,故得名强化学习。\n", 11 | "这个过程可以用下图来理解:\n", 12 | "![RL](http://7xikew.com1.z0.glb.clouddn.com/42741-29805054c02be5ea.png)\n", 13 | "在每个时间步$t$,agent:\n", 14 | "* 接受状态$s_t$\n", 15 | "* 接受标量回报$r_t$\n", 16 | "* 执行行动$a_t$\n", 17 | "\n", 18 | "环境:\n", 19 | "* 接受动作$a_t$\n", 20 | "* 产生状态$s_t$\n", 21 | "* 产生标量回报$r_t$\n", 22 | "\n", 23 | "\n", 24 | "## MDP\n", 25 | "通常我们都是从MDP(马尔科夫决策过程)来了解强化学习的。MDP问题中,我们有一个五元组:$(S,A,P,\\gamma,R)$\n", 26 | "* $S$:状态集,由agent所有可能的状态组成\n", 27 | "* $A$:动作集,由agent所有可能的行动构成\n", 28 | "* $P(s,a,s')$:转移概率分布,表示状态$s$下执行动作$a$后下个时刻状态的概率分布\n", 29 | "* $\\gamma$:折扣因子,$0\\leq \\gamma\\leq 1$,表示未来回报相对于当前回报的重要程度。如果$\\gamma=0$,表示只重视当前立即回报;$\\gamma=1$表示将未来回报视为与当前回报同等重要。\n", 30 | "* $R(s,a,s')$:标量立即回报函数。执行动作$a$,导致状态$s$转移到$s'$产生的回报。可以是关于状态-动作的函数$S\\times A\\to \\mathbb{R}$,也可以是只关于状态的函数$S\\to \\mathbb{R}$。记$t$时刻的回报为$r_t$,为了后续表述方便,假设我们感兴趣的问题中回报函数只取决于状态,而状态-动作函数可以很容易地推广,这里暂不涉及。\n", 31 | "\n", 32 | "注:这里阐述的MDP称为**discounted MDP**,即带折扣因子的MDP。有些MDP也可以定义为四元组:$(S,A,P,R)$,这是因为这类MDP中使用的值函数不考虑折扣因子。\n", 33 | "\n", 34 | "MDP过程具有马尔科夫性质,即给定当前状态,未来的状态与过去的状态无关。但与马尔科夫链不同的是,MDP还考虑了动作,也就是说MDP中状态的转移不仅和当前状态有关,还依赖于agent采取的动作。举个下棋的例子来说,我们在局面(状态$s_{t-1}$)走了一步(动作$a_{t-1}$),导致棋盘状态变为$s_{t}$,这时对手只需要根据当前棋盘状态$s_t$思考,作出应对$a_t$,导致局面变为$s_{t+1}$,而不需要考虑再之前的棋盘状态$s_{t-i},(i=1,2,...)$。由于对手会采取什么行动我们无法预测,因此$s_{t+1}$是随机的,依赖于当前的局面$s_t$和对手的落子行动$a_t$。\n", 35 | "我们可以通过下面表格了解各种马尔科夫模型的区别:\n", 36 | "\n", 37 | "||不考虑动作|考虑动作|\n", 38 | "|:|:|:|\n", 39 | "|状态可观测|马尔科夫链(MC)|马尔科夫决策过程(MDP)|\n", 40 | "|状态不完全可观测|隐马尔科夫模型(HMM)|不完全可观察马尔科夫决策过程(POMDP)|\n", 41 | "\n" 42 | ] 43 | }, 44 | { 45 | "cell_type": "markdown", 46 | "metadata": {}, 47 | "source": [ 48 | "\n", 49 | "MDP的运行过程:\n", 50 | "$$s_0\\xrightarrow{a_0}s_1\\xrightarrow{a_1}s_2\\xrightarrow{a_2}s_3\\xrightarrow{a_3}\\cdots$$\n", 51 | "我们从初始状态$s_0$出发,执行某个动作$a_0$,根据转移概率分布确定下一个状态$s_1\\sim P_{s_{0}a_{0}}$,接着执行动作$a_1$,再根据$P_{s_{1}a_{1}}$确定$s_2$...。\n", 52 | "\n", 53 | "一个discounted MDP中,我们的目标是最大化一个累积未来折扣回报:\n", 54 | "$$ R_t=\\sum_{k=0}^\\infty \\gamma^k r_{t+k+1}$$\n", 55 | "\n", 56 | "\n", 57 | "具体地,我们希望学得一个策略(policy),通过执行这个策略使上式最大化。策略一般可以表示为一个函数:$\\pi:s\\to a$,它以状态为输入,输出对应的动作。策略函数可以是确定的$\\pi(s)=a$,也可以是不确定的$\\pi(s,a)=p(a|s)$(这时策略函数是一个条件概率分布,表示给定状态$s$下执行下一个动作$a$的概率)。当agent执行一个策略时,每个状态下agent都执行策略指定的动作。\n", 58 | "\n", 59 | "强化学习通常具有**延迟回报**的特点,以下围棋为例,只有在最终决定胜负的那个时刻才有回报(赢棋为1,输棋为-1),而之前的时刻立即回报均为0。这种情况下,$R_t$等于1或-1,这将导致我们很难衡量策略的优劣,因为即使赢了一盘棋,未必能说明策略中每一步都是好棋;同样输了一盘棋也未必能说明每一步都是坏棋。因此我们需要一个目标函数来刻画策略的长期效用。\n", 60 | "\n", 61 | "为此,我们可以为策略定义一个值函数(value function)来综合评估某个策略的好坏。这个函数既可以是只关于状态的值函数$V^{\\pi}(s)$,也可以状态-动作值函数$Q^\\pi(s,a)$。状态值函数评估agent处于某个状态下的长期收益, 动作值函数评估agent在某个状态下执行某个动作的长期收益。\n", 62 | "本文后续都将以状态值函数为例,进行阐述。一般常用的有三种形式:\n", 63 | "1. $V^{\\pi}(s) = E_\\pi[\\sum\\limits_{k=0}^{\\infty}r_{t+k+1}|s_t=s]$\n", 64 | "2. $V^{\\pi}(s) = E_\\pi[\\lim\\limits_{k\\to\\infty}\\frac{1}{k}\\sum\\limits_{i=0}^{k}r_{t+i+1}|s_t=s]$\n", 65 | "3. $V^{\\pi}(s) = E_\\pi[\\sum\\limits_{k=0}^{\\infty}\\gamma^k r_{t+k+1}|s_t=s]$\n", 66 | "\n", 67 | "其中$E_\\pi[\\cdot|s_t=s]$表示从状态$s$开始,通过执行策略$\\pi$得到的累积回报的期望。有些情况下,agent和环境的交互是无止境的,比如一些控制问题,这样的问题称为continuing task。还有一种情况是我们可以把交互过程打散成一个个片段式任务(episodic task),每个片段有一个起始态和一个终止态(或称为吸收态,absorbing state),比如下棋。当每个episode结束时,我们对整个过程重启随机设置一个起始态或者从某个随机起始分布采样决定一个起始态。下面的部分我们只考虑片段式任务。\n", 68 | "\n", 69 | "上面三种值函数中,我们一般常用第三种形式,我把它叫做折扣值函数(discounted value function)。下一篇将介绍计算折扣值函数三种方法:动态规划、蒙特卡洛、时间差分。\n", 70 | "\n", 71 | "参考:\n", 72 | "1.[wiki](https://en.wikipedia.org/wiki/Reinforcement_learning) \n", 73 | "2.[强化学习二](http://blog.csdn.net/zz_1215/article/details/44138823) \n", 74 | "3.Reinforcement Learning: An Introduction\n" 75 | ] 76 | } 77 | ], 78 | "metadata": { 79 | "kernelspec": { 80 | "display_name": "Python 2", 81 | "language": "python", 82 | "name": "python2" 83 | }, 84 | "language_info": { 85 | "codemirror_mode": { 86 | "name": "ipython", 87 | "version": 2 88 | }, 89 | "file_extension": ".py", 90 | "mimetype": "text/x-python", 91 | "name": "python", 92 | "nbconvert_exporter": "python", 93 | "pygments_lexer": "ipython2", 94 | "version": "2.7.12" 95 | } 96 | }, 97 | "nbformat": 4, 98 | "nbformat_minor": 0 99 | } 100 | -------------------------------------------------------------------------------- /YidaXu-ML/variational-inference-for-gaussian-distribution.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 用变分推断估计高斯分布的参数\n", 8 | "在上一篇中,我们通过对高斯分布的均值$\\mu|\\tau$和精度 $\\tau$分别指定不同共轭先验:\n", 9 | "$$ x_i|\\mu,\\tau\\stackrel{ i.i.d.}{\\sim}\\mathcal{N}(\\mu,\\tau^{-1})=(\\frac{\\tau}{2\\pi})^{n/2}\\exp(-\\frac{\\tau}{2}\\sum_{i=1}^n(x_i-\\mu)^2) \\\\ \\mu|\\tau \\sim\\mathcal{N}(\\mu_0,(\\lambda_0 \\tau)^{-1})\\propto \\exp(-\\frac{\\lambda_0 \\tau}{2}(\\mu-\\mu_0)^2)\\\\ \\tau\\sim Ga(\\alpha_0,\\beta_0)\\propto \\tau^{\\alpha_0-1}\\exp(-\\beta_0\\tau)$$\n", 10 | "通过一系列计算最终得到了后验分布的解析解:\n", 11 | "$$ \\tau|x\\sim Ga(\\alpha_0+\\frac{n}{2},\\beta_0+\\frac{1}{2}\\sum_{i=1}^n (x_i-\\bar{x})^2+\\frac{n\\lambda_0(\\bar{x}-\\mu_0)^2}{2(n+\\lambda_0)})\\\\\\mu|x,\\tau\\sim\\mathcal{N}(\\frac{n\\bar{x}+\\lambda_0\\mu_0}{n+\\lambda_0},\\big((n+\\lambda_0)\\tau\\big)^{-1})$$\n", 12 | "\n", 13 | "这次我们假装不知道如何估计参数,转而采用变分推断估计$\\mu$和$\\tau$\n", 14 | "变分推断中,我们需要把隐变量进行分解:\n", 15 | "$$q(\\mu,\\tau)=q_\\mu(\\mu)q_\\tau(\\tau)$$\n", 16 | "均值的变分推断的迭代更新式为:\n", 17 | "$$\\begin{aligned}ln\\,(q^*_\\mu(\\mu))&=\\mathbb{E}_{q_\\tau(\\tau)}[ln\\,p(D,\\mu,\\tau)]+const\\\\&=\\mathbb{E}_{q_\\tau(\\tau)}[ln\\,p(D|\\mu,\\tau)+ln\\,p(\\mu|\\tau)+\\underbrace{ln\\,p(\\tau)}_{\\mbox{constant}}]+const\\\\&=\\mathbb{E}_{q_\\tau(\\tau)}[ln\\,p(D|\\mu,\\tau)+ln\\,p(\\mu|\\tau)]+const\\\\&=-\\frac{\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}\\bigg\\{\\sum_{i=1}^n(x_i-\\mu)^2+\\lambda_0(\\mu-\\mu_0)^2\\bigg\\}+const\\\\&=-\\frac{\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}\\bigg\\{n(\\mu-\\bar{x})^2+\\sum_{i=1}^n(x_i-\\bar{x})^2+\\lambda_0(\\mu-\\mu_0)^2\\bigg\\}+const\\\\&=-\\frac{\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}\\bigg\\{n\\mu^2-2n\\bar{x}\\mu+\\lambda_0 \\mu^2-2\\lambda_0\\mu_0\\mu\\bigg\\}+const\\\\&=-\\frac{\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}\\bigg\\{(n+\\lambda_0)\\mu^2-2(n\\bar{x}+\\lambda_0\\mu_0)\\mu\\bigg\\}+const\\\\&=-\\frac{\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}\\bigg\\{(n+\\lambda_0)\\Big[(\\mu-\\frac{n\\bar{x}+\\lambda_0\\mu_0}{n+\\lambda_0})^2+(\\frac{n\\bar{x}+\\lambda_0\\mu_0}{n+\\lambda_0})^2\\Big]\\bigg\\}+const\\\\&=-\\frac{(n+\\lambda_0)\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}(\\mu-\\frac{n\\bar{x}+\\lambda_0\\mu_0}{n+\\lambda_0})^2+const\\end{aligned}$$\n", 18 | "其中$\\mathbb{E}_{q_\\tau(\\tau)}[ln\\,p(\\tau)]$由于积分把$\\tau$积没了,可以视为常数,放到const里面,所有含有$\\tau$的项可以用相同方法处理。\n", 19 | "因此\n", 20 | "$$ q^*_\\mu(\\mu)\\propto \\exp(-\\frac{(n+\\lambda_0)\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)}{2}(\\mu-\\frac{n\\bar{x}+\\lambda_0\\mu_0}{n+\\lambda_0})^2)\\propto \\mathcal{N}(\\mu|\\frac{n\\bar{x}+\\lambda_0\\mu_0}{n+\\lambda_0},\\big((n+\\lambda_0)\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)\\big)^{-1})$$\n", 21 | "\n", 22 | "精度的迭代更新式为:\n", 23 | "$$\\begin{aligned}ln\\,(q^*_\\tau(\\tau))&=\\mathbb{E}_{q_\\mu(\\mu)}[ln\\,p(D,\\mu,\\tau)]+const\\\\&=\\mathbb{E}_{q_\\mu(\\mu)}[ln\\,p(D|\\mu,\\tau)+ln\\,p(\\mu|\\tau)+ln\\,p(\\tau)]+const\\\\&=\\mathbb{E}_{q_\\mu(\\mu)}[\\frac{n}{2}ln\\,(\\tau)-\\frac{\\tau}{2}\\sum_{i=1}^n(x_i-\\mu)^2-\\frac{\\lambda_0\\tau}{2}(\\mu-\\mu_0)^2+(\\alpha_0-1)ln\\,(\\tau)-\\beta_0\\tau]+const\\\\&=(\\underbrace{\\alpha_0+\\frac{n}{2}}_{\\alpha_n}-1)ln\\,(\\tau)-\\tau\\Big\\{\\underbrace{\\beta_0+\\frac{1}{2}\\mathbb{E}_{q_\\mu(\\mu)}\\bigg(\\sum_{i=1}^n(x_i-\\mu)^2+\\lambda_0(\\mu-\\mu_0)^2\\bigg)}_{\\beta_n}\\Big\\}+const\\end{aligned}$$\n", 24 | "$\\beta_n$可以写为\n", 25 | "$$\\begin{aligned}\\beta_n&=\\beta_0+\\frac{1}{2}\\mathbb{E}_{q_\\mu(\\mu)}\\bigg[\\sum_{i=1}^n(x_i-\\mu)^2+\\lambda_0(\\mu-\\mu_0)^2\\bigg]\\\\&=\\beta_0+\\frac{1}{2}\\bigg\\{\\mathbb{E}_{q_\\mu(\\mu)}\\bigg[n\\mu^2-2n\\bar{x}\\mu+\\lambda_0\\mu^2-2\\lambda_0\\mu_0\\mu\\bigg]+\\sum_{i=1}^n(x_i)^2+\\lambda_0\\mu_0^2\\bigg\\}\\\\&=\\beta_0+\\frac{1}{2}\\bigg\\{(n+\\lambda_0)\\mathbb{E}_{q_\\mu(\\mu)}[\\mu^2]-2(n\\bar{x}+\\lambda_0\\mu_0)\\mathbb{E}_{q_\\mu(\\mu)}[\\mu]+\\sum_{i=1}^n(x_i)^2+\\lambda_0\\mu_0^2\\bigg\\}\\end{aligned}$$\n", 26 | "\n", 27 | "我们用上一轮迭代得到的$q^*_\\mu(\\mu)$计算$\\mathbb{E}_{q_\\mu(\\mu)}[\\mu^2]$和$\\mathbb{E}_{q_\\mu(\\mu)}[\\mu]$,再用这一轮得到的$q^*_\\tau(\\tau)$计算下一轮的$(n+\\lambda_0)\\mathbb{E}_{q_\\tau(\\tau)}(\\tau)$\n", 28 | "\n" 29 | ] 30 | } 31 | ], 32 | "metadata": { 33 | "kernelspec": { 34 | "display_name": "Python 2", 35 | "language": "python", 36 | "name": "python2" 37 | }, 38 | "language_info": { 39 | "codemirror_mode": { 40 | "name": "ipython", 41 | "version": 2 42 | }, 43 | "file_extension": ".py", 44 | "mimetype": "text/x-python", 45 | "name": "python", 46 | "nbconvert_exporter": "python", 47 | "pygments_lexer": "ipython2", 48 | "version": "2.7.12" 49 | } 50 | }, 51 | "nbformat": 4, 52 | "nbformat_minor": 0 53 | } 54 | -------------------------------------------------------------------------------- /Deep-Learning/keras-notes/keras-tips.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# keras使用心得\n", 8 | "\n", 9 | "### 问题一:如何在训练时用F1作为metric输出?\n", 10 | "\n", 11 | "keras官方是不提供F1这个metric的,需要自己实现。\n", 12 | "实现代码如下:" 13 | ] 14 | }, 15 | { 16 | "cell_type": "code", 17 | "execution_count": 1, 18 | "metadata": { 19 | "collapsed": true 20 | }, 21 | "outputs": [], 22 | "source": [ 23 | "def f1score(y_true, y_pred):\n", 24 | " num_tp = K.sum(y_true*y_pred)\n", 25 | " num_fn = K.sum(y_true*(1.0-y_pred))\n", 26 | " num_fp = K.sum((1.0-y_true)*y_pred)\n", 27 | " num_tn = K.sum((1.0-y_true)*(1.0-y_pred))\n", 28 | " f1 = 2.0*num_tp/(2.0*num_tp+num_fn+num_fp)\n", 29 | " return f1" 30 | ] 31 | }, 32 | { 33 | "cell_type": "markdown", 34 | "metadata": {}, 35 | "source": [ 36 | "其中`y_true`表示真实标号,`y_pred`表示预测标号(假定标号取值为0或1)。\n", 37 | "接着,在模型编译的时候指定我们要使用的metric即可:" 38 | ] 39 | }, 40 | { 41 | "cell_type": "code", 42 | "execution_count": null, 43 | "metadata": { 44 | "collapsed": true 45 | }, 46 | "outputs": [], 47 | "source": [ 48 | "model.compile(optimizer=..., loss=..., metrics=[f1score])" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "### 问题二:如何在训练模型时同时输出多种loss和metric?(比如同时输出accuracy和f1)\n", 56 | "注意看函数`model.compile()`中的metrics参数,首先它是个复数形式,其次它是个列表,说到这应该明白了吧:) \n", 57 | "我们只要把要输出的多个metric丢进列表就可以了:" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "collapsed": true 65 | }, 66 | "outputs": [], 67 | "source": [ 68 | "model.compile(optimizer=..., loss=..., metrics=[f1score, 'acc'])" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "### 问题三:如何输出神经网络的模型结构?\n", 76 | "1.安装pydot \n", 77 | "不要用pip安装,以免安装到最新版。最新版的pydot中删除了find_graphviz()函数,与keras不兼容。 \n", 78 | "到这里安装[1.1.0版本](https://github.com/erocarrera/pydot/tree/v1.1.0)\n", 79 | "\n", 80 | "2.安装graphviz并将可执行文件加入环境变量path \n", 81 | "[下载地址](http://www.graphviz.org/Download_windows.php)\n", 82 | " \n", 83 | "3.使用keras调用graphviz进行画图" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": { 90 | "collapsed": true 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "from keras.utils.visualize_util import plot\n", 95 | "plot(model, to_file='model.png')" 96 | ] 97 | }, 98 | { 99 | "cell_type": "markdown", 100 | "metadata": {}, 101 | "source": [ 102 | "### 问题四:`merge`和`Merge`的区别是什么?\n", 103 | "首先,`merge()`是个函数,而`Merge`是个类。两者的相同点都是合并输出,不同的是`merge()`只能用于合并两或多个`tensor`,且输出也是`tensor`;而`Merge`既可以用于合并两或多个`tensor`,也可以用于合并两或多个`layer`。" 104 | ] 105 | }, 106 | { 107 | "cell_type": "markdown", 108 | "metadata": {}, 109 | "source": [ 110 | "### 问题五:如何使用ModelCheckPoint?\n", 111 | "ModelCheckPoint(模型检查点)是keras的一个回调函数,用于在每个epoch后保存模型权重到指定路径,其定义如下:\n", 112 | "```\n", 113 | "keras.callbacks.ModelCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=False, mode='auto')\n", 114 | "```\n", 115 | "参数\n", 116 | "\n", 117 | "* filename:字符串,保存模型的路径\n", 118 | "\n", 119 | "* monitor:需要监视的值\n", 120 | "\n", 121 | "* verbose:信息展示模式,0或1\n", 122 | "\n", 123 | "* save_best_only:当设置为True时,将只保存在验证集上性能最好的模型\n", 124 | "\n", 125 | "* mode:‘auto’,‘min’,‘max’之一,在save_best_only=True时决定性能最佳模型的评判准则,例如,当监测值为val_acc时,模式应为max,当检测值为val_loss时,模式应为min。在auto模式下,评价准则由被监测值的名字自动推断。\n", 126 | "\n", 127 | "* save_weights_only:若设置为True,则只保存模型权重,否则将保存整个模型(包括模型结构,配置信息等)\n", 128 | "\n", 129 | "filepath可以是格式化的字符串,里面的占位符将会被epoch值和传入on_epoch_end的logs关键字所填入\n", 130 | "\n", 131 | "例如,filename若为`weights.{epoch:02d}-{val_loss:.2f}}.hdf5`,则会生成对应epoch和验证集loss的多个文件。\n", 132 | "也就是说如果我们想要保存每个epoch上的模型权重的话需要指定文件名为`weights.{epoch:02d}.hdf5`。\n", 133 | "\n", 134 | "我们可以用load_weights函数读取权重,不过注意的是这个函数仅仅是读取权重而已,还需要在读取权重之前要重建模型结构。" 135 | ] 136 | } 137 | ], 138 | "metadata": { 139 | "kernelspec": { 140 | "display_name": "Python 2", 141 | "language": "python", 142 | "name": "python2" 143 | }, 144 | "language_info": { 145 | "codemirror_mode": { 146 | "name": "ipython", 147 | "version": 2 148 | }, 149 | "file_extension": ".py", 150 | "mimetype": "text/x-python", 151 | "name": "python", 152 | "nbconvert_exporter": "python", 153 | "pygments_lexer": "ipython2", 154 | "version": "2.7.12" 155 | } 156 | }, 157 | "nbformat": 4, 158 | "nbformat_minor": 0 159 | } 160 | -------------------------------------------------------------------------------- /YidaXu-ML/exponential-family.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 指数分布族\n", 8 | "具有如下形式的分布族称为指数分布族(exponential family):\n", 9 | "$$ p(x|\\eta)=b(x)\\exp(T(x)^T\\eta-a(\\eta))$$\n", 10 | "其中$\\eta$是分布的自然参数(natural parameter,比如高斯分布的自然参数就是$\\eta=(\\mu,\\sigma^2)^T$),$T(x)$是充分统计量(sufficient statistic,大多数情况$T(x)=x$),$a(\\eta)$是log normalizer,之所以这么叫是因为$a(\\eta)=ln\\ (\\int b(x)\\exp(T(x)^T\\eta)dx)$。\n", 11 | "确定$a,b,T$即确定了一个分布族,而$\\eta$是此分布族的参数。当$\\eta$变化时,我们得到这个分布族里不同的分布。\n", 12 | "指数分布族是个大家族,家族成员有正态分布,指数分布,gamma分布,卡方分布,beta分布,狄利克雷分布,多项式分布,伯努利分布,泊松分布等。\n", 13 | "\n", 14 | "## 多项式分布\n", 15 | "假设天上有n个球以概率$p_1,p_2,...,p_k$落到地上摆放的k个桶里,根据n和k的不同,我们可以将四种分布区分开来\n", 16 | "\n", 17 | "* 多项式分布(multinomial):n个球落到k个桶,参数为$n,p_1,...,p_k$\n", 18 | "* 二项式分布(binomial):n个球落到2个桶,参数为$n,p$\n", 19 | "* 分类分布(categorical):1个球落到k个桶,参数为$p_1,...,p_k$\n", 20 | "* 伯努利分布(bernoulli):1个球落到2个桶,参数为$p$\n", 21 | "\n", 22 | "## 指数族的MLE估计\n", 23 | "使用指数分布族我们可以大大简化最大似然的估计的计算过程\n", 24 | "$$ ln\\ p(\\mathbf{X}|\\eta)=ln\\ \\prod_{i=1}^N p(x_i|\\eta)=ln\\ \\prod_{i=1}^N b(x_i)\\exp\\bigg(\\Big(\\sum_{i=1}^N T(x_i)\\Big)^T\\eta-N a(\\eta)\\bigg)$$\n", 25 | "因为$\\mathbf{X}=\\{x_1,...,x_N\\}$已知,于是$ \\prod_{i=1}^N b(x_i)$可以看作常数我们可以将其省略,于是我们的目标函数变成\n", 26 | "$$ \\eta_{ML}=\\arg\\max_{\\eta}ln\\ \\exp\\bigg(\\Big(\\sum_{i=1}^N T(x_i)\\Big)^T\\eta-N a(\\eta)\\bigg)=\\arg\\max_{\\eta}\\Bigg\\{\\Big(\\sum_{i=1}^N T(x_i)\\Big)^T\\eta-N a(\\eta)\\Bigg\\}$$\n", 27 | "令$\\mathcal{L}=\\Big(\\sum_{i=1}^N T(x_i)\\Big)^T\\eta-N a(\\eta)$,对$\\eta$求梯度\n", 28 | "$$\\nabla{\\mathcal{L}}(\\eta)=\\sum_{i=1}^N T(x_i)-N\\nabla a(\\eta)=0\\to \\nabla a(\\eta)=\\frac{1}{N}\\sum_{i=1}^N T(x_i)$$\n", 29 | "这个结论告诉我们只要令$a(\\eta)$的梯度等于充分统计量的均值并求解这个方程组我们就能得到最大似然估计\n", 30 | "\n", 31 | "举个高斯分布的例子,高斯分布的pdf写为\n", 32 | "$$ p(x|\\mu,\\sigma^2)=\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-\\frac{1}{2\\sigma^2}(x-\\mu)^2)$$\n", 33 | "我们要将这个pdf $p(x|\\theta)$写为指数家族的形式$p(x|\\eta)$,由于高斯分布有两个参数$\\theta=(\\mu,\\sigma)^T$,因此在表示为指数分布族时也对应有两个自然参数$\\eta=(\\eta_1,\\eta_2)^T$,所不同的是需要在这两组参数之间作一个变换,那么接下来我们来看这个变换怎么得到。\n", 34 | "$$\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-\\frac{1}{2\\sigma^2}(x-\\mu)^2)=\\exp\\big(-\\frac{1}{2\\sigma^2}x^2+\\frac{\\mu}{\\sigma^2}x-\\frac{\\mu^2}{2\\sigma^2}-\\frac{1}{2}ln\\ (2\\pi\\sigma^2)\\big)\\\\=\\exp(-\\frac{1}{2}ln\\ (2\\pi))\\exp\\big(-\\frac{1}{2\\sigma^2}x^2+\\frac{\\mu}{\\sigma^2}x-\\frac{\\mu^2}{2\\sigma^2}-\\frac{1}{2}ln\\ \\sigma^2\\big)$$\n", 35 | "通过类比可得\n", 36 | "$$T(x)=(x,x^2)^T,\\eta=(\\eta_1,\\eta_2)^T=(\\frac{\\mu}{\\sigma^2},-\\frac{1}{2\\sigma^2})^T\\\\b(x)=\\exp(-\\frac{1}{2}ln\\ (2\\pi))\\\\a(\\eta)=\\frac{\\mu^2}{2\\sigma^2}+\\frac{1}{2}ln\\ \\sigma^2$$\n", 37 | "接下来我们要将$(\\mu,\\sigma)$用$(\\eta_1,\\eta_2)$表示,通过解方程\n", 38 | "$$ \\left\\{\\begin{aligned}&\\eta_1=\\frac{\\mu}{\\sigma^2}\\\\&\\eta_2=-\\frac{1}{2\\sigma^2}\\end{aligned}\\right.$$\n", 39 | "得到\n", 40 | "$$\\left\\{\\begin{aligned}&\\mu=-\\frac{\\eta_1}{2\\eta_2}\\\\&\\sigma^2=-\\frac{1}{2\\eta_2}\\end{aligned}\\right.$$\n", 41 | "于是\n", 42 | "$$a(\\eta)=-\\frac{\\eta_1^2}{4\\eta_2}-\\frac{1}{2}ln\\ (-2\\eta_2)$$\n", 43 | "其梯度为\n", 44 | "$$\\nabla a(\\eta)=\\begin{bmatrix}&-\\frac{\\eta_1}{2\\eta_2}\\\\&\\frac{\\eta_1^2}{4\\eta_2^2}-\\frac{1}{2\\eta_2}\\end{bmatrix}=\\begin{bmatrix}&\\mu\\\\&\\mu^2+\\sigma^2\\end{bmatrix}=\\begin{bmatrix}&\\frac{1}{N}\\sum_{i=1}^N x_i\\\\&\\frac{1}{N}\\sum_{i=1}^N x_i^2\\end{bmatrix}$$\n", 45 | "得到\n", 46 | "$$\\mu_{ML}=\\frac{1}{N}\\sum_{i=1}^N x_i\\\\\\sigma^2_{ML}=\\frac{1}{N}\\sum_{i=1}^N x_i^2-\\mu_{ML}^2=\\frac{1}{N}\\sum_{i=1}^N x_i^2-(\\frac{1}{N}\\sum_{i=1}^N x_i)^2=\\frac{1}{N}\\sum_{i=1}^N x_i^2-2(\\frac{1}{N}\\sum_{i=1}^N x_i)^2+(\\frac{1}{N}\\sum_{i=1}^N x_i)^2\\\\=\\frac{1}{N}\\sum_{i=1}^N x_i^2-2\\frac{1}{N}\\sum_{i=1}^Nx_i \\bar{x}+\\frac{1}{N}N\\bar{x}^2\\\\=\\frac{1}{N}\\sum_{i=1}^N(x_i^2-2 x_i\\bar{x}+\\bar{x}^2)=\\frac{1}{N}\\sum_{i=1}^N(x_i-\\bar{x})^2$$\n", 47 | "\n", 48 | "## log-normalizer的性质\n", 49 | "可以证明以下结论:\n", 50 | "$$ \\frac{\\partial a(\\eta)}{\\partial \\eta}=E_{p(x|\\eta)}[T(x)]$$\n", 51 | "证明过程如下:\n", 52 | "首先我们将log-normalizer表示出来:\n", 53 | "$$ a(\\eta)=ln\\ (\\int b(x)\\exp(T(x)^T\\eta)dx)$$\n", 54 | "接着对$\\eta$求导,这里利用莱布尼茨微积分的一个性质:\n", 55 | "$$\\frac{d\\ \\int_x f(x,y)dx}{d y}=\\int_x\\frac{\\partial\\ f(x,y)}{\\partial y}dx$$\n", 56 | "利用这个结论我们有\n", 57 | "$$ \\frac{da(\\eta)}{d \\eta}=\\int_x\\frac{\\partial\\ a(\\eta)}{\\partial \\eta}dx$$" 58 | ] 59 | } 60 | ], 61 | "metadata": { 62 | "kernelspec": { 63 | "display_name": "Python 2", 64 | "language": "python", 65 | "name": "python2" 66 | }, 67 | "language_info": { 68 | "codemirror_mode": { 69 | "name": "ipython", 70 | "version": 2 71 | }, 72 | "file_extension": ".py", 73 | "mimetype": "text/x-python", 74 | "name": "python", 75 | "nbconvert_exporter": "python", 76 | "pygments_lexer": "ipython2", 77 | "version": "2.7.12" 78 | } 79 | }, 80 | "nbformat": 4, 81 | "nbformat_minor": 0 82 | } 83 | -------------------------------------------------------------------------------- /YidaXu-ML/EM-review.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# EM简要回顾\n", 8 | "## 作者:hschen0712\n", 9 | "\n", 10 | "EM算法主要用于求解包含隐变量模型的最大化似然估计问题,EM框架中我们通常将$ln\\ p(\\mathbf{X}|\\theta)$称为不完全对数似然(incomplete log data likelihood),而$ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta)$称为完全对数似然(complete log data likelihood ),EM算法的目标是通过最大化完全对数似然找到参数的合理估计。然而由于我们对隐变量一无所知,因此没办法直接最大化完全对数似然,替代方案是最大化完全对数似然的后验数学期望,具体地该算法的参数迭代更新式如下:\n", 11 | "$$\\theta^{(t+1)} = \\arg\\max_{\\theta} \\int ln \\ p(\\mathbf{X},\\mathbf{Z}|\\theta)\\cdot p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z} =\\arg\\max_{\\theta}\\mathbb{E}_{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})}[ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta)]$$\n", 12 | "此处以高斯混合模型(GMM)为例,设$\\mathbf{X}=\\{x_1,x_2,...,x_n\\}$,$\\mathbf{Z}=\\{z_1,...,z_n\\}$($Z$不一定要与$X$一一对应)。可以证明,**最大化完全对数似然关于后验分布$p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})$的期望等价于最大化不完全对数似然**。具体证明过程如下,首先我们可以应用贝叶斯公式把不完全数据似然函数写成\n", 13 | "$$p(\\mathbf{X}|\\theta)=\\frac{p(\\mathbf{X},\\mathbf{Z}|\\theta)}{p(\\mathbf{Z}|\\mathbf{X},\\theta)}$$\n", 14 | "两边取对数,有\n", 15 | "$$ln\\ p(\\mathbf{X}|\\theta)=ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta) - ln\\ p(\\mathbf{Z}|\\mathbf{X},\\theta)$$\n", 16 | "两边关于分布$p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})$求数学期望,有\n", 17 | "$$ ln\\ p(\\mathbf{X}|\\theta)=\\int ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta)p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})d\\mathbf{Z}-\\int ln\\ p(\\mathbf{Z}|\\mathbf{X},\\theta) p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})d\\mathbf{Z}$$\n", 18 | "我们要证明$ln\\ p(\\mathbf{X}|\\theta^{(t+1)})\\geq ln\\ p(\\mathbf{X}|\\theta^{(t)})$(即经过一轮迭代不完全对数似然必定上升),因此作一个差值\n", 19 | "$$ ln\\ p(\\mathbf{X}|\\theta^{(t+1)})- ln\\ p(\\mathbf{X}|\\theta^{(t)})=(\\int ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta^{(t+1)})p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z}-\\int ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta^{(t)})p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z})-\\int ln\\ \\frac{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t+1)})}{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})}p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z}$$\n", 20 | "注意到$-\\int ln\\ \\frac{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t+1)})}{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})}p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z}$实际上是KL散度,我们可以用琴生不等式证明它一定是非负的。那么\n", 21 | "$$ln\\ p(\\mathbf{X}|\\theta^{(t+1)})- ln\\ p(\\mathbf{X}|\\theta^{(t)})\\geq \\int ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta^{(t+1)})p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z}-\\int ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta^{(t)})p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z}$$\n", 22 | "当$\\theta^{(t+1)} = \\arg\\max_{\\theta} \\int ln \\ p(\\mathbf{X},\\mathbf{Z}|\\theta)\\cdot p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)}) d\\mathbf{Z}$时这个差值就会大于等于0,于是每一轮迭代都会导致不完全对数似然增加,直到收敛。\n", 23 | "概括一下EM算法的步骤:\n", 24 | "首先我们给参数设定一个初值$\\theta^{(0)}$,接着在以下两部间循环直至目标函数收敛\n", 25 | "1)expectation:计算后验分布$p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})$和期望$\\mathbb{E}_{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})}[ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta)]$,这个期望是个关于$\\theta$的函数\n", 26 | "2)maximization:令$f(\\theta)=\\mathbb{E}_{p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})}[ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta)]$,寻找使$f(\\theta)$最大的$\\theta^{(t+1)}$\n", 27 | "下面我们结合GMM叙述EM算法的主要步骤,首先我们假设$(x_i,z_i)$之间相互独立,则完全数据似然的计算方法为\n", 28 | "$$p(\\mathbf{X},\\mathbf{Z}|\\theta)=\\prod_{i=1}^n p(x_i,z_i|\\theta)=\\prod_{i=1}^n p(x_i|z_i,\\theta)p(z_i|\\theta)=\\prod_{i=1}^n \\alpha_{z_i} \\mathcal{N}(x_i|\\mu_{z_i},\\Sigma_{z_i})$$\n", 29 | "对数完全数据似然\n", 30 | "$$ ln\\ p(\\mathbf{X},\\mathbf{Z}|\\theta)=\\sum_{i=1}^n ln\\ (\\alpha_{z_i} \\mathcal{N}(x_i|\\mu_{z_i},\\Sigma_{z_i}))$$\n", 31 | "同时我们假设$z_i|x_i$也相互独立,于是$p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})$可以写成\n", 32 | "$$ p(\\mathbf{Z}|\\mathbf{X},\\theta^{(t)})=\\prod_{i=1}^n p(z_i|x_i,\\theta^{(t)})=\\prod_{i=1}^n\\frac{p(x_i|z_i,\\theta^{(t)})p(z_i|\\theta^{(t)})}{\\sum_{z_i=1}^K p(x_i|z_i,\\theta^{(t)})p(z_i|\\theta^{(t)})}= \\prod_{i=1}^n\\frac{\\alpha^{(t)}_{z_i} \\mathcal{N}(x_i|\\mu^{(t)}_{z_i},\\Sigma^{(t)}_{z_i})}{\\sum_{l=1}^K \\alpha^{(t)}_{l} \\mathcal{N}(x_i|\\mu^{(t)}_{l},\\Sigma^{(t)}_{l})}$$\n", 33 | "\n", 34 | "\n", 35 | "# EM和VI区别以及联系\n", 36 | "## 区别\n", 37 | "EM认为完全似然关于后验概率的期望是可以计算的,并通过期望最大化更新这个参数,EM算法是得到的估计是点估计;VI认为后验不可计算,因此需要估计,VI的目的是求得一个隐变量的分布,因此是区间估计。\n", 38 | "EM假设分布的形式已知(可以将其参数化),从而能够将参数与隐变量进行分离,通过最大似然法估计参数;而VI则认为参数也是隐变量的一部分,于是我们不知道分布的样子,只能通过简单的分布估计推得。\n", 39 | "## 联系\n", 40 | "两者都将分布估计视为优化问题,两者的优化过程都使用到了坐标上升的思想。\n", 41 | "# Variational EM\n", 42 | "待续...\n" 43 | ] 44 | } 45 | ], 46 | "metadata": { 47 | "kernelspec": { 48 | "display_name": "Python 2", 49 | "language": "python", 50 | "name": "python2" 51 | }, 52 | "language_info": { 53 | "codemirror_mode": { 54 | "name": "ipython", 55 | "version": 2 56 | }, 57 | "file_extension": ".py", 58 | "mimetype": "text/x-python", 59 | "name": "python", 60 | "nbconvert_exporter": "python", 61 | "pygments_lexer": "ipython2", 62 | "version": "2.7.12" 63 | } 64 | }, 65 | "nbformat": 4, 66 | "nbformat_minor": 0 67 | } 68 | -------------------------------------------------------------------------------- /YidaXu-ML/variational-inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 变分推断\n", 8 | "\n", 9 | "变分推断的主要思想是用简单分布区估计复杂的分布。\n", 10 | "\n", 11 | "假设我们想要估计分布$p(X)$,我们在模型中引入隐变量$Z$,根据贝叶斯公式我们有\n", 12 | "$$ ln\\,p(X)=ln\\,p(X,Z)-ln\\,p(Z|X)$$\n", 13 | "通过引入一个辅助分布$q(Z)$,上式可以改写为\n", 14 | "$$ ln\\,p(X)=ln\\,\\frac{p(X,Z)}{q(Z)}-ln\\,\\frac{p(Z|X)}{q(Z)}=ln\\,p(X,Z)-ln\\,q(Z)-ln\\,\\frac{p(Z|X)}{q(Z)}$$\n", 15 | "对等式两边关于$q(Z)$求期望,我们有:\n", 16 | "$$ \\begin{aligned}ln\\,p(X)=\\int_z ln\\,p(X) q(Z) dz&=\\int_z ln\\,\\frac{p(X,Z)}{q(Z)} q(Z)dz-\\int_z ln\\,\\frac{p(Z|X)}{q(Z)}q(Z)dz\\\\&=\\color{blue}{\\underbrace{\\int_z ln\\,p(X,Z)q(Z)dz-\\int_z ln\\,q(Z) q(Z)dz}_{\\mbox{Evidence Lower Bound}}}+\\color{red}{\\underbrace{KL(q(Z)||p(Z|X))}_{\\mbox{Kullback–Leibler divergence}}}\\\\&=\\mathcal{L}(q)+KL(q||p)\\end{aligned}$$\n", 17 | "\n", 18 | "其中第一项叫作证据下界(Evidence Lower Bound,ELOB),它是关于$q(Z)$的泛函;第二项称为Kullback–Leibler散度,它估计了分布$q(Z)$和$p(Z|X)$的相似度。注意到KL divergence总是非负的,因此我们知道$\\mathcal{L}(q)\\leq ln p(X)$。当 $q(Z)=p(Z|X)$时,我们有 $KL(q||p)=0$,此时ELOB达到它的最大值,正是$p(X)$的log-likelihood。\n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "我们的目标是在给定数据$X=\\{X_1,...,X_n\\}$的情况下,使用一个变分分布$q(Z)$去近似隐变量$Z=\\{Z_1,...,Z_n\\}$的后验分布:\n", 26 | "$$ q(Z)\\approx p(Z|X)$$\n", 27 | "后验概率分布 $p(Z|X)$可能非常复杂,因为给定$X$的情况下,$Z_i$彼此之间通常不是独立的。为了使得近似过程尽量可行,我们假设$Z$可以分解为$M$个互不相交的组(或者说有 $M$个独立的成分):\n", 28 | "$$ q(Z)=\\prod_{i=1}^M q_i(Z_i)$$\n", 29 | "把这个选择代入ELOB,我们有\n", 30 | "$$ \\underbrace{\\int_z \\prod_{i=1}^M q_i(Z_i)ln\\,p(X,Z)dz}_{part 1}-\\underbrace{\\int_z \\prod_{i=1}^M q_i(Z_i) \\sum_{i=1}^M ln\\,q_i(Z_i) dz}_{part 2}$$\n", 31 | "我们把第一项称为`part 1`,把第二项称为`part 2`,接下来我们分别研究这两项。\n", 32 | "### part 1\n", 33 | "$$ (part 1)=\\int_{Z_1}...\\int_{Z_M} \\prod_{i=1}^M q_i(Z_i)ln\\,p(X,Z)dZ_1, ... ,dZ_M $$\n", 34 | "我们把$q_j(Z_j)$从积分中抽取出来并重新排列积分式:\n", 35 | "$$ (part 1)=\\int_{Z_j} q_j(Z_j)\\bigg(\\mathop{\\int\\cdots\\int}_{Z_{i\\neq j}}ln\\,p(X,Z) \\prod_{i\\neq j}^M q_i(Z_i)dZ_i\\bigg)dZ_j=\\int_{Z_j} q_j(Z_j)\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]dZ_j$$\n", 36 | "### part2\n", 37 | "$$\\begin{aligned} (part 2)&=\\int_z \\prod_{i=1}^M q_i(Z_i) \\sum_{i=1}^M ln\\,q_i(Z_i) dz\\\\&=\\int_{Z_1}...\\int_{Z_M} \\sum_{i=1}^M ln\\,q_i(Z_i) \\prod_{j=1}^M q_j(Z_j)dZ_j\\\\&=\\int_{Z_1}...\\int_{Z_M} \\sum_{i=1}^M ln\\,q_i(Z_i) (\\prod_{j\\neq i}^M q_j(Z_j)dZ_j)dZ_i\\\\&=\\int_{Z_i}\\sum_{i=1}^M ln\\,q_i(Z_i)\\bigg(\\mathop{\\int\\cdots\\int}_{Z_{j\\neq i}}\\prod_{j\\neq i}^M q_j(Z_j)dZ_j\\bigg)dZ_i\\\\&=\\int_{Z_i}\\sum_{i=1}^M q_i(Z_i) ln\\,(q_i(Z_i) )dZ_i\\\\&=\\sum_{i=1}^M \\int_{Z_i}q_i(Z_i) ln\\,(q_i(Z_i) )dZ_i \\end{aligned}$$\n", 38 | "对于特定的 $q_j(Z_j)$,和的剩余部分可以视为常数\n", 39 | "$$ (part2)=\\int_{Z_j}q_j(Z_j) ln\\,(q_j(Z_j) )dZ_j+const$$\n", 40 | "然后\n", 41 | "$$\\begin{aligned}\\mathcal{L}(q)&=part1-part2=\\int_{Z_j} q_j(Z_j)\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]dZ_j-\\int_{Z_j}q_j(Z_j) ln\\,(q_j(Z_j) )dZ_j+const\\\\&=\\int_{Z_j} q_j(Z_j)\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]dZ_j+const_1\\int_{Z_j} q_j(Z_j)dZ_j-\\int_{Z_j}q_j(Z_j) ln\\,(q_j(Z_j) )dZ_j+const_2\\\\&=\\int_{Z_j} q_j(Z_j)(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1)dZ_j-\\int_{Z_j}q_j(Z_j) ln\\,(q_j(Z_j) )dZ_j+const_2\\end{aligned}$$\n", 42 | "其中 $const_1+const_2=const$\n", 43 | "设$ln\\,\\tilde{p}_j(X,Z_j)=\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1$,接着我们有\n", 44 | "$$\\tilde{p}_j(X,Z_j)=\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1)=\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)])\\exp(const_1)$$\n", 45 | "其中常数项$\\exp(const_1)$的作用是归一化。\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "\n", 53 | "如果我们将$\\mathcal{L}(q)$视为$q_j(Z_j)$的泛函,同时将剩余的分布$q_{i\\neq j}(Z_i)$ 固定,我们可以将 ELOB表示为\n", 54 | "$$\\mathcal{L}(q_j)=\\int_{Z_j} q_j(Z_j)ln\\,\\frac{\\tilde{p}_j(X,Z_j)}{q_j(Z_j)}dZ_j+const_2$$\n", 55 | "这等价于最小化$-KL( q_j(Z_j)||\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1))$\n", 56 | "因此最大化 ELOB等价于最小化这个特殊的 KL divergence,最优分布$q^*_j(Z_j)$ 满足\n", 57 | "$$ ln\\,(q^*_j(Z_j))=\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1$$\n", 58 | "其中 $const_1$ 作用是归一化分布 $q^*_j(Z_j)$.如果我们在等式两边同时取指数exp则有\n", 59 | "$$ q^*_j(Z_j)=\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1)$$\n", 60 | "由\n", 61 | "$$\\int q^*_j(Z_j) dZ_j=\\int \\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]+const_1) dZ_j=1\\to \\exp(const_1)=\\frac{1}{\\int\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]dZ_j}$$\n", 62 | "我们可以得到变分推断的更新式\n", 63 | "$$ q^*_j(Z_j)=\\frac{\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)])}{\\int\\exp(\\mathbb{E}_{i\\neq j}[ln\\,p(X,Z)]dZ_j}$$\n" 64 | ] 65 | } 66 | ], 67 | "metadata": { 68 | "kernelspec": { 69 | "display_name": "Python 2", 70 | "language": "python", 71 | "name": "python2" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 2 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython2", 83 | "version": "2.7.12" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 0 88 | } 89 | -------------------------------------------------------------------------------- /README.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 简介\n", 8 | "\n", 9 | "> 作者:hschen \n", 10 | "> QQ:357033150 \n", 11 | "> 邮箱:hschen0712@gmail.com\n", 12 | "\n", 13 | "此笔记主要总结自一些论文、书籍以及公开课,由于本人水平有限,笔记中难免会出现各种错误,欢迎指正。 \n", 14 | "由于Github渲染`.ipynb`文件较慢,可以用nbviewer加快渲染:[点此加速](http://nbviewer.jupyter.org/github/hschen0712/machine-learning-notes/blob/master/README.ipynb)\n", 15 | "\n", 16 | "\n", 17 | "## 目录\n", 18 | "\n", 19 | "\n", 20 | "1.公开课/读书笔记 \n", 21 | "- [徐亦达机器学习笔记](YidaXu-ML/)\n", 22 | " - [采样算法系列1——基本采样算法](YidaXu-ML/sampling-methods-part1.ipynb)\n", 23 | " - [采样算法系列2——MCMC](YidaXu-ML/sampling-methods-part2.ipynb)\n", 24 | " - [EM算法](YidaXu-ML/EM-review.ipynb)\n", 25 | " - [变分推断](YidaXu-ML/variational-inference.ipynb)\n", 26 | " - [高斯分布的变分推断](YidaXu-ML/variational-inference-for-gaussian-distribution.ipynb)\n", 27 | " - [指数分布族](YidaXu-ML/exponential-family.ipynb)\n", 28 | " - [指数分布族的变分推断](YidaXu-ML/exponential-family-variational-inference.ipynb)\n", 29 | "\n", 30 | "- [CS229课程笔记](CS229/)\n", 31 | " - [广义线性模型](CS229/GLM.ipynb)\n", 32 | " - [EM算法](CS229/EM.ipynb)\n", 33 | " - [增强学习系列1](CS229/RL1.ipynb)\n", 34 | " - [增强学习系列2](CS229/RL2.ipynb)\n", 35 | "\n", 36 | "- [台大机器学习基石笔记](ML-Foundation/) \n", 37 | " - [第一讲-学习问题](ML-Foundation/lecture-1.ipynb)\n", 38 | "- [PRML读书笔记](PRML/)\n", 39 | " - [第一章 简介](PRML/Chap1-Introduction)\n", 40 | " - [1.1 多项式曲线拟合](PRML/Chap1-Introduction/1.1-polynomial-curve-fitting.ipynb)\n", 41 | " - [1.2 概率论回顾](PRML/Chap1-Introduction/1.2-probability-theory.ipynb)\n", 42 | " - [总结-曲线拟合的三种参数估计方法](PRML/Chap1-Introduction/Summary-three-curve-fitting-approaches.ipynb)\n", 43 | " - [第二章 概率分布](PRML/Chap2-Probability-Distributions)\n", 44 | " - [2.1 二元变量](PRML/Chap2-Probability-Distributions/2.1-binary-variables.ipynb)\n", 45 | " - [2.2 多元变量](PRML/Chap2-Probability-Distributions/2.2-multinomial-variables.ipynb)\n", 46 | " - [第三章 线性回归模型](PRML/Chap3-Linear-Models-For-Regression)\n", 47 | " - [3.1 线性基函数模型](PRML/Chap3-Linear-Models-For-Regression/3.1-linear-basis-function-models.ipynb)\n", 48 | " - [总结-贝叶斯线性回归](PRML/Chap3-Linear-Models-For-Regression/summary-baysian-linear-regression.ipynb)\n", 49 | "\n", 50 | "2.[机器学习笔记](Machine-Learning/)\n", 51 | "- [xgboost笔记](Machine-Learning/xgboost-notes)\n", 52 | " - [1. xgboost的安装](Machine-Learning/xgboost-notes/xgboost-note1.ipynb)\n", 53 | "- [softmax分类器](Machine-Learning/softmax-crossentropy-derivative.ipynb)\n", 54 | "- [用theano实现softmax分类器](Machine-Learning/implement-softmax-in-theano.ipynb)\n", 55 | "- [用SVD实现岭回归](Machine-Learning/svd-ridge-regression.ipynb)\n", 56 | "- [SVD系列1](Machine-Learning/svd1.ipynb)\n", 57 | "\n", 58 | "3.[NLP笔记](NLP/)\n", 59 | "- [LDA系列1——LDA简介](NLP/latent-dirichlet-allocation-1.ipynb)\n", 60 | "- [LDA系列2——Gibbs采样](NLP/latent-dirichlet-allocation-2.ipynb)\n", 61 | "- [朴素贝叶斯](NLP/naive-bayes.ipynb)\n", 62 | "\n", 63 | "\n", 64 | "4.[深度学习笔记](Deep-Learning/)\n", 65 | "- [theano笔记](Deep-Learning/theano-notes)\n", 66 | " - [2. theano简单计算](Deep-Learning/theano-notes/part2-simple-computations.ipynb)\n", 67 | " - [3. theano共享变量](Deep-Learning/theano-notes/part3-shared-variable.ipynb)\n", 68 | " - [4. theano随机数](Deep-Learning/theano-notes/part4-random-number.ipynb)\n", 69 | " - [6. theano的scan函数](Deep-Learning/theano-notes/part6-scan-function.ipynb)\n", 70 | " - [7. theano的dimshuffle](Deep-Learning/theano-notes/part7-dimshuffle.ipynb)\n", 71 | "- [mxnet笔记](Deep-Learning/mxnet-notes)\n", 72 | " - [1. Win10下安装MXNET](Deep-Learning/mxnet-notes/1-installation.ipynb)\n", 73 | "\n", 74 | " - [2. MXNET符号API](Deep-Learning/mxnet-notes/2-mxnet-symbolic.ipynb)\n", 75 | "\n", 76 | " - [mxnet中的运算符](Deep-Learning/mxnet-notes/operators-in-mxnet.ipynb)\n", 77 | "\n", 78 | " - [mshadow表达式模板教程](Deep-Learning/mxnet-notes/mshadow-expression-template-tutorial.ipynb)\n", 79 | "\n", 80 | "- [keras笔记](Deep-Learning/keras-notes)\n", 81 | " - [keras心得](Deep-Learning/keras-notes/keras-tips.ipynb)\n", 82 | "\n", 83 | "- [windows下安装caffe](Deep-Learning/install-caffe-in-windows.ipynb)\n", 84 | "- [BP算法矩阵形式推导](Deep-Learning/back-propagation-in-matrix-form.ipynb)\n", 85 | "- [随时间反向传播算法数学推导过程](Deep-Learning/back-propagation-through-time.ipynb)\n", 86 | "- [用numpy实现RNN](Deep-Learning/rnn-numpy.ipynb)\n", 87 | "- [随机矩阵的奇异值分析](Deep-Learning/singular-value-of-random-matrix.ipynb)\n" 88 | ] 89 | } 90 | ], 91 | "metadata": { 92 | "anaconda-cloud": {}, 93 | "kernelspec": { 94 | "display_name": "Python 2", 95 | "language": "python", 96 | "name": "python2" 97 | }, 98 | "language_info": { 99 | "codemirror_mode": { 100 | "name": "ipython", 101 | "version": 2 102 | }, 103 | "file_extension": ".py", 104 | "mimetype": "text/x-python", 105 | "name": "python", 106 | "nbconvert_exporter": "python", 107 | "pygments_lexer": "ipython2", 108 | "version": "2.7.9" 109 | } 110 | }, 111 | "nbformat": 4, 112 | "nbformat_minor": 0 113 | } 114 | -------------------------------------------------------------------------------- /YidaXu-ML/exponential-family-variational-inference.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 指数分布族的变分推理\n", 8 | "回顾一下上节课*变分推断*讲的内容。 \n", 9 | "与MCMC,Gibbs Sampling等基于抽样来估计未知分布的方法不同,变分推断通过一个优化过程来近似隐变量的后验概率分布。给定一组隐变量 $Z=\\{z_1,...,z_M\\}$,其中$M$表示$Z$可以分解为$M$个独立成分。变分推断用一个变分分布$q(Z)$估计后验分布$P(Z|X)$,通过最大化如下的ELBO来完成近似过程:\n", 10 | "$$ \\mathcal{L}(q(Z))=\\mathbb{E}_{q(Z)}[ln\\ p(X,Z)]-\\mathbb{E}_{q(Z)}[ln\\ q(Z)]$$\n", 11 | "最优的近似分布为\n", 12 | "$$ q^*(Z)=\\arg\\max_{q(Z)}\\mathcal{L}(q(Z))$$\n", 13 | "\n", 14 | "根据 meanfield假设,我们可以对$q(Z)$进行分解:\n", 15 | "$$q(Z)=\\prod_{i=1}^M q_i(Z_i)$$\n", 16 | "\n", 17 | "与上一堂课所用方法不同,这次我们假设$q_i(Z_i),(i=1,2,...,M)$属于指数分布族 ,这些分布的超参数是超参数$\\lambda_i,(i=1,2,...,M)$,这样做的好处是将泛函优化问题简化为了超参数优化问题。\n" 18 | ] 19 | }, 20 | { 21 | "cell_type": "markdown", 22 | "metadata": {}, 23 | "source": [ 24 | "我们可以用一个概率图模型来表示上一节课的模型(原谅我图画的丑):\n", 25 | "\n", 26 | "由于$Z_1,Z_2,...,Z_M$都是$X$的co-parents,根据概率图模型理论,要估计$Z_i$的后验,我们不仅需要知道$X$的信息,还需要知道$\\{Z_1,...,Z_{i-1},Z_{i+1},...,Z_M\\}$的信息,即$Z_i$的后验同时取决于$X$和其他的$Z_j$,可以将其表示为$P(Z_i|X,Z_1,...,Z_{i-1},Z_{i+1},...,Z_M)$ \n", 27 | "为了数学上表示的方便,我们将$\\{Z_1,...,Z_{i-1},Z_{i+1},...,Z_M\\}$简记为$Z_{-i}$。我们假设$Z_i$的后验属于指数分布族:\n", 28 | "$$p(Z_i|X,Z_{-i})=h(Z_i)\\exp(T(Z_i)^T\\eta(X,Z_{-i})-A_g(\\eta(X,Z_{-i}))$$\n", 29 | "其中$T(Z_i)$是$Z_i$的充分统计量,$\\eta(X,Z_{-i})$是其自然参数,它是关于$X,Z_{-i}$的函数,特别地,对于广义线性模型来说这个函数是线性的。\n", 30 | "\n", 31 | "同样地,我们假设每个近似分布$q_i(Z_i)$也是指数分布族:\n", 32 | "$$q_i(Z_i|\\lambda_i)=h(Z_i)\\exp(T(Z_i)^T\\lambda_i-A_\\ell(\\lambda_i))$$\n", 33 | "其中$\\lambda_i$是$Z_i$的超参数。为了方便,记$\\lambda_{-i}=\\{\\lambda_1,...,\\lambda_{i-1},\\lambda_{i+1},...,\\lambda_{M}\\}$\n" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "通过引入超参我们可以将问题转化为一个参数优化问题:\n", 41 | "$$ \\lambda^*_1,...,\\lambda^*_M=\\arg\\max_{\\lambda_1,...,\\lambda_M} \\mathcal{L}(\\lambda_1,...,\\lambda_M)$$\n", 42 | "\n", 43 | "用贝叶斯公式给目标函数做个变形:\n", 44 | "$$ \\mathcal{L}(q(Z))=\\mathbb{E}_{q(Z)}[\\ln P(Z_i|X,Z_{-i}) +\\ln P(Z_{-i}|X) + \\ln P(X)]-\\mathbb{E}_{q(Z)}[\\ln q(Z)]$$\n", 45 | "其参数化形式为:\n", 46 | "$$ \\begin{aligned}\\mathcal{L}(\\lambda_1,...,\\lambda_M)&= \\mathbb{E}_{q(Z|\\lambda)}[\\ln P(Z_i|X,Z_{-i}) +\\ln P(Z_{-i}|X) + \\ln P(X)]-\\mathbb{E}_{q(Z|\\lambda)}[\\ln q_i(Z_i|\\lambda_i)]-\\mathbb{E}_{q(Z|\\lambda)}[\\sum_{j\\neq i} \\ln q_j(Z_j|\\lambda_j)]\\\\&= \\mathbb{E}_{q(Z|\\lambda)}[\\ln P(Z_i|X,Z_{-i})]-\\mathbb{E}_{q(Z|\\lambda)}[\\ln q_i(Z_i|\\lambda_i)]+const\\end{aligned}$$\n", 47 | "其中$\\mathbb{E}_{q(Z|\\lambda)}[\\ln P(Z_{-i}|X)],\\mathbb{E}_{q(Z|\\lambda)}[\\ln P(X)]$和$\\mathbb{E}_{q(Z|\\lambda)}[\\sum_{j\\neq i} \\ln q_j(Z_j|\\lambda_j)]$都是常数。一开始不太理解为什么$\\mathbb{E}_{q(Z|\\lambda)}[\\ln P(Z_{-i}|X)]$是常数,后来经过如下的计算想通了。\n", 48 | "$$\\begin{aligned}\\mathbb{E}_{q(Z|\\lambda)}[\\ln P(Z_{-i}|X)]&= \\int_{Z_i} \\bigg[\\int_{Z_{j\\neq i}} \\ln P(Z_{-i}|X)\\prod_{j\\neq i}q_j(Z_j|\\lambda_j) dZ_j\\bigg] q_i(Z_i|\\lambda_i) dZ_i\\\\&=\\int_{Z_{j\\neq i}} \\ln P(Z_{-i}|X)\\prod_{j\\neq i}q_j(Z_j|\\lambda_j) dZ_j \\int_{Z_i} q_i(Z_i|\\lambda_i) dZ_i\\\\&=\\int_{Z_{j\\neq i}} \\ln P(Z_{-i}|X)\\prod_{j\\neq i}q_j(Z_j|\\lambda_j) dZ_j =constant(\\mbox{因为$\\lambda_{-i}$已知})\\end{aligned}$$\n", 49 | "\n", 50 | "\n", 51 | "\n", 52 | "代入指数分布式子有\n", 53 | "$$ \\begin{aligned}\\mathcal{L}(\\lambda_1,...,\\lambda_M)&= \\mathbb{E}_{q(Z|\\lambda)}[\\ln P(Z_i|X,Z_{-i})]-\\mathbb{E}_{q(Z|\\lambda)}[\\ln q_i(Z_i|\\lambda_i)]+const\\\\&= \\mathbb{E}_{q(Z|\\lambda)}[\\ln h(Z_i)]+\\mathbb{E}_{q(Z|\\lambda)}[T(Z_i)^T\\eta(X,Z_{-i})]-\\underbrace{\\mathbb{E}_{q(Z|\\lambda)}[A_g(\\eta(X,Z_{-i}))]}_{const}\\\\&\\quad-\\mathbb{E}_{q(Z|\\lambda)}[\\ln h(Z_i)]-\\mathbb{E}_{q(Z|\\lambda)}[T(Z_i)^T\\lambda_i]+\\mathbb{E}_{q(Z|\\lambda)}[A_\\ell(\\lambda_i)]+const\\\\&=\\mathbb{E}_{q(Z|\\lambda)}[T(Z_i)^T(\\eta(X,Z_{-i})-\\lambda_i)]+\\mathbb{E}_{q(Z|\\lambda)}[A_\\ell(\\lambda_i)]+const\\\\\n", 54 | "&=\\mathbb{E}_{q_i(Z_i|\\lambda_i)}[T(Z_i)]^T\\cdot \\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})-\\lambda_i]+\\mathbb{E}_{q(Z|\\lambda)}[A_\\ell(\\lambda_i)]+const\\\\&=A'_\\ell(\\lambda_i)^T\\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})-\\lambda_i]+\\mathbb{E}_{q(Z|\\lambda)}[A_\\ell(\\lambda_i)]+const\\\\&=A'_\\ell(\\lambda_i)^T(\\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})]-\\lambda_i)+A_\\ell(\\lambda_i)+const\\end{aligned}$$\n", 55 | "这里用了一个trick:\n", 56 | "$\\mathbb{E}_{q(Z|\\lambda)}[T(Z_i)^T(\\eta(X,Z_{-i})-\\lambda_i)]=\\mathbb{E}_{q_i(Z_i|\\lambda_i)}[T(Z_i)]^T\\cdot \\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})-\\lambda_i]$ \n", 57 | "来说明一下这是为什么:\n", 58 | "$$\\begin{aligned}\\mathbb{E}_{q(Z|\\lambda)}[T(Z_i)^T(\\eta(X,Z_{-i})-\\lambda_i)]&=\\int_{Z_1,...,Z_M}T(Z_i)^T(\\eta(X,Z_{-i})-\\lambda_i)\\prod_{i=1}^M q_i(Z_i|\\lambda_i) dZ_i\\\\&=\\int_{Z_i}T(Z_i)^T q_i(Z_i|\\lambda_i) dZ_i\\cdot \\int_{Z_{-i}}(\\eta(X,Z_{-i})-\\lambda_i) \\prod_{j\\neq i} q_j(Z_j|\\lambda_j) dZ_j \\\\&=\\mathbb{E}_{q_i(Z_i|\\lambda_i)}[T(Z_i)]^T\\cdot \\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})-\\lambda_i]\\end{aligned}$$\n", 59 | "回归正题,接着我们对$\\lambda_i$求导:\n", 60 | "\n", 61 | "$$\\frac{\\partial \\mathcal{L}}{\\partial \\lambda_i}=A_\\ell''(\\lambda_i)(\\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})]-\\lambda_i)-A'_\\ell(\\lambda_i)+A'_\\ell(\\lambda_i)=0$$\n", 62 | "一般来说$A_\\ell''(\\lambda_i)\\neq 0$,于是我们有\n", 63 | "$$\\lambda^*_i=\\mathbb{E}_{q_{-i}(Z_{-i}|\\lambda_{-i})}[\\eta(X,Z_{-i})]$$\n", 64 | "有了这个更新式,只要我们遍历$\\lambda_i$固定其他参数,就可以迭代地对ELBO进行优化,获得后验概率分布的估计。" 65 | ] 66 | } 67 | ], 68 | "metadata": { 69 | "kernelspec": { 70 | "display_name": "Python 2", 71 | "language": "python", 72 | "name": "python2" 73 | }, 74 | "language_info": { 75 | "codemirror_mode": { 76 | "name": "ipython", 77 | "version": 2 78 | }, 79 | "file_extension": ".py", 80 | "mimetype": "text/x-python", 81 | "name": "python", 82 | "nbconvert_exporter": "python", 83 | "pygments_lexer": "ipython2", 84 | "version": "2.7.12" 85 | } 86 | }, 87 | "nbformat": 4, 88 | "nbformat_minor": 0 89 | } 90 | -------------------------------------------------------------------------------- /Machine-Learning/svd1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 奇异值分解系列1\n", 8 | "## 线性代数拾遗\n", 9 | "在介绍奇异值分解之前,先来回顾线性代数中的一些重要概念。\n", 10 | "### 线性变换\n", 11 | "我们考察一个线性变换\n", 12 | "$$\\left\\{\\begin{aligned}&x'=2x+y\\\\&y'=-x+y\\end{aligned}\\right.$$\n", 13 | "我们可以将这个线性变换表示为一个矩阵和一个向量的乘积:\n", 14 | "$$\\begin{bmatrix}x'\\\\ y'\\end{bmatrix}=\\begin{bmatrix}2 & 1\\\\-1 & 1\\end{bmatrix}\\begin{bmatrix}x\\\\ y\\end{bmatrix}$$\n", 15 | "于是我们可以用一个矩阵$A$来描述这样一个线性变换:\n", 16 | "$$A=\\begin{bmatrix}2 & 1\\\\-1 & 1\\end{bmatrix}$$\n", 17 | "举个例子来说明。假设我们有一个向量:\n", 18 | "$$ \\mathbf{x}=\\begin{bmatrix}1\\\\ 3\\end{bmatrix}$$\n", 19 | "经过矩阵A变换后变为\n", 20 | "$$ \\mathbf{x}'=A\\mathbf{x}=\\begin{bmatrix}5\\\\2\\end{bmatrix}$$\n", 21 | "其变换过程如下图所示:\n", 22 | "\n", 23 | "从图中我们看到,一个线性变换相当于对一个向量作了两种操作:旋转和拉伸(但不包括平移)。特别地,有一种特殊的线性变换,它只对向量进行了拉伸,而不改变向量的方向,这种情况发生于当一个矩阵作用于它的特征向量时。\n", 24 | "### 正交矩阵\n", 25 | "如果一个n阶矩阵满足\n", 26 | "$$ U^\\top U=I\\\\U^{-1}=U^\\top$$\n", 27 | "则称$U$为正交矩阵。上述关系透露出两个信息: \n", 28 | "1)$U$的各列间彼此正交,$u_i^\\top u_j=0(i\\neq j)$ \n", 29 | "2)$U$的各列由单位向量组成$\\left\\|u_i\\right\\|=1$ \n", 30 | "正交矩阵是一类特殊的线性变换,它具有以下性质:\n", 31 | "* 保长性:当正交变换作用于一个向量后,该向量的长度不变 \n", 32 | "对$\\mathbf{x}$作正交变换$\\mathbf{y}=U\\mathbf{x}$,则$\\mathbf{y}^T\\mathbf{y}=\\mathbf{x}^\\top U^\\top U \\mathbf{x}=\\mathbf{x}^\\top \\mathbf{x}$,即$|y|=|x|$ \n", 33 | "* 保角性:两个向量经过正交变换后夹角不变 \n", 34 | "对$\\mathbf{x}_1,\\mathbf{x}_2$作正交变换$\\mathbf{y}_1=U\\mathbf{x}_1,\\mathbf{y}_2=U\\mathbf{x}_2$,则$\\cos<\\mathbf{y}_1,\\mathbf{y}_2>=\\frac{\\mathbf{y}_1^\\top\\mathbf{y}_2}{|\\mathbf{y}_1||\\mathbf{y}_2|}=\\frac{\\mathbf{x}_1^\\top\\mathbf{x}_2}{|\\mathbf{x}_1||\\mathbf{x}_2|}=\\cos<\\mathbf{x}_1, \\mathbf{x}_2>$ \n", 35 | "由此可以得出,正交变换只对向量作了旋转。" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "### 矩阵的对角分解\n", 43 | ">给定一个方阵n阶$A$,如果其有n个线性无关的特征向量,则可以将其分解为\n", 44 | "$$A=S\\Lambda S^{-1}$$\n", 45 | "其中,$\\Lambda$为对角矩阵,其对角元为$A$的各特征值,$S$为可逆矩阵,其列为$A$的各特征向量。 \n", 46 | "\n", 47 | "特别地,当$A$为对称矩阵时,可以将其分解为$A=Q\\Lambda Q^T$,$Q$为正交矩阵\n", 48 | "\n", 49 | "### 矩阵的四个基本子空间\n", 50 | "给定一个$m\\times n$矩阵$A$,它的四个基本子空间定义为:\n", 51 | "* 列空间$C(A)$,维数为$r(A)$\n", 52 | "* 行空间$C(A^T)$,维数为$r(A)$,等价于$A^T$的列空间\n", 53 | "* 零空间$N(A)$,维数为$n-r(A)$,由所有满足$Ax=0$的向量$x$组成\n", 54 | "* 左零空间$N(A^T)$,维数为$m-r(A)$,等价于$A^\\top$的零空间\n", 55 | "\n", 56 | "其中$r(A)$表示$A$的秩。由秩-零化度定理,$rank(A)+nullity(A)=n$,故行空间与零空间互补,构成$\\mathbb{R}^n$;列空间与左零空间互补,构成$\\mathbb{R}^m$" 57 | ] 58 | }, 59 | { 60 | "cell_type": "markdown", 61 | "metadata": {}, 62 | "source": [ 63 | "## 奇异值分解\n", 64 | "奇异值这个词最早由法国数学家 Émile Picard提出,用于形容异常、不平凡的事物。奇异值分解(SVD)是线性代数中一种常用的矩阵分解方法,与特征值分解不同的是,它不要求被分解矩阵必须是方阵,因此具有较好的可推广性。\n", 65 | "\n", 66 | ">定理(奇异值分解-full svd) \n", 67 | "给定一个矩阵$A_{m\\times n}$,则矩阵$A$的奇异值分解为:\n", 68 | "$$A_{m\\times n}=U_{m\\times m}\\Sigma_{m\\times n} V_{n\\times n}^\\top$$\n", 69 | "\n", 70 | "其中,$U$和$V$分别是$m\\times m$和$n\\times n$的正交矩阵,$\\Sigma$是一个$m\\times n$的“对角”矩阵,其形式是\n", 71 | "$$\\Sigma_{m\\times n}=\\begin{bmatrix}D_{r\\times r} & 0\\\\0& 0\\end{bmatrix}$$\n", 72 | "$D$表示一个$r\\times r$的对角矩阵($r=rank(A)$),其对角元为$\\sigma_1\\geq \\sigma_2\\geq \\cdots\\geq \\sigma_r\\geq 0$,称为$A$的奇异值。$U$的列称为$A$的左奇异向量,$V$的列称为$A$的的右奇异向量。\n", 73 | "\n", 74 | "\n" 75 | ] 76 | }, 77 | { 78 | "cell_type": "markdown", 79 | "metadata": {}, 80 | "source": [ 81 | "### SVD的基本思想\n", 82 | "设$A$是一个$m\\times n$的矩阵,$v_1,...,v_r$是其行空间的一组标准正交基,$u_1,...,u_r$是其列空间的一组标准正交基,SVD的目标是找出这样的一组正交基,使得在线性变换$A$的作用下,$v_i$与$u_i$同向,即$A v_i=\\sigma_i u_i (1\\leq i\\leq r)$ \n", 83 | "我们分别为这两组基作扩充,为行空间添上零空间的标准正交基(补齐为n个基向量),为列空间加上左零空间的标准正交基(补齐为m个基向量),我们有\n", 84 | "\n", 85 | "将上式写成矩阵形式我们有:\n", 86 | "$$AV=U\\Sigma$$\n", 87 | "因为$V$是正交矩阵,于是我们得到了SVD的分解式:\n", 88 | "$$A=U\\Sigma V^T$$\n", 89 | "### Reduced-SVD\n", 90 | "上面的full svd中,我们需要额外存储零空间、左零空间的正交基,这实际上是没必要的,去掉它们可以提高计算效率,于是我们有了如下三种SVD的变种。\n", 91 | "1.thin svd\n", 92 | ">**定理**(奇异值分解-thin svd) \n", 93 | "给定一个矩阵$A_{m\\times n}$,则矩阵$A$的奇异值分解为:\n", 94 | "$$A_{m\\times n}=U_{m\\times q}\\Sigma_{q\\times q} V_{q\\times n}^\\top$$\n", 95 | "\n", 96 | "其中$q=\\min(m, n)$\n", 97 | "\n", 98 | "2.compact svd \n", 99 | ">**定理**(奇异值分解-compact svd) \n", 100 | "给定一个矩阵$A_{m\\times n}$,则矩阵$A$的奇异值分解为:\n", 101 | "$$A_{m\\times n}=U_{m\\times r}\\Sigma_{r\\times r} V_{r\\times n}^\\top$$\n", 102 | "\n", 103 | "其中$r=rank(A)$\n", 104 | "\n", 105 | "3.truncated svd\n", 106 | ">**定理**(奇异值分解-truncated svd) \n", 107 | "给定一个矩阵$A_{m\\times n}$,则矩阵$A$的奇异值分解为:\n", 108 | "$$A_{m\\times n}\\approx U_{m\\times k}\\Sigma_{k\\times k} V_{k\\times n}^\\top$$\n", 109 | "\n", 110 | "其中$k" 125 | ] 126 | } 127 | ], 128 | "metadata": { 129 | "kernelspec": { 130 | "display_name": "Python 2", 131 | "language": "python", 132 | "name": "python2" 133 | }, 134 | "language_info": { 135 | "codemirror_mode": { 136 | "name": "ipython", 137 | "version": 2 138 | }, 139 | "file_extension": ".py", 140 | "mimetype": "text/x-python", 141 | "name": "python", 142 | "nbconvert_exporter": "python", 143 | "pygments_lexer": "ipython2", 144 | "version": "2.7.12" 145 | } 146 | }, 147 | "nbformat": 4, 148 | "nbformat_minor": 0 149 | } 150 | -------------------------------------------------------------------------------- /Machine-Learning/xgboost-notes/xgboost-note1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# xgboost的安装\n", 8 | "首先将xgboost的源代码从github上下载到本地:\n", 9 | "```\n", 10 | "git clone --recursive https://github.com/dmlc/xgboost\n", 11 | "```\n", 12 | "别忘了加上`--recursive`参数,以下载依赖的`dmlc-core`等模块。\n", 13 | "接着,编译xgboost的源代码 \n", 14 | "```\n", 15 | "cd xgboost \n", 16 | "make -j4 #开4个线程编译\n", 17 | "```\n", 18 | "编译完成后,将在`xgboost/lib`目录下生成文件`libxgboost.so`。 \n", 19 | "接下来我们需要安装xgboost的python接口,具体步骤如下: \n", 20 | "1.进入`xgboost/python-packages`,执行\n", 21 | "```\n", 22 | "python setup.py install\n", 23 | "```\n", 24 | "安装过程顺便会将`xgboost/lib/libxgboost.so`拷贝到`/home/little-prince/anaconda2/lib/python2.7/site-packages/xgboost-0.6-py2.7.egg/xgboost`目录下。 \n", 25 | "2.可以运行以下代码验证是否安装成功" 26 | ] 27 | }, 28 | { 29 | "cell_type": "code", 30 | "execution_count": 1, 31 | "metadata": { 32 | "collapsed": false 33 | }, 34 | "outputs": [ 35 | { 36 | "name": "stdout", 37 | "output_type": "stream", 38 | "text": [ 39 | "0.6\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "import xgboost as xgb\n", 45 | "print xgb.__version__" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "## 安装过程中遇到的错误及解决\n", 53 | "1.`/home/little-prince/anaconda2/lib/python2.7/site-packages/scipy/sparse/../../../../libstdc++.so.6: version 'GLIBCXX_3.4.20' not found` \n", 54 | "原因: \n", 55 | "GLIBCXX的版本太老。执行\n", 56 | "```\n", 57 | "sudo find / -name libstdc++.so.6*\n", 58 | "```\n", 59 | "返回结果: \n", 60 | "```\n", 61 | "/usr/share/gdb/auto-load/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21-gdb.py\n", 62 | "/usr/lib/vmware-tools/lib32/libstdc++.so.6\n", 63 | "/usr/lib/vmware-tools/lib32/libstdc++.so.6/libstdc++.so.6\n", 64 | "/usr/lib/vmware-tools/lib64/libstdc++.so.6\n", 65 | "/usr/lib/vmware-tools/lib64/libstdc++.so.6/libstdc++.so.6\n", 66 | "/usr/lib/x86_64-linux-gnu/libstdc++.so.6\n", 67 | "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21\n", 68 | "find: ‘/run/user/1000/gvfs’: Permission denied\n", 69 | "/home/little-prince/anaconda2/lib/libstdc++.so.6\n", 70 | "/home/little-prince/anaconda2/lib/libstdc++.so.6.0.19\n", 71 | "/home/little-prince/anaconda2/pkgs/libgcc-4.8.5-2/lib/libstdc++.so.6\n", 72 | "/home/little-prince/anaconda2/pkgs/libgcc-4.8.5-2/lib/libstdc++.so.6.0.19\n", 73 | "/home/little-prince/文档/vmware-tools-distrib/lib/lib32/libstdc++.so.6\n", 74 | "/home/little-prince/文档/vmware-tools-distrib/lib/lib32/libstdc++.so.6/libstdc++.so.6\n", 75 | "/home/little-prince/文档/vmware-tools-distrib/lib/lib64/libstdc++.so.6\n", 76 | "/home/little-prince/文档/vmware-tools-distrib/lib/lib64/libstdc++.so.6/libstdc++.so.6\n", 77 | "```\n", 78 | "其中,`/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21`就是我们需要的。接下来执行如下步骤: \n", 79 | "1)`rm /home/little-prince/anaconda2/lib/libstdc++.so.6` \n", 80 | "2)`cp /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21 /home/little-prince/anaconda2/lib/libstdc++.so.6` \n", 81 | "\n", 82 | "2.`/home/little-prince/anaconda2/bin/../lib/libgomp.so.1: version `GOMP_4.0' not found` \n", 83 | "原因: \n", 84 | "同上,版本太老。 \n", 85 | "执行\n", 86 | "```\n", 87 | "sudo find / -name libgomp.so.1*\n", 88 | "```\n", 89 | "找到 \n", 90 | "```\n", 91 | "/usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0\n", 92 | "/usr/lib/x86_64-linux-gnu/libgomp.so.1\n", 93 | "find: ‘/run/user/1000/gvfs’: Permission denied\n", 94 | "/home/little-prince/anaconda2/lib/libgomp.so.1.0.0\n", 95 | "/home/little-prince/anaconda2/lib/libgomp.so.1\n", 96 | "/home/little-prince/anaconda2/pkgs/libgcc-4.8.5-2/lib/libgomp.so.1.0.0\n", 97 | "/home/little-prince/anaconda2/pkgs/libgcc-4.8.5-2/lib/libgomp.so.1\n", 98 | "```\n", 99 | "运行 \n", 100 | "```\n", 101 | "strings /usr/lib/x86_64-linux-gnu/libgomp.so.1 | grep GOMP_4.0\n", 102 | "```\n", 103 | "得到 \n", 104 | "```\n", 105 | "GOMP_4.0\n", 106 | "GOMP_4.0.1\n", 107 | "```\n", 108 | "说明此文件就是我们想要的。接着执行\n", 109 | "```\n", 110 | "rm /home/little-prince/anaconda2/bin/../lib/libgomp.so.1\n", 111 | "cp /usr/lib/x86_64-linux-gnu/libgomp.so.1 /home/little-prince/anaconda2/bin/../lib/libgomp.so.1\n", 112 | "```\n" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## 用xgboost训练分类模型\n", 120 | "\n", 121 | "下段代码演示了如何用xgboost训练一个分类器并完成预测:\n" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": 2, 127 | "metadata": { 128 | "collapsed": false 129 | }, 130 | "outputs": [ 131 | { 132 | "name": "stdout", 133 | "output_type": "stream", 134 | "text": [ 135 | "accuracy is: 0.98\n" 136 | ] 137 | } 138 | ], 139 | "source": [ 140 | "from sklearn.datasets import load_iris\n", 141 | "from sklearn.model_selection import train_test_split\n", 142 | "from sklearn.metrics import accuracy_score\n", 143 | "import xgboost as xgb\n", 144 | "\n", 145 | "# loading database\n", 146 | "iris = load_iris()\n", 147 | "# split data into train set and test set\n", 148 | "iris_train, iris_test, y_train, y_test = train_test_split(iris['data'], iris['target'], test_size=0.33, random_state=42)\n", 149 | "# doing all the XGBoost magic\n", 150 | "xgb_model = xgb.XGBClassifier().fit(iris_train, y_train)\n", 151 | "# make predicts on test set\n", 152 | "y_pred = xgb_model.predict(iris_test)\n", 153 | "# show accuracy\n", 154 | "print \"accuracy is:\",accuracy_score(y_test, y_pred)" 155 | ] 156 | } 157 | ], 158 | "metadata": { 159 | "kernelspec": { 160 | "display_name": "Python 2", 161 | "language": "python", 162 | "name": "python2" 163 | }, 164 | "language_info": { 165 | "codemirror_mode": { 166 | "name": "ipython", 167 | "version": 2 168 | }, 169 | "file_extension": ".py", 170 | "mimetype": "text/x-python", 171 | "name": "python", 172 | "nbconvert_exporter": "python", 173 | "pygments_lexer": "ipython2", 174 | "version": "2.7.13" 175 | } 176 | }, 177 | "nbformat": 4, 178 | "nbformat_minor": 2 179 | } 180 | -------------------------------------------------------------------------------- /CS229/EM.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# EM算法\n", 8 | "\n", 9 | "## 1.Jensen不等式\n", 10 | "\n", 11 | "回顾凸函数的定义,假设$f(X)$是定义在实数集上的函数,如果$f''(x)\\geq 0$对于所有的$x\\in\\mathbb{R}$都成立,则称$f(X)$是凸函数;如果$f''(x)> 0$对于所有的$x\\in\\mathbb{R}$都成立,则称$f(X)$是严格凸函数。当$f(X)$定义在实数向量空间上时,如果其海赛矩阵$H(x)$是半正定的,则称$f(X)$是凸函数;如果$H(x)$是正定矩阵,则称$f(X)$是严格凸函数。\n", 12 | "接着我们给出Jensen不等式的定义: \n", 13 | "如果$f(X)$是凸函数,那么对于任意的$X$都有下式成立: \n", 14 | "$$E[f(X)]\\geq f(E[X])$$\n", 15 | "\n", 16 | "下图给出了这个定理的直观解释: \n", 17 | "\n", 18 | "\n", 19 | "\n", 20 | "图中,点$X$是一个随机变量,以0.5的概率取值点a,0.5的概率取值点b,那么$E[X]$就落在$[a,b]$的中点。而$E[f(X)]$则位于$y$轴上$f(a)$和$f(b)$的中点。观察可以发现,对于凸函数来说,$E[f(X)]$的值一定在$f(E[X])$的上方。\n", 21 | "\n", 22 | "这个定理可以总结为:“对于凸函数而言,函数值的期望总是大于等于期望的函数值”。" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## 2.EM算法\n", 30 | "\n", 31 | "假设我们有$m$个独立样例构成的训练集$\\{x^{(1)},...,x^{(m)}\\}$,我们希望从数据中学习到模型$p(x)$的参数$\\Theta$,一般都是通过最大化如下的似然函数来找到参数$\\Theta$的最大似然估计:\n", 32 | "\n", 33 | "$$\\ell(\\Theta)=\\sum_{i=1}^m \\log p(x^{(i)};\\Theta)$$\n", 34 | "\n", 35 | "然而并不是所有情况下都有解析解的。EM算法给出的解决办法是引入隐变量$z$,具体来说就是估计$p(x,z)$的参数,其似然函数为\n", 36 | "\n", 37 | "$$\\ell(\\Theta)=\\sum_{i=1}^m \\log p(x^{(i)};\\Theta)=\\sum_{i=1}^m \\log \\sum\\limits_{z^{(i)}} p(x^{(i)},z^{(i)};\\Theta)$$\n", 38 | "\n", 39 | "此处,$z$是离散随机变量,如果$z$是连续随机变量,求和符号变更为积分符号。\n", 40 | "\n", 41 | "EM算法的原理是: \n", 42 | "1)E-step:给出$\\ell(\\Theta)$的下界 \n", 43 | "2)M-step:优化这个下界 \n", 44 | "\n", 45 | "下面我们来详细叙述每个步骤的推导过程: \n", 46 | "### E-step\n", 47 | "对于每个$i$,设$Q_i$是$z$上的分布($\\sum\\limits_z Q_i(z)=1,Q_i(z)\\geq 0$),那么 \n", 48 | "$$\\begin{aligned}\\sum_{i=1}^m \\log p(x^{(i)};\\Theta)&=\\sum_{i=1}^m \\log\\sum\\limits_{z^{(i)}} p(x^{(i)},z^{(i)};\\Theta)\\\\\n", 49 | "&=\\sum_{i=1}^m \\log\\sum\\limits_{z^{(i)}} Q_i(z^{(i)}) \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})}\\\\&=\\sum_{i=1}^m \\log \\mathbb{E}_{z^{(i)}\\sim Q_i}\\bigg[\\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})}\\bigg]\\\\&\\geq \\sum_{i=1}^m \\mathbb{E}_{z^{(i)}\\sim Q_i}\\bigg[ \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})}\\bigg]\\\\&=\\sum_{i=1}^m \\sum_{z^{(i)}}Q_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})} \\end{aligned}$$\n", 50 | "\n", 51 | "最后一步我们用到了Jensen不等式,同时注意到$\\log x$是凹函数,因此不等式方向与凸函数版本的Jensen不等式相反。\n", 52 | "\n", 53 | "由此,我们得到了$\\ell(\\Theta)$的下界,无论我们选择什么样的$Q(z)$,上述不等式都成立。现在我们的问题是,对于特定的$\\theta$选择什么样的$Q$能够使得不等式两端的差距最小,即不等式的等号成立。实际上,只要$\\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})}$是一个与$z^{(i)}$无关的常量,那么在求期望时就可以把$z^{(i)}$消除,因此我们假设:\n", 54 | "$$\\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})}=c(x^{(i)};\\Theta)$$\n", 55 | "当我们只考虑$z$时,$c(x^{(i)};\\Theta)$可以视为一个常量。结合$\\sum_{z^{(i)}} Q_i(z^{(i)})=1$的事实,我们知道\n", 56 | "$$c(x^{(i)};\\Theta)=\\sum_{z^{(i)}} p(x^{(i)},z^{(i)};\\Theta)=p(x^{(i)};\\Theta)$$\n", 57 | "于是\n", 58 | "$$Q_i(z^{(i)})=\\frac{p(x^{(i)},z^{(i)};\\Theta)}{p(x^{(i)};\\Theta)}=p(z^{(i)}|x^{(i)};\\Theta)$$\n", 59 | "这告诉我们,对于特定的$\\Theta$只要$Q_i(z^{(i)})$等于给定$x^{(i)}$下$z^{(i)}$的后验分布即可让不等式的等号成立。\n", 60 | "\n", 61 | "\n", 62 | "### M-step\n", 63 | "\n", 64 | "在上一步,对于特定的$\\Theta$的取值,我们通过选择$Q_i(z^{(i)})$为后验分布$p(z^{(i)}|x^{(i)};\\Theta)$,找到了似然函数的下界,即以下不等式的等号成立:\n", 65 | "$$\\sum_{i=1}^m \\log p(x^{(i)};\\Theta)\\geq \\sum_{i=1}^m \\sum_{z^{(i)}}Q_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})} $$\n", 66 | "\n", 67 | "从而得到似然函数的准确估计。接下来我们找到令上式的右端最大的$\\Theta$\n", 68 | "\n", 69 | "$$\\Theta:=\\arg\\max_{\\Theta} \\sum_{i=1}^m \\sum_{z^{(i)}}Q_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})} $$\n", 70 | "\n", 71 | "由于我们事先不知道$\\Theta$的值,所以后验分布$p(z^{(i)}|x^{(i)};\\Theta)$一开始不会估计的很准确,因此我们通过如下迭代的方式来逐渐更新$\\Theta$的估计\n", 72 | "\n", 73 | "Repeat until convergence { \n", 74 | "(E-step) For each i,set \n", 75 | "$$Q^{(t)}_i(z^{(i)}):=p(z^{(i)}|x^{(i)};\\Theta^{(t)})$$\n", 76 | "\n", 77 | "(M-step) Set\n", 78 | "\n", 79 | "$$\\Theta^{(t+1)}:=\\arg\\max_{\\Theta} \\sum_{i=1}^m \\sum_{z^{(i)}}Q^{(t)}_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q^{(t)}_i(z^{(i)})}$$\n", 80 | "\n", 81 | "$$t:=t+1$$\n", 82 | "\n", 83 | "}\n", 84 | "\n", 85 | "### 收敛性的证明\n", 86 | "\n", 87 | "EM算法是保证能收敛的,我们如何证明这个结论?实际上,我们只要证明似然函数在每次迭代后都是单调递增的,即$\\ell(\\Theta^{(t+1)})\\geq \\ell(\\Theta^{(t)})$。 \n", 88 | "证明过程实际上是很简单的,关键在于$Q^{(t)}_i(z^{(i)})$的选择,通过$Q^{(t)}_i(z^{(i)}):=p(z^{(i)}|x^{(i)};\\Theta^{(t)})$,我们保证了Jensen不等式的等号成立:\n", 89 | "$$\\ell(\\Theta^{(t)})=\\sum_{i=1}^m \\sum_{z^{(i)}}Q^{(t)}_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta^{(t)})}{Q^{(t)}_i(z^{(i)})}$$\n", 90 | "又因为$\\Theta^{(t+1)}=\\arg\\max_{\\Theta} \\sum_{i=1}^m \\sum_{z^{(i)}}Q^{(t)}_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q^{(t)}_i(z^{(i)})}$,我们有\n", 91 | "\n", 92 | "$$\\begin{aligned}\\ell(\\Theta^{(t+1)})=\\sum_{i=1}^m \\log p(x^{(i)};\\Theta^{(t+1)})&\\geq \\sum_{i=1}^m \\sum_{z^{(i)}}Q^{(t)}_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta^{(t+1)})}{Q^{(t)}_i(z^{(i)})}\\\\&\\geq \\sum_{i=1}^m \\sum_{z^{(i)}}Q^{(t)}_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta^{(t)})}{Q^{(t)}_i(z^{(i)})} =\\ell(\\Theta^{(t)})\\end{aligned}$$\n" 93 | ] 94 | }, 95 | { 96 | "cell_type": "markdown", 97 | "metadata": {}, 98 | "source": [ 99 | "如果我们定义\n", 100 | "$$J(Q,\\Theta)=\\sum_{i=1}^m \\sum_{z^{(i)}}Q_i(z^{(i)}) \\log \\frac{p(x^{(i)},z^{(i)};\\Theta)}{Q_i(z^{(i)})} $$\n", 101 | "\n", 102 | "从前面的推导我们已经知道$\\ell(\\Theta)\\geq J(Q,\\Theta)$,那么EM算法可以看作是$J(Q,\\Theta)$上的坐标上升算法,E-step首先关于$Q$最大化$J$,在M-step中关于$\\Theta$优化$J$" 103 | ] 104 | } 105 | ], 106 | "metadata": { 107 | "kernelspec": { 108 | "display_name": "Python 2", 109 | "language": "python", 110 | "name": "python2" 111 | }, 112 | "language_info": { 113 | "codemirror_mode": { 114 | "name": "ipython", 115 | "version": 2 116 | }, 117 | "file_extension": ".py", 118 | "mimetype": "text/x-python", 119 | "name": "python", 120 | "nbconvert_exporter": "python", 121 | "pygments_lexer": "ipython2", 122 | "version": "2.7.13" 123 | } 124 | }, 125 | "nbformat": 4, 126 | "nbformat_minor": 2 127 | } 128 | -------------------------------------------------------------------------------- /PRML/Chap2-Probability-Distributions/2.2-multinomial-variables.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 2.2 多元变量\n", 8 | "\n", 9 | "上一节我们学习了二元变量的分布,其特点是结果只能有2种取值。但有时我们遇到的问题的结果数目不局限于两种,例如掷骰子就有6种互斥的可能结果。这一节我们将研究多元随机变量及其分布。 \n", 10 | "\n", 11 | "为了数学上方便,我们通常将多元变量表示为一个one-hot向量。假设随机变量的结果共有K种可能性,那么one-hot向量是一个K维向量,且其中只有一个元素$x_k=1$,其余位置上的元素均为0,以表示随机变量的结果为第k个状态。举个例子,比如$K=4$,且$x_2=1$,那么结果可以表示为:\n", 12 | "$$\\mathbf{x}=(0,1,0,0)^T$$\n", 13 | "## 分类分布\n", 14 | "我们用一个参数向量$\\boldsymbol{\\mu}=(\\mu_1,\\mu_2,...,\\mu_K)^T$来表示每一种可能状态的概率,则一个样本的离散概率分布为\n", 15 | "$$p(\\mathbf{x}|\\boldsymbol{\\mu})=\\prod_{k=1}^K \\mu_k^{x_k}$$\n", 16 | "其中$\\mu_k\\geq 0且\\sum_{k=1}^K\\mu_k=1$。我们将这样的分布称为分类分布(Categorical Distribution)。分类分布可以看做伯努利分布在多于2个结果情况下的推广。很容易证明这个分布已经是归一化的\n", 17 | "$$\\sum_{\\mathbf{x}}p(\\mathbf{x}|\\boldsymbol{\\mu})=\\sum_{k=1}^K\\mu_k=1$$\n", 18 | "并且\n", 19 | "$$\\mathbb{E}[\\mathbf{x}|\\boldsymbol{\\mu}]=\\sum_{\\mathbf{x}}p(\\mathbf{x}|\\boldsymbol{\\mu})\\mathbf{x}=(\\mu_1,...,\\mu_K)^T=\\boldsymbol{\\mu}$$\n", 20 | "\n", 21 | "## 多项式分布\n", 22 | "\n", 23 | "分类分布是多项式分布在样本数$N=1$时的特殊情况,当样本数$N>1$且各样本之间服从独立同分布假设时,记每种可能状态发生次数之和组成的向量为$\\boldsymbol{m}=(m_1,m_2,...,m_K)^T$,则$\\boldsymbol{m}$服从多项式分布,其概率分布为\n", 24 | "$$ Multi(m_1,m_2,...,m_K|\\boldsymbol{\\mu},N)=\\binom N {m_1m_2...m_K}\\prod_{k=1}^K \\mu_k^{m_k}$$\n", 25 | "其中$\\binom N {m_1m_2...m_K}=\\frac{N!}{m_1!m_2!...m_K!}$,注意到$m_k$满足约束\n", 26 | "$$ \\sum_{k=1}^K m_k=N$$\n", 27 | "\n", 28 | "我们目前学习了伯努利分布、二项式分布、分类分布、多项式分布,他们之间的区别可能有些容易混淆,因此我以一个例子来说明它们的区别: \n", 29 | "假设天上有$N$个球以概率$\\mu_1,\\mu_2,...,\\mu_K$独立且随机地掉落到地上的$K$个桶内,那么 \n", 30 | "- 当$N=1,K=2$时,我们得到伯努利分布\n", 31 | "- 当$N>1,K=2$时,我们得到二项式分布\n", 32 | "- 当$N=1,K>2$时,我们得到分类分布\n", 33 | "- 当$N>1,K>2$时,我们得到多项式分布 \n", 34 | "\n", 35 | "或者更直观点理解,就是:\n", 36 | "- 1个球落到2个桶——伯努利分布\n", 37 | "- N个球落到2个桶——二项式分布\n", 38 | "- 1个球落到K个桶——分类分布\n", 39 | "- N个球落到K个桶——多项式分布" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "### 多项式分布的最大似然估计\n", 47 | "现在让我们考虑一个由$N$个独立同分布样本组成的数据集$\\mathcal{D}=\\{\\mathbf{x}_1,...,\\mathbf{x}_N\\}$,则$\\mathcal{D}$的似然函数可以写为:\n", 48 | "$$p(\\mathcal{D}|\\boldsymbol{\\mu})=\\prod_{i=1}^N p(\\mathbf{x}_i|\\boldsymbol{\\mu})=\\prod_{i=1}^N \\prod_{k=1}^K \\mu_k^{x_{i,k}}= \\prod_{k=1}^K \\mu_k^{\\sum_{i=1}^N x_{i,k}}=\\prod_{k=1}^K \\mu_k^{m_k}$$\n", 49 | "其中$m_k=\\sum_{i=1}^N x_{i,k}$,表示第$k$个事件发生的次数。其对数似然为\n", 50 | "$$\\mathcal{L}(\\mathcal{D})=\\ln p(\\mathcal{D}|\\boldsymbol{\\mu})=\\sum_{k=1}^K m_k\\ln \\mu_k$$\n", 51 | "其中$\\sum_{k=1}^K \\mu_k=1$。 \n", 52 | "这是一个等式约束的优化问题,应用拉格朗日乘数法,写出拉格朗日目标函数\n", 53 | "$$\\mathcal{L}(\\boldsymbol{\\mu})=\\sum_{k=1}^K m_k\\ln \\mu_k-\\lambda(\\sum_{k=1}^K \\mu_k-1)$$\n", 54 | "关于$\\mu_k$求导\n", 55 | "$$\\frac{\\partial \\mathcal{L}(\\boldsymbol{\\mu})}{\\partial \\mu_k}=\\frac{m_k}{\\mu_k}-\\lambda =0$$\n", 56 | "得\n", 57 | "$$\\mu_k=\\frac{m_k}{\\lambda},(k=1,2,...,K)$$\n", 58 | "结合$\\sum_{k=1}^K \\mu_k=1$的事实,我们有\n", 59 | "$$\\frac{1}{\\lambda}\\sum_{k=1}^K m_k=\\frac{N}{\\lambda}=1\\rightarrow \\lambda=N$$\n", 60 | "因此$\\mu_k$的最大似然估计为\n", 61 | "$$\\mu^{ML}_k=\\frac{m_k}{N}$$" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": {}, 67 | "source": [ 68 | "### 狄利克雷分布\n", 69 | "通过观察多项式分布的形式,我们可以看出共轭先验具有如下形式\n", 70 | "$$p(\\boldsymbol{\\mu}|\\boldsymbol{\\alpha})\\propto \\prod_{k=1}^K \\mu_{k}^{\\alpha_k-1}$$\n", 71 | "其中$\\alpha_k$是该分布的参数,且$\\boldsymbol{\\alpha}=(\\alpha_1,...,\\alpha_K)^T$,必须注意的是$\\alpha_k$的和不一定为1。我们将其归一化,得到\n", 72 | "$$Dir(\\boldsymbol{\\mu}|\\boldsymbol{\\alpha})=\\frac{\\Gamma(\\alpha_0)}{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_K)}\\prod_{k=1}^K \\mu_{k}^{\\alpha_k-1}=\\frac{1}{Z}\\prod_{k=1}^K \\mu_{k}^{\\alpha_k-1}$$\n", 73 | "这个分布称为狄利克雷分布(Dirichlet distribution)。其中$\\Gamma(x)$是伽马函数,$\\alpha_0=\\sum_{k=1}^K \\alpha_k$,$Z=\\frac{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_K)}{\\Gamma(\\alpha_0)}$。" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": {}, 79 | "source": [ 80 | "狄利克雷分布的期望为\n", 81 | "$$\\mathbb{E}[\\mu_k]=\\frac{\\alpha_k}{\\alpha_0}$$\n", 82 | "证明过程见末尾的附录。 \n", 83 | "### 共轭后验分布\n", 84 | "我们将$\\boldsymbol{\\mu}$的先验$Dir(\\boldsymbol{\\mu}|\\boldsymbol{\\alpha})$乘以似然函数$p(\\mathcal{D}|\\boldsymbol{\\mu})$,得到$\\boldsymbol{\\mu}$的后验分布:\n", 85 | "$$p(\\boldsymbol{\\mu}|\\mathcal{D})\\propto Dir(\\boldsymbol{\\mu}|\\boldsymbol{\\alpha}) \\cdot p(\\mathcal{D}|\\boldsymbol{\\mu})\\propto \\prod_{k=1}^K \\mu_k^{(\\alpha_k+m_k)-1}$$\n", 86 | "$\\boldsymbol{\\mu}$的后验分布与先验具有相同的形式,经过类比可得\n", 87 | "$$\\begin{aligned}p(\\boldsymbol{\\mu}|\\mathcal{D})&=Dir(\\boldsymbol{\\mu}|\\boldsymbol{\\alpha}+\\boldsymbol{m})\\\\&=\\frac{\\Gamma(\\alpha_0+N)}{\\Gamma(\\alpha_1+m_1)\\cdots\\Gamma(\\alpha_K+m_K)}\\prod_{k=1}^K \\mu_{k}^{\\alpha_k+m_k-1}\\end{aligned}$$" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "## 附录\n", 95 | "\n", 96 | "$$\\begin{aligned}\\mathbb{E}[\\mu_j]&=\\int_{\\boldsymbol{\\mu}\\in A}\\mu_j Dir(\\boldsymbol{\\mu}|\\boldsymbol{\\alpha})d\\boldsymbol{\\mu}\\\\&=\\frac{\\Gamma(\\alpha_0)}{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_K)}\\int_{\\boldsymbol{\\mu}\\in A}\\mu_j^{(\\alpha_j+1)-1}\\prod_{k=1,k\\neq j}^K\\mu_k^{\\alpha_k-1} d\\mu_k \\\\&=\\frac{\\Gamma(\\alpha_0)}{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_K)}\\frac{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_{j-1})\\Gamma(\\alpha_j+1)\\Gamma(\\alpha_{j+1})\\cdots\\Gamma(\\alpha_K)}{\\Gamma(\\alpha_0+1)}\\\\&=\\frac{\\Gamma(\\alpha_0)\\Gamma(\\alpha_j+1)}{\\Gamma(\\alpha_0+1)\\Gamma(\\alpha_j)}\\\\&=\\frac{\\alpha_j}{\\alpha_0}\\end{aligned}$$\n" 97 | ] 98 | } 99 | ], 100 | "metadata": { 101 | "anaconda-cloud": {}, 102 | "kernelspec": { 103 | "display_name": "Python [default]", 104 | "language": "python", 105 | "name": "python2" 106 | }, 107 | "language_info": { 108 | "codemirror_mode": { 109 | "name": "ipython", 110 | "version": 2 111 | }, 112 | "file_extension": ".py", 113 | "mimetype": "text/x-python", 114 | "name": "python", 115 | "nbconvert_exporter": "python", 116 | "pygments_lexer": "ipython2", 117 | "version": "2.7.12" 118 | } 119 | }, 120 | "nbformat": 4, 121 | "nbformat_minor": 1 122 | } 123 | -------------------------------------------------------------------------------- /Machine-Learning/softmax-crossentropy-derivative.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# softmax分类器\n", 8 | "\n", 9 | "softmax regression是logisitic regression在多分类问题上的推广。softmax可以认为是一组超平面的集合,每个超平面代表一个类,每个类的参数可以由*法向量*$\\mathbf{\\theta}$和*截距*$b$来描述。在分类时,我们计算每个类别的得分$\\exp(\\mathbf{\\theta}_k^Tx+b_k),k=1,2,...,C$,然后把样本分配到得分最高的类。通常我们将这个得分进行归一化转化为一个概率分布:\n", 10 | "$$P(Y=k|x,\\mathbf{\\Theta},\\mathbf{b})= \\frac{\\exp(\\mathbf{\\theta}_k^Tx+b_k)}{\\sum_{k=1}^C \\exp(\\mathbf{\\theta}_k^Tx+b_k)}$$\n", 11 | "\n", 12 | "Notations:\n", 13 | "* $x$:输入向量,$d\\times 1$列向量,$d$是feature数\n", 14 | "* $W=[w_1,w_2,...,w_c]^T$:权重矩阵,$c\\times d$矩阵,$c$是label数,每一行对应一个类的超平面法向量\n", 15 | "* $b$:每个类对应超平面的偏置组成的向量, $c\\times 1$列向量\n", 16 | "* $z=Wx+b$:线性分类器输出, $c\\times 1$列向量\n", 17 | "* $\\hat{y}$:softmax函数输出, $c\\times 1$列向量\n", 18 | "* 记$\\vec{e}_j=[0,...,1,...,0]^T\\in\\mathbb{R}^{c\\times 1}$,其中$1$出现在第$j$个位置\n", 19 | "* $1_c$表示一个全$1$的$c$维列向量\n", 20 | "* $y$:我们要拟合的目标变量,是一个one-hot vector(只有一个1,其余均为0),也是 $c\\times 1$列向量 。 我们将其转置,表示为一个列向量:$y=(0,...,1,...,0)^T$\n", 21 | "\n", 22 | "他们之间的关系:\n", 23 | "$$\\left\\{\\begin{aligned}&z=Wx+b\\\\& \\hat{y}=\\mathrm{softmax}(z)=\\frac{exp(z)}{1_c^Texp(z)} \\end{aligned}\\right.$$\n", 24 | "\n", 25 | "cross-entropy error定义为:\n", 26 | "$$ E = -y^Tlog(\\hat{y}) $$\n", 27 | "因为$y$是一个one-hot vector(即只有一个位置为1),假设$y_k=1$,那么上式等于$-log(\\hat{y}_k)=-log(\\frac{exp(z_k)}{\\sum\\limits_i exp(z_i)})=-z_k+log(\\sum\\limits_i exp(z_i))$\n", 28 | "\n", 29 | "\n", 30 | "依据chain rule有:\n", 31 | "$$ \\begin{aligned}\\frac{\\partial E}{\\partial W_{ij}}\n", 32 | "&=tr\\bigg(\\big(\\frac{\\partial E}{\\partial z}\\big)^T\\frac {\\partial z}{\\partial W_{ij}}\\bigg)\\\\\n", 33 | "&=tr\\bigg( \\big(\\color{red}{\\frac{\\partial \\hat{y}}{\\partial z}\\cdot\\frac{\\partial E}{\\partial \\hat{y}}}\\big)^T\\frac {\\partial z}{\\partial W_{ij}} \\bigg)\\end{aligned}$$\n", 34 | "注意上式中红色部分的相乘顺序,这里我用了`Denominator layout`,因此套用链式法则时要颠倒相乘顺序,原先是\n", 35 | "$$E\\to \\hat{y}, \\hat{y} \\to z$$\n", 36 | "用了`Denominator layout`后要反过来,变为了\n", 37 | "$$\\hat{y} \\to z , E\\to \\hat{y}$$\n", 38 | "\n" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "我们一个一个来求上式中的各个部分\n", 46 | "1. $\\frac{\\partial \\hat{y}}{\\partial z}$\n", 47 | "$$\\begin{equation}\\begin{aligned}\\frac{\\partial \\hat{y}}{\\partial z}&=\\frac{\\partial ( \\frac{exp(z)}{1_c^Texp(z)})}{\\partial z}\\\\&= \\frac{1}{1_c^Texp(z)}\\frac{\\partial exp(z)}{\\partial z}+ \\frac{\\partial (\\frac{1}{1_c^Texp(z)})}{\\partial z}( exp(z) )^T\\\\&= \\frac{1}{1_c^Texp(z)}diag(exp(z))-\\frac{1}{(1_c^Texp(z))^2}exp(z)exp(z)^T\\\\&=diag(\\frac{exp(z)}{1_c^Texp(z)})-\\frac{exp(z)}{1_c^Texp(z)}\\cdot (\\frac{exp(z)}{1_c^Texp(z)})^T\\\\&=diag(\\mathrm{ softmax}(z))- \\mathrm{ softmax}(z) \\mathrm{ softmax}(z)^T\\\\&=diag(\\hat{y})-\\hat{y}\\hat{y}^T \\end{aligned}\\end{equation}$$\n", 48 | "注:上述求导过程使用了`Denominator layout`。\n", 49 | "设$a=a( \\boldsymbol{ x}),\\boldsymbol{u}= \\boldsymbol{u}( \\boldsymbol{x}) $,这里$ \\boldsymbol{ x}$特意加粗表示是列向量,$a$没加粗表示是一个标量函数,$ \\boldsymbol{u}$加粗表示是一个向量函数。在`Numerator layout`下,$\\frac{\\partial a \\boldsymbol{u}}{ \\boldsymbol{x}}=a\\frac{\\partial \\boldsymbol{u}}{\\partial \\boldsymbol{x}}+ \\boldsymbol{u}\\frac{\\partial a}{\\partial \\boldsymbol{x}} $,而在`Denominator layout`下,则为$\\frac{\\partial a \\boldsymbol{u}}{\\partial \\boldsymbol{x}}=a\\frac{\\partial \\boldsymbol{u}}{\\partial \\boldsymbol{x}}+\\frac{\\partial a}{\\partial \\boldsymbol{x}} \\boldsymbol{u}^T$,对比可知上述推导用的实际是`Denominator layout`。\n", 50 | "以下推导均采用 Denominator layout,这样的好处是我们用梯度更新权重时不需要对梯度再转置。\n", 51 | "\n", 52 | "2. $\\frac{\\partial E}{\\partial \\hat{y}}$ \n", 53 | "同样应用反向链式法则有:\n", 54 | "$$\\begin{equation}\\frac{\\partial E}{\\partial \\hat{y}}=\\frac{\\partial log(\\hat{y})}{\\partial \\hat{y}}\\cdot \\frac{\\partial (-y^Tlog(\\hat{y}))}{\\partial log(\\hat{y})}=\\big(diag(\\hat{y})\\big)^{-1}\\cdot(-y)\\end{equation}$$\n", 55 | "\n", 56 | "3. $\\frac{\\partial z}{\\partial W_{ij}} $ \n", 57 | "$z$的第$k$个分量可以表示为:$z_k=\\sum\\limits_j W_{kj}x_j+b_k$,因此\n", 58 | "$$\\begin{equation}\\frac{\\partial z}{\\partial W_{ij}} =\\begin{bmatrix}\\frac{\\partial z_1}{\\partial W_{ij}}\\\\\\vdots\\\\\\frac{\\partial z_c}{\\partial W_{ij}}\\end{bmatrix}=\\underbrace{[0,\\cdots, x_j,\\cdots, 0]^T}_{\\mbox{第$i$个元素为$x_j$}}=x_j \\vec{e}_i\\end{equation}$$\n", 59 | "其中$x_j$是向量$x$的第$j$个元素,为标量,它出现在第$i$行。\n", 60 | "综合$(1),(2),(3)$,我们有\n", 61 | "$$ \\begin{aligned}\\frac{\\partial E}{\\partial W_{ij}}&=tr\\bigg(\\big( (diag(\\hat{y})-\\hat{y}\\hat{y}^T)\\cdot (diag(\\hat{y}))^{-1} \\cdot (-y) \\big)^T\\cdot x_j \\vec{e}_i \\bigg)\\\\&=tr\\bigg(\\big( \\hat{y}\\cdot (1_c^Ty)-y\\big)^T\\cdot x_j \\vec{e}_i \\bigg)\\\\&=(\\hat{y}-y)^T\\cdot x_j \\vec{e}_i=r_ix_j\\end{aligned}$$\n", 62 | "其中$r_i=(\\hat{y}-y)_i$表示残差向量的第$i$项\n", 63 | "\n", 64 | "我们可以把上式改写为矩阵形式:\n", 65 | "$$ \\frac{\\partial E}{\\partial W}=(\\hat{y}-y)\\cdot x^T $$\n", 66 | "同理可得\n", 67 | "$$ \\frac{\\partial E}{\\partial b}=(\\hat{y}-y) $$\n", 68 | "那么在进行随机梯度下降的时候,更新式就是:\n", 69 | "$$ \\begin{aligned}&W \\leftarrow W - \\lambda (\\hat{y}-y)\\cdot x^T \\\\&b \\leftarrow b - \\lambda (\\hat{y}-y)\\end{aligned}$$\n", 70 | "其中$\\lambda$是学习率" 71 | ] 72 | }, 73 | { 74 | "cell_type": "markdown", 75 | "metadata": {}, 76 | "source": [ 77 | "### 一些个人思考\n", 78 | "1. 如何理解每个超平面的得分$\\exp(\\mathbf{\\theta}_k^Tx+b_k)$? \n", 79 | "我的理解是$d=\\mathbf{\\theta}_k^Tx+b_k$表示点$x$到超平面$\\mathbf{\\theta}_k$,$b_k$的距离(并不完全严谨,类似SVM中的functional margin)。\n", 80 | "2. 可以证明,如果损失函数是均方误差(MSE),那么它的梯度的更新式与交叉熵的更新式是相同的 \n", 81 | "\n", 82 | "### 参考 \n", 83 | "[Matrix Calculus Wiki](https://en.wikipedia.org/wiki/Matrix_calculus)" 84 | ] 85 | } 86 | ], 87 | "metadata": { 88 | "kernelspec": { 89 | "display_name": "Python 2", 90 | "language": "python", 91 | "name": "python2" 92 | }, 93 | "language_info": { 94 | "codemirror_mode": { 95 | "name": "ipython", 96 | "version": 2 97 | }, 98 | "file_extension": ".py", 99 | "mimetype": "text/x-python", 100 | "name": "python", 101 | "nbconvert_exporter": "python", 102 | "pygments_lexer": "ipython2", 103 | "version": "2.7.12" 104 | } 105 | }, 106 | "nbformat": 4, 107 | "nbformat_minor": 0 108 | } 109 | -------------------------------------------------------------------------------- /Machine-Learning/svd-ridge-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 用SVD实现ridge regression\n", 8 | "\n", 9 | "今天在实现岭回归的时候发现了一个问题,我写好了如下的岭回归代码,接着我将正则化系数设为0(退化为线性回归),输出拟合曲线的系数,然后我又将其与线性回归的系数输出进行比较,发现两者不一致且相差很大。" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 6, 15 | "metadata": { 16 | "collapsed": false 17 | }, 18 | "outputs": [ 19 | { 20 | "name": "stdout", 21 | "output_type": "stream", 22 | "text": [ 23 | "weights of ridge regression(lambda=0):\n", 24 | "w_0=-0.17, w_1=19.90, w_2=-141.49, w_3=485.66, w_4=-503.59, w_5=-1146.10, w_6=2537.15, w_7=-79.13, w_8=-2483.43, w_9=1311.17, \n", 25 | "weights of linear regression:\n", 26 | "w_0=-0.24, w_1=73.92, w_2=-1227.70, w_3=9413.51, w_4=-40494.25, w_5=106457.07, w_6=-175722.05, w_7=177370.64, w_8=-99533.63, w_9=23662.71, \n" 27 | ] 28 | } 29 | ], 30 | "source": [ 31 | "import numpy as np\n", 32 | "import matplotlib.pyplot as plt\n", 33 | "def polyfit(x, t, M):\n", 34 | " '''polynomial curve fitting\n", 35 | " # Arguments:\n", 36 | " x: vector of input variables\n", 37 | " t: targets of input variables\n", 38 | " M: degree of polynomial model\n", 39 | " # Returns:\n", 40 | " coefficients of the polynomial model\n", 41 | " '''\n", 42 | " X = np.array([[xx ** m for m in range(M+1)] for xx in x], dtype='float32')\n", 43 | " #return np.linalg.pinv(X).dot(t)\n", 44 | " # more accurate version, equivalent to pinv\n", 45 | " return np.linalg.lstsq(X, t)[0]\n", 46 | "\n", 47 | "M = 9\n", 48 | "_lambda = 0\n", 49 | "x = np.linspace(0, 1, 501)\n", 50 | "t = np.sin(2 * np.pi * x)\n", 51 | "# 生成0~1之间等间距的10个点\n", 52 | "n = 10\n", 53 | "x_tr = np.linspace(0, 1, n)\n", 54 | "t_tr = np.sin(2 * np.pi * x_tr) + np.random.normal(0, 0.3, n)\n", 55 | "Phi = np.array([[xx ** m for m in range(M+1)] for xx in x_tr], dtype='float32') \n", 56 | "w_ridge = np.linalg.solve(Phi.T.dot(Phi)+_lambda * np.eye(M+1), Phi.T.dot(t_tr))\n", 57 | "wgt_info = ''\n", 58 | "print 'weights of ridge regression(lambda={}):'.format(_lambda)\n", 59 | "for i, ww in enumerate(w_ridge):\n", 60 | " wgt_info += 'w_{}={:.2f}, '.format(i, ww)\n", 61 | "print wgt_info\n", 62 | "\n", 63 | "w_lr = polyfit(x_tr, t_tr, M)\n", 64 | "wgt_info = ''\n", 65 | "print 'weights of linear regression:'\n", 66 | "for i, ww in enumerate(w_lr):\n", 67 | " wgt_info += 'w_{}={:.2f}, '.format(i, ww)\n", 68 | "print wgt_info" 69 | ] 70 | }, 71 | { 72 | "cell_type": "markdown", 73 | "metadata": {}, 74 | "source": [ 75 | "对于这个问题我百思不得其解,于是求助于网络,经过一番查找,大概得到的答案是直接用solve函数求解在式子左边的矩阵接近奇异时会出现数值上的不稳定,解决方法是用奇异值分解过滤掉接近0的奇异值。" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "回顾岭回归的求解公式:\n", 83 | "$$w^*=(X^\\top X+\\lambda I)^{-1}X^\\top y$$\n", 84 | "将矩阵$X$进行奇异值分解有:\n", 85 | "$$ X=U\\Sigma V^\\top$$\n", 86 | "我们要求解的是伪逆矩阵$X^\\dagger=(X^\\top X+\\lambda I)^{-1}X^\\top$,将分解式代入有:\n", 87 | "$$\\begin{aligned}(X^\\top X+\\lambda I)^{-1}X^\\top &=(V\\Sigma (U^\\top U)\\Sigma V^\\top+\\lambda I)^{-1}\\cdot V\\Sigma U^\\top\\\\&=(V\\Sigma^2 V^T+\\lambda V V^\\top)^{-1}\\cdot V\\Sigma U^\\top\\\\&=\\big(V(\\Sigma^2 +\\lambda I) V^\\top\\big)^{-1}\\cdot V\\Sigma U^\\top\\\\&=(V^\\top)^{-1}\\cdot(\\Sigma^2+\\lambda I)^{-1}\\cdot(V)^{-1}\\cdot V\\Sigma U^\\top\\\\&=V(\\Sigma^2+\\lambda I)^{-1}\\Sigma U^\\top\\\\&=V \\cdot diag(\\frac{\\sigma_1}{\\sigma_1^2+\\lambda},...,\\frac{\\sigma_k}{\\sigma_k^2+\\lambda})\\cdot U^\\top\\end{aligned}$$\n", 88 | "实现:" 89 | ] 90 | }, 91 | { 92 | "cell_type": "code", 93 | "execution_count": 8, 94 | "metadata": { 95 | "collapsed": true 96 | }, 97 | "outputs": [], 98 | "source": [ 99 | "def ridge_regression(x, t,_lambda, M):\n", 100 | " '''ridge regression\n", 101 | " # Arguments:\n", 102 | " x: vector of input variables\n", 103 | " t: targets of input variables\n", 104 | " M: degree of polynomial model\n", 105 | " # Returns:\n", 106 | " coefficients of the polynomial model \n", 107 | " '''\n", 108 | " Phi = np.array([[xx ** m for m in range(M+1)] for xx in x], dtype='float32')\n", 109 | " U, S, Vh = np.linalg.svd(Phi, full_matrices=False)\n", 110 | " Ut = U.T.dot(t)\n", 111 | " #return np.linalg.solve(Phi.T.dot(Phi) + _lambda * np.eye(M+1)), Phi.T.dot(t))\n", 112 | " return reduce(np.dot, [Vh.T, np.diag(S/(S**2 + _lambda)), Ut])" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "测试:" 120 | ] 121 | }, 122 | { 123 | "cell_type": "code", 124 | "execution_count": 9, 125 | "metadata": { 126 | "collapsed": false 127 | }, 128 | "outputs": [ 129 | { 130 | "name": "stdout", 131 | "output_type": "stream", 132 | "text": [ 133 | "weights of ridge regression(lambda=0):\n", 134 | "w_0=-0.24, w_1=73.92, w_2=-1227.70, w_3=9413.51, w_4=-40494.27, w_5=106457.13, w_6=-175722.14, w_7=177370.73, w_8=-99533.68, w_9=23662.72, \n", 135 | "weights of linear regression:\n", 136 | "w_0=-0.24, w_1=73.92, w_2=-1227.70, w_3=9413.51, w_4=-40494.25, w_5=106457.07, w_6=-175722.05, w_7=177370.64, w_8=-99533.63, w_9=23662.71, \n" 137 | ] 138 | } 139 | ], 140 | "source": [ 141 | "M = 9\n", 142 | "_lambda = 0\n", 143 | "w_ridge = ridge_regression(x_tr, t_tr, _lambda, M)\n", 144 | "wgt_info = ''\n", 145 | "print 'weights of ridge regression(lambda={}):'.format(_lambda)\n", 146 | "for i, ww in enumerate(w_ridge):\n", 147 | " wgt_info += 'w_{}={:.2f}, '.format(i, ww)\n", 148 | "print wgt_info\n", 149 | "\n", 150 | "w_lr = polyfit(x_tr, t_tr, M)\n", 151 | "wgt_info = ''\n", 152 | "print 'weights of linear regression:'\n", 153 | "for i, ww in enumerate(w_lr):\n", 154 | " wgt_info += 'w_{}={:.2f}, '.format(i, ww)\n", 155 | "print wgt_info" 156 | ] 157 | }, 158 | { 159 | "cell_type": "markdown", 160 | "metadata": {}, 161 | "source": [ 162 | "可以看到,当正则系数等于0时,svd版本的岭回归的结果和线性回归的结果基本一致。" 163 | ] 164 | } 165 | ], 166 | "metadata": { 167 | "kernelspec": { 168 | "display_name": "Python 2", 169 | "language": "python", 170 | "name": "python2" 171 | }, 172 | "language_info": { 173 | "codemirror_mode": { 174 | "name": "ipython", 175 | "version": 2 176 | }, 177 | "file_extension": ".py", 178 | "mimetype": "text/x-python", 179 | "name": "python", 180 | "nbconvert_exporter": "python", 181 | "pygments_lexer": "ipython2", 182 | "version": "2.7.12" 183 | } 184 | }, 185 | "nbformat": 4, 186 | "nbformat_minor": 0 187 | } 188 | -------------------------------------------------------------------------------- /Deep-Learning/back-propagation-in-matrix-form.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# BackPropagation算法的矩阵形式推导\n", 8 | "\n", 9 | "## 符号说明\n", 10 | "|符号|说明|\n", 11 | "|:|:|\n", 12 | "|$l$|神经网络的第$l$层,$1\\leq l\\leq L$|\n", 13 | "|$d^{(l)}$|第$l$层的节点个数|\n", 14 | "| $W^{(l)}$|$d^{(l)}\\times d^{(l-1)}$的权值矩阵,$W^{(l)}_{ij}$表示第$l-1$层节点$j$到第$l$层节点$i$的突触权值|\n", 15 | "| $x^{(l)}$|第$l$层每个节点输出组成的向量,是一个$d^{(l)}\\times 1$的列向量|\n", 16 | "|$s^{(l)}$|第$l$层每个节点输入组成的向量,同上|\n", 17 | "|$b^{(l)}$|第$l$层每个节点的偏置,同上|\n", 18 | "| $\\sigma(x)$|_sigmoid_函数,$\\sigma(x)=1/(1+exp(-x))$,它的导数是$\\sigma^\\prime(x)=\\sigma(x)(1-\\sigma(x))$|\n", 19 | "| $n$|一个mini-batch上的样本数|\n", 20 | "|$y_i$|第i个样本的标签,为one-hot向量|\n", 21 | "|$\\hat{y}_i$|第i个样本的softmax输出,其维度与$y_i$相同|\n", 22 | "| $e_i$|当输入为$(x_i,y_i)$时的损失函数|\n", 23 | "|$\\delta^{(l)}$|第$l$层的敏感度,定义为$\\delta^{(l)}={\\partial e}/{\\partial s^{(l)}}$|" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "## 前馈过程和目标函数\n", 31 | "神经网络的前馈公式:\n", 32 | "$$ \\left\\{ \\begin{aligned} & s^{(l)} = W^{(l)} \\cdot x^{(l-1)} + b^{(l)}\\\\ & x^{(l)} = \\sigma(s^{(l)}) \\end{aligned} \\right. $$\n", 33 | "其中$\\sigma(\\cdot)$是sigmoid或softmax函数。 \n", 34 | "一般对于多分类的场合,神经网络的目标是最小化交叉熵误差。特别地,对于样本$(x_i,y_i)$来说:\n", 35 | "$$ e_i = -y_i^T \\log(\\hat{y}_i)$$\n", 36 | "\n", 37 | "通常为了加快神经网络的运算速度,我们采用mini-batch sgd的方式来更新权值,mini-batch版本的损失函数为:\n", 38 | "$$ E = \\frac{1}{n} \\sum_{i=1}^n e_i=-\\frac{1}{n} \\sum_{i=1}^n y_i^T \\log(\\hat{y}_i)$$\n", 39 | "其中,$n$为mini-batch中的样本数,$\\hat{y}_i=softmax(W^{(L)}x_i^{(L-1)}+ b^{(L)})$是神经网络的输出" 40 | ] 41 | }, 42 | { 43 | "cell_type": "markdown", 44 | "metadata": {}, 45 | "source": [ 46 | "## BP算法推导\n", 47 | "BP算法的关键是计算$\\delta^{(l)}=\\frac{\\partial E}{\\partial s^{(l)}}$,其含义是最终误差对于第$l$层神经元的敏感程度。 \n", 48 | "### 1.输出层的敏感度$\\frac{\\partial E}{\\partial s^{(L)}}$ \n", 49 | "首先列出各变量间的关系\n", 50 | "$$\\left\\{\\begin{aligned}&s^{(L)}=W^{(L)}x^{(L-1)}+b^{(L)}\\\\& \\hat{y}=\\mathrm{softmax}(s^{(L)})=\\frac{exp(s^{(L)})}{1_c^Texp(s^{(L)})} \\end{aligned}\\right.$$\n", 51 | "依据chain rule有:\n", 52 | "$$ \\begin{aligned}\\frac{\\partial E}{\\partial s^{(L)}}\n", 53 | "&=\\frac {\\partial \\hat{y}}{\\partial s^{(L)}}\\frac{\\partial E}{\\partial \\hat{y}}\\end{aligned}$$\n", 54 | "注意上式的相乘顺序,这里我用了`Denominator layout`,因此套用链式法则时要颠倒相乘顺序,原先是\n", 55 | "$$E\\to \\hat{y}, \\hat{y} \\to s^{(L)}$$\n", 56 | "用了`Denominator layout`后要反过来,变为了\n", 57 | "$$\\hat{y} \\to s^{(L)} , E\\to \\hat{y}$$\n", 58 | "1)$\\frac{\\partial \\hat{y}}{\\partial s^{(L)}}$ \n", 59 | "$$\\begin{equation}\\begin{aligned}\\frac{\\partial \\hat{y}}{\\partial s^{(L)}}&=\\frac{\\partial ( \\frac{exp(s^{(L)})}{1_c^Texp(s^{(L)})})}{\\partial s^{(L)}}\\\\&= \\frac{1}{1_c^Texp(s^{(L)})}\\frac{\\partial exp(s^{(L)})}{\\partial s^{(L)}}+ \\frac{\\partial (\\frac{1}{1_c^Texp(s^{(L)})})}{\\partial s^{(L)}}( exp(s^{(L)}) )^T\\\\&= \\frac{1}{1_c^Texp(s^{(L)})}diag(exp(s^{(L)}))-\\frac{1}{(1_c^Texp(s^{(L)}))^2}exp(s^{(L)})exp(s^{(L)})^T\\\\&=diag(\\frac{exp(s^{(L)})}{1_c^Texp(s^{(L)})})-\\frac{exp(s^{(L)})}{1_c^Texp(s^{(L)})}\\cdot (\\frac{exp(s^{(L)})}{1_c^Texp(s^{(L)})})^T\\\\&=diag(\\mathrm{ softmax}(s^{(L)}))- \\mathrm{ softmax}(s^{(L)}) \\mathrm{ softmax}(s^{(L)})^T\\\\&=diag(\\hat{y})-\\hat{y}\\hat{y}^T \\end{aligned}\\end{equation}$$\n", 60 | "2)$\\frac{\\partial E}{\\partial \\hat{y}}$ \n", 61 | "同样应用反向链式法则有:\n", 62 | "$$\\begin{equation}\\frac{\\partial E}{\\partial \\hat{y}}=\\frac{\\partial log(\\hat{y})}{\\partial \\hat{y}}\\cdot \\frac{\\partial (-y^Tlog(\\hat{y}))}{\\partial log(\\hat{y})}=\\big(diag(\\hat{y})\\big)^{-1}\\cdot(-y)\\end{equation}$$\n", 63 | "\n", 64 | "综合以上,有\n", 65 | "$$ \\begin{aligned}\\frac{\\partial E}{\\partial s^{(L)}}\n", 66 | "&=\\frac {\\partial \\hat{y}}{\\partial s^{(L)}}\\frac{\\partial E}{\\partial \\hat{y}}\\\\&=\\bigg(diag(\\hat{y})-\\hat{y}\\hat{y}^T\\bigg) \\big(diag(\\hat{y})\\big)^{-1}\\cdot(-y)\\\\&=-y+\\hat{y}(\\mathbf{1}^Ty)\\\\&=\\hat{y}-y\\end{aligned}$$\n", 67 | "\n", 68 | "其中$\\hat{y}^T\\cdot diag(\\hat{y})^{-1}=\\mathbf{1}^T=(1,1,...,1)^T$表示一个全为1的向量,因为$y$是一个one hot向量,因此$\\mathbf{1}^Ty=\\sum_{i=1}^n y_i=1$ \n", 69 | "因此我们知道:\n", 70 | ">$$\\delta^{(L)}=\\frac{\\partial E}{\\partial s^{(L)}}=\\hat{y}-y$$\n", 71 | "\n" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### 2.隐层的敏感度\n", 79 | "依据链式法则,有 \n", 80 | "$$ \\begin{aligned}\\delta^{(l)}=\\frac{\\partial E}{\\partial s^{(l)}} = &( \\frac{\\partial E}{\\partial s^{(l+1)}} \\cdot \\frac{\\partial s^{(l+1)}}{\\partial x^{(l)}} \\cdot\\frac{\\partial x^{(l)}}{\\partial s^{(l)}} ) ^T\\\\ = &\\{ (\\delta^{(l+1)})^T \\cdot W^{(l+1)} \\cdot diag(\\sigma^\\prime(s^{(l)}))\\}^T\\\\=& diag(\\sigma^\\prime(s^{(l)}))\\cdot (W^{(l+1)})^T\\delta^{(l+1)}\\\\=& \\sigma^\\prime(s^{(l)})\\circ ((W^{(l+1)})^T\\delta^{(l+1)})\\end{aligned}$$\n", 81 | "其中$\\circ$是向量点积(或Hadamard product,element-wise product)。于是我们有了误差敏感度的递推公式:\n", 82 | "\n", 83 | ">$$ \\begin{aligned}\\delta^{(l)} &= \\sigma^\\prime(s^{(l)})\\circ ((W^{(l+1)})^T\\delta^{(l+1)})\\\\&= \\sigma(s^{(l)})\\circ (1-\\sigma(s^{(l)}))\\circ ((W^{(l+1)})^T\\delta^{(l+1)})\\end{aligned}$$\n", 84 | "\n", 85 | "我们发现,敏感度满足一个递推式,当前层的敏感度依赖于下一层的敏感度。在进行反向传播时,首先我们需要先计算最后一层(即输出层)的敏感度$\\delta^{(L)}$,再回传给$L-1$层,计算$\\delta^{(L-1)}$,接着按照从后向前的顺序依次遍历剩余所有层,计算$\\delta^{(l)}$,这个过程可以表示为:\n", 86 | "$$\\delta^{(L)}\\to \\delta^{(L-1)}\\to \\delta^{(L-2)}\\to\\cdots\\delta^{(1)}$$" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## mini-batch sgd\n", 94 | "前面的推导均是基于单样本的sgd,当以一个mini-batch作为输入时,梯度应该为这个batch上的样本的梯度的平均值,即\n", 95 | "$$ \\nabla{W}^{(l)}=\\frac{\\partial E}{\\partial W^{(l)}}=\\frac{1}{n}\\sum_{i=1}^n \\delta_i^{(l)}\\otimes x_i^{(l-1)}$$\n", 96 | "$$\\nabla{b}^{(l)}=\\frac{\\partial E}{\\partial b^{(l)}} = \\frac{1}{n}\\sum_{i=1}^n \\delta_i^{(l)}$$\n", 97 | "其中$\\otimes$表示向量外积 \n", 98 | "梯度下降的更新式为:\n", 99 | "$$ W^{(l)}\\longleftarrow W^{(l)} - \\lambda \\nabla{W}^{(l)}$$\n", 100 | "$$b^{(l)}\\longleftarrow b^{(l)} - \\lambda\\nabla{b}^{(l)}$$\n", 101 | "其中$\\lambda>0$为学习率" 102 | ] 103 | } 104 | ], 105 | "metadata": { 106 | "anaconda-cloud": {}, 107 | "kernelspec": { 108 | "display_name": "Python [default]", 109 | "language": "python", 110 | "name": "python2" 111 | }, 112 | "language_info": { 113 | "codemirror_mode": { 114 | "name": "ipython", 115 | "version": 2 116 | }, 117 | "file_extension": ".py", 118 | "mimetype": "text/x-python", 119 | "name": "python", 120 | "nbconvert_exporter": "python", 121 | "pygments_lexer": "ipython2", 122 | "version": "2.7.12" 123 | } 124 | }, 125 | "nbformat": 4, 126 | "nbformat_minor": 0 127 | } 128 | -------------------------------------------------------------------------------- /Deep-Learning/theano-notes/part3-shared-variable.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# theano共享变量\n", 8 | "上节介绍了theano的符号变量及计算图的运行机制,本节将介绍另一个重要概念——共享变量。与符号变量稍有不同的是,共享变量有明确的值,可以被get/set,并且在使用它的函数间共享。" 9 | ] 10 | }, 11 | { 12 | "cell_type": "code", 13 | "execution_count": 1, 14 | "metadata": { 15 | "collapsed": false 16 | }, 17 | "outputs": [ 18 | { 19 | "name": "stderr", 20 | "output_type": "stream", 21 | "text": [ 22 | "Using gpu device 0: GeForce GTX 960M (CNMeM is disabled, cuDNN not available)\n" 23 | ] 24 | } 25 | ], 26 | "source": [ 27 | "from __future__ import print_function\n", 28 | "\n", 29 | "import numpy as np\n", 30 | "import theano\n", 31 | "import theano.tensor as T" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "共享变量的类型可以根据初始化方式自动推导" 39 | ] 40 | }, 41 | { 42 | "cell_type": "code", 43 | "execution_count": 2, 44 | "metadata": { 45 | "collapsed": false 46 | }, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "\n" 53 | ] 54 | } 55 | ], 56 | "source": [ 57 | "shared_var = theano.shared(np.array([[1, 2], [3, 4]], dtype=theano.config.floatX))\n", 58 | "print(shared_var.type())" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "metadata": {}, 64 | "source": [ 65 | "我们可以用set_value设置它的值,用get_value获取它的值" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 3, 71 | "metadata": { 72 | "collapsed": false 73 | }, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "[[ 3. 4.]\n", 80 | " [ 2. 1.]]\n" 81 | ] 82 | } 83 | ], 84 | "source": [ 85 | "shared_var.set_value(np.array([[3, 4], [2, 1]], dtype=theano.config.floatX))\n", 86 | "print(shared_var.get_value())" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "与符号变量类似的地方是,我们可以为它定义函数关系。注意到shared_var已经有确切的值,所以它不必作为函数输入参数的一部分,因为theano已经隐式地将其作使用表达式shared_squared的函数的输入。" 94 | ] 95 | }, 96 | { 97 | "cell_type": "code", 98 | "execution_count": 4, 99 | "metadata": { 100 | "collapsed": false 101 | }, 102 | "outputs": [ 103 | { 104 | "name": "stdout", 105 | "output_type": "stream", 106 | "text": [ 107 | "[[ 9. 16.]\n", 108 | " [ 4. 1.]]\n" 109 | ] 110 | } 111 | ], 112 | "source": [ 113 | "shared_squared = shared_var**2\n", 114 | "function_1 = theano.function([], shared_squared)\n", 115 | "print(function_1())" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "metadata": {}, 121 | "source": [ 122 | "## updates\n", 123 | "可以使用`theano.function`的`updates`参数来自动更新共享变量的值" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 5, 129 | "metadata": { 130 | "collapsed": false 131 | }, 132 | "outputs": [ 133 | { 134 | "name": "stdout", 135 | "output_type": "stream", 136 | "text": [ 137 | "shared_var before subtracting [[1, 1], [1, 1]] using function_2:\n", 138 | "[[ 3. 4.]\n", 139 | " [ 2. 1.]]\n", 140 | "shared_var after calling function_2:\n", 141 | "[[ 2. 3.]\n", 142 | " [ 1. 0.]]\n", 143 | "New output of function_1() (shared_var**2):\n", 144 | "[[ 4. 9.]\n", 145 | " [ 1. 0.]]\n" 146 | ] 147 | } 148 | ], 149 | "source": [ 150 | "subtract = T.matrix('subtract')\n", 151 | "# updates以一个字典为输入,字典的键位共享变量,值为更新的表达式\n", 152 | "# 这里updates被设置为shared_var = shared_var - subtract\n", 153 | "function_2 = theano.function([subtract], shared_var, updates={shared_var: shared_var - subtract})\n", 154 | "print(\"shared_var before subtracting [[1, 1], [1, 1]] using function_2:\")\n", 155 | "print(shared_var.get_value())\n", 156 | "# 减去矩阵[[1, 1], [1, 1]]\n", 157 | "function_2(np.array([[1, 1], [1, 1]], dtype=theano.config.floatX))\n", 158 | "print(\"shared_var after calling function_2:\")\n", 159 | "print(shared_var.get_value())\n", 160 | "# 注意到共享变量被所有使用它的函数共享, 由于它的值已经改变, 因此function_1()的输出也将改变\n", 161 | "print(\"New output of function_1() (shared_var**2):\")\n", 162 | "print(function_1())" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## 如何获取共享变量的shape\n", 170 | "共享变量的shape是个`TensorVariable`,如果直接print的话是无法查看具体的值的" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 6, 176 | "metadata": { 177 | "collapsed": false 178 | }, 179 | "outputs": [ 180 | { 181 | "name": "stdout", 182 | "output_type": "stream", 183 | "text": [ 184 | "type of shared_var:\n", 185 | "shared_var.shape:Shape.0\n" 186 | ] 187 | } 188 | ], 189 | "source": [ 190 | "print(\"type of shared_var:{}\".format(type(shared_var.shape)))\n", 191 | "print(\"shared_var.shape:{}\".format(shared_var.shape))" 192 | ] 193 | }, 194 | { 195 | "cell_type": "markdown", 196 | "metadata": {}, 197 | "source": [ 198 | "定义一个函数来查看shape的过程太过繁琐,解决的办法是使用eval函数" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 7, 204 | "metadata": { 205 | "collapsed": false 206 | }, 207 | "outputs": [ 208 | { 209 | "name": "stdout", 210 | "output_type": "stream", 211 | "text": [ 212 | "shared_var.shape.eval():[2 2]\n" 213 | ] 214 | } 215 | ], 216 | "source": [ 217 | "print(\"shared_var.shape.eval():{}\".format(shared_var.shape.eval()))" 218 | ] 219 | }, 220 | { 221 | "cell_type": "markdown", 222 | "metadata": {}, 223 | "source": [ 224 | "## 共享变量构造器参数`borrow`的作用\n", 225 | "borrow参数指定了在定义共享变量时是深拷贝还是浅拷贝。如果`borrow`为`False`,则是深拷贝,反之则为浅拷贝。接下来我们定义三个共享变量,分别对应`borrow`取默认值,False,True" 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": 8, 231 | "metadata": { 232 | "collapsed": true 233 | }, 234 | "outputs": [], 235 | "source": [ 236 | "np_array = np.ones(2, dtype='float32')\n", 237 | "\n", 238 | "s_default = theano.shared(np_array)\n", 239 | "s_false = theano.shared(np_array, borrow=False)\n", 240 | "s_true = theano.shared(np_array, borrow=True)" 241 | ] 242 | }, 243 | { 244 | "cell_type": "markdown", 245 | "metadata": {}, 246 | "source": [ 247 | "接着我们改变变量`np_array`" 248 | ] 249 | }, 250 | { 251 | "cell_type": "code", 252 | "execution_count": 9, 253 | "metadata": { 254 | "collapsed": false 255 | }, 256 | "outputs": [ 257 | { 258 | "name": "stdout", 259 | "output_type": "stream", 260 | "text": [ 261 | "[ 1. 1.]\n", 262 | "[ 1. 1.]\n", 263 | "[ 1. 1.]\n" 264 | ] 265 | } 266 | ], 267 | "source": [ 268 | "np_array += 1 # now it is an array of 2.0 s\n", 269 | "\n", 270 | "print(s_default.get_value())\n", 271 | "print(s_false.get_value())\n", 272 | "print(s_true.get_value())" 273 | ] 274 | }, 275 | { 276 | "cell_type": "markdown", 277 | "metadata": {}, 278 | "source": [ 279 | "由结果我们看到,当`borrow`取默认值时,使用的初始化方式是深拷贝;当`borrow=False`时使用的方式也是深拷贝;当`borrow=True`时,要分情况来讨论,如果运行代码的设备是CPU,那么初始化方式将是浅拷贝,输出结果则是[2, 2], 如果设备是GPU,那么`borrow`将失去作用,也就是说不管它是`True`还是`False`,初始化方式都将是深拷贝(因为需要将数据从CPU拷贝到GPU来运行)。" 280 | ] 281 | } 282 | ], 283 | "metadata": { 284 | "kernelspec": { 285 | "display_name": "Python 2", 286 | "language": "python", 287 | "name": "python2" 288 | }, 289 | "language_info": { 290 | "codemirror_mode": { 291 | "name": "ipython", 292 | "version": 2 293 | }, 294 | "file_extension": ".py", 295 | "mimetype": "text/x-python", 296 | "name": "python", 297 | "nbconvert_exporter": "python", 298 | "pygments_lexer": "ipython2", 299 | "version": "2.7.12" 300 | } 301 | }, 302 | "nbformat": 4, 303 | "nbformat_minor": 0 304 | } 305 | -------------------------------------------------------------------------------- /CS229/RL2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 强化学习(2)\n", 8 | "上一篇介绍了强化学习的基本概念,并引入了状态值函数的定义,这一篇主要讲述估计状态值函数的三种方法:动态规划、蒙特卡洛、时间差分。\n", 9 | "我们先回顾值函数的形式:\n", 10 | "$$V^{\\pi}(s) = E_\\pi[\\sum\\limits_{k=0}^{\\infty}\\gamma^k r_{t+k+1}|s_t=s]$$\n", 11 | "\n", 12 | "## 1.动态规划\n", 13 | "利用贝尔曼方程,我们可以证明值函数具有递推性。\n", 14 | "$$ \\begin{aligned}V^\\pi(s)&=E_\\pi[\\sum\\limits_{k=0}^{\\infty}\\gamma^k r_{t+k+1}|s_t=s]\\\\&=\\sum_{a}\\pi(s,a)E_a\\{r_{t+1}+\\gamma \\sum_{k=0}^\\infty \\gamma^k r_{t+k+2}|s_t=s\\}\\\\&=\\sum_{a}\\pi(s,a) \\sum_{s'}P(s,a,s')E\\{r_{t+1}+\\gamma \\sum_{k=0}^\\infty \\gamma^k r_{t+k+2}|s_{t+1}=s'\\}\\\\&=\\sum_{a}\\pi(s,a) \\sum_{s'}P(s,a,s')\\{R(s,a,s')+\\gamma E_\\pi[\\sum_{k=0}^\\infty \\gamma^k r_{t+k+2}|s_{t+1}=s']\\}\\\\&=\\sum_{a}\\pi(s,a) \\sum_{s'}P(s,a,s')\\{R(s,a,s')+\\gamma V^\\pi(s')\\}\\end{aligned}$$\n", 15 | "\n", 16 | "上式中,$E_a[\\cdot|s_t=s]$也可以写为$E[\\cdot|s_t=s,a]$。\n", 17 | "\n", 18 | "如果状态空间有限(假设为$|S|$),我们可以写下$|S|$个方程组,每个方程有$|S|$个变量,求解这个方程组就可求得所有状态的值函数。\n", 19 | "\n", 20 | "同样我们可以定义一个最优值函数(optimal value function)\n", 21 | "$$ V^*(s)=\\max_\\pi V^\\pi(s)$$\n", 22 | "$V^*(s)$表示所有可能策略里能够得到的$V^\\pi(s)$的最大值。最优值函数也有一个对应的贝尔曼方程版本:\n", 23 | "$$ \\begin{aligned}V^*(s)&=\\max_a \\sum_{s'\\in S}P(s,a,s')\\{R(s,a,s')+\\gamma V^\\pi(s')\\}\\\\&=\\sum_{s'\\in S}P(s,a,s')\\{R(s,a,s')+\\gamma \\max_a V^\\pi(s')\\}\\\\&=\\sum_{s'\\in S}P(s,a,s')\\{R(s,a,s')+\\gamma V^*(s')\\}\\end{aligned}$$\n", 24 | "我们可以根据估计出的最优值函数制定最优的策略(由于$R(s)$是常数,可以丢掉)\n", 25 | "$$\\label{policy}\\begin{equation}\\pi^*(s)=\\arg\\max\\limits_{a\\in A}\\sum_{s'\\in S}P_{sa}(s')V^*(s')\\end{equation}$$\n", 26 | "\n", 27 | "$\\pi^*$满足如下性质:\n", 28 | "\n", 29 | "$$V^*(s)=V^{\\pi^*}(s)\\geq V^\\pi(s)$$\n", 30 | "\n", 31 | "注意到$\\pi^*(s)$是所有状态下最优的策略,也就是说不论从以哪个状态做起始,都能按照它执行。\n", 32 | "\n", 33 | "#### 值迭代(value iteration)\n", 34 | "前面介绍了一种解方程组求解$V^\\pi(s)$的方法,如果用克拉默法则,求解时间为$O(|S|!)$;如果用高斯消元法时间复杂度为$O(|S|^3)$;目前最快的算法时间复杂度为$O(|S|^{2.376})$。当状态空间较大时,求解线性方程组效率较低。\n", 35 | "\n", 36 | "这里介绍另一种高效的算法:值迭代。该算法的步骤如下:\n", 37 | "1. 对每个状态$s$,初始化$V(s):=0$\n", 38 | "2. 重复直到收敛 {\n", 39 | " 对每个状态,更新: $V(s)=R(s)+\\max\\limits_{a\\in A} \\sum\\limits_{s'\\in S}P_{sa}(s')V(s')$\n", 40 | "}\n", 41 | "\n", 42 | "更新策略有两种:\n", 43 | "* 同步更新:一次性计算所有状态的新值,再覆盖旧值\n", 44 | "* 异步更新:循环过程中一次更新一个状态的值\n", 45 | "\n", 46 | "可以证明无论是同步更新还是异步更新,$V(s)$最后都会收敛于$V^*(s)$(具体怎么分析的,我还需要看更多文献,这里只是照搬结论)。有了收敛后的$V(s)$,我们可以用$\\eqref{policy}$制定策略\n", 47 | "时间复杂度分析:\n", 48 | "设收敛时的迭代次数为$T$,那么值迭代的时间复杂度为$O(T|S|^2|A|)$。在一些场合下比如机器人寻路,可执行的动作只有前后左右四种,动作空间远低于状态空间,这种情况下值迭代远比高斯消元法高效。\n", 49 | "\n", 50 | "#### 策略迭代(policy iteration)\n", 51 | "前面介绍的值迭代是基于更新值函数得到策略的,这里介绍一种更新策略的方法——策略迭代。\n", 52 | "策略迭代算法步骤如下:\n", 53 | "1. 随机初始化$\\pi$\n", 54 | "2. 重复直至收敛 {\n", 55 | "(a) 令$V:=V^\\pi$\n", 56 | "(b) 对每个状态$s$,计算$\\pi(s):=\\arg\\max\\limits_{a\\in A}\\sum\\limits_{s'}P_{sa}(s')V(s')$\n", 57 | "}\n", 58 | "\n", 59 | "循环中我们首先为当前的策略计算一个值函数,接着再用这个值函数更新策略。注意到(a)步骤中我们可以通过求解一个$|S|$元线性方程组得到值函数。同样可以证明,经过有限次循环,策略迭代算法能够收敛。\n", 60 | "时间复杂度分析:\n", 61 | "一般情况下,策略迭代可以收敛于多项式时间$O(T(|S|^3+|S|^2|A|))$。考虑到所有的策略有$|A|^{|S|}$种可能,故最坏情况下策略迭代要指数时间才能收敛$O(|A|^{|S|})$。\n", 62 | "\n", 63 | "#### 值迭代和策略迭代的对比\n", 64 | "值迭代的输出是每个状态的最优值函数,再用这个最优值函数生成策略函数(时间复杂度为$O(|S||A|)$);策略迭代则同时得到最优值函数和最优策略。\n", 65 | "当状态空间$|S|$比较大,动作空间$|A|$较小时,策略函数要求解一个线性方程组的代价太高,因此我们倾向于使用值迭代。当动作空间$|A|$较大,状态空间$|S|$比较小时,策略迭代求解线性方程组十分快速,可以很快收敛。\n", 66 | "考虑到实际中的问题大多都是状态比较多的,因此比较常用的是值迭代算法。\n", 67 | "\n", 68 | "## 2.蒙特卡洛\n", 69 | "蒙特卡洛也称为统计模拟方法,是一种用随机实验解决数值计算问题的方法。蒙特卡洛方法提出于20世纪40年代的二战时期,被老美用来计算原子弹的弹道。数学家冯·诺伊曼用驰名世界的赌城—摩纳哥的Monte Carlo—来命名这种方法,为它蒙上了一层神秘色彩。在这之前蒙特卡洛方法就已经存在。1777年,法国数学家布丰(Georges Louis Leclere de Buffon,1707—1788)提出用投针实验的方法求圆周率$\\pi$。这被认为是蒙特卡洛方法的起源。\n", 70 | "\n", 71 | "\n", 72 | "给定某个策略$\\pi$,蒙特卡洛估计每个状态$V^\\pi(s_i),(i=1,2,...,|S|)$的步骤是这样的:\n", 73 | "1. 从状态集随机挑选一个起始状态\n", 74 | "2. 执行策略$\\pi$,生成一个episode。每执行一次策略,我们可以获得执行路径上每个状态的累积回报样本。\n", 75 | "3. 我们运行系统足够多次,为每个状态获得足够多的值函数样本,将这些样本平均,就得到其值函数的估计。\n", 76 | "\n", 77 | "根据平均方式的不同,MC方法可以分为:First-visit MC和Every-visit MC。\n", 78 | "#### First-visit MC\n", 79 | "用第一次访问$s$后的累积回报样本估计$V^\\pi(s)$\n", 80 | "![1](http://7xikew.com1.z0.glb.clouddn.com/53dd5510-ee17-47aa-91a4-25ca761272ad.png)\n", 81 | "\n", 82 | "#### Every-visit MC\n", 83 | "用所有访问过s后的累积回报样本估计$V^\\pi(s)$。算法流程如下(原论文没给出,自己P了一张):\n", 84 | "![2](http://7xikew.com1.z0.glb.clouddn.com/75fb1e20-0084-4886-b555-be49f97dfeaf.jpg)\n", 85 | "\n", 86 | "可以保证,当运行次数趋于无限,这个估计就会逐渐收敛于$V^\\pi(s)$。\n", 87 | "前面介绍了如何用MC估计一个策略的值函数,那么如何用估计到的值函数求解最优策略?\n", 88 | "下图为广义策略迭代示意图,evaluation阶段,我们不断修正值函数,使其逼近当前策略下真实的值函数。improvement阶段,我们用估计到的值函数贪心地更新策略。\n", 89 | "![3](http://7xikew.com1.z0.glb.clouddn.com/f8af90ec-1a92-4c68-87ff-9a43120fcc63.png)\n", 90 | "很容易想到,我们用MC做evaluation估计值函数,再根据估计的值函数用贪心算法更新当前策略不就行了?如下图所示,E表示evaluation,I表示更新策略。\n", 91 | "![4](http://7xikew.com1.z0.glb.clouddn.com/9b08bd08-5dd2-4ec3-a3ec-660f2d47d746.png)\n", 92 | "\n", 93 | "## 3. 时间差分\n", 94 | "动态规划算法有如下特性:\n", 95 | "* 需要环境模型,即状态转移概率$P_{sa}$\n", 96 | "* 状态值函数的估计是自举的(bootstrapping),即当前状态值函数的更新依赖于已知的其他状态值函数。\n", 97 | "\n", 98 | "蒙特卡罗方法的特点则有:\n", 99 | "* 不需要环境模型\n", 100 | "* 状态值函数的估计是相互独立的\n", 101 | "* 只能用于episode tasks\n", 102 | "\n", 103 | "时间差分是蒙特卡洛和动态规划的结合,因此具有两者的优点:1)不需要环境知识,只需要根据经验求得最优策略。 2)不局限与episode task,可以用于连续任务\n", 104 | "\n", 105 | "\n", 106 | "\n", 107 | "#### constant-$\\alpha$ MC\n", 108 | "这里介绍MC的一个变种。它的估计式是:\n", 109 | "$$V(s_t) \\leftarrow V(s_t) + \\alpha[R_t - V(s_t)] \\tag {1}$$\n", 110 | "\n", 111 | "其中$R_t=\\sum\\limits_{k=0}^\\infty \\gamma^k R(s_{t+k})$是$t$时刻后的累积回报(可以认为是对$V(s_t)$的估计),$\\alpha$是步长,$R_t-V(s_t)$是一个增量,用于修正$V(s_t)$。constant-$\\alpha$ MC方法的缺点是需要等到整个episode结束才能计算$R_t$,以更新$V(s_t)$。\n", 112 | "这里有必要对$V(s_t)$、$V^\\pi(s)$和$V^\\pi(s_t)$做个区别。\n", 113 | "首先$V(s_t)$不是期望,只是一个sample。$V^\\pi(s)$和$V^\\pi(s_t)$都是数学期望。\n", 114 | "* $V(s_t)$是$t$时刻的状态值函数,是对$V^\\pi(s_t)$的估计值,$s_t$还没有给定,可以是任意状态\n", 115 | "* $V^\\pi(s_t)$是策略$\\pi$下状态$s_t$的累积回报的期望,但$s_t$没有给定\n", 116 | "* $V^\\pi(s)$是策略$\\pi$下状态$s_t$的累积回报期望,给定了$s_t=s$,产生了立即回报$R(s)$。\n", 117 | "#### Temporal difference\n", 118 | "constant-$\\alpha$ MC由于要等到一段episode结束才能计算所有的累积回报,因此无法推广到continuing task。与MC方法不同的是,TD(0)方法只需要下一时刻的信息来更新$V(s_t)$:\n", 119 | "$$ V(s_t)\\leftarrow V(s_t)+\\alpha[R(s_{t})+\\gamma V(s_{t+1})-V(s_t)]$$\n", 120 | "其中,$R(s_{t})$是t时刻的立即回报,$V_{s_{t+1}}$是下个时刻累积回报的估计。从式中可以看到,TD(0)的更新过程具有和DP类似的自举性质。\n", 121 | "TD(0)的算法过程如下图所示:\n", 122 | "![5](http://7xikew.com1.z0.glb.clouddn.com/baed76eb-588f-462b-8ee6-a0629b77a0d5.png)\n", 123 | "用类似上节MC的手段,我们将TD(0)算法与policy improvement结合就能实现一个智能控制问题。\n", 124 | "\n", 125 | "#### DP,MC,TD三者对比\n", 126 | "考察公式\n", 127 | "$$\\begin{aligned}V^\\pi(s)&=E_\\pi[R_t|s_t=s]=E_\\pi[\\sum_{k=0}^\\infty \\gamma^k R(s_{t+k})|s_t=s]\\\\&=R(s)+\\gamma E_\\pi[\\sum_{k=0}^\\infty \\gamma^k R(s_{t+k+1})|s_t=s]\\\\&=R(s)+\\gamma \\sum_{s'\\in S}P_{s\\pi(s)}(s')E_\\pi[\\sum_{k=0}^\\infty \\gamma^k R(s_{t+k+1})|s_{t+1}=s']\\\\&=R(s)+\\gamma \\sum_{s'\\in S}P_{s\\pi(s)}(s')V^\\pi(s')\\end{aligned}$$\n", 128 | "可以发现:\n", 129 | "* MC方法是以$E_\\pi[R_t|s_t=s]$为目标进行估计\n", 130 | "* DP方法是以$R(s)+\\gamma \\sum_{s'\\in S}P_{s\\pi(s)}(s')V^\\pi(s')$为目标进行估计\n", 131 | "* TD方法目标和DP的一样,只不过不是直接进行估计,而是用类似MC的方法通过采样去估计的。具体地,TD用$V(s')$替换$V^\\pi(s')$,用$R(s)+\\gamma V(s')$作为$R(s)+\\gamma \\sum_{s'\\in S}P_{s\\pi(s)}(s')V^\\pi(s')$的估计\n", 132 | "\n", 133 | "\n", 134 | "\n", 135 | "## 结语\n", 136 | "到此,我们介绍了三种求解值函数的方法。文献1的公式有些看不懂,不过还是尝试着自己理解了,写下了自己对这些方法的理解。下次的内容是SARSA,Q-learning。\n", 137 | "\n", 138 | "参考:\n", 139 | "1.Reinforcement Learning: An Introduction\n", 140 | "2.CS229-lecture-notes12\n", 141 | "3.Littman M L, Dean T L, Kaelbling L P. On the Complexity of Solving Markov Decision Problems[C]// Eleventh International Conference on Uncertainty in Artificial Intelligence. 2013:394--402.\n", 142 | "4.[增强学习(四) ----- 蒙特卡罗方法(Monte Carlo Methods) ](http://blog.csdn.net/zz_1215/article/details/44138881)\n", 143 | "5.[增强学习(五)----- 时间差分学习(Q learning, Sarsa learning)](http://www.cnblogs.com/jinxulin/p/5116332.html?utm_source=tuicool&utm_medium=referral)" 144 | ] 145 | } 146 | ], 147 | "metadata": { 148 | "kernelspec": { 149 | "display_name": "Python 2", 150 | "language": "python", 151 | "name": "python2" 152 | }, 153 | "language_info": { 154 | "codemirror_mode": { 155 | "name": "ipython", 156 | "version": 2 157 | }, 158 | "file_extension": ".py", 159 | "mimetype": "text/x-python", 160 | "name": "python", 161 | "nbconvert_exporter": "python", 162 | "pygments_lexer": "ipython2", 163 | "version": "2.7.12" 164 | } 165 | }, 166 | "nbformat": 4, 167 | "nbformat_minor": 0 168 | } 169 | -------------------------------------------------------------------------------- /Deep-Learning/theano-notes/part6-scan-function.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# theano的scan函数\n", 8 | "`scan`函数是theano中一个十分重要的概念,利用它我们可以处理时序数据,从而完成许多复杂的计算过程。其原理有点类似一个高度封装过的for循环,每个时刻都调用相同的回调函数处理该时刻的数据,最后再将处理的结果按照时间顺序堆叠、汇总。 \n", 9 | "`scan`相对于for循环的好处:\n", 10 | "* 迭代次数可以作为符号图的一部分\n", 11 | "* 最小化GPU数据传输\n", 12 | "* 可以按时序计算梯度\n", 13 | "* 比python在for循环中调用编译好的函数速度更快\n", 14 | "* 可以通过检测实际需要使用的内存量来减少总的内存使用\n", 15 | "\n", 16 | "它的缺点是过于复杂,难于调试,因此也成为笔者学习theano遇到的第一道坎。 \n", 17 | "\n", 18 | "scan函数的定义:\n", 19 | "```py\n", 20 | "theano.scan(fn, sequences=None, outputs_info=None, non_sequences=None, n_steps=None, truncate_gradient=-1, go_backwards=False, mode=None, name=None, profile=False, allow_gc=None, strict=False)\n", 21 | "```\n", 22 | "* fn:供scan在每个时刻调用的函数句柄,fn的输入参数顺序应该与传给scan函数的参数顺序一致 \n", 23 | "fn读取参数的顺序为:sequence->outputs_info->non_sequences\n", 24 | "* sequences:存放作为输入的时序数据,类型可以是列表或字典 \n", 25 | "举个例子,假设sequences=[a, b],那么执行scan时fn会依次读取该列表中每个变量第t时刻的数据a[t],b[t]。需要注意的是传入时要通过dimshuffle把时间维放在axis0\n", 26 | "* outputs_info:scan函数输出的初始值,同时通过outputs_info我们可以实现递归计算,其中taps是scan实现递归运算的关键。默认情况下,如果不指定taps参数,则outputs_info输入参数的前一时刻的值将会参与当前时刻的计算,如果outputs_info为None,表示不使用任何tap,此时不具有递归结构。如果使用多个时刻的taps,需要有额外的维度,比如taps=[-5,-2,-1],而输出的shape为(6,3),那么初始状态至少需要(5,)+(6,3)=(5,6,3)才能足够容纳scan函数的输出。如果类型为dict,需要指定initial(初值)和taps(参与计算的时刻),比如taps=-2表示计算过程需要用到t-2时刻的数据\n", 27 | "* non_sequences:作为输入的非时序数据\n", 28 | "* n_steps:循环执行fn的次数,可以是符号变量\n", 29 | "* truncate_gradient:用于限定BPTT中梯度回传的时刻数,如果为-1则表示不适用梯度截断\n", 30 | "* go_backwards:默认为False,如果为True,则从序列的末尾后往前计算各时刻输出,该参数一般用在双向的递归结构,比如BiGRU和BiLSTM的实现中\n", 31 | "* 返回值:scan函数的返回值是fn在每个时刻上处理的结果按照时间维度的堆叠\n" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "我们以累加器为例,来研究scan函数到底是怎么工作的:\n", 39 | "$$ sum(n) = \\sum_{i=0}^n i$$\n", 40 | "这个关系可以表示为如下的递归关系:\n", 41 | "$$sum(n)=sum(n-1)+n$$" 42 | ] 43 | }, 44 | { 45 | "cell_type": "code", 46 | "execution_count": 1, 47 | "metadata": { 48 | "collapsed": false 49 | }, 50 | "outputs": [ 51 | { 52 | "name": "stdout", 53 | "output_type": "stream", 54 | "text": [ 55 | "[ 0. 1. 3. 6. 10. 15.]\n" 56 | ] 57 | }, 58 | { 59 | "name": "stderr", 60 | "output_type": "stream", 61 | "text": [ 62 | "DEBUG: nvcc STDOUT mod.cu\r\n", 63 | " ���ڴ����� C:/Users/hschen/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_60_Stepping_3_GenuineIntel-2.7.12-64/tmpiviokf/265abc51f7c376c224983485238ff1a5.lib �Ͷ��� C:/Users/hschen/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_60_Stepping_3_GenuineIntel-2.7.12-64/tmpiviokf/265abc51f7c376c224983485238ff1a5.exp\r\n", 64 | "\n", 65 | "Using gpu device 0: GeForce GTX 960M (CNMeM is disabled, cuDNN 5103)\n", 66 | "C:\\Anaconda2\\lib\\site-packages\\theano\\sandbox\\cuda\\__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.\n", 67 | " warnings.warn(warn)\n" 68 | ] 69 | } 70 | ], 71 | "source": [ 72 | "import theano\n", 73 | "import theano.tensor as T\n", 74 | "import numpy as np\n", 75 | "\n", 76 | "n = T.iscalar()\n", 77 | "acc_out, updates = theano.scan(lambda i, acc_sum: acc_sum + i, sequences=T.arange(n+1), \n", 78 | " outputs_info=T.constant(np.float64(0)))\n", 79 | "accumulate_sum = theano.function([n], acc_out)\n", 80 | "print accumulate_sum(5)" 81 | ] 82 | }, 83 | { 84 | "cell_type": "markdown", 85 | "metadata": {}, 86 | "source": [ 87 | "为了获得累加结果,我以0作为outputs_info的initial_state的初始值。接着传给sequences一个以0起始、增量为1的整数序列,T.arange是theano版本的np.arange。随后我定义了一个匿名函数,这个函数的第一个参数`i`是序列的第i个元素;第二个参数`acc_sum`是上一个时刻的输出值,即$sum(i-1)$;而这个函数的返回值是$sum(i)=sum(i-1)+i$。scan函数按照时间顺序计算每个时刻的输出,并将结果按照时间顺序堆叠成一个np.ndarray数组:$[sum(0), sum(1),...,sum(n)]$" 88 | ] 89 | }, 90 | { 91 | "cell_type": "markdown", 92 | "metadata": {}, 93 | "source": [ 94 | "下面结合一些scan函数的examples进行讲解,以加深理解" 95 | ] 96 | }, 97 | { 98 | "cell_type": "markdown", 99 | "metadata": {}, 100 | "source": [ 101 | "## 平方和累加\n", 102 | "给定一个正整数$n$,我们要通过scan函数求解如下的式子:\n", 103 | "$$\\sum_{i=1}^n i^2$$" 104 | ] 105 | }, 106 | { 107 | "cell_type": "code", 108 | "execution_count": 2, 109 | "metadata": { 110 | "collapsed": false 111 | }, 112 | "outputs": [ 113 | { 114 | "name": "stdout", 115 | "output_type": "stream", 116 | "text": [ 117 | "55\n" 118 | ] 119 | } 120 | ], 121 | "source": [ 122 | "n = T.iscalar()\n", 123 | "acc_out, updates = theano.scan(lambda i, acc_sum: acc_sum + i**2, sequences=T.arange(n+1), \n", 124 | " outputs_info=T.constant(np.int64(0)))\n", 125 | "acc_out = acc_out[-1]\n", 126 | "accumulate_square_sum = theano.function([n], acc_out)\n", 127 | "print accumulate_square_sum(5)" 128 | ] 129 | }, 130 | { 131 | "cell_type": "markdown", 132 | "metadata": {}, 133 | "source": [ 134 | "## 斐波那契数列\n", 135 | "斐波那契数列的递推式为\n", 136 | "$$ x(n)=x(n-1)+x(n-2),(n\\geq 2)$$\n", 137 | "其中$x(0)=0, x(1)=1$ \n", 138 | "其scan版本的实现如下:" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": 3, 144 | "metadata": { 145 | "collapsed": false, 146 | "scrolled": true 147 | }, 148 | "outputs": [ 149 | { 150 | "name": "stdout", 151 | "output_type": "stream", 152 | "text": [ 153 | "[ 1 2 3 5 8 13 21 34 55 89]\n" 154 | ] 155 | } 156 | ], 157 | "source": [ 158 | "n = T.iscalar()\n", 159 | "x0 = T.ivector()\n", 160 | "fib_out, updates = theano.scan(lambda xtm1, xtm2: xtm1 + xtm2, \n", 161 | " outputs_info=[dict(initial=x0, taps=[-1,-2])], n_steps=n)\n", 162 | "fib_out = fib_out\n", 163 | "fib = theano.function([x0, n], fib_out)\n", 164 | "print fib(np.int32([0, 1]), 10)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "markdown", 169 | "metadata": {}, 170 | "source": [ 171 | "## 计算圆周率\n", 172 | "圆周率可以通过下面的积分式计算\n", 173 | "$$\\pi=2\\int_{-1}^1 \\sqrt{1-x^2}dx$$ " 174 | ] 175 | }, 176 | { 177 | "cell_type": "code", 178 | "execution_count": 4, 179 | "metadata": { 180 | "collapsed": false 181 | }, 182 | "outputs": [ 183 | { 184 | "name": "stdout", 185 | "output_type": "stream", 186 | "text": [ 187 | "3.14156113248\n" 188 | ] 189 | } 190 | ], 191 | "source": [ 192 | "inp=T.dvector()\n", 193 | "dx = T.dscalar()\n", 194 | "pi, updates = theano.scan(lambda xt, pi_sum: pi_sum+2.*T.sqrt(1-xt**2)*dx, sequences=[inp], \n", 195 | " outputs_info=T.constant(np.float64(0.)))\n", 196 | "pi = pi[-1]\n", 197 | "cal_pi = theano.function([inp, dx], pi)\n", 198 | "n_interval = 100000\n", 199 | "print cal_pi(np.linspace(-1, 1, n_interval)[1:-1], 2. / n_interval)" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": {}, 205 | "source": [ 206 | "## 计算泰勒展开\n", 207 | "我们用scan实现$e^x$的泰勒展开式:\n", 208 | "$$ e^x=1+x+\\frac{1}{2!}x^2+\\frac{1}{3!}x^3+\\cdots = \\sum_{n=0}^\\infty \\frac{1}{n!}x^n$$" 209 | ] 210 | }, 211 | { 212 | "cell_type": "code", 213 | "execution_count": 5, 214 | "metadata": { 215 | "collapsed": false 216 | }, 217 | "outputs": [ 218 | { 219 | "name": "stdout", 220 | "output_type": "stream", 221 | "text": [ 222 | "calc_exp(1)=2.718282\n", 223 | "calc_exp(0)=1.000000\n", 224 | "calc_exp(0)=0.367879\n", 225 | "whether calc_exp(1) equals np.exp(1):True\n" 226 | ] 227 | } 228 | ], 229 | "source": [ 230 | "n = T.iscalar()\n", 231 | "x = T.dscalar()\n", 232 | "#factorial = T.cumprod(T.arange(1, n + 1)))\n", 233 | "factorial = T.gamma(n+1)\n", 234 | "\n", 235 | "def fn(n, power, exp_sum, x):\n", 236 | " power = power*x\n", 237 | " return power, exp_sum + 1./T.gamma(n+1)*power\n", 238 | "result, updates = theano.scan(fn, sequences=T.arange(1, n),\n", 239 | " outputs_info=[T.constant(np.float64(1.)), T.constant(np.float64(1.))],\n", 240 | " non_sequences=x)\n", 241 | "exp_ = result[1][-1]\n", 242 | "calc_exp = theano.function([n, x], exp_)\n", 243 | "print \"calc_exp(1)=%f\"%calc_exp(15, 1)\n", 244 | "print \"calc_exp(0)=%f\"%calc_exp(15, 0)\n", 245 | "print \"calc_exp(0)=%f\"%calc_exp(15, -1)\n", 246 | "print \"whether calc_exp(1) equals np.exp(1):%s\"%np.allclose(calc_exp(15, 1), np.exp(1))" 247 | ] 248 | } 249 | ], 250 | "metadata": { 251 | "kernelspec": { 252 | "display_name": "Python 2", 253 | "language": "python", 254 | "name": "python2" 255 | }, 256 | "language_info": { 257 | "codemirror_mode": { 258 | "name": "ipython", 259 | "version": 2 260 | }, 261 | "file_extension": ".py", 262 | "mimetype": "text/x-python", 263 | "name": "python", 264 | "nbconvert_exporter": "python", 265 | "pygments_lexer": "ipython2", 266 | "version": "2.7.12" 267 | } 268 | }, 269 | "nbformat": 4, 270 | "nbformat_minor": 0 271 | } 272 | -------------------------------------------------------------------------------- /CS229/GLM.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | "# 广义线性模型\n", 9 | "线性回归模型假设给定特征$x$和参数$\\theta$的情况下,标号$y$服从一个高斯分布,即$y|x;\\theta\\sim\\mathcal{N}(\\mu,\\sigma)$;而逻辑回归假设给定特征$x$和参数$\\theta$的情况下,标号$y$服从一个伯努利分布,即$y|x;\\theta\\sim Bernoulli(\\phi)$,这里伯努利分布的参数$\\phi$是条件概率$p(y|x;\\theta)$,因此每次采样都依赖于$x$;softmax假设给定特征$x$和参数$\\Theta$的情况下,标号$y$服从一个多项式分布,即$y|x;\\Theta\\sim Multinomial(\\phi)$,假设有K个类,$\\sum_{i=1}^K \\phi_i=1$。下文将证明这三个模型都是广义线性模型(GLM)的特例。\n", 10 | "## 指数分布族\n", 11 | "具有如下形式的分布族称为指数分布族(exponential family):\n", 12 | "$$ p(y;\\eta)=b(y)\\exp(\\eta^T T(y)-a(\\eta))$$\n", 13 | "其中$\\eta$是分布的natural parameter,就是分布待求的参数。$T(y)$是sufficient statistics(大多数情况$T(y)=y$),是关于样本的函数。$a(\\eta)$是log normalizer,这是因为$\\exp(a(\\eta))=\\int b(y)\\exp(\\eta^TT(y))dy$,起到归一化作用。\n", 14 | "确定$a,b,T$即确定了一个分布族,而$\\eta$是此分布族的参数。当$\\eta$变化时,我们得到这个分布族里不同的分布。\n", 15 | "指数分布族是个大家族,家族成员有正态分布,指数分布,gamma分布,卡方分布,beta分布,狄利克雷分布,伯努利分布,泊松分布等。\n", 16 | "下面我们可以证明高斯分布和伯努利、多项式分布属于指数分布族。\n", 17 | "### 高斯分布\n", 18 | "高斯分布的pdf为\n", 19 | "$$ p(y|\\mu,\\sigma)=\\mathcal{N}(y|\\mu,\\sigma)=\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-\\frac{(y-\\mu)^2}{2\\sigma^2})$$\n", 20 | "这里我们只考虑$\\sigma^2=1$的情况,这样充分统计量就只有1维了,则高斯分布可以简化为\n", 21 | "$$ p(y|\\mu)=\\frac{1}{\\sqrt{2\\pi}}\\exp(-\\frac{(y-\\mu)^2}{2})=\\frac{1}{\\sqrt{2\\pi}}\\exp(-\\frac{y^2}{2})\\exp(\\mu y-\\frac{\\mu^2}{2})$$\n", 22 | "上述高斯分布可以写成指数分布族:\n", 23 | "$$\\eta = \\mu \\\\\n", 24 | "T(y) = y \\\\\n", 25 | "a(\\eta) = \\mu^2/2 = \\eta^2 /2 \\\\\n", 26 | "b(y)=\\frac{1}{\\sqrt{2\\pi}} \\exp(-\\frac{1}{2} y^2)$$\n", 27 | "以上的推导我做了很大的简化,即认为方差已知,然而做学问不能这么不求甚解,所以还是把方差不知道的情况也推一下呗。\n", 28 | "既然方差不知道,那么自然参数就应该是个向量$\\eta=(\\mu,\\sigma^2)^T$。\n", 29 | "接下来依然是观察\n", 30 | "$$ p(y|\\mu,\\sigma)=\\mathcal{N}(y|\\mu,\\sigma)=\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-\\frac{(y-\\mu)^2}{2\\sigma^2})$$\n", 31 | "我们可以对$\\frac{1}{\\sqrt{2\\pi}\\sigma}$做点手脚,写为\n", 32 | "$$\\begin{aligned}\\frac{1}{\\sqrt{2\\pi}\\sigma}&=\\exp(\\ln \\frac{1}{\\sqrt{2\\pi}\\sigma})=\\exp(-\\ln \\sqrt{2\\pi}\\sigma)=\\exp(\\ln\\sqrt{\\frac{1}{2\\pi\\sigma^2}})=\\exp(\\ln\\sqrt{\\frac{\\beta}{2\\pi}})\\\\&=\\exp(\\frac{1}{2}\\ln\\ \\beta-\\frac{1}{2}\\ln \\ (2\\pi))\\end{aligned}$$\n", 33 | "则原式写为\n", 34 | "$$ p(y|\\mu,\\sigma)=\\frac{1}{\\sqrt{2\\pi}\\sigma}\\exp(-\\frac{(y-\\mu)^2}{2\\sigma^2})=\\exp(\\color{blue}{\\underbrace{{-\\frac{y^2}{2}\\beta+\\mu y \\beta}}_{\\eta^TT(y)}}\\color{red}{\\underbrace{-\\frac{\\mu^2}{2}\\beta+\\frac{1}{2}\\ln \\ \\beta}_{a(\\eta)}}\\color{magenta}{\\underbrace{-\\frac{1}{2}\\ln(2\\pi)}_{b(y)}})$$\n", 35 | "其中$b(y)=\\exp(-\\frac{1}{2}\\ln(2\\pi))$,$\\eta=(\\mu\\beta,-\\frac{1}{2}\\beta)^T$,$T(y)=(y,y^2)^T$,$a(\\eta)=-\\frac{\\mu^2}{2}\\beta+\\frac{1}{2}\\ln \\ \\beta$\n", 36 | "\n", 37 | "\n", 38 | "### 伯努利分布\n", 39 | "伯努利分布的概率密度为\n", 40 | "$$\\begin{aligned}p(y;\\phi)&=\\phi^y(1-\\phi)^{1-y}\\\\&=\\exp(\\color{red}{\\underbrace{y\\log\\phi+(1-y)\\log(1-\\phi)}_{\\mbox{opposite of binary crossentropy}}})\\\\&=\\exp(y\\log(\\frac{\\phi}{1-\\phi})+\\log(1-\\phi))\\end{aligned}$$\n", 41 | "其中$\\phi=P(y=1)$\n", 42 | "对应的指数分布族参数为:\n", 43 | "$$T(y) = y \\\\ \\eta=\\log \\frac{\\phi}{1-\\phi} \\\\a(\\eta) = -\\log(1-\\phi)=\\log(1+e^\\eta) \\\\ b(y)=1$$\n", 44 | "注意到第二行中$exp(\\cdot)$里面是二元交叉熵误差(binary crossentropy)的**相反数**,交叉熵衡量了标签y与概率$\\phi$的相似度,同时也是logistic regression的loss function。\n", 45 | "logistic regression里面,标签y是给定的,而$\\phi=\\frac{e^\\eta}{1+e^\\eta}=\\frac{1}{1+e^{-\\eta}}$是未知的,我们要估计的正是这个概率$\\phi$,我们同时还应该注意到$\\phi$的形式是一个sigmoid function。\n", 46 | "\n", 47 | "## 广义线性模型\n", 48 | "广义线性模型对$p(y|x;\\theta)$和模型作了如下三个假设\n", 49 | "\n", 50 | "* 给定$x$和$\\theta$,$y$服从参数为$\\eta$的指数分布族:$y|x;\\theta\\sim ExponentialFamily(\\eta)$\n", 51 | "* 给定$x$,我们的目标是预测$T(y)$的数学期望,通常$T(y)=y$,因此是$h(x)=E[T(y)|x]=E[y|x]$\n", 52 | "* 自然参数是$x$的线性函数,即$\\eta=\\theta^Tx$\n", 53 | "\n", 54 | "利用GLM,我们可以很容易地得出最小二乘、logistic回归,softmax回归。\n", 55 | "## 线性回归\n", 56 | "期望为\n", 57 | "$$h_{\\theta}(x)=E[y|x]=\\mu=\\eta=\\theta^Tx$$\n", 58 | "可以用最大似然法求解参数$\\theta$\n", 59 | "## logistic回归\n", 60 | "期望为\n", 61 | "$$h_{\\theta}(x)=E[y|x]=\\phi=\\frac{1}{1+e^{-\\eta}}=\\frac{1}{1+e^{-\\theta^Tx}}$$\n", 62 | "我们依然可以用最大似然法求解参数$\\theta$\n", 63 | "给定一个训练集$\\mathcal{D}=\\{(x_1,y_1),...,(x_n,y_n)\\}$,假设$y_i$独立同分布,对于每个样本有$p(y_i|x_i,\\theta)=\\phi^{y_i}(1-\\phi)^{1-y_i}$.我们求对数似然:\n", 64 | "$$ ln\\prod_{i=1}^n p(y_i|x_i,\\theta)=\\sum_{i=1}^n[y_i\\ln(\\phi^{(i)})+(1-y_i)\\ln(1-\\phi^{(i)})]=\\sum_{i=1}^n[y_i\\ln(\\frac{1}{1+\\exp(-\\theta^Tx_i)})+(1-y_i)\\ln(1-\\frac{1}{1+\\exp(-\\theta^Tx_i)})]$$\n", 65 | "特别说一下,这里的$\\phi$为什么带有上标$(i)$,这是因为每个$y_i$虽然都服从一个伯努利分布,但是其对应的概率不同:$\\phi^{(i)}=\\frac{1}{1+\\exp(-\\theta^Tx_i)}$,也就是说$y_i$虽然独立,但**不严格同分布**(形式相同,参数不同)。\n", 66 | "最大化对数似然等价于最小化它的相反数,于是我们把目标函数表示为一个交叉熵:\n", 67 | "$$J(\\theta)=\\sum_{i=1}^n-[y_i\\ln(\\frac{1}{1+\\exp(-\\theta^Tx_i)})+(1-y_i)\\ln(1-\\frac{1}{1+\\exp(-\\theta^Tx_i)})]$$\n", 68 | "接下去求导与梯度下降的过程就不推导了。\n", 69 | "## softmax\n", 70 | "softmax中,$y\\in\\{1,2,...,k\\},(k\\geq 2)$,k个类对应有k个参数$\\Theta=\\{\\theta_1,\\theta_2,...,\\theta_k\\}$\n", 71 | "\n", 72 | "定义$T(y)\\in\\mathbb{R}^{k-1}$,且\n", 73 | "$$T(1)=\\begin{bmatrix}1\\\\0\\\\\\vdots\\\\0\\end{bmatrix},T(2)=\\begin{bmatrix}0\\\\1\\\\\\vdots\\\\0\\end{bmatrix},...,T(k-1)=\\begin{bmatrix}0\\\\0\\\\\\vdots\\\\1\\end{bmatrix},T(k)=\\begin{bmatrix}0\\\\0\\\\\\vdots\\\\0\\end{bmatrix}$$\n", 74 | "令$\\boldsymbol{\\phi}=(\\phi_1,\\phi_2,...,\\phi_{k-1})^T$,则$\\phi_k=1-\\sum_{i=1}^{k-1}\\phi_i=1-\\boldsymbol{1}^T\\boldsymbol{\\phi}$,假设$y|x;\\Theta\\sim Multinomial(\\phi^1,...,\\phi^k)$(其实这应该是个categorical distribution,但机器学习里通常不区分multinomial和categorical,把两者视作一个东西),那么有\n", 75 | "$$\\begin{aligned}P(y|x;\\theta)&=\\phi^{\\mathbb{I}(y=1)}_1\\phi^{\\mathbb{I}(y=2)}_2\\cdots \\phi^{\\mathbb{I}(y=k)}_k\\\\&=\\exp(\\sum_{i=1}^{k-1}\\mathbb{I}(y=i)\\log\\phi_i+\\mathbb{I}(y=k)\\log\\phi_k)\\\\&=\\exp((\\log\\boldsymbol{\\phi})^TT(y)+(1-\\boldsymbol{1}^TT(y))\\log\\phi_k)\\\\&=\\exp((\\log\\boldsymbol{\\phi}-\\log\\phi_k\\cdot\\boldsymbol{1})^TT(y)+\\log\\phi_k)\\\\&=\\exp((\\log\\frac{\\boldsymbol{\\phi}}{\\phi_k})^TT(y)-(-\\log\\phi_k))\\end{aligned}$$\n", 76 | "对照指数分布族定义式有:\n", 77 | "$$b(y)=1\\\\\\eta=\\log\\frac{\\boldsymbol{\\phi}}{\\phi_k}\\\\a(\\eta)=-\\log\\phi_k$$\n", 78 | "其中$\\eta_i=\\log\\frac{\\phi_i}{\\phi_k}$,定义$\\eta_k=\\log\\frac{\\phi_k}{\\phi_k}=0$\n", 79 | "则$e^{\\eta_i}=\\frac{\\phi_i}{\\phi_k},(i=1,...,k)$,$\\phi_i=\\phi_k e^{\\eta_i}$,由$\\sum_{i=1}^k\\phi_i=1$有:\n", 80 | "$$\\phi_k=\\frac{1}{\\sum_{j=1}^k e^{\\eta_j}}$$\n", 81 | "于是$\\phi_i=\\frac{e^{\\eta_i}}{\\sum_{j=1}^k e^{\\eta_j}}$,代入$\\eta_i=\\theta_i^Tx$得\n", 82 | "$$p(y=i|x;\\theta)=\\phi_i=\\frac{e^{\\theta_i^Tx}}{\\sum_{j=1}^k e^{\\theta_j^Tx}}$$\n", 83 | "模型的输出为\n", 84 | "$$h_\\theta(x)=E\\begin{bmatrix}\\mathbb{I}(y=1)|x;\\theta\\\\\\mathbb{I}(y=2)|x;\\theta\\\\\\vdots\\\\\\mathbb{I}(y=k)|x;\\theta\\end{bmatrix}=\\begin{bmatrix}\\phi_1\\\\\\phi_2\\\\\\vdots\\\\\\phi_k\\end{bmatrix}=\\begin{bmatrix}\\frac{e^{\\theta_1^Tx}}{\\sum_{j=1}^k e^{\\theta_j^Tx}}\\\\\\frac{e^{\\theta_2^Tx}}{\\sum_{j=1}^k e^{\\theta_j^Tx}}\\\\\\vdots\\\\\\frac{e^{\\theta_k^Tx}}{\\sum_{j=1}^k e^{\\theta_j^Tx}}\\end{bmatrix}$$ \n", 85 | "\n", 86 | "最大似然估计\n", 87 | "依然是求对数似然,在进行之前引入一个记号,以表示第i个样本出自第j个类的概率\n", 88 | "$$\\phi_j^{(i)}=p(y_i=j|x_i,\\Theta)=\\frac{\\exp(\\theta_j^Tx_i)}{\\sum_{k=1}^K \\exp(\\theta_k^Tx_i)}$$\n", 89 | "$$\\begin{aligned} ln\\prod_{i=1}^n p(y_i|x_i,\\Theta)&=ln \\prod_{i=1}^n\\phi^{\\mathbb{I}(y_i=1)}_1\\phi^{\\mathbb{I}(y_i=2)}_2\\cdots \\phi^{\\mathbb{I}(y_i=k)}_k=ln\\ \\phi^{\\sum_{i=1}^n \\mathbb{I}(y_i=1)}_1\\phi^{\\sum_{i=1}^n \\mathbb{I}(y_i=2)}_2\\cdots \\phi^{\\sum_{i=1}^n \\mathbb{I}(y_i=k)}_k \\\\&=\\sum_{i=1}^n[\\mathbb{I}(y_i=1)ln\\ \\phi^{(i)}_1+\\mathbb{I}(y_i=2)ln\\ \\phi^{(i)}_2+...+\\mathbb{I}(y_i=k)ln\\ \\phi^{(i)}_k]\\end{aligned}$$\n", 90 | "等价于最小化\n", 91 | "$$\\begin{aligned}J(\\Theta)&=-\\sum_{i=1}^n \\bigg[\\mathbb{I}(y_i=1)\\ln (\\frac{\\exp(\\theta_1^Tx_i)}{\\sum_{k=1}^K \\exp(\\theta_k^Tx_i)})+...+\\mathbb{I}(y_i=1)\\ln (\\frac{\\exp(\\theta_K^Tx_i)}{\\sum_{k=1}^K \\exp(\\theta_k^Tx_i)})\\bigg]\\\\&=-\\sum_{i=1}^n \\sum_{j=1}^K\\mathbb{I}(y_i=j) \\ln (\\frac{\\exp(\\theta_j^Tx_i)}{\\sum_{k=1}^K \\exp(\\theta_k^Tx_i)})\\\\&=-\\big[\\sum_{i=1}^n \\sum_{j=1}^K\\mathbb{I}(y_i=j)\\theta_j^Tx_i -\\sum_{i=1}^n \\sum_{j=1}^K\\mathbb{I}(y_i=j)\\ln\\sum_{k=1}^K\\exp(\\theta_k^Tx_i)\\big]\\\\&=-\\big[\\sum_{i=1}^n \\sum_{j=1}^K\\mathbb{I}(y_i=j)\\theta_j^Tx_i -\\sum_{i=1}^n \\ln\\sum_{k=1}^K\\exp(\\theta_k^Tx_i)(\\sum_{j=1}^K\\mathbb{I}(y_i=j))\\big]\\\\&=-\\big[\\sum_{i=1}^n \\sum_{j=1}^K\\mathbb{I}(y_i=j)\\theta_j^Tx_i -\\sum_{i=1}^n \\ln\\sum_{k=1}^K\\exp(\\theta_k^Tx_i))\\big]\\end{aligned}$$\n", 92 | "\n", 93 | "$$\\begin{aligned}\\frac{\\partial J(\\Theta)}{\\partial \\theta_j}&=-\\sum_{i=1}^n\\frac{\\partial\\big\\{\\mathbb{I}(y_i=j)\\theta_j^Tx_i-\\ln\\ \\sum_{k=1}^K\\exp(\\theta_k^Tx_i)\\big\\}}{\\partial \\theta_j}\\\\&=-\\sum_{i=1}^n\\mathbb{I}(y_i=j)x_i-\\frac{\\exp(\\theta_j^Tx_i)}{\\sum_{k=1}^K\\exp(\\theta_k^Tx_i)}\\cdot x_i=-\\sum_{i=1}^n(\\mathbb{I}(y_i=j)-\\phi_j^{(i)})x_i\\end{aligned}$$\n", 94 | "\n", 95 | "上式没有解析解,因此使用梯度下降优化。对应的梯度下降更新式为\n", 96 | "$$\\theta_j := \\theta_j-\\lambda \\frac{\\partial J(\\Theta)}{\\partial \\theta_j}=\\theta_j+\\lambda\\sum_{i=1}^n(\\mathbb{I}(y_i=j)-\\phi_j^{(i)})x_i$$\n", 97 | "\n", 98 | "对于large scale数据,我们通常使用基于mini-batch的梯度下降(mini-batch SGD)来更新参数" 99 | ] 100 | } 101 | ], 102 | "metadata": { 103 | "kernelspec": { 104 | "display_name": "Python 2", 105 | "language": "python", 106 | "name": "python2" 107 | }, 108 | "language_info": { 109 | "codemirror_mode": { 110 | "name": "ipython", 111 | "version": 2 112 | }, 113 | "file_extension": ".py", 114 | "mimetype": "text/x-python", 115 | "name": "python", 116 | "nbconvert_exporter": "python", 117 | "pygments_lexer": "ipython2", 118 | "version": "2.7.12" 119 | } 120 | }, 121 | "nbformat": 4, 122 | "nbformat_minor": 0 123 | } 124 | -------------------------------------------------------------------------------- /Deep-Learning/back-propagation-through-time.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "## 随时间反向传播 (BackPropagation Through Time,BPTT)\n", 8 | "\n", 9 | "RNN(循环神经网络,Recurrent Neural Network)是一种具有长时记忆能力的神经网络模型,被广泛应用于序列标注(Sequence Labeling)问题。在序列标注问题中,模型的输入是一段时间序列,记为$\\mathbf{x}=\\{x_1,x_2,...,x_T\\}$,我们的目标是为输入序列的每个元素打上标签集合中的对应标签,记为$\\mathbf{y}=\\{y_1,y_2,...,y_T\\}$。\n", 10 | "\n", 11 | "NLP中的大部分任务(比如分词、实体识别、词性标注)都可以最终归结为序列标注问题。这类问题中,输入是语料库中一段由$T$个词(或字)构成的文本$\\mathbf{x}=\\{x_1,x_2,...,x_T\\}$(其中$x_t$表示文本中的第$t$个词);输出是每个词对应的标签,根据任务的不同标签的形式也各不相同,但本质上都是针对每个词根据它的上下文进行标签的分类。 \n", 12 | "\n", 13 | "一个典型的RNN的结构如下图所示:\n", 14 | "![rnn1](http://7xikew.com1.z0.glb.clouddn.com/rnn-unfold.jpg \"RNN模型示意图\")\n", 15 | "\n", 16 | "从图中可以看到,一个RNN通常由三层组成,分别是输入层、隐藏层和输出层。与一般的神经网络不同的地方是,RNN的隐藏层存在一条有向反馈边,正是这种反馈机制赋予了RNN记忆能力。要理解左边的图可能有点难度,我们可以将其展开为右边这种更直观的形式,其中RNN的每个神经元接受当前时刻的输入$x_t$以及上一时刻隐单元的输出$h_{t-1}$,计算出当前神经元的输入$s_t$,经过激活函数变换得到输出$h_t$,并传递给下一时刻的隐单元。此外,我们还需要注意到RNN中每个时刻上的神经元的参数都是相同的(类似CNN的权值共享),这么做一方面是减小参数空间,保证泛化能力;另一方面是为了赋予RNN记忆能力,将有用信息存储在$W_{in},W_{rec},W_{out}$三个矩阵中。 \n", 17 | "\n", 18 | "由于RNN是一种基于时序数据的神经网络模型,因此传统的BP算法并不适用于该模型的优化,这要求我们提出新的优化算法。RNN中最常用的优化算法是随时间反向传播(BackPropagation Through Time,BPTT),下文将叙述BPTT算法的数学推导。\n", 19 | "\n", 20 | "\n", 21 | "### 符号注解\n", 22 | "在进一步讨论BPTT之前,先来总结一下本文要用到的数学符号。\n", 23 | "\n", 24 | "|符号|解释|\n", 25 | "|:---:|:|\n", 26 | "|$K$|词汇表的大小|\n", 27 | "|$T$|句子的长度|\n", 28 | "|$H$|隐藏层单元数|\n", 29 | "|$\\mathbf{x}=\\{x_1,x_2,...,x_T\\}$|句子的单词序列|\n", 30 | "| $x_t\\in\\mathbb{R}^{K\\times 1}$|第$t$个时刻RNN的输入,one-hot vector|\n", 31 | "|$\\hat{y}_t\\in\\mathbb{R}^{K\\times 1}$|第$t$时刻softmax层的输出,估计每个词出现的概率|\n", 32 | "|$y_t\\in\\mathbb{R}^{K\\times 1}$|第$t$时刻的label,为每个词出现的概率,one-hot vector|\n", 33 | "|$E_t$|第$t$个时刻(第$t$个word)的损失函数,定义为交叉熵误差$E_t=-y_t^Tlog(\\hat{y}_t)$|\n", 34 | "|$E$|一个句子的损失函数,由各个时刻(即每个word)的损失函数组成,$E=\\sum\\limits_t^T E_t$(注: 由于我们要推导的是SGD算法, 更新梯度是相对于一个训练样例而言的, 因此我们一次只考虑一个句子的误差,而不是整个训练集的误差(对应BGD算法))|\n", 35 | "| $s_t\\in\\mathbb{R}^{H\\times 1}$|第$t$个时刻RNN隐藏层的输入|\n", 36 | "| $h_t\\in\\mathbb{R}^{H\\times 1}$|第$t$个时刻RNN隐藏层的输出\n", 37 | "| $z_t\\in\\mathbb{R}^{K\\times 1}$|输出层的汇集输入|\n", 38 | "| $\\delta^{(t)}_k=\\frac{\\partial E_t}{\\partial s_k}$|第$t$个时刻损失函数$E_t$对第$k$时刻带权输入$s_k$的导数|\n", 39 | "| $r_t=\\hat{y}_t-y_t$|残差向量|\n", 40 | "| $W_{in}\\in\\mathbb{R}^{H\\times K}$|从输入层到隐藏层的权值|\n", 41 | "| $W_{rec}\\in\\mathbb{R}^{H\\times H}$|隐藏层上一个时刻到当前时刻的权值|\n", 42 | "| $W_{out}\\in\\mathbb{R}^{K\\times H}$|隐藏层到输出层的权值|\n", 43 | "\n", 44 | "上述符号之间的关系:\n", 45 | "$$\\left\\{\\begin{aligned}&s_t=W_{rec}h_{t-1}+W_{in}x_t\\\\&h_t=tanh(s_t)\\\\&z_t=W_{out}h_t\\\\& \\hat{y}_t=\\mathrm{softmax}(z_t)\\\\&E_t=-y_t^Tlog(\\hat{y}_t)\\\\&E=\\sum\\limits_t^T E_t \\end{aligned}\\right.$$\n", 46 | "\n", 47 | "这里有必要对上面的一些符号进行进一步解释。 \n", 48 | "1. 本文只讨论输入为one-hot vector的情况,这种向量的特点是茫茫0海中的一个1,即只用一个1表示某个单词的出现;其余的0均表示单词不出现。\n", 49 | "2. RNN要预测的输出是一个one-hot vector,表示下一个时刻各个单词出现的概率。\n", 50 | "3. 由于$y_t$是one-hot vector,不妨假设$y_{t,j}=1(y_{t,i}=0,i\\neq j)$,那么当前时刻的交叉熵为$E_t=-y_t^T \\log(\\hat{y}_t)=-\\log(\\hat{y}_{t,j})$。也就是说如果$t$出现的是第$j$个词,那么计算交叉熵时候只要看$\\hat{y}_t$的第$j$个分量即可。\n", 51 | "4. 由于$x_t$是one-hot向量,假设第$j$个词出现,则$W_{in}x_t$相当于把$W_{in}$的第$j$列选出来,因此这一步是不用进行任何矩阵运算的,直接做下标操作即可。\n", 52 | "\n", 53 | "\n" 54 | ] 55 | }, 56 | { 57 | "cell_type": "markdown", 58 | "metadata": {}, 59 | "source": [ 60 | "BPTT与BP类似,是在时间上反传的梯度下降算法。RNN中,我们的目的是求得$\\frac{\\partial E}{\\partial W_{in}},\\frac{\\partial E}{\\partial W_{rec}},\\frac{\\partial E}{\\partial W_{out}}$,根据这三个变化率来优化三个参数$W_{in},W_{rec},W_{out}$。注意到$\\frac{\\partial E}{\\partial W_{in}}=\\sum\\limits_t \\frac{\\partial E_t}{\\partial W_{in}}$,因此我们只要对每个时刻的损失函数求偏导数再加起来即可。矩阵求导有两种布局方式:分母布局(Denominator Layout)和分子布局(Numerator Layout),关于分子布局和分母布局的区别,请参考文献3。如果这里采用分子布局,那么更新梯度时还需要将梯度矩阵进行一次转置,因此出于数学上的方便,后续的矩阵求导都将采用分母布局
\n", 61 | "\n", 62 | "### 1.计算$\\frac{\\partial E_t}{\\partial W_{out}}$ \n", 63 | "注意到$E_t$是$W_{out}$的复合函数,参考文献3中`Scalar-by-matrix identities`一节中关于复合矩阵函数求导法则(右边的是分母布局):\n", 64 | "![](http://7xikew.com1.z0.glb.clouddn.com/matrix_calculus.jpg)\n", 65 | "我们有: \n", 66 | "$$ \\begin{aligned}\\frac{\\partial E_t}{\\partial W_{out}(i,j)}&=tr\\bigg( \\big( \\frac{\\partial E_t}{\\partial z_t}\\big)^T\\cdot \\frac{\\partial z_t}{\\partial W_{out}(i,j)}\\bigg)\\\\&=tr\\bigg((\\hat{y}_t-y_t)^T\\cdot\\begin{bmatrix}0\\\\ \\vdots \\\\ \\frac{\\partial z_{t}^{(i)}}{\\partial W_{out}(i,j)}\\\\\\vdots\\\\0\\end{bmatrix}\\bigg)\\\\&=r_t^{(i)} h_t^{(j)}\\end{aligned}$$\n", 67 | "其中$r_t^{(i)}=(\\hat{y}_t-y_t)^{(i)}$表示残差向量第$i$个分量,$h_t^{(j)}$表示$h_t$的第j个分量。\n", 68 | "上述结果可以改写为:\n", 69 | "$$ \\frac{\\partial E_t}{\\partial W_{out}}=r_t\\otimes h_t$$\n", 70 | "$$ \\frac{\\partial E}{\\partial W_{out}} = \\sum_{t=0}^T r_t\\otimes h_t $$\n", 71 | "其中$\\otimes$表示向量外积。" 72 | ] 73 | }, 74 | { 75 | "cell_type": "markdown", 76 | "metadata": {}, 77 | "source": [ 78 | "### 2.计算$\\frac{\\partial E_t}{\\partial W_{rec}}$ \n", 79 | "由于$W_{rec}$是各个时刻共享的,所以$t$时刻之前的每个时刻$W_{rec}$的变化都对$E_t$有贡献,反过来求偏导时,也要考虑之前每个时刻$W_{rec}$对$E$的影响。我们以$s_k$为中间变量,应用链式法则:\n", 80 | "$$ \\frac{\\partial E_t}{\\partial W_{rec}} = \\sum_{k=0}^t \\frac{\\partial s_k}{\\partial W_{rec}} \\frac{\\partial E_t}{\\partial s_k}$$\n", 81 | "但由于$\\frac{\\partial s_k}{\\partial W_{rec}}$(分子向量,分母矩阵)以目前的数学发展水平是没办法求的,因此我们要求这个偏导,可以拆解为$E_t$对$W_{rec}(i,j)$的偏导数:\n", 82 | "$$ \\frac{\\partial E_t}{\\partial W_{rec}(i,j)} = \\sum_{k=0}^t tr[(\\frac{\\partial E_t}{\\partial s_k})^T \\frac{\\partial s_k}{\\partial W_{rec}(i,j)}]= \\sum_{k=0}^t tr[(\\delta_k^{(t)})^T\\frac{\\partial s_k}{\\partial W_{rec}(i,j)}]$$\n", 83 | "其中,$\\delta^{(t)}_k=\\frac{\\partial E_t}{\\partial s_k}$,遵循\n", 84 | "$$s_k\\to h_k\\to s_{k+1}\\to ...\\to E_t$$\n", 85 | "的传递关系,应用链式法则有:\n", 86 | "$$\\delta^{(t)}_k=\\frac{\\partial h_k}{\\partial s_k}\\frac{\\partial s_{k+1}}{\\partial h_k} \\frac{\\partial E_t}{\\partial s_{k+1}}=diag(1-h_k\\odot h_k)W_{rec}^T\\delta^{(t)}_{k+1}=(W_{rec}^T\\delta^{(t)}_{k+1})\\odot (1-h_k\\odot h_k)$$\n", 87 | "其中,$\\odot$表示向量点乘(element-wise product)。注意$E_t$求导时链式法则的顺序,$E_t$是关于$s_k$的符合函数,且求导链上的每个变量都是向量,根据参考文献3,这种情况下应用分母布局的链式法则,方向应该是相反的。 \n", 88 | "接下来计算$\\delta^{(t)}_t$:\n", 89 | "$$\\delta^{(t)}_t=\\frac{\\partial E_t}{\\partial s_t}=\\frac{\\partial h_t}{\\partial s_t}\\frac{\\partial z_t}{\\partial h_t}\\frac{\\partial E_t}{\\partial z_t}=diag(1-h_t\\odot h_t)\\cdot W_{out}^T\\cdot(\\hat{y}_t-y_t)=(W_{out}^T(\\hat{y}_t-y_t))\\odot (1-h_t\\odot h_t)$$\n", 90 | "于是,我们得到了关于$\\delta$ 的递推关系式:\n", 91 | "$$\\left\\{\\begin{aligned}\\delta^{(t)}_t&=(W_{out}^T r_t)\\odot (1-h_t\\odot h_t)\\\\ \\delta^{(t)}_k&=(W_{rec}^T\\delta^{(t)}_{k+1})\\odot (1-h_k\\odot h_k)\\end{aligned}\\right.$$\n", 92 | "由$\\delta^{(t)}_t$出发,我们可以往前推出每一个$\\delta$,\n", 93 | "将$\\delta^{(t)}_0,...,\\delta^{(t)}_t$代入$ \\frac{\\partial E_t}{\\partial W_{rec}(i,j)} $有:\n", 94 | "$$ \\frac{\\partial E_t}{\\partial W_{rec}(i,j)} = \\sum_{k=0}^t \\delta_k^{(t)} h_{k-1}^{(j)} $$\n", 95 | "将上式写成矩阵形式:\n", 96 | "$$ \\frac{\\partial E_t}{\\partial W_{rec}} = \\sum_{k=0}^t \\delta^{(t)}_k \\otimes h_{k-1} \\\\\\frac{\\partial E}{\\partial W_{rec}}=\\sum_{t=0}^T \\sum_{k=0}^t \\delta^{(t)}_k \\otimes h_{k-1}$$\n", 97 | "不失严谨性,定义$h_{-1}$为全0的向量。" 98 | ] 99 | }, 100 | { 101 | "cell_type": "markdown", 102 | "metadata": {}, 103 | "source": [ 104 | "### 3.计算$\\frac{\\partial E_t}{\\partial W_{in}}$\n", 105 | "按照上述思路,我们可以得到\n", 106 | "$$ \\frac{\\partial E_t}{\\partial W_{in}} = \\sum_{k=0}^t \\delta_k \\otimes x_{k} $$\n", 107 | "由于$x_k$是个one-hot vector,假设$x_k(m)=1$,那么我们在更新$W$时只需要更新$W$的第$m$列即可。" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "metadata": {}, 113 | "source": [ 114 | "## 参数更新\n", 115 | "我们有了$E_t$关于各个参数的偏导数,就可以用梯度下降来更新各参数了:\n", 116 | "\n", 117 | "$$\\begin{aligned}&W_{in}:=W_{in}-\\lambda \\sum_{t=0}^T \\sum_{k=0}^t \\delta_k \\otimes x_{k} \\\\& W_{rec}:=W_{rec}-\\lambda \\sum_{t=0}^T \\sum_{k=0}^t \\delta_k \\otimes h_{k-1}\\\\ &W_{out} :=W_{out}-\\lambda \\sum_{t=0}^T r_t \\otimes h_t\\end{aligned}$$\n", 118 | "其中$r_t=\\hat{y}_t-y_t$,$\\delta_t=\\frac{\\partial E_t}{\\partial s_t}=(W_{out}^T r_t)\\odot (1-h_t\\odot h_t)$,$\\lambda>0$表示学习率" 119 | ] 120 | }, 121 | { 122 | "cell_type": "markdown", 123 | "metadata": {}, 124 | "source": [ 125 | "### 一些个人思考\n", 126 | "1. 为什么RNN中要对隐藏层的输出进行一次运算$z_t=W_{out}h_t$,然后再对$z_t$进行一次softmax,而不是直接对$h_t$进行softmax求得概率?为什么要有$W_{out}$这个参数? \n", 127 | "答:\n", 128 | "$x_t$是一个$K\\times 1$的向量,我们要将它映射到一个$H\\times 1$的$h_t$(其中$H$是隐神经元的个数),从$x_t$到$h_t$相当于对词向量做了一次编码;最终我们要得到的是一个$K\\times 1$的向量(这里$K$是词汇表大小),表示每个词接下来出现的概率,所以我们需要一个矩阵$K\\times H$的$W_{out}$来将$h_t$映射回原来的空间去,这个过程相当于解码。因此,RNN可以理解为一种编解码网络。
\n", 129 | "2. $W_{in},W_{rec},W_{out}$三个参数分别有什么意义? \n", 130 | "答:\n", 131 | "$W_{in}$将$K\\times 1$的one-hot词向量映射到$H\\times 1$隐藏层空间,将输入转化为计算机内部可理解可处理的形式,这个过程可以理解为一次编码过程;$W_{rec}$则是隐含层到自身的一个映射,它定义了模型如何结合上文信息,在编码中融入了之前的“记忆”;$W_{in},W_{rec}$结合了当前输入单词和之前的记忆,形成了当前时刻的知识状态。$W_{out}$是隐含层到输出的映射,$z=W_{out}h$是映射后的分数,这个过程相当于一次解码。这个解码后的分数再经过一层softmax转化为概率输出来,我们挑选概率最高的那个作为我们的预测。作为总结, RNN的记忆由两部分构成,一部分是当前的输入,另一部分是之前的记忆。 \n", 132 | "3. BPTT和BP的区别在哪?为什么不能用BP算法训练RNN? \n", 133 | "答:BP算法只考虑了误差的导数在上下层级之间梯度的反向传播;而BPTT则同时包含了梯度在纵向层级间的反向传播和在时间维度上的反向传播,同时在两个方向上进行参数优化。 \n", 134 | "4. 文中词$x_t$的特征是一个one-hot vector,这里能不能替换为word2vec训练出的词向量?效果和性能如何? \n", 135 | "答:RNNLM本身自带了训练词向量的过程。由于$x_t$是one-hot向量,假设出现的词的索引为$j$,那么$W_{in}x_t$就是把$W_{in}$的第$j$列$W[:,j]$取出,这个列向量就相当于该词的词向量。实际上用语言模型训练词向量的思想最早可以追溯到03年Bengio的一篇论文《A neural probabilistic language model 》,这篇论文中作者使用一个神经网络模型来训练n-gram模型,顺便学到了词向量。本文出于数学推导以及代码实现上的方便采用了one-hot向量作为输入。实际工程中,词汇表通常都是几百万,内存没办法装下几百万维的稠密矩阵,所以工程上基本上没有用one-hot的,基本都是用词向量。
\n", 136 | "\n", 137 | "\n", 138 | "参考: \n", 139 | "1.[使用RNN解决NLP中序列标注问题的通用优化思路](http://blog.csdn.net/malefactor/article/details/50725480) \n", 140 | "2.[wildml的rnn tutorial part3](http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients/) \n", 141 | "3.[Matrix Calculus Wiki](https://en.wikipedia.org/wiki/Matrix_calculus) \n", 142 | "4.《神经网络与深度学习讲义》 邱锡鹏" 143 | ] 144 | } 145 | ], 146 | "metadata": { 147 | "kernelspec": { 148 | "display_name": "Python 2", 149 | "language": "python", 150 | "name": "python2" 151 | }, 152 | "language_info": { 153 | "codemirror_mode": { 154 | "name": "ipython", 155 | "version": 2 156 | }, 157 | "file_extension": ".py", 158 | "mimetype": "text/x-python", 159 | "name": "python", 160 | "nbconvert_exporter": "python", 161 | "pygments_lexer": "ipython2", 162 | "version": "2.7.12" 163 | } 164 | }, 165 | "nbformat": 4, 166 | "nbformat_minor": 0 167 | } 168 | -------------------------------------------------------------------------------- /PRML/Chap3-Linear-Models-For-Regression/3.1-linear-basis-function-models.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 3.1 线性基函数模型\n", 8 | "\n", 9 | "本章开始,我们将学习监督学习中的一个重要任务——回归。给定一个$D$维的输入向量$x\\in \\mathbb{R}^D$,回归的目标是预测与之对应的目标变量$t\\in\\mathbb{R}$。对于回归,我们并不陌生,因为早在第一章的[多项式曲线拟合](../Chap1-Introduction/1.1-polynomial-curve-fitting.ipynb)一节中,我们就已经遇到了一个回归问题:\n", 10 | "$$y(x,\\mathbf{w})=\\sum_{j=0}^Mw_jx^j$$\n", 11 | "我们使用了上式中的多项式函数来拟合一个未知函数在各个点的目标函数值。事实上,多项式函数是本节将要介绍的线性回归模型的一个特例。" 12 | ] 13 | }, 14 | { 15 | "cell_type": "markdown", 16 | "metadata": {}, 17 | "source": [ 18 | "## 线性回归模型\n", 19 | "\n", 20 | "我们先来看一个最简单的线性回归模型\n", 21 | "\n", 22 | "$$ y(\\mathbf{x},\\mathbf{w})=w_0 + w_1 x_1 + ... + x_D w_D $$\n", 23 | "\n", 24 | "该模型中,拟合函数不仅是关于参数的线性函数,同时也是关于输入变量的线性函数。我们把这类关于参数呈线性的模型称为`线性回归模型`(linear regression model),该模型对应于多项式函数在$M=1$时的情形。 \n", 25 | "\n", 26 | "回顾第一章的[多项式曲线拟合](../Chap1-Introduction/1.1-polynomial-curve-fitting.ipynb)中进行的实验,当$M=1$时,多项式函数的拟合效果很差。\n", 27 | "\n", 28 | "为了提高线性模型的拟合能力,我们引入基函数(basis function)对输入变量作一个非线性变换$\\phi(\\mathbf{x})$,于是引入了基函数的线性回归模型可以表示为\n", 29 | "$$y(\\mathbf{x},\\mathbf{w})=w_0 + \\sum_{j=1}^{D-1} w_j \\phi(x_j)$$\n", 30 | "为了简化上式,我们定义$\\phi(x_0)=1$,于是上式又可以简写为\n", 31 | "$$y(\\mathbf{x},\\mathbf{w})=\\sum_{j=0}^{D-1} w_j \\phi(x_j)=\\mathbf{w}^\\top \\phi(\\mathbf{x})$$\n", 32 | "其中,$\\mathbf{w}=(w_0,w_1,...,w_{D-1})^\\top$,$\\phi(\\mathbf{x})=(\\phi_0(\\mathbf{x}),\\phi_1(\\mathbf{x}),...,\\phi_{D-1}(\\mathbf{x}))^\\top$" 33 | ] 34 | }, 35 | { 36 | "cell_type": "markdown", 37 | "metadata": {}, 38 | "source": [ 39 | "可供我们选择的基函数有:\n", 40 | "\n", 41 | "* 高斯基函数\n", 42 | "\n", 43 | "$$ \\phi_j(x)=\\exp\\big\\{-\\frac{(x-\\mu_j)^2}{2s^2}\\big\\}$$\n", 44 | "\n", 45 | "* S基函数\n", 46 | "$$ \\phi_j(x)=\\sigma\\big(-\\frac{x-\\mu_j}{s}\\big)$$\n", 47 | "\n", 48 | "其中$\\sigma(a)=\\frac{1}{1+\\exp(-a)}$\n" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "metadata": {}, 54 | "source": [ 55 | "## 最大似然和最小二乘\n", 56 | "\n", 57 | "回顾第一章,我们通过最小化平方和误差求出了最优多项式曲线的参数,并且证明了在高斯噪声的假设下,平方和误差函数可以由最大似然法导出。这一节,我们将更深入讨论最大似然和最小二乘之间的关系。\n", 58 | "\n", 59 | "像往常一样,我们假设目标变量等于一个确定函数与一个高斯噪声之和\n", 60 | "\n", 61 | "$$ t = y(\\mathbf{x},\\mathbf{w}) + \\epsilon $$\n", 62 | "\n", 63 | "其中$\\epsilon$服从均值为0,精度为$\\beta$的高斯分布,即$\\epsilon \\sim \\mathcal{N}(0, \\beta^{-1})$。\n", 64 | "\n", 65 | "于是$t$也服从高斯分布:\n", 66 | "\n", 67 | "$$ t\\sim \\mathcal{N}(y(\\mathbf{x},\\mathbf{w}), \\beta^{-1})) $$\n", 68 | "\n", 69 | "对于给定的新样本$\\mathbf{x}^*$,其目标变量最优的预测是\n", 70 | "\n", 71 | "$$ t^* = \\int t p(t|\\mathbf{x}^*, \\mathbf{w}, \\beta) dt=y(\\mathbf{x}^*,\\mathbf{w})$$\n", 72 | "\n", 73 | "\n", 74 | "现假设我们从该高斯分布独立同分布地采样得到一个样本集$\\{(\\mathbf{x}_1,t_1), ..., (\\mathbf{x}_N, t_N)\\}$,并令$\\mathrm{t} =(t_1,...,t_N)^\\top$表示由该样本集中所有的目标变量构成的向量。\n", 75 | "\n", 76 | "那么$\\mathrm{t}$的联合概率分布就可以表示为\n", 77 | "\n", 78 | "$$p(\\mathrm{t}|X,\\mathbf{w},\\beta)=\\prod_{i=1}^N p(t_i|\\mathbf{x}_i,\\mathbf{w},\\beta^{-1})=\\prod_{i=1}^N \\mathcal{N}(t_i|\\mathbf{w}^\\top \\phi(\\mathbf{x}_i), \\beta^{-1} )$$\n", 79 | "\n", 80 | "由于$X$已知,可以把$p(\\mathrm{t}|X,\\mathbf{w},\\beta)$简记为$p(\\mathrm{t}|\\mathbf{w},\\beta)$, 对数似然函数为\n", 81 | "\n", 82 | "$$ \\begin{aligned}\\ln p(\\mathrm{t}|\\mathbf{w},\\beta) &= \\sum_{i=1}^N \\ln \\mathcal{N}(t_i|\\mathbf{w}^\\top \\phi(\\mathbf{x}_i), \\beta^{-1} ) =\\sum_{i=1}^N\\ln \\sqrt{\\frac{\\beta}{2\\pi}}\\exp\\big\\{-\\frac{\\beta}{2}[t-\\mathbf{w}^\\top \\phi(x_i)]^2\\big\\}\\\\&=\\frac{N}{2}\\ln\\beta - \\frac{N}{2}\\ln(2\\pi) - \\frac{\\beta}{2} \\sum_{i=1}^N[t_i-\\mathbf{w}^\\top\\phi(x_i)]^2\\\\&=\\frac{N}{2}\\ln\\beta - \\frac{N}{2}\\ln(2\\pi) - \\beta E_D(\\mathbf{w})\\end{aligned}$$\n", 83 | "其中\n", 84 | "$$ E_D(\\mathbf{w})= \\frac{1}{2}\\sum_{i=1}^N[t_i-\\mathbf{w}^\\top\\phi(x_i)]^2$$\n", 85 | "\n", 86 | "即是最小二乘法的目标函数。 \n", 87 | "\n", 88 | "最大似然方法通过最大化似然函数来找到参数的最优值。在线性回归问题中,我们只关心$\\mathbf{w}$,即找到最大化$\\ln p(\\mathrm{t}|\\mathbf{w},\\beta)$的$\\mathbf{w}$,这等价于最小化最小二乘的目标函数$E_D(\\mathbf{w})$\n", 89 | "\n", 90 | "由$E_D(\\mathbf{w})$关于$\\mathbf{w}$的梯度为0\n", 91 | "$$\\begin{aligned}\\nabla_{\\mathbf{w}} E_D(\\mathbf{w}) &= \\frac{1}{2} \\frac{\\partial \\sum_{i=1}^N(t_i^2-2\\mathbf{w}^\\top \\phi(x_i)t_i+(\\mathbf{w}^\\top\\phi(x_i))^2)}{\\partial \\mathbf{w}}\\\\&= \\sum_{i=1}^N [\\phi(x_i)\\phi(x_i)^\\top - \\phi(x_i)t_i]=0\\end{aligned}$$\n", 92 | "得\n", 93 | "$$ \\sum_{i=1}^N \\phi(x_i)\\phi(x_i)^\\top \\mathbf{w}=\\sum_{i=1}^N\\phi(x_i)t_i $$\n", 94 | "\n", 95 | "$$ \\mathbf{w}_{ML} = \\big(\\sum_{i=1}^N \\phi(x_i)\\phi(x_i)^\\top\\big)^{-1} \\sum_{i=1}^N\\phi(x_i)t_i = (\\Phi^\\top \\Phi)^{-1}\\Phi^\\top \\mathrm{t}$$\n", 96 | "\n", 97 | "这个结果称为正规方程(Normal Equation),其中$\\Phi$是一个$N\\times D$的矩阵,一般称之为设计矩阵(Design Matrix)\n", 98 | "\n", 99 | "$$ \\Phi=\\begin{bmatrix}\\phi_0(x_0) \\quad \\phi_1(x_0) \\quad... \\quad\\phi_{D-1}(x_0)\\\\\\phi_0(x_1)\\quad \\phi_1(x_1) \\quad... \\quad\\phi_{D-1}(x_1)\\\\\\vdots\\\\\\phi_0(x_N) \\quad \\phi_1(x_N) \\quad... \\quad\\phi_{D-1}(x_N)\\end{bmatrix}$$\n", 100 | "\n", 101 | "$ \\Phi^\\dagger=(\\Phi^\\top \\Phi)^{-1}\\Phi^\\top$称为伪逆矩阵(Moore-Penrose pseudo-inverse),它是逆矩阵概念在非方阵上的推广。注意到,当$\\Phi$是一个方阵时,它的伪逆矩阵等于它的逆矩阵$\\Phi^\\dagger = \\Phi^{-1} (\\Phi^\\top)^{-1} \\Phi^\\top = \\Phi^{-1} \\Phi^{-1} \\Phi = \\Phi^{-1}$。" 102 | ] 103 | }, 104 | { 105 | "cell_type": "markdown", 106 | "metadata": {}, 107 | "source": [ 108 | "## 最小二乘的几何解释\n", 109 | "\n", 110 | "如下图所示,$\\mathrm{t}=(t_1,...,t_N)^\\top$是由所有样本点的目标变量组成的N维向量,$\\varphi_j$表示设施矩阵$\\Phi$的第$j$列,$\\mathcal{S}$表示一个由$\\varphi_j(j=1,2,...,D)$张成的子空间(即$\\Phi$的列空间)\n", 111 | "\n", 112 | "\n", 113 | "\n", 114 | "$\\mathbf{y}$是一个$N$维向量,它的第$j$个位置为$y(\\mathbf{x}_j, \\mathbf{w})$,注意到\n", 115 | "\n", 116 | "$$\\begin{aligned}\\mathbf{y}&=\\big(y(\\mathbf{x}_1, \\mathbf{w}),...,y(\\mathbf{x}_N, \\mathbf{w})\\big)^\\top\\\\&=\\big(\\phi(\\mathbf{x}_1)^\\top \\mathbf{w},..., \\phi(\\mathbf{x}_N)^\\top \\mathbf{w}\\big)\\\\&=\\Phi \\mathbf{w}=\\sum_{j=1}^D \\varphi_j w_j\\end{aligned}$$\n", 117 | "\n", 118 | "由此可知,$\\mathbf{y}$是$\\varphi_j$的线性组合,因此$\\mathbf{y}$也在子空间$\\mathcal{S}$中 \n", 119 | "\n", 120 | "接下来我们来考察最小二乘的目标函数\n", 121 | "\n", 122 | "$$ E_D(\\mathbf{w})= \\frac{1}{2}\\sum_{i=1}^N[t_i-\\mathbf{w}^\\top\\phi(x_i)]^2= \\frac{1}{2}\\left\\|\\mathbf{t}-\\mathbf{y}\\right\\|^2$$\n", 123 | "\n", 124 | "于是最小二乘问题转化为了在$\\Phi$的列空间中找到一个与$\\mathbf{t}$欧式距离最小的点$\\mathbf{y}$。显然,当$\\mathbf{y}$是$\\mathbf{t}$在$\\Phi$中的正交投影时,欧式距离最小,此时对应的参数值$\\mathbf{w}$刚好等于最大似然估计$\\mathbf{w}_{ML}$。\n", 125 | "\n", 126 | "我们来证明这个结论。将$\\mathbf{w}_{ML}$代入$\\mathbf{y}$:\n", 127 | "\n", 128 | "$$\\mathbf{y}_1 = (\\mathbf{w}_{ML}^\\top \\phi(\\mathbf{x}_1), ..., \\mathbf{w}_{ML}^\\top \\phi(\\mathbf{x}_N))^\\top = \\Phi \\mathbf{w}_{ML} =\\Phi (\\Phi^\\top \\Phi)^{-1}\\Phi^\\top \\mathbf{t}$$\n", 129 | "\n", 130 | "另一方面,设\n", 131 | "\n", 132 | "$$\\mathbf{y}_2 = P\\mathbf{t}$$\n", 133 | "\n", 134 | "是$\\mathbf{t}$在$\\Phi$的列空间中的投影,其中$P$是投影矩阵\n", 135 | "\n", 136 | "\n", 137 | "\n", 138 | "我们可以将$\\mathbf{y}_2$表示为以$\\Phi$矩阵中各列为基的线性组合:\n", 139 | "\n", 140 | "$$ \\mathbf{y}_2 = \\Phi \\xi $$\n", 141 | "\n", 142 | "其中$\\xi$是$N$维的向量,由各个基的系数组成\n", 143 | "\n", 144 | "\n", 145 | "二者的残差向量为\n", 146 | "\n", 147 | "$$ \\mathbf{e} = \\mathbf{t} - \\mathbf{y}_2 $$\n", 148 | "\n", 149 | "显然,$\\mathbf{e}$是$\\Phi$的列空间的法向量,因此有\n", 150 | "\n", 151 | "$$ \\Phi^\\top \\mathbf{e} = \\Phi^\\top (\\mathbf{t} - \\mathbf{y})=\\Phi^\\top (\\mathbf{t}-\\Phi \\xi)=0$$\n", 152 | "则\n", 153 | "$$\\Phi^\\top\\Phi \\xi = \\Phi^\\top \\mathbf{t}$$\n", 154 | "\n", 155 | "因此\n", 156 | "\n", 157 | "$$ \\xi = (\\Phi^\\top\\Phi)^{-1}\\Phi^\\top \\mathbf{t}$$\n", 158 | "\n", 159 | "由\n", 160 | "\n", 161 | "$$Pt = \\Phi\\xi=\\Phi(\\Phi^\\top\\Phi)^{-1}\\Phi^\\top \\mathbf{t}$$\n", 162 | "\n", 163 | "可知\n", 164 | "\n", 165 | "$$P=\\Phi(\\Phi^\\top\\Phi)^{-1}\\Phi^\\top$$\n", 166 | "\n", 167 | "代入$\\mathbf{y}_2=P\\mathbf{t}$可得\n", 168 | "\n", 169 | "$$\\mathbf{y}_2=P\\mathbf{t} = \\Phi(\\Phi^\\top\\Phi)^{-1}\\Phi^\\top\\mathbf{t} = \\Phi \\mathbf{w}_{ML}=\\mathbf{y}_1$$\n", 170 | "\n", 171 | "由此可知,最小二乘法求解参数的过程等同于求目标变量$\\mathbf{t}$在设计矩阵$\\Phi$的列空间中的正交投影。值得注意的是,当任意两个$\\varphi_j$之间近似地处于同一个方向时,$\\Phi^\\top\\Phi$接近奇异,可能会导致$\\mathbf{w}$数值不稳定,一般的方法是给$\\Phi^\\top\\Phi$添加一个正则项。" 172 | ] 173 | }, 174 | { 175 | "cell_type": "markdown", 176 | "metadata": {}, 177 | "source": [ 178 | "## 正则化最小二乘\n", 179 | "\n", 180 | "回顾1.1节多项式曲线拟合中,我们通过向损失函数引入一个正则项的方法来防止过拟合\n", 181 | "\n", 182 | "$$ E_D(\\mathbf{w})+\\lambda E_W(\\mathbf{w}) $$\n", 183 | "\n", 184 | "其中$\\lambda $是正则系数,用于控制过拟合的代价,$\\lambda$越大,过拟合的代价越大,因此越不容易发生过拟合;$\\lambda$越小,发生过拟合的风险越大。\n", 185 | "\n", 186 | "可以选择的正则项有很多种,其中最简单的一种就是L2正则项(在机器学习中,也叫权重衰减),即参数的平方和:\n", 187 | "$$ E_W(\\mathbf{w})=\\frac{1}{2}\\mathbf{w}^\\top \\mathbf{w}$$\n", 188 | "\n", 189 | "引入了L2正则项的损失函数变为\n", 190 | "\n", 191 | "$$ \\frac{1}{2}\\sum_{i=1}^N[t_i-\\mathbf{w}^\\top\\phi(x_i)]^2 + \\frac{\\lambda}{2}\\mathbf{w}^\\top \\mathbf{w}$$\n", 192 | "\n", 193 | "此时正规方程将变为\n", 194 | "\n", 195 | "$$ \\mathbf{w}_{ML}=(\\Phi^\\top \\Phi + \\lambda I)^{-1}\\Phi^\\top \\mathrm{t}$$\n", 196 | "\n", 197 | "不失一般性,我们可以用一个更通用的形式来表示正则化目标函数\n", 198 | "\n", 199 | "$$ \\frac{1}{2}\\sum_{i=1}^N[t_i-\\mathbf{w}^\\top\\phi(x_i)]^2 + \\frac{\\lambda}{2} \\sum_{j=1}^D |w_j|^q $$\n", 200 | "\n", 201 | "下图展示了当$q$的取不同值时,正则项的等高线变化\n", 202 | "![](http://7xikew.com1.z0.glb.clouddn.com/PRML-3.1-2.png)\n", 203 | "\n", 204 | "L2正则项对应于$q=2$的情况。$q=1$时(L1正则项)在统计学中一般被称为Lasso回归,它可以使权值向量中绝大多数的$w_j$都为0,只有少量权值大于0。为了更直观地理解这一点,我们可以将损失函数改写为一个带约束最优化问题\n", 205 | "\n", 206 | "$$\\begin{aligned} &\\min \\frac{1}{2}\\sum_{i=1}^N[t_i-\\mathbf{w}^\\top\\phi(x_i)]^2\\\\& \\mbox{subject to} \\sum_{j=1}^D |w_j|^q\\leq \\eta\\end{aligned}$$\n", 207 | "\n", 208 | "式中,$\\eta$是某个大于$0$的常数 \n", 209 | "\n", 210 | "下图为我们演示了上述不等式优化问题在二维的情况,图中红线包裹的区域是$\\mathbf{w}$需要满足的不等式约束,蓝色的同心圆表示目标函数的等高线(越往里,函数值越小)\n", 211 | "\n", 212 | "\n", 213 | "\n", 214 | "经过观察,可以知道,$q=1$时,Lasso的L1约束导致了$\\mathbf{w}_1^*=0$。当$\\lambda$越大时,这种约束将会越明显,权值向量将变得越稀疏。" 215 | ] 216 | } 217 | ], 218 | "metadata": { 219 | "kernelspec": { 220 | "display_name": "Python 2", 221 | "language": "python", 222 | "name": "python2" 223 | }, 224 | "language_info": { 225 | "codemirror_mode": { 226 | "name": "ipython", 227 | "version": 2 228 | }, 229 | "file_extension": ".py", 230 | "mimetype": "text/x-python", 231 | "name": "python", 232 | "nbconvert_exporter": "python", 233 | "pygments_lexer": "ipython2", 234 | "version": "2.7.9" 235 | } 236 | }, 237 | "nbformat": 4, 238 | "nbformat_minor": 2 239 | } 240 | -------------------------------------------------------------------------------- /Deep-Learning/theano-notes/part7-dimshuffle.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# theano的dimshuffle\n", 8 | "theano中的dimshuffle函数用于对张量的维度进行操作,可以增加维度,也可以交换维度,删除维度。 \n", 9 | "以下规则总结自官方文档: \n", 10 | "* 'x'表示增加一维,从0d scalar到1d vector \n", 11 | "* (0, 1)表示一个与原先相同的2D向量 \n", 12 | "* (1, 0)表示将2D向量的两维交换 \n", 13 | "* (‘x’, 0) 表示将一个1d vector变为一个1xN矩阵 \n", 14 | "* (0, ‘x’)将一个1d vector变为一个Nx1矩阵 \n", 15 | "* (2, 0, 1) -> A x B x C to C x A x B (2表示第三维也就是C,0表示第一维A,1表示第二维B) \n", 16 | "* (0, ‘x’, 1) -> A x B to A x 1 x B 表示A,B顺序不变在中间增加一维 \n", 17 | "* (1, ‘x’, 0) -> A x B to B x 1 x A 同理自己理解一下 \n", 18 | "* (1,) -> 删除维度0,被移除的维度必须是可广播的(broadcastable),即(1xA to A) \n", 19 | "\n", 20 | "写了个小程序来验证一下 \n", 21 | "首先定义了一个[0 1 2]的1D vector:v,v.dimshuffle(0)中的0表示第一维:3,也只有一维,所以不变。因为是1D的,所以shape只有(3,)" 22 | ] 23 | }, 24 | { 25 | "cell_type": "code", 26 | "execution_count": 1, 27 | "metadata": { 28 | "collapsed": false, 29 | "scrolled": true 30 | }, 31 | "outputs": [ 32 | { 33 | "name": "stdout", 34 | "output_type": "stream", 35 | "text": [ 36 | "v.dimshuffle(0): [0 1 2]\n", 37 | "v.dimshuffle(0).shape: [3]\n" 38 | ] 39 | } 40 | ], 41 | "source": [ 42 | "from __future__ import print_function\n", 43 | "import theano\n", 44 | "import numpy as np\n", 45 | "\n", 46 | "v = theano.shared(np.arange(3))\n", 47 | "# v.shape is a symbol expression, need theano.function or eval to compile it\n", 48 | "v_disp = v.dimshuffle(0)\n", 49 | "print('v.dimshuffle(0):',v_disp.eval())\n", 50 | "print('v.dimshuffle(0).shape:',v_disp.shape.eval())" 51 | ] 52 | }, 53 | { 54 | "cell_type": "markdown", 55 | "metadata": {}, 56 | "source": [ 57 | "v.dimshuffle('x',0)表示在第一维前加入一维,只要记住加了'x'就加了一维,所以大小变成了1x3" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": 2, 63 | "metadata": { 64 | "collapsed": false 65 | }, 66 | "outputs": [ 67 | { 68 | "name": "stdout", 69 | "output_type": "stream", 70 | "text": [ 71 | "v.dimshuffle('x',0): [[0 1 2]]\n", 72 | "v.dimshuffle('x',0).shape: [1 3]\n" 73 | ] 74 | } 75 | ], 76 | "source": [ 77 | "v_disp = v.dimshuffle('x', 0)\n", 78 | "print(\"v.dimshuffle('x',0):\",v_disp.eval())\n", 79 | "print(\"v.dimshuffle('x',0).shape:\",v_disp.shape.eval())" 80 | ] 81 | }, 82 | { 83 | "cell_type": "markdown", 84 | "metadata": {}, 85 | "source": [ 86 | "在axis=0后插入一维,那么形状应该变为3 x 1" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 3, 92 | "metadata": { 93 | "collapsed": false 94 | }, 95 | "outputs": [ 96 | { 97 | "name": "stdout", 98 | "output_type": "stream", 99 | "text": [ 100 | "v.dimshuffle(0,'x'): [[0]\n", 101 | " [1]\n", 102 | " [2]]\n", 103 | "v.dimshuffle(0,'x').shape: [3 1]\n" 104 | ] 105 | } 106 | ], 107 | "source": [ 108 | "v_disp = v.dimshuffle(0,'x')\n", 109 | "print(\"v.dimshuffle(0,'x'):\",v_disp.eval())\n", 110 | "print(\"v.dimshuffle(0,'x').shape:\",v_disp.shape.eval())" 111 | ] 112 | }, 113 | { 114 | "cell_type": "markdown", 115 | "metadata": {}, 116 | "source": [ 117 | "在axis=0后插入两个新的axis,那么形状应该变为3 x 1 x 1" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": 4, 123 | "metadata": { 124 | "collapsed": false 125 | }, 126 | "outputs": [ 127 | { 128 | "name": "stdout", 129 | "output_type": "stream", 130 | "text": [ 131 | "v.dimshuffle(0,'x','x'): [[[0]]\n", 132 | "\n", 133 | " [[1]]\n", 134 | "\n", 135 | " [[2]]]\n", 136 | "v.dimshuffle(0,'x','x').shape: [3 1 1]\n" 137 | ] 138 | } 139 | ], 140 | "source": [ 141 | "v_disp = v.dimshuffle(0,'x','x')\n", 142 | "print(\"v.dimshuffle(0,'x','x'):\",v_disp.eval())\n", 143 | "print(\"v.dimshuffle(0,'x','x').shape:\",v_disp.shape.eval())" 144 | ] 145 | }, 146 | { 147 | "cell_type": "markdown", 148 | "metadata": {}, 149 | "source": [ 150 | "在axis=0前后各插入一个新的axis,那么形状应该变为1 x 3 x 1" 151 | ] 152 | }, 153 | { 154 | "cell_type": "code", 155 | "execution_count": 5, 156 | "metadata": { 157 | "collapsed": false 158 | }, 159 | "outputs": [ 160 | { 161 | "name": "stdout", 162 | "output_type": "stream", 163 | "text": [ 164 | "v.dimshuffle('x',0,'x'): [[[0]\n", 165 | " [1]\n", 166 | " [2]]]\n", 167 | "v.dimshuffle('x',0,'x').shape: [1 3 1]\n" 168 | ] 169 | } 170 | ], 171 | "source": [ 172 | "v_disp = v.dimshuffle('x',0,'x')\n", 173 | "print(\"v.dimshuffle('x',0,'x'):\",v_disp.eval())\n", 174 | "print(\"v.dimshuffle('x',0,'x').shape:\",v_disp.shape.eval())" 175 | ] 176 | }, 177 | { 178 | "cell_type": "markdown", 179 | "metadata": {}, 180 | "source": [ 181 | "在axis=0前插入两个新的axis,那么形状应该变为1 x 1 x 3" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 6, 187 | "metadata": { 188 | "collapsed": false, 189 | "scrolled": true 190 | }, 191 | "outputs": [ 192 | { 193 | "name": "stdout", 194 | "output_type": "stream", 195 | "text": [ 196 | "v.dimshuffle('x','x',0): [[[0 1 2]]]\n", 197 | "v.dimshuffle('x','x',0).shape: [1 1 3]\n" 198 | ] 199 | } 200 | ], 201 | "source": [ 202 | "v_disp = v.dimshuffle('x','x',0)\n", 203 | "print(\"v.dimshuffle('x','x',0):\",v_disp.eval())\n", 204 | "print(\"v.dimshuffle('x','x',0).shape:\",v_disp.shape.eval())" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "metadata": {}, 210 | "source": [ 211 | "第二个例子,m是一个2x3矩阵" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": 7, 217 | "metadata": { 218 | "collapsed": false, 219 | "scrolled": true 220 | }, 221 | "outputs": [ 222 | { 223 | "name": "stdout", 224 | "output_type": "stream", 225 | "text": [ 226 | "m: [[0 1 2]\n", 227 | " [3 4 5]]\n", 228 | "m.shape: [2 3]\n" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "m = theano.shared(np.arange(6).reshape(2,3))\n", 234 | "print(\"m:\",m.eval())\n", 235 | "print(\"m.shape:\",m.shape.eval())" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "先确定0,'x',1的维数,0对应第一维(2),1表示第二维(3),'x'表示新加入的维度(1),所以结果维度是2x1x3\n", 243 | "加括号的顺序按照从左到右(外->内): \n", 244 | "1.先加最内层3,3表示括号内有3个数,因此是[0 1 2]和[3 4 5] \n", 245 | "2.再加中间层1,1表示括号内只有一个匹配的\"[]\",因此是[[0 1 2]],[[3 4 5]] \n", 246 | "3.最后加最外层2,2表示括号内有两个匹配的\"[]\"(只算最外层的匹配),于是最后结果是 \n", 247 | "```py\n", 248 | "[[[0 1 2]]\n", 249 | " [[3 4 5]]]\n", 250 | "```" 251 | ] 252 | }, 253 | { 254 | "cell_type": "code", 255 | "execution_count": 8, 256 | "metadata": { 257 | "collapsed": false, 258 | "scrolled": false 259 | }, 260 | "outputs": [ 261 | { 262 | "name": "stdout", 263 | "output_type": "stream", 264 | "text": [ 265 | "m.dimshuffle(0,'x',1): [[[0 1 2]]\n", 266 | "\n", 267 | " [[3 4 5]]]\n", 268 | "m.dimshuffle(0,'x',1).shape: [2 1 3]\n" 269 | ] 270 | } 271 | ], 272 | "source": [ 273 | "m_disp = m.dimshuffle(0,'x',1)\n", 274 | "print(\"m.dimshuffle(0,'x',1):\",m_disp.eval())\n", 275 | "print(\"m.dimshuffle(0,'x',1).shape:\",m_disp.shape.eval())" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "同理,结果应该是1 x 2 x 3的张量" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": 9, 288 | "metadata": { 289 | "collapsed": false, 290 | "scrolled": true 291 | }, 292 | "outputs": [ 293 | { 294 | "name": "stdout", 295 | "output_type": "stream", 296 | "text": [ 297 | "m.dimshuffle('x',0,1): [[[0 1 2]\n", 298 | " [3 4 5]]]\n", 299 | "m.dimshuffle('x',0,1).shape: [1 2 3]\n" 300 | ] 301 | } 302 | ], 303 | "source": [ 304 | "m_disp = m.dimshuffle('x', 0, 1)\n", 305 | "print(\"m.dimshuffle('x',0,1):\",m_disp.eval())\n", 306 | "print(\"m.dimshuffle('x',0,1).shape:\",m_disp.shape.eval())" 307 | ] 308 | }, 309 | { 310 | "cell_type": "markdown", 311 | "metadata": {}, 312 | "source": [ 313 | "同理,结果应该是2 x 3 x 1的张量" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": 10, 319 | "metadata": { 320 | "collapsed": false, 321 | "scrolled": true 322 | }, 323 | "outputs": [ 324 | { 325 | "name": "stdout", 326 | "output_type": "stream", 327 | "text": [ 328 | "m.dimshuffle(0,1,'x'): [[[0]\n", 329 | " [1]\n", 330 | " [2]]\n", 331 | "\n", 332 | " [[3]\n", 333 | " [4]\n", 334 | " [5]]]\n", 335 | "m.dimshuffle(0,1,'x').shape: [2 3 1]\n" 336 | ] 337 | } 338 | ], 339 | "source": [ 340 | "m_disp = m.dimshuffle(0, 1, 'x')\n", 341 | "print(\"m.dimshuffle(0,1,'x'):\",m_disp.eval())\n", 342 | "print(\"m.dimshuffle(0,1,'x').shape:\",m_disp.shape.eval())" 343 | ] 344 | }, 345 | { 346 | "cell_type": "markdown", 347 | "metadata": {}, 348 | "source": [ 349 | "`1,'x',0`表示将axis0和axis1进行了交换,并在中间插入一个新的axis,因此结果应该是3 x 1 x 2" 350 | ] 351 | }, 352 | { 353 | "cell_type": "code", 354 | "execution_count": 11, 355 | "metadata": { 356 | "collapsed": false, 357 | "scrolled": true 358 | }, 359 | "outputs": [ 360 | { 361 | "name": "stdout", 362 | "output_type": "stream", 363 | "text": [ 364 | "m.dimshuffle(1,'x',0): [[[0 3]]\n", 365 | "\n", 366 | " [[1 4]]\n", 367 | "\n", 368 | " [[2 5]]]\n", 369 | "m.dimshuffle(1,'x',0).shape: [3 1 2]\n" 370 | ] 371 | } 372 | ], 373 | "source": [ 374 | "# amount to transpose\n", 375 | "m_disp = m.dimshuffle(1,'x',0)\n", 376 | "print(\"m.dimshuffle(1,'x',0):\",m_disp.eval())\n", 377 | "print(\"m.dimshuffle(1,'x',0).shape:\",m_disp.shape.eval())" 378 | ] 379 | }, 380 | { 381 | "cell_type": "markdown", 382 | "metadata": {}, 383 | "source": [ 384 | "(1,)表示删除一个第一维(axis=0),在初始化shared时要保证被删除的维度是可广播的 " 385 | ] 386 | }, 387 | { 388 | "cell_type": "code", 389 | "execution_count": 12, 390 | "metadata": { 391 | "collapsed": false 392 | }, 393 | "outputs": [ 394 | { 395 | "name": "stdout", 396 | "output_type": "stream", 397 | "text": [ 398 | "[1 3]\n", 399 | "[3]\n" 400 | ] 401 | } 402 | ], 403 | "source": [ 404 | "m = theano.shared(np.array([[1,2,3]]),broadcastable=(True,False))\n", 405 | "print(m.shape.eval())\n", 406 | "print(m.dimshuffle(1,).shape.eval())" 407 | ] 408 | } 409 | ], 410 | "metadata": { 411 | "kernelspec": { 412 | "display_name": "Python 2", 413 | "language": "python", 414 | "name": "python2" 415 | }, 416 | "language_info": { 417 | "codemirror_mode": { 418 | "name": "ipython", 419 | "version": 2 420 | }, 421 | "file_extension": ".py", 422 | "mimetype": "text/x-python", 423 | "name": "python", 424 | "nbconvert_exporter": "python", 425 | "pygments_lexer": "ipython2", 426 | "version": "2.7.12" 427 | } 428 | }, 429 | "nbformat": 4, 430 | "nbformat_minor": 0 431 | } 432 | -------------------------------------------------------------------------------- /NLP/latent-dirichlet-allocation-1.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# LDA介绍\n", 8 | "\n", 9 | "## 主题模型简介\n", 10 | "主题模型(Topic Model)在机器学习和自然语言处理等领域是用来在一系列文档中发现抽象主题的一种统计模型。直观来讲,如果一篇文章有一个中心思想,那么一些特定词语会更频繁的出现。比方说,如果一篇文章是在讲狗的,那“狗”和“骨头”等词出现的频率会高些。如果一篇文章是在讲猫的,那“猫”和“鱼”等词出现的频率会高些。而有些词例如“这个”、“和”大概在两篇文章中出现的频率会大致相等。但真实的情况是,一篇文章通常包含多种主题,而且每个主题所占比例各不相同。因此,如果一篇文章10%和猫有关,90%和狗有关,那么和狗相关的关键字出现的次数大概会是和猫相关的关键字出现次数的9倍。一个主题模型试图用数学框架来体现文档的这种特点。主题模型自动分析每个文档,统计文档内的词语,根据统计的信息来断定当前文档含有哪些主题,以及每个主题所占的比例各为多少。主题模型最初是运用于自然语言处理相关方向,但目前以及延伸至例如生物信息学的其它领域。\n", 11 | "\n", 12 | "下面我用一个实际的例子来说明主题模型的工作原理,下图是我们手头上有的一个语料库,一共有7篇文章\n", 13 | "\n", 14 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model1.png)\n", 15 | "\n", 16 | "我们要将这些文档分配到三个主题,Topic1是关于信息技术的,Topic2是关于商业,Topic3是关于电影\n", 17 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model2.png)\n", 18 | "\n", 19 | "经过主题模型分析后,我们会得到关于每篇文章的一个主题分布,主题分布告诉我们一篇文章上各个主题的占比是多少\n", 20 | "\n", 21 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model5.png)\n", 22 | "\n", 23 | "比如黄颜色的文档属于技术的主题的概率远高于其他主题的概率;橙色文章位于主题1和主题2的中间,因此技术主题和商业主题的比例各占一半;而灰色文章位于三个主题的中心,因此三个主题的比例大致相等。\n" 24 | ] 25 | }, 26 | { 27 | "cell_type": "markdown", 28 | "metadata": {}, 29 | "source": [ 30 | "\n", 31 | "## 矩阵分解\n", 32 | "\n", 33 | "我们还可以从矩阵分解的角度来理解LDA。具体来说,就是将文档-词频矩阵分解为文档-主题矩阵与主题-词矩阵的乘积。\n", 34 | "\n", 35 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model4.png)\n", 36 | "\n", 37 | "图中,$M$代表文章数,$K$代表主题数,$V$代表词表大小。\n", 38 | "\n", 39 | "在矩阵分解之前,文本挖掘领域在文本特征表示问题上所采用的主流方法是向量空间模型。然而,传统的基于向量空间模型的方法有一个致命缺陷,那就是特征稀疏问题。一方面,算法的效率会受到影响;另一方面,增加了存储上的开销。\n", 40 | "\n", 41 | "作为改进,以LSA(潜在语义分析)为代表的矩阵分解方法被提出。LSA的革新之处在于打破了人们在文本表示上的思维定势,即认为文档是表示在词典空间上的。通过在文档和词之间引入一个语义维度,从而让文档和词在更低维的语义空间上建立起联系。语义是文档集合信息的一种浓缩,在LSA框架下,文档被表示为语义空间上的向量,向量值代表文档与语义的关联度;语义被表示为词典空间上的向量,向量值代表词和语义的关联度。\n", 42 | "\n", 43 | "其基本原理是运用线性代数中的矩阵分解方法(奇异值分解,简称SVD)将文档-词频矩阵分解为两个维度更低的矩阵的乘积,从而获得原始矩阵的低秩近似(low-rank approximation)。这么做的好处是可以去除噪音,降低数据表示成本。\n", 44 | "\n", 45 | "LSA的缺点是可解释性不强,它虽然能获得文档的语义表示,但不能解答文档集是如何产生的这一问题。针对于此,David M.Blei将概率模型引入到LSA框架中,提出LDA(潜在狄利克雷分配)算法。由于引入概率,所以LDA是一种生成式的矩阵分解方法(Generative Matrix Factorization)。与LSA相似的地方是,LDA也试图在文档->词的映射中引入一个中间变量,所不同的是他们把这一抽象概念称为主题。\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "\n", 53 | "## 基础知识介绍\n", 54 | "\n", 55 | "\n", 56 | "### 多项式分布\n", 57 | "\n", 58 | "多项式分布是文本处理领域最常用的分布之一,它是定义在一组离散随机变量上的分布,由一个和为1的参数向量表示,其概率分布数学形式为\n", 59 | "\n", 60 | "$$\n", 61 | "Multi(n_1,n_2,...,n_K|\\boldsymbol{p},N)=\\binom N {n_1n_2...n_K}\\prod_{k=1}^K p_k^{n_k}\n", 62 | "$$\n", 63 | "\n", 64 | "其中$K$为所有可能结果数,$\\mathbf{p}=(p_1,...,p_K)$,$p_k$表示一次独立实验中第$k$种结果发生的概率,$N$表示总观测次数(每次观测相互独立),$n_k$表示$N$次观测中第$k$种结果出现的次数,$N$和$n_k$的关系是$\\sum_{i=1}^K n_i=N$。我们可以把$N$次观测的结果表示为一个向量$\\mathbf{n}=\\{n_1n_2...n_K\\}$\n", 65 | "\n", 66 | "$\\binom N {n_1n_2...n_K}=\\frac{N!}{n_1!n_2!...n_K!}$表示可能结果总数\n", 67 | "理解多项式分布最简单的例子就是掷骰子。假设我们有一枚六面的骰子($K=6$),当我们独立地投掷这枚骰子$N$次,6个面的总出现次数$\\mathbf{n}=\\{n_1,...,n_6\\}$所构成的分布就服从一个多项式分布:\n", 68 | "$$\n", 69 | "Multi(n_1,n_2,...,n_6|\\boldsymbol{p},N)=\\binom N {n_1n_2...n_6}\\prod_{k=1}^6 p_k^{n_k}\n", 70 | "$$\n", 71 | "\n", 72 | "### 多项式分布的可视化\n", 73 | "\n", 74 | "前面提到,多项式分布可以由参数向量$\\mathbf{p}$表示,现在我们来对它进行可视化分析。如下图所示,当$K=3$时,多项式分布可以表示为三角形区域中的一个点,这些点的坐标的和为1,这个三角形区域在数学上被称为单纯形(simplex)。\n", 75 | "\n", 76 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model6.png)\n", 77 | "\n", 78 | "\n", 79 | "\n", 80 | "### 狄利克雷分布\n", 81 | "\n", 82 | "\n", 83 | "狄利克雷分布是一组连续多变量概率分布,是Beta分布在多变量情况下的推广。为了纪念德国数学家约翰·彼得·古斯塔夫·勒热纳·狄利克雷(Peter Gustav Lejeune Dirichlet)而命名。狄利克雷分布常作为贝叶斯统计的先验概率。当狄利克雷分布维度趋向无限时,便成为狄利克雷过程(Dirichlet process)。\n", 84 | "狄利克雷分布奠定了狄利克雷过程的基础,被广泛应用于自然语言处理特别是主题模型(topic model)的研究。\n", 85 | "\n", 86 | "狄利克雷分布数学定义式为\n", 87 | "\n", 88 | "$$Dir(\\mathbf{p}|\\alpha)=\\frac{\\Gamma(\\sum_{k}\\alpha_k)}{\\prod_{k}\\Gamma(\\alpha _k)}\\prod_{k}p_k^{\\alpha_k-1}$$\n", 89 | "\n", 90 | "其中$\\alpha$是狄利克雷分布的参数。\n", 91 | "\n", 92 | "### 狄利克雷分布的均值\n", 93 | "\n", 94 | "在继续深入之前,先证明一个结论,狄利克雷分布的均值为\n", 95 | "\n", 96 | "$$\\mu_j=\\frac{\\alpha_j}{\\alpha_0}$$\n", 97 | "其中$\\alpha_0=\\sum_{k=1}^K \\alpha_k$\n", 98 | "\n", 99 | "证明:\n", 100 | "\n", 101 | "$$\n", 102 | "\\begin{aligned}\\mu_j&=\\int_{ \\boldsymbol{p}\\in A} p_j Dir(\\boldsymbol{p}|\\alpha)d\\boldsymbol{p}\\\\&=\\frac{\\Gamma(\\alpha_0)}{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_K)}\\int_{\\boldsymbol{p}\\in A}p_j^{(\\alpha_j+1)-1}\\prod_{k=1,k\\neq j}^K p_k^{\\alpha_k-1} d p_k d p_j \\\\&=\\frac{\\Gamma(\\alpha_0)}{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_K)}\\frac{\\Gamma(\\alpha_1)\\cdots\\Gamma(\\alpha_{j-1})\\Gamma(\\alpha_j+1)\\Gamma(\\alpha_{j+1})\\cdots\\Gamma(\\alpha_K)}{\\Gamma(\\alpha_0+1)}\\\\&=\\frac{\\Gamma(\\alpha_0)\\Gamma(\\alpha_j+1)}{\\Gamma(\\alpha_0+1)\\Gamma(\\alpha_j)}\\\\&=\\frac{\\alpha_j}{\\alpha_0}\\end{aligned}\n", 103 | "$$\n", 104 | "\n", 105 | "其中$A$表示$\\mathbf{p}$所在的单纯形。$\\alpha_0$决定了狄利克雷分布的陡峭程度(有点类似于高斯分布中的方差),$\\alpha_0$越小,分布就越平缓;$\\alpha_0$越大,分布就越陡峭。\n", 106 | "\n", 107 | "\n", 108 | "### 参数的取值对于狄利克雷分布形状的影响\n", 109 | "\n", 110 | "还是以K=3为例,狄利克雷分布定义在多项式分布的单纯形上。下图为我们演示了当参数$\\alpha$变化时狄利克雷分布的等高线分布图(颜色越红表示密度越大):\n", 111 | "\n", 112 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model7.png)\n", 113 | "\n", 114 | "\n", 115 | "\n", 116 | "\n", 117 | "当$\\alpha=(1,1,1)$时,$p_k^{\\alpha_k-1}=1$,退化为均匀分布;其概率密度如下所示\n", 118 | "![](http://7xikew.com1.z0.glb.clouddn.com/dirichlet2.png)\n", 119 | "\n", 120 | "\n", 121 | "当$\\alpha=(0.1,0.1,0.1)$时,分布将变得异常稀疏,密度主要会集中在单纯形的角落和边界上,其余位置会变得异常扁平,如下图所示:\n", 122 | "\n", 123 | "\n", 124 | "当$\\alpha=(10,0,10,10)$时,分布的均值向量是$m=(1/3,1/3,1/3)$,其概率密度的形状类似于凸起的山丘,其最高点就是均值\n", 125 | "![](http://7xikew.com1.z0.glb.clouddn.com/dirichlet3.png)\n", 126 | "\n", 127 | "\n", 128 | "\n", 129 | "限于篇幅,狄利克雷分布的很多细节没法讲清楚,后面我会专门写一篇关于狄利克雷分布的笔记。\n", 130 | "\n", 131 | "\n", 132 | "### 狄利克雷分布与多项式分布的关系\n", 133 | "\n", 134 | "既然Beta分布是二项式分布的共轭先验,而狄利克雷分布是Beta分布在多维空间上的推广,那么我们可以做出假设:狄利克雷分布是多项式分布的共轭先验。这意味着如果我们选择狄利克雷分布作为多项式分布参数的先验分布,那么它的后验分布也是一个狄利克雷分布(只是参数与先验分布的不同)。\n", 135 | "\n", 136 | "接下来我们来证明这个结论。如果把贝叶斯公式给做个简化,则有\n", 137 | "$$p(x|y)p(y)\\propto p(y|x)$$\n", 138 | "\n", 139 | "因此,我们只需要证明\n", 140 | "\n", 141 | "$$ Multi(\\mathbf{n}|N,\\boldsymbol{p}) \\cdot Dir(\\mathbf{p}|\\alpha) \\propto Dir(\\mathbf{p}|\\alpha') $$\n", 142 | "\n", 143 | "证明:\n", 144 | "\n", 145 | "$$ \\begin{aligned}Multi(\\mathbf{n}|N,\\boldsymbol{p}) \\cdot Dir(\\mathbf{p}|\\alpha \\mathbf{m}) &= \\binom N {n_1n_2...n_K}\\prod_{k=1}^K p_k^{n_k} \\cdot \\frac{\\Gamma(\\sum_{k}\\alpha_k)}{\\prod_{k}\\Gamma(\\alpha _k)}\\prod_{k}p_k^{\\alpha_k-1}\\\\&= \\binom N {n_1n_2...n_K} \\frac{\\Gamma(\\sum_{k=1}^K\\alpha_k)}{\\prod_{k=1}^K\\Gamma(\\alpha_k)} \\prod_{k=1}^K p_k^{n_k+\\alpha_k - 1}\\\\&\\propto \\prod_{k=1}^K p_k^{n_k+\\alpha_k - 1}\\end{aligned}$$\n", 146 | "\n", 147 | "注意到等式右端与狄利克雷分布的形式相同,通过类比可以得出,参数$\\mathbf{p}$的后验分布为\n", 148 | "\n", 149 | "$$Dir(p|\\alpha')=\\frac{\\Gamma(\\sum_{k=1}^K\\alpha_k+n_k)}{\\prod_{k=1}^K\\Gamma(\\alpha_k+n_k)} \\prod_{k=1}^K p_k^{n_k+\\alpha_k - 1}$$\n", 150 | "\n", 151 | "于是我们可以得出结论,狄利克雷分布是多项式分布的共轭先验,这个结论可以表达为\n", 152 | "\n", 153 | "$$ Dir(p|\\alpha_1,...,\\alpha_k) + Multi(n_1,...,n_k|N,p)\\to Dir(p|\\alpha_1+n_1,...,\\alpha_k+n_k)$$" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "metadata": {}, 159 | "source": [ 160 | "\n", 161 | "## LDA工作原理\n", 162 | "\n", 163 | "LDA算法框架可以用如下的概率图模型表示\n", 164 | "\n", 165 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model8.png)\n", 166 | "\n", 167 | "下表给出了图中公式符号的含义\n", 168 | "\n", 169 | "\n", 170 | "|符号|含义|\n", 171 | "|:|:|\n", 172 | "|$D$|文档数|\n", 173 | "|$N$|一篇文档的单词数|\n", 174 | "|$ K$|主题数|\n", 175 | "|$V$|词表大小|\n", 176 | "|$\\alpha$|$\\theta_d$的先验分布的超参数|\n", 177 | "|$\\eta$|$\\beta_k$的先验分布的超参数|\n", 178 | "|$\\beta_k\\sim DIR(\\eta)$|第$k$个主题下所有单词的分布|\n", 179 | "|$\\theta_d \\sim DIR(\\alpha)$|文档$d$的主题分布|\n", 180 | "|$W_{d,n}\\sim Multi(\\beta_{Z_{d,n}})$|文档$d$第$n$个词|\n", 181 | "|$Z_{d,n}\\sim Multi(\\theta_d)$|$W_{d,n}$所对应主题|\n", 182 | "\n", 183 | "此处为了简化模型,我们假设所有文章都包含的单词数均为$N$。LDA模型下,一篇文档的产生过程如下:\n", 184 | "\n", 185 | "\n", 186 | ">for $k$ in $1,2,...,K$: \n", 187 | " $\\beta_k\\sim DIR(\\eta,...,\\eta)$ \n", 188 | "for $d$ in $1,2,...,D$: \n", 189 | " $\\theta_d \\sim DIR(\\alpha,...,\\alpha)$ \n", 190 | " for $n$ in $1,2,...,N$: \n", 191 | "  $Z_{d,n}\\sim Multi(\\theta_d)$ \n", 192 | "  $W_{d,n}\\sim Multi(\\beta_{Z_{d,n}})$\n", 193 | "\n", 194 | "为了更直观的叙述这个过程,我截取《LDA数学八卦》中的一张示意图(符号有些差异,不过不影响我们理解这个过程):\n", 195 | "\n", 196 | "![](http://7xikew.com1.z0.glb.clouddn.com/topic_model9.png)\n", 197 | "\n", 198 | "我们可以把$\\beta_k$看作有$V$个面的骰子(同理把$\\theta_d$看作有$K$个面的骰子),而它们都是从狄利克雷分布(对应图中左方的两个坛子)中随机抽样生成的,上面的坛子里装的是是doc-topic骰子,下面的坛子装的是topic-word骰子。在此设定下,LDA模型生成文档的过程描述如下: \n", 199 | "1)从下面的坛子中随机取出$K$个topic-word骰子 \n", 200 | "2)从上面的坛子随机取出一个doc-topic骰子 \n", 201 | "3)随机投掷doc-topic骰子得到主题$Z_{d,n}$ \n", 202 | "4)从第1步取出的K个骰子中取出编号为$Z_{d,n}$的骰子,并进行投掷,得到词$W_{d,n}$\n" 203 | ] 204 | }, 205 | { 206 | "cell_type": "markdown", 207 | "metadata": {}, 208 | "source": [ 209 | "## Dirichlet-Multinomial共轭\n", 210 | "\n", 211 | "我们观察LDA的概率图模型,发现LDA模型可以分解为两个物理过程: \n", 212 | "\n", 213 | "* $\\alpha\\to\\theta_d\\to z_{d,n}$\n", 214 | "\n", 215 | "这个过程对应于生成第$d$篇文档的时候,从上面的坛子中随机抽取一个doc-topic骰子$\\theta_d$,再用这个骰子进行投掷,得到第$n$个词的topic编号$z_{d,n}$\n", 216 | "\n", 217 | "* $\\eta\\to\\beta_k\\to w_{d,n}|z_{d,n}=k$\n", 218 | "\n", 219 | "这个过程对应于在生成第$d$篇文档的第$n$个词的时候,先从下面的坛子中取出编号为$z_{d,n}$的topic-word骰子,再进行投掷得到$w_{d,n}$\n", 220 | "\n", 221 | "注意到$\\theta_d,\\beta_k$满足Dirichlet分布,而$z_{d,n},w_{d,n}$服从Multinomial分布,根据上一节推导的结论可知,$\\theta_d$和$\\beta_k$的后验分布也是Dirichlet分布,这个性质称为Dirichlet-Multinomial共轭。\n", 222 | "\n" 223 | ] 224 | }, 225 | { 226 | "cell_type": "markdown", 227 | "metadata": {}, 228 | "source": [ 229 | "利用该性质,我们可以得出$\\theta_d$和$\\beta_k$的后验分布分别为:\n", 230 | "\n", 231 | "$$\\theta_d\\sim Dir(\\alpha+N_1^{(d)}, ..., \\alpha+N_K^{(d)})$$\n", 232 | "\n", 233 | "其中$N_k^{(d)}$表示第$d$篇文档中属于主题$k$的单词数,即$N_k^{(d)}=\\sum_{n=1}^N \\mathbb{1}(z_{d,n}=k)$ \n", 234 | "\n", 235 | "其中$\\theta_{d,k}$的参数估计为\n", 236 | "\n", 237 | "$$\\theta^*_{d,k}=\\mathbb{E}[\\theta_{d,k}]=\\frac{\\alpha+N_k^{(d)}}{K\\alpha+\\sum_{k=1}^K N_k^{(d)}}$$" 238 | ] 239 | }, 240 | { 241 | "cell_type": "markdown", 242 | "metadata": {}, 243 | "source": [ 244 | "\n", 245 | "$$ \\beta_k \\sim Dir(\\eta+N_1^{(k)},...,\\eta+N_V^{(k)}) $$\n", 246 | "\n", 247 | "其中$N_v^{(k)}$表示第$v$个词在主题$k$中的出现次数\n", 248 | "\n", 249 | "那么$\\beta_{k,v}$的参数估计为\n", 250 | "\n", 251 | "$$\\beta^*_{k,v}=\\mathbb{E}[\\beta_{k,v}]=\\frac{\\eta+N_v^{(k)}}{V\\eta+\\sum_{v=1}^V N_v^{(k)}}$$" 252 | ] 253 | }, 254 | { 255 | "cell_type": "markdown", 256 | "metadata": {}, 257 | "source": [ 258 | "如此一来,LDA模型一共有$D+K$个Dirichlet-Multinomial共轭链" 259 | ] 260 | } 261 | ], 262 | "metadata": { 263 | "kernelspec": { 264 | "display_name": "Python 2", 265 | "language": "python", 266 | "name": "python2" 267 | }, 268 | "language_info": { 269 | "codemirror_mode": { 270 | "name": "ipython", 271 | "version": 2 272 | }, 273 | "file_extension": ".py", 274 | "mimetype": "text/x-python", 275 | "name": "python", 276 | "nbconvert_exporter": "python", 277 | "pygments_lexer": "ipython2", 278 | "version": "2.7.9" 279 | } 280 | }, 281 | "nbformat": 4, 282 | "nbformat_minor": 2 283 | } 284 | -------------------------------------------------------------------------------- /PRML/Chap3-Linear-Models-For-Regression/summary-baysian-linear-regression.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 总结-贝叶斯线性回归\n", 8 | "\n", 9 | "## 问题背景:\n", 10 | "为了与PRML第一章一致,我们假定数据出自一个高斯分布:\n", 11 | "$$ p(t|x,\\mathbf{w},\\beta)=\\mathcal{N}(t|y(x,\\mathbf{w}),\\beta^{-1})=\\sqrt{\\frac{\\beta}{2\\pi}}\\exp(-\\frac{\\beta}{2}(t-y(x,\\mathbf{w}))^2)$$\n", 12 | "其中$\\beta$是精度,$y(x,\\mathbf{w})=\\sum\\limits_{j=0}^Mw_jx^j$\n", 13 | "$\\mathbf{w}$的先验为:\n", 14 | "$$ p(\\mathbf{w})=\\mathcal{N}(\\mathbf{w}|\\mathbf{0},\\alpha^{-1}\\mathbf{I})=(\\frac{\\alpha}{2\\pi})^{(M+1)/2}\\exp(-\\frac{\\alpha}{2}\\mathbf{w}^T\\mathbf{w})$$\n", 15 | "其中$\\alpha$是高斯分布的精度。\n", 16 | "为了表示方便,我们定义一个变换$\\phi(x)=(1,x,x^2,...,x^M)^T$,那么$y(x,\\mathbf{w})=\\mathbf{w}^T\\phi(x)$。为了对$\\mathbf{w}$作推断,我们需要收集数据更新先验分布,记收集到的数据为$\\mathbf{x}_N=\\{x_1,...,x_N\\}$,$\\mathbf{t}_N=\\{t_1,...,t_N\\}$,其中$t_i$是$x_i$对应的响应。进一步我们引入一个矩阵$\\Phi$,其定义如下:\n", 17 | "$$\\Phi=\\begin{bmatrix}\\phi(x_1)^T\\\\\\phi(x_2)^T\\\\\\vdots\\\\\\phi(x_N)^T\\end{bmatrix}$$\n", 18 | "我们可以认为这个矩阵是由$\\phi(x_i)^T$平铺而成。\n", 19 | "\n" 20 | ] 21 | }, 22 | { 23 | "cell_type": "markdown", 24 | "metadata": {}, 25 | "source": [ 26 | "## 详细推导过程\n", 27 | "首先我们先验证$p(\\mathbf{w}|\\mathbf{x},\\mathbf{t})$是个高斯分布。根据贝叶斯公式我们有:\n", 28 | "\n", 29 | "$$\\begin{aligned}p(\\mathbf{w}|\\mathbf{x},\\mathbf{t})&\\propto p(\\mathbf{w}|\\alpha)p(\\mathbf{t}|\\mathbf{x},\\mathbf{w})\\\\&=(\\frac{\\alpha}{2\\pi})^{(M+1)/2} \\exp(-\\frac{\\alpha}{2}\\mathbf{w}^T\\mathbf{w})\\prod_{i=1}^N\\sqrt{\\frac{\\beta}{2\\pi}}\\exp(-\\frac{\\beta}{2}(t_i-\\mathbf{w}^T\\phi(x_i))^2)\\\\&\\propto \\exp(-\\frac{1}{2}\\Big\\{\\alpha \\mathbf{w}^T\\mathbf{w}+\\beta\\sum_{i=1}^N(t_i^2-2t_i\\mathbf{w}^T\\phi(x_i)+\\mathbf{w}^T\\phi(x_i)\\phi(x_i)^T\\mathbf{w})\\Big\\})\\\\&\\propto \\exp(-\\frac{1}{2}\\Big\\{ \\mathbf{w}^T(\\alpha \\mathbf{I}+\\beta\\sum_{i=1}^N\\phi(x_i)\\phi(x_i)^T)\\mathbf{w}-2\\beta\\sum_{i=1}^N t_i\\phi(x_i)^T\\mathbf{w}\\Big\\})\\\\&\\propto \\exp(-\\frac{1}{2}\\Big\\{\\mathbf{w}^T(\\alpha \\mathbf{I}+\\beta \\Phi^T\\Phi)\\mathbf{w} -2\\beta(\\Phi^T\\mathbf{t}_N)^T\\mathbf{w}\\Big\\})\\end{aligned}$$\n", 30 | "令$S^{-1}=\\alpha \\mathbf{I}+\\beta\\Phi^T\\Phi$,$\\mu=\\beta S\\Phi^T\\mathbf{t}_N$(注:原书中式1.72写错了)\n", 31 | "则\n", 32 | "$$ p(\\mathbf{w}|\\mathbf{x},\\mathbf{t})\\propto \\exp(-\\frac{1}{2}(\\mathbf{w}-\\mu)^TS^{-1}(\\mathbf{w}-\\mu))\\propto \\mathcal{N}(\\mathbf{w}|\\mu,S)$$\n", 33 | "至此,我们证明了后验分布也是个高斯,接下来我们计算predictive distribution,注意到$p(t|x,\\mathbf{x},\\mathbf{t})$是两个高斯分布的卷积,其结果也是一个高斯,但为了严谨起见,还是证明一下。\n", 34 | "$$\\begin{aligned}p(t|x,\\mathbf{x},\\mathbf{t})&=\\int p(t|x,\\mathbf{w})p(\\mathbf{w}|\\mathbf{x},\\mathbf{t})d\\mathbf{w}\\\\&=\\frac{1}{(2\\pi)^{M/2+1}}\\frac{1}{(\\beta^{-1}|S|)^{1/2}} \\int \\exp(-\\frac{1}{2}\\Big\\{\\beta(t-\\mathbf{w}^T\\phi(x))^2+(\\mathbf{w}-\\mu)^TS^{-1}(\\mathbf{w}-\\mu)\\Big\\})d\\mathbf{w}\\\\&=\\frac{1}{(2\\pi)^{M/2+1}}\\frac{1}{(\\beta^{-1}|S|)^{1/2}}\\int\\exp(-\\frac{1}{2}\\Big\\{\\beta t^2-2\\beta t\\phi(x)^T\\mathbf{w}+\\beta\\mathbf{w}^T\\phi(x)\\phi(x)^T\\mathbf{w}\\\\&+\\mathbf{w}^TS^{-1}\\mathbf{w}-2\\mu^T S^{-1}\\mathbf{w}+\\mu^T S^{-1}\\mu\\Big\\})d\\mathbf{w}\\\\&=\\frac{1}{(2\\pi)^{M/2+1}}\\frac{1}{(\\beta^{-1}|S|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu))\\cdot \\\\&\\int\\exp(-\\frac{1}{2}\\Big\\{-2\\beta t\\phi(x)^T\\mathbf{w}+\\beta\\mathbf{w}^T\\phi(x)\\phi(x)^T\\mathbf{w}+\\mathbf{w}^TS^{-1}\\mathbf{w}-2\\mu^T S^{-1}\\mathbf{w}\\Big\\})d\\mathbf{w}\\\\&=\\frac{1}{(2\\pi)^{M/2+1}}\\frac{1}{(\\beta^{-1}|S|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu))\\cdot\\\\&\\int\\exp(-\\frac{1}{2}\\Big\\{\\mathbf{w}^T(\\underbrace{\\beta\\phi(x)\\phi(x)^T+S^{-1})}_{\\Lambda^{-1}}\\mathbf{w}-2(\\underbrace{\\beta t\\phi(x)^T+\\mu^TS^{-1}}_{m^T\\Lambda^{-1}})\\mathbf{w}\\Big\\}d\\mathbf{w}\\\\&=\\frac{1}{(2\\pi)^{M/2+1}}\\frac{1}{(\\beta^{-1}|S|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu))\\cdot\\\\&\\int\\exp(-\\frac{1}{2}\\Big\\{\\mathbf{w}^T\\Lambda^{-1}\\mathbf{w}-2m^T\\Lambda^{-1}\\mathbf{w}+m^T\\Lambda^{-1}m-m^T\\Lambda^{-1}m\\Big\\})d\\mathbf{w}\\\\&=\\frac{1}{(2\\pi)^{M/2+1}}\\frac{1}{(\\beta^{-1}|S|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))\\cdot\\\\&\\int\\exp(-\\frac{1}{2}\\Big\\{(\\mathbf{w}-m)^T\\Lambda^{-1}(\\mathbf{w}-m)\\Big\\})d\\mathbf{w}\\\\&=\\frac{(2\\pi)^{(M+1)/2}}{(2\\pi)^{M/2+1}}\\frac{|\\Lambda|^{1/2}}{(\\beta^{-1}|S|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))\\\\&=\\frac{1}{(2\\pi)^{1/2}}\\frac{|(\\beta\\phi(x)\\phi(x)^T+S^{-1})^{-1}|^{1/2}}{(\\beta^{-1}|S|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))\\\\&=\\frac{1}{(2\\pi)^{1/2}}\\frac{1}{(\\beta^{-1}|S||\\beta\\phi(x)\\phi(x)^T+S^{-1}|)^{1/2}}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))\\end{aligned}$$\n", 35 | "到这里不知道怎么推下去了,于是去网上闲逛找解决办法,终于找到了一篇论文《Modeling Inverse Covariance Matrices by Basis Expansion》这篇论文里介绍了一个引理\n", 36 | "**引理 (对称矩阵的秩1扰动)** 设$\\alpha\\in\\mathbb{R}$,$\\mathbf{a}\\in\\mathbb{R}^d$,$P\\in\\mathbb{R}^{d\\times d}$ 为可逆矩阵。如果$\\alpha\\neq\\mathbf{a} -(\\mathbf{a}^TP\\mathbf{a})^{-1}$那么秩1扰动矩阵$P+\\alpha \\mathbf{a} \\mathbf{a}^T$可逆,且\n", 37 | "$$ (P+\\alpha \\mathbf{a} \\mathbf{a}^T)^{-1}=P^{-1}-\\frac{\\alpha P^{-1}\\mathbf{a}\\mathbf{a}^T P^{-1}}{1+\\alpha \\mathbf{a}^TP^{-1}\\mathbf{a}}$$\n", 38 | "且\n", 39 | "$$ det(P+\\alpha \\mathbf{a} \\mathbf{a}^T)=(1+\\alpha \\mathbf{a}^T P^{-1}\\mathbf{a})det(P)$$\n", 40 | "这条定理说的是如果我们给协方差矩阵一个秩为1的向量外积做扰动,我们可以将扰动后的矩阵的逆和行列式进行展开。具体地,我们考察$|\\beta\\phi(x)\\phi(x)^T+S^{-1}|$,发现\n", 41 | "$$|\\beta\\phi(x)\\phi(x)^T+S^{-1}|=(1+\\beta \\phi(x)^T S\\phi(x))det(S^{-1})=(1+\\beta \\phi(x)^T S\\phi(x))/|S|$$\n", 42 | "于是\n", 43 | "$$ \\frac{1}{(\\beta^{-1}|S||\\beta\\phi(x)\\phi(x)^T+S^{-1}|)^{1/2}}=\\frac{1}{(\\beta^{-1}|S|\\cdot \\frac{(1+\\beta \\phi(x)^T S\\phi(x))}{|S|})^{1/2}}=\\frac{1}{(\\beta^{-1}+\\phi(x)^T S\\phi(x))^{1/2}}$$\n", 44 | "接下来考察指数部分$\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))$,注意到$\\mu=\\beta S\\Phi^T\\mathbf{t}_N$,于是$\\mu^TS^{-1}\\mu=\\beta^2(\\Phi^T\\mathbf{t}_N)^TS(\\Phi^T\\mathbf{t}_N)$。同时,应用上述引理我们有\n", 45 | "$$\\Lambda=(\\beta\\phi(x)\\phi(x)^T+S^{-1})^{-1}=S-\\frac{\\beta S\\phi(x)\\phi(x)^TS}{1+\\beta \\phi(x)^TS\\phi(x)}=S-\\frac{ S\\phi(x)\\phi(x)^TS}{\\beta^{-1}+\\phi(x)^TS\\phi(x)}$$\n" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "metadata": {}, 51 | "source": [ 52 | "利用以上两个关系,我们进一步进行推导\n", 53 | "$$\\begin{aligned}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))&=\\exp(-\\frac{1}{2}\\Big\\{\\beta t^2+\\beta^2(\\Phi^T\\mathbf{t}_N)^TS(\\Phi^T\\mathbf{t}_N)-(\\beta t\\phi(x)^T+\\beta (\\Phi^T\\mathbf{t}_N)^T)\\Lambda\\Lambda^{-1}\\Lambda(\\beta t\\phi(x)+\\beta \\Phi^T\\mathbf{t}_N)\\Big\\})\\\\&=\\exp(-\\frac{1}{2}\\Big\\{\\beta t^2+\\beta^2(\\Phi^T\\mathbf{t}_N)^TS(\\Phi^T\\mathbf{t}_N)-\\big[\\beta^2 t^2\\phi(x)^T\\Lambda\\phi(x)+2\\beta^2 t\\phi(x)^T\\Lambda (\\Phi^T\\mathbf{t}_N)+\\beta^2 (\\Phi^T\\mathbf{t}_N)^T\\Lambda(\\Phi^T\\mathbf{t}_N)\\big]\\Big\\})\\\\&=\n", 54 | "\\exp(-\\frac{1}{2}\\Big\\{\\beta t^2+\\beta^2(\\Phi^T\\mathbf{t}_N)^TS(\\Phi^T\\mathbf{t}_N)+\\big[-\\beta^2 t^2\\phi(x)^TS\\phi(x)+\\beta^3 t^2 \\frac{(\\phi(x)^TS\\phi(x))^2}{1+\\beta\\phi(x)^TS\\phi(x)}\\\\\n", 55 | "&-2\\beta^2 t\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)+2\\beta^3 t\\frac{\\phi(x)^T\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{1+\\beta\\phi(x)^TS\\phi(x)}\\\\&-\\beta^2(\\Phi^T\\mathbf{t}_N)^TS(\\Phi^T\\mathbf{t}_N)+\\beta^3 \\frac{(\\Phi^T\\mathbf{t}_N)^TS\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{1+\\beta\\phi(x)^TS\\phi(x)}\\big]\\Big\\})\\\\&=\\exp(-\\frac{1}{2}\\Big\\{\\big(\\beta-\\beta^2\\phi(x)^TS\\phi(x)+\\beta^3 \\frac{(\\phi(x)^TS\\phi(x))^2}{1+\\beta\\phi(x)^TS\\phi(x)}\\big)t^2\\\\\n", 56 | "&+2\\big(\\beta^3 \\frac{\\phi(x)^T\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{1+\\beta\\phi(x)^TS\\phi(x)}-\\beta^2\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)\\big)t+\\beta^3 \\frac{(\\Phi^T\\mathbf{t}_N)^TS\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{1+\\beta\\phi(x)^TS\\phi(x)}\\Big\\})\\end{aligned}$$\n", 57 | "我们考察每一个系数,首先是$t^2$的系数\n", 58 | "$$\\begin{aligned} \\beta-\\beta^2\\phi(x)^TS\\phi(x)+\\beta^3 \\frac{(\\phi(x)^TS\\phi(x))^2}{1+\\beta\\phi(x)^TS\\phi(x)}&=\\beta+\\frac{-\\beta^2\\phi(x)^TS\\phi(x)[1+\\beta\\phi(x)^TS\\phi(x)]+\\beta^3 (\\phi(x)^TS\\phi(x))^2}{1+\\beta\\phi(x)^TS\\phi(x)}\\\\&=\\beta-\\frac{\\beta^2\\phi(x)^TS\\phi(x)}{1+\\beta \\phi(x)^TS\\phi(x)}\\\\&=\\frac{\\beta(1+\\beta \\phi(x)^TS\\phi(x))-\\beta^2\\phi(x)^TS\\phi(x)}{1+\\beta\\phi(x)^TS\\phi(x)}\\\\&=\\frac{\\beta}{1+\\beta\\phi(x)^TS\\phi(x)}=\\frac{1}{\\beta^{-1}+\\phi(x)^TS\\phi(x)}\\end{aligned}$$\n", 59 | "接着是$t$的系数\n", 60 | "$$\\begin{aligned}\\beta^3 \\frac{\\phi(x)^T\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{1+\\beta\\phi(x)^TS\\phi(x)}-\\beta^2\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)&=\\frac{\\beta^3\\phi(x)^T\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)-\\beta^2\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)(1+\\beta\\phi(x)^TS\\phi(x))}{1+\\beta\\phi(x)^TS\\phi(x)}\\\\&=\\frac{\\beta^3\\phi(x)^T\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)-\\beta^2\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)-\\beta^3\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)\\phi(x)^TS\\phi(x)}{1+\\beta\\phi(x)^TS\\phi(x)}\\\\&=\\frac{-\\beta\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{\\beta^{-1}+\\phi(x)^TS\\phi(x)}\\end{aligned}$$\n", 61 | "最后我们考察常数项\n", 62 | "$$\\beta^3 \\frac{(\\Phi^T\\mathbf{t}_N)^TS\\phi(x)\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)}{1+\\beta\\phi(x)^TS\\phi(x)}=\\frac{\\beta^2(\\phi(x)^TS(\\Phi^T\\mathbf{t}_N))^2}{\\beta^{-1}+\\phi(x)^TS\\phi(x)}$$\n", 63 | "综合以上,我们有\n", 64 | "$$\\begin{aligned}\\exp(-\\frac{1}{2}(\\beta t^2+\\mu^T S^{-1}\\mu-m^T\\Lambda^{-1}m))&=\\exp(-\\frac{1}{2(\\beta^{-1}+\\phi(x)^TS\\phi(x))}\\Big\\{t^2-2\\beta\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)t+\\beta^2(\\phi(x)^TS(\\Phi^T\\mathbf{t}_N))^2\\Big\\})\\\\&=\\exp(-\\frac{1}{2(\\beta^{-1}+\\phi(x)^TS\\phi(x))}(t-\\beta\\phi(x)^TS(\\Phi^T\\mathbf{t}_N))^2)\\end{aligned}$$\n", 65 | "综合以上,我们可以得到\n", 66 | "$$p(t|x,\\mathbf{x},\\mathbf{t})=\\frac{1}{\\sqrt{2\\pi\\cdot(\\beta^{-1}+\\phi(x)^TS\\phi(x))}}\\exp(-\\frac{1}{2(\\beta^{-1}+\\phi(x)^TS\\phi(x))}(t-\\beta\\phi(x)^TS(\\Phi^T\\mathbf{t}_N))^2)$$\n", 67 | "令\n", 68 | "$$ m(x)=\\beta\\phi(x)^TS(\\Phi^T\\mathbf{t}_N)=\\mu^T\\phi(x)\\\\\n", 69 | "s^2(x)=\\beta^{-1}+\\phi(x)^TS\\phi(x)$$\n", 70 | "以上两式对应PRML中的式1.70~1.71。式1.71中,第一项表示数据中的噪音(方差越小,数据越集中,不确定性越小);第二项表示关于参数$\\mathbf{w}$的不确定性,当$N\\to \\infty$时,第二项趋于0,这是由于当数据量趋于无限大时,关于参数的不确定性逐渐消失,先验的影响逐渐减弱。理论上的证明如下,首先我们考察$S_{N+1}$:\n", 71 | "$$S_{N+1}=(\\alpha I+\\beta \\sum_{i=1}^N \\phi(x_i)\\phi(x_i)^T+\\beta \\phi(x_{N+1})\\phi(x_{N+1})^T)=(S_N^{-1}+\\beta \\phi(x_{N+1})\\phi(x_{N+1})^T)\\\\=S_N-\\beta\\frac{S_N\\phi(x_{N+1})\\phi(x_{N+1})^TS_N}{1+\\beta \\phi(x_{N+1})^TS_N\\phi(x_{N+1})}\\\\=S_N-\\frac{\\beta}{1+\\beta \\phi(x_{N+1})^TS_N\\phi(x_{N+1})} (S_N\\phi(x_{N+1}))(S_N\\phi(x_{N+1}))^T$$\n", 72 | "于是\n", 73 | "$$\\sigma_{N+1}^2(x)=\\beta^{-1}+\\phi(x)^TS_{N+1}\\phi(x)=\\sigma_N^2(x)-\\frac{\\beta}{1+\\beta \\phi(x_{N+1})^TS_N\\phi(x_{N+1})}[ \\phi(x)^T(S_N\\phi(x_{N+1}))]^2\\leq \\sigma_N^2(x)$$\n", 74 | "因此序列$\\sigma_N^2(x)$是单调递减序列,又由于有下界(0),因此当$N\\to\\infty$时,$\\sigma_N^2(x)\\to 0$\n", 75 | "\n", 76 | "于是我们知道\n", 77 | "$$p(t|x,\\mathbf{x},\\mathbf{t})=\\mathcal{N}(t|m(x),s^2(x))$$\n", 78 | "也就是说后验预测分布也是一个高斯,$t$的且均值、方差取决于$x$\n", 79 | "需要注意的是当$x$满足$\\beta=-(\\phi(x)^TS\\phi(x))^{-1}$时,方差\n", 80 | "$$ s^2(x)=\\beta^{-1}+\\phi(x)^TS\\phi(x)=-\\phi(x)^TS\\phi(x)+\\phi(x)^TS\\phi(x)=0$$\n", 81 | "因此在这一点分布未定义" 82 | ] 83 | } 84 | ], 85 | "metadata": { 86 | "kernelspec": { 87 | "display_name": "Python 2", 88 | "language": "python", 89 | "name": "python2" 90 | }, 91 | "language_info": { 92 | "codemirror_mode": { 93 | "name": "ipython", 94 | "version": 2 95 | }, 96 | "file_extension": ".py", 97 | "mimetype": "text/x-python", 98 | "name": "python", 99 | "nbconvert_exporter": "python", 100 | "pygments_lexer": "ipython2", 101 | "version": "2.7.12" 102 | } 103 | }, 104 | "nbformat": 4, 105 | "nbformat_minor": 0 106 | } 107 | -------------------------------------------------------------------------------- /YidaXu-ML/sampling-methods-part2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 采样方法系列2" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "## MCMC\n", 15 | "\n", 16 | "这一节将介绍MCMC(Markov Chain Monte Carlo)采样算法,MCMC是一种用马氏链从随机分布取样的算法,之前步骤的作为底本。步数越多,结果越好。首先我将介绍MH算法,接着介绍Gibbs采样,然后证明Gibbs采样是MH算法的一种特例。" 17 | ] 18 | }, 19 | { 20 | "cell_type": "markdown", 21 | "metadata": {}, 22 | "source": [ 23 | "## 马尔科夫链\n", 24 | "\n", 25 | "本节以一个简单的例子来回顾马氏过程中的一些基本概念。社会学家经常把人按其经济状况分成 3 类:下层 (lower-class)、中层 (middle-class)、上层 (upper-class),我们用 1、2、3 分别代表这三个阶层。社会学家们发现决定一个人的收入阶层的最重要的因素就是其父母的收入阶层。如果一个人的收入属于下层类别,那么他的孩子属于下层收入的概率是 0.65,属于中层收入的概率是 0.28,属于上层收入的概率是 0.07。事实上,从父代到子代,收入阶层的变化的转移概率如下\n", 26 | "\n", 27 | "|-|\t下层\t|中层|上层|\n", 28 | "|:|:|:|:|\n", 29 | "|下层|0.65|0.28|0.07|\n", 30 | "|中层|0.15|0.67|0.18|\n", 31 | "|上层|0.12|0.36|0.52|\n", 32 | "\n", 33 | "它们之间的概率转移图为:\n", 34 | "![](https://uploads.cosx.org/2013/01/markov-transition.png)\n", 35 | "\n", 36 | "现在,我们要计算第N代后,整个社会中各阶层人数的比例:\n", 37 | "\n", 38 | "计算概率转移矩阵\n", 39 | "\n", 40 | "$$P =\\left[ \\begin{matrix} 0.65 & 0.28 & 0.07\\\\ 0.15 & 0.67 & 0.18 \\\\ 0.12 & 0.36 & 0.52 \\end{matrix} \\right]$$\n", 41 | "\n", 42 | "\n", 43 | "假设当前这一代人处在下层、中层、上层的人的比例是概率分布向量$\\pi_0=[\\pi_0(1), \\pi_0(2), \\pi_0(3)]$,那么他们的子女的分布比例将是$\\pi_1=\\pi_0P$,他们的孙子代的分布比例将是$\\pi_2 = \\pi_1P=\\pi_0P^2$,……, 第$n$代子孙的收入分布比例将是 $\\pi_n = \\pi_{n-1}P = \\pi_0P^n$\n", 44 | "\n", 45 | "接下来我们用程序来模拟这个过程,首先定义概率转移矩阵$P$和子孙代数$N$" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 1, 51 | "metadata": { 52 | "collapsed": false 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "import numpy as np\n", 57 | "from numpy import linalg as LA\n", 58 | "\n", 59 | "def compute_dist(init_dist, P, N):\n", 60 | " '''\n", 61 | " 计算子代分布\n", 62 | " init_dist: 初始分布\n", 63 | " P: 概率转移矩阵\n", 64 | " N: 子孙代数\n", 65 | " '''\n", 66 | " assert P.shape[0] == P.shape[1]\n", 67 | " for n in range(N + 1):\n", 68 | " current_dist = np.dot(init_dist, LA.matrix_power(P, n))\n", 69 | " print '第{}代人'.format(n), current_dist \n", 70 | " return current_dist\n", 71 | "\n", 72 | "# 概率转移矩阵\n", 73 | "P = np.array([[.65, .28, .07],[.15,.67,.18],[.12,.36,.52]])\n", 74 | "# 子孙代数\n", 75 | "N = 10\n", 76 | "\n" 77 | ] 78 | }, 79 | { 80 | "cell_type": "markdown", 81 | "metadata": {}, 82 | "source": [ 83 | "第一次实验,假设当代人处在下层、中层、上层的比例是$\\pi_0=[0.51,0.26,0.23]$,那么让我们来看一下$N=10$代以后,他们的子孙在三个阶层的分布分别是多少" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 2, 89 | "metadata": { 90 | "collapsed": false, 91 | "scrolled": true 92 | }, 93 | "outputs": [ 94 | { 95 | "name": "stdout", 96 | "output_type": "stream", 97 | "text": [ 98 | "第0代人 [ 0.51 0.26 0.23]\n", 99 | "第1代人 [ 0.3981 0.3998 0.2021]\n", 100 | "第2代人 [ 0.342987 0.45209 0.204923]\n", 101 | "第3代人 [ 0.31534581 0.47270894 0.21194525]\n", 102 | "第4代人 [ 0.30131455 0.48131211 0.21737335]\n", 103 | "第5代人 [ 0.29413607 0.48510159 0.22076234]\n", 104 | "第6代人 [ 0.29044517 0.48685061 0.22270423]\n", 105 | "第7代人 [ 0.28854146 0.48768807 0.22377047]\n", 106 | "第8代人 [ 0.28755761 0.48809999 0.2243424 ]\n", 107 | "第9代人 [ 0.28704854 0.48830639 0.22464508]\n", 108 | "第10代人 [ 0.28678492 0.4884111 0.22480399]\n" 109 | ] 110 | }, 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "array([ 0.28678492, 0.4884111 , 0.22480399])" 115 | ] 116 | }, 117 | "execution_count": 2, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "init_state1 = np.array([.51, .26, .23])\n", 124 | "compute_dist(init_state1, P, N)" 125 | ] 126 | }, 127 | { 128 | "cell_type": "markdown", 129 | "metadata": {}, 130 | "source": [ 131 | "第二次实验,我们试着将当代人在下层、中层、上层的比例固定为$\\pi_0=[0.32,0.47,0.21]$,看看会发生什么" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 3, 137 | "metadata": { 138 | "collapsed": false, 139 | "scrolled": true 140 | }, 141 | "outputs": [ 142 | { 143 | "name": "stdout", 144 | "output_type": "stream", 145 | "text": [ 146 | "第0代人 [ 0.32 0.47 0.21]\n", 147 | "第1代人 [ 0.3037 0.4801 0.2162]\n", 148 | "第2代人 [ 0.295364 0.484535 0.220101]\n", 149 | "第3代人 [ 0.29107897 0.48657673 0.2223443 ]\n", 150 | "第4代人 [ 0.28886916 0.48755247 0.22357838]\n", 151 | "第5代人 [ 0.28772723 0.48803173 0.22424104]\n", 152 | "第6代人 [ 0.28713638 0.48827166 0.22459196]\n", 153 | "第7代人 [ 0.28683043 0.4883933 0.22477626]\n", 154 | "第8代人 [ 0.28667193 0.48845549 0.22487258]\n", 155 | "第9代人 [ 0.28658979 0.48848745 0.22492277]\n", 156 | "第10代人 [ 0.28654721 0.48850393 0.22494886]\n" 157 | ] 158 | }, 159 | { 160 | "data": { 161 | "text/plain": [ 162 | "array([ 0.28654721, 0.48850393, 0.22494886])" 163 | ] 164 | }, 165 | "execution_count": 3, 166 | "metadata": {}, 167 | "output_type": "execute_result" 168 | } 169 | ], 170 | "source": [ 171 | "init_state2 = np.array([.32, .47, .21])\n", 172 | "compute_dist(init_state2, P, N)" 173 | ] 174 | }, 175 | { 176 | "cell_type": "markdown", 177 | "metadata": {}, 178 | "source": [ 179 | "我们发现虽然两次实验的初始分布不同,但最终经过若干代后,三个阶层的比例都会最终固化。实际上,只要$N$足够大,无论怎么初始化起始分布,最终都会收敛于某个分布,我们把这个分布叫做平稳分布。这个性质不是该马氏链独有的,绝大部分马氏链都有这个性质。" 180 | ] 181 | }, 182 | { 183 | "cell_type": "markdown", 184 | "metadata": {}, 185 | "source": [ 186 | "## 细致平衡条件\n", 187 | "\n", 188 | "\n", 189 | "细致平衡条件是MCMC算法能够工作的前提。 \n", 190 | "\n", 191 | ">细致平衡条件 对于定义在状态空间$\\mathcal{X}$上的概率密度函数$\\pi$,如果对于任意的$i,j\\in \\mathcal{X}$,满足$\\pi_i P_{ij}=\\pi_j P_{ji}$,那么称$\\pi$关于$P$满足细致平衡条件(Detailed Balance)。\n", 192 | "\n", 193 | "其中,$P$是马氏链的转移概率矩阵,$P_{ij}$表示从状态$i$跳转到状态$j$的概率$P_{ij}=P(x^{(t)}=j|x^{(t-1)}=i)$。 \n", 194 | "从细致平衡条件可以推导出许多有用的结论 \n", 195 | "\n", 196 | "* 结论一 \n", 197 | "$\\pi$是$P$对应马氏链的平稳分布 \n", 198 | "证明: \n", 199 | "\n", 200 | "$$ \\sum_i \\pi_i P_{ij}=\\sum_i \\pi_j P_{ji}=\\pi_j \\sum_i P_{ji}=\\pi_j$$\n", 201 | "\n", 202 | "由此可知,如果一个分布满足细致平衡条件,那么它一定是某个马氏链的平稳分布。我们可以通过构造马氏链的转移概率矩阵$P$,使得该马氏链的平稳分布刚好是要采样的分布。\n", 203 | "\n", 204 | "* 结论二 \n", 205 | "满足细致平衡条件的马氏链称为可逆的马氏链(reversible Markov chain) ,其性质如下: \n", 206 | "如果$\\pi,P$满足细致平衡,且序列$(x_0,...,x_n)$服从以$\\pi,P$为参数的马氏链 $MC(\\pi, P)$,那么其逆序列$(x_n,...,x_0)$也服从该马氏链\n", 207 | "\n" 208 | ] 209 | }, 210 | { 211 | "cell_type": "markdown", 212 | "metadata": {}, 213 | "source": [ 214 | "## Metropolis–Hastings算法\n", 215 | "\n", 216 | "Metropolis–Hastings算法通过构造一个马氏链,并使其平稳分布为被采样分布,通过在该马氏链的平稳分布采样就获得了目标分布的样本。\n", 217 | "\n", 218 | "Metropolis–Hastings算法将细致平衡条件中的转移概率$P(x^*|x)$分解为提议概率$q(x^*|x)$和接受概率$A(x^*|x)$:\n", 219 | "$$P(x^*|x) = q(x^*|x) A(x^*|x)$$\n", 220 | "\n", 221 | "提议概率定义为$x$跳转到$x^*$的概率;接受概率表示接受$x^*$的概率。\n", 222 | "\n", 223 | "于是\n", 224 | "\n", 225 | "$$A(x^*|x)=\\frac{P(x^*|x)}{q(x^*|x)}$$\n", 226 | "\n", 227 | "定义接受率为\n", 228 | "\n", 229 | "$$ \\alpha(x^*|x) = \\frac{A(x^*|x)}{A(x|x^*)}=\\frac{P(x^*|x)q(x|x^*)}{P(x|x^*)q(x^*|x)}=\\frac{\\pi(x^*)q(x|x^*)}{\\pi(x)q(x^*|x)}$$\n", 230 | "\n", 231 | "这里用到了细致平衡条件\n", 232 | "\n", 233 | "$$\\pi(x)P(x^*|x)=\\pi(x^*)P(x|x^*)\\rightarrow \\frac{P(x^*|x)}{P(x|x^*)}=\\frac{\\pi(x^*)}{\\pi(x)}$$\n", 234 | "\n", 235 | "接受率的含义是从$x$转移到$x^*$的接受概率与从$x^*$转移到$x$的接受概率的比值,如果接受率大于1,则表明样本点更有可能出现在$x^*$,选择接受;如果接受率不大于1,则表明有$\\alpha$的可能性接受,$1-\\alpha$的可能性拒绝。最终我们倾向于留在高概率密度的地方,然后仅偶尔跑到低概率状态的地方(这也就是MH算法直观上的运行机理)。 \n" 236 | ] 237 | }, 238 | { 239 | "cell_type": "markdown", 240 | "metadata": {}, 241 | "source": [ 242 | "\n", 243 | "\n", 244 | "Metropolis–Hastings算法的流程如下图所示:\n", 245 | "![](http://7xikew.com1.z0.glb.clouddn.com/Metropolis-Hasting-Algorithm.png)" 246 | ] 247 | }, 248 | { 249 | "cell_type": "markdown", 250 | "metadata": {}, 251 | "source": [ 252 | "其中$\\alpha(x^*|x)=\\min \\big(1, \\frac{\\pi(x^*)q(x|x^*)}{\\pi(x)q(x^*|x)}\\big)$是接受率(accept rate),主题到MH算法给接受率加了一个上界。\n", 253 | "\n", 254 | "简单描述一下算法的工作流程: \n", 255 | "1)首先初始化马氏链的其实状态$x^{(0)}$ \n", 256 | "2)从$(0,1)$之间的均匀分布采样$u$,从转移概率分布$q(x^*|x^{(i)})$采样$x^*$ \n", 257 | "3)如果$u$小于采样率$\\alpha(x^*|x)$,则接受$x^*$;否则拒绝$x^*$ \n", 258 | "\n", 259 | "## MH算法细致平衡条件证明\n", 260 | "MH算法之所以能work,在于它满足了细致平衡条件。\n", 261 | "\n", 262 | "证明: \n", 263 | "\n", 264 | "根据上一节推导的的结论一,我们只要证明$\\pi(x)$满足细致平衡条件。\n", 265 | "\n", 266 | "将马氏链的转移概率分解为\n", 267 | "$$P(x^*|x)=q(x^*|x)A(x^*|x)$$ \n", 268 | "\n", 269 | "$P(x^*|x)$由两个步骤构成:\n", 270 | "\n", 271 | "1)从提议分布$q(x^*|x)$产生$x^*$ \n", 272 | "2)以概率$A(x^*|x)$接受$x^*$\n", 273 | "\n", 274 | "\n", 275 | "接下来证明细致平衡条件\n", 276 | "\n", 277 | "$$\\begin{aligned}\\pi(x)P(x^*|x)&=\\pi(x)q(x^*|x)A(x^*|x)\\\\&=\\pi(x)q(x^*|x)\\min \\big(1, \\frac{\\pi(x^*)q(x|x^*)}{\\pi(x)q(x^*|x)}\\big)\\\\&=\\min(\\pi(x)q(x^*|x),\\pi(x^*)q(x|x^*))\\\\&=\\pi(x^*)q(x|x^*)\\min \\big(1, \\frac{\\pi(x)q(x^*|x)}{\\pi(x^*)q(x|x^*)}\\big)\\\\&=\\pi(x^*)q(x|x^*) A(x|x^*)\\\\&=\\pi(x^*)P(x|x^*)\\end{aligned}$$\n", 278 | "\n", 279 | "\n", 280 | "由此,我们可知我们只要构造一个转移概率分布为$P(x^*|x)$的马氏链,并在该马氏链的平稳分布上进行抽样,就可以获得到目标分布$\\pi(x)$的样本。" 281 | ] 282 | }, 283 | { 284 | "cell_type": "markdown", 285 | "metadata": {}, 286 | "source": [ 287 | "## Gibbs采样算法\n", 288 | "\n", 289 | "给定一个起始样本点$(x^{(1)},y^{(1)},z^{(1)})^\\top$,我们想从目标分布$p(x,y,z)$采样得到$\\{(x^{(2)},y^{(2)},z^{(2)})^\\top,(x^{(3)},y^{(3)},z^{(3)})^\\top,...,(x^{(n)},y^{(n)},z^{(n)})^\\top\\}$,Gibbs采样算法的工作流程是这样的: \n", 290 | "采样$(x^{(2)},y^{(2)},z^{(2)})^\\top$的过程:\n", 291 | "$$ x^{(2)}\\sim p(x|y^{(1)},z^{(1)})\\\\ y^{(2)}\\sim p(y|x^{(2)},z^{(1)})\\\\ z^{(2)}\\sim p(z|x^{(2)},y^{(2)})$$\n", 292 | "采样$(x^{(3)},y^{(3)},z^{(3)})^\\top$的过程: \n", 293 | "$$ x^{(3)}\\sim p(x|y^{(2)},z^{(2)})\\\\ y^{(3)}\\sim p(y|x^{(3)},z^{(2)})\\\\ z^{(3)}\\sim p(z|x^{(3)},y^{(3)})$$\n", 294 | "\n", 295 | "这个过程理解起来很简单,循环遍历$x,y,z$三个坐标轴,每次固定其他两个变量(比如$y,z$),然后对剩余的一个变量(比如$x$)进行采样。 \n", 296 | "\n", 297 | "考虑更一般的情况,假设要采样的随机向量有$K$维,定义$x_{-i}=\\{x_1,...,x_{i-1},x_{i+1},...,x_K\\}$,表示去掉$x_i$后其他维度上的随机变量构成的集合。\n", 298 | "\n", 299 | "更一般情况下,Gibbs采样算法定义如下: \n", 300 | "> n维Gibbs采样算法 \n", 301 | "1.initilize $(x^{(0)}_1,...,x^{(0)}_K)$ \n", 302 | "2.for t in 0,1,2,... \n", 303 | "  $x_1^{(t+1)}\\sim p(x_1|x_2^{(t)},...,x_K^{(t)})$ \n", 304 | "  $x_2^{(t+1)}\\sim p(x_2|x_1^{(t+1)},x_3^{(t)},...,x_K^{(t)})$ \n", 305 | "  ... \n", 306 | "  $x_i^{(t+1)}\\sim p(x_i|x_1^{(t+1)},...,x_{i-1}^{(t+1)},x_{i+1}^{(t)},...,x_K^{(t)})$ \n", 307 | "  $x_K^{(t+1)}\\sim p(x_K|x_1^{(t+1)},...,x_{K-1}^{(t+1)})$\n", 308 | "\n", 309 | "其中\n", 310 | "$$p(x_i|x_{-i})=p(x_i|x_1,...,x_{i-1},x_{i+1},...,x_K)=\\frac{p(x_1,...,x_K)}{p(x_1,...,x_{i-1},x_{i+1},...,x_K)}$$\n", 311 | "是Gibbs采样的提议分布,也称为完全条件概率" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "metadata": {}, 317 | "source": [ 318 | "### Gibbs采样细致平衡条件证明 \n" 319 | ] 320 | }, 321 | { 322 | "cell_type": "markdown", 323 | "metadata": {}, 324 | "source": [ 325 | "这个问题可以转化为证明Gibbs采样是MH算法的一种特殊情况。假设我们要对$x_i$进行采样,考察接受率\n", 326 | "$$\\alpha(x^*|x)=\\frac{\\pi(x^*)q(x|x^*)}{\\pi(x)q(x^*|x)}=\\frac{p(x^*)p(x_i|x_{-i}^*)}{p(x)p(x_i^*|x_{-i})}=\\frac{p(x_i^*|x^*_{-i})p(x^*_{-i})p(x_i|x_{-i}^*)}{p(x_i|x^{-i})p(x_{-i})p(x_i^*|x_{-i})}$$\n", 327 | "\n", 328 | "注意到$x^*_{-i}=x_{-i}$,于是\n", 329 | "\n", 330 | "$$\\alpha(x^*|x)=\\frac{p(x^*|x^*_{-i})p(x^*_{-i})p(x_i|x_{-i}^*)}{p(x|x^{-i})p(x_i)p(x_i^*|x_{-i})}=\\frac{p(x_i^*|x_{-i})p(x_{-i})p(x_i|x_{-i})}{p(x_i|x_{-i})p(x_{-i})p(x_i^*|x_{-i})}=1$$\n", 331 | "\n", 332 | "由此我们知道,Gibbs采样是MH算法在接受率等于1时的特殊情形,也就是说Gibbs采样的过程只要一直接受提议分布产生的样本点即可。" 333 | ] 334 | }, 335 | { 336 | "cell_type": "markdown", 337 | "metadata": {}, 338 | "source": [ 339 | "### Gibbs采样的并行化\n", 340 | "\n", 341 | "未完待续" 342 | ] 343 | }, 344 | { 345 | "cell_type": "markdown", 346 | "metadata": {}, 347 | "source": [ 348 | "参考: \n", 349 | "1.[LDA-math-MCMC 和 Gibbs Sampling by 靳志辉](https://cosx.org/2013/01/lda-math-mcmc-and-gibbs-sampling) \n", 350 | "2.[LDA算法理解](https://www.zybuluo.com/Dounm/note/435982)" 351 | ] 352 | } 353 | ], 354 | "metadata": { 355 | "kernelspec": { 356 | "display_name": "Python 2", 357 | "language": "python", 358 | "name": "python2" 359 | }, 360 | "language_info": { 361 | "codemirror_mode": { 362 | "name": "ipython", 363 | "version": 2 364 | }, 365 | "file_extension": ".py", 366 | "mimetype": "text/x-python", 367 | "name": "python", 368 | "nbconvert_exporter": "python", 369 | "pygments_lexer": "ipython2", 370 | "version": "2.7.9" 371 | } 372 | }, 373 | "nbformat": 4, 374 | "nbformat_minor": 2 375 | } 376 | -------------------------------------------------------------------------------- /Deep-Learning/mxnet-notes/operators-in-mxnet.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# MXNet中的运算符\n", 8 | "-------------------------------------------------- -------------------------------------------------- ------------\n", 9 | "\n", 10 | "在MXNet中,运算符是一个包含实际计算逻辑和辅助信息的类,可以帮助系统执行优化,如就地更新和自动求导。在继续使用本文档之前,我们强烈建议您了解`mshadow`库,因为所有运算符都是在运行时由系统提供的张量结构`mshadow :: TBlob`计算的。\n", 11 | "\n", 12 | "MXNet的运算符接口允许您:\n", 13 | "\n", 14 | " - 通过指定就地更新来节省内存分配成本。\n", 15 | " - 从Python中隐藏一些内部参数,使其更清洁。\n", 16 | " - 定义输入张量和输出张量之间的关系,这允许系统为您执行形状检查。\n", 17 | " - 从系统获取额外的临时空间以执行计算(例如,调用`cudnn`例程)。\n", 18 | " \n" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "metadata": {}, 24 | "source": [ 25 | "### 运算符接口\n", 26 | "\n", 27 | "`Forward`是运算符接口的核心:\n", 28 | "```c\n", 29 | "    virtual void Forward(const OpContext&ctx,\n", 30 | "                         const std :: vector &in_data,\n", 31 | "                         const std :: vector &req,\n", 32 | "                         const std :: vector &out_data,\n", 33 | "                         const std :: vector &aux_states)= 0;\n", 34 | "```\n", 35 | "“OpContext”结构是:\n", 36 | "```c\n", 37 | "           struct OpContext {\n", 38 | "             int is_train\n", 39 | "             RunContext run_ctx;\n", 40 | "             std :: vector requested;\n", 41 | "           }}\n", 42 | "```\n", 43 | "它描述了运算符是处于训练还是测试阶段(在`is_train`中指定),运算符应该在哪个设备上运行(在`run_ctx`中),以及请求的资源(在以下部分中介绍)。\n", 44 | "\n", 45 | "- `in_data`和`out_data`分别表示输入和输出张量。所有的张量空间都由系统分配。\n", 46 | "- `req`表示如何将计算结果写入`out_data`。换句话说,`req.size()== out_data.size()`和`req [i]`对应于`out_data [i]`的写类型。\n", 47 | "\n", 48 | "- 运算符请求类型OpReqType定义为:\n", 49 | "```c\n", 50 | "           enum OpReqType {\n", 51 | "             kNullOp,\n", 52 | "             kWriteTo,\n", 53 | "             kWriteInplace,\n", 54 | "             kAddTo\n", 55 | "           };\n", 56 | "```\n", 57 | "通常,所有`out_data`的类型应该是`kWriteTo`,意味着提供的`out_data`张量是一个*原始*内存块,因此运算符应该将结果直接写入它。在某些情况下,例如当计算梯度张量时,如果我们可以累积结果,而不是直接重写张量内容,这样就很好,这样每次不需要创建额外的空间。在这种情况下,相应的`req`类型被设置为`kAddTo`,表示应该调用`+ =`。当我们需要进行同址运算(in-place computation)时,`out_data`的类型应该是`kWriteInplace`,这时输出张量`out_data`将与输入张量`in_data`共享同一片内存,将运算后的结果就地写入输入张量,而不是新开辟一块内存,以节省内存开销。\n", 58 | "\n", 59 | "- `aux_states`被有意设计用于帮助计算的辅助张量。目前,它是无用的。\n", 60 | "\n", 61 | "除了`Forward`运算符,你可以选择实现`Backward`接口:\n", 62 | "```c\n", 63 | "    virtual void Backward(const OpContext&ctx,\n", 64 | "                          const std :: vector &out_grad,\n", 65 | "                          const std :: vector &in_data,\n", 66 | "                          const std :: vector &out_data,\n", 67 | "                          const std :: vector &req,\n", 68 | "                          const std :: vector &in_grad,\n", 69 | "                          const std :: vector &aux_states);\n", 70 | "```\n", 71 | "该接口遵循与“Forward”接口相同的设计原则,除了给出了“out_grad”,“in_data”和“out_data”,并且运算符计算“in_grad”作为结果。命名策略类似于Torch的约定,可以总结如下图(mxnet官方没有给出图片):\n", 72 | "\n", 73 | "[输入/输出语义图]\n", 74 | "\n", 75 | "一些运算符可能不需要所有以下内容:`out_grad`,`in_data`和`out_data`。这些可以在`OperatorProperty`中的`DeclareBackwardDependency`接口中指定。" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "metadata": {}, 81 | "source": [ 82 | "### 运算符属性\n", 83 | "一个卷积可能有几个实现,你可能想在它们之间切换以实现最佳性能。因此,我们将operator 的* 语义 *接口从实现接口(`Operator`类)剥离为`OperatorProperty`类。 `OperatorProperty`接口包括:\n", 84 | "\n", 85 | "- InferShape:\n", 86 | "```c\n", 87 | "           virtual bool InferShape(std :: vector * in_shape,\n", 88 | "                                   std :: vector * out_shape,\n", 89 | "                                   std :: vector * aux_shape)const = 0;\n", 90 | "```\n", 91 | "这个接口有两个目的:1.告诉系统每个输入和输出张量的大小,所以它可以在`Forward`和`Backward`调用之前为它们分配空间; 2.执行大小检查以确保在运行之前没有明显的错误。 `in_shape`中的形状将由系统设置(根据前面的运算符的`out_shape`)。当没有足够的信息来推断形状时,它返回`false`,当形状不一致时抛出错误。\n", 92 | "\n", 93 | "- Request Resources:像`cudnnConvolutionForward`这样的运算符需要一个用于计算的工作空间。如果系统可以管理它,它可以执行优化,如重用该空间,等等。 MXNet定义了两个接口来实现:\n", 94 | "```c\n", 95 | "           virtual std :: vector ForwardResource(\n", 96 | "               const std :: vector &in_shape)const;\n", 97 | "           virtual std :: vector BackwardResource(\n", 98 | "               const std :: vector &in_shape)const;\n", 99 | "```\n", 100 | "ResourceRequest结构(在resource.h中)目前只包含一个类型标志:\n", 101 | "```c\n", 102 | "           struct ResourceRequest {\n", 103 | "             enum Type{\n", 104 | "               kRandom,// 获得一个mshadow :: Random 对象\n", 105 | "               kTempSpace,//请求临时空间\n", 106 | "             };\n", 107 | "             Type type;\n", 108 | "           };\n", 109 | "```\n", 110 | "如果`ForwardResource`和`BackwardResource`返回非空数组,系统通过`Operator`的`Forward`和`Backward`接口中的`ctx`参数提供相应的资源。基本上,要访问这些资源,只需写:\n", 111 | "```c\n", 112 | "           auto tmp_space_res = ctx.requested [kTempSpace] .get_space(some_shape,some_stream);\n", 113 | "           auto rand_res = ctx.requested [kRandom] .get_random(some_stream);\n", 114 | "```\n", 115 | "有关示例,请参见`src / operator / cudnn_convolution-inl.h`。\n", 116 | "\n", 117 | "- Backward dependency:让我们看看两个不同的操作符签名(我们将所有的参数命名为演示目的):\n", 118 | "\n", 119 | "```c\n", 120 | "           void FullyConnectedForward(TBlob weight,TBlob in_data,TBlob out_data);\n", 121 | "           void FullyConnectedBackward(TBlob weight,TBlob in_data,TBlob out_grad,TBlob in_grad);\n", 122 | "           void PoolingForward(TBlob in_data,TBlob out_data);\n", 123 | "           void PoolingBackward(TBlob in_data,TBlob out_data,TBlob out_grad,TBlob in_grad);\n", 124 | "```\n", 125 | "\n", 126 | "注意,`FullyConnectedForward`中的`out_data`不被`FullyConnectedBackward`使用,`PoolingBackward`需要`PoolingForward`的所有参数。因此,对于`FullyConnectedForward`,`out_data`张量一旦消耗可以安全地释放,因为向后的函数不需要它。这提供了系统尽可能快地对一些张量进行垃圾回收的机会。为了指定这种情况,我们提供了一个接口:\n", 127 | "```c\n", 128 | "          virtual std :: vector DeclareBackwardDependency(\n", 129 | "               const std :: vector &out_grad,\n", 130 | "               const std :: vector &in_data,\n", 131 | "               const std :: vector &out_data)const;\n", 132 | "```\n", 133 | "参数向量的`int`元素是用于区分不同数组的ID。让我们看看这个接口如何为`FullyConnected`和`Pooling`指定不同的依赖:\n", 134 | "```c\n", 135 | "          std :: vector FullyConnectedProperty :: DeclareBackwardDependency(\n", 136 | "              const std :: vector &out_grad,\n", 137 | "              const std :: vector &in_data,\n", 138 | "              const std :: vector &out_data)const {\n", 139 | "            return {out_grad [0],in_data [0]}; //注:不包括out_data [0]\n", 140 | "          }}\n", 141 | "          std :: vector PoolingProperty :: DeclareBackwardDependency(\n", 142 | "              const std :: vector &out_grad,\n", 143 | "              const std :: vector &in_data,\n", 144 | "              const std :: vector &out_data)const {\n", 145 | "            return {out_grad [0],in_data [0],out_data [0]};\n", 146 | "          }}\n", 147 | "```\n", 148 | "- Inplace选项:要进一步节省内存分配的成本,可以使用就地更新。当输入张量和输出张量具有相同的形状时,它们适合于逐元素操作。您可以使用以下接口指定和就地更新:\n", 149 | "```c\n", 150 | "           virtual std :: vector > ElewiseOpProperty :: ForwardInplaceOption\n", 151 | "               const std :: vector &in_data,\n", 152 | "               const std :: vector &out_data)const {\n", 153 | "             return {{in_data [0],out_data [0]}};\n", 154 | "           }}\n", 155 | " virtual std::vector> ElewiseOpProperty::BackwardInplaceOption(\n", 156 | " const std::vector &out_grad,\n", 157 | " const std::vector &in_data,\n", 158 | " const std::vector &out_data,\n", 159 | " const std::vector &in_grad) const {\n", 160 | " return { {out_grad[0], in_grad[0]} }\n", 161 | " }\n", 162 | "``` " 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "这告诉系统在`Forward`期间`in_data [0]`和`out_data [0]`张量可以共享相同的存储空间,`out_grad [0]`和`in_grad [0]`在`Backward`期间共享内存。\n", 170 | "\n", 171 | ">重要:即使您使用上述规范,也不能保证输入和输出张量共享相同的空间。事实上,这只是对系统作最后的决定的一个建议。但是,在任何一种情况下,这个决定对你是完全透明的,所以实际的“Forward”和“Backward”实现不需要考虑。\n", 172 | "\n", 173 | "- 暴露运算符到Python:由于C ++的限制,您需要用户实现以下接口:\n", 174 | "```c\n", 175 | " //从键值字符串列表中初始化属性类\n", 176 | " virtual void Init(const vector >&kwargs)= 0;\n", 177 | " //返回键值字符串映射中的参数\n", 178 | " virtual map GetParams()const = 0;\n", 179 | " //返回参数的名称(用于在python中生成签名)\n", 180 | " virtual vector ListArguments()const;\n", 181 | " //返回输出值的名称\n", 182 | " virtual vector ListOutputs()const;\n", 183 | " //返回辅助状态的名称\n", 184 | " virtual vector ListAuxiliaryStates()const;\n", 185 | " //返回输出值的个数\n", 186 | " virtual int NumOutputs()const;\n", 187 | " //返回可见输出的数量\n", 188 | " virtual int NumVisibleOutputs()const;\n", 189 | "``` " 190 | ] 191 | }, 192 | { 193 | "cell_type": "markdown", 194 | "metadata": {}, 195 | "source": [ 196 | "### 从运算符属性创建运算符\n", 197 | "\n", 198 | "`OperatorProperty`包括操作的所有*语义*属性。它还负责创建用于实际计算的“运算符”指针。\n", 199 | "\n", 200 | "#### 创建运算符\n", 201 | "\n", 202 | "在OperatorProperty中实现以下接口:\n", 203 | "```c\n", 204 | "    virtual Operator * CreateOperator(Context ctx)const = 0;\n", 205 | "```\n", 206 | "例如:\n", 207 | "```c\n", 208 | "    class ConvolutionOp {\n", 209 | "     public:\n", 210 | "      void Forward(...){...}\n", 211 | "      void Backward(...){...}\n", 212 | "    };\n", 213 | "    class ConvolutionOpProperty:public OperatorProperty {\n", 214 | "     public:\n", 215 | "      Operator* CreateOperator(Context ctx)const {\n", 216 | "        return new ConvolutionOp;\n", 217 | "      }}\n", 218 | "    };\n", 219 | "```\n", 220 | "#### 运算符参数\n", 221 | "\n", 222 | "当实现卷积运算符时,您需要知道卷积核大小,步幅大小,填充大小等。在调用任何`Forward`或`Backward`接口之前,这些参数应该传递给操作符。为此,您可以定义一个`ConvolutionParam`结构,如下所示:\n", 223 | "```c\n", 224 | "    #include \n", 225 | "    struct ConvolutionParam:public dmlc :: Parameter {\n", 226 | " TShape kernel, stride, pad;\n", 227 | " uint32_t num_filter,num_group,workspace;\n", 228 | "      bool no_bias;\n", 229 | "    };\n", 230 | "```\n", 231 | "把它放在ConvolutionOpProperty中,并在构造过程中传递给操作符类:\n", 232 | "```c\n", 233 | "    class ConvolutionOp {\n", 234 | "     public:\n", 235 | "      ConvolutionOp(ConvolutionParam p):param_(p){}\n", 236 | "      void Forward(...){...}\n", 237 | "      void Backward(...){...}\n", 238 | "     private:\n", 239 | "      ConvolutionParam param_;\n", 240 | "    };\n", 241 | "    class ConvolutionOpProperty:public OperatorProperty {\n", 242 | "     public:\n", 243 | "      void Init(const vector &kwargs){\n", 244 | "        // initialize param_ using kwargs\n", 245 | "      }}\n", 246 | "      Operator* CreateOperator(Context ctx)const {\n", 247 | "        return new ConvolutionOp(param_);\n", 248 | "      }}\n", 249 | "     private:\n", 250 | "      ConvolutionParam param_;\n", 251 | "    };\n", 252 | "```\n", 253 | "#### 将运算符属性类和参数类注册到MXNet\n", 254 | "\n", 255 | "使用以下宏来将参数结构和运算符属性类注册到MXNet:\n", 256 | "```c\n", 257 | "    DMLC_REGISTER_PARAMETER(ConvolutionParam);\n", 258 | "    MXNET_REGISTER_OP_PROPERTY(Convolution,ConvolutionOpProperty);\n", 259 | "```\n", 260 | "第一个参数是名称字符串,第二个是属性类名。\n", 261 | "\n", 262 | "### 接口总结\n", 263 | "\n", 264 | "我们几乎覆盖了定义新运算符所需的整个接口。让我们回顾一下:\n", 265 | "\n", 266 | "- 使用`Operator`接口来写你的计算逻辑(`Forward`和`Backward`)。\n", 267 | "- 使用`OperatorProperty`接口:\n", 268 | "     - 将参数传递给操作符类(可以使用`Init`接口)。\n", 269 | "     - 使用`CreateOperator`接口创建一个操作符。\n", 270 | "     - 正确实现操作符描述接口,例如参数名等。\n", 271 | "     - 正确实现“InferShape”接口以设置输出张量形状。\n", 272 | "     - [可选]如果需要额外的资源,请选中“ForwardResource”和“BackwardResource”。\n", 273 | "     - [可选]如果`Backward`不需要`Forward`的所有输入和输出,请检查`DeclareBackwardDependency`。\n", 274 | "     - [可选]如果支持就地更新,请检查“ForwardInplaceOption”和“BackwardInplaceOption”。\n", 275 | "- 注册`OperatorProperty`类和参数类。" 276 | ] 277 | } 278 | ], 279 | "metadata": { 280 | "anaconda-cloud": {}, 281 | "kernelspec": { 282 | "display_name": "Python [conda root]", 283 | "language": "python", 284 | "name": "conda-root-py" 285 | }, 286 | "language_info": { 287 | "codemirror_mode": { 288 | "name": "ipython", 289 | "version": 2 290 | }, 291 | "file_extension": ".py", 292 | "mimetype": "text/x-python", 293 | "name": "python", 294 | "nbconvert_exporter": "python", 295 | "pygments_lexer": "ipython2", 296 | "version": "2.7.12" 297 | } 298 | }, 299 | "nbformat": 4, 300 | "nbformat_minor": 1 301 | } 302 | --------------------------------------------------------------------------------