├── MCMC_ACF.png ├── MCMC_auto.png ├── Cartoon_bvf.png ├── Image_cartoon.png ├── MCMC_burn_prob.png ├── MCMC_burn_sol.png ├── Image_cartoon_god.png ├── README.md ├── 0_Overview.ipynb ├── 8_Large_BVAR.ipynb ├── 1_Introduction.ipynb └── 9_state_space.ipynb /MCMC_ACF.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/MCMC_ACF.png -------------------------------------------------------------------------------- /MCMC_auto.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/MCMC_auto.png -------------------------------------------------------------------------------- /Cartoon_bvf.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/Cartoon_bvf.png -------------------------------------------------------------------------------- /Image_cartoon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/Image_cartoon.png -------------------------------------------------------------------------------- /MCMC_burn_prob.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/MCMC_burn_prob.png -------------------------------------------------------------------------------- /MCMC_burn_sol.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/MCMC_burn_sol.png -------------------------------------------------------------------------------- /Image_cartoon_god.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jamie-L-Cross/Intro-Bayesian-Econometrics/HEAD/Image_cartoon_god.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Bayes 2 | A set of Jupyter notebooks and programs associated with a course titled "Introduction to Bayesian Econometrics". The target audience is postgraduate students and researchers in economics, finance and related disciplines. 3 | -------------------------------------------------------------------------------- /0_Overview.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": { 6 | "collapsed": false 7 | }, 8 | "source": [ 9 | "# [Introduction to Bayesian Econometrics (DRE 7030)](https://programmeinfo.bi.no/nb/course/DRE-7030/2020-autumn)\n", 10 | "\n", 11 | "### [Jamie Cross](https://sites.google.com/view/jamiecross/home) - jamie.cross(at)bi.no\n", 12 | "\n", 13 | "## Learning outcomes\n", 14 | "\n", 15 | "This is an introductory course in Bayesian econometrics. The intended audience is graduate students and researchers in economics, finance and related fields. My objective is to get you to a position where you have the ability to:\n", 16 | "\n", 17 | "1. Contrast classical and Bayesian thinking in econometrics\n", 18 | "2. Estimate commonly used econometric models using Bayesian methods\n", 19 | "3. Create original pieces of research using Bayesian methods\n", 20 | "\n", 21 | "Most of the models that we consider will be useful for modeling time-series data, however the estimation methods can also be applied to either cross-sectional or panel data. Topics include:\n", 22 | "\n", 23 | "1. Overview of Bayesian thinking: how it differs from classical/frequentist thinking\n", 24 | "2. Posterior simulation via Monte Carlo Integration and Markov chain Monte Carlo (MCMC) methods: Gibbs Sampling and Metropolis-Hastings algorithms\n", 25 | "3. Estimation and application of some commonly used models: linear regression with various error structures, vector autoregression and state-space models\n", 26 | "\n", 27 | "### Prerequisite knowledge\n", 28 | "\n", 29 | "I assume that you've already taken courses in introductory probability and statistics as well as (classical) econometrics and time series analysis.\n", 30 | "\n", 31 | "For anyone looking to learn the basics, I highly recommend:\n", 32 | "\n", 33 | "1. [Statistics 110 at Harvard](https://projects.iq.harvard.edu/stat110/home)\n", 34 | "2. [Introduction to Econometrics](https://www.amazon.com/Introduction-Econometrics-Pearson-Economics-James-ebook/dp/B00XIGZW9W) by James Stock and Mark Watson\n", 35 | "\n", 36 | "For graduate students and researchers, I recommend:\n", 37 | "\n", 38 | "1. [Introduction to Probability and Statistics at MIT](https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/index.htm)\n", 39 | "2. [Econometric Analysis](https://www.amazon.com/Econometric-Analysis-8th-William-Greene/dp/0134461363) by William Greene\n", 40 | "3. [Time Series Analysis](https://www.amazon.com/Time-Analysis-James-Douglas-Hamilton/dp/0691042896) by James Hamilton\n", 41 | "\n", 42 | "## About these Lectures\n", 43 | "\n", 44 | "### Course Material\n", 45 | "\n", 46 | "All of the course material can be downloaded from [my GitHub page](https://github.com/Jamie-L-Cross/Bayes). We won't be following any textbook, however I highly recommend that you obtain a copy of:\n", 47 | "\n", 48 | "1. [Notes on Bayesian Macroeconometrics](http://joshuachan.org/notes_BayesMacro.html) by Joshua CC Chan. This collection of notes contains most of the algorithms and models that we will learn in this course with associated MATLAB codes.\n", 49 | "2. [Bayesian Econometrics](https://www.amazon.com/Bayesian-Econometrics-Gary-Koop/dp/0470845678) by Gary Koop. Gary provides great intuition for Bayesian thinking and also covers some of the models that we will consider in this course.\n", 50 | "3. [Bayesian Econometric Methods (Econometric Exercises)](https://www.amazon.com/Bayesian-Econometric-Methods-Exercises/dp/0521671736), by Joshua CC Chan, Gary Koop, Dale Poirier and Justin Tobias. This is a book of exercises and associated MATLAB codes. It is an extremely useful companion to Josh and Gary's textbooks.\n", 51 | "\n", 52 | "In addition to these resources, you might find the following useful:\n", 53 | "\n", 54 | "1. [QuantEcon](https://quantecon.org/) website for economic modeling (in both Julia and Python)\n", 55 | "2. [Ben Lambert](https://ben-lambert.com/about/) has a book called [A Student’s Guide to Bayesian Statistics](https://www.amazon.co.uk/Students-Guide-Bayesian-Statistics/dp/1473916364/) which is targeted at students without any previous knowledge of statistics nor probability. He also has some great videos on [his YouTube channel](https://www.youtube.com/user/SpartacanUsuals/playlists).\n", 56 | "\n", 57 | "Many academics also provide free to use code (mostly provided in MATLAB but easily translatable into Julia):\n", 58 | "\n", 59 | "1. [Joshua Chan](http://joshuachan.org/code.html)\n", 60 | "2. [Haroon Mumtaz](https://sites.google.com/site/hmumtaz77/code)\n", 61 | "3. [Dimitris Korobilis](https://sites.google.com/site/dimitriskorobilis/matlab)\n", 62 | "4. [Gary Koop](https://sites.google.com/site/garykoop/home/computer-code-2)\n", 63 | "\n", 64 | "### Jupyter notebooks with Julia\n", 65 | "\n", 66 | "All of the lectures will be delivered using [The Jupyter Notebook](https://jupyter.org/) in which we will use the programming language [Julia](https://julialang.org/).\n", 67 | "\n", 68 | "The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Common uses include data cleaning and transformation, numerical simulation and statistical modeling. The reason I'm using the notebooks is that they allow for an interactive classroom in which we can do math and estimate models without the need to switch programs.\n", 69 | "\n", 70 | "By default, The Jupyter Notebook allow you to use the programming language [Python](https://www.python.org/), however it supports over 40 programming languages, including Julia. Both Python and Julia are open source and free to use, however I have chosen to use Julia because (1) it's [syntax](https://cheatsheets.quantecon.org/) is easier to write when doing matrix computations (which we will do a lot of in this course) and (2) it's [faster](https://julialang.org/benchmarks/). Further nice comparison of the two languages can be found [here](https://www.geeksforgeeks.org/julia-vs-python/).\n", 71 | "\n", 72 | "#### Getting started\n", 73 | "\n", 74 | "I assume that most people aren't familiar with either The Jupyter Notebook or Julia, but will assume that you can set them up and teach yourself the basics. As a first step, I recommend that you try [the online demo](https://jupyter.org/try).\n", 75 | "\n", 76 | "A simple step-by-step guide to installing both Julia and The Jupyter Notebook can be found [here](https://datatofish.com/add-julia-to-jupyter/). Installing these two programs will allow you to run all of the codes used in the lectures through The Jupyter Notebook. Alternatively, you can run them online via [cocalc.com](cocalc.com).\n", 77 | "\n", 78 | "If you want to learn more about The Jupyter Notebook and Julia, then I highly recommend:\n", 79 | "\n", 80 | "1. [Julia Tutorials](https://datatofish.com/julia-tutorials/) by [Data to Fish](https://datatofish.com/)\n", 81 | "2. [Getting Started with Julia](https://julia.quantecon.org/getting_started_julia/index.html) from [this lecture series](https://julia.quantecon.org/index_toc.html) by [QuantEcon](https://quantecon.org/).\n", 82 | "\n", 83 | "If you want to use Julia to run programs outside of The Jupyter Notebook (useful for research), then you will need a text editor such as [Atom](https://atom.io/) and [Juno](http://docs.junolab.org/latest/man/installation/). This [discussion on specifying the Julia path](https://discourse.julialang.org/t/set-julia-path-in-juno/37417) may also be useful.\n", 84 | "\n", 85 | "#### Jula vs MATLAB\n", 86 | "\n", 87 | "For those of you with coding experience in the programming language MATLAB (like myself), note that Julia's syntax is almost identical. [QuantEcon](https://quantecon.org/) provide a nice [cheat cheat](https://cheatsheets.quantecon.org/) that compares the two languages syntax (along with Python) as well as a list of [Julia's advantages](https://julia.quantecon.org/about_lectures.html) over MATLAB and other programming languages. The main reason that I've chosen to use Julia instead of MATLAB is that it's compatible with The Jupyter Notebook and therefore great for teaching. Others might like the fact that it's (1) free and (2) fast. The cost of using Julia is that (1) you have to manually install libraries to use certain functions (2) it has fewer libraries than MATLAB and (3) it doesn't have any customer support. That being said, there are a bunch of [Julia packages](https://julialang.org/packages/) available with more popping up everyday, as well as a [forum](https://discourse.julialang.org/) where you might find a solution to your problem or even post a question that other users can answer.\n", 88 | "\n", 89 | "Note that if you're still not convinced about making the switch from MATLAB to Julia then you can stick with MATLAB and convert the provided Julia codes using the [cheat cheat](https://cheatsheets.quantecon.org/). This will not impact your overall grade in any way.\n", 90 | "\n", 91 | "" 102 | ] 103 | } 104 | ], 105 | "metadata": { 106 | "kernelspec": { 107 | "display_name": "Julia 1.5.1", 108 | "env": { 109 | "JULIA_DEPOT_PATH": "/home/user/.julia/:/ext/julia/depot/", 110 | "JULIA_PROJECT": "/home/user/.julia/environment/v1.5" 111 | }, 112 | "language": "julia", 113 | "metadata": { 114 | "cocalc": { 115 | "description": "The Julia Programming Language", 116 | "priority": 10, 117 | "url": "https://julialang.org/" 118 | } 119 | }, 120 | "name": "julia-1.5" 121 | }, 122 | "language_info": { 123 | "file_extension": ".jl", 124 | "mimetype": "application/julia", 125 | "name": "julia", 126 | "version": "1.5.1" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 4 131 | } 132 | -------------------------------------------------------------------------------- /8_Large_BVAR.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Large Bayesian vector autorergession models\n", 8 | "\n", 9 | "Up until 2010, most empirical macroeconomic research using vector autoregression (VAR) models was done with small systems (3-5 variables with 1-2 lags). This is because large VARs often contain more parameters than can be fitted by standard macroeconomic datasets. For example, a $VAR(4)$ with $n = 20$ dependent variables has $1,620$ VAR coefficients, which is much larger than the number of observations in typical quarterly macroeconomic datasets (e.g. forty years of quarterly data (T=160)). As a result, frequentist estimation methods, such as maximum likelihood, are unlikely to have reasonable properties. This is known as the [*over-parameterization problem*](https://en.wikipedia.org/wiki/Overfitting).\n", 10 | "\n", 11 | "In an influential paper, [Banbura, Giannone, and Reichlin (2010)](https://onlinelibrary.wiley.com/doi/full/10.1002/jae.1137?casa_token=RgnCtzbYBjUAAAAA%3AmOIDzTVr46iEoSU73_ZGAmf9RcxmqF20fuv3Ot_jqj8buYLpY0sbERRBU6OGLPauoZZKTMt7-19Niiw) showed that this *over-parameterization* problem can be overcome through the use of informative priors. The basic idea is to use informative priors to push the coefficient values in the VAR towards zero resulting in a system of parsimonious random walks. Such priors are known as *shrinkage priors* (or *regularization priors*) and have the effect of reducing parameter uncertainty, thereby making them better suited than frequentist methods when estimating large VAR models. In their paper, Banbura, Giannone, and Reichlin (2010) not only showed that the use of such priors facilitates the estimation of a VAR with over 100 variables, but also found that this large Bayesian VAR (BVAR) improves upon the forecast performance of commonly used small VARs.\n", 12 | "\n", 13 | "**Remarks**:\n", 14 | "1. The prior is imposed *a priori*. If there is a sufficiently strong signal in the data, then *a posteriori*, each equation of the BVAR may follow a more complicated autoregressive process as opposed to a random walk.\n", 15 | "\n", 16 | "## VAR Recap\n", 17 | "\n", 18 | "Recall that the n-variable vector autorergession model of order $p$, denoted VAR(p), is defined as\n", 19 | "$$\n", 20 | "\\begin{equation}\n", 21 | "\\mathbf{y}_t = \\mathbf{b} + \\sum_{i=1}^{p}\\mathbf{B}_i\\mathbf{y}_{t-i} + \\boldsymbol{\\varepsilon}_t\n", 22 | "\\end{equation}\n", 23 | "$$\n", 24 | "where $\\mathbf{y}_t,\\mathbf{b},\\boldsymbol{\\varepsilon}_t$ are $n\\times 1$ vectors and $\\mathbf{B}_i$, $i=1,\\dots,p$ are $n\\times n$ matrices. \n", 25 | "\n", 26 | "For estimation purposes, we stack the model over all dates, $t=1,\\dots,T$ to get\n", 27 | "$$\n", 28 | "\\mathbf{y} = \\mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\varepsilon}, \\quad \\boldsymbol{\\varepsilon}\\sim \\mathcal{N}(\\mathbf{0},\\mathbf{I}_T\\otimes\\boldsymbol{\\Sigma})\n", 29 | "$$\n", 30 | "where $\\mathbf{y} = (y_1,\\dots,y_T)'$ and $\\boldsymbol{\\varepsilon} = (\\varepsilon_1,\\dots,\\varepsilon_T)'$ are $Tn\\times1$, and \n", 31 | "$\\mathbf{X}$ is a $Tn\\times nk$ matrix that stacks the regressors into a matrix\n", 32 | "$$\n", 33 | "\\mathbf{X}=\\begin{bmatrix}\n", 34 | "\\mathbf{x}_1 \\\\ \n", 35 | "\\vdots \\\\ \n", 36 | "\\mathbf{x}_T \n", 37 | "\\end{bmatrix}\n", 38 | "$$\n", 39 | "\n", 40 | "If we are willing to assume prior independence, then the conjugate independent Normal and inverse-Wishart priors facilitate the use of a two block Gibbs sampler. Details are provided in the lecture on VARs.\n", 41 | "\n", 42 | "## Shrinkage priors\n", 43 | "A generic Normal distributed prior on the VAR coefficients takes the form \n", 44 | "$$\n", 45 | "\\boldsymbol{\\beta}\\sim \\mathcal{N}(\\boldsymbol{\\mu},\\mathbf{V}),\n", 46 | "$$\n", 47 | "where $\\boldsymbol{\\mu}$ and $\\mathbf{V}$ are researcher specified hyperparameters. \n", 48 | "\n", 49 | "The prior becomes shrinkage prior by setting $\\boldsymbol{\\mu}=\\mathbf{0}$. Set in this manner, the choice of elements of the covariance matrix $ \\mathbf{V} $ will govern the degree of shrinkage. \n", 50 | "\n", 51 | "This means that the *independent Normal and inverse-Wishart prior* that we used to estimate the BVAR in the previous lecture is a shrinkage prior in which all of the BVAR coefficients are pushed towards zero at the same rate. \n", 52 | "\n", 53 | "More generally, we can use our knowledge of macroeconomic variables to impose some additional structure on the covariance matrix $ \\mathbf{V} $. The most common method is the *Minnesota prior*.\n", 54 | "\n", 55 | "### Minnesota prior\n", 56 | "Many variants of the Minnesota prior have been proposed (see, e.g., [Karlsson (2013)](https://www.sciencedirect.com/science/article/pii/B9780444627315000154) for\n", 57 | "a comprehensive discussion). In each case, it's common to assume that the variance-covariance matrix is diagonal. Hence, there is no relationship among the coefficients of various VAR equations. Here we follow the specification in [Koop and Korobilis (2010)](http://personal.strath.ac.uk/gary.koop/kk3.pdf) which is commonly used in practice. \n", 58 | "\n", 59 | "For the intercept term, it's common to specify a common scalar variance, denoted $\\lambda$. Smaller values imply a more informative prior with more shrinkage, and vice versa.\n", 60 | "\n", 61 | "For the coefficients of the lags, note that the diagonal elements of the prior variance matrix on the lagged coefficient matrices can be written as $ \\left(v_1,\\dots,v_{nk}\\right) = \\text{vec}\\left(\\left(\\mathbf{V}_1,\\dots,\\mathbf{V}_p\\right)'\\right)$. The $ \\left(i,j\\right) $-th element of $ \\mathbf{V}_r $, $ \\mathbf{V}_r^{ij} $, denotes the variance of the $ \\left(i,j\\right) $-the element of the VAR coefficient matrix $ \\mathbf{B}_r $, $ r=1,\\dots,p $. \n", 62 | "\n", 63 | "The Minnesota prior specifies that\n", 64 | "$$\n", 65 | "\\mathbf{V}_r^{ij}=\\begin{cases}\n", 66 | "\\frac{\\pi_{1}^{2}}{r^{\\pi_{3}}} & \\text{for coefficients on own lag }r\\text{ for }r=1,\\ldots,p,\\\\\n", 67 | "\\frac{\\pi_{1}^{2}\\pi_{2}\\sigma_j}{r^{\\pi_{3}}\\sigma_i} & \\text{for coefficients on lag }r\\text{ of variable }j\\neq i,\\text{ for }r=1,\\ldots,p,\n", 68 | "\\end{cases}\n", 69 | "$$\n", 70 | "where $\\pi_{1}, \\pi_{2}$ and $\\pi_{3}$ are hyperparameters, and $ \\sigma_l $ is the standard deviation from an $ AR\\left(p\\right) $ model for the variable $ l $, $ l=1,\\dots,n $. The hyperparameters\n", 71 | "1. $\\pi_{1}$ controls the overall tightness of the marginal distributions around zero and therefore governs the relative importance of the prior compared to information contained in the data. Smaller values imply a more informative prior with more shrinkage, and vice versa. \n", 72 | "2. $\\pi_2$ governs the relative importance of *own-lags* relative to lags of other variables. If $\\pi_{2}=1$, then both types of lags are a priori equally important. Conversely, setting $\\pi_{2}<1$ implies that own-lags are relatively more important than lags on other variables, and vice-versa. In most economic applications $\\pi_{2}<1$ is a natural choice.\n", 73 | "3. $\\pi_3$ governs the degree of shrinkage on recent lags relative to distant ones. In most economic applications recent lags are more likely to be important predictors than distant ones making $\\pi_{3}\\geq 1$ a natural choice. \n", 74 | "\n", 75 | "To see what the Minnesota prior implies consider a VAR(2) with $n=2$:\n", 76 | "$$\n", 77 | "\\mathbf{V} = \n", 78 | "\\begin{bmatrix}\n", 79 | "\\lambda & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\ \n", 80 | "0 & \\pi_{1}^{2} & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\ \n", 81 | "0 & 0 & \\frac{\\pi_{1}^{2}\\pi_{2}\\sigma_2}{\\sigma_1} & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\ \n", 82 | "0 & 0 & 0 & \\frac{\\pi_{1}^{2}}{2^{\\pi_{3}}} & 0 & 0 & 0 & 0 & 0 & 0\\\\ \n", 83 | "0 & 0 & 0 & 0 & \\frac{\\pi_{1}^{2}\\pi_{2}\\sigma_2}{2^{\\pi_{3}}\\sigma_1} & 0 & 0 & 0 & 0 & 0\\\\ \n", 84 | "0 & 0 & 0 & 0 & 0 & \\lambda & 0 & 0 & 0 & 0\\\\ \n", 85 | "0 & 0 & 0 & 0 & 0 & 0 & \\frac{\\pi_{1}^{2}\\pi_{2}\\sigma_1}{\\sigma_2} & 0 & 0 & 0\\\\ \n", 86 | "0 & 0 & 0 & 0 & 0 & 0 & 0 & \\pi_{1}^{2} & 0 & 0\\\\ \n", 87 | "0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \\frac{\\pi_{1}^{2}\\pi_{2}\\sigma_1}{2^{\\pi_{3}}\\sigma_2} & 0\\\\ \n", 88 | "0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & \\frac{\\pi_{1}^{2}}{2^{\\pi_{3}}} \n", 89 | "\\end{bmatrix}\n", 90 | "$$\n", 91 | "\n", 92 | "**Remarks**:\n", 93 | "1. **Name**: The Minnesota prior was first proposed in [Litterman (1979)](https://ideas.repec.org/p/fip/fedmwp/115.html), and was subsequently developed by researchers at University of Minnesota [Doan, Litterman, and Sims (1984)](https://www.tandfonline.com/doi/abs/10.1080/07474938408800053) and [Litterman (1986)](https://amstat.tandfonline.com/doi/abs/10.1080/07350015.1986.10509491).\n", 94 | "2. **Modern variants**: The original Minnesota prior used [ordinary least squares (OLS)](https://en.wikipedia.org/wiki/Ordinary_least_squares) to estimate the error covariance $\\boldsymbol{\\Sigma}$ in the VAR model, thereby ignoring parameter uncertainty associated with estimating $\\boldsymbol{\\Sigma}$. Modern treatments overcome this issue by using the *independent Normal and inverse-Wishart prior* discussed in the previous lecture. No extra estimation tools are required. \n", 95 | "3. **Hyperparameters**: By construction, the Minnesota prior estimates will be sensitive to the choice of hyperparameter values. This is because it's designed to be an informative prior. Following Koop and Korobilis (2010), most practitioners set $\\pi_{3}=2$, however values for $\\pi_{1}$ and $\\pi_{2}$ are generally subjective.\n", 96 | "3. **Hierarchical priors**: Dissatisfaction with different user choices of hyperparameter values led to the development of [*hierarchical priors*](https://en.wikipedia.org/wiki/Bayesian_hierarchical_modeling) that introduce priors on the hyperparameters (also called *hyperpriors*) and integrate them out in a Bayesian manner. For instance, [Giannone, Lenza and Primiceri (2015)](https://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00483) show that hyperpriors can generate more accurate macroeconomic forecasts than conventional choices. With this in mind, [Chan, Jacobi and Zhu (2020)](http://joshuachan.org/papers/AD_OptHyper.pdf) have proposed the use of a derivative based approach for efficient selection of the Minnesota hyperparameters." 97 | ] 98 | }, 99 | { 100 | "cell_type": "code", 101 | "execution_count": 1, 102 | "metadata": {}, 103 | "outputs": [ 104 | { 105 | "ename": "LoadError", 106 | "evalue": "UndefVarError: nk not defined", 107 | "output_type": "error", 108 | "traceback": [ 109 | "UndefVarError: nk not defined", 110 | "", 111 | "Stacktrace:", 112 | " [1] top-level scope at In[1]:3", 113 | " [2] include_string(::Function, ::Module, ::String, ::String) at .\\loading.jl:1091" 114 | ] 115 | } 116 | ], 117 | "source": [ 118 | "# Minnesota prior for a BVAR - replace prior for beta in previous codes with the following\n", 119 | "# prior mean\n", 120 | "pri_beta0 = zeros(nk);\n", 121 | "\n", 122 | "# prior variance\n", 123 | "lambda = 10; # overall tightness on intercepts\n", 124 | "c1 = 0.2; # overall tightness on lags\n", 125 | "c2 = 0.5; # additional shrinkage on cross-lags\n", 126 | "c3 = 2; # rate at which the prior variance decreases with increases lag length\n", 127 | "\n", 128 | "# OLS estimates for standard deviation\n", 129 | "sigOLS = zeros(n,1);\n", 130 | "for i = 1:n\n", 131 | " yi = y[:,i];\n", 132 | " Xi = [ones(T) X[:, (i+1):n:end ]];\n", 133 | " betai = (Xi'*Xi)\\(Xi'*yi);\n", 134 | " e = yi - Xi*betai;\n", 135 | " global sigOLS[i] = sqrt(mean(e.^2));\n", 136 | "end\n", 137 | "\n", 138 | "# Elements of Minneota variance\n", 139 | "sig_ratio = sigOLS*transpose(1 ./sigOLS);\n", 140 | "C1 = 1 ./transpose( kron(collect(1:p),ones(n,n)).^c3 );\n", 141 | "C2 = repeat(sig_ratio, 1, p);\n", 142 | "C3 = c2*ones(n,n);\n", 143 | "C3[diagind(C3)] .= 1;\n", 144 | "C3 = repeat( C3, 1, p);\n", 145 | "vPhi = c1^2 .*C1 .*C2 .*C3;\n", 146 | "\n", 147 | "# Prior covariance and precision matrix\n", 148 | "vc0 = lambda*ones(n,1); # variance of intercept terms\n", 149 | "vc1 = transpose([vc0 vPhi]);\n", 150 | "pri_Vbeta0 = Array(vec(vc1));\n", 151 | "pri_invVbeta0 = sparse(diagm(1 ./pri_Vbeta0));" 152 | ] 153 | }, 154 | { 155 | "cell_type": "markdown", 156 | "metadata": {}, 157 | "source": [ 158 | "# Conclusion\n", 159 | "VARs tend to be highly parameterized, and Bayesian methods allow us to overcome this overparameterization problem through the use of informative priors, known as shrinkage priors. The most prominent of these is the Minnesota prior which captures the following ideas\n", 160 | "1. Own lags are more likely to be important predictors than lags of other variables\n", 161 | "2. More recent lags are more likely to be important predictors than distant ones\n", 162 | "\n", 163 | "To incorporate the Minnesota prior beliefs into the BVAR, simply change the prior for beta using the code provided, and then estimate the BVAR using the code made available for the idependent Normal and inverse-Wishart prior.\n", 164 | "\n", 165 | "## Recent research\n", 166 | "Over the past few years, researchers have proposed *adaptive hierarchical shrinkage priors* which have shown to have optimal theoretical properties when applied to sparse datasets. The main idea is to leave preserve the signal from 'large' coefficients while strongly shrinking 'small' coefficients to zero. Popular examples include the *Dirichlet-Laplace* ([Kastner and Huber, 2020](https://onlinelibrary.wiley.com/doi/full/10.1002/for.2680)), *Horseshoe* ([Follett and Yu, 2017](https://arxiv.org/abs/1709.07524)) and *Normal-Gamma* ([Huber and Feldkircher, 2019](https://www.tandfonline.com/doi/full/10.1080/07350015.2016.1256217)) priors, which are from the family of *global-local priors* ([Polson and Scott, 2010](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.180.727&rep=rep1&type=pdf)). \n", 167 | "\n", 168 | "Despite having good theoretical properties, [Cross, Poon and Hou (2020)](https://www.sciencedirect.com/science/article/pii/S0169207019302547) find that these adaptive hierarchical priors don't seem to forecast better than a variant of the Minnesota prior where the hyperparameters are selected based on the data. A possible reason for this result is that macroeconomic datasets are not *sparse* but *dense*, a result that was originally claimed in [Giannoni, Lenza and Primiceri (2018)](https://ideas.repec.org/p/fip/fednls/87258.html). This has lead to new research developments on Minnesota-type adaptive hierarchical priors ([Chan, 2020](http://joshuachan.org/papers/BVAR-MAHP.pdf)) and hierarchical priors for time-varying parameter models ([Huber, Koop and Onorante, 2020](https://www.tandfonline.com/doi/full/10.1080/07350015.2020.1713796)), and remains an area of active research. \n", 169 | "\n", 170 | "## Recommended reading\n", 171 | "1. For those interested in learning more about large BVARs, I recommend reading the manuscript by [Koop and Korobilis (2010)](http://personal.strath.ac.uk/gary.koop/kk3.pdf) and [Large Bayesian Vector Autoregressions](http://joshuachan.org/papers/large_BVAR.pdf) by Joshua CC Chan, in [Macroeconomic Forecasting in the Era of Big Data: Theory and Practice\n", 172 | "](https://www.springer.com/gp/book/9783030311490).\n", 173 | "2. For those interested in estimating large BVARs with non-Gaussian, heteroscedastic and serially dependent errors, see [Chan (2020)](http://www.joshuachan.org/papers/BVAR.pdf)." 174 | ] 175 | } 176 | ], 177 | "metadata": { 178 | "kernelspec": { 179 | "display_name": "Julia 1.5.1", 180 | "language": "julia", 181 | "name": "julia-1.5" 182 | }, 183 | "language_info": { 184 | "file_extension": ".jl", 185 | "mimetype": "application/julia", 186 | "name": "julia", 187 | "version": "1.5.1" 188 | } 189 | }, 190 | "nbformat": 4, 191 | "nbformat_minor": 2 192 | } 193 | -------------------------------------------------------------------------------- /1_Introduction.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# Lecture 1: What does it mean to be Bayesian?\n", 8 | "\n", 9 | "My objective in this course is to get you to a position where you have the ability to \n", 10 | "1. Contrast classical and Bayesian thinking\n", 11 | "2. Estimate models using Bayesian methods\n", 12 | "2. Create original pieces of research using Bayesian methods\n", 13 | "\n", 14 | "This lecture is aimed at the first goal. \n", 15 | "\n", 16 | "By the end of this lecture we will have (hopefully) completed the first goal. To that end, we will\n", 17 | "1. Recap how to compute and interpret probabilities and contrast Bayesian and frequentist interpretations of probability\n", 18 | "2. Develop an understanding of Bayesian econometric theory\n", 19 | "3. Apply the theory in an experiment\n", 20 | "\n", 21 | "## Motivating questions\n", 22 | "Answer the following questions:\n", 23 | "1. Suppose I pull a coin out of my pocket, flip it and catch it. What is the probability that the outcome (i.e. Heads or Tails) is Heads? Explain your reasoning. \n", 24 | "2. Now suppose that I look at the outcome, but don't show it to you. What is the probability that I am observing Heads? Explain your reasoning. \n", 25 | "3. What are the probabilities of Donald Trump and Joe Biden respectively winning the 2020 US election? \n", 26 | "4. Suppose we estimate that the return to an additional year of education is, on average, 7% for which [5%,10%] represents a 95% interval estimate around this value. What is the probability that this interval estimate contains the true population value?\n", 27 | "\n", 28 | "### What do your answers suggest about your thought process?\n", 29 | "1. Under the assumption that the coin is fair, the correct answer from either a Bayesian or frequentist view is 1/2. We will soon learn that the difference between the two approaches will depend on our reasoning. If you reason that the probability is 1/2 because it represents the expected long-run frequency of observing Heads if I repeatedly tossed the coin, then you're thinking like frequentist. This is because frequentists view probabilities as representing the long-run frequency of an outcome in a large set of repeated experiments. In constrast, if you reasoned that the probability is 1/2 based on your past experience in probability classes, discussions with others, reading book etc, then you're thinking like a Bayesian. This is because Bayesians use their existing knowledge about the world to establish a *prior belief* about the probability, they then update this belief after seeing evidence (in this case outcomes of the coin flip). \n", 30 | "\n", 31 | "2. If you answered 1/2, or any other value in $[0,1]$, then you are thinking like a Bayesian. In contrast, if you answered either 0 or 1, then it depends on your reasoning. This is because Bayesians use probabilities to represent their uncertainty about outcomes, in this case about me observing Heads. Thus, any probability in $(0,1)$ is in line with a Bayesian interpretation of uncertainty about the outcome, while a probability of 0 or 1 with the reasoning that it represents absolute certainty about the outcome, is also in line with Bayesian thought. In contrast, to a frequentist, the outcome of the experiment has already been determined and I'm either looking at a Heads (probability = 1) or I'm not (probability = 0). Notice in this case, that the Bayesian and frequentists probability estimates can be the same (i.e. values of 0 or 1). This often occurs in practice.\n", 32 | "\n", 33 | "3. If you answered this question with any probability, then you are thinking like a Bayesian. This is because you have used your existing knowledge about the the world to infer a probability. In contrast, if you stated that we can't assign a probability to this question then you are thinking like a frequentist. This is because frequentists only assign probabilities to repeatable events, while Bayesians can assign probabilities to either repeatable or non-repeatable events.\n", 34 | "\n", 35 | "4. If you answered 95%, then you are thinking like a Bayesian. This is because Bayesians view population parameters as random variables and can therefore use probabilities to represent their degree of uncertainty about them. In contrast, if you answered either 0 or 1, then you are thinking like a frequentist. This is because frequentists view population population parameters as predetermined fixed quantities and consequently don't assign probabilities to them. Note that a common interpretation error among users of frequentist methods is that a *confidence interval* expresses the probability that the population parameter lies within the interval. Instead, the correct interpretation is that if the experiment where repeated over and over again, then 95% of the confidence intervals generated will cover the population parameter. \n", 36 | "\n", 37 | "\n", 38 | "## What is a probability?\n", 39 | "Recall that the *sample space* S of an experiment is the set of all possible outcomes, and an *event* $E$ is a subset of the sample space $S$. \n", 40 | "\n", 41 | "**Example:** \n", 42 | " Experiment: flip a coin and record the result, i.e. Heads (H) or Tails (T). \n", 43 | " Sample space: $S=\\{H,T\\}$\n", 44 | " Event: observed Heads, i.e. $E=\\{H\\}$\n", 45 | "\n", 46 | "The earliest definition of the probability of an event was to count the number of ways that the event could happen and divide by the total number of possible outcomes for the experiment. Written as an equation we have\n", 47 | "$P(E)=\\frac{\\text{number of outcomes favorable to } E}{\\text{total number of outcomes in } S}$\n", 48 | "This definition is formally known as the *naive definition of probability*.\n", 49 | "\n", 50 | "**Example:** According to the naive definition, the probability that a coin flip will show Heads is 0.5. This is computed as follows: First compute that the sample space $S=\\{H,T\\}$ has two outcomes. Next, compute that the event of showing Heads $E=\\{H\\}$ has one outcome. Now use the definition to compute that the probability a flip will show heads is 1/2=0.5.\n", 51 | "\n", 52 | "The naive definition is very restrictive in that it requires the sample space to be finite, and the outcomes to be equally likely. This is fine in some cases, e.g. our coin flip example, but is not applicable to many others. For instance, consider an experiment in which we roll a six sided die. In this case, the sample space is S={1,2,3,4,5,6} and by applying the naive definition of probability we can compute that the probability of landing any single outcome is 1/6. This is fine. Suppose, however, that the die was previously loaded with a weight so that the probability of rolling a one is 1/3, and the probability of landing any other single outcome is equal, i.e. 2/15. There is no way of computing such probabilities using the naive definition.\n", 53 | "\n", 54 | "For this reason, a *general definition of probability* was proposed that relies on the notation of a *probability space*. A probability space consists of a sample space S and a probability function $P:E\\subset S\\to [0,1]$ (i.e. $P$ takes an event $E\\subset S$ as an input and returns a real number in the interval $[0,1]$ as an output) which satisfies the following axioms:\n", 55 | "1. $P(\\emptyset) = 0 $ - The probability of nothing occurring (AKA a null event) is zero\n", 56 | "2. $P(S) = 1 $ - The probability of the sample space occurring is one\n", 57 | "3. If $E_i\\cap E_j = \\emptyset$ for $i\\neq j$, then $P(\\cup_{i=1}^{\\infty}E_i) = \\sum_{i=1}^{\\infty}P(E_i)$ - The probability of one or more mutually exclusive events occurring (AKA disjoint sets) is given by the sum of their probabilities.\n", 58 | "\n", 59 | "Unlike in the naive definition of probability, the general definition allows for a countably infinite number of outcomes in which each outcome may have a different probability of occurrence. Moreover, by applying logical arguments to these three axioms, we can derived the general rules of probability e.g. complements, unions, intersections etc. \n", 60 | "\n", 61 | "Further reading for those interested:\n", 62 | "1. [A Short History of Probability](http://homepages.wmich.edu/~mackey/Teaching/145/probHist.html)\n", 63 | "\n", 64 | "## Two interpretations of probability\n", 65 | "What do we mean when we say that the probability that a coin flip will show Heads is equal to 1/2?\n", 66 | "\n", 67 | "While the general definition of probability tells us what a probability function is, it doesn't tell us how probabilities should be interpreted. Two main schools of thought exist: \n", 68 | "1. The *frequentist* view of probability is that it represents a *long-run frequency* over a large number of repetitions of an experiment. This means that if a frequentist says that a coin flip has probability of showing Heads equal to 1/2, then they mean that: *if the coin were flipped it over and over again and the result recorded, then the coin will land Heads 50% of the time*. \n", 69 | "2. The *Bayesian* view of probability is that it represents a *degree of belief* about the event in question. This means that if a Bayesian says that a coin flip has probability of showing Heads equal to 1/2, then they mean that: *if the coin is flipped once, then I believe that there's a 50% chance that it shows Heads*.\n", 70 | "\n", 71 | "**Remarks**: \n", 72 | "1. Regardless of which perspective we take, the general rules of probability e.g. complements, unions, intersections etc remain the same. \n", 73 | "2. Both the Bayesian and frequentist perspectives are non-falsifiable. Which school of thought you choose to subscribe to is therefore a matter of personal preference. In this course we will adopt a Bayesian perspective and draw comparisons with the frequentist view along the way.\n", 74 | "3. Frequentists often argue that the concept of individual *degree of belief* is problematic, because people may have differing degrees of belief about different hypothesis, and science is about finding the correct answers to those hypothesis. The first point to note is that the Bayesian perspective of probability is *subjective* in the sense that it requires people to form a *prior belief* about an unknown quantity before seeing any results from an experiment. E.g. formulate the probability of the coin flip showing Heads, before seeing any outcomes from flipping the coin. Clearly this prior belief might be \"wrong\" in the sense that it might differ from the *true probability*. The second point to note, as we will see later in the lecture, however, Bayesian thinking does not stop here. Instead, we conduct experiments to gain evidence for or against the hypothesis (just like frequentists) and update our prior beliefs in a manner that is consistent with the rules of probability. This means that while a researcher might begin with a subjective *a priori* belief about an unknown quantity before seeing the data, they will then update their beliefs after seeing the data, and these *a posteriori* beliefs about the unknown quantity of interest will generally converge to the \"truth\". Thus, while the Bayesian interpretation of probability is fundamentally subjective, the process of inference is objective in the sense that it is consistent with the *likelihood principle* - the proposition that all relevant information about an unknown quantity obtained from an experiment is contained in the likelihood function. In fact, at the end of this lecture we will see that the Bayesian and frequentist methods often provide the same point estimates of unknown quantities. In such cases, the difference is in how we interpret the estimates. Finally, it's important to note that the frequentists do have a valid point. If Bayesian analysis is based on a *dogmatic prior belief* - in the sense that the person is not willing to update their belief in the face of evidence - and this prior is far away from the truth, then this will inevitably skew the results away from the truth. Fortunately, scientists know better than to stop here. Others will repeat the analysis with non-dogmatic priors and see if the results are robust. If they are not, then they will claim the initial analysis is wrong. Others will then repeat the experiment it again, and again, and confidence will grow they have settled on the true outcome. This is *Bayesian updating* in practice...we start with no knowledge, make some observations, update our knowledge, and repeat.\n", 75 | "\n", 76 | "\n", 77 | "Further reading for those interested:\n", 78 | "1. A brief history of Bayesian thinking, titled: [When Did Bayesian Inference Become \"Bayesian\"?](https://projecteuclid.org/download/pdf_1/euclid.ba/1340371071), by Professor Stephen E. Fienberg\n", 79 | "2. [Bayesian Epistemology](https://plato.stanford.edu/entries/epistemology-bayesian/) in the Stanford Encyclopedia of Philosophy\n", 80 | "\n", 81 | "## Practical implications\n", 82 | "While the distinction between frequentist and Bayesian interpretations about probability might seem abstract and irrelevant from the perspective of a pragmatic researcher, there are major practical implications for the way in which each school does statistics, and therefore econometrics. Three of the most important differences are:\n", 83 | "\n", 84 | "1. Both Bayesians and frequenstsis believe that there is some true data generating process (DGP) that governs the outcomes of an observed sample. However, beliefs differ over whether the parameters or data are random. Frequentists believe that the parameters are non-random values and that the data is random. In contrast, Bayesian's treat all unknown quantities, e.g. population parameters, as random variables and all known quantities, e.g. observed data, as fixed. An implication of this difference is that Bayesians assign probability statements to the unknown quantities, e.g. population parameters, while frequentists assign probabilities to constructed functions of the data, e.g. estimators.\n", 85 | "\n", 86 | "**Example**: If we toss a coin and think about the probability of showing Heads, then a Bayesian will view the unknown probability as a random variable, while a frequentist will view it as a number. A 95% Bayesian interval estimate around a point estimate, known as a *credible interval*, is interpreted by saying that: there is a 95% probability that the true value would lie within the interval, given the evidence provided by the observed data. In contrast, an analagous interval estimate from a frequentist perspective, known as a *confidence interval*, is interpreted by saying that: there is a 95% probability that the random interval (computed from the random data) contains the true value.\n", 87 | "\n", 88 | "**Remark**: While the [theory underlying confidence intervals](https://pdfs.semanticscholar.org/6281/be0dff2f86781fcda53f8d5263cb98000797.pdf) is sound, in practice, many people misinterpret them as Bayesian credible intervals. To avoid making this mistake ever again, remember that frequentists only place probabilities on data or functions of the data, e.g. the confidence interval, and never on hypotheses that involve non-random values, e.g. the population parameter.\n", 89 | "\n", 90 | "![](Cartoon_bvf.png)\n", 91 | "Source: [Agoston Torok's discussion of Bayesianism vs Frequentism](https://agostontorok.github.io/2017/03/26/bayes_vs_frequentist/)\n", 92 | "\n", 93 | "\n", 94 | "2. Since Frequentists define probability on the basis of a long-run frequency occurrence, they only assign probabilities to repeatable random events. In contrast, Bayesian's can assign probabilities to either repeatable or non-repeatable events. \n", 95 | "\n", 96 | "**Example**: A political scientist may be interested in answering the question: What is the probability that Donald Trump will win the 2020 US election? Since the 2020 election is not a repeatable event, a frequentist can not answer this question, however a Bayesian can use evidence from related sources, e.g. polls, to estimate the probability. This is one of the primary reasons that Bayesian statistics is popular among researchers in a range of disciplines including business analytics, finance, machine learning and the social sciences.\n", 97 | "\n", 98 | "3. As we will see in the next section, all Bayesian statistical notions: estimation, inference, prediction stem from Bayes theorem. In contrast, frequentist's have distinct methods for each.\n", 99 | "\n", 100 | "Further reading for those interested:\n", 101 | "1. [Frequentist and Subjectivist Perspectives on the Problems of Model Building in Economics](https://www.jstor.org/stable/pdf/1942744.pdf?refreqid=excelsior%3Aaa23cccaa6b6092cb8e65d0c0d714888), by Dale J. Poirier\n", 102 | "2. [Bayesian Methods in Applied Econometrics, or, Why Econometrics Should Always and Everywhere Be Bayesian](http://sims.princeton.edu/yftp/EmetSoc607/AppliedBayes.pdf), by Professor Chris Sims\n", 103 | "3. [Why isn't everyone a Bayesian?](https://www.jstor.org/stable/pdf/2683105.pdf?refreqid=excelsior%3Ace3dc05a001f7fa7f16220451838ea89), by Professor Brad Efron\n", 104 | "4. [Objections to Bayesian statistics](https://projecteuclid.org/download/pdf_1/euclid.ba/1340370429), by Professor Andrew Gelman\n", 105 | "\n", 106 | "![](Image_cartoon.png)\n", 107 | "Source: [RevBayes](https://twitter.com/revbayes/status/514231641300955137)\n", 108 | "\n", 109 | "" 113 | ] 114 | } 115 | ], 116 | "metadata": { 117 | "kernelspec": { 118 | "display_name": "Julia 1.5.1", 119 | "language": "julia", 120 | "name": "julia-1.5" 121 | }, 122 | "language_info": { 123 | "file_extension": ".jl", 124 | "mimetype": "application/julia", 125 | "name": "julia", 126 | "version": "1.5.1" 127 | } 128 | }, 129 | "nbformat": 4, 130 | "nbformat_minor": 2 131 | } 132 | -------------------------------------------------------------------------------- /9_state_space.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# State space models\n", 8 | "A [state space representation](https://en.wikipedia.org/wiki/State-space_representation#State_variables) of a dynamical model relates the set of output variables $\\mathbf{y}_t$ to the set of (possibly unobserved) state-variables $\\mathbf{z}_t$ using first-order differential equations when time is continuous or difference equations when time is discrete. Such representations are useful because they provide a summary of the systems dynamics. In this lecture we will assume that time is discrete, and learn about the linear state space model.\n", 9 | "\n", 10 | "**Remarks**:\n", 11 | "1. **Jargon**: \n", 12 | " 1. The *state variables* are the smallest possible set of variables that can represent the entire state of the system at any given time.\n", 13 | " 2. State space models are also referred to as state space systems.\n", 14 | " 3. Writing a model as a state space system is referred to as the *state space representation* of the model.\n", 15 | " 4. In some cases the state space representation facilitate estimation, so the terms *representation* and *model* are used interchangeably. " 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 1, 21 | "metadata": {}, 22 | "outputs": [ 23 | { 24 | "data": { 25 | "text/html": [ 26 | "\n", 62 | "\n", 67 | " Unable to load WebIO. Please make sure WebIO works for your Jupyter client.\n", 68 | " For troubleshooting, please see \n", 69 | " the WebIO/IJulia documentation.\n", 70 | " \n", 71 | "

\n" 72 | ], 73 | "text/plain": [ 74 | "HTML{String}(\"\\n\\n Unable to load WebIO. Please make sure WebIO works for your Jupyter client.\\n For troubleshooting, please see \\n the WebIO/IJulia documentation.\\n \\n

\\n\")" 75 | ] 76 | }, 77 | "metadata": {}, 78 | "output_type": "display_data" 79 | } 80 | ], 81 | "source": [ 82 | "# Load packages\n", 83 | "using Distributions # Work with standard probability distributions\n", 84 | "using Interact # Create widgets \n", 85 | "using Plots # Create plots\n", 86 | "using LinearAlgebra # Use extra linear algebra functions such as the identity matrix I(n)" 87 | ] 88 | }, 89 | { 90 | "cell_type": "markdown", 91 | "metadata": {}, 92 | "source": [ 93 | "## Linear state space model\n", 94 | "The general form of the linear state space model is given by two equations\n", 95 | "\n", 96 | "$$\n", 97 | "\\begin{align}\n", 98 | "\\mathbf{y}_t &= \\mathbf{A}\\mathbf{z}_t + \\mathbf{B}\\mathbf{e}_t \\\\\n", 99 | "\\mathbf{z}_t &= \\mathbf{C}\\mathbf{z}_{t-1} + \\mathbf{D}\\mathbf{u}_t \\\\\n", 100 | "\\end{align}\n", 101 | "$$\n", 102 | "\n", 103 | "in which\n", 104 | "1. $\\mathbf{y}_t$ is a $n\\times 1$ vector of observations\n", 105 | "2. $\\mathbf{z}_t$ is a $k\\times 1$ vector of (possibly unobserved) states\n", 106 | "3. $\\mathbf{e}_t$ and $\\mathbf{u}_t$ are $n\\times 1$ iid random vectors\n", 107 | "4. $\\mathbf{A}$ is a $n\\times k$ matrix, sometimes called the *output* (or *system*) *matrix*.\n", 108 | "5. $\\mathbf{B}$ is a $n\\times n$ *observation volatility matrix*\n", 109 | "6. $\\mathbf{C}$ is a $k\\times k$ *state transition matrix*\n", 110 | "7. $\\mathbf{D}$ is a $k\\times 1$ *state volatility matrix*\n", 111 | "\n", 112 | "The first equation is known as the *observation* (or *measurement*) *equation* and the second equation is known as the *state* (or *transition*) *equation*.\n", 113 | "\n", 114 | "A variety of dynamical models can be represented in terms of the linear state space model. Common examples include:\n", 115 | "1. Autoregressive (AR) model\n", 116 | "2. Vector Autoregressive (VAR) model\n", 117 | "\n", 118 | "**Example**: Recall that the AR(p) model is given by\n", 119 | "$$\n", 120 | "y_t = \\rho_0 + \\rho_1 y_{t-1} + \\dots + \\rho_p y_{t-p} + \\varepsilon_t\n", 121 | "$$\n", 122 | "\n", 123 | "The state space representation of the AR(p) model is given by defining the observation equation as\n", 124 | "$$\n", 125 | "y_t = \\rho_0 + \n", 126 | "\\begin{bmatrix} \n", 127 | "1 & 0 & \\dots & 0\n", 128 | "\\end{bmatrix}\n", 129 | "\\begin{bmatrix} \n", 130 | "y_t - \\rho_0\\\\\n", 131 | "y_{t-1} - \\rho_0\\\\\n", 132 | "\\vdots\\\\\n", 133 | "y_{t-p} - \\rho_0\\\\\n", 134 | "\\end{bmatrix}\n", 135 | "$$\n", 136 | "and the state equation as\n", 137 | "$$\n", 138 | "\\begin{bmatrix} \n", 139 | "y_t - \\rho_0\\\\\n", 140 | "y_{t-1} - \\rho_0\\\\\n", 141 | "\\vdots\\\\\n", 142 | "y_{t-p} - \\rho_0\\\\\n", 143 | "\\end{bmatrix}\n", 144 | "=\n", 145 | "\\begin{bmatrix}\n", 146 | "\\rho_1 & \\rho_2 & \\dots & \\rho_{p-1} & \\rho_p \\\\\n", 147 | "1 & 0 & \\dots & 0 & 0 \\\\\n", 148 | "0 & 1 & \\dots & 0 & 0 \\\\\n", 149 | "\\vdots & \\vdots & \\ddots & \\vdots & \\vdots \\\\\n", 150 | "0 & 0 & \\dots & 1 & 0 \\\\\n", 151 | "\\end{bmatrix}\n", 152 | "\\begin{bmatrix} \n", 153 | "y_{t-1} - \\rho_0\\\\\n", 154 | "y_{t-2} - \\rho_0\\\\\n", 155 | "\\vdots\\\\\n", 156 | "y_{t-p} - \\rho_0\\\\\n", 157 | "\\end{bmatrix}\n", 158 | "+\n", 159 | "\\begin{bmatrix} \n", 160 | "\\varepsilon_t - \\rho_0\\\\\n", 161 | "0\\\\\n", 162 | "\\vdots\\\\\n", 163 | "0\\\\\n", 164 | "\\end{bmatrix}\n", 165 | "$$\n", 166 | "\n", 167 | "In this lecture we learn about a commonly used linear state space model, known as the unobserved components model.\n", 168 | "\n", 169 | "# Unobserved components model\n", 170 | "The *unobserved components (UC) model* decomposes a time series $y_t$ into a non-stationary trend component $\\tau_t$ and a cyclical component $\\varepsilon_t$. The simplest variant of the UC model is the *local level* model which is defined by\n", 171 | "$$\n", 172 | "\\begin{align}\n", 173 | "y_t &= \\tau_{t} + \\varepsilon_t, \\quad \\varepsilon_t \\sim N(0,\\sigma^2)\\\\\n", 174 | "\\tau_t &= \\tau_{t-1} + u_t, \\quad u_t \\sim N(0,\\omega^2)\\\\\n", 175 | "\\end{align}\n", 176 | "$$\n", 177 | "where $\\varepsilon_t$ and $u_s$ are independent for all dates $t$ and $s$ and the initial condition $\\tau_0$ is estimated. \n", 178 | "\n", 179 | "**Remarks**:\n", 180 | "1. **Jargon**: Since both the observation and state equations are linear in the unobserved $\\tau_t$ and both the error terms are Normally distributed, the local level model is a type of *linear Gaussian state space model*.\n", 181 | "2. **Trend-cycle decompositions**: The local level model is a simple trend-cycle decomposition. [Morely, Nelson and Zivot (2003)](https://www.mitpressjournals.org/doi/abs/10.1162/003465303765299765) show that the [Beveridge-Nelson decomposition](https://stats.stackexchange.com/questions/80548/explaining-the-beveridge-nelson-decomposition) can be written as a UC model that allows for correlation between the trend and cycle innovations. [Grant and Chan (2016)](http://www.joshuachan.org/papers/output-gap-2M.pdf) show that the popular [Hodrick-Prescott (HP) filter](https://en.wikipedia.org/wiki/Hodrick%E2%80%93Prescott_filter) can be written as a UC model in which the cyclical components are serially independent --- an assumption that is rejected by the data. This is one of many reasons [why you should never use the HP filter](https://www.mitpressjournals.org/doi/abs/10.1162/REST_a_00706).\n", 182 | "3. **Linear regression**: Notice that the local level model can be viewed as a linear regression with a time-varying intercept, i.e. $x_{1,t}=1$ for all dates $t=1,\\dots,T$, and $\\beta_1 = \\tau_t$. We can extend it to estimate.\n" 183 | ] 184 | }, 185 | { 186 | "cell_type": "code", 187 | "execution_count": 5, 188 | "metadata": {}, 189 | "outputs": [ 190 | { 191 | "data": { 192 | "image/svg+xml": [ 193 | "\n", 194 | "\n", 195 | "\n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | "\n", 200 | "\n", 203 | "\n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | "\n", 208 | "\n", 211 | "\n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | "\n", 216 | "\n", 219 | "\n", 222 | "\n", 225 | "\n", 228 | "\n", 231 | "\n", 234 | "\n", 237 | "\n", 240 | "\n", 243 | "\n", 246 | "\n", 249 | "\n", 252 | "\n", 255 | "\n", 258 | "\n", 261 | "\n", 264 | "\n", 267 | "\n", 270 | "\n", 273 | "\n", 276 | "\n", 279 | "\n", 282 | "\n", 325 | "\n", 368 | "\n", 371 | "\n", 374 | "\n", 377 | "\n", 380 | "\n" 381 | ] 382 | }, 383 | "execution_count": 5, 384 | "metadata": {}, 385 | "output_type": "execute_result" 386 | } 387 | ], 388 | "source": [ 389 | "## Simulate data from the UC model\n", 390 | "# Parameter values\n", 391 | "true_sig2 = 1; # measurement variance\n", 392 | "true_omeg2 = 1; # state variance\n", 393 | "true_tau0 = 1; # initial state\n", 394 | "T = 400; # no. of dates\n", 395 | "\n", 396 | "# Storage\n", 397 | "y = zeros(T); # storage vector\n", 398 | "true_tau = zeros(T); # storage vector\n", 399 | "\n", 400 | "# Initial conditions\n", 401 | "true_tau[1] = true_tau0 + rand(Normal(0,sqrt(true_omeg2))); # initial state\n", 402 | "y[1] = true_tau[1] + rand(Normal(0,sqrt(true_sig2))); # initial obs\n", 403 | "\n", 404 | "# Simulation\n", 405 | "for t = 2:T\n", 406 | " true_tau[t] = true_tau[t-1] + rand(Normal(0,sqrt(true_omeg2))); # simulate state\n", 407 | " y[t] = true_tau[t] + rand(Normal(0,sqrt(true_sig2))); # simulate obs\n", 408 | "end\n", 409 | "\n", 410 | "#Plot\n", 411 | "x = collect(1:1:T);\n", 412 | "data = y;\n", 413 | "plot(x,data, label=\"Simulated data\")\n", 414 | "plot!(x,true_tau, label=\"Simulated trend\")\n" 415 | ] 416 | }, 417 | { 418 | "cell_type": "markdown", 419 | "metadata": {}, 420 | "source": [ 421 | "## Estimation\n", 422 | "To estimation the model we use matrix notation to stack the observation and state equations over all dates, $t=1,\\dots,T$. \n", 423 | "\n", 424 | "The measurement equation is given by\n", 425 | "$$\n", 426 | "\\mathbf{y} = \\boldsymbol{\\tau} + \\boldsymbol{\\varepsilon}\n", 427 | "$$\n", 428 | "in which $\\mathbf{y}=[y_1,\\dots,y_T]'$, $\\boldsymbol{\\tau}=[\\tau_1,\\dots,\\tau_T]'$ and $\\boldsymbol{\\varepsilon}=[\\varepsilon_1,\\dots,\\varepsilon_T]'$ are each $T\\times 1$ vectors. This implies that \n", 429 | "$$\n", 430 | "\\mathbf{y}|\\boldsymbol{\\tau},\\sigma^2\\sim N(\\boldsymbol{\\tau},\\boldsymbol{\\Sigma})\n", 431 | "$$\n", 432 | "in which $\\boldsymbol{\\Sigma} = \\sigma^2\\mathbf{I}_T$.\n", 433 | "\n", 434 | "Next, to stack the state equation, note that\n", 435 | "$$\n", 436 | "\\begin{align}\n", 437 | "\\tau_1 &= \\tau_0 + u_1\\\\\n", 438 | "\\tau_2 &= \\tau_1 + u_2\\\\\n", 439 | " &\\vdots\\\\\n", 440 | "\\tau_T &= \\tau_{T-1} + u_T\\\\ \n", 441 | "\\end{align}\n", 442 | "$$\n", 443 | "Taking the $\\tau_t$ $t=1,\\dots,T-1$ to the left hand side gives\n", 444 | "$$\n", 445 | "\\begin{align}\n", 446 | "\\tau_1 &= \\tau_0 + u_1\\\\\n", 447 | "\\tau_2 - \\tau_1 &= u_2\\\\\n", 448 | " &\\vdots\\\\\n", 449 | "\\tau_T - \\tau_{T-1} &= u_T\\\\ \n", 450 | "\\end{align}\n", 451 | "$$\n", 452 | "Writing this system as matrices gives\n", 453 | "$$\n", 454 | "\\underset{\\mathbf{H}}{\\underbrace{\\begin{bmatrix}\n", 455 | "1 & 0 & 0 & \\dots & 0\\\\\n", 456 | "-1 & 1 & 0 & \\dots & 0\\\\\n", 457 | "0 & -1 & 1 & \\dots & 0\\\\\n", 458 | "0 & 0 & \\ddots & \\ddots & \\vdots\\\\\n", 459 | "0 & 0 & 0 & -1 & 1\\\\\n", 460 | "\\end{bmatrix}}}\n", 461 | "\\underset{\\boldsymbol{\\tau}}{\\underbrace{\\begin{bmatrix}\n", 462 | "\\tau_1\\\\\n", 463 | "\\tau_2\\\\\n", 464 | "\\tau_3\\\\\n", 465 | "\\vdots\\\\\n", 466 | "\\tau_T\n", 467 | "\\end{bmatrix}}}\n", 468 | "=\n", 469 | "\\underset{\\tilde{\\boldsymbol{\\alpha}}}{\\underbrace{\\begin{bmatrix}\n", 470 | "\\tau_0\\\\\n", 471 | "0\\\\\n", 472 | "0\\\\\n", 473 | "\\vdots\\\\\n", 474 | "0\n", 475 | "\\end{bmatrix}}}\n", 476 | "+\n", 477 | "\\underset{\\mathbf{u}}{\\underbrace{\\begin{bmatrix}\n", 478 | "u_1\\\\\n", 479 | "u_2\\\\\n", 480 | "u_3\\\\\n", 481 | "\\vdots\\\\\n", 482 | "u_T\n", 483 | "\\end{bmatrix}}}\n", 484 | "$$\n", 485 | "or more compactly\n", 486 | "$$\n", 487 | "\\mathbf{H}\\boldsymbol{\\tau} = \\tilde{\\boldsymbol{\\alpha}} + \\mathbf{u}\n", 488 | "$$\n", 489 | "\n", 490 | "Since [the determinant of a lower triangular matrix is equal to the product of it's diagonal elements](https://en.wikipedia.org/wiki/Triangular_matrix#Properties), it follows that $|\\mathbf{H}|=1$ implying that it's invertible. Thus, \n", 491 | "$$\n", 492 | "\\boldsymbol{\\tau} = \\boldsymbol{\\alpha} + \\mathbf{H}^{-1}\\mathbf{u}\n", 493 | "$$\n", 494 | "in which $\\boldsymbol{\\alpha} = \\mathbf{H}^{-1}\\tilde{\\boldsymbol{\\alpha}}$. This implies that \n", 495 | "$$\\boldsymbol{\\tau}|\\omega^2,\\tau_0 \\sim N(\\boldsymbol{\\alpha},\\boldsymbol{\\Omega})$$\n", 496 | "in which $\\boldsymbol{\\Omega} = \\omega^2(\\mathbf{H}'\\mathbf{H})^{-1}$.\n", 497 | "\n", 498 | "Taken together, these equations imply that the probability model representation of the UC model is given by\n", 499 | "$$\n", 500 | "\\begin{align}\n", 501 | "\\mathbf{y}|\\boldsymbol{\\tau},\\sigma^2 &\\sim N(\\boldsymbol{\\tau},\\boldsymbol{\\Sigma})\\\\\n", 502 | "\\boldsymbol{\\tau}|\\omega^2,\\tau_0 &\\sim N(\\boldsymbol{\\alpha},\\boldsymbol{\\Omega})\n", 503 | "\\end{align}\n", 504 | "$$\n", 505 | "\n", 506 | "Estimating the UC model is therefore the same as estimating the parameters of two multivariate normal distributions which unknown mean and covariance. The only conceptual difference is that the mean of distribution is an unobserved state as opposed to a parameter. Nonetheless, we can still estimate the states by sampling them in the same manner.\n", 507 | "\n", 508 | "### Likelihood\n", 509 | "Using the probability model representation, the likelihood is given by\n", 510 | "$$\n", 511 | "p(\\mathbf{y}|\\boldsymbol{\\tau},\\sigma^2) = (2\\pi\\sigma^2)^{-\\frac{T}{2}}\\exp(-\\frac{1}{2\\sigma^2}(\\mathbf{y}-\\boldsymbol{\\tau})'(\\mathbf{y}-\\boldsymbol{\\tau}))\n", 512 | "$$\n", 513 | "\n", 514 | "### Priors\n", 515 | "We assume the following independent prior distributions\n", 516 | "1. $\\tau_0\\sim N(m_0,v_0)$\n", 517 | "2. $\\sigma^2\\sim IG(\\nu_{0,\\sigma},S_{0,\\sigma})$\n", 518 | "2. $\\omega^2\\sim IG(\\nu_{0,\\omega},S_{0,\\omega})$\n", 519 | "\n", 520 | "### Posterior\n", 521 | "We will use a 4-block Gibbs sampler to simulate from the joint posterior distribution $p(\\boldsymbol{\\tau},\\sigma^2,\\omega^2,\\tau_0|\\mathbf{y})$ which cycles through:\n", 522 | "1. $p(\\boldsymbol{\\tau}|\\mathbf{y},\\sigma^2,\\omega^2,\\tau_0)$\n", 523 | "2. $p(\\sigma^2|\\mathbf{y},\\boldsymbol{\\tau},\\omega^2,\\tau_0)$\n", 524 | "3. $p(\\omega^2|\\mathbf{y},\\boldsymbol{\\tau},\\sigma^2,\\tau_0)\\propto p(\\boldsymbol{\\tau},\\omega^2,\\tau_0)p(\\omega^2)$\n", 525 | "4. $p(\\tau_0|\\mathbf{y},\\boldsymbol{\\tau},\\sigma^2,\\omega^2)\\propto p(\\boldsymbol{\\tau},\\omega^2,\\tau_0)p(\\tau_0)$\n", 526 | "\n", 527 | "#### 1. Sampling $\\boldsymbol{\\tau}$\n", 528 | "Note that \n", 529 | "$$\n", 530 | "p(\\boldsymbol{\\tau}|\\mathbf{y},\\sigma^2,\\omega^2,\\tau_0)\\propto p(\\mathbf{y}|\\boldsymbol{\\tau},\\sigma^2)p(\\boldsymbol{\\tau},\\omega^2,\\tau_0)\n", 531 | "$$\n", 532 | "which is the product of two multivariate normal densities. Using results from the linear regression model, we know that\n", 533 | "$$\n", 534 | "\\boldsymbol{\\tau}|\\mathbf{y},\\sigma^2,\\omega^2,\\tau_0\\sim N(\\hat{\\boldsymbol{\\tau}},\\mathbf{D}_{\\tau}^{-1})\n", 535 | "$$ \n", 536 | "in which \n", 537 | "$\\hat{\\boldsymbol{\\tau}}=\\mathbf{D}_{\\tau}^{-1}(\\boldsymbol{\\Sigma}^{-1}\\mathbf{y} + \\boldsymbol{\\Omega}^{-1}\\boldsymbol{\\alpha})$ and $\\mathbf{D}_{\\tau}^{-1}=\\boldsymbol{\\Sigma}^{-1}+\\boldsymbol{\\Omega}^{-1}$.\n", 538 | "\n", 539 | "#### 2. Sampling $\\sigma^2$\n", 540 | "$$\n", 541 | "p(\\sigma^2|\\mathbf{y},\\boldsymbol{\\tau},\\omega^2,\\tau_0)\\propto p(\\mathbf{y}|\\boldsymbol{\\tau},\\sigma^2)p(\\sigma^2)\n", 542 | "$$\n", 543 | "which is the product of a multivariate normal density and an inverse-Gamma density. Using results from the linear regression model, we know that\n", 544 | "$$\n", 545 | "\\sigma^2|\\mathbf{y},\\boldsymbol{\\tau},\\omega^2,\\tau_0\\sim IG(\\nu_{\\sigma},S_{\\sigma})\n", 546 | "$$\n", 547 | "in which \n", 548 | "$\\nu = \\frac{T}{2}+\\nu_{0,\\sigma}$ and $S_{0,\\sigma} = S_0 + (\\mathbf{y}-\\boldsymbol{\\tau})'(\\mathbf{y}-\\boldsymbol{\\tau})$.\n", 549 | "\n", 550 | "#### 3. Sampling $\\omega^2$\n", 551 | "$$\n", 552 | "p(\\omega^2|\\mathbf{y},\\boldsymbol{\\tau},\\sigma^2,\\tau_0)\\propto p(\\boldsymbol{\\tau},\\omega^2,\\tau_0)p(\\omega^2)\n", 553 | "$$\n", 554 | "which is the product of a multivariate normal density and an inverse-Gamma density. Using results from the linear regression model, we know that\n", 555 | "$$\n", 556 | "\\omega^2|\\mathbf{y},\\boldsymbol{\\tau},\\sigma^2,\\tau_0\\sim IG(\\nu_{\\omega},S_{\\omega})\n", 557 | "$$\n", 558 | "in which \n", 559 | "$\\nu = \\frac{T}{2}+\\nu_{0,\\omega}$ and $S_{0,\\omega} = S_0 + (\\boldsymbol{\\tau}-\\boldsymbol{\\alpha})'\\mathbf{H}'\\mathbf{H}(\\boldsymbol{\\tau}-\\boldsymbol{\\alpha})$.\n", 560 | "\n", 561 | "#### 4. Sampling $\\tau_0$\n", 562 | "Recall that the initial condition $\\tau_0$ only appears in the first state equation\n", 563 | "$$\n", 564 | "\\tau_1 = \\tau_0 + u_1, \\quad u_1 \\sim N(0,\\omega^2)\n", 565 | "$$\n", 566 | "This implies that $\\tau_1\\sim N(\\tau_0 ,\\omega^2) $. The conditional posterior distribution \n", 567 | "$$\n", 568 | "p(\\tau_0|\\mathbf{y},\\boldsymbol{\\tau},\\sigma^2,\\omega^2)\\propto p(\\tau_1|\\omega^2,\\tau_0)p(\\tau_0)\n", 569 | "$$\n", 570 | "is the product of two univariate normal distributions. Using results from the worked example on the AR with drift model, it follows that\n", 571 | "$$\n", 572 | "\\tau_0|\\mathbf{y},\\boldsymbol{\\tau},\\sigma^2,\\omega^2\\sim N(\\hat{\\tau}_0,D_{\\tau_0})\n", 573 | "$$\n", 574 | "in which \n", 575 | "$\\hat{\\tau}_0={D}_{\\tau_0}(\\frac{\\tau_1}{\\omega^2} + \\frac{m_0}{v_0})$ and ${D}_{\\tau_0}=(\\frac{1}{\\omega^2}+\\frac{1}{v_0})^{-1}$.\n", 576 | "\n", 577 | "### Computational points\n", 578 | "#### Hyperparameters\n", 579 | "The hyperparameters for the initial values are relatively unimportant provided that the variance $v_0$ is not extremely small (informative). In practice, the size will dependent on the scale of the data, but a unit variance is the default option.\n", 580 | "\n", 581 | "On the other hand, the state variance $\\omega^2$ controls the smoothness of the trend component and the hyperparameters can be chosen to reflect the desired smoothness. While setting $\\nu_{0,\\omega}=3$ is reasonable, in our experience, the scale parameter $S_0 = \\mathbb{E}[\\omega^2](\\nu_{0,\\omega}-1)$ can greatly influence the results. Choosing $\\mathbb{E}[\\omega^2]$ to be a small number generally results in a smooth, but relatively flat trend. In contrast, choosing $\\mathbb{E}[\\omega^2]$ to be large generally results in the trend exhibiting substantial time variation. This is because the state variance is inferred from the unobserved states $\\tau_t$ which may have a weak-signal. With this in mind, [Amir-Ahmadi, Matthes and Wan (2020)](https://amstat.tandfonline.com/doi/full/10.1080/07350015.2018.1459302?casa_token=RG6wJLB-1vYAAAAA%3AM-2RxN3vw6Ui3hQ3HRTeqiPht88X7_BuKo3qsu0q6hK9UQ0QRopO1RG9-co58tEp-NFWgmi6r1NF) have recently proposed the use of a hierarchical model to estimate those hyperparameters jointly with all other parameters in the model." 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 7, 587 | "metadata": {}, 588 | "outputs": [], 589 | "source": [ 590 | "# Estimate UC model\n", 591 | "## Posterior analysis\n", 592 | "# Data\n", 593 | "y = data; # Observations\n", 594 | "T = size(y,1); # Dates after removing initial conditions\n", 595 | "\n", 596 | "# Controls\n", 597 | "nburn = 1000;\n", 598 | "ndraws = nburn + 1000;\n", 599 | "\n", 600 | "# Prior for sig2\n", 601 | "pri_nu1 = 3;\n", 602 | "pri_S1 = 1*(pri_nu1-1); # sets E(pri_sig2) = 1\n", 603 | "pri_sig2 = InverseGamma(pri_S1,pri_S1);\n", 604 | "\n", 605 | "# Prior for omeg2\n", 606 | "pri_nu2 = 3;\n", 607 | "pri_S2 = 1*(pri_nu2-1); # sets E(pri_omeg2) = 1\n", 608 | "pri_omeg2 = InverseGamma(pri_S2,pri_S2);\n", 609 | "\n", 610 | "# prior for tau0\n", 611 | "pri_m = 0;\n", 612 | "pri_v = 100;\n", 613 | "\n", 614 | "# Storage\n", 615 | "s_tau = zeros(ndraws-nburn,T);\n", 616 | "s_sig2 = zeros(ndraws-nburn,1);\n", 617 | "s_omeg2 = zeros(ndraws-nburn,1);\n", 618 | "s_tau0 = zeros(ndraws-nburn,1);\n", 619 | "\n", 620 | "# Deterministic terms in posterior\n", 621 | "post_nu1 = pri_nu1 + T/2;\n", 622 | "post_nu2 = pri_nu2 + T/2;\n", 623 | "\n", 624 | "# Difference matrix\n", 625 | "H = I(T) - diagm(-1 => ones(T-1));\n", 626 | "HH = H'*H;\n", 627 | "\n", 628 | "# Gibbs Sampler\n", 629 | "let \n", 630 | "MC_sigy2 = 1;\n", 631 | "MC_omega2 = .1;\n", 632 | "MC_tau0 = 0;\n", 633 | "\n", 634 | "for loop in 1:ndraws\n", 635 | "# local definitions to speed up code\n", 636 | "alp = ones(T)*MC_tau0;\n", 637 | "invSig = I(T)/MC_sigy2;\n", 638 | "invOmega = HH/MC_omega2;\n", 639 | "\n", 640 | "# Draw tau\n", 641 | " post_invD = invSig + invOmega;\n", 642 | " post_tauhat = post_invD\\(invSig*y + invOmega*alp );\n", 643 | " MC_tau = post_tauhat + transpose(cholesky(Hermitian(post_invD)).L)\\rand(Normal(0,1),T);\n", 644 | "\n", 645 | "# Draw sig2\n", 646 | " post_S1 = pri_S1 + 0.5*(y-MC_tau)'*(y-MC_tau);\n", 647 | " MC_sig2 = rand(InverseGamma(post_nu1,post_S1));\n", 648 | "\n", 649 | "# Draw omeg2\n", 650 | " post_S2 = pri_S2 +0.5*(MC_tau-alp)'*HH*(MC_tau-alp);\n", 651 | " MC_omeg2 = rand(InverseGamma(post_nu2,post_S2));\n", 652 | "\n", 653 | "# Draw tau0\n", 654 | " post_v = 1/(1/MC_omega2 + 1/MC_sigy2);\n", 655 | " post_m = post_v*(MC_tau[1]/MC_omega2 + pri_m/pri_v);\n", 656 | " MC_tau0 = post_m + rand(Normal(0,sqrt(post_v)));\n", 657 | "\n", 658 | "# Store\n", 659 | " if loop > nburn\n", 660 | " count_loop = loop - nburn;\n", 661 | " s_tau[count_loop,:] = transpose(MC_tau);\n", 662 | " s_sig2[count_loop] = MC_sig2;\n", 663 | " s_omeg2[count_loop] = MC_omeg2;\n", 664 | " s_tau0[count_loop] = MC_tau0;\n", 665 | " end\n", 666 | "end\n", 667 | "end " 668 | ] 669 | }, 670 | { 671 | "cell_type": "code", 672 | "execution_count": null, 673 | "metadata": {}, 674 | "outputs": [], 675 | "source": [ 676 | "# Plot marginal posterior distributions\n", 677 | "x = collect(0:0.1:5);\n", 678 | "histogram(x,post_sig2, normalize=:pdf, title = \"Posterior: sig2\", legend = false)\n", 679 | "plot!([true_sig2], seriestype=\"vline\", legend = false)\n", 680 | "p2a = plot!([post_sig2], seriestype=\"vline\", legend = false)\n", 681 | "\n", 682 | "histogram(x,post_omeg2, normalize=:pdf, title = \"Posterior: omeg2\", legend = false)\n", 683 | "plot!([true_omeg2], seriestype=\"vline\", legend = false)\n", 684 | "p2b = plot!([post_omeg2], seriestype=\"vline\", legend = false)\n", 685 | "\n", 686 | "histogram(x,s_tao0, normalize=:pdf, title = \"Posterior: tao0\", label=\"Empirical pdf\")\n", 687 | "plot!([true_tao0], seriestype=\"vline\", label=\"True value\")\n", 688 | "p2c = plot!([post_tao0], seriestype=\"vline\", label=\"MC mean\")\n", 689 | "\n", 690 | "plot(p2a,p2b,p2c,layout = (1,3))\n", 691 | "\n" 692 | ] 693 | }, 694 | { 695 | "cell_type": "code", 696 | "execution_count": null, 697 | "metadata": {}, 698 | "outputs": [], 699 | "source": [ 700 | "# Plot states\n", 701 | "x = collect(1:T);\n", 702 | "plot(x,post_tau,label =\"Estimated trend\")\n", 703 | "plot!(x,true_tau,label =\"True trend\")\n" 704 | ] 705 | }, 706 | { 707 | "cell_type": "markdown", 708 | "metadata": {}, 709 | "source": [ 710 | "# Conclusion\n", 711 | "State space models are generally useful because they provide a summary of the systems dynamics. From a practical perspective, they facilitate the estimation of models with unobserved components, such as the local level model. The local level model can be viewed as a regression with a time-varying intercept. More generally, [Canova (1993)](https://www.sciencedirect.com/science/article/abs/pii/S0165188906800114) extends this set-up to a regression with time-varying coefficients. [Cogley and Sargent (2001)](https://www.journals.uchicago.edu/doi/pdfplus/10.1086/654451) extend the VAR to have time-varying parameters. \n", 712 | "\n", 713 | "Estimating linear state space models is extremely similar to estimating the linear regression model. The only conceptual difference is that the mean of the observation equation contains unobserved states. Nonetheless, they both amount to estimating the unknown mean and covariance of a multivariate normal distribution. We can therefore estimate the states using a Gibbs sampler in which the state equation acts as a prior distribution. \n", 714 | "\n", 715 | "Note that this idea is not yet common practice. To estimate the unobserved states in linear state space models, people have traditionally relied on [*Kalman Filter*](https://en.wikipedia.org/wiki/Kalman_filter) based algorithms such as [Carter and Kohn (1994)](https://www.jstor.org/stable/2337125?seq=1) and [Durbin and Koopman (2002)](https://www.jstor.org/stable/4140605?seq=1). That being said, the idea of bypassing the Kalman filter in favor of directly sampling the states has become increasing popular. A highly influential paper in this literature is [Chan and Jeliazkov (2009)](http://www.joshuachan.org/papers/statespace1.pdf). In two empirical applications on macroeconomic data, they show how the precision sampler can be used to estimate (1) a time varying parameter VAR model and (2) a dynamic factor model. \n", 716 | "\n", 717 | "## Recommended reading\n", 718 | "1. For those looking to learn how to estimate a time-varying parameter VAR an excellent textbook treatment is provided in Chapter 8.2 of Joshua C. C. Chan's [Notes on Bayesian Macroeconometrics](http://joshuachan.org/notes_BayesMacro.html). \n", 719 | "2. Chan and Strachan (2020) have a manuscript on [Bayesian State Space Models in Macroeconometrics](http://www.joshuachan.org/papers/JES_StateSpaceModels.pdf)\n" 720 | ] 721 | } 722 | ], 723 | "metadata": { 724 | "kernelspec": { 725 | "display_name": "Julia 1.5.1", 726 | "language": "julia", 727 | "name": "julia-1.5" 728 | }, 729 | "language_info": { 730 | "file_extension": ".jl", 731 | "mimetype": "application/julia", 732 | "name": "julia", 733 | "version": "1.5.1" 734 | } 735 | }, 736 | "nbformat": 4, 737 | "nbformat_minor": 2 738 | } 739 | --------------------------------------------------------------------------------