├── LassoRidgeGraph.png ├── Lecture 11.pdf ├── Lecture 11.tex ├── Lecture_1.Rmd ├── Lecture_1.html ├── Lecture_1.pdf ├── Lecture_10.Rmd ├── Lecture_10.html ├── Lecture_10.pdf ├── Lecture_2.Rmd ├── Lecture_2.html ├── Lecture_2.pdf ├── Lecture_3.Rmd ├── Lecture_3.html ├── Lecture_3.pdf ├── Lecture_4.Rmd ├── Lecture_4.html ├── Lecture_4.pdf ├── Lecture_5.Rmd ├── Lecture_5.html ├── Lecture_5.pdf ├── Lecture_6.Rmd ├── Lecture_6.html ├── Lecture_6.pdf ├── Lecture_7.Rmd ├── Lecture_7.html ├── Lecture_7.pdf ├── Lecture_8.Rmd ├── Lecture_8.html ├── Lecture_8.pdf ├── Lecture_9.Rmd ├── Lecture_9.html ├── Lecture_9.pdf ├── PS1.Rmd ├── PS1.pdf ├── PS2.Rmd ├── PS2.html ├── PS2.pdf ├── PS3.Rmd ├── PS3.pdf ├── PS4.Rmd ├── PS4.pdf ├── PS5.Rmd ├── PS5.pdf ├── README.md ├── benchmark.R ├── bibliography.bib ├── syllabus.html ├── syllabus.md └── syllabus.pdf /LassoRidgeGraph.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/LassoRidgeGraph.png -------------------------------------------------------------------------------- /Lecture 11.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture 11.pdf -------------------------------------------------------------------------------- /Lecture 11.tex: -------------------------------------------------------------------------------- 1 | \documentclass{beamer} 2 | 3 | \usetheme{Warsaw} 4 | %\usetheme{Rochester} 5 | 6 | \usepackage[latin1]{inputenc} 7 | %\usepackage[ngerman]{babel} 8 | \usepackage{amsmath} 9 | \usepackage{amsfonts} 10 | \usepackage{amssymb} 11 | 12 | %%% Math symbols 13 | \newcommand{\argmax}{\operatornamewithlimits{arg\:max}} 14 | \newcommand{\argmin}{\operatornamewithlimits{arg\:min}} 15 | 16 | %%% other symbols 17 | \newcommand{\file}[1]{\hbox{\rm\texttt{#1}}} 18 | \newcommand{\stress}[1]{\textit{#1}} 19 | \newcommand{\booktitle}[1]{`#1'} %%' 20 | 21 | \newcommand{\bm}[1]{\mbox{\boldmath$#1$}} % bold greek letters in math mode 22 | \newcommand{\Varepsilon}{\bm{\Large \mbox{$\varepsilon$}}} 23 | \newcommand{\hats}[1]{#1_{\hat{s}}} 24 | \newcommand{\hatsm}{\hat{s}_m} 25 | \newcommand{\gammam}[1]{\bm{\gamma}^{[#1]}} 26 | \newcommand{\betam}[1]{\bm{\beta}^{[#1]}} 27 | \newcommand{\mbeta}{{\ensuremath{\boldsymbol{\beta}}}} 28 | \newcommand{\mgamma}{{\ensuremath{\boldsymbol{\gamma}}}} 29 | 30 | \AtBeginSection[]{\frame{\frametitle{Overview} \tableofcontents[current]}} 31 | 32 | \newcommand{\defin}[1]{\textit{\color{blue}#1}} 33 | 34 | % ========== Abk?rzungen ========== 35 | \newcommand{\N}{\mathbb{N}} 36 | \newcommand{\Z}{\mathbb{Z}} 37 | \newcommand{\Q}{\mathbb{Q}} 38 | \newcommand{\R}{\mathbb{R}} 39 | \newcommand{\C}{\mathbb{C}} 40 | 41 | \author[]{} 42 | \title[hdm]{Lecture 11 -- High-dimensional Microeconometric Models} 43 | 44 | \begin{document} 45 | \frame{\maketitle} 46 | 47 | \begin{frame} 48 | \tableofcontents 49 | \end{frame} 50 | 51 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 52 | \section{Introduction} 53 | 54 | \begin{frame}{Motivation} 55 | \begin{itemize} 56 | \item \textbf{Machine Learning}: Methods usually tailored for prediction. 57 | \item In \textbf{Economics / Econometrics} both prediction (stock market, demand, ...) but also learning of relations / causal inference is of interest. 58 | \item Here: Focus on causal inference. 59 | \item Examples for causal inference: What is the effect of a job market programme on future job prospects? What is the effect of a price change? 60 | \item General: What is the effect of a certain treatment on a relevant outcome variable 61 | \end{itemize} 62 | \end{frame} 63 | 64 | 65 | \begin{frame}{Motivation} 66 | \begin{itemize} 67 | \item Typical problem in Economics: potential endogeneity of the treatment. 68 | \item: Potential source: optimizing behaviour of the individuals with regard to the outcome and unobserved heterogeneity. 69 | \item Possible Solutions: 70 | \begin{itemize} 71 | \item Instrumental Variable (IV) estimation 72 | \item Selection of controls 73 | \end{itemize} 74 | \item Additional challenge: high-dimensional setting with $p$ even larger than $n$ 75 | \end{itemize} 76 | \end{frame} 77 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 78 | \section{High-dimensional Instrumental Variable (IV) Setting} 79 | 80 | \begin{frame}{Estimation and Inference with Many Instruments} 81 | Focus discussion on a simple IV model 82 | 83 | \begin{eqnarray} 84 | y_i &=& d_i \alpha + \varepsilon,\\ 85 | d_i &=& g(z_i) + v_i, \mbox{(first stage)} 86 | \label{eq:} 87 | \end{eqnarray} 88 | with $\begin{pmatrix} \varepsilon_i \\ v_i \end{pmatrix} | z_i \sim \left( 0, \begin{pmatrix} 89 | \sigma^2_{\varepsilon} & \sigma_{\varepsilon v} \\ \sigma_{\varepsilon v} & \sigma^2_{v} 90 | \end{pmatrix}\right)$ 91 | \end{frame} 92 | 93 | \begin{frame} 94 | \begin{itemize} 95 | \item can have additional low-dimensional controls $w_i$ entering both 96 | equations -- assume these have been partialled out; also can 97 | have multiple endogenous variables; see references for details 98 | \item the main target is $\alpha$, and $g$ is the unspecified regression function 99 | = ?optimal instrument? 100 | \item We have either 101 | \begin{itemize} 102 | \item Many instruments. $x_i = z_i$ , or 103 | \item Many technical instruments. $x_i = P(z_i)$, e.g. polynomials, 104 | trigonometric terms. 105 | \end{itemize} 106 | \item where where the number of instruments $p$ is large, possibly much larger than $n$ 107 | \end{itemize} 108 | \end{frame} 109 | 110 | \begin{frame}{Inference in the IV Model} 111 | \begin{itemize} 112 | \item Assume approximate sparsity: 113 | \[ g(z_i) = E[d_i|z_i]= \underbrace{x_i'\beta_0}_{\text{sparse approximation}} + \underbrace{r_i}_{\text{approx error}} \] 114 | 115 | that is, optimal instrument is approximated by s (unknown) instruments, such that 116 | \[ s:= ||\beta_0||_0 \ll n, \sqrt{1/n \sum_{i=1}^n r_i^2} \leq \sigma_v \sqrt{\frac{s}{n}} \] 117 | \item We shall find these "effective" instruments amongst $x_i$ by Lasso and estimate the optimal instrument by Post-Lasso, $\hat{g}(z_i)=x_i' \hat{\beta}_{PL}$. 118 | \item Estimate $\alpha$ using the estimated optimal instrument via 2SLS 119 | \end{itemize} 120 | \end{frame} 121 | 122 | \begin{frame}{Example: Instrument Selection in Angrist Krueger Data} 123 | \begin{itemize} 124 | \item $y_i =$ wage 125 | \item $d_i$ = education (endogenous) 126 | \item $\alpha$ = returns to schooling 127 | \item $z_i=$ quarter of birth and controls (50 state of birth dummies and 7 128 | year of birth dummies) 129 | \item $x_i = P(z_i)$, includes $z_i$ and all interactions 130 | \item a very large list, $p = 1530$ 131 | \end{itemize} 132 | Using few instruments (3 quarters of birth) or many instruments 133 | (1530) gives big standard errors. So it seems a good idea to use 134 | instrument selection to see if can improve. 135 | \end{frame} 136 | 137 | \begin{frame}{AK Example} 138 | 139 | \begin{tabular}{lccc} 140 | \hline \hline 141 | Estimator & Instruments & Schooling Coef & Rob Std Error\\ \hline 142 | 2SLS &(3 IVs) 3 &.10 &.020\\ \hline 143 | 2SLS &(All IVs) 1530& .10& .042\\ \hline 144 | 2SLS &(LASSO IVs) 12& .10& .014\\ \hline \hline 145 | \end{tabular} 146 | 147 | Notes: 148 | \begin{itemize} 149 | \item About 12 constructed instruments contain nearly all information. 150 | \item Fuller's form of 2SLS is used due to robustness. 151 | \item The Lasso selection of instruments and standard errors are fully 152 | justified theoretically below 153 | \end{itemize} 154 | \end{frame} 155 | 156 | \begin{frame}{2SLS with Post-LASSO estimated Optimal IV} 157 | 2SLS with Post-LASSO estimated Optimal IV 158 | \begin{itemize} 159 | \item In step one, estimate optimal instrument $\hat{g}(z_i) = x_i' \hat{\beta} $using 160 | Post-LASSO estimator. 161 | \item In step two, compute the 2SLS using optimal instrument as IV, 162 | \[ \hat{\alpha}= \left[ 1/n \sum_{i=1}^n (d_i\hat{g}(z_i)') \right]^{-1} 1/n \sum_{i=1}^n [\hat{g}(z_i)y_i] \] 163 | \end{itemize} 164 | \end{frame} 165 | 166 | \begin{frame}{IV Selection: Theoretical Justification} 167 | Theorem (2SLS with LASSO-selected IV) 168 | 169 | Under practical regularity conditions, if the optimal instrument is 170 | sufficient sparse, namely $s^2 \log^2 p = o(n)$, and is strong, namely 171 | $|E[d_i g(z_i)]|$ is bounded away from zero, we have that 172 | \[ \sigma_n^{-1} \sqrt{n} (\hat{\alpha}-\alpha) \rightarrow_d N(0,1) \] 173 | where $\sigma^2_n$ is the standard White?s robust formula for the variance of 174 | 2SLS. The estimator is semi-parametrically efficient under 175 | homoscedasticity. 176 | 177 | \begin{itemize} 178 | {\tiny 179 | \item Ref: Belloni, Chen, Chernozhukov, and Hansen (Econometrica, 2012) 180 | for a general statement. 181 | \item A weak-instrument robust procedure is also available: the sup-score 182 | test 183 | \item Key point: "Selection mistakes" are asymptotically negligible due to 184 | "low-bias" property of the estimating equations, which we shall discuss 185 | later.} 186 | \end{itemize} 187 | \end{frame} 188 | 189 | \begin{frame}{Example of IV: Eminent Domain} 190 | 191 | Estimate economic consequences of government take-over of 192 | property rights from individuals 193 | \begin{itemize} 194 | \item $y_i$ = economic outcome in a region i, e.g. housing price index 195 | \item $d_i$ = indicator of a property take-over decided in a court of law, 196 | by panels of 3 judges 197 | \item $x_i$ = demographic characteristics of judges, that are randomly 198 | assigned to panels: education, political affiliations, age, 199 | experience etc. 200 | \item $f_i = x_i$ + various interactions of components of $x_i$ , 201 | \item a very large list $p = p(f_i) = 344$ 202 | \end{itemize} 203 | \end{frame} 204 | 205 | \begin{frame}{Example continued} 206 | \begin{itemize} 207 | \item Outcome is log of housing price index; endogenous variable is 208 | government take-over 209 | \item Can use 2 elementary instruments, suggested by real lawyers 210 | (Chen and Yeh, 2010) 211 | \item Can use all 344 instruments and select approximately the right 212 | set using LASSO. 213 | \end{itemize} 214 | 215 | \begin{tabular}{lccc} 216 | \hline \hline 217 | Estimator &Instruments &Price Effect &Rob Std Error\\ \hline 218 | 2SLS &2& .07& .032\\ \hline 219 | 2SLS / LASSO IVs& 4& .05 &.017\\ \hline \hline 220 | \end{tabular} 221 | \end{frame} 222 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 223 | \section{Treatment Effects in a Partially Linear Model} 224 | 225 | \begin{frame} 226 | Example: (Exogenous) Cross-Country Growth Regression. 227 | \begin{itemize} 228 | \item Relation between growth rate and initial per capita GDP, 229 | conditional on covariates, describing institutions and 230 | technological factors: 231 | 232 | \[ \underbrace{\text{GrowRate}}_{y_i} = \beta_0 + \underbrace{\alpha}_{\text{ATE}} \underbrace{\log (\text{GDP})}_{d_i} + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i \] 233 | where the model is exogenous, 234 | \[ E[\varepsilon_i| d_i, x_i] = 0. \] 235 | \item Test the convergence hypothesis -- $\alpha < 0$ -- poor countries catch 236 | up with richer countries, conditional on similar institutions etc. 237 | Prediction from the classical Solow growth model. 238 | \item In Barro-Lee data, we have p = 60 covariates, n = 90 239 | observations. Need to do selection. 240 | \end{itemize} 241 | \end{frame} 242 | 243 | 244 | 245 | \begin{frame}{How to perform selection?} 246 | \begin{itemize} 247 | \item (Don't do it!) Naive/Textbook selection 248 | \begin{enumerate} 249 | \item Drop all $x_{ij}$s that have small coefficients, using model selection 250 | devices (classical such as t-tests or modern) 251 | \item Run OLS of yi on di and selected regressors. 252 | \end{enumerate} 253 | 254 | Does not work because fails to control omitted variable bias. 255 | (Leeb and P\"otscher, 2009). 256 | \item We propose Double Selection approach: 257 | \begin{enumerate} 258 | \item Select controls $x_{ij}$s that predict $y_i$ . 259 | \item Select controls $x_{ij}$s that predict $d_i$ . 260 | \item Run OLS of $y_i$ on $d_i$ and the union of controls selected in steps 1 261 | and 2. 262 | \end{enumerate} 263 | \item The additional selection step controls the omitted variable bias. 264 | \item We find that the coefficient on lagged GDP is negative, and the 265 | confidence intervals exclude zero. 266 | \end{itemize} 267 | \end{frame} 268 | 269 | \begin{frame} 270 | \begin{tabular}{lcc} 271 | \hline \hline 272 | Method &effect &Std. Err.\\ \hline 273 | Barro-Lee (Economic Reasoning) &$-0.02$& $0.005$\\ \hline 274 | All Controls ($n = 90$, $p = 60$)& $-0.02$& $0.031$\\ \hline 275 | Post-Naive Selection &$-0.01$ &$0.004$\\ \hline 276 | Post-Double-Selection &$-0.03$ &$0.011$\\ \hline \hline 277 | \end{tabular} 278 | 279 | \begin{itemize} 280 | \item Double-Selection finds 8 controls, including trade-openness and 281 | several education variables. 282 | \item Our findings support the conclusions reached in Barro and Lee 283 | and Barro and Sala-i-Martin. 284 | \item Using all controls is very imprecise. 285 | \item Using naive selection gives a biased estimate for the speed of 286 | convergence. 287 | \end{itemize} 288 | \end{frame} 289 | 290 | \begin{frame}{TE in a PLM} 291 | Partially linear regression model (exogenous) 292 | \[ y_i = d_i \alpha_0 + g(z_i) + \xi_i, E[\xi_i|z_i, d_i ] = 0, \] 293 | \begin{itemize} 294 | \item $y_i$ is the outcome variable 295 | \item $d_i$ is the policy/treatment variable whose impact is $\alpha_0$ 296 | \item $z_i$ represents confounding factors on which we need to condition 297 | \end{itemize} 298 | For us the auxilliary equation will be important: 299 | \[ d_i = m(z_i) + v_i, E[v_i | z_i ] = 0 \] 300 | \begin{itemize} 301 | \item $m$ summarizes the counfounding effect and creates omitted 302 | variable biases. 303 | \end{itemize} 304 | \end{frame} 305 | 306 | \begin{frame}{TE in a PLM} 307 | Use many control terms $x_i = P(z_i) \in \mathbb{R}^p$ to approximate $g$ and $m$ 308 | \[ y_i = d_i\alpha_0 + \x_i' \beta_{g0} + r_{gi} + \xi_i, d_i=x_i' \beta_{m0} + r_{mi} + v_i\] 309 | \begin{itemize} 310 | \item Many controls. $x_i = z_i$ . 311 | \item Many technical controls. $x_i = P(z_i)$, e.g. polynomials, 312 | trigonometric terms. 313 | \end{itemize} 314 | Key assumption: g and m are approximately sparse 315 | \end{frame} 316 | 317 | \begin{frame} 318 | \[ y_i = d_i \alpha_0 + x_i' \beta_{g0} + r_i + \xi_i, E[\xi_i| z_i, d_i ] = 0, \] 319 | Naive/Textbook Inference: 320 | \begin{enumerate} 321 | \item Select controls terms by running Lasso (or variants) of $y_i$ on $d_i$ 322 | and $x_i$ 323 | \item Estimate $\alpha_0$ by least squares of $y_i$ on $d_i$ and selected controls, 324 | apply standard inference 325 | \end{enumerate} 326 | However, this naive approach has caveats: 327 | \begin{itemize} 328 | \item Relies on perfect model selection and exact sparsity. Extremely 329 | unrealistic. 330 | \item Easily and badly breaks down both theoretically (Leeb and 331 | P\"otscher, 2009) and practically. 332 | \end{itemize} 333 | \end{frame} 334 | 335 | \begin{frame}{(Post) Double Selection Method} 336 | To define the method, write the reduced form (substitute out $d_i$) 337 | \begin{eqnarray} 338 | y_i &=& x_i' \bar{\beta}_0 + \bar{r}_i + \bar{\xi_i},\\ 339 | d_i &=& x_i' \beta_{m0} + r_{mi} + v_i, 340 | \end{eqnarray} 341 | 342 | \begin{enumerate} 343 | \item (Direct) Let $\hat{I}_1$ be controls selected by Lasso of $y_i$ on $x_i$ . 344 | \item (Indirect) Let $\hat{I}_1$ be controls selected by Lasso of $d_i$ on $x_i$ . 345 | \item (Final) Run least squares of $y_i$ on $d_i$ and union of selected controls: 346 | \end{enumerate} 347 | \[ (\tilde{\alpha}, \tilde{\beta}) = \argmin_{\alpha, \beta} \left\{ 1/n \sum_{i=1}^n [(y_i - d_i \alpha - x_i' \beta)^2]: \beta_j=0, \forall j \notin \hat{I}=\hat{I}_1 \cup \hat{2}_1 \right\}. \] 348 | 349 | The post-double-selection estimator. 350 | \begin{itemize} 351 | \item Belloni, Chernozhukov, Hansen (World Congress, 2010) 352 | \item Belloni, Chernozhukov, Hansen (ReStud, 2013) 353 | \end{itemize} 354 | \end{frame} 355 | 356 | \begin{frame}{Intuition} 357 | \begin{itemize} 358 | \item The double selection method is robust to moderate selection 359 | mistakes. 360 | \item The Indirect Lasso step -- the selection among the controls $x_i$ 361 | that predict $d_i$ -- creates this robustness. It finds controls whose 362 | omission would lead to a "large" omitted variable bias, and 363 | includes them in the regression. 364 | \item In essence the procedure is a selection version of Frisch-Waugh 365 | procedure for estimating linear regression. 366 | \end{itemize} 367 | \end{frame} 368 | 369 | \begin{frame}{More Intuition} 370 | \small 371 | Think about omitted variables bias in case with one treatment (d) and one 372 | regressor (x): 373 | \[ y_i = \alpha d_i + \beta x_i + \xi_i, d_i = x_i + v_i \] 374 | If we drop $x_i$ , the short regression of $y_i$ on $d_i$ gives 375 | \[ \sqrt{n} (\hat{\alpha} - \alpha) = \text{good term} + \sqrt{n} (D'D/n)^{-1}(X'X/n) (\gamma \beta).\] 376 | \begin{itemize} 377 | \item the good term is asymptotically normal, and we want 378 | $ \sqrt{n} \gamma \beta \rightarrow 0.$ 379 | \item naive selection drops $x_i$ if $\beta= O(\sqrt{\log n/n})$, but 380 | $ \sqrt{n} \gamma \sqrt{\log n /n} \rightarrow \infty $ 381 | \item double selection drops $x_i$ only if both $\beta= O(\sqrt{\log n/n})$ and $\gamma= O(\sqrt{\log n/n})$, that is, if 382 | \[ \sqrt{n} \gamma \beta = O((\log n)/\sqrt{n}) \rightarrow 0. \] 383 | \end{itemize} 384 | \end{frame} 385 | 386 | \begin{frame}{Main Result} 387 | Theorem (Inference on a Coefficient in Regression) 388 | 389 | Uniformly within a rich class of models, in which g and m admit a 390 | sparse approximation with $s^2 \log^2(p \vee n)/n \rightarrow 0$ and other practical 391 | conditions holding, 392 | \[ \sigma_n^{-1} \sqrt{n} (\hat{\alpha}-\alpha_0) \rightarrow_d N(0,1) \] 393 | $\sigma_n^{-1}$ is Robinson's formula for variance of LS in a partially linear 394 | model. Under homoscedasticity, semi-parametrically efficient. 395 | 396 | Model selection mistakes are asymptotically negligible due to 397 | double selection. 398 | \end{frame} 399 | 400 | \begin{frame}{Example: Effect of Abortion on Murder Rates} 401 | Estimate the consequences of abortion rates on crime, Donohue and 402 | Levitt (2001) 403 | \[ y_{it} = \alpha d_{it} + x_{it} + \xi_{it} \] 404 | \begin{itemize} 405 | \item $y_{it} =$ change in crime-rate in state i between t and t - 1, 406 | \item $d_{it} =$ change in the (lagged) abortion rate, 407 | \item $x_{it} =$ controls for time-varying confounding state-level factors, 408 | including initial conditions and interactions of all these variables 409 | with trend and trend-squared 410 | \item p = 251, n = 576 411 | \end{itemize} 412 | \end{frame} 413 | 414 | \begin{frame}{Example continued} 415 | Double selection: 8 controls selected, including initial conditions 416 | and trends interacted with initial conditions 417 | 418 | \begin{tabular}{lcc} 419 | \hline \hline 420 | Estimator &Effect& Std. Err.\\ \hline 421 | DS &$-0.204$& $0.068$\\ \hline 422 | Post-Single Selection& $- 0.202$& $0.051$\\ \hline 423 | Post-Double-Selection& $-0.166$ &$0.216$\\ \hline \hline 424 | \end{tabular} 425 | \end{frame} 426 | 427 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 428 | \section{Heterogenous Treatment Effects} 429 | 430 | \begin{frame}{Heterogenous Treatment Effects} 431 | \small 432 | \begin{itemize} 433 | \item Here $d_i$ is binary, indicating the receipt of the treatment, 434 | \item Drop partially linear structure; instead assume $d_i$ is fully 435 | interacted with all other control variables: 436 | \[ y_i = d_i g(1, z_i) + (1 - d_i )g(0, z_i ) + \xi_i, E[\xi_i| d_i, z_i ] = 0 \] 437 | \[ d_i = m(z_i) + u_i, E[u_i| z_i] = 0 \text{(as before)} \] 438 | \item Target parameter. Average Treatment Effect: 439 | \[ \alpha_0 = E[g(1, z_i) - g(0, z_i)] \] 440 | \item Example. $d_i=$ 401(k) eligibility, $z_i=$ characteristics of the 441 | worker/firm, $y_i=$ net savings or total wealth, $\alpha_0 =$ the average 442 | impact of 401(k) eligibility on savings. 443 | \end{itemize} 444 | \end{frame} 445 | 446 | \begin{frame}{Heterogenous Treatment Effects} 447 | \small 448 | An appropriate $M_i$ is given by Hahn's (1998) efficient score 449 | \[ M_i(\alpha, g, m) = \left( \frac{d_i(y_i-g(1,z_i))}{m(z_i)} - \frac{(1-d_i)(y_i-g(0,z_i))}{1-m(z_i)}+ g(1,z_i) - g(0,z_i) \right) - \alpha \] 450 | which is "immunized" against perturbations in $g_0$ and $m_0$: 451 | \[ \frac{\partial}{\partial g} E[M_i (\alpha_0, g, m_0)]|_{g=g_0} = 0, 452 | \frac{\partial}{\partial m} E[M_i (\alpha_0, g_0, m)]|_{m=m_0} = 0. \] 453 | Hence the post-double selection estimator for $\alpha$ is given by 454 | \[ \tilde{\alpha} = 1/N \sum_{i=1}^N \left( \frac{d_i(y_i-\hat{g}(1,z_i))}{\hat{m}(z_i)} - \frac{(1-\hat{d}_i)(y_i-\hat{g}(0,z_i))}{1-\hat{m}(z_i)}+ \hat{g}(1,z_i) - \hat{g}(0,z_i) \right)\] 455 | where we estimate g and m via post- selection (Post-Lasso) 456 | methods. 457 | \end{frame} 458 | 459 | \begin{frame}{Heterogenous Treatment Effects} 460 | \small 461 | Theorem (Inference on ATE) 462 | 463 | Uniformly within a rich class of models, in which g and m admit a 464 | sparse approximation with $s^2 \log^2(p \vee n)/n \rightarrow 0$ and other practical 465 | conditions holding, 466 | \[ \sigma_n^{-1} \sqrt{n} (\tilde{\alpha} - \alpha_0) \rightarrow_d N(0,1) \] 467 | where $\sigma^{-1}_n = E[M^2_i (\alpha_0, g_0, m_0)].$ 468 | Moreover, $\tilde{\alpha}$ is semi-parametrically efficient for $\alpha_0$. 469 | \begin{itemize} 470 | \item Model selection mistakes are asymptotically negligible due to the 471 | use of "immunizing" moment equations. 472 | \item Ref. Belloni, Chernozhukov, Hansen, Inference on TE after selection amongst 473 | high-dimensional controls (Restud, 2013). 474 | \end{itemize} 475 | \end{frame} 476 | %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 477 | 478 | 479 | \end{document} 480 | -------------------------------------------------------------------------------- /Lecture_1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 1 -- Introduction" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | mathjax: local 9 | self_contained: no 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = FALSE) 14 | ``` 15 | 16 | ## Defintions | Taxonomy of Data Sets 17 | 18 | - Larger data become more and more available. 19 | - $n$: number of observations; $p$: number of variables 20 | - "Tall data": big $n$, small $p$ 21 | computational demanding 22 | - "High-dimensional data" or "wide data": $n << p$ or small $n$, big $p$ 23 | non-standard theory, computational demanding 24 | - "Big Data": big $n$, small / big $p$ 25 | - Important concept: MapReduce and its software implementation [hadoop](https://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/), in particular for tall data 26 | 27 | ## Defintions | Input and Output Variables 28 | - Inputs $X$: measured or present variables. Synonyms: predictors, features or independent variables 29 | - These inputs have some influence on one or more outputs. 30 | - Output variable $Y$ is also called response or dependent variable or outcome variables. 31 | - $$ Y = f(X) + \varepsilon $$ 32 | - $f$ unknown function, $X=(X_1, \ldots, X_p)$ $p$ predictor variables, $\varepsilon$ random error term 33 | 34 | ## Defintions | Supervised vs Unsupervised Learning 35 | - Supervised Learning: Presence of the outcome variable to guide the learning process 36 | Goal: e.g. to use the inputs to predict the values of the outputs 37 | Methods: regression methods (linear, lasso, ridge, etc.), bagging, trees, random forests, ensemble learning, ... 38 | - Unsupervised Learning: only features are observed, no measurements of the outcome variable 39 | Goal: insights how the data are organized or clustered 40 | Methods: Association Rules, PCA, cluster analysis 41 | 42 | ## Definitions | Regression vs Classification 43 | - Input variables $X$ 44 | - Quantitative output $Y$: *regression* 45 | - Qualitative output (categorical / discrete) G: *classification* 46 | - Also input variables can also vary in measurement type. 47 | - Coding of qualitative variables: $0/1$, $-1/+1$, or in general case via dummy variables. 48 | 49 | ## Basic Concepts | Prediction vs. Inference 50 | 51 | - **Prediction**: Given inputs $X$, but not the output $Y$, we want to predict $Y$: 52 | $$ \hat{Y}=\hat{f}(X) $$ 53 | We are interested in high quality predictions and not in the function $f$ which is more or less considered as a black box. 54 | 55 | - **Inference**: Here the goal is understanding the relationship between $Y$ and $X$ and the form of $f$. Related questions are which predictors are associated with the response (model selection) and is the relationship linear or nonlinear. 56 | 57 | ## Basic Concepts | Trade-off between Prediction Accuracy and Model Interpretability 58 | Some methods are less flexible or more restrictive, meaning that the range of shapes of $f$ they can estimate is restricted. Other methods are more flexible in this regard. 59 | Usually there is a tension between prediction accuracy and interpretability. This means that flexible models often deliver good prediction accuracy and give models which are harder to interpret. 60 | 61 | This will become clearer in Part I. 62 | 63 | ## Basic Concepts | The Bias-Variance Trade-off I 64 | - The mean squared error (MSE) is defined as 65 | $$ MSE = 1/n \sum_{i=1}^n (y_i - \hat{f}(x_i))^2 $$ 66 | 67 | - Calculating the MSE for the sample used for estimation of $f$ (training set) might lead to **overfitting**. 68 | - Hence, MSE for a new unseen sample (testing set) is preferable: 69 | $$ MSE = Ave (y_0 - \hat{f}(x_0))^2 $$ 70 | with a new observation $x_0$ 71 | 72 | ## Basic Concepts | The Bias-Variance Trade-off II 73 | - We have the following decomposition 74 | $$ \mathbb{E} (y_0 - \hat{f}(x_0))^2 = \mathbb{Var}(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + \mathbb{Var}(\varepsilon) $$ 75 | - Variance: amount by which $\hat{f}$ changes if estimated by using a different training data set 76 | - Bias: error due to approximation the real relationship by a simpler model 77 | - "Bias-Variance Trade-off" 78 | 79 | ## Basic Concepts | The Bias-Variance Trade-off III (Illustration) 80 | 81 | ```{r, include=TRUE} 82 | set.seed(12345) 83 | n <- 25 84 | X <- rnorm(n) 85 | f <- X*1 86 | y <- f + rnorm(n) 87 | reg2 <- lm(y~ X + I(X^2)) 88 | reg3 <- lm(y~ X + I(X^2) + I(X^3)) 89 | reg4 <- lm(y~ X + I(X^2) + I(X^3) + I(X^4)) 90 | i <- order(X) 91 | ``` 92 | ```{r} 93 | plot(X,f, type="l") 94 | points(X,y) 95 | abline(lm(y ~ X), col="red") 96 | points(X[i], predict(reg2)[i], type="l", col="blue") 97 | points(X[i], predict(reg3)[i], type="l", col="green") 98 | points(X[i], predict(reg4)[i], type="l", col="yellow") 99 | legend("topleft", c("True line", "linear fit", "quadratic fit", "third order polynomial", "fourth order polynomial"), 100 | lty=c(1,1,1,1,1), lwd=rep(1,5),col=c("black","red", "blue", "green", "yellow"), cex=0.65) 101 | ``` 102 | 103 | 104 | 105 | 106 | ## Problems / Challenges in High-Dimensions 107 | 108 | - Lost in the immensity of high-dimensional spaces 109 | - Fluctuations cumulate. 110 | - An accumulation of rare events may not be rare. 111 | - Computational complexity 112 | 113 | ## Immensity of High-Dimensional Spaces I 114 | When the dimension $p$ increases, the notion of "nearest points" vanishes. Below the histograms of the pairwise distances of $n=100$ points randomly drawn (uniformly) from the unit cube are given. 115 | ```{r, include=FALSE} 116 | p <- c(2,10,100,1000) 117 | n <- 100 118 | Obs <- n*(n-1)/2 119 | Result <- matrix(NA, ncol=length(p), nrow=Obs) 120 | for (i in 1:length(p)) { 121 | X <- matrix(runif(n*p[i]), nrow=n) 122 | Res <- as.vector(apply(X, 1, function(x) {sqrt(rowSums(sweep(X,2,x)^2))})) 123 | Result[,i] <- unique(Res[Res!=0]) 124 | } 125 | ``` 126 | 127 | ## Immensity of High-Dimensional Spaces II 128 | ```{r} 129 | par(mfrow=c(2,2)) 130 | hist(Result[,1], main="dimension=2", ylab="frequency", xlab="distance between points", xlim=c(0,2)) 131 | hist(Result[,2], main="dimension=10", ylab="frequency", xlab="distance between points", xlim=c(0,3)) 132 | hist(Result[,3], main="dimension=100", ylab="frequency", xlab="distance between points", xlim=c(2,5)) 133 | hist(Result[,4], main="dimension=1000", ylab="frequency", xlab="distance between points", xlim=c(10,15)) 134 | ``` 135 | 136 | ## Immensity of High-Dimensional Spaces III 137 | 138 | How many points are needed in order to fill the hypercube $[0,1]^p$ in such a way that at any $x \in [0,1]^p$ there exists at least one point at distance less than $1$ from $x$? 139 | 140 | p | $20$| $30$| $50$| $100$|$150$|$200$ 141 | --|-----|-----|------|------|-----|---------------------------------- 142 | n |$39$| $45630$| $5.7*10^{12}$|$42*10^{39}$| $1.28*10^{72}$ |Inf 143 | 144 | ## Fluctuations accumulate. 145 | In the linear regression model $Y = X \beta + \varepsilon$ for the OLS estimate $\hat{\beta}=(X^TX)^{-1}X^TY$ we have 146 | $$ \mathbb{E} [||\hat{\beta} - \beta ||] = \mathbb{E} [||((X^TX)^{-1}X^T \varepsilon)||^2] = Tr((X^TX)^{-1}) \sigma^2. $$ 147 | In the case of orthogonal design: 148 | $$ \mathbb{E} [||\hat{\beta} - \beta ||] = p \sigma^2 $$ 149 | with $\mathbb{Var} \varepsilon= \sigma^2$. 150 | Hence the estimation error grows with the dimension $p$ of the problem. 151 | 152 | ## Fluctuations accumulate. 153 | We consider a standard Gaussian distribution $\mathcal{N}(0,I_p)$ with density $f_p(x)=(2 \pi)^{-p/2} \exp(-||x||^2/2).$ 154 | We are interested in the mass of the distribution in the "bell" 155 | $$ B_{p,\delta} = \{ x \in \mathbb{R}^p: f_p(x) \geq \delta f_p(0)\} = \{ x \in \mathbb{R}^p: ||x||^2 \leq 2 \log(\delta^{-1})\}. $$ 156 | The Markov Inequality gives us: 157 | $$ \mathbb{P}(X \in B_{p,\delta}) = \mathbb{P}(e^{-||X||^2/2}\geq \delta) \leq 1/\delta \mathbb{E}[e^{-||X||^2/2}] 158 | = \frac{1}{\delta 2^{p/2}}$$ 159 | ```{r, include=FALSE} 160 | delta = 0.001 161 | p = 30:100 162 | Prob = 1/(delta*2^(p/2)) 163 | #Prob = Prob/Prob[1] 164 | ``` 165 | ## Fluctuations accumulate. 166 | ```{r} 167 | plot(p, Prob, type="l", main="Mass in the bell", xlab="dimension p", ylab="mass in the bell") 168 | ``` 169 | 170 | ## Accumulation of Rare Events 171 | Suppose an error $\varepsilon$ is Gaussian distributed with $\mathcal{N}(0,1)$. Then with probability at least $1-\alpha$, the noise $\varepsilon$ has an absolute value smaller than $(2\log(1/\alpha))^{1/2}$. This follows from the inequality $\mathbb{P}(|\varepsilon| \geq x) \leq \exp(-x^2/2).$ 172 | 173 | When we observe $p$ noise variables $\varepsilon_1,\ldots,\varepsilon_p$ which are i.i.d. and standard normal, we have 174 | 175 | $$ \mathbb{P}(\max_{j=1,\ldots,p} |\varepsilon_j| \geq x) = 1 - (1- \mathbb{P}(|\varepsilon_1| \geq x))^p \approx p \mathbb{P}(|\varepsilon_1| \geq x).$$ 176 | 177 | This means that if we want to bound the max of the absolute values with probability $1-\alpha$, then we can only guarantee that the maximum is smaller than $(2\log(p/\alpha))^{1/2}$. 178 | 179 | 180 | ## Computational Complexity 181 | With increasing dimension, numerical computations can become very demand and exceed the available computing resources. 182 | 183 | Example: When we have $p$ potential regressors, than the number of submodels is $2^p$ which grows exponentially with the number of regressors. -------------------------------------------------------------------------------- /Lecture_1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_1.pdf -------------------------------------------------------------------------------- /Lecture_10.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 10 -- Model Assessment and Selection" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | beamer_presentation: default 7 | ioslides_presentation: null 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | ## Bias, Variance, and Model Complexity 14 | * Target variable $Y$, inputs $X$, $\hat{f}(X)$ prediction model estimated from training set $\mathcal{T}$ 15 | * Typical choices of loss functions: $L(Y,\hat{f}(X))=(Y - \hat{f}(X))^2$ (squared error) or $L(Y,\hat{f}(X))=|Y - \hat{f}(X)|$ (absolute error) 16 | * The test error / generalization error, is the prediction error over an independent test sample 17 | $$ Err_{\mathcal{T}} = E[L(Y, \hat{f}(X))|\mathcal{T}]$$ 18 | where both $X$ and $Y$ are drawn randomly from their joint distribution (population). 19 | 20 | ## Bias, Variance, and Model Complexity 21 | * Expected prediction error (or expected test error) 22 | $$ Err=E[L(Y,\hat{f}(X))]=E[Err_{\mathcal{T}}]$$ 23 | 24 | ## Bias, Variance, and Model Complexity 25 | * Goal: estimation of $Err_{\mathcal{T}}$ 26 | * Training error is the average loss over the training sample: 27 | $$ \bar{err} = 1/n \sum_{i=1}^n L(y_i, \hat{f}(x_i)).$$ 28 | * Similar for categorical variables (but different loss function). 29 | 30 | ## Bias, Variance, and Model Complexity 31 | * Usually model depends on a tuning parameter $\alpha$: $\hat{f}_{\alpha}(x)$ 32 | * Two different goals: 33 | + Model selection: estimating the performance of different models in order to choose the best one. 34 | + Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data. 35 | 36 | 37 | ## The Bias-Variance Decomposition 38 | * $Y=f(X) + \varepsilon$ with $E[\varepsilon]=0$ and $Var(\varepsilon)= \sigma^2_{\varepsilon}$ 39 | * The expected prediction error of a regression fit $\hat{f}(X)$ at a point $X=x_0$ under squared-error loss is given by 40 | $$ Err(x) = \sigma^2_{\varepsilon} + Bias^2(\hat{f}(x_0)) + Var(\hat{f}(x_0))$$ 41 | * This can be interpreted as "Irreducible Error + Bias$^2$ + Variance". 42 | 43 | ## The Bias-Variance Decomposition | Example OLS 44 | * For linear model fit $\hat{f}_p(x)=x^T \hat{\beta}$ with p components by ols we have 45 | $$ E(x_0)=E[(Y-\hat{f}_p (x) )^2|X=x_0] = \sigma^2_{\varepsilon} + [f(x_0) - E\hat{f}_p(x_0)]^2 + \|h(x_0)\|^2 \sigma^2_{\varepsilon}.$$ 46 | $h(x_0)=X(X^TX)^{-1} x_0$ 47 | 48 | ## The Bias-Variance Decomposition | Example OLS 49 | * Average over all sample values $x_i$ gives: 50 | $$ 1/n \sum_{i=1}^n Err(x_i) = \sigma^2_{\varepsilon} + 1/n \sum_{i=1}^n [f(x_i) - E\hat{f}(x_i)]^2 + \frac{p}{n} \sigma^2_{\varepsilon},$$ the in-sample error. 51 | 52 | ## Optimism of the Training Error Rate 53 | * With $\mathcal{T}=\{(x_1,y_1), \ldots, (x_n, y_n)\}$ given, the generalization error of a model $\hat{f}$ is 54 | $$ Err_{\mathcal{T}} = E_{X^0,Y^0}[L(Y^0, \hat{f}(X^0))|\mathcal{T}]$$ 55 | (fixed training set $\mathcal{T}$, new observation /data point $(X^0,Y^0)$ drawn from $F$, the distribution of the data) 56 | * Averaging over training sets yields the expected error: 57 | $$ Err = E_{\mathcal{T}} E_{X^0,Y^0}[L(Y^0, \hat{f}(X^0))|\mathcal{T}]$$ 58 | (easier to analyze) 59 | * In general: $\bar{err}=1/n \sum_{i=1}^n L(y_i, \hat{f}(x_i)) \leq Err_{\mathcal{T}}$ 60 | 61 | ## Optimism of the Training Error Rate 62 | * Part of the discrepancy come from the location where the evaluation points occur. $Err_{\mathcal{T}}$ as extra-sample error. 63 | * In-sample error (for analysis of $\bar{err}$) 64 | $$ Err_{in} = 1/n \sum_{i=1}^n E_{Y^0}[L(Y_i^0, \hat{f}(x_i))|\mathcal{T}]$$ 65 | (observation of n new response values at each of the training points $x_i$) 66 | * Optimism: difference between $Err_{in}$ and training error $\bar{err}$: 67 | $$ op \equiv Err_{in} - \bar{err}.$$ 68 | * Average optimism is the expectation of the optimism over training sets: 69 | $$ \omega \equiv E_y(op).$$ 70 | 71 | ## Optimism of the Training Error Rate 72 | * Usually only $\omega$ and not $op$ can be estimated (analogous to $Err$ and $Err_{\mathcal{T}}$) 73 | * It can be shown: $\omega = 2/n \sum_{i=1}^n Cov(\hat{y}_i, y_i).$ 74 | * Interpretation 75 | * In sum: $E_y(Err_{in})=E_y(\bar{err})+ 2/n \sum_{i=1}^n Cov(\hat{y}_i, y_i)$ 76 | * Example: linear fit with $p$ variables for model $Y=f(X)+ \varepsilon$: $\sum_{i=1}^n Cov(\hat{y}_i, y_i) = p \sigma^2_{\varepsilon}$ 77 | 78 | ## Estimates of In-Sample Prediction Error 79 | * General form of the in-sample estimates: $\hat{Err}_{in}=\bar{err} + \hat{\omega}.$ 80 | * $C_p$ statistic: $C_p= \bar{err} + 2 \frac{d}{n} \hat{\sigma}^2_{\varepsilon}$ 81 | * Akaike Information Criterion: $AIC= - \frac{2}{n} loglik + 2 \frac{d}{n}$ 82 | * Bayesian Information Criterion: $BIC = - 2 loglik + (\log n) d$ 83 | 84 | ## Cross-Validation 85 | * Estimation of the prediction error directly. 86 | * CV estimates the expected extra-sample error $Err=E[L(Y,\hat{f}(X))]$ 87 | * Formal description: 88 | + Denote $\kappa$ a partitioning function: $\kappa: \{1,\ldots, n\} \rightarrow \{1, \ldots, K\}$ 89 | + Denote by $\hat{f}^{-k}(x)$ the fitted function, computed with the kth part of the data removed. 90 | 91 | ## Cross-Validation 92 | * The cross-validated estimator of the prediction error is 93 | $$ CV(\hat{f}) = 1/n \sum_{i=1}^n L(y_i, \hat{f}^{-k}(x_i)).$$ 94 | * Typical choices : $K=5,10$, $K=n$ is called *leave-one-out* cross-validation 95 | 96 | ## Cross-Validation | Tuning Parameter 97 | Given a set of models $f(x,\alpha)$ indexed by a tuning parameter $\alpha$, denote by $\hat{f}^{-k}(x,\alpha)$ the model fit with the kth part of the data removed and tuning parameter $\alpha$. Then for this set of model we define 98 | $$ CV(\hat{f}, \alpha) = 1/n \sum_{i=1}^n L(y_i, \hat{f}^{-k}(x_i, \alpha)).$$ 99 | 100 | The function $CV(\hat{f}, \alpha)$ provides an estimate of the test error curve, and we find the tuning parameter $\hat{\alpha}$ that minimizes it. Our final is $\hat{f}(x,\hat{\alpha})$, which we then fit to all the data. -------------------------------------------------------------------------------- /Lecture_10.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_10.pdf -------------------------------------------------------------------------------- /Lecture_2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 2 -- Linear Regression and Extensions" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | beamer_presentation: default 7 | ioslides_presentation: null 8 | mathjax: local 9 | self_contained: no 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = FALSE) 14 | ``` 15 | 16 | ## Linear Regression 17 | We start with a linear regression model: 18 | $$ y_i = x_i' \beta + \varepsilon_i, i=1, \ldots,n, $$ 19 | where $x_i$ is a p-dimensional vector of regressors for observation $i$, $\beta$ a p-dimensional coefficient vector, and $\varepsilon_i$ iid error terms with $\mathbb{E} [\varepsilon_i|x_i]= 0$. 20 | 21 | The ordinary least squares (ols) estimator for $\beta$ is defined as 22 | 23 | $$ \hat{\beta} = \arg\min_{\beta \in \mathbb{R}^p} \sum_{i=1}^n (y_i - x_i' \beta)^2.$$ 24 | 25 | 26 | ## Linear Regression 27 | If the Gram matrix $\sum_{i=1}^n x_i x_i'$ is of full rank, the ols estimate is given by 28 | 29 | $$ \hat{\beta} = (\sum_{i=1}^n x_i x_i')^{-1} (\sum_{i=1}^n x_i y_i). $$ 30 | 31 | The residuals $\hat{\varepsilon}_i$ are defined as 32 | $$ \hat{\varepsilon}_i = y_i - x_i'\hat{\beta}. $$ 33 | For an observation $x$ the *fitted* or *predicted* values are given by 34 | $$ \hat{y} = x'\hat{\beta}. $$ 35 | 36 | ## Linear Regression 37 | 38 | In matrix notation we can write 39 | 40 | $$ Y=X \beta + \varepsilon $$ 41 | 42 | with $Y=\begin{pmatrix} y_1 & \ldots & y_n \end{pmatrix}$, $\varepsilon=\begin{pmatrix} \varepsilon_1 & \ldots & \varepsilon_n \end{pmatrix}$ and $X$ is a $n \times p$-matrix with observation $i$ forming the $i$th row of the matrix $X$. 43 | 44 | The ols estimate $\hat{\beta}$ can then be written as 45 | 46 | $$ \hat{\beta} = (X'X)^{-1}X'y.$$ 47 | 48 | ## Linear Regression 49 | 50 | Under homoscedastic errors, i.e. $\mathbb{V} \varepsilon_i=\sigma^2$, we have that 51 | 52 | $$ \mathbb{V}(\hat{\beta})= (X'X)^{-1} \sigma^2.$$ 53 | 54 | Asymptotically, the ols estimate is normal distributed: 55 | 56 | $$ \hat{\beta} \sim N(\beta, (X'X)^{-1} \sigma^2).$$ 57 | 58 | This can be used for testing hypotheses and construction of confidence intervals. 59 | 60 | ## Linear Regression 61 | $$ z_j=\frac{\hat{\beta_j}}{\hat{\sigma}^2 \sqrt{v_j}}$$ 62 | where $v_j$ is the $j$th diagonal element of $(X'X)^{-1}$. 63 | Under the null hypothesis $\beta_j=0$ the *Z-score* / *t-statistic* $z_j$ is $t_{n-p-1}$-distributed. 64 | 65 | ## Linear Regression 66 | 67 | Remark: In the high-dimensional-setting, i.e. $p >> n$ the Gram Matrix is rank deficient and the ols estimate is not uniquely defined and the variance of the parameter estimate is unbounded. 68 | 69 | ## Extensions 70 | - Polynomial Regression 71 | - Step Functions 72 | - Basis Functions 73 | - Regression Splines 74 | - Smoothing Splines 75 | 76 | ## Extensions | Remarks 77 | * Although the linear regression model looks quite simple, it can be extended / modified to model complex relations. 78 | * For the extensions we consider without loss of generality univariate regressions: 79 | 80 | $$ y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$$ 81 | 82 | ## Extensions | Polynomial Regression 83 | 84 | To make the linear specification more flexible, we might include higher-order polynomials: 85 | 86 | $$y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \ldots \beta_p x_i^p + \varepsilon_i$$ 87 | 88 | 89 | * Estimation by ols 90 | * Quite flexible, but usually p=3 or p=4 91 | * Higher order polynomials (p > 5) might lead to strange fits (overfitting), especially at the boundary. 92 | 93 | ## Extensions | Polynomial Regression - Example 94 | ```{r, include=FALSE} 95 | library(ISLR) 96 | attach(Wage) 97 | # Fourth orger polynomial 98 | fit4 = lm(wage ~ poly(age, 4, raw=T), data=Wage) 99 | # fit = lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data=Wage) 100 | agelims = range(age) 101 | age.grid = seq(from=agelims[1], to=agelims[2]) 102 | preds4 = predict(fit4, newdata=list(age=age.grid), se=TRUE) 103 | se.bands4 = cbind(preds4$fit + 2*preds4$se.fit, preds4$fit - 2*preds4$se.fit) #95% confidence intervall 104 | ``` 105 | 106 | 107 | ```{r, include=FALSE} 108 | # calculation higher order polynomials 109 | fit3 = lm(wage ~ poly(age, 3, raw=T), data=Wage) 110 | preds3 = predict(fit3, newdata=list(age=age.grid), se=TRUE) 111 | fit5 = lm(wage ~ poly(age, 5, raw=T), data=Wage) 112 | preds5 = predict(fit5, newdata=list(age=age.grid), se=TRUE) 113 | ``` 114 | 115 | ```{r} 116 | par(mfrow=c(1,2), mar=c(4.5,4.5,1,1), oma=c(0,0,4,0)) 117 | plot(age, wage, xlim=agelims, cex=.5, col="darkgrey") 118 | title("Polynomial of order 4", outer=F) 119 | lines(age.grid, preds4$fit, lwd=2, col="blue") 120 | matlines(age.grid, se.bands4, lwd=1, col="blue", lty=3) 121 | plot(age, wage, xlim=agelims, cex=.5, col="darkgrey") 122 | lines(age.grid, preds4$fit, lwd=2, col="blue") 123 | lines(age.grid, preds5$fit, lwd=2, col="red") 124 | lines(age.grid, preds3$fit, lwd=2, col="green") 125 | title("Different higher order polynomials") 126 | legend("bottomright", c("Third order", "Fourth order", "Fifth order"), 127 | lty=c(1,1,1), lwd=rep(2,3),col=c("green","blue", "red"), cex=0.65) 128 | ``` 129 | 130 | 131 | ## Extensions | Step Functions 132 | 133 | * Definition: Step functions are functions which are constant on each part of a partition of the domain. 134 | * Univariate Regression: choosing $K$ cut points $c_1, \ldots, c_K$ and defining new auxiliary variables: 135 | $C_0(x)= 1(x < c_1)$, $C_1(x)=1(c_1 \leq x < c_2)$, $\ldots$, $C_K(x)=1(c_K \leq x)$ 136 | * $1(\cdot)$ is the so-called indicator function which is $1$ is the condition is true and $0$ otherwise. 137 | * This gives us the following regression: 138 | $$ y_i = \beta_0 = \beta_1 *C_1(x_i) + \ldots + \beta_K *C_K(x_i) + \varepsilon_i $$ 139 | 140 | ## Extensions | Step Functions 141 | 142 | * Note: $C_0(x) + \ldots + C_K(x) = 1$ and hence we drop $C_0=(\cdot)$ to avoid multicollinearity. 143 | * Interpretation $\beta_0$ 144 | * Example: wage regression 145 | 146 | ## Extensions | Step Functions 147 | ```{r, include=FALSE} 148 | Wage$agegroup<-cut(Wage$age, c(18,25,35,45,55,65,80)) 149 | fitgroup = lm(wage ~ agegroup, data=Wage) 150 | predsgroup = predict(fitgroup, newdata=list(agegroup=cut(18:80, c(18,25,35,45,55,65,80))), se=TRUE) 151 | se.bandsgr = cbind(predsgroup$fit + 2*predsgroup$se.fit, predsgroup$fit - 2*predsgroup$se.fit) #95% confidence intervall 152 | ``` 153 | 154 | ```{r} 155 | plot(age, wage, xlim=agelims, cex=.5, col="darkgrey") 156 | title("Regression with step functions", outer=T) 157 | lines(18:80, predsgroup$fit, lwd=2, col="blue") 158 | matlines(age.grid, se.bandsgr, lwd=1, col="blue", lty=3) 159 | ``` 160 | 161 | ## Extension | Basis Functions 162 | 163 | * Idea: family of functions or transformations that can be applied to a variable: $b_1(x), \ldots, b_K(x)$ (basis functions) 164 | * Regression: $$y_i = \beta_0 + \beta_1 b_1(x_i) + \ldots + \beta_K b_K(x_i) + \varepsilon_i $$ 165 | * Examples 166 | + Polynomial regression: $b_j(x_i)=x^j_i$ 167 | + Piecewise constant functions (step functions): $b_j(x_i)=1(c_j \leq x_i < c_{j+1})$ 168 | + Regressions splines (coming next) -------------------------------------------------------------------------------- /Lecture_2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_2.pdf -------------------------------------------------------------------------------- /Lecture_3.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 3 -- Linear Regression and Extensions" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | beamer_presentation: default 7 | ioslides_presentation: null 8 | mathjax: local 9 | self_contained: no 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | knitr::opts_chunk$set(echo = FALSE) 14 | ``` 15 | 16 | ## Extensions 17 | - Polynomial Regression 18 | - Step Functions 19 | - Basis Functions 20 | - Regression Splines 21 | - Smoothing Splines 22 | 23 | ## Regression Splines 24 | 25 | * Polynomial regressions often leads to rough and instable estimates. 26 | * Solution: Fitting separate low-degree polynomials over different regions (and make them smooth) 27 | * Construction of splines: 28 | - Partition x-axis into different smaller sub-intervals and estimate a separate polynomial for each interval. 29 | - Additionally, it is required that the combined function is smooth (e.g. continuously differentiable) at the boundary points (knots). 30 | 31 | ## Regression Splines | Defintion 32 | 33 | Let $a=c_1 < c_2 < \ldots < c_K=b$ be a partition of the interval $[a,b]$. A function $s: [a,b] \rightarrow \mathbb{R}$ is called a polynomial spline of degree $l$ if 34 | 35 | 1. $s(z)$ is a polynomial of degree $l$ for $z \in [c_j, c_{j+1}), 1 \leq j < m.$ 36 | 2. $s(z)$ is $(l-1)$-times continuously differentiable. 37 | 38 | $c_1, \ldots, c_K$ are called knots of the splines and $\Omega=\{ c_1,\ldots, c_k\}$ knot set. 39 | 40 | ## Regression Splines 41 | 42 | It can be shown that regression splines form a vector space of dimension of dimension $K+l-1$. 43 | 44 | Hence every regression spline can be represented as the sum of $K+l-1$ basis functions: 45 | 46 | $$ s(z) = \beta_0 B_0(z) + \ldots + \beta_{k+l-2} B_{K+l-2}(z).$$ 47 | 48 | Basis functions: truncated power series basis, B-spline basis (numerical more stable) 49 | Truncated power series: 50 | $$s(z) = \sum_{j=0}^l \beta_j z^j + \sum_{j=2}^{K-1} \beta_{l+j-1} (z-c_j)^l_+$$ 51 | with $(z-c_j)^l_+ = \max(0, (z-c_j))^l$. 52 | 53 | ## Regression Splines | Example Basis Functions 54 | We consider the interval $[0,1]$ and knots $0<0.25<0.5<0.75<1$. 55 | 56 | For a quadratic spline and 5 knots, the number of basis functions is given by $2+5-1=6$. 57 | 58 | 59 | ## Regression Splines | Example Basis Functions 60 | 61 | ```{r} 62 | par(mfrow=c(3,2)) 63 | plot(NULL, xlim=c(0,1), ylim=c(0,2), xlab="", ylab="", main="Spline of degree 2, B_0") 64 | abline(h=1) 65 | curve(x^1, from=0, to=1, xlab="", ylab="", main="Spline of degree 2, B_1") 66 | curve(x^2, from=0, to=1, xlab="", ylab="", main="Spline of degree 2, B_2") 67 | f3 <- function(x) ifelse(x<0.25, 0, (x-0.25)^2) 68 | curve(f3, from=0, to=1, xlab="", ylab="", main="Spline of degree 2, B_3") 69 | f4 <- function(x) ifelse(x<0.5, 0, (x-0.5)^2) 70 | curve(f3, from=0, to=1, xlab="", ylab="", main="Spline of degree 2, B_4") 71 | f5 <- function(x) ifelse(x<0.75, 0, (x-0.75)^2) 72 | curve(f3, from=0, to=1, xlab="", ylab="", main="Spline of degree 2, B_5") 73 | ``` 74 | 75 | ## Regression Splines 76 | How to choose the number and location of knots? 77 | 78 | * Equi-distant knots 79 | * Choice according to quantiles of the x-variable 80 | 81 | ## Regression Splines | Example 82 | ```{r, include=FALSE} 83 | library(ISLR) 84 | attach(Wage) 85 | agelims = range(age) 86 | age.grid = seq(from=agelims[1], to=agelims[2]) 87 | library(splines) 88 | fit = lm(wage ~ bs(age, knots=c(25,40,60)), data=Wage) 89 | pred = predict(fit, newdata=list(age=age.grid), se=T) 90 | # alternative way by specifying degrees of freedom 91 | # fit = lm(wage ~ bs(age, df=6, data=Wage) 92 | # attr(bs(age=df=6), "knots") 93 | ``` 94 | ```{r} 95 | plot(age, wage, col="gray") 96 | lines(age.grid, pred$fit, lwd=2) 97 | lines(age.grid, pred$fit + 2*pred$se, lty="dashed") 98 | lines(age.grid, pred$fit - 2*pred$se, lty="dashed") 99 | ``` 100 | 101 | ## Natural Splines 102 | 103 | * Problem: Regression splines tend to display erratic behavior at the boundaries of the domain leading to high variance. 104 | * Solution: additional constraints at the boundary (left of the leftmost knot and right of the most rightmost knot) 105 | * Definition **Natural Spline** 106 | A natural spline of degree $l$ is a regression spline of degree $l$ with the additional constraint that 107 | it is a polynomial of degree $(k-1)/2$ on $(-\infty, c_0]$ and $[c_K, + \infty)$. 108 | * Most popular natural splines are cubic which are linear beyond the boundaries. 109 | * Modifications of the truncated power basis and B-spline basis for natural splines (here dimension $K$!) 110 | 111 | ## Smoothing Splines 112 | 113 | * Optimization problem: Among all functions $f(x)$ with two continuous derivatives, minimize: 114 | $$ RSS(f, \lambda) = \sum_{i=1}^n (y_i -f(x_i))^2 + \lambda \int (f''(t))^2dt $$ 115 | * $\lambda$ is called *smoothing parameter* (interpretation?) 116 | * It can be shown that the solution of the optimization problem is unique and a natural cubic spline with knots at the unique values of the $x_i, i=1,\ldots,n$. 117 | * Here: no problem how to choose the knots (as in the regression spline case) 118 | * Intuition: Overparametrization (because of $n$ knots), but penalization 119 | 120 | ## Snoothing Splines 121 | 122 | Since the solution is a natural spline, we can write it as 123 | 124 | $$ f(x) = \sum_{j=1}^n b_j(x) \beta_j $$ 125 | 126 | with $b_1(\cdot), \ldots, b_n(\cdot)$ an $n$ dimensional set of basis functions for representing the family of natural splines. 127 | 128 | ## Snoothing Splines 129 | 130 | Then the criterion reduces to 131 | 132 | $$ RSS(\beta, \lambda) = (y-B \beta)^T (y-B \beta) + \lambda \beta^T \Omega_n \beta $$ 133 | 134 | where $B_{ij}=b_j(x_i)$ and $(\Omega_n)_{jk}=\int b_j''(d) b_k''(t)dt.$ 135 | 136 | The solution is given by 137 | 138 | $$ \hat{\beta}=(B^TB + \lambda \Omega_n)^{-1}B^T y.$$ 139 | (generalized Ridge regression). 140 | 141 | ## Snoothing Splines 142 | The fitted smoothing spline is given by 143 | 144 | $$\hat{f}(x) = \sum_{j=1}^n b_j(x) \hat{\beta_j}$$ 145 | 146 | -------------------------------------------------------------------------------- /Lecture_3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_3.pdf -------------------------------------------------------------------------------- /Lecture_4.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 4 -- Ridge and Lasso Regression" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | ## Ridge Regression 14 | 15 | We consider a linear model 16 | $$ y_i = \sum_{k=1}^p x_{ik} \beta_k + \varepsilon_i = x_i' \beta + \varepsilon$$ 17 | The OLS solution is given by: 18 | $$ \hat{\beta}_{OLS} = (X'X)^{-1} X'y $$ 19 | The ridge estimator is given by 20 | $$ \hat{\beta}_{ridge} = (X'X+ \lambda I_p)^{-1} X'y $$ 21 | We inflate the $X'X$ matrix by $\lambda I_p$ so that it is positive definite 22 | irrespective of $p$, including $p > n$. 23 | 24 | ## Ridge Regression | Expectation 25 | 26 | $$ \mathbb{E}[\hat{\beta}(\lambda)] = \mathbb{E}[(X'X+ \lambda I_p)^{-1}X'Y]= X'X(\lambda I_p + X'X)^{-1} \beta $$ 27 | Clearly, $\mathbb{E}[\hat{\beta}(\lambda)] \neq \beta$ for any $\lambda >0$. 28 | 29 | $$ \lim_{\lambda \rightarrow \infty} \mathbb{E}[\hat{\beta}(\lambda)] = 0_p $$ 30 | 31 | Hence, all regression coefficients are shrunken towards zero as the penalty increases. 32 | 33 | ## Ridge Regression | Expectation | Orthogonal Design 34 | 35 | Orthogonal design: $X'X=I_p$ 36 | 37 | $$\hat{\beta}(\lambda) = (1+\lambda)^{-1} \hat{\beta}$$ 38 | 39 | + Ridge estimator scales the OLS estimator by a fixed factor. 40 | 41 | ## Ridge Regression | Variance 42 | 43 | Define $W_{\lambda}=[\lambda I_p + (X'X)^{-1}]^{-1}$. Hence, 44 | 45 | $W_{\lambda} \hat{\beta}= \hat{\beta}(\lambda).$ 46 | 47 | $$ \mathbb{Var}[\hat{\beta}(\lambda)] = \sigma^2 W_{\lambda} X'X W_{\lambda}' $$ 48 | 49 | $$ \lim_{\lambda \rightarrow \infty} \mathbb{Var} [\hat{\beta}(\lambda)] = 0_p$$ 50 | 51 | It can be shown that $\mathbb{Var}[\hat{\beta}] \geq \mathbb{Var}[\hat{\beta}(\lambda)]$ 52 | 53 | ## Ridge Regression | Variance | Orthogonal Design 54 | 55 | $\mathbb{Var}[\hat{\beta}(\lambda)]=\sigma^2(1+ \lambda)^{-2}I_p$ 56 | 57 | The variance of the OLS estimator exceeds the variance of the Ridge estimator. 58 | 59 | ## Ridge Regression | Mean Squared Error (MSE) 60 | 61 | $MSE(\hat{\theta})= \mathbb{E}[(\hat{\theta} - \theta)^2] = \mathbb{Var}(\hat{\theta}) + [Bias(\hat{\theta})]^2$ 62 | 63 | $MSE(\hat{\beta}(\lambda))= \sigma^2 tr\{W_{\lambda}(X'X)^{-1} W_{\lambda}'\}$ 64 | 65 | $+ \beta'(W_{\lambda}-I_p)'(W_{\lambda}-I_p) \beta.$ 66 | 67 | **Theorem** (Theobald, 1974) 68 | 69 | There exists $\lambda > 0$ such that $MSE[\hat{\beta}(\lambda)]0$. 91 | 92 | ## Ridge Regression | Bayesian Interpretation 93 | 94 | *Bayesian Interpretation*: If the prior distribution 95 | for $\beta$ is $N(0, \tau^2 I_p)$, and the distribution of $\varepsilon_i$ is normal $N(0, \sigma^2)$, if $\lambda = \sigma^2/\tau^2$, then $\hat{\beta}_{ridge}$ is the posterior mean/mode/median. 96 | 97 | ## Lasso Regression | Introduction 98 | 99 | Linear model: $y_i=\sum_{j=1}^p \beta_j X_i^{(j)} + \varepsilon_i, i=1,\ldots,n$, 100 | $\varepsilon_i,\ldots, \varepsilon_n$ i.i.d., independent of $\{X_i, i=1,\ldots,n\}$ and $\mathbb{E} [\varepsilon_i]=0$ and $\mathbb{V} [\varepsilon_i]=\sigma^2$ 101 | 102 | Wlog: intercept is zero and covariates are centered and on the same (unit) scale. 103 | 104 | Now: $p \gg n$ 105 | 106 | Problem: ols estimator not unique and overfit of the data 107 | 108 | ## Lasso Regression 109 | 110 | The Lasso estimator is given by 111 | $$ \hat{\beta}(\lambda)=\arg \min_{\beta \in \mathbb{R}^p} \left(||Y-X\beta||_2^2/n + \lambda ||\beta||_1 \right) (*) $$ 112 | 113 | $||Y-X\beta||_2^2=\sum_{i=1}^n (Y_i - (X\beta)_i)^2$, $|| \beta||_1 = \sum_{j=1}^p |\beta_j|$, $\lambda \geq 0$ penalization parameter 114 | 115 | ## Lasso Regression 116 | 117 | $(*)$ is equivalent to 118 | 119 | $$ \hat{\beta}_{primal}(R) = \arg \min_{\beta \in \mathbb{R}^p} \left(||Y-X\beta||_2^2/n \right) $$ 120 | such that $||\beta||_1 \leq R$ with a one-to-one relation between $R$ and $\lambda$. 121 | 122 | This optimization problem is a convex problem (and hence efficient computation is possible. ) 123 | 124 | ## Comparison Lasso and Ridge Regression 125 | 126 | ![Ridge and Lasso](LassoRidgeGraph.png) 127 | 128 | ## Computation of the Lasso Solution | Single Predictor 129 | $$ \min_{\beta} \{1/(2N) \sum_{i=1}^n (y_i - z_i \beta)^2 + \lambda |\beta|\} $$ 130 | 131 | The solution is given by 132 | 133 | $$ \hat{\beta} = \left\{\begin{array}{cl} 1/n \langle z,y \rangle - \lambda & \mbox{if}\quad 1/n \langle z,y \rangle > \lambda\\ 134 | 0 & \mbox{if}\quad 1/n |\langle z,y \rangle | \leq \lambda\\ 135 | 1/n \langle z,y \rangle + \lambda & \mbox{if}\quad 1/n \langle z,y \rangle < - \lambda 136 | \end{array}\right.$$ 137 | 138 | ## Computation of the Lasso Solution | Single Predictor 139 | This can be written as 140 | $$ \hat{\beta} = \mathcal{S}_{\lambda}(1/N )$$ 141 | with the *soft-threshodling* operator 142 | 143 | $$ \mathcal{S}_{\lambda}(x)=sign(x)(|x|-\lambda)_+$$ 144 | 145 | ## Computation of the Lasso Solution | Multiple Predictors 146 | 147 | cyclical coordinate descent 148 | 149 | $$ \frac{1}{2n} \sum_i (y_i - \sum_{k \neq j} x_{ik}\beta_k - x_{ij} \beta_k)^2 + \lambda \sum_{k \neq j}|\beta_k| + \lambda |\beta_j|$$ 150 | 151 | Idea: we repeatedly cycle through the predictors in some fixed order, where at the $j$th step, we update the coefficient $\beta_j$ by minimizing the objective function in this coordinate while holding fixed all other coefficients at their current values. 152 | 153 | ## Computation of the Lasso Solution | Multiple Predictors 154 | 155 | With partial residuals $r_i^{(j)}= y_i - \sum_{k \neq j} x_{ik} \hat{\beta}_k$ 156 | 157 | $$ \hat{\beta}_j=\mathcal{S}_{\lambda}(\frac{1}{n}\langle x_j, r^{(j)}\rangle) $$ 158 | 159 | The overall algorithm operates by applying this soft-thresholding update repeatedly in a cyclical manner, updating the coordinates of $\hat{\beta}$ along the way. -------------------------------------------------------------------------------- /Lecture_4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_4.pdf -------------------------------------------------------------------------------- /Lecture_5.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 5 -- Ridge and Lasso Regression II" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | 14 | ## Lasso Regression 15 | 16 | $$ \hat{\beta}(\lambda)=\arg \min_{\beta \in \mathbb{R}^p} \left(||Y-X\beta||_2^2/n + \lambda ||\beta||_1 \right) (*) $$ 17 | 18 | $||Y-X\beta||_2^2=\sum_{i=1}^n (Y_i - (X\beta)_i)^2$, $|| \beta||_1 = \sum_{j=1}^p |\beta_j|$, $\lambda \geq 0$ penalisation parameter 19 | 20 | $(*)$ is equivalent to 21 | 22 | $$ \hat{\beta}_{primal}(R) = \arg \min_{\beta \in \mathbb{R}^p} \left(||Y-X\beta||_2^2/n \right) $$ 23 | such that $||\beta||_1 \leq R$ with a one-to-one relation between $R$ and $\lambda$. 24 | 25 | This optimization problem is a convex problem (and hence efficient computation is possible.) 26 | 27 | ## Lasso Regression 28 | 29 | Key assumption: **sparsity** 30 | 31 | The number of variables $p$ can grow with the sample size and even be larger $n$, but the number of non-zero coefficients $s$ is required to be smaller than $n$ (but may also grow with the sample size). 32 | 33 | Notation: 34 | 35 | * $\beta^0$ true vector with components $\beta_j^0, j=1,\ldots,p$ 36 | * $S_0=\{j: \beta^0_j \neq 0, j=1,\ldots,p\}$, $s=|S|$ 37 | * $\hat{S}=\{j: \hat{\beta}_j \neq 0, j=1,\ldots,p\}$ 38 | 39 | 40 | ## A glimpse on the theory 41 | 42 | * convergence in prediction norm 43 | * convergence in $\ell_p$ norm 44 | * variable screening 45 | * variable selection 46 | 47 | ## Theory | Convergence in prediction norm 48 | * Conditions: 49 | + Restricted eigenvalue condition / compatability condition 50 | + no condition on the non-zero coefficients 51 | * Result: 52 | $$||X(\hat{\beta}-\beta_0)||^2_2/n = O_P(s \log(p)/n)$$ 53 | * Interpretation 54 | 55 | ## Theroy | Convergence in $\ell_p$ norm 56 | * Conditions: 57 | + Restricted eigenvalue condition / compatability condition 58 | + no condition on the non-zero coefficients 59 | * Result: 60 | $$||\hat{\beta}-\beta_0||_q = O_P(s^{1/q} \sqrt{\log(p)/n})$$ with $q \in \{1,2\}.$ 61 | * Interpretation 62 | 63 | 64 | ## Theory | Variable Screening 65 | * Conditions 66 | + Restricted eigenvalue condition 67 | + beta-min condition: $\min_{j \in S} |\beta_j^0| >> C \sqrt{s \log(p)/n}$ ($C$ some constant) 68 | * Result: 69 | $$ \mathbb{P}[S_0 \subset \hat{S}] \rightarrow 1$$ 70 | ($p \geq n \rightarrow \infty$) 71 | * Interpretation 72 | 73 | ## Theory | Variable Selection 74 | * Conditions: 75 | + neighbourhood stability condition (equivalent to irrepresentable condition) 76 | + beta-min condition 77 | * Result: 78 | $$ \mathbb{P}[S_0 = \hat{S}] \rightarrow 1$$ 79 | 80 | ## Extensions 81 | 82 | * Adaptive Lasso (Zou, 2006) 83 | * Post-Lasso (Belloni & Chernozhukov, 2011) 84 | * Elastic Net (Zou & Hastie, 2005) 85 | * LAVA (Chernozhukov et al., 2015) 86 | * Group Lasso 87 | 88 | ## Adaptive Lasso (Zou, 2006) 89 | 90 | * $\hat{\beta}_{adapt}(\lambda)=\arg \min_{\beta} \left( ||Y-X\beta||^2_2/n + \lambda \sum_{j=1}^p \frac{|\beta_j|}{|\hat{\beta}_{init,j}|} \right)$ 91 | where $\hat{\beta}_{init}$ is an initial estimator (e.g. Lasso from an initial stage) 92 | * Intuition: $\hat{\beta}_{init,j}=0$ leads to $\hat{\beta}_{adapt,j}=0$ 93 | $|\hat{\beta}_{init,j}|$ large $\Rightarrow$ small penalty 94 | * Goal: Reduction of bias of Lasso 95 | 96 | 97 | ## Post-Lasso (Belloni & Chernozhukov, 2011) 98 | * $\hat{\beta}(\lambda) = \arg \min \left( ||Y-X\beta||^2_2/n + \lambda |\beta|_1 \right)$ 99 | * $\hat{T}=supp(\hat{\beta}) =\{ j \in \{1,\ldots,p\}: |\hat{\beta}_j|>0 \}$ 100 | * Post model selection estimator $\tilde{\beta}$ (Post-Lasso) 101 | $$ \tilde{\beta}= \arg \min_{\beta} ||Y-X\beta||^2_2/2: \quad \beta_j=0 \mbox{for each} j \in \hat{T}^Cth$$ 102 | * Idea: Reduce bias by running OLS on the variables selected by Lasso in a first stage 103 | 104 | ## Elastic Net (Zou & Hastie, 2005) 105 | * Idea: Combination of $\ell_1-$ and $\ell_2-$penalty 106 | * $\ell_1-$penalty: sparse model 107 | * $\ell_2-$penalty: enforcing grouping effect, stabilization regularization path, removes limit on number of selected variables 108 | $$\hat{\beta}= \arg \min \left( ||Y-X\beta||^2_2/n + \lambda_2 ||\beta||_2 + \lambda_1 ||\beta||_1 \right)$$ 109 | * $\hat{\beta}_{enet}=(1+\lambda)(\hat{\beta})$ 110 | 111 | ## LAVA (Chernozhukov et al., 2015) 112 | * Idea: $\theta= \underbrace{\beta} + \underbrace{\delta}=$ dense + sparse part 113 | * $\hat{\theta}=\hat{\beta} + \hat{\delta}$ 114 | * $$ (\hat{\beta},\hat{\delta})=\arg \min_{(\beta', \delta')} \{l(data, \beta+\delta) + \lambda_2 ||\beta||_2^2 + \lambda_1|| \delta||_1$$ 115 | 116 | ## Group Lasso 117 | * Motivation: with factor variables, one would like to choose if all categories or none of them should be included. 118 | * $\mathcal{G}_1, \ldots, \mathcal{G}_q$ groups which partition the index set $\{1,\ldots,p\}$ 119 | * $\beta=(\beta_{\mathcal{G}_1}, \ldots, \beta_{\mathcal{G}_1})$, $\mathcal{G}_j=\{ \beta_r, r \in \mathcal{G}_j$ 120 | * $\hat{\beta}(\lambda)=\arg \min_{\beta} Q_{\lambda}(\beta)$ 121 | * $$Q_{\lambda}(\beta)=1/n ||Y-X\beta||^2_2 + \lambda \sum_{j=1}^q m_j ||\beta_{\mathcal{G}_j}||_2$$ 122 | $m_j=\sqrt{T_j}$, $T_j=|\mathcal{G}_j|$ 123 | * Either all variables in a group have either value zero or have a value different from zero. Selection of groups of variables (e.g. factors!) 124 | 125 | 126 | -------------------------------------------------------------------------------- /Lecture_5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_5.pdf -------------------------------------------------------------------------------- /Lecture_6.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 6 -- Of Trees and Forests" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | 14 | ## Introduction 15 | * Idea: Partition the feature / covariables space into a set of rectangles and then fit a simple model (constant) in each one. So the estimated function is the average of outcomes falling in this rectangle. 16 | * Recursive binary partitions, i.e. sequentially we choose variable and corresponding split pint which achieve best fit until some stopping criterion is reached. 17 | * Here: Regression, but also used for classification (with different criteria) 18 | * Example [cf blackboard] 19 | 20 | ## Regression Trees 21 | $n$ observations $(x_i,y_i), i=1,\ldots,n, \quad x_i=(x_{i1}, \ldots, x_{ip})$ 22 | 23 | Given a partition into $M$ regions $R_1,\ldots,R_M$ with a fitted constant in each region: 24 | 25 | $$ f(x) = \sum_{m=1}^M c_m I(x \in R_m) $$ 26 | 27 | Minimizing the residual sum of squares (RSS) $\sum_{x_i \in R_m} (y_i-f(x_i))^2$ leads to $\hat{c}_m=ave(y_i|x_i \in R_m)$ 28 | 29 | Finding the best partition in terms of minimal RSS is generally computational infeasible. 30 | 31 | Instead: greedy algorithm 32 | 33 | ## Regression Trees 34 | 35 | **Algorithm:** 36 | 37 | Start with all data 38 | 1. We consider a splitting variable $j$ and split point $s$, s.t. 39 | $$ \min_{j, s} \left[ \min_{c_1} \sum_{x_i \in R_1(j,s)} (y_i-c_1)^2 40 | + \min_{c_2} \sum_{x_i \in R_2(j,s)} (y_i-c_2)^2 \right] $$ 41 | with $R_1(j,s)=\{ X|X_j \leq s \}$, $R_2(j,s)=\{ X|X_j > s \}$ 42 | Determination of the best pair $(j,s)$ is feasible. 43 | 2. Repeat 1) on each of the two resulting regions 44 | 3. Stop when some criterion is reached 45 | 46 | 47 | ## Regression Trees 48 | How large should we grow a tree? 49 | 50 | Tree size is a tuning parameter governing the model's complexity. 51 | 52 | It should be chosen adaptively from the data. 53 | 54 | Strategy 1: Split tree node only, if decrease in rss is sufficiently high (but too short-sighted) 55 | 56 | Strategy 2: (preferred) 57 | 58 | * Given a large tree, stopping when some minimal node size is reached 59 | * Prune the tree by *cost-complexity pruning* 60 | 61 | ## Regression Trees 62 | 63 | Subtree $T \subset T_0$ is any tree that can be obtained by pruning $T_0$, i.e. collapsing any number of its internal nodes. 64 | We denote the terminal nodes by $m=1,\ldots,M$. 65 | 66 | $N_m=|\{x_i \in R_m\}|$ 67 | 68 | $\hat{c}_m = 1/N_m \sum_{x_i in R_m} y_i$ 69 | 70 | $Q_m(T) = \sum_{x_i in R_m} (y_i - \hat{c}_m)^2$ 71 | 72 | Cost-complexity criterion $C_{\alpha}(T)=\sum_{m=1}^{M} N_m Q_m(T) + \alpha |T|$ 73 | 74 | ## Regression Trees 75 | 76 | Find $T_{\alpha} \subset T_0$ to minimize $C_{\alpha}(T)$ 77 | 78 | $\alpha$ governs the trade-off between tree-size and its goodness of fit to the data. 79 | $\alpha=0$ yields full tree. 80 | 81 | Choice of $\alpha$ by cross validation. 82 | 83 | Final tree $T_{\hat{\alpha}}$ 84 | 85 | ## Bagging 86 | * Bagging: Bootstrap aggregation or bagging averages 87 | * Training data $Z=\{(x_1,y_1),\ldots,(x_n,y_n)\}$ 88 | * Fit a model to $Z$ and obtain $\hat{f}(x)$ 89 | * Idea: average the predictions over a collection of bootstrap samples, thereby reducing its variance 90 | * 91 | + $Z^{*b}, b=1,\ldots,B$ bootstrap samples 92 | + Fit model to get $\hat{f}^{*,b}(x)$ 93 | + $\hat{f}_{bag}(x)=1/B \sum_{b=1}^B \hat{f}^{*,b}(x)$ 94 | 95 | ## Bagging 96 | * $\hat{f}_{bag}(x)$ estimate of the true bagging value $E_{\hat{\mathcal{P}}} \hat{f}^{*}(x)$ 97 | * Well suited for high-variance, low-bias procedures 98 | * Application: regression trees 99 | 100 | ## Random Forests 101 | * Introduced by Breiman (2001) 102 | * Very powerful (good performance) in many applications 103 | * Modified version of bagging 104 | * Idea: Building a large collection of de-correlated trees and then average them (Breiman, 2001) 105 | 106 | ## Random Forests 107 | * Trees: low bias, but very noisy / high variance $\Rightarrow$ goal : reduction of variance 108 | * Trees generated by bagging are identically distributed, but not necessarily independent 109 | * For identical distributed variables with positive pairwise correlation $\rho$: variance of average $\rho \sigma^2 + (1-\rho)/B \sigma^2$ ($\rho$ correlation of the trees) (for i.i.d. rvs: $\sigma^2/B$) 110 | * Application: nonlinear estimators like random trees 111 | 112 | ## Random Forests | Procedure 113 | * Bootstrap samples $1,\ldots,B$ 114 | * Build trees and 115 | "Before each split, select $m \leq p$ of the input variables at random as candidates for splitting" (e.g. $m=\sqrt{p}, m=1$) 116 | * Aggregation: 117 | $$\hat{f}_{rf}(x) = \frac{1}{B} \sum_{b=1}^B T(x; \Theta_b)$$ 118 | $\Theta_b$: split variables, cut points, terminal node values for $b$ 119 | 120 | ## Ensemble Learning | Introduction 121 | * Idea: To build a prediction model by combing the strengths of a collection of simpler base models. 122 | * Bagging, random forests are ensemble methods for classification, where a committee of trees each cast a vote for predicted class. 123 | * Boosting proposed as a committee method where the committee of weak learners evolves over time with a weighted vote for the members 124 | * Ensemble learning 125 | + Developing a population of base learners from the training data 126 | + Combining them to form the composite predictor 127 | 128 | ## Learning Ensembles 129 | * We consider functions of the form 130 | $$ f(x) = \alpha_0 + \sum_{T_k \in \mathcal{T}} \alpha_k T_k(x) $$ 131 | with $\mathcal{T}$ dictionary of basis functions, e.g. trees with $|\mathcal{T}|$ quite large. 132 | * Hybrid approach of Friedman and Popescu (2003) 133 | + A finite dictionary $\mathcal{T}_L=\{T_1(x),\ldots, T_M(x)\}$ of basis functions is induced from the training data. 134 | + A family of functions $f_{\lambda}(x)$ is built by fitting a lasso path in this dictionary 135 | $$ \alpha(\lambda) = argmin_{\alpha} \sum_{i=1}^N L[y_i, \alpha_0 + \sum_{m=1}^M \alpha_, T_m(x_i)] + \lambda \sum_{m=1}^M |\alpha_m|$$ 136 | 137 | ## Ensemble Generating Algorithm 138 | How to choose the set of base functions $b(x;\gamma)$ forming $\mathcal{T}_L$? 139 | 140 | * $f_0(x)=\arg \min_c \sum_{i=1}^N L(y_i,c)$ 141 | * For $m=1$ to $M$ do 142 | + $\gamma_m = \arg \min_{\gamma} \sum_{i \in S_m(\eta)} L(y_i, f_{m-1}(x_i) + b(x_i;\gamma))$ 143 | + $f_m(x)=f_{m-1}(x) + \nu b(x;\gamma_m)$ 144 | * $\mathcal{T}_{ISLE}=\{ b(x;\gamma_1), \ldots, b(x;\gamma_M) \}$ 145 | 146 | $S_m(\eta)$ refers to a subsample of $N \eta$ of the training observations, typically without replacement. 147 | Recommendation: $\eta \leq 1/2$ and $\eta \sim 1/\sqrt(N)$, $\nu=0.1$ 148 | *Importance sampled learning ensemble (ISLE)* 149 | 150 | 151 | 152 | ## Special cases of the Algorithm 153 | * Bagging: $\eta=1$, samples with replacement, $\nu=0$ 154 | * Random forest: sampling is similar with more randomness introduced by the selection of the splitting variable. 155 | * Gradient boosting with shrinkage uses $\eta=1$ 156 | * Stochastic gradient boosting: identical 157 | 158 | -------------------------------------------------------------------------------- /Lecture_6.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_6.pdf -------------------------------------------------------------------------------- /Lecture_7.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 7 -- Neural Nets" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | 14 | ## Recap: What have we already learnt? 15 | 1. Introduction (Definitions, Basic Concepts, Challenges in High-Dimensions) 16 | 2. Linear Regression and Extensions (Linear Regression, Regression and Smoothing Splines) 17 | 3. Ridge Regression 18 | 4. Lasso Regression (Basic Principle, Some Results, Extensions) 19 | 5. Of Trees and Forests (Regressions Trees, Bagging, Random Forests) 20 | 21 | Today: Neural Nets / Deep Learning 22 | 23 | ## Neural Networks | Introduction 24 | * Inspired by the mode of operation of the brain, imitation of the human brain 25 | * Idea: Extract linear combinations of the inputs as derived features, and model the target ($Y$) as a nonlinear function of these features 26 | * Fields: Statistics, Artificial Intelligence 27 | 28 | ## Project Pursuit Regression 29 | * Input vector $X$ with $p$ components; target $Y$ 30 | * $\omega_m, m=1,\ldots,M$ unit $p-$vectors of unknown parameters 31 | * Project Pursuit Regression (PPR) model: 32 | $$ f(x)= \sum_{m=1}^M g_m(\omega_m' x)$$ 33 | $V_m=\omega_m' x$ derived feature; projection on $\omega_m$ 34 | * $g_m$ estimated along with $\omega_m$ by flexible smoothing methods 35 | * $g_m(\omega_m ' x)$ "ridge function" in $\mathbb{R}^p$ 36 | * Useful for prediction; difficult to interpret 37 | 38 | 39 | ## Neural Networks 40 | * Large class of models / learning methods 41 | * Here: single hidden layer back-propagation network / single layer perceptron 42 | * Two-stage regression or classification model represented by network diagram 43 | * Can be seen as nonlinear statistical models 44 | * Diagram cf blackboard 45 | 46 | ## Neural Networks 47 | * $Z_m= \sigma(\alpha_{0m} + \alpha_m'x)), m=1,\ldots, M$ 48 | * $T_k = \beta_{0k} + \beta_k'z, k=1,\ldots, K$ 49 | * $f_k(x) = g_k(T), k=1,\ldots, K$ 50 | 51 | ## Neural Networks 52 | * Activation function: $\sigma(v)=\frac{1}{1+e^{-v}}$ (sigmoid) (cf blackboard) 53 | * Regression case: $g_k(T)=T_k$; 54 | * Classification case: $g_k(T)=\frac{e^{T_k}}{\sum_{l=1}^K e^{T_l}}$ (softmax fct.) 55 | * Related to PPR 56 | * Measure of fit $R(\theta)$: sum-of-squared errors (regression), or squared error / cross entropy 57 | * Estimation: $R(\theta)$ by gradient descent (\textquotedblleft back propagation\textquotedblright); regularization might be needed 58 | 59 | ## Fitting Neural Networks 60 | * unknown parameters, called weights, $\theta$: 61 | * $\{ \alpha_{0m}, \alpha_m; m=1,2,\ldots, M \}$ $M(p+1)$ weights 62 | * $\{ \beta_{0m}, \beta_m; k=1,2,\ldots, K \}$ $K(p+1)$ weights 63 | * Criterion function: $R(\theta)= \sum_{k=1}^K \sum_{i=1}^N (y_{ik} - f_k(x_i))^2 = \sum_{i=1}^N R_i$ 64 | * Derivatives: $\frac{\partial R_i}{\partial \beta_{km}} = -2(y_i-f_k(x_i))g_k'(\beta_k^Tz_i)z_{mi}$ 65 | * Analog Derivatives $\frac{\partial R_i}{\partial \alpha_{ml}}$ 66 | 67 | ## Fitting Neural Networks 68 | A gradient descent update at the (r+1)st iteration is given by 69 | * $\beta_{km}^{(r+1)} = \beta_{km}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \beta_{km}^{(r)}}$ 70 | * $\alpha_{ml}^{(r+1)} = \alpha_{ml}^{(r)} - \gamma_r \sum_{i=1}^N \frac{\partial R_i}{\partial \alpha_{ml}^{(r)}}$ 71 | * $\gamma_r$ learning rate 72 | 73 | ## Fitting Neural Networks 74 | * Rewrite Derivatives as 75 | * $\frac{\partial R_i}{\partial \beta_{km}} = \delta_{ki}z_{mi}$ 76 | * $\frac{\partial R_i}{\partial \alpha_{ml}} = s_{mi}x_{il}$ 77 | * $s_{mi}=\sigma'(\alpha^T_m x_i) \sum_{k=1}^K \beta_{km} \delta_{ki}$ (*) 78 | * $\delta_{ki}$ and $s_{mi}$ "errors" 79 | 80 | ## Fitting Neural Networks 81 | Estimation via back-propagation equations: 82 | 83 | * Updates in updating step with two-pass algorithm 84 | * Forward pass: current weights are fixed, calculate $\hat{f}_k(x_i)$ 85 | * Backward pass calculate $\delta_{ki}$, back-propogate via (*), calculate gradients and update. 86 | 87 | ## Neural Networks 88 | * Starting values: random values near zero. Intuition: model starts out nearly linear and becomes nonlinear as the weights increase. 89 | * Overfitting: to prevent overfitting early stopping and penalization (weight decay; $R(\theta) + \lambda J(\theta)$) 90 | * Scaling of the inputs: large effects on the quality of the final solution. Default: standardization and normalization of of inputs (mean zero and unit variance) 91 | * Number of hidden units and layers: better to have too many hidden units than too few. (Flexibility + Regularization!) 92 | * Multiple Minima: nonconvex criterion function with many local minima (different starting values, average of predictions of collection of neural nets) 93 | 94 | 95 | -------------------------------------------------------------------------------- /Lecture_7.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_7.pdf -------------------------------------------------------------------------------- /Lecture_8.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 8 -- Boosting" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | 14 | ## Introduction 15 | * One of the most powerful learning ideas 16 | * Originally for classification problems, but extended to regression settings 17 | * Idea: Combining the output of many "weak" classifiers to produce a powerful committee 18 | * Weak classifier: classifier which error rate is only slightly better than random guessing. 19 | * "best off-the-shelf classifier in the world" (Breiman, 1998) 20 | 21 | ## Boosting for Classification 22 | * Freund and Schapire (1997): AdaBoost.M1 23 | * Output variable $Y \in \{-1,1\}$, $X$ predictor variable, $N$ observations (training sample) 24 | * $G(x)$: classification rule with values in $\{-1,1\}$ 25 | * Training error rate: $\bar{err}=1/N \sum_{i=1}^N I(y_i \neq G(x_i))$ 26 | * Expected error rate (on new observations): $\mathbb{E}_{XY}I(Y\neq G(X))$ 27 | 28 | ## Boosting for Classification | Main idea 29 | * Sequentially applying the weak classification algorithm to repeatedly modified versions of the data. 30 | * This produces a sequence of weak classifiers $G_m(x), m=1,\ldots,M$ 31 | * Finally, the predictions of all of them are then combined through a weighted majority vote to produce the final prediction: 32 | $$ G(x)=sign \left( \sum_{m=1}^M \alpha_m G_m(x) \right)$$ 33 | * Weights $\alpha_1,\ldots, \alpha_M$ computed by boosting. Higher influence to the more accurate classifiers in the sequence. 34 | 35 | ## Boosting for Classification | Main idea | Weights 36 | * Boosting applies different weights $w_1, \ldots, w_N$ to the training data at each step. 37 | * Initially, $w_i=1/N$ 38 | * For $m=2,\ldots,M$ observations weights are individually modified and classifier is reapplied to weighted data. 39 | * At step $m$ those observations that were misclassified by the classifier $G_{m-1}(x)$ at the previous step have their weight increased, the weights for the correctly classified ones are decreased. 40 | * Concentration on the training observations missed in the previous rounds. 41 | 42 | ## Boosting for Classification | Algorithm 43 | 1. Initialize the observation weights $w_i=1/N$ 44 | 2. For $m=1$ to $M$: 45 | + Fit a classifier $G_m(x)$ to the training data using weights $w_i$ 46 | + Compute $err_m = \frac{\sum_{i=1}^N w_i I(y_i \neq G_m(x_i))}{\sum_{i=1}^N w_i }$ 47 | + Compute $\alpha_m = \log((1-err_m)/err_m)$ 48 | + Set $w_i \leftarrow w_i \exp[\alpha_m I(y_i\neq G_m(x_i))]$ 49 | 3. Output $G(x)=sign\left[\sum_{m=1}^M \alpha_m G_m(x)\right]$ 50 | 51 | ## Boosting as Additive Modelling 52 | * Boosting is a way of fitting an additive expansion in a set of elementary "basis" functions. 53 | * Basis function expansion 54 | $$f(x) = \sum_{m=1}^M \beta_m b(x; \gamma_m)$$ 55 | where $\beta_m$ are expansion coefficients and $b(x;\gamma)$ simple functions / basis functions parametrized by $\gamma$ 56 | 57 | ## Boosting as Additive Modelling 58 | * Estimation by solving 59 | $$ \min_{\beta_m, \gamma_m} \sum_{i=1}^N L \left( y_i, \sum_{m=1}^M \beta_m b (x_i; \gamma_m) \right) (*)$$ 60 | * Often hard to solve, but often feasible subproblem 61 | $$ \min_{\beta, \gamma} \sum_{i=1}^N L(y_i; \beta b(x_i; \gamma))$$ 62 | 63 | ## Forward Stagewise Additive Modeling 64 | * Idea: Solving (*) by sequentially adding new basis functions to the expansion without adjusting the parameters and coefficients of those already added. 65 | * At each step $m$: Solve for the optimal $\beta_m$ and $b(x; \gamma_m)$ given the current expansion $f_{m-1}(x)$; this gives $f_m(x)$ and continue. 66 | * For squared-error loss: 67 | $$ L(y_i; f_{m-1}(x_i) + \beta b(x_i;\gamma)) = (y_i - f_{m-1}(x_i) - \beta b(x_i; \gamma))^2 $$ 68 | 69 | ## Forward Stagewise Additive Modeling | Algorithm 70 | 1. Initialize $f_0(x)=0$ 71 | 2. For $m=1$ to $M$: 72 | + Compute 73 | $$ (\beta_m, \gamma_m) = \arg \min_{\beta, \gamma} \sum_{i=1}^NL(y_i, f_{m-1}(x_i) + \beta b(x_i;\gamma)).$$ 74 | + Set $f_m(x)=f_{m-1}(x) + \beta_m b(x;\gamma_m)$ 75 | 76 | ## Exponential Loss and AdaBoost 77 | * AdaBoost as stagewise additive modeling using the loss function $L(y, f(x))=\exp(-yf(x))$ 78 | * Solve 79 | $$ (\beta_m, G_m) = \arg \min_{\beta, G} \sum exp[-y_i(f_{m-1}(x_i) + \beta G(x_i))] $$ 80 | * This can be expressed as 81 | $$ (\beta_m, G_m) = \arg \min_{\beta, G} w_i^{(m)} \exp(-\beta y_i G(x_i))$$ 82 | with $w_i^{(m)}=\exp(-y_i f_{m-1}(x_i))$ 83 | 84 | ## Exponential Loss and AdaBoost 85 | * Solution can be obtained in two steps. 86 | 1. For any value of $\beta > 0$, the solution for $G_m(x)$ is 87 | $$ G_m = \arg \min_G \sum_{i=1}^N w_i^{(m)}I(y_i\neq G(x_i))$$ 88 | 89 | ## Exponential Loss and AdaBoost 90 | 2. Plugin this $G_m$ into the criterion function and solving for $\beta$ gives: 91 | $$ \beta_m=1/2 \log \frac{1-err_m}{err_m},$$ where $err_m$ is the minimized weighted error rate 92 | $$ err_m = \frac{\sum_{i}w_i^{(m)} I(y_i \neq G_m(x_i))}{\sum_{i}w_i^{(m)}}$$ 93 | 94 | ## Exponential Loss and AdaBoost 95 | * Update is given by $f_m(x)=f_{m-1}(x) + \beta_m G_m(x)$ and weights $w_i^{(m)}=w_i^{(m)} exp(-\beta_m y_i G_m(x_i))$ 96 | * Weights can be rewritten as $w_i^{(m+1)}=w_i^{(m)} exp(\alpha_m I(y_i \neq G_m(x_i))) exp(-\beta_m)$ 97 | * $\alpha_m=2 \beta_m$ 98 | 99 | ## Remarks on the Loss Function | Classification 100 | * $yf(x)$ is called margin. Goal: maximize margin. 101 | * Classification rule: $G(x)=sign(f(x))$ 102 | * Exponential loss: 103 | + Computational easy (simple modular reweighting) 104 | + $f^*(x)=\arg \min_{f(x)} E_{Y|x}(e^{-Yf(x)})=1/2 \log \frac{Pr(Y=1|x)}{Pr(Y=-1|x)}$ or equivalently 105 | $Pr(y=1|x)=\frac{1}{1+e^{-2f^*(x)}}$ 106 | + Hence, AdaBoost estimates one half of the log-odd of $P(Y=1|x)$ 107 | * Alternative; binomial negative log-likelihood or deviance (coded $\{0,1\}$) has the same population minimizer. 108 | 109 | ## Remarks on the Loss Function | Regression Case 110 | * Squared error loss: $L(y,f(x))=(y-f(x))^2$ with population minimizer $f(x)=E[Y|x]$ 111 | * Mean absolute loss: $L(y,f(x))=|y-f(x)|$ with population minimizer $f(x)=median[Y|x]$ 112 | * Huber Loss: $L(y,f(x))=1(|y-f(x)|\leq \delta)(y-f(x))^2 +$ 113 | $+1(|y-f(x)|> \delta) (2\delta|y-f(x)|-\delta^2)$ 114 | 115 | ## Boosting Trees 116 | * Trees: Partition of the space of all joint predictors into disjoint regions $R_j$ (terminal nodes of the tree) with 117 | $$ x \in R_j \rightarrow f(x) = \gamma_j$$ 118 | * Trees can be expressed as $T(x;\Theta)=\sum_{j=1}^J \gamma_j I(x \in R_j)$ with parameters $\Theta=\{R_j,\gamma_j\}_1^J$ 119 | * Usually estimated via recursive partitioning 120 | 121 | ## Boosting Trees 122 | * Boosted tree model: $f_M(x)=\sum_{m=1}^M T(x;\Theta_m)$ 123 | * Forward stagewise procedure solves: 124 | $$\hat{\Theta}_m = \arg \min_{\Theta_m} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + T(x_i;\Theta_m))$$ 125 | * Simplification with squared-error loss and two-class classification with exponential loss (specialized tree-growing algorithm) 126 | 127 | ## Numerical Optimization via Gradient Boosting 128 | * Goal: approximate algorithms for solving $\hat{\Theta}_m = \arg \min_{\Theta_m} \sum_i L(y_i, f_{m-1}(x_i)+T(x_i;\Theta_m))$ 129 | * Loss to predict $y$ using $f(x)$: $L(f)=\sum_i L(y_i;f(x_i))$ (e.g. $f$ as sum of trees) 130 | * $\hat{f}= \arg \min_f L(f)$ (**) where the "parameters" $f \in \mathbb{R}^N$ are the values of the approximating function $f(x_i)$ at each of the N data points $x_i$. 131 | * Numerical optimization procedures solve (**) as as sum of component vectors $f_M=\sum_{m=0}^M h_m, h_m \in \mathbb{R}^N$ and $h_0$ starting value / initial guess. 132 | * Numerical methods differ in how to specify $h_m$ 133 | 134 | ## Gradient Boosting | Steepest Descent 135 | * $h_m= \rho_m g_m$ (steepest descent) with $\rho_m$ scalar and $g_m \in \mathbb{R}^N$ is the gradient of $L(f)$ evaluated at $f=f_{m-1}$ 136 | * $g_{im}=\left[ \frac{\partial L(y_i,f(x_i))}{\partial f(x_i)} \right]_{f(x_i)=f_{m-1}(x_i)}$ 137 | * Step length $\rho_m$ solves 138 | $$\rho_m= \arg \min_{\rho} L(f_{m-1}-\rho g_m)$$ 139 | * Updating: $f_m=f_{m-1}- \rho g_m$ 140 | * **Greedy Strategy** 141 | 142 | ## Gradient Boosting 143 | * Gradient is the unconstrained maximal descent direction. Only defined at the training data points $x_i$, but goal is generalization. 144 | * Idea: Approximate negative gradient by some "model", e.g. tree $T(x;\Theta_m)$ at $m$th iteration whose predictions $t_m$ are as close as possible to the negative gradient. 145 | * This leads to 146 | $$ \tilde{\Theta}_m = \arg \min_{\Theta} \sum_i (-g_{im}-T(x_i;\Theta))^2.$$ 147 | 148 | ## Gradient Tree Boosting Algorithm 149 | 1. Initialize $f_0(x)=\arg \min_{\gamma} \sum_i L(y_i, \gamma)$ 150 | 2. For $m=1$ to $M$: 151 | + For $i=1,2,\ldots,N$ compute 152 | $$ r_{im} = - \left[ \frac{\partial L(y_i, f(x_i))}{\partial f(x_i)} \right]_{f=f_{m-1}}.$$ 153 | + Fit a regression tree to the targets $r_{im}$ giving terminal regions $R_{jm}, j=1,2,\ldots, J_m$. 154 | + For $j=1,2,\ldots, J_m$ compute 155 | $$ \gamma_{jm} = \arg \min_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, f_{m-1}(x_i) + \gamma).$$ 156 | + Update $f_m(x)=f_{m-1}(x) + \sum_{j=1}^{J_m} \gamma_{jm}I(x \in R_{jm})$. 157 | 3. Output $\hat{f}(x)=f_M(x)$ 158 | 159 | ## Remarks 160 | * Shrinkage: $f_m(x)=f_{m-1}(x) + \nu \sum_{j=1}^J \gamma_{jm}I(x \in R_{jm})$ 161 | * Size of trees $J$ in each step: important choice, usually $2 < J < 10$ (by experimenting) 162 | * Early stopping required. (When to stop?) 163 | * Penalization / tuning parameters: $M$, $J$, $\nu$ -------------------------------------------------------------------------------- /Lecture_8.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_8.pdf -------------------------------------------------------------------------------- /Lecture_9.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Lecture 9 -- Support Vector Machines" 3 | author: "Martin Spindler" 4 | date: '`r format(Sys.Date())`' 5 | output: 6 | ioslides_presentation: null 7 | beamer_presentation: default 8 | keep_tex: yes 9 | mathjax: local 10 | self_contained: no 11 | --- 12 | 13 | ## Introduction 14 | * Outcome variable: $Y \in \{-1,+1\}$ 15 | * Generalization to multi-valued outcomes is straight forward. 16 | * Basic Idea (for classification): Separating the space of features by hyperplanes into different regressions so that the dependent variable / outcome is separated. 17 | * *Seperating* hyperplanes 18 | * Graph: cf white board 19 | * In this lecture we focus only on the basic idea. 20 | 21 | 22 | ## Digression: Hyperplanes 23 | * In a p-dimensional space a *hyperplane* is a flat affine subspace of dimension p-1- 24 | * Example 1: In $p=2$ a hyperplane is a line. 25 | * Example 2: In $p=3$ a hyperplane is a plane. 26 | * A hyperplane is a $p$-dimensional space is defined by $$\beta_0 + \beta_1 X_1 + \ldots \beta_p X_p = 0$$. 27 | * If $\beta_0 + \beta_1 X_1 + \ldots \beta_p X_p <0$ than $X=(X_1,\ldots,X_p)$ lies on one side of the hyperplane. 28 | * If $\beta_0 + \beta_1 X_1 + \ldots \beta_p X_p >0$ than $X$ lies on the other side of the hyperplane. 29 | * Hence: Separation of the space into two parts. 30 | * $f(x)=\beta_0 + \beta_1 x_1 + \ldots \beta_p x_p$ gives the signed distance from a point $x$ to the hyperplane defined by $f(x)=0$ 31 | 32 | ## Support Vector Classifier 33 | * Training data: $(x_1, y_1), \ldots, (x_n,y_n)$, $x_i \in \mathbb{R}^p$ and $y_i \in \{-1, +1\}$ 34 | * Definition hyperplane: $\{x: f(x)=x^T \beta + \beta_0=0 \}$ ($\beta$ unit vector with $\|\beta\|=1$) 35 | * Classification rule: $G(x)=sign[x^T \beta + \beta_0]$ 36 | 37 | ## Support Vector Classifier: Separable Case 38 | * Separable means: $y_i f(x_i)>0 \forall i$ for a plane $f(x)$ 39 | * Find the hyperplane that creates the biggest margin between the classes for $-1$ and $+1$. 40 | * Optimization problem: $\max_{\beta, \beta_0, \|\beta\|=1} M$ subject to 41 | $$ y_i(x_i^T \beta + \beta_0) \geq M, i=1,\ldots,n$$ 42 | * Equivalent formulation: $\min_{\beta, \beta_0} \| \beta \|$ subject to 43 | $$ y_i(x_i^T \beta + \beta_0) \geq 1, i=1,\ldots,n$$ 44 | 45 | ## Support Vector Classifiers: Non-separable Case 46 | * Now: classes overlap in the feature space. 47 | * Still maximize $M$, but allow for some points to be on the wrong side of the margin. 48 | * Slack variables $\xi=(\xi_1, \ldots, x_n)$ 49 | * Modification of the constraints: $y_i(x_i^T \beta + \beta_0) \geq M - \xi_i$ or 50 | $y_i(x_i^T \beta + \beta0 \geq M(1 - \xi_i)$ 51 | * $\forall \xi \geq 0, \sum_{i=1}^n \xi_i \leq constant$ 52 | * Interpretation: overlap in actual distance from the margin vs overlap in relative distance. 53 | * Focus on the second case (b/c convex optimization problem) 54 | 55 | ## Support Vector Classifiers: Non-separable Case 56 | * $\xi_i$ proportional amount by which the prediction $f(x_i)$ is on the wrong side of the margin. 57 | * Missclassification occurs, if $\xi_i > 1$ 58 | * Bounding $\sum \xi_i$ at value $K$, bounds the number of training missclassifications at $K$. 59 | * Equivalent formulation of the problem: $\min \| \beta \|$ subject to 60 | $$ y_i(x_i^T \beta + \beta_0) \geq 1 - \xi_i \forall i, \xi_i \geq 0, \sum \xi_i \leq K$$ 61 | 62 | ## Support Vector Machines 63 | * Up to now: linear boundaries in the feature space. 64 | * Flexibility by enlarging the feature space using basis expansions (e.g. polynomials, splines) 65 | * Better training-class separation and nonlinear boundaries in the original space. 66 | * Selection of basis functions $h_m(x), m=1,\ldots, M$ and fit of SV classifier using the input features $h(x_i)=(h_1(x_i), \ldots, h_M(x_i))$. 67 | * Nonlinear function $\hat{f}(x)= h(x)^T \hat{\beta} + \hat{\beta_0}$ 68 | * Classifier: $\hat{G}=sign(\hat{f}(x))$ 69 | 70 | ## Support Vector Machines 71 | * SVM use a very large space of basis functions leading to computational problems. 72 | * Problem of overfitting. 73 | * SVM technology takes care of both problems. 74 | 75 | ## Support Vector Machines 76 | * (omitting technical details) 77 | * Solution of the optimization problem involve $h(x)$ only through inner products. 78 | * Knowledge of the kernel functions $K(x,x')= \langle h(x), h(x') \rangle$ is sufficient. 79 | * Examples: 80 | + $d$th degree polynomial: $K(x,x')= (1+ \langle x,x' \rangle)^d$ 81 | + Radial basis: $K(x,x')=\exp(-\gamma \|x-x'\|^2)$ 82 | 83 | ## Support Vector Machines | Example 84 | * Consider two-dimensional space $(X_1,X_2)$ and polynomial kernel of degree 2. 85 | * $K(X,X')=1+2X_1 X_1' + 2 X_2 X_2' + (X_1 X_1')^2 + (X_2 X_2')^2 + 2 X_1 X_1' X_2 X_2'$ 86 | * Then $M=6$ and $h_1(X)=1, h_2(X)=\sqrt(2) X_1, \ldots$. Then $K(X,X')=\langle h(X), h(X') \rangle$ 87 | 88 | ## SVM as a Penalization Method 89 | * With $f(x)=h(x)^T \beta + \beta_0$, we consider the optimization problem: 90 | $$ \min_{\beta_0, \beta} \sum_{i=1}^n [ 1- y_i f(x_i)]_{+} + \frac{\lambda}{2} \| \beta \|^2$$ 91 | * loss + penalty 92 | * Hinge loss function: $L(y,f)=[1- yf]_{+}$ 93 | * Solution to the above optimization problem (with $\lambda = 1/C$) is the same as for the SVM problem. 94 | * ($C$ is a Cost parameter related to $K$) 95 | 96 | ## SVM | An Illustration 97 | We use the library *e1071* (alternative *LiblineaR* for very large linear problems) 98 | ```{r} 99 | library(e1071) 100 | ``` 101 | 102 | 103 | First, we generate data we would like to classify 104 | ```{r} 105 | set.seed(12345) 106 | x = matrix(rnorm(200*2), ncol=2) 107 | x[1:100,] = x[1:100,] + 2 108 | x[101:150,] = x[101:150,] - 2 109 | y = c(rep(1,150), rep(2,50)) 110 | dat = data.frame(x=x, y=as.factor(y)) 111 | ``` 112 | 113 | ## SVM | An Illustration 114 | ```{r} 115 | plot(x, col=y) 116 | ``` 117 | 118 | ## SVM | An Illustration 119 | Next, we split the data into training and testing sample 120 | ```{r} 121 | train = sample(200,100) 122 | ``` 123 | Then we fit a SVM with radial basis and plot the result 124 | ```{r} 125 | svmfit = svm(y~., data=dat[train,], kernel="radial", gamma=1, cost=1) 126 | # summary(smvfit) 127 | ``` 128 | 129 | ## SVM | An Illustration 130 | ```{r, fig.height=5, fig.width=8} 131 | plot(svmfit, dat[train,]) 132 | ``` 133 | 134 | ## SVM | An Illustration 135 | We can now increase the cost parameter to reduce the training errors 136 | 137 | ```{r} 138 | svmfit = svm(y~., data=dat[train,], kernel="radial", gamma=1, cost=1e5) 139 | #plot(svmfit, dat[train,]) 140 | # summary(smvfit) 141 | ``` 142 | 143 | 144 | ## SVM | An Illustration 145 | ```{r, fig.height=5, fig.width=8} 146 | plot(svmfit, dat[train,]) 147 | ``` 148 | 149 | 150 | ## SVM | An Illustration 151 | Selection of the cost parameter and $\gamma$ by CV 152 | ```{r} 153 | tune.out=tune(svm, y~., data=dat[train,], kernel="radial", ranges=list(cost=c(0.1,1,10,100,1000), gamma=c(0.5,1,2,3,4))) 154 | #summary(tune.out) 155 | ``` 156 | ## SVM | An Illustration 157 | Finally, we test it on the testing data 158 | ```{r} 159 | table(true=dat[-train,"y"], pred=predict(tune.out$best.model, newdata=dat[-train,])) 160 | ``` 161 | 162 | 163 | 164 | 165 | -------------------------------------------------------------------------------- /Lecture_9.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/Lecture_9.pdf -------------------------------------------------------------------------------- /PS1.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Problem Set 1" 3 | author: "" 4 | date: "`r format(Sys.Date())`" 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | knitr::opts_chunk$set(error = TRUE) 11 | ``` 12 | 13 | 14 | ## Installing R and RStudio 15 | * Install R, a free software environment for statistical computing and graphics from [CRAN](https://cran.r-project.org/), the Comprehensive R Archive Network. It is recommended to install a precompiled binary distribution for your operating system. 16 | * Install RStudio’s IDE (Integrated Development Environment), a powerful user interface for R. RStudio Desktop is freely [available](https://www.rstudio.com). 17 | 18 | 19 | ## Installing R packages 20 | 21 | * A strength of R is that many add on packages are available which extend the capability of R. The "official" packages are hosted at CRAN and can be installed in R/RStudio with command \texttt{install.packages()}. The package \texttt{hdm} which we will use later in the course can be installed by 22 | 23 | ```{r, eval=FALSE} 24 | install.packages("hdm") 25 | ``` 26 | 27 | * Many packages are collected in so-called [task views](https://cran.r-project.org/web/views/). For Machine Learning a good starting point is [here](https://cran.r-project.org/web/views/MachineLearning.html). 28 | 29 | * Often the most current version of packages, but also some packages which are not hosted at CRAN, are hosted at file repositories like R Forge and Github. E.g., the most current version of \texttt{hdm} can be installed from R-Forge by specifying the corresponding repository: 30 | 31 | ```{r, eval=FALSE} 32 | install.packages("hdm", repos="http://R-Forge.R-project.org") 33 | ``` 34 | 35 | * After installing a package it can be made available in the current R session with the command 36 | ```{r, eval=FALSE} 37 | library(hdm) 38 | ``` 39 | ## Packages for Machine Learning 40 | 41 | One of the strength of R is that many useful packages for Machine Learning are available. Some of the most important ones which will also useful during the course of this course are given in the table. 42 | 43 | package | description | 44 | --------|-------------------------------------------| 45 | mlr | interface to a large number of classification and regression techniques| 46 | rpart, tree, party | tree-structured models for regression, classification and survival analysis| 47 | randomForest | random forests| 48 | nnet | single-hidden-layer neural network| 49 | mboost, gbm | boosting methods| 50 | hdm, lars, glmnet | lasso implementations| 51 | 52 | ## Finding help in R and on the web 53 | 54 | * R has a comprehensive built-in help system. E.g. to get help for the function \texttt{lm} which conducts linear regression, you can use any of the following at the program's command prompt : 55 | 56 | ```{r, eval=FALSE} 57 | help.start() # general help 58 | help(lm) # help about function lm 59 | ?lm # same result 60 | apropos("lm") # list all functions containing string lm 61 | ??lm # extensive search on all documents containing the string "lm" 62 | example(lm) # show an example of function lm 63 | RSiteSearch("lm") # search for foo in help manuals and archived mailing lists 64 | 65 | ``` 66 | 67 | Moreover, many packages contain introductions called "vignettes". 68 | 69 | ```{r, eval=FALSE} 70 | # get vignettes on using installed packages 71 | vignette() # show available vingettes 72 | vignette(package="XYZ") # show vignettes in the package XYZ 73 | ``` 74 | 75 | For information on help search in are can be found on this [stackoverflow question](http://stackoverflow.com/questions/15289995/how-to-get-help-in-r) 76 | 77 | * Exercise: 78 | a. Install and load the package "hdm". 79 | b. Which function in this package conducts "lasso" estimation? 80 | c. How is this function used? 81 | d. Find the vignette of the package! 82 | 83 | * R is shipped with different manuals where "An Introduction to R" is a good starting point to learn more about R. 84 | Moreover, good sources for help are [stackoverflow](http://stackoverflow.com/questions/tagged/r) and the archive of the R-help list where solutions to many problems can be found. 85 | 86 | ## Course Material 87 | 88 | Material for this course is hosted at 89 | https://github.com/MartinSpindler/Machine-Learning-in-Econometrics 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | -------------------------------------------------------------------------------- /PS1.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/PS1.pdf -------------------------------------------------------------------------------- /PS2.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Problem Set 2 -- Linear Regression and Extensions" 3 | author: '' 4 | date: '`r format(Sys.Date())`' 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | knitr::opts_chunk$set(error = TRUE) 11 | ``` 12 | 13 | 14 | ## Loading and Exploring the Data Set 15 | For this problem set we will analyze the data set *Boston* which is contained in the library *MASS*. This data set records the median house value (*medv*) for $506$ neighborhoods around Boston. The goal is to predict the variable *medv* using $13$ predictors. 16 | 17 | * Load the data set. 18 | * Make yourself familiar with the data. Hint: *str()*, *names()*, *help()* 19 | * Generate Descriptive statistics. Hint: *summary*, *mean*, *sd*, *var*, *min*, *max*, *median*, *range*, *quantile*, *fivenum* 20 | * Plot the data, especially the outcome variable *medv* and the variable *lstat*. Hint: *plot*, *hist*, *boxplot* 21 | 22 | ```{r, include=TRUE} 23 | options(warn=1) 24 | library(MASS) 25 | data(Boston) 26 | ?Boston 27 | attach(Boston) 28 | summary(medv) 29 | summary(lstat) 30 | ``` 31 | 32 | 33 | ## Univariate Linear Regression 34 | 35 | * Analyse the relation between *medv* and *lstat* with a linear regression. Hint: *lm()* 36 | * Interpret the results. Hint: *summary* 37 | * Plot the regression line in a graph with the original data points. 38 | * What is the predicted value of *medv* for a region with a *lstat* of $32$? 39 | ```{r} 40 | reg1 = lm(medv ~ lstat, data=Boston) 41 | summary(reg1) 42 | #plot(reg1) 43 | plot(medv ~ lstat, data=Boston) 44 | abline(reg1, col="red") 45 | predict(reg1, newdata = list(lstat=32), interval="confidence") 46 | ``` 47 | 48 | ## Multivariate Linear Regression 49 | * Fit now a multivariate regression. 50 | * Interpret the results, in particular with a focus on the variable *lstat*. 51 | * Fit a more complex model, e.g. considering interaction effects and higher order polynomials. 52 | ```{r} 53 | reg2 = lm(medv ~ ., data=Boston) 54 | summary(reg2) 55 | #plot(reg2) 56 | reg3 = lm(medv ~ poly(lstat,3)+ crim + zn + crim:zn+ (chas + nox + rm)^2, data=Boston) 57 | coef(reg3) 58 | ``` 59 | 60 | ## Regression Splines 61 | Now we consider again the relation between *lstat* and *medv*. 62 | 63 | * Fit a cubic regression spline to the data! Hint: library *splines* and function *bs()* 64 | * Plot the fitted line! 65 | ```{r} 66 | library(splines) 67 | par(mfrow=c(1,1)) 68 | fit = lm(medv ~ bs(lstat, knots=c(10,20,30)), data=Boston) 69 | lstat.grid <- seq(from=1.8, to=37.9, by=0.1) 70 | pred = predict(fit, newdata=list(lstat=lstat.grid), se=TRUE) 71 | plot(Boston$lstat, Boston$medv, col="gray") 72 | lines(lstat.grid, pred$fit, lwd=2) 73 | lines(lstat.grid, pred$fit + 2*pred$se, lwd="dashed") 74 | lines(lstat.grid, pred$fit - 2*pred$se, lwd="dashed") 75 | ``` 76 | * Experiment with different spline specifications! Hint: options *knots* and *df* 77 | ```{r, include=F} 78 | fit2 = lm(medv ~ bs(lstat, df=6), data=Boston) 79 | attr(bs(lstat, df=6), "knots") 80 | pred2 = predict(fit2, newdata=list(lstat=lstat.grid), se=TRUE) 81 | plot(Boston$lstat, Boston$medv, col="gray") 82 | lines(lstat.grid, pred2$fit, lwd=2) 83 | lines(lstat.grid, pred2$fit + 2*pred2$se, lwd="dashed") 84 | lines(lstat.grid, pred2$fit - 2*pred2$se, lwd="dashed") 85 | ``` 86 | 87 | ```{r} 88 | fit2 = lm(medv ~ bs(lstat, df=20), data=Boston) 89 | attr(bs(lstat, df=20), "knots") 90 | pred2 = predict(fit2, newdata=list(lstat=lstat.grid), se=TRUE) 91 | plot(Boston$lstat, Boston$medv, col="gray") 92 | lines(lstat.grid, pred2$fit, lwd=2) 93 | lines(lstat.grid, pred2$fit + 2*pred2$se, lwd="dashed") 94 | lines(lstat.grid, pred2$fit - 2*pred2$se, lwd="dashed") 95 | ``` 96 | 97 | * Fit a natural spline. Hint: *ns()* 98 | ```{r} 99 | fit = lm(medv ~ ns(lstat, knots=c(10,20,30)), data=Boston) 100 | lstat.grid <- seq(from=1.8, to=37.9, by=0.1) 101 | pred = predict(fit, newdata=list(lstat=lstat.grid), se=TRUE) 102 | plot(Boston$lstat, Boston$medv, col="gray") 103 | lines(lstat.grid, pred$fit, lwd=2) 104 | lines(lstat.grid, pred$fit + 2*pred$se, lwd="dashed") 105 | lines(lstat.grid, pred$fit - 2*pred$se, lwd="dashed") 106 | ``` 107 | 108 | * Compare the different specifications! 109 | 110 | ## Smoothing Splines 111 | 112 | * Fit and plot a smoothing spline to the data! Hint: *smooth.spline()* 113 | ```{r} 114 | plot(lstat, medv, cex=.5, col="darkgrey") 115 | title("Smoothing Splines") 116 | fit = smooth.spline(lstat, medv, df=16) 117 | fit2 = smooth.spline(lstat, medv, cv=FALSE) # generalized CV 118 | fit2$df 119 | lines(fit, col="red", lwd=2) 120 | lines(fit2, col="blue", lwd=2) 121 | legend("topright", legend=c("16 df", "6.8 df"), col=c("red", "blue"), lty=1, lwd=2, cex=.8) 122 | ``` 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | -------------------------------------------------------------------------------- /PS2.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/PS2.pdf -------------------------------------------------------------------------------- /PS3.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Problem Set 3 -- Ridge and Lasso Regression" 3 | author: '' 4 | date: '`r format(Sys.Date())`' 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | knitr::opts_chunk$set(error = TRUE) 11 | ``` 12 | 13 | 14 | ## Simulating Data 15 | 16 | 17 | To simulate data we must draw random variables with some prespecified distribution. In *R* for every distribution usually four functions are implemented which are useful for working with distributions and differ by their prefix: *r* (random), *d* (density), *p* (probability), and *q* (quantile). The prefix is combined with a name for the distribution, e.g. *norm* for the normal distribution: *dnorm* for the density of a normal distribution, *pnorm* for the probability, *qnorm* for the quantiles, and *rnorm* to draw from a normal distribution. (Check out the help page of the functions!) 18 | 19 | Here we want to simulate a linear model of the form 20 | $$ y_i = x_i' \beta + \varepsilon_i, i = 1,\ldots, n$$ 21 | with $\beta$ a $p$-dimensional coefficient vector and $x_i$ $p$-dimensional vector of regressors. In vector notation: 22 | $$ y = X\beta + \varepsilon$$ 23 | with $y$ and $\varepsilon$ $n$-dimensional vectors and $X$ a $n \times p$-design matrix. 24 | 25 | Here the task is to simulate from this model, where we assume that the coefficient vector $\beta$ has $s$ entries equal to one and all others are zero. 26 | 27 | * Set $n=100$, $p=10$, $s=3$ 28 | * Create the coefficient vector $\beta$. Useful functions: *c()*, *rep()* 29 | * Simulate a design matrix and the error. Useful functions: *matrix()*, *rnorm* 30 | * Construct the model from above. Useful function: $\%*\%$ for matrix multiplication 31 | 32 | ```{r, eval=FALSE} 33 | set.seed(12345) 34 | n <- 100 35 | p <- 10 36 | s <- 3 37 | 38 | beta <- c(rep(1,s), rep(0, p-s)) 39 | X <- matrix(rnorm(n*p), ncol=p); eps <- rnorm(n) 40 | y <- X%*%beta + eps 41 | 42 | ``` 43 | 44 | ## Ridge Regression I 45 | 46 | * Estimate a ridge regression on simulated data from Exercise 1. Useful function: *glmnet* from the package *glmnet* with default $\alpha=0$. Also check out the option *lambda* in *glmnet* and the function *cv.glmnet* to perform cross-validation to determine $\lambda$. 47 | * Simulate new data from the same model and make predictions both in- and out-of-sample. Calculate the MSE for the predictions (also for the in-sample fit). Useful function: *predict* 48 | * Repeat the previous steps with different settings on $n$, $p$, and $s$. 49 | * Compare the results with ols regression! 50 | 51 | 52 | ```{r, eval=FALSE} 53 | library(glmnet) 54 | ridge1 <- glmnet(X, y, alpha=0) 55 | plot(ridge1) 56 | # lambda.grid <- seq(2,0, by=0.05) 57 | # ridge2 <- glmnet(X, y, lambda=lambda.grid, alpha=1) 58 | cv.out <- cv.glmnet(X, y, alpha=0) 59 | plot(cv.out) 60 | bestlam <- cv.out$lambda.min 61 | ridge.pred <- predict(ridge1, s= bestlam, newx=X) 62 | MSE.ridge.ins <- mean((y-ridge.pred)^2) 63 | # out of sample (size n/4) 64 | Xnew <- matrix(rnorm(n/4*p), ncol=p); epsnew <- rnorm(n/4) 65 | ynew <- Xnew%*%beta + epsnew 66 | ridge.pred.out <- predict(ridge1, s= bestlam, newx=Xnew) 67 | MSE.ridge.out <- mean((ynew-ridge.pred.out)^2) 68 | 69 | ## now for ols 70 | 71 | df <- data.frame(y=y,X=X) 72 | ols1 <- lm(y~X, data=df) 73 | MSE.ols.ins <- mean((y-predict(ols1))^2) 74 | yhatnew <- cbind(1,Xnew)%*%coef(ols1) 75 | MSE.ols.out <- mean((ynew - yhatnew)^2) 76 | 77 | 78 | # compare coefficients 79 | Coefs <- cbind(coef(ols1), as.vector(predict(ridge1, s=bestlam, newx=X, type="coefficients"))) 80 | colnames(Coefs) <- c("OLS", "Ridge") 81 | head(Coefs) 82 | ``` 83 | 84 | 85 | ## Lasso Estimation I 86 | 87 | * Redo the calculations from above but with Lasso with varying $n$, $p$, and $s$. Hint: Set option $\alpha$ in *glmnet* to $1$ 88 | * Compare the results! (In particular compare a "sparse" with a "dense" setting) 89 | * The package *hdm* contains the function *rlasso* which determines the penalization parameter by some theoretical grounded method. Look up the function in the man pages and / or vignette and analyze now the data set using this function. Compare the results. 90 | 91 | ```{r, eval=FALSE} 92 | lasso1 <- glmnet(X, y, alpha=1) 93 | plot(lasso1) 94 | # lambda.grid <- seq(2,0, by=0.05) 95 | # ridge2 <- glmnet(X, y, lambda=lambda.grid, alpha=1) 96 | cv.lasso.out <- cv.glmnet(X, y, alpha=1) 97 | plot(cv.lasso.out) 98 | bestlam <- cv.lasso.out$lambda.min 99 | lasso.pred <- predict(lasso1, s= bestlam, newx=X) 100 | MSE.lasso.ins <- mean((y-lasso.pred)^2) 101 | lasso.pred.out <- predict(lasso1, s= bestlam, newx=Xnew) 102 | MSE.lasso.out <- mean((ynew-lasso.pred.out)^2) 103 | ``` 104 | 105 | 106 | ```{r, eval=FALSE} 107 | library(hdm) 108 | lasso2 <- rlasso(y~X) 109 | lasso2.pred.ins <- predict(lasso2) 110 | MSE.lasso2.ins <- mean((y-lasso2.pred.ins)^2) 111 | lasso2.pred.out <- predict(lasso2, newdata=Xnew) 112 | MSE.lasso2.out <- mean((ynew-lasso2.pred.out)^2) 113 | ``` 114 | 115 | 116 | ## Bonus Excercise 117 | 118 | * Write a loop around the programs above so that you can perform a small simulation study! 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | -------------------------------------------------------------------------------- /PS3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/PS3.pdf -------------------------------------------------------------------------------- /PS4.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Problem Set 4 -- Regression Trees and Random Forests" 3 | author: '' 4 | date: '`r format(Sys.Date())`' 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | knitr::opts_chunk$set(error = TRUE) 11 | ``` 12 | 13 | 14 | ## Classification trees 15 | 16 | * Load the data set *Carseats* from the package *ISLR* and also load the package *tree* which will be used for the exercise. 17 | * Construct a binary variable *High*, which takes on a value of Yes if the Sales variable exceeds 8, and 18 | takes on a value of No otherwise. Hint: *ifelse()* 19 | * Fit a classification tree to predict *High* using all other variables in the data set. Hint: *tree()* 20 | * Plot the fitted tree. 21 | * Determine the "best" tree by pruning / CV and prune to this tree on some training set. Hint: *cv.tree* 22 | * Determine the quality of the predictions on the testing set. Hint: *predict()* 23 | 24 | ```{r, include=FALSE} 25 | library(tree) 26 | library(ISLR) 27 | attach(Carseats) 28 | High= ifelse(Sales <=8 ," No"," Yes ") 29 | Carseats = data.frame(Carseats,High) 30 | tree.carseats = tree(High ~ .-Sales , Carseats) 31 | summary(tree.carseats) 32 | plot(tree.carseats) 33 | text(tree.carseats, pretty =0) 34 | set.seed(2) 35 | train = sample(1:nrow(Carseats), 200) 36 | Carseats.test= Carseats[-train,] 37 | High.test=High[-train] 38 | tree.carseats = tree(High~ .-Sales ,Carseats ,subset = train) 39 | tree.pred= predict (tree.carseats ,Carseats.test, type ="class") 40 | table(tree.pred, High.test) 41 | ## 42 | set.seed(3) 43 | cv.carseats =cv.tree(tree.carseats, FUN =prune.misclass) 44 | names(cv.carseats ) 45 | cv.carseats 46 | par(mfrow =c(1 ,2) ) 47 | plot(cv.carseats$size,cv.carseats$dev ,type ="b") 48 | plot(cv.carseats$k,cv.carseats$dev ,type ="b") 49 | ###################### 50 | prune.carseats =prune.misclass(tree.carseats ,best =9) 51 | plot(prune.carseats) 52 | text(prune.carseats, pretty =0) 53 | ###################### 54 | tree.pred= predict (prune.carseats ,Carseats.test ,type ="class") 55 | table(tree.pred, High.test ) 56 | ###################### 57 | prune.carseats=prune.misclass(tree.carseats ,best =15) 58 | plot(prune.carseats) 59 | text(prune.carseats, pretty =0) 60 | tree.pred= predict (prune.carseats, Carseats.test ,type ="class") 61 | table(tree.pred, High.test) 62 | ``` 63 | 64 | ## Regression Trees 65 | * Fit a regression tree to the Boston data we had before in the class. The depedent variable is *medv*. Hint: package *MASS* 66 | * Do this on a training set (50% of the sample) and then evaluate the predictions on the testing set. 67 | * Plot the tree and interpret the results! 68 | 69 | ```{r, include=FALSE} 70 | library(MASS) 71 | set.seed(1) 72 | train = sample(1:nrow(Boston), nrow(Boston)/2) 73 | tree.boston =tree(medv~.,Boston, subset =train ) 74 | summary(tree.boston) 75 | plot(tree.boston) 76 | text(tree.boston, pretty =0) 77 | cv.boston =cv.tree(tree.boston) 78 | plot(cv.boston$size,cv.boston$dev, type="b") 79 | prune.boston = prune.tree(tree.boston ,best =5) 80 | plot(prune.boston) 81 | text(prune.boston, pretty =0) 82 | yhat= predict(tree.boston, newdata = Boston[-train ,]) 83 | boston.test = Boston[-train,"medv"] 84 | plot(yhat ,boston.test) 85 | abline(0 ,1) 86 | mean((yhat - boston.test)^2) 87 | ``` 88 | 89 | ## Bagging and Random Forests 90 | 91 | * Repeat the excerise from above (fitting on training set + prediction on testing set) with bagging. Hint: *randomForest* from the package with the same name. 92 | * Finally, fit a random forest on the training data and compare the model with all previous models. 93 | 94 | ```{r, include=FALSE} 95 | library(randomForest) 96 | set.seed(1) 97 | bag.boston = randomForest(medv~.,data=Boston, subset =train, mtry =13, importance = TRUE) 98 | yhat.bag = predict (bag.boston, newdata =Boston[-train ,]) 99 | plot(yhat.bag , boston.test) 100 | abline(0 ,1) 101 | mean(( yhat.bag - boston.test)^2) 102 | bag.boston = randomForest(medv~., data=Boston, subset =train , 103 | mtry =13, ntree =25) 104 | yhat.bag = predict (bag.boston , newdata =Boston[-train ,]) 105 | mean((yhat.bag - boston.test)^2) 106 | ``` 107 | 108 | 109 | ```{r, include=FALSE} 110 | set.seed(1) 111 | rf.boston = randomForest(medv~.,data =Boston , subset =train , 112 | mtry =6, importance = TRUE) 113 | yhat.rf = predict(rf.boston, newdata = Boston[- train ,]) 114 | mean((yhat.rf - boston.test)^2) 115 | importance(rf.boston) 116 | varImpPlot(rf.boston) 117 | ``` 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | -------------------------------------------------------------------------------- /PS4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/PS4.pdf -------------------------------------------------------------------------------- /PS5.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Problem Set 5 -- Boosting" 3 | author: '' 4 | date: '`r format(Sys.Date())`' 5 | output: pdf_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | knitr::opts_chunk$set(error = TRUE) 11 | ``` 12 | 13 | 14 | ## Boosting 15 | 16 | * Analyse the Boston data set using boosting using trees. The dependent variable is again *medv*. Split the data to training and testing set. Use the testing set to analyze the prediction quality of your model. 17 | * Hint: function *gbm()* from the package *gbm* or *mboost* from the package *mboost* 18 | * Estimate again a tree-based model with boosting, but now with shrinkage parameter $\nu=0.3$ 19 | * Experiment with the options and report the best tree-based boosting model. 20 | 21 | 22 | 23 | ```{r, include=TRUE} 24 | library(MASS) 25 | library(gbm) 26 | library(mboost) 27 | set.seed(12345) 28 | train = sample(1:nrow(Boston), floor(nrow(Boston)/2)) 29 | boost.boston = gbm(medv ~., data=Boston[train,], distribution="gaussian", n.trees=5000, interaction.depth=4) 30 | par(mfrow=c(1,2)) 31 | plot(boost.boston, i="rm") 32 | plot(boost.boston, i="lstat") 33 | yhat.boost=predict(boost.boston, newdata=Boston[-train,], n.trees=5000) 34 | boston.test=Boston[-train, "medv"] 35 | mean((yhat.boost - boston.test)^2) 36 | plot(yhat.boost, boston.test) 37 | abline(0,1) 38 | boost.boston = gbm(medv ~., data=Boston[train,], distribution="gaussian", n.trees=5000, interaction.depth=4, shrinkage=0.2) 39 | yhat.boost=predict(boost.boston, newdata=Boston[-train,], n.trees=5000) 40 | mean((yhat.boost - boston.test)^2) 41 | 42 | ``` 43 | 44 | -------------------------------------------------------------------------------- /PS5.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/PS5.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Machine-Learning-in-Econometrics 2 | Collection of lecture notes and excercises for a course "Machine Learning in Econometrics" 3 | -------------------------------------------------------------------------------- /benchmark.R: -------------------------------------------------------------------------------- 1 | library(microbenchmark) 2 | library(glmnet) 3 | library(hdm) 4 | 5 | set.seed(12345) 6 | n <- 1000 7 | p <- 100 8 | s <- 10 9 | 10 | beta <- c(rep(1,s), rep(0, p-s)) 11 | X <- matrix(rnorm(n*p), ncol=p); eps <- rnorm(n) 12 | y <- X%*%beta + eps 13 | 14 | MB <- microbenchmark(glmnet(X,y, alpha=0), rlasso.fit(X,y, post=FALSE), 15 | rlasso.fit(X,y, post=TRUE), rlasso(y~X, post=TRUE), rlasso(y~X, post=FALSE), cv.glmnet(X,y)) 16 | 17 | MB 18 | str(MB) 19 | -------------------------------------------------------------------------------- /bibliography.bib: -------------------------------------------------------------------------------- 1 | % This file was created with JabRef 2.9.2. 2 | % Encoding: Cp1252 3 | 4 | @BOOK{ESL, 5 | title = {The Elements of Statistical Learning}, 6 | publisher = {Springer New York Inc.}, 7 | year = {2001}, 8 | author = {Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome}, 9 | series = {Springer Series in Statistics}, 10 | address = {New York, NY, USA}, 11 | keywords = {ml statistics} 12 | } 13 | 14 | @BOOK{ISL, 15 | title = {An Introduction to Statistical Learning: With Applications in R}, 16 | publisher = {Springer Publishing Company, Incorporated}, 17 | year = {2014}, 18 | author = {James, Gareth and Witten, Daniela and Hastie, Trevor and Tibshirani, 19 | Robert}, 20 | isbn = {1461471370, 9781461471370} 21 | } 22 | 23 | @BOOK{UML, 24 | title = {Understanding Machine Learning: From Theory to Algorithms}, 25 | publisher = {Cambridge University Press}, 26 | year = {2014}, 27 | author = {Shalev-Shwartz, Shai and Ben-David, Shai}, 28 | address = {New York, NY, USA}, 29 | isbn = {1107057132, 9781107057135}, 30 | url = {http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/understanding-machine-learning-theory-algorithms.pdf} 31 | } 32 | 33 | @MANUAL{hdm, 34 | title = {{hdm}: High-Dimensional Metrics (Vignette)}, 35 | author = {Victor Chernozhukov, Chris Hansen, Martin Spindler}, 36 | year = {2016}, 37 | owner = {Martin}, 38 | timestamp = {2016.02.16} 39 | } 40 | 41 | -------------------------------------------------------------------------------- /syllabus.md: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Machine Learning in Econometrics" 3 | author: '' 4 | date: "February 2016" 5 | output: pdf_document 6 | bibliography: bibliography.bib 7 | --- 8 | 9 | The goal of this course is to give an introduction to Machine Learning in Econometrics. In the first part, methods from Machine Learning are presented. In the second part, applications of those methods in Econometrics are discussed. Moreover, the statistical software package R is used for illustration of the methods. 10 | 11 | # Lectures 12 | 1. [Introduction](Lecture_1.html) 13 | + Definitions 14 | + Basic Concepts 15 | + Challenges in High-Dimensions 16 | 17 | Part I. How to make predictions? 18 | 19 | 1. [Linear Regression and Extensions](Lecture_2.html) 20 | + [Recap: Linear Regression](Lecture_2.html) 21 | + [Regression Splines](Lecture_3.html) 22 | + [Smoothing Splines](Lecture_3.html) 23 | 2. [Ridge Regression](Lecture_4.html) 24 | 3. [Lasso Regression](Lecture_4.html) 25 | + [Basic Principle](Lecture_4.html) 26 | + [Some Results](Lecture_5.html) 27 | + [Extensions](Lecture_5.html) 28 | 4. Of Trees and Forests 29 | + [Regression Trees](Lecture_6.html) 30 | + [Bagging](Lecture_6.html) 31 | + [Random Forests](Lecture_6.html) 32 | 5. [Neural Nets / Deep Learning](Lecture_7.html) 33 | 6. [Boosting](Lecture_8.html) 34 | + Basic Idea 35 | + $L_2$Boosting for Regression 36 | 7. [Support Vector Machines](Lecture_9.html) 37 | 38 | 8. [Model Selection: How to choose between different models?]((Lecture_10.html)) 39 | 40 | Part II. Estimation and Inference of Structural Parameters and Treatment Effects 41 | 42 | 1. Partialling-out 43 | 2. Inference on Selected Target Variables in Regressions in High-Dimensions 44 | 3. IV Estimation in High-Dimensions 45 | 4. The Orthogonality Principle 46 | 47 | 48 | 49 | 50 | 51 | # Problem Sets 52 | 53 | * [PS1](PS1.pdf) Introduction to R /RStudio and some important R Packages 54 | * [Ps2](PS2.pdf) Linear Regression and Extensions 55 | * [Ps3](PS3.pdf) Ridge and Lasso Regression 56 | 57 | # Literature 58 | 59 | --- 60 | nocite: | 61 | @UML, @ISL, @ESL, @hdm 62 | ... 63 | 64 | 65 | -------------------------------------------------------------------------------- /syllabus.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/MartinSpindler/Machine-Learning-in-Econometrics/2108b4e5c378b23e21a98ad976446b2a2603b512/syllabus.pdf --------------------------------------------------------------------------------